Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the sequence length of vectors affect the calculation results of dense under bf16? #19878

Open
pass-lin opened this issue Jun 19, 2024 · 6 comments
Assignees
Labels

Comments

@pass-lin
Copy link

pass-lin commented Jun 19, 2024

import os
os.environ['KERAS_BACKEND'] = 'torch'
os.environ['OPS_KERNAL'] = '1'
import keras
keras.config.set_floatx('bfloat16')
from keras import ops
import numpy as np
initial_dim = 2048
finally_dim = 64
z = ops.convert_to_tensor(np.random.random([1,36,initial_dim]))
dense = keras.layers.Dense(finally_dim)
z1 = dense(z)
z2 = dense(z[:,:8])
print(ops.isclose(z1[:,:8],z2).all())

Example code is as above. In some cases, when the above z1 and z2 are found to not pass isclose, theoretically, and under fp32, they should be able to pass isclose in any situation. What is the problem, and how can it be solved?
This bug also be found at tf and jax backend,but not found at numpy backend
pass case:initial_dim = 2048 finally_dim =2048 ;initial_dim = 2048 finally_dim =4096 ;initial_dim = 1024 finally_dim =2048 ;
fail case: initial_dim = 2048 finally_dim =64;initial_dim = 2048 finally_dim =1024 ;initial_dim = 1024 finally_dim =2047 ;

However, similarly, we did not find a similar issue in pure torch.
import torch
import numpy as np
initial_dim = 4096
finally_dim = 32
z = torch.tensor(np.random.random([1,36,initial_dim]),dtype=torch.bfloat16)
linear = torch.nn.Linear(initial_dim,finally_dim).bfloat16()
z1 = linear(z)
z2 = linear(z[:,:8])
print(torch.isclose(z1[:,:8],z2).all())

@mehtamansi29
Copy link
Collaborator

Hi @pass-lin -

Thanks for reporting the issue. I have tested the code snippet and doesn't reproduces the reported behaviour in keras 3.3.3 version with torch backend. Attached gist file for reference.
Could you let us know which keras version using here?

@pass-lin
Copy link
Author

pass-lin commented Jun 19, 2024

Hi @pass-lin -

Thanks for reporting the issue. I have tested the code snippet and doesn't reproduces the reported behaviour in keras 3.3.3 version with torch backend. Attached gist file for reference. Could you let us know which keras version using here?

my test enviroment as follow
devices:4060ti
OS:WSL ubuntu22.04
keras:3.3.3
torch: 2.2.2+cu121
jax:0.4.23+cuda12
tensorflow-cpu 1.15

this bug also be found at windows 10+torch2.2.1+keras3.3.3
but not found at other devices(pure linux+A800+torch 2.2.0(cu118)+keras 3.3.3,when backend is torch
when env is (pure linux+A800+jax0.4.28+cuda12.cudnn89 or 0.4.23+cuda11.cudnn86+keras 3.3.3,this bug also exist
And this bug can not found at CPU and V100

@mehtamansi29 mehtamansi29 added the keras-team-review-pending Pending review by a Keras team member. label Jun 21, 2024
@szxysdt
Copy link

szxysdt commented Jun 27, 2024

I reproduced this bug in this environment:

& pip list
Package           Version
----------------- ------------
absl-py           2.1.0       
filelock          3.13.1      
fsspec            2024.2.0    
h5py              3.11.0      
intel-openmp      2021.4.0    
Jinja2            3.1.3       
keras             3.4.1       
markdown-it-py    3.0.0       
MarkupSafe        2.1.5       
mdurl             0.1.2       
mkl               2021.4.0    
ml-dtypes         0.4.0       
mpmath            1.3.0       
namex             0.0.8       
networkx          3.3
numpy             1.26.4      
optree            0.11.0      
packaging         24.1        
pillow            10.3.0      
pip               24.1.1      
Pygments          2.18.0      
rich              13.7.1      
setuptools        58.1.0      
sympy             1.12.1      
tbb               2021.13.0   
torch             2.3.1+cu121 
torchaudio        2.3.1+cu121 
torchvision       0.18.1+cu121
typing_extensions 4.12.2      

cuda device: 3060-12G
platform: Windows10

code:

import os
os.environ['KERAS_BACKEND'] = 'torch'
os.environ['OPS_KERNAL'] = '1'
import keras
keras.config.set_floatx('bfloat16')
from keras import ops
import numpy as np
initial_dim = 2048
finally_dim = 64
z = ops.convert_to_tensor(np.random.random([1,36,initial_dim]))
dense = keras.layers.Dense(finally_dim)
z1 = dense(z)
z2 = dense(z[:,:8])
print(ops.isclose(z1[:,:8],z2).all())

output:

tensor(False, device='cuda:0')

@123mbcz123
Copy link

123mbcz123 commented Jun 27, 2024

I cannot reproduce this error. One possible reason is that the graphics card is a 2080Ti, and the tensorcores of the 2080Ti do not support bfloat16 calculations, so bfloat16 is handled by CUDA.

platform: windows 10
graphic card: 2080Ti 11G
pytorch version: 2.3.1+cu121
keras version: 3.4.1

output:
tensor(True, device='cuda:0')

@SamanehSaadat
Copy link
Member

@pass-lin Could you provide a colab that reproduces the issue?

@SamanehSaadat SamanehSaadat removed the keras-team-review-pending Pending review by a Keras team member. label Jun 28, 2024
@SamanehSaadat SamanehSaadat self-assigned this Jun 28, 2024
@pass-lin
Copy link
Author

@pass-lin Could you provide a colab that reproduces the issue?

I don't think I can provide you with a Windows environment or one with an RTX 30 or 40 series on Colab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants