Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I tested your fp16 kernel on my A30 device and it took longer time than the baseline. Can you help me to figure out why? #3

Open
yp19961009 opened this issue Nov 20, 2023 · 0 comments

Comments

@yp19961009
Copy link

My environment
cuda:12.2
torch:2.1

fp16 kernel
~/FastGEMV main !4 > ./gemv -s 16384 -x 512 -y 2 -i 10000
size=16384, iter=10000
block_dim (512, 2)
grid_dim (1, 8192)
num_per_thread=32
solving...
Time taken: 8550.95 ms
checking...
checked

baseline
python baseline.py -size 16384
cost: 6031.960248947144 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant