I tested your fp16 kernel on my A30 device and it took longer time than the baseline. Can you help me to figure out why? #3

yp19961009 · 2023-11-20T09:49:30Z

My environment
cuda:12.2
torch:2.1

fp16 kernel
~/FastGEMV main !4 > ./gemv -s 16384 -x 512 -y 2 -i 10000
size=16384, iter=10000
block_dim (512, 2)
grid_dim (1, 8192)
num_per_thread=32
solving...
Time taken: 8550.95 ms
checking...
checked

baseline
python baseline.py -size 16384
cost: 6031.960248947144 ms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I tested your fp16 kernel on my A30 device and it took longer time than the baseline. Can you help me to figure out why? #3

I tested your fp16 kernel on my A30 device and it took longer time than the baseline. Can you help me to figure out why? #3

yp19961009 commented Nov 20, 2023

I tested your fp16 kernel on my A30 device and it took longer time than the baseline. Can you help me to figure out why? #3

I tested your fp16 kernel on my A30 device and it took longer time than the baseline. Can you help me to figure out why? #3

Comments

yp19961009 commented Nov 20, 2023