Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QB4W AVX GEMM Kernels #6621

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

QB4W AVX GEMM Kernels #6621

wants to merge 7 commits into from

Conversation

mcr229
Copy link
Contributor

@mcr229 mcr229 commented Jun 25, 2024

This pull request adds blockwise 4-bit (qb4w) GEMM microkernels targeting x86 AVX instruction family.

Note: Since AVX1 Ukernels share the same meta kernels as SSE2/4.1 kernels, this PR sits ontop of the SSE Kernels. For comparison of benchmarks to the SSE2/4.1 Kernels, see the table in this PR #6576

Tests and Benchmarks were run on Icelake Xeon Processor. block_size equal to KC are functionally equivalent to QC4W, so QC4W provides a reasonable performance comparison.

AVX Benchmarks

AVERAGE of OPs
M 128
N 16 128 4096 11008 32000
Tile Size Kernel BL Load Type K 1024 1024 1024 4096 4096
1x4c8 qd8_f32_qb4w 32 avx_ld128 17.2997 17.266 17.0955 17.2444 15.369
avx_ld64 17.0759 17.0915 16.4827 16.8174 15.4215
256 avx_ld128 20.1059 20.2838 20.3294 20.1497 19.2685
avx_ld64 19.3468 19.2601 18.999 19.2969 18.8185
qd8_f32_qc4w N/A avx_ld128 20.5992 20.5697 20.5364 20.6587 19.1282
avx_ld64 19.5944 19.7817 19.7276 19.8363 18.6789
2x4c8 qd8_f32_qb4w 32 avx_ld128 20.9994 21.1004 21.03 21.0452 20.2548
avx_ld64 21.2391 21.4121 21.234 20.8914 20.6457
256 avx_ld128 26.1548 25.5381 26.2522 26.2795 26.106
avx_ld64 25.7053 26.1241 25.8427 25.8229 25.3338
qd8_f32_qc4w N/A avx_ld128 26.9955 27.2483 27.3093 27.2677 26.6632
avx_ld64 26.4074 26.7154 26.7614 26.7571 26.2805
3x4c8 qd8_f32_qb4w 32 avx_ld128 22.6648 22.212 22.5704 22.8571 22.6488
avx_ld64 22.4549 22.5161 22.5755 22.5949 22.1056
256 avx_ld128 28.8683 28.3443 28.7889 28.6925 28.6756
avx_ld64 28.1638 28.5717 28.7398 28.1747 27.8443
qd8_f32_qc4w N/A avx_ld128 29.7597 29.642 29.9686 29.8611 30.0058
avx_ld64 29.8879 30.0863 29.7302 29.8452 29.7871
4x4c8 qd8_f32_qb4w 32 avx_ld128 22.7898 23.3741 23.2927 23.358 23.3069
avx_ld64 23.5985 23.7258 23.843 23.976 23.1728
256 avx_ld128 30.1112 30.318 30.218 30.4057 30.0172
avx_ld64 29.8162 29.6653 30.3024 29.8823 30.4935
qd8_f32_qc4w N/A avx_ld128 30.6894 31.185 31.3612 31.7971 31.4948
avx_ld64 31.4501 31.3585 30.564 31.6051 31.8028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants