Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QB4W AVX2 GEMM Kernels #6618

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

GregoryComer
Copy link
Contributor

@GregoryComer GregoryComer commented Jun 24, 2024

This pull request adds blockwise 4-bit (qb4w) GEMM microkernels targeting x86 via the AVX2 instruction family.

Note: This PR includes one commit from #6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.

Tests and benchmarks were run on Intel Ice Lake. I also did some informal benchmarking on Zen 3, which I can include, if desired. Benchmark data includes qc4w benchmarks for comparison. Note that blockwise kernels with block_size equal to kc are functionally equivalent to qc4w, thus qc4w provides a reasonable performance comparison. I expect qb4w with bl=256 to be slightly less performant than qc4w due to the slight increase in memory (~4.125 bits/weight vs ~4 bits per weight for qc4w), as well as due to the slight overhead of the block loop.

AVERAGE of OPS tile_size
n k bl datatype 1x8c8 2x8c8 3x8c8 4x8c8
16 1024 32 qd8_f32_qb4w 30.81 33.59 35.82 25.56
256 qd8_f32_qb4w 37.39 44.77 46.17 31.61
NaN qd8_f32_qc4w 38.99 45.29 48.60 45.71
128 1024 32 qd8_f32_qb4w 30.29 33.55 34.90 25.77
256 qd8_f32_qb4w 37.74 44.63 46.17 32.17
NaN qd8_f32_qc4w 38.91 44.62 49.21 44.86
4096 1024 32 qd8_f32_qb4w 29.71 32.69 34.89 25.58
256 qd8_f32_qb4w 37.24 45.28 45.46 32.03
NaN qd8_f32_qc4w 38.83 46.76 47.19 44.68
11008 4096 32 qd8_f32_qb4w 27.13 32.23 33.80 25.78
256 qd8_f32_qb4w 37.29 44.41 46.50 32.43
NaN qd8_f32_qc4w 36.02 45.84 48.99 44.29
32000 4096 32 qd8_f32_qb4w 19.77 26.41 30.11 25.09
256 qd8_f32_qb4w 30.58 41.53 44.61 32.19
NaN qd8_f32_qc4w 29.53 40.96 45.74 44.25

@GregoryComer GregoryComer marked this pull request as ready for review June 24, 2024 22:01
@GregoryComer
Copy link
Contributor Author

GregoryComer commented Jun 24, 2024

As a general note, I did test moving the hadds such that we can use a single YMM register for the outer accumulator. This was marginally slower (~2%) than the current approach for bl=32 and about the same (between ~0.5% slower to ~0.5% faster) on bl=256 on the LLM benchmarks. Tested on MR=1 and MR=3. It might be worth exploring this a little bit more (does it reduced spillage on MR=4 enough to make it faster than MR=3)? I'll look into this more, but I'd hope to take it as a follow-up if we make changes.

@fbarchard
Copy link
Contributor

Re YMM's - yes I tested that too, and infact the old code for qs8 8 bit output on avx and avx512 used to combine all the bytes and do a single vmin, which is slower.
The old code uses to used pack instructions and a vpshufb, for qu8 support, but for qs8 we can use cvt directly and avoid the shuffle. A problem with cvt on x86 is you can't do the upper half without an extract, so its better to use multiple registers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants