QB4W AVX2 GEMM Kernels #6618

GregoryComer · 2024-06-24T21:55:48Z

This pull request adds blockwise 4-bit (qb4w) GEMM microkernels targeting x86 via the AVX2 instruction family.

Note: This PR includes one commit from #6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.

Tests and benchmarks were run on Intel Ice Lake. I also did some informal benchmarking on Zen 3, which I can include, if desired. Benchmark data includes qc4w benchmarks for comparison. Note that blockwise kernels with block_size equal to kc are functionally equivalent to qc4w, thus qc4w provides a reasonable performance comparison. I expect qb4w with bl=256 to be slightly less performant than qc4w due to the slight increase in memory (~4.125 bits/weight vs ~4 bits per weight for qc4w), as well as due to the slight overhead of the block loop.

AVERAGE of OPS				tile_size
n	k	bl	datatype	1x8c8	2x8c8	3x8c8	4x8c8
16	1024	32	qd8_f32_qb4w	30.81	33.59	35.82	25.56
		256	qd8_f32_qb4w	37.39	44.77	46.17	31.61
		NaN	qd8_f32_qc4w	38.99	45.29	48.60	45.71
128	1024	32	qd8_f32_qb4w	30.29	33.55	34.90	25.77
		256	qd8_f32_qb4w	37.74	44.63	46.17	32.17
		NaN	qd8_f32_qc4w	38.91	44.62	49.21	44.86
4096	1024	32	qd8_f32_qb4w	29.71	32.69	34.89	25.58
		256	qd8_f32_qb4w	37.24	45.28	45.46	32.03
		NaN	qd8_f32_qc4w	38.83	46.76	47.19	44.68
11008	4096	32	qd8_f32_qb4w	27.13	32.23	33.80	25.78
		256	qd8_f32_qb4w	37.29	44.41	46.50	32.43
		NaN	qd8_f32_qc4w	36.02	45.84	48.99	44.29
32000	4096	32	qd8_f32_qb4w	19.77	26.41	30.11	25.09
		256	qd8_f32_qb4w	30.58	41.53	44.61	32.19
		NaN	qd8_f32_qc4w	29.53	40.96	45.74	44.25

GregoryComer · 2024-06-24T23:31:05Z

As a general note, I did test moving the hadds such that we can use a single YMM register for the outer accumulator. This was marginally slower (~2%) than the current approach for bl=32 and about the same (between ~0.5% slower to ~0.5% faster) on bl=256 on the LLM benchmarks. Tested on MR=1 and MR=3. It might be worth exploring this a little bit more (does it reduced spillage on MR=4 enough to make it faster than MR=3)? I'll look into this more, but I'd hope to take it as a follow-up if we make changes.

fbarchard · 2024-06-25T20:51:12Z

Re YMM's - yes I tested that too, and infact the old code for qs8 8 bit output on avx and avx512 used to combine all the bytes and do a single vmin, which is slower.
The old code uses to used pack instructions and a vpshufb, for qu8 support, but for qs8 we can use cvt directly and avoid the shuffle. A problem with cvt on x86 is you can't do the upper half without an extract, so its better to use multiple registers.

GregoryComer marked this pull request as ready for review June 24, 2024 22:01

GregoryComer added 5 commits June 24, 2024 15:58

Update test gen and benchmark for blockwise kr>2

d974e3c

Functional qd8-f32-qb4w avx2 1x8c8 kernel

8d02b3f

Add metakernel and tests for qd8-f16-qb4w avx2

623a655

Benchmark support for qb4w avx2

7872d27

Fix qb4w avx2 microkernel indentation

a4164cb

GregoryComer force-pushed the qb4w-avx2 branch from fd8c1bc to a4164cb Compare June 24, 2024 23:14

mcr229 mentioned this pull request Jun 27, 2024

AVX512SKX QB4 Kernels [F16] #6636

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QB4W AVX2 GEMM Kernels #6618

QB4W AVX2 GEMM Kernels #6618

GregoryComer commented Jun 24, 2024 •

edited

Loading

GregoryComer commented Jun 24, 2024 •

edited

Loading

fbarchard commented Jun 25, 2024

QB4W AVX2 GEMM Kernels #6618

Are you sure you want to change the base?

QB4W AVX2 GEMM Kernels #6618

Conversation

GregoryComer commented Jun 24, 2024 • edited Loading

GregoryComer commented Jun 24, 2024 • edited Loading

fbarchard commented Jun 25, 2024

GregoryComer commented Jun 24, 2024 •

edited

Loading

GregoryComer commented Jun 24, 2024 •

edited

Loading