QB4W SSE2/SSE41 GEMM Kernels #6576

mcr229 · 2024-06-17T21:24:36Z

This pull requests adds blockwise 4-bit (qb4w) GEMM microkernels targetinsg x86 SSE2 and SSE4.1 Instruction Family.

Note: This PR includes one commit from #6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.

Tests and Benchmarks were run on Icelake Xeon Processor. block_size equal to KC are functionally equivalent to QC4W, so QC4W provides a reasonable performance comparison.

SSE2 Benchmarks

AVERAGE of OPS
				M	128
				N	16	128	4096	11008	32000
Tile Size	Kernel	BL	Kernel Type	K	1024	1024	1024	4096	4096

1x4c8	qd8_f32_qb4w	32	sse2_ld128		16.3626	16.5004	16.4693	16.2591	15.1152
			sse2_ld64		13.8485	13.8654	13.7583	13.7609	12.9008
		256	sse2_ld128		19.1497	19.1257	19.2257	19.0563	17.9628
			sse2_ld64		15.5274	15.5863	15.541	15.4571	14.8878
	qd8_f32_qc4w		sse2_ld128		19.0017	19.3572	19.4491	19.1485	18.5355
			sse2_ld64		15.364	15.311	15.4119	15.3349	15.4244
2x4c8	qd8_f32_qb4w	32	sse2_ld128		19.7986	19.7132	19.3471	19.874	19.457
			sse2_ld64		15.0903	16.8612	13.1044	16.8245	17.2448
		256	sse2_ld128		24.3207	24.276	24.263	24.386	23.4275
			sse2_ld64		16.5991	19.1659	17.5122	20.0571	21.1375
	qd8_f32_qc4w		sse2_ld128		24.7754	24.3957	24.3669	25.0426	24.9166
			sse2_ld64		21.6573	21.4722	21.3531	21.6695	21.7461
3x4c8	qd8_f32_qb4w	32	sse2_ld128		21.0667	20.9242	20.4278	21.1425	20.9614
			sse2_ld64		19.7949	19.862	19.8228	19.78	19.6134
		256	sse2_ld128		26.2247	25.7603	26.5098	26.3606	26.2232
			sse2_ld64		24.0754	24.1035	24.1776	24.1109	23.6717
	qd8_f32_qc4w		sse2_ld128		26.8972	27.2868	27.1239	27.4936	27.3707
			sse2_ld64		24.008	23.7783	24.4216	24.9255	24.7481
4x4c8	qd8_f32_qb4w	32	sse2_ld128		21.546	21.2863	21.9667	22.0366	21.8368
			sse2_ld64		20.2396	20.4829	20.3171	20.1945	20.327
		256	sse2_ld128		27.4331	27.4372	27.6257	27.7552	27.7811
			sse2_ld64		25.1659	25.7404	25.4879	25.5119	25.4176
	qd8_f32_qc4w		sse2_ld128		28.0353	28.4844	28.1422	28.431	28.4336
			sse2_ld64		26.1268	26.5393	26.4284	26.8376	26.746

SSE4.1 Benchmarks

AVERAGE of OPS
				M	128
				N	16	128	4096	11008	32000
Tile Size	Kernel	BL	Kernel Type	K	1024	1024	1024	4096	4096

1x4c8	qd8_f32_qb4w	32	sse41_ld128		17.3014	17.3535	17.2064	17.1379	16.0412
			sse41_ld64		16.9989	16.8812	16.9124	16.8923	16.2466
		256	sse41_ld128		20.1779	20.2345	20.2459	20.2033	19.4152
			sse41_ld64		19.4751	19.1835	19.4524	19.2826	18.5472
	qd8_f32_qc4w	N/A	sse41_ld128		20.4684	20.5874	20.1799	20.5715	18.9625
			sse41_ld64		19.6819	19.3431	19.7433	19.5436	19.2356
2x4c8	qd8_f32_qb4w	32	sse41_ld128		21.2354	21.1893	21.1614	20.9931	20.8421
			sse41_ld64		21.3569	21.4488	21.235	20.9837	21.0178
		256	sse41_ld128		26.0017	25.9916	25.4473	25.5135	25.57
			sse41_ld64		25.2493	25.7025	26.085	25.9694	25.2185
	qd8_f32_qc4w	N/A	sse41_ld128		26.8566	27.1628	27.0692	27.3674	26.9184
			sse41_ld64		26.183	26.2211	26.139	26.5423	26.6486
3x4c8	qd8_f32_qb4w	32	sse41_ld128		22.6084	22.5638	22.2168	22.7137	22.4851
			sse41_ld64		22.2074	22.2576	22.1917	22.2111	21.9517
		256	sse41_ld128		28.6587	28.6288	29.0569	28.9576	28.5065
			sse41_ld64		28.233	28.3912	28.5254	28.3961	28.1014
	qd8_f32_qc4w	N/A	sse41_ld128		29.2685	29.6141	29.612	29.7231	29.4312
			sse41_ld64		29.0769	29.047	29.7566	30.0988	29.7775
4x4c8	qd8_f32_qb4w	32	sse41_ld128		23.4833	22.5307	23.0766	23.3969	23.209
			sse41_ld64		23.0014	23.5207	23.657	23.5848	23.4027
		256	sse41_ld128		30.0163	29.9047	30.2299	30.367	29.8178
			sse41_ld64		30.0503	29.1273	29.8003	30.3124	30.0438
	qd8_f32_qc4w	N/A	sse41_ld128		30.7884	31.3583	31.1508	31.4678	30.5934
			sse41_ld64		31.2101	30.8091	31.3049	31.2445	31.8041

fbarchard · 2024-06-17T23:12:38Z

src/qs8-gemm/MRx4c8-sse.c.in

@@ -31,17 +31,35 @@ $if DATATYPE != "QD8":
  #include <xnnpack/unaligned.h>


-$DATATYPE_SPEC = {"QC8": "qs8_qc8w", "QD8": "qd8_f32_qc8w", "QC4": "qd8_f32_qc4w", "QS8": "qs8", "QU8": "qu8"}[DATATYPE]
+$#


Suggest moving template into src/qd8-f32-qb4w-gemm/ folder

Do you mean also suggest that we create a different template for QB4W?

Not sure which way it will fall from maintenance point of view TBH, i.e. overloaded template vs. templates with similar logic.

wanted to follow up on if this meant moving the SET_INDENT function to some common tools file, or if all the qb4w gemm metakernels should live in their own folder seperate from qs8-gemm

digantdesai · 2024-06-18T03:10:27Z

src/qd8-f32-qb4w-gemm/gen/qd8-f32-qb4w-gemm-1x4c8-minmax-sse2-ld128.c

+    __m128 one_sixteenth = _mm_set_ps1(1.0f/16);
+    vout0x0123 = _mm_mul_ps(vout0x0123, one_sixteenth);
+
+    const __m128 vinput_scale0 = _mm_load1_ps(&quantization_params[0].inv_scale);


Possible to trade some mul_ps with a scalar one?

Suggested change

__m128 one_sixteenth = _mm_set_ps1(1.0f/16);

vout0x0123 = _mm_mul_ps(vout0x0123, one_sixteenth);

const __m128 vinput_scale0 = _mm_load1_ps(&quantization_params[0].inv_scale);

const __m128 vinput_scale0 = _mm_set1_ps(quantization_params[0].inv_scale * (1/16.0));

there is some special logic loading the vinput_scale from the struct, when MR > 1. Specifically we can get two inv_scales per load. I guess specifically for MR=1 we can do this, but not sure for MR > 1 this would give us any better perf. Any thoughts on that?

So two things,

scalar pipes should be relatively open at this point, and we don't have any dependency, so this should be better than competing for the vector pipes for the vector multiplication. But if we are avoiding cross pipe transfers for MR>1 then it might be better.

somewhat orthogonal to this, I feel if we move the 1/16.0f to weight (block) scales, then it might be better for both performance as well as accuracy given we don't have to do any multiplication in the kernel and the fp32 accumulation might be better with smaller number. To be clear, when convert larger int32 accumulator into fp32 in the first step we may drop some precision if the int32 is too large, however then multiplying with a smaller scale and accumulating in fp32 might still help). That said, I don't expect any significant changes in either numerics or perf TBH.

WDYT?

Here you go #6596

sure i'm ok with doing this 1/16 stuff as a follow up after this

alankelly · 2024-06-26T21:49:20Z

Please resolve conflicts and I will import

digantdesai mentioned this pull request Jun 17, 2024

QB4W Development #6502

Open

11 tasks

fbarchard reviewed Jun 17, 2024

View reviewed changes

digantdesai reviewed Jun 18, 2024

View reviewed changes

digantdesai mentioned this pull request Jun 19, 2024

QB4W MLAL GEMM Kernels #6574

Open

mcr229 marked this pull request as ready for review June 24, 2024 20:28

Update test gen and benchmark for blockwise kr>2

d974e3c

mcr229 mentioned this pull request Jun 25, 2024

QB4W AVX GEMM Kernels #6621

Open

mcr229 added 5 commits June 25, 2024 16:51

[QB4W][SSE2] Add Kernels, Tests, Benchmarks

d9b987e

sse4.1 ukernels/tests/benchmarks

6ca34d0

remove one-sixteenth load hack

5d0b2b7

Fix Indentation

8231e26

remove lingering merge conflict

a11d997

mcr229 force-pushed the master_qb4_sse branch from e49a5b2 to a11d997 Compare June 25, 2024 23:54

alankelly approved these changes Jun 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QB4W SSE2/SSE41 GEMM Kernels #6576

QB4W SSE2/SSE41 GEMM Kernels #6576

mcr229 commented Jun 17, 2024

fbarchard Jun 17, 2024

digantdesai Jun 17, 2024

mcr229 Jun 25, 2024

digantdesai Jun 18, 2024

mcr229 Jun 18, 2024

digantdesai Jun 19, 2024 •

edited

Loading

digantdesai Jun 21, 2024

mcr229 Jun 24, 2024

alankelly commented Jun 26, 2024

QB4W SSE2/SSE41 GEMM Kernels #6576

Are you sure you want to change the base?

QB4W SSE2/SSE41 GEMM Kernels #6576

Conversation

mcr229 commented Jun 17, 2024

SSE2 Benchmarks

SSE4.1 Benchmarks

fbarchard Jun 17, 2024

Choose a reason for hiding this comment

digantdesai Jun 17, 2024

Choose a reason for hiding this comment

mcr229 Jun 25, 2024

Choose a reason for hiding this comment

digantdesai Jun 18, 2024

Choose a reason for hiding this comment

mcr229 Jun 18, 2024

Choose a reason for hiding this comment

digantdesai Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

digantdesai Jun 21, 2024

Choose a reason for hiding this comment

mcr229 Jun 24, 2024

Choose a reason for hiding this comment

alankelly commented Jun 26, 2024

digantdesai Jun 19, 2024 •

edited

Loading