Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QB4W SSE2/SSE41 GEMM Kernels #6576

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

mcr229
Copy link
Contributor

@mcr229 mcr229 commented Jun 17, 2024

This pull requests adds blockwise 4-bit (qb4w) GEMM microkernels targetinsg x86 SSE2 and SSE4.1 Instruction Family.

Note: This PR includes one commit from #6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.

Tests and Benchmarks were run on Icelake Xeon Processor. block_size equal to KC are functionally equivalent to QC4W, so QC4W provides a reasonable performance comparison.

SSE2 Benchmarks

AVERAGE of OPS
M 128
N 16 128 4096 11008 32000
Tile Size Kernel BL Kernel Type K 1024 1024 1024 4096 4096
1x4c8 qd8_f32_qb4w 32 sse2_ld128 16.3626 16.5004 16.4693 16.2591 15.1152
sse2_ld64 13.8485 13.8654 13.7583 13.7609 12.9008
256 sse2_ld128 19.1497 19.1257 19.2257 19.0563 17.9628
sse2_ld64 15.5274 15.5863 15.541 15.4571 14.8878
qd8_f32_qc4w sse2_ld128 19.0017 19.3572 19.4491 19.1485 18.5355
sse2_ld64 15.364 15.311 15.4119 15.3349 15.4244
2x4c8 qd8_f32_qb4w 32 sse2_ld128 19.7986 19.7132 19.3471 19.874 19.457
sse2_ld64 15.0903 16.8612 13.1044 16.8245 17.2448
256 sse2_ld128 24.3207 24.276 24.263 24.386 23.4275
sse2_ld64 16.5991 19.1659 17.5122 20.0571 21.1375
qd8_f32_qc4w sse2_ld128 24.7754 24.3957 24.3669 25.0426 24.9166
sse2_ld64 21.6573 21.4722 21.3531 21.6695 21.7461
3x4c8 qd8_f32_qb4w 32 sse2_ld128 21.0667 20.9242 20.4278 21.1425 20.9614
sse2_ld64 19.7949 19.862 19.8228 19.78 19.6134
256 sse2_ld128 26.2247 25.7603 26.5098 26.3606 26.2232
sse2_ld64 24.0754 24.1035 24.1776 24.1109 23.6717
qd8_f32_qc4w sse2_ld128 26.8972 27.2868 27.1239 27.4936 27.3707
sse2_ld64 24.008 23.7783 24.4216 24.9255 24.7481
4x4c8 qd8_f32_qb4w 32 sse2_ld128 21.546 21.2863 21.9667 22.0366 21.8368
sse2_ld64 20.2396 20.4829 20.3171 20.1945 20.327
256 sse2_ld128 27.4331 27.4372 27.6257 27.7552 27.7811
sse2_ld64 25.1659 25.7404 25.4879 25.5119 25.4176
qd8_f32_qc4w sse2_ld128 28.0353 28.4844 28.1422 28.431 28.4336
sse2_ld64 26.1268 26.5393 26.4284 26.8376 26.746

SSE4.1 Benchmarks

AVERAGE of OPS
M 128
N 16 128 4096 11008 32000
Tile Size Kernel BL Kernel Type K 1024 1024 1024 4096 4096
1x4c8 qd8_f32_qb4w 32 sse41_ld128 17.3014 17.3535 17.2064 17.1379 16.0412
sse41_ld64 16.9989 16.8812 16.9124 16.8923 16.2466
256 sse41_ld128 20.1779 20.2345 20.2459 20.2033 19.4152
sse41_ld64 19.4751 19.1835 19.4524 19.2826 18.5472
qd8_f32_qc4w N/A sse41_ld128 20.4684 20.5874 20.1799 20.5715 18.9625
sse41_ld64 19.6819 19.3431 19.7433 19.5436 19.2356
2x4c8 qd8_f32_qb4w 32 sse41_ld128 21.2354 21.1893 21.1614 20.9931 20.8421
sse41_ld64 21.3569 21.4488 21.235 20.9837 21.0178
256 sse41_ld128 26.0017 25.9916 25.4473 25.5135 25.57
sse41_ld64 25.2493 25.7025 26.085 25.9694 25.2185
qd8_f32_qc4w N/A sse41_ld128 26.8566 27.1628 27.0692 27.3674 26.9184
sse41_ld64 26.183 26.2211 26.139 26.5423 26.6486
3x4c8 qd8_f32_qb4w 32 sse41_ld128 22.6084 22.5638 22.2168 22.7137 22.4851
sse41_ld64 22.2074 22.2576 22.1917 22.2111 21.9517
256 sse41_ld128 28.6587 28.6288 29.0569 28.9576 28.5065
sse41_ld64 28.233 28.3912 28.5254 28.3961 28.1014
qd8_f32_qc4w N/A sse41_ld128 29.2685 29.6141 29.612 29.7231 29.4312
sse41_ld64 29.0769 29.047 29.7566 30.0988 29.7775
4x4c8 qd8_f32_qb4w 32 sse41_ld128 23.4833 22.5307 23.0766 23.3969 23.209
sse41_ld64 23.0014 23.5207 23.657 23.5848 23.4027
256 sse41_ld128 30.0163 29.9047 30.2299 30.367 29.8178
sse41_ld64 30.0503 29.1273 29.8003 30.3124 30.0438
qd8_f32_qc4w N/A sse41_ld128 30.7884 31.3583 31.1508 31.4678 30.5934
sse41_ld64 31.2101 30.8091 31.3049 31.2445 31.8041

@digantdesai digantdesai mentioned this pull request Jun 17, 2024
11 tasks
@@ -31,17 +31,35 @@ $if DATATYPE != "QD8":
#include <xnnpack/unaligned.h>


$DATATYPE_SPEC = {"QC8": "qs8_qc8w", "QD8": "qd8_f32_qc8w", "QC4": "qd8_f32_qc4w", "QS8": "qs8", "QU8": "qu8"}[DATATYPE]
$#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving template into src/qd8-f32-qb4w-gemm/ folder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean also suggest that we create a different template for QB4W?

Not sure which way it will fall from maintenance point of view TBH, i.e. overloaded template vs. templates with similar logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanted to follow up on if this meant moving the SET_INDENT function to some common tools file, or if all the qb4w gemm metakernels should live in their own folder seperate from qs8-gemm

Comment on lines +154 to +157
__m128 one_sixteenth = _mm_set_ps1(1.0f/16);
vout0x0123 = _mm_mul_ps(vout0x0123, one_sixteenth);

const __m128 vinput_scale0 = _mm_load1_ps(&quantization_params[0].inv_scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to trade some mul_ps with a scalar one?

Suggested change
__m128 one_sixteenth = _mm_set_ps1(1.0f/16);
vout0x0123 = _mm_mul_ps(vout0x0123, one_sixteenth);
const __m128 vinput_scale0 = _mm_load1_ps(&quantization_params[0].inv_scale);
const __m128 vinput_scale0 = _mm_set1_ps(quantization_params[0].inv_scale * (1/16.0));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some special logic loading the vinput_scale from the struct, when MR > 1. Specifically we can get two inv_scales per load. I guess specifically for MR=1 we can do this, but not sure for MR > 1 this would give us any better perf. Any thoughts on that?

Copy link
Contributor

@digantdesai digantdesai Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So two things,

  • scalar pipes should be relatively open at this point, and we don't have any dependency, so this should be better than competing for the vector pipes for the vector multiplication. But if we are avoiding cross pipe transfers for MR>1 then it might be better.
  • somewhat orthogonal to this, I feel if we move the 1/16.0f to weight (block) scales, then it might be better for both performance as well as accuracy given we don't have to do any multiplication in the kernel and the fp32 accumulation might be better with smaller number. To be clear, when convert larger int32 accumulator into fp32 in the first step we may drop some precision if the int32 is too large, however then multiplying with a smaller scale and accumulating in fp32 might still help). That said, I don't expect any significant changes in either numerics or perf TBH.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go #6596

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i'm ok with doing this 1/16 stuff as a follow up after this

@mcr229 mcr229 marked this pull request as ready for review June 24, 2024 20:28
@mcr229 mcr229 mentioned this pull request Jun 25, 2024
@alankelly
Copy link
Collaborator

Please resolve conflicts and I will import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants