-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QB4W SSE2/SSE41 GEMM Kernels #6576
base: master
Are you sure you want to change the base?
Conversation
@@ -31,17 +31,35 @@ $if DATATYPE != "QD8": | |||
#include <xnnpack/unaligned.h> | |||
|
|||
|
|||
$DATATYPE_SPEC = {"QC8": "qs8_qc8w", "QD8": "qd8_f32_qc8w", "QC4": "qd8_f32_qc4w", "QS8": "qs8", "QU8": "qu8"}[DATATYPE] | |||
$# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest moving template into src/qd8-f32-qb4w-gemm/ folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean also suggest that we create a different template for QB4W
?
Not sure which way it will fall from maintenance point of view TBH, i.e. overloaded template vs. templates with similar logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanted to follow up on if this meant moving the SET_INDENT function to some common tools file, or if all the qb4w gemm metakernels should live in their own folder seperate from qs8-gemm
__m128 one_sixteenth = _mm_set_ps1(1.0f/16); | ||
vout0x0123 = _mm_mul_ps(vout0x0123, one_sixteenth); | ||
|
||
const __m128 vinput_scale0 = _mm_load1_ps(&quantization_params[0].inv_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to trade some mul_ps
with a scalar one?
__m128 one_sixteenth = _mm_set_ps1(1.0f/16); | |
vout0x0123 = _mm_mul_ps(vout0x0123, one_sixteenth); | |
const __m128 vinput_scale0 = _mm_load1_ps(&quantization_params[0].inv_scale); | |
const __m128 vinput_scale0 = _mm_set1_ps(quantization_params[0].inv_scale * (1/16.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some special logic loading the vinput_scale from the struct, when MR > 1. Specifically we can get two inv_scales per load. I guess specifically for MR=1 we can do this, but not sure for MR > 1 this would give us any better perf. Any thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So two things,
- scalar pipes should be relatively open at this point, and we don't have any dependency, so this should be better than competing for the vector pipes for the vector multiplication. But if we are avoiding cross pipe transfers for MR>1 then it might be better.
- somewhat orthogonal to this, I feel if we move the 1/16.0f to weight (block) scales, then it might be better for both performance as well as accuracy given we don't have to do any multiplication in the kernel and the fp32 accumulation might be better with smaller number. To be clear, when convert larger int32 accumulator into fp32 in the first step we may drop some precision if the int32 is too large, however then multiplying with a smaller scale and accumulating in fp32 might still help). That said, I don't expect any significant changes in either numerics or perf TBH.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go #6596
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure i'm ok with doing this 1/16 stuff as a follow up after this
Please resolve conflicts and I will import |
This pull requests adds blockwise 4-bit (qb4w) GEMM microkernels targetinsg x86 SSE2 and SSE4.1 Instruction Family.
Note: This PR includes one commit from #6557 (Test generation update for qb4w). I'm putting this PR up for review before that PR merges so that we can start the review process.
Tests and Benchmarks were run on Icelake Xeon Processor. block_size equal to KC are functionally equivalent to QC4W, so QC4W provides a reasonable performance comparison.
SSE2 Benchmarks
SSE4.1 Benchmarks