[CPU] add fp16 support to shm inference_all_reduce #5669

delock · 2024-06-17T08:02:11Z

This PR adds FP16 support to DeepSpeed SHM inference_all_reduce. Previously only FP32 and BF16 is supported. This is to align with PyTorch CPU support on FP16 datatype.

adk9 · 2024-06-17T23:10:28Z

Could you consider adding some unit tests to perhaps test_dist.py to test support for the different data types?

delock · 2024-06-18T11:24:10Z

Could you consider adding some unit tests to perhaps test_dist.py to test support for the different data types?

Let me see if I can add some tests.

delock · 2024-06-18T15:48:04Z

Hi @adk9, TestDistInferenceAllReduce is modified to test fp32, bf16 and fp16. Can you help start the workflow? Thanks!

delock · 2024-06-19T07:06:23Z

Hi @adk9 the failure for FP32 allreduce is due to modified UT test world_size=1, 2, or 4 and there was an unnecessary assertion for world_size == 1. Now this assertion had been removed given the code can handle this situation well. Can you help restart the workflow? Thanks!

This PR adds FP16 support to DeepSpeed SHM inference_all_reduce. Previously only FP32 and BF16 is supported. This is to align with PyTorch CPU support on FP16 datatype. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

add fp16 support to shm allreduce

0aa626f

delock requested review from awan-10, mrwyattii and arashb as code owners June 17, 2024 08:02

delock and others added 2 commits June 18, 2024 00:37

fix format

cd89810

Merge branch 'master' into gma/fp16_allreduce_support

9409467

loadams requested review from tjruwase and tohtana June 17, 2024 21:08

add more data types for test inference_all_reduce

206e29a

delock requested a review from loadams as a code owner June 18, 2024 15:36

adk9 approved these changes Jun 18, 2024

View reviewed changes

fix FP32+world_size=1 bug

44007e2

delock and others added 4 commits June 19, 2024 03:22

remove unneeded code

52084b0

fix format

4719cba

remove unnecessary comments

9e08a42

Merge branch 'master' into gma/fp16_allreduce_support

7b62634

adk9 added this pull request to the merge queue Jun 19, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jun 20, 2024

adk9 added this pull request to the merge queue Jun 20, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 20, 2024

Merge branch 'master' into gma/fp16_allreduce_support

e077d57

adk9 added this pull request to the merge queue Jun 21, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jun 22, 2024

adk9 added this pull request to the merge queue Jun 24, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 24, 2024

Merge branch 'master' into gma/fp16_allreduce_support

812579f

loadams enabled auto-merge June 24, 2024 22:35

loadams disabled auto-merge June 24, 2024 22:45

Merge branch 'master' into gma/fp16_allreduce_support

ae7497a

loadams enabled auto-merge June 26, 2024 17:01

loadams added this pull request to the merge queue Jun 26, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024

loadams added this pull request to the merge queue Jun 26, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024

loadams added this pull request to the merge queue Jun 26, 2024

Merged via the queue into microsoft:master with commit 19da95f Jun 26, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] add fp16 support to shm inference_all_reduce #5669

[CPU] add fp16 support to shm inference_all_reduce #5669

delock commented Jun 17, 2024

adk9 commented Jun 17, 2024

delock commented Jun 18, 2024

delock commented Jun 18, 2024

delock commented Jun 19, 2024

[CPU] add fp16 support to shm inference_all_reduce #5669

[CPU] add fp16 support to shm inference_all_reduce #5669

Conversation

delock commented Jun 17, 2024

adk9 commented Jun 17, 2024

delock commented Jun 18, 2024

delock commented Jun 18, 2024

delock commented Jun 19, 2024