Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] add fp16 support to shm inference_all_reduce #5669

Merged
merged 12 commits into from
Jun 26, 2024

Conversation

delock
Copy link
Contributor

@delock delock commented Jun 17, 2024

This PR adds FP16 support to DeepSpeed SHM inference_all_reduce. Previously only FP32 and BF16 is supported. This is to align with PyTorch CPU support on FP16 datatype.

@loadams loadams requested review from tjruwase and tohtana June 17, 2024 21:08
@adk9
Copy link
Contributor

adk9 commented Jun 17, 2024

Could you consider adding some unit tests to perhaps test_dist.py to test support for the different data types?

@delock
Copy link
Contributor Author

delock commented Jun 18, 2024

Could you consider adding some unit tests to perhaps test_dist.py to test support for the different data types?

Let me see if I can add some tests.

@delock delock requested a review from loadams as a code owner June 18, 2024 15:36
@delock
Copy link
Contributor Author

delock commented Jun 18, 2024

Hi @adk9, TestDistInferenceAllReduce is modified to test fp32, bf16 and fp16. Can you help start the workflow? Thanks!

@delock
Copy link
Contributor Author

delock commented Jun 19, 2024

Hi @adk9 the failure for FP32 allreduce is due to modified UT test world_size=1, 2, or 4 and there was an unnecessary assertion for world_size == 1. Now this assertion had been removed given the code can handle this situation well. Can you help restart the workflow? Thanks!

@adk9 adk9 added this pull request to the merge queue Jun 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jun 20, 2024
@adk9 adk9 added this pull request to the merge queue Jun 20, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 20, 2024
@adk9 adk9 added this pull request to the merge queue Jun 21, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jun 22, 2024
@adk9 adk9 added this pull request to the merge queue Jun 24, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jun 24, 2024
This PR adds FP16 support to DeepSpeed SHM inference_all_reduce.
Previously only FP32 and BF16 is supported. This is to align with
PyTorch CPU support on FP16 datatype.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 24, 2024
@loadams loadams enabled auto-merge June 24, 2024 22:35
@loadams loadams disabled auto-merge June 24, 2024 22:45
@loadams loadams enabled auto-merge June 26, 2024 17:01
@loadams loadams added this pull request to the merge queue Jun 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024
@loadams loadams added this pull request to the merge queue Jun 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 26, 2024
@loadams loadams added this pull request to the merge queue Jun 26, 2024
Merged via the queue into microsoft:master with commit 19da95f Jun 26, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants