Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

taozhiwei
Copy link
Contributor

@taozhiwei taozhiwei commented Jun 7, 2024

Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]

The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.

After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.

  1. non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE -> alltoall -> allgather
  2. both non-expert and expert all use TP:
    • the original execution order: alltoall -> exe MOE-> allreduce -> alltoall
    • optimized execution order: drop -> alltoall -> allgather -> exe MOE -> drop ->alltoall -> allgather

@tjruwase
Copy link
Contributor

tjruwase commented Jun 9, 2024

@siddharth9820, can you help review?

@tjruwase tjruwase requested review from tohtana and removed request for arashb and mrwyattii June 9, 2024 22:37
@siddharth9820
Copy link
Contributor

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!

@taozhiwei
Copy link
Contributor Author

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!
Thank you for your review,I just updated the mainline code and I need you to perform another test. Thank you.

@taozhiwei taozhiwei closed this Jun 17, 2024
@taozhiwei taozhiwei reopened this Jun 17, 2024
@taozhiwei
Copy link
Contributor Author

@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much.

@siddharth9820
Copy link
Contributor

siddharth9820 commented Jun 18, 2024

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

@taozhiwei
Copy link
Contributor Author

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

I ran a test locally and it still converged

@siddharth9820
Copy link
Contributor

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jun 21, 2024

@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626
This test failed due to network issues and needs to be triggered CI test again. Thank you

@taozhiwei
Copy link
Contributor Author

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.
How do I put the pictures in? I tried a few times but was unsuccessful.

@siddharth9820
Copy link
Contributor

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jun 25, 2024

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

This is a comparison of the loss curve before and after modification, which is consistent.
https://imgur.com/Nhj7c1m
Please help review again @siddharth9820 @tjruwase

@siddharth9820
Copy link
Contributor

siddharth9820 commented Jun 27, 2024

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@tjruwase
Copy link
Contributor

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

@taozhiwei
Copy link
Contributor Author

taozhiwei commented Jun 28, 2024

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626.
The second failed test I tested locally was passed,https://imgur.com/v2eMEox,meanwhile, my RP should not have any impact on this failed test
Can you help run the CI test again? thank you! @siddharth9820

@siddharth9820
Copy link
Contributor

siddharth9820 commented Jun 28, 2024

@tjruwase or someone else working at Deepspeed might be able to help you with CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants