reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

taozhiwei · 2024-06-07T12:05:12Z

Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]

The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.

After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.

non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE -> alltoall -> allgather
both non-expert and expert all use TP:
- the original execution order: alltoall -> exe MOE-> allreduce -> alltoall
- optimized execution order: drop -> alltoall -> allgather -> exe MOE -> drop ->alltoall -> allgather

tjruwase · 2024-06-09T22:36:44Z

@siddharth9820, can you help review?

siddharth9820 · 2024-06-09T22:43:31Z

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!

taozhiwei · 2024-06-11T05:09:33Z

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!
Thank you for your review,I just updated the mainline code and I need you to perform another test. Thank you.

taozhiwei · 2024-06-18T05:20:50Z

@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much.

siddharth9820 · 2024-06-18T05:26:05Z

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

taozhiwei · 2024-06-18T08:56:43Z

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

I ran a test locally and it still converged

siddharth9820 · 2024-06-21T00:21:08Z

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.

taozhiwei · 2024-06-21T03:57:37Z

@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626
This test failed due to network issues and needs to be triggered CI test again. Thank you

taozhiwei · 2024-06-21T04:55:42Z

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.
How do I put the pictures in? I tried a few times but was unsuccessful.

siddharth9820 · 2024-06-21T11:58:41Z

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

taozhiwei · 2024-06-25T02:26:02Z

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

This is a comparison of the loss curve before and after modification, which is consistent.
https://imgur.com/Nhj7c1m
Please help review again @siddharth9820 @tjruwase

…t are tensor-parallel microsoft#5626 Signed-off-by: --local <[email protected]>

siddharth9820 · 2024-06-27T14:08:19Z

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

tjruwase · 2024-06-27T14:10:15Z

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

taozhiwei · 2024-06-28T08:13:48Z

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626.
The second failed test I tested locally was passed,https://imgur.com/v2eMEox,meanwhile, my RP should not have any impact on this failed test
Can you help run the CI test again? thank you! @siddharth9820

siddharth9820 · 2024-06-28T09:05:23Z

@tjruwase or someone else working at Deepspeed might be able to help you with CI

taozhiwei requested review from tjruwase, mrwyattii, awan-10 and arashb as code owners June 7, 2024 12:05

tjruwase requested review from tohtana and removed request for arashb and mrwyattii June 9, 2024 22:37

taozhiwei closed this Jun 17, 2024

taozhiwei reopened this Jun 17, 2024

taozhiwei force-pushed the myfeature branch from 72bfcf8 to 4ada4c5 Compare June 24, 2024 11:32

reduce all-to-all communication volume when both expert and non-exper…

6db21bf

…t are tensor-parallel microsoft#5626 Signed-off-by: --local <[email protected]>

taozhiwei force-pushed the myfeature branch from 3f8b959 to 6db21bf Compare June 25, 2024 10:02

Merge branch 'master' into myfeature

9f98a52

tjruwase approved these changes Jun 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

taozhiwei commented Jun 7, 2024 •

edited

Loading

tjruwase commented Jun 9, 2024

siddharth9820 commented Jun 9, 2024

taozhiwei commented Jun 11, 2024

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 18, 2024 •

edited

Loading

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 21, 2024 •

edited

Loading

taozhiwei commented Jun 21, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 25, 2024 •

edited

Loading

siddharth9820 commented Jun 27, 2024 •

edited

Loading

tjruwase commented Jun 27, 2024

taozhiwei commented Jun 28, 2024 •

edited

Loading

siddharth9820 commented Jun 28, 2024 •

edited

Loading

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

Are you sure you want to change the base?

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626

Conversation

taozhiwei commented Jun 7, 2024 • edited Loading

tjruwase commented Jun 9, 2024

siddharth9820 commented Jun 9, 2024

taozhiwei commented Jun 11, 2024

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 18, 2024 • edited Loading

taozhiwei commented Jun 18, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 21, 2024 • edited Loading

taozhiwei commented Jun 21, 2024

siddharth9820 commented Jun 21, 2024

taozhiwei commented Jun 25, 2024 • edited Loading

siddharth9820 commented Jun 27, 2024 • edited Loading

tjruwase commented Jun 27, 2024

taozhiwei commented Jun 28, 2024 • edited Loading

siddharth9820 commented Jun 28, 2024 • edited Loading

taozhiwei commented Jun 7, 2024 •

edited

Loading

siddharth9820 commented Jun 18, 2024 •

edited

Loading

taozhiwei commented Jun 21, 2024 •

edited

Loading

taozhiwei commented Jun 25, 2024 •

edited

Loading

siddharth9820 commented Jun 27, 2024 •

edited

Loading

taozhiwei commented Jun 28, 2024 •

edited

Loading

siddharth9820 commented Jun 28, 2024 •

edited

Loading