-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626
base: master
Are you sure you want to change the base?
Conversation
@siddharth9820, can you help review? |
At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this! |
|
@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much. |
@taozhiwei still lgtm. Do you have some convergence curves for your changes? |
I ran a test locally and it still converged |
Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go. |
@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626 |
|
@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us? |
This is a comparison of the loss curve before and after modification, which is consistent. |
…t are tensor-parallel microsoft#5626 Signed-off-by: --local <[email protected]>
Thanks for doing this. LGTM. @tjruwase do we need any other tests? |
@taozhiwei, thanks for the PR. This is really a great contribution. @siddharth9820, thanks for helping to review. Approved. |
The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626. |
@tjruwase or someone else working at Deepspeed might be able to help you with CI |
Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]
The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.
After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.