reduce cpu host overhead when using moe #5578

ranzhejiang · 2024-05-29T04:01:30Z

The operation .to('cpu') is not necessary for exp_counts, and it will cause device to host synchronization which damage performance.

tohtana · 2024-05-31T23:32:12Z

deepspeed/moe/sharded_moe.py

@@ -366,7 +366,7 @@ def top2gating(logits: Tensor,
    combine_weights = combine1_sec + combine2_sec
    dispatch_mask = combine_weights.bool()

-    return l_aux, combine_weights, dispatch_mask, exp_counts.detach().to('cpu')
+    return l_aux, combine_weights, dispatch_mask, exp_counts


Currently exp_counts is unused at forward() of any of MoE classes, right?

Currently exp_counts is unused at forward() of any of MoE classes, right?

Yes, I have test it in Megatron-deepspeed and find that exp_counts is unused at forward() of any of MoE classes.

I have test these changes on my testing GPU platform, and work fine, no error and the loss keeps same to the original way.

tohtana

@ranzhejiang Thank you for your contribution! I have a few questions about your changes. Can you clarify them?

tohtana · 2024-05-31T23:33:15Z

deepspeed/moe/sharded_moe.py

@@ -322,7 +322,7 @@ def top2gating(logits: Tensor,
    l_aux = torch.mean(me * ce) * num_experts * num_experts

    # gating decisions
-    exp_counts = torch.sum(mask1 + mask2, dim=0)
+    exp_counts = torch.sum(mask1 + mask2, dim=0).detach().to(logits.device)


Can the device of mask1 and mask1 be different from logits?

Can the device of mask1 and mask1 be different from logits?

From line 296 to 301, we can find that , the calculation of mask1 depends on logits, and all torch operation will keep the original device, so the device of mask1 and logits be the same one. The same to mask1 and mask2 line 309 to 311

ranzhejiang · 2024-06-11T15:00:19Z

Hi, @tohtana I have clarified the modifications you mentioned and retest this PR with Megatron-Deepspeed on GPU platform(8xA800). It runs well and loss remains consistent with the original method, Could you please help review it again? Thanks!

ranzhejiang requested a review from awan-10 as a code owner May 29, 2024 04:01

loadams requested a review from tohtana May 31, 2024 22:15

tohtana reviewed May 31, 2024

View reviewed changes

ranzhejiang force-pushed the zhejiang/reduce_host_overhead_moe branch from e9e32f4 to d860d2c Compare June 11, 2024 03:32

reduce cpu host overhead when using moe

d860d2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce cpu host overhead when using moe #5578

reduce cpu host overhead when using moe #5578

ranzhejiang commented May 29, 2024 •

edited

Loading

tohtana May 31, 2024

ranzhejiang Jun 1, 2024

ranzhejiang Jun 1, 2024

tohtana left a comment

tohtana May 31, 2024

ranzhejiang Jun 1, 2024

ranzhejiang commented Jun 11, 2024

reduce cpu host overhead when using moe #5578

Are you sure you want to change the base?

reduce cpu host overhead when using moe #5578

Conversation

ranzhejiang commented May 29, 2024 • edited Loading

tohtana May 31, 2024

Choose a reason for hiding this comment

ranzhejiang Jun 1, 2024

Choose a reason for hiding this comment

ranzhejiang Jun 1, 2024

Choose a reason for hiding this comment

tohtana left a comment

Choose a reason for hiding this comment

tohtana May 31, 2024

Choose a reason for hiding this comment

ranzhejiang Jun 1, 2024

Choose a reason for hiding this comment

ranzhejiang commented Jun 11, 2024

ranzhejiang commented May 29, 2024 •

edited

Loading