[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph #5604

delock · 2024-06-03T07:53:49Z

This PR allows deepspeed.comm.inference_all_reduce() enters torch.compile graph even it is implemented as C++ kernel in DeepSpeed.

Previous implementation register inference_all_reduce() C++ kernel as pybind function so it can be called inside PyThon code. However pybind function cannot be recognized by PyTorch so graph breaks when inference_all_reduce is called.

We address issue by register inference_all_reduce as a PyTorch custom op torch.ops.deepspeed.inference_all_reduce, so it can be built into PyTorch graph

The output trace code from torchinductor

class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"):
        # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor)
        inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3)

        # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input)
        permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]);  primals_1 = None
        addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute);  primals_2 = permute = None

        # No stacktrace found for following nodes
        copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce);  primals_3 = None
        return [addmm, inference_all_reduce]

Note in this PR the inference_all_reduce op for CPU does not handle multinode and FP16 data type. For FP16 data type support, we will align with PyTorch CPU FP16 plan. For multinode, we are still looking at the possibility to upstream oneCCL integration into PyTorch, so we are able to get use of oneCCL for multinode tensor parallel inference with PyTorch.

This PR is independent to #5571. They can work seperately or together without issue.

tohtana · 2024-06-21T22:58:11Z

@delock Thank you for the great PR. I didn't know we can avoid some graph breaks by registering C++ extension op as torch's operator. This approach will definitely be useful in many features in DeepSpeed.
Let's merge it after it passe all the tests.

delock · 2024-06-22T09:18:37Z

@delock Thank you for the great PR. I didn't know we can avoid some graph breaks by registering C++ extension op as torch's operator. This approach will definitely be useful in many features in DeepSpeed. Let's merge it after it passe all the tests.

Hi @tohtana, formatting is fixed. The other error is an HF hub connection issue. Should pass on rerun.

delock added 14 commits April 18, 2024 05:36

remove no compile for torchbackend inference_all_reduce

1b89ebd

add myadd op

358bc46

myadd -> inference_all

649fb53

call inference_all_reduce_ in op implementation

284967e

test pytorch C++ ops

8b24c2b

cleanup

ee704a0

enable inference_all_reduce_ as op

ba03ac1

add fallback path of all_reduce op

ee16062

remove unused functions

39f7328

Merge branch 'master' into gma/inference_all_reduce_in_graph

1af7c0d

fix typo

4cb6c33

fix format

9cc13e3

fix format

16e79cc

change 'foo' in code into 'x'

84b039b

delock requested review from awan-10, mrwyattii and arashb as code owners June 3, 2024 07:53

remove debug print code

07a1576

delock marked this pull request as draft June 3, 2024 08:28

delock added 2 commits June 3, 2024 23:01

temp save

26db8a0

remove fallback path

620efd1

delock marked this pull request as ready for review June 6, 2024 05:30

Merge branch 'master' into gma/inference_all_reduce_in_graph

78afbfb

tjruwase requested review from tohtana and umchand and removed request for arashb, awan-10 and mrwyattii June 21, 2024 22:23

Merge branch 'master' into gma/inference_all_reduce_in_graph

bf38cf2

tohtana approved these changes Jun 21, 2024

View reviewed changes

fix format

d6627c0

tohtana enabled auto-merge June 24, 2024 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph #5604

[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph #5604

delock commented Jun 3, 2024 •

edited

Loading

tohtana commented Jun 21, 2024 •

edited

Loading

delock commented Jun 22, 2024

[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph #5604

Are you sure you want to change the base?

[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph #5604

Conversation

delock commented Jun 3, 2024 • edited Loading

tohtana commented Jun 21, 2024 • edited Loading

delock commented Jun 22, 2024

delock commented Jun 3, 2024 •

edited

Loading

tohtana commented Jun 21, 2024 •

edited

Loading