Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

does FSDP support AMSP (a new DP shard strategy) #128706

Open
guoyejun opened this issue Jun 14, 2024 · 2 comments
Open

does FSDP support AMSP (a new DP shard strategy) #128706

guoyejun opened this issue Jun 14, 2024 · 2 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@guoyejun
Copy link
Contributor

guoyejun commented Jun 14, 2024

馃殌 The feature, motivation and pitch

there's a new DP shard strategy which is more flexible and general, see more detail at https://arxiv.org/abs/2311.00257 AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training

Does FSDP support similar feature? If not, any plan to support it? thanks.

Alternatives

No response

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@awgu
Copy link
Contributor

awgu commented Jun 14, 2024

I do not think FSDP supports this currently. In my high level understanding, the flexibility introduced in AMSP is mainly useful when doing microbatching / gradient accumulation?

@malfet malfet added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 14, 2024
@weifengpy weifengpy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2024
@guoyejun
Copy link
Contributor Author

My understanding is that the flexibility comes from the new solution that the sharding strategy for parameter, gradient and optimizer states can be different. It by nature provides many sharding strategies, including DDP, ZeRO1, ZeRO2, ZeRO3, HSDP and MiCS and many more. With a given cluster and a given model, we may find a better sharding strategy, such as the table iv in the paper, also copy below.
image

another thing is that the sharding strategy is represented in two dims, one for node#, the other for gpu# in one node, it is more clear.

It is general because all these sharding strategies can be obtained by just changing the values of the configuration. We can even loose some constrains in the paper with the key idea from the paper, if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants