Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." #99

Open
HuipengXu opened this issue Jun 14, 2023 · 1 comment

Comments

@HuipengXu
Copy link

Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also use multiprocessing?

@mayank31398
Copy link
Collaborator

Hey, ds-inference is also doing world_size streams
However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate.
A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)

For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.
Screenshot 2023-06-15 at 12 05 46 PM

Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants