Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-device TPU Pods #1337

Open
r4victor opened this issue Jun 14, 2024 · 0 comments
Open

Support multi-device TPU Pods #1337

r4victor opened this issue Jun 14, 2024 · 0 comments

Comments

@r4victor
Copy link
Collaborator

#1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.

Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.

Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant