You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.
Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.
Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.
The text was updated successfully, but these errors were encountered:
#1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.
Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.
Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.
The text was updated successfully, but these errors were encountered: