You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The only reliable way to know that a service is healthy is to test it aka perform a health check that does a minimal processing of task it should do
Since this is a end to end test, its a good indicator that the service is really healthy and can take load
If there is something in the network, the health check will fail, which for example is not accounted for, by standard restart policies on container-exit
Solution
Because of this it would be extremely helpful to have health check support in dstack, and then have configuration options how to react to those changes
In order to react, it would be helpful to have a config option to set how many failures we want to consider a unhealthy, for example 3 failed ones
Then one reaction could be to try to restart the pod
Another reaction could be to remove the pod and replace it with a new one
Basically the idea is to always ensure the configured number of replicas is really healthy
Workaround
https://github.com/devidw/gingo created this to perform health checks and then perform pod restarts/adding/removing based on the health status of pods in a configured cluster
can be extended by writing other connectors, currently just has a runpod one
Would you like to help us implement this feature by sending a PR?
No
The text was updated successfully, but these errors were encountered:
Problem
The only reliable way to know that a service is healthy is to test it aka perform a health check that does a minimal processing of task it should do
Since this is a end to end test, its a good indicator that the service is really healthy and can take load
If there is something in the network, the health check will fail, which for example is not accounted for, by standard restart policies on container-exit
Solution
Because of this it would be extremely helpful to have health check support in dstack, and then have configuration options how to react to those changes
In order to react, it would be helpful to have a config option to set how many failures we want to consider a unhealthy, for example 3 failed ones
Then one reaction could be to try to restart the pod
Another reaction could be to remove the pod and replace it with a new one
Basically the idea is to always ensure the configured number of replicas is really healthy
Workaround
https://github.com/devidw/gingo created this to perform health checks and then perform pod restarts/adding/removing based on the health status of pods in a configured cluster
can be extended by writing other connectors, currently just has a runpod one
Would you like to help us implement this feature by sending a PR?
No
The text was updated successfully, but these errors were encountered: