-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: deprecating container_runtime config, AgentRM supporting singularity,podman, and apptainer #9516
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for determined-ui canceled.
|
agent/internal/agent.go
Outdated
}() | ||
docker := docker.NewClient(dcl) | ||
|
||
// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented a few parts of code instead of completely removing it as we are unclear if we want to deprecate whole of container_runtime (configuration option/variable) or just overloaded use of "container runtime" (container "providers", e.g. docker, podman, apptainer).
I will make corrections as needed in future commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only valid option for container_runtime should be docker
so if I had a config file that specified container_runtime: docker
I would expect that to not error after this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every other container runtime I would expect the agent to return an error message saying it can't run
agent/cmd/determined-agent/init.go
Outdated
@@ -152,6 +152,6 @@ func registerAgentConfig() { | |||
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff, | |||
"Time between agent reconnect attempts") | |||
|
|||
registerString(flags, name("container-runtime"), defaults.ContainerRuntime, | |||
"The container runtime to use") | |||
registerString(flags, name("docker-container-runtime"), defaults.DockerContainerRuntime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change is to make config definition for /etc/determined/agent.yaml file.
My changes will be undone here after your suggestions below.
agent/internal/agent.go
Outdated
}() | ||
docker := docker.NewClient(dcl) | ||
|
||
// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only valid option for container_runtime should be docker
so if I had a config file that specified container_runtime: docker
I would expect that to not error after this change
agent/internal/agent.go
Outdated
}() | ||
docker := docker.NewClient(dcl) | ||
|
||
// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every other container runtime I would expect the agent to return an error message saying it can't run
) | ||
|
||
// ContainerRuntime is our interface for interacting with runtimes like Docker or Singularity. | ||
type ContainerRuntime interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda like the interface. I think its a lot harder to go the other way (that is put code behind a clean interface when it isn't) than it is to go from a clean interface to not having one.
So I think my preference would be to just leave this in here for now.
@@ -152,6 +152,6 @@ func registerAgentConfig() { | |||
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are going to need a lot of changes to the circleci file to no longer run agent tests
dd9734f
to
cd0e626
Compare
agent/internal/options/options.go
Outdated
// requires root or a suid installation with /etc/subuid --fakeroot. | ||
AllowNetworkCreation bool `json:"allow_network_creation"` | ||
} | ||
|
||
// PodmanOptions configures how we interact with podman. | ||
type PodmanOptions struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently leaving this out. Would remove it if not needed in the next subtask
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9516 +/- ##
==========================================
+ Coverage 49.28% 49.45% +0.16%
==========================================
Files 1242 1240 -2
Lines 161444 160860 -584
Branches 2867 2867
==========================================
- Hits 79568 79548 -20
+ Misses 81704 81140 -564
Partials 172 172
Flags with carried forward coverage won't be shown. Click here to find out more.
|
agent/internal/agent.go
Outdated
@@ -124,29 +122,14 @@ func (a *Agent) run(ctx context.Context) error { | |||
var cruntime container.ContainerRuntime | |||
switch a.opts.ContainerRuntime { | |||
case options.PodmanContainerRuntime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also want to remove this switch statement; no one should even be using the configuration option.
I think we should simply set cruntime
to options.DockerContainerRuntime
(or whatever makes sense in the code). We can check if a.opts.ContainerRuntime
has a value and is not docker if we want to print a generic warning ... but I think we can simplify this section of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, earlier i removed the switch statement all together and had only Docker configs, but @NicholasBlaskey suggested we throw error in here
I agree to both cases, maybe now we can return an error, and later just have a case default
which handles with a generic error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configuration option shouldn't be known to users; we do not want them using it even if they have discovered it.
It looks like docker is the default for container runtime configuration. I think this because there's no default
clause in the original switch statement, but you should confirm this is true.
I think we can have a much simpler check. In pseudo-code, something like:
if a.opts.ContainerRuntime != "" and a.opts.ContainerRuntime != "Docker" {
// print warning
// error and exit
}
// continue to set up ContainerRuntime interface using Docker
Let me know if this is still confusing and we can go over the code together! (I might be able to explain it better in a call :)
agent/internal/options/options.go
Outdated
SingularityOptions SingularityOptions `json:"singularity_options"` | ||
PodmanOptions PodmanOptions `json:"podman_options"` | ||
ContainerRuntime string `json:"container_runtime"` | ||
PodmanOptions PodmanOptions `json:"podman_options"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want users to be using these configuration options. I'm ok with removing them in the next set of work though if that's what you prefer (I saw one of your comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of having contianer_runtime
configuration as is, but will try PodmanOption
in later iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Nick likes the ContainerRuntime
interface and wants to keep it. I think that makes a lot of sense.
I'm saying, we don't need to keep the user-facing container_runtime
configuration option. Either way, yes we should definitely remove PodmanOptions
in the next PR if it's not done in this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean here. That's possible and makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be some confusion since there's multiple ContainerRuntime
s in the code.
looking at this snippet ....
var cruntime container.ContainerRuntime // ContainerRuntime is an interface, defined in agent/internal/container_runtime.go
switch a.opts.ContainerRuntime { // ContainerRuntime is a string, populated by user configuration
The interface is good to keep. The configuration option one is up for debate.
@@ -2840,7 +2840,7 @@ jobs: | |||
PROJECT=$(terraform -chdir=tools/slurm/terraform output --raw project) | |||
gcloud compute scp agent/build/determined-agent "$INSTANCE_NAME":~ --zone $ZONE | |||
gcloud compute ssh --zone "$ZONE" "$INSTANCE_NAME" --project "$PROJECT" -- \ | |||
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default --container-runtime=<<parameters.container-run-type>> | |||
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete the agent-use
parameter and all steps that use it since it should be unused
So I would completely delete this when
block and the when
block below this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I understand your comment here and will update it.
We would need to remove the test-e2e-slurm
case as well, right?
jobs:
test-e2e-slurm:
parameters:
- when:
condition:
equal: ["-A", <<parameters.agent-use>>]
@@ -151,7 +151,4 @@ func registerAgentConfig() { | |||
"Max attempts agent has to reconnect") | |||
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff, | |||
"Time between agent reconnect attempts") | |||
|
|||
registerString(flags, name("container-runtime"), defaults.ContainerRuntime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave this flag in but only allow docker as the container-runtime
If for some reason someone has determined-agent --container-runtime=docker
I don't think we should break that use case
@@ -81,10 +80,6 @@ type Options struct { | |||
// master config. | |||
AgentReconnectBackoff int `json:"agent_reconnect_backoff"` | |||
|
|||
ContainerRuntime string `json:"container_runtime"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about leaving this option in but checking that is it always docker
So if someone has container_runtime: docker
in their agent config I think we shouldn't break them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our previous iterations in the PR, I thought we agreed on having to remove ContainerRuntime
config all-together. I see what you are saying here and agree, whereas earlier I assumed because this config was never documented, it was not actually used by any customers.
@kkunapuli - Do you agree to the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShreyaLnuHpe Yes, that was my understanding from talking with Bradley (that container_runtime
configuration option was never documented).
As you've dug into the task more thoroughly, it seems you've uncovered an additional use case that was documented: using agents with slurm. I'm ok with leaving the configuration option in, with non-Docker values returning an error. I though that extra config options would be ignored, not result in errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicholasBlaskey @kkunapuli
Shall we now have a documentation of container_runtime
config, and mentioning that it only intakes docker
value in case of agentRM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think don't document it. We won't block someone from having container_runtime: docker
but we don't want anyone adding it either.
@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do | |||
echo ' -A ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the -A
flag is the one we want to remove
Do we need any code changes to the scripts here to no longer support -A
beyond changing the usage description?
@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do | |||
echo ' -A ' | |||
echo " Description: Invokes a slurmcluster that uses agents instead of the launcher." | |||
echo " Example: $0 -A" | |||
echo ' -c {enroot|podman|singularity}' | |||
echo ' -c {docker}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think enroot|podman|singularity
are the correct options, since we don't support docker with launcher and we still will support the other runtimes for launcher
@@ -120,47 +118,18 @@ func (a *Agent) run(ctx context.Context) error { | |||
return fmt.Errorf("failed to detect devices: %v", devices) | |||
} | |||
|
|||
a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a release note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a release note, thanks
file. Below is a minimal example using a resource pool named for the user (``$USER``) and | ||
``singularity`` as the container runtime platform. If configured using variables such as ``$HOME``, | ||
a single ``agent.yaml`` could be shared by all users. | ||
file. Below is a minimal example using a resource pool named for the user (``$USER``). ``docker`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work
I think in slurm environments it is very likely that the launched jobs won't have access to docker so the agent method won't be supported
I think this whole hpc-with-agent
workload is no longer going to be supported, so I think we can remove all these docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this entail removing the page?: https://docs.determined.ai/latest/setup-cluster/slurm/hpc-with-agent.html
if so, please create a redirect. there is a guide here on using the docs redirect tool https://hpe-aiatscale.atlassian.net/wiki/spaces/DOC/pages/1338310660/How+to+Use+the+Docs+Redirect+Tool
@@ -5,7 +5,7 @@ | |||
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads). | |||
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials. | |||
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start. | |||
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. | |||
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we support Slurmcluster with docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, the intent is to no longer give customers the option to use slurm with determined agents.
At least, that's what I understood from slack thread: https://hpe-aiatscale.slack.com/archives/C06GMG83ZE0/p1718224380888699
@@ -81,10 +80,6 @@ type Options struct { | |||
// master config. | |||
AgentReconnectBackoff int `json:"agent_reconnect_backoff"` | |||
|
|||
ContainerRuntime string `json:"container_runtime"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about image_root
?
https://docs.determined.ai/latest/reference/deploy/agent-config-reference.html#image-root
docs this have any affect on Docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
viewed hpc-with-agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the infra-relevant portion of this PR, pending the stuff Nick suggested.
Ticket
RM-351
Description
Deprecate all container runtimes except for Docker for AgentRM
Potential Impact: Can only run AgentRM with Docker container runtime. Can no longer run with Apptainer/Singularity or podman
Test Plan
Pass CircleCI
Checklist
docs/release-notes/
.See Release Note for details.