Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: deprecating container_runtime config, AgentRM supporting singularity,podman, and apptainer #9516

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ShreyaLnuHpe
Copy link
Contributor

@ShreyaLnuHpe ShreyaLnuHpe commented Jun 13, 2024

Ticket

RM-351

Description

Deprecate all container runtimes except for Docker for AgentRM

Potential Impact: Can only run AgentRM with Docker container runtime. Can no longer run with Apptainer/Singularity or podman

Test Plan

Pass CircleCI

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner June 13, 2024 18:03
@cla-bot cla-bot bot added the cla-signed label Jun 13, 2024
Copy link

netlify bot commented Jun 13, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit b012c8e
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6671fddcca3a530008dd96ca

}()
docker := docker.NewClient(dcl)

// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented a few parts of code instead of completely removing it as we are unclear if we want to deprecate whole of container_runtime (configuration option/variable) or just overloaded use of "container runtime" (container "providers", e.g. docker, podman, apptainer).

I will make corrections as needed in future commits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only valid option for container_runtime should be docker

so if I had a config file that specified container_runtime: docker I would expect that to not error after this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every other container runtime I would expect the agent to return an error message saying it can't run

@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: deprecating container_runtime chore: draft deprecating container_runtime Jun 13, 2024
@@ -152,6 +152,6 @@ func registerAgentConfig() {
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff,
"Time between agent reconnect attempts")

registerString(flags, name("container-runtime"), defaults.ContainerRuntime,
"The container runtime to use")
registerString(flags, name("docker-container-runtime"), defaults.DockerContainerRuntime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change is to make config definition for /etc/determined/agent.yaml file.
My changes will be undone here after your suggestions below.

}()
docker := docker.NewClient(dcl)

// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only valid option for container_runtime should be docker

so if I had a config file that specified container_runtime: docker I would expect that to not error after this change

}()
docker := docker.NewClient(dcl)

// a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every other container runtime I would expect the agent to return an error message saying it can't run

)

// ContainerRuntime is our interface for interacting with runtimes like Docker or Singularity.
type ContainerRuntime interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda like the interface. I think its a lot harder to go the other way (that is put code behind a clean interface when it isn't) than it is to go from a clean interface to not having one.

So I think my preference would be to just leave this in here for now.

@@ -152,6 +152,6 @@ func registerAgentConfig() {
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are going to need a lot of changes to the circleci file to no longer run agent tests

@ShreyaLnuHpe ShreyaLnuHpe force-pushed the shreya/deprecateContainerRuntime branch from dd9734f to cd0e626 Compare June 14, 2024 19:26
// requires root or a suid installation with /etc/subuid --fakeroot.
AllowNetworkCreation bool `json:"allow_network_creation"`
}

// PodmanOptions configures how we interact with podman.
type PodmanOptions struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently leaving this out. Would remove it if not needed in the next subtask

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.

Project coverage is 49.45%. Comparing base (2bce8b6) to head (b012c8e).
Report is 17 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9516      +/-   ##
==========================================
+ Coverage   49.28%   49.45%   +0.16%     
==========================================
  Files        1242     1240       -2     
  Lines      161444   160860     -584     
  Branches     2867     2867              
==========================================
- Hits        79568    79548      -20     
+ Misses      81704    81140     -564     
  Partials      172      172              
Flag Coverage Δ
backend 44.36% <0.00%> (+0.46%) ⬆️
harness 63.81% <ø> (ø)
web 44.87% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
agent/cmd/determined-agent/init.go 100.00% <ø> (ø)
agent/internal/options/options.go 27.77% <ø> (-1.32%) ⬇️
agent/internal/agent.go 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes

@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: draft deprecating container_runtime chore: deprecating container_runtime Jun 14, 2024
@@ -124,29 +122,14 @@ func (a *Agent) run(ctx context.Context) error {
var cruntime container.ContainerRuntime
switch a.opts.ContainerRuntime {
case options.PodmanContainerRuntime:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to remove this switch statement; no one should even be using the configuration option.

I think we should simply set cruntime to options.DockerContainerRuntime (or whatever makes sense in the code). We can check if a.opts.ContainerRuntime has a value and is not docker if we want to print a generic warning ... but I think we can simplify this section of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, earlier i removed the switch statement all together and had only Docker configs, but @NicholasBlaskey suggested we throw error in here

I agree to both cases, maybe now we can return an error, and later just have a case default which handles with a generic error message

Copy link
Contributor

@kkunapuli kkunapuli Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration option shouldn't be known to users; we do not want them using it even if they have discovered it.

It looks like docker is the default for container runtime configuration. I think this because there's no default clause in the original switch statement, but you should confirm this is true.

I think we can have a much simpler check. In pseudo-code, something like:

if a.opts.ContainerRuntime != "" and a.opts.ContainerRuntime != "Docker" {
  // print warning
  // error and exit
}

// continue to set up ContainerRuntime interface using Docker

Let me know if this is still confusing and we can go over the code together! (I might be able to explain it better in a call :)

SingularityOptions SingularityOptions `json:"singularity_options"`
PodmanOptions PodmanOptions `json:"podman_options"`
ContainerRuntime string `json:"container_runtime"`
PodmanOptions PodmanOptions `json:"podman_options"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want users to be using these configuration options. I'm ok with removing them in the next set of work though if that's what you prefer (I saw one of your comments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of having contianer_runtime configuration as is, but will try PodmanOption in later iterations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Nick likes the ContainerRuntime interface and wants to keep it. I think that makes a lot of sense.

I'm saying, we don't need to keep the user-facing container_runtime configuration option. Either way, yes we should definitely remove PodmanOptions in the next PR if it's not done in this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean here. That's possible and makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be some confusion since there's multiple ContainerRuntimes in the code.

looking at this snippet ....

	var cruntime container.ContainerRuntime // ContainerRuntime is an interface, defined in agent/internal/container_runtime.go
	switch a.opts.ContainerRuntime { // ContainerRuntime is a string, populated by user configuration

The interface is good to keep. The configuration option one is up for debate.

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner June 18, 2024 19:13
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jun 18, 2024
@determined-ci determined-ci requested a review from a team June 18, 2024 19:13
@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: deprecating container_runtime chore: deprecating container_runtime config, AgentRM supporting singularity,podman, and apptainer Jun 18, 2024
@@ -2840,7 +2840,7 @@ jobs:
PROJECT=$(terraform -chdir=tools/slurm/terraform output --raw project)
gcloud compute scp agent/build/determined-agent "$INSTANCE_NAME":~ --zone $ZONE
gcloud compute ssh --zone "$ZONE" "$INSTANCE_NAME" --project "$PROJECT" -- \
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default --container-runtime=<<parameters.container-run-type>>
srun determined-agent --master-host=<<parameters.master-host>> --master-port=<<parameters.master-port>> --resource-pool=default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can delete the agent-use parameter and all steps that use it since it should be unused

So I would completely delete this when block and the when block below this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand your comment here and will update it.
We would need to remove the test-e2e-slurm case as well, right?

jobs:
  test-e2e-slurm:
      parameters:
        - when:
                  condition:
                    equal: ["-A", <<parameters.agent-use>>]

@@ -151,7 +151,4 @@ func registerAgentConfig() {
"Max attempts agent has to reconnect")
registerInt(flags, name("agent-reconnect-backoff"), defaults.AgentReconnectBackoff,
"Time between agent reconnect attempts")

registerString(flags, name("container-runtime"), defaults.ContainerRuntime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this flag in but only allow docker as the container-runtime

If for some reason someone has determined-agent --container-runtime=docker I don't think we should break that use case

@@ -81,10 +80,6 @@ type Options struct {
// master config.
AgentReconnectBackoff int `json:"agent_reconnect_backoff"`

ContainerRuntime string `json:"container_runtime"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about leaving this option in but checking that is it always docker

So if someone has container_runtime: docker in their agent config I think we shouldn't break them

Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our previous iterations in the PR, I thought we agreed on having to remove ContainerRuntime config all-together. I see what you are saying here and agree, whereas earlier I assumed because this config was never documented, it was not actually used by any customers.

@kkunapuli - Do you agree to the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShreyaLnuHpe Yes, that was my understanding from talking with Bradley (that container_runtime configuration option was never documented).

As you've dug into the task more thoroughly, it seems you've uncovered an additional use case that was documented: using agents with slurm. I'm ok with leaving the configuration option in, with non-Docker values returning an error. I though that extra config options would be ignored, not result in errors.

Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicholasBlaskey @kkunapuli
Shall we now have a documentation of container_runtime config, and mentioning that it only intakes docker value in case of agentRM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think don't document it. We won't block someone from having container_runtime: docker but we don't want anyone adding it either.

@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do
echo ' -A '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the -A flag is the one we want to remove

Do we need any code changes to the scripts here to no longer support -A beyond changing the usage description?

@@ -76,7 +76,7 @@ while [[ $# -gt 0 ]]; do
echo ' -A '
echo " Description: Invokes a slurmcluster that uses agents instead of the launcher."
echo " Example: $0 -A"
echo ' -c {enroot|podman|singularity}'
echo ' -c {docker}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think enroot|podman|singularity are the correct options, since we don't support docker with launcher and we still will support the other runtimes for launcher

@@ -120,47 +118,18 @@ func (a *Agent) run(ctx context.Context) error {
return fmt.Errorf("failed to detect devices: %v", devices)
}

a.log.Tracef("setting up %s runtime", a.opts.ContainerRuntime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a release note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a release note, thanks

file. Below is a minimal example using a resource pool named for the user (``$USER``) and
``singularity`` as the container runtime platform. If configured using variables such as ``$HOME``,
a single ``agent.yaml`` could be shared by all users.
file. Below is a minimal example using a resource pool named for the user (``$USER``). ``docker``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work

I think in slurm environments it is very likely that the launched jobs won't have access to docker so the agent method won't be supported

I think this whole hpc-with-agent workload is no longer going to be supported, so I think we can remove all these docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this entail removing the page?: https://docs.determined.ai/latest/setup-cluster/slurm/hpc-with-agent.html

if so, please create a redirect. there is a guide here on using the docs redirect tool https://hpe-aiatscale.atlassian.net/wiki/spaces/DOC/pages/1338310660/How+to+Use+the+Docs+Redirect+Tool

@@ -5,7 +5,7 @@
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads).
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials.
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`.
- To specify which container runtime environment to use, pass in `FLAGS="-c {container_run_type}"` to `make slurmcluster`. Choose from either `singularity` (default), `podman`, or `enroot`. If utilizing Slurmcluster with Determined Agents, `docker` container runtime environment is the sole available option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support Slurmcluster with docker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, the intent is to no longer give customers the option to use slurm with determined agents.

At least, that's what I understood from slack thread: https://hpe-aiatscale.slack.com/archives/C06GMG83ZE0/p1718224380888699

@@ -81,10 +80,6 @@ type Options struct {
// master config.
AgentReconnectBackoff int `json:"agent_reconnect_backoff"`

ContainerRuntime string `json:"container_runtime"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about image_root?

https://docs.determined.ai/latest/reference/deploy/agent-config-reference.html#image-root

docs this have any affect on Docker?

Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

viewed hpc-with-agent

Copy link
Member

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the infra-relevant portion of this PR, pending the stuff Nick suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants