Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job specific environment variables can't be set in Hydra multi-run #9449

Open
domenVres opened this issue Jun 12, 2024 · 2 comments
Open

Job specific environment variables can't be set in Hydra multi-run #9449

domenVres opened this issue Jun 12, 2024 · 2 comments
Assignees

Comments

@domenVres
Copy link

Is your feature request related to a problem? Please describe.

Hydra offers a way to set the environment variables, that are specific for each job - section hydra.job.env_set here. However, when I added this to the configuration for my multi-run hyperparameter search, the environment variable was not changed. I suspect that the reason for that is that NeMo has a custom launcher, through which hydra executes the job. I did not find that these environment variables would be handled anywhere in there.

Describe the solution you'd like

I would like to be able to set job-specific environment variables in the same way that Hydra documentation describes it. This means I should be able to add field job.env_set in Hydra config for multi-runs (screenshot below).

Screenshot 2024-06-12 at 10 20 03

Describe alternatives you've considered

The only alternative that I found so far and that works is to manually set the environment variables inside the training script (each job is a training script). However, inside these scripts, I don't have access to the job ID, and hence, I had to infer it from the values of hyperparameters. This is far from ideal, and it would be much nicer to have a config-level solution.

Additional context

The main reason that I went into trouble with the environment variables is a problem with ports when training multiple models at once. The process is initialized on the master port, and consequently, the following error happens:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).

The solution to avoid this error is to set the environment variable MASTER_PORT to a different value for each job. Without this option, running multiple training processes in parallel is impossible due to port collisions.

@titu1994
Copy link
Collaborator

We don't use process launcher unless you use hydra sweep config. Can you try removing that ? If not, we'll have to see how to implement your request. Ofc, if you have a solution you're encouraged to send a pr

@domenVres
Copy link
Author

Is there a way to perform a hyperparameter search without the hydra sweep?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants