Job specific environment variables can't be set in Hydra multi-run #9449

domenVres · 2024-06-12T08:35:38Z

Is your feature request related to a problem? Please describe.

Hydra offers a way to set the environment variables, that are specific for each job - section hydra.job.env_set here. However, when I added this to the configuration for my multi-run hyperparameter search, the environment variable was not changed. I suspect that the reason for that is that NeMo has a custom launcher, through which hydra executes the job. I did not find that these environment variables would be handled anywhere in there.

Describe the solution you'd like

I would like to be able to set job-specific environment variables in the same way that Hydra documentation describes it. This means I should be able to add field job.env_set in Hydra config for multi-runs (screenshot below).

Describe alternatives you've considered

The only alternative that I found so far and that works is to manually set the environment variables inside the training script (each job is a training script). However, inside these scripts, I don't have access to the job ID, and hence, I had to infer it from the values of hyperparameters. This is far from ideal, and it would be much nicer to have a config-level solution.

Additional context

The main reason that I went into trouble with the environment variables is a problem with ports when training multiple models at once. The process is initialized on the master port, and consequently, the following error happens:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).

The solution to avoid this error is to set the environment variable MASTER_PORT to a different value for each job. Without this option, running multiple training processes in parallel is impossible due to port collisions.

The text was updated successfully, but these errors were encountered:

titu1994 · 2024-06-12T08:59:20Z

We don't use process launcher unless you use hydra sweep config. Can you try removing that ? If not, we'll have to see how to implement your request. Ofc, if you have a solution you're encouraged to send a pr

domenVres · 2024-06-12T09:36:56Z

Is there a way to perform a hyperparameter search without the hydra sweep?

domenVres assigned okuchaiev Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job specific environment variables can't be set in Hydra multi-run #9449

Job specific environment variables can't be set in Hydra multi-run #9449

domenVres commented Jun 12, 2024

titu1994 commented Jun 12, 2024

domenVres commented Jun 12, 2024

Job specific environment variables can't be set in Hydra multi-run #9449

Job specific environment variables can't be set in Hydra multi-run #9449

Comments

domenVres commented Jun 12, 2024

titu1994 commented Jun 12, 2024

domenVres commented Jun 12, 2024