Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Lightning logs are not synchronised when using distributed training #3157

Open
admivsn opened this issue Apr 5, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@admivsn
Copy link

admivsn commented Apr 5, 2024

Description

As described in PyTorch Lightning documentation, the logs need to be synchronised using sync_dist=True.

For example in DeepAR, I think there should be an extra parameter when running distributed training.

I notice that when training on multi-GPU SageMaker instances I don't see the a performance uplift compared to a single-GPU instance. I also get a warning output from PyTorch Lightning.

It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

To Reproduce

It's difficult to reproduce as I'm running a SageMaker Training job.

SM_CURRENT_INSTANCE_TYPE=ml.g4dn.12xlarge
SM_NUM_CPUS=48
SM_NUM_GPUS=4

You can see by the set up there are 4 GPUs, which are detected by PyTorch Lightning as the logs look like this:

2024-04-17 20:37:48 Starting - Starting the training job...
2024-04-17 20:38:05 Starting - Preparing the instances for training......
2024-04-17 20:39:10 Downloading - Downloading input data...
2024-04-17 20:39:29 Downloading - Downloading the training image............
2024-04-17 20:41:50 Training - Training image download completed. Training in progress........2024-04-17 20:42:45,440 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)

...

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Missing logger folder: /opt/ml/code/lightning_logs
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Missing logger folder: /opt/ml/code/lightning_logs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /opt/ml/code/lightning_logs
Missing logger folder: /opt/ml/code/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
  | Name  | Type        | Params | In sizes                                                        | Out sizes  
----------------------------------------------------------------------------------------------------------------------
0 | model | DeepARModel | 25.9 K | [[1, 1], [1, 1], [1, 1102, 4], [1, 1102], [1, 1102], [1, 1, 4]] | [1, 100, 1]
----------------------------------------------------------------------------------------------------------------------
25.9 K    Trainable params
0         Non-trainable params
25.9 K    Total params
0.104     Total estimated model params size (MB)
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0, global step 50: 'train_loss' reached 0.75536 (best 0.75536), saving model to '/opt/ml/code/lightning_logs/version_0/checkpoints/epoch=0-step=50.ckpt' as top 1
Epoch 1, global step 100: 'train_loss' reached 0.72144 (best 0.72144), saving model to '/opt/ml/code/lightning_logs/version_0/checkpoints/epoch=1-step=100.ckpt' as top 1

Error message or code output

The particular warning of interest is:

/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Environment

  • Operating system:
  • Python version: 3.10
  • GluonTS version: 0.14.4
  • PyTorch version: 2.2.1
  • PyTorch Lightning version: 2.1.4

(Add as much information about your environment as possible, e.g. dependencies versions.)

@admivsn admivsn added the bug Something isn't working label Apr 5, 2024
@admivsn admivsn changed the title PyTorch Lightning validation logs are not synchronised when using distributed training PyTorch Lightning logs are not synchronised when using distributed training Apr 17, 2024
@admivsn
Copy link
Author

admivsn commented Apr 17, 2024

Updated this with some more info. Originally I thought it was just when using validation data however upon investigations it seems like its a wider issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant