Multi-GPU training fails under collective operation timeout #567

BaiqingL · 2024-06-17T01:30:40Z

System Info

ml.g5.12xlarge instance from AWS, with pyTorch 2.3.1, 4x A10G, CUDA 12.1

Modified dataset since I already pre-tokenized everything to avoid using time on GPU instances to reduce costs at https://huggingface.co/datasets/BaiqingL/pokemon-rag-llama-3-tokenized

Tokenizer has been modified in the following way

    # Load the tokenizer and add special tokens
    LLM_ACTION = "LLM_ACTION"
    MOVE_CHOSEN = "MOVE_CHOSEN"
    SWITCH_PKMN = "SWITCH_PKMN"
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token="oh no")
    tokenizer.add_special_tokens(
        {"additional_special_tokens": [SWITCH_PKMN, MOVE_CHOSEN, LLM_ACTION]}
    )
    tokenizer.pad_token = tokenizer.eos_token

Rest of the training script contains resize.

Dataset has been modified in such a way

  ds = load_dataset(
      "BaiqingL/pokemon-rag-llama-3-tokenized",
      cache_dir="/home/ec2-user/SageMaker/cache",
      split="train[:1%]"
  ).train_test_split(test_size=500)
  # Load and preprocess the dataset for training and validation
  dataset_train = ds["train"]

And val dataset:

dataset_val = ds["test"]

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

After the final step of training, presumably during model saving, process crashes and wastes all that time training...
Command executed:

torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --num_workers_dataloader 12 --enable_fsdp --model_name meta-llama/Meta-Llama-3-8B --use_peft --batch_size_training 2 --context_length 2048 --num-epochs 1 --peft_method lora --save_metrics --output_dir /home/ec2-user/SageMaker/output

Error logs

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f36762e4c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f36762e9a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f36762eadcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600078 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f36762e4c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f36762e9a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f36762eadcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f367500b897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f3675f6e119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f36c1d71e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7f36cadd944b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f36ca3cd52f in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5403b6bc62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5403b70a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5403b71dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5403b6bc62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5403b70a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5403b71dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5402892897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f54037f5119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7f544f5f8e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7f545866044b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7f5457c5452f in /lib64/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbacf215c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbacf21aa80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbacf21bdcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=_ALLGATHER_BASE, NumelIn=262675456, NumelOut=1050701824, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbacf215c62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbacf21aa80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbacf21bdcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbacdf3c897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7fbacee9f119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7fbb1aca2e95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7fbb23d0a44b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7fbb232fe52f in /lib64/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 1474236, last enqueued NCCL work: 1474236, last completed NCCL work: 1474235.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffa8a61ec62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffa8a623a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffa8a624dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1474236, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffa8a61ec62 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffa8a623a80 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffa8a624dcc in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #6: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffa89345897 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7ffa8a2a8119 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7ffad60abe95 in /home/ec2-user/anaconda3/envs/pytorch_p310/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x744b (0x7ffadf11944b in /lib64/libpthread.so.0)
frame #4: clone + 0x3f (0x7ffade70d52f in /lib64/libc.so.6)

E0617 01:16:30.106000 139731967973184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 40911) of binary: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
finetuning_2.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 40912)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40912
[2]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 40913)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40913
[3]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 40914)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40914
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_01:16:30
  host      : ip-172-16-17-88.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 40911)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 40911
======================================================

Expected behavior

Save the model

The text was updated successfully, but these errors were encountered:

wukaixingxp · 2024-06-18T00:10:24Z

Hi! From your log, I did not see the root cause of the NCCL timeout. I wonder if this error is reproducible? It may be a worker got killed somehow or NCCL connection is disrupted somehow. We can first check your NCCL config and here are some ways that help you check the correctness of the NCCL configs. (1) Run official NCCL all_reduce_perf. (2) Try Huggingface multi-GPU debug script. (3) If both tests mentioned above passed, then export NCCL_DEBUG=INFO and rerun the distributed training using our official example, see if the NCCL communications info gives any error or warning, you can paste back the NCCL info for me to double check.

wukaixingxp · 2024-06-18T00:13:31Z

If you believe your NCCL config is correct, then I suggest you use a small dataset and try to use py-spy record or dump function to track your pytorch main thread call-stack to see the name of the last function that has been run before crash.

wukaixingxp self-assigned this Jun 18, 2024

wukaixingxp mentioned this issue Jun 18, 2024

Some NCCL operations have failed or timed out. #543

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training fails under collective operation timeout #567

Multi-GPU training fails under collective operation timeout #567

BaiqingL commented Jun 17, 2024 •

edited

Loading

wukaixingxp commented Jun 18, 2024

wukaixingxp commented Jun 18, 2024 •

edited

Loading

Multi-GPU training fails under collective operation timeout #567

Multi-GPU training fails under collective operation timeout #567

Comments

BaiqingL commented Jun 17, 2024 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

wukaixingxp commented Jun 18, 2024

wukaixingxp commented Jun 18, 2024 • edited Loading

BaiqingL commented Jun 17, 2024 •

edited

Loading

wukaixingxp commented Jun 18, 2024 •

edited

Loading