You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Load the tokenizer and add special tokensLLM_ACTION="LLM_ACTION"MOVE_CHOSEN="MOVE_CHOSEN"SWITCH_PKMN="SWITCH_PKMN"tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token="oh no")
tokenizer.add_special_tokens(
{"additional_special_tokens": [SWITCH_PKMN, MOVE_CHOSEN, LLM_ACTION]}
)
tokenizer.pad_token=tokenizer.eos_token
Rest of the training script contains resize.
Dataset has been modified in such a way
ds=load_dataset(
"BaiqingL/pokemon-rag-llama-3-tokenized",
cache_dir="/home/ec2-user/SageMaker/cache",
split="train[:1%]"
).train_test_split(test_size=500)
# Load and preprocess the dataset for training and validationdataset_train=ds["train"]
And val dataset:
dataset_val=ds["test"]
Information
The official example scripts
My own modified scripts
🐛 Describe the bug
After the final step of training, presumably during model saving, process crashes and wastes all that time training...
Command executed:
Hi! From your log, I did not see the root cause of the NCCL timeout. I wonder if this error is reproducible? It may be a worker got killed somehow or NCCL connection is disrupted somehow. We can first check your NCCL config and here are some ways that help you check the correctness of the NCCL configs. (1) Run official NCCL all_reduce_perf. (2) Try Huggingface multi-GPU debug script. (3) If both tests mentioned above passed, then export NCCL_DEBUG=INFO and rerun the distributed training using our official example, see if the NCCL communications info gives any error or warning, you can paste back the NCCL info for me to double check.
If you believe your NCCL config is correct, then I suggest you use a small dataset and try to use py-spy record or dump function to track your pytorch main thread call-stack to see the name of the last function that has been run before crash.
System Info
ml.g5.12xlarge instance from AWS, with pyTorch 2.3.1, 4x A10G, CUDA 12.1
Modified dataset since I already pre-tokenized everything to avoid using time on GPU instances to reduce costs at https://huggingface.co/datasets/BaiqingL/pokemon-rag-llama-3-tokenized
Tokenizer has been modified in the following way
Rest of the training script contains resize.
Dataset has been modified in such a way
And val dataset:
Information
🐛 Describe the bug
After the final step of training, presumably during model saving, process crashes and wastes all that time training...
Command executed:
Error logs
Expected behavior
Save the model
The text was updated successfully, but these errors were encountered: