You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB.
How can I to run OPT-66B benchmark with mentioned resources?
error
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal
performance in your application as needed.
*****************************************
/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
[06/25/24 19:04:54] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:67 launch
[06/25/24 19:04:55] INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 16
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51974 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51975 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51976 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51977 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51978 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51980 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51981 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 51979) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
opt/opt_train_demo.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
Is there an existing issue for this bug?
馃悰 Describe the bug
I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB.
How can I to run OPT-66B benchmark with mentioned resources?
error
Environment
Docker image : nvcr.io/nvidia/pytorch:23.02-py3
transformers : 4.33
colossalai : 0.3.6
The text was updated successfully, but these errors were encountered: