[BUG]: loading OPT 66B model - CPU runs out of memory #5855

PurvangL · 2024-06-25T19:44:09Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB.
How can I to run OPT-66B benchmark with mentioned resources?

error

WARNING:torch.distributed.run:                                                                                                                                        
*****************************************                                                                                                                             
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal 
performance in your application as needed.                                                                                                                            
*****************************************                                                                                                                             
/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.                                     
  warnings.warn("`config` is deprecated and will be removed soon.")                                                                                                   
[06/25/24 19:04:54] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:67 launch                                
[06/25/24 19:04:55] INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 16                                                   
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51974 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51975 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51976 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51977 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51978 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51980 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51981 closing signal SIGTERM                                                                    
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 51979) of binary: /usr/bin/python                                       
Traceback (most recent call last):                                                                                                                                    
  File "/usr/local/bin/torchrun", line 33, in <module>                                                                                                                
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                    
    return f(*args, **kwargs)                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main                                                                           
    run(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run                                                                            
    elastic_launch(                                                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__                                                              
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent                                                          
    raise ChildFailedError(                                                                                                                                           
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                                    
======================================================                                                                                                                
opt/opt_train_demo.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):

Environment

Docker image : nvcr.io/nvidia/pytorch:23.02-py3
transformers : 4.33
colossalai : 0.3.6

The text was updated successfully, but these errors were encountered:

Edenzzzz · 2024-06-27T03:15:02Z

You can try the lazy init as in here and file in a PR if it works.

ColossalAI/examples/language/llama/benchmark.py

Line 245 in 8e718a1

init_ctx = (

PurvangL added the bug Something isn't working label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: loading OPT 66B model - CPU runs out of memory #5855

[BUG]: loading OPT 66B model - CPU runs out of memory #5855

PurvangL commented Jun 25, 2024

Edenzzzz commented Jun 27, 2024

[BUG]: loading OPT 66B model - CPU runs out of memory #5855

[BUG]: loading OPT 66B model - CPU runs out of memory #5855

Comments

PurvangL commented Jun 25, 2024

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

Edenzzzz commented Jun 27, 2024