Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: loading OPT 66B model - CPU runs out of memory #5855

Open
1 task done
PurvangL opened this issue Jun 25, 2024 · 1 comment
Open
1 task done

[BUG]: loading OPT 66B model - CPU runs out of memory #5855

PurvangL opened this issue Jun 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@PurvangL
Copy link

Is there an existing issue for this bug?

  • I have searched the existing issues

馃悰 Describe the bug

I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB.
How can I to run OPT-66B benchmark with mentioned resources?

error

WARNING:torch.distributed.run:                                                                                                                                        
*****************************************                                                                                                                             
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal 
performance in your application as needed.                                                                                                                            
*****************************************                                                                                                                             
/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.                                     
  warnings.warn("`config` is deprecated and will be removed soon.")                                                                                                   
[06/25/24 19:04:54] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:67 launch                                
[06/25/24 19:04:55] INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 16                                                   
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51974 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51975 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51976 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51977 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51978 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51980 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51981 closing signal SIGTERM                                                                    
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 51979) of binary: /usr/bin/python                                       
Traceback (most recent call last):                                                                                                                                    
  File "/usr/local/bin/torchrun", line 33, in <module>                                                                                                                
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                    
    return f(*args, **kwargs)                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main                                                                           
    run(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run                                                                            
    elastic_launch(                                                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__                                                              
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent                                                          
    raise ChildFailedError(                                                                                                                                           
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                                    
======================================================                                                                                                                
opt/opt_train_demo.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):

Environment

Docker image : nvcr.io/nvidia/pytorch:23.02-py3
transformers : 4.33
colossalai : 0.3.6

@PurvangL PurvangL added the bug Something isn't working label Jun 25, 2024
@Edenzzzz
Copy link
Contributor

You can try the lazy init as in here and file in a PR if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants