Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory allocation left resident in GPU(s) after model upload to HuggingFace #736

Open
tmostak opened this issue May 31, 2024 · 0 comments
Labels
type/bug Bug in code

Comments

@tmostak
Copy link

tmostak commented May 31, 2024

馃悰 Bug

When uploading a model to HuggingFace and using the cpu_shard setting, and I believe any available GPUs, allocations are left resident in GPU memory after upload. This usually means I have to restart H2O LLM Studio so I can train another model, especially if I expect to be tight on memory.

To Reproduce

Upload any model to HuggingFace using the cpu_shard setting. After finished, check nvidia-smi. See below after I uploaded a 22B param model:

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Fri May 31 17:48:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   35C    P0              70W / 400W |   5585MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   33C    P0              69W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   33C    P0              70W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P0              70W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   33C    P0              68W / 400W |   5965MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   32C    P0              67W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   33C    P0              71W / 400W |   5969MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   34C    P0              68W / 400W |   5589MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5572MiB |
|    1   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    2   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    3   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    4   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5952MiB |
|    5   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    6   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5956MiB |
|    7   N/A  N/A   2271624      C   ...envs/h2o_llm_2024_05_28/bin/python3     5576MiB |
+---------------------------------------------------------------------------------------+

### LLM Studio version
c23a3c80f847561736217a1d355837c0e4a8f595 (master)
@tmostak tmostak added the type/bug Bug in code label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

1 participant