Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training CUDA Out Of Memory error #98

Open
Kev1MSL opened this issue May 30, 2024 · 5 comments
Open

Training CUDA Out Of Memory error #98

Kev1MSL opened this issue May 30, 2024 · 5 comments

Comments

@Kev1MSL
Copy link

Kev1MSL commented May 30, 2024

Hi! I am trying to train the instantmesh model but I am currently facing issues just before the backpropagation where I am getting cuda out of memory error. Have you faced a similar issue when training and how did you solve this? I am also training on 8 GPUs with same memory as H800, as explained in the paper.
Thanks!

@sumanttyagi
Copy link

Please check your cuda devices .

@gaodalii
Copy link

I am using a single A800(80G), but I can only train it with batch_size=1, if I set batch_size=2, there also would be a cuda out of memory error.
image
image

@Kev1MSL
Copy link
Author

Kev1MSL commented May 31, 2024

Yes same thing, when I set batch_size=1 it works, but batch_size=2 it does not. However I am only missing a few GB (~2GB), so I was wondering if there is a way to optimize this? And also what happens if I want to distribute the training across multiple gpus, if I set batch_size=1, is it going to be 1 batch per GPU? Or the 1 batch will be distributed across the GPUs?

Because if it is a batch of size 1, then wouldn't we have issue with converging?

@Mrguanglei
Copy link

@Kev1MSL Hello, I encountered several problems in the training process, the structure of my dataset is as the picture says, but my training profile will not be written, I would like to ask for your help, thank you very much for your reply

微信图片_20240531213352
微信图片_20240531213146

@throb081
Copy link

throb081 commented Jun 4, 2024

@Kev1MSL hello,i am trying to run the training process,but i don't know how to construct the dataset ,can i have a look at the structure of dataset?thank you very much for your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants