Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataParallel crashes about "not same cuda device" when training after onnx export #1582

Open
yujiepan-work opened this issue Feb 17, 2023 · 7 comments

Comments

@yujiepan-work
Copy link
Contributor

yujiepan-work commented Feb 17, 2023

When using Data Parallel in pytorch (not DDP), after calling compression_ctrl.export_model(), the subsequent training will get crashed forExpected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Scripts for reproducing the error: https://gist.github.com/yujiepan-work/964d4716902ee75bf132dc4d80c96e61

After some debugging, I found that after onnx export, the "cuda:1" replicate model in DP is doing forward with the wrong model parameter from "cuda:0".

@vshampor
Copy link
Contributor

I have been doing some work in #1584 that is relevant, and I encountered the same issues at some points. The problem probably has to do with NNCFNetwork storing some unaccounted-for references to the wrapped model object that don't get correctly handled when DataParallel replicates the entire NNCFNetwork to get device-specific copies. If there is a bad reference in the module, then the replication won't adjust the reference to the replicated version of the module pointed to by the reference, and if the execution runs forward on that reference then the bad referenced module will get the inputs from a wrong CUDA device.

@vshampor
Copy link
Contributor

The reproducer you posted is actually working on the state introduced in #1584.

@yujiepan-work
Copy link
Contributor Author

Thanks for reply! I will close this issue since it is solved in latest version 😊

@ljaljushkin
Copy link
Contributor

Unfortunately, it's still reproduced with latest NNCF in Optimum tests https://github.com/huggingface/optimum-intel/blob/main/tests/openvino/test_training_examples.py:
pytest tests/openvino/test_training_examples.py -k "JPQD"

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

@ljaljushkin
Copy link
Contributor

It's reproduced for quantization aware training only as well:
pytest tests/openvino/test_training_examples.py -k "QAT"

@MaximProshin
Copy link
Collaborator

@ljaljushkin , @vshampor , is it still valid?

@ljaljushkin
Copy link
Contributor

ljaljushkin commented Jun 20, 2023

@MaximProshin yes, it's still valid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants