DataParallel crashes about "not same cuda device" when training after onnx export #1582

yujiepan-work · 2023-02-17T06:36:33Z

When using Data Parallel in pytorch (not DDP), after calling compression_ctrl.export_model(), the subsequent training will get crashed forExpected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Scripts for reproducing the error: https://gist.github.com/yujiepan-work/964d4716902ee75bf132dc4d80c96e61

After some debugging, I found that after onnx export, the "cuda:1" replicate model in DP is doing forward with the wrong model parameter from "cuda:0".

The text was updated successfully, but these errors were encountered:

vshampor · 2023-02-24T10:10:44Z

I have been doing some work in #1584 that is relevant, and I encountered the same issues at some points. The problem probably has to do with NNCFNetwork storing some unaccounted-for references to the wrapped model object that don't get correctly handled when DataParallel replicates the entire NNCFNetwork to get device-specific copies. If there is a bad reference in the module, then the replication won't adjust the reference to the replicated version of the module pointed to by the reference, and if the execution runs forward on that reference then the bad referenced module will get the inputs from a wrong CUDA device.

vshampor · 2023-02-24T12:19:33Z

The reproducer you posted is actually working on the state introduced in #1584.

yujiepan-work · 2023-04-23T03:05:36Z

Thanks for reply! I will close this issue since it is solved in latest version 😊

ljaljushkin · 2023-05-11T16:59:53Z

Unfortunately, it's still reproduced with latest NNCF in Optimum tests https://github.com/huggingface/optimum-intel/blob/main/tests/openvino/test_training_examples.py:
pytest tests/openvino/test_training_examples.py -k "JPQD"

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)

ljaljushkin · 2023-05-19T13:36:35Z

It's reproduced for quantization aware training only as well:
pytest tests/openvino/test_training_examples.py -k "QAT"

MaximProshin · 2023-06-20T08:48:30Z

@ljaljushkin , @vshampor , is it still valid?

ljaljushkin · 2023-06-20T11:18:28Z

@MaximProshin yes, it's still valid

yujiepan-work closed this as completed Apr 23, 2023

ljaljushkin reopened this May 11, 2023

ljaljushkin mentioned this issue May 11, 2023

Corrected docs to run JPQD in DDP mode huggingface/optimum-intel#315

Merged

1 task

ljaljushkin mentioned this issue May 19, 2023

Test training of examples with torchrun call huggingface/optimum-intel#319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataParallel crashes about "not same cuda device" when training after onnx export #1582

DataParallel crashes about "not same cuda device" when training after onnx export #1582

yujiepan-work commented Feb 17, 2023 •

edited

Loading

vshampor commented Feb 24, 2023

vshampor commented Feb 24, 2023

yujiepan-work commented Apr 23, 2023

ljaljushkin commented May 11, 2023

ljaljushkin commented May 19, 2023

MaximProshin commented Jun 20, 2023

ljaljushkin commented Jun 20, 2023 •

edited

Loading

DataParallel crashes about "not same cuda device" when training after onnx export #1582

DataParallel crashes about "not same cuda device" when training after onnx export #1582

Comments

yujiepan-work commented Feb 17, 2023 • edited Loading

vshampor commented Feb 24, 2023

vshampor commented Feb 24, 2023

yujiepan-work commented Apr 23, 2023

ljaljushkin commented May 11, 2023

ljaljushkin commented May 19, 2023

MaximProshin commented Jun 20, 2023

ljaljushkin commented Jun 20, 2023 • edited Loading

yujiepan-work commented Feb 17, 2023 •

edited

Loading

ljaljushkin commented Jun 20, 2023 •

edited

Loading