Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] lmdeploy lite auto_awq: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! #1675

Closed
2 tasks done
dawnranger opened this issue May 29, 2024 · 5 comments
Assignees

Comments

@dawnranger
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

Error when I try to run lmdeploy lite auto_awq on model Qwen1.5-0.5b-chat

Reproduction

lmdeploy lite auto_awq ./Qwen1.5-0.5b-chat

Environment

cuda 11.8
python 3.8.12
lmdeploy                      0.4.2
peft                          0.9.0
torch                         2.1.2+cu118
transformers                  4.41.1

Error traceback

lmdeploy lite auto_awq ./Qwen1.5-0.5b-chat --calib-dataset 'c4' --calib-samples 64 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ./lmdeploy_hf     
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.norm to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
Traceback (most recent call last):
  File "/root/venv_lmdeploy_0.4.2/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq
    auto_awq(**kwargs)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 96, in auto_awq
    vl_model, model, tokenizer, work_dir = calibrate(model,
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/lmdeploy/lite/apis/calibrate.py", line 235, in calibrate
    calib_ctx.calibrate(all_data)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 315, in calibrate
    _ = model(data.to(self.device))
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 978, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/root/venv_lmdeploy_0.4.2/lib/python3.8/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
@AllentDan
Copy link
Collaborator

Qwen1.5-0.5B-Chat model is not supported by lmdeploy turbomind.

@dawnranger
Copy link
Author

@AllentDan any play to support qwen2-0.5b?

@AllentDan
Copy link
Collaborator

After #1782

@AllentDan
Copy link
Collaborator

As replied in #1782 (comment), you may try Pytorch backend to run Qwen1.5-0.5B-Chat model.

@AllentDan
Copy link
Collaborator

The issue will be closed. Feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants