-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device dispatcher #1775
Device dispatcher #1775
Conversation
|
||
class CUDADeviceUtils(BaseDeviceUtils): | ||
|
||
device = 'cuda' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class name indicates CUDA. Is "device = 'cuda' " necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseDeviceUtils._sub_classes[sub_cls.device] = sub_cls |
The class would be registered automatically with this key when being imported.
num_ignore_eos=num_ignore_eos, | ||
output_que=out_que, | ||
) | ||
await self._async_step_background( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with torch.cuda.stream(self.stream):" is removed. Will it bring side effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has been move out to:
lmdeploy/lmdeploy/pytorch/engine/engine.py
Line 850 in 41f179e
with device_manager.context(self.device_context), torch.cuda.stream( |
So context just need to be added once.
@@ -0,0 +1 @@ | |||
# Copyright (c) OpenMMLab. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the package "default" used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is designed to place pure pytorch implementation, without any device assumption.
So if the backend does not provide a custom kernel, we can still perform the inference.
@zhulinJulia24 |
The models' evaluation has to be performed, too. @zhulinJulia24 |
Do we really need to do evaluation? For example, may we verify the basic correctness by comparing the temperature at 0 and the results of transformers, and only do evaluation when it is necessary? Of course, this is just my suggestion. I think if resources are sufficient, running an evaluation task quickly should not be a problem. |
https://github.com/InternLM/lmdeploy/actions/runs/9600920652 @grimoire @lvhan028 all precision are improved compare to https://github.com/zhulinJulia24/lmdeploy/actions/runs/9240064913 which is 0.4.2 version's precision. |
llama3-70blmdeploy serve api_server /mnt/models-new/llm_models/models--meta-llama--Meta-Llama-3-70B-Instruct/snapshots/0cac6d727e4cdf117e1bde11e4c7badd8b963919 --server-port 24555 --tp 4 --backend pytorch concurrency: 256 first_token latency(min, max, ave): 2.923s, 371.669s, 29.348s number of prompt tokens: 447592 concurrency: 128 first_token latency(min, max, ave): 0.465s, 17.343s, 3.991s number of prompt tokens: 447592 mixtral-moe-7x8blmdeploy serve api_server /nvme/qa_test_models/mistralai/Mixtral-8x7B-Instruct-v0.1 --server-port 24555 --tp 2 --backend pytorch concurrency: 128 first_token latency(min, max, ave): 1.912s, 20.649s, 3.140s number of prompt tokens: 491513 concurrency: 256 first_token latency(min, max, ave): 0.248s, 236.570s, 19.633s number of prompt tokens: 491513 internlm2-chat-20bbatch num_prompts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) throughput(out tok/s) throughput(total tok/s) meta-Llama-3-8B-Instructbatch num_prompts RPS RPM FTL(ave)(s) FTL(min)(s) FTL(max)(s) throughput(out tok/s) throughput(total tok/s) internlm2-chat-20b and meta-Llama-3-8B-Instruct consistent with baseline of 0.4.2 version. |
Comparing to the previous torch engine, as shown in https://github.com/zhulinJulia24/lmdeploy/actions/runs/9240064913, the evaluation accuracy doesn't degrade, does it? |
After inner discussion, this PR didn't cause accuracy degradation comparing to the previous version. |
importlib.import_module(f'{__name__}.{device_type}') | ||
assert device_type in loaded_utils | ||
except ImportError: | ||
logger.debug('Failed to import device utils for ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it fallbacks to cuda
, should make it as warning instead of debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most kernel won't have a device special implement, warning would be annoy.
And we would fallback to default instead of cuda in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pytorch/kernels/<device name>
pytorch/engine/deivces/<device name>
pytorch/models/module_map.py
XXX_MODULE_MAP
PytorchEngineConfig
requirement