-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch deepseek v2 #1621
Torch deepseek v2 #1621
Conversation
af38959
to
68a48f7
Compare
Hi @grimoire Are the current performance benchmark results as expected, and how much of a leading advantage is there compared to vLLM? Thanks. https://github.com/deepseek-ai/DeepSeek-V2?tab=readme-ov-file#inference-with-vllm-recommended |
@zhyncs the latest profile result (256 concurrency, 3000 prompt, block_size=32 and
Apart from the fact that the default value cannot be used for block_size, the rest is relatively acceptable. We have not performed benchmarks on vLLM yet, 8 A100 are not always available (T T). |
@grimoire ut |
Hi @grimoire, I used your commit to run the workflow at https://github.com/zhyncs/lmdeploy/actions/runs/9584655537 and obtained the whl https://github.com/zhyncs/dl/releases/tag/0620. And I encountered an error triton-lang/triton#4172. Do you have any ideas? Thanks! |
triton has a prepackaged ptxas, which might be different with your cuda driver version. You can set your own ptxas ( |
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas It works for me. Thanks and cheers. @grimoire |
Hi @grimoire May I ask if this uses a single A100 card or 8 cards? Thanks. |
LMDeploy single A100 # server
python3 -m lmdeploy serve api_server DeepSeek-V2-Lite --backend pytorch --cache-block-seq-len 32
# client
# https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
python3 benchmark_serving.py --backend lmdeploy --host 127.0.0.1 --port 23333 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --num-prompts 1000 --request-rate 128 result # ignore_eos false
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 154.05
Total input tokens: 236142
Total generated tokens: 148682
Request throughput (req/s): 6.49
Input token throughput (tok/s): 1532.88
Output token throughput (tok/s): 965.14
---------------Time to First Token----------------
Mean TTFT (ms): 56583.14
Median TTFT (ms): 55727.01
P99 TTFT (ms): 113475.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 116.80
Median TPOT (ms): 90.45
P99 TPOT (ms): 475.46
---------------Inter-token Latency----------------
Mean ITL (ms): 77.64
Median ITL (ms): 58.83
P99 ITL (ms): 430.49
==================================================
# ignore_eos true
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 181.48
Total input tokens: 236142
Total generated tokens: 215605
Request throughput (req/s): 5.51
Input token throughput (tok/s): 1301.17
Output token throughput (tok/s): 1188.01
---------------Time to First Token----------------
Mean TTFT (ms): 65216.61
Median TTFT (ms): 65241.68
P99 TTFT (ms): 135946.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 96.65
Median TPOT (ms): 80.46
P99 TPOT (ms): 267.16
---------------Inter-token Latency----------------
Mean ITL (ms): 72.56
Median ITL (ms): 60.45
P99 ITL (ms): 372.03
================================================== |
It is profiled with single A100. The bottleneck of lite model is on the host side, TP would make it worse. |
May we set the # python3 benchmark_serving.py --backend lmdeploy --host 127.0.0.1 --port 23333 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --num-prompts 1000 --request-rate 128
# cache-block-seq-len 32, ignore_eos true
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 181.48
Total input tokens: 236142
Total generated tokens: 215605
Request throughput (req/s): 5.51
Input token throughput (tok/s): 1301.17
Output token throughput (tok/s): 1188.01
---------------Time to First Token----------------
Mean TTFT (ms): 65216.61
Median TTFT (ms): 65241.68
P99 TTFT (ms): 135946.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 96.65
Median TPOT (ms): 80.46
P99 TPOT (ms): 267.16
---------------Inter-token Latency----------------
Mean ITL (ms): 72.56
Median ITL (ms): 60.45
P99 ITL (ms): 372.03
==================================================
# cache-block-seq-len default, ignore_eos true
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 384.18
Total input tokens: 236142
Total generated tokens: 215594
Request throughput (req/s): 2.60
Input token throughput (tok/s): 614.67
Output token throughput (tok/s): 561.18
---------------Time to First Token----------------
Mean TTFT (ms): 155387.50
Median TTFT (ms): 153036.64
P99 TTFT (ms): 328194.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 196.80
Median TPOT (ms): 181.42
P99 TPOT (ms): 515.61
---------------Inter-token Latency----------------
Mean ITL (ms): 163.64
Median ITL (ms): 96.56
P99 ITL (ms): 1542.84
================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"""adjust block_size.""" | ||
# TODO: support kernel with both large head dim and large block size. | ||
if model_config.k_head_dim >= 512 and cache_config.block_size > 32: | ||
cache_config.block_size = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this affect models other than DeepSeek v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mha kernel needs enough smem to cache the kv_cache block and query block. Any model with such a large head_dim should be limited.
Among all the models that Pytorch engine has supported, only deepseek v2 with MLA implementation meets the condition.
hold on plz. @grimoire @RunningLeon python3 -m lmdeploy serve api_server /workdir/DeepSeek-V2-Lite-Chat --backend pytorch
# run multi times get different res when temperature 0
python3 benchmark/profile_restful_api.py 127.0.0.1:23333 /workdir/DeepSeek-V2-Lite-Chat /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model_name /workdir/DeepSeek-V2-Lite-Chat --num_prompts 1 --concurrency 1 --temperature 0 |
@zhyncs temperature=0 is a invalid value I set to 1 if temperature<=0 in pytorch engine. lmdeploy/lmdeploy/pytorch/messages.py Line 75 in 9e8cb3c
|
@grimoire If the temperature is 0, it is not supported. How can I get a deterministic answer? |
TurboMind supports [0,2] for temperature |
Just set topk=1 or given a small enough temperature Note that small temperature might still leads to different result if two value in logits are close. |
@@ -157,11 +179,16 @@ def __forward_hook(module, args, kwargs, output): | |||
target_args = args | |||
target_kwargs = kwargs | |||
target_output = output | |||
raise ExtractorFound() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool is used to extract input/output of a submodule (for debug), computation after the module is not necessary.
from lmdeploy.pytorch.engine.model_agent import StepContext | ||
|
||
if model_config is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert model_config
q_a_proj
,kv_a_proj_with_mqa
in attention layer,gate
in moe layer are not distributed so less nccl op are required with the cost of memorys.block_size=32
would have better performance.cache_max_entry_count
andmax_prefill_token_num
might leads to oom.result of deepseek-v2-lite (WIP)
requirements: