support vl benchmark #1662

AllentDan · 2024-05-27T11:00:27Z

第一张图测试了 llava-v1.6-vicuna-7b 模型，量化后的 completion token/s 和 FTL。
第二张图，测了量化前后的吞吐量变化

vody-am · 2024-05-28T04:21:21Z

@AllentDan which command did you use to run this benchmark? Could you share? Was this run on an A100? I am curious also about performance on lesser GPUs which are more widely available and cheaper, and can test running there (such as GPUs with 24GB VRAM).

I have followed the instructions at https://github.com/InternLM/lmdeploy/tree/main/benchmark the get the ShareGPT dataset, so I have the data!

Another metric of interest would be how response time changes under load (if we increase requests per second, how much does the latency increase?)

AllentDan · 2024-05-28T06:13:01Z

@vody-am yes, A100 card.

python benchmark/profile_restful_api.py http://0.0.0.0:23333 /nvme/shared/llava-v1.6-vicuna-7b-4bit ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 16 --img_hw 512*512 --stream_output True --num_prompts 1000

vody-am · 2024-05-28T17:38:48Z

Ok, I am testing out on A100, L4 and 4090, will report back with numbers when completed.
For generating charts, did you use a script or do so by hand? I can post back when runs complete.

The non-quantized model successfully runs across all devices (impressively the L4 managed not to fall over serving requests).

EDIT: figured out issues with quant model, needed to properly install lmdeploy by building the wheel.

GPU: 4090
Model: llava-1.6-vicuna-7b

--------------------------------------------------                                                                                                                                                   
concurrency: 3                                                                                                                                                                                       
elapsed_time: 1829.352s                                                                                                                                                                              
                                                                                                                                                                                                     
first_token latency(min, max, ave): 0.081s, 1.160s, 0.462s                                                                                                                                           
                                                                                                                                                                                                     
number of prompt tokens: 248339                                                                                                                                                                      
number of completion tokens: 240582                                                                                                                                                                  
token throughput (completion token): 131.512 token/s                                                                                                                                                 
token throughput (prompt + completion token): 267.265 token/s                                                                                                                                        
RPS (request per second): 0.547 req/s                                                                                                                                                                
RPM (request per minute): 32.798 req/min                                                                                                                                                             
--------------------------------------------------

GPU: 4090
Model: llava-1.6-vicuna-7b AWQ

--------------------------------------------------                                                                                                                                                   
concurrency: 3                                                                                                                                                                                       
elapsed_time: 1085.826s                                                                                                                                                                              
                                                                                                                                                                                                     
first_token latency(min, max, ave): 0.074s, 1.131s, 0.469s                                                                                                                                           
                                                                                                                                                                                                     
number of prompt tokens: 248339                                                                                                                                                                      
number of completion tokens: 240582                                                                                                                                                                  
token throughput (completion token): 221.566 token/s                                                                                                                                                 
token throughput (prompt + completion token): 450.276 token/s                                                                                                                                        
RPS (request per second): 0.921 req/s                                                                                                                                                                
RPM (request per minute): 55.257 req/min                                                                                                                                                             
--------------------------------------------------

GPU: L4
Model: llava-1.6-vicuna-7b

--------------------------------------------------                                                                                                                           
concurrency: 2                                                                                                                                                               
elapsed_time: 8207.655s                                                                                                                                                      
                                                                                                                                                                             
first_token latency(min, max, ave): 0.209s, 2.419s, 1.209s                                                                                                                   
                                                                                                                                                                             
number of prompt tokens: 248339                                                                                                                                              
number of completion tokens: 240582                                                                                                                                          
token throughput (completion token): 29.312 token/s                                                                                                                          
token throughput (prompt + completion token): 59.569 token/s                                                                                                                 
RPS (request per second): 0.122 req/s                                                                                                                                        
RPM (request per minute): 7.310 req/min                                                                                                                                      
--------------------------------------------------

GPU: L4
Model: llava-1.6-vicuna-7b AWQ

--------------------------------------------------
concurrency: 2
elapsed_time: 4065.314s

first_token latency(min, max, ave): 0.149s, 2.142s, 1.070s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 59.179 token/s
token throughput (prompt + completion token): 120.266 token/s
RPS (request per second): 0.246 req/s
RPM (request per minute): 14.759 req/min
--------------------------------------------------

AllentDan · 2024-05-29T02:16:06Z

Hi, @vody-am I used Excel to plot the chart.

vody-am · 2024-05-29T02:59:57Z

Currently running the benchmark on an Nvidia L4 with AWQ. With a concurrency value of 2 seeing this:

FP16 was much slower, I saw an ETA of over 2 hours. I can let it run unattended if there's interest in collecting that data, but if not no worries. I would personally like to be able to run quantized models on that hardware as they're relatively cheap and plentiful.

Besides that, PR LGTM!

Just two more questions might come to mind:

this line which constructs the image creates one that is (I believe) all black. Maybe filling it with random data might produce a different result? That way each request has some variation in the image.
while benchmarking, should one set vision_max_batch_size ?

lvhan028 · 2024-05-29T07:30:49Z

benchmark/profile_restful_api.py

+
+            from lmdeploy.vl.utils import encode_image_base64
+            h, w = [int(s) for s in img_hw.split('*')]
+            img = PIL.Image.new(mode='RGB', size=(w, h))


Please use random value for pixels.

lvhan028 · 2024-05-29T07:32:24Z

benchmark/profile_restful_api.py

+            import PIL
+
+            from lmdeploy.vl.utils import encode_image_base64
+            h, w = [int(s) for s in img_hw.split('*')]


can we use "x" instead of ""? "x" just press one key, "" needs to press shift+8

AllentDan · 2024-05-31T02:39:08Z

I could not reproduce the error. @vody-am

vody-am · 2024-05-31T02:48:42Z

@AllentDan I was too hasty in posting 😓 thank you for checking, I will double check next time. I believe it is an environment error on my part, because it works on some hosts but not on others. Thanks 🫡

vody-am · 2024-05-31T03:13:37Z

ok turns out I was not too hasty. I believe it worked once for me due to the random sampling. The issue I ran into is specific to Qwen-VL, since the tokenizer treats <img> as a special token. Some of the completions on line https://github.com/AllentDan/lmdeploy/blob/vl-bench/benchmark/profile_restful_api.py#L37 may contain <img> within. e.g. there is at least one example with the following:

Typical inline elements in HTML include:

1. `<a>` - hyperlink
2. `<span>` - generic inline container
3. `<strong>` - strong importance
4. `<em>` - emphasized importance
5. `<img>` - image
6. `<input>` - input field
7. `<label>` - label for a form control
8. `<button>` - button
9. `<select>` - drop-down list
10. `<textarea>` - multi-line input field
11. `<small>` - smaller text
12. `<sup>` - superscript
13. `<sub>` - subscript

I got around this via completions = [c.replace('<img>', '<IMG>') for c in completions]

otherwise the tokenizer throws with "ValueError: Unclosed image token"
within /home/user/.cache/huggingface/modules/transformers_modules/tokenization_qwen.py line 97

AllentDan · 2024-06-19T02:57:41Z

@irexyc Do you have any idea handling <IMG> encoding case?

irexyc · 2024-06-19T03:19:22Z

Do you have any idea handling <img> encoding case?

I think the unclosed <img> token issue is special to qwen-vl and maybe we can ignore this situation for now

support vl benchmark

9c20a71

lvhan028 mentioned this pull request May 28, 2024

[Feature] A series of various optimization points #1647

Open

Merge branch 'main' into vl-bench

3cda409

vody-am approved these changes May 29, 2024

View reviewed changes

lvhan028 reviewed May 29, 2024

View reviewed changes

AllentDan added 2 commits May 29, 2024 15:46

comments

5727333

fix

a001bc6

RunningLeon mentioned this pull request May 30, 2024

[Feature]: Support cogvlm-chat #1502

Merged

6 tasks

AllentDan mentioned this pull request Jun 17, 2024

[Feature] 多模态api_server推理速度性能测试 #1790

Open

Merge branch 'main' into vl-bench

56db275

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support vl benchmark #1662

support vl benchmark #1662

AllentDan commented May 27, 2024

vody-am commented May 28, 2024

AllentDan commented May 28, 2024 •

edited

Loading

vody-am commented May 28, 2024 •

edited

Loading

AllentDan commented May 29, 2024 •

edited

Loading

vody-am commented May 29, 2024 •

edited

Loading

lvhan028 May 29, 2024

lvhan028 May 29, 2024

AllentDan commented May 31, 2024

vody-am commented May 31, 2024

vody-am commented May 31, 2024

AllentDan commented Jun 19, 2024

irexyc commented Jun 19, 2024

support vl benchmark #1662

Are you sure you want to change the base?

support vl benchmark #1662

Conversation

AllentDan commented May 27, 2024

vody-am commented May 28, 2024

AllentDan commented May 28, 2024 • edited Loading

vody-am commented May 28, 2024 • edited Loading

AllentDan commented May 29, 2024 • edited Loading

vody-am commented May 29, 2024 • edited Loading

lvhan028 May 29, 2024

Choose a reason for hiding this comment

lvhan028 May 29, 2024

Choose a reason for hiding this comment

AllentDan commented May 31, 2024

vody-am commented May 31, 2024

vody-am commented May 31, 2024

AllentDan commented Jun 19, 2024

irexyc commented Jun 19, 2024

AllentDan commented May 28, 2024 •

edited

Loading

vody-am commented May 28, 2024 •

edited

Loading

AllentDan commented May 29, 2024 •

edited

Loading

vody-am commented May 29, 2024 •

edited

Loading