Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support vl benchmark #1662

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

support vl benchmark #1662

wants to merge 5 commits into from

Conversation

AllentDan
Copy link
Collaborator

第一张图测试了 llava-v1.6-vicuna-7b 模型,量化后的 completion token/s 和 FTL。
第二张图,测了量化前后的 吞吐量变化
对比图
awq

@vody-am
Copy link
Contributor

vody-am commented May 28, 2024

@AllentDan which command did you use to run this benchmark? Could you share? Was this run on an A100? I am curious also about performance on lesser GPUs which are more widely available and cheaper, and can test running there (such as GPUs with 24GB VRAM).

I have followed the instructions at https://github.com/InternLM/lmdeploy/tree/main/benchmark the get the ShareGPT dataset, so I have the data!

Another metric of interest would be how response time changes under load (if we increase requests per second, how much does the latency increase?)

@AllentDan
Copy link
Collaborator Author

AllentDan commented May 28, 2024

@vody-am yes, A100 card.

python benchmark/profile_restful_api.py http://0.0.0.0:23333 /nvme/shared/llava-v1.6-vicuna-7b-4bit ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 16 --img_hw 512*512 --stream_output True --num_prompts 1000

@vody-am
Copy link
Contributor

vody-am commented May 28, 2024

Ok, I am testing out on A100, L4 and 4090, will report back with numbers when completed.
For generating charts, did you use a script or do so by hand? I can post back when runs complete.

The non-quantized model successfully runs across all devices (impressively the L4 managed not to fall over serving requests).

EDIT: figured out issues with quant model, needed to properly install lmdeploy by building the wheel.

GPU: 4090
Model: llava-1.6-vicuna-7b

--------------------------------------------------                                                                                                                                                   
concurrency: 3                                                                                                                                                                                       
elapsed_time: 1829.352s                                                                                                                                                                              
                                                                                                                                                                                                     
first_token latency(min, max, ave): 0.081s, 1.160s, 0.462s                                                                                                                                           
                                                                                                                                                                                                     
number of prompt tokens: 248339                                                                                                                                                                      
number of completion tokens: 240582                                                                                                                                                                  
token throughput (completion token): 131.512 token/s                                                                                                                                                 
token throughput (prompt + completion token): 267.265 token/s                                                                                                                                        
RPS (request per second): 0.547 req/s                                                                                                                                                                
RPM (request per minute): 32.798 req/min                                                                                                                                                             
--------------------------------------------------                                                                                                                                                   

GPU: 4090
Model: llava-1.6-vicuna-7b AWQ

--------------------------------------------------                                                                                                                                                   
concurrency: 3                                                                                                                                                                                       
elapsed_time: 1085.826s                                                                                                                                                                              
                                                                                                                                                                                                     
first_token latency(min, max, ave): 0.074s, 1.131s, 0.469s                                                                                                                                           
                                                                                                                                                                                                     
number of prompt tokens: 248339                                                                                                                                                                      
number of completion tokens: 240582                                                                                                                                                                  
token throughput (completion token): 221.566 token/s                                                                                                                                                 
token throughput (prompt + completion token): 450.276 token/s                                                                                                                                        
RPS (request per second): 0.921 req/s                                                                                                                                                                
RPM (request per minute): 55.257 req/min                                                                                                                                                             
--------------------------------------------------                                                                                                                                                   

GPU: L4
Model: llava-1.6-vicuna-7b

--------------------------------------------------                                                                                                                           
concurrency: 2                                                                                                                                                               
elapsed_time: 8207.655s                                                                                                                                                      
                                                                                                                                                                             
first_token latency(min, max, ave): 0.209s, 2.419s, 1.209s                                                                                                                   
                                                                                                                                                                             
number of prompt tokens: 248339                                                                                                                                              
number of completion tokens: 240582                                                                                                                                          
token throughput (completion token): 29.312 token/s                                                                                                                          
token throughput (prompt + completion token): 59.569 token/s                                                                                                                 
RPS (request per second): 0.122 req/s                                                                                                                                        
RPM (request per minute): 7.310 req/min                                                                                                                                      
-------------------------------------------------- 

GPU: L4
Model: llava-1.6-vicuna-7b AWQ

--------------------------------------------------
concurrency: 2
elapsed_time: 4065.314s

first_token latency(min, max, ave): 0.149s, 2.142s, 1.070s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 59.179 token/s
token throughput (prompt + completion token): 120.266 token/s
RPS (request per second): 0.246 req/s
RPM (request per minute): 14.759 req/min
--------------------------------------------------

@AllentDan
Copy link
Collaborator Author

AllentDan commented May 29, 2024

Hi, @vody-am I used Excel to plot the chart.

@vody-am
Copy link
Contributor

vody-am commented May 29, 2024

Currently running the benchmark on an Nvidia L4 with AWQ. With a concurrency value of 2 seeing this:

image

FP16 was much slower, I saw an ETA of over 2 hours. I can let it run unattended if there's interest in collecting that data, but if not no worries. I would personally like to be able to run quantized models on that hardware as they're relatively cheap and plentiful.

Besides that, PR LGTM!

Just two more questions might come to mind:

  1. this line which constructs the image creates one that is (I believe) all black. Maybe filling it with random data might produce a different result? That way each request has some variation in the image.

  2. while benchmarking, should one set vision_max_batch_size ?


from lmdeploy.vl.utils import encode_image_base64
h, w = [int(s) for s in img_hw.split('*')]
img = PIL.Image.new(mode='RGB', size=(w, h))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use random value for pixels.

import PIL

from lmdeploy.vl.utils import encode_image_base64
h, w = [int(s) for s in img_hw.split('*')]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use "x" instead of ""? "x" just press one key, "" needs to press shift+8

@RunningLeon RunningLeon mentioned this pull request May 30, 2024
6 tasks
@AllentDan
Copy link
Collaborator Author

I could not reproduce the error. @vody-am

@vody-am
Copy link
Contributor

vody-am commented May 31, 2024

@AllentDan I was too hasty in posting 😓 thank you for checking, I will double check next time. I believe it is an environment error on my part, because it works on some hosts but not on others. Thanks 🫡

@vody-am
Copy link
Contributor

vody-am commented May 31, 2024

ok turns out I was not too hasty. I believe it worked once for me due to the random sampling. The issue I ran into is specific to Qwen-VL, since the tokenizer treats <img> as a special token. Some of the completions on line https://github.com/AllentDan/lmdeploy/blob/vl-bench/benchmark/profile_restful_api.py#L37 may contain <img> within. e.g. there is at least one example with the following:

Typical inline elements in HTML include:

1. `<a>` - hyperlink
2. `<span>` - generic inline container
3. `<strong>` - strong importance
4. `<em>` - emphasized importance
5. `<img>` - image
6. `<input>` - input field
7. `<label>` - label for a form control
8. `<button>` - button
9. `<select>` - drop-down list
10. `<textarea>` - multi-line input field
11. `<small>` - smaller text
12. `<sup>` - superscript
13. `<sub>` - subscript

I got around this via completions = [c.replace('<img>', '<IMG>') for c in completions]

otherwise the tokenizer throws with "ValueError: Unclosed image token"
within /home/user/.cache/huggingface/modules/transformers_modules/tokenization_qwen.py line 97

@AllentDan
Copy link
Collaborator Author

@irexyc Do you have any idea handling <IMG> encoding case?

@irexyc
Copy link
Collaborator

irexyc commented Jun 19, 2024

Do you have any idea handling <img> encoding case?

I think the unclosed <img> token issue is special to qwen-vl and maybe we can ignore this situation for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants