Release LMDeploy Release V0.4.0 · InternLM/lmdeploy

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

data-free online quantization
Supports all nvidia GPU models with Volta architecture (sm70) and above
KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

-	-	-	llama2-7b-chat	-	-	internlm2-chat-7b	-	-	qwen1.5-7b-chat	-	-
dataset	version	metric	kv fp16	kv int8	kv int4	kv fp16	kv int8	kv int4	fp16	kv int8	kv int4
ceval	-	naive_average	28.42	27.96	27.58	60.45	60.88	60.28	70.56	70.49	68.62
mmlu	-	naive_average	35.64	35.58	34.79	63.91	64	62.36	61.48	61.56	60.65
triviaqa	2121ce	score	56.09	56.13	53.71	58.73	58.7	58.18	44.62	44.77	44.04
gsm8k	1d7fe4	accuracy	28.2	28.05	27.37	70.13	69.75	66.87	54.97	56.41	54.74
race-middle	9a54b6	accuracy	41.57	41.78	41.23	88.93	88.93	88.93	87.33	87.26	86.28
race-high	9a54b6	accuracy	39.65	39.77	40.77	85.33	85.31	84.62	82.53	82.59	82.02

The below table presents LMDeploy's inference performance with quantized KV.

model	kv type	test settings	RPS	v.s. kv fp16
llama2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	14.98	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	19.01	1.27
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	20.81	1.39
llama2-chat-13b	fp16	tp1 / ratio 0.9 / bs 128 / prompts 10000	8.55	1.0
-	int8	tp1 / ratio 0.9 / bs 256 / prompts 10000	10.96	1.28
-	int4	tp1 / ratio 0.9 / bs 256 / prompts 10000	11.91	1.39
internlm2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	24.13	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.28	1.05
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.80	1.07

Support qwen1.5 in turbomind engine by @lvhan028 in #1406
Online 8/4-bit KV-cache quantization by @lzhangzz in #1377
Support qwen1.5-*-AWQ model inference in turbomind by @lvhan028 in #1430
support Internvl chat v1.1, v1.2 and v1.2-plus by @irexyc in #1425
support Internvl chat llava by @irexyc in #1426
Add llama3 chat template by @AllentDan in #1461
Support mini gemini llama by @AllentDan in #1438
add interactive api in service for VL models by @AllentDan in #1444
support output logprobs with turbomind backend. by @irexyc in #1391
support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by @irexyc in #1458
Add qwen1.5 awq quantization by @AllentDan in #1470

Reduce binary size, add sm_89 and sm_90 targets by @lzhangzz in #1383
Use new event loop instead of the current loop for pipeline by @AllentDan in #1352
Optimize inference of pytorch engine with tensor parallelism by @grimoire in #1397
add llava-v1.6-34b template by @irexyc in #1408
Initialize vl encoder first to avoid OOM by @AllentDan in #1434
Support model_name customization for api_server by @AllentDan in #1403
Expose dynamic split&fuse parameters by @lvhan028 in #1433
warning transformers version by @grimoire in #1453
Optimize apply_rotary kernel and remove useless inference_mode by @grimoire in #1457
set infinity timeout to nccl by @grimoire in #1465
Feat: format internlm2 chat template by @liujiangning30 in #1456

handle SIGTERM by @grimoire in #1389
fix chat cli ArgumentError error happened in python 3.11 by @RunningLeon in #1401
Fix llama_triton_example by @AllentDan in #1414
miss --trust-remote-code in converter, which is side effect brought by pr #1406 by @lvhan028 in #1420
fix sampling kernel by @grimoire in #1417
Fix loading single safetensor file error by @AllentDan in #1427
remove space in deepseek template by @grimoire in #1441
fix free repetition_penalty_workspace_ buffer by @irexyc in #1467
fix adapter failure when tp>1 by @grimoire in #1476
get model in advance to fix downloading from modelscope error by @irexyc in #1473
Fix the side effect in engine_intance brought by #1391 by @lvhan028 in #1480

Full Changelog: v0.3.0...v0.4.0