Benchmark

Parameter Settings

Experimental environment:

A100
CUDA 11.8
python 3.10
torch 2.1.1
flash_attn 2.3.4
xformers 0.0.23
auto_gptq 0.5.1
bitsandbytes 0.41.3.post2

The following are the same command line settings for all experiments:

    --dataset_test_ratio 0 \
    --dataset cls-fudan-news-zh \
    --save_strategy no \
    --check_dataset_strategy warning \
    --preprocess_num_proc 4 \

If the following parameters are not specified, the following default values are used:

    --max_length 2048 \
    --batch_size 1 \
    --gradient_checkpointing true \
    --use_flash_attn true \
    --lora_rank 8 \
    --lora_target_modules DEFAULT \
    --quantization_bit 0 \
    --gradient_accumulation_steps 16 \

Token statistics of the corresponding test dataset (obtained by qwen's tokenizer): 3234.4±2547.5, min=91, max=19548.

The experimental script can be found in scripts/benchmark/test_memory_time/.

Quantization

The test script is:

swift sft \
    --model_type {MODEL_TYPE} \
    --quantization_bit {QUANTIZATION_BIT} \
    --sft_type lora \
    ...

Model Type [LoRA]	Quantization	Training Speed (samples/s)	GPU Memory (GiB)
qwen-7b-chat	bf16	4.31	27.74
	int4 (gptq)	2.05	19.21
	int8 (gptq)	1.97	22.20
	int4 (bnb)	2.41	23.85
qwen-14b-chat	bf16	2.60	40.14
	int4 (gptq)	1.15	23.30
	int8 (gptq)	1.08	29.13
	int4 (bnb)	1.36	30.05
qwen-72b-chat	bf16	0.59 (2*A100)	73.71+78.54
	int4 (gptq)	0.23	54.86
	int8 (gptq)	0.21	78.44
	int4 (bnb)	0.28	74.87

Model Type & Max Length

LoRA

The test script is:

swift sft \
    --model_type {MODEL_TYPE} \
    --max_length {MAX_LENGTH} \
    --sft_type lora \
    ...

Model Type [LoRA]	Max Length	Training Speed (samples/s)	GPU Memory (GiB)
qwen-1_8b-chat	512	9.88	6.99
	1024	9.90	10.71
	2048	8.77	16.35
	4096	5.92	23.80
	8192	4.19	37.03
qwen-7b-chat	512	7.43	18.01
	1024	6.51	21.73
	2048	4.31	27.74
	4096	2.05	35.31
	8192	1.34	48.41
qwen-14b-chat	512	5.63	30.14
	1024	4.36	34.43
	2048	2.60	40.14
	4096	1.17	47.95
	8192	0.79	60.74
qwen-72b-chat (2*A100)	512	1.41	67.68+73.07
	1024	1.02	70.25+77.11
	2048	0.59	73.71+78.54
	4096	-	OOM
	8192	-	OOM
chatglm3-6b	512	6.72	13.94
	1024	6.16	12.99
	2048	4.20	17.20
	4096	1.92	29.80
	8192	1.24	66.82
yi-6b-chat	512	5.27	13.72
	1024	5.07	15.44
	2048	3.84	16.95
	4096	1.99	28.25
	8192	1.35	43.81
yi-34b-chat	512	2.32	66.72
	1024	1.76	69.10
	2048	1.05	71.34
	4096	0.47	78.72
	8192	0.31 (2*A100)	47.01+65.03
openbuddy-zephyr-7b-chat	512	5.17	14.99
	1024	3.92	16.57
	2048	3.08	19.89
	4096	1.85	23.29
	8192	0.92	52.14
baichuan2-7b-chat	512	6.09	18.18
	1024	5.36	17.45
	2048	3.43	19.18
	4096	1.69	34.22
	8192	1.16	45.47
baichuan2-13b-chat	512	5.32	31.01
	1024	3.91	31.58
	2048	1.77	32.40
	4096	0.65	49.63
	8192	0.36	76.17

Full

The test script is:

swift sft \
    --model_type {MODEL_TYPE} \
    --max_length {MAX_LENGTH} \
    --sft_type full \
    ...

Model Type [FULL]	Max Length	Training Speed (samples/s)	GPU Memory (GiB)
qwen-1_8b-chat	512	10.77	18.16
	1024	10.39	18.62
	2048	8.73	35.11
	4096	5.45	31.62
	8192	3.81	38.93
qwen-7b-chat	512	5.96	73.37
	1024	5.00	73.64
	2048	3.30	74.26
	4096	1.64	78.76
	8192	1.11 (2*A100)	61.34+73.00
qwen-14b-chat (2*A100)	512	3.66	60.42+72.31
	1024	2.98	60.61+74.37
	2048	1.93	60.70+78.22
	4096	0.92	75.59+78.64
	8192	0.62	76.59+77.68

Batch Size

The test script is:

swift sft \
    --batch_size {BATCH_SIZE} \
    --model_type qwen-7b-chat \
    --sft_type lora \
    ...

Model Type [LoRA]	Batch Size	Training Speed (samples/s)	GPU Memory (GiB)
qwen-7b-chat	1	4.31	27.74
	2	3.60	43.11
	4	3.02	63.81
	8	2.77	76.14

Use Flash Attn & Gradient Checkpointing

The test script is:

swift sft \
    --use_flash_attn {USE_FLASH_ATTN} \
    --gradient_checkpointing {GRADIENT_CHECKPOINTING} \
    --model_type qwen-7b-chat \
    --sft_type lora \
    ...

Model Type [LoRA]	Use Flash Attn	Gradient Checkpointing	Training Speed (samples/s)	GPU Memory (GiB)
qwen-7b-chat	✔	✔	4.31	27.74
	✔	✘	6.19	37.70
	✘	✔	3.13	27.71
	✘	✘	4.45	57.67

LoRA Rank & LoRA Target Modules

The test script is:

swift sft \
    --lora_rank {LORA_RANK} \
    --lora_target_modules {LORA_TARGET_MODULES} \
    --model_type qwen-7b-chat \
    --sft_type lora \
    ...

Model Type [LoRA]	LoRA Rank	LoRA Target Modules	Training Speed (samples/s)	GPU Memory (GiB)	Trainable Params (M)
qwen-7b-chat	2	DEFAULT (c_attn)	4.27	27.72	1.05
	8	DEFAULT	4.31	27.74	4.19
	64	DEFAULT	4.19	27.85	33.55
	8	ALL (all linear)	3.22	27.87	17.89

Gradient Accumulation Steps

The test script is:

swift sft \
    --gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \
    --model_type qwen-7b-chat \
    --sft_type lora \
    ...

Model Type [LoRA]	Gradient Accumulation Steps	Training Speed (samples/s)	GPU Memory (GiB)
qwen-7b-chat	1	4.26	27.73
	2	4.32	27.74
	4	4.31	27.74
	8	4.32	27.74
	16	4.33	27.74
	32	4.30	27.74
	64	4.32	27.74

Tuners

exp_name	model_type	dataset	ms-bench mix ratio	tuner	tuner_params	trainable params(M)	flash_attn	gradient_checkpointing	hypers	memory	train speed(samples/s)	infer speed(tokens/s)	train_loss	eval_loss	gsm8k weighted acc	arc weighted acc	ceval weighted acc
adalora	qwen-7b-chat	ms-agent	2.0	adalora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False	26.8389(0.3464%)	True	True	lr=5e-05/epoch=2	32.55GiB	0.92(87543 samples/95338.71 seconds)	17.33(2345 tokens/135.29 seconds)	0.57	1.07	0.391	0.665	0.569
adapter	qwen-7b-chat	ms-agent	2.0	adapter		33.6896(0.4344%)	True	True	lr=5e-05/epoch=2	32.19GiB	1.48(87543 samples/59067.71 seconds)	26.63(4019 tokens/150.90 seconds)	0.55	1.03	0.438	0.662	0.565
dora	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=True	19.2512(0.2487%)	True	True	lr=5e-05/epoch=2	32.46GiB	0.51(87543 samples/171110.54 seconds)	4.29(2413 tokens/562.32 seconds)	0.53	1.01	0.466	0.683	0.577
full+galore128	qwen-7b-chat	ms-agent	2.0	full	galore_rank=128/galore_per_parameter=false/galore_with_embedding=false	7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	47.02GiB	1.10(87543 samples/79481.96 seconds)	28.96(2400 tokens/82.88 seconds)	0.55	1.00	0.358	0.688	0.577
full+galore32	qwen-7b-chat	ms-agent	2.0	full	galore_rank=32/galore_per_parameter=false/galore_with_embedding=false	7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	47.05GiB	1.11(87543 samples/78989.74 seconds)	29.17(2431 tokens/83.35 seconds)	0.56	1.01	0.386	0.667	0.539
full+galore64	qwen-7b-chat	ms-agent	2.0	full	galore_rank=64/galore_per_parameter=false/galore_with_embedding=false	7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	46.91GiB	1.11(87543 samples/79200.36 seconds)	28.94(2448 tokens/84.60 seconds)	0.56	1.01	0.397	0.674	0.544
full+galore_emb	qwen-7b-chat	ms-agent	2.0	full	galore_rank=128/galore_per_parameter=false/galore_with_embedding=true	7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	44.53GiB	1.10(87543 samples/79775.02 seconds)	29.45(2433 tokens/82.62 seconds)	0.55	1.00	0.398	0.670	0.568
full+galore_perparam	qwen-7b-chat	ms-agent	2.0	full	galore_rank=128/galore_per_parameter=true/galore_with_embedding=false	7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	47.02GiB	1.25(87543 samples/69821.89 seconds)	29.02(2478 tokens/85.39 seconds)	0.54	1.00	0.372	0.669	0.524
full+no_mix	qwen-7b-chat	ms-agent	0.0	full		7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	72.56GiB	1.27(29698 samples/23356.97 seconds)	30.31(11738 tokens/387.29 seconds)	0.57	0.44	0.174	0.652	0.553
full	qwen-7b-chat	ms-agent	2.0	full		7721.3245(100.0000%)	True	True	lr=5e-05/epoch=2	73.53GiB	1.43(87543 samples/61022.97 seconds)	29.51(3382 tokens/114.62 seconds)	0.54	0.95	0.343	0.536	0.495
llamapro	qwen-7b-chat	ms-agent	2.0	llamapro	num_blocks=4	809.5826(9.4900%)	True	True	lr=5e-05/epoch=2	38.11GiB	1.53(87543 samples/57294.42 seconds)	25.80(2374 tokens/92.02 seconds)	0.53	1.00	0.434	0.645	0.357
lora+	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=16.0/use_rslora=False/use_dora=False	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	32.35GiB	0.95(87543 samples/91923.80 seconds)	18.81(3329 tokens/176.94 seconds)	0.53	0.98	0.432	0.647	0.344
lora+neftune	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/neftune_noise_alpha=15.0	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	32.35GiB	0.96(87543 samples/91525.50 seconds)	19.84(161792 tokens/8156.02 seconds)	0.53	1.02	0.456	0.671	0.401
lora+no_mix	qwen-7b-chat	ms-agent	0.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	30.86GiB	0.91(29698 samples/32570.15 seconds)	19.89(36308 tokens/1825.26 seconds)	0.53	0.53	0.470	0.666	0.574
lora	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	32.35GiB	0.95(87543 samples/91974.29 seconds)	18.11(2415 tokens/133.32 seconds)	0.53	1.01	0.462	0.676	0.304
qwen-7b-chat-eval	qwen-7b-chat	None	0.0	None		None(None)				None		30.81(13765 tokens/446.83 seconds)			0.517	0.679	0.568
rslora	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=True/use_dora=False	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	32.35GiB	0.94(87543 samples/92758.63 seconds)	18.87(2762 tokens/146.34 seconds)	0.53	0.99	0.451	0.679	0.339
full+lisa_2	qwen-7b-chat	ms-agent	2.0	full	lisa_activated_layers=2/lisa_step_interval=20	-	True	True	lr=5e-05/epoch=2	31.11GiB	2.66(76837 samples/28881.28 seconds)	36.10(134469 tokens/3725.21 seconds)	0.62	1.06	0.349	0.653	0.592
full+lisa_4	qwen-7b-chat	ms-agent	2.0	full	lisa_activated_layers=4/lisa_step_interval=20	-	True	True	lr=5e-05/epoch=2	31.87GiB	2.63(76837 samples/29215.15 seconds)	36.75(135477 tokens/3686.17 seconds)	0.63	1.06	0.377	0.656	0.607
lora+packing+ddp	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	35.65GiB*2	1.56(7900 samples/5057.30 seconds)	26.20(421094 tokens/16073.09 seconds)	0.63	0.98	0.473	0.664	0.552
lora+packing+lazytokenize	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	32.83GiB	7.69(78237 samples/10179.40 seconds)	25.86(307390 tokens/11888.17 seconds)	0.63	1.04	0.472	0.660	0.554
lora+packing	qwen-7b-chat	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True	17.8913(0.2312%)	True	True	lr=5e-05/epoch=2	28.06GiB	0.79(7900 samples/10048.53 seconds)	26.12(409507 tokens/15675.36 seconds)	0.61	0.95	0.492	0.676	0.539

unsloth

exp_name	model_type	dataset	ms-bench mix ratio	tuner	tuner_params	trainable params(M)	flash_attn	gradient_checkpointing	hypers	memory	train speed(samples/s)	infer speed(tokens/s)	train_loss	eval_loss	gsm8k weighted acc	arc weighted acc	ceval weighted acc
unsloth+lora+q4	llama3-8b-instruct	ms-agent	2.0	lora		4.7186(0.1038%)	True	True	lr=5e-05/epoch=2	21.69GiB	1.76(76839 samples/43763.01 seconds)	15.22(160885 tokens/10570.90 seconds)	0.58	1.03	0.668	0.755	0.501

Export

exp_name	model_type	calibration dataset	quantization method	quantization bits	infer speed(tokens/s)	gsm8k weighted acc	arc weighted acc	ceval weighted acc
awq-ms-bench-mini	qwen-7b-chat	ms-bench-mini	awq	4	27.25(16501 tokens/605.47 seconds)	0.494	0.665	0.571
awq-pileval	qwen-7b-chat	pileval	awq	4	26.92(12994 tokens/482.72 seconds)	0.497	0.675	0.577
gptq-ms-bench-mini	qwen-7b-chat	ms-bench-mini	gptq	4	31.16(15349 tokens/492.54 seconds)	0.482	0.642	0.556
gptq-pileval	qwen-7b-chat	pileval	gptq	4	31.67(15185 tokens/479.54 seconds)	0.478	0.654	0.559

AWQ

exp_name	model_type	dataset	ms-bench mix ratio	tuner	tuner_params	trainable params(M)	flash_attn	gradient_checkpointing	hypers	memory	train speed(samples/s)	infer speed(tokens/s)	train_loss	eval_loss	gsm8k weighted acc	arc weighted acc	ceval weighted acc
qwen1half-7b-chat-awq	qwen1half-7b-chat-awq	ms-agent	2.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False	19.9885(1.5802%)	True	True	lr=5e-05/epoch=2	24.26GiB	0.45(87543 samples/194746.58 seconds)	16.08(2469 tokens/153.58 seconds)	0.55	1.19	0.505	0.737	0.656

AQLM

exp_name	model_type	dataset	ms-bench mix ratio	tuner	tuner_params	trainable params(M)	flash_attn	gradient_checkpointing	hypers	memory	train speed(samples/s)	infer speed(tokens/s)	train_loss	eval_loss	gsm8k weighted acc	arc weighted acc	ceval weighted acc
llama2-7b-aqlm-2bit-1x16	llama2-7b-aqlm-2bit-1x16	dureader-robust-zh	0.0	lora	rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False	19.9885(1.6510%)	True	True	lr=5e-05/epoch=2	4.04GiB	0.17(14994 samples/86140.71 seconds)		0.48	0.74

Sequence Parallel

Model	Dataset	Hyper params	Total steps	Train speed	Gpu memory
chatglm3-6b-32k	long-alpaca-12k(8055 tokens * 12000 rows)	gpu=2/sequence_parallel_size=1(2 GPU DDP baseline)	5940	0.30iter/s(5h13min total)	27G*2
		gpu=2/sequence_parallel_size=2(2 GPU with sequence parallel 2)	11880	0.5iter/s(6h total)	20G*2
		gpu=4/sequence_parallel_size=4(4 GPU with sequence parallel 4)	11880	1iter/s(3h20min total)	18G*4
		gpu=4/sequence_parallel_size=2(4 GPU sequence parallel 2)	5940	0.45iter/s(3h total)	21G*4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark.md

Benchmark.md

Benchmark

Table of Contents

Parameter Settings

Quantization

Model Type & Max Length

LoRA

Full

Batch Size

Use Flash Attn & Gradient Checkpointing

LoRA Rank & LoRA Target Modules

Gradient Accumulation Steps

Tuners

unsloth

Export

AWQ

AQLM

Sequence Parallel

Files

Benchmark.md

Latest commit

History

Benchmark.md

File metadata and controls

Benchmark

Table of Contents

Parameter Settings

Quantization

Model Type & Max Length

LoRA

Full

Batch Size

Use Flash Attn & Gradient Checkpointing

LoRA Rank & LoRA Target Modules

Gradient Accumulation Steps

Tuners

unsloth

Export

AWQ

AQLM

Sequence Parallel