MiniCPM-V多卡训练模型infer与单卡不一致 #1191

Uooga · 2024-06-20T08:26:24Z

Describe the bug
多卡数据并行lora微调了一个版本的MiniCPM-V，在测试的时候发现输出结果几乎跟原始没有微调的版本一样，损失函数有正常下降，但是在训练集的测试输出也仿佛是没有微调的版本；
怀疑是否是infer命令有问题呢？还请大佬帮忙看一下；
P.S.单卡训练的模型可以输出符合预期的效果

训练命令：
nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=$nproc_per_node
MASTER_PORT=29500
swift sft
--model_type minicpm-v-v2-chat
--dataset train_minicpm_v_2_0619.jsonl
--lora_target_modules ALL
--train_dataset_sample -1
--num_train_epochs 8
--ddp_find_unused_parameters True \

单卡训练命令：
CUDA_VISIBLE_DEVICES=1 swift sft --model_type minicpm-v-v2-chat --dataset train_minicpm_v_2_0619.jsonl --lora_target_modules ALL

infer命令：
CUDA_VISIBLE_DEVICES=1 swift export --ckpt_dir output/minicpm-v-v2-chat/v3-20240619-204718/checkpoint-6200/ --merge_lora true
CUDA_VISIBLE_DEVICES=1 swift infer --ckpt_dir output/minicpm-v-v2-chat/v3-20240619-204718/checkpoint-6200-merged --load_dataset_config true --val_dataset val_minicpm_v_2_0619.jsonl --show_dataset_sample -1

tastelikefeet · 2024-06-21T07:04:18Z

fixed #1197

tastelikefeet · 2024-06-21T07:04:26Z

需要重新训练下

Uooga · 2024-06-24T08:02:42Z

你好，我用最新的commit版本去进行训练，出现了新的报错，

当前使用的版本如下

Jintao-Huang · 2024-06-28T13:04:00Z

感觉训练 vision encoder部分就会有这个问题

Jintao-Huang assigned Jintao-Huang and tastelikefeet Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MiniCPM-V多卡训练模型infer与单卡不一致 #1191

MiniCPM-V多卡训练模型infer与单卡不一致 #1191

Uooga commented Jun 20, 2024

tastelikefeet commented Jun 21, 2024

tastelikefeet commented Jun 21, 2024

Uooga commented Jun 24, 2024

Jintao-Huang commented Jun 28, 2024

MiniCPM-V多卡训练模型infer与单卡不一致 #1191

MiniCPM-V多卡训练模型infer与单卡不一致 #1191

Comments

Uooga commented Jun 20, 2024

tastelikefeet commented Jun 21, 2024

tastelikefeet commented Jun 21, 2024

Uooga commented Jun 24, 2024

Jintao-Huang commented Jun 28, 2024