Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MiniCPM-V多卡训练模型infer与单卡不一致 #1191

Open
Uooga opened this issue Jun 20, 2024 · 4 comments
Open

MiniCPM-V多卡训练模型infer与单卡不一致 #1191

Uooga opened this issue Jun 20, 2024 · 4 comments
Assignees

Comments

@Uooga
Copy link

Uooga commented Jun 20, 2024

Describe the bug
多卡数据并行lora微调了一个版本的MiniCPM-V,在测试的时候发现输出结果几乎跟原始没有微调的版本一样,损失函数有正常下降,但是在训练集的测试输出也仿佛是没有微调的版本;
怀疑是否是infer命令有问题呢?还请大佬帮忙看一下;
P.S.单卡训练的模型可以输出符合预期的效果

训练命令:
nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=$nproc_per_node
MASTER_PORT=29500
swift sft
--model_type minicpm-v-v2-chat
--dataset train_minicpm_v_2_0619.jsonl
--lora_target_modules ALL
--train_dataset_sample -1
--num_train_epochs 8
--ddp_find_unused_parameters True \

单卡训练命令:
CUDA_VISIBLE_DEVICES=1 swift sft --model_type minicpm-v-v2-chat --dataset train_minicpm_v_2_0619.jsonl --lora_target_modules ALL

infer命令:
CUDA_VISIBLE_DEVICES=1 swift export --ckpt_dir output/minicpm-v-v2-chat/v3-20240619-204718/checkpoint-6200/ --merge_lora true
CUDA_VISIBLE_DEVICES=1 swift infer --ckpt_dir output/minicpm-v-v2-chat/v3-20240619-204718/checkpoint-6200-merged --load_dataset_config true --val_dataset val_minicpm_v_2_0619.jsonl --show_dataset_sample -1

@tastelikefeet
Copy link
Collaborator

fixed #1197

@tastelikefeet
Copy link
Collaborator

需要重新训练下

@Uooga
Copy link
Author

Uooga commented Jun 24, 2024

你好,我用最新的commit版本去进行训练,出现了新的报错,
image
当前使用的版本如下
image

@Jintao-Huang
Copy link
Collaborator

感觉训练 vision encoder部分就会有这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants