You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
deepspeed-zero3,lora_target_modules ALL,model_type phi3-vision-128k-instruct,多机多卡,在resume from checkpoint的时候,模型似乎无法加载。需要注意的是,此时的chekpoint文件夹内只包括lora相关的参数,但是报错显示模型在加载更多参数。
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
load_path, _ = deepspeed_engine.load_checkpoint(
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
self.load_module_state_dict(checkpoint=checkpoint,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
self.module.load_state_dict(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.vision_embed_tokens.glb_GN", "base_model.model.model.vision_embed_tokens.sub_GN", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.embeddings.class_embedding", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.embeddings.patch_embedding.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.embeddings.position_embedding.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.base_layer.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.base_layer.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.weight", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.bias", "base_model.model.model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc1.base_layer.weight", 省略
The text was updated successfully, but these errors were encountered:
Describe the bug
deepspeed-zero3,lora_target_modules ALL,model_type phi3-vision-128k-instruct,多机多卡,在resume from checkpoint的时候,模型似乎无法加载。需要注意的是,此时的chekpoint文件夹内只包括lora相关的参数,但是报错显示模型在加载更多参数。
The text was updated successfully, but these errors were encountered: