Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单卡3090ti进行lora微调,遇到了OOM问题 #228

Closed
1 of 2 tasks
RyanCcc114 opened this issue Jun 24, 2024 · 15 comments
Closed
1 of 2 tasks

单卡3090ti进行lora微调,遇到了OOM问题 #228

RyanCcc114 opened this issue Jun 24, 2024 · 15 comments
Assignees

Comments

@RyanCcc114
Copy link

System Info / 系統信息

torch2.1.0,硬件信息:单卡3090ti

lora.yaml
training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output
max_steps: 27000

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 2
evaluation_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

#deepspeed: ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

使用9000条数据进行训练的时候,出现了内存溢出问题

OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in
use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 195.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid
fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior / 期待表现

希望能够正常执行微调

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Jun 24, 2024
@zRzRzRzRzRzRzR
Copy link
Collaborator

可以直接执行非ds的任务吧,还是都报错呢

@RyanCcc114
Copy link
Author

可以直接执行非ds的任务吧,还是都报错呢

执行非ds任务会报错,是在wsl环境中进行微调。
奇怪的是使用官方微调脚本会爆显存,但是在llama-factory中微调则不会爆显存。

@zRzRzRzRzRzRzR
Copy link
Collaborator

更新了最新的微调代码吗,老代码确实可能会爆显存
wsl没测过,我都是纯linux开发的

@RyanCcc114
Copy link
Author

更新最新微调代码后,开始训练时loss一直为0

@zRzRzRzRzRzRzR
Copy link
Collaborator

zRzRzRzRzRzRzR commented Jun 26, 2024

确定你是使用BF16精度微调

@Yang-125
Copy link

确定你是使用BF16精度微调

请问是在哪个地方确定用BF16精度微调呀?

@RyanCcc114
Copy link
Author

RyanCcc114 commented Jun 26, 2024

确定你是使用BF16精度微调

已在lora配置文件中指定了bf16字段为true

lora.yaml
data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:
#see transformers.Seq2SeqTrainingArguments
output_dir: ./output
max_steps: 3000
#needed to be fit for the dataset
bf16: true
learning_rate: 5e-4
#settings for data loading
per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false
#settings for saving checkpoints
save_strategy: steps
save_steps: 500
#settings for logging
log_level: info
logging_strategy: steps
logging_steps: 10
#settings for evaluation
per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500
#settings for optimizer
#adam_epsilon: 1e-6
#uncomment the following line to detect nan or inf values
#debug: underflow_overflow
predict_with_generate: true
#see transformers.GenerationConfig
generation_config:
max_new_tokens: 512
#set your absolute deepspeed path here
#deepspeed: ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1

@Yang-125
Copy link

确定你是使用BF16精度微调

已在lora配置文件中指定了bf16字段为true

lora.yaml data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args: #see transformers.Seq2SeqTrainingArguments output_dir: ./output max_steps: 3000 #needed to be fit for the dataset bf16: true learning_rate: 5e-4 #settings for data loading per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false #settings for saving checkpoints save_strategy: steps save_steps: 500 #settings for logging log_level: info logging_strategy: steps logging_steps: 10 #settings for evaluation per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500 #settings for optimizer #adam_epsilon: 1e-6 #uncomment the following line to detect nan or inf values #debug: underflow_overflow predict_with_generate: true #see transformers.GenerationConfig generation_config: max_new_tokens: 512 #set your absolute deepspeed path here #deepspeed: ds_zero_2.json peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1

data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output
max_steps: 3000

needed to be fit for the dataset

bf16: true
learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: /home/yqx/workspace/Compared/GLM-4/finetune_demo/configs/ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
您的deepspeed 文件注释掉了 是否需要加上
以及我这边加上了之后loss还是一直输出为0.0

@zRzRzRzRzRzRzR
Copy link
Collaborator

zRzRzRzRzRzRzR commented Jun 26, 2024

要加上 能截图看一下数据集载入的运行截图吗

@Yang-125
Copy link

Yang-125 commented Jun 26, 2024

要加上 能截图看一下数据集载入的运行截图吗

可以的!
Uploading 截屏2024-06-26 15.41.38.png…

@RyanCcc114
Copy link
Author

使用同样的微调配置文件,旧版本的微调代码会爆显存,新版本的loss为0
Snipaste_2024-06-26_15-39-31

@zRzRzRzRzRzRzR
Copy link
Collaborator

你的数据集内容有被正常识别嘛,我建议在开始微调之前,你check一下apply chat template后label的部分

@RyanCcc114
Copy link
Author

经调试发现是构建input_ids时出错
finetune.py的process_batch函数内的 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label,修改process_batch_eval同样有效

@RyanCcc114
Copy link
Author

但是修改后还是出现了OOM问题😂

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label,修改process_batch_eval同样有效

@Yang-125
Copy link

但是修改后还是出现了OOM问题😂

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label,修改process_batch_eval同样有效

非常感谢,成功读取到了;OOM的原因可能是数据集太大了以及训练轮次导致;这边是多卡微调的时候由于其中某一张卡占用不够也报错了,但是限制这张卡不使用的时候就成功运行了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants