单卡3090ti进行lora微调，遇到了OOM问题 #228

RyanCcc114 · 2024-06-24T01:27:27Z

System Info / 系統信息

torch2.1.0，硬件信息：单卡3090ti

lora.yaml
training_args:

see `transformers.Seq2SeqTrainingArguments`

output_dir: ./output
max_steps: 27000

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 2
evaluation_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see `transformers.GenerationConfig`

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

#deepspeed: ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

使用9000条数据进行训练的时候，出现了内存溢出问题

OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in
use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 195.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid
fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior / 期待表现

希望能够正常执行微调

The text was updated successfully, but these errors were encountered:

zRzRzRzRzRzRzR · 2024-06-24T15:49:12Z

可以直接执行非ds的任务吧，还是都报错呢

RyanCcc114 · 2024-06-25T01:59:30Z

可以直接执行非ds的任务吧，还是都报错呢

执行非ds任务会报错，是在wsl环境中进行微调。
奇怪的是使用官方微调脚本会爆显存，但是在llama-factory中微调则不会爆显存。

zRzRzRzRzRzRzR · 2024-06-25T05:28:26Z

更新了最新的微调代码吗，老代码确实可能会爆显存
wsl没测过，我都是纯linux开发的

RyanCcc114 · 2024-06-25T05:32:10Z

更新最新微调代码后，开始训练时loss一直为0

zRzRzRzRzRzRzR · 2024-06-26T06:21:14Z

确定你是使用BF16精度微调

Yang-125 · 2024-06-26T07:01:01Z

确定你是使用BF16精度微调

请问是在哪个地方确定用BF16精度微调呀？

RyanCcc114 · 2024-06-26T07:11:42Z

确定你是使用BF16精度微调

已在lora配置文件中指定了bf16字段为true

lora.yaml
data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:
#see transformers.Seq2SeqTrainingArguments
output_dir: ./output
max_steps: 3000
#needed to be fit for the dataset
bf16: true
learning_rate: 5e-4
#settings for data loading
per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false
#settings for saving checkpoints
save_strategy: steps
save_steps: 500
#settings for logging
log_level: info
logging_strategy: steps
logging_steps: 10
#settings for evaluation
per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500
#settings for optimizer
#adam_epsilon: 1e-6
#uncomment the following line to detect nan or inf values
#debug: underflow_overflow
predict_with_generate: true
#see transformers.GenerationConfig
generation_config:
max_new_tokens: 512
#set your absolute deepspeed path here
#deepspeed: ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1

Yang-125 · 2024-06-26T07:34:14Z

确定你是使用BF16精度微调

已在lora配置文件中指定了bf16字段为true

lora.yaml data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args: #see transformers.Seq2SeqTrainingArguments output_dir: ./output max_steps: 3000 #needed to be fit for the dataset bf16: true learning_rate: 5e-4 #settings for data loading per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false #settings for saving checkpoints save_strategy: steps save_steps: 500 #settings for logging log_level: info logging_strategy: steps logging_steps: 10 #settings for evaluation per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500 #settings for optimizer #adam_epsilon: 1e-6 #uncomment the following line to detect nan or inf values #debug: underflow_overflow predict_with_generate: true #see transformers.GenerationConfig generation_config: max_new_tokens: 512 #set your absolute deepspeed path here #deepspeed: ds_zero_2.json peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1

data_config:
train_file: train.jsonl
val_file: dev.jsonl
test_file: dev.jsonl
num_proc: 1
max_input_length: 512
max_output_length: 512
training_args:

see `transformers.Seq2SeqTrainingArguments`

output_dir: ./output
max_steps: 3000

needed to be fit for the dataset

bf16: true
learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1
dataloader_num_workers: 16
remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps
save_steps: 500

settings for logging

log_level: info
logging_strategy: steps
logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4
evaluation_strategy: steps
eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see `transformers.GenerationConfig`

generation_config:
max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: /home/yqx/workspace/Compared/GLM-4/finetune_demo/configs/ds_zero_2.json
peft_config:
peft_type: LORA
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
您的deepspeed 文件注释掉了是否需要加上
以及我这边加上了之后loss还是一直输出为0.0

zRzRzRzRzRzRzR · 2024-06-26T07:36:15Z

要加上能截图看一下数据集载入的运行截图吗

Yang-125 · 2024-06-26T07:38:59Z

要加上能截图看一下数据集载入的运行截图吗

可以的！

RyanCcc114 · 2024-06-26T07:41:10Z

使用同样的微调配置文件，旧版本的微调代码会爆显存，新版本的loss为0

zRzRzRzRzRzRzR · 2024-06-26T16:19:28Z

你的数据集内容有被正常识别嘛，我建议在开始微调之前，你check一下apply chat template后label的部分

RyanCcc114 · 2024-06-27T02:05:21Z

经调试发现是构建input_ids时出错
finetune.py的process_batch函数内的 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label，修改process_batch_eval同样有效

RyanCcc114 · 2024-06-27T02:09:39Z

但是修改后还是出现了OOM问题😂

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label，修改process_batch_eval同样有效

Yang-125 · 2024-06-27T02:49:53Z

但是修改后还是出现了OOM问题😂

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为 new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label，修改process_batch_eval同样有效

非常感谢，成功读取到了；OOM的原因可能是数据集太大了以及训练轮次导致；这边是多卡微调的时候由于其中某一张卡占用不够也报错了，但是限制这张卡不使用的时候就成功运行了。

zRzRzRzRzRzRzR self-assigned this Jun 24, 2024

zRzRzRzRzRzRzR closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单卡3090ti进行lora微调，遇到了OOM问题 #228

单卡3090ti进行lora微调，遇到了OOM问题 #228

RyanCcc114 commented Jun 24, 2024

zRzRzRzRzRzRzR commented Jun 24, 2024

RyanCcc114 commented Jun 25, 2024

zRzRzRzRzRzRzR commented Jun 25, 2024

RyanCcc114 commented Jun 25, 2024

zRzRzRzRzRzRzR commented Jun 26, 2024 •

edited

Loading

Yang-125 commented Jun 26, 2024

RyanCcc114 commented Jun 26, 2024 •

edited

Loading

Yang-125 commented Jun 26, 2024

zRzRzRzRzRzRzR commented Jun 26, 2024 •

edited

Loading

Yang-125 commented Jun 26, 2024 •

edited

Loading

RyanCcc114 commented Jun 26, 2024

zRzRzRzRzRzRzR commented Jun 26, 2024

RyanCcc114 commented Jun 27, 2024

RyanCcc114 commented Jun 27, 2024

Yang-125 commented Jun 27, 2024

单卡3090ti进行lora微调，遇到了OOM问题 #228

单卡3090ti进行lora微调，遇到了OOM问题 #228

Comments

RyanCcc114 commented Jun 24, 2024

System Info / 系統信息

see transformers.Seq2SeqTrainingArguments

needed to be fit for the dataset

settings for data loading

settings for saving checkpoints

settings for logging

settings for evaluation

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

see transformers.GenerationConfig

set your absolute deepspeed path here

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

zRzRzRzRzRzRzR commented Jun 24, 2024

RyanCcc114 commented Jun 25, 2024

zRzRzRzRzRzRzR commented Jun 25, 2024

RyanCcc114 commented Jun 25, 2024

zRzRzRzRzRzRzR commented Jun 26, 2024 • edited Loading

Yang-125 commented Jun 26, 2024

RyanCcc114 commented Jun 26, 2024 • edited Loading

Yang-125 commented Jun 26, 2024

see transformers.Seq2SeqTrainingArguments

needed to be fit for the dataset

settings for data loading

settings for saving checkpoints

settings for logging

settings for evaluation

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

see transformers.GenerationConfig

set your absolute deepspeed path here

zRzRzRzRzRzRzR commented Jun 26, 2024 • edited Loading

Yang-125 commented Jun 26, 2024 • edited Loading

RyanCcc114 commented Jun 26, 2024

zRzRzRzRzRzRzR commented Jun 26, 2024

RyanCcc114 commented Jun 27, 2024

RyanCcc114 commented Jun 27, 2024

Yang-125 commented Jun 27, 2024

see `transformers.Seq2SeqTrainingArguments`

see `transformers.GenerationConfig`

zRzRzRzRzRzRzR commented Jun 26, 2024 •

edited

Loading

RyanCcc114 commented Jun 26, 2024 •

edited

Loading

see `transformers.Seq2SeqTrainingArguments`

see `transformers.GenerationConfig`

zRzRzRzRzRzRzR commented Jun 26, 2024 •

edited

Loading

Yang-125 commented Jun 26, 2024 •

edited

Loading