Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

读取数据问题 #17

Open
MonkeyTB opened this issue Apr 13, 2023 · 15 comments
Open

读取数据问题 #17

MonkeyTB opened this issue Apr 13, 2023 · 15 comments
Labels
wontfix This will not be worked on

Comments

@MonkeyTB
Copy link

self.examples = dataset["input_ids"]

你好,麻烦请问一下,这里这样读取数据后,在 chatglm_model.py第243-245显示读取的数据为空,这里应该怎么理解?

@MonkeyTB
Copy link
Author

补充一下:训练会报错数据为空

@shibing624
Copy link
Owner

没下载ADGEN 数据集吗?

@MonkeyTB
Copy link
Author

我用了今天新更新的代码,数据就没问题了,一脸懵逼,看代码就是少了个filter
有个很奇怪的问题
2023-04-13 11:51:18.354 | INFO | chatglm.chatglm_model:train_model:283 - Training/evaluation parameters TrainingArguments( _n_gpu=3, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=True, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0002, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=./result//logs, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=50, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, output_dir=./result/, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard', 'wandb'], resume_from_checkpoint=None, run_name=./result/, save_on_each_node=False, save_steps=400, save_strategy=steps, save_total_limit=3, seed=42, sharded_ddp=[], skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 2023-04-13 11:51:18.501 | INFO | chatglm.chatglm_model:train_model:297 - *** Train *** wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 0 wandb: WARNING Invalid choice wandb: Enter your choice:

  1. 是n_gpu=3?但是我找了所有配置和赋值,没有发现在哪里赋值的3,配置文件我看是1
  2. wandb 这是什么?让我手动输入 enter you choice,多次输入之后(瞎输入)就或出现下面的连接
    wandb: You chose 'Create a W&B account' wandb: Create an account here: https://wandb.ai/authorize?signup=true wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

@shibing624
Copy link
Owner

那就用最新代码,wandb是训练日志记录,不用管。

@MonkeyTB
Copy link
Author

2023-04-13 12:23:01.014 | INFO | chatglm.chatglm_model:train_model:297 - *** Train ***
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: e091e352ec72db11655f6fa7dcfd6d4a7b83xxxx
wandb: WARNING Invalid choice
wandb: Enter your choice: glm
wandb: WARNING Invalid choice
wandb: Enter your choice: 111
wandb: WARNING Invalid choice
wandb: Enter your choice: 0
wandb: WARNING Invalid choice
wandb: Enter your choice: 1
wandb: You chose 'Create a W&B account'
wandb: Create an account here: https://wandb.ai/authorize?signup=true
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: ERROR API key must be 40 characters long, yours was 1
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

大佬,这个不管不进行训练,我去屏蔽了import wandb也还是会弹出来这个,强制性要我输入

@MonkeyTB
Copy link
Author

我注册了一个账号,输入了40位的 key ,还是不行😓

@shibing624
Copy link
Owner

export WANDB_MODE=offline

@MonkeyTB
Copy link
Author

MonkeyTB commented Apr 13, 2023

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案:“
    prompt = f"问:{instruction}\n{input_text}\n答:"
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬,这里感觉有点问题,
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

@MonkeyTB
Copy link
Author

export WANDB_MODE=offline

感谢,我实在没办法,卸载了wandb就可以了,我装上再试试这个😓

@shibing624
Copy link
Owner

2023-04-13 12:23:01.014 | INFO | chatglm.chatglm_model:train_model:297 - *** Train *** wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: e091e352ec72db11655f6fa7dcfd6d4a7b83xxxx wandb: WARNING Invalid choice wandb: Enter your choice: glm wandb: WARNING Invalid choice wandb: Enter your choice: 111 wandb: WARNING Invalid choice wandb: Enter your choice: 0 wandb: WARNING Invalid choice wandb: Enter your choice: 1 wandb: You chose 'Create a W&B account' wandb: Create an account here: https://wandb.ai/authorize?signup=true wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: wandb: ERROR API key must be 40 characters long, yours was 1 wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice:

大佬,这个不管不进行训练,我去屏蔽了import wandb也还是会弹出来这个,强制性要我输入

你 选 3 就行了。

@shibing624
Copy link
Owner

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案:“
    prompt = f"问:{instruction}\n{input_text}\n答:"
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬,这里感觉有点问题,
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

没错,因为prompt_ids 默认是有 add_special_tokens=True,里面会带有bos + gmask

@MonkeyTB
Copy link
Author

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案:“
    prompt = f"问:{instruction}\n{input_text}\n答:"
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬,这里感觉有点问题,
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

没错,因为prompt_ids 默认是有 add_special_tokens=True,里面会带有bos + gmask

我在排查排查吧,我设置为True,会补两个0,也就是两个gmask,不会补bos的token_id。感谢开源

@shibing624
Copy link
Owner

train_dataset len: 10000, train_dataset[0]: [ 5 64286 12 65601 115448 68816 94113 75564 66104 63823
63976 70705 6 64157 64091 66889 64447 63823 4 95059
78289 63825 72663 12 28 64265 69028 63907 65667 6
70283 63854 64091 69466 97891 73134 6 63847 65283 64472
66876 78 4 4 67342 12 130001 130004 65831 72663
65247 75564 66104 63823 130005]

这里的:
130001 130004

130001就是bos, 130004就是gmask

@MonkeyTB
Copy link
Author

train_dataset len: 10000, train_dataset[0]: [ 5 64286 12 65601 115448 68816 94113 75564 66104 63823 63976 70705 6 64157 64091 66889 64447 63823 4 95059 78289 63825 72663 12 28 64265 69028 63907 65667 6 70283 63854 64091 69466 97891 73134 6 63847 65283 64472 66876 78 4 4 67342 12 130001 130004 65831 72663 65247 75564 66104 63823 130005]

这里的: 130001 130004

130001就是bos, 130004就是gmask

add special tokens True: [5, 66219, 1389, 64812, 69171, 0, 0]
add special tokens False [5, 66219, 1389, 64812, 69171]

更改ice_text.model之后就正常了
add special tokens True: [5, 66219, 1389, 64812, 69171, 130001, 130004]
add special tokens False [5, 66219, 1389, 64812, 69171]
之前更新的和没跟新的好像没有完全替代,导致出现了混乱。没更新之前打印 tokenizer.gmask_token_id = 0,目前更新后没有问题了,正常运行了

@shibing624 shibing624 pinned this issue May 11, 2023
Copy link

stale bot commented Dec 27, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)

@stale stale bot added the wontfix This will not be worked on label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants