Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

用自己的数据构建pretrain data 提示 KeyError: '##cry' #70

Open
ccoocode opened this issue Jun 21, 2020 · 1 comment
Open

用自己的数据构建pretrain data 提示 KeyError: '##cry' #70

ccoocode opened this issue Jun 21, 2020 · 1 comment

Comments

@ccoocode
Copy link

你好,我在用自己数据运行 create_pretraining_data.py 的时候提示: KeyError: '##cry', 看了下是 def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for i,item in enumerate(items):
#print(i,"item:",item) # ##期
output.append(vocab[item])
return output
函数报的错,感觉应该是在做jieba中文分词后生成的一些token不在词表里

跑 create_pretraining_data.py 的参数如下:
--do_lower_case=True --max_seq_length=40 --do_whole_word_mask=True --max_predictions_per_seq=20 --masked_lm_prob=0.15 --dupe_factor=3

vocab 用的是bert的

@waywaywayw
Copy link

我也遇到的同样的问题。
考虑的解决方法是词库里中英文混合词,不做wwm策略了。

代码修改:
get_new_segment函数里的
if segment_str in seq_cws_dict:
改成
if segment_str in seq_cws_dict and len(re.findall('[a-zA-Z]', segment_str))==0:

原因举例:
bert分词:'顺', '利', '的', '无', '创', 'dna'
jieba分词:'顺', '##利', '的', '无', '##创', '##dna'
再往后,bert词库里没有 ##dna,就报错了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants