Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于MacBERT for Chinese Spelling Correction(macbert4csc) Model #249

Closed
MachineLearningCuiFan opened this issue Dec 10, 2021 · 12 comments
Closed
Labels
question Further information is requested wontfix This will not be worked on

Comments

@MachineLearningCuiFan
Copy link

想问一下,这个模型是重新根据csc任务特点重新训练的bert吗?----即是利用混淆字词去替换文本产生噪音?
还是本身就是macbert,只是经过那个wang27k数据集微调的模型?

@MachineLearningCuiFan MachineLearningCuiFan added the question Further information is requested label Dec 10, 2021
@zhihanyang2022
Copy link

同问。

@shibing624
Copy link
Owner

是一个独立的纠错模型,网络结构变化了,参考的softmaskedbert模型,具有一个错误检测网络模块,一个错误纠正网络,loss是两个网络损失的加权。

@zhihanyang2022
Copy link

zhihanyang2022 commented Dec 23, 2021

@shibing624 谢谢回复。

你上传到huggingface的模型应该不包括错误检测网络吧?那么,在huggingface transformers里load_pretrain的时候,如何保证错误检测网络模块的参数也被加载了呢?

换句话说就是我不理解以下这个脚本里哪里包含了错误检测网络模块。

https://github.com/shibing624/pycorrector/blob/master/examples/macbert_demo.py

多谢

@shibing624
Copy link
Owner

错误检测网络的权重被加载了的,模型纠错的流程是先检测再纠正,只是这个过程是端到端的。

PS:pytorch_model.bin权重文件里面detection权重是可以单独看到。

@zhihanyang2022
Copy link

非常感谢!

所以说pycorrector里面是可以直接训练softmasked macbert吗?我看macbert训练的readme里面提到了另一个repo可以用来训练softmasked bert,但好像没有提到pycorrector可不可以。

@shibing624
Copy link
Owner

可以训练softmaskedbert,有abtion写的训练代码。

@zhihanyang2022
Copy link

zhihanyang2022 commented Dec 28, 2021

方便讲解一下detection的权重是怎么糅合到pytorch_model.bin里的呢?换句话说,detection本来是用的另一个网络的权重,怎么转换成transformers包可以直接加载的权重呢?

我仔细阅读了代码,但还是不太理解end-to-end是怎么做到的。我下载了 macbert4csc-base-chinese里的pytorch_model.bin也没看到里面有detection的权重。或者说,你现在huggingface上的其实不包含softmasked bert?

谢谢!

@shibing624
Copy link
Owner

shibing624 commented Dec 28, 2021

1、huggingface上的模型叫macbert4csc,魔改自softmaskedbert模型,去掉了softmasked结构,detection是一个Linear层,correction是bert的MLM层;训练时loss是detection和correction的加权和,而预测时拿到bert_outputs就可以输出纠错结果,具体参考代码 https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/macbert4csc.py
2、字节paper的softmaskedbert模型的实现代码是https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/softmaskedbert4csc.py ,我自评效果比macbert4csc模型差,并未上传到huggingface,大家可以自行评测对比。
3、macbert4csc模型的ckpt模型文件有保存detection权重,pytorch_model.bin没有放detection权重,可以查阅pytorch_model.bin的保存代码https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/train.py#L119 ,主要原因是1)兼容transformers库的调用逻辑;2)不放detection权重也不影响纠错预测结果。

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)

@stale stale bot added the wontfix This will not be worked on label Mar 2, 2022
@stale stale bot closed this as completed Apr 16, 2022
@shibing624 shibing624 pinned this issue Aug 8, 2023
@nevermorez
Copy link

1、huggingface上的模型叫macbert4csc,魔改自softmaskedbert模型,去掉了softmasked结构,detection是一个Linear层,correction是bert的MLM层;训练时loss是detection和correction的加权和,而预测时拿到bert_outputs就可以输出纠错结果,具体参考代码 https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/macbert4csc.py 。 2、字节paper的softmaskedbert模型的实现代码是https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/softmaskedbert4csc.py ,我自评效果比macbert4csc模型差,并未上传到huggingface,大家可以自行评测对比。 3、macbert4csc模型的ckpt模型文件有保存detection权重,pytorch_model.bin没有放detection权重,可以查阅pytorch_model.bin的保存代码https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/train.py#L119 ,主要原因是1)兼容transformers库的调用逻辑;2)不放detection权重也不影响纠错预测结果。

作者大大,我在另一个issue里面做过类似提问,在这好像找到了原因,如何确定pytorch_model.bin不放detection权重不影响纠错结果呢?就我个人finetune的结果来看,ckpt模型和bin模型输出差距很大(ckpt模型可以完成新数据的纠错,但bin模型不能),原始的ckpt模型拟合了训练集,但bin模型好像并未拟合,这是否跟bin模型未放detection权重有关呢?

@shibing624
Copy link
Owner

如果是这样的话,需要更新下bin模型的产出逻辑。我再验证下

@nevermorez
Copy link

万分感谢,辛苦!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants