Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why discreteness of word embedding leads to the optimizer easily fall into local minima? #47

Open
xsc1234 opened this issue May 27, 2023 · 1 comment

Comments

@xsc1234
Copy link

xsc1234 commented May 27, 2023

最近拜读了您的论文《GPT Understands, Too》,关于这段话有些不理解,希望您能帮忙指导解释下:”1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.” 按照我的理解,您这段话先说明预训练模型的词向量彼此之间相互离散,但是可训练参数h本身就是随机初始化的,并不来自于词向量,词向量的离散对h的优化有什么影响吗?

@init-neok
Copy link

我简单谈谈自己的理解哈,我看过作者的在BAAI上的分享,我觉得这句话的意思是随机初始化的本身引入的Pseudo Prompt就不是真正意义上的词,为了让他们更加符合语言意义上的特性,因此作者说这句话是为了引出后面使用BiLstm来初始化的这些pseudo Prompt的想法,而且P-tuing算是一种连续提示,使用lstm来初始化也更符合连续这个特性。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants