Skip to content

ywangzi/keyword_extraction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Keyword extraction method by using Word2Vec and Pagerank algorithms

The most common representation of distributional semantics is called one-hot representation in which dimensionality is equal to vocabulary’s cardinality. Elements of this vector space representation consist of 0’s and 1’s. However, this representation has some disadvantages. For example, in these representations, it is difficult to make deductions about word similarity. Due to high dimensionality, they can also cause overfitting. Moreover, it is computationally expensive.

Word embeddings are designed to capture attributional similarities between vocabulary items. Words that appear in similar contexts should be close to each other in the projected vector space. This means that grouping of words in a vector space must share same semantic properties. In word embeddings, Latent Semantic Analysis (LSA) uses a counting base dimensionality reduction method. Word2Vec is created as an alternative. Its low dimensionality can help to reduce computational complexity. Also compared with distributional semantics methods, it causes less overfitting. Word2Vec can also detect analogies between words.

Our model takes Word2Vec representations of words in a vector space. While we construct the Word2Vec model, we decide a threshold of counts of words because words that appear only once or twice in a large corpus are probably not unusual for the model, and there is not enough data to make any meaningful training on those words. A reasonable value for minimum counts changes between 0-100, and it depends on the size of corpora. Another critical parameter for Word2Vec model is the dimension of the vectors. This value changes between 100 and 400. Dimensions larger than 400 require more training but leads to more accurate models. I used Google News corpora which provided by Google which consist of 3 million word vectors. I did not remove stop words or infrequent words because these algorithms use windows and to find vector representations. So I need the nearby words to find vector representations.

The second step of this algorithm is to find PageRank value of each word. PageRank algorithm works with random walk. The original PageRank algorithm takes internet pages as a node. In our model PageRank algorithm takes Word2Vec representations of words. The cosine distance is used to calculate edge weights between nodes. After PageRank values of words are found, we can get words which have the highest PageRank values. Finally, these words can be seen as a keyword of a text.

About

Keyword extraction with Word2Vec

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%