Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer/Stemmer and few other questions #141

Open
vprelovac opened this issue Feb 27, 2020 · 3 comments
Open

Tokenizer/Stemmer and few other questions #141

vprelovac opened this issue Feb 27, 2020 · 3 comments
Assignees
Labels

Comments

@vprelovac
Copy link

vprelovac commented Feb 27, 2020

Hey Mišo

I spent a lot of time on text rank and while digging deeper into Sumy I want to ask you a few clarifying questions about some of the choices you made: This is all for English language.

  1. _WORD_PATTERN = re.compile(r"^[^\W\d_]+$", re.UNICODE)

Used with word_tokenize() to filter 'non-word' words. The problem is it "kills" words like "data-mining" or "sugar-free". Also word_tokenize is very slow. Here is an alternative to replace these two to consider:

WORDS = re.compile(r"\w+(?:['-]\w+)*")
words = WORDS.findall(sentence)
  1. What made you choose Snowball vs Porter stemmer.

Snowball: DVDs -> dvds
Porter: DVDs -> dvd

I don't have particual opinion just wondering how did you make the decision.

  1. How did you come up with your stopwords (for english?) It is very different thatn nltk defaults for example.

  2. Heuristics in plaintext parser are interesting.

In this example of text extracted from https://www.karoly.io/amazon-lightsail-review-2018/

Is Amazon Lightsail worth it?
Written by Niklas Karoly 10/28/2018 � 8 min read
Amazon AWS Lightsail review 2018
In November of 2016 AWS launched its brand Amazon Lightsail to target the ever growing market that DigitalOcean , Linode and co. made popular.

This ends up as two sentences instead of four.

@miso-belica
Copy link
Owner

Hi Vladimir, I think you know the code more than me because TextRank was not contributed by me. At least not the current implementation. But I will try to check the code and respond to your questions.

  1. I am not against to replace the implementation with the simpler/faster one, but the tokenizing is not always about the regexes. There are other languages and the Sumy I rellies on NLTK or other libs so I don't want to make it perfect for one language and break for others. Also, I trust NLTK to do its job better than me. But you are right those works should be fixed and tested. Also, Sumy is pluggable so you can provide your own tokenizer implemented as you mentioned. I think Regex can be simplified.
WORDS = re.compile(r"[\w'-]+")
words = WORDS.findall(sentence)
  1. Snowball vs Porter stemmer) To be honest I don't remember the decision. I was years ago. I don't even know if I tried both and picked the better or simply use the first I saw in the documentation.

  2. I barely remember it's a mix of NLTK, wiki freq. words and the stopwords from the other projects I was involved in. Sumy was my experiment in the early days and I used what gave me better results and I started to make it more generic when more people "joined" the project on Github.

  3. Yep, the sentences are separated by the correct end mark, not the new line if that is what you mean.

@vprelovac
Copy link
Author

vprelovac commented Mar 4, 2020

  1. I agree it is complex. However you already decided to use regex approach for english, and my point is that the regex I provided is higher quality and faster overall.

note: Your tweaked version would leave lonely dashes floating.

  1. Fair enough

  2. Sharing my current
    stopWords=frozenset(['front', 'wednesday', 'whole', 'thin', "you're", 'appear', 'could', 'further', 'q', 'fri', 'willing', 'years', 'saturday', 'be', 'is', 's', 'various', 'example', 'your', "i'd", 'specifying', 'entirely', 'follows', 'therefore', 'asking', "we're", 'otherwise', 'newsinfo', 'doesn', 'becomes', 'ie', 't', 'inner', 'friday', 'ltd', 'however', 'different', 'herein', 'got', 'mightn', 'lately', "that'll", 'been', 'sometime', 'wherein', 'i', 'inquirer', 'no', 'along', '1', 'ever', 'hereupon', 'mean', 'value', 'described', 'via', '2', 'move', 'shouldn', 'december', 'five', 'anyone', "that's", 'sincere', 'toward', 'useful', 'had', 'normally', 'seems', 'am', 'allows', 'sent', 'april', 'instead', '5', 'yourselves', 'fifth', 'top', 'all', 'hasnt', 'inward', 'say', 'thickv', 'll', 'soon', 'weren', 'while', 'a', '10', 'might', 'sixty', 'anyways', "we've", 'please', 'little', 'least', 'definitely', 'eg', 'her', 'accordingly', 'hereafter', 'home', 'sun', 'y', 'seriously', 'whose', 'clearly', 'the', 'said', 'came', 'herself', 'stories', 'wouldn', 'ain', 'z', 'on', 'doing', 'until', 'except', 'anyhow', 'former', 'concerning', 'same', 'whereby', 'possible', 'going', 'still', "it's", "he's", 'keep', 'see', 'done', 'find', "c's", 'thus', 'indicates', 'ours', 'itself', 'thank', 'inc', 'lest', 'beyond', "wouldn't", 'currently', "we'd", 'himself', 'just', 'thu', 'although', 'consider', 'between', 'far', 'percent', 'o', 'will', 'looking', 'tries', "they're", 'okay', 'cannot', 'put', 'hundred', 'thereafter', 'mainly', 'ex', 'look', 'ten', 'allow', 'thanks', 'getting', 'much', "i've", 'gotten', 'my', 'plus', 'w', 'become', 'why', 'wants', 'after', 'zero', "when's", 'certain', 'unlikely', "how's", '0', 'photo', 'necessary', 'more', 'says', 'ma', 'whereas', 'so', 'whether', 'self', 'afterwards', 'rappler', 'yet', 'especially', 'wonder', "don't", '6', 'in', 'hopefully', 'having', "she'd", 'others', 'myself', 'often', 'tried', 'may', 'awfully', 'whoever', 'does', 'own', 'anything', 'besides', 'gives', "shouldn't", 'c', 'reasonably', 'again', 'associated', 'best', 'tends', 'amount', "aren't", 'ye', 'pm', 'anyway', 'would', 'sorry', 'mine', 'reuters', 'everywhere', 'found', 'of', 'specify', "i'm", 'looks', "hadn't", 're', 'yung', 'able', 'last', "you've", 'few', 'something', 'tue', 'this', "you'd", 'empty', "isn't", 'must', 'either', 'considering', 'whereafter', "we'll", 'eleven', 'usually', 'time', "hasn't", 'our', 'greetings', 'since', 'you', 'thursday', 'particularly', 'gone', 'don', 'above', 'new', 'amongst', 'seen', 'up', 'consequently', 'many', 'needs', 'behind', 'has', 'couldn', 'contain', 'tell', 'under', 'twenty', 'use', 'well', 'following', 'sports', 'later', 'go', 'every', 'but', 'it', 'indeed', 'namely', 'not', "weren't", 'once', 'each', 'first', 'beside', 'hardly', 'did', 'thence', 'liked', 'sub', 'used', 'b', 'hi', 'think', 'maybe', "should've", 'ako', 'rather', 'eight', 'against', "haven't", 'hers', 'too', 'was', 'beforehand', 'rapplercom', 'right', 'vs', 'seem', 'unto', 'sat', 'seemed', 'then', 'welcome', 'when', 'part', 'serious', 'can', 'sup', 'here', 'wherever', 'saying', 'ang', 'second', 'alone', 'another', 'with', 'co', 'according', 'ask', 'nowhere', 'wed', 'despite', 'particular', 'by', 'nothing', 'year', 'qv', 'regarding', 'nd', 'his', 'january', 'side', 'section', 'tuesday', 'never', 'both', 'indicated', "here's", 'quite', 'k', 'full', "couldn't", 'february', 'aren', 'somewhere', 'available', 'yes', 'into', 'per', 'g', "they've", 'thats', 'n', 'than', 'sometimes', 'uucp', 'always', 'back', 'get', 'merely', 'nobody', 'october', 'yourself', 'followed', 'specified', 'even', 'for', 'nor', 'shall', 'rd', 'whence', 'somebody', 'howbeit', 'f', 'news', 'down', 'july', "let's", 'third', 'yours', 'fifteen', 'hadn', 'seeming', '3', 'bottom', 'v', 'saw', 'contains', 'immediate', 'now', 'trying', 'though', 'march', 'story', 'certainly', 'mon', "why's", 'tweet', 'placed', 'latterly', 'monday', 'try', 'haven', 'made', 'changes', 'those', 'latter', 'enough', 'noone', 'together', 'viz', 'someone', 'september', "where's", 'onto', 'make', 'were', 'elsewhere', 'do', 'thorough', 'overall', "he'd", 'thereupon', 'non', 'gets', 'containing', 'he', 'most', 'downwards', 'kept', 'everybody', "shan't", 'towards', 'happens', 'cant', 'already', 'how', 'un', 'using', 'sure', 'nine', 'meanwhile', "didn't", 'great', 'selves', 've', 'because', 'outside', 'some', 'there', 'four', 'amoungst', 'from', 'take', 'way', 'detail', 'throughout', 'moreover', 'anywhere', "i'll", 'among', 'oh', 'actually', 'isn', 'l', 'comes', 'six', 'wasn', 'an', 'ourselves', 'them', 'over', 'wish', "what's", 'only', 'keeps', 'being', 'upon', 'regardless', 'm', 'didn', 'd', 'several', 'else', "they'll", 'describe', 'novel', 'e', 'better', 'that', 'exactly', 'who', 'people', 'want', 'none', 'course', 'june', 'without', 'me', 'sensible', 'sa', 'nevertheless', 'very', 'unless', 'presumably', 'needn', 'about', 'let', 'somewhat', 'whenever', 'indicate', 'such', 'mill', 'shan', 'before', '2012', 'ok', 'during', 'yun', 'us', 'due', 'come', 'que', 'appreciate', 'fire', 'themselves', 'within', 'insofar', 'name', 'everyone', 'are', 'forth', 'at', 'ones', 'believe', 'brief', 'secondly', 'th', 'everything', 'also', 'thanx', 'next', 'if', 'away', 'somehow', 'furthermore', 'seven', 'mostly', 'help', "it'll", "doesn't", 'took', 'perhaps', 'neither', 'what', "there's", "t's", 'less', 'apart', 'hereby', 'as', 'they', 'thereby', "needn't", 'should', 'other', 'near', 'went', 'hither', 'inasmuch', 'provides', 'cause', 'forty', 'de', "he'll", "wasn't", 'and', 'p', 'x', '9', 'anybody', "it'd", 'yahoo', 'corresponding', 'around', 'one', 'truly', 'hasn', 'formerly', 'out', 'hello', "mightn't", 'off', 'three', 'twelve', 'ought', 'she', 'which', 'theres', 'won', 'thoroughly', 'two', 'whither', 'causes', '8', 'became', 'call', 'u', 'mustn', 'any', 'h', 'need', 'becoming', 'homepage', 'fifty', "a's", 'almost', 'or', 'known', 'really', 'taken', 'edu', 'likely', 'where', 'we', 'have', "mustn't", 'given', 'ignored', 'nearly', 'uses', 'show', "she'll", 'ko', 'hence', "can't", 'unfortunately', 'november', 'respectively', 'j', 'r', "ain't", 'relatively', 'probably', 'et', 'theirs', "she's", 'fill', 'august', "won't", 'these', "c'mon", 'sunday', 'through', 'him', 'etc', 'regards', "who's", 'whom', 'thru', 'com', 'appropriate', 'knows', 'know', 'seeing', 'goes', 'below', "they'd", 'whereupon', 'na', 'con', "you'll", 'aside', 'old', '4', 'twice', 'across', 'give', 'obviously', 'its', '2013', 'therein', '7', 'ng', 'whatever', 'like', 'to', 'their'])

  3. Yes but that is wrong as these are clearly four sentences.

Thanks!

@miso-belica
Copy link
Owner

miso-belica commented Mar 5, 2020

1 - It's not completely true. Sumy uses nltk.word_tokenize and the regex is used only to filter some words out. You are right it should not filter some words with - or ' probably, but your version removes the NLTK completely and relies only on regex, and I am not sure this is OK for me. Especially when it's not hard to implement and use custom tokenizer with Sumy. Anyway, thanks for the info why you decided to go with more complicated Regex :)
3 - Yep, you can use these or any others. That's why I left the Sumy open for custom components.
4 - Yes, it is as I can see. Unfortunately, NLTK couldn't detect it. If you have a better implementation of the Python sentence tokenizer, I will be happy to test it and replace NLTK in Sumy with it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants