Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow performance on large sets #11

Open
briannadon opened this issue Jan 28, 2022 · 1 comment
Open

Very slow performance on large sets #11

briannadon opened this issue Jan 28, 2022 · 1 comment

Comments

@briannadon
Copy link

Hi, this package looks really cool and I'd love to use it for my use case.

I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs. datasketch's MinHashLSHEnsemble method, the results are pretty wildly off-base from the numbers presented in the readme.

In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.

Any insight or help is appreciated.

@ekzhu
Copy link
Owner

ekzhu commented Jun 7, 2022

You are correct that this package is not suitable comparing to datasketch when it comes to larger sets. The benchmark datasets used in README have average set size around 20-30.

For this package, the query time is directly proportional to the size of the query set (# of tokens). It is also heavily influenced by the size of indexed sets because exact set similarity calculation is made at query time for candidate sets.

There are some algorithmic tricks that are designed to handle exact search over large sets. I made one: https://github.com/ekzhu/josie. I haven't had time to make it available here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants