You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, this package looks really cool and I'd love to use it for my use case.
I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs. datasketch's MinHashLSHEnsemble method, the results are pretty wildly off-base from the numbers presented in the readme.
In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.
Any insight or help is appreciated.
The text was updated successfully, but these errors were encountered:
You are correct that this package is not suitable comparing to datasketch when it comes to larger sets. The benchmark datasets used in README have average set size around 20-30.
For this package, the query time is directly proportional to the size of the query set (# of tokens). It is also heavily influenced by the size of indexed sets because exact set similarity calculation is made at query time for candidate sets.
There are some algorithmic tricks that are designed to handle exact search over large sets. I made one: https://github.com/ekzhu/josie. I haven't had time to make it available here.
Hi, this package looks really cool and I'd love to use it for my use case.
I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs.
datasketch
'sMinHashLSHEnsemble
method, the results are pretty wildly off-base from the numbers presented in the readme.In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.
Any insight or help is appreciated.
The text was updated successfully, but these errors were encountered: