Very slow performance on large sets #11

briannadon · 2022-01-28T19:29:50Z

Hi, this package looks really cool and I'd love to use it for my use case.

I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs. datasketch's MinHashLSHEnsemble method, the results are pretty wildly off-base from the numbers presented in the readme.

In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.

Any insight or help is appreciated.

The text was updated successfully, but these errors were encountered:

ekzhu · 2022-06-07T06:03:08Z

You are correct that this package is not suitable comparing to datasketch when it comes to larger sets. The benchmark datasets used in README have average set size around 20-30.

For this package, the query time is directly proportional to the size of the query set (# of tokens). It is also heavily influenced by the size of indexed sets because exact set similarity calculation is made at query time for candidate sets.

There are some algorithmic tricks that are designed to handle exact search over large sets. I made one: https://github.com/ekzhu/josie. I haven't had time to make it available here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow performance on large sets #11

Very slow performance on large sets #11

briannadon commented Jan 28, 2022

ekzhu commented Jun 7, 2022

Very slow performance on large sets #11

Very slow performance on large sets #11

Comments

briannadon commented Jan 28, 2022

ekzhu commented Jun 7, 2022