-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues API: Too many collections #4161
base: dev
Are you sure you want to change the base?
Conversation
2ebc202
to
9c79a63
Compare
const MANY_COLLECTIONS: usize = 30; | ||
|
||
/// Defines how many points are considered too few as an average per collection | ||
const FEW_POINTS: usize = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to define a number in kilobytes instead. It represents the actual size in bytes, which is more interesting than just a count. We use it in many places, including for our indexing threshold:
Line 112 in 97c107f
indexing_threshold_kb: 20000 |
Speaking of it, we might want to increase this number to more closely match the default indexing threshold. Note that the threshold is per segment.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why kilobytes is a better metric here, since the problem we want to spot is that the user has a dynamic number of collections. E.g. having many named vectors and/or high-dimensional ones does not relate to the root of the issue, IMO.
Would you elaborate why you think it is a better metric in this case?
Regarding the threshold, I am happy to move it around. The choice was completely arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was this: we count collections that are below some point threshold. I thought that it would make sense to match that up with the indexing threshold, in which case kilobytes should be used. I thought about it that way because it's usually one of the problems we mention when a lot of collections are created, many of them will be too small to make use of indexing.
But, if you designed it a bit different or if we just trigger from n collections onward, the specific metric probably doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to spot is that the user has a dynamic number of collections.
Could you elaborate on what this means, a dynamic number of collections?
Wouldn't we just trigger this issue above 30 (?) collections, no matter their contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, a dynamic number of collections means that creation or deletion of collections come from user interaction, not from admin decisions.
Valid cases for creating new collections might be:
- using new embedding models
- refactoring, new features, etc.
I think it might still be possible to reach the fixed limit while having "correct" usage, that's why I included the density threshold too.
a5638e5
to
26758c3
Compare
9c79a63
to
a0a23cf
Compare
a0a23cf
to
72f9cd6
Compare
Supersedes #3652, same thing but now event-based
Tracked in #3471
Needs #4139
This PR determines to have too many collections when all of the following are true:
The current way of counting the amount of points in the collections is using the
count
api, aproximate.All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?