-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User flag to construct CAGRA index with dataset #200
base: branch-24.08
Are you sure you want to change the base?
Conversation
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* - `false` means `build` only builds the graph, but | ||
* the user is expected to update dataset separately. | ||
*/ | ||
bool add_data_on_build = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different parameter name here. add_data_on_build
is already defined for the index_params, and I believe it has a slightly different meaning. It's meant to be used together with extend
function; in IVF-* methods we allow adding any new data after clustering this way. In CAGRA in the current form, the user would only be allowed to add the data that is present in the graph (and not via extend
function).
Or, at the very least, use explicit using cuvs::neighbors::index_params::add_data_on_build;
instead of shadowing the parent definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build
flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build
& index.update_dataset()
when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset
and build
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can remove the flag add_data_on_build
from cagra::index_params
and directly use cuvs::neighbors::add_data_on_build
, without overriding it. I can update the documentation of cuvs::neighbors::add_data_on_build
stating that it has a different meaning for a CAGRA index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved add_data_on_build
to the ivf index_params
and added a new arg for cagra -- populate_data
. The PR is also green on CI right now.
@tarang-jain could you clarify the PR title and description? Currently the word "save" is used, and that makes me associate to serialization, but for serialization we already have the I assume that problem you are addressing is index construction: whether to construct the index with the dataset. In that case the description could be updated like: "currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead)." |
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* - `false` means `build` only builds the graph, but | ||
* the user is expected to update dataset separately. | ||
*/ | ||
bool add_data_on_build = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build
flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build
& index.update_dataset()
when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset
and build
.
Currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to
index_params
to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead).Closes #199