Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User flag to construct CAGRA index with dataset #200

Open
wants to merge 4 commits into
base: branch-24.08
Choose a base branch
from

Conversation

tarang-jain
Copy link

@tarang-jain tarang-jain commented Jun 25, 2024

Currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead).

Closes #199

@tarang-jain tarang-jain requested a review from a team as a code owner June 25, 2024 21:54
@github-actions github-actions bot added the cpp label Jun 25, 2024
@tarang-jain tarang-jain marked this pull request as draft June 25, 2024 21:55
@tarang-jain tarang-jain marked this pull request as ready for review June 25, 2024 22:36
* - `false` means `build` only builds the graph, but
* the user is expected to update dataset separately.
*/
bool add_data_on_build = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a different parameter name here. add_data_on_build is already defined for the index_params, and I believe it has a slightly different meaning. It's meant to be used together with extend function; in IVF-* methods we allow adding any new data after clustering this way. In CAGRA in the current form, the user would only be allowed to add the data that is present in the graph (and not via extend function).

Or, at the very least, use explicit using cuvs::neighbors::index_params::add_data_on_build; instead of shadowing the parent definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build & index.update_dataset() when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset and build.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can remove the flag add_data_on_build from cagra::index_params and directly use cuvs::neighbors::add_data_on_build, without overriding it. I can update the documentation of cuvs::neighbors::add_data_on_build stating that it has a different meaning for a CAGRA index.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved add_data_on_build to the ivf index_params and added a new arg for cagra -- populate_data. The PR is also green on CI right now.

@tfeher
Copy link
Contributor

tfeher commented Jun 26, 2024

@tarang-jain could you clarify the PR title and description? Currently the word "save" is used, and that makes me associate to serialization, but for serialization we already have the include_dataset flag.

I assume that problem you are addressing is index construction: whether to construct the index with the dataset. In that case the description could be updated like: "currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead)."

* - `false` means `build` only builds the graph, but
* the user is expected to update dataset separately.
*/
bool add_data_on_build = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build & index.update_dataset() when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset and build.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jun 26, 2024
@tarang-jain tarang-jain changed the title User flag to save the dataset to the CAGRA index User flag to construct CAGRA index with dataset Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[FEA] Expose CAGRA construct_index_with_dataset flag to public API
5 participants