User flag to construct CAGRA index with dataset #200

tarang-jain · 2024-06-25T21:54:06Z

Currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead).

Closes #199

achirkin · 2024-06-26T05:08:47Z

cpp/include/cuvs/neighbors/cagra.hpp

+   *  - `false` means `build` only builds the graph, but
+   * the user is expected to update dataset separately.
+   */
+  bool add_data_on_build = true;


I think we need a different parameter name here. add_data_on_build is already defined for the index_params, and I believe it has a slightly different meaning. It's meant to be used together with extend function; in IVF-* methods we allow adding any new data after clustering this way. In CAGRA in the current form, the user would only be allowed to add the data that is present in the graph (and not via extend function).

Or, at the very least, use explicit using cuvs::neighbors::index_params::add_data_on_build; instead of shadowing the parent definition.

I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build & index.update_dataset() when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset and build.

So I can remove the flag add_data_on_build from cagra::index_params and directly use cuvs::neighbors::add_data_on_build, without overriding it. I can update the documentation of cuvs::neighbors::add_data_on_build stating that it has a different meaning for a CAGRA index.

I have moved add_data_on_build to the ivf index_params and added a new arg for cagra -- populate_data. The PR is also green on CI right now.

tfeher · 2024-06-26T07:36:03Z

@tarang-jain could you clarify the PR title and description? Currently the word "save" is used, and that makes me associate to serialization, but for serialization we already have the include_dataset flag.

I assume that problem you are addressing is index construction: whether to construct the index with the dataset. In that case the description could be updated like: "currently we would always construct the CAGAR index with dataset if the dataset fits into GPU mem. If it does not fit, then we fall back to constructing index with graph only. In this PR a simple flag is added to index_params to allow the user a choice if they want to disable constructing the index with the dataset (i.e. only construct with the only graph instead)."

tfeher · 2024-06-26T07:46:39Z

cpp/include/cuvs/neighbors/cagra.hpp

+   *  - `false` means `build` only builds the graph, but
+   * the user is expected to update dataset separately.
+   */
+  bool add_data_on_build = true;


I agree with Artem that there is some potential for confusion here, but I am not against repurposing the existing add_data_on_build flag. We should be clear what is the intended usage, therefore the documentation shall contain a code example on how to use build & index.update_dataset() when this flag is enabled. Also the docstring of update_dataset() should explain, that it is expected that the same set of vectors should be are used for update_dataset and build.

…expose-public-api

update build API

7cdbcd1

tarang-jain requested a review from a team as a code owner June 25, 2024 21:54

github-actions bot added the cpp label Jun 25, 2024

tarang-jain marked this pull request as draft June 25, 2024 21:55

divyegala approved these changes Jun 25, 2024

View reviewed changes

tarang-jain marked this pull request as ready for review June 25, 2024 22:36

achirkin suggested changes Jun 26, 2024

View reviewed changes

tfeher requested changes Jun 26, 2024

View reviewed changes

cjnolet assigned tarang-jain Jun 26, 2024

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jun 26, 2024

tarang-jain changed the title ~~User flag to save the dataset to the CAGRA index~~ User flag to construct CAGRA index with dataset Jun 26, 2024

tarang-jain added 3 commits June 26, 2024 07:51

updates to docstring;update index_params

7f3ccef

update index params

6dcb3dc

Merge branch 'branch-24.08' of https://github.com/rapidsai/cuvs into …

7040e31

…expose-public-api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User flag to construct CAGRA index with dataset #200

User flag to construct CAGRA index with dataset #200

tarang-jain commented Jun 25, 2024 •

edited

Loading

achirkin Jun 26, 2024

tfeher Jun 26, 2024

tarang-jain Jun 26, 2024

tarang-jain Jun 28, 2024

tfeher commented Jun 26, 2024

tfeher Jun 26, 2024

User flag to construct CAGRA index with dataset #200

Are you sure you want to change the base?

User flag to construct CAGRA index with dataset #200

Conversation

tarang-jain commented Jun 25, 2024 • edited Loading

achirkin Jun 26, 2024

Choose a reason for hiding this comment

tfeher Jun 26, 2024

Choose a reason for hiding this comment

tarang-jain Jun 26, 2024

Choose a reason for hiding this comment

tarang-jain Jun 28, 2024

Choose a reason for hiding this comment

tfeher commented Jun 26, 2024

tfeher Jun 26, 2024

Choose a reason for hiding this comment

tarang-jain commented Jun 25, 2024 •

edited

Loading