CAGRA new vector addition #151

enp1s0 · 2024-05-24T08:50:20Z

This PR introduces the new vector addition feature to CAGRA.

Rel: rapidsai/raft#1775
Original PR: rapidsai/raft#2157

CAGRA-Q is not supported

Usage

auto additional_dataset = raft::make_host_matrix<float, int64_t>(res,updated_dataset_size, dim);
cuvs::neighbors::cagra::extend(handle, raft::make_const_mdspan(additiona_dataset.view()), cagra_index);

Algorithm

Graph degree: d

The algorithm consists of two stages: rank-based reordering and reverse edge addition.

Rank-based reordering
1-1. Obtain d' (=2d) nearest neighbor vectors (V) of a given new vector using the CAGRA search
1-2. Count the number of detourable edges using the result of step 1 and the neighbor list of the input index. Then we prune (3*d/2) edges in the same way as the CAGRA graph optimization. Through this operation, we decide d/2 neighbors.
Reverse edge addition
2-1. Count the number of incoming edges for all nodes.
2-2. Add d/2 reverse edges from the nodes added to the neighbor list in Step 1 by replacing a node with a new node. To prevent the connection to the replaced node from being lost, we add the node to the neighbor list of the new node. This allow us to make a detour connection. The replaced nodes are the largest number of incoming edge nodes in the 2/d nodes from the back of the neighbor list without duplication with the nodes already in the neighbor list.

Performance

In this experiment, we first split the dataset into two parts: the initial and the additional part. Then, we extend the CAGRA index built by the initial part to include the additional part.

We can see a larger recall drop compared to the baseline by increasing the number of added vectors.
Therefore, rebuilding the CAGRA index is recommended when one wants to add a lot of vectors.

tfeher

Thanks @enp1s0 for the PR! The implementation was already reviewed in rapidsai/raft#2157. The PR looks good to me.

tfeher · 2024-05-27T09:43:10Z

Tests were failing on CI (CAGRA test hanging), restarting tests. According to @enp1s0 locally test were passing.

tfeher · 2024-05-27T13:43:07Z

AnnCagraAddNodesTest hangs on V100, I could reproduce it locally.

tfeher · 2024-05-27T21:41:15Z

Some raft -> cuvs namespace change were missing and the instantiated cagra::extend code was calling itself recursively due to a missing detail scope. The latter lead to the test hanging. These issues are fixed now.

cjnolet

This is definitely a very welcome feature and it would be great to get this into 24.06. I haven't gone through the guts of the implementation yet but I wanted to leave some initial feedback about the public APIs to start.

cpp/include/cuvs/neighbors/cagra.hpp

cpp/src/neighbors/cagra.cuh

enp1s0 · 2024-05-28T15:10:25Z

Thank you @cjnolet @tfeher for reviewing and fixing the bug. I fixed the docs.

enp1s0 · 2024-05-30T13:22:07Z

@cjnolet I fixed the code. Can you check it again?

cjnolet

Thanks again for the recent updates @enp1s0 and please accept my sincere apology that it's taken some time for me to get back to this PR. This past release turned out to require a significant amount of attention since it's the first release with the Ann code officially migrated from RAFT.

I think these changes are shaping up but they aren't yet ready to be merged. It's very important that we focus on the user experience in the public APIs as a first and highest priority. That means making sure the docs are usable and easily searchable, and that the APIs are consistent and easy to use.

cpp/include/cuvs/neighbors/cagra.hpp

cjnolet · 2024-05-31T14:55:09Z

cpp/include/cuvs/neighbors/cagra.hpp

+    updated_dataset_view,
+  const cuvs::neighbors::cagra::index<float, uint32_t>& idx,
+  raft::host_matrix_view<uint32_t, std::int64_t> updated_graph_view,
+  const cagra::extend_params& params);


This ordering is not consistent with the way we order arguments in public API functions. It should be:

Resource handle

Params structs

Input args

Output Args

Any free param arguments

cpp/include/cuvs/neighbors/cagra.hpp

cjnolet · 2024-05-31T15:01:11Z

cpp/include/cuvs/neighbors/cagra.hpp

+ * @param[in] handle raft resources
+ * @param[in] input_updated_dataset_view updated dataset (initial + additional dataset)
+ * @param[in] index CAGRA index
+ * @param[out] updated_graph_view updated graph


This seems awkward to return to the user- what are we expecting them to do with this parameter? From the user's perspective, this should be getting updated inside the index itself, shouldn't it?

cjnolet · 2024-05-31T15:07:09Z

cpp/include/cuvs/neighbors/cagra.hpp

+ * @tparam IdxT type of the indices
+ *
+ * @param[in] handle raft resources
+ * @param[in] input_updated_dataset_view updated dataset (initial + additional dataset)


I'm really struggling with the need to expose this function through the public APIs, especially when the cagra index contains methods for the dataset to be attached or reattached after the fact.

End users generally interact with these index APIs as a black box- that is, they invoke cuVS APIs to build and populate an index object and then they invoke functions to act upon that index object without ever having to care what's actually being done in the index object.

I understand the want to provide helper functions for power users but this isn't the correct way to provide these types of functions. If the only reason for exposing this function is because we want to provide a way for the user to give the dataset with the additional vectors appended then I think we need to put some additional thought into the existing "update_dataset" methods on the index and add an option to the extend params that affords more control to the user.

The problem is that this function assumes the user knows or cares about a graph object, which is intentionally being hidden inside the index. If we remove the graph output, we are left with the extend() function and what I think should be an option for the user to specify whether they're attaching the whole detaset with the new vectors appended. Does that make sense?

Yes, I agree that providing the APIs as a black box would be better for most users. And providing an option to specify whether the passed dataset is an entire one sounds good. An alternative way is to provide a function to expand the memory space (num vectors) for the dataset and graph in the CAGRA index. We allocate the host memory for the new dataset because we need to replace the old dataset with the new one on the device memory. To do this, we 1) copy the old dataset on the host memory, 2) deallocate the device memory, 3) allocate the new memory space on the device memory, and 4) copy the old dataset and additional dataset to the device memory (I noticed that we only need to allocate the host memory for the old dataset). It would be beneficial if there is a functionality to expand the memory space for new vectors because we don't need to reallocate the dataset memory space as above every time as long as the memory space is left.

cjnolet · 2024-06-05T03:28:37Z

I'm not opposed to the feature that add_graph_nodes is providing. We should find a way to enable it through extend(), so that the user has a unified experience. We do something similar with the IVF methods where we provide an option to over allocate the lists so adding new vectors can be done without re-allocating memory. We should be able to do something similar here.

…to cagra-new-vector-addition

cjnolet

I think your most recent changes look much better with some fairly minor things at this point. We need some more documentation around the why/when/how. I also think I'd prefer to keep the two additional optional args, rather than combine them into a new vocab item.

cjnolet · 2024-07-01T19:55:26Z

cpp/include/cuvs/neighbors/cagra.hpp

@@ -173,6 +174,43 @@ struct search_params : cuvs::neighbors::search_params {
 /**
 * @}
 */
+
+/**
+ * @defgroup cagra_cpp_extend_params CAGRA index extend parameters


It's not completely clear from the documentation and code examples how and when the user might use this. Can you expand the descriptions below to explain why and when these additional arguments are needed?

I fixed the explanation of the parameter. Please let me know if the updated explanation is not sufficient.

cjnolet · 2024-07-01T19:58:40Z

cpp/include/cuvs/neighbors/cagra.hpp

+ * @param[in] params extend params
+ * @param[in] additional_dataset additional dataset on device memory
+ * @param[in,out] idx CAGRA index
+ * @param[in] new_memory_buffers (Optional) user-allocated memory buffers for the extended dataset


From what I've gathered, these new memory buffers are outputs, right? It's unclear from the docs and examples alone how these are used and why a user might want to use this. To be completetely honest, I'd more prefer using 2 std::optional args instead of adding the additional struct to the API's vocabulary. I understand the consolidation but a user now has to look up what this new struct does instead of just being able to see the params documented above the functions.

I changed the arguments of the extend function to pass the dataset and graph optional views.

enp1s0 · 2024-07-05T07:54:02Z

@cjnolet Thank you for reviewing the code. I have fixed it according to your comment. Can you check it again?

tfeher

The recent changes look good to me: extend now allows fine control of dataset and graph placement if needed.

cjnolet

The changes look great, @enp1s0! So happy to have this feature.

raydouglass · 2024-07-09T19:26:49Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

cjnolet · 2024-07-09T19:32:08Z

/merge

enp1s0 requested review from a team as code owners May 24, 2024 08:50

github-actions bot added cpp CMake labels May 24, 2024

enp1s0 added feature request New feature or request non-breaking Introduces a non-breaking change labels May 24, 2024

cjnolet assigned enp1s0 May 24, 2024

tfeher approved these changes May 27, 2024

View reviewed changes

cjnolet requested changes May 28, 2024

View reviewed changes

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved

cpp/src/neighbors/cagra.cuh Outdated Show resolved Hide resolved

enp1s0 requested a review from a team as a code owner May 28, 2024 15:50

cjnolet requested changes May 31, 2024

View reviewed changes

enp1s0 and others added 14 commits June 3, 2024 01:52

Port new vector addition from RAFT PR

7765726

Fix style

a525e0e

Fix assertion

78b3595

Fix cuvs API

2277221

Fix the recursive func call issue by adding template params explicitly

48b004f

Update docs for new vector addition

039ce04

Move the comments of cagra::build and cagra::search to hpp

a5e4dda

Add doxygengroup

aee0af6

Add extend_params

4ce176b

Remove ref to build_knn_graph

0a340d0

Fix CAGRA config log in test

67e5363

Update CAGRA new vector addition test

b8e096b

Fix new vector addition test

462be18

Fix CAGRA test (build_algo)

289e774

enp1s0 and others added 3 commits June 14, 2024 19:20

Add support for non_owning_dataset in new vector addition

2bd62e7

Merge branch 'branch-24.08' into cagra-new-vector-addition

e40a3bf

Merge branch 'cagra-new-vector-addition' of github.com:enp1s0/cuvs in…

56b3ab1

…to cagra-new-vector-addition

github-actions bot removed the ci label Jun 14, 2024

enp1s0 and others added 8 commits June 14, 2024 19:22

Remove add_graph_nodes from neighbor_cagra.rst

630a6bf

Update cagra::extent doc

dbcfbe6

Merge branch 'branch-24.08' into cagra-new-vector-addition

ec1f927

Fix dataset copy

b47386a

Fix docs

d852bcf

Fix cagra::extend args order

32c111e

Merge branch 'branch-24.08' into cagra-new-vector-addition

f3a76af

Fix docs for cagra::extend

a0f08f0

enp1s0 changed the title ~~CAGRA new vector addition~~ [WIP] CAGRA new vector addition Jul 1, 2024

cjnolet requested changes Jul 1, 2024

View reviewed changes

cjnolet and others added 5 commits July 1, 2024 16:27

Merge branch 'branch-24.08' into cagra-new-vector-addition

2e07a6f

Add non_owning_dataset test

bdbdd6f

Fix cagra::extend API

a78961f

Update the explanation of extend_params

e891d21

Merge branch 'branch-24.08' into cagra-new-vector-addition

9869e80

enp1s0 changed the title ~~[WIP] CAGRA new vector addition~~ CAGRA new vector addition Jul 5, 2024

tfeher approved these changes Jul 7, 2024

View reviewed changes

Merge branch 'branch-24.08' into cagra-new-vector-addition

368dfeb

cjnolet approved these changes Jul 9, 2024

View reviewed changes

raydouglass removed the request for review from a team July 9, 2024 19:26

rapids-bot bot merged commit bf53940 into rapidsai:branch-24.08 Jul 9, 2024
54 checks passed

cjnolet mentioned this pull request Aug 27, 2024

[QST] updating an index #295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAGRA new vector addition #151

CAGRA new vector addition #151

enp1s0 commented May 24, 2024 •

edited

Loading

tfeher left a comment

tfeher commented May 27, 2024

tfeher commented May 27, 2024

tfeher commented May 27, 2024

cjnolet left a comment

enp1s0 commented May 28, 2024

enp1s0 commented May 30, 2024

cjnolet left a comment •

edited

Loading

cjnolet May 31, 2024

cjnolet May 31, 2024

cjnolet May 31, 2024

enp1s0 Jun 3, 2024

cjnolet commented Jun 5, 2024 •

edited

Loading

cjnolet left a comment

cjnolet Jul 1, 2024

enp1s0 Jul 3, 2024

cjnolet Jul 1, 2024

enp1s0 Jul 2, 2024

enp1s0 commented Jul 5, 2024

tfeher left a comment

cjnolet left a comment

raydouglass commented Jul 9, 2024

cjnolet commented Jul 9, 2024

CAGRA new vector addition #151

CAGRA new vector addition #151

Conversation

enp1s0 commented May 24, 2024 • edited Loading

Usage

Algorithm

Performance

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented May 27, 2024

tfeher commented May 27, 2024

tfeher commented May 27, 2024

cjnolet left a comment

Choose a reason for hiding this comment

enp1s0 commented May 28, 2024

enp1s0 commented May 30, 2024

cjnolet left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet commented Jun 5, 2024 • edited Loading

cjnolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enp1s0 commented Jul 5, 2024

tfeher left a comment

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

raydouglass commented Jul 9, 2024

cjnolet commented Jul 9, 2024

enp1s0 commented May 24, 2024 •

edited

Loading

cjnolet left a comment •

edited

Loading

cjnolet commented Jun 5, 2024 •

edited

Loading