Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAGRA new vector addition #151

Merged
merged 31 commits into from
Jul 9, 2024

Conversation

enp1s0
Copy link
Member

@enp1s0 enp1s0 commented May 24, 2024

This PR introduces the new vector addition feature to CAGRA.

Rel: rapidsai/raft#1775
Original PR: rapidsai/raft#2157

CAGRA-Q is not supported

Usage

auto additional_dataset = raft::make_host_matrix<float, int64_t>(res,updated_dataset_size, dim);
cuvs::neighbors::cagra::extend(handle, raft::make_const_mdspan(additiona_dataset.view()), cagra_index);

Algorithm

Graph degree: d

The algorithm consists of two stages: rank-based reordering and reverse edge addition.

  1. Rank-based reordering
    1-1. Obtain d' (=2d) nearest neighbor vectors (V) of a given new vector using the CAGRA search
    1-2. Count the number of detourable edges using the result of step 1 and the neighbor list of the input index. Then we prune (3*d/2) edges in the same way as the CAGRA graph optimization. Through this operation, we decide d/2 neighbors.
  2. Reverse edge addition
    2-1. Count the number of incoming edges for all nodes.
    2-2. Add d/2 reverse edges from the nodes added to the neighbor list in Step 1 by replacing a node with a new node. To prevent the connection to the replaced node from being lost, we add the node to the neighbor list of the new node. This allow us to make a detour connection. The replaced nodes are the largest number of incoming edge nodes in the 2/d nodes from the back of the neighbor list without duplication with the nodes already in the neighbor list.

Performance

In this experiment, we first split the dataset into two parts: the initial and the additional part. Then, we extend the CAGRA index built by the initial part to include the additional part.
search-eval

We can see a larger recall drop compared to the baseline by increasing the number of added vectors.
Therefore, rebuilding the CAGRA index is recommended when one wants to add a lot of vectors.

@enp1s0 enp1s0 requested review from a team as code owners May 24, 2024 08:50
@enp1s0 enp1s0 added feature request New feature or request non-breaking Introduces a non-breaking change labels May 24, 2024
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @enp1s0 for the PR! The implementation was already reviewed in rapidsai/raft#2157. The PR looks good to me.

@tfeher
Copy link
Contributor

tfeher commented May 27, 2024

Tests were failing on CI (CAGRA test hanging), restarting tests. According to @enp1s0 locally test were passing.

@tfeher
Copy link
Contributor

tfeher commented May 27, 2024

AnnCagraAddNodesTest hangs on V100, I could reproduce it locally.

@tfeher
Copy link
Contributor

tfeher commented May 27, 2024

Some raft -> cuvs namespace change were missing and the instantiated cagra::extend code was calling itself recursively due to a missing detail scope. The latter lead to the test hanging. These issues are fixed now.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a very welcome feature and it would be great to get this into 24.06. I haven't gone through the guts of the implementation yet but I wanted to leave some initial feedback about the public APIs to start.

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
cpp/src/neighbors/cagra.cuh Outdated Show resolved Hide resolved
@enp1s0
Copy link
Member Author

enp1s0 commented May 28, 2024

Thank you @cjnolet @tfeher for reviewing and fixing the bug. I fixed the docs.

@enp1s0 enp1s0 requested a review from a team as a code owner May 28, 2024 15:50
@enp1s0
Copy link
Member Author

enp1s0 commented May 30, 2024

@cjnolet I fixed the code. Can you check it again?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the recent updates @enp1s0 and please accept my sincere apology that it's taken some time for me to get back to this PR. This past release turned out to require a significant amount of attention since it's the first release with the Ann code officially migrated from RAFT.

I think these changes are shaping up but they aren't yet ready to be merged. It's very important that we focus on the user experience in the public APIs as a first and highest priority. That means making sure the docs are usable and easily searchable, and that the APIs are consistent and easy to use.

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
updated_dataset_view,
const cuvs::neighbors::cagra::index<float, uint32_t>& idx,
raft::host_matrix_view<uint32_t, std::int64_t> updated_graph_view,
const cagra::extend_params& params);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ordering is not consistent with the way we order arguments in public API functions. It should be:

  1. Resource handle
  2. Params structs
  3. Input args
  4. Output Args
  5. Any free param arguments

cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
cpp/include/cuvs/neighbors/cagra.hpp Outdated Show resolved Hide resolved
* @param[in] handle raft resources
* @param[in] input_updated_dataset_view updated dataset (initial + additional dataset)
* @param[in] index CAGRA index
* @param[out] updated_graph_view updated graph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems awkward to return to the user- what are we expecting them to do with this parameter? From the user's perspective, this should be getting updated inside the index itself, shouldn't it?

* @tparam IdxT type of the indices
*
* @param[in] handle raft resources
* @param[in] input_updated_dataset_view updated dataset (initial + additional dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really struggling with the need to expose this function through the public APIs, especially when the cagra index contains methods for the dataset to be attached or reattached after the fact.

End users generally interact with these index APIs as a black box- that is, they invoke cuVS APIs to build and populate an index object and then they invoke functions to act upon that index object without ever having to care what's actually being done in the index object.

I understand the want to provide helper functions for power users but this isn't the correct way to provide these types of functions. If the only reason for exposing this function is because we want to provide a way for the user to give the dataset with the additional vectors appended then I think we need to put some additional thought into the existing "update_dataset" methods on the index and add an option to the extend params that affords more control to the user.

The problem is that this function assumes the user knows or cares about a graph object, which is intentionally being hidden inside the index. If we remove the graph output, we are left with the extend() function and what I think should be an option for the user to specify whether they're attaching the whole detaset with the new vectors appended. Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that providing the APIs as a black box would be better for most users. And providing an option to specify whether the passed dataset is an entire one sounds good. An alternative way is to provide a function to expand the memory space (num vectors) for the dataset and graph in the CAGRA index. We allocate the host memory for the new dataset because we need to replace the old dataset with the new one on the device memory. To do this, we 1) copy the old dataset on the host memory, 2) deallocate the device memory, 3) allocate the new memory space on the device memory, and 4) copy the old dataset and additional dataset to the device memory (I noticed that we only need to allocate the host memory for the old dataset). It would be beneficial if there is a functionality to expand the memory space for new vectors because we don't need to reallocate the dataset memory space as above every time as long as the memory space is left.

@cjnolet
Copy link
Member

cjnolet commented Jun 5, 2024

I'm not opposed to the feature that add_graph_nodes is providing. We should find a way to enable it through extend(), so that the user has a unified experience. We do something similar with the IVF methods where we provide an option to over allocate the lists so adding new vectors can be done without re-allocating memory. We should be able to do something similar here.

@github-actions github-actions bot removed the ci label Jun 14, 2024
@enp1s0 enp1s0 changed the title CAGRA new vector addition [WIP] CAGRA new vector addition Jul 1, 2024
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your most recent changes look much better with some fairly minor things at this point. We need some more documentation around the why/when/how. I also think I'd prefer to keep the two additional optional args, rather than combine them into a new vocab item.

@@ -173,6 +174,43 @@ struct search_params : cuvs::neighbors::search_params {
/**
* @}
*/

/**
* @defgroup cagra_cpp_extend_params CAGRA index extend parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not completely clear from the documentation and code examples how and when the user might use this. Can you expand the descriptions below to explain why and when these additional arguments are needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the explanation of the parameter. Please let me know if the updated explanation is not sufficient.

* @param[in] params extend params
* @param[in] additional_dataset additional dataset on device memory
* @param[in,out] idx CAGRA index
* @param[in] new_memory_buffers (Optional) user-allocated memory buffers for the extended dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've gathered, these new memory buffers are outputs, right? It's unclear from the docs and examples alone how these are used and why a user might want to use this. To be completetely honest, I'd more prefer using 2 std::optional args instead of adding the additional struct to the API's vocabulary. I understand the consolidation but a user now has to look up what this new struct does instead of just being able to see the params documented above the functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the arguments of the extend function to pass the dataset and graph optional views.

@enp1s0
Copy link
Member Author

enp1s0 commented Jul 5, 2024

@cjnolet Thank you for reviewing the code. I have fixed it according to your comment. Can you check it again?

@enp1s0 enp1s0 changed the title [WIP] CAGRA new vector addition CAGRA new vector addition Jul 5, 2024
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recent changes look good to me: extend now allows fine control of dataset and graph placement if needed.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great, @enp1s0! So happy to have this feature.

@raydouglass
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@raydouglass raydouglass removed the request for review from a team July 9, 2024 19:26
@cjnolet
Copy link
Member

cjnolet commented Jul 9, 2024

/merge

@rapids-bot rapids-bot bot merged commit bf53940 into rapidsai:branch-24.08 Jul 9, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Development

Successfully merging this pull request may close these issues.

4 participants