-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAGRA new vector addition #151
CAGRA new vector addition #151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @enp1s0 for the PR! The implementation was already reviewed in rapidsai/raft#2157. The PR looks good to me.
Tests were failing on CI (CAGRA test hanging), restarting tests. According to @enp1s0 locally test were passing. |
|
Some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a very welcome feature and it would be great to get this into 24.06. I haven't gone through the guts of the implementation yet but I wanted to leave some initial feedback about the public APIs to start.
@cjnolet I fixed the code. Can you check it again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the recent updates @enp1s0 and please accept my sincere apology that it's taken some time for me to get back to this PR. This past release turned out to require a significant amount of attention since it's the first release with the Ann code officially migrated from RAFT.
I think these changes are shaping up but they aren't yet ready to be merged. It's very important that we focus on the user experience in the public APIs as a first and highest priority. That means making sure the docs are usable and easily searchable, and that the APIs are consistent and easy to use.
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
updated_dataset_view, | ||
const cuvs::neighbors::cagra::index<float, uint32_t>& idx, | ||
raft::host_matrix_view<uint32_t, std::int64_t> updated_graph_view, | ||
const cagra::extend_params& params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ordering is not consistent with the way we order arguments in public API functions. It should be:
- Resource handle
- Params structs
- Input args
- Output Args
- Any free param arguments
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* @param[in] handle raft resources | ||
* @param[in] input_updated_dataset_view updated dataset (initial + additional dataset) | ||
* @param[in] index CAGRA index | ||
* @param[out] updated_graph_view updated graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems awkward to return to the user- what are we expecting them to do with this parameter? From the user's perspective, this should be getting updated inside the index itself, shouldn't it?
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* @tparam IdxT type of the indices | ||
* | ||
* @param[in] handle raft resources | ||
* @param[in] input_updated_dataset_view updated dataset (initial + additional dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really struggling with the need to expose this function through the public APIs, especially when the cagra index contains methods for the dataset to be attached or reattached after the fact.
End users generally interact with these index APIs as a black box- that is, they invoke cuVS APIs to build and populate an index object and then they invoke functions to act upon that index object without ever having to care what's actually being done in the index object.
I understand the want to provide helper functions for power users but this isn't the correct way to provide these types of functions. If the only reason for exposing this function is because we want to provide a way for the user to give the dataset with the additional vectors appended then I think we need to put some additional thought into the existing "update_dataset" methods on the index and add an option to the extend params that affords more control to the user.
The problem is that this function assumes the user knows or cares about a graph object, which is intentionally being hidden inside the index. If we remove the graph output, we are left with the extend() function and what I think should be an option for the user to specify whether they're attaching the whole detaset with the new vectors appended. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that providing the APIs as a black box would be better for most users. And providing an option to specify whether the passed dataset is an entire one sounds good. An alternative way is to provide a function to expand the memory space (num vectors) for the dataset and graph in the CAGRA index. We allocate the host memory for the new dataset because we need to replace the old dataset with the new one on the device memory. To do this, we 1) copy the old dataset on the host memory, 2) deallocate the device memory, 3) allocate the new memory space on the device memory, and 4) copy the old dataset and additional dataset to the device memory (I noticed that we only need to allocate the host memory for the old dataset). It would be beneficial if there is a functionality to expand the memory space for new vectors because we don't need to reallocate the dataset memory space as above every time as long as the memory space is left.
I'm not opposed to the feature that add_graph_nodes is providing. We should find a way to enable it through extend(), so that the user has a unified experience. We do something similar with the IVF methods where we provide an option to over allocate the lists so adding new vectors can be done without re-allocating memory. We should be able to do something similar here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your most recent changes look much better with some fairly minor things at this point. We need some more documentation around the why/when/how. I also think I'd prefer to keep the two additional optional args, rather than combine them into a new vocab item.
@@ -173,6 +174,43 @@ struct search_params : cuvs::neighbors::search_params { | |||
/** | |||
* @} | |||
*/ | |||
|
|||
/** | |||
* @defgroup cagra_cpp_extend_params CAGRA index extend parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not completely clear from the documentation and code examples how and when the user might use this. Can you expand the descriptions below to explain why and when these additional arguments are needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the explanation of the parameter. Please let me know if the updated explanation is not sufficient.
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* @param[in] params extend params | ||
* @param[in] additional_dataset additional dataset on device memory | ||
* @param[in,out] idx CAGRA index | ||
* @param[in] new_memory_buffers (Optional) user-allocated memory buffers for the extended dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've gathered, these new memory buffers are outputs, right? It's unclear from the docs and examples alone how these are used and why a user might want to use this. To be completetely honest, I'd more prefer using 2 std::optional
args instead of adding the additional struct to the API's vocabulary. I understand the consolidation but a user now has to look up what this new struct does instead of just being able to see the params documented above the functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the arguments of the extend function to pass the dataset and graph optional views.
@cjnolet Thank you for reviewing the code. I have fixed it according to your comment. Can you check it again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recent changes look good to me: extend
now allows fine control of dataset and graph placement if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great, @enp1s0! So happy to have this feature.
Removing |
/merge |
This PR introduces the new vector addition feature to CAGRA.
Rel: rapidsai/raft#1775
Original PR: rapidsai/raft#2157
CAGRA-Q is not supported
Usage
Algorithm
Graph degree: d
The algorithm consists of two stages: rank-based reordering and reverse edge addition.
1-1. Obtain d' (=2d) nearest neighbor vectors (V) of a given new vector using the CAGRA search
1-2. Count the number of detourable edges using the result of step 1 and the neighbor list of the input index. Then we prune (3*d/2) edges in the same way as the CAGRA graph optimization. Through this operation, we decide d/2 neighbors.
2-1. Count the number of incoming edges for all nodes.
2-2. Add d/2 reverse edges from the nodes added to the neighbor list in Step 1 by replacing a node with a new node. To prevent the connection to the replaced node from being lost, we add the node to the neighbor list of the new node. This allow us to make a detour connection. The replaced nodes are the largest number of incoming edge nodes in the 2/d nodes from the back of the neighbor list without duplication with the nodes already in the neighbor list.
Performance
In this experiment, we first split the dataset into two parts: the initial and the additional part. Then, we extend the CAGRA index built by the initial part to include the additional part.
We can see a larger recall drop compared to the baseline by increasing the number of added vectors.
Therefore, rebuilding the CAGRA index is recommended when one wants to add a lot of vectors.