-
Notifications
You must be signed in to change notification settings - Fork 143
Expose graph and dataset accessors for CAGRA to C/Python #1086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose graph and dataset accessors for CAGRA to C/Python #1086
Conversation
Add the ability to get the graph and dataset from a CAGRA index to the c-api and python apis, as well as being able to reconstruct the cagra index from a graph and dataset. The eventual goal here is to be more flexible in terms of allowing other serialization formats. Rather than supporting every format inside of cuvs, by exposing the raw data needed to recreate a cagra index - we can let consumers of cuvs decide how they want to serialize an index.
| files: | | ||
| (?x) | ||
| [.](cmake|cpp|cu|cuh|h|hpp|sh|pxd|py|pyx|rs)$| | ||
| [.](cmake|cpp|cu|cuh|h|hpp|sh|pxd|py|pyx|rs|java)$| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| */ | ||
|
|
||
| enum cuvsKMeansInitMethod { | ||
| typedef enum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
@benfred I tried to check this out and build locally, Scratch that, I needed to update my conda environment as the raft include files changed. |
ldematte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments while I was writing a Java wrapper for this
|
TL;DR: I liked the API as it was in 2f22d53 ( If anything, in a follow-up PR, it would be great to have partial copy (e.g. |
How about having something like a // get the cagra graph
DLManagedTensor graph;
cuvsCagraIndexGetGraph(index, &graph);
// get the first 1K rows from the graph
DLManagedTensor subgraph;
cuvsSliceMatrix(&graph, 0, 1024, &subgraph);
// copy the subgraph to host memory
cuvsCopyMatrix(&subgraph, &subgraphHost);(this actually makes me think that I should rename cuvsCopyMatrix to cuvsMatrixCopy , so that we can have cuvsMatrixSlice etc - and to be consistent with other cuvs API's) |
@benfred / @cjnolet will keep me honest here. I think the C-lang user should not use Using a |
add ability to page through returned rows via a new 'cuvsMatrixSliceRows' function
That's even better!
++ |
mythrocks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nitpicks, but this looks good to go.
Co-authored-by: MithunR <[email protected]>
|
/merge |
| from cuvs.common.resources import auto_sync_resources | ||
|
|
||
|
|
||
| cdef class DeviceTensorView: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we return device_matrix_view without intermediate copies? I am not 100% sure I see the benefits of adding this class compared to device views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python API is using the c-api (instead of the c++ api) - meaning we can't use device_matrix_view directly inside python. Instead we're using dlpack.DLManagedTensor inside our C-API.
Previous to this PR, our c-api only accepted dlpack arrays as inputs (and would either use the contents in functions like cagra::build - or fill pre-allocated arrays with the outputs in functions like cagra::search). We hadn't yet exposed memory that was allocated in our C++ codebase to python.
This PR changes that - and we are now returning dlpack DLManagedTensor objects that return memory that is owned by and allocated inside our C++ codebase. This code does that without copying the data , with the flow going : device_matrix_view (c++) -> DLManagedTensor (C) -> DeviceTensorView (python) . At each step we aren't copying the data, there isn't an intermediate copy - so much as intermediate objects that have a pointer to the original data, and also extra information such as the shape/dtype/strides etc.
This DeviceTensorView code is necessary because we didn't have anything that would take a DLManagedTensor and return something that could be easily consumed in python. the closest object we had was the pylibraft.device_ndarray object - but that object wouldn't have worked here.
cpp/include/cuvs/neighbors/cagra.h
Outdated
| * | ||
| * @endcode | ||
| */ | ||
| cuvsError_t cuvsCagraIndexFromGraph(cuvsResources_t res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick- can we please rename this to cuvsCagraIndexFromParams or cuvsCagraIndexFromArgs? I'd like to keep tthe API design consistent and having to specify specific args in the name will get unwieldy quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to cuvsCagraIndexFromArgs in the last commit - (went with FromArgs instead of FromParams - since I think the FromParams could be confused with the Index Params we use to build the index)
bdice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving packaging changes.
…1216) This PR leverages the functions introduced by #1086 and the data structures introduced by #1111 to access, copy, and re-create an index to/from a CAGRA graph. Supersedes #1105 Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: #1216
…apidsai#1216) This PR leverages the functions introduced by rapidsai#1086 and the data structures introduced by rapidsai#1111 to access, copy, and re-create an index to/from a CAGRA graph. Supersedes rapidsai#1105 Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1216
Add the ability to get the graph and dataset from a CAGRA index to the c-api and python apis, as well as being able to reconstruct the cagra index from a graph and dataset.
The eventual goal here is to be more flexible in terms of allowing other serialization formats. Rather than supporting every format inside of cuvs, by exposing the raw data needed to recreate a cagra index - we can let consumers of cuvs decide how they want to serialize an index.