Skip to content

Conversation

@ldematte
Copy link
Contributor

This PR is a follow-up from #902.
Still WIP (see self-comments on the changes) but I'd like some early feedback.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ldematte
Copy link
Contributor Author

@chatman @mythrocks wdyt?

@ldematte ldematte changed the title [WIP][Java] Encapsulate on-heap float arrays into Dataset [Java] Encapsulate on-heap float arrays into Dataset Jun 19, 2025
@ldematte
Copy link
Contributor Author

I added another commit 3f348c6 in which I reduce the scope for the Dataset/MemorySegment: I think there is no need to keep it around once the index is built, it is not referenced again (but keep me honest here and point out if I'm wrong!)

@mythrocks
Copy link
Contributor

@chatman @mythrocks wdyt?

Looking forward to reading/reviewing this next week.

@chatman
Copy link
Contributor

chatman commented Jun 20, 2025 via email

@ldematte
Copy link
Contributor Author

I'll take a closer look at the scope of the
memory segment. I remember extending to scope of the memory segment to
closing it at index close due to some issues otherwise. I'll take another
look today to make sure that's not happening.

Yes please; it might be that the C++ side of things holds a reference to the input memory (wrapped and managed by the MemorySegment), and therefore that that memory needs to be available for the lifetime of the Index.
If we even suspect this is the case, I'll revert that change - better hold onto some memory a bit longer than shorter than we need!

@ldematte
Copy link
Contributor Author

BTW, no hurry -- I wasn't aware you guys were off this week. This can wait till next week, no prob.

@ldematte
Copy link
Contributor Author

I'll take a closer look at the scope of the
memory segment. I remember extending to scope of the memory segment to
closing it at index close due to some issues otherwise.

Hey @chatman, I dug into the C and C++ implementations of cuvsCagraBuild/cuvs::neighbors::cagra::build and a pointer to the original dataset seems to be passed down and retained/referenced by Index. Seems because the DLTensor gets transformed to a mdspan, but I think it's just wrapped and no copy happens there, so we need to retain the dataset memorysegment "alive" for the lifetime of the index. I'll change this PR to do that.

@chatman
Copy link
Contributor

chatman commented Jun 26, 2025 via email

@ldematte ldematte changed the title [Java] Encapsulate on-heap float arrays into Dataset [Review][Java] Encapsulate on-heap float arrays into Dataset Jun 26, 2025
@ldematte
Copy link
Contributor Author

ldematte commented Jun 30, 2025

but it would be pretty safe to keep the scope tied up with the lifecycle of the index.

👍
I've done exactly that in the last commit

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a nice simplification.

Comment on lines +31 to +35
mvn clean integration-test -Dit.test=com.nvidia.cuvs.CagraBuildAndSearchIT
```
or, for a single test:
```sh
mvn clean integration-test -Dit.test=com.nvidia.cuvs.CagraBuildAndSearchIT#testMergeStrategies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will schedule to merge this once #1045 is merged.

ldematte added 2 commits July 4, 2025 09:31
…array-dataset

# Conflicts:
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
@ldematte
Copy link
Contributor Author

ldematte commented Jul 4, 2025

@mythrocks rebased on top of #1045 and ready to go! 👍

@mythrocks mythrocks added improvement Improves an existing functionality non-breaking Introduces a non-breaking change Java labels Jul 4, 2025
@mythrocks mythrocks changed the title [Review][Java] Encapsulate on-heap float arrays into Dataset [Java] Encapsulate on-heap float arrays into Dataset Jul 4, 2025
@mythrocks
Copy link
Contributor

/ok to test 3b36197

@mythrocks
Copy link
Contributor

/ok to test b0b3552

@mythrocks
Copy link
Contributor

I just resolved the conflicts from #1081. This should go in after CI passes.

@mythrocks
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit f02ee48 into rapidsai:branch-25.08 Jul 6, 2025
100 of 102 checks passed
@mythrocks
Copy link
Contributor

Thank you for this change, @ldematte. This has been merged.

rapids-bot bot pushed a commit that referenced this pull request Jul 15, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on #1033 and #1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes #698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: #1034
punAhuja added a commit to SearchScale/cuvs that referenced this pull request Jul 15, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on rapidsai#1033 and rapidsai#1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes rapidsai#698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1034
punAhuja pushed a commit to SearchScale/cuvs that referenced this pull request Jul 15, 2025
This PR is a follow-up from rapidsai#902.
Still WIP (see self-comments on the changes) but I'd like some early feedback.

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1024
punAhuja added a commit to SearchScale/cuvs that referenced this pull request Jul 15, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on rapidsai#1033 and rapidsai#1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes rapidsai#698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1034
rapids-bot bot pushed a commit that referenced this pull request Jul 15, 2025
This PR tidies up the lifecycle of `MemorySegment`s by using specific confined `Arena`s where possible, and specific index-bound `Arena`s where the lifetime needs to be bound to the lifetime of an index.

It addresses all remaining usages of `CuVSResources#arena` after #1024 and #1045; as such, we can remove `CuVSResources#arena` completely, and force native memory usages to be tracked and dealt with specifically.

This would towards the goal of #1037

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: #1069
punAhuja pushed a commit to SearchScale/cuvs that referenced this pull request Jul 16, 2025
This PR tidies up the lifecycle of `MemorySegment`s by using specific confined `Arena`s where possible, and specific index-bound `Arena`s where the lifetime needs to be bound to the lifetime of an index.

It addresses all remaining usages of `CuVSResources#arena` after rapidsai#1024 and rapidsai#1045; as such, we can remove `CuVSResources#arena` completely, and force native memory usages to be tracked and dealt with specifically.

This would towards the goal of rapidsai#1037

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1069
punAhuja pushed a commit to SearchScale/cuvs that referenced this pull request Jul 16, 2025
This PR is a follow-up from rapidsai#902.
Still WIP (see self-comments on the changes) but I'd like some early feedback.

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1024
punAhuja added a commit to SearchScale/cuvs that referenced this pull request Jul 16, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on rapidsai#1033 and rapidsai#1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes rapidsai#698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality Java non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants