[Java] Exception-safe RMM Allocations #1215

mythrocks · 2025-08-05T06:22:22Z

This commit introduces exception safety for RMM allocations.

Previously, device memory allocated through cuvsRmmAlloc() was freed manually using cuvsRmmFree(), in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory.

This commit extends the CloseableHandle class to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe.

This commit introduces exception safety for RMM allocations. Previously, device memory allocated through `cuvsRmmAlloc()` was freed manually using `cuvsRmmFree()`, in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory. This commit extends the `CloseableHandle` class to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe. Signed-off-by: MithunR <[email protected]>

copy-pr-bot · 2025-08-05T06:22:31Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraMultiThreadStabilityIT.java

chatman · 2025-08-05T06:28:24Z

+1, very important change.

ldematte

It's a good idea! It might interfere with work I'm doing on the device CuVSMatrix, but I'll take the duty of merging them.
Left a couple of comments.

ldematte · 2025-08-05T10:27:42Z

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java


-      return new IndexReference(datasetMemorySegmentP, datasetBytes, tensorDataArena, index);
+        closeableDataMemorySegmentP.release();
+        return new IndexReference(datasetMemorySegmentP, datasetBytes, tensorDataArena, index);


I see a problem here: we pass the dataset "pointer" to IndexReference to hold it and clean it when we are done with the index (see destroyIndex()), so this will lead to a double free.
Your change is good only if we are able to determine that we don't need the dataset device memory after we built the index; however, I was not able to determine if we need it or not; I think we might need it, as it might not be copied over again.

Ah, I see now why you need "release": it's a way to work around that.
I think it is better to avoid it, and simply avoid to use try with resources here.

I think I haven't conveyed the utility of CloseableRMMAllocation properly.

Simple case (CagraIndexImpl, etc.)

Consider the following simplified (if contrived) case, representative of how allocRMMSegment() is used in CagraIndexImpl, TieredIndexImpl, etc.:

// search() { var queriesDP = allocateRMMSegment(...); // ... cudaMemcpy(...); // Can throw. checkCuVSError( cuvsCagraSearch(...) ); // Can also throw. // ... // And finally. cuvsRmmFree( queriesDP ); }

There are several throwable points between the alloc() and the free(). If any of them fire, queriesDP is leaked in __device__ memory. This is the simple case that CloseableRMMAllocation addresses.

The case for .release() (BruteForceIndexImpl)

Similar example as above, except that the Index adopts the allocation, and holds it until destroyIndex() is called.

// build() { var datasetMemorySegmentP = allocateRMMSegment(...); // ... cudaMemcpy(...); // Can throw. checkCuVSError( cuvsBruteForceBuild(...) ); // Can also throw. // ... // And finally, commit. No free(). return new IndexReference( datasetMemorySegmentP, ... ); }

All the perils of the first example appear here as well; there are many throwable points between the allocation and the creation of the IndexReference. We need a way to clean up the memory allocation if there's any throw before the final commit (to IndexReference).
This is why we release() right before the return.

Not using throw-with-resources would mean that we're still open to __device__ memory leaks.

Ah, I see. Still, I feel like release() is not clear; a more explicit way would be to do an old, simple try/catch:

var closeableDataMemorySegment = allocateRMMSegment(cuvsResources, datasetBytes)); try { MemorySegment datasetMemorySegment = closeableDataMemorySegment.handle(); // use handle // here ownership is "transferred" return new IndexReference(closeableDataMemorySegment, ... ); } catch (Throwable t) { closeableDataMemorySegment.close(); throw; }

let me see if I can think of a different pattern.

I think what you are trying can be seen as similar to C++ unique_ptr; the difference is that we don't have move semantics in Java (as a simple way to transfer ownership).
An alternative could be to implement release() closer to what C++ has, e.g. returning the enclosed object:

try (var closeableDataMemorySegment = allocateRMMSegment(cuvsResources, datasetBytes)) { MemorySegment datasetMemorySegment = closeableDataMemorySegment.handle(); // use handle // here ownership is "transferred" more explicitly return new IndexReference( new CloseableRMMAllocation(datasetMemorySegment.release()), // release it, return the "raw pointer", pass it immediately to another `CloseableRMMAllocation` that will handle its lifetime ... ); }

While I think this models more clearly the "I am transferring ownership" idea, there is one drawback: what if IndexReference or CloseableRMMAllocation ctors throw? This is not a danger here, they are both no-throw ctors (just simple assignements), but still it's a little bit less robust.

I think I still like the explicit try/catch more.

I want to highlight this though:

If any of them fire, queriesDP is leaked in __device__ memory

Leaking device memory seems particularly bad; but how bad is it actually? Like, would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?

In any case, calling @ChrisHegarty in to see if we can protect further against this possibility (using cleaners maybe?)

Similar to C++ unique_ptr...

Exactly right. That's what this is modeled around. This will turn into an RAII wrapper after I have moved the allocation into the constructor.

That's also the pattern we use in https://github.com/rapidsai/cudf, when we transfer ownership of the underlying __device__ memory in a CUDF column.

~~My initial version also had the pointer returned from release.~~

Edit: Looks like it's not just my initial version; release() does currently return the old pointer when relinquishing the memory. I didn't use it in BruteForceIndexImpl because the original pointer was already at hand in the same scope. This will get tighter once the RAII change is made.

would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?

No, the leak does not persist beyond the lifetime of the process. That should be reclaimed after the process exits, yes.

But I would like not to make an assumption that the users of cuvs-java are short-running processes, and require them to be tolerant of leaking memory. Even if current users might be alright with a leak in exceptional events, a future user might be a long-running application.

I would like not to make an assumption that the users of cuvs-java are short-running processes

Definitely not, I was just trying to understand what's the worst case scenario here, and if we need something extra, beyond taking care of exceptions.

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/CloseableRMMAllocation.java

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Util.java

…e-rmm-allocation

copy-pr-bot · 2025-08-05T17:21:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mythrocks · 2025-08-05T21:26:17Z

Hmm... There seems to be a logic error in CagraIndexImpl::search(). I'm debugging it right now.

…e-rmm-allocation

mythrocks · 2025-08-05T23:32:27Z

/ok to test 4fd5f8b

…e-rmm-allocation

mythrocks · 2025-08-06T17:22:28Z

/ok to test 7314065

ldematte

Looks good, I left a couple of optional comments (for this PR or for a follow-up)

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

ldematte · 2025-08-07T08:39:09Z

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/TieredIndexImpl.java

-        long[] datasetShape = {rows, cols};
-        MemorySegment datasetTensor =
-            prepareTensor(localArena, datasetDP, datasetShape, kDLFloat(), 32, kDLCUDA(), 1);
+          MemorySegment index = localArena.allocate(cuvsTieredIndex_t);


Unrelated to this PR, but I'm conflicted about using cuvsTieredIndex_t, since we need a POINTER here (as we see below, when we are extracting a C_POINTER).
Right now, cuvsTieredIndex_t is a pointer, but if that changes, we would end up allocating e.g. a struct, and use it to store (and retrieve) a pointer...

Right now, cuvsTieredIndex_t is a pointer, but if that changes...

I think that could apply to any of the typedefs exposed in the C++ API. Changing those structures would be a non-trivial event. I would think @benfred and gang are likely to keep those stable. :]

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/TieredIndexImpl.java

…e-rmm-allocation

Signed-off-by: MithunR <[email protected]>

mythrocks · 2025-08-08T06:14:27Z

/ok to test f90efc0

mythrocks · 2025-08-08T06:16:00Z

@ldematte: I've addressed the remainder of your concerns. That copy-constructor suggestion helped the code DRY a bit more.

Does this look more agreeable?

ldematte

Was already good to me, but now it looks even better I think. Thanks for sticking with me and implementing my suggestions!

mythrocks · 2025-08-08T15:21:16Z

~~/merge~~

@cjnolet, any chance you might approve this change? @ldematte has approved it already.

cjnolet · 2025-08-08T16:37:02Z

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

+                    ? allocateRMMSegment(cuvsRes, prefilterBytes)
+                    : CloseableRMMAllocation.EMPTY) {
+
+          cudaMemcpy(queriesDP.handle(), floatsSeg, queriesBytes, INFER_DIRECTION);


Not something to address in this PR, but we should really be using cudaMemcpyAsync everywhere (and accepting a stream / cuvs resources. This is going to synchronize the whole device and thatll slow things down immensely. I suspect this could be why @chatman and team have seen that multi-threaded search reduces perf significantly.

cjnolet

The change intended in this PR LGTM. We should address the cudaMemcpy -> cudaMemCpyAsync ASAP, though, because it will affect concurrency / perf drastically. We reallys hould go through the whole C/Java API and repalce cudaMemcpy w/ cudaMemCpyAsync everywhere (and accept cuvsresources everywhere needed.

ldematte · 2025-08-08T16:54:14Z

The change intended in this PR LGTM. We should address the cudaMemcpy -> cudaMemCpyAsync ASAP, though, because it will affect concurrency / perf drastically. We reallys hould go through the whole C/Java API and repalce cudaMemcpy w/ cudaMemCpyAsync everywhere (and accept cuvsresources everywhere needed).

@cjnolet a couple of question about this to cement my understanding: cudaMemcpyAsync is "async" in 2 ways: it does not block the caller (the CPU thread calling it), and it can allow for overlapping copy/computation (if it uses a non-0 stream parameter).

If we want the overlapping copy/computation but we need to be sure the operation is finished before proceeding (e.g. copied out the data) with the CPU side of things, we need to pass a non-0 stream and then "block" the caller (CPU thread) with a cudaStreamSynchronize(stream), correct? Operations on the device will still overlap.
If we want overlapping copy/computation with cuvs operations we need to use a different stream, right? We cannot use cuvsGetStream()/raft::resources::get_cuda_stream() for the copy too, or it will use the same stream it uses for e.g. indexing, and we will be unable to overlap. So, different resources (or different stream, manually managed), correct?

ldematte · 2025-08-08T16:59:54Z

I still have to benchmark it, but this is where I am leaning with CuVSDeviceMatrix: create (and manage) a separate stream for copy operations via cudaMemcpyAsync. The additional reason is that we also won't need to keep a reference to resources, which might be problematic (scope, threading), but it is my understanding that having a different stream it's also for the best performance wise.

cjnolet · 2025-08-08T17:04:14Z

If we want the overlapping copy/computation but we need to be sure the operation is finished before proceeding (e.g. copied out the data) with the CPU side of things, we need to pass a non-0 stream and then "block" the caller (CPU thread) with a cudaStreamSynchronize(stream), correct? Operations on the device will still overlap.

Even worse- cudaMemcpy blocks the whole device! So imagine having multiple threads, each performing search() with different streams, they won't be able to overlap because cudaMemCpy will block the entire device each time it's called until the memory has been fully copied. This is why we NEVER use cudaMemCpy and we always use cudaMemCpyAsync.

If we want overlapping copy/computation with cuvs operations we need to use a different stream, right? We cannot use cuvsGetStream()/raft::resources::get_cuda_stream() for the copy too, or it will use the same stream it uses for e.g. indexing, and we will be unable to overlap. So, different resources (or different stream, manually managed), correct?

A stream is necesary to queue up the memory copy, but this is still also very important because the goal here is to queue up work (if even on the same stream) and not block the client so that the GPU can work off the queue asynchronously. Of course, if you want to overlap multiple memory copies and kernel launches, the cuvsresources can also be configured by the user to provide multiple "worker streams" which can be used to perform these copies asynchronously and independently.

In order to take advanage of any of this, we need to use cudaMemCpyAsync- it's not a question of whether we should use it. We need to use it everywhere.

cjnolet · 2025-08-08T17:10:28Z

If we want overlapping copy/computation with cuvs operations we need to use a different stream, right?

I mentioned this above, but want to expand it just to make it fully clear- asynchronous computation even with the same stream is just as important to queue up and synchronize as little as possible. Queueing up work on a stream (such as a memory copy, memory allocation/free, or launching a kernel) has very little overhead, if any. But if we have to synchronize the cpu thread every time we queue up one of these operations, we're going to insert blocking gaps in the pipeline and these can be very costly in latency sensitive workloads (You can use nsight-systems application to profile this, you'll literally see the gaps in the pipeline). Insteas, we utilize the asynchronous nonblocking behavior as much as possible so the CPU can continue queuing up work for the gpu independent of the thread that's making the calls to the runtime API. THis is also why we only explicitly synchronize the stream in the cuVS C++/C layer when it's absolutely necessary (which in general is either because we just did a device to host copy and need to immediately read the memory on host, or if we are about to use multiple streams to overlap some operations, we synchronize the main stream before splitting off into a series of "worker streams" for things like overlapping computation / memory copies as you mentioned).

cjnolet · 2025-08-08T17:13:23Z

The additional reason is that we also won't need to keep a reference to resources, which might be problematic (scope, threading), but it is my understanding it's also for the best performance wise.

resources are expensive to create but the individual resources that get stored on the cuvsresources instance get created lazily as algorithms need them to try and amortize the cost there. I hope we're not saying we are recreating these resource objects often. Ideally we should create them once at the beginning and reuse them in subsequent calls. They don't have a high memory cost, but things like cublas handles can be very costly to create and that's used very often.

mythrocks · 2025-08-08T17:28:36Z

/merge

mythrocks · 2025-08-08T17:30:41Z

I'll address the cudaMemcpyAsync change in a follow-on, shortly.

Thank you for the reviews, @ldematte, @cjnolet.

ldematte · 2025-08-11T06:42:53Z

In order to take advantage of any of this, we need to use cudaMemCpyAsync- it's not a question of whether we should use it. We need to use it everywhere.

Absolutely; like you mentioned, even with the same stream + synchronize on the CPU thread this is still better (not blocking the device -- other threads can still make progress)

This commit introduces exception safety for RMM allocations. Previously, device memory allocated through `cuvsRmmAlloc()` was freed manually using `cuvsRmmFree()`, in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory. This commit extends the `CloseableHandle` class to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Lorenzo Dematté (https://github.com/ldematte) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1215

mythrocks self-assigned this Aug 5, 2025

mythrocks added the improvement Improves an existing functionality label Aug 5, 2025

mythrocks requested a review from a team as a code owner August 5, 2025 06:22

mythrocks added non-breaking Introduces a non-breaking change Java labels Aug 5, 2025

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Aug 5, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Aug 5, 2025

mythrocks marked this pull request as draft August 5, 2025 06:22

mythrocks commented Aug 5, 2025

View reviewed changes

java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraMultiThreadStabilityIT.java Show resolved Hide resolved

ldematte reviewed Aug 5, 2025

View reviewed changes

Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…

16c2c6c

…e-rmm-allocation

mythrocks added 2 commits August 5, 2025 16:25

Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…

6b8ce7b

…e-rmm-allocation

Fixed logic error in CagraIndexImpl refactor.

4fd5f8b

mythrocks marked this pull request as ready for review August 5, 2025 23:32

mythrocks changed the title ~~[WIP] [Java] Exception-safe RMM Allocations~~ [REVIEW] [Java] Exception-safe RMM Allocations Aug 5, 2025

mythrocks added 2 commits August 6, 2025 10:21

Moved RMM Allocations to CloseableRMMAllocation::ctor.

aa3518a

Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…

7314065

…e-rmm-allocation

mythrocks mentioned this pull request Aug 6, 2025

[Java]Binary and scalar quantization #1104

Merged

ldematte approved these changes Aug 7, 2025

View reviewed changes

mythrocks added 4 commits August 7, 2025 21:30

Merge remote-tracking branch 'origin/branch-25.10' into exception-saf…

ad657bb

…e-rmm-allocation

Review: Use copy-ctor instead of release().

f0a6566

Signed-off-by: MithunR <[email protected]>

Review: Static allocation for empty bitset vector.

40c56ea

Signed-off-by: MithunR <[email protected]>

Review: Precalculate prefilter constants for TieredIndexImpl.

f90efc0

Signed-off-by: MithunR <[email protected]>

ldematte approved these changes Aug 8, 2025

View reviewed changes

mythrocks changed the title ~~[REVIEW] [Java] Exception-safe RMM Allocations~~ [Java] Exception-safe RMM Allocations Aug 8, 2025

cjnolet reviewed Aug 8, 2025

View reviewed changes

cjnolet approved these changes Aug 8, 2025

View reviewed changes

rapids-bot bot merged commit 5e3a5a6 into rapidsai:branch-25.10 Aug 8, 2025
100 of 102 checks passed

github-project-automation bot moved this from Todo to Done in Vector Search, ML, & Data Mining Release Board Aug 8, 2025

mythrocks mentioned this pull request Aug 8, 2025

[Java] Add CAGRA index graph accessor/build from graph (host memory) #1216

Merged

mythrocks deleted the exception-safe-rmm-allocation branch August 8, 2025 23:55

[Java] Exception-safe RMM Allocations #1215

[Java] Exception-safe RMM Allocations #1215

Uh oh!

Conversation

mythrocks commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Aug 5, 2025

Uh oh!

Uh oh!

chatman commented Aug 5, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

ldematte Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

ldematte Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

mythrocks Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Simple case (CagraIndexImpl, etc.)

The case for .release() (BruteForceIndexImpl)

Uh oh!

ldematte Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldematte Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldematte Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

mythrocks Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mythrocks Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

ldematte Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot bot commented Aug 5, 2025

Uh oh!

mythrocks commented Aug 5, 2025

Uh oh!

mythrocks commented Aug 5, 2025

Uh oh!

mythrocks commented Aug 6, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldematte Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mythrocks Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythrocks commented Aug 8, 2025

Uh oh!

mythrocks commented Aug 8, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

mythrocks commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mythrocks commented Aug 5, 2025 •

edited

Loading

mythrocks Aug 5, 2025 •

edited

Loading

Simple case (`CagraIndexImpl`, etc.)

The case for `.release()` (`BruteForceIndexImpl`)

ldematte Aug 6, 2025 •

edited

Loading

ldematte Aug 6, 2025 •

edited

Loading

mythrocks Aug 6, 2025 •

edited

Loading

mythrocks commented Aug 8, 2025 •

edited

Loading

cjnolet Aug 8, 2025 •

edited

Loading

ldematte commented Aug 8, 2025 •

edited

Loading

ldematte commented Aug 8, 2025 •

edited

Loading

cjnolet commented Aug 8, 2025 •

edited

Loading

cjnolet commented Aug 8, 2025 •

edited

Loading

mythrocks commented Aug 8, 2025 •

edited

Loading