Skip to content

Conversation

@mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Aug 5, 2025

This commit introduces exception safety for RMM allocations.

Previously, device memory allocated through cuvsRmmAlloc() was freed manually using cuvsRmmFree(), in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory.

This commit extends the CloseableHandle class to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe.

This commit introduces exception safety for RMM allocations.

Previously, device memory allocated through `cuvsRmmAlloc()` was freed
manually using `cuvsRmmFree()`, in all the index impl classes.  The
problem there is that if an exception is thrown in the intervening time
between alloc and free, it would lead to a leak of device memory.

This commit extends the `CloseableHandle` class to encapsulate the
allocation of device memory.  This new class is used in
try-with-resources blocks, to make device memory allocations
exception-safe.

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks self-assigned this Aug 5, 2025
@mythrocks mythrocks added the improvement Improves an existing functionality label Aug 5, 2025
@mythrocks mythrocks requested a review from a team as a code owner August 5, 2025 06:22
@mythrocks mythrocks added non-breaking Introduces a non-breaking change Java labels Aug 5, 2025
@mythrocks mythrocks marked this pull request as draft August 5, 2025 06:22
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 5, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@chatman
Copy link
Contributor

chatman commented Aug 5, 2025

+1, very important change.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea! It might interfere with work I'm doing on the device CuVSMatrix, but I'll take the duty of merging them.
Left a couple of comments.


return new IndexReference(datasetMemorySegmentP, datasetBytes, tensorDataArena, index);
closeableDataMemorySegmentP.release();
return new IndexReference(datasetMemorySegmentP, datasetBytes, tensorDataArena, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem here: we pass the dataset "pointer" to IndexReference to hold it and clean it when we are done with the index (see destroyIndex()), so this will lead to a double free.
Your change is good only if we are able to determine that we don't need the dataset device memory after we built the index; however, I was not able to determine if we need it or not; I think we might need it, as it might not be copied over again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now why you need "release": it's a way to work around that.
I think it is better to avoid it, and simply avoid to use try with resources here.

Copy link
Contributor Author

@mythrocks mythrocks Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I haven't conveyed the utility of CloseableRMMAllocation properly.

Simple case (CagraIndexImpl, etc.)

Consider the following simplified (if contrived) case, representative of how allocRMMSegment() is used in CagraIndexImpl, TieredIndexImpl, etc.:

// search()
{
  var queriesDP = allocateRMMSegment(...);
  // ...
  cudaMemcpy(...); // Can throw.
  checkCuVSError( cuvsCagraSearch(...) ); // Can also throw.
  // ...
  // And finally.
  cuvsRmmFree( queriesDP );
}

There are several throwable points between the alloc() and the free(). If any of them fire, queriesDP is leaked in __device__ memory. This is the simple case that CloseableRMMAllocation addresses.

The case for .release() (BruteForceIndexImpl)

Similar example as above, except that the Index adopts the allocation, and holds it until destroyIndex() is called.

// build()
{
  var datasetMemorySegmentP = allocateRMMSegment(...);
  // ...
  cudaMemcpy(...); // Can throw.
  checkCuVSError( cuvsBruteForceBuild(...) ); // Can also throw.
  // ...
  // And finally, commit.  No free().
  return new IndexReference( datasetMemorySegmentP, ... );
}

All the perils of the first example appear here as well; there are many throwable points between the allocation and the creation of the IndexReference. We need a way to clean up the memory allocation if there's any throw before the final commit (to IndexReference).
This is why we release() right before the return.

Not using throw-with-resources would mean that we're still open to __device__ memory leaks.

Copy link
Contributor

@ldematte ldematte Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Still, I feel like release() is not clear; a more explicit way would be to do an old, simple try/catch:

var closeableDataMemorySegment = allocateRMMSegment(cuvsResources, datasetBytes));
try {
   MemorySegment datasetMemorySegment = closeableDataMemorySegment.handle();
   // use handle
   // here ownership is "transferred"
   return new IndexReference(closeableDataMemorySegment, ... );
} catch (Throwable t) {
   closeableDataMemorySegment.close();
   throw;
}

let me see if I can think of a different pattern.

Copy link
Contributor

@ldematte ldematte Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you are trying can be seen as similar to C++ unique_ptr; the difference is that we don't have move semantics in Java (as a simple way to transfer ownership).
An alternative could be to implement release() closer to what C++ has, e.g. returning the enclosed object:

try (var closeableDataMemorySegment = allocateRMMSegment(cuvsResources, datasetBytes)) {
   MemorySegment datasetMemorySegment = closeableDataMemorySegment.handle();
   // use handle
   // here ownership is "transferred" more explicitly
   return new IndexReference(
         new CloseableRMMAllocation(datasetMemorySegment.release()), // release it, return the "raw pointer", pass it immediately to another `CloseableRMMAllocation` that will handle its lifetime
         ... );
}

While I think this models more clearly the "I am transferring ownership" idea, there is one drawback: what if IndexReference or CloseableRMMAllocation ctors throw? This is not a danger here, they are both no-throw ctors (just simple assignements), but still it's a little bit less robust.

I think I still like the explicit try/catch more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to highlight this though:

If any of them fire, queriesDP is leaked in __device__ memory

Leaking device memory seems particularly bad; but how bad is it actually? Like, would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?

In any case, calling @ChrisHegarty in to see if we can protect further against this possibility (using cleaners maybe?)

Copy link
Contributor Author

@mythrocks mythrocks Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to C++ unique_ptr...

Exactly right. That's what this is modeled around. This will turn into an RAII wrapper after I have moved the allocation into the constructor.

That's also the pattern we use in https://github.com/rapidsai/cudf, when we transfer ownership of the underlying __device__ memory in a CUDF column.

My initial version also had the pointer returned from release.

Edit: Looks like it's not just my initial version; release() does currently return the old pointer when relinquishing the memory. I didn't use it in BruteForceIndexImpl because the original pointer was already at hand in the same scope. This will get tighter once the RAII change is made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?

No, the leak does not persist beyond the lifetime of the process. That should be reclaimed after the process exits, yes.

But I would like not to make an assumption that the users of cuvs-java are short-running processes, and require them to be tolerant of leaking memory. Even if current users might be alright with a leak in exceptional events, a future user might be a long-running application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like not to make an assumption that the users of cuvs-java are short-running processes

Definitely not, I was just trying to understand what's the worst case scenario here, and if we need something extra, beyond taking care of exceptions.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mythrocks
Copy link
Contributor Author

Hmm... There seems to be a logic error in CagraIndexImpl::search(). I'm debugging it right now.

@mythrocks mythrocks marked this pull request as ready for review August 5, 2025 23:32
@mythrocks
Copy link
Contributor Author

/ok to test 4fd5f8b

@mythrocks mythrocks changed the title [WIP] [Java] Exception-safe RMM Allocations [REVIEW] [Java] Exception-safe RMM Allocations Aug 5, 2025
@mythrocks
Copy link
Contributor Author

/ok to test 7314065

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a couple of optional comments (for this PR or for a follow-up)

long[] datasetShape = {rows, cols};
MemorySegment datasetTensor =
prepareTensor(localArena, datasetDP, datasetShape, kDLFloat(), 32, kDLCUDA(), 1);
MemorySegment index = localArena.allocate(cuvsTieredIndex_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I'm conflicted about using cuvsTieredIndex_t, since we need a POINTER here (as we see below, when we are extracting a C_POINTER).
Right now, cuvsTieredIndex_t is a pointer, but if that changes, we would end up allocating e.g. a struct, and use it to store (and retrieve) a pointer...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, cuvsTieredIndex_t is a pointer, but if that changes...

I think that could apply to any of the typedefs exposed in the C++ API. Changing those structures would be a non-trivial event. I would think @benfred and gang are likely to keep those stable. :]

@mythrocks
Copy link
Contributor Author

/ok to test f90efc0

@mythrocks
Copy link
Contributor Author

@ldematte: I've addressed the remainder of your concerns. That copy-constructor suggestion helped the code DRY a bit more.

Does this look more agreeable?

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was already good to me, but now it looks even better I think. Thanks for sticking with me and implementing my suggestions!

@mythrocks mythrocks changed the title [REVIEW] [Java] Exception-safe RMM Allocations [Java] Exception-safe RMM Allocations Aug 8, 2025
@mythrocks
Copy link
Contributor Author

mythrocks commented Aug 8, 2025

/merge

@cjnolet, any chance you might approve this change? @ldematte has approved it already.

? allocateRMMSegment(cuvsRes, prefilterBytes)
: CloseableRMMAllocation.EMPTY) {

cudaMemcpy(queriesDP.handle(), floatsSeg, queriesBytes, INFER_DIRECTION);
Copy link
Member

@cjnolet cjnolet Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to address in this PR, but we should really be using cudaMemcpyAsync everywhere (and accepting a stream / cuvs resources. This is going to synchronize the whole device and thatll slow things down immensely. I suspect this could be why @chatman and team have seen that multi-threaded search reduces perf significantly.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change intended in this PR LGTM. We should address the cudaMemcpy -> cudaMemCpyAsync ASAP, though, because it will affect concurrency / perf drastically. We reallys hould go through the whole C/Java API and repalce cudaMemcpy w/ cudaMemCpyAsync everywhere (and accept cuvsresources everywhere needed.

@ldematte
Copy link
Contributor

ldematte commented Aug 8, 2025

The change intended in this PR LGTM. We should address the cudaMemcpy -> cudaMemCpyAsync ASAP, though, because it will affect concurrency / perf drastically. We reallys hould go through the whole C/Java API and repalce cudaMemcpy w/ cudaMemCpyAsync everywhere (and accept cuvsresources everywhere needed).

@cjnolet a couple of question about this to cement my understanding: cudaMemcpyAsync is "async" in 2 ways: it does not block the caller (the CPU thread calling it), and it can allow for overlapping copy/computation (if it uses a non-0 stream parameter).

  • If we want the overlapping copy/computation but we need to be sure the operation is finished before proceeding (e.g. copied out the data) with the CPU side of things, we need to pass a non-0 stream and then "block" the caller (CPU thread) with a cudaStreamSynchronize(stream), correct? Operations on the device will still overlap.
  • If we want overlapping copy/computation with cuvs operations we need to use a different stream, right? We cannot use cuvsGetStream()/raft::resources::get_cuda_stream() for the copy too, or it will use the same stream it uses for e.g. indexing, and we will be unable to overlap. So, different resources (or different stream, manually managed), correct?

@ldematte
Copy link
Contributor

ldematte commented Aug 8, 2025

I still have to benchmark it, but this is where I am leaning with CuVSDeviceMatrix: create (and manage) a separate stream for copy operations via cudaMemcpyAsync. The additional reason is that we also won't need to keep a reference to resources, which might be problematic (scope, threading), but it is my understanding that having a different stream it's also for the best performance wise.

@cjnolet
Copy link
Member

cjnolet commented Aug 8, 2025

If we want the overlapping copy/computation but we need to be sure the operation is finished before proceeding (e.g. copied out the data) with the CPU side of things, we need to pass a non-0 stream and then "block" the caller (CPU thread) with a cudaStreamSynchronize(stream), correct? Operations on the device will still overlap.

Even worse- cudaMemcpy blocks the whole device! So imagine having multiple threads, each performing search() with different streams, they won't be able to overlap because cudaMemCpy will block the entire device each time it's called until the memory has been fully copied. This is why we NEVER use cudaMemCpy and we always use cudaMemCpyAsync.

If we want overlapping copy/computation with cuvs operations we need to use a different stream, right? We cannot use cuvsGetStream()/raft::resources::get_cuda_stream() for the copy too, or it will use the same stream it uses for e.g. indexing, and we will be unable to overlap. So, different resources (or different stream, manually managed), correct?

A stream is necesary to queue up the memory copy, but this is still also very important because the goal here is to queue up work (if even on the same stream) and not block the client so that the GPU can work off the queue asynchronously. Of course, if you want to overlap multiple memory copies and kernel launches, the cuvsresources can also be configured by the user to provide multiple "worker streams" which can be used to perform these copies asynchronously and independently.

In order to take advanage of any of this, we need to use cudaMemCpyAsync- it's not a question of whether we should use it. We need to use it everywhere.

@cjnolet
Copy link
Member

cjnolet commented Aug 8, 2025

If we want overlapping copy/computation with cuvs operations we need to use a different stream, right?

I mentioned this above, but want to expand it just to make it fully clear- asynchronous computation even with the same stream is just as important to queue up and synchronize as little as possible. Queueing up work on a stream (such as a memory copy, memory allocation/free, or launching a kernel) has very little overhead, if any. But if we have to synchronize the cpu thread every time we queue up one of these operations, we're going to insert blocking gaps in the pipeline and these can be very costly in latency sensitive workloads (You can use nsight-systems application to profile this, you'll literally see the gaps in the pipeline). Insteas, we utilize the asynchronous nonblocking behavior as much as possible so the CPU can continue queuing up work for the gpu independent of the thread that's making the calls to the runtime API. THis is also why we only explicitly synchronize the stream in the cuVS C++/C layer when it's absolutely necessary (which in general is either because we just did a device to host copy and need to immediately read the memory on host, or if we are about to use multiple streams to overlap some operations, we synchronize the main stream before splitting off into a series of "worker streams" for things like overlapping computation / memory copies as you mentioned).

@cjnolet
Copy link
Member

cjnolet commented Aug 8, 2025

The additional reason is that we also won't need to keep a reference to resources, which might be problematic (scope, threading), but it is my understanding it's also for the best performance wise.

resources are expensive to create but the individual resources that get stored on the cuvsresources instance get created lazily as algorithms need them to try and amortize the cost there. I hope we're not saying we are recreating these resource objects often. Ideally we should create them once at the beginning and reuse them in subsequent calls. They don't have a high memory cost, but things like cublas handles can be very costly to create and that's used very often.

@mythrocks
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 5e3a5a6 into rapidsai:branch-25.10 Aug 8, 2025
100 of 102 checks passed
@mythrocks
Copy link
Contributor Author

mythrocks commented Aug 8, 2025

I'll address the cudaMemcpyAsync change in a follow-on, shortly.

Thank you for the reviews, @ldematte, @cjnolet.

@mythrocks mythrocks deleted the exception-safe-rmm-allocation branch August 8, 2025 23:55
@ldematte
Copy link
Contributor

In order to take advantage of any of this, we need to use cudaMemCpyAsync- it's not a question of whether we should use it. We need to use it everywhere.

Absolutely; like you mentioned, even with the same stream + synchronize on the CPU thread this is still better (not blocking the device -- other threads can still make progress)

lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
This commit introduces exception safety for RMM allocations.

Previously, device memory allocated through `cuvsRmmAlloc()` was freed manually using `cuvsRmmFree()`, in all the index impl classes.  The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory.

This commit extends the `CloseableHandle` class to encapsulate the allocation of device memory.  This new class is used in try-with-resources blocks, to make device memory allocations exception-safe.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Lorenzo Dematté (https://github.com/ldematte)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1215
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality Java non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

4 participants