-
Notifications
You must be signed in to change notification settings - Fork 143
[Java] Exception-safe RMM Allocations #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Exception-safe RMM Allocations #1215
Conversation
This commit introduces exception safety for RMM allocations. Previously, device memory allocated through `cuvsRmmAlloc()` was freed manually using `cuvsRmmFree()`, in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory. This commit extends the `CloseableHandle` class to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe. Signed-off-by: MithunR <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraMultiThreadStabilityIT.java
Show resolved
Hide resolved
|
+1, very important change. |
ldematte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea! It might interfere with work I'm doing on the device CuVSMatrix, but I'll take the duty of merging them.
Left a couple of comments.
|
|
||
| return new IndexReference(datasetMemorySegmentP, datasetBytes, tensorDataArena, index); | ||
| closeableDataMemorySegmentP.release(); | ||
| return new IndexReference(datasetMemorySegmentP, datasetBytes, tensorDataArena, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a problem here: we pass the dataset "pointer" to IndexReference to hold it and clean it when we are done with the index (see destroyIndex()), so this will lead to a double free.
Your change is good only if we are able to determine that we don't need the dataset device memory after we built the index; however, I was not able to determine if we need it or not; I think we might need it, as it might not be copied over again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now why you need "release": it's a way to work around that.
I think it is better to avoid it, and simply avoid to use try with resources here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I haven't conveyed the utility of CloseableRMMAllocation properly.
Simple case (CagraIndexImpl, etc.)
Consider the following simplified (if contrived) case, representative of how allocRMMSegment() is used in CagraIndexImpl, TieredIndexImpl, etc.:
// search()
{
var queriesDP = allocateRMMSegment(...);
// ...
cudaMemcpy(...); // Can throw.
checkCuVSError( cuvsCagraSearch(...) ); // Can also throw.
// ...
// And finally.
cuvsRmmFree( queriesDP );
}There are several throwable points between the alloc() and the free(). If any of them fire, queriesDP is leaked in __device__ memory. This is the simple case that CloseableRMMAllocation addresses.
The case for .release() (BruteForceIndexImpl)
Similar example as above, except that the Index adopts the allocation, and holds it until destroyIndex() is called.
// build()
{
var datasetMemorySegmentP = allocateRMMSegment(...);
// ...
cudaMemcpy(...); // Can throw.
checkCuVSError( cuvsBruteForceBuild(...) ); // Can also throw.
// ...
// And finally, commit. No free().
return new IndexReference( datasetMemorySegmentP, ... );
}All the perils of the first example appear here as well; there are many throwable points between the allocation and the creation of the IndexReference. We need a way to clean up the memory allocation if there's any throw before the final commit (to IndexReference).
This is why we release() right before the return.
Not using throw-with-resources would mean that we're still open to __device__ memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Still, I feel like release() is not clear; a more explicit way would be to do an old, simple try/catch:
var closeableDataMemorySegment = allocateRMMSegment(cuvsResources, datasetBytes));
try {
MemorySegment datasetMemorySegment = closeableDataMemorySegment.handle();
// use handle
// here ownership is "transferred"
return new IndexReference(closeableDataMemorySegment, ... );
} catch (Throwable t) {
closeableDataMemorySegment.close();
throw;
}
let me see if I can think of a different pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are trying can be seen as similar to C++ unique_ptr; the difference is that we don't have move semantics in Java (as a simple way to transfer ownership).
An alternative could be to implement release() closer to what C++ has, e.g. returning the enclosed object:
try (var closeableDataMemorySegment = allocateRMMSegment(cuvsResources, datasetBytes)) {
MemorySegment datasetMemorySegment = closeableDataMemorySegment.handle();
// use handle
// here ownership is "transferred" more explicitly
return new IndexReference(
new CloseableRMMAllocation(datasetMemorySegment.release()), // release it, return the "raw pointer", pass it immediately to another `CloseableRMMAllocation` that will handle its lifetime
... );
}
While I think this models more clearly the "I am transferring ownership" idea, there is one drawback: what if IndexReference or CloseableRMMAllocation ctors throw? This is not a danger here, they are both no-throw ctors (just simple assignements), but still it's a little bit less robust.
I think I still like the explicit try/catch more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to highlight this though:
If any of them fire, queriesDP is leaked in
__device__memory
Leaking device memory seems particularly bad; but how bad is it actually? Like, would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?
In any case, calling @ChrisHegarty in to see if we can protect further against this possibility (using cleaners maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to C++
unique_ptr...
Exactly right. That's what this is modeled around. This will turn into an RAII wrapper after I have moved the allocation into the constructor.
That's also the pattern we use in https://github.com/rapidsai/cudf, when we transfer ownership of the underlying __device__ memory in a CUDF column.
My initial version also had the pointer returned from release.
Edit: Looks like it's not just my initial version; release() does currently return the old pointer when relinquishing the memory. I didn't use it in BruteForceIndexImpl because the original pointer was already at hand in the same scope. This will get tighter once the RAII change is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be leaking it even after the process is gone, or the OS/device driver will be able to reclaim that memory (like for host memory)?
No, the leak does not persist beyond the lifetime of the process. That should be reclaimed after the process exits, yes.
But I would like not to make an assumption that the users of cuvs-java are short-running processes, and require them to be tolerant of leaking memory. Even if current users might be alright with a leak in exceptional events, a future user might be a long-running application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like not to make an assumption that the users of cuvs-java are short-running processes
Definitely not, I was just trying to understand what's the worst case scenario here, and if we need something extra, beyond taking care of exceptions.
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/CloseableRMMAllocation.java
Outdated
Show resolved
Hide resolved
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Util.java
Outdated
Show resolved
Hide resolved
|
Hmm... There seems to be a logic error in |
|
/ok to test 4fd5f8b |
|
/ok to test 7314065 |
ldematte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left a couple of optional comments (for this PR or for a follow-up)
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java
Outdated
Show resolved
Hide resolved
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
Show resolved
Hide resolved
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
Outdated
Show resolved
Hide resolved
| long[] datasetShape = {rows, cols}; | ||
| MemorySegment datasetTensor = | ||
| prepareTensor(localArena, datasetDP, datasetShape, kDLFloat(), 32, kDLCUDA(), 1); | ||
| MemorySegment index = localArena.allocate(cuvsTieredIndex_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but I'm conflicted about using cuvsTieredIndex_t, since we need a POINTER here (as we see below, when we are extracting a C_POINTER).
Right now, cuvsTieredIndex_t is a pointer, but if that changes, we would end up allocating e.g. a struct, and use it to store (and retrieve) a pointer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now,
cuvsTieredIndex_tis a pointer, but if that changes...
I think that could apply to any of the typedefs exposed in the C++ API. Changing those structures would be a non-trivial event. I would think @benfred and gang are likely to keep those stable. :]
java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/TieredIndexImpl.java
Outdated
Show resolved
Hide resolved
Signed-off-by: MithunR <[email protected]>
Signed-off-by: MithunR <[email protected]>
Signed-off-by: MithunR <[email protected]>
|
/ok to test f90efc0 |
|
@ldematte: I've addressed the remainder of your concerns. That copy-constructor suggestion helped the code DRY a bit more. Does this look more agreeable? |
ldematte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was already good to me, but now it looks even better I think. Thanks for sticking with me and implementing my suggestions!
| ? allocateRMMSegment(cuvsRes, prefilterBytes) | ||
| : CloseableRMMAllocation.EMPTY) { | ||
|
|
||
| cudaMemcpy(queriesDP.handle(), floatsSeg, queriesBytes, INFER_DIRECTION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something to address in this PR, but we should really be using cudaMemcpyAsync everywhere (and accepting a stream / cuvs resources. This is going to synchronize the whole device and thatll slow things down immensely. I suspect this could be why @chatman and team have seen that multi-threaded search reduces perf significantly.
cjnolet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change intended in this PR LGTM. We should address the cudaMemcpy -> cudaMemCpyAsync ASAP, though, because it will affect concurrency / perf drastically. We reallys hould go through the whole C/Java API and repalce cudaMemcpy w/ cudaMemCpyAsync everywhere (and accept cuvsresources everywhere needed.
@cjnolet a couple of question about this to cement my understanding:
|
|
I still have to benchmark it, but this is where I am leaning with |
Even worse-
A stream is necesary to queue up the memory copy, but this is still also very important because the goal here is to queue up work (if even on the same stream) and not block the client so that the GPU can work off the queue asynchronously. Of course, if you want to overlap multiple memory copies and kernel launches, the cuvsresources can also be configured by the user to provide multiple "worker streams" which can be used to perform these copies asynchronously and independently. In order to take advanage of any of this, we need to use cudaMemCpyAsync- it's not a question of whether we should use it. We need to use it everywhere. |
I mentioned this above, but want to expand it just to make it fully clear- asynchronous computation even with the same stream is just as important to queue up and synchronize as little as possible. Queueing up work on a stream (such as a memory copy, memory allocation/free, or launching a kernel) has very little overhead, if any. But if we have to synchronize the cpu thread every time we queue up one of these operations, we're going to insert blocking gaps in the pipeline and these can be very costly in latency sensitive workloads (You can use nsight-systems application to profile this, you'll literally see the gaps in the pipeline). Insteas, we utilize the asynchronous nonblocking behavior as much as possible so the CPU can continue queuing up work for the gpu independent of the thread that's making the calls to the runtime API. THis is also why we only explicitly synchronize the stream in the cuVS C++/C layer when it's absolutely necessary (which in general is either because we just did a device to host copy and need to immediately read the memory on host, or if we are about to use multiple streams to overlap some operations, we synchronize the main stream before splitting off into a series of "worker streams" for things like overlapping computation / memory copies as you mentioned). |
resources are expensive to create but the individual resources that get stored on the cuvsresources instance get created lazily as algorithms need them to try and amortize the cost there. I hope we're not saying we are recreating these resource objects often. Ideally we should create them once at the beginning and reuse them in subsequent calls. They don't have a high memory cost, but things like cublas handles can be very costly to create and that's used very often. |
|
/merge |
5e3a5a6
into
rapidsai:branch-25.10
Absolutely; like you mentioned, even with the same stream + synchronize on the CPU thread this is still better (not blocking the device -- other threads can still make progress) |
This commit introduces exception safety for RMM allocations. Previously, device memory allocated through `cuvsRmmAlloc()` was freed manually using `cuvsRmmFree()`, in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory. This commit extends the `CloseableHandle` class to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Lorenzo Dematté (https://github.com/ldematte) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1215
This commit introduces exception safety for RMM allocations.
Previously, device memory allocated through
cuvsRmmAlloc()was freed manually usingcuvsRmmFree(), in all the index impl classes. The problem there is that if an exception is thrown in the intervening time between alloc and free, it would lead to a leak of device memory.This commit extends the
CloseableHandleclass to encapsulate the allocation of device memory. This new class is used in try-with-resources blocks, to make device memory allocations exception-safe.