Skip to content

Conversation

@rmaschal
Copy link
Contributor

@rmaschal rmaschal commented Aug 6, 2025

As title. For larger number of clusters (or faster PCI-E), the work required to find cluster assignments would take longer than the HtoD copy for each batch. This would lead to overwriting of one of the batch_load_iterator device buffers before kmeans predict completed, and thus lower recall.

This adds a raft::resource::sync_stream at the end of the kmeans predict loop to avoid this. Although not technically necessary (as the prior DtoH copy into pageable memory is synchronous), sync_stream is also added to the end of the quantization loop for consistency/correctness.

Also removed one unused comment.

@rmaschal rmaschal requested a review from a team as a code owner August 6, 2025 18:15
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 6, 2025
@cjnolet cjnolet moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Aug 6, 2025
dataset_vec_batches.prefetch_next_batch();

// Make sure work on device is finished before swapping buffers
raft::resource::sync_stream(res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefetch_next_batch seems to be synchronizing at the very end. Would it make better sense to have the sync_stream before prefetching the batch (above line 175)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefetch_next_batch() only syncs on the stream used for copying, so if the copy happens on a separate stream from the work (e.g. if you want to prfetch/overlap) then work might not be done when buffers are swapped (which is what I am trying to fix). Moving sync stream before prefetch will fix the problem, but at the cost of no overlapping of work/copy, reducing performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I had not realized that prefetching is done using a different stream

@cjnolet
Copy link
Member

cjnolet commented Aug 7, 2025

/merge

@rapids-bot rapids-bot bot merged commit 3cd48dc into rapidsai:branch-25.10 Aug 7, 2025
55 checks passed
lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
As title. For larger number of clusters (or faster PCI-E), the work required to find cluster assignments would take longer than the HtoD copy for each batch. This would lead to overwriting of one of the batch_load_iterator device buffers before kmeans predict completed, and thus lower recall.

This adds a raft::resource::sync_stream at the end of the kmeans predict loop to avoid this. Although not technically necessary (as the prior DtoH copy into pageable memory is synchronous), sync_stream is also added to the end of the quantization loop for consistency/correctness. 

Also removed one unused comment.

Authors:
  - https://github.com/rmaschal

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tarang Jain (https://github.com/tarang-jain)

URL: rapidsai#1224
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

3 participants