Make duplicate removal in all neighbors robust to distance drift across batches #1185

jinsolp · 2025-07-26T00:23:19Z

This is to fix an edge case that happens and the root cause is in issue: #1056, which is about different distance results from raft::linalg::gemm based on the input sizes.

Right now, when merging two knn graphs from different batches, we sort by distances (i.e. keys), and if the distances are same we sort by indices (i.e. values). After doing so, we compare indices right next to each other to check for duplicates under assumption that same vectors end up with same distances.

However, due to the problem stated in issue 1056, distance for same index can be slightly different based on the size of the input matrix to gemm (or where the vector is in the entire matrix).

For example, say we are calculating nearest neighbors for vector 0.
we could end up with

indices = [1, 2, 3, 2, ....]
distances = [0.023, 0.02355981, 0.02355983, 0.02355987]

because distance between vector 0 and vector 2 is calculated as 0.02355981 in the first batch, and 0.02355987 in the second batch.

This PR fixes this issue by checking 4 neighbors to its left for duplicates, instead of checking the one next to itself.

cjnolet · 2025-07-30T01:23:01Z

@jinsolp thanks for working to fix this issue. Have you done a profiling or benchmarking to gauge the impact of the fix? Just want to make sure this doesn't have a huge impact on the resulting perf (and quantify it in some way).

jinsolp · 2025-07-30T20:15:14Z

On average about +3 seconds to run all_neighbors e2e using 10 clusters on a 10M x 128 dataset with k=32. I don't think it's too concerning.

cjnolet · 2025-07-31T15:07:12Z

@jinsolp can you share more info about the benchmark? What percentage of the overall time was 3 seconds? Was it <1%? Was it 70% of the overall time?

jinsolp · 2025-07-31T16:09:59Z

Oh yes sure the entire e2e time was about 67 seconds vs 70 seconds

cjnolet · 2025-07-31T16:13:11Z

Thanks @jinsolp, so that amounts to about a 4.4% hit in perf, unfortunately. Still, I'd rather have a 5% hit for better correctness than to have a faster impl that is incorrect.

Do you have a sense for how much this drift happens in practice? Does it pop up nearly all the time? Is it an edge case?

jinsolp · 2025-07-31T16:28:34Z

This doesn't happen that often (especially if the distances are large in scale so are unlikely to end up being the same). It’s also not a major issue for NN Descent, only for brute-force search that directly uses raft::linalg::gemm. Still, I’m a bit concerned because I’ve seen this issue a few times adding mutual reachability for brute force. After knowing that raft::linalg::gemm is causing it, I ran (non-mr) brute force with a larger k which ends up using raft::linalg::gemm internally, and it has similar issue.

cjnolet · 2025-07-31T18:46:40Z

This doesn't happen that often (especially if the distances are large in scale so are unlikely to end up being the same). It’s also not a major issue for NN Descent, only for brute-force search that directly uses raft::linalg::gemm. Still, I’m a bit concerned because I’ve seen this issue a few times adding mutual reachability for brute force. After knowing that raft::linalg::gemm is causing it, I ran (non-mr) brute force with a larger k which ends up using raft::linalg::gemm internally, and it has similar issue.

Yeah, that's where I'm goin with things. I definitely don't think we should ignore this issue by any means, but I wonder if we could lessen the perf effects by characterizing under what conditions we see it happening and try to come up with some rules that we can apply so that this only need to be done under those conditions. For example, is there are specific dimentionality where it starts happening? Or a specific number of rows? Does it always happen below/above those sizes? Is it specific to small matrices only? Is it that it doesn't happen with the larger scales or that its effects are less noticeable?

Trust me, I"m not trying to halt progress here, but imagine if we made a bunch of these fixes that each had a 4% or more impact on perf... eventually the perf gap would become unacceptable. Just trying to make sure we can "soften the blow" so to speak.

jinsolp · 2025-07-31T19:05:40Z

Got it, I'll try to narrow this down a bit to see what we can do. Let me also try to run this many times before averaging it (right now I only ran it 5 times each to average) because I didn't expect this to add +3 seconds.

jinsolp · 2025-08-05T18:12:32Z

@cjnolet since this is a rare edge case, and doesn't happen across many different vectors, it should be enough to sweep a small neighboring window instead of sweeping the entire row. Window size 4 is chosen heuristically.

…nto an-merge-sweepall

jinsolp · 2025-08-06T19:42:13Z

Hmmm I realized that the duplicate problem happens for mutual reachability unless we sweep the whole row.
For example

[After ordinary KNN pass]
neighbors of 25: [25,1840,2435,1185,3745,290,955,410,970,4410,3220,3435,1420,3640,490,320];
distances for 25: [-0,187.677,189.562,194.995,197.123,199.602,205.618,205.697,207.173,207.21,211.438,211.749,212.001,212.875,214.381,214.964];

Core dist for this is 214.964

[After MR]
neighbors of 25: [25,290,320,410,490,970,1185,1420,1840,2435,3220,3435,3640,3745,4410,320];
distances for 25: [214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.965];

neighbor 320 is calculated to have distance 214.965 with vector 25 in the second run, so now duplicate 320 cannot be detected unless the entire row is sweeped.

I think what we can do is sweep the whole row when we are calculating MR, and just sweep a small window otherwise.

jinsolp · 2025-08-06T22:22:52Z

Window sweep (checking 4 neighbors) doesn't add too much overhead. We only sweep all if we set the template param SweepAll=true.
merge_subgraph_kernel times below.

cjnolet · 2025-08-07T17:03:36Z

@jinsolp are you comfortabl with the fix now or are you still trying to find a solution to the problem you listed above?

jinsolp · 2025-08-07T17:22:15Z

I think I am comfortable with the new fix!

cjnolet · 2025-08-07T17:26:29Z

/merge

…ss batches (rapidsai#1185) This is to fix an edge case that happens and the root cause is in issue: rapidsai#1056, which is about different distance results from `raft::linalg::gemm` based on the input sizes. Right now, when merging two knn graphs from different batches, we sort by distances (i.e. keys), and if the distances are same we sort by indices (i.e. values). After doing so, we compare indices right next to each other to check for duplicates under assumption that same vectors end up with same distances. However, due to the problem stated in issue 1056, distance for same index can be slightly different based on the size of the input matrix to gemm (or where the vector is in the entire matrix). For example, say we are calculating nearest neighbors for vector 0. we could end up with ``` indices = [1, 2, 3, 2, ....] distances = [0.023, 0.02355981, 0.02355983, 0.02355987] ``` because distance between vector 0 and vector 2 is calculated as 0.02355981 in the first batch, and 0.02355987 in the second batch. This PR fixes this issue by checking 4 neighbors to its left for duplicates, instead of checking the one next to itself. Authors: - Jinsol Park (https://github.com/jinsolp) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1185

sweep all in row

6074001

jinsolp self-assigned this Jul 26, 2025

jinsolp requested a review from a team as a code owner July 26, 2025 00:23

jinsolp added bug Something isn't working non-breaking Introduces a non-breaking change labels Jul 26, 2025

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Jul 26, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Jul 26, 2025

github-actions bot added the cpp label Jul 26, 2025

jinsolp moved this from Todo to Done in Vector Search, ML, & Data Mining Release Board Jul 26, 2025

jinsolp added the Waiting for review label Jul 26, 2025

jinsolp mentioned this pull request Jul 26, 2025

Mutual Reachability in all-neighbors API #1016

Merged

Merge branch 'branch-25.08' into an-merge-sweepall

88b09a8

cjnolet moved this from Done to Todo in Vector Search, ML, & Data Mining Release Board Jul 29, 2025

cjnolet moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Jul 29, 2025

Merge branch 'branch-25.08' into an-merge-sweepall

c149ce8

cjnolet added Waiting for author and removed Waiting for review labels Jul 30, 2025

Merge branch 'branch-25.08' into an-merge-sweepall

5b63383

jinsolp changed the base branch from branch-25.08 to branch-25.10 July 31, 2025 18:35

Merge branch 'branch-25.10' into an-merge-sweepall

23c5043

jinsolp added 2 commits August 4, 2025 13:21

Merge branch 'branch-25.10' into an-merge-sweepall

0940ed3

check window

62b2651

Merge branch 'branch-25.10' into an-merge-sweepall

078750a

jinsolp added 2 commits August 5, 2025 20:16

check upto itself

5c4bd8a

Merge branch 'an-merge-sweepall' of https://github.com/jinsolp/cuvs i…

891b315

…nto an-merge-sweepall

jinsolp added 3 commits August 6, 2025 22:02

conditional sweep all

17913ab

newlines

c87e630

Merge branch 'branch-25.10' into an-merge-sweepall

12dd0fa

Merge branch 'branch-25.10' into an-merge-sweepall

4d918df

cjnolet approved these changes Aug 7, 2025

View reviewed changes

Merge branch 'branch-25.10' into an-merge-sweepall

ffb5528

rapids-bot bot merged commit afc24ee into rapidsai:branch-25.10 Aug 8, 2025
53 checks passed

github-project-automation bot moved this from In Progress to Done in Vector Search, ML, & Data Mining Release Board Aug 8, 2025

jinsolp deleted the an-merge-sweepall branch August 8, 2025 00:58

Make duplicate removal in all neighbors robust to distance drift across batches #1185

Make duplicate removal in all neighbors robust to distance drift across batches #1185

Uh oh!

Conversation

jinsolp commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjnolet commented Jul 30, 2025

Uh oh!

jinsolp commented Jul 30, 2025

Uh oh!

cjnolet commented Jul 31, 2025

Uh oh!

jinsolp commented Jul 31, 2025

Uh oh!

cjnolet commented Jul 31, 2025

Uh oh!

jinsolp commented Jul 31, 2025

Uh oh!

cjnolet commented Jul 31, 2025

Uh oh!

jinsolp commented Jul 31, 2025

Uh oh!

jinsolp commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinsolp commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinsolp commented Aug 6, 2025

Uh oh!

cjnolet commented Aug 7, 2025

Uh oh!

jinsolp commented Aug 7, 2025

Uh oh!

cjnolet commented Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jinsolp commented Jul 26, 2025 •

edited

Loading

jinsolp commented Aug 5, 2025 •

edited

Loading

jinsolp commented Aug 6, 2025 •

edited

Loading