Skip to content

Conversation

@jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Jul 26, 2025

This is to fix an edge case that happens and the root cause is in issue: #1056, which is about different distance results from raft::linalg::gemm based on the input sizes.

Right now, when merging two knn graphs from different batches, we sort by distances (i.e. keys), and if the distances are same we sort by indices (i.e. values). After doing so, we compare indices right next to each other to check for duplicates under assumption that same vectors end up with same distances.

However, due to the problem stated in issue 1056, distance for same index can be slightly different based on the size of the input matrix to gemm (or where the vector is in the entire matrix).

For example, say we are calculating nearest neighbors for vector 0.
we could end up with

indices = [1, 2, 3, 2, ....]
distances = [0.023, 0.02355981, 0.02355983, 0.02355987]

because distance between vector 0 and vector 2 is calculated as 0.02355981 in the first batch, and 0.02355987 in the second batch.

This PR fixes this issue by checking 4 neighbors to its left for duplicates, instead of checking the one next to itself.

@cjnolet
Copy link
Member

cjnolet commented Jul 30, 2025

@jinsolp thanks for working to fix this issue. Have you done a profiling or benchmarking to gauge the impact of the fix? Just want to make sure this doesn't have a huge impact on the resulting perf (and quantify it in some way).

@jinsolp
Copy link
Contributor Author

jinsolp commented Jul 30, 2025

On average about +3 seconds to run all_neighbors e2e using 10 clusters on a 10M x 128 dataset with k=32. I don't think it's too concerning.

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2025

@jinsolp can you share more info about the benchmark? What percentage of the overall time was 3 seconds? Was it <1%? Was it 70% of the overall time?

@jinsolp
Copy link
Contributor Author

jinsolp commented Jul 31, 2025

Oh yes sure the entire e2e time was about 67 seconds vs 70 seconds

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2025

Thanks @jinsolp, so that amounts to about a 4.4% hit in perf, unfortunately. Still, I'd rather have a 5% hit for better correctness than to have a faster impl that is incorrect.

Do you have a sense for how much this drift happens in practice? Does it pop up nearly all the time? Is it an edge case?

@jinsolp
Copy link
Contributor Author

jinsolp commented Jul 31, 2025

This doesn't happen that often (especially if the distances are large in scale so are unlikely to end up being the same). It’s also not a major issue for NN Descent, only for brute-force search that directly uses raft::linalg::gemm. Still, I’m a bit concerned because I’ve seen this issue a few times adding mutual reachability for brute force. After knowing that raft::linalg::gemm is causing it, I ran (non-mr) brute force with a larger k which ends up using raft::linalg::gemm internally, and it has similar issue.

@jinsolp jinsolp changed the base branch from branch-25.08 to branch-25.10 July 31, 2025 18:35
@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2025

This doesn't happen that often (especially if the distances are large in scale so are unlikely to end up being the same). It’s also not a major issue for NN Descent, only for brute-force search that directly uses raft::linalg::gemm. Still, I’m a bit concerned because I’ve seen this issue a few times adding mutual reachability for brute force. After knowing that raft::linalg::gemm is causing it, I ran (non-mr) brute force with a larger k which ends up using raft::linalg::gemm internally, and it has similar issue.

Yeah, that's where I'm goin with things. I definitely don't think we should ignore this issue by any means, but I wonder if we could lessen the perf effects by characterizing under what conditions we see it happening and try to come up with some rules that we can apply so that this only need to be done under those conditions. For example, is there are specific dimentionality where it starts happening? Or a specific number of rows? Does it always happen below/above those sizes? Is it specific to small matrices only? Is it that it doesn't happen with the larger scales or that its effects are less noticeable?

Trust me, I"m not trying to halt progress here, but imagine if we made a bunch of these fixes that each had a 4% or more impact on perf... eventually the perf gap would become unacceptable. Just trying to make sure we can "soften the blow" so to speak.

@jinsolp
Copy link
Contributor Author

jinsolp commented Jul 31, 2025

Got it, I'll try to narrow this down a bit to see what we can do. Let me also try to run this many times before averaging it (right now I only ran it 5 times each to average) because I didn't expect this to add +3 seconds.

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 5, 2025

@cjnolet since this is a rare edge case, and doesn't happen across many different vectors, it should be enough to sweep a small neighboring window instead of sweeping the entire row. Window size 4 is chosen heuristically.

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 6, 2025

Hmmm I realized that the duplicate problem happens for mutual reachability unless we sweep the whole row.
For example

[After ordinary KNN pass]
neighbors of 25: [25,1840,2435,1185,3745,290,955,410,970,4410,3220,3435,1420,3640,490,320];
distances for 25: [-0,187.677,189.562,194.995,197.123,199.602,205.618,205.697,207.173,207.21,211.438,211.749,212.001,212.875,214.381,214.964];

Core dist for this is 214.964

[After MR]
neighbors of 25: [25,290,320,410,490,970,1185,1420,1840,2435,3220,3435,3640,3745,4410,320];
distances for 25: [214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.964,214.965];

neighbor 320 is calculated to have distance 214.965 with vector 25 in the second run, so now duplicate 320 cannot be detected unless the entire row is sweeped.

I think what we can do is sweep the whole row when we are calculating MR, and just sweep a small window otherwise.

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 6, 2025

Window sweep (checking 4 neighbors) doesn't add too much overhead. We only sweep all if we set the template param SweepAll=true.
merge_subgraph_kernel times below.
Screenshot 2025-08-06 at 2 59 17 PM

@cjnolet
Copy link
Member

cjnolet commented Aug 7, 2025

@jinsolp are you comfortabl with the fix now or are you still trying to find a solution to the problem you listed above?

@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 7, 2025

I think I am comfortable with the new fix!

@cjnolet
Copy link
Member

cjnolet commented Aug 7, 2025

/merge

@rapids-bot rapids-bot bot merged commit afc24ee into rapidsai:branch-25.10 Aug 8, 2025
53 checks passed
@jinsolp jinsolp deleted the an-merge-sweepall branch August 8, 2025 00:58
lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
…ss batches (rapidsai#1185)

This is to fix an edge case that happens and the root cause is in issue: rapidsai#1056, which is about different distance results from `raft::linalg::gemm` based on the input sizes.

Right now, when merging two knn graphs from different batches, we sort by distances (i.e. keys), and if the distances are same we sort by indices (i.e. values). After doing so, we compare indices right next to each other to check for duplicates under assumption that same vectors end up with same distances.

However, due to the problem stated in issue 1056, distance for same index can be slightly different based on the size of the input matrix to gemm (or where the vector is in the entire matrix).

For example, say we are calculating nearest neighbors for vector 0.
we could end up with
```
indices = [1, 2, 3, 2, ....]
distances = [0.023, 0.02355981, 0.02355983, 0.02355987]
```
because distance between vector 0 and vector 2 is calculated as 0.02355981 in the first batch, and 0.02355987 in the second batch.

This PR fixes this issue by checking 4 neighbors to its left for duplicates, instead of checking the one next to itself.

Authors:
  - Jinsol Park (https://github.com/jinsolp)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cpp non-breaking Introduces a non-breaking change Waiting for author

Development

Successfully merging this pull request may close these issues.

2 participants