-
Notifications
You must be signed in to change notification settings - Fork 143
Make duplicate removal in all neighbors robust to distance drift across batches #1185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jinsolp thanks for working to fix this issue. Have you done a profiling or benchmarking to gauge the impact of the fix? Just want to make sure this doesn't have a huge impact on the resulting perf (and quantify it in some way). |
|
On average about +3 seconds to run all_neighbors e2e using 10 clusters on a 10M x 128 dataset with k=32. I don't think it's too concerning. |
|
@jinsolp can you share more info about the benchmark? What percentage of the overall time was 3 seconds? Was it <1%? Was it 70% of the overall time? |
|
Oh yes sure the entire e2e time was about 67 seconds vs 70 seconds |
|
Thanks @jinsolp, so that amounts to about a 4.4% hit in perf, unfortunately. Still, I'd rather have a 5% hit for better correctness than to have a faster impl that is incorrect. Do you have a sense for how much this drift happens in practice? Does it pop up nearly all the time? Is it an edge case? |
|
This doesn't happen that often (especially if the distances are large in scale so are unlikely to end up being the same). It’s also not a major issue for NN Descent, only for brute-force search that directly uses |
Yeah, that's where I'm goin with things. I definitely don't think we should ignore this issue by any means, but I wonder if we could lessen the perf effects by characterizing under what conditions we see it happening and try to come up with some rules that we can apply so that this only need to be done under those conditions. For example, is there are specific dimentionality where it starts happening? Or a specific number of rows? Does it always happen below/above those sizes? Is it specific to small matrices only? Is it that it doesn't happen with the larger scales or that its effects are less noticeable? Trust me, I"m not trying to halt progress here, but imagine if we made a bunch of these fixes that each had a 4% or more impact on perf... eventually the perf gap would become unacceptable. Just trying to make sure we can "soften the blow" so to speak. |
|
Got it, I'll try to narrow this down a bit to see what we can do. Let me also try to run this many times before averaging it (right now I only ran it 5 times each to average) because I didn't expect this to add +3 seconds. |
|
@cjnolet since this is a rare edge case, and doesn't happen across many different vectors, it should be enough to sweep a small neighboring window instead of sweeping the entire row. Window size 4 is chosen heuristically. |
…nto an-merge-sweepall
|
Hmmm I realized that the duplicate problem happens for mutual reachability unless we sweep the whole row. neighbor 320 is calculated to have distance 214.965 with vector 25 in the second run, so now duplicate 320 cannot be detected unless the entire row is sweeped. I think what we can do is sweep the whole row when we are calculating MR, and just sweep a small window otherwise. |
|
@jinsolp are you comfortabl with the fix now or are you still trying to find a solution to the problem you listed above? |
|
I think I am comfortable with the new fix! |
|
/merge |
…ss batches (rapidsai#1185) This is to fix an edge case that happens and the root cause is in issue: rapidsai#1056, which is about different distance results from `raft::linalg::gemm` based on the input sizes. Right now, when merging two knn graphs from different batches, we sort by distances (i.e. keys), and if the distances are same we sort by indices (i.e. values). After doing so, we compare indices right next to each other to check for duplicates under assumption that same vectors end up with same distances. However, due to the problem stated in issue 1056, distance for same index can be slightly different based on the size of the input matrix to gemm (or where the vector is in the entire matrix). For example, say we are calculating nearest neighbors for vector 0. we could end up with ``` indices = [1, 2, 3, 2, ....] distances = [0.023, 0.02355981, 0.02355983, 0.02355987] ``` because distance between vector 0 and vector 2 is calculated as 0.02355981 in the first batch, and 0.02355987 in the second batch. This PR fixes this issue by checking 4 neighbors to its left for duplicates, instead of checking the one next to itself. Authors: - Jinsol Park (https://github.com/jinsolp) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1185

This is to fix an edge case that happens and the root cause is in issue: #1056, which is about different distance results from
raft::linalg::gemmbased on the input sizes.Right now, when merging two knn graphs from different batches, we sort by distances (i.e. keys), and if the distances are same we sort by indices (i.e. values). After doing so, we compare indices right next to each other to check for duplicates under assumption that same vectors end up with same distances.
However, due to the problem stated in issue 1056, distance for same index can be slightly different based on the size of the input matrix to gemm (or where the vector is in the entire matrix).
For example, say we are calculating nearest neighbors for vector 0.
we could end up with
because distance between vector 0 and vector 2 is calculated as 0.02355981 in the first batch, and 0.02355987 in the second batch.
This PR fixes this issue by checking 4 neighbors to its left for duplicates, instead of checking the one next to itself.