Skip to content

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Apr 28, 2025

Answers #6539
Requires rapidsai/raft#2739

This PR:

  • Trims the graph before embedding initialization
  • Stores the graph on host when using the UMAP estimator

@viclafargue viclafargue requested a review from a team as a code owner April 28, 2025 13:28
@csadorf csadorf linked an issue Apr 28, 2025 that may be closed by this pull request
@viclafargue viclafargue requested review from jcrist and jinsolp April 28, 2025 16:52
Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR Victor! I have a question before I move on with further reviews. : )

Comment on lines 242 to 247

value_t threshold = get_threshold(handle, fss_graph, inputs.n, params->n_epochs);
perform_thresholding(handle, fss_graph, threshold);

raft::sparse::op::coo_remove_zeros<value_t>(&fss_graph, graph, stream);
}
Copy link
Contributor

@jinsolp jinsolp Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph is the final output exposed at the python level. It looks like umap-learn runs the fuzzy simplicial operation (link, corresponding to our FuzzySimplSetImpl::symmetrize, then eliminates zeros, which becomes the final output.

I think if we doperform_thresholding here, we end up storing the graph after thresholding as the final graph, which seems different from that umap-learn has?
I understand that we need to do the thresholding before the embedding init, but think by doing this we end up with a different output graph (which should correspond to umap-learn's self.graph_). Please correct me if I'm wrong!

Copy link
Contributor Author

@viclafargue viclafargue Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that once the graph reaches the Python layer, we can no longer make assumptions about how the user intends to use it—so we should avoid modifying or potentially corrupting it. Since our goal is to trim the graph, that would require creating a copy, which could significantly increase VRAM usage. I acknowledge this approach doesn't exactly align with how umap-learn handles it, but I assumed that most users would primarily be interested in a fuzzy simplicial set graph that retains only the most important connections. I had also even considered performing the trimming before the symmetrization step, since the thrust operations there appears to cause a memory spike—if I’ve understood correctly.

However, I just realized that because the trimming threshold depends on the number of training epochs, trimming before storing the graph might work well with the UMAP estimator, but not as effectively with fuzzy_simplicial_set, which lacks access to the number of epochs. In that case, creating a copy might indeed be the only viable option—unless the graph is stored as a SciPy array on the Cython side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, creating a copy might indeed be the only viable option—unless the graph is stored as a SciPy array on the Cython side?

When Corey and I discussed this a month or two ago, IIRC we came to agreement that the graph_ attribute could (and probably should) be returned to the user on host. It's mostly there for introspection or for umap-learn compatibility. We do want to make sure the graph is being treated roughly the same as umap-learn does so zero-code-change methods that reference it (e.g. merging of estimators) work roughly the same. Since we never reference it ourselves, having it on host would be fine and would decrease the device memory pressure at this point.

@viclafargue viclafargue changed the title Fix graph thresholding Fix UMAP graph thresholding May 1, 2025
@viclafargue viclafargue requested a review from a team as a code owner May 5, 2025 12:02
@github-actions github-actions bot added the Cython / Python Cython or Python issue label May 5, 2025
@csadorf
Copy link
Contributor

csadorf commented May 14, 2025

This PR depends on rapidsai/raft#2650 . Is it realistic that it will be merged before burndown?

@divyegala
Copy link
Member

@csadorf no, RAFT burndown is tomorrow. Let's move this to 25.08.

@viclafargue viclafargue requested review from a team as code owners July 1, 2025 12:27
@github-actions github-actions bot added conda conda issue ci labels Jul 1, 2025
@viclafargue viclafargue changed the base branch from branch-25.06 to branch-25.08 July 9, 2025 15:42
@github-actions github-actions bot removed conda conda issue CMake ci labels Jul 9, 2025
Copy link

copy-pr-bot bot commented Aug 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@viclafargue viclafargue changed the base branch from branch-25.08 to branch-25.10 August 27, 2025 15:17
@viclafargue viclafargue added bug Something isn't working breaking Breaking change labels Sep 5, 2025
Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Victor! 👍 I left some suggestions and questions!

Comment on lines 842 to 843
if num_clusters == 5:
pytest.skip("Skipping test for 5 clusters")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed anything, but why would we want to skip test in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is an issue with do_snmg = True and num_clusters == 5 for some reason. It may be because my workstation only has 2 GPUs though, but CI only has one. Basically there are a lot of -1 in the indices for some reasons. The issue happens even apart from this PR on main branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More details here.

bool apply_thresholding = should_perform_thresholding(handle, in, threshold);

if (apply_thresholding) {
trim_graph(handle, in, out, threshold);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also be calling perform_thresholding here before we trim the graph?
Also, is my understanding correct that trim_graph doing what raft::sparse::op::coo_remove_zeros is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trimming was already performing a thresholding with a thrust::copy_if.

After thinking about it again, I decided to revert to a simpler systematic element-wise thresholding (setting to 0) followed by a coo_remove_zeros operation. The reasons are the following : 1) The performance of thrust::copy_if is probably not optimal. 2) The graph now being stored on host means it won't alter the graph in the graph_ attribute anymore 3) The graph will in most cases require a thresholding operation anyway, making it conditional through some checks is probably slower overall.

@github-actions github-actions bot removed conda conda issue CMake ci labels Sep 8, 2025
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a brief read through and logically this all makes sense to me. Left a few small notes of things I'm less-than-happy-about (but don't view as blockers), but overall this is an improvement.

copy_device_graph_to_host(handle, graph, host_graph);

trim_graph(handle, graph, trimmed_graph, inputs.n, params->n_epochs);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the scope here is to drop graph once the trimming is done to reduce device memory pressure? If so, can you add a small comment noting that?

Also, is there any way the trimming could be done in place? Last time I checked this was one of the memory highpoints, so reducing duplications of the graph here would be beneficial if easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the scope here is to drop graph once the trimming is done to reduce device memory pressure?

Exactly.

Also, is there any way the trimming could be done in place?

Might be wrong, but I don't think that we can do this efficiently in-place with CUDA. Performing the operation on host might be relatively fast and would avoid doubling VRAM use here, but it would be probably much slower than doing it on device with a copy.

cols = create_nonowning_numpy_array(self.cols(), np.int32)

graph = scipy.sparse.coo_matrix((vals.copy(), (rows.copy(), cols.copy())))
return graph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a full copy of the COO with memory allocated by numpy instead of within libcuml. This doubles the host memory (original host COO + new host COO), but the original should be freed momentarily after once the HostGraphHolder is dropped.

I don't love this, but I'm fine with this for now. Note that if we wanted, we could instead expose the COO from libcuml directly through the buffer protocol, avoiding the brief memory doubling.

Can you add a small comment/docstring on this method just noting that this returns a copy currently (to help future readers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original should be freed momentarily after once the HostGraphHolder is dropped.

Exactly, the function that instantiate the HostGraphHolder should release it immediately (modulo the garbage collector).

we could instead expose the COO from libcuml directly through the buffer protocol, avoiding the brief memory doubling

The COO matrix should have a predictable nnz before thresholding, so this would indeed be theoretically possible to pre-allocate the matrix on host and expose it in libcuml. However, we would like to use the raft::host_coo_matrix for the libcuml API and this utility can not be used in a non-owning way for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to preallocate it, we can expose the memory directly. I don't think this is worth doing though unless we find that host memory becomes a bottleneck. What you have now is fine, just please add the comment/docstring I requested.

@@ -839,6 +839,8 @@ def test_umap_distance_metrics_fit_transform_trust_on_sparse_input(
def test_umap_trustworthiness_on_batch_nnd(
num_clusters, fit_then_transform, metric, do_snmg
):
if do_snmg and num_clusters == 5:
pytest.xfail("xfailing snmg test with 5 clusters for now")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being xfailed? Can this be removed? If it can't, can you update the reason to be more descriptive of why it's xfailed so future readers can know when it can be removed (and if it's a TODO open a follow-up issue)?

Also: xfail markers like this don't give the test a chance to run and xpass - there's no indicator when running the tests when this can be un-xfailed.

Ideally if a test is flaky or failing we instead handle this by refactoring the parameters to mark the bad case as an xfail. In that way the test still runs, but failures will be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some investigation it looks like there is an issue in this test when running it on multiple GPUs with parameters (do_snmg = True and num_clusters == 5). The KNN step returns -1 for the indices. The issue is present in the main branch too. Setting CUDA_VISIBLE_DEVICES=0 removes the issue explaining why this does not seem to pop up in the CI.

Since, it is not related to this PR I removed the xfail. I will open an issue instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working CUDA/C++ Cython / Python Cython or Python issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Differences between umap.UMAP and cuml.UMAP in embeddings logic
5 participants