Optimize hnsw::from_cagra<GPU> #826

achirkin · 2025-04-16T07:16:57Z

Reduce the CAGRA-for-HNSW build times by:

avoiding unnecessary copies of the data between cagra::build and hnsw::from_cagra in the benchmarks
avoiding unnecessary temporary data buffers in hnsw::from_cagra
reducing random reads via forcing 1-1 mapping between the internal indices and external labels during HNSW import

As a side-effect, this PR also fixes the bug where hnsw::from_cagra segfaults in benchmarks if the dataset is passed in device memory (and incorrectly wrapped in a host_matrix_view).

In addition, this PR adds a bit more verbose NVTX reporting of different stages during the CAGRA/HNSW index build.

copy-pr-bot · 2025-04-16T07:17:00Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

achirkin · 2025-04-16T07:17:57Z

/ok to test

achirkin · 2025-04-16T13:48:04Z

Example timeline

Hardware: NVIDIA H200 NVL + Intel(R) Xeon(R) Gold 6444Y (16 cores / 32 threads)
Dataset: DEEP-100M

      "build_param": {
        "M": 16,
        "hierarchy": "gpu",
        "graph_degree": 32,
        "intermediate_graph_degree": 48,
        "graph_build_algo": "IVF_PQ",
        "ivf_pq_build_pq_dim": 32,
        "ivf_pq_build_pq_bits": 8,
        "ivf_pq_build_nlist": 10000,
        "ivf_pq_build_niter": 10,
        "ivf_pq_build_ratio": 10,
        "ivf_pq_build_codebook_kind": "subspace",
        "ivf_pq_search_nprobe": 20,
        "ivf_pq_search_internalDistanceDtype": "half",
        "ivf_pq_search_smemLutDtype": "half",
        "ivf_pq_search_refine_ratio": 1
      }

branch-25.06 / input data on the host

branch-25.06 / input data on the device

(segmentation fault)

PR-826 / input data on the host

PR-826 / input data on the device

Notes

Open the images in a separate tab to see the numbers
There's a cudaMemCpy2dAsync in the branch-25.06 at the end of the CAGRA build. This is attaching the dataset to the CAGRA index; it's not used subsequently if the dataset is on the host
The most savings in hnsw::from_cagra come from the initialization phase, which is congested on atomic counter and slowed down due to copying data in a slightly randomized order (due to the concurrent updates of the atomic counter).
The data-on-device version is faster by ~5% than the data-on-host version

cjnolet · 2025-04-24T21:09:42Z

@achirkin it looks like we have some python failures from these changes (this makes me happy we now have python tests for the benchmarks).

achirkin · 2025-04-25T05:11:16Z

@cjnolet lol these tests haven't made me happy while I was trying to setup my environment to run them :) and they just complain about a missing useless column I removed for HNSW algorithm (GPU time).

cjnolet · 2025-04-25T14:38:50Z

/merge

cjnolet · 2025-04-25T14:53:18Z

@achirkin does this issue address #762?

Reduce the CAGRA-for-HNSW build times by: - avoiding unnecessary copies of the data between cagra::build and hnsw::from_cagra in the benchmarks - avoiding unnecessary temporary data buffers in hnsw::from_cagra<GPU> - reducing random reads via forcing 1-1 mapping between the internal indices and external labels during HNSW import As a side-effect, this PR also fixes the bug where hnsw::from_cagra segfaults in benchmarks if the dataset is passed in device memory (and incorrectly wrapped in a host_matrix_view). In addition, this PR adds a bit more verbose NVTX reporting of different stages during the CAGRA/HNSW index build. Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#826

achirkin added 3 commits April 16, 2025 00:08

Simplify the wrapper and avoid unnecessary data copies

da7e409

Track cagra::build in NVTX

b30da0e

Rework hnsw::from_cagra<GPU>

2b636f4

achirkin added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 16, 2025

achirkin self-assigned this Apr 16, 2025

achirkin added this to Vector Search, ML, & Data Mining Release Board Apr 16, 2025

github-actions bot added the cpp label Apr 16, 2025

achirkin added 2 commits April 16, 2025 03:52

Fix style

1e9d88e

Use double-buffering when copying data from device

a698eef

achirkin marked this pull request as ready for review April 16, 2025 12:50

achirkin requested a review from a team as a code owner April 16, 2025 12:50

achirkin added 3 commits April 16, 2025 17:01

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

ab07a1d

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

af55095

Update test_cli.py

8402c16

achirkin requested a review from a team as a code owner April 22, 2025 12:14

github-actions bot added the Python label Apr 22, 2025

cjnolet moved this to In Progress in Vector Search, ML, & Data Mining Release Board Apr 22, 2025

cjnolet and others added 3 commits April 23, 2025 17:52

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

900c01c

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

5a4f526

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

298c66a

achirkin added 3 commits April 25, 2025 00:47

Adjust expected annbench output columns

5070a78

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

072e044

Merge branch 'branch-25.06' into enh-faster-hnsw-from-cagra

4de22f3

cjnolet approved these changes Apr 25, 2025

View reviewed changes

rapids-bot bot merged commit 0778a95 into rapidsai:branch-25.06 Apr 25, 2025
66 checks passed

github-project-automation bot moved this from In Progress to Done in Vector Search, ML, & Data Mining Release Board Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize hnsw::from_cagra<GPU> #826

Optimize hnsw::from_cagra<GPU> #826

Uh oh!

achirkin commented Apr 16, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 16, 2025

Uh oh!

achirkin commented Apr 16, 2025

Uh oh!

achirkin commented Apr 16, 2025 •

edited

Loading

Uh oh!

cjnolet commented Apr 24, 2025

Uh oh!

achirkin commented Apr 25, 2025

Uh oh!

cjnolet commented Apr 25, 2025

Uh oh!

cjnolet commented Apr 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize hnsw::from_cagra<GPU> #826

Optimize hnsw::from_cagra<GPU> #826

Uh oh!

Conversation

achirkin commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Apr 16, 2025

Uh oh!

achirkin commented Apr 16, 2025

Uh oh!

achirkin commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example timeline

branch-25.06 / input data on the host

branch-25.06 / input data on the device

PR-826 / input data on the host

PR-826 / input data on the device

Notes

Uh oh!

cjnolet commented Apr 24, 2025

Uh oh!

achirkin commented Apr 25, 2025

Uh oh!

cjnolet commented Apr 25, 2025

Uh oh!

cjnolet commented Apr 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

achirkin commented Apr 16, 2025 •

edited

Loading

achirkin commented Apr 16, 2025 •

edited

Loading