Skip to content

Conversation

@achirkin
Copy link
Contributor

@achirkin achirkin commented Apr 16, 2025

Reduce the CAGRA-for-HNSW build times by:

  • avoiding unnecessary copies of the data between cagra::build and hnsw::from_cagra in the benchmarks
  • avoiding unnecessary temporary data buffers in hnsw::from_cagra
  • reducing random reads via forcing 1-1 mapping between the internal indices and external labels during HNSW import

As a side-effect, this PR also fixes the bug where hnsw::from_cagra segfaults in benchmarks if the dataset is passed in device memory (and incorrectly wrapped in a host_matrix_view).

In addition, this PR adds a bit more verbose NVTX reporting of different stages during the CAGRA/HNSW index build.

@achirkin achirkin added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 16, 2025
@achirkin achirkin self-assigned this Apr 16, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 16, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Apr 16, 2025
@achirkin
Copy link
Contributor Author

/ok to test

@achirkin achirkin marked this pull request as ready for review April 16, 2025 12:50
@achirkin achirkin requested a review from a team as a code owner April 16, 2025 12:50
@achirkin
Copy link
Contributor Author

achirkin commented Apr 16, 2025

Example timeline

Hardware: NVIDIA H200 NVL + Intel(R) Xeon(R) Gold 6444Y (16 cores / 32 threads)
Dataset: DEEP-100M

      "build_param": {
        "M": 16,
        "hierarchy": "gpu",
        "graph_degree": 32,
        "intermediate_graph_degree": 48,
        "graph_build_algo": "IVF_PQ",
        "ivf_pq_build_pq_dim": 32,
        "ivf_pq_build_pq_bits": 8,
        "ivf_pq_build_nlist": 10000,
        "ivf_pq_build_niter": 10,
        "ivf_pq_build_ratio": 10,
        "ivf_pq_build_codebook_kind": "subspace",
        "ivf_pq_search_nprobe": 20,
        "ivf_pq_search_internalDistanceDtype": "half",
        "ivf_pq_search_smemLutDtype": "half",
        "ivf_pq_search_refine_ratio": 1
      }

branch-25.06 / input data on the host

Screenshot from 2025-04-16 15-31-20

branch-25.06 / input data on the device

(segmentation fault)

PR-826 / input data on the host

Screenshot from 2025-04-16 15-32-32

PR-826 / input data on the device

Screenshot from 2025-04-16 15-33-03

Notes

  1. Open the images in a separate tab to see the numbers
  2. There's a cudaMemCpy2dAsync in the branch-25.06 at the end of the CAGRA build. This is attaching the dataset to the CAGRA index; it's not used subsequently if the dataset is on the host
  3. The most savings in hnsw::from_cagra come from the initialization phase, which is congested on atomic counter and slowed down due to copying data in a slightly randomized order (due to the concurrent updates of the atomic counter).
  4. The data-on-device version is faster by ~5% than the data-on-host version

@achirkin achirkin requested a review from a team as a code owner April 22, 2025 12:14
@cjnolet
Copy link
Member

cjnolet commented Apr 24, 2025

@achirkin it looks like we have some python failures from these changes (this makes me happy we now have python tests for the benchmarks).

@achirkin
Copy link
Contributor Author

@cjnolet lol these tests haven't made me happy while I was trying to setup my environment to run them :) and they just complain about a missing useless column I removed for HNSW algorithm (GPU time).

@cjnolet
Copy link
Member

cjnolet commented Apr 25, 2025

/merge

@cjnolet
Copy link
Member

cjnolet commented Apr 25, 2025

@achirkin does this issue address #762?

@rapids-bot rapids-bot bot merged commit 0778a95 into rapidsai:branch-25.06 Apr 25, 2025
66 checks passed
jamxia155 pushed a commit to jamxia155/cuvs that referenced this pull request Apr 25, 2025
Reduce the CAGRA-for-HNSW build times by:
 - avoiding unnecessary copies of the data between cagra::build and hnsw::from_cagra in the benchmarks
 - avoiding unnecessary temporary data buffers in hnsw::from_cagra<GPU>
 - reducing random reads via forcing 1-1 mapping between the internal indices and external labels during HNSW import

As a side-effect, this PR also fixes the bug where hnsw::from_cagra segfaults in benchmarks if the dataset is passed in device memory (and incorrectly wrapped in a host_matrix_view).

In addition, this PR adds a bit more verbose NVTX reporting of different stages during the CAGRA/HNSW index build.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#826
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change Python

Development

Successfully merging this pull request may close these issues.

2 participants