SNMG ANN #231

viclafargue · 2024-07-18T10:46:09Z

This PR implements a distributed (single-node-multiple-GPUs) implementation of ANN indexes. It allows to build, extend and search an index on multiple GPUs.

Before building the index, the user has to choose between two modes :

Sharding mode : The index dataset is split, each GPU trains its own index with its respective share of the dataset. This is intended to both increase the search throughput and the maximal size of the index.
Index duplication mode : The index is built once on a GPU and then copied over to others. Alternatively, the index dataset is sent to each GPU to be built there. This intended to increase the search throughput.

SNMG indexes can be serialized and de-serialized. Local models can also be deserialized and deployed in index duplication mode.

Migrated from rapidsai/raft#1993

copy-pr-bot · 2024-07-18T10:46:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/CMakeLists.txt

cpp/include/cuvs/neighbors/ann_mg.hpp

cpp/src/neighbors/ann_mg/ann_mg.cuh

tfeher

Thanks Victor for the PR, it looks really great! I have just a few smaller comments.

cpp/src/neighbors/ann_mg/nccl_helpers.cuh

cpp/src/neighbors/ann_mg/ann_mg.cuh

cpp/test/neighbors/ann_mg.cuh

cpp/src/neighbors/ann_mg/ann_mg.cuh

tfeher

A few additional comments

cpp/src/neighbors/ann_mg/ann_mg.cuh

tfeher

We are almost there, but the matrix extent type has to be fixed: we should distinguish the neighbor index type (IdxT, usually int64), and the mdspan extent type which is always int64. In a previous review I have marked such changes in ann_mp.cuh, but actually the public API in ann_mg.hpp has also a few places where this need to be fixed

cpp/src/neighbors/ann_mg/omp_checks.cu

cpp/include/cuvs/neighbors/ann_mg.hpp

cjnolet · 2024-10-03T00:05:51Z

/ok to test

cjnolet · 2024-10-03T00:17:27Z

/ok to test

cjnolet · 2024-10-03T00:22:49Z

/ok to test

cjnolet · 2024-10-03T00:36:55Z

/ok to test

cjnolet · 2024-10-03T00:46:02Z

/ok to test

cjnolet · 2024-10-03T00:55:37Z

/ok to test

cjnolet · 2024-10-03T01:22:32Z

/ok to test

cjnolet · 2024-10-03T01:34:36Z

/ok to test

dantegd · 2024-10-03T02:53:54Z

CI error here seemed to have been a connection error:

[rapids-conda-retry] Exiting, no retryable mamba errors detected: 'ChecksumMismatchError:', 'ChunkedEncodingError:', 'CondaHTTPError:', 'CondaMultiError:', 'Connection broken:', 'ConnectionError:', 'DependencyNeedsBuildingError:', 'EOFError:', 'JSONDecodeError:', 'Multi-download failed', 'Timeout was reached', segfault exit code 139

divyegala · 2024-10-03T04:31:06Z

/ok to test

tfeher · 2024-10-03T10:19:58Z

/ok to test

cjnolet · 2024-10-03T13:21:22Z

/ok to test

cjnolet · 2024-10-03T13:59:31Z

/ok to test

cjnolet · 2024-10-03T15:18:00Z

/merge

rhdong · 2025-03-03T23:30:41Z

cpp/src/neighbors/iface/iface.hpp

+           const cuvs::neighbors::index_params* index_params,
+           raft::mdspan<const T, matrix_extent<int64_t>, row_major, Accessor> index_dataset)
+{
+  interface.mutex_->lock();


Hi @viclafargue @achirkin, I'm studying the code for the multi-GPU index feature—great work! I noticed a potential issue: the lock system does not prevent data races on device memory since the API does not explicitly synchronize the stream (at least, I found this to be the case with deserialize). If this behavior is intentional, please feel free to disregard my comments.

Hi @rhdong, thank you for noticing this! It is indeed probably safer to synchronize before unlocking to unsure that all cudaMemcpy calls made during deserialization have completed. Will update this.

Thanks again! Did an edit to fix this issue : 4e0f512

SNMG ANN

cc1e45a

viclafargue requested review from a team as code owners July 18, 2024 10:46

github-actions bot added cpp CMake labels Jul 18, 2024

viclafargue added 2 commits July 18, 2024 13:02

nccl_clique as header

279c345

update linking, build system and conda env

b10d01d

viclafargue requested a review from a team as a code owner July 18, 2024 16:08

viclafargue requested a review from jameslamb July 18, 2024 16:08

divyegala reviewed Jul 18, 2024

View reviewed changes

viclafargue added 2 commits July 19, 2024 16:15

Answered review

d178155

Merge branch 'branch-24.08' into snmg-ann

4bc9d9c

divyegala reviewed Jul 19, 2024

View reviewed changes

cpp/src/neighbors/ann_mg/ann_mg.cuh Outdated Show resolved Hide resolved

Apply review

1459248

tfeher requested changes Jul 24, 2024

View reviewed changes

viclafargue added 2 commits July 25, 2024 12:21

Answer reviews + small changes

f3a65fc

Adding documentation

ee2dcc3

viclafargue requested a review from a team as a code owner July 26, 2024 13:17

Merge branch 'branch-24.08' into snmg-ann

5236cc2

tfeher requested changes Jul 28, 2024

View reviewed changes

cpp/src/neighbors/ann_mg/ann_mg.cuh Outdated Show resolved Hide resolved

cpp/src/neighbors/ann_mg/ann_mg.cuh Outdated Show resolved Hide resolved

cpp/src/neighbors/ann_mg/ann_mg.cuh Outdated Show resolved Hide resolved

removing unnecessary omp barriers

60bd621

tfeher requested changes Jul 29, 2024

View reviewed changes

tfeher added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 29, 2024

achirkin reviewed Jul 30, 2024

View reviewed changes

cpp/include/cuvs/neighbors/ann_mg.hpp Outdated Show resolved Hide resolved

viclafargue added 3 commits July 30, 2024 14:54

int64_t change

17f62d2

tree reduction merge implementation

f523251

tree merge solidification

3e79a44

cjnolet assigned viclafargue Jul 31, 2024

cjnolet added 2 commits October 2, 2024 20:16

Adding ucp to cmakelists

ceb6287

Merge branch 'snmg-ann' of github.com:viclafargue/cuvs into snmg-ann

ce37b71

MOre renames

1f0f5e9

cjnolet added 2 commits October 2, 2024 20:34

Adding libucxx

cb8ed0c

Adding ucxx

fe5b6f8

Adding to run time

e257282

Adding libucxx to libcuvs y

b6cb776

use raw nccl calls

ac26507

Removing ucp from cmake

4a10a6c

viclafargue mentioned this pull request Oct 3, 2024

[FEA] Remove dependency of build_comms_nccl_only on UCP rapidsai/raft#2465

Open

changing serialization path and disabling sharded mode testing

c9515d5

round robin check improvment + temporary disable of CAGRA

d77704c

Merge branch 'branch-24.10' into snmg-ann

c2c810c

fix merge

4e7398a

rapids-bot bot merged commit 3383f28 into rapidsai:branch-24.10 Oct 3, 2024

rhdong reviewed Mar 3, 2025

View reviewed changes

SNMG ANN #231

SNMG ANN #231

Uh oh!

Conversation

viclafargue commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jul 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

dantegd commented Oct 3, 2024

Uh oh!

divyegala commented Oct 3, 2024

Uh oh!

tfeher commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

cjnolet commented Oct 3, 2024

Uh oh!

rhdong Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

viclafargue Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

viclafargue commented Jul 18, 2024 •

edited

Loading

viclafargue Mar 4, 2025 •

edited

Loading