This project incorporates NVIDIA NCCL Tests as part of our benchmarking and validation framework.
NCCL is a high-performance library developed by NVIDIA for accelerating collective communication primitives (such as all-reduce, all-gather, broadcast, reduce, and reduce-scatter) on multi-GPU systems. It is optimized for NVIDIA hardware and widely used in deep learning frameworks like PyTorch and TensorFlow to scale training across multiple GPUs and nodes.
For details on how we deploy and manage these tests in Kubernetes, see our Kubernetes README.
For details on how we deploy and manage these tests in Docker, see our Docker README.
The all_reduce operation sums arrays of data across all GPUs and distributes the result back to each GPU. It is fundamental in distributed deep learning for synchronizing gradients across devices.
The NCCL all_reduce test measures:
- Bandwidth (GB/s): How much data can be reduced per second.
- Latency (μs): Time taken for the operation to complete.
- Scalability: How performance changes with more GPUs or nodes.
NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2025, NVIDIA CORPORATION. All rights reserved.