-
Notifications
You must be signed in to change notification settings - Fork 188
Open
Description
Observation
After successfully resolving UCX version and ETCD coordination issues, nixlbench is showing much lower bandwidth than raw UCX performance tests on the same hardware.
Performance Comparison
| Test | Bandwidth | Transport | Hardware |
|---|---|---|---|
| Raw UCX (ucp_put_bw) | 284.98 GB/s | UCX over EFA | 2x H100 80GB |
| nixlbench | 0.324 GB/s | UCX via NIXL | Same hardware |
| Ratio | 1:878 | - | - |
Test Configuration
nixlbench \
--backend UCX \
--initiator_seg_type VRAM \
--target_seg_type VRAM \
--enable_pt 0 \
--num_threads 1 \
--max_batch_size 1 \
--num_iter 64Environment
- Platform: AWS SageMaker HyperPod EKS
- Hardware: 2x H100 80GB HBM3
- Network: EFA with GPUDirect RDMA enabled
- Container: Custom build with UCX 1.19.0, NIXL 0.7.1
- ETCD: Successfully coordinating, barriers working
Potential Causes
- Progress threads disabled (
--enable_pt 0) - Single thread (
--num_threads 1) - Batch size 1 (
--max_batch_size 1) - UCX transport settings not optimized for EFA
Request
Could the team provide guidance on:
- Recommended nixlbench settings for optimal EFA/GPUDirect performance
- UCX environment variables for AWS EFA optimization
- Expected performance characteristics (is 0.32 GB/s expected for this config?)
- Whether the NIXL abstraction layer introduces expected overhead
Documentation Reference
Full test results: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_SUCCESS_RESULTS.md
The gap seems too large to be only configuration, so any guidance would be appreciated to ensure we're benchmarking correctly.
Metadata
Metadata
Assignees
Labels
No labels