Skip to content

[Performance] nixlbench showing significantly lower bandwidth than raw UCX - optimization guidance needed #1021

@dmvevents

Description

@dmvevents

Observation

After successfully resolving UCX version and ETCD coordination issues, nixlbench is showing much lower bandwidth than raw UCX performance tests on the same hardware.

Performance Comparison

Test Bandwidth Transport Hardware
Raw UCX (ucp_put_bw) 284.98 GB/s UCX over EFA 2x H100 80GB
nixlbench 0.324 GB/s UCX via NIXL Same hardware
Ratio 1:878 - -

Test Configuration

nixlbench \
  --backend UCX \
  --initiator_seg_type VRAM \
  --target_seg_type VRAM \
  --enable_pt 0 \
  --num_threads 1 \
  --max_batch_size 1 \
  --num_iter 64

Environment

  • Platform: AWS SageMaker HyperPod EKS
  • Hardware: 2x H100 80GB HBM3
  • Network: EFA with GPUDirect RDMA enabled
  • Container: Custom build with UCX 1.19.0, NIXL 0.7.1
  • ETCD: Successfully coordinating, barriers working

Potential Causes

  1. Progress threads disabled (--enable_pt 0)
  2. Single thread (--num_threads 1)
  3. Batch size 1 (--max_batch_size 1)
  4. UCX transport settings not optimized for EFA

Request

Could the team provide guidance on:

  1. Recommended nixlbench settings for optimal EFA/GPUDirect performance
  2. UCX environment variables for AWS EFA optimization
  3. Expected performance characteristics (is 0.32 GB/s expected for this config?)
  4. Whether the NIXL abstraction layer introduces expected overhead

Documentation Reference

Full test results: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_SUCCESS_RESULTS.md

The gap seems too large to be only configuration, so any guidance would be appreciated to ensure we're benchmarking correctly.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions