[Performance] nixlbench showing significantly lower bandwidth than raw UCX - optimization guidance needed

## Observation
After successfully resolving UCX version and ETCD coordination issues, nixlbench is showing much lower bandwidth than raw UCX performance tests on the same hardware.

## Performance Comparison

| Test | Bandwidth | Transport | Hardware |
|------|-----------|-----------|----------|
| **Raw UCX (ucp_put_bw)** | **284.98 GB/s** | UCX over EFA | 2x H100 80GB |
| **nixlbench** | **0.324 GB/s** | UCX via NIXL | Same hardware |
| **Ratio** | **1:878** | - | - |

## Test Configuration
```bash
nixlbench \
  --backend UCX \
  --initiator_seg_type VRAM \
  --target_seg_type VRAM \
  --enable_pt 0 \
  --num_threads 1 \
  --max_batch_size 1 \
  --num_iter 64
```

## Environment
- Platform: AWS SageMaker HyperPod EKS
- Hardware: 2x H100 80GB HBM3
- Network: EFA with GPUDirect RDMA enabled
- Container: Custom build with UCX 1.19.0, NIXL 0.7.1
- ETCD: Successfully coordinating, barriers working

## Potential Causes
1. **Progress threads disabled** (`--enable_pt 0`)
2. **Single thread** (`--num_threads 1`)
3. **Batch size 1** (`--max_batch_size 1`)
4. **UCX transport settings** not optimized for EFA

## Request
Could the team provide guidance on:
1. Recommended nixlbench settings for optimal EFA/GPUDirect performance
2. UCX environment variables for AWS EFA optimization
3. Expected performance characteristics (is 0.32 GB/s expected for this config?)
4. Whether the NIXL abstraction layer introduces expected overhead

## Documentation Reference
Full test results: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_SUCCESS_RESULTS.md

The gap seems too large to be only configuration, so any guidance would be appreciated to ensure we're benchmarking correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] nixlbench showing significantly lower bandwidth than raw UCX - optimization guidance needed #1021

Observation

Performance Comparison

Test Configuration

Environment

Potential Causes

Request

Documentation Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	Bandwidth	Transport	Hardware
Raw UCX (ucp_put_bw)	284.98 GB/s	UCX over EFA	2x H100 80GB
nixlbench	0.324 GB/s	UCX via NIXL	Same hardware
Ratio	1:878	-	-

[Performance] nixlbench showing significantly lower bandwidth than raw UCX - optimization guidance needed #1021

Description

Observation

Performance Comparison

Test Configuration

Environment

Potential Causes

Request

Documentation Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions