-
Notifications
You must be signed in to change notification settings - Fork 189
Open
Description
Problem
During AWS SageMaker HyperPod deployment, we encountered nixlbench failing with:
UCP API version is incompatible: required >= 1.19, actual 1.16.0
Environment
- Platform: AWS SageMaker HyperPod EKS
- Container: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1.post1
- Nodes: 2x H100 80GB HBM3
Root Cause
The published runtime containers include UCX 1.16.0, but nixlbench requires UCX >= 1.19.0.
Resolution
We rebuilt containers with UCX 1.19.0 from source, which resolved the issue.
Suggestion
- Update documentation to clearly state nixlbench requires UCX 1.19.0+
- Update published containers to include UCX 1.19.0+ for nixlbench compatibility
- Add version check and clear error message in nixlbench startup
Workaround
Build custom containers with UCX 1.19.0:
RUN git clone --depth 1 --branch v1.19.0 https://github.com/openucx/ucx.git && \
cd ucx && ./autogen.sh && ./configure && make -j && make installSuccessfully tested with 2-node ETCD-coordinated setup achieving 0.324 GB/s throughput.
Reference: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_SUCCESS_RESULTS.md
Metadata
Metadata
Assignees
Labels
No labels