Skip to content

[Documentation] nixlbench UCX version requirement unclear #1023

@dmvevents

Description

@dmvevents

Problem

During AWS SageMaker HyperPod deployment, we encountered nixlbench failing with:

UCP API version is incompatible: required >= 1.19, actual 1.16.0

Environment

  • Platform: AWS SageMaker HyperPod EKS
  • Container: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1.post1
  • Nodes: 2x H100 80GB HBM3

Root Cause

The published runtime containers include UCX 1.16.0, but nixlbench requires UCX >= 1.19.0.

Resolution

We rebuilt containers with UCX 1.19.0 from source, which resolved the issue.

Suggestion

  1. Update documentation to clearly state nixlbench requires UCX 1.19.0+
  2. Update published containers to include UCX 1.19.0+ for nixlbench compatibility
  3. Add version check and clear error message in nixlbench startup

Workaround

Build custom containers with UCX 1.19.0:

RUN git clone --depth 1 --branch v1.19.0 https://github.com/openucx/ucx.git && \
    cd ucx && ./autogen.sh && ./configure && make -j && make install

Successfully tested with 2-node ETCD-coordinated setup achieving 0.324 GB/s throughput.

Reference: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_SUCCESS_RESULTS.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions