Skip to content

[GPUNetIO] [ERR] kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0 #952

@foraxe

Description

@foraxe

1. Error msg when i tried to run GPUNetIO backend

Block Size (B) Batch Size B/W (GB/Sec) Avg Lat. (us) Avg Prep (us) P99 Prep (us) Avg Post (us) P99 Post (us) Avg Tx (us) P99 Tx (us) got completion with err: syndrome=0x5, vendor_err_synd=0xf9, hw_err_synd=0, hw_synd_type=0, wqe_counter=65281 wqe_qpn=ed020008 kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0

2. Error occurs when running below commands:

LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/gdrcopy/src:/opt/mellanox/doca NIXL_PLUGIN_DIR=/usr/local/nixl/lib/x86_64-linux-gnu/plugins CUDA_MODULE_LOADING=EAGER ./nixlbench --etcd-endpoints http://x.x.x.x:2379/ --backend=GPUNETIO --initiator_seg_type=VRAM --target_seg_type=DRAM --runtime_type=ETCD --gpunetio_device_list=1 --device_list=mlx5_0 --start_batch_size=512 --max_batch_size=512 --total_buffer_size=34359738368

3. Env & Settings

The running environment is built based on codes from main branch (25/10/27).
I am running on 8xH*** card server, with 4 ConnectX-7 net cards.

Anyone have ideas on solving the Error when running the GPUNetIO backend? Or ways to tackle it?

Notes: In former codes from pull/760 GPUNETIO backend returns 'Error 22 doca_gpu_dev_rdma_wait_all xfer' after printing the nixlbench settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions