[GPUNetIO] [ERR] kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0

## 1. Error msg when i tried to run GPUNetIO backend
`Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)  P99 Post (us)  Avg Tx (us)    P99 Tx (us)    
got completion with err: syndrome=0x5, vendor_err_synd=0xf9, hw_err_synd=0, hw_synd_type=0, wqe_counter=65281 wqe_qpn=ed020008
kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0`

## 2. Error occurs when running below commands:
`LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/gdrcopy/src:/opt/mellanox/doca NIXL_PLUGIN_DIR=/usr/local/nixl/lib/x86_64-linux-gnu/plugins CUDA_MODULE_LOADING=EAGER ./nixlbench --etcd-endpoints http://x.x.x.x:2379/ --backend=GPUNETIO --initiator_seg_type=VRAM --target_seg_type=DRAM --runtime_type=ETCD --gpunetio_device_list=1 --device_list=mlx5_0 --start_batch_size=512 --max_batch_size=512 --total_buffer_size=34359738368
`

## 3. Env & Settings
The running environment is built based on codes from main branch (25/10/27).
I am running on 8xH*** card server, with 4 ConnectX-7 net cards.

## Anyone have ideas on solving the Error when running the GPUNetIO backend? Or ways to tackle it?

Notes: In former codes from [pull/760](https://github.com/ai-dynamo/nixl/pull/760) GPUNETIO backend returns 'Error 22 doca_gpu_dev_rdma_wait_all xfer' after printing the nixlbench settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPUNetIO] [ERR] kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0 #952

1. Error msg when i tried to run GPUNetIO backend

2. Error occurs when running below commands:

3. Env & Settings

Anyone have ideas on solving the Error when running the GPUNetIO backend? Or ways to tackle it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GPUNetIO] [ERR] kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0 #952

Description

1. Error msg when i tried to run GPUNetIO backend

2. Error occurs when running below commands:

3. Env & Settings

Anyone have ideas on solving the Error when running the GPUNetIO backend? Or ways to tackle it?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions