-
Notifications
You must be signed in to change notification settings - Fork 189
Description
1. Error msg when i tried to run GPUNetIO backend
Block Size (B) Batch Size B/W (GB/Sec) Avg Lat. (us) Avg Prep (us) P99 Prep (us) Avg Post (us) P99 Post (us) Avg Tx (us) P99 Tx (us) got completion with err: syndrome=0x5, vendor_err_synd=0xf9, hw_err_synd=0, hw_synd_type=0, wqe_counter=65281 wqe_qpn=ed020008 kernel_progress: block 0 error CQE! poll_status -5 wqe 511 index 0
2. Error occurs when running below commands:
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/gdrcopy/src:/opt/mellanox/doca NIXL_PLUGIN_DIR=/usr/local/nixl/lib/x86_64-linux-gnu/plugins CUDA_MODULE_LOADING=EAGER ./nixlbench --etcd-endpoints http://x.x.x.x:2379/ --backend=GPUNETIO --initiator_seg_type=VRAM --target_seg_type=DRAM --runtime_type=ETCD --gpunetio_device_list=1 --device_list=mlx5_0 --start_batch_size=512 --max_batch_size=512 --total_buffer_size=34359738368
3. Env & Settings
The running environment is built based on codes from main branch (25/10/27).
I am running on 8xH*** card server, with 4 ConnectX-7 net cards.
Anyone have ideas on solving the Error when running the GPUNetIO backend? Or ways to tackle it?
Notes: In former codes from pull/760 GPUNETIO backend returns 'Error 22 doca_gpu_dev_rdma_wait_all xfer' after printing the nixlbench settings.