-
Notifications
You must be signed in to change notification settings - Fork 946
Description
Hi community, we've been testing 3FS in various environments and encountered a connection establishment issue on certain cloud platforms where RDMA is virtualized at the VM layer.
Problem Desc
We observed that RDMA connections consistently fail between different VMs. However, RDMA communication works when performed on the same host (loopback through the same physical NIC). After adding extensive debug logging, we traced the problem to the connection setup phase. It appears that the transport layer in these specific virtualized environments silently drops empty (0-byte) RDMA packets, in the current 3FS implementation, the IBConnect class sends a 0-byte event to notify the peer as part of the connection handshake. This 0-byte packet is the one being dropped, and as a test, we modified the implementation to send a 1-byte event instead of a 0-byte one. With this change, the RDMA connection is established successfully across different VMs, and the system works as expected.
Question
Is this change to a 1-byte event for connection setup a reasonable and safe solution? Are there any potential risks, performance implications, or other side effects we might not be aware of?