Skip to content

RDMA connection fails in virtualized RDMA environments due to 0-byte event packet #320

@SimonCqk

Description

@SimonCqk

Hi community, we've been testing 3FS in various environments and encountered a connection establishment issue on certain cloud platforms where RDMA is virtualized at the VM layer.

Problem Desc

We observed that RDMA connections consistently fail between different VMs. However, RDMA communication works when performed on the same host (loopback through the same physical NIC). After adding extensive debug logging, we traced the problem to the connection setup phase. It appears that the transport layer in these specific virtualized environments silently drops empty (0-byte) RDMA packets, in the current 3FS implementation, the IBConnect class sends a 0-byte event to notify the peer as part of the connection handshake. This 0-byte packet is the one being dropped, and as a test, we modified the implementation to send a 1-byte event instead of a 0-byte one. With this change, the RDMA connection is established successfully across different VMs, and the system works as expected.

Question

Is this change to a 1-byte event for connection setup a reasonable and safe solution? Are there any potential risks, performance implications, or other side effects we might not be aware of?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions