Skip to content

Stale locks appear after a successful interruption of a blocked posix lock #3179

@xhernandez

Description

@xhernandez

Description of problem:

When a process blocked waiting for a posix lock is interrupted by a signal, the request is cancelled with EINTR, but internally AFR keeps sending LK requests to other bricks, which may cause issues since these locks are not owned by anyone and won't be released in most cases.

The exact command to reproduce the issue:

Using the program provided here, this issue can be seen running these tests:

    test_wrlock(0, 1); /* this lock is granted. */
    test_wrlock(1, 1); /* this lock is blocked. */
    test_interrupt(1, 1); /* this should cancel previous lock. */
    test_unlock(0, 1); /* first lock released. */
    test_wrlock(0, 1); /* this should succeed. */
    test_unlock(0, 1);

The full output of the command that failed:

# gluster volume create test replica 3 server:/bricks/test_{1..3}
# gluster volume start test
# mount -t glusterfs server:/test /mnt/test
# touch /mnt/test/file
# ./test /mnt/test/file
  0: Locking
  0: Locked
  1: Locking
  1: Received signal 18
  1: fcntl() failed: (4) Interrupted system call
  0: Unlocking
  0: Unlocked
  0: Locking
<hang>

Expected results:

It shouldn't hang.

Additional info:

The issue happens because AFR takes posix locks in a sequential way, and only checks errors after the LK fop has been sent to all bricks. In the case of interrupts, the LK request is unwound by FUSE as soon as the interrupt request succeeds, so AFR shouldn't continue processing them in this case.

However, the way the locks xlator is implemented, makes it difficult to "undo" the already acquired posix locks in case of interrupt in the middle of acquisition.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions