Skip to content

io_uring errors when starting glusterd #4214

@patabid

Description

@patabid

Description of problem:

We have a three server glusterfs setup. When starting glusterd the service frequently fails to start with the following error logged:

C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7ff76) [0x7f194fc22f76] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bf15) [0x7f194fc2ef15] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bdd5) [0x7f194fc2edd5] ) 0-: Assertion failed:

The service will typically start and run after several attempts. It will run stably for about 2 week then crash.

All three servers are identical down to the bios versions.

The exact command to reproduce the issue:

$ sudo systemctl start glusterd

The full output of the command that failed:

Job for glusterd.service failed because the control process exited with error code.                                                                                          
See "systemctl status glusterd.service" and "journalctl -xeu glusterd.service" for details. 

On running journalctl -xeu glusterd.service this is the output:

Jul 31 14:53:18 srv-003 glusterd[1582227]: [2023-07-31 14:53:18.894347 +0000] C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7ff76) [0x7f194fc22f76] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bf15) [0x7f194fc2ef15] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bdd5) [0x7f194fc2edd5] ) 0-: Assertion failed:
Jul 31 14:53:18 srv-003 glusterd[1582227]: pending frames:
Jul 31 14:53:18 srv-003 glusterd[1582227]: patchset: git://git.gluster.org/glusterfs.git
Jul 31 14:53:18 srv-003 glusterd[1582227]: signal received: 6
Jul 31 14:53:18 srv-003 glusterd[1582227]: time of crash:
Jul 31 14:53:18 srv-003 glusterd[1582227]: 2023-07-31 14:53:18 +0000
Jul 31 14:53:18 srv-003 glusterd[1582227]: configuration details:
Jul 31 14:53:18 srv-003 glusterd[1582227]: argp 1
Jul 31 14:53:18 srv-003 glusterd[1582227]: backtrace 1
Jul 31 14:53:18 srv-003 glusterd[1582227]: dlfcn 1

Expected results:
No output and glusterd running

Mandatory info:
- The output of the gluster volume info command:

Volume Name: vol03
Type: Distributed-Disperse
Volume ID: 49f0d0cd-3335-4e08-ae1e-fb56d2a7d685
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: srv-001:/srv/glusterfs/vol03/brick0
Brick2: srv-002:/srv/glusterfs/vol03/brick0
Brick3: srv-003:/srv/glusterfs/vol03/brick0
Options Reconfigured:
performance.cache-size: 1GB
storage.linux-io_uring: off
server.event-threads: 4
client.event-threads: 4
performance.write-behind: off
performance.parallel-readdir: on
performance.readdir-ahead: on
performance.nl-cache-timeout: 600
performance.nl-cache: on
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-samba-metadata: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.fips-mode-rchecksum: on
transport.address-family: inet

- The output of the gluster volume status command:

** This is after the glusterd service has successfully started and is running!

Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick srv-001:/srv/glusterfs
/vol03/brick0                               54477     0          Y       5564 
Brick srv-002:/srv/glusterfs
/vol03/brick0                               58095     0          Y       4288 
Brick srv-003:/srv/glusterfs
/vol03/brick0                               50589     0          Y       5319 
Self-heal Daemon on localhost               N/A       N/A        Y       1582991
Self-heal Daemon on srv-002  N/A       N/A        Y       4323 
Self-heal Daemon on srv-001  N/A       N/A        Y       7260 
 
Task Status of Volume vol03
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

Status: Connected
Number of entries: 0

Brick srv-002:/srv/glusterfs/vol03/brick0
Status: Connected
Number of entries: 0

Brick srv-003:/srv/glusterfs/vol03/brick0
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Not sure how to do this, happy to if someone can point me in the right direction for what is needed.

Additional info:

Each server has mostly identical hardware composed of the following:
CPU: AMD Ryzen 7 5700G
RAM: 2x servers have 16Gb and one has 32Gb (this is the only variance)
Storage:

  • 2x NVME drives per server
  • 2TB Samsung 970 EVO Plus

The entire storage stack:

  • EFI partition table per drive
  • primary drive is boot drive
  • primary drive has 1.8T LVM partition (after system boot portions)
  • second drive has matching 1.8T LVM partition
  • 1x volume group contains these partitions
  • A 1.15TiB logical volume in LVM RAID 1 across the two drives hosts the gluster brick on each server
  • LV is encrypted using cryptsetup with LUCS
  • encrypted LV is then mounted using /dev/mapper
  • The encrypted partition is then formatted with an XFS file system
  • This is then hosted using glusterd

This is a complex setup driven by a clients security policies though the RAID setup can be removed.

- The operating system / glusterfs version:

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 23.04
Release:        23.04
Codename:       lunar
# glusterfs --version
glusterfs 11.0
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Last week we were running glusterfs 10.4 with exactly the same issues. Upgraded to 11.0 this weekend to see if that would provide a fix, there has been no change in behavior.

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions