Skip to content

替换磁盘之后,对应 targets 一直 OFFLINE #335

@wangzhuzhen

Description

@wangzhuzhen

搭建了一个16 存储个节点的 3FS ,每个存储节点使用一个 SSD 磁盘,已经使用了一段时间。

目前有一个节点 SSD 磁盘物理故障,无法恢复。因此换了个磁盘,但是 storage service 重启之后,对应节点的 storage targets 一直 OFFLINE,没有出现自动恢复的情况。请问接下来该怎么处理?

节点状态都正常

# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://100.70.8.24:8001","RDMA://100.70.8.23:8001","RDMA://100.70.8.25:8001"]' "list-nodes"
Id     Type     Status               Hostname           Pid    Tags  LastHeartbeatTime    ConfigVersion  ReleaseVersion
3      MGMTD    PRIMARY_MGMTD        gpu25-n204-c01-6u  18314  []    N/A                  0              250228-dev-1-999999-394583db
1      MGMTD    HEARTBEAT_CONNECTED  gpu23-n204-b17-6u  31749  []    2025-09-01 06:44:05  0              250228-dev-1-999999-394583db
2      MGMTD    HEARTBEAT_CONNECTED  gpu24-n204-b18-6u  46455  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
100    META     HEARTBEAT_CONNECTED  gpu23-n204-b17-6u  36418  []    2025-09-01 06:44:03  0              250228-dev-1-999999-394583db
200    META     HEARTBEAT_CONNECTED  gpu24-n204-b18-6u  51648  []    2025-09-01 06:44:08  0              250228-dev-1-999999-394583db
300    META     HEARTBEAT_CONNECTED  gpu25-n204-c01-6u  22979  []    2025-09-01 06:44:08  0              250228-dev-1-999999-394583db
10009  STORAGE  HEARTBEAT_CONNECTED  gpu09-n204-a13-6u  18369  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
10010  STORAGE  HEARTBEAT_CONNECTED  gpu10-n204-a15-6u  62739  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
10011  STORAGE  HEARTBEAT_CONNECTED  gpu11-n204-a16-6u  39859  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
10012  STORAGE  HEARTBEAT_CONNECTED  gpu12-n204-a18-6u  30478  []    2025-09-01 06:44:10  0              250228-dev-1-999999-394583db
10013  STORAGE  HEARTBEAT_CONNECTED  gpu13-n204-b03-6u  47478  []    2025-09-01 06:44:11  0              250228-dev-1-999999-394583db
10014  STORAGE  HEARTBEAT_CONNECTED  gpu14-n204-b04-6u  38979  []    2025-09-01 06:44:11  0              250228-dev-1-999999-394583db
10015  STORAGE  HEARTBEAT_CONNECTED  gpu15-n204-b06-6u  7326   []    2025-09-01 06:44:11  0              250228-dev-1-999999-394583db
10016  STORAGE  HEARTBEAT_CONNECTED  gpu16-n204-b07-6u  54174  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
10017  STORAGE  HEARTBEAT_CONNECTED  gpu17-n204-b08-6u  7493   []    2025-09-01 06:44:11  0              250228-dev-1-999999-394583db
10018  STORAGE  HEARTBEAT_CONNECTED  gpu18-n204-b09-6u  59328  []    2025-09-01 06:44:10  0              250228-dev-1-999999-394583db
10019  STORAGE  HEARTBEAT_CONNECTED  gpu19-n204-b11-6u  31081  []    2025-09-01 06:44:11  0              250228-dev-1-999999-394583db
10020  STORAGE  HEARTBEAT_CONNECTED  gpu20-n204-b12-6u  21612  []    2025-09-01 06:44:13  0              250228-dev-1-999999-394583db
10021  STORAGE  HEARTBEAT_CONNECTED  gpu21-n204-b14-6u  62261  []    2025-09-01 06:44:13  0              250228-dev-1-999999-394583db
10022  STORAGE  HEARTBEAT_CONNECTED  gpu22-n204-b15-6u  38910  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
10023  STORAGE  HEARTBEAT_CONNECTED  gpu23-n204-b17-6u  58510  []    2025-09-01 06:44:12  0              250228-dev-1-999999-394583db
10024  STORAGE  HEARTBEAT_CONNECTED  gpu24-n204-b18-6u  64182  []    2025-09-01 06:44:10  0              250228-dev-1-999999-394583db

有OFFLINE 的 Chains(全部是同一个节点上):

# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://100.70.8.24:8001","RDMA://100.70.8.23:8001","RDMA://100.70.8.25:8001"]' "list-chains" | grep OFFLINE
900100001  1             2             SERVING(2/3)  []              101002000101(SERVING-UPTODATE)  101002200101(SERVING-UPTODATE)  101002300101(OFFLINE-OFFLINE)
900100004  1             2             SERVING(2/3)  []              101001800101(SERVING-UPTODATE)  101002100101(SERVING-UPTODATE)  101002300102(OFFLINE-OFFLINE)
900100007  1             2             SERVING(2/3)  []              101001200103(SERVING-UPTODATE)  101001400101(SERVING-UPTODATE)  101002300103(OFFLINE-OFFLINE)
900100017  1             5             SERVING(2/3)  []              101002000104(SERVING-UPTODATE)  101001600103(SERVING-UPTODATE)  101002300104(OFFLINE-OFFLINE)
900100027  1             2             SERVING(2/3)  []              101001300106(SERVING-UPTODATE)  101001400104(SERVING-UPTODATE)  101002300105(OFFLINE-OFFLINE)
900100030  1             2             SERVING(2/3)  []              101001900105(SERVING-UPTODATE)  101002200107(SERVING-UPTODATE)  101002300106(OFFLINE-OFFLINE)
900100032  1             2             SERVING(2/3)  []              101001100105(SERVING-UPTODATE)  101001700106(SERVING-UPTODATE)  101002300107(OFFLINE-OFFLINE)
900100033  1             2             SERVING(2/3)  []              101001200106(SERVING-UPTODATE)  101001500108(SERVING-UPTODATE)  101002300108(OFFLINE-OFFLINE)
900100037  1             2             SERVING(2/3)  []              101001300108(SERVING-UPTODATE)  101001800107(SERVING-UPTODATE)  101002300109(OFFLINE-OFFLINE)

对应节点上的 storage target 状态 OFFLINE

# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://100.70.8.24:8001","RDMA://100.70.8.23:8001","RDMA://100.70.8.25:8001"]' "list-targets" | grep OFFLINE
101002300108  900100033  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300103  900100007  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300101  900100001  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300106  900100030  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300107  900100032  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300102  900100004  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300105  900100027  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300109  900100037  TAIL    OFFLINE      OFFLINE     N/A     N/A        0
101002300104  900100017  TAIL    OFFLINE      OFFLINE     N/A     N/A        0

正常节点上,SSD 存储目录下 /storage/data1/3fs/ 包含 targetID 目录和 engine 目录

/storage/data1/3fs# ls -l
total 0
drwxr-xr-x  2 root root  33 Aug 12 08:30 101002400101
drwxr-xr-x  2 root root  33 Aug 12 08:30 101002400102
drwxr-xr-x  2 root root  25 Aug 12 08:30 101002400103
drwxr-xr-x  2 root root  25 Aug 12 08:30 101002400104
drwxr-xr-x  2 root root  33 Aug 12 08:30 101002400105
drwxr-xr-x  2 root root  33 Aug 12 08:30 101002400106
drwxr-xr-x  2 root root  25 Aug 12 08:30 101002400107
drwxr-xr-x  2 root root  25 Aug 12 08:30 101002400108
drwxr-xr-x  2 root root  33 Aug 12 08:30 101002400109
drwxr-xr-x 14 root root 212 Aug 12 06:25 engine

而替换了磁盘的节点上,SSD 存储目录下 /storage/data1/3fs/ 只有 engine 目录

# ls -l
total 0
drwxr-xr-x 14 root root 212 Sep  1 03:26 engine

下一步需要如何操作,才能恢复这个节点上的 storage target 状态为正常的 SERVING /UPTODATE ? 是要 unregister-node 然后 register-node ? 或者是要 remove-target 然后 create-target ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions