-
Notifications
You must be signed in to change notification settings - Fork 945
Open
Description
搭建了一个16 存储个节点的 3FS ,每个存储节点使用一个 SSD 磁盘,已经使用了一段时间。
目前有一个节点 SSD 磁盘物理故障,无法恢复。因此换了个磁盘,但是 storage service 重启之后,对应节点的 storage targets 一直 OFFLINE,没有出现自动恢复的情况。请问接下来该怎么处理?
节点状态都正常
# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://100.70.8.24:8001","RDMA://100.70.8.23:8001","RDMA://100.70.8.25:8001"]' "list-nodes"
Id Type Status Hostname Pid Tags LastHeartbeatTime ConfigVersion ReleaseVersion
3 MGMTD PRIMARY_MGMTD gpu25-n204-c01-6u 18314 [] N/A 0 250228-dev-1-999999-394583db
1 MGMTD HEARTBEAT_CONNECTED gpu23-n204-b17-6u 31749 [] 2025-09-01 06:44:05 0 250228-dev-1-999999-394583db
2 MGMTD HEARTBEAT_CONNECTED gpu24-n204-b18-6u 46455 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
100 META HEARTBEAT_CONNECTED gpu23-n204-b17-6u 36418 [] 2025-09-01 06:44:03 0 250228-dev-1-999999-394583db
200 META HEARTBEAT_CONNECTED gpu24-n204-b18-6u 51648 [] 2025-09-01 06:44:08 0 250228-dev-1-999999-394583db
300 META HEARTBEAT_CONNECTED gpu25-n204-c01-6u 22979 [] 2025-09-01 06:44:08 0 250228-dev-1-999999-394583db
10009 STORAGE HEARTBEAT_CONNECTED gpu09-n204-a13-6u 18369 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
10010 STORAGE HEARTBEAT_CONNECTED gpu10-n204-a15-6u 62739 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
10011 STORAGE HEARTBEAT_CONNECTED gpu11-n204-a16-6u 39859 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
10012 STORAGE HEARTBEAT_CONNECTED gpu12-n204-a18-6u 30478 [] 2025-09-01 06:44:10 0 250228-dev-1-999999-394583db
10013 STORAGE HEARTBEAT_CONNECTED gpu13-n204-b03-6u 47478 [] 2025-09-01 06:44:11 0 250228-dev-1-999999-394583db
10014 STORAGE HEARTBEAT_CONNECTED gpu14-n204-b04-6u 38979 [] 2025-09-01 06:44:11 0 250228-dev-1-999999-394583db
10015 STORAGE HEARTBEAT_CONNECTED gpu15-n204-b06-6u 7326 [] 2025-09-01 06:44:11 0 250228-dev-1-999999-394583db
10016 STORAGE HEARTBEAT_CONNECTED gpu16-n204-b07-6u 54174 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
10017 STORAGE HEARTBEAT_CONNECTED gpu17-n204-b08-6u 7493 [] 2025-09-01 06:44:11 0 250228-dev-1-999999-394583db
10018 STORAGE HEARTBEAT_CONNECTED gpu18-n204-b09-6u 59328 [] 2025-09-01 06:44:10 0 250228-dev-1-999999-394583db
10019 STORAGE HEARTBEAT_CONNECTED gpu19-n204-b11-6u 31081 [] 2025-09-01 06:44:11 0 250228-dev-1-999999-394583db
10020 STORAGE HEARTBEAT_CONNECTED gpu20-n204-b12-6u 21612 [] 2025-09-01 06:44:13 0 250228-dev-1-999999-394583db
10021 STORAGE HEARTBEAT_CONNECTED gpu21-n204-b14-6u 62261 [] 2025-09-01 06:44:13 0 250228-dev-1-999999-394583db
10022 STORAGE HEARTBEAT_CONNECTED gpu22-n204-b15-6u 38910 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
10023 STORAGE HEARTBEAT_CONNECTED gpu23-n204-b17-6u 58510 [] 2025-09-01 06:44:12 0 250228-dev-1-999999-394583db
10024 STORAGE HEARTBEAT_CONNECTED gpu24-n204-b18-6u 64182 [] 2025-09-01 06:44:10 0 250228-dev-1-999999-394583db
有OFFLINE 的 Chains(全部是同一个节点上):
# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://100.70.8.24:8001","RDMA://100.70.8.23:8001","RDMA://100.70.8.25:8001"]' "list-chains" | grep OFFLINE
900100001 1 2 SERVING(2/3) [] 101002000101(SERVING-UPTODATE) 101002200101(SERVING-UPTODATE) 101002300101(OFFLINE-OFFLINE)
900100004 1 2 SERVING(2/3) [] 101001800101(SERVING-UPTODATE) 101002100101(SERVING-UPTODATE) 101002300102(OFFLINE-OFFLINE)
900100007 1 2 SERVING(2/3) [] 101001200103(SERVING-UPTODATE) 101001400101(SERVING-UPTODATE) 101002300103(OFFLINE-OFFLINE)
900100017 1 5 SERVING(2/3) [] 101002000104(SERVING-UPTODATE) 101001600103(SERVING-UPTODATE) 101002300104(OFFLINE-OFFLINE)
900100027 1 2 SERVING(2/3) [] 101001300106(SERVING-UPTODATE) 101001400104(SERVING-UPTODATE) 101002300105(OFFLINE-OFFLINE)
900100030 1 2 SERVING(2/3) [] 101001900105(SERVING-UPTODATE) 101002200107(SERVING-UPTODATE) 101002300106(OFFLINE-OFFLINE)
900100032 1 2 SERVING(2/3) [] 101001100105(SERVING-UPTODATE) 101001700106(SERVING-UPTODATE) 101002300107(OFFLINE-OFFLINE)
900100033 1 2 SERVING(2/3) [] 101001200106(SERVING-UPTODATE) 101001500108(SERVING-UPTODATE) 101002300108(OFFLINE-OFFLINE)
900100037 1 2 SERVING(2/3) [] 101001300108(SERVING-UPTODATE) 101001800107(SERVING-UPTODATE) 101002300109(OFFLINE-OFFLINE)
对应节点上的 storage target 状态 OFFLINE
# /opt/3fs/bin/admin_cli -cfg /opt/3fs/etc/admin_cli.toml --config.mgmtd_client.mgmtd_server_addresses '["RDMA://100.70.8.24:8001","RDMA://100.70.8.23:8001","RDMA://100.70.8.25:8001"]' "list-targets" | grep OFFLINE
101002300108 900100033 TAIL OFFLINE OFFLINE N/A N/A 0
101002300103 900100007 TAIL OFFLINE OFFLINE N/A N/A 0
101002300101 900100001 TAIL OFFLINE OFFLINE N/A N/A 0
101002300106 900100030 TAIL OFFLINE OFFLINE N/A N/A 0
101002300107 900100032 TAIL OFFLINE OFFLINE N/A N/A 0
101002300102 900100004 TAIL OFFLINE OFFLINE N/A N/A 0
101002300105 900100027 TAIL OFFLINE OFFLINE N/A N/A 0
101002300109 900100037 TAIL OFFLINE OFFLINE N/A N/A 0
101002300104 900100017 TAIL OFFLINE OFFLINE N/A N/A 0
正常节点上,SSD 存储目录下 /storage/data1/3fs/ 包含 targetID 目录和 engine 目录
/storage/data1/3fs# ls -l
total 0
drwxr-xr-x 2 root root 33 Aug 12 08:30 101002400101
drwxr-xr-x 2 root root 33 Aug 12 08:30 101002400102
drwxr-xr-x 2 root root 25 Aug 12 08:30 101002400103
drwxr-xr-x 2 root root 25 Aug 12 08:30 101002400104
drwxr-xr-x 2 root root 33 Aug 12 08:30 101002400105
drwxr-xr-x 2 root root 33 Aug 12 08:30 101002400106
drwxr-xr-x 2 root root 25 Aug 12 08:30 101002400107
drwxr-xr-x 2 root root 25 Aug 12 08:30 101002400108
drwxr-xr-x 2 root root 33 Aug 12 08:30 101002400109
drwxr-xr-x 14 root root 212 Aug 12 06:25 engine
而替换了磁盘的节点上,SSD 存储目录下 /storage/data1/3fs/ 只有 engine 目录
# ls -l
total 0
drwxr-xr-x 14 root root 212 Sep 1 03:26 engine
下一步需要如何操作,才能恢复这个节点上的 storage target 状态为正常的 SERVING /UPTODATE ? 是要 unregister-node 然后 register-node ? 或者是要 remove-target 然后 create-target ?
Metadata
Metadata
Assignees
Labels
No labels