bitnami etcd 故障成员恢复

本文介绍了在使用 bitnami/etcd 部署的集群中,如何恢复故障成员的方法。首先在处理前,往 etcd 集群中添加一些数据,以验证恢复方案不会导致集群数据丢失。然后,通过删除 etcd-1 pv 的数据目录来模拟 etcd-1 成员故障,并查看 etcd-1 pod 日志、etcd 集群成员状态以及 endpoint 健康状态来分析故障原因。接着,本文提供了一套恢复方案,包括将故障成员 etcd-1 从集群中移除、将 etcd statefulsets 缩放为 0、删除故障成员 etcd-1 绑定的 pv 的数据目录、修改 statefulsets 中 ETCD_INITIAL_CLUSTER_STATE 配置的值为 "existing"、将 etcd statefulsets 恢复为原来的副本数以及检查 etcd 集群 endpoint 健康状态等步骤。最后,验证数据是否完整。

bitnami etcd 故障成员恢复

环境准备

准备一个使用 bitnami/etcd 部署的集群。

查看 statefulsets:

1
2
3
[root@cnode1 ~]# kubectl -n etcd get statefulsets.apps
NAME READY AGE
etcd 3/3 29d

查看 pod:

1
2
3
4
5
[root@cnode1 ~]# kubectl -n etcd get pod
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 19m
etcd-1 1/1 Running 0 21m
etcd-2 1/1 Running 0 22m

查看 ETCD_INITIAL_CLUSTER_STATE 配置:

1
2
3
[root@cnode1 ~]# kubectl -n etcd get statefulsets.apps etcd -o yaml | grep ETCD_INITIAL_CLUSTER_STATE -A 1
- name: ETCD_INITIAL_CLUSTER_STATE
value: new

查看 etcd 集群成员状态:

1
2
3
4
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl member list
14105f2e4f2559a9, started, etcd-0, http://etcd-0.etcd-headless.etcd:2380, http://etcd-0.etcd-headless.etcd:2379,http://etcd.etcd:2379, false
432042ce92be9a79, started, etcd-2, http://etcd-2.etcd-headless.etcd:2380, http://etcd-2.etcd-headless.etcd:2379,http://etcd.etcd:2379, false
e3ff14c8562cf8c2, started, etcd-1, http://etcd-1.etcd-headless.etcd:2380, http://etcd-1.etcd-headless.etcd:2379,http://etcd.etcd:2379, false

该集群有 3 个成员:

  • etcd-0
  • etcd-1
  • etcd-2

添加测试数据

在处理前,先往 etcd 集群中添加一些数据,以验证恢复方案不会导致集群数据丢失。

1
2
3
4
5
6
[root@cnode1 ~]# kubectl -n etcd exec -it etcd-0 -- bash
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl put /hello world
OK
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl get /hello
/hello
world

破坏 etcd-1 成员

获取 etcd-1 的 pv:(所有 pv 均使用 local-storage)

1
2
3
4
5
[root@cnode1 ~]# kubectl -n etcd get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-etcd-0 Bound pvc-2a2f2a25-69b3-41ab-a21d-a5aa3d05277d 10Gi RWO local-storage 29d
data-etcd-1 Bound pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20 10Gi RWO local-storage 60m
data-etcd-2 Bound pvc-6da8bcd7-cd2a-43fc-9849-7e1c2870f193 10Gi RWO local-storage 29d

获取 etcd-1 pv 所在的节点:

1
2
3
4
5
6
7
8
9
[root@cnode1 ~]# kubectl get pv pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20 -o yaml | grep nodeAffinity -A 7
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- cnode3

获取 etcd-1 pv 的数据目录:

1
2
3
4
[root@cnode1 ~]# kubectl get pv pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20 -o yaml | grep hostPath -A 2
hostPath:
path: /data/local-path-provisioner/pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20_etcd_data-etcd-1
type: DirectoryOrCreate

进入节点 cnode3 后台:

1
ssh root@cnode3

删除 etcd-1 pv 的数据目录:

1
[root@cnode3 ~]# rm -rf /data/local-path-provisioner/pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20_etcd_data-etcd-1/data/

删除 etcd-1 pod:

1
2
[root@cnode1 ~]# kubectl -n etcd delete pod etcd-1
pod "etcd-1" deleted

等待 etcd-1 pod 重新被拉起:

1
2
3
4
5
[root@cnode1 ~]# kubectl -n etcd get pod
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 46m
etcd-1 0/1 CrashLoopBackOff 5 (114s ago) 5m10s
etcd-2 1/1 Running 0 48m

此状态下,重启 etcd 集群也无法恢复:

1
2
3
4
5
6
7
8
9
10
11
[root@cnode1 ~]# kubectl -n etcd rollout restart statefulset etcd
statefulset.apps/etcd restarted
[root@cnode1 ~]# kubectl -n etcd scale statefulset etcd --replicas=0
statefulset.apps/etcd scaled
[root@cnode1 ~]# kubectl -n etcd scale statefulset etcd --replicas=3
statefulset.apps/etcd scaled
[root@cnode1 ~]# kubectl -n etcd get pod
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 67s
etcd-1 0/1 CrashLoopBackOff 1 (12s ago) 67s
etcd-2 1/1 Running 0 67s

分析 etcd-1 故障

查看 etcd-1 pod 日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@cnode1 ~]# kubectl -n etcd logs -f etcd-1
...
{"level":"info","ts":"2023-07-31T22:47:11.938+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"http://etcd-1.etcd-headless.etcd:2380"}
{"level":"info","ts":"2023-07-31T22:47:11.938+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd-0=http://etcd-0.etcd-headless.etcd:2380,etcd-1=http://etcd-1.etcd-headless.etcd:2380,etcd-2=http://etcd-2.etcd-headless.etcd:2380"}
{"level":"info","ts":"2023-07-31T22:47:11.938+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"new"}
...
{"level":"info","ts":"2023-07-31T22:47:11.939+0800","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd"]}
{"level":"warn","ts":"2023-07-31T22:47:11.939+0800","caller":"etcdmain/etcd.go:446","msg":"found invalid file under data directory","filename":"member_id","data-dir":"/bitnami/etcd/data"}
{"level":"info","ts":"2023-07-31T22:47:11.939+0800","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/bitnami/etcd/data","dir-type":"member"}
...
{"level":"info","ts":"2023-07-31T22:47:11.957+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"af3fef14dc164ad9 [term: 1] received a MsgHeartbeat message with higher term from 14105f2e4f2559a9 [term: 24]"}
{"level":"info","ts":"2023-07-31T22:47:11.957+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"af3fef14dc164ad9 became follower at term 24"}
{"level":"info","ts":"2023-07-31T22:47:11.957+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: af3fef14dc164ad9 elected leader 14105f2e4f2559a9 at term 24"}
{"level":"warn","ts":"2023-07-31T22:47:11.958+0800","caller":"etcdserver/server.go:1150","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2023-07-31T22:47:11.958+0800","caller":"etcdserver/server.go:1151","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2023-07-31T22:47:11.959+0800","caller":"etcdserver/server.go:2053","msg":"stopped publish because server is stopped","local-member-id":"af3fef14dc164ad9","local-member-attributes":"{Name:etcd-1 ClientURLs:[http://etcd-1.etcd-headless.etcd:2379 http://etcd.etcd:2379]}","publish-timeout":"25s","error":"etcdserver: server stopped"}
{"level":"info","ts":"2023-07-31T22:47:11.962+0800","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"432042ce92be9a79"}

查看 etcd 集群成员状态:(这个命令无法得知 endpoint 的健康状态)

1
2
3
4
5
[root@cnode1 ~]# kubectl -n etcd exec -it etcd-0 -- bash
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl member list
14105f2e4f2559a9, started, etcd-0, http://etcd-0.etcd-headless.etcd:2380, http://etcd-0.etcd-headless.etcd:2379,http://etcd.etcd:2379, false
432042ce92be9a79, started, etcd-2, http://etcd-2.etcd-headless.etcd:2380, http://etcd-2.etcd-headless.etcd:2379,http://etcd.etcd:2379, false
e3ff14c8562cf8c2, started, etcd-1, http://etcd-1.etcd-headless.etcd:2380, http://etcd-1.etcd-headless.etcd:2379,http://etcd.etcd:2379, false

查看 etcd 集群 endpoint 健康状态:

1
2
3
4
5
6
7
8
9
10
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl endpoint health --endpoints=etcd-0.etcd-headless.etcd:2380,etcd-1.etcd-headless.etcd:2380,etcd-2.etcd-headless.etcd:2380 -w table
{"level":"warn","ts":"2023-07-31T22:58:52.131+0800","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x400044aa80/etcd-1.etcd-headless.etcd:2380","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.188.20:2380: connect: connection refused\""}
+--------------------------------+--------+-------------+---------------------------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+--------------------------------+--------+-------------+---------------------------+
| etcd-0.etcd-headless.etcd:2380 | true | 2.97674ms | |
| etcd-2.etcd-headless.etcd:2380 | true | 8.06479ms | |
| etcd-1.etcd-headless.etcd:2380 | false | 5.00261904s | context deadline exceeded |
+--------------------------------+--------+-------------+---------------------------+
Error: unhealthy cluster

恢复方案

(1)将故障成员 etcd-1 从集群中移除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[root@cnode1 ~]# kubectl -n etcd exec -it etcd-0 -- bash
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl endpoint health --endpoints=etcd-0.etcd-headless.etcd:2380,etcd-1.etcd-headless.etcd:2380,etcd-2.etcd-headless.etcd:2380
{"level":"warn","ts":"2023-07-31T23:42:37.481+0800","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000230000/etcd-1.etcd-headless.etcd:2380","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.188.20:2380: connect: connection refused\""}
etcd-0.etcd-headless.etcd:2380 is healthy: successfully committed proposal: took = 4.34159ms
etcd-2.etcd-headless.etcd:2380 is healthy: successfully committed proposal: took = 5.19394ms
etcd-1.etcd-headless.etcd:2380 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl member list -w table
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
| 14105f2e4f2559a9 | started | etcd-0 | http://etcd-0.etcd-headless.etcd:2380 | http://etcd-0.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
| 432042ce92be9a79 | started | etcd-2 | http://etcd-2.etcd-headless.etcd:2380 | http://etcd-2.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
| e3ff14c8562cf8c2 | started | etcd-1 | http://etcd-1.etcd-headless.etcd:2380 | http://etcd-1.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl member remove e3ff14c8562cf8c2
Member e3ff14c8562cf8c2 removed from cluster d96b6d5811544d30
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl member list -w table
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
| 14105f2e4f2559a9 | started | etcd-0 | http://etcd-0.etcd-headless.etcd:2380 | http://etcd-0.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
| 432042ce92be9a79 | started | etcd-2 | http://etcd-2.etcd-headless.etcd:2380 | http://etcd-2.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+

(2)将 etcd statefulsets 缩放为 0

1
2
3
4
5
[root@cnode1 ~]# kubectl -n etcd scale statefulset etcd --replicas=0
statefulset.apps/etcd scaled
[root@cnode1 ~]# kubectl -n etcd get statefulsets.apps etcd
NAME READY AGE
etcd 0/0 30d

(3)删除故障成员 etcd-1 绑定的 pv 的数据目录

  • 获取 etcd-1 的 pv

    1
    2
    3
    4
    5
    [root@cnode1 ~]# kubectl -n etcd get pvc
    NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
    data-etcd-0 Bound pvc-2a2f2a25-69b3-41ab-a21d-a5aa3d05277d 10Gi RWO local-storage 30d
    data-etcd-1 Bound pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20 10Gi RWO local-storage 4h10m
    data-etcd-2 Bound pvc-6da8bcd7-cd2a-43fc-9849-7e1c2870f193 10Gi RWO local-storage 30d
  • 获取 etcd-1 pv 所在的节点

    1
    2
    3
    4
    5
    6
    7
    8
    9
    [root@cnode1 ~]# kubectl get pv pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20 -o yaml | grep nodeAffinity -A 7
    nodeAffinity:
    required:
    nodeSelectorTerms:
    - matchExpressions:
    - key: kubernetes.io/hostname
    operator: In
    values:
    - cnode3
  • 获取 etcd-1 pv 的数据目录

    1
    2
    3
    4
    [root@cnode1 ~]# kubectl get pv pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20 -o yaml | grep hostPath -A 2
    hostPath:
    path: /data/local-path-provisioner/pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20_etcd_data-etcd-1
    type: DirectoryOrCreate
  • 进入节点 cnode3 后台

    1
    ssh root@cnode3
  • 清除 etcd-1 pv 的数据目录

    1
    2
    3
    [root@cnode1 ~]# rm -rf /data/local-path-provisioner/pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20_etcd_data-etcd-1/data/
    [root@cnode1 ~]# ls /data/local-path-provisioner/pvc-4832f8c4-0827-4992-9e39-185f9c0f8e20_etcd_data-etcd-1/
    [root@cnode1 ~]#

(4)修改 statefulsets 中 ETCD_INITIAL_CLUSTER_STATE 配置的值为 "existing"

1
2
[root@cnode1 ~]# kubectl -n etcd edit statefulsets.apps etcd
statefulset.apps/etcd edited

修改内容如下:

1
2
- name: ETCD_INITIAL_CLUSTER_STATE
value: existing

(5)将 etcd statefulsets 恢复为原来的副本数

1
2
[root@cnode1 ~]# kubectl -n etcd scale statefulset etcd --replicas=3
statefulset.apps/etcd scaled

(6)查看 etcd-1 的日志

1
2
3
4
5
6
7
8
9
10
[root@cnode1 ~]# kubectl -n etcd logs -f etcd-1
...
{"level":"info","ts":"2023-07-31T23:48:36.065+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd-0=http://etcd-0.etcd-headless.etcd:2380,etcd-1=http://etcd-1.etcd-headless.etcd:2380,etcd-2=http://etcd-2.etcd-headless.etcd:2380"}
{"level":"info","ts":"2023-07-31T23:48:36.065+0800","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"existing"}
...
{"level":"info","ts":"2023-07-31T23:48:36.430+0800","caller":"etcdserver/server.go:2042","msg":"published local member to cluster through raft","local-member-id":"1ea060417ff0fbae","local-member-attributes":"{Name:etcd-1 ClientURLs:[http://etcd-1.etcd-headless.etcd:2379 http://etcd.etcd:2379]}","request-path":"/0/members/1ea060417ff0fbae/attributes","cluster-id":"d96b6d5811544d30","publish-timeout":"25s"}
{"level":"info","ts":"2023-07-31T23:48:36.430+0800","caller":"embed/serve.go:98","msg":"ready to serve client requests"}
{"level":"info","ts":"2023-07-31T23:48:36.430+0800","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
{"level":"info","ts":"2023-07-31T23:48:36.430+0800","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
{"level":"info","ts":"2023-07-31T23:48:36.431+0800","caller":"embed/serve.go:140","msg":"serving client traffic insecurely; this is strongly discouraged!","address":"[::]:2379"}

(7)查看 etcd statefulsets 状态

1
2
3
4
5
6
7
8
[root@cnode1 ~]# kubectl -n etcd get statefulsets.apps etcd
NAME READY AGE
etcd 3/3 30d
[root@cnode1 ~]# kubectl -n etcd get pod
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 6m39s
etcd-1 1/1 Running 0 6m39s
etcd-2 1/1 Running 0 6m39s

(8)检查 etcd 集群 endpoint 健康状态

1
2
3
4
5
6
7
8
9
[root@cnode1 ~]# kubectl -n etcd exec -it etcd-0 -- bash
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl endpoint health --endpoints=etcd-0.etcd-headless.etcd:2380,etcd-1.etcd-headless.etcd:2380,etcd-2.etcd-headless.etcd:2380 -w table
+--------------------------------+--------+-----------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+--------------------------------+--------+-----------+-------+
| etcd-0.etcd-headless.etcd:2380 | true | 5.09933ms | |
| etcd-1.etcd-headless.etcd:2380 | true | 5.97046ms | |
| etcd-2.etcd-headless.etcd:2380 | true | 5.37909ms | |
+--------------------------------+--------+-----------+-------+

(9)查看 etcd 集群成员状态

1
2
3
4
5
6
7
8
9
[root@cnode1 ~]# kubectl -n etcd exec -it etcd-0 -- bash
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl member list -w table
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+
| 14105f2e4f2559a9 | started | etcd-0 | http://etcd-0.etcd-headless.etcd:2380 | http://etcd-0.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
| 1ea060417ff0fbae | started | etcd-1 | http://etcd-1.etcd-headless.etcd:2380 | http://etcd-1.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
| 432042ce92be9a79 | started | etcd-2 | http://etcd-2.etcd-headless.etcd:2380 | http://etcd-2.etcd-headless.etcd:2379,http://etcd.etcd:2379 | false |
+------------------+---------+--------+---------------------------------------+-------------------------------------------------------------+------------+

(10)修改 statefulsets 中 ETCD_INITIAL_CLUSTER_STATE 配置的值恢复为 "new"(可选)

1
2
[root@cnode1 ~]# kubectl -n etcd edit statefulsets.apps etcd
statefulset.apps/etcd edited

修改内容如下:

1
2
- name: ETCD_INITIAL_CLUSTER_STATE
value: new

等待 statefulsets 重启完成:

1
2
3
4
5
6
7
8
[root@cnode1 ~]# kubectl -n etcd get statefulsets.apps etcd
NAME READY AGE
etcd 3/3 30d
[root@cnode1 ~]# kubectl -n etcd get pod
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 2m2s
etcd-1 1/1 Running 0 3m5s
etcd-2 1/1 Running 0 4m8s

(11)验证数据是否完整

1
2
3
4
[root@cnode1 ~]# kubectl -n etcd exec -it etcd-0 -- bash
I have no name!@etcd-0:/opt/bitnami/etcd$ bin/etcdctl get /hello
/hello
world