Rocky Linux 部署 etcd 集群

本文介绍了在 Rocky Linux 系统上部署 etcd 集群的步骤。首先介绍了环境准备,包括三个节点的操作系统、IP地址和版本等信息。接下来,安装必要软件,包括 etcd、证书工具 cfssl 等,并创建 etcd 所需的证书。然后配置 etcd,包括主节点和启动配置,以及其余节点的配置。启动 etcd 服务并验证其状态,包括检查集群健康状态、查看集群节点列表、验证集群数据存储服务和监测数据变化等。此外,还介绍了数据备份与恢复、节点宕机与恢复、ETCD 故障节点修复和身份验证等内容。最后,介绍了如何安装 etcdkeeper。

Rocky Linux 部署 etcd 集群

1. 环境准备

主机名 操作系统 IP地址 版本
node1 Rocky Linux release 8.6 (Green Obsidian) 10.128.170.131 etcd-v3.5.6
node2 Rocky Linux release 8.6 (Green Obsidian) 10.128.170.132 etcd-v3.5.6
node3 Rocky Linux release 8.6 (Green Obsidian) 10.128.170.133 etcd-v3.5.6
  • /etc/hosts 配置

    1
    2
    3
    10.128.170.131 node1 node1.local
    10.128.170.132 node2 node2.local
    10.128.170.133 node3 node3.local

2. 安装必要软件

以下必要软件安装操作需要在全部节点执行。

etcd 集群部署过程中需要用到以下组件:

  • cfssl
  • cfssljson
  • cfssl_certinfo
  • etcd
  • etcdctl

2.1. 创建数据目录

1
mkdir -p /data/etcd/{default,cfg,ssl}
  • default 存放 etcd 数据
  • cfg 存放 etcd 配置文件
  • ssl 存放证书、私钥

2.2. 下载安装 etcd

https://github.com/etcd-io/etcd/releases/tag/v3.5.6 下载以下文件:

  • etcd-v3.5.6-linux-amd64.tar.gz

解压安装:

1
tar -zxvf etcd-v3.5.6-linux-amd64.tar.gz -C /opt

创建 bin 目录,并将可执行程序移动到 bin 目录下:

1
2
3
cd /opt/etcd-v3.5.6-linux-amd64
mkdir bin
mv etcd etcdctl etcdutl bin

2.3. 下载安装证书工具 cfssl

https://github.com/cloudflare/cfssl/releases/tag/v1.6.3 下载以下文件:

  • cfssl_1.6.3_linux_amd64
  • cfssljson_1.6.3_linux_amd64
  • cfssl-certinfo_1.6.3_linux_amd64

下载证书工具 cfssl 后,移动到 /opt/etcd-v3.5.6-linux-amd64/bin 目录下,并授予可执行权限:

1
2
3
4
5
6
7
8
chmod +x cfssl_1.6.3_linux_amd64
cp cfssl_1.6.3_linux_amd64 /opt/etcd-v3.5.6-linux-amd64/bin/cfssl

chmod +x cfssljson_1.6.3_linux_amd64
cp cfssljson_1.6.3_linux_amd64 /opt/etcd-v3.5.6-linux-amd64/bin/cfssljson

chmod +x cfssl-certinfo_1.6.3_linux_amd64
cp cfssl-certinfo_1.6.3_linux_amd64 /opt/etcd-v3.5.6-linux-amd64/bin/cfssl-certinfo

方便起见,以上工具分别重命名为:

  • cfssl
  • cfssljson
  • cfssl-certinfo

2.4. 配置环境变量

1
vim ~/.bashrc

添加如下内容:

1
2
3
4
# Etcd Environment
export ETCD_HOME=/opt/etcd-v3.5.6-linux-amd64

export PATH=$PATH:$ETCD_HOME/bin

使配置立即生效:

1
source ~/.bashrc

3. 创建 TLS 证书

以下创建 TLS 证书操作在任意一个节点执行即可。

3.1. 新建 CA 签名请求文件

创建 CA 签名请求文件 ca-csr.json,存放在 /data/etcd/ssl 中,文件内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cat > /data/etcd/ssl/ca-csr.json << EOF
{
"CN": "etcd",
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"L": "Shenzhen",
"ST": "GuangDong"
}
]
}
EOF

术语介绍:

1
2
3
4
5
C: Country, 国家
L: Locality,地区,城市
O: Organization Name,组织名称,公司名称
OU: Organization Unit Name,组织单位名称,公司部门
ST: State,州,省

3.2. 生成 CA 凭证、私钥和证书签名请求

在 /data/etcd/ssl 目录下执行:

1
cfssl gencert -initca ca-csr.json | cfssljson -bare ca

生成以下三个文件:

  • ca-key.pem(CA 私钥)
  • ca.pem(CA 证书)
  • ca.csr(CA 证书签名请求)

3.3. CA 配置

创建 CA 配置文件 ca-config.json,存放在 /data/etcd/ssl 中,文件内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cat > /data/etcd/ssl/ca-config.json << EOF
{
"signing": {
"default": {
"expiry": "87600h"
},
"profiles": {
"etcd": {
"expiry": "87600h",
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
]
}
}
}
}
EOF

术语介绍:

1
2
3
4
5
profiles:可以设置多个 profile,这里的 profile 是 etcd
expiry:指定了证书的有效期是 10 年(87600h)
signing:表示该证书可用于签名其它证书,生成的 ca.pem 证书中 CA=TRUE
server auth:表示 client 可以用该 CA 对 server 提供的证书进行验证
client auth:表示 server 可以用该 CA 对 client 提供的证书进行验证

3.4. 新建 etcd 证书请求文件

新建 etcd 证书请求文件 server-csr.json,存放在 /data/etcd/ssl 中,文件内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
cat > /data/etcd/ssl/server-csr.json << EOF
{
"CN": "server",
"hosts": [
"127.0.0.1",
"10.128.170.131",
"10.128.170.132",
"10.128.170.133",
"node1",
"node2",
"node3",
"node1.local",
"node2.local",
"node3.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"L": "Shenzhen",
"ST": "GuangDong"
}
]
}
EOF

hosts 字段指定授权使用该证书的 etcd 节点 IP 或域名列表,需要将 etcd 集群的三个节点 IP 都列在其中。

3.5. 生成 server 证书

在 /data/etcd/ssl 目录下执行:

1
cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=etcd server-csr.json | cfssljson -bare server

生成以下三个文件:

  • server-key.pem(server 私钥)
  • server.pem(server 证书)
  • server.csr(server 证书签名请求)

3.6. 拷贝证书到其它节点

根据需要拷贝 /data/etcd/ssl 目录下的证书文件到其他节点:

1
2
3
cd /data/etcd/ssl
scp ca.pem server.pem server-key.pem root@node2:/data/etcd/ssl
scp ca.pem server.pem server-key.pem root@node3:/data/etcd/ssl

4. etcd 配置

4.1. etcd 主节点配置

创建 etcd 主节点配置文件 etcd.conf,存放在 /data/etcd/cfg 中,文件内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
cat > /data/etcd/cfg/etcd.conf << "EOF"
#[Member]
ETCD_NAME="etcd01"
ETCD_DATA_DIR="/data/etcd/default"
ETCD_LISTEN_PEER_URLS="https://10.128.170.131:2380"
ETCD_LISTEN_CLIENT_URLS="https://10.128.170.131:2379,https://127.0.0.1:2379"

#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.128.170.131:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://10.128.170.131:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://10.128.170.131:2380,etcd02=https://10.128.170.132:2380,etcd03=https://10.128.170.133:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"

#[Security]
ETCD_CERT_FILE="/data/etcd/ssl/server.pem"
ETCD_KEY_FILE="/data/etcd/ssl/server-key.pem"
ETCD_TRUSTED_CA_FILE="/data/etcd/ssl/ca.pem"
ETCD_CLIENT_CERT_AUTH="false"
ETCD_PEER_CERT_FILE="/data/etcd/ssl/server.pem"
ETCD_PEER_KEY_FILE="/data/etcd/ssl/server-key.pem"
ETCD_PEER_TRUSTED_CA_FILE="/data/etcd/ssl/ca.pem"
ETCD_PEER_CLIENT_CERT_AUTH="true"
EOF

4.2. etcd 启动配置

创建 etcd 启动配置文件 etcd.service,存放在 /usr/lib/systemd/system 目录下,文件内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
cat > /usr/lib/systemd/system/etcd.service << "EOF"
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=/data/etcd/cfg/etcd.conf
ExecStart=/opt/etcd-v3.5.6-linux-amd64/bin/etcd
#ExecStart=/opt/etcd-v3.5.6-linux-amd64/bin/etcd \
#--name=${ETCD_NAME} \
#--data-dir=${ETCD_DATA_DIR} \
#--listen-peer-urls=${ETCD_LISTEN_PEER_URLS} \
#--listen-client-urls=${ETCD_LISTEN_CLIENT_URLS} \
#--advertise-client-urls=${ETCD_ADVERTISE_CLIENT_URLS} \
#--initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
#--initial-cluster=${ETCD_INITIAL_CLUSTER} \
#--initial-cluster-token=${ETCD_INITIAL_CLUSTER_TOKEN} \
#--initial-cluster-state=${ETCD_INITIAL_CLUSTER_STATE} \
#--cert-file=${ETCD_CERT_FILE} \
#--key-file=${ETCD_KEY_FILE} \
#--peer-cert-file=${ETCD_PEER_CERT_FILE} \
#--peer-key-file=${ETCD_PEER_KEY_FILE} \
#--trusted-ca-file=${ETCD_TRUSTED_CA_FILE} \
#--client-cert-auth=${ETCD_CLIENT_CERT_AUTH} \
#--peer-client-cert-auth=${ETCD_PEER_CLIENT_CERT_AUTH} \
#--peer-trusted-ca-file=${ETCD_PEER_TRUSTED_CA_FILE}
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

字段解释:

1
2
3
4
5
6
7
--WorkingDirectory、--data-dir:指定工作目录和数据目录为 ${ETCD_DATA_DIR},需在启动服务前创建这个目录;
--wal-dir:指定 wal 目录,为了提高性能,一般使用 SSD 或者和 --data-dir 不同的磁盘;
--name:指定节点名称,当 --initial-cluster-state 值为 new 时,--name 的参数值必须位于 --initial-cluster 列表中;
--cert-file、--key-file:etcd server 与 client 通信时使用的证书和私钥;
--trusted-ca-file:签名 client 证书的 CA 证书,用于验证 client 证书;
--peer-cert-file、--peer-key-file:etcd 与 peer 通信使用的证书和私钥;
--peer-trusted-ca-file:签名 peer 证书的 CA 证书,用于验证 peer 证书;

启动异常:conflicting environment variable "ETCD_NAME" is shadowed by corresponding command-line flag (either unset environment variable or disable flag)

原因分析:ETCD3.4 版本会自动读取环境变量的参数,所以 EnvironmentFile 文件中有的参数,不需要再次在 ExecStart 启动参数中添加,二选一,如同时配置,会触发以上报错。

解决方法:剔除 ExecStart 中和配置文件重复的内容即可。

5. 配置其余节点

同样操作配置其余节点,但需要修改 etcd.conf 文件中的节点名字以及服务器 IP 地址。

6. 启动 etcd 服务

1
2
systemctl daemon-reload
systemctl start etcd.service

7. 验证 etcd 服务

在此之前,我们需要设置 etcd 的 API 版本,目前有 v2 与 v3 两个版本,两个版本的 API 存储方式不同,命令工具也不同。我们统一约定使用 v3 版本(默认是 v2 版本),把 v3 加入环境变量中:

1
2
echo 'export ETCDCTL_API=3' >> ~/.bashrc
source ~/.bashrc

7.1. 检查集群健康状态

1
/opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" endpoint health

集群健康则显示:

1
2
3
https://node3.local:2379 is healthy: successfully committed proposal: took = 23.967506ms
https://node1.local:2379 is healthy: successfully committed proposal: took = 36.438089ms
https://node2.local:2379 is healthy: successfully committed proposal: took = 36.013216ms

7.2. 查看集群节点列表

true 代表 leader,由其负责处理客户端请求信息

1
/opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" endpoint status

服务正常则显示:

1
2
3
https://node1.local:2379, 2cc5cd20ac23bb53, 3.5.6, 20 kB, true, false, 2, 11, 11,
https://node2.local:2379, 60084d96d6d6e98a, 3.5.6, 20 kB, false, false, 2, 11, 11,
https://node3.local:2379, 57107ffa28aa353e, 3.5.6, 20 kB, false, false, 2, 11, 11,

7.3. 验证集群数据存储服务

任选一个节点,客户端 etcdctl 存储一个数据,其他节点查看数据:

etcd01 客户端存储数据:

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" put /hello "world"
OK

etcd02 或 etcd03 查看数据:

1
2
3
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" get /hello
/hello
world

7.4. 监测数据变化

使用 watch 命令后,如果一切正常则会进入阻塞,监测一个键值的变化,一旦键值发生更新,就会输出最新的值,直到用户按 CTRL+C 退出 。

在 10.128.170.133 上一直监控:

1
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" watch /name

在10.128.170.132 上修改两次 /name 的值:

1
2
3
4
[root@node2 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" put /name "first"
OK
[root@node2 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" put /name "second"
OK

10.128.170.133 监控界面便打印出修改记录:

1
2
3
4
5
6
7
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" watch /name
PUT
/name
first
PUT
/name
second

8. 数据备份与恢复

8.1. 数据备份

1
2
3
4
5
6
7
8
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379" snapshot save mysnapshot.db
{"level":"info","ts":"2023-01-06T01:01:59.039+0800","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"mysnapshot.db.part"}
{"level":"info","ts":"2023-01-06T01:01:59.053+0800","logger":"client","caller":"v3@v3.5.6/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-01-06T01:01:59.053+0800","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://node1.local:2379"}
{"level":"info","ts":"2023-01-06T01:01:59.082+0800","logger":"client","caller":"v3@v3.5.6/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-01-06T01:01:59.146+0800","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://node1.local:2379","size":"20 kB","took":"now"}
{"level":"info","ts":"2023-01-06T01:01:59.146+0800","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"mysnapshot.db"}
Snapshot saved at mysnapshot.db

8.2. 数据恢复

https://etcd.io/docs/v3.5/op-guide/recovery/

9. 节点宕机与恢复

9.1. 查看集群各节点状态

1
2
3
4
5
6
7
8
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" endpoint status --write-out=table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://node1.local:2379 | 2cc5cd20ac23bb53 | 3.5.6 | 20 kB | true | false | 2 | 14 | 14 | |
| https://node2.local:2379 | 60084d96d6d6e98a | 3.5.6 | 20 kB | false | false | 2 | 14 | 14 | |
| https://node3.local:2379 | 57107ffa28aa353e | 3.5.6 | 20 kB | false | false | 2 | 14 | 14 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

9.2. 模拟主节点 node1 宕机

使用命令 systemctl stop etcd 停掉 leader 节点,再次查看集群状况,可见重新选举出了 leader,集群可正常使用:

1
2
3
4
5
6
7
8
9
10
[root@node1 ~]# systemctl stop etcd.service
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" endpoint status --write-out=table
{"level":"warn","ts":"2023-01-06T12:39:07.738+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000b8c40/node1.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.170.131:2379: connect: connection refused\""}
Failed to get the status of endpoint https://node1.local:2379 (context deadline exceeded)
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://node2.local:2379 | 60084d96d6d6e98a | 3.5.6 | 20 kB | true | false | 3 | 15 | 15 | |
| https://node3.local:2379 | 57107ffa28aa353e | 3.5.6 | 20 kB | false | false | 3 | 15 | 15 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

9.3. 模拟从节点 node2 宕机

继续使用命令 systemctl stop etcd 停掉节点 node2,查看集群状况,集群已经无法正常使用,说明 3 节点的 etcd 集群容错为 1。根据官方说明,集群可用性为 (N-1)/2,假设集群数量 N 是 3 台设备,可最多可故障 1 台设备,而不影响集群使用。

1
2
3
4
5
6
7
8
9
10
11
[root@node2 ~]# systemctl stop etcd.service
[root@node2 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" endpoint status --write-out=table
{"level":"warn","ts":"2023-01-06T12:39:58.307+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00035aa80/node1.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.170.131:2379: connect: connection refused\""}
Failed to get the status of endpoint https://node1.local:2379 (context deadline exceeded)
{"level":"warn","ts":"2023-01-06T12:40:03.308+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00035aa80/node1.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.170.132:2379: connect: connection refused\""}
Failed to get the status of endpoint https://node2.local:2379 (context deadline exceeded)
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| https://node3.local:2379 | 57107ffa28aa353e | 3.5.6 | 20 kB | false | false | 4 | 16 | 16 | etcdserver: no leader |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+

9.4. 模拟 node2 节点恢复

使用命令 systemctl start etcd.service 启动节点 node2,查看集群状况,可见重新选举出 leader 节点,集群恢复正常使用:

1
2
3
4
5
6
7
8
9
10
[root@node2 ~]# systemctl start etcd.service
[root@node2 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" endpoint status --write-out=table
{"level":"warn","ts":"2023-01-06T12:41:13.714+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000332c40/node1.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.170.131:2379: connect: connection refused\""}
Failed to get the status of endpoint https://node1.local:2379 (context deadline exceeded)
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://node2.local:2379 | 60084d96d6d6e98a | 3.5.6 | 20 kB | false | false | 5 | 18 | 18 | |
| https://node3.local:2379 | 57107ffa28aa353e | 3.5.6 | 20 kB | true | false | 5 | 18 | 18 | |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

10. ETCD 故障节点修复

10.1. 从集群中删除故障节点

(正常节点上操作)

查看集群,获取坏掉节点的 ID,比如为 2cc5cd20ac23bb53

1
2
3
4
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" member list
2cc5cd20ac23bb53, started, etcd01, https://10.128.170.131:2380, https://10.128.170.131:2379, false
57107ffa28aa353e, started, etcd03, https://10.128.170.133:2380, https://10.128.170.133:2379, false
60084d96d6d6e98a, started, etcd02, https://10.128.170.132:2380, https://10.128.170.132:2379, false

删除坏掉的节点

1
2
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" member remove 2cc5cd20ac23bb53
Member 2cc5cd20ac23bb53 removed from cluster 6fe4c34617a995fa

10.2. 修复故障节点

(故障节点上操作)

修改配置文件

把 ETCD_INITIAL_CLUSTER_STATE="new" 修改为 ETCD_INITIAL_CLUSTER_STATE="existing"

1
[root@node1 ~]# sed -i 's/ETCD_INITIAL_CLUSTER_STATE="new"/ETCD_INITIAL_CLUSTER_STATE="existing"/' /data/etcd/cfg/etcd.conf

清理节点数据

1
[root@node1 ~]# rm -rf /data/etcd/default/member/

10.3. 重新添加节点

(正常节点上操作)

1
2
3
4
5
6
7
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" member add etcd01 --peer-urls=https://10.128.170.131:2380
Member cda884b84bce7043 added to cluster 6fe4c34617a995fa

ETCD_NAME="etcd01"
ETCD_INITIAL_CLUSTER="etcd03=https://10.128.170.133:2380,etcd02=https://10.128.170.132:2380,etcd01=https://10.128.170.131:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.128.170.131:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

10.4. 重启故障节点

(故障节点上操作)

1
[root@node1 ~]# systemctl start etcd.service

11. 身份验证

11.1. 创建 root 用户

先创建 root 用户,提示输入密码,这里我密码设置为 cluster

1
2
3
4
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" user add root
Password of root:
Type password of root again for confirmation:
User root created

赋予 root 用户 root 角色

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" user grant-role root root
Role root is granted to user root

11.2. 查看 root 用户详情

1
2
3
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" user get root
User: root
Roles: root

确定 root 用户具有 root 角色拥有所有权限后再进行下一步

11.3. 开启身份验证

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" auth enable
Authentication Enabled

11.4. 创建新用户

创建新用户 server,密码同样设为 cluster(可任意设置)

1
2
3
4
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster user add server
Password of server:
Type password of server again for confirmation:
User server created

查看用户列表和 server 用户详情

1
2
3
4
5
6
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster user list
root
server
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster user get server
User: server
Roles:

11.5. 创建角色

(内置的 root 角色无法通过 role list 查看,我们可以创建名字同样为 root 的角色,但是后续为 user 赋予 root 角色时,将使用自定义的 root 角色,而不是内置的 root 角色)

创建 server 角色

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role add server
Role server created

查看角色列表和 server 角色详情(新创建的角色没有任何权限)

1
2
3
4
5
6
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role list
server
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role get server
Role server
KV Read:
KV Write:

11.6. 角色授权

角色没有密码,仅仅是定义的一组访问权限,角色的访问权限可以被赋予 read(读),write(写),readwrite(读和写)权限。

给 server 角色赋予键 /hello 读写操作权限

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role grant-permission server readwrite /hello --prefix=true
Role server updated
1
2
3
4
5
6
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role get server
Role server
KV Read:
[/hello, /hellp) (prefix /hello)
KV Write:
[/hello, /hellp) (prefix /hello)

给 server 角色赋予键 /hello 目录读写操作权限

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role grant-permission server readwrite /hello/* --prefix=true
Role server updated
1
2
3
4
5
6
7
8
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster role get server
Role server
KV Read:
[/hello, /hellp) (prefix /hello)
[/hello/*, /hello/+) (prefix /hello/*)
KV Write:
[/hello, /hellp) (prefix /hello)
[/hello/*, /hello/+) (prefix /hello/*)

11.7. 赋予用户角色

赋予 server 用户 server 角色

1
2
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster user grant-role server server
Role server is granted to user server
1
2
3
[root@node1 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster user get server
User: server
Roles: server

11.8. 测试 server 用户权限

1
2
3
4
5
6
7
8
9
10
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster put /hello world
OK
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=root:cluster put /name wylu
OK
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=server:cluster get /hello
/hello
world
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=server:cluster get /wylu
{"level":"warn","ts":"2023-01-06T14:31:02.431+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000558540/node1.local:2379","attempt":0,"error":"rpc error: code = PermissionDenied desc = etcdserver: permission denied"}
Error: etcdserver: permission denied

11.8.1. 不添加用户密码参数

提示错误:

1
2
3
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" get /hello
{"level":"warn","ts":"2023-01-06T14:32:37.153+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00034a380/node1.local:2379","attempt":0,"error":"rpc error: code = InvalidArgument desc = etcdserver: user name is empty"}
Error: etcdserver: user name is empty

11.8.2. 添加用户密码参数

1
2
3
[root@node3 ~]# /opt/etcd-v3.5.6-linux-amd64/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/server.pem --key=/data/etcd/ssl/server-key.pem --endpoints="https://node1.local:2379,https://node2.local:2379,https://node3.local:2379" --user=server:cluster get /hello
/hello
world

11.9. 常用命令附录

11.9.1. root 用户与 root 角色

root 是 etcd 的超级管理员,拥有 etcd 的所有权限,在开启角色认证之前为们必须要先建立好 root 用户。还需要注意的是 root 用户必须拥有 root 的角色,允许在 etcd 的所有操作。

root 角色可以赋予任何用户,拥有 root 角色的用户有全局读写权限和集群身份验证配置权限,此外,还具有修改集群成员身份,碎片整理,建立快照等权限。

11.9.2. 用户操作

user:可以为 etcd 创建多个用户并设置密码,子命令有:

1
2
3
4
5
6
7
1)add 添加用户
2)delete 删除用户
3)get 取得用户详情
4)list 列出所有用户
5)passwd 修改用户密码
6)grant-role 给用户分配角色
7)revoke-role 给用户移除角色

11.9.3. 角色操作

role:可以为 etcd 创建多个角色并设置权限,子命令有:

1
2
3
4
5
6
1)add 添加角色
2)delete 删除角色
3)get 取得角色信息
4)list 列出所有角色
5)grant-permission 为角色设置某个key的权限
6)revoke-permission 为角色移除某个key的权限

11.9.4. 身份验证

启动:auth enable

关闭:auth disable

11.9.5. 命令参考链接

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/authentication.md

https://juejin.im/post/5b986abff265da0ad947b52f

12. 安装 etcdkeeper

etcdkeeper 可提供 web 页面来管理 etcd 数据库中的数据

12.1. 安装 etcdkeeper

https://github.com/evildecay/etcdkeeper/releases/tag/v0.7.6 下载以下文件:

  • etcdkeeper-v0.7.6-linux_x86_64.zip

解压安装:

1
2
unzip etcdkeeper-v0.7.6-linux_x86_64.zip -d /opt
chmod +x /opt/etcdkeeper/etcdkeeper

12.2. 创建 system 服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
cat > /usr/lib/systemd/system/etcdkeeper.service << "EOF"
[Unit]
Description=etcdkeeper service
After=network.target

[Service]
Type=simple
ExecStart=/opt/etcdkeeper/etcdkeeper \
-h 0.0.0.0 \
-p 8800 \
-cacert=/data/etcd/ssl/ca.pem \
-cert=/data/etcd/ssl/server.pem \
-key=/data/etcd/ssl/server-key.pem \
-auth \
-usetls
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
PrivateTmp=true

[Install]
WantedBy=multi-user.target
EOF

12.3. 修改 index.html

1
2
3
[root@node1 Downloads]# dos2unix /opt/etcdkeeper/assets/etcdkeeper/index.html
dos2unix: converting file /opt/etcdkeeper/assets/etcdkeeper/index.html to Unix format ...
[root@node1 Downloads]# sed -i "154s/etcdBase = '127.0.0.1:2379'/etcdBase = '10.128.170.131:2379'/" /opt/etcdkeeper/assets/etcdkeeper/index.html

12.4. 服务控制

1
2
3
4
5
systemctl daemon-reload
systemctl enable etcdkeeper.service # 设置开机自启动
systemctl disable etcdkeeper.service # 停止开机自启动
systemctl start etcdkeeper # 启动 etcdkeeper 服务
systemctl stop etcdkeeper # 停止 etcdkeeper 服务