前言
etcd 是 Kubernetes 集群的”大脑”,存储了集群的所有配置、状态和元数据。一旦 etcd 数据丢失,整个集群将无法恢复到之前的状态,所有的 Deployment、Service、ConfigMap 等资源配置都会丢失。
本文适用场景:
- 单 master 节点的 Kubernetes 集群(kubeadm 部署)
- 需要在进行重大变更(如节点升级、版本迁移)前做备份
- 需要建立 etcd 定期备份机制
环境信息:
- Kubernetes 版本:v1.32.3
- etcd 版本:3.5.x(kubeadm 内置)
- 操作系统:Ubuntu 24.04
一、为什么必须备份 etcd?
真实风险场景
- 节点升级失败:升级 Kubernetes 组件时控制面损坏
- 误操作删除:
kubectl delete 误删关键资源
- 硬件故障:master 节点磁盘损坏
- 版本不兼容:升级后发现无法回滚
备份策略建议
| 场景 |
备份频率 |
保留时间 |
| 生产集群 |
每 6 小时 |
30 天 |
| 测试集群 |
每日一次 |
7 天 |
| 重大变更前 |
立即备份 |
永久保留 |
二、手动备份 etcd(紧急场景)
步骤 1:SSH 登录到 master 节点
步骤 2:创建备份目录
1
| sudo mkdir -p /backup/etcd
|
步骤 3:执行快照备份
1 2 3 4 5 6
| sudo ETCDCTL_API=3 etcdctl snapshot save \ /backup/etcd/etcd-snapshot-$(date +%F-%H%M).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
|
参数说明:
ETCDCTL_API=3:使用 etcd v3 API
--endpoints:etcd 服务地址(通常是本地 2379 端口)
--cacert/--cert/--key:etcd TLS 证书路径(kubeadm 默认位置)
- 文件名使用时间戳,便于区分多个备份
预期输出:
1 2 3 4 5
| {"level":"info","ts":"2026-02-06T15:30:45.123Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/backup/etcd/etcd-snapshot-2026-02-06-1530.db.part"} {"level":"info","ts":"2026-02-06T15:30:45.234Z","logger":"client","caller":"v3@v3.5.10/maintenance.go:212","msg":"opened snapshot stream; downloading"} {"level":"info","ts":"2026-02-06T15:30:45.345Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"} {"level":"info","ts":"2026-02-06T15:30:45.456Z","logger":"client","caller":"v3@v3.5.10/maintenance.go:220","msg":"completed snapshot read; closing"} Snapshot saved at /backup/etcd/etcd-snapshot-2026-02-06-1530.db
|
步骤 4:验证备份文件
1 2 3 4 5 6 7
| ls -lh /backup/etcd/
sudo ETCDCTL_API=3 etcdutl snapshot status \ /backup/etcd/etcd-snapshot-$(date +%F-*)*.db \ --write-out=table
|
预期输出:
1 2 3 4 5
| +----------+----------+------------+------------+ | HASH | REVISION | TOTAL KEYS | TOTAL SIZE | +----------+----------+------------+------------+ | 3a4b5c6d | 12345 | 1234 | 5.6 MB | +----------+----------+------------+------------+
|
关键验证点:
- ✅ 文件大小 > 0(通常几 MB 到几十 MB)
- ✅
snapshot status 可以读取且显示 TOTAL KEYS > 0
- ✅ HASH 值存在(表示数据完整性校验通过)
三、持久化备份(防止 master 节点故障)
TODO:持久化方案
四、自动化定期备份(CronJob)
方案 1:使用 Linux Cron(简单场景)
在 master 节点上创建备份脚本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| sudo tee /usr/local/bin/etcd-backup.sh <<'EOF'
set -euo pipefail
BACKUP_DIR="/backup/etcd" BACKUP_FILE="${BACKUP_DIR}/etcd-snapshot-$(date +%F-%H%M).db" RETENTION_DAYS=7
mkdir -p ${BACKUP_DIR}
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_FILE} \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_FILE}
find ${BACKUP_DIR} -name "etcd-snapshot-*.db" -mtime +${RETENTION_DAYS} -delete
echo "Backup completed: ${BACKUP_FILE}" EOF
|
添加执行权限:
1
| sudo chmod +x /usr/local/bin/etcd-backup.sh
|
测试脚本:
1
| sudo /usr/local/bin/etcd-backup.sh
|
配置 Cron 定时任务(每日凌晨 2 点):
添加以下行:
1
| 0 2 * * * /usr/local/bin/etcd-backup.sh >> /var/log/etcd-backup.log 2>&1
|
验证 Cron 任务:
方案 2:使用 Kubernetes CronJob(云原生方案)
创建 ServiceAccount 和 RBAC:
name1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| apiVersion: v1 kind: Namespace metadata: name: kube-backup --- apiVersion: v1 kind: ServiceAccount metadata: name: etcd-backup namespace: kube-backup --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: etcd-backup rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: etcd-backup roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: etcd-backup subjects: - kind: ServiceAccount name: etcd-backup namespace: kube-backup
|
创建 CronJob:
name1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
| apiVersion: batch/v1 kind: CronJob metadata: name: etcd-backup namespace: kube-backup spec: schedule: "0 2 * * *" successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 1 jobTemplate: spec: template: spec: serviceAccountName: etcd-backup hostNetwork: true nodeSelector: node-role.kubernetes.io/control-plane: "" tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: backup image: registry.k8s.io/etcd:3.5.10-0 command: - /bin/sh - -c - | ETCDCTL_API=3 etcdctl snapshot save \ /backup/etcd-snapshot-$(date +%F-%H%M).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key find /backup -name "etcd-snapshot-*.db" -mtime +7 -delete volumeMounts: - name: etcd-certs mountPath: /etc/kubernetes/pki/etcd readOnly: true - name: backup mountPath: /backup restartPolicy: OnFailure volumes: - name: etcd-certs hostPath: path: /etc/kubernetes/pki/etcd type: Directory - name: backup hostPath: path: /backup/etcd type: DirectoryOrCreate
|
部署:
1 2
| kubectl apply -f etcd-backup-rbac.yaml kubectl apply -f etcd-backup-cronjob.yaml
|
验证:
1 2 3 4 5 6 7 8
| kubectl get cronjob -n kube-backup
kubectl create job -n kube-backup etcd-backup-manual --from=cronjob/etcd-backup
kubectl logs -n kube-backup job/etcd-backup-manual
|
五、从备份恢复 etcd(灾难恢复)
⚠️ 警告
恢复 etcd 会完全覆盖当前集群状态,请仅在以下场景使用:
- etcd 数据完全损坏
- 需要回滚到特定时间点
- 在隔离环境中验证备份
恢复步骤(概要)
1. 停止所有 master 节点的 kube-apiserver:
1
| sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
|
2. 停止 etcd:
1
| sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
|
3. 清空旧数据:
1
| sudo rm -rf /var/lib/etcd
|
4. 恢复快照:
1 2 3 4 5 6
| sudo ETCDCTL_API=3 etcdctl snapshot restore \ /backup/etcd/etcd-snapshot-2026-02-06-1530.db \ --data-dir=/var/lib/etcd \ --name=master-gz-amd64-ubuntu-1 \ --initial-cluster=master-gz-amd64-ubuntu-1=https://127.0.0.1:2380 \ --initial-advertise-peer-urls=https://127.0.0.1:2380
|
5. 恢复 etcd 和 apiserver:
1 2
| sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/ sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
|
6. 验证集群:
1 2
| kubectl get nodes kubectl get pods -A
|
完整恢复指南建议单独测试,本文重点在备份环节。
六、最佳实践与注意事项
✅ 最佳实践
- 备份前置:任何重大变更前先手动备份一次
- 多地存储:至少一份备份要存储在集群外(COS/S3/NFS)
- 定期演练:每月/每季度做一次恢复演练(在测试环境)
- 监控备份:设置告警,备份失败时立即通知
- 版本管理:备份文件命名包含时间戳和集群版本号
⚠️ 常见陷阱
| 问题 |
后果 |
避免方法 |
| 只在本地备份 |
master 节点故障时无法恢复 |
上传到对象存储 |
| 从未验证备份 |
恢复时发现备份损坏 |
定期执行 snapshot status |
| 备份无保留策略 |
磁盘被占满 |
自动清理 N 天前的备份 |
| 证书路径错误 |
备份失败但无告警 |
脚本中添加错误处理 |
七、故障排查
问题 1:权限拒绝
报错:
1
| Error: open /backup/etcd/snapshot.db: permission denied
|
解决:
1 2 3
| sudo mkdir -p /backup/etcd sudo chown -R root:root /backup/etcd
|
问题 2:证书路径错误
报错:
1
| Error: context deadline exceeded
|
排查:
1 2
| ls -l /etc/kubernetes/pki/etcd/
|
常见路径(根据安装方式不同):
- kubeadm:
/etc/kubernetes/pki/etcd/
- 二进制安装:
/etc/etcd/ssl/
问题 3:etcd 未监听 127.0.0.1
报错:
1
| Error: connection refused
|
排查 etcd 监听地址:
1
| sudo netstat -tlnp | grep 2379
|
修改 endpoints(如果 etcd 监听其他地址):
1
| --endpoints=https://<实际IP>:2379
|
八、总结
- ✅ 手动备份只需一条命令,但必须持久化到安全位置
- ✅ 自动化备份用 Cron 或 CronJob,推荐后者(云原生)
- ✅ 备份后务必验证
snapshot status
- ✅ 定期演练恢复流程(在测试环境)
下一步建议:
- 立即执行一次手动备份
- 配置自动化备份(Cron/CronJob)
- 上传备份到对象存储
- 在测试环境演练恢复