前言

在 Kubernetes 集群运维中,经常会遇到节点版本不一致的情况,这可能是由于:

  • 节点分批加入集群,安装时使用了不同版本的 kubeadm
  • 部分节点升级后,个别节点未及时跟进
  • 测试环境缺乏严格的版本管理

版本不一致的风险

  • kubelet 与 apiserver 的版本偏差超出支持范围时,可能出现不可预期的行为
  • 部分 API 或功能在不同版本间不兼容
  • 排障困难,问题难以复现

本文场景

  • 集群有 3 个节点:master (v1.32.3)、worker (v1.32.9)、gateway (v1.28.2)
  • 目标:将 gateway 节点从 v1.28.2 升级到 v1.32.9
  • 集群类型:测试环境,可以容忍短时服务中断

一、升级前的准备工作

1.1 确认当前版本

1
kubectl get nodes -o wide

输出示例

1
2
3
4
NAME                         STATUS   ROLES           AGE   VERSION
master-gz-amd64-ubuntu-1 Ready control-plane 60d v1.32.3
work-gz-amd64-ubuntu-1 Ready <none> 60d v1.32.9
gateway-gz-amd64-ubuntu-1 Ready <none> 60d v1.28.2 ← 需要升级

1.2 备份 etcd(必须!)

在 master 节点执行

1
2
3
4
5
6
7
8
9
10
11
12
sudo mkdir -p /backup/etcd

sudo ETCDCTL_API=3 etcdctl snapshot save \
/backup/etcd/etcd-snapshot-before-upgrade-$(date +%F-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

# 验证备份
sudo ETCDCTL_API=3 etcdctl snapshot status \
/backup/etcd/etcd-snapshot-*.db --write-out=table

关键验证

  • ✅ 文件存在且大小 > 0
  • snapshot status 显示 TOTAL KEYS > 0

详细备份步骤见《Kubernetes 集群 etcd 备份实战指南》

1.3 检查节点上运行的 Pod

1
kubectl get pods -A -o wide --field-selector spec.nodeName=gateway-gz-amd64-ubuntu-1

输出示例

1
2
3
4
NAMESPACE       NAME                      READY   STATUS    NODE
kube-system kube-proxy-xxxxx 1/1 Running gateway-gz-amd64-ubuntu-1
kube-flannel kube-flannel-xxxxx 1/1 Running gateway-gz-amd64-ubuntu-1
higress-system higress-gateway-xxxxx 1/1 Running gateway-gz-amd64-ubuntu-1

关键检查

  • 是否有状态服务(数据库、缓存等)?如有,需要先迁移
  • 是否有单副本服务(如上面的 Higress)?如有,升级期间会短时不可用
  • 如果都是 DaemonSet(如 kube-proxy、CNI),可以直接继续

二、升级步骤(适用于测试环境)

步骤 1:驱逐 Pod 并标记节点不可调度

1
2
3
4
kubectl drain gateway-gz-amd64-ubuntu-1 \
--ignore-daemonsets \
--delete-emptydir-data \
--force

预期输出

1
2
3
4
5
node/gateway-gz-amd64-ubuntu-1 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/kube-proxy-xxxxx, kube-flannel/kube-flannel-xxxxx
evicting pod higress-system/higress-gateway-xxxxx
pod/higress-gateway-xxxxx evicted
node/gateway-gz-amd64-ubuntu-1 drained

参数说明

  • --ignore-daemonsets:忽略 DaemonSet 管理的 Pod(它们无法被驱逐)
  • --delete-emptydir-data:删除使用 emptyDir 的 Pod(临时数据会丢失)
  • --force:强制删除未被控制器管理的裸 Pod

步骤 2:从集群中删除节点记录

1
kubectl delete node gateway-gz-amd64-ubuntu-1

验证删除成功

1
2
kubectl get nodes
# 确认 gateway-gz-amd64-ubuntu-1 不在列表中

步骤 3:SSH 登录到 gateway 节点

1
ssh ubuntu@<gateway-ip>

步骤 4:停止 kubelet 并重置节点配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 停止 kubelet
sudo systemctl stop kubelet

# 重置 kubeadm 配置(清理所有旧配置)
sudo kubeadm reset -f

# 清理 CNI 配置
sudo rm -rf /etc/cni/net.d/*

# 清理 iptables 规则(可选但推荐)
sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X

# 重启 containerd
sudo systemctl restart containerd

预期输出kubeadm reset):

1
2
3
4
5
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf ...]
The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

验证清理完成

1
2
3
4
5
6
# 关键配置文件应该已被删除
ls /etc/kubernetes/kubelet.conf
# 应显示:No such file or directory

sudo systemctl status kubelet
# 应显示:inactive (dead)

步骤 5:升级 kubeadm、kubelet、kubectl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 解除版本锁定
sudo apt-mark unhold kubeadm kubelet kubectl

# 更新软件源
sudo apt-get update

# 查看可用版本(确认目标版本存在)
apt-cache madison kubeadm | grep 1.32

# 安装目标版本(v1.32.9)
sudo apt-get install -y \
kubeadm=1.32.9-1.1 \
kubelet=1.32.9-1.1 \
kubectl=1.32.9-1.1

# 重新锁定版本(防止意外升级)
sudo apt-mark hold kubeadm kubelet kubectl

验证版本

1
2
3
4
5
kubeadm version
# 应显示:kubeadm version: &version.Info{Major:"1", Minor:"32", GitVersion:"v1.32.9", ...}

kubelet --version
# 应显示:Kubernetes v1.32.9

步骤 6:重启 kubelet(应用新配置)

1
2
sudo systemctl daemon-reload
sudo systemctl restart kubelet

步骤 7:在 master 节点生成 join 命令

切换到 master 节点

1
ssh ubuntu@<master-ip>

生成 join 命令

1
kubeadm token create --print-join-command

输出示例(复制完整输出):

1
2
kubeadm join 172.16.16.10:6443 --token tk3x59.398ytf1itnhntst5 \
--discovery-token-ca-cert-hash sha256:1d9cfe17699c2f3a0954620669785e28d3a60d2770f70cd859df9fbd6258ab03

步骤 8:在 gateway 节点重新加入集群

切换回 gateway 节点

1
ssh ubuntu@<gateway-ip>

执行 join(粘贴上一步的完整命令,记得加 sudo):

1
2
3
sudo kubeadm join 172.16.16.10:6443 \
--token tk3x59.398ytf1itnhntst5 \
--discovery-token-ca-cert-hash sha256:1d9cfe17699c2f3a0954620669785e28d3a60d2770f70cd859df9fbd6258ab03

预期输出

1
2
3
4
5
6
7
8
9
10
11
12
13
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

关键验证点

  • ✅ 无报错信息
  • ✅ 显示 “This node has joined the cluster”

三、验证升级结果

3.1 检查节点状态

在 master 节点执行

1
kubectl get nodes -o wide

预期输出

1
2
3
4
NAME                         STATUS   ROLES           AGE   VERSION
master-gz-amd64-ubuntu-1 Ready control-plane 60d v1.32.3
work-gz-amd64-ubuntu-1 Ready <none> 60d v1.32.9
gateway-gz-amd64-ubuntu-1 Ready <none> 30s v1.32.9

关键验证

  • ✅ gateway 节点 VERSION 列为 v1.32.9
  • ✅ STATUS 为 Ready(可能需要等待 30-60 秒)

如果 STATUS 是 NotReady

1
2
3
4
5
# 等待 1-2 分钟(CNI 插件初始化需要时间)

# 如果持续 NotReady,查看 kubelet 日志
ssh ubuntu@<gateway-ip>
sudo journalctl -u kubelet -n 50 --no-pager

3.2 检查系统 Pod

1
2
3
4
5
# 检查 kube-system 命名空间
kubectl get pods -n kube-system -o wide | grep gateway

# 检查 CNI(Flannel)
kubectl get pods -n kube-flannel -o wide | grep gateway

预期输出

1
2
kube-system    kube-proxy-xxxxx          1/1  Running  gateway-gz-amd64-ubuntu-1
kube-flannel kube-flannel-xxxxx 1/1 Running gateway-gz-amd64-ubuntu-1

3.3 检查业务 Pod 是否恢复

1
2
# 检查 Higress(根据你的实际服务)
kubectl get pods -n higress-system -o wide

如果 Pod 未自动调度回 gateway 节点(因为被驱逐了):

**方案 :添加容忍使higress调度到gateway节点上

1
2
3
4
5
#确认 gateway 节点的污点
kubectl describe node gateway-gz-amd64-ubuntu-1 | grep Taints

#如果 gateway 节点在升级后污点丢失,重新添加:
kubectl taint nodes gateway-gz-amd64-ubuntu-1 node-role=gateway:NoSchedule

为higress-gateway 和higress-controller 添加容忍和亲和性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: apps/v1
kind: Deployment
metadata:
name: higress-gateway
namespace: higress-system
spec:
template:
spec:
# 容忍 gateway 节点的污点
tolerations:
- key: "node-role"
operator: "Equal"
value: "gateway"
effect: "NoSchedule"

# 节点亲和性:强制调度到 gateway 节点
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- gateway-gz-amd64-ubuntu-1

containers:
- name: higress-gateway
# ... 其他配置

3.4 功能验证

1
2
3
4
5
# 测试网关功能(根据实际场景)
curl -H "Host: gateway.zlinkcloudtech.com" http://<gateway-ip>

# 或测试其他对外服务
curl https://blog.zlinkcloudtech.com

四、常见问题与排查

问题 1:join 时报错 “already exists in the cluster”

完整报错

1
error execution phase kubelet-start: a Node with name "gateway-gz-amd64-ubuntu-1" and status "Ready" already exists in the cluster. You must delete the existing Node or change the name of this new joining Node

原因

  • 在 master 节点上忘记执行 kubectl delete node
  • 或者删除后集群状态未同步

解决

1
2
3
4
# 在 master 节点再次删除
kubectl delete node gateway-gz-amd64-ubuntu-1 --force --grace-period=0

# 等待 10 秒后重新 join

问题 2:join 时报错 “FileAvailable–etc-kubernetes-kubelet.conf already exists”

完整报错

1
2
3
[ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists

原因

  • kubeadm reset 未彻底清理
  • kubelet 仍在运行

解决

1
2
3
4
5
6
7
8
# 在 gateway 节点执行
sudo systemctl stop kubelet
sudo kubeadm reset -f
sudo rm -rf /etc/cni/net.d/*
sudo rm -rf /etc/kubernetes/*
sudo rm -rf /var/lib/kubelet/*

# 然后重新 join

问题 3:节点 Ready 但 Pod 一直 ContainerCreating

现象

1
2
kubectl get pods -n higress-system
# Pod STATUS 一直是 ContainerCreating

排查

1
2
kubectl describe pod -n higress-system <pod-name>
# 查看 Events 部分

常见原因与解决

错误信息 原因 解决方法
failed to create pod network CNI 未就绪 重启 Flannel DaemonSet
image pull failed 镜像拉取失败 检查网络或镜像仓库
node not found 节点标签丢失 重新打标签

重启 CNI(以 Flannel 为例)

1
kubectl delete pod -n kube-flannel -l app=flannel --field-selector spec.nodeName=gateway-gz-amd64-ubuntu-1

问题 4:token 已过期

报错

1
token过期

原因

  • kubeadm token 默认有效期 24 小时

解决

1
2
3
4
# 在 master 节点重新生成 token
kubeadm token create --print-join-command

# 使用新的 join 命令

问题 5:节点持续 NotReady,日志显示 “cni plugin not initialized”

kubelet 日志

1
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

原因

  • CNI 配置未加载
  • Flannel Pod 未启动

排查

1
2
3
4
5
# 检查 CNI 配置文件
ls /etc/cni/net.d/

# 检查 Flannel Pod
kubectl get pods -n kube-flannel -o wide

解决

1
2
3
4
5
6
7
8
# 删除旧的 CNI 配置
sudo rm -rf /etc/cni/net.d/*

# 重启 Flannel Pod
kubectl delete pod -n kube-flannel -l app=flannel --field-selector spec.nodeName=gateway-gz-amd64-ubuntu-1

# 等待 1-2 分钟后检查
kubectl get nodes