节点重启流程
每台节点按以下步骤循环执行:检查 etcd → 阻止调度 → 驱逐 Pod → 重启 → 恢复调度,全部完成后再处理下一台。
1.检查 etcd 状态
etcd 推荐奇数节点部署,以保证 quorum(多数派)存活时集群可正常读写。容错计算公式:
⌊n/2⌋ + 1,参考官方容错性说明。当前 3 节点需至少保证 2 个 etcd 存活。
以下操作需在每台 etcd 节点上验证:
## 验证数据一致性 / 节点健康 # etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint status --write-out=table ## DB SIZE: 数据库大小,部署时通过 --quota-backend-bytes 设置上限(默认 2G) ## IS LEADER: 是否为 leader ## IS LEARNER: 是否为非投票成员(worker) ## RAFT TERM: leader 任期,须保证各节点该值一致;重启 / 网络抖动都会使其 +1 +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://x.x.x.x:2379 | 683c58b549788bd9 | 3.5.15 | 30 MB | true | false | 40 | 130863104 | 130863104 | | +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ## 验证连通性 / 响应延迟 # etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health --write-out=table ## HEALTH: 能否读写 ## TOOK: 读一个随机 key,无错误即判定健康;耗时 ≈ 网络往返 + leader 心跳确认 +------------------------+--------+-------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +------------------------+--------+-------------+-------+ | https://x.x.x.x:2379 | true | 60.842745ms | | +------------------------+--------+-------------+-------+ ## 查看成员列表 # etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ member list -w table ## STATUS: 节点状态 ## PEER ADDRS: 节点间通信地址 ## CLIENT ADDRS: 客户端(API server)访问地址 ## IS LEARNER: 是否为非投票成员(worker) +---------+---------+--------+----------------------+----------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +---------+---------+--------+----------------------+----------------------+------------+ | xxxxxxx | started | demo-1 | https://x.x.x.x:2380 | https://x.x.x.x:2379 | false | | xxxxxxx | started | demo-2 | https://x.x.x.x:2380 | https://x.x.x.x:2379 | false | | xxxxxxx | started | demo-3 | https://x.x.x.x:2380 | https://x.x.x.x:2379 | false | +---------+---------+--------+----------------------+----------------------+------------+2.阻止 Pod 调度
通过cordon标记待重启节点,阻止新 Pod 调度上来:
注:一次只操作一台节点,完成该节点的 cordon → 重启 → uncordon 流程后,才能处理下一台!!!
# kubectl cordon demo-13.驱逐 Pod
kube-apiserver/controller-manager/scheduler/etcd是 kubelet 直接管理的静态 Pod(static pod),drain 不会驱逐它们。只要其余两个节点的 etcd 存活,集群控制面就正常。
# kubectl drain demo-1 \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=300s驱逐后确认节点已停止调度:
# kubectl get node NAME STATUS ROLES AGE VERSION demo-1 Ready,SchedulingDisabled control-plane 336d v1.31.14 demo-2 Ready control-plane 336d v1.31.14 demo-3 Ready control-plane 336d v1.31.14常见报错:驱逐超时通常是因为 PDB(PodDisruptionBudget)不允许驱逐,例如:
error when evicting pods/"prometheus-k8s-0" -n "monitoring": Cannot evict pod as it would violate the pod's disruption budget.排查并处理该节点的 PDB:
除临时改 PDB 外,也可扩容对应副本数,或手动清理 Pod(但这会使 PDB 失去意义)。
# kubectl get pdb -n monitoring prometheus-k8s NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE prometheus-k8s 1 N/A 0 337d ## 临时将 minAvailable 调为 0(结束后建议还原) # kubectl patch pdb prometheus-k8s -n monitoring --type=json -p='[{"op":"replace","path":"/spec/minAvailable","value":0}]'4.重启节点
# ssh demo-1 # reboot5.恢复调度并验证
节点重启后,确认状态恢复Ready(内核版本也会变为更新后的版本),再用uncordon解除调度限制:
# kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME demo-1 Ready,SchedulingDisabled control-plane 336d v1.31.14 x.x.x.x <none> Ubuntu 24.04.4 LTS 6.8.0-124-generic containerd://1.7.27 demo-2 Ready control-plane 336d v1.31.14 x.x.x.x <none> Ubuntu 24.04.4 LTS 6.8.0-64-generic containerd://1.7.27 demo-3 Ready control-plane 336d v1.31.14 x.x.x.x <none> Ubuntu 24.04.4 LTS 6.8.0-106-generic containerd://1.7.27 # kubectl uncordon demo-1 node/demo-1 uncordoned # kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME demo-1 Ready control-plane 336d v1.31.14 x.x.x.x <none> Ubuntu 24.04.4 LTS 6.8.0-124-generic containerd://1.7.27 demo-2 Ready control-plane 336d v1.31.14 x.x.x.x <none> Ubuntu 24.04.4 LTS 6.8.0-64-generic containerd://1.7.27 demo-3 Ready control-plane 336d v1.31.14 x.x.x.x <none> Ubuntu 24.04.4 LTS 6.8.0-106-generic containerd://1.7.276.循环操作
当前节点处理完毕,回到第 1 步对下一台节点重复整个流程...
问题排查
1.重启后 Pod 被标记为<invalid>
1.1.问题现象
重启节点后,该节点所有 Pod 的 RESTARTS 列显示为<invalid>,静态 Pod READY 列展示为 0/1。在 k8s 源码中:pod 创建时间与当前时间偏差超过 2 秒即显示 invalid。
# kubectl get pods -A -o wide | grep demo-3 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE kube-system coredns-dbbb9ff68-z8wjd 1/1 Running 0 (<invalid> ago) 131m x.x.x.x demo-3 kube-system etcd-demo-3 0/1 Running 0 (<invalid> ago) 10m x.x.x.x demo-3 kube-system kube-apiserver-demo-3 0/1 Running 0 (<invalid> ago) 10m x.x.x.x demo-3 kube-system kube-controller-manager-demo-3 0/1 Running 0 (<invalid> ago) 10m x.x.x.x demo-3 kube-system kube-proxy-f2fj8 1/1 Running 1 (<invalid> ago) 75d x.x.x.x demo-3 kube-system kube-scheduler-demo-3 0/1 Running 0 (<invalid> ago) 10m x.x.x.x demo-3 kube-system node-local-dns-xkfxc 1/1 Running 3 (<invalid> ago) 75d x.x.x.x demo-31.2.排查过程
1.2.1.校验容器与节点时间
以 etcd Pod 为例,其startedAt(UTC)比节点当前 UTC 时间晚了约 7 小时 40 分,处于 "未来" 时间:
通过时间结尾的 Z 判断时间格式为 UTC
## 节点当前时间(CST / UTC) # date Wed Jun 17 09:06:16 PM CST 2026 # date -u Wed Jun 17 01:06:16 PM UTC 2026 ## Pod 容器状态(时间为 UTC) # kubectl get pod -n kube-system etcd-demo-1 -o jsonpath='{.status.containerStatuses}' | jq . [ { ... "lastState": { "terminated": { "exitCode": 255, "finishedAt": "2026-06-17T20:46:05Z", "reason": "Unknown", "startedAt": "2026-06-17T09:56:07Z" } }, "name": "etcd", "ready": false, "restartCount": 4, "state": { "running": { "startedAt": "2026-06-17T20:46:16Z" } } } ] ## containerd 记录的容器创建/启动时间 ## https://github.com/kubernetes-sigs/cri-tools/blob/v1.26.1/cmd/crictl/container.go#L862 # crictl inspect 2ea3cdadfec6f | grep -Ei "createdAt|startedAt|finishedAt" "createdAt": "2026-06-18T04:46:16.17870878+08:00", "startedAt": "2026-06-18T04:46:16.641950022+08:00", "finishedAt": "0001-01-01T00:00:00Z", ## 换算为 UTC 统一对比 ## containerd: 2026-06-17 20:46:16 UTC ## 节点当前: 2026-06-17 13:06:16 UTC ← 容器时间在「未来」(晚 7h40m)1.2.2.校验 Pod 底层容器状态
排除 containerd / etcd 异常。底层容器实际处于 Running 状态:
# crictl ps -a | grep 'etcd' CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD 2ea3cdadfec6f 2e96e5913fc06 Less than a second ago Running etcd 4 7451b51061d22 etcd-demo-1 fab531bce16e7 2e96e5913fc06 3 hours ago Exited etcd 3 5f897d72fd206 etcd-demo-1 ## 进程确实在跑 # ps aux | grep -v 'grep' | grep 'etcd' root 2089 ... etcd --advertise-client-urls=https://x.x.x.x:2379 ... ## etcd 节点状态 # etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint status -w table +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://x.x.x.x:2379 | 683c58b549788bd9 | 3.5.15 | 30 MB | false | false | 59 | 131325454 | 131325454 | | +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+1.2.3.定位时间问题根因
各节点都配了 NTP,对比时间一致。这里我的想法是这样的:containerd 服务本身已经很多年了,大概率不会有这种 bug,更可能是他创建容器时,识别到的时间就是在 "未来"...那现在查看时间却正常了,应该是被节点中配置的时间同步改回来了。所以我才通过 dmesg 看一下启动后的内核记录:
# dmesg -T | grep -iEw "rtc|time|clock" [Wed Jun 17 20:54:08 2026] vmware: Host bus clock speed read from hypervisor : 66000000 Hz [Wed Jun 17 20:54:08 2026] vmware: using clock offset of 5629468472 ns [Wed Jun 17 20:54:08 2026] PM: RTC time: 12:54:08, date: 2026-06-17 [Wed Jun 17 20:54:09 2026] PTP clock support registered ## 设置 UTC 时间 [Wed Jun 17 20:54:10 2026] rtc_cmos 00:01: setting system clock to 2026-06-17T12:54:10 UTC (1781700850) [Wed Jun 17 20:54:11 2026] Loaded X.509 cert 'Build time autogenerated kernel key: ...' ## 内核设置完时间约 40 秒后 systemd-journald 又记录了一次时间向后跳变 [Wed Jun 17 20:54:50 2026] systemd-journald[525]: Time jumped backwards, rotating.用 journalctl 把 systemd-journald 记录的日志导出,由于输出较多,我做了一些精简。发现时间线为:20:54:xx → 04:45:xx → 20:54:xx,在主机时间被拨到"未来"(04:45)时 containerd 才启动,所以创建出的容器时间也是 "未来"。
注:这份日志并不能直接证明是 VMware 改的时间。我的判断依据是:时间被改为 04:45:xx 前启动的只有 ssh/cron 等系统服务,他们不没有改时间的能力;而 VGAuthService 是距改时最近、且具备改时能力的服务,因此列为第一怀疑对象,后续治本方案也验证了这一点。
## 内容较多,输出到文件后截取必要片段 # journalctl -b --no-pager > journalctl.txt ## -b: 只看本次开机后的日志; ## --no-pager: 不分页全部输出 ## 50 秒前后可看到明显的时间差异:20:54:xx → 04:45:xx → 20:54:xx Jun 17 20:54:19 demo-1 kernel: DMI: VMware, Inc. VMware Virtual Platform/... Jun 17 20:54:19 demo-1 kernel: vmware: hypercall mode: 0x02 Jun 17 20:54:19 demo-1 kernel: Hypervisor detected: VMware Jun 17 20:54:19 demo-1 kernel: vmware: TSC freq read from hypervisor : 2600.000 MHz Jun 17 20:54:19 demo-1 kernel: vmware: Host bus clock speed read from hypervisor : 66000000 Hz Jun 17 20:54:19 demo-1 kernel: vmware: using clock offset of 5629468472 ns Jun 17 20:54:19 demo-1 kernel: Booting paravirtualized kernel on VMware hypervisor Jun 17 20:54:26 demo-1 VGAuthService[799]: Using '/var/lib/vmware/VGAuth/aliasStore' for alias store root directory Jun 17 20:54:26 demo-1 VGAuthService[799]: LoadCatalogAndSchema: Using '/etc/vmware-tools/vgauth/schemas' for SAML schemas Jun 17 20:54:26 demo-1 VGAuthService[799]: LoadPrefs: Allowing 300 of clock skew for SAML date validation Jun 17 20:54:26 demo-1 VGAuthService[799]: SAML_Init: Using xmlsec1 1.2.39 for XML signature support Jun 17 20:54:26 demo-1 VGAuthService[799]: ServiceNetworkCreateSocketDir: Created socket directory '/var/run/vmware' Jun 17 20:54:26 demo-1 VGAuthService[799]: BEGIN SERVICE Jun 17 20:54:26 demo-1 systemd[1]: Starting etcd.service - Etcd Service... ## 主机时间被拨到 "未来" 后,containerd 才启动 Jun 18 04:45:57 demo-1 systemd-resolved[796]: Clock change detected. Flushing caches. Jun 18 04:45:57 demo-1 systemd[1]: Started kubelet.service - kubelet: The Kubernetes Node Agent. Jun 18 04:45:58 demo-1 systemd[1]: Starting containerd.service - containerd container runtime... Jun 18 04:46:16 demo-1 containerd[919]: time="..." msg="CreateContainer ... for &ContainerMetadata{Name:etcd,Attempt:4,} returns container id \"2ea3cdadfec6f...\"" Jun 18 04:46:16 demo-1 containerd[919]: time="..." msg="StartContainer for \"2ea3cdadfec6f...\"" Jun 18 04:46:16 demo-1 systemd[1]: Started cri-containerd-2ea3cdadfec6f....scope - libcontainer container 2ea3cdadfec6f.... Jun 18 04:46:16 demo-1 containerd[919]: time="..." msg="StartContainer for \"2ea3cdadfec6f...\" returns successfully" ## 时钟再次被更改 Jun 17 20:54:50 demo-1 systemd-resolved[796]: Clock change detected. Flushing caches. Jun 17 20:54:50 demo-1 systemd-journald[525]: Time jumped backwards, rotating. Jun 17 20:54:50 demo-1 systemd-timesyncd[797]: Contacted time server 91.189.91.157:123 (ntp.ubuntu.com). Jun 17 20:54:50 demo-1 systemd-timesyncd[797]: Initial clock synchronization to Wed 2026-06-17 20:54:50.365713 CST. Jun 17 20:54:50 demo-1 systemd[1]: etcd.service: Scheduled restart job, restart counter is at 2. Jun 17 20:54:50 demo-1 systemd[1]: Starting etcd.service - Etcd Service...1.3.解决方式
通过上面日志输出,怀疑根因是 VMware 导致的时间跳变,那就有有两条路:
- 改虚拟机自身启动顺序(治标);
- 改 VMware 时间同步配置(治本)。
1.3.1.治标:更改启动顺序
适用于没有 VMware 宿主机权限的场景。新建一个等待时钟同步的服务,让 kubelet 依赖它,确保主机时间恢复正常后再拉起容器。以及 contaierd 也需要这个依赖,否则会出现下一个问题。
这个自定义服务具体内容不重要,换成
sleep 60也能达到目的。
## 1. 用 Drop-In 而非直接改 kubelet.service ## /usr/lib/systemd/system/kubelet.service 归 rpm/deb 包所有,升级会被覆盖; ## systemd Drop-In (/etc/systemd/system/kubelet.service.d/) 不属于任何包,类似 helm custom value. # systemctl status kubelet ## ● kubelet.service - kubelet: The Kubernetes Node Agent ## Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; preset: enabled) ## Drop-In: /usr/lib/systemd/system/kubelet.service.d/ ## └─10-kubeadm.conf # mkdir /etc/systemd/system/kubelet.service.d/ ## 2. 新建等待时钟同步服务 cat > /etc/systemd/system/wait-for-clock-sync.service <<'EOF' [Unit] Description=Wait for system clock to be synchronized After=systemd-timesyncd.service network-online.target Before=kubelet.service Wants=network-online.target [Service] Type=oneshot RemainAfterExit=yes ExecStart=/bin/bash -c 'for i in $(seq 1 60); do if timedatectl show -p NTPSynchronized --value 2>/dev/null | grep -q yes; then exit 0; fi; sleep 1; done; exit 0' TimeoutStartSec=70 [Install] WantedBy=multi-user.target EOF ## 3. 让 kubelet 依赖它 cat > /etc/systemd/system/kubelet.service.d/20-wait-time-sync.conf <<'EOF' [Unit] After=wait-for-clock-sync.service Requires=wait-for-clock-sync.service EOF ## 4. 启用 + 重载 # systemctl daemon-reload # systemctl enable --now wait-for-clock-sync.service ## 5. 验证依赖链 # systemctl list-dependencies --reverse wait-for-clock-sync.service # systemctl status wait-for-clock-sync.service1.3.2.治本:关闭 VMware 时间同步
本次环境是 VMware 与虚拟机时间不一致、启动时把虚拟机时间拨乱导致的。参考 VMware 官方博客,关掉所有时间同步:
- 关闭虚拟机;
- 更改对应机器 vmx 配置;
- 开机。
## 在 ESXi 宿主机上编辑对应虚拟机的 vmx # grep -i 'time' /vmfs/volumes/data-1/demo-1/demo-1.vmx time.synchronize.continue = "FALSE" time.synchronize.restore = "FALSE" time.synchronize.resume.disk = "FALSE" time.synchronize.shrink = "FALSE" time.synchronize.tools.startup = "FALSE" time.synchronize.tools.enable = "FALSE" time.synchronize.resume.host = "FALSE"## 关闭后重启,dmesg 不再有时间跳变 # dmesg -T | grep -iEw "rtc|time|clock" [Thu Jun 18 18:14:06 2026] vmware: Host bus clock speed read from hypervisor : 66000000 Hz [Thu Jun 18 18:14:06 2026] vmware: using clock offset of 4177512396 ns [Thu Jun 18 18:14:07 2026] PM: RTC time: 10:14:06, date: 2026-06-18 [Thu Jun 18 18:14:08 2026] PTP clock support registered [Thu Jun 18 18:14:08 2026] rtc_cmos 00:01: setting system clock to 2026-06-18T10:14:08 UTC (1781777648) [Thu Jun 18 18:14:08 2026] Loaded X.509 cert 'Build time autogenerated kernel key: ...' [Thu Jun 18 18:14:09 2026] Loaded X.509 cert 'Build time autogenerated kernel key: ...'2.重启后容器名称被占用,容器无法创建
2.1.问题现象
触发原因与问题 1 同源(时间跳变导致 containerd 数据写入不完整),根治同样用 1.3 的方案。本节讲的是重启后已经出现该症状时,如何手动恢复。
此问题一般由两种情况导致:
- 非原子写入:containerd 创建容器时会写两条记录(名称 + 详情),如出现 断电/panic/时间跳变 等情况只写一条就会残留;
- 时间跳变(本次根因)。
# kubectl get pods -n kube-system -o wide | grep demo-1 NAME READY STATUS RESTARTS AGE IP NODE coredns-dbbb9ff68-pzr4j 1/1 Running 4 5h8m x.x.x.x demo-1 etcd-demo-1 0/1 Unknown 8 20m x.x.x.x demo-1 kube-apiserver-demo-1 0/1 Unknown 32 20m x.x.x.x demo-1 kube-controller-manager-demo-1 0/1 Unknown 20 20m x.x.x.x demo-1 kube-proxy-jgx42 1/1 Running 0 34m x.x.x.x demo-1 kube-scheduler-demo-1 0/1 Unknown 19 20m x.x.x.x demo-1 node-local-dns-vgjlf 1/1 Running 0 28m x.x.x.x demo-12.2.排查过程
2.2.1.查底层容器与日志
容器已创建但处于 Exited,且日志文件不存在(容器并未真正起来):
# crictl ps -a | grep -i 'etcd' CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD c93ac13a41b67 2e96e5913fc06 Less than a second ago Exited etcd 8 f2fc82e9a4d98 etcd-demo-1 ## 日志文件不存在 # crictl logs 624bf43aff2db FATA[0000] failed to try resolving symlinks in path "/var/log/pods/kube-system_etcd-demo-1_c93ac13a41b67d89d5dbbfbc90cf9c8f/etcd/8.log": lstat /var/log/pods/kube-system_etcd-demo-1_c93ac13a41b67d89d5dbbfbc90cf9c8f/etcd/8.log: no such file or directory2.2.2.查 kubelet 日志
容器由 kubelet 拉起,看 kubelet 的报错。以下日志都在说同一件事:kubelet 想创建etcd-demo-1,但这个名字在 containerd 里已被占用(reserved):
# journalctl -u kubelet --since "3 min ago" --no-pager | grep -i 'etcd' Jun 18 23:02:36 demo-1 kubelet[23429]: E0618 23:02:36.675072 ... "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to reserve sandbox name \"etcd-demo-1_kube-system_df1fae7c70ff1a1dfc6127a8f7bf67a2_6\": name ... is reserved for \"7cc9d6627964...\"" Jun 18 23:02:36 demo-1 kubelet[23429]: E0618 23:02:36.675212 ... "Failed to create sandbox for pod" err="... failed to reserve sandbox name \"etcd-demo-1_kube-system_..._6\": name ... is reserved for \"7cc9d6627964...\"" pod="kube-system/etcd-demo-1" Jun 18 23:02:36 demo-1 kubelet[23429]: E0618 23:02:36.675273 ... "CreatePodSandbox for pod failed" err="... failed to reserve sandbox name \"etcd-demo-1_kube-system_..._6\": name ... is reserved for \"7cc9d6627964...\"" pod="kube-system/etcd-demo-1" Jun 18 23:02:36 demo-1 kubelet[23429]: E0618 23:02:36.675410 ... "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"etcd-demo-1_kube-system(...)\" with CreatePodSandboxError: ... name ... is reserved for \"7cc9d6627964...\"" pod="kube-system/etcd-demo-1"2.3.解决方式
补充:containerd 社区已有对应 issue #10848,2.2.x / 2.3.x 已修复(见 PR #11576),升级后可避免复发。
暂停 kubelet 服务后,手动清理被占用的容器,再重启 containerd 即可:
## 停 kubelet,否则它会反复尝试创建 # systemctl stop kubelet ## 找到被占用的 sandbox(按名字查重名) # crictl pods ## 列出所有 sandbox # crictl pods -q --name <name> ## 拿到对应 ID ## 删除残留 sandbox # crictl stopp $ID # crictl rmp -f $ID # systemctl restart containerd # systemctl start kubelet