当前位置：首页 > news >正文

Kubernetes 编程 / Operator 专题【左扬精讲】—— Client-go 源代码分析： Controller 调试与诊断工具：从日志分析到问题定位

news 2026/6/13 18:31:08

Kubernetes 编程 / Operator 专题【左扬精讲】—— Client-go 源代码分析： Controller 调试与诊断工具：从日志分析到问题定位

当我们开发 Kubernetes Controller 时，最头疼的问题就是：程序跑起来了，但不知道发生了什么。事件有没有收到？WorkQueue 积压了多少任务？为什么 Reconcile 总是失败？

这一篇文章，我们来系统地学习 Controller 的调试和诊断方法，从 kubectl 命令到日志分析，从 metrics 到事件排查。

Kubernetes 调试诊断 kubectl v1.36.1

🔓 学习重点提示  — 建议先通读全文，再重点回顾标注内容

★ 重点掌握（必须）
   • kubectl describe/logs：查看资源和日志的基本方法
   • kubectl debug：如何进入 Pod 调试网络问题
   • kubectl api-resources：查看集群支持的所有资源类型

☆ 次重点（了解即可）
   • kubectl events 命令

一、日志分析：Controller 调试的第一步

日志是调试 Controller 的第一手资料。Kubernetes Controller 通常使用 klog（Kubernetes 的日志库），它会输出 INFO、WARNING、ERROR 等级别的日志。

查看 Controller 日志

# 查看 Controller Pod 的日志
kubectl logs -n <namespace> <controller-pod-name># 实时跟踪日志
kubectl logs -n <namespace> <controller-pod-name> -f# 查看上一个退出的容器的日志（如果 Pod 重启过）
kubectl logs -n <namespace> <controller-pod-name> --previous# 只看 ERROR 级别的日志
kubectl logs -n <namespace> <controller-pod-name> | grep -i error

关键日志关键词

当分析 Controller 日志时，关注以下关键词：

关键词	含义	排查建议
"Waiting for caches to sync"	正在等待 Informer 缓存同步	正常启动日志
"error syncing"	Reconcile 失败	查看具体错误信息和 key
"requeued after rate limit"	任务因限速被重新入队	正常行为，说明任务处理失败了
"failed to get"	从 Lister 获取资源失败	缓存可能未同步完成

二、kubectl describe：查看资源详情和事件

kubectl describe 可以显示资源的详细信息和关联事件，是排查问题的利器。

# 查看 Controller Pod 的详细信息（包含 Events）
kubectl describe pod -n <namespace> <controller-pod-name># 查看 Deployment 的详细信息（查看 Controller 创建的 Pod 数量）
kubectl describe deployment -n <namespace> <controller-deployment-name># 查看 ReplicaSet（Controller 创建的实际 Pod 管理者）
kubectl describe replicaset -n <namespace> <controller-replicaset-name># 查看具体资源的 OwnerReferences（确认 Pod 是谁创建的）
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.ownerReferences}' | jq

Events 字段解读

describe 输出中的 Events 部分经常被忽视，但它包含了大量有价值的信息：

Events:Type     Reason                  Age   From                     Message----     ------                  ----  ----                     -------Normal   Scheduled               5m    default-scheduler          Successfully assigned my-pod to node-1Normal   Pulling               5m    kubelet, node-1            Pulling image "nginx:latest"Normal   Pulled                4m    kubelet, node-1            Successfully pulled image "nginx:latest"Normal   Created               4m    kubelet, node-1            Created container nginxNormal   Started               4m    kubelet, node-1            Started container nginxWarning  BackOff               2m    kubelet, node-1            Back-off restarting failed container

三、kubectl get events：查看集群事件

Events 是 Kubernetes 集群中发生的重要事件的记录，可以帮助我们追踪资源变化。

# 查看最近的事件（按时间排序）
kubectl get events -n <namespace> --sort-by='.lastTimestamp'# 只看 WARNING 级别的事件
kubectl get events -n <namespace> --field-selector type=Warning# 查看某个资源相关的所有事件
kubectl get events -n <namespace> --field-selector involvedObject.name=<resource-name># 查看某个 Controller 创建的所有 Pod 的事件
kubectl get events -n <namespace> --field-selector involvedObject.namespace=<namespace>,involvedObject.uid=<controller-uid># 以 JSON 格式查看完整的事件信息
kubectl get events -n <namespace> -o json | jq '.items[] | {reason: .reason, message: .message, involvedObject: .involvedObject.name}'

四、kubectl api-resources：了解集群资源类型

kubectl api-resources 显示集群支持的所有资源类型，包括内置资源和 CRD。

# 查看所有资源类型（简洁输出）
kubectl api-resources# 查看所有资源类型及其详细信息
kubectl api-resources -o wide# 只看特定 API 组的资源
kubectl api-resources --api-group=apps
kubectl api-resources --api-group=example.com  # 自定义 API 组# 查看资源支持的 verbs（操作）
kubectl api-resources -o wide | grep pods# 查看特定资源的 API 路径
kubectl api-resources --api-group=apps --verbs=get -o name

输出示例：

NAME                              SHORTNAMES   APIVERSION                    NAMESPACED   KIND
pods                              po           v1                            true         Pod
deployments                       deploy       apps/v1                       true         Deployment
configmaps                       cm           v1                            true         ConfigMap
customresourcedefinitions         crd          apiextensions.k8s.io/v1       false        CustomResourceDefinition
foos                                          example.com/v1                 true         Foo

五、kubectl debug：高级调试技巧

kubectl debug 是 Kubernetes 1.23+ 引入的功能，可以直接在 Pod 中执行调试命令。

# 在 Controller Pod 中执行调试命令
kubectl debug -it <controller-pod-name> -n <namespace> --image=busybox -- sh# 复制 Pod 到一个带调试工具的新 Pod
kubectl debug <controller-pod-name> -n <namespace> --image=curlimages/curl -- curl http://apiserver:8001/healthz# 复制 Pod 并修改 command（用于调试启动问题）
kubectl debug <controller-pod-name> -n <namespace> --image=busybox --share-processes --copy-to=debug-pod -- sh# 查看进程信息（需要 --share-processes）
kubectl exec -it debug-pod -n <namespace> -- ps aux# 网络调试：测试 APIServer 连通性
kubectl exec -it <controller-pod-name> -n <namespace> -- sh -c "nslookup kubernetes.default.svc"

六、常见问题排查清单

问题 1：Controller 没有启动

排查步骤：

# 1. 查看 Pod 是否存在
kubectl get pods -n <namespace> | grep controller# 2. 查看 Pod 状态
kubectl get pod <controller-pod> -n <namespace> -o wide# 3. 查看 Pod 启动失败原因
kubectl describe pod <controller-pod> -n <namespace> | tail -50# 4. 查看镜像拉取日志
kubectl describe pod <controller-pod> -n <namespace> | grep -A5 "Events:"

问题 2：Controller 启动后卡住

排查步骤：

# 1. 查看日志，确认是否在等待缓存同步
kubectl logs <controller-pod> -n <namespace> | grep -i "cache\|sync"# 2. 确认 APIServer 是否可达
kubectl exec <controller-pod> -n <namespace> -- curl -k https://kubernetes.default.svc/healthz# 3. 查看 RBAC 权限是否足够
kubectl auth can-i --list --namespace=<namespace> --as=system:serviceaccount:<namespace>:<sa-name>

问题 3：Reconcile 总是失败

排查步骤：

# 1. 搜索错误日志
kubectl logs <controller-pod> -n <namespace> | grep -i "error\|failed\|cannot"# 2. 查看 WorkQueue 积压情况（如果有 metrics）
kubectl port-forward <controller-pod> 8080:8080 -n <namespace>
curl http://localhost:8080/metrics | grep workqueue# 3. 查看目标资源是否存在
kubectl get <resource-type> <resource-name> -n <namespace># 4. 查看资源是否有 finalizer 或其他保护机制
kubectl get <resource-type> <resource-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'

七、日志级别调优

通过调整日志级别，可以获得更详细或更简洁的日志输出。

# 在 Pod spec 中设置日志级别
# -v=0: 最小日志（默认）
# -v=2: 显示 HTTP 请求和响应
# -v=4: 显示 Informer 事件
# -v=6: 显示详细的调试信息
# -v=10: 最大日志级别spec:containers:- name: controllerargs:- --v=4  # 设置日志级别为 4

🌟 实用技巧
在生产环境中，建议使用 -v=2 或 -v=4，既能获得足够的信息，又不会因为日志量太大影响性能。需要调试时临时提升到 -v=6 或更高。