Cilium VXLAN 模式使用说明

Cilium VXLAN 模式使用说明

VXLAN 模式对底层网络基础设施的要求最。在此模式下,所有集群节点通过基于 UDP 的封装协议(VXLAN 或 Geneve)建立起全互联的隧道网格,Cilium 节点之间的所有流量都会经过封装。

网络要求:

  • 封装依赖于节点间的正常连接。也就是说,只要 Cilium 节点之间能够相互访问,则所有路由要求都满足了;
  • 底层网络和防火墙必须允许封装报文通过:
    • VXLAN 8472/UDP
    • Geneve 6081/UDP

模式优势:

  • 配置方便:连接集群节点的网络不需要感知 PodCIDR。集群节点可以跨越多个路由域或链路层域。底层网络的拓扑结构无关紧要,只要集群节点之间能够通过 IP/UDP 互相通信即可;
  • 网段容量:由于不依赖任何底层网络限制,可用的地址空间大大增加。如果 PodCIDR 范围配置得当,则每个节点上可以运行任意数量的 Pod;
  • 节点扩充:当与 Kubernetes 等编排系统一起使用时,集群中所有节点的信息(包括每个节点分配到的 PodCIDR)会自动同步给所有 Cilium Agent。新节点加入集群后,会自动被纳入隧道网格,无需手动配置;
  • 身份传递:封装协议允许在网络报文中携带元数据。Cilium 利用这一能力来传递源安全身份等元数据信息。身份传递是一种优化手段,旨在避免在远端节点上再进行一次身份查找。

模式缺点:

  • 由于需要添加封装头部,实际可用于载荷的 MTU 比原生路由模式更低(VXLAN 每个报文增加 50 字节开销),这会导致特定网络连接的最大吞吐量降低。开启巨帧(Jumbo Frames)可以大幅缓解这一问题(同样是 50 字节开销,在标准帧 1500 字节中占比较高,而在巨帧 9000 字节中占比就很低了)。

部署流程

通过 Kind 快速生成集群并部署 Cilium VXLAN 模式

#!/bin/bash set -v # 1. Prepare NoCNI kubernetes environment cat <<EOF | kind create cluster --name=cilium-kubeproxy-vxlan --image=kindest/node:v1.27.3 --config=- kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 networking: disableDefaultCNI: true # kubeProxyMode: "none" # Enable KubeProxy nodes: - role: control-plane - role: worker - role: worker containerdConfigPatches: EOF # 2. Remove kubernetes node taints controller_node_ip=`kubectl get node -o wide --no-headers | grep -E "control-plane|bpf1" | awk -F " " '{print $6}'` kubectl taint nodes $(kubectl get nodes -o name | grep control-plane) node-role.kubernetes.io/control-plane:NoSchedule- # 3. Install CNI[Cilium 1.17.15] cilium_version=v1.17.15 docker pull quay.io/cilium/cilium:$cilium_version && docker pull quay.io/cilium/operator-generic:$cilium_version kind load docker-image quay.io/cilium/cilium:$cilium_version quay.io/cilium/operator-generic:$cilium_version --name cilium-kubeproxy-vxlan helm repo add cilium https://helm.cilium.io ; helm repo update; # routingMode=tunnel # tunnelProtocol=vxlan helm install cilium cilium/cilium \ --version v1.17.15 \ --namespace kube-system \ --set k8sServiceHost=$controller_node_ip \ --set k8sServicePort=6443 \ --set image.pullPolicy=IfNotPresent \ --set debug.enabled=true \ --set debug.verbose="datapath flow kvstore envoy policy" \ --set bpf.monitorAggregation=none \ --set monitor.enabled=true \ --set ipam.mode=cluster-pool \ --set cluster.name=cilium-kubeproxy-vxlan \ --set routingMode=tunnel \ --set tunnelProtocol=vxlan \ --set ipv4NativeRoutingCIDR="10.0.0.0/8" # 4. Separate namesapce and cgroup v2 verify [https://github.com/cilium/cilium/pull/16259 && https://docs.cilium.io/en/stable/installation/kind/#install-cilium] #for container in $(docker ps -a --format "table {{.Names}}" | grep cilium-kubeproxy-vxlan);do docker exec $container ls -al /proc/self/ns/cgroup;done #mount -l | grep cgroup && docker info | grep "Cgroup Version" | awk '$1=$1'

创建测试 Pod

本质是 Nginx,仅用于通过访问时抓包使用

apiVersion: apps/v1 kind: StatefulSet metadata: labels: app: nginx name: pod spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: burlyluo/nettool:latest name: nettoolbox env: - name: NETTOOL_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName securityContext: privileged: true affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: nginx topologyKey: kubernetes.io/hostname --- apiVersion: v1 kind: Service metadata: name: pod spec: type: NodePort selector: app: nginx ports: - name: http port: 80 targetPort: 80 nodePort: 32000

查看部署结果

root@network-demo:~# kubectl get pods -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default pod-0 1/1 Running 0 53s 10.0.2.140 cilium-kubeproxy-vxlan-worker2 default pod-1 1/1 Running 0 47s 10.0.1.97 cilium-kubeproxy-vxlan-worker default pod-2 1/1 Running 0 42s 10.0.0.33 cilium-kubeproxy-vxlan-control-plane kube-system cilium-764sh 2/2 Running 0 5m32s 172.18.0.2 cilium-kubeproxy-vxlan-worker kube-system cilium-7bc47 2/2 Running 0 5m32s 172.18.0.4 cilium-kubeproxy-vxlan-worker2 kube-system cilium-envoy-ljdnj 1/1 Running 0 5m32s 172.18.0.2 cilium-kubeproxy-vxlan-worker kube-system cilium-envoy-p4jx8 1/1 Running 0 5m32s 172.18.0.4 cilium-kubeproxy-vxlan-worker2 kube-system cilium-envoy-wkb5m 1/1 Running 0 5m32s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane kube-system cilium-j8sb5 2/2 Running 0 5m32s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane kube-system cilium-operator-7bfd9d69f4-nckns 1/1 Running 0 5m32s 172.18.0.2 cilium-kubeproxy-vxlan-worker kube-system cilium-operator-7bfd9d69f4-tdrdv 1/1 Running 0 5m32s 172.18.0.4 cilium-kubeproxy-vxlan-worker2 kube-system coredns-5d78c9869d-7ttv2 1/1 Running 0 7m16s 10.0.1.30 cilium-kubeproxy-vxlan-worker kube-system coredns-5d78c9869d-d7ls6 1/1 Running 0 7m16s 10.0.1.226 cilium-kubeproxy-vxlan-worker kube-system etcd-cilium 1/1 Running 0 7m30s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane kube-system kube-apiserver-cilium 1/1 Running 0 7m31s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane kube-system kube-controller-manager-cilium 1/1 Running 0 7m30s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane kube-system kube-proxy-4dpp6 1/1 Running 0 7m11s 172.18.0.4 cilium-kubeproxy-vxlan-worker2 kube-system kube-proxy-8x95v 1/1 Running 0 7m9s 172.18.0.2 cilium-kubeproxy-vxlan-worker kube-system kube-proxy-gx2hj 1/1 Running 0 7m16s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane kube-system kube-scheduler-cilium 1/1 Running 0 7m31s 172.18.0.3 cilium-kubeproxy-vxlan-control-plane

查询 Cilium 详细信息

1.查询 Cilium 详细运行状态

root@network-demo:~# kubectl exec -it -n kube-system cilium-j8sb5 -- cilium status KVStore: Disabled Kubernetes: Ok 1.27 (v1.27.3) [linux/amd64] Kubernetes APIs: ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"] ## 没有使用 cilium 代替 k8s kube-proxy KubeProxyReplacement: False Host firewall: Disabled SRv6: Disabled CNI Chaining: none CNI Config file: successfully wrote CNI configuration file to /host/etc/cni/net.d/05-cilium.conflist Cilium: Ok 1.17.15 (v1.17.15-4206eaa5) NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory Cilium health daemon: Ok IPAM: IPv4: 3/254 allocated from 10.0.0.0/24, IPv4 BIG TCP: Disabled IPv6 BIG TCP: Disabled BandwidthManager: Disabled ## vxlan 网络模式 Routing: Network: Tunnel [vxlan] Host: Legacy Attach Mode: TCX Device Mode: veth Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled] Controller Status: 26/26 healthy Proxy Status: OK, ip 10.0.0.23, 0 redirects active on ports 10000-20000, Envoy: external Global Identity Range: min 256, max 65535 Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 44.59 Metrics: Disabled Encryption: Disabled Cluster health: 3/3 reachable (2026-05-05T08:59:02Z) Name IP Node Endpoints Modules Health: Stopped(0) Degraded(0) OK(52)

2.查询 Cilium Endpoint 信息

在 Cilium 中,Endpoint 术语含义:Cilium 为容器分配 IP。一个 Pod 中可以包含多个容器(多个容器共享同一个 Pod IP)。所有共享同一地址的容器被分组在一起,Cilium 将其称为一个 Endpoint。

每个节点的 Cilium Agent 只管理本节点的 Endpoint,所以不同节点的 cilium endpoint list 输出不同,本次以 Controller 节点 Pod 作为示例:

root@network-demo:~# kubectl exec -it -n kube-system cilium-j8sb5 -- cilium endpoint list ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv4 STATUS ENFORCEMENT ENFORCEMENT 467 Disabled Disabled 17110 k8s:app=nginx 10.0.0.33 ready k8s:io.cilium.k8s.namespace/metadata.name=default k8s:io.cilium.k8s.policy.cluster=cilium-kubeproxy-vxlan k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default 1364 Disabled Disabled 1 k8s:node-role.kubernetes.io/control-plane ready k8s:node.kubernetes.io/exclude-from-external-load-balancers reserved:host 3682 Disabled Disabled 4 reserved:health 10.0.0.204 ready

3.查询 Cilium Service 信息

在 Cilium 中,Service 术语含义:k8s svc 在 Cilium eBPF Map 中实际转发状态。官网文档中提到,如果不使用 Cilium 代替 kube-proxy,则只会启用 ClusterIP services 的负载:

By default, Helm setskubeProxyReplacement=false, which only enables per-packet in-cluster load-balancing of ClusterIP services.

root@network-demo:~# kubectl exec -it -n kube-system cilium-j8sb5 -- cilium service list ID Frontend Service Type Backend 1 10.96.0.1:443/TCP ClusterIP 1 => 172.18.0.3:6443/TCP (active) 2 10.96.153.200:443/TCP ClusterIP 1 => 172.18.0.3:4244/TCP (active) 3 10.96.0.10:53/UDP ClusterIP 1 => 10.0.1.30:53/UDP (active) 2 => 10.0.1.226:53/UDP (active) 4 10.96.0.10:53/TCP ClusterIP 1 => 10.0.1.30:53/TCP (active) 2 => 10.0.1.226:53/TCP (active) 5 10.96.0.10:9153/TCP ClusterIP 1 => 10.0.1.30:9153/TCP (active) 2 => 10.0.1.226:9153/TCP (active) 7 10.96.128.232:80/TCP ClusterIP 1 => 10.0.2.140:80/TCP (active) 2 => 10.0.1.97:80/TCP (active) 3 => 10.0.0.33:80/TCP (active)

验证效果

查询 Cilium 主机路由、网卡设备、tunnel 信息

1.查询 Cilium Node 网卡设备

1.1.查询 Cilium cilium_host 设备信息

查询后发现,cilium_host 设备是一个 veth pair,而不是 VXLAN 设备:

root@network-demo:~# docker exec -it cilium-kubeproxy-vxlan-control-plane ip address show cilium_host 5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 0a:c2:82:31:a5:f9 brd ff:ff:ff:ff:ff:ff inet 10.0.0.23/32 scope global cilium_host valid_lft forever preferred_lft forever root@network-demo:~# docker exec -it cilium-kubeproxy-vxlan-control-plane ip -d link show cilium_host 5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 0a:c2:82:31:a5:f9 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 veth addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
1.2.查询 Cilium cilium_vxlan 设备信息

真正的 VXLAN 设备是 cilium_vxlan 通过与其他 CNI VXLAN 模式对比发现,cilium_vxlan 设备中并没有指定 local/remote 信息,但在对应位置添加了external关键字,表示这个 VXLAN 设备的 FDB 转发表由外部程序管理,而不是靠内核自动学习。这里的"外部程序"指的就是 eBPF 程序。

root@network-demo:~# docker exec -it cilium-kubeproxy-vxlan-control-plane ip address show cilium_vxlan 6: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether da:36:b5:e9:0a:30 brd ff:ff:ff:ff:ff:ff root@network-demo:~# docker exec -it cilium-kubeproxy-vxlan-control-plane ip -d link show cilium_vxlan 6: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default link/ether da:36:b5:e9:0a:30 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 vxlan external addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
root@ce-demo-1:~# ip address show vxlan.calico 30499: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 66:e0:bb:93:52:4f brd ff:ff:ff:ff:ff:ff inet 10.244.142.0/32 scope global vxlan.calico valid_lft forever preferred_lft forever root@ce-demo-1:~# ip -d link show vxlan.calico 30499: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 66:e0:bb:93:52:4f brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 65535 info: Using default fan map value (33) ## calico 通过 local 10.51.0.100 指定了这个 VXLAN 设备用本机此 IP 地址作为 VXLAN 外层封装的 src ip vxlan id 4096 local 10.51.0.100 dev ens160 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 tso_max_size 65536 tso_max_segs 65535 gro_max_size 65536

2.查询 Cilium Node 路由表

从路由表中发现,跨节点 Pod 请求并没有通过 VXLAN 设备进行转发,而是使用的 cilium_host。

但实际上,跨节点 Pod 访问并不会经过 cilium_host 设备,因为 Pod veth pair lxc 设备挂载的 eBPF ingress cil_from_container 看到目标 IP 非本节点 Pod IP,查询 cilium_tunnel map 后,获取跨节点 Pod IP 应该转发到哪个 Node 后,直接转发给 cilium_vxlan 设备进行 VXLAN 封装了。

root@cilium-kubeproxy-vxlan-control-plane:/# ip route show default via 172.18.0.1 dev eth0 10.0.0.0/24 via 10.0.0.23 dev cilium_host proto kernel src 10.0.0.23 10.0.0.23 dev cilium_host proto kernel scope link 10.0.1.0/24 via 10.0.0.23 dev cilium_host proto kernel src 10.0.0.23 mtu 1450 10.0.2.0/24 via 10.0.0.23 dev cilium_host proto kernel src 10.0.0.23 mtu 1450 172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3

3.查询 cilium_tunnel 表

cilium_tunnel_m 是 Cilium VXLAN 隧道的核心转发表,这张 eBPF map 告诉 eBPF 程序:要访问某个远端 PodCIDR,应该把 VXLAN 包发到哪个节点 IP

root@cilium-kubeproxy-vxlan-control-plane:/# bpftool map show name cilium_tunnel_m 2026: hash name cilium_tunnel_m flags 0x1 key 20B value 20B max_entries 65536 memlock 1050000B pids cilium-agent(386835) 2046: hash name cilium_tunnel_m flags 0x1 key 20B value 20B max_entries 65536 memlock 1050000B 2108: hash name cilium_tunnel_m flags 0x1 key 20B value 20B max_entries 65536 memlock 1050000B root@cilium-kubeproxy-vxlan-control-plane:/# bpftool map dump id 2026 key: 0a 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: ac 12 00 04 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 key: 0a 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: ac 12 00 02 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 root@cilium-kubeproxy-vxlan-control-plane:/# bpftool map dump id 2046 key: 0a 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: ac 12 00 04 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 key: 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: ac 12 00 03 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 root@cilium-kubeproxy-vxlan-control-plane:/# bpftool map dump id 2108 key: 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: ac 12 00 03 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 key: 0a 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: ac 12 00 02 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00

十进制,以 control 节点为例,翻译后效果:

root@network-demo:~# kubectl exec -it -n kube-system cilium-j8sb5 -- cilium bpf tunnel list TUNNEL VALUE 10.0.2.0 172.18.0.4:0 10.0.1.0 172.18.0.2:0

请求抓包

1.Pod 处抓包