Cilium流量分析(一)

结合开发环境和代码进行Cilium流量的分析,会省略一些特性,比如Overlay、L7 Policy等。顺序是按照官方文档Life of a Packet 的图进行,本文是Endpoint to Endpoint(socket)的分析。

环境部署

使用cilium 1.12.0版本的代码库,在Mac上使用
K8S=1 NO_BUILD=1 TUNNEL_MODE_STRING=disabled SERVER_BOX=cilium/ubuntu-next-cgroupv2 SERVER_VERSION=0 contrib/vagrant/start.sh部署单节点的Cilium Kubernetes集群。其中:
1)提前执行了make build,不然在vm里编译会很慢
2)修改了shell脚本里的cilium启动参数,主要:关闭了tunnel(默认是VXLAN);启动cilium kube-proxy-replacement;启用sockops。

--- a/contrib/vagrant/scripts/03-install-kubernetes-worker.sh
+++ b/contrib/vagrant/scripts/03-install-kubernetes-worker.sh
@@ -305,10 +305,10 @@ EOF

 log "reloading systemctl daemon and enabling and restarting kube-proxy"
 sudo systemctl daemon-reload
-sudo systemctl enable kube-proxy
-sudo systemctl restart kube-proxy
+# sudo systemctl enable kube-proxy
+# sudo systemctl restart kube-proxy

-sudo systemctl status kube-proxy --no-pager
+# sudo systemctl status kube-proxy --no-pager

--- a/contrib/vagrant/start.sh
+++ b/contrib/vagrant/start.sh
@@ -306,7 +306,7 @@ function write_cilium_cfg() {

     cilium_options="\
       --debug --pprof --enable-hubble --hubble-listen-address :4244 --enable-k8s-event-handover \
-      --k8s-require-ipv4-pod-cidr --enable-bandwidth-manager --kube-proxy-replacement=disabled \
+      --k8s-require-ipv4-pod-cidr --enable-bandwidth-manager --kube-proxy-replacement=strict --cgroup-root=/sys/fs/cgroup --sockops-enable \
       --enable-remote-node-identity"
     cilium_operator_options=" --debug"

3)重新做了个box:cilium/ubuntu-next-cgroupv2,就是在cilium/ubuntu-next之上开启了cgroupv2 unified modeBPF_PROG_TYPE_CGROUP_XXX需要cgroupv2才能使用。

Kubernetes使用cgroupv2需要做一些额外的配置

环境部署完后,再部署个pod/nc-k8s1,service/nc-k8s1。

$ kubectl get po -o wide
NAME      READY   STATUS    RESTARTS   AGE   IP           NODE   NOMINATED NODE   READINESS GATES
nc-k8s1   1/1     Running   0          5s    10.11.0.99   k8s1   <none>           <none>
$ kubectl get node -o wide
NAME   STATUS   ROLES    AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION         CONTAINER-RUNTIME
k8s1   Ready    <none>   61m   v1.24.2   192.168.60.11   <none>        Ubuntu 20.04.4 LTS   5.18.0-g7e062cda7d90   containerd://1.6.3
$ systemctl status kube-proxy
● kube-proxy.service - Kubernetes Kube-Proxy Server
     Loaded: loaded (/etc/systemd/system/kube-proxy.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
       Docs: https://kubernetes.io/docs/concepts/overview/components/#kube-proxy
             https://kubernetes.io/docs/reference/generated/kube-proxy/
$ kubectl get svc -o wide
NAME         TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE    SELECTOR
kubernetes   ClusterIP   172.20.0.1    <none>        443/TCP        8m2s   <none>
nc-k8s1      NodePort    172.20.0.84   <none>        80:32438/TCP   9s     app=nc-k8s1

Endpoint to Endpoint(socket)

流量图

这一篇主要分析下下面那张socket的图,tc的后面再分析。图中的bpf_sockops.cbpf_redir.cBPF_PROG_TYPE_SOCK_OPSBPF_PROG_TYPE_SK_SKB配合做的一个socket层的redirect,先由bpf_sockops.c维护一个BPF_MAP_TYPE_SOCKHASH类型的Map,再由bpf_redir.c根据Map将数据重定向到对应的socket。

代码实现

分别查看bpf_redir、bpf_sockops与map。

$ bpftool prog show pinned /sys/fs/bpf/bpf_redir
3228: sk_msg  name bpf_redir_proxy  tag 2dfc83bbb7ceae9b  gpl
	loaded_at 2022-08-05T06:58:48+0000  uid 0
	xlated 1064B  jited 590B  memlock 4096B  map_ids 402,406,422
	btf_id 233
$ bpftool prog show pinned /sys/fs/bpf/bpf_sockops
3222: sock_ops  name bpf_sockmap  tag 00baed82e9c683bc  gpl
	loaded_at 2022-08-05T06:58:48+0000  uid 0
	xlated 1656B  jited 888B  memlock 4096B  map_ids 417,402,406,98,422
	btf_id 225
$ bpftool map show id 422
422: sockhash  name cilium_sock_ops  flags 0x0
	key 44B  value 4B  max_entries 65535  memlock 3145728B

调用层次:

__section("sk_msg")
bpf_redir_proxy
  |-sk_msg_extract4_key
  |-lookup_ip4_remote_endpoint
  |-policy_sk_egress
  |-msg_redirect_hash

_section("sockops")
bpf_sockmap
  |-switch (op)
    case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
    case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
       bpf_sock_ops_ipv4
         |-sk_extract4_key
         |-lookup_ip4_remote_endpoint
         |-policy_sk_egress
         |-__lookup_ip4_endpoint
         |-sock_hash_update

看下大体的实现:

// 负责进行socket重定向
__section("sk_msg")
int bpf_redir_proxy(struct sk_msg_md *msg)
{
  sk_msg_extract4_key(msg, &key); // 从msg中获取map的key,key包括了socket的源地址、源端口、目标地址、目标端口、IPFamily
  info = lookup_ip4_remote_endpoint(key.dip4); // 从ipcache map中查找对应的endpoint,用于policy的判断。ipcache存储了cilium管理的endpoint,以ip地址为key,value为对应的endpoint信息,比如identity,cilium的网络策略的实现依赖于identity
  verdict = policy_sk_egress(dst_id, key.sip4, (__u16)key.dport); // policy的判断
  if (verdict >= 0)
    msg_redirect_hash(msg, &SOCK_OPS_MAP, &key, flags); //调用bpf_sk_redirect_map helper function,根据sockhash map进行重定向
}

// 负责进行sockhash map的维护
_section("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
  switch (op) {
  case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: // 分别对应被动连接与主动连接,因此sockhash map里两个方向的socket都会记录
  case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
  ...
#ifdef ENABLE_IPV4
    if (family == AF_INET)
      bpf_sock_ops_ipv4(skops); // 当socket状态变为ESTABLISHED后,执行bpf_sock_ops_ipv4
#endif
  ...
  }
}

static inline void bpf_sock_ops_ipv4(struct bpf_sock_ops *skops)
{
  sk_extract4_key(skops, &key);
  if (1) {
    info = lookup_ip4_remote_endpoint(key.dip4); // 从ipcache map中查找有对应的endpoint
    if (info != NULL && info->sec_label)
      dst_id = info->sec_label;
    else
      dst_id = WORLD_ID;
  }

  verdict = policy_sk_egress(dst_id, key.sip4, (__u16)key.dport); // policy策略判断

  /* Lookup IPv4 address, this will return a match if:
   * - The destination IP address belongs to the local endpoint manage
   *   by Cilium.
   * - The destination IP address is an IP address associated with the
   *   host itself.
   * Then because these are local IPs that have passed LB/Policy/NAT
   * blocks redirect directly to socket.
   */
  exists = __lookup_ip4_endpoint(key.dip4);
  if (!exists)
    return;

  sock_hash_update(skops, &SOCK_OPS_MAP, &key, BPF_NOEXIST); // 更新sockhash map
}

代码逻辑比较简单,其中重定向是通过bpf helper函数bpf_msg_redirect_hash()实现,会根据sockhash map中的key直接进行转发。此外涉及另外两个map,其中cilium_ipcachelookup_ip4_remote_endpoint()查询的map,cilium_lxc__lookup_ip4_endpoint查询的map。

$ cilium map get cilium_ipcache | grep 10.11.0.99
10.11.0.99/32                             identity=3187 encryptkey=0 tunnelendpoint=0.0.0.0   sync
$ cilium map get cilium_lxc | grep 10.11.0.99
10.11.0.99:0     id=25    flags=0x0000 ifindex=22  mac=DA:96:96:60:21:7E nodemac=9A:37:05:A5:EA:14   sync

测试

接下来手动测试下重定向,先使用bpftool查看map,有4条数据。

$ bpftool map dump id 422
key:
0a 0b 00 0a 00 00 00 00  00 00 00 00 00 00 00 00
0a 0b 00 e9 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 10 90 00 00  c4 e6 00 00
value:
No space left on device
key:
c0 a8 3c 0b 00 00 00 00  00 00 00 00 00 00 00 00
c0 a8 3c 0b 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 10 90 00 00  d6 70 00 00
value:
No space left on device
key:
c0 a8 3c 0b 00 00 00 00  00 00 00 00 00 00 00 00
c0 a8 3c 0b 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 d6 70 00 00  10 90 00 00
value:
No space left on device
key:
0a 0b 00 e9 00 00 00 00  00 00 00 00 00 00 00 00
0a 0b 00 0a 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00 c4 e6 00 00  10 90 00 00
value:
No space left on device
Found 0 elements

在节点上执行tcpdump -i lxcf4d0f7379d34 -enn,其中lxcf4d0f7379d34是nc-k8s1容器的veth pair宿主机一端。

$ kubectl exec -it nc-k8s1 -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
21: eth0@if22: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether da:96:96:60:21:7e brd ff:ff:ff:ff:ff:ff
    inet 10.11.0.99/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd04::b033/128 scope global flags 02
       valid_lft forever preferred_lft forever
    inet6 fe80::d896:96ff:fe60:217e/64 scope link
       valid_lft forever preferred_lft forever
$ ip a
...
22: lxcf4d0f7379d34@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 9a:37:05:a5:ea:14 brd ff:ff:ff:ff:ff:ff link-netns cni-83b0571c-c726-6eaa-1e3f-5395347df6ea
    inet6 fe80::9837:5ff:fea5:ea14/64 scope link
       valid_lft forever preferred_lft forever

在节点上执行nc 10.11.0.99 80,(nc-k8s1里执行了nc -lv 80)可以看到tcpdump抓到了tcp三次握手的包。但后续的数据包不会被tcpdump抓到,因为后续的数据包通过socket重定向进行转发,不再通过包括tc在内的内核协议栈,tcpdump也就无法在tc层抓到相应的数据包。

$ nc 10.11.0.99 80
hello
package

$ kubectl logs -f nc-k8s1
Listening on [0.0.0.0] (family 0, port 80)
Connection from 10.11.0.233 47638 received!
hello
package

在断开连接前,再次查看sockhash map,可以看到新增的两条数据,分别是tcp两端socket在建立是加入的。

$ bpftool map dump id 422 | grep key | wc -l
6

Service实现(socket)

cilium在socket层的bpf程序除了上面两个外,还有就是cgroupv2类型的bpf程序也会在socket操作时触发,cilium用来实现kubernetes service的功能。

$ bpftool cgroup show /sys/fs/cgroup/
ID       AttachType      AttachFlags     Name
3222     sock_ops                        bpf_sockmap
3280     connect4                        sock4_connect
3260     connect6                        sock6_connect
3288     post_bind4                      sock4_post_bind
3268     post_bind6                      sock6_post_bind
3292     sendmsg4                        sock4_sendmsg
3272     sendmsg6                        sock6_sendmsg
3296     recvmsg4                        sock4_recvmsg
3276     recvmsg6                        sock6_recvmsg
3284     getpeername4                    sock4_getpeername
3264     getpeername6                    sock6_getpeername

以ipv4 tcp的service为例,在socket connect调用时触发bpf程序。
调用层次:

__section("cgroup/connect4")
sock4_connect
  |-__sock4_xlate_fwd
      |-lb4_lookup_service
          |-map_lookup_elem(LB4_SERVICES_MAP_V2)
      |-if (!svc)
        sock4_wildcard_lookup_full
          |-sock4_wildcard_lookup
      |-if (lb4_svc_is_affinity(svc))
        lb4_affinity_backend_id_by_netns
          |-__lb4_affinity_backend_id
              |-map_lookup_elem(LB4_AFFINITY_MAP)
        __lb4_lookup_backend
          |-map_lookup_elem()
      |-if (backend_id == 0)
        __lb4_lookup_backend_slot
          |-map_lookup_elem(LB4_SERVICES_MAP_V2)
        __lb4_lookup_backend
          |-map_lookup_elem(LB4_BACKEND_MAP_V2)
      |-lb4_update_affinity_by_netns
          |--__lb4_update_affinity
              |-map_update_elem(LB4_AFFINITY_MAP)
      |-sock4_update_revnat
        |-map_lookup_elem(LB4_REVERSE_NAT_SK_MAP)

大体代码:

__section("cgroup/connect4")
int sock4_connect(struct bpf_sock_addr *ctx)
{
	....
	__sock4_xlate_fwd(ctx, ctx, false); 
	return SYS_PROCEED;
}
static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
					     struct bpf_sock_addr *ctx_full,
					     const bool udp_only)
{
	if (is_defined(ENABLE_SOCKET_LB_HOST_ONLY) && !in_hostns) //验证是否符合lb host only
		return -ENXIO;
	if (!udp_only && !sock_proto_enabled(ctx->protocol))  //验证协议类型
		return -ENOTSUP;
	
	svc = lb4_lookup_service(&key, true); //找clusterIP
	if (!svc)
		svc = sock4_wildcard_lookup_full(&key, in_hostns); //找nodeport与hostport
	if (!svc)
		return -ENXIO;

	if (lb4_svc_is_affinity(svc)) {  //设置了亲和性
		backend_id = lb4_affinity_backend_id_by_netns(svc, &id);
		if (backend_id != 0) { 
			backend = __lb4_lookup_backend(backend_id); 
			if (!backend) //后端pod已不存在,重新调度
				backend_id = 0;
		}
	}

	if (backend_id == 0) {
		key.backend_slot = (sock_select_slot(ctx_full) % svc->count) + 1;
		backend_slot = __lb4_lookup_backend_slot(&key);
		backend_id = backend_slot->backend_id;
		backend = __lb4_lookup_backend(backend_id);
	}

	if (lb4_svc_is_affinity(svc) && !backend_from_affinity) //如果是service设置了亲和性,并且此次是重新负载了,则进行affinity map的更新
		lb4_update_affinity_by_netns(svc, &id, backend_id);
		
	if (sock4_update_revnat(ctx_full, backend, &orig_key,
				svc->rev_nat_index) < 0) {
		update_metrics(0, METRIC_EGRESS, REASON_LB_REVNAT_UPDATE);
		return -ENOMEM;
	}
	// 进行socket的DNAT
	ctx->user_ip4 = backend->address;
	ctx_set_port(ctx, backend->port);
	return 0;
}

流程如下:
1)先找到对应的Service。通过lb4_lookup_service()查找是否是某个Service的ClusterIP。这块查询的是hash map cilium_lb4_services_v2,key和value如下。

struct lb4_key {
	__be32 address;		/* Service virtual IPv4 address */
	__be16 dport;		/* L4 port filter, if unset, all ports apply */
	__u16 backend_slot;	/* Backend iterator, 0 indicates the svc frontend */
	__u8 proto;		/* L4 protocol, currently not used (set to 0) */
	__u8 scope;		/* LB_LOOKUP_SCOPE_* for externalTrafficPolicy=Local */
	__u8 pad[2];
};
struct lb4_service {
	union {
		__u32 backend_id;	/* Backend ID in lb4_backends */
		__u32 affinity_timeout;	/* In seconds, only for svc frontend */
		__u32 l7_lb_proxy_port;	/* In host byte order, only when flags2 && SVC_FLAG_L7LOADBALANCER */
	};
	/* For the service frontend, count denotes number of service backend
	 * slots (otherwise zero).
	 */
	__u16 count;
	__u16 rev_nat_index;	/* Reverse NAT ID in lb4_reverse_nat */
	__u8 flags;
	__u8 flags2;
	__u8  pad[2];
};

以环境中的service/nc-k8s1为例,clusterIP为172.20.0.84,map信息为:

$ cilium map get cilium_lb4_services_v2  | grep 172.20.0.84
172.20.0.84:80        10800 1 (6) [0x10 0x0]    sync

通过bpftool可以看到两条记录,第一条是service backend,记录了后端nc-k8s1 pod信息,第二条是service frontend记录了service的信息。

$ bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_services_v2 | grep "ac 14 00 54"
key: ac 14 00 54 00 50 01 00  00 00 00 00  value: 14 00 00 00 00 00 00 06  00 00 00 00
key: ac 14 00 54 00 50 00 00  00 00 00 00  value: 30 2a 00 00 01 00 00 06  10 00 00 00

整理一下就是:

[{
    "key": {
		"address": "172.20.0.84",
		"dport": 80,
		"backend_slot": 1,
		"proto": 0,
		"scope": 0
	},
	"value": {
		"backend_id": 0x14,
		"count": 0,
		"rev_nat_index": 0x06
	}

}, {
	"key": {
		"address": "172.20.0.84",
		"dport": 80,
		"backend_slot": 0,
		"proto": 0,
		"scope": 0
	},
	"value": {
		"affinity_timeout": 0x302a,
		"count": 1,
		"rev_nat_index": 0x06
	}
}]

lb4_lookup_service()是根据ClusterIP查找对应的service,如果lb4_lookup_service()查找失败,会使用sock4_wildcard_lookup_full()查找nodeport、hostport记录,环境中的nc-k8s1 service的nodeport是32438,map中的记录如下所示。

$ cilium map get cilium_lb4_services_v2  | grep 32438
10.0.2.15:32438       10800 1 (11) [0x52 0x0]   sync
0.0.0.0:32438         10800 1 (8) [0x12 0x0]    sync
192.168.59.15:32438   10800 1 (10) [0x52 0x0]   sync
192.168.61.11:32438   10800 1 (13) [0x52 0x0]   sync
192.168.60.11:32438   10800 1 (12) [0x52 0x0]   sync

2)亲和性逻辑。通过lb4_affinity_backend_id_by_netns()查找亲和性的后端。这里用到的map是cilium_lb4_affinity ,key中记录了clientIP、访问时间等,map在进行负载之后通过lb4_update_affinity_by_netns()进行更新。

$  bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_affinity
key: 01 00 00 00 00 00 00 00  00 06 01 00 00 00 00 00
value: 14 1d 00 00 00 00 00 00  14 00 00 00 00 00 00 00
Found 1 element
struct lb_affinity_val {
	__u64 last_used;
	__u32 backend_id; 
	__u32 pad;
} __packed;

通过lb4_affinity_backend_id_by_netns()找到backend_ip后,再通过__lb4_lookup_backend()查找后端,查找的map为cilium_lb4_backends_v2,key为backend_ip,value中记录了podIP

$ cilium map get cilium_lb4_backends_v2
Key   Value                 State   Error
20    ANY://10.11.0.99:80   sync

3)选择后端Pod进行负载。sock_select_slot()方法用于选择后端Pod,对TCP使用get_prandom_u32()进行随机选择,对于UDP则对socket cookie作hash进行选择。根据backend_slot查询backend_id,最终查询到backend

4)更新nat表。如下记录,表示 172.20.0.84:80(ac 14 00 54 00 50)发送到了10.11.0.99:800a 0b 00 63 00 50),其中前面64位是socket cookie。这个表主要在recvmsg4、getpeername4时用于进行DNAT,查了相关资料,部分UDP应用需要检测回包源地址,因此需要进行DNAT。而getpeername通过DNAT可以获取到Service的地址,做到对完全应用透明。

$ bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_reverse_sk | grep "ac 14 00 54 00 50"
key: 0b 3e 00 00 00 00 00 00  0a 0b 00 63 00 50 00 00  value: ac 14 00 54 00 50 00 06