Cilium流量分析(一)
结合开发环境和代码进行Cilium流量的分析,会省略一些特性,比如Overlay、L7 Policy等。顺序是按照官方文档Life of a Packet 的图进行,本文是Endpoint to Endpoint(socket)的分析。
环境部署
使用cilium 1.12.0版本的代码库,在Mac上使用K8S=1 NO_BUILD=1 TUNNEL_MODE_STRING=disabled SERVER_BOX=cilium/ubuntu-next-cgroupv2 SERVER_VERSION=0 contrib/vagrant/start.sh
部署单节点的Cilium Kubernetes集群。其中:
1)提前执行了make build
,不然在vm里编译会很慢
2)修改了shell脚本里的cilium启动参数,主要:关闭了tunnel(默认是VXLAN);启动cilium kube-proxy-replacement;启用sockops。
--- a/contrib/vagrant/scripts/03-install-kubernetes-worker.sh
+++ b/contrib/vagrant/scripts/03-install-kubernetes-worker.sh
@@ -305,10 +305,10 @@ EOF
log "reloading systemctl daemon and enabling and restarting kube-proxy"
sudo systemctl daemon-reload
-sudo systemctl enable kube-proxy
-sudo systemctl restart kube-proxy
+# sudo systemctl enable kube-proxy
+# sudo systemctl restart kube-proxy
-sudo systemctl status kube-proxy --no-pager
+# sudo systemctl status kube-proxy --no-pager
--- a/contrib/vagrant/start.sh
+++ b/contrib/vagrant/start.sh
@@ -306,7 +306,7 @@ function write_cilium_cfg() {
cilium_options="\
--debug --pprof --enable-hubble --hubble-listen-address :4244 --enable-k8s-event-handover \
- --k8s-require-ipv4-pod-cidr --enable-bandwidth-manager --kube-proxy-replacement=disabled \
+ --k8s-require-ipv4-pod-cidr --enable-bandwidth-manager --kube-proxy-replacement=strict --cgroup-root=/sys/fs/cgroup --sockops-enable \
--enable-remote-node-identity"
cilium_operator_options=" --debug"
3)重新做了个box:cilium/ubuntu-next-cgroupv2,就是在cilium/ubuntu-next之上开启了cgroupv2 unified mode。BPF_PROG_TYPE_CGROUP_XXX
需要cgroupv2才能使用。
Kubernetes使用cgroupv2需要做一些额外的配置。
环境部署完后,再部署个pod/nc-k8s1,service/nc-k8s1。
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nc-k8s1 1/1 Running 0 5s 10.11.0.99 k8s1 <none> <none>
$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s1 Ready <none> 61m v1.24.2 192.168.60.11 <none> Ubuntu 20.04.4 LTS 5.18.0-g7e062cda7d90 containerd://1.6.3
$ systemctl status kube-proxy
● kube-proxy.service - Kubernetes Kube-Proxy Server
Loaded: loaded (/etc/systemd/system/kube-proxy.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: https://kubernetes.io/docs/concepts/overview/components/#kube-proxy
https://kubernetes.io/docs/reference/generated/kube-proxy/
$ kubectl get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes ClusterIP 172.20.0.1 <none> 443/TCP 8m2s <none>
nc-k8s1 NodePort 172.20.0.84 <none> 80:32438/TCP 9s app=nc-k8s1
Endpoint to Endpoint(socket)
流量图
这一篇主要分析下下面那张socket的图,tc的后面再分析。图中的bpf_sockops.c
与bpf_redir.c
是BPF_PROG_TYPE_SOCK_OPS
与 BPF_PROG_TYPE_SK_SKB
配合做的一个socket层的redirect,先由bpf_sockops.c
维护一个BPF_MAP_TYPE_SOCKHASH
类型的Map,再由bpf_redir.c
根据Map将数据重定向到对应的socket。
代码实现
分别查看bpf_redir、bpf_sockops与map。
$ bpftool prog show pinned /sys/fs/bpf/bpf_redir
3228: sk_msg name bpf_redir_proxy tag 2dfc83bbb7ceae9b gpl
loaded_at 2022-08-05T06:58:48+0000 uid 0
xlated 1064B jited 590B memlock 4096B map_ids 402,406,422
btf_id 233
$ bpftool prog show pinned /sys/fs/bpf/bpf_sockops
3222: sock_ops name bpf_sockmap tag 00baed82e9c683bc gpl
loaded_at 2022-08-05T06:58:48+0000 uid 0
xlated 1656B jited 888B memlock 4096B map_ids 417,402,406,98,422
btf_id 225
$ bpftool map show id 422
422: sockhash name cilium_sock_ops flags 0x0
key 44B value 4B max_entries 65535 memlock 3145728B
调用层次:
__section("sk_msg")
bpf_redir_proxy
|-sk_msg_extract4_key
|-lookup_ip4_remote_endpoint
|-policy_sk_egress
|-msg_redirect_hash
_section("sockops")
bpf_sockmap
|-switch (op)
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
bpf_sock_ops_ipv4
|-sk_extract4_key
|-lookup_ip4_remote_endpoint
|-policy_sk_egress
|-__lookup_ip4_endpoint
|-sock_hash_update
看下大体的实现:
// 负责进行socket重定向
__section("sk_msg")
int bpf_redir_proxy(struct sk_msg_md *msg)
{
sk_msg_extract4_key(msg, &key); // 从msg中获取map的key,key包括了socket的源地址、源端口、目标地址、目标端口、IPFamily
info = lookup_ip4_remote_endpoint(key.dip4); // 从ipcache map中查找对应的endpoint,用于policy的判断。ipcache存储了cilium管理的endpoint,以ip地址为key,value为对应的endpoint信息,比如identity,cilium的网络策略的实现依赖于identity
verdict = policy_sk_egress(dst_id, key.sip4, (__u16)key.dport); // policy的判断
if (verdict >= 0)
msg_redirect_hash(msg, &SOCK_OPS_MAP, &key, flags); //调用bpf_sk_redirect_map helper function,根据sockhash map进行重定向
}
// 负责进行sockhash map的维护
_section("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
switch (op) {
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: // 分别对应被动连接与主动连接,因此sockhash map里两个方向的socket都会记录
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
...
#ifdef ENABLE_IPV4
if (family == AF_INET)
bpf_sock_ops_ipv4(skops); // 当socket状态变为ESTABLISHED后,执行bpf_sock_ops_ipv4
#endif
...
}
}
static inline void bpf_sock_ops_ipv4(struct bpf_sock_ops *skops)
{
sk_extract4_key(skops, &key);
if (1) {
info = lookup_ip4_remote_endpoint(key.dip4); // 从ipcache map中查找有对应的endpoint
if (info != NULL && info->sec_label)
dst_id = info->sec_label;
else
dst_id = WORLD_ID;
}
verdict = policy_sk_egress(dst_id, key.sip4, (__u16)key.dport); // policy策略判断
/* Lookup IPv4 address, this will return a match if:
* - The destination IP address belongs to the local endpoint manage
* by Cilium.
* - The destination IP address is an IP address associated with the
* host itself.
* Then because these are local IPs that have passed LB/Policy/NAT
* blocks redirect directly to socket.
*/
exists = __lookup_ip4_endpoint(key.dip4);
if (!exists)
return;
sock_hash_update(skops, &SOCK_OPS_MAP, &key, BPF_NOEXIST); // 更新sockhash map
}
代码逻辑比较简单,其中重定向是通过bpf helper函数bpf_msg_redirect_hash()
实现,会根据sockhash map中的key直接进行转发。此外涉及另外两个map,其中cilium_ipcache
是lookup_ip4_remote_endpoint()
查询的map,cilium_lxc
是__lookup_ip4_endpoint
查询的map。
$ cilium map get cilium_ipcache | grep 10.11.0.99
10.11.0.99/32 identity=3187 encryptkey=0 tunnelendpoint=0.0.0.0 sync
$ cilium map get cilium_lxc | grep 10.11.0.99
10.11.0.99:0 id=25 flags=0x0000 ifindex=22 mac=DA:96:96:60:21:7E nodemac=9A:37:05:A5:EA:14 sync
测试
接下来手动测试下重定向,先使用bpftool查看map,有4条数据。
$ bpftool map dump id 422
key:
0a 0b 00 0a 00 00 00 00 00 00 00 00 00 00 00 00
0a 0b 00 e9 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 10 90 00 00 c4 e6 00 00
value:
No space left on device
key:
c0 a8 3c 0b 00 00 00 00 00 00 00 00 00 00 00 00
c0 a8 3c 0b 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 10 90 00 00 d6 70 00 00
value:
No space left on device
key:
c0 a8 3c 0b 00 00 00 00 00 00 00 00 00 00 00 00
c0 a8 3c 0b 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 d6 70 00 00 10 90 00 00
value:
No space left on device
key:
0a 0b 00 e9 00 00 00 00 00 00 00 00 00 00 00 00
0a 0b 00 0a 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 c4 e6 00 00 10 90 00 00
value:
No space left on device
Found 0 elements
在节点上执行tcpdump -i lxcf4d0f7379d34 -enn
,其中lxcf4d0f7379d34
是nc-k8s1容器的veth pair宿主机一端。
$ kubectl exec -it nc-k8s1 -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
21: eth0@if22: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether da:96:96:60:21:7e brd ff:ff:ff:ff:ff:ff
inet 10.11.0.99/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fd04::b033/128 scope global flags 02
valid_lft forever preferred_lft forever
inet6 fe80::d896:96ff:fe60:217e/64 scope link
valid_lft forever preferred_lft forever
$ ip a
...
22: lxcf4d0f7379d34@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 9a:37:05:a5:ea:14 brd ff:ff:ff:ff:ff:ff link-netns cni-83b0571c-c726-6eaa-1e3f-5395347df6ea
inet6 fe80::9837:5ff:fea5:ea14/64 scope link
valid_lft forever preferred_lft forever
在节点上执行nc 10.11.0.99 80
,(nc-k8s1里执行了nc -lv 80)可以看到tcpdump抓到了tcp三次握手的包。但后续的数据包不会被tcpdump抓到,因为后续的数据包通过socket重定向进行转发,不再通过包括tc在内的内核协议栈,tcpdump也就无法在tc层抓到相应的数据包。
$ nc 10.11.0.99 80
hello
package
$ kubectl logs -f nc-k8s1
Listening on [0.0.0.0] (family 0, port 80)
Connection from 10.11.0.233 47638 received!
hello
package
在断开连接前,再次查看sockhash map,可以看到新增的两条数据,分别是tcp两端socket在建立是加入的。
$ bpftool map dump id 422 | grep key | wc -l
6
Service实现(socket)
cilium在socket层的bpf程序除了上面两个外,还有就是cgroupv2类型的bpf程序也会在socket操作时触发,cilium用来实现kubernetes service的功能。
$ bpftool cgroup show /sys/fs/cgroup/
ID AttachType AttachFlags Name
3222 sock_ops bpf_sockmap
3280 connect4 sock4_connect
3260 connect6 sock6_connect
3288 post_bind4 sock4_post_bind
3268 post_bind6 sock6_post_bind
3292 sendmsg4 sock4_sendmsg
3272 sendmsg6 sock6_sendmsg
3296 recvmsg4 sock4_recvmsg
3276 recvmsg6 sock6_recvmsg
3284 getpeername4 sock4_getpeername
3264 getpeername6 sock6_getpeername
以ipv4 tcp的service为例,在socket connect调用时触发bpf程序。
调用层次:
__section("cgroup/connect4")
sock4_connect
|-__sock4_xlate_fwd
|-lb4_lookup_service
|-map_lookup_elem(LB4_SERVICES_MAP_V2)
|-if (!svc)
sock4_wildcard_lookup_full
|-sock4_wildcard_lookup
|-if (lb4_svc_is_affinity(svc))
lb4_affinity_backend_id_by_netns
|-__lb4_affinity_backend_id
|-map_lookup_elem(LB4_AFFINITY_MAP)
__lb4_lookup_backend
|-map_lookup_elem()
|-if (backend_id == 0)
__lb4_lookup_backend_slot
|-map_lookup_elem(LB4_SERVICES_MAP_V2)
__lb4_lookup_backend
|-map_lookup_elem(LB4_BACKEND_MAP_V2)
|-lb4_update_affinity_by_netns
|--__lb4_update_affinity
|-map_update_elem(LB4_AFFINITY_MAP)
|-sock4_update_revnat
|-map_lookup_elem(LB4_REVERSE_NAT_SK_MAP)
大体代码:
__section("cgroup/connect4")
int sock4_connect(struct bpf_sock_addr *ctx)
{
....
__sock4_xlate_fwd(ctx, ctx, false);
return SYS_PROCEED;
}
static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
struct bpf_sock_addr *ctx_full,
const bool udp_only)
{
if (is_defined(ENABLE_SOCKET_LB_HOST_ONLY) && !in_hostns) //验证是否符合lb host only
return -ENXIO;
if (!udp_only && !sock_proto_enabled(ctx->protocol)) //验证协议类型
return -ENOTSUP;
svc = lb4_lookup_service(&key, true); //找clusterIP
if (!svc)
svc = sock4_wildcard_lookup_full(&key, in_hostns); //找nodeport与hostport
if (!svc)
return -ENXIO;
if (lb4_svc_is_affinity(svc)) { //设置了亲和性
backend_id = lb4_affinity_backend_id_by_netns(svc, &id);
if (backend_id != 0) {
backend = __lb4_lookup_backend(backend_id);
if (!backend) //后端pod已不存在,重新调度
backend_id = 0;
}
}
if (backend_id == 0) {
key.backend_slot = (sock_select_slot(ctx_full) % svc->count) + 1;
backend_slot = __lb4_lookup_backend_slot(&key);
backend_id = backend_slot->backend_id;
backend = __lb4_lookup_backend(backend_id);
}
if (lb4_svc_is_affinity(svc) && !backend_from_affinity) //如果是service设置了亲和性,并且此次是重新负载了,则进行affinity map的更新
lb4_update_affinity_by_netns(svc, &id, backend_id);
if (sock4_update_revnat(ctx_full, backend, &orig_key,
svc->rev_nat_index) < 0) {
update_metrics(0, METRIC_EGRESS, REASON_LB_REVNAT_UPDATE);
return -ENOMEM;
}
// 进行socket的DNAT
ctx->user_ip4 = backend->address;
ctx_set_port(ctx, backend->port);
return 0;
}
流程如下:
1)先找到对应的Service。通过lb4_lookup_service()
查找是否是某个Service的ClusterIP。这块查询的是hash map cilium_lb4_services_v2
,key和value如下。
struct lb4_key {
__be32 address; /* Service virtual IPv4 address */
__be16 dport; /* L4 port filter, if unset, all ports apply */
__u16 backend_slot; /* Backend iterator, 0 indicates the svc frontend */
__u8 proto; /* L4 protocol, currently not used (set to 0) */
__u8 scope; /* LB_LOOKUP_SCOPE_* for externalTrafficPolicy=Local */
__u8 pad[2];
};
struct lb4_service {
union {
__u32 backend_id; /* Backend ID in lb4_backends */
__u32 affinity_timeout; /* In seconds, only for svc frontend */
__u32 l7_lb_proxy_port; /* In host byte order, only when flags2 && SVC_FLAG_L7LOADBALANCER */
};
/* For the service frontend, count denotes number of service backend
* slots (otherwise zero).
*/
__u16 count;
__u16 rev_nat_index; /* Reverse NAT ID in lb4_reverse_nat */
__u8 flags;
__u8 flags2;
__u8 pad[2];
};
以环境中的service/nc-k8s1为例,clusterIP为172.20.0.84
,map信息为:
$ cilium map get cilium_lb4_services_v2 | grep 172.20.0.84
172.20.0.84:80 10800 1 (6) [0x10 0x0] sync
通过bpftool可以看到两条记录,第一条是service backend,记录了后端nc-k8s1 pod信息,第二条是service frontend记录了service的信息。
$ bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_services_v2 | grep "ac 14 00 54"
key: ac 14 00 54 00 50 01 00 00 00 00 00 value: 14 00 00 00 00 00 00 06 00 00 00 00
key: ac 14 00 54 00 50 00 00 00 00 00 00 value: 30 2a 00 00 01 00 00 06 10 00 00 00
整理一下就是:
[{
"key": {
"address": "172.20.0.84",
"dport": 80,
"backend_slot": 1,
"proto": 0,
"scope": 0
},
"value": {
"backend_id": 0x14,
"count": 0,
"rev_nat_index": 0x06
}
}, {
"key": {
"address": "172.20.0.84",
"dport": 80,
"backend_slot": 0,
"proto": 0,
"scope": 0
},
"value": {
"affinity_timeout": 0x302a,
"count": 1,
"rev_nat_index": 0x06
}
}]
lb4_lookup_service()
是根据ClusterIP查找对应的service,如果lb4_lookup_service()
查找失败,会使用sock4_wildcard_lookup_full()
查找nodeport、hostport记录,环境中的nc-k8s1 service的nodeport是32438,map中的记录如下所示。
$ cilium map get cilium_lb4_services_v2 | grep 32438
10.0.2.15:32438 10800 1 (11) [0x52 0x0] sync
0.0.0.0:32438 10800 1 (8) [0x12 0x0] sync
192.168.59.15:32438 10800 1 (10) [0x52 0x0] sync
192.168.61.11:32438 10800 1 (13) [0x52 0x0] sync
192.168.60.11:32438 10800 1 (12) [0x52 0x0] sync
2)亲和性逻辑。通过lb4_affinity_backend_id_by_netns()
查找亲和性的后端。这里用到的map是cilium_lb4_affinity
,key中记录了clientIP、访问时间等,map在进行负载之后通过lb4_update_affinity_by_netns()
进行更新。
$ bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_affinity
key: 01 00 00 00 00 00 00 00 00 06 01 00 00 00 00 00
value: 14 1d 00 00 00 00 00 00 14 00 00 00 00 00 00 00
Found 1 element
struct lb_affinity_val {
__u64 last_used;
__u32 backend_id;
__u32 pad;
} __packed;
通过lb4_affinity_backend_id_by_netns()
找到backend_ip后,再通过__lb4_lookup_backend()
查找后端,查找的map为cilium_lb4_backends_v2
,key为backend_ip
,value中记录了podIP
$ cilium map get cilium_lb4_backends_v2
Key Value State Error
20 ANY://10.11.0.99:80 sync
3)选择后端Pod进行负载。sock_select_slot()
方法用于选择后端Pod,对TCP使用get_prandom_u32()
进行随机选择,对于UDP则对socket cookie作hash进行选择。根据backend_slot
查询backend_id
,最终查询到backend
。
4)更新nat表。如下记录,表示 172.20.0.84:80
(ac 14 00 54 00 50
)发送到了10.11.0.99:80
(0a 0b 00 63 00 50
),其中前面64位是socket cookie。这个表主要在recvmsg4、getpeername4时用于进行DNAT,查了相关资料,部分UDP应用需要检测回包源地址,因此需要进行DNAT。而getpeername通过DNAT可以获取到Service的地址,做到对完全应用透明。
$ bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_reverse_sk | grep "ac 14 00 54 00 50"
key: 0b 3e 00 00 00 00 00 00 0a 0b 00 63 00 50 00 00 value: ac 14 00 54 00 50 00 06