基于 libbpf 的 TCP 连贯提早监督工具 tcpconnlat 剖析 - eBPF基础知识 Part5
《eBPF基础知识》 系列简介:
《eBPF基础知识》系列指标是整顿一下 BPF 相干的基础知识。次要聚焦程序与内核互动接口局部。文章应用了 libbpf,但如果你不间接应用 libbpf,看本系列还是有肯定意义的,因为它聚焦于程序与内核互动接口局部,而非 libbpf 封装自身。而所有 bpf 开发框架,都要以类似的形式跟内核互动。甚至框架自身就是基于 libbpf。哪怕是 golang/rust/python/BCC/bpftrace。
- 《ELF 格局简述 - eBPF基础知识 Part1》
- 《BPF 零碎接口 与 libbpf 示例剖析 - eBPF基础知识 Part2》
- 《经典 libbpf 范例: bootstrap 剖析 - eBPF基础知识 Part3》
- 《经典 libbpf 范例: uprobe 剖析 - eBPF基础知识 Part4》
国内习惯:尽量多图少文字。以下假如读者曾经对 BPF 有肯定的理解,或者浏览过之前的 《eBPF基础知识》系列文章。
很少人晓得,eBPF 的利用鼻祖 BCC 除了提供很多基于 python/bpftrace 的工具集之外,最近因为 libbpf 1.0 大大加强了:易用性、性能、执行文件的可移植性 BPF CO-RE (Compile Once – Run Everywhere)
的起因,开始有很多间接用 libbpf 1.0 写的 c 的 工具了。其中一个就是这篇文章要讲的 tcpconnlat 。
动机:为何我要钻研 libbpf 版本的 tcpconnlat
开始剖析前,我想说几句废话:为何我要钻研 libbpf 版本的 tcpconnlat?
理解这个经典又实用的 BPF 工具,如何与内核互动实现性能的。
内核的 BPF 接口(syscall)因为历史和兼容性起因,设计得切实简单和不直观。syscall 设计者是想缩小 syscall 数量,一个 syscall 实现多功能。但同时也减少了应用的复杂度。这里想理解:
- 须要用到哪些内核对象
- 内核对象之间如何 link 起来,组成数据/事件流
学习如何应用 libbpf。这是主要指标。
- libbpf 如何帮忙简化开发者与内核对话的难度
tcpconnlat 示例程序性能
tcpconnlat 程序通过:
内核态 bpf 程序监听用户的 内核的
tcp_v4_connect
与tcp_rcv_state_process
函数,去记录和剖析 socket 连贯建设的用时状况。发送事件到 bpf_perf_event_output。阐明一下这两个函数:tcp_v4_connect - 内核尝试建设 socket 时调用
/* This will initiate an outgoing connection. */int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len){...}
tcp_rcv_state_process - 内核 socket 状态变动时调用
/* * This function implements the receiving procedure of RFC 793 for * all states except ESTABLISHED and TIME_WAIT. * It's called from both tcp_v4_rcv and tcp_v6_rcv and should be * address independent. */int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb){...}
- 用户态程序负责加载(load)和 attach 内核态 BPF 程序。而后监听 bpf_perf_event_output 事件,打印输出。
$ sudo ./tcpconnlatPID COMM IP SADDR DADDR DPORT LAT(ms)4218 code 4 192.168.1.14 192.168.1.17 8118 0.623930 Chrome_Child 4 192.168.1.14 192.168.1.17 8118 0.61...
程序阐明
uprobe 与内核互动概述
如上图排版有问题,请点这里用 Draw.io 关上。局部带互动链接和 hover tips
图中是我跟踪的后果。用 Draw.io 关上后,鼠标放到区域上,会 hover 出 stack(调用堆栈)。
图中的阐明曾经比拟具体。其中包含重要的数据结构和步骤。
1~5. 用户态 libbpf 数据加载与内存数据结构筹备
.rodata
mmap 内存页筹备- vmlinux BTF 加载,用于
BPF CO-RE
为什么不再写了?因为切实不必要写,图中曾经有,一个疾速找到序号在图中的地位的小 tips 是,在 draw.io 中 CTRL+f
查找序号:
剖析环境阐明
$ uname -aLinux T30 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux$ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.2 LTS"VERSION="22.04.2 LTS (Jammy Jellyfish)"
内核态 BPF 字节码程序
<mark>我始终致力防止在文章间接上代码。起因是,我本人的体验是,在文章中读代码太难了…… </mark>不过有时还是要贴。指标不是让读者齐全一次看懂代码,而是对次要逻辑和命名符号有个理性的理解。我尽量精简一下吧。不要被这纸老虎吓跑。只有配合图解。
先看 BPF 内核字节码程序局部:
tcpconnlat.bpf.c
const volatile __u64 targ_min_us = 0;const volatile pid_t targ_tgid = 0;struct piddata { char comm[TASK_COMM_LEN]; u64 ts; u32 tgid;};struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 4096); __type(key, struct sock *); __type(value, struct piddata);} start SEC(".maps");struct { __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); __uint(key_size, sizeof(u32)); __uint(value_size, sizeof(u32));} events SEC(".maps");static int trace_connect(struct sock *sk){ u32 tgid = bpf_get_current_pid_tgid() >> 32; struct piddata piddata = {}; if (targ_tgid && targ_tgid != tgid) return 0; bpf_get_current_comm(&piddata.comm, sizeof(piddata.comm)); piddata.ts = bpf_ktime_get_ns(); piddata.tgid = tgid; bpf_map_update_elem(&start, &sk, &piddata, 0); return 0;}static int handle_tcp_rcv_state_process(void *ctx, struct sock *sk){ struct piddata *piddatap; struct event event = {}; s64 delta; u64 ts; if (BPF_CORE_READ(sk, __sk_common.skc_state) != TCP_SYN_SENT) return 0; piddatap = bpf_map_lookup_elem(&start, &sk); if (!piddatap) return 0; ts = bpf_ktime_get_ns(); delta = (s64)(ts - piddatap->ts); if (delta < 0) goto cleanup; event.delta_us = delta / 1000U; if (targ_min_us && event.delta_us < targ_min_us) goto cleanup; __builtin_memcpy(&event.comm, piddatap->comm, sizeof(event.comm)); event.ts_us = ts / 1000; event.tgid = piddatap->tgid; event.lport = BPF_CORE_READ(sk, __sk_common.skc_num); event.dport = BPF_CORE_READ(sk, __sk_common.skc_dport); event.af = BPF_CORE_READ(sk, __sk_common.skc_family); if (event.af == AF_INET) { event.saddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr); event.daddr_v4 = BPF_CORE_READ(sk, __sk_common.skc_daddr); } else { BPF_CORE_READ_INTO(&event.saddr_v6, sk, __sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32); BPF_CORE_READ_INTO(&event.daddr_v6, sk, __sk_common.skc_v6_daddr.in6_u.u6_addr32); } bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &event, sizeof(event));cleanup: bpf_map_delete_elem(&start, &sk); return 0;}SEC("fentry/tcp_v4_connect")int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk){ return trace_connect(sk);}SEC("fentry/tcp_rcv_state_process")int BPF_PROG(fentry_tcp_rcv_state_process, struct sock *sk){ return handle_tcp_rcv_state_process(ctx, sk);}
用户态 bpf 程序
tcpconnlat.c
#define PERF_BUFFER_PAGES 16#define PERF_POLL_TIMEOUT_MS 100static volatile sig_atomic_t exiting = 0;static struct env { __u64 min_us; pid_t pid; bool timestamp; bool lport; bool verbose;} env;const char *argp_program_version = "tcpconnlat 0.1";const char *argp_program_bug_address = "https://github.com/iovisor/bcc/tree/master/libbpf-tools";const char argp_program_doc[] ="\nTrace TCP connects and show connection latency.\n""\n""USAGE: tcpconnlat [--help] [-t] [-p PID] [-L]\n""\n""EXAMPLES:\n"" tcpconnlat # summarize on-CPU time as a histogram\n"" tcpconnlat 1 # trace connection latency slower than 1 ms\n"" tcpconnlat 0.1 # trace connection latency slower than 100 us\n"" tcpconnlat -t # 1s summaries, milliseconds, and timestamps\n"" tcpconnlat -p 185 # trace PID 185 only\n"" tcpconnlat -L # include LPORT while printing outputs\n";static const struct argp_option opts[] = { { "timestamp", 't', NULL, 0, "Include timestamp on output" }, { "pid", 'p', "PID", 0, "Trace this PID only" }, { "lport", 'L', NULL, 0, "Include LPORT on output" }, { "verbose", 'v', NULL, 0, "Verbose debug output" }, { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" }, {},};static error_t parse_arg(int key, char *arg, struct argp_state *state){...}static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args){...}static void sig_int(int signo){ exiting = 1;}void handle_event(void *ctx, int cpu, void *data, __u32 data_sz){ const struct event *e = data; char src[INET6_ADDRSTRLEN]; char dst[INET6_ADDRSTRLEN]; union { struct in_addr x4; struct in6_addr x6; } s, d; static __u64 start_ts; if (env.timestamp) { if (start_ts == 0) start_ts = e->ts_us; printf("%-9.3f ", (e->ts_us - start_ts) / 1000000.0); } if (e->af == AF_INET) { s.x4.s_addr = e->saddr_v4; d.x4.s_addr = e->daddr_v4; } else if (e->af == AF_INET6) { memcpy(&s.x6.s6_addr, e->saddr_v6, sizeof(s.x6.s6_addr)); memcpy(&d.x6.s6_addr, e->daddr_v6, sizeof(d.x6.s6_addr)); } else { fprintf(stderr, "broken event: event->af=%d", e->af); return; } if (env.lport) { printf("%-6d %-12.12s %-2d %-16s %-6d %-16s %-5d %.2f\n", e->tgid, e->comm, e->af == AF_INET ? 4 : 6, inet_ntop(e->af, &s, src, sizeof(src)), e->lport, inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport), e->delta_us / 1000.0); } else { printf("%-6d %-12.12s %-2d %-16s %-16s %-5d %.2f\n", e->tgid, e->comm, e->af == AF_INET ? 4 : 6, inet_ntop(e->af, &s, src, sizeof(src)), inet_ntop(e->af, &d, dst, sizeof(dst)), ntohs(e->dport), e->delta_us / 1000.0); }}void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt){ fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);}int main(int argc, char **argv){ static const struct argp argp = { .options = opts, .parser = parse_arg, .doc = argp_program_doc, }; struct perf_buffer *pb = NULL; struct tcpconnlat_bpf *obj; int err; err = argp_parse(&argp, argc, argv, 0, NULL, NULL); if (err) return err; libbpf_set_print(libbpf_print_fn); obj = tcpconnlat_bpf__open(); if (!obj) { fprintf(stderr, "failed to open BPF object\n"); return 1; } /* initialize global data (filtering options) */ obj->rodata->targ_min_us = env.min_us; obj->rodata->targ_tgid = env.pid; if (fentry_can_attach("tcp_v4_connect", NULL)) { bpf_program__set_attach_target(obj->progs.fentry_tcp_v4_connect, 0, "tcp_v4_connect"); bpf_program__set_attach_target(obj->progs.fentry_tcp_v6_connect, 0, "tcp_v6_connect"); bpf_program__set_attach_target(obj->progs.fentry_tcp_rcv_state_process, 0, "tcp_rcv_state_process"); bpf_program__set_autoload(obj->progs.tcp_v4_connect, false); bpf_program__set_autoload(obj->progs.tcp_v6_connect, false); bpf_program__set_autoload(obj->progs.tcp_rcv_state_process, false); } else { bpf_program__set_autoload(obj->progs.fentry_tcp_v4_connect, false); bpf_program__set_autoload(obj->progs.fentry_tcp_v6_connect, false); bpf_program__set_autoload(obj->progs.fentry_tcp_rcv_state_process, false); } err = tcpconnlat_bpf__load(obj); if (err) { fprintf(stderr, "failed to load BPF object: %d\n", err); goto cleanup; } err = tcpconnlat_bpf__attach(obj); if (err) { goto cleanup; } pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES, handle_event, handle_lost_events, NULL, NULL); if (!pb) { fprintf(stderr, "failed to open perf buffer: %d\n", errno); goto cleanup; } /* print header */ if (env.timestamp) printf("%-9s ", ("TIME(s)")); if (env.lport) { printf("%-6s %-12s %-2s %-16s %-6s %-16s %-5s %s\n", "PID", "COMM", "IP", "SADDR", "LPORT", "DADDR", "DPORT", "LAT(ms)"); } else { printf("%-6s %-12s %-2s %-16s %-16s %-5s %s\n", "PID", "COMM", "IP", "SADDR", "DADDR", "DPORT", "LAT(ms)"); } if (signal(SIGINT, sig_int) == SIG_ERR) { fprintf(stderr, "can't set signal handler: %s\n", strerror(errno)); err = 1; goto cleanup; } /* main: poll */ while (!exiting) { err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS); if (err < 0 && err != -EINTR) { fprintf(stderr, "error polling perf buffer: %s\n", strerror(-err)); goto cleanup; } /* reset err to return 0 if exiting */ err = 0; }cleanup: perf_buffer__free(pb); tcpconnlat_bpf__destroy(obj); return err != 0;}
后记
技术开悟的路,或者和人的成熟过程一样,只有事实的磨难能力得道。