共计 5054 个字符,预计需要花费 13 分钟才能阅读完成。
起因
某天更新发现某服务收到请求但客户端无法收到返回。幸运的是,客户端同学能在测试环境重现问题。2 分法找到首个故障版本并进行了回退。故障版本仅仅修改了 snd_buf,理论上不应导致该问题。
追查过程
抓包判断故障端
定位到服务端未返回
15:27:52.589785 IP (tos 0x0, ttl 52, id 15592, offset 0, flags [DF], proto TCP (6), length 187)
106.11.0.208.26082 > XXX.12345: Flags [P.], cksum 0xfe51 (correct), seq 51:186, ack 104, win 347, options [nop,nop,TS val 94902541 ecr 3888443977], length 135
E...<.@.4..Aj.......e.09..1........[.Q.....
.......I....s...K......}..[..KG.....N...r?[T....\"...U@q.3.O*M..5..........e...hx.1...43K7#.......<Gu..O..&qNX.o .....Z......?.s....q....N*.y..
15:27:52.632302 IP (tos 0x0, ttl 63, id 13795, offset 0, flags [DF], proto TCP (6), length 52)
XXX.12345 > 106.11.0.208.26082: Flags [.], cksum 0x6b54 (correct), seq 104, ack 186, win 224, options [nop,nop,TS val 3888446989 ecr 94902541], length 0
E..45.@.?.......j...09e.......2(....kT.....
.......
服务端调用了 :gen_tcp.send, 发送接口返回:ok. 为什么抓包无法抓到呢?
排查 erlang 的问题
erlang 版本: OTP-21.0
最终定位至 inet_drv.c:10902, 这里也可以看到, erlang 最终是调用了 writev.
使用 ltrace 跟踪 writev 系统调用.
1ltrace -p 343 -e writev
2
3<... writev resumed> ) = 4
4<... writev resumed> ) = 4
5exe->writev(233, 0xe697e9d0, 2, 247) = 213
6exe->writev(249, 0xe697e9d0, 2, 10) = 0x5166
7exe->writev(19, 0xe687c800, 2, 0x2b5b82c0) = 34
发现调用了 writev, 0x5166 的长度, 和代码里 send 的长度 20834 一致 (0x5166=20834+4, 还有 erlang 自动添加的 4 字节包头长度).
inet_drv.c:10856
case TCP_PB_4:
put_int32(len, buf);
h_len = 4;
break
排查 docker 的问题
尝试在 docker 容器内用执行如下 python 代码
import socket
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('0.0.0.0',12345))
s.listen(5)
conn, addr = s.accept()
conn.setblocking(0)
conn.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4096)
conn.send("a" * 40960)
conn.close()
在另一个容器 telnet, 能够收到返回, 不能重现问题.
而直接在容器外, 使用同样脚本, 从外网 telnet. 可以重现问题. 故排除 docker 问题.
排查 linux kernel 问题
发现同样 python 脚本, 在某个 QA 环境, 内核版本为 4.15.0-1026-gcp, 无法重现问题.
在另一 QA 环境, 内核版本为 4.15.0-1034-gcp, 可重现问题. 于是怀疑是 linux kernel.
查看 linux kernel 4.15.0 提交日志
linux v4.15 有 9 个 rc 版本, 正式版在 Jan 29, 2018, 理论上, 不应该那么久都没有人报 bug 的.
gcp
所以, gcp 是什么呢?
Software Description
linux - Linux kernel
linux-aws - Linux kernel for Amazon Web Services (AWS) systems
linux-gcp - Linux kernel for Google Cloud Platform (GCP) systems
linux-kvm - Linux kernel for cloud environments
linux-raspi2 - Linux kernel for Raspberry Pi 2
linux-snapdragon - Linux kernel for Snapdragon processors
linux-azure - Linux kernel for Microsoft Azure Cloud systems
linux-hwe - Linux hardware enablement (HWE) kernel
linux-oem - Linux kernel for OEM processors
linux-oracle - Linux kernel for Oracle Cloud systems
linux-aws-hwe - Linux kernel for Amazon Web Services (AWS-HWE) systems
这里猜测是谷歌云对 linux kernel 做了一些私有化修改.
于是从 1026 往上逐个测试 linux 内核版本:
root@XXX:/home/dingxinglong# apt-cache search gcp | grep 4.15.0 | grep signed
linux-image-unsigned-4.15.0-1018-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1019-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1021-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1023-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1024-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1025-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1026-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1027-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1028-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1029-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1030-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1032-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1033-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1034-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1036-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1037-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1040-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1041-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1042-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1044-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1046-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1047-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
发现确实和 linux kernel image 相关:
4.15.0-1026-gcp OK
4.15.0-1027-gcp OK
4.15.0-1028-gcp OK
4.15.0-1029-gcp OK
4.15.0-1030-gcp OK
4.15.0-1032-gcp OK
4.15.0-1033-gcp OK
4.15.0-1034-gcp BUG
4.15.0-1036-gcp OK
同一台机器, 在 1033 版本 ok, 升到 1034 后有 BUG, 升到 1036 后再次 OK.
可查到对应的 bug:
http://changelogs.ubuntu.com/…
- Remote denial of service (resource exhaustion) caused by TCP SACK scoreboard
bug 出现时间是:
— Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Thu, 06 Jun 2019 11:07:33 -0300
buf fix 时间是:
— Kleber Sacilotto de Souza <kleber.souza@canonical.com> Mon, 24 Jun 2019 14:48:10 +0200
没有搜索到 bug_fix 对应的代码. 猜测是非开源代码. 也许这是 google 云干不过 amazon 的原因?
这里附上升级 kernel 的命令行.
apt-get install -y linux-image-4.15.0-1034-gcp linux-headers-4.15.0-1034-gcp linux-modules-4.15.0-1034-gcp linux-modules-extra-4.15.0-1034-gcp
总结
- 发布任何更新内容前, 准备好回滚方案, 总有想象不到的问题.
- 自动化测试用例尽量丰富. 如果测试用例的返回包够大, 会及早暴露问题.
- 能够重现的 bug, 尽早追到底, 总有收获.