一个erlang-nif野指针的追踪过程

30次阅读

共计 12078 个字符,预计需要花费 31 分钟才能阅读完成。

概述

最近半年, 经常出现一些奇怪的 bug.

  • CPU 非常高, 但 etop 并没有 red 特别高的进程.
  • 内存占用非常高, 和 etop 看到的内存不吻合.
  • coredump.

因为 erlang 层面无法定位问题, 那问题只能在 c 了.

BUG 现场

cpu %100

ErtsRunQueue 被写坏

#0  sched_prep_cont_spin_wait (ssi=<optimized out>) at beam/erl_process.c:2947
#1  scheduler_wait (fcalls=<synthetic pointer>, rq=<optimized out>, esdp=0x7fac4d7c7b40) at beam/erl_process.c:3377
#2  erts_schedule () at beam/erl_process.c:9419
#3  0x0000000000464e31 in erts_dirty_process_main () at beam/beam_emu.c:1175
#4  0x000000000044d6f0 in sched_dirty_cpu_thread_func (vesdp=0x7fac4d7c7b40) at beam/erl_process.c:8396
#5  0x0000000000674cf9 in thr_wrapper (vtwd=0x7ffcb5f11cf0) at pthread/ethread.c:118
#6  0x00007fac908226ba in start_thread (arg=0x7fac4c3b1700) at pthread_create.c:333
#7  0x00007fac9035041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
erts_schedule () at beam/erl_process.c:9419
9419    in beam/erl_process.c
(gdb) p rq
$9 = (ErtsRunQueue *) 0x20
(gdb) p *rq
Cannot access memory at address 0x20

AOFF_RBTree_t 被写坏. rb_tree 存在环.

#0  rbt_insert (order=FF_BF, root=root@entry=0x7f0c9c6000e8, blk=0x7f0c9c605a58) at beam/erl_ao_firstfit_alloc.c:796
#1  0x00000000005e8a42 in aoff_link_free_block (allctr=<optimized out>, block=<optimized out>) at beam/erl_ao_firstfit_alloc.c:750
#2  0x000000000049b444 in mbc_free (allctr=0x22e90c0, p=<optimized out>, busy_pcrr_pp=0x7f0c98376a30) at beam/erl_alloc_util.c:2579
#3  0x000000000049be8f in dealloc_block (allctr=allctr@entry=0x22e90c0, ptr=ptr@entry=0x7f0c9c605a60, fix=fix@entry=0x0, dec_cc_on_redirect=dec_cc_on_redirect@entry=1) at beam/erl_alloc_util.c:2321
#4  0x000000000049f440 in dealloc_block (fix=0x0, dec_cc_on_redirect=1, ptr=0x7f0c9c605a60, allctr=0x22e90c0) at beam/erl_alloc_util.c:2306
#5  handle_delayed_dealloc (need_more_work=<optimized out>, thr_prgr_p=<optimized out>, need_thr_progress=0x7f0c98376b28, ops_limit=20, use_limit=<optimized out>, allctr_locked=0, allctr=0x22e90c0) at beam/erl_alloc_util.c:2174
#6  erts_alcu_check_delayed_dealloc (allctr=0x22e90c0, limit=limit@entry=1, need_thr_progress=need_thr_progress@entry=0x7f0c98376b28, thr_prgr_p=thr_prgr_p@entry=0x7f0c98376b30, more_work=more_work@entry=0x7f0c98376b2c)
    at beam/erl_alloc_util.c:2276
#7  0x000000000048df73 in erts_alloc_scheduler_handle_delayed_dealloc (vesdp=0x7f0c9a56dcc0, need_thr_progress=need_thr_progress@entry=0x7f0c98376b28, thr_prgr_p=thr_prgr_p@entry=0x7f0c98376b30, 
    more_work=more_work@entry=0x7f0c98376b2c) at beam/erl_alloc.c:1895
#8  0x0000000000461812 in handle_delayed_dealloc_thr_prgr (waiting=1, aux_work=4, awdp=0x7f0c9a56ddd8) at beam/erl_process.c:2101
#9  handle_aux_work (awdp=awdp@entry=0x7f0c9a56ddd8, orig_aux_work=orig_aux_work@entry=4, waiting=waiting@entry=1) at beam/erl_process.c:2596
#10 0x0000000000460ced in scheduler_wait (fcalls=<synthetic pointer>, rq=<optimized out>, esdp=0x7f0c9a56dcc0) at beam/erl_process.c:3292
#11 erts_schedule () at beam/erl_process.c:9419
#12 0x0000000000451920 in process_main () at beam/beam_emu.c:689
#13 0x000000000044d863 in sched_thread_func (vesdp=0x7f0c9a56dcc0) at beam/erl_process.c:8349
#14 0x0000000000674cf9 in thr_wrapper (vtwd=0x7ffe71ca9770) at pthread/ethread.c:118
#15 0x00007f0cdd1976ba in start_thread (arg=0x7f0c98377700) at pthread_create.c:333
#16 0x00007f0cdccc541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) p *root->right->right
$25 = {hdr = {bhdr = 1185, u = {carrier = 0x7f0c9c600000, udata__ = ""}}, parent = 0x7f0c9c607b80, left = 0x0, right = 0x7f0c9c605ac0, flags = 3, max_sz = 1288, u = {next = 0x0, birth_time = 0}}
(gdb) p *root->right->right
$26 = {hdr = {bhdr = 1185, u = {carrier = 0x7f0c9c600000, udata__ = ""}}, parent = 0x7f0c9c607b80, left = 0x0, right = 0x7f0c9c605ac0, flags = 3, max_sz = 1288, u = {next = 0x0, birth_time = 0}}
(gdb) p *root->right->right
$27 = {hdr = {bhdr = 1185, u = {carrier = 0x7f0c9c600000, udata__ = ""}}, parent = 0x7f0c9c607b80, left = 0x0, right = 0x7f0c9c605ac0, flags = 3, max_sz = 1288, u = {next = 0x0, birth_time = 0}}
(gdb) p *root->right->right->right

memory leak

见 cpu %100 中的 AOFF_RBTree_t 问题. memory leak 的现场较少. 目前只分析一个.
因 AOFF_RBTree_t busy loop. 导致分配的内存无法释放, 主要内存分配的源头是普罗米修斯监控采集.

#Cp<0x9a6153d8>  (I)
#Cp<ram_file:ll_open/3+0x80>.
#Cp<0x9c0d3510>.
#Cp<prometheus_text_format:format/1+0x60>.
#Cp<'Elixir.Prometheus.Format.Text':format/1+0x58>.
#Cp<'Elixir.Prometheus.MetricsRouter':'-do_match/4-fun-1-'/1+0x48>.
#Cp<'Elixir.Prometheus.MetricsRouter':plug_builder_call/2+0xd0>.
#Cp<0x9c02b7a0>.
#Cp<0x827211d8>.
#Cp<cowboy_stream_h:execute/3+0x60>.
#Cp<cowboy_stream_h:request_process/3+0xb8>.
#Cp<proc_lib:init_p_do_apply/3+0x50>.
#Cp<terminate process normally>.

coredump

如下两个 coredump 较为典型.

#0  0x00000000005e7a7b in lower_max_size (stop_at=0x0, node=0x7eff735b2d68) at beam/erl_ao_firstfit_alloc.c:161
#1  rbt_delete (root=root@entry=0x7eff735800e8, del=0x7eff735bc2f8) at beam/erl_ao_firstfit_alloc.c:617
#2  0x00000000005e826e in aoff_unlink_free_block (allctr=<optimized out>, blk=<optimized out>) at beam/erl_ao_firstfit_alloc.c:548
#3  0x00000000005e84aa in aoff_get_free_block (allctr=<optimized out>, size=<optimized out>, cand_blk=<optimized out>, cand_size=<optimized out>) at beam/erl_ao_firstfit_alloc.c:911
#4  0x0000000000499191 in create_carrier (allctr=allctr@entry=0xb0bdc0, umem_sz=umem_sz@entry=65544, flags=<optimized out>, flags@entry=2) at beam/erl_alloc_util.c:3921
#5  0x000000000049a700 in mbc_alloc_block (blk_szp=<synthetic pointer>, size=65536, allctr=0xb0bdc0) at beam/erl_alloc_util.c:2357
#6  mbc_alloc (allctr=0xb0bdc0, size=65536) at beam/erl_alloc_util.c:2483
#7  0x00000000004a1491 in erts_alcu_alloc_thr_pref (type=107, extra=<optimized out>, size=65536) at beam/erl_alloc_util.c:5650
#8  0x00007f002eff82d0 in deflateInit2_ () from /lib/x86_64-linux-gnu/libz.so.1
#9  0x00007f002eff841e in deflateInit_ () from /lib/x86_64-linux-gnu/libz.so.1
#10 0x00000000005d67b0 in erl_zlib_deflate_start (streamp=<optimized out>, source=<optimized out>, sourceLen=<optimized out>, level=<optimized out>) at beam/erl_zlib.c:60
#11 0x000000000053cf9f in erts_term_to_binary_int (p=p@entry=0x7effecf00438, Term=<optimized out>, level=<optimized out>, flags=<optimized out>, context_b=0x7eff73a438f8, context_b@entry=0x0) at beam/external.c:2049
#12 0x000000000053f7b2 in term_to_binary_2 (A__p=0x7effecf00438, BIF__ARGS=<optimized out>, A__I=<optimized out>) at beam/external.c:1204
#13 0x00000000004525cc in process_main () at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:310
#14 0x000000000044d863 in sched_thread_func (vesdp=0x7effec184dc0) at beam/erl_process.c:8349
#15 0x0000000000674cf9 in thr_wrapper (vtwd=0x7fffdd513f70) at pthread/ethread.c:118
#16 0x00007f002eddc6ba in pthread_getattr_np (thread_id=139637642776000, attr=0x0) at pthread_getattr_np.c:88
#17 0x0000000000000000 in ?? ()
#0  0x00000000005e870d in tree_insert_fixup (blk=<optimized out>, root=<optimized out>) at beam/erl_ao_firstfit_alloc.c:476
#1  rbt_insert (order=FF_BF, root=root@entry=0x7f0b7ad400e8, blk=0x7f0b7ad775d8) at beam/erl_ao_firstfit_alloc.c:830
#2  0x00000000005e8a42 in aoff_link_free_block (allctr=<optimized out>, block=<optimized out>) at beam/erl_ao_firstfit_alloc.c:750
#3  0x000000000049a5dc in mbc_alloc_finalize (valid_blk_info=1, want_blk_sz=31720, crr=0x7f0b7ad40000, flags=<optimized out>, org_blk_sz=<optimized out>, blk=0x7f0b7ad6f9f0, allctr=0x1cb1dc0) at beam/erl_alloc_util.c:2421
#4  mbc_alloc (allctr=0x1cb1dc0, size=<optimized out>) at beam/erl_alloc_util.c:2487
#5  0x00000000004a1491 in erts_alcu_alloc_thr_pref (type=86, extra=<optimized out>, size=31712) at beam/erl_alloc_util.c:5650
#6  0x0000000000486f0d in erts_alloc (type=11060, size=31712) at beam/erl_alloc.h:230
#7  erts_start_staging_ranges (num_new=num_new@entry=1) at beam/beam_ranges.c:131
#8  0x0000000000486a4a in erts_start_staging_code_ix (num_new=num_new@entry=1) at beam/code_ix.c:74
#9  0x000000000047aca8 in finish_loading_1 (A__p=0x7f0be94c1e38, BIF__ARGS=<optimized out>, A__I=<optimized out>) at beam/beam_bif_load.c:373
#10 0x00000000004525cc in process_main () at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:310
#11 0x000000000044d863 in sched_thread_func (vesdp=0x7f0bec784dc0) at beam/erl_process.c:8349
#12 0x0000000000674cf9 in thr_wrapper (vtwd=0x7ffe8f8841e0) at pthread/ethread.c:118
#13 0x00007f0c2f3df6ba in pthread_getattr_np (thread_id=139689188675008, attr=0x0) at pthread_getattr_np.c:88
#14 0x0000000000000000 in ?? ()

追踪过程

理论上 erlang 不应该有那么大 BUG. 问题大概率出现在我们使用的三方库上. 汇总后有 dyntrace,crypto,asn1rt_nif,Elixir.CBson,jiffy. 三方的 nif 有 CBson, jiffy, 因 jiffy 使用的人更多. 怀疑 CBson.

通过被破坏的内存将范围限定在越界和野指针

见上面的现场分析,可能存在越界或野指针写坏了内存.从 CBson 开始,阅读代码未发现问题.

构造出 CBson 的 coredump, 但和线上不吻合

并发 encode/decode 一些较大的随机生成数据, 能够有一定概率 coredump, 调用栈如下, 发现是 decode 时,判断长度没有考虑负数.仔细阅读了 decode 代码,在 bson 字节流正确时,理论上不应该引发崩溃.

#0  0x00007febcdf0bb79 in decode_iter (env=0x7fec455bbd70, argc=<optimized out>, argv=0x7fec482041c0) at src/bson_decoder.c:516
#1  0x0000000000455158 in process_main () at x86_64-unknown-linux-gnu/opt/smp/beam_cold.h:119
#2  0x000000000044d80e in sched_thread_func (vesdp=0x7fec4668f180) at beam/erl_process.c:8349
#3  0x0000000000674cc9 in thr_wrapper (vtwd=0x7ffface81c50) at pthread/ethread.c:118
#4  0x00007fec895b76db in start_thread (arg=0x7fec455bc700) at pthread_create.c:463
#5  0x00007fec890d888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

此外测试了如下情况:

  • 和并发无关.
  • 和 decode 的 yield 无关
  • 和 value binary 大小无关.
  • 和 value binary 是否随机无关.

erlang crash in instrument:allocations

因有相当多 coredump 都和 erl_alloc 相关. 查了 erlang 对 erl_alloc 的最近修改. 发现一个有意思的提交:

~/git/otp/erts/emulator/beam(a0ae44f*) » git log -p master erl_alloc*
commit 2193618f3524d6115693c8e710fffbdf9a68b533
Author: John Högberg <john@erlang.org>
Date:   Mon Aug 12 15:09:17 2019 +0200

    erts: Fix crash in instrument:allocations/0-1
    
    The current carrier list was read when the allocator wasn't locked,
    crashing the emulator if a block scan raced with a carrier
    allocation

看起来能解释 AOFF_RBTree_t 问题. 可惜升级到 erlang 21.3.8.9 后, 仍然能够重现 BUG.

构造了吻合线上 coredump 的用例

如果 decode 在 bson 字节流正确的情况下没有问题, 说明 encode 生成了错误的字节流. 经过验证发现. encode 同样数据, 生成的字节流可能不一致. 尝试 diff 字节流寻找灵感, 因为字节流实在太大. 如果没有分析 decode 过程的工具, 无法简单地看出问题. 
接着尝试.

  1. 单纯反复 decode 同一份数据.
  2. 单纯反复 encode 同一份数据.

发现, 并发 encode 同一份数据, 能够在 1 -2MIN 重现 coredump. 而且调用栈极为近似.

定位到一个野指针 bug

在范围更加缩小的情况下, 继续把 encode 代码再次过了一遍. 重点看 memcpy, 指针写逻辑, 最终发现一个野指针. 可能改写一片已释放内存.

erlang nif 的调试

程序在做什么

top -H

如,找到 cpu %100 的线程.

top -H -p 12229
Threads:  32 total,   1 running,  27 sleeping,   0 stopped,   0 zombie
%Cpu(s): 20.1 us,  3.9 sy,  0.0 ni, 75.2 id,  0.1 wa,  0.0 hi,  0.7 si,  0.0 st
KiB Mem : 32939792 total,  1596620 free, 19476236 used, 11866936 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12275476 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                  
13760 root      20   0  9.836g 6.811g   2348 R 99.9 21.7  17782:37 8_dirty_cpu_sch                                                                                                                                                          
12229 root      20   0  9.836g 6.811g   2348 t  0.0 21.7   0:00.08 beam.smp                                                                                                                                                                 
13669 root      20   0  9.836g 6.811g   2348 S  0.0 21.7   0:00.02 sys_sig_dispatc 

pstack

....
Thread 2 (Thread 0x7fee70b3f700 (LWP 3863)):
#0  0x00007feeb2cb851d in read () at ../sysdeps/unix/syscall-template.S:84
#1  0x00000000005fd38c in read (__nbytes=4, __buf=0x7fee70b3eee0, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2  signal_dispatcher_thread_func (unused=<optimized out>) at sys/unix/sys.c:993
#3  0x0000000000674cf9 in thr_wrapper (vtwd=0x7fff9fbb3770) at pthread/ethread.c:118
#4  0x00007feeb2caf6ba in start_thread (arg=0x7fee70b3f700) at pthread_create.c:333
#5  0x00007feeb27dd41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 1 (Thread 0x7feeb3a2d700 (LWP 3721)):
#0  0x00007feeb27d35d3 in select () at ../sysdeps/unix/syscall-template.S:84
#1  0x00000000005fe4f0 in erts_sys_main_thread () at sys/unix/sys.c:1140
#2  0x00000000004a8c3c in erl_start (argc=32, argv=<optimized out>) at beam/erl_init.c:2327
#3  0x00000000004518a9 in main (argc=<optimized out>, argv=<optimized out>) at sys/unix/erl_main.c:30

etp-stacktrace

可以在 gdbinit 里 source etp-commands.in, 更方便地使用 etp-stacktrace. 这样可以知道程序在 erlang 层在干什么.

~/git/elixir-cbson(dee7f1d*) » cat ~/.gdbinit                                                                                                                                enjolras@ubuntu
set target-async 1
set pagination off
set non-stop on
source /home/enjolras/git/otp/erts/etc/unix/etp-commands.in
(gdb) etp-stacktrace p
% Stacktrace (24)
#Cp<0x37d953d8>  (I)
#Cp<rand:seed_put/1+0x58>  (cp)
#Cp<rand:uniform/1+0x40>.
#Cp<'Elixir.Ejoy.Service.Discovery':get_service/2+0xe8>.
#Cp<'Elixir.Ejoy.Service.Discovery':'-get_service_by_names/2-fun-0-'/3+0x88>.
#Cp<'Elixir.Enum':'-reduce/3-lists^foldl/2-0-'/3+0x50>.
#Cp<'Elixir.Ejoy.Services.Client':call_game/3+0xc0>.
#Cp<'Elixir.Gangplank.Srpc':rpc/4+0x80>.
#Cp<'Elixir.OnlineCounter':get_count_subprocess/3+0x78>.
#Cp<terminate process normally>.

gdb

source /home/enjolras/git/otp/erts/etc/unix/etp-commands.in
可以使用一些更方便查看 erlang 内部结构的命令. 最好用是 etp 打印 eterm, 比如打印 atom.
各个语言都有类似的脚手架. 如 cpython 的 cpython\Misc\gdbinit.

总结

野指针 BUG 的查找非常看运气. 关键在于如何重现 BUG, 能够重现之后, 如何缩小 BUG 范围就是老生常谈了:

如何重现问题

分析现场. 阅读代码. 大胆假设, 细心验证.
如, 现场发现内存又被写坏. 而且不只一个地方被写坏. 那么堆上的越界 / 野指针是可能的.
erlang 的 The current carrier list was read when the allocator wasn’t locked, 可能导致 allocator 对应的数据结构被写坏, 但不可能导致 ErtsRunQueue 被写坏.
如, bug 是随机二进制数据触发的吗? 字典中有随机二进制数据可触发 BUG, 替换成固定字符串呢?
如, bug 和 encode 包大小相关吗? 和 yield 相关吗? 和并发相关?
信息越多,就离重现/解决问题越近.

缩小 BUG 范围.

  • Taints 将范围缩小至特定几个库
  • 构造不同的测试用例将范围缩小到部分逻辑. 
  • 2 分搜索最近提交.

正文完
 0