关于jvm:eBPF-求证坊间传闻Java-GC-日志可导致整个-JVM-服务卡顿

1次阅读

共计 18755 个字符,预计需要花费 47 分钟才能阅读完成。

概述

实现世界的 Java 利用,都会记录 GC 日志。但不是所有人都晓得小小的日志可能导致整个 JVM 服务卡顿。本文尝试用 eBPF 等分析方法,去证实具体环境下,问题的存在与否。

审误和公布本文时,我才是二阳后活过来数小时而已,写了数周的文章切实不想再迁延公布了。如文章有错漏,还请多多包涵和斧正。

Java 服务卡顿,是 Java 世界永恒的话题之王。想到 Java 卡顿,大部分人的第一反馈是以下关键词:

  • GC
  • Stop the world(STW)

而写 GC 日志导致 Java 卡顿可能是个小众话题。不过我确实在之前的电商公司中目击过。那是 2017 年的 12/8 左右,还是 Java 8,过后相干的专家花了大略数周工夫去论证和试验,最终开出诊断后果和药方:

  • 诊断后果:写 GC 日志导致 JVM GC 卡顿
  • 药方:GC 日志不落盘(写 ram disk/tmpfs),或落到独立的 disk 中。

转瞬来到 2023 年,OpenJDK 21 都来了。问题还在吗?看看搜索引擎的后果:

  • Eliminating Large JVM GC Pauses Caused by Background IO Traffic – 2016 年 Zhenyun Zhuang@LinkedIn

    • OS-Caused Large JVM Pauses: Investigations and Solutions – 2016 年 Zhenyun Zhuang@LinkedIn
  • JVM 性能调优 –YGC – heapdump.cn
  • JVM 输入 GC 日志导致 JVM 卡住,我 TM 人傻了

OpenJDK 的修改计划

这个问题是有宽泛认知了,OpenJDK 也给出了解决修改:

  • Buffered Logging? – rramakrishna@twitter.com
  • JDK-8229517: Support for optional asynchronous/buffered logging

    • Git Commit
  • JDK-8264323: Add a global option -Xlog:async for Unified Logging
  • Asynchronous Logging in Corretto 17 – Xin Liu@amazon.com

遗憾的是,出于新版本与旧版本行为统一的兼容性准则,这个修改默认是敞开的,即须要加 Java 命令行参数 -Xlog:async 来手工关上异步原生线程写 GC 日志的“性能”。

那么问题来了,是间接加参数了事,还是求证一下,服务的卡顿起源是否真正来源于这个问题,再思考批改参数?当然,本文抉择是后者。

基本原理

JVM 写 GC 日志机会剖析

Safepoint

[From https://krzysztofslusarski.github.io/2022/12/12/async-manual….]

上图大略阐明了 safepoint 的次要流程。有趣味同学能够看看下面链接,或搜寻一下,网上很多好文章阐明过了。我就不搬门弄斧了。

一点须要留神的是,safepoint STOP THE WORLD(STW) 的使用者不只有 GC。还有其它。

这里只简略阐明一下(我不是 JVM 专家,所以请审慎应用以下观点):

  1. Global safepoint request

    1.1 有一个线程提出了进入 safepoint 的申请,其中带上 safepoint operation 参数,参数其实是 STOP THE WORLD(STW) 后要执行的 Callback 操作。可能是分配内存有余,触发 GC。也可能是其它起因。

    1.2 一个叫“VM Thread”的线程在收到 safepoint request 后,批改一个 JVM 全局的 safepoint flag 为 true(这里只为简略阐明,如果你是里手,那么这个 flag 其实是操作系统的内存缺页标识)。

    1.3 而后这个“VM Thread”就开始期待其它利用线程(App thread)达到(进入)safepoint。

    1.4 其它利用线程(App thread)其实会高频查看这个 safepoint flag,当发现为 true 时,就达到(进入)safepoint 状态。

    源码链接 SafepointSynchronize::begin()

  1. Global safepoint

    当“VM Thread”发现所有 App thread 都达到 safepoint(实在的 STW 的开始)。就开始执行 safepoint operation。本文中,当然就是 GC 操作 了。

    源码链接 RuntimeService::record_safepoint_synchronized()

  1. End of safepoint operation

    safepoint operation 执行结束,“VM Thread”完结 STW。

    源码链接 SafepointSynchronize::end()

GC

下面要害一点是,在 STW 后,会执行 GC 操作,结束后,才完结 STW。GC 操作 是会写 GC 日志的,默认会调用 write syscall。证实见上面。

诊断

诊断 – eBPF 跟踪法

先看看 bpftrace 程序 trace-safepoint.bt:

#!/usr/local/bin/bpftrace
// trace-safepoint.bt

BEGIN
{}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
{@java_pid[pid] = pid;
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
/@java_pid[pid]/
{@sp_begin[pid, tid, comm, ustack] = count();
    @vm_thread_tid[tid] = tid;
    @vm_thread_waiting_sp_sync[tid] = tid;
    @vm_thread_in_STW[tid] = tid;
    printf("SafepointSynchronize::begin %-6d %-16s\n", tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*RuntimeService*record_safepoint_synchronized*
/@java_pid[pid]/
{@sp_synchronized[pid, tid, comm, ustack] = count();
    delete (@vm_thread_waiting_sp_sync[tid]);
    printf("RuntimeService::record_safepoint_synchronized %-6d %-16s\n", tid, comm);
}

uretprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*end*
/@java_pid[pid]/
{delete (@vm_thread_in_STW[tid]);
    printf("AsyncLogWriter: %-6d %-16s %s\n", pid, comm, func);
    printf("SafepointSynchronize::end %-6d %-16s\n", tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*AsyncLogWriter*
/@java_pid[pid] && @vm_thread_in_STW[tid]/
{printf("AsyncLogWriter: %-6d %-16s %s\n", pid, comm, func);
    @AsyncLogWriter[pid, tid, comm, ustack] = count();}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*LogFileOutput*
/@java_pid[pid] && @vm_thread_in_STW[tid]/
{printf("LogFileOutput: %-6d %-16s %s\n", pid, comm, func);
    @LogFileOutput[pid, tid, comm, ustack] = count();}

tracepoint:block:block_rq_complete,
tracepoint:block:block_bio_complete
/@java_pid[pid]/
{@block_bio_complete[pid, tid, comm, kstack(4)] = count();}

tracepoint:syscalls:sys_enter_sync,
tracepoint:syscalls:sys_enter_syncfs,
tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync,
tracepoint:syscalls:sys_enter_sync_file_range*,
tracepoint:syscalls:sys_enter_msync
/@java_pid[pid]/
{time("%H:%M:%S");
    printf("sync syscall: %-6d %-16s %s\n", pid, comm, probe);

    @sync_call[pid, tid, comm, probe, ustack(6)] = count();
    printf("sp sync_call: %-6d %-16s %s\n", pid, comm, probe);

}

kprobe:vfs_read*,
kprobe:vfs_write*,
kprobe:vfs_fsync,
kprobe:vfs_open,
kprobe:vfs_create
/@java_pid[pid]/
{@vfs_call[pid, tid, comm, func, ustack(6)] = count();
    printf("sp vfs_call: %-6d %-16s %s\n", pid, comm, probe);
}

END
{}

这个程序假如你的机器上只运行了一个 java 程序,如果多个,要作点批改。

咱们次要看 BPF probe 探针点:

  1. uprobe:SafepointSynchronize::begin()
  2. uprobe:RuntimeService::record_safepoint_synchronized()
  3. uprobe:LogFileOutput::*():JVM 日志写文件代码
  4. uprobe:AsyncLogWriter::*():JVM 日志异步化代码
  5. kprobe:vfs_write:内核文件写操作
  6. uprobe:SafepointSynchronize::end()

而后写一个简略的 Java 程序,TestGCLog.java:

// TestGCLog.java
import java.util.*;
public class TestGCLog {List objList = new ArrayList();
   byte[] m = new byte[100*1024];
   public static void main(String[] arg) throws Exception {TestGCLog t = new TestGCLog();
        t.test();}

   void test() throws Exception {long lastprint = System.currentTimeMillis();
    while(true) {objList.clear();
                long start = System.currentTimeMillis();
                TestGCLog newObj = new TestGCLog();
                long end = System.currentTimeMillis();
                if(end - start > 10) {System.out.println( new Date() + "- TestGCLog: slow new object:" + (end - start) + "ms"  );
                }                
                objList.add(newObj);
        Thread.sleep(5);
        long now = System.currentTimeMillis();
        if(now - lastprint > 5 * 1000) {System.out.println( new Date() + "- TestGCLog" );
            lastprint = now;
        }
    }
   }
}

执行 java 程序和 bpftrace:

/usr/lib/jvm/java-17-openjdk-amd64/bin/javac TestGCLog.java

/usr/lib/jvm/java-17-openjdk-amd64/bin/java -Xloggc:/mnt/dm0/gc.log  -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -cp . TestGCLog
sudo bpftrace ./trace-safepoint.bt

能够看到一些乏味的 bpftrace 程序输入:

事件产生程序:

SafepointSynchronize::begin 422373 VM Thread       
RuntimeService::record_safepoint_synchronized 422373 VM Thread       
LogFileOutput: 422361 VM Thread        LogFileOutput::write(LogDecorations const&, char const*)
AsyncLogWriter: 422361 VM Thread        AsyncLogWriter::instance()
sp vfs_call: 422361 VM Thread        kprobe:vfs_write
AsyncLogWriter: 422361 VM Thread        VMThread::inner_execute(VM_Operation*)
SafepointSynchronize::end 422373 VM Thread 

可见 vfs_write 写文件的 syscall 调用,是同步地产生在 STW 期间的。即在 SafepointSynchronize::begin 与

SafepointSynchronize::end 之间。

一些 stack 信息:

@AsyncLogWriter[422361, 422373, VM Thread, 
    AsyncLogWriter::instance()+0
    LogTagSet::vwrite(LogLevel::type, char const*, __va_list_tag*)+471
    LogTargetHandle::print(char const*, ...)+159
    LogStream::write(char const*, unsigned long)+338
    outputStream::do_vsnprintf_and_write_with_automatic_buffer(char const*, __va_list_tag*, bool)+194
    outputStream::print_cr(char const*, ...)+407
    GCTraceTimeLoggerImpl::log_end(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource>)+288
    G1CollectedHeap::do_collection_pause_at_safepoint_helper(double)+2116
    G1CollectedHeap::do_collection_pause_at_safepoint(double)+53
    VM_G1CollectForAllocation::doit()+75
    VM_Operation::evaluate()+226
    VMThread::evaluate_operation(VM_Operation*)+177
    VMThread::inner_execute(VM_Operation*)+436
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 18

@LogFileOutput[422361, 422373, VM Thread, 
    LogFileOutput::write(LogDecorations const&, char const*)+0
    LogTargetHandle::print(char const*, ...)+159
    LogStream::write(char const*, unsigned long)+338
    outputStream::do_vsnprintf_and_write_with_automatic_buffer(char const*, __va_list_tag*, bool)+194
    outputStream::print_cr(char const*, ...)+407
    GCTraceTimeLoggerImpl::log_end(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource>)+288
    G1CollectedHeap::do_collection_pause_at_safepoint_helper(double)+2116
    G1CollectedHeap::do_collection_pause_at_safepoint(double)+53
    VM_G1CollectForAllocation::doit()+75
    VM_Operation::evaluate()+226
    VMThread::evaluate_operation(VM_Operation*)+177
    VMThread::inner_execute(VM_Operation*)+436
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 17

@sp_begin[404917, 404924, VM Thread, 
    SafepointSynchronize::begin()+0
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 1
@sp_synchronized[404917, 404924, VM Thread, 
    RuntimeService::record_safepoint_synchronized(long)+0
    VMThread::inner_execute(VM_Operation*)+389
    VMThread::run()+199
    Thread::call_run()+196
    thread_native_entry(Thread*)+233
    start_thread+755
]: 1

你能够参见下文的 [Linux 如何模仿 IO 卡顿] 一节来模仿 GC 日志输入的指标盘 IO 卡顿、甚至挂起的状况。

如果你用 sudo dmsetup suspend /dev/$your_dm_dev 来挂起指标盘,从 Java 程序的 stdout 和 top 中能够间接看到,整个 JVM 也暂停了。

如果这时用 top -H -p $(pgrep java) 查看线程状态,你会发现一个在 D (uninterruptable_sleep) 状态的 VM Thread 线程。

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND           
 430796 labile    20   0 3068740 112820  20416 S   0.3   0.2   0:00.57 VM Periodic Tas   
 430779 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.00 java              
 430780 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:02.96 java              
 430781 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.02 GC Thread#0       
 430782 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.00 G1 Main Marker       
 430785 labile    20   0 3068740 112820  20416 S   0.0   0.2   0:00.05 G1 Service        
 430786 labile    20   0 3068740 112820  20416 D   0.0   0.2   0:00.04 VM Thread 

如果好奇 VM Thread 挂在什么中央,能够看:

$ sudo cat /proc/`pgrep java`/task/$thread_id_ofVM_Thread/stack
[<0>] percpu_rwsem_wait+0x107/0x130
[<0>] vfs_write+0x274/0x2a0  <----
[<0>] ksys_write+0x67/0xf0  <----
[<0>] __x64_sys_write+0x19/0x30
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

或:[<0>] do_get_write_access+0x261/0x330
[<0>] jbd2_journal_get_write_access+0x54/0xa0
[<0>] __ext4_journal_get_write_access+0x8f/0x1e0
[<0>] ext4_reserve_inode_write+0xa0/0xe0
[<0>] __ext4_mark_inode_dirty+0x57/0x230
[<0>] ext4_dirty_inode+0x5c/0x90
[<0>] __mark_inode_dirty+0x5e/0x3b0
[<0>] generic_update_time+0x6c/0xf0
[<0>] file_update_time+0x12f/0x150
[<0>] file_modified+0x27/0x40
[<0>] ext4_buffered_write_iter+0x53/0x130
[<0>] ext4_file_write_iter+0x49/0x80
[<0>] new_sync_write+0xfe/0x190 <----
[<0>] vfs_write+0x20d/0x2a0 <----
[<0>] ksys_write+0x67/0xf0
[<0>] __x64_sys_write+0x19/0x30
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

诊断 – gdb 跟踪法

因为本文的重点只是 BPF,所以 gdb 只简略写一下。

gdb debug open-jdk 的办法,你能够参见下文的 [Ubuntu 下的 jdk gdb 环境] 一节。

应用 gdb 能够在关注的 JVM 函数和内核调用中设置断点。

$ gdb -ex 'handle SIGSEGV noprint nostop pass' -p `pgrep java`

(gdb) thead thead_id_of_vm_thread

(gdb) bt
#0  __GI___libc_write (nbytes=95, buf=0x7f7cc4031390, fd=4) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=4, buf=0x7f7cc4031390, nbytes=95) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007f7cca88af6d in _IO_new_file_write (f=0x7f7cc4000c20, data=0x7f7cc4031390, n=95) at ./libio/fileops.c:1180
#3  0x00007f7cca88ca61 in new_do_write (to_do=95, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", fp=0x7f7cc4000c20) at ./libio/libioP.h:947
#4  _IO_new_do_write (to_do=95, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", fp=0x7f7cc4000c20) at ./libio/fileops.c:425
#5  _IO_new_do_write (fp=0x7f7cc4000c20, data=0x7f7cc4031390 "[3598.772s][info][gc] GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms\n", to_do=95) at ./libio/fileops.c:422
#6  0x00007f7cca88a558 in _IO_new_file_sync (fp=0x7f7cc4000c20) at ./libio/fileops.c:798
#7  0x00007f7cca87f22a in __GI__IO_fflush (fp=0x7f7cc4000c20) at ./libio/libioP.h:947
#8  0x00007f7cc9eae679 in LogFileStreamOutput::flush (this=0x7f7cc4003e70) at src/hotspot/share/logging/logFileStreamOutput.cpp:93
#9  LogFileStreamOutput::write (this=this@entry=0x7f7cc4003e70, decorations=..., msg=msg@entry=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms")
    at src/hotspot/share/logging/logFileStreamOutput.cpp:131
#10 0x00007f7cc9eadade in LogFileOutput::write_blocking (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", decorations=..., this=0x7f7cc4003e70)
    at src/hotspot/share/logging/logFileOutput.cpp:308
#11 LogFileOutput::write (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", decorations=..., this=0x7f7cc4003e70) at src/hotspot/share/logging/logFileOutput.cpp:332
#12 LogFileOutput::write (this=0x7f7cc4003e70, decorations=..., msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms") at src/hotspot/share/logging/logFileOutput.cpp:320
#13 0x00007f7cc9eb3a47 in LogTagSet::log (msg=0x7f7cac39b5c0 "GC(509) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(128M) 3.261ms", level=LogLevel::Info, 
    this=0x7f7cca74bee0 <LogTagSetMapping<(LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>) at src/hotspot/share/logging/logTagSet.cpp:85
#14 LogTagSet::vwrite (this=0x7f7cca74bee0 <LogTagSetMapping<(LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>, level=LogLevel::Info, fmt=<optimized out>, 
    args=<optimized out>) at src/hotspot/share/logging/logTagSet.cpp:150
#15 0x00007f7cc9b7f09f in LogTargetHandle::print (this=this@entry=0x7f7cac39c2f0, fmt=fmt@entry=0x7f7cca41dd6e "%s") at src/hotspot/share/logging/logHandle.hpp:95
#16 0x00007f7cc9eb3232 in LogStream::write (this=0x7f7cac39c260, s=0x7f7cac39b930 "3.261ms\n", len=9) at src/hotspot/share/logging/logStream.hpp:52
#17 0x00007f7cca03eaa2 in outputStream::do_vsnprintf_and_write_with_automatic_buffer (this=this@entry=0x7f7cac39c260, format=format@entry=0x7f7cca498e6c "%.3fms", ap=ap@entry=0x7f7cac39c158, add_cr=add_cr@entry=true)
    at src/hotspot/share/utilities/ostream.cpp:131
#18 0x00007f7cca03f427 in outputStream::do_vsnprintf_and_write (add_cr=true, ap=0x7f7cac39c158, format=0x7f7cca498e6c "%.3fms", this=0x7f7cac39c260) at src/hotspot/share/utilities/ostream.cpp:144
#19 outputStream::print_cr (this=this@entry=0x7f7cac39c260, format=format@entry=0x7f7cca498e6c "%.3fms") at src/hotspot/share/utilities/ostream.cpp:158
#20 0x00007f7cc9b8fec0 in GCTraceTimeLoggerImpl::log_end (this=<optimized out>, end=...) at src/hotspot/share/gc/shared/gcTraceTime.cpp:70
#21 0x00007f7cc9afc8c4 in GCTraceTimeDriver::at_end (end=..., cb=<optimized out>, this=0x7f7cac39c688) at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:78
#22 GCTraceTimeDriver::~GCTraceTimeDriver (this=0x7f7cac39c688, __in_chrg=<optimized out>) at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:61
#23 GCTraceTimeImpl::~GCTraceTimeImpl (this=0x7f7cac39c618, __in_chrg=<optimized out>) at src/hotspot/share/gc/shared/gcTraceTime.hpp:141
#24 GCTraceTimeWrapper<(LogLevel::type)3, (LogTag::type)43, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::~GCTraceTimeWrapper (this=0x7f7cac39c610, __in_chrg=<optimized out>)
    at src/hotspot/share/gc/shared/gcTraceTime.inline.hpp:184
#25 G1CollectedHeap::do_collection_pause_at_safepoint_helper (this=this@entry=0x7f7cc4026ef0, target_pause_time_ms=target_pause_time_ms@entry=200) at src/hotspot/share/gc/g1/g1CollectedHeap.cpp:3071
#26 0x00007f7cc9afcb45 in G1CollectedHeap::do_collection_pause_at_safepoint (this=this@entry=0x7f7cc4026ef0, target_pause_time_ms=200) at src/hotspot/share/gc/g1/g1CollectedHeap.cpp:2887
#27 0x00007f7cc9b7902b in VM_G1CollectForAllocation::doit (this=0x7f7cc8ffe690) at src/hotspot/share/gc/g1/g1VMOperations.cpp:146
#28 0x00007f7cca35b632 in VM_Operation::evaluate (this=this@entry=0x7f7cc8ffe690) at src/hotspot/share/runtime/vmOperations.cpp:70
#29 0x00007f7cca35ced1 in VMThread::evaluate_operation (this=this@entry=0x7f7cc40e0f10, op=0x7f7cc8ffe690) at src/hotspot/share/runtime/vmThread.cpp:269
#30 0x00007f7cca35d2d4 in VMThread::inner_execute (this=this@entry=0x7f7cc40e0f10, op=<optimized out>) at src/hotspot/share/runtime/vmThread.cpp:415
#31 0x00007f7cca35d5c7 in VMThread::loop (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/vmThread.cpp:482
#32 VMThread::run (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/vmThread.cpp:162
#33 0x00007f7cca2d7834 in Thread::call_run (this=0x7f7cc40e0f10) at src/hotspot/share/runtime/thread.cpp:394
#34 0x00007f7cca0326c9 in thread_native_entry (thread=0x7f7cc40e0f10) at src/hotspot/os/linux/os_linux.cpp:724
#35 0x00007f7cca894b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#36 0x00007f7cca926a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

云原生下可能加剧的问题

  • worker node 共享内存 或 container limit 内存后,Direct Page Reclaim

    在云环境下,一台机器可能运行很多 container。有的 container 是高 IO 的利用,如写日志,这些都会占用 page cache 和对共享的 disk 产生争用压力。整机的内存不足,或者是 container 自身的内存不足,都可能触发 Direct Page Reclaim

  • worker node 共享本地磁盘 IO 争用
  • Ceph 等 PV 网络盘稳定性远不如本地盘,带宽与 latency 绝对不稳固。

如以下一个配置,多个 container 可能共享了 emptyDir 用的本地盘。而后在 IO 争用时,写 GC 日志就可能呈现 STW 下的高 latency,间接导致服务 API latency 超时。

apiVersion: v1
kind: Pod
metadata:
  annotations:
  labels:
  name: xyz
spec:
  containers:
  - command: java ... -Xloggc:/usr/share/myapp/logs/gc.log ...
    ...
    volumeMounts:
    - mountPath: /usr/share/myapp/logs
      name: myapp-logs
  volumes:
  - emptyDir: {}
    name: myapp-logs

结语

资金是好货色,注意力也是好货色,人们为了领有它们无所不用其极。但也有它的副作用。君不见当初 IT 技术界,很容易就呈现一些 拜名主义,时常三句不离一堆新技术 keyword。晓得锤子好是坏事,但不是所有货色都是钉子。理解货色自身和它的个性与毛病,才是做好技术选型的要害。当初市场上曾经开始了一些 BPF 的滥用,咱们静观其变吧。我还是感觉,eBPF 当初最成熟的利用场景是:用其它伎俩难以剖析的问题的跟踪。如内核的异样行为。

前面,我会写另外两编 Twins 文章:《eBPF 求证坊间风闻:mmap + Java Safepoint 可导致整个 JVM 服务卡顿?》、《如何测量 过程级别的 IO 提早》

“學而不思則罔 思而不學則殆”—《论语·为政》

附录

Linux 利用的写文件操作为何会卡顿

理解过 Linux IO 机制的同学可能会提出疑难,利用的 write syscall 是写入 pagecache(dirty),而后再由专门的 pdflush 线程负责理论的 IO 操作。所以利用自身个别是不会有写文件卡顿的状况的。除非利用本人调用 fsync() 强制同步落盘。

确实,Linux 的次要 IO 写门路 (Wite Path) 是异步为主。但凡事有例外(魔鬼在细节),这正是程序员世界好玩的中央:Why buffered writes are sometimes stalled – YOSHINORI MATSUNOBU。这文章总结起来有几个可能

  • READ MODIFY WRITE – 一些状况须要先读再写
  • WRITE() MAY BE BLOCKED FOR “STABLE PAGE WRITES” – 写入的 page 之前曾经是 dirty 状态,且正在被 pdflush writeback 线程锁住,筹备落盘。
  • WAITING FOR JOURNAL BLOCK ALLOCATION IN EXT3/4

下面残缺了吗?我看还有其它可能:

  • Direct Page Reclaim – 内存页回收

当利用向内核申请内存,而内核发现以后的闲暇内存(如果是 container(cgroup),闲暇内存只包含 limit 内的局部)不足以满足时,就可能会挂起以后线程,去开释更多的内存,通常这个操作包含:

  1. 之前标记为 dirty 的 page cache 的落盘,落盘当然是慢的 IO 操作了。以后线程可能进入“uninterruptable_sleep”状态。直到 IO 写和开释内存实现。

在云环境下,一台机器可能运行很多 container。有的 container 是高 IO 的利用,如写日志,这些都会占用 page cache 和对共享的 disk 产生争用压力。整机的内存不足,或者是 container 自身的内存不足,都可能触发 Direct Page Reclaim。

更多可参考:Our lessons on Linux writeback: do“dirty”jobs the right way

偶像 brendangregg 利用 BPF(bpftrace) 有诊断办法:vmscan.bt

Linux 如何模仿 IO 卡顿

网上能搜寻到的模仿 IO 卡顿大略有两个办法:

  • device mapper delay 法

    • Emulate a slow block device with dm-delay – Christophe Vu-Brugier
    • Emulate a slow block device with dm-delay – flamingbytes
    • dm-delay
  • systemtap 注入提早法

    • 巧用 Systemtap 注入提早模仿 IO 设施抖动
    • 应用 SystemTap 给 I / O 设施注入提早

本文用的是 device mapper delay 法。device 自身带了一个 delay。

# load the brd kernel module. rd_nr is the maximum number of ramdisks. rd_size is the ramdisk size in KB.
sudo modprobe brd rd_nr=1 rd_size=1048576

ls -l /dev/ram0
brw-rw---- 1 root disk 1, 0 Aug 24 20:00 /dev/ram0

sudo blockdev --getsize /dev/ram0 # Display the size in 512-byte sectors
2097152

#We specify the read write delay as 200ms, specify the write latency(e.g. 100ms) as below.
size=$(sudo blockdev --getsize /dev/ram0)
sudo dmsetup remove delayed
echo "0 $size delay /dev/ram0 0 200" | sudo dmsetup create delayed


sudo dmsetup table delayed
0 2097152 delay 1:0 0 50

ls -la /dev/mapper/ | grep delayed
lrwxrwxrwx  1 root root       7 May 19 18:37 delayed -> ../dm-0

sudo mkfs.ext4 /dev/dm-0

sudo mkdir /mnt/dm0
sudo mount /dev/dm-0 /mnt/dm0

cd /mnt/dm0

sudo dd if=/dev/urandom of=/mnt/dm0/mapped.file bs=1G count=1

另外如果 debug 须要,能够挂起 device,间接让 mapper device 的所有 IO 操作过程挂起(uninterruptable_sleep):

# 挂起
sudo dmsetup suspend /dev/$your_dm_dev

# 复原
sudo dmsetup resume /dev/$your_dm_dev

Ubuntu 下的 BPF 跟踪堆栈因为 -fno-omit-frame-pointer 的 glibc 缺失的问题

sudo apt install libc6-prof

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java ...

TL;DR :

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1908307

Ubuntu 下的 jdk gdb 环境

gdb 用的 symbol table

sudo apt install openjdk
sudo apt install openjdk-17-dbg

Ubuntu 定制的 openjdk 源码:

具体见:https://www.cyberciti.biz/faq/how-to-get-source-code-of-packa…

## 前置条件,曾经在 apt soures.list 中退出源码的 apt repo 源

sudo apt update
cd ~
sudo apt-get source openjdk-17
正文完
 0