概述
Java 反对好几种文件读取办法,本文要说的是小众的 mmap(MappedByteBuffer)
以及它与 Safepoint、JVM 服务卡顿之间的关系。本文尝试用 eBPF 等分析方法,去证实具体环境下,问题的存在与否。
审误和公布本文时,我才是二阳后活过来数小时而已,写了数周的文章切实不想再迁延公布了。如文章有错漏,还请多多包涵和斧正。
引
Java 服务卡顿,是 Java 世界永恒的话题之王。想到 Java 卡顿,大部分人的第一反馈是以下关键词:
- GC
- Safepoint / Stop the world(STW)
而说拜访 mmap(MappedByteBuffer)
导致 Java 卡顿可能是个小众话题。但如果你理解了一些基于 Java 的重 IO 的开源我的项目后,就会发现,这其实不是个小众话题。如已经大热的 NoSQL 数据库 Cassandra 默认就是重度应用 mmap 去读取数据文件和写 commit log 的。
基本原理
JVM 写 GC 日志机会剖析
Safepoint
[From https://krzysztofslusarski.github.io/2022/12/12/async-manual....]
上图大略阐明了 safepoint 的次要流程。有趣味同学能够看看下面链接,或搜寻一下,网上很多好文章阐明过了。我就不搬门弄斧了。
一点须要留神的是,safepoint STOP THE WORLD(STW) 的使用者不只有 GC。还有其它。
这里只简略阐明一下(我不是 JVM 专家,所以请审慎应用以下观点):
Global safepoint request
1.1 有一个线程提出了进入 safepoint 的申请,其中带上
safepoint operation
参数,参数其实是 STOP THE WORLD(STW) 后要执行的 Callback 操作 。可能是分配内存有余,触发 GC。也可能是其它起因。1.2 一个叫 “
VM Thread
” 的线程在收到 safepoint request 后,批改一个 JVM 全局的safepoint flag
为 true(这里只为简略阐明,如果你是里手,那么这个 flag 其实是操作系统的内存缺页标识) 。1.3 而后这个 “VM Thread” 就开始期待其它利用线程(App thread) 达到(进入) safepoint 。
1.4 其它利用线程(App thread)其实会高频查看这个 safepoint flag ,当发现为 true 时,就达到(进入) safepoint 状态。
源码链接 SafepointSynchronize::begin()
本文次要关注这里的 “期待其它利用线程(App thread) 达到(进入) safepoint”。
Global safepoint
当 “
VM Thread
” 发现所有 App thread 都达到 safepoint (实在的 STW 的开始) 。就开始执行safepoint operation
。GC 操作
是safepoint operation
其中一种可能类型。源码链接 RuntimeService::record_safepoint_synchronized()
End of safepoint operation
safepoint operation
执行结束, “VM Thread
” 完结 STW 。源码链接 SafepointSynchronize::end()
症状
假如曾经退出 java 命令参数:
-XX:+PrintGCApplicationStoppedTime-XX:+PrintGCDetails-XX:+PrintSafepointStatistics-XX:PrintSafepointStatisticsCount=1-XX:+SafepointTimeout-XX:SafepointTimeoutDelay=<ms before timeout log is printed>
GC log:
# 大部分工夫,花在 Global safepoint request 与 Global safepoint 这间的“期待其它利用线程(App thread) 达到(进入) safepoint”2023-09-07T13:19:50.029+0000: 58103.440: Total time for which application threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 seconds
或 safepoint log:
[safepoint ] Application time: 0.1950250 seconds[safepoint ] Entering safepoint region: RevokeBias[safepoint ] Leaving safepoint region[safepoint ] Total time for which application threads were stopped: 0.0003424 seconds, Stopping threads took: 0.0000491 seconds///////# SafepointSynchronize::begin: Timeout detected:# SafepointSynchronize::begin: Timed out while spinning to reach a# safepoint.# SafepointSynchronize::begin: Threads which did not reach the# safepoint:# "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0# tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE>> ># SafepointSynchronize::begin: (End of list) vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count58099.941: G1IncCollectionPause [ 447 11 ] [ 3304 0 3305 1 190 ] 1>>
这个问题曾经在 Cassandra 界有了个朵:
- JVM safepoints, mmap, and slow disks 或这里
- Use of MappedByteBuffer could cause long STW due to TTSP (time to safepoint)
- Long unexplained time-to-safepoint (TTSP) in the JVM
那么问题来了,怎么证实你的环境的 java 利用存在或不存在这个问题?
诊断
环境筹备
IO 卡顿的模仿办法见 [Linux 如何模仿 IO 卡顿] 一节。
生成测试文件:
sudo dd if=/dev/urandom of=/mnt/dm1/mapped.file bs=1G count=1cp /mnt/dm1/mapped.file /tmp/mapped.file
测试的指标 java 程序
// TestMMap.javaimport java.util.*;import java.io.IOException;import java.io.RandomAccessFile;import java.nio.MappedByteBuffer;import java.nio.channels.FileChannel;import java.util.concurrent.ThreadLocalRandom;public class TestMMap { public static void main(String[] arg) throws Exception { TestGCLog l = new TestGCLog(); l.start(); TestMMap t = new TestMMap(); t.test(arg[0]); } void test(String mapfile) throws Exception { long startTime = System.currentTimeMillis(); RandomAccessFile file = new RandomAccessFile(mapfile, "r"); FileChannel channel = file.getChannel(); // Read file into mapped buffer MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_ONLY, 0, // position channel.size()); long lastprint = System.currentTimeMillis(); while(true) { // for (int i = 0; i < channel.size(); i++) { long now = System.currentTimeMillis(); // System.out.print((char) mbb.get()); long p = ThreadLocalRandom.current().nextLong(0, channel.size() - 1); mbb.position((int)p); mbb.get(); if( System.currentTimeMillis() - now > 1 * 1000 ) { System.out.println( new Date() + " - Read MMap: read slow: " + mbb.position() + "@" + mbb.limit() + " ," + (System.currentTimeMillis() - now) + "ms" ); } else if( now - lastprint > 5 * 1000 ) { System.out.println( new Date() + " - Read MMap:" + mbb.position() + "/" + mbb.limit() ); lastprint = now; } Thread.sleep(10); // } // mbb.position(0); if(false) break; } channel.close(); file.close(); System.out.println("Total read and print time: " + (System.currentTimeMillis() - startTime)); } static class TestGCLog extends Thread { List objList = new ArrayList(); byte[] m = new byte[100*1024]; public TestGCLog() { super("TestGCLog"); } @Override public void run() { try { test(); } catch (Exception e) { e.printStackTrace(); } } void test() throws Exception { long lastprint = System.currentTimeMillis(); while(true) { objList.clear(); long start = System.currentTimeMillis(); TestGCLog newObj = new TestGCLog(); long end = System.currentTimeMillis(); if( end - start > 10 ) { System.out.println( new Date() + " - TestGCLog: slow new object: " + (end - start) + "ms" ); } objList.add( newObj ); Thread.sleep(5); long now = System.currentTimeMillis(); if( now - lastprint > 5 * 1000 ) { System.out.println( new Date() + " - TestGCLog" ); lastprint = now; } } } }}
程序的次要逻辑是在 TestGCLog
线程中触发 GC 。在 main 线程中随机拜访 mmap 映射到内存中的文件,以触发 page fault 和 IO 。
bpftrace 程序
#!/usr/local/bin/bpftrace// trace-safepoint-mmap2.bt// 程序假如环境中只有一个 java 过程BEGIN{}uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*{ // printf("java elf main. PID=%d", pid); @java_pid[pid] = pid;}uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*/@java_pid[pid]/{ @sp_begin[pid, tid, comm, ustack] = count(); @vm_thread_tid[tid] = tid; @vm_thread_waiting_sp_sync[tid] = tid; @vm_thread_in_STW[tid] = tid; @STWing[pid] = pid; printf("%d %-6d %-16s SafepointSynchronize::begin \n", elapsed / 1e6, tid, comm);}uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*RuntimeService*record_safepoint_synchronized*/@java_pid[pid]/{ @sp_synchronized[tid, comm, ustack] = count(); delete (@vm_thread_waiting_sp_sync[tid]); delete (@STWing[pid]); printf("%d %-6d %-16s RuntimeService::record_safepoint_synchronized\n", elapsed / 1e6, tid, comm);}uretprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*end*/@java_pid[pid]/{ delete (@vm_thread_in_STW[tid]); printf("%d %-6d %-16s SafepointSynchronize::end\n", elapsed / 1e6, tid, comm); printf("===========\n\n");}// tracepoint:exceptions:page_fault_user// /@java_pid[pid]/// {// @page_fault_user[tid, comm, kstack(8)] = count();// printf("%d page_fault_user(\"%s %d\", address=0x%lX, ip=0x%lX)\n", elapsed / 1e6, comm, tid, args->address, args->ip);// }kprobe:filemap_fault/@java_pid[pid]/{ @filemap_fault[tid, comm, kstack(8)] = count(); printf("%d %-6d %-16s filemap_fault()\n", elapsed / 1e6, tid, comm); @filemap_fault_kstack[ tid, comm, kstack(8)] = count(); @filemap_fault_ustack[ tid, comm, ustack(8)] = count();}kretfunc:filemap_fault/@java_pid[pid]/{ printf("%d %-6d %-16s kretfunc:filemap_fault()\n", elapsed / 1e6, tid, comm);}// tracepoint:block:block_rq_complete,// tracepoint:block:block_bio_complete// /@java_pid[pid]/// {// @block_bio_complete[pid, tid, comm, kstack(8)] = count();// }kretfunc:ext4_mpage_readpages/@java_pid[pid]/{ $page = args->page; $node = args->inode; $rac = args->rac; $nr_pages = $rac->_nr_pages; printf ("%d %-6d %-16s kretfunc:ext4_mpage_readpages %ld %s\n", elapsed / 1e6, tid, comm, $rac->_index * 4096, str($rac->file->f_path.dentry->d_name.name)); @ext4_mpage_readpages[ tid, comm, kstack(8)] = count(); // @ext4_mpage_readpages_ustack[ tid, comm, ustack(8)] = count();}// kprobe:ext4_mpage_readpages// /@java_pid[pid]/// {// printf ("%d kprobe:ext4_mpage_readpages %d %s\n", elapsed / 1e6, tid, comm);// }// tracepoint:syscalls:sys_enter_sync,// tracepoint:syscalls:sys_enter_syncfs,// tracepoint:syscalls:sys_enter_fsync,// tracepoint:syscalls:sys_enter_fdatasync,// tracepoint:syscalls:sys_enter_sync_file_range*,// tracepoint:syscalls:sys_enter_msync// /@java_pid[pid]/// {// time("%H:%M:%S ");// printf("sync syscall: %-6d %-16s %s\n", pid, comm, probe);// // if( @vm_thread_waiting_sp_sync[tid] ) {// @sync_call[pid, tid, comm, probe, ustack(6)] = count();// printf("sp sync_call: %-6d %-16s %s\n", pid, comm, probe);// // }// }END{}
执行
env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java '-Xlog:safepoint,gc=info:stdout:level,tags,time,uptime,pid' -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -XX:+AlwaysPreTouch -cp . TestMMap /mnt/dm1/mapped.file
sudo bpftrace ./trace-safepoint-mmap2.bt
跟踪输入:
37507 568704 java filemap_fault()37507 568704 java kretfunc:ext4_mpage_readpages 1026187264 mapped.file38521 568704 java kretfunc:filemap_fault()38531 568704 java filemap_fault()38531 568704 java kretfunc:ext4_mpage_readpages 116322304 mapped.file39549 568704 java kretfunc:filemap_fault()39569 568704 java filemap_fault()39569 568704 java kretfunc:ext4_mpage_readpages 960757760 mapped.file40601 568704 java kretfunc:filemap_fault()40632 568704 java filemap_fault() <--- Page Fault START40632 568704 java kretfunc:ext4_mpage_readpages 484405248 mapped.file41386 568719 VM Thread SafepointSynchronize::begin <-- waiting for safepoint reach START41657 568704 java kretfunc:filemap_fault() <-- Page Fault END41658 568719 VM Thread RuntimeService::record_safepoint_synchronized <-- waiting for safepoint reach END41658 568719 VM Thread SafepointSynchronize::end===========41669 568704 java filemap_fault()41669 568704 java kretfunc:ext4_mpage_readpages 43298816 mapped.file42681 568704 java kretfunc:filemap_fault()42691 568704 java filemap_fault()42691 568704 java kretfunc:ext4_mpage_readpages 828358656 mapped.file43709 568704 java kretfunc:filemap_fault()43719 568704 java filemap_fault()43719 568704 java kretfunc:ext4_mpage_readpages 191881216 mapped.file44732 568704 java kretfunc:filemap_fault()44743 568704 java filemap_fault() <--- Page Fault START44743 568704 java kretfunc:ext4_mpage_readpages 905154560 mapped.file45574 568719 VM Thread SafepointSynchronize::begin <-- waiting for safepoint reach START 45753 568704 java kretfunc:filemap_fault()) <-- Page Fault END45754 568719 VM Thread RuntimeService::record_safepoint_synchronized <-- waiting for safepoint reach END45755 568719 VM Thread SafepointSynchronize::end
可见隐患确实存在。只是视环境条件是否满足,爆不暴发而已。
如果用回没有 IO 卡顿的盘试试:
env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java '-Xlog:safepoint,gc=info:stdout:level,tags,time,uptime,pid' -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -XX:+AlwaysPreTouch -cp . TestMMap /tmp/mapped.file
问题不存在。
云原生下可能加剧的问题
worker node 共享内存 或 container limit 内存后,Direct Page Reclaim
在云环境下,一台机器可能运行很多 container 。有的 container 是高 IO 的利用,如写日志,这些都会占用 page cache 和对共享的 disk 产生争用压力 。 整机的内存不足,或者是 container 自身的内存不足,都可能触发 Direct Page Reclaim
- worker node 共享本地磁盘 IO 争用
- Ceph 等 PV 网络盘稳定性远不如本地盘,带宽与 latency 绝对不稳固。
如以下一个配置,多个 container 可能共享了 emptyDir 用的本地盘。而后在 IO 争用时,写 GC 日志就可能呈现 STW 下的高 latency ,间接导致服务 API latency 超时。
apiVersion: v1kind: Podmetadata: annotations: labels: name: xyzspec: containers: - command: java ... -Xloggc:/usr/share/myapp/logs/gc.log ... ... volumeMounts: - mountPath: /usr/share/myapp/logs name: myapp-logs volumes: - emptyDir: {} name: myapp-logs
结语
前面,我会写另外两编文章:《eBPF 求证坊间风闻:Java GC 日志可导致整个 JVM 服务卡顿?》、《如何测量 过程级别的 IO 提早》
附录
Linux 如何模仿 IO 卡顿
网上能搜寻到的模仿 IO 卡顿大略有两个办法:
device mapper delay 法
- Emulate a slow block device with dm-delay - Christophe Vu-Brugier
- Emulate a slow block device with dm-delay - flamingbytes
- dm-delay
systemtap 注入提早法
- 巧用Systemtap注入提早模仿IO设施抖动
- 应用SystemTap给I/O设施注入提早
本文用的是 device mapper delay 法。device 自身带了一个 delay。
# load the brd kernel module. rd_nr is the maximum number of ramdisks. rd_size is the ramdisk size in KB.sudo modprobe brd rd_nr=1 rd_size=1048576ls -l /dev/ram0brw-rw---- 1 root disk 1, 0 Aug 24 20:00 /dev/ram0sudo blockdev --getsize /dev/ram0 # Display the size in 512-byte sectors2097152#We specify the read write delay as 200ms, specify the write latency(e.g. 100ms) as below.size=$(sudo blockdev --getsize /dev/ram0)sudo dmsetup remove delayedecho "0 $size delay /dev/ram0 0 200" | sudo dmsetup create delayedsudo dmsetup table delayed0 2097152 delay 1:0 0 50ls -la /dev/mapper/ | grep delayedlrwxrwxrwx 1 root root 7 May 19 18:37 delayed -> ../dm-0sudo mkfs.ext4 /dev/dm-0sudo mkdir /mnt/dm1sudo mount /dev/dm-0 /mnt/dm1cd /mnt/dm1
另外如果 debug 须要,能够挂起 device ,间接让 mapper device 的所有 IO 操作过程挂起(uninterruptable_sleep):
# 挂起sudo dmsetup suspend /dev/$your_dm_dev# 复原sudo dmsetup resume /dev/$your_dm_dev
Ubuntu 下的 BPF 跟踪堆栈因为 -fno-omit-frame-pointer 的 glibc 缺失的问题
sudo apt install libc6-profenv LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java ...
TL;DR :
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1908307