关于jvm:eBPF-求证坊间传闻mmap-Java-Safepoint-可导致整个-JVM-服务卡顿

46次阅读

共计 11902 个字符,预计需要花费 30 分钟才能阅读完成。

概述

Java 反对好几种文件读取办法,本文要说的是小众的 mmap(MappedByteBuffer) 以及它与 Safepoint、JVM 服务卡顿之间的关系。本文尝试用 eBPF 等分析方法,去证实具体环境下,问题的存在与否。

审误和公布本文时,我才是二阳后活过来数小时而已,写了数周的文章切实不想再迁延公布了。如文章有错漏,还请多多包涵和斧正。

Java 服务卡顿,是 Java 世界永恒的话题之王。想到 Java 卡顿,大部分人的第一反馈是以下关键词:

  • GC
  • Safepoint / Stop the world(STW)

而说拜访 mmap(MappedByteBuffer) 导致 Java 卡顿可能是个小众话题。但如果你理解了一些基于 Java 的重 IO 的开源我的项目后,就会发现,这其实不是个小众话题。如已经大热的 NoSQL 数据库 Cassandra 默认就是重度应用 mmap 去读取数据文件和写 commit log 的。

基本原理

JVM 写 GC 日志机会剖析

Safepoint

[From https://krzysztofslusarski.github.io/2022/12/12/async-manual….]

上图大略阐明了 safepoint 的次要流程。有趣味同学能够看看下面链接,或搜寻一下,网上很多好文章阐明过了。我就不搬门弄斧了。

一点须要留神的是,safepoint STOP THE WORLD(STW) 的使用者不只有 GC。还有其它。

这里只简略阐明一下(我不是 JVM 专家,所以请审慎应用以下观点):

  1. Global safepoint request

    1.1 有一个线程提出了进入 safepoint 的申请,其中带上 safepoint operation 参数,参数其实是 STOP THE WORLD(STW) 后要执行的 Callback 操作。可能是分配内存有余,触发 GC。也可能是其它起因。

    1.2 一个叫“VM Thread”的线程在收到 safepoint request 后,批改一个 JVM 全局的 safepoint flag 为 true(这里只为简略阐明,如果你是里手,那么这个 flag 其实是操作系统的内存缺页标识)。

    1.3 而后这个“VM Thread”就开始期待其它利用线程(App thread)达到(进入)safepoint。

    1.4 其它利用线程(App thread)其实会高频查看这个 safepoint flag,当发现为 true 时,就达到(进入)safepoint 状态。

    源码链接 SafepointSynchronize::begin()

    本文次要关注这里的“期待其它利用线程(App thread)达到(进入)safepoint”。

  1. Global safepoint

    当“VM Thread”发现所有 App thread 都达到 safepoint(实在的 STW 的开始)。就开始执行 safepoint operationGC 操作 safepoint operation 其中一种可能类型。

    源码链接 RuntimeService::record_safepoint_synchronized()

  1. End of safepoint operation

    safepoint operation 执行结束,“VM Thread”完结 STW。

    源码链接 SafepointSynchronize::end()

症状

假如曾经退出 java 命令参数:

-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDetails
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=<ms before timeout log is printed>

GC log:

# 大部分工夫,花在 Global safepoint request 与 Global safepoint 这间的“期待其它利用线程(App thread)达到(进入)safepoint”2023-09-07T13:19:50.029+0000: 58103.440: Total time for which application threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 seconds

或 safepoint log:

[safepoint] Application time: 0.1950250 seconds
[safepoint] Entering safepoint region: RevokeBias
[safepoint] Leaving safepoint region
[safepoint] Total time for which application threads were stopped: 0.0003424 seconds, Stopping threads took: 0.0000491 seconds
///////
# SafepointSynchronize::begin: Timeout detected:
# SafepointSynchronize::begin: Timed out while spinning to reach a
# safepoint.
# SafepointSynchronize::begin: Threads which did not reach the
# safepoint:
# "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0
# tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
>> >
# SafepointSynchronize::begin: (End of list)
         vmop                    [threads: total initially_running
         wait_to_block]    [time: spin block sync cleanup vmop]
         page_trap_count
58099.941: G1IncCollectionPause             [     447          1
1    ]      [3304     0  3305     1   190]  1
>>

这个问题曾经在 Cassandra 界有了个朵:

  • JVM safepoints, mmap, and slow disks 或这里
  • Use of MappedByteBuffer could cause long STW due to TTSP (time to safepoint)
  • Long unexplained time-to-safepoint (TTSP) in the JVM

那么问题来了,怎么证实你的环境的 java 利用存在或不存在这个问题?

诊断

环境筹备

IO 卡顿的模仿办法见 [Linux 如何模仿 IO 卡顿] 一节。

生成测试文件:

sudo dd if=/dev/urandom of=/mnt/dm1/mapped.file bs=1G count=1
cp /mnt/dm1/mapped.file /tmp/mapped.file

测试的指标 java 程序

// TestMMap.java

import java.util.*;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.concurrent.ThreadLocalRandom;

public class TestMMap {public static void main(String[] arg) throws Exception {TestGCLog l = new TestGCLog();
        l.start();

        TestMMap t = new TestMMap();
        t.test(arg[0]);
   }

   void test(String mapfile) throws Exception {long startTime = System.currentTimeMillis();
        RandomAccessFile file = new RandomAccessFile(mapfile, "r");
        FileChannel channel = file.getChannel();

        // Read file into mapped buffer
        MappedByteBuffer mbb =
                channel.map(FileChannel.MapMode.READ_ONLY,
                        0,          // position
                        channel.size());

        long lastprint = System.currentTimeMillis();

        while(true) {// for (int i = 0; i < channel.size(); i++) {long now = System.currentTimeMillis();
                // System.out.print((char) mbb.get());
                long p = ThreadLocalRandom.current().nextLong(0, channel.size() - 1);
                mbb.position((int)p);
                mbb.get();
                if(System.currentTimeMillis() - now > 1 * 1000 ) {System.out.println( new Date() + "- Read MMap: read slow:" + mbb.position() + "@" + mbb.limit() + "," + (System.currentTimeMillis() - now) + "ms" );
                }
                else if(now - lastprint > 5 * 1000) {System.out.println( new Date() + "- Read MMap:" + mbb.position() + "/" + mbb.limit() );
                    lastprint = now;
                }
                Thread.sleep(10);
            // }
            // mbb.position(0);
            if(false)
                break;
        }

        channel.close();
        file.close();
        System.out.println("Total read and print time:" + (System.currentTimeMillis() - startTime));
   }

    static class TestGCLog extends Thread {List objList = new ArrayList();
        byte[] m = new byte[100*1024];

        public TestGCLog() {super("TestGCLog");
        }
        
        @Override
        public void run() {
            try {test();
            } catch (Exception e) {e.printStackTrace();
            }
        }

        void test() throws Exception {long lastprint = System.currentTimeMillis();
            while(true) {objList.clear();
                long start = System.currentTimeMillis();
                TestGCLog newObj = new TestGCLog();
                long end = System.currentTimeMillis();
                if(end - start > 10) {System.out.println( new Date() + "- TestGCLog: slow new object:" + (end - start) + "ms"  );
                }                
                objList.add(newObj);
                Thread.sleep(5);
                long now = System.currentTimeMillis();
                if(now - lastprint > 5 * 1000) {System.out.println( new Date() + "- TestGCLog" );
                    lastprint = now;
                }
            }
        }
    }
}

程序的次要逻辑是在 TestGCLog 线程中触发 GC。在 main 线程中随机拜访 mmap 映射到内存中的文件,以触发 page fault 和 IO。

bpftrace 程序

#!/usr/local/bin/bpftrace

// trace-safepoint-mmap2.bt
// 程序假如环境中只有一个 java 过程

BEGIN
{}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
{// printf("java elf main. PID=%d", pid);
    @java_pid[pid] = pid;
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*begin*
/@java_pid[pid]/
{@sp_begin[pid, tid, comm, ustack] = count();
    @vm_thread_tid[tid] = tid;
    @vm_thread_waiting_sp_sync[tid] = tid;
    @vm_thread_in_STW[tid] = tid;
    @STWing[pid] = pid;
    printf("%d %-6d %-16s SafepointSynchronize::begin \n", elapsed / 1e6, tid, comm);
}

uprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*RuntimeService*record_safepoint_synchronized*
/@java_pid[pid]/
{@sp_synchronized[tid, comm, ustack] = count();
    delete (@vm_thread_waiting_sp_sync[tid]);
    delete (@STWing[pid]);
    printf("%d %-6d %-16s RuntimeService::record_safepoint_synchronized\n", elapsed / 1e6, tid, comm);
}

uretprobe:/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so:*SafepointSynchronize*end*
/@java_pid[pid]/
{delete (@vm_thread_in_STW[tid]);
    printf("%d %-6d %-16s SafepointSynchronize::end\n", elapsed / 1e6, tid, comm);
    printf("===========\n\n");
}


// tracepoint:exceptions:page_fault_user
// /@java_pid[pid]/
// {//     @page_fault_user[tid, comm, kstack(8)] = count();
//     printf("%d page_fault_user(\"%s %d\", address=0x%lX, ip=0x%lX)\n", elapsed / 1e6, comm, tid, args->address, args->ip);
// }

kprobe:filemap_fault
/@java_pid[pid]/
{@filemap_fault[tid, comm, kstack(8)] = count();
    printf("%d %-6d %-16s filemap_fault()\n", elapsed / 1e6, tid, comm);

    @filemap_fault_kstack[tid, comm, kstack(8)] = count();
    @filemap_fault_ustack[tid, comm, ustack(8)] = count();}

kretfunc:filemap_fault
/@java_pid[pid]/
{printf("%d %-6d %-16s kretfunc:filemap_fault()\n", elapsed / 1e6, tid, comm);
}

// tracepoint:block:block_rq_complete,
// tracepoint:block:block_bio_complete
// /@java_pid[pid]/
// {//     @block_bio_complete[pid, tid, comm, kstack(8)] = count();
// }

kretfunc:ext4_mpage_readpages
/@java_pid[pid]/
{
    $page = args->page;
    $node = args->inode;
    $rac = args->rac;
    $nr_pages = $rac->_nr_pages;
    printf ("%d %-6d %-16s kretfunc:ext4_mpage_readpages %ld %s\n", elapsed / 1e6, tid, comm, $rac->_index * 4096, str($rac->file->f_path.dentry->d_name.name));
    @ext4_mpage_readpages[tid, comm, kstack(8)] = count();
    // @ext4_mpage_readpages_ustack[tid, comm, ustack(8)] = count();}

// kprobe:ext4_mpage_readpages
// /@java_pid[pid]/
// {//     printf ("%d kprobe:ext4_mpage_readpages %d %s\n", elapsed / 1e6, tid, comm);
// }


// tracepoint:syscalls:sys_enter_sync,
// tracepoint:syscalls:sys_enter_syncfs,
// tracepoint:syscalls:sys_enter_fsync,
// tracepoint:syscalls:sys_enter_fdatasync,
// tracepoint:syscalls:sys_enter_sync_file_range*,
// tracepoint:syscalls:sys_enter_msync
// /@java_pid[pid]/
// {//     time("%H:%M:%S");
//     printf("sync syscall: %-6d %-16s %s\n", pid, comm, probe);

//     // if(@vm_thread_waiting_sp_sync[tid] ) {//         @sync_call[pid, tid, comm, probe, ustack(6)] = count();
//         printf("sp sync_call: %-6d %-16s %s\n", pid, comm, probe);
//     // }

// }

END
{}

执行

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java '-Xlog:safepoint,gc=info:stdout:level,tags,time,uptime,pid' -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -XX:+AlwaysPreTouch -cp . TestMMap /mnt/dm1/mapped.file
sudo bpftrace ./trace-safepoint-mmap2.bt

跟踪输入:

37507 568704 java             filemap_fault()
37507 568704 java             kretfunc:ext4_mpage_readpages 1026187264 mapped.file
38521 568704 java             kretfunc:filemap_fault()
38531 568704 java             filemap_fault()
38531 568704 java             kretfunc:ext4_mpage_readpages 116322304 mapped.file
39549 568704 java             kretfunc:filemap_fault()
39569 568704 java             filemap_fault()
39569 568704 java             kretfunc:ext4_mpage_readpages 960757760 mapped.file
40601 568704 java             kretfunc:filemap_fault()
40632 568704 java             filemap_fault() <--- Page Fault START
40632 568704 java             kretfunc:ext4_mpage_readpages 484405248 mapped.file
41386 568719 VM Thread        SafepointSynchronize::begin <-- waiting for safepoint reach START
41657 568704 java             kretfunc:filemap_fault() <-- Page Fault  END
41658 568719 VM Thread        RuntimeService::record_safepoint_synchronized <-- waiting for safepoint reach END
41658 568719 VM Thread        SafepointSynchronize::end
===========

41669 568704 java             filemap_fault()
41669 568704 java             kretfunc:ext4_mpage_readpages 43298816 mapped.file
42681 568704 java             kretfunc:filemap_fault()
42691 568704 java             filemap_fault()
42691 568704 java             kretfunc:ext4_mpage_readpages 828358656 mapped.file
43709 568704 java             kretfunc:filemap_fault()
43719 568704 java             filemap_fault()
43719 568704 java             kretfunc:ext4_mpage_readpages 191881216 mapped.file
44732 568704 java             kretfunc:filemap_fault()
44743 568704 java             filemap_fault() <--- Page Fault START
44743 568704 java             kretfunc:ext4_mpage_readpages 905154560 mapped.file
45574 568719 VM Thread        SafepointSynchronize::begin <-- waiting for safepoint reach START 
45753 568704 java             kretfunc:filemap_fault()) <-- Page Fault  END
45754 568719 VM Thread        RuntimeService::record_safepoint_synchronized <-- waiting for safepoint reach END
45755 568719 VM Thread        SafepointSynchronize::end

可见隐患确实存在。只是视环境条件是否满足,爆不暴发而已。

如果用回没有 IO 卡顿的盘试试:

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java '-Xlog:safepoint,gc=info:stdout:level,tags,time,uptime,pid' -XX:+UnlockDiagnosticVMOptions -XX:+PreserveFramePointer -Xms128M -Xmx128M -XX:+AlwaysPreTouch -cp . TestMMap /tmp/mapped.file

问题不存在。

云原生下可能加剧的问题

  • worker node 共享内存 或 container limit 内存后,Direct Page Reclaim

    在云环境下,一台机器可能运行很多 container。有的 container 是高 IO 的利用,如写日志,这些都会占用 page cache 和对共享的 disk 产生争用压力。整机的内存不足,或者是 container 自身的内存不足,都可能触发 Direct Page Reclaim

  • worker node 共享本地磁盘 IO 争用
  • Ceph 等 PV 网络盘稳定性远不如本地盘,带宽与 latency 绝对不稳固。

如以下一个配置,多个 container 可能共享了 emptyDir 用的本地盘。而后在 IO 争用时,写 GC 日志就可能呈现 STW 下的高 latency,间接导致服务 API latency 超时。

apiVersion: v1
kind: Pod
metadata:
  annotations:
  labels:
  name: xyz
spec:
  containers:
  - command: java ... -Xloggc:/usr/share/myapp/logs/gc.log ...
    ...
    volumeMounts:
    - mountPath: /usr/share/myapp/logs
      name: myapp-logs
  volumes:
  - emptyDir: {}
    name: myapp-logs

结语

前面,我会写另外两编文章:《eBPF 求证坊间风闻:Java GC 日志可导致整个 JVM 服务卡顿?》、《如何测量 过程级别的 IO 提早》

附录

Linux 如何模仿 IO 卡顿

网上能搜寻到的模仿 IO 卡顿大略有两个办法:

  • device mapper delay 法

    • Emulate a slow block device with dm-delay – Christophe Vu-Brugier
    • Emulate a slow block device with dm-delay – flamingbytes
    • dm-delay
  • systemtap 注入提早法

    • 巧用 Systemtap 注入提早模仿 IO 设施抖动
    • 应用 SystemTap 给 I / O 设施注入提早

本文用的是 device mapper delay 法。device 自身带了一个 delay。

# load the brd kernel module. rd_nr is the maximum number of ramdisks. rd_size is the ramdisk size in KB.
sudo modprobe brd rd_nr=1 rd_size=1048576

ls -l /dev/ram0
brw-rw---- 1 root disk 1, 0 Aug 24 20:00 /dev/ram0

sudo blockdev --getsize /dev/ram0 # Display the size in 512-byte sectors
2097152

#We specify the read write delay as 200ms, specify the write latency(e.g. 100ms) as below.
size=$(sudo blockdev --getsize /dev/ram0)
sudo dmsetup remove delayed
echo "0 $size delay /dev/ram0 0 200" | sudo dmsetup create delayed


sudo dmsetup table delayed
0 2097152 delay 1:0 0 50

ls -la /dev/mapper/ | grep delayed
lrwxrwxrwx  1 root root       7 May 19 18:37 delayed -> ../dm-0

sudo mkfs.ext4 /dev/dm-0

sudo mkdir /mnt/dm1
sudo mount /dev/dm-0 /mnt/dm1

cd /mnt/dm1

另外如果 debug 须要,能够挂起 device,间接让 mapper device 的所有 IO 操作过程挂起(uninterruptable_sleep):

# 挂起
sudo dmsetup suspend /dev/$your_dm_dev

# 复原
sudo dmsetup resume /dev/$your_dm_dev

Ubuntu 下的 BPF 跟踪堆栈因为 -fno-omit-frame-pointer 的 glibc 缺失的问题

sudo apt install libc6-prof

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu /usr/lib/jvm/java-17-openjdk-amd64/bin/java ...

TL;DR :

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1908307

正文完
 0