关于java:一次鞭辟入里的-Log4j2-异步日志输出阻塞问题的定位

4次阅读

共计 14826 个字符,预计需要花费 38 分钟才能阅读完成。

一次鞭辟入里的 Log4j2 日志输入阻塞问题的定位

问题景象

线上某个利用的某个实例忽然呈现某些次申请 服务响应极慢 的状况,有几次申请 超过 60s 才返回,并且通过日志发现,服务线程并没有做什么很重的操作。这种状况断断续续继续了半小时左右。

外围问题定位

因为半小时后,服务实例恢复正常,并且呈现申请极慢的次数并不太多,没有能及时打印呈现问题的时候线程堆栈采集相干信息。然而,咱们有本人的法宝,JFR(对于 JFR,请参考我的另一系列 JFR 全解)。

JFR 非常适合用来预先复盘定位问题,并且配置切当的话,性能损耗极小,并且不像 APM 零碎一样须要额定的采集以及须要对立的服务过程进行整合展示。咱们 随用随取 即可,然而 JFR 面向的是单过程的问题定位,找到问题过程,定位跨过程业务问题链路追踪还是须要 APM 零碎的。

咱们通过如下命令采集对应时间段的 JFR:

jcmd 过程 pid JFR.dump begin=2021-07-04T17:30:00.000Z end=2021-07-04T18:30:00.000Z

咱们抉择的时间跨度比出问题的时间跨度大一些,这样保障底层问题呈现的时候也被采集写入这个 dump 文件。

对于 历史某些申请响应慢,我个别依照如下流程去看:

  1. 是否有 STW(Stop-the-world,参考我的另一篇文章:JVM 相干 – SafePoint 与 Stop The World 全解):

    1. 是否有 GC 导致的长时间 STW
    2. 是否有其余起因导致过程所有线程进入 safepoint 导致 STW
  2. 是否 IO 花了太长时间,例如调用其余微服务,拜访各种存储(硬盘,数据库,缓存等等)
  3. 是否在某些锁下面阻塞太长时间?
  4. 是否 CPU 占用过高,哪些线程导致的?

接下来咱们来详细分析,首先是 GC,从下图看,并没有能重大影响线程 STW 很长时间的:

而后查看 safepoint 相干,发现也没有什么影响:

接着,咱们查看 IO 相干的事件,也并没有发现业务相干的阻塞读:

而后,咱们查看是否在某些锁下面阻塞太长时间,终于发现异常:

通过堆栈,咱们发现是 log4j2 打印日志卡住了

问题剖析

1. Java Monitor Blocked 的 Monitor Address 对于同一个对象也是会扭转的

首先,对于 Java Monitor Blocked 这个事件,咱们能够通过 Monitor Address 来判断是否是同一个锁。例如这里咱们发现这些线程都是阻塞在 FFFF 4C09 3188 这个地址的对象,这个地址是基于对象在 Java 堆中的绝对地址得出的。

然而 对于同一个对象 ,这个地址 并不会是不变 的,在每次 GC 扫描整顿到这个对象对象的时候,如果有回收内存,个别这个地址就会扭转。因为咱们应用的是 G1,所以并不是每次 GC 都会扫描到这个对象,然而如果地址产生扭转,肯定是因为产生了 GC

2. log4j2 异步日志原理简介

在咱们的利用配置中,采纳的日志框架是 log4j2,采纳的配置是 异步日志配置。这里简略说一下 Log4j2 异步日志的原理:Log4j2 异步日志基于高性能数据结构 Disruptor,Disruptor 是一个环形 buffer,做了很多性能优化(具体原理能够参考我的另一系列:高并发数据结构(disruptor)),Log4j2 对此的利用如下所示:

简略来说,多线程通过 log4j2 的门面类 org.apache.logging.log4j.Logger 进行日志输入,被封装成为一个 org.apache.logging.log4j.core.LogEvent,放入到 Disruptor 的环形 buffer 中。在生产端有一个单线程生产这些 LogEvent 写入对应的 Appender,咱们这里只有一个 Appender,其配置是:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>

异步日志 logger 配置为(配置 includeLocation 为 false,防止每次打印日志须要获取调用堆栈的性能损耗):

 <Asyncroot level="info" includeLocation="false">
    <appender-ref ref="file"/>
</Asyncroot>

log4j component 额定配置为:

log4j2.component.properties:

# 当没有日志的时候,生产线程通过 Block 期待日志事件到来,这样 CPU 占用起码
AsyncLoggerConfig.WaitStrategy=Block

3. log4j2 disruptor 的 RingBuffer 的大小

既然是一个环形 Buffer,它的容量是无限的,咱们这里没有批改它的大小,走的是默认配置,查看其源码:

AsyncLoggerConfigDisruptor.java

public synchronized void start() {
    // 省略其余代码
    ringBufferSize = DisruptorUtil.calculateRingBufferSize("AsyncLoggerConfig.RingBufferSize");
    disruptor = new Disruptor<>(factory, ringBufferSize, threadFactory, ProducerType.MULTI, waitStrategy);
}

DisruptorUtil.java

private static final int RINGBUFFER_MIN_SIZE = 128;
private static final int RINGBUFFER_DEFAULT_SIZE = 256 * 1024;
private static final int RINGBUFFER_NO_GC_DEFAULT_SIZE = 4 * 1024;

static int calculateRingBufferSize(final String propertyName) {
    // 是否启用了 ThreadLocal,如果是则为 4 kB,不是则为 256 kB
    int ringBufferSize = Constants.ENABLE_THREADLOCALS ? RINGBUFFER_NO_GC_DEFAULT_SIZE : RINGBUFFER_DEFAULT_SIZE;
    // 读取零碎变量,以及 log4j2.component.properties 文件获取 propertyName(这里是 AsyncLoggerConfig.RingBufferSize) 这个配置
    final String userPreferredRBSize = PropertiesUtil.getProperties().getStringProperty(propertyName,
            String.valueOf(ringBufferSize));
    try {int size = Integer.parseInt(userPreferredRBSize);
        // 如果小于 128 字节则依照 128 字节设置
        if (size < RINGBUFFER_MIN_SIZE) {
            size = RINGBUFFER_MIN_SIZE;
            LOGGER.warn("Invalid RingBufferSize {}, using minimum size {}.", userPreferredRBSize,
                    RINGBUFFER_MIN_SIZE);
        }
        ringBufferSize = size;
    } catch (final Exception ex) {LOGGER.warn("Invalid RingBufferSize {}, using default size {}.", userPreferredRBSize, ringBufferSize);
    }
    // 取最近的 2 的 n 次方,因为对于 2 的 n 次方取余等于对于 2^n-1 取与运算,这样更快
    return Integers.ceilingNextPowerOfTwo(ringBufferSize);
}

如果启用了 ThreadLocal 这种形式生成 LogEvent,每次不新生成的 LogEvent 用之前的,用 ThreadLocal 存储的,这样防止了创立新的 LogEvent。然而思考上面这种状况:

logger.info("{}", someObj);

这样会造成强援用,导致如果线程没有新的日志,这个 someObj 始终不能回收。所以 针对 Web 利用,log4j2 默认是不启用 ThreadLocal 的 形式生成 LogEvent

Constants.java

public static final boolean IS_WEB_APP = PropertiesUtil.getProperties().getBooleanProperty("log4j2.is.webapp", isClassAvailable("javax.servlet.Servlet"));
public static final boolean ENABLE_THREADLOCALS = !IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty("log4j2.enable.threadlocals", true);

由此,能够看出,咱们的 RingBuffer 的大小为 256 kB

4. RingBuffer 满了 log4j2 会产生什么?

当 RingBuffer 满了,如果在 log4j2.component.properties 配置了 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false,则会 Wait(其实是 park)在 Disruptor 的 produce 办法上,期待生产出下一个能够生产的环形 buffer 槽;默认这个配置为 true,即 所有生产日志的线程尝试获取全局中的同一个锁(private final Object queueFullEnqueueLock = new Object();):

DisruptorUtil.java

static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
        .getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);
private boolean synchronizeEnqueueWhenQueueFull() {
    return DisruptorUtil.ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL
            // Background thread must never block
            && backgroundThreadId != Thread.currentThread().getId();
}

private final Object queueFullEnqueueLock = new Object();

private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
    // 如果 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=true,默认就是 true
    if (synchronizeEnqueueWhenQueueFull()) {synchronized (queueFullEnqueueLock) {disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
        }
    } else {
        // 如果 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false
        disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
    }
}

默认配置的时候,异样堆栈和咱们 JFR 中看到的一样,举个例子:

"Thread-0" #27 [13136] prio=5 os_prio=0 cpu=0.00ms elapsed=141.08s tid=0x0000022d6f2fbcc0 nid=0x3350 waiting for monitor entry  [0x000000399bcfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:375)
    - waiting to lock <merged>(a java.lang.Object)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
    at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
    at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
    at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
    at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
    省略业务办法堆栈

配置为 false 的时候,堆栈是这个样子的:

"Thread-0" #27 [18152] prio=5 os_prio=0 cpu=0.00ms elapsed=5.68s tid=0x000002c1fa120e00 nid=0x46e8 runnable  [0x000000eda8efe000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@17-loom/Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17-loom/LockSupport.java:410)
    at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
    at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
    at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:524)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:379)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
    at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
    at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
    at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
    at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
    at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
    at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
    at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)

5. 为何会满,咱们的生产线程过后在干啥?

咱们来看下过后的生产线程是否有异样,因为 过后硬盘 io 看系统监控并没有异样 所以这个线程很大可能是 Runnable 的,一直的在写入日志。同时,咱们晓得,写入文件 java 底层基于 native 调用,所以咱们查看 JFR 的 native 办法采集。应用事件查看器中的 Method Profiling Sample Native,右键点击,并抉择“应用所选类型事件创立新页”:

在创立进去的新页中,依照线程分组查看,查看 Log4j2 的 disruptor 生产线程,能够得出下图:

能够看出:在出问题的时间段,采集到的写入日志文件的事件数量显著增多很多,并且,通过堆栈能够看到:

"Log4j2-TF-2-AsyncLoggerConfig-1" #26 [23680] daemon prio=5 os_prio=0 cpu=406.25ms elapsed=2.89s tid=0x0000022d6f3d4080 nid=0x5c80 runnable  [0x000000399bbfe000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileOutputStream.writeBytes(java.base@17-loom/Native Method)
    at java.io.FileOutputStream.write(java.base@17-loom/FileOutputStream.java:365)
    at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:261)
    - eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
    at org.apache.logging.log4j.core.appender.FileManager.writeToDestination(FileManager.java:272)
    - eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
    at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.writeToDestination(RollingFileManager.java:236)
    - eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
    at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:293)
    - locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
    at org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:302)
    - locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:199)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
    at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
    at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
    at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
    at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
    at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
    at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
    at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
    at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
    at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
    at java.lang.Thread.run(java.base@17-loom/Thread.java:1521)

log4j2 调用了很屡次 flush,JFR 显示采集到的事件,每次都调用了 flush。每次调用 flush,都会造成文件真正写的 native 调用。而生产慢的起因,其实就是这种 native 调用太多,零碎写入不过去了。

问题解决

咱们能够通过以下四个方向解决这个问题:

  1. 缩小日志输入,精简日志,这是比拟根底的形式,也是比较简单的形式,然而身为一个技术人,咱们不能满足于此
  2. 减少硬盘 io,这个也是比拟根底简略的解决形式
  3. 咱们是否能够缩小这个 flush 呢?答案是能够的,咱们能够配置 Appender 的 immediateFlush 为 false
  4. 减少监控,针对堆栈蕴含 org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue 的 java monitor block 事件进行监控,如果发现工夫过长或者数量很多的事件则报警或者重建过程

1. 配置 Appender 的 immediateFlush 为 false

咱们能够配置 Appender 的 immediateFlush 为 false,例如:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}"
             immediateFlush="false">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>

这里的原理对应源码:

AbstractOutputStreamAppender.java

protected void directEncodeEvent(final LogEvent event) {getLayout().encode(event, manager);
    // 如果配置了 immdiateFlush(默认为 true)或者以后事件是 EndOfBatch
    if (this.immediateFlush || event.isEndOfBatch()) {manager.flush();
    }
}

那么对于 Log4j2 Disruptor 异步日志来说,什么时候 LogEventEndOfBatch 呢?是在生产到的 index 等于生产公布到的最大 index 的时候,这也是比拟合乎性能设计思考,即在没有生产完的时候,尽可能地不 flush,生产完以后所有的时候再去 flush:

BatchEventProcessor.java

private void processEvents()
{
    T event = null;
    long nextSequence = sequence.get() + 1L;

    while (true)
    {
        try
        {final long availableSequence = sequenceBarrier.waitFor(nextSequence);
            if (batchStartAware != null)
            {batchStartAware.onBatchStart(availableSequence - nextSequence + 1);
            }

            while (nextSequence <= availableSequence)
            {event = dataProvider.get(nextSequence);
                // 这里 nextSequence == availableSequence 就是 EndOfBatch
                eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
                nextSequence++;
            }

            sequence.set(availableSequence);
        }
        catch (final TimeoutException e)
        {notifyTimeout(sequence.get());
        }
        catch (final AlertException ex)
        {if (running.get() != RUNNING)
            {break;}
        }
        catch (final Throwable ex)
        {exceptionHandler.handleEventException(ex, nextSequence, event);
            sequence.set(nextSequence);
            nextSequence++;
        }
    }
}

2. 减少基于 JFR 事件监控

这个须要 Java 14 以上的版本

Configuration config = Configuration.getConfiguration("default");
// 设置监控的锁 block 工夫超过多少就会采集
config.getSettings().put("jdk.JavaMonitorEnter#threshold", "1s");
try (var es = new RecordingStream(config)) {
    es.onEvent("jdk.JavaMonitorEnter", recordedEvent -> {
        // 如果堆栈蕴含咱们关注的,则报警
        if (recordedEvent.getStackTrace().getFrames().stream().anyMatch(recordedFrame -> recordedFrame.toString().contains("org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue")))  {System.out.println("Alarm:" + recordedEvent);
        }
    });
    es.start();}

微信搜寻“我的编程喵”关注公众号,每日一刷,轻松晋升技术,斩获各种 offer

正文完
 0