netty

引言

如题目所言,讲述了一个难以排查的 Netty ByteBuf 内存透露问题的排查和优化实战。这种经验之谈十分有学习和参考价值。

原文

A Netty ByteBuf Memory Leak Story and the Lessons Learned | Logz.io

By: Asaf Mesika

Just a while ago, I was chasing a memory leak we had at Logz.io while I was refactoring our log receiver. We were using Netty, and after a major refactoring, we noticed that there was a gradual decrease of free memory to the machine.

就在不久前,我在Logz.io重构咱们的日志收集器时发现了一个内存泄露问题。咱们过后应用的是Netty,在重构之后,咱们发现机器的可用内存在逐步缩小。

Our first action was to try to run garbage collection to see if this was an on-heap or off-heap (utilizing ByteBuf) memory issue. We quickly found that it was an off-heap issue and started to read through the code to see where we forgot to call the release() method on the ByteBuf type. We could not find anything obvious — but that is usually the case when it comes to memory leaks.

咱们首先尝试运行垃圾回收,看看这是堆内 (on-heap) 还是堆外(off-heap)(利用ByteBuf)内存问题。

咱们很快发现这是一个堆外问题,并开始浏览代码,查看咱们在哪里遗记调用ByteBuf类型的release() 办法。咱们没有找到任何显著的中央--但当波及到内存透露时,通常就是这种状况。

Then, I noticed that there was a message that appeared only once when we started the application:

之后,我留神到有一条音讯在应用程序启动的时候只呈现了一次。

ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel()

ERROR i.n.u.ResourceLeakDetector: 透露:ByteBuf.release()未在垃圾收集前调用。启用高级透露报告以找出透露产生的地位。要启用高级透露报告,请指定JVM选项"-Dio.netty.leakDetectionLevel=advanced "或调用ResourceLeakDetector.setLevel()

At first, I did not pay much attention to the message because it only appeared once. So, I figured that it was a single ByteBuf that I forgot to release and that I would fix it the following week. After a couple of days, we noticed that the host’s free memory was still decreasing. So, I realized that I needed to understand more about this error.

最开始我并没有对于这条信息过多关注,因为它仅仅呈现了一次。因而,我认为是我遗记开释单个ByteBuf,我会在下周解决这个问题。几天后,咱们发现主机的可用内存仍在缩小。因而,我意识到我须要进一步理解这个谬误。

In the reference counted objects section in Netty’s documentation, there was a detailed section entitled “Troubleshooting buffer leaks.” When I read that part of the documentation, I did not understand it completely until I read the following:

在Netty文档中的援用计数对象局部,有一个题为 "缓冲区透露的故障排除 "的具体章节。当我浏览这部分文档时,我并没有齐全了解,直到我浏览了上面的内容:

Netty adds a hook to the ByteBuf code such that when a GC occurs, it checks whether this buffer was released(), if it doesn’t it prints the error message above. ONE important detail here is that it only does this check for a fraction of the byte buffers (sampling), thus when you see this error message only once, it probably means it happens a lot more than once.

Netty在ByteBuf代码中增加了一个钩子,当GC产生时,它会查看该缓冲区是否被开释(),如果没有,就会打印下面的错误信息。
这里有一个重要细节,它只对一部分字节缓冲区(采样)进行这种查看,因而当你只看到一次错误信息时,很可能意味着它产生了很屡次

Once I understood that I added the JVM option switch

了解这一点后,我增加了JVM选项:

-Dio.netty.leakDetectionLevel=advanced

as recommended. However, when the application started, I then saw two error messages instead of one as a side effect. There was one more important detail in the log message: the location in the code where I had created the specific ByteBuf that had not been released. This helped me to understand the location where I was causing the leak. The first takeaway: Do not ignore memory leak messages — immediately switch the leak detection level to advanced mode in the JVM command line argument to detect the origin of the leak.

依据倡议。当应用程序启动时,我看到了两个错误信息,而不是一个错误信息。这里又一条更为重要的音讯呈现在日志当中: 我在代码中创立的特定ByteBuf尚未开释的地位。这帮忙我了解那导致了内存透露的地位。

基于下面的内容。第一点启发: 不要疏忽内存透露信息,立刻将JVM命令行参数中的透露检测级别切换为高级模式,以检测透露的源头。

The second takeaway: When hunting down ByteBuf memory leaks, run a “find usage” on the class and trace your code upwards through the calling hierarchy until you get to the actual code that created it — even if it seems obvious and specifically if it is third-party code that is causing the problem.

第二点启发: 在查找ByteBuf内存透露时,在类上运行 "find usage",并通过调用层次结构向上跟踪代码,直到找到创立它的理论代码,即便它看起来很显著,特地是如果它是导致问题的第三方代码。

More on the subject:

更多相干信息:

  • Webinar - Collect and Analyze Kafka JMX Metrics with Logz.io
  • Shipping AWS Lambda Metrics to Logz.io
  • What Are the Hardest Parts of Kubernetes to Learn?

The third takeaway was a side effect of changing the leak-detection level to advanced mode. When I ran my performance load test, I noticed that the receiver barely made it through 25 MB/sec, but the rate when using the same machine is usually 200 MB/sec. I had placed more code into the build that I had tested, so I was not sure of the cause of the slowdown.

下面第三条的播种是理解到内存透露的检测级别改为 高级模式 的副作用。当咱们运行性能负载测试的时候,发现日志收集器勉强能够到25MB/s,然而应用同一台机器的时候速率通常都在 200MB /s。

我将更多的代码放入了我测试过的版本中,所以我不确定导致速度变慢的起因。

I started commenting out code until I had reached a point where my handler simply did nothing — the handler practically looked like a copy-paste of the Discard Server example from Netty’s documentation.

我开始正文代码,直到我的处理程序什么也不做,处理程序实际上就像Netty文档中Discard Server示例的复制粘贴。

When I removed the

然而当我移除

-Dio.netty.leakDetectionLevel=advanced

JVM option, the speed returned to normal. I was amazed! So, just to boil this article down to a single point to remember: The leak detection level’s advanced mode may slow down Netty by a factor of 10.

JVM选项之后,这时候速度立即复原到失常状况。这让我十分诧异!所以,将本文归结为一点,请记住:

透露检测级别的高级模式可能会使Netty的运行速度升高10倍

Have you had any experiences with memory leaks using Netty and had learned some lessons as a result? If so, I’d love to hear your stories in the comments below!

您在应用Netty时是否有过内存泄露的经验,并因而汲取了一些教训?如果有,我心愿在上面的评论中听到您的故事!