关于netty:NettyA-Netty-ByteBuf-Memory-Leak-Story-and-the-Lessons-Learned

netty

引言

如题目所言，讲述了一个难以排查的 Netty ByteBuf 内存透露问题的排查和优化实战。这种经验之谈十分有学习和参考价值。

原文

A Netty ByteBuf Memory Leak Story and the Lessons Learned | Logz.io

By: Asaf Mesika

Just a while ago, I was chasing a memory leak we had at Logz.io while I was refactoring our log receiver. We were using Netty, and after a major refactoring, we noticed that there was a gradual decrease of free memory to the machine.

就在不久前，我在Logz.io重构咱们的日志收集器时发现了一个内存泄露问题。咱们过后应用的是Netty，在重构之后，咱们发现机器的可用内存在逐步缩小。

Our first action was to try to run garbage collection to see if this was an on-heap or off-heap (utilizing ByteBuf) memory issue. We quickly found that it was an off-heap issue and started to read through the code to see where we forgot to call the release() method on the ByteBuf type. We could not find anything obvious — but that is usually the case when it comes to memory leaks.

咱们首先尝试运行垃圾回收，看看这是堆内 （on-heap） 还是堆外（off-heap）（利用ByteBuf）内存问题。

咱们很快发现这是一个堆外问题，并开始浏览代码，查看咱们在哪里遗记调用ByteBuf类型的release() 办法。咱们没有找到任何显著的中央--但当波及到内存透露时，通常就是这种状况。

Then, I noticed that there was a message that appeared only once when we started the application:

之后，我留神到有一条音讯在应用程序启动的时候只呈现了一次。

ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel()

ERROR i.n.u.ResourceLeakDetector：透露：ByteBuf.release()未在垃圾收集前调用。启用高级透露报告以找出透露产生的地位。要启用高级透露报告，请指定JVM选项"-Dio.netty.leakDetectionLevel=advanced "或调用ResourceLeakDetector.setLevel()

At first, I did not pay much attention to the message because it only appeared once. So, I figured that it was a single ByteBuf that I forgot to release and that I would fix it the following week. After a couple of days, we noticed that the host’s free memory was still decreasing. So, I realized that I needed to understand more about this error.

最开始我并没有对于这条信息过多关注，因为它仅仅呈现了一次。因而，我认为是我遗记开释单个ByteBuf，我会在下周解决这个问题。几天后，咱们发现主机的可用内存仍在缩小。因而，我意识到我须要进一步理解这个谬误。

In the reference counted objects section in Netty’s documentation, there was a detailed section entitled “Troubleshooting buffer leaks.” When I read that part of the documentation, I did not understand it completely until I read the following:

在Netty文档中的援用计数对象局部，有一个题为 "缓冲区透露的故障排除 "的具体章节。当我浏览这部分文档时，我并没有齐全了解，直到我浏览了上面的内容：

Netty adds a hook to the ByteBuf code such that when a GC occurs, it checks whether this buffer was released(), if it doesn’t it prints the error message above. ONE important detail here is that it only does this check for a fraction of the byte buffers (sampling), thus when you see this error message only once, it probably means it happens a lot more than once.
Netty在ByteBuf代码中增加了一个钩子，当GC产生时，它会查看该缓冲区是否被开释()，如果没有，就会打印下面的错误信息。
这里有一个重要细节，它只对一部分字节缓冲区（采样）进行这种查看，因而当你只看到一次错误信息时，很可能意味着它产生了很屡次。

Once I understood that I added the JVM option switch

了解这一点后，我增加了JVM选项：

-Dio.netty.leakDetectionLevel=advanced

as recommended. However, when the application started, I then saw two error messages instead of one as a side effect. There was one more important detail in the log message: the location in the code where I had created the specific ByteBuf that had not been released. This helped me to understand the location where I was causing the leak. The first takeaway: Do not ignore memory leak messages — immediately switch the leak detection level to advanced mode in the JVM command line argument to detect the origin of the leak.

依据倡议。当应用程序启动时，我看到了两个错误信息，而不是一个错误信息。这里又一条更为重要的音讯呈现在日志当中： 我在代码中创立的特定ByteBuf尚未开释的地位。这帮忙我了解那导致了内存透露的地位。

基于下面的内容。第一点启发：不要疏忽内存透露信息，立刻将JVM命令行参数中的透露检测级别切换为高级模式，以检测透露的源头。

The second takeaway: When hunting down ByteBuf memory leaks, run a “find usage” on the class and trace your code upwards through the calling hierarchy until you get to the actual code that created it — even if it seems obvious and specifically if it is third-party code that is causing the problem.

第二点启发：在查找ByteBuf内存透露时，在类上运行 "find usage"，并通过调用层次结构向上跟踪代码，直到找到创立它的理论代码，即便它看起来很显著，特地是如果它是导致问题的第三方代码。

More on the subject:

更多相干信息：

Webinar - Collect and Analyze Kafka JMX Metrics with Logz.io
Shipping AWS Lambda Metrics to Logz.io
What Are the Hardest Parts of Kubernetes to Learn?

The third takeaway was a side effect of changing the leak-detection level to advanced mode. When I ran my performance load test, I noticed that the receiver barely made it through 25 MB/sec, but the rate when using the same machine is usually 200 MB/sec. I had placed more code into the build that I had tested, so I was not sure of the cause of the slowdown.

下面第三条的播种是理解到内存透露的检测级别改为 高级模式 的副作用。当咱们运行性能负载测试的时候，发现日志收集器勉强能够到25MB/s，然而应用同一台机器的时候速率通常都在 200MB /s。

我将更多的代码放入了我测试过的版本中，所以我不确定导致速度变慢的起因。

I started commenting out code until I had reached a point where my handler simply did nothing — the handler practically looked like a copy-paste of the Discard Server example from Netty’s documentation.

我开始正文代码，直到我的处理程序什么也不做，处理程序实际上就像Netty文档中Discard Server示例的复制粘贴。

When I removed the

然而当我移除

-Dio.netty.leakDetectionLevel=advanced

JVM option, the speed returned to normal. I was amazed! So, just to boil this article down to a single point to remember: The leak detection level’s advanced mode may slow down Netty by a factor of 10.

JVM选项之后，这时候速度立即复原到失常状况。这让我十分诧异！所以，将本文归结为一点，请记住：

透露检测级别的高级模式可能会使Netty的运行速度升高10倍。

Have you had any experiences with memory leaks using Netty and had learned some lessons as a result? If so, I’d love to hear your stories in the comments below!

您在应用Netty时是否有过内存泄露的经验，并因而汲取了一些教训？如果有，我心愿在上面的评论中听到您的故事！