共计 4839 个字符,预计需要花费 13 分钟才能阅读完成。
netty
引言
如题目所言,讲述了一个难以排查的 Netty ByteBuf 内存透露问题的排查和优化实战。这种经验之谈十分有学习和参考价值。
原文
A Netty ByteBuf Memory Leak Story and the Lessons Learned | Logz.io
By: Asaf Mesika
Just a while ago, I was chasing a memory leak we had at Logz.io while I was refactoring our log receiver. We were using Netty, and after a major refactoring, we noticed that there was a gradual decrease of free memory to the machine.
就在不久前,我在 Logz.io 重构咱们的 日志收集器 时发现了一个内存泄露问题。咱们过后应用的是 Netty,在重构之后,咱们发现机器的可用内存在逐步缩小。
Our first action was to try to run garbage collection to see if this was an on-heap or off-heap (utilizing ByteBuf) memory issue. We quickly found that it was an off-heap issue and started to read through the code to see where we forgot to call the release() method on the ByteBuf type. We could not find anything obvious — but that is usually the case when it comes to memory leaks.
咱们首先尝试运行垃圾回收,看看这是堆内 (on-heap) 还是堆外(off-heap)(利用 ByteBuf)内存问题。
咱们很快发现这是一个 堆外问题 ,并开始浏览代码,查看咱们在 哪里遗记调用 ByteBuf 类型的 release() 办法。咱们没有找到任何显著的中央 – 但当波及到内存透露时,通常就是这种状况。
Then, I noticed that there was a message that appeared only once when we started the application:
之后,我留神到有一条音讯在应用程序启动的时候只呈现了一次。
ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it’s garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option ‘-Dio.netty.leakDetectionLevel=advanced’ or call ResourceLeakDetector.setLevel()
ERROR i.n.u.ResourceLeakDetector
:透露:ByteBuf.release()未在垃圾收集前调用。启用高级透露报告以找出透露产生的地位。要启用高级透露报告,请指定 JVM 选项 ”-Dio.netty.leakDetectionLevel=advanced “ 或调用ResourceLeakDetector.setLevel()
At first, I did not pay much attention to the message because it only appeared once. So, I figured that it was a single ByteBuf that I forgot to release and that I would fix it the following week. After a couple of days, we noticed that the host’s free memory was still decreasing. So, I realized that I needed to understand more about this error.
最开始我并没有对于这条信息过多关注,因为它仅仅呈现了一次。因而,我认为是我遗记开释单个 ByteBuf,我会在下周解决这个问题。几天后,咱们发现主机的可用内存仍在缩小。因而,我意识到我须要进一步理解这个谬误。
In the reference counted objects section in Netty’s documentation, there was a detailed section entitled“Troubleshooting buffer leaks.”When I read that part of the documentation, I did not understand it completely until I read the following:
在 Netty 文档中的援用计数对象局部,有一个题为 “ 缓冲区透露的故障排除 “ 的具体章节。当我浏览这部分文档时,我并没有齐全了解,直到我浏览了上面的内容:
Netty adds a hook to the ByteBuf code such that when a GC occurs, it checks whether this buffer was released(), if it doesn’t it prints the error message above. ONE important detail here is that it only does this check for a fraction of the byte buffers (sampling), thus when you see this error message only once, it probably means it happens a lot more than once.
Netty 在 ByteBuf 代码中增加了一个钩子,当 GC 产生时,它会查看该缓冲区是否被开释 (),如果没有,就会打印下面的错误信息。
这里有一个重要细节,它只对一部分字节缓冲区(采样)进行这种查看, 因而当你只看到一次错误信息时,很可能意味着它产生了很屡次。
Once I understood that I added the JVM option switch
了解这一点后,我增加了 JVM 选项:
-Dio.netty.leakDetectionLevel=advanced
as recommended. However, when the application started, I then saw two error messages instead of one as a side effect. There was one more important detail in the log message: the location in the code where I had created the specific ByteBuf that had not been released. This helped me to understand the location where I was causing the leak. The first takeaway: Do not ignore memory leak messages — immediately switch the leak detection level to advanced mode in the JVM command line argument to detect the origin of the leak.
依据倡议。当应用程序启动时,我看到了 两个错误信息 ,而不是一个错误信息。这里又一条更为重要的音讯呈现在日志当中: 我在代码中创立的特定 ByteBuf 尚未开释的地位。这帮忙我了解那导致了内存透露的地位。
基于下面的内容。第一点启发:不要疏忽内存透露信息,立刻将 JVM 命令行参数中的透露检测级别切换为 高级模式,以检测透露的源头。
The second takeaway: When hunting down ByteBuf memory leaks, run a“find usage”on the class and trace your code upwards through the calling hierarchy until you get to the actual code that created it — even if it seems obvious and specifically if it is third-party code that is causing the problem.
第二点启发:在查找 ByteBuf 内存透露时,在类上运行 “find usage”,并通过调用层次结构向上跟踪代码,直到找到创立它的理论代码,即便它看起来很显著,特地是如果它是导致问题的第三方代码。
More on the subject:
更多相干信息:
- Webinar – Collect and Analyze Kafka JMX Metrics with Logz.io
- Shipping AWS Lambda Metrics to Logz.io
- What Are the Hardest Parts of Kubernetes to Learn?
The third takeaway was a side effect of changing the leak-detection level to advanced mode. When I ran my performance load test, I noticed that the receiver barely made it through 25 MB/sec, but the rate when using the same machine is usually 200 MB/sec. I had placed more code into the build that I had tested, so I was not sure of the cause of the slowdown.
下面第三条的播种是理解到内存透露的检测级别改为 高级模式 的副作用。当咱们运行性能负载测试的时候,发现日志收集器勉强能够到 25MB/s,然而应用同一台机器的时候速率通常都在 200MB /s。
我将更多的代码放入了我测试过的版本中,所以我不确定导致速度变慢的起因。
I started commenting out code until I had reached a point where my handler simply did nothing — the handler practically looked like a copy-paste of the Discard Server example from Netty’s documentation.
我开始 正文代码,直到我的处理程序什么也不做,处理程序实际上就像 Netty 文档中 Discard Server 示例的复制粘贴。
When I removed the
然而当我移除
-Dio.netty.leakDetectionLevel=advanced
JVM option, the speed returned to normal. I was amazed! So, just to boil this article down to a single point to remember: The leak detection level’s advanced mode may slow down Netty by a factor of 10.
JVM 选项之后,这时候速度立即复原到失常状况。这让我十分诧异!所以,将本文归结为一点,请记住:
透露检测级别的高级模式可能会使 Netty 的运行速度升高 10 倍。
Have you had any experiences with memory leaks using Netty and had learned some lessons as a result? If so, I’d love to hear your stories in the comments below!
您在应用 Netty 时是否有过内存泄露的经验,并因而汲取了一些教训?如果有,我心愿在上面的评论中听到您的故事!