关于jvm:CMS垃圾收集器停顿案例

CMS垃圾收集器从 jdk1.6 中开始利用, 是一个老年代垃圾收集器, 在 JVM 的倒退过程中表演了重要的历史作用,jdk1.7,jdk1.8中都能够开启应用。在 jdk9 中曾经废除掉了。

`CMS`垃圾收集器的重要毛病

因为老年代碎片问题, 在 YGC 的时候会产生降职失败 (promotion failures), 即便老年代有足够的空间, 然而依然可能导致调配失败, 因为没有足够间断的空间, 从而触发Concurrent mode Failure, 会产生SWT 的FullGC。FullGC相比于 CMS 这种并发模式的 GC 须要更长的进展工夫能力实现垃圾回收工作。这会导致重大的进展服务不可用问题。concurrent mode failure，须要 stop-the-wold 降级为 GC-Serail Old）。

`CMS`为什么会产生碎片

CMS垃圾收集器在回收老年代时, 采纳的是标记清理 (Mark-Sweep) 算法, 它在垃圾回收时并不会压缩堆, 工夫久了, 导致老年代的碎片化问题越来越重大, 直到产生单线程的 Mark-Sweep Compact GC 即FullGC, 会齐全 STW。如果堆比拟大并且老年代占的空间比拟大,STW 的工夫会继续几秒, 十几秒, 几十秒。对于应用程序来说就是长时间的进展, 这对于互联网利用的影响是很大的。

剖析 `CMS` 日志

启动 jvm 的时候，减少参数 -XX:+PrintGCDetails 和 -XX:+PrintGCTimeStamps 能够打印出 CMS GC 的具体日志。-XX:+PrintHeapAtGC 在进行 GC 的前后打印出堆的信息,-Xloggc:../logs/gc.log 日志文件的输入门路。

-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and -XX:+PrintGCApplicationStoppedTime -XX:PrintFLSStatistics=2

{Heap before GC invocations=7430 (full 24):
parnew generation total 134400K, used 121348K[0x53000000, 0x5c600000, 0x5c600000)
eden space 115200K, 99% used [0x53000000, 0x5a07e738, 0x5a080000)
from space 19200K, 32% used [0x5a080000, 0x5a682cc0, 0x5b340000)
to space 19200K, 0% used [0x5b340000, 0x5b340000, 0x5c600000)
concurrent mark-sweep generation total 2099200K, used 1694466K [0x5c600000, 0xdc800000, 0xdc800000)
concurrent-mark-sweep perm gen total 409600K, used 186942K [0xdc800000, 0xf5800000, 0xfbc00000)
10628.167: [GC Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 103224160
Max Chunk Size: 5486
Number of Blocks: 57345
Av. Block Size: 1800
Tree Height: 36 <---- High fragmentation
Statistics for IndexedFreeLists:
--------------------------------
Total Free Space: 371324
Max Chunk Size: 254
Number of Blocks: 8591 <---- High fragmentation
Av. Block Size: 43
free=103595484
frag=1.0000 <---- High fragmentation
Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
Tree Height: 0
Statistics for IndexedFreeLists:
--------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
free=0 frag=0.0000
10628.168: [ParNew (promotion failed) Desired survivor size 9830400 bytes, new threshold 1 (max 1)
- age 1: 4770440 bytes, 4770440 total: 121348K->122157K(134400K), 0.4263254secs]
10628,594: [CMS10630.887: [CMS-concurrent-mark: 7.286/8.682 secs] [Times: user=14.81, sys=0.34, real=8.68 secs]
(concurrent mode failure):1698044K->625427K(2099200K), 17.1365396 secs]
1815815K->625427K(2233600K), [CMS Perm : 186942K->180711K(409600K)]
 
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 377269492
Max Chunk Size:
377269492
Number of Blocks: 1 <---- No fragmentation
Av. Block Size: 377269492
Tree Height: 1 <---- No fragmentation
Statistics for IndexedFreeLists:
--------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
free=377269492
frag=0.0000 <---- No fragmentation
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
Tree Height: 0
Statistics for IndexedFreeLists:
--------------------------------
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
free=0 frag=0.0000
, 17.5645589 secs] [Times: user=17.82 sys=0.06, real=17.57 secs]
Heap after GC invocations=7431 (full 25):
parnew generation total 134400K, used 0K [0x53000000, 0x5c600000, 0x5c600000)
eden space 115200K, 0% used [0x53000000, 0x53000000, 0x5a080000)
from space 19200K, 0% used [0x5b340000, 0x5b340000, 0x5c600000)
to space 19200K, 0% used [0x5a080000, 0x5a080000, 0x5b340000)
concurrent mark-sweep generation total 2099200K, used 625427K [0x5c600000, 0xdc800000, 0xdc800000)
concurrent-mark-sweep perm gen total 409600K, used 180711K [0xdc800000, 0xf5800000, 0xfbc00000)
}
Total time for which application threads were stopped: 17.5730653 seconds

因为碎片率十分高，从而导致 promotion failure，而后产生 concurrent mode failure，触发的 FullGC 总计花了 17.1365396 秒才实现。

操作系统内存不够, 应用了 `swap`, 导致`CMS` 长时间进展

操作系统应用了swap, 可能导致 GC 进展工夫更长，这些进展可能是几秒，甚至几十秒级别。

系统配置了容许应用 swap 空间，操作系统可能把 JVM 过程的非流动内存页移到 swap 空间，从而开释内存给以后流动过程（可能是操作系统上其余过程，取决于系统调度）。Swapping 因为须要拜访磁盘，所以相比物理内存，它的速度慢的令人发指。所以，如果在 GC 的时候，零碎正好须要执行 Swapping

{Heap before GC invocations=132 (full 0):
par new generation total 2696384K, used 2696384K [0xfffffffc20010000, 0xfffffffce0010000, 0xfffffffce0010000)
eden space 2247040K, 100% used [0xfffffffc20010000, 0xfffffffca9270000, 0xfffffffca9270000)
from space 449344K, 100% used [0xfffffffca9270000, 0xfffffffcc4940000, 0xfffffffcc4940000)
to space 449344K, 0% used [0xfffffffcc4940000, 0xfffffffcc4940000, 0xfffffffce0010000)
concurrent mark-sweep generation total 9437184K, used 1860619K [0xfffffffce0010000, 0xffffffff20010000, 0xffffffff20010000)
concurrent-mark-sweep perm gen total 1310720K, used 511451K [0xffffffff20010000, 0xffffffff70010000, 0xffffffff70010000)
2020-07-17T03:58:06.601-0700: 51522.120: [GC Before GC: :2696384K->449344K(2696384K), 29.4779282 secs] 4557003K->2326821K(12133568K) ,29.4795222 secs] [Times: user=915.56, sys=6.35, real=29.48 secs]

最初一行 [Times: user=915.56, sys=6.35, real=29.48 secs] 中 real 就是 YGC 时利用实在的进展工夫。

`YGC`时 vmstat 命令输入

r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 0 77611960 94847600 55 266 0 0 0 0 0 0 0 0 0 3041 2644 2431 44 8 48
0 0 0 76968296 94828816 79 324 0 18 18 0 0 0 0 1 0 3009 3642 2519 59 13 28
1 0 0 77316456 94816000 389 2848 0 7 7 0 0 0 0 2 0 40062 78231 61451 42 6 53
2 0 0 77577552 94798520 115 591 0 13 13 0 0 13 12 1 0 4991 8104 5413 2 0 98

YGC总共花了 29.48 才实现, 从下面看出零碎在此期间应用了 600 多 Mb 的 swap 分区, 这就意味着，在 GC 的时候，内存中的一些页被移到了 swap 空间，这个内存页不肯定属于 JVM 过程，可能是其余操作系统上的其余过程。

操作系统上可用物理内容不足以运行零碎上所有的过程，解决办法就是尽可能运行更少的过程，减少 RAM 从而晋升零碎的物理内存。在这个例子中，Old 区有 9G，然而只应用了 1.8G（mark-sweep generation total 9437184K, used 1860619K）。咱们能够适当的升高 Old 区的大小以及整个堆的大小，从而缩小内存压力，最小化零碎上的利用产生 swapping 的可能。

如果应用程序须要的堆内存比咱们设定的 Xms 大, 也会导致频繁的 GC, 重大的状况会导致OOM。因为堆空间有余, 对象调配失败,JVM 就要调用GC 尝试回收曾经调配的空间, 然而 GC 不能开释更多的内存空间, 又导致下一次GC。

利用运行时，频繁的 FullGC 会引起长时间进展，在上面这个例子中，Perm 空间 (永恒代) 简直是满的，并且在 Perm 区尝试分配内存也都失败了，从而触发 FullGC：

永恒代：这个区域会存储包含类定义、构造、字段、办法（数据及代码）以及常量在内的类相干数据。它能够通过（以下两个是非堆区配置参数）-XX:PermSize 及 -XX:MaxPermSize 来进行调节。若永恒代 (Perm Gen) 空间用完，会导致 java.lang.OutOfMemoryError: PermGenspace 的异样。而且从 JDK8 开始，永恒代被元空间所取代。

166687.013: [Full GC [PSYoungGen:126501K->0K(922048K)] [PSOldGen: 2063794K->1598637K(2097152K)]2190295K->1598637K(3019200K) [PSPermGen: 165840K->164249K(166016K)],6.8204928 secs] [Times: user=6.80 sys=0.02, real=6.81 secs]
 
166699.015: [Full GC [PSYoungGen:125518K->0K(922048K)] [PSOldGen: 1763798K->1583621K(2097152K)]1889316K->1583621K(3019200K) [PSPermGen: 165868K->164849K(166016K)],4.8204928 secs] [Times: user=4.80 sys=0.02, real=4.81 secs]

如果老年代空间有余时, 也会导致频繁的 FullGC, 解决方案就是扩充 老年代 和永恒代 的空间。

程序中调用了`System.gc`

System.gc 调用，利用中的一些类里，或者第三方模块中调用 System.gc 调用从而触发 STW 的 FullGC，也可能会引起十分长时间的进展。如下 GC 日志所示，Full GC 前面的（System）示意它是由调用 System.GC 触发的 FullGC，并且耗时 5.75 秒：

164638.058: [Full GC (System) [PSYoungGen: 22789K->0K(992448K)]
[PSOldGen: 1645508K->1666990K(2097152K)] 1668298K->1666990K(3089600K)
[PSPermGen: 164914K->164914K(166720K)], 5.7499132 secs] [Times: user=5.69, sys=0.06, real=5.75 secs]

如果你要敞开通过调用 System.gc() 触发 FullGC，配置 JVM 参数 –XX:+DisableExplicitGC 即可。

promotion failed – concurrent mode failure

Minor GC 后，Survivor 空间包容不了残余对象，将要放入老年代，老年代有碎片或者不能包容这些对象，就产生了 concurrent mode failure, 而后进行 stop-the-world 的 Serial Old 收集器。

解决办法：-XX:UseCMSCompactAtFullCollection -XX:CMSFullGCBeforeCompaction=5 或者调大新生代或者 Survivor 空间

concurrent mode failure

CMS 是和业务线程并发运行的，在执行 CMS 的过程中有业务对象须要在老年代间接调配，例如大对象，然而老年代没有足够的空间来调配，所以导致 concurrent mode failure, 而后须要进行 stop-the-world 的 Serial Old 收集器。

解决办法：+XX:CMSInitiatingOccupancyFraction 触发 CMS 收集器的内存比例, 调大老年带的空间。+XX:CMSMaxAbortablePrecleanTime,+XX:CMSFullGCsBeforeCompaction=5,5 次 Full GC 后压缩 old generation 一次

关注微信公众号:【入门小站】, 解锁更多知识点。

CMS垃圾收集器的重要毛病

CMS为什么会产生碎片

剖析 CMS 日志

操作系统内存不够, 应用了 swap, 导致CMS 长时间进展

YGC时 vmstat 命令输入

堆空间有余

程序中调用了System.gc

总结

`CMS`垃圾收集器的重要毛病

`CMS`为什么会产生碎片

剖析 `CMS` 日志

操作系统内存不够, 应用了 `swap`, 导致`CMS` 长时间进展

`YGC`时 vmstat 命令输入

程序中调用了`System.gc`