关于容器技术:玩火的容器内存控制-CGroup-容器基础拾遗-Part-1

本文来自：玩火的容器内存管制 CGroup – 容器根底拾遗 Part 1 。如图片不清，请回到原文。

引
- 咱们在谈容器内存时，到底在谈什么？
CGroup 内存阐明
- 根底概念
  - Page
  - anonymous pages
  - LRU list
  - LRU list 组
- CGroup 概述
  - 主记账范畴
  - 内核内存记账范畴
  - 内存回收
- 状态文件
  - memory.stat
  - memory.usage_in_bytes
  - memory.numa_stat
  - memory.failcnt
- 管制与事件告诉
  - 强制回收内存
  - 基于内存阈值水位的告诉
  - 不要 OOM Kill，只是暂停
监控与指标
- 常被误会的 K8s 指标
  - container_memory_usage_bytes
  - container_memory_working_set_bytes
  - kubectl top
  - 什么 metric 才是 OOM Kill 相干
内存限度对 IO 性能的影响
理解内核数据结构
Page Cache 占用剖析工具
上文未提到的有用的参考

引

容器内存限度是个矛盾而重要的抉择，给多了浪费资源，给少了服务随机解体。

CGroup 内存管制是容器资源管制的外围。她是个法则严明的看守者，在利用超限时狠心地 OOM Klll。她同时也有宽容的一面，在利用资源有余时，调配和开释 Cache 给利用应用。而其心田的记账算法却回味无穷。要察看和监控她的状态和行为，更是千条万绪。本文尝试用作剖析和梳理。

🤫 看完下面的看守者的比喻，如果你想到身边的那位，就与我无关了。

在去年，我写了一篇：《把大象装入货柜里——Java容器内存拆解》其中简略提到了 CGroup 内存管制和 Java 的关系。而最近在工作上又遇到到容器的内存问题，于是想比拟深刻和系统地去理解一下这个天坑。每个做容器化（上云）都必须面对，但最想回避的大坑。

咱们在谈容器内存时，到底在谈什么？

先列几个问题：

容器内存应用，只是过程用户态应用的内存？如果是，包含哪些？
如果包含内核态的内存，有哪些内核态内存？
所有内存都是强占用的吗？还是能够由内核必要时伸缩和开释？在监控到个别指标靠近甚至超过下限时，肯定会 OOM Kill 吗？

本文尝试剖析下面问题，<mark>并挖掘 CGroup 背地鲜为人知的秘技 ✨🔮 帮忙准确剖析 OOM Kill 问题</mark>。同时说说容器内存的估算与监控上的坑。

CGroup 内存阐明

根底概念

根底概念这一节有点干燥和 TL;DR，不喜可跳过。

Page

操作系统为进步治理内存的效率，是以 Page 为单位治理内存的，个别一个 Page 是 4kb：

以下图片和局部文字起源：https://lwn.net/Articles/443241/ 与 https://biriukov.dev/docs/pag…

anonymous pages

anonymous pages，直译为 匿名页面，就是没有理论文件对应的内存 Page 了。如利用的 Heap，Stack 空间。相同的概念就是 file pages，包含 page cache、file mapping 、tmpfs 等等。

LRU list

Linux 保护一个 LRU list 去保留 Page 的元信息，这就能够按最近拜访页面的程序去遍历页面了：

而为了进步回收时的效率，Linux 应用了双 LRU list：

即 active LRU 与 inactive LRU。

他们的定位是：

如果内核认为在不久的未来可能不须要它们，它会将页面从active LRU挪动到inactive LRU。如果某些过程试图拜访它们，则能够将inactive LRU 中的页面疾速移回active LRU。 inactive LRU能够被认为是零碎正在思考很快回收的页面的一种缓刑区域。

active LRU 与 inactive LRU 的流转关系如下：

LRU list 组

当然，实现上比这还要简单。以后的内核实际上保护了五个 LRU 列表。

anonymous pages 有独自的active LRU和inactive LRU – 这些页面的回收策略是不同的，如果零碎在没有替换分区(swap)的状况下运行，它们不会被回收（这是个别 Kubernetes 的状况了，只管新版本有出入）。
不可回收的页面列表(LRU_UNEVICTABLE) ，例如，已锁定到内存中的页面。
文件相干的列表

5 个列表的源码见：

enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,//anonymous pages
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,//file pages
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
    LRU_UNEVICTABLE,//不可回收
    NR_LRU_LISTS//边界
};

在未有 CGroup 之前，每个 NUMA Node + Zone 内存区域 都存在一组这些列表，被称为“全局 LRU”：

图：NUMA 下的 LRU，图片起源

有了 cgroup mem controller 后，因为要管制和跟踪每个 cgroup 的内存应用，减少了另一个级别的复杂性。cgroup mem controller 须要跟踪每个页面的更多信息，将每个页面与所属的cgroup mem controller相关联。将该信息增加到 struct page 页面元信息 中。

有了 cgroup mem controller 后，LRU 列表处于每个 cgroup 和每个 NUMA node 一组的级别。每个内存 cgroup 都有本人的内存回收器专用 LRU 列表：

图：CGroup + NUMA 下的 LRU，图片起源

CGroup 概述

Linux 的 kernel.org 官网文档：

Memory Resource Controller from kernel.org

这个文档的读者对象是内核开发者或研究者，对新人不太好入门。如果硬翻译写在这里，就是赶客出门了。不过一些外围信息还得说说。

主记账范畴

主记账内存的下限配置在文件 memory.limit_in_bytes 中，对应 k8s 中的 spec.containers[].resources.limits.memory

CGROUP 次要记账以下局部：

anon pages (RSS)，文件不相干的内存，如 heap
cache pages (Page Cache), 文件相干的内存，如 File Cache/File Mapping 。

对于可回收的内存（如文件的 Cache 或 File Mapping），CGROUP 通过 LRU 链表记录 Page 的拜访程序。按 FIFO（LRU) 办法回收。

对于初学 Linux 的同学，有一点须要强调的是，内存记账是以理论拜访的 Page 为准的，不是调配的 Page。说白了，就是只对有拜访过的页才记账。

如果你是 Java 程序员，这就是 -Xms xG -Xmx xG 之外还要-XX:+AlwaysPreTouch 起因之一。尽早记账和发现问题。

内核内存记账范畴

内核空间的内存也能够记账，也能够配置下限。不过 k8s 临时如同不反对这个 cgroup 限度。须要留神的是，cache pages 算入主记账范畴，不算入内核内存记账范畴。

内核内存下限配置在文件 memory.kmem.limit_in_bytes 中。它次要包含：

stack pages
slab pages
sockets memory pressure
tcp memory pressure

内存回收

Memory Resource Controller from kernel.org

Each cgroup maintains a per cgroup LRU which has the same structure as global VM. When a cgroup goes over its limit, we first try to reclaim memory from the cgroup so as to make space for the new pages that the cgroup has touched. If the reclaim is unsuccessful, an OOM routine is invoked to select and kill the bulkiest task in the cgroup. (See 10. OOM Control below.)

The reclaim algorithm has not been modified for cgroups, except that pages that are selected for reclaiming come from the per-cgroup LRU list.

每个 cgroup 保护一个 cgroup LRU 组，其构造与全局 LRU 组雷同。当一个 cgroup 拜访的内存 超过它的限度时，咱们首先尝试从 cgroup 中回收内存，以便为新内存页腾出空间。如果回收不胜利，则会调用 OOM 流程来抉择并终止 cgroup 中最宏大的工作。

cgroups 的回收算法和之前的全局 LRU 一样的，除了被抉择用于回收的页面来自 per-cgroup LRU 列表。其实次要能够回收的，就是 Page Cache 了。

状态文件

文件比拟多，这里只说几个重要的。

memory.stat

memory.stat 文件包含一些统计信息：

cache	# of bytes of page cache memory.
rss	# of bytes of anonymous and swap cache memory (includes transparent hugepages).
rss_huge	# of bytes of anonymous transparent hugepages.
mapped_file	# of bytes of mapped file (includes tmpfs/shmem)
pgpgin	# of charging events to the memory cgroup. The charging event happens each time a page is accounted as either mapped anon page(RSS) or cache page(Page Cache) to the cgroup.
pgpgout	# of uncharging events to the memory cgroup. The uncharging event happens each time a page is unaccounted from the cgroup.
swap	# of bytes of swap usage
dirty	# of bytes that are waiting to get written back to the disk.
writeback	# of bytes of file/anon cache that are queued for syncing to disk.
inactive_anon	# of bytes of anonymous and swap cache memory on inactive LRU list.
active_anon	# of bytes of anonymous and swap cache memory on active LRU list.
inactive_file	# of bytes of file-backed memory on inactive LRU list.
active_file	# of bytes of file-backed memory on active LRU list.
unevictable	# of bytes of memory that cannot be reclaimed (mlocked etc).

注：表中的 # 是 number，即数量的意思。

留神：只有 anonymous 和 swap cache memory 被纳入为“rss”统计的一部分。这不应与真正的“resident set size 常驻内存大小” 或 cgroup 应用的物理内存量相混同。

“rss + mapped_file” 能够认为是 cgroup 的 resident set size 常驻内存大小。

（留神：file 相干的内存页和 shmem 能够和其余 cgroup 共享。在这种状况下，mapped_file 只包含 page cache 的所有者为本 cgroup 的 page cache。第一个拜访这个页的 cgroup 即为所有者）

memory.usage_in_bytes

这相当于 memory.stat 中的 RSS+CACHE(+SWAP) 。不过有肯定更新提早。

memory.numa_stat

提供 cgroup 中的 numa node 内存调配信息。如之前的图所说，每个 NUMA node 有本人的 LRU list 组。

memory.failcnt

cgroup 提供 memory.failcnt 文件。显示达到 CGroup 限度的次数。当内存 cgroup 达到限度时，failcnt 会减少，触发回收内存。

管制与事件告诉

文件比拟多，这里只说几个重要的。

强制回收内存 memory.force_empty

$ echo 0 > memory.force_empty

回收所有能够回收的内存，K8s 禁用 Swap 状况下，次要是文件相干的 Page Cache 了。

这里有一件乏味的事。

Kernel v3.0 时：

memory.force_empty interface is provided to make cgroup’s memory usage empty.
You can use this interface only when <mark>the cgroup has no tasks</mark>.
When writing anything to this

Kernel v4.0 后：

memory.force_empty interface is provided to make cgroup’s memory usage empty.
When writing anything to this

即 kernel v4.0 后，没说须要 cgroup 肯定要没过程下，能力强制回收。这个小插曲是产生在我举荐这性能给一个同学应用时，同学找到了一份旧 Redhat 6.0 文档说不能在有过程的状况下强制回收。

基于内存阈值水位的告诉

CGroup 有一个神秘暗藏秘技，k8s 如同没应用，就是能够设置内存应用的多个水位的告诉。其中应用了文件 cgroup.event_control 这里不细说，有趣味的同学移步：

https://www.kernel.org/doc/ht…

这个暗藏秘技有什么用？想想，如果咱们要定位一个容器的 OOM 问题，在快要到 limit 和 OOM 前，core dump 或 java heap dump 一下，是不是就能够剖析一下过程过后的内存应用状况？

不要 OOM Kill，只是暂停

这个，也是 CGroup 的神秘暗藏秘技，k8s 如同没应用。这个暗藏秘技有什么用？想想，如果咱们要定位一个容器的 OOM 问题，在将要产生 OOM 时，容器暂停了。这时，咱们能够：

core dump 过程
或配置更大的 CGroup 下限，让 jvm 持续跑，而后，java heap dump

是不是就能够剖析一下过程过后的内存应用状况？

有趣味移步：

https://www.kernel.org/doc/ht…

相干文件：memory.oom_control

监控与指标

有理论运维教训的同学都晓得，运维最难的是监控，监控最难的是抉择和了解指标，还有指标间千头万绪的关系。

K8s 的容器指标

K8s 自带容器指标数据源是来自 kubelet 中运行的 cAdvisor 模块的。

而 cAdvisor 的官网 Metric 阐明文档在这：Monitoring cAdvisor with Prometheus 。这个官网文档是写得太简略了，简略到不太适宜问题定位……

好在，高手在民间：Out-of-memory (OOM) in Kubernetes – Part 3: Memory metrics sources and tools to collect them:

cAdvisor metric	Source OS metric(s)	Explanation of source OS metric(s)	What does the metric mean?
`container_memory_cache`	`total_cache` value in the `memory.stat` file inside the container’s cgroup directory folder	number of bytes of page cache memory	Size of memory used by the cache that’s automatically populated when reading/writing files
`container_memory_rss`	`total_rss` value in the `memory.stat` file inside the container’s cgroup directory folder	number of bytes of anonymous and swap cache memory (includes transparent hugepages). […]This should not be confused with the true ‘resident set size’ or the amount of physical memory used by the cgroup. ‘rss + mapped_file’ will give you resident set size of cgroup”	Size of memory not used for mapping files from the disk
`container_memory_mapped_file`	`total_mapped_file` value in the `memory.stat` file inside the container’s cgroup directory folder	number of bytes of mapped file (includes tmpfs/shmem)	Size of memory that’s used for mapping files
`container_memory_swap`	`total_swap` value in the `memory.stat` file inside the container’s cgroup directory folder	number of bytes of swap usage
`container_memory_failcnt`	The value inside the `memory.failcnt` file	shows the number of times that a usage counter hit its limit
`container_memory_usage_bytes`	The value inside the `memory.usage_in_bytes` file	doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat	Size of overall memory used, regardless if it’s for mapping from disk or just allocating
`container_memory_max_usage_bytes`	The value inside the `memory.max_usage_in_bytes` file	max memory usage recorded
`container_memory_working_set_bytes`	Deduct `inactive_file` inside the `memory.stat` file from the value inside the `memory.usage_in_bytes` file. If result is negative then use 0	`inactive_file`: number of bytes of file-backed memory on inactive LRU list `usage_in_bytes`: doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat	A heuristic for the minimum size of memory required for the app to work. the amount of memory in-use that cannot be freed under memory pressure[…] It includes all anonymous (non-file-backed) memory since Kubernetes does not support swap. The metric typically also includes some cached (file-backed) memory, because the host OS cannot always reclaim such pages. See the cAdvisor table for the formula containing base OS metrics

表：CAdvisor 的指标和起源

如果下面的形容还不足以满足你的好奇心，那么这里有更多：

https://jpetazzo.github.io/20…

常被误会的 K8s 指标

container_memory_usage_bytes

A Deep Dive into Kubernetes Metrics — Part 3 Container Resource Metrics

You might think that memory utilization is easily tracked with container_memory_usage_bytes, however, this metric also includes cached (think filesystem cache) items that can be evicted under memory pressure. The better metric is container_memory_working_set_bytes as this is what the OOM killer is watching for.

container_memory_working_set_bytes

Memory usage discrepancy: cgroup memory.usage_in_bytes vs. RSS inside docker container

container_memory_working_set_bytes = container_memory_usage_bytes – total_inactive_file (from /sys/fs/cgroup/memory/memory.stat), this is calculated in cAdvisor and is <= container_memory_usage_bytes

kubectl top

Memory usage discrepancy: cgroup memory.usage_in_bytes vs. RSS inside docker container

when you use the kubectl top pods command, you get the value of container_memory_working_set_bytes not container_memory_usage_bytes metric.

container_memory_cache 与 container_memory_mapped_file 的关系

Out-of-memory (OOM) in Kubernetes – Part 3: Memory metrics sources and tools to collect them:

Notice the “page cache” term on the definition of the container_memory_cache metric. In Linux the page cache is “used to cache the content of files as IO is performed upon them” as per the “Linux Kernel Programming” book by Kaiwan N Billimoria(本文作者注：这本书我看过，是我看到的，最近最好解理的内核图书). You might be tempted as such to think that container_memory_mapped_file pretty much refers to the same thing, but that’s actually just a subset: e.g. a file can be mapped in memory (whole or parts of it) or it can be read in blocks, but the page cache will include data coming from either way of accessing that file. See https://stackoverflow.com/que… for more info.

什么 metric 才是 OOM Kill 相干

Memory usage discrepancy: cgroup memory.usage_in_bytes vs. RSS inside docker container

It is also worth to mention that when the value of container_memory_usage_bytes reaches to the limits, your pod will NOT get oom-killed. BUT if container_memory_working_set_bytes or container_memory_rss reached to the limits, the pod will be killed.

内存限度对 IO 性能的影响

本节次要参考：Don’t Let Linux Control Groups Run Uncontrolled

Page cache usage by apps is counted towards a cgroup’s memory limit, and anonymous memory usage can steal page cache for the same cgroup

应用程序的页面缓存使用量计入 cgroup 的内存限度，anonymous memory 匿名内存应用能够窃取同一 cgroup 的页面缓存

这句话就是说，容器内存缓和时，文件缓存会缩小，反复读的性能会降落。写入(Writeback) 的性能也可能会降落。所以如果你的容器有比拟多文件 IO，请审慎配置内存 limit。

理解内核数据结构

TODO：上一些我画的图。

Page Cache 占用剖析工具

找到占用 Page Cache 的文件：

vmtouch – the Virtual Memory Toucher
pcstat

POD 的另一个杀手驱赶 eviction

本节内容少数来自：Out-of-memory (OOM) in Kubernetes – Part 4: Pod evictions, OOM scenarios and flows leading to them

Out-of-memory (OOM) in Kubernetes – Part 4: Pod evictions, OOM scenarios and flows leading to them

But when does the Kubelet decide to evict pods? “Low memory situation” is a rather fuzzy concept: we’ve seen that the OOM killer acts at the system level (in OOM killer) when memory is critically low (essentially almost nothing left), so it follows that pod evictions should happen before that. But when exactly?

As per the official Kubernetes docs “‘Allocatable’ on a Kubernetes node is defined as the amount of compute resources that are available for pods“. This feature is enabled by default via the --enforce-node-allocatable=pods and once the memory usage for the pods crosses this value, the Kubelet triggers the eviction mechanism: “Enforcement is performed by evicting pods whenever the overall usage across all pods exceeds ‘Allocatable’” as documented here.

We can easily see the value by checking the output of kubectl describe node. Here’s how the section of interest looks like for one of nodes of the Kubernetes cluster used throughout this article (a 7-GiB Azure DS2_v2 node):

Pods 在 Node 中的可用内存：

图：Worknode 内存分类

图：Worknode 可调配内存容量散布

驱赶 eviction 机制

How does all we’ve seen in the previous section reflect in the eviction mechanism? Pods are allowed to use memory as long as the overall usage across all pods is less than the allocatable memory value. Once this threshold is exceeded, you’re at the mercy of the Kubelet – as every 10s it checks the memory usage against the defined thresholds. Should the Kubelet decide an eviction is necessary the pods are sorted based on an internal algorithm described in Pod selection for kubelet eviction – which includes QoS class and individual memory usage as factors – and evicts the first one. The Kubelet continues evicting pods as long as thresholds are still being breached. Should one or more pods allocate memory so rapidly that the Kubelet doesn’t get a chance to spot it inside its 10s window, and the overall pod memory usage attempts to grow over the sum of allocatable memory plus the hard eviction threshold value, then the kernel’s OOM killer will step in and kill one or more processes inside the pods’ containers, as we’ve seen at length in the Cgroups and the OOM killer section.

说了那么多，能够 Get 到一点比较简单间接的是， POD 的 memory 理论应用超过 memory request 越多，越有可能被选中驱赶 eviction。

驱赶 eviction 算法参考指标

We’re talking about memory usage a lot, but what exactly do we mean by that? Take the “allocatable” amount of memory that pods can use overall: for our DS2_v2 AKS nodes that is 4565 MiB. Is it when the RAM of a node has 4565 MiB filled with data for the pods means it’s right next to the threshold of starting evictions? In other words, what’s the metric used?

Back in the Metrics values section we’ve seen there are quite a few metrics that track memory usage per type of object. Take the container object, for which cAdvisor will return half a dozen metrics such as container_memory_rss, container_memory_usage_bytes, container_memory_working_set_bytes etc.

So when the Kubelet looks at the eviction thresholds, what memory metric is it actually comparing against? The official Kubernetes documentation provides the answer to this: it’s the working set. There’s even a small script included there that shows the computations for deciding evictions at the node level. Essentially it computes the working_set metric for the node as the root memory cgroup’s memory.usage_in_bytes minus the inactive_file field in the root memory cgroup’s memory.stat file.

And that is the exact formula we’ve come across in the past when we’ve looked at how are the node metrics computed through the Resource Metrics API (see the last row of the cAdvisor metrics table). Which is good news, as we’ll be able to plot the same exact metrics used in Kubelet’s eviction decisions on our charts in the following sections, by choosing the metric source which currently gives almost all of the memory metrics: cAdvisor.

As a side-note, if you want to see that the Kubelet’s code reflects what was said above – both for the “allocatable” threshold as well as the --eviction-hard one – have a look at What is the memory metric that the Kubelet is using when making eviction decisions?.

容器内存 kill 法大全

这里有个容器内存事件处理逻辑图，一句说了，玩火要小心，不然，总有一种 kill 法适宜你。

from Out-of-memory (OOM) in Kubernetes – Part 4: Pod evictions, OOM scenarios and flows leading to them

上文未提到的有用的参考

Out-of-memory (OOM) in Kubernetes – Part 1: Intro and topics discussed
SRE deep dive into Linux Page Cache
Dropping cache didn’t drop cache
Memory Measurements Complexities and Considerations – Part 1
Overcoming challenges with Linux cgroups memory accounting