关于linux:为什么linux下多线程程序如此消耗虚拟内存

45次阅读

共计 5834 个字符,预计需要花费 15 分钟才能阅读完成。

最近在进行服务器内存优化的时候,发现一个十分微妙的问题,咱们的认证服务器(AuthServer)负责跟第三方渠道 SDK 打交道,因为采纳了 curl 阻塞的形式,所以这里开了 128 个线程,奇怪的是每次刚启动的时候占用的虚拟内存在 2.3G,而后每次解决音讯就减少 64M,减少到 4.4G 就不再减少了,因为咱们采纳预调配的形式,在线程外部基本没有大块分内存,那么这些内存到底是从哪来的呢?让人百思不得其解。

1. 摸索

一开始首先排除掉内存泄露,不可能每次都泄露 64M 内存这么偶合,为了证实我的观点,首先,我应用了 valgrind。

1: valgrind –leak-check=full –track-fds=yes –log-file=./AuthServer.vlog &

而后启动测试,跑至内存不再减少,果然 valgrind 显示没有任何内存泄露。重复试验了很屡次,后果都是这样。

在屡次应用 valgrind 无果当前,我开始狐疑程序外部是不是用到 mmap 之类的调用, 于是应用 strace 对 mmap,brk 等零碎函数的检测:

1: strace -f -e”brk,mmap,munmap” -p $(pidof AuthServer)

其后果如下:

1: [pid 19343] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f53c8ca9000

2: [pid 19343] munmap(0x7f53c8ca9000, 53833728) = 0

3: [pid 19343] munmap(0x7f53d0000000, 13275136) = 0

4: [pid 19343] mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f53d04a8000

5: Process 19495 attached

我查看了一下 trace 文件也没有发现大量内存 mmap 动作,即使是 brk 动作引起的内存增长也不大。于是感觉人生都没有方向了,而后狐疑是不是文件缓存把虚拟内存占掉了,正文掉了代码中所有读写日志的代码,虚拟内存仍然减少,排除了这个可能。

2. 灵光一现

起初,我开始缩小 thread 的数量开始测试,在测试的时候偶尔发现一个很奇怪的景象。那就是如果过程创立了一个线程并且在该线程内调配一个很小的内存 1k,整个过程虚拟内存立马减少 64M,而后再调配,内存就不减少了。测试代码如下:

1: #include <iostream>

2: #include <stdio.h>

3: #include <stdlib.h>

4: #include <unistd.h>

5: using namespace std;

6:

7: volatile bool start = 0;

8:

9:

10: void thread_run(void )

11: {

12:

13:while(1)

14:{

15: if(start)

16: {

17: cout << “Thread malloc” << endl;

18: char *buf = new char[1024];

19: start = 0;

20: }

21: sleep(1);

22:}

23: }

24:

25: int main()

26: {

27: pthread_t th;

28:

29: getchar();

30: getchar();

31: pthread_create(&th, 0, thread_run, 0);

32:

33: while((getchar()))

34: {

35: start = 1;

36: }

37:

38:

39: return 0;

40: }

其运行后果如下图,刚开始时,过程占用虚拟内存 14M,输出 0,创立子线程,过程内存达到 23M,这减少的 10M 是线程堆栈的大小(查看和设置线程堆栈大小可用 ulimit –s),第一次输出 1,程序调配 1k 内存,整个过程减少 64M 虚拟内存,之后再输出 2,3,各再次调配 1k,内存均不再变动。

这个后果让我悲痛欲绝,因为以前学习过谷歌的 Tcmalloc,其中每个线程都有本人的缓冲区来解决多线程内存调配的竞争,预计新版的 glibc 同样学习了这个技巧,于是查看 pmap $(pidof main) 查看内存状况,如下:

请留神 65404 这一行,种种迹象表明,这个再加上它下面那一行(在这里是 132)就是减少的那个 64M)。起初减少 thread 的数量,就会有新增 thread 数量相应的 65404 的内存块。

须要 C /C++ Linux 高级服务器架构师学习材料加群 812855908(包含 C /C++,Linux,golang 技术,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK,流媒体,CDN,P2P,K8S,Docker,TCP/IP,协程,DPDK,ffmpeg 等)

3. 刨根问底

通过一番搜寻和代码查看。终于晓得了原来是 glibc 的 malloc 在这里捣鬼。glibc 版本大于 2.11 的都会有这个问题:在 redhat 的官网文档上:

Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including… An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores.This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores.

The developer, Ulrich Drepper, has a much deeper explanation on his blog:

Before, malloc tried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible… This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.

The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.

While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.

… Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space… We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.

New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we’ve seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We’ve observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.

Setting MALLOC_ARENA_MAX to a low number will restrict the number of memory arenas and bound the virtual memory, with no noticeable downside in performance – we’ve been recommending MALLOC_ARENA_MAX=4. We should set this in Hadoop-env.sh to avoid this issue as RHEL6 becomes more and more common.

总结一下,glibc 为了分配内存的性能的问题,应用了很多叫做 arena 的 memory pool, 缺省配置在 64bit 上面是每一个 arena 为 64M,一个过程能够最多有 cores 8 个 arena。假如你的机器是 4 核的,那么最多能够有 4 8 = 32 个 arena,也就是应用 32 * 64 = 2048M 内存。当然你也能够通过设置环境变量来扭转 arena 的数量. 例如 export MALLOC_ARENA_MAX=1

hadoop 举荐把这个值设置为 4。当然了,既然是多核的机器,而 arena 的引进是为了解决多线程内存调配竞争的问题,那么设置为 cpu 核的数量预计也是一个不错的抉择。设置这个值当前最好能对你的程序做一下压力测试,用以看看扭转 arena 的数量是否会对程序的性能有影响。

mallopt(M_ARENA_MAX, xxx) 如果你打算在程序代码中来设置这个货色,那么能够调用 mallopt(M_ARENA_MAX, xxx) 来实现,因为咱们 AuthServer 采纳了预调配的形式,在各个线程内并没有分配内存,所以不须要这种优化,在初始化的时候采纳 mallopt(M_ARENA_MAX, 1) 将其关掉,设置为 0,示意零碎按 CPU 进行主动设置。

4. 意外发现

想到 tcmalloc 小对象才从线程本人的内存池调配,大内存依然从地方调配区调配,不晓得 glibc 是如何设计的,于是将下面程序中线程每次调配的内存从 1k 调整为 1M,果然不出所料,再调配完 64M 后,依然每次都会减少 1M,由此可见,新版 glibc 齐全借鉴了 tcmalloc 的思维。

忙了几天的问题终于解决了,情绪大好,通过明天的问题让我晓得,作为一个服务器程序员,如果不懂编译器和操作系统内核,是齐全不合格的,当前要增强这方面的学习。

正文完
 0