关于tidb:深入解析-TiFlash丨多并发下线程创建释放的阻塞问题

TiFlash 初期存在一个辣手的问题：对于简单的小查问，无论减少多少并发，TiFlash 的整机 CPU 使用率都远远不能打满。如下图：

对 TiFlash 和问题自身通过一段时间的理解后，认为方向应该在“公共组件”（全局锁、底层存储、下层服务等）上。在这个方向上做“地毯式”排查后，终于定位到问题的一个重要起因：高并发下频繁的线程创立和开释，这会引发线程在创立 / 开释过程呈现排队和阻塞景象。

因为 TiFlash 的工作模式依赖于启动大量长期新线程去做一些部分计算或者其余的事件，大量线程创立 / 开释过程呈现了排队和阻塞景象，导致利用的计算工作也被阻塞了。而且并发越多，这个问题越重大，所以 CPU 使用率不再随着并发减少而减少。
具体的排查过程，因为篇幅无限，本篇就不多赘述了。首先咱们能够结构个简略试验来复现这个问题：

首先定义三种工作模式: wait、work、workOnNewThread

wait: while 循环，期待 condition_variable

work: while 循环，每次 memcpy 20 次(每次 memcpy copy 1000000 bytes)。

workOnNewThread: while 循环，每次申请新的 thread，新 thread 内 memcpy 20 次，join 期待线程完结，反复这个过程。

接下来按不同的工作模式组合去做试验。

试验 1：40 个 work 线程

试验 2：1000 个 wait 线程, 40 个 work 线程

试验 3：40 个 workOnNewThread 线程

试验 4：120 个 workOnNewThread 线程

试验 5：500 个 workOnNewThread 线程

各试验 CPU 使用率如下:

试验 1 和 2 表明，即便试验 2 比试验 1 多了 1000 个 wait 线程，并不会因为 wait 线程数十分多而导致 CPU 打不满。过多的 wait 线程数并不会让 CPU 打不满。从起因上来讲，wait 类型的线程不参加调度，前面会讲到。另外，linux 采纳的是 cfs 调度器，工夫复杂度是 O(lgn)，所以实践上大规模可调度线程数目也并不会给调度减少显著的压力。

试验 3、4、5 表明，如果大量工作线程的工作模式是频繁申请和开释线程，能够导致 cpu 打不满的状况。

接下来带大家一起剖析下，为什么线程的频繁创立和开释会带来排队和阻塞景象，代价如此之高？

应用 GDB 查看线程的频繁创立和开释场景下的程序，能够看到线程创立和开释过程被 lll_lock_wait_private 的锁阻塞掉。如图:

#0 _lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fbc55f60d80 in _L_lock_3443 () from /lib64/libpthread.so.0
#2 0x00007fbc55f60200 in get_cached_stack (memp=<synthetic pointer>, sizep=<synthetic pointer>)
   at allocatestack.c:175
#3 allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>,
   attr=0x7fbc56173400 <__default_pthread_attr>) at allocatestack.c:474
#4 __pthread_create_2_1 (newthread=0x7fb8f6c234a8, attr=0x0,
   start_routine=0x88835a0 <std::execute_native_thread_routine(void*)>, arg=0x7fbb8bd10cc0)
   at pthread_create.c:447
#5 0x0000000008883865 in __gthread_create (__args=<optimized out>
   __func=0x88835a0 <std::execute_native_thread_routine(void*)>,
   __threadid=_threadid@entry=0x7fb8f6c234a8)
   at /root/XXX/gcc-7.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/b...
#6 std::thread::_M_start_thread (this=this@entry=0x7fb8f6c234a8,state=...) 
   at ../../../../-/libstdc++-v3/src/c++11/thread.cc:163

<center>Figure 1：线程申请阻塞时堆栈 </center>

#0 _lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fbc55f60e59 in _L_lock_4600 () from /lib64/libpthread.so.0
#2 0x00007fbc55f6089f in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>
   attr=0x7fbc56173400 <__default_pthread_attr>) at allocatestack.c:552
#3 __pthread_create_2_1 (newthread=0x7fb5f1a5e8b0, attr=0x0,
   start_routine=0x88835a0 <std::execute_native_thread_routine(void*)>, arg=0x7fbb8bcd6500)
   at pthread_create.c:447
#4 0x0000000008883865 in __gthread_create (__args=<optimized out>,
   __func=0x88835a0 <std::execute_native_thread_routine(void*)>,
   __threadid=__threadid@entry=0x7fb5f1a5e8b0)
   at /root/XXX/gcc-7.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/...
#5 std::thread::_M_start_thread (this=this@entry=0x7fb5f1a5e8b0, state=...) 
   at ../../../.././libstdc++-v3/src/c++11/thread.cc:163

<center>Figure 2：线程申请阻塞时堆栈 </center>

#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fbc55f60b71 in _L_lock_244 () from /lib64/libpthread.so.0
#2 0x00007fbc55f5ef3c in _deallocate_stack (pd=0x7fbc56173320 <stack_cache_lock>, pd@entry=0x7fb378912700) at allocatestack.c:704
#3 0x00007fbc55f60109 in __free_tcb (pd=pd@entry=0x7fb378912700) at pthread_create.c:223
#4 0x00007fbc55f61053 in pthread_join (threadid=140408798652160, thread_return=0x0) at pthread_join.c:111
#5 0x0000000008883803 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>)
        at /root/XXX/gcc-7.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:668
#6 std::thread::join (this=this@entry=0x7fbbc2005668) at ../../../.././libstdc++-v3/src/c++11/thread.cc:136

<center>Figure 3：线程开释阻塞时堆栈 </center>

从图中堆栈能够看到，线程创立时会调用allocate_stack 和 __deallocate_stack，而线程开释时会调用 __deallocate_stack，这几个函数会因为触发了名为 lll_lock_wait_private 的锁争抢而产生阻塞。

为了解释这个状况，须要对 thread 的创立开释过程进行理解。

咱们日常用到的线程，是通过 NPTL 实现的 pthread。NPTL(native posix thread library)，俗称原生 pthread 库，自身集成在 glibc 外面。在剖析了 glibc 的相干源码后，能够理解到 pthread 创立和开释的工作过程。

线程创立工作会给线程调配 stack，析构工作会开释 stack，这期间会用到stack_used 和stack_cache 两个链表:stack_used 保护的是正在被线程应用 stack，而 stack_cache 保护的是的之前线程开释后回收可利用的 stack。线程申请 stack 时，并不是间接去申请新的 stack，而是先尝试从stack_cache 里获取。

__lll_lock_wait_private 是 private 状态的 __lll_lock_wait，理论是一种基于 futex 实现的互斥锁, 前面会讲到，private 是指在这个锁只在过程外部应用，而不会跨过程。
这个锁争抢就是在线程调用 allocate_stack (线程申请时)、deallocate_stack (线程开释时) 过程中对这两个链表进行操作时产生的。

allocate_stack 过程:

Returns a usable stack for a new thread either by allocating a
   new stack or reusing a cached stack of sufficient size.
   ATTR must be non-NULL and point to a valid pthread_attr.
   PDP must be non-NULL.  */
static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
                ALLOCATE_STACK_PARMS)
{
  ... // do something

  /* Get memory for the stack.  */
  if (__glibc_unlikely (attr->flags & ATTR_FLAG_STACKADDR))
    {... // do something}
  else
    {
      // main branch
      /* Allocate some anonymous memory.  If possible use the cache.  */
      ... // do something

      /* Try to get a stack from the cache.  */
      reqsize = size;
      pd = get_cached_stack (&size, &mem);
      /* 
          If get_cached_stack() succeed, it will use cached_stack 
          to do rest work. Otherwise, it will call mmap() to allocate a stack.
      */
      if (pd == NULL) // if pd == NULL, get_cached_stack() failed
        {
          ... // do something
          mem = mmap (NULL, size, prot,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
          ... // do something
          /* Prepare to modify global data.  */
          lll_lock (stack_cache_lock, LLL_PRIVATE); // global lock

          /* And add to the list of stacks in use.  */
          stack_list_add (&pd->list, &stack_used);

          lll_unlock (stack_cache_lock, LLL_PRIVATE);
          ... // do something
        }
      ... //do something
    }
  ... //do something
  return 0;
}

/* Get a stack frame from the cache.  We have to match by size since
   some blocks might be too small or far too large.  */
static struct pthread *
get_cached_stack (size_t *sizep, void **memp)
{
  size_t size = *sizep;
  struct pthread *result = NULL;
  list_t *entry;

  lll_lock (stack_cache_lock, LLL_PRIVATE); // global lock

  /* Search the cache for a matching entry.  We search for the
     smallest stack which has at least the required size.  Note that
     in normal situations the size of all allocated stacks is the
     same.  As the very least there are only a few different sizes.
     Therefore this loop will exit early most of the time with an
     exact match.  */
  list_for_each (entry, &stack_cache)
    {... // do something}

  ... // do something

  /* Dequeue the entry.  */
  stack_list_del (&result->list);

  /* And add to the list of stacks in use.  */
  stack_list_add (&result->list, &stack_used);

  /* And decrease the cache size.  */
  stack_cache_actsize -= result->stackblock_size;

  /* Release the lock early.  */
  lll_unlock (stack_cache_lock, LLL_PRIVATE);
  ... // do something
  return result;
}

<center>Figure 4: allocate_stack 代码剖析 </center>

联合堆栈和源码可知，pthread_create 最开始会调用allocate_stack 来进行线程堆栈的调配。具体过程如上图: 首先检查用户是否本人提供了 stack 空间，如果是，那么间接用用户提供的空间进行调配。不过这种状况很少见。默认状况下，用户是不提供的，而是零碎本人去调配。这种状况下会先调用 get_cached_stack，尝试从曾经调配过的 stack 列表中从新利用。如果获取 stack 失败，那么会调用 syscall mmap 进行 stack 的调配，获取 stack 后，会尝试获取全局锁lll_lock 将 stack 增加到stack_used 列表中。这个过程中，get_cached_stack 外部也会尝试获取雷同的全局锁lll_lock，首先扫描stack_cache 列表，将可用的 stack 找到，而后将该 stack 从stack_cache 列表中删除，再退出到stack_used 列表中。

deallocate_stack 过程:

void
internal_function
__deallocate_stack (struct pthread *pd)
{lll_lock (stack_cache_lock, LLL_PRIVATE); //global lock

  /* Remove the thread from the list of threads with user defined
     stacks.  */
  stack_list_del (&pd->list); 

  /* Not much to do.  Just free the mmap()ed memory.  Note that we do
     not reset the 'used' flag in the 'tid' field.  This is done by
     the kernel.  If no thread has been created yet this field is
     still zero.  */
  if (__glibc_likely (! pd->user_stack))
    (void) queue_stack (pd); 
  else
    /* Free the memory associated with the ELF TLS.  */
    _dl_deallocate_tls (TLS_TPADJ (pd), false);

  lll_unlock (stack_cache_lock, LLL_PRIVATE);
}

/* Add a stack frame which is not used anymore to the stack.  Must be
   called with the cache lock held.  */
static inline void
__attribute ((always_inline))
queue_stack (struct pthread *stack)
{
  /* We unconditionally add the stack to the list.  The memory may
     still be in use but it will not be reused until the kernel marks
     the stack as not used anymore.  */
  stack_list_add (&stack->list, &stack_cache);

  stack_cache_actsize += stack->stackblock_size;
  if (__glibc_unlikely (stack_cache_actsize > stack_cache_maxsize))
    //if stack_cache is full, release some stacks
    __free_stacks (stack_cache_maxsize); 
}

/* Free stacks until cache size is lower than LIMIT.  */
void
__free_stacks (size_t limit)
{
  /* We reduce the size of the cache.  Remove the last entries until
     the size is below the limit.  */
  list_t *entry;
  list_t *prev;

  /* Search from the end of the list.  */
  list_for_each_prev_safe (entry, prev, &stack_cache)
    {
      struct pthread *curr;

      curr = list_entry (entry, struct pthread, list);
      if (FREE_P (curr))
        {
          ... // do something
          
          /* Remove this block.  This should never fail.  If it does
             something is really wrong.  */
          if (munmap (curr->stackblock, curr->stackblock_size) != 0)
            abort ();

          /* Maybe we have freed enough.  */
          if (stack_cache_actsize <= limit)
            break;
        }
    }
}

<center>Figure 5: deallocate_stack 代码剖析 </center>

//file path: nptl/allocatestack.c
/* Maximum size in kB of cache.  */
static size_t stack_cache_maxsize = 40 * 1024 * 1024; /* 40MiBi by default.  */
static size_t stack_cache_actsize;

<center>Figure 6: stack_cache 列表容量 stack_cache_maxsize 的默认值 </center>

联合堆栈和源码可知，线程在完结时，会调用 __free_tcb 来先将线程的 TCB(Thread Control Block，线程的元数据) 开释，而后调用deallocate_stack 将 stack 回收。这个过程中，次要的瓶颈点在deallocate_stack 上。deallocate_stack 会尝试持有跟 allocate_stack 外面雷同的lll_lock 全局锁，将 stack 从stack_used 列表中删除。而后判断 stack 是否是零碎调配的，如果是，那么将其退出到stack_cache 列表中。退出后，会查看stack_cache 列表的大小是否超出阈值stack_cache_maxsize，如果是，那么会调用__free_stacks 函数开释一些 stack 直到小于阈值stack_cache_maxsize。值得注意的是，__free_stacks 函数外面会调用syscall munmap 来开释内存。对于阈值stack_cache_maxsize，如上图，从源码上看，它的默认值是 4010241024，联合代码中的正文，仿佛单位是 kB。然而起初实测后发现，这个正文是有问题，实际上stack_cache_maxsize 的单位是 Byte，也就是默认 40MB。而 thread 默认 stack 大小个别为 8~10 MB，也就是说 glibc 默认状况下大略能够帮用户 cache 4~5 个线程 stack。

由此可见，线程在创立和开释过程中，都会抢同一把全局互斥锁 lll_lock，从而在高并发线程创立 / 开释时，这些线程会产生排队、阻塞的状况。因为这个过程中同一时间只能一个线程在工作，假如线程创立 / 开释的代价是 c，那么能够大抵推算出 n 个线程创立 / 开释的均匀提早 avg_rt = (1+2+…+n)c/n = n(n+1)/2c/n=(n+1)*c/2。也就是创立 / 开释的均匀提早随并发数线性减少。在 TiFlash 上对线程创立做打点监控后发现，40 个嵌套查问（max_threads =4，注：此为 TiFlash 的并发度参数）下，线程创立 / 开释的线程数规模达到了 3500 左右，线程创立均匀提早竟然达到了 30ms! 这是提早是十分恐怖的，线程创立 / 开释曾经不像设想中那么“轻量”了。单次操作的提早曾经如此之高，对于像 TiFlash 这种嵌套型的线程创立场景，可想而知会更重大。

讲到这里，大家曾经理解到线程创立和开释过程会尝试获取全局互斥锁而产生排队阻塞的行为，不过可能还对lll_lock 一头雾水。什么是lll_lock 呢?

<center>Figure 7: futex</center>

lll 是 low level lock 的缩写，俗称底层锁，理论是基于 Futex 实现的互斥锁。Futex，全称 fast userspace mutex，是一个非 vDSO 的 system call。高版本 linux 的 mutex 也是基于 futex 实现的。futex 的设计思路认为大部分状况锁争抢是不会产生的，这时候能够间接在用户态实现锁操作。而当产生锁争抢时，lll_lock 通过非 vDSO 的零碎调用 sys_futex(FUTEX_WAIT) 陷入内核态期待被唤醒。胜利抢到锁的线程，干完活后，通过 lll_unlock 来唤醒 val 个线程(val 个别设为 1)，lll_unlock 理论通过非 vDSO 的零碎调用sys_futex(FUTEX_WAIT) 来实现唤醒操作。

从上面对 lll_lock、futex 的原理中能够理解到，如果是非争抢状况下，这个操作是比拟轻量的，也不会陷入内核态。然而在争抢状况下，岂但产生了排队阻塞，还会触发用户态和内核态的切换，线程的创立 / 开释效率雪上加霜。内核态和用户态的切换之所以慢，次要因为非 vDSO 的零碎调用。上面无妨讲讲零碎调用的代价。

古代 linux 零碎中，个别会将局部的 syscall 汇合用 vDSO 的形式裸露给过程，过程以 vDSO 的形式对 syscall 进行调用其实是很高效的，因为不波及到用户态和内核态的切换。而非 vDSO 的 syscall 就不那么侥幸了，可怜的是 Futex 就属于非 vDSO 类的。

<center>Figure 8: system call 工作形式 </center>

传统的 syscall 通过 int 0x80 中断的形式进行，CPU 把控制权交给 OS，OS 会查看传入的参数，例如SYS_gettimeofday，而后利依据寄存器中的零碎调用号查找零碎调用表，取得调用号对应的服务函数并执行比方: gettimeofday。中断会强制 CPU 保留中断前的执行状态，为了在中断完结后能够把状态复原。除了中断自身, kernel 还会做更多的事件。Linux 被分为用户空间和内核空间，内核空间的权限等级最高，能够间接对硬件做操作。为了避免用户程序的歹意行为，用户利用无奈间接拜访内核空间，要想做用户态无奈实现的工作，便须要 syscall 来间接实现，kernel 必须在用户和内核这两个内存段之间切换来实现这个操作。这种内存地址的切换，须要 CPU 寄存器内容的“大换血”，因为须要保留和复原之前寄存器的现场。此外还要对用户传入的内容做额定的权限查看，所以对性能的冲击是比拟大的。古代的 CPU 提供了 syscall 指令，因而高版本的 linux 理论通过 syscall 指令来代替原来的 int 0x80 中断形式，然而代价仍然很高。

从之前 futex 介绍中讲到，mutex 理论也是基于 futex 实现的。同样都是基于 futex 实现，为啥线程创立 / 析构就会变慢呢？

通过批改 glibc 和 kernel 的源码，在外面退出 trace 代码，定位到线程创立 / 析构在 futex 临界区内耗时次要是munmap 这个 syscall 奉献的。之前的源码剖析中讲到，当线程开释时，如果 stack_cache 列表曾经满了，会调用munmap 来将 stack 开释掉。

munmap 这个操作的耗时大略有几 us 甚至几十 us。简直奉献了整个过程耗时的 90% 以上。又因为munmap 是通过 futex 全局锁在实现的，导致这期间其余的线程创立 / 析构工作都必须阻塞。引发重大的性能降级。

所以，线程创立 / 析构慢的更深层起因是：线程析构时如果stack_cache 满了，须要调用munmap 来将 stack 开释，这个过程的 futex 临界区耗时过长！这样创立和析构在抢同一把 futex 锁的时候，都会产生阻塞景象。

可是同样都是操作内存，用于申请内存的 mmap 却十分快，根本不到 1us 就能够执行完结，为什么两者反差如此之大？接下来咱们剖析下 munmap 为什么会这么慢。

首先简要讲下 munmap 的工作过程：munmap 会依据要开释的内存范畴寻找对应的虚拟内存区 VMA（virtual memory area），如果要开释的内存范畴的首尾正好落在某个 VMA 的两头，那么须要对对应的 VMA 进行决裂。而后解映射 unmap、开释对应的 VMA、页表。并作废相干的 TLB。

通过在 kernel 的中退出的 trace 发现，耗时次要产生在 tlb_flush_mmu 中，这个是驱赶 TLB 的过程。因为 munmap 在开释内存后，须要将过期生效的 TLB 作废掉，所以会调用这个函数。

再深刻上来，如果波及的 TLB 在多个 CPU 核上都存在，tlb_flush_mmu 会调用smp_call_function_many 来在这些核上都做一遍 flush TLB，并且以同步的形式期待该过程执行结束。单核 Flush TLB 的操作通过单核中断实现，多核 Flush TLB 须要通过核间中断 IPI 来实现。

通过 trace 定位，耗时次要是 IPI 奉献的，光是 IPI 通信的耗时就有几 us 甚至几十 us，而 flush TLB 自身却不到 1us。

<center>Figure 9: 核间中断 IPI 工作形式 </center>

IPI 的具体工作形式如上图，多个 CPU 外围通过系统总线 System Bus 进行 IPI 音讯的通信，当一个 CPU 核须要在多个 CPU 外围上做 IPI 工作时，该外围会发送 IPI 申请到 System Bus 并期待其余外围全副实现 IPI 操作，相干的 CPU 外围上收到 IPI 申请后处理本人的 Interrupt 工作，实现后通过 System Bus 告诉发起方。因为这个过程须要通过 CPU 内部的 System Bus 来实现，并且发起方在发送 IPI 到期待其余外围实现中断操作的过程中只能傻等着，所以 overhead 十分高（几 us 甚至更高）。

翻看他人的研究成果, 更加验证了 IPI 是很重的操作。依据 18 年发表的论文 <<Latr: Lazy Translation Coherence>> 的钻研表明，

“an IPI takes up to 6.6 µs for 120 cores with 8 sockets and 2.7 µs for 16 cores with 2 sockets.”

也就是说一次 IPI 操作的 overhead 大略就是 us 级别的。

除了线程创立和开释的问题，线程数也是一个比拟值得关注的问题。尤其是 running 线程数多了后，context switch 和调度的代价可能会对性能带来冲击。为什么这里刻意强调是 running 态线程呢? 因为处于阻塞态的线程（锁期待、nanosleep 等），理论并不参加调度也不会产生上下文切换。可能很多人都有这样的误会就是: 线程数 (无论是否处于阻塞态) 多了，上下文切换、调度代价就肯定高，实际上并不完全正确的。因为对于处于阻塞态的线程，调度器不会调配给他任何 CPU 工夫，直到被唤醒为止。Linux 的调度器实现是 CFS（Completely Fair Scheduler），它实际上在每个 CPU core 上保护了一个基于红黑树的处于 runnable 态线程的 queue，也叫 runqueue。这个 runqueue 的调度代价为 log(n)（n 为该队列中 runnable 线程的数目）。因为 CFS 只对 running 态线程做调度，所以调度和 context switch 次要产生在 running 线程之间。方才详细分析了调度器 CFS 的代价，接下来讲一下 context switch 的。

context switch 分为过程内和过程间，因为咱们个别都是单过程下的多线程开发，所以这里的上下文切换次要是指过程内线程的切换代价。过程内线程切换绝对于跨过程切换效率绝对较高，因为不产生 TLB(Translation lookaside buffer) flush。不过过程内线程切换的代价也不低，因为会产生寄存器现场、TCB(thread control block)的保留和复原，还有 CPU cache 的局部生效。

之前版本 TiFlash 在高并发查问下线程总数能够达到 5000 多，的确是一个比拟恐怖的数目。然而 runnning 线程数个别不超过 100 个。假如在 40 个逻辑核的机器上运行, 这时候的调度代价最坏状况下不超过 lg(100) , 现实状态应该是 lg(100/40) , 代价绝对较小。而上下文切换代价大略相当于几十个 running 线程的量级, 也属于比拟可控的状态。

这里, 我也做了个试验来比照 5000 个 running 线程和 5000 个 blocked 线程的耗时比照。首先定义了 3 种线程状态：work 是从 0 到 50000000 做计数；Yield 是循环做sched_yeild , 让线程不做任何计算工作就让出 CPU 执行权并维持 runnable 状态，这样的目标是在减少 running 态线程数目的同时，不引入额定计算工作量。Wait 是做condition_variable.wait(lock, false)。耗时后果如下：

能够看到，因为锁期待是非 running 的线程，试验一和试验二的耗时相差不大，阐明 5000 个阻塞态线程并没对性能造成显著冲击。而试验三，500 个只做上下文切换的线程(相当于不做计算工作的 running 态线程)，数目上没有试验二的 wait 线程多，即便不做别的计算工作，也给性能造成微小的冲击。这带来的调度和上下文切换代价就相当显著了，耗时间接涨了近 10 倍多。这阐明，调度和上下文切换代价次要跟非阻塞态的 running 线程数无关。这一点，有助于咱们当前在剖析性能问题时失去更精确的判断。

咱们在排查问题时，在监控上其实踩了不少坑，一个是系统监控工具 top 挖的。咱们在 top 下看到 running threads 数目低于预期，常常在个位数彷徨。让咱们误以为问题出在了零碎的上下文无关。然而，主机的 CPU 使用率却能达到 80%。可是细想又感觉不对劲：如果大部分工夫都是几个或者十几个线程在工作，对于一台 40 逻辑核的主机来说，是不可能达到这么高的 CPU 使用率的，这是怎么回事呢？

//Entry Point
static void procs_refresh (void) {
   ...
   read_something = Thread_mode ? readeither : readproc;

   for (;;) {
      ...
      // on the way to n_alloc, the library will allocate the underlying
      // proc_t storage whenever our private_ppt[] pointer is NULL...
      // read_something() is function readeither() in Thread_mode!
      if (!(ptask = read_something(PT, private_ppt[n_used]))) break;
      procs_hlp((private_ppt[n_used] = ptask));  // tally this proc_t
   }

   closeproc(PT);
   ...
} // end: procs_refresh

// readeither() is function pointer of read_something() in Thread_mode;
// readeither: return a pointer to a proc_t filled with requested info about
// the next unique process or task available.  If no more are available,
// return a null pointer (boolean false).  Use the passed buffer instead
// of allocating space if it is non-NULL.
proc_t* readeither (PROCTAB *restrict const PT, proc_t *restrict x) {
    ...

next_proc:
    ...

next_task:
    // fills in our path, plus x->tid and x->tgid
    // find next thread
    if ((!(PT->taskfinder(PT,&skel_p,x,path)))   // simple_nexttid()
    || (!(ret = PT->taskreader(PT,new_p,x,path)))) { // simple_readtask
        goto next_proc;
    }
    if (!new_p) {
        new_p = ret;
        canary = new_p->tid;
    }
    return ret;

end_procs:
    if (!saved_x) free(x);
    return NULL;
}

// simple_nexttid() is function simple_nexttid() actually
// This finds tasks in /proc/*/task/ in the traditional way.
// Return non-zero on success.
static int simple_nexttid(PROCTAB *restrict const PT, const proc_t *restrict const p, proc_t *restrict const t, char *restrict const path) {
  static struct dirent *ent;        /* dirent handle */
  if(PT->taskdir_user != p->tgid){ // init
    if(PT->taskdir){closedir(PT->taskdir); 
    }
    // use "path" as some tmp space
    // get iterator of directory  /proc/[PID]/task
    snprintf(path, PROCPATHLEN, "/proc/%d/task", p->tgid);
    PT->taskdir = opendir(path);
    if(!PT->taskdir) return 0;
    PT->taskdir_user = p->tgid;
  }
  for (;;) { // iterate files in current directory
    ent = readdir(PT->taskdir); // read state file of a thread
    if(unlikely(unlikely(!ent) || unlikely(!ent->d_name[0]))) return 0;
    if(likely(likely(*ent->d_name > '0') && likely(*ent->d_name <= '9'))) break;
  }
  ...
  return 1;
}

// how TOP statisticizes state of threads 
switch (this->state) {
      case 'R':
         Frame_running++;
         break;
      case 't':     // 't' (tracing stop)
      case 'T':
         Frame_stopped++;
         break;
      case 'Z':
         Frame_zombied++;
         break;
      default:
         /* the following states are counted as sleeping state
            currently: 'D' (disk sleep),
                       'I' (idle),
                       'P' (parked),
                       'S' (sleeping),
                       'X' (dead - actually 'dying' & probably never seen)
         */
         Frame_sleepin++;
         break;
   }

<center>Figure 10: top 源码剖析 </center>

剖析了 top 的源码后，终于明确了起因。原来 top 显示的不是过后的 ” 刹时状况 ”，因为 top 不会把程序停掉。具体的工作过程如上图，top 会扫描一遍过后的线程列表，而后一个一个去取状态，这个过程中程序是持续运行的，所以 top 扫完列表后，之后新启动线程是没记录进去的，而旧线程一部分曾经完结了，完结状态的线程会算到 sleeping 里。所以对于高并发线程频繁申请和开释的场景下，top 上看到的 running 数就是会偏少的。

所以 top 中的 running 线程数，对于线程频繁创立和开释的程序来说，这个指标是不精确的。

此外，对于 pipeline 模式的 TiFlash，数据在 pipeline 流动的过程中，同一数据只会呈现在 pipleline 的一个环节上，算子有数据就解决，没数据就期待（GDB 上看大部分线程都是这个状态）。pipeline 中大部分的环节都处于没数据期待，有数据又很快完结的情况。监控工程中没有停掉整个 TiFlash，所以对于每个线程了，大概率会取到这个线程的期待状态。

在整个问题的排查过程中，有一些办法是能够积淀下来，当前的开发、排查工作中，仍然能够用到：

多线程开发中，应尽量采纳线程池、协程等伎俩来防止频繁的线程创立和开释。
尽量在简略环境下复现问题，以缩小会对排查产生烦扰的因素。
管制 running 态的线程数目，大于 CPU 核数后会产生多余的上下文切换代价。
在线程期待资源的场景的开发中，尽量应用 lock，cv 等。如果用 sleep，睡眠距离应尽量设得长一点，以缩小不必要的线程唤醒。
辩证地对待监控工具，当剖析后果和监控数据有矛盾时，不能排除对监控工具自身的质疑。此外，要仔细阅读监控工具的文档和指标阐明，防止对指标产生误读。
多线程 hang、slow、争抢问题排查：pstack、GDB 看各个线程的状态。
性能热点工具：perf、flamegraph。

关于tidb:深入解析-TiFlash丨多并发下线程创建释放的阻塞问题

试验复现、验证

定义

各试验

具体试验后果

后果剖析

多并发下，线程创立和开释会产生什么?

GDB 上看到的阻塞景象

thread 创立和开释的工作过程

lll_lock 和 Futex

零碎调用的代价

mutex 理论也是基于 futex 实现的，为啥线程创立 / 析构就会变慢呢？

munmap、TLB shootdown 和核间中断 IPI

Context switch 和 CFS

警觉系统监控的误导

经验总结

Just My Socks（注册教程内含优惠码）

关于tidb:深入解析-TiFlash丨多并发下线程创建释放的阻塞问题

试验复现、验证

定义

各试验

具体试验后果

后果剖析

多并发下，线程创立和开释会产生什么?

GDB 上看到的阻塞景象

thread 创立和开释的工作过程

lll_lock 和 Futex

零碎调用的代价

mutex 理论也是基于 futex 实现的，为啥线程创立 / 析构就会变慢呢？

munmap、TLB shootdown 和核间中断 IPI

Context switch 和 CFS

警觉系统监控的误导

经验总结

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）