乐趣区

线上cpu100问题排查过程

有很多时候我们发现线上 cpu 使用率过高或者内存溢出等情况,其实在 linux 环境下是可以看到其使用情况和具体的错误信息的

查看占用 cpu 高的进程

 [log@task-a-shprod-1 ~]$ top
top - 12:00:19 up 20 days, 19:46,  1 user,  load average: 2.42, 1.71, 2.40
Tasks:  98 total,   2 running,  96 sleeping,   0 stopped,   0 zombie
%Cpu(s): 53.1 us, 16.9 sy,  0.0 ni, 27.8 id,  0.0 wa,  0.0 hi,  2.3 si,  0.0 st
KiB Mem : 16267724 total,   353600 free,  8349840 used,  7564284 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7557728 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15980 root      20   0 8375964 3.148g  14824 S 175.7 20.3   3329:10 java
  348 root      20   0   62560  20068  19612 R  85.3  0.1  13592:20 systemd-journal
 8978 root      20   0 7898304 1.205g  14804 S  22.3  7.8  56:33.51 java
10214 root      20   0 8065148 1.695g  14800 S   1.3 10.9  15:43.18 java
 1038 root      10 -10  128800  12248   9300 S   1.0  0.1 294:46.14 AliYunDun
 9605 root      20   0 7970496 1.689g  14784 S   1.0 10.9   4:29.24 java
    3 root      20   0       0      0      0 S   0.3  0.0  12:25.10 ksoftirqd/0
    9 root      20   0       0      0      0 S   0.3  0.0  43:15.84 rcu_sched
   13 root      20   0       0      0      0 S   0.3  0.0  14:10.83 ksoftirqd/1
   18 root      20   0       0      0      0 S   0.3  0.0  13:32.39 ksoftirqd/2
   23 root      20   0       0      0      0 S   0.3  0.0  16:09.67 ksoftirqd/3
 1044 root      20   0  263504  41520   5936 S   0.3  0.3  40:29.56 ilogtail
    1 root      20   0   43384   3788   2496 S   0.0  0.0   0:25.62 systemd

可以看到,占用最高的是 java 进程 PID 为 15980 占用了 175.7%

查看进程中最耗 cpu 的子线程

 [log@task-a-shprod-1 ~]$ top -Hp 15980
top - 12:01:25 up 20 days, 19:48,  1 user,  load average: 4.98, 2.55, 2.64
Threads:  58 total,   2 running,  56 sleeping,   0 stopped,   0 zombie
%Cpu(s): 65.4 us, 15.2 sy,  0.0 ni, 17.2 id,  0.1 wa,  0.0 hi,  2.1 si,  0.0 st
KiB Mem : 16267724 total,   322392 free,  8380124 used,  7565208 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7527436 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
16131 root      20   0 8375964 3.223g  14824 S 44.9 20.8 651:07.49 java
16130 root      20   0 8375964 3.223g  14824 S 35.9 20.8 628:32.23 java
16132 root      20   0 8375964 3.223g  14824 R 30.9 20.8 569:00.13 java
16133 root      20   0 8375964 3.223g  14824 S 25.9 20.8 638:04.25 java
16129 root      20   0 8375964 3.223g  14824 R 12.0 20.8 678:13.62 java
15982 root      20   0 8375964 3.223g  14824 S  0.7 20.8  12:06.16 java
15983 root      20   0 8375964 3.223g  14824 S  0.7 20.8  12:07.24 java
16149 root      20   0 8375964 3.223g  14824 S  0.7 20.8  25:09.56 java
15984 root      20   0 8375964 3.223g  14824 S  0.3 20.8  12:07.52 java
15985 root      20   0 8375964 3.223g  14824 S  0.3 20.8  12:04.10 java
15987 root      20   0 8375964 3.223g  14824 S  0.3 20.8   5:59.05 java

将最耗 cpu 的线程 id 转换为 16 进制输出

[log@task-a-shprod-1 ~]$  printf "%x \n" 16131
3f03

查询具体出现问题的代码位置

[log@task-a-shprod-1 ~]$  jstack 15980 | grep 3f03 -A 30

便可以定位出问题代码了

退出移动版