关于go:51-Golang实战Go程序分析利器pprof

不晓得你有没有遇到这种状况，Go 服务总是是不是的响应十分慢排查发现所有的依赖还都挺快，感觉是 Go 服务本身慢又不晓得怎么慢在哪里；或者说申请甚至齐全没响应，Go 服务明明在失常运行，申请去哪儿了呢？或者说 Go 服务内存占用总是居高不下，内存都节约在哪呢？这些问题都能够通过 pprof 剖析，本篇文章将为你演示如何基于 pprof 分析程序性能（问题）。

为什么 pprof 能够帮忙咱们剖析 Go 程序性能呢？因为它能够采集程序运行时数据：比如说协程栈，这样服务阻塞在哪里是不是高深莫测了；比如说内存分配情况包含调用栈，这样哪里消耗内存是不是也分明了。

当然想要通过 pprof 分析程序性能，也是要引入一丢丢代码吧，毕竟采集程序运行时数据，是须要消耗一些资源的。引入 pprof 采样也非常简单，几行代码就能搞定：

package main

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {go func() {http.ListenAndServe("0.0.0.0:6060", nil)
    }()}

就是这么简略就能够了。那怎么看 pprof 采集的数据呢？同样很简略，浏览器拜访某个接口就能够，接口返回页面如下：

pprof 采集的数据指标阐明如下：

// 浏览器输出：http://127.0.0.1:8888/debug/pprof/

// 内存调配采样，只有引入 net/http/pprof 才会采集
allocs: A sampling of all past memory allocations

// 采集协程阻塞事件，默认不开启；可通过 runtime.SetBlockProfileRate 开启
block: Stack traces that led to blocking on synchronization primitives

// 程序启动命令
cmdline: The command line invocation of the current program

// 协程调用栈桢
goroutine: Stack traces of all current goroutines

// 与 allocs 采样数据一样，然而可用来采样存活对象内存调配（采样前可运行 GC）heap: A sampling of memory allocations of live objects. You can specify the gc GET parameter to run GC before taking the heap sample.

// 采集锁事件，默认不开启；可通过 runtime.SetMutexProfileFraction 开启
mutex: Stack traces of holders of contended mutexes

//CPU 采样，能够通过参数 seconds 指定采样工夫；可用来分析程序热点
profile: CPU profile. You can specify the duration in the seconds GET parameter. After you get the profile file, use the go tool pprof command to investigate the profile.

// 线程创立调用栈
threadcreate: Stack traces that led to the creation of new OS threads

// 记录 Go 服务在做什么，采集包含协程调度，GC，零碎调用等事件
trace: A trace of execution of the current program. You can specify the duration in the seconds GET parameter. After you get the trace file, use the go tool trace command to investigate the trace.

举一个例子，goroutine 指标输入所有线程栈桢，可用于分析程序阻塞状况。比方，大量协程阻塞在获取锁这一处，那是不是因为锁没被开释导致死锁了；比方，大量协程阻塞在管道写操作（生产生产业务），那是不是因为生产协程解决太慢了，导致管道数据满了，生产者无奈写入数据了，等等。goroutine 指标输入内容如下：

http://127.0.0.1:8888/debug/pprof/goroutine?debug=1

// 第一个数字示意协程数
1 @ 0x10390d6 0x1032357 0x1063929 0x10d0372 0x10d16da 0x10d16c8 0x1126c29 0x1135d85 0x131dc9f 0x1069921
#    0x1063928    internal/poll.runtime_pollWait+0x88        /go1.18/src/runtime/netpoll.go:302
#    0x10d0371    internal/poll.(*pollDesc).wait+0x31        /go1.18/src/internal/poll/fd_poll_runtime.go:83
#    0x10d16d9    internal/poll.(*pollDesc).waitRead+0x259    /go1.18/src/internal/poll/fd_poll_runtime.go:88
#    0x10d16c7    internal/poll.(*FD).Read+0x247            /go1.18/src/internal/poll/fd_unix.go:167
#    0x1126c28    net.(*netFD).Read+0x28                /go1.18/src/net/fd_posix.go:55
#    0x1135d84    net.(*conn).Read+0x44                /go1.18/src/net/net.go:183
#    0x131dc9e    net/http.(*connReader).backgroundRead+0x3e    /go1.18/src/net/http/server.go:672

......

为什么只是引入 net/http/pprof 就能采集这些数据指标呢（当然只是局部），留神下面的代码事例，咱们还启动了一个 web 服务，而且没有设置 HTTP 申请处理器；引入引入 net/http/pprof 包的时候，init 办法就默认注册好 HTTP 申请处理器了，所以咱们能力通过这些接口获取这些数据指标。

func init() {
    // 前缀匹配，蕴含 allocs、heap、goroutine 等
    http.HandleFunc("/debug/pprof/", Index)

    http.HandleFunc("/debug/pprof/cmdline", Cmdline)
    http.HandleFunc("/debug/pprof/profile", Profile)
    http.HandleFunc("/debug/pprof/symbol", Symbol)
    http.HandleFunc("/debug/pprof/trace", Trace)
}

内存的指标可通过 heap 或者 allocs 查看，这两个指标采样的数据是一样，然而 heap 能够用来采样存活对象的内存调配（采样前可运行 GC，这样剩下的都是存活对象了）。以 heap 为例，内存指标展现如下：

http://127.0.0.1:8888/debug/pprof/heap?debug=1

//heap profile 汇总数据: 
//10: 29728 [16: 54144] 含意为 inuse 对象数目（已调配的，去除了已开释的）: inuse 字节 [已调配对象数目: 已调配字节] 
//heap/1048576 均匀采样频率 1048576 字节
heap profile: 10: 29728 [16: 54144] @ heap/1048576

//1: 18432 [1: 18432] 含意同下面；该调用栈内存分配情况
1: 18432 [1: 18432] @ 0x1363bc9 0x1365cce 0x13660f8 0x1366c51 0x1366c46 0x1669517 0x1045c06 0x1045b51 0x1045b51 0x1045b51 0x1045b51 0x1045b51 0x1038c33 0x1069881
#    0x1363bc8    regexp.onePassCopy+0x48                    /go1.18/src/regexp/onepass.go:226
#    0x1365ccd    regexp.compileOnePass+0x14d                /go1.18/src/regexp/onepass.go:498
......

// 其余调用栈


// 总的内存统计指标
# runtime.MemStats
# Alloc = 1311912        // 已分堆内存字节数，不蕴含已开释内存，与统一（调配时累加，开释减）# TotalAlloc = 5007040    // 总的调配堆内存字节数，蕴含已开释内存
# Sys = 16598024        // 从操作系统申请的内存总数（包含栈，堆，还有 Go 的一些原生对象等）# Mallocs = 34063        // 内存调配数（调配的 mspan 内存块数目之和）# Frees = 25879            // 内存开释数（开释也就是回收的 mspan 内存块数目之和）# HeapAlloc = 1311912    // 同 Alloc
# HeapSys = 7602176        // 从操作系统申请的，仅用于堆
# HeapIdle = 4104192    //mspan 没被调配就是 idle 状态，HeapIdle 等于这些状态的 mspan 内存之和
# HeapInuse = 3497984    //mspan 局部内存被调配了就是 inuse 状态，HeapInuse 等于这些状态的 mspan 内存之和
# HeapReleased = 2449408    // 归还给操作系统的堆内存
# HeapObjects = 8184        // 堆对象数目，等于 Mallocs-Frees
# Stack = 786432 / 786432    //inuse 的栈内存 / 向操作系统申请的栈内存

# NextGC = 4194304            // 下次 GC 完结后的堆内存指标大小
# LastGC = 1663660279482080000    // 上次 GC 完结工夫
# PauseNs = [42600 .......]        //stopTheWorld 暂停工夫，单位纳秒；256 大小的循环数组
# PauseEnd = [1663582175653806000 ......]    //GC 完结工夫；256 大小的循环数组
# GCCPUFraction = 2.6664927576867443e-06    //GC 消耗 CPU 工夫占整体的比例
# MaxRSS = 14467072                            // 过程调配的常驻内存大小
// 省略了局部

能够看到，内存指标还是比较完善的，包含以后内存调配一些指标，历时 GC 指标，以及每个调用栈的内存分配器状况都有（当然只是采样）。问题来了，怎么采集到内存调配调用栈呢？而且就算只是采样，也是会耗费资源的，假如我没开启 pprof 也不剖析这些数据，会采集吗？当然不会了。

首先，要采集内存调配调用栈，必定是须要在内存调配入口也就是 mallocgc 采集了，如果判断须要采样，则记录以后调用栈以及内存调配大小；貌似还不够，调用栈的内存调配指标是 inuse 状态调配的内存 / 对象，也就是内存被开释后，须要去除在外的，如果只在 mallocgc 处采集数据的话，显然是不够的。所以，真正的逻辑应该是，mallocgc 时采集内存调配，垃圾回收 sweep 时采集内存回收，这两相减不就是 inuse 状态的内存。采集过程能够参考这两个函数：

// Called by malloc to record a profiled block.
func mProf_Malloc(p unsafe.Pointer, size uintptr)
// Called when freeing a profiled block.
func mProf_Free(b *bucket, size uintptr)

第二个问题，什么状况下才开启内存采集？当然是引入 net/http/pprof 包了。不过，这一逻辑还是有些简单，全局变量 runtime.disableMemoryProfiling 管制是否开启内存指标采集，默认为 false，然而在 Go 编译阶段，连接器检测到代码中没有引入 runtime.MemProfile，则设置为 true，也就是禁止内存指标采集；当咱们引入 net/http/pprof 包时，代码中也就引入了 runtime.MemProfile。

// disableMemoryProfiling is set by the linker if runtime.MemProfile
// is not used and the link type guarantees nobody else could use it
// elsewhere.
var disableMemoryProfiling bool

虽说通过 heap 或者 allocs 能够查看内存指标，甚至还能蕴含调用栈，然而这么多数据，如何能疾速的定位到哪些代码在大量分配内存呢？Go 语言还提供了 pprof 工具，能够帮忙咱们剖析这些数据，应用也很简略：

go tool pprof http://127.0.0.1:8888/debug/pprof/heap?debug=1

//help 查看反对命令；top 命令显示内存申请最多的调用，默认 top10
//flat 申请内存大小；flat% 申请内存占用的比例；sum% 以后行后面所有行总申请内存占用的比例
//cum 累计值，该函数以及调用栈所有函数总得申请内存；cum% 申请内存占用的比例
(pprof) top
     flat  flat%   sum%        cum   cum%
 3587.94kB 87.51% 87.51%  3587.94kB 87.51%  runtime.allocm
  512.06kB 12.49%   100%   512.06kB 12.49%  sync.(*poolChain).pushHead

看到了吧，基于 pprof 工具能够很不便找出申请内存最多的调用。这是本地启动的一个测试服务，根本没有申请拜访，所以申请内存最多的还是创立线程相干逻辑。另外，pprof 工具也能够帮忙咱们将内存申请指标绘制成图片(svg、web、pdf 命令都能够)，图片蕴含函数调用链路每个节点申请的内存指标（线越粗阐明该调用栈内存调配越多）：

构想这么一个场景，某一天你忽然收到报警，线上机器 cpu 利用率忽然飙升且居高不下，简略定位发现是 Go 程序占用太多 cpu。接下来怎么排查呢？能够应用
cpu profile 剖析，其可用来分析程序热点，比方发现程序中的死循环。

为什么 cpu profile 能够分析程序热点呢？因为开启之后，会定时记录当然程序执行点 (就是程序正在执行哪一行代码) 以及调用栈，采集一段时间之后（默认 30 秒），依据这些数据是不是就能剖析进去热点代码了？

cpu profile 的指标输入是一个文件，须要借助 pprof 工具剖析，应用形式如下：

go tool pprof http://127.0.0.1:8888/debug/pprof/profile?debug=1
(pprof)

这里咱们基于后面介绍的 Go 性能测试工具，阐明如何剖析 cpu 指标：

package demo

import (
    "strings"
    "testing"
)

func BenchmarkStringPlus(b *testing.B) {
    s := ""
    for i := 0; i < b.N; i++ {s += "abc"}
}

func BenchmarkStringBuilder(b *testing.B) {build := strings.Builder{}
    for i := 0; i < b.N; i++ {build.WriteString("abc")
    }
}

// -cpuprofile 采集 cpu 指标，并输入的文件
//go test -benchtime 100000x -cpuprofile cpu.out  -bench .

基于 pprof 剖析 cpu 指标文件：

go tool pprof cpu.out

// 查问消耗 cpu top10 的调用
//flat 消耗 cpu 工夫；flat% 消耗 cpu 工夫的比例；sum% 以后行后面消耗 cpu 工夫的比例
//cum 累计值，该函数以及调用栈所有函数消耗 cpu 工夫；cum% 消耗 cpu 工夫的比例
(pprof) top
Showing nodes accounting for 2.72s, 88.03% of 3.09s total
Dropped 26 nodes (cum <= 0.02s)
Showing top 10 nodes out of 82
      flat  flat%   sum%        cum   cum%
     1.02s 33.01% 33.01%      1.07s 34.63%  runtime.kevent
     0.69s 22.33% 55.34%      0.76s 24.60%  runtime.pthread_cond_wait
     0.30s  9.71% 65.05%      0.30s  9.71%  runtime.pthread_kill
     0.18s  5.83% 70.87%      0.18s  5.83%  runtime.pthread_cond_signal
     0.15s  4.85% 75.73%      0.15s  4.85%  runtime.libcCall
     0.10s  3.24% 78.96%      0.10s  3.24%  runtime.memmove
     0.10s  3.24% 82.20%      0.15s  4.85%  runtime.scanobject
     0.08s  2.59% 84.79%      0.08s  2.59%  runtime.gcFlushBgCredit
     0.07s  2.27% 87.06%      0.12s  3.88%  runtime.scanblock
     0.03s  0.97% 88.03%      0.04s  1.29%  runtime.casgstatus

字符串的相加是须要调配大量的内存，所以能够看到这次的性能测试，垃圾回收占用了较多 cpu 工夫。另外咱们还能够基于其余命令，剖析出某一个函数的代码具体是哪里消耗工夫多（list 命令），：

//list 命令参数为匹配模式（函数调用含糊匹配）(pprof) list scanobject

Total: 3.09s
ROUTINE ======================== runtime.scanobject in /go1.18/src/runtime/mgcmark.go
     100ms      150ms (flat, cum)  4.85% of Total
 
         .       10ms   1246:    hbits := heapBitsForAddr(b)

      70ms       80ms   1290:    for i = 0; i < n; i, hbits = i+goarch.PtrSize, hbits.next() {}

另外，pprof 工具也能够帮忙咱们将 cpu 指标绘制成图片(svg、web、pdf 命令都能够)，图片蕴含函数调用链路每个节点的 cpu 耗时（线越粗阐明该调用 cpu 耗时越多）：

当然 pprof 还提供 web 拜访，如 top，list，耗时图片等在 web 界面查看也十分不便，web 界面还能查看经典的火焰图：

// -http 启动 web 服务
go tool pprof -http=:9999 cpu.out
Serving web UI on http://localhost:9999

最初再简略阐明一个问题：Go 语言是如何定时采集 cpu 指标的，又是怎么采集 cpu 指标的呢？pprof.StartCPUProfile 函数用来开启 cpu 指标的采集，”/debug/pprof/profile” 接口也是通过这个函数实现的。定时底层其实也是通过定时器实现的（timer_create），定时器触发时（默认采集频率 100hz，也就是 1 秒 100 次），会给创立定时器的线程发送 SIGPROF 信号，信号处理器（sighandler）记录以后程序执行点（pc 寄存器）以及调用栈。

// 创立线程级 cpu profile: 
-setThreadCPUProfiler(hz)
    -timer_create(_CLOCK_THREAD_CPUTIME_ID, &sevp, &timerid)

CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12) A clock that measures (user and system) CPU time consumed by the calling thread.

咱们曾经直到，block 采集协程阻塞事件（包含调用栈桢），默认不开启，须要通过 runtime.SetBlockProfileRate 函数开启；mutex 采集锁事件（包含调用栈桢），默认不开启，须要通过 runtime.SetMutexProfileFraction 函数开启。

以 block 指标为例，输入的数据如下：

//cycles 取的是 cputick，这里的值是均匀每秒钟 cputick 递增多少；cycles/second=1996821317

// 第一个值是协程阻塞的 cputick 数，第二个值是阻塞次数，依据 cycles/second 就能够计算均匀阻塞工夫了
161357271645546 364 @ 0x1048768 0x13a4014 0x1069921
#    0x1048767    runtime.selectgo+0x407                            /go1.18/src/runtime/select.go:509
#    0x13a4013    github.com/go-redis/redis/v8/internal/pool.(*ConnPool).reaper+0xd3    /xxx/vendor/github.com/go-redis/redis/v8/internal/pool/pool.go:485

留神 block 以及 mutex 输入的工夫都是 cputick，与工夫有一个换算关系（cycles/second）。block 含意是阻塞事件，协程抢锁或者管道的读写阻塞等，都会导致协程的阻塞，协程阻塞时会记录阻塞工夫点 cputick，解除阻塞时同样记录时间点 cputick，以此计算出协程阻塞工夫，同时在解除阻塞时记录调用栈。mutex 含意是锁持有事件，在开释锁时记录时间以及调用栈。

同样的 pprof 工具能够帮忙咱们剖析协程阻塞状况：

go tool pprof http://127.0.0.1:8888/debug/pprof/block?debug=1

(pprof) top
Dropped 102 nodes (cum <= 0.20hrs)
      flat  flat%   sum%        cum   cum%
  40.97hrs   100%   100%   40.97hrs   100%  runtime.selectgo
         0     0%   100%      23hrs 56.13%  github.com/go-redis/redis/v8/internal/pool.(*ConnPool).reaper
         0     0%   100%   17.97hrs 43.85%  github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1
         0     0%   100%   40.97hrs   100%  runtime.goexit

trace 可跟踪 Go 程序多种类型事件，包含协程调度，零碎调用，锁，垃圾回收等等，所以 trace 是一款 Go 程序性能问题利器。咱们能够通过 trace.Start 开启 trace 追踪，”/debug/pprof/trace” 接口就是基于该函数实现的。

// 默认追踪 1 秒
go tool trace http://127.0.0.1:8888/debug/pprof/trace?seconds=1
Parsing trace...
Splitting trace...
Opening browser. Trace viewer is listening on http://127.0.0.1:64557

trace 工具主动启动 web 服务，不便咱们通过 web 界面查看，次要蕴含这几个方面的剖析：

// 展现每个逻辑处理器 P 每时刻在做什么
View trace
// 协程剖析
Goroutine analysis
// 网络阻塞
Network blocking profile (⬇)
// 同步阻塞，如锁之类的
Synchronization blocking profile (⬇)
// 零碎调用阻塞
Syscall blocking profile (⬇)
// 调度提早
Scheduler latency profile (⬇)

比如说协程剖析，能够查看每一个协程的一些数据指标，包含执行工夫，网络等待时间，同步阻塞工夫，期待调度工夫，GC 暂停工夫等等，列表如下（能够依照不通维度排序）：

再比方同步阻塞剖析，与下面相似，会生成一张图片，显示整个调用栈的阻塞工夫：

Go 程序性能剖析的根本伎俩还是须要把握的，不然遇到性能问题你将大刀阔斧。本篇文章次要介绍 pprof 以及 trace 的根本应用，以及次要数据指标剖析，当然这里只介绍了一些实践，更多实际中的利用还须要你去摸索。

关于go:51-Golang实战Go程序分析利器pprof

pprof 概述

内存

cpu pprofile

锁 & 阻塞

trace 概述

总结