关于go:Go语言性能剖析利器pprof实战

作者：耿宗杰

对于 pprof 的文章在网上已是车载斗量，却是千篇一律的命令介绍，鲜有真正实操的，本文将参考 Go 社区材料，联合本人的教训，实战 Go 程序的性能剖析与优化过程。

首先说一下性能优化的个别思路。零碎性能的剖析优化，肯定是从大到小的步骤来进行的，即从业务架构的优化，到零碎架构的优化，再到零碎模块间的优化，最初到代码编写层面的优化。业务架构的优化是最具性价比的，技术难度绝对较小，却能够带来大幅的性能晋升。比方通过和共事或外部门沟通，缩小了一些接口调用或者去掉了不必要的简单的业务逻辑，能够轻松晋升整个零碎的性能。

零碎架构的优化，比方退出缓存，由 http 改良为 rpc 等，也能够在大量投入下带来较大的性能晋升。最初是程序代码级别的性能优化，这又分为两方面，一是合格的数据结构与应用，二才是在此基础上的性能分析。比方在 Go 语言中应用 slice 这种不便的数据结构时，尽可能提前申请足够的内存避免 append 超过容量时的内存申请和数据拷贝；应用并发爱护时尽量由 RWMutex 代替 mutex，甚至在极高并发场景下应用更细粒度的原子操作代替锁等等。

上面进入注释，待优化程序是社区中一个例子，代码有点长，实现的算法是驰名的计算机科学家 Tarjan 的求图的强连通重量算法，对于这个算法的思维请自行 google（就别自行百度了~）。以下为实操过程（会有那么一丢丢长。。。）：

初始版本代码 havlak1.go:

// Go from multi-language-benchmark/src/havlak/go_pro// Copyright 2011 Google Inc.//// Licensed under the Apache License, Version 2.0 (the "License");// you may not use this file except in compliance with the License.// You may obtain a copy of the License at////     http://www.apache.org/licenses/LICENSE-2.0//// Unless required by applicable law or agreed to in writing, software// distributed under the License is distributed on an "AS IS" BASIS,// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.// See the License for the specific language governing permissions and// limitations under the License.// Test Program for the Havlak loop finder.//// This program constructs a fairly large control flow// graph and performs loop recognition. This is the Go// version.//package mainimport ("flag"   "fmt"   "log"   "os"   "runtime/pprof")type BasicBlock struct {Name     int   InEdges  []*BasicBlock   OutEdges []*BasicBlock}func NewBasicBlock(name int) *BasicBlock {return &BasicBlock{Name: name}}func (bb *BasicBlock) Dump() {   fmt.Printf("BB#%06d:", bb.Name)   if len(bb.InEdges) > 0 {fmt.Printf("in :")      for _, iter := range bb.InEdges {fmt.Printf("BB#%06d", iter.Name)      }   }   if len(bb.OutEdges) > 0 {fmt.Print("out:")      for _, iter := range bb.OutEdges {fmt.Printf("BB#%06d", iter.Name)      }   }   fmt.Printf("\n")}func (bb *BasicBlock) NumPred() int {   return len(bb.InEdges)}func (bb *BasicBlock) NumSucc() int {   return len(bb.OutEdges)}func (bb *BasicBlock) AddInEdge(from *BasicBlock) {bb.InEdges = append(bb.InEdges, from)}func (bb *BasicBlock) AddOutEdge(to *BasicBlock) {bb.OutEdges = append(bb.OutEdges, to)}//-----------------------------------------------------------type CFG struct {Blocks []*BasicBlock   Start  *BasicBlock}func NewCFG() *CFG {   return &CFG{}}func (cfg *CFG) NumNodes() int {   return len(cfg.Blocks)}func (cfg *CFG) CreateNode(node int) *BasicBlock {if node < len(cfg.Blocks) {return cfg.Blocks[node]   }   if node != len(cfg.Blocks) {println("oops", node, len(cfg.Blocks))      panic("wtf")   }   bblock := NewBasicBlock(node)   cfg.Blocks = append(cfg.Blocks, bblock)   if len(cfg.Blocks) == 1 {cfg.Start = bblock}   return bblock}func (cfg *CFG) Dump() {   for _, n := range cfg.Blocks {      n.Dump()   }}//-----------------------------------------------------------type BasicBlockEdge struct {Dst *BasicBlock   Src *BasicBlock}func NewBasicBlockEdge(cfg *CFG, from int, to int) *BasicBlockEdge {self := new(BasicBlockEdge)   self.Src = cfg.CreateNode(from)   self.Dst = cfg.CreateNode(to)   self.Src.AddOutEdge(self.Dst)   self.Dst.AddInEdge(self.Src)   return self}//-----------------------------------------------------------// Basic Blocks and Loops are being classified as regular, irreducible,// and so on. This enum contains a symbolic name for all these classifications//const (_             = iota // Go has an interesting iota concept   bbTop                // uninitialized   bbNonHeader          // a regular BB   bbReducible          // reducible loop   bbSelf               // single BB loop   bbIrreducible        // irreducible loop   bbDead               // a dead BB   bbLast               // sentinel)// UnionFindNode is used in the Union/Find algorithm to collapse// complete loops into a single node. These nodes and the// corresponding functionality are implemented with this class//type UnionFindNode struct {parent    *UnionFindNode   bb        *BasicBlock   loop      *SimpleLoop   dfsNumber int}// Init explicitly initializes UnionFind nodes.//func (u *UnionFindNode) Init(bb *BasicBlock, dfsNumber int) {u.parent = u   u.bb = bb   u.dfsNumber = dfsNumber   u.loop = nil}// FindSet implements the Find part of the Union/Find Algorithm//// Implemented with Path Compression (inner loops are only// visited and collapsed once, however, deep nests would still// result in significant traversals).//func (u *UnionFindNode) FindSet() *UnionFindNode {   var nodeList []*UnionFindNode   node := u   for ; node != node.parent; node = node.parent {if node.parent != node.parent.parent {         nodeList = append(nodeList, node)      }   }   // Path Compression, all nodes'parents point to the 1st level parent.   for _, ll := range nodeList {ll.parent = node.parent}   return node}// Union relies on path compression.//func (u *UnionFindNode) Union(B *UnionFindNode) {u.parent = B}// Constants//// Marker for uninitialized nodes.const unvisited = -1// Safeguard against pathological algorithm behavior.const maxNonBackPreds = 32 * 1024// IsAncestor//// As described in the paper, determine whether a node'w'is a//"true"ancestor for node'v'.//// Dominance can be tested quickly using a pre-order trick// for depth-first spanning trees. This is why DFS is the first// thing we run below.//// Go comment: Parameters can be written as w,v int, inlike in C, where//   each parameter needs its own type.//func isAncestor(w, v int, last []int) bool {return ((w <= v) && (v <= last[w]))}// listContainsNode//// Check whether a list contains a specific element. //func listContainsNode(l []*UnionFindNode, u *UnionFindNode) bool {for _, ll := range l {      if ll == u {         return true}   }   return false}// DFS - Depth-First-Search and node numbering.//func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {nodes[current].Init(currentNode, current)   number[currentNode] = current   lastid := current   for _, target := range currentNode.OutEdges {if number[target] == unvisited {lastid = DFS(target, nodes, number, last, lastid+1)      }   }   last[number[currentNode]] = lastid   return lastid}// FindLoops//// Find loops and build loop forest using Havlak's algorithm, which// is derived from Tarjan. Variable names and step numbering has// been chosen to be identical to the nomenclature in Havlak's// paper (which, in turn, is similar to the one used by Tarjan).//func FindLoops(cfgraph *CFG, lsgraph *LSG) {if cfgraph.Start == nil {      return}   size := cfgraph.NumNodes()   nonBackPreds := make([]map[int]bool, size)   backPreds := make([][]int, size)   number := make(map[*BasicBlock]int)   header := make([]int, size, size)   types := make([]int, size, size)   last := make([]int, size, size)   nodes := make([]*UnionFindNode, size, size)   for i := 0; i < size; i++ {nodes[i] = new(UnionFindNode)   }   // Step a:   //   - initialize all nodes as unvisited.   //   - depth-first traversal and numbering.   //   - unreached BB's are marked as dead.   //   for i, bb := range cfgraph.Blocks {number[bb] = unvisited      nonBackPreds[i] = make(map[int]bool)   }   DFS(cfgraph.Start, nodes, number, last, 0)   // Step b:   //   - iterate over all nodes.   //   //   A backedge comes from a descendant in the DFS tree, and non-backedges   //   from non-descendants (following Tarjan).   //   //   - check incoming edges 'v' and add them to either   //     - the list of backedges (backPreds) or   //     - the list of non-backedges (nonBackPreds)   //   for w := 0; w < size; w++ {header[w] = 0      types[w] = bbNonHeader      nodeW := nodes[w].bb      if nodeW == nil {types[w] = bbDead         continue // dead BB      }      if nodeW.NumPred() > 0 {         for _, nodeV := range nodeW.InEdges {            v := number[nodeV]            if v == unvisited {continue // dead node}            if isAncestor(w, v, last) {backPreds[w] = append(backPreds[w], v)            } else {nonBackPreds[w][v] = true            }         }      }   }   // Start node is root of all other loops.   header[0] = 0   // Step c:   //   // The outer loop, unchanged from Tarjan. It does nothing except   // for those nodes which are the destinations of backedges.   // For a header node w, we chase backward from the sources of the   // backedges adding nodes to the set P, representing the body of   // the loop headed by w.   //   // By running through the nodes in reverse of the DFST preorder,   // we ensure that inner loop headers will be processed before the   // headers for surrounding loops.   //   for w := size - 1; w >= 0; w-- {// this is 'P' in Havlak's paper      var nodePool []*UnionFindNode      nodeW := nodes[w].bb      if nodeW == nil {continue // dead BB}      // Step d:      for _, v := range backPreds[w] {if v != w {            nodePool = append(nodePool, nodes[v].FindSet())         } else {types[w] = bbSelf         }      }      // Copy nodePool to workList.      //      workList := append([]*UnionFindNode(nil), nodePool...)      if len(nodePool) != 0 {types[w] = bbReducible      }      // work the list...      //      for len(workList) > 0 {x := workList[0]         workList = workList[1:]         // Step e:         //         // Step e represents the main difference from Tarjan's method.         // Chasing upwards from the sources of a node w's backedges. If         // there is a node y' that is not a descendant of w, w is marked         // the header of an irreducible loop, there is another entry         // into this loop that avoids w.         //         // The algorithm has degenerated. Break and         // return in this case.         //         nonBackSize := len(nonBackPreds[x.dfsNumber])         if nonBackSize > maxNonBackPreds {return}         for iter := range nonBackPreds[x.dfsNumber] {y := nodes[iter]            ydash := y.FindSet()            if !isAncestor(w, ydash.dfsNumber, last) {types[w] = bbIrreducible               nonBackPreds[w][ydash.dfsNumber] = true            } else {if ydash.dfsNumber != w {                  if !listContainsNode(nodePool, ydash) {workList = append(workList, ydash)                     nodePool = append(nodePool, ydash)                  }               }            }         }      }      // Collapse/Unionize nodes in a SCC to a single node      // For every SCC found, create a loop descriptor and link it in.      //      if (len(nodePool) > 0) || (types[w] == bbSelf) {loop := lsgraph.NewLoop()         loop.SetHeader(nodeW)         if types[w] != bbIrreducible {loop.IsReducible = true}         // At this point, one can set attributes to the loop, such as:         //         // the bottom node:         //    iter  = backPreds[w].begin();         //    loop bottom is: nodes[iter].node);         //         // the number of backedges:         //    backPreds[w].size()         //         // whether this loop is reducible:         //    type[w] != BasicBlockClass.bbIrreducible         //         nodes[w].loop = loop         for _, node := range nodePool {// Add nodes to loop descriptor.            header[node.dfsNumber] = w            node.Union(nodes[w])            // Nested loops are not added, but linked together.            if node.loop != nil {node.loop.Parent = loop} else {loop.AddNode(node.bb)            }         }         lsgraph.AddLoop(loop)      } // nodePool.size   } // Step c}// External entry point.func FindHavlakLoops(cfgraph *CFG, lsgraph *LSG) int {FindLoops(cfgraph, lsgraph)   return lsgraph.NumLoops()}//======================================================// Scaffold Code//======================================================// Basic representation of loops, a loop has an entry point,// one or more exit edges, a set of basic blocks, and potentially// an outer loop - a "parent" loop.//// Furthermore, it can have any set of properties, e.g.,// it can be an irreducible loop, have control flow, be// a candidate for transformations, and what not.//type SimpleLoop struct {// No set, use map to bool   basicBlocks map[*BasicBlock]bool   Children    map[*SimpleLoop]bool   Parent      *SimpleLoop   header      *BasicBlock   IsRoot       bool   IsReducible  bool   Counter      int   NestingLevel int   DepthLevel   int}func (loop *SimpleLoop) AddNode(bb *BasicBlock) {loop.basicBlocks[bb] = true}func (loop *SimpleLoop) AddChildLoop(child *SimpleLoop) {loop.Children[child] = true}func (loop *SimpleLoop) Dump(indent int) {for i := 0; i < indent; i++ {      fmt.Printf("")   }   // No ? operator ?   fmt.Printf("loop-%d nest: %d depth %d ",      loop.Counter, loop.NestingLevel, loop.DepthLevel)   if !loop.IsReducible {fmt.Printf("(Irreducible) ")   }   // must have > 0   if len(loop.Children) > 0 {fmt.Printf("Children: ")      for ll := range loop.Children {fmt.Printf("loop-%d", ll.Counter)      }   }   if len(loop.basicBlocks) > 0 {fmt.Printf("(")      for bb := range loop.basicBlocks {fmt.Printf("BB#%06d ", bb.Name)         if loop.header == bb {fmt.Printf("*")         }      }      fmt.Printf("\b)")   }   fmt.Printf("\n")}func (loop *SimpleLoop) SetParent(parent *SimpleLoop) {loop.Parent = parent   loop.Parent.AddChildLoop(loop)}func (loop *SimpleLoop) SetHeader(bb *BasicBlock) {loop.AddNode(bb)   loop.header = bb}//------------------------------------// Helper (No templates or such)//func max(x, y int) int {if x > y {      return x}   return y}// LoopStructureGraph//// Maintain loop structure for a given CFG.//// Two values are maintained for this loop graph, depth, and nesting level.// For example://// loop        nesting level    depth//----------------------------------------// loop-0      2                0//   loop-1    1                1//   loop-3    1                1//     loop-2  0                2//var loopCounter = 0type LSG struct {root  *SimpleLoop   loops []*SimpleLoop}func NewLSG() *LSG {   lsg := new(LSG)   lsg.root = lsg.NewLoop()   lsg.root.NestingLevel = 0   return lsg}func (lsg *LSG) NewLoop() *SimpleLoop {   loop := new(SimpleLoop)   loop.basicBlocks = make(map[*BasicBlock]bool)   loop.Children = make(map[*SimpleLoop]bool)   loop.Parent = nil   loop.header = nil   loop.Counter = loopCounter   loopCounter++   return loop}func (lsg *LSG) AddLoop(loop *SimpleLoop) {lsg.loops = append(lsg.loops, loop)}func (lsg *LSG) Dump() {   lsg.dump(lsg.root, 0)}func (lsg *LSG) dump(loop *SimpleLoop, indent int) {loop.Dump(indent)   for ll := range loop.Children {lsg.dump(ll, indent+1)   }}func (lsg *LSG) CalculateNestingLevel() {   for _, sl := range lsg.loops {      if sl.IsRoot {         continue}      if sl.Parent == nil {sl.SetParent(lsg.root)      }   }   lsg.calculateNestingLevel(lsg.root, 0)}func (lsg *LSG) calculateNestingLevel(loop *SimpleLoop, depth int) {loop.DepthLevel = depth   for ll := range loop.Children {      lsg.calculateNestingLevel(ll, depth+1)      ll.NestingLevel = max(loop.NestingLevel, ll.NestingLevel+1)   }}func (lsg *LSG) NumLoops() int {   return len(lsg.loops)}func (lsg *LSG) Root() *SimpleLoop {   return lsg.root}//======================================================// Testing Code//======================================================func buildDiamond(cfgraph *CFG, start int) int {bb0 := start   NewBasicBlockEdge(cfgraph, bb0, bb0+1)   NewBasicBlockEdge(cfgraph, bb0, bb0+2)   NewBasicBlockEdge(cfgraph, bb0+1, bb0+3)   NewBasicBlockEdge(cfgraph, bb0+2, bb0+3)   return bb0 + 3}func buildConnect(cfgraph *CFG, start int, end int) {NewBasicBlockEdge(cfgraph, start, end)}func buildStraight(cfgraph *CFG, start int, n int) int {for i := 0; i < n; i++ {      buildConnect(cfgraph, start+i, start+i+1)   }   return start + n}func buildBaseLoop(cfgraph *CFG, from int) int {header := buildStraight(cfgraph, from, 1)   diamond1 := buildDiamond(cfgraph, header)   d11 := buildStraight(cfgraph, diamond1, 1)   diamond2 := buildDiamond(cfgraph, d11)   footer := buildStraight(cfgraph, diamond2, 1)   buildConnect(cfgraph, diamond2, d11)   buildConnect(cfgraph, diamond1, header)   buildConnect(cfgraph, footer, from)   footer = buildStraight(cfgraph, footer, 1)   return footer}var cpuprofile = flag.String("cpuprofile","", "write cpu profile to this file")func main() {   flag.Parse()   if *cpuprofile != ""{f, err := os.Create(*cpuprofile)      if err != nil {log.Fatal(err)      }      pprof.StartCPUProfile(f)      defer pprof.StopCPUProfile()}   lsgraph := NewLSG()   cfgraph := NewCFG()   cfgraph.CreateNode(0) // top   cfgraph.CreateNode(1) // bottom   NewBasicBlockEdge(cfgraph, 0, 2)   for dummyloop := 0; dummyloop < 15000; dummyloop++ {FindHavlakLoops(cfgraph, NewLSG())   }   n := 2   for parlooptrees := 0; parlooptrees < 10; parlooptrees++ {cfgraph.CreateNode(n + 1)      buildConnect(cfgraph, 2, n+1)      n = n + 1      for i := 0; i < 100; i++ {top := n         n = buildStraight(cfgraph, n, 1)         for j := 0; j < 25; j++ {n = buildBaseLoop(cfgraph, n)         }         bottom := buildStraight(cfgraph, n, 1)         buildConnect(cfgraph, n, top)         n = bottom      }      buildConnect(cfgraph, n, 1)   }   FindHavlakLoops(cfgraph, lsgraph)   for i := 0; i < 50; i++ {FindHavlakLoops(cfgraph, NewLSG())   }   fmt.Printf("# of loops: %d (including 1 artificial root node)\n", lsgraph.NumLoops())   lsgraph.CalculateNestingLevel()}

咱们借助 macbook 零碎上的 time 命令来打印程序运行的工夫（内核态、用户态、总工夫）：

编译后运行程序：

用户态耗时 23.07s，内核态耗时 0.4s，总耗时 13.7s（用了两个核，170%）。因为程序外面曾经先开启了 pprof 统计 cpu 耗时，间接 top 命令看：

能够看到，pprof 数据采集继续了 12.99s，采样时长是 19.49s（还是两核的缘故）。

这里要说一下，无论是 cpu 耗时统计还是内存占用统计，都是距离采样。cpu 耗时时每隔一段时间（大略是 10ms）对调用栈的函数进行记录，最初剖析在所有的记录次数中，各个函数呈现的次数，包含在运行中的次数，和入栈次数（阐明它调用了别的函数）。内存占用统计是每调配 512K 记录一次调配门路。

耗时最多的是 mapaccess1_fast64，这是运行时中的 map 读写时的数据查问函数。如果编译程序时没有禁用内联，看到的会有所不同，其中会显示 FindHavlakLoops 函数，并标识为 inline。因为 FindHavlakLoops 外面就调用了 FindLoops，所以在编译器会间接把这个函数开展，用 FindLoops 替换 FindHavlakLoops 函数。也能够在不禁用内联编译时，设置 pprof 的 noinlines 开关为 true，默认为 false，即 noinlines=true。

这里看到最大的函数并不是业务函数而是零碎函数，那没啥好优化零碎函数的（只能是咱用的不对了呗~）。就看是哪里调用的这个零碎函数：

web mapaccess1_fast64

能够看到，调用最多的是 DFS 和 FindLoops。那就看看这俩函数外面是怎么应用 map 的，来看 DFS：

能够看到，DFS 函数里耗时较长又是 map 操作的，是 242 246 和 250 三行。对于这里的优化办法，是用 list 构造代替 map 构造，因为能用 list 达到成果的状况下没必要用 map。这个咋一看没问题，但如同又有点顺当对吧？其实是因为这个程序的自身特点，这里有显著的一点，就是 BasicBlock 构造的特殊性，自身的 Name 属性就是本身的索引下标。查找某个 BasicBlock 不须要应用 map 来记录，间接通过下标拜访显然是最低的工夫复杂度（对于 map 和 list 的工夫复杂度就不多说了，map 看起来是 O1，但其实没有思考 hash 计算自身的过程和解决 hash 抵触的老本，而 list 是必然的 O1）。通过把这部分的 map 批改为 list 数据结构，版本 H2，再编译查看耗时状况：

此时耗时升高到了 8s。再次查看 cpu 耗时散布：

能够看到，top1 变成了 scanobject 函数，不再是 map 读写了。看到这个函数，以及下边的 mallocgc，咱们要晓得，这都是内存相干的，前者是内存对象扫描的处理函数，是垃圾回收三色标记的一个过程，后者则是间接治理内存调配和回收的（留神是同时负责内存调配和回收，malloc && gc）。这阐明目前程序花了很多工夫在进行内存治理，看比例是 8%。那就要剖析一下了，是什么货色产生了这么多内存变动，合不合理。剖析内存，就是 pprof 的 memprofile 性能了。增加一下相干的内存统计代码，具体怎么增加大家必定都会，就不贴了（网上多得是）。增加完之后，从新编译生成版本 H3，执行 H3，生成对应的内存监测文件：

查看内存的分配情况，在这里不禁止内联，因为在理论运行时该内联的函数也是会被开展替换掉的：

能够看到，节点创立是第一，FindLoops 是第二。因为创立 BasicBlock 首先很简略，没有简单的过程，其次这是程序中的一个根底对象构造，若要改构造体，那可能波及到算法也得改，这个显然是不适合的，不仅可能改出 bug，还可能收益不高。那咱们就顺着看第二位的 FindLoops，这个就是前边咱们看到的调用 mapaccess 内置函数的另一个业务函数，所以优化的办法跟下面相似，还是优化 map 的应用，替换为 list。这里有一点非凡的是，替换成 list 之后，map 的追加须要改一下，加一个判断反复的逻辑，所以新加了一个 appendUnique 办法。再次编译，版本 H4，查看耗时：

这时程序耗时降到了 5.6s。再次查看内存调配，确认优化成果：

能够看到，FindLoops 函数曾经不在高位，内存耗费降下去了。再看一下此时的 cpu 耗时散布：

能够看到，FindLoops 成为 top1，scanobject 函数成了第二位。这就对了，因为 咱们就是要让 cpu 更多的去运行业务代码，把工夫花到真正须要的中央。

这时就能够进行下一轮的性能优化了（这就是性能调优，一遍一遍的排查耗时，压缩不必要的 cpu 工夫，压迫计算性能）。持续看一下此时 FindLoops 到底在哪儿化的工夫多，是否正当：

从各个语句的耗时来看，如同没啥大问题，没有很离谱的耗时（360ms 那行是因为循环的缘故）。这阐明编码上没有大问题，依照咱们前边一开始说的程序优化的步骤，就须要往上找了，看能不能优化逻辑来缩小不必要的计算，以达到晋升性能的目标（即，每一个计算步骤解决的都没啥大问题了，那能不能少算点儿）。这里 FindLoops 在程序入口，花了不少工夫来初始化一系列长期变量，加起来有 180ms，这就意味着每次调用 FindLoops 函数，都要先花 180ms 的筹备工作。这部分长期变量的屡次创立 + 初始化，能够通过加内存缓存的形式来缩小反复创立和申请，这里波及到的规范解法其实就是对象池（像 Go 创立的 web 服务，在并发量很高的状况下，每一次 http 申请的解析都须要创建对象 + 申请内存 + 反序列这一系列的初始动作，这时就能够借助 sync.Pool 来缩小这种反复工作，晋升性能）。同理，这里也是退出了一个内存缓存对象 cache：

把本来的内存申请初始化过程做了替换。在原有缓存不够的状况下，申请新的变量，否者截取原有缓存应用，缩小内存申请：

调整结束后，编译新版本 H5，再看下耗时：

这时候程序的耗时曾经降到 4.1s，相比一开始的 13.7s，曾经晋升了两倍多。看下当初的 cpu 耗时散布：

能够看到，相比上次的散布，前两位都是业务代码函数了，阐明进一步提高了业务代码的耗时占比，升高了无关零碎函数的负载。这种间接应用全局变量的形式加 cache 不是并发平安的，然而因为这里程序的逻辑自身也不是并发的，所以这里没问题。

到这里，实操的优化过程就走完了。提炼总结一下优化的步骤和应用的次要办法命令有：

1. 通过 top5 查看 cpu 耗时：确定是字典数据操作占比最大；

2. 通过 web 命令查看是谁次要调用了字典操作函数：确定有 DFS 和 FindLoops；

3. 应用 list 命令查看 DFS 函数中每行代码的耗时，找到跟 map 操作相干的局部，确定优化办法：把 map 换成 list；

4. 从新运行，再用 top 命令查看 cpu 耗时散布：找到耗时最大的是内存对象扫描函数 scanobject；

5. 退出内存分析代码，运行查看内存调配的大小散布：确定 FindLoops 占用较大；

6. 应用 list 命令查看 FindLoops 函数的每行内存开销：确定是 map 创立占用内存较大；

7. 同样应用 list 数据结构替换 map，从新运行，再看内存大小散布：确定 FindLoops 已不在高位，同时再看 CPU 耗时时，scanobject 曾经降下去了，目标达到；

8. 此时又开始新的一轮排查，持续查看 cpu 耗时排行榜，FindLoops 耗时居首；

9. 持续应用 list 办法看 FindLoops 函数每行的耗时，没有显著的问题。那就要换思路，从排查编码问题转换为排查逻辑问题，缩小计算过程：这里是加缓存；

10. 加完缓存看到性能显著晋升了，阐明优化对了。这时又该循环进入下一轮的排查的优化了，以此往返。。。直到压迫出咱们能达到的最大性能！

以上就是本次程序优化的整体步骤和思路，过程中秉持的思路办法是一贯的，就是一直的用 pprof 排查 top 级的函数耗时和内存占用，通过优化数据结构、缩小不必要的计算过程、升高内存调配和回收的累赘等形式来晋升性能，这一点对所有的程序都实用，也是后续能够借鉴的方法论。

1.优化程序的大前提是你肯定要对程序有足够深刻的了解！（或者说咱们不能优化程序优化出 bug 来啊。。。）。最初生产 H6 版本，之所以又对性能晋升了一倍，全是建设在作者对程序齐全了解的根底之上的，所以他才能够调整循环逻辑，复用程序对象，调整调用逻辑等等。这一层再往下层思考，就到了程序逻辑，零碎架构内置整体的业务架构层面了。这其实就又回到了一开始咱们说的程序优化由大到小的总体思路上了。

2.带 GC 的语言相比不带的，反而更考验内存治理的技术。Go 语言在开发效率上的晋升，是把垃圾回收交给了零碎来解决，但别忘了，耗费的仍然是咱们本人的 cpu 工夫（羊毛，那不得从羊身上来 …）。所以咱们在应用每种构造体，进行每种计算操作时，都要明确其背地波及的内存开销，要把内存变动放到潜意识里去治理。

以上就是本次 pprof 优化 Go 程序的实操过程及总结感想，供参考，感激~

关于go:Go语言性能剖析利器pprof实战

前言

优化思路

优化实际

两点感悟

Just My Socks（注册教程内含优惠码）

关于go:Go语言性能剖析利器pprof实战

前言

优化思路

优化实际

两点感悟

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）