乐趣区

关于golang:golangGC详解

Golang 从 1.5 开始引入了三色 GC, 通过屡次改良, 以后的 1.9 版本的 GC 进展工夫曾经能够做到极短.
进展工夫的缩小意味着 ” 最大响应工夫 ” 的缩短, 这也让 go 更适宜编写网络服务程序.
这篇文章将通过剖析 golang 的源代码来解说 go 中的三色 GC 的实现原理.

这个系列剖析的 golang 源代码是 Google 官网的实现的 1.9.2 版本, 不适用于其余版本和 gccgo 等其余实现,
运行环境是 Ubuntu 16.04 LTS 64bit.
首先会解说根底概念, 而后解说分配器, 再解说收集器的实现.

根底概念

内存构造

go 在程序启动时会调配一块虚拟内存地址是间断的内存, 构造如下:

这一块内存分为了 3 个区域, 在 X64 上大小别离是 512M, 16G 和 512G, 它们的作用如下:

arena

arena 区域就是咱们通常说的heap, go 从 heap 调配的内存都在这个区域中.

bitmap

bitmap 区域用于示意 arena 区域中哪些地址保留了对象, 并且对象中哪些地址蕴含了 指针 .
bitmap 区域中一个 byte(8 bit) 对应了 arena 区域中的四个指针大小的内存, 也就是 2 bit 对应一个指针大小的内存.
所以 bitmap 区域的大小是 512GB / 指针大小(8 byte) / 4 = 16GB.

bitmap 区域中的一个 byte 对应 arena 区域的四个指针大小的内存的构造如下,
每一个指针大小的内存都会有两个 bit 别离示意是否应该持续扫描和是否蕴含指针:

bitmap 中的 byte 和 arena 的对应关系从开端开始, 也就是随着内存调配会向两边扩大:

spans

spans 区域用于示意 arena 区中的某一页 (Page) 属于哪个 span, 什么是 span 将在上面介绍.
spans 区域中一个指针 (8 byte) 对应了 arena 区域中的一页 (在 go 中一页 =8KB).
所以 spans 的大小是 512GB / 页大小(8KB) * 指针大小(8 byte) = 512MB.

spans 区域的一个指针对应 arena 区域的一页的构造如下, 和 bitmap 不一样的是对应关系会从结尾开始:

什么时候从 Heap 调配对象

很多解说 go 的文章和书籍中都提到过, go 会主动确定哪些对象应该放在栈上, 哪些对象应该放在堆上.
简略的来说, 当一个对象的内容可能在生成该对象的函数完结后被拜访, 那么这个对象就会调配在堆上.
在堆上调配对象的状况包含:

  • 返回对象的指针
  • 传递了对象的指针到其余函数
  • 在闭包中应用了对象并且须要批改对象
  • 应用 new

在 C 语言中函数返回在栈上的对象的指针是十分危险的事件, 但在 go 中却是平安的, 因为这个对象会主动在堆上调配.
go 决定是否应用堆调配对象的过程也叫 ” 逃逸剖析 ”.

GC Bitmap

GC 在标记时须要晓得哪些地方蕴含了指针, 例如下面提到的 bitmap 区域涵盖了 arena 区域中的指针信息.
除此之外, GC 还须要晓得栈空间上哪些地方蕴含了指针,
因为栈空间不属于 arena 区域, 栈空间的指针信息将会在 函数信息 外面.
另外, GC 在调配对象时也须要依据对象的类型设置 bitmap 区域, 起源的指针信息将会在 类型信息 外面.

总结起来 go 中有以下的 GC Bitmap:

  • bitmap 区域: 涵盖了 arena 区域, 应用 2 bit 示意一个指针大小的内存
  • 函数信息: 涵盖了函数的栈空间, 应用 1 bit 示意一个指针大小的内存 (位于 stackmap.bytedata)
  • 类型信息: 在调配对象时会复制到 bitmap 区域, 应用 1 bit 示意一个指针大小的内存 (位于_type.gcdata)

Span

span 是用于调配对象的区块, 下图是简略阐明了 Span 的内部结构:

通常一个 span 蕴含了多个大小雷同的元素, 一个元素会保留一个对象, 除非:

  • span 用于保留大对象, 这种状况 span 只有一个元素
  • span 用于保留极小对象且不蕴含指针的对象(tiny object), 这种状况 span 会用一个元素保留多个对象

span 中有一个 freeindex 标记下一次调配对象时应该开始搜寻的地址, 调配后 freeindex 会减少,
在 freeindex 之前的元素都是已调配的, 在 freeindex 之后的元素有可能已调配, 也有可能未调配.

span 每次 GC 当前都可能会回收掉一些元素, allocBits 用于标记哪些元素是已调配的, 哪些元素是未调配的.
应用 freeindex + allocBits 能够在调配时跳过已调配的元素, 把对象设置在未调配的元素中,
但因为每次都去拜访 allocBits 效率会比较慢, span 中有一个整数型的 allocCache 用于缓存 freeindex 开始的 bitmap, 缓存的 bit 值与原值相同.

gcmarkBits 用于在 gc 时标记哪些对象存活, 每次 gc 当前 gcmarkBits 会变为 allocBits.
须要留神的是 span 构造自身的内存是从零碎调配的, 下面提到的 spans 区域和 bitmap 区域都只是一个索引.

Span 的类型

span 依据大小能够分为 67 个类型, 如下:

// class  bytes/obj  bytes/span  objects  tail waste  max waste
//     1          8        8192     1024           0     87.50%
//     2         16        8192      512           0     43.75%
//     3         32        8192      256           0     46.88%
//     4         48        8192      170          32     31.52%
//     5         64        8192      128           0     23.44%
//     6         80        8192      102          32     19.07%
//     7         96        8192       85          32     15.95%
//     8        112        8192       73          16     13.56%
//     9        128        8192       64           0     11.72%
//    10        144        8192       56         128     11.82%
//    11        160        8192       51          32      9.73%
//    12        176        8192       46          96      9.59%
//    13        192        8192       42         128      9.25%
//    14        208        8192       39          80      8.12%
//    15        224        8192       36         128      8.15%
//    16        240        8192       34          32      6.62%
//    17        256        8192       32           0      5.86%
//    18        288        8192       28         128     12.16%
//    19        320        8192       25         192     11.80%
//    20        352        8192       23          96      9.88%
//    21        384        8192       21         128      9.51%
//    22        416        8192       19         288     10.71%
//    23        448        8192       18         128      8.37%
//    24        480        8192       17          32      6.82%
//    25        512        8192       16           0      6.05%
//    26        576        8192       14         128     12.33%
//    27        640        8192       12         512     15.48%
//    28        704        8192       11         448     13.93%
//    29        768        8192       10         512     13.94%
//    30        896        8192        9         128     15.52%
//    31       1024        8192        8           0     12.40%
//    32       1152        8192        7         128     12.41%
//    33       1280        8192        6         512     15.55%
//    34       1408       16384       11         896     14.00%
//    35       1536        8192        5         512     14.00%
//    36       1792       16384        9         256     15.57%
//    37       2048        8192        4           0     12.45%
//    38       2304       16384        7         256     12.46%
//    39       2688        8192        3         128     15.59%
//    40       3072       24576        8           0     12.47%
//    41       3200       16384        5         384      6.22%
//    42       3456       24576        7         384      8.83%
//    43       4096        8192        2           0     15.60%
//    44       4864       24576        5         256     16.65%
//    45       5376       16384        3         256     10.92%
//    46       6144       24576        4           0     12.48%
//    47       6528       32768        5         128      6.23%
//    48       6784       40960        6         256      4.36%
//    49       6912       49152        7         768      3.37%
//    50       8192        8192        1           0     15.61%
//    51       9472       57344        6         512     14.28%
//    52       9728       49152        5         512      3.64%
//    53      10240       40960        4           0      4.99%
//    54      10880       32768        3         128      6.24%
//    55      12288       24576        2           0     11.45%
//    56      13568       40960        3         256      9.99%
//    57      14336       57344        4           0      5.35%
//    58      16384       16384        1           0     12.49%
//    59      18432       73728        4           0     11.11%
//    60      19072       57344        3         128      3.57%
//    61      20480       40960        2           0      6.87%
//    62      21760       65536        3         256      6.25%
//    63      24576       24576        1           0     11.45%
//    64      27264       81920        3         128     10.00%
//    65      28672       57344        2           0      4.91%
//    66      32768       32768        1           0     12.50%

以类型 (class) 为 1 的 span 为例,
span 中的元素大小是 8 byte, span 自身占 1 页也就是 8K, 一共能够保留 1024 个对象.

在调配对象时, 会依据对象的大小决定应用什么类型的 span,
例如 16 byte 的对象会应用 span 2, 17 byte 的对象会应用 span 3, 32 byte 的对象会应用 span 3.
从这个例子也能够看到, 调配 17 和 32 byte 的对象都会应用 span 3, 也就是说局部大小的对象在调配时会节约肯定的空间.

有人可能会留神到, 下面最大的 span 的元素大小是 32K, 那么调配超过 32K 的对象会在哪里调配呢?
超过 32K 的对象称为 ” 大对象 ”, 调配大对象时, 会间接从 heap 调配一个非凡的 span,
这个非凡的 span 的类型 (class) 是 0, 只蕴含了一个大对象, span 的大小由对象的大小决定.

非凡的 span 加上的 66 个规范的 span, 一共组成了 67 个 span 类型.

Span 的地位

在前一篇中我提到了 P 是一个虚构的资源, 同一时间只能有一个线程拜访同一个 P, 所以 P 中的数据不须要锁.
为了调配对象时有更好的性能, 各个 P 中都有 span 的缓存(也叫 mcache), 缓存的构造如下:

各个 P 中按 span 类型的不同, 有 67*2=134 个 span 的缓存,

其中 scan 和 noscan 的区别在于,
如果对象蕴含了指针, 调配对象时会应用 scan 的 span,
如果对象不蕴含指针, 调配对象时会应用 noscan 的 span.
把 span 分为 scan 和 noscan 的意义在于,
GC 扫描对象的时候对于 noscan 的 span 能够不去查看 bitmap 区域来标记子对象, 这样能够大幅晋升标记的效率.

在调配对象时将会从以下的地位获取适宜的 span 用于调配:

  • 首先从 P 的缓存 (mcache) 获取, 如果有缓存的 span 并且未满则应用, 这个步骤不须要锁
  • 而后从全局缓存 (mcentral) 获取, 如果获取胜利则设置到 P, 这个步骤须要锁
  • 最初从 mheap 获取, 获取后设置到全局缓存, 这个步骤须要锁

在 P 中缓存 span 的做法跟 CoreCLR 中线程缓存调配上下文 (Allocation Context) 的做法类似,
都能够让调配对象时大部分时候不须要线程锁, 改良调配的性能.

调配对象的解决

调配对象的流程

go 从堆调配对象时会调用 newobject 函数, 这个函数的流程大抵如下:

首先会查看 GC 是否在工作中, 如果 GC 在工作中并且以后的 G 调配了肯定大小的内存则须要帮助 GC 做肯定的工作,
这个机制叫 GC Assist, 用于避免分配内存太快导致 GC 回收跟不上的状况产生.

之后会判断是小对象还是大对象, 如果是大对象则间接调用 largeAlloc 从堆中调配,
如果是小对象分 3 个阶段获取可用的 span, 而后从 span 中调配对象:

  • 首先从 P 的缓存 (mcache) 获取
  • 而后从全局缓存 (mcentral) 获取, 全局缓存中有可用的 span 的列表
  • 最初从 mheap 获取, mheap 中也有 span 的自在列表, 如果都获取失败则从 arena 区域调配

这三个阶段的具体构造如下图:

数据类型的定义

调配对象波及的数据类型蕴含:

p: 前一篇提到过, P 是协程中的用于运行 go 代码的虚构资源
m: 前一篇提到过, M 目前代表零碎线程
g: 前一篇提到过, G 就是 goroutine
mspan: 用于调配对象的区块
mcentral: 全局的 mspan 缓存, 一共有 67*2=134 个
mheap: 用于治理 heap 的对象, 全局只有一个

源代码剖析

go 从堆调配对象时会调用 newobject 函数, 先从这个函数看起:

// implementation of new builtin
// compiler (both frontend and SSA backend) knows the signature
// of this function
func newobject(typ *_type) unsafe.Pointer {return mallocgc(typ.size, typ, true)
}

newobject 调用了 mallocgc 函数:

// Allocate an object of size bytes.
// Small objects are allocated from the per-P cache's free lists.
// Large objects (> 32 kB) are allocated straight from the heap.
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    if gcphase == _GCmarktermination {throw("mallocgc called with gcphase == _GCmarktermination")
    }

    if size == 0 {return unsafe.Pointer(&zerobase)
    }

    if debug.sbrk != 0 {align := uintptr(16)
        if typ != nil {align = uintptr(typ.align)
        }
        return persistentalloc(size, align, &memstats.other_sys)
    }

    // 判断是否要辅助 GC 工作
    // gcBlackenEnabled 在 GC 的标记阶段会开启
    // assistG is the G to charge for this allocation, or nil if
    // GC is not currently active.
    var assistG *g
    if gcBlackenEnabled != 0 {
        // Charge the current user G for this allocation.
        assistG = getg()
        if assistG.m.curg != nil {assistG = assistG.m.curg}
        // Charge the allocation against the G. We'll account
        // for internal fragmentation at the end of mallocgc.
        assistG.gcAssistBytes -= int64(size)

        // 会按调配的大小判断须要帮助 GC 实现多少工作
        // 具体的算法将在上面解说收集器时阐明
        if assistG.gcAssistBytes < 0 {
            // This G is in debt. Assist the GC to correct
            // this before allocating. This must happen
            // before disabling preemption.
            gcAssistAlloc(assistG)
        }
    }

    // 减少以后 G 对应的 M 的 lock 计数, 避免这个 G 被抢占
    // Set mp.mallocing to keep from being preempted by GC.
    mp := acquirem()
    if mp.mallocing != 0 {throw("malloc deadlock")
    }
    if mp.gsignal == getg() {throw("malloc during signal")
    }
    mp.mallocing = 1

    shouldhelpgc := false
    dataSize := size
    // 获取以后 G 对应的 M 对应的 P 的本地 span 缓存(mcache)
    // 因为 M 在领有 P 后会把 P 的 mcache 设到 M 中, 这里返回的是 getg().m.mcache
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    // 判断是否小对象, maxSmallSize 以后的值是 32K
    if size <= maxSmallSize {
        // 如果对象不蕴含指针, 并且对象的大小小于 16 bytes, 能够做非凡解决
        // 这里是针对十分小的对象的优化, 因为 span 的元素最小只能是 8 byte, 如果对象更小那么很多空间都会被节约掉
        // 十分小的对象能够整合在 "class 2 noscan" 的元素 (大小为 16 byte) 中
        if noscan && size < maxTinySize {
            // Tiny allocator.
            //
            // Tiny allocator combines several tiny allocation requests
            // into a single memory block. The resulting memory block
            // is freed when all subobjects are unreachable. The subobjects
            // must be noscan (don't have pointers), this ensures that
            // the amount of potentially wasted memory is bounded.
            //
            // Size of the memory block used for combining (maxTinySize) is tunable.
            // Current setting is 16 bytes, which relates to 2x worst case memory
            // wastage (when all but one subobjects are unreachable).
            // 8 bytes would result in no wastage at all, but provides less
            // opportunities for combining.
            // 32 bytes provides more opportunities for combining,
            // but can lead to 4x worst case wastage.
            // The best case winning is 8x regardless of block size.
            //
            // Objects obtained from tiny allocator must not be freed explicitly.
            // So when an object will be freed explicitly, we ensure that
            // its size >= maxTinySize.
            //
            // SetFinalizer has a special case for objects potentially coming
            // from tiny allocator, it such case it allows to set finalizers
            // for an inner byte of a memory block.
            //
            // The main targets of tiny allocator are small strings and
            // standalone escaping variables. On a json benchmark
            // the allocator reduces number of allocations by ~12% and
            // reduces heap size by ~20%.
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {off = round(off, 8)
            } else if size&3 == 0 {off = round(off, 4)
            } else if size&1 == 0 {off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySpanClass]
            v := nextFreeFast(span)
            if v == 0 {v, _, shouldhelpgc = c.nextFree(tinySpanClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            // 否则按一般的小对象调配
            // 首先获取对象的大小应该应用哪个 span 类型
            var sizeclass uint8
            if size <= smallSizeMax-8 {sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            // 等于 sizeclass * 2 + (noscan ? 1 : 0)
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            // 尝试疾速的从这个 span 中调配
            v := nextFreeFast(span)
            if v == 0 {
                // 调配失败, 可能须要从 mcentral 或者 mheap 中获取
                // 如果从 mcentral 或者 mheap 获取了新的 span, 则 shouldhelpgc 会等于 true
                // shouldhelpgc 会等于 true 时会在上面判断是否要触发 GC
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        // 大对象间接从 mheap 调配, 这里的 s 是一个非凡的 span, 它的 class 是 0
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {s = largeAlloc(size, needzero, noscan)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }

    // 设置 arena 对应的 bitmap, 记录哪些地位蕴含了指针, GC 会应用 bitmap 扫描所有可达到的对象
    var scanSize uintptr
    if !noscan {
        // If allocating a defer+arg block, now that we've picked a malloc size
        // large enough to hold everything, cut the "asked for" size down to
        // just the defer header, so that the GC bitmap will record the arg block
        // as containing nothing at all (as if it were unused space at the end of
        // a malloc block caused by size rounding).
        // The defer arg areas are scanned as part of scanstack.
        if typ == deferType {dataSize = unsafe.Sizeof(_defer{})
        }
        // 这个函数十分的长, 有趣味的能够看
        // https://github.com/golang/go/blob/go1.9.2/src/runtime/mbitmap.go#L855
        // 尽管代码很长然而设置的内容跟下面说过的 bitmap 区域的构造一样
        // 依据类型信息设置 scan bit 跟 pointer bit, scan bit 成立示意应该持续扫描, pointer bit 成立示意该地位是指针
        // 须要留神的中央有
        // - 如果一个类型只有结尾的中央蕴含指针, 例如[ptr, ptr, large non-pointer data]
        //   那么前面的局部的 scan bit 将会为 0, 这样能够大幅晋升标记的效率
        // - 第二个 slot 的 scan bit 用处比拟非凡, 它并不用于标记是否持续 scan, 而是标记 checkmark
        // 什么是 checkmark
        // - 因为 go 的并行 GC 比较复杂, 为了查看实现是否正确, go 须要在有一个查看所有应该被标记的对象是否被标记的机制
        //   这个机制就是 checkmark, 在开启 checkmark 时 go 会在标记阶段的最初进行整个世界而后从新执行一次标记
        //   下面的第二个 slot 的 scan bit 就是用于标记对象在 checkmark 标记中是否被标记的
        // - 有的人可能会发现第二个 slot 要求对象起码有两个指针的大小, 那么只有一个指针的大小的对象呢?
        //   只有一个指针的大小的对象能够分为两种状况
        //   对象就是指针, 因为大小刚好是 1 个指针所以并不需要看 bitmap 区域, 这时第一个 slot 就是 checkmark
        //   对象不是指针, 因为有 tiny alloc 的机制, 不是指针且只有一个指针大小的对象会调配在两个指针的 span 中
        //               这时候也不须要看 bitmap 区域, 所以和下面一样第一个 slot 就是 checkmark
        heapBitsSetType(uintptr(x), size, dataSize, typ)
        if dataSize > typ.size {
            // Array allocation. If there are any
            // pointers, GC has to scan to the last
            // element.
            if typ.ptrdata != 0 {scanSize = dataSize - typ.size + typ.ptrdata}
        } else {scanSize = typ.ptrdata}
        c.local_scan += scanSize
    }

    // 内存屏障, 因为 x86 和 x64 的 store 不会乱序所以这里只是个针对编译器的屏障, 汇编中是 ret
    // Ensure that the stores above that initialize x to
    // type-safe memory and set the heap bits occur before
    // the caller can make x observable to the garbage
    // collector. Otherwise, on weakly ordered machines,
    // the garbage collector could follow a pointer to x,
    // but see uninitialized memory or stale heap bits.
    publicationBarrier()

    // 如果以后在 GC 中, 须要立即标记调配后的对象为 "彩色", 避免它被回收
    // Allocate black during GC.
    // All slots hold nil so no scanning is needed.
    // This may be racing with GC so do it atomically if there can be
    // a race marking the bit.
    if gcphase != _GCoff {gcmarknewobject(uintptr(x), size, scanSize)
    }

    // Race Detector 的解决(用于检测线程抵触问题)
    if raceenabled {racemalloc(x, size)
    }

    // Memory Sanitizer 的解决(用于检测危险指针等内存问题)
    if msanenabled {msanmalloc(x, size)
    }

    // 从新容许以后的 G 被抢占
    mp.mallocing = 0
    releasem(mp)

    // 除错记录
    if debug.allocfreetrace != 0 {tracealloc(x, size, typ)
    }

    // Profiler 记录
    if rate := MemProfileRate; rate > 0 {if size < uintptr(rate) && int32(size) < c.next_sample {c.next_sample -= int32(size)
        } else {mp := acquirem()
            profilealloc(mp, x, size)
            releasem(mp)
        }
    }

    // gcAssistBytes 减去 "理论调配大小 - 要求调配大小", 调整到精确值
    if assistG != nil {
        // Account for internal fragmentation in the assist
        // debt now that we know it.
        assistG.gcAssistBytes -= int64(size - dataSize)
    }

    // 如果之前获取了新的 span, 则判断是否须要后盾启动 GC
    // 这里的判断逻辑 (gcTrigger) 会在上面具体阐明
    if shouldhelpgc {if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {gcStart(gcBackgroundMode, t)
        }
    }

    return x
}

接下来看看如何从 span 外面调配对象, 首先会调用 nextFreeFast 尝试疾速调配:

// nextFreeFast returns the next free object if one is quickly available.
// Otherwise it returns 0.
func nextFreeFast(s *mspan) gclinkptr {
    // 获取第一个非 0 的 bit 是第几个 bit, 也就是哪个元素是未调配的
    theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
    // 找到未调配的元素
    if theBit < 64 {result := s.freeindex + uintptr(theBit)
        // 要求索引值小于元素数量
        if result < s.nelems {
            // 下一个 freeindex
            freeidx := result + 1
            // 能够被 64 整除时须要非凡解决(参考 nextFree)
            if freeidx%64 == 0 && freeidx != s.nelems {return 0}
            // 更新 freeindex 和 allocCache(高位都是 0, 用尽当前会更新)
            s.allocCache >>= uint(theBit + 1)
            s.freeindex = freeidx
            // 返回元素所在的地址
            v := gclinkptr(result*s.elemsize + s.base())
            // 增加已调配的元素计数
            s.allocCount++
            return v
        }
    }
    return 0
}

如果在 freeindex 后无奈疾速找到未调配的元素, 就须要调用 nextFree 做出更简单的解决:

// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(spc spanClass) (v gclinkptr, s *mspan, shouldhelpgc bool) {
    // 找到下一个 freeindex 和更新 allocCache
    s = c.alloc[spc]
    shouldhelpgc = false
    freeIndex := s.nextFreeIndex()
    // 如果 span 外面所有元素都已调配, 则须要获取新的 span
    if freeIndex == s.nelems {
        // The span is full.
        if uintptr(s.allocCount) != s.nelems {println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
            throw("s.allocCount != s.nelems && freeIndex == s.nelems")
        }
        // 申请新的 span
        systemstack(func() {c.refill(spc)
        })
        // 获取申请后的新的 span, 并设置须要查看是否执行 GC
        shouldhelpgc = true
        s = c.alloc[spc]

        freeIndex = s.nextFreeIndex()}

    if freeIndex >= s.nelems {throw("freeIndex is not valid")
    }

    // 返回元素所在的地址
    v = gclinkptr(freeIndex*s.elemsize + s.base())
    // 增加已调配的元素计数
    s.allocCount++
    if uintptr(s.allocCount) > s.nelems {println("s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
        throw("s.allocCount > s.nelems")
    }
    return
}

如果 mcache 中指定类型的 span 已满, 就须要调用 refill 函数申请新的 span:

// Gets a span that has a free object in it and assigns it
// to be the cached span for the given sizeclass. Returns this span.
func (c *mcache) refill(spc spanClass) *mspan {_g_ := getg()

    // 避免 G 被抢占
    _g_.m.locks++
    // Return the current cached span to the central lists.
    s := c.alloc[spc]

    // 确保以后的 span 所有元素都已调配
    if uintptr(s.allocCount) != s.nelems {throw("refill of span with free space remaining")
    }

    // 设置 span 的 incache 属性, 除非是全局应用的空 span(也就是 mcache 外面 span 指针的默认值)
    if s != &emptymspan {s.incache = false}

    // 向 mcentral 申请一个新的 span
    // Get a new cached span from the central lists.
    s = mheap_.central[spc].mcentral.cacheSpan()
    if s == nil {throw("out of memory")
    }

    if uintptr(s.allocCount) == s.nelems {throw("span has no free space")
    }

    // 设置新的 span 到 mcache 中
    c.alloc[spc] = s
    // 容许 G 被抢占
    _g_.m.locks--
    return s
}

向 mcentral 申请一个新的 span 会通过 cacheSpan 函数:
mcentral 首先尝试从外部的链表复用原有的 span, 如果复用失败则向 mheap 申请.

// Allocate a span to use in an MCache.
func (c *mcentral) cacheSpan() *mspan {
    // 让以后 G 帮助一部分的 sweep 工作
    // Deduct credit for this span allocation and sweep if necessary.
    spanBytes := uintptr(class_to_allocnpages[c.spanclass.sizeclass()]) * _PageSize
    deductSweepCredit(spanBytes, 0)

    // 对 mcentral 上锁, 因为可能会有多个 M(P)同时拜访
    lock(&c.lock)
    traceDone := false
    if trace.enabled {traceGCSweepStart()
    }
    sg := mheap_.sweepgen
retry:
    // mcentral 外面有两个 span 的链表
    // - nonempty 示意确定该 span 起码有一个未调配的元素
    // - empty 示意不确定该 span 起码有一个未调配的元素
    // 这里优先查找 nonempty 的链表
    // sweepgen 每次 GC 都会减少 2
    // - sweepgen == 全局 sweepgen, 示意 span 曾经 sweep 过
    // - sweepgen == 全局 sweepgen-1, 示意 span 正在 sweep
    // - sweepgen == 全局 sweepgen-2, 示意 span 期待 sweep
    var s *mspan
    for s = c.nonempty.first; s != nil; s = s.next {
        // 如果 span 期待 sweep, 尝试原子批改 sweepgen 为全局 sweepgen-1
        if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            // 批改胜利则把 span 移到 empty 链表, sweep 它而后跳到 havespan
            c.nonempty.remove(s)
            c.empty.insertBack(s)
            unlock(&c.lock)
            s.sweep(true)
            goto havespan
        }
        // 如果这个 span 正在被其余线程 sweep, 就跳过
        if s.sweepgen == sg-1 {
            // the span is being swept by background sweeper, skip
            continue
        }
        // span 曾经 sweep 过
        // 因为 nonempty 链表中的 span 确定起码有一个未调配的元素, 这里能够间接应用它
        // we have a nonempty span that does not require sweeping, allocate from it
        c.nonempty.remove(s)
        c.empty.insertBack(s)
        unlock(&c.lock)
        goto havespan
    }

    // 查找 empty 的链表
    for s = c.empty.first; s != nil; s = s.next {
        // 如果 span 期待 sweep, 尝试原子批改 sweepgen 为全局 sweepgen-1
        if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            // 把 span 放到 empty 链表的最初
            // we have an empty span that requires sweeping,
            // sweep it and see if we can free some space in it
            c.empty.remove(s)
            // swept spans are at the end of the list
            c.empty.insertBack(s)
            unlock(&c.lock)
            // 尝试 sweep
            s.sweep(true)
            // sweep 当前还须要检测是否有未调配的对象, 如果有则能够应用它
            freeIndex := s.nextFreeIndex()
            if freeIndex != s.nelems {
                s.freeindex = freeIndex
                goto havespan
            }
            lock(&c.lock)
            // the span is still empty after sweep
            // it is already in the empty list, so just retry
            goto retry
        }
        // 如果这个 span 正在被其余线程 sweep, 就跳过
        if s.sweepgen == sg-1 {
            // the span is being swept by background sweeper, skip
            continue
        }
        // 找不到有未调配对象的 span
        // already swept empty span,
        // all subsequent ones must also be either swept or in process of sweeping
        break
    }
    if trace.enabled {traceGCSweepDone()
        traceDone = true
    }
    unlock(&c.lock)

    // 找不到有未调配对象的 span, 须要从 mheap 调配
    // 调配实现后加到 empty 链表中
    // Replenish central list if empty.
    s = c.grow()
    if s == nil {return nil}
    lock(&c.lock)
    c.empty.insertBack(s)
    unlock(&c.lock)

    // At this point s is a non-empty span, queued at the end of the empty list,
    // c is unlocked.
havespan:
    if trace.enabled && !traceDone {traceGCSweepDone()
    }
    // 统计 span 中未调配的元素数量, 加到 mcentral.nmalloc 中
    // 统计 span 中未调配的元素总大小, 加到 memstats.heap_live 中
    cap := int32((s.npages << _PageShift) / s.elemsize)
    n := cap - int32(s.allocCount)
    if n == 0 || s.freeindex == s.nelems || uintptr(s.allocCount) == s.nelems {throw("span has no free objects")
    }
    // Assume all objects from this span will be allocated in the
    // mcache. If it gets uncached, we'll adjust this.
    atomic.Xadd64(&c.nmalloc, int64(n))
    usedBytes := uintptr(s.allocCount) * s.elemsize
    atomic.Xadd64(&memstats.heap_live, int64(spanBytes)-int64(usedBytes))
    // 跟踪解决
    if trace.enabled {
        // heap_live changed.
        traceHeapAlloc()}
    // 如果以后在 GC 中, 因为 heap_live 扭转了, 从新调整 G 辅助标记工作的值
    // 具体请参考上面对 revise 函数的解析
    if gcBlackenEnabled != 0 {
        // heap_live changed.
        gcController.revise()}
    // 设置 span 的 incache 属性, 示意 span 正在 mcache 中
    s.incache = true
    // 依据 freeindex 更新 allocCache
    freeByteBase := s.freeindex &^ (64 - 1)
    whichByte := freeByteBase / 8
    // Init alloc bits cache.
    s.refillAllocCache(whichByte)

    // Adjust the allocCache so that s.freeindex corresponds to the low bit in
    // s.allocCache.
    s.allocCache >>= s.freeindex % 64

    return s
}

mcentral 向 mheap 申请一个新的 span 会应用 grow 函数:

// grow allocates a new empty span from the heap and initializes it for c's size class.
func (c *mcentral) grow() *mspan {// 依据 mcentral 的类型计算须要申请的 span 的大小 (除以 8K = 有多少页) 和能够保留多少个元素
    npages := uintptr(class_to_allocnpages[c.spanclass.sizeclass()])
    size := uintptr(class_to_size[c.spanclass.sizeclass()])
    n := (npages << _PageShift) / size

    // 向 mheap 申请一个新的 span, 以页 (8K) 为单位
    s := mheap_.alloc(npages, c.spanclass, false, true)
    if s == nil {return nil}

    p := s.base()
    s.limit = p + size*n

    // 调配并初始化 span 的 allocBits 和 gcmarkBits
    heapBitsForSpan(s.base()).initSpan(s)
    return s
}

mheap 调配 span 的函数是 alloc:

func (h *mheap) alloc(npage uintptr, spanclass spanClass, large bool, needzero bool) *mspan {
    // 在 g0 的栈空间中调用 alloc_m 函数
    // 对于 systemstack 的阐明请看前一篇文章
    // Don't do any operations that lock the heap on the G stack.
    // It might trigger stack growth, and the stack growth code needs
    // to be able to allocate heap.
    var s *mspan
    systemstack(func() {s = h.alloc_m(npage, spanclass, large)
    })

    if s != nil {
        if needzero && s.needzero != 0 {memclrNoHeapPointers(unsafe.Pointer(s.base()), s.npages<<_PageShift)
        }
        s.needzero = 0
    }
    return s
}

alloc 函数会在 g0 的栈空间中调用 alloc_m 函数:

// Allocate a new span of npage pages from the heap for GC'd memory
// and record its size class in the HeapMap and HeapMapCache.
func (h *mheap) alloc_m(npage uintptr, spanclass spanClass, large bool) *mspan {_g_ := getg()
    if _g_ != _g_.m.g0 {throw("_mheap_alloc not on g0 stack")
    }
    // 对 mheap 上锁, 这里的锁是全局锁
    lock(&h.lock)

    // 为了避免 heap 增速太快, 在调配 n 页之前要先 sweep 和回收 n 页
    // 会先枚举 busy 列表而后再枚举 busyLarge 列表进行 sweep, 具体参考 reclaim 和 reclaimList 函数
    // To prevent excessive heap growth, before allocating n pages
    // we need to sweep and reclaim at least n pages.
    if h.sweepdone == 0 {// TODO(austin): This tends to sweep a large number of
        // spans in order to find a few completely free spans
        // (for example, in the garbage benchmark, this sweeps
        // ~30x the number of pages its trying to allocate).
        // If GC kept a bit for whether there were any marks
        // in a span, we could release these free spans
        // at the end of GC and eliminate this entirely.
        if trace.enabled {traceGCSweepStart()
        }
        h.reclaim(npage)
        if trace.enabled {traceGCSweepDone()
        }
    }

    // 把 mcache 中的本地统计数据加到全局
    // transfer stats from cache to global
    memstats.heap_scan += uint64(_g_.m.mcache.local_scan)
    _g_.m.mcache.local_scan = 0
    memstats.tinyallocs += uint64(_g_.m.mcache.local_tinyallocs)
    _g_.m.mcache.local_tinyallocs = 0

    // 调用 allocSpanLocked 调配 span, allocSpanLocked 函数要求以后曾经对 mheap 上锁
    s := h.allocSpanLocked(npage, &memstats.heap_inuse)
    if s != nil {
        // Record span info, because gc needs to be
        // able to map interior pointer to containing span.
        // 设置 span 的 sweepgen = 全局 sweepgen
        atomic.Store(&s.sweepgen, h.sweepgen)
        // 放到全局 span 列表中, 这里的 sweepSpans 的长度是 2
        // sweepSpans[h.sweepgen/2%2]保留以后正在应用的 span 列表
        // sweepSpans[1-h.sweepgen/2%2]保留期待 sweep 的 span 列表
        // 因为每次 gcsweepgen 都会加 2, 每次 gc 这两个列表都会替换
        h.sweepSpans[h.sweepgen/2%2].push(s) // Add to swept in-use list.
        // 初始化 span 成员
        s.state = _MSpanInUse
        s.allocCount = 0
        s.spanclass = spanclass
        if sizeclass := spanclass.sizeclass(); sizeclass == 0 {
            s.elemsize = s.npages << _PageShift
            s.divShift = 0
            s.divMul = 0
            s.divShift2 = 0
            s.baseMask = 0
        } else {s.elemsize = uintptr(class_to_size[sizeclass])
            m := &class_to_divmagic[sizeclass]
            s.divShift = m.shift
            s.divMul = m.mul
            s.divShift2 = m.shift2
            s.baseMask = m.baseMask
        }

        // update stats, sweep lists
        h.pagesInUse += uint64(npage)
        // 下面 grow 函数会传入 true, 也就是通过 grow 调用到这里 large 会等于 true
        // 增加已调配的 span 到 busy 列表, 如果页数超过_MaxMHeapList(128 页 =8K*128=1M)则放到 busylarge 列表
        if large {
            memstats.heap_objects++
            mheap_.largealloc += uint64(s.elemsize)
            mheap_.nlargealloc++
            atomic.Xadd64(&memstats.heap_live, int64(npage<<_PageShift))
            // Swept spans are at the end of lists.
            if s.npages < uintptr(len(h.busy)) {h.busy[s.npages].insertBack(s)
            } else {h.busylarge.insertBack(s)
            }
        }
    }
    // 如果以后在 GC 中, 因为 heap_live 扭转了, 从新调整 G 辅助标记工作的值
    // 具体请参考上面对 revise 函数的解析
    // heap_scan and heap_live were updated.
    if gcBlackenEnabled != 0 {gcController.revise()
    }

    // 跟踪解决
    if trace.enabled {traceHeapAlloc()
    }

    // h.spans is accessed concurrently without synchronization
    // from other threads. Hence, there must be a store/store
    // barrier here to ensure the writes to h.spans above happen
    // before the caller can publish a pointer p to an object
    // allocated from s. As soon as this happens, the garbage
    // collector running on another processor could read p and
    // look up s in h.spans. The unlock acts as the barrier to
    // order these writes. On the read side, the data dependency
    // between p and the index in h.spans orders the reads.
    unlock(&h.lock)
    return s
}

持续查看 allocSpanLocked 函数:

// Allocates a span of the given size.  h must be locked.
// The returned span has been removed from the
// free list, but its state is still MSpanFree.
func (h *mheap) allocSpanLocked(npage uintptr, stat *uint64) *mspan {
    var list *mSpanList
    var s *mspan

    // 尝试在 mheap 中的自在列表调配
    // 页数小于_MaxMHeapList(128 页 =1M)的自在 span 都会在 free 列表中
    // 页数大于_MaxMHeapList 的自在 span 都会在 freelarge 列表中
    // Try in fixed-size lists up to max.
    for i := int(npage); i < len(h.free); i++ {list = &h.free[i]
        if !list.isEmpty() {
            s = list.first
            list.remove(s)
            goto HaveSpan
        }
    }
    // free 列表找不到则查找 freelarge 列表
    // 查找不到就向 arena 区域申请一个新的 span 加到 freelarge 中, 而后再查找 freelarge 列表
    // Best fit in list of large spans.
    s = h.allocLarge(npage) // allocLarge removed s from h.freelarge for us
    if s == nil {if !h.grow(npage) {return nil}
        s = h.allocLarge(npage)
        if s == nil {return nil}
    }

HaveSpan:
    // Mark span in use.
    if s.state != _MSpanFree {throw("MHeap_AllocLocked - MSpan not free")
    }
    if s.npages < npage {throw("MHeap_AllocLocked - bad npages")
    }
    // 如果 span 有已开释 (解除虚拟内存和物理内存关系) 的页, 揭示这些页会被应用而后更新统计数据
    if s.npreleased > 0 {sysUsed(unsafe.Pointer(s.base()), s.npages<<_PageShift)
        memstats.heap_released -= uint64(s.npreleased << _PageShift)
        s.npreleased = 0
    }

    // 如果获取到的 span 页数比要求的页数多
    // 宰割残余的页数到另一个 span 并且放到自在列表中
    if s.npages > npage {
        // Trim extra and put it back in the heap.
        t := (*mspan)(h.spanalloc.alloc())
        t.init(s.base()+npage<<_PageShift, s.npages-npage)
        s.npages = npage
        p := (t.base() - h.arena_start) >> _PageShift
        if p > 0 {h.spans[p-1] = s
        }
        h.spans[p] = t
        h.spans[p+t.npages-1] = t
        t.needzero = s.needzero
        s.state = _MSpanManual // prevent coalescing with s
        t.state = _MSpanManual
        h.freeSpanLocked(t, false, false, s.unusedsince)
        s.state = _MSpanFree
    }
    s.unusedsince = 0

    // 设置 spans 区域, 哪些地址对应哪个 mspan 对象
    p := (s.base() - h.arena_start) >> _PageShift
    for n := uintptr(0); n < npage; n++ {h.spans[p+n] = s
    }

    // 更新统计数据
    *stat += uint64(npage << _PageShift)
    memstats.heap_idle -= uint64(npage << _PageShift)

    //println("spanalloc", hex(s.start<<_PageShift))
    if s.inList() {throw("still in list")
    }
    return s
}

持续查看 allocLarge 函数:

// allocLarge allocates a span of at least npage pages from the treap of large spans.
// Returns nil if no such span currently exists.
func (h *mheap) allocLarge(npage uintptr) *mspan {
    // Search treap for smallest span with >= npage pages.
    return h.freelarge.remove(npage)
}

freelarge 的类型是 mTreap, 调用 remove 函数会在树外面搜寻一个至多 npage 且在树中的最小的 span 返回:

// remove searches for, finds, removes from the treap, and returns the smallest
// span that can hold npages. If no span has at least npages return nil.
// This is slightly more complicated than a simple binary tree search
// since if an exact match is not found the next larger node is
// returned.
// If the last node inspected > npagesKey not holding
// a left node (a smaller npages) is the "best fit" node.
func (root *mTreap) remove(npages uintptr) *mspan {
    t := root.treap
    for t != nil {
        if t.spanKey == nil {throw("treap node with nil spanKey found")
        }
        if t.npagesKey < npages {t = t.right} else if t.left != nil && t.left.npagesKey >= npages {t = t.left} else {
            result := t.spanKey
            root.removeNode(t)
            return result
        }
    }
    return nil
}

向 arena 区域申请新 span 的函数是 mheap 类的 grow 函数:

// Try to add at least npage pages of memory to the heap,
// returning whether it worked.
//
// h must be locked.
func (h *mheap) grow(npage uintptr) bool {
    // Ask for a big chunk, to reduce the number of mappings
    // the operating system needs to track; also amortizes
    // the overhead of an operating system mapping.
    // Allocate a multiple of 64kB.
    npage = round(npage, (64<<10)/_PageSize)
    ask := npage << _PageShift
    if ask < _HeapAllocChunk {ask = _HeapAllocChunk}

    // 调用 mheap.sysAlloc 函数申请
    v := h.sysAlloc(ask)
    if v == nil {
        if ask > npage<<_PageShift {
            ask = npage << _PageShift
            v = h.sysAlloc(ask)
        }
        if v == nil {print("runtime: out of memory: cannot allocate", ask, "-byte block (", memstats.heap_sys, "in use)n")
            return false
        }
    }

    // 创立一个新的 span 并加到自在列表中
    // Create a fake "in use" span and free it, so that the
    // right coalescing happens.
    s := (*mspan)(h.spanalloc.alloc())
    s.init(uintptr(v), ask>>_PageShift)
    p := (s.base() - h.arena_start) >> _PageShift
    for i := p; i < p+s.npages; i++ {h.spans[i] = s
    }
    atomic.Store(&s.sweepgen, h.sweepgen)
    s.state = _MSpanInUse
    h.pagesInUse += uint64(s.npages)
    h.freeSpanLocked(s, false, true, 0)
    return true
}

持续查看 mheap 的 sysAlloc 函数:

// sysAlloc allocates the next n bytes from the heap arena. The
// returned pointer is always _PageSize aligned and between
// h.arena_start and h.arena_end. sysAlloc returns nil on failure.
// There is no corresponding free function.
func (h *mheap) sysAlloc(n uintptr) unsafe.Pointer {
    // strandLimit is the maximum number of bytes to strand from
    // the current arena block. If we would need to strand more
    // than this, we fall back to sysAlloc'ing just enough for
    // this allocation.
    const strandLimit = 16 << 20

    // 如果 arena 区域以后已提交的区域有余, 则调用 sysReserve 预留更多的空间, 而后更新 arena_end
    // sysReserve 在 linux 上调用的是 mmap 函数
    // mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if n > h.arena_end-h.arena_alloc {
        // If we haven't grown the arena to _MaxMem yet, try
        // to reserve some more address space.
        p_size := round(n+_PageSize, 256<<20)
        new_end := h.arena_end + p_size // Careful: can overflow
        if h.arena_end <= new_end && new_end-h.arena_start-1 <= _MaxMem {
            // TODO: It would be bad if part of the arena
            // is reserved and part is not.
            var reserved bool
            p := uintptr(sysReserve(unsafe.Pointer(h.arena_end), p_size, &reserved))
            if p == 0 {
                // TODO: Try smaller reservation
                // growths in case we're in a crowded
                // 32-bit address space.
                goto reservationFailed
            }
            // p can be just about anywhere in the address
            // space, including before arena_end.
            if p == h.arena_end {
                // The new block is contiguous with
                // the current block. Extend the
                // current arena block.
                h.arena_end = new_end
                h.arena_reserved = reserved
            } else if h.arena_start <= p && p+p_size-h.arena_start-1 <= _MaxMem && h.arena_end-h.arena_alloc < strandLimit {
                // We were able to reserve more memory
                // within the arena space, but it's
                // not contiguous with our previous
                // reservation. It could be before or
                // after our current arena_used.
                //
                // Keep everything page-aligned.
                // Our pages are bigger than hardware pages.
                h.arena_end = p + p_size
                p = round(p, _PageSize)
                h.arena_alloc = p
                h.arena_reserved = reserved
            } else {
                // We got a mapping, but either
                //
                // 1) It's not in the arena, so we
                // can't use it. (This should never
                // happen on 32-bit.)
                //
                // 2) We would need to discard too
                // much of our current arena block to
                // use it.
                //
                // We haven't added this allocation to
                // the stats, so subtract it from a
                // fake stat (but avoid underflow).
                //
                // We'll fall back to a small sysAlloc.
                stat := uint64(p_size)
                sysFree(unsafe.Pointer(p), p_size, &stat)
            }
        }
    }

    // 预留的空间足够时只须要减少 arena_alloc
    if n <= h.arena_end-h.arena_alloc {
        // Keep taking from our reservation.
        p := h.arena_alloc
        sysMap(unsafe.Pointer(p), n, h.arena_reserved, &memstats.heap_sys)
        h.arena_alloc += n
        if h.arena_alloc > h.arena_used {h.setArenaUsed(h.arena_alloc, true)
        }

        if p&(_PageSize-1) != 0 {throw("misrounded allocation in MHeap_SysAlloc")
        }
        return unsafe.Pointer(p)
    }

    // 预留空间失败后的解决
reservationFailed:
    // If using 64-bit, our reservation is all we have.
    if sys.PtrSize != 4 {return nil}

    // On 32-bit, once the reservation is gone we can
    // try to get memory at a location chosen by the OS.
    p_size := round(n, _PageSize) + _PageSize
    p := uintptr(sysAlloc(p_size, &memstats.heap_sys))
    if p == 0 {return nil}

    if p < h.arena_start || p+p_size-h.arena_start > _MaxMem {
        // This shouldn't be possible because _MaxMem is the
        // whole address space on 32-bit.
        top := uint64(h.arena_start) + _MaxMem
        print("runtime: memory allocated by OS (", hex(p), ") not in usable range [", hex(h.arena_start), ",", hex(top), ")n")
        sysFree(unsafe.Pointer(p), p_size, &memstats.heap_sys)
        return nil
    }

    p += -p & (_PageSize - 1)
    if p+n > h.arena_used {h.setArenaUsed(p+n, true)
    }

    if p&(_PageSize-1) != 0 {throw("misrounded allocation in MHeap_SysAlloc")
    }
    return unsafe.Pointer(p)
}

以上就是调配对象的残缺流程了, 接下来剖析 GC 标记和回收对象的解决.

回收对象的解决

回收对象的流程

GO 的 GC 是并行 GC, 也就是 GC 的大部分解决和一般的 go 代码是同时运行的, 这让 GO 的 GC 流程比较复杂.
首先 GC 有四个阶段, 它们别离是:

  • Sweep Termination: 对未打扫的 span 进行打扫, 只有上一轮的 GC 的打扫工作实现才能够开始新一轮的 GC
  • Mark: 扫描所有根对象, 和根对象能够达到的所有对象, 标记它们不被回收
  • Mark Termination: 实现标记工作, 从新扫描局部根对象(要求 STW)
  • Sweep: 按标记后果打扫 span

下图是比拟残缺的 GC 流程, 并按色彩对这四个阶段进行了分类:

在 GC 过程中会有两种后台任务 (G), 一种是标记用的后台任务, 一种是打扫用的后台任务.
标记用的后台任务会在须要时启动, 能够同时工作的后台任务数量大概是 P 的数量的 25%, 也就是 go 所讲的让 25% 的 cpu 用在 GC 上的依据.
打扫用的后台任务在程序启动时会启动一个, 进入打扫阶段时唤醒.

目前整个 GC 流程会进行两次 STW(Stop The World), 第一次是 Mark 阶段的开始, 第二次是 Mark Termination 阶段.
第一次 STW 会筹备根对象的扫描, 启动写屏障 (Write Barrier) 和辅助 GC(mutator assist).
第二次 STW 会从新扫描局部根对象, 禁用写屏障 (Write Barrier) 和辅助 GC(mutator assist).
须要留神的是, 不是所有根对象的扫描都须要 STW, 例如扫描栈上的对象只须要进行领有该栈的 G.
从 go 1.9 开始, 写屏障的实现应用了 Hybrid Write Barrier, 大幅缩小了第二次 STW 的工夫.

GC 的触发条件

GC 在满足肯定条件后会被触发, 触发条件有以下几种:

  • gcTriggerAlways: 强制触发 GC
  • gcTriggerHeap: 以后调配的内存达到肯定值就触发 GC
  • gcTriggerTime: 当肯定工夫没有执行过 GC 就触发 GC
  • gcTriggerCycle: 要求启动新一轮的 GC, 已启动则跳过, 手动触发 GC 的 runtime.GC() 会应用这个条件

触发条件的判断在 gctrigger 的 test 函数.
其中 gcTriggerHeap 和 gcTriggerTime 这两个条件是天然触发的, gcTriggerHeap 的判断代码如下:

return memstats.heap_live >= memstats.gc_trigger

heap_live 的减少在上面对分配器的代码剖析中能够看到, 当值达到 gc_trigger 就会触发 GC, 那么 gc_trigger 是如何决定的?
gc_trigger 的计算在 gcSetTriggerRatio 函数中, 公式是:

trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))

以后标记存活的大小乘以 1 + 系数 triggerRatio, 就是下次登程 GC 须要的调配量.
triggerRatio 在每次 GC 后都会调整, 计算 triggerRatio 的函数是 encCycle, 公式是:

const triggerGain = 0.5
// 指标 Heap 增长率, 默认是 1.0
goalGrowthRatio := float64(gcpercent) / 100
// 理论 Heap 增长率, 等于总大小 / 存活大小 -1
actualGrowthRatio := float64(memstats.heap_live)/float64(memstats.heap_marked) - 1
// GC 标记阶段的应用工夫(因为 endCycle 是在 Mark Termination 阶段调用的)
assistDuration := nanotime() - c.markStartTime
// GC 标记阶段的 CPU 占用率, 目标值是 0.25
utilization := gcGoalUtilization
if assistDuration > 0 {
    // assistTime 是 G 辅助 GC 标记对象所应用的工夫共计
    // (nanosecnds spent in mutator assists during this cycle)
    // 额定的 CPU 占用率 = 辅助 GC 标记对象的总工夫 / (GC 标记应用工夫 * P 的数量)
    utilization += float64(c.assistTime) / float64(assistDuration*int64(gomaxprocs))
}
// 触发系数偏移值 = 指标增长率 - 原触发系数 - CPU 占用率 / 指标 CPU 占用率 * (理论增长率 - 原触发系数)
// 参数的剖析:
// 理论增长率越大, 触发系数偏移值越小, 小于 0 时下次触发 GC 会提前
// CPU 占用率越大, 触发系数偏移值越小, 小于 0 时下次触发 GC 会提前
// 原触发系数越大, 触发系数偏移值越小, 小于 0 时下次触发 GC 会提前
triggerError := goalGrowthRatio - memstats.triggerRatio - utilization/gcGoalUtilization*(actualGrowthRatio-memstats.triggerRatio)
// 依据偏移值调整触发系数, 每次只调整偏移值的一半(渐进式调整)
triggerRatio := memstats.triggerRatio + triggerGain*triggerError

公式中的 ” 指标 Heap 增长率 ” 能够通过设置环境变量 ”GOGC” 调整, 默认值是 100, 减少它的值能够缩小 GC 的触发.
设置 ”GOGC=off” 能够彻底关掉 GC.

gcTriggerTime 的判断代码如下:

lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
return lastgc != 0 && t.now-lastgc > forcegcperiod

forcegcperiod 的定义是 2 分钟, 也就是 2 分钟内没有执行过 GC 就会强制触发.

三色的定义(黑, 灰, 白)

我看过的对三色 GC 的 ” 三色 ” 这个概念解释的最好的文章就是这一篇了, 强烈建议先看这一篇中的解说.
“ 三色 ” 的概念能够简略的了解为:

  • 彩色: 对象在这次 GC 中已标记, 且这个对象蕴含的子对象也已标记
  • 灰色: 对象在这次 GC 中已标记, 但这个对象蕴含的子对象未标记
  • 红色: 对象在这次 GC 中未标记

在 go 外部对象并没有保留色彩的属性, 三色只是对它们的状态的形容,
红色的对象在它所在的 span 的 gcmarkBits 中对应的 bit 为 0,
灰色的对象在它所在的 span 的 gcmarkBits 中对应的 bit 为 1, 并且对象在标记队列中,
彩色的对象在它所在的 span 的 gcmarkBits 中对应的 bit 为 1, 并且对象曾经从标记队列中取出并解决.
gc 实现后, gcmarkBits 会挪动到 allocBits 而后重新分配一个全副为 0 的 bitmap, 这样彩色的对象就变为了红色.

写屏障(Write Barrier)

因为 go 反对并行 GC, GC 的扫描和 go 代码能够同时运行, 这样带来的问题是 GC 扫描的过程中 go 代码有可能扭转了对象的依赖树,
例如开始扫描时发现根对象 A 和 B, B 领有 C 的指针, GC 先扫描 A, 而后 B 把 C 的指针交给 A, GC 再扫描 B, 这时 C 就不会被扫描到.
为了防止这个问题, go 在 GC 的标记阶段会启用写屏障(Write Barrier).

启用了写屏障 (Write Barrier) 后, 当 B 把 C 的指针交给 A 时, GC 会认为在这一轮的扫描中 C 的指针是存活的,
即便 A 可能会在稍后丢掉 C, 那么 C 就在下一轮回收.
写屏障只针对指针启用, 而且只在 GC 的标记阶段启用, 平时会间接把值写入到指标地址.

go 在 1.9 开始启用了混合写屏障(Hybrid Write Barrier), 伪代码如下:

writePointer(slot, ptr):
    shade(*slot)
    if any stack is grey:
        shade(ptr)
    *slot = ptr

混合写屏障会同时标记指针写入指标的 ” 原指针 ” 和“新指针 ”.

标记原指针的起因是, 其余运行中的线程有可能会同时把这个指针的值复制到寄存器或者栈上的本地变量,
因为 复制指针到寄存器或者栈上的本地变量不会通过写屏障, 所以有可能会导致指针不被标记, 试想上面的状况:

 b = obj
 oldx = nil
[gc] scan oldx...
 oldx = b.x // 复制 b.x 到本地变量, 不进过写屏障
 b.x = ptr // 写屏障应该标记 b.x 的原值
[gc] scan b...
如果写屏障不标记原值, 那么 oldx 就不会被扫描到.

标记新指针的起因是, 其余运行中的线程有可能会转移指针的地位, 试想上面的状况:

 a = ptr
 b = obj
[gc] scan b...
 b.x = a // 写屏障应该标记 b.x 的新值
 a = nil
[gc] scan a...
如果写屏障不标记新值, 那么 ptr 就不会被扫描到.

混合写屏障能够让 GC 在并行标记完结后不须要从新扫描各个 G 的堆栈, 能够缩小 Mark Termination 中的 STW 工夫.
除了写屏障外, 在 GC 的过程中所有新调配的对象都会立即变为彩色, 在下面的 mallocgc 函数中能够看到.

辅助 GC(mutator assist)

为了避免 heap 增速太快, 在 GC 执行的过程中如果同时运行的 G 调配了内存, 那么这个 G 会被要求辅助 GC 做一部分的工作.
在 GC 的过程中同时运行的 G 称为 ”mutator”, “mutator assist” 机制就是 G 辅助 GC 做一部分工作的机制.

辅助 GC 做的工作有两种类型, 一种是标记 (Mark), 另一种是打扫(Sweep).
辅助标记的触发能够查看下面的 mallocgc 函数, 触发时 G 会帮忙扫描 ” 工作量 ” 个对象, 工作量的计算公式是:

debtBytes * assistWorkPerByte

意思是调配的大小乘以系数 assistWorkPerByte, assistWorkPerByte 的计算在函数 revise 中, 公式是:

// 期待扫描的对象数量 = 未扫描的对象数量 - 已扫描的对象数量
scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
if scanWorkExpected < 1000 {scanWorkExpected = 1000}
// 间隔触发 GC 的 Heap 大小 = 期待触发 GC 的 Heap 大小 - 以后的 Heap 大小
// 留神 next_gc 的计算跟 gc_trigger 不一样, next_gc 等于 heap_marked * (1 + gcpercent / 100)
heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
if heapDistance <= 0 {heapDistance = 1}
// 每调配 1 byte 须要辅助扫描的对象数量 = 期待扫描的对象数量 / 间隔触发 GC 的 Heap 大小
c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)

和辅助标记不一样的是, 辅助打扫申请新 span 时才会查看, 而辅助标记是每次调配对象时都会查看.
辅助打扫的触发能够看下面的 cacheSpan 函数, 触发时 G 会帮忙回收 ” 工作量 ” 页的对象, 工作量的计算公式是:

spanBytes * sweepPagesPerByte // 不完全相同, 具体看 deductSweepCredit 函数

意思是调配的大小乘以系数 sweepPagesPerByte, sweepPagesPerByte 的计算在函数 gcSetTriggerRatio 中, 公式是:

// 以后的 Heap 大小
heapLiveBasis := atomic.Load64(&memstats.heap_live)
// 间隔触发 GC 的 Heap 大小 = 下次触发 GC 的 Heap 大小 - 以后的 Heap 大小
heapDistance := int64(trigger) - int64(heapLiveBasis)
heapDistance -= 1024 * 1024
if heapDistance < _PageSize {heapDistance = _PageSize}
// 已打扫的页数
pagesSwept := atomic.Load64(&mheap_.pagesSwept)
// 未打扫的页数 = 应用中的页数 - 已打扫的页数
sweepDistancePages := int64(mheap_.pagesInUse) - int64(pagesSwept)
if sweepDistancePages <= 0 {mheap_.sweepPagesPerByte = 0} else {// 每调配 1 byte(的 span)须要辅助打扫的页数 = 未打扫的页数 / 间隔触发 GC 的 Heap 大小
    mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance)
}

根对象

在 GC 的标记阶段首先须要标记的就是 ” 根对象 ”, 从根对象开始可达到的所有对象都会被认为是存活的.
根对象蕴含了全局变量, 各个 G 的栈上的变量等, GC 会先扫描根对象而后再扫描根对象可达到的所有对象.
扫描根对象蕴含了一系列的工作, 它们定义在 [https://github.com/golang/go/blob/go1.9.2/src/runtime/mgcmark.go#L54] 函数:

  • Fixed Roots: 非凡的扫描工作

    • fixedRootFinalizers: 扫描析构器队列
    • fixedRootFreeGStacks: 开释已停止的 G 的栈
  • Flush Cache Roots: 开释 mcache 中的所有 span, 要求 STW
  • Data Roots: 扫描可读写的全局变量
  • BSS Roots: 扫描只读的全局变量
  • Span Roots: 扫描各个 span 中非凡对象(析构器列表)
  • Stack Roots: 扫描各个 G 的栈

标记阶段 (Mark) 会做其中的 ”Fixed Roots”, “Data Roots”, “BSS Roots”, “Span Roots”, “Stack Roots”.
实现标记阶段 (Mark Termination) 会做其中的 ”Fixed Roots”, “Flush Cache Roots”.

标记队列

GC 的标记阶段会应用 ” 标记队列 ” 来确定所有可从根对象达到的对象都已标记, 下面提到的 ” 灰色 ” 的对象就是在标记队列中的对象.
举例来说, 如果以后有 [A, B, C] 这三个根对象, 那么扫描根对象时就会把它们放到标记队列:

work queue: [A, B, C]

后盾标记工作从标记队列中取出 A, 如果 A 援用了 D, 则把 D 放入标记队列:

work queue: [B, C, D]

后盾标记工作从标记队列取出 B, 如果 B 也援用了 D, 这时因为 D 在 gcmarkBits 中对应的 bit 曾经是 1 所以会跳过:

work queue: [C, D]

如果并行运行的 go 代码调配了一个对象 E, 对象 E 会被立即标记, 但不会进入标记队列 (因为确定 E 没有援用其余对象).
而后并行运行的 go 代码把对象 F 设置给对象 E 的成员, 写屏障会标记对象 F 而后把对象 F 加到运行队列:

work queue: [C, D, F]

后盾标记工作从标记队列取出 C, 如果 C 没有援用其余对象, 则不须要解决:

work queue: [D, F]

后盾标记工作从标记队列取出 D, 如果 D 援用了 X, 则把 X 放入标记队列:

work queue: [F, X]

后盾标记工作从标记队列取出 F, 如果 F 没有援用其余对象, 则不须要解决.
后盾标记工作从标记队列取出 X, 如果 X 没有援用其余对象, 则不须要解决.
最初标记队列为空, 标记实现, 存活的对象有[A, B, C, D, E, F, X].

理论的情况会比下面介绍的情况略微简单一点.
标记队列会分为全局标记队列和各个 P 的本地标记队列, 这点和协程中的运行队列类似.
并且标记队列为空当前, 还须要进行整个世界并禁止写屏障, 而后再次查看是否为空.

源代码剖析

go 触发 gc 会从 gcStart 函数开始:

// gcStart transitions the GC from _GCoff to _GCmark (if
// !mode.stwMark) or _GCmarktermination (if mode.stwMark) by
// performing sweep termination and GC initialization.
//
// This may return without performing this transition in some cases,
// such as when called on a system stack or with locks held.
func gcStart(mode gcMode, trigger gcTrigger) {
    // 判断以后 G 是否可抢占, 不可抢占时不触发 GC
    // Since this is called from malloc and malloc is called in
    // the guts of a number of libraries that might be holding
    // locks, don't attempt to start GC in non-preemptible or
    // potentially unstable situations.
    mp := acquirem()
    if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {releasem(mp)
        return
    }
    releasem(mp)
    mp = nil

    // 并行打扫上一轮 GC 未打扫的 span
    // Pick up the remaining unswept/not being swept spans concurrently
    //
    // This shouldn't happen if we're being invoked in background
    // mode since proportional sweep should have just finished
    // sweeping everything, but rounding errors, etc, may leave a
    // few spans unswept. In forced mode, this is necessary since
    // GC can be forced at any point in the sweeping cycle.
    //
    // We check the transition condition continuously here in case
    // this G gets delayed in to the next GC cycle.
    for trigger.test() && gosweepone() != ^uintptr(0) {sweep.nbgsweep++}

    // 上锁, 而后从新查看 gcTrigger 的条件是否成立, 不成立时不触发 GC
    // Perform GC initialization and the sweep termination
    // transition.
    semacquire(&work.startSema)
    // Re-check transition condition under transition lock.
    if !trigger.test() {semrelease(&work.startSema)
        return
    }

    // 记录是否强制触发, gcTriggerCycle 是 runtime.GC 用的
    // For stats, check if this GC was forced by the user.
    work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle

    // 判断是否指定了禁止并行 GC 的参数
    // In gcstoptheworld debug mode, upgrade the mode accordingly.
    // We do this after re-checking the transition condition so
    // that multiple goroutines that detect the heap trigger don't
    // start multiple STW GCs.
    if mode == gcBackgroundMode {
        if debug.gcstoptheworld == 1 {mode = gcForceMode} else if debug.gcstoptheworld == 2 {mode = gcForceBlockMode}
    }

    // Ok, we're doing it!  Stop everybody else
    semacquire(&worldsema)

    // 跟踪解决
    if trace.enabled {traceGCStart()
    }

    // 启动后盾扫描工作(G)
    if mode == gcBackgroundMode {gcBgMarkStartWorkers()
    }

    // 重置标记相干的状态
    gcResetMarkState()

    // 重置参数
    work.stwprocs, work.maxprocs = gcprocs(), gomaxprocs
    work.heap0 = atomic.Load64(&memstats.heap_live)
    work.pauseNS = 0
    work.mode = mode

    // 记录开始工夫
    now := nanotime()
    work.tSweepTerm = now
    work.pauseStart = now
    
    // 进行所有运行中的 G, 并禁止它们运行
    systemstack(stopTheWorldWithSema)
    
    // !!!!!!!!!!!!!!!!
    // 世界已进行(STW)...
    // !!!!!!!!!!!!!!!!
    
    // 打扫上一轮 GC 未打扫的 span, 确保上一轮 GC 已实现
    // Finish sweep before we start concurrent scan.
    systemstack(func() {finishsweep_m()
    })
    // 打扫 sched.sudogcache 和 sched.deferpool
    // clearpools before we start the GC. If we wait they memory will not be
    // reclaimed until the next GC cycle.
    clearpools()

    // 减少 GC 计数
    work.cycles++
    
    // 判断是否并行 GC 模式
    if mode == gcBackgroundMode { // Do as much work concurrently as possible
        // 标记新一轮 GC 已开始
        gcController.startCycle()
        work.heapGoal = memstats.next_gc

        // 设置全局变量中的 GC 状态为_GCmark
        // 而后启用写屏障
        // Enter concurrent mark phase and enable
        // write barriers.
        //
        // Because the world is stopped, all Ps will
        // observe that write barriers are enabled by
        // the time we start the world and begin
        // scanning.
        //
        // Write barriers must be enabled before assists are
        // enabled because they must be enabled before
        // any non-leaf heap objects are marked. Since
        // allocations are blocked until assists can
        // happen, we want enable assists as early as
        // possible.
        setGCPhase(_GCmark)

        // 重置后盾标记工作的计数
        gcBgMarkPrepare() // Must happen before assist enable.

        // 计算扫描根对象的工作数量
        gcMarkRootPrepare()

        // 标记所有 tiny alloc 期待合并的对象
        // Mark all active tinyalloc blocks. Since we're
        // allocating from these, they need to be black like
        // other allocations. The alternative is to blacken
        // the tiny block on every allocation from it, which
        // would slow down the tiny allocator.
        gcMarkTinyAllocs()

        // 启用辅助 GC
        // At this point all Ps have enabled the write
        // barrier, thus maintaining the no white to
        // black invariant. Enable mutator assists to
        // put back-pressure on fast allocating
        // mutators.
        atomic.Store(&gcBlackenEnabled, 1)

        // 记录标记开始的工夫
        // Assists and workers can start the moment we start
        // the world.
        gcController.markStartTime = now

        // 重新启动世界
        // 后面创立的后盾标记工作会开始工作, 所有后盾标记工作都实现工作后, 进入实现标记阶段
        // Concurrent mark.
        systemstack(startTheWorldWithSema)
        
        // !!!!!!!!!!!!!!!
        // 世界已重新启动...
        // !!!!!!!!!!!!!!!
        
        // 记录进行了多久, 和标记阶段开始的工夫
        now = nanotime()
        work.pauseNS += now - work.pauseStart
        work.tMark = now
    } else {
        // 不是并行 GC 模式
        // 记录实现标记阶段开始的工夫
        t := nanotime()
        work.tMark, work.tMarkTerm = t, t
        work.heapGoal = work.heap0

        // 跳过标记阶段, 执行实现标记阶段
        // 所有标记工作都会在世界已进行的状态执行
        // (标记阶段会设置 work.markrootDone=true, 如果跳过则它的值是 false, 实现标记阶段会执行所有工作)
        // 实现标记阶段会重新启动世界
        // Perform mark termination. This will restart the world.
        gcMarkTermination(memstats.triggerRatio)
    }

    semrelease(&work.startSema)
}

接下来一个个剖析 gcStart 调用的函数, 倡议配合下面的 ” 回收对象的流程 ” 中的图了解.

函数 gcBgMarkStartWorkers 用于启动后盾标记工作, 先别离对每个 P 启动一个:

// gcBgMarkStartWorkers prepares background mark worker goroutines.
// These goroutines will not run until the mark phase, but they must
// be started while the work is not stopped and from a regular G
// stack. The caller must hold worldsema.
func gcBgMarkStartWorkers() {
    // Background marking is performed by per-P G's. Ensure that
    // each P has a background GC G.
    for _, p := range &allp {
        if p == nil || p.status == _Pdead {break}
        // 如果已启动则不反复启动
        if p.gcBgMarkWorker == 0 {go gcBgMarkWorker(p)
            // 启动后期待该工作告诉信号量 bgMarkReady 再持续
            notetsleepg(&work.bgMarkReady, -1)
            noteclear(&work.bgMarkReady)
        }
    }
}

这里尽管为每个 P 启动了一个后盾标记工作, 然而能够同时工作的只有 25%, 这个逻辑在协程 M 获取 G 时调用的 findRunnableGCWorker 中:

// findRunnableGCWorker returns the background mark worker for _p_ if it
// should be run. This must only be called when gcBlackenEnabled != 0.
func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g {
    if gcBlackenEnabled == 0 {throw("gcControllerState.findRunnable: blackening not enabled")
    }
    if _p_.gcBgMarkWorker == 0 {
        // The mark worker associated with this P is blocked
        // performing a mark transition. We can't run it
        // because it may be on some other run or wait queue.
        return nil
    }

    if !gcMarkWorkAvailable(_p_) {
        // No work to be done right now. This can happen at
        // the end of the mark phase when there are still
        // assists tapering off. Don't bother running a worker
        // now because it'll just return immediately.
        return nil
    }

    // 原子缩小对应的值, 如果缩小后大于等于 0 则返回 true, 否则返回 false
    decIfPositive := func(ptr *int64) bool {
        if *ptr > 0 {if atomic.Xaddint64(ptr, -1) >= 0 {return true}
            // We lost a race
            atomic.Xaddint64(ptr, +1)
        }
        return false
    }

    // 缩小 dedicatedMarkWorkersNeeded, 胜利时后盾标记工作的模式是 Dedicated
    // dedicatedMarkWorkersNeeded 是以后 P 的数量的 25% 去除小数点
    // 详见 startCycle 函数
    if decIfPositive(&c.dedicatedMarkWorkersNeeded) {
        // This P is now dedicated to marking until the end of
        // the concurrent mark phase.
        _p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode
    } else {
        // 缩小 fractionalMarkWorkersNeeded, 胜利是后盾标记工作的模式是 Fractional
        // 下面的计算如果小数点后有数值 (不可能整除) 则 fractionalMarkWorkersNeeded 为 1, 否则为 0
        // 详见 startCycle 函数
        // 举例来说, 4 个 P 时会执行 1 个 Dedicated 模式的工作, 5 个 P 时会执行 1 个 Dedicated 模式和 1 个 Fractional 模式的工作
        if !decIfPositive(&c.fractionalMarkWorkersNeeded) {
            // No more workers are need right now.
            return nil
        }

        // 按 Dedicated 模式的工作的执行工夫判断 cpu 占用率是否超过预算值, 超过时不启动
        // This P has picked the token for the fractional worker.
        // Is the GC currently under or at the utilization goal?
        // If so, do more work.
        //
        // We used to check whether doing one time slice of work
        // would remain under the utilization goal, but that has the
        // effect of delaying work until the mutator has run for
        // enough time slices to pay for the work. During those time
        // slices, write barriers are enabled, so the mutator is running slower.
        // Now instead we do the work whenever we're under or at the
        // utilization work and pay for it by letting the mutator run later.
        // This doesn't change the overall utilization averages, but it
        // front loads the GC work so that the GC finishes earlier and
        // write barriers can be turned off sooner, effectively giving
        // the mutator a faster machine.
        //
        // The old, slower behavior can be restored by setting
        //    gcForcePreemptNS = forcePreemptNS.
        const gcForcePreemptNS = 0

        // TODO(austin): We could fast path this and basically
        // eliminate contention on c.fractionalMarkWorkersNeeded by
        // precomputing the minimum time at which it's worth
        // next scheduling the fractional worker. Then Ps
        // don't have to fight in the window where we've
        // passed that deadline and no one has started the
        // worker yet.
        //
        // TODO(austin): Shorter preemption interval for mark
        // worker to improve fairness and give this
        // finer-grained control over schedule?
        now := nanotime() - gcController.markStartTime
        then := now + gcForcePreemptNS
        timeUsed := c.fractionalMarkTime + gcForcePreemptNS
        if then > 0 && float64(timeUsed)/float64(then) > c.fractionalUtilizationGoal {
            // Nope, we'd overshoot the utilization goal
            atomic.Xaddint64(&c.fractionalMarkWorkersNeeded, +1)
            return nil
        }
        _p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode
    }

    // 安顿后盾标记工作执行
    // Run the background mark worker
    gp := _p_.gcBgMarkWorker.ptr()
    casgstatus(gp, _Gwaiting, _Grunnable)
    if trace.enabled {traceGoUnpark(gp, 0)
    }
    return gp
}

gcResetMarkState 函数会重置标记相干的状态:

// gcResetMarkState resets global state prior to marking (concurrent
// or STW) and resets the stack scan state of all Gs.
//
// This is safe to do without the world stopped because any Gs created
// during or after this will start out in the reset state.
func gcResetMarkState() {
    // This may be called during a concurrent phase, so make sure
    // allgs doesn't change.
    lock(&allglock)
    for _, gp := range allgs {
        gp.gcscandone = false  // set to true in gcphasework
        gp.gcscanvalid = false // stack has not been scanned
        gp.gcAssistBytes = 0
    }
    unlock(&allglock)

    work.bytesMarked = 0
    work.initialHeapLive = atomic.Load64(&memstats.heap_live)
    work.markrootDone = false
}

stopTheWorldWithSema 函数会进行整个世界, 这个函数必须在 g0 中运行:

// stopTheWorldWithSema is the core implementation of stopTheWorld.
// The caller is responsible for acquiring worldsema and disabling
// preemption first and then should stopTheWorldWithSema on the system
// stack:
//
//    semacquire(&worldsema, 0)
//    m.preemptoff = "reason"
//    systemstack(stopTheWorldWithSema)
//
// When finished, the caller must either call startTheWorld or undo
// these three operations separately:
//
//    m.preemptoff = ""
//    systemstack(startTheWorldWithSema)
//    semrelease(&worldsema)
//
// It is allowed to acquire worldsema once and then execute multiple
// startTheWorldWithSema/stopTheWorldWithSema pairs.
// Other P's are able to execute between successive calls to
// startTheWorldWithSema and stopTheWorldWithSema.
// Holding worldsema causes any other goroutines invoking
// stopTheWorld to block.
func stopTheWorldWithSema() {_g_ := getg()

    // If we hold a lock, then we won't be able to stop another M
    // that is blocked trying to acquire the lock.
    if _g_.m.locks > 0 {throw("stopTheWorld: holding locks")
    }

    lock(&sched.lock)
    
    // 须要进行的 P 数量
    sched.stopwait = gomaxprocs
    
    // 设置 gc 期待标记, 调度时看见此标记会进入期待
    atomic.Store(&sched.gcwaiting, 1)
    
    // 抢占所有运行中的 G
    preemptall()
    
    // 进行以后的 P
    // stop current P
    _g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic.
    
    // 缩小须要进行的 P 数量(以后的 P 算一个)
    sched.stopwait--
    
    // 抢占所有在 Psyscall 状态的 P, 避免它们从新参加调度
    // try to retake all P's in Psyscall status
    for i := 0; i < int(gomaxprocs); i++ {p := allp[i]
        s := p.status
        if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
            if trace.enabled {traceGoSysBlock(p)
                traceProcStop(p)
            }
            p.syscalltick++
            sched.stopwait--
        }
    }
    
    // 避免所有闲暇的 P 从新参加调度
    // stop idle P's
    for {p := pidleget()
        if p == nil {break}
        p.status = _Pgcstop
        sched.stopwait--
    }
    wait := sched.stopwait > 0
    unlock(&sched.lock)

    // 如果仍有须要进行的 P, 则期待它们进行
    // wait for remaining P's to stop voluntarily
    if wait {
        for {
            // 循环期待 + 抢占所有运行中的 G
            // wait for 100us, then try to re-preempt in case of any races
            if notetsleep(&sched.stopnote, 100*1000) {noteclear(&sched.stopnote)
                break
            }
            preemptall()}
    }

    // 逻辑正确性查看
    // sanity checks
    bad := ""
    if sched.stopwait != 0 {bad = "stopTheWorld: not stopped (stopwait != 0)"
    } else {for i := 0; i < int(gomaxprocs); i++ {p := allp[i]
            if p.status != _Pgcstop {bad = "stopTheWorld: not stopped (status != _Pgcstop)"
            }
        }
    }
    if atomic.Load(&freezing) != 0 {
        // Some other thread is panicking. This can cause the
        // sanity checks above to fail if the panic happens in
        // the signal handler on a stopped thread. Either way,
        // we should halt this thread.
        lock(&deadlock)
        lock(&deadlock)
    }
    if bad != "" {throw(bad)
    }
    
    // 到这里所有运行中的 G 都会变为待运行, 并且所有的 P 都不能被 M 获取
    // 也就是说所有的 go 代码 (除了以后的) 都会进行运行, 并且不能运行新的 go 代码
}

finishsweep_m 函数会打扫上一轮 GC 未打扫的 span, 确保上一轮 GC 已实现:

// finishsweep_m ensures that all spans are swept.
//
// The world must be stopped. This ensures there are no sweeps in
// progress.
//
//go:nowritebarrier
func finishsweep_m() {
    // sweepone 会取出一个未 sweep 的 span 而后执行 sweep
    // 具体将在上面 sweep 阶段时剖析
    // Sweeping must be complete before marking commences, so
    // sweep any unswept spans. If this is a concurrent GC, there
    // shouldn't be any spans left to sweep, so this should finish
    // instantly. If GC was forced before the concurrent sweep
    // finished, there may be spans to sweep.
    for sweepone() != ^uintptr(0) {sweep.npausesweep++}

    // 所有 span 都 sweep 实现后, 启动一个新的 markbit 时代
    // 这个函数是实现 span 的 gcmarkBits 和 allocBits 的调配和复用的要害, 流程如下
    // - span 调配 gcmarkBits 和 allocBits
    // - span 实现 sweep
    //   - 原 allocBits 不再被应用
    //   - gcmarkBits 变为 allocBits
    //   - 调配新的 gcmarkBits
    // - 开启新的 markbit 时代
    // - span 实现 sweep, 同上
    // - 开启新的 markbit 时代
    //   - 2 个时代之前的 bitmap 将不再被应用, 能够复用这些 bitmap
    nextMarkBitArenaEpoch()}

clearpools 函数会清理 sched.sudogcache 和 sched.deferpool, 让它们的内存能够被回收:

func clearpools() {
    // clear sync.Pools
    if poolcleanup != nil {poolcleanup()
    }

    // Clear central sudog cache.
    // Leave per-P caches alone, they have strictly bounded size.
    // Disconnect cached list before dropping it on the floor,
    // so that a dangling ref to one entry does not pin all of them.
    lock(&sched.sudoglock)
    var sg, sgnext *sudog
    for sg = sched.sudogcache; sg != nil; sg = sgnext {
        sgnext = sg.next
        sg.next = nil
    }
    sched.sudogcache = nil
    unlock(&sched.sudoglock)

    // Clear central defer pools.
    // Leave per-P pools alone, they have strictly bounded size.
    lock(&sched.deferlock)
    for i := range sched.deferpool {
        // disconnect cached list before dropping it on the floor,
        // so that a dangling ref to one entry does not pin all of them.
        var d, dlink *_defer
        for d = sched.deferpool[i]; d != nil; d = dlink {
            dlink = d.link
            d.link = nil
        }
        sched.deferpool[i] = nil
    }
    unlock(&sched.deferlock)
}

startCycle 标记开始了新一轮的 GC:

// startCycle resets the GC controller's state and computes estimates
// for a new GC cycle. The caller must hold worldsema.
func (c *gcControllerState) startCycle() {
    c.scanWork = 0
    c.bgScanCredit = 0
    c.assistTime = 0
    c.dedicatedMarkTime = 0
    c.fractionalMarkTime = 0
    c.idleMarkTime = 0

    // 假装 heap_marked 的值如果 gc_trigger 的值很小, 避免前面对 triggerRatio 做出谬误的调整
    // If this is the first GC cycle or we're operating on a very
    // small heap, fake heap_marked so it looks like gc_trigger is
    // the appropriate growth from heap_marked, even though the
    // real heap_marked may not have a meaningful value (on the
    // first cycle) or may be much smaller (resulting in a large
    // error response).
    if memstats.gc_trigger <= heapminimum {memstats.heap_marked = uint64(float64(memstats.gc_trigger) / (1 + memstats.triggerRatio))
    }

    // 从新计算 next_gc, 留神 next_gc 的计算跟 gc_trigger 不一样
    // Re-compute the heap goal for this cycle in case something
    // changed. This is the same calculation we use elsewhere.
    memstats.next_gc = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
    if gcpercent < 0 {memstats.next_gc = ^uint64(0)
    }

    // 确保 next_gc 和 heap_live 之间起码有 1MB
    // Ensure that the heap goal is at least a little larger than
    // the current live heap size. This may not be the case if GC
    // start is delayed or if the allocation that pushed heap_live
    // over gc_trigger is large or if the trigger is really close to
    // GOGC. Assist is proportional to this distance, so enforce a
    // minimum distance, even if it means going over the GOGC goal
    // by a tiny bit.
    if memstats.next_gc < memstats.heap_live+1024*1024 {memstats.next_gc = memstats.heap_live + 1024*1024}

    // 计算能够同时执行的后盾标记工作的数量
    // dedicatedMarkWorkersNeeded 等于 P 的数量的 25% 去除小数点
    // 如果能够整除则 fractionalMarkWorkersNeeded 等于 0 否则等于 1
    // totalUtilizationGoal 是 GC 所占的 P 的目标值(例如 P 一共有 5 个时指标是 1.25 个 P)
    // fractionalUtilizationGoal 是 Fractiona 模式的工作所占的 P 的目标值(例如 P 一共有 5 个时指标是 0.25 个 P)
    // Compute the total mark utilization goal and divide it among
    // dedicated and fractional workers.
    totalUtilizationGoal := float64(gomaxprocs) * gcGoalUtilization
    c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal)
    c.fractionalUtilizationGoal = totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded)
    if c.fractionalUtilizationGoal > 0 {c.fractionalMarkWorkersNeeded = 1} else {c.fractionalMarkWorkersNeeded = 0}

    // 重置 P 中的辅助 GC 所用的工夫统计
    // Clear per-P state
    for _, p := range &allp {
        if p == nil {break}
        p.gcAssistTime = 0
    }

    // 计算辅助 GC 的参数
    // 参考上面对计算 assistWorkPerByte 的公式的剖析
    // Compute initial values for controls that are updated
    // throughout the cycle.
    c.revise()

    if debug.gcpacertrace > 0 {
        print("pacer: assist ratio=", c.assistWorkPerByte,
            "(scan", memstats.heap_scan>>20, "MB in",
            work.initialHeapLive>>20, "->",
            memstats.next_gc>>20, "MB)",
            "workers=", c.dedicatedMarkWorkersNeeded,
            "+", c.fractionalMarkWorkersNeeded, "n")
    }
}

setGCPhase 函数会批改示意以后 GC 阶段的全局变量和 是否开启写屏障 的全局变量:

//go:nosplit
func setGCPhase(x uint32) {atomic.Store(&gcphase, x)
    writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
    writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
}

gcBgMarkPrepare 函数会重置后盾标记工作的计数:

// gcBgMarkPrepare sets up state for background marking.
// Mutator assists must not yet be enabled.
func gcBgMarkPrepare() {
    // Background marking will stop when the work queues are empty
    // and there are no more workers (note that, since this is
    // concurrent, this may be a transient state, but mark
    // termination will clean it up). Between background workers
    // and assists, we don't really know how many workers there
    // will be, so we pretend to have an arbitrarily large number
    // of workers, almost all of which are "waiting". While a
    // worker is working it decrements nwait. If nproc == nwait,
    // there are no workers.
    work.nproc = ^uint32(0)
    work.nwait = ^uint32(0)
}

gcMarkRootPrepare 函数会计算扫描根对象的工作数量:

// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and
// some miscellany) and initializes scanning-related state.
//
// The caller must have call gcCopySpans().
//
// The world must be stopped.
//
//go:nowritebarrier
func gcMarkRootPrepare() {// 开释 mcache 中的所有 span 的工作, 只在实现标记阶段 (mark termination) 中执行
    if gcphase == _GCmarktermination {work.nFlushCacheRoots = int(gomaxprocs)
    } else {work.nFlushCacheRoots = 0}

    // 计算 block 数量的函数, rootBlockBytes 是 256KB
    // Compute how many data and BSS root blocks there are.
    nBlocks := func(bytes uintptr) int {return int((bytes + rootBlockBytes - 1) / rootBlockBytes)
    }

    work.nDataRoots = 0
    work.nBSSRoots = 0

    // data 和 bss 每一轮 GC 只扫描一次
    // 并行 GC 中会在后盾标记工作中扫描, 实现标记阶段 (mark termination) 中不扫描
    // 非并行 GC 会在实现标记阶段 (mark termination) 中扫描
    // Only scan globals once per cycle; preferably concurrently.
    if !work.markrootDone {
        // 计算扫描可读写的全局变量的工作数量
        for _, datap := range activeModules() {nDataRoots := nBlocks(datap.edata - datap.data)
            if nDataRoots > work.nDataRoots {work.nDataRoots = nDataRoots}
        }

        // 计算扫描只读的全局变量的工作数量
        for _, datap := range activeModules() {nBSSRoots := nBlocks(datap.ebss - datap.bss)
            if nBSSRoots > work.nBSSRoots {work.nBSSRoots = nBSSRoots}
        }
    }

    // span 中的 finalizer 和各个 G 的栈每一轮 GC 只扫描一次
    // 同上
    if !work.markrootDone {
        // 计算扫描 span 中的 finalizer 的工作数量
        // On the first markroot, we need to scan span roots.
        // In concurrent GC, this happens during concurrent
        // mark and we depend on addfinalizer to ensure the
        // above invariants for objects that get finalizers
        // after concurrent mark. In STW GC, this will happen
        // during mark termination.
        //
        // We're only interested in scanning the in-use spans,
        // which will all be swept at this point. More spans
        // may be added to this list during concurrent GC, but
        // we only care about spans that were allocated before
        // this mark phase.
        work.nSpanRoots = mheap_.sweepSpans[mheap_.sweepgen/2%2].numBlocks()

        // 计算扫描各个 G 的栈的工作数量
        // On the first markroot, we need to scan all Gs. Gs
        // may be created after this point, but it's okay that
        // we ignore them because they begin life without any
        // roots, so there's nothing to scan, and any roots
        // they create during the concurrent phase will be
        // scanned during mark termination. During mark
        // termination, allglen isn't changing, so we'll scan
        // all Gs.
        work.nStackRoots = int(atomic.Loaduintptr(&allglen))
    } else {
        // We've already scanned span roots and kept the scan
        // up-to-date during concurrent mark.
        work.nSpanRoots = 0

        // The hybrid barrier ensures that stacks can't
        // contain pointers to unmarked objects, so on the
        // second markroot, there's no need to scan stacks.
        work.nStackRoots = 0

        if debug.gcrescanstacks > 0 {
            // Scan stacks anyway for debugging.
            work.nStackRoots = int(atomic.Loaduintptr(&allglen))
        }
    }

    // 计算总任务数量
    // 后盾标记工作会对 markrootNext 进行原子递增, 来决定做哪个工作
    // 这种用数值来实现锁自在队列的方法挺聪慧的, 只管 google 工程师感觉不好(看前面 markroot 函数的剖析)
    work.markrootNext = 0
    work.markrootJobs = uint32(fixedRootCount + work.nFlushCacheRoots + work.nDataRoots + work.nBSSRoots + work.nSpanRoots + work.nStackRoots)
}

gcMarkTinyAllocs 函数会标记所有 tiny alloc 期待合并的对象:

// gcMarkTinyAllocs greys all active tiny alloc blocks.
//
// The world must be stopped.
func gcMarkTinyAllocs() {
    for _, p := range &allp {
        if p == nil || p.status == _Pdead {break}
        c := p.mcache
        if c == nil || c.tiny == 0 {continue}
        // 标记各个 P 中的 mcache 中的 tiny
        // 在下面的 mallocgc 函数中能够看到 tiny 是以后期待合并的对象
        _, hbits, span, objIndex := heapBitsForObject(c.tiny, 0, 0)
        gcw := &p.gcw
        // 标记一个对象存活, 并把它加到标记队列(该对象变为灰色)
        greyobject(c.tiny, 0, 0, hbits, span, gcw, objIndex)
        // gcBlackenPromptly 变量示意以后是否禁止本地队列, 如果已禁止则把标记工作 flush 到全局队列
        if gcBlackenPromptly {gcw.dispose()
        }
    }
}

startTheWorldWithSema 函数会重新启动世界:

func startTheWorldWithSema() {_g_ := getg()
    
    // 禁止 G 被抢占
    _g_.m.locks++        // disable preemption because it can be holding p in a local var
    
    // 判断收到的网络事件 (fd 可读可写或谬误) 并增加对应的 G 到待运行队列
    gp := netpoll(false) // non-blocking
    injectglist(gp)
    
    // 判断是否要启动 gc helper
    add := needaddgcproc()
    lock(&sched.lock)
    
    // 如果要求扭转 gomaxprocs 则调整 P 的数量
    // procresize 会返回有可运行工作的 P 的链表
    procs := gomaxprocs
    if newprocs != 0 {
        procs = newprocs
        newprocs = 0
    }
    p1 := procresize(procs)
    
    // 勾销 GC 期待标记
    sched.gcwaiting = 0
    
    // 如果 sysmon 在期待则唤醒它
    if sched.sysmonwait != 0 {
        sched.sysmonwait = 0
        notewakeup(&sched.sysmonnote)
    }
    unlock(&sched.lock)
    
    // 唤醒有可运行工作的 P
    for p1 != nil {
        p := p1
        p1 = p1.link.ptr()
        if p.m != 0 {mp := p.m.ptr()
            p.m = 0
            if mp.nextp != 0 {throw("startTheWorld: inconsistent mp->nextp")
            }
            mp.nextp.set(p)
            notewakeup(&mp.park)
        } else {
            // Start M to run P.  Do not start another M below.
            newm(nil, p)
            add = false
        }
    }
    
    // 如果有闲暇的 P,并且没有自旋中的 M 则唤醒或者创立一个 M
    // Wakeup an additional proc in case we have excessive runnable goroutines
    // in local queues or in the global queue. If we don't, the proc will park itself.
    // If we have lots of excessive work, resetspinning will unpark additional procs as necessary.
    if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {wakep()
    }
    
    // 启动 gc helper
    if add {
        // If GC could have used another helper proc, start one now,
        // in the hope that it will be available next time.
        // It would have been even better to start it before the collection,
        // but doing so requires allocating memory, so it's tricky to
        // coordinate. This lazy approach works out in practice:
        // we don't mind if the first couple gc rounds don't have quite
        // the maximum number of procs.
        newm(mhelpgc, nil)
    }
    
    // 容许 G 被抢占
    _g_.m.locks--
    
    // 如果以后 G 要求被抢占则从新尝试
    if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
        _g_.stackguard0 = stackPreempt
    }
}

重启世界后各个 M 会从新开始调度, 调度时会优先应用下面提到的 findRunnableGCWorker 函数查找工作, 之后就有大概 25% 的 P 运行后盾标记工作.
后盾标记工作的函数是 gcBgMarkWorker:

func gcBgMarkWorker(_p_ *p) {gp := getg()
    
    // 用于休眠后从新获取 P 的结构体
    type parkInfo struct {
        m      muintptr // Release this m on park.
        attach puintptr // If non-nil, attach to this p on park.
    }
    // We pass park to a gopark unlock function, so it can't be on
    // the stack (see gopark). Prevent deadlock from recursively
    // starting GC by disabling preemption.
    gp.m.preemptoff = "GC worker init"
    park := new(parkInfo)
    gp.m.preemptoff = ""
    
    // 设置以后的 M 并禁止抢占
    park.m.set(acquirem())
    // 设置以后的 P(须要关联到的 P)
    park.attach.set(_p_)
    
    // 告诉 gcBgMarkStartWorkers 能够持续解决
    // Inform gcBgMarkStartWorkers that this worker is ready.
    // After this point, the background mark worker is scheduled
    // cooperatively by gcController.findRunnable. Hence, it must
    // never be preempted, as this would put it into _Grunnable
    // and put it on a run queue. Instead, when the preempt flag
    // is set, this puts itself into _Gwaiting to be woken up by
    // gcController.findRunnable at the appropriate time.
    notewakeup(&work.bgMarkReady)
    
    for {
        // 让以后 G 进入休眠
        // Go to sleep until woken by gcController.findRunnable.
        // We can't releasem yet since even the call to gopark
        // may be preempted.
        gopark(func(g *g, parkp unsafe.Pointer) bool {park := (*parkInfo)(parkp)
            
            // 从新容许抢占
            // The worker G is no longer running, so it's
            // now safe to allow preemption.
            releasem(park.m.ptr())
            
            // 设置关联的 P
            // 把以后的 G 设到 P 的 gcBgMarkWorker 成员, 下次 findRunnableGCWorker 会应用
            // 设置失败时不休眠
            // If the worker isn't attached to its P,
            // attach now. During initialization and after
            // a phase change, the worker may have been
            // running on a different P. As soon as we
            // attach, the owner P may schedule the
            // worker, so this must be done after the G is
            // stopped.
            if park.attach != 0 {p := park.attach.ptr()
                park.attach.set(nil)
                // cas the worker because we may be
                // racing with a new worker starting
                // on this P.
                if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
                    // The P got a new worker.
                    // Exit this worker.
                    return false
                }
            }
            return true
        }, unsafe.Pointer(park), "GC worker (idle)", traceEvGoBlock, 0)
        
        // 查看 P 的 gcBgMarkWorker 是否和以后的 G 统一, 不统一时完结以后的工作
        // Loop until the P dies and disassociates this
        // worker (the P may later be reused, in which case
        // it will get a new worker) or we failed to associate.
        if _p_.gcBgMarkWorker.ptr() != gp {break}
        
        // 禁止 G 被抢占
        // Disable preemption so we can use the gcw. If the
        // scheduler wants to preempt us, we'll stop draining,
        // dispose the gcw, and then preempt.
        park.m.set(acquirem())
        
        if gcBlackenEnabled == 0 {throw("gcBgMarkWorker: blackening not enabled")
        }
        
        // 记录开始工夫
        startTime := nanotime()
        
        decnwait := atomic.Xadd(&work.nwait, -1)
        if decnwait == work.nproc {println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
            throw("work.nwait was > work.nproc")
        }
        
        // 切换到 g0 运行
        systemstack(func() {// 设置 G 的状态为期待中这样它的栈能够被扫描(两个后盾标记工作能够相互扫描对方的栈)
            // Mark our goroutine preemptible so its stack
            // can be scanned. This lets two mark workers
            // scan each other (otherwise, they would
            // deadlock). We must not modify anything on
            // the G stack. However, stack shrinking is
            // disabled for mark workers, so it is safe to
            // read from the G stack.
            casgstatus(gp, _Grunning, _Gwaiting)
            
            // 判断后盾标记工作的模式
            switch _p_.gcMarkWorkerMode {
            default:
                throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
            case gcMarkWorkerDedicatedMode:
                // 这个模式下 P 应该分心执行标记
                // 执行标记, 直到被抢占, 并且须要计算后盾的扫描量来缩小辅助 GC 和唤醒期待中的 G
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
                // 被抢占时把本地运行队列中的所有 G 都踢到全局运行队列
                if gp.preempt {
                    // We were preempted. This is
                    // a useful signal to kick
                    // everything out of the run
                    // queue so it can run
                    // somewhere else.
                    lock(&sched.lock)
                    for {gp, _ := runqget(_p_)
                        if gp == nil {break}
                        globrunqput(gp)
                    }
                    unlock(&sched.lock)
                }
                // 继续执行标记, 直到无更多任务, 并且须要计算后盾的扫描量来缩小辅助 GC 和唤醒期待中的 G
                // Go back to draining, this time
                // without preemption.
                gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit)
            case gcMarkWorkerFractionalMode:
                // 这个模式下 P 应该适当执行标记
                // 执行标记, 直到被抢占, 并且须要计算后盾的扫描量来缩小辅助 GC 和唤醒期待中的 G
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
            case gcMarkWorkerIdleMode:
                // 这个模式下 P 只在闲暇时执行标记
                // 执行标记, 直到被抢占或者达到肯定的量, 并且须要计算后盾的扫描量来缩小辅助 GC 和唤醒期待中的 G
                gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
            }
            
            // 复原 G 的状态到运行中
            casgstatus(gp, _Gwaiting, _Grunning)
        })
        
        // 如果标记了禁止本地标记队列则 flush 到全局标记队列
        // If we are nearing the end of mark, dispose
        // of the cache promptly. We must do this
        // before signaling that we're no longer
        // working so that other workers can't observe
        // no workers and no work while we have this
        // cached, and before we compute done.
        if gcBlackenPromptly {_p_.gcw.dispose()
        }
        
        // 累加所用工夫
        // Account for time.
        duration := nanotime() - startTime
        switch _p_.gcMarkWorkerMode {
        case gcMarkWorkerDedicatedMode:
            atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
            atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
        case gcMarkWorkerFractionalMode:
            atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
            atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 1)
        case gcMarkWorkerIdleMode:
            atomic.Xaddint64(&gcController.idleMarkTime, duration)
        }
        
        // Was this the last worker and did we run out
        // of work?
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait > work.nproc {
            println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
                "work.nwait=", incnwait, "work.nproc=", work.nproc)
            throw("work.nwait > work.nproc")
        }
        
        // 判断是否所有后盾标记工作都实现, 并且没有更多的工作
        // If this worker reached a background mark completion
        // point, signal the main GC goroutine.
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // 勾销和 P 的关联
            // Make this G preemptible and disassociate it
            // as the worker for this P so
            // findRunnableGCWorker doesn't try to
            // schedule it.
            _p_.gcBgMarkWorker.set(nil)
            
            // 容许 G 被抢占
            releasem(park.m.ptr())
            
            // 筹备进入实现标记阶段
            gcMarkDone()
            
            // 休眠之前会从新关联 P
            // 因为下面容许被抢占, 到这里的时候可能就会变成其余 P
            // 如果从新关联 P 失败则这个工作会完结
            // Disable preemption and prepare to reattach
            // to the P.
            //
            // We may be running on a different P at this
            // point, so we can't reattach until this G is
            // parked.
            park.m.set(acquirem())
            park.attach.set(_p_)
        }
    }
}

gcDrain 函数用于执行标记:

// gcDrain scans roots and objects in work buffers, blackening grey
// objects until all roots and work buffers have been drained.
//
// If flags&gcDrainUntilPreempt != 0, gcDrain returns when g.preempt
// is set. This implies gcDrainNoBlock.
//
// If flags&gcDrainIdle != 0, gcDrain returns when there is other work
// to do. This implies gcDrainNoBlock.
//
// If flags&gcDrainNoBlock != 0, gcDrain returns as soon as it is
// unable to get more work. Otherwise, it will block until all
// blocking calls are blocked in gcDrain.
//
// If flags&gcDrainFlushBgCredit != 0, gcDrain flushes scan work
// credit to gcController.bgScanCredit every gcCreditSlack units of
// scan work.
//
//go:nowritebarrier
func gcDrain(gcw *gcWork, flags gcDrainFlags) {
    if !writeBarrier.needed {throw("gcDrain phase incorrect")
    }
    
    gp := getg().m.curg
    
    // 看到抢占标记时是否要返回
    preemptible := flags&gcDrainUntilPreempt != 0
    
    // 没有工作时是否要期待工作
    blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainNoBlock) == 0
    
    // 是否计算后盾的扫描量来缩小辅助 GC 和唤醒期待中的 G
    flushBgCredit := flags&gcDrainFlushBgCredit != 0
    
    // 是否只执行一定量的工作
    idle := flags&gcDrainIdle != 0
    
    // 记录初始的已扫描数量
    initScanWork := gcw.scanWork
    
    // 扫描 idleCheckThreshold(100000)个对象当前查看是否要返回
    // idleCheck is the scan work at which to perform the next
    // idle check with the scheduler.
    idleCheck := initScanWork + idleCheckThreshold
    
    // 如果根对象未扫描完, 则先扫描根对象
    // Drain root marking jobs.
    if work.markrootNext < work.markrootJobs {
        // 如果标记了 preemptible, 循环直到被抢占
        for !(preemptible && gp.preempt) {// 从根对象扫描队列取出一个值(原子递增)
            job := atomic.Xadd(&work.markrootNext, +1) - 1
            if job >= work.markrootJobs {break}
            // 执行根对象扫描工作
            markroot(gcw, job)
            // 如果是 idle 模式并且有其余工作, 则返回
            if idle && pollWork() {goto done}
        }
    }
    
    // 根对象曾经在标记队列中, 生产标记队列
    // 如果标记了 preemptible, 循环直到被抢占
    // Drain heap marking jobs.
    for !(preemptible && gp.preempt) {
        // 如果全局标记队列为空, 把本地标记队列的一部分工作分过来
        // (如果 wbuf2 不为空则挪动 wbuf2 过来, 否则挪动 wbuf1 的一半过来)
        // Try to keep work available on the global queue. We used to
        // check if there were waiting workers, but it's better to
        // just keep work available than to make workers wait. In the
        // worst case, we'll do O(log(_WorkbufSize)) unnecessary
        // balances.
        if work.full == 0 {gcw.balance()
        }
        
        // 从本地标记队列中获取对象, 获取不到则从全局标记队列获取
        var b uintptr
        if blocking {
            // 阻塞获取
            b = gcw.get()} else {
            // 非阻塞获取
            b = gcw.tryGetFast()
            if b == 0 {b = gcw.tryGet()
            }
        }
        
        // 获取不到对象, 标记队列已为空, 跳出循环
        if b == 0 {
            // work barrier reached or tryGet failed.
            break
        }
        
        // 扫描获取到的对象
        scanobject(b, gcw)
        
        // 如果曾经扫描了肯定数量的对象(gcCreditSlack 的值是 2000)
        // Flush background scan work credit to the global
        // account if we've accumulated enough locally so
        // mutator assists can draw on it.
        if gcw.scanWork >= gcCreditSlack {
            // 把扫描的对象数量增加到全局
            atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
            // 缩小辅助 GC 的工作量和唤醒期待中的 G
            if flushBgCredit {gcFlushBgCredit(gcw.scanWork - initScanWork)
                initScanWork = 0
            }
            idleCheck -= gcw.scanWork
            gcw.scanWork = 0
            
            // 如果是 idle 模式且达到了查看的扫描量, 则查看是否有其余工作(G), 如果有则跳出循环
            if idle && idleCheck <= 0 {
                idleCheck += idleCheckThreshold
                if pollWork() {break}
            }
        }
    }
    
    // In blocking mode, write barriers are not allowed after this
    // point because we must preserve the condition that the work
    // buffers are empty.
    
done:
    // 把扫描的对象数量增加到全局
    // Flush remaining scan work credit.
    if gcw.scanWork > 0 {atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
        // 缩小辅助 GC 的工作量和唤醒期待中的 G
        if flushBgCredit {gcFlushBgCredit(gcw.scanWork - initScanWork)
        }
        gcw.scanWork = 0
    }
}

markroot 函数用于执行根对象扫描工作:

// markroot scans the i'th root.
//
// Preemption must be disabled (because this uses a gcWork).
//
// nowritebarrier is only advisory here.
//
//go:nowritebarrier
func markroot(gcw *gcWork, i uint32) {
    // 判断取出的数值对应哪种工作
    // (google 的工程师感觉这种方法可笑)
    // TODO(austin): This is a bit ridiculous. Compute and store
    // the bases in gcMarkRootPrepare instead of the counts.
    baseFlushCache := uint32(fixedRootCount)
    baseData := baseFlushCache + uint32(work.nFlushCacheRoots)
    baseBSS := baseData + uint32(work.nDataRoots)
    baseSpans := baseBSS + uint32(work.nBSSRoots)
    baseStacks := baseSpans + uint32(work.nSpanRoots)
    end := baseStacks + uint32(work.nStackRoots)

    // Note: if you add a case here, please also update heapdump.go:dumproots.
    switch {
    // 开释 mcache 中的所有 span, 要求 STW
    case baseFlushCache <= i && i < baseData:
        flushmcache(int(i - baseFlushCache))

    // 扫描可读写的全局变量
    // 这里只会扫描 i 对应的 block, 扫描时传入蕴含哪里有指针的 bitmap 数据
    case baseData <= i && i < baseBSS:
        for _, datap := range activeModules() {markrootBlock(datap.data, datap.edata-datap.data, datap.gcdatamask.bytedata, gcw, int(i-baseData))
        }

    // 扫描只读的全局变量
    // 这里只会扫描 i 对应的 block, 扫描时传入蕴含哪里有指针的 bitmap 数据
    case baseBSS <= i && i < baseSpans:
        for _, datap := range activeModules() {markrootBlock(datap.bss, datap.ebss-datap.bss, datap.gcbssmask.bytedata, gcw, int(i-baseBSS))
        }

    // 扫描析构器队列
    case i == fixedRootFinalizers:
        // Only do this once per GC cycle since we don't call
        // queuefinalizer during marking.
        if work.markrootDone {break}
        for fb := allfin; fb != nil; fb = fb.alllink {cnt := uintptr(atomic.Load(&fb.cnt))
            scanblock(uintptr(unsafe.Pointer(&fb.fin[0])), cnt*unsafe.Sizeof(fb.fin[0]), &finptrmask[0], gcw)
        }

    // 开释已停止的 G 的栈
    case i == fixedRootFreeGStacks:
        // Only do this once per GC cycle; preferably
        // concurrently.
        if !work.markrootDone {
            // Switch to the system stack so we can call
            // stackfree.
            systemstack(markrootFreeGStacks)
        }

    // 扫描各个 span 中非凡对象(析构器列表)
    case baseSpans <= i && i < baseStacks:
        // mark MSpan.specials
        markrootSpans(gcw, int(i-baseSpans))

    // 扫描各个 G 的栈
    default:
        // 获取须要扫描的 G
        // the rest is scanning goroutine stacks
        var gp *g
        if baseStacks <= i && i < end {gp = allgs[i-baseStacks]
        } else {throw("markroot: bad index")
        }

        // 记录期待开始的工夫
        // remember when we've first observed the G blocked
        // needed only to output in traceback
        status := readgstatus(gp) // We are not in a scan state
        if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 {gp.waitsince = work.tstart}

        // 切换到 g0 运行(有可能会扫到本人的栈)
        // scang must be done on the system stack in case
        // we're trying to scan our own stack.
        systemstack(func() {
            // 判断扫描的栈是否本人的
            // If this is a self-scan, put the user G in
            // _Gwaiting to prevent self-deadlock. It may
            // already be in _Gwaiting if this is a mark
            // worker or we're in mark termination.
            userG := getg().m.curg
            selfScan := gp == userG && readgstatus(userG) == _Grunning
            
            // 如果正在扫描本人的栈则切换状态到期待中避免死锁
            if selfScan {casgstatus(userG, _Grunning, _Gwaiting)
                userG.waitreason = "garbage collection scan"
            }
            
            // 扫描 G 的栈
            // TODO: scang blocks until gp's stack has
            // been scanned, which may take a while for
            // running goroutines. Consider doing this in
            // two phases where the first is non-blocking:
            // we scan the stacks we can and ask running
            // goroutines to scan themselves; and the
            // second blocks.
            scang(gp, gcw)
            
            // 如果正在扫描本人的栈则把状态切换回运行中
            if selfScan {casgstatus(userG, _Gwaiting, _Grunning)
            }
        })
    }
}

scang 函数负责扫描 G 的栈:

// scang blocks until gp's stack has been scanned.
// It might be scanned by scang or it might be scanned by the goroutine itself.
// Either way, the stack scan has completed when scang returns.
func scang(gp *g, gcw *gcWork) {// Invariant; we (the caller, markroot for a specific goroutine) own gp.gcscandone.
    // Nothing is racing with us now, but gcscandone might be set to true left over
    // from an earlier round of stack scanning (we scan twice per GC).
    // We use gcscandone to record whether the scan has been done during this round.

    // 标记扫描未实现
    gp.gcscandone = false

    // See http://golang.org/cl/21503 for justification of the yield delay.
    const yieldDelay = 10 * 1000
    var nextYield int64

    // 循环直到扫描实现
    // Endeavor to get gcscandone set to true,
    // either by doing the stack scan ourselves or by coercing gp to scan itself.
    // gp.gcscandone can transition from false to true when we're not looking
    // (if we asked for preemption), so any time we lock the status using
    // castogscanstatus we have to double-check that the scan is still not done.
loop:
    for i := 0; !gp.gcscandone; i++ {
        // 判断 G 的以后状态
        switch s := readgstatus(gp); s {
        default:
            dumpgstatus(gp)
            throw("stopg: invalid status")

        // G 已停止, 不须要扫描它
        case _Gdead:
            // No stack.
            gp.gcscandone = true
            break loop

        // G 的栈正在扩大, 下一轮重试
        case _Gcopystack:
        // Stack being switched. Go around again.

        // G 不是运行中, 首先须要避免它运行
        case _Grunnable, _Gsyscall, _Gwaiting:
            // Claim goroutine by setting scan bit.
            // Racing with execution or readying of gp.
            // The scan bit keeps them from running
            // the goroutine until we're done.
            if castogscanstatus(gp, s, s|_Gscan) {
                // 原子切换状态胜利时扫描它的栈
                if !gp.gcscandone {scanstack(gp, gcw)
                    gp.gcscandone = true
                }
                // 复原 G 的状态, 并跳出循环
                restartg(gp)
                break loop
            }

        // G 正在扫描它本人, 期待扫描结束
        case _Gscanwaiting:
        // newstack is doing a scan for us right now. Wait.

        // G 正在运行
        case _Grunning:
            // Goroutine running. Try to preempt execution so it can scan itself.
            // The preemption handler (in newstack) does the actual scan.

            // 如果曾经有抢占申请, 则抢占胜利时会帮咱们解决
            // Optimization: if there is already a pending preemption request
            // (from the previous loop iteration), don't bother with the atomics.
            if gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt {break}

            // 抢占 G, 抢占胜利时 G 会扫描它本人
            // Ask for preemption and self scan.
            if castogscanstatus(gp, _Grunning, _Gscanrunning) {
                if !gp.gcscandone {
                    gp.preemptscan = true
                    gp.preempt = true
                    gp.stackguard0 = stackPreempt
                }
                casfrom_Gscanstatus(gp, _Gscanrunning, _Grunning)
            }
        }

        // 第一轮休眠 10 毫秒, 第二轮休眠 5 毫秒
        if i == 0 {nextYield = nanotime() + yieldDelay
        }
        if nanotime() < nextYield {procyield(10)
        } else {osyield()
            nextYield = nanotime() + yieldDelay/2}
    }

    // 扫描实现, 勾销抢占扫描的申请
    gp.preemptscan = false // cancel scan request if no longer needed
}

设置 preemptscan 后, 在抢占 G 胜利时会调用 scanstack 扫描它本人的栈, 具体代码在这里.
扫描栈用的函数是 scanstack:

// scanstack scans gp's stack, greying all pointers found on the stack.
//
// scanstack is marked go:systemstack because it must not be preempted
// while using a workbuf.
//
//go:nowritebarrier
//go:systemstack
func scanstack(gp *g, gcw *gcWork) {
    if gp.gcscanvalid {return}

    if readgstatus(gp)&_Gscan == 0 {print("runtime:scanstack: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", hex(readgstatus(gp)), "n")
        throw("scanstack - bad status")
    }

    switch readgstatus(gp) &^ _Gscan {
    default:
        print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "n")
        throw("mark - bad status")
    case _Gdead:
        return
    case _Grunning:
        print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "n")
        throw("scanstack: goroutine not stopped")
    case _Grunnable, _Gsyscall, _Gwaiting:
        // ok
    }

    if gp == getg() {throw("can't scan our own stack")
    }
    mp := gp.m
    if mp != nil && mp.helpgc != 0 {throw("can't scan gchelper stack")
    }

    // Shrink the stack if not much of it is being used. During
    // concurrent GC, we can do this during concurrent mark.
    if !work.markrootDone {shrinkstack(gp)
    }

    // Scan the stack.
    var cache pcvalueCache
    scanframe := func(frame *stkframe, unused unsafe.Pointer) bool {// scanframeworker 会依据代码地址 (pc) 获取函数信息
        // 而后找到函数信息中的 stackmap.bytedata, 它保留了函数的栈上哪些地方有指针
        // 再调用 scanblock 来扫描函数的栈空间, 同时函数的参数也会这样扫描
        scanframeworker(frame, &cache, gcw)
        return true
    }
    // 枚举所有调用帧, 别离调用 scanframe 函数
    gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, scanframe, nil, 0)
    // 枚举所有 defer 的调用帧, 别离调用 scanframe 函数
    tracebackdefers(gp, scanframe, nil)
    gp.gcscanvalid = true
}

scanblock 函数是一个通用的扫描函数, 扫描全局变量和栈空间都会用它, 和 scanobject 不同的是 bitmap 须要手动传入:

// scanblock scans b as scanobject would, but using an explicit
// pointer bitmap instead of the heap bitmap.
//
// This is used to scan non-heap roots, so it does not update
// gcw.bytesMarked or gcw.scanWork.
//
//go:nowritebarrier
func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) {
    // Use local copies of original parameters, so that a stack trace
    // due to one of the throws below shows the original block
    // base and extent.
    b := b0
    n := n0

    arena_start := mheap_.arena_start
    arena_used := mheap_.arena_used

    // 枚举扫描的地址
    for i := uintptr(0); i < n; {
        // 找到 bitmap 中对应的 byte
        // Find bits for the next word.
        bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8)))
        if bits == 0 {
            i += sys.PtrSize * 8
            continue
        }
        // 枚举 byte
        for j := 0; j < 8 && i < n; j++ {
            // 如果该地址蕴含指针
            if bits&1 != 0 {// 标记在该地址的对象存活, 并把它加到标记队列(该对象变为灰色)
                // Same work as in scanobject; see comments there.
                obj := *(*uintptr)(unsafe.Pointer(b + i))
                if obj != 0 && arena_start <= obj && obj < arena_used {
                    // 找到该对象对应的 span 和 bitmap
                    if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {// 标记一个对象存活, 并把它加到标记队列(该对象变为灰色)
                        greyobject(obj, b, i, hbits, span, gcw, objIndex)
                    }
                }
            }
            // 解决下一个指针下一个 bit
            bits >>= 1
            i += sys.PtrSize
        }
    }
}

greyobject 用于标记一个对象存活, 并把它加到标记队列(该对象变为灰色):

// obj is the start of an object with mark mbits.
// If it isn't already marked, mark it and enqueue into gcw.
// base and off are for debugging only and could be removed.
//go:nowritebarrierrec
func greyobject(obj, base, off uintptr, hbits heapBits, span *mspan, gcw *gcWork, objIndex uintptr) {
    // obj should be start of allocation, and so must be at least pointer-aligned.
    if obj&(sys.PtrSize-1) != 0 {throw("greyobject: obj not pointer-aligned")
    }
    mbits := span.markBitsForIndex(objIndex)

    if useCheckmark {
        // checkmark 是用于查看是否所有可达到的对象都被正确标记的机制, 仅除错应用
        if !mbits.isMarked() {printlock()
            print("runtime:greyobject: checkmarks finds unexpected unmarked object obj=", hex(obj), "n")
            print("runtime: found obj at *(", hex(base), "+", hex(off), ")n")

            // Dump the source (base) object
            gcDumpObject("base", base, off)

            // Dump the object
            gcDumpObject("obj", obj, ^uintptr(0))

            getg().m.traceback = 2
            throw("checkmark found unmarked object")
        }
        if hbits.isCheckmarked(span.elemsize) {return}
        hbits.setCheckmarked(span.elemsize)
        if !hbits.isCheckmarked(span.elemsize) {throw("setCheckmarked and isCheckmarked disagree")
        }
    } else {if debug.gccheckmark > 0 && span.isFree(objIndex) {print("runtime: marking free object", hex(obj), "found at *(", hex(base), "+", hex(off), ")n")
            gcDumpObject("base", base, off)
            gcDumpObject("obj", obj, ^uintptr(0))
            getg().m.traceback = 2
            throw("marking free object")
        }

        // 如果对象所在的 span 中的 gcmarkBits 对应的 bit 曾经设置为 1 则能够跳过解决
        // If marked we have nothing to do.
        if mbits.isMarked() {return}
        
        // 设置对象所在的 span 中的 gcmarkBits 对应的 bit 为 1
        // mbits.setMarked() // Avoid extra call overhead with manual inlining.
        atomic.Or8(mbits.bytep, mbits.mask)
        
        // 如果确定对象不蕴含指针(所在 span 的类型是 noscan), 则不须要把对象放入标记队列
        // If this is a noscan object, fast-track it to black
        // instead of greying it.
        if span.spanclass.noscan() {gcw.bytesMarked += uint64(span.elemsize)
            return
        }
    }

    // 把对象放入标记队列
    // 先放入本地标记队列, 失败时把本地标记队列中的局部工作转移到全局标记队列, 再放入本地标记队列
    // Queue the obj for scanning. The PREFETCH(obj) logic has been removed but
    // seems like a nice optimization that can be added back in.
    // There needs to be time between the PREFETCH and the use.
    // Previously we put the obj in an 8 element buffer that is drained at a rate
    // to give the PREFETCH time to do its work.
    // Use of PREFETCHNTA might be more appropriate than PREFETCH
    if !gcw.putFast(obj) {gcw.put(obj)
    }
}

gcDrain 函数扫描完根对象, 就会开始生产标记队列, 对从标记队列中取出的对象调用 scanobject 函数:

// scanobject scans the object starting at b, adding pointers to gcw.
// b must point to the beginning of a heap object or an oblet.
// scanobject consults the GC bitmap for the pointer mask and the
// spans for the size of the object.
//
//go:nowritebarrier
func scanobject(b uintptr, gcw *gcWork) {
    // Note that arena_used may change concurrently during
    // scanobject and hence scanobject may encounter a pointer to
    // a newly allocated heap object that is *not* in
    // [start,used). It will not mark this object; however, we
    // know that it was just installed by a mutator, which means
    // that mutator will execute a write barrier and take care of
    // marking it. This is even more pronounced on relaxed memory
    // architectures since we access arena_used without barriers
    // or synchronization, but the same logic applies.
    arena_start := mheap_.arena_start
    arena_used := mheap_.arena_used

    // Find the bits for b and the size of the object at b.
    //
    // b is either the beginning of an object, in which case this
    // is the size of the object to scan, or it points to an
    // oblet, in which case we compute the size to scan below.
    // 获取对象对应的 bitmap
    hbits := heapBitsForAddr(b)
    
    // 获取对象所在的 span
    s := spanOfUnchecked(b)
    
    // 获取对象的大小
    n := s.elemsize
    if n == 0 {throw("scanobject n == 0")
    }

    // 对象大小过大时 (maxObletBytes 是 128KB) 须要宰割扫描
    // 每次最多只扫描 128KB
    if n > maxObletBytes {
        // Large object. Break into oblets for better
        // parallelism and lower latency.
        if b == s.base() {
            // It's possible this is a noscan object (not
            // from greyobject, but from other code
            // paths), in which case we must *not* enqueue
            // oblets since their bitmaps will be
            // uninitialized.
            if s.spanclass.noscan() {
                // Bypass the whole scan.
                gcw.bytesMarked += uint64(n)
                return
            }

            // Enqueue the other oblets to scan later.
            // Some oblets may be in b's scalar tail, but
            // these will be marked as "no more pointers",
            // so we'll drop out immediately when we go to
            // scan those.
            for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes {if !gcw.putFast(oblet) {gcw.put(oblet)
                }
            }
        }

        // Compute the size of the oblet. Since this object
        // must be a large object, s.base() is the beginning
        // of the object.
        n = s.base() + s.elemsize - b
        if n > maxObletBytes {n = maxObletBytes}
    }

    // 扫描对象中的指针
    var i uintptr
    for i = 0; i < n; i += sys.PtrSize {
        // 获取对应的 bit
        // Find bits for this word.
        if i != 0 {// Avoid needless hbits.next() on last iteration.
            hbits = hbits.next()}
        // Load bits once. See CL 22712 and issue 16973 for discussion.
        bits := hbits.bits()
        
        // 查看 scan bit 判断是否持续扫描, 留神第二个 scan bit 是 checkmark
        // During checkmarking, 1-word objects store the checkmark
        // in the type bit for the one word. The only one-word objects
        // are pointers, or else they'd be merged with other non-pointer
        // data into larger allocations.
        if i != 1*sys.PtrSize && bits&bitScan == 0 {break // no more pointers in this object}
        
        // 查看 pointer bit, 不是指针则持续
        if bits&bitPointer == 0 {continue // not a pointer}

        // 取出指针的值
        // Work here is duplicated in scanblock and above.
        // If you make changes here, make changes there too.
        obj := *(*uintptr)(unsafe.Pointer(b + i))

        // 如果指针在 arena 区域中, 则调用 greyobject 标记对象并把对象放到标记队列中
        // At this point we have extracted the next potential pointer.
        // Check if it points into heap and not back at the current object.
        if obj != 0 && arena_start <= obj && obj < arena_used && obj-b >= n {
            // Mark the object.
            if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {greyobject(obj, b, i, hbits, span, gcw, objIndex)
            }
        }
    }
    
    // 统计扫描过的大小和对象数量
    gcw.bytesMarked += uint64(n)
    gcw.scanWork += int64(i)
}

在所有后盾标记工作都把标记队列生产结束时, 会执行 gcMarkDone 函数筹备进入实现标记阶段 (mark termination):
在并行 GC 中 gcMarkDone 会被执行两次, 第一次会禁止本地标记队列而后从新开始后盾标记工作, 第二次会进入实现标记阶段(mark termination)。

// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2
// to mark termination.
//
// This should be called when all mark work has been drained. In mark
// 1, this includes all root marking jobs, global work buffers, and
// active work buffers in assists and background workers; however,
// work may still be cached in per-P work buffers. In mark 2, per-P
// caches are disabled.
//
// The calling context must be preemptible.
//
// Note that it is explicitly okay to have write barriers in this
// function because completion of concurrent mark is best-effort
// anyway. Any work created by write barriers here will be cleaned up
// by mark termination.
func gcMarkDone() {
top:
    semacquire(&work.markDoneSema)

    // Re-check transition condition under transition lock.
    if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {semrelease(&work.markDoneSema)
        return
    }

    // 临时禁止启动新的后盾标记工作
    // Disallow starting new workers so that any remaining workers
    // in the current mark phase will drain out.
    //
    // TODO(austin): Should dedicated workers keep an eye on this
    // and exit gcDrain promptly?
    atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
    atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, -0xffffffff)

    // 判断本地标记队列是否已禁用
    if !gcBlackenPromptly {
        // 本地标记队列是否未禁用, 禁用而后从新开始后盾标记工作
        // Transition from mark 1 to mark 2.
        //
        // The global work list is empty, but there can still be work
        // sitting in the per-P work caches.
        // Flush and disable work caches.

        // 禁用本地标记队列
        // Disallow caching workbufs and indicate that we're in mark 2.
        gcBlackenPromptly = true

        // Prevent completion of mark 2 until we've flushed
        // cached workbufs.
        atomic.Xadd(&work.nwait, -1)

        // GC is set up for mark 2. Let Gs blocked on the
        // transition lock go while we flush caches.
        semrelease(&work.markDoneSema)

        // 把所有本地标记队列中的对象都推到全局标记队列
        systemstack(func() {
            // Flush all currently cached workbufs and
            // ensure all Ps see gcBlackenPromptly. This
            // also blocks until any remaining mark 1
            // workers have exited their loop so we can
            // start new mark 2 workers.
            forEachP(func(_p_ *p) {_p_.gcw.dispose()
            })
        })

        // 除错用
        // Check that roots are marked. We should be able to
        // do this before the forEachP, but based on issue
        // #16083 there may be a (harmless) race where we can
        // enter mark 2 while some workers are still scanning
        // stacks. The forEachP ensures these scans are done.
        //
        // TODO(austin): Figure out the race and fix this
        // properly.
        gcMarkRootCheck()

        // 容许启动新的后盾标记工作
        // Now we can start up mark 2 workers.
        atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
        atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 0xffffffff)

        // 如果确定没有更多的工作则能够间接跳到函数顶部
        // 这样就当作是第二次调用了
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // This loop will make progress because
            // gcBlackenPromptly is now true, so it won't
            // take this same "if" branch.
            goto top
        }
    } else {
        // 记录实现标记阶段开始的工夫和 STW 开始的工夫
        // Transition to mark termination.
        now := nanotime()
        work.tMarkTerm = now
        work.pauseStart = now
        
        // 禁止 G 被抢占
        getg().m.preemptoff = "gcing"
        
        // 进行所有运行中的 G, 并禁止它们运行
        systemstack(stopTheWorldWithSema)
        
        // !!!!!!!!!!!!!!!!
        // 世界已进行(STW)...
        // !!!!!!!!!!!!!!!!
        
        // The gcphase is _GCmark, it will transition to _GCmarktermination
        // below. The important thing is that the wb remains active until
        // all marking is complete. This includes writes made by the GC.
        
        // 标记对根对象的扫描已实现, 会影响 gcMarkRootPrepare 中的解决
        // Record that one root marking pass has completed.
        work.markrootDone = true
        
        // 禁止辅助 GC 和后盾标记工作的运行
        // Disable assists and background workers. We must do
        // this before waking blocked assists.
        atomic.Store(&gcBlackenEnabled, 0)
        
        // 唤醒所有因为辅助 GC 而休眠的 G
        // Wake all blocked assists. These will run when we
        // start the world again.
        gcWakeAllAssists()
        
        // Likewise, release the transition lock. Blocked
        // workers and assists will run when we start the
        // world again.
        semrelease(&work.markDoneSema)
        
        // 计算下一次触发 gc 须要的 heap 大小
        // endCycle depends on all gcWork cache stats being
        // flushed. This is ensured by mark 2.
        nextTriggerRatio := gcController.endCycle()
        
        // 进入实现标记阶段, 会重新启动世界
        // Perform mark termination. This will restart the world.
        gcMarkTermination(nextTriggerRatio)
    }
}

gcMarkTermination 函数会进入实现标记阶段:

func gcMarkTermination(nextTriggerRatio float64) {
    // World is stopped.
    // Start marktermination which includes enabling the write barrier.
    // 禁止辅助 GC 和后盾标记工作的运行
    atomic.Store(&gcBlackenEnabled, 0)
    
    // 从新容许本地标记队列(下次 GC 应用)
    gcBlackenPromptly = false
    
    // 设置以后 GC 阶段到实现标记阶段, 并启用写屏障
    setGCPhase(_GCmarktermination)

    // 记录开始工夫
    work.heap1 = memstats.heap_live
    startTime := nanotime()

    // 禁止 G 被抢占
    mp := acquirem()
    mp.preemptoff = "gcing"
    _g_ := getg()
    _g_.m.traceback = 2
    
    // 设置 G 的状态为期待中这样它的栈能够被扫描
    gp := _g_.m.curg
    casgstatus(gp, _Grunning, _Gwaiting)
    gp.waitreason = "garbage collection"

    // 切换到 g0 运行
    // Run gc on the g0 stack. We do this so that the g stack
    // we're currently running on will no longer change. Cuts
    // the root set down a bit (g0 stacks are not scanned, and
    // we don't need to scan gc's internal state).  We also
    // need to switch to g0 so we can shrink the stack.
    systemstack(func() {
        // 开始 STW 中的标记
        gcMark(startTime)
        
        // 必须立即返回, 因为里面的 G 的栈有可能被挪动, 不能在这之后拜访里面的变量
        // Must return immediately.
        // The outer function's stack may have moved
        // during gcMark (it shrinks stacks, including the
        // outer function's stack), so we must not refer
        // to any of its variables. Return back to the
        // non-system stack to pick up the new addresses
        // before continuing.
    })

    // 从新切换到 g0 运行
    systemstack(func() {
        work.heap2 = work.bytesMarked
        
        // 如果启用了 checkmark 则执行查看, 查看是否所有可达到的对象都有标记
        if debug.gccheckmark > 0 {
            // Run a full stop-the-world mark using checkmark bits,
            // to check that we didn't forget to mark anything during
            // the concurrent mark process.
            gcResetMarkState()
            initCheckmarks()
            gcMark(startTime)
            clearCheckmarks()}

        // 设置以后 GC 阶段到敞开, 并禁用写屏障
        // marking is complete so we can turn the write barrier off
        setGCPhase(_GCoff)
        
        // 唤醒后盾打扫工作, 将在 STW 完结后开始运行
        gcSweep(work.mode)

        // 除错用
        if debug.gctrace > 1 {startTime = nanotime()
            // The g stacks have been scanned so
            // they have gcscanvalid==true and gcworkdone==true.
            // Reset these so that all stacks will be rescanned.
            gcResetMarkState()
            finishsweep_m()

            // Still in STW but gcphase is _GCoff, reset to _GCmarktermination
            // At this point all objects will be found during the gcMark which
            // does a complete STW mark and object scan.
            setGCPhase(_GCmarktermination)
            gcMark(startTime)
            setGCPhase(_GCoff) // marking is done, turn off wb.
            gcSweep(work.mode)
        }
    })

    // 设置 G 的状态为运行中
    _g_.m.traceback = 0
    casgstatus(gp, _Gwaiting, _Grunning)

    // 跟踪解决
    if trace.enabled {traceGCDone()
    }

    // all done
    mp.preemptoff = ""

    if gcphase != _GCoff {throw("gc done but gcphase != _GCoff")
    }

    // 更新下一次触发 gc 须要的 heap 大小(gc_trigger)
    // Update GC trigger and pacing for the next cycle.
    gcSetTriggerRatio(nextTriggerRatio)

    // 更新用时记录
    // Update timing memstats
    now := nanotime()
    sec, nsec, _ := time_now()
    unixNow := sec*1e9 + int64(nsec)
    work.pauseNS += now - work.pauseStart
    work.tEnd = now
    atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
    atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
    memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
    memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
    memstats.pause_total_ns += uint64(work.pauseNS)

    // 更新所用 cpu 记录
    // Update work.totaltime.
    sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
    // We report idle marking time below, but omit it from the
    // overall utilization here since it's"free".
    markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
    markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
    cycleCpu := sweepTermCpu + markCpu + markTermCpu
    work.totaltime += cycleCpu

    // Compute overall GC CPU utilization.
    totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
    memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)

    // 重置打扫状态
    // Reset sweep state.
    sweep.nbgsweep = 0
    sweep.npausesweep = 0

    // 统计强制开始 GC 的次数
    if work.userForced {memstats.numforcedgc++}

    // 统计执行 GC 的次数而后唤醒期待打扫的 G
    // Bump GC cycle count and wake goroutines waiting on sweep.
    lock(&work.sweepWaiters.lock)
    memstats.numgc++
    injectglist(work.sweepWaiters.head.ptr())
    work.sweepWaiters.head = 0
    unlock(&work.sweepWaiters.lock)

    // 性能统计用
    // Finish the current heap profiling cycle and start a new
    // heap profiling cycle. We do this before starting the world
    // so events don't leak into the wrong cycle.
    mProf_NextCycle()

    // 重新启动世界
    systemstack(startTheWorldWithSema)

    // !!!!!!!!!!!!!!!
    // 世界已重新启动...
    // !!!!!!!!!!!!!!!

    // 性能统计用
    // Flush the heap profile so we can start a new cycle next GC.
    // This is relatively expensive, so we don't do it with the
    // world stopped.
    mProf_Flush()

    // 挪动标记队列应用的缓冲区到自在列表, 使得它们能够被回收
    // Prepare workbufs for freeing by the sweeper. We do this
    // asynchronously because it can take non-trivial time.
    prepareFreeWorkbufs()

    // 开释未应用的栈
    // Free stack spans. This must be done between GC cycles.
    systemstack(freeStackSpans)

    // 除错用
    // Print gctrace before dropping worldsema. As soon as we drop
    // worldsema another cycle could start and smash the stats
    // we're trying to print.
    if debug.gctrace > 0 {util := int(memstats.gc_cpu_fraction * 100)

        var sbuf [24]byte
        printlock()
        print("gc", memstats.numgc,
            "@", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s",
            util, "%:")
        prev := work.tSweepTerm
        for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
            if i != 0 {print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
            prev = ns
        }
        print("ms clock,")
        for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
            if i == 2 || i == 3 {
                // Separate mark time components with /.
                print("/")
            } else if i != 0 {print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
        }
        print("ms cpu,",
            work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, "MB,",
            work.heapGoal>>20, "MB goal,",
            work.maxprocs, "P")
        if work.userForced {print("(forced)")
        }
        print("n")
        printunlock()}

    semrelease(&worldsema)
    // Careful: another GC cycle may start now.

    // 从新容许以后的 G 被抢占
    releasem(mp)
    mp = nil

    // 如果是并行 GC, 让以后 M 持续运行(会回到 gcBgMarkWorker 而后休眠)
    // 如果不是并行 GC, 则让以后 M 开始调度
    // now that gc is done, kick off finalizer thread if needed
    if !concurrentSweep {
        // give the queued finalizers, if any, a chance to run
        Gosched()}
}

gcSweep 函数会唤醒后盾打扫工作:
后盾打扫工作会在程序启动时调用的 gcenable 函数中启动.

func gcSweep(mode gcMode) {
    if gcphase != _GCoff {throw("gcSweep being done but phase is not GCoff")
    }

    // 减少 sweepgen, 这样 sweepSpans 中两个队列角色会替换, 所有 span 都会变为 "待打扫" 的 span
    lock(&mheap_.lock)
    mheap_.sweepgen += 2
    mheap_.sweepdone = 0
    if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
        // We should have drained this list during the last
        // sweep phase. We certainly need to start this phase
        // with an empty swept list.
        throw("non-empty swept list")
    }
    mheap_.pagesSwept = 0
    unlock(&mheap_.lock)

    // 如果非并行 GC 则在这里实现所有工作(STW 中)
    if !_ConcurrentSweep || mode == gcForceBlockMode {
        // Special case synchronous sweep.
        // Record that no proportional sweeping has to happen.
        lock(&mheap_.lock)
        mheap_.sweepPagesPerByte = 0
        unlock(&mheap_.lock)
        // Sweep all spans eagerly.
        for sweepone() != ^uintptr(0) {sweep.npausesweep++}
        // Free workbufs eagerly.
        prepareFreeWorkbufs()
        for freeSomeWbufs(false) { }
        // All "free" events for this mark/sweep cycle have
        // now happened, so we can make this profile cycle
        // available immediately.
        mProf_NextCycle()
        mProf_Flush()
        return
    }

    // 唤醒后盾打扫工作
    // Background sweep.
    lock(&sweep.lock)
    if sweep.parked {
        sweep.parked = false
        ready(sweep.g, 0, true)
    }
    unlock(&sweep.lock)
}

后盾打扫工作的函数是 bgsweep:

func bgsweep(c chan int) {sweep.g = getg()

    // 期待唤醒
    lock(&sweep.lock)
    sweep.parked = true
    c <- 1
    goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)

    // 循环打扫
    for {// 打扫一个 span, 而后进入调度(一次只做大量工作)
        for gosweepone() != ^uintptr(0) {
            sweep.nbgsweep++
            Gosched()}
        // 开释一些未应用的标记队列缓冲区到 heap
        for freeSomeWbufs(true) {Gosched()
        }
        // 如果打扫未实现则持续循环
        lock(&sweep.lock)
        if !gosweepdone() {
            // This can happen if a GC runs between
            // gosweepone returning ^0 above
            // and the lock being acquired.
            unlock(&sweep.lock)
            continue
        }
        // 否则让后盾打扫工作进入休眠, 以后 M 持续调度
        sweep.parked = true
        goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)
    }
}

gosweepone 函数会从 sweepSpans 中取出单个 span 打扫:

//go:nowritebarrier
func gosweepone() uintptr {
    var ret uintptr
    // 切换到 g0 运行
    systemstack(func() {ret = sweepone()
    })
    return ret
}

sweepone 函数如下:

// sweeps one span
// returns number of pages returned to heap, or ^uintptr(0) if there is nothing to sweep
//go:nowritebarrier
func sweepone() uintptr {_g_ := getg()
    sweepRatio := mheap_.sweepPagesPerByte // For debugging

    // 禁止 G 被抢占
    // increment locks to ensure that the goroutine is not preempted
    // in the middle of sweep thus leaving the span in an inconsistent state for next GC
    _g_.m.locks++
    
    // 查看是否已实现打扫
    if atomic.Load(&mheap_.sweepdone) != 0 {
        _g_.m.locks--
        return ^uintptr(0)
    }
    
    // 更新同时执行 sweep 的工作数量
    atomic.Xadd(&mheap_.sweepers, +1)

    npages := ^uintptr(0)
    sg := mheap_.sweepgen
    for {
        // 从 sweepSpans 中取出一个 span
        s := mheap_.sweepSpans[1-sg/2%2].pop()
        // 全副打扫结束时跳出循环
        if s == nil {atomic.Store(&mheap_.sweepdone, 1)
            break
        }
        // 其余 M 曾经在打扫这个 span 时跳过
        if s.state != mSpanInUse {
            // This can happen if direct sweeping already
            // swept this span, but in that case the sweep
            // generation should always be up-to-date.
            if s.sweepgen != sg {print("runtime: bad span s.state=", s.state, "s.sweepgen=", s.sweepgen, "sweepgen=", sg, "n")
                throw("non in-use span in unswept list")
            }
            continue
        }
        // 原子减少 span 的 sweepgen, 失败示意其余 M 曾经开始打扫这个 span, 跳过
        if s.sweepgen != sg-2 || !atomic.Cas(&s.sweepgen, sg-2, sg-1) {continue}
        // 打扫这个 span, 而后跳出循环
        npages = s.npages
        if !s.sweep(false) {
            // Span is still in-use, so this returned no
            // pages to the heap and the span needs to
            // move to the swept in-use list.
            npages = 0
        }
        break
    }

    // 更新同时执行 sweep 的工作数量
    // Decrement the number of active sweepers and if this is the
    // last one print trace information.
    if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 {
        if debug.gcpacertrace > 0 {print("pacer: sweep done at heap size", memstats.heap_live>>20, "MB; allocated", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept", mheap_.pagesSwept, "pages at", sweepRatio, "pages/byten")
        }
    }
    // 容许 G 被抢占
    _g_.m.locks--
    // 返回打扫的页数
    return npages
}

span 的 sweep 函数用于打扫单个 span:

// Sweep frees or collects finalizers for blocks not marked in the mark phase.
// It clears the mark bits in preparation for the next GC round.
// Returns true if the span was returned to heap.
// If preserve=true, don't return it to heap nor relink in MCentral lists;
// caller takes care of it.
//TODO go:nowritebarrier
func (s *mspan) sweep(preserve bool) bool {
    // It's critical that we enter this function with preemption disabled,
    // GC must not start while we are in the middle of this function.
    _g_ := getg()
    if _g_.m.locks == 0 && _g_.m.mallocing == 0 && _g_ != _g_.m.g0 {throw("MSpan_Sweep: m is not locked")
    }
    sweepgen := mheap_.sweepgen
    if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {print("MSpan_Sweep: state=", s.state, "sweepgen=", s.sweepgen, "mheap.sweepgen=", sweepgen, "n")
        throw("MSpan_Sweep: bad span state")
    }

    if trace.enabled {traceGCSweepSpan(s.npages * _PageSize)
    }

    // 统计已清理的页数
    atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))

    spc := s.spanclass
    size := s.elemsize
    res := false

    c := _g_.m.mcache
    freeToHeap := false

    // The allocBits indicate which unmarked objects don't need to be
    // processed since they were free at the end of the last GC cycle
    // and were not allocated since then.
    // If the allocBits index is >= s.freeindex and the bit
    // is not marked then the object remains unallocated
    // since the last GC.
    // This situation is analogous to being on a freelist.

    // 判断在 special 中的析构器, 如果对应的对象曾经不再存活则标记对象存活避免回收, 而后把析构器移到运行队列
    // Unlink & free special records for any objects we're about to free.
    // Two complications here:
    // 1. An object can have both finalizer and profile special records.
    //    In such case we need to queue finalizer for execution,
    //    mark the object as live and preserve the profile special.
    // 2. A tiny object can have several finalizers setup for different offsets.
    //    If such object is not marked, we need to queue all finalizers at once.
    // Both 1 and 2 are possible at the same time.
    specialp := &s.specials
    special := *specialp
    for special != nil {
        // A finalizer can be set for an inner byte of an object, find object beginning.
        objIndex := uintptr(special.offset) / size
        p := s.base() + objIndex*size
        mbits := s.markBitsForIndex(objIndex)
        if !mbits.isMarked() {
            // This object is not marked and has at least one special record.
            // Pass 1: see if it has at least one finalizer.
            hasFin := false
            endOffset := p - s.base() + size
            for tmp := special; tmp != nil && uintptr(tmp.offset) < endOffset; tmp = tmp.next {
                if tmp.kind == _KindSpecialFinalizer {
                    // Stop freeing of object if it has a finalizer.
                    mbits.setMarkedNonAtomic()
                    hasFin = true
                    break
                }
            }
            // Pass 2: queue all finalizers _or_ handle profile record.
            for special != nil && uintptr(special.offset) < endOffset {
                // Find the exact byte for which the special was setup
                // (as opposed to object beginning).
                p := s.base() + uintptr(special.offset)
                if special.kind == _KindSpecialFinalizer || !hasFin {
                    // Splice out special record.
                    y := special
                    special = special.next
                    *specialp = special
                    freespecial(y, unsafe.Pointer(p), size)
                } else {// This is profile record, but the object has finalizers (so kept alive).
                    // Keep special record.
                    specialp = &special.next
                    special = *specialp
                }
            }
        } else {
            // object is still live: keep special record
            specialp = &special.next
            special = *specialp
        }
    }

    // 除错用
    if debug.allocfreetrace != 0 || raceenabled || msanenabled {
        // Find all newly freed objects. This doesn't have to
        // efficient; allocfreetrace has massive overhead.
        mbits := s.markBitsForBase()
        abits := s.allocBitsForIndex(0)
        for i := uintptr(0); i < s.nelems; i++ {if !mbits.isMarked() && (abits.index < s.freeindex || abits.isMarked()) {x := s.base() + i*s.elemsize
                if debug.allocfreetrace != 0 {tracefree(unsafe.Pointer(x), size)
                }
                if raceenabled {racefree(unsafe.Pointer(x), size)
                }
                if msanenabled {msanfree(unsafe.Pointer(x), size)
                }
            }
            mbits.advance()
            abits.advance()}
    }

    // 计算开释的对象数量
    // Count the number of free objects in this span.
    nalloc := uint16(s.countAlloc())
    if spc.sizeclass() == 0 && nalloc == 0 {// 如果 span 的类型是 0(大对象)并且其中的对象曾经不存活则开释到 heap
        s.needzero = 1
        freeToHeap = true
    }
    nfreed := s.allocCount - nalloc
    if nalloc > s.allocCount {print("runtime: nelems=", s.nelems, "nalloc=", nalloc, "previous allocCount=", s.allocCount, "nfreed=", nfreed, "n")
        throw("sweep increased allocation count")
    }

    // 设置新的 allocCount
    s.allocCount = nalloc

    // 判断 span 是否无未调配的对象
    wasempty := s.nextFreeIndex() == s.nelems

    // 重置 freeindex, 下次调配从 0 开始搜寻
    s.freeindex = 0 // reset allocation index to start of span.
    if trace.enabled {getg().m.p.ptr().traceReclaimed += uintptr(nfreed) * s.elemsize
    }

    // gcmarkBits 变为新的 allocBits
    // 而后重新分配一块全副为 0 的 gcmarkBits
    // 下次调配对象时能够依据 allocBits 得悉哪些元素是未调配的
    // gcmarkBits becomes the allocBits.
    // get a fresh cleared gcmarkBits in preparation for next GC
    s.allocBits = s.gcmarkBits
    s.gcmarkBits = newMarkBits(s.nelems)

    // 更新 freeindex 开始的 allocCache
    // Initialize alloc bits cache.
    s.refillAllocCache(0)

    // 如果 span 中曾经无存活的对象则更新 sweepgen 到最新
    // 上面会把 span 加到 mcentral 或者 mheap
    // We need to set s.sweepgen = h.sweepgen only when all blocks are swept,
    // because of the potential for a concurrent free/SetFinalizer.
    // But we need to set it before we make the span available for allocation
    // (return it to heap or mcentral), because allocation code assumes that a
    // span is already swept if available for allocation.
    if freeToHeap || nfreed == 0 {
        // The span must be in our exclusive ownership until we update sweepgen,
        // check for potential races.
        if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {print("MSpan_Sweep: state=", s.state, "sweepgen=", s.sweepgen, "mheap.sweepgen=", sweepgen, "n")
            throw("MSpan_Sweep: bad span state after sweep")
        }
        // Serialization point.
        // At this point the mark bits are cleared and allocation ready
        // to go so release the span.
        atomic.Store(&s.sweepgen, sweepgen)
    }

    if nfreed > 0 && spc.sizeclass() != 0 {
        // 把 span 加到 mcentral, res 等于是否增加胜利
        c.local_nsmallfree[spc.sizeclass()] += uintptr(nfreed)
        res = mheap_.central[spc].mcentral.freeSpan(s, preserve, wasempty)
        // freeSpan 会更新 sweepgen
        // MCentral_FreeSpan updates sweepgen
    } else if freeToHeap {
        // 把 span 开释到 mheap
        // Free large span to heap

        // NOTE(rsc,dvyukov): The original implementation of efence
        // in CL 22060046 used SysFree instead of SysFault, so that
        // the operating system would eventually give the memory
        // back to us again, so that an efence program could run
        // longer without running out of memory. Unfortunately,
        // calling SysFree here without any kind of adjustment of the
        // heap data structures means that when the memory does
        // come back to us, we have the wrong metadata for it, either in
        // the MSpan structures or in the garbage collection bitmap.
        // Using SysFault here means that the program will run out of
        // memory fairly quickly in efence mode, but at least it won't
        // have mysterious crashes due to confused memory reuse.
        // It should be possible to switch back to SysFree if we also
        // implement and then call some kind of MHeap_DeleteSpan.
        if debug.efence > 0 {
            s.limit = 0 // prevent mlookup from finding this span
            sysFault(unsafe.Pointer(s.base()), size)
        } else {mheap_.freeSpan(s, 1)
        }
        c.local_nlargefree++
        c.local_largefree += size
        res = true
    }
    
    // 如果 span 未加到 mcentral 或者未开释到 mheap, 则示意 span 仍在应用
    if !res {
        // 把仍在应用的 span 加到 sweepSpans 的 "已打扫" 队列中
        // The span has been swept and is still in-use, so put
        // it on the swept in-use list.
        mheap_.sweepSpans[sweepgen/2%2].push(s)
    }
    return res
}

从 bgsweep 和后面的分配器能够看出扫描阶段的工作是非常懈怠 (lazy) 的,
理论可能会呈现前一阶段的扫描还未实现, 就须要开始新一轮的 GC 的状况,
所以每一轮 GC 开始之前都须要实现前一轮 GC 的扫描工作(Sweep Termination 阶段).

GC 的整个流程都剖析结束了, 最初贴上写屏障函数 writebarrierptr 的实现:

// NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
// but if we do that, Go inserts a write barrier on *dst = src.
//go:nosplit
func writebarrierptr(dst *uintptr, src uintptr) {
    if writeBarrier.cgo {cgoCheckWriteBarrier(dst, src)
    }
    if !writeBarrier.needed {
        *dst = src
        return
    }
    if src != 0 && src < minPhysPageSize {systemstack(func() {print("runtime: writebarrierptr *", dst, "=", hex(src), "n")
            throw("bad pointer in write barrier")
        })
    }
    // 标记指针
    writebarrierptr_prewrite1(dst, src)
    // 设置指针到指标
    *dst = src
}

writebarrierptr_prewrite1 函数如下:

// writebarrierptr_prewrite1 invokes a write barrier for *dst = src
// prior to the write happening.
//
// Write barrier calls must not happen during critical GC and scheduler
// related operations. In particular there are times when the GC assumes
// that the world is stopped but scheduler related code is still being
// executed, dealing with syscalls, dealing with putting gs on runnable
// queues and so forth. This code cannot execute write barriers because
// the GC might drop them on the floor. Stopping the world involves removing
// the p associated with an m. We use the fact that m.p == nil to indicate
// that we are in one these critical section and throw if the write is of
// a pointer to a heap object.
//go:nosplit
func writebarrierptr_prewrite1(dst *uintptr, src uintptr) {mp := acquirem()
    if mp.inwb || mp.dying > 0 {releasem(mp)
        return
    }
    systemstack(func() {if mp.p == 0 && memstats.enablegc && !mp.inwb && inheap(src) {throw("writebarrierptr_prewrite1 called with mp.p == nil")
        }
        mp.inwb = true
        gcmarkwb_m(dst, src)
    })
    mp.inwb = false
    releasem(mp)
}

gcmarkwb_m 函数如下:

func gcmarkwb_m(slot *uintptr, ptr uintptr) {
    if writeBarrier.needed {
        // Note: This turns bad pointer writes into bad
        // pointer reads, which could be confusing. We avoid
        // reading from obviously bad pointers, which should
        // take care of the vast majority of these. We could
        // patch this up in the signal handler, or use XCHG to
        // combine the read and the write. Checking inheap is
        // insufficient since we need to track changes to
        // roots outside the heap.
        //
        // Note: profbuf.go omits a barrier during signal handler
        // profile logging; that's safe only because this deletion barrier exists.
        // If we remove the deletion barrier, we'll have to work out
        // a new way to handle the profile logging.
        if slot1 := uintptr(unsafe.Pointer(slot)); slot1 >= minPhysPageSize {
            if optr := *slot; optr != 0 {
                // 标记旧指针
                shade(optr)
            }
        }
        // TODO: Make this conditional on the caller's stack color.
        if ptr != 0 && inheap(ptr) {
            // 标记新指针
            shade(ptr)
        }
    }
}

shade 函数如下:

// Shade the object if it isn't already.
// The object is not nil and known to be in the heap.
// Preemption must be disabled.
//go:nowritebarrier
func shade(b uintptr) {if obj, hbits, span, objIndex := heapBitsForObject(b, 0, 0); obj != 0 {gcw := &getg().m.p.ptr().gcw
        // 标记一个对象存活, 并把它加到标记队列(该对象变为灰色)
        greyobject(obj, 0, 0, hbits, span, gcw, objIndex)
        // 如果标记了禁止本地标记队列则 flush 到全局标记队列
        if gcphase == _GCmarktermination || gcBlackenPromptly {
            // Ps aren't allowed to cache work during mark
            // termination.
            gcw.dispose()}
    }
}

参考链接

https://github.com/golang/go
https://making.pusher.com/golangs-real-time-gc-in-theory-and-practice
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md
https://golang.org/s/go15gcpacing
https://golang.org/ref/mem
https://talks.golang.org/2015/go-gc.pdf
https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4_Jr6_IDASEKg0t0dBR8/edit#heading=h.x4kziklnb8fr
https://go-review.googlesource.com/c/go/+/21503
http://www.cnblogs.com/diegodu/p/5803202.html
http://legendtkl.com/2017/04/28/golang-gc
https://lengzzz.com/note/gc-in-golang

Golang 的 GC 和 CoreCLR 的 GC 比照

因为我之前曾经对 CoreCLR 的 GC 做过剖析(看这一篇和这一篇), 这里我能够简略的比照一下 CoreCLR 和 GO 的 GC 实现:

  • CoreCLR 的对象带有类型信息, GO 的对象不带, 而是通过 bitmap 区域记录哪些地方蕴含指针
  • CoreCLR 调配对象的速度显著更快, GO 调配对象须要查找 span 和写入 bitmap 区域
  • CoreCLR 的收集器须要做的工作比 GO 多很多

    • CoreCLR 不同大小的对象都会放在一个 segment 中, 只能线性扫描
    • CoreCLR 判断对象援用要拜访类型信息, 而 go 只须要拜访 bitmap
    • CoreCLR 打扫时要一个个去标记为自由对象, 而 go 只须要切换 allocBits
  • CoreCLR 的进展工夫比 GO 要长

    • 尽管 CoreCLR 反对并行 GC, 然而没有 GO 彻底, GO 连扫描根对象都不须要齐全进展
  • CoreCLR 反对分代 GC

    • 尽管 Full GC 时 CoreCLR 的效率不如 GO, 然而 CoreCLR 能够在大部分时候只扫描第 0 和第 1 代的对象
    • 因为反对分代 GC, 通常 CoreCLR 花在 GC 上的 CPU 工夫会比 GO 要少

CoreCLR 的分配器和收集器通常比 GO 要高效, 也就是说 CoreCLR 会有更高的吞吐量.
但 CoreCLR 的最大进展工夫不如 GO 短, 这是因为 GO 的 GC 整个设计都是为了缩小进展工夫.

当初分布式计算和横向扩大越来越风行,
比起谋求单机吞吐量, 谋求低提早而后让分布式解决吞吐量问题无疑是更理智的抉择,
GO 的设计指标使得它比其余语言都更适宜编写网络服务程序.

退出移动版