共计 9364 个字符，预计需要花费 24 分钟才能阅读完成。

MongoDB 的同步原理，官方文档介绍的比较少，网上资料也不是太多，下面是结合官方文档、网上资料和测试时候的日志，整理出来的一点东西。
因为 MongoDB 的每个分片也是副本集，所以只需要搞副本集的同步原理即可。

大体来说，MongoDB 副本集同步主要包含两个步骤：

1\. Initial Sync，全量同步
2\. Replication，即 sync oplog

先通过 init sync 同步全量数据，再通过 replication 不断重放 Primary 上的 oplog 同步增量数据。全量同步完成后，成员从转换 STARTUP2 为 SECONDARY

1) 全量同步开始，获取同步源上的最新时间戳 t1
2) 全量同步集合数据，建立索引（比较耗时）3) 获取同步源上最新的时间戳 t2
4) 重放 t1 到 t2 之间所有的 oplog
5) 全量同步结束

简单来说，就是遍历 Primary 上的所有 DB 的所有集合，将数据拷贝到自身节点，然后读取全量同步开始到结束时间段内的 oplog 并重放。

initial sync 结束后，Secondary 会建立到 Primary 上 local.oplog.rs 的 tailable cursor，不断从 Primary 上获取新写入的 oplog，并应用到自身。

Secondary 节点当出现如下状况时，需要先进⾏全量同步

1) oplog 为空
2) local.replset.minvalid 集合⾥_initialSyncFlag 字段设置为 true（用于 init sync 失败处理）3) 内存标记 initialSyncRequested 设置为 true（用于 resync 命令，resync 命令只用于 master/slave 架构，副本集无法使用）

这 3 个场景分别对应(场景 2 和场景 3 没看到官网文档有写，参考张友东大神博客)

1) 新节点加⼊，⽆任何 oplog，此时需先进性 initial sync
2) initial sync 开始时，会主动将_initialSyncFlag 字段设置为 true，正常结束后再设置为 false；如果节点重启时，发现_initialSyncFlag 为 true，说明上次全量同步中途失败了，此时应该重新进⾏ initial sync
3)当⽤户发送 resync 命令时，initialSyncRequested 会设置为 true，此时会强制重新开始⼀次 initial sync

1.3.1 全量同步数据的时候，会不会源数据的 oplog 被覆盖了导致全量同步失败？

在 3.4 版本及以后，不会。
下面这张图说明了 3.4 对全量同步的改进（图来自张友东博客）：

官方文档是：

initial sync 会在为每个集合复制文档时构所有集合索引。在早期版本（3.4 之前）的 MongoDB 中，仅_id 在此阶段构建索引。Initial sync 复制数据的时候会将新增的 oplog 记录存到本地（3.4 新增）。

全量同步结束后，Secondary 就开始从结束时间点建立 tailable cursor，不断的从同步源拉取 oplog 并重放应用到自身，这个过程并不是由一个线程来完成的，mongodb 为了提升同步效率，将拉取 oplog 以及重放 oplog 分到了不同的线程来执行。
具体线程和作用如下（这部分暂时没有在官方文档找到，来自张友东大神博客）：

producer thread：这个线程不断的从同步源上拉取 oplog，并加入到一个 BlockQueue 的队列里保存着，BlockQueue 最大存储 240MB 的 oplog 数据，当超过这个阈值时，就必须等到 oplog 被 replBatcher 消费掉才能继续拉取。
replBatcher thread：这个线程负责逐个从 producer thread 的队列里取出 oplog，并放到自己维护的队列里，这个队列最多允许 5000 个元素，并且元素总大小不超过 512MB，当队列满了时，就需要等待 oplogApplication 消费掉
oplogApplication 会取出 replBatch thread 当前队列的所有元素，并将元素根据 docId（如果存储引擎不支持文档锁，则根据集合名称）分散到不同的 replWriter 线程，replWriter 线程将所有的 oplog 应用到自身；等待所有 oplog 都应用完毕，oplogApplication 线程将所有的 oplog 顺序写入到 local.oplog.rs 集合。

针对上面的叙述，画了一个图方便理解：

producer 的 buffer 和 apply 线程的统计信息都可以通过 db.serverStatus().metrics.repl 来查询到。

2.2.1 为什么 oplog 的回放要弄这么多的线程？

和 mysql 一样，一个线程做一个事情，拉取 oplog 是单线程，其他线程进行回放；多个回放线程加快速度。

2.2.2 为什么需要 replBatcher 线程来中转？

oplog 重放时，要保持顺序性，⽽且遇到 create、drop 等 DDL 命令时，这些命令与其他的增删改查命令是不能并⾏执⾏的，⽽这些控制就是由 replBatcher 来完成的。

2.2.3 如何解决 secondary 节点 oplog 重放追不上 primary 问题？

方法一：设置更大的回放线程数

  * mongod 命令行指定：mongod --setParameter replWriterThreadCount=32
  * 配置文件中指定

setParameter:
  replWriterThreadCount: 32

方法二：增大 oplog 的大小
方法三：将 writeOpsToOplog 步骤分散到多个 replWriter 线程来并发执行，看官方开发者日志已经实现了这个（在 3.4.0-rc2 版本）

2.3 注意事项

initial sync 单线程复制数据，效率比较低，生产环境应该尽量避免 initial sync 出现，需合理配置 oplog。
新加⼊节点时，可以通过物理复制的⽅式来避免 initial sync，将 Primary 上的 dbpath 拷⻉到新的节点，然后直接启动。
当 Secondary 同步滞后是因为主上并发写入太高导致，db.serverStatus().metrics.repl.buffer 的 sizeBytes 值持续接近 maxSizeBytes 的时候，可通过调整 Secondary 上 replWriter 并发线程数来提升。

将日志级别 verbosity 设置为 1，然后过滤日志
cat mg36000.log |egrep “clone|index|oplog” >b.log
最后拿出过滤后的部分日志。

3.4.21 新加入节点日志

因为日志太多，贴太多出来也没什么意义，下面贴出了对 db01 库的某个
集合的日志。可以发现是先创建 collection 索引，然后 clone 集合数据和索引数据，这样就完成了该集合的 clone。最后将配置改为下一个集合。

2019-08-21T16:50:10.880+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-27-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1}, "name" : "num_1", "ns" : "db01.test2" }),
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: {v: 2, key: { num: 1.0}, name: "num_1", ns: "db01.test2" }
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.882+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-28-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1}, "name" : "_id_", "ns" : "db01.test2" }),
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: {v: 2, key: { _id: 1}, name: "_id_", ns: "db01.test2" }
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.901+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: num_1
2019-08-21T16:50:10.906+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: _id_
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11] collection clone finished: db01.test2
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11]     collection: db01.test2, stats: {ns: "db01.test2", documentsToCopy: 2000, documentsCopied: 2000, indexes: 2, fetchedBatches: 1, start: new Date(1566377410875), end: new Date(1566377410913), elapsedMillis: 38 }
2019-08-21T16:50:10.920+0800 D STORAGE  [InitialSyncInserters-db01.collection10] create uri: table:db01/index-30-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1}, "name" : "_id_", "ns" : "db01.collection1" }),

3.6.12 加入新节点日志

3.6 较 3.4 的区别是，复制数据库的线程明确了是：repl writer worker 进行重放（看文档其实 3.4 已经是如此了）还有就是明确是用 cursors 来进行。其他和 3.4 没有区别，也是创建索引，然后 clone 数据。

2019-08-22T13:59:39.444+0800 D STORAGE  [repl writer worker 9] create uri: table:db01/index-32-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1}, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=true)
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9] build index on: db01.collection1 properties: {v: 2, key: { _id: 1}, name: "_id_", ns: "db01.collection1" }
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T13:59:39.447+0800 D REPL     [replication-1] Collection cloner running with 1 cursors established.
2019-08-22T13:59:39.681+0800 D INDEX    [repl writer worker 7]      bulk commit starting for index: _id_
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7] collection clone finished: db01.collection1
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     database: db01, stats: {dbname: "db01", collections: 1, clonedCollections: 1, start: new Date(1566453579439), end: new Date(1566453579725), elapsedMillis: 286 }
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     collection: db01.collection1, stats: {ns: "db01.collection1", documentsToCopy: 50000, documentsCopied: 50000, indexes: 1, fetchedBatches: 1, start: new Date(1566453579440), end: new Date(1566453579725), elapsedMillis: 285 }
2019-08-22T13:59:39.731+0800 D STORAGE  [repl writer worker 8] create uri: table:test/index-34-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1}, "name" : "_id_", "ns" : "test.user1" }),log=(enabled=true)

4.0.11 加入新节点日志

使用 cursors，和 3.6 基本一致

2019-08-22T15:02:13.806+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-30--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1}, "name" : "num_1", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: {v: 2, key: { num: 1.0}, name: "num_1", ns: "db01.collection1" }
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.816+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-31--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1}, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: {v: 2, key: { _id: 1}, name: "_id_", ns: "db01.collection1" }
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.820+0800 D REPL     [replication-0] Collection cloner running with 1 cursors established.

2019-08-22T15:15:17.566+0800 D STORAGE  [repl writer worker 2] create collection db01.collection2 {uuid: UUID("8e61a14e-280c-4da7-ad8c-f6fd086d9481") }
2019-08-22T15:15:17.567+0800 I STORAGE  [repl writer worker 2] createCollection: db01.collection2 with provided UUID: 8e61a14e-280c-4da7-ad8c-f6fd086d9481
2019-08-22T15:15:17.567+0800 D STORAGE  [repl writer worker 2] stored meta data for db01.collection2 @ RecordId(22)
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] db01.collection2: clearing plan cache - collection info cache reset
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] create uri: table:db01/index-43--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1}, "name" : "_id_", "ns" : "db01.collection2" }),log=(enabled=false)

参考：
https://docs.mongodb.com/v4.0/core/replica-set-sync/
https://docs.mongodb.com/v4.0/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member
http://www.mongoing.com/archives/2369

本文作者：hs2021

阅读原文

本文为云栖社区原创内容，未经允许不得转载。

MongoDB副本集同步原理

一、Initial Sync

1.1 初始化同步过程

1.2 初始化同步场景

1.3 疑问点解释

1.3.1 全量同步数据的时候，会不会源数据的 oplog 被覆盖了导致全量同步失败？

二、Replication

2.1 sync oplog 的过程

2.2 对过程疑问点的解释

2.3 注意事项

三、日志分析

3.1 初始化同步日志

3.2 复制日志

站内搜索