关于zookeeper:谈谈分布式一致性算法-paxos-zab-raft-gossip

概述

这篇blog次要探讨常见的分布式一致性算法：

paxos
zab (zookeeper atomic broadcast)
raft
gossip

重点在于比照这些算法的相同点和差别。思考算法的设计和取舍.

paxos

paxos是最早提出的分布式一致性算法.
见http://lamport.azurewebsites....

重要概念

Roles

实际中, 节点往往同时承载proposer/acceptors/learner的性能.

proposer

提案人，A proposer sends a proposed value to a set of acceptors. 能够了解为能够执行写入的master节点。留神paxos协定并没有限度Proposer的数量. 在实践中为了防止核心节点生成全序proposal number的单点故障, 应用了选主协定. 参考上面的leader.

leader

见论文的第三节：Implementing a State Machine
必须有一个核心节点生成全序的 proposal number. 被选为主节点的proposer才可发动proposal.
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm.

acceptors

投票人，An acceptor may accept the proposed value. 即形成majority或quorum的节点。

learner

被动承受选举后果的角色。能够了解为只读的从库。

client

客户端. 发动申请和接管返回.

work flow

Basic Paxos without failures

下图来自: https://en.wikipedia.org/wiki...

Client   Proposer      Acceptor     Learner   |         |          |  |  |       |  |   X-------->|          |  |  |       |  |  Request   |         X--------->|->|->|       |  |  Prepare(1)   |         |<---------X--X--X       |  |  Promise(1,{Va,Vb,Vc})   |         X--------->|->|->|       |  |  Accept!(1,V)   |         |<---------X--X--X------>|->|  Accepted(1,V)   |<---------------------------------X--X  Response   |         |          |  |  |       |  |

算法外围

acceptor不会承受proposal number更小的prepare, 会返回accept过的highest-numbered.
被提交的value只有在阶段2才会被抉择.

因为所有最高的proposal number, 都是通过了majority的accept的, 所以相对不会产生如下状况:
更低版本议题笼罩更高版本议题的状况, 如,
(proposal number 1, V1) 笼罩 (proposal number 2, V2).

为什么paxos要有prepare阶段

paxos容许同时存在多个proposer, prepare阶段能够确保不笼罩higher proposal number.

paxos算法是线性的吗?

paxos算法没有明确要求proposal number的生成是严格有序的. 也没有规定learner严格有序地按proposal number读取.
在严格有序的状况下, paxos才是线性的.

zab

zab是zookeeper实现的分布式一致性算法.
见: http://www.cs.cornell.edu/cou...

why not paxos

线性一致性

It is important in our setting that we enable multiple outstanding ZooKeeper operations and that a prefix of operations submitted concurrently by a ZooKeeper client are committed according to FIFO order.
paxos没有间接反对线性一致性. 如果多个primaries同时执行多个事务, 事务提交的程序是无奈保障的, 这个将导致最终的后果不统一.
下图阐明了这种不统一, 27A -> 28B 变成了 27C -> 28B. 如果心愿防止这种不统一, 只能限度一次只能执行一个proposal. 显然这样会导致系统吞吐低下. 若应用合并transactions的形式优化的话, 又会导致系统提早进步. 合并transactions的size也很难抉择.

疾速复原

在多个primaries同时执行多个事务的状况下, paxos不同的processes在同一个sequence number下可能有不同的值. 新的master必须对所有未learned的value走一遍paxos phases 1, 即通过majories取得正确的值.

重要概念

Roles

leader

zab 是通过单主节点, 即leader播送来实现线性, 即全序(论文中的PO, primary order)的播送.

follower

选主, abdeliver 主节点的播送.

epoch

新纪元, 每次选主胜利, 都会有有 epoch_new > epoch_current.

proposes 和 commit

proposes和commit对应paxos的Promise/Accept.
proposes和commit 都蕴含 (e, (v, z)):
e: epoch 对应master的任期number
v: commit transaction value, 对应理论的值
z: commit transaction identifier (zxid), 对应提交版本号.
正是commit的全序线性化保障, 保障了各个节点的value变动具备线性一致性.

work flow

选主(discovery 和 synchronization)

如何确认prospective leader？

尚未在了论文中找到prospective leader的准确形容，有含糊形容在 V. Z AB IN DETAIL 如下：

When a process starts, it enters the ELECTION state. While in this state the process tries to elect a new leader or become a leader. If the process finds an elected leader, it moves to the FOLLOWING state and begins to follow the leader.

为什么选主须要discovery 和 synchronization 分两个阶段?

discovery 阶段通过少数follower的cepoch确保leader有最新的commit和history(f.p和h.f)．
synchronization 阶段确保leader将最新的commit/history同步到了少数节点．

broadcast

少数成员ack后，即可平安commit和返回client. 参照discovery阶段，因为会抉择所有节点中最新的history，故，只须要少数节点ack，即可返回写入胜利.
返回失败不肯定意味着写入不胜利，可能是

胜利，但返回未被收到。
失败，leader 播送之前解体。
失败，leader 在少数成员ack前解体，之后的选主未选中ack的follower的history。

写入失败不肯定失败，写入胜利能够确保胜利。

raft

raft 是目前最易于实现，最易懂的分布式一致性算法，和zab一样，raft必须先选出主节点。

why not paxos

难懂
不容易构建具备实际意义的实现.

paxos指single-decree paxos, 在工程中价值不大, 而multi-paxos的诸多细节并未给出.

和 zab 的异同

异

选主逻辑不同, raft通过随机的选主超时触发选主. 只有一个阶段
broadcast逻辑不同, 见上面的log.

从论文登程, raft思考到了更多工程上的细节, 如日志压缩, 和扭转集群节点都是zab/paxos未提及的.

同

single master and multi followers
任期概念 raft (term) zab (epoch)

外围形象

Roles

leader

同zab, raft有一个strong leader. 必须先选出主节点, 才可提供写入服务.

follower

同zab, follower可投票, 写入主节点播送的log(by AppendEntries rpc)

term

同zab, 每选出一个leader, 都对应一个全序的term.

log

日志, 即zab的propose和commit.
AppendEntries设计得极为精美. 同时能够作为

heartbeat
propose
commit (leaderCommit字段能够commit所有小于等于leaderCommit的log)

work flow

Leader election

https://raft.github.io/
有具体的选主过程. 能够随便操作节点, 察看各个状况下的边界. 参照paxos的文本化流程, 上面是一个失常状况下的选主:

Node1                 Node2                 Node3   |                    |                     |timeout                 |                     |   X----RequestVote---->|                     |   X--------------------|-----RequestVote---->|   |<-------vote--------X                     |   |<-------vote--------|-------------- ------Xbecome leader           |                     |

Log replication

Figure 6: Logs are composed of entries, which are numbered
sequentially. Each entry contains the term in which it was
created (the number in each box) and a command for the state
machine. Anentryisconsidered committed if itissafeforthat
entry to be applied to state machines.

认真思考AppendEntries RPC的Receiver implementation:

不承受小于以后term的log. 播送AppendEntries的节点只可能有两种, 一, 误以为本人仍是leader的老leader. 二, 以后leader, 一个分区中的Node可能不停进步term, 但它取得少数节点投票成为leader之前, 不会收回播送. 这一点保障了, 只承受以后任期term的log.
不承受prevLogIndex和prevLogTerm不匹配的log. 若prevLogIndex和prevLogTerm不匹配. 阐明prevLogIndex之上有多个leader播送, leader的log和follower的log不统一. leader通过缩小nextIndex重试来修复和follower的不统一. 直到所有未commit的log都达成统一.
如果以后entry的值不统一, 应用leader的. leader的log是可信的.
如果有新的entries, 增加.
如果leaderCommit > commitIndex, 设置 commitIndex = min(leaderCommit, index of last new entry)

当少数节点都返回对应success时, AppendEntries的entries即可认为是写入, 能够返回给client. 这是通过在选主时施加的额定限度来保障的, 具备更新(比拟log index和term). log的节点才可博得选举. 那么, 只有一个log在少数节点写入了(就算没有commit), 选主时, 肯定会抉择一个具备该entry的节点:

不切换leader的状况下, 不会写入更新的log
切换leader须要少数节点的批准, 肯定会抉择一个具备该log的节点. 因为该log是最新的.

Cluster membership changes

Figure11: Timeline for a configuration change. Dashed lines
show configuration entries that have been created but not
committed,and solidlinesshow thelatestcommittedconfigu-
ration entry. The leader first creates theC old,new configuration
entry in its log and commits it to C old,new (a majority of C old
and a majority of C new ). Then it creates the C new entry and
commits it to a majority of C new . There is no point in time in
whichC old and C new can both make decisions independently.

关键在于可能做决定的majority的切换.

gossip

gossip一致性算法和paxos/zab/raft 有较大差别, 不存在主节点, 没有线性化保障. 通过有限重试的播送, 确保变动在一段时间后, 可能同步到所有节点.

work flow

实现算法的要害在与, 如何将所有节点的gossip播送合并成一个状态, 即, 节点a能够申明节点a上有k1=v1, 节点b申明节点b上有k1=v2. 那么合并的状态是 a上有k1=v1, b上有k1=v2. 只有合并的算法是统一的, 如, 能够抉择并存, 优先a或优先b, 那么收到对应播送的节点, 状态可保持一致.
因为没有线性化保障, 播送的内容不能用差量, 而应该用全量.

Active thread (peer P):                 Passive thread (peer Q):(1) selectPeer(&Q);                     (1)(2) selectToSend(&bufs);                (2)(3) sendTo(Q, bufs);            ----->  (3) receiveFromAny(&P, &bufr);(4)                                     (4) selectToSend(&bufs);(5) receiveFrom(Q, &bufr);      <-----  (5) sendTo(P, bufs);(6) selectToKeep(cache, bufr);          (6) selectToKeep(cache, bufr);(7) processData(cache);                 (7) processData(cache)Figure 1: The general organization of a gossiping protocol.

分布式共识的要害

不论是paxos, zab, raft，实现一致性的外围原理都相似：

少数节点批准后才可提交．
批准后不可撤回．
这个过程是幂等的．

不论是节点解体后持续提交，还是由其它节点持续提交，会在少数节点达成共识.

参考

https://en.wikipedia.org/wiki...
http://lamport.azurewebsites....
https://stackoverflow.com/que...
http://www.cs.cornell.edu/cou...
https://raft.github.io/
https://web.stanford.edu/~ous...