关于后端:如何实现一个-Paxos

简介：Paxos 作为一个经典的分布式一致性算法 (Consensus Algorithm)，在各种教材中也被当做范例来解说。但因为其抽象性，很少有人基于奢侈 Paxos 开发一致性库，本文介绍的实现代码参考了 RAFT 中的概念以及 phxpaxos 的实现和架构设计，实现 multi-paxos 算法，次要针对线程平安和模块形象进行强化，网络、成员治理、日志、快照、存储以接口模式接入，算法设计为事件驱动，仅蕴含头文件，便于移植和扩大。Paxos 作为一个经典的分布式一致性算法 (Consensus Algorithm)，在各种教材中也被当做范例来解说。但因为其抽象性，很少有人基于奢侈 Paxos 开发一致性库，而 RAFT 则是工业界里实现较多的一致性算法，RAFT 的论文能够在上面参考资料中找到（In Search of an Understandable Consensus Algorithm），RAFT 通过引入强 leader 角色，解决了 Paxos 算法中很多工程实现难题，同时引入了日志 + 状态机的概念，将多节点同步进行了高度形象，解决了很多问题。这里我之所以反其道而行之，抉择 Paxos 进行实现，次要是因为：Paxos 开源实现较少，经典，各种定义高度形象（适宜作为通用库），挑战性强正确性不依赖 leader 选举，适宜疾速写入节点切换（抢主），本实现里，单 paxos group，3 节点本地回环内存存储，3 节点并发写性能 16k/s，10ms leader lease 优化 43k/s（MBP13 2018 下测试）实现限度少，扩展性强本实现代码参考了 RAFT 中的概念以及 phxpaxos 的实现和架构设计，实现 multi-paxos 算法，次要针对线程平安和模块形象进行强化，网络、成员治理、日志、快照、存储以接口模式接入，算法设计为事件驱动，仅蕴含头文件，便于移植和扩大。本文假如读者对 Paxos 协定有肯定的理解，并不会对 Paxos 算法的推导证实和一些基本概念做过多解说，次要着重于 Paxos 的工程实现。如果读者对 Paxos 算法的推导证实感兴趣能够浏览参考资料中的相干论文材料。有了 Paxos 能够干什么 Paxos 如此出名，写了个库能够干些啥炫酷的事件呢？最直观的，你能够在 Paxos 根底上实现一个分布式系统，它具备：强一致性，保障各个节点的数据都是一样的，及时并发地在多个节点上做写操作高可用性，例如 3 节点的 Paxos 零碎，能够容忍任何一个节点挂掉，同时持续提供服务基于 Paxos 零碎的日志 + 状态机，能够轻易实现带状态的高可用服务，比方一个分布式 KV 存储系统。再联合快照 + 成员治理，能够让这个服务具备在线迁徙、动静增加多正本等诸多高级性能。是不是心动了呢，让咱们进入上面的算法实现环节。代码地址 Talk is cheap, show me the code. 先放代码仓库链接 zpaxos github 仓库集体习惯将根底类算法库间接写成头文件，便于后续代码援用和移植到其余我的项目中，同时能够让编译器充沛内联各种函数，毛病是编译工夫变慢。公开的代码中，为了缩小额定我的项目援用，仅带了个日志库（spdlog，同样的 header only），单元测试写的比较简单，感兴趣的小伙伴也能够加些更多的测试。外围算法目录测试代码目录 Paxos 算法根底这里为防止翻译造成谬误了解，上面全副拷贝 Paxos Made Simple 原文作为参考算法指标 A consensus algorithm ensures that a single one among the proposed values is chosenOnly a value that has been proposed may be chosen,Only a single value is chosen, andA process never learns that a value has been chosen unless it actually has been. 一个最奢侈的一致性算法的目标，就是在一堆对等节点中协商出一个大家都公认的值，同时这个值是其中某个节点提出的而且在这个值确定后，能被所有节点获知。算法实现对于 Paxos 算法的推导证实，曾经有很多文章形容了，这里我就不在赘述，毕竟本文的次要指标是实现一个 Paxos 库，咱们着重于代码的实现。Phase 1. (prepare)A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.Phase 2. (accept)If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n. 最根底的流程则是这个两轮投票，为了实现投票，咱们须要对形容中的实体进行代码实现。基类 Cbasebase.h 定义了算法中所须要的实体，次要包含，投票 ballot_number_t，值 value_t，acceptor 状态 state_t，角色间传递的音讯 message_t。struct ballot_number_t final {

 proposal_id_t proposal_id;
node_id_t node_id;

};

struct value_t final {

 state_machine_id_t state_machine_id;
utility::Cbuffer buffer;

};

struct state_t final {

 ballot_number_t promised, accepted;
value_t value;

};

struct message_t final {

 enum type_e {
    noop = 0,
    prepare,
    prepare_promise,
    prepare_reject,
    accept,
    accept_accept,
    accept_reject,
    value_chosen,
    learn_ping,
    learn_pong,
    learn_request,
    learn_response
} type;
 
// Sender info.
group_id_t group_id;
instance_id_t instance_id;
node_id_t node_id;
 
/**
 * Following field may optional.
 */
 
// As sequence number for reply.
proposal_id_t proposal_id;
 
ballot_number_t ballot;
value_t value;
 
// For learner data transmit.
bool overload; // Used in ping & pong. This should be consider when send learn request.
instance_id_t min_stored_instance_id; // Used in ping and pong.
std::vector<learn_t> learn_batch;
std::vector<Csnapshot::shared_ptr> snapshot_batch;

}; 折叠同时 base.h 定义了一个节点的基类 Cbase，用于形容了该根底节点的状态、以后 log instance id、锁等内容，同时提供一些根底的 index 推动、音讯收发、成员判断、音讯存储性能。上面截取了 Cbase 的局部代码。template<class T>
class Cbase {

 // Const info of instance.
const node_id_t node_id_;
const group_id_t group_id_;
const write_options_t default_write_options_;
 
std::mutex update_lock_;
std::atomic<instance_id_t> instance_id_;
 
Cstorage &storage_;
Ccommunication &communication_;
CpeerManager &peer_manager_;
 
bool is_voter(const instance_id_t &instance_id);
bool is_all_peer(const instance_id_t &instance_id, const std::set<node_id_t> &node_set);
bool is_all_voter(const instance_id_t &instance_id, const std::set<node_id_t> &node_set);
bool is_quorum(const instance_id_t &instance_id, const std::set<node_id_t> &node_set);
 
int get_min_instance_id(instance_id_t &instance_id);
int get_max_instance_id(instance_id_t &instance_id);
void set_instance_id(instance_id_t &instance_id);
 
bool get(const instance_id_t &instance_id, state_t &state);
bool get(const instance_id_t &instance_id, std::vector<state_t> &states);
bool put(const instance_id_t &instance_id, const state_t &state, bool &lag);
bool next_instance(const instance_id_t &instance_id, const value_t &chosen_value);
bool put_and_next_instance(const instance_id_t &instance_id, const state_t &state, bool &lag);
bool put_and_next_instance(const instance_id_t &instance_id, const std::vector<state_t> &states, bool &lag);
 
bool reset_min_instance(const instance_id_t &instance_id, const state_t &state);
 
bool broadcast(const message_t &msg,
               Ccommunication::broadcast_range_e range,
               Ccommunication::broadcast_type_e type);
bool send(const node_id_t &target_id, const message_t &msg);

}; 折叠 Proposer 角色 Cproposeproposer.h 负责实现 Paxos 算法中的 proposer 的行为，包含提出决定，解决 acceptor 回复的音讯等。on_prepare_reply 解决 acceptor 返回 prepare 流程的响应，绝对于 Paxos 论文中的形容，这里须要对音讯做具体的检测，判断是以后上下文中须要解决的音讯后，退出到响应统计汇合中，最初依据多数派准则，做出进一步判断，是放弃还是持续进入下一步 accept 流程。response_set_.insert(msg.node_id);

if (message_t::prepare_promise == msg.type) {

 // Promise.
promise_or_accept_set_.insert(msg.node_id);
// Caution: This will move value to local variable, and never touch it again.
update_ballot_and_value(std::forward<message_t>(msg));

} else {

 // Reject.
reject_set_.insert(msg.node_id);
has_rejected_ = true;
record_other_proposal_id(msg);

}

if (base_.is_quorum(working_instance_id_, promise_or_accept_set_)) {

 // Prepare success.
can_skip_prepare_ = true;
accept(accept_msg);

} else if (base_.is_quorum(working_instance_id_, reject_set_) ||

         base_.is_all_voter(working_instance_id_, response_set_)) {
// Prepare fail.
state_ = proposer_idle;
last_error_ = error_prepare_rejected;
notify_idle = true;

}on_accept_reply 解决 acceptor 返回 accept 流程的响应，这里依据 Paxos 中形容，通过多数派准则，判断该提案是否被最终通过，如果通过，则进入 chosen 流程，播送确定的值。response_set_.insert(msg.node_id);

if (message_t::accept_accept == msg.type) {

 // Accept.
promise_or_accept_set_.insert(msg.node_id);

} else {

 // Reject.
reject_set_.insert(msg.node_id);
has_rejected_ = true;
record_other_proposal_id(msg);

}

if (base_.is_quorum(working_instance_id_, promise_or_accept_set_)) {

 // Accept success.
chosen(chosen_msg);
chosen_value = value_;

} else if (base_.is_quorum(working_instance_id_, reject_set_) ||

         base_.is_all_voter(working_instance_id_, response_set_)) {
// Accept fail.
state_ = proposer_idle;
last_error_ = error_accept_rejected;
notify_idle = true;

}Acceptor 角色 Cacceptoracceptor.h 负责实现 Paxos 算法中 acceptor 的行为，解决 proposer 的申请，同时进行长久化、推高 log instance id 等。同时 Cacceptor 还有个重要使命，就是在初始化时候，加载已有的状态，保障 promise 的状态以及 accept 的值。on_prepare 对应收到 prepare 申请后的解决，针对提案投票号，决定返回音讯，及 promise 状态长久化。if (msg.ballot >= state_.promised) {

 // Promise.
response.type = message_t::prepare_promise;
if (state_.accepted) {
    response.ballot = state_.accepted;
    response.value = state_.value;
}
 
state_.promised = msg.ballot;
 
auto lag = false;
if (!persist(lag)) {if (lag)
        return Cbase<T>::routine_error_lag;
 
    return Cbase<T>::routine_write_fail;
}

} else {

 // Reject.
response.type = message_t::prepare_reject;
response.ballot = state_.promised;

}on_accept 对应解决收到的 accept 申请的解决，依据本身状态和提案号，决定是更新以后状态还是返回回绝，最终将适合的 accept 状态和 value 长久化。if (msg.ballot >= state_.promised) {

 // Accept.
response.type = message_t::accept_accept;
 
state_.promised = msg.ballot;
state_.accepted = msg.ballot;
state_.value = std::move(msg.value); // Move value to local variable.
 
auto lag = false;
if (!persist(lag)) {if (lag)
        return Cbase<T>::routine_error_lag;
    return Cbase<T>::routine_write_fail;
}

} else {

 // Reject.
response.type = message_t::accept_reject;
response.ballot = state_.promised;

}on_chosen 是解决 proposer 播送的对应值确定的音讯，通过判断后，会推高以后 log instance id，让以后节点进入下一个 value 的判断（multi-paxos 的逻辑）。if (base_.next_instance(working_instance_id_, state_.value)) {

 chosen_instance_id = working_instance_id_;
chosen_value = state_.value;

} else

return Cbase<T>::routine_error_lag;Paxos 算法进阶 Multi-Paxos 至此，咱们实现了论文中两个根本角色的根底性能，同时也非常明显的，这两个角色并没什么用，只能确定一个固定的值，这时就须要引入 multi-paxos 算法了。既然确定一个值没有用，那么，确定一系列值，就能够联合状态机实现更加简单的性能了。这个就是之前提到的 log instance id 了，这个是个从 0 开始的 u64。typedef uint64_t instance_id_t; // [0, inf) 这时很简略就能实现一个多值的序列，每个值都应用 Paxos 的算法进行确认。如下所示，instance_id_t 从 0 开始，顺次递增，proposer 通过 prepare & accept 流程顺次确定值。value 是一系列操作，咱们就能通过状态机实现多节点间的强统一同步了。instance_id_t0123...infvalue_ta=1b=2b=a+1a=b+1... Paxosprepareacceptprepareacceptprepareacceptprepareaccept... 这里不难发现，每个值的确定，都至多须要 2 次通信 RT（on_chosen 的音讯能够被 pipeline，并不占用提早）+ 2 次磁盘 IO，这个代价是相当大的。但 Paxos 文中也提出了 multi-paxos 思路。Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2. Recall that, after completing phase 1 of the proposer’s algorithm, either the value to be proposed is determined or else the proposer is free to propose any value. 简而言之，就是：value 能够不仅仅是一个值，而是一个序列的值（把这些序列看成一个整套，了解为一个大值，花了屡次网络进行传输），在复用 proposer id 的状况下，能够屡次走 phase 2 accept 流程，实现序列值的提交该优化没有突破 paxos 的假如及要求，因而 leader 并不是 multi-paxos 的必须项该间断流程随时能被更高的 proposer id 打断（了解为新值的抢占，中断之前的传输，同样没有突破之前值的束缚，只是被剪短了）这时候，一个现实状况是，一个节点抢占并被认可了一个 proposer id 之后，用 accept 进行间断提交。每个值的确定精简为 1 次通信 RT+ 1 次磁盘 IO，也就是多节点数据同步的最优代价。instance_id_t0123...infvalue_ta=1b=2b=a+1a=b+1... Paxosprepareacceptacceptacceptaccept... 同时，咱们在实现的根底上能够引入一些机制，放慢某些不必要的流程，进行性能的优化。proposer.h 中应用 can_skip_prepare 和 has_rejected 判断是否跳过能够 prepare 流程以及在被拒后（任何其余节点的 proposer 抢占更高 proposer id）退回到 2 阶段流程尽管多个节点之间抢占写入并不会带来正确性问题，但屡次抢占导致没有任何节点能长期进行间断 accept 优化，这里引入了 leader_manager.h，在 accept 后，无脑回绝任何其余节点的 prepare 一段时间，让 accept 胜利的节点能继续独占 acceptor 一段时间，能够在高抵触的场景下，在工夫窗口中实现间断 accept 提交。learner 角色 learner 用于疾速学习已确定的 log instanceTo learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. This allows learners to find out about a chosen value as soon as possible, but it requires each acceptor to respond to each learner—a number of responses equal to the product of the number of acceptors and the number of learners. 论文中的办法是询问所有 acceptor，确定多数派的 value，这里咱们通过 proposer 的 on_chosen 播送 proposer id，让所有其余节点晓得哪个值曾经被确定，疾速推升 log instance id，也有助于节点晓得哪些值能够被传递到状态机进行回放。learner.h 通过 ping 包的模式，理解各个对等节点的被确定的 log instance id，抉择适合的节点进行疾速学习，理论工程中会依据落后水平和 log 被裁剪的状况，抉择通过 log 还是 snapshot 的形式进行学习。网络、成员、日志、状态机插件化依据 Paxos Made Live 中的形容，实现正确的 Paxos 的难度不仅在于实现规范 Paxos 算法，更在于其音讯传输和存储牢靠的假如（非拜占庭谬误），quorum 精确判断（成员变更）等。解决这个问题的形式是，应用接口将这部分同外围算法拆散开来，交给更业余的人或库去解决，而咱们仅专精于以后的算法、优化和调度（让库成为无状态的）。同时这种拆散的做法，能够让 Paxos 工作在已有的存储、网络系统之上，防止额定引入的存储或网络带来冗余、影响性能。因而所有非 Paxos 算法的假如和实现，都通过接口的形式接入到外围算法中。包含存储、通信、成员治理和状态机和快照。当然为了测试，代码中提供了最简略的基于队列的通信，能够模仿随机提早、乱序、丢包等非拜占庭谬误，内存存储。前面附录会附上 RocksDB 实现的存储、反对变更的成员治理 + 成员状态机 + 快照实现以及基于 asio 的 TCP&UDP 混合的通信零碎。单 Paxos Group 角色交融 Cinstanceproposer acceptor learner 三角色齐全了，上面就须要一个治理对象把他们交融到一起，造成一个 Paxos Group 了，这里咱们应用的是 instance.h 这个类 Cinstance，通过模板的形式实现 0 损耗的接口，躲避虚函数的调用代价，将三个角色以及解决 log instance 推动、通信、存储和状态机的 Cbase 齐全连接起来。为了内部调用不便，Cinstance 也提供了带流控的阻塞接口，给定各种超时参数，向 Paxos Group 中提交一个值，在胜利或超时后返回。为了让角色间接充沛解耦，所有波及到角色状态流转的接口都裸露进去，以回调的形式在 Cinstance 中解决，也能直观地在一个代码上下文中解决这些交互信息，尽可能减少逻辑 bug。void synced_value_done(const instance_id_t &instance_id, const value_t &value);

void synced_reset_instance(const instance_id_t &from, const instance_id_t &to);

Cbase::routine_status_e self_prepare(const message_t &msg);

Cbase::routine_status_e self_chosen(const message_t &msg);

void on_proposer_idle();

void on_proposer_final(const instance_id_t &instance_id, value_t &&value);

void value_chosen(const instance_id_t &instance_id, value_t &&value);

void value_learnt(const instance_id_t &instance_id, value_t &&value);

void learn_done();

Cbase::routine_status_e take_snapshots(const instance_id_t &peer_instance_id, std::vector<Csnapshot::shared_ptr> &snapshots);

Cbase::routine_status_e load_snapshots(const std::vector<Csnapshot::shared_ptr> &snapshots); 多线程化这里的实现次要是工程上的实现，这里只提下基本思路，具体实现能够参考代码。Paxos 算法胜利地将几个角色齐全合成开来，除了 log instance 推动须要严格程序进行，其余角色都能够在任意 log instance id 上进行，角色外部状态机通过锁管制通过在长久化和推动 log instance id 的时候，短暂持有全局锁，尽可能减少串行化点，同时通过原子变量疾速判断以后角色是否落后齐全事件推动模型（包含超时和状态变更）超时及工作队列 timer_work_queue.h 可重置超时机制 resetable_timeout_notify.hlog+ 状态机 +snapshot(日志压缩) 序列化的值曾经就绪了，实现残缺的带状态的利用就差状态机了，RAFT 外面曾经有了残缺叙述，这里咱们同样把设计为日志 + 状态机的实现，为了 learner 疾速学习，同样提供了快照的接口。进一步的，因为有了快照，咱们就不须要保留残缺的日志了，通过快照就能疾速重放到对应的 log instance id，实现疾速学习。同样日志、状态机、快照都采纳接口方式实现，参考 state_machine.h 局部代码，接口中预留了很多辅助类操作接口，便于实现无阻塞的快照获取和利用。

class Csnapshot {
public:

 // Get global state machine id which identify myself, and this should be **unique**.
virtual state_machine_id_t get_id() const = 0;
 
// The snapshot represent the state machine which run to this id(not included).
virtual const instance_id_t &get_next_instance_id() const = 0;

};

class CstateMachine {
public:

 // Get global state machine id which identify myself, and this should be **unique**.
virtual state_machine_id_t get_id() const = 0;
 
// This will be invoked sequentially by instance id,
// and this callback should have ability to handle duplicate id caused by replay.
// Should throw exception if get unexpected instance id.
// If instance's chosen value is not for this SM, empty value will given.
virtual void consume_sequentially(const instance_id_t &instance_id, const utility::Cslice &value) = 0;
 
// Supply buffer which can move.
virtual void consume_sequentially(const instance_id_t &instance_id, utility::Cbuffer &&value) {consume_sequentially(instance_id, value.slice());
}
 
// Return instance id which will execute on next round.
// Can smaller than actual value only in concurrent with consuming, but **never** larger than real value.
virtual instance_id_t get_next_execute_id() = 0;
 
// Return smallest instance id which not been persisted.
// Can smaller than actual value only in concurrent with consuming, but **never** larger than real value.
virtual instance_id_t get_next_persist_id() = 0;
 
// The get_next_instance_id of snapshot returned should >= get_next_persist_id().
virtual int take_snapshot(Csnapshot::shared_ptr &snapshot) = 0;
 
// next_persist_id should larger or equal to snapshot after successfully load.
virtual int load_snapshot(const Csnapshot::shared_ptr &snapshot) = 0;

}; 折叠其次为了实现更高级的性能，算法提供了 2 套 value chosen 回调接口，一个是在 log instance id 推动的临界区内的回调 synced_value_done，另一个是异步的回调 value_chosen，别离实用于和 log instance id 强相干的状态管制（例如成员治理，前面会提到），以及一般的状态机。异步的回调是在临界区之外的，占用事件驱动线程，但不会影响 Paxos 算法总体吞吐量，同时也有个同步队列 CstateMachineBase 保障日志利用的程序性。成员变更至此咱们实现了大部分对分布式一致性库的需要，但还有个常见的重要需要：在实用化的分布式一致性库中实现动静成员治理。实现这个性能次要有以下几种形式：停机，手动变更配置文件 RAFT 的实现 joint consensusjoint consensus (two-phase approach)Log entries are replicated to all servers in both configurations.Any server from either configuration may serve as leader.Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations. 一步成员变更，将成员治理问题转换为 Paxos 解决的一致性问题（本库应用的办法）之所以 RAFT 不采纳一步变更，是因为一步变更会在中间状态中呈现不穿插的多组 quorum，如上面样例中的场景，须要将 C 节点替换为 D 节点，在 log instance id 3 上，因为提早等起因，A 和 C 节点还没有进行成员变更，还认为成员是 ABC，AC 作为 quorum 进而 accept 了一个 value，而对于晓得最新成员为 ABC 的 BD 两个节点，仍能够作为 quorum 去 accept 另外一个值，这就导致了 Paxos 算法生效。

这个问题的实质在于，在进行共识算法时，成员不是原子的变动的，而是在各个节点间存在中间状态的。将成员变更操作引入 log 中，并通过状态机在各个节点重放，通过多版本成员管制对不同 log instance id 的状况应用正确的成员组，即可解决这个问题。此时成员变更被整合到 Paxos 算法中，并成为一个原子的变更呈现。

不难发现，在通信和成员治理接口中也传递了 group id（多 Paxos Group）和 log instance id 的参数（通信接口在 message 中获取），便于在实现的时候兼容动静成员变更的治理。class Ccommunication {
public:

 virtual int send(const node_id_t &target_id, const message_t &message) = 0;
 
enum broadcast_range_e {
    broadcast_voter = 0,
    broadcast_follower,
    broadcast_all,
};
 
enum broadcast_type_e {
    broadcast_self_first = 0,
    broadcast_self_last,
    broadcast_no_self,
};
 
virtual int broadcast(const message_t &message, broadcast_range_e range, broadcast_type_e type) = 0;

};

class CpeerManager {
public:

 virtual bool is_voter(const group_id_t &group_id, const instance_id_t &instance_id,
                      const node_id_t &node_id) = 0;
 
virtual bool is_all_peer(const group_id_t &group_id, const instance_id_t &instance_id,
                         const std::set<node_id_t> &node_set) = 0;
 
virtual bool is_all_voter(const group_id_t &group_id, const instance_id_t &instance_id,
                          const std::set<node_id_t> &node_set) = 0;
 
virtual bool is_quorum(const group_id_t &group_id, const instance_id_t &instance_id,
                       const std::set<node_id_t> &node_set) = 0;

}; 折叠总结至此一个残缺的、模块化的 Paxos 库曾经实现了，能够实现大部分咱们冀望的能力，也具备极大的扩大能力。当然在实现这个库的时候，也存在取舍，本库仅实现了一个 Paxos Group，只能串行顺次确定一个值，这是为了具备疾速抢主的能力，舍弃了 pipeline 的能力（pipeline 疾速抢占的空洞对状态机实现很不敌对）。当然为了实现 pipeline 能够通过多 GROUP 实现，效率也不会有太大差异。更多的优化比方存储的日志和状态机的混合长久化、音讯的 GROUPING(BATCHING) 等都能够在提供的接口上随便施展。这里提供几个扩大代码样例作为参考，包含基于 RocksDB 的存储 rocks_storage.h 基于 ASIO 的 TCP & UDP 通信 asio_network.h 基于状态机 +MVCC 的动静成员治理 dynamic_peer_manager.h 参考资料 Paxos Made SimplePaxos Made Live – An Engineering PerspectiveIn Search of an Understandable Consensus Algorithmphxpaxos wikiPolarDB-X 一致性共识协定 (X-Paxos) 数据库架构杂谈（二）高可用与一致性 phxpaxos 原文链接：http://click.aliyun.com/m/100… 本文为阿里云原创内容，未经容许不得转载。

	state_machine_id_t state_machine_id;
	utility::Cbuffer buffer;

	enum type_e {
	noop = 0,
	prepare,
	prepare_promise,
	prepare_reject,
	accept,
	accept_accept,
	accept_reject,
	value_chosen,
	learn_ping,
	learn_pong,
	learn_request,
	learn_response
	} type;

	// Sender info.
	group_id_t group_id;
	instance_id_t instance_id;
	node_id_t node_id;

	/**
	* Following field may optional.
	*/

	// As sequence number for reply.
	proposal_id_t proposal_id;

	ballot_number_t ballot;
	value_t value;

	// For learner data transmit.
	bool overload; // Used in ping & pong. This should be consider when send learn request.
	instance_id_t min_stored_instance_id; // Used in ping and pong.
	std::vector<learn_t> learn_batch;
	std::vector<Csnapshot::shared_ptr> snapshot_batch;

	// Const info of instance.
	const node_id_t node_id_;
	const group_id_t group_id_;
	const write_options_t default_write_options_;

	std::mutex update_lock_;
	std::atomic<instance_id_t> instance_id_;

	Cstorage &storage_;
	Ccommunication &communication_;
	CpeerManager &peer_manager_;

	bool is_voter(const instance_id_t &instance_id);
	bool is_all_peer(const instance_id_t &instance_id, const std::set<node_id_t> &node_set);
	bool is_all_voter(const instance_id_t &instance_id, const std::set<node_id_t> &node_set);
	bool is_quorum(const instance_id_t &instance_id, const std::set<node_id_t> &node_set);

	int get_min_instance_id(instance_id_t &instance_id);
	int get_max_instance_id(instance_id_t &instance_id);
	void set_instance_id(instance_id_t &instance_id);

	bool get(const instance_id_t &instance_id, state_t &state);
	bool get(const instance_id_t &instance_id, std::vector<state_t> &states);
	bool put(const instance_id_t &instance_id, const state_t &state, bool &lag);
	bool next_instance(const instance_id_t &instance_id, const value_t &chosen_value);
	bool put_and_next_instance(const instance_id_t &instance_id, const state_t &state, bool &lag);
	bool put_and_next_instance(const instance_id_t &instance_id, const std::vector<state_t> &states, bool &lag);

	bool reset_min_instance(const instance_id_t &instance_id, const state_t &state);

	bool broadcast(const message_t &msg,
	Ccommunication::broadcast_range_e range,
	Ccommunication::broadcast_type_e type);
	bool send(const node_id_t &target_id, const message_t &msg);

	// Promise.
	promise_or_accept_set_.insert(msg.node_id);
	// Caution: This will move value to local variable, and never touch it again.
	update_ballot_and_value(std::forward<message_t>(msg));

	// Reject.
	reject_set_.insert(msg.node_id);
	has_rejected_ = true;
	record_other_proposal_id(msg);

	// Prepare success.
	can_skip_prepare_ = true;
	accept(accept_msg);

	base_.is_all_voter(working_instance_id_, response_set_)) {
	// Prepare fail.
	state_ = proposer_idle;
	last_error_ = error_prepare_rejected;
	notify_idle = true;

	base_.is_all_voter(working_instance_id_, response_set_)) {
	// Accept fail.
	state_ = proposer_idle;
	last_error_ = error_accept_rejected;
	notify_idle = true;

	// Promise.
	response.type = message_t::prepare_promise;
	if (state_.accepted) {
	response.ballot = state_.accepted;
	response.value = state_.value;
	}

	state_.promised = msg.ballot;

	auto lag = false;
	if (!persist(lag)) {if (lag)
	return Cbase<T>::routine_error_lag;

	return Cbase<T>::routine_write_fail;
	}

	// Reject.
	response.type = message_t::prepare_reject;
	response.ballot = state_.promised;

	// Accept.
	response.type = message_t::accept_accept;

	state_.promised = msg.ballot;
	state_.accepted = msg.ballot;
	state_.value = std::move(msg.value); // Move value to local variable.

	auto lag = false;
	if (!persist(lag)) {if (lag)
	return Cbase<T>::routine_error_lag;
	return Cbase<T>::routine_write_fail;
	}

	// Reject.
	response.type = message_t::accept_reject;
	response.ballot = state_.promised;

	chosen_instance_id = working_instance_id_;
	chosen_value = state_.value;

	// Get global state machine id which identify myself, and this should be unique.
	virtual state_machine_id_t get_id() const = 0;

	// The snapshot represent the state machine which run to this id(not included).
	virtual const instance_id_t &get_next_instance_id() const = 0;

	// Get global state machine id which identify myself, and this should be unique.
	virtual state_machine_id_t get_id() const = 0;

	// This will be invoked sequentially by instance id,
	// and this callback should have ability to handle duplicate id caused by replay.
	// Should throw exception if get unexpected instance id.
	// If instance's chosen value is not for this SM, empty value will given.
	virtual void consume_sequentially(const instance_id_t &instance_id, const utility::Cslice &value) = 0;

	// Supply buffer which can move.
	virtual void consume_sequentially(const instance_id_t &instance_id, utility::Cbuffer &&value) {consume_sequentially(instance_id, value.slice());
	}

	// Return instance id which will execute on next round.
	// Can smaller than actual value only in concurrent with consuming, but never larger than real value.
	virtual instance_id_t get_next_execute_id() = 0;

	// Return smallest instance id which not been persisted.
	// Can smaller than actual value only in concurrent with consuming, but never larger than real value.
	virtual instance_id_t get_next_persist_id() = 0;

	// The get_next_instance_id of snapshot returned should >= get_next_persist_id().
	virtual int take_snapshot(Csnapshot::shared_ptr &snapshot) = 0;

	// next_persist_id should larger or equal to snapshot after successfully load.
	virtual int load_snapshot(const Csnapshot::shared_ptr &snapshot) = 0;

	virtual int send(const node_id_t &target_id, const message_t &message) = 0;

	enum broadcast_range_e {
	broadcast_voter = 0,
	broadcast_follower,
	broadcast_all,
	};

	enum broadcast_type_e {
	broadcast_self_first = 0,
	broadcast_self_last,
	broadcast_no_self,
	};

	virtual int broadcast(const message_t &message, broadcast_range_e range, broadcast_type_e type) = 0;

	virtual bool is_voter(const group_id_t &group_id, const instance_id_t &instance_id,
	const node_id_t &node_id) = 0;

	virtual bool is_all_peer(const group_id_t &group_id, const instance_id_t &instance_id,
	const std::set<node_id_t> &node_set) = 0;

	virtual bool is_all_voter(const group_id_t &group_id, const instance_id_t &instance_id,
	const std::set<node_id_t> &node_set) = 0;

	virtual bool is_quorum(const group_id_t &group_id, const instance_id_t &instance_id,
	const std::set<node_id_t> &node_set) = 0;

关于后端:如何实现一个-Paxos

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）