Understanding Paxos
Asynchronous Fault-Tolerant Consensus
Paul Krzyzanowski
November 1, 2018
Understanding Paxos
The Consensus Problem
Suppose you have a collection of computers and want them all to agree on something. This is what consensus is about; consensus means agreement.
Consensus comes up frequently in distributed systems design. There are various reasons why we want it: to agree on who gets access to a resource (mutual exclusion), agree on who is in charge (elections), or to agree on a common ordering of events among a collection of computers (e.g., what action to take next, or state machine replication).
Replication may be the most common use of consensus. We set up collections (clusters) of servers, all of which will have replicated content. This provides fault tolerance: if any server dies, others are still running. It also provides scale: clients can read data from any available server although writing data will generally require the use of all servers and not scale as well. For many applications, though, reads far outnumber writes. To keep the content replicated, we have two design choices. We can funnel all write requests through one coordinator, which will ensure in-order delivery to all replicas or we can send updates to any system but that system will coordinate those updates with its replicas to ensure that all updates are applied in the same order on each replica. In the first case, we need consensus to elect that coordinator. In the second case, we need to run a consensus algorithm for each update to ensure that everyone agrees on the order.
The consensus problem can be stated in a basic, generic manner: One or more systems may propose some value. How do we get a collection of computers to agree on exactly one of those proposed values?
The formal properties for asynchronous consensus are:
- Validity: only the proposed values can be decided. If a process decides on some value, _v_, then some process must have proposed _v_.
- Uniform Agreement: no two correct processes (those that do not crash) can decide on different values.
- Integrity: each process can decide a value at most once.
- Termination: all processes will eventually decide on a result.
Paxos
Paxos is an algorithm that is used to achieve consensus among a distributed set of computers that communicate via an asynchronous network. One or more clients proposes a value to Paxos and we have consensus when a majority of systems running Paxos agrees on one of the proposed values. Paxos is widely used and is legendary in computer science since it is the first consensus algorithm that has been rigorously proved to be correct.
Paxos simply selects a single value from one or more values that are proposed to it and lets everyone know what that value is. A run of the Paxos protocol results in the selection of single proposed value. If you need to use Paxos to create a replicated log (for a replicated state machine, for example), then you need to run Paxos repeatedly. This is called multi-Paxos. There are some optimizations that could be implemented for multi-Paxos but we will not discuss those here.
Paxos provides abortable consensus. This means that some processes abort the consensus if there is contention while others decide on the value. Those processes that decide have to agree on the same value. Aborting allows a process to terminate rather than be blocked indefinitely. When a client proposes a value to Paxos, it is possible that the proposed value might fail if there was a competing concurrent proposal that won. The client will then have to propose the value again to another run of the Paxos algorithm.
Our assumptions for the algorithm are:
- Concurrent proposals: One or more systems may propose a value concurrently. If only one system would propose a value then it is clear what the consensus would be. With multiple systems, we need to select one from among those values.
- Validity: The chosen value that is agreed upon must be one of the proposed values. The servers cannot just choose a random number.
- Majority rule: Once a majority of Paxos servers agrees on one of the proposed values, we have consensus on that value. This also implies that a majority of servers need to be functioning for the algorithm to run. To survive m failures, we will need 2m+1 systems.
- Asynchronous network: The network is unreliable and asynchronous: messages may get lost or arbitrarily delayed. The network may also get partitioned.
- Fail-stop faults: Systems may exhibit fail-stop faults. They may restart but need to remember their previous state to make sure they do not change their mind. Failures are not Byzantine.
- Unicasts: Communication is point-to-point. There is no mechanism to multicast a message atomically to the set of Paxos servers.
- Announcement: Once consensus is reached, the results can be made known to everyone.
What we need in a distributed consensus protocol
We have an environment of multiple systems (nodes), connected by a network, where one or more of these nodes may concurrently propose a value (e.g., perform an operation on a server, select a coordinator, add to a log, whatever…). We need a protocol to choose exactly one value in cases where multiple competing values may be proposed.
We will call the processes that are proposing values proposers. We will also have processes called acceptors that will accept these values and help figure out which one value will be chosen.
Multiple proposers, single acceptor
The simplest attempt to design a consensus protocol will use a single acceptor. We can elect one of our systems to take on this role. Multiple proposers will concurrently propose various values to a single acceptor. The acceptor chooses one of these proposed values. Unfortunately, this does not handle the case where the acceptor crashes. If it crashes after choosing a value, we will not know what value has been chosen and will have to wait for the acceptor to restart. We want to design a system that will be functional if the majority of nodes are running.
Fault tolerance: multiple acceptors
To achieve fault tolerance, we will use a collection of acceptors. Each proposal will be sent to at least a majority of these acceptors. If a quorum, or majority, of these acceptors chooses a particular proposed value (proposal) then that value will be considered chosen. Even if some acceptors crash, others are up so we will still know the outcome. If an acceptor crashes after it accepted a proposal, other acceptors know that a value was chosen.
If an acceptor simply accepts the first proposed value it receives and cannot change its mind (i.e., it chooses that value), it is possible that we may have no majority of accepted values. Depending on the order in which messages arrive at various acceptors, each acceptor may choose a different value or small groups of acceptors may choose different values such that there is no majority of acceptors that chose the same proposed values. This tells us that acceptors may need to change their mind about which accepted value to choose. We want a value to be chosen only when a majority of acceptors accept that value.
Since we know that an acceptor may need to change its mind, we might consider just having an acceptor accept any value given by any proposer. A majority of acceptors may accept one value, which means that value is chosen. However another server may also tell a majority of acceptors to accept another value. Some acceptors will receive both of these proposals because to have a majority means that both sets of proposals have at least one acceptor in common. That means at least one acceptor had to change its mind about what value is ultimately chosen, violating the integrity property. Once we choose a value, there is no going back.
To fix this, instead of just having a proposer propose a value, we will first ask it to contact a majority of acceptors and check whether a value has already been chosen. If it has, then the proposer must propose that chosen value to the other acceptors. To implement this, we will need a two phase protocol: check first and then provide a value.
Asking a proposer to check is not sufficient. Another proposer may come along after the checking phase and propose a different value. What if that second value is accepted by a majority and then the acceptors receive requests from the first proposer to accept its value? We again end up with two chosen values. The protocol needs to be modified to ensure that once we accept a value, we will abort competing proposals. To do this, Paxos will impose an ordering on proposals. Newer proposals will take precedence over older ones. If that first proposer tries to finish its protocol, its requests will fail.
How Paxos works
The case of characters
Paxos has three entities:
- Proposers: Receive requests (values) from clients and try to convince acceptors to accept their proposed values.
- Acceptors: Accept certain proposed values from proposers and let proposers know if something else was accepted. A response from an acceptor represents a vote for a particular proposal.
- Learners: Announce the outcome.
In practice, a single node may run proposer, acceptor, and learner roles. It is common for Paxos to coexist with the service that requires consensus (e.g., distributed storage) on a set of replicated servers, with each server taking on all three roles rather than using separate servers dedicated to Paxos. For the sake of discussing the protocol, however, we consider these to be independent entities.
What Paxos does
A client sends a request to any Paxos proposer. The proposer then runs a two-phase protocol with the acceptors. Paxos is a majority-wins protocol. A majority avoids split-brain problems and ensures that if you made a proposal and asked over 50% of the systems if somebody else made a proposal and they all reply no then you know for certain that no other system could have asked over 50% of the systems received the same answer. Because of this Paxos requires a majority of its servers to be running for the algorithm to terminate. A majority ensures that there is at least one node in common from one majority to another if servers die and restart. The system requires 2m+1 servers to tolerate the failure of m servers. As we shall see, Paxos requires a majority of acceptors. We can have one or a smaller number of proposers and learners.
Paxos acceptors cannot forget what they accepted, so they need to keep track of the information they received from proposers by writing it to stable storage. This is storage such as flash memory or disk, whose contents can be retrieved even if the process or system is restarted.
The Paxos protocol (initial version)
Paxos is a two-phase protocol, meaning that the proposers interact with the acceptors twice. At a high level:
Phase 1
A proposer asks all the working acceptors whether anyone already received a proposal. If the answer is no, propose a value.
Phase 2
If a majority of acceptors agree to this value then that is our consensus.
Let’s look at the protocol in a bit more detail now. Right now, we will examine the mostly failure-free case. Later, we will augment the algorithm to correct for other failures.
When a proposer receives a client request to reach consensus on a value, the proposer must create a proposal number. This number must have two properties:
- It must be unique. No two proposers can come up with the same number. An easy way of doing this is to use a global process identifier for the least significant bits of the number. For example, instead of an ID=12, node 3 will generate ID=12.3 and node 2 will generate 12.2.
- It must be bigger than any previously used identifier used in the cluster. A proposer may use an incrementing counter or use a nanosecond-level timestamp to achieve this. If the number is not bigger than one previously used, the proposer will find out by having its proposal rejected and will have to try again.
Phase 1: PREPARE-PROMISE
A proposer receives a consensus request for a VALUE from a client. It creates a unique proposal number, ID, and sends a PREPARE(ID) message to at least a majority of acceptors.
Each acceptor that receives the PREPARE message looks at the ID in the message and decides:
Is this ID bigger than any round I have previously received?
If yes
store the ID number, max_id = ID
respond with a PROMISE message
If no
do not respond (or respond with a "fail" message)
If the proposer receives a PROMISE response from a majority of acceptors, it now knows that a majority of them are willing to participate in this proposal. The proposer can now proceed with getting consensus. Each of these acceptors made a promise that no other proposal with a smaller number can make it to consensus.
Phase 2: PROPOSE-ACCEPT
If a proposer received a PROMISE message from the majority of acceptors, it now has to tell the acceptors to accept that proposal. If not, it has to start over with another round of Paxos.
In this phase of the protocol, the proposer tells all the acceptors (that are live) what value to accept. It sends:
PROPOSE(ID, VALUE)
to a majority or all of the acceptors. Each acceptor now decides whether to accept the proposal. It accepts the proposal if the ID number of the proposal is still the largest one that it has seen. Recall that an acceptor promised not to accept proposals from PREPARE messages with smaller numbers but can and will accept proposals with higher numbers. The logic the acceptor uses is:
is the ID the largest I have seen so far, max_id == N?
if yes
reply with an ACCEPTED message & send ACCEPTED(ID, VALUE) to all learners
if no
do not respond (or respond with a "fail" message)
The ACCEPTED message can be sent back to the proposer as well as to the learners so they can propagate the value to wherever its action is needed (e.g., append to a log, modify a replicated database, …). When the proposer or learner receives a majority of accept messages then it knows that consensus has been reached on the value.
To summarize, in the first phase, the proposer finds out that no promises have been made to higher numbered proposals. In the second phase, the proposer asks the acceptors to accept the proposal with a specific value. As long as no higher numbered proposals have arrived during this time, the acceptor responds back that the proposal has been accepted.
Handling failures and fixing the protocol
Suppose a proposer sends a PREPARE message with a lower ID than was previously promised by the majority of acceptors. Some acceptors, those that did not receive the earlier higher ID, might reply with a PROMISE. But a majority will not and the proposer will get fail messages (negative acknowledgements) or simply time out waiting and have to retry with a higher ID.
On the other hand, suppose some proposer now sends a PREPARE message with a higher ID (higher_ID). The acceptors only promised to ignore messages with lower numbers. We need to modify what the acceptor does in Phase 1.
If the acceptor has not yet accepted any proposal (that is, it responded with a PROMISE to a past proposal but not an ACCEPTED, it will simply respond back to the proposer with a PROMISE. However, if the acceptor has already accepted an earlier message it responds to the proposer with a PROMISE that contains the accepted ID and its corresponding value.
The acceptor needs to keep track of proposals that it has accepted (in phase 2). If it already sent an ACCEPT message it cannot change its mind but it will inform the proposer that it already accepted an earlier proposal. The logic for the acceptor in the first phase is now:
acceptor receives a PREPARE(ID) message:
is this ID bigger than any round I have previously received?
if yes
store the ID number, max_id = ID
respond with a PROMISE(ID) message
if no
did I already accept a proposal?
if yes
respond with a PROMISE(ID, accepted_ID, accepted_VALUE) message
if no
do not respond (or respond with a "fail" message)
When the proposer receives responses from acceptors at the end of phase 1, it needs to get a majority of responses for its proposal ID to continue with the protocol. It also has to look through each one of those responses to see if there were any accepted proposals. If there were none, then the proposer is free to propose its own value. If there were, then then proposer is obligated to pick the value corresponding to the highest-numbered accepted proposal. This is how other acceptors get to find out about accepted proposals that they may have missed. The logic at the proposer at the start of phase 2 is now:
proposer receives PROMISE(ID, [VALUE]) messages:
do I have PROMISE responses from a majority of acceptors?
if yes
do any responses contain accepted values (from other proposals)?
if yes
pick the value with the highest accepted ID
send PROPOSE(ID, accepted_VALUE) to at least a majority of acceptors
if no
we can use our proposed value
send PROPOSE(ID, VALUE) to at least a majority of acceptors
Our full Paxos protocol now looks like this:
Phase 1a: Proposer (PREPARE)
A proposer initiates a PREPARE message, picking a unique, ever-incrementing value.
ID = cnt++;
send PREPARE(ID)
Phase 1b: Acceptor (PROMISE)
An acceptor receives a PREPARE(ID) message:
if (ID <= max_id)
do not respond (or respond with a "fail" message)
else
max_id = ID // save highest ID we've seen so far
if (proposal_accepted == true) // was a proposal already accepted?
respond: PROMISE(ID, accepted_ID, accepted_VALUE)
else
respond: PROMISE(ID)
Phase 2a: Proposer (PROPOSE)
The proposer now checks to see if it can use its proposal or if it has to use the highest-numbered one it received from among all responses:
did I receive PROMISE responses from a majority of acceptors?
if yes
do any responses contain accepted values (from other proposals)?
if yes
val = accepted_VALUE // value from PROMISE message with the highest accepted ID
if no
val = VALUE // we can use our proposed value
send PROPOSE(ID, val) to at least a majority of acceptors
Phase 2b: Acceptor (ACCEPT)
Each acceptor receives a PROPOSE(ID, VALUE) message from a proposer. If the ID is the highest number it has processed then accept the proposal and propagate the value to the proposer and to all the learners.
if (ID == max_id) // is the ID the largest I have seen so far?
proposal_accepted = true // note that we accepted a proposal
accepted_ID = ID // save the accepted proposal number
accepted_VALUE = VALUE // save the accepted proposal data
respond: ACCEPTED(ID, VALUE) to the proposer and all learners
else
do not respond (or respond with a "fail" message)
If a majority of acceptors accept ID, value then consensus is reached. Consensus is on the value, not necessarily the ID.
Failure examples
Acceptor fails in phase 1
Suppose an acceptor fails during phase 1. That means it will not return a PROMISE message. As long as the proposer still gets responses from a majority of acceptors, the protocol can continue to make progress.
Acceptor fails in phase 2
Suppose an acceptor fails during phase 2. That means it will not be able to send back an ACCEPTED message. This is also not a problem as long as enough of the acceptors are still alive and will respond so that the proposer or learner receives responses from a majority of acceptors.
Proposer fails in the Prepare phase
If the proposer fails before it sent any messages, then it is the same as if it did not run at all.
What if the proposer fails after sending one or more PREPARE msgs? An acceptor would have sent PROMISE responses back but no ACCEPT messages would follow, so there would be no consensus. Some other node will eventually run its version of Paxos and run as a proposer, picking its own ID. If the higher ID number works, then the algorithm runs. Otherwise, the proposer would have its request rejected and have to pick a higher ID number.
What if the proposer fails during the ACCEPT phase? At least one ACCEPT message was sent. Some another node proposes a new message with PREPARE(higher-ID). The acceptor responds by telling that proposer that an earlier proposal was already accepted:
PROMISE(higher-ID, <old_ID, Value>)
If a proposer gets any PROMISE responses with a value then it must choose the response with the highest accepted ID and change its own value. It sends out:
ACCEPT(higher-ID, Value)
If proposer fails in the ACCEPT phase, any proposer that takes over finishes the job of propagating the old value that was accepted.
Suppose a majority of acceptors receive PREPARE messages for some ID followed by PROPOSE messages. That means a majority of acceptors accepted for that ID. No PREPARE messages with lower IDs can now be accepted by a majority. To do so would require a majority of promises for the lower numbered ID but we already made promises for the higher numbered ID. No PROPOSE messages with higher IDs and different values will be accepted by a majority either. At least one acceptor will know the ID and corresponding value that it accepted, which it will propagate back to the proposer. You can have proposals with higher IDs, but the proposer is obligated to give them the same value. If a proposer sends:
PREPARE(high-ID)
At least one acceptor will respond back with the previously accepted ID and value:
PROMISE(high-ID, accepted-ID, value)
The proposer will have to return back a PROPOSE(high-ID, value). This is how proposers & learners can learn about what was accepted and ultimately create a majority outcome.
Proposer fails in the ACCEPT phase
Here, a proposer that takes over does not know it is taking over a pending consensus. It simply proposes a value but does not realize that the consensus protocol was already in progress. There are two cases to consider here:
- The proposer does not get any responses from a majority of acceptors that contain an old proposal ID and corresponding value. That means there has been no majority agreement yet. The proposer then just executes normally and finishes its protocol.
2 The proposer that takes over knows it is taking over a pending consensus because it gets at least one response that contains an accepted proposal # and value. It just executes using that previous value and finishes the protocol.
The Leader
With Paxos, there is a chance of reaching livelock where the algorithm makes no progress. There are approaches to break this livelock, such as using random exponentially-increasing delays. However, a common solution is to select a single proposer to handle all incoming requests. This elected proposer will be called the leader. Selecting a leader requires running an election among the proposers. While we can run Paxos to select a leader, we can also ensure that we do not encounter livelock while running an election to choose a leader so we can avoid livelock. We can use an algorithm such as the Bully algorithm to choose a leader. Note that Paxos is still designed to be fault tolerant. The leader is not a requirement and requests may still be made via other proposers or other proposers may step in at any time.
Bully algorithm recap: A node that starts an election sends its server ID to all of its peers. If it gets a response from any peer with a higher ID, it will not be the leader. If all responses have lower IDs than it becomes the leader. If a node receives an election message from a node with a lower-numbered ID, then the node starts its own election. Eventually, the node with the highest ID will be the winner.
Engineering Paxos for the real world
Paxos defines a protocol for single-value distributed consensus but does not consider other factors that are needed to get Paxos to run in real environments. Some of these factors are:
Single run vs. multi-run
Paxos yields consensus for a single value. Much of the time, we need to make consensus decisions repeatedly, such as keeping a replicated log or state machine synchronized. For this we need to run Paxos multiple times. This environment is called multi-Paxos.
Group management
The cluster of systems that are running Paxos needs to be administered. We need to be able to add systems to the group, remove them, and detect if any processes, entire systems, or network links are dead. Each proposer needs to know the set of acceptors so it can communicate with them and needs to know the learners (if those are present). Paxos is a fault-tolerant protocol but the mechanisms for managing the group are outside of its scope. We can turn to something like the group membership service of Isis virtual synchrony to track this.
Byzantine failures
We assumed that none of the systems running Paxos suffer Byzantine failures. That is, either they run and communicate correctly or they stay silent. In real life, however, Byzantine failures do exist. We can guard against network problems with mechanisms such as checksums or, if we fear malicious interference, digital signatures. However, we do need to worry about a misbehaving proposer that may inadvertantly set its proposal ID to infinity (e.g., INFINITY
in math.h
in C or math.inf
in Python if using floats; otherwise INT_MAX
in C or sys.maxint
in Python). This puts the Paxos protocol into a state where acceptors will have to reject any other proposal.
Location of servers
Possibly the most common use of Paxos is in implementing replicated state machines, such as a distributed storage system. To ensure that replicas are consistent, incoming operations must be processed in the same order on all systems. A common way of doing this is to use a replicated log. That is, each of the servers will maintain a log that is sequenced identically to the logs on the other servers. A consensus algorithm will decide the next value that goes on the log. Then, each server simply processes the log in order and applies the requested operations.
In these environments, and most others, each server also serves as a Paxos node, running the functions of proposer, acceptor, and learner. A client can send a message to any server, which invokes some operation that updates the replicated state machine (e.g., replicated storage service). In this case, the proposer will often be co-resident with the server that the client contacted. The request from the user is the value for a proposal that will be originated by that node. The proposer on that node manages its proposal ID numbers and sends proposals to all the acceptors it can reach, including itself. These acceptors accept the requests via the Paxos protocol. When the node knows that consensus has been reached, the operation can be applied to the log and propagated to the other servers. If another node tries to propose something concurrently, one of the proposals will be told that another proposal has already been accepted and it will get a different value so it will carry out the Paxos protocol to achieve consensus on that accepted value. It will then have to try again later, running Paxos again, to get its own value into the log.
References
Leslie Lamport, Paxos Made Simple, ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001), December 2001
Leslie Lamport, The Part-Time Parliament, ACM Transactions on Computer Systems, Volume 16, Issue 2, May 1998, pages 133–169. Corrections from August 2000
Wei Chen, Abortable Consensus and Its Application to Probabilistic Atomic Broadcast, Microsoft Research Asia, MSR-TR–2006–135, September 2007
John Ousterhout, Paxos lecture
Luis Quesada Torres, The Paxos Algorithm, Google TechTalks, February 28, 2018.
Chris Coloha, Paxos Simplified, Distributed Systems Course, December 12, 2017.