yannmar January 8, 2018 at 09:11

The Pacx algorithm. Clear article on consensus in a distributed system

In this article we will analyze the Paxos consensus algorithm, discuss why it is needed, why it works, prove its correctness and talk a bit about practical application problems. In many ways, this is a free retelling of an article by Leslie Lamport “Paxos Made Simple”

Why distributed consensus is needed and what it is

Often when a distributed system is running (there are simply several servers processing user requests or a distributed storage system or something like that), it becomes necessary to make general decisions, for example, which server to run the cluster level singleton service on, and where to migrate it when it falls server. This task is simply solved in the presence of the arbiter of the decision - the admin server. The problem is that the arbiter becomes a single point of collapse, his failure can lead to complete or partial inoperability of the system, and its restoration and restoration of the system will require manual intervention. Obviously, a fault-tolerant system should consist of equal participants who are able to agree among themselves on common issues - to reach consensus.

Leslie Lamport, originally wanted to prove mathematically rigorously that consensus in a distributed system with unreliable communications is impossible, but instead he invented and proved the Paxos algorithm that allows such a consensus to be reached. The description of the algorithm, which can be found, for example, on Wikipedia, looks deceptively simple and short, but it is not easy to understand, and to put it into practice is even more difficult.

This algorithm has important limitations: it describes the process of choosing only one value, in practice, you need to make many decisions and for each next case you will need to create a new instance of the algorithm. It is also assumed that the participants in the system do not behave in a Byzantine manner, i.e. keep their promises, do not provide false data (I don’t know the story, but I feel - the reputation of the Byzantines was so-so)

About selection and roles

The concept of “choice” (choice of solution, choice of value) in the context of consensus algorithms is different from everyday. If in real life we compare the pros and cons, then in the case of the consensus algorithm, the choice is made from equivalent options, i.e. any value / solution is suitable, the main thing is to decide which one of the many proposed.

This can be compared, as the three of you and your friends go to the pizzeria, where the whole pizza is equally good (or bad) and trying to choose which large pizza to order for three, the only waiter comes up to you, and you continue to look perplexedly at the menu. Due to the neighboring tables are starting to suggest: "Take already pepperoni, do not hold no worse than all the rest" - it was an offer (proposal) and expressed his offering(proposer). One of you has heard him and thinks: "OK, let it be pepperoni." He - the host (acceptor) and just took (to accept) offer . For other table shouting: "You the third time they say - take the Margaret and may have an order with us will be" - this is another proposal , but "you for the third time" - a number of proposals (proposal number), and "Margarita" - is value of the sentence (value). Then the waiter begins to question you who accepted what offers. He - finding out (Leaner), and if the majority (Majority) of you took the same sentence ( "I'll say for the third time - Margarita") called the proposal selected(chosen), and if the majority informs him of the waiter, the waiter will know (to learn) about the selected value and will be able to accept the order. But, say, one of those who accepted the third sentence is stuck in a smartphone and does not respond to the waiter's question.

The value was chosen, but it is impossible to recognize it - the two friends who remain in the real world give conflicting answers. The next round of negotiations begins and perhaps a new proposal will be accepted, and the participant stuck in the smartphone will continue to read the Habr. So, the correct consensus algorithm sets such negotiation rules that any next selected value will coincide with the first time the selected value.

Look again, this is a very important and absolutely wonderful moment: the state of the system (accepted values) is distributed among the nodes - each receiver knows only about the proposal he has accepted, the general value has been chosen, but a situation may arise when it is impossible to know and even impossible to know whether something is selected. In this case, negotiations will continue. But Paxos is such that any next value chosen by the majority of the accepting values will coincide with the first one selected. The choice, once made, will not be changed.

So we figured out the terminology and roles:

Offers - make suggestions .
Acceptors - accept offers, and remember the accepted ( number and value ). In this case, the host will accept the first proposal received and will accept subsequent ones even if a choice has already been made.
Recognizing - find out which proposal was chosen (accepted by most participants).

In practice, roles can and will be combined, for example, if we talk about servers choosing a leader among themselves, each server can offer itself as a leader, will accept offers (including its own offers), and in the end it needs to find out who It was chosen.

About most

The majority refers to “more than half”, i.e. N + 1 from 2N + 1 (and from 2N too) or more.

The requirement of acceptance by the majority is obvious: if a choice of value would require acceptance by a smaller number of acceptors (exactly half or less), then nothing would hinder to choose several different values at the same time, i.e. consensus would not have been reached. It will be useful to us that any two sets that make up the majority have a nonempty intersection.

In this case, the participation of all recipients in the choice is not required, the rest of the recipients may be malfunctioning, communication with them may be broken. Thus, we get that a system of 2N + 1 participants is able to survive the failure of N of them.

Global time distributed system

We have already mentioned the offer number. Each proposal has a number. In the description of Lamport, this is a natural number such that:

the number is unique, each offer has its own;
the bidder uses a larger number for each subsequent offer.

The following analogy helped me in understanding the algorithm: a number is essentially a timestamp of a sentence, together timestamps set the global synthetic time of a distributed system designed so that a) no two sentences can be made (issued) at the same time , b) we can understand which of the two proposals was made “later”. The host accepting some proposal will ignore the proposals that were made “earlier” than the accepted one (proposals “from the past”).

In practice, the offer number will consist of a pair: the value of the offer counter of a specific server and the server identifier.

We construct the correct algorithm

The value v is chosen if the proposal with the number m and the value v was accepted by the majority of the hosts. After that, any subsequent selected value (accepted by most of the receivers) should coincide with the selected v .

This requirement can be fulfilled if, after the value of v has been chosen, the acceptors will only accept offers with values equal to v .

The acceptor accepts what the offerers offer and, therefore, the previous requirement can in turn be fulfilled if all the offerers in all offers with the number n> m offer only the previously selected value v .

Let us imagine that we are proving the last statement by induction and try to understand what is missing to make the induction transition.

The sentence (m, v) - was chosen, i.e. there are many C consisting of the majority of the hosts and each of these hosts has accepted the offer (m, v) . It also follows from the induction hypothesis that all sentences with numbers from the range from m to n-1 inclusive have the value v (*). Now we need to make the sentence with number n also have the value v .

Suppose that there exists a set S consisting of the majority of the hosts and such that we can find out the meaning of the last sentence (with a maximum number less than n) accepted by the host from this set. Let's say the number of this “maximum” sentence is k . Since the set S contains at least one receiving a from the set C , it means that the number k will be greater than or equal to the number of the last accepted offer by the receiving a , and the number of the last sentence of received a will be greater than or equal to m (the moment the value was chosen) i.e. . k> = m , and thus, by virtue of (*), the value of the sentence k is the previously chosen v , we will use it to generate the sentence (n, v). Thus, if we are able to guarantee the existence of the sets S , then this will provide an induction transition.

In strict mathematical language, it was formulated by Lamport as follows: for the induction transition to occur, the following invariant must be satisfied:

For any v and n , if the sentence (n, v) is made (issued), then there exists a set S consisting of the majority of the receivers and such that one of the two is true: (a) no receiver from the set S has accepted any proposal with smaller number n or (b) v - this value is the highest numbered proposals smaller n among all proposals received from a plurality of host S .

To ensure the existence of the set S described above , the person planning to make an offer n must find out from the majority of those accepting what number and value of the last sentence they will accept until the moment n .

See what the problem is: the offeror can ask anyone accepting the number and meaning of the last offer he accepted, but there is a chance that by sending an answer the receiver will accept the next offer and the answer received by the offerer will be outdated and useless. To avoid this, Lamport came up with this: let the bidder not only ask the meaning of the last accepted offer, but also ask him not to accept any offers with a number less than n. In response from the recipient, the offeror will receive: 1) the value and number of the last accepted offer and 2) the promise not to accept any offers with numbers less than n . Then the information about the last acceptance of an offer received in response does not expire, and if the answers are received from the majority decision, the decision to form the desired set of the S . As soon as the set S is formed, the offeror makes the offer (n, v) (where v , as we discussed above, is the value of the last sentence (the offer with the maximum number) accepted by the host from the set S until the moment n ).

Thus, we got the correct consensus algorithm, where proposals are made in two stages:

Preparatory stage:
1. The offerer sends out an announcement that he plans to make offer n (at time n )
2. The receiving recipients of the announcement send back a promise not to accept any offers with numbers less than n (up to the moment n ), as well as the last accepted offer (number and value) or do not answer at all if the number of the last accepted offer is greater than n or if n is among proposals that the host has already promised not to accept another bidder.
Sentence:
1. The proposer, having received answers from the majority of the receivers, selects the value from the response with the maximum offer number as the offer value v and sends out the offer (n, v) .
2. The receiver, having received the offer (n, v) , is obliged to accept it if, of course, he has not promised another offeror not to accept offers with numbers less than n * , where n *> n .

In order to know that a value has been chosen, the one recognizing must see that the same offer (number plus value) has been accepted by most of the hosts.

Practical problems

Messaging is asynchronous, communications are unreliable, so it is not known whether the offeror will receive at least some answers, without waiting for the required number of promises, the offeror will try again by making the next offer. All of this can be repeated for a long time.

The algorithm does not guarantee that consensus will be reached. Say, one offeror receives from the majority a promise not to accept offers with a number less than 1, the second following him receives from the majority a promise not to accept offers with a number less than 2. That is, 1st sentence can no longer be selected. Having understood that his offer will not be selected, the first bidder will try to get from the majority of the promise not to accept offers with a number less than 3. Now the 2nd offer will not be accepted. And so the two offerers can infinitely lead the majority of the hosts to one or the other side, blocking the possibility of reaching consensus. This can be corrected by introducing random delays into the algorithm, and of course it is advisable to make sure that there is only one offering. For this,

But here a new problem will arise: if we choose a leader, nothing prevents the remaining nodes behind him from initiating new elections and choosing another leader among themselves, they will manage to gather a quorum. Moreover, the previous leader may not know anything about this and will continue to act considering that he is in charge here.

Another complication is dishonest Byzantine behavior, and for this it is not necessary to have malware among the elements of the system. For example, the recipients are obliged to remember their promise, for this they save it to disk, data corruption on the disk (due to a failure or inaccurate deletion of files by the operator) will lead to Byzantine behavior. In practice, even data in memory can be corrupted with the same sad result. Errors in the code can also cause Byzantine behavior.

You can read about these and other problems of the Paxos algorithm in a fascinating article by Google developers “Paxos Made Live - An Engineering Perspective” , where they talk about their experience in implementing distributed consensus in the Chubby project (intracorporate analogue of Zookeeper)

Thus, the Paxos algorithm is not silver a bullet, but just a small wonderful fundamental brick of our knowledge about fault-tolerant distributed systems.

Tags: