Two-phase commit and the future of distributed systems

Original author: Murat Demirbas
  • Transfer
In this article, we will model and investigate a two-phase commit protocol using TLA +.

The two-phase commit protocol is practical and is used today in many distributed systems. However, it is rather short. Therefore, we can quickly simulate it and learn a lot. In fact, with this example we will illustrate what result is fundamentally impossible in distributed systems .

The problem of a two-phase commit

The transaction goes through resource managers (RM) . All RMs must agree on whether the transaction will be completed or aborted .

The transaction manager (TM) makes the final decision: commit or cancel . The prerequisite for a commit is the commitment to commit all RMs. Otherwise, the transaction should be canceled.

Some notes on modeling

For simplicity, we perform simulations in a shared memory model, not in a messaging system. It also provides a quick model check. But we will add non-atomicity to the “read from neighboring node and state update” actions in order to capture interesting behavior when sending messages.

RM can only read the TM state and read / update its own state. He cannot read the status of another resource manager. TM can read the status of all RM nodes and read / update its own state.


Lines 9-10 set the initial rmStatefor each RM on working, and TM - on init.

A predicate is canCommitdefined as trueif all RMs are “prepared” (ready to commit). If there is a RM in the “hang up” state, then the predicate becomes valid canAbort.

TM modeling is simple. The transaction manager checks the possibility of commit or cancellation - and updates accordingly tmState.

There is a possibility that TM will not be able to make it tmState“inaccessible”, but only if the constant is TMMAYFAILset totruebefore starting to validate the model. In TLA +, labels define the degree of atomicity, that is, its granularity. By the labels F1 and F2 we denote that the corresponding operators are executed nonatomically (after some indefinite time) with respect to the previous operators.

RM Model

The RM model is also simple. Since the “working” and “prepared” states are not finite, RM selects non-deterministic among actions until it reaches the final state. The “working” RM can go to the “interrupted” or “prepared” state. A “prepared” RM expects a commit / cancel from TM, and acts accordingly. The figure below shows the possible state transitions for one RM. But note that we have several RMs, each of which passes through its own states at its own pace without knowing the status of other RMs.

Two Phase Commit Model

We need to check the consistency of our two-phase commit: so that there are no different RMs, one of which says “commit” and the other “abortion”.

The predicate Completedchecks that the protocol does not hang forever: in the end, each RM reaches the final state committedor aborted.

Now we are ready to check the protocol model. Initially, we installed a TMMAYFAIL=FALSE, RM=1..3protocol with three RMs and one TM, that is, in a reliable configuration. Checking the model takes 15 seconds and says that there are no errors. And Consistency, and Completedsatisfied with any possible protocol execution with any alternation of RM actions and TM actions.

Now installTMMAYFAIL=TRUEand restart the check. The program quickly produces the opposite result, where RM is stuck waiting for a response from an inaccessible TM.

We see that at the stage State=4RM2 transitions are interrupted, RM3 transitions State=7are interrupted, the State=8TM switches to the “hang up” state and falls on State=9. The State=10system freezes because RM1 remains in a prepared state forever, waiting for a decision from a fallen TM.

BTM simulation

To avoid transaction freezes, we add a backup TM (BTM), which quickly takes control if the main TM is unavailable. BTM uses the same logic as TM for decision making. And for simplicity, we assume that BTM never falls.

When we check the model with the added BTM process, we get a new error message.

BTM cannot accept a commit, because our original condition canCommitstates that everyone RMstatesshould be “prepared” and does not take into account the condition when some RMs already received a decision on a commit from the original TM before BTM took control. It is necessary to rewrite the conditions canCommittaking into account this situation.

Success! When we test the model, we achieve both consistency and completeness, since the BTM takes control and completes the transaction if TM drops. Here is the 2PCwithBTM model in TLA + (BTM and the second line canCommit are initially uncommented) and the corresponding pdf .

What if RM fails too?

We assumed that RM is reliable. Now cancel this condition and see how the protocol behaves when RM fails. Add the "inaccessible" state to the failure model. To investigate the behavior and model intermittent loss of availability, let the emergency RM recover and continue working by reading its status from the log. Here is another RM state transition diagram with the “unavailable” state added and red transitions. And below is a revised model for RM.

It is also necessary to refinecanAborttaking into account the state of inaccessibility. TM may decide to “hang up” if any of the services is in an interrupted or inaccessible state. If you omit this condition, then the fallen and not restored RM interrupts the progress of the transaction. Of course, again, it is necessary to take into account the RMs, who have learned the transaction completion decision from the source TM.

Model checking

When we check the model, the inconsistency problem arises! How could this happen? We trace the trace.

When State=6all RMs are in a prepared state, TM decided to complete the transaction, RM1 saw this decision and switched to the RC label, which means being ready to change its state to “completed”. (Memorize RM1, this gun will fire in the last act). Unfortunately, TM collapses at the stage State=7, and RM2 becomes unavailable at State=8. In the ninth step, the backup BTM takes control and reads the state of the three RMs as “prepared, unavailable, prepared” - and decides to cancel the transaction in the tenth step. Remember RM1? He decides to complete the transaction because he received such a decision from the original TM, and goes tocommittedon the 11th step. In State=13RM3, it performs the decision to abort the transaction from the BTM and goes into a state aborted- and now we have broken coordination with RM1.

In this case, the BTM made a decision that broke the consistency . On the other hand, if you force the BTM to wait for the RM to exit from the inaccessible state, it may hang forever in the event of a crash on the node, and this will violate the condition for fulfillment (progress).

An updated TLA + model file is available here , as well as the corresponding pdf .

Impossibility of FLP

So what happened? We have stumbled into the theorem of Fisher, Lynch, Paterson (FLP) about the impossibility of consensus in an asynchronous system with failures.

In our example, the BTM cannot correctly decide whether RM2 is in a state of failure or not — and incorrectly decides to abort the transaction. If only the original TM made the decision, such an inaccuracy in recognizing the failure would not have become a problem. RM will comply with any TM decision, so both consistency and progress will be maintained.

The problem is that we have two objects make decisions: TM and BTM, they look at the state of RM at different times and make different decisions. Such asymmetry of information is the root of all evil in distributed systems.

The problem does not disappear even with expansion to a three-phase commit.Here is a three-phase commit modeled in TLA + ( pdf version ), and the error tracing below shows that progress has been broken this time (on the Wikipedia page on the three-phase commit, the situation is described when RM1 hangs after receiving the solution before the commit, and RM2 and RM3 commit commit, which violates the consistency).

Paxos is trying to make the world better

But all is not lost, hope is not dead. We have Paxos . It acts neatly within the FLP theorem. The innovation of Paxos is that it is always safe (even in the presence of inaccurate detectors, asynchronous execution and failures), and ultimately ends the transaction when consensus becomes possible.

You can emulate TM on a cluster with three Paxos nodes, and this will solve the TM / BTM inconsistency problem. Or, as Gray and Lamport showed in a scientific paper on consensus in a commit transaction , if RM uses the Paxos container to store its decisions at the same time as the TM response, this eliminates one extra step in the standard protocol algorithm.

Also popular now: