Kinetics of large clusters

Summary

Fatal mistake of Martin Kleppman.
Physico-chemical kinetics does mathematics.
The half-life of the cluster.
We solve nonlinear differential equations without solving them.
Nodes as a catalyst.
The predictive power of charts.
100 million years.
Synergy.

In the previous article, we examined in detail the article by Brewer and his theorem of the same name . This time we ’ll be preparing the preparation of Martin Kleppman’s post “The probability of data loss in large clusters” .

In this post, the author tries to model the following problem. To ensure data integrity, the data replication method is usually used. In this case, in fact, it does not matter whether erasure encoding is used or not. In the original post, the author asks for the probability of a single node falling out, and then poses the question: what is the probability of data falling out with an increase in the number of nodes?

The answer is shown in this picture:

Data loss

Those. as the number of nodes increases, the number of lost data increases in direct proportion.

Why is it important? If we consider the sizes of modern clusters, we will see that their number is continuously growing over time. So, a reasonable question arises: is it worth worrying about the safety of your data and raising the replication factor? After all, this directly affects the business, the cost of ownership, and so on. Also, with this example, it is possible to demonstrate wonderfully how to produce a mathematically correct, but incorrect result.

Cluster modeling

To demonstrate the inaccuracy of the calculations, it is useful to understand what a model and modeling are. If the model poorly describes the real behavior of the system, then no matter what the correct formulas are used, we can easily get the wrong result. And all due to the fact that our model may not take into account some important parameters of the system, which cannot be neglected. The art is to understand what is important and what is not.

To describe the life of a cluster, it is important to take into account the dynamics of changes and the interconnection of various processes. This is precisely the weak link in the original article, because a static picture was taken there without any features related to replication.

To describe the dynamics, I will use methods of chemical kinetics, where instead of an ensemble of particles I will use an ensemble of nodes. As far as I know, no one has used such a formalism to describe the behavior of clusters. Therefore, I will improvise.

We introduce the following notation:

: total number of cluster nodes
: number of working nodes (available)
: number of problem nodes (failed)

Then it is obvious that:

$\ begin {aligned} N = A + F \\ \ end {aligned}$

I will include any problems in the number of problem nodes: the disk is screwed up, the processor, network, etc. are broken. The reason is not important to me, the fact of data breakdown and inaccessibility is important. In the future, of course, you can take into account the more subtle dynamics.

Now we write the kinetic equations of the processes of breakdown and restoration of the cluster nodes:

$\ begin {aligned} A & amp; \ xrightarrow {k_f} F & amp; (1) \\ F & amp; \ xrightarrow {k_a} A & amp; (2) \\ \ end {aligned}$

These simplest equations say the following. The first equation describes the process of node failure. It does not depend on any parameters and describes an isolated node failure. Other nodes are not involved in this process. The original “composition” of the process participants is used on the left, and the products of the process on the right. Speed constants k_f

and

specify the speed characteristics of processes for breaking and restoring nodes, respectively.

Let us find out the physical meaning of the rate constants. To do this, we write the kinetic equations:

$\ begin {aligned} \ frac {dA} {dt} & amp; = -k_f A + k_a F \\ \ frac {dF} {dt} & amp; = k_f A - k_a F \\ \ end {aligned}$

From these equations, the meaning of the constants k_f

and

. If we assume that there are no admins and the cluster does not heal itself (i.e. k_a = 0

), then we immediately get the equation:

$\ begin {aligned} \ frac {dA} {dt} = -k_f A \\ \ end {aligned}$

$\ begin {aligned} A = N e ^ {- k_f t} \\ \ end {aligned}$

Those. the value 1 / k_f

is the half-life of the cluster for parts, up to e / 2

. Let be the $\ tau_f$ characteristic time of transition of a single node from state

to state

, and let be the $\ tau_a$ characteristic time of transition of a single node from state

to state

. Then

$\ begin {aligned} \ tau_f & amp; \ sim \ frac {1} {k_f} \\ \ tau_a & amp; \ sim \ frac {1} {k_a} \\ \ frac {\ tau_a} {\ tau_f} & amp; = \ frac {k_f} {k_a} \\ \ end {aligned}$

We solve our kinetic equations. I want to make a reservation right away that I will cut corners wherever possible to get the simplest analytical dependencies that I will use for possible predictions and tuning.

Because the maximum limit on the number of solutions of differential equations is reached, then I will solve these equations by the method of quasistationary states :

$\ begin {aligned} \ frac {dA} {dt} = 0 \ Rightarrow F = A \ frac {k_f} {k_a} \\ \ end {aligned}$

Given that $F \ ll n$ (this is a completely reasonable assumption, otherwise you need to buy better hardware or more advanced admins), we get:

$\ begin {aligned} A & amp; \ simeq N \\ F & amp; = N \ frac {k_f} {k_a} \\ \ end {aligned}$

If we assume that the recovery time is $\ tau_a$ approximately 1 week, and the death time is $\ tau_f$ approximately 1 year, we obtain that the share of broken nodes p_f

$\ begin {aligned} p_f = \ frac {F} {N} = \ frac {\ tau_a} {\ tau_f} \ approx 2 \% \\ \ end {aligned}$

Chunks

Let us denote by

- the number of underreplicated chunks that need to be replicated after the node falls out when it enters the state

. Then, to account for chunks, we correct our equations:

$\ begin {aligned} A & amp; \ xrightarrow {k_f} F + U & amp; (1) \\ F & amp; \ xrightarrow {k_a} A & amp; (2) \\ U + A & amp; \ xrightarrow {k_r} H + A & amp; (3) \\ \ end {aligned}$

where

is the second-order process replication rate constant, and

means a healthy chunk, which will dissolve in the total mass of chunks.

The 3rd equation needs to be clarified. It describes a second order process, not a first one:

$\ begin {aligned} U & amp; \ xrightarrow {k_r} H \\ \ end {aligned}$

If we did, we would get a Kleppman curve, which is not in my plans. In fact, all nodes participate in the recovery process, and the more nodes there are, the faster the process goes. This is due to the fact that chunks from a killed node are distributed approximately evenly throughout the cluster, so each participant needs to replicate one-

time less. This means that the final speed of recovery of chunks from a killed node will be proportional to the number of available nodes.

It is also worth noting that in equation (3) it is on the left and on the right

, and at the same time it is not consumed. Chemists would immediately say that in this case it

acts as a catalyst. And if you think carefully, then the way it really is.

Using the method of quasistationary concentrations, we instantly get the result:

$\ begin {aligned} \ frac {dU} {dt} = 0 = k_f A - k_r UA \\ \ end {aligned}$

$\ begin {aligned} U = \ frac {k_f} {k_r} \\ \ end {aligned}$

Amazing result! Those. the number of chunks that need to be replicated does not depend on the number of nodes! And this is due precisely to the fact that an increase in the number of nodes increases the resulting reaction rate (3), thereby compensating for a larger number of

nodes. Catalysis!

Rate this value. $\ tau_r$ - Chunk recovery time, as if we had only one node. Let 5TB of data need to be replicated on the node, while the replication flow in bytes is set to 50MB / s, then we get:

$\ begin {aligned} U = \ frac {\ tau_r} {\ tau_f} \ approx \ frac {1 \ times 10 ^ 5} {3.2 \ times 10 ^ 7} \ approx 3 \ times 10 ^ {- 3} \\ \ end {aligned}$

Those. $U \ ll 1$ And you can not be afraid for the safety of data. It is worth considering that the loss of one chunk out of 3 does not lead to data loss.

Replication planning

In the previous calculation, we made an implicit assumption that the nodes instantly recognize which particular chunks need to be replicated, and immediately begin the process of their replication. In reality, this is completely wrong: the master needs to understand that the node has died, then understand which specific chunks need to be replicated and the process of replication on the nodes started. All this is not instantaneous and takes some time $\ tau_s$ (scheduling).

To take into account the delay, we use the theory of a transition state or an activated complex , which describes the process of transition through a saddle point on a multidimensional surface of potential energy. From our point of view, we will have some additional intermediate state U ^ *

, which means that this chunk has been scheduled for replication, but the replication process has not yet begun. Those. in the next nanosecond, replication will definitely begin, but not a picosecond earlier. Then our system will take the final form:

$\ begin {aligned} A & amp; \ xrightarrow {k_f} F + U & amp; (1) \\ F & amp; \ xrightarrow {k_a} A & amp; (2) \\ U & amp; \ xrightarrow {k_s} U ^ * & amp; (3) \\ U ^ * + A & amp; \ xrightarrow {k_r} H + A & amp; (4) \\ \ end {aligned}$

Solving it, we find that:

$\ begin {aligned} \ frac {dU} {dt} & amp; = k_f A - k_s U \\ \ frac {dU ^ *} {dt} & amp; = k_s U - k_r U ^ * A \\ \ end {aligned }$

Using the method of quasistationary concentrations, we find:

$\ begin {aligned} U & amp; = A \ frac {k_f} {k_s} \\ U ^ * & amp; = \ frac {k_f} {k_r} \\ U_ {sum} & amp; = U + U ^ * = \ frac {\ tau_r} {\ tau_f} \ bigg (1 + \ frac {A} {\ widetilde {A}} \ bigg) \\ \ end {aligned}$

Where:

$\ begin {aligned} \ widetilde {A} = \ frac {\ tau_r} {\ tau_s} \\ \ end {aligned}$

As you can see, the result coincides with the previous one with the exception of the multiplier $(1 + A / \ widetilde {A})$ . Consider 2 limiting cases:

$A \ ll \ widetilde {A}$ . In this case, all the arguments given earlier are retained: the number of chunks does not depend on the number of nodes, and therefore does not grow with the growth of the cluster.
$A \ gg \ widetilde {A}$ . In this case $U_ {sum} \ simeq A \ tau_s / \ tau_f$ , it grows linearly with increasing number of nodes.

To determine the regime, we will evaluate what it is equal to $\ widetilde {A}$ . $\ tau_s$ - this is the characteristic total time of detection of an underreplicated chunk and planning for its replication. A rough estimate (using the “finger to the sky” technique) gives a value of 100 s. Then:

$\ begin {aligned} \ widetilde {A} = \ frac {1 \ times 10 ^ 5} {100} = 1000 \\ \ end {aligned}$

Thus, a further increase in the cluster above this figure under these conditions will begin to increase the likelihood of losing a chunk.

What can be done to improve the situation? It would seem that it is possible to improve the asymptotics by moving the boundary $\ widetilde {A}$ by increasing $\ tau_r$ , but this will only increase the value $U_ {sum}$ without any real improvement. The surest way: this is a decrease $\ tau_s$ , i.e. decision time for chunk replication, as $\ tau_f$ depends on the characteristics of iron and it is hard to influence this with software.

Discussion of limiting cases

The proposed model actually divides the set of clusters into 2 camps.

Relatively small clusters with the number of nodes <1000 adjoin the first camp. In this case, the probability of obtaining an underreplicated chunk is described by a simple formula:

$\ begin {aligned} U = \ frac {\ tau_r} {\ tau_f} \\ \ end {aligned}$

To improve the situation, 2 approaches can be applied:

Improve iron, thereby increasing $\ tau_f$ .
Speed up replication by reducing $\ tau_r$ .

These methods are generally fairly obvious.

In the second camp, we already have large and extra-large servers with the number of nodes> 1000. Here, the dependence will be determined as follows:

$\ begin {aligned} U = A \ frac {\ tau_s} {\ tau_f} \\ \ end {aligned}$

Those. will be directly proportional to the number of nodes, which means that a subsequent increase in the cluster will negatively affect the likelihood of underreplicated chunks. However, here you can significantly reduce the negative effects using the following approaches:

Continue to increase $\ tau_f$ .
Accelerate the detection of underreplicated chunks and subsequent replication planning, thereby reducing $\ tau_s$ .

The 2nd reduction approach is $\ tau_s$ no longer obvious. It would seem, what difference does it take 20 seconds or 100 seconds? However, this value significantly affects the likelihood of underreplicated chunks. Also not obvious is the fact that the dependence on $\ tau_r$ , i.e. replication speed itself does not matter. It is understandable in this model: as the number of nodes increases, this speed only increases, so the constant addition of the “acceleration” of the replication process, aimed at detecting and planning replication, begins to significantly influence the replication of the chunk.

It is worth staying a little more on $\ tau_f$ . In addition to the direct contribution to the life of the chunks, the increase has a $\ tau_f$ beneficial effect on the number of available nodes, because

$\ begin {aligned} A = N - F \ simeq N \ bigg (1 - \ frac {\ tau_a} {\ tau_f} \ bigg) \\ \ end {aligned}$

Thus increasing the number of available nodes. Thus, the improvement $\ tau_f$ directly affects the availability of cluster resources, speeding up calculations and at the same time increasing the reliability of data storage. On the other hand, improving the quality of iron directly affects the cost of ownership of the cluster. The described model allows us to quantify the economic feasibility of such decisions.

Comparison of approaches

In conclusion, I would like to give a comparison of the two approaches. About this eloquently tell the following graphs.

Data loss

From the first graph you can only see a linear relationship, but it will not give an answer to the question: “what needs to be done to improve the situation?” The second picture describes a more complex model that can immediately give answers to questions about what to do and how affect the behavior of the replication process. Moreover, it gives a recipe for a quick way, literally in the mind, of assessing the consequences of various architectural decisions. In other words, the predictive power of the developed model is on a qualitatively different level.

Loss of chunk

We now find the characteristic time of loss of the chunk. To do this, we write out the kinetics of the formation processes of such chunks, taking into account the degree of replication = 3:

$\ begin {aligned} A & amp; \ xrightarrow {k_f} F + U \\ F & amp; \ xrightarrow {k_a} A \\ U & amp; \ xrightarrow {k_s} U ^ * \\ U ^ * + A & amp; \ xrightarrow {k_r} H + A \\ U & amp; \ xrightarrow {k_f} F + U_2 \\ U ^ * & amp; \ xrightarrow {k_f} F + U_2 \\ U_2 & amp; \ xrightarrow {k_s} U_2 ^ * \\ U_2 ^ * + A & amp; \ xrightarrow {k_r} U + A \\ U_2 & amp; \ xrightarrow {k_f} F + L \\ U_2 ^ * & amp; \ xrightarrow {k_f} F + L \\ \ end {aligned}$

Here,

the number of underreplicated chunks that lack two copies U_2 ^ *

is indicated through - an intermediate state, similar U ^ *

to the state U_2

, which

means a lost (lost) chunk. Then:

$\ begin {aligned} \ frac {dL} {dt} & amp; = k_f \ big (U_2 + U_2 ^ * \ big) \\ \ tau_l & amp; = \ frac {1} {k_f \ big (U_2 + U_2 ^ * \ big)} \\ \ end {aligned}$

Where $\ tau_l$ is the characteristic time of the formation of the lost chunk. We solve our system for 2 limiting cases for the case when A = 1000

.

$A \ ll \ widetilde {A}$ then

$\ begin {aligned} \ tau_l = A \ frac {\ tau_f ^ 3} {\ tau_r ^ 2} \ approx 100 \ 000 \ 000 \ years \\ \ end {aligned}$

To $A \ gg \ widetilde {A}$ get:

$\ begin {aligned} \ tau_l = \ frac {\ tau_f ^ 3} {A \ tau_s ^ 2} \ approx 100 \ 000 \ 000 \ years \\ \ end {aligned}$

Those. the characteristic time of the formation of the lost chunk is 100 million years! In this case, approximately similar values are obtained, since we are in the transition zone. The magnitude of the characteristic time $\ tau_l$ speaks for itself and everyone can draw conclusions himself.

However, it is worth paying attention to one feature. In the limiting case $A \ ll \ widetilde {A}$ , while it

is constant and independent of

, in the expression for $\ tau_l$ we get the inverse relationship, i.e. with cluster growth, triple replication even improves data security! With further growth of the cluster, the situation changes exactly the opposite.

findings

The article successively introduces an innovative method for modeling the kinetics of processes of large clusters. The considered approximate model for describing cluster dynamics makes it possible to calculate probabilistic characteristics that describe data loss.

Of course, this model is only the first approximation of what is really happening on the cluster. Here we only took into account the most significant processes for obtaining a qualitative result. But even such a model allows us to judge what is happening inside the cluster, and also gives recommendations for improving the situation.

Nevertheless, the proposed approach allows for more accurate and reliable results based on a subtle account of various factors and analysis of real data on the operation of the cluster. Below is a far from exhaustive list of model improvements:

Cluster nodes can fail due to various hardware failures. Failure of a particular site usually has a different probability. Moreover, a failure, for example, of a processor, does not lose data, but only gives a temporary unavailability of a node. It's easy to be taken into account in the model by entering different states $F_ {proc}$ , $F_ {disk}$ , $F_ {mem}$ etc. with various speeds of processes and various consequences.
Not all nodes are equally useful. Different parties may differ in their nature and frequency of failures. This can be taken into account in the model by entering , etc. with different speeds of the corresponding processes.
Adding various types of nodes to the model: with partially damaged disks, banned, etc. For example, you can analyze in detail the effect of turning off an entire rack and elucidating the characteristic rates of cluster transition to the stationary mode. Moreover, by numerically solving the differential equations of processes, we can visually consider the dynamics of chunks and nodes.
Each drive has slightly different read / write characteristics, including latency and throughput. Nevertheless, it is possible to more accurately estimate the rate constants of processes by integrating over the corresponding distribution functions of the characteristics of the disks, by analogy with the rate constants in gases, integrated over the Maxwell velocity distribution .

Thus, the kinetic approach allows, on the one hand, to obtain a simple description and analytical dependencies, on the other hand, it has very serious potential for introducing additional subtle factors and processes, based on the analysis of cluster operation data, adding specific points as necessary. It is possible to evaluate the influence of the contribution of each factor on the resulting equations, allowing us to model the improvements with a view to their expediency. In the simplest case, such a model allows you to quickly obtain analytical dependencies, giving recipes to improve the situation. Simulation can be bi-directional in nature: it is possible to iteratively improve the model by adding processes to the system of kinetic equations; and try to analyze potential system improvements by introducing relevant processes into the model. Those.

In addition, one can always go on to numerically integrate a system of rigid nonlinear differential equations by obtaining the dynamics and response of the system to specific influences or small perturbations.

Thus, the synergy of seemingly unrelated areas of knowledge allows you to get amazing results that have undeniable predictive power.

Grigory Demchenko, developer of YT

Literature

[1] Cap Theorem Myths
[2] Spanner, TrueTime and the CAP Theorem
[3] The probability of data loss in large clusters
[4] Химическая кинетика
[5] Активированного комплекса теория
[6] Приближенные методы химической кинетики
[7] Распределение Максвелла
[8] YT: зачем Яндексу своя MapReduce-система и как она устроена

Tags: