Shedar August 4, 2010 at 10:08

Eventually Consistent

Original author: Werner Vogels, Amazon.com

Transfer

Recently on a habr discussion of scalable systems and NoSQL decisions more often began to meet. This article, written by Amazon's CTO, is one of the best introductory, in my opinion, showing what problems arise when building scalable systems, what to consider when choosing a toolbox, which cassandra authors have in mind when talking about providing AP in cassandra and CP in HBase and much more.

Introduction

Amazon cloud computing is based on infrastructure services. Such as Amazon S3, SimpleDB and EC2. They allow you to build scalable computing platforms and applications. The requirements for these services are very stringent. They should provide excellent security, scalability, availability, performance and efficient use of resources. And all this while serving millions of customers from all over the world.
Internally, these services are huge, distributed systems operating on a global scale. Which creates additional difficulties, because when processing trillions and trillions of requests, events that usually happen with a very low probability are now guaranteed to happen. And this must be taken into account when designing and developing the system. Globally, we use replication everywhere to provide the required performance and high availability. Although replication brings us closer to our goals, it still does not allow us to transparently achieve them. There are a number of nuances that users of services using replication will encounter.
One of these nuances is the type of data consistency provided by the system. In particular, many common distributed systems use the eventual consistency model in the context of data replication. When developing large scalable systems at Amazon, we used a set of rules and abstractions related to data replication on large scalable systems. We focused on finding a compromise between high availability and data consistency. In this article, I will examine some of the information that has shaped our approach to building reliable distributed systems that work on a global scale.

Historical perspective

In an ideal world, there would be only one model of consistency: after performing a data update, all observers will see the update. The first difficulties with achieving this arose in the DBMS in the late 70s. The best work on this subject is Bruce Lindsay's Notes on Distributed Databases. He outlines the basic principles of database replication and discusses a number of techniques related to achieving consistency. Many of these techniques try to achieve distribution transparency - so that from the point of view of the user it looks like a single system, and not like a lot of connected systems. Many systems of the time followed the approach that a failure of the entire system is better than a violation of transparency.
In the mid-90s, with the growth of systems on the Internet, this practice has been revised. At this time, people began to feel that accessibility was the most important property, but they could not decide what to sacrifice as a compromise. Eric Brewer, professor of Berkeley, who was then the head of Inktomi (the company that released the successful search engine, which was later absorbed by Yahoo - approx. Per.) , Brought all the compromises together in a report at the 2000 PODC conference. He presented the CAP theorem, which states that of the three properties of systems with distributed data are consistency, system availability when one of the nodes fails (system availability), and resistance to loss of connection between network segments (partition tolerance)(hereinafter, network segmentation means the loss of communication between parts of a distributed system, when each part is separately operable, but they "do not see" each other - approx. per.) - only two can be achieved at a time. A more formal confirmation is published in an article by Seth Gilbert and Nancy Lynch in 2002.
A system that does not provide resistance to loss of communication between network segments can achieve data consistency and availability, which is often achieved using a transaction protocol. In this case, certain situations are treated as a system failure. For example, if the client does not see part of the nodes. It is worth noting that segmentation is often present in large scalable systems, because data consistency and availability are not achievable at the same time. This means that we have two choices: weaken the consistency, which will allow us to create a system with high availability in the conditions of network segmentation, or focus on consistency, which will lead to the unavailability of the system in certain situations.
Both options require the attention of the client side developer to the system capabilities. If the system focuses on integrity, then the developer should keep in mind that the system may not be available, for example, for recording and accordingly handle this situation so as not to lose data. If the system focuses on accessibility, then it can always provide a record, but reading data in some cases will not reflect the result of a recent recording. The developer has to decide whether the client really needs the very latest changes. In many cases, slightly outdated data is acceptable.
Basically, consistency in transaction systems that conform to ACIDs is a slightly different kind of consistency. In ACID, consistency refers to the guarantee that, upon completion of a transaction, the database is in a consistent state. For example, when transferring money between accounts, the amount of money in the accounts should not change. In ACID-compliant systems, this kind of consistency is typically ensured by the use of transactions and database tools to ensure data integrity.

Consistency - Client and Server

There are two perspectives on consistency. One from the point of view of the developer / client: how they see data updates. The second is from the server side: how are updates in the system and what can the system guarantee regarding updates.

Consistency from the point of view of the client

From the point of view of the client, we have the following components:
Storage system . At the moment, we are considering it as a black box. But we take into account that inside something is highly scalable and distributed, built to ensure stability and accessibility.
Process A . A process that writes to and reads from a storage system.
Processes B and C . Two processes independent of process A, which also write and read from the storage system. It doesn’t matter if they are processes or threads of one process. The only important thing is that they are independent and must interact to exchange information.
Client (client-side) consistency determines how and when observers (in our case, processes A, B and C) see changes in the given object in the storage system. In the following examples, illustrating various types of consistency, process A performed an update of the data.
Strong consistency . After the update is completed, any subsequent access to the data (by Process A, B or C) will return the updated value.
Weak consistency . The system does not guarantee that subsequent data accesses will return the updated value. Before the updated value is returned, a number of conditions must be met. The period between updating and the moment when each observer is always guaranteed to see the updated value is calledwindow mismatch (inconsistency window) .
Eventual consistency . A special case of weak consistency. The system ensures that, in the absence of new data updates, ultimately, all queries will return the last updated value. If there are no failures, the maximum inconsistency window size can be determined based on factors such as communication delay, system load, and the number of replicas in accordance with the replication scheme. The most popular system that implements “consistency in the long run” is DNS. The updated record is distributed in accordance with the configuration parameters and cache interval settings. Ultimately, all customers will see the update.
Eventual consistency has many variations that are important to consider:
Causal consistency . If process A has informed process B that it has updated the data, then subsequent calls of process B to this data will return updated values and the record is guaranteed to replace the earlier one. Access to process C, which is not causally related to process A, is subject to the usual rules of eventual consistency.
Consistency Model Read-your-writes consistency . This is an important model in which process A, after updating data, always receives an updated value when it is accessed and never sees the older one. This is a special case of causal consistency.
Session consistency . This is a practical version of the previous model when a process gains access to the repository in a session context. As long as the session exists, the system guarantees Read-your-writes consistency. If the session ends due to some kind of failure, a new session should be created, which is guaranteed not to overlap with others.
Monotonic read consistency model . If the process saw a certain value, then, with subsequent access to this data, it will never get an older value.
Monotonic write consistency model. In this version, the system guarantees the ordering of the recording of one process. Systems that do not provide this level of consistency are difficult to use.
Some of these variations can be combined. For example, you can combine monotonic reads and session consistency. From a practical point of view, monotonic reads and read-your-writes are most desirable in systems that implement "consistency in the long run", but are not always necessary. This combination facilitates application development, while allowing storage to weaken consistency and provide high availability.
“Eventual consistency” is not some kind of esoteric poetry of extremal distributed systems. Many modern relational DBMSs that provide reliability by duplication to the backup server (primary-backup reliability) implement the replication mechanism in two modes: synchronous and asynchronous. In synchronous mode, the replica update is part of the transaction. In asynchronous mode, the update is delivered as backup, with some delay, often through the delivery of logs. In the latter case, if the primary server fails before the log has been delivered, then reading from the backup server raised instead of the primary will return the outdated data to us. Also, to provide better read scalability, relational DBMSs began to provide readability from a backup server,

Server side consistency

On the server side, we need to understand more deeply how updates are distributed in the system in order to understand what this or that method gives to the developer. Let's start by agreeing a few definitions:
N = the number of nodes that store copies (replicas) of data;
W = the number of replicas that must acknowledge receipt of the update before the update is considered complete;
R = number of replicas that are contacted when processing a request to read data.
If W + R> N, then the sets of replicas participating in the recording and participating in the reading always intersect, which can guarantee strong consistency. The mechanism of synchronous replication to the backup server of relational DBMSs is N = 2, W = 2, and R = 1. It doesn’t matter which replica the reading is from, actual data will always be read. With asynchronous replication and enabled reading from the backup server, N = 2, W = 1, and R = 1. In this case, R + W = N and data consistency cannot be guaranteed.
The problem with such configurations is that if it is impossible to write to W nodes due to a failure, the write operation should return an error, noting the unavailability of the system. For example, with N = 3, W = 3 and two available nodes, the system should generate a write error.
In distributed repositories, which should provide high performance and availability, the number of replicas is generally more than two. A system that focuses only on fault tolerance often uses N = 3 (with W = 2 and R = 2). Systems that need to handle very high reading loads often use more replicas than they need to provide fault tolerance. The value of N can be several tens, or even hundreds of nodes, with R = 1, so that a query to one node will return the result. Systems focused on data consistency set W = N for updates, which can reduce the likelihood of a successful recording. A frequent configuration for systems requiring fault tolerance but not requiring strong consistency is to work with W = 1,
How to configure N, W, and R depends on the use cases and performance requirements for different loads. For R = 1, N = W, we optimize the reading speed, and for W = 1, R = N, we optimize the system for very fast writing. Of course, in the latter case, the survival of the system in case of failures is not guaranteed, and for W <(N + 1) / 2, there is the possibility of conflicting records when multiple nodes do not intersect during various write operations.
Weak / eventual consistency occurs when W + R <= N. Those. there is a possibility that the sets of nodes do not intersect during writing and reading. If this is a deliberate step and not because of the requirements for fault tolerance, then setting R to something other than 1 makes little sense. Weak consistency arises in two main cases: the first is replication to many nodes to provide read scalability, as noted above, and the second is for more complex data access. On simple key-value systems, it is fairly easy to compare versions to determine which value was last recorded. But in systems that return sets of objects, it is more difficult to determine which of the sets is considered the last relevant. Most systems in which W The ability to implement read-your-writes, session, monotonic consistency models generally depends on binding the client to a specific server that provides work with the entire distributed system. When a client accesses the same server each time, the implementation of read-your-writes and monotonic reads is quite simple. At the same time, it is somewhat more difficult to implement load balancing and fault tolerance, but this is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about (translation options are welcome) .
Sometimes read-your-writes and monotonic reads are implemented by the client. When adding versions to records, the client discards the value with versions less than the last one encountered.
Segmentation occurs when some nodes of the system cannot connect to other nodes, but both sets of nodes are accessible to clients. Using the classical quorum mechanism, a segment having W nodes can continue to function and receive updates, while another segment becomes inaccessible. Similar considerations apply to reading. Since the sets of nodes during reading and writing intersect by definition, a smaller set of nodes becomes inaccessible. Segmentation happens infrequently, but can occur both between data centers and inside data centers.
In some applications, the unavailability of some nodes is unacceptable, and it is important that a client that interacts with any segment can work normally. In this case, both segments define a new set of nodes for storing data, and a merge operation is performed when communication between the segments is restored.

Amazon dynamo

Amazon's dynamo is a system that allows you to configure all the parameters discussed above in accordance with the architecture of the application. This is a key-value storage system that is used in many e-commerce platform services and Amazon web services. One of the goals of Dynamo development is to allow service owners who use instances of the Dynamo storage system, often distributed across several data centers, to determine a compromise between consistency, stability, availability, and system performance.

Summarizing the above

Data inconsistency in highly scalable reliable distributed systems should be acceptable for two reasons: improving read and write performance when there are many competitive requests; segmentation processing, when otherwise it is necessary to declare that a part of the system is inaccessible, even if all nodes are working.
Whether inconsistency is acceptable depends on the client application. In any case, the developer must not forget what kind of consistency is provided by the storage system and take this into account when developing applications. There are a number of practical improvements to the eventual consistency model, such as session consistency and uniform reading, that make life easier for developers. Often, an application can use eventual consistency without any problems. A particular widespread case is a website on which we have the concept of consistency from the user's point of view. In this case, the inconsistency window should be less than the expected time for the user to go to the next page. This allows you to distribute the update on the system until the next read request.
The purpose of this article is to raise awareness of the complexity of systems that need to work on a global scale and require careful tuning to ensure that they can provide the application with the performance, availability, and resilience that it needs. One of the things that distributed system designers have to work with is the size of the inconsistency window, during which system customers can experience the realities of developing highly scalable systems.

Comments and suggestions for improving the translation are welcome.

Tags: