
Databases and NoSQL (Chapter 4 of the book “Scalable and High-Performance Web Applications”)
In this chapter, we discuss databases, relational and NoSQL, which run on the same machine. It is this mode of operation that will be the brick on which distributed databases are built.
The most important criterion for choosing one or another database is the accumulated knowledge about it. For example, the relational model is several decades old, and during this time a huge amount of knowledge has accumulated on how to properly fit one or another subject area into it. Widespread databases, such as MySQL, have accumulated a wealth of information about its behavior in different conditions, as well as about which systems, in principle, can be constructed on the basis of this database, somehow changing or adapting it.
This wealth should never be discounted. You should always understand that a new unknown system will have to go through a period of accumulation of knowledge within each specific development team. How does the domain fit into the data model of this database? How does this database behave in different conditions, with different patterns of access to it? How does she behave in case of equipment failure? How can you expand the scope of the application base from the standard? Where can I get advice or take advantage of someone else's experience working with the database, for free or for money?
The relational data model is several decades old. In a relational model, a database consists of tables that consist of rows and columns. Each column has its own data type (string, number, logical value, date, text, binary blob). All lines are the same.
Typically, each type of object is stored in a separate table (for example, a user table or a project table). Usually each object has a unique identifier. The identifier can be either conditional, that is, simply a number, or arising from the subject area, for example, a person’s passport number or ISBN for books. Usually use conditional identifiers.
Objects (tables) can be associated with each other using an identifier. For example, if we have a department table and an employee table, then in the department table there is a department identifier, and in the employee table there is an employee identifier and the identifier of the department to which it belongs. In the theory of relational databases, this case is called “one to many” (one department has many employees).
The many-to-many case is also possible. For example, there is a project table and a developer table. Many developers can work on one project, and one developer can work on several projects. In this case, a third table is usually created - a table of relations with two fields: a project identifier and a developer identifier. Each link between the developer and the project is expressed as a line in the link table. If the developer has not yet been assigned to any project, then in the link table there simply will not be a single record about him.
Relational database servers provide standard table data access operators such as SELECT, INSERT, UPDATE, and DELETE. Different servers also provide some additional operators. You can retrieve data from tables by many different criteria. There is a “core” of the SQL standard, which is supported by almost all servers, and there are always certain extensions of the standard that can be used when working with a specific database server.
The “rigidity” of a relational data model allows various optimizations to data access. The most obvious example is creating indexes on table fields for quick access. For example, on the employee table, you can create an index on the “department ID” field, and then the operation “get a list of employees of a department” will work faster. An index is simply a materialized data structure (see Chapter 1), such as a B-tree or hash. It is important to understand how this data structure is structured so that you can draw conclusions about how it will work in a particular case.
An important role in the design of a relational database is played by the normalization and denormalization of the data model. The normal form of a database is one where information is not repeated. For speed and efficiency, sometimes the database is denormalized, and then duplicate information appears in it.
For example, we have a customer table and a sales table. Some customers are considered “important” because they bought more than N. Each time we could retrieve a list of its sales for each customer, summarize the cost and compare it with N. At the same time, we can add a field to the customer table for speed - the flag is “important” and constantly maintain it in a solid state - for example, when starting a new sale, check whether the total amount has become greater than N and, if so, put this flag in “TRUE”. With programming errors, such fields can be out of sync and then the database has to be repaired.
Successful denormalization can greatly increase performance. However, denormalization is not a panacea; it can lead to negative consequences.
How to work effectively in a relational database with data structures such as a hierarchical tree or graph? Over the years, vast experience has been accumulated in this area. For example, hierarchical trees for speed can be stored using the materialized path.
The data in the key-value style can be stored both in the form of an obvious three-column table “object_id-key-value”, and (sometimes) in the form of a “wide" table.
For some data structures, starting from a certain size, it is almost impossible to effectively fit into a relational database, and you have to use specialized solutions. For example, a graph of friendships between a billion people is almost impossible to process using standard graph algorithms within the framework of the relational model, even on modern equipment.
One of the meanings of the term “NoSQL” is a departure from the relational model in favor of more specific (or more generalized) data models. For example, traditionally successful NoSQL systems are key-value pairs storage systems such as Redis or Memcache. Their data model is extremely simple - it is essentially an associative array, where the keys are of string type, and the values can contain any data. Like any associative array, such systems support a limited set of operations with data - read the value by key, set the key value, delete the key and its associated value. The operation “get a list of keys” may not be supported on such systems.
Another example of successful NoSQL systems is document repositories. Objects in such storages are usually associative arrays of a free structure, that is, essentially different objects can be stored in the same “table”. Examples of systems of this class are MongoDB and Cassandra. Depending on what data is actually stored in a particular database, its performance can vary greatly. For example, if you optimize such a “table” by storing objects of the same type in it, the
third example of specialized NoSQL systems is graph databases. They are specially tailored for processing a specific data structure, and usually for working with large amounts of data (because a standard relational implementation can do just fine on small volumes).
A very important example of NoSQL systems are regular file systems such as Ext4 or NTFS. They are designed to store objects in a hierarchical structure with free format content. Databases themselves, relational and NoSQL, usually use file systems to store their content, and sometimes the interaction between these two subsystems becomes important in one case or another.
Another important case is full-text search engines such as Elastic Search or the Google Search Engine.
The fundamental problem in designing a system using databases is that almost any system works on relatively small amounts of data, and almost any system gradually stops working with relatively large amounts of data. This means that in the process of developing the system and increasing the amount of data, you have to rethink working with data, change the data storage model, or even replace the database server with another.
It is traditionally believed that increasing the amount of data for each next order requires redesigning the database. Sometimes they try to fight this by designing the base immediately two to three orders of magnitude ahead, however this is not always possible in full. The issue of working with increasing data is a generally unsolved engineering problem.
Another common problem is the sudden need to apply new algorithms to existing data, usually with high speed requirements. For example, in some company stores all information about sales of goods, suitable for accounting and monthly reports. However, the challenges of the time require starting daily and hourly analysis of information about the sales history and making business decisions based on this analysis - which stores to send goods to, which advertising campaigns to start, what else to offer to people who buy certain goods. Such algorithms may require a fundamental change in the way data is stored, while maintaining compatibility with the existing system and with existing data. The question of working in such conditions is an unsolved engineering problem.
Any equipment will fail sooner or later: disks, memory, processor, electrical power, etc. In this chapter we will consider the case of one physical machine on which the server is running. Let this physical machine suddenly lose power. After power is restored, it boots up again and starts the database server. What will happen to the data?
Each database system, relational and NoSQL, has its own strategy for handling such failures.
Generally speaking, a “zero strategy” is possible when all data is simply lost and the database becomes empty. An example of a highly successful NoSQL system with such a strategy is Memcache.
Relational database systems traditionally support one strategy or another that provides a set of guarantees called ACID: atomicity, consistency, isolation, durability (atomicity, consistency, isolation, reliability). These terms refer to transaction processing.
A transaction is a set of operations that are considered as a whole. A classic example of a transaction is transferring money between two bank accounts. To do this, we must reduce the amount in one account and at the same time increase the amount in another account.
Atomicity (atomicity) is a guarantee that with any behavior of the equipment either these two operations will be performed, or none will be performed. That is, even if we “withdraw money from one account”, and a voltage surge occurs in this microsecond - after rebooting the base and putting it into operation, we will again see the previous amount in the original account.
Consistency is the least clearly defined guarantee. In addition, this term is also used in the definition of the CAP-theorem (about which see below), and there it means something else (but close). Most generally, it can be said that consistency guarantees some “reasonable” database behavior, such that the programmer will not receive any special surprises when working with the database, as well as in case of equipment failures.
Isolation means that during the execution of a transaction, other concurrently running operations do not “see” the intermediate state. For example, we calculated the total amount in the accounts. Now, if we start sending money, “we will withdraw money from one account” and in this microsecond another process will try to calculate the “total amount in accounts” again, then we will get the same amount, and not less.
Durability means that after a transaction is successfully completed, its results will no longer be lost under any circumstances. For example, we will send money, close the transaction and receive a message from the server about the successful completion of the transaction. A voltage surge will occur in a microsecond. Reliability ensures that when the machine boots up again and enters the operating mode, the information about sending money will be stored in the database.
Traditionally, databases that support ACID allow, and to some extent violate it, using the so-called. “Transaction isolation level”. For example, at the “uncommitted read” level, concurrent transactions can “see” the intermediate states of other transactions.
In general, weakening of guarantees often makes it possible to increase efficiency at the cost of special requirements for interpreting the results (for example, they may not be accurate enough or just incorrect). For some cases this may be justified: for example, if we want to show the total number of registered users on the site, then in general we are not interested in the exact value - just say “about a hundred” or just show “some” number, because no one has ever will be able to confirm or refute each specific value.
Many NoSQL systems simply refuse to support ACID, and instead declare their own unique set of guarantees, which can be anywhere in the spectrum from “zero” to more or less close to “full ACID”. For example, some versions of some systems can simply leave the database in a damaged state when a machine fails, so it will require manual or semi-automatic recovery after a reboot, and it is not guaranteed that all recorded data will be saved.
Warranties weakened at the level of an individual machine can be “restored” or even built on their basis, a significantly more reliable system if physical machines are combined into a network and a special mode of working with them is required. See below for more details.
Typically, a database has many clients that simultaneously perform both read and write operations. The database is obliged in this case to fulfill the guarantees that are laid down in it. For example, relational databases typically provide transaction isolation (see above).
Support for concurrent access to databases often requires significant effort from the server developer, which must ensure the speed and reliability of such access. There are many different algorithms and data structures underlying parallel access.
For example, to add an entry to the table, we need to select a new page in the table, and also update the index. If in parallel another client adds another record, then he needs to select another page (or use the same one?), And update the index again (or maybe two index update operations can be combined?). What if the first client started the transaction, announced the addition of a record, waited two seconds, and rolled back the transaction? What if one client increased the value of the field by one, and the second decreased by one? What if a voltage surge can occur in any microsecond and the system should return to the “correct” state after a reboot, despite all the numerous combinations of intermediate conditions and states?
Compliance with concurrent access while maintaining performance is a huge and complex engineering challenge. All database servers solve it using more or less standard approaches, however, the specific implementation of these approaches, and the subtleties associated with them, are different in each database server.
It is traditionally believed that an increase in the number of simultaneous database clients by an order of magnitude requires a review of its architecture. In general terms, this issue is an unsolved engineering problem.
All database servers provide many administrative functions related to server life on a separate machine. Among these features are backup; restore from backup; optimizing the space occupied by tables; distribution of data files across various disks and file systems; network access to the database server (see also the corresponding chapter in the book) and the effectiveness of such access.
Also, some servers can effectively use the special functions of the operating system (often, in turn, designed specifically for database servers). A typical example is asynchronous I / O support.
There are also administrative functions associated with networking physical machines. For example, this is setting up the replication topology, as well as managing machines in clusters. See below for more details.
All these, and many other functions, are implemented in different ways in different database servers. Administrative functions, their sophistication and convenience, are an important criterion for choosing a database server suitable for a specific task.
Modern database servers provide many performance tweaks. Comparing the speed of different databases in special conditions is a fascinating and not always meaningful activity.
It is important to understand that any server configured for certain conditions can always be “kneeling” by changing the data access patterns, increasing the number of clients or increasing the amount of data stored. Data access patterns change as the system evolves. The number of customers is growing as the popularity of the system. The amount of stored data usually also grows as the system evolves. All this leads to the fact that old records and successes become irrelevant, and it is necessary to re-conduct the process of fine-tuning the system, and sometimes think about changing the architecture of data access.
As we have said, every physical machine can break down at any given time. In addition, any physical machine has a performance limit that it can provide. These two circumstances make it possible to integrate machines into a network and consider them as a distributed database.
Distributed databases make us think again about all the issues that we discussed for the case of a single physical machine: the data model, the data access protocol and the guarantees that are provided in case of equipment failure.
We will discuss this issue in more detail in the next chapter of our book.
Knowledge accumulation
The most important criterion for choosing one or another database is the accumulated knowledge about it. For example, the relational model is several decades old, and during this time a huge amount of knowledge has accumulated on how to properly fit one or another subject area into it. Widespread databases, such as MySQL, have accumulated a wealth of information about its behavior in different conditions, as well as about which systems, in principle, can be constructed on the basis of this database, somehow changing or adapting it.
This wealth should never be discounted. You should always understand that a new unknown system will have to go through a period of accumulation of knowledge within each specific development team. How does the domain fit into the data model of this database? How does this database behave in different conditions, with different patterns of access to it? How does she behave in case of equipment failure? How can you expand the scope of the application base from the standard? Where can I get advice or take advantage of someone else's experience working with the database, for free or for money?
Relational data model
The relational data model is several decades old. In a relational model, a database consists of tables that consist of rows and columns. Each column has its own data type (string, number, logical value, date, text, binary blob). All lines are the same.
Typically, each type of object is stored in a separate table (for example, a user table or a project table). Usually each object has a unique identifier. The identifier can be either conditional, that is, simply a number, or arising from the subject area, for example, a person’s passport number or ISBN for books. Usually use conditional identifiers.
Objects (tables) can be associated with each other using an identifier. For example, if we have a department table and an employee table, then in the department table there is a department identifier, and in the employee table there is an employee identifier and the identifier of the department to which it belongs. In the theory of relational databases, this case is called “one to many” (one department has many employees).
The many-to-many case is also possible. For example, there is a project table and a developer table. Many developers can work on one project, and one developer can work on several projects. In this case, a third table is usually created - a table of relations with two fields: a project identifier and a developer identifier. Each link between the developer and the project is expressed as a line in the link table. If the developer has not yet been assigned to any project, then in the link table there simply will not be a single record about him.
Relational database servers provide standard table data access operators such as SELECT, INSERT, UPDATE, and DELETE. Different servers also provide some additional operators. You can retrieve data from tables by many different criteria. There is a “core” of the SQL standard, which is supported by almost all servers, and there are always certain extensions of the standard that can be used when working with a specific database server.
Access optimization
The “rigidity” of a relational data model allows various optimizations to data access. The most obvious example is creating indexes on table fields for quick access. For example, on the employee table, you can create an index on the “department ID” field, and then the operation “get a list of employees of a department” will work faster. An index is simply a materialized data structure (see Chapter 1), such as a B-tree or hash. It is important to understand how this data structure is structured so that you can draw conclusions about how it will work in a particular case.
An important role in the design of a relational database is played by the normalization and denormalization of the data model. The normal form of a database is one where information is not repeated. For speed and efficiency, sometimes the database is denormalized, and then duplicate information appears in it.
For example, we have a customer table and a sales table. Some customers are considered “important” because they bought more than N. Each time we could retrieve a list of its sales for each customer, summarize the cost and compare it with N. At the same time, we can add a field to the customer table for speed - the flag is “important” and constantly maintain it in a solid state - for example, when starting a new sale, check whether the total amount has become greater than N and, if so, put this flag in “TRUE”. With programming errors, such fields can be out of sync and then the database has to be repaired.
Successful denormalization can greatly increase performance. However, denormalization is not a panacea; it can lead to negative consequences.
How to work effectively in a relational database with data structures such as a hierarchical tree or graph? Over the years, vast experience has been accumulated in this area. For example, hierarchical trees for speed can be stored using the materialized path.
The data in the key-value style can be stored both in the form of an obvious three-column table “object_id-key-value”, and (sometimes) in the form of a “wide" table.
For some data structures, starting from a certain size, it is almost impossible to effectively fit into a relational database, and you have to use specialized solutions. For example, a graph of friendships between a billion people is almost impossible to process using standard graph algorithms within the framework of the relational model, even on modern equipment.
Other data models
One of the meanings of the term “NoSQL” is a departure from the relational model in favor of more specific (or more generalized) data models. For example, traditionally successful NoSQL systems are key-value pairs storage systems such as Redis or Memcache. Their data model is extremely simple - it is essentially an associative array, where the keys are of string type, and the values can contain any data. Like any associative array, such systems support a limited set of operations with data - read the value by key, set the key value, delete the key and its associated value. The operation “get a list of keys” may not be supported on such systems.
Another example of successful NoSQL systems is document repositories. Objects in such storages are usually associative arrays of a free structure, that is, essentially different objects can be stored in the same “table”. Examples of systems of this class are MongoDB and Cassandra. Depending on what data is actually stored in a particular database, its performance can vary greatly. For example, if you optimize such a “table” by storing objects of the same type in it, the
third example of specialized NoSQL systems is graph databases. They are specially tailored for processing a specific data structure, and usually for working with large amounts of data (because a standard relational implementation can do just fine on small volumes).
A very important example of NoSQL systems are regular file systems such as Ext4 or NTFS. They are designed to store objects in a hierarchical structure with free format content. Databases themselves, relational and NoSQL, usually use file systems to store their content, and sometimes the interaction between these two subsystems becomes important in one case or another.
Another important case is full-text search engines such as Elastic Search or the Google Search Engine.
Large volumes and complex algorithms
The fundamental problem in designing a system using databases is that almost any system works on relatively small amounts of data, and almost any system gradually stops working with relatively large amounts of data. This means that in the process of developing the system and increasing the amount of data, you have to rethink working with data, change the data storage model, or even replace the database server with another.
It is traditionally believed that increasing the amount of data for each next order requires redesigning the database. Sometimes they try to fight this by designing the base immediately two to three orders of magnitude ahead, however this is not always possible in full. The issue of working with increasing data is a generally unsolved engineering problem.
Another common problem is the sudden need to apply new algorithms to existing data, usually with high speed requirements. For example, in some company stores all information about sales of goods, suitable for accounting and monthly reports. However, the challenges of the time require starting daily and hourly analysis of information about the sales history and making business decisions based on this analysis - which stores to send goods to, which advertising campaigns to start, what else to offer to people who buy certain goods. Such algorithms may require a fundamental change in the way data is stored, while maintaining compatibility with the existing system and with existing data. The question of working in such conditions is an unsolved engineering problem.
Equipment Failure Behavior
Any equipment will fail sooner or later: disks, memory, processor, electrical power, etc. In this chapter we will consider the case of one physical machine on which the server is running. Let this physical machine suddenly lose power. After power is restored, it boots up again and starts the database server. What will happen to the data?
Each database system, relational and NoSQL, has its own strategy for handling such failures.
Generally speaking, a “zero strategy” is possible when all data is simply lost and the database becomes empty. An example of a highly successful NoSQL system with such a strategy is Memcache.
ACID: Atomicity, Consistency, Isolation and Reliability
Relational database systems traditionally support one strategy or another that provides a set of guarantees called ACID: atomicity, consistency, isolation, durability (atomicity, consistency, isolation, reliability). These terms refer to transaction processing.
A transaction is a set of operations that are considered as a whole. A classic example of a transaction is transferring money between two bank accounts. To do this, we must reduce the amount in one account and at the same time increase the amount in another account.
Atomicity (atomicity) is a guarantee that with any behavior of the equipment either these two operations will be performed, or none will be performed. That is, even if we “withdraw money from one account”, and a voltage surge occurs in this microsecond - after rebooting the base and putting it into operation, we will again see the previous amount in the original account.
Consistency is the least clearly defined guarantee. In addition, this term is also used in the definition of the CAP-theorem (about which see below), and there it means something else (but close). Most generally, it can be said that consistency guarantees some “reasonable” database behavior, such that the programmer will not receive any special surprises when working with the database, as well as in case of equipment failures.
Isolation means that during the execution of a transaction, other concurrently running operations do not “see” the intermediate state. For example, we calculated the total amount in the accounts. Now, if we start sending money, “we will withdraw money from one account” and in this microsecond another process will try to calculate the “total amount in accounts” again, then we will get the same amount, and not less.
Durability means that after a transaction is successfully completed, its results will no longer be lost under any circumstances. For example, we will send money, close the transaction and receive a message from the server about the successful completion of the transaction. A voltage surge will occur in a microsecond. Reliability ensures that when the machine boots up again and enters the operating mode, the information about sending money will be stored in the database.
Traditionally, databases that support ACID allow, and to some extent violate it, using the so-called. “Transaction isolation level”. For example, at the “uncommitted read” level, concurrent transactions can “see” the intermediate states of other transactions.
Weakening of guarantees
In general, weakening of guarantees often makes it possible to increase efficiency at the cost of special requirements for interpreting the results (for example, they may not be accurate enough or just incorrect). For some cases this may be justified: for example, if we want to show the total number of registered users on the site, then in general we are not interested in the exact value - just say “about a hundred” or just show “some” number, because no one has ever will be able to confirm or refute each specific value.
Many NoSQL systems simply refuse to support ACID, and instead declare their own unique set of guarantees, which can be anywhere in the spectrum from “zero” to more or less close to “full ACID”. For example, some versions of some systems can simply leave the database in a damaged state when a machine fails, so it will require manual or semi-automatic recovery after a reboot, and it is not guaranteed that all recorded data will be saved.
Warranties weakened at the level of an individual machine can be “restored” or even built on their basis, a significantly more reliable system if physical machines are combined into a network and a special mode of working with them is required. See below for more details.
Parallel Database Access
Typically, a database has many clients that simultaneously perform both read and write operations. The database is obliged in this case to fulfill the guarantees that are laid down in it. For example, relational databases typically provide transaction isolation (see above).
Support for concurrent access to databases often requires significant effort from the server developer, which must ensure the speed and reliability of such access. There are many different algorithms and data structures underlying parallel access.
For example, to add an entry to the table, we need to select a new page in the table, and also update the index. If in parallel another client adds another record, then he needs to select another page (or use the same one?), And update the index again (or maybe two index update operations can be combined?). What if the first client started the transaction, announced the addition of a record, waited two seconds, and rolled back the transaction? What if one client increased the value of the field by one, and the second decreased by one? What if a voltage surge can occur in any microsecond and the system should return to the “correct” state after a reboot, despite all the numerous combinations of intermediate conditions and states?
Compliance with concurrent access while maintaining performance is a huge and complex engineering challenge. All database servers solve it using more or less standard approaches, however, the specific implementation of these approaches, and the subtleties associated with them, are different in each database server.
It is traditionally believed that an increase in the number of simultaneous database clients by an order of magnitude requires a review of its architecture. In general terms, this issue is an unsolved engineering problem.
Administrative functions
All database servers provide many administrative functions related to server life on a separate machine. Among these features are backup; restore from backup; optimizing the space occupied by tables; distribution of data files across various disks and file systems; network access to the database server (see also the corresponding chapter in the book) and the effectiveness of such access.
Also, some servers can effectively use the special functions of the operating system (often, in turn, designed specifically for database servers). A typical example is asynchronous I / O support.
There are also administrative functions associated with networking physical machines. For example, this is setting up the replication topology, as well as managing machines in clusters. See below for more details.
All these, and many other functions, are implemented in different ways in different database servers. Administrative functions, their sophistication and convenience, are an important criterion for choosing a database server suitable for a specific task.
Modern database servers provide many performance tweaks. Comparing the speed of different databases in special conditions is a fascinating and not always meaningful activity.
It is important to understand that any server configured for certain conditions can always be “kneeling” by changing the data access patterns, increasing the number of clients or increasing the amount of data stored. Data access patterns change as the system evolves. The number of customers is growing as the popularity of the system. The amount of stored data usually also grows as the system evolves. All this leads to the fact that old records and successes become irrelevant, and it is necessary to re-conduct the process of fine-tuning the system, and sometimes think about changing the architecture of data access.
Distributed databases
As we have said, every physical machine can break down at any given time. In addition, any physical machine has a performance limit that it can provide. These two circumstances make it possible to integrate machines into a network and consider them as a distributed database.
Distributed databases make us think again about all the issues that we discussed for the case of a single physical machine: the data model, the data access protocol and the guarantees that are provided in case of equipment failure.
We will discuss this issue in more detail in the next chapter of our book.