NoSQL Databases: Understanding the Essence
Recently, the term “NoSQL” has become very fashionable and popular, all kinds of software solutions are actively developed and promoted under this sign. NoSQL has become synonymous with huge amounts of data, linear scalability, clusters, fault tolerance, non-relational. However, few people have a clear understanding of what NoSQL repositories are, how the term appeared and what common characteristics they have. Let's try to eliminate this gap.

The most interesting thing in the term is that despite the fact that it was first used in the late 90s, the real meaning in the form as it is used now, acquired only in the middle of 2009. Initially, this was the name of the open source database created by Carlo Strozzi, which kept all data as ASCII files and used shell scripts instead of SQL to access the data. It had nothing to do with “NoSQL” in its current form.
In June 2009, Johann Oscarsson organized a meeting in San Francisco to discuss new trends in the IT storage and processing market. The main stimulus for the meeting was the new open source products like BigTable and Dynamo. For a bright sign for a meeting, it was necessary to find a capacious and concise term that would fit perfectly into the Twitter hashtag. One such term was suggested by Eric Evans from RackSpace - “NoSQL”. The term was planned for only one meeting and did not have a deep semantic load, but it so happened that it spread throughout the global network like a viral ad and became the de facto name of a whole direction in the IT industry. At the conference, by the way, Voldemort (clone of Amazon Dynamo), Cassandra, Hbase (analogues of Google BigTable), Hypertable, CouchDB, MongoDB spoke.
It is worth emphasizing once again that the term “NoSQL” has an absolutely spontaneous origin and does not have a universally recognized definition or scientific institution behind. This name rather characterizes the vector of IT development aside from relational databases. It stands for Not Only SQL, although there are supporters of the direct definition of No SQL. Pramod Sadalaj and Martin Fowler tried to group and systematize knowledge about the NoSQL world in their recent book “NoSQL Distilled” .
There are few common characteristics for all NoSQL, since a lot of heterogeneous systems are now hidden under the NoSQL label (the most complete list, perhaps, can be found at http://nosql-database.org/ ). Many characteristics are peculiar only to certain NoSQL databases, which I will definitely mention when listing.
1. SQL
is not used. This refers to ANSI SQL DML, since many databases try to use query languages similar to the well-known favorite syntax, but nobody managed to fully implement it and is unlikely to succeed. Although there are rumors that there are startups who are trying to implement SQL, for example, in a bootup ( http://www.drawntoscalehq.com/ and http://www.hadapt.com/ )
2. Unstructured (schemaless)
The meaning is that in NoSQL databases, unlike relational databases, the data structure is not regulated (or weakly typed if analogies with programming languages are made) - you can add an arbitrary field in a separate line or document without first declaratively changing the structure of the entire table. Thus, if it becomes necessary to change the data model, then the only sufficient action is to reflect the change in the application code.
For example, when renaming a field in MongoDB:
If we change the application logic, then we expect a new field also when reading. But due to the lack of a data scheme, the totalSum field is absent in other existing Order objects. In this situation, there are two options for further action. The first is to bypass all documents and update this field in all existing documents. Due to the volume of data, this process occurs without any locks (comparable to the alter table rename column command), therefore, during the update, existing data can be read by other processes. Therefore, the second option - verification in the application code - is inevitable:
And when re-recording, we will write this field to the database in a new format.
A nice consequence of the lack of a schema is the efficiency with sparse data. If there is a date_published field in one document and not in the second, then no empty date_published field will be created for the second. This, in principle, is logical, but a less obvious example is column-family NoSQL databases that use the familiar concepts of tables / columns. However, due to the lack of a scheme, columns are not declared declaratively and can be changed / added during a user session with the database. This allows in particular the use of dynamic columns for the implementation of lists.
The unstructured scheme has its drawbacks - in addition to the above-mentioned overhead in the application code when changing the data model - the absence of all kinds of restrictions on the part of the database (not null, unique, check constraint, etc.), plus there are additional difficulties in understanding and controlling the structure data during parallel work with the database of different projects (there are no dictionaries on the side of the database). However, in a rapidly changing modern world, such flexibility is still an advantage. An example is Twitter, which five years ago, along with tweet, stored only a little additional information (time, Twitter handle and a few more bytes of meta-information), but now in addition to the message itself, a few kilobytes of metadata is stored in the database.
(Hereinafter, it is mainly about key-value, document and column-family databases, graph databases may not have these properties).
3. Presentation of data in the form of aggregates.
Unlike the relational model, which stores the logical business entity of the application into various physical tables in order to normalize, NoSQL stores operate with these entities as with integral objects:

This example demonstrates aggregates for the standard e-commerce standard conceptual relational model “order - order items - payments - product”. In both cases, the order is combined with the positions in one logical object, with each position containing a link to the product and some of its attributes, for example, the name (such denormalization is necessary so as not to request a product object when retrieving an order - the main rule of distributed systems is a minimum “Joins” between objects). In one unit, payments are combined with the order and are an integral part of the object, in another - they are taken out in a separate object. This demonstrates the main rule for designing the data structure in NoSQL databases - it must obey the requirements of the application and be optimized as much as possible for the most common queries.
Many will object, noting that working with large, often denormalized, objects is fraught with numerous problems when trying arbitrary queries to data when queries do not fit into the structure of aggregates. What if we use orders along with line items and order payments (this is how the application works), but the business asks us to calculate how many units of a certain product were sold last month? In this case, instead of scanning the OrderItem table (in the case of the relational model), we will have to retrieve the entire orders in NoSQL storage, although we will not need most of this information. Unfortunately, this is a compromise that must be made in a distributed system: we cannot normalize data as in a conventional single-server system,
I tried to group the pros and cons of both approaches in a tablet:

4. Weak ACID properties.
For a long time, the consistency of data was a “sacred cow” for architects and developers. All relational databases provided this or that level of isolation - either due to locks upon change and blocking reads, or due to undo-logs. With the advent of huge amounts of information and distributed systems, it became clear that it was impossible to provide them with a transactional set of operations on the one hand and to obtain high availability and fast response time. Moreover, even updating one record does not guarantee that any other user will immediately see the changes in the system, because a change can occur, for example, in the master node, and the replica is asynchronously copied to the slave node that the other user is working with. In this case, he will see the result after some period of time. This is called eventual consistency and this is what all the major Internet companies in the world are now doing, including Facebook and Amazon. The latter proudly declare that the maximum interval during which the user can see inconsistent data is no more than a second. An example of such a situation is shown in the figure:

The logical question that arises in such a situation is what about systems that classically place high demands on atomicity-consistency of operations and at the same time need fast distributed clusters - financial, online stores, etc.? Practice shows that these requirements have long been irrelevant: this is what one developer of the financial banking system said: “If we really waited for each transaction to complete in the global ATM network (ATMs), transactions would take so long that customers would flee away in a rage. What happens if you and your partner withdraw money at the same time and exceed the limit? “You will both receive the money, and we will fix it later.” Another example is the hotel reservation shown in the picture. Online stores whose data policy implies eventual consistency, are obliged to provide measures in case of such situations (automatic resolution of conflicts, rollback operations, updating with other data). In practice, hotels always try to keep a “pool” of free rooms for an unforeseen event, and this can be a solution to a controversial situation.
In fact, weak ACID properties do not mean that they do not exist at all. In most cases, an application working with a relational database uses a transaction to change logically related objects (order - order items), which is necessary, since these are different tables. With the correct design of the data model in the NoSQL database (the unit is an order together with a list of order items), you can achieve the same level of isolation when changing one record as in a relational database.
5. Distributed systems, without shared resources (share nothing).
Again, this does not apply to the database graph, whose structure, by definition, is poorly distributed across remote nodes.
This is perhaps the main leitmotif of the development of NoSQL databases. With the avalanche-like growth of information in the world and the need to process it in a reasonable amount of time, the problem of vertical scalability arose - the growth of processor speed stopped at 3.5 GHz, the speed of reading from disk also grows at a quiet pace, plus the price of a powerful server is always higher than the total price of several simple servers. In this situation, conventional relational databases, even clustered on an array of disks, are not able to solve the problem of speed, scalability and bandwidth. The only way out is horizontal scaling, when several independent servers are connected by a fast network and each owns / processes only part of the data and / or only part of read-update requests. In such an architecture, to increase storage capacity (capacity, response time, bandwidth) you only need to add a new server to the cluster - and that’s it. Sharding, replication, fault tolerance procedures (the result will be obtained even if one or more servers stop responding), NoSQL database itself is engaged in data redistribution in case of adding a node. I will briefly introduce the main properties of distributed NoSQL databases:
Replication - copying data to other nodes during the upgrade. It allows both to achieve greater scalability and increase the availability and security of data. It is customary to subdivide into two types:
master-slave :

and peer-to-peer :

The first type assumes good scalability for reading (can occur from any node), but non-scalable writing (only to the master node). There are also subtleties with ensuring constant availability (in the event of a fall of the wizard, either manually or automatically, one of the remaining nodes is assigned to its place). For the second type of replication, it is assumed that all nodes are equal and can serve both read and write requests.
Sharding - data separation by nodes:

Sharding was often used as a “crutch” for relational databases in order to increase speed and throughput: the user application partitioned data on several independent databases and when requesting the corresponding data, the user turned to a specific database. In NoSQL databases, sharding, like replication, is done automatically by the database itself and the user application is separate from these complex mechanisms.
6. NoSQL databases are mostly open source and created in the 21st century.
It is on the second basis that Sadalaj and Fowler did not classify object databases as NoSQL (although http://nosql-database.org/ includes them in the general list), since they were created back in the 90s and have not gained much popularity .
In addition, I wanted to dwell on the classification of NoSQL databases, but, perhaps, I will do this in the next article, if it will be interesting to habrowsers.
The NoSQL movement is gaining popularity at a gigantic pace. However, this does not mean that relational databases are becoming a rudiment or something archaic. Most likely, they will be used and used as before, but NoSQL databases will act more and more in symbiosis with them. We are entering the era of polyglot persistence, an era when different data warehouses are used for different needs. Now there is no monopoly of relational databases as a non-alternative data source. Increasingly, architects are choosing storage based on the nature of the data itself and how we want to manipulate it, how much information is expected. And therefore, everything becomes only more interesting.

History.
The most interesting thing in the term is that despite the fact that it was first used in the late 90s, the real meaning in the form as it is used now, acquired only in the middle of 2009. Initially, this was the name of the open source database created by Carlo Strozzi, which kept all data as ASCII files and used shell scripts instead of SQL to access the data. It had nothing to do with “NoSQL” in its current form.
In June 2009, Johann Oscarsson organized a meeting in San Francisco to discuss new trends in the IT storage and processing market. The main stimulus for the meeting was the new open source products like BigTable and Dynamo. For a bright sign for a meeting, it was necessary to find a capacious and concise term that would fit perfectly into the Twitter hashtag. One such term was suggested by Eric Evans from RackSpace - “NoSQL”. The term was planned for only one meeting and did not have a deep semantic load, but it so happened that it spread throughout the global network like a viral ad and became the de facto name of a whole direction in the IT industry. At the conference, by the way, Voldemort (clone of Amazon Dynamo), Cassandra, Hbase (analogues of Google BigTable), Hypertable, CouchDB, MongoDB spoke.
It is worth emphasizing once again that the term “NoSQL” has an absolutely spontaneous origin and does not have a universally recognized definition or scientific institution behind. This name rather characterizes the vector of IT development aside from relational databases. It stands for Not Only SQL, although there are supporters of the direct definition of No SQL. Pramod Sadalaj and Martin Fowler tried to group and systematize knowledge about the NoSQL world in their recent book “NoSQL Distilled” .
NoSQL Database Features
There are few common characteristics for all NoSQL, since a lot of heterogeneous systems are now hidden under the NoSQL label (the most complete list, perhaps, can be found at http://nosql-database.org/ ). Many characteristics are peculiar only to certain NoSQL databases, which I will definitely mention when listing.
1. SQL
is not used. This refers to ANSI SQL DML, since many databases try to use query languages similar to the well-known favorite syntax, but nobody managed to fully implement it and is unlikely to succeed. Although there are rumors that there are startups who are trying to implement SQL, for example, in a bootup ( http://www.drawntoscalehq.com/ and http://www.hadapt.com/ )
2. Unstructured (schemaless)
The meaning is that in NoSQL databases, unlike relational databases, the data structure is not regulated (or weakly typed if analogies with programming languages are made) - you can add an arbitrary field in a separate line or document without first declaratively changing the structure of the entire table. Thus, if it becomes necessary to change the data model, then the only sufficient action is to reflect the change in the application code.
For example, when renaming a field in MongoDB:
BasicDBObject order = new BasicDBObject();
order.put(“date”, orderDate); // это поле было давно
order.put(“totalSum”, total); // раньше мы использовали просто “sum”
If we change the application logic, then we expect a new field also when reading. But due to the lack of a data scheme, the totalSum field is absent in other existing Order objects. In this situation, there are two options for further action. The first is to bypass all documents and update this field in all existing documents. Due to the volume of data, this process occurs without any locks (comparable to the alter table rename column command), therefore, during the update, existing data can be read by other processes. Therefore, the second option - verification in the application code - is inevitable:
BasicDBObject order = new BasicDBObject();
Double totalSum = order.getDouble(“sum”); // Это старая модель
if (totalSum == null)
totalSum = order.getDouble(“totalSum”); // Это обновленная модель
And when re-recording, we will write this field to the database in a new format.
A nice consequence of the lack of a schema is the efficiency with sparse data. If there is a date_published field in one document and not in the second, then no empty date_published field will be created for the second. This, in principle, is logical, but a less obvious example is column-family NoSQL databases that use the familiar concepts of tables / columns. However, due to the lack of a scheme, columns are not declared declaratively and can be changed / added during a user session with the database. This allows in particular the use of dynamic columns for the implementation of lists.
The unstructured scheme has its drawbacks - in addition to the above-mentioned overhead in the application code when changing the data model - the absence of all kinds of restrictions on the part of the database (not null, unique, check constraint, etc.), plus there are additional difficulties in understanding and controlling the structure data during parallel work with the database of different projects (there are no dictionaries on the side of the database). However, in a rapidly changing modern world, such flexibility is still an advantage. An example is Twitter, which five years ago, along with tweet, stored only a little additional information (time, Twitter handle and a few more bytes of meta-information), but now in addition to the message itself, a few kilobytes of metadata is stored in the database.
(Hereinafter, it is mainly about key-value, document and column-family databases, graph databases may not have these properties).
3. Presentation of data in the form of aggregates.
Unlike the relational model, which stores the logical business entity of the application into various physical tables in order to normalize, NoSQL stores operate with these entities as with integral objects:

This example demonstrates aggregates for the standard e-commerce standard conceptual relational model “order - order items - payments - product”. In both cases, the order is combined with the positions in one logical object, with each position containing a link to the product and some of its attributes, for example, the name (such denormalization is necessary so as not to request a product object when retrieving an order - the main rule of distributed systems is a minimum “Joins” between objects). In one unit, payments are combined with the order and are an integral part of the object, in another - they are taken out in a separate object. This demonstrates the main rule for designing the data structure in NoSQL databases - it must obey the requirements of the application and be optimized as much as possible for the most common queries.
Many will object, noting that working with large, often denormalized, objects is fraught with numerous problems when trying arbitrary queries to data when queries do not fit into the structure of aggregates. What if we use orders along with line items and order payments (this is how the application works), but the business asks us to calculate how many units of a certain product were sold last month? In this case, instead of scanning the OrderItem table (in the case of the relational model), we will have to retrieve the entire orders in NoSQL storage, although we will not need most of this information. Unfortunately, this is a compromise that must be made in a distributed system: we cannot normalize data as in a conventional single-server system,
I tried to group the pros and cons of both approaches in a tablet:

4. Weak ACID properties.
For a long time, the consistency of data was a “sacred cow” for architects and developers. All relational databases provided this or that level of isolation - either due to locks upon change and blocking reads, or due to undo-logs. With the advent of huge amounts of information and distributed systems, it became clear that it was impossible to provide them with a transactional set of operations on the one hand and to obtain high availability and fast response time. Moreover, even updating one record does not guarantee that any other user will immediately see the changes in the system, because a change can occur, for example, in the master node, and the replica is asynchronously copied to the slave node that the other user is working with. In this case, he will see the result after some period of time. This is called eventual consistency and this is what all the major Internet companies in the world are now doing, including Facebook and Amazon. The latter proudly declare that the maximum interval during which the user can see inconsistent data is no more than a second. An example of such a situation is shown in the figure:

The logical question that arises in such a situation is what about systems that classically place high demands on atomicity-consistency of operations and at the same time need fast distributed clusters - financial, online stores, etc.? Practice shows that these requirements have long been irrelevant: this is what one developer of the financial banking system said: “If we really waited for each transaction to complete in the global ATM network (ATMs), transactions would take so long that customers would flee away in a rage. What happens if you and your partner withdraw money at the same time and exceed the limit? “You will both receive the money, and we will fix it later.” Another example is the hotel reservation shown in the picture. Online stores whose data policy implies eventual consistency, are obliged to provide measures in case of such situations (automatic resolution of conflicts, rollback operations, updating with other data). In practice, hotels always try to keep a “pool” of free rooms for an unforeseen event, and this can be a solution to a controversial situation.
In fact, weak ACID properties do not mean that they do not exist at all. In most cases, an application working with a relational database uses a transaction to change logically related objects (order - order items), which is necessary, since these are different tables. With the correct design of the data model in the NoSQL database (the unit is an order together with a list of order items), you can achieve the same level of isolation when changing one record as in a relational database.
5. Distributed systems, without shared resources (share nothing).
Again, this does not apply to the database graph, whose structure, by definition, is poorly distributed across remote nodes.
This is perhaps the main leitmotif of the development of NoSQL databases. With the avalanche-like growth of information in the world and the need to process it in a reasonable amount of time, the problem of vertical scalability arose - the growth of processor speed stopped at 3.5 GHz, the speed of reading from disk also grows at a quiet pace, plus the price of a powerful server is always higher than the total price of several simple servers. In this situation, conventional relational databases, even clustered on an array of disks, are not able to solve the problem of speed, scalability and bandwidth. The only way out is horizontal scaling, when several independent servers are connected by a fast network and each owns / processes only part of the data and / or only part of read-update requests. In such an architecture, to increase storage capacity (capacity, response time, bandwidth) you only need to add a new server to the cluster - and that’s it. Sharding, replication, fault tolerance procedures (the result will be obtained even if one or more servers stop responding), NoSQL database itself is engaged in data redistribution in case of adding a node. I will briefly introduce the main properties of distributed NoSQL databases:
Replication - copying data to other nodes during the upgrade. It allows both to achieve greater scalability and increase the availability and security of data. It is customary to subdivide into two types:
master-slave :

and peer-to-peer :

The first type assumes good scalability for reading (can occur from any node), but non-scalable writing (only to the master node). There are also subtleties with ensuring constant availability (in the event of a fall of the wizard, either manually or automatically, one of the remaining nodes is assigned to its place). For the second type of replication, it is assumed that all nodes are equal and can serve both read and write requests.
Sharding - data separation by nodes:

Sharding was often used as a “crutch” for relational databases in order to increase speed and throughput: the user application partitioned data on several independent databases and when requesting the corresponding data, the user turned to a specific database. In NoSQL databases, sharding, like replication, is done automatically by the database itself and the user application is separate from these complex mechanisms.
6. NoSQL databases are mostly open source and created in the 21st century.
It is on the second basis that Sadalaj and Fowler did not classify object databases as NoSQL (although http://nosql-database.org/ includes them in the general list), since they were created back in the 90s and have not gained much popularity .
In addition, I wanted to dwell on the classification of NoSQL databases, but, perhaps, I will do this in the next article, if it will be interesting to habrowsers.
Summary.
The NoSQL movement is gaining popularity at a gigantic pace. However, this does not mean that relational databases are becoming a rudiment or something archaic. Most likely, they will be used and used as before, but NoSQL databases will act more and more in symbiosis with them. We are entering the era of polyglot persistence, an era when different data warehouses are used for different needs. Now there is no monopoly of relational databases as a non-alternative data source. Increasingly, architects are choosing storage based on the nature of the data itself and how we want to manipulate it, how much information is expected. And therefore, everything becomes only more interesting.