Are relational databases doomed?
Translator's note: although the article is quite old (published 2 years ago) and has a big name, it still gives a good idea about the differences between relational databases and NoSQL databases, their advantages and disadvantages, and also provides a brief overview of non-relational repositories.
Recently, many non-relational databases have appeared. This suggests that if you need virtually unlimited scalability on demand, you need a non-relational database.
If this is true, does this mean that powerful relational databases have become vulnerable? Does this mean that the days of relational databases pass and will soon pass completely? In this article, we will consider the popular trend of non-relational databases in relation to various situations and see if this will affect the future of relational databases.
Relational databases have been around for about 30 years. During this time, several revolutions broke out, which were supposed to put an end to relational repositories. Of course, none of these revolutions took place, and one of them did not shake the position of relational databases a bit.
Let's start with the basics
A relational database is a set of tables (entities). Tables consist of columns and rows (tuples). Constraints can be defined inside tables; relationships exist between tables. Using SQL, you can execute queries that return datasets obtained from one or more tables. Within one query, data is obtained from several tables by joining them (JOIN), most often the same columns are used to join, which determine the relationship between the tables. Normalization is the process of structuring a data model that provides connectivity and lack of redundancy in the data.
Access to relational databases is through relational database management systems (RDBMS). Almost all of the database systems we use are relational, such as Oracle, SQL Server, MySQL, Sybase, DB2, TeraData, and so on.
The reasons for this dominance are not obvious. Throughout the existence of relational databases, they constantly offered the best mixture of simplicity, stability, flexibility, performance, scalability and compatibility in the field of data management.
However, to provide all of these features, relational stores are incredibly complex inside. For example, a simple SELECT query may have hundreds of potential execution paths that the optimizer will evaluate directly during query execution. All of this is hidden from users, but inside the RDBMS it creates an execution plan based on things like costing algorithms that best suits the request.
Relational Database Issues
Although relational storages provide the best mixture of simplicity, stability, flexibility, performance, scalability and compatibility, their performance for each of these items is not necessarily higher than that of similar systems focused on one particular feature. This was not a big problem, since the general dominance of relational DBMSs outweighed any flaws. However, if conventional RDBs did not meet the needs, alternatives always existed.
Today the situation is a little different. The variety of applications is growing, and with it the importance of these features is growing. And with the growing number of databases, one feature begins to overshadow all the others. This is scalability. As more and more applications work under high load conditions, such as web services, their scalability requirements can change very quickly and grow rapidly. The first problem can be very difficult to resolve if you have a relational database located on your own server. Suppose server load tripled overnight. How fast can you upgrade hardware? The solution to the second problem also causes difficulties in the case of using relational databases.
Relational databases scale well only if they are located on a single server. When the resources of this server run out, you will need to add more machines and distribute the load between them. And here the complexity of relational databases begins to play against scalability. If you try to increase the number of servers not up to a few, but up to hundreds or thousands, the complexity will increase by an order of magnitude, and the characteristics that make relational databases so attractive will rapidly reduce to zero the chances of using them as a platform for large distributed systems.
To stay competitive, cloud service vendors have to deal with this limitation somehow, because what kind of cloud platform is this without scalable data storage. Therefore, vendors have only one option left if they want to provide users with a scalable place to store data. It is necessary to use other types of databases that have a higher ability to scale, albeit at the cost of other features available in relational databases.
These benefits, as well as the existing demand for them, have led to a wave of new database management systems.
This type of database is commonly called a key-value store. In fact, there is no official name, so you can find it in the context of document-oriented, attribute-oriented, distributed databases (although they can also be relational), sharded sorted arrays, distributed hash tables and storages key-value type. And although each of these names indicates specific features of the system, all of them are variations on a topic that we will call a key-value storage.
However, whatever you call it, this “new” type of database is not so new and has always been used mainly for applications for which the use of relational databases would be unsuitable. However, without the need for the web and the cloud for scalability, these systems remained not much in demand. Now the task is to determine which type of storage is more suitable for a particular system.
Relational databases and key-value storages differ fundamentally and are designed to solve various problems. Comparison of characteristics will only allow you to understand the difference between them, but let's start with this:
|Relational database||Key-Value Storage|
|A database consists of tables, tables contain columns and rows, and rows consist of column values. All rows of one table have a single structure.||For domains, you can draw an analogy with tables, but unlike tables for domains, the data structure is not defined. A domain is a box in which you can put anything you want. Entries within the same domain can have a different structure.|
|Data Model 1 is predefined. It is strongly typed, contains restrictions and relationships to ensure data integrity.||Records are identified by a key, and each record has a dynamic set of attributes associated with it.|
|The data model is based on a natural representation of the data contained, and not on the functionality of the application.||In some implementations, attributes can only be string. In other implementations, attributes have simple data types that reflect the types used in programming: integers, an array of strings, and lists.|
|The data model is normalized to avoid data duplication. Normalization breeds relationships between tables. Relations relate data from different tables.||Between domains, as well as within the same domain, relations are not clearly defined.|
Key-value storage is record-oriented. This means that all information related to this record is stored with it. A domain (which you can think of as a table) can contain countless different records. For example, a domain may contain information about customers and orders. This means that data is usually duplicated between different domains. This is an acceptable approach because disk space is cheap. The main thing is that it allows you to store all related data in one place, which improves scalability, since the need to join data from different tables disappears. When using a relational database, it would be necessary to use connections to group the necessary information in one place.
Although the need for a relationship drops dramatically for storing key-value pairs, relationships are still needed. Such relationships usually exist between core entities. For example, an order system would have records that contain data about customers, products, and orders. It doesn’t matter if this data is in the same domain or in several. The bottom line is that when the buyer places an order, you most likely will not want to store information about the buyer and the order in one record.
Instead, the order entry should contain keys that point to the corresponding customer and product records. Since any information can be stored in the records, and relations are not defined in the data model itself, the database management system will not be able to control the integrity of the relations. This means that you can delete customers and the products they ordered. Ensuring data integrity rests entirely with the application.
|Relational database||Key-Value Storage|
|Data is created, updated, deleted, and queried using the Structured Query Language (SQL).||Data is created, updated, deleted, and queried using an API method call.|
|SQL queries can retrieve data both from a single table and from several tables, using connections (joins).||Some implementations provide SQL-like syntax for setting filtering conditions.|
|SQL queries can include aggregations and complex filters.||Often, you can use only the basic comparison operators (=,! =, <,>, <= And =>).|
|A relational database typically contains built-in logic, such as triggers, stored procedures, and functions.||All business logic and logic to support data integrity is contained in application code.|
Interaction with applications
|Relational database||Key-Value Storage|
|Most commonly used are native APIs, or generic ones, such as OLE DB or ODBC.||Most often, SOAP and / or REST APIs are used to access data.|
|Data is stored in a format that displays their natural structure, therefore mapping of application structures and relational database structures is necessary.||Data can be displayed more efficiently in the structure of the application, only code is needed to write data to objects.|
Key-Value Storage: Benefits
There are two distinct advantages of such systems over relational storage.
Suitable for cloud services
The first advantage of key-value storages is that they are simpler, which means they are more scalable than relational databases. If you host your own system together and plan to host a dozen or a hundred servers that need to cope with the increasing load behind your data warehouse, then your choice is key-value storage.
Due to the fact that such storage is easily and dynamically expanding, they are also useful to vendors who provide a multi-user web storage platform. Such a database is a relatively cheap means of data storage with great potential for scalability. Users usually pay only for what they use, but their needs can grow. The vendor will be able to dynamically and practically without restrictions increase the size of the platform, based on the load.
More natural code integration
The relational data model and the code object model are usually constructed in different ways, which leads to some incompatibility. Developers solve this problem by writing code that maps a relational model to an object model. This process does not have a clear and quickly achievable value and can take quite a considerable amount of time that could be spent on developing the application itself. Meanwhile, many key-value stores store data in a structure that maps to objects more naturally. This can significantly reduce development time.
Other arguments in favor of using key-value storage, such as “Relational databases can become clumsy” (by the way, I have no idea what this means), are less convincing. But before becoming a proponent of such repositories, check out the next section.
Key-Value Storage: Disadvantages
Constraints in relational databases guarantee data integrity at the lowest level. Data that does not satisfy the restrictions physically cannot get into the database. There are no such restrictions in key-value storages; therefore, data integrity control lies entirely with applications. However, there are errors in any code. If errors in a correctly designed relational database usually do not lead to data integrity problems, then errors in key-value stores usually lead to such problems.
Another advantage of relational databases is that they force you to go through the process of developing a data model. If you have designed the model well, then the database will contain a logical structure that fully reflects the structure of the stored data, but diverges from the structure of the application. Thus, data becomes application independent. This means that another application can use the same data and the application logic can be changed without any changes to the base model. To do the same with key-value storage, try replacing the process of designing a relational model with class design, which creates common classes based on the natural data structure.
And don't forget about compatibility. Unlike relational databases, cloud-centric storage has far fewer common standards. Although conceptually they do not differ, they all have different APIs, query interfaces and their own specifics. Therefore, you are better off trusting your vendor, because in which case, you cannot easily switch to another service provider. And given the fact that almost all modern key-value stores are in beta 2 , trusting becomes even more risky than using relational databases.
Limited data analytics
Typically, all cloud storage is built on the type of multiple leases , which means that the same system is used by a large number of users and applications. To prevent the "capture" of the overall system, vendors usually somehow limit the execution of requests. For example, in SimpleDB, a query cannot run for longer than 5 seconds. In the Google AppEngine Datastore, you can’t get more than 1000 records 3 per request .
These restrictions are not scary for simple logic (creating, updating, deleting and retrieving a small number of records). But what if your application becomes popular? You’ve got a lot of new users and a lot of new data, and now you want to make new opportunities for users or somehow benefit from the data. Here you can be hard to break off with even simple queries for data analysis. Features like tracking application usage patterns or a recommendation system based on a user's history may at best be difficult to implement. And at worst, they are simply impossible.
In this case, for analytics it is better to create a separate database that will be populated with data from your key-value storage. Think in advance how this can be done. Will you host the server in the cloud or at home? Will there be problems due to signal delays between you and your provider? Does your storage support such data transfer? If you have 100 million records and you can take 1,000 records at a time, how much will it take to transfer all the data?
However, do not prioritize scalability. It will be useless if your users decide to use the services of another service, because it provides more options and settings.
Many web service providers offer multi-tenant key-value storage. Most of them satisfy the criteria listed above, however each has its own distinctive features and differs from the standards described above. Let's take a look at specific sample repositories such as SimpleDB, Google AppEngine Datastore, and SQL Data Services.
SimpleDB is an attribute-based key-value storage that is part of Amazon WebServices. SimpleDB is in beta; users can use it for free - as long as their needs do not exceed a certain limit.
SimpleDB has several limitations. First, the query execution time is limited to 5 seconds. Second, there are no data types other than strings. Everything is stored, retrieved and compared as a string, so in order to compare the dates, you will need to convert them to the ISO8601 format. Third, the maximum size of any line is 1024 bytes, which limits the size of the text (for example, a description of the product), which you can store as an attribute. However, since the data structure is flexible, you can circumvent this limitation by adding the attributes “Product Description1”, “Product Description2”, etc. But the number of attributes is also limited - a maximum of 256 attributes. While SimpleDB is in beta, the domain size is limited to 10 gigabytes, and the entire database cannot occupy more than 1 terabyte.
One key feature of the model is to use SimpleDB final konstistentsii ( eventual consistency model ). This model is suitable for multi-threaded work, however, it should be borne in mind that after you change the attribute value in some record, these changes may not be visible during subsequent read operations. The likelihood of such a development of events is quite low, however, it must be remembered. You do not want to sell the last ticket to five customers just because your data was inconsistent at the time of sale.
Google AppEngine Data Store
Google's AppEngine Datastore is built on top of BigTable, Google’s internal structured data storage system. AppEngine Datastore does not provide direct access to BigTable, but can be perceived as a simplified interface for interacting with BigTable.
AppEngine Datastore supports a larger number of data types within a single record than SimpleDB. For example, lists that may contain collections within a record.
Most likely you will use this particular data warehouse when developing using Google AppEngine. However, unlike SimpleDB, you cannot use the AppEngine Datastore (or BigTable) outside of Google’s web services.
Microsoft: SQL Data Services
SQL Data Services is part of the Microsoft Azure platform . SQL Data Services is free, in beta, and has database size limits. SQL Data Services is a separate application - an add-on for many SQL servers that store data. These storages can be relational, however for you SDS is a key-value storage, as well as the products described above.
There are also a number of storages that you can use outside the cloud by installing them on your own. Almost all of these projects are young, are in alpha or beta, and are open source. With open source, you will probably be more aware of potential problems and limitations than using closed products.
CouchDB is a free, open source, document-oriented database. As a data storage format, JSON is used. CouchDB is designed to fill the gap between document-oriented and relational databases using “views”. Such views contain data from documents in a form similar to a table, and allow you to build indexes and perform queries.
CouchDB is currently not a truly distributed database. It has replication features that allow you to synchronize data between servers, but this is not the same distribution that is needed to build a highly scalable environment. However, CouchDB developers are working on this.
The Voldemort project is a distributed key-value database designed for horizontal scaling on a large number of servers. Born in the LinkedIn development process, it was used for several systems with high scalability requirements. The Voldemort project also uses a finite consistency model.
Mongo is a database developed at 10gen by Geir Magnusson and Dwight Merriman (whom you can know from DoubleClick). Like CouchDB, Mongo is a document-oriented database that stores data in JSON format. However, Mongo is more likely an object base than a pure key-value storage.
Drizzle presents a completely different approach to solving the problems that key-value stores are designed to deal with. Drizzle started as one of the MySQL 6.0 branches. Developers later removed a number of functions (including views, triggers, compiled expressions, stored procedures, query cache, ACLs, and some data types), with the goal of creating a simpler and faster DBMS. However, Drizzle can still be used to store relational data. The goal of the developers is to build a semi-relational platform designed for web applications and cloud applications running on systems with 16 or more cores.
Ultimately, there are four reasons why you can choose a non-relational key-value storage for your application:
- Your data is strongly document-oriented, and more suitable for a key-value data model than for a relational model.
- Your domain model is highly object-oriented, so using key-value storage will reduce the size of additional code for data conversion.
- The data warehouse integrates cheaply and easily with your vendor’s web services.
- Your main concern is high scalability on demand.
However, when making a decision, be aware of the limitations of specific databases and the risks that you will encounter when choosing to use non-relational databases.
For all other requirements, it is better to choose the good old relational DBMS. So are they doomed? Of course not. At least for now.
1 - in my opinion, the term “data structure” is more suitable here, but left the original data model.
2 - most likely, the author had in mind that non-relational databases are inferior to relational ones in their capabilities.
3 - perhaps the data is already outdated, the article dates from February 2009.