donjenya July 13, 2012 at 15:15

Data Access Speed: The Battle for the Future

From the sandbox

Since ancient times, mankind has been engaged in the fact that it accumulated information, analyzed and stored it in some form, so that later it could be transmitted to descendants. The evolution of our consciousness was made possible largely due to just this - a new generation of people did not have to comprehend what was already comprehended to them. Starting with the oldest media - Egyptian papyrus and Sumerian cuneiform tablets, mankind has accumulated an ever-increasing amount of information. In the history of mankind there were times when, as a result of wars and cataclysms, part of the already accumulated knowledge was destroyed or disappeared, and then progress stopped, and humanity was thrown back in its development. The real revolution and breakthrough was the discovery of mass printing technology, which allowed to disseminate information to a large audience, which in turn led to explosive growth in the sciences, art, and also brought the consciousness of all mankind to a higher level. The development of technology in the twentieth century has led to the emergence of new storage media - punched cards, punched tapes, hard magnetic disks, etc. More and more information was transferred from ledgers to electronic media. There was a need for organizing and managing access to this data - this is how the first DBMSs appeared.

The relational data model proposed in 1970 by E.F. Coddom, for a long time set a trend in the development of databases and allowed to fully meet the requirements of the business until today. Since 1970, relational databases have come a long way and have taken many challenges that stand in their way. Constantly growing volumes of data have led to the emergence of methods that can provide faster access to the necessary data - indexes, storing data in sorted form, etc. These methods quite successfully coped with their task, and still have not lost their relevance. However, the rapid increase in the volume of storage media and the cheapening cost of data storage has led to the fact that database sizes of tens of terabytes are no longer something unusual and are perceived as a common phenomenon. Business cannot allow so that this data is “dead weight”, as the ever-increasing competition in the world forces him to look for new approaches to mastering the sphere of his activity, because according to the catch phrase - "Who owns the information, owns the world." If we talk about time, then the bill does not go for days, or even hours, but rather for minutes - whoever can quickly get the necessary information will win.

But not all modern databases are ready for new volumes - old methods are no longer so effective. The main component that “slows down” the entire database as a whole is an information storage device. Unfortunately, it is precisely the capabilities of the hard drive that have now embarked on the further development of ways to obtain useful information from a huge data set of tens of terabytes. Today, technology is not keeping pace with the growth in the amount of data that needs to be analyzed. Flash disks are quite expensive and have significant drawbacks, in particular, the recording resource that impedes their use as corporate data storage devices for databases. In this article, I propose to discuss the methods used by modern analytical databases to overcome the shortcomings of existing technologies. I would also like to leave a discussion of the NoSQL database family rich in various varieties for a separate article, so as not to introduce confusion into existing approaches. Still, databases with the NoSQL model are still quite exotic for traditional analytical systems, although they have gained some popularity in individual tasks. The main interest is still attracted to databases with a traditional relational data model that meet the ACID requirements and are intended for Big Data analytics - how they respond to a modern challenge. although they gained some popularity in individual tasks. The main interest is still attracted to databases with a traditional relational data model that meet the ACID requirements and are intended for Big Data analytics - how they respond to a modern challenge. although they gained some popularity in individual tasks. The main interest is still attracted to databases with a traditional relational data model that meet the ACID requirements and are intended for Big Data analytics - how they respond to a modern challenge.

It is perfectly clear that the data used by analytical databases must be adequately prepared and ordered, since it is difficult to distinguish any regularities from chaos. However, there are exceptions to this rule, which we will talk about later; there may be another article within the framework. Assume that the data is prepared by some ETL process and loaded into the data warehouse. How modern analytical databases can provide such a speed of access to data that would allow them not to spend several days reading several terabytes or tens of terabytes.

Massive parallel processing

Massively parallel architecture is built from separate nodes, where each node has its own processor, memory, as well as communication tools that allow it to communicate with other nodes. A separate node in this case is a separate database that works in conjunction with all the others. Thus, if we go down a notch, we will see a process or a set of processes, which is a database, and performs its own separate task, contributing to the common cause. Due to the fact that each node has its own infrastructure (processor and memory), we do not rest against the traditional restrictions that are characteristic of databases, which are essentially the only node with access to the entire volume of stored data. Here, each node stores its portion of data and works with it, providing the fastest access.

Theoretically, for each node we can provide our own processor and disk memory, so the maximum data reading speed will be equal to the sum of all reading speeds for information storage devices, allowing us to achieve acceptable results regarding the response time of those requests where it is necessary to analyze extra-large volumes information. In practice, in order to achieve greater utilization, several nodes live on the same server, sharing resources.

Obviously, for databases built on such an architecture, one of the nodes or each of them must be able to accept a request from the user, distribute it to all remaining nodes, wait for a response from them, and pass this response to the user as the result of the execution of his request.

The advantages of such an architecture lie on the surface - it is almost linear scalability. Moreover, scalability in both horizontal and vertical form.

The disadvantages are that it is necessary to make significant efforts to create software that allows you to take full advantage of such an architecture, which is what determines the high cost of such products.

Analytical relational databases using MPP:

1. EMC Greenplum
One of the best solutions with a powerful set of features that provide the ability to configure it for any task.
2. Teradata
A well-known solution that has proven itself in the market. High cost compared to competitors, not due to significant advantages.
3. HP Vertica
Advantages of the solution are at the same time its disadvantages. A large amount of redundant (duplicated) data that must be stored, focus on a narrow range of tasks, lack of some important functionality.
4. IBM Netezza
An interesting and fairly quick solution. The disadvantages are that this is a completely hardware solution built on a proprietary platform, partially outdated. There are questions regarding scalability.

A separate review can be made for each decision if readers become interested in the future. Be that as it may, it is these four products that are setting the trend in the MPP sector with shared nothing architecture. Using their example, we can observe the vector of development of further technologies aimed at processing super-large volumes of data. A completely new class of databases has appeared, designed specifically for processing dozens of terabytes.

However, a second direction has appeared, allowing you to circumvent the limitations associated with the capabilities of hard drives.

IMDB

In-memory database is a database that works with data that is completely in RAM. As you know, random access memory has an order of magnitude superior to conventional hard drives, thereby providing a high-performance data storage device with enormous read and write speeds. Despite this, there are still few people who wish to store their data completely in RAM. This is due primarily to the fact that the cost of such memory is much higher than the cost of hard drives. An important factor is that all data disappears as soon as the electricity is turned off. For a long time, databases working with data in main memory were auxiliary and served as a buffer that stores short-term data necessary only for operational processing. Nonetheless, the decline in the cost of this type of memory spurred interest in databases of this kind. Entry into the Big Data area is believed to begin with terabytes. Until recently, among databases of this type there were no solutions that could work with a sufficiently large volume. However, in 2011, SAP introduced its HANA database, which supported up to 8 terabytes of uncompressed data. Theoretically, using compression, the amount of data used can be raised to 40 terabytes. Another representative of IMDB technology is Oracle’s TimesTen. Both solutions have great functionality and are the most developed products in the field of In-Memory RDBMS. among databases of this type there were no solutions that could work with a sufficiently large volume. However, in 2011, SAP introduced its HANA database, which supported up to 8 terabytes of uncompressed data. Theoretically, using compression, the amount of data used can be raised to 40 terabytes. Another representative of IMDB technology is Oracle’s TimesTen. Both solutions have great functionality and are the most developed products in the field of In-Memory RDBMS. among databases of this type there were no solutions that could work with a sufficiently large volume. However, in 2011, SAP introduced its HANA database, which supported up to 8 terabytes of uncompressed data. Theoretically, using compression, the amount of data used can be raised to 40 terabytes. Another representative of IMDB technology is Oracle’s TimesTen. Both solutions have great functionality and are the most developed products in the field of In-Memory RDBMS. Another representative of IMDB technology is Oracle’s TimesTen. Both solutions have great functionality and are the most developed products in the field of In-Memory RDBMS. Another representative of IMDB technology is Oracle’s TimesTen. Both solutions have great functionality and are the most developed products in the field of In-Memory RDBMS.

In this way, companies are ready to accept the challenge of Big Data. Solutions have already been invented and tested, allowing for an acceptable time to get an answer to questions posed by analysts, marketers and database managers using information accumulated over decades. New database classes are being created for processing super-large amounts of data. New methods are being developed to increase the speed of data access.

At the same time, modern realities show that relational databases cannot be all-in-one solutions. That is why, in addition to the data warehouse, which performs the function of storing and processing large amounts of information, the company must also have OLTP databases designed for conducting the company's operations.

Tags:

Data Access Speed: The Battle for the Future

Massive parallel processing

IMDB

Also popular now: