
Oracle vs Teradata vs Hadoop
This article is aimed at Large and Very Large Data Warehouses, but for a smooth picture in the classification a little is mentioned and small.
The article is written for specialists who appreciate the main criterion for working with databases - speed. It will be about systems aimed at a crude full scan (the oracle students have already strained, and the Teradatians are happy).
Let's look at how much data and work Oracle or Hadoop / NoSQL is best suited for.
1) On small volumes it is more profitable to use NoSQL without Hadoop if you feel sorry for $ 900 for Oracle SE One. Its main benefit is the price, NoSQL databases are usually free. A small amount of data implies a small complexity of the model and development in the database.
2) On medium and large volumes, Oracle has great advantages over Teradata and Hadoop. Its main advantages:
1. Very high maturity of technology and product, the number of implementations compared to Hadoop.
3. A very rich set of technologies, which greatly facilitates and speeds up the development compared to all.
3. I suspect that Oracle is cheaper to operate than Hadoop, due to the cost of renting server and electricity.
4. Price compared to Teradata. If you do not buy Exadata, but build your server, I think the price difference will not be huge compared to Hadoop.
Oracle has good scalability, but there is also a bottleneck, it is a storage subsystem, it is one for all the number of gnaws. So, to a certain extent, Oracle shows one of the best data processing speeds.
The fastest self-assembled storage arrays that I have seen provide 18 Gb / s (although I'm sure there are more). Exadata Full Rack provides through custom firmware all Hardware and Software 25 Gb / s.
However, it often happens that the full scan performance in Oracle is not enough.
I'll tell you an example. In 2007, in Beeline, 170 million records per day fell into one table, these are all calls throughout Russia. It’s unrealistic to analyze, and speaking in terms of slogans, it’s unrealistic to run on such a table; there will be no performance of the hard drives. In such cases, optimization is applied, based on such a fact table, several large aggregates of 4 million records per day are created. And on the basis of these large aggregates, many smaller aggregates are already being created for specific tasks / reports. This kind of optimization can be done on Oracle, on Teradata, and on Hadoop.
This system has 3 drawbacks:
1. If business users require a new field that is not in the aggregates, then the development process, that is, adding it, is a very long one.
It is necessary to stretch the field through all units.
2. Not all Ad-hoc reports on such a system are possible. And they are because Ad-hoc, that a report is required here and now, is small and simple, and this is either a loss to the company, or the answer to the question is already outdated and not needed.
3. Very complex ETL.
Here is a drawback to solve the two data, and you can use Hadoop or Teradata.
3) On extremely large volumes, you can use Hadoop.
The advantages of this technology are 2:
1. Almost infinite linear scalability. You can provide 25, 125, 1000 Gigabytes per second.
2. Price, all for free. Besides iron, of course.
Disadvantage:
1. Creating MapReduce procedures is usually a time-consuming task. So Ad-hoc queries will not be as simple as in SQL.
I did not compare the performance of Oracle and Hadoop on the same hardware, but I think Hadoop will give way to Oracle significantly. If you take into account only the speed of the screws, then Exadata gives out 25GB / s, a regular office disk 7.2K 100 Mb / s, it turns out that you need 250 ordinary computers. An ordinary computer costs 20 thousand rubles. Consumes 200 watts. Exadata 7600 Wat. Hadoop, it turns out, is very disadvantageous in terms of electricity, and this is without taking into account the fact that Exadata has everything with double redundancy.
4) Super large volumes on Teradata.
Teradata does a much better job with a tough data method such as full scan. Teradata has a shared-nothing ideology and is very similar to Hadoop / NoSQL. Data lies on many servers, each server processes its part itself. But Teradata has a significant drawback - a rather poor toolkit. It is inconvenient to work on it. Compared to Oracle, this is not such a mature product. The price, the full Teradata cabinet and the Exadata Full Rack cost about the same, $ 5 million.
I also mention the common flaw of Teradata and Hadoop. This is a need to somehow distribute data among nodes. It can be either a natural key, i.e. business key, or surrogate. Time will not work here anymore, this is not partitioning. It is necessary that future data lay down evenly across all nodes. The region, for example, for Beeline, a bad attribute, Moscow occupies 30%. Or some kind of surrogate or hash key.
The advantages of Teradata are that it essentially has a tripartite partition, while Oracle has a double. With 170 million rows in one partition, this turns out to be very useful, if we divide these 170 million into 85 sub-parties by region, and on Teradata also into 30 nodes, then the final data array can be read very quickly.
Teradata Limit:
Due to shared-nothing technology and BYNET V5 network, Teradata can be scaled up to 2048 nodes, 76TB (10K) per node, total 234PB. And one Exadata rack is only 672TB (15K) or 200TB (7.2K). Parallelizing Exadata is not particularly beneficial. Disk space is one at all! And if you combine the disk space of 2 racks (is it possible Exadata - I do not know?), Then everything rests on the network performance of 40 Gigabits between racks. Rather, rack 1 will have quick and wide access to its screws, but slow to rack 2 screws, and vice versa.
It should also be borne in mind that Teradata and Exadata have a multi-part, hybrid compression. Up to 4-6 times average compression. Although it also exists in NoSQL databases, it may not be as effective as such monsters, which took a lot of money to develop.
For the sake of completeness, it is worth mentioning that:
Oracle has 2 levels of cache, RAM and SSD Flash Cards.
Teradata 1 level - memory, but there is know-how - temperature storage.
Due to the 2 levels of cache and the lack of MPP, Exadata is much better suited for OLTP load.
Conclusion: If you do not have unregulated requests, all requests are known in advance and the data is not more than 600 TB, then take Oracle - it is very convenient to work. If more, take Teradat or Hadoop.
If you have more than 100TB of data and a lot of ad hoc requests, take Teradata or Hadoop.
PS I wanted to add a bunch of Oracle + Luster to the article, but I realized that it doesn’t add anything new to Oracle, everything again rests on the performance of a 40 gigabit network.
The article is written for specialists who appreciate the main criterion for working with databases - speed. It will be about systems aimed at a crude full scan (the oracle students have already strained, and the Teradatians are happy).
Let's look at how much data and work Oracle or Hadoop / NoSQL is best suited for.
1) On small volumes it is more profitable to use NoSQL without Hadoop if you feel sorry for $ 900 for Oracle SE One. Its main benefit is the price, NoSQL databases are usually free. A small amount of data implies a small complexity of the model and development in the database.
2) On medium and large volumes, Oracle has great advantages over Teradata and Hadoop. Its main advantages:
1. Very high maturity of technology and product, the number of implementations compared to Hadoop.
3. A very rich set of technologies, which greatly facilitates and speeds up the development compared to all.
3. I suspect that Oracle is cheaper to operate than Hadoop, due to the cost of renting server and electricity.
4. Price compared to Teradata. If you do not buy Exadata, but build your server, I think the price difference will not be huge compared to Hadoop.
Oracle has good scalability, but there is also a bottleneck, it is a storage subsystem, it is one for all the number of gnaws. So, to a certain extent, Oracle shows one of the best data processing speeds.
The fastest self-assembled storage arrays that I have seen provide 18 Gb / s (although I'm sure there are more). Exadata Full Rack provides through custom firmware all Hardware and Software 25 Gb / s.
However, it often happens that the full scan performance in Oracle is not enough.
I'll tell you an example. In 2007, in Beeline, 170 million records per day fell into one table, these are all calls throughout Russia. It’s unrealistic to analyze, and speaking in terms of slogans, it’s unrealistic to run on such a table; there will be no performance of the hard drives. In such cases, optimization is applied, based on such a fact table, several large aggregates of 4 million records per day are created. And on the basis of these large aggregates, many smaller aggregates are already being created for specific tasks / reports. This kind of optimization can be done on Oracle, on Teradata, and on Hadoop.
This system has 3 drawbacks:
1. If business users require a new field that is not in the aggregates, then the development process, that is, adding it, is a very long one.
It is necessary to stretch the field through all units.
2. Not all Ad-hoc reports on such a system are possible. And they are because Ad-hoc, that a report is required here and now, is small and simple, and this is either a loss to the company, or the answer to the question is already outdated and not needed.
3. Very complex ETL.
Here is a drawback to solve the two data, and you can use Hadoop or Teradata.
3) On extremely large volumes, you can use Hadoop.
The advantages of this technology are 2:
1. Almost infinite linear scalability. You can provide 25, 125, 1000 Gigabytes per second.
2. Price, all for free. Besides iron, of course.
Disadvantage:
1. Creating MapReduce procedures is usually a time-consuming task. So Ad-hoc queries will not be as simple as in SQL.
I did not compare the performance of Oracle and Hadoop on the same hardware, but I think Hadoop will give way to Oracle significantly. If you take into account only the speed of the screws, then Exadata gives out 25GB / s, a regular office disk 7.2K 100 Mb / s, it turns out that you need 250 ordinary computers. An ordinary computer costs 20 thousand rubles. Consumes 200 watts. Exadata 7600 Wat. Hadoop, it turns out, is very disadvantageous in terms of electricity, and this is without taking into account the fact that Exadata has everything with double redundancy.
4) Super large volumes on Teradata.
Teradata does a much better job with a tough data method such as full scan. Teradata has a shared-nothing ideology and is very similar to Hadoop / NoSQL. Data lies on many servers, each server processes its part itself. But Teradata has a significant drawback - a rather poor toolkit. It is inconvenient to work on it. Compared to Oracle, this is not such a mature product. The price, the full Teradata cabinet and the Exadata Full Rack cost about the same, $ 5 million.
I also mention the common flaw of Teradata and Hadoop. This is a need to somehow distribute data among nodes. It can be either a natural key, i.e. business key, or surrogate. Time will not work here anymore, this is not partitioning. It is necessary that future data lay down evenly across all nodes. The region, for example, for Beeline, a bad attribute, Moscow occupies 30%. Or some kind of surrogate or hash key.
The advantages of Teradata are that it essentially has a tripartite partition, while Oracle has a double. With 170 million rows in one partition, this turns out to be very useful, if we divide these 170 million into 85 sub-parties by region, and on Teradata also into 30 nodes, then the final data array can be read very quickly.
Teradata Limit:
Due to shared-nothing technology and BYNET V5 network, Teradata can be scaled up to 2048 nodes, 76TB (10K) per node, total 234PB. And one Exadata rack is only 672TB (15K) or 200TB (7.2K). Parallelizing Exadata is not particularly beneficial. Disk space is one at all! And if you combine the disk space of 2 racks (is it possible Exadata - I do not know?), Then everything rests on the network performance of 40 Gigabits between racks. Rather, rack 1 will have quick and wide access to its screws, but slow to rack 2 screws, and vice versa.
It should also be borne in mind that Teradata and Exadata have a multi-part, hybrid compression. Up to 4-6 times average compression. Although it also exists in NoSQL databases, it may not be as effective as such monsters, which took a lot of money to develop.
For the sake of completeness, it is worth mentioning that:
Oracle has 2 levels of cache, RAM and SSD Flash Cards.
Teradata 1 level - memory, but there is know-how - temperature storage.
Due to the 2 levels of cache and the lack of MPP, Exadata is much better suited for OLTP load.
Conclusion: If you do not have unregulated requests, all requests are known in advance and the data is not more than 600 TB, then take Oracle - it is very convenient to work. If more, take Teradat or Hadoop.
If you have more than 100TB of data and a lot of ad hoc requests, take Teradata or Hadoop.
PS I wanted to add a bunch of Oracle + Luster to the article, but I realized that it doesn’t add anything new to Oracle, everything again rests on the performance of a 40 gigabit network.