Hadoop: to be or not to be?

Original author: Anand Krishnaswami
  • Transfer
Hello dear readers!

Some time ago, we had a translation of the fundamental Oreille book about the Hadoop framework:



At present, the editors were faced with a difficult choice whether to translate the new 4th edition of this book or to print an existing one.

Therefore, we decided to publish a translation of the article by Ananda Krishnaswami, which appeared on the Thoughtworks blog back in 2013, where the author is trying to analyze in which cases it is appropriate to use Hadoop, and in which it is unnecessary.

We hope that the material will seem interesting, cause controversy, and you will share your impressions of working with Hadoop and take part in the survey.



The Hadoop system is often positioned as a universal framework that will help your organization deal decisively with any problems. Just mention the "big data" or "analytics" - and immediately find the appropriate answer: "Hadoop!". However, the Hadoop framework was designed to solve a very specific class of problems; in other cases, it is, to put it mildly, imperfect, and sometimes using Hadoop is an obvious mistake. While data transformation (in a broader sense - ETL operations - “extract, convert, load”) is significantly optimized using Hadoop, however, if your business has at least one of the five properties listed below, then you probably should do without Hadoop .

1. The thirst for big data

Many companies are inclined to believe that the data at their disposal pulls on the status of "large", but, unfortunately, in most cases this assessment is overvalued. The research article, Nobody Ever Got Fired For Buying a Cluster, allows you to evaluate how much data is widely believed to be “large.” The authors conclude that the Hadoop system was created to process tera- and petabyte data volumes, while in solving most practical problems the input data volume does not exceed 100 GB (the median job size in Microsoft & Yahoo is less than 14 GB, and 90% of the tasks in Facebook turns out to be much less than 100 GB). Accordingly, the authors of the article consider it appropriate to select a separate server for episodic vertical scaling of the infrastructure,

Ask yourself:

• Do we have a few terabytes of data or more?
• Do I have a stable and very voluminous flow of data?
• How much data am I going to operate on?

2. In line

When sending jobs, the minimum Hadoop delay is about a minute. Thus, the system takes a minute or more to respond to a sales order and provide recommendations. Only a very loyal and patient client will look at the screen for 60+ seconds, waiting for an answer. Alternatively, you can pre-compute related elements for each element already in the list (a priori using Hadoop) and provide instant (second) access to the saved result on the site or in the mobile application. Hadoop is an excellent engine for such preliminary calculations, simplifying work with big data. Of course, the more complex the typical response of this kind becomes, the less effective is the complete prediction of the results.

Ask yourself:

• What are user expectations regarding program response speed?
• Which of the tasks can be combined into packages?

3. The answer to your call will come through ...

Hadoop is not intended for use in cases requiring a response to requests in real time. Jobs passing through the map-reduce cycle also spend some time in the shuffle cycle. The duration of both of these cycles is unlimited, as a result of which the development of real-time applications based on Hadoop is seriously complicated. Trading, taking into account the volume-weighted average price, is a practical example in which the system requires an operational response to complete transactions.

Analysts can not do without SQL. Hadoop is not too good for random access to data sets (even with Hive, which actually generates MapReduce jobs from your query). Google’s Dremel architecture (and, of course, BigQuery) is designed specifically to support spontaneous queries on giant row sets for periods of no more than a few seconds. SQL, however, allows you to make relationships between tables. Other promising alternatives are the development of Shark from the University of California, AmpLab from the University of Berkeley, as well as the Stinger initiative implemented by Hortonworks.

Ask yourself:

• How tightly should users / analysts interact with my data?
• Is interactivity required with terabytes of data, or with only a small subset of information?

So: Hadoop works in batch mode. This means that when adding new information, the task must again sift through the entire set of data. Therefore, the duration of the analysis increases. Data fragments — just updates or small changes — can come in real time. Often, a business must make decisions based on these events. No matter how quickly new data is downloaded to the system, Hadoop will still process it in batch mode. Perhaps in the future this problem will be solved with the help of YARN. Twitter's Storm solution is already a popular and affordable alternative. The combination of Storm with a distributed messaging system such as Kafka offers a number of possibilities for streaming aggregation and data processing. However, Storm is sorely lacking in load balancing,

Ask yourself:

• What is the “expiration date” of my data?
• How fast should my business generate revenue from incoming data?
• How important is it for my business to respond to changes or updates in real time?

Real-time advertising or tracking data from sensors requires real-time streaming input processing. But Hadoop or tools implemented on its basis are not the only alternatives. For example, SAP’s HANA database, stored in RAM, was used in McLaren’s ATLAS analytic toolkit during the recent Indy 500 race along with MATLAB to run models and respond to telemetry during the race. Many analysts believe that the future of Hadoop is connected with interactivity and real-time work.

4. Just closed an account in your favorite social network

Hadoop, and especially MapReduce, are best suited for working with data that can be decomposed into key-value pairs without risking a loss of context or any implicit relationships. Implicit relationships are in the graphs (edges, subtrees, child and parent relationships, weights, etc.), and not all such relationships can exist on a particular node. Therefore, most algorithms for working with graphs require full or partial processing of the graph at each iteration. In MapReduce, this is often impossible or very difficult to do. In addition, there is a problem with choosing a strategy for segmenting data by nodes. If your main data structure is a graph or a network, then you are probably better off using a graph database such as Neo4J or Dex. You can also get acquainted with newer developments, such as

Ask yourself:

• Can you say that the basic structure of my data is no less important than the data itself?
• Is the information sought related to the data structure no less or even more than the data itself?

5. MapReduce Model

Some tasks / tasks / algorithms simply do not fit into the MapReduce programming model. One of these classes of problems has already been discussed above. Tasks in which to calculate the results you need to know the results of the intermediate stages of work is another category of such problems (an academic example is the calculation of Fibonacci series). Some machine learning algorithms (for example, based on gradient descent or maximizing expectations) also do not quite fit into the MapReduce paradigm. There are certain optimization strategies / options (global state, transferring data structures for reference, etc.) that were proposed by various researchers to solve each of these problems, but their implementation remains more complex and unintuitive than we would like.

Ask yourself:

• Does the company pay serious attention to highly specific algorithms or subject-oriented processes?
• Will the technical department deal with analytics better if the applied algorithms are adapted for MapReduce - or not?

In addition, one should consider such practical cases in which the data set is not too large, or the total amount of data is large, but this set consists of billions of small files (for example, you need to look at the mass of image files and select from them those that contain a certain figure), amenable to concatenation. As mentioned above, if tasks do not fit into the MapReduce paradigm of "divide and aggregate", then using Hadoop to solve such problems is a dubious undertaking.

So, having studied when Hadoop may not be the best solution, let's discuss when it is advisable to use it.

Ask yourself:

Is your organization going to ...
1. Extract information from the colossal volumes of text logs?
2. Convert mostly unstructured or poorly structured data into a convenient organized format?
3. Solve problems associated with processing the entire set of data, perform operations at night (similar to how day operations on cards are processed in credit companies)?
4. To rely on the conclusions made after a single data processing, valid until the next intended data processing (which is not applicable, for example, to the values ​​of exchange quotes, which change much more often than when the trading day closes)?

In such cases, you should almost certainly pay attention to Hadoop.

There are a number of business tasks that fit well with the Hadoop model (though practice shows that solving even such tasks is a pretty non-trivial thing). As a rule, such tasks come down to processing huge volumes of unstructured or semi-structured data and consist either in summarizing the contents or in converting the observations made into a structured form for subsequent use by other components of the system. In such cases, the Hadoop model comes in handy. If the data you have collected contains elements that can easily act as identifiers for the corresponding values ​​(in Hadoop these are called key-value pairs), then such a simple association can be used for several aggregation options at once.

So, the most important thing is to clearly understand what business resources are available and understand the problem that you are going to solve. I hope that these considerations, as well as the above recommendations, will help you choose exactly the tools that are ideal for your business.

It is likely that this will be Hadoop.

Only registered users can participate in the survey. Please come in.

Hadoop Classic Book

  • 67.9% Interested in reissue 106
  • 12.8% enough overprints 20
  • 11.5% Publishing a new book or printing an old one is not worth it 18
  • 7.6% Better translate another book on Hadoop (link and description in the comments) 12

Also popular now: