codezombie January 10, 2013 at 00:20

Microsoft HDInsight. The cloudy (and not only) future of Hadoop

The amount of data generated and collected by modern research centers, financial institutions, social networks, is already habitually measured in petabytes. Since Facebook data centers already have more than 15 billion images, the NYSE stock exchange creates and replicates about 1 TB of data daily, the Large Hadron Collider receives about 1 PB of data per second.

Obviously, the tasks of processing large amounts of data are increasingly becoming not only for large companies, but also for startups and small research groups.

The Hadoop platform, which, in principle, successfully solves the Big Data problem for semi- and unstructured data, in its “pure” form imposes significant requirements both on the qualifications of the administrators of the Hadoop cluster and on the initial financial costs for the hardware of such a cluster.

In this situation, the symbiosis of cloud technologies and the Hadoop platform is increasingly presented as an extremely promising way to solve the Big Data problem , which has an extremely low input level (qualification + launch costs).

Today, the largest cloud providers, with varying degrees of proximity to the release version, provide “Cloud Computing + Hadoop” services. So Amazon in 2009 announced Amazon Elastic MapReduce, Google in 2010 - Google MapperAPI, Microsoft in 2011 - Windows Azure HDInsight.

Service from Microsoft this year showed a high development dynamics and, in general, provides a wide range of capabilities related to both management and development and integration with Windows Azure cloud services and other Microsoft BI-tools .

1. HDInsight. Overview, History, Ecosystem

HDInsight is Microsoft's solution for deploying the Hadoop platform on the Windows Server family and on the Windows Azure cloud platform.

HDInsight is available in two versions:

as a cloud service on Windows Azure - Windows Azure HDInsight (access requires an invite);
as a local cluster on Windows Server 2008/2012 - Microsoft HDInsight to Windows Server (CTP) (installation is available through WebPI, including on the desktop versions of Windows).

2011

Microsoft's partnership with Hortonworks, aimed at integrating the Hadoop platform into Microsoft products, was announced in October 2011 at the Professional Association for SQL Server (PASS) conference.

On December 14 of the same year, the cloud service “ Hadoop on Azure ” (CTP) was introduced , available for a limited number of invites.

2012

On June 29, 2012, the Hadoop on Azure service update (Service Update 2) was released. The most noticeable change compared to the previous version of the service was an increase in cluster power by 2.5 times.

Already after 2 months (August 21), the third Service Update 3 was released, among the innovations of which were:

REST API for setting tasks (Job), requesting status of completion and removal of tasks;
the ability to directly access the cluster through a browser;
C # SDK v1.0;
PowerShell cmdlets.

On October 25, at the Strata Conference-Hadoop World, HDInsight was presented (the development of a service formerly known as “Hadoop on Azure”), which allows you to deploy the Hadoop platform both in enterprises ( on-premise ) and on demand ( on-demand ) .

In October of the same year, the Microsoft .NET SDK for Hadoop project was launched , among the bright goals of which is the creation of the Linq-to-Hive, Job Submission API and WebHDFS client.

As of December 2012, HDInsight is in the testing phase and, in addition to Hadoop v1.0.1, includes the following components *:

Pig version 0.9.3: a high-level data processing language and framework for the execution of these requests;
Hive 0.9.0: distributed storage infrastructure supporting data requests;
Mahout 0.5: machine learning;
HCatalog 0.4.0: metadata management;
Sqoop 1.4.2: moving large amounts of data between Hadoop and structured data warehouses;
Pegasus 2: graph analysis;
Flume : a distributed high-availability service for collecting, aggregating and moving log data in HDFS.
Business intelligence tools, including
- Hive ODBC Driver;
- Hive Add-in for Excel (available for download via HDInsight Dashboard).

Development for HDInsight can be done in C #, Java, JavaScript, Python.

HDInsight has the ability to integrate:

with Active Directory for security management and access control;
with System Center for cluster management;
with Azure Blob Storage and Amazon S3 cloud storage services as data sources.

For Microsoft SQL Server 2012 Parallel Data Warehouse in 2013, a connector is expected for SQL Server (for data exchange), integration with SQL Server Integration Services (for data loading) and with SQL Server Analysis Services (analytics).

In the next part of the article, we will consider a practical example of writing a Hadoop job (Hadoop Job) for word counting on a locally installed HDInsight for Windows Server solution.

2. HDInsight. Workshop

Training

First, install the “Microsoft HDInsight for Windows Server CTP” solution through the Web Platform Installer. After successfully completing the installation of the solution, we will have new websites on the local IIS and useful shortcuts on the desktop.

JavaScript WordCounter

One of these shortcuts, pointing to a website with the speaking name “HadoopDashboard,” is what we need. Clicking on this shortcut opens the HDInsight control panel.

HDInsight Dashboard

Following the “Interactive JavaScript” tile, we proceed to create a job for our local cluster. Let's start with the task of classical word counting.

To do this, we will need several files with text (I took a speech from one of Martin Luther King's speeches) and the javascript logic of counting itself in the map and reduce phases (Listing 1). According to this logic, we will separate the words when finding a non-letter character (it is obvious that this is not always the case), in addition to getting rid of “noise”, we will not take into account words shorter than 3 characters.

Listing 1. WordCountJob.js file

// map() implementation
var map = function(key, value, context) {
    var words = value.split(/[^a-zA-Z]/);
    for (var i = 0; i < words.lenght; i++) {
        if (words[i] !== "" && words[i].lenght > 3) {
            context.write(words[i].toLowerCase(), 1);
        }
    }
};
// reduce() implementation
var reduce = function (key, values, context) {
    var sum = 0;
    while (values.hasNext()) {
        sum += parseInt(values.next());
    }
    context.write(key, sum);
};

The text files and javascript created for analysis, saved in the wordCountJob.js file, must be transferred to HDFS (Listing 2).

Listing 2.

#mkdir wordcount // создаем директорию wordcount
fs.put("wordcount") // копируем в созданную директорию файлы с локального диска
fs.put("wordcount")
#ls wordcount // запрашивает листинг директории с анализируемыми данными
fs.put() // копируем в HDFS файл wordCountJob.js
#cat wordCountJob.js // просматриваем wordCountob.js

What has been done looks something like this: We make

HDInsight. Word count job

a request for the first 10 results, sorted in descending order:

pig.from("wordcount").mapReduce("wordCountJob.js", "word, count:long").orderBy("count DESC").take(49).to("output");

and so it would look for Amazon S3 cloud storage:

pig.from("s3n://HDinsightTest/MartinLutherKing").mapReduce("wordCountJob.js", "word, count:long").orderBy("count DESC").take(49).to("output");

We start and at the end we get something like:

[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! (View Log)

Let's look at the result in the form of text output, and then graphic.

js> #ls output // получаем список файлов в выходной директории

Found 3 items

-rw-r--r-- 1 hadoop supergroup 0 2012-12-12 08:47 /user/hadoop/output/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2012-12-12 08:47 /user/hadoop/output/_logs
-rw-r--r-- 1 hadoop supergroup 0 2012-12-12 08:47 /user/hadoop/output/part-r-00000

js> #cat output/part-r-00000 // просматриваем результат

Result (word / number of occurrences)

that 115
they 65
this 57
their 50
must 50
have 48
with 39
will 38
vietnam 38
them 33
nation 27
from 26
when 23
world 23
which 22
america 22
speak 21
there 20
were 17
into 16
life 15
against 15
those 14
american 14
revolution 14
been 14
make 13
people 13
land 13
government 13
these 13
about 13
what 12
such 12
love 12
peace 12
more 11
over 11
even 11
than 11
great 11
only 10
being 10
violence 10
every 10
some 10
without 10
would 10
choice 10

js> file = fs.read("output")
js> data = parse(file.data, "word, count:long")
js> graph.bar(data) // строим график

It is gratifying to note that Martin Luther King so often used such words as “love,” “freedom,” “choice,” “life.”

3. And what next?

According to the subjective opinion of the author, the task of counting the frequency of words in a text fragment (if you do not include morphological analysis) does not differ in the complexity of the subject area. And, as a result, it does not cause a ~~sinking heart of~~ wild interest in me or the IT community.

The most typical “Big Data” cases are studies aimed at:

identify complex dependencies in social networks;
anti-fraud systems (fraud detection) in the financial sector;
genome studies;
log analysis, etc.

I will not go further, as this is the topic of a separate article. I can only say that cases solved on the Hadoop platform, in general, have the following properties:

the problem being solved is reduced to a task parallel in data
(with the advent of YARN, the fulfillment of this condition is not critical);
input data does not fit into RAM;
input data is semi- and / or unstructured;
input data is not streaming data;
The result is not realtime-sensitive.

Therefore, answering the question presented in the subtitle (“what's next?”), I set the first task to change the subject area . My view fell on the financial sector, in particular, on high-frequency trading ** (High Frequency Trading, HFT).

The second aspect that you need to look at in a way different from the one presented in the previous example is the automation of calculations .

The Javascript + Pig bundle , available from the HDInsight Dashboard, is certainly a powerful tool in the hands of an analyst familiar with the subject area and programming. But still this is a “man in the middle” and this is manual input. For the trading robot, this approach is obviously unacceptable.

In fairness, I repeat that HDInsight supports the execution of tasks via the command line, which can cool automate the solution of any data analysis task. But for me this is more of a quantitative improvement than a qualitative one.

And a quality improvement is the development in one of the high-level programming languages supported by HDInsight. Among them: Java, Python and C #.

So, we decided on what needs to be changed so that it would not be embarrassing for the boss to show or talk about the benefits of using HDInsight.

Unfortunately, the description of the tasks of the analysis component of stock ticks and the implementation of these tasks (with source code), in addition to being far from the final stage, also goes beyond the scope and scope of the article (but, I assure you, this is a matter of time).

Conclusion

HDInsight is a flexible solution that allows a developer to deploy an on-premises Hadoop cluster in a relatively short period of time or order it as an on-demand service in the cloud, with a wide range of programming languages and friends to choose from (for .NET developers) development tools.

Thus, the HDInsight solution unprecedentedly reduces the level of entry into the development of "data-intensive" applications. Which, in turn, will necessarily lead to even more popularization of the Hadoop platform, MPP applications and the demand for such applications among researchers and businesses.

Source of inspiration

[1] Matt Winkler. Developing Big Data Analytics Applications with JavaScript and .NET for Windows Azure and Windows . Build 2012 Conference, 2012.

* Component versions are for December 2012.
** The author knows that this is not the most typical case for Hadoop (first and only, due to the sensitivity to real-time calculations). But the eyes are afraid, and the ~~hands - make the~~ head - think.

Tags: