Microsoft HDInsight. The cloudy (and not only) future of Hadoop

    The amount of data generated and collected by modern research centers, financial institutions, social networks, is already habitually measured in petabytes. Since Facebook data centers already have more than 15 billion images, the NYSE stock exchange creates and replicates about 1 TB of data daily, the Large Hadron Collider receives about 1 PB of data per second.

    Obviously, the tasks of processing large amounts of data are increasingly becoming not only for large companies, but also for startups and small research groups.

    The Hadoop platform, which, in principle, successfully solves the Big Data problem for semi- and unstructured data, in its “pure” form imposes significant requirements both on the qualifications of the administrators of the Hadoop cluster and on the initial financial costs for the hardware of such a cluster.

    In this situation, the symbiosis of cloud technologies and the Hadoop platform is increasingly presented as an extremely promising way to solve the Big Data problem , which has an extremely low input level (qualification + launch costs).

    Today, the largest cloud providers, with varying degrees of proximity to the release version, provide “Cloud Computing + Hadoop” services. So Amazon in 2009 announced Amazon Elastic MapReduce, Google in 2010 - Google MapperAPI, Microsoft in 2011 - Windows Azure HDInsight.

    Service from Microsoft this year showed a high development dynamics and, in general, provides a wide range of capabilities related to both management and development and integration with Windows Azure cloud services and other Microsoft BI-tools .

    1. HDInsight. Overview, History, Ecosystem


    HDInsight is Microsoft's solution for deploying the Hadoop platform on the Windows Server family and on the Windows Azure cloud platform.

    HDInsight is available in two versions:
    • as a cloud service on Windows Azure - Windows Azure HDInsight (access requires an invite);
    • as a local cluster on Windows Server 2008/2012 - Microsoft HDInsight to Windows Server (CTP) (installation is available through WebPI, including on the desktop versions of Windows).

    2011


    Microsoft's partnership with Hortonworks, aimed at integrating the Hadoop platform into Microsoft products, was announced in October 2011 at the Professional Association for SQL Server (PASS) conference.

    On December 14 of the same year, the cloud service “ Hadoop on Azure ” (CTP) was introduced , available for a limited number of invites.

    2012


    On June 29, 2012, the Hadoop on Azure service update (Service Update 2) was released. The most noticeable change compared to the previous version of the service was an increase in cluster power by 2.5 times.

    Already after 2 months (August 21), the third Service Update 3 was released, among the innovations of which were:
    • REST API for setting tasks (Job), requesting status of completion and removal of tasks;
    • the ability to directly access the cluster through a browser;
    • C # SDK v1.0;
    • PowerShell cmdlets.

    On October 25, at the Strata Conference-Hadoop World, HDInsight was presented (the development of a service formerly known as “Hadoop on Azure”), which allows you to deploy the Hadoop platform both in enterprises ( on-premise ) and on demand ( on-demand ) .

    In October of the same year, the Microsoft .NET SDK for Hadoop project was launched , among the bright goals of which is the creation of the Linq-to-Hive, Job Submission API and WebHDFS client.

    As of December 2012, HDInsight is in the testing phase and, in addition to Hadoop v1.0.1, includes the following components *:
    • Pig version 0.9.3: a high-level data processing language and framework for the execution of these requests;
    • Hive 0.9.0: distributed storage infrastructure supporting data requests;
    • Mahout 0.5: machine learning;
    • HCatalog 0.4.0: metadata management;
    • Sqoop 1.4.2: moving large amounts of data between Hadoop and structured data warehouses;
    • Pegasus 2: graph analysis;
    • Flume : a distributed high-availability service for collecting, aggregating and moving log data in HDFS.
    • Business intelligence tools, including
      • Hive ODBC Driver;
      • Hive Add-in for Excel (available for download via HDInsight Dashboard).

    HDInsight Ecosystem

    Development for HDInsight can be done in C #, Java, JavaScript, Python.

    HDInsight has the ability to integrate:
    • with Active Directory for security management and access control;
    • with System Center for cluster management;
    • with Azure Blob Storage and Amazon S3 cloud storage services as data sources.

    For Microsoft SQL Server 2012 Parallel Data Warehouse in 2013, a connector is expected for SQL Server (for data exchange), integration with SQL Server Integration Services (for data loading) and with SQL Server Analysis Services (analytics).

    In the next part of the article, we will consider a practical example of writing a Hadoop job (Hadoop Job) for word counting on a locally installed HDInsight for Windows Server solution.

    2. HDInsight. Workshop


    Training


    First, install the “Microsoft HDInsight for Windows Server CTP” solution through the Web Platform Installer. After successfully completing the installation of the solution, we will have new websites on the local IIS and useful shortcuts on the desktop.

    JavaScript WordCounter


    One of these shortcuts, pointing to a website with the speaking name “HadoopDashboard,” is what we need. Clicking on this shortcut opens the HDInsight control panel.

    HDInsight Dashboard

    Following the “Interactive JavaScript” tile, we proceed to create a job for our local cluster. Let's start with the task of classical word counting.

    To do this, we will need several files with text (I took a speech from one of Martin Luther King's speeches) and the javascript logic of counting itself in the map and reduce phases (Listing 1). According to this logic, we will separate the words when finding a non-letter character (it is obvious that this is not always the case), in addition to getting rid of “noise”, we will not take into account words shorter than 3 characters.

    Listing 1. WordCountJob.js file
    // map() implementation
    var map = function(key, value, context) {
        var words = value.split(/[^a-zA-Z]/);
        for (var i = 0; i < words.lenght; i++) {
            if (words[i] !== "" && words[i].lenght > 3) {
                context.write(words[i].toLowerCase(), 1);
            }
        }
    };
    // reduce() implementation
    var reduce = function (key, values, context) {
        var sum = 0;
        while (values.hasNext()) {
            sum += parseInt(values.next());
        }
        context.write(key, sum);
    }; 

    The text files and javascript created for analysis, saved in the wordCountJob.js file, must be transferred to HDFS (Listing 2).

    Listing 2.
    #mkdir wordcount // создаем директорию wordcount
    fs.put("wordcount") // копируем в созданную директорию файлы с локального диска
    fs.put("wordcount")
    #ls wordcount // запрашивает листинг директории с анализируемыми данными
    fs.put() // копируем в HDFS файл wordCountJob.js
    #cat wordCountJob.js // просматриваем wordCountob.js

    What has been done looks something like this: We make

    HDInsight.  Word count job

    a request for the first 10 results, sorted in descending order:

    pig.from("wordcount").mapReduce("wordCountJob.js", "word, count:long").orderBy("count DESC").take(49).to("output");

    and so it would look for Amazon S3 cloud storage:

    pig.from("s3n://HDinsightTest/MartinLutherKing").mapReduce("wordCountJob.js", "word, count:long").orderBy("count DESC").take(49).to("output"); 

    We start and at the end we get something like:

    [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! (View Log) 

    Let's look at the result in the form of text output, and then graphic.

    js> #ls output // получаем список файлов в выходной директории

    Found 3 items
    -rw-r--r-- 1 hadoop supergroup 0 2012-12-12 08:47 /user/hadoop/output/_SUCCESS
    drwxr-xr-x - hadoop supergroup 0 2012-12-12 08:47 /user/hadoop/output/_logs
    -rw-r--r-- 1 hadoop supergroup 0 2012-12-12 08:47 /user/hadoop/output/part-r-00000


    js> #cat output/part-r-00000 // просматриваем результат

    Result (word / number of occurrences)
    that 115
    they 65
    this 57
    their 50
    must 50
    have 48
    with 39
    will 38
    vietnam 38
    them 33
    nation 27
    from 26
    when 23
    world 23
    which 22
    america 22
    speak 21
    there 20
    were 17
    into 16
    life 15
    against 15
    those 14
    american 14
    revolution 14
    been 14
    make 13
    people 13
    land 13
    government 13
    these 13
    about 13
    what 12
    such 12
    love 12
    peace 12
    more 11
    over 11
    even 11
    than 11
    great 11
    only 10
    being 10
    violence 10
    every 10
    some 10
    without 10
    would 10
    choice 10

    js> file = fs.read("output")
    js> data = parse(file.data, "word, count:long")
    js> graph.bar(data) // строим график

    HDInsight.  Word count graphic

    It is gratifying to note that Martin Luther King so often used such words as “love,” “freedom,” “choice,” “life.”

    3. And what next?


    According to the subjective opinion of the author, the task of counting the frequency of words in a text fragment (if you do not include morphological analysis) does not differ in the complexity of the subject area. And, as a result, it does not cause a sinking heart of wild interest in me or the IT community.

    The most typical “Big Data” cases are studies aimed at:
    • identify complex dependencies in social networks;
    • anti-fraud systems (fraud detection) in the financial sector;
    • genome studies;
    • log analysis, etc.

    I will not go further, as this is the topic of a separate article. I can only say that cases solved on the Hadoop platform, in general, have the following properties:
    • the problem being solved is reduced to a task parallel in data
      (with the advent of YARN, the fulfillment of this condition is not critical);
    • input data does not fit into RAM;
    • input data is semi- and / or unstructured;
    • input data is not streaming data;
    • The result is not realtime-sensitive.

    Therefore, answering the question presented in the subtitle (“what's next?”), I set the first task to change the subject area . My view fell on the financial sector, in particular, on high-frequency trading ** (High Frequency Trading, HFT).

    The second aspect that you need to look at in a way different from the one presented in the previous example is the automation of calculations .

    The Javascript + Pig bundle , available from the HDInsight Dashboard, is certainly a powerful tool in the hands of an analyst familiar with the subject area and programming. But still this is a “man in the middle” and this is manual input. For the trading robot, this approach is obviously unacceptable.

    In fairness, I repeat that HDInsight supports the execution of tasks via the command line, which can cool automate the solution of any data analysis task. But for me this is more of a quantitative improvement than a qualitative one.

    And a quality improvement is the development in one of the high-level programming languages ​​supported by HDInsight. Among them: Java, Python and C #.

    So, we decided on what needs to be changed so that it would not be embarrassing for the boss to show or talk about the benefits of using HDInsight.

    Unfortunately, the description of the tasks of the analysis component of stock ticks and the implementation of these tasks (with source code), in addition to being far from the final stage, also goes beyond the scope and scope of the article (but, I assure you, this is a matter of time).

    Conclusion


    HDInsight is a flexible solution that allows a developer to deploy an on-premises Hadoop cluster in a relatively short period of time or order it as an on-demand service in the cloud, with a wide range of programming languages ​​and friends to choose from (for .NET developers) development tools.

    Thus, the HDInsight solution unprecedentedly reduces the level of entry into the development of "data-intensive" applications. Which, in turn, will necessarily lead to even more popularization of the Hadoop platform, MPP applications and the demand for such applications among researchers and businesses.

    Source of inspiration


    [1] Matt Winkler. Developing Big Data Analytics Applications with JavaScript and .NET for Windows Azure and Windows . Build 2012 Conference, 2012.

    * Component versions are for December 2012.
    ** The author knows that this is not the most typical case for Hadoop (first and only, due to the sensitivity to real-time calculations). But the eyes are afraid, and the hands - make the head - think.

    Also popular now: