Windows Azure and Hadoop: Friendship Ready for Enterprise

    Over the past half a month there have been 3 IT events lying in the plane of Big Data , Cloud Computing and their symbiosis . By a strange coincidence, these events were left without due attention from both the habro-community and the few professional network communities on LinkedIn and Facebook.

    The events in question are the Strata + Hadoop World conference, the release of the stable version of Hadoop 2.2.0 and the Windows Azure HDInsight cloud service. The indirect and direct relationship of these events will be discussed below.

    Windows Azure HDInsight 2.1 Ecosystem

    Why so many links?
    Due to the limited volume of the article, I will provide links to resources that will be more useful than any free retelling of the contents of these resources.

    The first event is the Strata + Hadoop World conference , which was held from 28.10 to 30.10. Speakers talk about Data Science and the Hadoop platform, use Python and R. Cloudera, Hortonworks, MapR, SAP, IBM, Intel, VMware, Microsoft are among the long list of sponsors for the event.

    The vision of the latter (Microsoft) was presented by 3 reports ( 1 , 2 , 3 ). At the same time, the transition of a service that provides "Hadoop as a Service" in Windows Azure, Windows Azure HDInsight to the stable version (Generally Available, GA) was announced .

    In HDInsight GA, as expected, there are no new features compared to the CTP version. More importantly, Microsoft is consistent (it often happens with development platforms), and everything that was available to developers in the CTP version remained in the final release. Among the “features” of using HDInsight is that it is a cloud service (pay for use), integration with Microsoft BI tools, and the ability to write Jobs in javascript and C # (.NET).

    One of the most important strategic advantages of the service, in my opinion, is that HDInsight GA, as in earlier versions, provides a 100% Hadoop-compatible solution.

    This is important because it becomes clear that Microsoft will provide a subscription service from which you can “leave” without having to rewrite the code dramatically. This is a strategically important opportunity for startups and small research teams.

    Thus, the Hive / Pig / Java code that was written for Jobs in on-demand (Windows Azure) can be run without modification on an on-premise Hadoop cluster.

    As for the Hadoop platform itself, a stable version of Apache Hadoop 2.x was announced on October 15 . The innovations of Apache Hadoop 2.2.0 I wrote earlier ; I repeat only that the most anticipated feature of the release is the YARN framework .

    Also among the announced changes to Hadoop 2.2.0 was support for running Hadoop on Windows. I did not stand a candle over Hortonworks engineers, but I have a feeling that integration with Windows Server is the result of the efforts of Hortonworks engineers with whom Microsoft works closely.

    But Hortonworks still includes Apache Hadoop 1.2 in HDP for Windows , which (let the Hadoop community correct me if I'm wrong) does not support the YARN computing framework .

    Considering the release of stable Hadoop 2 discussed above, the lack of support for the YARN framework is a big limitation of Windows Azure HDInsight. Another limitation of the HDInsight service isThe maximum available cluster size is 32 nodes .

    Create HDInsight cluster

    Although, in fairness, it is worth noting that the stable version of Hadoop was released 2 weeks ago, Hortonworks is already working on support for the latest Hadoop distribution in HDP for Windows (HDP 2 for Windows), and the limit on the number of nodes is likely to be solved by a request to the support service Windows Azure

    The limitations of Hadoop 2.x are not yet understood by the community. I note that most feature in version 2 is only a workaround for the limitations of version 1 . Separately, I will highlight the YARN computing framework, which not only allowed us to get rid of strong connectivity with the map / reduce programming model , in particular, but also is a qualitative leap in the development of the Hadoop platform as a whole.


    Looking at the work of Google in the field of Big Data, I add to the potential limitations of the platform - the lack of support for geo-distribution (both replication and execution). The disadvantages of working with the community (primarily academic) are the lack of pdf-paper similar to those produced by Google and Microsoft research units, Berkeley UC researchers, describing the principles and architectural approaches used in the design of a particular computing system ( one famous example: MapReduce: Simplied Data Processing on Large Clusters ).

    Instead of a conclusion. How to try?

    Hadoop 2.0

    Cloudera: Apache Hadoop 2.x distribution (+ additional components of the Hadoop ecosystem) - CDH 5 Beta .
    Hortonworks: Apache Hadoop 2.x distribution (+ additional components of the Hadoop ecosystem) - Hortonworks Data Platform 2.0 .

    Window Azure HDInsight

    Cloud Service: Hadoop as a Service on Windows Azure - Windows Azure HDInsight .
    Locally: Windows Azure Emulator HDInsight - HDInsight Emulator for Windows Azure (install via the Web Platform Installer).

    Also popular now: