Hadoop and Automation: Part 3

    Well, Habrazhiteli, it's time to summarize a series of articles ( part 1 and part 2 ) devoted to my adventure with automation of deployment of a Hadoop cluster.

    My project is almost ready, it remains only to test the process and you can make yourself a notch on the fuselage.
    In this article I will talk about raising the “driving force” of our cluster - Slaves , as well as summarize and provide useful links to resources that I used throughout my project. Perhaps, for some, the articles seemed scarce on the source code and implementation details, so at the end of the article I will provide a link to Github

    Well, out of habit, at the very beginning I will give a diagram of the architecture that I managed to deploy to the cloud.

    In our case, given the test nature of the run, only 2 Slave nodes were used , but in real conditions there would be dozens of them. Next, I will briefly describe how their deployment was organized.

    Deploying Slaves


    As you can guess from the architecture, the Slave node consists of 2 parts , each of which is responsible for the actions associated with the parts of the Masters architecture. The DataNode is the point of interaction of the Slave node with the NameNode , which coordinates the distributed storage of data .
    The DataNode process connects to the service on the NameNode node, after which Clients can access file operations directly to the DataNode nodes. It is also worth noting that DataNode nodes communicate with each other in case of data replication, which in turn allows you to get away from using RAID arrays, because The replication mechanism has already been laid in software.
    The process of deploying a DataNode is quite simple:
    • Setting prerequisites in the form of Java ;
    • Adding repositories with packages of Hadoop distribution;
    • Creating the backbone of the directories needed to set the DataNode ;
    • Generate configuration files based on the template and the attributes cookbook yet
    • Installing distribution packages ( hadoop-hdfs-datanode )
    • Starting the DataNode process by service hadoop-hdfs-datanode start;
    • Registration of the status of the deployment process .

    As a result, if all the data is specified correctly and the configuration is applied, the added Slave nodes can be seen on the web interface of NameNode nodes. This means that the DataNode node is now available for file operations related to distributed data storage. Copy the file to HDFS and see for yourself. NodeManager , in turn, is responsible for interacting with the ResourceManager , which manages the tasks and resources available to execute them. The deployment process of NodeManager is similar to the process in the case of DataNode , with a difference in the package name for installation and service ( hadoop-yarn-nodemanager
    )

    After the successful completion of the deployment of Slaves- nodes - we can consider our cluster ready. It is worth paying attention to the files that set the environment variables (hadoop_env, yarn_env, etc.) - the data in the variables must correspond to the actual values ​​in the cluster. Also, it is worth paying attention to the accuracy of the values ​​of the variables in which the domain names and ports on which this or that service is running are indicated .
    How can we verify the health of a cluster? The most affordable option is to start the task from one of the Clients nodes. For example, like this:
    hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5
    

    where hadoop-mapreduce-examples-2.2.0.jar is the name of the task description file (available in the base installation), pi indicates the type of task (MapReduce Task in this case), and 2 and 5 are responsible for the distribution parameters of the tasks ( in more detail - here ).
    The result , after all the calculations, will be output to the terminal with statistics and the result of the calculations, or the creation of an output file with the output of data there (the nature of the data and the format of their output depends on the task described in the .jar file).
    <End />

    These are the clusters and pies, dear Habrazhiteli . At this stage - I do not pretend to be ideal for this solution, because there are still phases of testing and making improvements / edits to the cookbook code . I wanted to share my experience and describe another approach to deploying a Hadoop cluster - the approach is not the simplest and most Orthodox, I would say. But it is in such unconventional conditions that “steel” is tempered. My final goal is a modest counterpart to the Amazon MapReduce service, for our private cloud.
    I really welcome the advice from everyone who pays and paid attention to this series of articles (special thanks ffriend who paid attention and asked questions, some of which led me to new ideas).

    Material Links


    As promised, here is a list of materials that, along with my colleagues, helped to bring the project to an acceptable form:

    - Detailed documentation on the HDP distribution - docs.hortonworks.com
    - Wiki from the fathers, Apache Hadoop - wiki.apache.org/hadoop
    - Documentation from them - hadoop.apache.org/docs/current
    - A slightly outdated (in terms of) architecture description article - here
    - Nice tutorial in 2 parts - here
    - Adapted tutorial translation from martsen -habrahabr.ru/post/206196
    - Community cookbook " Hadoop ", on the basis of which I made my project - Hadoop cookbook
    - As a result - my humble project as it is (ahead of the update) - GitHub Thank you

    all for your attention! Comments are welcome! If I can help with something - contact! See you again. UPD Added a link to an article on Habré, translation of a tutorial.


    Also popular now: