MistiC May 12, 2014 at 19:23

Hadoop and Automation: Part 1

Hello colleagues!

The last couple of weeks I have been working on an interesting (from my point of view) occupation, which was the creation of a Hadoop-as-a-Service solution for the private cloud of our company. First of all, I was wondering what kind of beast Hadoop is , why the combinations of the words Big Data and Hadoop are so often heard now . For me, getting started with Hadoop started from scratch. Of course, I was not and not being Big Data a specialist, therefore I went into the essence as much as was necessary to understand the processes in the context of cluster deployment automation.

Facilitated my work the fact that the task was formulated quite clearly - there iscluster architecture , there is a Hadoop distribution , there is a Chef automation tool . It only remained to get acquainted with the installation and configuration of parts of the cluster, as well as options for its use. Further in the articles I will try to simplify the description of the cluster architecture, the purpose of the parts thereof, as well as the configuration and startup process.

Cluster architecture

What did I need to get in the end? Here is a scheme with architecture was available to me.

As I later realized - a fairly simple bare cluster architecture (without Hbase , Hive , Pig and other third-party products related to Hadoop ). But, at first glance - everything was incomprehensible. Well, Google to help, and that’s what turned out in the end ...

Hadoop cluster can be divided into 3 parts: Masters , Slaves and Clients .
Masters control the two main functions of the cluster - data placement and computation / processing associated with this data.
Responsible for posting dataHDFS - Hadoop Distributed File System - introduced in our architecture NameNode and JournalNode . YARN , also known as MapReduce v.2 ,
is responsible for coordinating the assignment of tasks and conducting distributed computing . Slaves do all the " dirty " work, it is they who receive and perform tasks related to computing. Each Slave consists of an HDFS part ( DataNode ) and a YARN part ( NodeManager

) respectively. Each part is responsible for the corresponding function, whether it is distributed data storage or distributed computing.

And finally, Clients , a kind of kings of the cluster who do nothing but supply data and tasks to the cluster, as well as obtain results .

Hadoop is written in Java , so any component requires it. The interaction between the parts of the cluster is hidden in the bowels of the Java classes , but only part of the cluster configuration is available to us by making the necessary settings in the right place.
By default, the configuration files are located under / etc / hadoop / conf and are the parameters that can be reassigned in the cluster:

hadoop-env.sh and yarn-env.sh - contains specific settings for environment variables, it is here that it is recommended to make settings for the paths and options needed by Hadoop ;
core-site.xml - contains values that can be reassigned instead of the default values for the cluster, such as the address of the root FS, various directories for Hadoop , etc .;
hdfs-site.xml - contains settings for HDFS , namely for NameNode , DataNode , JournalNode , as well as for Zookeeper , such as domain names and ports on which a particular service is running, or directories necessary to save distributed data;
yarn-site.xml - contains settings for YARN , namely for ResourceManager and NodeManager , such as domain names and ports on which a particular service is running, settings for resource allocation for processing, etc .;
mapred-site.xml - contains the configuration for MapReduce jobs, as well as settings for JobHistory MapReduce server;
log4j.properties - contains the configuration of the logging process using the Apache Commons Logging framework ;
hadoop-metrics.properties - indicates where Hadoop will send its metrics, whether it be a file or a monitoring system;
hadoop-policy.xml - Security and ACL settings for the cluster;
capacity-scheduler.xml - settings for CapacityScheduler, which is responsible for scheduling tasks and placing them in the execution queue, as well as distributing cluster resources in queues;

Accordingly, for our automation process, not only an automated installation is necessary , but also the ability to change and compose this configuration without actually editing it on the nodes.
The cluster deployed on Ubuntu using the HortonWorks (HDP) distribution version 2.0. *.
To create a cluster, 1 virtual machine was allocated for each part of the Masters , 1 Clients virtual machine, and 2 virtual machines as Slaves .
When writing a wrapper cookbook, I used the community's best practices , namelythis project is Chris Gianelloni , who turned out to be a very active developer, quickly responding to bugs found in cookbook -e. This cookbook provided the ability to install various parts of the Hadoop cluster, the basic configuration of the cluster by setting the attributes of the cookbook and generating configuration files based on them, as well as verifying that there is enough configuration to start the cluster.

Clients Deployment Automation

Clients are virtual machines that provide data and tasks for a Hadoop cluster, as well as capture the results of distributed computing.
After adding entries about the HortonWorks repo to the Ubuntu repository , various collected deb packages responsible for certain parts of the cluster became available.
We, in this case, were interested in the hadoop-client package , the installation of which was carried out as follows:

package "hadoop-client" do
  action :install
end

Simply? It can’t be easier, thanks to colleagues from HortonWorks who saved system administrators from having to build Hadoop from source.
Configuration exclusively for Clients is not needed, they are based on configurations for Masters / Slaves (how the process of creating configuration files based on attributes is implemented in the next article ).
As a result, after the installation is completed, we will be able to send tasks for our cluster. Jobs are described in .jar files using the Hadoop classes . ExamplesI’ll try to describe the launch of tasks at the end of the series of articles when our cluster will be fully operational.
The cluster operation results are added to the directories specified at the start of the task or in the configuration files.
What should happen next? Further, after we send the task to our cluster, our Masters should receive the task (YARN) and files (HDFS) needed to complete it, and carry out the process of distributing the received resources on Slaves . In the meantime, we have neither Masters nor Slaves . It is about the detailed process of installing and configuring these parts of the cluster that I want to talk about in future articles.

Part 1 came out in a light, introductory part, in which I described what I needed to do and which path I chose to complete the task.
Further parts will be more filled with code and aspects of starting and configuring a Hadoop cluster.
Comments about inaccuracies and errors in the description are welcome, I obviously have something to learn in the field of Big Data . Thank you

all for your attention!

Tags:

Hadoop and Automation: Part 1

Cluster architecture

Clients Deployment Automation

Also popular now: