facha January 6, 2017 at 21:07

Hadoop from scratch

Tutorial

This article will provide practical guidance on how to build, initial configure, and test Hadoop health for novice administrators. We will analyze how to build Hadoop from source, configure, run and verify that everything works as it should. In the article you will not find the theoretical part. If you have never encountered Hadoop before, you don’t know what parts it consists of and how they interact, here are a couple of useful links to the official documentation:

hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/ HdfsDesign.html
hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/YARN.html

Why not just use a ready-made distribution?

- Training. Similar articles often begin with recommendations for downloading a virtual machine image with a Cloudera or HortonWorks distribution. As a rule, a distribution is a complex ecosystem with a lot of components. It will not be easy for a beginner to figure out where what is and how it all interacts. Starting from scratch, we slightly reduce the entry threshold, since we have the opportunity to consider components one at a time.

- Functional tests and benchmarks. There is a small lag between the release of a new version of the product, and the moment when it appears in the distribution. If you need to test the new features of the newly appeared version, you will not be able to use the ready-made distribution. It will also be difficult to compare the performance of two versions of the same software, since in finished distributions there is usually no way to update the version of any one component, leaving everything else as it is.

- Just for fun.

Why are we compiling from source? After all, Hadoop binary assemblies are also available.

Part of the Hadoop code is written in C / C ++. I don’t know what system the development team builds on, but the C libraries that come with the Hadoop binary builds depend on the version of libc that is neither in RHEL nor in Debian / Ubuntu. The inoperability of Hadoop C libraries is generally not critical, but some features will not work without them.

Why re-describe everything that is already in the official documentation?

This article is intended to save time. The official documentation does not contain quickstart instructions - do it and it will work. If for one reason or another you need to assemble a “vanilla” Hadoop, but there is no time to do this by trial and error, you have come to the address.

Assembly

We will use CentOS 7 for assembly. According to Сloudera, most of the clusters work on RHEL and derivatives (CentOS, Oracle Linux). The 7th version is most suitable, since its repositories already have the protobuf library of the required version. If you want to use CentOS 6, it will be necessary to build protobuf yourself.

We will conduct assembly and other experiments with root privileges (so as not to complicate the article).

Somewhere around 95% of the Hadoop code is written in Java. For assembly we need Oracle JDK and Maven.

Download the latest JDK from the Oracle site and unzip to / opt. Also add the JAVA_HOME variable (used by Hadoop) and add / opt / java / bin to the PATH for the root user (for convenience):

cd ~ 
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.tar.gz 
tar xvf ~/jdk-8u112-linux-x64.tar.gz 
mv ~/jdk1.8.0_112 /opt/java 
echo "PATH=\"/opt/java/bin:\$PATH\"" >> ~/.bashrc 
echo "export JAVA_HOME=\"/opt/java\"" >> ~/.bashrc

Install Maven. It will be needed only at the assembly stage. Therefore, we will install it in our home (after the completion of the assembly, all files that remain in home can be deleted).

cd ~ 
wget http://apache.rediris.es/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz 
tar xvf ~/apache-maven-3.3.9-bin.tar.gz 
mv ~/apache-maven-3.3.9 ~/maven 
echo "PATH=\"/root/maven/bin:\$PATH\"" >> ~/.bashrc 
source ~/.bashrc

Somewhere 4-5% of the Hadoop code is written in C / C ++. Install the compiler and other packages necessary for assembly:

 yum -y install gcc gcc-c++ autoconf automake libtool cmake

We will also need some third-party libraries:

yum -y install zlib-devel openssl openssl-devel snappy snappy-devel bzip2 bzip2-devel protobuf protobuf-devel

The system is ready. Download, build and install Hadoop in / opt:

cd ~ 
wget http://apache.rediris.es/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz 
tar -xvf ~/hadoop-2.7.3-src.tar.gz 
mv ~/hadoop-2.7.3-src ~/hadoop-src 
cd ~/hadoop-src 
mvn package -Pdist,native -DskipTests -Dtar 
tar -C/opt -xvf ~/hadoop-src/hadoop-dist/target/hadoop-2.7.3.tar.gz 
mv /opt/hadoop-* /opt/hadoop 
echo "PATH=\"/opt/hadoop/bin:\$PATH\"" >> ~/.bashrc 
source ~/.bashrc

Primary configuration

Hadoop has about a thousand parameters. Fortunately, to start Hadoop and take some first steps in mastering, about 40 is enough, leaving the rest by default.

Let's get started. If you remember, we installed Hadoop in / opt / hadoop. All configuration files are located in / opt / hadoop / etc / hadoop. In total, 6 configuration files will need to be edited. I bring all the configs below in the form of commands. So that those who are trying to build their Hadoop for this article can simply copy-paste the commands to the console.

First, set the JAVA_HOME environment variable in the hadoop-env.sh and yarn-env.sh files. So we will let all components know where java is installed, which they should use.

sed -i '1iJAVA_HOME=/opt/java' /opt/hadoop/etc/hadoop/hadoop-env.sh 
sed -i '1iJAVA_HOME=/opt/java' /opt/hadoop/etc/hadoop/yarn-env.sh

We configure the URL for HDFS in the core-site.xml file. It consists of the hdfs: // prefix, the host name on which NameNode is running, and the port. If this is not done, Hadoop will not use the distributed file system, but will work with the local file system on your computer (default URL: file: ///).

cat << EOF > /opt/hadoop/etc/hadoop/core-site.xml 
fs.defaultFShdfs://localhost:8020 
EOF

In the hdfs-site.xml file, we configure 4 parameters. We set the number of replicas to 1, since our “cluster” consists of only one node. We also configure the directories where the NameNode, DataNode and SecondaryNameNode data will be stored.

cat << EOF > /opt/hadoop/etc/hadoop/hdfs-site.xml 
dfs.replication1dfs.namenode.name.dir/data/dfs/nndfs.datanode.data.dir/data/dfs/dndfs.namenode.checkpoint.dir/data/dfs/snn 
EOF

We are done configuring HDFS. It would be possible to run NameNode and DataNode, and work with the FS. But let's leave it for the next section. Let's move on to the YARN configuration.

cat << EOF > /opt/hadoop/etc/hadoop/yarn-site.xml 
yarn.resourcemanager.hostnamelocalhostyarn.nodemanager.resource.memory-mb4096yarn.nodemanager.resource.cpu-vcores4yarn.scheduler.maximum-allocation-mb1024yarn.scheduler.maximum-allocation-vcores1yarn.nodemanager.vmem-check-enabledfalseyarn.nodemanager.local-dirs/data/yarnyarn.nodemanager.log-dirs/data/yarn/logyarn.log-aggregation-enabletrueyarn.nodemanager.aux-servicesmapreduce_shuffleyarn.nodemanager.aux-services.mapreduce_shuffle.classorg.apache.hadoop.mapred.ShuffleHandler
EOF

There are a lot of options. Let's go through them in order.

The yarn.resourcemanager.hostname parameter specifies on which host the ResourceManager service is running.

The parameters yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores are perhaps the most important. In them, we tell the cluster how much memory and CPU cores each node can use in total to run containers.

The parameters yarn.scheduler.maximum-allocation-mb and yarn.scheduler.maximum-allocation-vcores indicate how much memory and cores can be allocated for a separate container. It is easy to see that with this configuration in our “cluster”, consisting of one node, 4 containers can be started simultaneously (with 1GB of memory each).

The yarn.nodemanager.vmem-check-enabled parameter set to false disables checking the amount of virtual memory used. As you can see from the previous paragraph, not so much memory is available for each container, and with this configuration, any application will probably exceed the limit of available virtual memory.

The yarn.nodemanager.local-dirs parameter specifies where the temporary container data will be stored (jar with application bytecode, configuration files, temporary data generated at runtime, ...)

The yarn.nodemanager.log-dirs parameter indicates where the logs will be stored locally every task.

The yarn.log-aggregation-enable parameter instructs logs to be stored in HDFS. After the application is finished, its logs from yarn.nodemanager.log-dirs of each node will be moved to HDFS (by default, to the / tmp / logs directory).

The yarn.nodemanager.aux-services and yarn.nodemanager.aux-services.mapreduce_shuffle.class parameters specify a third-party shuffle service for the MapReduce framework.

That's probably all for YARN. I will also give a configuration for MapReduce (one of the possible distributed computing frameworks). Although it has recently lost popularity due to the advent of Spark, there are many more where it is used.

cat << EOF > /opt/hadoop/etc/hadoop/mapred-site.xml 
mapreduce.framework.nameyarnmapreduce.jobhistory.addresslocalhost:10020mapreduce.jobhistory.webapp.addresslocalhost:19888mapreduce.job.reduce.slowstart.completedmaps0.8yarn.app.mapreduce.am.resource.cpu-vcores1yarn.app.mapreduce.am.resource.mb1024yarn.app.mapreduce.am.command-opts-Djava.net.preferIPv4Stack=true -Xmx768mmapreduce.map.cpu.vcores1mapreduce.map.memory.mb1024mapreduce.map.java.opts-Djava.net.preferIPv4Stack=true -Xmx768mmapreduce.reduce.cpu.vcores1mapreduce.reduce.memory.mb1024mapreduce.reduce.java.opts-Djava.net.preferIPv4Stack=true -Xmx768m
EOF

The mapreduce.framework.name parameter specifies that we will run MapReduce tasks in YARN (the default value of local is used only for debugging - all tasks run in the same jvm on the same machine).

The parameters mapreduce.jobhistory.address and mapreduce.jobhistory.webapp.address indicate the name of the node on which the JobHistory service will be launched.

The mapreduce.job.reduce.slowstart.completedmaps parameter specifies to start the reduce phase no earlier than 80% of the map phase is completed.

The remaining parameters specify the maximum possible values of memory and CPU cores and jvm heap for mappers, reducers and application masters. As you can see, they should not exceed the corresponding values for the YARN containers that we defined in yarn-site.xml. Jvm heap values are usually set to 75% of the * .memory.mb parameters.

Start

Create the / data directory in which the HDFS data will be stored, as well as the temporary YARN container files.

mkdir /data

Format HDFS

hadoop namenode -format

And finally, we’ll launch all the services of our “cluster”:


/opt/hadoop/sbin/hadoop-daemon.sh start namenode 
/opt/hadoop/sbin/hadoop-daemon.sh start datanode 
/opt/hadoop/sbin/yarn-daemon.sh start resourcemanager 
/opt/hadoop/sbin/yarn-daemon.sh start nodemanager 
/opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver

If everything went well (you can check the error messages in the logs in / opt / hadoop / logs), Hadoop is deployed and ready to go ...

Health Check

Let's look at the hadoop directory structure:

/opt/hadoop/ 
├── bin 
├── etc 
│   └── hadoop 
├── include 
├── lib 
│   └── native 
├── libexec 
├── logs 
├── sbin 
└── share 
    ├── doc 
    │   └── hadoop 
    └── hadoop 
        ├── common 
        ├── hdfs 
        ├── httpfs 
        ├── kms 
        ├── mapreduce 
        ├── tools 
        └── yarn

Hadoop itself (executable java bytecode) is located in the share directory and is divided into components (hdfs, yarn, mapreduce, etc ...). The lib directory contains libraries written in C. The

purpose of the remaining directories is intuitive: bin - command line utilities for working with Hadoop, sbin - startup scripts, etc - configs, logs - logs. We are primarily interested in two utilities from the bin directory: hdfs and yarn.

If you remember, we already formatted HDFS and started all the necessary processes. Let's see what we have in HDFS:


hdfs dfs -ls -R / 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn/staging 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history/done 
drwxrwxrwt   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history/done_intermediate

Although we did not explicitly create this directory structure, the JobHistory service created it (the last daemon started: mr-jobhistory-daemon.sh start historyserver).

Let's see what is in the / data directory:

/data/ 
├── dfs 
│   ├── dn 
│   │   ├── current 
│   │   │   ├── BP-1600342399-192.168.122.70-1483626613224 
│   │   │   │   ├── current 
│   │   │   │   │   ├── finalized 
│   │   │   │   │   ├── rbw 
│   │   │   │   │   └── VERSION 
│   │   │   │   ├── scanner.cursor 
│   │   │   │   └── tmp 
│   │   │   └── VERSION 
│   │   └── in_use.lock 
│   └── nn 
│       ├── current 
│       │   ├── edits_inprogress_0000000000000000001 
│       │   ├── fsimage_0000000000000000000 
│       │   ├── fsimage_0000000000000000000.md5 
│       │   ├── seen_txid 
│       │   └── VERSION 
│       └── in_use.lock 
└── yarn 
    ├── filecache 
    ├── log 
    ├── nmPrivate 
    └── usercache

As you can see, in / data / dfs / nn NameNode created the fsimage file and the first edit file. In / data / dfs / dn, DataNode has created a directory for storing data blocks, but the data itself does not yet exist.

Copy some file from the local file system to HDFS:

hdfs dfs -put /var/log/messages /tmp/
hdfs dfs -ls /tmp/messages 
-rw-r--r--   1 root supergroup     375974 2017-01-05 09:33 /tmp/messages

Let's look at the contents of / data again

/data/dfs/dn 
├── current 
│   ├── BP-1600342399-192.168.122.70-1483626613224 
│   │   ├── current 
│   │   │   ├── finalized 
│   │   │   │   └── subdir0 
│   │   │   │       └── subdir0 
│   │   │   │           ├── blk_1073741825 
│   │   │   │           └── blk_1073741825_1001.meta 
│   │   │   ├── rbw 
│   │   │   └── VERSION 
│   │   ├── scanner.cursor 
│   │   └── tmp 
│   └── VERSION 
└── in_use.lock

Hurrah!!! The first block and its checksum appeared.

Let's run some application to make sure that YARN works as it should. For example, pi from hadoop-mapreduce-examples.jar package:

yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 3 100000
…
Job Finished in 37.837 seconds 
Estimated value of Pi is 3.14168000000000000000

If you look at the contents of / data / yarn during application execution, you can find out a lot of interesting things about how YARN applications are executed:

/data/yarn/ 
├── filecache 
├── log 
│   └── application_1483628783579_0001 
│       ├── container_1483628783579_0001_01_000001 
│       │   ├── stderr 
│       │   ├── stdout 
│       │   └── syslog 
│       ├── container_1483628783579_0001_01_000002 
│       │   ├── stderr 
│       │   ├── stdout 
│       │   └── syslog 
│       ├── container_1483628783579_0001_01_000003 
│       │   ├── stderr 
│       │   ├── stdout 
│       │   └── syslog 
│       └── container_1483628783579_0001_01_000004 
│           ├── stderr 
│           ├── stdout 
│           └── syslog 
├── nmPrivate 
│   └── application_1483628783579_0001 
│       ├── container_1483628783579_0001_01_000001 
│       │   ├── container_1483628783579_0001_01_000001.pid 
│       │   ├── container_1483628783579_0001_01_000001.tokens 
│       │   └── launch_container.sh 
│       ├── container_1483628783579_0001_01_000002 
│       │   ├── container_1483628783579_0001_01_000002.pid 
│       │   ├── container_1483628783579_0001_01_000002.tokens 
│       │   └── launch_container.sh 
│       ├── container_1483628783579_0001_01_000003 
│       │   ├── container_1483628783579_0001_01_000003.pid 
│       │   ├── container_1483628783579_0001_01_000003.tokens 
│       │   └── launch_container.sh 
│       └── container_1483628783579_0001_01_000004 
│           ├── container_1483628783579_0001_01_000004.pid 
│           ├── container_1483628783579_0001_01_000004.tokens 
│           └── launch_container.sh 
└── usercache 
    └── root 
        ├── appcache 
        │   └── application_1483628783579_0001 
        │       ├── container_1483628783579_0001_01_000001 
        │       │   ├── container_tokens 
        │       │   ├── default_container_executor_session.sh 
        │       │   ├── default_container_executor.sh 
        │       │   ├── job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar 
        │       │   ├── jobSubmitDir 
        │       │   │   ├── job.split -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/12/job.split 
        │       │   │   └── job.splitmetainfo -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/10/job.splitmetainfo 
        │       │   ├── job.xml -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/13/job.xml 
        │       │   ├── launch_container.sh 
        │       │   └── tmp 
        │       │       └── Jetty_0_0_0_0_37883_mapreduce____.rposvq 
        │       │           └── webapp 
        │       │               └── webapps 
        │       │                   └── mapreduce 
        │       ├── container_1483628783579_0001_01_000002 
        │       │   ├── container_tokens 
        │       │   ├── default_container_executor_session.sh 
        │       │   ├── default_container_executor.sh 
        │       │   ├── job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar 
        │       │   ├── job.xml 
        │       │   ├── launch_container.sh 
        │       │   └── tmp 
        │       ├── container_1483628783579_0001_01_000003 
        │       │   ├── container_tokens 
        │       │   ├── default_container_executor_session.sh 
        │       │   ├── default_container_executor.sh 
        │       │   ├── job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar 
        │       │   ├── job.xml 
        │       │   ├── launch_container.sh 
        │       │   └── tmp 
        │       ├── container_1483628783579_0001_01_000004 
        │       │   ├── container_tokens 
        │       │   ├── default_container_executor_session.sh 
        │       │   ├── default_container_executor.sh 
        │       │   ├── job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar 
        │       │   ├── job.xml 
        │       │   ├── launch_container.sh 
        │       │   └── tmp 
        │       ├── filecache 
        │       │   ├── 10 
        │       │   │   └── job.splitmetainfo 
        │       │   ├── 11 
        │       │   │   └── job.jar 
        │       │   │       └── job.jar 
        │       │   ├── 12 
        │       │   │   └── job.split 
        │       │   └── 13 
        │       │       └── job.xml 
        │       └── work 
        └── filecache 
42 directories, 50 files

In particular, we see that the logs are written in / data / yarn / log (parameter yarn.nodemanager.log-dirs from yarn-site.xml).

At the end of the application / data / yarn comes in its original form:

/data/yarn/ 
├── filecache 
├── log 
├── nmPrivate 
└── usercache 
    └── root 
        ├── appcache 
        └── filecache

If we take a look at the contents of HDFS again, we see that log aggregation is working (the logs of the newly executed application were moved from the local FS / data / yarn / log to HDFS / tmp / logs).

We will also see that the JobHistory service has saved information about our application in / tmp / hadoop-yarn / staging / history / done.

hdfs dfs -ls -R / 
drwxrwx---   - root supergroup          0 2017-01-05 10:12 /tmp 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn 
drwxrwx---   - root supergroup          0 2017-01-05 10:12 /tmp/hadoop-yarn/staging 
drwxrwx---   - root supergroup          0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05/000000 
-rwxrwx---   1 root supergroup      46338 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05/000000/job_1483628783579_0001-1483629144632-root-QuasiMonteCarlo-1483629179995-3-1-SUCCEEDED-default-1483629156270.jhist 
-rwxrwx---   1 root supergroup     117543 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05/000000/job_1483628783579_0001_conf.xml 
drwxrwxrwt   - root supergroup          0 2017-01-05 10:12 /tmp/hadoop-yarn/staging/history/done_intermediate 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done_intermediate/root 
drwx------   - root supergroup          0 2017-01-05 10:12 /tmp/hadoop-yarn/staging/root 
drwx------   - root supergroup          0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/root/.staging 
drwxrwxrwt   - root supergroup          0 2017-01-05 10:12 /tmp/logs 
drwxrwx---   - root supergroup          0 2017-01-05 10:12 /tmp/logs/root 
drwxrwx---   - root supergroup          0 2017-01-05 10:12 /tmp/logs/root/logs 
drwxrwx---   - root supergroup          0 2017-01-05 10:13 /tmp/logs/root/logs/application_1483628783579_0001 
-rw-r-----   1 root supergroup      65829 2017-01-05 10:13 /tmp/logs/root/logs/application_1483628783579_0001/master.local_37940 
drwxr-xr-x   - root supergroup          0 2017-01-05 10:12 /user 
drwxr-xr-x   - root supergroup          0 2017-01-05 10:13 /user/root

Distributed Cluster Testing

Perhaps you noticed that so far I have taken the “cluster” in quotation marks. After all, everything works for us on the same machine. Correct this annoying misunderstanding. Testing our Hadoop in a true distributed cluster.

First of all, tweak the Hadoop configuration. Currently, the host name in the Hadoop configuration is specified as localhost. If now just copy this configuration to other nodes, each node will try to find NameNode, ResourceManager and JobHistory services on its host. Therefore, we decide in advance with the host name with these services and make changes to the configs.

In my case, all the above master services (NameNode, ResourceManager, JobHistory) will be executed on the master.local host. Replace localhost with master.local in the configuration:

cd /opt/hadoop/etc/hadoop
sed -i 's/localhost/master.local/' core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml

Now I just clone the virtual machine on which I build two times to get two slave nodes. On slave nodes, you need to set a unique host name (in my case, it is slave1.local and slave2.local). Also, on all three nodes of our cluster, we will configure / etc / hosts so that each machine in the cluster can access the others by host name. In my case, it looks like this (same content on all three machines):

cat /etc/hosts
…
192.168.122.70   master.local 
192.168.122.59   slave1.local 
192.168.122.217 slave2.local

Additionally, on the slave1.local and slave2.local nodes, you need to clear the contents of / data / dfs / dn

rm -rf /data/dfs/dn/*

All is ready. On master.local we launch all services:

/opt/hadoop/sbin/hadoop-daemon.sh start namenode 
/opt/hadoop/sbin/hadoop-daemon.sh start datanode 
/opt/hadoop/sbin/yarn-daemon.sh start resourcemanager 
/opt/hadoop/sbin/yarn-daemon.sh start nodemanager 
/opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver

On slave1.local and slave2.local, run only DataNode and NodeManager:

/opt/hadoop/sbin/hadoop-daemon.sh start datanode 
/opt/hadoop/sbin/yarn-daemon.sh start nodemanager

Let's check that our cluster now consists of three nodes.

For HDFS, look at the output of the dfsadmin -report command and make sure that all three machines are included in the Live datanodes list:

hdfs dfsadmin -report
...
Live datanodes (3): 
…
Name: 192.168.122.70:50010 (master.local) 
...
Name: 192.168.122.59:50010 (slave1.local) 
...
Name: 192.168.122.217:50010 (slave2.local)

Or go to the NameNode:

master.local : 50070 / dfshealth.html # tab-datanode webpage

For YARN, look at the output of the node -list command:

yarn node -list -all 
17/01/06 06:17:52 INFO client.RMProxy: Connecting to ResourceManager at master.local/192.168.122.70:8032 
Total Nodes:3 
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers 
slave2.local:39694              RUNNING slave2.local:8042                                  0 
slave1.local:36880              RUNNING slave1.local:8042                                  0 
master.local:44373              RUNNING master.local:8042                                  0

Or, go to the ResourceManager

master.local web page : 8088 / cluster / nodes

All nodes must be in the list with the RUNNING status.

Finally, make sure that MapReduce applications that are running use resources on all three nodes. Run the familiar Pi application from hadoop-mapreduce-examples.jar:

yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 30 1000

At runtime, see the output of yarn node -list -all again:

...
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers 
slave2.local:39694              RUNNING slave2.local:8042                                  4 
slave1.local:36880              RUNNING slave1.local:8042                                  4 
master.local:44373              RUNNING master.local:8042                                  4

Number-of-Running-Containers - 4 on each node.

We can also go to master.local : 8088 / cluster / nodes and see how many cores and memory are used by all applications in total on each node.

Conclusion

We compiled Hadoop from the source code, installed, configured and tested the functionality on a separate machine and in a distributed cluster. If you are interested in the topic, if you want to collect other services from the Hadoop ecosystem in a similar way, I’ll leave a link to a script that I support for my own needs:

github.com/hadoopfromscratch/hadoopfromscratch

Using it, you can install zookeeper, spark, hive, hbase, cassandra , flume. If you find errors or inaccuracies, please write. I would really appreciate it.

Tags:

hadoop