Sunteți pe pagina 1din 140

1 Big Data - Admin Course

Table of Contents – version -20200619

Table of Contents – version -20200619 ............................................................................ 1


1. Prelude................................................................................................................................. 2
2. Ambari ................................................................................................................................. 4
3. Debugging – Ambari(A) ..................................................................................................... 16
4. Understanding YARN Config.......................................................................................... 20
5. Map Reduce Job Submission – YARN(A) ..................................................................... 42
6. Using HDFS ...................................................................................................................... 55
7. Understanding HDFS Internals(A) ..................................................................................... 89
8. Understanding Debugging in HDFS(A) .............................................................................. 92
10. Change NN Heap settings & Config Group – Services (A) ....................................... 99
11. Hadoop Benchmarks(A) ................................................................................................. 109
12. ResourceManager high availability ........................................................................... 123
13. Tuning and Debugging HDP – (A) ............................................................................ 135

Tos.Tech | http://thinkopensource.in
2 Big Data - Admin Course

1. Prelude

All software will be in D:\\Software folder of your desktop.


All commands should be executed using putty.
Winscp browser should be used to copy software from the windows desktop to your linux box.

Action: Start virtual machine.


You can start VM workstation or VM Player and import VM from d:/Software folder. After that
start the VM and connect to it using putty or directly from the console. Credentials to connect to
the system is root/life213. You can determine the ip of the system using ifconfig command.
Refer the Supplement to import vmworkstation.
Note: Include the system ip and hostname in host file of the VM as well as in window client
machine so that you can access the system using hostname.

Ex : 10.10.20.21 tos.master.com

Mount the shared folder in VM host. Henceforth it will be refer as Software folder. You can refer
the supplement document for enabling and mounting shared folder.
Comment the last line in case of any issue as shown below.

Issue

Resolution (#vi /etc/fstab)


Tos.Tech | http://thinkopensource.in
3 Big Data - Admin Course

Reboot the machine to make the changes effect.


Mounting Shared Folder:
/usr/bin/vmware-hgfsclient
/usr/bin/vmhgfs-fuse .host:/ /mnt/hgfs -o subtype=vmhgfs-fuse,allow_other

Tos.Tech | http://thinkopensource.in
4 Big Data - Admin Course

2. Ambari

Goal: You will install Ambari server on the hedege host.

Hadoop requires java hence you need to install JDK and set Java Home on all the nodes.
#su - root
#mkdir /YARN
#tar -xvf jdk-8u181-linux-x64.tar.gz -C /YARN
#cd /YARN
mv jdk1.8.0_181 jdk
To include JAVA_HOME for all bash users, make an entry in /etc/profile.d as follows:

#echo "export JAVA_HOME=/YARN/jdk/" > /etc/profile.d/java.sh

Include in .bashrc (vi ~/.bashrc)


export PATH=$PATH:$JAVA_HOME/bin

Tos.Tech | http://thinkopensource.in
5 Big Data - Admin Course

We have extracted java in /YARN/jdk folder and specify java home using root logon.
(CentOS 7 64-bit-CLI)
Type bash in the command to prompt to reinitialized the scripts.
Next we will download the ambari repo so that yum utlility can download it.
Steps
1. #mkdir /apps
2. #cd /apps
3. Download the Ambari repository file to a directory on your installation host.
4. yum install wget

wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.3.0/ambari.repo -O


/etc/yum.repos.d/ambari.repo

Important
Do not modify the ambari.repo file name. This file is expected to be available on the
Ambari Server host during Agent registration.
5. Confirm that the repository is configured by checking the repo list.

yum repolist

Tos.Tech | http://thinkopensource.in
6 Big Data - Admin Course

1. Install the Ambari. This also installs the default PostgreSQL Ambari database.

yum install ambari-server


2. Enter y when prompted to confirm transaction and dependency checks

Tos.Tech | http://thinkopensource.in
7 Big Data - Admin Course

Ambari Server by default uses an embedded PostgreSQL database. When you install the Ambari
Server, the PostgreSQL packages and dependencies must be available for install. These packages
are typically available as part of your Operating System repositories.

Set Up the Ambari Server

Before starting the Ambari Server, you must set up the Ambari Server. Setup configures Ambari
to talk to the Ambari database, installs the JDK and allows you to customize the user account the
Ambari Server daemon will run as.
ambari-server setup
The command manages the setup process. Run the command on the Ambari server host to start
the setup process.
Respond to the setup prompt:
1. If you have not temporarily disabled SELinux, you may get a warning. Accept the default (y),
and continue.
2. By default, Ambari Server runs under root. Accept the default (n) at the Customize user account
for ambari-server daemon prompt, to proceed as root.

3. If you have not temporarily disabled iptables you may get a warning. Enter y to continue.
4. Select a JDK version to download. Enter 1 to download Oracle JDK 1.8. Alternatively, you can
choose to enter a Custom JDK. If you choose Custom JDK, you must manually install the
JDK on all hosts and specify the Java Home path. :
For our lab Accept 2 and enter the Java home as - /YARN/jdk
Note

Tos.Tech | http://thinkopensource.in
8 Big Data - Admin Course

JDK support depends entirely on your choice of Stack versions. By default, Ambari
Server setup downloads and installs Oracle JDK 1.8 and the accompanying Java
Cryptography Extension (JCE) Policy Files.
5. Enable Ambari Server to download and install GPL Licensed LZO packages [y/n] (n)? y
6. Accept the Oracle JDK license when prompted. You must accept this license to download the
necessary JDK from Oracle. The JDK is installed during the deploy phase.

7. Select n at Enter advanced database configuration to use the default, embedded PostgreSQL
database for Ambari. The default PostgreSQL database name is ambari. The default user name
and password are ambari/bigdata.

Tos.Tech | http://thinkopensource.in
9 Big Data - Admin Course

8. Setup completes.

Start the Ambari Server

Tos.Tech | http://thinkopensource.in
10 Big Data - Admin Course

 Run the following command on the Ambari Server host:

ambari-server start

 To check the Ambari Server processes:

ambari-server status

Tos.Tech | http://thinkopensource.in
11 Big Data - Admin Course

 To stop the Ambari Server: Do not execute this command. It’s for your information.

ambari-server stop

On Ambari Server start, Ambari runs a database consistency check looking for issues. If any issues
are found, Ambari Server start will abort and display the following message: DB configs
consistency check failed. Ambari writes more details about database consistency check results to
the/var/log/ambari-server/ambari-server-check-database.log file.
You can force Ambari Server to start by skipping this check with the following option: (optional
when only there is issue)
ambari-server start --skip-database-check
If you have database issues, by choosing to skip this check, do not make any changes to your
cluster topology or perform a cluster upgrade until you correct the database
consistency issues.

If any error as shown below occurred during start up:


2019-04-13 20:25:48,248 INFO - Checking DB store version
2019-04-13 20:25:51,247 ERROR - Current database store version is not compatible
with current server version, serverVersion=2.7.3.0, schemaVersion=2.6.0

Tos.Tech | http://thinkopensource.in
12 Big Data - Admin Course

Solution:

# ambari-server status
Using python /usr/bin/python
Ambari-server status
Ambari Server not running. Stale PID File at: /var/run/ambari-server/ambari-server.pid
# ambari-server reset
Using python /usr/bin/python
Resetting ambari-server
**** WARNING **** You are about to reset and clear the Ambari Server database. This will
remove all cluster host and configuration information from the database. You will be required to
re-configure the Ambari server and re-run the cluster wizard.
Are you SURE you want to perform the reset [yes/no] (no)? yes
Confirm server reset [yes/no](no)? yes
Resetting the Server database...
Creating schema and user...
done.
Creating tables...
done.
Ambari Server 'reset' completed successfully.

Then configure the ambari set up again.


Next Steps

Log on to Apache Ambari - hedge.ostech.com

Tos.Tech | http://thinkopensource.in
13 Big Data - Admin Course

Prerequisites
Ambari Server must be running.
This will download required drivers for mysql
#yum install mysql-connector-java*
#cd /usr/lib/ambari-agent
#cp /usr/share/java/mysql-connector-java.jar .

#cp /usr/share/java/mysql-connector-java.jar /var/lib/ambari-agent/tmp/

Note: Whenever there is any issue related to jar file, determine the jar from the log file and
manually download and copy to the tmp folder as shown above,

log on to Ambari Web using a web browser and install the HDP cluster software.
Stop the firewall in the VM.
#systemctl stop firewalld

Tos.Tech | http://thinkopensource.in
14 Big Data - Admin Course

#systemctl disable firewalld


Steps
1. Point your web browser to
http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located at http://hedge:8080/#/login.

2. Log in to the Ambari Server using the default user name/password: admin/admin. For a new
cluster, the Cluster Install wizard displays a Welcome page.

Tos.Tech | http://thinkopensource.in
15 Big Data - Admin Course

Click Sign In

-------------------------------------------- Lab ends here ---------------------------------------------

Tos.Tech | http://thinkopensource.in
16 Big Data - Admin Course

3. Debugging – Ambari(A)
Debug logs will help us troubleshoot ambari issues better and faster. Debug logs will contain more number of internal calls those will help
us understanding the problem better.
Check current log level in log4j.properties file

Check log4j.rootLogger property value in log4j.properties file.

#grep rootLogger /etc/ambari-server/conf/log4j.properties

In the above picture rootLogger value shown as INFO,file , We need to change it to DEBUG,file.

INFO is the default log level in Ambari server.

We can also check ambari-server.log file for log level.

#tail -f /var/log/ambari-server/ambari-server.log

Tos.Tech | http://thinkopensource.in
17 Big Data - Admin Course
1. Open the relevant configuration file in a UNIX text editor:

 Ambari Server Log Configuration: /etc/ambari-server/conf/log4j.properties

1. Replace "INFO" with "DEBUG":


log4j.rootLogger=INFO,file

2. vi /etc/ambari-server/conf/log4j.properties

[root@hawq20 conf]# grep rootLogger /etc/ambari-server/conf/log4j.properties


log4j.rootLogger=DEBUG,file
[root@hawq20 conf]#

4. Save the configuration file and close it.

Restart the Ambari server:

ambari-server restart

Check DEBUG log in ambari-server.log file and determine the entry that received heartbeat from your cluster node. In case you
have any issue with any of the node look for the heartbeat that is received at the ambari server.If it’s not in the log file then check

Tos.Tech | http://thinkopensource.in
18 Big Data - Admin Course
the ambari agent status.

Command :

tail -f /var/log/ambari-server/ambari-server.log

Revert loglevel to INFO

Please revert log level to INFO once debug logs collected using same steps. Debug logs take lot of space, can also cause service failures
sometimes.

To enable debug logging in Ambari agent, follow these steps:

1. Open the relevant configuration file in a UNIX text editor:

 Ambari Agent: /etc/ambari-agent/conf/ambari-agent.ini

2. Locate (or add in the entry) loglevel


[root@hawq20 conf]# grep loglevel ambari-agent.ini
;loglevel=(DEBUG/INFO)
loglevel=INFO
[root@hawq20 conf]#

3. Replace "loglevel=INFO" with "loglevel=DEBUG":

[root@hawq20 conf]# grep loglevel ambari-agent.ini


;loglevel=(DEBUG/INFO)
loglevel=DEBUG
Tos.Tech | http://thinkopensource.in
19 Big Data - Admin Course
[root@hawq20 conf]#

4. Save the configuration file and close it.

5. Restart Ambari agent:

ambari-agent restart

NOTE: Ambari agent logging level will only change on one host and will not affect the other hosts in the cluster.

tail -f /var/log/ambari-agent/ambari-agent.log
Look for an entry that specifies sending heart beat from the agent to server as shown below.

After this revert the setting to info and restart the ambari agent.

--------------------------------- Lab Ends Here --------------------------------------------

Tos.Tech | http://thinkopensource.in
20 Big Data - Admin Course

4. Understanding YARN Config

Goal: You will verify some of the setting related to HDFS and YARN in the config file so that you
get familiarize with various config files.

Start the Ambari Server

 Run the following command on the Ambari Server host:

ambari-server start

 To check the Ambari Server processes:

Tos.Tech | http://thinkopensource.in
21 Big Data - Admin Course

ambari-server status

 To stop the Ambari Server: Do not execute this command. Its for your information.
ambari-server stop

You can view the log file in case of any issue: /var/log/ambari-server/ambari-server*.log

Log In to Apache Ambari

3. Point your web browser to

http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located at http://tos.master.com:8080/#/login.

4. Log in to the Ambari Server using the default user name/password: admin/admin. You can
change these credentials later.

Tos.Tech | http://thinkopensource.in
22 Big Data - Admin Course

Action: Task to stop or start services.


Log on to Ambari server; click on any services listed below the services tab in the dashboard. Click
on the red icon and start from the menu option.

http://tos.master.com:8080/#/main/dashboard/metrics
Tos.Tech | http://thinkopensource.in
23 Big Data - Admin Course

Start the hdfs and yarn services.

HDFS Sevices:
 NameNode
 DataNode
Yarn services:
 Resource Manager
 Node Manager

Namenode
In case the namenode take quite a long time to start up i.e exceeds 10 minutes+; verify the log in
the following location. View the latest file out of it.

Tos.Tech | http://thinkopensource.in
24 Big Data - Admin Course

/var/lib/ambari-agent/data/out*txt
If there is error message as shown below: Execute the command to come out of the safe mode.

2019-06-09 14:04:54,384 - Retrying after 10 seconds. Reason: Execution of


'/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://tos.master.com:8020 -
safemode get | grep 'Safe mode is OFF'' returned 1.

#sudo -u hdfs hdfs dfsadmin -safemode leave

If everything goes well it will be as shown below:

Tos.Tech | http://thinkopensource.in
25 Big Data - Admin Course

Ensure that following services are started:

Tos.Tech | http://thinkopensource.in
26 Big Data - Admin Course

Open a terminal and vi file /etc/hadoop/conf/core-site.xml. Verify the Port number and the host
that runs the namenode services.

<property>
<name>fs.defaultFS</name>
<value>hdfs://tos.hp.com:8020</value>
<final>true</final>
</property>
<property>
<name>hadoop.proxyuser.hdfs.groups</name>
<value>*</value>
</property>

You can also verify using the ambari console as shown below:

Ambari Dashboard  Services  HDFS  Configs  Advanced

Tos.Tech | http://thinkopensource.in
27 Big Data - Admin Course

This is the graphical representation of the config file. All changes have to be done from the web
console only so that synchronizing to all slaves node will be managed by ambari else you have to
do manually.

Tos.Tech | http://thinkopensource.in
28 Big Data - Admin Course

Verify the replication factor and the physical location of the data or block configured for the
cluster.

#vi /etc/hadoop/conf/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/apps/YARN/data/hadoop/hdfs/dn</value>
</property>
<property>
Tos.Tech | http://thinkopensource.in
29 Big Data - Admin Course

<name>dfs.namenode.http-address</name>
<value>hp.tos.com:50070</value>
</property>
</configuration>

Tos.Tech | http://thinkopensource.in
30 Big Data - Admin Course

Verify /etc/hadoop/conf/mapred-site.xml

Map reduce related setting: as map reduce is executing in the yarn mode in this cluster.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml, you can use vi editor to view The pluggable shuffle and pluggable sort capabilities
allow replacing the built in shuffle and sort logic with alternate implementations.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Let us review Java Heap Sizes :

Tos.Tech | http://thinkopensource.in
31 Big Data - Admin Course

#vi /etc/hadoop/conf/hadoop-env.sh

Verify the following parameters.

HADOOP_HEAPSIZE="500"
HADOOP_NAMENODE_INIT_HEAPSIZE="500"

View the following information in mapred-env.sh


HADOOP_JOB_HISTORYSERVER_HEAPSIZE=250
Verify yarn-env.sh [If the following variable is not there in the file, you can make an entry in the
last line however ignore for the time being]
JAVA_HEAP_MAX=-Xmx500m
YARN_HEAPSIZE=500

Tos.Tech | http://thinkopensource.in
32 Big Data - Admin Course

After starting the HDFS, you can verify the java processes as shown below
#su -
#su hdfs
#jps

This command will list the java processes started for the Hadoop – Yarn.
#su -
#su hdfs
#jps

You can verify the Services using Web Interface also, You need to replace IP with that of your
server IP.

Tos.Tech | http://thinkopensource.in
33 Big Data - Admin Course

Access the Name Node UI and Data Node UI. Get familiarize the various features of these UI
especially the Node that belongs to the HDP cluster and the files store in the HDFS.

http://tos.master.com:50070/dfshealth.html#tab-overview

Verify the namenode that is use to connect to the cluster.

Hints : Overview 'tos.hp.com:8020' (active)

You can click on the various tabs to familiarize with the web UI.

Tos.Tech | http://thinkopensource.in
34 Big Data - Admin Course

It provides the overview of the Hadoop cluster.

You can verify the datanode information using this tab.

Tos.Tech | http://thinkopensource.in
35 Big Data - Admin Course

How many datanodes are there in the cluster? Now only one. What about any node being
decommision ?

Any snapshot being taken? You can verfy this after the snapshot lab. Here all snapshot
information will be stored.

Tos.Tech | http://thinkopensource.in
36 Big Data - Admin Course

You can verify the start up status of the hadoop cluster.

Tos.Tech | http://thinkopensource.in
37 Big Data - Admin Course

Browse the files in the cluster using the following option:

Tos.Tech | http://thinkopensource.in
38 Big Data - Admin Course

Click any of the file and verify it content.

Tos.Tech | http://thinkopensource.in
39 Big Data - Admin Course

You can access the Resource Manager UI as shown below:

http://tos.hp.com:8088/

Whenever you submit a job in the yarn cluster, you will get the job listed in this console. How
much resources consume will all be displayed here?

Verify the nodes that are running the yarn applications.

Tos.Tech | http://thinkopensource.in
40 Big Data - Admin Course

Understand the log folder and location: /var/log

Review the namenode log: tail -f hadoop-hdfs-namenode-tos.hp.com.log

Tos.Tech | http://thinkopensource.in
41 Big Data - Admin Course

Congrats! You have successfully completed Understanding main configuration of Yarn Cluster.

-------------------------------------------- Lab ends here ----------------------------------------------

Tos.Tech | http://thinkopensource.in
42 Big Data - Admin Course

5. Map Reduce Job Submission – YARN(A)

You will be able to submit map reduce job to hadoop yarn cluster at the end of this lab.
You need to ensure that Hadoop cluster is configured and started before proceeding ahead.
We are going to use sample MapReduce Examples provided by the hadoop installation using hdfs
user to understand how to submit MR job.

Run Sample MapReduce Examples using hdfs user


# su - hdfs
#export YARN_EXAMPLES=/usr/hdp/current/hadoop-mapreduce-client
#yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-3.1.1.3.1.4.0-315.jar pi 16 1000

Tos.Tech | http://thinkopensource.in
43 Big Data - Admin Course

You can verify the execution of jobs using the YARN web console.

Tos.Tech | http://thinkopensource.in
44 Big Data - Admin Course

http://tos.hp.com:8088/ui2/#/cluster-overview

or To view the job status. Click on Resource Manager UI.

Tos.Tech | http://thinkopensource.in
45 Big Data - Admin Course

Then Click on Queues to understand which Queue executed the job.

We will talk about the Queues when we discuss scheduler later in the training.

Tos.Tech | http://thinkopensource.in
46 Big Data - Admin Course

Click on the Applications tab.

Now, the job is in accepted state. Finally it will be in running state as shown below.

Tos.Tech | http://thinkopensource.in
47 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
48 Big Data - Admin Course

Click on the application id link and verify the resources consume by this Job. Hover the mouse
over the color to get the exact value of memory consumption.

Find out where the AM executes for the job we have submitted now.

Click on Application ID using the RM UI.

Tos.Tech | http://thinkopensource.in
49 Big Data - Admin Course

In my case it’s the slave which can be different for your execution.

Tos.Tech | http://thinkopensource.in
50 Big Data - Admin Course

Click on Diagnostic to understand the resources demanded and consume.

In the above example its ask for 5 cotainers each of 768 mb and 1 v cores.

Tos.Tech | http://thinkopensource.in
51 Big Data - Admin Course

Finally at the end of the job execution, the pi result will be shown as above.

Tos.Tech | http://thinkopensource.in
52 Big Data - Admin Course

As shown above in the counter:


16 map tasks being launched for this job and only one reducer. Data locality happened for all the
16 mappers.

Tos.Tech | http://thinkopensource.in
53 Big Data - Admin Course

Errata: /etc/hadoop/conf/yarn-site.xml

Issue Map reduces jobs not proceeding ahead and stuck at Accepted state.
Solution: verify the yarn-site.xml file and execute yarn services with yarn user only

<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>124</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>

Reduce the memory of Resource/Nodemanager to about 250 MB each if unable to execute and
start the history server of MR v2.

Tos.Tech | http://thinkopensource.in
54 Big Data - Admin Course

YARN  Configs  Edit

-------------------------------------------- Lab ends here ----------------------------------------------

Tos.Tech | http://thinkopensource.in
55 Big Data - Admin Course

6. Using HDFS

In this lab you will begin to get acquainted with the Hadoop tools. You will manipulate files in
HDFS, the Hadoop Distributed File System.
Set Up Your Environment
Before starting the labs, start up the VM and the HDFS, you need to logon with hdfs user for this
exercise.

Start the Ambari Server

 Run the following command on the Ambari Server host:

ambari-server start

Tos.Tech | http://thinkopensource.in
56 Big Data - Admin Course

 To check the Ambari Server processes:

ambari-server status

 To stop the Ambari Server: Do not execute this command. Its for your information.
ambari-server stop
Tos.Tech | http://thinkopensource.in
57 Big Data - Admin Course

You can view the log file in case of any issue: /var/log/ambari-server/ambari-server*.log

Log In to Apache Ambari

5. Point your web browser to

http://<your.ambari.server>:8080
,where <your.ambari.server> is the name of your ambari server host.
For example, a default Ambari server host is located at http://tos.master.com:8080/#/login.

6. Log in to the Ambari Server using the default user name/password: admin/admin. You can
change these credentials later.

Tos.Tech | http://thinkopensource.in
58 Big Data - Admin Course

Action: Task to stop or start services.


Log on to Ambari server; click on any services listed below the services tab in the dashboard. Click
on the red icon and start from the menu option.

http://tos.master.com:8080/#/main/dashboard/metrics

Tos.Tech | http://thinkopensource.in
59 Big Data - Admin Course

Start the hdfs and yarn services.

HDFS Sevices:
 NameNode
 DataNode
Yarn services:
 Resource Manager
 Node Manager

Namenode
In case the namenode take quite a long time to start up i.e exceeds 10 minutes+; verify the log in
the following location. View the latest file out of it.
Tos.Tech | http://thinkopensource.in
60 Big Data - Admin Course

/var/lib/ambari-agent/data/out*txt
If there is error message as shown below: Execute the command to exit the safe mode.

2019-06-09 14:04:54,384 - Retrying after 10 seconds. Reason: Execution of


'/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://tos.master.com:8020 -
safemode get | grep 'Safe mode is OFF'' returned 1.

#sudo -u hdfs hdfs dfsadmin -safemode leave

If everything goes well it should be as shown below:

Tos.Tech | http://thinkopensource.in
61 Big Data - Admin Course

Ensure that following services are started :

Tos.Tech | http://thinkopensource.in
62 Big Data - Admin Course

Data files (local), You need to copy all these files in your VM. All exercise need to be performed
using hdfs logon unless specified. You can create a data folder in your home directory and dump
all data inside that folder.

/software/data/shakespeare.tar.gz
/software/data/access_log.gz
/software/data/pg20417.txt

Hadoop is already installed, configured, and running on your virtual machine. Most of your
interaction with the system will be through a command-‐line wrapper called hadoop. If you run
this program with no arguments, it prints a help message. To try this, run the following command
in a terminal window:
# su - hdfs
$ hadoop

Tos.Tech | http://thinkopensource.in
63 Big Data - Admin Course

The hadoop command is subdivided into several subsystems. For example, there is a subsystem
for working with files in HDFS and another for launching and managing MapReduce processing
jobs.

Exploring HDFS
The subsystem associated with HDFS in the Hadoop wrapper program is called FsShell. This
subsystem can be invoked with the command hadoop fs.
Open a terminal window (if one is not already open) by double-‐clicking the Terminal icon on the
desktop.
In the terminal window, enter:

$ hadoop fs

Tos.Tech | http://thinkopensource.in
64 Big Data - Admin Course

You see a help message describing all the commands associated with the FsShell subsystem.
Enter:

$ hadoop fs -ls /
This shows you the contents of the root directory in HDFS. There will be multiple entries, one of
which is /user. Individual users have a “home” directory under this directory, named after their
username; your username in this course is hdfs, therefore your home directory is /user/hdfs.

Try viewing the contents of the /user directory by running:

Tos.Tech | http://thinkopensource.in
65 Big Data - Admin Course

$ hadoop fs -ls /user

You will see your home directory in the directory listing.

List the contents of your home directory by running:

$ hadoop fs -ls /user/hdfs

This is different from running hadoop fs -ls /foo, which refers to a directory that doesn’t exist. In
this case, an error message would be displayed.
Note that the directory structure in HDFS has nothing to do with the directory structure of the
local filesystem; they are completely separate namespaces.

Uploading Files

Tos.Tech | http://thinkopensource.in
66 Big Data - Admin Course

Besides browsing the existing filesystem, another important thing you can do with FsShell is to
upload new data into HDFS. Change directories to the local filesystem directory containing the
sample data we will be using in the homework labs.

$ cd /Software

If you perform a regular Linux ls command in this directory, you will see a few files, including two
named shakespeare.tar.gz and shakespeare-stream.tar.gz. Both of these contain the complete
works of Shakespeare in text format, but with different formats and organizations. For now we
will work with shakespeare.tar.gz.

Unzip shakespeare.tar.gz by running with root credentials(su root):

$ tar zxvf shakespeare.tar.gz

This creates a directory named shakespeare/ containing several files on your local filesystem.

Tos.Tech | http://thinkopensource.in
67 Big Data - Admin Course

copy this directory into HDFS using hdfs:

$ hadoop fs -put shakespeare /user/hdfs/shakespeare

This copies the local shakespeare directory and its contents into a remote, HDFS directory named
/user/hdfs/shakespeare.

List the contents of your HDFS home directory now:

$ hadoop fs -ls /user/hdfs

You should see an entry for the shakespeare directory.

Tos.Tech | http://thinkopensource.in
68 Big Data - Admin Course

Now try the same fs -ls command but without a path argument:

$ hadoop fs -ls

You should see the same results. If you don’t pass a directory name to the –ls command, it
assumes you mean your home directory, i.e. /user/hdfs.

Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths in
MapReduce programs), they are considered relative to your home directory.

Tos.Tech | http://thinkopensource.in
69 Big Data - Admin Course

We will also need a sample web server log file, which we will put into HDFS for use in future labs.
This file is currently compressed using GZip. Rather than extract the file to the local disk and then
upload it, we will extract and upload in one step.
First, create a directory in HDFS in which to store it:

$ hadoop fs -mkdir weblog

Now, extract and upload the file in one step. The -c option to gunzip uncompressed to standard
output, and the dash (-) in the hadoop fs -put command takes whatever is being sent to its
standard input and places that data in HDFS.

$ gunzip -c access_log.gz | hadoop fs -put - weblog/access_log

Run the hadoop fs -ls command to verify that the log file is in your HDFS home directory.

The access log file is quite large – around 500 MB. Create a smaller version of this file, consisting
only of its first 5000 lines, and store the smaller version in HDFS. You can use the smaller version
for testing in subsequent labs.

$ hadoop fs -mkdir testlog

$ gunzip -c access_log.gz | head -n 5000 | hadoop fs -put - testlog/test_access_log

Viewing and Manipulating Files


Now let’s view some of the data you just copied into HDFS.
Enter:

Tos.Tech | http://thinkopensource.in
70 Big Data - Admin Course

$ hadoop fs -ls shakespeare

This lists the contents of the /user/hdfs/shakespeare HDFS directory, which consists of the files
comedies, glossary, histories, poems, and tragedies.
The glossary file included in the compressed file you began with is not strictly a work of
Shakespeare, so let’s remove it:

$ hadoop fs -rm shakespeare/glossary

Note that you could leave this file in place if you so wished. If you did, then it would be included in
subsequent computations across the works of Shakespeare, and would skew your results slightly.
As with many real-‐world big data problems, you make trade-‐offs between the labor to purify
your input data and the precision of your results.

Enter:
Tos.Tech | http://thinkopensource.in
71 Big Data - Admin Course

$ hadoop fs -cat shakespeare/histories | tail -n 50

This prints the last 50 lines of Henry IV, Part 1 to your terminal. This command is handy for
viewing the output of MapReduce programs. Very often, an individual output file of a MapReduce
program is very large, making it inconvenient to view the entire file in the terminal. For this
reason, it’s often a good idea to pipe the output of the fs -cat command into head, tail, more, or
less.

To download a file to work with on the local filesystem use the fs -get command. This command
takes two arguments: an HDFS path and a local path. It copies the HDFS contents into the local
filesystem:

$ hadoop fs -get shakespeare/poems ~/shakepoems.txt


$ less ~/shakepoems.txt

Tos.Tech | http://thinkopensource.in
72 Big Data - Admin Course

There are several other operations available with the hadoop fs command to perform most
common filesystem manipulations: mv, cp, mkdir, etc.

$ hadoop fs

This displays a brief usage report of the commands available within FsShell. Try playing around
with a few of these commands if you like.

Basic Hadoop Filesystem commands (Optional)

In order to work with HDFS you need to use the hadoop fs command. For example to list the /
and /app directories you need to input the following commands:
Tos.Tech | http://thinkopensource.in
73 Big Data - Admin Course

hadoop fs -ls /
hadoop fs -ls /tmp
There are many commands you can run within the Hadoop filesystem. For example to make the
directory test you can issue the following command:
hadoop fs -mkdir test

Now let's see the directory we've created:


hadoop fs -ls /
hadoop fs -ls /user/hdfs

You should be aware that you can pipe (using the | character) any HDFS command to be used with
the Linux shell. For example, you can easily use grep with HDFS by doing the following:

hadoop fs -mkdir /user/hdfs/test2


hadoop fs -ls /user/hdfs | grep test

As you can see the grep command only returned the lines which had test in them (thus removing
the "Found x items" line and oozie-root directory from the listing.

In order to move files between your regular linux filesystem and HDFS you will likely use the put
and get commands. First, move a single file to the hadoop filesystem.
Copy pg20417.txt from software folder to data folder

hadoop fs -put /home/hdfs/data/pg20417.txt pg20417.txt


hadoop fs -ls /user/hdfs

Tos.Tech | http://thinkopensource.in
74 Big Data - Admin Course

You should now see a new file called /user/hdfs/pg* listed. In order to view the contents of this file
we will use the -cat command as follows:

hadoop fs -cat pg20417.txt

We can also use the linux diff command to see if the file we put on HDFS is actually the same as
the original on the local filesystem. You can do this as follows:

diff <( hadoop fs -cat pg20417.txt) /home/hdfs/data/pg20417.txt


Since the diff command produces no output we know that the files are the same (the diff command
prints all the lines in the files that differ).

Tos.Tech | http://thinkopensource.in
75 Big Data - Admin Course

Some more Hadoop Filesystem commands

In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In
the Linux shell this is generally done with the "-R" argument) For example, to do a recursive
listing we'll use the -lsr command rather than just -ls. Try this:

hadoop fs -ls /user


hadoop fs -ls -R /user

Tos.Tech | http://thinkopensource.in
76 Big Data - Admin Course

In order to find the size of files you need to use the -du or -dus commands. Keep in mind that
these commands return the file size in bytes. To find the size of the pg20417.txt file use the
following command:

hadoop fs -du pg20417.txt

To find the size of all files individually in the /user/root directory use the following command:

hadoop fs -du /user/hdfs

To find the size of all files in total of the /user/root directory use the following command:

hadoop fs -dus /user/hdfs

Tos.Tech | http://thinkopensource.in
77 Big Data - Admin Course

If you would like to get more information about a given command, invoke -help as follows:

hadoop fs -help

For example, to get help on the dus command you'd do the following:
hadoop fs -help dus
You can observe the HDFS’s namenode console as follows:
Familiarize the various options
http://10.10.20.20:50070/dfshealth.html#tab-overview

Tos.Tech | http://thinkopensource.in
78 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
79 Big Data - Admin Course

Click on Datanodes

Click on Snapshot
Tos.Tech | http://thinkopensource.in
80 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
81 Big Data - Admin Course

Click on Startup Progressand utlities

Tos.Tech | http://thinkopensource.in
82 Big Data - Admin Course

Click on Logs
http://192.168.246.131:50070/logs/

You can verify the log by clicking on the datanode log file.

You can verifying the Hadoop File System Health. Check for minimally replicated blocks if any.

Tos.Tech | http://thinkopensource.in
83 Big Data - Admin Course

#hadoop fsck /

Tos.Tech | http://thinkopensource.in
84 Big Data - Admin Course

#hadoop dfsadmin –report

This is a dfsadmin command for reporting on each DataNode. It displays the status of Hadoop
cluster. Any under replicated blocks or Corrupt replicas?

Tos.Tech | http://thinkopensource.in
85 Big Data - Admin Course

#hadoop dfsadmin -metasave hadoop.txt


This will save some of NameNode’s metadata into its log directory under filename.
In this metadata, you’ll find lists of blocks waiting for replication, blocks being replicated, and
blocks awaiting deletion. For replication each block will also have a list of DataNodes being
replicated to. Finally, the metasave file will also have summary statistics on each DataNode.

Go to log folder:
# cd /var/hadoop/logs
# ls

# vi hadoop.txt

Tos.Tech | http://thinkopensource.in
86 Big Data - Admin Course

You can get the information of hadoop safe mode.


#hadoop dfsadmin -safemode get

hadoop dfsadmin -safemode enter

hadoop dfsadmin -safemode leave

hadoop fsck /

Tos.Tech | http://thinkopensource.in
87 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
88 Big Data - Admin Course

You can determine the version of Hadoop.

#hadoop version

The default port is 50070. To get a list of files in a directory you would use:

1. curl -i http://tos.hp.com:50070/webhdfs/v1/user/root/output/?op=LISTSTATUS

--------------------------------------- Lab Ends Here --------------------------------------

Tos.Tech | http://thinkopensource.in
89 Big Data - Admin Course

7. Understanding HDFS Internals(A)


Copy any file larger than 128 MB.
Now let’s copy a file from the local file system (LFS) to Hadoop distributed file system (HDFS)
and see how the data is being copied and what happens internally.

#su - hdfs
#hadoop fs -put yelp_academic_dataset_review.json /user/hdfs/mydata

NameNode has all the metadata such as the replication factor, locations, racks etc… related to
the file.
We can view this information on executing the below command.

# hdfs fsck /user/hdfs/mydata -locations -racks -blocks -files

On running the above command the gateway node runs the fsck and connects to the Namenode.
Namenode checks for the file and the time it was created.

Next, the Namenode will go to the particular block pool id of the Namenode which contains the
metadata information.
Tos.Tech | http://thinkopensource.in
90 Big Data - Admin Course

Based on the block pool id, it will search for the block id of the data node and the details such as
the rack information on which the data is stored based on the replication factor.

Further, it will give you the information regarding the blocks which are Over-replicated, Under-
replicated, corrupt blocks, the number of data nodes and the racks used along with the health
status of the file system.

Tos.Tech | http://thinkopensource.in
91 Big Data - Admin Course

Apart from this, the scheduler also plays a role in distributing the resources and scheduling a job
on storing data into Hdfs. In this case, I’m using Yarn architecture. The details related to the
scheduling are present in yarn-site.xml. The default scheduler used is capacity scheduler.

-------------------------------------------- Lab ends here ---------------------------------------------

Tos.Tech | http://thinkopensource.in
92 Big Data - Admin Course

8. Understanding Debugging in HDFS(A)

Open a terminal on any of the node.


Execute the following commands and go throughthe debug steps.
# su – hdfs
#export HADOOP_ROOT_LOGGER=DEBUG,console
#echo $HADOOP_ROOT_LOGGER;

#hadoop fs -ls /

#hdfs dfs -du /

Tos.Tech | http://thinkopensource.in
93 Big Data - Admin Course

For example, imagine if someone (or something) decided to run


#hdfs dfs -du on root (“/”). This would recursively scan the entire file system tree, holding the
FSNamesystem lock for as long as it takes to process 5000 content summary counts (assuming
the default setting for dfs.content-summary.limit).

Hints : export HADOOP_ROOT_LOGGER=DEBUG,console


echo $HADOOP_ROOT_LOGGER;
Tos.Tech | http://thinkopensource.in
94 Big Data - Admin Course

https://community.cloudera.com/t5/Community-Articles/Set-log-level-of-namenode/ta-
p/249460
https://leveragebigdata.blogspot.com/2017/01/debugging-apache-hadoop.html
https://stackoverflow.com/questions/19198367/is-there-a-way-to-debug-namenode-or-datanode-
of-hadoop-using-eclipse
#hdfs dfsadmin -metasave metasave-report.txt

With hdfs dfsadmin -metasave provides information about blocks, including>


 blocks waiting for replication
 blocks currently being replication
 total number of blocks

Log on the namenode server and verify the metadata.


#cd //var/log/hadoop/hdfs
Tos.Tech | http://thinkopensource.in
95 Big Data - Admin Course

#cat metasave-report.txt

Have a glance of the report.

Execute the following:

#hdfs getconf -confKey dfs.namenode.avoid.read.stale.datanode

Tos.Tech | http://thinkopensource.in
96 Big Data - Admin Course

Verify the block consistency.

Determine the meta file of the block and verify the status as shown below:

Go to the datadirectory of a datanode and get any of the metadata file


#cd /hadoop/hdfs/data/current/BP-919298001-10.10.10.15-
1582852006440/current/finalized/subdir0/subdir0
#ls

Tos.Tech | http://thinkopensource.in
97 Big Data - Admin Course

#hdfs debug verifyMeta -meta /hadoop/hdfs/data/current/BP-919298001-10.10.10.15-


1582852006440/current/finalized/subdir0/subdir0/blk_1073741859_1035.meta

Optional:

Enable Debug using the following option and restart the services.

Update the Root Logger with - DEBUG,RFA.


Tos.Tech | http://thinkopensource.in
98 Big Data - Admin Course

export HADOOP_NAMENODE_OPTS="${HADOOP_NAMENODE_OPTS} -Dhadoop.root.logger=DEBUG,DRFA"

Restart the services as required.

---------------------------------------- Lab Ends Here -------------------------------

Tos.Tech | http://thinkopensource.in
99 Big Data - Admin Course

10. Change NN Heap settings & Config Group – Services (A)

Set the Namenode java heap size (Memory) to 2.5 GB using the following option
Use Services > [HDFS] > Configs to optimize service performance for the service.
1. In Ambari Web, click a service name in the service summary list on the left.
2. From the the service Summary page, click the Configs tab, then use one of the following tabs
to manage configuration settings.
o Use the Configs tab to manage configuration versions and groups.
o Use the Settings tab to manage Smart Configs by adjusting the green, slider buttons.
o Use the Advanced tab to edit specific configuration properties and values.

Tos.Tech | http://thinkopensource.in
100 Big Data - Admin Course

3. Click Save.
Tos.Tech | http://thinkopensource.in
101 Big Data - Admin Course

4. Enter a description for this configuration version that includes your current changes.
5. Review and confirm each recommended change.
Restart all affected services.

Let us configure
Click on HDFS services  Config  Config Group -> Manage Config Group  Add
Enter the following details:

Ok

Tos.Tech | http://thinkopensource.in
102 Big Data - Admin Course

Select the group on the left side and add the slavea host on the right.

Tos.Tech | http://thinkopensource.in
103 Big Data - Admin Course

Click Save.

Now Let us change the memory setting of Slave A . Select Gonfig group which we have just created
above.

Tos.Tech | http://thinkopensource.in
104 Big Data - Admin Course

Override Configurations
Once you have created the configuration group and assign some hosts to the group, you are ready to override configuration values.
This section uses HDFS Hadoop maximum Java heap size property as an example to describes how to override configuration values.
1. On the HDFS’s configuration page, from the Group drop-down list, select the configuration
group created in the previous section. You will see the configuration values displayed are
identical to the ones in the default group. Configuration groups show full list of configuration

Tos.Tech | http://thinkopensource.in
105 Big Data - Admin Course

properties. You can choose which ones to overide.

2. Click the Override button next to the property you want to set a new value. Enter a new value
in the text box shown below the default value.

3. You will not be able to save the configuration changes unless you specify a value that’s
different from the default value.
Tos.Tech | http://thinkopensource.in
106 Big Data - Admin Course

4. Click Save on the top of the configuration page to save the configuration. Enter a description
for the change in the Save Configuration wizard and click Save again.

Tos.Tech | http://thinkopensource.in
107 Big Data - Admin Course

5. Ambari web UI opens up a new wizard dialog with the save configuration result.

6. Restart HDFS to have the configuration change take effect.

7. Ambari web UI shows different configuration values defined in various groups when it displays
the default group.

Tos.Tech | http://thinkopensource.in
108 Big Data - Admin Course
In this post, we describe how you can override component configuration on a subset of hosts. This is a very useful and straight forward way to
apply host specific configuration values when a cluster is a heterogeneous mixture of hosts. You can also re-assign hosts from the non-default
configuration groups to the default group or the other non-default configuration groups.

-------------------------------------------- Lab ends here ----------------------------------------------

Tos.Tech | http://thinkopensource.in
109 Big Data - Admin Course

11. Hadoop Benchmarks(A)

And before we start, here’s a nifty trick for your tests: When running the benchmarks described
in the following sections, you might want to use the Unix time command to measure the elapsed
time. This saves you the hassle of navigating to the Hadoop JobTracker web interface to get the
(almost) same information. Simply prefix every Hadoop command with time :
time hadoop jar hadoop-*examples*.jar ...
TestDFSIO
The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such
as stress testing HDFS, to discover performance bottlenecks in your network, to shake
out the hardware, OS and Hadoop setup of your cluster machines (particularly the
NameNode and the DataNodes) and to give you a first impression of how fast your
cluster is in terms of I/O.
The default output directory is /benchmarks/TestDFSIO
When a write test is run via -write , the TestDFSIO benchmark writes its files
to /benchmarks/TestDFSIO on HDFS. Files from older write runs are overwritten.
Benchmark results are saved in a local file called TestDFSIO_results.log in the current local
directory (results are appended if the file already exists) and also printed to STDOUT.
Run write tests before read tests
The read test of TestDFSIO does not generate its own input files. For this reason, it is a
convenient practice to first run a write test via -write and then follow-up with a read test
via -read (while using the same parameters as during the previous -write run).

Tos.Tech | http://thinkopensource.in
110 Big Data - Admin Course

# su - yarn
#export YARN_EXAMPLES=/usr/hdp/current/hadoop-mapreduce-client
cd /usr/hdp/current/hadoop-mapreduce-client
Run a write test (as input data for the subsequent read test)
TestDFSIO is designed in such a way that it will use 1 map task per file, i.e. it is a 1:1 mapping
from files to map tasks. Splits are defined so that each map gets only one filename, which it
creates ( -write ) or reads ( -read ).
The command to run a write test that generates 10 output files of size 1GB for a total of 10GB is:

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar


TestDFSIO -write -nrFiles 10 -fileSize 1000

Tos.Tech | http://thinkopensource.in
111 Big Data - Admin Course

Run a read test


The command to run the corresponding read test using 10 input files of size 1GB is:

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar


TestDFSIO -read -nrFiles 10 -fileSize 1000

Tos.Tech | http://thinkopensource.in
112 Big Data - Admin Course

Clean up and remove test data


The command to remove previous test data is:

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar


TestDFSIO -clean

The cleaning run will delete the output directory /benchmarks/TestDFSIO on HDFS.

Interpreting TestDFSIO results


Let’s have a look at this exemplary result for writing and reading 1TB of data on a cluster of twenty nodes and try to deduce its
meaning:

Here, the most notable metrics are Throughput mb/sec and Average IO rate mb/sec. Both of them are based on the file size written
(or read) by the individual map tasks and the elapsed time to do so.

Two derived metrics you might be interested in are estimates of the “concurrent” throughput and average IO rate (for the lack of a
better term) your cluster is capable of. Imagine you let TestDFSIO create 1,000 files but your cluster has only 200 map slots. This
means that it takes about five MapReduce waves ( 5 * 200 = 1,000 ) to write the full test data because the cluster can only run
200 map tasks at the same time. In this case, simply take the minimum of the number of files (here: 1,000 ) and the number of
available map slots in your cluster (here: 200 ), and multiply the throughput and average IO rate by this minimum. In our example,
the concurrent throughput would be estimated at 4.989 * 200 = 997.8 MB/s and the concurrent average IO rate at 5.185 *
200 = 1,037.0 MB/s .

TeraSort benchmark suite

Tos.Tech | http://thinkopensource.in
113 Big Data - Admin Course

A full TeraSort benchmark run consists of the following three steps:

1. Generating the input data via TeraGen .


2. Running the actual TeraSort on the input data.
3. Validating the sorted output data via TeraValidate .

You do not need to re-generate input data before every TeraSort run (step 2). So you can skip step 1 (TeraGen) for later TeraSort
runs if you are satisfied with the generated data.

Figure 1 shows the basic data flow. We use the included HDFS directory names in the later examples.

Figure 1: Hadoop Benchmarking and Stress Testing: The basic data flow of the TeraSort benchmark suite.
Tos.Tech | http://thinkopensource.in
114 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
115 Big Data - Admin Course

TeraGen: Generate the TeraSort input data (if needed)


TeraGen (source code) generates random data that can be conveniently used as input data for a subsequent TeraSort run.

The syntax for running TeraGen is as follows:

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar


teragen <number of 100-byte rows> <output dir>

Using the HDFS output directory /user/hduser/terasort-input as an example, the command to run TeraGen in order to
generate 1TB of input data (i.e. 1,000,000,000,000 bytes) is:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar


teragen 10000000000 /user/hdfs/terasort-input

Please note that the first parameter supplied to TeraGen is 10 billion (10,000,000,000), i.e. not 1 trillion = 1 TB
(1,000,000,000,000). The reason is that the first parameter specifies the number of rows of input data to generate, each of which
having a size of 100 bytes.

Here is the actual TeraGen data format per row to clear things up:

<10 bytes key><10 bytes rowid><78 bytes filler>\r\n


where

1. The keys are random characters from the set ‘ ‘ .. ‘~’.


2. The rowid is the right justified row id as a int.
3. The filler consists of 7 runs of 10 characters from ‘A’ to ‘Z’.

Tos.Tech | http://thinkopensource.in
116 Big Data - Admin Course

Using the input directory /user/hduser/terasort-input and the output


directory /user/hduser/terasort-output as an example, the command to run the TeraSort
benchmark is:

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar terasort


/user/hdfs/terasort-input /user/hdfs/terasort-output

Tos.Tech | http://thinkopensource.in
117 Big Data - Admin Course

TeraValidate: Validate the sorted output data of TeraSort


TeraValidate (source code) ensures that the output data of TeraSort is globally sorted.

Using the output directory /user/hdfs/terasort-output from the previous sections and the
report (output) directory /user/hdfs/terasort-validate as an example, the command to run
the TeraValidate test is:
Tos.Tech | http://thinkopensource.in
118 Big Data - Admin Course

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar


teravalidate /user/hdfs/terasort-output /user/hdfs/terasort-validate

Tos.Tech | http://thinkopensource.in
119 Big Data - Admin Course

NameNode benchmark
Tos.Tech | http://thinkopensource.in
120 Big Data - Admin Course

The following command will run a NameNode benchmark that creates 1000 files using 12 maps and 6 reducers. It uses a custom
output directory based on the machine’s short hostname. This is a simple trick to ensure that one box does not accidentally write
into the same output directory of another box running NNBench at the same time.

$ hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar


nnbench -operation create_write \
-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 \
-replicationFactorPerFile 3 -readFileAfterOpen true \
-baseDir /benchmarks/NNBench-`hostname -s`

Note that by default the benchmark waits 2 minutes before it actually starts!

Tos.Tech | http://thinkopensource.in
121 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
122 Big Data - Admin Course

------------------------------------- Lab Ends Here --------------------------------------

Tos.Tech | http://thinkopensource.in
123 Big Data - Admin Course

12. ResourceManager high availability

To access the wizard and enable ResourceManager high availability:


You can configure high availability for ResourceManager by using the Enable ResourceManager
HA wizard.

You must have at least three nodes in the cluster:


 hosts in your cluster
Start Master, slave and slavea nodes.
 Apache ZooKeeper servers running on all nodes.

Start the following in the master node:


Namenode, DataNode, Resource Manager and Node Manager Services.

In all nodes,
Datanode, Zookeeper, Zkfc & Journal Node wherever it’s installed.

In slavea
Namenode

At this point we have only one Resource Manager configured in the cluster.

Tos.Tech | http://thinkopensource.in
124 Big Data - Admin Course

HDFS Services should be up as shown below:

YARN Services should be up too.

Tos.Tech | http://thinkopensource.in
125 Big Data - Admin Course

1. In Ambari Web, browse to Services > YARN > Summary.


2. Select Service Actions and choose Enable ResourceManager HA.
The Enable ResourceManager HA wizard launches, describing a set of automated and
manual steps that you must take to set up ResourceManager high availability.
3. On Get Started, read the overview of enabling ResourceManager HA.

Click Next to proceed.


4. On Select Host (Slavea), accept the default selection or choose an available host.

Tos.Tech | http://thinkopensource.in
126 Big Data - Admin Course

Click Next to proceed.


5. On Review Selections, expand YARN if necessary, to review all the configuration changes
proposed for YARN.

Tos.Tech | http://thinkopensource.in
127 Big Data - Admin Course

Click Next to approve the changes and start automatically configuring ResourceManager HA.
6. On Configure Components, click Complete when all the progress bars finish tracking.

At the end you should have the following two nodes of RM

Tos.Tech | http://thinkopensource.in
128 Big Data - Admin Course

As you can see one will be in active and other will be in stand by.

Tos.Tech | http://thinkopensource.in
129 Big Data - Admin Course

Test the Fail Over.


Submit a job.
Let’s submit a MapReduce job to the cluster
In the Terminal – master node
# su - hdfs
#export YARN_EXAMPLES=/usr/hdp/current/hadoop-mapreduce-client
# yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar pi -
Dmapred.job.queue.name=Training 6 10

Stop the Primary RM after the job execution started.


Determine the RM node which is primary, execute the following command and the one with the
state active is the primary node.
#su - yarn
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2

In my case the rm1 is the primary RM. (Yarn  Config  Custom yarn-site.xml)

Tos.Tech | http://thinkopensource.in
130 Big Data - Admin Course

#rm1 is the primary resource manager.

Stop the RM service of master node. When the terminal display the below text

Using the dashboard  Yarn  Active  STop


Tos.Tech | http://thinkopensource.in
131 Big Data - Admin Course

Failing Over RM node message will be displayed in the console as shown below:

Verify that Job proceed with the secondary RM


After sometimes it should be failed over to rm2

Tos.Tech | http://thinkopensource.in
132 Big Data - Admin Course

You can verify from the dashboard also that rm2 is the primary resource manager now.

Now the job will be orchestrated by the new primary RM i.e rm2 and we don’t need to resubmit
the job to the cluster.

Tos.Tech | http://thinkopensource.in
133 Big Data - Admin Course

You can also determine the status of the rm2 i.e slavea using yarn command.
yarn rmadmin -getServiceState rm2

Job will be completed after a few minutes depending on the resources.

Let us start the Failed RM i.e rm1 ; master node.

Let us check which is primary now.


yarn rmadmin -getServiceState rm2
yarn rmadmin -getServiceState rm1

Tos.Tech | http://thinkopensource.in
134 Big Data - Admin Course

This mean that the current active node will be the primary Resource Manager till it get failed over
although the earlier primary node comes up.

-------------------------------------------- Lab ends here ----------------------------------------------

Tos.Tech | http://thinkopensource.in
135 Big Data - Admin Course

13. Tuning and Debugging HDP – (A)

Case Study to debug and resolve job related issues running in Hadoop cluster. You can’t execute
any job that exceeds the Virtual memory demand then the configure in the configuration file.

If the virtual memory usage exceeds more than the allowed configured memory then the container
will be killed and job will failed.
Let us enable the flag in custom yarn-site.xml file so that Node Manager can monitor the virtual
memory usage of the cluster. i.e yarn.nodemanager.vmem-check-enabled = true
Dashboard  Yarn  Config  Advance

Accept all warning and default setting to complete the configuration.


You need to restart the following services to get the setting affected.
Resource manager, Node manager services in all the applicable nodes.
Submit the following job to the cluster.
# su - hdfs
#export YARN_EXAMPLES=/usr/hdp/current/hadoop-mapreduce-client
#yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-3.1.1.3.1.0.0-78.jar pi 16 1000

Tos.Tech | http://thinkopensource.in
136 Big Data - Admin Course

After some time the Job will failed with the following errors.

Container Physical Memory consumption at this juncture: Virtual memory usage is beyond the
permissible limit.

Current usage: 107.5 MB of 206 MB physical memory used …

Tos.Tech | http://thinkopensource.in
137 Big Data - Admin Course

Tos.Tech | http://thinkopensource.in
138 Big Data - Admin Course

Physical Memory allocated is 2536 MB i.e 2.4 GB

1.9 GB of 824 MB virtual memory used ( Virtual memory usage exceeds that of the limit). Killing
container.
Observation:
Open the file and observe the virtual to physical memory allowed ratio. It’s 4 times here for each
map container.
#vi /etc/hadoop/3.1.0.0-78/0/yarn-site.xml
yarn.nodemanager.vmem-pmem-ratio = 4 times

the ("mapreduce.map.memory.mb") is set to 206MB then the total allowed virtual memory is 4 *
206 =824MB.
#vi /etc/hadoop/3.1.0.0-78/0/mapred-site.xml

Tos.Tech | http://thinkopensource.in
139 Big Data - Admin Course

However as shown in the below log 1.9 GB virtual memory is demanded then the allowed 824 Gb
configured. Hence the job failed.

You can verify from the log. This error is due to the overall consumption of virtual memory which
is more than the allocated allowed virtual memory. How do we resolve this? One way is to increase
the Physical memory and raised the allowed virtual memory ratio. Another way is to disable the
validation of the virtual memory. Which we will disable it!
Concepts:
NodeManager can monitor the memory usage (virtual and physical) of the container. If its virtual
memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the
"mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will be
killed if “yarn.nodemanager.vmem-check-enabled” is true.

Solution:
yarn.nodemanager.vmem-check-enabled should be false and restart the cluster services i.e
Nodemanager and Resource Manager
Then resubmit the job all over again. Do not update the xml files directly. All changes have to be
done through the Ambari UI only.

Tos.Tech | http://thinkopensource.in
140 Big Data - Admin Course

-------------------------------------------- Lab ends here ------------------------------------------

Tos.Tech | http://thinkopensource.in

S-ar putea să vă placă și