Documente Academic
Documente Profesional
Documente Cultură
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
The contents of this course and all its related materials, including lab exercises and files, are Copyright Hortonworks
Inc. 2014.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of
Hortonworks Inc. All rights reserved.
Table of Contents
Table of Contents ...................................................................................................................... 4
Course Introduction.............................................................................................................. 10
Unit 1: Introduction to HDP and Hadoop 2.0 ............................................................... 11
Enterprise Data Trends @ Scale ................................................................................................. 12
What is Big Data? ............................................................................................................................. 13
A Market for Big Data ..................................................................................................................... 14
Most Common New Types of Data.............................................................................................. 15
Moving from Causation to Correlation..................................................................................... 17
What is Hadoop? .............................................................................................................................. 19
What is Hadoop 2.0? ....................................................................................................................... 20
Traditional Systems vs. Hadoop ................................................................................................. 21
Overview of a Hadoop Cluster ..................................................................................................... 22
Who is Hortonworks?..................................................................................................................... 23
The Hortonworks Data Platform ............................................................................................... 24
Use Case: EDW before Hadoop .................................................................................................... 26
Banking Use Case: EDW with HDP ............................................................................................. 27
Nagios................................................................................................................................................. 357
Nagios UI ........................................................................................................................................... 359
Monitoring JVM Processes .......................................................................................................... 360
Understanding JVM Memory ..................................................................................................... 362
Eclipse Memory Analyzer ........................................................................................................... 364
JVM Memory Heap Dump ............................................................................................................ 366
Java Management Extensions (JMX) ....................................................................................... 368
Course Introduction
10
Course Agenda
Introductions
What is Hadoop?
Who is Hortonworks?
11
Machine
Data
Social Media
VoIP
Enterprise
Data
12
Twitter messages are 140 bytes each generating 8TB data per day.
Part One: 3Vs: Gartner analyst Doug Laney came up with famous three Vs
(Volume, Velocity and Variety) in 2001.
Part Three: Enhanced Insight and Decision Making - The goal of working with
big data is to increase business value and to respond quicker and with more
accuracy to meet well-defined business objectives.
13
Source: http://www.researchmoz.us/big-data-market-business-case-market-analysisand-forecasts-2014-2019-report.html
14
2. Clickstream
Capture and analyze website visitors data trails and
optimize your website
3. Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines
4. Geographic
Value
5. Server Logs
Research logs to diagnose process failures and prevent
security breaches
+ Keep existing
data longer!
Sentiment: The most commonly sighted source, analyzing language usage, text
and computational iinguistics in an attempt to better analyze subjective
information. Many companies are trying to leverage this data to provide
sentiment trackers, identify influencers etc.
15
16
Server logs: This one is not new to the IT world. You often lose precious trails
and information when you simply roll over log files. Today, you should not have
to lose this data; you just save the data in Hadoop!
Text: Text is everywhere. We all love to express ourselves - every blog, article,
news site, ecommerce site you go these days, you will find people putting out
their thoughts. And this is on top of the already existing text sources like surveys
and the Web content itself. How do you store, search and analyze all this text
data to glean for key insights? Hadoop!
17
Organizations are also looking at extra data that comes from social media and machine
data and correlating it with their existing traditional data. Correlating data from multiple
sources is generating much higher data analysis results.
Businesses that can use big data to generate more detailed results with a higher degree
of accuracy will be at a competitive advantage. Its about being able to out Hadoop
your competition.
Data driven decisions are better decisions its as simple as that. Using big
data enables managers to decide on the basis of evidence rather than
intuition. For that reason it has the potential to revolutionize management.
Harvard Business Review, October 2012
18
What is Hadoop?
Hadoop is all about processing and storage. Hadoop is a software framework
environment that provides a parallel processing environment on a distributed file
system using commodity hardware. A Hadoop cluster is made up of master processes
and slave processes spread out across different x86 servers. This framework allows
someone to build a Hadoop cluster that offers high performance super computer
capability.
19
20
Hadoop Common: the utilities that provide support for the other Hadoop
modules.
MapReduce: for processing large data sets in a scalable and parallel fashion.
EDW
Required on write
MPP
Analytics
schema
speed
NoSQL
Hadoop
Distribution
Required on read
Writes are fast
governance
Loosely structured
processing
Structured
data types
Data Discovery
Processing unstructured data
Massive Storage/Processing
Relational
NoSQL
Real-time
A database
Hadoop is a data platform that compliments existing data systems. Hadoop is designed
for schema-on-read and can handle the large data volumes coming from semistructured and unstructured data. With the low cost of storage on Hadoop,
organizations are looking at using Hadoop more for archiving.
Copyright 2014, Hortonworks, Inc. All rights reserved.
21
Master Node 2
ResourceManager
Standby NameNode
HBase Master
HiveServer2
ZooKeeper
Management Node
Ambari Server
Ganglia/Nagios
WebHCat Server
JobHistoryServer
ZooKeeper
DataNode 2
DataNode
NodeManager
H RegionServer
DataNode 3
DataNode
NodeManager
H RegionServer
DataNode n
DataNode
NodeManager
H RegionServer
HBase components: HBase also has a master server and slave servers called
RegionServers.
22
Who is Hortonworks?
OPERATIONAL
SERVICES
AMBARI
FLUME
HBASE
FALCON*
OOZIE
Hortonworks
Data Platform (HDP)
DATA
SERVICES
PIG
SQOOP
HIVE &
HCATALOG
Focus on enterprise
distribution of Hadoop
LOAD &
EXTRACT
HADOOP
CORE
PLATFORM
SERVICES
NFS
WebHDFS
KNOX*
MAP
REDUCE
TEZ
YARN
HDFS
Enterprise Readiness
HORTONWORKS
DATA PLATFORM (HDP)
OS/VM
Cloud
Appliance
Who is Hortonworks?
Hortonworks develops, distributes and supports Enterprise Apache Hadoop:
Develop: Hortonworks was formed by the key architects, builders, and operators
from Yahoo! Hortonworks software engineering team has led the effort to
design and build every major release of Apache Hadoop from 0.1 to the most
current stable release, contributing more than 80% of the code along the way.
23
0.96.0
0.12.0
2.2.0
HDP 1.3
0.5.0
1.4.1
1.2.0
1.0.3
HMC1.1
3.1.3
Ambari
Mahout
HMC1
Zookeeper
HDP 1.0
3.2.0
Oozie
2012
1.2.3
3.4.5
3.3.4
0.92.1
0.4.0
Sqoop
JUNE
HDP 1.1
3.3.2
0.9.0
0.9.2
0.8.0
4.0.0
0.7.0
0.94.2
HBase
2012
1.4.3
1.4.2
HCatalog
SEPT
0.10.0
0.10.1
HDP 1.2
Hadoop
2013
0.94.6
0.11
1.1.2
FEB
1.4.4
0.11.0
Pig
May
2013
1.4.1
0.12.0
HDP 2.0
Hive
OCT
2013
24
It takes a tremendous amount of skill and testing to find the right combination
for all the frameworks.
Other software runs along side of Hadoop. It is hard for a software vendor to
work with customers that have their own unique distribution of the Hadoop
frameworks.
Utilizes HDP, a 100% free, open source distribution of Hadoop. Every line of code
generated by Hortonworks is put back into the Apache Software Foundation.
Has developed over 614,041 lines of code compared to the next nearest
distribution vendor with 147,933 lines of code (based on a recent comparison).
Tests HDP at a much larger scale than any other distribution. HDP is certified and
tested at scale.
25
DM
Log files
Exhaust Data
DM
Social Media
Sensors,
devices
ETL
EDW
DM
DB data
26
A schema was required for ingestion. When a new source of data was
introduced, new schemas had to be created (which took up to a month!).
SLAs were suffering because the EDW was busy performing ETL.
Data had to be thrown out after 2 to 5 days because it was not cost-effective to
maintain it.
The bank was missing out on new data sources, and also historical data was lost.
Explore
DM
Log files
Exhaust Data
Big Data
Platform
Social Media
EDW
DM
Sensors,
devices
DB data
DM
Data is now available for use with minimal delay, which enables real-time
capture of source data.
They have a new philosophy about data: capture all data first, and then structure
the data as business needs evolve. This makes their systems much more
dynamic.
The bank now stores years worth of raw transactional data. The data is no
longer archived, it has become ACTIVE!
Data Lineage: The bank stores intermediate stages of their data, enabling a more
powerful analytics platform.
The EDW can focus less on storage and transformation and more on analytics.
Hadoop opens up an opportunity for exploration of data that was never there
before!
27
Unit 1 Review
1. The core Hadoop frameworks are __________________ and _______________.
2. True or False: Hadoop is equivalent to a NoSQL platform.
3. What is the name of the management interface used for provisioning, managing,
and monitoring Hadoop clusters? _________________
4. What processes might you find running on a Master node of a Hadoop cluster?
_________________________________________________________________
28
OS Architecture
HDFS Architecture
The NameNode
The DataNodes
DataNode Failure
HDFS Clients
29
30
Metadata: All nodes in a directory tree can have various levels of ownership
(user, group, anonymous), permissions (read, write, execute), last accessed time,
create time, modified time, is-hidden, etc.
Tools: All file systems have tools to perform file operations as well as
administrative operations such as troubleshooting and fixing problems.
Copyright 2014, Hortonworks, Inc. All rights reserved.
OS Architecture
A familiar file system architecture:
Namespace(s)
Tools
Metadata
File System
(ext4, ext3, xfs, etc.)
Journaling
Storage
Disk
OS Architecture
Most common file systems are POSIX based. HDFS is also a POSIX based file system.
31
HDFS Architecture
NameNode
Namespace
Block Map
Metadata
Journaling
NameNode and
DataNodes are
daemon jvms
Disk
Tools
DataNode
DataNode
DataNode
Storage
Storage
Disk
Disk
Storage
Disk
HDFS Architecture
A Hadoop instance consists of a cluster of HDFS machines; often referred to as the
Hadoop cluster or HDFS cluster. There are two main components of a HDFS cluster:
1. NameNode: The master node of HDFS that manages the data (without
actually storing it) by determining and maintaining how the chunks of data
are distributed across the DataNodes. The NameNode will contain and
manage the namespace, metadata, journaling, and a BlockMap. The
BlockMap is an in-memory map of all the blocks that make up a file and
DataNode locations of those blocks in the HDFS cluster.
2. DataNode: Stores the chunks of data, and is responsible for replicating the
chunks across other DataNodes.
32
The NameNode and DataNode are daemon processes running in the cluster. Some
important concepts involving the NameNode and DataNodes are:
By default only one NameNode is used in a cluster, which creates a single point
of failure. We will later discuss how to enable HA in Hadoop to mitigate this risk.
Data never resides on or passes through the NameNode. Your big data only
resides on DataNodes.
The NameNode keeps track of how the data is broken down into chunks on the
DataNodes.
The default replication factor is 3 (and is also configurable), which means each
chunk of data is replicated across 3 DataNodes.
33
DataNode 1
DataNode 2
DataNode 3
5. The first DataNode pipelines the replication to the next DataNode in the list.
34
You can specify the block size for each file using the dfs.blocksize property. If you do not
specify a block size at the file level, the global value of dfs.blocksize defined in hdfssite.xml will be used.
IMPORTANT: The data never passes through the NameNode. The client
program that is uploading the data into HDFS performs I/O directly with the
DataNodes. The NameNode only stores the metadata of the file system; it is
not responsible for storing or transferring the data.
35
1.2. Try putting the hadoop-common JAR file into HDFS with a block size of 30
bytes:
# hadoop fs -D dfs.blocksize=30 -put hadoop-common-x.jar hadoop-common.jar
1.3. Notice 30 bytes is not a valid blocksize. The blocksize needs to be at least
1048576 according to the dfs.namenode.fs-limits.min-block-size property:
put: Specified block size is less than configured minimum
value (dfs.namenode.fs-limits.min-block-size): 30 < 1048576
1.4. Try the put again, but use a block size of 2,000,000:
# hadoop fs -D dfs.blocksize=2000000 -put hadoop-commonx.jar hadoop-common.jar
1.5. Notice 2,000,000 is not a valid block size because it is not a multiple of 512
(the checksum size).
36
1.6. Try the put again, but this time use 1,048,576 for the block size:
# hadoop fs -D dfs.blocksize=1048576 -put hadoop-commonx.jar hadoop-common.jar
1.7. This time the put command should have worked. Use ls to verify the file is in
HDFS:
# hadoop fs -ls
...
-rw-r--r-3 root root
2679929
hadoop-common.jar
2.2. Notice there are three blocks. Look for the following line in the output:
Total blocks (validated):
2.3. What is the average block replication for this file? ________________
Step 3: Specify a Replication Factor
3.1. Add another file from /usr/lib/hadoop into HDFS, except this time specify a
different replication factor:
# hadoop fs -D dfs.replication=2
hadoop-nfs.jar
-put hadoop-nfs-x.jar
37
Notice the output contains the block IDs, which coincidentally are the names of
the files on the DataNodes.
4.2. Change directories to the following:
# cd /hadoop/hdfs/data/current/BP-xxx/current/finalized/
You are looking for a subfolder with a recent timestamp. Once you find it, cd into
that folder.
4.5. See if you can find the various blocks for hadoop-common.jar and hadoopnfs.jar. They will look similar to the following:
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
1
1
1
1
1
1
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
4.6. How come some of the blocks are exactly 1048576 bytes? ______________
_________________________________________________________________
4.7. What is in the .meta files? _______________________________________
38
The NameNode
1. When the NameNode starts, it reads
the fsimage_N and edits_N files.
2. The transactions in edits_N are
merged with fsimage_N.
3. A newly-created fsimage_N+1 is
written to disk, and a new, empty
edits_N+1 is created.
fsimage
edits
Namespace
Journaling
Metadata
NameNode
The NameNode
HDFS has a master/slave architecture. A HDFS cluster consists of a single NameNode,
which is a master server that manages the file system namespace and regulates access
to files by clients.
The NameNode has the following characteristics:
39
fsimage_N: Contains the entire file system namespace, including the mapping of
blocks to files and file system properties.
edits_N: A transaction log that persistently records every change that occurs to
file system metadata.
When the NameNode starts up, it enters safemode (a read-only mode). It loads the
fsimage_N and edits_N from disk, applies all the transactions from the edits_N to the
in-memory representation of the fsimage_N, and flushes out this new version into a
new fsimage_N on disk.
NOTE: The edits_N file naming actually contains a range of numbers for the
historical events. For example, edits_0008-0012. There is an additional file
named edits_inprogress_<start-of-range> for the current edits.
For example, initially you will have an fsimage_0 file and an edits_inprogress_0 file.
When the merging occurs, the transactions in edits_inprogress_0 are merged with
fsimage_0, and a new fsimage_1 file is created. In addition, a new, empty
edits_inprogress file is created for all future transactions that occur after the creation of
fsimage_1.
This process is called a checkpoint. Once the NameNode has successfully checkpointed,
it will leave safemode, thus enabling writes.
40
The DataNodes
NameNode
DataNode 2
DataNode 3
DataNode 4
123
Storage
The DataNodes
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode determines the mapping of blocks to DataNodes. The DataNodes are
responsible for:
Performing block creation, deletion, and replication upon instruction from the
NameNode. (The NameNode makes all decisions regarding replication of
blocks.)
The NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is
functioning properly. A Blockreport contains a list of all blocks on a DataNode.
41
It stores each block of HDFS data in a separate file on its local file system.
The DataNode does not create all files in the same local directory. It uses a
discovery technique to determine the optimal number of files per directory and
creates subdirectories appropriately.
When a DataNode starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files, and sends this
information to the NameNode (as a Blockreport).
DataNode Failure
Sorry, DataNode 3,
but Im going to
assume you are
dead.
NameNode
Heartbeat &
Blockreport
DataNode 1
Heartbeat &
Blockreport
DataNode 2
Heartbeat &
Blockreport
DataNode 3
DataNode 4
DataNode Failure
The primary objective of HDFS is to store data reliably even in the presence of failures.
Hadoop is designed to recover gracefully from a disk failure or network failure of a
DataNode using the following guidelines:
Any data that was registered to a dead DataNode is no longer available to HDFS.
The NameNode does not send new I/O requests to a dead DataNode, and its
blocks are replicated to live DataNodes.
DataNode death typically causes the replication factor of some blocks to fall below their
specified value. The NameNode constantly tracks which blocks need to be replicated
and initiates replication whenever necessary.
43
44
HDFS Clients
Commandline
Tools
User
HDFS
Admin
dfsadmin, namenode, datanode, balancer, daemonlog,
secondarynamenode
WebHDFS
The NameNode and DataNodes both expose RESTful apis to
perform user operations
NameNode
DataNode 1
HttpFS
A REST gateway that supports user operations and is
interoperable with WebHDFS
Hue
DataNode 2
A feature rich GUI that includes a HDFS file browser, job browser
for MR & YARN, HBase, Hive, Pig, and Sqoop support
HDFS Clients
HDFS provides many out of the box methods for clients to interact with the file system.
These include command line, RESTful, and a Java HDFS API. Additionally, HDP provides
Hue, a GUI interface to not only HDFS but also other components in HDP.
We will explore the various types of clients in an upcoming lab.
45
Unit 2 Review
1. Which component of HDFS is responsible for maintaining the namespace of the
distributed file system? _________________________
2. What is the default file replication factor in HDFS? _________________________
3. True or False: To input a file into HDFS, the client application passes the data to
the NameNode, which then divides the data into blocks and passes the blocks to
the DataNodes. _____________
4. Which property is used to specify the block size of a file stored in HDFS?
__________________________
5. The NameNode maintains the namespace of the file system using which two sets
of files? _______________________________________________________
46
47
48
Slave Nodes: JBOD, or Just a Bunch of Disks, is simple array of disks with no
striping or mirroring.
49
50
We will cover cluster tuning later and answer questions such as:
Is the cluster for small group? Multi-tenant?
How much storage do you anticipate in the short-term?
How quickly will the data grow?
Do you anticipate compute heavy processing of the data? Compute/memory
heavy algorithms?
51
1.4. This script does a lot. Look over the steps and see if you can follow what is
happening. Some of the highlights of the script include:
- Sets up passwordless SSH amongst your four nodes.
- Installs ntp on each node.
- Configures the repositories for installing HDP locally.
- Disables security and turns off iptables.
Step 2: Run the Setup Script
2.1. Run the setup script using the following command:
52
# ./env_setup.sh
2.2. The script will take a while to execute. Watch the output and keep an eye out
for any errors. The end of the output will look like:
Installed:
yum-plugin-priorities.noarch 0:1.1.30-14.el6
Complete!
NOTE: If you dont see your command prompt, simply press Enter when the
script is finished.
IMPORTANT: If you find an error, try to determine at which step in the script
it occurred. You may need to manually copy-and-paste the remainder of the
script based on where your error occurred.
RESULT: Your cluster is now ready for HDP 2.0 to be installed using Ambari!
53
NOTE: The -s option runs the setup in silent mode, meaning all default
values are accepted at any prompts.
http://node1:8080
3.2. Log in to the Ambari server using the default credentials admin/admin:
55
56
5.2. In the Host Registration Information section, click the Choose File button,
then browse to and select the training-keypair.pem file at Desktop:
5.4. Click the Register and Confirm button. Click OK if you are warned about not
using fully qualified domain names.
Step 6: Confirm Hosts
6.1. Wait for some initial verification to occur on your cluster. Once the process is
done, click the Next button to proceed:
57
NOTE: You may see a confirmation message with a warning. Verify your
nodes are configured correctly before continuing. If it is related to firewall,
you can ingnore the error.
58
CAUTION: Make sure to choose the right node for each master service as
specified below. Once the installation starts, you cannot change the
selection!
NameNode: node1
SNameNode: node2
History Server: node2
ResourceManager: node2
Nagios Server: node3
Ganglia Server: node3
HiveServer2: node2
59
60
10.2. Click on the Oozie tab and enter oozie for its Database Password:
10.3. Click on the Nagios tab. Enter admin for the Nagios Admin password, and
enter your email address in the Hadoop Admin email field:
61
11.1. Notice the Review page allows you to review your complete install
configuration. If youre satisfied that everything is correct, click Deploy to start
the installation process. (If you need to go back and make changes, you can use
the Back button.)
62
12.2. You should see the following screen if the installation completes
successfully:
12.3. When the process completes, click Next to get a summary of the installation
process. Check all configured services are on the expected nodes, then click
Complete:
63
RESULT: You now have a running 3-node cluster of the Hortonworks Data Platform!
64
Configuration Considerations
Deployment Layout
Configuring HDFS
What is Ambari
Management
Monitoring
REST API
65
Configuration Considerations
There are two ways to configure HDP:
-
Manual configuration
Ambari UI configuration
66
Data
Binaries
Configuration
Runtime
Install Bits
Deployment Layout
The HDP deployment layout per machine may vary slightly because not all machines will
have the same components. For example, theres only one Ambari Server per cluster.
However, by using the deployment layout above as a guide, you can quickly find the
configuration, binaries, and repos needed for Ambari to run.
The Deployment Layout can be broken down into five key categories:
1. Install Bits: It is a best practice to setup a local repository of the install bits or
rpm repos. When setting up a local repo, a yum repo is added to
/etc/yum.repos.d/. The rpms are installed on a simple webserver.
2. Binaries: Hadoop executables, libraries, dependencies, template configs, etc. are
located at /usr/lib/ in the appropriate project folder. Files in these directories
should not be modified, especially configuration files. A best practice for
customization of shell scripts is that modifications should be done via wrapper
scripts, such as passing parameters or piping stdout to a log file.
3. Configuration: By convention, Hadoop configurations are under /etc/ under the
appropriate project. This is where configuration changes should be made rather
than in install (binaries) directories.
67
4. Data: Various Hadoop services require data directories. For example, HDFS
requires space for the NameNode to write its edits log files. And the DataNodes
will write the actual data blocks to the local file system. Throughout the
configuration files, you will find services requiring a directory path to use as
temporary or permanent storage.
5. Runtime: As Hadoop services are running, starting, and stopping, they will be
writing to self-maintenance files such as pid (process id) files, typically to
/var/run/. For example, Hadoop HDFS services will publish pid files to
/var/run/hadoop/hdfs/.
68
Configuring HDFS
There are two configurations involved when configuring HDFS. In addition to Hadoop
configuration properties necessary to bring up an HDFS cluster, there are some prerequisites, which we will discuss:
Ports Firewall considerations:
Service
NameNode WebUI
NameNode metadata service
DataNode
Servers
Master
Nodes
All Slave
Nodes
Secondary (Checkpoint)
NameNode
Ports
Protocol Description
50070
http
50470
https
8020/9000 IPC
50075
50475
50010
50020
50090
http
https
IPC
http
NameNode WebUI
Secure http
File system metadata
operations
DataNode WebUI
Secure http
Data transfer
Metadata operations
Secondary NameNode
WebUI
69
DNS
Ensure that HDFS hosts are resolvable via DNS. If this is not possible, all the hosts will
need their /etc/hosts file to contain all the hosts in the cluster. The hosts file is a local
domain to ip mapping file.
core-site.xml: Cluster-wide settings, including NameNode host and port, proxy
user/groups. This file will get distributed to all nodes, but is always changed
uniformly.
hdfs-site.xml: Some settings are cluster-wide, while others are DataNode
specific. For example, dfs.datanode.data.dir can be different between
DataNodes.
NameNode
fs.defaultFS: hdfs://namenodehost:8020
dfs.namenode.name.dir: /hadoop/hdfs/namenode
dfs.replication: Hadoop default is 3 and should be kept at 3. This property can be
overridden by the client per-operation if you want to change the replication for a
file. For example; if a file is referenced multiple times in many jobs, it is often a
performance gain to have more replicas of that same file, i.e. joining with lookup
files).
dfs.replication.max: Maximum replication.
dfs.blocksize: Default block size is 128MB (this property is expressed in bytes). If
your cluster generally has larger datasets and the datasets are not process
intensive, you can set this to a higher size. However, 128MB is a good default to
keep. If you like, just as the replication factor, you can change the blocksize for
each file that you upload into HDFS.
dfs.namenode.stale.datanode.interval: Default, 30000ms. Threshold for
amount of time in milliseconds before the NameNode considers a DataNode to
be stale, at which point the DataNode is moved to the end of the list of available
replica locations.
SecondaryNameNode
dfs.namenode.checkpoint.dir: Directory where SecondaryNameNode
temporarily stores the images it needs to merge from the NameNode
dfs.namenode.checkpoint.period: Default 3600s. Number of seconds between
two periodic checkpoints in seconds.
70
71
What is Ambari
Ambari is a 100% Apache open source operations framework for provisioning,
managing, and monitoring Hadoop clusters. It provides these features through a web
frontend and an extensive REST API.
With Ambari, clusters can be built from ground up on clean operating system instances.
It will do the job of propagating binaries, configuring services, launching them, and
monitoring them to all the hosts in a cluster.
72
73
Management
Once a cluster is provisioned, services can be managed either an entire service or
management can be granular to services sub-component. For example, an
administrator can choose to start/stop the entire YARN service (ResourceManager +
NodeManagers), or just stop a particular NodeManager on a host.
Configuration of services can also be managed. Properties, credentials, paths are some
examples of common configurations. Ambari allow for custom or advanced properties
to be managed for most services. Once a configuration change has been made, Ambari
will persist changes to its own internal database, a PostgreSQL database by default.
74
Management Flow
1. Stop service(s): Services are required to be stopped.
2. Edit and save: Once saved, Ambari will validate and persist the new settings in
its database, write the settings to appropriate configuration files on the cluster.
3. Start service(s): Services can now be started.
Advanced Configurations
Ambari supports configuring NameNode HA and security. These are advanced features
available under the Admin page. These topics will be covered in later units.
75
Monitoring
Ambari provides monitoring with the combination of two powerful open source
frameworks: Ganglia and Nagios.
Ganglia
All cluster metrics are gathered by Ganglia agents running on each host and aggregated.
Nagios
Nagios is used to provide alerts, escalation schemes to implement enterprise SLAs, and
reports. With Nagios, alerts via email, SMS, or script execution can be triggered by
events such as a threshold limit being crossed. For example, an administrator may want
to receive an SMS alert if a Hadoop master nodes CPU is pegged at 100% for more than
5 minutes. All such thresholds are configurable in Nagios.
Dashboard
Ambari provides a dashboard that gives an administrator a quick view of the overall
health of the entire cluster. There are 20+ widgets that provide quick stats on services.
Widgets can be added or you could write your own widget using the Ambari APIs.
76
REST API
Ambari uses a REST API. You can write your own automation scripts to perform
extensive operations. The REST API allows you to monitor as well as manage a cluster.
77
The Ambari REST API is an evolving feature. While most operations will work
as expected, be sure to thoroughly test an operation and validate expected
results
78
79
RESULT: You have added a new node to the cluster and Hadoop is installed on it. In a
later lab, you will commission this node as a DataNode.
80
Objective: To learn how to start and stop the various HDP services using
either the command line or Ambari.
Successful Outcome: You will have stopped HDP from the command line and
started it again using Ambari.
Before You Begin: Your cluster should be up and running.
Oozie
WebHCat
Hive
Zookeeper
Yarn
Node
Manager
node3
node3
node1
node2
node3
node4
node2
node2
node2
node1
node2
node3
node1
81
MapReduce
History
Server
Yarn
Resource
Manager
HDFS
DataNode
Secondary
NameNode
NameNode
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop nodemanager'
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop nodemanager'
su - mapred -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-mapreduce/sbin/mrjobhistory-daemon.sh --config /etc/hadoop/conf
stop historyserver'
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop resourcemanager'
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop secondarynamenode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop namenode"
node2
node3
node2
node2
node1
node2
node3
node2
node1
1.2. SSH into node1. (Make sure you run this script from node1.)
1.3. Run the following script to shutdown all HDP services on your cluster:
# ~/scripts/shutdown_all_services.sh
1.4. Wait for the script to execute and all the services to stop.
Step 2: View Ambari
2.1. Go to your Ambari Dashboard. Notice the Cluster Status and Metrics on the
Dashboard are mostly n/a:
82
2.2. Notice that all the Services are down - as shown by the red icon next to each
service name:
2.3. From the Services page, click on each service individually. They should all be
stopped.
Step 3: Stop ambari services
3.1. Run the following script to shutdown all Ambari services on your cluster:
# ~/scripts/stop_ambari.sh
83
84
6.4. Once all the services are started, click the OK button to close the progress
dialog.
6.5. Verify on the Services page of Ambari that all the HDP services in your cluster
are up and running.
NOTE: The table below shows the proper order for starting HDP services.
These can be executed using the /root/scripts/startup_all_services.sh script
provided in your class cluster.
HDFS
NameNode
Secondary
NameNode
DataNodes
YARN
Resource Manager
History Server
Node Managers
85
node1
node2
node1
node2
node3
node2
node2
node1
node2
Zookeeper
Hive
Metastore
HiveServer2
WebHCat
Oozie
Ganglia
Nagios
86
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
--config /etc/hadoop/conf start nodemanager'
/usr/lib/zookeeper/bin/zkServer.sh start
/usr/lib/zookeeper/bin/zkServer.sh start
/usr/lib/zookeeper/bin/zkServer.sh start
su - hive -c 'env HADOOP_HOME=/usr
JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/tmp/startMetastore.sh /var/log/hive/hive.out
/var/log/hive/hive.log /var/run/hive/hive.pid
/etc/hive/conf.server '
su - hive -c 'env
JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/tmp/startHiveserver2.sh /var/log/hive/hiveserver2.out /var/log/hive/hive-server2.log
/var/run/hive/hive-server.pid
/etc/hive/conf.server '
su -l hcat -c
"/usr/lib/hcatalog/sbin/webhcat_server.sh
start"
sudo su -l oozie -c "/usr/lib/oozie/bin/oozied.sh
start"
/etc/init.d/hdp-gmetad start
/etc/init.d/hdp-gmond start
/etc/init.d/hdp-gmond start
/etc/init.d/hdp-gmond start
/etc/init.d/hdp-gmond start
service nagios start
node3
node1
node2
node3
node2
node2
node2
node2
node3
node1
node2
node3
node4
node3
Notice the usage contains options for performing file system tasks in HDFS, like
copying files from a local folder into HDFS, retrieving a file from HDFS, copying
and moving files around, and making and removing directories.
1.2. Enter the following command:
# hdfs dfs
Notice you get the same usage list as the hadoop fs command.
NOTE: The hadoop command is a more generic command that has fewer
options than the hdfs command. However, notice hdfs dfs is just an alias for
hadoop fs.
87
hdfs
hdfs
hdfs
hdfs
0
0
0
0
/user/ambari-qa
/user/hcat
/user/hive
/user/oozie
Notice HDFS has four user folders by default: ambari-qa, hcat, hive and oozie.
2.3. Run the -ls command again, but this time specify the root HDFS folder:
# hadoop fs -ls /
yarn
hdfs
mapred
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
0
0
0
0
0
0
2013-08-20
2013-08-20
2013-08-20
2013-08-20
2013-08-28
2013-08-28
13:59
13:53
13:57
13:58
22:03
22:03
/app-logs
/apps
/mapred
/mr-history
/tmp
/user
IMPORTANT: Notice how adding the / in the -ls command caused the
contents of the root folder to display, but leaving off the / attempted to list
the contents of /user/root. If you do not specify an absolute path, then all
hadoop commands are relative to the users default home folder.
88
Notice the root user does not have permission to create this folder.
3.2. Switch to the hdfs user:
# su - hdfs
3.4. Change the permissions to make root the owner of the directory:
$ hadoop fs -chown root:root /user/root
3.5. Verify the folder was created successfully and root is the owner:
$ hadoop fs -ls /user
...
drwxr-xr-x
- root
root
/user/root
3.7. Now view the contents of /user/root using the following command again:
# hadoop fs -ls
The directory is empty, but notice this time the command worked.
Step 4: Create Directories in HDFS
4.1. Enter the following command to create a directory named test in HDFS:
Copyright 2014, Hortonworks, Inc. All rights reserved.
89
test
Notice you only see the test directory. To recursively view the contents of a
folder, use: -ls -R:
# hadoop fs -ls -R
root
root
root
root
root
root
root
root
0
0
0
0
test
test/test1
test/test2
test/test2/test3
.Trash
.Trash/Current
.Trash/Current/user
.Trash/Current/user/root
.Trash/Current/user/root/test
.Trash/Current/user/root/test/test2
.Trash/Current/user/root/test/test2/test3
test
test/test1
NOTE: Notice Hadoop created a .Trash folder for the root user and moved
the deleted content there. The .Trash folder empties automatically after a
configured amount of time.
6.3. Run the following -put command to copy hdfs-audit.log into the test folder in
HDFS:
# hadoop fs -put hdfs-audit.log test/
3744098
0
test/hdfs-audit.log
test/test1
91
7.2. Verify the file is in both places by using the -ls -R command on test. The
output should look like the following:
# hadoop fs -ls -R test
-rw-r--r-3 root root
drwxr-xr-x
- root root
-rw-r--r-3 root root
3744098 test/hdfs-audit.log
0 test/test1
3744098 test/test1/copy.log
7.3. Now delete the copy.log file using the -rm command:
# hadoop fs -rm test/test1/copy.log
8.2. You can also use the -tail command to view the end of a file:
# hadoop fs -tail test/hdfs-audit.log
Notice the output this time is only the last 20 rows of hdfs-audit.log.
Step 9: Getting a File from HDFS
9.1. See if you can figure out how to use the get command to copy test/hdfsaudit.log into your local /tmp folder.
Step 10: The getmerge Command
10.1. Put the file /var/log/hadoop/hdfs/hadoop-hdfs-namenode-node1.log into
the test folder in HDFS. You should now have two files in test: hdfs-audit.log and
hadoop-hdfs-namenode-node1.log:
# hadoop fs -ls test
Found 3 items
-rw-r--r-3 root root
namenode-node1.log
-rw-r--r-3 root root
drwxr-xr-x
- root root
92
10.3. What did the previous command do? Compare the file size of merged.txt
with the two log files from the test folder.
Step 11: Specify the Block Size of a File
11.1. Change directories to /root/labs:
# cd /root/labs
Notice this folder contains an HBase JAR file that is about 4.7MB.
11.2. Put the HBase JAR file into /user/root in HDFS with the name hbase.jar, and
assign it a blocksize of 1048576 bytes. HINT: The blocksize is defined using the
dfs.blocksize property on the command line.
11.3. Run the following fsck command on hbase.jar:
# hdfs fsck /user/root/hbase.jar
11.4. How many blocks did this file get broken down in to? ________________
RESULT: You should now be comfortable with executing the various HDFS commands,
including creating directories, putting files into HDFS, copying files out of HDFS, and
deleting files and folders.
93
ANSWERS:
Step 2.4: hdfs
Step 9.1:
# hadoop fs -get test/hdfs-audit.log /tmp
Step 10.3: The two files that were in the test folder in HDFS were merged into a single
file and stored on the local file system.
Step 11.2:
hadoop fs -D dfs.blocksize=1048576 -put hbase-0.94.3bimota-1.2.0.21+HBASE-7644.jar hbase.jar
94
Replication Placement
NameNode Information
95
96
These features not only maintain the reliability and durability of the data blocks but also
allow for easy administration.
The HDFS client will calculate a checksum for each block and send it to the
DataNode along with the block.
The DataNode stores checksums in a metadata file separate from the blocks
data file.
The block as well as the checksum is sent to the client when reading. The client
will validate the checksum and if there is an inconsistency it will inform the
NameNode that the block is corrupt.
97
Replication Placement
Every file has a block size and replication factor associated with it. All blocks that make
up a file will be the same size except for the last file. The NameNode will make all
decisions regarding block replication for a file in HDFS. DataNodes send block reports to
the NameNode containing a list of all the blocks for a specific DataNode. The
DataNodes are responsible for the creation, deletion and replication of blocks based
upon instructions from NameNode.
Be aware that HDFS block placement does not take into account disk space utilization on
the DataNodes. This ensures that blocks are placed for availability and not just on the
DataNodes with the most free space.
98
Below is an example of how blocks and metadata are laid out in a DataNode directory.
The data blocks are stored in HDFS directories beginning with the blk_ prefix and
contain the raw bytes. The metadata file has the .meta suffix and contains header,
version and type information. It also contains the checksum data for the blocks.
${dfs.data.dir}/current/VERSION
/blk_<id_1>
/blk_<id_1>.meta
/blk_<id_2>
/blk_<id_2>.meta
/...
/blk_<id_64>
/blk_<id_64>.meta
/subdirectory0/
/subdirectory1/
/...
99
Client
1. I want to
write a block
of data.
NameNode
ta +
3. Daksum
che c
6. Success!
DataNode 1
DataNode 4
4. Data and
checksum
5. Success!
DataNode 12
4. Verify
Checksum
5.Success!
Data Pipeline
100
Checksums
Checksums are generated for each data block and are used to validate the block during
reads. A checksum is created for a set number of bytes of data as defined by
io.bytes.per.checksum. The size of the checksum data is minimal, for instance; a CRC32 checksum is 4 bytes long.
Use the -ignoreCrc option when using the get or -copyToLocal command to
read data.
101
Client
1. I need
to read a
portion of
file.txt
NameNode
2. OK, youll
find it on
DataNode
12, block 5.
102
Web UI
display of bad
blocks
NameNode
DataNode
Reporting of bad
blocks with block report.
DataNode
DataNode
Notify the DataNode with the results for each validated block.
The block scanner will adjust its read rate to ensure it completes the block scanning
within the defined time frame. The time frame is defined by the parameter
dfs.datanode.scan.period.hours (the default is 504 hours or 3 weeks). The DataNode
keeps an in-memory list of the blocks verification times, which are also stored in a log
file.
103
The Block Scanning Report can be accessed from the DataNode GUI:
http://datanode:50075/blockScannerReport
The period of time between running block scanner is set with the
dfs.datanode.scan.period.hour property.
104
If the fsck command is run with no arguments, it will print usage information. If run
with a path, the command will check all blocks for files within the path. If / is given as
the path, the entire HDFS file system will be checked.
The fsck command will not run on files open for write by a Hadoop client. The
openforwrite option will override that default.
105
NOTE: fsck retrieves all of its information from the NameNode; it does not
communicate with any DataNodes to retrieve any block data.
Over-replicated blocks: Blocks that exceed their target replication for the file
they belong to. Normally, over-replication is not a problem, and HDFS will
automatically delete excess replicas.
Under-replicated blocks: Blocks that do not meet their target replication for the
file they belong to. The NameNode will automatically create new replicas of
under-replicated blocks until they meet the target replication. You can get
information about the blocks being replicated (or waiting to be replicated) using
hdfs dfs -metasave.
106
Mis-replicated blocks: Blocks that do not satisfy the block replica placement
policy. For example, for a replication level of three in a multirack cluster, if all
three replicas of a block are on the same rack, then the block is mis-replicated
because the replicas should be spread across at least two racks for resilience.
The NameNode will automatically re-replicate mis-replicated blocks so that they
satisfy the rack placement policy.
107
Description
path
-move
-delete
-openforwrite
-files
-blocks
-locations
-racks
Description
-files
-blocks
-locations
-racks
-delete
-move
108
Code example:
$
109
results
$ hdfs fsck /
....................................................................
Status: HEALTHY
Total size:
128847681 B
Total dirs:
144
Total files: 200 (Files currently being written: 3)
Total blocks (validated):
198 (avg. block size 650745 B) (Total open file
blocks (not validated): 3)
Minimally replicated blocks: 198 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks:
1 (0.5050505 %)
Mis-replicated blocks:
0 (0.0 %)
Default replication factor: 3
Average block replication:
2.989899
Corrupt blocks:
0
Missing replicas:
7 (1.1824324 %)
Number of data-nodes:
3
Number of racks:
1
FSCK ended at Wed Oct 03 12:05:58 EDT 2012 in 44 milliseconds
Output
$ hdfs fsck /
$ hadoop fs du s h /
$ hadoop fs count q /
List all files that are rechecked and list all the blocks
for each file. Includes the addresses of the
DataNodes containing the blocks.
110
Make sure and redirect fsck output to a file if working on a large cluster. Writing to
STDOUT on a large cluster can be time consuming.
$ hdfs fsck / -files -blocks -locations > myfsck001.log
Look for key patterns in output of fsck. Search for these strings:
CORRUPT block
CORRUPT
MISSING
111
The dfs command can be used to get a detailed listing of the HDFS
namespace
$ hdfs dfs -ls / > mydfslsr001.log
Description
-report
-safemode
-finalizeUpgrade
-refreshNodes
112
$ hdfs dfs -report > datanodereport001.log Returns a list of all the DataNodes in a cluster.
Decommissions or Recommissions
DataNodes(s).
113
NameNode Information
The NameNode information can also be saved to a
file using the metasave option:
347 files and directories, 201 blocks = 548 total
Live Datanodes: 3
Dead Datanodes: 0
Metasave: Blocks waiting for replication: 1
/user/root/.staging/job_201210012351_0005/job.jar: blk_2319384830921372914_1198
(replicas: l: 3 d: 0 c: 0 e: 0) 10.202.29.145:
50010 : 10.77.22.74:50010 : 10.34.49.188:50010 :
Metasave: Blocks being replicated: 0
Metasave: Blocks 0 waiting deletion from 0 datanodes.
Metasave: Number of datanodes: 3
10.202.29.145:50010 IN 885570207744(824.75 GB) 124239872(118.48 MB) 0.01%
842086092800(784.25 GB) Wed Oct 03 12:18:41 EDT 2012
10.77.22.74:50010 IN 885570207744(824.75 GB) 132116480(126 MB) 0.01%
842075832320(784.24 GB) Wed Oct 03 12:18:42 EDT 2012
10.34.49.188:50010 IN 885570207744(824.75 GB) 122011648(116.36 MB) 0.01%
842083241984(784.25 GB) Wed Oct 03 12:18:41 EDT 2012
NameNode Information
The metasave option creates a file (named filename) that is written to
HADOOP_LOG_DIR/hadoop/hdfs on the NameNodess local file system and contains:
Summary statistics.
114
115
Unit 5 Review
1. What is the priority of placement of the second block replica during block
replication? ____________________
2. What is the purpose of setting the io.bytes.per.checksum parameter?
_____________________
3. What process uses the dfs.datanode.scan.period.hours parameter?
_____________________
4. List three things an hdfs fsck command will look for?
5. Which option of the hdfs fsck command would you use to list DataNode
addresses for the blocks? _______________________
6. What output value(s) of the hdfs fsck command would you use to determine the
total amount of disk storage including replication is a file taking up?
7. Why would you run the command below?
$ hdfs dfs -report > myreport001.log
116
Objective: View the various tools for performing block verification and
the health of files in HDFS.
Successful Outcome: You will see the result of the Block Scanner Report on node1,
and the output of the fsck command.
Before You Begin: SSH into node1.
117
You should see a list of all blocks on that DataNode and their status:
NOTE: If a block is corrupt, the NameNode is notified and attempts to fix the
issue. The default time period for scanning blocks is every three weeks, so in
a production environment you would not set this interval to 30 minutes like
you did in this lab. Use the block scanner report as a quick way to verify the
integrity of the blocks in your cluster.
118
5.3. How many blocks did test_data get split into? ____________
5.4. What is the average block replication of test_data? ___________
Step 6: Using fsck Options
6.1. Run the fsck command again, but this time add the -blocks option:
# hdfs fsck /user/root/test_data -files -blocks
6.2. What did the blocks option add to the output? _________________________
6.3. Add the -locations option as well:
# hdfs fsck /user/root/test_data -files -blocks -locations
6.4. What did the locations option add to the output? _______________________
Step 7: Run a File system Check
7.1. You can run fsck on the entire file system. Enter the following command:
# hdfs fsck /
Notice this command fails, because root does not have permission to view all the
files in HDFS.
7.2. Switch to the hdfs user:
# su - hdfs
119
120
8.5. Click on the Live Nodes link to view the Live DataNodes in your cluster:
9.3. Go back to the dfshealth.jsp page and refresh it. Notice you now have 1 Dead
Node and a large number of under-replicated blocks:
9.4. Why does your cluster have so many under-replicated blocks? ___________
_________________________________________________________________
Step 10: Run fsck Again
10.1. Switch to the hdfs user and run fsck on the entire file system:
# su - hdfs
[hdfs@node1 ~]$ hdfs fsck /
Notice you get a long list of every file that contains under-replicated blocks.
121
10.2. What is the average block replication now on your cluster? _____________
10.3. Compare the value of Missing replicas in the output of fsck with the value of
Number of Under-Replication Blocks in the NameNode UI.
Step 11: Start the DataNode Again
11.1. Using Ambari, start the DataNode process on node1.
11.2. Refresh the dfshealth.jsp page in the NameNode UI frequently, and you can
watch as the number of under-replicated blocks gradually decreases to 0:
11.3. Run fsck again on your entire file system, and notice everything is back to
normal again.
RESULT: The Block Scanner Report is a quick way to view the status of the blocks on the
DataNodes of your cluster. The fsck tool is a great way to view the health of your file
system and block replication, as is using the NameNode UI.
122
User Authentication
123
DFSClient
ferP
r an s
a t aT
c
roto
ol
ClientProtocol
NameNode
NFSv3
Client
NFS
Gateway
DataNode
Data
Tran
s
ferP
rotoc
ol
DataNode
NFS Client: The number of application users doing the writing and the number
of files being loaded concurrently define the workload.
DFS Client: Multiple threads are used to process multiple files. DFSClient
averages 30 MB/S writes.
124
HDFS NFS Gateway simplifies data ingest of large-scale analytical workloads. Random
writes are not supported. Different ways the NFS interface to HDFS can be used include:
A few reminders:
NOTE: NFSv4 support, HA, Kerberos are on the roadmap for the HDFS NFS
Gateway.
125
HA is not build into the Gateway Servers. IF a gateway sever goes down then the
corresponding HDFS client mounts will fail.
126
127
Update the following property to hdfs-site.xml. This sets the maximum number of files
being uploaded in parallel.
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>1024</value>
</property>
Add the following property to hdfs-site.xml. NFS client often reorders writes.
Sequential writes can arrive at the NFS gateway at random order. This directory is used
to temporarily save out-of-order writes before writing to HDFS. You need to make sure
the directory has enough space.
<property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
Change the log level in the log4j.property file to debug to collect more details:
log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG
128
Start mountd and nfsd making sure the user starting the Hadoop cluster and the user
starting the NFS gateway are the same.
$ hdfs nfs3
Make sure NFS gateway services have started properly. Verify mountd, portmapper and
NFS are up and running.
129
Execute the following command to verify if all the services are up and running:
rpcinfo -p $nfs_server_ip
Make sure the HDFS namespace is exported and can be mounted by any client.
#
showmount -e $nfs_server_ip
130
User Authentication
The OS login user id on the NFS client must match the user id
accessing HDFS
LDAP/NIS should be used to make sure the same user ids are deployed
on the NFS client and HDFS
NFS
Client
Gage
NFS
Gateway
Gage
NameNode
Look up UID/GID
for user
User Authentication
The user authentication method needs to make sure the UID/GID match between the
user access, the NFS client, and the user running the HDFS operations.
The manual creation of users is not recommended for production environments.
131
Unit 6 Review
1. What nodes in a Hadoop cluster can the HDFS NFS gateway run on?
2. A _________ needs to be running on the NFS gateway node.
3. What configuration file is modified to configure the HDFS NFS Gateway server?
4. The _____________ must match between the NFS client and HDFS for user
authentication.
132
133
3.1. Run the following commands to stop the nfs and rpcbind services. (If they are
not running, the following commands will fail, which is no problem):
# service nfs stop
# service rpcbind stop
3.2. Now start the NFS services using the hadoop-daemon.sh script:
# hdfs portmap &
# hdfs nfs3 &
3.4. Verify that the HDFS namespace is exported and can be mounted by any
client.
# showmount -e node1
134
RESULT: You have mounted HDFS to a local file system, which can be a convenient skill
to know how to do when working frequently with files in HDFS.
135
136
What is YARN?
Beyond MapReduce
ResourceManager
NodeManager
MapReduce
Configuring YARN
Configuring MapReduce
Tools
What is YARN?
The goal of an operating system is to facilitate applications to achieve 100% utilization
of all resources on the physical system while letting every application execute at its
maximum potential. This is what YARN achieves in a Hadoop cluster. The compute
resources managed by YARN in a Hadoop cluster are memory and cpu. A YARN
application can request these resources and YARN will make them available according to
its scheduler policy.
What distinguishes YARN from other distributed compute frameworks is that the
applications that can run on YARN can be rapidly developed. Many standalone
applications have already been adapted and they range in types from batch applications
such as MapReduce to realtime always-on database applications such as HOYA (HBase
on YARN).
137
Batch Apps
HADOOP 1.0
HADOOP 2.0
MapReduce
(cluster resource management
& data processing)
HDFS
(redundant, reliable storage)
MapReduce
Others
(data processing)
(data processing)
YARN
(cluster resource management)
HDFS2
(redundant, reliable storage)
Availability: The JobTracker was a single point of failure. If it failed, then ALL
jobs failed.
Hard partition of resources into map and reduce slots: This limitation is a major
factor that causes a clusters compute resources to be underutilized.
Lacks support for alternate paradigms and services: Legacy Hadoop was meant
to solve batch-processing scenarios, and MapReduce was the only programming
paradigm available.
138
Beyond Java: The types of applications that run on YARN are not limited to Java.
Applications written in any language, as long as the binaries are installed on the
cluster, can run natively, all while requesting resources from YARN and utilizing
HDFS.
139
Beyond MapReduce
Tez
MapReduce as a workflow
Storm
Stream processing; always-running application
ONLINE
(HBase)
STREAMING
(Storm, S4,)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave)
Beyond MapReduce
Remember that MapReduce is just a type of application paradigm that can run on YARN.
Applications are continuously being ported to run on YARN to utilize HDFS storage and
in some cases to utilize YARNs distributed compute framework itself.
140
YARN Use-case
Two key factors led to dropping an entire datacenter of 10k nodes:
1. YARN is a pure resource manager; it does not care about application specifics
such as what type of application is running on the cluster. Resource management
is lightweight once these types of details are offloaded to other processes. YARN
simply knows about resource availability for each node in the cluster and will
lease these resources based on its scheduler policy. The responsibility of
applications using these resources is left to another type of per-application
process called an ApplicationMaster.
2. MapReduce in Hadoop 2 (MRv2) itself has taken advantage of this type of
architecture, where each job has its own ApplicationMaster. We will discuss
ApplicationMaster details later on. Each MRv2 jobs resource requests are
dynamically sized for its Map and Reduce processes.
141
ResourceManager (master)
Application management
Scheduling
Security
ApplicationsManager
Scheduler
NodeManager (worker)
NodeManager 1
Security
Container
Container
Container
Container
Job1
Task1
Job2
Map1
Job2
Reducer
Job2
Map2
Free
Capacity
NodeManager 2
Container
Container
AppMaster
YARN Job1
Job1
Task2
Free Capacity
NodeManager 3
Container
Container
Container
Container
Container
Job2
Map3
Job2
Map4
Job2
Map5
Job2
Map6
AppMaster
MR Job2
Free
Capacity
142
NodeManagers: These are the worker nodes in a YARN cluster. They publish
resource pools (memory & CPU) to the ResourceManager. The ResourceManager
will have an aggregate view of these resources.
Client & Admin Utilities: YARN provides both client and admin command-line
tools. For monitoring YARN components, there is a REST API as well as MBeans
for daemon processes.
143
ResourceManager
Response with ApplicationID
Scheduler
ApplicationsManager
ApplicationMasterService
5
4
Get Capabilities
Start ApplicationMaster
6
NodeManager 1
Container
Container
Job
Map1
Job
Reducer1
Free
Capacity
NodeManager 2
Container
r
Job
Map2
NodeManager 3
Free Capacity
Req/Rec Containers
Container
Container
Container
Container
Container
Job
Map3
Job
Map4
Job
Map5
Free
Job
Capacity
Map6
ApplicationMaster
MR Job
NodeManager 4
Container
Container
Job
Map7
Job
Map8
Container
Container
Container
Job
Reducer2
144
145
ResourceManager
146
NodeManager
147
MapReduce
Map Phase
Shuffle/Sort
Reduce Phase
NM + DN
NM + DN
Mapper
Reducer
NM + DN
Mapper
Data is shuffled
across the network
and sorted
NM + DN
NM + DN
Reducer
Mapper
MapReduce
The original use-case for Hadoop was distributed batch processing. MapReduce is a
power application paradigm for processing massive amounts of data.
Core features of MapReduce are:
Co-locating processing with data blocks: Take the computing to where the data
lives, rather than querying or reading data into a remote application. Would you
rather move hundreds of GB/TB of data around your network, or would you
rather move an application that processes the same data to where the data
actually lives?
Map Phase: This is the initial phase of all MapReduce jobs. This is where raw
data can be read, extracted, transformed, and results written out to HDFS or
moved on to Reducers for aggregate processing, such as a final count, sum, min,
max, etc. The Map phase can also be thought of as the ETL or projection step for
MapReduce.
Reduce Phase: This is the final phase where data is sorted on a user-defined key
and grouped by that same key. The Reducer has the option to perform an
148
aggregate function on that data. The Reduce phase can be thought of as the
aggregation step.
Data is always moved along the pipeline in MapReduce in the form of key/value
pairs.
A MapReduce job scales to the size of the data For example, if a dataset in
HDFS is 1 terabyte broken into 256MB blocks, it is possible for 4096 mappers to
run in parallel to read each block (if the cluster has the capacity).
149
2.2. Notice a MapReduce job gets submitted to the cluster. Wait for the job to
complete.
150
151
Configuring YARN
For configuring YARN, there is one core configuration file:
/etc/hadoop/conf/yarn-site.xml
The most important aspect to configuring YARN is how resource allocation works. There
are two types of resources:
Physical: The total physical resources (memory) that a container will claim.
Virtual: The total virtual resources (memory) that a container will claim. It is
usually much larger than physical memory. You want to keep this higher
because once the containers are running, a process can often times take
advantage of virtual memory addressing in order to give the application an
impression that it has more memory than physically allocated.
Why does this work? Because the underlying operating system will page out
memory thats not being used to a partition on its local disk known as a
swap partition.
152
Servers
ResourceManager 8088
ResourceManager 8032
NodeManagers
50060
Ports
Protocol Description
http
IPC
http
WebUI for RM
Application submissions
WebUI for NMs
153
Configuring MapReduce
There are a few additional considerations for properly configuring MapReduce. In the
mapred-site.xml, there are two additional properties that should be kept in tune:
MapReduce container size (physical):
Now you need to ensure that the jvm heap is lower than the physical allotted to the
container:
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2048m</value>
</property>
154
Notice the Xmx (Java Heap max) is less than the container allocation of 2560MB. This
will give the container some breathing room to continue running without expressing any
out of memory issues.
155
Tools
Starting daemons from the command line:
# yarn resourcemanager
# yarn nodemanager
# yarn proxyserver
156
Admin operations
Operation
Description
$ yarn rmadmin
$ yarn application
List/kill applications.
$ yarn node
$ yarn logs
$ yarn daemonlog
REST
YARN MR applications and cluster can be monitored via the REST API. The REST API is
available at:
http://<resourcemanager:port>/ws/v1/cluster
http://<node:port>/ws/v1/node
http://<webapplicationproxy:port>/proxy/<appid>/ws/v1/mapreduce
Currently, GET requests are supported for monitoring and gathering metrics. Full REST
API usage with examples can be found at:
http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarnsite/WebServicesIntro.html
Ambari
The primary management and provisioning of YARN components should be done via
Ambari (if possible). Ambari has extensive UIs to manage your YARN cluster.
157
Unit 7 Review
1. What are the three main phases of a MapReduce job? _____________
_________________________________________________________
2. What determines the number of Mappers of a MapReduce job? ___________
________________________________________________________________
3. What determines the number of Reducers of a MapReduce job? ___________
________________________________________________________________
158
This file contains URLS, along with keywords found on the webpages of each URL.
NOTE: The MapReduce job in this lab computes an inverted index, one of
the earliest use cases of Hadoop and MapReduce. A Web crawler scans the
Internet and retrieves URLs along with keywords on each page. The index
inverter job flips this information around, outputting the keywords along
with each web page that contains the keyword.
159
3.2. Run the job again, using the same command as the previous step.
3.3. The job should run successfully this time. How many map tasks were needed
for this job? ________ How many reduce tasks? ___________
3.4. How long (in ms) did it take for all the mappers to run? _________________
3.5. How long (in ms) did it take for all the reducers to run? _________________
3.6. How many bytes did the mappers of this job process? ____________
3.7. How many bytes did the reducers output? ____________
Step 4: View the Output
4.1. Verify the index_output folder was created in HDFS:
# hadoop fs -ls index_output
4.3. What did the reducer use as the key for its output? ________________
4.4. What did the reducer use as the values for its output? _________________
Step 5: Run the Job Again
5.1. Run the IndexInverterJob again with the exact same command.
5.2. The job failed. Why? ____________________________________________
5.3. Delete the index_output folder in HDFS.
160
5.4. Run the job again, and it should run successfully this time.
Step 6: View the Resource Manager UI
6.1. Point your web browser to Ambari at http://node1:8080.
6.2. From the Dashboard page, select YARN from the left-hand menu.
6.3. Select the Configs tab on the YARN page.
6.4. Which node in your cluster is the Resource Manager running on? __________
6.5. Point your web browser to the Resource Manager UI, which is
http://node2:8088. You should see the All Applications page:
6.6. In the Cluster menu on the left side of the page, click on the various links like
About, Nodes, Applications, and Scheduler. Notice there is a lot of useful
information provided in this UI.
4708852 hbase.jar
If you do not have this file in HDFS, put it there. The file is found in your
/root/labs folder.
161
7.2. Run the IndexInverterJob using the following command (entered on a single
line):
# hadoop jar invertedindex.jar inverted.IndexInverterJob
hbase.jar index_output2
7.3. Notice exceptions are thrown, and eventually the job will fail. From the
output of the job, how many map tasks were launched? _________ How many
map tasks failed? ________ How many were killed? ________
7.4. The input file, hbase.jar, is split into 5 blocks in HDFS. Why did this
MapReduce job launch 10 map tasks? ___________________________________
Step 8: View the Log Files
8.1. Lets figure out what happened to the IndexInverterJob. View the Job History
page of node2 by pointing your browser to http://node2:19888:
Notice the most recent job at the top of the list has a status of FAILED.
8.2. Click on Job ID of the failed IndexInverterJob. You should see the details page
for this job:
162
8.3. Notice this page contains useful details about the job, including the average
map and reduce time, and how long it took to execute the entire job.
8.4. Also notice that 5 mappers and 1 reducer were started for this job. In the
screen shot above, 8 map tasks failed, 2 were killed, and 0 were successful. Notice
these numbers are links - click on your number of failed map tasks:
8.5. In the Logs column, click on the logs link of one of the failed map tasks to
view the corresponding log file:
8.6. What happened in this job? Why did the mapper fail? ___________________
RESULT: You have executed a MapReduce job that failed for several different reasons.
Being able to troubleshoot these types of issues is an important and handy skill for any
Hadoop administrator.
ANSWERS:
2.2: The job fails because the input file for the job does not exist in HDFS.
3.3: 1 mapper and 1 reducer, as found in the Job Counters section of the output.
3.4: Look for the counter Total time spent by all maps in occupied slots (ms)
163
3.5: Similarly, look for Total time spent by all reduces in occupied slots (ms)
3.6: Bytes Read=1126, as found in the File Input Format Counters section.
3.7: Bytes Written=2997, as found in the File Output Format Counters section.
5.2: The output folder of a MapReduce job cannot exist. You should have gotten the
following error message: FileAlreadyExistsException: Output directory
hdfs://node1:8020/user/root/index_output already exists
7.3: You should see 10 launched map tasks. The number of failed and killed tasks will
vary, but expect about 8 failed and 2 killed.
7.4: When a map task fails, the MapReduce framework launches the map task again. A
map task has to fail 2 times (by default) before the entire job fails. The input file was
split into 5 blocks, and each block generated a map task that failed 2 times, so 5x2=10.
8.6: A NullPointerException was thrown on line 32 of IndexInverterJob.java. Useful
information for the Java developer!
164
Defining Queues
Configuring Permissions
Multi-Tenancy Limits
165
Job Scheduler
166
Fair Scheduler: Schedules jobs so that all jobs get, on average, an equal share of
cluster resources.
We recommend using a Capacity Scheduler, and HDP uses the Capacity Scheduler by
default. We will discuss both of these schedulers in this Unit, but our focus will be on
configuring and managing the Capacity Scheduler.
167
Actual:
Configured for:
Queue1
Queue2
Queue3
40%
35%
25%
50%
30%
20%
Queue 1 might represent the Marketing department, which gets 40% of the
resources because it paid for 40% of the cluster from its budget.
Queue 2 might represent the Sales department, and they get 35% of the
resources because of a company SLA.
Queue 3 might represent the Engineering department, and the remaining 25% is
allocated to them until another department comes along and needs to use the
cluster.
168
The Capacity Scheduler provides elastic resource scheduling, which means that if some
of the resources in the cluster are idle, then one queue can take up more of the cluster
capacity than was minimally allocated to them in the above configuration.
Lets now take a look at how to define and configure queues.
169
Manual editing: If you edit the XML file directly, run the following command to
have the changes take effect:
# yarn rmadmin -refreshQueues
170
Ambari: If you configure the Capacity Scheduler using Ambari, you will need to
stop the YARN service, make your changes, and then start YARN again.
Defining Queues
yarn.scheduler.capacity.root.queues=
"Marketing,Sales,Engineering"
yarn.scheduler.capacity.root.Marketing.capacity=50
yarn.scheduler.capacity.root.Sales.capacity=30
yarn.scheduler.capacity.root.Engineering.capacity=
20
Defining Queues
To define a child queue of root, use a comma-separated list of queue names for the
yarn.scheduler.capacity.root.queues property. For example:
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>Marketing,Sales,Engineering</value>
</property>
171
A queues properties are configured by adding the queue name to the specific property.
For example, the following allocates 50% of the total capacity to the Marketing queue
and 30% to the Sales queue:
<property>
<name>
yarn.scheduler.capacity.root.Marketing.capacity
</name>
<value>50</value>
</property>
<property>
<name>
yarn.scheduler.capacity.root.Sales.capacity
</name>
<value>30</value>
</property>
172
173
A good use case for maximum-capacity is for applications that take a long time to run.
You may not want a long-running app to not consume a lot of resources, while providing
a large maximum for applications that you want to run quickly. You could setup
different queues for this behavior:
yarn.scheduler.capacity.root.queues="Marketing,
Marketing-longrunning"
yarn.scheduler.capacity.root.Marketing.capacity=50
yarn.scheduler.capacity.root.Marketing.maximum-capacity=80
yarn.scheduler.capacity.root.Marketing-longrunning
.capacity=35
yarn.scheduler.capacity.root.Marketing-longrunning
.maximum-capacity=35
174
The maximum user limit is based on the number of users that have actually submitted
jobs at any given time. For example, two users each get a maximum of 50%, three users
would each get a maximum of 33%, and so on.
Suppose the Sales queue is configured with a user minimum of 20%. Answer the
following questions:
1. If one user submits two jobs to the Sales queue, then each job will get between
__________ and _________ percent of resource.
2. If 3 different users have submitted one job each to the Sales queue, then each
user will get between _______ and _________ percent of resources.
175
Configuring Permissions
yarn.scheduler.capacity.root.
Engineering.acl_submit_applications=
"developer,admin,George,Tom
yarn.scheduler.capacity.root.
Engineering.acl_administer_queue=
"admin,Tom"
Configuring Permissions
Each queue can define an Access Control List that authorizes which users and groups
can submit jobs to the queue. For example:
yarn.scheduler.capacity.root.Engineering.acl_submit_applica
tions="developer,admin,George,Tom"
There is also a property for configuring users and/or groups you can administer a queue:
yarn.scheduler.capacity.root.Engineering.acl_administer_que
ue="admin,Tom"
176
Queues can be defined so that resources are shared fairly between these
queues.
Different fairness algorithms can be used, including FIFO and Dominant Resource
Fairness (which uses an algorithm combining memory usage with CPU usage).
177
For details of all configuration options of the Fair Scheduler, view the documentation at
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarnsite/FairScheduler.html.
178
Unit 8 Review
1. What are the two built-in schedulers in Hadoop?
2. Which scheduler does Hortonworks recommend you use in HDP 2.0?
3. When using a Capacity Scheduler, all queues are children of the ___________
queue.
Suppose you have the following properties configured:
yarn.scheduler.capacity.root.queues=A,B
yarn.scheduler.capacity.root.A.capacity=80
yarn.scheduler.capacity.root.B.capacity=20
yarn.scheduler.capacity.root.B.maximum-capacity=100
179
1.3. Notice there is one child queue of root defined. What is the name of the
queue? _________________
1.4. Click on the arrow to the left of default to expand and view the settings of
the default queue:
180
1.5. Notice since there are no jobs running on your cluster, the status page simply
shows 0.0% of default is being used right now.
Step 2: View the Settings of the Capacity Scheduler
2.1. Go to the Ambari Dashboard page.
2.2. Click on the YARN link in the list of services, then click on the Configs tab and
scroll down to the Scheduler section:
2.3. Notice this is where you configure the settings for the scheduler of the
Resource Manager. Which type of scheduler is currently being used?
________________________________________
Step 3: Stop YARN
3.1. You cannot configure the Capacity Scheduler using Ambari while the Resource
Manager and Node Manager services are running. While on the YARN Services
page, click the Stop button in the upper-right corner of the page, then OK to
confirm. Wait for the YARN service to stop:
181
6.3. Expand the A queue and verify its capacity is 50% and its maximum capacity is
70%:
7.3. Make sure you have a file in /user/root in HDFS named hbase.jar. If not, put
the HBase jar from the labs folder into HDFS, giving it the name hbase.jar.
7.4. In the first window, submit the test1.pig script to queue A by running the
following command (all on a single line):
# pig -Dmapreduce.job.queuename=A test1.pig 1>pig1.out
&>pig1.err &
183
7.5. While test1.pig is running, submit the test2.pig script to queue B in the other
terminal window:
# pig -Dmapreduce.job.queuename=B test2.pig 1>pig2.out
&>pig2.err &
7.6. While both jobs are running, refresh the Scheduler status page. (It may take a
minute for both jobs to run long enough to show up in the queues, so refresh the
page often until they do):
7.7. You should see resources being used in both the A and B queues.
RESULT: You just defined two queues for the Capacity Scheduler, configured specific
capacities for each queue, and submitted a job to each queue.
ANSWERS:
1.3: default
2.3: The Capacity Scheduler
184
Data Ingestion
distcp Options
185
Keeping the current data in the EDW and archiving historical data to Hadoop.
186
Using a hybrid approach incorporating some data layers in Hadoop and the
speed layer in HBase (more discussion on this later in this unit).
Using Hadoop to aggregate and filter the data then loading the results and into
an EDW and/or datamarts.
Storing the data in Hadoop and using high-speed connecters from Teradata,
Oracle, SQL Server, etc. and using the analytics in the EDW to read the data from
the Hadoop cluster.
As the data strategies evolve they create more data movement between the enterprise
data systems. Lifecycle data management has always been a central part of enterprise
data platforms, and the Hadoop and HBase clusters now become a part of that lifecycle.
Falcon is the framework used in Hadoop 2 for lifecycle data management.
187
Incapable/high
complexity when
dealing with loosely
structured data
No visibility into
transactional data
Scalability is very expensive with vendor proprietary solutions and SAN storage.
There is very high business latency between data hitting the disks and being able
to make business decisions using the data.
188
-Linearly scalable on
commodity hardware
-Massively parallel
storage and compute
Hadoop greatly reduces the business latency between data hitting the disks and being
able to make business decisions using the data.
189
Data Ingestion
Data Sources/Transports
Web Logs,
Clicks
Social,
Graph, Feeds
Sensors,
Devices, RFID
Spatial, GPS
Extract &
Load
Big Data
Refinery
WebHDFS
Docs, Text,
XML
3rd Party
Audio, Video,
Images
Sqoop
Flume
DB Data
Events, Other
Data Ingestion
Data ingestion is one of the key components of any data warehouse, enterprise data
store or Hadoop cluster. It is a major effort to design data ingestion strategies for any
enterprise data store. Hadoop Data Refineries and Data Lakes take data ingestion to an
entirely new level of volume, speed and types of data.
Extraction, Transformation and Loading (ETL) has been a standard method for moving
data into enterprise data stores. The reason for the transformation before loading is
that the cost of SAN storage has required data to be transformed by aggregating and
filtering data to reduce the amount of data that will be loaded into an enterprise data
store.
With Extraction, Loading and Transformation (ELT) the data is loaded into Hadoop to a
layer known as the source of truth. This is the raw data. Since Hadoop can store data
much more cost effectively, all of the detailed data gets loaded into Hadoop. The data is
then transformed into different data layers.
190
DB
Data
Business
Transactions
& Interactions
Audio,
Video,
Images
Docs,
Text,
XML
Social,
Graph,
Feeds
Sensors,
Devices,
RFID
Spatial,
GPS
Events,
Other
Big Data
Refinery
Web
Logs,
Clicks
Classic
ETL
processing
2
Store, aggregate,
and transform
multi-structured
data to unlock
value
Business
Intelligence
& Analytics
Retain historical
data to unlock
additional value
5
Dashboards, Reports,
Visualization,
191
More interestingly, there are businesses deriving value from processing large video,
audio, and image files. Retail stores, for example, are leveraging in-store video feeds to
help them better understand how customers navigate the aisles as they find and
purchase products. Retailers that provide optimized shopping paths and intelligent
product placement within their stores are able to drive more revenue for the business.
In this case, while the video files may be big in size, the refined output of the analysis is
typically small in size but potentially big in value.
The Big Data Refinery platform provides fertile ground for new types of tools and data
processing workloads to emerge in support of rich multi-level data refinement solutions.
With that as backdrop, Step 3 takes the model further by showing how the Big Data
Refinery interacts with the systems powering Business Transactions & Interactions and
Business Intelligence & Analytics. Interacting in this way opens up the ability for
businesses to get a richer and more informed 360 view of customers, for example.
By directly integrating the Big Data Refinery with existing Business Intelligence &
Analytics solutions that contain much of the transactional information for the business,
companies can enhance their ability to more accurately understand the customer
behaviors that lead to the transactions.
Moreover, systems focused on Business Transactions & Interactions can also benefit
from connecting with the Big Data Refinery. Complex analytics and calculations of key
parameters can be performed in the refinery and flow downstream to fuel runtime
models powering business applications with the goal of more accurately targeting
customers with the best and most relevant offers, for example.
Since the Big Data Refinery is great at retaining large volumes of data for long periods of
time, the model is completed with the feedback loops illustrated in Steps 4 and 5.
Retaining the past 10 years of historical Black Friday retail data, for example, can
benefit the business, especially if its blended with other data sources such as 10 years
of weather data accessed from a third party data provider. The opportunities for
creating value from multi-structured data sources available inside and outside the
enterprise are virtually endless if you have a platform that can do it cost effectively and
at scale.
192
Organize data
based on
source/derived
relationships
Allows for fault
and rebuild
process
Speed
Layer
Conform, Summarize, Access
Serving
Layer
Standardize, Cleanse, Integrate, Filter,
Transform
Batch
Layer
Extract & Load
Batch Layer: Immutable master data set (source of truth). Used to create views
for the batch layer.
193
<SourceURL1>
For example, perform a copy between two Hadoop clusters running the same version of
Hadoop:
$ hadoop distcp hdfs://<SourceURL>:8020/input/data1
hdfs://<DestinationURL>:8020/input/data1
To perform a copy between two Hadoop clusters running a different version of Hadoop,
the older cluster uses the hftp protocol, and the 2.x cluster uses the hdfs protocol:
$ hadoop distcp hftp://<SourceURL>:50070/input/data2
hdfs://<DestinationURL>:8020/input/data2
/input/data2 /input/data3
/input/data3
If a map fails and -i is NOT specified, all the files in the split, not only those that failed,
will be recopied. It also changes the semantics for generating destination paths, so users
should use this carefully.
Flag -i means Ignore failures. This option will keep more accurate statistics about the
copy than the default case. It also preserves logs from failed copies, which can be
valuable for debugging. A failing map will not cause the job to fail before all splits are
attempted.
195
Distcp Options
The following list are things to take into consideration when using the distcp command:
The update option is used to make sure only files that have changed are
updated. Using a checksum (CRC32) it verifies if the destination file sizes are
different. The skipcrccheck option can be used to disable the checksum.
Distcp will skip files that already exist in the destination path. Use the
overwrite to make sure existing files are overwritten. File sizes are not checked.
The delete option can be used to delete any files in destination that are not in
the source.
Use the hftp file system on the source if there are different versions between the
source and destination HDFS clusters.
196
# hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-async
Should distcp execution be blocking
-atomic
Commit all changes or none
-bandwidth <arg>
Specify bandwidth per map in MB
-delete
Delete from target, files missing in
source
-f <arg>
List of files that need to be copied
-filelimit <arg>
(Deprecated!) Limit number of files
copied to <= n
-i
Ignore failures during copy
-log <arg>
Folder on DFS where distcp execution
logs are saved
-m <arg>
Max number of concurrent maps to use
for copy
-mapredSslConf <arg>
Configuration for ssl config file,
to use with hftps://
-overwrite
Choose to overwrite target files
unconditionally, even if they exist.
-p <arg>
preserve status (rbugp)(replication,
block-size, user, group, permission)
-sizelimit <arg>
(Deprecated!) Limit number of files
copied to <= n bytes
-skipcrccheck
Whether to skip CRC checks between
source and target paths.
-strategy <arg>
Copy strategy to use. Default is
dividing work based on file sizes
-tmp <arg>
Intermediate work path to be used for
atomic commit
-update
Update target, copying only
missingfiles or directories
There are two strategy options: static (the default) and dynamic. When static is used,
mappers are balanced based on the total size of files copied by each map. The dynamic
approach splits files into chunks and map tasks process a chunk at a time, allowing
faster mappers to consume more file paths than slower ones and thereby speeding up
the overall distcp job.
197
Using distcp
The distcp command starts up containers for running the mappers as well as generating
I/O based on the volume of data to be copied. Take the resource utilization and the
IOPS generated to schedule large distcp jobs during appropriate times.
Distcp also consumes container resources on the destination cluster, which may
be the same or a different cluster.
Copying data between the two clusters will also generate network traffic
between the data nodes for each cluster. Make sure network resources are not
exceeded between the two clusters.
198
Using the hdfs:// schema for the source and destination requires the clusters be running
the same version of software. Other protocols that can be used include:
webhdfs://
hftp://
Best practice is to validate the copy between the source and the destination.
Use the hadoop fs ls /Path to confirm ownership, permissions and files.
199
200
201
Objective: To become familiar with how to copy data from one cluster
to another.
Successful Outcome: Data from a remote cluster is copied to your own cluster.
Before You Begin: For this exercise use node1 as your Remote-Cluster.
3.2. View the contents of distcp_target and verify test_data file copied over to
your cluster:
202
3.4. View the contents of distcp_target and verify the wordcount &
constitution.txt file copied over to your cluster again.
Step 4: Copy only new/updated files and directories using -update option.
4.1. Check the timestamp for files in /user/root/wordcount directory. Delete partr-00000 file from wordcount directory.
4.2. Now run following command with -update option.
$ hadoop distcp -update
hdfs://node1:8020/user/root/wordcount
distcp_target/wordcount
4.3. View the contents of distcp_target and compare timestamp of all the files.
You can see that the timestamp changed only for part-r-00000 file and wordcount
folder timstamp.
Step 5: Copy data from a Remote-Cluster running different version of Hadoop.
5.1. Execute the following command to copy a remote file into distcp_target.
$ hadoop distcp
hftp://node1:50070/user/root/hbase.jar
distcp_target
5.2. View the contents of distcp_target and verify test_data file copied over to
your cluster.
RESULT: You have learnt the steps to copy data from one cluster to another.
203
What is WebHDFS ?
Setting up WebHDFS
Using WebHDFS
WebHDFS Authentication
Running WebHCat
Using WebHCat
204
What is WebHDFS?
Hadoop contains native libraries for accessing HDFS from the Hadoop cluster. WebHDFS
provides a full set of HTTP REST APIs to access Hadoop remotely. HDFS commands can
be run from a platform that does not contain Hadoop software.
REST (Representational State Transfer) uses well known HTTP verb commands GET,
POST, PUT and DELETE to perform operations. REST:
Uses URIs (Uniform Resource Identifier defines a web resource using text).
205
WebHDFS uses REST APIs to perform HDFS user operations including reading files,
writing to files, making directories, changing permissions and renaming. WebHDFS can
be used to copy data between different versions of HDFS.
WebHDFS is built-in to HDFS. It runs inside NameNodes and DataNodes, therefore, it can
use all HDFS functionalities. It is a part of HDFS there are no additional servers to
install. WebHDFS can use a proxy WebHDFS (httpfs). In most cases it uses hdfs:// (port
optional).
WebHDFS supports the following:
206
Setting up WebHDFS
WebHDFS should be enabled during the Ambari install by selecting the
enable WebHDFS checkbox.
Description
dfs.webhdfs.enabled
dfs.web.authentication.kerberos.principal
dfs.web.authentication.kerberos.keytab
Setting up WebHDFS
If manually setting the dfs.webhdfs.enabled property in the hdfs-site.xml file, HDFS
(NameNode and DataNodes) must be restarted for the changes to take effect.
hdfs-site.xml:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
207
When using Kerberos to secure cluster, look at the documentation to get all the details
but here is a summary.
1. Create a HTTP service user principal.
2. kadmin: addprinc -randkey
HTTP/$<Fully_Qualified_Domain_Name>@$<Realm_Name>.COM
4. Verify that the keytab file and the principal are associated with the correct
service:
klist k -t /etc/security/spnego.service.keytab
208
Using WebHDFS
The URL syntax to access the REST API of WebHDFS is:
http://hostname:port/webhdfs/v1/<PATH>?op=
Using WebHDFS
The REST API uses the prefix "/webhdfs/v1" in the path and appends a query at the end.
HTTP URL format:
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=...
HDFS URI:
hdfs://<HOST>:<RPC_PORT>/<PATH>
cURL and wget can be used to execute WebHDFS commands. cURL has been around a
long time in Unix and Linux environments. It is a popular command line tool (and
library) because it can support so many protocols (HTTP, HTTPS, FTP, SCP, LDAP, TENET,
POP2, SMTP, IMAP, .).
209
wget is a free software package for retrieving files using HTTP, HTTPS and
FTP.
Additional examples:
List the status of a file (Use the v option to display output in verbose mode to
get more details.)
$ curl -i
"http://host:port/webhdfs/v1/input/mydata?op=GETFILESTATUS"
210
WebHDFS Authentication
Authentication can be controlled through the following commands:
Security off:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?[user.name=<USER>&
]op=..."
211
Proxy Users
A proxy user can send a request for another. The username of U must be specified in the
doas query parameter unless a delegation token is presented in authentication. In such
case, the information of both users P and U must be encoded in the delegation token.
Below is the syntax to use when;
Security is off:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?[user.name=<USER>&]
doas=<USER>&op=..."
212
213
HTTP GET
HTTP PUT
HTTP POST
OPEN
CREATE
APPEND
GETFILESTATUS
MKDIRS
LISTSTATUS
RENAME
GETCONTENTSUMMARY
SETREPLICATION
HTTP POST
GETFILECHECKSUM
SETOWNER
DELETE
GETHOMEDIRECTORY
SETPERMISSION
GETDELEGATIONTOKEN
SETTIMES
RENEWDELEGATONTOKEN
CANCELDELEGATIONTOKEN
Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
214
Append to a File:
curl -i -X POST
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersi
ze=<INT>]"
Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
Content-Length: 0
Make a Directory:
curl -i -X PUT
"http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]
Rename a File/Directory:
curl -i -X PUT
"<HOST>:<PORT>/webhdfs/v1/<PATH>?op=RENAME&destination=<PAT
H>
Delete a File/Directory:
curl -i -X DELETE
"http://<host>:<port>/webhdfs/v1/<path>?op=DELETE[&recursiv
e=<true|false>]
215
List a Directory:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS
Set Permission:
curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION
[&permission=<OCTAL>]
Set Owner:
curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER
[&owner=<USER>][&group=<GROUP>]
216
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETTIMES
[&modificationtime=<TIME>][&accesstime=<TIME>]
IllegalArgumentException
UnsupportedOperationException
SecurityException
401 Unauthorized
IOException
403 Forbidden
FileNotFoundException
RumtimeException
217
218
HttpFS is a full rewrite of Hadoop HDFS proxy. A key difference is HttpFS supports all file
system operations while Hadoop HDFS proxy supports only read operations.
HttpFS also supports:
219
220
Filename
Description
webhcat_server.sh
webhcat-default.xml
webhcat-site.xml
webhcat-log4j.properties
Description
templeton.port
templeton.hadoop.config.dir
templeton.jar
templeton.streaming.jar
templeton.hive.path
templeton.hive.properties
templeton.zookeeper.hosts
221
Definition
Default
hdfs:///apps/webhcat/pig.tar.gz
templeton.pig.path
pig.tar.gz/pig/bin/pig
templeton.hive.archive
hdfs:///apps/webhcat/hive.tar.gz
templeton.hive.path
hive.tar.gz/hive/bin/hive
templeton.streaming.jar
hdfs:///apps/webhcat/hadoopstreaming.jar
222
Running WebHCat
Start the server:
$ /usr/lib/hcatalog/sbin/webhcat_server.sh start
Running WebHCat
Hadoop uses a LocalResource to keep Pig and Hive from having to be installed
everywhere on the cluster. The server will get a copy of the LocalResource when
needed.
223
Using WebHCat
The URL to access the REST API of WebHCat is:
http:/ / hostname/ tem pleton/ v1/
Here is an example of running a MapReduce job:
# curl -s -d user.name=hadoop_user \
-d jar=wordcount.jar \
-d class=com.hortonworks.WordCount \
-d libjars=transform.jar \
-d arg=wordcount/input \
-d arg=wordcount/output \
'http://host:50111/templeton/v1/mapreduce/jar'
Using WebHCat
WebHCat can execute programs through the Knox Gateway. The URL for accessing the
REST API of WebHCat is: http://hostname:port/templeton/v1/.
Below is an example of WebHCat running a Java MapReduce job. This example
assumes the input and output directories have been setup as well as the inode being
created for the file.
$ curl -v -i -k u <USERIDID>:<PASSWORD> -X POST \
-d jar=/dev/my-examples.jar -d class=wordcount \
-d arg=/dev/input -d arg=/dev/output \
'https://127.0.0.1:8443/gateway/sample/templeton/api/v1
/mapreduce/jar'
224
Unit 10 Review
1. WebHDFS supports HDFS _____________ and _______________ operations.
2. The _________________ parameter needs to be set to true to enable WebHDFS.
3. WebHDFS can use ________________ and _________________ for
authentication.
4. HttpsFS is a _____________________ from NameNode and must be configured.
225
1.2. You should see a 200 OK response, along with a JSON object containing the
files and directories in your /user/root folder:
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 14 Nov 2013 14:35:42 GMT
Date: Thu, 14 Nov 2013 14:35:42 GMT
Pragma: no-cache
Expires: Thu, 14 Nov 2013 14:35:42 GMT
Date: Thu, 14 Nov 2013 14:35:42 GMT
Pragma: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":1732
1,"group":"hadoop","length":0,"modificationTime":1384408800
226
076,"owner":"root","pathSuffix":".Trash","permission":"700"
,"replication":0,"type":"DIRECTORY"},
{"accessTime":1384219125588,"blockSize":134217728,"children
Num":0,"fileId":17331,"group":"hadoop","length":861,"modifi
cationTime":1384219125967,"owner":"root","pathSuffix":"cons
titution.txt","permission":"644","replication":3,"type":"FI
LE"},
...
]}}
3.2. Use the temporary redirect URL that the NameNode provides in the response
above to submit the file to the DataNode. For example, the command shown here
puts the file onto node4, but you should copy-and-paste the URL from the
response of the previous step:
227
44841 history/constitution.txt
4.3. Using the URL provided by the previous command, upload test_data into
HDFS, and then verify the upload worked successfully.
Step 5: Append to an Existing File
5.1. Appending a file is similar to creating a file - it is a two-step process. Using
WebHDFS, append the local file constitution.txt to big.txt in HDFS.
228
SOLUTION to 6.2:
curl -i -L
"http://node1:50070/webhdfs/v1/user/root/big.txt?op=OPEN&of
fset=1000000&length=1048576" > big_partial.txt
229
Introduction to Hive
Hive Components
Hive MetaStore
HiveServer2
Performing Queries
ORCFile Example
Hive Tables
ORCFile Example
Compression
Hive Security
230
Introduction to Hive
Hive queries are capable of data summarization, ad-hoc querying and analytics of large
volumes of data. Hive is scalable to 100PB+. Apache Hive is the gateway for business
intelligence and visualization tools integrated with Apache Hadoop. Hive supports
databases, tables, SQL language and other foundational constructs for analyzing data.
Hive will get the SQL code, process the code and convert the code to a MapReduce
program. The MapReduce program runs in the YARN framework and generates the
results.
Additional Hive capabilities and features:
231
SerDes map JSON, XML and other formats natively into Hive.
232
Hive MetaStore
The Hive MetaStore contains all the metadata definitions for Hive tables
and partitions
The metastore can be local or remote
Local Metastore
Driver
Metastore
RDBMS
Local
Datastore
HiveServer2
Remote Metastore
Driver
Metastore
RDBMS
Remote
Datastore
HiveServer2
Hive MetaStore
The Hive metastore stores table definitions and related metadata information. Hive
uses an Object Relational Mapper (ORM) to access relational databases. Valid Hive
metastore database are: MySQL, PostgreSQL, Oracle and Derby. An embedded
metastore is available but it should only be used for unit testing.
Below is an example of setting up a local metastore using MySQL at the metastore
repository
Property
Value
javax.jdo.option.ConnectionURL
jdbc:mysql://<HOSTNAME>/<DBNAME>?cre
ateDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
<MYSQL_USER>
javax.jdo.option.ConnectionPassword
<MYSQL_PASSWORD>
hive.metastore.local
233
hive.metastore.warehouse.dir
<DEFINE_PATH_HIVETABLES>
With a remote metastore setup, a Hive client needs to connect to a metastore server
that then communicates to the remote datastore (RDBMS) using the Thrift protocol.
Thrift is an Interface Definition Language (IDL) that defines the specification for the
interface to a software component. Thrift uses Remote Procedure Calls (RPCs) for the
communication between two service endpoints.
234
HiveServer2
HiveServer2 is a server interface that allows JDBC/ODBC remote clients
to run queries and retrieve the results.
Hive SQL
CLI
JDBC / ODBC
Web UI
HiveServer2
RDBMS
Datastore
Hive
Mappers
Reducers
DataNodes
HiveServer2
HiveServer2 (HS2) is a gateway / JDBC / ODBC endpoint Hive clients can talk to. ODBC
allows Excel and just about any BI tool to use Hive to access Hadoop data.
Configuration parameters for the HiveServer2 are set in the hive-site.xml file.
HiveServer2 supports no authentication (Anonymous), Kerberos, LDAP and custom
authentication. Authentication mode is defined with the hive.server2.authentication
parameter (NONE, KERBEROS, LDAP and CUSTOM). NONE is the default value.
HiveServer2 executes a query as the user who started the query by default
(hive.server2.enable.doAs=true). Setting this parameter to false, the query will run as
the same user the HiveServer2 process runs as.
There are multiple ways to start the HiveServer2:
$ $HIVE_HOME/bin/hiveserver2
$ $HIVE_HOME/bin/hive --service hiveserver2
235
The e option can be used to execute from the Linux command line. The S option says
run in silent mode.
$ hive -S -e "select * FROM mycooltab" > /tmp/mytabout
236
Hadoop dfs commands can be run from the Hive CLI. Dfs commands can be run
without typing hadoop first.
hive> dfs -ls /user;
Beeline connects to the Hive Server2 instance. Hive clients connect to the HiveServer
instance.
237
Hadoop
1
Hive SQL
CLI
JDBC / ODBC
Hive
HiveServer2
Web UI
Hive
Compiler
Optimizer
Executor
Map / Reduce 3
Mappers
Reducers
DataNodes
4
238
Partitions: Can physically separate table data into separate data units.
239
A Hive CREATE TABLE command can create a Hive and HBase table as well as create a
Hive table that points to an existing HBase table. Hive tables can also point to other
NoSQL database tables.
CREATE EXTERNAL TABLE myhtab(id INT, name STRING) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
Delimited Text: Excellent for sharing among Pig, Hive, and Linux (awk, perl,
pythonn etc.) Binary file formats are more efficient.
ORCFile
Hive uses SerDes to read and write from tables. The SerDe determines the format in
which the records are serialized and deserialized. You can write your own custom SerDe,
or use one of the built-in ones which include:
240
Avro: Easily converts Avro schema and data types into Hive schema and data
types. Avro understands compression.
Copyright 2014, Hortonworks, Inc. All rights reserved.
Regular Expression
ORC
Thrift
NOTE: Accumulo is not part of the HDP distribution yet, but it is supported
by Hortonworks.
241
Hive Tables
Data stored in HDFS is schema-on-read, meaning Hive does not control the data
integrity when it is written. For Hive Managed tables, the table name is the name Hive
will assign to the directory in HDFS. For external tables, the files can be in any folder in
HDFS.
If you drop an external table, it will keep the data in its defined directory. With a Hive
Managed table, if you drop the table, then the data is deleted.
Multiple schemas can be connected to a single directory.
242
Hive supports the following data types: TINYINT, SMALLINT, INT, BIGINT,
BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, VARCHAR and DATE.
Hive also has four complex data types: ARRAY, MAP, STRUCT and UNIONTYPE.
243
An external table is just like a Hive-manged table, except that when the table is
dropped, Hive will not delete the underlying /apps/hive/warehouse/salaries folder.
In the table above, the table data for salaries will be whatever is in the
/user/train/salaries directory.
244
For Hive-managed tables, the data is moved into a special Hive subfolders of
/apps/hive/warehouse.
For external tables, the data is moved to the folder specified by the LOCATION
clause in the tables definition.
The LOAD DATA command can load files from the local file system (using the LOCAL
qualifier) or files already in HDFS. For example, the following command loads a local file
into a table named customers:
LOAD DATA LOCAL INPATH '/tmp/customers.csv' OVERWRITE INTO
TABLE customers;
The OVERWRITE option deletes any existing data in the table and replaces it with
the new data. If you want to append data to the tables existing contents, simply
leave off the OVERWRITE keyword.
245
If the data is already in HDFS, then leave off the LOCAL keyword:
LOAD DATA INPATH '/user/train/customers.csv' OVERWRITE INTO
TABLE customers;
In either case above, the file customers.csv is moved either into HDFS in a subfolder of
/apps/hive/warehouse or to the tables LOCATION folder, and the contents of
customers.csv are now associated with the customers table.
You can also insert data into a Hive table that is the result of a query, which is a
common technique in Hive. An example of the syntax is below:
INSERT INTO birthdays SELECT firstName, lastName, birthday
FROM customers WHERE birthday IS NOT NULL;
The birthdays table will contain all customers whose birthday column is not null.
246
Performing Queries
Lets take a look at some sample queries to demonstrate what HiveQL looks like. The
following SELECT statement selects all records from the customers table:
SELECT * FROM customers;
You can use the familiar WHERE clause to specify which rows to select from a table:
FROM customers SELECT firstName, lastName, address, zip
WHERE orderID > 0 GROUP BY zip;
NOTE: The FROM clause in Hive can appear before or after the SELECT
clause.
One benefit of Hive is its ability to join data in a simple fashion. The JOIN command in
HiveQL is similar to its SQL counterpart. For example, the following statement performs
an inner join on two tables:
SELECT customers.*, orders.* FROM customers JOIN orders ON
(customers.customerID = orders.customerID);
In the SELECT above, a row will be returned for every customer, even those without
any orders.
247
A best practice is to divide data among different files that can be pruned out,
which is accomplished by using partitions, buckets and skewed tables.
Sort data ahead of time. Sorting data ahead of time simplifies joins and skipping
becomes more effective.
248
4 Stages
Stage Details
249
Tez avoids
unneeded writes to
HDFS
GROUP BY a.state
Hive MR
M
Hive Tez
SELECT a.state
SELECT b.id
R
SELECT a.state,
c.itemId
M
R
SELECT b.id
M
M
HDFS
JOIN (a, c)
SELECT c.price
HDFS
JOIN (a, c)
HDFS
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
250
ORCFile Example
sale
id
mestamp
productsk
storesk
amount
state
10000
2013-06-13T09:03:05
16775
670
$70.50
CA
10001
2013-06-13T09:03:05
10739
359
$52.99
IL
10002
2013-06-13T09:03:06
4671
606
$67.12
MA
10003
2013-06-13T09:03:08
7224
174
$96.85
CA
10004
2013-06-13T09:03:12
9354
123
$67.76
CA
10005
2013-06-13T09:03:18
1192
497
$25.73
IL
ORCFile Example
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store
Hive data. File formats in Hive are specified at the table level using the AS keyword. For
example:
CREATE TABLE tablename (
...
) AS ORC;
You can also specify ORC as the default file format of new tables:
SET hive.default.fileformat=Orc
The ORC file format is a part of the Stinger Initiative to improve the performance of Hive
queries, and using ORC files can greatly improve the execution time of your Hive
queries.
251
Compression
Hive queries will usually become I/O bound before they become CPU bound. Reducing
the amount of data to be read by using compression can improve performance.
Different compression codecs include: Snappy, LZO, Gzip, BZip2, etc.
Get a listing of the compression codes available in your environment. Compression
options can also be defined in the Hive CLI.
$ hive -e "set io.compression.codecs"
hive> set mapred.output.compression.type=BLOCK;
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;
hive> set hive.exec.compress.output=true;
252
Example:
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
</property>
Specify that the output of the Reducer(s) should be compressed with the
hive.exec.compress.output parameter.
<property>
<name>hive.exec.compress.output</name>
<value>true/value>
</property>
253
Hive Security
Usernames can be defined when executing commands. You can specify user.name in a
GET :table command:
$ curl -s
'http://localhost:50111/templeton/v1/ddl/database/default/t
able/my_table?user.name=cole'
254
Unit 11 Review
1. The Hive component for storing schema and metadata information is
___________________.
4. True or False: Tez improves the performance of any MapReduce job, not just
Hive queries.
255
1.2. Notice there are 5 part-m-0000x files, which are the result of a MapReduce
job that formatted the data for use with Hive. View the contents of one of these
files:
# more part-m-00000
Notice the data consists of information about visitors to the White House,
including the name, date, person being visited, and a comment section.
Step 2: Define a Hive Table
2.1. In the data folder, there is a text file named wh_visits.hive. View its
contents. Notice it defines a Hive table named wh_visits with a schema that
matches the data in the part-m-0000x files:
# more wh_visits.hive
create table wh_visits (
lname string,
fname string,
time_of_arrival string,
256
appt_scheduled_time string,
meeting_location string,
info_comment string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' ;
2.3. If successful, you should see OK in the output along with the time it took to
run the query.
Step 3: Verify the Table Creation
3.1. Start the Hive Shell:
# hive
hive>
3.2. From the hive> prompt, enter the show tables command:
hive> show tables;
None
None
None
None
None
None
257
Notice there is a folder named wh_visits. When did this folder get created?
_________________________________________________________________
4.3. List the contents of the wh_visits folder:
# hadoop fs -ls /apps/hive/warehouse/wh_visits
This time, you should see a couple thousand rows of data. Notice that by simply
putting a file into the wh_visits folder, the table now contains data.
5.3. Notice no MapReduce job was executed to perform the select * query. Why
not? ___________________________________________________________
Step 6: Drop the Table
258
6.1. Run the following query, which drops the wh_visits table:
hive> drop table wh_visits;
6.2. Exit the Hive shell and view the contents of the Hive warehouse folder:
# hadoop fs -ls /apps/hive/warehouse/
Notice that not only has the part-m-00000 file been deleted, but also the
wh_visits folder no longer exists!
Step 7: Create the Table Again
7.1. Run wh_visits.hive again to recreate the wh_visits table:
# hive -f wh_visits.hive
259
8.6. Try the following query. Make sure the output looks like first names:
hive> select fname from wh_visits limit 20;
Notice the folder is empty. The LOAD DATA command moved the files from their
original HDFS folder into the Hive warehouse folder; it did not copy them.
IMPORTANT: Be careful when you drop a managed table in Hive. Make sure
you either have the data backed up somewhere else, or that you no longer
want the data.
# more external_table.hive
create external table wh_visits (
lname string,
fname string,
time_of_arrival string,
appt_scheduled_time string,
meeting_location string,
info_comment string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/root/whitehouse/' ;
11.3. Create the whitehouse folder in HDFS again, and put the five part-m files
into whitehouse.
11.4. Verify that there is not a subfolder of /apps/hive/warehouse named
wh_visits.
11.5. Run the query in external_table.hive to create the wh_visits table:
# hive f external_table.hive
11.6. Run a query on wh_visits to verify that the table does actually contain
records.
11.7. Drop wh_visits again, but this time notice that the files in the whitehouse
folder are not deleted.
RESULT: As you just verified, the data for external tables is not deleted when the
corresponding table is dropped. Aside from this behavior, managed tables and external
tables in Hive are essentially the same.
261
Overview of Sqoop
Importing a Table
Exporting a Table
262
Overview of Sqoop
Relational
Database
1. Client executes a
sqoop command
Enterprise
Document-based
Data Warehouse
Systems
Map
tasks
Hadoop Cluster
Overview of Sqoop
Sqoop is a tool designed to transfer data between Hadoop and external structured
datastores like RDBMS and data warehouses. Using Sqoop, you can provision the data
from an external system into HDFS. Sqoop uses a connector-based architecture that
supports plugins that provide connectivity to additional external systems.
As you can see in the slide, Sqoop uses MapReduce to distribute its work across the
Hadoop cluster:
1. A Sqoop job gets executed using the sqoop command line.
2. Sqoop uses Map tasks (4 by default) to execute the command.
3. Plugins are used to communicate with the outside data source. The schema is
provided by the data source, and Sqoop generates and executes SQL statements
using JDBC or other connectors.
263
Teradata
MySQL
Netezza
264
Sqoop will read the table row-by-row into HDFS. The output of this import
process is a set of files containing a copy of the imported table.
The import process is performed in parallel. For this reason, the output will be in
multiple files.
These files may be delimited text files (for example, with commas or tabs
separating each field), or binary Avro or SequenceFiles containing serialized
record data.
265
Credentials can be included in the connect string, so using the --username and -password arguments
Must specify either a table to import using --table, or the result of a SQL query
using --query
266
Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
Importing a Table
The following Sqoop command imports a database table named StockPrices into a
folder in HDFS named /data/stockprices:
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
The connect string in this example is for MySQL. The database name is nyse.
The --table argument is the name of the table in the NYSE database.
The default number of map tasks for Sqoop is 4, so the result of this import will
be in 4 files.
267
NOTE: You can use --as-avrodatafile to import the data to Avro files, and use
--as-sequencefile to import the data to sequence files.
--split-by: the column used to determine how the data is split between mappers.
If you do not specify a split-by column, then the primary key column is used.
--query: use instead of --table, the imported data is the resulting records from
the given SQL query.
NOTE: The import command shown here looks like it entered over multiple
lines, but you have to enter this entire Sqoop command on a single
command line.
Which column will Sqoop use to split the data up between the mappers?
____________________________
269
Only rows whose Volume column is greater than 1,000,000 will be imported.
The $CONDITIONS token must appear somewhere in the WHERE clause of your
SQL query. Sqoop replaces this token with LIMIT and OFFSET clauses so that the
data can be split between mappers.
If you use --query, then you must also specify a --split-by column or the Sqoop
command will fail to execute.
270
271
272
--table: the table to populate in the database. This table must already exist in the
database. If no --update-key is defined, then the command is executed in Insert
Mode.
Copyright 2014, Hortonworks, Inc. All rights reserved.
--update-key: the primary key column for supporting updates. If you define this
argument, the Update Mode is used and existing rows are updated with the
exported data.
--call: invokes a stored procedure for every record, thereby using Call Mode. If
you define --call, then do not define the --table argument or an error will occur.
--update-mode: Specify how updates are performed when new rows are found
with non-matching keys in database. Values are updateonly (the default) and
allowinsert.
273
Exporting to a Table
sqoop export
--connect jdbc:mysql://host/nyse
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"
Exporting to a Table
The following Sqoop command exports the data in the /data/logfiles/ folder in HDFS to
a table named LogData:
sqoop export
--connect jdbc:mysql://host/nyse
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"
The column values are determined by the delimiter, which is a tab in this
example.
Sqoop will perform this job using 4 mappers, but you can specify the number to
use with the -m argument.
274
Unit 12 Review
1. What is the default number of map tasks for a Sqoop job? _____________
2. How do you specify a different number of mappers in a Sqoop job?
_________________________________________________
3. What is the purpose of the $CONDITIONS value in the WHERE clause of a Sqoop
query?
__________________________________________________________________
275
The comma-separated fields represent a gender, age, salary and zip code.
2.3. Notice there is a salaries.sql script that defines a new table in MySQL named
salaries. For this script to work, you need to copy salaries.txt into the publiclyavailable /tmp folder:
276
# cp salaries.txt /tmp
2.4. Now run the salaries.sql script using the following command:
# mysql test < salaries.sql
3.2. Switch to the test database, which is where the salaries table was created:
mysql> use test;
3.3. Run the show tables command and verify salaries is defined:
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| salaries
|
+----------------+
1 row in set (0.00 sec)
277
4.1. Enter the following command at the mysql prompt to grant access to node2
and node3 to connect to the mysql-server running on node1:
grant all privileges on *.* to 'root'@'%' with grant
option;
5.2. A MapReduce job should start executing, and it may take a couple minutes for
the job to complete.
Step 6: Verify the Import
6.1. View the contents of the salaries folder:
# hadoop fs -ls salaries
6.2. You should see a new folder named salaries. View its contents:
# hadoop fs -ls salaries
Found 4 items
-rw-r--r-1 root hdfs
-rw-r--r-1 root hdfs
-rw-r--r-1 root hdfs
-rw-r--r-1 root hdfs
272
241
238
272
part-m-00000
part-m-00001
part-m-00002
part-m-00003
6.3. Notice there are four new files in the salaries folder named part-m-0000x.
Why are there four of these files?
__________________________________________________________________
6.4. Use the cat command to view the contents of the files. For example:
278
Notice the contents of these files are the rows from the salaries table in MySQL.
You have now successfully imported data from a MySQL database into HDFS.
Notice you imported the entire table with all of its columns. In the next step, you
will import only specific columns of a table.
Step 7: Specify Columns to Import
7.1. Using the --columns argument, write a Sqoop command that imports the
salary and age columns (in that order) of the salaries table into a directory in
HDFS named salaries2. In addition, set the -m argument to 1 so that the result is a
single file.
7.2. After the import, verify you only have one part-m fie in salaries2:
# hadoop fs -ls salaries2
Found 1 items
-rw-r--r-1 root hdfs
482
salaries2/part-m-00000
7.3. Verify the contents of part-m-00000 are only the 2 columns you specified:
# hadoop fs -cat salaries2/part-m-00000
TIP: The Sqoop command will look similar to the ones you have been using
throughout this lab, except you will use --query instead of --table. Recall
279
that when you use a --query command you must also define a --split-by
column, or define -m to be 1.
Also, do not forget to add $CONDITIONS to the WHERE clause of your query,
as demonstrated earlier in this Unit.
8.2. To verify the result, view the contents of the files in salaries3. You should
have only two output files.
8.3. View the contents of part-m-00000 and part-m-00001. Notice one file
contains females, and the other file contains males. Why? ______________
______________________________________________________________
8.4. Verify the output files contain only records whose salary is greater than
90,000.00.
Step 9: Put the Export Data into HDFS
9.1. Now lets export data from HDFS to the database. Start by viewing the
contents of the data, which is in a file named salarydata.txt:
# tail salarydata.txt
M,49,29000,95103
M,44,34000,95102
M,99,25000,94041
F,93,96000,95105
F,75,9000,94040
F,14,0,95102
M,68,1000,94040
F,45,78000,94041
M,40,6000,95103
F,82,5000,95050
Notice the records in this file contain 4 values separated by commas, and the
values represent a gender, age, salary and zip code, respectively.
9.2. Create a new directory in HDFS named salarydata.
9.3. Put salarydata.txt into the salarydata directory in HDFS.
Step 10: Create a Table in the Database
10.1. There is a script in the /root/labs folder that creates a table in MySQL that
matches the records in salarydata.txt. View the SQL script:
280
# more salaries2.sql
281
| M
|
3 |
0 |
95101 |
| M
|
25 | 26000 |
94040 |
+--------+------+--------+---------+
RESULT: You have imported the data from MySQL to HDFS using the entire table,
specific columns, and also using the result of a query. You have also exported a folder of
data in HDFS into a table in MySQL.
SOLUTIONS:
Step 7.1 is the following command (entered on a single line):
# sqoop import --connect jdbc:mysql://node1/test
--table salaries
--columns salary,age
-m 1
--target-dir salaries2
--username root
Step 8.1:
sqoop import --connect jdbc:mysql://node1/test
--query "select * from salaries s where s.salary > 90000.00
and \$CONDITIONS"
--split-by gender
-m 2
--target-dir salaries3
--username root
282
Step 11
sqoop export
--connect jdbc:mysql://node1/test
--table salaries2
--export-dir salarydata
--input-fields-terminated-by ","
--username root
ANSWERS:
Step 6.3: The MapReduce job that executed the Sqoop command used four mappers, so
there are four output files (one from each mapper).
Step 8.3: You used gender as the split-by column, so all records with the same gender
are sent to the same mapper.
283
Flume Introduction
Installing Flume
Flume Events
Flume Sources
Flume Channels
Flume Sinks
Multiple Sinks
Flume Interceptors
Design Patterns
Flume Configuration
Monitoring Flume
284
Flume Introduction
A Flume is an artificial channel or stream created
which uses water to transport objects down a
channel.
Apache Flume, a data ingestion tool, collects,
aggregates and directs data streams into Hadoop
using the same concepts. Flume works with
different data sources to process and send data to
defined destinations.
Source
Channel
Sink
Flume Agent
Flume Introduction
A flume is an artificial channel or stream created that uses water to transport objects
down the channel. Flumes were often used by the logging industry to move cut
wooden logs. Apache Flume transfers data from multiple sources into Hadoop via
events instead of wooden logs. It efficiently collects, aggregates, and moves large
amounts of streaming data.
Flume Components
Event: The individual unit of data (such as a log entry) and is made up of
header(s) and a byte-array body.
Source: Defines the type of data stream that is entering Flume. Sources may
either be active; constantly looking for data, or passive; waiting for data to be
passed to them.
Sink: Delivers the data to its destination. Each sink is defined based on the
destination it will be transferring data into. For example: HDFS, HBase, a local
file.
285
Channel: The conduit between the source and the sink (destination).
Flume Workflow
1. Client transmits event to a source.
2. Source receives event and delivers it to one or more channels.
3. The sink or sinks transfer the data from the channel to the final destination.
286
Installing Flume
Following are the system requirement for running Flume:
Memory: The Flume agent requires appropriate amount of memory for all
components of the agent.
Disk Space: Flume agent needs permission to access sources and write to
destinations. Make sure channels have sufficient storage.
Although not required, it is recommended to set your time to UTC versus local
time.
NOTE: The Flume agent heap size can be set with JAVA_OPTS:
JAVA_OPTS= "-Xms100m -Xmx200m"
287
Description
/etc/flume/conf/flume-conf.properties
/etc/flume/conf/flume-env.sh
/etc/flume/conf/log4j.properties
flume.log.dir=/var/log/flume
288
Flume Events
An event can range from text to images. The key point about events is they need to be
generated from regular streaming data.
An Event is a single unit of data that can be transported by Flume NG (akin to messages
in JMS). Events are generally small (ranging from a few bytes to a few kilobytes) and are
commonly a single record from a larger dataset. Events are made up of headers,
containing the key / value map and a body, storing the arbitrary byte array.
Clients generate data as a stream of events and run in a separate thread. The clients
send data to a source. A log4j appender sends events directly to Flume NG's source or
syslog daemon.
289
Flume Sources
A Flume source is the data stream from which Flume receives the data. The source can
be pollable or event driven. A spool director can be set up to look for new files. A suffix
can be added once all events have been transmitted.
Property
Sample Value
agent.sources
mychannel
agent.sources.channels
mychannel
agent.sources.mychannel. type
spooldir
agent.sources.mychannel. spoolDir
/directorypath
agent.source.mychannel. fileSuffix
.COMPLETE
290
Flume Source
Description
Avro Source
Exec Source
Thrift Source
RPC source.
NetCat Source
SpoolDir Source
JMS Source
HTTP Source
Syslog Source
Custom Source
291
Flume Channels
The channel is the conduit for events between a source and a sink. The channel dictates
the durability of event delivery between a source and a sink. An event stays in the
channel until the sink successfully sends the data to the defined destination. The source
and the sink run asynchronously in processing events in the channel. Channel
exceptions can be thrown if the ingest rate exceeds the channels ability to handle that
rate.
292
File Channel: Writes and checkpoints files to disk. Slower but durable.
JDBC Channel: Events are stored in a persistent storage that is backed up with a
database. Slower but durable.
293
Events can be batched as a transaction and each transaction has a unique id. The
number of events that are processed together as a single transaction determines the
batch size. Each event in a transaction has a unique sequence number.
The durability of transactions is determined by the batch sizes as batch sizes control
throughput.
294
Source
Channel
Sink
Channel
Sink
C
S
Agent
Multiplexing:
agent.sources.mychannel. selector.type = multiplexing
agent.sources.mychannel. selector.header = port
295
Flume Sinks
Sinks receive Events from Channels which are then written to the HDFS
or forwarded to another data source. Supported destinations are shown
below:
HDFS
Avro
Flume Sinks
A sink is the destination for the data stream in Flume. The sink receives events from a
channel and runs in a separate thread. Sinks can support text and sequence files when
writing to HDFS and both file types can be compressed. Below is a list of the different
types of sinks.
Type of Sink
Description
HDFS Sink
Logger Sink
Avro Sink
296
Thrift Sink
IRC Sink
HBase Sink
Null Sink
Custom Sink
297
Multiple Sinks
Single sink is default behavior.
Multiple sinks can provide:
Failover for Sinks.
Load balancing of Sinks.
Sink
Source
Channel
S
P
Sink
Agent
Multiple Sinks
Sink Processors are a collection of multiple sinks and can be setup for load balancing
over multiple sinks or to achieve failover from one sink to another in case of failure.
298
There are different types of sink processors; each is described in the table below:
Type of Sink Processor
Description
Default
Failover
Load Balancing
299
Flume Interceptors
Interceptors are set with the interceptors property and have the ability to drop or
modify an event based on how the interceptor is coded. Flume supports chaining
multiple interceptors together and the order of definition sets the order they run in.
agent.sources.mychannel.interceptors = inter1 iinter2 inter3
agent.sources.mychannel.interceptors = inter1
agent.sources.mychannel.interceptors.inter1.type = timestamp
agent.sources.mychannel.interceptors.inter1.preserveExisting
= true
300
301
Design Patterns
Multi-Agent Flow
Source
Channel
Avro Sink
Avro
RPC
Channel
Avro Source
Sink
Fan In (Consolidation)
Source
Channel
Sink
Source
Channel
Sink
Source
Channel
Sink
Source
Channel
Sink
Fan Out
Source
Channel
Sink
Channel
Sink
Channel
Sink
HDFS
Source
Channel
Sink
Design Patterns
Flume has the flexibility to create complex data workflows. Agents are able to have
multiple sources, channels and sinks. You can also connect multiple agents to each
other.
The Flume topology supports multiple design patterns. A few are shown above:
Multi-Agent Flow
Fan In
Fan Out
For any Flume agent, the source ingests data and sends it to the channel. There can be
multiple sources, channels and sinks in a Flume agent but each sink can only receive
data from a single channel.
302
303
Example formats:
<AgentName>.sources = <SourceName>
<AgentName>.sinks = <SinkName>
<AgentName>.channels = <Channel1> <Channel2>
<AgentName>.sources.<SourceName>.channels = <Channel1>
<Channel2> ... # set channel for source
<AgentName>.sinks.<SinkName>.channel = <Channel1>
# set channel for sink
<AgentName>.sources.<SourceName>.<someProperty> =
<someValue>
<AgentName>.channel.<ChannelName>.<someProperty> =
<someValue>
# properties for channels
<AgentName>.sources.<SinkName>.<someProperty> = <someValue>
# properties for sinks
To start a Flume agent, call the flume-ng shell (located in Flume bin directory) script.
The script sets the agent name, the configuration directory and the configuration
properties file.
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flumeconf.properties.
304
305
# my.conf file
#Define source name
agent.sources = snet
#Define sink name
agent.sinks = sink1
#Define channel name
agent.channels = chmem
#Set the source
agent.sources.snet.type = netcat
agent.sources.snet.bind = localhost
agent.sources.snet.port = 44444
# Set the sink destination
agent.sinks.sink1.type = logger
#Set channel to type memory
agent.channels.chmem.type = memory
agent.channels.chmem.capacity = 1000
agent.channels.chmem.transactioncapacity = 100
#Set the source with the channel
agent.sources.snet.channels = chmem
#Set the sink with the channel
agent.sinks.sink1.channel = chmem
306
307
Flume Configuration
# A single-node Flume configuration
# Name the components on this agent
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channelA
# Describe/configure source1
agent1.sources.source1.type = netcat
agent1.sources.source1.bind = localhost
agent1.sources.source1.port = 44444
# Describe sink1
agent1.sinks.sink1.type = logger
# Use a channel which buffers events in memory
agent1.channels.channelA.type = memory
agent1.channels.channelA.capacity = 1000
agent1.channels.channelA.transactionCapactiy = 100
# Bind the source and sink to the channel
agent1.sources.source1.channels = channelA
agent1.sinks.sink1.channel = channelA
Flume Configuration
The property "type" needs to be set for each component for Flume to understand what
kind of object it needs to be. Each source, sink and channel type has its own set of
properties required for it to function as intended. All those need to be set as needed. In
the previous example, we have a flow from avro-AppSrv-source to hdfs-Cluster1-sink
through the memory channel mem-channel-1.
308
309
Monitoring Flume
Flume monitoring options can be set in /etc/flume/conf/flume-env.sh (JAVA_OPTS) for
the following:
JMX monitoring
JAVA_OPTS="-Dcom.sun.management.jmxremoteDcom.sun.management.jmxremote.port=4159
-Dcom.sun.management.jmxremote.authenticate=false Dcom.sun.management.jmxremote.ssl=false
310
Nagios: Nagios can be configured to watch the Flume agents. Monitoring for
cpu, memory and disk resources consumed by Flume should be standard. Look
at the Nagios JMX plugin to monitor performance.
Unit 13 Review
1. The basic unit of data for Flume is an ____________________ .
2. Sources can be polled or _________________________.
3. A channel selector can be replicating, multiplexing or _____________________.
4. The Flume component ____________________ allows inspection and
transformation of the data as it flows through the stream.
311
8.2. Verify Flume is installed by viewing the usage of the flume-ng command:
# flume-ng
9.3. Notice the name of the agent defined in this file is called logagent.
9.4. The source of logagent is source1. Based on the source1 configuration, where
is the data coming from for this Flume agent?
__________________________________________________
312
10.2. Start logagent using the following command (all on a single line):
flume-ng agent -n logagent -f logagent.conf
-Dflume.log.dir=/var/log/flume/
-Dflume.log.file=logagent.log &
10.3. View the output of the command. Make sure sink1 and source1 started:
INFO sink.RollingFileSink: RollingFileSink sink1 started.
INFO instrumentation.MonitoredCounterGroup: Monitoried
counter group for type: SOURCE, name: source1, registered
successfully.
INFO instrumentation.MonitoredCounterGroup: Component type:
SOURCE, name: source1 started
INFO source.AvroSource: Avro source source1 started.
11.2. Change directories to ~/labs/flume and view the contents of test.log. This
will be the data that you send to the source of logagent.
313
11.3. From the ~/labs/flume folder, run the following command (all on a single
line) which takes the contents of test.log and writes it in the Avro format to port
8888 on node1:
# flume-ng avro-client -H node1 -p 8888 -C
/usr/lib/flume/lib/flume-ng-core-1.4.0.2.0.6.0-76.jar -F
test.log
11.4. Wait for this task to execute. When complete, view the contents of
flumedata in HDFS, which should now contain a new file:
# hadoop fs -ls flumedata
Found 1 items
-rw-r--r-3 root root
739
flumedata/FlumeData.1384193670669
11.5. View the contents of the file in HDFS. It should match the content from
test.log:
# hadoop fs -cat flumedata/FlumeData.1384193670669
12.2. To kill a Flume agent, simply issue the kill command on the process:
# kill pid
RESULT: You just ran a Flume agent that reads data from a network connection and
streams it into a folder in HDFS.
ANSWERS:
2.4: The source of logagent is a network connection on port 8888 of node1.
2.5: The channel is an in-memory channel of size 100.
2.6: The sink is the /user/root/flumedata folder in HDFS.
314
Oozie Overview
Oozie Components
Oozie Console
Interfaces to Oozie
Oozie Scripts
Oozie Actions
Oozie Metrics
315
Oozie Overview
A workflow is a sequence of actions scheduled for execution. Oozie is the workflow
scheduler for Hadoop that runs as a service on the cluster. Clients submit workflow
definitions for immediate or scheduled execution. Oozie is tightly integrated with
Hadoop.
Oozie actions may include:
Streaming
MapReduce
Pig
Hive
Distcp
Sqoop jobs
316
Oozie Components
Workflow Engine
Runs workflows
Coordinating
Engine
Workflow
Engine
Oozie Server
Oozie Server
JVM runs Coordinating Engine and Workflow Engine
Database
Database
Stores workflow definitions and state information
Oozie Console
Oozie Components
Oozie is a Java Web-application that runs in a Java servlet-container in Tomcat. Tomcat
is a Java web-application server that uses a database to store the Oozie workflow
definitions, the state of current workflow instances, instance states and variables.
Two main components are the Oozie server and the Oozie client. The Server is the
engine that runs the workflow, and the Oozie client launches jobs and communicates
with the Oozie server.
Oozies metadata database contains the workflow definitions and the current status of
workflows including state information and workflow instances (such as states and
variables) in a database.
317
OK
Fail/Kill
Start
Action
End
ERROR
Kill
When a HDFS URI is defined as a data set, Oozie will perform availability check. When
data dependencies are met, the coordinators workflow is triggered. Oozie coordinators
also support triggers that run when HCatalog table partitions are available and workflow
actions can read data from the partitions. (HCatalog provides abstract table definitions
for the underlying data storage.)
A Direct Acyclic Graph (DAG) is a collection of vertices (nodes actions) and directed
edges that connect different vertices in an order (directed graph) so there is an end and
the DAG does not circle back to the start.
The Oozie workflows are defined in a XML Process Definition Language called hPDL. The
XML documents contain the workflow made up of start, end and fail nodes as well as
decision control statements contain decision, fork and join nodes.
Workflow actions:
All workflows must have one start and one end node.
If the workflow fails, it transitions to a kill node. The workflow reports the error
message specified in the message element in the workflow definition.
319
Kill
Action
Start
Action
Fork
Join
Action
Kill
End
Kill
Oozie sub-workflow.
Most actions have to wait until the previous action completes. Callbacks and polling are
used by Oozie to stay in communication with the defined processing.
Computation/processing tasks triggered by an action node are executed by the
MapReduce framework. Most operations are executed asynchronously however file
system operations are executed synchronously.
320
Oozie Actions
Shell Action: Oozie will wait for the shell command to complete before going to next
action. The standard output of the shell command can be used to make decisions.
Pig, Hive and MapReduce Actions: For executing Pig and Hive scripts and Java
MapReduce jobs.
Sqoop Action: Oozie will wait for the Sqoop command to complete before going to next
action.
Ssh Action: Runs a remote secure shell command on a remote machine. The workflow
will wait for ssh command to complete. Ssh command is executed in the home
directory of the defined user on the remote host.
Custom Action: Custom actions can be set up to run synchronous or asynchronous.
321
Email Actions: Sent synchronously, an email must contain an address, a subject and a
body. Here is an example of setting the properties for an email action. Examples of
other Oozie actions can be found in the documentation.
<workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:0.1">
...
<action name="an-email">
<email xmlns="uri:oozie:email-action:0.1">
<to>bigkahuna@hwxs.com</to>
<subject>Email notifications for
${wf:id()}</subject>
<body>My cool workflow ${wf:id()} successfully
completed.</body>
</email>
<ok to="mycooljob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
322
Resource
Manager
Execute the
Action
MapReduce
Job
(Launcher
task)
NOTE: If the Oozie job consists of multiple actions, then a new Launcher
MapReduce job is executed for each distinct action in the workflow.
323
Coordinating
Engine
Workflow
Engine
JobTracker
Oozie Server
Databases supported:
Derby (default), MYSQL, Oracle, PostgreSQL, and HSQL.
Database
Many organizations use their enterprise scheduler to call Oozie Workflows. You may
use REST API to call workflows.
Yahoo runs over 700 workflows. They are organized into coordinators and
bundled together.
324
Oozie Console
Oozie Console
The Oozie Web Console provides a UI for viewing and monitoring your Oozie jobs. You
will use the Console in the upcoming lab.
325
Interfaces to Oozie
The Oozie Web Services (WS) API is a HTTP REST JSON API.
326
327
Here are the primary Oozie environmental variables, which are configured in oozieenv.sh:
Variable Name
Description
CATALINA_OPTS
OOZIE_CONFIG_FILE
OOZIE_LOGS
OOZIE_LOG4J_FILE
OOZIE_LOG4J_RELOAD
OOZIE_HTTP_PORT
OOZIE_ADMIN_PORT
OOZIE_HTTP_HOSTNAME
OOZIE_BASE_URL
OOZIE_CHECK_OWNER
328
Description
OOZIE_HTTPS_PORT
OOZIE_HTTPS_KEYSTORE_FILE
OOZIE_HTTPS_KEYSTORE_PASS
329
Oozie Scripts
Run the oozie-setup.sh script to manually configure Oozie with all the components
added to the libext/ directory.
$ bin/oozie-setup.sh prepare-war [-d directory] [-secure]
sharelib create -fs <FS_URI> [-locallib <PATH>]
sharelib upgrade -fs <FS_URI> [-locallib <PATH>]
db create|upgrade|postupgrade -run [-sqlfile <FILE>]
330
Examples:
Command
Description
bin/oozied.sh start
bin/oozied.sh run
331
332
Description
333
334
335
Behind the scenes, a workflow.xml file is generated dynamically that contains a single
action. The action will be script specified at the command line, and the job will be
created and executed right away.
336
Unit 14 Review
1. There are three types of Oozie jobs. They are _______________________ ,
___________________________and _______________________ jobs.
2. An Oozie __________________ provides a way to package multiple coordinator
and workflow jobs.
3. List three types of Oozie actions: ______________________________________
4. Set Oozie logging information in the ____________________________ file.
337
1.2. Unzip the archive in the oozielab folder, which contains a file named
whitehouse_visits.txt that is quite large:
# unzip whitehouse_visits.zip
This publicly available data contains records of visitors to the White House in
Washington, D.C.
Step 2: Load the Data into HDFS
2.1. Make a new directory in HDFS named whitehouse. (If you already have a
whitehouse folder in HDFS, delete it first):
# hadoop fs -rm -R whitehouse
# hadoop fs -mkdir whitehouse
338
2.2. Use the put command in the Grunt shell to copy the whitehouse_visits.txt
file the whitehouse folder in HDFS, renaming the file visits.txt. (Be sure to enter
this command on a single line):
# hadoop fs -put whitehouse_visits.txt
whitehouse/visits.txt
2.3. Use the ls command to verify the file was uploaded successfully:
# hadoop fs -ls whitehouse
Found 1 items
-rw-r--r-3 root root 183292235 whitehouse/visits.txt
3.4. Click the Add Property... link and add two properties. Assign the
hadoop.proxyuser.root.hosts property to * and also the
hadoop.proxyuser.root.groups:
3.5. Click the Save button to save your changes to the HDFS config.
3.6. Start HDFS service.
Step 4: Deploy the Oozie Workflow
4.1. SSH into node2.
Copyright 2014, Hortonworks, Inc. All rights reserved.
339
4.7. Put congress_visits.hive and whitehouse.pig from the oozielab folder into
the new congress folder in HDFS.
4.8. Also, put workflow.xml into the congress folder.
4.9. If you look at the Hive action in workflow.xml, you will notice that it
references a file named hive-site.xml within the <job-xml> tag. This file
represents the settings Oozie needs to connect to your Hive instance, and the file
needs to be deployed in HDFS (using a relative path to the workflow directory).
Put hive-site.xml into the congress directory:
# hadoop fs -put /etc/hive/conf/hive-site.xml congress
4.10. Verify you have four files now in your congress folder in HDFS:
# hadoop fs -ls congress
Found 4 items
-rw-r--r-3 root root
congress/congress_visits.hive
-rw-r--r-3 root root
-rw-r--r-3 root root
-rw-r--r-3 root root
429
3509 congress/hive-site.xml
580 congress/whitehouse.pig
1623 congress/workflow.xml
You should see your Oozie job in the list of Workflow Jobs:
341
Notice you can view the status of each Action within the workflow.
Step 9: Verify the Results
9.1. Once the Oozie job is completed successfully, start the Hive Shell.
9.2. Run a select statement on congress_visits and verify the table is populated:
hive> select * from congress_visits;
...
WATERS MAXINE
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WATT
MEL
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WEGNER DAVID L
12/8/2010 16:46 12/8/2010 17:00 POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WILLOUGHBY JEANNE
P
12/8/2010 17:07 12/8/2010 17:00
POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL
STAFF
WILSON ROLLIE
E
12/8/2010 16:49 12/8/2010 17:00
POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL
STAFF
YOUNG DON
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
MCCONNELL
MITCH
12/14/2010 9:00 POTUS
WH
MEMBER OF CONGRESS MEETING WITH POTUS.
Time taken: 1.082 seconds, Fetched: 102 row(s)
342
RESULT: You have just executed an Oozie workflow that consists of a Pig script followed
by a Hive script.
ANSWERS:
Step 4.2: Two
Step 4.3: The Pig action named export_congress
Step 4.4: The Hive action named define_congress_table
343
Ambari
Monitoring Architecture
Ganglia
Nagios
Nagios UI
344
Ambari
The HDP install needs to get software from a YUM repository. A remote yum repository
can be used however; usually a local copy of the HDP repository is set up so your hosts
within the firewall can access it. Reference the Hortonworks documentation on
Deploying HDP In Production Data Centers with Firewalls for more information.
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP2.0.6.0/bk_reference/content/reference_chap4.html
Database Metastores are required for Ambari, Hive and Oozie. MySQL, Oracle or
PostgreSQL are recommended. Derby is the default.
HDP is certified and supported for running on virtual or cloud platforms (VMware
vSphere, Amazon Web Services and Rackspace).
The Hortonworks Sandbox (Pseudo distribution deployment model) has VMs for
VMware Fusion. Ambari is used to manage the Hadoop cluster running in the sandbox.
VirtualBox and Hyper-V. (www.hortonworks.com/sandbox)
345
Ambari was first released in HDP 1.2. Ambari 1.4.1 is released with HDP2 and contains
additional functionality:
Ability to deploy and manage the Hadoop 2.0 stack using Ambari.
Support for enabling Kerberos based security for Hadoop 2.0 services.
346
Monitoring Architecture
Ganglia Server
(gmetad)
AmbariServer
Postgres
Nagios Server
RRDtool
gmond
AmbariAgent
IP Address #1
(Gateway)
gmond
AmbariAgent
IP Address #1
gmond
gmond
AmbariAgent
AmbariAgent
IP Address #3
IP Address #4
Monitoring Architecture
Ambari monitors Hadoop services including: HDFS, HBase, Pig, Hive, etc. A service can
have multiple components (i.e. HDFS, NameNode, StandbyNameNode, DataNode,).
The term node and host are used interchangeably.
The Ambari server has:
Agents installed on each host. It sends heartbeats to the Ambari server and
receives commands in response to heartbeats.
Each host will have a Ganglia Monitor (gmond) running that collects information to
Ganglia Connector and then to Ambari Server.
Ambari Web sessions do not timeout. It is important to log out of the Ambari web
interface when you are done.
347
348
Advantages of Ambari:
While Ambari is not the first management system for Hadoop, Ambari is an
excellent example of the innovation and accelerated development open source
delivers. Ambari has grown significantly in HDP 1.2, 1.3 and HDP2.
Start the Ambari Server on the node where it has been configured.
# ambari-server start
349
Add Widget
Gear
1
2
4
Service Status
3
Widgets
350
Widgets can be moved around screen (drag and drop). Hovering over a widget will
provide a summary. You can also:
Click on the gear icon (#5 in slide) and move to Classic Version. The gear allows
you to reset widgets to default and view metrics in Ganglia.
351
352
3. Hosts: The Hosts view lets you dill down into a host to get detailed information
on services running on host. Actions are available to start, stop and
decommission. Hosts can be added with the +Add Hosts Wizard.
4. Admin: The Admin View supports user management and provides general
information.
High Availability: NameNode HA can be set up. This option will start the
NameNode HA Wizard. The Wizard will walk you through defining the
Standby NameNode, and JournalNodes.
Checking Stack and Component Versions: This screen allows you to see the
Hadoop software stack and the specific version installed.
Checking Service User Accounts and Groups: Display users and groups and
the services they own.
353
Ganglia
Designed for monitoring and collecting large quantities of metrics of
federations of clusters
Ganglia
Ganglia was developed at Berkeley and is a BSD-licensed open source project. Berkeley
is known as a center of grid and high-performance environments. Ganglia was designed
and developed in an environment where large computing environments were the norm.
Ganglia was assumed to be running in extremely scalable environments where minimal
overhead and performance were a fundamental requirement. Ganglia was designed
from the very beginning to scale to cloud-sized networks. Therefore, Ganglia is an ideal
tool for monitoring Hadoop clusters that can grow to 10,000+ nodes per cluster.
Ganglia ships with a large number of metrics that can be accessed with visual graphs.
Ganglia has a plug-in to receive Hadoop metrics and can provide aggregate statistics for
the cluster as a whole. Ganglia also provides real-time graphing capabilities.
354
Ganglia Monitors
Gmetad: The Ganglia Meta Daemon polls information from the gmond daemons
then collects and aggregates the statistics. RRDtool is a tool that stores metrics in
round robin databases.
Gweb: Ganglia Web is a PHP program that runs in an Apache web server that
provides visualization. The configuration file is conf.php.
355
The Ganglia configuration file (gmetad.conf) is organized into sections that are defined
in curly braces. Section names and attributes are case insensitive. There are two
categories:
hbase: Number of regions, memstore sizes, read and write requests, StoreFile
Index sizes, block cache hjt and miss ratios, and block cache memory available.
356
Nagios
The Nagios primary configuration file (nagios.cfg) default location is the /etc/nagios
directory.
Key parameters:
Parameter
log file
Description
Contains the location of the nagios.log file
(/usr/local/nagios/var/nagios.log).
nagios_user
nagios_group
status_file
/usr/local/nagios/var/staus.dat is the
current status or downtime information.
temp_path
357
temp_file
/usr/local/nagios/var/nagios.tmp is used
as a temporary file when updating status
information.
linux-server
xxxx.xxxx.xxxx
xxxx.xxxx.xxxx
xxx.xxx.xxx.xxx
NOTE: HDP2 uses Nagios 3.5.0. Nagios is installed as part of the Ambari
install.
358
Nagios UI
Nagios UI
Nagios can be accessed from the Ambari interface or from the server running Nagios.
Launch the Nagios UI on the server it is running via http:/localhost/nagios.
359
360
Viewing JVM heap dumps is only used when a problem needs to be looked at in a very
detailed way. Whats nice about jmap is that is available if necessary.
If you are having problems running out of memory for the JVM you can cause a heap
dump to be generated automatically when an out of memory issue occurs. Set the XX:+HeapDumpOnOutOfMemoryError option to generate a heap dump when an out of
memory issue occurs.
361
Survivor 2
Survivor 1
Eden
New
Tenured
PermGen
An object will move from the Survivor II memory area into the Old memory area.
362
When the Old memory area fills up, a major garbage collection will occur which can
impact performance. This can impact YARN which is running mappers and reducers in
Containers.
The Xms option determines the size of the Old and Young generation memory areas
together. This combined memory area is determined by the Xmx option.
Hints can be provided to the Young and Old memory areas. The exact memory size will
be determined by the JVM. Young generation size can be initialized with the
XX:NewSize argument
Old generation with XX:NewRatio. A value of 2 will make the Old generation memory
area twice as big as the Young generation.
363
364
JVM memory heap dumps can also be viewed with commands. Use jps to get process id
(2719) and jmap to display.
# jps -l
# jmap -histo:live 2719 | head
# jmap heap 2719
# jstat gcutil 2719 5000
# jps
3855 ResourceManager
3096 SecondaryNameNode
3973 NodeManager
2719 NameNode
2645 QuorumPeerMain
2952 DataNode
3292 RunJar
3332 RunJar
4080 JobHistoryServer
4558 AmbariServer
3505 RunJar
4903 RunJar
3238 Jps
3789 Bootstrap
365
366
S0
S1
YGC
YGCT
FGC
FGCT
GCT
0.00
100.0
40.85
11.19
99.35
0.044
0.000
0.044
Description
S0
S1
YGC
YGCT
FCG
FGCT
GCT
367
368
Unit 15 Review
1. The Dashboard View supports two different types of views, they are the
________________ and _________________ views.
2. The Ganglia primary daemons are _____________ , _____________ and
_______________.
3. The main Nagios configuration file is ______________________.
4. Use this Java JDK tool to create a JVM heap dump: ______________________
5. Use this Java tool to access JVM metrics: ___________________________
369
Balancer
Running Balancer
370
Architectural Review
Decommissioning/Commissioning nodes need to take the above into consideration.
Daemons and Processes running on a slave server can include additional frameworks.
Usually a slave server in the cluster runs both a DataNode and NodeManager daemon.
If running HBase, the slave server will also run a HBase Region Server. Additional
frameworks such as Accumulo, Storm, etc. will have their own client processes.
The ResourceTrackerServer is responsible for registering new nodes,
decommissioning/commissioning nodes.
NMLiveliness Monitor monitors live and dead nodes.
The NodesListManager manages the collection of valid and excluded nodes. The
NodesListManager reads the following local host configuration files. Lines that begin
with # are comments.
dfs.hosts: Names a file that contains a list of hosts that are permitted to connect
to the NameNode.
dfs.hosts.exclude: Names a file that contains a list of hosts that are not
permitted to connect to the NameNode.
371
Run the refreshNodes option for the ResourceManager daemon to recognized the
changes:
# yarn rmadmin -refreshNodes
372
Adds more processing capabilities because the cluster can run more Containers.
373
Decommissioning Nodes
Although HDFS is designed to tolerate DataNode failures, this does not mean you can
just terminate DataNodes en masse with no ill effect. With a replication level of three
for example, the chances are very high that you will lose data by simultaneously
shutting down three DataNodes if they are on different racks. The way to decommission
DataNodes is to inform the NameNode infrastructure of the DataNode(s) to be taken
out of circulation, so that it can replicate the blocks to the rest of HDFS before taking the
node down.
With NodeManagers and Containers, Hadoop is more forgiving. If you shut down a
NodeManager that is running tasks, the ResourceManager will notice the failure and
reschedule the tasks on other nodes in the Cluster.
The decommissioning process is controlled by an exclude file. The exclude file lists the
nodes that are not permitted to connect to the cluster (master daemons).
374
The rules for whether a NodeManager may connect to the ResourceManager are
simple: a NodeManager may connect only if it appears in the include file and does not
appear in the exclude file. An unspecified or empty include file is taken to mean that all
nodes are in the include file.
For HDFS, the rules are slightly different. If a node appears in both the include and
exclude file, then it may connect, but only to be decommissioned.
375
2.
3.
4.
Check the Cluster Web Console NameNode Web UI for Node status
5.
6.
376
http://<ResourceManager node>:8088
377
Dead Nodes: The NameNode will declare a DataNode dead when a heartbeat is
not received for a period of time. The default is 10 minutes.
Node remains as dead until it is removed from the dfs.include list, AND dfsadmin
command is run to refresh the nodes (dfsadmin - refreshNodes).
378
2.
3.
4.
379
Check that the new Node appears in the ResourceManager Web UI.
http://<ResourceManager node>:8088
Run balancer if you want existing blocks to be written to the new DataNode. This
ensures the HDFS cluster is able to leverage the processing and IOPS of the new
DataNode.
The Balancer needs to work with the Namenodes in the cluster to balance the cluster.
Example:
"$HADOOP_PREFIX"/bin/hadoop-daemon.sh --script "$bin"/hdfs
start balancer [-policy <policy>]
380
Balancer
381
382
Running Balancer
Balancer can be run periodically as a batch job
Every 24 hours or weekly for example
Balancer should be run after new nodes have been added to the cluster
Running the balancer is also useful if a client loads files from and to a
computer that is a DataNode
One replica of the blocks will be placed on the local DataNode
Balancer runs until there are no blocks to move or until it has lost
contact with the NameNode
Can be stopped with a Ctrl+C
Running Balancer
383
384
Unit 16 Review
1. Which property points to the file that contains the list of hosts allowed to
connect to the NameNode? _________________________________
2. Which property points to the file that contains the list of hosts not allowed to
connect to the NameNode? _________________________________
3. The ResourceManager also has include and exclude files. Which two properties
define where these two files are located? ____________________________
_______________________________________________________________
4. The rmadmin option is to __________________________________________.
385
386
1.5. Wait for the DataNode component to be installed. When the install is
complete, DataNode should appear in the list of Components on node4:
387
6.5. Wait a couple minutes for the balancer to even out the block storage. You will
see at the command prompt as blocks get moved from one node to another:
INFO balancer.Balancer: 0 over-utilized: []
INFO balancer.Balancer: 1 underutilized:
[BalancerDatanode[10.222.133.205:50010,
utilization=0.40288804420577484]]
INFO balancer.Balancer: Need to move 173.74 MB to make the
cluster balanced.
INFO balancer.Balancer: Decided to move 89.95 MB bytes from
10.170.202.246:50010 to 10.222.133.205:50010
INFO balancer.Balancer: Decided to move 151.79 MB bytes
from 10.174.49.252:50010 to 10.222.133.205:50010
INFO balancer.Balancer: Will move 241.74 MB in this
iterationINFO balancer.Balancer: Moving block 1073742573
from 10.174.49.252:50010 to 10.222.133.205:50010 through
10.174.50.60:50010 is succeeded.
INFO balancer.Balancer: Moving block 1073742572 from
10.174.49.252:50010 to 10.222.133.205:50010 through
10.174.50.60:50010 is succeeded.
...
6.6. Refresh the Live Nodes page of the NameNode UI. Your node4 DataNode
should now have blocks on it, and the number of blocks will gradually increase as
the balancer app continue to even out the block storage on your cluster.
NOTE: The balancer app will run for a long time. Just leave the process open
in your terminal window. If you need to perform any future tasks on node1,
just open a new terminal window.
388
7.3. Click OK in the confirmation dialog, and wait for the decommissioning task to
complete.
NOTE: There is a minimal chance that the decommissioning task may fail
due to a known bug in Hadoop 2.0 where the node contains a block that
belongs to a file with a replication factor larger than the rest of the cluster
size. The work-around is to locate and delete any files that have a replication
factor larger than 3. View https://issues.apache.org/jira/browse/HDFS-5662
for more details.
389
8.2. Click on Decommissioning Nodes and it will show that node1 is undergoing
the decommission process.
8.3. Go to the Live Nodes page of the NameNode UI. You will see that blocks are
gradually being copied from node1 to the other nodes. The Admin State of node1
is going to be either Decommission in Progress or Decommissioned. Refresh the
page until the status is Decommissioned.
8.4. Go back to the NameNode UI page. Notice you have 4 Live Nodes, and 1 of
them is Decommissioned:
9.3. It will take several minutes to stop the DataNode process on node1.
9.4. From the Ambari Dashboard page, you should see 3/4 live DataNodes:
390
RESULT: You have now seen how to commission a new DataNode, and also how to run
the balancer tool to balance the blocks across a cluster once new DataNodes are
commissioned. You also have decommissioned one of the DataNodes from your cluster.
391
392
HDFS Snapshots
393
HDFS Snapshots
HDFS Snapshots
Another major highlighted feature of Hadoop 2 is HDFS snapshots. Taking a snapshot is
fast. As long as snapshotting is enabled on a particular directory, users with write
permissions to that directory can create as many snapshots as needed and removed as
needed.
394
Perform
HDFS
Snapshot
distcp new
Snapshot to
Backup
Cluster
Snapshot
new data on
Backup
Cluster
Enterprise
Retention
Policy
Cleanup
On Success
Action
On Failure
Action
395
396
PostgreSQL Backup
pgadmin hive > hive_backup.sql
397
Oracle Backup
[oracle]$ expdp hive/password schemas=hive directory=backups
dumpfile=hive_backup.dmp
Backup Ambari
Take a backup of the following Ambari cluster configurations:
1. /etc/ambari-server
2. The ambari database in PostgreSQL
398
1.5. Which node and folder is the block stored in? ________________________
Step 2: Enable Snapshots
2.1. Now lets enable the /user/root/data directory for taking snapshots:
Copyright 2014, Hortonworks, Inc. All rights reserved.
399
3.2. Verify the snapshot was created by viewing the contents of the
data/.snapshot folder:
# hadoop fs -ls -R data/.snapshot
drwxr-xr-x
- root hadoop
0 data/.snapshot/ss01
-rw-r--r-3 root hadoop
44841 data/.snapshot/
ss01/constitution.txt
4.2. Use the ls command to verify the file is no longer in the data folder in HDFS.
4.3. Check whether the file still exists in /user/root/data/.snapshot/ss01. It
should still be there.
400
4.4. Run the same find command again that you ran in the earlier step. Does the
block file still exist on your local file system? _____________________________
Step 5: Recover the File
5.1. Lets copy this file from data/.snapshot/ss01 to the data directory.
# hadoop fs -cp data/.snapshot/ss01/constitution.txt data/
5.2. Run the fsck command again on data/constitution.txt. Notice that the block
and location information have changed for this file.
5.3. Run the find command for the new blocks. Notice the blocks for the
constitution.txt file appear in two locations on your local file system (before
deleting the file and after copying the file).
RESULT: This lab demonstrates how the snapshot process locks down the blocks from
deleting and editing, and the blocks are always available in case you need to recover
your file in future.
Answers:
Step 1.5: In a subfolder of: /hadoop/hdfs/data/current/
Step 3.3: Once snapshot is enabled for a directory, it can not be deleted until we delete
the snapshot itself.
Step 4.4: Yes
401
Rack Awareness
HDFS Replication
Rack Topology
402
Rack Awareness
Rack awareness spreads block replicas across different racks to make sure if a rack
becomes unavailable (power failure, switch failure, etc.) all replicas for a block are not
lost. Rack awareness makes sure that all operations that involve rack placement
understand to spread the blocks across multiple racks. The NameNode makes the
decision where blocks are placed. Examples of block operations that are rack aware
include:
Inserts
Hadoop balancer
Decommissioning a datanode
For rack awareness, each data node is assigned to a rack. Each rack will have a unique
rack id. Rack ids are hierarchical and appear as path names.
If rack awareness is not configured, the entire Hadoop cluster is treated as if it were a
single rack. Every DataNode will have a rack id of /default-rack. With the default
behavior, data is loaded on a DataNode and the then two other data nodes are selected
at random to make sure replicas are spread across multiple data nodes.
403
404
HDFS Replication
First replica is placed on the same rack as the client, if possible. If that
is not possible, it will be placed randomly.
Second replica is placed on a DataNode on another rack
Third replica is on another DataNode on the second rack
Rack 1
Rack 2
DataNode
DataNode
DataNode
Data and
checksum
Data and
checksum
Ack
Ack
Verify
Checksum
Replica Placement
Rack awareness places different priorities on each replica. The assumption is traffic
within a rack is faster than across racks.
The first replica is put on the DataNode that is closest to the Hadoop client. This is the
rack the client is running on.
The second replica is placed on a different rack for high availability. This makes sure
that if a rack fails a replica of a block still exists.
The third replica is placed on the same rack as the second rack. Once the second replica
is on a different rack, high availability has been taken care of. The goal is to get the third
replica on another DataNode of the second rack.
405
Rack Topology
Aggregation Switches
2xToR Switches
Staging Node
NameNode
HBaseMaster
Oozie Server
DataNodes
2xToR Switches
Staging Node
Standby NameNode
Secondary NameNode
DataNodes
KVM Switch
Resource Manager
Management Node
Ambari Server
Ganglia/Nagios
WebHCat Server
JobHistoryServer
Hive2 Server
KVM Switch
Rack Topology
Rack topologies need to make sure there are no single points of failure.
There are a number of different ways to deploy rack topologies for Hadoop. The
TopofRack (ToR) architecture is popular with their short cable runs and easy replication
of rack configurations. As companies build out data centers, they deploy rack servers as
the core building block, with ToR switches and cabling within the rack. Pod-based
(containerized) modular designs are becoming very popular. A pod is a preconfigured
system with compute, network and storage resources. Pod architectures strength is
integration and standardization.
TopofRack does not mean switches are at the top of the rack. The top of the rack is
more popular because of ease of access and cabling but switches can be anywhere in
the rack.
This example uses the leaf-spine topology. Each TOR switch is a leaf and each
aggregation switch is a spine. Scalability can be increased by designing a dual-tier
aggregation layer. TOR switches in a rack can be connected to aggregation switches that
can provide interconnection to the rest of the data center.
Each rack should have two Top Of Rack (TOR) Ethernet switches that are bonded. Two
switches are used for scalability and availability.
406
407
408
rack-topology.sh
409
Unit 18 Review
1. Each rack has a _____________________ path name.
2. The priority of the second replica for rack aware is _______________________.
3. Rack topology is configured in the __________________________________ file.
410
411
Notice this script calculates the rack name using the IP address of the node. The
first three parts of the IP address become its rack name. For example: if
192.168.1.100 is the IP address, then the rack name would be /192.168.1.
Step 3: Configure the Rack Script
3.1. Copy the script to directory /etc/hadoop/conf as rack-topology.sh:
# cp rack-topology.sh.sample
/etc/hadoop/conf/rack-topology.sh
3.2. Stop HDFS. Edit core-site.xml and add the following properties:
topology.script.file.name=/etc/hadoop/conf/rack-topology.sh
topology.script.number.arg=1
4.2. You can also view current topology by using following command:
412
4.3. Run the fsck command. You should see 4 racks now:
RESULT: The nodes in your cluster are now each assigned to a rack, and the rack
assignment takes place automatically using the rack-topology.sh script. You can write
your own custom script for automatically determining the appropriate rack names for
your cluster nodes.
413
HDFS HA Components
Understanding NameNode HA
NameNodes in HA
Failover Modes
NameNode Architectures
Red Hat HA
VMware HA
414
NameNode
Namespace
fsimage
Block Management
Edits.log
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4
Red Hat and VMware HA are solutions that work well but there are reasons customers
want an HA solution built into HDP.
Both Red Hat and VMware HA:
415
416
HDFS HA Components
Hadoop HA clusters use nameservice IDs to identify an HDFS instance that may be made
up of multiple NameNodes. A NameNode ID is added. Each NameNode has a unique ID
in a HA cluster to make sure it is uniquely identified.
DataNodes send block map reports and heartbeats to the Primary and Standby
NameNodes to maintain consistency.
A ZooKeeper Quorum is used to coordinate data, perform update notifications and
monitor for failure. Each NameNode maintains a persistent session in ZooKeeper.
The ZooKeeper Quorum will:
417
The ZKFailoverController (ZKFC) monitors and manages the state of the NameNode. The
Active and Standby NameNode will each run a ZKFC.
The ZKFC:
Monitors the health of the NameNode it is monitoring and manages its state of
being healthy or unhealthy.
The Journal Nodes (JNs) make sure that a split-brain scenario (both NN writing at same
time) does not occur. The JNs make sure that only one NameNode can be a writer at a
time.
The Active NameNode will write records to the shared edit.log file. The Standby Node
will read the edits.log file and apply changes to itself. The Standby NameNode will read
all edits before becoming active during a failover.
Currently there can only be one shared directory. The storage needs to support
redundancy to protect the metadata.
The Standby NameNode performs checkpoints. If upgrading from HDP1 to HDP2, the
previous Secondary NameNode can be replaced with the Standby NameNode.
An experimental shared storage solution is BookKeeper. BookKeeper can replicate edit
log entries across multiple storage nodes. The edit log can be striped across the storage
nodes for high performance. Fencing is supported in the protocol. The metadata for
BookKeeper is stored in ZooKeeper. In current HA architecture, a ZooKeeper cluster is
required for ZKFC. The same cluster can be for BookKeeper metadata. Refer to the
Apache BookKeeper project documentation for more information.
http://zookeeper.apache.org/bookkeeper/
418
Understanding NameNode HA
Fo r a
utom
failov atic
er
NameNode
SNameNode
ZKFC
Active
ZKFC
Standby
ZK
ZK
ZK
Namespace
Writes
JN
JN
Namespace
Reads
JN
Block Management
Block Management
Hadoop HA Cluster
Heartbeats
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4
Understanding NameNode HA
NameNode High Availability (HA) has no external dependency.
NameNode HA has an active NameNode and standby NameNode running in an activepassive relationship. If the active NameNode goes down the Passive NameNode
becomes the Active NameNode. If the failed NameNode restarts it will become the
passive NameNode. The ZooKeeper FailOverController (ZKFC) maintains a lock on the
active NameNode for a namespace.
On each platform running a NameNode service there will be an associated ZKFC. The
ZKFC communicates with:
The NameNode service it is associated with. ZKFC monitors the health and
manages the HA state of the NameNode.
The FailoverController (FC) monitors the health of the NameNode, Operating System
(OS) and Hardware (HW). There is an active and standby FailoverController.
Heartbeats occur between the Failover Controllers (active and passive) and the
zookeeper servers.
Copyright 2014, Hortonworks, Inc. All rights reserved.
419
Recommendations:
420
NameNodes in HA
Start the services in the following order:
1. JournalNodes
2. NameNodes
3. DataNodes
Always start the NameNode then its corresponding ZKFC.
The Active NameNode is determined by which NameNode starts first. If one NameNode
is the preferred Active NameNode then always start if first.
The hdfs haadmin command is used to perform a manual failover.
421
There are two ways of sharing edit logs with NameNode HA:
The active NameNode writes the edits in the edits.log. The Standby NameNode will
read and apply edits to maintain a consistent state. The current state is maintained with
a quorum of Journal Nodes.
Commands and scripts used to manage HA:
Format and initialize the state of the Zookeeper
$ hdfs zkfc formatZK
Start-dfs.sh will start the ZKFC daemon when automatic failover is setup.
To manually start a zkfc process:
$ hadoop-daemon.sh start zkfc
422
Failover Modes
The ZooKeeper FailOverController processes monitors the health of the NameNodes for
a NameSpace. The FailOverControl will facilitate the failover process and perform a
fencing operation to make sure a split-brain scenario cannot occur.
A split-brain scenario occurs if both NameNodes think they are both the active
NameNode. The fencing operation makes sure a NameNode gets fenced off so it cannot
be active. This protects the NameNode metadata to make sure it does not become
corrupt by two NameNodes doing writes at the same time.
The command below can be used to failover from active to passive NameNode.
$ hdfs haadmin failover <StandbyNN-To-Be>
Be>
<ActiveNN-To-
423
A cluster may not generate the workload or need federated NameNodes but has
HA requirements.
424
425
Options include:
426
Red Hat HA
Power Fencing
Monitoring Agent
Heartbeat
Monitoring Agent
Monitor NN
Monitor NN
NN
Standby
NameNode
Shared
NameNode
State
Red Hat HA
Red Hat Enterprise Linux (RHEL) HA cluster software is separate from the Hadoop
cluster. A power-fencing device is required (deals with split-brain scenario). A floating
IP is required for failover. RHEL HA cluster must be configured for the Hadoop master
servers that have high availability requirements.
Typically, the overall Hadoop cluster must include the following types of machines:
The RHEL HA cluster machines. These machines must host only those master
services that require HA (in this case, the NameNode and the ResourceManager).
Master machines that run other master services such as Hive Server 2, HBase
master, etc.
427
VMware HA
VMware vCenter Server
NameNode
(VM)
MasterNode
(VM)
Heartbeat
MasterNode
(VM)
MasterNode
(VM)
On failure, start
VMs on another
ESXi host
ESXi Host
Shared
Storage
Resource Manager
(VM)
MasterNode
(VM)
MasterNode
(VM)
ESXi Host
VMware HA
vSphere is VMwares software platform for providing a virtualization platform. The
VMware vCenter Server is VMwares central point of management.
A vSphere ESXi host can run multiple VMs. A vSphere HA/DRS cluster can be set with
multiple ESXi hosts. The ESXi hosts maintain heartbeats and communication so they
understand what VMs are running in the vSphere HA cluster. If an ESXi host fails, HA will
start the failed VMs on another ESXi host in the vSphere HA cluster automatically. If a
VM fails on an ESXi host, VMware HA can restart the VM on another ESXi host in the HA
cluster. The vSphere HA cluster must use shared storage.
A NameNode monitoring agent notifies vSphere if the NameNode daemon fails or
becomes unstable. vSphere HA will trigger the NameNode VM to restart on the same
ESXi host or a different ESXi host dependent on the error. A monitoring agent needs to
be setup to monitor any other HDP2 master nodes to let vSphere HA be aware of the
need to start up the master node VM again. vSphere HA can also automatically handle
an
It takes about five clicks to set up HA with vSphere and vSphere will manage the HA
environment automatically. When HA is enabled, a Fault Domain Manager (FDM)
service is started on the ESXi hosts. The ESXi hosts have an election and select an
election and pick a master host. The master host manages the FDM environment.
428
There are a number of different options for configuring how vSphere HA/DRS high
availability works. It takes a fair about of expertise to setup a virtual HA environment
but once set up it works automatically.
When running a vSphere HA/DRS cluster a number of features of virtualization can be
leveraged.
Fault Tolerance: Two VMs across different ESXi hosts can stay synchronized in an
active-passive relationship. If active VM fails, passive VM takes over. (Only
supported with up to four vCPUs in vSphere 5.5). Fault tolerance has zero-down
time.
vSphere Replication (VR) can perform VM replication across different sites (the
hardware does not have to be an exact match between sites).
Site Recovery Manager (SRM) supports automatic failover to another site.
vSphere HA can protect against an ESXi failure or the failure of applications running on
the VM (vSphere Application Failover (in vSphere 5.5). vSphere HA also protects against
the VM, guest OS failure and network failures.
The NameNode must run inside a virtual machine which is hosted on the
vSphere HA cluster.
The ResourceManager must run inside its own virtual machine which is hosted
on the vSphere HA cluster.
The vSphere HA cluster must include a minimum of two ESXi server machines.
429
1.2. Click the Enable NameNode HA button. Notice on the first step of the wizard
that you get a warning about stopping HBase first:
430
431
3.4. Once Ambari recognizes that your cluster is in Safe Mode and a Checkpoint
has been made, you will be able to click the Next button.
Step 4: Wait for the Configuration
4.1. At this point, Ambari will stop all services, install the necessary components,
and restart the services. Wait for these tasks to complete:
432
4.2. Once all the tasks are complete, click the Next button.
Step 5: Initialize the JournalNodes
5.1. On node1, enter the command shown in the wizard to initialize the
JournalNodes:
# sudo su -l hdfs -c 'hdfs namenode -initializeSharedEdits'
5.2. Once Ambari determines that the JournalNodes are initialized, you will be
able to click the Next button:
433
7.2. On node4, run the command to initialize the metadata for the new
NameNode:
# sudo su -l hdfs -c 'hdfs namenode -bootstrapStandby'
8.2. Click the Done button when all the tasks are complete.
434
10.3. Go back the HDFS page in Ambari. Notice the Standby NameNode has
become the Active NameNode:
10.4. Now start the stopped NameNode again, and you will notice that it becomes
a Standby NameNode:
435
RESULT: You now have NameNode HA configured on your cluster, and you have also
verified that the HA works when one of the NameNodes stops.
436
Security Concepts
Kerberos Synopsis
437
Security Concepts
Before implementing security in a Hadoop cluster, its important to understand basic
security concepts and terms.
Principal: A principal is any user or service that is performing an operation in the
secured environment. A user principal an interactive or unattended (system) user that
logs into a secured environment and starts to interact with services. A service principal
is a service that needs to perform operations in a secured environment.
Authentication: There are many authentication mechanisms available by which
principals can prove their credentials are trusted. Credentials can be username and
password, a key file or certificate of trust, or a combination of usernames and trust files.
The common authentication protocols are Kerberos, Plain Text, X.509, Digest, and many
others. The protocol that is used in Hadoop is Kerberos, an MIT open source project.
438
439
Kerberos Synopsis
Kerberos is a protocol that aims to provide an authentication and authorization system
to:
Prevent the need for passwords to be transferred over the network.
Still allow for users to enter passwords.
Allow a user to establish an authenticated session without having the need
to re-enter passwords for every operation.
To create that secure communication among its various components, Hadoop uses
Kerberos. Kerberos is a third party authentication mechanism, in which users and
services that users want to access rely on a third party - the Kerberos server - to
authenticate each to the other. The Kerberos server itself is known as theKey
Distribution Center, or KDC.
At a high level, it has three parts:
440
A Ticket Granting Server (TGS) that issues subsequent service tickets based
on the initial TGT.
A user principal requests authentication from the AS. The AS returns a TGT that is
encrypted using the user principal's Kerberos password, which is known only to the user
principal and the AS. The user principal decrypts the TGT locally using its Kerberos
password, and from that point forward, until the ticket expires, the user principal can
use the TGT to get service tickets from the TGS. Service tickets are what allow a principal
to access various services.
Because cluster resources (hosts or services) cannot provide a password each time to
decrypt the TGT, they use a special file, called a keytab, which contains the resource
principal's encrypted credentials.
Kerberos Components
Term
Description
Key Distribution
Center, or KDC
Kerberos KDC
Server
Kerberos Client
Principal
The unique name of a user or service that authenticates against the KDC.
Keytab
Realm
441
Component
HDFS
NameNode
nn/$FQDN
HDFS
NameNode HTTP
HTTP/$FQDN
HDFS
SecondaryNameNode
nn/$FQDN
HDFS
SecondaryNameNode HTTP
HTTP/$FQDN
HDFS
DataNode
dn/$FQDN
MR2
History Server
jhs/$FQDN
MR2
HTTP/$FQDN
YARN
ResourceManager
rm/$FQDN
YARN
NodeManager
nm/$FQDN
442
Oozie
Oozie Server
oozie/$FQDN
Oozie
Oozie HTTP
HTTP/$FQDN
Hive
Hive Metastore
HiveServer2
hive/$FQDN
Hive
WebHCat
HTTP/$FQDN
HBase
MasterServer
hbase/$FQDN
HBase
RegionServer
hbase/$FQDN
ZooKeeper
ZooKeeper
zookeeper/$FQDN
Nagios Server
Nagios
nagios/$FQDN
JournalNode
Server[a]
JournalNode
jn/$FQDN
[a]
Once principals are established in the KDCs database, keytab files can be extracted.
Recall that keytabs are a key file that identifies a principal. Keytabs need to be installed
on each appropriate host; wherever a service principal resides.
To extract a keytab file from an established principal:
$kadmin.local xst -norandkey -k $keytab_file_name
$primary_name/fully.qualified.domain.name@EXAMPLE.COM
443
Note:
Once authentication is setup, the next step is to set up mappings of local UNIX
service accounts to Kerberos principals.
These mappings live in the core-site.xml under the hadoop.security.auth_to_local
property.
444
445
1.2. Switch to the hdfs user and create a new directory in HDFS named
/user/horton.
1.3. Change ownership of /user/horton in HDFS to the horton user.
1.4. Exit out from the hdfs user, and switch to the horton user.
1.5. Check whether you can do a listing of the /user directory successfully:
$ hadoop fs -ls /user
NOTE: The current cluster is not a secure cluster so you can easily do a
listing of the /user directory in HDFS successfully.
2.2. Login to node2, node3 and node4 and install the Kerberos client only:
# yum -y install krb5-workstation
3.3. Enter the following command to create a Kerberos database using the
kdb5_util utility:
# kdb5_util create -s
During this step it will ask you to define a master key. Enter 1234 as the key.
Step 4: Start Kerberos
4.1. Start the KDC server by executing following commands:
# /etc/rc.d/init.d/krb5kdc start
# /etc/rc.d/init.d/kadmin start
447
6.2. Create a new file on node1 named /root/scripts/kerberos.csv and copy-andpaste the contents of the CSV file into kerberos.csv.
6.3. Run the pre-written script to create all required principals and keytabs. It will
ask for the location of the CSV file you created. Provide the full path to the file:
# /root/scripts/create_principals.sh
NOTE: This step will create all required principals and keytab files on all the
nodes. Once you are done with the step, go back to Ambari UI.
448
7.1. We have completed all 4 required steps. Now it is time to enable security
through Ambari. Click the Apply button.
7.2. The Save and Apply Configuration step can take 10-15 minutes. When the
task is complete, click the Done button:
9.3. Now create a keytab file in the /etc/security/keytabs directory using the
following command:
449
9.4. Set appropriate permissions for the keytab file for the horton user:
# chown horton:hadoop
/etc/security/keytabs/horton.headless.keytab
# chmod 440 /etc/security/keytabs/horton.headless.keytab
9.5. Switch to the horton user and initialize the keytab file:
# su - horton
$ kinit -kt /etc/security/keytabs/horton.headless.keytab
horton@EXAMPLE.COM
9.6. Now try to list the contents of /user in HDFS again. This time you should be
able to view the folders contents!
RESULT: You have enabled Kerberos security for your HDP cluster.
450
451
452
453
454
3. yarn.resourcemanager.nodes.include-path and
yarn.resourcemanager.nodes.exclude-path
4. execute ResourceManager administration operations
455
Knox
ZooKeeper
HBase
HCatalog
NameNode Federation
456
Discover
Design
Enable
Maintain
Archive
457
Tools
Data Processing
Oozie
Replication
Sqoop
Retention
Distcp
Scheduling
Flume
Reprocessing
MapReduce
Multi-Cluster Management
458
459
Falcon
Falcon is a data lifecycle management framework for Apache Hadoop.
Falcon enables users to configure, manage and orchestrate data motion,
disaster recovery, and data retention workflows to support of business
continuity and data governance use cases.
Falcon
Falcon provides the key services data processing applications need. Falcon manages
workflow and replication.
Falcons goal is to simplify data management on Hadoop. It achieves this by providing
important data lifecycle management services that any Hadoop application can rely on.
Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a
proven, well-tested and extremely scalable data management system built specifically
for the unique capabilities that Hadoop offers.
Falcon also supports multi-cluster failover.
460
Future: Knox
Provide perimeter security
Support authentication and token
verification security scenarios
Single URL to access multiple
Hadoop services
Enable integration with enterprise
and cloud identity management
environments
Supports
WebHDFS
WebHCat
Oozie
HBase
Hive
Future: Knox
While not yet part of HDP, Knox is intended to provide permiter security. It aims to
provide a single point of entry into a Hadoop cluster for a user to access different
services such as HDFS, YARN, Hive, Oozie. Knox can be installed in HDP 2 as an add-on. A
user authenticates once with the Knox service via Kerberos, while Knox itself handles
serving requests for that user inside the cluster.
For more information:
Hortonworks: http://hortonworks.com/hadoop/knox-gateway/
Apache: http://knox.incubator.apache.org/
461
ZooKeeper Synopsis
ZooKeeper is a service that provides configuration management, naming, distributed
synchronization, and group services. Various Hadoop services rely on ZooKeeper to
operate. In this Unit, we will focus on Administering ZooKeeper.
Centralized service for:
Configuration management: Services such as HBase use ZooKeeper extensively
for configuration management, such as a registry of all HBase nodes and tables.
Naming & group services: ZooKeeper can act as a naming service, similar to
what a DNS would provide. At an application level, can you can use ZooKeeper as
your replacement to DNS. For example, if your application needs to resolve a
host name, that information can be maintained and provided by ZooKeeper to
the application.
462
Components
An ensemble of ZooKeeper hosts 3 hosts suffices for most clusters
Ensembles are configured in odd numbers of 3,5,7, etc.
This is due to always having a majority and allowing for one
additional failure than even numbers.
5 zknodes allows for 2 failures, whereas 4 zknodes allows for only
1 failure. Both have the same number of majority.
The ensemble of hosts work together as a quorum; as long as a majority
of them agree on an operation, then the operation succeeds.
ZooKeeper Client
ZooKeeper ships with a command line client that allows you to perform file-system like
operations:
/user/lib/zookeeper/bin/zkCli.sh
463
The simple client allows you to create znodes. More complex operations would be
performed programmatically. For more in-depth information and programmers guide,
visit:
http://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html
464
Configuring ZooKeeper
To configure ZooKeeper, find the config file at with key configuration properties
mentioned above:
/etc/zookeeper/conf/zoo.cfg
465
Notice that ZooKeeper echoed back imok. A full list of the four letter word commands is
provided below:
Command Description
conf
cons
crst
dump
envi
ruok
srst
srvr
stat
wchs
wchc
Note: Depending on the number of watches this operation may be expensive (ie
impact server performance), use it carefully.
Lists detailed information on watches for the server, by path. This outputs a list of
paths (znodes) with associated sessions.
wchp
mntr
466
Note: Depending on the number of watches this operation may be expensive (ie
impact server performance), use it carefully.
Outputs a list of variables that could be used for monitoring the health of the
cluster.
Service
Servers
Ports
ZooKeeper Server
ZooKeeper Server
ZooKeeper Server
All ZK Nodes
All ZK Nodes
All ZK Nodes
2888
3888
2818
Protocol
Description
Peer to peer communication.
Peer to peer leader election.
Clients connect to this port.
467
HBase Synopsis
HBase is a NoSQL database known as the Hadoop Database. Because HDFS is a resilient,
highly scalable distributed file system, HBase capitalizes on these characteristics by
persisting its data directly to HDFS.
HBase Architecture
ZooKeeper1
HMaster
ZooKeeper2
ZooKeeper3
RegionServer
RegionServer
RegionServer
DataNode
DataNode
DataNode
HDFS
468
Page 321
Components
Rowkey: Data is always identified by a rowkey. A Row key can be considered as a
primary key that you find in relational databases. It is a unique key that identifies a row
in HBase. Rowkeys are always sorted lexagraphically in ascending order within Regions.
Region: A region is a collection of rows that is managed by one of the RegionServers.
RegionServer: The HBase worker node, java process, which is co-located with
DataNodes. A RegionServer can load HBase block files into memory for caching, scan
blocks locally, and is thus co-located with the data blocks that make up the regions it
manages.
HMaster: Responsible for HBase maintenance tasks such as load-balancing and
orchestrating recovery when a RegionServer fails. Since clients talk directly to
RegionServers, it is possible for an HMaster to go down, and HBase can continue
functioning, however, the HMaster should be restarted as soon as possible.
ZooKeeper: ZooKeeper handles all of the configuration management. Clients always talk
to ZooKeeper first to find the appropriate RegionServer to talk to.
Since HBase RegionServers have a data block cache, heap sizes for
RegionServers is often very large. It is recommended to set the heap for
RegionServers (resource permitting) to at least 8GB.
469
Configuring HBase
HBase configuration properties are described above.
Ports Firewall considerations:
Service
Servers
Ports
Protocol Description
HMaster
HMaster Info Web UI
Masters
Masters
60000
60010
http
RegionServer
RegionServers 60020
RegionServer
RegionServers 60030
470
http
RegionServers comm
HMaster WebUI stats
Client to RegionServer
communications, Master to
RegionServer, RegionServer
to RegionServer
RegionServer Web UI
stats
HCatalog
HCatalog is Hives table and storage management layer.
MapReduce
Pig
Hive
Streaming
HCatalog
ORC
RC
Text
Sequence
Custom
HBase
HCatalog
HCatalog allows the creation of schema definitions that will be accessed from
applications. This allows the schema definition to be outside of the application code.
HCatalog is a set of interfaces that provide access to Hive's metastore for different types
of applications.
HCatalog provides:
A table abstraction so that users need not be concerned with where or how their
data is stored.
Interoperability across data processing tools such as Pig, Map Reduce, and Hive.
The HCatalog CLI supports all Hive DDLs that do not require MapReduce. HCatalog is
used to create, alter, drop tables, etc. The HCatalog CLI supports commands like SHOW
TABLES and DESCRIBE TABLE.
471
Block Management
Edits.log
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4
DataNodes handle all I/O, storage and block management on data node machines (slave
servers). Data blocks are replicated for high availability.
The HDP1 NameNode architecture scales to approximately 5,000 data nodes. One of
the big advantages of a Hadoop platform is the ability to coordinate data from all types
of different sources. Customers want to put more of their date into a central data lake
versus creating lots of Hadoop clusters.
472
In HDP1 the Namespace Volume = Single Namespace + block. All tenants shared a single
namespace.
With a single namespace there is isolation in multi user environment.
In HDP1, customer clusters can contain 4500+ nodes, 100+ PB storage and 400+ million
files and they keep growing bigger.
HDFS has over 7 9s of data reliability with less than 0.38 failures across 25
clusters.
HDFS offers fast repair time for disk failure or node failure. In HDFS, repairs can
occur in minutes versus RAID arrays where fixes can take hours.
473
Federating NameNodes
Hadoop clusters are increasing in size, workloads and complexity.
At Facebook, HDFS has around 2600 nodes, 300 million files and blocks,
addressing up to 60PB of storage.
The number of files in HDFS is limited by the amount of memory in a single name
node. More RAM in a single machine creates more garbage collection issues.
Multiple NameNodes increase the amount of memory and files that can be
HDFS.
474
475
NameNode 1
Namespace 1
Pool 1
NameNode n
Namespace k
Namespace n
Pool k
Block Management
fsimage
Pool n
fsimage
Block Management
fsimage
Block Management
Edits.log
Edits.log
Edits.log
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4
476
Namespace Volume
The NameServiceID is an identifier for coordinating a NameNode with its backup,
secondary or checkpointing nodes. The NameServiceId is used to identify a set of nodes
associated with a namespace in the configuration files. Datanodes will reference all the
DataNodes in the cluster. DataNodes store blocks for all the namespace volumes, there
is no partitioning.
A federation of NameNodes is a simple design and required minimal changes to existing
NameNode code.
Separating the namespace and block management also allows block storage to become
a separate service. The namespace happens to be one of the applications that uses the
service. This opens up the potential of associating different types of services on block
storage. Examples:
HBase
New block categories can be created in the future to support with different types
of garbage collection and optimization for different types of applications.
Foreign namespaces
477
478
479
</configuration>
480
The HDFS cluster can be started from any node as long as the HDFS configuration
information is available. The startup process starts the NameNodes and the DataNodes
in the slaves file are started.
$HADOOP_PREFIX_HOME/bin/start-dfs.sh
$HADOOP_PREFIX_HOME/bin/stop-dfs.sh
481
Number of files
Number of blocks
You can run the Cluster Web Console from any NameNode:
http://<NameNodeNHost>:port>/dfsclusterhealth.jsp
NameNodes can be added and removed in a Federated cluster without restarting the
cluster.
482
483
Configuration Parameters
Property
Value (examples)
Description
dfs.nameservices
mycoolcluster
dfs.ha.namenodes.
[nameservice ID]
nn1,nn2
NameNode IDs
dfs.ha.automaticfailover.enabled
true
ha.zookeeper.quorum
nn1,nn2
ha.zookeeper.quorum
<Host1>:2181,
<Host2>:2181,
<Host3>:2181
dfs.journalnode.edits.dir
/localdirpath/journalnode/
dfs.namenode.rpcaddress.mycoolcluster
<Host1>:8020
dfs.namenode.http-address.
mycoolcluster
<Host1>:50070
dfs.namenode.rpc-address.
mycoolcluster
<Host2>:8020
dfs.namenode.httpaddress.nmycoolcluster
<Host2>:50070
hdfs-site.xml
dfs.ha.automatic-failover.enabled
core-site.xml
ha.zookeeper.quorum
484
Sshfence: Uses SSH to connect to the Active NameNode and kill the Active
NameNode process.
For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
485