HDPOps-ManageAmbari Docker GA Rev3

HDP Operations: Install and Manage
With Apache Ambari

A Hortonworks University
Hadoop Training Course
Copyright 2014, Hortonworks, Inc. All rights reserved.
Title: HDP Operations: Install and Manage with Apache Ambari

Version: GA
Revision: 3
Date: Jul 30, 2014
Copyright 2013-2014 Hortonworks Inc. All rights reserved.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
The contents of this course and all its related materials, including lab exercises and files, are Copyright Hortonworks
Inc. 2014.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of
Hortonworks Inc. All rights reserved.
Table of Contents
Table of Contents ...................................................................................................................... 4
Course Introduction.............................................................................................................. 10
Unit 1: Introduction to HDP and Hadoop 2.0 ............................................................... 11
Enterprise Data Trends @ Scale ................................................................................................. 12
What is Big Data? ............................................................................................................................. 13
A Market for Big Data ..................................................................................................................... 14
Most Common New Types of Data.............................................................................................. 15
Moving from Causation to Correlation..................................................................................... 17
What is Hadoop? .............................................................................................................................. 19
What is Hadoop 2.0? ....................................................................................................................... 20
Traditional Systems vs. Hadoop ................................................................................................. 21
Overview of a Hadoop Cluster ..................................................................................................... 22
Who is Hortonworks?..................................................................................................................... 23
The Hortonworks Data Platform ............................................................................................... 24
Use Case: EDW before Hadoop .................................................................................................... 26
Banking Use Case: EDW with HDP ............................................................................................. 27
Unit 2: HDFS Architecture................................................................................................... 29

What is a File System? .................................................................................................................... 30
OS Architecture ................................................................................................................................ 31
HDFS Architecture ........................................................................................................................... 32
Understanding Block Storage...................................................................................................... 34
Demonstration: Understanding Block Storage ..................................................................... 36
The NameNode ................................................................................................................................. 39
The DataNodes.................................................................................................................................. 41
DataNode Failure ............................................................................................................................. 43
HDFS Clients ...................................................................................................................................... 45
Unit 3: Installation Prerequisites and Planning ......................................................... 47

Minimum Hardware Requirements .......................................................................................... 48
Minimum Software Requirements ............................................................................................ 49
A Formidable Starter Cluster ...................................................................................................... 50
Lab 3.1: Setting up the Environment ........................................................................................ 52
Lab 3.2: Install HDP 2.0 Cluster using Ambari ...................................................................... 54
Unit 4: Configuring Hadoop ................................................................................................ 65

Configuration Considerations ..................................................................................................... 66
Deployment Layout ......................................................................................................................... 67
Configuring HDFS............................................................................................................................. 69
What is Ambari ................................................................................................................................. 72
Configuration via Ambari ............................................................................................................. 73
Management ...................................................................................................................................... 74
Monitoring ......................................................................................................................................... 76
REST API.............................................................................................................................................. 77
Lab 4.1: Add a New Node to the Cluster ................................................................................... 79
Lab 4.2: Stopping and Starting HDP Services......................................................................... 81
Lab 4.3: Using HDFS Commands ................................................................................................. 87
Unit 5: Ensuring Data Integrity ......................................................................................... 95

Ensuring Data Integrity ................................................................................................................. 96
Replication Placement ................................................................................................................... 98
Data Integrity Writing Data .................................................................................................... 100
Data Integrity Reading Data ................................................................................................... 102
Data Integrity - Block Scanning ................................................................................................ 103
Running a File System Check ..................................................................................................... 105
What Does the File System Check Look For? ....................................................................... 106
hdfs fsck Syntax .............................................................................................................................. 108
Data Integrity File System Check: Commands & Output .............................................. 110
The dfs Command .......................................................................................................................... 112
NameNode Information ............................................................................................................... 114
Changing the Replication Factor .............................................................................................. 115
Lab 5.1: Verify Data with Block Scanner and fsck .............................................................. 117
Unit 6: HDFS NFS Gateway ................................................................................................ 123
HDFS NFS Gateway Introduction .............................................................................................. 124

NFS Gateway Node ......................................................................................................................... 126
Configuring the HDFS NFS Gateway ........................................................................................ 127
Starting the NFS Gateway Service ............................................................................................ 129
User Authentication ...................................................................................................................... 131
Lab 6.1: Mounting HDFS to a Local File System ................................................................... 133
Unit 7: YARN Architecture and MapReduce ............................................................... 136
What is YARN? ................................................................................................................................. 137

Hadoop as Next-Gen Platform ................................................................................................... 138
Beyond MapReduce ...................................................................................................................... 140
YARN Use-case ................................................................................................................................ 141
YARN Birds Eye View ................................................................................................................... 142
Lifecycle of a YARN Application ................................................................................................ 144
ResourceManager .......................................................................................................................... 146
NodeManager .................................................................................................................................. 147
MapReduce....................................................................................................................................... 148
Demonstration: Understanding MapReduce ....................................................................... 150
Configuring YARN .......................................................................................................................... 152
Configuring MapReduce .............................................................................................................. 154
Tools ................................................................................................................................................... 156
Lab 7.1: Troubleshooting a MapReduce Job......................................................................... 159
Unit 8: Job Schedulers ........................................................................................................ 165
Overview of Job Scheduling ....................................................................................................... 166

The Built-in Schedulers ............................................................................................................... 167
Overview of the Capacity Scheduler ....................................................................................... 168
Configuring the Capacity Scheduler ........................................................................................ 170
Defining Queues ............................................................................................................................. 171
Configuring Capacity Limits ....................................................................................................... 173
Configuring User Limits............................................................................................................... 175
Configuring Permissions ............................................................................................................. 176
Overview of the Fair Scheduler ................................................................................................ 177
Configuration of the Fair Scheduler ........................................................................................ 178
Lab 8.1: Configuring the Capacity Scheduler ....................................................................... 180
Unit 9: Enterprise Data Movement ................................................................................ 185

Enterprise Data Movement ........................................................................................................ 186
Challenges with a Traditional ETL Platform ........................................................................ 188
Hadoop Based ETL Platform ...................................................................................................... 189
Data Ingestion ................................................................................................................................. 190
Hadoop: Data Movement ............................................................................................................. 191
Defining Data Layers .................................................................................................................... 193
Distributed Copy (distcp) Command ...................................................................................... 194
Distcp Options................................................................................................................................. 196
Using distcp...................................................................................................................................... 198
Using distcp for Backups ............................................................................................................. 200
Lab 9.1: Use distcp to Copy Data from a Remote Cluster ................................................. 202
Unit 10: HDFS Web Services ............................................................................................. 204
What is WebHDFS? ........................................................................................................................ 205

Setting up WebHDFS ..................................................................................................................... 207
Using WebHDFS .............................................................................................................................. 209
WebHDFS Authentication ........................................................................................................... 211
Copying Files to HDFS................................................................................................................... 213
Hadoop HDFS over HTTP ............................................................................................................ 218
Who Uses WebHCat REST API?.................................................................................................. 220
Running WebHCat ......................................................................................................................... 223
Using WebHCat ............................................................................................................................... 224
Lab 10.1: Using WebHDFS........................................................................................................... 226
Unit 11: Hive Administration .......................................................................................... 230
Introduction to Hive ..................................................................................................................... 231

Comparing Hive with RDBMS .................................................................................................... 232
Hive MetaStore ............................................................................................................................... 233
HiveServer2 ..................................................................................................................................... 235
Hive Command Line Interface ................................................................................................... 236
Processing Hive SQL Statements .............................................................................................. 238
Hive Data Hierarchical Structures........................................................................................... 239
Hive Tables....................................................................................................................................... 242
Defining a Hive-Managed Table................................................................................................ 243
Defining an External Table ......................................................................................................... 244
Defining a Table LOCATION ....................................................................................................... 244
Loading Data into Hive ................................................................................................................ 245
Performing Queries ...................................................................................................................... 247
Guidelines for Architecting Hive Data.................................................................................... 248
Hive Query Optimizations .......................................................................................................... 249
Hive/MR versus Hive/Tez .......................................................................................................... 250
ORCFile Example ............................................................................................................................ 251
Compression .................................................................................................................................... 252
Hive Security ................................................................................................................................... 254
Lab 11.1: Understanding Hive Tables .................................................................................... 256
Unit 12: Transferring Data with Sqoop ........................................................................ 262

Overview of Sqoop ......................................................................................................................... 263
The Sqoop Import Tool ................................................................................................................ 265
Importing a Table .......................................................................................................................... 267

Importing Specific Columns ....................................................................................................... 269
Importing from a Query .............................................................................................................. 270
The Sqoop Export Tool ................................................................................................................ 272
Exporting to a Table...................................................................................................................... 274
Lab 12.1: Using Sqoop .................................................................................................................. 276
Unit 13: Flume....................................................................................................................... 284
Flume Introduction ....................................................................................................................... 285

Installing Flume ............................................................................................................................. 287
Flume Events ................................................................................................................................... 289
Flume Sources ................................................................................................................................. 290
Flume Channels .............................................................................................................................. 292
Flume Channel Selectors ............................................................................................................. 294
Flume Channel Selector ............................................................................................................... 295
Flume Sinks ...................................................................................................................................... 296
Multiple Sinks ................................................................................................................................. 298
Flume Interceptors ....................................................................................................................... 300
Design Patterns .............................................................................................................................. 302
Configuring Individual Components....................................................................................... 303
Flume Netcat Source Example ................................................................................................... 305
Flume Exec Source Example ...................................................................................................... 307
Flume Configuration ..................................................................................................................... 308
Monitoring Flume .......................................................................................................................... 310
Lab 13.1: Install and Test Flume .............................................................................................. 312
Unit 14: Oozie ........................................................................................................................ 315
Oozie Overview............................................................................................................................... 316

Oozie Components......................................................................................................................... 317
Jobs, Workflows, Coordinators, Bundles............................................................................... 318
Workflow Actions and Decisions ............................................................................................. 320
Oozie Actions ................................................................................................................................... 321
Oozie Job Submission ................................................................................................................... 323
Oozie Server Workflow Coordinator ...................................................................................... 324
Oozie Console .................................................................................................................................. 325
Interfaces to Oozie......................................................................................................................... 326
Oozie Server Configuration ........................................................................................................ 327
Oozie Scripts .................................................................................................................................... 330
The Oozie CLI................................................................................................................................... 332
Using the Oozie CLI........................................................................................................................ 334
Submit Jobs through HTTP ......................................................................................................... 336
Lab 14.1: Running an Oozie Workflow................................................................................... 338
Unit 15: Monitoring HDP2 Services ............................................................................... 344
Ambari ............................................................................................................................................... 345

Monitoring Architecture ............................................................................................................. 347
Monitoring HDP2 Clusters .......................................................................................................... 348
Ambari Web Interface .................................................................................................................. 350
Ambari Web Interface (cont.) ................................................................................................... 352
Ganglia ............................................................................................................................................... 354
Ganglia Monitoring a Hadoop Cluster .................................................................................... 355
Nagios................................................................................................................................................. 357
Nagios UI ........................................................................................................................................... 359
Monitoring JVM Processes .......................................................................................................... 360
Understanding JVM Memory ..................................................................................................... 362
Eclipse Memory Analyzer ........................................................................................................... 364
JVM Memory Heap Dump ............................................................................................................ 366
Java Management Extensions (JMX) ....................................................................................... 368
Unit 16: Commissioning and Decommissioning Nodes .......................................... 370

Architectural Review.................................................................................................................... 371
Decommissioning and Commissioning Nodes .................................................................... 373
Decommissioning Nodes ............................................................................................................. 374
Steps for Decommissioning a DataNode................................................................................ 376
Decommissioning Node States .................................................................................................. 378
Steps for Commissioning a Node .............................................................................................. 379
Balancer ............................................................................................................................................ 381
Balancer Threshold Setting ....................................................................................................... 382
Configuring Balancer Bandwidth............................................................................................. 384
Lab 16.1: Commissioning & Decommissioning DataNodes ............................................ 386
Unit 17: Backup and Recovery ........................................................................................ 392
What should you backup? ........................................................................................................... 393

HDFS Snapshots .............................................................................................................................. 394
HDFS Data - Backups .................................................................................................................... 395
HDFS Data Automate & Restore ............................................................................................ 396
Hive & Ambari Backup ................................................................................................................. 397
Lab 17.1: Using HDFS Snapshots .............................................................................................. 399
Unit 18: Rack Awareness and Topology ...................................................................... 402
Rack Awareness ............................................................................................................................. 403

YARN Rack Awareness ................................................................................................................. 404
Replica Placement ......................................................................................................................... 405
Rack Topology ................................................................................................................................ 406
Rack Topology Script.................................................................................................................... 408
Configuring the Rack Topology Script.................................................................................... 409
Lab 18.1: Configuring Rack Awareness ................................................................................. 411
Unit 19: NameNode HA ...................................................................................................... 414
NameNode Architecture HDP1 ................................................................................................. 415

NameNode High Availability ...................................................................................................... 416
HDFS HA Components .................................................................................................................. 417
Understanding NameNode HA .................................................................................................. 419
NameNodes in HA .......................................................................................................................... 421
Failover Modes ............................................................................................................................... 423
hdfs haadmin Command ............................................................................................................. 426
Red Hat HA ....................................................................................................................................... 427
VMware HA ...................................................................................................................................... 428
Lab 19.1: Implementing NameNode HA................................................................................. 430
Unit 20: Securing HDP ........................................................................................................ 437
Security Concepts .......................................................................................................................... 438

Kerberos Synopsis ......................................................................................................................... 440
HDP Security Overview................................................................................................................ 442

Securing HDP Authentication................................................................................................. 444
Securing HDP - Authorization ................................................................................................... 445
Lab 20.1: Securing a HDP Cluster ............................................................................................. 446
Appendix A: Unit Review Answers ................................................................................ 451

Appendix B: Other Hadoop Tools................................................................................... 456
Data Lifecycle Management ....................................................................................................... 457

Data Lifecycle Management on Hadoop ................................................................................ 458
Falcon Use Cases and Capabilities ........................................................................................... 459
Falcon ................................................................................................................................................. 460
Future: Knox .................................................................................................................................... 461
ZooKeeper Synopsis ..................................................................................................................... 462
Configuring ZooKeeper................................................................................................................ 465
HBase Synopsis ............................................................................................................................... 468
Configuring HBase ......................................................................................................................... 470
HCatalog ............................................................................................................................................ 471
Federating NameNodes ............................................................................................................... 474
Namespace Volume ....................................................................................................................... 477
Benefits of Independent Block Pools ...................................................................................... 478
Namespaces Increase Scalability ............................................................................................. 479
Configuring NameServices for DataNodes ............................................................................ 480
Block Management with Federation ....................................................................................... 482
Federation Configuration Parameters ................................................................................... 484
Course Introduction
10
Welcome to Hortonworks University
Course Agenda
Introductions
Overview of Hortonworks Certification
Unit 1: Introduction to HDP and

Hadoop 2.0
Topics covered:
Enterprise Data Trends @ Scale
What is Big Data?
A Market for Big Data
Most Common New Types of Data
Moving from Causation to Correlation
What is Hadoop?
What is Hadoop 2.0?
Traditional Systems vs. Hadoop
Overview of a Hadoop Cluster
Who is Hortonworks?
The Hortonworks Data Platform
Hadoop Use Case
Lab 1.1: Login to Your Cluster
11

Organizations are redefining data strategies due to the requirements of the
evolving Enterprise Data Warehouse (EDW).
Machine
Data
Social Media
VoIP
Enterprise
Data

The volume of data that is available for analysis is transforming organizations, as well as
the entire IT industry. Everyone is seeing data external to an organization as becoming
just as strategic as internal data. Semi-structured and unstructured data volume is
beginning to dwarf the traditional data in relational databases and data warehouses.
12
Facebook has around 50 PB warehouse and its constantly growing.
Twitter messages are 140 bytes each generating 8TB data per day.
Data is more than doubling every year.
Almost 80% of data will be unstructured data.
Netflix: 75% of streaming video results from recommendations.
Amazon: 35% of product sales come from product recommendations.
What is Big Data?

Big data is high-volume, -velocity and -variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight
and decision-making.
Gartners Big Data definition is broken into three parts:
Part One: 3Vs: Gartner analyst Doug Laney came up with famous three Vs
(Volume, Velocity and Variety) in 2001.
Part Two: Cost-Effective, Innovative Forms of Information Processing:

Organizations are looking to access unstructured and semi-structured data and
process that data with traditional structured data to perform comprehensive
analysis.
Part Three: Enhanced Insight and Decision Making - The goal of working with
big data is to increase business value and to respond quicker and with more
accuracy to meet well-defined business objectives.
Source: FORBES - http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-bigdata-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/

13
A Market for Big Data

IDC is predicting a big data market that will grow revenue at 31.7 percent a year until it
hits the $23.8 billion mark in 2016. Thats a big number for a relatively new market, but
it only tells part of the story of where big data technology will make money. Defining
big data isnt always an easy task, and breaking it out into a group of separate
technologies might not be either. While this report appears to subsume a May 2012
report from IDC predicting an $813 million Hadoop market, it certainly doesnt include
the market for analytics software. In July, IDC predicted that market which is a critical
piece of the overall big data picture would hit $51 billion by 2016. (IBMs Steve Mills
said he expects IBM to do $15 billion in analytics revenue itself by 2015.)
Despite challenges, such as the lack of clear big data strategies, security concerns, and
the need for workforce re-skilling, the growth potential of Big Data is unprecedented.
Mind Commerce estimates that global spending on Big Data will grow at a CAGR of 48%
between 2014 and 2019. Big Data revenues will reach $135 Billion by the end of 2019.
This report provides an in-depth assessment of the global Big Data market, including a
study of the business case, application use cases, vendor landscape, value chain analysis,
case studies and a quantitative assessment of the industry from 2013 to 2019.
Source: http://www.researchmoz.us/big-data-market-business-case-market-analysisand-forecasts-2014-2019-report.html
14

1. Sentiment
Understand how your customers feel about your brand and
products right now
2. Clickstream
Capture and analyze website visitors data trails and
optimize your website
3. Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines
4. Geographic
Value
Analyze location-based data to manage operations where

they occur
5. Server Logs
Research logs to diagnose process failures and prevent
security breaches
6. Unstructured (txt, video, pictures, etc..)

Understand patterns in files across millions of web pages,
emails, and documents
+ Keep existing
data longer!
Sentiment: The most commonly sighted source, analyzing language usage, text
and computational iinguistics in an attempt to better analyze subjective
information. Many companies are trying to leverage this data to provide
sentiment trackers, identify influencers etc.
Clickstream: The trail a user leaves behind as he navigates your website.

Analyze the trail to optimize website design.
Sensor/Machine: These are everywhere: cars, health equipment, smartphones,

etc. Nike put one in shoes. Someone also put one in baby diapers! They call it
proactive maintenance.
15
16
Geographic: Location based data a common use being location based

targeting. This data has much more wider application in supply chain
optimization across the manufacturing industry allowing organizations to
optimize routes, predict inventory levels, etc.
Server logs: This one is not new to the IT world. You often lose precious trails
and information when you simply roll over log files. Today, you should not have
to lose this data; you just save the data in Hadoop!
Text: Text is everywhere. We all love to express ourselves - every blog, article,
news site, ecommerce site you go these days, you will find people putting out
their thoughts. And this is on top of the already existing text sources like surveys
and the Web content itself. How do you store, search and analyze all this text
data to glean for key insights? Hadoop!
Moving from Causation to Correlation

Big data allows your organization to generate results that are more accurate and that
you can have more confidence in, because there is more detailed data that has been
correlated with other sources to provide more accuracy. In addition, the ability to
reduce the business latency of the time between when the data hits the disk to the time
the data can be used to make business decisions can be a critical success factor for an
organizations success.
Microsoft was looking to improve the grammar accuracy for Microsoft Word. The
researchers Michele Banko and Eric Brill found the more data they fed into existing
algorithms, the more accurate the algorithms got. They took four common algorithms
and fed them 10 million, 100 million and then 1 billion records. One algorithm that had
an accuracy of 75% went up to 95% when fed more data. The more data they looked at,
the smarter the algorithms got.
Amazon does translations in over 60 languages and its translations are considered the
best. Amazon has massive amounts of data that they have access to and this data gives
them a strategic advantage if they use it properly. Companies are seeing that the more
data they analyze, the more accurate the results become.
17
Organizations are also looking at extra data that comes from social media and machine
data and correlating it with their existing traditional data. Correlating data from multiple
sources is generating much higher data analysis results.
Businesses that can use big data to generate more detailed results with a higher degree
of accuracy will be at a competitive advantage. Its about being able to out Hadoop
your competition.
Data driven decisions are better decisions its as simple as that. Using big
data enables managers to decide on the basis of evidence rather than
intuition. For that reason it has the potential to revolutionize management.
Harvard Business Review, October 2012
By 2015, organizations that build a modern information management

system will outperform their peers financially by 20 percent.
Gartner, Mark Beyer, Information Management in the 21st Century
18
What is Hadoop?
Hadoop is all about processing and storage. Hadoop is a software framework
environment that provides a parallel processing environment on a distributed file
system using commodity hardware. A Hadoop cluster is made up of master processes
and slave processes spread out across different x86 servers. This framework allows
someone to build a Hadoop cluster that offers high performance super computer
capability.
Wikipedia states: Apache Hadoop is an open-source software framework

that supports distributed applications, licensed under the Apache v2 license
(public domain)". It enables applications to work with thousands of
computational independent computers and petabytes of data.
19
What is Hadoop 2.0?

Hadoop 2.0 refers to the next generation of Hadoop. As expected, the Hadoop
framework has grown to meet the demands of its own popularity and usage, and 2.0
reflects the natural maturing of the open-source project.
The Apache Hadoop project consists of the following modules:
20
Hadoop Common: the utilities that provide support for the other Hadoop
modules.
HDFS: the Hadoop Distributed File System
YARN: a framework for job scheduling and cluster resource management.
MapReduce: for processing large data sets in a scalable and parallel fashion.

SCALE (storage & processing)
Traditional
Database
EDW
Required on write
MPP
Analytics
schema
speed
Reads are fast
NoSQL
Hadoop
Distribution
Required on read
Writes are fast
Standards and structured
governance
Loosely structured
Limited, no data processing
processing
Processing coupled with data
Structured
data types
Multi and unstructured
best fit use
Data Discovery
Processing unstructured data
Massive Storage/Processing
Interactive OLAP Analytics

Complex ACID Transactions
Operational Data Store

Hadoop is not designed to replace existing relational databases or data warehouses.
Relational databases are designed to manage transactions. They contain a lot of
feature/functionality designed around managing transactions. They are based upon
schema-on-write. Organizations have spent years building Enterprise Data Warehouses
(EDW) and reporting systems for their traditional data. The traditional EDWs are not
going anywhere either. EDWs are also based on schema-on-write.
Hadoop is not:
Relational
NoSQL
Real-time
A database
Hadoop is a data platform that compliments existing data systems. Hadoop is designed
for schema-on-read and can handle the large data volumes coming from semistructured and unstructured data. With the low cost of storage on Hadoop,
organizations are looking at using Hadoop more for archiving.
21

A Hadoop cluster is made up of master and slave servers
Master servers manage the infrastructure
Slave servers contain the distributed data and perform processing
Master Servers NameNode, ResourceManager, Standby Name Node, HBase Master
Master Node 1
NameNode
Oozie Server
ZooKeeper
Master Node 2
ResourceManager
Standby NameNode
HBase Master
HiveServer2
ZooKeeper
Management Node
Ambari Server
Ganglia/Nagios
WebHCat Server
JobHistoryServer
ZooKeeper
Slave Servers NodeManager, DataNode, HBase RegionServer

DataNode 1
DataNode
NodeManager
H RegionServer
DataNode 2
DataNode
NodeManager
H RegionServer
DataNode 3
DataNode
NodeManager
H RegionServer
DataNode n
DataNode
NodeManager
H RegionServer

A Hadoop cluster consists of the following components:
NameNode: a master server that manages the namespace of HDFS.
DataNodes: slave servers that store blocks of data.
ResourceManager: the master server of the YARN processing framework.
NodeManagers: slave servers of the YARN processing framework.
HBase components: HBase also has a master server and slave servers called
RegionServers.
Some of the components working in the background of a cluster include ZooKeeper,

Ambari, Ganglia, Nagios, JobHistory, HiveServer2, and WebHCat.
22
Who is Hortonworks?
OPERATIONAL
SERVICES
AMBARI
FLUME
HBASE
FALCON*
OOZIE
Hortonworks
Data Platform (HDP)
DATA
SERVICES
PIG
SQOOP
HIVE &
HCATALOG
Focus on enterprise
distribution of Hadoop
LOAD &
EXTRACT
HADOOP
CORE
PLATFORM
SERVICES
NFS
WebHDFS
KNOX*
MAP
REDUCE
TEZ
YARN
True open source model, no

vendor lock-in
HDFS
Enterprise Readiness
Defining Hadoop roadmap
High Availability, Disaster

Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OS/VM
Cloud
Appliance
Who is Hortonworks?
Hortonworks develops, distributes and supports Enterprise Apache Hadoop:
Develop: Hortonworks was formed by the key architects, builders, and operators
from Yahoo! Hortonworks software engineering team has led the effort to
design and build every major release of Apache Hadoop from 0.1 to the most
current stable release, contributing more than 80% of the code along the way.
Distribute: We provide a 100% Open Source Distribution of Apache Hadoop,

adding the required Operational, Data and Platform services from the open
source community in the Hortonworks Data Platform (HDP).
Support: We provide a range of support options for customers of the

Hortonworks Data Platform and are the leading provider of expert Hadoop
training available today.
23
HDP: Reliable, Consistent & Current

HDP demonstrates most recent community innovation
0.96.0
0.12.0
2.2.0
HDP 1.3
0.5.0
1.4.1
1.2.0
1.0.3
HMC1.1
3.1.3
Ambari
Mahout
HMC1
Zookeeper
HDP 1.0
3.2.0
Oozie
2012
1.2.3
3.4.5
3.3.4
0.92.1
0.4.0
Sqoop
JUNE
HDP 1.1
3.3.2
0.9.0
0.9.2
0.8.0
4.0.0
0.7.0
0.94.2
HBase
2012
1.4.3
1.4.2
HCatalog
SEPT
0.10.0
0.10.1
HDP 1.2
Hadoop
2013
0.94.6
0.11
1.1.2
FEB
1.4.4
0.11.0
Pig
May
2013
1.4.1
0.12.0
HDP 2.0
Hive
OCT
2013
Hortonworks Data Platform
The Hortonworks Data Platform

Individuals can download a free release of Hadoop from the Apache Software
Foundation (100% free open source). This gives someone the opportunity to test
different versions of the different frameworks that make up Hadoop. A company looking
to run a production version of Hadoop will want an enterprise version of Hadoop. An
enterprise version of Hadoop has gone through rigorous system, function and
regression testing of the distribution. Its important for an enterprise version of Hadoop
to determine what is the best combination of all the frameworks. Finding the best stable
release of a framework that works the best with the other frameworks is critical for
stability. Most organizations will not work with the Apache Hadoop distribution for a
production release for the following two reasons:
24
It takes a tremendous amount of skill and testing to find the right combination
for all the frameworks.
Other software runs along side of Hadoop. It is hard for a software vendor to
work with customers that have their own unique distribution of the Hadoop
frameworks.
The Hortonworks Data Platform (HDP) is the Hadoop distribution provided by

Hortonworks. HDP is perceived in the industry as the enterprise distribution of
Hadoop. Hortonworks is recognized for their platform expertise around Hadoop as well
as being the organization that is defining the roadmap for Hadoop.
Hortonworks:
Utilizes HDP, a 100% free, open source distribution of Hadoop. Every line of code
generated by Hortonworks is put back into the Apache Software Foundation.
Is considered to be the distribution defining the Hadoop road map due to

Hortonworks creating significantly more lines of open source code for Hadoop
than any other source.
Has developed over 614,041 lines of code compared to the next nearest
distribution vendor with 147,933 lines of code (based on a recent comparison).
Has 21 committers compared to the nearest distribution vendor having 7

committers.
Tests HDP at a much larger scale than any other distribution. HDP is certified and
tested at scale.
25
Use Case: EDW before HDP

Unstructured
Data
DM
Log files
Exhaust Data
DM
Social Media
Sensors,
devices
ETL
EDW
DM
DB data
EDW was occupied with ETL, SLAs Suffer

Schema required for ingest. New source? Adjust ETL+ EDW
Data Discarded due to Scale and Cost (Volume, Speed)
Lots of time to understand process, high latency, low value
Use Case: EDW before Hadoop

Before Hadoop, a banks Enterprise Data Warehouse was ingesting large amounts of
data with the following characteristics:
26
A schema was required for ingestion. When a new source of data was
introduced, new schemas had to be created (which took up to a month!).
SLAs were suffering because the EDW was busy performing ETL.
Data had to be thrown out after 2 to 5 days because it was not cost-effective to
maintain it.
The bank was missing out on new data sources, and also historical data was lost.
Use Case: EDW with HDP

Unstructured
Data
Explore
DM
Log files
Exhaust Data
Big Data
Platform
Social Media
EDW
DM
Sensors,
devices
DB data
Consolidate data types: structured/polystructured

Data available with minimal delay (explore), On demand Schema
Active Archive: store unprecendented amounts, enable more

complete analysis
DM
Banking Use Case: EDW with HDP

By adding Hadoop, the bank is now benefitting from a list of new capabilities in its EDW:
Data is now available for use with minimal delay, which enables real-time
capture of source data.
They have a new philosophy about data: capture all data first, and then structure
the data as business needs evolve. This makes their systems much more
dynamic.
The bank now stores years worth of raw transactional data. The data is no
longer archived, it has become ACTIVE!
Data Lineage: The bank stores intermediate stages of their data, enabling a more
powerful analytics platform.
The EDW can focus less on storage and transformation and more on analytics.
Hadoop opens up an opportunity for exploration of data that was never there
before!
27
Unit 1 Review
1. The core Hadoop frameworks are __________________ and _______________.
2. True or False: Hadoop is equivalent to a NoSQL platform.
3. What is the name of the management interface used for provisioning, managing,
and monitoring Hadoop clusters? _________________
4. What processes might you find running on a Master node of a Hadoop cluster?
_________________________________________________________________
28
Unit 2: HDFS Architecture

Topics covered:
What is a File System?
OS Architecture
HDFS Architecture
Understanding Block Storage
Demonstration: Understanding Block Storage
The NameNode
The DataNodes
DataNode Failure
HDFS Clients
29
What is a File System?

Data in Hadoop is stored on a file system referred to as HDFS - the Hadoop Distributed
File System. Within HDFS, data is broken down into chunks and distributed across a
cluster of machines.
Before discussing HDFS, lets take a look at the features of common file systems:
30
Namespace: Multi-level directory trees and file names.
Metadata: All nodes in a directory tree can have various levels of ownership
(user, group, anonymous), permissions (read, write, execute), last accessed time,
create time, modified time, is-hidden, etc.
Journaling: Reliable file systems will maintain a journal of edits in case of

failures, such as power or disk failures. Journals will contain metadata related to
the edit and in some implementations, the actual data to be flushed to disk.
Storage: Storage in a file system is on a physical or network attached storage

device. These devices will persist data, which is chunked by blocks.
Tools: All file systems have tools to perform file operations as well as
administrative operations such as troubleshooting and fixing problems.
OS Architecture
A familiar file system architecture:
Operating System (OS)

Virtual File System
Namespace(s)
Tools
Metadata
File System
(ext4, ext3, xfs, etc.)
Note: File systems are components of an OS
Journaling
Storage
Disk
OS Architecture
Most common file systems are POSIX based. HDFS is also a POSIX based file system.
31
HDFS Architecture
NameNode
Namespace
Block Map
Metadata
Journaling
NameNode and
DataNodes are
daemon jvms
Disk
Tools
DataNode
DataNode
DataNode
Storage
Storage
Disk
Disk
Storage
Disk
HDFS Architecture
A Hadoop instance consists of a cluster of HDFS machines; often referred to as the
Hadoop cluster or HDFS cluster. There are two main components of a HDFS cluster:
1. NameNode: The master node of HDFS that manages the data (without
actually storing it) by determining and maintaining how the chunks of data
are distributed across the DataNodes. The NameNode will contain and
manage the namespace, metadata, journaling, and a BlockMap. The
BlockMap is an in-memory map of all the blocks that make up a file and
DataNode locations of those blocks in the HDFS cluster.
2. DataNode: Stores the chunks of data, and is responsible for replicating the
chunks across other DataNodes.
32
The NameNode and DataNode are daemon processes running in the cluster. Some
important concepts involving the NameNode and DataNodes are:
By default only one NameNode is used in a cluster, which creates a single point
of failure. We will later discuss how to enable HA in Hadoop to mitigate this risk.
Data never resides on or passes through the NameNode. Your big data only
resides on DataNodes.
DataNodes are referred to as slave daemons to the NameNode and are

constantly communicating their state with the NameNode.
The NameNode keeps track of how the data is broken down into chunks on the
DataNodes.
The default chunk size is 128MB (but is configurable).
The default replication factor is 3 (and is also configurable), which means each
chunk of data is replicated across 3 DataNodes.
DataNodes communicate with other DataNodes (through commands from the

NameNode) to achieve data replication.
NOTE: HDFS supports a traditional hierarchical file organization. A user or an

application can create directories and store files inside these directories. It is
a distributed POSIX implementation.
33
1. Client sends a request to

the NameNode to add a
file to HDFS.
NameNode
2. NameNode gives client a
lease to the file path.
3. For every block, the client will request the NameNode to provide
a new blockid and a list of destination DataNodes.
4. The client will write the block directly to the first DataNode in the
list.
DataNode 1
DataNode 2
DataNode 3
5. The first DataNode pipelines the replication to the next DataNode in the list.
Understanding Block Storage

Putting a file into HDFS involves the following steps:
1. A client application sends a request to the NameNode that specifies where they
want to put the file in HDFS.
2. The NameNode gives the client a lease to the file path. This lease will be released
if there is a failure, timeout, or will be made permanent if the write is successful
and file handle is closed.
3. For every block that the client needs to write (128MB by default), the client will
make a request to the NameNode for a new blockid and a destination list of
DataNodes for where to write the new block to.
4. Once the client gets the blockid and destination DataNodes it will start flushing
its buffer to the first DataNode in the list.
5. To replicate the block, that first DataNode will open a stream to the next
DataNode in the list and flush its persisted chunk to that DataNode. This block
replication pipeline is established to all the nodes in the list. Thus replication is
very efficient and occurs in parallel.
34
You can specify the block size for each file using the dfs.blocksize property. If you do not
specify a block size at the file level, the global value of dfs.blocksize defined in hdfssite.xml will be used.
IMPORTANT: The data never passes through the NameNode. The client
program that is uploading the data into HDFS performs I/O directly with the
DataNodes. The NameNode only stores the metadata of the file system; it is
not responsible for storing or transferring the data.
35
Demonstration: Understanding Block Storage
Objective: To understand how data is partitioned into blocks and stored

in HDFS.
During this Watch as your instructor performs the following steps.
demonstration:
Step 1: Put the File into HDFS

1.1. Change directories to /usr/lib/hadoop (or any folder containing a file larger
than 2MB):
# cd /usr/lib/hadoop
1.2. Try putting the hadoop-common JAR file into HDFS with a block size of 30
bytes:
# hadoop fs -D dfs.blocksize=30 -put hadoop-common-x.jar hadoop-common.jar
1.3. Notice 30 bytes is not a valid blocksize. The blocksize needs to be at least
1048576 according to the dfs.namenode.fs-limits.min-block-size property:
put: Specified block size is less than configured minimum
value (dfs.namenode.fs-limits.min-block-size): 30 < 1048576
1.4. Try the put again, but use a block size of 2,000,000:
# hadoop fs -D dfs.blocksize=2000000 -put hadoop-commonx.jar hadoop-common.jar
1.5. Notice 2,000,000 is not a valid block size because it is not a multiple of 512
(the checksum size).
36
1.6. Try the put again, but this time use 1,048,576 for the block size:
# hadoop fs -D dfs.blocksize=1048576 -put hadoop-commonx.jar hadoop-common.jar
1.7. This time the put command should have worked. Use ls to verify the file is in
HDFS:
# hadoop fs -ls
...
-rw-r--r-3 root root
2679929
hadoop-common.jar
Step 2: View the Number of Blocks

2.1. Run the following command to view the number of blocks that were created
for hadoop-common.jar:
# hdfs fsck /user/root/hadoop-common.jar
2.2. Notice there are three blocks. Look for the following line in the output:
Total blocks (validated):
3 (avg. block size 893309 B)
2.3. What is the average block replication for this file? ________________
Step 3: Specify a Replication Factor
3.1. Add another file from /usr/lib/hadoop into HDFS, except this time specify a
different replication factor:
# hadoop fs -D dfs.replication=2
hadoop-nfs.jar
-put hadoop-nfs-x.jar
3.2. Run the hdfs fsck command on hadoop-nfs.jar:

# hdfs fsck /user/root/hadoop-nfs.jar
3.3. Verify the average block replication for this file is 2.

Step 4: Find the Actual Blocks
4.1. Run the same fsck command on hadoop-common.jar as before, but add the
files and blocks options:
37
# hdfs fsck /user/root/hadoop-common.jar -files -blocks
Notice the output contains the block IDs, which coincidentally are the names of
the files on the DataNodes.
4.2. Change directories to the following:
# cd /hadoop/hdfs/data/current/BP-xxx/current/finalized/
replacing BP-xxx with the actual folder name.

4.3. Notice the actual blocks appear in this folder. List the contents of the folder
and look for files with a recent timestamp. You may not see any new files. Why
not? ____________________________________________________________
4.4. Try another node in your cluster:
# ssh root@node3
# cd /hadoop/hdfs/data/current/BP-xxx/current/finalized/
You are looking for a subfolder with a recent timestamp. Once you find it, cd into
that folder.
4.5. See if you can find the various blocks for hadoop-common.jar and hadoopnfs.jar. They will look similar to the following:
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
-rw-r--r--.
1
1
1
1
1
1
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hadoop 1048576 blk_1073742331

hadoop
8199 blk_1073742331_1507.meta
hadoop 1048576 blk_1073742332
hadoop
8199 blk_1073742332_1508.meta
hadoop 582777 blk_1073742333
hadoop
4563 blk_1073742333_1509.meta
4.6. How come some of the blocks are exactly 1048576 bytes? ______________
_________________________________________________________________
4.7. What is in the .meta files? _______________________________________
38
The NameNode
1. When the NameNode starts, it reads
the fsimage_N and edits_N files.
2. The transactions in edits_N are
merged with fsimage_N.
3. A newly-created fsimage_N+1 is
written to disk, and a new, empty
edits_N+1 is created.
fsimage
edits
Namespace
Journaling
Metadata
The NameNode will be in safemode,

a read-only mode.
4. Now a client application can

create a new file in HDFS.
5. The NameNode journals that
create transaction in the
edits_N+1 file.
NameNode
The NameNode
HDFS has a master/slave architecture. A HDFS cluster consists of a single NameNode,
which is a master server that manages the file system namespace and regulates access
to files by clients.
The NameNode has the following characteristics:
It is the master of the DataNodes.
It executes file system namespace operations such as opening, closing, and

renaming files and directories.
It determines the mapping of blocks to DataNodes.
It maintains the file system namespace.
39
The NameNode performs these tasks by maintaining two files:
fsimage_N: Contains the entire file system namespace, including the mapping of
blocks to files and file system properties.
edits_N: A transaction log that persistently records every change that occurs to
file system metadata.
When the NameNode starts up, it enters safemode (a read-only mode). It loads the
fsimage_N and edits_N from disk, applies all the transactions from the edits_N to the
in-memory representation of the fsimage_N, and flushes out this new version into a
new fsimage_N on disk.
NOTE: The edits_N file naming actually contains a range of numbers for the
historical events. For example, edits_0008-0012. There is an additional file
named edits_inprogress_<start-of-range> for the current edits.
For example, initially you will have an fsimage_0 file and an edits_inprogress_0 file.
When the merging occurs, the transactions in edits_inprogress_0 are merged with
fsimage_0, and a new fsimage_1 file is created. In addition, a new, empty
edits_inprogress file is created for all future transactions that occur after the creation of
fsimage_1.
This process is called a checkpoint. Once the NameNode has successfully checkpointed,
it will leave safemode, thus enabling writes.
40
The DataNodes
NameNode
Im still here! This

is my latest
Blockreport.
DataNode 1
Replicate block 123

to DataNode 1.
DataNode 2
DataNode 3
Im here too! And

here is my latest
Blockreport.
DataNode 4
123
Storage
The DataNodes
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode determines the mapping of blocks to DataNodes. The DataNodes are
responsible for:
Handling read and write requests from application clients.
Performing block creation, deletion, and replication upon instruction from the
NameNode. (The NameNode makes all decisions regarding replication of
blocks.)
Sending heartbeats to the NameNode.
Sending the blcoks stored on the DataNode in a Blockreport.
The NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is
functioning properly. A Blockreport contains a list of all blocks on a DataNode.
41
DataNodes have the following characteristics:
The DataNode has no knowledge about HDFS files.
It stores each block of HDFS data in a separate file on its local file system.
The DataNode does not create all files in the same local directory. It uses a
discovery technique to determine the optimal number of files per directory and
creates subdirectories appropriately.
When a DataNode starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files, and sends this
information to the NameNode (as a Blockreport).
REFERENCE: For tips on configuring a network for a Hadoop cluster, visit

http://hortonworks.com/kb/best-practices-for-cluster-network-configuration/ .
42
DataNode Failure
Sorry, DataNode 3,
but Im going to
assume you are
dead.
NameNode
Heartbeat &
Blockreport
DataNode 1
Heartbeat &
Blockreport
DataNode 2
Heartbeat &
Blockreport
DataNode 3
DataNode 4
DataNode Failure
The primary objective of HDFS is to store data reliably even in the presence of failures.
Hadoop is designed to recover gracefully from a disk failure or network failure of a
DataNode using the following guidelines:
If a DataNode fails to send a Heartbeat to the NameNode, that DataNode is

labeled as dead.
Any data that was registered to a dead DataNode is no longer available to HDFS.
The NameNode does not send new I/O requests to a dead DataNode, and its
blocks are replicated to live DataNodes.
DataNode death typically causes the replication factor of some blocks to fall below their
specified value. The NameNode constantly tracks which blocks need to be replicated
and initiates replication whenever necessary.
43
NOTE: It is possible that a block of data fetched from a DataNode arrives

corrupted, either from a disk failure or network error. HDFS implements
checksum checking on the contents of HDFS files.
When a client creates an HDFS file, it computes a checksum of each block of
the file and stores these checksums in a separate hidden file in the same
HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from
each DataNode matches the checksum stored in the associated checksum
file. If the checksum verification fails, the client can opt to retrieve that block
from another DataNode that has a replica of that block.
44
HDFS Clients
Commandline
Tools
User
HDFS
fs, archive, distcp, fsck, fetchdt
Admin
dfsadmin, namenode, datanode, balancer, daemonlog,
secondarynamenode
WebHDFS
The NameNode and DataNodes both expose RESTful apis to
perform user operations
NameNode
DataNode 1
HttpFS
A REST gateway that supports user operations and is
interoperable with WebHDFS
Hue
DataNode 2
A feature rich GUI that includes a HDFS file browser, job browser
for MR & YARN, HBase, Hive, Pig, and Sqoop support
HDFS Java API

Can write your own java based client applications
YARN applications; MapReduce!
HDFS Clients
HDFS provides many out of the box methods for clients to interact with the file system.
These include command line, RESTful, and a Java HDFS API. Additionally, HDP provides
Hue, a GUI interface to not only HDFS but also other components in HDP.
We will explore the various types of clients in an upcoming lab.
45
Unit 2 Review
1. Which component of HDFS is responsible for maintaining the namespace of the
distributed file system? _________________________
2. What is the default file replication factor in HDFS? _________________________
3. True or False: To input a file into HDFS, the client application passes the data to
the NameNode, which then divides the data into blocks and passes the blocks to
the DataNodes. _____________
4. Which property is used to specify the block size of a file stored in HDFS?
__________________________
5. The NameNode maintains the namespace of the file system using which two sets
of files? _______________________________________________________
46
Unit 3: Installation Prerequisites

and Planning
Topics covered:
Minimum Hardware Requirements
Minimum Software Requirements
A Formidable Starter Cluster
Lab 3.1: Setting up the Environment
Lab 3.2: Install HDP 2.0 Cluster using Ambari
47
Minimum Hardware Requirements

The great benefit of Hadoop is that hardware requirements are very flexible.
Infrastructure choices are in your hands. There are, however, guidelines to help you
make better choices.
48
Master Nodes: RAID10 is recommended because master nodes need to be more

resilient to failures. RAID10 will allow for multiple disk failures at the same time.
Slave Nodes: JBOD, or Just a Bunch of Disks, is simple array of disks with no
striping or mirroring.
Minimum Software Requirements

All hosts, both masters and slaves, should have the above installed. Ambari can push the
JDK to all the hosts if you are using Ambari to manage your cluster. At least one of the
master nodes should have Ambari bits downloaded. We will discuss this later in the
course.
49
A Formidable Starter Cluster

Configuration that will yield 26TB storage:
42U single rack
10 nodes: 2 masters, 8 slaves
Masters (2x):
- 6 - 1 TB drives
- Dual quad core
- 64 GB RAM
- RAID 10
- 2x Gigabit Ethernet
- Redundant power supply
Slaves (8 x 2U):
- Single quad core
- 32 GB RAM
- 2x internal SATA drives for OS
- 6 x 2TB SATA drives as JBOD
50
We will cover cluster tuning later and answer questions such as:
Is the cluster for small group? Multi-tenant?
How much storage do you anticipate in the short-term?
How quickly will the data grow?
Do you anticipate compute heavy processing of the data? Compute/memory
heavy algorithms?
51
Lab 3.1: Setting up the Environment
Objective: To become familiar with preparing an HDP 2.0 installation.

Successful Outcome: You have setup passwordless SSH and configured the
repositories.
Before You Begin: SSH into node1.
Step 1: View the Setup Script

1.1. SSH into node1.
1.2. Change directories to /root/scripts:
# cd /root/scripts
1.3. View the contents of env_setup.sh:

# more env_setup.sh
1.4. This script does a lot. Look over the steps and see if you can follow what is
happening. Some of the highlights of the script include:
- Sets up passwordless SSH amongst your four nodes.
- Installs ntp on each node.
- Configures the repositories for installing HDP locally.
- Disables security and turns off iptables.
Step 2: Run the Setup Script
2.1. Run the setup script using the following command:
52
# ./env_setup.sh
2.2. The script will take a while to execute. Watch the output and keep an eye out
for any errors. The end of the output will look like:
Installed:
yum-plugin-priorities.noarch 0:1.1.30-14.el6
Complete!
NOTE: If you dont see your command prompt, simply press Enter when the
script is finished.
IMPORTANT: If you find an error, try to determine at which step in the script
it occurred. You may need to manually copy-and-paste the remainder of the
script based on where your error occurred.
RESULT: Your cluster is now ready for HDP 2.0 to be installed using Ambari!
53
Lab 3.2: Install HDP 2.0 Cluster using Ambari
Objective: To install a Hadoop cluster using Ambari UI.

Successful Outcome: You can see the various HDP 2.0 services running from within
the Ambari UI.
Step 1: Install ambari-server

1.1. From the command line of node1, enter the following command:
# yum -y install ambari-server
1.2. Open the following file using vi:

# vi /var/lib/ambariserver/resources/stacks/HDPLocal/2.0.6/repos/repoinfo.xml
1.3. Change the centos6 configuration to point to the local repositories by

changing the baseurl property as shown here:
<os type="centos6">
<repo>
<baseurl>http://node1/hdp/HDP-2.0.6.0-76</baseurl>
<repoid>HDP-2.0.6</repoid>
<reponame>HDP</reponame>
</repo>
</os>
Step 2: Start the ambari-server

2.1. The Ambari Server manages the install process. Run the Ambari Server setup
using the following command:
54
# ambari-server setup -s -i /usr/jdk-6u31-linux-x64.bin
NOTE: The -s option runs the setup in silent mode, meaning all default
values are accepted at any prompts.
2.2. Now start the Ambari Server:

# ambari-server start
Step 3: Login to Ambari

3.1. Start the Firefox browser from left side-bar and go to the following URL:
http://node1:8080
3.2. Log in to the Ambari server using the default credentials admin/admin:
Step 4: Run the Install Wizard

4.1. At the Welcome page, enter the name horton for your cluster and click the
Next button.
55
4.2. Select the service stack HDP2.0.6:
Step 5: Enter the Host and SSH Key Details

5.1. Enter node1, node2 and node3 in the list of Target Hosts. (Do not enter
node4; you will add that node to the cluster in a later lab.)
56
5.2. In the Host Registration Information section, click the Choose File button,
then browse to and select the training-keypair.pem file at Desktop:
5.3. Under Advanced Options, check Use a local software repository:
5.4. Click the Register and Confirm button. Click OK if you are warned about not
using fully qualified domain names.
Step 6: Confirm Hosts
6.1. Wait for some initial verification to occur on your cluster. Once the process is
done, click the Next button to proceed:
57
NOTE: You may see a confirmation message with a warning. Verify your
nodes are configured correctly before continuing. If it is related to firewall,
you can ingnore the error.
Step 7: Choose the Services to Install

7.1. Hortonworks Data Platform is made up of a number of components. You are
going to install all the services except HBase on your cluster, so uncheck HBase
and make sure all other serves are checked, then click the Next button:
58
Step 8: Assign Master Nodes

8.1. The Ambari wizard attempts to assign the various master services on
appropriate hosts in your cluster. Carefully choose the following assignments of
the master services!
CAUTION: Make sure to choose the right node for each master service as
specified below. Once the installation starts, you cannot change the
selection!
NameNode: node1
SNameNode: node2
History Server: node2
ResourceManager: node2
Nagios Server: node3
Ganglia Server: node3
HiveServer2: node2
59
Oozie Server: node2

ZooKeeper: node1
ZooKeeper: node2
ZooKeeper: node3
8.2. Verify your assignments match the following:
8.3. Click the Next button to continue.

Step 9: Assign Slaves and Clients
9.1. Assign all slave and client components to all nodes in the list:
60
Step 10: Customize Services

10.1. Notice three services require additional configuration: Hive, Oozie and
Nagios. Click on the Hive tab, then enter hive for the Database Password:
10.2. Click on the Oozie tab and enter oozie for its Database Password:
10.3. Click on the Nagios tab. Enter admin for the Nagios Admin password, and
enter your email address in the Hadoop Admin email field:

Step 11: Review the Configuration
61
11.1. Notice the Review page allows you to review your complete install
configuration. If youre satisfied that everything is correct, click Deploy to start
the installation process. (If you need to go back and make changes, you can use
the Back button.)
Step 12: Wait for HDP to Install

12.1. The installation will begin now. It will take 20-30 minutes to complete,
depending on network speed. You will see progress updates under the Status
column as components are installed, tested, and started:
62
12.2. You should see the following screen if the installation completes
successfully:
12.3. When the process completes, click Next to get a summary of the installation
process. Check all configured services are on the expected nodes, then click
Complete:
63
Step 13: View the Ambari Dashboard

13.1. After the install wizard completes, you will be directed to your clusters
Ambari Dashboard page. Verify the DataNodes Live status shows 3/3:
RESULT: You now have a running 3-node cluster of the Hortonworks Data Platform!
64
Unit 4: Configuring Hadoop

Topics covered:
Configuration Considerations
Deployment Layout
Configuring HDFS
What is Ambari
Configuration via Ambari
Management
Monitoring
REST API
Lab 4.1: Add a New Node to the Cluster
Lab 4.2: Stopping and Starting HDP Services
Lab 4.3: Using HDFS Commands
65
Configuration Considerations
There are two ways to configure HDP:
-
Manual configuration
Ambari UI configuration
Currently, it is necessary to know key configurations at the configuration file level.

Ambari is highly capable of managing configurations, however, as an HDP admin if
Ambari fails, it will be important to know key properties to run a cluster. This will also
provide insight into what exact configuration files need to be looked at when
configuring the various services. In this Unit, we focus on HDFS, YARN, MapReduce,
ZooKeeper, & HBase.
66
Data
Binaries
Configuration
Runtime
Install Bits
Deployment Layout
The HDP deployment layout per machine may vary slightly because not all machines will
have the same components. For example, theres only one Ambari Server per cluster.
However, by using the deployment layout above as a guide, you can quickly find the
configuration, binaries, and repos needed for Ambari to run.
The Deployment Layout can be broken down into five key categories:
1. Install Bits: It is a best practice to setup a local repository of the install bits or
rpm repos. When setting up a local repo, a yum repo is added to
/etc/yum.repos.d/. The rpms are installed on a simple webserver.
2. Binaries: Hadoop executables, libraries, dependencies, template configs, etc. are
located at /usr/lib/ in the appropriate project folder. Files in these directories
should not be modified, especially configuration files. A best practice for
customization of shell scripts is that modifications should be done via wrapper
scripts, such as passing parameters or piping stdout to a log file.
3. Configuration: By convention, Hadoop configurations are under /etc/ under the
appropriate project. This is where configuration changes should be made rather
than in install (binaries) directories.
67
4. Data: Various Hadoop services require data directories. For example, HDFS
requires space for the NameNode to write its edits log files. And the DataNodes
will write the actual data blocks to the local file system. Throughout the
configuration files, you will find services requiring a directory path to use as
temporary or permanent storage.
5. Runtime: As Hadoop services are running, starting, and stopping, they will be
writing to self-maintenance files such as pid (process id) files, typically to
/var/run/. For example, Hadoop HDFS services will publish pid files to
/var/run/hadoop/hdfs/.
68
Configuring HDFS
There are two configurations involved when configuring HDFS. In addition to Hadoop
configuration properties necessary to bring up an HDFS cluster, there are some prerequisites, which we will discuss:
Ports Firewall considerations:
Service
NameNode WebUI
NameNode metadata service
DataNode
Servers
Master
Nodes
All Slave
Nodes
Secondary (Checkpoint)
NameNode
Ports
Protocol Description
50070
http
50470
https
8020/9000 IPC
50075
50475
50010
50020
50090
http
https
IPC
http
NameNode WebUI
Secure http
File system metadata
operations
DataNode WebUI
Secure http
Data transfer
Metadata operations
Secondary NameNode
WebUI
69
DNS
Ensure that HDFS hosts are resolvable via DNS. If this is not possible, all the hosts will
need their /etc/hosts file to contain all the hosts in the cluster. The hosts file is a local
domain to ip mapping file.
core-site.xml: Cluster-wide settings, including NameNode host and port, proxy
user/groups. This file will get distributed to all nodes, but is always changed
uniformly.
hdfs-site.xml: Some settings are cluster-wide, while others are DataNode
specific. For example, dfs.datanode.data.dir can be different between
DataNodes.
NameNode
fs.defaultFS: hdfs://namenodehost:8020
dfs.namenode.name.dir: /hadoop/hdfs/namenode
dfs.replication: Hadoop default is 3 and should be kept at 3. This property can be
overridden by the client per-operation if you want to change the replication for a
file. For example; if a file is referenced multiple times in many jobs, it is often a
performance gain to have more replicas of that same file, i.e. joining with lookup
files).
dfs.replication.max: Maximum replication.
dfs.blocksize: Default block size is 128MB (this property is expressed in bytes). If
your cluster generally has larger datasets and the datasets are not process
intensive, you can set this to a higher size. However, 128MB is a good default to
keep. If you like, just as the replication factor, you can change the blocksize for
each file that you upload into HDFS.
dfs.namenode.stale.datanode.interval: Default, 30000ms. Threshold for
amount of time in milliseconds before the NameNode considers a DataNode to
be stale, at which point the DataNode is moved to the end of the list of available
replica locations.
SecondaryNameNode
dfs.namenode.checkpoint.dir: Directory where SecondaryNameNode
temporarily stores the images it needs to merge from the NameNode
dfs.namenode.checkpoint.period: Default 3600s. Number of seconds between
two periodic checkpoints in seconds.
70
dfs.namenode.checkpoint.txns: Default 1,000,000 transactions. After this many

transactions on the NameNode, the SecondaryNameNode will create a
checkpoint. This property has precedence over the checkpoint.period
DataNodes
dfs.datanode.address: 0.0.0.0:50010. The DataNode host address.
dfs.datanode.data.dir: DataNode data block directory. If a node has multiple
drives, you can specify a comma-separated list of data directories for each drive.
71
What is Ambari
Ambari is a 100% Apache open source operations framework for provisioning,
managing, and monitoring Hadoop clusters. It provides these features through a web
frontend and an extensive REST API.
With Ambari, clusters can be built from ground up on clean operating system instances.
It will do the job of propagating binaries, configuring services, launching them, and
monitoring them to all the hosts in a cluster.
72
Configuration via Ambari

Using Ambari is very convenient. However, there are some cases where performing
manual changes is necessary, such as a differing property values between DataNodes.
When the same property is different within the cluster, we refer to this as a
heterogeneous configuration.
We will be using Ambari throughout this course to monitor, provision, and manage your
HDP cluster.
73
Management
Once a cluster is provisioned, services can be managed either an entire service or
management can be granular to services sub-component. For example, an
administrator can choose to start/stop the entire YARN service (ResourceManager +
NodeManagers), or just stop a particular NodeManager on a host.
Configuration of services can also be managed. Properties, credentials, paths are some
examples of common configurations. Ambari allow for custom or advanced properties
to be managed for most services. Once a configuration change has been made, Ambari
will persist changes to its own internal database, a PostgreSQL database by default.
74
Management Flow
1. Stop service(s): Services are required to be stopped.
2. Edit and save: Once saved, Ambari will validate and persist the new settings in
its database, write the settings to appropriate configuration files on the cluster.
3. Start service(s): Services can now be started.
Ambari uses Puppet, an open source automation system for orchestrating

the starting/stopping of services.
Advanced Configurations
Ambari supports configuring NameNode HA and security. These are advanced features
available under the Admin page. These topics will be covered in later units.
75
Monitoring
Ambari provides monitoring with the combination of two powerful open source
frameworks: Ganglia and Nagios.
Ganglia
All cluster metrics are gathered by Ganglia agents running on each host and aggregated.
Nagios
Nagios is used to provide alerts, escalation schemes to implement enterprise SLAs, and
reports. With Nagios, alerts via email, SMS, or script execution can be triggered by
events such as a threshold limit being crossed. For example, an administrator may want
to receive an SMS alert if a Hadoop master nodes CPU is pegged at 100% for more than
5 minutes. All such thresholds are configurable in Nagios.
Dashboard
Ambari provides a dashboard that gives an administrator a quick view of the overall
health of the entire cluster. There are 20+ widgets that provide quick stats on services.
Widgets can be added or you could write your own widget using the Ambari APIs.
76
REST API
Ambari uses a REST API. You can write your own automation scripts to perform
extensive operations. The REST API allows you to monitor as well as manage a cluster.
77
All components of a cluster are Resources that can be added, updated/configured, or

removed. Core resources include:
Clusters: Top most level resource; a Hadoop cluster.
Services: Hadoop services such as HDFS, YARN, etc.
Components: Individual components of a service such as NameNode or

ResourceManager.
Hosts: The host machines that participate in a cluster.
Host_Components: Individual resources on a host; often times this resource

is used to get all services running on a host.
Configurations: Sets of key/value pairs that configure services.
The Ambari REST API is an evolving feature. While most operations will work
as expected, be sure to thoroughly test an operation and validate expected
results
78
Lab 4.1: Add a New Node to the Cluster
Objective: Add an additional node to a cluster.

Successful Outcome: node4 will be added to your HDP cluster as a DataNode.
Before You Begin: Your 3-node cluster should have HDP successfully installed.
Step 1: Login to Ambari

1.1. If you are not logged in already, login to the Ambari Dashboard of your
cluster.
Step 2: Run the Add Hosts Wizard
2.1. Click on the Hosts tab of Ambari.
2.2. Click the Add New Hosts button to start the Add Host Wizard:
Step 3: Complete the Add Host Wizard

3.1. You have seen this wizard before when you installed HDP, so we will provide
only a few hints in this lab. Read through the following hints before running the
wizard.
3.2. The hostname of the node you are adding is node4.
3.3. Make sure you check the box for using a local software repository.
3.4. On the Assign Slaves and Clients step, choose only the Client option.
79
Step 4: Verify the New Host

4.1. Go to the Hosts page of Ambari. You should see all four of your nodes now
listed.
RESULT: You have added a new node to the cluster and Hadoop is installed on it. In a
later lab, you will commission this node as a DataNode.
80
Lab 4.2: Stopping and Starting HDP Services
Objective: To learn how to start and stop the various HDP services using
either the command line or Ambari.
Successful Outcome: You will have stopped HDP from the command line and
started it again using Ambari.
Before You Begin: Your cluster should be up and running.
Step 1: Stop the HDP Services from the Command Line

1.1. The following table lists all the processes that need to be stopped, in the
proper order for shutting down all HDP services. Do not type these in yet - they
are provided for you in a script:
Nagios
Ganglia
Oozie
WebHCat
Hive
Zookeeper
Yarn
Node
Manager
service nagios stop

service hdp-gmetad stop
service hdp-gmond stop
sudo su -l oozie -c "/usr/lib/oozie/bin/oozied.sh
stop"
su -l hcat -c
"/usr/lib/hcatalog/sbin/webhcat_server.sh stop"
ps aux | awk '{print $1,$2}' | grep hive | awk '{print
$2}' | xargs kill >/dev/null 2>&1
/usr/lib/zookeeper/bin/zkServer.sh stop
su - yarn -c 'export
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop nodemanager'
node3
node3
node1
node2
node3
node4
node2
node2
node2
node1
node2
node3
node1
81
MapReduce
History
Server
Yarn
Resource
Manager
HDFS
DataNode
Secondary
NameNode
NameNode
su - mapred -c 'export
&& /usr/lib/hadoop-mapreduce/sbin/mrjobhistory-daemon.sh --config /etc/hadoop/conf
stop historyserver'
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh -config /etc/hadoop/conf stop resourcemanager'
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop datanode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop secondarynamenode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh stop namenode"
node2
node3
node2
node2
node1
node2
node3
node2
node1
1.2. SSH into node1. (Make sure you run this script from node1.)
1.3. Run the following script to shutdown all HDP services on your cluster:
# ~/scripts/shutdown_all_services.sh
1.4. Wait for the script to execute and all the services to stop.
Step 2: View Ambari
2.1. Go to your Ambari Dashboard. Notice the Cluster Status and Metrics on the
Dashboard are mostly n/a:
82
2.2. Notice that all the Services are down - as shown by the red icon next to each
service name:
2.3. From the Services page, click on each service individually. They should all be
stopped.
Step 3: Stop ambari services
3.1. Run the following script to shutdown all Ambari services on your cluster:
# ~/scripts/stop_ambari.sh
83
Step 4: Take a backup of your VM.

4.1. Go to VMWare Player/Fusion Menu and shutdown the VM.
4.2. Copy the existing VM folder from current location to a new location.
4.3. Once you are done with backup, go to VMWare Player/Fusion Menu and start
the VM again.
NOTE: It is important to take a backup of the VM now so that you do not

need to start from beginning in case of a fatal error.
Step 5: Start Ambari services

5.1. Login to node1 again
5.2. Run the following script to start all Ambari services on your cluster:
# ~/scripts/start_ambari.sh
Step 6: Start the HDP Services from Ambari UI

6.1. Go To Ambari UI using Firefox browser
6.2. From the Services page, click the Start All button. Click the OK button to
confirm.
6.3. Wait for the services to start, which can take 10-15 minutes. If you click the
small arrow to the right of Start All Services, you can view the progress on each
node:
84
6.4. Once all the services are started, click the OK button to close the progress
dialog.
6.5. Verify on the Services page of Ambari that all the HDP services in your cluster
are up and running.
NOTE: The table below shows the proper order for starting HDP services.
These can be executed using the /root/scripts/startup_all_services.sh script
provided in your class cluster.
HDFS
NameNode
Secondary
NameNode
DataNodes
YARN
Resource Manager
History Server
Node Managers
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start namenode"

su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start secondarynamenode"
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoopdaemon.sh start datanode"
&& /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh
--config /etc/hadoop/conf start
resourcemanager'
su - mapred -c 'export
&& /usr/lib/hadoop-mapreduce/sbin/mrjobhistory-daemon.sh --config
/etc/hadoop/conf start historyserver'
--config /etc/hadoop/conf start nodemanager'
85
node1
node2
node1
node2
node3
node2
node2
node1
node2
Zookeeper
Hive
Metastore
HiveServer2
WebHCat
Oozie
Ganglia
Nagios
86
/usr/lib/zookeeper/bin/zkServer.sh start
su - hive -c 'env HADOOP_HOME=/usr
JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/tmp/startMetastore.sh /var/log/hive/hive.out
/var/log/hive/hive.log /var/run/hive/hive.pid
/etc/hive/conf.server '
su - hive -c 'env
JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/tmp/startHiveserver2.sh /var/log/hive/hiveserver2.out /var/log/hive/hive-server2.log
/var/run/hive/hive-server.pid
/etc/hive/conf.server '
su -l hcat -c
"/usr/lib/hcatalog/sbin/webhcat_server.sh
start"
sudo su -l oozie -c "/usr/lib/oozie/bin/oozied.sh
start"
/etc/init.d/hdp-gmetad start
/etc/init.d/hdp-gmond start
service nagios start
node3
node1
node2
node3
node2
node2
node2
node2
node3
node1
node2
node3
node4
node3
Lab 4.3: Using HDFS Commands
Objective: To become familiar running HDFS commands and how to

view the HDFS file system.
Successful Outcome: You will have added and deleted several files and folders in
HDFS.
Step 1: View the hadoop fs Command

1.1. From the command line, enter the following command to view the usage of
hadoop fs:
# hadoop fs
Notice the usage contains options for performing file system tasks in HDFS, like
copying files from a local folder into HDFS, retrieving a file from HDFS, copying
and moving files around, and making and removing directories.
1.2. Enter the following command:
# hdfs dfs
Notice you get the same usage list as the hadoop fs command.
NOTE: The hadoop command is a more generic command that has fewer
options than the hdfs command. However, notice hdfs dfs is just an alias for
hadoop fs.
87
Step 2: Understanding the Default Folders in HDFS

2.1. Enter the following -ls command to view the contents of the users root
directory in HDFS, which is /user/root:
# hadoop fs -ls
You do have not a /user/root directory yet, so no output is displayed:

ls: `.': No such file or directory
2.2. View the contents of the /user directory in HDFS:

# hadoop fs -ls /user
Found 4 items
drwxrwx--- ambari-qa
drwxr-xr-x
- hcat
drwx------ hive
drwxrwxr-x
- oozie
hdfs
hdfs
hdfs
hdfs
0
0
0
0
/user/ambari-qa
/user/hcat
/user/hive
/user/oozie
Notice HDFS has four user folders by default: ambari-qa, hcat, hive and oozie.
2.3. Run the -ls command again, but this time specify the root HDFS folder:
# hadoop fs -ls /
The output should look like:

Found 6 items
drwxrwxrwt
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxrwxrwx
drwxr-xr-x
-
yarn
hdfs
mapred
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
hdfs
0
0
0
0
0
0
2013-08-20
2013-08-20
2013-08-20
2013-08-20
2013-08-28
2013-08-28
13:59
13:53
13:57
13:58
22:03
22:03
/app-logs
/apps
/mapred
/mr-history
/tmp
/user
2.4. Which user is the owner of the /user folder? ___________________
IMPORTANT: Notice how adding the / in the -ls command caused the
contents of the root folder to display, but leaving off the / attempted to list
the contents of /user/root. If you do not specify an absolute path, then all
hadoop commands are relative to the users default home folder.
88
Step 3: Create a User Directory in HDFS

3.1. Enter the following mkdir command:
# hadoop fs -mkdir /user/root
Notice the root user does not have permission to create this folder.
3.2. Switch to the hdfs user:
# su - hdfs
3.3. Make a new directory in HDFS named /user/root:

$ hadoop fs -mkdir /user/root
3.4. Change the permissions to make root the owner of the directory:
$ hadoop fs -chown root:root /user/root
3.5. Verify the folder was created successfully and root is the owner:
$ hadoop fs -ls /user
...
drwxr-xr-x
- root
root
/user/root
3.6. Switch back to the root user:

$ exit
logout
[root@node1 ~]#
3.7. Now view the contents of /user/root using the following command again:
# hadoop fs -ls
The directory is empty, but notice this time the command worked.
Step 4: Create Directories in HDFS
4.1. Enter the following command to create a directory named test in HDFS:
89
# hadoop fs -mkdir test
4.2. Verify the folder was created successfully:

# hadoop fs -ls
Found 1 items
drwxr-xr-x
- root root
test
4.3. Create a couple of subdirectories of test:

# hadoop fs -mkdir test/test1
# hadoop fs -mkdir test/test2
# hadoop fs -mkdir test/test2/test3
4.4. Use the -ls command to view the contents of /user/root:

# hadoop fs -ls
Notice you only see the test directory. To recursively view the contents of a
folder, use: -ls -R:
# hadoop fs -ls -R
The output should look like:

drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
root
root
root
root
root
root
root
root
0
0
0
0
test
test/test1
test/test2
test/test2/test3
Step 5: Delete a Directory

5.1. Delete the test2 folder (and recursively its subcontents) using the -rm -R
command:
# hadoop fs -rm -R test/test2
5.2. Now run the -ls -R command:

# hadoop fs -ls -R
The directory structure of the output should look like:

90
.Trash
.Trash/Current
.Trash/Current/user
.Trash/Current/user/root
.Trash/Current/user/root/test
.Trash/Current/user/root/test/test2
.Trash/Current/user/root/test/test2/test3
test
test/test1
NOTE: Notice Hadoop created a .Trash folder for the root user and moved
the deleted content there. The .Trash folder empties automatically after a
configured amount of time.
Step 6: Upload a File to HDFS

6.1. Now lets put a file into the test folder. Change directories to
/var/log/hadoop/hdfs:
# cd /var/log/hadoop/hdfs
6.2. Notice this folder contains a file named hdfs-audit.log:

# tail hdfs-audit.log
6.3. Run the following -put command to copy hdfs-audit.log into the test folder in
HDFS:
# hadoop fs -put hdfs-audit.log test/
6.4. Verify the file is in HDFS by listing the contents of test:

# hadoop fs -ls test
Found 2 items
drwxr-xr-x
- root root
3744098
0
test/hdfs-audit.log
test/test1
Step 7: Copy a File in HDFS

7.1. Now copy the hdfs-audit.log file in test to another folder in HDFS:
# hadoop fs -cp test/hdfs-audit.log test/test1/copy.log
91
7.2. Verify the file is in both places by using the -ls -R command on test. The
output should look like the following:
# hadoop fs -ls -R test
drwxr-xr-x
- root root
3744098 test/hdfs-audit.log
0 test/test1
3744098 test/test1/copy.log
7.3. Now delete the copy.log file using the -rm command:
# hadoop fs -rm test/test1/copy.log
7.4. Verify the copy.log file is in the .Trash folder.

Step 8: View the Contents of a File in HDFS
8.1. You can use the -cat command to view text files in HDFS. Enter the following
command to view the contents of data.txt:
# hadoop fs -cat test/hdfs-audit.log
8.2. You can also use the -tail command to view the end of a file:
# hadoop fs -tail test/hdfs-audit.log
Notice the output this time is only the last 20 rows of hdfs-audit.log.
Step 9: Getting a File from HDFS
9.1. See if you can figure out how to use the get command to copy test/hdfsaudit.log into your local /tmp folder.
Step 10: The getmerge Command
10.1. Put the file /var/log/hadoop/hdfs/hadoop-hdfs-namenode-node1.log into
the test folder in HDFS. You should now have two files in test: hdfs-audit.log and
hadoop-hdfs-namenode-node1.log:
# hadoop fs -ls test
Found 3 items
namenode-node1.log
drwxr-xr-x
- root root
92
1033038 test/hadoop-hdfs3744098 test/hdfs-audit.log

0 test/test1
10.2. Run the following getmerge command:

# hadoop fs -getmerge test /tmp/merged.txt
10.3. What did the previous command do? Compare the file size of merged.txt
with the two log files from the test folder.
Step 11: Specify the Block Size of a File
11.1. Change directories to /root/labs:
# cd /root/labs
Notice this folder contains an HBase JAR file that is about 4.7MB.
11.2. Put the HBase JAR file into /user/root in HDFS with the name hbase.jar, and
assign it a blocksize of 1048576 bytes. HINT: The blocksize is defined using the
dfs.blocksize property on the command line.
11.3. Run the following fsck command on hbase.jar:
# hdfs fsck /user/root/hbase.jar
11.4. How many blocks did this file get broken down in to? ________________
RESULT: You should now be comfortable with executing the various HDFS commands,
including creating directories, putting files into HDFS, copying files out of HDFS, and
deleting files and folders.
93
ANSWERS:
Step 2.4: hdfs
Step 9.1:
# hadoop fs -get test/hdfs-audit.log /tmp
Step 10.3: The two files that were in the test folder in HDFS were merged into a single
file and stored on the local file system.
Step 11.2:
hadoop fs -D dfs.blocksize=1048576 -put hbase-0.94.3bimota-1.2.0.21+HBASE-7644.jar hbase.jar
Step 11.4: The file should be broken down into 5 blocks.
94
Unit 5: Ensuring Data Integrity

Topics covered:
Ensuring Data Integrity
Replication Placement
Data Integrity Writing Data
Data Integrity Reading Data
Data Integrity Block Scanning
Running a File System Check
What Does the File System Check Look For?
hdfs fsck syntax
Data Integrity File System Check: Commands & Output
Hadoop dfs Command
NameNode Information
Changing the Replication Factor
Lab 5.1: Verify Data with Block Scanner and fsck
95
Ensuring Data Integrity

HDFS has a simple yet robust architecture that was designed for data reliability in the
face of faults and failures in disks, nodes and networks. HDFS is a successful file system
because it is simple. The simplicity allows HDFS to be fast, relatively easy to administer,
and it avoids the complex locking issues in databases and standard file systems.
DataNodes and disk drives can fail. Along with hardware failures data can become
corrupted or lost. Disk rot can occur. Memory, disk and network issues can all
contribute to the corruption of a block. We will discuss how HDFS file system checks
and block scanning will help make sure any corrupt blocks are replaced.
96
These features not only maintain the reliability and durability of the data blocks but also
allow for easy administration.
The HDFS client will calculate a checksum for each block and send it to the
DataNode along with the block.
The DataNode stores checksums in a metadata file separate from the blocks
data file.
The block as well as the checksum is sent to the client when reading. The client
will validate the checksum and if there is an inconsistency it will inform the
NameNode that the block is corrupt.
97
Replication Placement
Every file has a block size and replication factor associated with it. All blocks that make
up a file will be the same size except for the last file. The NameNode will make all
decisions regarding block replication for a file in HDFS. DataNodes send block reports to
the NameNode containing a list of all the blocks for a specific DataNode. The
DataNodes are responsible for the creation, deletion and replication of blocks based
upon instructions from NameNode.
Be aware that HDFS block placement does not take into account disk space utilization on
the DataNodes. This ensures that blocks are placed for availability and not just on the
DataNodes with the most free space.
98
Below is an example of how blocks and metadata are laid out in a DataNode directory.
The data blocks are stored in HDFS directories beginning with the blk_ prefix and
contain the raw bytes. The metadata file has the .meta suffix and contains header,
version and type information. It also contains the checksum data for the blocks.
${dfs.data.dir}/current/VERSION
/blk_<id_1>
/blk_<id_1>.meta
/blk_<id_2>
/blk_<id_2>.meta
/...
/blk_<id_64>
/blk_<id_64>.meta
/subdirectory0/
/subdirectory1/
/...
99

2. OK,
please use
DataNodes
1, 4, 12.
Client
1. I want to
write a block
of data.
NameNode
ta +
3. Daksum
che c
6. Success!
DataNode 1
DataNode 4
4. Data and
checksum
5. Success!
DataNode 12
4. Verify
Checksum
5.Success!
Data Pipeline

High performing applications stream data to files. HDFS does this as well; the HDFS
client caches packets of data in memory. Once that data reaches the HDFS block size,
the client will notify the NameNode. The NameNode will provide the DataNode
information about, and the locations, for the block replicas. The client will then stream
the packet of data to the first targeted DataNode. Replication is performed in a pipeline
fashion; the first DataNode will start writing the block and will then transfer that data to
the second DataNode. The second DataNode will start sending the data to the third
DataNode and so on.
When the blocks in a directory reach a defined limit, which is controlled via
dfs.datanode.numblocks, the DataNode will define a new subdirectory. After defining
the subdirectory it will start placing new data blocks and the corresponding metadata in
that subdirectory. This is performed using a fan-out structure ensuring no single
directory is overloaded with files or becomes too deep.
100
Checksums
Checksums are generated for each data block and are used to validate the block during
reads. A checksum is created for a set number of bytes of data as defined by
io.bytes.per.checksum. The size of the checksum data is minimal, for instance; a CRC32 checksum is 4 bytes long.
Note: The default number of bytes for the checksum is 512.
Investigating Corrupt Blocks

In the case of corrupt blocks, you may want to verify or recover as much of the data as
possible. Verification of checksums can be turned off if you want to look at the corrupt
blocks. To disable the verification:
When using the HDFS API, call FileSystem.setVerifyChecksum(false) when

opening a file to read.
Use the -ignoreCrc option when using the get or -copyToLocal command to
read data.
101
Client
1. I need
to read a
portion of
file.txt
NameNode
2. OK, youll
find it on
DataNode
12, block 5.
3. Verify checksum and read the block.

DataNode 12

When a Hadoop client reads a file it will get a list of the blocks and locations of the block
replicas from the NameNode. The blocks will be sorted based on their distance from
the client. The client will attempt to find the closest block replica. If the DataNode is
down or the first block replica is corrupt, the client will then get the next replica of the
block from a different DataNode.
The client will validate checksums to make sure the block is valid. A computed
checksum is compared against the stored checksum. If the client detects a problem
with the checksum it will get the block from the next block replica from a different
DataNode in the list.
102
Data Integrity - Block Scanning

Deep inspection every 3 weeks
Web UI
display of bad
blocks
NameNode
Periodic block Scanning

and checksum checking.
DataNode
Reporting of bad
blocks with block report.
DataNode
New blocks are created

for bad blocks.
DataNode
Data Integrity - Block Scanning

The block scanner is a software process that runs over a defined time frame to validate
the integrity of the blocks on a DataNode. The block scanners role is to check all the
blocks in a data node and report the results to the NameNode.
The block scanner will:
Ensure the block matches the stored checksums.
Notify the DataNode with the results for each validated block.
Inform the NameNode if it detects a corrupt block.
Will NOT fix any corrupt blocks.
The block scanner will adjust its read rate to ensure it completes the block scanning
within the defined time frame. The time frame is defined by the parameter
dfs.datanode.scan.period.hours (the default is 504 hours or 3 weeks). The DataNode
keeps an in-memory list of the blocks verification times, which are also stored in a log
file.
103
The Block Scanning Report can be accessed from the DataNode GUI:
http://datanode:50075/blockScannerReport
The period of time between running block scanner is set with the
dfs.datanode.scan.period.hour property.
104
Running a File System Check

Block information is gathered from the DataNode and heartbeats are used to send block
information to the NameNode. A HDFS file system check reviews the blocks metadata
information looking for any issues. This file system check reports on the health of HDFS
files from the status of the blocks.
A Linux file system check can repair files on local Linux file systems. The HDFS file
system check checks the blocks in HDFS files and reports the results. The HDFS file
system check information is then used by the NameNode to perform repairs on the
HDFS blocks.
The fsck command is run from the command line.
# hdfs fsck [path] [options]
If the fsck command is run with no arguments, it will print usage information. If run
with a path, the command will check all blocks for files within the path. If / is given as
the path, the entire HDFS file system will be checked.
The fsck command will not run on files open for write by a Hadoop client. The
openforwrite option will override that default.
105
What Does the File System Check Look For?

The fsck command will recursively walk through the file system namespace, starting at
the given path and check the blocks for all files it finds. It prints a dot for every file it
checks. To check a file, fsck retrieves the metadata for the files blocks and looks for
problems or inconsistencies.
NOTE: fsck retrieves all of its information from the NameNode; it does not
communicate with any DataNodes to retrieve any block data.
A few of the conditions the fsck looks for are:
Over-replicated blocks: Blocks that exceed their target replication for the file
they belong to. Normally, over-replication is not a problem, and HDFS will
automatically delete excess replicas.
Under-replicated blocks: Blocks that do not meet their target replication for the
file they belong to. The NameNode will automatically create new replicas of
under-replicated blocks until they meet the target replication. You can get
information about the blocks being replicated (or waiting to be replicated) using
hdfs dfs -metasave.
106
Mis-replicated blocks: Blocks that do not satisfy the block replica placement
policy. For example, for a replication level of three in a multirack cluster, if all
three replicas of a block are on the same rack, then the block is mis-replicated
because the replicas should be spread across at least two racks for resilience.
The NameNode will automatically re-replicate mis-replicated blocks so that they
satisfy the rack placement policy.
Corrupt blocks: Blocks whose replicas are corrupt.
Missing replicas: Blocks with no replicas anywhere in the cluster.
107
hdfs fsck Syntax

hdfs fsck [OPTIONS] <path> [-move | -delete | openforwrite] [-files [-blocks [-locations | -racks]]]
Options
Description
path
Start checking from this path.
-move
Move corrupted files to /lost+found.
-delete
Delete corrupted files.
-openforwrite
Print out files opened for write.
-files
Print out information on files being checked.
-blocks
Print out block information.
-locations
Print out DataNode address of blocks.
-racks
Print out rack information for data-nodes.
hdfs fsck Syntax

The following options may be used to request additional block information:
Option
Description
-files
File name and path.

File size in bytes and blocks.
File health status.
-blocks
Block name & length.

Number of replicas of the block.
-locations
DataNode address of blocks (including

replicas).
-racks
Rack name is added to the location

information.
-delete
Remove corrupt files.
-move
Move corrupt files.
108
Code example:
$
hdfs fsck /user/user-name -files -blocks locations racks
109
Data Integrity Filesystem Check

fsck
NameNode
results
fsck checks reported

condition in
NameNode metadata
$ hdfs fsck /
....................................................................
Status: HEALTHY
Total size:
128847681 B
Total dirs:
144
Total files: 200 (Files currently being written: 3)
Total blocks (validated):
198 (avg. block size 650745 B) (Total open file
blocks (not validated): 3)
Minimally replicated blocks: 198 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks:
1 (0.5050505 %)
Mis-replicated blocks:
0 (0.0 %)
Default replication factor: 3
Average block replication:
2.989899
Corrupt blocks:
0
Missing replicas:
7 (1.1824324 %)
Number of data-nodes:
3
Number of racks:
1
FSCK ended at Wed Oct 03 12:05:58 EDT 2012 in 44 milliseconds
Data Integrity File System Check: Commands & Output

Command
Output
$ hdfs fsck /
All files in the HDFS file system are checked.
$ hadoop fs du s h /
HDFS storage used not including replication factor.
$ hadoop fs count q /
Quota information can also provide storage space

used.
$ hdfs fsck / -openforwrite
A list of all files currently open for writing
$ hdfs fsck - files - blocks

locations
List all files that are rechecked and list all the blocks
for each file. Includes the addresses of the
DataNodes containing the blocks.
110
Make sure and redirect fsck output to a file if working on a large cluster. Writing to
STDOUT on a large cluster can be time consuming.
$ hdfs fsck / -files -blocks -locations > myfsck001.log
Look for key patterns in output of fsck. Search for these strings:
Target Replicas is x but found y replica(s).
CORRUPT block
CORRUPT
MISSING
Minimally replicated blocks
111
The dfs Command

The hdfs dfs command supports some additional HDFS administration
operations
The report option can provide some detailed information on the HDFS
environment
The refreshNodes option is used to refresh the list of nodes to decommission,
defined in a file specified by the dfs.hosts.exclude property.
The dfs command can be used to get a detailed listing of the HDFS
namespace
$ hdfs dfs -ls / > mydfslsr001.log
The dfs Command

The hdfs dfs command syntax is:
hdfs dfs [-report] [-safemode enter | leave | get | wait]
[-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status
| details | force] [-metasave filename] [-setQuota <quota>
<dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [help [cmd]]
dfs command Option
Description
-report
Provides statistics on HDFS.
-safemode
Used to enter or leave Safemode.
-finalizeUpgrade
Removes cluster backup made during the last

upgrade.
-refreshNodes
Updates the DataNodes allowed to connect to

the NameNode.
112
Re-reads the config file to update the
DataNodes defined in the files specified

by the dfs.hosts and dfs.host.exclude
properties.
Each entry that exists in

dfs.hosts.exclude and NOT in dfs.hosts
is decommissioned.
Each entry that exists in dfs.hosts and

NOT in dfs.hosts.exclude is stopped
from decommissioning if it had been
previously defined for
decommissioning.
DataNodes not in either dfs.hosts or

dfs.hosts.exclude will be
decommissioned.
$ hdfs dfs -report > datanodereport001.log Returns a list of all the DataNodes in a cluster.
$ hdfs dfs -report >

datanodelist102113.log
$ hdfs dfs -lsr / > namespace102113.log
Used together, the dfsadmin and dfs

commands can be used to compare old and
current lists of DataNodes and the HDFS
Namespace.
$ hdfs dfs -safemode enter
Puts HDFS in Safemode.
$ hdfs dfs refreshNodes
Decommissions or Recommissions
DataNodes(s).
$ hdfs dfs -upgradeProgress status
Determines if there is currently an upgrade in

progress.
$ hdfs dfs finalizeUpgrade
Makes an upgrade permanent after completing

all the work.
113
The NameNode information can also be saved to a
file using the metasave option:
347 files and directories, 201 blocks = 548 total
Live Datanodes: 3
Dead Datanodes: 0
Metasave: Blocks waiting for replication: 1
/user/root/.staging/job_201210012351_0005/job.jar: blk_2319384830921372914_1198
(replicas: l: 3 d: 0 c: 0 e: 0) 10.202.29.145:
50010 : 10.77.22.74:50010 : 10.34.49.188:50010 :
Metasave: Blocks being replicated: 0
Metasave: Blocks 0 waiting deletion from 0 datanodes.
Metasave: Number of datanodes: 3
10.202.29.145:50010 IN 885570207744(824.75 GB) 124239872(118.48 MB) 0.01%
842086092800(784.25 GB) Wed Oct 03 12:18:41 EDT 2012
10.77.22.74:50010 IN 885570207744(824.75 GB) 132116480(126 MB) 0.01%
842075832320(784.24 GB) Wed Oct 03 12:18:42 EDT 2012
10.34.49.188:50010 IN 885570207744(824.75 GB) 122011648(116.36 MB) 0.01%
842083241984(784.25 GB) Wed Oct 03 12:18:41 EDT 2012
The metasave option creates a file (named filename) that is written to
HADOOP_LOG_DIR/hadoop/hdfs on the NameNodess local file system and contains:
Blocks waiting for replication.
Blocks being replicated.
Blocks waiting for deletion.
Each block will list the DataNodes to which it is replicated.
Summary statistics.
114
Changing the Replication Factor

The replication factor can be increased or decreased. When the replication factor of a
file is decreased, the NameNode will eliminate block replicas to get to the modified
replication factor. The NameNode will have a DataNode eliminate a block replica during
a heartbeat. The DataNode will remove the block replica, freeing up the space,
although, there may be a time delay before the block is deleted.
To set the replication of file to 2:
$ hadoop dfs -setrep w 2 /directorypath/file
To set the replication of all of HDFS recursively to 4:

./bin/hadoop dfs -setrep w 4
115
Unit 5 Review
1. What is the priority of placement of the second block replica during block
replication? ____________________
2. What is the purpose of setting the io.bytes.per.checksum parameter?
_____________________
3. What process uses the dfs.datanode.scan.period.hours parameter?
_____________________
4. List three things an hdfs fsck command will look for?
5. Which option of the hdfs fsck command would you use to list DataNode
addresses for the blocks? _______________________
6. What output value(s) of the hdfs fsck command would you use to determine the
total amount of disk storage including replication is a file taking up?
7. Why would you run the command below?
$ hdfs dfs -report > myreport001.log
116
Lab 5.1: Verify Data with Block Scanner and

fsck
Objective: View the various tools for performing block verification and
the health of files in HDFS.
Successful Outcome: You will see the result of the Block Scanner Report on node1,
and the output of the fsck command.
Step 1: Stop HDFS service

Step 2: Configure the Scan Period
1.1. Go to HDFS -> Configs, then expand the Custom hdfs-site.xml section and
click the Add Property
1.2. Add the following property to hdfs-site.xml:
dfs.datanode.scan.period.hours=1
Step 2: Restart the HDFS Service

Step 3: View the Block Scanner Report
3.1. Point your web browser to http://node1:50075/blockScannerReport. The
report will look similar to the following:
117
3.2. How many blocks are on your node1 DataNode? ______________

Step 4: View the Block Details
4.1. Add the listblocks parameter to the blockScannerReport URL:
http://node1:50075/blockScannerReport?listblocks
You should see a list of all blocks on that DataNode and their status:
NOTE: If a block is corrupt, the NameNode is notified and attempts to fix the
issue. The default time period for scanning blocks is every three weeks, so in
a production environment you would not set this interval to 30 minutes like
you did in this lab. Use the block scanner report as a quick way to verify the
integrity of the blocks in your cluster.
Step 5: Run the fsck Command on a File

5.1. Put the file /root/data/test_data into the /user/root folder of HDFS:
118
# hadoop fs -put ~/data/test_data test_data
5.2. Run the fsck command for the file /user/root/test_data:

# hdfs fsck /user/root/test_data -files
5.3. How many blocks did test_data get split into? ____________
5.4. What is the average block replication of test_data? ___________
Step 6: Using fsck Options
6.1. Run the fsck command again, but this time add the -blocks option:
# hdfs fsck /user/root/test_data -files -blocks
6.2. What did the blocks option add to the output? _________________________
6.3. Add the -locations option as well:
# hdfs fsck /user/root/test_data -files -blocks -locations
6.4. What did the locations option add to the output? _______________________
Step 7: Run a File system Check
7.1. You can run fsck on the entire file system. Enter the following command:
# hdfs fsck /
Notice this command fails, because root does not have permission to view all the
files in HDFS.
7.2. Switch to the hdfs user:
# su - hdfs
7.3. Run fsck on the entire file system:

$ hdfs fsck /
7.4. What is the total size of your HDFS? _______________________
119
7.5. How many directories does your cluster have? ____________

7.6. How many files are on your cluster? _____________
7.7. How many total blocks are on your cluster? ___________
7.8. What is the average block replication of your cluster? ________________
7.9. Switch back to the root user:
[hdfs@node1 ~]$ exit
Step 8: View the Health of your Cluster

8.1. From the Hosts page for node1 in Ambari, stop the DataNode service on
node1.
8.2. Click on the Services page, then HDFS. You should see that one of your
DataNodes is not live.
8.3. Using the Quick Links menu on the HDFS Services page (shown in the
screenshot below), open the NameNode UI:
The NameNode UI opens the dfshealth.jsp page by default.

8.4. Notice you still have 3 Live Nodes and 0 Dead Nodes:
NOTE: It takes 10.5 minutes for a DataNode to be marked as dead in a

cluster.
120
8.5. Click on the Live Nodes link to view the Live DataNodes in your cluster:
Step 9: Take a Break

9.1. ...and wait for your DataNode to be marked as Dead in your cluster!
9.2. Refresh the Live DataNodes page, and you should only see two live
DataNodes:
9.3. Go back to the dfshealth.jsp page and refresh it. Notice you now have 1 Dead
Node and a large number of under-replicated blocks:
9.4. Why does your cluster have so many under-replicated blocks? ___________
_________________________________________________________________
Step 10: Run fsck Again
10.1. Switch to the hdfs user and run fsck on the entire file system:
# su - hdfs
[hdfs@node1 ~]$ hdfs fsck /
Notice you get a long list of every file that contains under-replicated blocks.
121
10.2. What is the average block replication now on your cluster? _____________
10.3. Compare the value of Missing replicas in the output of fsck with the value of
Number of Under-Replication Blocks in the NameNode UI.
Step 11: Start the DataNode Again
11.1. Using Ambari, start the DataNode process on node1.
11.2. Refresh the dfshealth.jsp page in the NameNode UI frequently, and you can
watch as the number of under-replicated blocks gradually decreases to 0:
11.3. Run fsck again on your entire file system, and notice everything is back to
normal again.
RESULT: The Block Scanner Report is a quick way to view the status of the blocks on the
DataNodes of your cluster. The fsck tool is a great way to view the health of your file
system and block replication, as is using the NameNode UI.
122
Unit 6: HDFS NFS Gateway

Topics covered:
HDFS NFS Gateway Introduction
HDFS NFS Gateway
Configuring the HDFS NFS Gateway
Starting the NFS Gateway Service
User Authentication
Lab 6.1: Mounting HDFS to a Local File System
123
The HDFS NFS Gateway

HDFS access is usually done using the HDFS API or the web API
The HDFS NFS Gateway (NFS) allows HDFS to be mounted as part of a
client local file system
DFSClient
The NFS Gateway is a stateless daemon, that translates NFS protocol to

HDFS access protocols
ferP
r an s
a t aT
c
roto
ol
ClientProtocol
NameNode
NFSv3
Client
NFS
Gateway
DataNode
Data
Tran
s
ferP
rotoc
ol
DataNode
HDFS NFS Gateway Introduction

Network File System (NFS) is a distributed file system protocol that allows access to files
on a remote computer in a manner similar to how local file system is accessed.
The DFSClient is inside the NFS Gateway daemon (nfs3), therefore, the DFSClient is part
of the NFS Gateway.
HDFS NFS Gateway allows HDFS to be accessed using the NFS protocol supporting more
applications and use cases. All HDFS commands are supported from listing files, copying
files, moving files, creating and removing directories.
NFS Client: The number of application users doing the writing and the number
of files being loaded concurrently define the workload.
DFS Client: Multiple threads are used to process multiple files. DFSClient
averages 30 MB/S writes.
NFS Gateway: Multiple NFS Gateways can be created for scalability.
124
HDFS NFS Gateway simplifies data ingest of large-scale analytical workloads. Random
writes are not supported. Different ways the NFS interface to HDFS can be used include:
Browsing files on HDFS.
Downloading files from HDFS.
Uploading files on HDFS.
Streaming data directly to HDFS.
A few reminders:
HDFS is a read-only file system with append capabilities.
NFSv3 is a stateless environment.
After an idle period, the file will be closed.
Following write will reopen the file.
NOTE: NFSv4 support, HA, Kerberos are on the roadmap for the HDFS NFS
Gateway.
125
NFS Gateway Node

For scalable loading of data to HDFS, you should use webHDFS or the Java RPC. NFS
supports loading of smaller (< 1 GB at a time) files into HDFS through the NFS Gateway.
NFS Gateway Node is a very easy way to load data not necessarily the fastest. As
mentioned earlier, you should create multiple HDFS NFS Gateways to increase
scalability.
When an application user writes to an HDFS mount point, the Gateway server will
generate one DFSClient. If the application user writes a number of files to each HDFS
mount point, the Gateway Server will generate one DFSClient with multiple threads.
When multiple application users perform loads with multiple files to each mount point,
the Gateway server will generate multiple DFSClients with multiple threads each.
HDFS NFS Gateway was introduced in HDP 1.3. There are some operational limits to be
aware of:
Additional Gateway servers have to be setup, started and monitored manually.

Ambari management is on the roadmap.
HA is not build into the Gateway Servers. IF a gateway sever goes down then the
corresponding HDFS client mounts will fail.
126
Configuring the HDFS NFS Gateway

Edit the hdfs-site.xml file on your NFS gateway machine and modify the following
property. If the export is mounted with access time update allowed, make sure this
property is not disabled in the configuration file.
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>3600000</value>
<description>The access time for HDFS file is precise
upto this value.
The default value is 1 hour. Setting a value
of 0 disables access times for HDFS.
</description>
</property>
127
Update the following property to hdfs-site.xml. This sets the maximum number of files
being uploaded in parallel.
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>1024</value>
</property>
Add the following property to hdfs-site.xml. NFS client often reorders writes.
Sequential writes can arrive at the NFS gateway at random order. This directory is used
to temporarily save out-of-order writes before writing to HDFS. You need to make sure
the directory has enough space.
<property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
Change the log level in the log4j.property file to debug to collect more details:
log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG
To get more details of RPC requests, add the following:

log4j.logger.org.apache.hadoop.oncrpc=DEBUG
128
Starting the NFS Gateway Service

NFS3 can be thought of as a combination of the nfsd and mountd daemon together.
HDP2 release enables file streaming to HDFS through NFS.
Start the portmap in the NFS gateway package as root.
# hdfs portmap
Start mountd and nfsd making sure the user starting the Hadoop cluster and the user
starting the NFS gateway are the same.
$ hdfs nfs3
Stop the NFS gateway service.

$ hadoop-daemon.sh stop nfs3
$ hadoop-daemon.sh stop portmap
Make sure NFS gateway services have started properly. Verify mountd, portmapper and
NFS are up and running.
129
Execute the following command to verify if all the services are up and running:
rpcinfo -p $nfs_server_ip
Make sure the HDFS namespace is exported and can be mounted by any client.
#
showmount -e $nfs_server_ip
Make HDFS available through NFS.

mount -t nfs -o vers=3,proto=tcp,nolock $server:/
$mount_point
130
User Authentication
The OS login user id on the NFS client must match the user id
accessing HDFS
LDAP/NIS should be used to make sure the same user ids are deployed
on the NFS client and HDFS
NFS
Client
Gage
Login user: Gage

UID: 350,
GID: 300
NFS
Gateway
Gage
NameNode
Look up UID/GID
for user
User Authentication
The user authentication method needs to make sure the UID/GID match between the
user access, the NFS client, and the user running the HDFS operations.
The manual creation of users is not recommended for production environments.
131
Unit 6 Review
1. What nodes in a Hadoop cluster can the HDFS NFS gateway run on?
2. A _________ needs to be running on the NFS gateway node.
3. What configuration file is modified to configure the HDFS NFS Gateway server?
4. The _____________ must match between the NFS client and HDFS for user
authentication.
132
Lab 6.1: Mounting HDFS to a Local File System
Objective: To learn how to mount HDFS to a local file system.

Successful Outcome: HDSF will be mounted to /home/hdfs/hdfs-mount.
Step 1: Install NFS

1.1. Execute following yum command to install NFS on node1:
# yum -y install nfs*
Step 2: Configure NFS

2.1. Within Ambari, stop the HDFS service.
2.2. Go to Services -> HDFS -> Configs, scroll down to the Advanced section and
click on the Advanced heading to expand the section.
2.3. Locate the dfs.namenode.accesstime.precision property and set it to
3600000.
2.4. Scroll down to the "Custom hdfs-site.xml" section.
2.5. Using the "Add Property..." link, add the following new property:
dfs.nfs3.dump.dir=/tmp/.hdfs-nfs
2.6. Click the Save button to save your changes.

2.7. Start the HDFS service back up.
Step 3: Start NFS
133
3.1. Run the following commands to stop the nfs and rpcbind services. (If they are
not running, the following commands will fail, which is no problem):
# service nfs stop
# service rpcbind stop
3.2. Now start the NFS services using the hadoop-daemon.sh script:
# hdfs portmap &
# hdfs nfs3 &
3.3. Verify that the required services are up and running.

# rpcinfo -p node1
You should se output similar to the following:
3.4. Verify that the HDFS namespace is exported and can be mounted by any
client.
# showmount -e node1
You should se output similar to the following:
134
Step 4: Mount HDFS to the Local File System

4.1. As the root user, create a new directory on your local file system named
/home/hdfs/hdfs-mount.
4.2. Change its ownership to the hdfs user.
4.3. Execute the following command on a single line to mount HDFS to the hdfsmount directory:
# mount -t nfs -o vers=3,proto=tcp,nolock node1:/
/home/hdfs/hdfs-mount
4.4. Browse HDFS now on your local file system.

# ls -l /home/hdfs/hdfs-mount
4.5. Switch to the hdfs user.

4.6. Copy a local file using the cp command into the hdfs-mount directory, and
then check whether you can see this file using the hadoop fs -ls / command.
4.7. Delete the file from the local file system using the rm command, and verify
the file is no longer in HDFS.
4.8. Try other Linux commands and see whether they work successfully.
RESULT: You have mounted HDFS to a local file system, which can be a convenient skill
to know how to do when working frequently with files in HDFS.
135
Unit 7: YARN Architecture and

MapReduce
Topics covered:
136
What is YARN?
Hadoop as Next-Gen Platform
Beyond MapReduce
YARN Use Case
YARN Birds Eye View
Lifecycle of a YARN Application
ResourceManager
NodeManager
MapReduce
Demonstration: Understanding MapReduce
Configuring YARN
Configuring MapReduce
Tools
Lab 7.1: Troubleshooting a MapReduce Job
What is YARN?
The goal of an operating system is to facilitate applications to achieve 100% utilization
of all resources on the physical system while letting every application execute at its
maximum potential. This is what YARN achieves in a Hadoop cluster. The compute
resources managed by YARN in a Hadoop cluster are memory and cpu. A YARN
application can request these resources and YARN will make them available according to
its scheduler policy.
What distinguishes YARN from other distributed compute frameworks is that the
applications that can run on YARN can be rapidly developed. Many standalone
applications have already been adapted and they range in types from batch applications
such as MapReduce to realtime always-on database applications such as HOYA (HBase
on YARN).
137

Single Use System
Multi Purpose Platform
Batch Apps
Batch, Interactive, Online, Streaming,
HADOOP 1.0
HADOOP 2.0
MapReduce
(cluster resource management
& data processing)
HDFS
(redundant, reliable storage)
MapReduce
Others
(data processing)
(data processing)
YARN
(cluster resource management)
HDFS2
(redundant, reliable storage)

What are key areas that needed improvement in Hadoop 1 that were addressed in
YARN?
Scalability: It was difficult for a Hadoop 1 cluster to go beyond 4000 nodes.
Availability: The JobTracker was a single point of failure. If it failed, then ALL
jobs failed.
Hard partition of resources into map and reduce slots: This limitation is a major
factor that causes a clusters compute resources to be underutilized.
Lacks support for alternate paradigms and services: Legacy Hadoop was meant
to solve batch-processing scenarios, and MapReduce was the only programming
paradigm available.
Hadoop 2 presents five key benefits:
Scale: YARN has been successfully deployed on 35,000+ nodes.
New programming models & services: Resource management and task

management are now separated concerns. This opens the doors to having
138
different types of applications. Applications dont necessarily have to be batch

jobs. Resources are now generic that any kind of application can ask for them.
Improved cluster utilization: In Hadoop 1 MapReduce, each worker node or

TaskTracker had a hardcoded allocation of slots that were a combination of Map
and Reduce slots. Because resources are dynamically allocated per application in
YARN, there are no longer any wasted compute units.
Agility: Applications written on YARN can adapt to a changing cluster in terms of

resource availability, data locality, state persistence for handling failure, etc.
Beyond Java: The types of applications that run on YARN are not limited to Java.
Applications written in any language, as long as the binaries are installed on the
cluster, can run natively, all while requesting resources from YARN and utilizing
HDFS.
139
Beyond MapReduce
Tez
MapReduce as a workflow
Hoya Hbase on YARN

HBase resource management now centralized
Storm
Stream processing; always-running application
Applica ons Run Na vely IN Hadoop

BATCH
INTERACTIVE
(MapReduce)
(Tez)
ONLINE
(HBase)
STREAMING
(Storm, S4,)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
OTHER
(Search)
(Weave)
YARN (Cluster Resource Management)

HDFS2 (Redundant, Reliable Storage)
Beyond MapReduce
Remember that MapReduce is just a type of application paradigm that can run on YARN.
Applications are continuously being ported to run on YARN to utilize HDFS storage and
in some cases to utilize YARNs distributed compute framework itself.
140
YARN Use-case
Two key factors led to dropping an entire datacenter of 10k nodes:
1. YARN is a pure resource manager; it does not care about application specifics
such as what type of application is running on the cluster. Resource management
is lightweight once these types of details are offloaded to other processes. YARN
simply knows about resource availability for each node in the cluster and will
lease these resources based on its scheduler policy. The responsibility of
applications using these resources is left to another type of per-application
process called an ApplicationMaster.
2. MapReduce in Hadoop 2 (MRv2) itself has taken advantage of this type of
architecture, where each job has its own ApplicationMaster. We will discuss
ApplicationMaster details later on. Each MRv2 jobs resource requests are
dynamically sized for its Map and Reduce processes.
141
YARN Birds Eye View

ResourceManager
ResourceManager (master)
Application management
Scheduling
Security
ApplicationsManager
Scheduler
NodeManager (worker)
NodeManager 1
Provides resources to lease

Memory, cpu
Container as unit of processing

ApplicationMaster to manage
individual jobs
Job History Server
Job history preserved in HDFS
Client & Admin Utilities

CLI
REST
JMX
Security
Container
Container
Container
Container
Job1
Task1
Job2
Map1
Job2
Reducer
Job2
Map2
Free
Capacity
NodeManager 2
Container
Container
AppMaster
YARN Job1
Job1
Task2
Free Capacity
NodeManager 3
Container
Container
Container
Container
Container
Job2
Map3
Job2
Map4
Job2
Map5
Job2
Map6
AppMaster
MR Job2
Free
Capacity
YARN Birds Eye View

The following core components make up the YARN framework:
ResourceManager
The ResourceManager is the heart of YARN. This the single entry point for clients to
submit YARN applications. Its responsible for:
Application management: Launching, tracking application status, and leasing

cluster resources to applications
Scheduling: When an application requests resources, resources are made

available (or rejected) to the application based on the ResourceManagers
scheduler policy. The default scheduler is the Capacity Scheduler where an
application is submitted to a pre-configured queue. A queue will have a certain
portion of the clusters resources allocated. Another type of scheduler is the Fair
Scheduler. These are discussed in more detail later.
Security: Intracommunication between YARN components can be secured with

authorization tokens. The ResourceManager contains token managers that are
used by various components:
142
AMRMTokenSecretManager: Per application; ApplicationMasters will

use a token when submitting ResourceManager requests. The manager
stores the tokens by ApplicationID.
RMContainerTokenSecretManager: Per container; after a container is

leased by the ResourceManager to the ApplicationMaster, the
ApplicationMaster submits a container specific token as part of the
container launch request to the NodeManager.
RMDelegationTokenSecretManager: Per RM client; an unauthenticated

process can be passed a delegation token by a trusted client and can then
communicate with the ResourceManager.
NodeManagers: These are the worker nodes in a YARN cluster. They publish
resource pools (memory & CPU) to the ResourceManager. The ResourceManager
will have an aggregate view of these resources.
Containers: A Container is the atomic resource unit in YARN. Containers are

sized by memory and cpu (virtual cores) and are dynamic. Thus the number of
containers that can be running in parallel on a single NodeManager will vary
depending on what the ApplicationMaster has requested.
ApplicationMasters: The controllers in a YARN application. An

ApplicationMasters responsibility is to ensure that the application gets the
necessary resources, and hence, requests/releases them from the
ResourceManager. It is also responsible for handling container failures,
application-level cleanup, and application-level failure recovery.
ApplicationMasters themselves run in a Container.
Client & Admin Utilities: YARN provides both client and admin command-line
tools. For monitoring YARN components, there is a REST API as well as MBeans
for daemon processes.
143

1
Client submits application request
ResourceManager
Response with ApplicationID
Scheduler
ApplicationsManager
Application Submission Context

(user, queue, dependencies, resource requests)
+
3
Containter Launch Context
(resource reqs, launch commands, delegation tokens)
ApplicationMasterService
5
4
Get Capabilities
Start ApplicationMaster
6
NodeManager 1
Container
Container
Job
Map1
Job
Reducer1
Free
Capacity
NodeManager 2
Container
r
Job
Map2
NodeManager 3
Free Capacity
Req/Rec Containers
Container
Container
Container
Container
Container
Job
Map3
Job
Map4
Job
Map5
Free
Job
Capacity
Map6
ApplicationMaster
MR Job
Container Launch Requests
NodeManager 4
Container
Container
Job
Map7
Job
Map8
Container
Container
Job Free Capacity

Job
Map10
Map9
Container
Job
Reducer2

1. Client submits application request: This is an initial lightweight request for an
ApplicationID.
2. Response with ApplicationID: If the request is successful, the
ApplicationManager will respond with an ApplicationID, which will be used for
the actual application submission to the cluster.
3. Application Submission Context (ASC) + Container Launch Context (CLC): The
client submits the application to the ApplicationManager. Any queue,
dependencies, container launch commands, etc. are sent in the request as well.
4. The ApplicationManager is responsible for finding a Container on a
NodeManager to start the ApplicationMaster.
5. Once the ApplicationMaster has started, it will establish connection with the
ResourceManager, specifically, a component called the
ApplicationMasterService. It will then retrieve cluster capabilities (memory &
cpu) availability.
144
6. The ApplicationMaster has the option to continue. If it does, it will send a

request for Containers to run the application. In the case of MapReduce, it will
ask for specific NodeManagers because it wants to co-locate processing with
data blocks in HDFS.
7. The ApplicationMaster will receive leases on Containers per the YarnScheduler
policy. An ApplicationMaster can request resources as often as necessary. It will
also have to take care of graceful shutdown of Containers by releasing its lease(s)
to the ResourceManager.
8. Once the ApplicationMaster has determined that the application should end, it
can optionally persist application logs to HDFS, and any other post-process
activity before it self-terminates.
145
ResourceManager
146
NodeManager
147
MapReduce
Map Phase
Shuffle/Sort
Reduce Phase
NM + DN
NM + DN
Mapper
Reducer
NM + DN
Mapper
Data is shuffled
across the network
and sorted
NM + DN
NM + DN
Reducer
Mapper
MapReduce
The original use-case for Hadoop was distributed batch processing. MapReduce is a
power application paradigm for processing massive amounts of data.
Core features of MapReduce are:
Co-locating processing with data blocks: Take the computing to where the data
lives, rather than querying or reading data into a remote application. Would you
rather move hundreds of GB/TB of data around your network, or would you
rather move an application that processes the same data to where the data
actually lives?
Map Phase: This is the initial phase of all MapReduce jobs. This is where raw
data can be read, extracted, transformed, and results written out to HDFS or
moved on to Reducers for aggregate processing, such as a final count, sum, min,
max, etc. The Map phase can also be thought of as the ETL or projection step for
MapReduce.
Reduce Phase: This is the final phase where data is sorted on a user-defined key
and grouped by that same key. The Reducer has the option to perform an
148
aggregate function on that data. The Reduce phase can be thought of as the
aggregation step.
Data is always moved along the pipeline in MapReduce in the form of key/value
pairs.
A MapReduce job scales to the size of the data For example, if a dataset in
HDFS is 1 terabyte broken into 256MB blocks, it is possible for 4096 mappers to
run in parallel to read each block (if the cluster has the capacity).
149
Demonstration: Understanding MapReduce
Objective: To understand how MapReduce works.

During this Watch as your instructor performs the following steps.
demonstration:

1.1. Change directories to the labs folder:
# cd /root/labs
1.2. Notice a file named constitution.txt:

# more constitution.txt
1.3. Put the file into HDFS:

# hadoop fs -put constitution.txt constitution.txt
Step 2: Run the WordCount Job

2.1. The following command runs the WordCount job on the constitution.txt and
writes the output to wordcount_output:
# hadoop jar wordcount.jar wordcount.WordCountJob
constitution.txt wordcount_output
2.2. Notice a MapReduce job gets submitted to the cluster. Wait for the job to
complete.
150
Step 3: View the Results

3.1. View the contents of the wordcount_output folder:
# hadoop fs -ls wordcount_output
You should see a single output file named part-r-00000:

Found 1 items
-rw-r--r-- 1 root hdfs 17031 wordcount_output/part-r-00000
3.2. Why is there one file in this directory? ______________________________

3.3. What does the r in the filename stand for? _________________________
3.4. View the contents of part-r-00000:
# hadoop fs -cat wordcount_output/part-r-00000
3.5. Why are the words sorted alphabetically? _____________________________

3.6. What was the key output by the WordCount reducer? ___________
3.7. What was the value output by the WordCount reducer? _____________
3.8. Based on the output of the reducer, what do you think the mapper output
contained as key/value pairs? _________________________________________
__________________________________________________________________
151
Configuring YARN
For configuring YARN, there is one core configuration file:
/etc/hadoop/conf/yarn-site.xml
The most important aspect to configuring YARN is how resource allocation works. There
are two types of resources:
Physical: The total physical resources (memory) that a container will claim.
Virtual: The total virtual resources (memory) that a container will claim. It is
usually much larger than physical memory. You want to keep this higher
because once the containers are running, a process can often times take
advantage of virtual memory addressing in order to give the application an
impression that it has more memory than physically allocated.
Why does this work? Because the underlying operating system will page out
memory thats not being used to a partition on its local disk known as a
swap partition.
152
Ports firewall considerations:

Service
Servers
Resource Manager WebUI

ResourceManager
NodeManager WebUI
ResourceManager 8088
ResourceManager 8032
NodeManagers
50060
Ports
http
IPC
http
WebUI for RM
Application submissions
WebUI for NMs
153
Configuring MapReduce
There are a few additional considerations for properly configuring MapReduce. In the
mapred-site.xml, there are two additional properties that should be kept in tune:
MapReduce container size (physical):
mapreduce.map.memory.mb: 1GB + 512MB for non-heap such as permgen.
mapreduce.reduce.memory.mb: 2GB + 512MB for non-heap such as

permgen.
Now you need to ensure that the jvm heap is lower than the physical allotted to the
container:
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2048m</value>
</property>
154
Notice the Xmx (Java Heap max) is less than the container allocation of 2560MB. This
will give the container some breathing room to continue running without expressing any
out of memory issues.
155
Tools
Starting daemons from the command line:
# yarn resourcemanager
# yarn nodemanager
# yarn proxyserver
Web Application Proxy

It is the responsibility of an ApplicationMaster to provide a web UI and send the link to
the UI to the ResourceManager. The ResourceManager runs as a trusted user, however,
ApplicationMasters are not. In order to provide a warning to users that they are visiting
an untrusted link and also stripping out any cookies sent by the user, a proxy server can
be optionally started. Note that a web application proxy runs embedded within the
ResourceManager if yarn.web-proxy.address is not set.
156
Admin operations
Operation
Description
$ yarn rmadmin
Primarily to refresh properties such as

queues, nodes, and acls.
$ yarn application
List/kill applications.
$ yarn node
Print node reports such as status (rack

info, containers, memory, cpu) and view
nodes by states (i.e. RUNNING, INITING,
etc.).
$ yarn logs
Container level logs dumped to stdout.
$ yarn daemonlog
Get/set log level for live daemons.
REST
YARN MR applications and cluster can be monitored via the REST API. The REST API is
available at:
http://<resourcemanager:port>/ws/v1/cluster
http://<node:port>/ws/v1/node
http://<webapplicationproxy:port>/proxy/<appid>/ws/v1/mapreduce
Currently, GET requests are supported for monitoring and gathering metrics. Full REST
API usage with examples can be found at:
http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarnsite/WebServicesIntro.html
Ambari
The primary management and provisioning of YARN components should be done via
Ambari (if possible). Ambari has extensive UIs to manage your YARN cluster.
157
Unit 7 Review
1. What are the three main phases of a MapReduce job? _____________
_________________________________________________________
2. What determines the number of Mappers of a MapReduce job? ___________
________________________________________________________________
3. What determines the number of Reducers of a MapReduce job? ___________
________________________________________________________________
4. What are the 3 main components of YARN? ____________________________

________________________________________________________________
5. What happens if a Container fails to complete its task in a YARN application?
________________________________________________________________
158
Lab 7.1: Troubleshooting a MapReduce Job

Objective: To learn tips on how to troubleshoot issues with a
MapReduce job.
Successful Outcome: You will have run the IndexInverter MapReduce job several
times and viewed its log files.

1.1. Change directories to the labs folder:
# cd /root/labs
1.2. Notice a file named hortonworks.txt. View its contents:

# more hortonworks.txt
This file contains URLS, along with keywords found on the webpages of each URL.
NOTE: The MapReduce job in this lab computes an inverted index, one of
the earliest use cases of Hadoop and MapReduce. A Web crawler scans the
Internet and retrieves URLs along with keywords on each page. The index
inverter job flips this information around, outputting the keywords along
with each web page that contains the keyword.
Step 2: Run the IndexInverterJob

2.1. Enter the following command to run the IndexInverterJob:
# hadoop jar invertedindex.jar inverted.IndexInverterJob
hortonworks.txt index_output
159
2.2. The job fails. What is the issue? ___________________________________

________________________________________________________________
Step 3: Run the Job Again
3.1. Put the file /root/labs/hortonworks.txt into HDFS:
# hadoop fs -put /root/labs/hortonworks.txt hortonworks.txt
3.2. Run the job again, using the same command as the previous step.
3.3. The job should run successfully this time. How many map tasks were needed
for this job? ________ How many reduce tasks? ___________
3.4. How long (in ms) did it take for all the mappers to run? _________________
3.5. How long (in ms) did it take for all the reducers to run? _________________
3.6. How many bytes did the mappers of this job process? ____________
3.7. How many bytes did the reducers output? ____________
Step 4: View the Output
4.1. Verify the index_output folder was created in HDFS:
# hadoop fs -ls index_output
You should see a single output file named part-r-00000:

4.2. View the contents of part-r-00000:
# hadoop fs -cat index_output/part-r-00000
4.3. What did the reducer use as the key for its output? ________________
4.4. What did the reducer use as the values for its output? _________________
Step 5: Run the Job Again
5.1. Run the IndexInverterJob again with the exact same command.
5.2. The job failed. Why? ____________________________________________
5.3. Delete the index_output folder in HDFS.
160
5.4. Run the job again, and it should run successfully this time.
Step 6: View the Resource Manager UI
6.1. Point your web browser to Ambari at http://node1:8080.
6.2. From the Dashboard page, select YARN from the left-hand menu.
6.3. Select the Configs tab on the YARN page.
6.4. Which node in your cluster is the Resource Manager running on? __________
6.5. Point your web browser to the Resource Manager UI, which is
http://node2:8088. You should see the All Applications page:
6.6. In the Cluster menu on the left side of the page, click on the various links like
About, Nodes, Applications, and Scheduler. Notice there is a lot of useful
information provided in this UI.
Step 7: Troubleshoot a Problem

7.1. From a previous lab, you should have a file named hbase.jar in /user/root in
HDFS:
# hadoop fs -ls hbase.jar
Found 1 items
4708852 hbase.jar
If you do not have this file in HDFS, put it there. The file is found in your
/root/labs folder.
161
7.2. Run the IndexInverterJob using the following command (entered on a single
line):
# hadoop jar invertedindex.jar inverted.IndexInverterJob
hbase.jar index_output2
7.3. Notice exceptions are thrown, and eventually the job will fail. From the
output of the job, how many map tasks were launched? _________ How many
map tasks failed? ________ How many were killed? ________
7.4. The input file, hbase.jar, is split into 5 blocks in HDFS. Why did this
MapReduce job launch 10 map tasks? ___________________________________
Step 8: View the Log Files
8.1. Lets figure out what happened to the IndexInverterJob. View the Job History
page of node2 by pointing your browser to http://node2:19888:
Notice the most recent job at the top of the list has a status of FAILED.
8.2. Click on Job ID of the failed IndexInverterJob. You should see the details page
for this job:
162
8.3. Notice this page contains useful details about the job, including the average
map and reduce time, and how long it took to execute the entire job.
8.4. Also notice that 5 mappers and 1 reducer were started for this job. In the
screen shot above, 8 map tasks failed, 2 were killed, and 0 were successful. Notice
these numbers are links - click on your number of failed map tasks:
8.5. In the Logs column, click on the logs link of one of the failed map tasks to
view the corresponding log file:
8.6. What happened in this job? Why did the mapper fail? ___________________
RESULT: You have executed a MapReduce job that failed for several different reasons.
Being able to troubleshoot these types of issues is an important and handy skill for any
Hadoop administrator.
ANSWERS:
2.2: The job fails because the input file for the job does not exist in HDFS.
3.3: 1 mapper and 1 reducer, as found in the Job Counters section of the output.
3.4: Look for the counter Total time spent by all maps in occupied slots (ms)
163
3.5: Similarly, look for Total time spent by all reduces in occupied slots (ms)
3.6: Bytes Read=1126, as found in the File Input Format Counters section.
3.7: Bytes Written=2997, as found in the File Output Format Counters section.
5.2: The output folder of a MapReduce job cannot exist. You should have gotten the
following error message: FileAlreadyExistsException: Output directory
hdfs://node1:8020/user/root/index_output already exists
7.3: You should see 10 launched map tasks. The number of failed and killed tasks will
vary, but expect about 8 failed and 2 killed.
7.4: When a map task fails, the MapReduce framework launches the map task again. A
map task has to fail 2 times (by default) before the entire job fails. The input file was
split into 5 blocks, and each block generated a map task that failed 2 times, so 5x2=10.
8.6: A NullPointerException was thrown on line 32 of IndexInverterJob.java. Useful
information for the Java developer!
164
Unit 8: Job Schedulers

Topics covered:
Overview of Job Scheduling
The Built-in Schedulers
Overview of the Capacity Scheduler
Configuring the Capacity Scheduler
Defining Queues
Configuring Capacity Limits
Configuring User Limits
Configuring Permissions
Overview of the Fair Scheduler
Configuration of the Fair Scheduler
Multi-Tenancy Limits
Lab 8.1: Configuring the Capacity Scheduler
165
Job Scheduler
1 Client applications submit

jobs to the cluster.
2 A pluggable job scheduler

assigns resources for each
job.
3 The jobs execute on the

cluster within their assigned
resources.

YARN jobs are submitted to the ResourceManager, which is responsible for scheduling
jobs. The ResourceManager has a pluggable Scheduler that allows you to configure
scheduling to meet the needs of your organization and your cluster.
The Scheduler is responsible for allocating cluster resources to a job. This unit discusses
how to configure this important aspect of your Hadoop cluster.
NOTE: The Scheduler of the ResourceManager is purely a scheduler; it only

schedules jobs. It is not responsible for monitoring the status or ensuring the
completion of jobs.
166
The Built-in Schedulers

HDP has two commonly-used built-in Schedulers:
Capacity Scheduler: Schedules jobs based on memory usage using hierarchical

queues that you configure to meet the requirements of your organizations
needs.
Fair Scheduler: Schedules jobs so that all jobs get, on average, an equal share of
cluster resources.
We recommend using a Capacity Scheduler, and HDP uses the Capacity Scheduler by
default. We will discuss both of these schedulers in this Unit, but our focus will be on
configuring and managing the Capacity Scheduler.
167
Distributes jobs to queues based on capacity

Capacity determined by percentage of available memory
Total of all of the task percentage assigned has to add up to 100%
Can set individual user limits per queue
Extra capacity available is given to other queues evenly
Actual:
Configured for:
Queue1
Queue2
Queue3
40%
35%
25%
50%
30%
20%

The Capacity Scheduler uses a hierarchical collection of queues. The parent of all queues
is named root, and by default the root queue is allocated 100% of the resources.
You can define as many queues as you like, and distribute the percentage of resources
however you like, as long as the percentages add up to 100. This design of the Capacity
Scheduler allows you to share the resources of your Hadoop cluster according to your
organizations needs and SLAs.
For example:
Queue 1 might represent the Marketing department, which gets 40% of the
resources because it paid for 40% of the cluster from its budget.
Queue 2 might represent the Sales department, and they get 35% of the
resources because of a company SLA.
Queue 3 might represent the Engineering department, and the remaining 25% is
allocated to them until another department comes along and needs to use the
cluster.
168
The Capacity Scheduler provides elastic resource scheduling, which means that if some
of the resources in the cluster are idle, then one queue can take up more of the cluster
capacity than was minimally allocated to them in the above configuration.
Lets now take a look at how to define and configure queues.
169

The queues of the Capacity Scheduler are defined and configured in the capacityscheduler.xml file found in the HADOOP_CONF folder. You can edit the XML file
directly, or use Ambari on the Services -> YARN -> Configs page (as shown in the
screenshot above).
Manual editing: If you edit the XML file directly, run the following command to
have the changes take effect:
# yarn rmadmin -refreshQueues
170
Ambari: If you configure the Capacity Scheduler using Ambari, you will need to
stop the YARN service, make your changes, and then start YARN again.
Defining Queues
yarn.scheduler.capacity.root.queues=
"Marketing,Sales,Engineering"
yarn.scheduler.capacity.root.Marketing.capacity=50
yarn.scheduler.capacity.root.Sales.capacity=30
yarn.scheduler.capacity.root.Engineering.capacity=
20
Defining Queues
To define a child queue of root, use a comma-separated list of queue names for the
yarn.scheduler.capacity.root.queues property. For example:
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>Marketing,Sales,Engineering</value>
</property>
171
A queues properties are configured by adding the queue name to the specific property.
For example, the following allocates 50% of the total capacity to the Marketing queue
and 30% to the Sales queue:
<property>
<name>
yarn.scheduler.capacity.root.Marketing.capacity
</name>
<value>50</value>
</property>
<property>
<name>
yarn.scheduler.capacity.root.Sales.capacity
</name>
<value>30</value>
</property>
In Ambari, you set properties using an equals sign. For example:

yarn.scheduler.capacity.root.Engineering.capacity=20
172

yarn.scheduler.capacity.root.queues =
"Marketing,Marketing-longrunning "
yarn.scheduler.capacity.root.Marketing.maximumcapacity=80
yarn.scheduler.capacity.root.Marketing-longrunning
.capacity=35
.maximum-capacity=35

With elastic resource scheduling, a queue can use up more resources than it is initially
allocated if some of the resources in the cluster are idle. You can control the behavior of
this elasticity by configuring a queues maximum capacity using the
yarn.scheduler.capacity.<queue-name>.maximum-capacity property.
For example, the following Marketing queue can use up to 80% of a clusters resources:
yarn.scheduler.capacity.root.Marketing.maximum-capacity=80
173
A good use case for maximum-capacity is for applications that take a long time to run.
You may not want a long-running app to not consume a lot of resources, while providing
a large maximum for applications that you want to run quickly. You could setup
different queues for this behavior:
yarn.scheduler.capacity.root.queues="Marketing,
Marketing-longrunning"
yarn.scheduler.capacity.root.Marketing.maximum-capacity=80
.capacity=35
.maximum-capacity=35
Given the configuration above, answer the following questions:

1. What is the highest percentage of resources that an application submitted to the
Marketing queue use? ______________
2. What is the highest percentage of resources that an application submitted to the
Marketing-longrunning queue use? ______________
174
Configuring User Limits

You can configure a queue so that users of the queue are guaranteed a certain
minimum percentage of resources. For example, the following property assigns each
user a minimum of 20 of the resources available to the Sales queue:
yarn.scheduler.capacity.Sales.minimum-user-limit-percent=20
The maximum user limit is based on the number of users that have actually submitted
jobs at any given time. For example, two users each get a maximum of 50%, three users
would each get a maximum of 33%, and so on.
Suppose the Sales queue is configured with a user minimum of 20%. Answer the
following questions:
1. If one user submits two jobs to the Sales queue, then each job will get between
__________ and _________ percent of resource.
2. If 3 different users have submitted one job each to the Sales queue, then each
user will get between _______ and _________ percent of resources.
175
yarn.scheduler.capacity.root.
Engineering.acl_submit_applications=
"developer,admin,George,Tom
yarn.scheduler.capacity.root.
Engineering.acl_administer_queue=
"admin,Tom"
Each queue can define an Access Control List that authorizes which users and groups
can submit jobs to the queue. For example:
yarn.scheduler.capacity.root.Engineering.acl_submit_applica
tions="developer,admin,George,Tom"
There is also a property for configuring users and/or groups you can administer a queue:
yarn.scheduler.capacity.root.Engineering.acl_administer_que
ue="admin,Tom"
NOTE: The acl_submit_applications property includes users and groups

from any parent queue.
176
Overview of the Fair Scheduler

Fair scheduling is a method of assigning resources to applications such that all apps get,
on average, an equal share of resources over time. The Fair Scheduler is enabled using
the following property in yarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.
scheduler.fair.FairScheduler</value>
</property>
Some of the features of the Fair Scheduler include:
All applications get, on average, an equal share of resources (memory) based on

time.
Queues can be defined so that resources are shared fairly between these
queues.
Different fairness algorithms can be used, including FIFO and Dominant Resource
Fairness (which uses an algorithm combining memory usage with CPU usage).
177
Configuration of the Fair Scheduler

Queues are defined in an allocation file specified by the
yarn.scheduler.fair.allocation.file property.
The file looks like:
<?xml version="1.0"?>
<allocations>
<queue name="myqueue">
<minResources>10000 mb,0vcores</minResources>
<maxResources>80000 mb,0vcores</maxResources>
<maxRunningApps>25</maxRunningApps>
<weight>2.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
<queue name="my_subqueue">
<minResources>5000 mb,0vcores</minResources>
</queue>
</queue>
<user name="Tom">
<maxRunningApps>10</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
</allocations>
For details of all configuration options of the Fair Scheduler, view the documentation at
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarnsite/FairScheduler.html.
178
Unit 8 Review
1. What are the two built-in schedulers in Hadoop?
2. Which scheduler does Hortonworks recommend you use in HDP 2.0?
3. When using a Capacity Scheduler, all queues are children of the ___________
queue.
Suppose you have the following properties configured:
yarn.scheduler.capacity.root.queues=A,B
yarn.scheduler.capacity.root.A.capacity=80
yarn.scheduler.capacity.root.B.capacity=20
yarn.scheduler.capacity.root.B.maximum-capacity=100
4. How many queues does this cluster have?

5. What does setting As capacity to 80 mean?
6. Is it possible that a job submitted to queue B can use 100% of the clusters
resources?
7. Is it possible that a job submitted to queue A can use 100% of the clusters
resources?
179
Lab 8.1: Configuring the Capacity Scheduler

Objective: Learn how to define and configure queues for the Capacity
Scheduler.
Successful Outcome: You will have two new queues configured.
Before You Begin: Your cluster should be up and running.
Step 1: View the Status of the Capacity Scheduler

1.1. Point your web browser to http://node2:8088 to view the UI of the Resource
Manager.
1.2. In the menu on the left side, click on the Scheduler link to view the current
status of the Capacity Scheduler:
1.3. Notice there is one child queue of root defined. What is the name of the
queue? _________________
1.4. Click on the arrow to the left of default to expand and view the settings of
the default queue:
180
1.5. Notice since there are no jobs running on your cluster, the status page simply
shows 0.0% of default is being used right now.
Step 2: View the Settings of the Capacity Scheduler
2.1. Go to the Ambari Dashboard page.
2.2. Click on the YARN link in the list of services, then click on the Configs tab and
scroll down to the Scheduler section:
2.3. Notice this is where you configure the settings for the scheduler of the
Resource Manager. Which type of scheduler is currently being used?
________________________________________
Step 3: Stop YARN
3.1. You cannot configure the Capacity Scheduler using Ambari while the Resource
Manager and Node Manager services are running. While on the YARN Services
page, click the Stop button in the upper-right corner of the page, then OK to
confirm. Wait for the YARN service to stop:
181
Step 4: Define Custom Queues

4.1. From the Configs tab of the YARN Services page, add two new queues named
A and B to the Capacity Scheduler (and leave the default queue defined also):
yarn.scheduler.capacity.root.queues=default,A,B
4.2. Assign the following capacities to the three queues:

default capacity = 20%
A capacity = 50%
B capacity = 30%
4.3. Assign the following maximum capacities to the three queues:
default maximum capacity = 100%
A maximum capacity = 70%
B maximum capacity = 50%
4.4. Save your changes by clicking the Save button at the bottom of the page. You
should see the following confirmation dialog:
4.5. Click OK to close the dialog window.

Step 5: Start YARN
5.1. Go back to the Summary page of YARN and click the Start button to start the
YARN service again.
182
Step 6: Verify the Changes

6.1. Go back to the Scheduler web page at http://node2:8088/cluster/scheduler.
6.2. You should see the A and B queues now:
6.3. Expand the A queue and verify its capacity is 50% and its maximum capacity is
70%:
6.4. Similarly, verify B and default are configured correctly.

6.5. Leave this web page open - you are going to view it in the next step.
Step 7: Submit a Job to a Specific Queue
7.1. Lets submit a MapReduce job to each new queue. SSH into node1, and
change directories to the /root/labs folder:
[root@node1 ~]# cd ~/labs
7.2. Make sure environment variable JAVA_HOME is defined, so enter the

following command:
# export JAVA_HOME=/usr/jdk64/jdk1.6.0_31/
7.3. Make sure you have a file in /user/root in HDFS named hbase.jar. If not, put
the HBase jar from the labs folder into HDFS, giving it the name hbase.jar.
7.4. In the first window, submit the test1.pig script to queue A by running the
following command (all on a single line):
# pig -Dmapreduce.job.queuename=A test1.pig 1>pig1.out
&>pig1.err &
183
7.5. While test1.pig is running, submit the test2.pig script to queue B in the other
terminal window:
# pig -Dmapreduce.job.queuename=B test2.pig 1>pig2.out
&>pig2.err &
7.6. While both jobs are running, refresh the Scheduler status page. (It may take a
minute for both jobs to run long enough to show up in the queues, so refresh the
page often until they do):
7.7. You should see resources being used in both the A and B queues.
RESULT: You just defined two queues for the Capacity Scheduler, configured specific
capacities for each queue, and submitted a job to each queue.
ANSWERS:
1.3: default
2.3: The Capacity Scheduler
184
Unit 9: Enterprise Data Movement

Topics covered:
Enterprise Data Movement
Challenges with a Traditional ETL Platform
Hadoop Based ETL Platform
Data Ingestion
Hadoop: Reducing Business Latency
Defining Data Layers
Distributed Copy (distcp) Command
distcp Options
Considerations for distcp
Using distcp for Backups
Lab 9.1: Use distcp to Copy Data from a Remote Cluster
185
Enterprise Data Movement

Organizations are continually looking for ways to reduce and manage costs. At the same
time, data centers are storing more and more data and for longer periods of time than
they ever have in the past. Compared to SAN, HDFS is a very cost effective way of
managing mass amounts of data. HDFS is a file system; therefore any type of data can
be stored in Hadoop.
Similar to any Data Warehouse or Enterprise Data Store (EDS), data architecture is a
critical component within the Hadoop cluster. As organizations move both their current
and historical data from the traditional data warehouses into Hadoop, many factors
must be considered and evaluated and a data strategy must be put into place.
The first step in that process is for organizations to evaluate their current system and
decide their goals for working with the data. Some common scenarios include:
Keeping the current data in the EDW and archiving historical data to Hadoop.
Loading the new semi-structured and unstructured data into Hadoop.
Building data layers and performing data analysis within Hadoop.
186
Using a hybrid approach incorporating some data layers in Hadoop and the
speed layer in HBase (more discussion on this later in this unit).
Using Hadoop to aggregate and filter the data then loading the results and into
an EDW and/or datamarts.
Storing the data in Hadoop and using high-speed connecters from Teradata,
Oracle, SQL Server, etc. and using the analytics in the EDW to read the data from
the Hadoop cluster.
As the data strategies evolve they create more data movement between the enterprise
data systems. Lifecycle data management has always been a central part of enterprise
data platforms, and the Hadoop and HBase clusters now become a part of that lifecycle.
Falcon is the framework used in Hadoop 2 for lifecycle data management.
187

Data discarded
due to cost and/or
performance
Incapable/high
complexity when
dealing with loosely
structured data
No visibility into
transactional data
EDW used as an ETL

tool with 100s of
transient staging tables
-Lot of time spent understanding

source and defining destination data
structures
-Doesnt scale linearly.

-License Costs High
-High latency between data generation

and availability

Traditional ETL platforms are not designed to handle big data. The traditional ETL
platforms have the following challenges when trying to work with big data:
They do not work well with semi-structured or unstructured data.
Scalability is very expensive with vendor proprietary solutions and SAN storage.
High cost of storage leads to eliminating an enormous amount of detailed data.
Traditional systems have high licensing costs.
There is very high business latency between data hitting the disks and being able
to make business decisions using the data.
188

-Provides data for use with
minimum delay and latency
-Enables real time capture
of source data
-Support for any type

of data: structured/
unstructured
-Store raw transactional data

-Store 7+ years of data with no archiving
-Data Lineage: Store intermediate stages of data
-Becomes a powerful analytics platform
-Linearly scalable on
commodity hardware
-Data warehouse can

focus less on storage
& transformation and
more on analytics
-Massively parallel
storage and compute

The weaknesses of the traditional platforms are the strengths of the Hadoop platforms.
Traditional ETL platforms are not designed to handle big data.
Hadoop platforms:
Easily work with semi-structured or unstructured data.
Offer relatively inexpensive storage.
Use free open source software and commodity hardware.
Hadoop greatly reduces the business latency between data hitting the disks and being
able to make business decisions using the data.
189
Data Ingestion
Data Sources/Transports
Web Logs,
Clicks
Social,
Graph, Feeds
Sensors,
Devices, RFID
Spatial, GPS
Extract &
Load
Big Data
Refinery
WebHDFS
Docs, Text,
XML
3rd Party
Audio, Video,
Images
Sqoop
Flume
DB Data
Events, Other
Data Ingestion
Data ingestion is one of the key components of any data warehouse, enterprise data
store or Hadoop cluster. It is a major effort to design data ingestion strategies for any
enterprise data store. Hadoop Data Refineries and Data Lakes take data ingestion to an
entirely new level of volume, speed and types of data.
Extraction, Transformation and Loading (ETL) has been a standard method for moving
data into enterprise data stores. The reason for the transformation before loading is
that the cost of SAN storage has required data to be transformed by aggregating and
filtering data to reduce the amount of data that will be loaded into an enterprise data
store.
With Extraction, Loading and Transformation (ELT) the data is loaded into Hadoop to a
layer known as the source of truth. This is the raw data. Since Hadoop can store data
much more cost effectively, all of the detailed data gets loaded into Hadoop. The data is
then transformed into different data layers.
190
Hadoop: Reducing Business Latency

Retain runtime models and historical data
for ongoing refinement & analysis
DB
Data
Business
Transactions
& Interactions
Audio,
Video,
Images
Docs,
Text,
XML
Social,
Graph,
Feeds
Sensors,
Devices,
RFID
Spatial,
GPS
Events,
Other
Web, Mobile, CRM,

ERP, SCM,
Big Data
Refinery
Web
Logs,
Clicks
Share refined data and

runtime models
Classic
ETL
processing
2
Store, aggregate,
and transform
multi-structured
data to unlock
value
Business
Intelligence
& Analytics
Retain historical
data to unlock
additional value
5
Dashboards, Reports,
Visualization,
Hadoop: Data Movement

There are three broad areas of data processing:
1. Business Transactions & Interactions Relational databases.
2. Business Intelligence & Analytics Data Warehouses and Enterprise Data Stores.
3. Big Data Refinery Hadoop Cluster.
Enterprise IT has been connecting systems via classic ETL processing, as illustrated in
Step 1 above, for many years in order to deliver structured and repeatable analysis. In
this step, the business determines the questions to ask and IT collects and structures the
data needed to answer those questions.
The Big Data Refinery, is a new system capable of storing, aggregating, and
transforming a wide range of multi-structured raw data sources into usable formats that
help fuel new insights for the business. The Big Data Refinery provides a cost-effective
platform for unlocking the potential value within data and discovering the business
questions worth answering with this data. A popular example of big data refining is
processing Web logs, clickstreams, social interactions, social feeds, and other user
generated data sources into more accurate assessments of customer churn or more
effective creation of personalized offers.
191
More interestingly, there are businesses deriving value from processing large video,
audio, and image files. Retail stores, for example, are leveraging in-store video feeds to
help them better understand how customers navigate the aisles as they find and
purchase products. Retailers that provide optimized shopping paths and intelligent
product placement within their stores are able to drive more revenue for the business.
In this case, while the video files may be big in size, the refined output of the analysis is
typically small in size but potentially big in value.
The Big Data Refinery platform provides fertile ground for new types of tools and data
processing workloads to emerge in support of rich multi-level data refinement solutions.
With that as backdrop, Step 3 takes the model further by showing how the Big Data
Refinery interacts with the systems powering Business Transactions & Interactions and
Business Intelligence & Analytics. Interacting in this way opens up the ability for
businesses to get a richer and more informed 360 view of customers, for example.
By directly integrating the Big Data Refinery with existing Business Intelligence &
Analytics solutions that contain much of the transactional information for the business,
companies can enhance their ability to more accurately understand the customer
behaviors that lead to the transactions.
Moreover, systems focused on Business Transactions & Interactions can also benefit
from connecting with the Big Data Refinery. Complex analytics and calculations of key
parameters can be performed in the refinery and flow downstream to fuel runtime
models powering business applications with the goal of more accurately targeting
customers with the best and most relevant offers, for example.
Since the Big Data Refinery is great at retaining large volumes of data for long periods of
time, the model is completed with the feedback loops illustrated in Steps 4 and 5.
Retaining the past 10 years of historical Black Friday retail data, for example, can
benefit the business, especially if its blended with other data sources such as 10 years
of weather data accessed from a third party data provider. The opportunities for
creating value from multi-structured data sources available inside and outside the
enterprise are virtually endless if you have a platform that can do it cost effectively and
at scale.
192

There are lots of different ways of organizing data in an enterprise data
platform that includes Hadoop.
Organize data
based on
source/derived
relationships
Allows for fault
and rebuild
process
Speed
Layer
Conform, Summarize, Access
Serving
Layer
Standardize, Cleanse, Integrate, Filter,
Transform
Batch
Layer
Extract & Load

There are multiple ways of organizing data in an Enterprise Data Warehouse and the
same goes for Hadoop.
One way is the Lambda Architecture, which defines different data layers. A Hadoop
cluster can work by itself or be integrated with HBase and other EDWs and ODSs to build
different data layers that meet the data needs of an organization.
The process of building different data layers is a familiar concept within data
warehousing and analytics. The data layers are built in a Hadoop cluster for the same
reasons they have been built in data warehouses for the last 30 years, the facilitate
speed. There are 3 data layers:
Batch Layer: Immutable master data set (source of truth). Used to create views
for the batch layer.
Serving Layer: Contains pre-computed views.
Speed Layer: Contains additional levels of pre-computed views, structures and

indexes to reduce the latency that exists in the serving layer.
193
Distributed Copy (distcp) Command

The distcp command makes it easy to copy large volumes of HDFS data in parallel.
Distcp uses the MapReduce framework to support copying files or directories
recursively.
Usage: hadoop distcp [OPTIONS]
<SourceURLn> <DestinationURL>
<SourceURL1>
For example, perform a copy between two Hadoop clusters running the same version of
Hadoop:
$ hadoop distcp hdfs://<SourceURL>:8020/input/data1
hdfs://<DestinationURL>:8020/input/data1
To perform a copy between two Hadoop clusters running a different version of Hadoop,
the older cluster uses the hftp protocol, and the 2.x cluster uses the hdfs protocol:
$ hadoop distcp hftp://<SourceURL>:50070/input/data2
hdfs://<DestinationURL>:8020/input/data2
Perform a copy within the same cluster:

194
$ hadoop distcp input/data1

$ hadoop distcp -f
/srcfile
/input/data2 /input/data3
/input/data3
If a map fails and -i is NOT specified, all the files in the split, not only those that failed,
will be recopied. It also changes the semantics for generating destination paths, so users
should use this carefully.
Flag -i means Ignore failures. This option will keep more accurate statistics about the
copy than the default case. It also preserves logs from failed copies, which can be
valuable for debugging. A failing map will not cause the job to fail before all splits are
attempted.
195
Options for the distcp command

To display the help text for distcp, type the following command:
$ hadoop distcp
Controllable distcp options:

Defining log directory location
The maximum amount of files or data to copy
SSL configuration for mapper
Distcp Options
The following list are things to take into consideration when using the distcp command:
Source paths need to be absolute.
If the destination directory does not exist it will be created.
The update option is used to make sure only files that have changed are
updated. Using a checksum (CRC32) it verifies if the destination file sizes are
different. The skipcrccheck option can be used to disable the checksum.
Distcp will skip files that already exist in the destination path. Use the
overwrite to make sure existing files are overwritten. File sizes are not checked.
The delete option can be used to delete any files in destination that are not in
the source.
Use the hftp file system on the source if there are different versions between the
source and destination HDFS clusters.
196
# hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-async
Should distcp execution be blocking
-atomic
Commit all changes or none
-bandwidth <arg>
Specify bandwidth per map in MB
-delete
Delete from target, files missing in
source
-f <arg>
List of files that need to be copied
-filelimit <arg>
(Deprecated!) Limit number of files
copied to <= n
-i
Ignore failures during copy
-log <arg>
Folder on DFS where distcp execution
logs are saved
-m <arg>
Max number of concurrent maps to use
for copy
-mapredSslConf <arg>
Configuration for ssl config file,
to use with hftps://
-overwrite
Choose to overwrite target files
unconditionally, even if they exist.
-p <arg>
preserve status (rbugp)(replication,
block-size, user, group, permission)
-sizelimit <arg>
(Deprecated!) Limit number of files
copied to <= n bytes
-skipcrccheck
Whether to skip CRC checks between
source and target paths.
-strategy <arg>
Copy strategy to use. Default is
dividing work based on file sizes
-tmp <arg>
Intermediate work path to be used for
atomic commit
-update
Update target, copying only
missingfiles or directories
There are two strategy options: static (the default) and dynamic. When static is used,
mappers are balanced based on the total size of files copied by each map. The dynamic
approach splits files into chunks and map tasks process a chunk at a time, allowing
faster mappers to consume more file paths than slower ones and thereby speeding up
the overall distcp job.
197
Considerations for distcp

The m option can be used to select the number of Mappers. If there is a large
amount of data, there may be a need to limit the number of mappers.
Minimizing the number of mappers will have each mapper copy more data.
Note: This may be more efficient if there are a lot of mappers or the system is
constrained for resources.
Using distcp
The distcp command starts up containers for running the mappers as well as generating
I/O based on the volume of data to be copied. Take the resource utilization and the
IOPS generated to schedule large distcp jobs during appropriate times.
Distcp also consumes container resources on the destination cluster, which may
be the same or a different cluster.
Copying data between the two clusters will also generate network traffic
between the data nodes for each cluster. Make sure network resources are not
exceeded between the two clusters.
198
Using the hdfs:// schema for the source and destination requires the clusters be running
the same version of software. Other protocols that can be used include:
webhdfs://
hftp://
Best practice is to validate the copy between the source and the destination.
Use the hadoop fs ls /Path to confirm ownership, permissions and files.
199
Using distcp for Backups

When using HCatalog or Hive Metadata you must also move the SQL Files containing
that info.
Using the Lambda architecture as an example; a Hadoop cluster will have multiple data
layers. It may be resource, time or cost prohibitive to backup the entire Hadoop
cluster. In such cases, because the other data layers can be rebuilt from the source, you
can backup only the raw data layer.
200
Unit 9 Review Questions

1. _____________ is one of the key components of any data warehouse,
enterprise data store or Hadoop cluster.
2. Data is inherently ___________ and _____________.
3. _____________, is a new system capable of storing, aggregating, and
transforming a wide range of multi-structured raw data sources into usable
formats that help fuel new insights for the business.
4. The Lambda Architecture includes what three layers __________________,
__________________ & __________________.
5. When running distcp for two Hadoop clusters running a different version of
Hadoop the source prefix needs to be ___________.
6. The update option checks ____________________ and __________ to see if a
file has changed.
201
Lab 9.1: Use distcp to Copy Data from a

Remote Cluster
Objective: To become familiar with how to copy data from one cluster
to another.
Successful Outcome: Data from a remote cluster is copied to your own cluster.
Before You Begin: For this exercise use node1 as your Remote-Cluster.
Step 1: Access Remote-Cluster

1.1. Verify you can reach the NameNode of the Remote-Cluster by executing the ls
command.
$ hadoop fs -ls hdfs://node1:8020/
You should see the / folder of the Remote-Cluster.

Step 2: View the Remote Folder
2.1. Use the ls command and view the contents of the remote directory that you
are going to copy:
$ hadoop fs -ls hdfs://node1:8020/user/root
Step 3: Copy the Data from Remote-Cluster

3.1. Execute the following command to copy a remote file into distcp_target:
$ hadoop distcp
hdfs://Remote-Cluster:8020/user/root/test_data
distcp_target
3.2. View the contents of distcp_target and verify test_data file copied over to
your cluster:
202
$ hadoop fs -ls -R distcp_target
3.3. Copy one or many directories/files into distcp_target:

$ hadoop distcp
hdfs://node1:8020/user/root/wordcount
hdfs://node1:8020/user/root/constitution.txt
distcp_target
3.4. View the contents of distcp_target and verify the wordcount &
constitution.txt file copied over to your cluster again.
Step 4: Copy only new/updated files and directories using -update option.
4.1. Check the timestamp for files in /user/root/wordcount directory. Delete partr-00000 file from wordcount directory.
4.2. Now run following command with -update option.
$ hadoop distcp -update
hdfs://node1:8020/user/root/wordcount
distcp_target/wordcount
4.3. View the contents of distcp_target and compare timestamp of all the files.
You can see that the timestamp changed only for part-r-00000 file and wordcount
folder timstamp.
Step 5: Copy data from a Remote-Cluster running different version of Hadoop.
5.1. Execute the following command to copy a remote file into distcp_target.
$ hadoop distcp
hftp://node1:50070/user/root/hbase.jar
distcp_target
5.2. View the contents of distcp_target and verify test_data file copied over to
your cluster.
RESULT: You have learnt the steps to copy data from one cluster to another.
203
Unit 10: HDFS Web Services

Topics covered:
What is WebHDFS ?
Setting up WebHDFS
Using WebHDFS
WebHDFS Authentication
Copying Files to HDFS
Hadoop HDFS over HTTP
Who Uses WebHCat REST API?
Running WebHCat
Using WebHCat
Lab 10.1: Using WebHDFS
204
What is WebHDFS?
Hadoop contains native libraries for accessing HDFS from the Hadoop cluster. WebHDFS
provides a full set of HTTP REST APIs to access Hadoop remotely. HDFS commands can
be run from a platform that does not contain Hadoop software.
REST (Representational State Transfer) uses well known HTTP verb commands GET,
POST, PUT and DELETE to perform operations. REST:
Uses HTTP to execute calls between different machines.
Is platform independent, language independent and can be used with firewalls.
Uses URIs (Uniform Resource Identifier defines a web resource using text).
Is a lightweight alternative to using SOAP (Simple Object Access Protocol - XML

based protocol) and RPC (remote procedure calls) to access web resources.
205
WebHDFS uses REST APIs to perform HDFS user operations including reading files,
writing to files, making directories, changing permissions and renaming. WebHDFS can
be used to copy data between different versions of HDFS.
WebHDFS is built-in to HDFS. It runs inside NameNodes and DataNodes, therefore, it can
use all HDFS functionalities. It is a part of HDFS there are no additional servers to
install. WebHDFS can use a proxy WebHDFS (httpfs). In most cases it uses hdfs:// (port
optional).
WebHDFS supports the following:
HDFS read and write operations as well as HDFS parameters.
Kerberos (SPNEGO) and delegation tokens for authentication.
206
Setting up WebHDFS
WebHDFS should be enabled during the Ambari install by selecting the
enable WebHDFS checkbox.
Properties for WebHDFS
Description
dfs.webhdfs.enabled
Use to enable WebHDFS.
dfs.web.authentication.kerberos.principal
HTTP Kerberos principal.
dfs.web.authentication.kerberos.keytab
Kerberos keytab file.
Setting up WebHDFS
If manually setting the dfs.webhdfs.enabled property in the hdfs-site.xml file, HDFS
(NameNode and DataNodes) must be restarted for the changes to take effect.
hdfs-site.xml:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
207
When using Kerberos to secure cluster, look at the documentation to get all the details
but here is a summary.
1. Create a HTTP service user principal.
2. kadmin: addprinc -randkey
HTTP/$<Fully_Qualified_Domain_Name>@$<Realm_Name>.COM
3. Create keytab files for the HTTP principals.

kadmin: xst -norandkey -k
/etc/security/spnego.service.keytabHTTP/$<Fully_Qualified_D
omain_Name>
4. Verify that the keytab file and the principal are associated with the correct
service:
klist k -t /etc/security/spnego.service.keytab
5. Add the following properties to the hdfs-site.xml file.

<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/$<Fully_Qualified_Domain_Name>@$<Realm_Name>.CO
M</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/spnego.service.keytab</value>
</property>
208
Using WebHDFS
The URL syntax to access the REST API of WebHDFS is:
http://hostname:port/webhdfs/v1/<PATH>?op=
To read a file named input/mydata:

$ curl -i -L "http://host:50070/webhdfs/v1/input/mydata?op=OPEN"
To list the contents of a directory named tdata:

$ curl -i "http://host:50070/webhdfs/v1/tdata/?op=LISTSTATUS"
To make a directory named myoutput:

$ curl -i -X PUT "http://host:50070/webhdfs/v1/myoutput?
op=MKDIRS&permission=744"
Using WebHDFS
The REST API uses the prefix "/webhdfs/v1" in the path and appends a query at the end.
HTTP URL format:
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=...
WebHDFS File system URI :

webhdfs://<HOST>:<HTTP_PORT>/<PATH>
HDFS URI:
hdfs://<HOST>:<RPC_PORT>/<PATH>
cURL and wget can be used to execute WebHDFS commands. cURL has been around a
long time in Unix and Linux environments. It is a popular command line tool (and
library) because it can support so many protocols (HTTP, HTTPS, FTP, SCP, LDAP, TENET,
POP2, SMTP, IMAP, .).
209
wget is a free software package for retrieving files using HTTP, HTTPS and
FTP.
Additional examples:
List the status of a file (Use the v option to display output in verbose mode to
get more details.)
$ curl -i
"http://host:port/webhdfs/v1/input/mydata?op=GETFILESTATUS"
To write a file into a /input/myfile2 file:

$ curl v -i -X PUT -L
"http://$<Host_Name>:$<Port>/webhdfs/v1/input/myfile2?op=CR
EATE" -T mycoolfile
210
WebHDFS Authentication
Authentication can be controlled through the following commands:
Security off:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?[user.name=<USER>&
]op=..."
Security on and using Kerberos SPNEGO:

curl -i --negotiate -u :
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=..."
Security on with Hadoop delegation token:

curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?delegation=<TOKEN>&
op=..."
211
Proxy Users
A proxy user can send a request for another. The username of U must be specified in the
doas query parameter unless a delegation token is presented in authentication. In such
case, the information of both users P and U must be encoded in the delegation token.
Below is the syntax to use when;
Security is off:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?[user.name=<USER>&]
doas=<USER>&op=..."
When security is on (Kerberos SPNEGO):

curl -i --negotiate -u :
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?doas=<USER>&op=..."
When security is on (Hadoop delegation token):

curl -i
"http://<HOST>:PORT>/webhdfs/v1/<PATH>?delegation=<TOKEN>&o
p=..."
212

Prepare a file for writing to HDFS
Syntax:
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
[&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>]
[&permission=<OCTAL>][&buffersize=<INT>]"
Write a file to HDFS

$ curl -v -X PUT "http://ec2-50-17-40-3.compute-1.amazonaws.com:50070/webhdfs/v1/
input/mydata?op=CREATE&user.name=jdoe"
$ curl -v -X PUT T mydata "http://ec2-50-17-40-3.compute-1.amazonaws.com:50075/
webhdfs/v1/input/mydata?op=CREATE&user.name=jdoe"

Below is an example of copying data into HDFS using WebHDFS. The two-step process is
a temporary workout for a software library bug.
The output will contain the 307 response code with a server address and port number
for where to write the file.
$ curl -v -X PUT
"http://localhost:50070/webhdfs/v1/input/mydata?op=CREATE
Format for 307 response code:
HTTP/1.1 307 TEMPORARY_REDIRECT
Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
$ curl -v -X PUT T mydata "http://ec2-50-17-40-3.compute1.amazonaws.com:50075/webhdfs/v1/input/mydata?op=CREATE&use
r.name=jdoe"
# Verify the file was created.
$ hadoop fs -ls /input
213
Additional WebHDFS Commands

WEBHDFS operations will return a JSON schema or a zero length response. Only
exception is the OPEN command.
HTTP GET
HTTP PUT
HTTP POST
OPEN
CREATE
APPEND
GETFILESTATUS
MKDIRS
LISTSTATUS
RENAME
GETCONTENTSUMMARY
SETREPLICATION
HTTP POST
GETFILECHECKSUM
SETOWNER
DELETE
GETHOMEDIRECTORY
SETPERMISSION
GETDELEGATIONTOKEN
SETTIMES
RENEWDELEGATONTOKEN
CANCELDELEGATIONTOKEN
Additional WebHDFS Commands

Submit a HTTP GET request with automatically following redirects.
curl -i -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
[&offset=<LONG>][&length=<LONG>][&buffersize=<INT>]
The request is redirected to a datanode where the file data is to be written:

Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
Create and Write to a File:

curl -i -X PUT -T <LOCAL_FILE>
"http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
214
Append to a File:
curl -i -X POST
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersi
ze=<INT>]"
The request is redirected to a datanode where the file data is to be appended:

Location:
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
Content-Length: 0
Step 2: Submit request to append file:

curl -i -X POST -T <LOCAL_FILE>
"http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
Make a Directory:
curl -i -X PUT
"http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]
Rename a File/Directory:
curl -i -X PUT
"<HOST>:<PORT>/webhdfs/v1/<PATH>?op=RENAME&destination=<PAT
H>
Delete a File/Directory:
curl -i -X DELETE
"http://<host>:<port>/webhdfs/v1/<path>?op=DELETE[&recursiv
e=<true|false>]
Get the Status of a File/Directory:

curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS
215
List a Directory:
curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS
Get Directory Content Summary:

curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMAR
Y
Get File Checksum:

curl -i
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM
Get Home Directory:

curl -i
"http://<HOST>:<PORT>/webhdfs/v1/?op=GETHOMEDIRECTORY
Set Permission:
curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION
[&permission=<OCTAL>]
Set Owner:
curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER
[&owner=<USER>][&group=<GROUP>]
Set Replication Factor:

curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETREPLICATION
[&replication=<SHORT>]
Set Access or Modification Time:

curl -i -X PUT
216
"http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETTIMES
[&modificationtime=<TIME>][&accesstime=<TIME>]
Get Delegation Token:

curl -i
"http://<HOST>:<PORT>/webhdfs/v1/?op=GETDELEGATIONTOKEN&ren
ewer=<USER>
Renew Delegation Token:

curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/?op=RENEWDELEGATIONTOKEN&t
oken=<TOKEN>
Cancel Delegation Token:

curl -i -X PUT
"http://<HOST>:<PORT>/webhdfs/v1/?op=CANCELDELEGATIONTOKEN&
token=<TOKEN>
HTTP Response Codes

Exception
HHT Response Code
IllegalArgumentException
400 Bad Request
UnsupportedOperationException
400 Bad Request
SecurityException
401 Unauthorized
IOException
403 Forbidden
FileNotFoundException
404 Not Found
RumtimeException
500 Internal Server Error
217
Hadoop HDFS over HTTP

HttpFS is a separate service from NameNode and must be configured. HttpFS is a Java
application that runs in Tomcat that comes with the HttpFS binary distribution.
Examples:
Create the HDFS /user/tom directory:
$ curl -X POST http://httpfshost:14000/webhdfs/v1/user/tom/bar?op=mkdirs
Display the contents of the HDFS /user/george directory.

$ curl http://<HTTPFSHOST>:14000/webhdfs/v1/user/jdoe?op=list
218
HttpFS is a full rewrite of Hadoop HDFS proxy. A key difference is HttpFS supports all file
system operations while Hadoop HDFS proxy supports only read operations.
HttpFS also supports:
Hadoop pseudo authentication
Kerberos SPNEGOS authentication
Hadoop proxy users
Hadoop HDFS proxy did not support the above authentications.
219
Who Uses WebHCat REST API?

WebHCat is designed to connect services. WebHCat was initially called Templeton (rat
in Charlottes Web). Therefore, you will still see references to Templeton in the
directories, etc.
WebHCat will look for files in the CLASSPATH and then in the TEMPLETON_HOME
environmental variable. Key files with WebHCat:
220
Filename
Description
webhcat_server.sh
webhcat-default.xml
Default configuration variables. Never

change this file it has no effect. The
webhcat-default.xml file Is in the WebHCat
war file.
webhcat-site.xml
Modify and add variables to customize

WebHCat. The WebHCat server needs to
be restarted after configuration changes.
webhcat-log4j.properties
Contains the location of the WebHCat log

files.
The WebHCat configuration variables can be found in the HCatalog documentation.

WebHCat can be configured to use Kerberos. Here are a few examples of WebHCat
variables.
Configuration Variable
Description
templeton.port
The HTTP port for the WebHCat server (50111).
templeton.hadoop.config.dir
Path to the Hadoop configuration files

${env.HADOOP.CONFIG_DIR}
templeton.jar
Path to the WebHCat jar file

${env.TEMPLETON_HOME/share/webhcat/svr/webhcat0.11.0.jar}
templeton.streaming.jar
HDFS path to the Hadoop streaming jar file

hdfs:///user/temleton/hadoop-streaming.jar
templeton.hive.path
Path to Hive executable

hive-0.11.0.tar.gz/hive-0.11.0/bin/hive
templeton.hive.properties
Properties to use when running Hive.
templeton.zookeeper.hosts
Zookeeper servers listed in a comma delimited order

(host:port).
221
Accessing and Securing WebHCat Files

Parameters for making files accessible to WebHCat:
Variables
Definition
Default
Caching and Securing

WebHCat Files
Path to Pig archive.
hdfs:///apps/webhcat/pig.tar.gz
templeton.pig.path
Path to Pig executable.
pig.tar.gz/pig/bin/pig
templeton.hive.archive
Path to Hive archive.
hdfs:///apps/webhcat/hive.tar.gz
templeton.hive.path
Path to Hive executable.
hive.tar.gz/hive/bin/hive
templeton.streaming.jar
Path to Hadoop streaming

jar file.
hdfs:///apps/webhcat/hadoopstreaming.jar
Caching and Securing WebHCat Files

The paths shown above are configured in the /etc/hcatalog/conf/webhcat-site.xml file
on the node where WebHCat is installed.
222
Running WebHCat
Start the server:
$ /usr/lib/hcatalog/sbin/webhcat_server.sh start
Stop the server:

$ /usr/lib/hcatalog/sbin/webhcat_server.sh stop
WebHCat requirements include:

Zookeeper if using the ZooKeeper storage class
A secure cluster will require Kerberos keys and principals
Running WebHCat
Hadoop uses a LocalResource to keep Pig and Hive from having to be installed
everywhere on the cluster. The server will get a copy of the LocalResource when
needed.
223
Using WebHCat
The URL to access the REST API of WebHCat is:
http:/ / hostname/ tem pleton/ v1/
Here is an example of running a MapReduce job:
# curl -s -d user.name=hadoop_user \
-d jar=wordcount.jar \
-d class=com.hortonworks.WordCount \
-d libjars=transform.jar \
-d arg=wordcount/input \
-d arg=wordcount/output \
'http://host:50111/templeton/v1/mapreduce/jar'
Using WebHCat
WebHCat can execute programs through the Knox Gateway. The URL for accessing the
REST API of WebHCat is: http://hostname:port/templeton/v1/.
Below is an example of WebHCat running a Java MapReduce job. This example
assumes the input and output directories have been setup as well as the inode being
created for the file.
$ curl -v -i -k u <USERIDID>:<PASSWORD> -X POST \
-d jar=/dev/my-examples.jar -d class=wordcount \
-d arg=/dev/input -d arg=/dev/output \
'https://127.0.0.1:8443/gateway/sample/templeton/api/v1
/mapreduce/jar'
224
Unit 10 Review
1. WebHDFS supports HDFS _____________ and _______________ operations.
2. The _________________ parameter needs to be set to true to enable WebHDFS.
3. WebHDFS can use ________________ and _________________ for
authentication.
4. HttpsFS is a _____________________ from NameNode and must be configured.
225
Lab 10.1: Using WebHDFS
Objective: To become familiar with the capabilities of WebHDFS and

how to use it.
Successful Outcome: You will have executed several WebHDFS file commands
successfully, include upload, append, list a directory, and
download.
Step 1: List a Directory

1.1. Using curl, view the contents of the /user/root directory in HDFS:
# curl -i
"http://node1:50070/webhdfs/v1/user/root?op=LISTSTATUS"
1.2. You should see a 200 OK response, along with a JSON object containing the
files and directories in your /user/root folder:
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 14 Nov 2013 14:35:42 GMT
Date: Thu, 14 Nov 2013 14:35:42 GMT
Pragma: no-cache
Expires: Thu, 14 Nov 2013 14:35:42 GMT
Date: Thu, 14 Nov 2013 14:35:42 GMT
Pragma: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":1732
1,"group":"hadoop","length":0,"modificationTime":1384408800
226
076,"owner":"root","pathSuffix":".Trash","permission":"700"
,"replication":0,"type":"DIRECTORY"},
{"accessTime":1384219125588,"blockSize":134217728,"children
Num":0,"fileId":17331,"group":"hadoop","length":861,"modifi
cationTime":1384219125967,"owner":"root","pathSuffix":"cons
titution.txt","permission":"644","replication":3,"type":"FI
LE"},
...
]}}
Step 2: Make a New Directory

2.1. Use WebHDFS to make a new subdirectory of /user/root in HDFS named
history:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/history?op=MKDIRS"
If you get an AccessControlException, you need to add the user.name property to

the URL:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/history?op=MKDIRS&
user.name=root"
2.2. Verify the history directory was created successfully:

# hadoop fs -ls
Step 3: Upload a File

3.1. The first step to uploading a file is to create a path on the NameNode. As part
of the REST contract, the NameNode will respond with a 307
TEMPORARY_REDIRECT to the actual DataNode that the blocks should go to:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/history/constituti
on.txt?op=CREATE&blocksize=1048576"
3.2. Use the temporary redirect URL that the NameNode provides in the response
above to submit the file to the DataNode. For example, the command shown here
puts the file onto node4, but you should copy-and-paste the URL from the
response of the previous step:
227
# curl -i -PUT -T constitution.txt

"http://node4:50075/webhdfs/v1/user/root/history/constituti
on.txt?op=CREATE&namenoderpcaddress=node1:8020&blocksize=10
48576&overwrite=false&user.name=root"
3.3. Verify the file was uploaded successfully:

# hadoop fs -ls history
Found 1 items
-rwxr-xr-x
3 root hadoop
44841 history/constitution.txt
Step 4: Upload a Large File

4.1. In this step, you will upload a larger file, one that spans multiple blocks. Start
by changing directories to /root/data:
# cd ~/data
You should have a large file in data named test_data.

4.2. Ask the NameNode to create a file named big.txt in /user/root:
# curl -i -X PUT
"http://node1:50070/webhdfs/v1/user/root/big.txt?op=CREATE&
blocksize=1048576"
4.3. Using the URL provided by the previous command, upload test_data into
HDFS, and then verify the upload worked successfully.
Step 5: Append to an Existing File
5.1. Appending a file is similar to creating a file - it is a two-step process. Using
WebHDFS, append the local file constitution.txt to big.txt in HDFS.
228
Step 6: Retrieve a File

6.1. Use WebHDFS to retrieve the file constitution.txt.
6.2. Retrieve big.txt from the 1,000,000th byte offset and get 1048576 bytes
(1MB). Pipe the result to a local file named big_partial.txt.
RESULT: You have seen how to use WebHDFS to execute a variety of HDFS commands
over HTTP using RESTful web Services.
SOLUTION to 6.2:
curl -i -L
"http://node1:50070/webhdfs/v1/user/root/big.txt?op=OPEN&of
fset=1000000&length=1048576" > big_partial.txt
229
Unit 11: Hive Administration

Topics covered:
Introduction to Hive
Comparing Hive with RDBMS
Hive Components
Hive MetaStore
HiveServer2
Hive Command Line Interface
Processing Hive SQL Statements
Defining a Hive-Managed Table
Defining an External Table
Loading Data into Hive
Performing Queries
Guidelines for Architecting Hive Data
ORCFile Example
Hive Tables
Hive Query Optimizations
Hive/MR verses Hive/Tez
ORCFile Example
Compression
Hive Security
Lab 11.1: Understanding Hive Tables
230
Introduction to Hive
Hive queries are capable of data summarization, ad-hoc querying and analytics of large
volumes of data. Hive is scalable to 100PB+. Apache Hive is the gateway for business
intelligence and visualization tools integrated with Apache Hadoop. Hive supports
databases, tables, SQL language and other foundational constructs for analyzing data.
Hive will get the SQL code, process the code and convert the code to a MapReduce
program. The MapReduce program runs in the YARN framework and generates the
results.
Additional Hive capabilities and features:
Allows queries, inserts and appends.
Does not allow updates or deletes.
Data can be separated into partitions and buckets.
Supports cubes, dimensions, and star schemas.
231
Comparing Hive with RDBMS

Remember, Hive is a data warehouse infrastructure on top of Hadoop. HDFS uses
schema-on-read. Data can be stored in different formats such as text, sequential files
and columnar files.
Other comparisons include:
Views in Hive are logical query constructs.
Materialized views are not supported.
Indexes in Hive store their data in separate table constructs.
Bitmap indexes are supported.
User Defined Functions (UDFs) can be used to add additional functionality to

Hive queries.
Hive supports arrays, maps, structs and unions.
SerDes map JSON, XML and other formats natively into Hive.
232
Hive MetaStore
The Hive MetaStore contains all the metadata definitions for Hive tables
and partitions
The metastore can be local or remote
Local Metastore
Driver
Metastore
RDBMS
Local
Datastore
HiveServer2
Remote Metastore
Driver
Metastore
RDBMS
Remote
Datastore
HiveServer2
Hive MetaStore
The Hive metastore stores table definitions and related metadata information. Hive
uses an Object Relational Mapper (ORM) to access relational databases. Valid Hive
metastore database are: MySQL, PostgreSQL, Oracle and Derby. An embedded
metastore is available but it should only be used for unit testing.
Below is an example of setting up a local metastore using MySQL at the metastore
repository
Property
Value
javax.jdo.option.ConnectionURL
jdbc:mysql://<HOSTNAME>/<DBNAME>?cre
ateDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
<MYSQL_USER>
javax.jdo.option.ConnectionPassword
<MYSQL_PASSWORD>
hive.metastore.local
True, this is local store.
233
hive.metastore.warehouse.dir
<DEFINE_PATH_HIVETABLES>
With a remote metastore setup, a Hive client needs to connect to a metastore server
that then communicates to the remote datastore (RDBMS) using the Thrift protocol.
Thrift is an Interface Definition Language (IDL) that defines the specification for the
interface to a software component. Thrift uses Remote Procedure Calls (RPCs) for the
communication between two service endpoints.
234
HiveServer2
HiveServer2 is a server interface that allows JDBC/ODBC remote clients
to run queries and retrieve the results.
Hive SQL
CLI
JDBC / ODBC
Web UI
HiveServer2
RDBMS
Datastore
Hive
Mappers
Reducers
DataNodes
HiveServer2
HiveServer2 (HS2) is a gateway / JDBC / ODBC endpoint Hive clients can talk to. ODBC
allows Excel and just about any BI tool to use Hive to access Hadoop data.
Configuration parameters for the HiveServer2 are set in the hive-site.xml file.
HiveServer2 supports no authentication (Anonymous), Kerberos, LDAP and custom
authentication. Authentication mode is defined with the hive.server2.authentication
parameter (NONE, KERBEROS, LDAP and CUSTOM). NONE is the default value.
HiveServer2 executes a query as the user who started the query by default
(hive.server2.enable.doAs=true). Setting this parameter to false, the query will run as
the same user the HiveServer2 process runs as.
There are multiple ways to start the HiveServer2:
$ $HIVE_HOME/bin/hiveserver2
$ $HIVE_HOME/bin/hive --service hiveserver2
235
Hive Command Line Interface

The $HIVE_HOME/bin path should be added to the PATH environment variable in Linux.
The hive --help option can provide a listing of hive options.
In the Hive CLI, the set and set-v options will display variables in hiveconf, hivevar,
system and env namespaces and all Hadoop properties respectfully.
Example:
hive> set;
hive> set -v;
The e option can be used to execute from the Linux command line. The S option says
run in silent mode.
$ hive -S -e "select * FROM mycooltab" > /tmp/mytabout
Use the set command to find a property value.

$ hive -S -e "set" | grep auth
236
Run a script file containing SQL code.

$ hive -f /hscripts/myrockingquery.hql
Bash shell commands can be run from the Hive CLI.

hive> ! whoami;
Hadoop dfs commands can be run from the Hive CLI. Dfs commands can be run
without typing hadoop first.
hive> dfs -ls /user;
Beeline connects to the Hive Server2 instance. Hive clients connect to the HiveServer
instance.
237

1
Client executes SQL query
Hive parses and plans query
Hadoop
1
Hive SQL
Query converted to Map/Reduce
Map/Reduce run by Hadoop
CLI
JDBC / ODBC
Hive
HiveServer2
Web UI
Hive
Compiler
Optimizer
Executor
Map / Reduce 3
Mappers
Reducers
DataNodes
4

Hive is a system that provides a SQL interface and a relational model on top of Hadoop.
When you use Hive, you expose data within Hadoop as relational tables and you issue
SQL queries to query the data. The underlying data can be structured or unstructured.
Under the covers Hive converts these SQL queries to map/reduce jobs and submits
them to the Hadoop cluster. Using Hive lets you program SQL rather than Java map
reduce and lets you work at a much higher level of abstraction.
The driver handles the parsing, optimization, compilation, and execution. Hive does not
generate Java code for MapReduce. Hive uses Map and Reduce modules (like
interpreters) that are driven by the defined job execution. Hive will communicate with
the YARN Resource Manager for starting the MapReduce job.
Once the Hive query is converted into a MapReduce program, Hadoop can use parallel
processing and distributed database to generate a result using a highly scalable and
highly available Hadoop infrastructure.
238
Hive Data Hierarchical Structures

Hive data hierarchical structures.
Databases: In Hive a database is a namespace that separates tables and data structures.
The default database directory is defined by the hive.metastore.warehouse.dir
parameter. A directory is defined for each database. The default location can be
overridden by specifying the path. Comments and additional properties can be added
to the database definition. The USE command defines the current database. All
following commands will execute on the objects in the current database. Database
properties can be modified with the ALTER DATABASE command.
Tables: Schema objects (abstract table definitions mapping relational definition to
underlying data).
Partitions: Can physically separate table data into separate data units.
Buckets (or Clusters): Data in each partition can be sub-partitioned based on a

hash function of some column of the Table.
239
Database commands in Hive:

hive> CREATE DATABASE mydb;
hive> CREATE DATABASE mydb2 LOCATION
/user/george/mydbs;
hive> CREATE DATABASE mydb3
COMMENT My supercool db;
hive> CREATE DATABASE mydb4 WITH DBPROPERTIES creator
= George Trujillo, date = 2013-11-02);
hive> SHOW DATABASES;
hive> DESCRIBE DATABASE mydb3;
hive> DESCRIBE DATABASE EXTENDED mydb4;
hive> USE mydb;
hive> CREATE TABLE mycooltab( id INT, name STRING);
hive> DESCRIBE mycooltab;
hive> DESCRIBE EXTENDED mycooltab;
-- can tell if
table is INTERNAL or EXTERNAL
hive> SHOW TABLES;
A Hive CREATE TABLE command can create a Hive and HBase table as well as create a
Hive table that points to an existing HBase table. Hive tables can also point to other
NoSQL database tables.
CREATE EXTERNAL TABLE myhtab(id INT, name STRING) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
Hive has four built-in file formats:
Delimited Text: Excellent for sharing among Pig, Hive, and Linux (awk, perl,
pythonn etc.) Binary file formats are more efficient.
SequenceFile: Binary key-value pairs. Can be compressed at BLOCK and

RECORD level. Supports splitting on blocks.
ORCFile
RCFile: Stores data in a record columnar format. Allows compression of

individual columns and fast analyzing on columns.
Hive uses SerDes to read and write from tables. The SerDe determines the format in
which the records are serialized and deserialized. You can write your own custom SerDe,
or use one of the built-in ones which include:
240
Avro: Easily converts Avro schema and data types into Hive schema and data
types. Avro understands compression.
Regular Expression
ORC
JSON: JavaScript Object Notation is supported by Hive.
Thrift
NOTE: SerDe stands for serializer/deserializer. A SerDe is used when data

needs to be converted from one format (unstructured) to another format
(record structure).
NOTE: Accumulo is not part of the HDP distribution yet, but it is supported
by Hortonworks.
241
Hive Tables
Data stored in HDFS is schema-on-read, meaning Hive does not control the data
integrity when it is written. For Hive Managed tables, the table name is the name Hive
will assign to the directory in HDFS. For external tables, the files can be in any folder in
HDFS.
If you drop an external table, it will keep the data in its defined directory. With a Hive
Managed table, if you drop the table, then the data is deleted.
Multiple schemas can be connected to a single directory.
242
Defining a Hive-Managed Table

A Hive table allows you to add structure to your otherwise unstructured data in HDFS.
Use the CREATE TABLE command to define a Hive table, similar to creating a table in
SQL.
For example, the following HiveQL creates a new Hive-manged table named customer:
CREATE TABLE customer (
customerID INT,
firstName STRING,
lastName STRING,
birthday TIMESTAMP,
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
The customer table has four columns.
ROW FORMAT is either DELIMITED or SERDE.
Hive supports the following data types: TINYINT, SMALLINT, INT, BIGINT,
BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, VARCHAR and DATE.
Hive also has four complex data types: ARRAY, MAP, STRUCT and UNIONTYPE.
243
Defining an External Table

The following CREATE statement creates an external table named salaries:
CREATE EXTERNAL TABLE salaries (
gender string,
age int,
salary double,
zip int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
An external table is just like a Hive-manged table, except that when the table is
dropped, Hive will not delete the underlying /apps/hive/warehouse/salaries folder.
Defining a Table LOCATION

Hive does not have to store the underlying data in /apps/hive/warehouse. Instead, the
files for a Hive table can be stored in a folder anywhere in HDFS by defining the
LOCATION clause. For example:
CREATE EXTERNAL TABLE salaries (
gender string,
age int,
salary double,
zip int
)
FIELDS TERMINATED BY ','
LOCATION '/user/train/salaries/';
In the table above, the table data for salaries will be whatever is in the
/user/train/salaries directory.
IMPORTANT: The sole difference in behavior between external tables and

Hive-managed tables is when they are dropped. If you drop a Hive-managed
table, then its underlying data is deleted from HDFS. If you drop an external
table, then its underlaying data remains in HDFS (even if the LOCATION was
in /apps/hive/warehouse/).
244

LOAD DATA LOCAL INPATH /tmp/customers.csv'
OVERWRITE INTO TABLE customers;
LOAD DATA INPATH '/user/train/customers.csv'
OVERWRITE INTO TABLE customers;
INSERT INTO birthdays
SELECT firstName, lastName, birthday
FROM customers
WHERE birthday IS NOT NULL;

The data for a Hive table resides in HDFS. To associate data with a table, use the LOAD
DATA command. The data does not actually get loaded into anything, but the data
does get moved:
For Hive-managed tables, the data is moved into a special Hive subfolders of
/apps/hive/warehouse.
For external tables, the data is moved to the folder specified by the LOCATION
clause in the tables definition.
The LOAD DATA command can load files from the local file system (using the LOCAL
qualifier) or files already in HDFS. For example, the following command loads a local file
into a table named customers:
LOAD DATA LOCAL INPATH '/tmp/customers.csv' OVERWRITE INTO
TABLE customers;
The OVERWRITE option deletes any existing data in the table and replaces it with
the new data. If you want to append data to the tables existing contents, simply
leave off the OVERWRITE keyword.
245
If the data is already in HDFS, then leave off the LOCAL keyword:
LOAD DATA INPATH '/user/train/customers.csv' OVERWRITE INTO
TABLE customers;
In either case above, the file customers.csv is moved either into HDFS in a subfolder of
/apps/hive/warehouse or to the tables LOCATION folder, and the contents of
customers.csv are now associated with the customers table.
You can also insert data into a Hive table that is the result of a query, which is a
common technique in Hive. An example of the syntax is below:
INSERT INTO birthdays SELECT firstName, lastName, birthday
FROM customers WHERE birthday IS NOT NULL;
The birthdays table will contain all customers whose birthday column is not null.
246
Performing Queries
Lets take a look at some sample queries to demonstrate what HiveQL looks like. The
following SELECT statement selects all records from the customers table:
SELECT * FROM customers;
You can use the familiar WHERE clause to specify which rows to select from a table:
FROM customers SELECT firstName, lastName, address, zip
WHERE orderID > 0 GROUP BY zip;
NOTE: The FROM clause in Hive can appear before or after the SELECT
clause.
One benefit of Hive is its ability to join data in a simple fashion. The JOIN command in
HiveQL is similar to its SQL counterpart. For example, the following statement performs
an inner join on two tables:
SELECT customers.*, orders.* FROM customers JOIN orders ON
(customers.customerID = orders.customerID);
To perform an outer join, use the OUTER keyword:

SELECT customers.*, orders.* FROM customers LEFT OUTER JOIN
orders ON (customers.customerID = orders.customerID);
In the SELECT above, a row will be returned for every customer, even those without
any orders.
247
Guidelines for Architecting Hive Data

Hadoop is very good at coordinated, sequential scans, but it does not have the concept
of random I/O. Traditional indexes are not very effective in Hive.
Sorting and skipping takes the place of indexing.
A key consideration in Hive is to minimize the amount of data that needs to be

shuffled during the shuffle/sort phase of the MapReduce job.
A best practice is to divide data among different files that can be pruned out,
which is accomplished by using partitions, buckets and skewed tables.
Sort data ahead of time. Sorting data ahead of time simplifies joins and skipping
becomes more effective.
248
4 Stages
Stage Details

Hive queries can be optimized. There is a Hive explain plan that can be used to evaluate
the execution plan of the query. Use explain extended in front of your query.
Sections:
Abstract syntax tree: You can usually ignore this.
Stage dependencies: Dependencies and # of stages.
Stage plans: Important info on how Hive is running the job.
249
Hive/MR versus Hive/Tez

SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
Tez avoids
unneeded writes to
HDFS
GROUP BY a.state
Hive MR
M
Hive Tez
SELECT a.state
SELECT b.id
R
SELECT a.state,
c.itemId
M
R
SELECT b.id
M
M
HDFS
JOIN (a, c)
SELECT c.price
HDFS
JOIN (a, c)
HDFS
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
Hive/MR versus Hive/Tez

As you can see in the diagram above, a Hive query without Tez can consist of multiple
MapReduce jobs. Tez performs a Hive query in a single job, avoiding the intermediate
writes to disk that were a result of the multiple MapReduce jobs.
250
ORCFile Example
sale
id
mestamp
productsk
storesk
amount
state
10000
2013-06-13T09:03:05
16775
670
$70.50
CA
10001
2013-06-13T09:03:05
10739
359
$52.99
IL
10002
2013-06-13T09:03:06
4671
606
$67.12
MA
10003
2013-06-13T09:03:08
7224
174
$96.85
CA
10004
2013-06-13T09:03:12
9354
123
$67.76
CA
10005
2013-06-13T09:03:18
1192
497
$25.73
IL
CREATE TABLE sale (

id int, timestamp timestamp,
productsk int, storesk int,
amount decimal, state string
) STORED AS orc;
ORCFile Example
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store
Hive data. File formats in Hive are specified at the table level using the AS keyword. For
example:
CREATE TABLE tablename (
...
) AS ORC;
You can also modify the file format of an existing table:

ALTER TABLE tablename SET FILEFORMAT ORC;
You can also specify ORC as the default file format of new tables:
SET hive.default.fileformat=Orc
The ORC file format is a part of the Stinger Initiative to improve the performance of Hive
queries, and using ORC files can greatly improve the execution time of your Hive
queries.
251
Compression
Hive queries will usually become I/O bound before they become CPU bound. Reducing
the amount of data to be read by using compression can improve performance.
Different compression codecs include: Snappy, LZO, Gzip, BZip2, etc.
Get a listing of the compression codes available in your environment. Compression
options can also be defined in the Hive CLI.
$ hive -e "set io.compression.codecs"
hive> set mapred.output.compression.type=BLOCK;
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;
hive> set hive.exec.compress.output=true;
The intermediate data generated by the Mappers can be compressed. The

hive.exec.compress.intermediate property needs to be set to true.
252
Example:
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
</property>
The default compression codec is set with the mapred.map.output.compression.codec

in the $HADOOP_HOME/conf/mapred-site.xml file.
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
Specify that the output of the Reducer(s) should be compressed with the
hive.exec.compress.output parameter.
<property>
<name>hive.exec.compress.output</name>
<value>true/value>
</property>
Set the codec to use for the output.

<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
253
Hive Security
Usernames can be defined when executing commands. You can specify user.name in a
GET :table command:
$ curl -s
'http://localhost:50111/templeton/v1/ddl/database/default/t
able/my_table?user.name=cole'
Or you can specify user.name in a POST :table command:

$ curl -s -d user.name=cole -d rename=myoldtable_2
'http://localhost:50111/templeton/v1/ddl/database/default/t
able/mycool_table'
254
Unit 11 Review
1. The Hive component for storing schema and metadata information is
___________________.
2. A(n) ______________________ table loosely couples the table schema to the

underlying data storage.
3. ___________________ is a server interface that allows JDBC/ODBC remote

clients to run queries and retrieve the results.
4. True or False: Tez improves the performance of any MapReduce job, not just
Hive queries.
255
Lab 11.1: Understanding Hive Tables
Objective: Understand how Hive table data is stored in HDFS.

Successful Outcome: You will have created a couple of tables in Hive and learned
how data gets associated with a Hive table.
Before you begin: SSH into node1.
Step 1: Review the Data

1.1. As root, change directories to the /root/labs/data folder:
# cd /root/labs/data
1.2. Notice there are 5 part-m-0000x files, which are the result of a MapReduce
job that formatted the data for use with Hive. View the contents of one of these
files:
# more part-m-00000
Notice the data consists of information about visitors to the White House,
including the name, date, person being visited, and a comment section.
Step 2: Define a Hive Table
2.1. In the data folder, there is a text file named wh_visits.hive. View its
contents. Notice it defines a Hive table named wh_visits with a schema that
matches the data in the part-m-0000x files:
# more wh_visits.hive
create table wh_visits (
lname string,
fname string,
time_of_arrival string,
256
appt_scheduled_time string,
meeting_location string,
info_comment string)
FIELDS TERMINATED BY '\t' ;
NOTE: You cannot use comment or location as column names because

those are reserved Hive keywords.
2.2. Run the script with the following command:

# hive -f wh_visits.hive
2.3. If successful, you should see OK in the output along with the time it took to
run the query.
Step 3: Verify the Table Creation
3.1. Start the Hive Shell:
# hive
hive>
3.2. From the hive> prompt, enter the show tables command:
hive> show tables;
You should see wh_visits in the list of tables.

3.3. Use the describe command to view the details of wh_visits:
hive> describe wh_visits;
OK
lname
string
fname
string
time_of_arrival
string
appt_scheduled_time
string
meeting_location
string
info_comment
string
None
None
None
None
None
None
3.4. Try running a query (even though the table is empty):

select * from wh_visits;
257
The query should execute fine, but no result appears.

Step 4: View the Hive Folder Structure
4.1. Exit the Hive shell:
hive> exit;
[root@node1 data]#
4.2. View the contents of the Hive warehouse folder:

# hadoop fs -ls /apps/hive/warehouse
Found 1 items
drwxr-xr-x
- root hdfs
0 /apps/hive/warehouse/wh_visits
Notice there is a folder named wh_visits. When did this folder get created?
_________________________________________________________________
4.3. List the contents of the wh_visits folder:
# hadoop fs -ls /apps/hive/warehouse/wh_visits
Notice the folder is empty.

Step 5: Populate the Hive Table
5.1. Run the following command to put the local part-m-00000 file into the
wh_visits folder:
# hadoop fs -put part-m-00000
/apps/hive/warehouse/wh_visits
5.2. From the Hive shell, run the following query:

hive> select * from wh_visits;
This time, you should see a couple thousand rows of data. Notice that by simply
putting a file into the wh_visits folder, the table now contains data.
5.3. Notice no MapReduce job was executed to perform the select * query. Why
not? ___________________________________________________________
Step 6: Drop the Table
258
6.1. Run the following query, which drops the wh_visits table:
hive> drop table wh_visits;
6.2. Exit the Hive shell and view the contents of the Hive warehouse folder:
# hadoop fs -ls /apps/hive/warehouse/
Notice that not only has the part-m-00000 file been deleted, but also the
wh_visits folder no longer exists!
Step 7: Create the Table Again
7.1. Run wh_visits.hive again to recreate the wh_visits table:
# hive -f wh_visits.hive
Step 8: Use the Hive LOAD DATA Command

8.1. Create a new directory in HDFS named whitehouse:
# hadoop fs -mkdir whitehouse
8.2. Put all 5 part-m files into whitehouse:

# hadoop fs -put part-m-* whitehouse/
8.3. Verify the files are there:

# hadoop fs -ls whitehouse
8.4. From the Hive shell, run the following query:

hive> LOAD DATA INPATH '/user/root/whitehouse/' OVERWRITE
INTO TABLE wh_visits;
8.5. Verify you have now have data in the table:

hive> select * from wh_visits limit 10;
You should ten rows of visitors, and no MapReduce is needed.
259
8.6. Try the following query. Make sure the output looks like first names:
hive> select fname from wh_visits limit 20;
This time a MapReduce job executed. Why? ____________________________

Step 9: View the Folder Structure
9.1. View the contents of the wh_visits folder:
# hadoop fs -ls /apps/hive/warehouse/wh_visits
Notice the five part-m files are located in wh_visits.

9.2. Try viewing the contents of the whitehouse folder:
Notice the folder is empty. The LOAD DATA command moved the files from their
original HDFS folder into the Hive warehouse folder; it did not copy them.
IMPORTANT: Be careful when you drop a managed table in Hive. Make sure
you either have the data backed up somewhere else, or that you no longer
want the data.
Step 10: Count the Number of Rows in a Table

10.1. Enter the following Hive query, which outputs the number of rows in
wh_visits:
hive> select count(*) from wh_visits;
10.2. How many rows are currently in wh_visits? _____________

Step 11: Define an External Table
11.1. Drop the wh_visits table again:
hive> drop table wh_visits;
11.2. View the contents of external_table.hive in the /root/labs/data folder:

260
# more external_table.hive
create external table wh_visits (
lname string,
fname string,
time_of_arrival string,
appt_scheduled_time string,
meeting_location string,
info_comment string)
FIELDS TERMINATED BY '\t'
LOCATION '/user/root/whitehouse/' ;
11.3. Create the whitehouse folder in HDFS again, and put the five part-m files
into whitehouse.
11.4. Verify that there is not a subfolder of /apps/hive/warehouse named
wh_visits.
11.5. Run the query in external_table.hive to create the wh_visits table:
# hive f external_table.hive
11.6. Run a query on wh_visits to verify that the table does actually contain
records.
11.7. Drop wh_visits again, but this time notice that the files in the whitehouse
folder are not deleted.
RESULT: As you just verified, the data for external tables is not deleted when the
corresponding table is dropped. Aside from this behavior, managed tables and external
tables in Hive are essentially the same.
261
Unit 12: Transferring Data with

Sqoop
Topics covered:
Overview of Sqoop
The Sqoop Import Tool
Importing a Table
Importing Specific Columns
Importing from a Query
The Sqoop Export Tool
Exporting a Table
Lab 12.1: Using Sqoop
262
Overview of Sqoop
Relational
Database
1. Client executes a
sqoop command
Enterprise
Document-based
Data Warehouse
Systems
3. Plugins provide connectivity to

various data sources
2. Sqoop executes the

command as a
MapReduce job on the
cluster (using Map-only
tasks)
Map
tasks
Hadoop Cluster
Overview of Sqoop
Sqoop is a tool designed to transfer data between Hadoop and external structured
datastores like RDBMS and data warehouses. Using Sqoop, you can provision the data
from an external system into HDFS. Sqoop uses a connector-based architecture that
supports plugins that provide connectivity to additional external systems.
As you can see in the slide, Sqoop uses MapReduce to distribute its work across the
Hadoop cluster:
1. A Sqoop job gets executed using the sqoop command line.
2. Sqoop uses Map tasks (4 by default) to execute the command.
3. Plugins are used to communicate with the outside data source. The schema is
provided by the data source, and Sqoop generates and executes SQL statements
using JDBC or other connectors.
NOTE: Using MapReduce to perform Sqoop commands provides parallel operation

as well as fault tolerance.
263
HDP provides the following connectors for Sqoop:
Teradata
MySQL
Oracle JDBC connector
Netezza
A Sqoop connector for SQL Server is also available from Microsoft:
264
SQL Server R2 connector
The Sqoop Import Tool

With Sqoop, you can import data from a relational database system into HDFS:
The input to the import process is a database table.
Sqoop will read the table row-by-row into HDFS. The output of this import
process is a set of files containing a copy of the imported table.
The import process is performed in parallel. For this reason, the output will be in
multiple files.
These files may be delimited text files (for example, with commas or tabs
separating each field), or binary Avro or SequenceFiles containing serialized
record data.
The import command looks like:

sqoop import (generic-args) (import-args)
265
The import command has the following requirements:
Must specify a connect string using the --connect argument
Credentials can be included in the connect string, so using the --username and -password arguments
Must specify either a table to import using --table, or the result of a SQL query
using --query
266
Importing a Table
sqoop import
--connect jdbc:mysql://host/nyse
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
Importing a Table
The following Sqoop command imports a database table named StockPrices into a
folder in HDFS named /data/stockprices:
sqoop import
--table StockPrices
--target-dir /data/stockprice/
--as-textfile
Based on the import command above:
The connect string in this example is for MySQL. The database name is nyse.
The --table argument is the name of the table in the NYSE database.
The --target-dir is where in HDFS the data will be imported.
The default number of map tasks for Sqoop is 4, so the result of this import will
be in 4 files.
The --as-textfile argument imports the data as plain text.
267
NOTE: You can use --as-avrodatafile to import the data to Avro files, and use
--as-sequencefile to import the data to sequence files.
Other useful import arguments include:
--columns: a comma-separated list of the columns in the table to import (as

opposed to importing all columns, which is the default behavior).
--fields-terminated-by: specify the delimiter. Sqoop uses a comma by default.
--append: the data is appended to an existing dataset in HDFS.
--split-by: the column used to determine how the data is split between mappers.
If you do not specify a split-by column, then the primary key column is used.
-m: the number of map tasks to use.
--query: use instead of --table, the imported data is the resulting records from
the given SQL query.
--compress: enables compression.
NOTE: The import command shown here looks like it entered over multiple
lines, but you have to enter this entire Sqoop command on a single
command line.
REFERENCE: Visit http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html for a list

of all arguments available for the import command.
268

sqoop import
--table StockPrices
--columns StockSymbol,Volume, High,ClosingPrice
--target-dir /data/dailyhighs/
--as-textfile
--split-by StockSymbol
-m 10

Use the --columns argument to specify which columns from the table to import. For
example:
sqoop import
--table StockPrices
--columns StockSymbol,Volume,High,ClosingPrice
--target-dir /data/dailyhighs/
--as-textfile
-m 10
Based on the import command above:
How many columns will be in imported? ______________
How many files will be created in /data/dailyhighs/? ______________
Which column will Sqoop use to split the data up between the mappers?
____________________________
269

sqoop import
--query "SELECT * FROM StockPrices s
WHERE s.Volume >= 1000000
AND \$CONDITIONS"
--target-dir /data/highvolume/
--as-textfile

Use the --query argument to specify which rows to select from a table. For example:
sqoop import
--query "SELECT * FROM StockPrices s
WHERE s.Volume >= 1000000
AND \$CONDITIONS"
--target-dir /data/highvolume/
--as-textfile
Based on the command above:
Only rows whose Volume column is greater than 1,000,000 will be imported.
The $CONDITIONS token must appear somewhere in the WHERE clause of your
SQL query. Sqoop replaces this token with LIMIT and OFFSET clauses so that the
data can be split between mappers.
If you use --query, then you must also specify a --split-by column or the Sqoop
command will fail to execute.
270
NOTE: Using --query is limited to simple queries where there are no

ambiguous projections and no OR conditions in the WHERE clause. Use of
complex queries (such as queries that have sub-queries, or joins leading to
ambiguous projections) can lead to unexpected results.
IMPORTANT: You either use --query or --table, but attempting to define

both results in an error.
271
The Sqoop Export Tool

Sqoops export process will read a set of delimited text files from HDFS in parallel, parse
them into records, and insert them as new rows in a target database table. The syntax
for the export command is:
sqoop export (generic-args) (export-args)
The Sqoop export tool runs in three modes:

1. Insert Mode: the records being exported are inserted into the table using a SQL
INSERT statement.
2. Update Mode: an UPDATE SQL statement is executed for existing rows, and an
INSERT can be used for new rows.
3. Call Mode: a stored procedure is invoked for each record.
The mode used is determined by the arguments specified:
272
--table: the table to populate in the database. This table must already exist in the
database. If no --update-key is defined, then the command is executed in Insert
Mode.
--update-key: the primary key column for supporting updates. If you define this
argument, the Update Mode is used and existing rows are updated with the
exported data.
--call: invokes a stored procedure for every record, thereby using Call Mode. If
you define --call, then do not define the --table argument or an error will occur.
The following are sqoop export arguments:
--export-dir: the directory in HDFS that contains the data to export.
--input-fields-terminated-by: the input field delimiter. A comma is the default.
--update-mode: Specify how updates are performed when new rows are found
with non-matching keys in database. Values are updateonly (the default) and
allowinsert.
273
Exporting to a Table
sqoop export
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"
Exporting to a Table
The following Sqoop command exports the data in the /data/logfiles/ folder in HDFS to
a table named LogData:
sqoop export
--table LogData
--export-dir /data/logfiles/
--input-fields-terminated-by "\t"
Based on the command above:
The table LogData needs to already exist in the Weblogs database.
The column values are determined by the delimiter, which is a tab in this
example.
All files in the /data/logfiles/ directory will be exported.
Sqoop will perform this job using 4 mappers, but you can specify the number to
use with the -m argument.
274
Unit 12 Review
1. What is the default number of map tasks for a Sqoop job? _____________
2. How do you specify a different number of mappers in a Sqoop job?
_________________________________________________
3. What is the purpose of the $CONDITIONS value in the WHERE clause of a Sqoop
query?
__________________________________________________________________
275
Lab 12.1: Using Sqoop
Objective: Move data between HDFS and a RDBMS.

Successful Outcome: You will have imported data from MySQL into folders in
HDFS, and exported data from HDFS into a MySQL table.
Perform the following steps:

Step 1: Install MySQL
1.1. Run the following commands to install MySQL on node1:
# yum -y install mysql mysql-server
1.2. Start the server with the following command:

# service mysqld start
Step 2: Create a Table in MySQL

2.1. As the root user, change directories to /root/labs:
# cd ~/labs/
2.2. View the contents of salaries.txt:

# cat salaries.txt
The comma-separated fields represent a gender, age, salary and zip code.
2.3. Notice there is a salaries.sql script that defines a new table in MySQL named
salaries. For this script to work, you need to copy salaries.txt into the publiclyavailable /tmp folder:
276
# cp salaries.txt /tmp
2.4. Now run the salaries.sql script using the following command:
# mysql test < salaries.sql
Step 3: View the Table

3.1. To verify the table is populated in MySQL, open the mysql prompt:
# mysql
3.2. Switch to the test database, which is where the salaries table was created:
mysql> use test;
3.3. Run the show tables command and verify salaries is defined:
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| salaries
|
+----------------+
1 row in set (0.00 sec)
3.4. Select 10 items from the table to verify it is populated:

mysql> select * from salaries limit 10;
+--------+------+--------+---------+----+
| gender | age | salary | zipcode | id |
+--------+------+--------+---------+----+
| F
|
66 | 41000 |
95103 | 1 |
| M
|
40 | 76000 |
95102 | 2 |
| F
|
58 | 95000 |
95103 | 3 |
| F
|
68 | 60000 |
95105 | 4 |
| M
|
85 | 14000 |
95102 | 5 |
| M
|
14 |
0 |
95105 | 6 |
| M
|
52 |
2000 |
94040 | 7 |
| M
|
67 | 99000 |
94040 | 8 |
| F
|
43 | 11000 |
94041 | 9 |
| F
|
37 | 65000 |
94040 | 10 |
+--------+------+--------+---------+----+
Step 4: Grant the Necessary Privileges

277
4.1. Enter the following command at the mysql prompt to grant access to node2
and node3 to connect to the mysql-server running on node1:
grant all privileges on *.* to 'root'@'%' with grant
option;
4.2. Exit the mysql prompt:

mysql> exit
Step 5: Import the Table into HDFS

5.1. Enter the following Sqoop command (all on a single line), which imports the
salaries table in the test database into HDFS:
# sqoop import
--connect jdbc:mysql://node1/test
--table salaries
--username root
5.2. A MapReduce job should start executing, and it may take a couple minutes for
the job to complete.
Step 6: Verify the Import
6.1. View the contents of the salaries folder:
# hadoop fs -ls salaries
6.2. You should see a new folder named salaries. View its contents:
# hadoop fs -ls salaries
Found 4 items
-rw-r--r-1 root hdfs
272
241
238
272
part-m-00000
part-m-00001
part-m-00002
part-m-00003
6.3. Notice there are four new files in the salaries folder named part-m-0000x.
Why are there four of these files?
__________________________________________________________________
6.4. Use the cat command to view the contents of the files. For example:
278
# hadoop fs -cat salaries/part-m-00000
Notice the contents of these files are the rows from the salaries table in MySQL.
You have now successfully imported data from a MySQL database into HDFS.
Notice you imported the entire table with all of its columns. In the next step, you
will import only specific columns of a table.
Step 7: Specify Columns to Import
7.1. Using the --columns argument, write a Sqoop command that imports the
salary and age columns (in that order) of the salaries table into a directory in
HDFS named salaries2. In addition, set the -m argument to 1 so that the result is a
single file.
7.2. After the import, verify you only have one part-m fie in salaries2:
# hadoop fs -ls salaries2
Found 1 items
482
salaries2/part-m-00000
7.3. Verify the contents of part-m-00000 are only the 2 columns you specified:
# hadoop fs -cat salaries2/part-m-00000
The last few lines should look like the following:

69000.0,97
91000.0,48
0.0,1
48000.0,45
3000.0,39
14000.0,84
Step 8: Importing from a Query

8.1. Write a Sqoop import command that imports the rows from salaries in
MySQL whose salary column is greater than 90,000.00. Use gender as the --splitby value, specify only 2 mappers, and import the data into the salaries3 folder in
HDFS.
TIP: The Sqoop command will look similar to the ones you have been using
throughout this lab, except you will use --query instead of --table. Recall
279
that when you use a --query command you must also define a --split-by
column, or define -m to be 1.
Also, do not forget to add $CONDITIONS to the WHERE clause of your query,
as demonstrated earlier in this Unit.
8.2. To verify the result, view the contents of the files in salaries3. You should
have only two output files.
8.3. View the contents of part-m-00000 and part-m-00001. Notice one file
contains females, and the other file contains males. Why? ______________
______________________________________________________________
8.4. Verify the output files contain only records whose salary is greater than
90,000.00.
Step 9: Put the Export Data into HDFS
9.1. Now lets export data from HDFS to the database. Start by viewing the
contents of the data, which is in a file named salarydata.txt:
# tail salarydata.txt
M,49,29000,95103
M,44,34000,95102
M,99,25000,94041
F,93,96000,95105
F,75,9000,94040
F,14,0,95102
M,68,1000,94040
F,45,78000,94041
M,40,6000,95103
F,82,5000,95050
Notice the records in this file contain 4 values separated by commas, and the
values represent a gender, age, salary and zip code, respectively.
9.2. Create a new directory in HDFS named salarydata.
9.3. Put salarydata.txt into the salarydata directory in HDFS.
Step 10: Create a Table in the Database
10.1. There is a script in the /root/labs folder that creates a table in MySQL that
matches the records in salarydata.txt. View the SQL script:
280
# more salaries2.sql
10.2. Run this script using the following command:

# mysql test < salaries2.sql
10.3. Verify the table was created successfully in MySQL:

# mysql
mysql> use test;
mysql> describe salaries2;
+---------+------------+------+-----+---------+-------+
| Field
| Type
| Null | Key | Default | Extra |
+---------+------------+------+-----+---------+-------+
| gender | varchar(1) | YES |
| NULL
|
|
| age
| int(11)
| YES |
| NULL
|
|
| salary | double
| YES |
| NULL
|
|
| zipcode | int(11)
| YES |
| NULL
|
|
+---------+------------+------+-----+---------+-------+
10.4. Exit the mysql prompt:

mysql> exit
Step 11: Export the Data

11.1. Run a Sqoop command that exports the salarydata folder in HDFS into the
salaries2 table in MySQL. At the end of the MapReduce output, you should see a
log event stating that 10,000 records were exported.
11.2. Verify it worked by viewing the tables contents from the mysql prompt. The
output should look like the following:
mysql> use test;
mysql> select * from salaries2 limit 10;
+--------+------+--------+---------+
| gender | age | salary | zipcode |
+--------+------+--------+---------+
| M
|
57 | 39000 |
95050 |
| F
|
63 | 41000 |
95102 |
| M
|
55 | 99000 |
94040 |
| M
|
51 | 58000 |
95102 |
| M
|
75 | 43000 |
95101 |
| M
|
94 | 11000 |
95051 |
| M
|
28 |
6000 |
94041 |
| M
|
14 |
0 |
95102 |
281
| M
|
3 |
0 |
95101 |
| M
|
25 | 26000 |
94040 |
+--------+------+--------+---------+
RESULT: You have imported the data from MySQL to HDFS using the entire table,
specific columns, and also using the result of a query. You have also exported a folder of
data in HDFS into a table in MySQL.
SOLUTIONS:
Step 7.1 is the following command (entered on a single line):
# sqoop import --connect jdbc:mysql://node1/test
--table salaries
--columns salary,age
-m 1
--target-dir salaries2
--username root
Step 8.1:
sqoop import --connect jdbc:mysql://node1/test
--query "select * from salaries s where s.salary > 90000.00
and \$CONDITIONS"
--split-by gender
-m 2
--target-dir salaries3
--username root
282
Step 11
sqoop export
--connect jdbc:mysql://node1/test
--table salaries2
--export-dir salarydata
--input-fields-terminated-by ","
--username root
ANSWERS:
Step 6.3: The MapReduce job that executed the Sqoop command used four mappers, so
there are four output files (one from each mapper).
Step 8.3: You used gender as the split-by column, so all records with the same gender
are sent to the same mapper.
283
Unit 13: Flume

Topics covered:
Flume Introduction
Installing Flume
Flume Events
Flume Sources
Flume Channels
Flume Channel Selectors
Flume Channel Selector
Flume Sinks
Multiple Sinks
Flume Interceptors
Design Patterns
Configuring Individual Components
Flume Netcat Source Example
Flume Exec Source Example
Flume Configuration
Monitoring Flume
Lab 13.1: Install and Test Flume
284
Flume Introduction
A Flume is an artificial channel or stream created
which uses water to transport objects down a
channel.
Apache Flume, a data ingestion tool, collects,
aggregates and directs data streams into Hadoop
using the same concepts. Flume works with
different data sources to process and send data to
defined destinations.
Source
Channel
Sink
Flume Agent
Flume Introduction
A flume is an artificial channel or stream created that uses water to transport objects
down the channel. Flumes were often used by the logging industry to move cut
wooden logs. Apache Flume transfers data from multiple sources into Hadoop via
events instead of wooden logs. It efficiently collects, aggregates, and moves large
amounts of streaming data.
Flume Components
Event: The individual unit of data (such as a log entry) and is made up of
header(s) and a byte-array body.
Source: Defines the type of data stream that is entering Flume. Sources may
either be active; constantly looking for data, or passive; waiting for data to be
passed to them.
Client: Produces and communicates events to the source.
Sink: Delivers the data to its destination. Each sink is defined based on the
destination it will be transferring data into. For example: HDFS, HBase, a local
file.
285
Channel: The conduit between the source and the sink (destination).
Agent: A JVM process that is a collection of sources, sinks and channels. An

agent requires that at least one source, channel and sink be defined however
may also be configured with multiple sources, channels, and sinks.
Flume Workflow
1. Client transmits event to a source.
2. Source receives event and delivers it to one or more channels.
3. The sink or sinks transfer the data from the channel to the final destination.
286
Installing Flume
Following are the system requirement for running Flume:
Java Runtime Environment: Java 1.6 or later (Java 1.7 Recommended).
Memory: The Flume agent requires appropriate amount of memory for all
components of the agent.
Disk Space: Flume agent needs permission to access sources and write to
destinations. Make sure channels have sufficient storage.
Permissions: Read/write access to all directories the agents will be using.
Although not required, it is recommended to set your time to UTC versus local
time.
NOTE: The Flume agent heap size can be set with JAVA_OPTS:
JAVA_OPTS= "-Xms100m -Xmx200m"
287
The key Flume environment and configuration files are:

File Location / Name
Description
/etc/flume/conf/flume-conf.properties
Java properties file.
/etc/flume/conf/flume-env.sh
Contains environment variables.
/etc/flume/conf/log4j.properties
Contains Java logging properties (such as

log directory).
flume.log.dir=/var/log/flume
Flume log directory.
NOTE: The following templates are available:

/etc/flume/conf/flume-conf.properties.template and
/etc/flume/conf/flume-env.sh.template
Flume can be started using either of the commands below:

$ /etc/rc.d/init.d/flume-ng start
-- OR -$ service flume-ng start
Use the help option to view the usage list:

$ ./bin/flume-ng help
288
Flume Events
An event can range from text to images. The key point about events is they need to be
generated from regular streaming data.
An Event is a single unit of data that can be transported by Flume NG (akin to messages
in JMS). Events are generally small (ranging from a few bytes to a few kilobytes) and are
commonly a single record from a larger dataset. Events are made up of headers,
containing the key / value map and a body, storing the arbitrary byte array.
Clients generate data as a stream of events and run in a separate thread. The clients
send data to a source. A log4j appender sends events directly to Flume NG's source or
syslog daemon.
289
Flume Sources
A Flume source is the data stream from which Flume receives the data. The source can
be pollable or event driven. A spool director can be set up to look for new files. A suffix
can be added once all events have been transmitted.
Property
Sample Value
agent.sources
mychannel
agent.sources.channels
mychannel
agent.sources.mychannel. type
spooldir
agent.sources.mychannel. spoolDir
/directorypath
agent.source.mychannel. fileSuffix
.COMPLETE
290
Flume Source
Description
Avro Source
Listens on Avro port and receives events

from an external Avro client stream.
Exec Source
Runs a given Unix command at start-up

and listens to standard out.
Thrift Source
RPC source.
NetCat Source
Listens on a given port and turns each line

of text into an event.
Sequence Generator Source
Continuously generates events with a

counter that starts at 0 and increments by
1. Useful for testing and debugging.
SpoolDir Source
Process rotating log files.
JMS Source
Process stream from Java Messaging

Service.
HTTP Source
Accepts events by HTTP POST and GET

(experimentation only).
Syslog Source
Reads syslog data and generates Flume

events.
Also: Syslog TCP Source and Syslog UDP
Source.
Custom Source
Implementation of Source interface.
291
Flume Channels
The channel is the conduit for events between a source and a sink. The channel dictates
the durability of event delivery between a source and a sink. An event stays in the
channel until the sink successfully sends the data to the defined destination. The source
and the sink run asynchronously in processing events in the channel. Channel
exceptions can be thrown if the ingest rate exceeds the channels ability to handle that
rate.
292
Types of channels include:
Memory Channel: Fast but makes no guarantee against data loss.
File Channel: Backed by WAL implementation; fully durable and reliable.
JDBC Channel: Backed by embedded Database; fully durable and reliable.
Memory Channels: Events are stored in an in-memory queue with configurable

max size. They are the fastest but lack durability.
File Channel: Writes and checkpoints files to disk. Slower but durable.
JDBC Channel: Events are stored in a persistent storage that is backed up with a
database. Slower but durable.
Custom Channel: Implementation of the Channel interface. Examples of

defining different types of channels.
agent.channels.mychannel. type = memory
agent.channels.mychannel. type = file
293
Flume Channel Selectors

A channel selector allows you to go from a single source to multiple channels using a fan
out strategy.
Example:
agent.sources.mychannel.channels = mych1 mych2 mych3
agent.sources.mychannel.selector.type = replicating
agent.sources = mychannel
agent.channels = mych1 mych2 mych3
agent.sources. mychannel.selector.type = multiplexing
agent.sources. mychannel.selector.header = state
agent.sources. mychannel.selector.mapping.src1 = mych1
mych2
Events can be batched as a transaction and each transaction has a unique id. The
number of events that are processed together as a single transaction determines the
batch size. Each event in a transaction has a unique sequence number.
The durability of transactions is determined by the batch sizes as batch sizes control
throughput.
294

Channel Selectors support fanning out of events. The events are either
replicated to all channels or sent to a specific channel.
Source
Channel
Sink
Channel
Sink
C
S
Agent

A channel selector can be set to replicating (default) or multiplexing. Channel selectors
set to replicating will send all events to multiple channels. Channel selectors set to
multiplexing can replicate or selectively route an event to one or more channels. See
below for examples of setting the channel selector.
Replicating:
agent.sources.mychannel. channels = c1 c2 c3
agent.sources.mychannel. selector.type = replicating
Multiplexing:
agent.sources.mychannel. selector.type = multiplexing
agent.sources.mychannel. selector.header = port
295
Flume Sinks
Sinks receive Events from Channels which are then written to the HDFS
or forwarded to another data source. Supported destinations are shown
below:
HDFS
Avro
Flume Sinks
A sink is the destination for the data stream in Flume. The sink receives events from a
channel and runs in a separate thread. Sinks can support text and sequence files when
writing to HDFS and both file types can be compressed. Below is a list of the different
types of sinks.
Type of Sink
Description
HDFS Sink
Write events into HDFS. Data streams can

be compressed.
Creates text or sequence files. Sequence
files are the default and are able to be
split.
Supports Kerberos authentication.
Logger Sink
Logs event at INFO level.
Avro Sink
Use to sent to Avro Source on another

server.
296
Thrift Sink
Flume events are turned into Thrift events

and sent to hostname/port.
IRC Sink
Takes messages from attached channel

and relays those to configured IRC
destinations.
File Roll Sink
Stores events on the local file system.
HBase Sink
Write data to HBase.
Async HBase Sink
Writes data to HBase using asynchronous

model.
Null Sink
Removes all events received from the

channel.
Custom Sink
Create a custom implementation of Sink

interface.
297
Multiple Sinks
Single sink is default behavior.
Multiple sinks can provide:
Failover for Sinks.
Load balancing of Sinks.
Sink
Source
Channel
S
P
Sink
Agent
Multiple Sinks
Sink Processors are a collection of multiple sinks and can be setup for load balancing
over multiple sinks or to achieve failover from one sink to another in case of failure.
298
There are different types of sink processors; each is described in the table below:
Type of Sink Processor
Description
Default
Accepts one sink and does not require user

to create a sink processor. Follows the
source channel sink pattern.
Failover
Maintains a list of prioritized sinks. Moves

out failed sinks, if they continue to fail
they are retired, if they send a successful
event, they are restored.
Load Balancing
Load balances across multiple sinks. The

load distribution can be round-robin or
random selection with a default of roundrobin. Picks next available sink if selected
sink fails to deliver event.
299
Flume Interceptors
Interceptors are set with the interceptors property and have the ability to drop or
modify an event based on how the interceptor is coded. Flume supports chaining
multiple interceptors together and the order of definition sets the order they run in.
agent.sources.mychannel.interceptors = inter1 iinter2 inter3
The type property sets the type of interceptor.
Timestamp Interceptor: Inserts timestamp into header of events. An existing

time stamp can be overwritten.
agent.sources.mychannel.interceptors = inter1
agent.sources.mychannel.interceptors.inter1.type = timestamp
agent.sources.mychannel.interceptors.inter1.preserveExisting
= true
300
Host Interceptor: Inserts the hostname or IP address in the header of event.

Allows user to append a static header to all events.
agent.sources.mychannel.interceptors = inter1
agent.sources.mychannel.interceptors.type = host
Static Interceptor: Allows a key/value to be added.
UUID Interceptor: Sets a universally unique identifier on all events that go

through interceptor.
Regex Filtering Interceptor: Filters events selectively by interpreting the event

body as text and matching the text against a configured regular expression.
Regex Extractor Interceptor: Filters events by interpreting event body as text,

matching the text against a configured regular expression and extract an
element out of the body and place it in the header.
Morphline Interceptor: Filters the events through a morphline configuration file

that defines a chain of transformation commands that pipe records from one
command to another.
301
Design Patterns
Multi-Agent Flow
Source
Channel
Avro Sink
Avro
RPC
Channel
Avro Source
Sink
Fan In (Consolidation)
Source
Channel
Sink
Source
Channel
Sink
Source
Channel
Sink
Source
Channel
Sink
Fan Out
Source
Channel
Sink
Channel
Sink
Channel
Sink
HDFS
Source
Channel
Sink
Design Patterns
Flume has the flexibility to create complex data workflows. Agents are able to have
multiple sources, channels and sinks. You can also connect multiple agents to each
other.
The Flume topology supports multiple design patterns. A few are shown above:
Multi-Agent Flow
Fan In
Fan Out
For any Flume agent, the source ingests data and sends it to the channel. There can be
multiple sources, channels and sinks in a Flume agent but each sink can only receive
data from a single channel.
302
Configuring Individual Components

Multiple agents can be configured in a single configuration file. The configuration file
contains properties about each source, channel, and sink for the agent(s). How these
areas are configured determines the data flow through the agent.
A prefix name is used to identify an agent. Each source, channel and sink needs to have
a name associated with it. Following the name are the properties that define the
component.
303
Example formats:
<AgentName>.sources = <SourceName>
<AgentName>.sinks = <SinkName>
<AgentName>.channels = <Channel1> <Channel2>
<AgentName>.sources.<SourceName>.channels = <Channel1>
<Channel2> ... # set channel for source
<AgentName>.sinks.<SinkName>.channel = <Channel1>
# set channel for sink
<AgentName>.sources.<SourceName>.<someProperty> =
<someValue>
<AgentName>.channel.<ChannelName>.<someProperty> =
<someValue>
# properties for channels
<AgentName>.sources.<SinkName>.<someProperty> = <someValue>
# properties for sinks
# Agent name examples

agent.sources.mysource.port=20100 -- # Uses agent called
agent.
a1.sources.mysource.port=20100 -- # Uses agent called a1.
To start a Flume agent, call the flume-ng shell (located in Flume bin directory) script.
The script sets the agent name, the configuration directory and the configuration
properties file.
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flumeconf.properties.
304
Flume Netcat Source Example

The following table outlines options that can be set within the my.conf file. The agent
will receive the text from the telnet command.
305
# my.conf file
#Define source name
agent.sources = snet
#Define sink name
agent.sinks = sink1
#Define channel name
agent.channels = chmem
#Set the source
agent.sources.snet.type = netcat
agent.sources.snet.bind = localhost
agent.sources.snet.port = 44444
# Set the sink destination
agent.sinks.sink1.type = logger
#Set channel to type memory
agent.channels.chmem.type = memory
agent.channels.chmem.capacity = 1000
agent.channels.chmem.transactioncapacity = 100
#Set the source with the channel
agent.sources.snet.channels = chmem
#Set the sink with the channel
agent.sinks.sink1.channel = chmem
306
Flume Exec Source Example

agent.sources = pstream
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.sources.pstream.channels = memoryChannel
agent.sources.pstream.type = exec
agent.sources.pstream.command = tail -f /etc/passwd
agent.sinks = hdfsSink
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.channel = memoryChannel
agent.sinks.hdfsSink.hdfs.path =
hdfs://hdp/user/root/flumetest
agent.sinks.hdfsSink.hdfs.fileType = SequenceFile
agent.sinks.hdfsSink.hdfs.writeFormat = Text
307
Flume Configuration
# A single-node Flume configuration
# Name the components on this agent
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channelA
# Describe/configure source1
agent1.sources.source1.type = netcat
agent1.sources.source1.bind = localhost
agent1.sources.source1.port = 44444
# Describe sink1
agent1.sinks.sink1.type = logger
# Use a channel which buffers events in memory
agent1.channels.channelA.type = memory
agent1.channels.channelA.capacity = 1000
agent1.channels.channelA.transactionCapactiy = 100
# Bind the source and sink to the channel
agent1.sources.source1.channels = channelA
agent1.sinks.sink1.channel = channelA
Flume Configuration
The property "type" needs to be set for each component for Flume to understand what
kind of object it needs to be. Each source, sink and channel type has its own set of
properties required for it to function as intended. All those need to be set as needed. In
the previous example, we have a flow from avro-AppSrv-source to hdfs-Cluster1-sink
through the memory channel mem-channel-1.
308
The example below shows configuration of each of those components:

# set channel for sources, sinks, channels
my_agent.sources = avro-AppSrv-source
my_agent.sinks = hdfs-Cluster1-sink
my_agent.channels = channel1
# properties of avro-AppSrv-source
my_agent.sources.avro-AppSrv-source.type = avro
my_agent.sources.avro-AppSrv-source.bind = localhost
my_agent.sources.avro-AppSrv-source.port = 10000
# properties of channel1
my_agent.channels.channel1.type = memory
my_agent.channels.channel1.capacity = 1000
my_agent.channels.channel1.transactionCapacity = 100
# properties of hdfs-Cluster1-sink
my_agent.sinks.hdfs-Cluster1-sink.type = hdfs
my_agent.sinks.hdfs-Cluster1-sink.hdfs.path =
hdfs://namenode/flume/webdata
#...
309
Monitoring Flume
Flume monitoring options can be set in /etc/flume/conf/flume-env.sh (JAVA_OPTS) for
the following:
JMX monitoring
JAVA_OPTS="-Dcom.sun.management.jmxremoteDcom.sun.management.jmxremote.port=4159
-Dcom.sun.management.jmxremote.authenticate=false Dcom.sun.management.jmxremote.ssl=false
Ganglia: Flume metrics can be sent to Ganglia.

JAVA_OPTS="-Dflume.monitoring.type=ganglia Dflume.monitoring.hosts=<ganglia-server>:8660"
310
Nagios: Nagios can be configured to watch the Flume agents. Monitoring for
cpu, memory and disk resources consumed by Flume should be standard. Look
at the Nagios JMX plugin to monitor performance.
Unit 13 Review
1. The basic unit of data for Flume is an ____________________ .
2. Sources can be polled or _________________________.
3. A channel selector can be replicating, multiplexing or _____________________.
4. The Flume component ____________________ allows inspection and
transformation of the data as it flows through the stream.
311
Lab 13.1: Install and Test Flume
Objective: Install, configure and run Flume.

Successful Outcome: A running Flume agent that reads data from a network
connection and writes it to a folder in HDFS.
Perform the following steps:

Step 8: Install Flume
8.1. On node1 as root, install Flume using the following command:
# yum y install flume
8.2. Verify Flume is installed by viewing the usage of the flume-ng command:
# flume-ng
Step 9: View the Flume Agent Configuration

9.1. Change directories to the /root/labs/flume directory.
9.2. A Flume agent has been written for you in a file named logagent.conf. View
the file:
# less logagent.conf
9.3. Notice the name of the agent defined in this file is called logagent.
9.4. The source of logagent is source1. Based on the source1 configuration, where
is the data coming from for this Flume agent?
__________________________________________________
312
9.5. What type of channel is being used for logagent?

___________________________________
9.6. Where is the sink for this Flume agent?
___________________________________________
Step 10: Start a Flume agent
10.1. Flume requires JAVA_HOME to be defined, so enter the following command:
# export JAVA_HOME=/usr/jdk64/jdk1.6.0_31/
10.2. Start logagent using the following command (all on a single line):
flume-ng agent -n logagent -f logagent.conf
-Dflume.log.dir=/var/log/flume/
-Dflume.log.file=logagent.log &
10.3. View the output of the command. Make sure sink1 and source1 started:
INFO sink.RollingFileSink: RollingFileSink sink1 started.
INFO instrumentation.MonitoredCounterGroup: Monitoried
counter group for type: SOURCE, name: source1, registered
successfully.
INFO instrumentation.MonitoredCounterGroup: Component type:
SOURCE, name: source1 started
INFO source.AvroSource: Avro source source1 started.
10.4. Verify a flume process is running:

# ps -eaf | grep flume
Step 11: Test the Flume Agent

11.1. The sink for logagent is the /user/root/flumedata folder in HDFS. View the
contents of this folder, which should either be empty or does not exist yet:
# hadoop fs -ls flumedata
11.2. Change directories to ~/labs/flume and view the contents of test.log. This
will be the data that you send to the source of logagent.
313
11.3. From the ~/labs/flume folder, run the following command (all on a single
line) which takes the contents of test.log and writes it in the Avro format to port
8888 on node1:
# flume-ng avro-client -H node1 -p 8888 -C
/usr/lib/flume/lib/flume-ng-core-1.4.0.2.0.6.0-76.jar -F
test.log
11.4. Wait for this task to execute. When complete, view the contents of
flumedata in HDFS, which should now contain a new file:
# hadoop fs -ls flumedata
Found 1 items
739
flumedata/FlumeData.1384193670669
11.5. View the contents of the file in HDFS. It should match the content from
test.log:
# hadoop fs -cat flumedata/FlumeData.1384193670669
Step 12: Stop the Agent

12.1. Determine the process ID of the Flume agent:
# ps -eaf | grep flume
12.2. To kill a Flume agent, simply issue the kill command on the process:
# kill pid
RESULT: You just ran a Flume agent that reads data from a network connection and
streams it into a folder in HDFS.
ANSWERS:
2.4: The source of logagent is a network connection on port 8888 of node1.
2.5: The channel is an in-memory channel of size 100.
2.6: The sink is the /user/root/flumedata folder in HDFS.
314
Unit 14: Oozie

Topics covered:
Oozie Overview
Oozie Components
Jobs, Workflows, Coordinators, Bundles
Workflow Actions and Decisions
Oozie Job Submission
Oozie Server Workflow Coordinator
Oozie Console
Interfaces to Oozie
Oozie Server Configuration Files
Oozie Scripts
The Oozie CLI
Using the Oozie CLI
Submit Jobs through http
Oozie Actions
Oozie Metrics
Lab 14.1: Running an Oozie Workflow
315
Oozie Overview
A workflow is a sequence of actions scheduled for execution. Oozie is the workflow
scheduler for Hadoop that runs as a service on the cluster. Clients submit workflow
definitions for immediate or scheduled execution. Oozie is tightly integrated with
Hadoop.
Oozie actions may include:
Streaming
MapReduce
Pig
Hive
Distcp
Sqoop jobs
316
Oozie Components
Workflow Engine
Runs workflows
Coordinating
Engine
Coordinating Engine Scheduler

Runs workflow jobs based on:
Workflow
Engine
Oozie Server
Predefined schedules (fixed or cron intervals)

Data availability
Oozie Server
JVM runs Coordinating Engine and Workflow Engine
Database
Database
Stores workflow definitions and state information
Oozie Console
Oozie Components
Oozie is a Java Web-application that runs in a Java servlet-container in Tomcat. Tomcat
is a Java web-application server that uses a database to store the Oozie workflow
definitions, the state of current workflow instances, instance states and variables.
Two main components are the Oozie server and the Oozie client. The Server is the
engine that runs the workflow, and the Oozie client launches jobs and communicates
with the Oozie server.
Oozies metadata database contains the workflow definitions and the current status of
workflows including state information and workflow instances (such as states and
variables) in a database.
317

Workflows run in a defined order set by a Direct Acyclic Graph (DAG)
Workflow is defined with nodes
Include beginning and end of a workflow
Start
End
OK
Fail/Kill
Start
Action
End
ERROR
Kill

An Oozie job is an operation to be performed. There are Oozie workflow and
coordinator jobs.
An Oozie workflow is a collection of jobs or actions to be performed in a defined order
(DAG). Workflows contain control flow (execution path) and action nodes (processing to
be performed). Oozie triggers workflow actions, but Hadoop MapReduce executes
them. Oozie detects completion of tasks through callback and polling. When Oozie starts
a task, it provides a unique callback HTTP URL to the task, thereby notifying that URL
when its complete. If the task fails to invoke the callback URL, Oozie can poll the task
for completion.
The Oozie coordinator jobs are triggered by time and data availability. The Coordinator
allows you to model workflow execution triggers in the form of the data, time or event
predicates. The workflow job is started after those predicates are satisfied. Oozie
Coordinator can also manage multiple workflows that are dependent on the outcome of
subsequent workflows. The outputs of subsequent workflows become the input to the
next workflow. This chain is called a data application pipeline.
An Oozie bundle provides a way to package multiple coordinator and workflow jobs.
The bundle defines a set of coordinator applications (data pipeline). This allows a user
to start, stop, suspend, resume and resume a bundle.
318
When a HDFS URI is defined as a data set, Oozie will perform availability check. When
data dependencies are met, the coordinators workflow is triggered. Oozie coordinators
also support triggers that run when HCatalog table partitions are available and workflow
actions can read data from the partitions. (HCatalog provides abstract table definitions
for the underlying data storage.)
A Direct Acyclic Graph (DAG) is a collection of vertices (nodes actions) and directed
edges that connect different vertices in an order (directed graph) so there is an end and
the DAG does not circle back to the start.
The Oozie workflows are defined in a XML Process Definition Language called hPDL. The
XML documents contain the workflow made up of start, end and fail nodes as well as
decision control statements contain decision, fork and join nodes.
Workflow actions:
All workflows must have one start and one end node.
Workflow starts with a transition to the start node.
Workflow succeeds when it transitions to the end node.
If the workflow fails, it transitions to a kill node. The workflow reports the error
message specified in the message element in the workflow definition.
319

Workflow is defined with mechanisms to control the workflow execution:
Decision
Fork
Join
Kill
Action
Start
Action
Fork
Join
Action
Kill
End
Kill

The workflow will trigger the execution of a computation of a computation/processing
task.
Action types include:
Streaming, MapReduce, HDFS, Pig, and Sqoop.
Java, shell, email, and distcp.
HDFS file system operations.
Oozie sub-workflow.
Most actions have to wait until the previous action completes. Callbacks and polling are
used by Oozie to stay in communication with the defined processing.
Computation/processing tasks triggered by an action node are executed by the
MapReduce framework. Most operations are executed asynchronously however file
system operations are executed synchronously.
320
Oozie Actions
Shell Action: Oozie will wait for the shell command to complete before going to next
action. The standard output of the shell command can be used to make decisions.
Pig, Hive and MapReduce Actions: For executing Pig and Hive scripts and Java
MapReduce jobs.
Sqoop Action: Oozie will wait for the Sqoop command to complete before going to next
action.
Ssh Action: Runs a remote secure shell command on a remote machine. The workflow
will wait for ssh command to complete. Ssh command is executed in the home
directory of the defined user on the remote host.
Custom Action: Custom actions can be set up to run synchronous or asynchronous.
321
Email Actions: Sent synchronously, an email must contain an address, a subject and a
body. Here is an example of setting the properties for an email action. Examples of
other Oozie actions can be found in the documentation.
<workflow-app name="sample-wf"
xmlns="uri:oozie:workflow:0.1">
...
<action name="an-email">
<email xmlns="uri:oozie:email-action:0.1">
<to>bigkahuna@hwxs.com</to>
<subject>Email notifications for
${wf:id()}</subject>
<body>My cool workflow ${wf:id()} successfully
completed.</body>
</email>
<ok to="mycooljob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
322

1.
2.
3.
4.
5.
Submit MapReduce Job

Start Map only process (runs Oozie script)
The action is executed
The Launcher task is terminated
If the workflow contains another action, a new Launcher is submitted
Oozie
Server
Resource
Manager
Execute the
Action
MapReduce
Job
(Launcher
task)

As mentioned in the previous slide, most of the actions are executed asynchronously.
HDFS actions, however, are handled synchronously. Identical workflows can be run
concurrently when properly parameterized.
Oozie can detect completion of computation/processing tasks through callbacks and
polling. When a computation/processing task is started by Oozie, it provides a unique
callback URL to the task. The task should invoke the given URL to notify its completion.
When the task fails to invoke the callback URL for any reason (Transient network failure,
for example), or when the type of task cannot invoke the callback URL upon completion,
Oozie has a mechanism to poll computation/processing tasks for completions. The
default number of retries is three.
NOTE: If the Oozie job consists of multiple actions, then a new Launcher
MapReduce job is executed for each distinct action in the workflow.
323

Oozie server can receive job completion from either callback or poll.
Coordinating
Engine
Job Submission (with callback URL)
Workflow
Engine
Periodic polling for job status
Callback to URL on job completion
JobTracker
Oozie Server
Databases supported:
Derby (default), MYSQL, Oracle, PostgreSQL, and HSQL.
Database

Workflows can be submitted individually or as part of a Coordinator workflow.
Oozies coordinating engine:
Allows the user to define workflow execution schedules based on dependencies.
Allows for workflow execution triggers in the form of predicates.
Starts a workflow job after the predicate event is satisfied.
Predicates can reference data, time or a cron-style schedule.
Allows for multiple coordinators to be bundled together.
Many organizations use their enterprise scheduler to call Oozie Workflows. You may
use REST API to call workflows.
Yahoo runs over 700 workflows. They are organized into coordinators and
bundled together.
324
Oozie Console
Use to watch progress of workflows
Cannot submit jobs
Use to see results of workflow execution
Cannot modify job status
Use to get detailed information on job

execution
Oozie Console
The Oozie Web Console provides a UI for viewing and monitoring your Oozie jobs. You
will use the Console in the upcoming lab.
325
Interfaces to Oozie
The Oozie Web Services (WS) API is a HTTP REST JSON API.
326
Oozie Server Configuration

A proxy user should be set up for the Oozie server process. Set the following two
properties in the Hadoop core-site.xml file. You should always restart Hadoop when
changes to the proxy user settings are modified.
<! Setting up Oozie proxy user -->
<property>
<name>hadoop.proxyuser.[PROXY_USERNAME].hosts</name>
<value>[OOZIE_SERVER_HOSTNAME]</value>
</property>
<property>
<name>hadoop.proxyuser.[PROXY_USERNAME].groups</name>
<value>[PROXY_USER_GROUPS]</value>
</property>
The oozie-default.xml contains the initial read-only Oozie parameters.
327
Here are the primary Oozie environmental variables, which are configured in oozieenv.sh:
Variable Name
Description
CATALINA_OPTS
Java properties for the Embedded Tomcat

server that runs Oozie.
OOZIE_CONFIG_FILE
Oozie configuration file (oozie-site.xml).
OOZIE_LOGS
Oozie logs directory (logs/).
OOZIE_LOG4J_FILE
Oozie Log4J configuration file (oozielog4j.properties).
OOZIE_LOG4J_RELOAD
Reload interval for Log4J configuration file

(10) in seconds.
OOZIE_HTTP_PORT
Oozie port number (11000).
OOZIE_ADMIN_PORT
Oozie admin port (11001).
OOZIE_HTTP_HOSTNAME
The Oozie host server name. Find by

running hostname f command.
OOZIE_BASE_URL
The base URL for actions callback URLs to

Oozie
OOZIE_CHECK_OWNER
When set to TRUE , Oozie

setup/start/run/stop scripts will verify the
owner of the Oozie installation directory
matches the user executing the script. The
default value is undefined.
328
Configure when Oozie is using HTTPS (SSL).

Variable Name
Description
OOZIE_HTTPS_PORT
The Oozie server SSL port runs (11443).
OOZIE_HTTPS_KEYSTORE_FILE
The keystore certificate file location

($OOZIE_HOME}/.keystore).
OOZIE_HTTPS_KEYSTORE_PASS
The keystore file password ( password).
329
Oozie Scripts
oozied.sh start: start the Oozie process as a daemon
oozied.sh run: start Oozie as a foreground process
oozied.sh stop: stop the Oozie process
Run the oozie-setup.sh script to manually configure Oozie with all the components
added to the libext/ directory.
$ bin/oozie-setup.sh prepare-war [-d directory] [-secure]
sharelib create -fs <FS_URI> [-locallib <PATH>]
sharelib upgrade -fs <FS_URI> [-locallib <PATH>]
db create|upgrade|postupgrade -run [-sqlfile <FILE>]
Manually create the Oozie metadata DB:

$ bin/ooziedb.sh create -sqlfile oozie.sql -runValidate DB
Connection.
330
Examples:
Command
Description
bin/oozied.sh start
Start Oozie as a daemon process.
bin/oozied.sh run
Start Oozie as a foreground process.
bin/oozie admin -oozie

http://localhost:11000/oozie -status
Check the status of Oozie. The status

should be normal.
331
The Oozie CLI

The Oozie Command Line Interface (CLI) can perform pseudo/simple and Kerberos HTTP
SPNEGO authentication when requested by Oozie server. The doas option allows a
proxy user to run oozie commands if the proxy user is set.
332
Oozie CLI commands:

Command
Description
oozie job <OPTIONS>
Execute job operations such as start, kill,

dryrun, resume, suspend, etc.
oozie jobs <OPTIONS>
Get jobs status.
oozie admin <OPTIONS>
Perform administration operations.
oozie validate <ARGS>
Validate a workflow XML file
oozie pig <OPTIONS> -X <ARGS>
Submit a pig job.
oozie hive <OPTIONS> -X<ARGS>
Submit a hive job.
oozie info <OPTIONS>
Return detailed info about listed options.
oozie mapreduce <OPTIONS>
Submit a mapreduce job.
oozie sla <OPTIONS>
sla operations (Deprecated with Oozie 4.0)
333
Using the Oozie CLI

Included in the distribution are a number of Oozie examples found in the oozieexamples-*.tar.gz file.
Examples:
# Submitting a job and put it in prep status.
$ oozie job -oozie http://localhost:11000/oozie -config
job.properties submit
job: 14-20131025454545-oozie-myjob
#Starts a job in prep status.
$ oozie job -oozie http://localhost:11000/oozie -start 1420131025454545-oozie-myjob
# Check the status of a job
$ oozie job -oozie http://localhost:11000/oozie -info 1420131025454545-oozie-myjob
# Stop workflows
$ oozie job -oozie http://localhost:11000/oozie -suspend
14-20131025454545-oozie-myjob
334
# Resume suspended job

$ oozie job -oozie http://localhost:11000/oozie -resume 1420131025454545-oozie-myjob
#Kill an Oozie job
$ oozie job -oozie http://localhost:11000/oozie -kill 1420131025454545-oozie-myjob
# Perform dryrun of workflow job
$ oozie job -oozie http://localhost:11000/oozie -dryrun config job.properties
# Check status of Oozie system
$ oozie admin -oozie http://localhost:11000/oozie systemmode [NORMAL|NOWEBSERVICE|SAFEMODE]
# Validate workflow XML
$ oozie validate myJob/myworkflow.xml
# Submit a MapReduce job
$ oozie mapreduce -oozie http://localhost:11000/oozie config job.properties
335
Submit Jobs through HTTP

You can submit an Oozie job over HTTP by specifying the -oozie property:
$ oozie pig -oozie http://host:8080/oozie
-file myscript.pig -config job.properties
-PINPUT=/user/me/in -POUTPUT=/user/me/out
-X -Dmapred.job.queue.name=UserQueue
Behind the scenes, a workflow.xml file is generated dynamically that contains a single
action. The action will be script specified at the command line, and the job will be
created and executed right away.
336
Unit 14 Review
1. There are three types of Oozie jobs. They are _______________________ ,
___________________________and _______________________ jobs.
2. An Oozie __________________ provides a way to package multiple coordinator
and workflow jobs.
3. List three types of Oozie actions: ______________________________________
4. Set Oozie logging information in the ____________________________ file.
337
Lab 14.1: Running an Oozie Workflow

Objective: Deploy and run an Oozie workflow
Successful Outcome: You will run an Oozie job that executes a Pig script and a Hive
script.
Step 1: View the Raw Data

1.1. On node2, change directories to the oozielab folder:
# cd ~/labs/oozielab/
1.2. Unzip the archive in the oozielab folder, which contains a file named
whitehouse_visits.txt that is quite large:
# unzip whitehouse_visits.zip
1.3. View the contents of this file:

# tail whitehouse_visits.txt
This publicly available data contains records of visitors to the White House in
Washington, D.C.
Step 2: Load the Data into HDFS
2.1. Make a new directory in HDFS named whitehouse. (If you already have a
whitehouse folder in HDFS, delete it first):
# hadoop fs -rm -R whitehouse
# hadoop fs -mkdir whitehouse
338
2.2. Use the put command in the Grunt shell to copy the whitehouse_visits.txt
file the whitehouse folder in HDFS, renaming the file visits.txt. (Be sure to enter
this command on a single line):
# hadoop fs -put whitehouse_visits.txt
whitehouse/visits.txt
2.3. Use the ls command to verify the file was uploaded successfully:
Found 1 items
-rw-r--r-3 root root 183292235 whitehouse/visits.txt
Step 3: Configure Oozie User Permissions

3.1. In Ambari, go to the Services page and Stop HDFS service.
3.2. Go the HDFS page in Ambari, then scroll down and expand the Custom coresite.xml section.
3.3. The Oozie workflow you defined is going to be executed by the root user, so
root needs permission to communicate with the Oozie server. Add root to the
hadoop.proxyuser.oozie.groups property:
3.4. Click the Add Property... link and add two properties. Assign the
hadoop.proxyuser.root.hosts property to * and also the
hadoop.proxyuser.root.groups:
3.5. Click the Save button to save your changes to the HDFS config.
3.6. Start HDFS service.
Step 4: Deploy the Oozie Workflow
4.1. SSH into node2.
339
4.2. View the file workflow.xml in /root/labs/oozielab.

4.3. How many actions are in this workflow? _____________
4.4. Which action will execute first? _________________
4.5. If the first action is successful, which action will execute next? ____________
4.6. To deploy this workflow, we need a directory in HDFS:
# hadoop fs -mkdir congress
4.7. Put congress_visits.hive and whitehouse.pig from the oozielab folder into
the new congress folder in HDFS.
4.8. Also, put workflow.xml into the congress folder.
4.9. If you look at the Hive action in workflow.xml, you will notice that it
references a file named hive-site.xml within the <job-xml> tag. This file
represents the settings Oozie needs to connect to your Hive instance, and the file
needs to be deployed in HDFS (using a relative path to the workflow directory).
Put hive-site.xml into the congress directory:
# hadoop fs -put /etc/hive/conf/hive-site.xml congress
4.10. Verify you have four files now in your congress folder in HDFS:
# hadoop fs -ls congress
Found 4 items
congress/congress_visits.hive
429
3509 congress/hive-site.xml
580 congress/whitehouse.pig
1623 congress/workflow.xml
Step 5: Define the OOZIE_URL Environment Variable

5.1. Although not required, you can simplify oozie commands by defining the
OOZIE_URL environment variable. From the command line, enter the following
command:
# export OOZIE_URL=http://node2:11000/oozie
Step 6: Define the Job Properties

340
6.1. View the contents of job.properties in oozielab.

6.2. Notice the oozie.wf.application.path property points to the congress folder
in HDFS. This property is how you denote which Oozie job is going to execute.
6.3. Make sure the resourceManager and nameNode properties are defined
properly.
Step 7: Run the Workflow
7.1. From the oozielab folder, run the workflow with the following command:
# oozie job -config job.properties -run
If successful, the job ID should be displayed at the command prompt.

Step 8: Monitor the Workflow
8.1. Point your Web browser to the Oozie Web Console:
http://node2:11000/
You should see your Oozie job in the list of Workflow Jobs:
8.2. Double-click on the Job Id to view the Job Info page:
341
Notice you can view the status of each Action within the workflow.
Step 9: Verify the Results
9.1. Once the Oozie job is completed successfully, start the Hive Shell.
9.2. Run a select statement on congress_visits and verify the table is populated:
hive> select * from congress_visits;
...
WATERS MAXINE
12/8/2010 17:00 POTUS
OEOB
MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WATT
MEL
12/8/2010 17:00 POTUS
OEOB
WEGNER DAVID L
12/8/2010 16:46 12/8/2010 17:00 POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL STAFF
WILLOUGHBY JEANNE
P
12/8/2010 17:07 12/8/2010 17:00
POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL
STAFF
WILSON ROLLIE
E
12/8/2010 16:49 12/8/2010 17:00
POTUS
OEOB MEMBERS OF CONGRESS AND CONGRESSIONAL
STAFF
YOUNG DON
12/8/2010 17:00 POTUS
OEOB
MCCONNELL
MITCH
12/14/2010 9:00 POTUS
WH
MEMBER OF CONGRESS MEETING WITH POTUS.
Time taken: 1.082 seconds, Fetched: 102 row(s)
342
RESULT: You have just executed an Oozie workflow that consists of a Pig script followed
by a Hive script.
ANSWERS:
Step 4.2: Two
Step 4.3: The Pig action named export_congress
Step 4.4: The Hive action named define_congress_table
343
Unit 15: Monitoring HDP2 Services

Topics covered:
Ambari
Monitoring Architecture
Monitoring HDP2 Clusters
Ambari Web Interface
Ambari Services - HDFS
Ganglia
Ganglia Monitoring a Hadoop Cluster
Nagios
Nagios - Ambari Interface
Nagios UI
Monitoring JVM Processes
Understanding JVM memory
Eclipse Memory Analyzer
JVM Memory Heap Dump
Java Management Extensions (JMX)
344
Ambari
The HDP install needs to get software from a YUM repository. A remote yum repository
can be used however; usually a local copy of the HDP repository is set up so your hosts
within the firewall can access it. Reference the Hortonworks documentation on
Deploying HDP In Production Data Centers with Firewalls for more information.
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP2.0.6.0/bk_reference/content/reference_chap4.html
Database Metastores are required for Ambari, Hive and Oozie. MySQL, Oracle or
PostgreSQL are recommended. Derby is the default.
HDP is certified and supported for running on virtual or cloud platforms (VMware
vSphere, Amazon Web Services and Rackspace).
The Hortonworks Sandbox (Pseudo distribution deployment model) has VMs for
VMware Fusion. Ambari is used to manage the Hadoop cluster running in the sandbox.
VirtualBox and Hyper-V. (www.hortonworks.com/sandbox)
345
Ambari was first released in HDP 1.2. Ambari 1.4.1 is released with HDP2 and contains
additional functionality:
Ability to deploy and manage the Hadoop 2.0 stack using Ambari.
Capability to enable NameNode HA.
Support for enabling Kerberos based security for Hadoop 2.0 services.
Support to work with SSL enabled Hadoop daemons.
Support to work with web authentication enabled for Hadoop daemons.
Added support for JDK 7.
346
Node1 Resource Manager

Node2 NameNode
Node3 Mgmt Node
Ganglia Server
(gmetad)
AmbariServer
Postgres
Nagios Server
RRDtool
gmond
AmbariAgent
IP Address #1
(Gateway)
gmond
AmbariAgent
IP Address #1
gmond
gmond
AmbariAgent
AmbariAgent
IP Address #3
IP Address #4
Ambari monitors Hadoop services including: HDFS, HBase, Pig, Hive, etc. A service can
have multiple components (i.e. HDFS, NameNode, StandbyNameNode, DataNode,).
The term node and host are used interchangeably.
The Ambari server has:
An Agent Interface for communicating with agents.
Agents installed on each host. It sends heartbeats to the Ambari server and
receives commands in response to heartbeats.
A database repository (usually Postgres or MySQL). The database maintains the

Ambari state in case of an Ambari server crash. Agents stay running during
Ambari server crash and recover when the Ambari server is restored.
Each host will have a Ganglia Monitor (gmond) running that collects information to
Ganglia Connector and then to Ambari Server.
Ambari Web sessions do not timeout. It is important to log out of the Ambari web
interface when you are done.
347
Monitoring HDP2 Clusters

A HDP2 system can run as standalone (in a single JVM), a pseudo distributed system (all
daemons on a single node) or as a distributed system. A distributed HDP2 system can
run on two nodes or 10,000+ nodes. A system like HDP2 that can scale needs a
monitoring infrastructure than can scale as the Hadoop cluster grows.
Ganglia provides the monitoring capability to generate real time graphs.
Nagios excels at monitoring and sending out alerts.
OpenStack is a cloud-computing project focused on providing Infrastructure as a Service
(IaaS). OpenStack is an open source cloud operating system. OpenStack provides the
capability to create and deliver cloud-computing services on standardized hardware.
Ambari is the Hadoop management interface with OpenStack. The OpenStack Savanna
project provides a way to deploy a Hadoop cluster on top of OpenStack.
348
Advantages of Ambari:
Open source management system for Hadoop.
Also selected as Hadoop Management interface for OpenStack.
While Ambari is not the first management system for Hadoop, Ambari is an
excellent example of the innovation and accelerated development open source
delivers. Ambari has grown significantly in HDP 1.2, 1.3 and HDP2.
Start the Ambari Server on the node where it has been configured.
# ambari-server start
Access Ambari from the Ambari Web interface.

http://<AmbariHOSTSERVER>:8080
349
Add Widget
Gear
1
2
4
Service Status
3
Widgets

The Ambari Web interface is made up of a number of different components:
1. Navigation Header: DashBoard | Heatmaps | Services | Hosts | Admin
2. The Dashboard View: Made up of two different versions that are both
customizable:
3. Dashboard View Widget Version: Shown above.
4. Dashboard View Service Status: Provides an overall view of status by color. A
colored rectangle will show number of alerts. You can then drill down into more
detail.
350
Solid Green = Up and running
Blinking Green = Starting up
Solid Red = Down
Blinking Red = Stopping
Widgets can be moved around screen (drag and drop). Hovering over a widget will
provide a summary. You can also:
Click on the X in right corner and delete widget.
Click on edit icon to modify an existing Widget.
Click on the gear icon (#5 in slide) and move to Classic Version. The gear allows
you to reset widgets to default and view metrics in Ganglia.
Zoom in for more detailed information.
Click on +Add (#6 in slide) icon to add a widget.
351
Ambari Web Interface (cont.)
Ambari Web Interface (cont.)

Continued explanation of the Ambari Web Interface monitoring tool:
1. Heatmaps: Provide a graphical representation of the overall health with color
indicators. Hovering over a heatmap will display a popup. Select the Metric
dropdown to change the metric type. The default maximums can also be
changed.
2. Services: Provides details on services running in the Hadoop cluster.
2a. Maintenance options: The management header containing
Maintenance | Start | Stop sections is an easy way to start and stop a
service as well as perform smoke tests.
2b. Services Summary: Clicking on summary tab will give an overall
perspective of a specific service.
2c. Services Config: Clicking on the Services Config tab allows updates to
configurations for a specific service. Some services have quick links that drill
into logs, JMX, different UIs.
352
3. Hosts: The Hosts view lets you dill down into a host to get detailed information
on services running on host. Actions are available to start, stop and
decommission. Hosts can be added with the +Add Hosts Wizard.
4. Admin: The Admin View supports user management and provides general
information.
User Management: Users can be added, dropped and assigned privileges.

There are two types of users: User and Admin. A User can view metrics. An
Admin has user privileges and can start and stop services, change
configurations, etc.
High Availability: NameNode HA can be set up. This option will start the
NameNode HA Wizard. The Wizard will walk you through defining the
Standby NameNode, and JournalNodes.
Enabling Kerberos Security: Enabling Kerberos Security will walk you

through creating principals and keytabs, etc.
Checking Stack and Component Versions: This screen allows you to see the
Hadoop software stack and the specific version installed.
Checking Service User Accounts and Groups: Display users and groups and
the services they own.
353
Ganglia
Designed for monitoring and collecting large quantities of metrics of
federations of clusters
Ganglia
Ganglia was developed at Berkeley and is a BSD-licensed open source project. Berkeley
is known as a center of grid and high-performance environments. Ganglia was designed
and developed in an environment where large computing environments were the norm.
Ganglia was assumed to be running in extremely scalable environments where minimal
overhead and performance were a fundamental requirement. Ganglia was designed
from the very beginning to scale to cloud-sized networks. Therefore, Ganglia is an ideal
tool for monitoring Hadoop clusters that can grow to 10,000+ nodes per cluster.
Ganglia ships with a large number of metrics that can be accessed with visual graphs.
Ganglia has a plug-in to receive Hadoop metrics and can provide aggregate statistics for
the cluster as a whole. Ganglia also provides real-time graphing capabilities.
NOTE: HDP2 uses Ganglia 3.5 and Gweb 3.5.7.
354
Ganglia Monitors
Drill down to get detail on Ganglia Server

Ganglia has three primary daemons: gmond, gmead, and gweb.
gmond: The Ganglia Monitoring Daemon runs on each host to be monitored.

Gmond collects run-time statistics. Gmond polls metrics according to its own
local configuration file. Gmond can collect metrics on compute resources such as
memory, CPU, storage networking. Gmond can also collect metrics on active
processes/daemons. Hadoop can publish metrics to Ganglia in formats that
Ganglia can understand.
Gmetad: The Ganglia Meta Daemon polls information from the gmond daemons
then collects and aggregates the statistics. RRDtool is a tool that stores metrics in
round robin databases.
Gweb: Ganglia Web is a PHP program that runs in an Apache web server that
provides visualization. The configuration file is conf.php.
355
The Ganglia configuration file (gmetad.conf) is organized into sections that are defined
in curly braces. Section names and attributes are case insensitive. There are two
categories:
Host and cluster configurations.
Metrics collection and scheduling.
Ganglia monitoring is setup in the Hadoop hadoop-metrics.properties file. Examples of

property definitions include metric time periods and ports.
Metrics (contexts) are broken into:
yarn: Containers, failures, and containers completed.
hdfs: Data block activity (reads, writes, replicates, verifications, removed).
hbase: Number of regions, memstore sizes, read and write requests, StoreFile
Index sizes, block cache hjt and miss ratios, and block cache memory available.
jvm: Memory, thread states, garbage collection, and logging events.
rpc: Number of operations, open connections, processing times, and open

connections.
356
Nagios
The Nagios primary configuration file (nagios.cfg) default location is the /etc/nagios
directory.
Key parameters:
Parameter
log file
Description
Contains the location of the nagios.log file
(/usr/local/nagios/var/nagios.log).
nagios_user
Nagios, the effective user that the Nagios

process runs under.
nagios_group
Nagios, the effective group that the Nagios

process runs under.
status_file
/usr/local/nagios/var/staus.dat is the
current status or downtime information.
temp_path
/tmp is the directory that Nagios uses as a

scratch work area.
357
temp_file
/usr/local/nagios/var/nagios.tmp is used
as a temporary file when updating status
information.
After making any configuration changes, Nagios needs to be restarted.

# service nagios restart
To define a host so it is available to Nagios, define the following values in

/etc/nagios/objects/hadoop-hosts.cfg:
define host {
use
host_name
alias
address
}
linux-server
xxxx.xxxx.xxxx
xxxx.xxxx.xxxx
xxx.xxx.xxx.xxx
NOTE: HDP2 uses Nagios 3.5.0. Nagios is installed as part of the Ambari
install.
358
Nagios UI
Nagios UI
Nagios can be accessed from the Ambari interface or from the server running Nagios.
Launch the Nagios UI on the server it is running via http:/localhost/nagios.
359
Monitoring JVM Processes

JMX is a Java specification for providing self-describing monitoring and management
capabilities within a Java application. Since all the Hadoop daemons are Java based, it
makes sense for Hadoop to use this capability as one of its mechanism for delivering
metrics data. All modern JVMs support JMX natively, so there are no additional
components to set up or install for Hadoop to produce metrics via JMX. JMX provides
metrics through components called MBeans. Each MBean can contain attributes and
operations. MBean attributes are used to expose metrics to the outside world.
There are many operational tools (For example, HP OpenView, etc.) out there which can
connect to JMX compliant applications, without having any specialized knowledge about
those applications. This is an example of the self-describing aspect of JMX. JMX metrics
will automatically be enabled when the GangliaSink is used for metrics (hadoopmetrics2.properties). You can also access JMX metrics via an HTTP interface with
http://<daemon-host-name>:<port>/jmx.
For example: http://namenode:50070/jmx.
360
Viewing JVM heap dumps is only used when a problem needs to be looked at in a very
detailed way. Whats nice about jmap is that is available if necessary.
If you are having problems running out of memory for the JVM you can cause a heap
dump to be generated automatically when an out of memory issue occurs. Set the XX:+HeapDumpOnOutOfMemoryError option to generate a heap dump when an out of
memory issue occurs.
361
Understanding JVM Memory

The memory for a JVM process is divided into the following areas.
Survivor 2
Survivor 1
Eden
New
Tenured
PermGen
Understanding JVM Memory

Hadoop Administrators need to have a fundamental understanding of JVM memory.
A JVM process is divided into three memory segments called generations. The
generations are divided into young, old, and permanent. The young generation is also
sometimes referred to as the new generation and the old generation is sometimes
referred to as the tenured generation.
During garbage collection a Java object will move from Young memory to Old memory.
The new generation memory area is divided into 3 sub-segments. These are called Eden,
Survivor Space I, and Survivor Space II.
Eden: Where an object is first created.
Survivor I: Object moves into Survivor I from Eden.
Survivor II: Object moves into Survivor II from Survivor I.
An object will move from the Survivor II memory area into the Old memory area.
362
When the Old memory area fills up, a major garbage collection will occur which can
impact performance. This can impact YARN which is running mappers and reducers in
Containers.
The Xms option determines the size of the Old and Young generation memory areas
together. This combined memory area is determined by the Xmx option.
Hints can be provided to the Young and Old memory areas. The exact memory size will
be determined by the JVM. Young generation size can be initialized with the
XX:NewSize argument
Old generation with XX:NewRatio. A value of 2 will make the Old generation memory
area twice as big as the Young generation.
363

The Eclipse Memory Analyzer (MAT) is a nice interface to use for viewing details on Java
heap dumps. The Eclipse MAT makes it easy to go out and view retained sizes of objects
and issues with garbage collection. A set of Eclipse plug-ins are used to access reference
objects on Java heap dumps. A Java heap dump is a point in time view of a Java object
graph.
[root@hdp2 ~]# jmap -F -dump:live,format=b,file=nnheap.bin
2719
Attaching to process ID 2719, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.45-b08
Dumping heap to nnheap.bin ...
Heap dump file created
364
JVM memory heap dumps can also be viewed with commands. Use jps to get process id
(2719) and jmap to display.
# jps -l
# jmap -histo:live 2719 | head
# jmap heap 2719
# jstat gcutil 2719 5000
Note process id 2719 is the NameNode.
# jps
3855 ResourceManager
3096 SecondaryNameNode
3973 NodeManager
2719 NameNode
2645 QuorumPeerMain
2952 DataNode
3292 RunJar
3332 RunJar
4080 JobHistoryServer
4558 AmbariServer
3505 RunJar
4903 RunJar
3238 Jps
3789 Bootstrap
365

It is pretty easy to create a JVM memory heap dump. The jmap command needs the
JVM heap dump file to be named and the process id of the JVM to be specified. You can
then click on the JVM heap dump to bring up Eclipse or open the file in Eclipse.
# $JAVA_HOME/bin/jstat -gcutil $(cat
/var/run/hadoop/yarn/yarn-yarn-resourcemanager.pid)
366
S0
S1
YGC
YGCT
FGC
FGCT
GCT
0.00
100.0
40.85
11.19
99.35
0.044
0.000
0.044
Field definitions in the jstat gcutil command output.

Column
Description
S0
Survivor I Memory Heap.
S1
Survivor II Memory Heap.
Eden & Young Memory Heap.
Old Memory Heap.
Permanent Memory Heap.
YGC
The number of Young (Eden) space collections.
YGCT
The total time taken for the Young (Eden) collections.
FCG
The number of Old space collections.
FGCT
The total time taken for the Old collections.
GCT
The total time taken for garbage collection consumed so far.
367
Java Management Extensions (JMX)

DataNodes metrics provide:
Number of bytes written.
Number of blocks replicated.
Number of read requests from clients.
368
Unit 15 Review
1. The Dashboard View supports two different types of views, they are the
________________ and _________________ views.
2. The Ganglia primary daemons are _____________ , _____________ and
_______________.
3. The main Nagios configuration file is ______________________.
4. Use this Java JDK tool to create a JVM heap dump: ______________________
5. Use this Java tool to access JVM metrics: ___________________________
369
Unit 16: Commissioning and

Decommissioning Nodes
Topics covered:
Decommissioning and Commissioning
Decommissioning Worker Nodes
Steps for Decommissioning a Worker Node
Decommissioning Node States
Steps for Commissioning a DataNode
Balancer
Balancer Threshold Setting
Running Balancer
Configuring Balancer Bandwidth
Lab 16.1: Commissioning DataNodes
370
Architectural Review
Decommissioning/Commissioning nodes need to take the above into consideration.
Daemons and Processes running on a slave server can include additional frameworks.
Usually a slave server in the cluster runs both a DataNode and NodeManager daemon.
If running HBase, the slave server will also run a HBase Region Server. Additional
frameworks such as Accumulo, Storm, etc. will have their own client processes.
The ResourceTrackerServer is responsible for registering new nodes,
decommissioning/commissioning nodes.
NMLiveliness Monitor monitors live and dead nodes.
The NodesListManager manages the collection of valid and excluded nodes. The
NodesListManager reads the following local host configuration files. Lines that begin
with # are comments.
dfs.hosts: Names a file that contains a list of hosts that are permitted to connect
to the NameNode.
dfs.hosts.exclude: Names a file that contains a list of hosts that are not
permitted to connect to the NameNode.
371
yarn.resourcemanager.nodes.include-path: Points to a file that has a list of

nodes accepted by ResourceManager.
yarn.resourcemanager.nodes.exclude-path: Points to a file with a list of nodes

that are not to be accepted by ResourceManager.
Run the refreshNodes option for the ResourceManager daemon to recognized the
changes:
# yarn rmadmin -refreshNodes
372
Decommissioning and Commissioning Nodes

Hadoop clusters grow horizontally. Administrators periodically need to be able to add,
decommission and recommission DataNodes (slave servers). Adding a node to a cluster:
Adds more processing capabilities because the cluster can run more Containers.
Add more IOPS and storage capability.
Make a Hadoop cluster more resilient to failure and heavy workloads.
373
Decommissioning Nodes
Although HDFS is designed to tolerate DataNode failures, this does not mean you can
just terminate DataNodes en masse with no ill effect. With a replication level of three
for example, the chances are very high that you will lose data by simultaneously
shutting down three DataNodes if they are on different racks. The way to decommission
DataNodes is to inform the NameNode infrastructure of the DataNode(s) to be taken
out of circulation, so that it can replicate the blocks to the rest of HDFS before taking the
node down.
With NodeManagers and Containers, Hadoop is more forgiving. If you shut down a
NodeManager that is running tasks, the ResourceManager will notice the failure and
reschedule the tasks on other nodes in the Cluster.
The decommissioning process is controlled by an exclude file. The exclude file lists the
nodes that are not permitted to connect to the cluster (master daemons).
374
The rules for whether a NodeManager may connect to the ResourceManager are
simple: a NodeManager may connect only if it appears in the include file and does not
appear in the exclude file. An unspecified or empty include file is taken to mean that all
nodes are in the include file.
For HDFS, the rules are slightly different. If a node appears in both the include and
exclude file, then it may connect, but only to be decommissioned.
375
Steps for Decommissioning a Node

1.
Add the Node to the the NameNode(s) dfs.exclude file
2.
Add the Node to the ResourceManagers exclude file
3.
Have the NameNode(s) & ResourceManager re-read the exclude file:

$ /usr/lib/hadoop-hdfs/sbin/distribute-exclude.sh
<exclude_file>
$ /usr/lib/hadoop-hdfs/sbin/refresh-namenodes.sh
$ yarn rmadmin refreshNodes
4.
Check the Cluster Web Console NameNode Web UI for Node status
5.
Wait for the Node to be listed as decommissioned
6.
Check the Resource Manager Web UI for the Node status
Steps for Decommissioning a DataNode

Decommissioning follows similar processes in HDP2 as in HDP1. A node is
decommissioned by being placed in the exclude file for the NameNodes and the
Resource Manager.
Optionally, Remove the Node from the NameNode(s) dfs.include file(s), if permanently
removing the node from the cluster. (Also remove this Node from the ResourceManager
include file). Remove the node information from the slave file as well.
The distribute-exclude.sh script will copy the exclude file to all the NameNodes in the
HDFS cluster. The refresh-namenodes.sh script will refresh all the NameNodes to read
the new exclude file.
The individual NameNode(s) Web UI can also be accessed for Node status. View the
Cluster status to verify the DataNode has been decommissioned.
Check the new Node using the Cluster Web Console from any NameNode.
http://<any_nn_host:port>/dfsclusterhealth.jsp.
Check that the new Node appears in the ResourceManager Web UI
376
http://<ResourceManager node>:8088
The NameNode Web UI can also be accessed for the NameNodes.

http://<NameNode node>:8020
377
Decommissioning Node States
Dead Nodes: The NameNode will declare a DataNode dead when a heartbeat is
not received for a period of time. The default is 10 minutes.
Time period is set with: heartbeat.recheck.interval
Node remains as dead until it is removed from the dfs.include list, AND dfsadmin
command is run to refresh the nodes (dfsadmin - refreshNodes).
378
Steps for Commissioning a Node

1.
Add the new Node to the dfs.include file on the NameNode(s)
2.
Add the new Node to the include file on the ResourceManager
3.
Have the NameNode(s) & ResourceManager re-read the include file

with new Node in it:
$ /usr/lib/hadoop-hdfs/sbin/refresh-namenodes.sh
$ yarn rmadmin refreshNodes
4.
Start the DataNode and NodeManager daemons
Steps for Commissioning a Node

Verify the Hadoop software on the Node to be commissioned matches the other nodes.
Make sure the DataNode process and the NodeManager daemons point to their
NameNodes and ResourceManager respectively.
yarn.resourcemanager.nodes.include-path Points to a file that has a list of DataNodes
accepted by ResourceManager.
Start the DataNode daemon.
su -l hdfs -c "/usr/lib/hadoop/sbin/hadoop-daemon.sh"
Start the ResourceManager daemon.

su -l yarn -c "/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh"
Verify the new Worker Node (slave server is running properly).

Check the new Node using the Cluster Web Console from any NameNode.
http://<any_nn_host:port>/dfsclusterhealth.jsp.
379
Check that the new Node appears in the ResourceManager Web UI.
http://<ResourceManager node>:8088
The NameNode Web UI can also be accessed for the NameNodes.

http://<NameNode node>:8020
Run balancer if you want existing blocks to be written to the new DataNode. This
ensures the HDFS cluster is able to leverage the processing and IOPS of the new
DataNode.
The Balancer needs to work with the Namenodes in the cluster to balance the cluster.
Example:
"$HADOOP_PREFIX"/bin/hadoop-daemon.sh --script "$bin"/hdfs
start balancer [-policy <policy>]
380
Balancer
381
Balancer Threshold Setting
382
Running Balancer
Balancer can be run periodically as a batch job
Every 24 hours or weekly for example
Balancer should be run after new nodes have been added to the cluster
Running the balancer is also useful if a client loads files from and to a
computer that is a DataNode
One replica of the blocks will be placed on the local DataNode
To run the balancer:

hdfs balancer [-threshold <threshold>] [-policy <policy>]]
Balancer runs until there are no blocks to move or until it has lost
contact with the NameNode
Can be stopped with a Ctrl+C
Running Balancer
383

Should be run in a way which will limit impact on the cluster
You can limit the amount of bandwidth Balancer can utilize for
replicating blocks
The bandwidth limit is set in bytes/sec
Default value is 1048576 (1 MB/sec)
To set the bandwidth utilization allowed: (hdfs-site.xml)

dfs.datanode.balance.bandwidthPerSec
384
Unit 16 Review
1. Which property points to the file that contains the list of hosts allowed to
connect to the NameNode? _________________________________
2. Which property points to the file that contains the list of hosts not allowed to
connect to the NameNode? _________________________________
3. The ResourceManager also has include and exclude files. Which two properties
define where these two files are located? ____________________________
_______________________________________________________________
4. The rmadmin option is to __________________________________________.
385
Lab 16.1: Commissioning & Decommissioning

DataNodes
Objective: Commission a new DataNode to an existing cluster, and also

decommission a node.
Successful Outcome: node4 is successfully added to your cluster as a DataNode,
and node1 is successfully decommissioned from the cluster.
Before You Begin: Open the Ambari UI.
Step 1: Commission a New DataNode.

1.1. Go to the Hosts page of Ambari.
1.2. Click on the host for node4.
1.3. Click on the Add Component dropdown menu and select DataNode:
1.4. Click Yes when the confirmation dialog appears.
386
1.5. Wait for the DataNode component to be installed. When the install is
complete, DataNode should appear in the list of Components on node4:
Step 2: Copy a significant big directory to HDFS

2.1. Copy /root/repo directory to HDFS.
Step 3: Start the DataNode

3.1. Click on the Action menu next to DataNode and choose Start.
3.2. Wait for the DataNode service to start.
Step 4: Restart Nagios (Optional, if there are only 3 DataNodes under HDFS service)
Step 5: Verify the Commissioned Node
5.1. Once the DataNode is started successfully on node4, go back to the Ambari
Dashboard. You should see 4 live DataNodes:
387
Step 6: Run the Balancer

6.1. Go to the NameNode UI. Notice it also shows 4 live DataNodes.
6.2. Click on Live Nodes link.
6.3. Notice that the new DataNode does not have any data blocks currently.
6.4. On node1, run following command to run the balancer process on your
cluster:
# su -l hdfs -c "hdfs balancer -threshold 1"
6.5. Wait a couple minutes for the balancer to even out the block storage. You will
see at the command prompt as blocks get moved from one node to another:
INFO balancer.Balancer: 0 over-utilized: []
INFO balancer.Balancer: 1 underutilized:
[BalancerDatanode[10.222.133.205:50010,
utilization=0.40288804420577484]]
INFO balancer.Balancer: Need to move 173.74 MB to make the
cluster balanced.
INFO balancer.Balancer: Decided to move 89.95 MB bytes from
10.170.202.246:50010 to 10.222.133.205:50010
INFO balancer.Balancer: Decided to move 151.79 MB bytes
from 10.174.49.252:50010 to 10.222.133.205:50010
INFO balancer.Balancer: Will move 241.74 MB in this
iterationINFO balancer.Balancer: Moving block 1073742573
from 10.174.49.252:50010 to 10.222.133.205:50010 through
10.174.50.60:50010 is succeeded.
INFO balancer.Balancer: Moving block 1073742572 from
10.174.49.252:50010 to 10.222.133.205:50010 through
10.174.50.60:50010 is succeeded.
...
6.6. Refresh the Live Nodes page of the NameNode UI. Your node4 DataNode
should now have blocks on it, and the number of blocks will gradually increase as
the balancer app continue to even out the block storage on your cluster.
NOTE: The balancer app will run for a long time. Just leave the process open
in your terminal window. If you need to perform any future tasks on node1,
just open a new terminal window.
388
Step 7: Decommission a DataNode

7.1. From the Hosts page of Ambari, click on the host for node1.
7.2. In the Components section, click on the Action menu next to DataNode and
select Decommission:
7.3. Click OK in the confirmation dialog, and wait for the decommissioning task to
complete.
NOTE: There is a minimal chance that the decommissioning task may fail
due to a known bug in Hadoop 2.0 where the node contains a block that
belongs to a file with a replication factor larger than the rest of the cluster
size. The work-around is to locate and delete any files that have a replication
factor larger than 3. View https://issues.apache.org/jira/browse/HDFS-5662
for more details.
Step 8: View the NameNode UI

8.1. Go to the Namenode UI. Notice that Decommissioning Nodes is 1 and Live
Nodes is still 4:
389
8.2. Click on Decommissioning Nodes and it will show that node1 is undergoing
the decommission process.
8.3. Go to the Live Nodes page of the NameNode UI. You will see that blocks are
gradually being copied from node1 to the other nodes. The Admin State of node1
is going to be either Decommission in Progress or Decommissioned. Refresh the
page until the status is Decommissioned.
8.4. Go back to the NameNode UI page. Notice you have 4 Live Nodes, and 1 of
them is Decommissioned:
Step 9: Stop the DataNode

9.1. Once the decommissioning process is complete, go back to the Hosts page for
node1 in Ambari.
9.2. In the Action menu next to DataNode, select Stop:
9.3. It will take several minutes to stop the DataNode process on node1.
9.4. From the Ambari Dashboard page, you should see 3/4 live DataNodes:
390
RESULT: You have now seen how to commission a new DataNode, and also how to run
the balancer tool to balance the blocks across a cluster once new DataNodes are
commissioned. You also have decommissioned one of the DataNodes from your cluster.
391
Unit 17: Backup and Recovery

Topics covered:
392
What should you backup?
HDFS Snapshots
HDFS Data Backups
HDFS Data Automation & Restore
Hive & Ambari
Lab 17.1: Using HDFS Snapshots
What should you backup?

HDFS data and configurations in your Hadoop cluster are the most important files to
backup. Here well discuss backing up HDFS data, Hive, and Ambari. It is also important
to backup cluster related configuration files on the OS.
393
HDFS Snapshots
Create HDFS directory snapshots

Fast operation - only metadata affected
Results in .snapshot/ directory in the HDFS directory
Snapshots are named or default to timestamp
Directories must be made snapshottable
Snapshot Steps:
Allow snapshot on directory
hdfs dfsadmin -allowSnapshot foo/bar/
Create snapshot for directory and optionally provide snapshot name
hdfs dfs -createSnapshot foo/bar/ mysnapshot_today
Verify snapshot
hadoop fs -l foo/bar/.snapshot
Snapshot diff
# hdfs snapshotDiff foo/bar firstsnapshot secondsnapshot
Difference between snapshot firstsnapshot and snapshot secondsnapshot under
directory /user/root/foo/bar:
M
.
+
./anotherfile.txt
HDFS Snapshots
Another major highlighted feature of Hadoop 2 is HDFS snapshots. Taking a snapshot is
fast. As long as snapshotting is enabled on a particular directory, users with write
permissions to that directory can create as many snapshots as needed and removed as
needed.
394
HDFS Data - Backups

Backup
Condition(s)
Fulfilled
hdfs dfs -createSnapshot /foo/ snapshot-name

hadoop distcp update prbugp m 16 \
hdfs://original-host/foo/.snapshot/snapshot-name \
hdfs://target-host/foo/
Perform
HDFS
Snapshot
distcp new
Snapshot to
Backup
Cluster
hdfs dfs -createSnapshot /foo/ snapshot-name
Snapshot
new data on
Backup
Cluster
Enterprise
Retention
Policy
Cleanup
On Success
Action
On Failure
Action
Can be orchestrated by Oozie!
HDFS Data - Backups

A typical strategy for backing up HDFS data to a remote Hadoop cluster is highlighted
above. Oozie is the perfect tool to automate the entire process. Remember, with Oozie
you can execute distcp actions and run shell scripts. For example, you can write a
wrapper script to implementing your enterprises data retention policy, i.e. remove
data older than 7 years nightly.
395
HDFS Data Automate & Restore

Whereas Oozie can be used to automate the backup process, often times restoring data
is a manual process. If you chose to perform offsite backups, you can choose an offsite
snapshot, or use your production clusters own snapshots to restore.
Restoring a backup is fairly simple:
1. On your backup cluster, choose which snapshot to restore. We will move it to
the target/production cluster last.
2. Now clean up your production cluster by removing or moving the production
directory using either the -rm or -mv commands.
a. The -rm option with -skipTrash will free up space in case your dataset is
very large.
3. Move your pristine copy of the data to the target location via distcp and without
the -update options. And if your pristine copy is already on the target cluster,
simply run a -mv command.
396
Hive & Ambari Backup

Backup Hive Metastore
Weve already seen how to backup HDFS data. Hive data is in HDFS. We also need to be
to backup the Hive Metastore, the database that contains Hives table definitions.
Since the Metastore is simply a relational database (MySQL, PostgreSQL, or Oracle), just
simply backup the Metastore database.
The Metastore database name by default is hive.
MySQL Backup
mysqldump hive > hive_backup.sql
PostgreSQL Backup
pgadmin hive > hive_backup.sql
397
Oracle Backup
[oracle]$ expdp hive/password schemas=hive directory=backups
dumpfile=hive_backup.dmp
Backup Hive data on same cluster

Often times, its sufficient to backup any HDFS data simply by making a physical copy on
the same HDFS cluster. Here is an example of backing up a Hive table within the same
HDFS cluster:
hadoop fs -cp /apps/hive/warehouse/mytable /backups/hive/mytable
Backup Hive data to different cluster

Using the distcp command we saw earlier, you can also backup HDFS data to different
cluster:
distcp fs -cp hdfs://sourcenode/apps/hive/warehouse/mytable \
hdfs://destnode/backups/hive/mytable
Backup Ambari
Take a backup of the following Ambari cluster configurations:
1. /etc/ambari-server
2. The ambari database in PostgreSQL
398
Lab 17.1: Using HDFS Snapshots
Objective: Understand how snapshots work in Hadoop.

Successful Outcome: The data folder has a snapshot taken of it.
Step 1: Load a File into HDFS

1.1. Start by creating a new directory in HDFS:
# hadoop fs -mkdir data
1.2. Put a file in the new directory:

# hadoop fs -put ~/labs/constitution.txt data/
1.3. Run the fsck command on the file:

# hdfs fsck /user/root/data/constitution.txt -files -blocks
-locations
Select-and-copy the block ID.

1.4. Use the find command to identify the location of the block on your local file
system. The command will look like the following, and youneed to run it on
node2 or node3:
# find / -name "blk_ 1073743186"
1.5. Which node and folder is the block stored in? ________________________
Step 2: Enable Snapshots
2.1. Now lets enable the /user/root/data directory for taking snapshots:
399
# su -l hdfs -c "hdfs dfsadmin -allowSnapshot

/user/root/data"
You should see the following confirmation:

Allowing snaphot on /user/root/data succeeded
Step 3: Create a Snapshot

3.1. Now create a snapshot of /user/root/data:
# hdfs dfs -createSnapshot /user/root/data ss01
You should see the following confirmation:

Created snapshot /user/root/data/.snapshot/ss01
3.2. Verify the snapshot was created by viewing the contents of the
data/.snapshot folder:
# hadoop fs -ls -R data/.snapshot
drwxr-xr-x
- root hadoop
0 data/.snapshot/ss01
-rw-r--r-3 root hadoop
44841 data/.snapshot/
ss01/constitution.txt
3.3. Try to delete the data folder:

# hadoop fs -rm -R data
You cannot delete the folder. Why not? _________________________________

Step 4: Delete the File
4.1. Delete the constitution.txt file in data:
# hadoop fs -rm data/constitution.txt
4.2. Use the ls command to verify the file is no longer in the data folder in HDFS.
4.3. Check whether the file still exists in /user/root/data/.snapshot/ss01. It
should still be there.
400
4.4. Run the same find command again that you ran in the earlier step. Does the
block file still exist on your local file system? _____________________________
Step 5: Recover the File
5.1. Lets copy this file from data/.snapshot/ss01 to the data directory.
# hadoop fs -cp data/.snapshot/ss01/constitution.txt data/
5.2. Run the fsck command again on data/constitution.txt. Notice that the block
and location information have changed for this file.
5.3. Run the find command for the new blocks. Notice the blocks for the
constitution.txt file appear in two locations on your local file system (before
deleting the file and after copying the file).
RESULT: This lab demonstrates how the snapshot process locks down the blocks from
deleting and editing, and the blocks are always available in case you need to recover
your file in future.
Answers:
Step 1.5: In a subfolder of: /hadoop/hdfs/data/current/
Step 3.3: Once snapshot is enabled for a directory, it can not be deleted until we delete
the snapshot itself.
Step 4.4: Yes
401
Unit 18: Rack Awareness and

Topology
Topics covered:
Rack Awareness
YARN Rack Awareness
HDFS Replication
Rack Topology
Rack Topology Script
Configuring the Rack Topology Script
Lab 18.1: Configuring Rack Awareness
402
Rack Awareness
Rack awareness spreads block replicas across different racks to make sure if a rack
becomes unavailable (power failure, switch failure, etc.) all replicas for a block are not
lost. Rack awareness makes sure that all operations that involve rack placement
understand to spread the blocks across multiple racks. The NameNode makes the
decision where blocks are placed. Examples of block operations that are rack aware
include:
Inserts
Hadoop balancer
Decommissioning a datanode
For rack awareness, each data node is assigned to a rack. Each rack will have a unique
rack id. Rack ids are hierarchical and appear as path names.
If rack awareness is not configured, the entire Hadoop cluster is treated as if it were a
single rack. Every DataNode will have a rack id of /default-rack. With the default
behavior, data is loaded on a DataNode and the then two other data nodes are selected
at random to make sure replicas are spread across multiple data nodes.
403
YARN Rack Awareness

YARN is rack aware similar to HDFS. The NameNode and ResourceManager get rack
information on the DataNodes. An API resolve in an administrator module resolves the
DNS name, IP address and rack id.
404
HDFS Replication
First replica is placed on the same rack as the client, if possible. If that
is not possible, it will be placed randomly.
Second replica is placed on a DataNode on another rack
Third replica is on another DataNode on the second rack
Rack 1
Rack 2
DataNode
DataNode
DataNode
Data and
checksum
Data and
checksum
Ack
Ack
Verify
Checksum
Replica Placement
Rack awareness places different priorities on each replica. The assumption is traffic
within a rack is faster than across racks.
The first replica is put on the DataNode that is closest to the Hadoop client. This is the
rack the client is running on.
The second replica is placed on a different rack for high availability. This makes sure
that if a rack fails a replica of a block still exists.
The third replica is placed on the same rack as the second rack. Once the second replica
is on a different rack, high availability has been taken care of. The goal is to get the third
replica on another DataNode of the second rack.
405
Rack Topology
Aggregation Switches
2xToR Switches
Staging Node
NameNode
HBaseMaster
Oozie Server
DataNodes
2xToR Switches
Staging Node
Standby NameNode
Secondary NameNode
DataNodes
KVM Switch
Resource Manager
Management Node
Ambari Server
Ganglia/Nagios
WebHCat Server
JobHistoryServer
Hive2 Server
KVM Switch
Rack Topology
Rack topologies need to make sure there are no single points of failure.
There are a number of different ways to deploy rack topologies for Hadoop. The
TopofRack (ToR) architecture is popular with their short cable runs and easy replication
of rack configurations. As companies build out data centers, they deploy rack servers as
the core building block, with ToR switches and cabling within the rack. Pod-based
(containerized) modular designs are becoming very popular. A pod is a preconfigured
system with compute, network and storage resources. Pod architectures strength is
integration and standardization.
TopofRack does not mean switches are at the top of the rack. The top of the rack is
more popular because of ease of access and cabling but switches can be anywhere in
the rack.
This example uses the leaf-spine topology. Each TOR switch is a leaf and each
aggregation switch is a spine. Scalability can be increased by designing a dual-tier
aggregation layer. TOR switches in a rack can be connected to aggregation switches that
can provide interconnection to the rest of the data center.
Each rack should have two Top Of Rack (TOR) Ethernet switches that are bonded. Two
switches are used for scalability and availability.
406
Rack topologies have a number of advantages:
Modular design, making it easy to upgrade and add racks.
Support switching from 10GB cards to 40GB or higher.
Copper cabling is all in rack.
407
Rack Topology Script
408

The rack topology script is set in the core-site.xml file
The topology.script.file.name sets the rack topology script
The topology script can be a shell, python, Java, etc. script
Example:
#!/bin/bash
ipaddr=$1
rack=`echo $ipaddr | cut -f1-3 -d '.' `
if [ -z "$rack" ] ; then
echo -n "/default-rack"
else
echo -n "/$rack"
fi
rack-topology.sh

Setting the rack topology script:
<property>
<name>topology.script.file.name</name>
<value>/home/hadoop/scripts/rack-topology.sh</value>
</property>
409
Unit 18 Review
1. Each rack has a _____________________ path name.
2. The priority of the second replica for rack aware is _______________________.
3. Rack topology is configured in the __________________________________ file.
410
Lab 18.1: Configuring Rack Awareness
Objective: Configure a Hadoop cluster to be rack-aware.

Successful Outcome: Each node in your cluster will be assigned to a rack.
Step 1: View the Current Rack Awareness

1.1. Check out how many racks are being recognized by the cluster by running the
fsck command:
$ su -l hdfs -c "hdfs fsck -racks"
Notice you only have one rack in your cluster:
1.2. Switch back to the root user.

Step 2: View the rack-topology Script
411
2.1. On node1, change directories to ~/labs.

2.2. View the contents of rack-topology.sh.sample, a sample rack topology script
provided for you:
#!/bin/bash
ipaddr=$1
rack=`echo $ipaddr | cut -f1-3 -d '.' `
if [ -z "$rack" ] ; then
echo -n "/default-rack"
else
echo -n "/$rack"
fi
Notice this script calculates the rack name using the IP address of the node. The
first three parts of the IP address become its rack name. For example: if
192.168.1.100 is the IP address, then the rack name would be /192.168.1.
Step 3: Configure the Rack Script
3.1. Copy the script to directory /etc/hadoop/conf as rack-topology.sh:
# cp rack-topology.sh.sample
/etc/hadoop/conf/rack-topology.sh
3.2. Stop HDFS. Edit core-site.xml and add the following properties:
topology.script.file.name=/etc/hadoop/conf/rack-topology.sh
topology.script.number.arg=1
3.3. Start HDFS, so your changes to core-site.xml take effect:

Step 4: Verify the Rack Awareness
4.1. Run the fsck command once again and this time you will see 4 racks:
# su -l hdfs -c "hdfs fsck -racks"
4.2. You can also view current topology by using following command:
412
# su -l hdfs -c "hdfs dfsadmin -printTopology"

Rack: /172.17.0.2
172.17.0.2:50010 (node1)
Rack: /172.17.0.3
172.17.0.3:50010 (node2)
Rack: /172.17.0.4
172.17.0.4:50010 (node3)
Rack: /172.17.0.5
172.17.0.5:50010 (node4)
4.3. Run the fsck command. You should see 4 racks now:
RESULT: The nodes in your cluster are now each assigned to a rack, and the rack
assignment takes place automatically using the rack-topology.sh script. You can write
your own custom script for automatically determining the appropriate rack names for
your cluster nodes.
413
Unit 19: NameNode HA

Topics covered:
NameNode Architecture HDP1
NameNode High Availability
HDFS HA Components
Understanding NameNode HA
NameNodes in HA
Failover Modes
NameNode Architectures
hdfs haadmin Command
Red Hat HA
VMware HA
Lab 19.1: Configure NameNode High Availability using Ambari
414

Single
Point Of
Failure
NameNode
Namespace
fsimage
Block Management
Edits.log
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4
HDFS Distributed Storage

The NameNode in HDP1 is a Single Point of Failure (SPOC). Two supported options for
HDP 1 are:
Red Hat HA for physical servers.
VMware HA for virtual servers running vSphere.
Red Hat and VMware HA are solutions that work well but there are reasons customers
want an HA solution built into HDP.
Both Red Hat and VMware HA:
Are crash recovery solutions.
Are 3rd party solutions that must be purchased.
Require expertise outside of Hadoop.
415
NameNode High Availability

The HDP HA architecture maintains high availability for the NameNode master daemon
service. With HA, if the NameNode master daemon fails, the standby NameNode takes
over as the active NameNode.
When any dependent service (YARN, HBase, ) detects a failure, it will pause, retry and
recover when the HA stack resolves the issue. For example, when the Standby
NameNode takes over on a failure the applications will recover and finish to completion.
Dependent services (like ResourceManager) automatically detect the failure or fail over
of the co-dependent component (NameNode) and these dependent services pause,
retry, and recover the failed service. For example, the ResourceManager does not
launch new jobs or kill jobs that have been waiting for the NameNode.
416
HDFS HA Components
Hadoop HA clusters use nameservice IDs to identify an HDFS instance that may be made
up of multiple NameNodes. A NameNode ID is added. Each NameNode has a unique ID
in a HA cluster to make sure it is uniquely identified.
DataNodes send block map reports and heartbeats to the Primary and Standby
NameNodes to maintain consistency.
A ZooKeeper Quorum is used to coordinate data, perform update notifications and
monitor for failure. Each NameNode maintains a persistent session in ZooKeeper.
The ZooKeeper Quorum will:
Notify the Standby NameNode to become active during a failover.
Perform an Active NameNode election, giving the Standby NameNode an

exclusive lock so it can become the Active NameNode.
417
The ZKFailoverController (ZKFC) monitors and manages the state of the NameNode. The
Active and Standby NameNode will each run a ZKFC.
The ZKFC:
Monitors the health of the NameNode it is monitoring and manages its state of
being healthy or unhealthy.
Manages the NameNode session. A lock is maintained on the Active NameNode.
Performs elections to determine which NameNode should be active and if

necessary will perform a failover.
The Journal Nodes (JNs) make sure that a split-brain scenario (both NN writing at same
time) does not occur. The JNs make sure that only one NameNode can be a writer at a
time.
The Active NameNode will write records to the shared edit.log file. The Standby Node
will read the edits.log file and apply changes to itself. The Standby NameNode will read
all edits before becoming active during a failover.
Currently there can only be one shared directory. The storage needs to support
redundancy to protect the metadata.
The Standby NameNode performs checkpoints. If upgrading from HDP1 to HDP2, the
previous Secondary NameNode can be replaced with the Standby NameNode.
An experimental shared storage solution is BookKeeper. BookKeeper can replicate edit
log entries across multiple storage nodes. The edit log can be striped across the storage
nodes for high performance. Fencing is supported in the protocol. The metadata for
BookKeeper is stored in ZooKeeper. In current HA architecture, a ZooKeeper cluster is
required for ZKFC. The same cluster can be for BookKeeper metadata. Refer to the
Apache BookKeeper project documentation for more information.
http://zookeeper.apache.org/bookkeeper/
418
Fo r a
utom
failov atic
er
NameNode
SNameNode
ZKFC
Active
ZKFC
Standby
ZK
ZK
ZK
Namespace
Writes
JN
JN
Namespace
Reads
JN
Block Management
Block Management
Hadoop HA Cluster
Heartbeats
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4
NameNode High Availability (HA) has no external dependency.
NameNode HA has an active NameNode and standby NameNode running in an activepassive relationship. If the active NameNode goes down the Passive NameNode
becomes the Active NameNode. If the failed NameNode restarts it will become the
passive NameNode. The ZooKeeper FailOverController (ZKFC) maintains a lock on the
active NameNode for a namespace.
On each platform running a NameNode service there will be an associated ZKFC. The
ZKFC communicates with:
The NameNode service it is associated with. ZKFC monitors the health and
manages the HA state of the NameNode.
The ZooKeeper Service.
The FailoverController (FC) monitors the health of the NameNode, Operating System
(OS) and Hardware (HW). There is an active and standby FailoverController.
Heartbeats occur between the Failover Controllers (active and passive) and the
zookeeper servers.
419
Recommendations:
Have three to five ZooKeeper daemons.
It is okay to run Zookeeper daemons on the NameNode platforms (active and

standby).
It is okay to run a Zookeeper daemon on the ResourceManager platform.
It is recommended to keep the HDFS metadata and the ZooKeeper data on

separate disks and controllers.
420
NameNodes in HA
Start the services in the following order:
1. JournalNodes
2. NameNodes
3. DataNodes
Always start the NameNode then its corresponding ZKFC.
The Active NameNode is determined by which NameNode starts first. If one NameNode
is the preferred Active NameNode then always start if first.
The hdfs haadmin command is used to perform a manual failover.
421
There are two ways of sharing edit logs with NameNode HA:
Quorum based-storage (best practice)
Shared storage using NFS
The active NameNode writes the edits in the edits.log. The Standby NameNode will
read and apply edits to maintain a consistent state. The current state is maintained with
a quorum of Journal Nodes.
Commands and scripts used to manage HA:
Format and initialize the state of the Zookeeper
$ hdfs zkfc formatZK
Start-dfs.sh will start the ZKFC daemon when automatic failover is setup.
To manually start a zkfc process:
$ hadoop-daemon.sh start zkfc
422
Failover Modes
The ZooKeeper FailOverController processes monitors the health of the NameNodes for
a NameSpace. The FailOverControl will facilitate the failover process and perform a
fencing operation to make sure a split-brain scenario cannot occur.
A split-brain scenario occurs if both NameNodes think they are both the active
NameNode. The fencing operation makes sure a NameNode gets fenced off so it cannot
be active. This protects the NameNode metadata to make sure it does not become
corrupt by two NameNodes doing writes at the same time.
The command below can be used to failover from active to passive NameNode.
$ hdfs haadmin failover <StandbyNN-To-Be>
Be>
<ActiveNN-To-
423
HDP2 supports a number of different NameNode architectures dependent upon the

requirements and Service Level Agreements (SLAs).
A single NameNode with a Secondary NameNode is supported. A customer may

not have an HA requirement or has just upgraded to HDP2.
A cluster may not generate the workload or need federated NameNodes but has
HA requirements.
A cluster may need a Federation NameNode configuration but not have a

requirement for HA.
A cluster may need a Federation NameNode configuration and may have a

requirement for HA.
A cluster may not have an HA requirement for all federated NameNodes.
424
Examples of Failover Scenarios:
HDP Master service failure.
HDP Master JVM failure.
Hung HDP Master daemon or hung operating system.
HDP Master operating system failure.
Failures in virtual infrastructure (VMware).
Virtual machine failure.
ESXi host failure.
Failure of the NIC cards on ESXi hosts.
Network failure between ESXi hosts.
425
hdfs haadmin Command

Running the hdfs haadmin command without any additional arguments will display the
usage information.
To perform failover, be sure to use the hdfs haadmin -failover command.
Options include:
getServiceState: Determines whether the given NameNode is Active or Standby.

This option is useful for administration scripts and cron jobs.
checkHealth: Checks the health of the given NameNodeConnect to the provided

NameNode. The NameNode is capable of performing some diagnostics on itself
including checking if internal services are running as expected. This command
will return 0 if the NameNode is healthy, non-zero otherwise.
426
Red Hat HA
Power Fencing
Monitoring Agent
Heartbeat
Monitoring Agent
Monitor NN
Monitor NN
NN
Standby
NameNode
Shared
NameNode
State
Red Hat HA
Red Hat Enterprise Linux (RHEL) HA cluster software is separate from the Hadoop
cluster. A power-fencing device is required (deals with split-brain scenario). A floating
IP is required for failover. RHEL HA cluster must be configured for the Hadoop master
servers that have high availability requirements.
Typically, the overall Hadoop cluster must include the following types of machines:
The RHEL HA cluster machines. These machines must host only those master
services that require HA (in this case, the NameNode and the ResourceManager).
Master machines that run other master services such as Hive Server 2, HBase
master, etc.
427
VMware HA
VMware vCenter Server
NameNode
(VM)
MasterNode
(VM)
Heartbeat
MasterNode
(VM)
MasterNode
(VM)
On failure, start
VMs on another
ESXi host
ESXi Host
Shared
Storage
Resource Manager
(VM)
MasterNode
(VM)
MasterNode
(VM)
ESXi Host
VMware HA
vSphere is VMwares software platform for providing a virtualization platform. The
VMware vCenter Server is VMwares central point of management.
A vSphere ESXi host can run multiple VMs. A vSphere HA/DRS cluster can be set with
multiple ESXi hosts. The ESXi hosts maintain heartbeats and communication so they
understand what VMs are running in the vSphere HA cluster. If an ESXi host fails, HA will
start the failed VMs on another ESXi host in the vSphere HA cluster automatically. If a
VM fails on an ESXi host, VMware HA can restart the VM on another ESXi host in the HA
cluster. The vSphere HA cluster must use shared storage.
A NameNode monitoring agent notifies vSphere if the NameNode daemon fails or
becomes unstable. vSphere HA will trigger the NameNode VM to restart on the same
ESXi host or a different ESXi host dependent on the error. A monitoring agent needs to
be setup to monitor any other HDP2 master nodes to let vSphere HA be aware of the
need to start up the master node VM again. vSphere HA can also automatically handle
an
It takes about five clicks to set up HA with vSphere and vSphere will manage the HA
environment automatically. When HA is enabled, a Fault Domain Manager (FDM)
service is started on the ESXi hosts. The ESXi hosts have an election and select an
election and pick a master host. The master host manages the FDM environment.
428
There are a number of different options for configuring how vSphere HA/DRS high
availability works. It takes a fair about of expertise to setup a virtual HA environment
but once set up it works automatically.
When running a vSphere HA/DRS cluster a number of features of virtualization can be
leveraged.
vMotion: Transparently moves a VM to another ESXi host.
Distributed Resource Scheduling (DRS): Automatically moves a VM

(transparently) to another ESXi host to balance the workload across the vSphere
HA/DRS cluster.
Fault Tolerance: Two VMs across different ESXi hosts can stay synchronized in an
active-passive relationship. If active VM fails, passive VM takes over. (Only
supported with up to four vCPUs in vSphere 5.5). Fault tolerance has zero-down
time.
vSphere Replication (VR) can perform VM replication across different sites (the
hardware does not have to be an exact match between sites).
Site Recovery Manager (SRM) supports automatic failover to another site.
vSphere HA can protect against an ESXi failure or the failure of applications running on
the VM (vSphere Application Failover (in vSphere 5.5). vSphere HA also protects against
the VM, guest OS failure and network failures.
When using VMware HA:
The NameNode must run inside a virtual machine which is hosted on the
vSphere HA cluster.
The ResourceManager must run inside its own virtual machine which is hosted
on the vSphere HA cluster.
The vSphere HA cluster must include a minimum of two ESXi server machines.
429
Lab 19.1: Implementing NameNode HA
Objective: To configure and verify NameNode High Availability using

Ambari.
Successful Outcome: Your cluster will have a Standby NameNode along with
Active NameNode.
Before You Begin: Open the Ambari UI.
Step 1: Stop HBase

1.1. Click on the Admin tab in Ambari, then select High Availability from the menu
on the left side of the page:
1.2. Click the Enable NameNode HA button. Notice on the first step of the wizard
that you get a warning about stopping HBase first:
430
1.3. Click the X in the upper-right corner to end the wizard.

1.4. Go to Services -> HBase and click the Stop button. Wait for HBase to stop.
Step 2: Enable NameNode HA
2.1. Go back to the Admin page and start the Enable NameNode HA Wizard
again.
2.2. Enter HACluster as the Nameservice ID for your NameNode HA cluster. (The
name must consist of one word and not have any special characters.) Click the
Next button:
2.3. Choose node4 as the Additional NameNode:
431
2.4. Click the Next button.

2.5. Notice that a Secondary NameNode is not allowed if you have NameNode HA
configured. Ambari takes care of this for you, as you can see on the Review step
of the wizard:
2.6. Click the Next button to continue the wizard.

Step 3: Perform the Manual Steps
3.1. Notice the Enable NameNode HA Wizard is requiring you to perform some
manual steps before you are able to continue. Start by SSH-ing into node1.
3.2. Put the NameNode in Safe Mode:
# sudo su -l hdfs -c 'hdfs dfsadmin -safemode enter'
3.3. Create a Checkpoint:

# sudo su -l hdfs -c 'hdfs dfsadmin -saveNamespace'
3.4. Once Ambari recognizes that your cluster is in Safe Mode and a Checkpoint
has been made, you will be able to click the Next button.
Step 4: Wait for the Configuration
4.1. At this point, Ambari will stop all services, install the necessary components,
and restart the services. Wait for these tasks to complete:
432
4.2. Once all the tasks are complete, click the Next button.
Step 5: Initialize the JournalNodes
5.1. On node1, enter the command shown in the wizard to initialize the
JournalNodes:
# sudo su -l hdfs -c 'hdfs namenode -initializeSharedEdits'
5.2. Once Ambari determines that the JournalNodes are initialized, you will be
able to click the Next button:
Step 6: Start the Components

6.1. In the next step, Ambari will start ZooKeeper and the NameNode service. Click
Next when its complete:
433
Step 7: Initialize NameNode HA Metadata

7.1. On node1, enter the command shown in the wizard to configure the
metadata for automatic failover:
# sudo su -l hdfs -c 'hdfs zkfc -formatZK'
7.2. On node4, run the command to initialize the metadata for the new
NameNode:
# sudo su -l hdfs -c 'hdfs namenode -bootstrapStandby'

Step 8: Wait for the Wizard to Finish
8.1. In this final step, the wizard will start all required services and delete the
Secondary NameNode:
8.2. Click the Done button when all the tasks are complete.
434
Step 9: Verify the Standby NameNode

9.1. Go to the HDFS service page. You should see an Active NameNode and a
Standby NameNode:
Step 10: Test NameNode HA.

10.1. Go the Host page of node1 in Ambari.
10.2. Next to the Active NameNode component, select Stop from the Action
menu:
10.3. Go back the HDFS page in Ambari. Notice the Standby NameNode has
become the Active NameNode:
10.4. Now start the stopped NameNode again, and you will notice that it becomes
a Standby NameNode:
435
RESULT: You now have NameNode HA configured on your cluster, and you have also
verified that the HA works when one of the NameNodes stops.
436
Unit 20: Securing HDP

Topics covered:
Security Concepts
Kerberos Synopsis
HDP Security Overview
Securing HDP Authentication
Securing HDP - Authorization
Lab 20.1: Securing a HDP Cluster
437
Security Concepts
Before implementing security in a Hadoop cluster, its important to understand basic
security concepts and terms.
Principal: A principal is any user or service that is performing an operation in the
secured environment. A user principal an interactive or unattended (system) user that
logs into a secured environment and starts to interact with services. A service principal
is a service that needs to perform operations in a secured environment.
Authentication: There are many authentication mechanisms available by which
principals can prove their credentials are trusted. Credentials can be username and
password, a key file or certificate of trust, or a combination of usernames and trust files.
The common authentication protocols are Kerberos, Plain Text, X.509, Digest, and many
others. The protocol that is used in Hadoop is Kerberos, an MIT open source project.
438
Authorization: Once a principal is authenticated, it needs to receive authorization to

perform the operations it wants. Authorizations can be provided by lookups into a file,
database, and directory services such as LDAP. In Hadoop, authorization is provided via
configuration files which will be discussed in detail below.
439
Kerberos Synopsis
Kerberos is a protocol that aims to provide an authentication and authorization system
to:
Prevent the need for passwords to be transferred over the network.
Still allow for users to enter passwords.
Allow a user to establish an authenticated session without having the need
to re-enter passwords for every operation.
To create that secure communication among its various components, Hadoop uses
Kerberos. Kerberos is a third party authentication mechanism, in which users and
services that users want to access rely on a third party - the Kerberos server - to
authenticate each to the other. The Kerberos server itself is known as theKey
Distribution Center, or KDC.
At a high level, it has three parts:
440
A database of the users and services (known as principals) that it knows

about and their respective Kerberos passwords.
An authentication server (AS) which performs the initial authentication and

issues a Ticket Granting Ticket (TGT).
A Ticket Granting Server (TGS) that issues subsequent service tickets based
on the initial TGT.
A user principal requests authentication from the AS. The AS returns a TGT that is
encrypted using the user principal's Kerberos password, which is known only to the user
principal and the AS. The user principal decrypts the TGT locally using its Kerberos
password, and from that point forward, until the ticket expires, the user principal can
use the TGT to get service tickets from the TGS. Service tickets are what allow a principal
to access various services.
Because cluster resources (hosts or services) cannot provide a password each time to
decrypt the TGT, they use a special file, called a keytab, which contains the resource
principal's encrypted credentials.
Kerberos Components
Term
Description
Key Distribution
Center, or KDC
The trusted source for authentication in a Kerberos-enabled environment.
Kerberos KDC
Server
The machine, or server, that serves as the Key Distribution Center.
Kerberos Client
Any machine in the cluster that authenticates against the KDC.
Principal
The unique name of a user or service that authenticates against the KDC.
Keytab
A file that includes one or more principals and their keys.
Realm
The Kerberos network that includes a KDC and a number of Clients.
441
HDP Security Overview

HDP has a set of pre-defined Kerberos principals:
Service
Component
Mandatory Principal Name
HDFS
NameNode
nn/$FQDN
HDFS
NameNode HTTP
HTTP/$FQDN
HDFS
SecondaryNameNode
nn/$FQDN
HDFS
SecondaryNameNode HTTP
HTTP/$FQDN
HDFS
DataNode
dn/$FQDN
MR2
History Server
jhs/$FQDN
MR2
History Server HTTP
HTTP/$FQDN
YARN
ResourceManager
rm/$FQDN
YARN
NodeManager
nm/$FQDN
442
Oozie
Oozie Server
oozie/$FQDN
Oozie
Oozie HTTP
HTTP/$FQDN
Hive
Hive Metastore
HiveServer2
hive/$FQDN
Hive
WebHCat
HTTP/$FQDN
HBase
MasterServer
hbase/$FQDN
HBase
RegionServer
hbase/$FQDN
ZooKeeper
ZooKeeper
zookeeper/$FQDN
Nagios Server
Nagios
nagios/$FQDN
JournalNode
Server[a]
JournalNode
jn/$FQDN
[a]
Only required if you are setting up NameNode HA.
To create the principal for a DataNode service, issue this command:

$kadmin.local addprinc -randkey dn/$DataNode-Host@EXAMPLE.COM
Once principals are established in the KDCs database, keytab files can be extracted.
Recall that keytabs are a key file that identifies a principal. Keytabs need to be installed
on each appropriate host; wherever a service principal resides.
To extract a keytab file from an established principal:
$kadmin.local xst -norandkey -k $keytab_file_name
$primary_name/fully.qualified.domain.name@EXAMPLE.COM
$primary_name name of the principal

/fully.qualified.domain.name category in which a principal belongs
@EXAMPLE.COM the Realm
443
Securing HDP Authentication

The easiest way to enable security in HDP is via Ambari. Your Security lab at the end of
this section will walk you through the steps to secure your cluster.
Note:
Once authentication is setup, the next step is to set up mappings of local UNIX
service accounts to Kerberos principals.
These mappings live in the core-site.xml under the hadoop.security.auth_to_local
property.
444
Securing HDP - Authorization

Access Control Lists (ACLs) can also be set up in a Hadoop cluster. Authorization ACLs
can be set for:
HDFS
Jobs
HBase
Hive
ZooKeeper
445
Lab 20.1: Securing a HDP Cluster
Objective: To understand how to configure security for HDP.

Successful Outcome: The horton user will be authenticated and authorized to view
the contents of the /user folder in HDFS.
Step 1: Create a New User

1.1. On node1 as root, create a new user named horton:
# useradd -g hadoop horton
1.2. Switch to the hdfs user and create a new directory in HDFS named
/user/horton.
1.3. Change ownership of /user/horton in HDFS to the horton user.
1.4. Exit out from the hdfs user, and switch to the horton user.
1.5. Check whether you can do a listing of the /user directory successfully:
1.6. Exit out from the horton user.
NOTE: The current cluster is not a secure cluster so you can easily do a
listing of the /user directory in HDFS successfully.
Step 2: Install Kerberos

446
2.1. As the root user on node1, install the following packages:

# yum -y install krb5-server krb5-libs krb5-auth-dialog
krb5-workstation
2.2. Login to node2, node3 and node4 and install the Kerberos client only:
# yum -y install krb5-workstation
Step 3: Configure Kerberos

3.1. On node1, edit the configuration file /etc/krb5.conf and modify the value for
kdc and admin_server to node1:
# vi /etc/krb5.conf
[realms]
EXAMPLE.COM = {
kdc = node1
admin_server = node1
}
3.2. Copy the /etc/krb5.conf file to all the other nodes:

# ~/scripts/distFile.sh /etc/krb5.conf /etc/krb5.conf
3.3. Enter the following command to create a Kerberos database using the
kdb5_util utility:
# kdb5_util create -s
During this step it will ask you to define a master key. Enter 1234 as the key.
Step 4: Start Kerberos
4.1. Start the KDC server by executing following commands:
# /etc/rc.d/init.d/krb5kdc start
# /etc/rc.d/init.d/kadmin start
Step 5: Run the Enable Security Wizard

5.1. In Ambari, go the Admin page and click on the Security link:
447
5.2. Click the Enable Security button.

5.3. Notice on the Get Started step of the Enable Security Wizard there are four
manual steps that must be completed first. We have already executed the first 2
steps. For the remaining 2 steps, click the Next button.
5.4. The default settings on the Configure Services step of the wizard are fine, so
click Next to continue.
Step 6: Create the Principals and Keytabs
6.1. (NOTE: This step does not work using Safari.) On the Create Principals and
Keytabs step of the wizard, all the required default settings for Kerberos are
shown in a tabular format. Click the Download CSV button at the bottom of the
page and save the file on your local machine:
6.2. Create a new file on node1 named /root/scripts/kerberos.csv and copy-andpaste the contents of the CSV file into kerberos.csv.
6.3. Run the pre-written script to create all required principals and keytabs. It will
ask for the location of the CSV file you created. Provide the full path to the file:
# /root/scripts/create_principals.sh
NOTE: This step will create all required principals and keytab files on all the
nodes. Once you are done with the step, go back to Ambari UI.
Step 7: Finish the Enable Security Wizard
448
7.1. We have completed all 4 required steps. Now it is time to enable security
through Ambari. Click the Apply button.
7.2. The Save and Apply Configuration step can take 10-15 minutes. When the
task is complete, click the Done button:
Step 8: Verify Security is Enabled

8.1. On node1, switch to the horton user and try to list the contents of /user in
HDFS:
# su - horton
The command should throw an error.

Step 9: Configure User Permissions
9.1. As the root user, add horton to the Kerberos database. Start the kadmin
service by typing following:
# kadmin.local
9.2. Type following command to create a horton principal:

kadmin.local: addprinc -randkey horton@EXAMPLE.COM
9.3. Now create a keytab file in the /etc/security/keytabs directory using the
following command:
449
kadmin.local: xst -norandkey -k

/etc/security/keytabs/horton.headless.keytab
horton@EXAMPLE.COM
kadmin.local: exit
9.4. Set appropriate permissions for the keytab file for the horton user:
# chown horton:hadoop
/etc/security/keytabs/horton.headless.keytab
# chmod 440 /etc/security/keytabs/horton.headless.keytab
9.5. Switch to the horton user and initialize the keytab file:
# su - horton
$ kinit -kt /etc/security/keytabs/horton.headless.keytab
horton@EXAMPLE.COM
9.6. Now try to list the contents of /user in HDFS again. This time you should be
able to view the folders contents!
RESULT: You have enabled Kerberos security for your HDP cluster.
450
Appendix A: Unit Review Answers

Unit 1 Review Answers
1. YARN and HDFS
2. False
3. Ambari
4. NameNode are ResourceManager are the primary master processes.

1. NameNode
2. 3
3. False. A files data in HDFS never passes through the NameNode. Client
applications read and write directly from the DataNodes.
4. dfs.blocksize
5. The fsimage and edits files
1. Availability.
2. Defines the number of bytes for a checksum to be created (default is 512).
3. Block scanner.
4. Under-replicated blocks, Over-replicated blocks, Mis-replicated blocks (on the
same node), corrupt blocks.
5. The location option.
6. The Total Size and the Default Replication Factor.
7. To get a list of all the DataNodes in a cluster.
451

1. DataNode, NameNode or HDP client machine
2. NFS server
3. hdfs-site.xml
4. UID/GID

1. Map phase, shuffle/sort phase, and reduce phase
2. The number of Mappers is determined by the input splits.
3. You get to choose the number of Reducers.
4. ResourceManager, NodeManager and ApplicationMaster
5. It is up to the ApplicationMaster to request a new Container from the
ResourceManager and attempt the task again.
1. Capacity and Fairness
2. Capacity
3. root
4. 3 (A and B and the root parent queue)
5. The A queue is allocated 80% of the resources of the cluster.
6. Yes, because its maximum-capacity is set to 100%.
7. No, because queue A does not have a maximum-capacity configured, which
means elasticity is disabled for A
452

1. Data ingestion
2. Time-based and immutable
3. Big Data Refinery
4. The Batch layer, serving layer and speed layer.
5. hftp
6. source and destination (using a checksum CRC32)

1. Read and write operations.
2. dfs.webhdfs.enabled
3. Kerberos SPNEGO and Hadoop delegation tokens
4. Service

1. Metastore
2. External
3. WebHCat
4. True

1. 4 map tasks by default
2. The -m option is for specifying the number of mappers.
3. The $CONDITIONS value is used internally by Sqoop to specify LIMIT and OFFSET
clauses so the data can be split up amongst the map tasks
453

1. Event
2. Event driven
3. Custom
4. Interceptor

1. workflow, coordinator and bundle jobs
2. Bundle
3. Answers can include: Email, Shell, Pig, MapReduce, Hive, Sqoop, ssh, DistCp or
Custom
4. oozie-log4j.properties

1. Widget and Classic
2. gmond, gmead and gweb
3. hadoop-cluster.cfg
4. jmap
5. JMX

1. dfs.hosts
2. dfs.hosts.exclude
454
3. yarn.resourcemanager.nodes.include-path and
yarn.resourcemanager.nodes.exclude-path
4. execute ResourceManager administration operations

1. Hierarchical
2. High Availability
3. core-site.xml
455
Appendix B: Other Hadoop Tools

Topics covered:
Data Lifecycle Management
Data Lifecycle Management on Hadoop
Falcon Use Cases and Capabilities
Role of Falcon in Hadoop
Knox
ZooKeeper
HBase
HCatalog
NameNode Federation
456
Data Lifecycle Management

In a world of increased governance and regulation, data lifecycle management is more
important than ever.
Data management lifecycle has different stages the data goes through:
Discover
Design
Enable
Maintain
Archive
457

Hadoop systems are integral to the data lifecycle management for an
organization.
Data Management Needs
Tools
Data Processing
Oozie
Replication
Sqoop
Retention
Distcp
Scheduling
Flume
Reprocessing
MapReduce
Multi-Cluster Management
Hive and Pig

The data volume, ingestion rates and different types of data in Hadoop add to the
complexity of data lifecycle management for an organization.
The individual tools in Hadoop are excellent for their defined functionality but an
organized way of managing all this data needs to be implemented.
458
Falcon Use Cases and Capabilities
459
Falcon
Falcon is a data lifecycle management framework for Apache Hadoop.
Falcon enables users to configure, manage and orchestrate data motion,
disaster recovery, and data retention workflows to support of business
continuity and data governance use cases.
Falcon
Falcon provides the key services data processing applications need. Falcon manages
workflow and replication.
Falcons goal is to simplify data management on Hadoop. It achieves this by providing
important data lifecycle management services that any Hadoop application can rely on.
Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a
proven, well-tested and extremely scalable data management system built specifically
for the unique capabilities that Hadoop offers.
Falcon also supports multi-cluster failover.
Falcon is not in the initial GA release of HDP2. Falcon will be in a following

release of HDP2.
460
Future: Knox
Provide perimeter security
Support authentication and token
verification security scenarios
Single URL to access multiple
Hadoop services
Enable integration with enterprise
and cloud identity management
environments
Supports
WebHDFS
WebHCat
Oozie
HBase
Hive
Future: Knox
While not yet part of HDP, Knox is intended to provide permiter security. It aims to
provide a single point of entry into a Hadoop cluster for a user to access different
services such as HDFS, YARN, Hive, Oozie. Knox can be installed in HDP 2 as an add-on. A
user authenticates once with the Knox service via Kerberos, while Knox itself handles
serving requests for that user inside the cluster.
For more information:
Hortonworks: http://hortonworks.com/hadoop/knox-gateway/
Apache: http://knox.incubator.apache.org/
461
ZooKeeper Synopsis
ZooKeeper is a service that provides configuration management, naming, distributed
synchronization, and group services. Various Hadoop services rely on ZooKeeper to
operate. In this Unit, we will focus on Administering ZooKeeper.
Centralized service for:
Configuration management: Services such as HBase use ZooKeeper extensively
for configuration management, such as a registry of all HBase nodes and tables.
Naming & group services: ZooKeeper can act as a naming service, similar to
what a DNS would provide. At an application level, can you can use ZooKeeper as
your replacement to DNS. For example, if your application needs to resolve a
host name, that information can be maintained and provided by ZooKeeper to
the application.
Distributed synchronization (i.e. distributed transactions): You can implement a

Two-phase commit (2PC), a common distributed transaction type, using
ZooKeeper.
462
Components
An ensemble of ZooKeeper hosts 3 hosts suffices for most clusters
Ensembles are configured in odd numbers of 3,5,7, etc.
This is due to always having a majority and allowing for one
additional failure than even numbers.
5 zknodes allows for 2 failures, whereas 4 zknodes allows for only
1 failure. Both have the same number of majority.
The ensemble of hosts work together as a quorum; as long as a majority
of them agree on an operation, then the operation succeeds.
A leader host out of the ensemble.

ZooKeeper nodes will self-elect a leader.
All operations are stamped with a sequential transaction id calls a zxid.

The zxid exposes total ordering of operations.
ZooKeeper Client
ZooKeeper ships with a command line client that allows you to perform file-system like
operations:
/user/lib/zookeeper/bin/zkCli.sh
To connect to a ZooKeeper host:

[root@node1 ~]# /usr/lib/zookeeper/bin/zkCli.sh -server
node1:2181
463
To view available commands:

[zk: node1:2181(CONNECTED) 0] help
ZooKeeper -server host:port cmd args
connect host:port
get path [watch]
ls path [watch]
set path data [version]
rmr path
delquota [-n|-b] path
quit
printwatches on|off
create [-s] [-e] path data acl
stat path [watch]
close
ls2 path [watch]
history
listquota path
setAcl path acl
getAcl path
sync path
redo cmdno
addauth scheme auth
delete path [version]
setquota -n|-b val path
The simple client allows you to create znodes. More complex operations would be
performed programmatically. For more in-depth information and programmers guide,
visit:
http://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html
464
Configuring ZooKeeper
To configure ZooKeeper, find the config file at with key configuration properties
mentioned above:
/etc/zookeeper/conf/zoo.cfg
Data & Logs

By default, the data and log directory are the same. However, in production, these
should be separated out. As with any logging strategy, throughput increases and latency
decreases when logs are directed to a log device or drive. Use the dataDir and
dataLogDir properties to separate these out.
Monitoring
Ambari does a good job of monitoring liveliness of each ZooKeeper host in the cluster.
To perform more detailed monitoring, you can use a set of commands, called the Four
Letter Words. Another way to monitor ZooKeeper is via JMX. Several MBeans are
exposed that allow you to get detailed statistics.
465
ZooKeeper Commands: The Four Letter Words

Below is an example of the ruok (are you okay?) command:
[root@node1 ~]# echo -n ruok | nc node1 2181
imok
[root@node1 ~]#
Notice that ZooKeeper echoed back imok. A full list of the four letter word commands is
provided below:
Command Description
conf
cons
crst
dump
envi
ruok
srst
srvr
stat
wchs
Print details about server configuration

List full connection/session details for all clients connected to this server. Includes
information on numbers of packets received/sent, session id, operation latencies,
last operation performed, etc...
Reset connection/session statistics for all connections.
Lists the outstanding sessions and ephemeral nodes. This only works on the leader.
Print details about serving environment.
Tests if server is running in a non-error state. The server will respond with imok if it
is running. Otherwise it will not respond at all.
A response of "imok" does not necessarily indicate that the server has joined the
quorum, just that the server process is active and bound to the specified client port.
Use "stat" for details on state wrt quorum and client connection information
Reset server statistics.
Lists full details for the server.
Lists brief details for the server and connected clients.
Lists brief information on watches for the server.
Lists detailed information on watches for the server, by session. This outputs a list of
sessions (connections) with associated watches (paths).
wchc
Note: Depending on the number of watches this operation may be expensive (ie
impact server performance), use it carefully.
Lists detailed information on watches for the server, by path. This outputs a list of
paths (znodes) with associated sessions.
wchp
mntr
466
Note: Depending on the number of watches this operation may be expensive (ie
impact server performance), use it carefully.
Outputs a list of variables that could be used for monitoring the health of the
cluster.
Ports firewall considerations
Service
Servers
Ports
ZooKeeper Server
ZooKeeper Server
ZooKeeper Server
All ZK Nodes
All ZK Nodes
All ZK Nodes
2888
3888
2818
Protocol
Description
Peer to peer communication.
Peer to peer leader election.
Clients connect to this port.
467
HBase Synopsis
HBase is a NoSQL database known as the Hadoop Database. Because HDFS is a resilient,
highly scalable distributed file system, HBase capitalizes on these characteristics by
persisting its data directly to HDFS.
HBase Architecture
ZooKeeper1
HMaster
ZooKeeper2
ZooKeeper3
RegionServer
RegionServer
RegionServer
DataNode
DataNode
DataNode
HDFS
468
Hortonworks Inc. 2012
Page 321
Components
Rowkey: Data is always identified by a rowkey. A Row key can be considered as a
primary key that you find in relational databases. It is a unique key that identifies a row
in HBase. Rowkeys are always sorted lexagraphically in ascending order within Regions.
Region: A region is a collection of rows that is managed by one of the RegionServers.
RegionServer: The HBase worker node, java process, which is co-located with
DataNodes. A RegionServer can load HBase block files into memory for caching, scan
blocks locally, and is thus co-located with the data blocks that make up the regions it
manages.
HMaster: Responsible for HBase maintenance tasks such as load-balancing and
orchestrating recovery when a RegionServer fails. Since clients talk directly to
RegionServers, it is possible for an HMaster to go down, and HBase can continue
functioning, however, the HMaster should be restarted as soon as possible.
ZooKeeper: ZooKeeper handles all of the configuration management. Clients always talk
to ZooKeeper first to find the appropriate RegionServer to talk to.
Since HBase RegionServers have a data block cache, heap sizes for
RegionServers is often very large. It is recommended to set the heap for
RegionServers (resource permitting) to at least 8GB.
469
Configuring HBase
HBase configuration properties are described above.
Ports Firewall considerations:
Service
Servers
Ports
HMaster
HMaster Info Web UI
Masters
Masters
60000
60010
http
RegionServer
RegionServers 60020
RegionServer
RegionServers 60030
470
http
RegionServers comm
HMaster WebUI stats
Client to RegionServer
communications, Master to
RegionServer, RegionServer
to RegionServer
RegionServer Web UI
stats
HCatalog
HCatalog is Hives table and storage management layer.
MapReduce
Pig
Hive
Streaming
HCatalog
ORC
RC
Text
Sequence
Custom
HBase
HCatalog
HCatalog allows the creation of schema definitions that will be accessed from
applications. This allows the schema definition to be outside of the application code.
HCatalog is a set of interfaces that provide access to Hive's metastore for different types
of applications.
HCatalog provides:
A shared schema and data type mechanism for Hadoop tools.
A table abstraction so that users need not be concerned with where or how their
data is stored.
Interoperability across data processing tools such as Pig, Map Reduce, and Hive.
A REST interface to allow language independent access to Hive's metadata.
The HCatalog CLI supports all Hive DDLs that do not require MapReduce. HCatalog is
used to create, alter, drop tables, etc. The HCatalog CLI supports commands like SHOW
TABLES and DESCRIBE TABLE.
471

NameNode
Namespace
fsimage
Block Management
Edits.log
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4

The NameNode has two main areas:
Namespace: Contains information on directories, files and blocks. In HDP1

there is only one NameNode per cluster, therefore one namespace for the
cluster. The namespace supports the creation, deletion and modification and
listing of files as well as directory operations.
Block Management: Locations of blocks, manages replicas of blocks, DataNode

cluster membership, heartbeats, etc. Supports the creation, deletion and
modification of block operations.
DataNodes handle all I/O, storage and block management on data node machines (slave
servers). Data blocks are replicated for high availability.
The HDP1 NameNode architecture scales to approximately 5,000 data nodes. One of
the big advantages of a Hadoop platform is the ability to coordinate data from all types
of different sources. Customers want to put more of their date into a central data lake
versus creating lots of Hadoop clusters.
472
In HDP1 the Namespace Volume = Single Namespace + block. All tenants shared a single
namespace.
With a single namespace there is isolation in multi user environment.
In HDP1, customer clusters can contain 4500+ nodes, 100+ PB storage and 400+ million
files and they keep growing bigger.
HDFS has over 7 9s of data reliability with less than 0.38 failures across 25
clusters.
One administrator can manage 1000 3000k nodes in a cluster.
HDFS offers fast repair time for disk failure or node failure. In HDFS, repairs can
occur in minutes versus RAID arrays where fixes can take hours.
473
Federating NameNodes
Hadoop clusters are increasing in size, workloads and complexity.
A typical large deployment at Yahoo! includes an HDFS cluster with 2700-4200

DataNodes with 180 million files and blocks, and address ~25 PB of storage.
At Facebook, HDFS has around 2600 nodes, 300 million files and blocks,
addressing up to 60PB of storage.
The number of files in HDFS is limited by the amount of memory in a single name
node. More RAM in a single machine creates more garbage collection issues.
Multiple NameNodes increase the amount of memory and files that can be
HDFS.
At a lot of companies, the Hadoop cluster is used in a multi-tenant environment where

different business units share a cluster. One large application can impact all the tenants
if a single namespace (HDP2).
In HDP1, a lot of organizations run HBase in a separate cluster to make sure SLAs can be
met. In HDP2 HBase can run in its own namespace, eliminating the need for separate
clusters.
474
In a federated NameNode environment, the NameNodes are independent and do not

need to coordinate with each other.
When a NameNode/namespace is deleted, the corresponding block pool at the
DataNodes is deleted.
Each namespace volume is upgraded as a unit, during cluster upgrade.
475

NameNode k
NameNode 1
Namespace 1
Pool 1
NameNode n
Namespace k
Namespace n
Pool k
Block Management
fsimage
Pool n
fsimage
Block Management
fsimage
Block Management
Edits.log
Edits.log
Edits.log
Heartbeats
DataNode 1
DataNode 2
BLK1
BLK6
BKL4
BLK2
BK5
BK5
DataNode 3
BLK6
BLK1
BLK6
BLK2
BKL4
DataNode n
BLK1
BK5
BLK2
BKL4

In HDP2, a Hadoop cluster can have a single NameNode managing a single namespace.
For scalability, HDP2 can configure a federation of NameNodes, each with its own
Namespace. A federation is a group of objects that work together with each group
acting with a level of independence. This helps allow a HDP2 cluster to scale to 10,000
DataNodes.
The NameNodes are federated, so each NameNode is independent. They do not need
to coordinate with each other. DataNodes will register with each federated NameNode.
A ClusterID is a new identifier, used to identify all nodes in the cluster.
Despite there being separate namespaces and block pools the NameNode is quite
similar to a configuration of a single NameNode in a single namespace. Most changes
are in the DataNodes, configurations, and tools.
DataNodes in a federated cluster:
Register with all the NameNodes.
Send periodic heartbeats and block reports to all the NameNodes.
Send block received/deleted for a block pool to corresponding NameNode.
476
Namespace Volume
The NameServiceID is an identifier for coordinating a NameNode with its backup,
secondary or checkpointing nodes. The NameServiceId is used to identify a set of nodes
associated with a namespace in the configuration files. Datanodes will reference all the
DataNodes in the cluster. DataNodes store blocks for all the namespace volumes, there
is no partitioning.
A federation of NameNodes is a simple design and required minimal changes to existing
NameNode code.
Separating the namespace and block management also allows block storage to become
a separate service. The namespace happens to be one of the applications that uses the
service. This opens up the potential of associating different types of services on block
storage. Examples:
HBase
New block categories can be created in the future to support with different types
of garbage collection and optimization for different types of applications.
Foreign namespaces
477
Benefits of Independent Block Pools

Separating the block pools from namespace volumes allows the potential of block
management to move into separate independent nodes. Separate block pools allow
different types of application implementations to be simplified. Areas such as
distributed caches become easier to implement.
An independent namespace can generate Block IDs for new blocks without the need for
coordination with the other namespaces. The failure of a NameNode does not prevent
other NameNodes to service DataNodes in the cluster.
478
Namespaces Increase Scalability

The HDP1 NameNode architecture scales to approximately 5,000 data nodes. One of
the big advantages of Hadoop platforms is the ability to coordinate data from all types
of different sources. Customers want to put more of their data into a central data lake
versus creating many Hadoop clusters. HDP2 NameNode architecture can scale to
10,000+ data nodes.
The scalability of multiple NameNodes takes HDP2 pas 100k current tasks, 200PB of
storage and 1+ billion files.
479
Configuring NameServices for DataNodes

<configuration>
<property>
<name>dfs.nameservices</name>
Define NameNodes <value>ns1,ns2</value>
in Cluster
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name> <value>nn-host1:rpc-port</value>
Server and Port # for NameNode1
</property>
<property>
<name>dfs.namenode.http-address.ns1</name> <value>nn-host1:http-port</value>
</property>
<property>
<name>dfs.namenode.secondaryhttp-address.ns1</name> <value>snn-host1:http-port</
Server and Port # for SNameNode1
value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>nn-host2:rpc-port</value>
</property>
<property>
<name>dfs.namenode.http-address.ns2</name>
<value>nn-host2:http-port</value>
</property>
<property>
Server and Port # for SNameNode2
<name>dfs.namenode.secondaryhttp-address.ns2</name>
<value>snn-host2:http-port</
value>
</property>
</configuration>
Configuring NameServices for DataNodes

There is a single configuration for all nodes in the cluster.
Formatting a NameNode requires specifying the ClusterId. Multiple NameNodes can be
formatted at the same time. ClusterIDs are auto generated if not provide.
$HADOOP_PREFIX_HOME/bin/hdfs namenode -format [-clusterId
<cluster_id>]
When upgrading and going from a single namespace to multiple namespaces

(NameNodes).
$HADOOP_PREFIX_HOME/bin/hdfs start namenode --config
$HADOOP_CONF_DIR -upgrade -clusterId <cluster_ID>
A NameNode can be added to an existing HDFS cluster. Update configuration files to

reflect the new NameNode(s). Run the refreshNameNode option on all the Datanodes.
$HADOOP_PREFIX_HOME/bin/hdfs dfadmin -refreshNameNode
<datanode_host_name>:<datanode_rpc_port>
480
The HDFS cluster can be started from any node as long as the HDFS configuration
information is available. The startup process starts the NameNodes and the DataNodes
in the slaves file are started.
$HADOOP_PREFIX_HOME/bin/start-dfs.sh
$HADOOP_PREFIX_HOME/bin/stop-dfs.sh
NameNode Configuration Parameters

dfs.namenode.rpc-address dfs.namenode.servicerpc-address dfs.namenode.httpaddress dfs.namenode.https-address dfs.namenode.keytab.file dfs.namenode.name.dir
dfs.namenode.edits.dir dfs.namenode.checkpoint.dir dfs.namenode.checkpoint.edits.dir
Secondary Configuration Parameters

dfs.namenode.secondary.http-address dfs.secondary.namenode.keytab.file
Backup Configuration Parameters

dfs.namenode.backup.address dfs.secondary.namenode.keytab.file
481
Block Management with Federation

The Cluster Web Service can use any of the NameNode servers to monitor the cluster.
The Cluster Web Service displays:
Number of files
Number of blocks
Total storage capacity
Used and available storage information for HDFS cluster.
You can run the Cluster Web Console from any NameNode:
http://<NameNodeNHost>:port>/dfsclusterhealth.jsp
NameNodes can be added and removed in a Federated cluster without restarting the
cluster.
482
Run the balancer choosing the Node or Blockpool policy.

"$HADOOP_PREFIX"/bin/hadoop-daemon.sh --config
$HADOOP_CONF_DIR --script "$bin"/hdfs start balancer [policy <policy>]
When decommissioning a DataNode, the exclude file needs to be distributed to all

NameNodes.
"$HADOOP_PREFIX"/bin/distribute-exclude.sh <exclude_file>
To refresh all the NameNodes, run the following script:

"$HADOOP_PREFIX"/bin/refresh-namenodes.sh
483
Configuration Parameters
Property
Value (examples)
Description
dfs.nameservices
mycoolcluster
Logical name for nameservice
dfs.ha.namenodes.
[nameservice ID]
nn1,nn2
NameNode IDs
dfs.ha.automaticfailover.enabled
true
Set cluster automatic failover on

each NameNode using HA.
ha.zookeeper.quorum
nn1,nn2
NameNodes in the cluster
ha.zookeeper.quorum
<Host1>:2181,
<Host2>:2181,
<Host3>:2181
Define Zookeeper Quorum
dfs.journalnode.edits.dir
/localdirpath/journalnode/
Where JNs store state data
dfs.namenode.rpcaddress.mycoolcluster
<Host1>:8020
Set RPC address for NN to listen

on per NN.
dfs.namenode.http-address.
mycoolcluster
<Host1>:50070
Set HTTP address for NN to

listen on per NN.
dfs.namenode.rpc-address.
mycoolcluster
<Host2>:8020
Set RPC address for NN to listen

on per NN.
dfs.namenode.httpaddress.nmycoolcluster
<Host2>:50070
Set HTTP address for NN to

listen on per NN.
Federation Configuration Parameters

Both the hdfs-site.xml and core-site.xml configuration files need to be modified to set
up HDFS HA. Examples:
hdfs-site.xml
dfs.ha.automatic-failover.enabled
core-site.xml
ha.zookeeper.quorum
484
Additional parameters need to be configured:
dfs.ha.fencing.methods: A set of scripts or Java classes that determine how to

fence the Active NameNode during a failover.
Sshfence: Uses SSH to connect to the Active NameNode and kill the Active
NameNode process.
For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
Shell: Execute a shell script or command to fence the Active NameNode.
485

HDPOps-ManageAmbari Docker GA Rev3

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

HDPOps-ManageAmbari Docker GA Rev3

Încărcat de

Drepturi de autor:

Formate disponibile

HDP Operations: Install and Manage

With Apache Ambari

Copyright 2014, Hortonworks, Inc. All rights reserved.

Title: HDP Operations: Install and Manage with Apache Ambari

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 2: HDFS Architecture................................................................................................... 29

Unit 3: Installation Prerequisites and Planning ......................................................... 47

Unit 4: Configuring Hadoop ................................................................................................ 65

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 4.3: Using HDFS Commands ................................................................................................. 87

Unit 5: Ensuring Data Integrity ......................................................................................... 95

Unit 6: HDFS NFS Gateway ................................................................................................ 123

HDFS NFS Gateway Introduction .............................................................................................. 124

Unit 7: YARN Architecture and MapReduce ............................................................... 136

What is YARN? ................................................................................................................................. 137

Unit 8: Job Schedulers ........................................................................................................ 165

Overview of Job Scheduling ....................................................................................................... 166

Copyright 2014, Hortonworks, Inc. All rights reserved.

Lab 8.1: Configuring the Capacity Scheduler ....................................................................... 180

Unit 9: Enterprise Data Movement ................................................................................ 185

Unit 10: HDFS Web Services ............................................................................................. 204

What is WebHDFS? ........................................................................................................................ 205

Unit 11: Hive Administration .......................................................................................... 230

Introduction to Hive ..................................................................................................................... 231

Unit 12: Transferring Data with Sqoop ........................................................................ 262

Copyright 2014, Hortonworks, Inc. All rights reserved.

Importing a Table .......................................................................................................................... 267

Unit 13: Flume....................................................................................................................... 284

Flume Introduction ....................................................................................................................... 285

Unit 14: Oozie ........................................................................................................................ 315

Oozie Overview............................................................................................................................... 316

Unit 15: Monitoring HDP2 Services ............................................................................... 344

Ambari ............................................................................................................................................... 345

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 16: Commissioning and Decommissioning Nodes .......................................... 370

Unit 17: Backup and Recovery ........................................................................................ 392

What should you backup? ........................................................................................................... 393

Unit 18: Rack Awareness and Topology ...................................................................... 402

Rack Awareness ............................................................................................................................. 403

Unit 19: NameNode HA ...................................................................................................... 414

NameNode Architecture HDP1 ................................................................................................. 415

Unit 20: Securing HDP ........................................................................................................ 437

Security Concepts .......................................................................................................................... 438

Copyright 2014, Hortonworks, Inc. All rights reserved.

HDP Security Overview................................................................................................................ 442

Appendix A: Unit Review Answers ................................................................................ 451

Data Lifecycle Management ....................................................................................................... 457

Copyright 2014, Hortonworks, Inc. All rights reserved.

Welcome to Hortonworks University

Overview of Hortonworks Certification

Copyright 2014, Hortonworks, Inc. All rights reserved.

Unit 1: Introduction to HDP and

Enterprise Data Trends @ Scale

What is Big Data?

A Market for Big Data

Most Common New Types of Data

Moving from Causation to Correlation

What is Hadoop 2.0?

Traditional Systems vs. Hadoop

Overview of a Hadoop Cluster

The Hortonworks Data Platform

Hadoop Use Case

Lab 1.1: Login to Your Cluster

Copyright 2014, Hortonworks, Inc. All rights reserved.