Big Data & Hadoop Architecture Development

Big Data & Hadoop
Architecture and Development
Raghavan Solium
Big Data Consultant
raghavan.solium@gmail.com
Day - 1
Understanding Big Data
What is Big Data

Challenges with Big Data
Why not RDBMS / EDW?
Distributed Computing & MapReduce Model
What is Apache Hadoop
Hadoop & its eco system

Components of Hadoop (Architecture)
Hadoop deployment modes
Install & Configure Hadoop
Hands on with Standalone mode
Day - 1
HDFS The Hadoop DFS
Building Blocks
Name Node & Data Node
Starting HDFS Services
HDFS Commands
Hands on
Configure HDFS
Start & Examine the daemons
Export & Import files into HDFS
Map Reduce Anatomy
MapReduce Workflow
Job Tracker & Task Tracker
Starting MapReduce Services
Hands on
Configure MapReduce
Start & Examine the daemons
Day - 2
MapReduce Programming
Java API
Data Types
Input & Output Formats
Hands on
Advance Topics
Combiner
Partitioner
Counters
Compression, Speculative Execution, Zero & One Reducer
Distributed Cache
Job Chaining
HDFS Federation
HDFS HA
Hadoop Cluster Administration
Day - 3
Pig
What is Pig Latin?

Pig Architecture
Install & Configure Pig
Data Types & Common Query algorithms
Hands On
Hive
What is Hive?
Hive Architecture
Install & Configure Hive
Hive Data Models
Hive Metastore
Partitioning and Bucketing
Hands On
Day - 4
Sqoop
What is Sqoop
Install & Configure Sqoop
Import & Export
Hands On
Introduction to Amazon Cloud

What is AWS
EC2, S3
How to leverage AWS for Hadoop
Day - 4
Hadoop Administration
HDFS Persistent Data Structure

HDFS Safe Mode
HDFS File system Check
HDFS Block Scanner
HDFS Balancer
Logging
Hadoop Routine Maintenance
Commissioning & Decommissioning of nodes
Cluster Machine considerations
Network Topology
Security
What the BIG hype about Big Data?

May be it is in the hype, but the problems are big, real and big value. How?....
We are in the age of advanced analytics (thats where all the problem is, we want
to analyze the data) where valuable business insight is mined out of historical data
But we also live in the age of crazy data where every individuals, enterprises, and
machines leave so much data behind summing up to many Terabytes and many
times, Petabytes, and it is only expected to grow
Good news. Blessing in disguise. More data means better precision
More data usually beats better algorithms
But How are we going to analyze?
Traditional database or warehouse systems crawl or crack at these volumes
Inflexible to handle most of these formats
This is the very characteristic of Big Data
Nature of Big Data

Huge volumes of data that can not be handled by traditional database or
warehouse systems, its mostly machine produced, most of it is unstructured and
grows at high velocity
7
Lets Define
Variety
Sensor Data
Machine logs
Social media data
Scientific data
RFID readers
sensor networks
vehicle GPS traces
Retail transactions
Volume
The New York Stock Exchange has
several petabytes of data for analysis
Facebook hosts approximately 10
billion photos, taking up one
petabytes of storage.
At the end of 2010 The Large Hadron
Collider near Geneva, Switzerland has
about 150 petabytes of data
Velocity
The New York Stock Exchange
generates about one terabyte of new
trade data every day
The Large Hadron Collider produce s
about 15 petabytes of data per year
Weather sensors collect data every
hour at many locations across the
globe and gather a large volume of
log data
8
Inflection Points
Data Storage
Big Data ranges from several Terabytes to Petabytes.
At these volumes access speed of the data devices will dominate overall analysis
time.
A Terabyte of data requires 2.5 hours to be read from a 100 MBPS drive
Writing will even be slower
Is divide the data and rule a solution here?

Have multiple disk drives, split your data file into small enough pieces across the
drives and do parallel reads and processing
Hardware Reliability (Failure of any drive) is a challenge
Resolving Data interdependency between drives is a notorious challenge
Number of disk drives that can be added to a server is limited
Analysis
Much of Big Data is unstructured. Traditional RDBMS/ EDW cannot handle them
Lot of Big Data analysis is adhoc in nature, involves whole data scan, referencing
itself, joining, combing etc
Traditional RDBMS/ EDW cannot handle these with their limited scalability
options and architectural limitations
You can incorporate better servers, processors and throw in more RAM but there
is a limit to it
9
Inflection Points
We need a Drastically different approach
A distributed file system with high capacity and high reliability
A process engine that can handle structure / Unstructured
data
A computation model that can operate on distributed data
and abstracts data dispersion
PRAM, MapReduce are such models
Let us see what MapReduce is
10
What is MapReduce Model
(K2, V2)
Intermediate Key/Value pairs
(K1, V1)
(K3, V3)
Input file Split s
Output files
Computer 1
Split 1
Input File
/ Data
Sort
Map
Reduce
Computer 2
Split 2
Computer 1
Sort
Map
Computer 2
Reduce
Computer 3
Split 3
Part 1
Part 2
Sort
Map
11
What is MapReduce Model

MapReduce is a computation model that supports parallel processing on
distributed data using a cluster of computers.
The MapReduce model expects the input data to be split and distributed to the
machines on the cluster so the each split can be processed independently and
in parallel.
There are two stages of processing in MapReduce model to achieve the final
result. Map and Reduce. Every computer in the cluster can run independent
map and reduce processes.
Map processes the input splits. The output of map is distributed again to the
reduce processes to combine the map output to give final expected result.
The model treats data at every stage as Key and Values pairs, transforming one
set of Key/ Value pairs into different set of Key/ value pairs to arrive at the end
result.
Map processes transforms input key/ value pairs to an intermediate key/value
pairs. MapReduce framework passes this output to reduce processes which will
transform this to get the final result which again will be in the form of key/
Value pairs.
12
MapReduce Model
MapReduce should have
Ability to initiate and monitor parallel processes and
coordinate between them
A mechanism to pass the same key outputs from map
processes to a single reduce process
Recover from any failures transparently
13
Big Data Universe

Evolving and expanding..
14
So whats going to happen to our good friend RDBMS?

We dont know! As of now it looks like they are going to
coexists
Hadoop is a batch oriented analysis system. Its not
suitable for low-latency data operations
MapReduce systems can output the analysis outcome to
the RDBMS/EDWs for online access and point queries
RDBMS / EDW compared to MapReduce
Data size
Access
Updates
Structure
Integrity
Scaling
Traditional RDBMS
Gigabytes
Interactive and batch
Read and write many times
Static schema
High
Nonlinear
MapReduce
Petabytes
Batch
Write once, read many times
Dynamic schema
Low
Linear
(Some of these things are debatable as the Big Data and Hadoop eco systems are fast evolving and moving to higher degree of
maturity and flexibility. For example Hbase brings in the ability to point queries )
15
Some Use Cases

Web/Content Indexing
Finance & Insurance
Fraud detection
Sentiment Analysis
Retail
Trend analysis, Personalized promotions
Scientific simulation & analysis

Aeronautics, Particle physics, DNA Analysis
Machine Learning
Log Analytics
16
What is Apache Hadoop and how it can help with Big Data?
It is an open source Apache project for handling Big Data
It addresses Data storage issue and Analysis (processing) issues through its
HDFS file system and implementing MapReduce computation model
It is designed for massive scalability and reliability
The model enables leveraging cheap commodity servers keeping the cost in
check
Who Loves it?
Yahoo! runs 20,000 servers running Hadoop

Largest Hadoop cluster is 4000 servers, 16 PB raw storage (Is it Yahoo?)
Facebook runs 2000 Hadoop servers
24 PB raw storage and 100 TB raw log / day
eBay and LinkedIn has production use of Hadoop
Sears retail uses Hadoop
17
Hadoop & Its ecosystem
Apache Oozie (Workflow)

HBase
Zoo
Keeper
(Coordination
Service)
PIG
Hive
Mahout
MapReduce Framework
Structured Data
Sqoop
HDFS (Hadoop Distributed File System)
Flume
Log
Files
Unstructured Data
18
Hadoop & Its ecosystem

Avro: A serialization system for efficient, cross-language RPC and persistent data storage.
MapReduce:
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
HDFS: A distributed file system that runs on large clusters of commodity machines.
Pig:
A data flow language and execution environment for exploring very large datasets. Pig runs on
HDFS and MapReduce clusters.
Hive:
A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for
querying the data.
Hbase:
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and
supports both batch-style computations using MapReduce and point queries (random reads).
ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such
as distributed locks that can be used for building distributed applications.
Sqoop:
A tool for efficient bulk transfer of data between structured data stores (such as relational
databases) and HDFS.
Oozie:
A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig,
Hive, and Sqoop jobs).
19
Hadoop Requirements
Supported Platforms
GNU/Linux is supported as a development and production
Win32 supported as development only
cygwin is required for running on Windows
Required Software
JavaTM 1.6.x
ssh to be installed, sshd must be running (for launching the
daemons on the cluster with password less entry)
Development Environment
Eclipse 3.5 or above
20
Lab Requirements
Windows 7 - 64 bit OS, Min 4 GB Ram
VMWare Player 5.0.0
Linux VM - Ubuntu 12.04 LTS
User: hadoop, Password: hadoop123
Java 6 installed on Linux VM
Open SSH installed on Linux VM
Putty - For opening Telnet sessions to the Linux VM
WinSCP - For transferring files between Windows / VM
Eclipse 3.5
21
Hands On
Using the VM
Install & Configure hadoop
Install & Configure ssh

Set up Putty & WinScp
Set up lab directories
Install open JDK
Install & Verify hadoop
22
Starting VM
23
Starting VM
Enter user ID/ Password : hadoop / hadoop123
24
Install & Configure ssh
Install ssh
>>sudo apt-get
install
ssh
Check ssh installation

>>which ssh
>>which sshd
>>which ssh-keygen
Generate ssh Key

>>ssh-keygen -t
rsa
-P
-f
~/.ssh/id_rsa
Copy public key as an authorized key (equivalent to slaves)

>>cp ~/.ssh/id_rsa.pub
~/.ssh/authorized_keys
>>chmod
700 ~/.ssh
>>chmod
600 ~/.ssh/authorized_keys
25
Verify ssh
Verify SSH by logging into target (localhost here)
>>ssh localhost
This command should log you into the machine localhost
26
Accessing VM Putty & WinSCP

Get IP address of the Linux VM
>>ifconfig
Use Putty to telnet to VM

Use WinSCP to FTP to VM
27
Lab VM Directory Structure

User Home Directory for user hadoop (Created default by OS)
/home/hadoop
Working directory for the lab session

/home/hadoop/lab
Downloads directory (installables downloaded and stored under this)

/home/hadoop/lab/downloads
Data directory (sample data is stored under this)

/home/hadoop/lab/data
Create directory for installing the tools

/home/hadoop/lab/install
28
Install & Configure Java

Install Open JDK
>>sudo
apt-get
install
openjdk-6-jdk
Check Installation
>>java
-version
Configure Java Home in environment

Add a line to .bash_profile to set Java Home
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
Hadoop will use this during runtime
Install Hadoop
Download Hadoop Jar

http://apache.techartifact.com/mirror/hadoop/common/hado
op-1.0.3/hadoop-1.0.3.tar.gz
FTP the file to Linux VM into ~/lab/downloads folder
Untar (execute the following commands)

>>cd ~/lab/install
>>tar xvf ~/lab/downloads/hadoop-1.0.3.tar.gz
Check the extracted directory hadoop-1.0.3
>>ls -l hadoop-1.0.3
Configure environment in .bash_profile

Add below two lines and execute bash profile
>>export HADOOP_INSTALL=~/lab/install/hadoop-1.0.3
>>export PATH=$PATH:$HADOOP_INSTALL/bin
>>. .bash_profile
30
Run an Example
Verify Hadoop installation
>> hadoop version
Try the following

>>hadoop
Will provide command usage
>>cd $HADOOP_INSTALL
>>hadoop jar hadoop-examples-1.0.3.jar
Will provide the list of classes in the above jar file
>>hadoop jar hadoop-examples-1.0.3.jar wordcount

<input directory> <output directory>
31
Component of Core Hadoop

Client
Name Node
Job Tracker
Networked
Secondary
Name Node
Data Node
Data Node
Data Node
Data Node
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Map
Map
Map
Map
Red
Red
Red
Red
(Hadoop supports many other file systems other than HDFS itself . For one to leverage Hadoops abilities completely HDFS is one of
the most reliable file systems)
32
Components of Core Hadoop

At a high-level Hadoop architectural components can be classified into two categories
Distributed File management system HDFS
This has central and distributed sub components
NameNode Centrally Monitors and controls the whole file system
DataNode Take care of the local file segments and constantly communicates
with NameNode
Secondary NameNode Do not confuse. This is not a NameNode Backup. This
just backs up the file system status from the NameNode periodically
Distributed computing system MapReduce Framework
This again has central and distributed sub components

Job Tracker Centrally Monitors the submitted Job and controls all processes
running on the nodes (computers) of the cluster. This communicated with Name
Node for file system access
Task Tracker Take care of the local job execution on the local file segment. This
talks to DataNode for file information. This constantly communicates with Job
Tracker daemon to report the task progress
When the Hadoop system is running in a distributed mode all these daemons would be
running in the respective computer
33
Hadoop Operational Modes
Hadoop can be run in one of the three modes

Standalone (Local) Mode
No daemons launched
Everything runs in single JVM
Suitable for development
Pseudo Distributed Mode

All daemons are launched on a single machine thus
simulating a cluster environment
Suitable for testing & debugging
Fully Distributed Mode

The Hadoop daemons run in a cluster environment
Each daemons run on machines respectively assigned to them
Suitable for Integration Testing / Production
A typical distributed mode runs Name Node on a separate machine, Job Tracker & Secondary Name Node on a separate machine.
34
Rest of the machines in the cluster run a Data Node and Task Tracker Daemons
Hadoop Configuration Files
The configuration files can be found under conf Directory

File Name
Format
Description
hadoop-env.sh
Bash script
Environment variables that are used in the scripts to run

Hadoop
core-site.xml
Hadoop
configuration XML
Configuration settings for Hadoop Core, such as I/O

settings that are common to HDFS and MapReduce
hdfs-site.xml
Hadoop
configuration XML
Configuration settings for HDFS daemons: the namenode,

the secondary namenode, and the datanodes
mapred-site.xml
Hadoop
configuration XML
Configuration settings for MapReduce daemons: the

jobtracker and the tasktrackers
masters
Plain text
List of machines (one per line) that run a secondary

namenode
slaves
Plain text
List of machines (one per line) that each run a datanode

and a tasktracker
hadoop-metrics
.properties
Java Properties
Properties for controlling how metrics are published in

Hadoop
log4j.properties
Java Properties
Properties for system logfiles, the namenode audit log,

and the task log for the tasktracker child process
35
Key Configuration Properties
Property Name
Conf File
Standalone
Pseudo
Distributed
Fully Distributed
fs.default.name
core-site.xml
file:///
(default)
hdfs://localhost/ hdfs://namenode/
dfs.replication
hdfs-site.xml
NA
3 (default)
mapred.job.tracker
mapredsite.xml
local (default)
Localhost:8021
Jobtracket:8021
36
HDFS
37
Design of HDFS
HDFS is hadoops Distributed File System
Designed for storing very large files (of sized petabytes)
Single file can be stored across several the disks
Designed for streaming data access patterns
Not suitable for low-latency data access
Designed to be highly fault tolerant hence can run on
commodity hardware
38
HDFS Concepts
Like in any file system HDFS stores files by breaking it
into smallest units called Blocks
The default HDFS block size is 64 MB
The large block size helps in maintaining high
throughput
Each Block is replicated across multiple machines in the
cluster for redundancy
39
Design of HDFS - Daemons
Get block information for

the file
Secondary
Name Node
Name Node
Networked
Client
Read Blocks
Data Node
Hadoop
Cluster
Data Node
Data Node
Data Node
Data Blocks
40
Design of HDFS - Daemons

The HDFS file system is managed by two daemons
NameNode & DataNode
NameNode & DataNode function in master/ slave fashion
NameNode Manages File system namespace
Maintains file system tree and the metadata of all the files and directories
Filesystem Image
Edit log
Datanodes store and retrieve the blocks for the files when they are told by
NameNode
NameNode maintains the information on which DataNodes all the blocks for a
given file are located
DataNodes report to NameNode periodically with the list of blocks they are
storing
With NameNode off, the HDFS is inaccessible
Secondary NameNode
Not a backup for the NameNode
Just helps in merging filesystem image with edit log to avoid edit log
41
becoming too large
Hands On
Configure HDFS file system for hadoop
Format HDFS
Start & Verify HDFS services
Verify HDFS
Stop HDFS services
Change replication
42
Configuring HDFS core-site.xml (Pseudo Distributed Mode)

Set JAVA_HOME in conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
The property is used on the remote machines
Set up core-site.xml
<?xml version="1.0"?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
Add fs.default.name property under configuration tag to specify NameNode location.
localhost for Pseudo distributed mode. Name node runs at port 8020 by default if no
port is specified
43
Starting HDFS
Format NameNode
>>hadoop namenode -format
Creates empty file system with storage directories and
persistent data structures
Data nodes are not involved
Start dfs services & verify daemons

>>start-dfs.sh
>>jps
List / Check HDFS

>>hadoop
>>hadoop
>>hadoop
fs -ls
fsck / -files -blocks
fs -mkdir testdir
44
Verify HDFS
List / Check HDFS again
>>hadoop
>>hadoop
fs -ls
fsck /
-files -blocks
Stop dfs services

>>stop-dfs.sh
>>jps
No java processes should be running
45
Configuring HDFS - hdfs-site.xml (Pseudo Distributed Mode)


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Add dfs.replication property under configuration tag. value is set

to 1 so that no replication is done
46
Configuring HDFS - hdfs-site.xml (Pseudo Distributed Mode)
Property
Name
Description
Default Value
dfs.name.dir
Directories for NameNode to store its

persistent data (Comma separated
directory names). A copy of metadata is
stored in each of the listed directory
${hadoop.tmp.dir}/dfs/
name
dfs.data.dir
Directories where DataNode stores blocks.

Each block is stored in only one of these
directories
data
fs.checkpoint.dir Directories where secondary NameNode

stores checkpoints. A copy of the
checkpoint is stored in each of the listed
directory
namesecondary
47
Basic HDFS Commands

Creating Directory
hadoop fs -mkdir <dirname>
Removing Directory
hadoop fs -rm
<dirname>
Copying files to HDFS from local filesystem

hadoop fs -copyFromLocal <local dir>/<filename>
<hdfs dir Name>/<hdfs file name>
Copying files from HDFS to local filesystem

hadoop fs -copyToLocal
<local dir>/<filename>
<hdfs dir Name>/<hdfs file name>
List files and directories

hadoop
fs
-ls
<dir name>
List the blocks that make up each file in HDFS

hadoop
fsck
-files
-blocks
48
Hands On
Create data directories for
NameNode
Secondary NameNode
DataNode
Configure the nodes

Format HDFS
Start DFS service and verify daemons
Create directory retail in HDFS
Copy files from lab/data/retail directory to HDFS retail
directory
Verify the blocks created
Do fsck on HDFS to check the health of HDFS file system
49
Create data directories for HDFS

Create directory for NameNode
>>cd ~/lab
>>mkdir hdfs
>>cd hdfs
>>mkdir namenode
>>mkdir secondarynamenode
>>mkdir datanode
>>chmod 755 datanode
50
Configuring data directories for HDFS

Configure HDFS directories
Add the following properties in hdfs-site.xml
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/lab/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/lab/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/home/hadoop/ lab/hdfs/secondarynamenode</value>
<final>true</final>
</property>
51
HDFS Web UI
Hadoop provides a web UI for viewing HDFS

Available at http://<VM host IP>:50070/
Browse file system
Log files
52
MapReduce
53
MapReduce
A distributed parallel processing engine of Hadoop
Processes the data in sequential parallel steps called
Map
Reduce
Best run with a DFS supported by hadoop to exploit its parallel

processing abilities
Has the ability to run on a cluster of computers
Each computer called as a node
Input/output data at every stage is handled in terms of key/value

pairs
Key/ Value can be chosen by programmer
Mapper output with the same key are sent to the same reducer
Input to Reducer is always sorted by key
Number of mappers and reducers per node can be configured
54
MapReduce Workflow Word count
(K1, V1)
Input file Split s on the DFS

Computer 1
If you go up and
down
Input file
If you go up and down
The weight go down and
the health go up
Map
Computer 2
The weight go
down and
Map
Computer 3
the health go up
Map
Intermediate Key/Value pairs
Output
(K2, V2)
(K3, V3)
and, 1
down, 1
go, 1
if, 1
up, 1
you, 1
and, 1
down, 1
go, 1
the, 1
weight, 1
go, 1
health, 1
the, 1
up, 1
and, 1 down, 1
and, 1 down, 1
go, 1
go, 1
go, 1
if, 1
Computer 1
Reduce
Computer 2
Reduce
up, 1
up, 1
you, 1
health, 1
the, 1
the, 1
and
Down
go
If
2
2
3
1
up
you
the
Health
Weight
2
1
2
1
1
weight, 1
55
Design of MapReduce - Daemons

The MapReduce system is managed by two daemons
JobTracker & TaskTracker
JobTracker & TaskTracker function in master/ slave
fashion
JobTracker coordinates the entire job execution
TaskTracker runs the individual tasks of map and reduce
JobTracker does the bookkeeping of all the tasks run on the
cluster
One map task is created for each input split
Number of reduce tasks is configurable
mapred.reduce.tasks
56
Design of MapReduce - Daemons

Client
Job Tracker
Networked
HDFS
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
57
Hands On
Configure MapReduce
Start MapReduce daemons
Verify the daemons
Stop the daemons
58
mapred-site.xml - Pseudo Distributed Mode

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Add mapred.job.tracker property under configuration tag to specify JobTracker
location. localhost:8021 for Pseudo distributed mode.
59
Starting Hadoop MapReduce Daemons
Start MapReduce Services

>>start-mapred.sh
>>jps
Stop MapReduce Services

>>stop-mapred.sh
>>jps
60
MapReduce
Programming
61
MapReduce Programming
Having seen the functioning of MapReduce, to perform a

job in hadoop a programmer needs to create
A MAP function
A REDUCE function
A Driver to communicate with the framework, configure and
launch the job
Execution Environment
Map
Reduce
Framework
Execution
Parameters
Map
Map
Reduce
Framework
Map
Red
uce
Map
Reduce
Framework
Output
Red
Driver
62
Retail Use case

Set of transactions in txn.csv
Txn ID
TXN Date Cust ID Amt Category Sub-Cat Addr-1 Addr-2 Credit/ Cash
00999990,08-19-2011,4003754,147.66,Team Sports,Lacrosse,Bellevue,Washington,credit
00999991,10-09-2011,4006641,126.19,Water Sports,Surfing,San Antonio,Texas,credit
00999992,06-09-2011,4005497,097.78,Water Sports,Windsurfing,San Diego,California,credit
Customer details in custs.csv

Cust ID Fst Nam Lst Nam Age
Profession
4009983,Jordan,Tate,35,Coach
4009984,Justin,Melvin,43,Loan officer
4009985,Rachel,Corbett,66,Human resources assistant
63
Map Function
The Map function is represented by Mapper class, which declares
an abstract method map()
Mapper class is generic type with four type parameters for the
input and output key/ value pairs
Mapper <K1, V1, K2, V2>
K1, V1 are the types of the input key / value pair
K2, V2 are the types of the output key / value pair
Hadoop provides its own types that are optimized for network
serialization
Text
LongWritable
IntWritable
Corresponds to Java String

Corresponds to Java Long
Corresponds to Java Int
The map() method must be implemented to achieve the input key/

value transformation
map() is called by MapReduce framework passing the input key/
values from the input split
map() is provided with a context object in its call, to which the
transformed key/ values can be written to
64
Mapper Word Count

public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);

private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), " \t\n\r\f,.;:?![]'");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}
}
65
Reduce Function
The Reduce function is represented by Reducer class, which
declares an abstract method reduce()
Reducer class is generic type with four type parameters for the
input and output key/ value pairs
Reducer <K2, V2, K3, V3>
K2, V2 are the types of the input key/ value pair, this type of this pair
must match the output types of Mapper
K3, V3 are the types of the output key/ value pair
The reduce() method must be implemented to achieve the

desired transformation of input key/ value
reduce() method is called by MapReduce framework passing the
input key/ values from out of map phase
MapReduce framework guarantees that the records with the
same key from all the map tasks will reach a single reduce task
Similar to the map, reduce method is provided with a context
object to which the transformed key/ values can be written to
66
Reducer Word Count

public static class IntSumReducer
extends Reducer< Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum) );
}
}
67
Driver MapReduce Job

The job object forms the specification of a job
Configuration conf = new Configuration();
Job job = new Job(conf, Word Count);
Job object gives you control over how the job is run
Set the jar file containing mapper and reducer for distribution
around the cluster
Job.setJarByClass(wordCount.class);
Set Mapper and Reducer classes

job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
Input/ output location is specified by calling static methods on

FileInputFormat and FileOutputFormat classes by passing the job
FileInputFormat.addInputPath(job, path);
FileOutputFormat.setOutputPath(job, path);
Set Mapper and Reducer output types

Set Input and Output formats
Input key/ value types are controlled by the Input Formats
68
MapReduce Job Word Count

Public class WordCount
public static void main(String args[]) throws Exception {

if (args.length != 2) {
System.err.println(Usage: WordCount <input Path> <output Path>);
System.exit(-1);
}
job.setJarByClass(WordCount.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]) );
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
69
The MapReduce Web UI

Hadoop provides a web UI for viewing job information
Available at http://<VM host IP>:50030/

follow jobs progress while it is running
find job statistics
View job logs
Task Details
70
Combiner
Combiner function helps to aggregate the map output
before passing on to reduce function
Reduces intermediate data to be written to disk
Reduces data to be transferred over network
Combiner is represented by same interface as Reducer

Combiner for a job is specified as
job.setCombinerClass(<combinerclassname>.class);
71
Word Count With Combiner

Public class WordCount
public static void main(String args[]) throws Exception {

if (args.length != 2) {
System.err.println(Usage: WordCount <input Path> <output Path>);
System.exit(-1);
}
job.setJarByClass(WordCount.class);
job.setCombinerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
In case of cumulative &

associative functions the
reducer can work as combiner.
Otherwise a separate combiner
needs to be created
FileInputFormat.addInputPath(job, new Path(args[0]) );

FileOutputFormat.setOutputPath(job, new Path(args[1]) );
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
72
Partitioning
Map tasks partition their output keys by the number of
reducers
There can be many keys in a partition
All records for a given key will be in a single partition
A Partitioner class controls partitioning based on the Key
Hadoop uses hash partition by default (HashPartitioner)
The default behavior can be changed by implementing

the getPartition() method in the Partitioner (abstract) class
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int
numPartitions);
}
A custom partitioner for a job can be set as

job.setPartitionerClass(<customPartitionerClass>.class);
73
Partitioner Example
public class WordPartitioner extends Partitioner <Text, IntWritable>{
@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String ch = key.toString().substring(0,1);
/*if (ch.matches("[abcdefghijklm]")) {
return 0;
} else if (ch.matches("[nopqrstuvwxyz]")) {
return 1;
}
return 2;*/
//return (ch.charAt(0) % numPartitions); //round robin based on ASCI value
return 0; // default behavior
}
}
74
One or Zero Reducers

Number of reducers is to be set by the developer
job.setNumReduceTasks();
OR
mapred.reduce.tasks=10
One Reducer
Maps output data is not partitioned, all key /values will reach
the only reducer
Only one output file is created
Output file is sorted by Key
Good way of combining files or producing a sorted output for
small amounts of data
Zero Reducers or Map-only

The job will have only map tasks
Each mapper output is written into a separate file (similar to
multiple reducers case) into HDFS
Useful in cases where the input split can be processed
independent of other parts
75
Data Types
Hadoop provides its own data types

Data types implement Writable interface
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
Optimized for network serialization
Key data types implement WritableComparable interface

which enables key comparison
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
Keys are compared with each other during the sorting phase
Respective registered RawComparator is used comparison
public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
76
Data Types
Writable wrapper classes for Java primitives
Java
primitive
Writable
Serialized size
implementation
(bytes)
Boolean
Byte
Short
Int
BooleanWritable
ByteWritable
ShortWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
VLongWritable
DoubleWritable
Float
Long
Double
1
1
2
4
15
4
8
19
8
NullWritable
Special writable class with zero length
serialization
Used as a place holder for a key/ value
when you do not need to use that
position
77
Data Types (Custom)

Custom Data types (Custom Writables)
Custom and complex data types can be implemented per need to be used
as key and values
key data types must implement WritableComparable
Values data types need to implement at least Writable
Custom types can implement raw comparators for speed

public static class CustComparator extends WritableComparator {
public CustComparator () {
super(<custDataType>.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
}
}
static {
WritableComparator.define(<custDataType>.class, new CustComparator());
}
WritableComparator is a general purpose

RawComparator
Custom comparators for a job can also be set as
job.setSortComparatorClass(KeyComparator.class);
job.setGroupingComparatorClass(GroupComparator.class);
implementation
of
78
Input Formats
An Input Format determines how the input data to be
interpreted and passed on to the mapper
Based on an Input Format, the input data is divided into
equal chunks called splits
Each split is processed by a separate map task
Each split in turn is divided into records based on Input
Format and passed with each map call
The Key and the value from the input record is determined
by the Input Format (including types)
All input Formats implement InputFormat interface

Input format for a job is set as follows
job.setInputFormatClass(<Input Format Class Name>.class);
Two categories of Input Formats

File Based
Non File Based
79
Input Formats
80
File Input Formats

FileInputFormat is the base class for all file based data
sources
Implements InputFormat interface
FileInputFormat offers static convenience methods for
setting a Jobs input paths
FileInputFormart.addInputPath(job, path)
Each Split corresponds to either all or part of a single file

except for CombineFileInputFormat
File Input Formats
Text Based
TextInputFormat
KeyValueTextInputFormat
NLineInputFormat
CombineFileInputFormat (meant for lot of small files to avoid too many splits)
Binary
SequenceFileInputFormat
81
File Input Formats - TextInputFormat

Each line is treated as a record
Key is byte offset of the line from beginning of the file
Value is the entire line
Input File
2001220,John ,peter,35,320,1st Main,lombard,NJ,manager

2001221,Karev,Yong,39,450,2nd Main,Hackensack,NJ,Sys Admin
2001223,Vinay,Kumar,27,325,Sunny Brook,2nd Block,Lonely Beach,NY,Web Designer
LongWritable
Input to
Mapper
Text
K1 = 0 V1 = 2001220,John ,peter,35,320,1st Main,lombard,NJ,manager

K2 = 54 V2 = 2001221,Karev,Yong,39,450,2nd Main,Hackensack,NJ,Sys Admin
K3 = 102 V3 = 2001223,Vinay,Kumar,27,325,Sunny Brook,2nd Block,Lonely Beach,NY,Web Designer
TextInputFormat is the default input format if none

specified
82
File Input Formats - Others

KeyValueTextInputFormat
Splits each line into key/ value based on specified delimiter
Key is the part of the record before the first appearance of the
delimiter and rest is the value
Default delimiter is tab character
A different delimiter can be set through the property
mapreduce.input.keyvaluelinerecordreader.key.value.separator
NLineInputFormat
Each File Splits contains fixed number of lines
The default is one, which can be changed by setting the
property
mapreduce.input.lineinputformat.linespermap
CobineFileInputFormat
A Splits can consist of multiple files (based on max split size)
Typically used for lot of small files
This is an abstracts class and one need to implement to use 83
File Input Formats - SequenceFileInputFormat

Sequence File
provides persistent data structure for binary key-value pairs
Provides sync points in the file at regular intervals, which
makes a sequence file splittable
The key / values can be stored compressed or without
Two types of compressions
Record
Block
SequenceFileInputFormat
Enables reading data from a Sequence File
Can read MapFiles as well
Variants of SequnceFileInputFormat
SequnceFileAsTextInputFormat
Converts key, values into Text Objects
SequnceFileAsBinaryInputFormat
Retrieves the keys and values as BytesWritable Objects
84
Non File Input Formats - DBInputFormat

DBInputFormat is an input format to read data from
RDBMS through JDBC
85
Output Formats
OutputFormat class hierarchy
86
Output Formats - Types

Output Format for a job is set as
job.setOutputFormatClass(TextOutputFormat.class);
FileBased
FileOutputFormat is the Base class
FileOutputFormat offers static method for setting output path
FileOutputFormat.setOutputPath(job, path);
One file per reducer is created (default file name : part-r-nnnnn),
nnnnn is an designating the part number, starting from zero
TextOutputFormat
SequenceFileOutputFormat
SequenceFileAsBinaryOutputFormat
MapFileOutputFormat
NullOutputFormat
DBOutputFormat
Output format to dump output data to RDBMS through JDBC
87
Lazy Output
FileOutputFormat subclasses will create output files,
even if there is no record to write
LazyOutputFormat can be used to delay output file
creation until there is a record to write
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class)
Instead of
Job.setOutputFormatClass(TextOutputFormat.class)
88
Unit Testing - MRUnit

MRUnit is a unit testing library for MapReduce program
Mapper and Reducer can be tested independently by
passing inputs
MapDriver<K1, V1, K2, V2> has methods to run a mapper by
passing input key value and expected key values
ReduceDriver< MapDriver<K1, V1, K2, V2> has methods to
run a Reducer by passing input key value and expected key
values
89
Counters
Useful means of
Monitoring job progress
Gathering statistics
Problem diagnosis
Built-in-counters fall into below groups.
MapReduce task counters

Filesystem counters
FileInput-Format counters
FileOutput-Format counters
Job counters
Each counter will either be task counter or job counter

Counters are global. MapReduce framework aggregates
them across all maps and reduces to produce a grand
total at the end of the job
90
User Defined Counters

Counters are defined in a job by Java enum
enum Temperature {
MISSING,
MALFORMED
}
Counters are set and incremented as

context.getCounter(Temperature.MISSING).increment(1);
Dynamic counters
Counters can also be set without predefining as enums
context.getCounter(grounName, counterName).increment(1);
Counters are retrieved as

Counters cntrs = job.getCounters();
long total = cntrs.getCounter(Task.Counter.MAP_INPUT_RECORDS);
long missing =
cntrs.getCounter(MaxTemperatureWithCounters.Temperature.MISSING);
91
Side Data Distribution

Side data: typically the read only data needed by the job
for processing the main dataset
Two methods to make such data available to task trackers
Using Job Configuration
Using Distributed Cache
Using Job Configuration

Small amount of metadata can be set as key value pairs in the
job configuration
conf.set(Product, Binoculars);
conf.set(Conversion, 54.65);
The same can be retrieved in the map or reduce tasks

Configuration conf = context.getConfiguration();
String product = conf.get(Product).trim();
Effective only for small amounts of data (few KB). Else will put
pressure on memory of daemons
92
Side Data Distribution Distributed Cache

A mechanism for copying read only data in files/ archives
to the task nodes just in time
Can be used to provide 3rd party jar files
Hadoop copies these files to DFS then tasktracker copies
them to the local disk relative to tasks working directory
Distributed cache for a job can be set up by calling
methods on Job
Job.addCacheFile(new URI(<file path>/<file name>);
Job.addCacheArchives(new URI(<file path>/<file name>);
Job.addFileToClasspath(new Path(<file path>/<file name>);
The files can be retrieved from the distributed cache

through the methods on JobContext
Path[] localPaths = context.getLocalCacheFiles();
Path[] localArchives = context.getLocalCacheArchives();
93
Multiple Inputs
Often in real life you get the related data from different
sources in different formats
Hadoop provide MultipleInputs class to handle the
situation
MultipleInputs.addInputPath(job, inputPath1, <inputformat>.class);
MultipleInputs.addInputPath(job, inputPath2, <inputformat>.class);
No need to set input path, InputFormat class separately

You can even have separate Mapper class for each input file
MultipleInputs.addInputPath(job, inputPath1, <inputformat>.class,
MapperClass1.class);
MultipleInputs.addInputPath(job, inputPath2, <inputformat>.class,
MapperClass2.class);
Both Mappers must emit same key/ value types
94
Joins
More than one record sets to be joined based on a key
Two techniques for joining data in MapReduce
Map side join (Replicated Join)
Possible only when
one of the data sets is small enough to be distributed across the data
nodes and fits into the memory for maps to independently join OR
Both the data sets are portioned in such a way that they have equal
number of partitions, sorted by same key and all records for a given key
must reside in the same partition
The smaller data set is used for the look up using the join key
Faster as the data is loaded into the memory
95
Joins
Reduce side join
Mapper will tag the records from both the data sets distinctly
Join key is used as maps output key
The records for the same key are brought together in the
reducer and reducer will complete the joining process
Less efficient as both the data sets have to go through
mapreduce shuffle
96
Job Chaining
Multiple jobs can be run in a linear or complex dependent fashion
Simple way is to call the job drivers one after the other with
respective configurations
JobClient.runJob(conf1);
JobClient.runJob(conf2);
Here the second job is not launched until first job is completed
For complex dependencies you can use JobControl, and

ControlledJob classes
ControlledJob cjob1 = new ControlledJob(conf1);
ControlledJob cjob2 = new ControlledJob(conf2);
cjob2.addDependingJob(cjob1);
JobControl jc = new JobControl(Chained Job);
jc.addjob(cjob1);
jc.addjob(cjob2);
jc.run();
JobControl can run jobs in parallel if there is no dependency or the

dependencies are met
97
Speculative Execution
MapReduce jobs execution time is typically determined
by the slowest running task
Job is not complete until all tasks are completed
One slow job could bring down overall performance of the job
Tasks could be slow due to various reasons

Hardware degradation
Software issues
Hadoop Strategy Speculative Execution
Determines when a task is running longer than expected

Launches another equivalent task as backup
Output is taken from the task whichever completes first
Any duplicate tasks running are killed post that
98
Speculative Execution - Settings

Is ON by default
The behavior can be controlled independently for map
and reduce tasks
For Map Tasks
mapred.map.tasks.speculative.execution
to true/ false
For Reduce Tasks

mapred.reduce.tasks.speculative.execution
to true/ false
99
Skipping Bad Records

While handling a large datasets you may not anticipate
every possible error scenario
This will result in unhandled exception leading to task
failure
Hadoop retries failed tasks(task can fail due to other
reasons) up to four times before marking the whole job
as failed
Hadoop provides skipping mode for automatically
skipping bad records
The mode is OFF by default
Can be enabled by setting
mapred.skip.mode.enabled = true
100
Skipping Bad Records

When skipping mode is enabled, if the task fails for two
times, the record is noted during the third time and
skipped during the fourth attempt
The number of total attempts for map and reduce tasks
can be increased by setting
mapred.map.max.attempts
mapred.reduce.max.attempts
Bad records are stored under _logs/skip directory as

sequence file
101
Hadoop Archive Files
HDFS stores small files (size << block size) inefficiently

Each file is stored in a block, increasing the disk seeks
Too many small files take lot of nameNode memory/ MB
Hadoop Archives (HAR) is hadoops file format that packs

files into HDFS blocks efficiently
HAR files are not compressed
HAR files reduces namenode memory usage
HAR files can be used as mapreduce input directly
hadoop archive is the command to work on HAR files
hadoop archive -archiveName files.har /my/files
hadoop fs -ls /my
/my
HAR files always be with .har extension

HAR files can be accessed by application using har URI 102
Some Operations for Thought

Sorting, Counting, Summing
Secondary Sort
Searching, Validation and transformation
Statistical Computations
Grouping
Unions and Intersections
Inverted Index
103
Disadvantages of MapReduce
MapReduce (Java API) is difficult to program, long
development cycle
Need to rewrite trivial operations like Join, filter to
achieve in map/reduce/Key/value concepts
Locked with Java which makes it impossible for data
analysts to work with hadoop
There are several abstraction layers on top of
MapReduce which make working with Hadoop simple.
PIG and HIVE are in the leading front
104
PIG
105
PIG
PIG is an abstraction layer on top of MapReduce that frees analysts
from the complexity of MapReduce programming
Architected towards handling unstructured and semi structured
data
Its a dataflow language, which means the data is processed in a
sequence of steps transforming the data
The transformations support relational-style operations such as
filter, union, group and join.
Designed to be extensible and reusable
Programmers can develop own functions and use (UDFs)
Programmer friendly
Allows to introspect data structures
Can do sample run on a representative subset of your input
PIG internally converts each transformation into a MapReduce job

and submits to hadoop cluster
40 percent of Yahoos Hadoop jobs are run with PIG
106
PIG Architecture
Pig runs as a client side application, there is no need to
install anything on the cluster
Pig
Script
Grunt Shell
Map
Red
Map
Red
Hadoop Cluster
107
Install & Configure PIG

Download a version of PIG compatible with your
hadoop installation
http://pig.apache.org/releases.html
Untar into a designated folder. This will be Pigs home

directory
>>tar xvf pig-x.y.z.tar.gz
Configure Environment Variables - add in .bash_profile

export PIG_INSTALL=/<parent directory path>/pig-x.y.z
export PATH=$PATH:$PIG_INSTALL/bin
>>. .bash_profile
Verify Installation
>>pig -help
Displays command usage
>>pig
Takes you into Grunt shell
grunt>
108
PIG Execution Modes

Local Mode
Runs in a single JVM

Operates on local file system
Suitable for small datasets and for development
To run PIG in local mode
>>pig -x
local
MapReduce Mode
In this mode the queries are translated into MapReduce jobs
and run on hadoop cluster
PIG version must be compatible with hadoop version
Set HADOOP_HOME environment variable to indicate pig
which hadoop client to use
export HADOOP_HOME=$HADOOP_INSTALL
If not set it will uses the bundled version of hadoop
109
Ways of Executing PIG programs
Grunt
An interactive shell for running Pig commands
Grunt is started when the pig command is run without any
options
Script
Pig commands can be executed directly from a script file
>>pig pigscript.pig
It is also possible to run Pig scripts from Grunt shell using run
and exec.
Embedded
You can run Pig programs from Java using the PigServer class,
much like you can use JDBC
For programmatic access to Grunt, use PigRunner
110
An Example
A Sequence of transformation steps to get the end result
grunt> transactions = LOAD 'retail/txn.csv' USING PigStorage(',')

AS (txn_id, txn_dt, cust_id, amt, cat, sub_cat, adr1, adr2, trans_type);
LOAD
grunt> txn_100plus = FILTER transactions BY amt > 100.00;
FILTER
grunt> txn_grpd = GROUP txn_100plus BY cat;
grunt> txn_cnt_bycat = FOREACH txn_grpd GENERATE group,

COUNT(txn_100plus);
GROUP
AGGREGATE
grunt> DUMP txn_cnt_bycat;
A relation is created with every statement

111
Data Types
Simple Types
Category
Numeric
Text
Binary
Type
int
long
float
double
chararray
bytearray
Description
32-bit signed integer
64-bit signed integer
32-bit floating-point number
64-bit floating-point number
Character array in UTF-16 format
Byte array
112
Data Types
Complex Types
Type
Tuple
Bag
map
Description
Sequence of fields of any type
An unordered collection of tuples, possibly
with duplicates
A set of key-value pairs; keys must be
character arrays, but values may be any type
Example
(1,'pomegranate')
{(1,'pomegranate'),(2)}
['a'#'pomegranate']
113
LOAD Operator
<relation name> = LOAD <input file with path> [USING UDF()]

[AS (<field name1>:dataType, <field name2>:dataType, ,<field name3>:dataType)]
Loads data from a file into a relation

Uses the PigStorage load function as default unless specified
otherwise with the USING option
The data can be given a schema using the AS option.
The default data type is bytearray if not specified
records=LOAD sales.txt;
records=LOAD sales.txt AS (f1:chararray, f2:int, f3:float);
records=LOAD sales.txt USING PigStorage(\t);
records=LOAD sales.txt USING PigStorage(\t) AS (f1:chararray, f2:int, f3:float);
114
Diagnostic Operators
DESCRIBE
Describes the schema of a relation
EXPLAIN
Display the execution plan used to compute a relation
ILLUSTRATE
Illustrate step-by-step how data is transformed
Uses sample of the input data to simulate the execution.
115
Data Write Operators
LIMIT
Limits the number of tuples from a relation
DUMP
Display the tuples from a relation
STORE
Store the data from a relation into a directory.
The directory must not exists
116
Relational Operators
FILTER
Selects tuples based on Boolean expression
teenagers = FILTER cust BY age < 20;
ORDER
Sort a relation based on one or more fields
Further processing (FILTER, DISTINCT, etc.) may destroy the
ordering
ordered_list = ORDER cust BY name DESC;
DISTINCT
Removes duplicate tuples
unique_custlist = DISTINCT cust;
117
GROUP BY
Within a relation, group tuples with the same group key
GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL
FOREACH
Loop through each tuple in nested_alias and generate new
tuple(s).
countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
118
GROUP BY
Within a relation, group tuples with the same group key
GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL
FOREACH
Loop through each tuple in nested_alias and generate new
tuple(s).
At least one of the fields of nested_alias should be a bag
DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE are allowed
operations in nested_op to operate on the inner bag(s).
countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
119
Operating on Multiple datasets
JOIN
Compute inner join of two or
more relations based on common
field values.
X = JOIN A BY a1, B BY b1;
DUMP X;
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)
(1,2,3,1,3)
(8,3,4,8,9)
(7,2,5,7,9)
120
COGROUP
Group tuples from two or
more relations, based on
common group values.
>>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
>>DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)
>>X = COGROUP A BY a1, B BY b1;

>>DUMP X;
(1,
(8,
(7,
(2,
(4,
{(1,2,3)}, {(1,3)} )
{(8,3,4)}, {(8,9)} )
{(7,2,5)}, {(7,9)} )
{}, {(2,4),(2,7)} )
{(4,2,1), (4,3,3)}, {} )
121
UNION
Creates the union of two or
more relations
>>X = UNION A, B;
>>DUMP X;
SPLIT
>>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
>>DUMP B;
(2,4)
(8,9)
(1,2,3)
(4,2,1)
(8,3,4)
(2,4)
(8,9)
Splits a relation into two or more relations, based on a Boolean

expressions.
>>Y = SPLIT X INTO C IF a1 <5, D IF a1 > 5;
>>DUMP C;
>>DUMP D;
(1,2,3)
(4,2,1)
(2,4)
(8,3,4)
(8,9)
122
SAMPLE
Randomly samples a relation as per given sampling factor.
There is no guarantee that the same number of tuples are
returned every time.
>>sample_data = SAMPLE large_data 0.01;

Above statement generates a 1% sample of data in relation
large_data
123
UDFs
PIG lets users define their own functions and lets them
be used in the statements
The UDFs can be developed in Java, Python or
Javascript
Filter UDF
To be subclassed of FilterFunc which is a subclass of EvalFunc
Eval UDF
To be subclassed of EvalFunc
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
}
Load UDF
To be subclassed of LoadFunc
Define and use an UDF

REGISTER pig-examples.jar;
DEFINE <funcName> com.training.myfunc.isCustomerTeen()
filtered= FILTER cust BY isCustomerTeen(age)
124
Macros
Package reusable pieces of Pig Latin code

Define a Macro
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
max_temp = max_by_group(filtered_records, year, temperature);
Macros can be defined in separate files to Pig scripts for

reuse, in which case they need to be imported
IMPORT
<path>/<macrofile>';
125
HIVE
126
HIVE
A datawarehousing framework built on top of hadoop
Abstracts MapReduce complexity behind
Target users are generally data analysts who are
comfortable with SQL
SQL Like Language and called HiveQL
Hive meant only for structured data
You can interact with Hive using several methods
CLI (Command Line Interface)
A Web GUI
JDBC
127
HIVE Architecture
CLI
Hive
Metastore
WEB
JDBC
Parser/
Planner/
Optimizer
Map
Red
Map
Red
Hadoop Cluster
128
Install & Configure HIVE

Download a version of HIVE compatible with your
hadoop installation
http://hive.apache.org/releases.html
Untar into a designated folder. This will be HIVEs home

directory
>>tar xvf hive-x.y.z.tar.gz
Configure
Environment Variables add in .bash_profile
export HIVE_INSTALL=/<parent directory path>/hive-x.y.z
export PATH=$PATH:$HIVE_INSTALL/bin
Verify Installation
>>hive -help
Displays command usage
>>hive
Takes you into hive shell
hive>
129

Hadoop needs to be running
Configure to hadoop
Create hive-site.xml under conf directory
specify the filesystem and jobtracker using the hadoop
properties
fs.default.name
mapred.job.tracker
If not set, they default to the local file system and the local
(in-process) job runner - just like they do in Hadoop
Create following directories under HDFS

/tmp
/user/hive/warehouse
chmod g+w for both above directories
130

Data store
Hive stores data under /user/hive/warehouse by default
Metastore
Out-of-the-box hive comes with light weight SQL database
Derby to store and manage meta data
This can be configured to other databases like MySQL
131
Hive Data Models

Databases
Tables
Partitions
Buckets
132
Hive Data Types

TINYINT 1 byte integer
SMALLINT 2 byte integer
INT 4 byte integer
BIGINT 8 byte integer
BOOLEAN true / false
FLOAT single precision
DOUBLE double precision
STRING sequence of characters
STRUCT
A column can be of type STRUCT with data {a INT, b STRING}
MAPS
ARRAYS
*a, b, c+
133
Tables
A Hive table is logically made up of the data being
stored and the associated metadata
Creating a Table
CREATE TABLE emp_table (id INT, name String, address STRING)
PARTITIONED BY (designation STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
STORED AS SEQUENCEFILE;
Loading Data
LOAD DATA INPATH /home/hadoop/employee.csv
OVERWRITE INTO TABLE emp_table;
View Table Schema

SHOW TABLES;
DESCRIBE emp_table;
An external table is a table which is outside the

warehouse directory
134
Hands On
Create retail, customers tables
hive> CREATE DATABSE retail;
hive> USE retail;
hive> CREATE TABLE retail_trans (txn_id INT, txn_date STRING, Cust_id INT,
Amount FLOAT, Category STRING, Sub_Category STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> CREATE TABLE customers (Cust_id INT, FirstName STRING,

LastName STRING, Profession STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> SHOW TABLES;
hive> DESCRIBE retail_trans;
135
Hands On
Load data and run queries
hive> LOAD DATA INPATH 'retail/txn.csv' INTO TABLE retail_trans;
hive> LOAD DATA INPATH 'retail/custs.csv' INTO TABLE customers;
hive> SELECT Category, count(*) FROM retail_trans
GROUP BY Category;
hive> SELECT Category, count(*) FROM retail_trans WHERE Amount > 100
GROUP BY Category;
hive> SELECT Concat (cu.FirstName, ' ', cu.LastName), rt.Category, count(*)
FROM retail_trans rt JOIN customers cu
ON rt.cust_id = cu.cust_id
GROUP BY cu.FirstName, cu.LastName, rt.Category;
136
Queries
SELECT
SELECT id, naem FROM emp_table WHERE designation =
manager;
SELECT count(*) FROM emp_table;
SELECT designation, count(*) FROM emp_table
GROUP BY designation;
INSERT
INSERT OVERWRITE TABLE new_emp (SELECT * FROM
emp_table WHERE id > 100);
Inserting local directory
INSERT OVERWRITE LOCAL DIRECTORY tmp/results (SELECT * FROM
emp_table WHERE id > 100);
JOIN
SELECT emp_table.*, detail.age FROM emp_table JOIN detail
ON (emp_table.id = detail.id);
137
Partitioning & Bucketing

HIVE can organize tables into partitions based on
columns
Partitioned are specified during the table creation time
When we load data into a partitioned table, the
partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');
Bucketing
Bucketing imposes extra structure on the table
make sampling more efficient
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
138
UDFs
UDFs have to be written in java
Have to be subclased UDF
(org.apache.hadoop.hive.ql.exec.UDF)
A UDF must implement at least one evaluate() method.
public class Strip extends UDF {
public Text evaluate(Text str) {
----------return str1;
}
}
ADD JAR /path/to/hive-examples.jar;
CREATE TEMPORARY FUNCTION strip AS
'com.hadoopbook.hive.Strip';
SELECT strip(' bee ') FROM dummy
139
SQOOP
140
SQOOP
sqoop allows users to extract data from a structured

data store into Hadoop for analysis
Sqoop can also export the data back the structured
stores
Installing & Configuring SQOOP
Download a version of SQOOP compatible with your hadoop
installation
Untar into a designated folder. This will be SQOOPs home
directory
>>tar xvf sqoop-x.y.z.tar.gz
Configure
Environment Variables add in .bash_profile
export SQOOP_HOME=/<parent directory path>/sqoop-x.y.z
export PATH=$PATH:$SQOOP_HOME /bin
Verify Installation
>>sqoop
>>sqoop help
141
Importing Data
RDBMS
1. Examine the
schema
2. Generate Code
Sqoop
Client
MyClass.java
3. Launch Multiple maps

on the cluster
4. Use Generate Code
Map
Map
Map
Hadoop Cluster
142
Importing Data
Copy mysql jdbc driver to sqoops lib directory
Sqoop does not come with the jdbc driver
Sample import
>>sqoop import --connect jdbc:mysql://localhost/retail
--table transactions -m 1
>>hadoop fs -ls transactions
The Import tool will run a MapReduce job that connects to
the database and reads the table
By default, four map tasks are used
The output is written to a directory by the table name, under users
HDFS home directory
Generates comma-delimited text files by default

In addition to downloading data, the import tool also
generates a java class as per the table schema
143
Codegen
The code can also be generated without import action
>>sqoop codegen --connect jdbc:mysql://localhost/hadoopguide
--table widgets --class-name Widget
The generated class can hold a single record retrieved from
the table
The generated code can be used in MapReduce programs to
manipulate the data
144
Working along with Hive

Importing data into Hive
Generate Hive table definition directly from the source
>>sqoop create-hive-table
--connect jdbc:mysql://localhost/retail --table transactions
--fields-terminated-by ',
Generate table definition and import data into Hive

>>sqoop import
--connect jdbc:mysql://localhost/retail
--table transactions -m 1 --hive-import
Exporting data from Hive

Create the table in MySQL database
>>sqoop export
--connect jdbc:mysql://localhost/retail -m 1
--table customers
--export-dir /user/hive/warehouse/retail.db/customers
--input-fields-terminated-by ','
145
Administration
146
NameNode Persistent Data Structure

A newly formatted Namenode creates the
shown directory structure
VERSION: Java properties file with HDFS
version
edits: Any write operation such as
creating, moving a file is logged into edits
fsimage: Persistent checkpoint of file
system metadata. This is update whenever
edit log rolls over
fstime: Records the time when fsimage
was last updated
${dfs.name.dir}/
current/
VERSION
edits
fsimage
fstime
147
Persistent Data Structure DataNode & SNN

Secondary Namenode directory structure
Datanode directory structure

Need not be formatted explicitly
They create their directories on startup
${dfs.data.dir}/
current/
blk_<id_1>
blk_<id_1>.meta
${fs.checkpoint.dir}/
current/
VERSION
edits
fsimage
fstime
previous.checkpoint/
VERSION
edits
fsimage
fstime
blk_<id_2>
blk_<id_2>.meta
Subdir0/
Subdir1/
148
HDFS Safe Mode

When a Namenode starts it will enter Safe Mode
Loads fsimage to memory and applies the edits from edit log
During this time it does not listen to any requests
Safe mode is exited when minimal replication condition is met,
plus an extension time of 30 seconds
Check whether Namenode is in safe mode

hadoop dfsadmin -safemode get
Wait until the safe mode is off

hadoop dfsadmin -safemode wait
Enter or leave safe mode

hadoop dfsadmin -safemode enter / leave
149
HDFS Filesystem Check
Hadoop provides fsck utility to check the health of HDFS

hadoop fsck /
Option to either
move (to lost+found)
or delete affected
files
hadoop fsck / -move
hadoop fsck / -delete
Finding Blocks for a given file

hadoop fsck /user/hadoop/weather/1901 -files -blocks -racks
150
HDFS Block Scanner

Datanodes run Block Scanner utility periodically to verify
the blocks stored on it to guard against the disk errors
The default is 3 weeks (dfs.datanode.scan.period.hours)
Corrupt blocks are reported to Namenode for fixing
The Block scan report for a datanode can be accessed at
http://<datanode>:50
075/blockScannerRep
ort
List of Blocks can be
accessed by
appending ?listblocks
to the above URL
151
HDFS Balancer
Over a period of time, the block distribution across the

cluster may become unbalanced
This will affect the data locality for MapReduce and puts
strain on highly utilized datanodes
Hadoops Balancer daemon redistributes the blocks to
achieve the balance
The balancing act can be initiated through
start-balancer.sh
It produces a log file in the standard log directory

The bandwidth for the balancer cab be changed by setting
the dfs.balance.bandwidthPerSec property in hdfs-site.xml
Default bandwidth 1 MB / Sec
152
Logging
All Hadoop Daemons produce respective log files
Log files are stored under $HADOOP_INSTALL/logs
The location can be changed by setting the property
HADOOP_LOG_DIR in hadoop-env.sh
The log levels can be set under log4j.properties

Name Node
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN
Job Tracker
log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG
Stack Traces
The stack traces for all the hadoop daemons can be obtained at
/stacks page under daemons expose a web UI
Job tracker stack trace can be found at
http://<jobtracker-host>:50030/stacks
153
Hadoop Routine Maintenance

Metadata Backup
Good practice to keep copies of different ages(one hour, one day,
one week etc.,)
One way is to periodically archive secondary namenodes
previous.checkpoint directory to an offsite location
Test the integrity of the copy regularly
Data Backup
HDFS replication is not a substitute for data back up
As the data volume is very high, it is a good practice to prioritize
the data to be backed up
Business critical data
Data that can not be regenerated
distcp is a good tool to backup from HDFS to other filesystems
Run filesystem check (fsck) and balancer tools regularly 154
Commissioning of New Nodes

The datanodes that are permitted to connect to Namenode
are specified in a file pointed by the property dfs.hosts
The tasktrackers that are permitted to connect to Jobtracker
are specified in a file pointed by the property mapred.hosts
This restricts an arbitrary machine connecting into the
cluster and compromising on data integrity and security
To add a new nodes
Add the network address of the new node in the above files
Run the commands to refresh Namenode and Jobtracker
hadoop dfsadmin -refreshNodes
hadoop mradmin -refreshNodes
Update the slaves file with the new nodes

Note that the slaves file is not used by hadoop daemons. It is used by
control scripts for cluster-wide operations
155
Decommissioning of Nodes
For removing nodes from the cluster, the Namenode and

the Jobtracker must be informed
The decommissioning process is controlled by an exclude
file. The file location is set through a property
For HDFS it is dfs.hosts.exclude
For MapReduce mapred.hosts.exclude
To remove nodes from the cluster

add the network addresses to the respective exclude files
Run the commands to update Namenode and JobTracker
hadoop dfsadmin -refreshNodes
hadoop mradmin -refreshNodes
During decommission process Namenode will replicate the data

to other datanodes
Remove the nodes from the include file as well as slaves file 156
Cluster & Machines Considerations
Several options
Build your own cluster from scratch
Use offerings that provide hadoop as a service on cloud
While building you own, choose server grade commodity

machines (Commodity does not mean low-end)
Unix / Linux platform
Hadoop is designed to use multiple cores and disks
A typical machine for running a datanode and tasktracker
Processor
Memory
Storage
Network
Two quad-core 2-2.5 GHz CPUs
16-24 GB ECC RAM1
Four 1 TB SATA disks
Gigabit Ethernet
Cluster size is typically estimated based on the storage

capacity and its expected growth
157
Master Node Scenarios
The machine running the master daemons should be

resilient as failure of these would lead to data lose and
unavailability of the cluster
On a small cluster (few 10s of nodes) you run all master
daemons on a single machine
As the cluster grows their memory requirement grows and
needs to be run on separate machines
The control scripts should be run as follows
Run HDFS control scripts from the namenode machine
masters file should contain the address of the secondary namenode
Run MapReduce scripts from the Job tracker machine

slaves file on both machines should be in sync so that each node
will run one Datanode and a task tracker
158
Network Topology
A common architecture consists of two level network

topology
1GB + Switch
1GB Switch
30 to 40 servers per rack

159
Network Topology
For multirack cluster, the admin needs to map nodes to
racks so hadoop is network aware to place data as well as
mapreduce tasks as close as possible to the data
Two ways to define the network map
Implement java interface DNSToSwitchMapping
public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
}
Have the property topology.node.switch.mapping.impl point to the
implemented class. The namenode and jobtracker will make use of this
User based script pointed by the property

topology.script.file.name
The default behavior is to map all nodes to the same rack

160
Cluster Setup and Installation

Use automated installation tools such as kickstart or Debian
to install software on nodes
Create one master script and use the same to automate
Following steps to be carried to complete cluster setup

Install Java (6 or later) on all nodes
Create user account on all nodes for Hadoop activities
Have the same user name on all nodes
Having NFS drive as home directory makes SSH key distribution simple
Install Hadoop on all nodes and change the owner of files

Install SSH. Hadoop control scripts (not the daemons) rely on SSH
to perform cluster-wide operations
Configure
Generate an RSA key pair, share public key on all nodes
Configure Hadoop. Better way of doing it is by using tools like Chef
161
or Puppet
Memory Requirements Worker Node

The memory allocated to each daemon is controlled by
HADOOP_HEAPSIZE setting in hadoop-env.sh
The default value is 1 GB
The task tracker launches separate JVMs to run map and

reduce tasks
The memory for the child JVM is set by mapred.child.java.opts.
Default value is 200 MB
The number of map and reduce tasks that can be run at

any time is set by the property
Map
- mapred.tasktracker.map.tasks.maximum
Reduce - mapred.tasktracker.reduce.tasks.maximum
The default is two for both map and reduce tasks
162
Memory Requirements Worker Node
The number of tasks that can be run simultaneously on a

tasks tracker depends on the number of processors
a good rule of thumb is to have a factor of between one
and two more tasks than processors
If you have a eight core processor
One core for Datanode and tasktracker
On remaining 7 cores we can have 7 maps and 7 reduce tasks
Increasing the memory for the JVM to 400 MB the total
memory required is 7.6 MB
163
Other Properties to consider

Cluster Membership
Buffer Size
HDFS Block size
Reserved storage space
Trash
Job Scheduler
Reduce slow start
Task Memory Limits
164
Security
Hadoop uses Kerberos for authentication
Kerberos do not manage the
permissions for hadoop
To enable Kerberos
authentication set the property
hadoop.security.authentication
in core-site.xml to kerberos
Enable service-level
authorization by setting
hadoop.security.authorization
to true in the same file
To control which users and groups can do what, configure
Access Control Lists (ACLs) in the hadoop-policy.xml
165
Security Policies
Allow only alice, bob and users in the mapreduce group to submit the jobs
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice, bob mapreduce</value>

</property>
Allow only users in the datanode group to communicate with Namenode

<property>
<name>security.datanode.protocol.acl</name>
<value>datanode</value>
</property>
Allow any user to talk to HDFS cluster as a DFSClient

<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>
166
Recommended Readings
167

Big Data & Hadoop Architecture Development

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Big Data & Hadoop Architecture Development

Încărcat de

Drepturi de autor:

Formate disponibile

Big Data & Hadoop

Architecture and Development

Understanding Big Data

What is Big Data

What is Apache Hadoop

Hadoop & its eco system

Map Reduce Anatomy

What is Pig Latin?

Introduction to Amazon Cloud

HDFS Persistent Data Structure

What the BIG hype about Big Data?

Nature of Big Data

Is divide the data and rule a solution here?

Let us see what MapReduce is

What is MapReduce Model

Input file Split s

What is MapReduce Model

Big Data Universe

So whats going to happen to our good friend RDBMS?

Some Use Cases

Scientific simulation & analysis

Who Loves it?

Yahoo! runs 20,000 servers running Hadoop

Hadoop & Its ecosystem

Apache Oozie (Workflow)

HDFS (Hadoop Distributed File System)

Hadoop & Its ecosystem

Install & Configure ssh

Enter user ID/ Password : hadoop / hadoop123

Install & Configure ssh

Check ssh installation

Generate ssh Key

Copy public key as an authorized key (equivalent to slaves)

Accessing VM Putty & WinSCP

Use Putty to telnet to VM

Lab VM Directory Structure

Working directory for the lab session

Downloads directory (installables downloaded and stored under this)

Data directory (sample data is stored under this)

Create directory for installing the tools

Install & Configure Java

Configure Java Home in environment

Download Hadoop Jar

Untar (execute the following commands)

Configure environment in .bash_profile

Try the following

>>hadoop jar hadoop-examples-1.0.3.jar wordcount

Component of Core Hadoop

Components of Core Hadoop

This again has central and distributed sub components

Hadoop Operational Modes

Hadoop can be run in one of the three modes

Pseudo Distributed Mode

Fully Distributed Mode

Hadoop Configuration Files

The configuration files can be found under conf Directory

Environment variables that are used in the scripts to run

Configuration settings for Hadoop Core, such as I/O

Configuration settings for HDFS daemons: the namenode,

Configuration settings for MapReduce daemons: the

List of machines (one per line) that run a secondary

List of machines (one per line) that each run a datanode