Documente Academic
Documente Profesional
Documente Cultură
Raghavan Solium
Big Data Consultant
raghavan.solium@gmail.com
Day - 1
Day - 1
HDFS The Hadoop DFS
Building Blocks
Name Node & Data Node
Starting HDFS Services
HDFS Commands
Hands on
Configure HDFS
Start & Examine the daemons
Export & Import files into HDFS
MapReduce Workflow
Job Tracker & Task Tracker
Starting MapReduce Services
Hands on
Configure MapReduce
Start & Examine the daemons
Day - 2
MapReduce Programming
Java API
Data Types
Input & Output Formats
Hands on
Advance Topics
Combiner
Partitioner
Counters
Compression, Speculative Execution, Zero & One Reducer
Distributed Cache
Job Chaining
HDFS Federation
HDFS HA
Hadoop Cluster Administration
Day - 3
Pig
Hive
What is Hive?
Hive Architecture
Install & Configure Hive
Hive Data Models
Hive Metastore
Partitioning and Bucketing
Hands On
Day - 4
Sqoop
What is Sqoop
Install & Configure Sqoop
Import & Export
Hands On
Day - 4
Hadoop Administration
Lets Define
Variety
Sensor Data
Machine logs
Social media data
Scientific data
RFID readers
sensor networks
vehicle GPS traces
Retail transactions
Volume
The New York Stock Exchange has
several petabytes of data for analysis
Facebook hosts approximately 10
billion photos, taking up one
petabytes of storage.
At the end of 2010 The Large Hadron
Collider near Geneva, Switzerland has
about 150 petabytes of data
Velocity
The New York Stock Exchange
generates about one terabyte of new
trade data every day
The Large Hadron Collider produce s
about 15 petabytes of data per year
Weather sensors collect data every
hour at many locations across the
globe and gather a large volume of
log data
8
Inflection Points
Data Storage
Big Data ranges from several Terabytes to Petabytes.
At these volumes access speed of the data devices will dominate overall analysis
time.
A Terabyte of data requires 2.5 hours to be read from a 100 MBPS drive
Writing will even be slower
Analysis
Much of Big Data is unstructured. Traditional RDBMS/ EDW cannot handle them
Lot of Big Data analysis is adhoc in nature, involves whole data scan, referencing
itself, joining, combing etc
Traditional RDBMS/ EDW cannot handle these with their limited scalability
options and architectural limitations
You can incorporate better servers, processors and throw in more RAM but there
is a limit to it
9
Inflection Points
We need a Drastically different approach
A distributed file system with high capacity and high reliability
A process engine that can handle structure / Unstructured
data
A computation model that can operate on distributed data
and abstracts data dispersion
PRAM, MapReduce are such models
10
(K2, V2)
Intermediate Key/Value pairs
(K1, V1)
(K3, V3)
Output files
Computer 1
Split 1
Input File
/ Data
Sort
Map
Reduce
Computer 2
Split 2
Computer 1
Sort
Map
Computer 2
Reduce
Computer 3
Split 3
Part 1
Part 2
Sort
Map
11
MapReduce Model
MapReduce should have
Ability to initiate and monitor parallel processes and
coordinate between them
A mechanism to pass the same key outputs from map
processes to a single reduce process
Recover from any failures transparently
13
14
Data size
Access
Updates
Structure
Integrity
Scaling
Traditional RDBMS
Gigabytes
Interactive and batch
Read and write many times
Static schema
High
Nonlinear
MapReduce
Petabytes
Batch
Write once, read many times
Dynamic schema
Low
Linear
(Some of these things are debatable as the Big Data and Hadoop eco systems are fast evolving and moving to higher degree of
maturity and flexibility. For example Hbase brings in the ability to point queries )
15
Retail
Trend analysis, Personalized promotions
Machine Learning
Log Analytics
16
What is Apache Hadoop and how it can help with Big Data?
It is an open source Apache project for handling Big Data
It addresses Data storage issue and Analysis (processing) issues through its
HDFS file system and implementing MapReduce computation model
It is designed for massive scalability and reliability
The model enables leveraging cheap commodity servers keeping the cost in
check
PIG
Hive
Mahout
MapReduce Framework
Structured Data
Sqoop
Flume
Log
Files
Unstructured Data
18
MapReduce:
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
HDFS: A distributed file system that runs on large clusters of commodity machines.
Pig:
A data flow language and execution environment for exploring very large datasets. Pig runs on
HDFS and MapReduce clusters.
Hive:
A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for
querying the data.
Hbase:
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and
supports both batch-style computations using MapReduce and point queries (random reads).
ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such
as distributed locks that can be used for building distributed applications.
Sqoop:
A tool for efficient bulk transfer of data between structured data stores (such as relational
databases) and HDFS.
Oozie:
A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig,
Hive, and Sqoop jobs).
19
Hadoop Requirements
Supported Platforms
GNU/Linux is supported as a development and production
Win32 supported as development only
cygwin is required for running on Windows
Required Software
JavaTM 1.6.x
ssh to be installed, sshd must be running (for launching the
daemons on the cluster with password less entry)
Development Environment
Eclipse 3.5 or above
20
Lab Requirements
Windows 7 - 64 bit OS, Min 4 GB Ram
VMWare Player 5.0.0
Linux VM - Ubuntu 12.04 LTS
User: hadoop, Password: hadoop123
Java 6 installed on Linux VM
Open SSH installed on Linux VM
Putty - For opening Telnet sessions to the Linux VM
WinSCP - For transferring files between Windows / VM
Eclipse 3.5
21
Hands On
Using the VM
Install & Configure hadoop
22
Starting VM
23
Starting VM
24
Install ssh
>>sudo apt-get
install
ssh
rsa
-P
-f
~/.ssh/id_rsa
Verify ssh
Verify SSH by logging into target (localhost here)
>>ssh localhost
This command should log you into the machine localhost
26
27
28
apt-get
install
openjdk-6-jdk
Check Installation
>>java
-version
Install Hadoop
30
Run an Example
Verify Hadoop installation
>> hadoop version
>>cd $HADOOP_INSTALL
>>hadoop jar hadoop-examples-1.0.3.jar
Will provide the list of classes in the above jar file
31
Name Node
Job Tracker
Networked
Secondary
Name Node
Data Node
Data Node
Data Node
Data Node
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Map
Map
Map
Map
Red
Red
Red
Red
(Hadoop supports many other file systems other than HDFS itself . For one to leverage Hadoops abilities completely HDFS is one of
the most reliable file systems)
32
Task Tracker Take care of the local job execution on the local file segment. This
talks to DataNode for file information. This constantly communicates with Job
Tracker daemon to report the task progress
When the Hadoop system is running in a distributed mode all these daemons would be
running in the respective computer
33
Format
Description
hadoop-env.sh
Bash script
core-site.xml
Hadoop
configuration XML
hdfs-site.xml
Hadoop
configuration XML
mapred-site.xml
Hadoop
configuration XML
masters
Plain text
slaves
Plain text
hadoop-metrics
.properties
Java Properties
log4j.properties
Java Properties
Property Name
Conf File
Standalone
Pseudo
Distributed
Fully Distributed
fs.default.name
core-site.xml
file:///
(default)
hdfs://localhost/ hdfs://namenode/
dfs.replication
hdfs-site.xml
NA
3 (default)
mapred.job.tracker
mapredsite.xml
local (default)
Localhost:8021
Jobtracket:8021
36
HDFS
37
Design of HDFS
HDFS is hadoops Distributed File System
Designed for storing very large files (of sized petabytes)
Single file can be stored across several the disks
Designed for streaming data access patterns
Not suitable for low-latency data access
Designed to be highly fault tolerant hence can run on
commodity hardware
38
HDFS Concepts
Like in any file system HDFS stores files by breaking it
into smallest units called Blocks
The default HDFS block size is 64 MB
The large block size helps in maintaining high
throughput
Each Block is replicated across multiple machines in the
cluster for redundancy
39
Secondary
Name Node
Name Node
Networked
Client
Read Blocks
Data Node
Hadoop
Cluster
Data Node
Data Node
Data Node
Data Blocks
40
Datanodes store and retrieve the blocks for the files when they are told by
NameNode
NameNode maintains the information on which DataNodes all the blocks for a
given file are located
DataNodes report to NameNode periodically with the list of blocks they are
storing
With NameNode off, the HDFS is inaccessible
Secondary NameNode
Not a backup for the NameNode
Just helps in merging filesystem image with edit log to avoid edit log
41
becoming too large
Hands On
Configure HDFS file system for hadoop
Format HDFS
Start & Verify HDFS services
Verify HDFS
Stop HDFS services
Change replication
42
Set up core-site.xml
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
Add fs.default.name property under configuration tag to specify NameNode location.
localhost for Pseudo distributed mode. Name node runs at port 8020 by default if no
port is specified
43
Starting HDFS
Format NameNode
>>hadoop namenode -format
Creates empty file system with storage directories and
persistent data structures
Data nodes are not involved
fs -ls
fsck / -files -blocks
fs -mkdir testdir
44
Verify HDFS
List / Check HDFS again
>>hadoop
>>hadoop
fs -ls
fsck /
-files -blocks
45
Property
Name
Description
Default Value
dfs.name.dir
${hadoop.tmp.dir}/dfs/
name
dfs.data.dir
${hadoop.tmp.dir}/dfs/
data
${hadoop.tmp.dir}/dfs/
namesecondary
47
Removing Directory
hadoop fs -rm
<dirname>
fs
-ls
<dir name>
fsck
-files
-blocks
48
Hands On
Create data directories for
NameNode
Secondary NameNode
DataNode
50
HDFS Web UI
52
MapReduce
53
MapReduce
A distributed parallel processing engine of Hadoop
Processes the data in sequential parallel steps called
Map
Reduce
Mapper output with the same key are sent to the same reducer
Input to Reducer is always sorted by key
Number of mappers and reducers per node can be configured
54
(K1, V1)
Input file
If you go up and down
The weight go down and
the health go up
Map
Computer 2
The weight go
down and
Map
Computer 3
the health go up
Map
Output
(K2, V2)
(K3, V3)
and, 1
down, 1
go, 1
if, 1
up, 1
you, 1
and, 1
down, 1
go, 1
the, 1
weight, 1
go, 1
health, 1
the, 1
up, 1
and, 1 down, 1
and, 1 down, 1
go, 1
go, 1
go, 1
if, 1
Computer 1
Reduce
Computer 2
Reduce
up, 1
up, 1
you, 1
health, 1
the, 1
the, 1
and
Down
go
If
2
2
3
1
up
you
the
Health
Weight
2
1
2
1
1
weight, 1
55
Job Tracker
Networked
HDFS
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
Map
Red
57
Hands On
Configure MapReduce
Start MapReduce daemons
Verify the daemons
Stop the daemons
58
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Add mapred.job.tracker property under configuration tag to specify JobTracker
location. localhost:8021 for Pseudo distributed mode.
59
60
MapReduce
Programming
61
MapReduce Programming
Execution
Parameters
Map
Map
Reduce
Framework
Map
Red
uce
Map
Reduce
Framework
Output
Red
Driver
62
63
Map Function
The Map function is represented by Mapper class, which declares
an abstract method map()
Mapper class is generic type with four type parameters for the
input and output key/ value pairs
Mapper <K1, V1, K2, V2>
K1, V1 are the types of the input key / value pair
K2, V2 are the types of the output key / value pair
Hadoop provides its own types that are optimized for network
serialization
Text
LongWritable
IntWritable
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}
}
65
Reduce Function
The Reduce function is represented by Reducer class, which
declares an abstract method reduce()
Reducer class is generic type with four type parameters for the
input and output key/ value pairs
Reducer <K2, V2, K3, V3>
K2, V2 are the types of the input key/ value pair, this type of this pair
must match the output types of Mapper
K3, V3 are the types of the output key/ value pair
67
Job object gives you control over how the job is run
Set the jar file containing mapper and reducer for distribution
around the cluster
Job.setJarByClass(wordCount.class);
}
}
69
70
Combiner
Combiner function helps to aggregate the map output
before passing on to reduce function
Reduces intermediate data to be written to disk
Reduces data to be transferred over network
71
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
72
Partitioning
Map tasks partition their output keys by the number of
reducers
There can be many keys in a partition
All records for a given key will be in a single partition
A Partitioner class controls partitioning based on the Key
Hadoop uses hash partition by default (HashPartitioner)
Partitioner Example
public class WordPartitioner extends Partitioner <Text, IntWritable>{
@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String ch = key.toString().substring(0,1);
/*if (ch.matches("[abcdefghijklm]")) {
return 0;
} else if (ch.matches("[nopqrstuvwxyz]")) {
return 1;
}
return 2;*/
//return (ch.charAt(0) % numPartitions); //round robin based on ASCI value
return 0; // default behavior
}
}
74
OR
mapred.reduce.tasks=10
One Reducer
Maps output data is not partitioned, all key /values will reach
the only reducer
Only one output file is created
Output file is sorted by Key
Good way of combining files or producing a sorted output for
small amounts of data
Data Types
Keys are compared with each other during the sorting phase
Respective registered RawComparator is used comparison
public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
76
Data Types
Writable wrapper classes for Java primitives
Java
primitive
Writable
Serialized size
implementation
(bytes)
Boolean
Byte
Short
Int
BooleanWritable
ByteWritable
ShortWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
VLongWritable
DoubleWritable
Float
Long
Double
1
1
2
4
15
4
8
19
8
NullWritable
Special writable class with zero length
serialization
Used as a place holder for a key/ value
when you do not need to use that
position
77
implementation
of
78
Input Formats
An Input Format determines how the input data to be
interpreted and passed on to the mapper
Based on an Input Format, the input data is divided into
equal chunks called splits
Each split is processed by a separate map task
Each split in turn is divided into records based on Input
Format and passed with each map call
The Key and the value from the input record is determined
by the Input Format (including types)
79
Input Formats
80
TextInputFormat
KeyValueTextInputFormat
NLineInputFormat
CombineFileInputFormat (meant for lot of small files to avoid too many splits)
Binary
SequenceFileInputFormat
81
LongWritable
Input to
Mapper
Text
NLineInputFormat
Each File Splits contains fixed number of lines
The default is one, which can be changed by setting the
property
mapreduce.input.lineinputformat.linespermap
CobineFileInputFormat
A Splits can consist of multiple files (based on max split size)
Typically used for lot of small files
This is an abstracts class and one need to implement to use 83
SequenceFileInputFormat
Enables reading data from a Sequence File
Can read MapFiles as well
Variants of SequnceFileInputFormat
SequnceFileAsTextInputFormat
Converts key, values into Text Objects
SequnceFileAsBinaryInputFormat
Retrieves the keys and values as BytesWritable Objects
84
85
Output Formats
OutputFormat class hierarchy
86
FileBased
FileOutputFormat is the Base class
FileOutputFormat offers static method for setting output path
FileOutputFormat.setOutputPath(job, path);
One file per reducer is created (default file name : part-r-nnnnn),
nnnnn is an designating the part number, starting from zero
TextOutputFormat
SequenceFileOutputFormat
SequenceFileAsBinaryOutputFormat
MapFileOutputFormat
NullOutputFormat
DBOutputFormat
Output format to dump output data to RDBMS through JDBC
87
Lazy Output
FileOutputFormat subclasses will create output files,
even if there is no record to write
LazyOutputFormat can be used to delay output file
creation until there is a record to write
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class)
Instead of
Job.setOutputFormatClass(TextOutputFormat.class)
88
89
Counters
Useful means of
Monitoring job progress
Gathering statistics
Problem diagnosis
Dynamic counters
Counters can also be set without predefining as enums
context.getCounter(grounName, counterName).increment(1);
Effective only for small amounts of data (few KB). Else will put
pressure on memory of daemons
92
Multiple Inputs
Often in real life you get the related data from different
sources in different formats
Hadoop provide MultipleInputs class to handle the
situation
MultipleInputs.addInputPath(job, inputPath1, <inputformat>.class);
MultipleInputs.addInputPath(job, inputPath2, <inputformat>.class);
Joins
More than one record sets to be joined based on a key
Two techniques for joining data in MapReduce
Map side join (Replicated Join)
Possible only when
one of the data sets is small enough to be distributed across the data
nodes and fits into the memory for maps to independently join OR
Both the data sets are portioned in such a way that they have equal
number of partitions, sorted by same key and all records for a given key
must reside in the same partition
The smaller data set is used for the look up using the join key
Faster as the data is loaded into the memory
95
Joins
Reduce side join
Mapper will tag the records from both the data sets distinctly
Join key is used as maps output key
The records for the same key are brought together in the
reducer and reducer will complete the joining process
Less efficient as both the data sets have to go through
mapreduce shuffle
96
Job Chaining
Multiple jobs can be run in a linear or complex dependent fashion
Simple way is to call the job drivers one after the other with
respective configurations
JobClient.runJob(conf1);
JobClient.runJob(conf2);
Here the second job is not launched until first job is completed
cjob2.addDependingJob(cjob1);
JobControl jc = new JobControl(Chained Job);
jc.addjob(cjob1);
jc.addjob(cjob2);
jc.run();
97
Speculative Execution
MapReduce jobs execution time is typically determined
by the slowest running task
Job is not complete until all tasks are completed
One slow job could bring down overall performance of the job
to true/ false
to true/ false
99
101
/my
103
Disadvantages of MapReduce
MapReduce (Java API) is difficult to program, long
development cycle
Need to rewrite trivial operations like Join, filter to
achieve in map/reduce/Key/value concepts
Locked with Java which makes it impossible for data
analysts to work with hadoop
There are several abstraction layers on top of
MapReduce which make working with Hadoop simple.
PIG and HIVE are in the leading front
104
PIG
105
PIG
PIG is an abstraction layer on top of MapReduce that frees analysts
from the complexity of MapReduce programming
Architected towards handling unstructured and semi structured
data
Its a dataflow language, which means the data is processed in a
sequence of steps transforming the data
The transformations support relational-style operations such as
filter, union, group and join.
Designed to be extensible and reusable
Programmers can develop own functions and use (UDFs)
Programmer friendly
Allows to introspect data structures
Can do sample run on a representative subset of your input
PIG Architecture
Pig runs as a client side application, there is no need to
install anything on the cluster
Pig
Script
Grunt Shell
Map
Red
Map
Red
Hadoop Cluster
107
Verify Installation
>>pig -help
Displays command usage
>>pig
Takes you into Grunt shell
grunt>
108
local
MapReduce Mode
In this mode the queries are translated into MapReduce jobs
and run on hadoop cluster
PIG version must be compatible with hadoop version
Set HADOOP_HOME environment variable to indicate pig
which hadoop client to use
export HADOOP_HOME=$HADOOP_INSTALL
If not set it will uses the bundled version of hadoop
109
Grunt
An interactive shell for running Pig commands
Grunt is started when the pig command is run without any
options
Script
Pig commands can be executed directly from a script file
>>pig pigscript.pig
It is also possible to run Pig scripts from Grunt shell using run
and exec.
Embedded
You can run Pig programs from Java using the PigServer class,
much like you can use JDBC
For programmatic access to Grunt, use PigRunner
110
An Example
A Sequence of transformation steps to get the end result
LOAD
FILTER
GROUP
AGGREGATE
Data Types
Simple Types
Category
Numeric
Text
Binary
Type
int
long
float
double
chararray
bytearray
Description
32-bit signed integer
64-bit signed integer
32-bit floating-point number
64-bit floating-point number
Character array in UTF-16 format
Byte array
112
Data Types
Complex Types
Type
Tuple
Bag
map
Description
Sequence of fields of any type
An unordered collection of tuples, possibly
with duplicates
A set of key-value pairs; keys must be
character arrays, but values may be any type
Example
(1,'pomegranate')
{(1,'pomegranate'),(2)}
['a'#'pomegranate']
113
LOAD Operator
114
Diagnostic Operators
DESCRIBE
Describes the schema of a relation
EXPLAIN
Display the execution plan used to compute a relation
ILLUSTRATE
Illustrate step-by-step how data is transformed
Uses sample of the input data to simulate the execution.
115
LIMIT
Limits the number of tuples from a relation
DUMP
Display the tuples from a relation
STORE
Store the data from a relation into a directory.
The directory must not exists
116
Relational Operators
FILTER
Selects tuples based on Boolean expression
teenagers = FILTER cust BY age < 20;
ORDER
Sort a relation based on one or more fields
Further processing (FILTER, DISTINCT, etc.) may destroy the
ordering
ordered_list = ORDER cust BY name DESC;
DISTINCT
Removes duplicate tuples
unique_custlist = DISTINCT cust;
117
Relational Operators
GROUP BY
Within a relation, group tuples with the same group key
GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL
FOREACH
Loop through each tuple in nested_alias and generate new
tuple(s).
countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
118
Relational Operators
GROUP BY
Within a relation, group tuples with the same group key
GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL
FOREACH
Loop through each tuple in nested_alias and generate new
tuple(s).
At least one of the fields of nested_alias should be a bag
DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE are allowed
operations in nested_op to operate on the inner bag(s).
countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
119
JOIN
Compute inner join of two or
more relations based on common
field values.
X = JOIN A BY a1, B BY b1;
DUMP X;
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)
(1,2,3,1,3)
(8,3,4,8,9)
(7,2,5,7,9)
120
COGROUP
Group tuples from two or
more relations, based on
common group values.
>>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
>>DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)
{(1,2,3)}, {(1,3)} )
{(8,3,4)}, {(8,9)} )
{(7,2,5)}, {(7,9)} )
{}, {(2,4),(2,7)} )
{(4,2,1), (4,3,3)}, {} )
121
UNION
Creates the union of two or
more relations
>>X = UNION A, B;
>>DUMP X;
SPLIT
>>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
>>DUMP B;
(2,4)
(8,9)
(1,2,3)
(4,2,1)
(8,3,4)
(2,4)
(8,9)
(8,3,4)
(8,9)
122
SAMPLE
Randomly samples a relation as per given sampling factor.
There is no guarantee that the same number of tuples are
returned every time.
123
UDFs
PIG lets users define their own functions and lets them
be used in the statements
The UDFs can be developed in Java, Python or
Javascript
Filter UDF
To be subclassed of FilterFunc which is a subclass of EvalFunc
Eval UDF
To be subclassed of EvalFunc
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
}
Load UDF
To be subclassed of LoadFunc
Macros
<path>/<macrofile>';
125
HIVE
126
HIVE
A datawarehousing framework built on top of hadoop
Abstracts MapReduce complexity behind
Target users are generally data analysts who are
comfortable with SQL
SQL Like Language and called HiveQL
Hive meant only for structured data
You can interact with Hive using several methods
CLI (Command Line Interface)
A Web GUI
JDBC
127
HIVE Architecture
CLI
Hive
Metastore
WEB
JDBC
Parser/
Planner/
Optimizer
Map
Red
Map
Red
Hadoop Cluster
128
Configure
Environment Variables add in .bash_profile
export HIVE_INSTALL=/<parent directory path>/hive-x.y.z
export PATH=$PATH:$HIVE_INSTALL/bin
Verify Installation
>>hive -help
Displays command usage
>>hive
Takes you into hive shell
hive>
129
If not set, they default to the local file system and the local
(in-process) job runner - just like they do in Hadoop
130
Metastore
Out-of-the-box hive comes with light weight SQL database
Derby to store and manage meta data
This can be configured to other databases like MySQL
131
132
MAPS
ARRAYS
*a, b, c+
133
Tables
A Hive table is logically made up of the data being
stored and the associated metadata
Creating a Table
CREATE TABLE emp_table (id INT, name String, address STRING)
PARTITIONED BY (designation STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
STORED AS SEQUENCEFILE;
Loading Data
LOAD DATA INPATH /home/hadoop/employee.csv
OVERWRITE INTO TABLE emp_table;
Hands On
Create retail, customers tables
hive> CREATE DATABSE retail;
hive> USE retail;
hive> CREATE TABLE retail_trans (txn_id INT, txn_date STRING, Cust_id INT,
Amount FLOAT, Category STRING, Sub_Category STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
135
Hands On
Load data and run queries
hive> LOAD DATA INPATH 'retail/txn.csv' INTO TABLE retail_trans;
hive> LOAD DATA INPATH 'retail/custs.csv' INTO TABLE customers;
hive> SELECT Category, count(*) FROM retail_trans
GROUP BY Category;
hive> SELECT Category, count(*) FROM retail_trans WHERE Amount > 100
GROUP BY Category;
hive> SELECT Concat (cu.FirstName, ' ', cu.LastName), rt.Category, count(*)
FROM retail_trans rt JOIN customers cu
ON rt.cust_id = cu.cust_id
GROUP BY cu.FirstName, cu.LastName, rt.Category;
136
Queries
SELECT
SELECT id, naem FROM emp_table WHERE designation =
manager;
SELECT count(*) FROM emp_table;
SELECT designation, count(*) FROM emp_table
GROUP BY designation;
INSERT
INSERT OVERWRITE TABLE new_emp (SELECT * FROM
emp_table WHERE id > 100);
Inserting local directory
INSERT OVERWRITE LOCAL DIRECTORY tmp/results (SELECT * FROM
emp_table WHERE id > 100);
JOIN
SELECT emp_table.*, detail.age FROM emp_table JOIN detail
ON (emp_table.id = detail.id);
137
Bucketing
Bucketing imposes extra structure on the table
make sampling more efficient
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
138
UDFs
UDFs have to be written in java
Have to be subclased UDF
(org.apache.hadoop.hive.ql.exec.UDF)
A UDF must implement at least one evaluate() method.
public class Strip extends UDF {
public Text evaluate(Text str) {
----------return str1;
}
}
ADD JAR /path/to/hive-examples.jar;
CREATE TEMPORARY FUNCTION strip AS
'com.hadoopbook.hive.Strip';
SELECT strip(' bee ') FROM dummy
139
SQOOP
140
SQOOP
Configure
Environment Variables add in .bash_profile
export SQOOP_HOME=/<parent directory path>/sqoop-x.y.z
export PATH=$PATH:$SQOOP_HOME /bin
Verify Installation
>>sqoop
>>sqoop help
141
Importing Data
RDBMS
1. Examine the
schema
2. Generate Code
Sqoop
Client
MyClass.java
Map
Map
Map
Hadoop Cluster
142
Importing Data
Copy mysql jdbc driver to sqoops lib directory
Sqoop does not come with the jdbc driver
Sample import
>>sqoop import --connect jdbc:mysql://localhost/retail
--table transactions -m 1
>>hadoop fs -ls transactions
The Import tool will run a MapReduce job that connects to
the database and reads the table
By default, four map tasks are used
The output is written to a directory by the table name, under users
HDFS home directory
Codegen
The code can also be generated without import action
>>sqoop codegen --connect jdbc:mysql://localhost/hadoopguide
--table widgets --class-name Widget
The generated class can hold a single record retrieved from
the table
The generated code can be used in MapReduce programs to
manipulate the data
144
Administration
146
${dfs.name.dir}/
current/
VERSION
edits
fsimage
fstime
147
${fs.checkpoint.dir}/
current/
VERSION
edits
fsimage
fstime
previous.checkpoint/
VERSION
edits
fsimage
fstime
blk_<id_2>
blk_<id_2>.meta
Subdir0/
Subdir1/
148
Option to either
move (to lost+found)
or delete affected
files
hadoop fsck / -move
hadoop fsck / -delete
151
HDFS Balancer
Logging
All Hadoop Daemons produce respective log files
Log files are stored under $HADOOP_INSTALL/logs
The location can be changed by setting the property
HADOOP_LOG_DIR in hadoop-env.sh
Job Tracker
log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG
Stack Traces
The stack traces for all the hadoop daemons can be obtained at
/stacks page under daemons expose a web UI
Job tracker stack trace can be found at
http://<jobtracker-host>:50030/stacks
153
Data Backup
HDFS replication is not a substitute for data back up
As the data volume is very high, it is a good practice to prioritize
the data to be backed up
Business critical data
Data that can not be regenerated
Decommissioning of Nodes
Several options
Build your own cluster from scratch
Use offerings that provide hadoop as a service on cloud
Memory
Storage
Network
Gigabit Ethernet
Network Topology
1GB Switch
Network Topology
For multirack cluster, the admin needs to map nodes to
racks so hadoop is network aware to place data as well as
mapreduce tasks as close as possible to the data
Two ways to define the network map
Implement java interface DNSToSwitchMapping
public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
}
Have the property topology.node.switch.mapping.impl point to the
implemented class. The namenode and jobtracker will make use of this
Configure
Generate an RSA key pair, share public key on all nodes
Configure Hadoop. Better way of doing it is by using tools like Chef
161
or Puppet
163
164
Security
Hadoop uses Kerberos for authentication
Kerberos do not manage the
permissions for hadoop
To enable Kerberos
authentication set the property
hadoop.security.authentication
in core-site.xml to kerberos
Enable service-level
authorization by setting
hadoop.security.authorization
to true in the same file
To control which users and groups can do what, configure
Access Control Lists (ACLs) in the hadoop-policy.xml
165
Security Policies
Allow only alice, bob and users in the mapreduce group to submit the jobs
<property>
<name>security.job.submission.protocol.acl</name>
Recommended Readings
167