Sunteți pe pagina 1din 31

An introduction to

Hello
Processing against a 156 node
cluster
Certified Hadoop Developer
Certified Hadoop System
Administrator
Goals
Why should you care?
What is it?
How does it work?
Data Everywhere
Every two days now we create as
much information as we did from the
dawn of civilization up until 2003
-Eric Schmidt
then CEO of Google
Aug 4, 2010
Data Everywhere
Data Everywhere
Data Everywhere
The Hadoop Project
Originally based on papers published by
Google in 2003 and 2004
Hadoop started in 2006 at Yahoo!
Top level Apache Foundation project
Large, active user base, user groups
Very active development, strong
development team
Who Uses Hadoop?
Hadoop Components
HDFS
Storage
Self-healing
high-bandwidth
clustered storage
MapReduce
Processing
Fault-tolerant
distributed
processing
Typical Cluster
3-4000 commodity servers
Each server
2x quad-core
16-24 GB ram
4-12 TB disk space
20-30 servers per rack
2 Kinds of Nodes
Master Nodes Slave Nodes
Master Nodes
NameNode
only 1 per cluster
metadata server and database
SecondaryNameNode helps with
some housekeeping
JobTracker
only 1 per cluster
job scheduler
Slave Nodes
DataNodes
1-4000 per cluster
block data storage
TaskTrackers
1-4000 per cluster
task execution
HDFS Basics
HDFS is a filesystem written in Java
Sits on top of a native filesystem
Provides redundant storage for massive
amounts of data
Use cheap(ish), unreliable computers
HDFS Data
Data is split into blocks and stored on
multiple nodes in the cluster
Each block is usually 64 MB or 128 MB
(conf)
Each block is replicated multiple times
(conf)
Replicas stored on different data nodes
Large files, 100 MB+
NameNode
A single NameNode stores all metadata
Filenames, locations on DataNodes of
each block, owner, group, etc.
All information maintained in RAM for
fast lookup
Filesystem metadata size is limited to
the amount of available RAM on the
NameNode
SecondaryNameNode
The Secondary NameNode is not a
failover NameNode
Does memory-intensive administrative
functions for the NameNode
Should run on a separate machine
Data Node
DataNodes store file contents
Stored as opaque blocks on the
underlying filesystem
Different blocks of the same file will be
stored on different DataNodes
Same block is stored on three (or more)
DataNodes for redundancy
Self-healing
DataNodes send heartbeats to the
NameNode
After a period without any heartbeats,
a DataNode is assumed to be lost
NameNode determines which blocks
were on the lost node
NameNode finds other DataNodes
with copies of these blocks
These DataNodes are instructed to
copy the blocks to other nodes
Replication is actively maintained
HDFS Data Storage
NameNode holds
file metadata
DataNodes hold the
actual data
Block size is 64
MB, 128 MB, etc
Each block
replicated three
times
NameNode
foo.txt: blk_1, blk_2, blk_3
bar.txt: blk_4, blk_5
DataNodes
blk_1 blk_2
blk_3 blk_5
blk_1 blk_3
blk_4
blk_1 blk_4
blk_5
blk_2 blk_4
blk_2 blk_3
blk_5
What is MapReduce?
MapReduce is a method for distributing
a task across multiple nodes
Automatic parallelization and
distribution
Each node processes data stored on
that node (processing goes to the data)
Features of MapReduce
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
JobTracker
MapReduce jobs are controlled by a
software daemon known as the
JobTracker
The JobTracker resides on a master
node
Assigns Map and Reduce tasks to other nodes on the cluster
These nodes each run a software daemon known as the
TaskTracker
The TaskTracker is responsible for actually instantiating the Map or
Reduce task, and reporting progress back to the JobTracker
Two Parts
Developer specifies two functions:
map()
reduce()
The framework does the rest
map()
The Mapper reads data in the form of
key/value pairs
It outputs zero or more key/value pairs
map(key_in, value_in) ->
(key_out, value_out)
reduce()
After the Map phase all the intermediate
values for a given intermediate key are
combined together into a list
This list is given to one or more
Reducers
The Reducer outputs zero or more final
key/value pairs
These are written to HDFS
map() Word Count
map(String input_key, String input_value)
foreach word w in input_value
emit(w, 1)
(1234, to be or not to be)
(5678, to see or not to see)
(to,1),(be,1),(or,1),(not,1),
(to,1),(be,1), (to,1),(see,1),
(or,1),(not,1),(to,1),(see,1)
reduce() Word Count
reduce(String output_key, List middle_vals)
set count = 0
foreach v in intermediate_vals:
count += v
emit(output_key, count)
(to, [1,1,1,1])
(be,[1,1])
(or,[1,1])
(not,[1,1])
(see,[1,1])
(to, 4)
(be,2)
(or,2)
(not,2)
(see,2)

Resources
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/
http://www.cloudera.com/resources/?media=Video
Questions?

S-ar putea să vă placă și