Short Introdude Hadoop

1
Scale Unlimited, Inc
A very short
Intro to
Hadoop
photo by: i_pinz, flickr

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Monday, December 19, 2011

2
Welcome to Intro to Hadoop
Usually a 4 hour online class

Focus on Why, What, How of Hadoop

3
Meet Your Instructor
Developer of Bixo web mining toolkit

Committer on Apache Tika
Hadoop developer and trainer
Solr developer and trainer

4
Overview
A very short
Intro to
photo by: exfordy, flickr

Hadoop

5
How to Crunch a Petabyte?
Lots of disks, spinning all the time

Redundancy, since disks die
Lots of CPU cores, working all the time
Retry, since network errors happen

6
Hadoop to the Rescue
Scalable - many servers with lots of cores and spindles

Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big computer

7
Logical Architecture
Cluster
Execution
Storage
Logically, Hadoop is simply a computing cluster that provides:

a Storage layer, and
an Execution layer

8
Storage Layer
Hadoop Distributed File System (aka HDFS, or Hadoop DFS)

Runs on top of regular OS file system, typically Linux ext3
Fixed-size blocks (64MB by default) that are replicated
Write once, read many; optimized for streaming in and out

9
Execution Layer
Hadoop Map-Reduce
Responsible for running a job in parallel on many servers
Handles re-trying a task that fails, validating complete results
Jobs consist of special map and reduce operations

10
Scalable
Cluster
Rack Rack Rack
Node Node Node Node ...
cpu Execution
disk Storage
Virtual execution and storage layers span many nodes (servers)

Scales linearly (sort of) with cores and disks.

11
Reliable
Each block is replicated, typically three times.

Each task must succeed, or the job fails

12
Fault-tolerant
Failed tasks are automatically retried.

Failed data transfers are automatically retried.
Servers can join and leave the cluster at any time.

13
Simple
Node Node Node Node ...
cpu Execution
disk Storage
Reduces complexity
Conceptual operating system that spans many CPUs & disks

14
Typical Hadoop Cluster
Has one master server - high quality, beefy box.

NameNode process - manages file system
JobTracker process - manages tasks
Has multiple slave servers - commodity hardware.
DataNode process - manages file system blocks on local drives
TaskTracker process - runs tasks on server
Uses high speed network between all servers

15
Architectural Components
Solid boxes are unique applications NameNode DataNode
Dashed boxes are child JVM instances DataNode
DataNode
DataNode data block
(on same node as parent)
Dotted boxes are blocks of managed files
(on same node as parent)
mapper
mapper
child jvm
childmapper
jvm
child jvm
Client JobTracker
jobs tasks
TaskTracker
reducer
reducer
child jvm
childreducer
jvm
child jvm
Master Slaves

16
Distributed File
System
A very short
Intro to
photo by: Graham Racher, flickr

Hadoop

17
Virtual File System
Treats many disks on many servers as one huge, logical volume

Data is stored in 1...n blocks
The DataNode process manages blocks of data on a slave.
The NameNode process keeps track of file metadata on the master.

18
Replication
Each block is stored on several different disks (default is 3)

Hadoop tries to copy blocks to different servers and racks.
Protects data against disk, server, rack failures.
Reduces the need to move data to code.

19
Error Recovery
Slaves constantly check in with the master.

Data is automatically replicated if a disk or server goes away.

20
Limitations
Optimized for streaming data in/out

So no random access to data in a file
Data rates 30% - 50% of max raw disk rate
No append support currently
Write once, read many

21
NameNode
Runs on master node

Is a single point of failure
There are no built-in software hot failover mechanisms
Maintains filesystem metadata, the Namespace
files and hierarchical directories

22
DataNodes
DataNode
DataNode
DataNode
DataNode data block
Files stored on HDFS are chunked and stored as blocks on DataNode

Manages storage attached to the nodes that they run on
Data never flows through NameNode, only DataNodes

23
Map-Reduce
Paradigm
A very short
Intro to

Hadoop

24
Definitions
Key Value Pair -> two units of data, exchanged between Map & Reduce
Map -> The map function in the MapReduce algorithm
user defined
converts each input Key Value Pair to 0...n output Key Value Pairs
Reduce -> The reduce function in the MapReduce algorithm
user defined
converts each input Key + all Values to 0...n output Key Value Pairs
Group -> A built-in operation that happens between Map and Reduce
ensures each Key passed to Reduce includes all Values

25
All Together
Mapper
Mapper Shuffle Reducer
Mapper

26
How MapReduce Works
Map translates input to keys and values to new keys and values
[K1,V1] Map [K2,V2]
System Groups each unique key with all its values

[K2,V2] Group [K2,{V2,V2,....}]
Reduce translates the values of each unique key to new keys and values
[K2,{V2,V2,....}] Reduce [K3,V3]

27
Canonical Example - Word Count
Word Count
Read a document, parse out the words, count the frequency of each word
Specifically, in MapReduce
With a document consisting of lines of text
Translate each line of text into key = word and value = 1
e.g. <the,1> <quick,1> <brown,1> <fox,1> <jumped,1>
Sum the values (ones) for each unique word

28
[1, When in the Course of human events, it becomes]
[2, necessary for one people to dissolve the political bands]
[3, which have connected them with another, and to assume]

[K1,V1]
[n, {k,k,k,k,...}]
Map
[K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1]
User
Group
Defined
[K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]
[connected,{1,1,1,1,1}] [k,{v,v,...}]
Reduce
[When,4] [peope,6] [dissolve,3]

[K3,V3]
[connected,5] [k,sum(v,v,v,..)]
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

29
Divide & Conquer (splitting data)
Because
The Map function only cares about the current key and value, and
The Reduce function only cares about the current key and its values
Then
A Mapper can invoke Map on an arbitrary number of input keys and values
or just some fraction of the input data set
A Reducer can invoke Reduce on an arbitrary number of the unique keys
but all the values for that key

30
[K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3]
Mapper
Mapper
Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N
file directory
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

31
Divide & Conquer (parallelizable)
Because
Each Mapper is independent and processes part of the whole, and
Each Reducer is independent and processes part of the whole
Then
Any number of Mappers can run on each node, and
Any number of Reducers can run on each node, and
The cluster can contain any number of nodes

32
JobTracker
jobs
Client JobTracker
Is a single point of failure

Determines # Mapper Tasks from file splits via InputFormat
Uses predefined value for # Reducer Tasks
Client applications use JobClient to submit jobs and query status
Command line use hadoop job <commands>
Web status console use http://jobtracker-server:50030/

33
TaskTracker
mapper
mapper
child jvm
childmapper
jvm
child jvm
TaskTracker
reducer
reducer
child jvm
childreducer
jvm
child jvm
Spawns each Task as a new child JVM

Max # mapper and reducer tasks set independently
Can pass child JVM opts via mapred.child.java.opts
Can re-use JVM to avoid overhead of task initialization

34
Hadoop (sort of)

Deep Thoughts
A very short
Intro to

Hadoop

35
Avoiding Hadoop
Hadoop is a big hammer - but not every problem is a nail

Small data
Real-time data
Beware the Hadoopaphile

36
Avoiding Map-Reduce
Writing Hadoop code is painful and error prone

Hive & Pig are good solutions for People Who Like SQL
Cascading is a good solution for complex, stable workflows

37
Leveraging the Eco-System
Many open source projects built on top of Hadoop

HBase - scalable NoSQL data store
Sqoop - getting SQL data in/out of Hadoop
Other projects work well with Hadoop
Kafka/Scribe - getting log data in/out of Hadoop
Avro - data serialization/storage format

38
Get Involved
Join the mailing list - http://hadoop.apache.org/mailing_lists.html

Go to user group meetings - e.g. http://hadoop.meetup.com/

39
Learn More
Buy the book - Hadoop: The Definitive Guide, 2nd edition

Try the tutorials
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Get training (danger, personal plug)
http://www.scaleunlimited.com/training
http://www.cloudera.com/hadoop-training

40
Resources
Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com

Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html
Users groups - e.g. http://hadoop.meetup.com/
Hadoop API - http://hadoop.apache.org/common/docs/current/api/
Hadoop: The Definitive Guide, 2nd edition by Tom White
Cascading: http://www.cascading.org
Datameer: http://www.datameer.com

Short Introdude Hadoop

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Short Introdude Hadoop

Încărcat de

Drepturi de autor:

Formate disponibile

1

Scale Unlimited, Inc

photo by: i_pinz, flickr

Monday, December 19, 2011

Welcome to Intro to Hadoop

Usually a 4 hour online class

Monday, December 19, 2011

Meet Your Instructor

Developer of Bixo web mining toolkit

Monday, December 19, 2011

photo by: exfordy, flickr

Monday, December 19, 2011

How to Crunch a Petabyte?

Lots of disks, spinning all the time

Monday, December 19, 2011

Hadoop to the Rescue

Scalable - many servers with lots of cores and spindles

Monday, December 19, 2011

Logically, Hadoop is simply a computing cluster that provides:

Monday, December 19, 2011

Hadoop Distributed File System (aka HDFS, or Hadoop DFS)

Monday, December 19, 2011

Monday, December 19, 2011

Rack Rack Rack

Node Node Node Node ...

Virtual execution and storage layers span many nodes (servers)

Monday, December 19, 2011

Each block is replicated, typically three times.

Monday, December 19, 2011

Failed tasks are automatically retried.

Monday, December 19, 2011

Monday, December 19, 2011

Typical Hadoop Cluster

Has one master server - high quality, beefy box.

Monday, December 19, 2011

Monday, December 19, 2011

photo by: Graham Racher, flickr

Monday, December 19, 2011

Virtual File System

Treats many disks on many servers as one huge, logical volume

Monday, December 19, 2011

Each block is stored on several different disks (default is 3)

Monday, December 19, 2011

Slaves constantly check in with the master.

Monday, December 19, 2011

Optimized for streaming data in/out

Monday, December 19, 2011

Runs on master node

Monday, December 19, 2011

Files stored on HDFS are chunked and stored as blocks on DataNode

Monday, December 19, 2011

photo by: Graham Racher, flickr

Monday, December 19, 2011

Monday, December 19, 2011

Mapper Shuffle Reducer

Mapper Shuffle Reducer

Mapper Shuffle Reducer

Monday, December 19, 2011

How MapReduce Works

System Groups each unique key with all its values

Monday, December 19, 2011

Canonical Example - Word Count

Monday, December 19, 2011

[2, necessary for one people to dissolve the political bands]