Sunteți pe pagina 1din 40

1

Scale Unlimited, Inc

A very short
Intro to
Hadoop

photo by: i_pinz, flickr


Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


2

Welcome to Intro to Hadoop

Usually a 4 hour online class


Focus on Why, What, How of Hadoop

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


3

Meet Your Instructor

Developer of Bixo web mining toolkit


Committer on Apache Tika
Hadoop developer and trainer
Solr developer and trainer

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


4

Overview

A very short
Intro to

photo by: exfordy, flickr


Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


5

How to Crunch a Petabyte?

Lots of disks, spinning all the time


Redundancy, since disks die
Lots of CPU cores, working all the time
Retry, since network errors happen

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


6

Hadoop to the Rescue

Scalable - many servers with lots of cores and spindles


Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big computer

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


7

Logical Architecture
Cluster

Execution

Storage

Logically, Hadoop is simply a computing cluster that provides:


a Storage layer, and
an Execution layer

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


8

Storage Layer

Hadoop Distributed File System (aka HDFS, or Hadoop DFS)


Runs on top of regular OS file system, typically Linux ext3
Fixed-size blocks (64MB by default) that are replicated
Write once, read many; optimized for streaming in and out

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


9

Execution Layer

Hadoop Map-Reduce
Responsible for running a job in parallel on many servers
Handles re-trying a task that fails, validating complete results
Jobs consist of special map and reduce operations

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


10

Scalable
Cluster

Rack Rack Rack

Node Node Node Node ...

cpu Execution

disk Storage

Virtual execution and storage layers span many nodes (servers)


Scales linearly (sort of) with cores and disks.

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


11

Reliable

Each block is replicated, typically three times.


Each task must succeed, or the job fails

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


12

Fault-tolerant

Failed tasks are automatically retried.


Failed data transfers are automatically retried.
Servers can join and leave the cluster at any time.

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


13

Simple
Node Node Node Node ...

cpu Execution

disk Storage

Reduces complexity
Conceptual operating system that spans many CPUs & disks

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


14

Typical Hadoop Cluster

Has one master server - high quality, beefy box.


NameNode process - manages file system
JobTracker process - manages tasks
Has multiple slave servers - commodity hardware.
DataNode process - manages file system blocks on local drives
TaskTracker process - runs tasks on server
Uses high speed network between all servers

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


15

Architectural Components
Solid boxes are unique applications NameNode DataNode
Dashed boxes are child JVM instances DataNode
DataNode
DataNode data block
(on same node as parent)
Dotted boxes are blocks of managed files
(on same node as parent)

mapper
mapper
child jvm
childmapper
jvm
child jvm
Client JobTracker
jobs tasks
TaskTracker
reducer
reducer
child jvm
childreducer
jvm
child jvm

Master Slaves

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


16

Distributed File
System

A very short
Intro to

photo by: Graham Racher, flickr


Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


17

Virtual File System

Treats many disks on many servers as one huge, logical volume


Data is stored in 1...n blocks
The DataNode process manages blocks of data on a slave.
The NameNode process keeps track of file metadata on the master.

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


18

Replication

Each block is stored on several different disks (default is 3)


Hadoop tries to copy blocks to different servers and racks.
Protects data against disk, server, rack failures.
Reduces the need to move data to code.

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


19

Error Recovery

Slaves constantly check in with the master.


Data is automatically replicated if a disk or server goes away.

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


20

Limitations

Optimized for streaming data in/out


So no random access to data in a file
Data rates 30% - 50% of max raw disk rate
No append support currently
Write once, read many

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


21

NameNode

Runs on master node


Is a single point of failure
There are no built-in software hot failover mechanisms
Maintains filesystem metadata, the Namespace
files and hierarchical directories

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


22

DataNodes

DataNode
DataNode
DataNode
DataNode data block

Files stored on HDFS are chunked and stored as blocks on DataNode


Manages storage attached to the nodes that they run on
Data never flows through NameNode, only DataNodes

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


23

Map-Reduce
Paradigm

A very short
Intro to

photo by: Graham Racher, flickr


Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


24

Definitions

Key Value Pair -> two units of data, exchanged between Map & Reduce
Map -> The map function in the MapReduce algorithm
user defined
converts each input Key Value Pair to 0...n output Key Value Pairs
Reduce -> The reduce function in the MapReduce algorithm
user defined
converts each input Key + all Values to 0...n output Key Value Pairs
Group -> A built-in operation that happens between Map and Reduce
ensures each Key passed to Reduce includes all Values

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


25

All Together
Mapper

Mapper Shuffle Reducer

Mapper Shuffle Reducer

Mapper Shuffle Reducer

Mapper

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


26

How MapReduce Works

Map translates input to keys and values to new keys and values
[K1,V1] Map [K2,V2]

System Groups each unique key with all its values


[K2,V2] Group [K2,{V2,V2,....}]

Reduce translates the values of each unique key to new keys and values
[K2,{V2,V2,....}] Reduce [K3,V3]

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


27

Canonical Example - Word Count

Word Count
Read a document, parse out the words, count the frequency of each word

Specifically, in MapReduce
With a document consisting of lines of text
Translate each line of text into key = word and value = 1
e.g. <the,1> <quick,1> <brown,1> <fox,1> <jumped,1>
Sum the values (ones) for each unique word

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


28
[1, When in the Course of human events, it becomes]

[2, necessary for one people to dissolve the political bands]

[3, which have connected them with another, and to assume]


[K1,V1]

[n, {k,k,k,k,...}]

Map

[K2,V2] [When,1] [in,1] [the,1] [Course,1] [k,1]

User
Group
Defined

[K2,{V2,V2,....}] [When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]

[connected,{1,1,1,1,1}] [k,{v,v,...}]

Reduce

[When,4] [peope,6] [dissolve,3]


[K3,V3]
[connected,5] [k,sum(v,v,v,..)]

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


29

Divide & Conquer (splitting data)

Because
The Map function only cares about the current key and value, and
The Reduce function only cares about the current key and its values
Then
A Mapper can invoke Map on an arbitrary number of input keys and values
or just some fraction of the input data set
A Reducer can invoke Reduce on an arbitrary number of the unique keys
but all the values for that key

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


30
[K1,V1] [K2,V2] [K2,{V2,V2,...}] [K3,V3]

Mapper

Mapper Shuffle Reducer

Mapper Shuffle Reducer

Mapper Shuffle Reducer

Mapper

Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N

file directory

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


31

Divide & Conquer (parallelizable)

Because
Each Mapper is independent and processes part of the whole, and
Each Reducer is independent and processes part of the whole
Then
Any number of Mappers can run on each node, and
Any number of Reducers can run on each node, and
The cluster can contain any number of nodes

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


32

JobTracker
jobs
Client JobTracker

Is a single point of failure


Determines # Mapper Tasks from file splits via InputFormat
Uses predefined value for # Reducer Tasks
Client applications use JobClient to submit jobs and query status
Command line use hadoop job <commands>
Web status console use http://jobtracker-server:50030/

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


33

TaskTracker
mapper
mapper
child jvm
childmapper
jvm
child jvm

TaskTracker
reducer
reducer
child jvm
childreducer
jvm
child jvm

Spawns each Task as a new child JVM


Max # mapper and reducer tasks set independently
Can pass child JVM opts via mapred.child.java.opts
Can re-use JVM to avoid overhead of task initialization

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


34

Hadoop (sort of)


Deep Thoughts

A very short
Intro to

photo by: Graham Racher, flickr


Hadoop
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


35

Avoiding Hadoop

Hadoop is a big hammer - but not every problem is a nail


Small data
Real-time data
Beware the Hadoopaphile

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


36

Avoiding Map-Reduce

Writing Hadoop code is painful and error prone


Hive & Pig are good solutions for People Who Like SQL
Cascading is a good solution for complex, stable workflows

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


37

Leveraging the Eco-System

Many open source projects built on top of Hadoop


HBase - scalable NoSQL data store
Sqoop - getting SQL data in/out of Hadoop
Other projects work well with Hadoop
Kafka/Scribe - getting log data in/out of Hadoop
Avro - data serialization/storage format

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


38

Get Involved

Join the mailing list - http://hadoop.apache.org/mailing_lists.html


Go to user group meetings - e.g. http://hadoop.meetup.com/

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


39

Learn More

Buy the book - Hadoop: The Definitive Guide, 2nd edition


Try the tutorials
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Get training (danger, personal plug)
http://www.scaleunlimited.com/training
http://www.cloudera.com/hadoop-training

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011


40

Resources

Scale Unlimited Alumni list - scale-unlimited-alumni@googlegroups.com


Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html
Users groups - e.g. http://hadoop.meetup.com/
Hadoop API - http://hadoop.apache.org/common/docs/current/api/
Hadoop: The Definitive Guide, 2nd edition by Tom White
Cascading: http://www.cascading.org
Datameer: http://www.datameer.com

Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Monday, December 19, 2011

S-ar putea să vă placă și