Sunteți pe pagina 1din 114

1

Map reduce Programming

The Configuration API


Components in Hadoop are configured using Hadoops own configuration API.
org.apache.hadoop.conf.Configuration : - represents a collection of configuration properties
and their values.
Each property is named by a String, and the type of a value may be one of several types
o including Java primitives such as boolean, int, long, and float
o other useful types such as String, Class, and java.io.File, and collections of Strings

hPot-Tech

Map reduce Programming

The Configuration API..

hPot-Tech

Map reduce Programming

Tool implementation :

hPot-Tech

Map reduce Programming

Packaging a Job
a jobs classes must be packaged into a job JAR file to send to the cluster
Any dependent JAR files can be packaged in a lib subdirectory in the job JAR file.
The client classpath
The users client-side classpath set by hadoop jar <jar> is made up of:
The job JAR file
Any JAR files in the lib directory of the job JAR file, and the classes directory.
The classpath defined by HADOOP_CLASSPATH, if set

hPot-Tech

Map reduce Programming

Launching a Job

To launch the job, we need to run the driver, specifying the cluster that we want to run
the job on with the -conf option

hPot-Tech

Map reduce Programming

The Job output..

hPot-Tech

Map reduce Programming

The Job output..

hPot-Tech

Map reduce Programming

The MapReduce Web UI.


A web UI for viewing information about your jobs.
useful for
o following a jobs progress while it is running
o finding job statistics and logs after the job has completed.
http://jobtracker-host:50030/.

hPot-Tech

Map reduce Programming

The jobtracker page

hPot-Tech

10

Map reduce Programming

The jobtracker page

hPot-Tech

11

Map reduce Programming

The job page

hPot-Tech

12

Map reduce Programming

The job page

hPot-Tech

13

Map reduce Programming

Map Reduce Programming

hPot-Tech

14

Map reduce Programming

The MapReduce Approach


Shared memory approach (OpenMP, MPI, ...)
I Developer needs to take care of (almost) everything
I Synchronization, Concurrency
I Resource allocation
MapReduce: a shared nothing approach
I Most of the above issues are taken care of
I Problem decomposition and sharing partial results need particular
attention
I Optimizations (memory and network consumption) are tricky

hPot-Tech

15

Map reduce Programming

Functional Programming Roots


Key feature: higher order functions
I Functions that accept other functions as arguments
I Map and Fold

Figure: Illustration of map and fold.

hPot-Tech

16

Map reduce Programming

Functional Programming Roots


map phase:
 Given a list, map takes as an argument a function f (that takes a
single argument) and applies it to all element in a list
fold phase:
 Given a list, fold takes as arguments a function g (that takes two
arguments) and an initial value
 g is first applied to the initial value and the first item in the list
 The result is stored in an intermediate variable, which is used as an
input together with the next item to a second application of g
 The process is repeated until all items in the list have been
Consumed

hPot-Tech

17

Map reduce Programming

Functional Programming Roots


We can view map as a transformation over a dataset
 This transformation is specified by the function f
 Each functional application happens in isolation
 The application of f to each element of a dataset can be parallelized in a
straightforward manner
We can view fold as an aggregation operation
 The aggregation is defined by the function g
 Data locality: elements in the list must be brought together
 If we can group element of the list, also the fold phase can proceed in parallel
Associative and commutative operations
 Allow performance gains through local aggregation and reordering

hPot-Tech

18

Map reduce Programming

Functional Programming and MapReduce


Equivalence of MapReduce and Functional Programming:
 The map of MapReduce corresponds to the map operation
 The reduce of MapReduce corresponds to the fold operation
The framework coordinates the map and reduce phases:
How intermediate results are grouped for the reduce to happen in parallel
In practice:
 User-specified computation is applied (in parallel) to all input records of a dataset
 Intermediate results are aggregated by another user-specified Computation

hPot-Tech

19

Map reduce Programming

Mappers and Reducers

hPot-Tech

20

Map reduce Programming

Data Structures
Key-value pairs are the basic data structure in MapReduce
 Keys and values can be: integers, float, strings, raw bytes
 They can also be arbitrary data structures
The design of MapReduce algorithms involes:
 Imposing the key-value structure on arbitrary datasets
o E.g.: for a collection of Web pages, input keys may be URLs and
values may be the HTML content
 In some algorithms, input keys are not used, in others they uniquely
identify a record
 Keys can be combined in complex ways to design various algorithms

hPot-Tech

21

Map reduce Programming

A MapReduce job
The programmer defines a mapper and a reducer as follows2:
o map: (k1; v1) ! [(k2; v2)]
o reduce: (k2; [v2]) ! [(k3; v3)]
A MapReduce job consists in:
o A dataset stored on the underlying distributed filesystem, which is
split in a number of files across machines
o The mapper is applied to every input key-value pair to generate
intermediate key-value pairs
o The reducer is applied to all values associated with the same
intermediate key to generate output key-value pairs

hPot-Tech

22

Map reduce Programming

Where the magic happens


Implicit between the map and reduce phases is a distributed group by operation
on intermediate keys
 Intermediate data arrive at each reducer in order, sorted by the key
 No ordering is guaranteed across reducers
Output keys from reducers are written back to the distributed filesystem
 The output may consist of r distinct files, where r is the number of reducers
 Such output may be the input to a subsequent MapReduce phase
Intermediate keys are transient:
 They are not stored on the distributed filesystem
 They are spilled to the local disk of each machine in the cluster

hPot-Tech

23

Map reduce Programming

A Simplified view of MapReduce

Figure: Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs. Reducers are applied to all
intermediate values associated with the same intermediate key. Between the map and reduce phase lies a barrier that involves a large distributed
sort and group by

hPot-Tech

24

Map reduce Programming

hPot-Tech

25

Map reduce Programming

Hello World in MapReduce


Input:
 Key-value pairs: (docid, doc) stored on the distributed filesystem
 docid: unique identifier of a document
 doc: is the text of the document itself
Mapper:
 Takes an input key-value pair, tokenize the document
 Emits intermediate key-value pairs: the word is the key and the integer is the value
The framework:
 Guarantees all values associated with the same key (the word) are brought to the
same reducer
The reducer:
 Receives all values associated to some keys
 Sums the values and writes output key-value pairs: the key is the word and the value
is the number of occurrences

hPot-Tech

26

Map reduce Programming

Implementation and Execution Details


The partitioner is in charge of assigning intermediate keys (words) to reducers
Note that the partitioner can be customized
How many map and reduce tasks?
The framework essentially takes care of map tasks
The designer/developer takes care of reduce tasks

hPot-Tech

27

Map reduce Programming

Restrictions
Using external resources
 E.g.: Other data stores than the distributed file system
 Concurrent access by many map/reduce tasks
Side effects
 Not allowed in functional programming
 E.g.: preserving state across multiple inputs
 State is kept internal
I/O and execution
 External side effects using distributed data stores (e.g. BigTable)
 No input (e.g. computing _), no reducers, never no mappers

hPot-Tech

28

Map reduce Programming

The Execution Framework

hPot-Tech

29

Map reduce Programming

The Execution Framework


MapReduce program, a.k.a. a job:





Code of mappers and reducers


Code for combiners and partitioners (optional)
Configuration parameters
All packaged together

A MapReduce job is submitted to the cluster


 The framework takes care of eveything else

hPot-Tech

30

Map reduce Programming

Tutorial: Map Reduce

hPot-Tech

31

Map reduce Programming

hPot-Tech

32

Map reduce Programming

Debugging a Job
The web UI (debug statement to log to standard error)
custom counter

hPot-Tech

33

Map reduce Programming

Add debugging to the mapper:

hPot-Tech

34

Map reduce Programming

The tasks page

hPot-Tech

35

Map reduce Programming

The task details page

hPot-Tech

36

Map reduce Programming

Hadoop Logs

hPot-Tech

37

Map reduce Programming

Anything written to standard output or standard error is directed to the relevant log file.

hPot-Tech

38

Map reduce Programming

Remote Debugging
debugger is hard to arrange when running the job on a cluster
options :
o Reproduce the failure locally
o Use JVM debugging options
o Use task profiling
o Use IsolationRunner
set keep.failed.task.files to true to keep a failed tasks files.

hPot-Tech

39

Map reduce Programming

Tuning a Job

hPot-Tech

40

Map reduce Programming

Tuning a Job

hPot-Tech

41

Map reduce Programming

Job Submission
JobClient class
 The runJob() method creates a new instance of a JobClient
 Then it calls the submitJob() on this class
Simple verifications on the Job
 Is there an output directory?
 Are there any input splits?
 Can I copy the JAR of the job to HDFS?
NOTE: the JAR of the job is replicated 10 times

hPot-Tech

42

Map reduce Programming

MapReduce Workflows
o When the processing gets more complex :
o As a rule of thumb, think about adding more jobs, rather than adding complexity to jobs.
o For more complex problems,
o Consider a higher-level language than Map-Reduce, such as Pig, Hive, Cascading,
Cascalog, or Crunch.
o One immediate benefit is that it frees you from the translation into MapReduce jobs,
allowing you to concentrate on the analysis you are performing.

hPot-Tech

43

Map reduce Programming

JobControl:

When there is more than one job in a MapReduce workflow :


For a linear chain, the simplest approach is to run each job one after another :

For anything more complex than a linear chain,


o org.apache.hadoop.mapreduce.jobcontrol.JobControl :
o represents a graph of jobs to be run.
o add the job configurations,
o tell the JobControl instance the dependencies between jobs.
o run the JobControl in a thread, and it runs the jobs in dependency order.
o can poll for progress, and when the jobs have finished, you can query for all the jobs
statuses and the associated errors for any failures.
o If a job fails, JobControl wont run its dependencies.

hPot-Tech

44

Map reduce Programming

Advance MapReduce How Map reduce works?

hPot-Tech

45

Map reduce Programming

Classic MapReduce

hPot-Tech

46

Map reduce Programming

Failures
Major benefits of using Hadoop is its ability to handle failures and allow job to complete.
Task failure:

When user code in the map or reduce task throws a runtime exception.
The error ultimately makes it into the user logs.
Hanging tasks are dealt with differently : mapred.task.timeout
When the jobtracker is notified of a task attempt that has failed (by the tasktrackers
heartbeat call), it will reschedule execution of the task.
The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously
failed

hPot-Tech

47

Map reduce Programming

Failures
Tasktracker failure :
The jobtracker will notice a tasktracker that has stopped sending heartbeats if it hasnt received
one for 10 minutes (configured via the mapred.task tracker.expiry.interval property, in
milliseconds)
And remove it from its pool of tasktrackers to schedule tasks on.

Jobtracker failure
Failure of the jobtracker is the most serious failure mode.
Hadoop has no mechanism for dealing with jobtracker failureit is a single point of failure
so in this case all running jobs fail.
After restarting a jobtracker, any jobs that were running at the time it was stopped will need to
be resubmitted

hPot-Tech

48

Map reduce Programming

Partitioners and Combiners

hPot-Tech

49

Map reduce Programming

Partitioners
Partitioners are responsible for:
 Dividing up the intermediate key space
 Assigning intermediate key-value pairs to reducers
 Specify the task to which an intermediate key-value pair must be copied
Hash-based partitioner
 Computes the hash of the key modulo the number of reducers r
 This ensures a roughly even partitioning of the key space
o However, it ignores values: this can cause imbalance in the data processed by
each reducer
 When dealing with complex keys, even the base partitioner may need customization

hPot-Tech

50

Map reduce Programming

Combiners
Combiners are an (optional) optimization:
 Allow local aggregation before the shuffle and sort phase
 Each combiner operates in isolation
Essentially, combiners are used to save bandwidth
 E.g.: word count program
Combiners can be implemented using local data-structures
 E.g., an associative array keeps intermediate computations and aggregation thereof
 The map function only emits once all input records (even all input splits) are
processed

hPot-Tech

51

Map reduce Programming

Partitioners and Combiners, an Illustration

Note: in Hadoop, partitioners are executed before combiners

hPot-Tech

52

Map reduce Programming

hPot-Tech

53

Map reduce Programming

Lab : Combiner & Partitioners

hPot-Tech

54

Map reduce Programming

MRUnit Map Reduce Unit Testing.


The map and reduce functions in MapReduce are easy to test in isolation
MRUnit :
a testing library that makes easy to pass known inputs to a mapper or a reducer and
check that the outputs are as expected.
used in conjunction with a standard test execution framework, such as JUnit.

hPot-Tech

55

Map reduce Programming

Mapper

hPot-Tech

56

Map reduce Programming

Reducer

hPot-Tech

57

Map reduce Programming

Tutorial : MRUnit.

hPot-Tech

58

Map reduce Programming

hPot-Tech

59

Map reduce Programming

Hadoop MapReduce Types and Formats

hPot-Tech

60

Map reduce Programming

MapReduce Types
 Input / output to mappers and reducers
a. map: (k1; v1) ! [(k2; v2)]
b. reduce: (k2; [v2]) ! [(k3; v3)]
 In Hadoop, a mapper is created as follows:
a. void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)
b.
 Types:
a. K types implement WritableComparable
b. V types implement Writable

hPot-Tech

61

Map reduce Programming

What is a Writable
 Hadoop defines its own classes for strings (Text), integers
(intWritable), etc...
 All keys are instances of WritableComparable
o Why comparable?
 All values are instances of Writable

hPot-Tech

62

Map reduce Programming

hPot-Tech

63

Map reduce Programming

Reading Data
Datasets are specified by InputFormats
 I InputFormats define input data (e.g. a file, a directory)
 I InputFormats is a factory for RecordReader objects to extract
 key-value records from the input source
InputFormats identify partitions of the data that form an InputSplit
 InputSplit is a (reference to a) chunk of the input processed by
a single map
o Largest split is processed first
 Each split is divided into records, and the map processes each
record (a key-value pair) in turn
 Splits and records are logical, they are not physically bound to a file

hPot-Tech

64

Map reduce Programming

The relationship between InputSplit and HDFS blocks

hPot-Tech

65

Map reduce Programming

FileInputFormat and Friends


TextInputFormat
 Traeats each newline-terminated line of a file as a value
KeyValueTextInputFormat
 Maps newline-terminated text lines of key SEPARATOR value
SequenceFileInputFormat
 Binary file of key-value pairs with some additional metadata
SequenceFileAsTextInputFormat
 Same as before but, maps (k.toString(), v.toString())

hPot-Tech

66

Map reduce Programming

Filtering File Inputs


FileInputFormat reads all files out of a specified directory and send them to the
mapper
Delegates filtering this file list to a method subclasses may override
 Example: create your own xyzFileInputFormat to read *.xyz from a directory list

hPot-Tech

67

Map reduce Programming

Record Readers
Each InputFormat provides its own RecordReader implementation
LineRecordReader
 Reads a line from a text file
KeyValueRecordReader
 Used by KeyValueTextInputFormat

hPot-Tech

68

Map reduce Programming

Input Split Size

hPot-Tech

69

Map reduce Programming

Sending Data to Reducers


Map function receives OutputCollector object
OutputCollector.collect() receives key-value elements
Any (WritableComparable, Writable) can be used By defalut, mapper output type
assumed to be the same as the reducer output type

hPot-Tech

70

Map reduce Programming

WritableComparator
Compares WritableComparable data
 Will call the WritableComparable.compare() method
 Can provide fast path for serialized data
Configured through:
JobConf.setOutputValueGroupingComparator()

hPot-Tech

71

Map reduce Programming

Partitioner
int getPartition(key, value, numPartitions)
 Outputs the partition number for a given key
 One partition == all values sent to a single reduce task
HasPartitioner used by default
 Uses key.hashCode() to return partion number
JobConf used to set Partitioner implementation

hPot-Tech

72

Map reduce Programming

The Reducer
void reduce(k2 key, Iterator<v2> values,OutputCollector<k3, v3> output, Reporter
reporter )
Keys and values sent to one partition all go to the same reduce task
Calls are sorted by key
 Early keys are reduced and output before late keys

hPot-Tech

73

Map reduce Programming

Writing the Output

hPot-Tech

74

Map reduce Programming

Writing the Output


 Analogous to InputFormat
 TextOutputFormat writes key value <newline> strings to output file
 SequenceFileOutputFormat uses a binary format to pack key-value pairs
 NullOutputFormat discards output

hPot-Tech

75

Map reduce Programming

Lab :- Input and Output

hPot-Tech

76

Map reduce Programming

Map Side and Reduce Side Joins

hPot-Tech

77

Map reduce Programming

Joins
MapReduce can perform joins between large datasets

hPot-Tech

78

Map reduce Programming

Join:

performed by the mapper, it is called a map-side join

performed by the reducer it is called a reduce-side join.

hPot-Tech

79

Map reduce Programming

Map-Side Joins
 A map-side join between large inputs works by
performing the join before the data reaches the map
function.
 The inputs to each map must be partitioned and
sorted in a particular way.
 Each input dataset must be divided into the
same number of partitions, and it must be sorted by
the same key (the join key) in each source.
 All the records for a particular key must reside in
the same partition.

hPot-Tech

80

Map reduce Programming

Reduce-Side Joins
 A reduce-side join is more general than a mapside join
 the input datasets dont have to be structured in
any particular way
 the mapper tags each record with its source and
uses the join key as the map output key, so that the
records with the same key are brought together in
the reducer.

hPot-Tech

81

Map reduce Programming

Lab : Map Side Join.

hPot-Tech

Managing a Hadoop Cluster

Hadoop Cluster Component


NameNode: Manages the namespace, file
system metadata, and access control. There is
exactly one NameNode in each cluster.
SecondaryNameNode: Downloads
periodic checkpoints from the NameNode for
fault-tolerance. There is exactly one
SecondaryNameNode in each cluster.

Hadoop Cluster Component


JobTracker: Hands out tasks to the slave
nodes. There is exactly one JobTracker in each
cluster.
DataNode: Holds file system data; each data
node manages its own locally-attached storage
(i.e., the node's hard disk) and stores a copy of
some or all blocks in the file system. There are
one or more DataNodes in each cluster. If your
cluster has only one DataNode then file system
data cannot be replicated.

Hadoop Cluster Component


TaskTracker: Slaves that carry
out map and reduce tasks. There are one or
more TaskTrackers in each cluster.

HDFS Architecture
Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Write

Rack2

Client
3/3/2013

Platform requirements for Hadoop


Java Requirements
Hadoop is a Java-based system. Recent versions of
Hadoop require Sun Java 1.6.

Operating System
As Hadoop is written in Java, it is mostly portable
between different operating systems

Downloading and Installing Hadoop

Topology of a typical Hadoop cluster .

Installation Steps
Installed java
ssh and sshd
gunzip hadoop-0.18.0.tar.gz
Or tar vxf hadoop-0.18.0.tar

Set JAVA_HOME in conf/hadoop-env.sh


Modified hadoop-site.xml

Hadoop Installation Flavors


Standalone
Pseudo-distributed
Hadoop clusters of multiple nodes

Additional Configuration
conf/masters
contains the hostname of the SecondaryNameNode
It should be fully-qualified domain name.

conf/slaves
the hostname of every machine in the cluster which
should start TaskTracker and DataNode daemons
Ex:
slave01
slave02
slave03

Advance Configuration
enable passwordless ssh
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

The ~/.ssh/id_dsa.pub and authorized_keys


files should be replicated on all machines in
the cluster.

Advance Configuration
Various directories should be created on each
node
The NameNode requires the NameNode metadata
directory
$ mkdir -p /home/hadoop/dfs/name

Every node needs the Hadoop tmp directory and


DataNode directory created

Advance Configuration..
bin/slaves.sh allows a command to be
executed on all nodes in the slaves file.
$ mkdir -p /tmp/hadoop
$ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
$ export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /home/hadoop/dfs/data

Format HDFS
$ bin/hadoop namenode -format

start the cluster:


$ bin/start-all.sh

Important Directories
Directory

Description

Default location

Suggested location

HADOOP_LOG_DIR

Output location for log files


from daemons

${HADOOP_HOME}/logs

/var/log/hadoop

hadoop.tmp.dir

A base for other temporary


directories

/tmp/hadoop-${user.name}

/tmp/hadoop

dfs.name.dir

Where the NameNode


metadata should be stored

${hadoop.tmp.dir}/dfs/name /home/hadoop/dfs/name

dfs.data.dir

Where DataNodes store their


${hadoop.tmp.dir}/dfs/data
blocks

mapred.system.dir

The in-HDFS path to shared


MapReduce system files

/home/hadoop/dfs/data

${hadoop.tmp.dir}/mapred/sy
/hadoop/mapred/system
stem

Recommended configuration
dfs.name.dir and dfs.data.dir be moved out
from hadoop.tmp.dir.
Adjust mapred.system.dir

Selecting Machines
Hadoop is designed to take advantage of
whatever hardware is available
Hadoop jobs written in Java can consume
between 1 and 2 GB of RAM per core
If you use HadoopStreaming to write your jobs
in a scripting language such as Python, more
memory may be advisable.

Cluster Configurations
Small Clusters: 2-10 Nodes
Medium Clusters: 10-40 Nodes
Large Clusters: Multiple Racks

Small Clusters: 2-10 Nodes


In two nodes,
one node: NameNode/JobTracker and a
DataNode/TaskTracker;
the other node: DataNode/TaskTracker.

Clusters of three or more machines typically


use a dedicated NameNode/JobTracker, and
all other nodes are workers.

configuration in conf/hadoop-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>head.server.node.com:9001</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head.server.node.com:9000</val
ue>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>

<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

Medium Clusters: 10-40 Nodes


The single point of failure in a Hadoop cluster
is the NameNode
Hence, back up the NameNode metadata.
One machine in the cluster should be designated
as the NameNode's backup
It does not run the normal Hadoop daemons
it exposes a directory via NFS which is only
mounted on the NameNode

NameNodes backup
The cluster's hadoop-site.xml file should then
instruct the NameNode to write to this
directory as well:
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name,/mnt/namenode-backup</value>
<final>true</final>
</property>

Backup NameNode
the backup machine can be used for is to
serve as the SecondaryNameNode
this is not a failover NameNode process
It takes periodic snapshots of its metadata

conf/hadoop-site.xml
Nodes must be decommissioned on a schedule that permits
replication of blocks being decommissioned.
conf/hadoop-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
create an empty file with this name:
$ touch /home/hadoop/excludes

Replication Setting
<property>
<name>dfs.replication</name>
<value>3</value>
</property>

Disk & heap


<property>
<name>dfs.datanode.du.reserved</name>
<value>1073741824</value>
<final>true</final>
</property>

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>

Using multiple drives per machine


DataNodes can be configured to write blocks
out to multiple disks via the dfs.data.dir
property.
<property>
<name>dfs.data.dir</name>
<value>/d1/dfs/data,/d2/dfs/data,/d3/dfs/data,/d4/dfs/data</value>
<final>true</final>
</property>

Using multiple drives per machine..


<property>
<name>mapred.local.dir</name>
<value>
/d1/mapred/local,/d2/mapred/local,
/d3/mapred/local,/d4/mapred/local
</value>
<final>true</final>
</property>

Tutorial
Configure Hadoop Cluster in two nodes.
Tutorial-Installed Hadoop in Cluster.docx

Large Clusters: Multiple Racks


possibility of rack failure now exists
operational racks should be able to continue
even if entire other racks are disabled
the amount of metadata under the care of the
NameNode increases

Large Clusters: Multiple Racks


The NameNode is responsible for managing
metadata associated with each block in the
HDFS
the amount of information in the rack scales
into the 10's or 100's of TB
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>

Large Clusters: Multiple Racks


The NFS-mounted write-through backup
should be placed in a different rack from the
NameNode.
The SecondaryNameNode should be
instantiated on a separate rack

Large Clusters: Multiple Racks


<property>
<name>dfs.namenode.handler.count</name>
<value>40</value>
</property>
<property>
<name>mapred.job.tracker.handler.count</name>
<value>40</value>
</property>

Large Clusters: Multiple Racks


Property

Range

io.file.buffer.size

32768-131072

io.sort.factor

50-200

io.sort.mb

50-200

mapred.reduce.parallel.copies

20-50

Description

Read/write buffer size used in


SequenceFiles (should be in multiples of
the hardware page size)
Number of streams to merge concurrently
when sorting files during shuffling
Amount of memory to use while sorting
data
Number of concurrent connections a
reducer should use when fetching its input
from mappers

tasktracker.http.threads

40-50

Number of threads each TaskTracker uses


to provide intermediate map output to
reducers

mapred.tasktracker.map.tasks.maximum

1/2 * (cores/node) to 2 * (cores/node)

Number of map tasks to deploy on each


machine.

mapred.tasktracker.reduce.tasks.maximum 1/2 * (cores/node) to 2 * (cores/node)

Number of reduce tasks to deploy on each


machine.

S-ar putea să vă placă și