Sunteți pe pagina 1din 36

Profiling the Network Performance

of Hadoop Jobs
Team : Pramod Biligiri & Sayed Asad Ali

Talk Outline
Introduction to the problem
What is Hadoop?
Hadoops MapReduce Framework
Shuffle as a Bottleneck
Experimental Setup
Choice of Benchmarks
Terasort Discussion
Ranked Inverted Index Discussion
Summary and Future Work

Introduction to the problem


Reproduce existing results which show that the
Network is the bottleneck in shuffle-intensive
Hadoop jobs.

What is Hadoop?
A framework for distributed processing of large data sets across
clusters of computers using simple programming models based on
Googles MapReduce.
Distinct Features:

Designed for Commodity Hardware


Highly Fault-tolerant
Horizontally Scalable
Push computation to data

MapReduce
MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster
Programming Model
For each input record, generate (key, value)
Apply reduce operation for all values corresponding to the
same key

Hadoops MapReduce Framework


1. Prepare the Map() input
2. Run the user-provided Map() code
3. "Shuffle" the Map output to the Reduce processors
4. Run the user-provided Reduce() code
5. Produce the final output

MapReduce Flow

Shuffle!

Shuffle as a Bottleneck?
On average, the shuffle phase accounts for 33% of the running
time in these jobs. In addition, in 26% of the jobs with reduce tasks,
shuffles account for more than 50% of the running time, and in 16% of
jobs, they account for more than 70% of the running time. This
confirms widely reported results that the network is a bottleneck
in MapReduce
Managing Data Transfers in Computer Clusters with Orchestra
- Mosharaf Chowdhury et al

Chosen Benchmarks
Terasort
Ranked Inverted Index

Experimental Setups
Instance type

Memory

CPU

Elastic Compute Units

Disk

Network performance

Config 1 m1.large

7.5 GB

64-bit

2 x 420 GB

Moderate

Config 2 m1.xlarge

15 GB

64-bit

4 x 420 GB

High

SDSC

8 GB

64-bit/ Intel Xeon CPU 5140 @2.33


GHz, 4 cores

2 x 1.5 TB

1 Gb/s

custom

Network Performance of EMR


Conflicting Values!
Source 1 : with AppNeta pathtest
average : 753 Mb/s
http://www.appneta.com/resources/pathtest-download.html

Source 2 : The available bandwidth is still 1 Gb/s, confirming


anecdotal evidence that EC2 has full bisection bandwidth."
Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al

Source 3 : The median TCP/UDP throughput of medium


instances are both close to 760 Mb/s."
The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al

Why Terasort?

Popular benchmark for Hadoop


Shipped with most Hadoop distributions.
Utilizes all aspects of the cluster - cpu, network, disk and memory
Large amount of data to shuffle (240 GB).

Representative of real world workloads


This data shuffle pattern arises in large scale sorts, merges and join
operations in the data center. We chose this test because, in our
interactions with application developers, we learned that many use such
operations with caution, because the operations are highly expensive in
todays data center network.
source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.

Terasort - How it works:

Sorts 1 terabyte of data.


Each data item is 100 bytes in size.
The first 10 bytes of a data item constitute its sort key.
Format of input data:
<key 10 bytes><rowid 10 bytes><filler 78 bytes>\r\n
key
: random characters from ASCII 32-126
rowid : an integer
filler
: random characters from the set A-Z

Terasort - How it works:


Map
Partition input keys into different buckets
<Leverage Hadoops default sorting of Map output>

Reduce
Collect outputs from different maps

Results

Comparison of Terasort on different configurations


Total job Time
(min)

Map Time
(min)

Reduce
Time (min)

Shuffle
Average Time

Shuffle Time %

Config 1

205

84

205

60

29.3

SDSC

166

60

90

36

21.7

Config 2

86

40

75

22

25.5

Instance type
Config 1

m1.large (RAM 7.5 GB)

Config 2

m1.xlarge (RAM 15 GB)

SDSC

Custom (RAM 8 GB)

CDF of data transferred over the network during the lifetime of the job

Shuffle ends
Reduce nearly done
Map ends
Sorting of Map outputs
(local to the node)
Shuffle starts

5100

6900

Network Transfer Rate on nodes


Network Link Saturated

Blue : Read
Red : Write

Disk I/O

Sorting of map
outputs

CPU Utilisation

Memory Statistics

Why Ranked Inverted Index?


For a given text corpus, for each word it generates a list of
documents containing the word in decreasing order of frequency
word -> (count1 | file1), (count2 | file2), ...
count1 > count2 >

A ranked inverted index is used often in text processing and


information retrieval tasks

Mentioned in the Tarazu paper as a Shuffle heavy workload


Tarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.

Ranked Inverted Index - How it


works:
Map input: (word | filename) -> count
Map output: word -> (filename, count)
Reduce output: word -> (count1 | file1), (count2 | file2) ...
It involves a sort of the values on the reduce side
(Note that the Map input is the output of another MapReduce job called
sequence-count)

Experimental Results of Ranked Inverted Index


Total job Time
(min)
Config 1

12

Map Time
(min)
5.5

Reduce
Time (min)
11.5

Shuffle
Average Time
3.5

Shuffle Time %
27.14

Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2

Instance type
Config 1

m1.large (RAM 7.5 GB)

CDF of data transferred over the network during the lifetime of the job

Replicating results to 3
Nodes

Shuffle ends
Reduce nearly done
Map ends

Shuffle starts

Network Transfer Rate on nodes


Network Link Saturated

Disk I/O

Blue : Read
Red : Write

CPU Utilisation

Memory Statistics

Summary
- Shuffle can constitute significant time of the total job runtime
- Worth investing in good network connectivity for a compute cluster

Stuff that doesnt add up!


Why does peak Network Bandwidth for Ranked Inverted Index
overshoot the 1Gb/s mark?
Why is the sort phase of RII so short?

Future Work

How does changing the various parameters make a difference? eg


io.sort.mb, io.sort.factor, fs.inmemory.size.mb
Effect of Combiners?
Varying the number of Map Tasks and Reduce Tasks
How many Map tasks are rack local or machine local?
Investigate the unresolved issues
Lack of precise information about topology and network
bandwidth for EMR Clusters

QnA

Thank you!

Standard Test Results


Input Size

Run Time on
Hadoop (min)

Shuffle Volume

Critical Path

tera-sort

300

2353

200

Shuffle

ranked-inverted-index

205

2322

219

Shuffle

S-ar putea să vă placă și