Hadoop Network Performance

Profiling the Network Performance
of Hadoop Jobs
Team : Pramod Biligiri & Sayed Asad Ali
Talk Outline
Introduction to the problem
What is Hadoop?
Hadoops MapReduce Framework
Shuffle as a Bottleneck
Experimental Setup
Choice of Benchmarks
Terasort Discussion
Ranked Inverted Index Discussion
Summary and Future Work
Introduction to the problem

Reproduce existing results which show that the
Network is the bottleneck in shuffle-intensive
Hadoop jobs.
What is Hadoop?
A framework for distributed processing of large data sets across
clusters of computers using simple programming models based on
Googles MapReduce.
Distinct Features:
Designed for Commodity Hardware

Highly Fault-tolerant
Horizontally Scalable
Push computation to data
MapReduce
MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster
Programming Model
For each input record, generate (key, value)
Apply reduce operation for all values corresponding to the
same key
Hadoops MapReduce Framework

1. Prepare the Map() input
2. Run the user-provided Map() code
3. "Shuffle" the Map output to the Reduce processors
4. Run the user-provided Reduce() code
5. Produce the final output
MapReduce Flow
Shuffle!
Shuffle as a Bottleneck?
On average, the shuffle phase accounts for 33% of the running
time in these jobs. In addition, in 26% of the jobs with reduce tasks,
shuffles account for more than 50% of the running time, and in 16% of
jobs, they account for more than 70% of the running time. This
confirms widely reported results that the network is a bottleneck
in MapReduce
Managing Data Transfers in Computer Clusters with Orchestra
- Mosharaf Chowdhury et al
Chosen Benchmarks
Terasort
Ranked Inverted Index
Experimental Setups
Instance type
Memory
CPU
Elastic Compute Units
Disk
Network performance
Config 1 m1.large
7.5 GB
64-bit
2 x 420 GB
Moderate
Config 2 m1.xlarge
15 GB
64-bit
4 x 420 GB
High
SDSC
8 GB
64-bit/ Intel Xeon CPU 5140 @2.33

GHz, 4 cores
2 x 1.5 TB
1 Gb/s
custom
Network Performance of EMR

Conflicting Values!
Source 1 : with AppNeta pathtest
average : 753 Mb/s
http://www.appneta.com/resources/pathtest-download.html
Source 2 : The available bandwidth is still 1 Gb/s, confirming

anecdotal evidence that EC2 has full bisection bandwidth."
Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al
Source 3 : The median TCP/UDP throughput of medium

instances are both close to 760 Mb/s."
The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al
Why Terasort?
Popular benchmark for Hadoop

Shipped with most Hadoop distributions.
Utilizes all aspects of the cluster - cpu, network, disk and memory
Large amount of data to shuffle (240 GB).
Representative of real world workloads

This data shuffle pattern arises in large scale sorts, merges and join
operations in the data center. We chose this test because, in our
interactions with application developers, we learned that many use such
operations with caution, because the operations are highly expensive in
todays data center network.
source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.
Terasort - How it works:
Sorts 1 terabyte of data.

Each data item is 100 bytes in size.
The first 10 bytes of a data item constitute its sort key.
Format of input data:
<key 10 bytes><rowid 10 bytes><filler 78 bytes>\r\n
key
: random characters from ASCII 32-126
rowid : an integer
filler
: random characters from the set A-Z
Terasort - How it works:

Map
Partition input keys into different buckets
<Leverage Hadoops default sorting of Map output>
Reduce
Collect outputs from different maps
Results
Comparison of Terasort on different configurations

Total job Time
(min)
Map Time
(min)
Reduce
Time (min)
Shuffle
Average Time
Shuffle Time %
Config 1
205
84
205
60
29.3
SDSC
166
60
90
36
21.7
Config 2
86
40
75
22
25.5
Instance type
Config 1
m1.large (RAM 7.5 GB)
Config 2
m1.xlarge (RAM 15 GB)
SDSC
Custom (RAM 8 GB)
CDF of data transferred over the network during the lifetime of the job
Shuffle ends
Reduce nearly done
Map ends
Sorting of Map outputs
(local to the node)
Shuffle starts
5100
6900
Network Transfer Rate on nodes

Network Link Saturated
Blue : Read
Red : Write
Disk I/O
Sorting of map
outputs
CPU Utilisation
Memory Statistics
Why Ranked Inverted Index?

For a given text corpus, for each word it generates a list of
documents containing the word in decreasing order of frequency
word -> (count1 | file1), (count2 | file2), ...
count1 > count2 >
A ranked inverted index is used often in text processing and

information retrieval tasks
Mentioned in the Tarazu paper as a Shuffle heavy workload

Tarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.
Ranked Inverted Index - How it

works:
Map input: (word | filename) -> count
Map output: word -> (filename, count)
Reduce output: word -> (count1 | file1), (count2 | file2) ...
It involves a sort of the values on the reduce side
(Note that the Map input is the output of another MapReduce job called
sequence-count)
Experimental Results of Ranked Inverted Index

Total job Time
(min)
Config 1
12
Map Time
(min)
5.5
Reduce
Time (min)
11.5
Shuffle
Average Time
3.5
Shuffle Time %
27.14
Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2
Instance type
Config 1
m1.large (RAM 7.5 GB)
CDF of data transferred over the network during the lifetime of the job
Replicating results to 3
Nodes
Shuffle ends
Reduce nearly done
Map ends
Shuffle starts
Network Transfer Rate on nodes

Network Link Saturated
Disk I/O
Blue : Read
Red : Write
CPU Utilisation
Memory Statistics
Summary
- Shuffle can constitute significant time of the total job runtime
- Worth investing in good network connectivity for a compute cluster
Stuff that doesnt add up!

Why does peak Network Bandwidth for Ranked Inverted Index
overshoot the 1Gb/s mark?
Why is the sort phase of RII so short?
Future Work
How does changing the various parameters make a difference? eg

io.sort.mb, io.sort.factor, fs.inmemory.size.mb
Effect of Combiners?
Varying the number of Map Tasks and Reduce Tasks
How many Map tasks are rack local or machine local?
Investigate the unresolved issues
Lack of precise information about topology and network
bandwidth for EMR Clusters
QnA
Thank you!
Standard Test Results

Input Size
Run Time on
Hadoop (min)
Shuffle Volume
Critical Path
tera-sort
300
2353
200
Shuffle
ranked-inverted-index
205
2322
219
Shuffle

Hadoop Network Performance

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hadoop Network Performance

Încărcat de

Drepturi de autor:

Formate disponibile

Profiling the Network Performance

Introduction to the problem

Designed for Commodity Hardware

Hadoops MapReduce Framework

Elastic Compute Units

64-bit/ Intel Xeon CPU 5140 @2.33

Network Performance of EMR

Source 2 : The available bandwidth is still 1 Gb/s, confirming

Source 3 : The median TCP/UDP throughput of medium

Popular benchmark for Hadoop

Representative of real world workloads

Terasort - How it works:

Sorts 1 terabyte of data.

Terasort - How it works:

Comparison of Terasort on different configurations

m1.large (RAM 7.5 GB)

m1.xlarge (RAM 15 GB)

Custom (RAM 8 GB)

Network Transfer Rate on nodes

Why Ranked Inverted Index?

A ranked inverted index is used often in text processing and

Mentioned in the Tarazu paper as a Shuffle heavy workload

Ranked Inverted Index - How it

Experimental Results of Ranked Inverted Index

Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2

m1.large (RAM 7.5 GB)

Network Transfer Rate on nodes

Stuff that doesnt add up!

How does changing the various parameters make a difference? eg

Standard Test Results

S-ar putea să vă placă și