Documente Academic
Documente Profesional
Documente Cultură
of Hadoop Jobs
Team : Pramod Biligiri & Sayed Asad Ali
Talk Outline
Introduction to the problem
What is Hadoop?
Hadoops MapReduce Framework
Shuffle as a Bottleneck
Experimental Setup
Choice of Benchmarks
Terasort Discussion
Ranked Inverted Index Discussion
Summary and Future Work
What is Hadoop?
A framework for distributed processing of large data sets across
clusters of computers using simple programming models based on
Googles MapReduce.
Distinct Features:
MapReduce
MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster
Programming Model
For each input record, generate (key, value)
Apply reduce operation for all values corresponding to the
same key
MapReduce Flow
Shuffle!
Shuffle as a Bottleneck?
On average, the shuffle phase accounts for 33% of the running
time in these jobs. In addition, in 26% of the jobs with reduce tasks,
shuffles account for more than 50% of the running time, and in 16% of
jobs, they account for more than 70% of the running time. This
confirms widely reported results that the network is a bottleneck
in MapReduce
Managing Data Transfers in Computer Clusters with Orchestra
- Mosharaf Chowdhury et al
Chosen Benchmarks
Terasort
Ranked Inverted Index
Experimental Setups
Instance type
Memory
CPU
Disk
Network performance
Config 1 m1.large
7.5 GB
64-bit
2 x 420 GB
Moderate
Config 2 m1.xlarge
15 GB
64-bit
4 x 420 GB
High
SDSC
8 GB
2 x 1.5 TB
1 Gb/s
custom
Why Terasort?
Reduce
Collect outputs from different maps
Results
Map Time
(min)
Reduce
Time (min)
Shuffle
Average Time
Shuffle Time %
Config 1
205
84
205
60
29.3
SDSC
166
60
90
36
21.7
Config 2
86
40
75
22
25.5
Instance type
Config 1
Config 2
SDSC
CDF of data transferred over the network during the lifetime of the job
Shuffle ends
Reduce nearly done
Map ends
Sorting of Map outputs
(local to the node)
Shuffle starts
5100
6900
Blue : Read
Red : Write
Disk I/O
Sorting of map
outputs
CPU Utilisation
Memory Statistics
12
Map Time
(min)
5.5
Reduce
Time (min)
11.5
Shuffle
Average Time
3.5
Shuffle Time %
27.14
Instance type
Config 1
CDF of data transferred over the network during the lifetime of the job
Replicating results to 3
Nodes
Shuffle ends
Reduce nearly done
Map ends
Shuffle starts
Disk I/O
Blue : Read
Red : Write
CPU Utilisation
Memory Statistics
Summary
- Shuffle can constitute significant time of the total job runtime
- Worth investing in good network connectivity for a compute cluster
Future Work
QnA
Thank you!
Run Time on
Hadoop (min)
Shuffle Volume
Critical Path
tera-sort
300
2353
200
Shuffle
ranked-inverted-index
205
2322
219
Shuffle