Documente Academic
Documente Profesional
Documente Cultură
and Frameworks
15-719/18-847b
Garth Gibson
Greg Ganger
Majd Sakr
• Optional
• Ref 3: DyradLinQ: A system for general-purpose distributed data-
parallel computing using a high-level language. Yuan Yu, Michael Isard,
Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda,
Jon Currey. OSDI’08.
http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf
• Ref 4: Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson,
Carlos Guestrin, and Joseph M. Hellerstein (2010). "GraphLab: A New
Parallel Framework for Machine Learning." Conf on Uncertainty in
Artificial Intelligence (UAI).
http://www.select.cs.cmu.edu/publications/scripts/papers.cgi
• Optional
• Ref 5: TensorFlow: A system for large-scale machine learning.
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy
Davis, Jeff Dean, Matthieu Devin, Sanjay Ghemawatt, Geoffrey
Irving, Michael Isard. OSDI’16.
https://www.usenix.org/system/files/conference/osdi16/osdi16-
abadi.pdf
• HPC was almost the only home for parallel computing in the 90s
• Physical simulation was the killer app – weather, vehicle design,
explosions/collisions, etc – replace “wet labs” with “dry labs”
o Physics is the same everywhere, so define a mesh on a set of particles, code the
physics you want to simulate at one mesh point as a property of the influence
of nearby mesh points, and iterate
o Bulk Synchronous Processing (BSP): run all updates of mesh points in parallel
based on value at last time point, form new set of values & repeat
• Defined “Weak Scaling” for bigger machines – rather than make a fixed
problem go faster (strong scaling), make bigger problem go same speed
o Most demanding users set problem size to match total available memory
• HPC demanded too much expertise, too many details and tuning
• Cloud frameworks all about making parallel programming easier
o Willing to sacrifice efficiency (too willing perhaps)
o Willing to specialize to application (rather than machine)
• Canonical BigData user has data & processing needs that require lots
of computer, but doesn’t have CS or HPC training & experience
o Wants to learn least amount of computer science to get results this week
o Might later want to learn more if same jobs become a personal bottleneck
• MapReduce
o Package two Sipelstein91 operators filter/map and reduce as the base of a
data parallel programming model built around Java libraries
• DryadLinq
o Compile workflows of different data processing programs into
schedulable processes
• Spark
o Work to keep partial results in memory, and declarative programming
• TensorFlow
o Specialize to iterative machine learning
• MapReduce
uses disks for
input, tmp, &
output
• Want to use
memory mostly
• Machine Learning apps iterate over same data to “solve” something
o Way too much use of disk when the data is not giant
• Spark is MR rewrite: more general (dryad-like graphs of work), more
interactive (scala interpreter) & more efficient (in-memory)
• rdd_x.map(foo).map(bar)
• Function foo() takes in a record x and outputs a record y
• Function bar() takes in a record y and outputs a record z
• Spark automatically creates a function foo_bar() that takes in a
record x and outputs a record z.
5
Three Strategies
• Parallelism
– Break down jobs into distributed independent
tasks to exploit parallelism
• Scheduling
– Consider data-locality and variations in overall
system workloads for scheduling
• Fault Tolerance
– Transparently tolerate data and task failures
Hadoop MapReduce
• Hadoop is an open source implementation of MapReduce
– ~2006
7
MapReduce In a Nutshell
• MapReduce incorporates two phases
• Map Phase
• Reduce phase
Map Partition
HDFS
SplitBLK
0 Task Partition
Partition Reduce
Partition
Partition Task
Partition
Map
HDFS
SplitBLK
1 Partition
Task
Dataset Partition Reduce
Partition Partition
Task
To HDFS
SplitBLK
2 Map
HDFS Partition
HDFS Task Partition
Partition Reduce
Partition
Partition Task
HDFS
SplitBLK
3 Map Partition
Task
Partition Merge &
Sort
Shuffle Stage Stage Reduce Stage
Map Phase Reduce Phase
Data Distribution
• In a MapReduce cluster, data is distributed to all the nodes of the cluster
as it is being loaded
• An underlying distributed file systems (e.g., GFS, HDFS) splits large data
files into chunks which are managed by different nodes in the cluster
• Even though the file chunks are distributed across several machines, they
form a single namespace
9
Network Topology In MapReduce
• Nodes are spread over different racks embraced in one or many data centers
• The bandwidth between two nodes is dependent on their relative locations in the
network topology
• The assumption is that nodes that are on the same rack will have higher bandwidth
between them as opposed to nodes that are off-rack
Computing Units: Tasks
MapReduce divides the workload into multiple independent tasks and
automatically schedule them on cluster nodes
11
MapReduce Phases
• In MapReduce, splits are processed in splits C0 C1 C2 C3
Map Phase
mappers M0 M1 M2 M3
• The output from the mappers is denoted as
intermediate output and brought
into a second set of tasks called Reducers IO IO IO IO
Reduce Phase
• The process of reading intermediate output into Shuffling Data
Reducers R0 R1
• The map and reduce functions receive and emit (K, V) pairs
Input Splits Intermediate Output Final Output
InputFormat InputFormat
file file
Split Split Split Split Split Split
file file
RecordReaders RR RR RR RR RR RR RecordReaders
OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Input Files
• Input files are where the data for a MapReduce task is
initially stored
• Binary files
• Multi-line input records
• Or something else entirely
17
InputFormat
• How the input files are split up and read is defined
by the InputFormat
• InputFormat is a class that does the following:
• Selects the files that should be used for input
• Defines the InputSplits that break a file
• Provides a factory for RecordReader objects that read the file
InputFormat
file
file
18
InputFormat Types
• Several InputFormats are provided with Hadoop:
19
Input Splits
• An input split describes a unit of data that a single map task in a
MapReduce program will process
• The RecordReader class actually loads data from its source and converts it
into (K, V) pairs suitable for reading by Mappers
Files loaded from local HDFS store
• The RecordReader is invoked repeatedly
on the input until the entire split is consumed InputFormat
file
• Each invocation of the RecordReader leads Split Split Split
file
to another call of the map function defined
by the programmer RR RR RR
Mapper and Reducer
• The Mapper performs the user-defined work of the
first phase of the MapReduce program Files loaded from local HDFS store
file
• The Reducer performs the user-defined work of Split Split Split
RR RR RR
Sort
Reduce
Partitioner
• Each mapper may emit (K, V) pairs to any partition
Files loaded from local HDFS store
Reduce
Sort (merge)
• Each Reducer is responsible for reducing Files loaded from local HDFS store
the values associated with (several)
intermediate keys InputFormat
file
• The set of intermediate keys on a single Split Split Split
file
node is automatically sorted (merged) by
MapReduce before they are presented RR RR RR
to the Reducer
Map Map Map
Partitioner
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
file
Split Split Split
• The instances of OutputFormat provided by
file
Hadoop write to files on the local disk or in HDFS
RR RR RR
OutputFormat
Combiner Functions
• MapReduce applications are limited by the bandwidth available on the cluster
• It pays to minimize the data shuffled between map and reduce tasks
• Hadoop allows the user to specify a combiner function (just like the reduce
function) to be run on a map output only if the Reduce function is commutative
and associative.
Map (Y, T) Combiner
output output
(1950, 0)
MT MT (1950, 20) (1950, 20)
(1950, 10)
N
MT
R
LEGEND:
N MT
RT
•R = Rack
MT
•N = Node
N •MT = Map Task
MT •RT = Reduce Task
R
•Y = Year
N MT •T = Temperature
MapReduce In a Nutshell
• MapReduce incorporates two phases, Map and Reduce.
The Shuffle in MapReduce
Job Scheduling in MapReduce
• In MapReduce, an application is represented as a job
29
Multi-user Job Scheduling in MapReduce
• Fair scheduler (Facebook)
– Pools of jobs, each pool is assigned a set of shares
– Jobs get (on ave) equal share of slots over time
– Across pools, use Fair scheduler, within pool, FIFO
or Fair scheduler
• Capacity scheduler (Yahoo!)
– Creates job queues
– Each queue is configured with # (capacity) of slots
– Within queue, scheduling is priority based
Task Scheduling in MapReduce
• MapReduce adopts a master-slave architecture
TT
Task Slots
• The master node in MapReduce is referred
to as Job Tracker (JT)
JT
Tasks Queue
• Each slave node in MapReduce is referred T0
T0 T1
T1 T2
• I.e., JT does not push map and reduce tasks to TTs but rather TTs pull them by
making requests
31
Map and Reduce Task Scheduling
• Every TT sends a heartbeat message periodically to JT encompassing a
request for a map or a reduce task to run
• JT satisfies requests for map tasks via attempting to schedule mappers in the
vicinity of their input splits (i.e., it considers locality)
32
Task Scheduling
Task Scheduling in Hadoop
• A golden principle adopted by Hadoop is: “Moving computation towards data
is cheaper than moving data towards computation”
– Hadoop applies this principle to Map task scheduling
• With map task scheduling, once a slave (or a TaskTracker- TT) polls for a map
task, M, at the master node (or the JobTracker- JT), JT attempts to assign TT
an M that has its input data local to TT
Core Switch
• With reduce task scheduling, once a slave (or a TaskTracker- TT) polls for a
reduce task, R, at the master node (or the JobTracker- JT), JT assigns TT any R
Shuffle Partitions
CS A locality problem,
where R is scheduled
at TT1 while its
RS1 RS2
partitions exist
at TT4
TT1 TT2 TT3 TT4 TT5 JT
Request Reduce Task R
Assign R to TT1
35
CS= Core Switch & RS = Rack Switch
Fault Tolerance in Hadoop
• Data redundancy
• Achieved at the storage layer through replicas (default is 3)
• Stored at physically separate machines
• Can tolerate
– Corrupted files
– Faulty nodes
• HDFS:
– Computes checksums for all data written to it
– Verifies when reading
• Task Resiliency (task slowdown or failure)
• Monitors to detect faulty or slow tasks
• Replicates tasks
36
Task Resiliency
• MapReduce can guide jobs toward a successful completion even when jobs are
run on a large cluster where probability of failures increases
• If the job is still in the map phase, JT asks another TT to re-execute all
Mappers that previously ran at the failed TT
37
Speculative Execution
• A MapReduce job is dominated by the slowest task
• If a task’s progress score is less than (average – 0.2), and the task has
run for at least 1 minute, it is marked as a straggler
T1 Not a straggler
PS= 2/3
T2 A straggler
PS= 1/12
Time
Issues with Speculative Execution
• Susceptible in heterogeneous environments
– If transient congestion, lots of speculative tasks
• Launches speculative tasks without checking speed of
TT or load of speculative task
– Slow TT will become slower
• Locality always trumps
– If 2 speculative tasks ST1 & ST2
• With stragglers T1@70% and T2@20%
• If task slot is local to ST2’s HDFS block, ST2 gets scheduled
• Three reduce stages treated equally
– Shuffle stage is typically slower than the merge & sort and
reduce stages
MapReduce Applications
Shuffled Output
Map Reduce
Input Data Data Data
Local Disk or
Network Network
Network
≈1 Sort
44