S17 L4 Programming Models Combined PDF

Programming Models
and Frameworks
Advanced Cloud Computing
15-719/18-847b
Garth Gibson
Greg Ganger
Majd Sakr
Jan 30, 2017 15719/18847b Adv. Cloud Computing 1

Advanced Cloud Computing Programming Models
•  Ref 1: MapReduce: simplified data processing on large clusters.

Jeffrey Dean and Sanjay Ghemawat. OSDI’04. 2004.
http://static.usenix.org/event/osdi04/tech/full_papers/dean/
dean.pdf
•  Ref 2: Spark: cluster computing with working sets. Matei Zaharia,
Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
USENIX Hot Topics in Cloud Computing (HotCloud’10).
http://www.cs.berkeley.edu/~matei/papers/2010/
hotcloud_spark.pdf

•  Optional
•  Ref 3: DyradLinQ: A system for general-purpose distributed data-
parallel computing using a high-level language. Yuan Yu, Michael Isard,
Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda,
Jon Currey. OSDI’08.
http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf
•  Ref 4: Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson,
Carlos Guestrin, and Joseph M. Hellerstein (2010). "GraphLab: A New
Parallel Framework for Machine Learning." Conf on Uncertainty in
Artificial Intelligence (UAI).
http://www.select.cs.cmu.edu/publications/scripts/papers.cgi

•  Optional
•  Ref 5: TensorFlow: A system for large-scale machine learning.
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy
Davis, Jeff Dean, Matthieu Devin, Sanjay Ghemawatt, Geoffrey
Irving, Michael Isard. OSDI’16.
https://www.usenix.org/system/files/conference/osdi16/osdi16-
abadi.pdf

Recall the SaaS, PaaS, IaaS Taxonomy
•  Service, Platform or Infrastructure as a Service

o  SaaS: service is a complete application (client-server computing)
•  Not usually a programming abstraction
o  PaaS: high level (language) programming model for cloud computer
•  Eg. Rapid prototyping languages
•  Turing complete but resource management hidden
o  IaaS: low level (language) computing model for cloud computer
•  Eg. Assembler as a language
•  Basic hardware model with all (virtual) resources exposed
•  For PaaS and IaaS, cloud programming is needed

o  How is this different from CS 101? Scale, fault tolerance, elasticity, ….

Embarrassingly parallel “Killer app:” Web servers
•  Online retail stores (like amazon.com for example)

o  Most of the computational demand is for browsing product marketing,
forming and rendering web pages, managing customer session state
•  Actual order taking and billing not as demanding, have separate specialized
services (Amazon bookseller backend)
o  One customer session needs a small fraction of one server
o  No interaction between customers (unless inventory near exhaustion)
•  Parallelism is more cores running identical copy of web server
•  Load balancing, maybe in name service, is parallel programming
o  Elasticity needs template service, load monitoring, cluster allocation
o  These need not require user programming, just configuration

Eg., Obama for America Elastic Load Balancer

What about larger apps?
•  Parallel programming is hard – how can cloud frameworks help?

•  Collection-oriented languages (Sipelstein&Blelloch, Proc IEEE v79, n4, 1991)
o  Also known as Data-parallel
o  Specify a computation on an element; apply to each in collection
•  Analogy to SIMD: single instruction on multiple data
o  Specify an operation on the collection as a whole
•  Union/intersection, permute/sort, filter/select/map
•  Reduce-reorderable (A) /reduce-ordered (B)
–  (A) Eg., ADD(1,7,2) = (1+7)+2 = (2+1)+7 = 10
–  (B) Eg., CONCAT(“the “, “lazy “, “fox “) = “the lazy fox “
•  Note the link to MapReduce …. its no accident

High Performance Computing Approach
•  HPC was almost the only home for parallel computing in the 90s
•  Physical simulation was the killer app – weather, vehicle design,
explosions/collisions, etc – replace “wet labs” with “dry labs”
o  Physics is the same everywhere, so define a mesh on a set of particles, code the
physics you want to simulate at one mesh point as a property of the influence
of nearby mesh points, and iterate
o  Bulk Synchronous Processing (BSP): run all updates of mesh points in parallel
based on value at last time point, form new set of values & repeat
•  Defined “Weak Scaling” for bigger machines – rather than make a fixed
problem go faster (strong scaling), make bigger problem go same speed
o  Most demanding users set problem size to match total available memory

High Performance Computing Frameworks
•  Machines cost O($10-100) million, so

o  emphasis was on maximizing utilization of machines (congress checks)
o  low-level speed and hardware specific optimizations (esp. network)
o  preference for expert programmers following established best practices
•  Developed MPI (Message Passing Interface) framework (eg. MPICH)
o  Launch N threads with library routines for everything you need:
•  Naming, addressing, membership, messaging, synchronization (barriers)
•  Transforms, physics modules, math libraries, etc
o  Resource allocators and schedulers space share jobs on physical cluster
o  Fault tolerance by checkpoint/restart requiring programmer save/restore
o  Proto-elasticity: kill N-node job & reschedule a past checkpoint on M nodes
•  Very manual, deep learning curve, few commercial runaway successes

Broadening HPC: Grid Computing
•  Grid Computing started with commodity servers (predates Cloud)

o  1989 concept of “killer micros” that would kill off supercomputers
•  Frameworks were less specialized, easier to use (& less efficient)
o  Beowulf, PVM (parallel virtual machine), Condor, Rocks, Sun Grid Engine
•  For funding reasons grid emphasized multi-institution sharing
o  So authentication, authorization, single-signon, parallel-ftp
o  Heterogeneous workflow (run job A on mach. B, then job C on mach. D)
•  Basic model: jobs selected from batch queue, take over cluster
•  Simplified “pile of work”: when a core comes free, take a task from
the run queue and run to completion

Cloud Programming, back to the future
•  HPC demanded too much expertise, too many details and tuning
•  Cloud frameworks all about making parallel programming easier
o  Willing to sacrifice efficiency (too willing perhaps)
o  Willing to specialize to application (rather than machine)
•  Canonical BigData user has data & processing needs that require lots
of computer, but doesn’t have CS or HPC training & experience
o  Wants to learn least amount of computer science to get results this week
o  Might later want to learn more if same jobs become a personal bottleneck

2005 NIST Arabic-English Competition
Expert human
translator BLEU Score
Translate 100 articles
0.7 •  2005 : Google wins!
Usable
translation
0.6 Qualitatively better 1st entry
Human-edittable
translation
0.5
Google Not most sophisticated approach
Topic ISI
identification IBM+CMU No one knew Arabic
UMD
0.4 JHU+CU Brute force statistics
Edinburgh
0.3 But more data & compute !!

200M words from UN translations
Useless 0.2 1 billion words of Arabic docs
Systran 1000 processor cluster
0.1
Mitre
! Can’t compete w/o big data
FSC
0.0
Cloud Programming Case Studies
•  MapReduce
o  Package two Sipelstein91 operators filter/map and reduce as the base of a
data parallel programming model built around Java libraries
•  DryadLinq
o  Compile workflows of different data processing programs into
schedulable processes
•  Spark
o  Work to keep partial results in memory, and declarative programming
•  TensorFlow
o  Specialize to iterative machine learning

MapReduce (Majd)

DryadLinq
•  Simplify efficient data parallel code

o  Compiler support for imperative and
declarative (eg., database) operations
o  Extends MapReduce to workflows
that can be collectively optimized
•  Data flows on edges between processes at vertices (workflows)
•  Coding is processes at vertices and expressions representing workflow
•  Interesting part of the compiler operates on the expressions
o  Inspired by traditional database query optimizations – rewrite the
execution plan with equivalent plan that is expected to execute faster

DryadLinq
•  Data flowing through a graph abstraction
o  Vertices are programs (possibly different with each vertex)
o  Edges are data channels (pipe-like)
o  Requires programs to have no side-effects (no changes to shared state)
o  Apply function similar to MapReduce reduce – open ended user code
•  Compiler operates on expressions, rewriting execution sequences
o  Exploits prior work on compiler for workflows on sets (LINQ)
o  Extends traditional database query planning with less type restrictive code
•  Unlike traditional plans, virtualizes resources (so might spill to storage)
o  Knows how to partition sets (hash, range and round robin) over nodes
o  Doesn’t always know what processes do, so less powerful optimizer than
database – where it can’t infer what is happening, it takes hints from users
o  Can auto-pipeline, remove redundant partitioning, reorder partitionings, etc

Spark: Optimize/generalize MR for iterative apps
Through files (disk)
•  MapReduce
uses disks for
input, tmp, &
output
•  Want to use
memory mostly
•  Machine Learning apps iterate over same data to “solve” something
o  Way too much use of disk when the data is not giant
•  Spark is MR rewrite: more general (dryad-like graphs of work), more
interactive (scala interpreter) & more efficient (in-memory)

Spark Resilient Distributed Datasets (RDD)
•  Spark programs are functional, deterministic

=> same input means same result
o  This is the basis of selective re-execution and automated fault-tolerance
•  Spark is a set/collection (called an RDD) oriented system
o  Splits a set into partitions, and assign to workers to parallelize operation
•  Store invocation (code & args) with inputs as a closure
o  Treat this as a “future” – compute now or later at system’s choice (lazy)
o  If code & inputs already at node X, “args” is faster to send than results
•  Futures can be used as compression on wire & in replica nodes

Spark Resilient Distributed Datasets (RDD) con’t
•  Many operators are builtins (well-known properties, like Dryad)

o  Spark automates transforms when pipelining multiple builtin operations
•  Spark is lazy – only specific operators force computation
o  E,g, materialize in file system
o  Build programs interactively, computing only when and what user needs
o  Lineage is chain of invocations: future on future … delayed compute
•  Replication/FT: ship & cache RDDs on other nodes
o  Can recompute everything there if needed, but mostly don’t
o  Save space in memory on replicas and network bandwidth
o  Need entire lineage to be replicated in non-overlapping fault domains

Spark “combining python functions” example
•  rdd_x.map(foo).map(bar)
•  Function foo() takes in a record x and outputs a record y
•  Function bar() takes in a record y and outputs a record z
•  Spark automatically creates a function foo_bar() that takes in a
record x and outputs a record z.
Feb 1, 2016 15719 Adv. Cloud Computing 21

Next day plan
•  Encapsulation and virtual machines

o  Guest lecturer, Michael Kozuch, Intel Labs
Feb 1, 2016 15719 Adv. Cloud Computing 22

Programming Models
MapReduce
15-719/18-847b Advanced Cloud Computing
Spring 2017
Majd Sakr, Garth Gibson, Greg Ganger
January 30, 2017

1
Motivation
• How do you perform
batch processing of large data sets
using low cost clusters
with thousands of commodity machines
which frequently experience partial failure
or slowdowns
Batch Processing of Large Datasets
• Challenges
– Parallel programming
– Job orchestration
– Scheduling
– Load Balancing
– Communication
– Fault Tolerance
– Performance
–…
Google MapReduce
• Data parallel framework for processing Big
Data on large commodity hardware
• Transparently tolerates
– Data faults
– Computation faults
• Achieves
– Scalability and fault tolerance
Commodity Clusters
MapReduce is designed to efficiently process big data using regular
commodity computers
A theoretical 1000-CPU machine would cost a very large amount of

money, far more than 1000 single-CPU or 250 quad-core
commodity machines
Premise: MapReduce ties smaller and more reasonably priced

machines together into a single cost-effective commodity cluster to
solve Big Data problems
5
Three Strategies
• Parallelism
– Break down jobs into distributed independent
tasks to exploit parallelism
• Scheduling
– Consider data-locality and variations in overall
system workloads for scheduling
• Fault Tolerance
– Transparently tolerate data and task failures
Hadoop MapReduce
• Hadoop is an open source implementation of MapReduce
– ~2006
• Hadoop presents MapReduce as an analytics engine and under

the hood uses a distributed storage layer referred to as Hadoop
Distributed File System (HDFS)
• HDFS mimics Google File System (GFS)
• Applications in MapReduce are represented as jobs

• Each job encompasses several map and reduce tasks
• Map and reduce tasks operate on data independently and in parallel
7
MapReduce In a Nutshell
• MapReduce incorporates two phases
• Map Phase
• Reduce phase
Map Partition
HDFS
SplitBLK
0 Task Partition
Partition Reduce
Partition
Partition Task
Partition
Map
HDFS
SplitBLK
1 Partition
Task
Dataset Partition Reduce
Partition Partition
Task
To HDFS
SplitBLK
2 Map
HDFS Partition
HDFS Task Partition
Partition Reduce
Partition
Partition Task
HDFS
SplitBLK
3 Map Partition
Task
Partition Merge &
Sort
Shuffle Stage Stage Reduce Stage
Map Phase Reduce Phase
Data Distribution
• In a MapReduce cluster, data is distributed to all the nodes of the cluster
as it is being loaded
• An underlying distributed file systems (e.g., GFS, HDFS) splits large data
files into chunks which are managed by different nodes in the cluster
Input data: A large file
Node 1 Node 2 Node 3

Chunk of input data Chunk of input data Chunk of input data
• Even though the file chunks are distributed across several machines, they
form a single namespace
9
Network Topology In MapReduce
• MapReduce assumes a tree style network topology
• Nodes are spread over different racks embraced in one or many data centers
• The bandwidth between two nodes is dependent on their relative locations in the
network topology
• The assumption is that nodes that are on the same rack will have higher bandwidth
between them as opposed to nodes that are off-rack
Computing Units: Tasks
MapReduce divides the workload into multiple independent tasks and
automatically schedule them on cluster nodes
A work performed by each task is done in isolation from one another
The amount of communication which can be performed by tasks is limited

mainly for scalability and fault-tolerance reasons
11
MapReduce Phases
• In MapReduce, splits are processed in splits C0 C1 C2 C3
isolation by tasks called Mappers
Map Phase
mappers M0 M1 M2 M3
• The output from the mappers is denoted as
intermediate output and brought
into a second set of tasks called Reducers IO IO IO IO
Reduce Phase
• The process of reading intermediate output into Shuffling Data
a set of Reducers is known as shuffling Merge & Sort
Reducers R0 R1
• The Reducers produce the final output

FO FO
• Overall, MapReduce breaks the data flow into two phases,

map phase and reduce phase
Keys and Values
• The programmer in MapReduce has to specify two functions, the
map function and the reduce function that implement the Mapper
and the Reducer in a MapReduce program
• In MapReduce data elements are always structured as key-value

(i.e., (K, V)) pairs
• The map and reduce functions receive and emit (K, V) pairs
Input Splits Intermediate Output Final Output
(K, V) Map (K’, V’) Reduce (K’’, V’’)

Pairs Function Pairs Function Pairs
Splits
• A split can contain reference to one or more HDFS blocks
• Configurable parameter
• Each map task processes one split
• # of splits dictates the # of map tasks
• By default one split contains reference to one HDFS block
• Map tasks are scheduled in the vicinity of HDFS blocks to
reduce network traffic
Partitions
• Map tasks store intermediate output on local disk (not HDFS)
• A subset of intermediate key space is assigned to each Reducer
• hash(key) mod R
• These subsets are known as partitions
Different colors represent

different keys (potentially)
from different Mappers
Partitions are the input

to Reducers
Hadoop MapReduce: A Closer Look
Node 1 Node 2
Files loaded from local HDFS store Files loaded from local HDFS store
InputFormat InputFormat
file file
Split Split Split Split Split Split
file file
RecordReaders RR RR RR RR RR RR RecordReaders
Input (K, V) pairs Input (K, V) pairs

Map Map Map Map Map Map
Intermediate (K, V) pairs Intermediate (K, V) pairs

Shuffling
Partitioner Process Partitioner
Sort Intermediate Sort

(K,V) pairs
exchanged by
Reduce all nodes Reduce
Final (K, V) pairs Final (K, V) pairs
OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Input Files
• Input files are where the data for a MapReduce task is
initially stored
• The input files typically reside in a distributed file system

(e.g. HDFS)
• The format of input files is arbitrary

file
• Line-based log files file
• Binary files
• Multi-line input records
• Or something else entirely
17
InputFormat
• How the input files are split up and read is defined
by the InputFormat
• InputFormat is a class that does the following:
• Selects the files that should be used for input
• Defines the InputSplits that break a file
• Provides a factory for RecordReader objects that read the file
Files loaded from local HDFS store
InputFormat
file
file
18
InputFormat Types
• Several InputFormats are provided with Hadoop:
InputFormat Description Key Value

TextInputFormat Default format; The byte offset The line contents
reads lines of text of the line
files
KeyValueInputFormat Parses lines into Everything up The remainder of
(K, V) pairs to the first tab the line
character
SequenceFileInputFormat A Hadoop-specific user-defined user-defined
high-performance
binary format
MyInputFormat A user-specified user-defined user-defined
input format
19
Input Splits
• An input split describes a unit of data that a single map task in a
MapReduce program will process
• By dividing the file into splits, we allow

several map tasks to operate on a single
file in parallel
InputFormat
• If the file is very large, this can improve
file
performance significantly through parallelism Split Split Split
file
• Each map task corresponds to a single input split

RecordReader
• The input split defines a slice of data but does not describe how
to access it
• The RecordReader class actually loads data from its source and converts it
into (K, V) pairs suitable for reading by Mappers
• The RecordReader is invoked repeatedly
on the input until the entire split is consumed InputFormat
file
• Each invocation of the RecordReader leads Split Split Split
file
to another call of the map function defined
by the programmer RR RR RR
Mapper and Reducer
• The Mapper performs the user-defined work of the
first phase of the MapReduce program Files loaded from local HDFS store
• A new instance of Mapper is created for each split InputFormat
file
• The Reducer performs the user-defined work of Split Split Split
the second phase of the MapReduce program file
RR RR RR
• A new instance of Reducer is created for each partition

• For each key in the partition assigned to a Reducer, the Map Map Map
Reducer is called once
Partitioner
Sort
Reduce
Partitioner
• Each mapper may emit (K, V) pairs to any partition
• Therefore, the map nodes must all agree on

InputFormat
where to send different pieces of
intermediate data file
Split Split Split
file
• The partitioner class determines which RR RR RR

partition a given (K,V) pair will go to
Map Map Map
• The default partitioner computes a hash value for a
given key and assigns it to a partition based on Partitioner
this result
Sort
Reduce
Sort (merge)
• Each Reducer is responsible for reducing Files loaded from local HDFS store
the values associated with (several)
intermediate keys InputFormat
file
• The set of intermediate keys on a single Split Split Split
file
node is automatically sorted (merged) by
MapReduce before they are presented RR RR RR
to the Reducer
Map Map Map
Partitioner
Sort
Reduce
OutputFormat
• The OutputFormat class defines the way (K,V) pairs

produced by Reducers are written to output files InputFormat
file
Split Split Split
• The instances of OutputFormat provided by
file
Hadoop write to files on the local disk or in HDFS
RR RR RR
• Several OutputFormats are provided by Hadoop:

Map Map Map
OutputFormat Description
TextOutputFormat Default; writes lines in "key \t
Partitioner
value" format
SequenceFileOutputFormat Writes binary files suitable for
Sort
reading into subsequent
MapReduce jobs
Reduce
NullOutputFormat Generates no output files
OutputFormat
Combiner Functions
• MapReduce applications are limited by the bandwidth available on the cluster
• It pays to minimize the data shuffled between map and reduce tasks
• Hadoop allows the user to specify a combiner function (just like the reduce
function) to be run on a map output only if the Reduce function is commutative
and associative.
Map (Y, T) Combiner
output output
(1950, 0)
MT MT (1950, 20) (1950, 20)
(1950, 10)
N
MT
R
LEGEND:
N MT
RT
•R = Rack
MT
•N = Node
N •MT = Map Task
MT •RT = Reduce Task
R
•Y = Year
N MT •T = Temperature
MapReduce In a Nutshell
• MapReduce incorporates two phases, Map and Reduce.
The Shuffle in MapReduce
Job Scheduling in MapReduce
• In MapReduce, an application is represented as a job
• A job encompasses multiple map and reduce tasks
• Job schedulers in MapReduce are pluggable
• Hadoop MapReduce by default FIFO scheduler for jobs
• Schedules jobs in order of submission

• Starvation with long-running jobs
• No job preemption
• No evaluation of job priority or size
29
Multi-user Job Scheduling in MapReduce
• Fair scheduler (Facebook)
– Pools of jobs, each pool is assigned a set of shares
– Jobs get (on ave) equal share of slots over time
– Across pools, use Fair scheduler, within pool, FIFO
or Fair scheduler
• Capacity scheduler (Yahoo!)
– Creates job queues
– Each queue is configured with # (capacity) of slots
– Within queue, scheduling is priority based
Task Scheduling in MapReduce
• MapReduce adopts a master-slave architecture
TT
Task Slots
• The master node in MapReduce is referred
to as Job Tracker (JT)
JT
Tasks Queue
• Each slave node in MapReduce is referred T0
T0 T1
T1 T2
to as Task Tracker (TT)

TT
Task Slots
• MapReduce adopts a pull scheduling strategy rather than
a push one
• I.e., JT does not push map and reduce tasks to TTs but rather TTs pull them by
making requests
31
Map and Reduce Task Scheduling
• Every TT sends a heartbeat message periodically to JT encompassing a
request for a map or a reduce task to run
I. Map Task Scheduling:
• JT satisfies requests for map tasks via attempting to schedule mappers in the
vicinity of their input splits (i.e., it considers locality)
II. Reduce Task Scheduling:
• However, JT simply assigns the next yet-to-run reduce task to a requesting TT

regardless of TT’s network location and its implied effect on the reducer’s
shuffle time (i.e., it does not consider locality)
32
Task Scheduling
Task Scheduling in Hadoop
• A golden principle adopted by Hadoop is: “Moving computation towards data
is cheaper than moving data towards computation”
– Hadoop applies this principle to Map task scheduling
• With map task scheduling, once a slave (or a TaskTracker- TT) polls for a map
task, M, at the master node (or the JobTracker- JT), JT attempts to assign TT
an M that has its input data local to TT
Core Switch
Rack Switch 1 Rack Switch 2
TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker

MT2 MT3 MT1 MT2 MT3
Request a Map Task
Schedule a Map Task at an Empty Map Slot on TaskTracker1
Task Scheduling in Hadoop
• Hadoop does not apply the locality principle to Reduce task scheduling
• With reduce task scheduling, once a slave (or a TaskTracker- TT) polls for a
reduce task, R, at the master node (or the JobTracker- JT), JT assigns TT any R
Shuffle Partitions
CS A locality problem,
where R is scheduled
at TT1 while its
RS1 RS2
partitions exist
at TT4
TT1 TT2 TT3 TT4 TT5 JT
Request Reduce Task R
Assign R to TT1
35
CS= Core Switch & RS = Rack Switch
Fault Tolerance in Hadoop
• Data redundancy
• Achieved at the storage layer through replicas (default is 3)
• Stored at physically separate machines
• Can tolerate
– Corrupted files
– Faulty nodes
• HDFS:
– Computes checksums for all data written to it
– Verifies when reading
• Task Resiliency (task slowdown or failure)
• Monitors to detect faulty or slow tasks
• Replicates tasks
36
Task Resiliency
• MapReduce can guide jobs toward a successful completion even when jobs are
run on a large cluster where probability of failures increases
• The primary way that MapReduce achieves fault tolerance is through

restarting tasks
• If a TT fails to communicate with JT for a period of time (by default, 1 minute in

Hadoop), JT will assume that TT in question has crashed
• If the job is still in the map phase, JT asks another TT to re-execute all
Mappers that previously ran at the failed TT
• If the job is in the reduce phase, JT asks another TT to re-execute all

Reducers that were in progress on the failed TT
37
Speculative Execution
• A MapReduce job is dominated by the slowest task
• MapReduce attempts to locate slow tasks (stragglers) and run redundant

(speculative) tasks that will optimistically commit before the corresponding
stragglers
• This process is known as speculative execution
• Only one copy of a straggler is allowed to be speculated
• Whichever copy of a task commits first, it becomes the definitive copy,

and the other copy is killed by JT
Locating Stragglers
• How does Hadoop locate stragglers?
• Hadoop monitors each task progress using a progress score between
0 and 1 based on the amount of data processed
• If a task’s progress score is less than (average – 0.2), and the task has
run for at least 1 minute, it is marked as a straggler
T1 Not a straggler
PS= 2/3
T2 A straggler
PS= 1/12
Time
Issues with Speculative Execution
• Susceptible in heterogeneous environments
– If transient congestion, lots of speculative tasks
• Launches speculative tasks without checking speed of
TT or load of speculative task
– Slow TT will become slower
• Locality always trumps
– If 2 speculative tasks ST1 & ST2
• With stragglers T1@70% and T2@20%
• If task slot is local to ST2’s HDFS block, ST2 gets scheduled
• Three reduce stages treated equally
– Shuffle stage is typically slower than the merge & sort and
reduce stages
MapReduce Applications
Shuffled Output
Map Reduce
Input Data Data Data
Local Disk or
Network Network
Network
Data Pattern Shuffle Data/Map Example

Input Ratio
Map Input Shuffle Data
0 Sobel Edge
Detection
<< 1 Grep
≈1 Sort
>> 1 Some IR apps

Grep Example
Grep Search Map & Reduce task details
Reduce Task 1 - Sorting 0:00:36
Reduce Task 1 - Shuffling 0:00:35
Reduce Task 1 0:00:37
Map task 15 0:00:05
Map task 14 0:00:09
Map task 13 0:00:08
Map task 12 0:00:09
Map task 11 0:00:09
Map task 10 0:00:11
Map task 9 0:00:09
Map task 8 0:00:08
Map task 7 0:00:09
Map task 6 0:00:10
Map task 5 0:00:13
Map task 4 0:00:09
Map task 3 0:00:11
Map task 2 0:00:10
Map task 1 0:00:08
00:00:00 00:00:09 00:00:17 00:00:26 00:00:35 00:00:43 00:00:52

TeraSort Example
TeraSort Map & Reduce Task Details
Reduce Task 1 - Sorting 0:01:50
Reduce Task 1 - Shuffling 0:01:28
Reduce Task 1 0:02:54
Map task 16 0:00:10
Map task 15 0:00:09
Map task 14 0:00:22
Map task 13 0:00:13
Map task 12 0:00:15
Map task 11 0:00:16
Map task 10 0:00:15
Map task 9 0:00:18
Map task 8 0:00:13
Map task 7 0:00:12
Map task 6 0:00:31
Map task 5 0:00:13
Map task 4 0:00:16
Map task 3 0:00:13
Map task 2 0:00:12
Map task 1 0:00:12
00:00:00 00:00:43 00:01:26 00:02:10 00:02:53 00:03:36

What Makes MapReduce Popular?
• MapReduce is characterized by:
1. Its simplified programming model which allows the user to
quickly write and test distributed systems
2. Its efficient and automatic distribution of data and workload
across machines
3. Its flat scalability curve. Specifically, after a Mapreduce
program is written and functioning on 10 nodes, very little-if
any- work is required for making that same program run on
1000 nodes
4. Its fault tolerance approach
44

S17 L4 Programming Models Combined PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

S17 L4 Programming Models Combined PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Programming Models

Advanced Cloud Computing

Jan 30, 2017 15719/18847b Adv. Cloud Computing 1

• Ref 1: MapReduce: simplified data processing on large clusters.

Jan 30, 2017 15719/18847b Adv. Cloud Computing 2

Jan 30, 2017 15719/18847b Adv. Cloud Computing 3

Jan 30, 2017 15719/18847b Adv. Cloud Computing 4

• Service, Platform or Infrastructure as a Service

• For PaaS and IaaS, cloud programming is needed

Jan 30, 2017 15719/18847b Adv. Cloud Computing 5

• Online retail stores (like amazon.com for example)

Jan 30, 2017 15719/18847b Adv. Cloud Computing 6

Jan 30, 2017 15719/18847b Adv. Cloud Computing 7

• Parallel programming is hard – how can cloud frameworks help?

Jan 30, 2017 15719/18847b Adv. Cloud Computing 8

Jan 30, 2017 15719/18847b Adv. Cloud Computing 9

• Machines cost O($10-100) million, so

Jan 30, 2017 15719/18847b Adv. Cloud Computing 10

• Grid Computing started with commodity servers (predates Cloud)

Jan 30, 2017 15719/18847b Adv. Cloud Computing 11

Jan 30, 2017 15719/18847b Adv. Cloud Computing 12

0.3 But more data & compute !!

Jan 30, 2017 15719/18847b Adv. Cloud Computing 14

Jan 30, 2017 15719/18847b Adv. Cloud Computing 15

• Simplify efficient data parallel code

Jan 30, 2017 15719/18847b Adv. Cloud Computing 16

Jan 30, 2017 15719/18847b Adv. Cloud Computing 17

Jan 30, 2017 15719/18847b Adv. Cloud Computing 18

• Spark programs are functional, deterministic

Jan 30, 2017 15719/18847b Adv. Cloud Computing 19

• Many operators are builtins (well-known properties, like Dryad)

Jan 30, 2017 15719/18847b Adv. Cloud Computing 20

Feb 1, 2016 15719 Adv. Cloud Computing 21

• Encapsulation and virtual machines

Feb 1, 2016 15719 Adv. Cloud Computing 22

Majd Sakr, Garth Gibson, Greg Ganger

January 30, 2017

A theoretical 1000-CPU machine would cost a very large amount of

Premise: MapReduce ties smaller and more reasonably priced

• Hadoop presents MapReduce as an analytics engine and under

• Applications in MapReduce are represented as jobs

Input data: A large file

Node 1 Node 2 Node 3

• MapReduce assumes a tree style network topology

A work performed by each task is done in isolation from one another

The amount of communication which can be performed by tasks is limited

isolation by tasks called Mappers

a set of Reducers is known as shuffling Merge & Sort

• The Reducers produce the final output

• Overall, MapReduce breaks the data flow into two phases,

• In MapReduce data elements are always structured as key-value

(K, V) Map (K’, V’) Reduce (K’’, V’’)

Different colors represent

Partitions are the input

Input (K, V) pairs Input (K, V) pairs

Intermediate (K, V) pairs Intermediate (K, V) pairs

Sort Intermediate Sort

Final (K, V) pairs Final (K, V) pairs

• The input files typically reside in a distributed file system

• The format of input files is arbitrary

• Line-based log files file

Files loaded from local HDFS store

InputFormat Description Key Value

•  Ref 1: MapReduce: simplified data processing on large clusters.

•  Service, Platform or Infrastructure as a Service

•  For PaaS and IaaS, cloud programming is needed

•  Online retail stores (like amazon.com for example)

•  Parallel programming is hard – how can cloud frameworks help?

•  Machines cost O($10-100) million, so

•  Grid Computing started with commodity servers (predates Cloud)

•  Simplify efficient data parallel code

•  Spark programs are functional, deterministic

•  Many operators are builtins (well-known properties, like Dryad)

•  Encapsulation and virtual machines