Optimizing Latency and Reliability of Pipeline Workflow Applications

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Optimizing Latency and Reliability of Pipeline

Workflow Applications
N 6345
March 2008
apport
de recherche
ISRN INRIA/RR--6345--FR+ENG
Thme NUM
ISSN 0249-6399
inria-00186152, version 4 - 26 Mar 2008
Anne Benoit Veronika Rehn-Sonigo Yves Robert
O ptim izing Latency and R eliability of P ipeline W ork ow

A pplications
A nne Benoit ,Veronika R ehn-Sonigo ,Y ves R obert
T hem e N U M | System es num eriques
Projet G R A A L
R apport de recherche n 6345 |
M arch 2008 | 16 pages
A bstract: M apping applications onto heterogeneous platform s is a di cult challenge,even

for sim ple application patterns such as pipeline graphs. T he problem is even m ore com plex
w hen processors are subject to failure during the execution ofthe application.
In this paper,w e study the com plexity ofa bi-criteria m apping w hich aim s at optim izing
the latency (i.e.,the response tim e) and the reliability (i.e.,the probability that the com putation w illbe successful)ofthe application. Latency is m inim ized by using faster processors,
w hile reliability is increased by replicating com putations on a set of processors. H ow ever,
replication increaseslatency (additionalcom m unications,slow erprocessors). T he application
fails to be executed only ifallthe processors failduring execution.
W hile sim ple polynom ialalgorithm s can be found for fully hom ogeneous platform s,the
problem becom es N P-hard w hen tackling heterogeneous platform s. T his is yet another illustration ofthe additionalcom plexity added by heterogeneity.
K ey-w ords: H eterogeneity,scheduling,com plexity results,reliability,response tim e.
T his text is also available as a research report of the Laboratoire de lInform atique du Parallelism e
http://www.ens-lyon.fr/LIP.
Unit de recherche INRIA Rhne-Alpes

655, avenue de lEurope, 38334 Montbonnot Saint Ismier (France)
Tlphone : +33 4 76 61 52 00 Tlcopie +33 4 76 61 52 52
O ptim isation de latence et abilite des applications de type

w ork ow pipeline
R esum e : Lordonnancem ent et lallocation des applications sur plates-form es heterogenes
sont des problem es cruciaux, m ^
em e pour des applications sim ples com m e des graphes en
pipeline. Le problem e devient m ^
em e encore plus com plexe quand les processeurs peuvent
tom beren panne pendantlexecution de lapplication. D anscetarticle,nousetudionsla com plexite dune allocation bi-critere quivise a optim iser la latence (i.e.,le tem ps de reponse)
et la abilite (i.e.,la probabilite que le calculreussisse) de lapplication. La latence est m inim isee en utilisantdesprocesseursrapides,tandisque la abilite estaugm entee en repliquant
lescalculssurun ensem blede processeurs. Toutefois,la replication augm ente la latence (com m unicationsadditionnellesetprocesseursm oinsrapides).Lapplication echouea ^
etreexecutee
seulem ent sitout les processeurs echouent pendant lexecution. D es algorithm es sim ples en
tem ps polynom ialpeuvent^
etre trouves pour plates-form es com pletem ent hom ogenes,tandis
que le problem e devient N P-dur quand on sattaque aux plates-form es heterogenes. C est
encore une autre illustration de la com plexite additionelle due a lheterogeneite.
M ots-cles :
reponse.
H eterogeneite, ordonnancem ent, resultats de com plexite, abilite, tem ps de
O ptim izing Latency and Reliability ofPipeline W ork ow A pplications
Introduction
M apping applications onto parallelplatform s is a di cult challenge. Severalscheduling and

load-balancing techniques have been developed for hom ogeneous architectures (see [14]for a
survey)buttheadventofheterogeneousclustershasrendered them apping problem even m ore
di cult. M oreover,in a distributed com puting architecture,som e processors m ay suddenly
becom e unavailable,and w e are facing the problem offailure [1,2]. In thiscontextofdynam ic
heterogeneous platform sw ith failures,a structured program m ing approach rulesoutm any of
the problem sw hich the low -levelparallelapplication developer is usually confronted to,such
as deadlocks or process starvation.
In thispaper,w e considerapplication w ork ow sthatcan be expressed aspipeline graphs.
T ypical applications include digital im age processing, w here im ages have to be processed
in steady-state m ode. A w ellknow n pipeline application of this type is for exam ple JPEG
encoding (see http://w w w .jpeg.org/). In such w ork ow applications, a series of data sets
(tasks)entertheinputstageand progressfrom stageto stageuntilthe nalresultiscom puted.
Each stage has its ow n com m unication and com putation requirem ents: it reads an input le
from the previous stage,processes the data and outputs a result to the next stage. For each
data set,initialdata isinputto the rststage,and nalresultsare outputfrom the laststage.
Each processorhasa failureprobability,w hich expressesthechancethattheprocessorfails
during execution. K ey m etricsfora given w ork ow are the latency and the failure probability.
T he latency isthe tim e elapsed betw een the beginning and the end ofthe execution ofa given
data set,hence it m easures the response tim e ofthe system to process the data set entirely.
Intuitively,w e m inim ize the latency by assigning allstages to the fastest processor,but this
m ay lead to an unreliable execution ofthe application. T herefore,w e need to nd trade-o s
betw een tw o antagonistic objectives,nam ely latency and failure probability. Inform ally,the
application w illbe reliable fora given m apping ifthe corresponding globalfailure probability
is sm all. H ere,w e focus on bi-criteria approaches,i.e.,m inim izing the latency under failure
probability constraints,ortheconverse.Indeed,such bi-criteria approachesseem m orenatural
than the m inim ization of a linear com bination of both criteria. U sers m ay have latency
constraints or reliability constraints,but it m akes little sense for them to m inim ize the sum
ofthe latency and ofthe failure probability.
W e focus on pipeline skeletons and thusw e enforce the rule that a given stage is m apped
onto a single processor. In other w ords,a processor that is assigned a stage w illexecute the
operations required by this stage (input,com putation and output) for allthe tasks fed into
the pipeline. H ow ever,in order to im prove reliability,w e can replicate the com putations for
a given stage on severalprocessors,i.e.,a set ofprocessors perform s identicalcom putations
on every data set. T hus,in case offailure,w e can take the result from a processor w hich is
stillw orking. T he optim ization problem can be stated inform ally as follow s: w hich stage to
assign to w hich (set of) processors? W e require the m apping to be interval-based,i.e.,a set
ofprocessors is assigned an intervalofconsecutive stages. T he m ain objective ofthis paper
is to assess the com plexity ofthis bi-criteria m apping problem .
T he rest ofthe paper is organized as follow s. Section 2 is devoted to the presentation of
the targetoptim ization problem s.N extin Section 3 som e m otivating exam plesare presented.
In Section 4 w e proceed to the com plexity results. Finally,w e brie y review related w ork and
state som e concluding rem arks in Section 5.
R R n 6345
2
2.1
A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert
Fram ew ork and optim ization problem s

Fram ew ork
T he application is expressed as a pipeline graph ofn stages Sk ,1 k n,as illustrated on

Figure 1. C onsecutive data sets are fed into the pipeline and processed from stage to stage,
untilthey exit the pipeline after the last stage. Each stage executes a task. M ore precisely,
the k-th stage Sk receivesan inputfrom the previousstage,ofsize k 1,perform sa num berof
w k com putations,and outputs data ofsize k to the next stage. T his operation corresponds
to the k-th task and is repeated periodically on each data set. T he rst stage S 1 receives an
input ofsize 0 from the outside w orld,w hile the last stage Sn returns the result,ofsize n ,
to the outside w orld.
0
S1
w1
S2
...
k 1
w2
Sk
...
wk
Sn
wn
Figure 1: T he application pipeline.
sin P in
P out sout
bv;out
bin;u
Pu
su
bu;v
Pv
sv
Figure 2: T he target platform .

W etargeta platform (seeFigure2),w ith m processorsP u ,1 u m ,fully interconnected
1,
as a (virtual) clique. W e associate to each processor a failure probability 0
fpu
1 u
m , w hich is the probability that the processor breaks dow n during the execution
ofthe application. A set ofprocessors w ith identicalfailure probabilities is denoted Failure
H om ogeneous and otherw iseFailure H eterogeneous. W econsidera constantfailureprobability
as w e are dealing w ith w ork ow s.T hese w ork ow sare m eantto run during a very long tim e,
and therefore w e addressthe question ofw hetherthe processor w illbreak dow n or notatany
tim e during execution. Indeed the m axim um latency w illbe determ ined by the latency ofthe
datasets w hich are processed after the failure.
T here is a bidirectional link linku;v : P u ! P v betw een any processor pair P u and P v,
of bandw idth bu;v. T he speed of processor P u is denoted as su , and it takes X =su tim eunits for P u to execute X oating point operations. W e also enforce a linear cost m odelfor
com m unications,henceittakesX =bu;v tim e-unitsto send (orreceive)a m essageofsizeX from
P u to P v. C om m unication contention is taken care ofby enforcing the one-port m odel[6,7].
In this m odel,a given processor can be involved in a single com m unication at any tim e-step,
either a send or a receive. H ow ever,independentcom m unications betw een distinct processor
pairscan take place sim ultaneously. T he one-portm odelseem sto tthe perform ance ofsom e
IN R IA
current M PI im plem entations,w hich serialize asynchronous M PI sends as soon as m essage

sizes exceed a few m egabytes [13].
W e consider three types ofplatform s:
Fully H om ogeneous platform s have identical processors (su = s for 1

interconnection links (bu;v = b for 1 u;v m );
C om m unication H om ogeneous platform s,w ith identicallinks but di erent speed processors,introduce a rst degree ofheterogeneity;
Fully H eterogeneous platform sconstitutethem ostdi cultinstance,w ith di erentspeed

processors and di erent capacity links.
m ) and
Finally, w e assum e that tw o special additional processors P in and P out are devoted to
input/outputdata. Initially,the inputdata foreach task resideson P in,w hile allresultsm ust
be returned to and stored in P out.
2.2
B i-criteria M apping P roblem
T he generalm apping problem consists in assigning application stages to platform processors.

For sim plicity,w e could assum e thateach stage Si ofthe application pipeline is m apped onto
a distinct processor (w hich is possible only if n
m ). H ow ever,such one-to-one m appings
m ay be unduly restrictive, and a naturalextension is to search for intervalm appings, i.e.,
allocation functions w here each participating processor is assigned an intervalofconsecutive
stages. Intuitively, assigning several consecutive tasks to the sam e processor w ill increase
its com putationalload,but m ay w elldram atically decrease com m unication requirem ents. In
fact, the best intervalm apping m ay turn out to be a one-to-one m apping, or instead m ay
enrollonly a very sm allnum ber of fast com puting processors interconnected by high-speed
links.Intervalm appingsconstitutea naturaland usefulgeneralization ofone-to-one m appings
(not to speak ofsituations w here m < n,w here intervalm appings are m andatory),and such
m appings have been studied by Subhlock et al.[15,16].
Form ally, w e search for a partition of [1::n]into p m intervals Ij = [dj;ej]such that
dj ej for 1 j p,d1 = 1,dj+ 1 = ej + 1 for 1 j p 1 and ep = n.
T he function alloc(j)returnsthe indicesofthe processorson w hich intervalIj ism apped.
T herearekj = jalloc(j)jprocessorsexecuting Ij,and obviously kj 1.Increasing kj increases
the reliability ofthe execution ofintervalIj. T he optim ization problem is to determ ine the
best m apping,over allpossible partitions into intervals,and over allprocessor assignm ents.
T heobjectivecan beto m inim izeeitherthelatency orthefailureprobability,ora com bination:
given a threshold latency, w hat is the m inim um failure probability that can be achieved?
Sim ilarly, given a threshold failure probability, w hat is the m inim um latency that can be
achieved?
T he failure probability can be com puted given
ber p ofintervals and the set of
Q the num Q
(
1
processors assigned to each interval: F P = 1
1 j p
u2 alloc(j) fpu ).
W e assum e that alloc(0)= fing and alloc(m + 1)= foutg,w here P in is a specialprocessor
holding the initialdata,and P out is receiving the results. D ealing w ith Fully H om ogeneous
and C om m unication H om ogeneous platform s,the latency is obtained as
)
(
P ej
X
i= dj w i
dj 1
n
+
+
:
(1)
kj
Tlatency =
b
m inu2 alloc(j)(su )
b
1 j p
R R n 6345
In equation (1),w e consider the longest path required to com pute a given data set. T he
w orst case is w hen the rst processors involved in the replication failduring execution. A
com m unication to intervalj m ust then be paid kj tim es since these are serialized (one-port
m odel). For com putations,w e consider the totalcom putation tim e required by the slow est
processor assigned to the interval. For the naloutput,only one com m unication is required,
hence the n =b. N ote thatin orderto achieve thislatency,w e need a standard consensusprotocolto determ ine w hich ofthe surviving processors perform s the outgoing com m unications
[17].
A sim ilar m echanism is used for Fully H eterogeneous platform s:
8 P
9
< ej w i
=
X
X
X
ej
i= dj
0
Tlatency =
+
m ax
+
(2)
bin;u
su
bu;v ;
u2 alloc(j) :
1 j p
u2 alloc(1)
v2 alloc(j+ 1)
M otivating exam ples
B efore presenting com plexity results in Section 4,w e w antto m ake the reader m ore sensitive
to the di culty ofthe problem via som e m otivating exam ples.
W e start w ith the m ono-criterion interval m apping problem of m inim izing the latency.
For Fully H om ogeneous and C om m unication H om ogeneous platform s the optim allatency is
achieved by assigning the w hole pipeline to the fastest processor. T his is due to the fact
thatm apping the w hole pipeline onto one single processorm inim izesthe com m unication cost
since allcom m unication links have the sam e characteristics. C hoosing the fastest processor
on C om m unication H om ogeneous platform s ensures the shortest processing tim e.
H ow ever, this line of reasoning does not hold anym ore w hen com m unications becom e
heterogeneous. Let us consider for instance the m apping ofthe pipeline ofFigure 3 on the
Fully H eterogeneous platform ofFigure 4. T he pipeline consists oftw o stages,both needing
thesam eam ountofcom putation (w = 2),and thesam eam ountofcom m unications( = 100).
In this exam ple,a m apping w hich m inim izes the latency m ust m ap each stage on a di erent
processor,thus splitting the stages into tw o intervals. In fact,ifw e m ap the w hole pipeline
on a single processor,w e achieve a latency of100=100 + (2+ 2)=1 + 100=1 = 105,either ifw e
choose P 1 or P 2 as target processor. Splitting the pipeline and hence m apping the rst stage
on P 1 and the second stage on P 2 requires to pay the com m unication betw een P 1 and P 2 but
drastically decreasesthelatency:100=100+ 2=1+ 100=100+ 2=1+ 100=100 = 1+ 2+ 1+ 2+ 1 = 7.
100
S1
w1 = 2
100
S2
100
w2 = 2
Figure 3: Exam ple optim alw ith 2 intervals.

U nfortunately these intuitions cannot be generalized w hen tackling bi-criteria optim ization,w herelatency should be m inim ized respecting a certain failure threshold orthe converse.
W e w illprovein Lem m a 1 thatm inim izing thefailureprobability undera xed latency threshold on Fully H om ogeneous and C om m unication H om ogeneous-Failure H om ogeneous platform s
stillcan be done by keeping a single interval.
H ow ever,ifw e considerC om m unication H om ogeneous-Failure H eterogeneous,w e can nd
exam ples in w hich this property is not true. C onsider for instance the pipeline ofFigure 5.
IN R IA
s1 = 1
P1
100
P in
P out
100
1
100
P2
s2 = 1
Figure 4: T he pipeline has to be split into intervals to achieve an optim al latency on this
platform .
T he target platform consists of one processor of speed 1 and failure probability 0:1, it is a
slow but reliable processor. O n the other hand w e have 10 fast and unreliable processors,of
speed 100 and failure probability 0:8. A llcom m unication linkshavea bandw idth b = 1. Ifthe
latency threshold is xed to 22,the slow processor cannot be used in the replication schem e.
A lso,ifw e use three fast processors,the latency is 3 10+ 101=100 > 22. T husthe best oneintervalsolution reachesa failure probability of(1 (1 0:82))= 0:64,w hich isvery high. W e
can do m uch betterby using the slow processoron the slow stage,and then replicate ten tim es
the second stage on the fastprocessors,achieving a latency of10+ 1=1+ 10 1+ 100=100 = 22
and a failure probability of1 (1 0:1):(1 0:810 )< 0:2. T husthe optim alsolution does not
consist ofa single intervalin this case.
10
S1
S2
w1 = 1
w 2 = 100
Figure 5: Exam ple optim alw ith 2 intervals.
C om plexity results
In thissection,w e expose the com plexity resultsforboth m ono-criterion and bi-criteria problem s.
4.1
M ono-criterion problem s
T heorem 1. M inim izing the failure probability can be done in polynom ialtim e.
P roof. T his can be seen easily from the form ula com puting the globalfailure probability:
the m inim um isreached by replicating the w hole pipeline asa single intervalon allprocessors.
T his is true for allplatform types.
T he problem ofm inim izing the latency istrivially ofpolynom ialtim e com plexity forFully
H om ogeneous and C om m unication H om ogeneous platform s. H ow ever the problem becom es
harderforFully H eterogeneous platform sbecause ofthe rstand lastcom m unications,w hich
should bem apped on fastcom m unicating linksto optim izethelatency. N oticethatreplication
can only decreaselatency so w edo notconsiderany replication in thism ono-criterion problem .
H ow ever,w e need to nd the best partition ofstages into intervals.
R R n 6345
T heorem 2. M inim izing the latency can be done in polynom ial tim e on C om m unication
H om ogeneous platform s.
P roof. T he latency is optim ized w hen w e suppress all com m unications. A lso, replication
is increasing latency by adding extra com m unications. O n a C om m unication H om ogeneous
platform ,the latency is m inim ized by m apping the w hole pipeline as a single intervalon the
fastest processor.
T heorem 3. M inim izing the latency is N P-hard on Fully H eterogeneous platform s for oneto-one m appings.
P roof. T he problem clearly belongs to N P.W e use a reduction from the Traveling Salesm an
Problem (T SP),w hich is N P-com plete [11]. C onsider an arbitrary instance I1 ofT SP,i.e.,a
com plete graph G = (V;E ;c),w here c(e) is the cost ofedge e,a source vertex s 2 V ,a tail
vertex t 2 V ,and a bound K : is there an H am iltonian path in G from s to t w hose cost is
not greater than K ?
W e build the follow ing instance I2 of the one-to-one latency m inim ization problem : w e
consider an application w ith n = jV jidenticalstages. A llapplication costs are unit costs:
w i = i for alli. For the platform ,in addition to P in and P out w e use m = n = jV jidentical
processors of unit speed: si = 1 for all i. W e sim ply w rite i for the processor P i that
corresponds to vertex vi 2 V .
W e only play w ith the link bandw idths: w e interconnect P in and s,P out and t w ith links
ofbandw idth 1. W e interconnect iand j w ith a link ofbandw idth c(e1i;j). A llthe other links
are very slow (say theirbandw idth issm allerthan K + 1n+ 3 ). W e ask w hetherw e can achieve a
latency Tlatency K 0,w here K 0 = K + n + 2. C learly,the size ofI2 is linear in the size ofI1.
B ecausew ehaveasm any processorsasstages,any solution to I2 w illuseallprocessors.W e
need to m ap the rststage on s and the lastone on t,otherw ise the input/outputcostalready
exceedsK 0.W espend 2 tim e-unitsforinput/output,and n tim e-unitsforcom puting (oneunit
perstage/processor). T here rem ain exactly K tim e-unitsforinter-processorcom m unications,
i.e.,for the totalcost ofthe H am iltonian path that goes from s to t. W e cannot use any slow
link either. H ence w e have a solution for I2 ifand only ifw e have one for I1.
A sfarasw e know ,the com plexity isstillopen forintervalm appings,although w e suspect
itm ightbeN P-hard.H ow ever,ifw erelax theintervalconstraint,i.e.,a setofnon-consecutive
stages can be assigned to a sam e processor,then the problem becom es polynom ial. W e call
such m appings generalm appings.
T heorem 4. M inim izing the latency is polynom ial on Fully H eterogeneous platform s for
generalm appings.
P roof. W e consider Fully H eterogeneous platform s and w e w ant to m inim ize the latency.
Let us consider a directed graph w ith n:m + 2 vertices, and (n 1)m 2 + 2m edges, as
illustrated in Figure 6. Vi;u corresponds to the m apping ofstage Si onto processor P u . V0;in
and V(n+ 1);out represent the initial and nal processors, and data m ust ow from V 0;in to
V(n+ 1);out. Edgesrepresentthe ow ofdata from one stage to another,thusw e have m 2 edges
for i = 0::n, connecting vertex Vi;u to Vi+ 1;v for u;v = 1::m (except for the rst and last
stages w here there are only m edges).
IN R IA
e1;1;1
...
...
V2;1
e0;in;m
...
V0;in
V2;1
...
V1;2
V1;m
V2;m
e2;u;v
Vn;1
en;1;out
Vn;2
en
1;u;v
Vn+ 1;out
...
V1;1
e0;in;1
Vn;m
Figure 6: M inim izing the latency.

T hus, a general m apping can be represented by a path from V0;in to V(n+ 1);out: if Vi;u
is in the path then stage Si is m apped onto P u . N otice that a path can create intervals of
non-consecutive stages,thus this m apping is not interval-based.
W e assign w eights to the edges to ensure that the w eight ofa path is the latency ofthe
corresponding m apping. C om putation cost ofstage Si on P u is added on the m edges exiting
Vi;u , and thus ei;u;v = wsui . C om m unication costs are added on all edges: ei;u;v+ = bui;v if
= P v. Edges ei;u;u correspond to intra-interval com m unications, and thus there is no
Pu 6
com m unication cost to pay.
T he m apping w hich realizes the m inim um latency can be obtained by nding a shortest
path in this graph going from V0;in to V(n+ 1);out. T he graph has polynom ial size and the
shortest path can be com puted in polynom ialtim e [8],thusw e have the result in polynom ial
tim e,w hich concludes the proof.
4.2
P relim inary Lem m a for bi-criteria problem s
W e start w ith a prelim inary lem m a w hich proves that there is an optim alsolution of both
bi-criteria problem s consisting ofa single intervalfor Fully H om ogeneous platform s,and for
C om m unication H om ogeneous platform s w ith identicalfailure probabilities.
Lem m a 1. O n Fully H om ogeneous and C om m unication H om ogeneous-Failure H om ogeneous
platform s, there is a m apping ofthe pipeline as a single intervalwhich m inim izes the failure
probability under a xed latency threshold, and there is a m apping ofthe pipeline as a single
intervalwhich m inim izes the latency under a xed failure probability threshold.
P roof. Ifthe stages are split into p intervals,the failure probability is expressed as
Y
Y
(1
fpu ):
1
1 j p
u2 alloc(j)
Letusstartw ith the Fully H om ogeneous case,and w ith Failure H eterogeneous fora m ost
generalsetting. W e can transform the solution into a new one using a single interval,w hich
im proves both latency and failure probability. Let k0 be the num ber oftim es that the rst
interval is replicated in the original solution. T hen a solution w hich replicates the w hole
intervalon the k0 m ost reliable processors realizes: (i) a latency w hich is sm aller since w e
rem ove the com m uni
Q cations betw een intervals;(ii) a sm aller failure probability since for the
new solution (1
u2 alloc(1) fpu ) is greater than the sam e expression in the originalsolution
(the m ost reliable processors are used in the new one),and m oreover the old solution even
decreases this value by m ultiplying it by other term s sm aller than 1. T hus the new solution
is better for both criteria.
R R n 6345
10
In the case w ith C om m unication H om ogeneous and Failure H om ogeneous,w e use a sim ilar reasoning to transform the solution. W e select the intervalw ith the few est num ber of
processors,denoted k. In the failure probability expression,there is a term in (1 fpk ),and
thusthe globalfailure probability isgreaterthan 1 (1 fpk )w hich isobtained by replicating
the w hole intervalonto k processors. Since w e do notw antto increase the latency,w e use the
fastest k processors,and it is easy to check that this schem e cannot increase latency (k k0
and the slow estprocessorisnotslow erthan the slow estprocessorofany intervalsofthe initial
solution). T hus the new solution is better for both criteria,w hich ends the proof.
W e point out that Lem m a 1 cannot be extended to C om m unication H om ogeneous and
Failure H eterogeneous: instead,w e can build counter exam ples in w hich this property is not
true,as illustrated in Section 3.
4.3
B i-criteria problem s on Fully H om ogeneous platform s
For Fully H om ogeneous platform s, w e consider that all failure probabilities are identical,
since the platform is m ade ofidenticalprocessors. H ow ever,results can easily be extended
for di erent failure probabilities. W e have seen in Lem m a 1 that the optim al solution for
a bi-criteria m apping on such platform s alw ays consists in m apping the w hole pipeline as a
single interval. O therw ise,both latency and failure probability w ould be increased.
T heorem 5. O n Fully H om ogeneous platform s,the solution to the bi-criteria problem can be
found in polynom ialtim e using A lgorithm 1 or A lgorithm 2.
Inform ally,the algorithm s nd the m axim um num berofprocessors k that can be used in
the replication set,and the w hole intervalis m apped on a set ofk identicalprocessors. W ith
di erent failure probabilities,the m ore reliable processors are used.
begin
Find k m axim um ,such that
P
k
1 j n
wj
R eplicate the w hole pipeline as a single intervalonto the k (m ost reliable)

processors;
end
A lgorithm 1: Fully H om ogeneous platform s: M inim izing F P for a xed L
begin
Find k m inim um ,such that
1
(1
fpk )
FP
R eplicate the w hole pipeline as a single intervalonto the k (m ost reliable)

processors;
end
A lgorithm 2: Fully H om ogeneous platform s: M inim izing L for a xed F P
IN R IA
11
P roof. T he proofofthis theorem is based on Lem m a 1. W e prove it in the generalsetting

of heterogeneous failure probabilities. A n optim al solution can be obtained by m apping
the pipeline as a single interval,thus w e need to decide the set of processors alloc used for
replication. jallocjis the num ber ofprocessors used.
T he rst problem can be form ally expressed as follow s:
Q
M inim ize 1 (1
u2 alloc fpu );
(3)
under the constraint
P
0
n
1 i n wi
+
L
jallocj +
b
s
b
Q
T his leads to m inim ize u2 alloc fpu , and the constraint on the latency determ ines the
m axim um num ber k ofprocessors w hich can be used:
P
b
n
1 i n wi
L
k=
b
s
0
Q
In orderto m inim ize u2 alloc fpu ,w e need to use as m any processorsaspossible since fpu 1
for 1 u m .
Ifone ofthe m ost reliable processors is not used,w e can exchange it w ith a less reliable
one,and thus increase the value ofthe product,so the form ula is m inim ized w hen using the
k m ost reliable processors,w hich is represented in A lgorithm 1.
T he second problem is expressed below :
P
wi
M inim ize jallocjb0 + 1 is n +

Y
fpu ) F P
1 (1
(4)
u2 alloc
Latency increasesw hen jallocjislarge,thusw e need to nd the sm allestnum berofprocessorsw hich satis esconstraint(4). A sbefore,ifone ofthe m ostreliable processorsisnotused,
w e can exchange it and im prove the reliability w ithout increasing the latency,w hich m ight
lead to add few er processors to the replication set for an identicalreliability. A lgorithm 2
thus returns the optim alsolution.
R em ark B oth algorithm s (1 and 2) are optim alas w ellin the case ofheterogeneous failure
probabilities. W e add the m ost reliable processors to the replication schem e (thus increasing
latency and decreasing the failure probability) w hile L or F P are not reached.
4.4
B i-criteria problem s on C om . H om ogeneous platform s
For C om m unication H om ogeneous platform s, w e rst consider the sim pler case w here all
failure probabilities are identical,denoted by Failure H om ogeneous. In this case,the optim al
bi-criteria solution stillconsists ofthe m apping ofthe pipeline as a single interval.
T heorem 6. O n C om m unication H om ogeneous platform s with Failure H om ogeneous, the
solution to the bi-criteria problem can be found in polynom ialtim e using A lgorithm 3 or 4.
R R n 6345
12
Inform ally, w e add the fastest processors to the replication set w hile the latency is not
exceeded (or untilF P is reached),thus reducing the failure probability and increasing the
latency.
begin
O rder processors in non-increasing order ofsj;
Find k m axim um ,such that
P
1 j n wj
0
n
+
+
k
b
sk
b
R eplicate the w hole pipeline as a single intervalonto the fastest k processors;

// N ote that at any tim e sk is the speed of
// the slow est processor used
// in the replication schem e.
end
A lgorithm 3: C om m unication H om ogeneous platform s -Failure H om ogeneous: M inim izing F P for a xed L
begin
Find k m inim um ,such that
1
(1
fpk )
FP
R eplicate the w hole pipeline as a single intervalonto the fastest k processors;

end
A lgorithm 4: C om m unication H om ogeneous platform s -Failure H om ogeneous: M inim izing L for a xed F P
P roof. In thisparticularsetting,Lem m a 1 stillapplies,so w e restrictto m appingsasa single

interval,and search for the optim alset ofprocessors alloc w hich should be used.
T he rst problem is expressed as:
M inim ize 1 (1 fpjallocj);
P
0
n
1 i n wi
L
+
jallocj +
b
m inu2 alloc su
b
(5)
T he failure probability is sm aller w hen jallocj is large, thus w e need to add as m any
processors as w e can w hile satisfying the constraint. T he latency increases w hen adding
m ore processors,and it depends of the speed of the slow est processors. T hus,if the jallocj
fastest processors are not used,w e can exchange a fastest processor w ith a used one w ithout
increasing latency. A lgorithm 3 thus returns an optim alm apping.
T he other problem is sim ilar,w ith the follow ing expression:
P
wi
M inim ize jallocjb0 + m in1u 2iallnoc su +

(6)
IN R IA
(1
fpjallocj)
13
FP
W e can thus nd the sm allestnum berofprocessorsthatshould be used in orderto satisfy

F P ,and then use the fastest processors to optim ize latency,w hich is done by A lgorithm 4.
H ow ever, the problem is m ore com plex w hen w e consider di erent failure probabilities
(Failure H eterogeneous). It is also m ore naturalsince w e have di erent processors and there
is no reason w hy they w ould have the sam e failure probability. U nfortunately for Failure
H eterogeneous,w e can exhibit for som e problem instances an optim alsolution in w hich the
pipeline stages m ust be divided in severalintervals. T he com plexity ofthe problem rem ains
open,but w e conjecture it is N P-hard.
4.5
B i-criteria problem s on Fully H eterogeneous platform s
For Fully H eterogeneous platform s,w e restrict to heterogeneous failure probabilities,w hich
is the m ost naturalcase. W e prove that the bi-criteria problem s are N P-hard.
T heorem 7. O n Fully H eterogeneous platform s,the bi-criteria (decision problem s associated
to the) optim ization problem s are N P-hard.
P roof. W e consider the follow ing decision problem on Fully H eterogeneous platform s: given
a failure probability threshold F P and a latency threshold L,is there a m apping offailure
probability less than F P and oflatency less than L? T he problem is obviously in N P:given
a m apping, it is easy to check in polynom ialtim e that it is valid by com puting its failure
probability and latency.
To establish the com pleteness,w e use a reduction from 2-PA RT IT IO N [11]. W e consider
:given m pos
an instance I1 of2-PA RT IT IO N P
P itive integers a1P;am2;:::;am ,does there exist a
subset I f1;:::;m g such that i2 I ai =
i= 1 ai.
i=
2 I ai? Let S =
W e build the follow ing instance I2 ofour problem : the pipeline is com posed ofa single
stage w ith w = 1, and the input and output com m unication costs are 0 = 1 = 1. T he
platform consists in m processors w ith speeds sj = 1 and failure probability fpj = e aj ,for
1). B andw idths are de ned as bin;j = 1=aj and bj;out = 1 for
1 j m (thus 0 fpj
1 j m.
W e ask w hether it is possible to realize a latency ofS=2 + 2 and a failure probability of
e S=2. C learly,the size ofI2 is polynom ial(and even linear) in the size ofI1. W e now show
that instance I1 has a solution ifand only ifinstance I2 does.
Suppose rst that I1 has a solution. T he solution to I2 w hich replicates the stage on the
set of processors I has a latency of S=2 + 2,since the rst com m unication requires to sum
0=bin;j forallprocessorP j included in thereplication schem e,and then both com put
Q ation and
the naloutputrequire a tim e 1. T hefailure probability ofthissolution is1 (1
j2 I fpj)=
P
ai
= e S=2. T hus w e have solved I2.

O n the other hand,ifI2 has a solution,let I be the set ofprocessors on w hich the stage
is replicated. B ecause ofthe latency constraint,
e
j2 I
X
j2 I
R R n 6345
1
bin;j
+ 1+ 1
S
+ 2:
2
14
P
Since bin;j = 1=aj,this im plies that j2 I aj S=2. N ext w e consider the failure probability
constraint. W e m ust have
Y
S
fpj) e 2
1 (1
j2 I
P
P
and thus e j2 I aj e S=2,w hich forces j2 I aj S=2. T hus j2 I aj = S=2 and w e have
a solution to the instance of2-PA RT IT IO N I1,w hich concludes the proof.
R elated w ork and conclusion
In thispaper,w ehaveassessed thecom plexity oftrading betw een responsetim eand reliability,
w hich are am ong the m ost im portant criteria for a typical user. Indeed, in the context of
large scale distributed platform ssuch asclustersorgrids,failure probability becom esa m ajor
concern [10,12,9],and thebi-criteria approach tackled in thispaperenablesto providerobust
solutions w hile ful lling user dem ands (m inim izing latency under som e reliability threshold,
or the converse). W e have show n that the m ore heterogeneity in the target platform s,the
m ore di cult the problem s. In particular,the bi-criteria optim ization problem is polynom ial
for Fully H om ogeneous,N P-hard for Fully H eterogeneous and rem ains an open problem for
C om m unication H om ogeneous.
A n exam ple of a real w orld application consisting of a pipeline w ork ow can be found
in [3]. In this w ork,w e study the intervalm apping ofthe JPEG encoder pipeline on a cluster
ofw orkstations.
Severalotherbi-criteria optim ization problem shave been considered in the literature. For
instanceoptim izing both latency and throughputisquitenatural,astheseobjectivesrepresent
trade-o s betw een user expectations and the w hole system perform ance. See [16,5,4]for
pipeline graphs and [18]for generalapplication D A G s. In the context ofem bedded system s,
energy consum ption is another im portant objective to m inim ize. T hree-criteria optim ization
(energy,latency and throughput) is discussed in [19].
For large scale distributed platform s such as production grids,throughput is a very im portantcriterion asitm easuresthe aggregate rate ofprocessing ofdata,hence the globalrate
at w hich execution progresses. W e can envision tw o types ofreplication: the rst type is to
replicate the sam e com putation on di erentprocessors,asin thispaper,to increase reliability.
T he second type isto allocate the processing ofdi erentdata setsto di erentprocessors(say
in a round-robin fashion),in order to increase the throughput.B oth replication types can be
conducted sim ultaneously,at the price ofm ore resource consum ption. O ur future w ork w ill
be devoted to the study ofthe interplay betw een throughput,latency and reliability,a very
challenging algorithm ic problem .
R eferences
[1] J.A baw ajy.Fault-tolerantscheduling policy forgrid com puting system s.In International
Parallel and D istributed Processing Sym posium IPD PS2004. IEEE C om puter Society
Press,2004.
[2] S.A lbers and G .Schm idt. Scheduling w ith unexpected m achine breakdow ns. D iscrete
A pplied M athem atics,110(2-3):85{99,2001.
IN R IA
15
[3] A .B enoit,H .K osch,V .R ehn-Sonigo,and Y .R obert. B i-criteria Pipeline M appings for

ParallelIm age Processing.R esearch R eport2008-02,LIP,EN S Lyon,France,Jan.2008.
A vailable at graal.ens-lyon.fr/~vsonigo/.
[4] A .B enoit,V .R ehn-Sonigo,and Y .R obert. M ulti-criteria scheduling ofpipeline w orkow s.In H eteroPar2007: InternationalC onference on H eterogeneousC om puting,jointly
published with C luster2007.IEEE C om puter Society Press,2007.
[5] A .B enoitand Y .R obert.C om plexity resultsforthroughputand latency optim ization of
replicated and data-parallelw ork ow s.In H eteroPar2007: InternationalC onference on
H eterogeneous C om puting, jointly published with C luster2007.IEEE C om puter Society
Press,2007.
[6] P.B hat,C .R aghavendra,and V .Prasanna. E cient collective com m unication in distributed heterogeneous system s. In IC D C S99 19th International C onference on D istributed C om puting System s,pages 15{24.IEEE C om puter Society Press,1999.
[7] P.B hat,C .R aghavendra,and V .Prasanna. E cient collective com m unication in distributed heterogeneous system s.JournalofParalleland D istributed C om puting,63:251{
263,2003.
[8] T .H .C orm en,C .E.Leiserson,and R .L.R ivest. Introduction to A lgorithm s. T he M IT
Press,1990.
[9] A .D uarte,D .R exachs,and E.Luque. A distributed schem e for fault-tolerance in large
clusters of w orkstations. In N IC Series, Vol. 33, pages 473{480. John von N eum ann
Institute for C om puting,Julich,2006.
[10] A .H .Frey and G .Fox.Problem sand approachesfora tera op processor.In Proceedings
of the T hird C onference on H ypercube C oncurrent C om puters and A pplications, pages
21{25.A C M Press,1988.
[11] M .R .G arey and D .S.Johnson.C om puters and Intractability,a G uide to the T heory of
N P-C om pleteness. W .H .Freem an and C om pany,1979.
[12] A .
G eist
and
C.
Engelm ann.
D evelopm ent
of
naturally
fault
tolerant
algorithm s
for
com puting
on
100,000
processors.
http://www.csm.ornl.gov/~geist/Lyon2002-geist.pdf,2002.
[13] T .Saifand M .Parashar. U nderstanding the behavior and perform ance ofnon-blocking
com m unications in M PI. In Proceedings of Euro-Par 2004: ParallelProcessing,LN C S
3149,pages 173{182.Springer,2004.
[14] B .A .Shirazi,A .R .H urson,and K .M .K avi. Scheduling and load balancing in parallel
and distributed system s. IEEE C om puter Science Press,1995.
[15] J.Subhlok and G .Vondran.O ptim alm apping ofsequencesofdata paralleltasks.In Proc.
5th A C M SIG PLA N Sym posium on Principles and Practice of Parallel Program m ing,
PPoPP95,pages 134{143.A C M Press,1995.
R R n 6345
16
[16] J. Subhlok and G . Vondran. O ptim al latency-throughput tradeo s for data parallel
pipelines.In A C M Sym posium on ParallelA lgorithm s and A rchitectures SPA A 96,pages
62{71.A C M Press,1996.
[17] G .Tel. Introduction to D istributed A lgorithm s. C am bridge U niversity Press,2000.
[18] N .V ydyanathan,U .C atalyurek,T .K urc,P.Saddayappan,and J.Saltz.A n approach for
optim izing latency under throughput constraints for application w ork ow s on clusters.
R esearch R eport O SU -C ISR C -1/07-T R 03,O hio State U niversity,C olum bus,O H ,Jan.
2007. A vailable at ftp://ftp.cse.ohio-state.edu/pub/tech-report/2007.
[19] R .X u,R .M elhem ,and D .M osse. Energy-aw are scheduling for stream ing applications
on chip m ultiprocessors. In the 28th IEEE Real-T im e System Sym posium (RT SS07),
Tucson,A rizona,D ecem ber 2007.
IN R IA
Unit de recherche INRIA Rhne-Alpes

655, avenue de lEurope - 38334 Montbonnot Saint-Ismier (France)
Unit de recherche INRIA Futurs : Parc Club Orsay Universit - ZAC des Vignes
4, rue Jacques Monod - 91893 ORSAY Cedex (France)
Unit de recherche INRIA Lorraine : LORIA, Technople de Nancy-Brabois - Campus scientifique
615, rue du Jardin Botanique - BP 101 - 54602 Villers-ls-Nancy Cedex (France)
Unit de recherche INRIA Rennes : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France)
Unit de recherche INRIA Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France)
Unit de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France)
diteur
INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)
http://www.inria.fr
ISSN 0249-6399

Optimizing Latency and Reliability of Pipeline Workflow Applications

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Optimizing Latency and Reliability of Pipeline Workflow Applications

Încărcat de

Drepturi de autor:

Formate disponibile

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Optimizing Latency and Reliability of Pipeline

inria-00186152, version 4 - 26 Mar 2008

Anne Benoit Veronika Rehn-Sonigo Yves Robert

O ptim izing Latency and R eliability of P ipeline W ork ow

M arch 2008 | 16 pages

A bstract: M apping applications onto heterogeneous platform s is a di cult challenge,even

Unit de recherche INRIA Rhne-Alpes

O ptim isation de latence et abilite des applications de type

H eterogeneite, ordonnancem ent, resultats de com plexite, abilite, tem ps de

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

M apping applications onto parallelplatform s is a di cult challenge. Severalscheduling and

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

Fram ew ork and optim ization problem s

T he application is expressed as a pipeline graph ofn stages Sk ,1 k n,as illustrated on

Figure 1: T he application pipeline.

Figure 2: T he target platform .

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

current M PI im plem entations,w hich serialize asynchronous M PI sends as soon as m essage

Fully H om ogeneous platform s have identical processors (su = s for 1

Fully H eterogeneous platform sconstitutethem ostdi cultinstance,w ith di erentspeed

B i-criteria M apping P roblem

T he generalm apping problem consists in assigning application stages to platform processors.

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

M otivating exam ples

Figure 3: Exam ple optim alw ith 2 intervals.

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

Figure 5: Exam ple optim alw ith 2 intervals.

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

Figure 6: M inim izing the latency.

P relim inary Lem m a for bi-criteria problem s

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

B i-criteria problem s on Fully H om ogeneous platform s

R eplicate the w hole pipeline as a single intervalonto the k (m ost reliable)

R eplicate the w hole pipeline as a single intervalonto the k (m ost reliable)

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

P roof. T he proofofthis theorem is based on Lem m a 1. W e prove it in the generalsetting

M inim ize jallocjb0 + 1 is n +

B i-criteria problem s on C om . H om ogeneous platform s

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

R eplicate the w hole pipeline as a single intervalonto the fastest k processors;

R eplicate the w hole pipeline as a single intervalonto the fastest k processors;

P roof. In thisparticularsetting,Lem m a 1 stillapplies,so w e restrictto m appingsasa single

M inim ize jallocjb0 + m in1u 2iallnoc su +

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

W e can thus nd the sm allestnum berofprocessorsthatshould be used in orderto satisfy

B i-criteria problem s on Fully H eterogeneous platform s

= e S=2. T hus w e have solved I2.

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

R elated w ork and conclusion

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

[3] A .B enoit,H .K osch,V .R ehn-Sonigo,and Y .R obert. B i-criteria Pipeline M appings for

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

Unit de recherche INRIA Rhne-Alpes

S-ar putea să vă placă și