Sunteți pe pagina 1din 19

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Optimizing Latency and Reliability of Pipeline


Workflow Applications

N 6345
March 2008

apport
de recherche

ISRN INRIA/RR--6345--FR+ENG

Thme NUM

ISSN 0249-6399

inria-00186152, version 4 - 26 Mar 2008

Anne Benoit Veronika Rehn-Sonigo Yves Robert

O ptim izing Latency and R eliability of P ipeline W ork ow


A pplications
A nne Benoit ,Veronika R ehn-Sonigo ,Y ves R obert
T hem e N U M | System es num eriques
Projet G R A A L
R apport de recherche n 6345 |

M arch 2008 | 16 pages

A bstract: M apping applications onto heterogeneous platform s is a di cult challenge,even


for sim ple application patterns such as pipeline graphs. T he problem is even m ore com plex
w hen processors are subject to failure during the execution ofthe application.
In this paper,w e study the com plexity ofa bi-criteria m apping w hich aim s at optim izing
the latency (i.e.,the response tim e) and the reliability (i.e.,the probability that the com putation w illbe successful)ofthe application. Latency is m inim ized by using faster processors,
w hile reliability is increased by replicating com putations on a set of processors. H ow ever,
replication increaseslatency (additionalcom m unications,slow erprocessors). T he application
fails to be executed only ifallthe processors failduring execution.
W hile sim ple polynom ialalgorithm s can be found for fully hom ogeneous platform s,the
problem becom es N P-hard w hen tackling heterogeneous platform s. T his is yet another illustration ofthe additionalcom plexity added by heterogeneity.
K ey-w ords: H eterogeneity,scheduling,com plexity results,reliability,response tim e.

T his text is also available as a research report of the Laboratoire de lInform atique du Parallelism e
http://www.ens-lyon.fr/LIP.

Unit de recherche INRIA Rhne-Alpes


655, avenue de lEurope, 38334 Montbonnot Saint Ismier (France)
Tlphone : +33 4 76 61 52 00 Tlcopie +33 4 76 61 52 52

O ptim isation de latence et abilite des applications de type


w ork ow pipeline
R esum e : Lordonnancem ent et lallocation des applications sur plates-form es heterogenes
sont des problem es cruciaux, m ^
em e pour des applications sim ples com m e des graphes en
pipeline. Le problem e devient m ^
em e encore plus com plexe quand les processeurs peuvent
tom beren panne pendantlexecution de lapplication. D anscetarticle,nousetudionsla com plexite dune allocation bi-critere quivise a optim iser la latence (i.e.,le tem ps de reponse)
et la abilite (i.e.,la probabilite que le calculreussisse) de lapplication. La latence est m inim isee en utilisantdesprocesseursrapides,tandisque la abilite estaugm entee en repliquant
lescalculssurun ensem blede processeurs. Toutefois,la replication augm ente la latence (com m unicationsadditionnellesetprocesseursm oinsrapides).Lapplication echouea ^
etreexecutee
seulem ent sitout les processeurs echouent pendant lexecution. D es algorithm es sim ples en
tem ps polynom ialpeuvent^
etre trouves pour plates-form es com pletem ent hom ogenes,tandis
que le problem e devient N P-dur quand on sattaque aux plates-form es heterogenes. C est
encore une autre illustration de la com plexite additionelle due a lheterogeneite.
M ots-cles :
reponse.

H eterogeneite, ordonnancem ent, resultats de com plexite, abilite, tem ps de

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

Introduction

M apping applications onto parallelplatform s is a di cult challenge. Severalscheduling and


load-balancing techniques have been developed for hom ogeneous architectures (see [14]for a
survey)buttheadventofheterogeneousclustershasrendered them apping problem even m ore
di cult. M oreover,in a distributed com puting architecture,som e processors m ay suddenly
becom e unavailable,and w e are facing the problem offailure [1,2]. In thiscontextofdynam ic
heterogeneous platform sw ith failures,a structured program m ing approach rulesoutm any of
the problem sw hich the low -levelparallelapplication developer is usually confronted to,such
as deadlocks or process starvation.
In thispaper,w e considerapplication w ork ow sthatcan be expressed aspipeline graphs.
T ypical applications include digital im age processing, w here im ages have to be processed
in steady-state m ode. A w ellknow n pipeline application of this type is for exam ple JPEG
encoding (see http://w w w .jpeg.org/). In such w ork ow applications, a series of data sets
(tasks)entertheinputstageand progressfrom stageto stageuntilthe nalresultiscom puted.
Each stage has its ow n com m unication and com putation requirem ents: it reads an input le
from the previous stage,processes the data and outputs a result to the next stage. For each
data set,initialdata isinputto the rststage,and nalresultsare outputfrom the laststage.
Each processorhasa failureprobability,w hich expressesthechancethattheprocessorfails
during execution. K ey m etricsfora given w ork ow are the latency and the failure probability.
T he latency isthe tim e elapsed betw een the beginning and the end ofthe execution ofa given
data set,hence it m easures the response tim e ofthe system to process the data set entirely.
Intuitively,w e m inim ize the latency by assigning allstages to the fastest processor,but this
m ay lead to an unreliable execution ofthe application. T herefore,w e need to nd trade-o s
betw een tw o antagonistic objectives,nam ely latency and failure probability. Inform ally,the
application w illbe reliable fora given m apping ifthe corresponding globalfailure probability
is sm all. H ere,w e focus on bi-criteria approaches,i.e.,m inim izing the latency under failure
probability constraints,ortheconverse.Indeed,such bi-criteria approachesseem m orenatural
than the m inim ization of a linear com bination of both criteria. U sers m ay have latency
constraints or reliability constraints,but it m akes little sense for them to m inim ize the sum
ofthe latency and ofthe failure probability.
W e focus on pipeline skeletons and thusw e enforce the rule that a given stage is m apped
onto a single processor. In other w ords,a processor that is assigned a stage w illexecute the
operations required by this stage (input,com putation and output) for allthe tasks fed into
the pipeline. H ow ever,in order to im prove reliability,w e can replicate the com putations for
a given stage on severalprocessors,i.e.,a set ofprocessors perform s identicalcom putations
on every data set. T hus,in case offailure,w e can take the result from a processor w hich is
stillw orking. T he optim ization problem can be stated inform ally as follow s: w hich stage to
assign to w hich (set of) processors? W e require the m apping to be interval-based,i.e.,a set
ofprocessors is assigned an intervalofconsecutive stages. T he m ain objective ofthis paper
is to assess the com plexity ofthis bi-criteria m apping problem .
T he rest ofthe paper is organized as follow s. Section 2 is devoted to the presentation of
the targetoptim ization problem s.N extin Section 3 som e m otivating exam plesare presented.
In Section 4 w e proceed to the com plexity results. Finally,w e brie y review related w ork and
state som e concluding rem arks in Section 5.

R R n 6345

2
2.1

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

Fram ew ork and optim ization problem s


Fram ew ork

T he application is expressed as a pipeline graph ofn stages Sk ,1 k n,as illustrated on


Figure 1. C onsecutive data sets are fed into the pipeline and processed from stage to stage,
untilthey exit the pipeline after the last stage. Each stage executes a task. M ore precisely,
the k-th stage Sk receivesan inputfrom the previousstage,ofsize k 1,perform sa num berof
w k com putations,and outputs data ofsize k to the next stage. T his operation corresponds
to the k-th task and is repeated periodically on each data set. T he rst stage S 1 receives an
input ofsize 0 from the outside w orld,w hile the last stage Sn returns the result,ofsize n ,
to the outside w orld.
0

S1
w1

S2

...

k 1

w2

Sk

...

wk

Sn

wn

Figure 1: T he application pipeline.

sin P in

P out sout
bv;out

bin;u
Pu
su

bu;v

Pv
sv

Figure 2: T he target platform .


W etargeta platform (seeFigure2),w ith m processorsP u ,1 u m ,fully interconnected
1,
as a (virtual) clique. W e associate to each processor a failure probability 0
fpu
1 u
m , w hich is the probability that the processor breaks dow n during the execution
ofthe application. A set ofprocessors w ith identicalfailure probabilities is denoted Failure
H om ogeneous and otherw iseFailure H eterogeneous. W econsidera constantfailureprobability
as w e are dealing w ith w ork ow s.T hese w ork ow sare m eantto run during a very long tim e,
and therefore w e addressthe question ofw hetherthe processor w illbreak dow n or notatany
tim e during execution. Indeed the m axim um latency w illbe determ ined by the latency ofthe
datasets w hich are processed after the failure.
T here is a bidirectional link linku;v : P u ! P v betw een any processor pair P u and P v,
of bandw idth bu;v. T he speed of processor P u is denoted as su , and it takes X =su tim eunits for P u to execute X oating point operations. W e also enforce a linear cost m odelfor
com m unications,henceittakesX =bu;v tim e-unitsto send (orreceive)a m essageofsizeX from
P u to P v. C om m unication contention is taken care ofby enforcing the one-port m odel[6,7].
In this m odel,a given processor can be involved in a single com m unication at any tim e-step,
either a send or a receive. H ow ever,independentcom m unications betw een distinct processor
pairscan take place sim ultaneously. T he one-portm odelseem sto tthe perform ance ofsom e

IN R IA

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

current M PI im plem entations,w hich serialize asynchronous M PI sends as soon as m essage


sizes exceed a few m egabytes [13].
W e consider three types ofplatform s:


Fully H om ogeneous platform s have identical processors (su = s for 1


interconnection links (bu;v = b for 1 u;v m );

C om m unication H om ogeneous platform s,w ith identicallinks but di erent speed processors,introduce a rst degree ofheterogeneity;

Fully H eterogeneous platform sconstitutethem ostdi cultinstance,w ith di erentspeed


processors and di erent capacity links.

m ) and

Finally, w e assum e that tw o special additional processors P in and P out are devoted to
input/outputdata. Initially,the inputdata foreach task resideson P in,w hile allresultsm ust
be returned to and stored in P out.

2.2

B i-criteria M apping P roblem

T he generalm apping problem consists in assigning application stages to platform processors.


For sim plicity,w e could assum e thateach stage Si ofthe application pipeline is m apped onto
a distinct processor (w hich is possible only if n
m ). H ow ever,such one-to-one m appings
m ay be unduly restrictive, and a naturalextension is to search for intervalm appings, i.e.,
allocation functions w here each participating processor is assigned an intervalofconsecutive
stages. Intuitively, assigning several consecutive tasks to the sam e processor w ill increase
its com putationalload,but m ay w elldram atically decrease com m unication requirem ents. In
fact, the best intervalm apping m ay turn out to be a one-to-one m apping, or instead m ay
enrollonly a very sm allnum ber of fast com puting processors interconnected by high-speed
links.Intervalm appingsconstitutea naturaland usefulgeneralization ofone-to-one m appings
(not to speak ofsituations w here m < n,w here intervalm appings are m andatory),and such
m appings have been studied by Subhlock et al.[15,16].
Form ally, w e search for a partition of [1::n]into p m intervals Ij = [dj;ej]such that
dj ej for 1 j p,d1 = 1,dj+ 1 = ej + 1 for 1 j p 1 and ep = n.
T he function alloc(j)returnsthe indicesofthe processorson w hich intervalIj ism apped.
T herearekj = jalloc(j)jprocessorsexecuting Ij,and obviously kj 1.Increasing kj increases
the reliability ofthe execution ofintervalIj. T he optim ization problem is to determ ine the
best m apping,over allpossible partitions into intervals,and over allprocessor assignm ents.
T heobjectivecan beto m inim izeeitherthelatency orthefailureprobability,ora com bination:
given a threshold latency, w hat is the m inim um failure probability that can be achieved?
Sim ilarly, given a threshold failure probability, w hat is the m inim um latency that can be
achieved?
T he failure probability can be com puted given
ber p ofintervals and the set of
Q the num Q
(
1
processors assigned to each interval: F P = 1
1 j p
u2 alloc(j) fpu ).
W e assum e that alloc(0)= fing and alloc(m + 1)= foutg,w here P in is a specialprocessor
holding the initialdata,and P out is receiving the results. D ealing w ith Fully H om ogeneous
and C om m unication H om ogeneous platform s,the latency is obtained as
)
(
P ej
X
i= dj w i
dj 1
n
+
+
:
(1)
kj
Tlatency =
b
m inu2 alloc(j)(su )
b
1 j p

R R n 6345

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

In equation (1),w e consider the longest path required to com pute a given data set. T he
w orst case is w hen the rst processors involved in the replication failduring execution. A
com m unication to intervalj m ust then be paid kj tim es since these are serialized (one-port
m odel). For com putations,w e consider the totalcom putation tim e required by the slow est
processor assigned to the interval. For the naloutput,only one com m unication is required,
hence the n =b. N ote thatin orderto achieve thislatency,w e need a standard consensusprotocolto determ ine w hich ofthe surviving processors perform s the outgoing com m unications
[17].
A sim ilar m echanism is used for Fully H eterogeneous platform s:
8 P
9
< ej w i
=
X
X
X
ej
i= dj
0
Tlatency =
+
m ax
+
(2)
bin;u
su
bu;v ;
u2 alloc(j) :
1 j p

u2 alloc(1)

v2 alloc(j+ 1)

M otivating exam ples

B efore presenting com plexity results in Section 4,w e w antto m ake the reader m ore sensitive
to the di culty ofthe problem via som e m otivating exam ples.
W e start w ith the m ono-criterion interval m apping problem of m inim izing the latency.
For Fully H om ogeneous and C om m unication H om ogeneous platform s the optim allatency is
achieved by assigning the w hole pipeline to the fastest processor. T his is due to the fact
thatm apping the w hole pipeline onto one single processorm inim izesthe com m unication cost
since allcom m unication links have the sam e characteristics. C hoosing the fastest processor
on C om m unication H om ogeneous platform s ensures the shortest processing tim e.
H ow ever, this line of reasoning does not hold anym ore w hen com m unications becom e
heterogeneous. Let us consider for instance the m apping ofthe pipeline ofFigure 3 on the
Fully H eterogeneous platform ofFigure 4. T he pipeline consists oftw o stages,both needing
thesam eam ountofcom putation (w = 2),and thesam eam ountofcom m unications( = 100).
In this exam ple,a m apping w hich m inim izes the latency m ust m ap each stage on a di erent
processor,thus splitting the stages into tw o intervals. In fact,ifw e m ap the w hole pipeline
on a single processor,w e achieve a latency of100=100 + (2+ 2)=1 + 100=1 = 105,either ifw e
choose P 1 or P 2 as target processor. Splitting the pipeline and hence m apping the rst stage
on P 1 and the second stage on P 2 requires to pay the com m unication betw een P 1 and P 2 but
drastically decreasesthelatency:100=100+ 2=1+ 100=100+ 2=1+ 100=100 = 1+ 2+ 1+ 2+ 1 = 7.
100

S1
w1 = 2

100

S2

100

w2 = 2

Figure 3: Exam ple optim alw ith 2 intervals.


U nfortunately these intuitions cannot be generalized w hen tackling bi-criteria optim ization,w herelatency should be m inim ized respecting a certain failure threshold orthe converse.
W e w illprovein Lem m a 1 thatm inim izing thefailureprobability undera xed latency threshold on Fully H om ogeneous and C om m unication H om ogeneous-Failure H om ogeneous platform s
stillcan be done by keeping a single interval.
H ow ever,ifw e considerC om m unication H om ogeneous-Failure H eterogeneous,w e can nd
exam ples in w hich this property is not true. C onsider for instance the pipeline ofFigure 5.

IN R IA

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

s1 = 1

P1
100

P in

P out

100
1

100

P2
s2 = 1

Figure 4: T he pipeline has to be split into intervals to achieve an optim al latency on this
platform .
T he target platform consists of one processor of speed 1 and failure probability 0:1, it is a
slow but reliable processor. O n the other hand w e have 10 fast and unreliable processors,of
speed 100 and failure probability 0:8. A llcom m unication linkshavea bandw idth b = 1. Ifthe
latency threshold is xed to 22,the slow processor cannot be used in the replication schem e.
A lso,ifw e use three fast processors,the latency is 3 10+ 101=100 > 22. T husthe best oneintervalsolution reachesa failure probability of(1 (1 0:82))= 0:64,w hich isvery high. W e
can do m uch betterby using the slow processoron the slow stage,and then replicate ten tim es
the second stage on the fastprocessors,achieving a latency of10+ 1=1+ 10 1+ 100=100 = 22
and a failure probability of1 (1 0:1):(1 0:810 )< 0:2. T husthe optim alsolution does not
consist ofa single intervalin this case.

10

S1

S2

w1 = 1

w 2 = 100

Figure 5: Exam ple optim alw ith 2 intervals.

C om plexity results

In thissection,w e expose the com plexity resultsforboth m ono-criterion and bi-criteria problem s.

4.1

M ono-criterion problem s

T heorem 1. M inim izing the failure probability can be done in polynom ialtim e.
P roof. T his can be seen easily from the form ula com puting the globalfailure probability:
the m inim um isreached by replicating the w hole pipeline asa single intervalon allprocessors.
T his is true for allplatform types.
T he problem ofm inim izing the latency istrivially ofpolynom ialtim e com plexity forFully
H om ogeneous and C om m unication H om ogeneous platform s. H ow ever the problem becom es
harderforFully H eterogeneous platform sbecause ofthe rstand lastcom m unications,w hich
should bem apped on fastcom m unicating linksto optim izethelatency. N oticethatreplication
can only decreaselatency so w edo notconsiderany replication in thism ono-criterion problem .
H ow ever,w e need to nd the best partition ofstages into intervals.

R R n 6345

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

T heorem 2. M inim izing the latency can be done in polynom ial tim e on C om m unication
H om ogeneous platform s.
P roof. T he latency is optim ized w hen w e suppress all com m unications. A lso, replication
is increasing latency by adding extra com m unications. O n a C om m unication H om ogeneous
platform ,the latency is m inim ized by m apping the w hole pipeline as a single intervalon the
fastest processor.

T heorem 3. M inim izing the latency is N P-hard on Fully H eterogeneous platform s for oneto-one m appings.
P roof. T he problem clearly belongs to N P.W e use a reduction from the Traveling Salesm an
Problem (T SP),w hich is N P-com plete [11]. C onsider an arbitrary instance I1 ofT SP,i.e.,a
com plete graph G = (V;E ;c),w here c(e) is the cost ofedge e,a source vertex s 2 V ,a tail
vertex t 2 V ,and a bound K : is there an H am iltonian path in G from s to t w hose cost is
not greater than K ?
W e build the follow ing instance I2 of the one-to-one latency m inim ization problem : w e
consider an application w ith n = jV jidenticalstages. A llapplication costs are unit costs:
w i = i for alli. For the platform ,in addition to P in and P out w e use m = n = jV jidentical
processors of unit speed: si = 1 for all i. W e sim ply w rite i for the processor P i that
corresponds to vertex vi 2 V .
W e only play w ith the link bandw idths: w e interconnect P in and s,P out and t w ith links
ofbandw idth 1. W e interconnect iand j w ith a link ofbandw idth c(e1i;j). A llthe other links
are very slow (say theirbandw idth issm allerthan K + 1n+ 3 ). W e ask w hetherw e can achieve a
latency Tlatency K 0,w here K 0 = K + n + 2. C learly,the size ofI2 is linear in the size ofI1.
B ecausew ehaveasm any processorsasstages,any solution to I2 w illuseallprocessors.W e
need to m ap the rststage on s and the lastone on t,otherw ise the input/outputcostalready
exceedsK 0.W espend 2 tim e-unitsforinput/output,and n tim e-unitsforcom puting (oneunit
perstage/processor). T here rem ain exactly K tim e-unitsforinter-processorcom m unications,
i.e.,for the totalcost ofthe H am iltonian path that goes from s to t. W e cannot use any slow
link either. H ence w e have a solution for I2 ifand only ifw e have one for I1.
A sfarasw e know ,the com plexity isstillopen forintervalm appings,although w e suspect
itm ightbeN P-hard.H ow ever,ifw erelax theintervalconstraint,i.e.,a setofnon-consecutive
stages can be assigned to a sam e processor,then the problem becom es polynom ial. W e call
such m appings generalm appings.
T heorem 4. M inim izing the latency is polynom ial on Fully H eterogeneous platform s for
generalm appings.
P roof. W e consider Fully H eterogeneous platform s and w e w ant to m inim ize the latency.
Let us consider a directed graph w ith n:m + 2 vertices, and (n 1)m 2 + 2m edges, as
illustrated in Figure 6. Vi;u corresponds to the m apping ofstage Si onto processor P u . V0;in
and V(n+ 1);out represent the initial and nal processors, and data m ust ow from V 0;in to
V(n+ 1);out. Edgesrepresentthe ow ofdata from one stage to another,thusw e have m 2 edges
for i = 0::n, connecting vertex Vi;u to Vi+ 1;v for u;v = 1::m (except for the rst and last
stages w here there are only m edges).

IN R IA

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

e1;1;1

...
...

V2;1

e0;in;m

...

V0;in

V2;1

...

V1;2

V1;m

V2;m

e2;u;v

Vn;1

en;1;out

Vn;2
en

1;u;v

Vn+ 1;out

...

V1;1
e0;in;1

Vn;m

Figure 6: M inim izing the latency.


T hus, a general m apping can be represented by a path from V0;in to V(n+ 1);out: if Vi;u
is in the path then stage Si is m apped onto P u . N otice that a path can create intervals of
non-consecutive stages,thus this m apping is not interval-based.
W e assign w eights to the edges to ensure that the w eight ofa path is the latency ofthe
corresponding m apping. C om putation cost ofstage Si on P u is added on the m edges exiting
Vi;u , and thus ei;u;v = wsui . C om m unication costs are added on all edges: ei;u;v+ = bui;v if
= P v. Edges ei;u;u correspond to intra-interval com m unications, and thus there is no
Pu 6
com m unication cost to pay.
T he m apping w hich realizes the m inim um latency can be obtained by nding a shortest
path in this graph going from V0;in to V(n+ 1);out. T he graph has polynom ial size and the
shortest path can be com puted in polynom ialtim e [8],thusw e have the result in polynom ial
tim e,w hich concludes the proof.

4.2

P relim inary Lem m a for bi-criteria problem s

W e start w ith a prelim inary lem m a w hich proves that there is an optim alsolution of both
bi-criteria problem s consisting ofa single intervalfor Fully H om ogeneous platform s,and for
C om m unication H om ogeneous platform s w ith identicalfailure probabilities.
Lem m a 1. O n Fully H om ogeneous and C om m unication H om ogeneous-Failure H om ogeneous
platform s, there is a m apping ofthe pipeline as a single intervalwhich m inim izes the failure
probability under a xed latency threshold, and there is a m apping ofthe pipeline as a single
intervalwhich m inim izes the latency under a xed failure probability threshold.
P roof. Ifthe stages are split into p intervals,the failure probability is expressed as
Y
Y
(1
fpu ):
1
1 j p

u2 alloc(j)

Letusstartw ith the Fully H om ogeneous case,and w ith Failure H eterogeneous fora m ost
generalsetting. W e can transform the solution into a new one using a single interval,w hich
im proves both latency and failure probability. Let k0 be the num ber oftim es that the rst
interval is replicated in the original solution. T hen a solution w hich replicates the w hole
intervalon the k0 m ost reliable processors realizes: (i) a latency w hich is sm aller since w e
rem ove the com m uni
Q cations betw een intervals;(ii) a sm aller failure probability since for the
new solution (1
u2 alloc(1) fpu ) is greater than the sam e expression in the originalsolution
(the m ost reliable processors are used in the new one),and m oreover the old solution even
decreases this value by m ultiplying it by other term s sm aller than 1. T hus the new solution
is better for both criteria.

R R n 6345

10

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

In the case w ith C om m unication H om ogeneous and Failure H om ogeneous,w e use a sim ilar reasoning to transform the solution. W e select the intervalw ith the few est num ber of
processors,denoted k. In the failure probability expression,there is a term in (1 fpk ),and
thusthe globalfailure probability isgreaterthan 1 (1 fpk )w hich isobtained by replicating
the w hole intervalonto k processors. Since w e do notw antto increase the latency,w e use the
fastest k processors,and it is easy to check that this schem e cannot increase latency (k k0
and the slow estprocessorisnotslow erthan the slow estprocessorofany intervalsofthe initial
solution). T hus the new solution is better for both criteria,w hich ends the proof.
W e point out that Lem m a 1 cannot be extended to C om m unication H om ogeneous and
Failure H eterogeneous: instead,w e can build counter exam ples in w hich this property is not
true,as illustrated in Section 3.

4.3

B i-criteria problem s on Fully H om ogeneous platform s

For Fully H om ogeneous platform s, w e consider that all failure probabilities are identical,
since the platform is m ade ofidenticalprocessors. H ow ever,results can easily be extended
for di erent failure probabilities. W e have seen in Lem m a 1 that the optim al solution for
a bi-criteria m apping on such platform s alw ays consists in m apping the w hole pipeline as a
single interval. O therw ise,both latency and failure probability w ould be increased.
T heorem 5. O n Fully H om ogeneous platform s,the solution to the bi-criteria problem can be
found in polynom ialtim e using A lgorithm 1 or A lgorithm 2.
Inform ally,the algorithm s nd the m axim um num berofprocessors k that can be used in
the replication set,and the w hole intervalis m apped on a set ofk identicalprocessors. W ith
di erent failure probabilities,the m ore reliable processors are used.
begin
Find k m axim um ,such that
P
k

1 j n

wj

R eplicate the w hole pipeline as a single intervalonto the k (m ost reliable)


processors;
end
A lgorithm 1: Fully H om ogeneous platform s: M inim izing F P for a xed L

begin
Find k m inim um ,such that
1

(1

fpk )

FP

R eplicate the w hole pipeline as a single intervalonto the k (m ost reliable)


processors;
end
A lgorithm 2: Fully H om ogeneous platform s: M inim izing L for a xed F P

IN R IA

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

11

P roof. T he proofofthis theorem is based on Lem m a 1. W e prove it in the generalsetting


of heterogeneous failure probabilities. A n optim al solution can be obtained by m apping
the pipeline as a single interval,thus w e need to decide the set of processors alloc used for
replication. jallocjis the num ber ofprocessors used.
T he rst problem can be form ally expressed as follow s:
Q
M inim ize 1 (1
u2 alloc fpu );
(3)
under the constraint
P
0
n
1 i n wi
+
L
jallocj +
b
s
b
Q
T his leads to m inim ize u2 alloc fpu , and the constraint on the latency determ ines the
m axim um num ber k ofprocessors w hich can be used:
P
b
n
1 i n wi
L
k=
b
s
0
Q
In orderto m inim ize u2 alloc fpu ,w e need to use as m any processorsaspossible since fpu 1
for 1 u m .
Ifone ofthe m ost reliable processors is not used,w e can exchange it w ith a less reliable
one,and thus increase the value ofthe product,so the form ula is m inim ized w hen using the
k m ost reliable processors,w hich is represented in A lgorithm 1.
T he second problem is expressed below :
P

wi

M inim ize jallocjb0 + 1 is n +


under the constraint
Y
fpu ) F P
1 (1

(4)

u2 alloc

Latency increasesw hen jallocjislarge,thusw e need to nd the sm allestnum berofprocessorsw hich satis esconstraint(4). A sbefore,ifone ofthe m ostreliable processorsisnotused,
w e can exchange it and im prove the reliability w ithout increasing the latency,w hich m ight
lead to add few er processors to the replication set for an identicalreliability. A lgorithm 2
thus returns the optim alsolution.

R em ark B oth algorithm s (1 and 2) are optim alas w ellin the case ofheterogeneous failure
probabilities. W e add the m ost reliable processors to the replication schem e (thus increasing
latency and decreasing the failure probability) w hile L or F P are not reached.

4.4

B i-criteria problem s on C om . H om ogeneous platform s

For C om m unication H om ogeneous platform s, w e rst consider the sim pler case w here all
failure probabilities are identical,denoted by Failure H om ogeneous. In this case,the optim al
bi-criteria solution stillconsists ofthe m apping ofthe pipeline as a single interval.
T heorem 6. O n C om m unication H om ogeneous platform s with Failure H om ogeneous, the
solution to the bi-criteria problem can be found in polynom ialtim e using A lgorithm 3 or 4.

R R n 6345

12

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

Inform ally, w e add the fastest processors to the replication set w hile the latency is not
exceeded (or untilF P is reached),thus reducing the failure probability and increasing the
latency.
begin
O rder processors in non-increasing order ofsj;
Find k m axim um ,such that
P
1 j n wj
0
n
+
+
k
b
sk
b

R eplicate the w hole pipeline as a single intervalonto the fastest k processors;


// N ote that at any tim e sk is the speed of
// the slow est processor used
// in the replication schem e.
end
A lgorithm 3: C om m unication H om ogeneous platform s -Failure H om ogeneous: M inim izing F P for a xed L

begin
Find k m inim um ,such that
1

(1

fpk )

FP

R eplicate the w hole pipeline as a single intervalonto the fastest k processors;


end
A lgorithm 4: C om m unication H om ogeneous platform s -Failure H om ogeneous: M inim izing L for a xed F P

P roof. In thisparticularsetting,Lem m a 1 stillapplies,so w e restrictto m appingsasa single


interval,and search for the optim alset ofprocessors alloc w hich should be used.
T he rst problem is expressed as:
M inim ize 1 (1 fpjallocj);
under the constraint
P
0
n
1 i n wi
L
+
jallocj +
b
m inu2 alloc su
b

(5)

T he failure probability is sm aller w hen jallocj is large, thus w e need to add as m any
processors as w e can w hile satisfying the constraint. T he latency increases w hen adding
m ore processors,and it depends of the speed of the slow est processors. T hus,if the jallocj
fastest processors are not used,w e can exchange a fastest processor w ith a used one w ithout
increasing latency. A lgorithm 3 thus returns an optim alm apping.
T he other problem is sim ilar,w ith the follow ing expression:
P

wi

M inim ize jallocjb0 + m in1u 2iallnoc su +


under the constraint

(6)

IN R IA

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

(1

fpjallocj)

13

FP

W e can thus nd the sm allestnum berofprocessorsthatshould be used in orderto satisfy


F P ,and then use the fastest processors to optim ize latency,w hich is done by A lgorithm 4.

H ow ever, the problem is m ore com plex w hen w e consider di erent failure probabilities
(Failure H eterogeneous). It is also m ore naturalsince w e have di erent processors and there
is no reason w hy they w ould have the sam e failure probability. U nfortunately for Failure
H eterogeneous,w e can exhibit for som e problem instances an optim alsolution in w hich the
pipeline stages m ust be divided in severalintervals. T he com plexity ofthe problem rem ains
open,but w e conjecture it is N P-hard.

4.5

B i-criteria problem s on Fully H eterogeneous platform s

For Fully H eterogeneous platform s,w e restrict to heterogeneous failure probabilities,w hich
is the m ost naturalcase. W e prove that the bi-criteria problem s are N P-hard.
T heorem 7. O n Fully H eterogeneous platform s,the bi-criteria (decision problem s associated
to the) optim ization problem s are N P-hard.
P roof. W e consider the follow ing decision problem on Fully H eterogeneous platform s: given
a failure probability threshold F P and a latency threshold L,is there a m apping offailure
probability less than F P and oflatency less than L? T he problem is obviously in N P:given
a m apping, it is easy to check in polynom ialtim e that it is valid by com puting its failure
probability and latency.
To establish the com pleteness,w e use a reduction from 2-PA RT IT IO N [11]. W e consider
:given m pos
an instance I1 of2-PA RT IT IO N P
P itive integers a1P;am2;:::;am ,does there exist a
subset I f1;:::;m g such that i2 I ai =
i= 1 ai.
i=
2 I ai? Let S =
W e build the follow ing instance I2 ofour problem : the pipeline is com posed ofa single
stage w ith w = 1, and the input and output com m unication costs are 0 = 1 = 1. T he
platform consists in m processors w ith speeds sj = 1 and failure probability fpj = e aj ,for
1). B andw idths are de ned as bin;j = 1=aj and bj;out = 1 for
1 j m (thus 0 fpj
1 j m.
W e ask w hether it is possible to realize a latency ofS=2 + 2 and a failure probability of
e S=2. C learly,the size ofI2 is polynom ial(and even linear) in the size ofI1. W e now show
that instance I1 has a solution ifand only ifinstance I2 does.
Suppose rst that I1 has a solution. T he solution to I2 w hich replicates the stage on the
set of processors I has a latency of S=2 + 2,since the rst com m unication requires to sum
0=bin;j forallprocessorP j included in thereplication schem e,and then both com put
Q ation and
the naloutputrequire a tim e 1. T hefailure probability ofthissolution is1 (1
j2 I fpj)=
P

ai

= e S=2. T hus w e have solved I2.


O n the other hand,ifI2 has a solution,let I be the set ofprocessors on w hich the stage
is replicated. B ecause ofthe latency constraint,
e

j2 I

X
j2 I

R R n 6345

1
bin;j

+ 1+ 1

S
+ 2:
2

14

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

P
Since bin;j = 1=aj,this im plies that j2 I aj S=2. N ext w e consider the failure probability
constraint. W e m ust have
Y
S
fpj) e 2
1 (1
j2 I
P

P
and thus e j2 I aj e S=2,w hich forces j2 I aj S=2. T hus j2 I aj = S=2 and w e have
a solution to the instance of2-PA RT IT IO N I1,w hich concludes the proof.

R elated w ork and conclusion

In thispaper,w ehaveassessed thecom plexity oftrading betw een responsetim eand reliability,
w hich are am ong the m ost im portant criteria for a typical user. Indeed, in the context of
large scale distributed platform ssuch asclustersorgrids,failure probability becom esa m ajor
concern [10,12,9],and thebi-criteria approach tackled in thispaperenablesto providerobust
solutions w hile ful lling user dem ands (m inim izing latency under som e reliability threshold,
or the converse). W e have show n that the m ore heterogeneity in the target platform s,the
m ore di cult the problem s. In particular,the bi-criteria optim ization problem is polynom ial
for Fully H om ogeneous,N P-hard for Fully H eterogeneous and rem ains an open problem for
C om m unication H om ogeneous.
A n exam ple of a real w orld application consisting of a pipeline w ork ow can be found
in [3]. In this w ork,w e study the intervalm apping ofthe JPEG encoder pipeline on a cluster
ofw orkstations.
Severalotherbi-criteria optim ization problem shave been considered in the literature. For
instanceoptim izing both latency and throughputisquitenatural,astheseobjectivesrepresent
trade-o s betw een user expectations and the w hole system perform ance. See [16,5,4]for
pipeline graphs and [18]for generalapplication D A G s. In the context ofem bedded system s,
energy consum ption is another im portant objective to m inim ize. T hree-criteria optim ization
(energy,latency and throughput) is discussed in [19].
For large scale distributed platform s such as production grids,throughput is a very im portantcriterion asitm easuresthe aggregate rate ofprocessing ofdata,hence the globalrate
at w hich execution progresses. W e can envision tw o types ofreplication: the rst type is to
replicate the sam e com putation on di erentprocessors,asin thispaper,to increase reliability.
T he second type isto allocate the processing ofdi erentdata setsto di erentprocessors(say
in a round-robin fashion),in order to increase the throughput.B oth replication types can be
conducted sim ultaneously,at the price ofm ore resource consum ption. O ur future w ork w ill
be devoted to the study ofthe interplay betw een throughput,latency and reliability,a very
challenging algorithm ic problem .

R eferences
[1] J.A baw ajy.Fault-tolerantscheduling policy forgrid com puting system s.In International
Parallel and D istributed Processing Sym posium IPD PS2004. IEEE C om puter Society
Press,2004.
[2] S.A lbers and G .Schm idt. Scheduling w ith unexpected m achine breakdow ns. D iscrete
A pplied M athem atics,110(2-3):85{99,2001.

IN R IA

O ptim izing Latency and Reliability ofPipeline W ork ow A pplications

15

[3] A .B enoit,H .K osch,V .R ehn-Sonigo,and Y .R obert. B i-criteria Pipeline M appings for


ParallelIm age Processing.R esearch R eport2008-02,LIP,EN S Lyon,France,Jan.2008.
A vailable at graal.ens-lyon.fr/~vsonigo/.
[4] A .B enoit,V .R ehn-Sonigo,and Y .R obert. M ulti-criteria scheduling ofpipeline w orkow s.In H eteroPar2007: InternationalC onference on H eterogeneousC om puting,jointly
published with C luster2007.IEEE C om puter Society Press,2007.
[5] A .B enoitand Y .R obert.C om plexity resultsforthroughputand latency optim ization of
replicated and data-parallelw ork ow s.In H eteroPar2007: InternationalC onference on
H eterogeneous C om puting, jointly published with C luster2007.IEEE C om puter Society
Press,2007.
[6] P.B hat,C .R aghavendra,and V .Prasanna. E cient collective com m unication in distributed heterogeneous system s. In IC D C S99 19th International C onference on D istributed C om puting System s,pages 15{24.IEEE C om puter Society Press,1999.
[7] P.B hat,C .R aghavendra,and V .Prasanna. E cient collective com m unication in distributed heterogeneous system s.JournalofParalleland D istributed C om puting,63:251{
263,2003.
[8] T .H .C orm en,C .E.Leiserson,and R .L.R ivest. Introduction to A lgorithm s. T he M IT
Press,1990.
[9] A .D uarte,D .R exachs,and E.Luque. A distributed schem e for fault-tolerance in large
clusters of w orkstations. In N IC Series, Vol. 33, pages 473{480. John von N eum ann
Institute for C om puting,Julich,2006.
[10] A .H .Frey and G .Fox.Problem sand approachesfora tera op processor.In Proceedings
of the T hird C onference on H ypercube C oncurrent C om puters and A pplications, pages
21{25.A C M Press,1988.
[11] M .R .G arey and D .S.Johnson.C om puters and Intractability,a G uide to the T heory of
N P-C om pleteness. W .H .Freem an and C om pany,1979.
[12] A .
G eist
and
C.
Engelm ann.
D evelopm ent
of
naturally
fault
tolerant
algorithm s
for
com puting
on
100,000
processors.
http://www.csm.ornl.gov/~geist/Lyon2002-geist.pdf,2002.
[13] T .Saifand M .Parashar. U nderstanding the behavior and perform ance ofnon-blocking
com m unications in M PI. In Proceedings of Euro-Par 2004: ParallelProcessing,LN C S
3149,pages 173{182.Springer,2004.
[14] B .A .Shirazi,A .R .H urson,and K .M .K avi. Scheduling and load balancing in parallel
and distributed system s. IEEE C om puter Science Press,1995.
[15] J.Subhlok and G .Vondran.O ptim alm apping ofsequencesofdata paralleltasks.In Proc.
5th A C M SIG PLA N Sym posium on Principles and Practice of Parallel Program m ing,
PPoPP95,pages 134{143.A C M Press,1995.

R R n 6345

16

A nne B enoit,Veronika Rehn-Sonigo ,Y ves Robert

[16] J. Subhlok and G . Vondran. O ptim al latency-throughput tradeo s for data parallel
pipelines.In A C M Sym posium on ParallelA lgorithm s and A rchitectures SPA A 96,pages
62{71.A C M Press,1996.
[17] G .Tel. Introduction to D istributed A lgorithm s. C am bridge U niversity Press,2000.
[18] N .V ydyanathan,U .C atalyurek,T .K urc,P.Saddayappan,and J.Saltz.A n approach for
optim izing latency under throughput constraints for application w ork ow s on clusters.
R esearch R eport O SU -C ISR C -1/07-T R 03,O hio State U niversity,C olum bus,O H ,Jan.
2007. A vailable at ftp://ftp.cse.ohio-state.edu/pub/tech-report/2007.
[19] R .X u,R .M elhem ,and D .M osse. Energy-aw are scheduling for stream ing applications
on chip m ultiprocessors. In the 28th IEEE Real-T im e System Sym posium (RT SS07),
Tucson,A rizona,D ecem ber 2007.

IN R IA

Unit de recherche INRIA Rhne-Alpes


655, avenue de lEurope - 38334 Montbonnot Saint-Ismier (France)
Unit de recherche INRIA Futurs : Parc Club Orsay Universit - ZAC des Vignes
4, rue Jacques Monod - 91893 ORSAY Cedex (France)
Unit de recherche INRIA Lorraine : LORIA, Technople de Nancy-Brabois - Campus scientifique
615, rue du Jardin Botanique - BP 101 - 54602 Villers-ls-Nancy Cedex (France)
Unit de recherche INRIA Rennes : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France)
Unit de recherche INRIA Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France)
Unit de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France)

diteur
INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)

http://www.inria.fr
ISSN 0249-6399

S-ar putea să vă placă și