Documente Academic
Documente Profesional
Documente Cultură
N 6345
March 2008
apport
de recherche
ISRN INRIA/RR--6345--FR+ENG
Thme NUM
ISSN 0249-6399
T his text is also available as a research report of the Laboratoire de lInform atique du Parallelism e
http://www.ens-lyon.fr/LIP.
Introduction
R R n 6345
2
2.1
S1
w1
S2
...
k 1
w2
Sk
...
wk
Sn
wn
sin P in
P out sout
bv;out
bin;u
Pu
su
bu;v
Pv
sv
IN R IA
C om m unication H om ogeneous platform s,w ith identicallinks but di erent speed processors,introduce a rst degree ofheterogeneity;
m ) and
Finally, w e assum e that tw o special additional processors P in and P out are devoted to
input/outputdata. Initially,the inputdata foreach task resideson P in,w hile allresultsm ust
be returned to and stored in P out.
2.2
R R n 6345
In equation (1),w e consider the longest path required to com pute a given data set. T he
w orst case is w hen the rst processors involved in the replication failduring execution. A
com m unication to intervalj m ust then be paid kj tim es since these are serialized (one-port
m odel). For com putations,w e consider the totalcom putation tim e required by the slow est
processor assigned to the interval. For the naloutput,only one com m unication is required,
hence the n =b. N ote thatin orderto achieve thislatency,w e need a standard consensusprotocolto determ ine w hich ofthe surviving processors perform s the outgoing com m unications
[17].
A sim ilar m echanism is used for Fully H eterogeneous platform s:
8 P
9
< ej w i
=
X
X
X
ej
i= dj
0
Tlatency =
+
m ax
+
(2)
bin;u
su
bu;v ;
u2 alloc(j) :
1 j p
u2 alloc(1)
v2 alloc(j+ 1)
B efore presenting com plexity results in Section 4,w e w antto m ake the reader m ore sensitive
to the di culty ofthe problem via som e m otivating exam ples.
W e start w ith the m ono-criterion interval m apping problem of m inim izing the latency.
For Fully H om ogeneous and C om m unication H om ogeneous platform s the optim allatency is
achieved by assigning the w hole pipeline to the fastest processor. T his is due to the fact
thatm apping the w hole pipeline onto one single processorm inim izesthe com m unication cost
since allcom m unication links have the sam e characteristics. C hoosing the fastest processor
on C om m unication H om ogeneous platform s ensures the shortest processing tim e.
H ow ever, this line of reasoning does not hold anym ore w hen com m unications becom e
heterogeneous. Let us consider for instance the m apping ofthe pipeline ofFigure 3 on the
Fully H eterogeneous platform ofFigure 4. T he pipeline consists oftw o stages,both needing
thesam eam ountofcom putation (w = 2),and thesam eam ountofcom m unications( = 100).
In this exam ple,a m apping w hich m inim izes the latency m ust m ap each stage on a di erent
processor,thus splitting the stages into tw o intervals. In fact,ifw e m ap the w hole pipeline
on a single processor,w e achieve a latency of100=100 + (2+ 2)=1 + 100=1 = 105,either ifw e
choose P 1 or P 2 as target processor. Splitting the pipeline and hence m apping the rst stage
on P 1 and the second stage on P 2 requires to pay the com m unication betw een P 1 and P 2 but
drastically decreasesthelatency:100=100+ 2=1+ 100=100+ 2=1+ 100=100 = 1+ 2+ 1+ 2+ 1 = 7.
100
S1
w1 = 2
100
S2
100
w2 = 2
IN R IA
s1 = 1
P1
100
P in
P out
100
1
100
P2
s2 = 1
Figure 4: T he pipeline has to be split into intervals to achieve an optim al latency on this
platform .
T he target platform consists of one processor of speed 1 and failure probability 0:1, it is a
slow but reliable processor. O n the other hand w e have 10 fast and unreliable processors,of
speed 100 and failure probability 0:8. A llcom m unication linkshavea bandw idth b = 1. Ifthe
latency threshold is xed to 22,the slow processor cannot be used in the replication schem e.
A lso,ifw e use three fast processors,the latency is 3 10+ 101=100 > 22. T husthe best oneintervalsolution reachesa failure probability of(1 (1 0:82))= 0:64,w hich isvery high. W e
can do m uch betterby using the slow processoron the slow stage,and then replicate ten tim es
the second stage on the fastprocessors,achieving a latency of10+ 1=1+ 10 1+ 100=100 = 22
and a failure probability of1 (1 0:1):(1 0:810 )< 0:2. T husthe optim alsolution does not
consist ofa single intervalin this case.
10
S1
S2
w1 = 1
w 2 = 100
C om plexity results
In thissection,w e expose the com plexity resultsforboth m ono-criterion and bi-criteria problem s.
4.1
M ono-criterion problem s
T heorem 1. M inim izing the failure probability can be done in polynom ialtim e.
P roof. T his can be seen easily from the form ula com puting the globalfailure probability:
the m inim um isreached by replicating the w hole pipeline asa single intervalon allprocessors.
T his is true for allplatform types.
T he problem ofm inim izing the latency istrivially ofpolynom ialtim e com plexity forFully
H om ogeneous and C om m unication H om ogeneous platform s. H ow ever the problem becom es
harderforFully H eterogeneous platform sbecause ofthe rstand lastcom m unications,w hich
should bem apped on fastcom m unicating linksto optim izethelatency. N oticethatreplication
can only decreaselatency so w edo notconsiderany replication in thism ono-criterion problem .
H ow ever,w e need to nd the best partition ofstages into intervals.
R R n 6345
T heorem 2. M inim izing the latency can be done in polynom ial tim e on C om m unication
H om ogeneous platform s.
P roof. T he latency is optim ized w hen w e suppress all com m unications. A lso, replication
is increasing latency by adding extra com m unications. O n a C om m unication H om ogeneous
platform ,the latency is m inim ized by m apping the w hole pipeline as a single intervalon the
fastest processor.
T heorem 3. M inim izing the latency is N P-hard on Fully H eterogeneous platform s for oneto-one m appings.
P roof. T he problem clearly belongs to N P.W e use a reduction from the Traveling Salesm an
Problem (T SP),w hich is N P-com plete [11]. C onsider an arbitrary instance I1 ofT SP,i.e.,a
com plete graph G = (V;E ;c),w here c(e) is the cost ofedge e,a source vertex s 2 V ,a tail
vertex t 2 V ,and a bound K : is there an H am iltonian path in G from s to t w hose cost is
not greater than K ?
W e build the follow ing instance I2 of the one-to-one latency m inim ization problem : w e
consider an application w ith n = jV jidenticalstages. A llapplication costs are unit costs:
w i = i for alli. For the platform ,in addition to P in and P out w e use m = n = jV jidentical
processors of unit speed: si = 1 for all i. W e sim ply w rite i for the processor P i that
corresponds to vertex vi 2 V .
W e only play w ith the link bandw idths: w e interconnect P in and s,P out and t w ith links
ofbandw idth 1. W e interconnect iand j w ith a link ofbandw idth c(e1i;j). A llthe other links
are very slow (say theirbandw idth issm allerthan K + 1n+ 3 ). W e ask w hetherw e can achieve a
latency Tlatency K 0,w here K 0 = K + n + 2. C learly,the size ofI2 is linear in the size ofI1.
B ecausew ehaveasm any processorsasstages,any solution to I2 w illuseallprocessors.W e
need to m ap the rststage on s and the lastone on t,otherw ise the input/outputcostalready
exceedsK 0.W espend 2 tim e-unitsforinput/output,and n tim e-unitsforcom puting (oneunit
perstage/processor). T here rem ain exactly K tim e-unitsforinter-processorcom m unications,
i.e.,for the totalcost ofthe H am iltonian path that goes from s to t. W e cannot use any slow
link either. H ence w e have a solution for I2 ifand only ifw e have one for I1.
A sfarasw e know ,the com plexity isstillopen forintervalm appings,although w e suspect
itm ightbeN P-hard.H ow ever,ifw erelax theintervalconstraint,i.e.,a setofnon-consecutive
stages can be assigned to a sam e processor,then the problem becom es polynom ial. W e call
such m appings generalm appings.
T heorem 4. M inim izing the latency is polynom ial on Fully H eterogeneous platform s for
generalm appings.
P roof. W e consider Fully H eterogeneous platform s and w e w ant to m inim ize the latency.
Let us consider a directed graph w ith n:m + 2 vertices, and (n 1)m 2 + 2m edges, as
illustrated in Figure 6. Vi;u corresponds to the m apping ofstage Si onto processor P u . V0;in
and V(n+ 1);out represent the initial and nal processors, and data m ust ow from V 0;in to
V(n+ 1);out. Edgesrepresentthe ow ofdata from one stage to another,thusw e have m 2 edges
for i = 0::n, connecting vertex Vi;u to Vi+ 1;v for u;v = 1::m (except for the rst and last
stages w here there are only m edges).
IN R IA
e1;1;1
...
...
V2;1
e0;in;m
...
V0;in
V2;1
...
V1;2
V1;m
V2;m
e2;u;v
Vn;1
en;1;out
Vn;2
en
1;u;v
Vn+ 1;out
...
V1;1
e0;in;1
Vn;m
4.2
W e start w ith a prelim inary lem m a w hich proves that there is an optim alsolution of both
bi-criteria problem s consisting ofa single intervalfor Fully H om ogeneous platform s,and for
C om m unication H om ogeneous platform s w ith identicalfailure probabilities.
Lem m a 1. O n Fully H om ogeneous and C om m unication H om ogeneous-Failure H om ogeneous
platform s, there is a m apping ofthe pipeline as a single intervalwhich m inim izes the failure
probability under a xed latency threshold, and there is a m apping ofthe pipeline as a single
intervalwhich m inim izes the latency under a xed failure probability threshold.
P roof. Ifthe stages are split into p intervals,the failure probability is expressed as
Y
Y
(1
fpu ):
1
1 j p
u2 alloc(j)
Letusstartw ith the Fully H om ogeneous case,and w ith Failure H eterogeneous fora m ost
generalsetting. W e can transform the solution into a new one using a single interval,w hich
im proves both latency and failure probability. Let k0 be the num ber oftim es that the rst
interval is replicated in the original solution. T hen a solution w hich replicates the w hole
intervalon the k0 m ost reliable processors realizes: (i) a latency w hich is sm aller since w e
rem ove the com m uni
Q cations betw een intervals;(ii) a sm aller failure probability since for the
new solution (1
u2 alloc(1) fpu ) is greater than the sam e expression in the originalsolution
(the m ost reliable processors are used in the new one),and m oreover the old solution even
decreases this value by m ultiplying it by other term s sm aller than 1. T hus the new solution
is better for both criteria.
R R n 6345
10
In the case w ith C om m unication H om ogeneous and Failure H om ogeneous,w e use a sim ilar reasoning to transform the solution. W e select the intervalw ith the few est num ber of
processors,denoted k. In the failure probability expression,there is a term in (1 fpk ),and
thusthe globalfailure probability isgreaterthan 1 (1 fpk )w hich isobtained by replicating
the w hole intervalonto k processors. Since w e do notw antto increase the latency,w e use the
fastest k processors,and it is easy to check that this schem e cannot increase latency (k k0
and the slow estprocessorisnotslow erthan the slow estprocessorofany intervalsofthe initial
solution). T hus the new solution is better for both criteria,w hich ends the proof.
W e point out that Lem m a 1 cannot be extended to C om m unication H om ogeneous and
Failure H eterogeneous: instead,w e can build counter exam ples in w hich this property is not
true,as illustrated in Section 3.
4.3
For Fully H om ogeneous platform s, w e consider that all failure probabilities are identical,
since the platform is m ade ofidenticalprocessors. H ow ever,results can easily be extended
for di erent failure probabilities. W e have seen in Lem m a 1 that the optim al solution for
a bi-criteria m apping on such platform s alw ays consists in m apping the w hole pipeline as a
single interval. O therw ise,both latency and failure probability w ould be increased.
T heorem 5. O n Fully H om ogeneous platform s,the solution to the bi-criteria problem can be
found in polynom ialtim e using A lgorithm 1 or A lgorithm 2.
Inform ally,the algorithm s nd the m axim um num berofprocessors k that can be used in
the replication set,and the w hole intervalis m apped on a set ofk identicalprocessors. W ith
di erent failure probabilities,the m ore reliable processors are used.
begin
Find k m axim um ,such that
P
k
1 j n
wj
begin
Find k m inim um ,such that
1
(1
fpk )
FP
IN R IA
11
wi
(4)
u2 alloc
Latency increasesw hen jallocjislarge,thusw e need to nd the sm allestnum berofprocessorsw hich satis esconstraint(4). A sbefore,ifone ofthe m ostreliable processorsisnotused,
w e can exchange it and im prove the reliability w ithout increasing the latency,w hich m ight
lead to add few er processors to the replication set for an identicalreliability. A lgorithm 2
thus returns the optim alsolution.
R em ark B oth algorithm s (1 and 2) are optim alas w ellin the case ofheterogeneous failure
probabilities. W e add the m ost reliable processors to the replication schem e (thus increasing
latency and decreasing the failure probability) w hile L or F P are not reached.
4.4
For C om m unication H om ogeneous platform s, w e rst consider the sim pler case w here all
failure probabilities are identical,denoted by Failure H om ogeneous. In this case,the optim al
bi-criteria solution stillconsists ofthe m apping ofthe pipeline as a single interval.
T heorem 6. O n C om m unication H om ogeneous platform s with Failure H om ogeneous, the
solution to the bi-criteria problem can be found in polynom ialtim e using A lgorithm 3 or 4.
R R n 6345
12
Inform ally, w e add the fastest processors to the replication set w hile the latency is not
exceeded (or untilF P is reached),thus reducing the failure probability and increasing the
latency.
begin
O rder processors in non-increasing order ofsj;
Find k m axim um ,such that
P
1 j n wj
0
n
+
+
k
b
sk
b
begin
Find k m inim um ,such that
1
(1
fpk )
FP
(5)
T he failure probability is sm aller w hen jallocj is large, thus w e need to add as m any
processors as w e can w hile satisfying the constraint. T he latency increases w hen adding
m ore processors,and it depends of the speed of the slow est processors. T hus,if the jallocj
fastest processors are not used,w e can exchange a fastest processor w ith a used one w ithout
increasing latency. A lgorithm 3 thus returns an optim alm apping.
T he other problem is sim ilar,w ith the follow ing expression:
P
wi
(6)
IN R IA
(1
fpjallocj)
13
FP
H ow ever, the problem is m ore com plex w hen w e consider di erent failure probabilities
(Failure H eterogeneous). It is also m ore naturalsince w e have di erent processors and there
is no reason w hy they w ould have the sam e failure probability. U nfortunately for Failure
H eterogeneous,w e can exhibit for som e problem instances an optim alsolution in w hich the
pipeline stages m ust be divided in severalintervals. T he com plexity ofthe problem rem ains
open,but w e conjecture it is N P-hard.
4.5
For Fully H eterogeneous platform s,w e restrict to heterogeneous failure probabilities,w hich
is the m ost naturalcase. W e prove that the bi-criteria problem s are N P-hard.
T heorem 7. O n Fully H eterogeneous platform s,the bi-criteria (decision problem s associated
to the) optim ization problem s are N P-hard.
P roof. W e consider the follow ing decision problem on Fully H eterogeneous platform s: given
a failure probability threshold F P and a latency threshold L,is there a m apping offailure
probability less than F P and oflatency less than L? T he problem is obviously in N P:given
a m apping, it is easy to check in polynom ialtim e that it is valid by com puting its failure
probability and latency.
To establish the com pleteness,w e use a reduction from 2-PA RT IT IO N [11]. W e consider
:given m pos
an instance I1 of2-PA RT IT IO N P
P itive integers a1P;am2;:::;am ,does there exist a
subset I f1;:::;m g such that i2 I ai =
i= 1 ai.
i=
2 I ai? Let S =
W e build the follow ing instance I2 ofour problem : the pipeline is com posed ofa single
stage w ith w = 1, and the input and output com m unication costs are 0 = 1 = 1. T he
platform consists in m processors w ith speeds sj = 1 and failure probability fpj = e aj ,for
1). B andw idths are de ned as bin;j = 1=aj and bj;out = 1 for
1 j m (thus 0 fpj
1 j m.
W e ask w hether it is possible to realize a latency ofS=2 + 2 and a failure probability of
e S=2. C learly,the size ofI2 is polynom ial(and even linear) in the size ofI1. W e now show
that instance I1 has a solution ifand only ifinstance I2 does.
Suppose rst that I1 has a solution. T he solution to I2 w hich replicates the stage on the
set of processors I has a latency of S=2 + 2,since the rst com m unication requires to sum
0=bin;j forallprocessorP j included in thereplication schem e,and then both com put
Q ation and
the naloutputrequire a tim e 1. T hefailure probability ofthissolution is1 (1
j2 I fpj)=
P
ai
j2 I
X
j2 I
R R n 6345
1
bin;j
+ 1+ 1
S
+ 2:
2
14
P
Since bin;j = 1=aj,this im plies that j2 I aj S=2. N ext w e consider the failure probability
constraint. W e m ust have
Y
S
fpj) e 2
1 (1
j2 I
P
P
and thus e j2 I aj e S=2,w hich forces j2 I aj S=2. T hus j2 I aj = S=2 and w e have
a solution to the instance of2-PA RT IT IO N I1,w hich concludes the proof.
In thispaper,w ehaveassessed thecom plexity oftrading betw een responsetim eand reliability,
w hich are am ong the m ost im portant criteria for a typical user. Indeed, in the context of
large scale distributed platform ssuch asclustersorgrids,failure probability becom esa m ajor
concern [10,12,9],and thebi-criteria approach tackled in thispaperenablesto providerobust
solutions w hile ful lling user dem ands (m inim izing latency under som e reliability threshold,
or the converse). W e have show n that the m ore heterogeneity in the target platform s,the
m ore di cult the problem s. In particular,the bi-criteria optim ization problem is polynom ial
for Fully H om ogeneous,N P-hard for Fully H eterogeneous and rem ains an open problem for
C om m unication H om ogeneous.
A n exam ple of a real w orld application consisting of a pipeline w ork ow can be found
in [3]. In this w ork,w e study the intervalm apping ofthe JPEG encoder pipeline on a cluster
ofw orkstations.
Severalotherbi-criteria optim ization problem shave been considered in the literature. For
instanceoptim izing both latency and throughputisquitenatural,astheseobjectivesrepresent
trade-o s betw een user expectations and the w hole system perform ance. See [16,5,4]for
pipeline graphs and [18]for generalapplication D A G s. In the context ofem bedded system s,
energy consum ption is another im portant objective to m inim ize. T hree-criteria optim ization
(energy,latency and throughput) is discussed in [19].
For large scale distributed platform s such as production grids,throughput is a very im portantcriterion asitm easuresthe aggregate rate ofprocessing ofdata,hence the globalrate
at w hich execution progresses. W e can envision tw o types ofreplication: the rst type is to
replicate the sam e com putation on di erentprocessors,asin thispaper,to increase reliability.
T he second type isto allocate the processing ofdi erentdata setsto di erentprocessors(say
in a round-robin fashion),in order to increase the throughput.B oth replication types can be
conducted sim ultaneously,at the price ofm ore resource consum ption. O ur future w ork w ill
be devoted to the study ofthe interplay betw een throughput,latency and reliability,a very
challenging algorithm ic problem .
R eferences
[1] J.A baw ajy.Fault-tolerantscheduling policy forgrid com puting system s.In International
Parallel and D istributed Processing Sym posium IPD PS2004. IEEE C om puter Society
Press,2004.
[2] S.A lbers and G .Schm idt. Scheduling w ith unexpected m achine breakdow ns. D iscrete
A pplied M athem atics,110(2-3):85{99,2001.
IN R IA
15
R R n 6345
16
[16] J. Subhlok and G . Vondran. O ptim al latency-throughput tradeo s for data parallel
pipelines.In A C M Sym posium on ParallelA lgorithm s and A rchitectures SPA A 96,pages
62{71.A C M Press,1996.
[17] G .Tel. Introduction to D istributed A lgorithm s. C am bridge U niversity Press,2000.
[18] N .V ydyanathan,U .C atalyurek,T .K urc,P.Saddayappan,and J.Saltz.A n approach for
optim izing latency under throughput constraints for application w ork ow s on clusters.
R esearch R eport O SU -C ISR C -1/07-T R 03,O hio State U niversity,C olum bus,O H ,Jan.
2007. A vailable at ftp://ftp.cse.ohio-state.edu/pub/tech-report/2007.
[19] R .X u,R .M elhem ,and D .M osse. Energy-aw are scheduling for stream ing applications
on chip m ultiprocessors. In the 28th IEEE Real-T im e System Sym posium (RT SS07),
Tucson,A rizona,D ecem ber 2007.
IN R IA
diteur
INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)
http://www.inria.fr
ISSN 0249-6399