Computer Organization Lecture

ENEE350 Lecture Notes-Weeks 14 and 15
Pipelining & Amdahls Law
Pipeliningisamethodofprocessinginwhichaproblemisdividedintoanumberof subproblems and solved and the solutions of the subproblems for different instancesoftheproblemarethenoverlapped.
Example:a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,,n f[2] f[1] e[2] e[1] D D + D D D + a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)D = nD +3D Speed-up = 4nD/{3D + nD} -> 4 for large n.
d[2] c[2] c[1] b[1] + D + d[1]
Wecandescribethecomputationprocessinalinearpipelinealgorithmically.Thereare threedistinctphasestothiscomputation:(a)fillingthepipeline, (b)runningthepipelineinthefilledstateuntilthelastinputarrives,and (c)emptyingthepipeline.

(linear pipeline)
while(1) {resetLatches(); clock = 0; //fill the pipeline for(j = 0; j <= n-1; j++) {for(k = 0; k <= j; k++) segment(k); clock++; } //execute all segments until the last input arrives while (clock <= m) {for(j = 0; j <= n-1; j++) segment(j); clock++; } //empty the pipeline for(j = 0; j <= n-1; j++) {for(k = j; k <= n-1; k++) segment(k); clock++; } }
Instructionpipelines: Goal: (i)toincreasethethroughput(numberofinstructions/sec)inexecutingprograms (ii)toreducetheexecutiontime(clockcycles/instruction,etc.
clock
0 1 2
fetch
decode
execute
I1 I2 I3 I4 I1 I2 I3 I4 I1 I2 I3
3 4
clock
0 1 2
fetch
decode
execute
memory
writeback
I1 I2 I3 I4 I5 I1 I2 I3 I4 I1 I2 I3 I1 I2 I1
3 4
Speedupofpipelinedexecutionofinstructionsoverasequentialexecution:
CPIu N u f 5 S (5) = CPI p N p f1

Assumingthatthesystemsoperateatthesameclockrateandusethesame numberofoperations:
CPIu S (5) = CPI p1
Example Supposethattheinstructionmixofprogramsexecutedonaserialandpipelined machinesis40%ALU,20%branching,and40%memorywith4,2,and4cyclespereach instructioninthethreeclassesrespectively. Then,underidealconditions(nostallsduetohazards)
CPIu 4 0.4 + 2 0.2 + 4 0.4 S (5) = = = 3.3 CPI p1 1

If, the clock speed needs to be increased for the pipeline implementation then the speedupwillhavetobescaleddownaccordingly.
MIPS Pipeline
IF
ID
EX
WB
Register operations
IF
ID
EX
ME
WB
Register/Memory operations
Instruction Pipelines (Hennessy & Patterson)
Hazards 1StructuralHazards 2DataHazards 3ControlHazards
Structural Hazards: They arise when limited resources are scheduled to operate on differentstreamsduringthesameclockperiod.
Structural Hazards: They arise when limited resources are scheduled to operate concurrentlyondifferentstreamsduringthesameclockperiod.
Example:Memoryconflict(datafetch+instructionfetch)ordatapathconflict (arithmeticoperation+PCupdate)
Clock 0 1 IF I1 I2 I1 ID EX ME WB
I3
I2
I1
I4
I3
I2
I1
I5
I4
I3
I2
I1
I6
I5
I4
I3
I2
I7
I6
I5
I4
I3
Fix:Duplicatehardware(tooexpensive) Stallthepipeline(serializetheoperation)(tooslow)
Clock IF ID EX ME WB
0 1
I 1 I 2 I1
I2
I1
I2
I1
I 3
I2
I1
I 4
I3
I2
I4
I3
I4
I3
I 5
I4
I3
I 6
I5
I4
Speedup=Tserial/Tpipeline =5nts/{2nts+2ts},foroddn =5nts/{2nts+3ts},forevenn >5/2asthenumberofinstructions,n,tendstoinfinity. Thus,weloosehalfthethroughputduetostalls. Note:Thepipelinetimeofexecutioncanbecomputedusingtherecurrences T1=4 Ti=Ti1+1foreveni Ti=Ti1+3foroddi
DataHazards Theyoccurwhentheexecutionsoftwoinstructionsmayresultintheincorrectreading ofoperandsand/orwritingofaresult. ReadAfterWrite(RAW)Hazard(DataDependency) WriteAfterReadHazard(WAR)(DataAntidependency) WriteAfterWriteHazard(WAW)(DataAntidependency)
RAWHazards Theyoccurwhenreadsareearlyandwritesarelate.
Clock 0 1 2 3 4 5 6
IF I1 I2 I3 I4 I5 I6 I7
ID I1 I2 I3 I4 I5 I6
EX
ME
WB
I1 Read I3 I4 I5 I1 I2 I3 I4 Write I2 I3
I1:R1=R1+R2I2:R3=R1+R2
RAWHazards(Contd) Theycanbeavoidedbystallingthereadsbutthisincreasestheexecutiontime.Abetter approachistousedataforwarding:
Clock 0 1 2 3 4 5 6
IF I1 I2 I3 I4 I5 I6 I7
ID I1 I2 I3 I4 I5 I6
EX
ME
WB
I1 Read Read I4 I5 I1 I2 I3 I4 Write I2 I3
I1:R1=R1+R2I2:R3=R1+R2
WARHazards Theyoccurwhenwritesareearlyandreadsarelate Clock IF 0 1 2 3 4 5 6 I1 I2 I3 I4 I5 I6 I7 I1 I2 I3 I4 I5 I6 I1 I2 I3 I4 I5 I1 I2 I3 I4 I1 Write Read I3 I4 I2 I3 I1 I2 I1 ID EX ME WB EX ME WB
I1:R2=R2+R3;R9=R3+R4,I2:R3=R7+R5;R6=R2+R8
Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruction processing is to schedule conditionalbranchinstructions. Whenapipelinecontrollerencountersaconditionalbranchinstructionithasa choicetodecodeitintooneoftwoinstructionstreams. Ifthebranchconditionismetthentheexecutioncontinuesfromthetargetofthe conditionalbranchinstruction; Otherwise,itcontinueswiththeinstructionthatfollowstheconditionalbranch instruction.
Example:Supposethatweexecutethefollowingassemblycodeona5stage pipeline(IF,ID,EX,ME,WB): JCDR0<10,add; SUBR0,R1;JMPD,halt; add:ADDR0,R1; halt:HLT; IfweassumethatR0<10thentheSUBinstructionwouldhavebeenincorrectly fetchedduringthesecondclockcycle.andwewillhavetoanotherfetchcycle tofetchtheADDinstruction.
Classificationofbranchpredictionalgorithms StaticBranchPrediction:Thebranchdecisiondoesnotchangeovertimeweuseafixed branchingpolicy. Dynamic Branch Prediction: The branch decision does change over time we use a branchingpolicythatvariesovertime.
StaticBranchPredictionAlgorithms 1Dontpredict(stallthepipeline) 2Nevertakethebranch 3Alwaystakethebranch 4Delayedbranch
1Stallthepipelineby1clockcycle:Thisallowsustodeterminethetargetofthe branchinstruction.
JCD SUB ADD
IF ID
EX ME WB
IF ID
EX ME WB
Stallanddecidethebranch.
PipelineExecutionSpeed(stallcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcycles perinstruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequency =1+branchfrequency Ingeneral,CPIofthepipeline>1+branchfrequencybecauseofdataandpossibly structuralhazards Pros:Straightforwardtoimplement Cons:Thetimeoverheadishighwhentheinstructionmixincludesahigh percentageofbranchinstructions.
2Nevertakethebranch.Theinstructioninthepipelineisflushedifitis determinedthatthebranchshouldhavebeentakenaftertheIDstageis carriedout.
JCD SUB IOR
IF ID
EX ME WB ID EX ME WB EX ME WB
IF
IF ID
XOR
IF ID
EX ME WB
SUBinstructionisalwaysexecutedandtheneithertheIORinstructionis executednextorSUBisflushedandXORisexecuted.
PipelineExecutionSpeed(Nevertakethebranchcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcyclesper instruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequencymispredictionrate =1+branchfrequencymispredictionrate Pros:Ifthepredictionishighlyaccuratethenthepipelinecanoperateclosetoitsfull throughput. Cons:Implementationisnotasstraightforwardandrequiresflushingifdecodingthe branchaddresstakesmorethan1clockcycle.
3Alwaystakethebranch.Theinstructioninthepipelineisflushedifitisdetermined thatthebranchshouldhavebeentakenaftertheIDstageiscarriedout.
JCD SUB IOR
IF ID
EX ME WB IF ID EX ME WB EX ME WB
IF ID
XOR
IF ID addresscomputation
EX ME WB
PipelineExecutionSpeed(Alwaystakethebranchcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcyclesper instruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequencypredictionrate +branchpenaltybranchfrequencymispredictionrate =1+branchfrequencypredictionrate +2branchfrequencymispredictionrate Pros:Bettersuitedfortheexecutionofloopswithoutthecompiler'sintervention(butthis cangenerallybeovercome,seethenextslide). Cons:Implementationisnotasstraightforward,andhasahighermispredictionpenalty. Notasadvantageousasnottakingthebranchsincethebranchaddresscomputationisnot completeduntilaftertheEXsegmentiscarriedout.
Example:for(i=0;i<10;i++)a[i]=a[i]+1;
Branchalwayswillnotworkwellwithoutcompilershelp
CLRR0; loop:JCDR0>=10,exit LDDR1,R0; ADDR1,1; ST+R1,R0; JMPD,loop; exit: Branchalwayswillworkwellwithoutcompilershelp CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop;
3Delayedbranch:Insertaninstructionafterabranchinstruction,andalwaysexecuteit whetherornotthebranchconditionapplies.Ofcourse,thismustbeaninstructionthat canbeexecutedwithoutanysideeffectsonthecorrectnessoftheprogram. Pros:Pipelineisneverstalledorflushedandwiththecorrectchoicebranchdelayedslot instruction,performancecanapproachthatofanidealpipeline. Cons:ItisnotalwayspossibletofindadelayedslotinstructioninwhichcaseaNOP instruction may have to be inserted into the delayed slot to make sure that the program'sintegrityisnotviolated.Itmakescompilersworkharder.
Whichinstructiontoplaceintothedelayedbranchslot? 3.1Chooseaninstructionbeforethebranch,butmakesurethatbranchdoesnot dependonmovedinstruction.Ifsuchaninstructioncanbefound,thisalwayspays off. Example: ADDR1,R2; JCDR2>10,exit; canberescheduledas JCDR2,>,10,exit; ADDR1,R2;(Delayslot)
3.2Chooseaninstructionfromthetargetofthebranch,butmakesurethatthemoved instructionisexecutablewhenthebranchisnottaken. Example: ADDR1,R2; JCDR2>10,sub; JMPD,add; . sub:SUBR4,R5; add:ADIR3,5; canberescheduledas ADDR1,R2; JCDR2,>,10,sub; ADIR3,5;(Delayslot) . sub:SUBR4,R5;
3.3Choose an instruction from the antitarget of the branch, but make sure that the movedinstructionisexecutablewhenthebranchistaken. Example: //ADDR3,R2; JCDR2>10,exit; ADDR3,R2; exit:SUBR4,R5; //ADDR4,R3; canberescheduledas ADDR1,R2; JCDR2,>,10,exit; ADDR3,R2;(Scheduleforexecutionifitdoesnotaltertheprogramfloworoutput) exit:SUBR4,R5;
DynamicBranchPrediction Dynamicbranchpredictionreliesonthehistoryofhowbranchconditionswereresolved inthepast. Historyofbranchesiskeptinabuffer.Tokeepthisbufferreasonablysmallandeasyto access,thebufferisindexedbysomefixednumberoflowerorderbitsoftheaddressofthe branchinstruction. Assumptionisthattheaddressvaluesintheloweraddressfieldareuniqueenoughto preventfrequentcollisionsoroverrides.Thusifwearetryingtopredictbranchesina programwhichremainswithinablockof256locations,8bitsshouldsuffice.
x x+1 x+256
JCD
. .
JCD
Branchinstructionsintheinstructioncacheincludeabranchpredictionfieldthat isusedtopredictifthebranchshouldbetaken.
Memory Location
Program Branch instruction
Branch prediction eld 0 (branch was not taken)
x x+4 x+8 x+12 x+16 x+20
Branch instruction
0 (branch was not taken)
Branch instruction
1 (branch was taken)
Branchprediction: Inthesimplestcase,thefieldisa1bittag: 0<=>branchwasnottakenlasttime(StateA) 1<=>branchwastakenlasttime(StateB)
nottaken
A
taken
B
taken
nottaken
WhileinstateApredictthebranchasnottobetaken WhileinstateBpredictthebranchastobetaken
Thisworksrelativelywell:Itaccuratelypredictsthebranchesinloopsinallbuttwoofthe iterations CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop; AssumingthatwebegininstateA,predictionfails whenR0=1(branchisnottakenwhenitshouldbe) andR0=10(branchistakenwhenitshouldnotbe) AssumingthatwebegininstateB,predictionfails whenR0=10(branchistakenwhenitshouldnotbe)
Wecanmodifythelooptomakethebranchpredictionalgorithmfailtwicewhenwe begininstateBaswell. CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0>=10,exit; JMPD,loop; exit:
AssumingthatwebegininstateB,predictionfails: whenR0=1(branchistakenwhenitshouldnotbe) andR0=10(branchisnottakenwhenitshouldnotbe)
Whatisworseisthatwecanmakethisbranchpredictionalgorithmfaileachtimeit makesaprediction: LDIR0,1; loop:JCDR0>0,neg; LDIR0,1; JMPD,loop; neg:LDIR0,1; JMPD,loop;
AssumingthatwebegininstateA,predictionfails whenR0=1(branchisnottakenwhenitshouldbe) R0=1(branchistakenwhenitshouldnotbe) R0=1(branchisnottakenwhenitshouldbe) R0=1(branchistakenwhenitshouldnotbe) andsoon
2bitprediction(Amorereluctantflipindecision)
nottaken
A1
taken
A2
nottaken nottaken taken

B2 B1
taken taken
nottaken
WhileinstatesA1andA2predictthebranchasnottobetaken WhileinstatesB1andB2predictthebranchastobetaken
nottaken
CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop;
taken
A1
nottaken nottaken taken
A2
taken taken
AssumingthatwebegininstateA1,predictionfails whenR0=1,2(branchisnottakenwhenitshouldbe) andR0=10(branchistakenwhenitshouldnotbe) AssumingthatwebegininstateB1,predictionfails whenR0=10(branchistakenwhenitshouldnotbe)
B2
nottaken
B1
2bitpredictorsaremoreresilienttobranchinversions(predictionsarereversedwhen theyaremissedtwice): nottaken LDIR0,1; taken loop:JCDR0>0,neg; A1 A2 LDIR0,1; nottaken JMPD,loop; taken neg:LDIR0,1; nottaken JMPD,loop;
taken
taken
AssumingthatwebegininstateB1,prediction B2 succeedswhenR0=1(branchistakenwhenitshouldbe) failswhenR0=1(branchistakenwhenitshouldnotbe) succeedswhenR0=1(branchistakenwhenitshouldbe) failswhenR0=1(branchistakenwhenitshouldnotbe) andsoon
B1
nottaken
Amdahl'sLaw(FixedLoadSpeedup) Let q bethefractionofaload L thatcannotbespeededupbyintroducingmore processorsandlet T(p)betheamounttimeittakestoexecute L on p processors byalinearworkfunction,p>1.Then
T ( p) > qT (1) +
(1 q)T (1) p T (1) 1 1 S ( p) = < as p T ( p) q + 1 q q p
Allthismeansisthat,themaximumspeedupofasystemislimitedbythefraction oftheworkthatmustbecompletedsequentially.Thus,theexecutionofthework using p processorscanbereducedto qT(1)underthebestofcircumstances,and thespeedupcannotexceed1/q.
Example A4processorcomputerexecutesinstructionsthatarefetchedfroma randomaccessmemoryoverasharedbusasshownbelow:
Thetasktobeperformedisdividedintotwoparts: 1. Fetchinstruction(serialpart)ittakes30microseconds 2. Executeinstruction(parallelpart)ittakes10microsecondsto execute: S(4)=T(1)/T(4)=1/(0.75+0.25/4)=4/3.25=1.23
microseconds
microseconds
microseconds
microseconds
Now,supposethatthenumberofprocessorsisdoubled.Then S(8)=T(1)/T(8)=1/(0.75+0.25/8)=8/6.25=1.28
Supposethatthenumberofprocessorsisdoubledagain.Then S(16)=T(1)/T(16)=1/(0.75+0.25/16)=16/12.25=1.30.
Whatisthelimit S(p)=T(1)/T(p)=1/(0.75+0.25/p)=1/0.75=1.333.
AlternateFormsofAmdahl'sLaw
T (1) S= Tunenhanced + Tenhanced T (1) 1 = as s . 1 q q T (1)(q + ) s

wheresisthespeedupofthecomputationthatcanbeenhanced.
Example: Supposethatyou'veupgradedyourcomputerfroma2GHzprocessortoa4 GHzprocessor.Whatisthemaximumspeedupyouexpectinexecutinga typicalprogramassumingthat(1)thespeedoffetchingeachinstructionis directly proportional to the speed of reading an instruction from the primarymemoryofyourcomputer,andreadinganinstructiontakesfour timeslongerthanexecutingit,(2)thespeedofexecutingeachinstructionis directlyproportionaltotheclockspeedoftheprocessorofyourcomputer?
UsingAmdahl'sLawwithq=0.8ands=2,wehave S=2/(0.2+0.8x2)=1.111 Verydisappointingasyouarelikelytohavepaidquiteabitofmoneyforthe upgrade!
GeneralizedAmdahl'sLaw
Ingeneral,ataskmaybepartitionedintoasetofsubtasks,witheach subtaskrequiringadesignatednumberofprocessorstoexecute.Inthis case, the speedup of the parallel execution of the task over its sequentialexecutioncanbecharacterizedbythefollowing,moregeneral formula: T (1) S ( p1, p2 ,, pk ) = T ( p1, p2 ,, pk )
T (1) 1 = q1T (1) q2T (1) q T (1) q1 q2 q + ++ k + ++ k p1 p2 pk p1 p2 pk where q1 + q2 + + qk = 1. <

Whenk=2,q1=q,q2=1q,p1=1,p2=p,thisformulareducesto Amdahl'sLaw.
Remark:
ThegeneralizedAmdahl'sLawcanalsoberewrittentoexpressthespeedupdueto differentamountsofspeedenhancement(Se)thatcanbemadetodifferentpartsofa system:
Se ( s1, s2 ,, sk ) =
T (1) T ( s1, s2 ,, sk ) T (1) 1 < q1T (1) q2T (1) q T (1) q1 q2 q + ++ k + ++ k s1 s2 sk s1 s2 sk
where q1 + q2 + + qk = 1.
Example: Supposethatyourcomputerexecutesaprogramthathasthefollowing profileofexecution: (a) 30%integeroperations, (b) 20%floatingpointoperations, (c) 50%memoryreferenceinstructions Howmuchspeedupwillyouexpectifyoudoublethespeedofthe floatingunitofyourcomputer?Usingtheformulaabove:
Se=1/(0.3+0.2/2+0.5)=1.1
Example: Supposethatyouhaveafixedbudgetof$500toupgradeeachofthe computersinyourlaboratory,andyoufindoutthatthecomputationsyou performonyourcomputersrequire (a) 40%integeroperations, (b) 60%floatingpointoperations, Ifeverydollarspentontheintegerunitafter$50decreasesitsexecution timeby2%,andifeverydollarspentonthefloatingpointunitafter$100 decreasesitsexecutiontimeby1%,howwouldyouspendthe$500?
Example(Continued):
S=
T (1) where x1 + x 2 = 350 Ti ( x1 ) + Tf ( x 2 )
T i( x1 ) = (1 0.02)T i( x1 1) T i( x1 ) = 0.98 x1 Ti (0)
T f ( x 2 ) = (1 0.01)T f ( x 2 1) T f ( x 2 ) = 0.99 x 2 Tf (0)

T i(0) = 0.4 T (1) T f (0) = 0.6T (1)
SubstitutingtheseintothegeneralizedAmdahl'sspeedupexpressiongives:
S= T (1) 0.98 x1 0.4 T (1) + 0.99 x 2 0.6 T (1) 1 = 0.98 x1 0.4 + 0.99 x 2 0.6
Example8(Continued): Sowemaximize
1 0.98 x1 0.4 + 0.99 x 2 0.6
subjecttox1+x2=350,
ormaximize
1 0.98 x1 0.4 + 0.99 350 x1 0.6
subjecttox1<350.
Example(Continued): Computingthevaluesintheneighborhoodof120revealsthatthespeedupis maximizedwhenx1=126. FromMathematica: Table[1/(0.4*0.98^x+0.6*0.99^(350x)),{x,120,128,1}] {10.5398,10.5518,10.5616,10.5691,10.5744,10.5776,10.5785,10.5773,10.574} Note:Itispossibletohavehigherspeedupwithallofthemoneyinvestedin oneoftheunitsifthefixcostforoneoftheunitsbecomessufficientlylarge.
Addendum: Ifthechangesinperformanceduetoupgradesarespecifiedinterms ofspeedratherthantime,wecanthenusethefollowingformulation:

t=L s t = t s = L s = L s 1 x s x s s x s 2 x t = L s = t s s s s t = T ( x) T ( x 1) = T ( x 1) s s T ( x) = (1 s )T ( x 1) s
s wheredenotesthepercentagechangeinspeed. s

Computer Organization Lecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Computer Organization Lecture

Încărcat de

Drepturi de autor:

Formate disponibile

ENEE350 Lecture Notes-Weeks 14 and 15

Pipelining & Amdahls Law

d[2] c[2] c[1] b[1] + D + d[1]

Wecandescribethecomputationprocessinalinearpipelinealgorithmically.Thereare threedistinctphasestothiscomputation:(a)fillingthepipeline, (b)runningthepipelineinthefilledstateuntilthelastinputarrives,and (c)emptyingthepipeline.

Instructionpipelines: Goal: (i)toincreasethethroughput(numberofinstructions/sec)inexecutingprograms (ii)toreducetheexecutiontime(clockcycles/instruction,etc.

CPIu N u f 5 S (5) = CPI p N p f1

CPIu S (5) = CPI p1

Example Supposethattheinstructionmixofprogramsexecutedonaserialandpipelined machinesis40%ALU,20%branching,and40%memorywith4,2,and4cyclespereach instructioninthethreeclassesrespectively. Then,underidealconditions(nostallsduetohazards)

CPIu 4 0.4 + 2 0.2 + 4 0.4 S (5) = = = 3.3 CPI p1 1

Instruction Pipelines (Hennessy & Patterson)

Hazards 1StructuralHazards 2DataHazards 3ControlHazards

Speedup=Tserial/Tpipeline =5nts/{2nts+2ts},foroddn =5nts/{2nts+3ts},forevenn >5/2asthenumberofinstructions,n,tendstoinfinity. Thus,weloosehalfthethroughputduetostalls. Note:Thepipelinetimeofexecutioncanbecomputedusingtherecurrences T1=4 Ti=Ti1+1foreveni Ti=Ti1+3foroddi

DataHazards Theyoccurwhentheexecutionsoftwoinstructionsmayresultintheincorrectreading ofoperandsand/orwritingofaresult. ReadAfterWrite(RAW)Hazard(DataDependency) WriteAfterReadHazard(WAR)(DataAntidependency) WriteAfterWriteHazard(WAW)(DataAntidependency)

RAWHazards(Contd) Theycanbeavoidedbystallingthereadsbutthisincreasestheexecutiontime.Abetter approachistousedataforwarding:

I1 Read Read I4 I5 I1 I2 I3 I4 Write I2 I3

WARHazards Theyoccurwhenwritesareearlyandreadsarelate Clock IF 0 1 2 3 4 5 6 I1 I2 I3 I4 I5 I6 I7 I1 I2 I3 I4 I5 I6 I1 I2 I3 I4 I5 I1 I2 I3 I4 I1 Write Read I3 I4 I2 I3 I1 I2 I1 ID EX ME WB EX ME WB

StaticBranchPredictionAlgorithms 1Dontpredict(stallthepipeline) 2Nevertakethebranch 3Alwaystakethebranch 4Delayedbranch

JCD SUB ADD

2Nevertakethebranch.Theinstructioninthepipelineisflushedifitis determinedthatthebranchshouldhavebeentakenaftertheIDstageis carriedout.

JCD SUB IOR

JCD SUB IOR

Program Branch instruction

Branch prediction eld 0 (branch was not taken)

x x+4 x+8 x+12 x+16 x+20

0 (branch was not taken)

1 (branch was taken)

Branchprediction: Inthesimplestcase,thefieldisa1bittag: 0<=>branchwasnottakenlasttime(StateA) 1<=>branchwastakenlasttime(StateB)

Wecanmodifythelooptomakethebranchpredictionalgorithmfailtwicewhenwe begininstateBaswell. CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0>=10,exit; JMPD,loop; exit:

AssumingthatwebegininstateB,predictionfails: whenR0=1(branchistakenwhenitshouldnotbe) andR0=10(branchisnottakenwhenitshouldnotbe)

Whatisworseisthatwecanmakethisbranchpredictionalgorithmfaileachtimeit makesaprediction: LDIR0,1; loop:JCDR0>0,neg; LDIR0,1; JMPD,loop; neg:LDIR0,1; JMPD,loop;

AssumingthatwebegininstateA,predictionfails whenR0=1(branchisnottakenwhenitshouldbe) R0=1(branchistakenwhenitshouldnotbe) R0=1(branchisnottakenwhenitshouldbe) R0=1(branchistakenwhenitshouldnotbe) andsoon

nottaken nottaken taken

CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop;

AssumingthatwebegininstateA1,predictionfails whenR0=1,2(branchisnottakenwhenitshouldbe) andR0=10(branchistakenwhenitshouldnotbe) AssumingthatwebegininstateB1,predictionfails whenR0=10(branchistakenwhenitshouldnotbe)

AssumingthatwebegininstateB1,prediction B2 succeedswhenR0=1(branchistakenwhenitshouldbe) failswhenR0=1(branchistakenwhenitshouldnotbe) succeedswhenR0=1(branchistakenwhenitshouldbe) failswhenR0=1(branchistakenwhenitshouldnotbe) andsoon

Amdahl'sLaw(FixedLoadSpeedup) Let q bethefractionofaload L thatcannotbespeededupbyintroducingmore processorsandlet T(p)betheamounttimeittakestoexecute L on p processors byalinearworkfunction,p>1.Then

(1 q)T (1) p T (1) 1 1 S ( p) = < as p T ( p) q + 1 q q p

Allthismeansisthat,themaximumspeedupofasystemislimitedbythefraction oftheworkthatmustbecompletedsequentially.Thus,theexecutionofthework using p processorscanbereducedto qT(1)underthebestofcircumstances,and thespeedupcannotexceed1/q.

Example A4processorcomputerexecutesinstructionsthatarefetchedfroma randomaccessmemoryoverasharedbusasshownbelow:

Thetasktobeperformedisdividedintotwoparts: 1. Fetchinstruction(serialpart)ittakes30microseconds 2. Executeinstruction(parallelpart)ittakes10microsecondsto execute: S(4)=T(1)/T(4)=1/(0.75+0.25/4)=4/3.25=1.23

T (1) S= Tunenhanced + Tenhanced T (1) 1 = as s . 1 q q T (1)(q + ) s

UsingAmdahl'sLawwithq=0.8ands=2,wehave S=2/(0.2+0.8x2)=1.111 Verydisappointingasyouarelikelytohavepaidquiteabitofmoneyforthe upgrade!

T (1) 1 = q1T (1) q2T (1) q T (1) q1 q2 q + ++ k + ++ k p1 p2 pk p1 p2 pk where q1 + q2 + + qk = 1. <

ThegeneralizedAmdahl'sLawcanalsoberewrittentoexpressthespeedupdueto differentamountsofspeedenhancement(Se)thatcanbemadetodifferentpartsofa system:

T (1) T ( s1, s2 ,, sk ) T (1) 1 < q1T (1) q2T (1) q T (1) q1 q2 q + ++ k + ++ k s1 s2 sk s1 s2 sk

T (1) where x1 + x 2 = 350 Ti ( x1 ) + Tf ( x 2 )

T i( x1 ) = (1 0.02)T i( x1 1) T i( x1 ) = 0.98 x1 Ti (0)

T f ( x 2 ) = (1 0.01)T f ( x 2 1) T f ( x 2 ) = 0.99 x 2 Tf (0)

1 0.98 x1 0.4 + 0.99 x 2 0.6

1 0.98 x1 0.4 + 0.99 350 x1 0.6

Addendum: Ifthechangesinperformanceduetoupgradesarespecifiedinterms ofspeedratherthantime,wecanthenusethefollowingformulation:

S-ar putea să vă placă și