Documente Academic
Documente Profesional
Documente Cultură
Pipeliningisamethodofprocessinginwhichaproblemisdividedintoanumberof subproblems and solved and the solutions of the subproblems for different instancesoftheproblemarethenoverlapped.
Example:a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,,n f[2] f[1] e[2] e[1] D D + D D D + a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)D = nD +3D Speed-up = 4nD/{3D + nD} -> 4 for large n.
while(1) {resetLatches(); clock = 0; //fill the pipeline for(j = 0; j <= n-1; j++) {for(k = 0; k <= j; k++) segment(k); clock++; } //execute all segments until the last input arrives while (clock <= m) {for(j = 0; j <= n-1; j++) segment(j); clock++; } //empty the pipeline for(j = 0; j <= n-1; j++) {for(k = j; k <= n-1; k++) segment(k); clock++; } }
clock
0
1
2
fetch
decode
execute
I1 I2 I3 I4 I1 I2 I3 I4 I1 I2 I3
3 4
clock
0
1
2
fetch
decode
execute
memory
writeback
I1 I2 I3 I4 I5 I1 I2 I3 I4 I1 I2 I3 I1 I2 I1
3 4
Speedupofpipelinedexecutionofinstructionsoverasequentialexecution:
MIPS Pipeline
IF
ID
EX
WB
Register operations
IF
ID
EX
ME
WB
Register/Memory operations
Structural Hazards: They arise when limited resources are scheduled to operate on differentstreamsduringthesameclockperiod.
Structural Hazards: They arise when limited resources are scheduled to operate concurrentlyondifferentstreamsduringthesameclockperiod.
Example:Memoryconflict(datafetch+instructionfetch)ordatapathconflict (arithmeticoperation+PCupdate)
Clock
0
1
IF
I1
I2
I1
ID
EX
ME
WB
I3
I2
I1
I4
I3
I2
I1
I5
I4
I3
I2
I1
I6
I5
I4
I3
I2
I7
I6
I5
I4
I3
Fix:Duplicatehardware(tooexpensive) Stallthepipeline(serializetheoperation)(tooslow)
Clock
IF
ID
EX
ME
WB
0 1
I 1 I 2 I1
I2
I1
I2
I1
I 3
I2
I1
I 4
I3
I2
I4
I3
I4
I3
I 5
I4
I3
I 6
I5
I4
RAWHazards Theyoccurwhenreadsareearlyandwritesarelate.
Clock 0 1 2 3 4 5 6
IF I1 I2 I3 I4 I5 I6 I7
ID I1 I2 I3 I4 I5 I6
EX
ME
WB
I1 Read I3 I4 I5 I1 I2 I3 I4 Write I2 I3
I1:R1=R1+R2I2:R3=R1+R2
Clock 0 1 2 3 4 5 6
IF I1 I2 I3 I4 I5 I6 I7
ID I1 I2 I3 I4 I5 I6
EX
ME
WB
I1:R1=R1+R2I2:R3=R1+R2
I1:R2=R2+R3;R9=R3+R4,I2:R3=R7+R5;R6=R2+R8
Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruction processing is to schedule conditionalbranchinstructions. Whenapipelinecontrollerencountersaconditionalbranchinstructionithasa choicetodecodeitintooneoftwoinstructionstreams. Ifthebranchconditionismetthentheexecutioncontinuesfromthetargetofthe conditionalbranchinstruction; Otherwise,itcontinueswiththeinstructionthatfollowstheconditionalbranch instruction.
Classificationofbranchpredictionalgorithms StaticBranchPrediction:Thebranchdecisiondoesnotchangeovertimeweuseafixed branchingpolicy. Dynamic Branch Prediction: The branch decision does change over time we use a branchingpolicythatvariesovertime.
1Stallthepipelineby1clockcycle:Thisallowsustodeterminethetargetofthe branchinstruction.
IF ID
EX ME WB
IF ID
EX ME WB
Stallanddecidethebranch.
PipelineExecutionSpeed(stallcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcycles perinstruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequency =1+branchfrequency Ingeneral,CPIofthepipeline>1+branchfrequencybecauseofdataandpossibly structuralhazards Pros:Straightforwardtoimplement Cons:Thetimeoverheadishighwhentheinstructionmixincludesahigh percentageofbranchinstructions.
IF ID
EX ME WB ID EX ME WB EX ME WB
IF
IF ID
XOR
IF ID
EX ME WB
SUBinstructionisalwaysexecutedandtheneithertheIORinstructionis executednextorSUBisflushedandXORisexecuted.
PipelineExecutionSpeed(Nevertakethebranchcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcyclesper instruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequencymispredictionrate =1+branchfrequencymispredictionrate Pros:Ifthepredictionishighlyaccuratethenthepipelinecanoperateclosetoitsfull throughput. Cons:Implementationisnotasstraightforwardandrequiresflushingifdecodingthe branchaddresstakesmorethan1clockcycle.
3Alwaystakethebranch.Theinstructioninthepipelineisflushedifitisdetermined thatthebranchshouldhavebeentakenaftertheIDstageiscarriedout.
IF ID
EX ME WB IF ID EX ME WB EX ME WB
IF ID
XOR
IF ID addresscomputation
EX ME WB
PipelineExecutionSpeed(Alwaystakethebranchcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcyclesper instruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequencypredictionrate +branchpenaltybranchfrequencymispredictionrate =1+branchfrequencypredictionrate +2branchfrequencymispredictionrate Pros:Bettersuitedfortheexecutionofloopswithoutthecompiler'sintervention(butthis cangenerallybeovercome,seethenextslide). Cons:Implementationisnotasstraightforward,andhasahighermispredictionpenalty. Notasadvantageousasnottakingthebranchsincethebranchaddresscomputationisnot completeduntilaftertheEXsegmentiscarriedout.
Example:for(i=0;i<10;i++)a[i]=a[i]+1;
Branchalwayswillnotworkwellwithoutcompilershelp
CLRR0; loop:JCDR0>=10,exit LDDR1,R0; ADDR1,1; ST+R1,R0; JMPD,loop; exit: Branchalwayswillworkwellwithoutcompilershelp CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop;
3Delayedbranch:Insertaninstructionafterabranchinstruction,andalwaysexecuteit whetherornotthebranchconditionapplies.Ofcourse,thismustbeaninstructionthat canbeexecutedwithoutanysideeffectsonthecorrectnessoftheprogram. Pros:Pipelineisneverstalledorflushedandwiththecorrectchoicebranchdelayedslot instruction,performancecanapproachthatofanidealpipeline. Cons:ItisnotalwayspossibletofindadelayedslotinstructioninwhichcaseaNOP instruction may have to be inserted into the delayed slot to make sure that the program'sintegrityisnotviolated.Itmakescompilersworkharder.
Whichinstructiontoplaceintothedelayedbranchslot? 3.1Chooseaninstructionbeforethebranch,butmakesurethatbranchdoesnot dependonmovedinstruction.Ifsuchaninstructioncanbefound,thisalwayspays off. Example: ADDR1,R2; JCDR2>10,exit; canberescheduledas JCDR2,>,10,exit; ADDR1,R2;(Delayslot)
3.2Chooseaninstructionfromthetargetofthebranch,butmakesurethatthemoved instructionisexecutablewhenthebranchisnottaken. Example: ADDR1,R2; JCDR2>10,sub; JMPD,add; . sub:SUBR4,R5; add:ADIR3,5; canberescheduledas ADDR1,R2; JCDR2,>,10,sub; ADIR3,5;(Delayslot) . sub:SUBR4,R5;
3.3Choose an instruction from the antitarget of the branch, but make sure that the movedinstructionisexecutablewhenthebranchistaken. Example: //ADDR3,R2; JCDR2>10,exit; ADDR3,R2; exit:SUBR4,R5; //ADDR4,R3; canberescheduledas ADDR1,R2; JCDR2,>,10,exit; ADDR3,R2;(Scheduleforexecutionifitdoesnotaltertheprogramfloworoutput) exit:SUBR4,R5;
x x+1 x+256
JCD
.
.
JCD
Branchinstructionsintheinstructioncacheincludeabranchpredictionfieldthat isusedtopredictifthebranchshouldbetaken.
Memory Location
Branch instruction
Branch instruction
nottaken
A
taken
B
taken
nottaken
WhileinstateApredictthebranchasnottobetaken WhileinstateBpredictthebranchastobetaken
Thisworksrelativelywell:Itaccuratelypredictsthebranchesinloopsinallbuttwoofthe iterations CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop; AssumingthatwebegininstateA,predictionfails whenR0=1(branchisnottakenwhenitshouldbe) andR0=10(branchistakenwhenitshouldnotbe) AssumingthatwebegininstateB,predictionfails whenR0=10(branchistakenwhenitshouldnotbe)
2bitprediction(Amorereluctantflipindecision)
nottaken
A1
taken
A2
taken taken
nottaken
WhileinstatesA1andA2predictthebranchasnottobetaken WhileinstatesB1andB2predictthebranchastobetaken
nottaken
taken
A1
nottaken nottaken taken
A2
taken taken
B2
nottaken
B1
2bitpredictorsaremoreresilienttobranchinversions(predictionsarereversedwhen theyaremissedtwice): nottaken LDIR0,1; taken loop:JCDR0>0,neg; A1 A2 LDIR0,1; nottaken JMPD,loop; taken neg:LDIR0,1; nottaken JMPD,loop;
taken
taken
B1
nottaken
T ( p) > qT (1) +
microseconds
microseconds
microseconds
microseconds
Now,supposethatthenumberofprocessorsisdoubled.Then S(8)=T(1)/T(8)=1/(0.75+0.25/8)=8/6.25=1.28
Supposethatthenumberofprocessorsisdoubledagain.Then S(16)=T(1)/T(16)=1/(0.75+0.25/16)=16/12.25=1.30.
Whatisthelimit S(p)=T(1)/T(p)=1/(0.75+0.25/p)=1/0.75=1.333.
AlternateFormsofAmdahl'sLaw
Example: Supposethatyou'veupgradedyourcomputerfroma2GHzprocessortoa4 GHzprocessor.Whatisthemaximumspeedupyouexpectinexecutinga typicalprogramassumingthat(1)thespeedoffetchingeachinstructionis directly proportional to the speed of reading an instruction from the primarymemoryofyourcomputer,andreadinganinstructiontakesfour timeslongerthanexecutingit,(2)thespeedofexecutingeachinstructionis directlyproportionaltotheclockspeedoftheprocessorofyourcomputer?
GeneralizedAmdahl'sLaw
Ingeneral,ataskmaybepartitionedintoasetofsubtasks,witheach subtaskrequiringadesignatednumberofprocessorstoexecute.Inthis case, the speedup of the parallel execution of the task over its sequentialexecutioncanbecharacterizedbythefollowing,moregeneral formula: T (1) S ( p1, p2 ,, pk ) = T ( p1, p2 ,, pk )
Remark:
Se ( s1, s2 ,, sk ) =
where q1 + q2 + + qk = 1.
Example: Supposethatyourcomputerexecutesaprogramthathasthefollowing profileofexecution: (a) 30%integeroperations, (b) 20%floatingpointoperations, (c) 50%memoryreferenceinstructions Howmuchspeedupwillyouexpectifyoudoublethespeedofthe floatingunitofyourcomputer?Usingtheformulaabove:
Se=1/(0.3+0.2/2+0.5)=1.1
Example: Supposethatyouhaveafixedbudgetof$500toupgradeeachofthe computersinyourlaboratory,andyoufindoutthatthecomputationsyou performonyourcomputersrequire (a) 40%integeroperations, (b) 60%floatingpointoperations, Ifeverydollarspentontheintegerunitafter$50decreasesitsexecution timeby2%,andifeverydollarspentonthefloatingpointunitafter$100 decreasesitsexecutiontimeby1%,howwouldyouspendthe$500?
Example(Continued):
S=
SubstitutingtheseintothegeneralizedAmdahl'sspeedupexpressiongives:
S= T (1) 0.98 x1 0.4 T (1) + 0.99 x 2 0.6 T (1) 1 = 0.98 x1 0.4 + 0.99 x 2 0.6
Example8(Continued): Sowemaximize
subjecttox1+x2=350,
ormaximize
subjecttox1<350.
s wheredenotesthepercentagechangeinspeed. s