Sunteți pe pagina 1din 58

ENEE350 Lecture Notes-Weeks 14 and 15

Pipelining & Amdahls Law

Pipeliningisamethodofprocessinginwhichaproblemisdividedintoanumberof subproblems and solved and the solutions of the subproblems for different instancesoftheproblemarethenoverlapped.

Example:a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,,n f[2] f[1] e[2] e[1] D D + D D D + a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)D = nD +3D Speed-up = 4nD/{3D + nD} -> 4 for large n.

d[2] c[2] c[1] b[1] + D + d[1]

Wecandescribethecomputationprocessinalinearpipelinealgorithmically.Thereare threedistinctphasestothiscomputation:(a)fillingthepipeline, (b)runningthepipelineinthefilledstateuntilthelastinputarrives,and (c)emptyingthepipeline.


(linear pipeline)

while(1) {resetLatches(); clock = 0; //fill the pipeline for(j = 0; j <= n-1; j++) {for(k = 0; k <= j; k++) segment(k); clock++; } //execute all segments until the last input arrives while (clock <= m) {for(j = 0; j <= n-1; j++) segment(j); clock++; } //empty the pipeline for(j = 0; j <= n-1; j++) {for(k = j; k <= n-1; k++) segment(k); clock++; } }

Instructionpipelines: Goal: (i)toincreasethethroughput(numberofinstructions/sec)inexecutingprograms (ii)toreducetheexecutiontime(clockcycles/instruction,etc.

clock
0 1 2

fetch

decode

execute

I1 I2 I3 I4 I1 I2 I3 I4 I1 I2 I3

3 4

clock
0 1 2

fetch

decode

execute

memory

writeback

I1 I2 I3 I4 I5 I1 I2 I3 I4 I1 I2 I3 I1 I2 I1

3 4

Speedupofpipelinedexecutionofinstructionsoverasequentialexecution:

CPIu N u f 5 S (5) = CPI p N p f1


Assumingthatthesystemsoperateatthesameclockrateandusethesame numberofoperations:

CPIu S (5) = CPI p1

Example Supposethattheinstructionmixofprogramsexecutedonaserialandpipelined machinesis40%ALU,20%branching,and40%memorywith4,2,and4cyclespereach instructioninthethreeclassesrespectively. Then,underidealconditions(nostallsduetohazards)

CPIu 4 0.4 + 2 0.2 + 4 0.4 S (5) = = = 3.3 CPI p1 1


If, the clock speed needs to be increased for the pipeline implementation then the speedupwillhavetobescaleddownaccordingly.

MIPS Pipeline

IF

ID

EX

WB

Register operations

IF

ID

EX

ME

WB

Register/Memory operations

Instruction Pipelines (Hennessy & Patterson)

Hazards 1StructuralHazards 2DataHazards 3ControlHazards

Structural Hazards: They arise when limited resources are scheduled to operate on differentstreamsduringthesameclockperiod.

Structural Hazards: They arise when limited resources are scheduled to operate concurrentlyondifferentstreamsduringthesameclockperiod.

Example:Memoryconflict(datafetch+instructionfetch)ordatapathconflict (arithmeticoperation+PCupdate)
Clock 0 1 IF I1 I2 I1 ID EX ME WB

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I2

I1

I6

I5

I4

I3

I2

I7

I6

I5

I4

I3

Fix:Duplicatehardware(tooexpensive) Stallthepipeline(serializetheoperation)(tooslow)
Clock IF ID EX ME WB

0 1

I 1 I 2 I1

I2

I1

I2

I1

I 3

I2

I1

I 4

I3

I2

I4

I3

I4

I3

I 5

I4

I3

I 6

I5

I4

Speedup=Tserial/Tpipeline =5nts/{2nts+2ts},foroddn =5nts/{2nts+3ts},forevenn >5/2asthenumberofinstructions,n,tendstoinfinity. Thus,weloosehalfthethroughputduetostalls. Note:Thepipelinetimeofexecutioncanbecomputedusingtherecurrences T1=4 Ti=Ti1+1foreveni Ti=Ti1+3foroddi

DataHazards Theyoccurwhentheexecutionsoftwoinstructionsmayresultintheincorrectreading ofoperandsand/orwritingofaresult. ReadAfterWrite(RAW)Hazard(DataDependency) WriteAfterReadHazard(WAR)(DataAntidependency) WriteAfterWriteHazard(WAW)(DataAntidependency)

RAWHazards Theyoccurwhenreadsareearlyandwritesarelate.

Clock 0 1 2 3 4 5 6

IF I1 I2 I3 I4 I5 I6 I7

ID I1 I2 I3 I4 I5 I6

EX

ME

WB

I1 Read I3 I4 I5 I1 I2 I3 I4 Write I2 I3

I1:R1=R1+R2I2:R3=R1+R2

RAWHazards(Contd) Theycanbeavoidedbystallingthereadsbutthisincreasestheexecutiontime.Abetter approachistousedataforwarding:

Clock 0 1 2 3 4 5 6

IF I1 I2 I3 I4 I5 I6 I7

ID I1 I2 I3 I4 I5 I6

EX

ME

WB

I1 Read Read I4 I5 I1 I2 I3 I4 Write I2 I3

I1:R1=R1+R2I2:R3=R1+R2

WARHazards Theyoccurwhenwritesareearlyandreadsarelate Clock IF 0 1 2 3 4 5 6 I1 I2 I3 I4 I5 I6 I7 I1 I2 I3 I4 I5 I6 I1 I2 I3 I4 I5 I1 I2 I3 I4 I1 Write Read I3 I4 I2 I3 I1 I2 I1 ID EX ME WB EX ME WB

I1:R2=R2+R3;R9=R3+R4,I2:R3=R7+R5;R6=R2+R8

Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruction processing is to schedule conditionalbranchinstructions. Whenapipelinecontrollerencountersaconditionalbranchinstructionithasa choicetodecodeitintooneoftwoinstructionstreams. Ifthebranchconditionismetthentheexecutioncontinuesfromthetargetofthe conditionalbranchinstruction; Otherwise,itcontinueswiththeinstructionthatfollowstheconditionalbranch instruction.

Example:Supposethatweexecutethefollowingassemblycodeona5stage pipeline(IF,ID,EX,ME,WB): JCDR0<10,add; SUBR0,R1;JMPD,halt; add:ADDR0,R1; halt:HLT; IfweassumethatR0<10thentheSUBinstructionwouldhavebeenincorrectly fetchedduringthesecondclockcycle.andwewillhavetoanotherfetchcycle tofetchtheADDinstruction.

Classificationofbranchpredictionalgorithms StaticBranchPrediction:Thebranchdecisiondoesnotchangeovertimeweuseafixed branchingpolicy. Dynamic Branch Prediction: The branch decision does change over time we use a branchingpolicythatvariesovertime.

StaticBranchPredictionAlgorithms 1Dontpredict(stallthepipeline) 2Nevertakethebranch 3Alwaystakethebranch 4Delayedbranch

1Stallthepipelineby1clockcycle:Thisallowsustodeterminethetargetofthe branchinstruction.

JCD SUB ADD

IF ID

EX ME WB

IF ID

EX ME WB

Stallanddecidethebranch.

PipelineExecutionSpeed(stallcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcycles perinstruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequency =1+branchfrequency Ingeneral,CPIofthepipeline>1+branchfrequencybecauseofdataandpossibly structuralhazards Pros:Straightforwardtoimplement Cons:Thetimeoverheadishighwhentheinstructionmixincludesahigh percentageofbranchinstructions.

2Nevertakethebranch.Theinstructioninthepipelineisflushedifitis determinedthatthebranchshouldhavebeentakenaftertheIDstageis carriedout.

JCD SUB IOR

IF ID

EX ME WB ID EX ME WB EX ME WB

IF

IF ID

XOR

IF ID

EX ME WB

SUBinstructionisalwaysexecutedandtheneithertheIORinstructionis executednextorSUBisflushedandXORisexecuted.

PipelineExecutionSpeed(Nevertakethebranchcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcyclesper instruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequencymispredictionrate =1+branchfrequencymispredictionrate Pros:Ifthepredictionishighlyaccuratethenthepipelinecanoperateclosetoitsfull throughput. Cons:Implementationisnotasstraightforwardandrequiresflushingifdecodingthe branchaddresstakesmorethan1clockcycle.

3Alwaystakethebranch.Theinstructioninthepipelineisflushedifitisdetermined thatthebranchshouldhavebeentakenaftertheIDstageiscarriedout.

JCD SUB IOR

IF ID

EX ME WB IF ID EX ME WB EX ME WB

IF ID

XOR

IF ID addresscomputation

EX ME WB

PipelineExecutionSpeed(Alwaystakethebranchcase): Assumingonlybranchhazards,wecancomputetheaveragenumberofclockcyclesper instruction(CPI)as CPIofthepipeline =CPIofidealpipeline+thenumberofidlecycles/instruction =1+branchpenaltybranchfrequencypredictionrate +branchpenaltybranchfrequencymispredictionrate =1+branchfrequencypredictionrate +2branchfrequencymispredictionrate Pros:Bettersuitedfortheexecutionofloopswithoutthecompiler'sintervention(butthis cangenerallybeovercome,seethenextslide). Cons:Implementationisnotasstraightforward,andhasahighermispredictionpenalty. Notasadvantageousasnottakingthebranchsincethebranchaddresscomputationisnot completeduntilaftertheEXsegmentiscarriedout.

Example:for(i=0;i<10;i++)a[i]=a[i]+1;

Branchalwayswillnotworkwellwithoutcompilershelp
CLRR0; loop:JCDR0>=10,exit LDDR1,R0; ADDR1,1; ST+R1,R0; JMPD,loop; exit: Branchalwayswillworkwellwithoutcompilershelp CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop;

3Delayedbranch:Insertaninstructionafterabranchinstruction,andalwaysexecuteit whetherornotthebranchconditionapplies.Ofcourse,thismustbeaninstructionthat canbeexecutedwithoutanysideeffectsonthecorrectnessoftheprogram. Pros:Pipelineisneverstalledorflushedandwiththecorrectchoicebranchdelayedslot instruction,performancecanapproachthatofanidealpipeline. Cons:ItisnotalwayspossibletofindadelayedslotinstructioninwhichcaseaNOP instruction may have to be inserted into the delayed slot to make sure that the program'sintegrityisnotviolated.Itmakescompilersworkharder.

Whichinstructiontoplaceintothedelayedbranchslot? 3.1Chooseaninstructionbeforethebranch,butmakesurethatbranchdoesnot dependonmovedinstruction.Ifsuchaninstructioncanbefound,thisalwayspays off. Example: ADDR1,R2; JCDR2>10,exit; canberescheduledas JCDR2,>,10,exit; ADDR1,R2;(Delayslot)

3.2Chooseaninstructionfromthetargetofthebranch,butmakesurethatthemoved instructionisexecutablewhenthebranchisnottaken. Example: ADDR1,R2; JCDR2>10,sub; JMPD,add; . sub:SUBR4,R5; add:ADIR3,5; canberescheduledas ADDR1,R2; JCDR2,>,10,sub; ADIR3,5;(Delayslot) . sub:SUBR4,R5;

3.3Choose an instruction from the antitarget of the branch, but make sure that the movedinstructionisexecutablewhenthebranchistaken. Example: //ADDR3,R2; JCDR2>10,exit; ADDR3,R2; exit:SUBR4,R5; //ADDR4,R3; canberescheduledas ADDR1,R2; JCDR2,>,10,exit; ADDR3,R2;(Scheduleforexecutionifitdoesnotaltertheprogramfloworoutput) exit:SUBR4,R5;

DynamicBranchPrediction Dynamicbranchpredictionreliesonthehistoryofhowbranchconditionswereresolved inthepast. Historyofbranchesiskeptinabuffer.Tokeepthisbufferreasonablysmallandeasyto access,thebufferisindexedbysomefixednumberoflowerorderbitsoftheaddressofthe branchinstruction. Assumptionisthattheaddressvaluesintheloweraddressfieldareuniqueenoughto preventfrequentcollisionsoroverrides.Thusifwearetryingtopredictbranchesina programwhichremainswithinablockof256locations,8bitsshouldsuffice.

x x+1 x+256

JCD
. .

JCD

Branchinstructionsintheinstructioncacheincludeabranchpredictionfieldthat isusedtopredictifthebranchshouldbetaken.

Memory Location

Program Branch instruction

Branch prediction eld 0 (branch was not taken)

x x+4 x+8 x+12 x+16 x+20

Branch instruction

0 (branch was not taken)

Branch instruction

1 (branch was taken)

Branchprediction: Inthesimplestcase,thefieldisa1bittag: 0<=>branchwasnottakenlasttime(StateA) 1<=>branchwastakenlasttime(StateB)

nottaken
A

taken
B

taken

nottaken
WhileinstateApredictthebranchasnottobetaken WhileinstateBpredictthebranchastobetaken

Thisworksrelativelywell:Itaccuratelypredictsthebranchesinloopsinallbuttwoofthe iterations CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop; AssumingthatwebegininstateA,predictionfails whenR0=1(branchisnottakenwhenitshouldbe) andR0=10(branchistakenwhenitshouldnotbe) AssumingthatwebegininstateB,predictionfails whenR0=10(branchistakenwhenitshouldnotbe)

Wecanmodifythelooptomakethebranchpredictionalgorithmfailtwicewhenwe begininstateBaswell. CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0>=10,exit; JMPD,loop; exit:

AssumingthatwebegininstateB,predictionfails: whenR0=1(branchistakenwhenitshouldnotbe) andR0=10(branchisnottakenwhenitshouldnotbe)

Whatisworseisthatwecanmakethisbranchpredictionalgorithmfaileachtimeit makesaprediction: LDIR0,1; loop:JCDR0>0,neg; LDIR0,1; JMPD,loop; neg:LDIR0,1; JMPD,loop;

AssumingthatwebegininstateA,predictionfails whenR0=1(branchisnottakenwhenitshouldbe) R0=1(branchistakenwhenitshouldnotbe) R0=1(branchisnottakenwhenitshouldbe) R0=1(branchistakenwhenitshouldnotbe) andsoon

2bitprediction(Amorereluctantflipindecision)

nottaken
A1

taken
A2

nottaken nottaken taken


B2 B1

taken taken

nottaken
WhileinstatesA1andA2predictthebranchasnottobetaken WhileinstatesB1andB2predictthebranchastobetaken

nottaken

CLRR0; loop:LDDR1,R0; ADDR1,1; ST+R1,R0; JCDR0<10,loop;

taken

A1
nottaken nottaken taken

A2

taken taken

AssumingthatwebegininstateA1,predictionfails whenR0=1,2(branchisnottakenwhenitshouldbe) andR0=10(branchistakenwhenitshouldnotbe) AssumingthatwebegininstateB1,predictionfails whenR0=10(branchistakenwhenitshouldnotbe)

B2
nottaken

B1

2bitpredictorsaremoreresilienttobranchinversions(predictionsarereversedwhen theyaremissedtwice): nottaken LDIR0,1; taken loop:JCDR0>0,neg; A1 A2 LDIR0,1; nottaken JMPD,loop; taken neg:LDIR0,1; nottaken JMPD,loop;
taken

taken

AssumingthatwebegininstateB1,prediction B2 succeedswhenR0=1(branchistakenwhenitshouldbe) failswhenR0=1(branchistakenwhenitshouldnotbe) succeedswhenR0=1(branchistakenwhenitshouldbe) failswhenR0=1(branchistakenwhenitshouldnotbe) andsoon

B1
nottaken

Amdahl'sLaw(FixedLoadSpeedup) Let q bethefractionofaload L thatcannotbespeededupbyintroducingmore processorsandlet T(p)betheamounttimeittakestoexecute L on p processors byalinearworkfunction,p>1.Then

T ( p) > qT (1) +

(1 q)T (1) p T (1) 1 1 S ( p) = < as p T ( p) q + 1 q q p

Allthismeansisthat,themaximumspeedupofasystemislimitedbythefraction oftheworkthatmustbecompletedsequentially.Thus,theexecutionofthework using p processorscanbereducedto qT(1)underthebestofcircumstances,and thespeedupcannotexceed1/q.

Example A4processorcomputerexecutesinstructionsthatarefetchedfroma randomaccessmemoryoverasharedbusasshownbelow:

Thetasktobeperformedisdividedintotwoparts: 1. Fetchinstruction(serialpart)ittakes30microseconds 2. Executeinstruction(parallelpart)ittakes10microsecondsto execute: S(4)=T(1)/T(4)=1/(0.75+0.25/4)=4/3.25=1.23

microseconds

microseconds

microseconds

microseconds

Now,supposethatthenumberofprocessorsisdoubled.Then S(8)=T(1)/T(8)=1/(0.75+0.25/8)=8/6.25=1.28

Supposethatthenumberofprocessorsisdoubledagain.Then S(16)=T(1)/T(16)=1/(0.75+0.25/16)=16/12.25=1.30.

Whatisthelimit S(p)=T(1)/T(p)=1/(0.75+0.25/p)=1/0.75=1.333.

AlternateFormsofAmdahl'sLaw

T (1) S= Tunenhanced + Tenhanced T (1) 1 = as s . 1 q q T (1)(q + ) s


wheresisthespeedupofthecomputationthatcanbeenhanced.

Example: Supposethatyou'veupgradedyourcomputerfroma2GHzprocessortoa4 GHzprocessor.Whatisthemaximumspeedupyouexpectinexecutinga typicalprogramassumingthat(1)thespeedoffetchingeachinstructionis directly proportional to the speed of reading an instruction from the primarymemoryofyourcomputer,andreadinganinstructiontakesfour timeslongerthanexecutingit,(2)thespeedofexecutingeachinstructionis directlyproportionaltotheclockspeedoftheprocessorofyourcomputer?

UsingAmdahl'sLawwithq=0.8ands=2,wehave S=2/(0.2+0.8x2)=1.111 Verydisappointingasyouarelikelytohavepaidquiteabitofmoneyforthe upgrade!

GeneralizedAmdahl'sLaw

Ingeneral,ataskmaybepartitionedintoasetofsubtasks,witheach subtaskrequiringadesignatednumberofprocessorstoexecute.Inthis case, the speedup of the parallel execution of the task over its sequentialexecutioncanbecharacterizedbythefollowing,moregeneral formula: T (1) S ( p1, p2 ,, pk ) = T ( p1, p2 ,, pk )

T (1) 1 = q1T (1) q2T (1) q T (1) q1 q2 q + ++ k + ++ k p1 p2 pk p1 p2 pk where q1 + q2 + + qk = 1. <


Whenk=2,q1=q,q2=1q,p1=1,p2=p,thisformulareducesto Amdahl'sLaw.

Remark:

ThegeneralizedAmdahl'sLawcanalsoberewrittentoexpressthespeedupdueto differentamountsofspeedenhancement(Se)thatcanbemadetodifferentpartsofa system:

Se ( s1, s2 ,, sk ) =

T (1) T ( s1, s2 ,, sk ) T (1) 1 < q1T (1) q2T (1) q T (1) q1 q2 q + ++ k + ++ k s1 s2 sk s1 s2 sk

where q1 + q2 + + qk = 1.

Example: Supposethatyourcomputerexecutesaprogramthathasthefollowing profileofexecution: (a) 30%integeroperations, (b) 20%floatingpointoperations, (c) 50%memoryreferenceinstructions Howmuchspeedupwillyouexpectifyoudoublethespeedofthe floatingunitofyourcomputer?Usingtheformulaabove:

Se=1/(0.3+0.2/2+0.5)=1.1

Example: Supposethatyouhaveafixedbudgetof$500toupgradeeachofthe computersinyourlaboratory,andyoufindoutthatthecomputationsyou performonyourcomputersrequire (a) 40%integeroperations, (b) 60%floatingpointoperations, Ifeverydollarspentontheintegerunitafter$50decreasesitsexecution timeby2%,andifeverydollarspentonthefloatingpointunitafter$100 decreasesitsexecutiontimeby1%,howwouldyouspendthe$500?

Example(Continued):

S=

T (1) where x1 + x 2 = 350 Ti ( x1 ) + Tf ( x 2 )

T i( x1 ) = (1 0.02)T i( x1 1) T i( x1 ) = 0.98 x1 Ti (0)

T f ( x 2 ) = (1 0.01)T f ( x 2 1) T f ( x 2 ) = 0.99 x 2 Tf (0)


T i(0) = 0.4 T (1) T f (0) = 0.6T (1)

SubstitutingtheseintothegeneralizedAmdahl'sspeedupexpressiongives:
S= T (1) 0.98 x1 0.4 T (1) + 0.99 x 2 0.6 T (1) 1 = 0.98 x1 0.4 + 0.99 x 2 0.6

Example8(Continued): Sowemaximize

1 0.98 x1 0.4 + 0.99 x 2 0.6

subjecttox1+x2=350,

ormaximize

1 0.98 x1 0.4 + 0.99 350 x1 0.6

subjecttox1<350.

Example(Continued): Computingthevaluesintheneighborhoodof120revealsthatthespeedupis maximizedwhenx1=126. FromMathematica: Table[1/(0.4*0.98^x+0.6*0.99^(350x)),{x,120,128,1}] {10.5398,10.5518,10.5616,10.5691,10.5744,10.5776,10.5785,10.5773,10.574} Note:Itispossibletohavehigherspeedupwithallofthemoneyinvestedin oneoftheunitsifthefixcostforoneoftheunitsbecomessufficientlylarge.

Addendum: Ifthechangesinperformanceduetoupgradesarespecifiedinterms ofspeedratherthantime,wecanthenusethefollowingformulation:


t=L s t = t s = L s = L s 1 x s x s s x s 2 x t = L s = t s s s s t = T ( x) T ( x 1) = T ( x 1) s s T ( x) = (1 s )T ( x 1) s

s wheredenotesthepercentagechangeinspeed. s

S-ar putea să vă placă și