Sunteți pe pagina 1din 36

Review of Instruction Sets, Pipelines, and Caches

CSE 7381/5381

Pipelining: Its Natural!


Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes
CSE 7381/5381

Sequential Laundry
6 PM 7 8 9
Time

10

11

Midnight

30 40 20 30 40 20 30 40 20 30 40 20
T a s k O r d e r

A B C D

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
CSE 7381/5381

Pipelined Laundry Start work ASAP


6 PM 7 8 9
Time

10

11

Midnight

30 40
T a s k O r d e r

40

40

40 20

A
B

C
D Pipelined laundry takes 3.5 hours for 4 loads CSE 7381/5381

Pipelining Lessons
6 PM
T a s k O r d e r

9
Time

30 40 A B C D

40

40

40 20

Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup CSE 7381/5381

Computer Pipelines
Execute billions of instructions, so throughput is what matters DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores

CSE 7381/5381

5 Steps of DLX Datapath


Figure 3.1, Page 130
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back

IR

L M D

CSE 7381/5381

Pipelined DLX Datapath


Figure 3.4, page 137
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access

Write Back

Data stationary control


local decode for each instruction phase / pipeline stage
CSE 7381/5381

Visualizing Pipelining
Figure 3.3, Page 133
Time (clock cycles)

I n s t r. O r d e r
CSE 7381/5381

Its Not That Easy for Computers


Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Pipelining of branches & other instructionsstall the pipeline until the hazardbubbles in the pipeline

CSE 7381/5381

One Memory Port/Structural Hazards


Figure 3.6, Page 142 Time (clock cycles)

I n s t r.
O r d e r

Load
Instr 1

Instr 2
Instr 3 Instr 4
CSE 7381/5381

One Memory Port/Structural Hazards


Figure 3.7, Page 143 Time (clock cycles)

Load
I n s t r. O r d e r

Instr 1 Instr 2

stall
Instr 3
CSE 7381/5381

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr


Speedup = Ideal CPI x Pipeline depth x Ideal CPI + Pipeline stall CPI Clock Cycleunpipelined Clock Cyclepipelined

Speed Up Equation for Pipelining

Speedup =

Pipeline depth x 1 + Pipeline stall CPI

Clock Cycleunpipelined Clock Cyclepipelined

CSE 7381/5381

Example: Dual-port vs. Single-port


Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

Machine A is 1.33 times faster

CSE 7381/5381

Data Hazard on R1
Figure 3.9, page 147 Time (clock cycles)
IF ID/RF EX MEM WB

I n s t r. O r d e r

add r1,r2,r3

sub r4,r1,r3
and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
CSE 7381/5381

Three Generic Data Hazards


InstrI followed by InstrJ

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

CSE 7381/5381

Three Generic Data Hazards


InstrI followed by InstrJ

Write After Read (WAR) InstrJ tries to write operand before InstrI reads i
Gets wrong operand

Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

CSE 7381/5381

Three Generic Data Hazards


InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
CSE 7381/5381

Forwarding to Avoid Data Hazard


Figure 3.10, Page 149
Time (clock cycles) I n s t r. O r d e r

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11


CSE 7381/5381

HW Change for Forwarding


Figure 3.20, Page 161

CSE 7381/5381

Data Hazard Even with Forwarding


Figure 3.12, Page 153 Time (clock cycles) I n s t r. O r d e r

lw r1, 0(r2)
sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
CSE 7381/5381

Data Hazard Even with Forwarding


Figure 3.13, Page 154 Time (clock cycles)

I n s t r.
O r d e r

lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9


CSE 7381/5381

Software Scheduling to Avoid Load Hazards


Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd

CSE 7381/5381

Control Hazard on Branches Three Stage Stall

CSE 7381/5381

Branch Stall Impact


If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier

DLX branch tests if register = 0 or 0 DLX Solution:


Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

CSE 7381/5381

Pipelined DLX Datapath


Figure 3.22, page 163
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back

This is the correct 1 cycle latency implementation!

CSE 7381/5381

Four Branch Hazard Alternatives


#1: Stall until branch direction is clear #2: Predict Branch Not Taken
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken


53% DLX branches taken on average But havent calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome
CSE 7381/5381

Four Branch Hazard Alternatives


#4: Delayed Branch
Define branch to take place AFTER a following instruction
branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken

Branch delay of length n

1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this

CSE 7381/5381

Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot:


Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
CSE 7381/5381

Evaluating Branch Alternatives


Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty
CPI 1.42 1.14 1.09 1.07 speedup v. unpipelined 3.5 4.4 4.5 4.6 speedup v. stall 1.0 1.26 1.29 1.31

Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5

Conditional & Unconditional = 14%, 65% change PC


CSE 7381/5381

Improvements in Delayed Branches


Cancelling branches
if branch behaves as predicted, normal delayed branch otherwise, turn the delay slot to a NO-OP

Helps the compiler in rescheduling instructions without restrictions Deeper pipes with longer branch delays make delayed branching less attractive Newer RISC machines use combination of ordinary and delayed branches, sometimes only ordinary branches with better prediction
CSE 7381/5381

Prediction Techniques
Taken and non-taken predictions Separating the forward and backward branches Profile-based predictions
behavior of branches highy biased towards taken and non-taken changing the input has minimal effect on the branch behavior

CSE 7381/5381

Handling Exceptions
Turn off all writes for the faulting instruction and for all the instructions that follow in the pipe Save PC of the faulting instruction For delayed branch, needs multiple PCs
no. of delay slots + 1

Precise exceptions - instructions just before the fault are completed and those after can be restarted from scratch
slower mode

CSE 7381/5381

Out-of-order Exceptions
(I+1)th instruction may cause an exception before I does Handles by using exception status vectors Disable the side effects as soon as exception is found Exception handling happens at WB, in the unpipelined order

CSE 7381/5381

Multi-Cycle Operations
Impractical to require the FP operations to complete in 1 or 2 clock cycles
either slow down the clock or complex fp hardware

Instead allow FP pipe line a longer latency May cause more hazards
Divide unit not fully pipelined - structural hazard WAW since the instructions reach WB out of order Causes additional problems with exception

Out of order precise exceptions


Either serialize the FP operations or buffer the results of operation
CSE 7381/5381

Pipelining Introduction Summary


Just overlap tasks, and easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:
Speedup = Pipeline Depth Clock Cycle Unpipelined

X
1 + Pipeline stall CPI

Clock Cycle Pipelined

Hazards limit performance on computers:


Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction

CSE 7381/5381

S-ar putea să vă placă și