Pipeline

Review of Instruction Sets, Pipelines, and Caches
CSE 7381/5381
Pipelining: Its Natural!

Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes
CSE 7381/5381
Sequential Laundry
6 PM 7 8 9
Time
10
11
Midnight
30 40 20 30 40 20 30 40 20 30 40 20
T a s k O r d e r
A B C D
Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
CSE 7381/5381
Pipelined Laundry Start work ASAP

6 PM 7 8 9
Time
10
11
Midnight
30 40
T a s k O r d e r
40
40
40 20
A
B
C
D Pipelined laundry takes 3.5 hours for 4 loads CSE 7381/5381
Pipelining Lessons
6 PM
T a s k O r d e r
9
Time
30 40 A B C D
40
40
40 20
Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup CSE 7381/5381
Computer Pipelines
Execute billions of instructions, so throughput is what matters DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores
CSE 7381/5381
5 Steps of DLX Datapath

Figure 3.1, Page 130
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back
IR
L M D
CSE 7381/5381
Pipelined DLX Datapath

Figure 3.4, page 137
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access
Write Back
Data stationary control

local decode for each instruction phase / pipeline stage
CSE 7381/5381
Visualizing Pipelining
Time (clock cycles)
I n s t r. O r d e r
CSE 7381/5381
Its Not That Easy for Computers

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Pipelining of branches & other instructionsstall the pipeline until the hazardbubbles in the pipeline
CSE 7381/5381
One Memory Port/Structural Hazards

Figure 3.6, Page 142 Time (clock cycles)
I n s t r.
O r d e r
Load
Instr 1
Instr 2
Instr 3 Instr 4
CSE 7381/5381
One Memory Port/Structural Hazards

Load
Instr 1 Instr 2
stall
Instr 3
CSE 7381/5381
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr

Speedup = Ideal CPI x Pipeline depth x Ideal CPI + Pipeline stall CPI Clock Cycleunpipelined Clock Cyclepipelined
Speed Up Equation for Pipelining
Speedup =
Pipeline depth x 1 + Pipeline stall CPI
Clock Cycleunpipelined Clock Cyclepipelined
CSE 7381/5381
Example: Dual-port vs. Single-port

Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
Machine A is 1.33 times faster
CSE 7381/5381
Data Hazard on R1
Figure 3.9, page 147 Time (clock cycles)
IF ID/RF EX MEM WB
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
CSE 7381/5381
Three Generic Data Hazards

InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
CSE 7381/5381

InstrI followed by InstrJ
Write After Read (WAR) InstrJ tries to write operand before InstrI reads i
Gets wrong operand
Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
CSE 7381/5381

InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
CSE 7381/5381
Forwarding to Avoid Data Hazard

Time (clock cycles) I n s t r. O r d e r
add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

CSE 7381/5381
HW Change for Forwarding

CSE 7381/5381
Data Hazard Even with Forwarding

Figure 3.12, Page 153 Time (clock cycles) I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
CSE 7381/5381
Data Hazard Even with Forwarding

I n s t r.
O r d e r
lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

CSE 7381/5381
Software Scheduling to Avoid Load Hazards

Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd
CSE 7381/5381
Control Hazard on Branches Three Stage Stall
CSE 7381/5381
Branch Stall Impact

If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier
DLX branch tests if register = 0 or 0 DLX Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
CSE 7381/5381
Pipelined DLX Datapath

Figure 3.22, page 163
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back
This is the correct 1 cycle latency implementation!
CSE 7381/5381
Four Branch Hazard Alternatives

#1: Stall until branch direction is clear #2: Predict Branch Not Taken
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken

53% DLX branches taken on average But havent calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome
CSE 7381/5381
Four Branch Hazard Alternatives

#4: Delayed Branch
Define branch to take place AFTER a following instruction
branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken
Branch delay of length n
1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this
CSE 7381/5381
Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled
Compiler effectiveness for single branch delay slot:

Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
CSE 7381/5381
Evaluating Branch Alternatives

Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty
CPI 1.42 1.14 1.09 1.07 speedup v. unpipelined 3.5 4.4 4.5 4.6 speedup v. stall 1.0 1.26 1.29 1.31
Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5
Conditional & Unconditional = 14%, 65% change PC

CSE 7381/5381
Improvements in Delayed Branches

Cancelling branches
if branch behaves as predicted, normal delayed branch otherwise, turn the delay slot to a NO-OP
Helps the compiler in rescheduling instructions without restrictions Deeper pipes with longer branch delays make delayed branching less attractive Newer RISC machines use combination of ordinary and delayed branches, sometimes only ordinary branches with better prediction
CSE 7381/5381
Prediction Techniques
Taken and non-taken predictions Separating the forward and backward branches Profile-based predictions
behavior of branches highy biased towards taken and non-taken changing the input has minimal effect on the branch behavior
CSE 7381/5381
Handling Exceptions
Turn off all writes for the faulting instruction and for all the instructions that follow in the pipe Save PC of the faulting instruction For delayed branch, needs multiple PCs
no. of delay slots + 1
Precise exceptions - instructions just before the fault are completed and those after can be restarted from scratch
slower mode
CSE 7381/5381
Out-of-order Exceptions
(I+1)th instruction may cause an exception before I does Handles by using exception status vectors Disable the side effects as soon as exception is found Exception handling happens at WB, in the unpipelined order
CSE 7381/5381
Multi-Cycle Operations
Impractical to require the FP operations to complete in 1 or 2 clock cycles
either slow down the clock or complex fp hardware
Instead allow FP pipe line a longer latency May cause more hazards
Divide unit not fully pipelined - structural hazard WAW since the instructions reach WB out of order Causes additional problems with exception
Out of order precise exceptions

Either serialize the FP operations or buffer the results of operation
CSE 7381/5381
Pipelining Introduction Summary

Just overlap tasks, and easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:
Speedup = Pipeline Depth Clock Cycle Unpipelined
X
1 + Pipeline stall CPI
Clock Cycle Pipelined
Hazards limit performance on computers:

Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction
CSE 7381/5381

Pipeline

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Pipeline

Încărcat de

Drepturi de autor:

Formate disponibile

Review of Instruction Sets, Pipelines, and Caches

Pipelining: Its Natural!

Pipelined Laundry Start work ASAP

5 Steps of DLX Datapath

Pipelined DLX Datapath

Data stationary control

Its Not That Easy for Computers

One Memory Port/Structural Hazards

One Memory Port/Structural Hazards

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr

Speed Up Equation for Pipelining

Pipeline depth x 1 + Pipeline stall CPI

Clock Cycleunpipelined Clock Cyclepipelined

Example: Dual-port vs. Single-port

Machine A is 1.33 times faster

Three Generic Data Hazards

Three Generic Data Hazards

Three Generic Data Hazards

Forwarding to Avoid Data Hazard

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

HW Change for Forwarding

Data Hazard Even with Forwarding

Data Hazard Even with Forwarding

lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Software Scheduling to Avoid Load Hazards

Control Hazard on Branches Three Stage Stall

Branch Stall Impact

DLX branch tests if register = 0 or 0 DLX Solution:

Pipelined DLX Datapath

This is the correct 1 cycle latency implementation!

Four Branch Hazard Alternatives

#3: Predict Branch Taken

Four Branch Hazard Alternatives

Branch delay of length n

Compiler effectiveness for single branch delay slot:

Evaluating Branch Alternatives

Conditional & Unconditional = 14%, 65% change PC

Improvements in Delayed Branches

Out of order precise exceptions

Pipelining Introduction Summary

Clock Cycle Pipelined

Hazards limit performance on computers:

S-ar putea să vă placă și