Documente Academic
Documente Profesional
Documente Cultură
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Pipelining Example
Laundry Example: Three Stages
Sequential Laundry
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
30
A
B
C
D
7
30
30
8
30
30
30
30
30
30
9 PM
Time
30
30
30
1 2
1 2
k
1 2
1 2
k
Without Pipelining
One completion every k time units
Pipelined Processor Design
1 2
With Pipelining
One completion every 1 time unit
Muhamed Mudawar slide 6
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously
Sk
Register
S2
Register
S1
Register
Input
Register
Clock
Pipelined Processor Design
Pipeline Performance
Let ti = time delay in stage Si
k+n1
Sk k for large n
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
ID = Decode and
Register Fetch
EX = Execute and
Calculate Address
00
Inc
Imm26
Imm16
Rs
Register
File
Rt
Instruction
Instruction
Memory
Rd
m
u
x
BusW
Address
RW
PC
m
u
x
Next
PC
Ext
m
u
x
ALU result
zero
A
L
U
WB = Write
Back
Data
Memory
Address
m
u
x
1
Data_in
MEM = Memory
Access
Pipelined Datapath
Pipeline registers, in green, separate each pipeline stage
ID = Decode
IF/ID
EX = Execute
ID/EX
Next
PC
00
Imm26
Instruction
Register
File
Rt
RW
Instruction
Memory
Rd
zero
Rs
Ext
BusW
PC
Imm16
Address
WB
EX/MEM
Inc
m
u
x
MEM = Memory
m
u
x
m
u
x
A
L
U
MEM/WB
ALU result
Address
Data
Memory
m
u
x
1
Data_in
ID
EX
IF/ID
ID/EX
Next
PC
00
Imm26
Instruction
Register
File
Rt
BusW
Instruction
Memory
Rd
zero
Rs
Ext
RW
PC
Imm16
Address
WB
EX/MEM
Inc
m
u
x
MEM
m
u
x
m
u
x
A
L
U
MEM/WB
ALU result
Address
Data
Memory
m
u
x
1
Data_in
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw $6, 8($5)
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
CC6
CC7
CC8
InstructionTime Diagram
Diagram shows:
Which instruction occupies what stage at each clock cycle
Instruction Order
$7, 8($3)
lw
$6, 8($5)
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
IF
ID
$2, 10($3)
CC1
WB
EX MEM
Time
Reg
ALU
900 ps
MEM
Reg
IF
Reg
ALU
MEM
Reg
900 ps
Pipelined Processor Design
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Control Signals
IF/ID
ID/EX
EX/MEM
j
Inc
PCSrc
00
Imm26
Imm16
Address
Instruction
beq
Register
File
Rt
BusW
Instruction
Memory
Rd
Ext
m
u
x
m
u
x
MEM/WB
bne
zero
Rs
RW
PC
m
u
x
Imm26
Next
PC
A
L
U
ALU result
m
u
x
Address
Data
Memory
Data_in
func
RegDst RegWrite
ALU
Control
ALUSrc
ALUOp Br&J
MemWrite
MemRead
MemtoReg
Execute Stage
Memory Stage
Writeback
Signal
Control Signals
Control Signals
Signal
RegWrite
Op
RegDst ALUSrc ALUOp Beq Bne
R-Type
1=Rd
0=Reg R-Type
addi
0=Rt
1=Imm
ADD
slti
0=Rt
1=Imm
SLT
andi
0=Rt
1=Imm
AND
ori
0=Rt
1=Imm
OR
lw
0=Rt
1=Imm
ADD
sw
1=Imm
ADD
beq
0=Reg
SUB
bne
0=Reg
SUB
Pipelined Control
IF/ID
ID/EX
EX/MEM
j
Inc
PCSrc
00
Imm26
Imm16
beq
MEM/WB
bne
zero
ALU result
Rs
Address
Register
File
Rt
Instruction
BusW
Instruction
Memory
m
u
x
m
u
x
m
u
x
Address
Data
Memory
Data_in
Op
Rd
A
L
U
Ext
RW
PC
m
u
x
Next
PC
RegDst
func
ALU
Control
MemWrite
RegWrite
MemtoReg
WB
M
WB
Main
Control
MemRead
ALUOp
EX
ALUSrc
WB
Pass control
signals along
pipeline just
like the data
Memory Stage
Pipelining Summary
Pipelining doesnt improve latency of a single instruction
However, it improves throughput of entire workload
Instructions are initiated and completed at a higher rate
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Pipeline Hazards
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle
Example
Writing back ALU result in stage 4
Instructions
lw
$6, 8($5)
IF
$2, 10($3)
CC1
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX MEM
Time
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Data Hazards
Dependency between instructions causes a data hazard
# r1 is written
# r1 is read
Time (cycles)
value of $2
sub $2, $1, $3
and $4, $2, $5
or $6, $3, $2
add $7, $2, $2
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10/20
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
sw $8, 10($2)
Instructions and & or will read old value of $2 from reg file
During CC5, $2 is written and read new value is read
Pipelined Processor Design
Instruction Order
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10/20
20
20
20
IM
Reg
ALU
DM
Reg
bubble
bubble
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
IM
or $6, $3, $2
Two bubbles are inserted into ID/EX at end of CC3 & CC4
Bubbles are NOP instructions: do not modify registers or memory
Bubbles delay instruction execution and waste clock cycles
Pipelined Processor Design
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
CC6
CC7
CC8
Implementing Forwarding
Two multiplexers added at the inputs of A & B registers
ALU output in the EX stage is forwarded (fed back)
ALU result or Load data in the MEM stage is also forwarded
Rd
m
u
x
m
u
x
Data_in
Rw
File
Register
Rt
Address
Rs
m
u
x
m
u
x
A
L
U
MEM/WB
ALU result
Rw
Instruction
Ext
Data
Memory
m
u
x
WriteData
ALUSrc
Rw
ForwardA
Imm26
Imm26
IF/ID
RegDst
RegWrite
Pipelined Processor Design
ForwardB
ICS 233 KFUPM
Forwarding Unit
Forwarding unit generates ForwardA and ForwardB
That are used to control the two forwarding multiplexers
MemtoReg
EX/MEM
ForwardB
ALU result
m
u
x
Rd
m
u
x
Address
Data_in
Rw
File
Register
Rt
m
u
x
Rs
m
u
x
A
L
U
MEM/WB
ALU result
Rw
Instruction
Ext
Data
Memory
m
u
x
WriteData
Imm26
Rw
Imm26
IF/ID
ForwardA
Forwarding Unit
Pipelined Processor Design
Explanation
ForwardA = 00
ForwardA = 01
ForwardA = 10
ForwardB = 00
ForwardB = 01
ForwardB = 10
else ForwardA = 00
if (IF/ID.Rt == ID/EX.Rw and ID/EX.Rw 0 and ID/EX.RegWrite) ForwardB = 01
elseif (IF/ID.Rt == EX/MEM.Rw and EX/MEM.Rw 0and EX/MEM.RegWrite)
ForwardB = 10
else
ForwardB = 00
Forwarding Example
When lw reaches the MEM stage
Instruction sequence:
lw
add
sub
$4, 100($9)
$7, $5, $6
$8, $4, $7
ForwardB = 01
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
Rd
m
u
x
m
u
x
m
u
x
A
L
U
ALU result
File
Register
Rt
m
u
x
Rs
lw $4,100($9)
ALU result
Ext
Rw
Instruction
ForwardA = 10
add $7,$5,$6
sub $8,$4,$7
Rw
Imm26
Imm26
ForwardA = 10
ForwardB = 01
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
Program Order
Time (cycles)
lw
$2, 20($1)
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
CC6
CC7
CC8
However, load
can forward
data to second
next instruction
Reg
Instruction that needs the load data will be in the IF/ID register
Program Order
Time (cycles)
lw
$2, 20($1)
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
bubble
Reg
IM
IM
CC6
CC7
ALU
DM
Reg
Reg
ALU
DM
CC8
Reg
Imm26
Data_in
Rw
Data
Memory
m
u
x
WriteData
ALU result
Address
IF/IDWrite
Op
PCWrite
Rd
m
u
x
m
u
x
m
u
x
A
L
U
Rw
File
Address
Register
Rt
Rs
m
u
x
Forwarding,
Hazard Detection,
and Stall Unit
MemRead
0
Main
Control
m
u
x
EX
Bubble
Bubble
clears
control
signals
ForwardB
WB
PC
Instruction
Instruction
Instruction
Memory
ALU result
Ext
ForwardA
Rw
Imm26
Compiler Scheduling
Compilers can schedule code in a way to avoid load stalls
Slow code:
lw
lw
add
sw
lw
lw
sub
sw
$10,
$11,
$12,
$12,
$13,
$14,
$15,
$15,
($1)
($2)
$10, $11
($3)
($4)
($5)
$13, $14
($6)
# $1 = addr
# $2 = addr
# stall
# $3 = addr
# $4 = addr
# $5 = addr
# stall
# $6 = addr
ICS 233 KFUPM
b
c
a
e
f
d
lw
lw
lw
lw
add
sw
sub
sw
$10,
$11,
$13,
$14,
$12,
$12,
$15,
$14,
0($1)
0($2)
0($4)
0($5)
$10, $11
0($3)
$13, $14
0($6)
# $1 is read
# $1 is written
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Control Hazards
Branch instructions can cause great performance loss
Branch target
PC + 4
PC + 4 + 4 immediate
If Branch is Taken
zero = 1
m
u
x
A
L
U
ALU result
Ext
m
u
x
Imm16
beq = 1
SUB
Forwarding
from MEM stage
Rd
m
u
x
File
Address
label:
lw $8, ($7)
. . .
beq $5, $6, label
next1
next2
Pipelined Processor Design
Register
Rt
m
u
x
Instruction
Rs
Rw
m
u
x
Instruction
Memory
Instruction
00
PCSrc = 1
Next
PC
Rw
Imm26
NPC
Imm26
PC
Inc
beq $5,$6,label
next1
next2
beq $5,$6,label
Next1 # bubble
cc1
cc2
cc3
IF
Reg
ALU
IF
Next2 # bubble
cc4
cc5
cc6
Reg
Bubble
Bubbl
e
Bubble
IF
Bubble
Bubble
Bubbl
e
Bubble
IF
Reg
ALU
MEM
cc7
Modified Datapath
Imm16
Data_in
Data
Memory
m
u
x
WriteData
m
u
x
m
u
x
A
L
U
Rw
Rd
m
u
x
ALU result
File
Address
Address
Register
Rt
Ext
m
u
x
Rw
Rs
ALU result
Instruction
PC
m
u
x
Instruction
Memory
Instruction
00
PCSrc
Imm16
Imm26
reset
Rw
NPC
Inc
Next
PC
Details of Next PC
PCSrc
NPC
30
A
D
D
30
m 30
u
x
Ext
Imm16
beq
zero
msb 4
bne
Imm26
26
=
Forwarded BusA
Pipelined Processor Design
BusB
Muhamed Mudawar slide 51
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Delayed Branch
Define branch to take place after the next instruction
label:
(next instruction)
. . .
branch target
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
Muhamed Mudawar slide 54
Zero-Delayed Branch
Disadvantages of delayed branch
Branch delay can increase to multiple cycles in deeper pipelines
Solution
Check the PC to see if the instruction being fetched is a branch
Store the branch target address in a branch buffer in the IF stage
Inc
mux
PC
Target
Predict
Addresses
Bits
low-order bits
used as index
predict_taken
=
Pipelined Processor Design
Predict
Taken
Not Taken
Predict
Taken
Taken
Not Taken
Taken
Not Taken
Not
Taken
Not
Taken
Not
Taken
Taken
Pipelined Processor Design