ECE 552 Chapter 2 - The Basics: Natalie Enright Jerger

ECE
552 Chapter 2 - The Basics

Natalie Enright Jerger
Lecture notes based on slides created by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood
Recap
Evalua:on Metrics
Cost, Power, Reliability Performance
What is Latency? Throughput? CPU Performance Equa:on Latency(P,A) = seconds / program =

(instruc:ons / program) * (cycles / instruc:on) * (seconds / cycle)
Amdahls Law
Fall 2012 ECE 552 (Enright Jerger): Basics 2
Applica)on
This lecture
Firmware I/O
OS
Compiler CPU Memory Digital Circuits Gates and Transistors
Review features found in all modern microprocessors

Pipelining
Single, in-order issue Clock rate vs. IPC
Data Hazards
Hardware: stalling and bypassing
Control Hazards Memory Hierarchy
Will build on basic design throughout semester

The Sequen:al Model

Implicit model of all modern commercial ISAs
Fetch PC Decode Read Inputs Execute Write Output Next PC
Called von Neuman, but in ENIAC design before
Basic feature: the program counter (PC)

Denes total order on dynamic instruc:on
Value ows from insn X to Y via storage A i X names A as output, Y names A as input And Y acer X in total order Next PC is PC++ unless instruc:on (insn) says otherwise
Order and named storage dene computa:on
Processor logically executes loop at lec
Instruc:on execu:on assumed atomic Instruc:on X nishes before insn X+1 starts
Alterna:ves have been proposed

Datapath: Single Cycle

+ 4
PC
I$
Register File
s1 s2 d
D$
Control
Datapath
FuncGonal units (ALUs), registers, memory interface
Control: decode por:on

Muxes, write enable signals regulate ow of data in datapath Translates opcode into control signals
Breaking Instruc:on Down into Steps

Instruc:on Fetch (F)
Instruc:on fetched from memory at address indicated by PC Increment PC to point to next instruc:on in sequence
Instruc:on Decode (D)

Decode instruc:on to nd type Read registers
Execu:on (E)
ALU performs arithme:c/logic opera:on (arithme:c insn) Compute memory address (load/store insn) If branch, compute target PC and update
Memory access (M)

Contents of Mem[addr] are fetched (load) Contents of Mem[addr] are modied (store)
Writeback (W)
Write instruc:on result to register
Datapath: Mul:-Cycle
+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Add latches to create mulG-cycle implementaGon Load is going to have longest path through datapath
Dene each temporary value

A = 1st source register B = 2nd source register O = output from ALU D = output from Data cache IR = instrucGon register
Holds insn to drive control signals
Fall 2012
ECE 552 (Enright Jerger): Basics
Datapath: Mul:-Cycle - Fetch

+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Instruc:on Fetch (F)

IR = Mem[PC] PC = PC + 4
Datapath: Mul:-Cycle - Decode

+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Instruc:on Decode/Register Read (D)

Fall 2012
A = Regs[s1] B = Regs[s2]
10
Datapath: Mul:-Cycle - Execute

+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Execute (E)
O = A + Imm32 (Memory opera:ons) O = A op B (Reg-Reg/Arithme:c) O = A op Imm32 (Reg-Imm/Arithme:c) Branch: ALU2 = NPC + Imm32; Cond (A == 0)? if (Cond) PC = ALU2 ECE 552 (Enright Jerger): Basics
Fall 2012
11
Datapath: Mul:-Cycle - Memory Access

+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Memory Access (M)

Fall 2012
D = Mem[O] Mem[O] = B
12
Datapath: Mul:-Cycle - Writeback

+ 4
PC
I$
IR
Register File
s1 s2 d
A B O D
D$
Control
Writeback (W)
Reg[d] = O (Reg-Reg/Arithme:c) Reg[d] = O (Reg-Imm/Arithme:c) Reg[d] = D (Load) ECE 552 (Enright Jerger): Basics
Fall 2012
13
Quick Review
Single-cycle insn0.fetch, dec, exec Mul)-cycle
insn1.fetch, dec, exec insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec
Basic datapath: fetch, decode, execute Single-cycle control: hardwired

+ Low CPI (1) Long clock period (to accommodate slowest instrucGon)
Mul4-cycle control: micro-programmed/state machine

+ Short clock period High CPI
Single vs. Mul:-Cycle

Single Cycle
Clock period = 50ns, CPI = 1 Performance = 50ns/insn
MulG-cycle
Branch 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles) Clock period = 11ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
Why 11ns clock period and not 10ns?
Performance = 44ns/insn
Latency vs. Throughput

Can we have both low CPI and short clock period?
Not if datapath executes only one instrucGon at a Gme
Latency vs. Throughput

Latency: No good way to make a single instrucGon go faster Throughput: fortunately dont care about single insn latency
Goal is to make programs (not individual insns) go faster Programs contain billions of insns
Fall 2012
16
Mul)-cycle Pipelined
Pipelining
insn0.fetch insn0.dec insn0.exec insn1.fetch insn0.fetch insn0.dec insn1.fetch insn0.exec insn1.dec insn1.exec insn1.dec insn1.exec
Important performance technique

Improves instruc/on throughput rather instruc/on latency
Begin with mul:-cycle design

When instrucGon advances from stage 1 to 2 Allow next instrucGon to enter stage 1 Form of parallelism: insn-stage parallelism Individual instrucGon takes the same number of stages + But instruc/ons enter and leave at a much faster rate
Automo:ve assembly line analogy

Snapshot
t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 Insn i Insn i+1
D F
X D F
M X D F
W M X D F
W M X D F
W M X D
Insn i+2
W M X
Insn i+3
W M
5 instrucGons in progress
Some stages may take longer Five instrucGons in-ight concurrently

5 Stage Pipelined Datapath

PC + 4 PC
PC
I$
Register File
s1 s2 d
IR
A O B IR B IR
D$
D IR
Temporary values (PC,IR,A,B,O,D) re-latched every stage

Why? 5 insns may be in pipeline at once, do they share a single PC? NoGce, PC not latched afer ALU stage (why not?)
Fall 2012 ECE 552 (Enright Jerger): Basics
19
Pipeline Terminology
PC + 4 PC
PC
I$
Register File
s1 s2 d
IR
A O B IR B IR
D$
D IR
PC
F/D
D/X
X/M
M/W
Five stage: Fetch, Decode, eXecute, Memory, Writeback

PC, F/D, D/X, X/M, M/W
Nothing magical about the number 5 (Pen:um 4 has 22 stages)
Latches (pipeline registers) named by stages they separate

Pipeline Control
PC + 4 PC A
PC
I$
IR
Register File
s1 s2 d
O O D
B IR xC
D$
B IR mC wC IR wC
CTRL
mC wC
One single-cycle controller, but pipeline the control signals

Terminology and Foreshadowing

Scalar pipeline: one insn per stage per cycle
AlternaGve: superscalar (mulGple insns per stage per cycle) (later)
In-order pipeline: insns enter execute stage in program order

AlternaGve: out-of-order (later)
Pipeline depth: number of pipeline stages

Nothing magical about 5
Logical stages
Trend has been to deeper pipelines (more later)

Instruc:on Conven:on
Some ISAs (example: MIPS)
Instruc:on des:na:on (i.e. output) on the lec
add r1, r2, r3 means r1 <-- r2+r3
Other ISAs
Instruc:on des:na:on (i.e. output) on the right
add r1, r2, r3 means r1+r2 --> r3
Style used for examples

Will use add r2, r3 --> r1 to make des:na:on register clear
Pipeline Example: Cycle 1

PC + 4 PC A
<< 2
PC
I$
Register File
s1 s2 d
O O
B SX
B IR
D$
PC
IR
IR
IR
F/D add r1, r2 r3
D/X
X/M
M/W
3 instruc:ons

PC + 4 PC A
<< 2
PC
I$
Register File
s1 s2 d
O O
B SX
B IR
D$
PC
IR
IR
IR
F/D lw [r5+0] r4 add r1, r2 r3
D/X
X/M
M/W
Fall 2012
25

PC PC + 4 PC
<< 2
PC
I$
Register File s1 s2 d
A O B B SX IR IR
D$
PC
IR IR
IR
sw r6 [r7+4]
lw [r5+0] r4
add r1, r2 r3
Fall 2012
26

PC + 4 PC
<< 2
PC
I$
B SX IR
D$
PC
IR IR
IR
IR
F/D sw r6, [r7+4]
D/X
X/M lw [r5+0] r4
M/W add r1, r2 r3
Fall 2012
27

PC + 4 PC
<< 2
PC
I$
O D
B SX IR
D$
PC
IR IR
IR
IR
F/D
D/X sw r6 [r7+4]
X/M
M/W add r1, r2 r3
lw [r5+0] r4
Fall 2012
28

PC + 4 PC
<< 2
PC
I$
O D
B SX IR
D$
PC
IR IR
IR
IR
F/D
D/X
X/M
M/W sw r6 [r7+4] lw
Fall 2012
29

PC + 4 PC
<< 2
PC
I$
O D
B SX IR
D$
PC
IR IR
IR
IR
F/D
D/X
X/M
M/W sw
Fall 2012
30
Pipeline Diagram
add r1,r2 r3 ld [r5] r4 st r6 [r7+4] 1 F 2 3 D X F D F 4 5 6 7 M W X M W D X M W 8 9
Pipeline diagram
Cycles across, insns down Conven:on: X means ld [r5] r4 nishes execute stage and writes into X/M latch at end of cycle 4
Fall 2012
31
Pipelining
Balanced
All stages must take approximately the same :me Doesnt make sense to op:mize a stage whose processing :me is not longest
Insn must pass through all stages and in same order

Even if stage is unneeded
Buering
Not all stages take exactly the same :me Independent computa:ons
No rela:onships between work units Minimize pipeline stalls
Fall 2012
32
Pipeline Performance Calcula:on

Back of the envelope calcula:on
Branch: 20%, load: 20%, ALU: 60% Clock period = 50ns, CPI = 1 Performance = 50ns/insn
Single-cycle Mul:-cycle
Branch 3 cycles, load 5 cycles, ALU 4 cycles Clock period = 11 ns, CPI = (0.2 * 3 + 0.2 *5 + 0.6 * 4) = 4 Performance = 44 ns/insn Clock period = 12ns (approx 50ns/5 stages + overheads) CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle) Performance = 12ns/insn Actually ... CPI = 1 + some penalty for pipelining Say CPI = 1.5 (on average instrucGon completes every 1.5 cycles) Performance = 18ns/insn
Pipelined
Fall 2012
33
Q1: Why is Pipeline Clock Period..

... > (delay thru datapath) / (# of pipeline stages)?
Latches add delay Extra bypassing logic adds delay Pipeline stages have dierent delays, clock period is max delay These factors have implica:on for ideal number of pipeline stages
Q2: Why is Pipeline CPI...

... > 1?
CPI for scalar in-order pipeline is 1 + stall penal:es Stalls used to resolve hazards
Hazard: condi:on that jeopardizes sequen:al illusion Stall: pipeline delay introduced to restore sequen:al illusion
Calcula:ng pipeline CPI

Frequency of stall * stall cycles Penal:es add (stalls generally dont overlap in in-order pipelines) 1 + stall-freq1 * stall-cyc1 + stall-freq2 * stall-cyc2 + ...
Correctness/performance/make common case fast (MCCF)

Long penal:es OK if they happen rarely (e.g. 1 + 0.01 * 10) = 1.1 Stall also have implica:ons for ideal number of pipeline stages
Managing a Pipeline
Proper ow requires two pipeline opera:ons Opera:on I: stall
Mess with latch write-enable and clear signals to achieve
Eect: stops some insns in their current stages Use: make younger insns wait for older ones to complete Implementa:on: de-assert write-enable
Opera:on II: ush
Eect: removes insns from current stages Use: see later Implementa:on: assert clear signals
Both stall and ush must be propagated to younger insns

Dependence & Hazards

Dependence: rela:onship that serializes two insns
Data: two insns use the same value or storage loca:on Control: one instruc:on aects whether another executes at all Maybe: two insns may have a dependence
Hazard: dependence causes poten:al incorrect execu:on

Possibility of using or corrup:ng data or execu:on ow Structural: two insns want to use same structure, one must wait Ocen xed with stalls: insn stays in same stage for mul:ple cycles
37
Data Hazards/Dependence
Lets forget about branches and control for a while
3 insn seq from earlier example
add r3, r2, r1 lw r4, 0(r5) sw r6, 0(r7)
But it wasnt a real program Real programs have data dependences

They pass values via registers and memory
Fall 2012
38
RAW
Read-aMer-write (RAW)
add r2,r3r1 sub r1,r4r2 or r6,r3r1 Problem: swap would mean sub uses wrong value for r1 True: value ows through this dependence
Using dierent output register for add doesnt help
Fall 2012
39
Dependent Opera:ons
Independent opera:ons
add r1, r2 r3 add r4, r5 r6
+ 4
<< 2
older add does not update r3 unGl WB
Would this program execute correctly on a pipeline?

add r1, r2 r3 add r3, r5 r6
PC
PC A
PC
I$
Register File
s1 s2 d
O B O B IR SX IR IR
D$
This one?
add r1, r2 r3 ld [r3] r4, addi r3, 1 r6 st r3 [r7]
PC
IR
F/D
D/X
X/M
M/W
read r3, r5
compute r3 = r2 + r1
40
Fall 2012
Data Hazard Example

+ 4 PC PC A PC
<< 2
I$
Register File
s1 s2 d
O B SX O B IR IR
D$
PC
IR
IR
F/D
D/X
X/M
M/W
sw r3 [r7]
Fall 2012
addi r3, 1 r6
lw [r3] r4
add r1,r2 r3
Which instrucGons would execute with correct inputs?

add is wriGng its result into r3 in current cycle lw read r3, 2 cycles ago wrong value addi read r3 1 cycle ago wrong value sw is reading r3 this cycle maybe
ECE 552 (Enright Jerger): Basics 41
RAW: Detect and Stall
Stall logic: detect and stall reader in D

(F/D.rs1 && (F/D.rs1==D/X.rd || F/D.rs1==X/M.rd)) || (F/D.rs2 && (F/D.rs2==D/X.rd || F/D.rs2==X/M.rd))
Re-evaluated every cycle un:l no longer true + Low cost, simple IPC degrada:on, dependences are the common case
RAW Hazard: Detect and Stall

IF
IR F/D
NOP
ID
IR D/X
EX
IR X/M
MEM
IR M/W
WB
hazard
Prevent F/D insn from reading (advancing this cycle)

Write nop into D/X.IR (eec:vely, insert nop in hardware) Also reset (clear datapath control signals) Disable F/D latch and PC write enables (why?)
Re-evaluate situa:on next cycle

Stall Timing (without bypassing)

D and W stages share regle
Each gets regle for half a cycle 1st half W writes, 2nd half D reads 2 cycle stall d* = data stall, p* = propagated stall
1 add r1,r2 r3 lw [r3] r4 addi r3,1 r6 sw r3 [r7] F
6 X D F
7 M X D
8 W M X
10
D X M W F d* d* D p* p* F
W M
Fall 2012
44
Avoiding Stalls: Observa:on

IF ID EX
lw [r3+4] r4
MEM
WB
add r1, r2 r3
Problem arises because

lw r4, 4(r3) has already read r3 from regle add r3, r2, r1 hasnt yet wrizen r3 to regle
But fundamentally, OK why?

lw r4, 4(r3) hasnt used r3 yet add r3, r2, r1 has already computed r3
Avoiding Stalls: Observa:on

r3 computed r3 wriGen add r1, r2 r3
IF
ID
EX
MEM
WB
ld [r3] r4
IF
ID
r3 read
EX
r3 needed
MEM
WB
Forwarding paths and control logic can resolve dependencies

Reducing RAW Stalls with Bypassing

Why wait unGl W stage? Data available afer X or M stage
Bypass (aka forward) data directly to input of X or M
MX: from beginning of M (X output) to input of X WX: from beginning of W (M output) to input of X WM: from beginning of W (M output) to data input of M Two each of MX, WX + WM = full bypassing
+Reduces stalls in a big way Addi:onal wires and muxes may increase clock cycle
Fall 2012
47
Avoiding Stalls: Bypassing

IF
F/D
ID
D/X
EX
lw [r3+4] r4
MEM
add r1, r2 r3
WB
M/W
X/M
Bypassing (aka forwarding)

Reading a value from an intermediate source Not wai:ng un:l is is available from primary source (e.g. regle)
Bypass Logic
Register File
s1 s2
IR A O B SX IR IR IR O
D$
F/D
D/X
X/M
M/W
bypass
Bypass logic: similar to but separate from stall logic

Stall logic controls latches, bypass logic controls mux inputs Complement one another: cant bypass must stall ALU-A input mux bypass logic
(X/M.rd==D/X.rs1) 0 // check rst (M/W.rd==D/X.rs1) 1 // check second Else 2 // check last
Why?
49
Pipeline Diagrams with Bypassing

If bypass exists, from/to stages execute in same cycle
Example: full bypassing, use MX bypass
add r2,r3r1 sub r1,r4r2 1 F 2 D F 3 X D 4 M X 5 W M 6 W 7 8 9 10
Example: full bypassing, use WX bypass

add r2,r3r1 ld [r7]r5 sub r1,r4r2 1 F 2 D F 3 X D F 3 X D 4 M X D 4 M X 5 W M X 5 W M 6 W M 6 W 7 8 9 10
W 7 8 9 10
Example: WM bypass
add r2,r3r1 ?
1 F
2 D F
Can you think of a code example that uses the WM bypass?

Does bypassing prevent all Data Hazards?
Register File
s1 s2
IR
O B SX IR
B IR
D$
IR
F/D
NOP D/X X/M M/W
stall
add r2, r3 r4
lw [r2+4] r3
No. Bypassing alone is not sucient Solu:on? Detect and stall.

Stalling on Load-To-Use Dependences

Register File
s1 s2
IR A B SX IR IR IR O B O
D$
F/D
NOP D/X X/M M/W
stall add r2, r3 r4 lw [r2+4] r3
Stall = (D/X.IR.Opera:on == LOAD) &&

((F/D.IR.rs1 == D/X.IR.rd) || ((F/D.IR.rs2 == D/X.IR.rd) && (F/D.IR.OP != Store))

Register File
s1 s2
D$
F/D
NOP D/X X/M M/W
stall add r2, r3 r4 stall bubble lw [r2+4] r3


Register File
s1 s2
D$
F/D
NOP D/X X/M M/W
stall add r2, r3 r4 stall bubble lw [r2+4] r3

Load-Use Stalls
Even with full bypassing, stall logic is unavoidable
Load-use stall
Load value not ready at beginning of M cant use MX bypass Use WX bypass
ld [r3+4]r1 sub r1,r4r2
1 F
2 3 4 5 6 7 8 D X M W F d* D X M W
9 10
Aside I: how does stall/bypass logic handle cache misses? Aside II: compiler scheduling can be used to reduce load-use stall frequency
55
Performance Impact of Load/Use Penalty

Assume
Branch: 20%, load: 20%, store: 10%, other: 50% 50% of loads are followed by dependent instruc:on
require 1 cycle stall (i.e. inser:on of 1 nop)
Calculate CPI
CPI = 1 + (1 * 0.20 * 0.50) = 1.1
Fall 2012
56
WAW Hazards
Write-aMer-write (WAW)
add r2,r3 r1 sub r1,r4 r2 or r3,r6 r1
Compiler eects
Scheduling problem: reordering would leave wrong value in r1
Later instruc:on reading r1 would get wrong value
Ar)cial: no value ows through dependence

Eliminate using dierent output register name for or
Pipeline eects
Doesnt aect in-order pipeline with single-cycle opera:ons
One reason for making ALU opera:ons go through M stage
Can happen with mul:-cycle opera:ons (e.g., FP or cache misses)

WAR Hazards
Write-aMer-read (WAR)
add r3,r2r1 sub r4,r5r2 or r1,r3r6
Compiler eects
Scheduling problem: reordering would mean add uses wrong value for r2 Ar/cial: solve using dierent output register name for sub
Pipeline eects
Cant happen in simple in-order pipeline Can happen with out-of-order execuGon
Data Hazards: Summary

Real insn sequences pass values via registers/memory
Three kinds of data dependences (wheres the fourth?)
add r2,r3r1 add r2,r3r1 add r2,r3r1 sub r1,r4r2 sub r1,r4r2 sub r5,r4r2 or r6,r3r1 or r6,r3r1 or r6,r3r1 Read-after-write (RAW) Write-after-read (WAR) Write-after-write (WAW) Output-dependence True-dependence Anti-dependence Only one dependence between any two insns (RAW has priority) Dependence is property of the program and ISA
Data hazards: function of data dependences and pipeline

Potential for executing dependent insns in wrong order Require both insns to be in pipeline (in flight) simultaneously
Structural Hazards
ld [r1] r2 add r4,r3 r1 sub r5,r3 r1 and r3,r4 r6
1 F
2 D F
3 X D F
4 M X D F
5 W M X D
6 W M X
W M
Structural hazard: resource needed twice in one cycle

Example: shared I/D$
To x structural hazards: proper ISA/pipeline design

Each insn uses every structure exactly once For at most one cycle Always at same stage rela:ve to F (fetch)
Tolerate structural hazards

Add stall logic to stall pipeline when hazards occur
Fixing Structural Hazards

1 ld [r1] r2 add r4,r3 r1 sub r5,r3 r1 and r3,r4 r6 F 2 D F 3 X D F 4 M X D s* 5 W M X F 6 W M D 7 8 9 W X
Can x structural hazards by stalling
s* = structural stall Q: which one to stall: ld or and? Always safe to stall younger instruc:on (here and)
But not always the best thing to do performance wise (?) + Low cost, simple Increases CPI Upshot: beter to avoid by design than to x
Fetch stall logic: (X/M.op == ld || X/M.op == st)
Fall 2012
61
Control Hazards
Pipeline works well when there is no transfer of control F fetches next sequen:al instruc:on Problem when sequen:al ow is disrupted
First, look at steps need for branch: br (Rj op Rk) displ Comparison between Rj and Rk Set ag for outcome of comparison Compute target address: PC + displ (if necessary) ModicaGon of PC (if necessary) Add an ALU to ID stage to compute the target address
Need to use ALU twice
Control Hazards
Branch InstrucGon Branch decision known at this stage
F
If taken, 2 instruc/ons are wrong
F F F
PC is correct, fetch the right instruc/on
What to do?
Control Hazards
Default: assume not-taken (at fetch, cant tell its a branch) Control hazards indicated with c* (or not at all) Taken branch penalty is 2 cycles At decode, know its a branch and stall Insert no-ops for 2 cycles 1 addi r1,1r3 bnez r3,targ st r6[r7+4]
Fall 2012
2 D F
3 X D
4 M X
5 W M F
6 W D
c* c*
W
64
Control Hazard CPI Calcula:on

Back of the envelope calcula:on
Branch: 20%, other: 80%, 75% of branches are taken CPIBASE = 1 Always Stall CPIBASE+BRANCH = 1 + 0.20*2 = 1.4 Stalling for branches cause 40% slowdown Only pay 2 cycle penalty for taken branches CPIBASE+BRANCH = 1 + 0.20*0.75*2 = 1.3 Branches cause 30% slowdown
Fall 2012
65
Control Specula:on and Recovery

Correct:
addi r1,1r3 bnez r3,targ st r6[r7+4] add r2,r1r4 targ:add r4,r5r4
1 F
2 D F
3 X D F
specula)ve
4 M X D F
5 W M X D F
W M X D
W M X
W M
Speculate: Predict branch outcome Simple predic:on: predict not-taken Mis-specula)on recovery: what to do on wrong guess
Not too painful in an in-order pipeline Branch resolves in X + Younger insns (in F, D) havent changed permanent state Flush insns currently in F/D and D/X (i.e., replace with nops)
Mis-specula:on recovery
Recovery:
addi r1,1r3 bnez r3,targ st r6[r7+4] add r1,r2r4 targ:add r4,r5r4
1 F
2 D F
3 X D F
4 M X D F
5 W M --F
6 W --D
--X
-M
2 insn cancelled (changed to nops)

Did not reach a stage where they updated memory or registers
Can begin fetching from target calculated in X

Branch Predic:on
Simple strategy assumes branch not taken But... taken branches are more common Want a predic:ve strategy
Yield taken or not taken predic:on with high probability of being right Jump ahead to Chapter 4 (to prepare you for assignment)
Why is this important?

Consider deeper pipelines
Branch outcome not known for 10-20 cycles acer decode
Order-of-order execu:on (heavily dependent on predic:on)

Big Idea: Specula:on

Specula4on
Engagement in risky transacGons on the chance of prot
Specula4ve execu4on
Execute before all parameters known with certainty
Correct specula4on
+ Avoid stall, improve performance
Incorrect specula4on (mis-specula4on)

Must abort/ush/squash incorrect instrucGons Must undo incorrect changes (recover pre-speculaGon state)
The game: [%correct * gain] > [(1%correct) * penalty]

Control Hazards: Control Specula:on

Deal with control hazards with control specula)on
Unknown parameter: are these the correct insns to execute next?
Mechanics
Guess branch target, start fetching at guessed posi:on Execute branch to verify (check) guess
Correct specula:on? keep going Mis-specula:on? Flush mis-speculated insns Dont write registers or memory un:l predic:on veried
Fall 2012
70
Control Specula:on
Specula:on game for in-order 5 stage pipeline
Gain = 2 cycles Penalty = 0 cycles
No penalty mis-specula:on no worse than stalling
%correct = branch predic)on

Sta:c (compiler) ~85%, dynamic (hardware) >95% Not much bezer? Sta:c has 3X mispredicts!
Fall 2012
71
Branch Predic:on Performance

Again assume:
Branch: 20%, load: 20%, store: 10%, other: 50% 75% of branches are taken
Dynamic branch predicGon

Branches predicted with 95% accuracy CPI = 1 + 0.20*0.05*2 = 1.02
Fall 2012
72
Anatomy of a Branch Predictor

All insns and/or Branch insn Program ExecuGon Event SelecGon PC and/or global history and/or local history, etc PredicGon Index
Event (branch) source:

Dynamic program
Event selec:on:
Branchs, but can make predic:ons on all insn (ignore for nonbranch)
Predictor indexing:
Recovery? Feedback Branch outcome Update predicGon mechanism Update history
Fall 2012
Access one or more tables

PredicGon Mechanism StaGc (ISA) SaturaGng counters Markov, etc
Predictor mechanism
Sta:c vs. Dynamic
Feedback and recovery

Real outcome known a few cycles acer predic:on Reinforce predictor condence
73
Dynamic Branch Predic:on

BP part I: direc)on predictor
Applies to condiGonal branches only Predicts taken/not-taken Hard
BP part II: target predictor

Applies to all control transfers Supplies target PC, tells if insn is a branch prior to decode + Easy
Learn from the past, predict the future

Record history in hardware structure
Will focus on direc:on predic:on, return to target predic:on in Chapter 4

74
Branch Direc:on Predic:on

Predict in fetch stage
Dont know if insn is branch but can disregard predic:on if not
Direc)on predictor (DIRP)

Map condi:onal-branch PC to taken/not-taken (T/N) decision Seemingly innocuous, but quite dicult to do well Individual condi:onal branches ocen unbiased or weakly biased
90%+ one way or the other considered biased
Branch History Table (BHT)

Branch history table (BHT): simplest direc:on predictor
PC indexes table of bits (0 = N, 1 = T), no tags Essen:ally: branch will go same way it went last :me Problem: consider inner loop branch below (* = mis-predic:on)
for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever
State/prediction Outcome Two built-in mis-predicGons per inner loop iteraGon Branch predictor changes its mind too quickly
Two-Bit Satura:ng Counters (2bc)

Two-bit satura)ng counters (2bc) [Smith]
Replace each single-bit predic:on
(0,1,2,3) = (N,n,t,T)
Force DIRP to mis-predict twice before changing its mind
Fall 2012
77
Two bit satura:ng Counter State Machine

Predict Taken 11 TT Predict Taken 10 TN
Predict Not Taken 01 NT
Predict Not Taken 00 NN
Strong taken: last two instances were taken Weak taken: last instance N but previous T Weak not-taken: last instance T but previous N
Ini:al state
Strong not-taken: last two N

Two bit satura:ng counter

for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever
State/prediction Outcome
+Fixes this pathology (changing mind too quickly)

+(which is not contrived, by the way)
Branch Predic:on Buers (BPBs)

Storing and Accessing Predic:ons
Like memory hierarchy/cache Avoid cost of tags (tags >> data -- 2 bits)
PC k 2k saturaGng counters PHT
PHT: Pazern history table Aliasing

Two branches with dierent PC access same counter
Correlated branch behaviour

2 bit scheme: small amount of history (local scheme) What about:
if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) --
Clearly branch 3 is correlated to the behaviour of branches 1 and 2 Predictor that uses single branch to predict outcome cannot capture this behaviour Global scheme
Fall 2012
81
Correlated Branch Predic:on

History register
For each branch
Shic in outcome
Global (shif) register k PHT
Use history instead of PC to index PHT
Called a two-level-predictor When to update global history register?

2k saturaGng counters
82
Correlated Predictor
Correlated (two-level) predictor [Paz]
Exploits observaGon that branch outcomes are correlated Maintains separate predicGon per (PC, BHR)
Branch history register (BHR): recent branch outcomes
Simple working example: assume program has one branch

BHT: one 1-bit DIRP entry BHT+2BHR: 4 1-bit DIRP entries
State/prediction active pattern BHR=NN BHR=NT BHR=TN BHR=TT Outcome
Fall 2012
We didnt make anything beter, whats the problem?

ECE 552 (Enright Jerger): Basics 83
What happened?
BHR wasnt long enough to capture the pazern Try again: BHT+3BHR: 8 1-bit DIRP entries
State/prediction BHR=NNN BHR=NNT BHR=NTN active pattern BHR=NTT BHR=TNN BHR=TNT BHR=TTN BHR=TTT Outcome
+ No mis-predic:ons acer predictor learns all the relevant pazerns

Design choice I: one global BHR or one per PC (local)?
Each one captures dierent kinds of pazerns Global is bezer, captures local pazerns for :ght loop branches
Design choice II: how many history bits (BHR size)?

Tricky one + Longer BHRs are bezer for some apps, shorter bezer for others PHT u:liza:on decreases w/ long BHRs
Many history pazerns are never seen Many branches are history independent (dont care) PC ^ BHR allows mul:ple PCs to dynamically share PHT BHR length < log2(BHT size)
Predictor takes longer to train Typical length: 812

Various 2-level schemes
Fall 2012
86
Hybrid Predictor
Hybrid (tournament) predictor [McFarling]
Atacks correlated predictor BHT uGlizaGon problem Idea: combine two predictors
Simple PHT predicts history independent branches Correlated predictor predicts only branches that need history Chooser assigns branches to one predictor or the other Branches start in simple PHT, move mis-predicGon threshold
+ Correlated predictor can be made smaller, handles fewer branches + 9095% accuracy
PHT
GHR
Fall 2012
PHT
chooser
PC
87
Branch Predic:on Summary

Performance requirement for pipelined processors Importance of accuracy increases with depth and width of pipeline Basic building block
2-bit satura:ng counter
Predict direc:on and target

Youll implement several direc:on predictors Well cover target predic:on in a few weeks
How are Interrupts/Excep:ons Handled in Pipeline?

Interrupt: external, e.g., :mer, I/O device requests Excep4on: internal, e.g., /0, page fault, illegal instruc:on Can occur at any stage in pipeline except WB
Page fault: IF or Mem Illegal op code: ID Divide by zero: EX
Upon detec:ng interrupt

Opera:ng system saves processor state Interrupt handling rou:ne
Abort or correct
Handling Interrupts/Excep:ons
Instructions before X (X-1, X-2...) in program order currently in pipeline Complete normally Results are part of saved process state Instruction X and instructions after it already in the pipeline Converted into nops Saved PC corresponds to the PC of instruction X Program will restart at X Called precise state or precise interrupts
Handling Precise Excep:ons

Straighyorward for simple pipeline When excepGon detected
Flag inserted into pipeline register Instruc:on converted to nop Excep:on handled when ag reaches M/W pipeline register
Why?
Fall 2012
91
Handling Precise Excep:ons

ExcepGons handled in program order not temporal order
1 div r1,r0r2 add r3,r4r5 F 2 D F 3 X D 4 M X 5 W M W 6
Handle /0 excep:on rst Will revisit excep:on handling for more complex pipelines later
Pipelined Func:onal Units

Pipeline so far... shallow and simple
Abstrac:on Real pipelines are much deeper
Each stage can be decomposed into several stages Well come back to F, D, M, W later
Lets look at EX
Fast integer arithme:c and logic opera:ons Single cycle Slow integer arithme:c opera:ons: mul:ply, divide Pipelined (except div Floa:ng point opera:ons: add, mul:ply, divide, sqrt and sqrt)
Pipelined Func:onal Units

How many stages? Minimum cycles between independent instrucGons of same type (no RAW dependency) Ex: Intel PenGum Pro
Integer mul:plies (4 stages), new instruc:on every cycle FP add (3 stages) new insn every cycle FP Mult (ve stages), new insn 2 cycles
Hazards and Forwarding

EX M1 M2 M3 M4
MEM
M5 M6 M7
A1
A2
A3
A4
div
Long opera:ons: RAW stalls will be more frequent WAW hazards are possible: insns no longer reach WB in order WAR hazards are not possible: register reads always occur in D
Handling WAW Hazards

1
div f0,f1f2 stf f2[r1] addf f0,f1f2
2 D F
3 E/ D F
4 E/ d* D
8 W M
9 W
10
E/ E/ E/ d* d* X E+ E+ W F D
--
addf f2,f3f4
E+ E+
What to do?
Op:on I: stall younger instruc:on (addf) at writeback
+ Intui:ve, simple Lower performance, cascading W structural hazards
Op:on II: cancel older instruc:on (divf) writeback

+ No performance loss What if divf or stf cause an excep:on (e.g., /0, page fault)? Complicates excep:on handling
Basic Design Recap

5-stage pipeline
F (Fetch), D (Decode), E (Execute), M (Memory Access), W (Writeback) Data hazards
RAW --> add forwarding logic Stall on load-to-use
Control hazards
Branches: ush instruc:ons when branch taken Branch predic:on
Almost done: assume perfect memory

Now add memory hierarchy
97
Fall 2012
98
Fall 2012
99

ECE 552 Chapter 2 - The Basics: Natalie Enright Jerger

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ECE 552 Chapter 2 - The Basics: Natalie Enright Jerger

Încărcat de

Drepturi de autor:

Formate disponibile

ECE

552 Chapter 2 - The Basics

What is Latency? Throughput? CPU Performance Equa:on Latency(P,A) = seconds / program =

Review features found in all modern microprocessors

Control Hazards Memory Hierarchy

Will build on basic design throughout semester

The Sequen:al Model

Called von Neuman, but in ENIAC design before

Basic feature: the program counter (PC)

Order and named storage dene computa:on

Processor logically executes loop at lec

Alterna:ves have been proposed

Datapath: Single Cycle

Control: decode por:on

Breaking Instruc:on Down into Steps

Instruc:on Decode (D)

Memory access (M)

Dene each temporary value

ECE 552 (Enright Jerger): Basics

Datapath: Mul:-Cycle - Fetch

Instruc:on Fetch (F)

Datapath: Mul:-Cycle - Decode

Instruc:on Decode/Register Read (D)

ECE 552 (Enright Jerger): Basics

Datapath: Mul:-Cycle - Execute

Datapath: Mul:-Cycle - Memory Access

Memory Access (M)

ECE 552 (Enright Jerger): Basics

Datapath: Mul:-Cycle - Writeback

Basic datapath: fetch, decode, execute Single-cycle control: hardwired

Mul4-cycle control: micro-programmed/state machine

Single vs. Mul:-Cycle

Latency vs. Throughput

Latency vs. Throughput

ECE 552 (Enright Jerger): Basics

Important performance technique

Begin with mul:-cycle design

Automo:ve assembly line analogy

Some stages may take longer Five instrucGons in-ight concurrently

5 Stage Pipelined Datapath

Temporary values (PC,IR,A,B,O,D) re-latched every stage

Five stage: Fetch, Decode, eXecute, Memory, Writeback

Nothing magical about the number 5 (Pen:um 4 has 22 stages)

Latches (pipeline registers) named by stages they separate

One single-cycle controller, but pipeline the control signals

Terminology and Foreshadowing

In-order pipeline: insns enter execute stage in program order

Pipeline depth: number of pipeline stages

Trend has been to deeper pipelines (more later)

Style used for examples

Pipeline Example: Cycle 1

F/D add r1, r2 r3

Pipeline Example: Cycle 2

F/D lw [r5+0] r4 add r1, r2 r3

ECE 552 (Enright Jerger): Basics

Pipeline Example: Cycle 3

ECE 552 (Enright Jerger): Basics

Pipeline Example: Cycle 4

F/D sw r6, [r7+4]

M/W add r1, r2 r3

ECE 552 (Enright Jerger): Basics

Pipeline Example: Cycle 5

M/W add r1, r2 r3