Sunteți pe pagina 1din 33

ECE

552 Chapter 2 - The Basics


Natalie Enright Jerger
Lecture notes based on slides created by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood

Recap
Evalua:on Metrics
Cost, Power, Reliability Performance

What is Latency? Throughput? CPU Performance Equa:on Latency(P,A) = seconds / program =


(instruc:ons / program) * (cycles / instruc:on) * (seconds / cycle)

Amdahls Law
Fall 2012 ECE 552 (Enright Jerger): Basics 2

Applica)on

This lecture
Firmware I/O

OS
Compiler CPU Memory Digital Circuits Gates and Transistors

Review features found in all modern microprocessors


Pipelining
Single, in-order issue Clock rate vs. IPC

Data Hazards
Hardware: stalling and bypassing

Control Hazards Memory Hierarchy

Will build on basic design throughout semester


Fall 2012 ECE 552 (Enright Jerger): Basics 3

The Sequen:al Model


Implicit model of all modern commercial ISAs
Fetch PC Decode Read Inputs Execute Write Output Next PC

Called von Neuman, but in ENIAC design before

Basic feature: the program counter (PC)


Denes total order on dynamic instruc:on
Value ows from insn X to Y via storage A i X names A as output, Y names A as input And Y acer X in total order Next PC is PC++ unless instruc:on (insn) says otherwise

Order and named storage dene computa:on

Processor logically executes loop at lec

Instruc:on execu:on assumed atomic Instruc:on X nishes before insn X+1 starts

Alterna:ves have been proposed


Fall 2012 ECE 552 (Enright Jerger): Basics 4

Datapath: Single Cycle


+ 4

PC

I$

Register File
s1 s2 d

D$

Control

Datapath
FuncGonal units (ALUs), registers, memory interface

Control: decode por:on


Muxes, write enable signals regulate ow of data in datapath Translates opcode into control signals
Fall 2012 ECE 552 (Enright Jerger): Basics 5

Breaking Instruc:on Down into Steps


Instruc:on Fetch (F)
Instruc:on fetched from memory at address indicated by PC Increment PC to point to next instruc:on in sequence

Instruc:on Decode (D)


Decode instruc:on to nd type Read registers

Execu:on (E)
ALU performs arithme:c/logic opera:on (arithme:c insn) Compute memory address (load/store insn) If branch, compute target PC and update

Memory access (M)


Contents of Mem[addr] are fetched (load) Contents of Mem[addr] are modied (store)

Writeback (W)
Write instruc:on result to register
Fall 2012 ECE 552 (Enright Jerger): Basics 6

Datapath: Mul:-Cycle
+ 4

PC

I$

IR

Register File
s1 s2 d

A B O D

D$

Control

Add latches to create mulG-cycle implementaGon Load is going to have longest path through datapath
Fall 2012 ECE 552 (Enright Jerger): Basics 7

Dene each temporary value


A = 1st source register B = 2nd source register O = output from ALU D = output from Data cache IR = instrucGon register
Holds insn to drive control signals

Fall 2012

ECE 552 (Enright Jerger): Basics

Datapath: Mul:-Cycle - Fetch


+ 4

PC

I$

IR

Register File
s1 s2 d

A B O D

D$

Control

Instruc:on Fetch (F)


IR = Mem[PC] PC = PC + 4
Fall 2012 ECE 552 (Enright Jerger): Basics 9

Datapath: Mul:-Cycle - Decode


+ 4

PC

I$

IR

Register File
s1 s2 d

A B O D

D$

Control

Instruc:on Decode/Register Read (D)


Fall 2012

A = Regs[s1] B = Regs[s2]

ECE 552 (Enright Jerger): Basics

10

Datapath: Mul:-Cycle - Execute


+ 4

PC

I$

IR

Register File
s1 s2 d

A B O D

D$

Control

Execute (E)
O = A + Imm32 (Memory opera:ons) O = A op B (Reg-Reg/Arithme:c) O = A op Imm32 (Reg-Imm/Arithme:c) Branch: ALU2 = NPC + Imm32; Cond (A == 0)? if (Cond) PC = ALU2 ECE 552 (Enright Jerger): Basics

Fall 2012

11

Datapath: Mul:-Cycle - Memory Access


+ 4

PC

I$

IR

Register File
s1 s2 d

A B O D

D$

Control

Memory Access (M)


Fall 2012

D = Mem[O] Mem[O] = B

ECE 552 (Enright Jerger): Basics

12

Datapath: Mul:-Cycle - Writeback


+ 4

PC

I$

IR

Register File
s1 s2 d

A B O D

D$

Control

Writeback (W)
Reg[d] = O (Reg-Reg/Arithme:c) Reg[d] = O (Reg-Imm/Arithme:c) Reg[d] = D (Load) ECE 552 (Enright Jerger): Basics

Fall 2012

13

Quick Review
Single-cycle insn0.fetch, dec, exec Mul)-cycle
insn1.fetch, dec, exec insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec

Basic datapath: fetch, decode, execute Single-cycle control: hardwired


+ Low CPI (1) Long clock period (to accommodate slowest instrucGon)

Mul4-cycle control: micro-programmed/state machine


+ Short clock period High CPI
Fall 2012 ECE 552 (Enright Jerger): Basics 14

Single vs. Mul:-Cycle


Single Cycle
Clock period = 50ns, CPI = 1 Performance = 50ns/insn

MulG-cycle
Branch 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles) Clock period = 11ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
Why 11ns clock period and not 10ns?

Performance = 44ns/insn
Fall 2012 ECE 552 (Enright Jerger): Basics 15

Latency vs. Throughput


Can we have both low CPI and short clock period?
Not if datapath executes only one instrucGon at a Gme

Latency vs. Throughput


Latency: No good way to make a single instrucGon go faster Throughput: fortunately dont care about single insn latency
Goal is to make programs (not individual insns) go faster Programs contain billions of insns

Fall 2012

ECE 552 (Enright Jerger): Basics

16

Mul)-cycle Pipelined

Pipelining
insn0.fetch insn0.dec insn0.exec insn1.fetch insn0.fetch insn0.dec insn1.fetch insn0.exec insn1.dec insn1.exec insn1.dec insn1.exec

Important performance technique


Improves instruc/on throughput rather instruc/on latency

Begin with mul:-cycle design


When instrucGon advances from stage 1 to 2 Allow next instrucGon to enter stage 1 Form of parallelism: insn-stage parallelism Individual instrucGon takes the same number of stages + But instruc/ons enter and leave at a much faster rate

Automo:ve assembly line analogy


Fall 2012 ECE 552 (Enright Jerger): Basics 17

Snapshot
t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 Insn i Insn i+1

D F

X D F

M X D F

W M X D F

W M X D F

W M X D

Insn i+2

W M X

Insn i+3

W M

5 instrucGons in progress

Some stages may take longer Five instrucGons in-ight concurrently


Fall 2012 ECE 552 (Enright Jerger): Basics 18

5 Stage Pipelined Datapath


PC + 4 PC

PC

I$

Register File
s1 s2 d
IR

A O B IR B IR

D$

D IR

Temporary values (PC,IR,A,B,O,D) re-latched every stage


Why? 5 insns may be in pipeline at once, do they share a single PC? NoGce, PC not latched afer ALU stage (why not?)
Fall 2012 ECE 552 (Enright Jerger): Basics

19

Pipeline Terminology
PC + 4 PC

PC

I$

Register File
s1 s2 d
IR

A O B IR B IR

D$

D IR

PC

F/D

D/X

X/M

M/W

Five stage: Fetch, Decode, eXecute, Memory, Writeback


PC, F/D, D/X, X/M, M/W

Nothing magical about the number 5 (Pen:um 4 has 22 stages)

Latches (pipeline registers) named by stages they separate


Fall 2012 ECE 552 (Enright Jerger): Basics 20

Pipeline Control
PC + 4 PC A

PC

I$
IR

Register File
s1 s2 d

O O D

B IR xC

D$
B IR mC wC IR wC

CTRL

mC wC

One single-cycle controller, but pipeline the control signals


Fall 2012 ECE 552 (Enright Jerger): Basics 21

Terminology and Foreshadowing


Scalar pipeline: one insn per stage per cycle
AlternaGve: superscalar (mulGple insns per stage per cycle) (later)

In-order pipeline: insns enter execute stage in program order


AlternaGve: out-of-order (later)

Pipeline depth: number of pipeline stages


Nothing magical about 5
Logical stages

Trend has been to deeper pipelines (more later)


Fall 2012 ECE 552 (Enright Jerger): Basics 22

Instruc:on Conven:on
Some ISAs (example: MIPS)
Instruc:on des:na:on (i.e. output) on the lec
add r1, r2, r3 means r1 <-- r2+r3

Other ISAs
Instruc:on des:na:on (i.e. output) on the right
add r1, r2, r3 means r1+r2 --> r3

Style used for examples


Will use add r2, r3 --> r1 to make des:na:on register clear
Fall 2012 ECE 552 (Enright Jerger): Basics 23

Pipeline Example: Cycle 1


PC + 4 PC A
<< 2

PC

I$

Register File
s1 s2 d

O O

B SX

B IR

D$

PC
IR

IR

IR

F/D add r1, r2 r3

D/X

X/M

M/W

3 instruc:ons
Fall 2012 ECE 552 (Enright Jerger): Basics 24

Pipeline Example: Cycle 2


PC + 4 PC A
<< 2

PC

I$

Register File
s1 s2 d

O O

B SX

B IR

D$

PC
IR

IR

IR

F/D lw [r5+0] r4 add r1, r2 r3

D/X

X/M

M/W

Fall 2012

ECE 552 (Enright Jerger): Basics

25

Pipeline Example: Cycle 3


PC PC + 4 PC
<< 2

PC

I$

Register File s1 s2 d

A O B B SX IR IR

D$

PC

IR IR

IR

sw r6 [r7+4]

lw [r5+0] r4

add r1, r2 r3

Fall 2012

ECE 552 (Enright Jerger): Basics

26

Pipeline Example: Cycle 4


PC + 4 PC
<< 2

PC

I$

Register File s1 s2 d

B SX IR

D$

PC

IR IR

IR

IR

F/D sw r6, [r7+4]

D/X

X/M lw [r5+0] r4

M/W add r1, r2 r3

Fall 2012

ECE 552 (Enright Jerger): Basics

27

Pipeline Example: Cycle 5


PC + 4 PC
<< 2

PC

I$

Register File s1 s2 d

O D

B SX IR

D$

PC

IR IR

IR

IR

F/D

D/X sw r6 [r7+4]

X/M

M/W add r1, r2 r3

lw [r5+0] r4

Fall 2012

ECE 552 (Enright Jerger): Basics

28

Pipeline Example: Cycle 6


PC + 4 PC
<< 2

PC

I$

Register File s1 s2 d

O D

B SX IR

D$

PC

IR IR

IR

IR

F/D

D/X

X/M

M/W sw r6 [r7+4] lw

Fall 2012

ECE 552 (Enright Jerger): Basics

29

Pipeline Example: Cycle 7


PC + 4 PC
<< 2

PC

I$

Register File s1 s2 d

O D

B SX IR

D$

PC

IR IR

IR

IR

F/D

D/X

X/M

M/W sw

Fall 2012

ECE 552 (Enright Jerger): Basics

30

Pipeline Diagram
add r1,r2 r3 ld [r5] r4 st r6 [r7+4] 1 F 2 3 D X F D F 4 5 6 7 M W X M W D X M W 8 9

Pipeline diagram
Cycles across, insns down Conven:on: X means ld [r5] r4 nishes execute stage and writes into X/M latch at end of cycle 4

Fall 2012

ECE 552 (Enright Jerger): Basics

31

Pipelining
Balanced
All stages must take approximately the same :me Doesnt make sense to op:mize a stage whose processing :me is not longest

Insn must pass through all stages and in same order


Even if stage is unneeded

Buering
Not all stages take exactly the same :me Independent computa:ons
No rela:onships between work units Minimize pipeline stalls

Fall 2012

ECE 552 (Enright Jerger): Basics

32

Pipeline Performance Calcula:on


Back of the envelope calcula:on
Branch: 20%, load: 20%, ALU: 60% Clock period = 50ns, CPI = 1 Performance = 50ns/insn

Single-cycle Mul:-cycle

Branch 3 cycles, load 5 cycles, ALU 4 cycles Clock period = 11 ns, CPI = (0.2 * 3 + 0.2 *5 + 0.6 * 4) = 4 Performance = 44 ns/insn Clock period = 12ns (approx 50ns/5 stages + overheads) CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle) Performance = 12ns/insn Actually ... CPI = 1 + some penalty for pipelining Say CPI = 1.5 (on average instrucGon completes every 1.5 cycles) Performance = 18ns/insn

Pipelined

Fall 2012

ECE 552 (Enright Jerger): Basics

33

Q1: Why is Pipeline Clock Period..


... > (delay thru datapath) / (# of pipeline stages)?
Latches add delay Extra bypassing logic adds delay Pipeline stages have dierent delays, clock period is max delay These factors have implica:on for ideal number of pipeline stages
Fall 2012 ECE 552 (Enright Jerger): Basics 34

Q2: Why is Pipeline CPI...


... > 1?
CPI for scalar in-order pipeline is 1 + stall penal:es Stalls used to resolve hazards
Hazard: condi:on that jeopardizes sequen:al illusion Stall: pipeline delay introduced to restore sequen:al illusion

Calcula:ng pipeline CPI


Frequency of stall * stall cycles Penal:es add (stalls generally dont overlap in in-order pipelines) 1 + stall-freq1 * stall-cyc1 + stall-freq2 * stall-cyc2 + ...

Correctness/performance/make common case fast (MCCF)


Long penal:es OK if they happen rarely (e.g. 1 + 0.01 * 10) = 1.1 Stall also have implica:ons for ideal number of pipeline stages
Fall 2012 ECE 552 (Enright Jerger): Basics 35

Managing a Pipeline
Proper ow requires two pipeline opera:ons Opera:on I: stall
Mess with latch write-enable and clear signals to achieve

Eect: stops some insns in their current stages Use: make younger insns wait for older ones to complete Implementa:on: de-assert write-enable

Opera:on II: ush

Eect: removes insns from current stages Use: see later Implementa:on: assert clear signals

Both stall and ush must be propagated to younger insns


Fall 2012 ECE 552 (Enright Jerger): Basics 36

Dependence & Hazards


Dependence: rela:onship that serializes two insns
Data: two insns use the same value or storage loca:on Control: one instruc:on aects whether another executes at all Maybe: two insns may have a dependence

Hazard: dependence causes poten:al incorrect execu:on


Possibility of using or corrup:ng data or execu:on ow Structural: two insns want to use same structure, one must wait Ocen xed with stalls: insn stays in same stage for mul:ple cycles
Fall 2012 ECE 552 (Enright Jerger): Basics

37

Data Hazards/Dependence
Lets forget about branches and control for a while
3 insn seq from earlier example
add r3, r2, r1 lw r4, 0(r5) sw r6, 0(r7)

But it wasnt a real program Real programs have data dependences


They pass values via registers and memory

Fall 2012

ECE 552 (Enright Jerger): Basics

38

RAW
Read-aMer-write (RAW)
add r2,r3r1 sub r1,r4r2 or r6,r3r1 Problem: swap would mean sub uses wrong value for r1 True: value ows through this dependence
Using dierent output register for add doesnt help
ECE 552 (Enright Jerger): Basics

Fall 2012

39

Dependent Opera:ons
Independent opera:ons
add r1, r2 r3 add r4, r5 r6
+ 4
<< 2

older add does not update r3 unGl WB

Would this program execute correctly on a pipeline?


add r1, r2 r3 add r3, r5 r6

PC

PC A

PC

I$

Register File
s1 s2 d

O B O B IR SX IR IR

D$

This one?
add r1, r2 r3 ld [r3] r4, addi r3, 1 r6 st r3 [r7]

PC

IR

F/D

D/X

X/M

M/W

read r3, r5
ECE 552 (Enright Jerger): Basics

compute r3 = r2 + r1
40

Fall 2012

Data Hazard Example


+ 4 PC PC A PC
<< 2

I$

Register File
s1 s2 d

O B SX O B IR IR

D$

PC

IR

IR

F/D

D/X

X/M

M/W

sw r3 [r7]
Fall 2012

addi r3, 1 r6

lw [r3] r4

add r1,r2 r3

Which instrucGons would execute with correct inputs?


add is wriGng its result into r3 in current cycle lw read r3, 2 cycles ago wrong value addi read r3 1 cycle ago wrong value sw is reading r3 this cycle maybe
ECE 552 (Enright Jerger): Basics 41

RAW: Detect and Stall

Stall logic: detect and stall reader in D


(F/D.rs1 && (F/D.rs1==D/X.rd || F/D.rs1==X/M.rd)) || (F/D.rs2 && (F/D.rs2==D/X.rd || F/D.rs2==X/M.rd))

Re-evaluated every cycle un:l no longer true + Low cost, simple IPC degrada:on, dependences are the common case
Fall 2012 ECE 552 (Enright Jerger): Basics 42

RAW Hazard: Detect and Stall


IF
IR F/D
NOP

ID
IR D/X

EX
IR X/M

MEM
IR M/W

WB

hazard

Prevent F/D insn from reading (advancing this cycle)


Write nop into D/X.IR (eec:vely, insert nop in hardware) Also reset (clear datapath control signals) Disable F/D latch and PC write enables (why?)

Re-evaluate situa:on next cycle


Fall 2012 ECE 552 (Enright Jerger): Basics 43

Stall Timing (without bypassing)


D and W stages share regle
Each gets regle for half a cycle 1st half W writes, 2nd half D reads 2 cycle stall d* = data stall, p* = propagated stall

1 add r1,r2 r3 lw [r3] r4 addi r3,1 r6 sw r3 [r7] F

6 X D F

7 M X D

8 W M X

10

D X M W F d* d* D p* p* F

W M

Fall 2012

ECE 552 (Enright Jerger): Basics

44

Avoiding Stalls: Observa:on


IF ID EX
lw [r3+4] r4

MEM

WB

add r1, r2 r3

Problem arises because


lw r4, 4(r3) has already read r3 from regle add r3, r2, r1 hasnt yet wrizen r3 to regle

But fundamentally, OK why?


lw r4, 4(r3) hasnt used r3 yet add r3, r2, r1 has already computed r3
Fall 2012 ECE 552 (Enright Jerger): Basics 45

Avoiding Stalls: Observa:on


r3 computed r3 wriGen add r1, r2 r3

IF

ID

EX

MEM

WB

ld [r3] r4

IF

ID
r3 read

EX
r3 needed

MEM

WB

Forwarding paths and control logic can resolve dependencies


Fall 2012 ECE 552 (Enright Jerger): Basics 46

Reducing RAW Stalls with Bypassing


Why wait unGl W stage? Data available afer X or M stage
Bypass (aka forward) data directly to input of X or M
MX: from beginning of M (X output) to input of X WX: from beginning of W (M output) to input of X WM: from beginning of W (M output) to data input of M Two each of MX, WX + WM = full bypassing

+Reduces stalls in a big way Addi:onal wires and muxes may increase clock cycle

Fall 2012

ECE 552 (Enright Jerger): Basics

47

Avoiding Stalls: Bypassing


IF
F/D

ID
D/X

EX
lw [r3+4] r4

MEM
add r1, r2 r3

WB
M/W

X/M

Bypassing (aka forwarding)


Reading a value from an intermediate source Not wai:ng un:l is is available from primary source (e.g. regle)
Fall 2012 ECE 552 (Enright Jerger): Basics 48

Bypass Logic
Register File
s1 s2
IR A O B SX IR IR IR O

D$

F/D

D/X

X/M

M/W

bypass

Bypass logic: similar to but separate from stall logic


Stall logic controls latches, bypass logic controls mux inputs Complement one another: cant bypass must stall ALU-A input mux bypass logic
(X/M.rd==D/X.rs1) 0 // check rst (M/W.rd==D/X.rs1) 1 // check second Else 2 // check last
Fall 2012 ECE 552 (Enright Jerger): Basics

Why?

49

Pipeline Diagrams with Bypassing


If bypass exists, from/to stages execute in same cycle
Example: full bypassing, use MX bypass
add r2,r3r1 sub r1,r4r2 1 F 2 D F 3 X D 4 M X 5 W M 6 W 7 8 9 10

Example: full bypassing, use WX bypass


add r2,r3r1 ld [r7]r5 sub r1,r4r2 1 F 2 D F 3 X D F 3 X D 4 M X D 4 M X 5 W M X 5 W M 6 W M 6 W 7 8 9 10

W 7 8 9 10

Example: WM bypass
add r2,r3r1 ?

1 F

2 D F

Can you think of a code example that uses the WM bypass?


Fall 2012 ECE 552 (Enright Jerger): Basics 50

Does bypassing prevent all Data Hazards?

Register File
s1 s2
IR

O B SX IR

B IR

D$

IR

F/D

NOP D/X X/M M/W

stall

add r2, r3 r4

lw [r2+4] r3

No. Bypassing alone is not sucient Solu:on? Detect and stall.


Fall 2012 ECE 552 (Enright Jerger): Basics 51

Stalling on Load-To-Use Dependences


Register File
s1 s2
IR A B SX IR IR IR O B O

D$

F/D

NOP D/X X/M M/W

stall add r2, r3 r4 lw [r2+4] r3

Stall = (D/X.IR.Opera:on == LOAD) &&


((F/D.IR.rs1 == D/X.IR.rd) || ((F/D.IR.rs2 == D/X.IR.rd) && (F/D.IR.OP != Store))
Fall 2012 ECE 552 (Enright Jerger): Basics 52

Stalling on Load-To-Use Dependences


Register File
s1 s2
IR A B SX IR IR IR O B O

D$

F/D

NOP D/X X/M M/W

stall add r2, r3 r4 stall bubble lw [r2+4] r3

Stall = (D/X.IR.Opera:on == LOAD) &&


((F/D.IR.rs1 == D/X.IR.rd) || ((F/D.IR.rs2 == D/X.IR.rd) && (F/D.IR.OP != Store))
Fall 2012 ECE 552 (Enright Jerger): Basics 53

Stalling on Load-To-Use Dependences


Register File
s1 s2
IR A B SX IR IR IR O B O

D$

F/D

NOP D/X X/M M/W

stall add r2, r3 r4 stall bubble lw [r2+4] r3

Stall = (D/X.IR.Opera:on == LOAD) &&


((F/D.IR.rs1 == D/X.IR.rd) || ((F/D.IR.rs2 == D/X.IR.rd) && (F/D.IR.OP != Store))
Fall 2012 ECE 552 (Enright Jerger): Basics 54

Load-Use Stalls
Even with full bypassing, stall logic is unavoidable
Load-use stall
Load value not ready at beginning of M cant use MX bypass Use WX bypass

ld [r3+4]r1 sub r1,r4r2

1 F

2 3 4 5 6 7 8 D X M W F d* D X M W

9 10

Aside I: how does stall/bypass logic handle cache misses? Aside II: compiler scheduling can be used to reduce load-use stall frequency
Fall 2012 ECE 552 (Enright Jerger): Basics

55

Performance Impact of Load/Use Penalty


Assume
Branch: 20%, load: 20%, store: 10%, other: 50% 50% of loads are followed by dependent instruc:on
require 1 cycle stall (i.e. inser:on of 1 nop)

Calculate CPI
CPI = 1 + (1 * 0.20 * 0.50) = 1.1

Fall 2012

ECE 552 (Enright Jerger): Basics

56

WAW Hazards
Write-aMer-write (WAW)
add r2,r3 r1 sub r1,r4 r2 or r3,r6 r1

Compiler eects
Scheduling problem: reordering would leave wrong value in r1
Later instruc:on reading r1 would get wrong value

Ar)cial: no value ows through dependence


Eliminate using dierent output register name for or

Pipeline eects
Doesnt aect in-order pipeline with single-cycle opera:ons
One reason for making ALU opera:ons go through M stage

Can happen with mul:-cycle opera:ons (e.g., FP or cache misses)


Fall 2012 ECE 552 (Enright Jerger): Basics 57

WAR Hazards
Write-aMer-read (WAR)
add r3,r2r1 sub r4,r5r2 or r1,r3r6

Compiler eects
Scheduling problem: reordering would mean add uses wrong value for r2 Ar/cial: solve using dierent output register name for sub

Pipeline eects
Cant happen in simple in-order pipeline Can happen with out-of-order execuGon
Fall 2012 ECE 552 (Enright Jerger): Basics 58

Data Hazards: Summary


Real insn sequences pass values via registers/memory
Three kinds of data dependences (wheres the fourth?)
add r2,r3r1 add r2,r3r1 add r2,r3r1 sub r1,r4r2 sub r1,r4r2 sub r5,r4r2 or r6,r3r1 or r6,r3r1 or r6,r3r1 Read-after-write (RAW) Write-after-read (WAR) Write-after-write (WAW) Output-dependence True-dependence Anti-dependence Only one dependence between any two insns (RAW has priority) Dependence is property of the program and ISA

Data hazards: function of data dependences and pipeline


Potential for executing dependent insns in wrong order Require both insns to be in pipeline (in flight) simultaneously
Fall 2012 ECE 552 (Enright Jerger): Basics 59

Structural Hazards
ld [r1] r2 add r4,r3 r1 sub r5,r3 r1 and r3,r4 r6

1 F

2 D F

3 X D F

4 M X D F

5 W M X D

6 W M X

W M

Structural hazard: resource needed twice in one cycle


Example: shared I/D$

To x structural hazards: proper ISA/pipeline design


Each insn uses every structure exactly once For at most one cycle Always at same stage rela:ve to F (fetch)

Tolerate structural hazards


Add stall logic to stall pipeline when hazards occur
Fall 2012 ECE 552 (Enright Jerger): Basics 60

Fixing Structural Hazards


1 ld [r1] r2 add r4,r3 r1 sub r5,r3 r1 and r3,r4 r6 F 2 D F 3 X D F 4 M X D s* 5 W M X F 6 W M D 7 8 9 W X

Can x structural hazards by stalling

s* = structural stall Q: which one to stall: ld or and? Always safe to stall younger instruc:on (here and)
But not always the best thing to do performance wise (?) + Low cost, simple Increases CPI Upshot: beter to avoid by design than to x
Fetch stall logic: (X/M.op == ld || X/M.op == st)

Fall 2012

ECE 552 (Enright Jerger): Basics

61

Control Hazards
Pipeline works well when there is no transfer of control F fetches next sequen:al instruc:on Problem when sequen:al ow is disrupted
First, look at steps need for branch: br (Rj op Rk) displ Comparison between Rj and Rk Set ag for outcome of comparison Compute target address: PC + displ (if necessary) ModicaGon of PC (if necessary) Add an ALU to ID stage to compute the target address
Fall 2012 ECE 552 (Enright Jerger): Basics 62

Need to use ALU twice

Control Hazards
Branch InstrucGon Branch decision known at this stage

F
If taken, 2 instruc/ons are wrong

F F F
PC is correct, fetch the right instruc/on

What to do?
Fall 2012 ECE 552 (Enright Jerger): Basics 63

Control Hazards
Default: assume not-taken (at fetch, cant tell its a branch) Control hazards indicated with c* (or not at all) Taken branch penalty is 2 cycles At decode, know its a branch and stall Insert no-ops for 2 cycles 1 addi r1,1r3 bnez r3,targ st r6[r7+4]
Fall 2012

2 D F

3 X D

4 M X

5 W M F

6 W D

c* c*
ECE 552 (Enright Jerger): Basics

W
64

Control Hazard CPI Calcula:on


Back of the envelope calcula:on
Branch: 20%, other: 80%, 75% of branches are taken CPIBASE = 1 Always Stall CPIBASE+BRANCH = 1 + 0.20*2 = 1.4 Stalling for branches cause 40% slowdown Only pay 2 cycle penalty for taken branches CPIBASE+BRANCH = 1 + 0.20*0.75*2 = 1.3 Branches cause 30% slowdown

Fall 2012

ECE 552 (Enright Jerger): Basics

65

Control Specula:on and Recovery


Correct:

addi r1,1r3 bnez r3,targ st r6[r7+4] add r2,r1r4 targ:add r4,r5r4

1 F

2 D F

3 X D F

specula)ve

4 M X D F

5 W M X D F

W M X D

W M X

W M

Speculate: Predict branch outcome Simple predic:on: predict not-taken Mis-specula)on recovery: what to do on wrong guess
Not too painful in an in-order pipeline Branch resolves in X + Younger insns (in F, D) havent changed permanent state Flush insns currently in F/D and D/X (i.e., replace with nops)
Fall 2012 ECE 552 (Enright Jerger): Basics 66

Mis-specula:on recovery
Recovery:

addi r1,1r3 bnez r3,targ st r6[r7+4] add r1,r2r4 targ:add r4,r5r4

1 F

2 D F

3 X D F

4 M X D F

5 W M --F

6 W --D

--X

-M

2 insn cancelled (changed to nops)


Did not reach a stage where they updated memory or registers

Can begin fetching from target calculated in X


Fall 2012 ECE 552 (Enright Jerger): Basics 67

Branch Predic:on
Simple strategy assumes branch not taken But... taken branches are more common Want a predic:ve strategy
Yield taken or not taken predic:on with high probability of being right Jump ahead to Chapter 4 (to prepare you for assignment)

Why is this important?


Consider deeper pipelines
Branch outcome not known for 10-20 cycles acer decode

Order-of-order execu:on (heavily dependent on predic:on)


Fall 2012 ECE 552 (Enright Jerger): Basics 68

Big Idea: Specula:on


Specula4on
Engagement in risky transacGons on the chance of prot

Specula4ve execu4on
Execute before all parameters known with certainty

Correct specula4on
+ Avoid stall, improve performance

Incorrect specula4on (mis-specula4on)


Must abort/ush/squash incorrect instrucGons Must undo incorrect changes (recover pre-speculaGon state)

The game: [%correct * gain] > [(1%correct) * penalty]


Fall 2012 ECE 552 (Enright Jerger): Basics 69

Control Hazards: Control Specula:on


Deal with control hazards with control specula)on
Unknown parameter: are these the correct insns to execute next?

Mechanics
Guess branch target, start fetching at guessed posi:on Execute branch to verify (check) guess
Correct specula:on? keep going Mis-specula:on? Flush mis-speculated insns Dont write registers or memory un:l predic:on veried

Fall 2012

ECE 552 (Enright Jerger): Basics

70

Control Specula:on
Specula:on game for in-order 5 stage pipeline
Gain = 2 cycles Penalty = 0 cycles
No penalty mis-specula:on no worse than stalling

%correct = branch predic)on


Sta:c (compiler) ~85%, dynamic (hardware) >95% Not much bezer? Sta:c has 3X mispredicts!

Fall 2012

ECE 552 (Enright Jerger): Basics

71

Branch Predic:on Performance


Again assume:
Branch: 20%, load: 20%, store: 10%, other: 50% 75% of branches are taken

Dynamic branch predicGon


Branches predicted with 95% accuracy CPI = 1 + 0.20*0.05*2 = 1.02

Fall 2012

ECE 552 (Enright Jerger): Basics

72

Anatomy of a Branch Predictor


All insns and/or Branch insn Program ExecuGon Event SelecGon PC and/or global history and/or local history, etc PredicGon Index

Event (branch) source:


Dynamic program

Event selec:on:
Branchs, but can make predic:ons on all insn (ignore for nonbranch)

Predictor indexing:
Recovery? Feedback Branch outcome Update predicGon mechanism Update history
Fall 2012

Access one or more tables


PredicGon Mechanism StaGc (ISA) SaturaGng counters Markov, etc

Predictor mechanism
Sta:c vs. Dynamic

Feedback and recovery


Real outcome known a few cycles acer predic:on Reinforce predictor condence
73

ECE 552 (Enright Jerger): Basics

Dynamic Branch Predic:on


BP part I: direc)on predictor
Applies to condiGonal branches only Predicts taken/not-taken Hard

BP part II: target predictor


Applies to all control transfers Supplies target PC, tells if insn is a branch prior to decode + Easy

Learn from the past, predict the future


Record history in hardware structure

Will focus on direc:on predic:on, return to target predic:on in Chapter 4


Fall 2012 ECE 552 (Enright Jerger): Basics

74

Branch Direc:on Predic:on


Predict in fetch stage
Dont know if insn is branch but can disregard predic:on if not

Direc)on predictor (DIRP)


Map condi:onal-branch PC to taken/not-taken (T/N) decision Seemingly innocuous, but quite dicult to do well Individual condi:onal branches ocen unbiased or weakly biased
90%+ one way or the other considered biased
Fall 2012 ECE 552 (Enright Jerger): Basics 75

Branch History Table (BHT)


Branch history table (BHT): simplest direc:on predictor
PC indexes table of bits (0 = N, 1 = T), no tags Essen:ally: branch will go same way it went last :me Problem: consider inner loop branch below (* = mis-predic:on)
for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever

State/prediction Outcome Two built-in mis-predicGons per inner loop iteraGon Branch predictor changes its mind too quickly
Fall 2012 ECE 552 (Enright Jerger): Basics 76

Two-Bit Satura:ng Counters (2bc)


Two-bit satura)ng counters (2bc) [Smith]
Replace each single-bit predic:on
(0,1,2,3) = (N,n,t,T)

Force DIRP to mis-predict twice before changing its mind

Fall 2012

ECE 552 (Enright Jerger): Basics

77

Two bit satura:ng Counter State Machine


Predict Taken 11 TT Predict Taken 10 TN

Predict Not Taken 01 NT

Predict Not Taken 00 NN

Strong taken: last two instances were taken Weak taken: last instance N but previous T Weak not-taken: last instance T but previous N
Ini:al state

Strong not-taken: last two N


Fall 2012 ECE 552 (Enright Jerger): Basics 78

Two bit satura:ng counter


for (i=0;i<100;i++) for (j=0;j<3;j++) // whatever

State/prediction Outcome

+Fixes this pathology (changing mind too quickly)


+(which is not contrived, by the way)
Fall 2012 ECE 552 (Enright Jerger): Basics 79

Branch Predic:on Buers (BPBs)


Storing and Accessing Predic:ons
Like memory hierarchy/cache Avoid cost of tags (tags >> data -- 2 bits)
PC k 2k saturaGng counters PHT

PHT: Pazern history table Aliasing


Two branches with dierent PC access same counter
Fall 2012 ECE 552 (Enright Jerger): Basics 80

Correlated branch behaviour


2 bit scheme: small amount of history (local scheme) What about:
if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb) --

Clearly branch 3 is correlated to the behaviour of branches 1 and 2 Predictor that uses single branch to predict outcome cannot capture this behaviour Global scheme

Fall 2012

ECE 552 (Enright Jerger): Basics

81

Correlated Branch Predic:on


History register
For each branch
Shic in outcome
Global (shif) register k PHT

Use history instead of PC to index PHT

Called a two-level-predictor When to update global history register?


Fall 2012 ECE 552 (Enright Jerger): Basics

2k saturaGng counters

82

Correlated Predictor
Correlated (two-level) predictor [Paz]
Exploits observaGon that branch outcomes are correlated Maintains separate predicGon per (PC, BHR)
Branch history register (BHR): recent branch outcomes

Simple working example: assume program has one branch


BHT: one 1-bit DIRP entry BHT+2BHR: 4 1-bit DIRP entries
State/prediction active pattern BHR=NN BHR=NT BHR=TN BHR=TT Outcome

Fall 2012

We didnt make anything beter, whats the problem?


ECE 552 (Enright Jerger): Basics 83

Correlated Predictor
What happened?
BHR wasnt long enough to capture the pazern Try again: BHT+3BHR: 8 1-bit DIRP entries
State/prediction BHR=NNN BHR=NNT BHR=NTN active pattern BHR=NTT BHR=TNN BHR=TNT BHR=TTN BHR=TTT Outcome

+ No mis-predic:ons acer predictor learns all the relevant pazerns


Fall 2012 ECE 552 (Enright Jerger): Basics 84

Correlated Predictor
Design choice I: one global BHR or one per PC (local)?
Each one captures dierent kinds of pazerns Global is bezer, captures local pazerns for :ght loop branches

Design choice II: how many history bits (BHR size)?


Tricky one + Longer BHRs are bezer for some apps, shorter bezer for others PHT u:liza:on decreases w/ long BHRs
Many history pazerns are never seen Many branches are history independent (dont care) PC ^ BHR allows mul:ple PCs to dynamically share PHT BHR length < log2(BHT size)

Predictor takes longer to train Typical length: 812


Fall 2012 ECE 552 (Enright Jerger): Basics 85

Various 2-level schemes

Fall 2012

ECE 552 (Enright Jerger): Basics

86

Hybrid Predictor
Hybrid (tournament) predictor [McFarling]
Atacks correlated predictor BHT uGlizaGon problem Idea: combine two predictors
Simple PHT predicts history independent branches Correlated predictor predicts only branches that need history Chooser assigns branches to one predictor or the other Branches start in simple PHT, move mis-predicGon threshold

+ Correlated predictor can be made smaller, handles fewer branches + 9095% accuracy

PHT

GHR

Fall 2012

ECE 552 (Enright Jerger): Basics

PHT

chooser

PC

87

Branch Predic:on Summary


Performance requirement for pipelined processors Importance of accuracy increases with depth and width of pipeline Basic building block
2-bit satura:ng counter

Predict direc:on and target


Youll implement several direc:on predictors Well cover target predic:on in a few weeks
Fall 2012 ECE 552 (Enright Jerger): Basics 88

How are Interrupts/Excep:ons Handled in Pipeline?


Interrupt: external, e.g., :mer, I/O device requests Excep4on: internal, e.g., /0, page fault, illegal instruc:on Can occur at any stage in pipeline except WB
Page fault: IF or Mem Illegal op code: ID Divide by zero: EX

Upon detec:ng interrupt


Opera:ng system saves processor state Interrupt handling rou:ne
Abort or correct
Fall 2012 ECE 552 (Enright Jerger): Basics 89

Handling Interrupts/Excep:ons
Instructions before X (X-1, X-2...) in program order currently in pipeline Complete normally Results are part of saved process state Instruction X and instructions after it already in the pipeline Converted into nops Saved PC corresponds to the PC of instruction X Program will restart at X Called precise state or precise interrupts
Fall 2012 ECE 552 (Enright Jerger): Basics 90

Handling Precise Excep:ons


Straighyorward for simple pipeline When excepGon detected
Flag inserted into pipeline register Instruc:on converted to nop Excep:on handled when ag reaches M/W pipeline register
Why?

Fall 2012

ECE 552 (Enright Jerger): Basics

91

Handling Precise Excep:ons


ExcepGons handled in program order not temporal order
1 div r1,r0r2 add r3,r4r5 F 2 D F 3 X D 4 M X 5 W M W 6

Handle /0 excep:on rst Will revisit excep:on handling for more complex pipelines later
Fall 2012 ECE 552 (Enright Jerger): Basics 92

Pipelined Func:onal Units


Pipeline so far... shallow and simple
Abstrac:on Real pipelines are much deeper
Each stage can be decomposed into several stages Well come back to F, D, M, W later

Lets look at EX
Fast integer arithme:c and logic opera:ons Single cycle Slow integer arithme:c opera:ons: mul:ply, divide Pipelined (except div Floa:ng point opera:ons: add, mul:ply, divide, sqrt and sqrt)
Fall 2012 ECE 552 (Enright Jerger): Basics 93

Pipelined Func:onal Units


How many stages? Minimum cycles between independent instrucGons of same type (no RAW dependency) Ex: Intel PenGum Pro
Integer mul:plies (4 stages), new instruc:on every cycle FP add (3 stages) new insn every cycle FP Mult (ve stages), new insn 2 cycles
Fall 2012 ECE 552 (Enright Jerger): Basics 94

Hazards and Forwarding


EX M1 M2 M3 M4

MEM
M5 M6 M7

A1

A2

A3

A4

div

Long opera:ons: RAW stalls will be more frequent WAW hazards are possible: insns no longer reach WB in order WAR hazards are not possible: register reads always occur in D
Fall 2012 ECE 552 (Enright Jerger): Basics 95

Handling WAW Hazards


1
div f0,f1f2 stf f2[r1] addf f0,f1f2

2 D F

3 E/ D F

4 E/ d* D

8 W M

9 W

10

E/ E/ E/ d* d* X E+ E+ W F D

--
addf f2,f3f4

E+ E+

What to do?
Op:on I: stall younger instruc:on (addf) at writeback
+ Intui:ve, simple Lower performance, cascading W structural hazards

Op:on II: cancel older instruc:on (divf) writeback


+ No performance loss What if divf or stf cause an excep:on (e.g., /0, page fault)? Complicates excep:on handling
Fall 2012 ECE 552 (Enright Jerger): Basics 96

Basic Design Recap


5-stage pipeline
F (Fetch), D (Decode), E (Execute), M (Memory Access), W (Writeback) Data hazards
RAW --> add forwarding logic Stall on load-to-use

Control hazards
Branches: ush instruc:ons when branch taken Branch predic:on

Almost done: assume perfect memory


Now add memory hierarchy
Fall 2012 ECE 552 (Enright Jerger): Basics

97

Fall 2012

ECE 552 (Enright Jerger): Basics

98

Fall 2012

ECE 552 (Enright Jerger): Basics

99

S-ar putea să vă placă și