Documente Academic
Documente Profesional
Documente Cultură
The Processor
Introduction
Instruction count
A simplified version A more realistic pipelined version Memory reference: M f l , sw lw Arithmetic/logical: add, sub, and, or, slt Control transfer: beq, j
Chapter 4 The Processor 2
Instruction Execution
PC instruction memory, fetch instruction Register numbers register file file, read registers Depending on instruction class
CPU Overview
Multiplexers
Use multiplexers
Control
Low voltage = 0 0, High voltage = 1 One wire per bit Multi bit data encoded on multi Multi-bit multi-wire wire buses Operate on data Output is a function of input Store information
Combinational element
Combinational Elements
AND-gate
Adder
A + B Y
Y=A&B
Y
Y=A+B
A B
Multiplexer
Arithmetic/Logic Unit
Y = S ? I1 : I0
I0 I1
M u x
Y = F(A, B)
A ALU B F
Chapter 4 The Processor 8
Sequential Elements
Uses a clock signal to determine when to update the stored value Edge-triggered: update when Clk changes from 0 to 1
Clk
D Clk
D Q
Sequential Elements
Only updates on clock edge when write control input is 1 Used when stored value is required later
Clk
D Write Clk
Write D Q
Clocking Methodology
Between clock edges Input p from state elements, , output p to state element Longest delay determines clock period
Building a Datapath
Datapath
Instruction Fetch
32-bit register
R-Format Instructions
Read two register operands Perform arithmetic/logical operation Write register result
Load/Store Instructions
Load: Read memory and update register Store: Write register value to memory
Branch Instructions
Use ALU, subtract and check Zero output Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4
Branch Instructions
Just re routes re-routes wires
Each datapath element can only do one function at a time Hence, we need separate instruction and data memories
Use multiplexers where alternate data sources are used for different instructions
R-Type/Load/Store Datapath
Full Datapath
ALU Control
ALU Control
R-type
rs
25:21
rt
20:16
rd
15:11
shamt
10:6
funct
5:0
Load/ Store
35 or 43
31:26
rs
25:21
rt
20:16
address
15:0
Branch
4
31:26
rs
25:21
rt
20:16
address
15:0
opcode
always read
R-Type Instruction
Load Instruction
Branch-on-Equal Instruction
Implementing Jumps
Jump
2
31 26 31:26
address
25 0 25:0
Performance Issues
Critical path: load instruction Instruction memory register file ALU data memory register file
Not feasible to vary period for different instructions Violates design principle
Pipelining Analogy
Four loads:
Non-stop:
MIPS Pipeline
Pipeline Performance
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipeline Speedup
i.e., i e all take the same time Time between instructionspipelined = Time between instructionsnonpipelined Number of stages
Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Can decode and read registers in one step Can calculate address in 3rd stage, access memory in 4th stage g Memory access takes only one cycle
Chapter 4 The Processor 36
L d/ t Load/store addressing dd i
Hazards
Situations that prevent starting the next st uct o in the t e next e t cycle cyc e instruction Structure hazards
A required resource is busy Need to wait for previous instruction to complete its data read/write Deciding on control action depends on previous instruction
Chapter 4 The Processor 37
Data hazard
Control hazard
Structure Hazards
Load/store requires data access I Instruction i fetch f h would ld h have to stall t ll for f that h cycle
Data Hazards
add sub
Don t wait for it to be stored in a register Dont Requires extra connections in the datapath
Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F;
lw lw add sw lw add sw $t1, $t2, , $t3, $t3, $t4 $t4, $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0) lw lw lw add sw add sw $t1, $t2, , $t4, $t3, $t3 $t3, $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0)
stall
stall
13 cycles
11 cycles
Chapter 4 The Processor 42
Control Hazards
Fetching next instruction depends on branch outcome Pipeline cant always fetch correct instruction
In MIPS pipeline
Need to compare registers and compute target early in the pipeline Add hardware h d to t do d it in i ID stage t
Stall on Branch
Branch Prediction
Stall penalty becomes unacceptable Only stall if prediction is wrong Can predict branches not taken Fetch instruction after branch, with no delay
In MIPS pipeline
Prediction incorrect
e.g., record recent history of each branch When wrong, stall while re-fetching, and update history
Pipeline Summary
The BIG Picture
Executes multiple instructions in parallel Each instruction has the same latency Structure, data, control
Subject to hazards
MEM
WB
Pipeline registers
Pipeline Operation
Shows pipeline usage in a single cycle Highlight resources used Graph of operation over time
We ll look at single-clock-cycle Well single clock cycle diagrams for load & store
Chapter 4 The Processor 51
EX for Load
WB for Load
EX for Store
WB for Store
Traditional form
Pipelined Control
Pipelined Control
e.g., e g ID/EX ID/EX.RegisterRs RegisterRs = register number for Rs sitting in ID/EX pipeline register
ID/EX.RegisterRs, ID/EX.RegisterRt
Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline p p reg g
EX/MEM.RegWrite, MEM/WB.RegWrite
And onl only if Rd for that instr instruction ction is not $zero
Forwarding Paths
Forwarding Conditions
EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
MEM hazard
Want to use the most recent O l f Only fwd d if EX hazard h d condition diti i isnt t t true
MEM hazard
if ( (MEM/WB.RegWrite g and ( (MEM/WB.RegisterRd g 0) ) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd (MEM/WB RegisterRd = ID/EX.RegisterRs)) ID/EX RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd (MEM/WB RegisterRd = ID/EX.RegisterRt)) ID/EX RegisterRt)) ForwardB = 01
Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage given by y are g
EX, MEM and WB do nop (no-operation) Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw
Branch Hazards
, 50($7) ( ) $4,
Chapter 4 The Processor 84
If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction
lw
lw
$1 addr $1,
IF
ID IF
EX ID
MEM
WB
ID ID EX MEM WB
In deeper and superscalar pipelines, branch penalty p y is more significant g Use dynamic prediction
Branch p prediction buffer ( (aka branch history y table) ) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch
Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction
Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first it ti of iteration f inner i l loop next t ti time around d
Chapter 4 The Processor 91
2-Bit Predictor
1-cycle penalty for a taken branch Cache of target addresses Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately
4.9 Exceptions
Different ISAs use the terms differently Arises within the CPU
Exception p
Interrupt
Handling Exceptions
In MIPS, exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction
In MIPS: Exception Program Counter (EPC) In MIPS: Cause register Well We ll assume 1 1-bit bit
An Alternate Mechanism
Vectored Interrupts
Handler address determined by the cause Undefined opcode: Overflow: : C000 0000 C000 0020 C000 0040
Example:
Instructions either
Handler Actions
Read cause, and transfer to relevant a de handler Determine action required If restartable
Take corrective action use EPC to return to program Terminate program Report error using EPC, cause,
Otherwise
Exceptions in a Pipeline
Exception Properties
Restartable exceptions
Pipeline can flush the instruction Handler executes, then returns to the instruction
Exception Example
Exception on add in
40 44 48 4C 50 54 sub and or add slt lw $11, $12, $13, $1, $15, $16, $2, $4 $2, $5 $2, $6 $2, $1 $6, $7 50($7)
Handler
80000180 80000184 sw sw $25, 1000($0) $26, $26 1004($0)
Exception Example
Exception Example
Multiple Exceptions
Flush subsequent instructions Precise exceptions Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult!
In complex pipelines
Imprecise Exceptions
Including exception cause(s) Which instruction(s) ( ) had exceptions p Which to complete or flush
Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines
Chapter 4 The Processor 105
4.10 Pa arallelism a and Advan nced Instru uction Lev vel Parallel lism
Deeper pipeline
L Less work k per stage t shorter h t clock l k cycle l Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue
Multiple issue
Multiple Issue
Compiler groups instructions to be issued together Packages them into issue slots Compiler detects and avoids hazards CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime
Speculation
If so, complete the operation If not, roll-back and do the right thing
Speculate on load
Compiler/Hardware Speculation
e.g., e g move load before branch Can include fix-up instructions to recover from incorrect guess
Buffer results until it determines they are actually needed Flush buffers on incorrect speculation
e.g., speculative load before null-pointer check Can add ISA support for deferring exceptions Can buffer exceptions p until instruction completion (which may not occur)
Static speculation
Dynamic speculation
Group of instructions that can be issued on a single cycle Determined by pipeline resources required
Specifies S ifi multiple lti l concurrent t operations ti Very Long Instruction Word (VLIW)
Chapter 4 The Processor 111
Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets
Two-issue packets
Forwarding avoided stalls with single-issue Now cant use ALU result in load/store in same packet p
add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, packets effectively a stall
Load-use hazard
Scheduling Example
ALU/b ALU/branch h Loop: nop addi $s1, $s1,4 addu dd $t0, $ 0 $t0, $ 0 $s2 $ 2 bne
Loop Unrolling
A idi structural Avoiding t t l and dd data t h hazards d Though it may still help Code semantics ensured by the CPU
But commit result to registers in order lw $t0, 20($s2) addu $t1, $t1 $t0, $t0 $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw
E ample Example
Results also sent to any waiting reservation stations Reorders buffer for register writes
Can supply y operands for issued instructions Chapter 4 The Processor 121
Register Renaming
Reservation stations and reorder buffer effectively e ect e y p provide o de register eg ste renaming e a g On instruction issue to reservation station
Copied to reservation station No longer required in the register; can be overwritten It will be provided to the reservation station by a function unit Register update may not be required
Chapter 4 The Processor 122
Speculation
Don t commit until branch outcome Dont determined Avoid load and cache miss delay
Load speculation
Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit
Why not just let the compiler schedule code? Not all stalls are predicable
Yes, but not as much as wed like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate
e.g., pointer aliasing Limited window size during instruction issue Hard to keep pipelines full
Power Efficiency
Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better
Year 1989 1993 1997 2001 2004 2006 2003 2005 Clock Rate 25MHz 66MHz 200MHz 2000MHz 3600MHz 2930MHz 1950MHz 1200MHz Pipeline Stages 5 5 10 22 31 14 14 6 Issue width 1 2 3 3 3 4 4 1 Out-of-order/ Speculation No No Yes Yes Yes Yes No No Cores 1 1 1 1 1 2 1 8 Power 5W 10W 29W 75W 103W 75W 90W 70W
Microprocessor i486 Pentium Pentium Pro P4 Willamette P4 Prescott Core UltraSparc III UltraSparc T1
FP is 5 stages longer Up to 106 RISC-ops in progress Complex instructions with long dependencies Branch mispredictions M Memory access delays d l
Chapter 4 The Processor 128
Bottlenecks
Fallacies
So why havent we always done pipelining? More transistors make more advanced techniques feasible Pipeline-related ISA design needs to take account of technology trends
Pitfalls
Significant overhead to make pipelining work IA-32 micro-op approach Register update side effects, memory indirection Advanced pipelines have long delay slots
Concluding Remarks
ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism
More instructions completed per second Latency y for each instruction not reduced
Hazards: structural, data, control Multiple p issue and dynamic y scheduling g( (ILP) )