The Processor: (Datapath and Pipelining)

14:332:331
The Processor
(datapath and pipelining)
Instruction Memory PC Address Instruction Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Addr Address Data Memory Read Data Write Data
ALU
Simplicity favors regularity fixed size instructions and data 32-bits Good design demands good compromisesOnly three instruction formats Smaller is fasterlimited instruction set limited number of registers in register file limited number of addressing modes Make the common case fastarithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands
Review: Design Principles
The Processor: Datapath & Control

We're ready to look at an implementation of the MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
The Processor: Datapath & Control

Fetch PC = PC+4 Exec Decode
Generic implementation (first two stages are same): use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction All instructions (except j) use the ALU after reading the registers
Abstract Implementation View

Two types of functional units: elements that operate on data (combinational like ALU) elements that contain state (sequential - registers and memory)
Abstract Implementation View

Single cycle operation (multi-cycle presented later) Split memory (Harvard) model - one memory for instructions (instruction cache) and one for data Shows how PC is incremented or changed by branch taken, does not show multiplexors or control lines
Control Unit
Instruction Memory PC Address Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Address ALU Data Memory Read Data
overflow zero
Write Data
Clocking Methodologies
Clocking methodology defines when signals can be read and when they can be written. Do not read while writing- unpredictable result falling (negative) edge
cycle time
rising (positive) edge
We adopt an edge-triggered clocking methodology = update on clock edge Values stored in the state elements are updated only on a clock edge. Input to combinatorial element are values stored by the state elements in a previous clock cycle, Combinatorial elements output can be used in the following clock cycle.
All signals must propagate from State element 1 through combinatorial block and to State element 2 in a single clock cycle. State element is read in half cycle and written second half. The time needed for the logic gates to settle determines the length of the clock cycle.
State element 1 Combinational logic State element 2
Synchronous digital system
clock
one clock cycle
Assumes state elements are written on every clock cycle; If not, need explicit write control signal - write occurs only when both the write control is asserted and clock edge occurs
Building the Datapath
The datapath assures that data travels between various memory units to registers and ALUs; The control unit regulates this transfer and determines what actions are to be taken on the data. It does so using control lines that are connected to various hardware units.
The edge triggered clock methodology allows a given state element to be both read and be written to in a single clock cycle. Either we have a long clock cycle for one instruction (length is determined by the slowest instruction), or multiple cycles per instruction.
Building the Datapath
What are the building blocks needed to implement a subset of the MIPS instructions (lw, sw, arithmetic and logic instructions - like add, sub, and, or, slt)? Any instruction needs to be fetched and the PC incremented by 4
Special ALU that only adds
Add 4
Instruction Memory
PC is updated on every clock cycle
PC
Read Address
Instruction Memory is read every cycle, so it doesnt need an explicit read control signal Extend later for j, beq
Instruction
Decoding instructions involves sending the fetched instructions opcode and function field bits to the control unit
Control Unit
Decoding Instructions
5 5
Instruction
Read Addr 1 32 Read Register Read Addr 2 Data 1 File 32 Write Addr Read
Write Data Data 2
reading two values from the Register File Register File addresses are contained in the instruction
Executing R Format Operations
R format operations (add, sub, slt, and, or)

31 R-type: op 25 rs 20 rt 15 rd 10 5 0 shamt funct
perform the indicated (by op and funct) operation on values in rs and rt store the result back into the Register File (into location rd)
ALU control
32
4
overflow zero
Instruction
ALU
32
The fourth ALU control line supports NOR
Executing R Format Operations

Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File (RegWrite)
RegWrite ALU control MemWrite
Instruction
32
4
Address
ALU
overflow zero
Data Memory Read Data

Write Data
32
32
MemRead
To execute lw or sw operations we need a Data Memory unit with two control signals for write into (MemWrite) and read from (MemRead)
Executing Load and Store Operations

31 I-Type: op 25 rs 20 rt 15 address offset 0
Load and store operations compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction The 32 bits in the base register were read from the Register File during decode The offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value
Executing Load and Store Operations

RegWrite ALU control overflow zero
Address ALU Data Memory Read Data Write Data
MemWrite
Read Addr 1 Instruction Read Register Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
16-bit offset
Sign Extend
MemRead
16
32
sw - value read from the Register File during decode must be written to the Data Memory lw - value read from the Data Memory must be stored in the Register File
Executing Branch Operations

31 25 20 15 address offset 0
I-Type:
op
rs
rt
Branch operations have to compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output is asserted) compute the branch target address by adding the updated PC to the sign extended 16-bit signed offset field in the instruction offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address
Executing Branch Operations

PC + 4
Add
Shift left 2
Add
Branch target address
RegWrite
PC
ALU control (perform a subtraction) overflow zero (to branch control logic -take branch if zero line is high)
ALU
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read
Write Data Data 2
16-bit offset
16
Sign Extend
32
Executing Jump Operations

31 J-Type: op 25 jump target address 0
Jump operations have to replace the lower 28 bits of the PC+4 with the lower 26 bits of the fetched instruction shifted left by 2 bits
Add 4 4 Instruction Memory PC Read Instruction Address 26 Shift left 2
Jump address
28
Creating a Single Datapath from the Parts

We need to assemble the datapath segments, add control lines as needed, and design the control path Fetch, decode and execute each instruction in one clock cycle single cycle design. Cycle time is determined by length of the longest path No datapath resource can be used more than once per instruction, so some must be duplicated (that is why we have a separate Instruction Memory and Data Memory) To share datapath elements between different instruction classes will need multiplexors at the input of the shared elements Need control lines to do the selection of inputs
Fetch, R, and Memory Access Portions

Add 4 Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data
RegWrite
ALU control ovf zero
MemWrite
Instruction Memory PC Read Address Instruction
lw
lw / sw R
Sign 16 Extend 32
MemRead
Multiplexor Insertion
Add 4
RegWrite
ALUSrc ALU control ovf zero
MemWrite
MemtoReg
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data
Sign 16 Extend
MemRead
32
Adding the Branch Portion

Branch not taken, R, lw /sw
Add 4 Shift left 2 Add
PCSrc MemWrite MemtoReg
RegWrite
ALUSrc ALU control ovf zero
lw R
lw / sw
Sign 16 Extend 32
MemRead
Datapath - review
We wait for everything to settle down - ALU might not produce right answer right away Cycle time determined by length of the longest path Split memory (Harvard) model - single cycle operation Simplified to contain only the instructions: lw, sw, add, sub, and, or, slt, beq. Sequential components (PC, RegFile, Memory) are edge triggered - state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs
Single cycle datapath

31 R-type: op 31 I-Type: op 25 rs 25 rs 20 rt 20 rt 15 rd 15 address offset 10 5 0 shamt funct 0
Observations - op field always in bits 31-26 address of the two registers to be read are always specified by the rs and rt fields (bits 25-21 and 20-16) address of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for Rtype instructions base register for lw and sw always in rs (bits 25-21) offset for beq, lw, and sw always in bits 15-0
(Almost) Complete Single Cycle Datapath

0
Add 4 Shift left 2 Add
1 PCSrc MemWrite MemtoReg
RegDst
RegWrite
ALUSrc ovf
Instruction Memory PC Read Address Instr[31-0]
Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1
zero
ALU
Address Data Memory Read Data Write Data
0 1
Instr[ 15 -11]
File Write Addr Write Data
Read Data 2
0 1
ALU control
Instr[15-0]
Sign 16 Extend
MemRead
32
Instr[5-0]
ALUOp
ALU's operation based on instruction type and function code
ALU Control
MIPS uses multiple control levels to increase speed of the main control unit and decrease its size
ALU control input (Ainvert+Binvert + Operation) 0000 0001 0010 0110 0111 Function
set
and or add subtract set on less than NOR
1100
ALU Control, continued
ALU control input

ALU control
Instr[5-0]
Multiple levels of control main control unit generates the ALUOp bits ALU control unit generates ALU control inputs (main control is smaller)
Instr op funct ALUOp desired action
ALUOp(from control block)
ALU control input
lw
sw beq add sub and or slt
xxxxxx
xxxxxx xxxxxx 100000 100010 100100 100101 101010
00
00 01 10 10 10 10 10
add
add subtract add subtract AND OR Set on less than
0010
0010 0110 0010 0110 0000 0001 0111
ALU Control Truth Table

F5 F4 F3 F2 F1 F0 ALUOp1 ALUOp0 Op3 Op2 Op1 Op0
X X X X X
X X X
X X X
X X X X X X 0 0 0 0 X 0 0 1 0
X 0 X 0 X 1 1 1 0 0 0 1 0 1 0
x 1 1
1 1 1
1 x x
x x x
0 0 0
0 0 0
1 0 1
0 0 1
1 1 1
0 0 1
0 0 0
0 1 1
Can make use of more dont cares since ALUOp does not use the encoding 11 since F5 and F4 are always 10
ALU Control Combinational Logic
From the truth table can design the ALU Control logic
Summary of control lines
RegDest Source of the destination register for the operation RegWrite Enables writing a register in the register file ALUsrc Source of second ALU operand, can be a register or part of the instruction PCsrc Source of the PC (increment [PC + 4] or branch) MemRead/MemWrite Reading / Writing from data memory MemtoReg Source of write register contents
Datapath with Control Unit

0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc Branch Shift left 2 Add
1
PCSrc MemRead MemtoReg
RegWrite
RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read
MemWrite
ovf zero
ALU
Address
Data Memory Read Data Write Data
1 0
0
1
ALU control
Instr[15 -11] Instr[15-0]
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Main Control Unit

Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp1 ALUOp0
Rtype
000000
1
0 X
0
1 1
0
1 X
1
1 0
X
1 X
0
0 1
0
0 0
1
0 0
X
0 0
lw
100011
sw
101011
beq
000100
Control Unit Logic
From the truth table can design the Main Control logic
Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]
000000
100011
101011
000100
R-type
lw
sw
beq
RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
Adding the Jump Operation
J has the opfield 000010 followed by 26-bit jump

address. We need to add to the datapath. The instruction bits [25-0] are shifted left two bits and concatenated to the four MSB of PC+4. Jump
Instr[25-0] 26 PC+4[31-28] Shift left 2
1
28 32
PC
0 0
1
PCSrc
A new multiplexer and control line are needed to select this input for the program counter
One more gate is needed in the Main Control logic for J
Control Unit Logic
Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]
000010
Jump RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
R-type
lw
sw
beq
Adding Jump Operation

Instr[25-0] 26 Add 4 ALUOp Instr[31-26] Control Unit Jump Branch Shift left 2 Add Shift left 2
1
28 32 PC+4[31-28]
0 0 1
PCSrc MemRead MemtoReg MemWrite
ALUSrc
RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Instr[15 -11] Instr[15-0]
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Control Unit Logic
One more gate is needed in the Main Control logic for Jal (J type instruction which stores PC+4 in $ra $31)
Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26] 3 000000 000011 26-bit address J format
Jal
R-type
Jump RegDst MemtoReg RegWrite ALUOp1 ALUOp0
Adding Jal Operation

Instr[25-0] 26 Add 4 ALUOp Instr[31-26] Control Unit Jump Branch Shift left 2 Add Shift left 2
1
28 32 PC+4[31-28]
0 0 1
PCSrc MemRead MemtoReg MemWrite
PC+4
ALUSrc
RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Instr[15 -11]
Write Data
Data 2
31
Instr[15-0]
Sign 16 Extend
32
Instr[5-0]
Adding Control line Settings for Jal Operation

RegDst Is now 2 bits MemtoReg Is now 2 bits
RegDst R-format lw sw beq J 01 00 xx xx xx
ALUSrc 0 1 1 0 x
MemtoReg 00 01 xx xx xx
Reg Write 1 1 0 0 0
Mem Mem Read Write Branch ALUOp1 ALUOp0 Jump 0 1 0 0 0 0 0 1 0 0 0 0 0 1 x 1 0 0 0 x 0 0 0 1 x 0 0 0 0 1
JAL
10 R[31]
10
PC+ 4 PC Jump Address
Single Cycle Implementation Cycle Time

Unfortunately, though simple, the single cycle approach is not used because it is inefficient Clock cycle must have the same length for every instruction What is the longest path (slowest instruction)? Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except:
Instruction and Data Memory (2ns) and adders (2ns) Register File access (reads or writes) (1ns) floating point operations even longer
ALU
Instruction Critical Paths

Instr. R-type load store beq jump I Mem 2 2 2 2 2 Reg Rd 1 1 1 1 ALU Op 2 2 2 2 2 2 D Mem Reg Wr 1 1 Total delay (ns) 6 8 7 5 2
What about floating point operations?
A floating point add.d = Instr. Fetch (2 ns)+ Reg. Read (1 ns)+ ALU add(8 ns)+ Reg. Write (1 ns)= 12 ns Floating point load l.s=2+1+2(ALUop)+2(data mem)+1 (Reg.) = 8 ns Floating point store s.s =2+1+2(ALU)+2(data mem) = 7 ns. The longest instruction is floating point multiply mul = Inst. Fetch (2 ns)+Reg. Read (1 ns)+ALU multiply (16 ns)+ Reg. Write (1 ns) = 20 ns Floating point branch = 5 ns, floating point jump=2 (fetch) If clock period is variable in length, then we need to look at instruction frequency. For example Loads (31%), stores (21%), Rtype (27%), beq(5%), j (2%), add.d, sub.d (7%), mult.d, div.d(7%). Combining to compute the clock cycle=8x31%+7x21%+6x27%+5x5%+2x2%+20x7%+12x7%= 7 ns
What about variable cycle length?

Instead of a fixed cycle time, we allow cycle time to depend on instruction class. We can then compare performance, considering that CPI will still be 1, and Instruction count does not change. Perf. CPU variable cycle time = CPU exec. time fixed cycle time Perf. CPU fixed cycle time CPU exec. time var. cycle time = Clock period fixed Clock period variable because performance= _____________1______________ )
Instr. Count x CPI x Clock Period
Performance improvement = 20 ns (fixed cycle clock period) = 2.86 faster

7 ns (variable cycle clock period)
Where We are Headed
Single cycle/instr. uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction we cannot make common case fast. especially problematic for more complex instructions like floating point multiplications Single cycle/instr. datapath is wasteful of area since some functional units must be duplicated since they can not be shared during an instruction execution e.g., need separate adders to do PC update and branch target address calculations, as well as an ALU to do Rtype arithmetic/logic operations and data memory address calculations
Single Cycle Disadvantages & Advantages
Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction
Cycle 1 Clk Single Cycle Implementation: lw sw Waste Cycle 2
Is wasteful of area since some functional units must be duplicated since they can not be shared during a clock cycle (e.g., adders, memory units) But, it is simple(r) and easy to understand
The Five Stages of a lw Instruction

We will consider only a subset of instructions (lw, sw, add, sub, and, or, slt, beq) IFetch: Instruction Fetch and Update PC Dec: Registers Read and Instruction Decode Exec: calculate memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
lw
IFetch Dec RR 2 ns Exec Mem WB
Several instructions were worked on by the CPU at the same time Each major logic unit works on a different stage of a different instruction Like doing laundry for different roommates
What if.
Pipelined MIPS Processor
Start the next instruction while still working on the current one
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
lw sw R-type

IFetch Dec RR
Exec
Mem
Exec
WB
Mem Exec WB Mem WB
IFetch Dec RR
IFetch Dec RR
Improves throughput - total amount of work done in a given time If pipeline is full (ideal situation) Time between Inst Pipelined = Time between inst.Non-pipelined Number of pipeline stages Instruction latency is not reduced - time from the start of an instruction to its completion
Single Cycle vs. Pipelined

Single Cycle Implementation:
Cycle 1
Clk Load Store Pipelined lw finishes later than non-pipelined version Waste
Cycle 2
Pipeline Implementation:
lw
Time savings
IFetch
sw
Dec
IFetch
Exec
Dec
Mem
Exec Dec
WB
Mem Exec WB Mem WB
wasted cycle
R-type IFetch
Single Cycle, vs. Pipelined

Single Cycle Implementation:
Assume memory and ALU ops take 200 ps, Reg ops take 100 ps
Pipelined Implementation:
Register read in second half of cycle
Time savings Instruction 2 Time savings Instruction 3
Designing MIPS Instructions for Pipelining
What makes it easy - all instructions are the same length (32 bits) - The first two pipeline stages are the same for all instructions. few instruction formats (three) with symmetry across formats - registers addresses are in the same location and thus can be read while instructions are being decoded (read reg file in the first of the clock cycle) memory operations can occur only in loads and stores, thus the ALU can compute memory addresses in EX stage operands are aligned in memory so a single data transfer requires only one memory access
MIPS Pipeline Datapath Modifications
What do we need to add/modify in our single-cycle per instruction datapath to make it pipelined? The MIPS instruction has (up to) five stages, thus pipeliene has 5 stages: Ifetch to fetch the instruction from Instruction memory Dec to decode the instruction and read Register File registers Exec to do the ALU operations Mem to read from/write into Data Memory WB to write back into the register file. So we need a way to separate the data path into five pieces, without losing intermediate results. We will introduce Pipeline registers between stages to isolate them and store intermediate results
All instructions advance during one clock cycle between one pipeline register and the next
IF (fetch)
1 0
ID (decode)
EX (execute)
Mem
WB (write back)
Add 4 Shift left 2 IFetch/Dec Add
Exec/Mem
Mem/WB
Read Address
File
Write Addr Write Data Read Data 2
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Data 1 Read Addr 2 ALU
Address
Data Memory
Read Data
1 0
0 1
Write Data
System Clock
16
Sign Extend
32
Because all data is passed through the pipeline, the address of the register where data needs to be loaded (lw) also needs to be passed
IFetch
1 0
Dec
Exec
Mem
WB
Pipeline Register size (for now) IF/ID 64 ID/EX 32x4+5=133 EX/MEM 32x3+5=101 MEM/WB 32x2=5=69
Add Shift left 2 IFetch/Dec Add
Exec/Mem
Mem/WB
Read Address
File
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Address
Data Memory
Read Data
1 0
0 1
Write Data
System Clock
16
Sign Extend
32
Extends pipeline reg. to hold address of destination reg.
MIPS Pipeline Control Path Modifications
All control signals are determined during Decode and held in the pipeline registers between pipeline stages
IFetch
1 0
Control Add 4 Shift left 2 IFetch/Dec Add
Dec
Exec
Mem
WB
Exec/Mem
Mem/WB
Read Address
File
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Address
Data Memory
Read Data
1 0
0 1
Write Data
System Clock
16
Sign Extend
32

2 6 4 3
The modified control path is
Pipeline Example
How does the non-dependent instruction sequence execute in a pipeline ? (no support for forwarding)
before <4> before <3> before <2> before <1> lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9 after <1> after <2>
Pipeline Example - before <4> completes
20
10

$4, $5
11
Pipeline Example - lw completes

Data memory not used (MEM control lines 0) $5 12 destination register
Pipeline Example - sub completes

$6, $7
Data memory not used (MEM control lines 0) $7 13
Pipeline Example - and completes

Normal PC+4 increment (PCSrc=0) $9 14
Pipeline Example - or completes
Pipeline Example - add completes
Written in first half of cycle
Graphically Representing MIPS Pipeline

ALU IM Reg DM Reg
So-far we saw the single-clock-cycle pipeline diagrams show the state of the entire datapath during a clock cycle (instructions are identified above the pipeline stages). Multi-clock-cycle pipeline diagrams are simpler, and can help answer how many cycles does it take to execute this code Or what is the ALU doing during a certain cycle Can represent multiple instructions in a single figure If there is a hazard, it shows why it occurs, and how it can be fixed
Why Pipeline? For Throughput!

Time (clock cycles)
Once the pipeline is full, one instruction is completed every cycle Reg DM Reg ALU
I n s t r. O r d e r
Inst 0 Inst 1
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
Inst 2
Inst 3 Inst 4
IM
Reg
DM
Reg
ALU
IM
Reg
DM
ALU
IM
Reg
Time to fill the pipeline
Example of graphical representation
Can be converted in a single-clock-cycle pipeline diagram
DM
Example of single-clock-cycle (cc 5) pipeline representation
Pipelining the MIPS ISA What makes it hard - structural hazards: what if we had only one memory - then the pipeline cannot have one instruction read from memory (fetch stage), while at the same time another instruction writes into memory (sw) control hazards: need to make a decision based on the results of one instruction, while that instruction is still executing. what about branches?
Stalling
Impact of branch stalling
We assume that all instructions in the pipeline have a CPI of 1. Branches which always are followed by a stall have a CPI of 2. In a typical program branches occur 13% of the time. Thus we can compute the aggregate CPI of the alwaysstall for branch architecture as:
Then CPI = CPI i x F i
n
i=1 CPI always stall = 1 x 87% + 2 x 13% = 1.13 cycles/instruction Thus CPU Perform. always stall = Inst. Count x CPI no stall x Clock Perform. no stall Inst. CountxCPI always stall x Clock Perform. always stall = 1 = 0.885 ( 88.5%) Perform. no stall 1.13
control hazards: Another approach is prediction - either static - always execute the instruction following a branch (assume always that the branch is not taken), or predict dynamically (keep a history of each branch as taken or not taken - accurate 90% of time).
Pipelining the MIPS ISA
Branch not taken
Branch taken
The pipeline simplified representation is shading the blocks that are used in a given clock cycle.
data hazards: what if an instructions input operands depend on the output of a previous instruction that did not finish? Example an add followed by a sub.
Forwarding
Forwarding will fail for a lw followed immediately by an instruction that uses the results of the lw operation. Example lw followed by a sub.
Solution - stall pipeline one clock cycle, then forward
Another solution - optimize compiler, such that lw is followed by an instruction which does not depend on the loaded word.
How About Register File Access?

Time (clock cycles)
add Inst 1 Inst 2 add Inst 4
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.
Reg
IM
ALU
Reg
IM
ALU
Reg
IM
ALU
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
Branch Instructions Cause Control Hazards
Dependencies backward in time cause hazards

ALU
time
add beq lw Inst 3 Inst 4
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
One Way to Fix a Control Hazard

ALU
add beq stall stall
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Can fix branch hazard by waiting stall but affects throughput
ALU
ALU
lw Inst 3
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
Register Usage Can Cause Data Hazards

ALU
add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5
IM
Reg
DM
Reg
Data hazard
Reg
ALU
IM
Reg
DM
No data hazard
Reg
ALU
IM
Reg
DM
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
One Way to Fix a Data Hazard

add r1,r2,r3
stall stall
IM
Reg
DM
Reg
Can fix data hazard by waiting stall but affects throughput
ALU
ALU
sub r4,r1,r5
IM
Reg
DM
Reg
ALU
and r6,r1,r7
IM
Reg
DM
Reg
Loads Can Cause Data Hazards

ALU
lw r1,100(r2) sub r4,r1,r5
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
and r6,r1,r7
or r8, r1, r9 xor r4,r1,r5
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
Stores Can Cause Data Hazards

ALU
add r1,r2,r3 sw r1,100(r5)
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
and r6,r1,r7
or r8, r1, r9 xor r4,r1,r5
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
Pipeline Changes to accommodate Forwarding

To avoid slowing down throughput, we need to add a hardware that detects data hazards. We call this the forwarding unit. Data needs to be forwarded to the ALU when a data hazard is detected. Thus the forwarding unit controls forwarding data through additional multiplexing at the ALU input. This logic unit needs input from the three pipeline registers. It also needs to detect if the RegWrite control signal is asserted so it needs input from the control lines also. No forwarding if EX/MEM.RegisterRd=$0 and MEM/WB.RegisterRd=$0
It needs to detect one of four cases of data hazards: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRs) Forward similarly if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRt) Forward
similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRs) Forward similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRt) Forward
Using Pipeline Registers to solve data hazards

ALUSrc
0 1 2
0 1 2 2 2 0 1
ForwardA 00 ID/EX input to ALU1 - no fwd 01 MEM/WB input to ALU1 10 EX/MEM input to ALU1
ForwardB 00 ID/EX input to ALU2 01 MEM/WB input to ALU2 10 EX/MEM input to ALU2
ForwardB 11 sign extension input to ALU2 OR add another multiplexer
Forwarding Pipeline Example
How does the dependent instruction sequence execute in a pipeline with support for forwarding?
before <4> before <3> before <2> before <1> sub $2, $1, and $4, $2, or $4, $4, add $9, $4, after <1> after <2>
$3 $5 $2 $2
Forw. Pipeline Example - before <2> completes
Forw. Pipeline Example - before <1> completes

Use this value of $2 not the one fetched from register file
EX/MEM.RegWrite is asserted
EX/MEM.RegisterRd=ID/EX.RegisterRs
Forw. Pipeline Example - sub completes

Both $4 and $2 are forwarded
EX/MEM.RegWrite is asserted MEM/WB.RegWrite is asserted
MEM/WB.RegisterRd=ID/EX.RegisterRt
Forwarding Pipeline Example - and completes

Use this value of $4 not the one fetched from register file
EX/MEM.RegWrite is asserted
Example (corrected)
Example (corrected)
Example (corrected)
Example (corrected)
Example (corrected)
Example (corrected)
Forwarding does not work when an instruction following a lw tries to read the value from the destination register of lw lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 The pipeline needs to be stalled, and data forwarded from the MEM/WB pipeline register
Pipeline Changes to accommodate Stalls
Forwarding does not work
How Stalls are inserted
Stalls happen in the EX stage, such that the subsequent two instructions in the pipeline both repeat what they were doing for one cycle This allows forwarding to work
Stall in CC4 and and or repeat what they did in CC3
OR is fetched in CC3 but it stays in ID stage for another cycle
Pipeline Changes to accommodate Stalls

We need a logic unit which detects hazards and then stalls. The hazard detection unit operates in the instruction decode stage, and tests to see if the instruction is a load (if ID/EX.MemRead control line is asserted) Then it checks if either of the source registers of the instruction currently being decoded is the same as the target/destination register of the lw being executed (that is if ID/EX.RegisterRt=IF/ID.RegisterRs or ID/ EX.RegisterRt= IF/ID.RegisterRt) During stalling the PC is prevented from incrementing and the instruction in the IF/ID pipeline register is preserved. Need additional control lines for the IF/ID register and for the PC. The bubble is inserted by setting the pipelined control signals in the ID/EX pipeline register to 0. So we need a way to change the values of the control lines.
Instruction source registers
Pipeline Changes to do Hazard Detection
ID/EX.RegisterRt
Pipeline Changes to do Hazard Detection

Stall by 0-ing all 9 control lines
IF/ID write
PCwrite
Pipeline stalling example before<3>
completes

Hazard is detected
completes

PCWrite is asserted Bubble inserted
completes
0
0
IF/IDWrite is asserted
Registers continue to be read
Pipeline stalling example lw
completes
Forwarding unit sets ALUsrc multiplexer to use value from WB register
Pipeline stalling example bubble
completes
Forwarding unit sets ALUsrc multiplexer to use value from EX/MEM register
Example
Consider executing the following code add $5, $6, $7 lw $6, 100($7) sub $7, $6, $8 How many cycles will it take to execute the code? Draw a diagram that illustrates the dependencies that need to be resolved
CC 7 CC 8
add $5,$6,$7
lw $6,100($7)
sub $7, $6, $8
Example - continued
Draw a diagram that illustrates how the code will actually be executed (incorporating any stalls or forwarding to solve the identified problems)
CC 7 CC 8
add..
lw $6,100($7)
Stall one cycle
forwarding
sub $7, $6, $8

2
Branch decision in MEM stage
Pipeline Changes to accommodate Control Hazards Control hazards are due to branch hazards and to exceptions (I/O interrupts, requests from the OS, overflow, or an unknown instruction). A branch hazard occurs less frequently than data hazards, and is detected in the MEM stage of the pipeline. Assume branch not taken, the three instructions following a branch that is taken will be in the pipeline, and need to be flushed.
branch detected CC4
40 beq $1,$3,7 44 and $12,$2,$5
48 or $13,$6,$2 52 add $14,$2,$2 56 72 lw $4,50($7)
Pipeline Changes to accommodate Branch Hazards The pipeline throughput can be improved by moving the decision whether the branch is taken or not to the Decode stage of the pipeline; Then if the branch is taken, only one instruction needs to be flushed (discarded) - the instruction immediately after the branch instruction. Thus we need a new logic circuit which compares the contents of the register file outputs; Since the decision is taken in the decode stage, the branch address needs to be computed in the decode phase too, in case the branch is to be taken Thus we need a new adder in the decode phase, as well as add an IF Flush control line to flush the IF/ID pipeline register.
Pipeline Changes to accommodate Branch Hazards

Branch
Switch to branch address Compute branch address
Check for equality
Pipelined branch example <before 2> completes

PC-relative branch 40+4+7*4=72
Branch
IF Flush
2 Flushing means instruction field is 0s
Pipelined branch example <before 1> completes
Pipeline Changes to accommodate Branch Hazards The above scheme will fail if we have the following series of instructions:
36 add $1, $6, $7
40 beq $1, $3, 28

44 and $12, $2, $5 72 lw
Because the correct value of register $1 is not in the decode stage (in the register file) at the time when the comparator needs it Pipeline needs to be stalled and the value of $1 needs to be forwarded from EX/Mem pipeline register
36 add $1, $6, $7
40 beq $1, $3, 28
44 and $12, $2, $5 72 lw
36 add $1, $6, $7 Stall
40 beq $1, $3, 28

flush 72 lw
Example
How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3)
Example
How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop
Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3)
Example 2
Consider the pipelined datapath that does not accommodate branch hazards. Can an attempt to flush and an attempt to stall occur simultaneously? You may want to consider the following code sequence to help you answer this question: beq $1, $2, TARGET #assume the branch is taken lw $3, 40($4) add $3, $3, $3 sw $3, 40($4) TARGET: or $10,$11, $12
If the beq resolution is in the MEM stage, and the branch is taken, it requires a flush of the IF/ID pipeline register (means the register needs to be written to) and a change of the PC to the branch address; this happens in clock cycle 4.
Example 2 - continued
At the same time a hazard is detected between lw and the next instruction (add) which is dependent (due to $3 used as source register). Thus the hazard detection unit issues a stall, and requests that the PC and the IF/ID registers not be written to. The answer is YES, a flush and a stall are issued simultaneously.
If there are any conflicting actions, which should take priority?
Flush should take priority
Is there a simple change you can make to the datapath to ensure the necessary priority?
Example 2- continued
The hazard detection unit should be changed to see the RegWrite signal in the execution stage after it goes through the MUX used to flush the pipeline
RegWrite
Dynamic branch prediction
The static branch predicts that it will not be taken and then flush if it was taken works for simple pipelines, but is wasteful for performance for aggressive pipelining architecture (such as the multiple issue of Pentium IV). One approach is to have a branch prediction buffer (a small memory unit indexed by the lower portion of the address in the branch instruction). It contains a bit that says if the branch was recently taken or not. The value of the prediction bit is inverted if the prediction turned out to be wrong. When the branch is almost always taken, this 1-bit predictor will predict wrong twice (at the start and end of the run of branches).
Dynamic branch prediction

A better approach is to use a two-bit scheme, which must be wrong twice to change the direction of prediction. The branch prediction is stored in a special buffer which is accessed with the beq instruction in the IF stage. If the beq is predicted as taken, then fetching begins from the target once beq is in ID.
Not Taken
Taken Taken Not Taken Not Taken
Taken
Further optimization with a global predictor taking into consideration the global behavior of recently executed branches. Each branch has two predictors, and tournament predictor keeps track and favors the one that was more accurate.
Dynamic branch prediction with compiler optimization Furthermore, compilers place instructions that always execute in the delay spot For mostly taken branches Best choice
Pipeline Changes to accommodate Exceptions

Overflow is discovered at the end of the execute stage when the ALU sends a signal to the control unit. Following notification of an overflow the control unit has to flush the two instructions that followed the one causing the overflow. These instructions are now in the IF and ID stages of the pipeline. Thus we add an input to the MUX in the ID stage that 0s the control signals using an ID.Flush signal
8000180
ID.Flush
IF.Flush
Overflow
The instruction that causes the overflow (which is detected in the EX stage) needs to be flushed from the pipeline. This means that an EX.Flush signal needs to be sent to two multiplexers to zero the control signals for the last two stages of the pipeline. Overflow is only one of the many possible exception causes. The cause is stored in a Cause register below: 4 address error exception (load) 5- address error exception (store) 10 unknown instructions or reserved instruction 12 arithmetic overflow 15 floating point exception
An additional input is added to the PC MUX that sends to the PC 8000 0180hex (system reserved memory address for overflow) The address of the instruction following the offending command is saved in the Exception Program Counter (EPC) register and the cause in the Cause Register. If there are multiple exceptions, their causes are stored in the cause register, such that hardware can interrupt based on later exceptions once the earliest exception has been serviced. In case of an I/o interrupt, the execution jumps to the system routine needed to deal with the I/o, followed by a return to the address stored in the EPC for program completion. The OS responds to an exception either by terminating the process that caused the exception or by performing some action. The process whos exception is due to an unimplemented instruction is killed by the OS.
Pipeline Changes to accommodate Overflow

Branch
EX.Flush
80000180
0 0
Overflow
Pipeline Changes to accommodate Unknown Instruction

Branch
EX.Flush (LOW)
80000180
0 0
Pipelined exception example: and completes
80000180
54
Overflow
50 add causes an overflow
OS instruction fetched
Pipelined exception example or completes
80000180
80000184
80000184
80000180
One way to speed up pipelines is to have more stages (up to eight) results in shorter clock cycles. Another way is superscalar architectures which have CPI less than 1. Multiple instructions can be launched at the same time (multiple issue) - Instruction execution rate exceeds the clock rate! Were talking of number of Instructions per Clock Cycle (IPC instead of CPI) Architectures try to issue 3 to 8 instructions at every clock cycle. A third way is to balance load through dynamic pipeline scheduling, to avoid hazards (stalls). The price for these speed-ups is more hardware, more complicated control and a more complicated instruction execution model. If instructions are launched in pairs, only the first instruction is launched if dynamic conditions are not met.
Pipelining Speed-ups
Static Multiple Issue

Used in embedded processors and VLIW processors Can improve performance by up to 200% Layout is restricted to simplify the decoding and instruction issue Instructions are issued in pairs, aligned on a 64-bit boundary with the ALU and branch portion operating first; If one of the instruction of the pair cannot be used, it is replaced by a no-op. The hardware detects data hazards and generates stalls between two issue packets, but the compiler is required to avoid all dependencies within the instruction pair. A load will cause the next two instructions to stall if they were to use the loaded word.
CC 7 add
CC 8
lw
beq
sw
sub
lw
Static two-issue datapath
We need two output ports for Instruction memory, two more read and one more write ports for the Register file, two ALUs (one handles address computation for Data memory access), and two sign-extending units
Three Primary Units of Dynamically Scheduled Pipeline
Dynamic pipeline scheduling chooses which instruction to execute next, re-ordering them to avoid stalls
Buffer holding all the operands and the operation
Results sent to other reservation stations or the commit unit Buffers results until it is safe to put them in the register file or in data memory (store)
Commit unit serves as a forwarding station For operands that are needed before they were written back in the register file
AMD Opteron X4 12-stage pipeline

Speculative pipeline that executes 3 instructions/clock cycle
Register renaming removes antidependencies. In case of incorrect speculation, the mapping between architectural and physical registers is undone. Memory address calculation
Actual memory access
Intel Core pipeline

Each core can execute 4 instructions simultaneously A Core duo can execute 8 instructions simultaneously Better branch prediction Enhanced ALU Less power consumption
http://www.extremetech.com/article2/0,1697,1988744,00.asp

The Processor: (Datapath and Pipelining)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

The Processor: (Datapath and Pipelining)

Încărcat de

Drepturi de autor:

Formate disponibile

14:332:331

Review: Design Principles

The Processor: Datapath & Control

The Processor: Datapath & Control

Abstract Implementation View

Abstract Implementation View

Instruction Memory PC Address Instruction

Address ALU Data Memory Read Data

rising (positive) edge

Synchronous digital system

one clock cycle

Building the Datapath

Building the Datapath

PC is updated on every clock cycle

Executing R Format Operations

R format operations (add, sub, slt, and, or)

The fourth ALU control line supports NOR

Executing R Format Operations

Data Memory Read Data

Executing Load and Store Operations

Executing Load and Store Operations

Executing Branch Operations

Executing Branch Operations

Branch target address

Executing Jump Operations

Creating a Single Datapath from the Parts

Fetch, R, and Memory Access Portions

ALU control ovf zero

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

ALUSrc ALU control ovf zero

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

Adding the Branch Portion

PCSrc MemWrite MemtoReg

ALUSrc ALU control ovf zero

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

Single cycle datapath

(Almost) Complete Single Cycle Datapath

1 PCSrc MemWrite MemtoReg

Instruction Memory PC Read Address Instr[31-0]

Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1

Address Data Memory Read Data Write Data

File Write Addr Write Data

ALU's operation based on instruction type and function code

and or add subtract set on less than NOR

ALU Control, continued

ALU control input

ALUOp(from control block)

ALU control input

ALU Control Truth Table

ALU Control Combinational Logic

Summary of control lines

Datapath with Control Unit

Instruction Memory PC Read Address Instr[31-0]

Instr[15 -11] Instr[15-0]

Main Control Unit

Control Unit Logic

Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]

RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0

Adding the Jump Operation

J has the opfield 000010 followed by 26-bit jump