Chapter 4

CPE 408340 Computer Organization Chapter 4: The Processor: Datapath and Control
Saed R. Abed
[Computer Engineering Department, Hashemite University] [Adapted from Otmane Ait Mohamed Slides & Computer Organization and Design, Patterson & Hennessy, 2005, UCB] 1
Review: Design Principles
Simplicity favors regularity

fixed size instructions 32-bits only three instruction formats
Good design demands good compromises
three instruction formats
Smaller is faster

limited instruction set limited number of registers in register file limited number of addressing modes
Make the common case fast

arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands
2
Review: THE Performance Equation
Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle
or
CPU time
Instruction_count x CPI ----------------------------------------------clock_rate
These equations separate the three key factors that affect performance

Can measure the CPU execution time by running the program The clock rate is usually given in the documentation Can measure instruction count by using profilers/simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details
3
Datapath design tended to just work Control paths are where the system complexity lives. Bugs spawned from control path design errors reside in the microcode flow, the finite-state machines, and all the special exceptions that inevitably spring up in a machine design like thistles in a flower garden.
The Pentium Chronicles, Colwell, pg. 64
4.1 The Processor: Datapath & Control
Our implementation of the MIPS is simplified

memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic implementation
use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction
Fetch PC = PC+4 Exec Decode
All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow?
5
Abstract Implementation View
Two types of functional units:

elements that operate on data values (combinational) elements that contain state (sequential)
Instruction Memory PC Address Instruction
Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Addr
Address Data Memory Read Data Write Data
ALU
Single cycle operation Split memory (Harvard) model - one memory for instructions and one for data
6
4.2 Logic Design Conventions: Clocking Methodologies
The clocking methodology defines when signals can be read and when they are written
An edge-triggered methodology read contents of state elements send values through combinational logic write results to one or more state elements
State element 1 Combinational logic State element 2
Typical execution

clock
one clock cycle
Assumes state elements are written on every clock cycle; if not, need explicit write control signal
write occurs only when both the write control is asserted and the clock edge occurs
4.3 Building a Datapath: Fetching Instructions
Fetching instructions involves

reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction
clock
4
Add
Instruction Memory PC Read Address Instruction
PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesnt need an explicit read control signal
Decoding Instructions
Decoding instructions involves
sending the fetched instructions opcode and function field bits to the control unit
Control Unit
and
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
reading two values from the Register File

- Register File addresses are contained in the instruction
9
Reading Registers Just in Case
Note that both RegFile read ports are active for all instructions during the Decode cycle using the rs and rt instruction field addresses

Since havent decoded the instruction yet, dont know what the instruction is ! Just in case the instruction uses values from the RegFile do work ahead by reading the two source operands
Which instructions do make use of the RegFile values?
Also, all instructions (except j) use the ALU after reading the registers Why? memory-reference? arithmetic? control flow?
10
Executing R Format Operations
R format operations (add, sub, slt, and, or)

31 R-type: op

25 rs
20 rt
15 rd
10
shamt funct
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control
Instruction
ALU
overflow zero
Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
11
Consider slt Instruction
R format operations (add, sub, slt, and, or)

31 R-type: op

25 rs
20 rt
15 rd
10
shamt funct
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
Instruction
ALU
overflow zero
Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
12
Consider the slt Instruction
Remember the R format instruction slt

slt $t0, $s0, $s1 # if $s0 < $s1 # then $t0 = 1 # else $t0 = 0
Where does the 1 (or 0) come from to store into $t0 in the Register File at the end of the execute cycle?
Instruction
ALU
overflow zero
13 2
Executing Load and Store Operations
Load and store operations have to

31 I-Type: op 25 rs 20 rt 15 address offset 0
compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction
- base register was read from the Register File during decode - offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value
store value, read from the Register File during decode, must be written to the Data Memory load value, read from the Data Memory, must be stored in the Register File
14
Executing Load and Store Operations, cont
RegWrite
ALU control overflow zero
MemWrite
Instruction
Address ALU Data Memory Read Data Write Data
16
Sign Extend
MemRead
32
16
Executing Branch Operations
Branch operations have to

31 I-Type: op 25 rs 20 rt 15 address offset 0
compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output) compute the branch target address by adding the updated PC to the sign extended16-bit signed offset field in the instruction
- base register is the updated PC - offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address
17
Executing Branch Operations, cont
Add 4 Shift left 2 Add
Branch target address
ALU control
PC Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
zero (to branch control logic)

ALU
Instruction
16
Sign Extend
32
19
Executing Jump Operations
Jump operations have to

31 J-Type: op
25 jump target address
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2
Jump address
28
20
Creating a Single Datapath from the Parts
Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design fetch, decode and execute each instructions in one clock cycle
no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory
Cycle time is determined by length of the longest path

21
Fetch, R, and Memory Access Portions
Add 4
RegWrite
ALUSrc ALU control ovf zero
MemWrite
MemtoReg
Sign 16 Extend
MemRead
32
22
Multiplexor Insertion
Add 4
RegWrite
MemWrite
MemtoReg
Sign 16 Extend
MemRead
32
23
Clock Distribution
System Clock
clock cycle
RegWrite
Add 4 Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
MemWrite ALUSrc ALU control ovf zero MemtoReg
Sign 16 Extend
MemRead
32
24
Adding the Branch Portion
Add 4 Shift left 2 Add
PCSrc MemWrite MemtoReg
RegWrite
Sign 16 Extend
MemRead
32
26
4.4 A Simple Implementation Scheme: Our Simple Control Structure
We wait for everything to settle down

ALU might not produce right answer right away Memory and RegFile reads are combinational (as are ALU, adders, muxes, shifter, signextender) Use write signals along with the clock edge to determine when to write to the sequential elements (to the PC, to the Register File and to the Data Memory)
The clock cycle time is determined by the logic delay through the longest path
We are ignoring some details like register setup and hold times
27
Adding the Control
Selecting the operations to perform (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs)
31 R-type: op 25 rs 25 rs 25 20 rt 20 rt 15 rd 15 address offset 0 10 5 0 shamt funct 0
Observations
31 I-Type: op 31
op field always in bits 31-26
addr of registers J-type: op target address to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base register addr. of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions offset for beq, lw, and sw always in bits 15-0 28
Single Cycle Datapath with Control Unit

0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add
1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
Instruction Memory PC Read Address Instr[31-0]
1 0
0 1
ALU control
Instr[15 -11] Instr[15-0]
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
29
ALU Control
ALU's operation based on instruction type and function code

ALU control input 0000 0001 0010 0011 0110 1110 1111 Function and or xor nor add subtract set on less than
Notice that we are using different encodings than in the book

30
ALU Control, Cont
Controlling the ALU uses of multiple decoding levels

main control unit generates the ALUOp bits ALU control unit generates ALUcontrol bits
Instr op lw sw beq add subt and or xor nor slt
funct ALUOp action add xxxxxx 00 add xxxxxx 00 subtract xxxxxx 01 100000 10 add 100010 10 subtract 100100 10 and 100101 10 or 100110 10 xor 100111 10 nor 101010 10 slt
ALUcontrol 0110 0110 1110 0110 1110 0000 0001 0010 0011 1111
32
ALU Control Truth Table

F5 F4 F3 F2 F1 F0 ALU Op1 ALU Op0
Our ALU m control input

ALU control3 ALU control2 ALU control1 ALU control0
X X X X X X X X X
X X X X X X X X X X X 0 0 0 0 X 0 0 1 0 X 0 1 0 0 X 0 1 0 1 X 0 1 1 0 X 0 1 1 1 X 1 0 1 0
0 0 1 1 1 1 1 1 1
0 1 0 0 0 0 0 0 0
0 1 0 1 0 0 0 0 1
Add/subt
1 1 1 1 0 0 0 0 1
1 1 1 1 0 0 1 1 1
0 0 0 0 0 1 0 1 1
Mux control
Four, 6-input truth tables

34
ALU Control Logic
From the truth table can design the ALU Control logic
Instr[3] Instr[2] Instr[1] Instr[0] ALUOp1 ALUOp0
ALUcontrol3 ALUcontrol2 ALUcontrol1
ALUcontrol0
35
R-type Instruction Data/Control Flow

0
1
ovf zero
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
36
Store Word Instruction Data/Control Flow

0
1
ovf zero
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
38
Load Word Instruction Data/Control Flow

0
1
ovf zero
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
40
Branch Instruction Data/Control Flow

0
1
ovf zero
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
43
Main Control Unit

Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp
R-type
000000
1 0 X X
0 1 1 0
0 1 X X
1 1 0 0
0 1 0 0
0 0 1 0
0 0 0 1
10 00 00 01
lw
100011
sw
101011
beq
000100
Setting
of the MemRd signal (for R-type, sw, beq) depends on the memory design (could have to be 0 or could be a X (dont care))
44
Control Unit Logic
From the truth table can design the Main Control logic
Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]
R-type
lw
sw
beq
RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
45
Review: Handling Jump Operations
Jump operation have to
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
31 0 jump target address
J-Type: op
Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2
Jump address
28
46
Adding the Jump Operation

Instr[25-0] 26 Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Jump Branch Shift left 2 Add Shift left 2
1
28 32 PC+4[31-28]
0 0 1
ovf zero
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
48
Main Control Unit

Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp Jump
R-type
000000
1 0 X X X
0 1 1 0 X
0 1 X X X
1 1 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 X
10 00 00 01 XX
0 0 0 0 1
lw
100011
sw
101011
beq
000100
j
000010
Setting
of the MemRd signal (for R-type, sw, beq) depends on the memory design
50
Single Cycle Implementation Cycle Time
Unfortunately, though simple, the single cycle approach is not used because it is very slow Clock cycle must have the same length for every instruction
What is the longest (slowest) path (slowest instruction)?
51
Instruction Critical Paths

Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times) except:
Instruction and Data Memory (4 ns) and adders (2 ns) Register File access (reads or writes) (1 ns)
ALU
Instr. Rtype load store beq jump
I Mem 4 4 4 4 4
Reg Rd 1 1 1 1
ALU Op D Mem Reg Wr 2 2 2 2 4 4 1 1
Total 8 12 11 7 4
53
Single Cycle Disadvantages & Advantages
Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction
especially problematic for more complex instructions like floating point multiply
Cycle 1 Cycle 2
Clk lw sw Waste
May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand
54
4.5 Pipelining is Natural!

Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes Folder takes 30 minutes Stasher takes 30 minutes to put clothes into drawers
55
Sequential Laundry
6 PM 7 8 9 10 11 12 1 2 AM
T a s k O r d e r
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 A B C D
Time
Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take?
56
Pipelined Laundry: Start work ASAP

6 PM 7 8 9 10 11 12 1 2 AM
T a s k O r d e r
30 30 30 30 30 30 30 A B C D
Time
Pipelined laundry takes 3.5 hours for 4 loads!

57
Pipelining Lessons
6 PM
9
Time
T a s k O r d e r
Pipelining doesnt help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences
58
30 30 30 30 30 30 30 A B C D
The Five Stages of Load

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file
59
Pipelining
Improve performance by increasing instruction throughput

200 400 600 800 1000 1200 1400 1600 1800
Program execution Time order (in instructions)
lw $1, 100($0) Instruction R g e fetch lw $2, 200($0) lw $3, 300($0)
AU L 800 ps
Data ac es c s
Rg e Instruction R g e fetch AU L 800 ps Data ac s c es Rg e Instruction fetch 800 ps
Note: timing assumptions changed for this example
Program execution Time order (in instructions) lw $1, 100($0)
200
400
600
800
1000
1200
1400
Instruction fetch
Rg e
AU L Rg e Instruction fetch 200 ps
Data a es cc s AU L Rg e 200 ps
Rg e Data ac s c es AU L 200 ps Rg e Data ac es c s 200 ps Rg e 200 ps
lw $2, 200($0) 200 ps Instruction fetch lw $3, 300($0) 200 ps
Ideal speedup is number of stages in the pipeline. Do we achieve this?

60
Basic Idea
IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back
Ad d 4 S it hf lf2 et Read Ra ed r g t r1 eie s dt 1 a a Read r g t r2 eie s Rgtr e i es s Write Ra ed r gse e it r dt 2 a a Wi re t daa t 1 6 2 Sign 3 et n x d e AD r s l D Add e ut
P C
A de s drs I sr ci n n tu to I sr ci n n tu to memory
Zr e o A AU L L U rsl e ut
A de s drs
Ra ed dat a Data Mm r eo y
Write data
What do we need to add to actually split the datapath into stages?

62
A Pipelined MIPS Processor
Start the next instruction before the current one has completed

improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
lw sw R-type
IFetch
Dec IFetch
Exec Dec IFetch
Mem Exec Dec
WB Mem Exec WB Mem WB
- clock cycle (pipeline stage time) is limited by the slowest stage - for some instructions, some stages are wasted cycles
63
Single Cycle, Multiple Cycle, vs. Pipeline

Single Cycle Implementation: Cycle 1 Clk lw Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch sw Waste Cycle 2
Pipeline Implementation: lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB
pipeline clock same as multicycle clock
R-type IFetch
64
4.6 MIPS Pipeline Datapath Modifications
What do we need to add/modify in our MIPS datapath?
State registers between each pipeline stage to isolate them

IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack
Add 4 Shift left 2 IFetch/Dec Add
Write Addr Write Data
Read Data 2
ALU
Address Write Data
16
Sign Extend
32
System Clock
Mem/WB
Read Address
File
Exec/Mem
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
Read Data
65
MIPS Pipeline Control Path Modifications
All control signals can be determined during Decode
and held in the state registers between pipeline stages
ID/EX EX/MEM IF/ID Add 4 Shift left 2 Add Control MEM/WB
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALU Address Write Data Read Data
File
Write Addr Write Data Read Data 2
16
Sign Extend
32
66
Single Data Path to be pipelined
67
Pipelined version of the single cycle datapath
68
Instructions being executed assuming pipelined execution
69
IF and ID
70
EX (lw) instruction)
71
MEM and WB
72
IF and ID (SW)
73
EX (SW)
74
MEM and WB
75
Correct Data Path to handle lw correctly
Example
77
Example
78
Pipelining the MIPS ISA
What makes it easy
all instructions are the same length (32 bits)
- can fetch in the 1st stage and decode in the 2nd stage
few instruction formats (three) with symmetry across formats
- can begin reading register file in 2nd stage
memory operations can occur only in loads and stores
- can use the execute stage to calculate memory addresses
each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB) structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instructions input operands depend on the output of a previous instruction?
What makes it hard

79
Graphically Representing MIPS Pipeline
ALU
IM
Reg
DM
Reg
Can help with answering questions like:

How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?
80
Why Pipeline? For Performance!

Time (clock cycles)
I n s t r. O r d e r
Inst 0 Inst 1 Inst 2 Inst 3 Inst 4
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Once the pipeline is full, one instruction is completed every cycle so CPI = 1
Reg
IM
Time to fill the pipeline
ALU
Reg
IM
ALU
Reg
IM
ALU
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
81
Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards
structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready
- An instructions source operand(s) are produced by a prior instruction still in the pipeline
control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
- branch and jump instructions, exceptions

Can

always resolve hazards by waiting
pipeline control must detect the hazard and take action to resolve hazards
82
A Single Memory Would Be a Structural Hazard

Time (clock cycles)
lw Inst 1 Inst 2 Inst 3 Inst 4

Can
Mem
Reg
Mem
Reg
Reading data from memory

Reg
ALU Reg Mem
ALU
Mem
Mem
ALU Reg Mem
Reg
Mem
Reg
ALU
Mem
Mem
Reg
ALU
fix with separate instr and data memories

83
Reading instruction from memory
Reg
Mem
Reg
How About Register File Access?

Time (clock cycles)
add $1, Inst 1 Inst 2
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half
Reg
IM
ALU
Reg
ALU
ALU
DM
ALU
add $2,$1,
IM
Reg
DM
Reg
clock edge that controls register writing
clock edge that controls loading of pipeline state registers
85
Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards

ALU
add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
Read
IM
Reg
DM
Reg
before write data hazard

87
Loads Can Cause Data Hazards

ALU
lw
$1,4($2)
IM
Reg
DM
Reg
ALU
sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
Load-use
IM
Reg
DM
Reg
data hazard
89
4.7 MIPS Pipeline Data and Control Paths
90
91

PCSrc ID/EX EX/MEM Control IF/ID Add 4 RegWrite Shift left 2 Add Branch MEM/WB
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALUSrc ALU Address Write Data ALU cntrl Read Data
MemtoReg
File
Write Addr Write Data Read Data 2
16
Sign Extend
MemWrite MemRead
32
ALUOp
RegDst
92
Control Settings
EX Stage Reg Dst 1 0 X X
MEM Stage
WB Stage
R lw sw beq
ALU ALU ALU Brch Mem Mem Reg Mem Op1 Op0 Src Read Write Write toReg 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 X X
CSE431 L05 Basic MIPS Architecture.93
Irwin, PSU, 2005
93
One Way to Fix a Data Hazard

Can fix data hazard by waiting stall
ALU
add $1, stall stall
IM
Reg
DM
Reg
ALU
sub $4,$1,$5 and $6,$1,$7
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
94
Another Way to Fix a Data Hazard

Fix data hazards by forwarding results as soon as they are available to where they are needed
Reg ALU
add $1,
IM
Reg
DM
Reg
ALU
sub $4,$1,$5
IM
Reg
DM
Reg
ALU
and $6,$1,$7 or $8,$1,$9
IM
Reg
DM
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
Reg
96
Data Forwarding (aka Bypassing)
Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ALU) that need it that cycle For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by

adding multiplexors to the inputs of the ALU connecting the Rd write data in EX/MEM or MEM/WB to either (or both) of the EXs stage Rs and Rt ALU mux inputs adding the proper control hardware to control the new muxes
Other functional units may need similar forwarding logic (e.g., the DM) With forwarding can achieve a CPI of 1 even in the presence of data dependencies
99
Data Forwarding Control Conditions

1.
EX/MEM hazard:
Forwards the != 0) = ID/EX.RegisterRs)) result from the previous instr. to either input of the ALU != 0) = ID/EX.RegisterRt))
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd and (EX/MEM.RegisterRd ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd and (EX/MEM.RegisterRd ForwardB = 10
2.
MEM/WB hazard:
Forwards the != 0) = ID/EX.RegisterRs)) result from the second previous instr. to either input != 0) = ID/EX.RegisterRt)) of the ALU
100
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd and (MEM/WB.RegisterRd ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd and (MEM/WB.RegisterRd ForwardB = 01
Forwarding Illustration
ALU
add $1,
IM
Reg
DM
Reg
ALU
sub $4,$1,$5
IM
Reg
DM
Reg
ALU
and $6,$7,$1
IM
Reg
DM
Reg
EX/MEM hazard forwarding
MEM/WB hazard forwarding
101
Yet Another Complication!
Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction which should be forwarded?
ALU
add $1,$1,$2 add $1,$1,$3
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
add $1,$1,$4
IM
Reg
DM
Reg
103
Corrected Data Forwarding Control Conditions

2.
MEM/WB hazard:
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
104
Datapath with Forwarding Hardware

PCSrc ID/EX EX/MEM Control IF/ID Add 4 Shift left 2 Add Branch MEM/WB
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALU Address Read Data Write Data ALU cntrl
File
Write Addr Write Data 16 Sign Extend Read Data 2 32
EX/MEM.RegisterRd ID/EX.RegisterRt ID/EX.RegisterRs Forward Unit MEM/WB.RegisterRd
106
Memory-to-Memory Copies
For loads immediately followed by stores (memory-tomemory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input.
Would need to add a Forward Unit and a mux to the memory access stage
ALU
lw $1,4($2)
IM
Reg
DM
Reg
ALU
sw $1,4($3)
IM
Reg
DM
Reg
107
Forwarding with Load-use Data Hazards

ALU
lw stall
IM $1,4($2)
Reg
DM
Reg
ALU
IM
Reg
DM ALU
Reg
sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
110
Adding the Hazard Hardware

PCSrc Hazard Unit IF/ID Control 0 Add 4 Shift left 2 Add Branch MEM/WB ID/EX ID/EX.MemRead EX/MEM
0 1
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
File
ID/EX.RegisterRt
Forward Unit
111
4.8 Control Hazards
When the flow of instruction addresses is not sequential (i.e., PC = PC + 4)

Conditional branches (beq, bne) Unconditional branches (j, jal, jr) Exceptions
Possible solutions

Stall (impacts performance) Move branch decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Delay decision (requires compiler support) Predict and hope for the best !
Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards
112
Datapath Branch and Jump Hardware

Jump PCSrc Shift left 2 IF/ID Add PC+4[31-28] 4 Shift left 2 Add Control Branch MEM/WB ID/EX EX/MEM
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
File
Forward Unit
114
Jumps Incur One Stall

Jumps
not decoded until ID, so one flush is needed
To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
IM Reg DM Reg
j flush j target
IM
Reg
DM ALU
Reg
Fix jump hazard by waiting flush
IM
Fortunately, jumps are very infrequent only 3% of the SPECint instruction mix
115
ALU
Reg
ALU
DM
Reg
Supporting ID Stage Jumps

Jump PCSrc Shift left 2 IF/ID Add PC+4[31-28] 4 Shift left 2 Add Control Branch MEM/WB ID/EX EX/MEM
Instruction Memory
Read Address PC
Read Addr 1
Register Read 0
Data 1 Read Addr 2
Data Memory
File
Forward Unit
116
Branches Cause Control Hazards

ALU
beq lw Inst 3 Inst 4
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
117
One Way to Fix a Branch Control Hazard

beq flush flush flush beq target Inst 3
IM
Reg
DM
Reg
IM
Reg IM
DM ALU
Reg DM ALU
Fix branch hazard by waiting flush but affects CPI

Reg
ALU
Reg
IM
ALU
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
118
Another Way to Fix a Branch Control Hazard
Move branch decision hardware back to as early in the pipeline as possible i.e., during the decode cycle
ALU
beq flush beq target Inst 3
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Fix branch hazard by waiting flush
ALU
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
119
Yet Another Way to Fix a Control Hazard

Predict
branches are always not taken and take corrective action when wrong (i.e., taken)
4 beq $1,$2,2 8 flush sub $4,$1,$5
IM Reg DM Reg
IM
Reg
DM
Branch decision hardware moved to the decode cycle Reg
16 and $6,$1,$7 20 or r8,$1,$9
IM
To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
120
ALU
Reg
IM
ALU
Reg
ALU
DM
Reg
ALU
DM
Reg
Two Types of Stalls
Noop instruction (or bubble) inserted between two instructions in the pipeline (e.g., load-use hazards)
Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle (bounce them in place with write control signals) Insert noop instruction by zeroing control bits in the pipeline register at the appropriate stage Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline
Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noop instruction (as done for instructions located sequentially after j and beq instructions)
Zero the control bits for the instruction to be flushed
121
Many Other Pipeline Structures Are Possible
What about the (slow) multiply operation?

Make the clock twice as slow or let it take two cycles (since it doesnt use the DM stage)
MUL ALU IM Reg DM Reg
What

if the data memory access is twice as slow as the instruction memory?

make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate)
ALU IM Reg DM1 DM2 Reg
122
Designing a Pipelined Processor
Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage
123
Pipelining the Load Instruction

Cycle 1 Cycle 2 Clock 1st lw Ifetch Reg/Dec Exec Reg/Dec Ifetch Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
2nd lw Ifetch 3rd lw
The five independent functional units in the pipeline datapath are:

Instruction Memory for the Ifetch stage Register Files Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register Files Write port (bus W) for the Wr stage
124
The Four Stages of R-type

Cycle 1 Cycle 2 Cycle 3 Cycle 4
R-type Ifetch
Reg/Dec
Exec
Wr
Reg/Dec: Registers Fetch and Instruction Decode Exec:

ALU operates on the two register operands Update PC
Wr: Write the ALU output back to the register file
125
Pipelining the R-type and Load Instruction

Cycle 1 Cycle 2 Clock R-type Ifetch R-type Reg/Dec Ifetch Load Exec Reg/Dec Ifetch Wr Exec Reg/Dec Wr Exec Reg/Dec Mem Exec Reg/Dec Wr Wr Exec Wr Ops! We have a problem! Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
R-type Ifetch
R-type Ifetch
We have pipeline conflict or structural hazard:

Two instructions try to write to the register file at the same time! Only one write port
126
Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions:
Load uses Register Files Write Port during its 5th stage
1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr
R-type uses Register Files Write Port during its 4th stage
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Wr
2 ways to solve this pipeline hazard.
127
Solution 1: Insert Bubble into the Pipeline

Cycle 1 Cycle 2 Clock Ifetch Load Reg/Dec Ifetch Exec Reg/Dec Wr Exec Reg/Dec Mem Exec Reg/Dec Pipeline Wr Wr Exec Wr Exec Reg/Dec Wr Exec Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
R-type Ifetch
R-type Ifetch
R-type Ifetch
Bubble Reg/Dec Ifetch
Insert a bubble into the pipeline to prevent 2 writes at the same cycle

The control logic can be complex. Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!

128
Solution 2: Delay R-types Write by One Cycle
Delay R-types register write by one cycle:

Now R-type instructions also use Reg Files write port at Stage 5 Mem stage is a NOOP stage: nothing is being done.
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr
Cycle 1 Cycle 2 Clock R-type Ifetch R-type Reg/Dec Ifetch Load
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Exec Reg/Dec Ifetch
Mem Exec Reg/Dec
Wr Mem Exec Reg/Dec Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr
R-type Ifetch
R-type Ifetch
129
The Four Stages of Store

Store
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory
130
The Three Stages of Beq

Beq
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec:
Registers Fetch and Instruction Decode
Exec:

compares the two register operand, select correct branch target address latch into PC
131
Pipelining Summary

All modern day processors use pipelining Pipelining doesnt help latency of single task, it helps throughput of entire workload Potential speedup: a really fast clock cycle and able to complete one instruction every clock cycle (CPI) Pipeline rate limited by slowest pipeline stage

Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact speedup for deep pipelines and short code runs
Must detect and resolve hazards
Stalling negatively affects CPI (makes CPI greater than the ideal of 1)
132
4.9 Exception: Communication of I/O Devices and Processor
How the processor directs the I/O devices

Special I/O instructions - Must specify both the device and the command Memory-mapped I/O - Portions of the high-order memory address space are assigned to each I/O device. Read (lw) and writes (sw) to those memory addresses are interpreted as commands to the I/O devices - Load/stores to the I/O address space done only by the OS How the I/O device communicates with the processor Polling the processor periodically checks the status of an I/O device to determine its need for service - Processor is totally in control but does all the work - Can waste a lot of processor time due to speed differences Interrupt-driven the I/O device issues an interrupts to the processor to indicate that it needs attention
133
MIPS I/O Instructions
MIPS has 2 coprocessors: Coprocessor 0 handles exceptions including input and output interrupts, Coprocessor 1 handles floating point
Coprocessors have their own register sets so have instructions to move values between these registers and the CPUs registers
Register # Use BadVAddr 8 bad mem addr Count 9 timer Compare 11 timer compare Status 12 intr mask & enable bits Cause 13 excp type and pending intrs EPC 14 addr of instr causing excp
mfc0
rd, rt
0x10 0
#move from coprocessor 0

rt rd 0 0
mtc0 rt, rd
0x10 4
#move to coprocessor 0
rt rd 0 0 134
The Downsides of Polling
Input and output devices are very slow compared to the processor

These time lags are simulated in SPIM which measures time in instructions executed, not in real clock time After the transmitter starts to write a character, the transmitters ready bit becomes 0. It doesnt become ready again until the processor has executed a (large) fixed number of instructions. (You dont want to single step the simulator!)
Polling will execute the loop til ready code thousands of times. While the input or output is occurring, nothing else can be done a waste of resources. There is a better way
135
I/O Interrupts
An I/O interrupt is used to signal an I/O request for service

Can have different urgencies (so may need to be prioritized) Need to identity the device generating the interrupt An I/O interrupt is not associated with any instruction and does not prevent any instruction from completion
An I/O interrupt is asynchronous wrt instr execution
- You can pick your own convenient point to take an interrupt Advantage
User program progress is only halted during the actual transfer of I/O data to/from user memory space Cause an interrupt (I/O device) Detect an interrupt and save the proper information to resume after servicing the interrupt (processor)
Disadvantage special hardware is needed to

136
Additions to MIPS ISA for I/O
Coprocessor 0 records the information the software needs to handle exceptions (including interrupts)

EPC (register 14) holds the address+4 of the instruction that was executing when the exception occurred Status (register 12) exception mask and enable bits
15 8 Intr Mask
User mode Intr enable Excp level
1 0
- Intr Mask = 1 bit for each of 6 hw and 2 sw exception levels (1 enables exception at that level, 0 disables them) - User mode = 0 if running in kernel mode when exception occurred; 1 if running in user mode (fixed at 1 in SPIM) - Excp level = set to 1 (disable exceptions) when an exception occurs; should be reset by exception handler when done - Intr enable = 1 if exception are enabled; 0 if disabled
141
Additions to MIPS ISA, Cont
Cause (register 13) exception pending and type bits

31 15 8 6
Exception code
Branch delay
Pending exception (PI)
PI3 = recv intr PI2 = trans intr
- PI: bits set if exception occurs but not yet serviced

so can handle more than one exception occurring at same time, or records exception requests when exception are disabled
- Exception code: encodes reasons for exception

0 (INT) external interrupt (I/O device request) 4 (AdEL) address error trap (load or instr fetch) 5 (AdES) address error trap (store) 6 (IBE) bus error on instruction fetch trap 7 (DBE) bus error on data load or store trap 8 (Sys) syscall trap 9 (Bp) breakpoint trap 10 (RI) reserved (or undefined) instruction trap 12 (Ov) arithmetic overflow trap
142
MIPS Exception Return Instruction
Exception return sets the Excp level bit in coprocessor 0s Status register to 0 (reenabling exception) and returns to the instruction pointed to by coprocessor 0s EPC register
eret
0x10
#return from exception

1 0 0 0 0x18
143
Exceptions in General
user program normal control flow: sequential, jumps, branches, calls, returns Exception System Exception Handler
return from exception
Exception = unprogrammed control transfer
system takes action to handle the exception
- must record the address of the offending or next to execute instruction and save (and restore) user state
returns control to user after handling the exception

144
Two Types of Exceptions
Interrupts

caused by external events (i.e., request from I/O device) asynchronous to program execution may be handled between instructions simply suspend and resume user program caused by internal events
Traps
- exceptional conditions (e.g., arithmetic overflow, undefined instr.) - errors (e.g., hardware malfunction, memory parity error) - faults (e.g., non-resident page page fault)

synchronous to program execution condition must be remedied by the trap handler instruction may be retried (or simulated) and program continued or program may be aborted
145
Additions to MIPS ISA for Interrupts
Control signals to write EPC (EPCWrite), Cause and Status (Cause&StatusWrite) Hardware to record the type of interrupt in Cause Modify the finite state machine so that

the address of interrupt handler (8000 0180hex) can be loaded into the PC, so must increase the size of PC mux and save the address of the next instr in EPC
146
Additions to MIPS ISA for Traps

Control signals to write EPC (EPCWrite & IntrOrExcp), Cause and Status (Cause&StatusWrite) Hardware to record the type of trap in Cause Further modify the finite state machine so that
for traps, record the address of the current (offending) instruction in the EPC, so must undo the PC = PC + 4 done during fetch
147
How Control Detects Two Traps
Undefined instruction (RI) detected when no next state is defined in state 1 (decode) for the opcode value
Define the next state value for all undefined op values as new state 10
Arithmetic overflow (Ov) The overflow signal from the ALU is used in state 6 (if dont want to complete RegWrite) Need to modify the FSM in a similar fashion for remaining traps
Challenge is to handle the interactions between instructions and exception-causing events so that the control logic remains small and fast
- Complex interactions makes the control unit the most challenging aspect of hardware design, especially in pipelined processors
148
What Makes Pipelining Hard?

Examples of interrupts:

Interrupts cause great havoc!
Power failing, Arithmetic overflow, I/O device request, OS call, Page fault
There are 5 instructions executing in 5 stage pipeline when an interrupt occurs:

How to stop the pipeline? How to restart the pipeline? Who caused the interrupt?
Interrupts (also known as: faults, exceptions, traps) often require

surprise jump (to vectored address) linking return address saving of PSW (including CCs) state change (e.g., to kernel mode)
149

What happens on interrupt while in delay slot ? Next instruction is not sequential solution #1: save multiple PCs Save current and next PC Special return sequence, more complex hardware solution #2: single PC plus Branch delay bit PC points to branch instruction
Stage
IF ID EX MEM
Problem that causes the interrupt

Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic interrupt Page fault on data fetch; misaligned memory access; memory-protection violation 150
Simultaneous exceptions in more than one pipeline stage, e.g., Load with data page fault in MEM stage Add with instruction page fault in IF stage Add fault will happen BEFORE load fault Solution #1 Interrupt status vector per instruction Defer check until last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete
Another advantage for state update late in pipeline!

151
Heres what happens on a data page fault. 1 i i+1 i+2 i+3 i+4 i+5 i+6 trap -> trap handler -> F 2 D F 3 X D F 4 M X D F 5 W M X D F W <- page fault M X D F W <- squash M X D F W <- squash M X D W <- squash M X W M W
152

Complex Addressing Modes and Instructions
Complex Instructions
Address modes: Auto increment causes register change during instruction execution

Interrupts? Need to restore register state Adds WAR and WAW hazards since writes are no longer the last stage.
Memory-Memory Move Instructions

Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt
Condition Codes
153
Datapath with Controls to Handle Exceptions
154
Exception Handling Example

overflow exception
155
Exception Handling Example

start of exception handling routine
156

Chapter 4

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Chapter 4

Încărcat de

Drepturi de autor:

Formate disponibile

CPE 408340 Computer Organization Chapter 4: The Processor: Datapath and Control

Review: Design Principles

Simplicity favors regularity

fixed size instructions 32-bits only three instruction formats

Good design demands good compromises

three instruction formats

Make the common case fast

Review: THE Performance Equation

Instruction_count x CPI ----------------------------------------------clock_rate

The Pentium Chronicles, Colwell, pg. 64

4.1 The Processor: Datapath & Control

Our implementation of the MIPS is simplified

Fetch PC = PC+4 Exec Decode

Abstract Implementation View

Two types of functional units:

Instruction Memory PC Address Instruction

Address Data Memory Read Data Write Data

4.2 Logic Design Conventions: Clocking Methodologies

one clock cycle

4.3 Building a Datapath: Fetching Instructions

Fetching instructions involves

Fetch PC = PC+4 Exec Decode

Instruction Memory PC Read Address Instruction

Decoding instructions involves

Fetch PC = PC+4 Exec Decode

reading two values from the Register File

Reading Registers Just in Case

Which instructions do make use of the RegFile values?

Executing R Format Operations

R format operations (add, sub, slt, and, or)

Fetch PC = PC+4 Exec Decode

Consider slt Instruction

R format operations (add, sub, slt, and, or)

Fetch PC = PC+4 Exec Decode

Consider the slt Instruction

Remember the R format instruction slt

Executing Load and Store Operations

Load and store operations have to

Executing Load and Store Operations, cont

ALU control overflow zero

Address ALU Data Memory Read Data Write Data

Executing Branch Operations

Branch operations have to

Executing Branch Operations, cont

Add 4 Shift left 2 Add

Branch target address

zero (to branch control logic)

Executing Jump Operations

Jump operations have to

25 jump target address

Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2

Creating a Single Datapath from the Parts

Cycle time is determined by length of the longest path

Fetch, R, and Memory Access Portions

ALUSrc ALU control ovf zero

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

ALUSrc ALU control ovf zero

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

MemWrite ALUSrc ALU control ovf zero MemtoReg

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data