Documente Academic
Documente Profesional
Documente Cultură
Saed R. Abed
[Computer Engineering Department, Hashemite University] [Adapted from Otmane Ait Mohamed Slides & Computer Organization and Design, Patterson & Hennessy, 2005, UCB] 1
Smaller is faster
limited instruction set limited number of registers in register file limited number of addressing modes
arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands
2
Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle
or
CPU time
These equations separate the three key factors that affect performance
Can measure the CPU execution time by running the program The clock rate is usually given in the documentation Can measure instruction count by using profilers/simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details
3
Datapath design tended to just work Control paths are where the system complexity lives. Bugs spawned from control path design errors reside in the microcode flow, the finite-state machines, and all the special exceptions that inevitably spring up in a machine design like thistles in a flower garden.
memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic implementation
use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction
All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow?
5
elements that operate on data values (combinational) elements that contain state (sequential)
Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Addr
ALU
Single cycle operation Split memory (Harvard) model - one memory for instructions and one for data
6
The clocking methodology defines when signals can be read and when they are written
An edge-triggered methodology read contents of state elements send values through combinational logic write results to one or more state elements
State element 1 Combinational logic State element 2
Typical execution
clock
Assumes state elements are written on every clock cycle; if not, need explicit write control signal
write occurs only when both the write control is asserted and the clock edge occurs
reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction
clock
4
Add
PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesnt need an explicit read control signal
Decoding Instructions
sending the fetched instructions opcode and function field bits to the control unit
Control Unit
and
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Note that both RegFile read ports are active for all instructions during the Decode cycle using the rs and rt instruction field addresses
Since havent decoded the instruction yet, dont know what the instruction is ! Just in case the instruction uses values from the RegFile do work ahead by reading the two source operands
Also, all instructions (except j) use the ALU after reading the registers Why? memory-reference? arithmetic? control flow?
10
25 rs
20 rt
15 rd
10
shamt funct
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
ALU
overflow zero
Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
11
25 rs
20 rt
15 rd
10
shamt funct
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
ALU
overflow zero
Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File
12
Where does the 1 (or 0) come from to store into $t0 in the Register File at the end of the execute cycle?
RegWrite ALU control
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
ALU
overflow zero
13 2
compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction
- base register was read from the Register File during decode - offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value
store value, read from the Register File during decode, must be written to the Data Memory load value, read from the Data Memory, must be stored in the Register File
14
RegWrite
MemWrite
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
16
Sign Extend
MemRead
32
16
compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output) compute the branch target address by adding the updated PC to the sign extended16-bit signed offset field in the instruction
- base register is the updated PC - offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address
17
ALU control
PC Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Instruction
16
Sign Extend
32
19
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
Jump address
28
20
Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design fetch, decode and execute each instructions in one clock cycle
no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory
Add 4
RegWrite
MemWrite
MemtoReg
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Sign 16 Extend
MemRead
32
22
Multiplexor Insertion
Add 4
RegWrite
MemWrite
MemtoReg
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Sign 16 Extend
MemRead
32
23
Clock Distribution
System Clock
clock cycle
RegWrite
Add 4 Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Sign 16 Extend
MemRead
32
24
RegWrite
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
Sign 16 Extend
MemRead
32
26
ALU might not produce right answer right away Memory and RegFile reads are combinational (as are ALU, adders, muxes, shifter, signextender) Use write signals along with the clock edge to determine when to write to the sequential elements (to the PC, to the Register File and to the Data Memory)
The clock cycle time is determined by the logic delay through the longest path
We are ignoring some details like register setup and hold times
27
Selecting the operations to perform (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs)
31 R-type: op 25 rs 25 rs 25 20 rt 20 rt 15 rd 15 address offset 0 10 5 0 shamt funct 0
Observations
31 I-Type: op 31
addr of registers J-type: op target address to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base register addr. of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions offset for beq, lw, and sw always in bits 15-0 28
1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
29
ALU Control
main control unit generates the ALUOp bits ALU control unit generates ALUcontrol bits
funct ALUOp action add xxxxxx 00 add xxxxxx 00 subtract xxxxxx 01 100000 10 add 100010 10 subtract 100100 10 and 100101 10 or 100110 10 xor 100111 10 nor 101010 10 slt
ALUcontrol 0110 0110 1110 0110 1110 0000 0001 0010 0011 1111
32
X X X X X X X X X
X X X X X X X X X X X 0 0 0 0 X 0 0 1 0 X 0 1 0 0 X 0 1 0 1 X 0 1 1 0 X 0 1 1 1 X 1 0 1 0
0 0 1 1 1 1 1 1 1
0 1 0 0 0 0 0 0 0
0 1 0 1 0 0 0 0 1
Add/subt
1 1 1 1 0 0 0 0 1
1 1 1 1 0 0 1 1 1
0 0 0 0 0 1 0 1 1
Mux control
From the truth table can design the ALU Control logic
ALUcontrol0
35
1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
36
1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
38
1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
40
1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
43
R-type
000000
1 0 X X
0 1 1 0
0 1 X X
1 1 0 0
0 1 0 0
0 0 1 0
0 0 0 1
10 00 00 01
lw
100011
sw
101011
beq
000100
Setting
of the MemRd signal (for R-type, sw, beq) depends on the memory design (could have to be 0 or could be a X (dont care))
44
From the truth table can design the Main Control logic
R-type
lw
sw
beq
45
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
31 0 jump target address
J-Type: op
Jump address
28
46
1
28 32 PC+4[31-28]
0 0 1
PCSrc MemRead MemtoReg MemWrite
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
48
R-type
000000
1 0 X X X
0 1 1 0 X
0 1 X X X
1 1 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 X
10 00 00 01 XX
0 0 0 0 1
lw
100011
sw
101011
beq
000100
j
000010
Setting
of the MemRd signal (for R-type, sw, beq) depends on the memory design
50
Unfortunately, though simple, the single cycle approach is not used because it is very slow Clock cycle must have the same length for every instruction
51
Instruction and Data Memory (4 ns) and adders (2 ns) Register File access (reads or writes) (1 ns)
ALU
I Mem 4 4 4 4 4
Reg Rd 1 1 1 1
Total 8 12 11 7 4
53
Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction
especially problematic for more complex instructions like floating point multiply
Cycle 1 Cycle 2
Clk lw sw Waste
May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand
54
Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes Folder takes 30 minutes Stasher takes 30 minutes to put clothes into drawers
55
Sequential Laundry
6 PM 7 8 9 10 11 12 1 2 AM
T a s k O r d e r
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 A B C D
Time
Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take?
56
T a s k O r d e r
30 30 30 30 30 30 30 A B C D
Time
Pipelining Lessons
6 PM
9
Time
T a s k O r d e r
Pipelining doesnt help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences
58
30 30 30 30 30 30 30 A B C D
Load Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file
59
Pipelining
AU L 800 ps
Data ac es c s
200
400
600
800
1000
1200
1400
Instruction fetch
Rg e
Data a es cc s AU L Rg e 200 ps
Basic Idea
IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back
Ad d 4 S it hf lf2 et Read Ra ed r g t r1 eie s dt 1 a a Read r g t r2 eie s Rgtr e i es s Write Ra ed r gse e it r dt 2 a a Wi re t daa t 1 6 2 Sign 3 et n x d e AD r s l D Add e ut
P C
A de s drs I sr ci n n tu to I sr ci n n tu to memory
Zr e o A AU L L U rsl e ut
A de s drs
Ra ed dat a Data Mm r eo y
Write data
Start the next instruction before the current one has completed
improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
lw sw R-type
IFetch
Dec IFetch
- clock cycle (pipeline stage time) is limited by the slowest stage - for some instructions, some stages are wasted cycles
63
Pipeline Implementation: lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB
R-type IFetch
64
Read Data 2
ALU
16
Sign Extend
32
System Clock
Mem/WB
Read Address
File
Exec/Mem
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
Read Data
65
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALU Address Write Data Read Data
File
Write Addr Write Data Read Data 2
16
Sign Extend
32
66
67
68
69
IF and ID
70
EX (lw) instruction)
71
MEM and WB
72
IF and ID (SW)
73
EX (SW)
74
MEM and WB
75
Example
77
Example
78
- can fetch in the 1st stage and decode in the 2nd stage
each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB) structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instructions input operands depend on the output of a previous instruction?
79
ALU
IM
Reg
DM
Reg
How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?
80
I n s t r. O r d e r
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Once the pipeline is full, one instruction is completed every cycle so CPI = 1
Reg
IM
ALU
Reg
IM
ALU
Reg
IM
ALU
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
81
structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready
- An instructions source operand(s) are produced by a prior instruction still in the pipeline
control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
pipeline control must detect the hazard and take action to resolve hazards
82
I n s t r. O r d e r
Mem
Reg
Mem
Reg
ALU
Mem
Mem
Reg
Mem
Reg
ALU
Mem
Mem
Reg
ALU
Reg
Mem
Reg
I n s t r. O r d e r
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half
Reg
IM
ALU
Reg
ALU
ALU
DM
ALU
add $2,$1,
IM
Reg
DM
Reg
85
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
Read
IM
Reg
DM
Reg
I n s t r. O r d e r
lw
$1,4($2)
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
Load-use
IM
Reg
DM
Reg
data hazard
89
90
91
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALUSrc ALU Address Write Data ALU cntrl Read Data
MemtoReg
File
Write Addr Write Data Read Data 2
16
Sign Extend
MemWrite MemRead
32
ALUOp
RegDst
92
Control Settings
MEM Stage
WB Stage
R lw sw beq
ALU ALU ALU Brch Mem Mem Reg Mem Op1 Op0 Src Read Write Write toReg 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 X X
93
ALU
I n s t r. O r d e r
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
94
I n s t r. O r d e r
add $1,
IM
Reg
DM
Reg
ALU
sub $4,$1,$5
IM
Reg
DM
Reg
ALU
IM
Reg
DM
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
Reg
96
Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ALU) that need it that cycle For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by
adding multiplexors to the inputs of the ALU connecting the Rd write data in EX/MEM or MEM/WB to either (or both) of the EXs stage Rs and Rt ALU mux inputs adding the proper control hardware to control the new muxes
Other functional units may need similar forwarding logic (e.g., the DM) With forwarding can achieve a CPI of 1 even in the presence of data dependencies
99
EX/MEM hazard:
Forwards the != 0) = ID/EX.RegisterRs)) result from the previous instr. to either input of the ALU != 0) = ID/EX.RegisterRt))
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd and (EX/MEM.RegisterRd ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd and (EX/MEM.RegisterRd ForwardB = 10
2.
MEM/WB hazard:
Forwards the != 0) = ID/EX.RegisterRs)) result from the second previous instr. to either input != 0) = ID/EX.RegisterRt)) of the ALU
100
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd and (MEM/WB.RegisterRd ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd and (MEM/WB.RegisterRd ForwardB = 01
Forwarding Illustration
ALU
I n s t r. O r d e r
add $1,
IM
Reg
DM
Reg
ALU
sub $4,$1,$5
IM
Reg
DM
Reg
ALU
and $6,$7,$1
IM
Reg
DM
Reg
101
Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction which should be forwarded?
I n s t r. O r d e r
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
add $1,$1,$4
IM
Reg
DM
Reg
103
MEM/WB hazard:
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
104
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALU Address Read Data Write Data ALU cntrl
File
Write Addr Write Data 16 Sign Extend Read Data 2 32
106
Memory-to-Memory Copies
For loads immediately followed by stores (memory-tomemory copies) can avoid a stall by adding forwarding hardware from the MEM/WB register to the data memory input.
Would need to add a Forward Unit and a mux to the memory access stage
I n s t r. O r d e r
ALU
lw $1,4($2)
IM
Reg
DM
Reg
ALU
sw $1,4($3)
IM
Reg
DM
Reg
107
I n s t r. O r d e r
lw stall
IM $1,4($2)
Reg
DM
Reg
ALU
IM
Reg
DM ALU
Reg
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
110
0 1
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALU Address Read Data Write Data ALU cntrl
File
Write Addr Write Data 16 Sign Extend Read Data 2 32
ID/EX.RegisterRt
Forward Unit
111
Conditional branches (beq, bne) Unconditional branches (j, jal, jr) Exceptions
Possible solutions
Stall (impacts performance) Move branch decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Delay decision (requires compiler support) Predict and hope for the best !
Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards
112
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data Memory
ALU Address Read Data Write Data ALU cntrl
File
Write Addr Write Data 16 Sign Extend Read Data 2 32
Forward Unit
114
To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
IM Reg DM Reg
I n s t r. O r d e r
j flush j target
IM
Reg
DM ALU
Reg
IM
Fortunately, jumps are very infrequent only 3% of the SPECint instruction mix
115
ALU
Reg
ALU
DM
Reg
Instruction Memory
Read Address PC
Read Addr 1
Register Read 0
Data 1 Read Addr 2
Data Memory
ALU Address Read Data Write Data ALU cntrl
File
Write Addr Write Data 16 Sign Extend Read Data 2 32
Forward Unit
116
I n s t r. O r d e r
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
117
IM
Reg
DM
Reg
IM
Reg IM
DM ALU
Reg DM ALU
ALU
Reg
IM
ALU
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
118
Move branch decision hardware back to as early in the pipeline as possible i.e., during the decode cycle
ALU
I n s t r. O r d e r
IM
Reg
DM
Reg
IM
Reg
DM
Reg
ALU
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
119
branches are always not taken and take corrective action when wrong (i.e., taken)
4 beq $1,$2,2 8 flush sub $4,$1,$5
IM Reg DM Reg
I n s t r. O r d e r
IM
Reg
DM
IM
To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
120
ALU
Reg
IM
ALU
Reg
ALU
DM
Reg
ALU
DM
Reg
Noop instruction (or bubble) inserted between two instructions in the pipeline (e.g., load-use hazards)
Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle (bounce them in place with write control signals) Insert noop instruction by zeroing control bits in the pipeline register at the appropriate stage Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline
Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noop instruction (as done for instructions located sequentially after j and beq instructions)
121
Make the clock twice as slow or let it take two cycles (since it doesnt use the DM stage)
MUL ALU IM Reg DM Reg
What
122
Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage
123
Instruction Memory for the Ifetch stage Register Files Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register Files Write port (bus W) for the Wr stage
124
R-type Ifetch
Reg/Dec
Exec
Wr
125
R-type Ifetch
R-type Ifetch
Two instructions try to write to the register file at the same time! Only one write port
126
Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions:
Load uses Register Files Write Port during its 5th stage
1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr
R-type uses Register Files Write Port during its 4th stage
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Wr
127
R-type Ifetch
R-type Ifetch
R-type Ifetch
Insert a bubble into the pipeline to prevent 2 writes at the same cycle
The control logic can be complex. Lose instruction fetch and issue opportunity.
Now R-type instructions also use Reg Files write port at Stage 5 Mem stage is a NOOP stage: nothing is being done.
1 R-type Ifetch 2 Reg/Dec 3 Exec 4 Mem 5 Wr
R-type Ifetch
R-type Ifetch
129
Store
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory
130
Beq
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec:
Exec:
compares the two register operand, select correct branch target address latch into PC
131
Pipelining Summary
All modern day processors use pipelining Pipelining doesnt help latency of single task, it helps throughput of entire workload Potential speedup: a really fast clock cycle and able to complete one instruction every clock cycle (CPI) Pipeline rate limited by slowest pipeline stage
Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact speedup for deep pipelines and short code runs
Stalling negatively affects CPI (makes CPI greater than the ideal of 1)
132
133
MIPS has 2 coprocessors: Coprocessor 0 handles exceptions including input and output interrupts, Coprocessor 1 handles floating point
Coprocessors have their own register sets so have instructions to move values between these registers and the CPUs registers
Register # Use BadVAddr 8 bad mem addr Count 9 timer Compare 11 timer compare Status 12 intr mask & enable bits Cause 13 excp type and pending intrs EPC 14 addr of instr causing excp
mfc0
rd, rt
0x10 0
mtc0 rt, rd
0x10 4
#move to coprocessor 0
rt rd 0 0 134
Input and output devices are very slow compared to the processor
These time lags are simulated in SPIM which measures time in instructions executed, not in real clock time After the transmitter starts to write a character, the transmitters ready bit becomes 0. It doesnt become ready again until the processor has executed a (large) fixed number of instructions. (You dont want to single step the simulator!)
Polling will execute the loop til ready code thousands of times. While the input or output is occurring, nothing else can be done a waste of resources. There is a better way
135
I/O Interrupts
Can have different urgencies (so may need to be prioritized) Need to identity the device generating the interrupt An I/O interrupt is not associated with any instruction and does not prevent any instruction from completion
- You can pick your own convenient point to take an interrupt Advantage
User program progress is only halted during the actual transfer of I/O data to/from user memory space Cause an interrupt (I/O device) Detect an interrupt and save the proper information to resume after servicing the interrupt (processor)
136
Coprocessor 0 records the information the software needs to handle exceptions (including interrupts)
EPC (register 14) holds the address+4 of the instruction that was executing when the exception occurred Status (register 12) exception mask and enable bits
15 8 Intr Mask
User mode Intr enable Excp level
1 0
- Intr Mask = 1 bit for each of 6 hw and 2 sw exception levels (1 enables exception at that level, 0 disables them) - User mode = 0 if running in kernel mode when exception occurred; 1 if running in user mode (fixed at 1 in SPIM) - Excp level = set to 1 (disable exceptions) when an exception occurs; should be reset by exception handler when done - Intr enable = 1 if exception are enabled; 0 if disabled
141
Branch delay
142
Exception return sets the Excp level bit in coprocessor 0s Status register to 0 (reenabling exception) and returns to the instruction pointed to by coprocessor 0s EPC register
eret
0x10
143
Exceptions in General
user program normal control flow: sequential, jumps, branches, calls, returns Exception System Exception Handler
- must record the address of the offending or next to execute instruction and save (and restore) user state
Interrupts
caused by external events (i.e., request from I/O device) asynchronous to program execution may be handled between instructions simply suspend and resume user program caused by internal events
Traps
- exceptional conditions (e.g., arithmetic overflow, undefined instr.) - errors (e.g., hardware malfunction, memory parity error) - faults (e.g., non-resident page page fault)
synchronous to program execution condition must be remedied by the trap handler instruction may be retried (or simulated) and program continued or program may be aborted
145
Control signals to write EPC (EPCWrite), Cause and Status (Cause&StatusWrite) Hardware to record the type of interrupt in Cause Modify the finite state machine so that
the address of interrupt handler (8000 0180hex) can be loaded into the PC, so must increase the size of PC mux and save the address of the next instr in EPC
146
for traps, record the address of the current (offending) instruction in the EPC, so must undo the PC = PC + 4 done during fetch
147
Undefined instruction (RI) detected when no next state is defined in state 1 (decode) for the opcode value
Define the next state value for all undefined op values as new state 10
Arithmetic overflow (Ov) The overflow signal from the ALU is used in state 6 (if dont want to complete RegWrite) Need to modify the FSM in a similar fashion for remaining traps
Challenge is to handle the interactions between instructions and exception-causing events so that the control logic remains small and fast
- Complex interactions makes the control unit the most challenging aspect of hardware design, especially in pipelined processors
148
Power failing, Arithmetic overflow, I/O device request, OS call, Page fault
surprise jump (to vectored address) linking return address saving of PSW (including CCs) state change (e.g., to kernel mode)
149
Stage
IF ID EX MEM
Simultaneous exceptions in more than one pipeline stage, e.g., Load with data page fault in MEM stage Add with instruction page fault in IF stage Add fault will happen BEFORE load fault Solution #1 Interrupt status vector per instruction Defer check until last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete
Heres what happens on a data page fault. 1 i i+1 i+2 i+3 i+4 i+5 i+6 trap -> trap handler -> F 2 D F 3 X D F 4 M X D F 5 W M X D F W <- page fault M X D F W <- squash M X D F W <- squash M X D W <- squash M X W M W
152
Complex Instructions
Address modes: Auto increment causes register change during instruction execution
Interrupts? Need to restore register state Adds WAR and WAW hazards since writes are no longer the last stage.
Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt
Condition Codes
153
154
155
156