Documente Academic
Documente Profesional
Documente Cultură
The Processor
(datapath and pipelining)
Instruction Memory PC Address Instruction Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Addr Address Data Memory Read Data Write Data
ALU
Simplicity favors regularity fixed size instructions and data 32-bits Good design demands good compromisesOnly three instruction formats Smaller is fasterlimited instruction set limited number of registers in register file limited number of addressing modes Make the common case fastarithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands
Generic implementation (first two stages are same): use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction All instructions (except j) use the ALU after reading the registers
Single cycle operation (multi-cycle presented later) Split memory (Harvard) model - one memory for instructions (instruction cache) and one for data Shows how PC is incremented or changed by branch taken, does not show multiplexors or control lines
Control Unit
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
overflow zero
Write Data
Clocking Methodologies
Clocking methodology defines when signals can be read and when they can be written. Do not read while writing- unpredictable result falling (negative) edge
cycle time
We adopt an edge-triggered clocking methodology = update on clock edge Values stored in the state elements are updated only on a clock edge. Input to combinatorial element are values stored by the state elements in a previous clock cycle, Combinatorial elements output can be used in the following clock cycle.
All signals must propagate from State element 1 through combinatorial block and to State element 2 in a single clock cycle. State element is read in half cycle and written second half. The time needed for the logic gates to settle determines the length of the clock cycle.
State element 1 Combinational logic State element 2
clock
Assumes state elements are written on every clock cycle; If not, need explicit write control signal - write occurs only when both the write control is asserted and clock edge occurs
The datapath assures that data travels between various memory units to registers and ALUs; The control unit regulates this transfer and determines what actions are to be taken on the data. It does so using control lines that are connected to various hardware units.
The edge triggered clock methodology allows a given state element to be both read and be written to in a single clock cycle. Either we have a long clock cycle for one instruction (length is determined by the slowest instruction), or multiple cycles per instruction.
What are the building blocks needed to implement a subset of the MIPS instructions (lw, sw, arithmetic and logic instructions - like add, sub, and, or, slt)? Any instruction needs to be fetched and the PC incremented by 4
Special ALU that only adds
Add 4
Instruction Memory
PC
Read Address
Instruction Memory is read every cycle, so it doesnt need an explicit read control signal Extend later for j, beq
Instruction
Decoding instructions involves sending the fetched instructions opcode and function field bits to the control unit
Control Unit
Decoding Instructions
5 5
Instruction
Read Addr 1 32 Read Register Read Addr 2 Data 1 File 32 Write Addr Read
Write Data Data 2
reading two values from the Register File Register File addresses are contained in the instruction
perform the indicated (by op and funct) operation on values in rs and rt store the result back into the Register File (into location rd)
ALU control
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
32
4
overflow zero
Instruction
ALU
32
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
32
4
Address
ALU
overflow zero
32
32
MemRead
To execute lw or sw operations we need a Data Memory unit with two control signals for write into (MemWrite) and read from (MemRead)
Load and store operations compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction The 32 bits in the base register were read from the Register File during decode The offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value
MemWrite
Read Addr 1 Instruction Read Register Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
16-bit offset
Sign Extend
MemRead
16
32
sw - value read from the Register File during decode must be written to the Data Memory lw - value read from the Data Memory must be stored in the Register File
I-Type:
op
rs
rt
Branch operations have to compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output is asserted) compute the branch target address by adding the updated PC to the sign extended 16-bit signed offset field in the instruction offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address
Shift left 2
Add
RegWrite
PC
ALU control (perform a subtraction) overflow zero (to branch control logic -take branch if zero line is high)
ALU
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read
Write Data Data 2
16-bit offset
16
Sign Extend
32
Jump operations have to replace the lower 28 bits of the PC+4 with the lower 26 bits of the fetched instruction shifted left by 2 bits
Add 4 4 Instruction Memory PC Read Instruction Address 26 Shift left 2
Jump address
28
We need to assemble the datapath segments, add control lines as needed, and design the control path Fetch, decode and execute each instruction in one clock cycle single cycle design. Cycle time is determined by length of the longest path No datapath resource can be used more than once per instruction, so some must be duplicated (that is why we have a separate Instruction Memory and Data Memory) To share datapath elements between different instruction classes will need multiplexors at the input of the shared elements Need control lines to do the selection of inputs
RegWrite
MemWrite
lw
lw / sw R
Sign 16 Extend 32
MemRead
Multiplexor Insertion
Add 4
RegWrite
MemWrite
MemtoReg
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data
Sign 16 Extend
MemRead
32
RegWrite
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
lw R
lw / sw
Sign 16 Extend 32
MemRead
Datapath - review
We wait for everything to settle down - ALU might not produce right answer right away Cycle time determined by length of the longest path Split memory (Harvard) model - single cycle operation Simplified to contain only the instructions: lw, sw, add, sub, and, or, slt, beq. Sequential components (PC, RegFile, Memory) are edge triggered - state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs
Observations - op field always in bits 31-26 address of the two registers to be read are always specified by the rs and rt fields (bits 25-21 and 20-16) address of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for Rtype instructions base register for lw and sw always in rs (bits 25-21) offset for beq, lw, and sw always in bits 15-0
RegDst
RegWrite
ALUSrc ovf
zero
ALU
0 1
Instr[ 15 -11]
Read Data 2
0 1
ALU control
Instr[15-0]
Sign 16 Extend
MemRead
32
Instr[5-0]
ALUOp
ALU Control
MIPS uses multiple control levels to increase speed of the main control unit and decrease its size
ALU control input (Ainvert+Binvert + Operation) 0000 0001 0010 0110 0111 Function
set
1100
Multiple levels of control main control unit generates the ALUOp bits ALU control unit generates ALU control inputs (main control is smaller)
Instr op funct ALUOp desired action
lw
sw beq add sub and or slt
xxxxxx
xxxxxx xxxxxx 100000 100010 100100 100101 101010
00
00 01 10 10 10 10 10
add
add subtract add subtract AND OR Set on less than
0010
0010 0110 0010 0110 0000 0001 0111
X X X X X
X X X
X X X
X X X X X X 0 0 0 0 X 0 0 1 0
X 0 X 0 X 1 1 1 0 0 0 1 0 1 0
x 1 1
1 1 1
1 x x
x x x
0 0 0
0 0 0
1 0 1
0 0 1
1 1 1
0 0 1
0 0 0
0 1 1
Can make use of more dont cares since ALUOp does not use the encoding 11 since F5 and F4 are always 10
From the truth table can design the ALU Control logic
RegDest Source of the destination register for the operation RegWrite Enables writing a register in the register file ALUsrc Source of second ALU operand, can be a register or part of the instruction PCsrc Source of the PC (increment [PC + 4] or branch) MemRead/MemWrite Reading / Writing from data memory MemtoReg Source of write register contents
1
PCSrc MemRead MemtoReg
RegWrite
RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read
MemWrite
ovf zero
ALU
Address
Data Memory Read Data Write Data
1 0
0
1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Rtype
000000
1
0 X
0
1 1
0
1 X
1
1 0
X
1 X
0
0 1
0
0 0
1
0 0
X
0 0
lw
100011
sw
101011
beq
000100
From the truth table can design the Main Control logic
000000
100011
101011
000100
R-type
lw
sw
beq
1
28 32
PC
0 0
1
PCSrc
A new multiplexer and control line are needed to select this input for the program counter
000010
Jump RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0
R-type
lw
sw
beq
1
28 32 PC+4[31-28]
0 0 1
PCSrc MemRead MemtoReg MemWrite
ALUSrc
RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
One more gate is needed in the Main Control logic for Jal (J type instruction which stores PC+4 in $ra $31)
Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26] 3 000000 000011 26-bit address J format
Jal
R-type
1
28 32 PC+4[31-28]
0 0 1
PCSrc MemRead MemtoReg MemWrite
PC+4
ALUSrc
RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read
ovf zero
ALU Address Data Memory Read Data Write Data
1 0
0 1
ALU control
Instr[15 -11]
Write Data
Data 2
31
Instr[15-0]
Sign 16 Extend
32
Instr[5-0]
ALUSrc 0 1 1 0 x
MemtoReg 00 01 xx xx xx
Reg Write 1 1 0 0 0
JAL
10 R[31]
10
Instruction and Data Memory (2ns) and adders (2ns) Register File access (reads or writes) (1ns) floating point operations even longer
ALU
A floating point add.d = Instr. Fetch (2 ns)+ Reg. Read (1 ns)+ ALU add(8 ns)+ Reg. Write (1 ns)= 12 ns Floating point load l.s=2+1+2(ALUop)+2(data mem)+1 (Reg.) = 8 ns Floating point store s.s =2+1+2(ALU)+2(data mem) = 7 ns. The longest instruction is floating point multiply mul = Inst. Fetch (2 ns)+Reg. Read (1 ns)+ALU multiply (16 ns)+ Reg. Write (1 ns) = 20 ns Floating point branch = 5 ns, floating point jump=2 (fetch) If clock period is variable in length, then we need to look at instruction frequency. For example Loads (31%), stores (21%), Rtype (27%), beq(5%), j (2%), add.d, sub.d (7%), mult.d, div.d(7%). Combining to compute the clock cycle=8x31%+7x21%+6x27%+5x5%+2x2%+20x7%+12x7%= 7 ns
Single cycle/instr. uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction we cannot make common case fast. especially problematic for more complex instructions like floating point multiplications Single cycle/instr. datapath is wasteful of area since some functional units must be duplicated since they can not be shared during an instruction execution e.g., need separate adders to do PC update and branch target address calculations, as well as an ALU to do Rtype arithmetic/logic operations and data memory address calculations
Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction
Cycle 1 Clk Single Cycle Implementation: lw sw Waste Cycle 2
Is wasteful of area since some functional units must be duplicated since they can not be shared during a clock cycle (e.g., adders, memory units) But, it is simple(r) and easy to understand
We will consider only a subset of instructions (lw, sw, add, sub, and, or, slt, beq) IFetch: Instruction Fetch and Update PC Dec: Registers Read and Instruction Decode Exec: calculate memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
lw
IFetch Dec RR 2 ns Exec Mem WB
Several instructions were worked on by the CPU at the same time Each major logic unit works on a different stage of a different instruction Like doing laundry for different roommates
What if.
Start the next instruction while still working on the current one
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
lw sw R-type
IFetch Dec RR
Exec
Mem
Exec
WB
Mem Exec WB Mem WB
IFetch Dec RR
IFetch Dec RR
Improves throughput - total amount of work done in a given time If pipeline is full (ideal situation) Time between Inst Pipelined = Time between inst.Non-pipelined Number of pipeline stages Instruction latency is not reduced - time from the start of an instruction to its completion
Cycle 2
Pipeline Implementation:
lw
Time savings
IFetch
sw
Dec
IFetch
Exec
Dec
Mem
Exec Dec
WB
Mem Exec WB Mem WB
wasted cycle
R-type IFetch
Pipelined Implementation:
What makes it easy - all instructions are the same length (32 bits) - The first two pipeline stages are the same for all instructions. few instruction formats (three) with symmetry across formats - registers addresses are in the same location and thus can be read while instructions are being decoded (read reg file in the first of the clock cycle) memory operations can occur only in loads and stores, thus the ALU can compute memory addresses in EX stage operands are aligned in memory so a single data transfer requires only one memory access
What do we need to add/modify in our single-cycle per instruction datapath to make it pipelined? The MIPS instruction has (up to) five stages, thus pipeliene has 5 stages: Ifetch to fetch the instruction from Instruction memory Dec to decode the instruction and read Register File registers Exec to do the ALU operations Mem to read from/write into Data Memory WB to write back into the register file. So we need a way to separate the data path into five pieces, without losing intermediate results. We will introduce Pipeline registers between stages to isolate them and store intermediate results
All instructions advance during one clock cycle between one pipeline register and the next
IF (fetch)
1 0
ID (decode)
EX (execute)
Mem
WB (write back)
Exec/Mem
Mem/WB
Read Address
File
Write Addr Write Data Read Data 2
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Data 1 Read Addr 2 ALU
Address
Data Memory
Read Data
1 0
0 1
Write Data
System Clock
16
Sign Extend
32
Because all data is passed through the pipeline, the address of the register where data needs to be loaded (lw) also needs to be passed
IFetch
1 0
Dec
Exec
Mem
WB
Pipeline Register size (for now) IF/ID 64 ID/EX 32x4+5=133 EX/MEM 32x3+5=101 MEM/WB 32x2=5=69
Add Shift left 2 IFetch/Dec Add
Exec/Mem
Mem/WB
Read Address
File
Write Addr Write Data Read Data 2
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Data 1 Read Addr 2 ALU
Address
Data Memory
Read Data
1 0
0 1
Write Data
System Clock
16
Sign Extend
32
All control signals are determined during Decode and held in the pipeline registers between pipeline stages
IFetch
1 0
Control Add 4 Shift left 2 IFetch/Dec Add
Dec
Exec
Mem
WB
Exec/Mem
Mem/WB
Read Address
File
Write Addr Write Data Read Data 2
Dec/Exec
Instruction Memory
PC
Read Addr 1
Register Read
Data 1 Read Addr 2 ALU
Address
Data Memory
Read Data
1 0
0 1
Write Data
System Clock
16
Sign Extend
32
Pipeline Example
How does the non-dependent instruction sequence execute in a pipeline ? (no support for forwarding)
before <4> before <3> before <2> before <1> lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9 after <1> after <2>
20
10
11
So-far we saw the single-clock-cycle pipeline diagrams show the state of the entire datapath during a clock cycle (instructions are identified above the pipeline stages). Multi-clock-cycle pipeline diagrams are simpler, and can help answer how many cycles does it take to execute this code Or what is the ALU doing during a certain cycle Can represent multiple instructions in a single figure If there is a hazard, it shows why it occurs, and how it can be fixed
I n s t r. O r d e r
Inst 0 Inst 1
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
Inst 2
Inst 3 Inst 4
IM
Reg
DM
Reg
ALU
IM
Reg
DM
ALU
IM
Reg
DM
Pipelining the MIPS ISA What makes it hard - structural hazards: what if we had only one memory - then the pipeline cannot have one instruction read from memory (fetch stage), while at the same time another instruction writes into memory (sw) control hazards: need to make a decision based on the results of one instruction, while that instruction is still executing. what about branches?
Stalling
We assume that all instructions in the pipeline have a CPI of 1. Branches which always are followed by a stall have a CPI of 2. In a typical program branches occur 13% of the time. Thus we can compute the aggregate CPI of the alwaysstall for branch architecture as:
Then CPI = CPI i x F i
n
i=1 CPI always stall = 1 x 87% + 2 x 13% = 1.13 cycles/instruction Thus CPU Perform. always stall = Inst. Count x CPI no stall x Clock Perform. no stall Inst. CountxCPI always stall x Clock Perform. always stall = 1 = 0.885 ( 88.5%) Perform. no stall 1.13
control hazards: Another approach is prediction - either static - always execute the instruction following a branch (assume always that the branch is not taken), or predict dynamically (keep a history of each branch as taken or not taken - accurate 90% of time).
Branch taken
The pipeline simplified representation is shading the blocks that are used in a given clock cycle.
data hazards: what if an instructions input operands depend on the output of a previous instruction that did not finish? Example an add followed by a sub.
Forwarding
Forwarding will fail for a lw followed immediately by an instruction that uses the results of the lw operation. Example lw followed by a sub.
Another solution - optimize compiler, such that lw is followed by an instruction which does not depend on the loaded word.
I n s t r. O r d e r
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.
Reg
IM
ALU
Reg
IM
ALU
Reg
IM
ALU
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
time
I n s t r. O r d e r
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
I n s t r. O r d e r
IM
Reg
DM
Reg
IM
Reg
DM
Reg
ALU
ALU
lw Inst 3
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
I n s t r. O r d e r
add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5
IM
Reg
DM
Reg
Data hazard
Reg
ALU
IM
Reg
DM
No data hazard
Reg
ALU
IM
Reg
DM
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
add r1,r2,r3
stall stall
IM
Reg
DM
Reg
ALU
ALU
sub r4,r1,r5
IM
Reg
DM
Reg
ALU
and r6,r1,r7
IM
Reg
DM
Reg
I n s t r. O r d e r
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
and r6,r1,r7
or r8, r1, r9 xor r4,r1,r5
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
I n s t r. O r d e r
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
and r6,r1,r7
or r8, r1, r9 xor r4,r1,r5
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
To avoid slowing down throughput, we need to add a hardware that detects data hazards. We call this the forwarding unit. Data needs to be forwarded to the ALU when a data hazard is detected. Thus the forwarding unit controls forwarding data through additional multiplexing at the ALU input. This logic unit needs input from the three pipeline registers. It also needs to detect if the RegWrite control signal is asserted so it needs input from the control lines also. No forwarding if EX/MEM.RegisterRd=$0 and MEM/WB.RegisterRd=$0
It needs to detect one of four cases of data hazards: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRs) Forward similarly if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRt) Forward
similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRs) Forward similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRt) Forward
0 1 2 2 2 0 1
ForwardA 00 ID/EX input to ALU1 - no fwd 01 MEM/WB input to ALU1 10 EX/MEM input to ALU1
ForwardB 00 ID/EX input to ALU2 01 MEM/WB input to ALU2 10 EX/MEM input to ALU2
How does the dependent instruction sequence execute in a pipeline with support for forwarding?
before <4> before <3> before <2> before <1> sub $2, $1, and $4, $2, or $4, $4, add $9, $4, after <1> after <2>
$3 $5 $2 $2
EX/MEM.RegWrite is asserted
EX/MEM.RegisterRd=ID/EX.RegisterRs
EX/MEM.RegisterRd=ID/EX.RegisterRs
MEM/WB.RegisterRd=ID/EX.RegisterRt
EX/MEM.RegWrite is asserted
EX/MEM.RegisterRd=ID/EX.RegisterRs
Example (corrected)
Example (corrected)
Example (corrected)
Example (corrected)
Example (corrected)
Example (corrected)
Forwarding does not work when an instruction following a lw tries to read the value from the destination register of lw lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 The pipeline needs to be stalled, and data forwarded from the MEM/WB pipeline register
Stalls happen in the EX stage, such that the subsequent two instructions in the pipeline both repeat what they were doing for one cycle This allows forwarding to work
Stall in CC4 and and or repeat what they did in CC3
We need a logic unit which detects hazards and then stalls. The hazard detection unit operates in the instruction decode stage, and tests to see if the instruction is a load (if ID/EX.MemRead control line is asserted) Then it checks if either of the source registers of the instruction currently being decoded is the same as the target/destination register of the lw being executed (that is if ID/EX.RegisterRt=IF/ID.RegisterRs or ID/ EX.RegisterRt= IF/ID.RegisterRt) During stalling the PC is prevented from incrementing and the instruction in the IF/ID pipeline register is preserved. Need additional control lines for the IF/ID register and for the PC. The bubble is inserted by setting the pipelined control signals in the ID/EX pipeline register to 0. So we need a way to change the values of the control lines.
ID/EX.RegisterRt
PCwrite
completes
completes
completes
0
0
IF/IDWrite is asserted
completes
completes
Forwarding unit sets ALUsrc multiplexer to use value from EX/MEM register
Example
Consider executing the following code add $5, $6, $7 lw $6, 100($7) sub $7, $6, $8 How many cycles will it take to execute the code? Draw a diagram that illustrates the dependencies that need to be resolved
CC 7 CC 8
add $5,$6,$7
lw $6,100($7)
Example - continued
Draw a diagram that illustrates how the code will actually be executed (incorporating any stalls or forwarding to solve the identified problems)
CC 7 CC 8
add..
lw $6,100($7)
forwarding
Pipeline Changes to accommodate Control Hazards Control hazards are due to branch hazards and to exceptions (I/O interrupts, requests from the OS, overflow, or an unknown instruction). A branch hazard occurs less frequently than data hazards, and is detected in the MEM stage of the pipeline. Assume branch not taken, the three instructions following a branch that is taken will be in the pipeline, and need to be flushed.
branch detected CC4
Pipeline Changes to accommodate Branch Hazards The pipeline throughput can be improved by moving the decision whether the branch is taken or not to the Decode stage of the pipeline; Then if the branch is taken, only one instruction needs to be flushed (discarded) - the instruction immediately after the branch instruction. Thus we need a new logic circuit which compares the contents of the register file outputs; Since the decision is taken in the decode stage, the branch address needs to be computed in the decode phase too, in case the branch is to be taken Thus we need a new adder in the decode phase, as well as add an IF Flush control line to flush the IF/ID pipeline register.
IF Flush
Pipeline Changes to accommodate Branch Hazards The above scheme will fail if we have the following series of instructions:
36 add $1, $6, $7
Because the correct value of register $1 is not in the decode stage (in the register file) at the time when the comparator needs it Pipeline needs to be stalled and the value of $1 needs to be forwarded from EX/Mem pipeline register
Example
How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3)
Example
How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop
Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3)
Example 2
Consider the pipelined datapath that does not accommodate branch hazards. Can an attempt to flush and an attempt to stall occur simultaneously? You may want to consider the following code sequence to help you answer this question: beq $1, $2, TARGET #assume the branch is taken lw $3, 40($4) add $3, $3, $3 sw $3, 40($4) TARGET: or $10,$11, $12
If the beq resolution is in the MEM stage, and the branch is taken, it requires a flush of the IF/ID pipeline register (means the register needs to be written to) and a change of the PC to the branch address; this happens in clock cycle 4.
Example 2 - continued
At the same time a hazard is detected between lw and the next instruction (add) which is dependent (due to $3 used as source register). Thus the hazard detection unit issues a stall, and requests that the PC and the IF/ID registers not be written to. The answer is YES, a flush and a stall are issued simultaneously.
If there are any conflicting actions, which should take priority?
Flush should take priority
Is there a simple change you can make to the datapath to ensure the necessary priority?
Example 2- continued
The hazard detection unit should be changed to see the RegWrite signal in the execution stage after it goes through the MUX used to flush the pipeline
RegWrite
The static branch predicts that it will not be taken and then flush if it was taken works for simple pipelines, but is wasteful for performance for aggressive pipelining architecture (such as the multiple issue of Pentium IV). One approach is to have a branch prediction buffer (a small memory unit indexed by the lower portion of the address in the branch instruction). It contains a bit that says if the branch was recently taken or not. The value of the prediction bit is inverted if the prediction turned out to be wrong. When the branch is almost always taken, this 1-bit predictor will predict wrong twice (at the start and end of the run of branches).
A better approach is to use a two-bit scheme, which must be wrong twice to change the direction of prediction. The branch prediction is stored in a special buffer which is accessed with the beq instruction in the IF stage. If the beq is predicted as taken, then fetching begins from the target once beq is in ID.
Not Taken
Taken
Further optimization with a global predictor taking into consideration the global behavior of recently executed branches. Each branch has two predictors, and tournament predictor keeps track and favors the one that was more accurate.
Dynamic branch prediction with compiler optimization Furthermore, compilers place instructions that always execute in the delay spot For mostly taken branches Best choice
Overflow is discovered at the end of the execute stage when the ALU sends a signal to the control unit. Following notification of an overflow the control unit has to flush the two instructions that followed the one causing the overflow. These instructions are now in the IF and ID stages of the pipeline. Thus we add an input to the MUX in the ID stage that 0s the control signals using an ID.Flush signal
8000180
ID.Flush
IF.Flush
Overflow
The instruction that causes the overflow (which is detected in the EX stage) needs to be flushed from the pipeline. This means that an EX.Flush signal needs to be sent to two multiplexers to zero the control signals for the last two stages of the pipeline. Overflow is only one of the many possible exception causes. The cause is stored in a Cause register below: 4 address error exception (load) 5- address error exception (store) 10 unknown instructions or reserved instruction 12 arithmetic overflow 15 floating point exception
An additional input is added to the PC MUX that sends to the PC 8000 0180hex (system reserved memory address for overflow) The address of the instruction following the offending command is saved in the Exception Program Counter (EPC) register and the cause in the Cause Register. If there are multiple exceptions, their causes are stored in the cause register, such that hardware can interrupt based on later exceptions once the earliest exception has been serviced. In case of an I/o interrupt, the execution jumps to the system routine needed to deal with the I/o, followed by a return to the address stored in the EPC for program completion. The OS responds to an exception either by terminating the process that caused the exception or by performing some action. The process whos exception is due to an unimplemented instruction is killed by the OS.
80000180
0 0
Overflow
80000180
0 0
80000180
54
Overflow
OS instruction fetched
80000180
80000184
80000184
80000180
One way to speed up pipelines is to have more stages (up to eight) results in shorter clock cycles. Another way is superscalar architectures which have CPI less than 1. Multiple instructions can be launched at the same time (multiple issue) - Instruction execution rate exceeds the clock rate! Were talking of number of Instructions per Clock Cycle (IPC instead of CPI) Architectures try to issue 3 to 8 instructions at every clock cycle. A third way is to balance load through dynamic pipeline scheduling, to avoid hazards (stalls). The price for these speed-ups is more hardware, more complicated control and a more complicated instruction execution model. If instructions are launched in pairs, only the first instruction is launched if dynamic conditions are not met.
Pipelining Speed-ups
Used in embedded processors and VLIW processors Can improve performance by up to 200% Layout is restricted to simplify the decoding and instruction issue Instructions are issued in pairs, aligned on a 64-bit boundary with the ALU and branch portion operating first; If one of the instruction of the pair cannot be used, it is replaced by a no-op. The hardware detects data hazards and generates stalls between two issue packets, but the compiler is required to avoid all dependencies within the instruction pair. A load will cause the next two instructions to stall if they were to use the loaded word.
CC 7 add
CC 8
lw
beq
sw
sub
lw
We need two output ports for Instruction memory, two more read and one more write ports for the Register file, two ALUs (one handles address computation for Data memory access), and two sign-extending units
Dynamic pipeline scheduling chooses which instruction to execute next, re-ordering them to avoid stalls
Buffer holding all the operands and the operation
Results sent to other reservation stations or the commit unit Buffers results until it is safe to put them in the register file or in data memory (store)
Commit unit serves as a forwarding station For operands that are needed before they were written back in the register file
Register renaming removes antidependencies. In case of incorrect speculation, the mapping between architectural and physical registers is undone. Memory address calculation
http://www.extremetech.com/article2/0,1697,1988744,00.asp