Sunteți pe pagina 1din 144

14:332:331

The Processor
(datapath and pipelining)
Instruction Memory PC Address Instruction Write Data Register Read Data Reg Addr File Reg Addr Read Data Reg Addr Address Data Memory Read Data Write Data

ALU

Simplicity favors regularity fixed size instructions and data 32-bits Good design demands good compromisesOnly three instruction formats Smaller is fasterlimited instruction set limited number of registers in register file limited number of addressing modes Make the common case fastarithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands

Review: Design Principles

The Processor: Datapath & Control


We're ready to look at an implementation of the MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

The Processor: Datapath & Control


Fetch PC = PC+4 Exec Decode

Generic implementation (first two stages are same): use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction All instructions (except j) use the ALU after reading the registers

Abstract Implementation View


Two types of functional units: elements that operate on data (combinational like ALU) elements that contain state (sequential - registers and memory)

Abstract Implementation View


Single cycle operation (multi-cycle presented later) Split memory (Harvard) model - one memory for instructions (instruction cache) and one for data Shows how PC is incremented or changed by branch taken, does not show multiplexors or control lines
Control Unit

Instruction Memory PC Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

Address ALU Data Memory Read Data

overflow zero

Write Data

Clocking Methodologies

Clocking methodology defines when signals can be read and when they can be written. Do not read while writing- unpredictable result falling (negative) edge

cycle time

rising (positive) edge

We adopt an edge-triggered clocking methodology = update on clock edge Values stored in the state elements are updated only on a clock edge. Input to combinatorial element are values stored by the state elements in a previous clock cycle, Combinatorial elements output can be used in the following clock cycle.

All signals must propagate from State element 1 through combinatorial block and to State element 2 in a single clock cycle. State element is read in half cycle and written second half. The time needed for the logic gates to settle determines the length of the clock cycle.
State element 1 Combinational logic State element 2

Synchronous digital system

clock

one clock cycle

Assumes state elements are written on every clock cycle; If not, need explicit write control signal - write occurs only when both the write control is asserted and clock edge occurs

Building the Datapath

The datapath assures that data travels between various memory units to registers and ALUs; The control unit regulates this transfer and determines what actions are to be taken on the data. It does so using control lines that are connected to various hardware units.

The edge triggered clock methodology allows a given state element to be both read and be written to in a single clock cycle. Either we have a long clock cycle for one instruction (length is determined by the slowest instruction), or multiple cycles per instruction.

Building the Datapath

What are the building blocks needed to implement a subset of the MIPS instructions (lw, sw, arithmetic and logic instructions - like add, sub, and, or, slt)? Any instruction needs to be fetched and the PC incremented by 4
Special ALU that only adds
Add 4

Instruction Memory

PC is updated on every clock cycle

PC

Read Address

Instruction Memory is read every cycle, so it doesnt need an explicit read control signal Extend later for j, beq

Instruction

Decoding instructions involves sending the fetched instructions opcode and function field bits to the control unit
Control Unit

Decoding Instructions

5 5
Instruction

Read Addr 1 32 Read Register Read Addr 2 Data 1 File 32 Write Addr Read
Write Data Data 2

reading two values from the Register File Register File addresses are contained in the instruction

Executing R Format Operations

R format operations (add, sub, slt, and, or)


31 R-type: op 25 rs 20 rt 15 rd 10 5 0 shamt funct

perform the indicated (by op and funct) operation on values in rs and rt store the result back into the Register File (into location rd)
ALU control
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

32

4
overflow zero

Instruction

ALU

32

The fourth ALU control line supports NOR

Executing R Format Operations


Note that Register File is not written every cycle (e.g. sw), so we need an explicit write control signal for the Register File (RegWrite)
RegWrite ALU control MemWrite

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

32

4
Address

ALU

overflow zero

Data Memory Read Data


Write Data

32

32
MemRead

To execute lw or sw operations we need a Data Memory unit with two control signals for write into (MemWrite) and read from (MemRead)

Executing Load and Store Operations


31 I-Type: op 25 rs 20 rt 15 address offset 0

Load and store operations compute a memory address by adding the base register (in rs) to the 16-bit signed offset field in the instruction The 32 bits in the base register were read from the Register File during decode The offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value

Executing Load and Store Operations


RegWrite ALU control overflow zero
Address ALU Data Memory Read Data Write Data

MemWrite

Read Addr 1 Instruction Read Register Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

16-bit offset
Sign Extend

MemRead

16

32

sw - value read from the Register File during decode must be written to the Data Memory lw - value read from the Data Memory must be stored in the Register File

Executing Branch Operations


31 25 20 15 address offset 0

I-Type:

op

rs

rt

Branch operations have to compare the operands read from the Register File during decode (rs and rt values) for equality (zero ALU output is asserted) compute the branch target address by adding the updated PC to the sign extended 16-bit signed offset field in the instruction offset value in the low order 16 bits of the instruction must be sign extended to create a 32-bit signed value and then shifted left 2 bits to turn it into a word address

Executing Branch Operations


PC + 4
Add

Shift left 2

Add

Branch target address

RegWrite
PC

ALU control (perform a subtraction) overflow zero (to branch control logic -take branch if zero line is high)
ALU

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read
Write Data Data 2

16-bit offset

16

Sign Extend

32

Executing Jump Operations


31 J-Type: op 25 jump target address 0

Jump operations have to replace the lower 28 bits of the PC+4 with the lower 26 bits of the fetched instruction shifted left by 2 bits
Add 4 4 Instruction Memory PC Read Instruction Address 26 Shift left 2

Jump address
28

Creating a Single Datapath from the Parts


We need to assemble the datapath segments, add control lines as needed, and design the control path Fetch, decode and execute each instruction in one clock cycle single cycle design. Cycle time is determined by length of the longest path No datapath resource can be used more than once per instruction, so some must be duplicated (that is why we have a separate Instruction Memory and Data Memory) To share datapath elements between different instruction classes will need multiplexors at the input of the shared elements Need control lines to do the selection of inputs

Fetch, R, and Memory Access Portions


Add 4 Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data

RegWrite

ALU control ovf zero

MemWrite

Instruction Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

lw

lw / sw R
Sign 16 Extend 32

MemRead

Multiplexor Insertion

Add 4

RegWrite

ALUSrc ALU control ovf zero

MemWrite

MemtoReg

Instruction Memory PC Read Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data

Address ALU Data Memory Read Data Write Data

Sign 16 Extend

MemRead
32

Adding the Branch Portion


Branch not taken, R, lw /sw
Add 4 Shift left 2 Add

PCSrc MemWrite MemtoReg

RegWrite

ALUSrc ALU control ovf zero

Instruction Memory PC Read Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

Address ALU Data Memory Read Data Write Data

lw R

lw / sw
Sign 16 Extend 32

MemRead

Datapath - review

We wait for everything to settle down - ALU might not produce right answer right away Cycle time determined by length of the longest path Split memory (Harvard) model - single cycle operation Simplified to contain only the instructions: lw, sw, add, sub, and, or, slt, beq. Sequential components (PC, RegFile, Memory) are edge triggered - state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs

Single cycle datapath


31 R-type: op 31 I-Type: op 25 rs 25 rs 20 rt 20 rt 15 rd 15 address offset 10 5 0 shamt funct 0

Observations - op field always in bits 31-26 address of the two registers to be read are always specified by the rs and rt fields (bits 25-21 and 20-16) address of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for Rtype instructions base register for lw and sw always in rs (bits 25-21) offset for beq, lw, and sw always in bits 15-0

(Almost) Complete Single Cycle Datapath


0
Add 4 Shift left 2 Add

1 PCSrc MemWrite MemtoReg

RegDst

RegWrite

ALUSrc ovf

Instruction Memory PC Read Address Instr[31-0]

Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1

zero
ALU

Address Data Memory Read Data Write Data

0 1
Instr[ 15 -11]

File Write Addr Write Data

Read Data 2

0 1
ALU control

Instr[15-0]

Sign 16 Extend

MemRead

32

Instr[5-0]

ALUOp

ALU's operation based on instruction type and function code

ALU Control

MIPS uses multiple control levels to increase speed of the main control unit and decrease its size
ALU control input (Ainvert+Binvert + Operation) 0000 0001 0010 0110 0111 Function

set

and or add subtract set on less than NOR

1100

ALU Control, continued

ALU control input


ALU control
Instr[5-0]

Multiple levels of control main control unit generates the ALUOp bits ALU control unit generates ALU control inputs (main control is smaller)
Instr op funct ALUOp desired action

ALUOp(from control block)

ALU control input

lw
sw beq add sub and or slt

xxxxxx
xxxxxx xxxxxx 100000 100010 100100 100101 101010

00
00 01 10 10 10 10 10

add
add subtract add subtract AND OR Set on less than

0010
0010 0110 0010 0110 0000 0001 0111

ALU Control Truth Table


F5 F4 F3 F2 F1 F0 ALUOp1 ALUOp0 Op3 Op2 Op1 Op0

X X X X X

X X X
X X X

X X X X X X 0 0 0 0 X 0 0 1 0
X 0 X 0 X 1 1 1 0 0 0 1 0 1 0

x 1 1
1 1 1

1 x x
x x x

0 0 0
0 0 0

1 0 1
0 0 1

1 1 1
0 0 1

0 0 0
0 1 1

Can make use of more dont cares since ALUOp does not use the encoding 11 since F5 and F4 are always 10

ALU Control Combinational Logic

From the truth table can design the ALU Control logic

Summary of control lines

RegDest Source of the destination register for the operation RegWrite Enables writing a register in the register file ALUsrc Source of second ALU operand, can be a register or part of the instruction PCsrc Source of the PC (increment [PC + 4] or branch) MemRead/MemWrite Reading / Writing from data memory MemtoReg Source of write register contents

Datapath with Control Unit


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg

RegWrite
RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read

MemWrite

ovf zero
ALU

Instruction Memory PC Read Address Instr[31-0]

Address
Data Memory Read Data Write Data

1 0

0
1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

Main Control Unit


Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp1 ALUOp0

Rtype
000000

1
0 X

0
1 1

0
1 X

1
1 0

X
1 X

0
0 1

0
0 0

1
0 0

X
0 0

lw
100011

sw
101011

beq
000100

Control Unit Logic

From the truth table can design the Main Control logic

Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]

000000

100011

101011

000100

R-type

lw

sw

beq

RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0

Adding the Jump Operation

J has the opfield 000010 followed by 26-bit jump


address. We need to add to the datapath. The instruction bits [25-0] are shifted left two bits and concatenated to the four MSB of PC+4. Jump
Instr[25-0] 26 PC+4[31-28] Shift left 2

1
28 32

PC

0 0

1
PCSrc

A new multiplexer and control line are needed to select this input for the program counter

One more gate is needed in the Main Control logic for J

Control Unit Logic

Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26]

000010

Jump RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0

R-type

lw

sw

beq

Adding Jump Operation


Instr[25-0] 26 Add 4 ALUOp Instr[31-26] Control Unit Jump Branch Shift left 2 Add Shift left 2

1
28 32 PC+4[31-28]

0 0 1
PCSrc MemRead MemtoReg MemWrite

ALUSrc
RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

Control Unit Logic

One more gate is needed in the Main Control logic for Jal (J type instruction which stores PC+4 in $ra $31)

Instr[31] Instr[30] Instr[29] Instr[28] Instr[27] Instr[26] 3 000000 000011 26-bit address J format

Jal

R-type

Jump RegDst MemtoReg RegWrite ALUOp1 ALUOp0

Adding Jal Operation


Instr[25-0] 26 Add 4 ALUOp Instr[31-26] Control Unit Jump Branch Shift left 2 Add Shift left 2

1
28 32 PC+4[31-28]

0 0 1
PCSrc MemRead MemtoReg MemWrite

PC+4

ALUSrc
RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read

ovf zero
ALU Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

1 0

0 1
ALU control

Instr[15 -11]

Write Data

Data 2

31

Instr[15-0]

Sign 16 Extend

32

Instr[5-0]

Adding Control line Settings for Jal Operation


RegDst Is now 2 bits MemtoReg Is now 2 bits

RegDst R-format lw sw beq J 01 00 xx xx xx

ALUSrc 0 1 1 0 x

MemtoReg 00 01 xx xx xx

Reg Write 1 1 0 0 0

Mem Mem Read Write Branch ALUOp1 ALUOp0 Jump 0 1 0 0 0 0 0 1 0 0 0 0 0 1 x 1 0 0 0 x 0 0 0 1 x 0 0 0 0 1

JAL

10 R[31]

10

PC+ 4 PC Jump Address

Single Cycle Implementation Cycle Time


Unfortunately, though simple, the single cycle approach is not used because it is inefficient Clock cycle must have the same length for every instruction What is the longest path (slowest instruction)? Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except:

Instruction and Data Memory (2ns) and adders (2ns) Register File access (reads or writes) (1ns) floating point operations even longer

ALU

Instruction Critical Paths


Instr. R-type load store beq jump I Mem 2 2 2 2 2 Reg Rd 1 1 1 1 ALU Op 2 2 2 2 2 2 D Mem Reg Wr 1 1 Total delay (ns) 6 8 7 5 2

What about floating point operations?

A floating point add.d = Instr. Fetch (2 ns)+ Reg. Read (1 ns)+ ALU add(8 ns)+ Reg. Write (1 ns)= 12 ns Floating point load l.s=2+1+2(ALUop)+2(data mem)+1 (Reg.) = 8 ns Floating point store s.s =2+1+2(ALU)+2(data mem) = 7 ns. The longest instruction is floating point multiply mul = Inst. Fetch (2 ns)+Reg. Read (1 ns)+ALU multiply (16 ns)+ Reg. Write (1 ns) = 20 ns Floating point branch = 5 ns, floating point jump=2 (fetch) If clock period is variable in length, then we need to look at instruction frequency. For example Loads (31%), stores (21%), Rtype (27%), beq(5%), j (2%), add.d, sub.d (7%), mult.d, div.d(7%). Combining to compute the clock cycle=8x31%+7x21%+6x27%+5x5%+2x2%+20x7%+12x7%= 7 ns

What about variable cycle length?


Instead of a fixed cycle time, we allow cycle time to depend on instruction class. We can then compare performance, considering that CPI will still be 1, and Instruction count does not change. Perf. CPU variable cycle time = CPU exec. time fixed cycle time Perf. CPU fixed cycle time CPU exec. time var. cycle time = Clock period fixed Clock period variable because performance= _____________1______________ )

Instr. Count x CPI x Clock Period

Performance improvement = 20 ns (fixed cycle clock period) = 2.86 faster


7 ns (variable cycle clock period)

Where We are Headed

Single cycle/instr. uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction we cannot make common case fast. especially problematic for more complex instructions like floating point multiplications Single cycle/instr. datapath is wasteful of area since some functional units must be duplicated since they can not be shared during an instruction execution e.g., need separate adders to do PC update and branch target address calculations, as well as an ALU to do Rtype arithmetic/logic operations and data memory address calculations

Single Cycle Disadvantages & Advantages

Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction
Cycle 1 Clk Single Cycle Implementation: lw sw Waste Cycle 2

Is wasteful of area since some functional units must be duplicated since they can not be shared during a clock cycle (e.g., adders, memory units) But, it is simple(r) and easy to understand

The Five Stages of a lw Instruction


We will consider only a subset of instructions (lw, sw, add, sub, and, or, slt, beq) IFetch: Instruction Fetch and Update PC Dec: Registers Read and Instruction Decode Exec: calculate memory address Mem: Read the data from the Data Memory WB: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

lw
IFetch Dec RR 2 ns Exec Mem WB

Several instructions were worked on by the CPU at the same time Each major logic unit works on a different stage of a different instruction Like doing laundry for different roommates

What if.

Pipelined MIPS Processor

Start the next instruction while still working on the current one
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw sw R-type

IFetch Dec RR

Exec

Mem
Exec

WB
Mem Exec WB Mem WB

IFetch Dec RR

IFetch Dec RR

Improves throughput - total amount of work done in a given time If pipeline is full (ideal situation) Time between Inst Pipelined = Time between inst.Non-pipelined Number of pipeline stages Instruction latency is not reduced - time from the start of an instruction to its completion

Single Cycle vs. Pipelined


Single Cycle Implementation:
Cycle 1
Clk Load Store Pipelined lw finishes later than non-pipelined version Waste

Cycle 2

Pipeline Implementation:
lw

Time savings
IFetch
sw

Dec
IFetch

Exec
Dec

Mem
Exec Dec

WB
Mem Exec WB Mem WB

wasted cycle

R-type IFetch

Single Cycle, vs. Pipelined


Single Cycle Implementation:
Assume memory and ALU ops take 200 ps, Reg ops take 100 ps

Pipelined Implementation:

Register read in second half of cycle

Time savings Instruction 2 Time savings Instruction 3

Designing MIPS Instructions for Pipelining

What makes it easy - all instructions are the same length (32 bits) - The first two pipeline stages are the same for all instructions. few instruction formats (three) with symmetry across formats - registers addresses are in the same location and thus can be read while instructions are being decoded (read reg file in the first of the clock cycle) memory operations can occur only in loads and stores, thus the ALU can compute memory addresses in EX stage operands are aligned in memory so a single data transfer requires only one memory access

MIPS Pipeline Datapath Modifications

What do we need to add/modify in our single-cycle per instruction datapath to make it pipelined? The MIPS instruction has (up to) five stages, thus pipeliene has 5 stages: Ifetch to fetch the instruction from Instruction memory Dec to decode the instruction and read Register File registers Exec to do the ALU operations Mem to read from/write into Data Memory WB to write back into the register file. So we need a way to separate the data path into five pieces, without losing intermediate results. We will introduce Pipeline registers between stages to isolate them and store intermediate results

MIPS Pipeline Datapath Modifications

All instructions advance during one clock cycle between one pipeline register and the next
IF (fetch)
1 0

ID (decode)

EX (execute)

Mem

WB (write back)

Add 4 Shift left 2 IFetch/Dec Add

Exec/Mem

Mem/WB

Read Address

File
Write Addr Write Data Read Data 2

Dec/Exec

Instruction Memory
PC

Read Addr 1

Register Read
Data 1 Read Addr 2 ALU

Address

Data Memory
Read Data

1 0

0 1

Write Data

System Clock

16

Sign Extend

32

MIPS Pipeline Datapath Modifications

Because all data is passed through the pipeline, the address of the register where data needs to be loaded (lw) also needs to be passed
IFetch
1 0

Dec

Exec

Mem

WB

Pipeline Register size (for now) IF/ID 64 ID/EX 32x4+5=133 EX/MEM 32x3+5=101 MEM/WB 32x2=5=69
Add Shift left 2 IFetch/Dec Add

Exec/Mem

Mem/WB

Read Address

File
Write Addr Write Data Read Data 2

Dec/Exec

Instruction Memory
PC

Read Addr 1

Register Read
Data 1 Read Addr 2 ALU

Address

Data Memory
Read Data

1 0

0 1

Write Data

System Clock

16

Sign Extend

32

Extends pipeline reg. to hold address of destination reg.

MIPS Pipeline Control Path Modifications

All control signals are determined during Decode and held in the pipeline registers between pipeline stages
IFetch
1 0
Control Add 4 Shift left 2 IFetch/Dec Add

Dec

Exec

Mem

WB

Exec/Mem

Mem/WB

Read Address

File
Write Addr Write Data Read Data 2

Dec/Exec

Instruction Memory
PC

Read Addr 1

Register Read
Data 1 Read Addr 2 ALU

Address

Data Memory
Read Data

1 0

0 1

Write Data

System Clock

16

Sign Extend

32

MIPS Pipeline Control Path Modifications


2 6 4 3

The modified control path is

Pipeline Example

How does the non-dependent instruction sequence execute in a pipeline ? (no support for forwarding)
before <4> before <3> before <2> before <1> lw $10, 20($1) sub $11, $2, $3 and $12, $4, $5 or $13, $6, $7 add $14, $8, $9 after <1> after <2>

Pipeline Example - before <4> completes

Pipeline Example - before <3> completes

Pipeline Example - before <2> completes

20

10

Pipeline Example - before <1> completes


$4, $5

11

Pipeline Example - lw completes


Data memory not used (MEM control lines 0) $5 12 destination register

Pipeline Example - sub completes


$6, $7
Data memory not used (MEM control lines 0) $7 13

Pipeline Example - and completes


Normal PC+4 increment (PCSrc=0) $9 14

Pipeline Example - or completes

Pipeline Example - add completes

Written in first half of cycle

Graphically Representing MIPS Pipeline


ALU IM Reg DM Reg

So-far we saw the single-clock-cycle pipeline diagrams show the state of the entire datapath during a clock cycle (instructions are identified above the pipeline stages). Multi-clock-cycle pipeline diagrams are simpler, and can help answer how many cycles does it take to execute this code Or what is the ALU doing during a certain cycle Can represent multiple instructions in a single figure If there is a hazard, it shows why it occurs, and how it can be fixed

Why Pipeline? For Throughput!


Time (clock cycles)
Once the pipeline is full, one instruction is completed every cycle Reg DM Reg ALU

I n s t r. O r d e r

Inst 0 Inst 1

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

Inst 2
Inst 3 Inst 4

IM

Reg

DM

Reg

ALU

IM

Reg

DM

ALU

IM

Reg

Time to fill the pipeline

Example of graphical representation

Can be converted in a single-clock-cycle pipeline diagram

DM

Example of single-clock-cycle (cc 5) pipeline representation

Pipelining the MIPS ISA What makes it hard - structural hazards: what if we had only one memory - then the pipeline cannot have one instruction read from memory (fetch stage), while at the same time another instruction writes into memory (sw) control hazards: need to make a decision based on the results of one instruction, while that instruction is still executing. what about branches?

Stalling

Impact of branch stalling

We assume that all instructions in the pipeline have a CPI of 1. Branches which always are followed by a stall have a CPI of 2. In a typical program branches occur 13% of the time. Thus we can compute the aggregate CPI of the alwaysstall for branch architecture as:
Then CPI = CPI i x F i
n

i=1 CPI always stall = 1 x 87% + 2 x 13% = 1.13 cycles/instruction Thus CPU Perform. always stall = Inst. Count x CPI no stall x Clock Perform. no stall Inst. CountxCPI always stall x Clock Perform. always stall = 1 = 0.885 ( 88.5%) Perform. no stall 1.13

control hazards: Another approach is prediction - either static - always execute the instruction following a branch (assume always that the branch is not taken), or predict dynamically (keep a history of each branch as taken or not taken - accurate 90% of time).

Pipelining the MIPS ISA

Branch not taken

Branch taken

Pipelining the MIPS ISA

The pipeline simplified representation is shading the blocks that are used in a given clock cycle.
data hazards: what if an instructions input operands depend on the output of a previous instruction that did not finish? Example an add followed by a sub.

Forwarding

Pipelining the MIPS ISA

Forwarding will fail for a lw followed immediately by an instruction that uses the results of the lw operation. Example lw followed by a sub.

Pipelining the MIPS ISA

Solution - stall pipeline one clock cycle, then forward

Another solution - optimize compiler, such that lw is followed by an instruction which does not depend on the loaded word.

How About Register File Access?


Time (clock cycles)

I n s t r. O r d e r

add Inst 1 Inst 2 add Inst 4

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.
Reg

IM

ALU

Reg

IM

ALU

Reg

IM

ALU

DM

Reg

ALU

DM

Reg

ALU

DM

Reg

Branch Instructions Cause Control Hazards

Dependencies backward in time cause hazards


ALU

time

I n s t r. O r d e r

add beq lw Inst 3 Inst 4

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

One Way to Fix a Control Hazard


ALU

I n s t r. O r d e r

add beq stall stall

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Can fix branch hazard by waiting stall but affects throughput

ALU

ALU

lw Inst 3

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards


ALU

I n s t r. O r d e r

add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5

IM

Reg

DM

Reg

Data hazard
Reg

ALU

IM

Reg

DM

No data hazard
Reg

ALU

IM

Reg

DM

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

One Way to Fix a Data Hazard


I n s t r. O r d e r

add r1,r2,r3
stall stall

IM

Reg

DM

Reg

Can fix data hazard by waiting stall but affects throughput

ALU

ALU

sub r4,r1,r5

IM

Reg

DM

Reg

ALU

and r6,r1,r7

IM

Reg

DM

Reg

Loads Can Cause Data Hazards

Dependencies backward in time cause hazards


ALU

I n s t r. O r d e r

lw r1,100(r2) sub r4,r1,r5

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

and r6,r1,r7
or r8, r1, r9 xor r4,r1,r5

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

Stores Can Cause Data Hazards

Dependencies backward in time cause hazards


ALU

I n s t r. O r d e r

add r1,r2,r3 sw r1,100(r5)

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

and r6,r1,r7
or r8, r1, r9 xor r4,r1,r5

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

Pipeline Changes to accommodate Forwarding


To avoid slowing down throughput, we need to add a hardware that detects data hazards. We call this the forwarding unit. Data needs to be forwarded to the ALU when a data hazard is detected. Thus the forwarding unit controls forwarding data through additional multiplexing at the ALU input. This logic unit needs input from the three pipeline registers. It also needs to detect if the RegWrite control signal is asserted so it needs input from the control lines also. No forwarding if EX/MEM.RegisterRd=$0 and MEM/WB.RegisterRd=$0

Pipeline Changes to accommodate Forwarding

It needs to detect one of four cases of data hazards: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRs) Forward similarly if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0 and (EX/MEM.RegisterRd=ID/EX.RegisterRt) Forward

Pipeline Changes to accommodate Forwarding

similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRs) Forward similarly if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0 and (MEM/WB.RegisterRd=ID/EX.RegisterRt) Forward

Using Pipeline Registers to solve data hazards

Pipeline Changes to accommodate Forwarding

Pipeline Changes to accommodate Forwarding


ALUSrc
0 1 2

0 1 2 2 2 0 1

ForwardA 00 ID/EX input to ALU1 - no fwd 01 MEM/WB input to ALU1 10 EX/MEM input to ALU1

ForwardB 00 ID/EX input to ALU2 01 MEM/WB input to ALU2 10 EX/MEM input to ALU2

ForwardB 11 sign extension input to ALU2 OR add another multiplexer

Forwarding Pipeline Example

How does the dependent instruction sequence execute in a pipeline with support for forwarding?
before <4> before <3> before <2> before <1> sub $2, $1, and $4, $2, or $4, $4, add $9, $4, after <1> after <2>

$3 $5 $2 $2

Forw. Pipeline Example - before <2> completes

Forw. Pipeline Example - before <1> completes


Use this value of $2 not the one fetched from register file

EX/MEM.RegWrite is asserted

EX/MEM.RegisterRd=ID/EX.RegisterRs

Forw. Pipeline Example - sub completes


Both $4 and $2 are forwarded

EX/MEM.RegWrite is asserted MEM/WB.RegWrite is asserted

EX/MEM.RegisterRd=ID/EX.RegisterRs

MEM/WB.RegisterRd=ID/EX.RegisterRt

Forwarding Pipeline Example - and completes


Use this value of $4 not the one fetched from register file

EX/MEM.RegWrite is asserted

EX/MEM.RegisterRd=ID/EX.RegisterRs

Example (corrected)

Example (corrected)

Example (corrected)

Example (corrected)

Example (corrected)

Example (corrected)

Forwarding does not work when an instruction following a lw tries to read the value from the destination register of lw lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 The pipeline needs to be stalled, and data forwarded from the MEM/WB pipeline register

Pipeline Changes to accommodate Stalls

Forwarding does not work

How Stalls are inserted

Stalls happen in the EX stage, such that the subsequent two instructions in the pipeline both repeat what they were doing for one cycle This allows forwarding to work
Stall in CC4 and and or repeat what they did in CC3

OR is fetched in CC3 but it stays in ID stage for another cycle

Pipeline Changes to accommodate Stalls


We need a logic unit which detects hazards and then stalls. The hazard detection unit operates in the instruction decode stage, and tests to see if the instruction is a load (if ID/EX.MemRead control line is asserted) Then it checks if either of the source registers of the instruction currently being decoded is the same as the target/destination register of the lw being executed (that is if ID/EX.RegisterRt=IF/ID.RegisterRs or ID/ EX.RegisterRt= IF/ID.RegisterRt) During stalling the PC is prevented from incrementing and the instruction in the IF/ID pipeline register is preserved. Need additional control lines for the IF/ID register and for the PC. The bubble is inserted by setting the pipelined control signals in the ID/EX pipeline register to 0. So we need a way to change the values of the control lines.

Instruction source registers

Pipeline Changes to do Hazard Detection

ID/EX.RegisterRt

Pipeline Changes to do Hazard Detection


Stall by 0-ing all 9 control lines
IF/ID write

PCwrite

Pipeline stalling example before<3>

completes

Pipeline stalling example before<2>


Hazard is detected

completes

Pipeline stalling example before<1>


PCWrite is asserted Bubble inserted

completes

0
0

IF/IDWrite is asserted

Registers continue to be read

Pipeline stalling example lw

completes

Forwarding unit sets ALUsrc multiplexer to use value from WB register

Pipeline stalling example bubble

completes

Forwarding unit sets ALUsrc multiplexer to use value from EX/MEM register

Example

Consider executing the following code add $5, $6, $7 lw $6, 100($7) sub $7, $6, $8 How many cycles will it take to execute the code? Draw a diagram that illustrates the dependencies that need to be resolved
CC 7 CC 8

add $5,$6,$7

lw $6,100($7)

sub $7, $6, $8

Example - continued

Draw a diagram that illustrates how the code will actually be executed (incorporating any stalls or forwarding to solve the identified problems)
CC 7 CC 8

add..

lw $6,100($7)

Stall one cycle

forwarding

sub $7, $6, $8

MIPS Pipeline Control Path Modifications


2

Branch decision in MEM stage

Pipeline Changes to accommodate Control Hazards Control hazards are due to branch hazards and to exceptions (I/O interrupts, requests from the OS, overflow, or an unknown instruction). A branch hazard occurs less frequently than data hazards, and is detected in the MEM stage of the pipeline. Assume branch not taken, the three instructions following a branch that is taken will be in the pipeline, and need to be flushed.
branch detected CC4

40 beq $1,$3,7 44 and $12,$2,$5

48 or $13,$6,$2 52 add $14,$2,$2 56 72 lw $4,50($7)

Pipeline Changes to accommodate Branch Hazards The pipeline throughput can be improved by moving the decision whether the branch is taken or not to the Decode stage of the pipeline; Then if the branch is taken, only one instruction needs to be flushed (discarded) - the instruction immediately after the branch instruction. Thus we need a new logic circuit which compares the contents of the register file outputs; Since the decision is taken in the decode stage, the branch address needs to be computed in the decode phase too, in case the branch is to be taken Thus we need a new adder in the decode phase, as well as add an IF Flush control line to flush the IF/ID pipeline register.

Pipeline Changes to accommodate Branch Hazards


Branch

Switch to branch address Compute branch address

Check for equality

Pipelined branch example <before 2> completes


PC-relative branch 40+4+7*4=72
Branch

IF Flush

2 Flushing means instruction field is 0s

Pipelined branch example <before 1> completes

Pipeline Changes to accommodate Branch Hazards The above scheme will fail if we have the following series of instructions:
36 add $1, $6, $7

40 beq $1, $3, 28


44 and $12, $2, $5 72 lw

Because the correct value of register $1 is not in the decode stage (in the register file) at the time when the comparator needs it Pipeline needs to be stalled and the value of $1 needs to be forwarded from EX/Mem pipeline register

Pipeline Changes to accommodate Branch Hazards

36 add $1, $6, $7

40 beq $1, $3, 28

44 and $12, $2, $5 72 lw

Pipeline Changes to accommodate Branch Hazards

36 add $1, $6, $7 Stall

40 beq $1, $3, 28


flush 72 lw

Pipeline Changes to accommodate Branch Hazards

Example

How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3)

Example

How can the following code be modified to make use of a delayed branch slot?: Loop: lw $2, 100($3) addi $3, $3, 4 beq $3, $4, Loop We cannot put addi after the beq since it modifies register $3 We cannot just put lw after the beq since register $3 had changed First we re-write the code as Loop: addi $3, $3, 4 lw $2, 96($3) beq $3, $4, Loop

Then we can move the lw after the beq Loop: addi $3, $3, 4 beq $3, $4, Loop lw $2, 96($3)

Example 2
Consider the pipelined datapath that does not accommodate branch hazards. Can an attempt to flush and an attempt to stall occur simultaneously? You may want to consider the following code sequence to help you answer this question: beq $1, $2, TARGET #assume the branch is taken lw $3, 40($4) add $3, $3, $3 sw $3, 40($4) TARGET: or $10,$11, $12

If the beq resolution is in the MEM stage, and the branch is taken, it requires a flush of the IF/ID pipeline register (means the register needs to be written to) and a change of the PC to the branch address; this happens in clock cycle 4.

Example 2 - continued

At the same time a hazard is detected between lw and the next instruction (add) which is dependent (due to $3 used as source register). Thus the hazard detection unit issues a stall, and requests that the PC and the IF/ID registers not be written to. The answer is YES, a flush and a stall are issued simultaneously.
If there are any conflicting actions, which should take priority?
Flush should take priority

Is there a simple change you can make to the datapath to ensure the necessary priority?

Example 2- continued
The hazard detection unit should be changed to see the RegWrite signal in the execution stage after it goes through the MUX used to flush the pipeline
RegWrite

Dynamic branch prediction

The static branch predicts that it will not be taken and then flush if it was taken works for simple pipelines, but is wasteful for performance for aggressive pipelining architecture (such as the multiple issue of Pentium IV). One approach is to have a branch prediction buffer (a small memory unit indexed by the lower portion of the address in the branch instruction). It contains a bit that says if the branch was recently taken or not. The value of the prediction bit is inverted if the prediction turned out to be wrong. When the branch is almost always taken, this 1-bit predictor will predict wrong twice (at the start and end of the run of branches).

Dynamic branch prediction


A better approach is to use a two-bit scheme, which must be wrong twice to change the direction of prediction. The branch prediction is stored in a special buffer which is accessed with the beq instruction in the IF stage. If the beq is predicted as taken, then fetching begins from the target once beq is in ID.

Not Taken

Taken Taken Not Taken Not Taken

Taken

Further optimization with a global predictor taking into consideration the global behavior of recently executed branches. Each branch has two predictors, and tournament predictor keeps track and favors the one that was more accurate.

Dynamic branch prediction with compiler optimization Furthermore, compilers place instructions that always execute in the delay spot For mostly taken branches Best choice

Pipeline Changes to accommodate Exceptions


Overflow is discovered at the end of the execute stage when the ALU sends a signal to the control unit. Following notification of an overflow the control unit has to flush the two instructions that followed the one causing the overflow. These instructions are now in the IF and ID stages of the pipeline. Thus we add an input to the MUX in the ID stage that 0s the control signals using an ID.Flush signal

8000180

ID.Flush

IF.Flush

Overflow

Pipeline Changes to accommodate Exceptions

The instruction that causes the overflow (which is detected in the EX stage) needs to be flushed from the pipeline. This means that an EX.Flush signal needs to be sent to two multiplexers to zero the control signals for the last two stages of the pipeline. Overflow is only one of the many possible exception causes. The cause is stored in a Cause register below: 4 address error exception (load) 5- address error exception (store) 10 unknown instructions or reserved instruction 12 arithmetic overflow 15 floating point exception

Pipeline Changes to accommodate Exceptions

An additional input is added to the PC MUX that sends to the PC 8000 0180hex (system reserved memory address for overflow) The address of the instruction following the offending command is saved in the Exception Program Counter (EPC) register and the cause in the Cause Register. If there are multiple exceptions, their causes are stored in the cause register, such that hardware can interrupt based on later exceptions once the earliest exception has been serviced. In case of an I/o interrupt, the execution jumps to the system routine needed to deal with the I/o, followed by a return to the address stored in the EPC for program completion. The OS responds to an exception either by terminating the process that caused the exception or by performing some action. The process whos exception is due to an unimplemented instruction is killed by the OS.

Pipeline Changes to accommodate Overflow


Branch
EX.Flush

80000180

0 0

Overflow

Pipeline Changes to accommodate Unknown Instruction


Branch
EX.Flush (LOW)

80000180

0 0

Pipelined exception example: and completes

80000180

54

Overflow

50 add causes an overflow

OS instruction fetched

Pipelined exception example or completes

80000180

80000184

80000184

80000180

One way to speed up pipelines is to have more stages (up to eight) results in shorter clock cycles. Another way is superscalar architectures which have CPI less than 1. Multiple instructions can be launched at the same time (multiple issue) - Instruction execution rate exceeds the clock rate! Were talking of number of Instructions per Clock Cycle (IPC instead of CPI) Architectures try to issue 3 to 8 instructions at every clock cycle. A third way is to balance load through dynamic pipeline scheduling, to avoid hazards (stalls). The price for these speed-ups is more hardware, more complicated control and a more complicated instruction execution model. If instructions are launched in pairs, only the first instruction is launched if dynamic conditions are not met.

Pipelining Speed-ups

Static Multiple Issue


Used in embedded processors and VLIW processors Can improve performance by up to 200% Layout is restricted to simplify the decoding and instruction issue Instructions are issued in pairs, aligned on a 64-bit boundary with the ALU and branch portion operating first; If one of the instruction of the pair cannot be used, it is replaced by a no-op. The hardware detects data hazards and generates stalls between two issue packets, but the compiler is required to avoid all dependencies within the instruction pair. A load will cause the next two instructions to stall if they were to use the loaded word.

CC 7 add

CC 8

lw

beq

sw

sub

lw

Static two-issue datapath

We need two output ports for Instruction memory, two more read and one more write ports for the Register file, two ALUs (one handles address computation for Data memory access), and two sign-extending units

Three Primary Units of Dynamically Scheduled Pipeline

Dynamic pipeline scheduling chooses which instruction to execute next, re-ordering them to avoid stalls
Buffer holding all the operands and the operation

Results sent to other reservation stations or the commit unit Buffers results until it is safe to put them in the register file or in data memory (store)

Commit unit serves as a forwarding station For operands that are needed before they were written back in the register file

AMD Opteron X4 12-stage pipeline


Speculative pipeline that executes 3 instructions/clock cycle

Register renaming removes antidependencies. In case of incorrect speculation, the mapping between architectural and physical registers is undone. Memory address calculation

Actual memory access

Intel Core pipeline


Each core can execute 4 instructions simultaneously A Core duo can execute 8 instructions simultaneously Better branch prediction Enhanced ALU Less power consumption

http://www.extremetech.com/article2/0,1697,1988744,00.asp

S-ar putea să vă placă și