CENG 331 Course Slides Chapter 5

Chapter - 5
The Processor:
Datapath and Control
11/99
Computer Organization & Architecture
Ch.5 - 1.0
Outline

11/99
Design a processor: step-by-step

Requirements of the Instruction Set
Components and Clocking
Assembling an Adequate Datapath
Controlling the Datapath
Ch.5 - 2.0
The Big Picture: Where are We Now?

The Five Classic Components of a Computer

Processor
Input
Control
Memory
Datapath
Output
This chapters Topic: Design a Single Cycle Processor

machine
design
Arithmetic
technology
inst. set design

11/99
Ch.5 - 3.0
The Big Picture: The Performance Perspective

Performance of a machine is determined by:

Instruction count
Clock cycle time
Clock cycles per instruction
CPI
Inst. Count
Processor design (datapath and control) will

determine:

Cycle Time
Clock cycle time

Clock cycles per instruction
In this chapter ...

Single cycle processor:

Advantage: One clock cycle per instruction
Disadvantage: long cycle time
11/99
Ch.5 - 4.0
How to Design a Processor: step-by-step

1. Analyze instruction set => datapath requirements

the meaning of each instruction is given by the register transfers

datapath must include storage element for ISA registers
possibly more

11/99
datapath must support each register transfer
2. Select set of datapath components and establish

clocking methodology
3. Assemble datapath meeting the requirements
4. Analyze implementation of each instruction to
determine setting of control points that effects the
register transfer.
5. Assemble the control logic
Ch.5 - 5.0
Single Cycle Datapath
11/99
Ch.5 - 6.0
The MIPS Instruction Formats

All MIPS instructions are 32 bits long. The three instruction

formats:
31
26
21
16
11
6
op
R-type
I-type
J-type
31
rs
6 bits
26
op
31
rt
5 bits
21
rs
6 bits
26
rd
5 bits
16
5 bits
funct
5 bits
6 bits
0
immediate
rt
5 bits
16 bits
0
op
target address
6 bits
5 bits
shamt
26 bits
The different fields are:

op: operation of the instruction

rs, rt, rd: the source and destination register specifiers
shamt: shift amount
funct: selects the variant of the operation in the op field
address / immediate: address offset or immediate value
target address: target address of the jump instruction
11/99
Ch.5 - 7.0
Step 1a: The MIPS-lite Subset

ADD and SUB

addU rd, rs, rt
subU rd, rs, rt
OR Immediate
31
26
op
6 bits
31
0
funct
5 bits
6 bits
0
immediate
5 bits
16 bits
LOAD and STORE Word

lw rt, rs, imm16
sw rt, rs, imm16
BRANCH
beq rs, rt, imm16
31
26
op
6 bits
31
21
rs
16
rt
5 bits
26
op
6 bits
11/99
5 bits
6
shamt
16
rt
5 bits
11
rd
5 bits
21
rs
6 bits
16
rt
5 bits
26
op
ori rt, rs, imm16
21
rs
5 bits
21
rs
5 bits
0
immediate
16 bits
16
rt
5 bits
0
immediate
16 bits
Ch.5 - 8.0
Logical Register Transfers

RTL gives the meaning of the instructions

All start by fetching the instruction
op | rs | rt | rd | shamt | funct
= MEM[ PC ]
op | rs | rt |
= MEM[ PC ]
inst
Imm16
Register Transfers
ADDU
R[rd] R[rs] + R[rt];
PC PC + 4
SUBU
R[rd] R[rs] R[rt];
PC PC + 4
ORi
R[rt] R[rs] | zero_ext(Imm16);
PC PC + 4
LOAD
R[rt] MEM[ R[rs] + sign_ext(Imm16)];
PC PC + 4
STORE
MEM[ R[rs] + sign_ext(Imm16) ] R[rt];
PC PC + 4
BEQ
if ( R[rs] == R[rt] ) then

else
11/99
PC PC + 4 +
sign_ext(Imm16)] || 00
PC PC + 4
Ch.5 - 9.0
Step 1: Requirements of the Instruction Set

Memory
11/99
instruction & data
Registers (32 x 32-bit)

read RS
read RT
Write RT or RD
PC
Extender
Add and Sub register or extended immediate
Add 4 or extended immediate to PC
Ch.5 - 10.0
Step 2: Components of the Datapath

Combinational Elements
Storage Elements
Clocking methodology
11/99
Ch.5 - 11.0
Simple Implementation
Include the functional units we need for each
instruction
Instruction
address
PC
Instruction
Add Sum
MemWrite
Instruction
memory
Address
a. Instruction memory
b. Program counter
c. Adder
Write
data
Read
data
Data
memory
16
Sign
extend
32
MemRead
5
Register
numbers
5
5
Data
Read
register 1
Read
register 2
Registers
Write
register
Write
data
ALU control
a. Data memory unit
Read
data 1
Data
Zero
ALU ALU
result
Read
data 2
b. Sign-extension unit
Why do we need this stuff?
RegWrite
a. Registers
11/99
b. ALU
Ch.5 - 12.0
Our Implementation
An edge triggered methodology
Typical execution:

read contents of some state elements,

send values through some combinational logic
write results to one or more state elements
State
element
1
State
element
2
Combinational logic
Clock cycle
11/99
Ch.5 - 13.0
More Implementation Details

Abstract / Simplified View:
Data
Register #
PC
Address
Instruction
memory
Instruction
Registers
ALU
Address
Register #
Data
memory
Register #
Data
Two types of functional units:

11/99
elements that operate on data values (combinational)

elements that contain state (sequential)
Ch.5 - 14.0
Combinational Logic Elements

CarryIn
Adder
32
Adder
32
Sum
Carry
32
Select
A
MUX
32
MUX
32
32
OP
ALU
32
ALU
B
11/99
32
Result
32
Ch.5 - 15.0
Storage Element: Register

Register
Similar to the D Flip Flop except

N-bit input and output
Write Enable input
Write Enable:
negated (0): Data Out will not change
asserted (1): Data Out will become Data In
Write Enable
Data In
Data Out
Clk
11/99
Ch.5 - 16.0
Storage Element: Register File

Register File consists of 32 registers:
Two 32-bit output busses:

busA and busB
One 32-bit input bus: busW
Write
Enable
RW RA RB
5 5 5
busW
32
Clk
32 32-bit
Registers
Register is selected by:

busA
32
busB
32
RA (number) selects the register to put on busA (data)

RB (number) selects the register to put on busB (data)
RW (number) selects the register to be written
via busW (data) when Write Enable is 1
Clock input (CLK)

The CLK input is a factor ONLY during write operation

During read operation, behaves as a combinational logic block:
RA or RB valid busA or busB valid after access time.
11/99
Ch.5 - 17.0
Register File
Built using D flip-flops
Read register
number 1
Register 0
Register 1
Register n 1
M
u
x
Read register
number 1
Read data 1
Register n
Register file
Write
register
Read register
number 2
Write
data
M
u
x
11/99
Read
data 1
Read register
number 2
Read
data 2
Write
Read data 2
Ch.5 - 18.0
Register File
Note: we still use the real clock to determine when

to write
Write
0
Register number
C
Register 0
n-to-1
decoder
n 1
Register 1
D
C
Register n 1
D
C
Register n
D
Register data
11/99
Ch.5 - 19.0
Storage Element: Idealized Memory

Write Enable
Memory (idealized)

One input bus: Data In

One output bus: Data Out
Memory word is selected by:

Address
Data In
32
Clk
DataOut
32
Address selects the word to put on Data Out

Write Enable = 1: address selects the memory
word to be written via the Data In bus
Clock input (CLK)


During read operation, behaves as a combinational logic
block:
Address valid Data Out valid after access time.
11/99
Ch.5 - 20.0
Clocking Methodology
Clk
Setup
Hold
Setup
Hold
Dont Care
.
.
.
.
.
.
.
.
.
.
.
.
All storage elements are clocked by the same clock edge

Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock
Skew
(CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
11/99
Ch.5 - 21.0
Critical Path & Cycle Time

Clk
.
.
.
.
.
.
.
.
.
.
.
.
Critical path: the slowest path between any two storage devices
Cycle time is a function of the critical path and must be greater
than:
Clock-to-Q + Longest Path through Combination Logic +
Setup
11/99
Ch.5 - 22.0
Clock Skews Effect on Cycle Time

Clk1
Clock Skew
Clk2
.
.
.
.
.
.
.
.
.
Clk1
.
.
.
Clk2
The
worst case scenario for cycle time consideration:

input register sees CLK1
The output register sees CLK2
Cycle Time - Clock Skew CLK-to-Q + Longest Delay + Setup
Cycle Time CLK-to-Q + Longest Delay + Setup + Clock Skew
The
11/99
Ch.5 - 23.0
Control
Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexor inputs)
Information comes from the 32 bits of the instruction
Example: add $8, $17, $18

Instruction Format:
000000
op
11/99
10001
rs
10010
rt
01000 00000
rd
shamt
100000
funct
ALU's operation based on instruction type and function code
Ch.5 - 24.0
Step 3: Assemble DataPath

Register Transfer Requirements

Datapath Assembly
Instruction Fetch
Read Operands and Execute Operation
11/99
Ch.5 - 25.0
3a: Overview of the Instruction Fetch Unit

The common RTL operations

Fetch the Instruction: mem[PC]

Update the program counter:
Sequential Code: PC PC + 4
Branch and Jump: PC something else
Clk
PC
Next Address
Logic
Address
Instruction Word
Instruction
Memory
11/99
32
Ch.5 - 26.0
3b: Add & Subtract

R[rd] R[rs] op R[rt]

Example: addU rd, rs, rt

Ra, Rb, and Rw come from instructions rs, rt, and rd fields
ALUctr and RegWr: control logic after decoding the instruction
31
26
21
op
16
rs
6 bits
11
rt
5 bits
5 bits
5 bits
Rd Rs Rt
RegWr
5 5
5
Rw Ra Rb
32
Clk
32 32-bit
Registers
shamt
funct
5 bits
6 bits
ALUctr
busA
32
ALU
busW
rd
busB
Result
3
2
32
11/99
Ch.5 - 27.0
Register-Register Timing: One complete cycle

Clk
Clk-to-Q
PC
Old Value
New Value
Instruction Memory Access Time
Rs, Rt, Rd,

Op, Func
Old Value
ALUctr
Old Value
New Value
RegWr
Old Value
New Value
New Value
Delay through Control Logic
Register File Access Time

busA, B
Old Value
busW
Old Value
New Value
ALU Delay
New Value
Rd Rs Rt
RegWr5 5
5
Rw Ra Rb
11/99
32 32-bit
Registers
Register Write
Occurs Here
busA
32
busB
32
ALU
busW
32
Clk
ALUctr
Result
3
2
Ch.5 - 28.0
3c: Logical Operations with Immediate

R[rt] R[rs] op ZeroExt[imm16] ]

31
26
21
op
rs
6 bits
rt
5 bits
Rd
0000000000000000
Rt
RegWr
Rs Rt?
5
5
5
Rw
busW
16 bits
busA
Ra Rb
Result
ALU
32
32
busB
ZeroExt
16
Mux
32
imm16
ALUctr
32 32-bit
Registers
32
Clk
16 bits
immediate
16 bits
Mux
immediate
5 bits
rd?
16 15
31
RegDst
11
16
32
ALUSrc
11/99
Ch.5 - 29.0
3d: Load Operations

R[rt] Mem[R[rs] + SignExt[imm16]]

31
26
21
op
Rd
RegDst
rs
6 bits
Rt
Example: lw rt, rs, imm16

11
16
rt
5 bits
immediate
5 bits
rd
16 bits
Mux
Rs
RegWr 5
32
Clk
Rw
Rt?
ALUctr
5
busA
Ra Rb
W_Src
32
32 32-bit
Registers
ALU
busW
busB
WrEn Adr
??
32
ALUSrc
Mux
16
Extender
imm16
Mux
32
32
MemWr
Data In
32
Clk
Data
Memory
32
ExtOp
11/99
Ch.5 - 30.0
3e: Store Operations

Mem[ R[rs] + SignExt[imm16] ] R[rt]
31
26
21
op
RegDst
16
rs
6 bits
Rd
Example: sw rt, rs, imm16

0
rt
5 bits
immediate
5 bits
16 bits
Rt
MemWr
ALUctr
W_Src
Mux
Rs
RegWr 5
Rw
32
Clk
busA
Ra Rb
32
32 32-bit
Registers
ALU
busW
Rt
5
busB
16
WrEn Adr
Data In 32
32
Clk
Data
Memory
32
ALUSrc
ExtOp
11/99
Mux
Extender
imm16
Mux
32
32
Ch.5 - 31.0
3f: The Branch Instruction

31
26
op
6 bits
beq
21
rs
16
rt
5 bits
0
immediate
5 bits
16 bits
rs, rt, imm16
mem[PC]Fetch the instruction from memory
Equal R[rs] == R[rt]
if (Equal) Calculate the next instructions address
Calculate the branch condition
PC PC + 4 + ( SignExt(imm16) x 4 )
else
PC PC + 4
11/99
Ch.5 - 32.0
Datapath for Branch Operations

beq
rs, rt, imm16

31
Datapath generates condition (equal)
26
21
op
16
rs
6 bits
rt
5 bits
immediate
5 bits
16 bits
Inst Address
nPC_sel
4
Adder
Rs
RegWr 5
Rt
5
busA
busW
PC
Mux
00
32
Clk
Rw
Ra Rb
32
32 x 32-bit
Registers
busB
32
Adder
PC Ext
imm16
Cond
Equal?
Clk
11/99
Ch.5 - 33.0
Putting it All Together: A Single Cycle Datapath
nPC_sel
Imm16
Rw
Rt
Ra Rb
32 x 32-bit
Registers
busA
=
32
imm16
16
32
WrEn Adr
Data In
32
ExtOp
11/99
32
Mux
32
Extender
PC Ext
Adder
busB
Clk
Clk
MemtoReg
ALU
32
Rs
5
Mux
Mux
busW
PC
Adder
00
RegWr
ALUctr MemWr
Equal
Rt
0
imm16
Rd
RegDst
Rd
Rt
Instruction<31:0>
<0:15>
Rs
<11:15>
Adr
<16:20>
<21:25>
Inst
Memory
Clk
Data
Memory
ALUSrc
Ch.5 - 34.0
Building the Datapath

Use multiplexors to stitch them together

PCSrc
M
u
x
Add
Add ALU
result
4
Shift
left 2
Registers
Read
register 1
Read
Read
data 1
register 2
Read
address
PC
Instruction
Write
register
Write
data
RegWrite
16
Instruction
memory
11/99
Read
data 2
Sign
extend
3 ALU operation
ALUSrc
Zero
ALU ALU
result
M
u
x
MemWrite
MemtoReg
Address
Read
data
Data
Write memory
data
M
u
x
32
MemRead
An Abstract View of the Critical Path

Register file and ideal memory:

During read operation, behave as combinational logic:
Address valid Output valid after access time.

Ideal
Instruction
Memory
Instruction
Rd
5
Instruction
Address
Rs
5
Rt
5
Imm
16
A
32
Rw Ra
32 32-bit
Registers
PC
32
Rb
32
ALU
Next Address
Critical Path (Load Operation) =

PCs Clk-to-Q +
Instruction Memorys Access Time +
Register Files Access Time +
ALU to Perform a 32-bit Add +
Data Memory Access Time +
Setup Time for Register File Write +
Clock Skew
Clk
Clk
11/99
32
Data
Address
Data
In
Ideal
Data
Memory
Clk
Ch.5 - 36.0
An Abstract View of the Implementation
Ideal
Instruction
Memory
Rt
5
A
32
Rw Ra Rb
PC
32 32-bit
Registers
32
ALU
32
Clk
Control Signals Conditions
Instruction
Rd Rs
5
5
Instruction
Address
Next Address
Control
Clk
Data
Address
Data
In
Ideal
Data
Memory
Data
Out
Clk
32
Datapath
11/99
Ch.5 - 37.0
Step 4: Given Datapath: RTL Control

Instruction<31:0>
Rd
<0:15>
Rs
<11:15>
Rt
<16:20>
Op Fun
<21:25>
Adr
<21:25>
Inst
Memory
Imm16
Control
nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg
Equal
Datapath
11/99
Ch.5 - 38.0
Meaning of the Control Signals

Inst
Memory
Rs,
Rt, Rd and
Immed16 hardwired
into datapath
Addr
nPC_sel
nPC_sel:
0 PC PC + 4;
imm16
PC
Mux
Adder
PC Ext
11/99
00
Adder
1 PC PC + 4 +
SignExt(Im16)
|| 00
Clk
Ch.5 - 39.0
Meaning of the Control Signals

ExtOp:
ALUsrc:
ALUctr:
zero, sign
0 regB; 1 immed
add, sub, or

RegDst
Rd
busW
1 Mem
RegDst:
0 rt; 1 rd
RegWr:
write dest register
Equal
Rt
Rs
ALUctr MemWr
MemtoReg
Rt
5
32
Data In
32
ExtOp
Clk
32
0
Mux
16
Extender
imm16
=
32
Mux
busA
Rw Ra Rb
32 32-bit
Registers
busB
32
Clk
11/99
MemtoReg:
ALU
32
write memory
1
RegWr
MemWr:
WrEn Adr
Data
Memory
ALUSrc
Ch.5 - 40.0
Control Signals
inst
Register Transfer
ADD
R[rd] R[rs] + R[rt]; PC PC + 4
SUB
R[rd] R[rs] R[rt]; PC PC + 4
ALUsrc = RegB, ALUctr = add, RegDst = rd, RegWr, nPC_sel = +4
ALUsrc = ___, Extop = __, ALUctr = ___, RegDst = ___, RegWr(?), MemtoReg(?), MemWr(?),
nPC_sel =__
R[rt] R[rs] + zero_ext(Imm16); PC PC + 4
ORi
nPC_sel =__
LOAD
R[rt] MEM[ R[rs] + sign_ext(Imm16)]; PC PC + 4

nPC_sel =__
STORE
MEM[ R[rs] + sign_ext(Imm16)] R[rs]; PC PC + 4

nPC_sel =__
if ( R[rs] == R[rt] ) then PC PC + sign_ext(Imm16)] || 00 else PC PC + 4
BEQ
nPC_sel =__
11/99
Ch.5 - 41.0
Control Signals (Answer)

inst
Register Transfer
ADD
R[rd] R[rs] + R[rt]; PC PC + 4

ALUsrc = RegB, ALUctr = add, RegDst = rd, RegWr, nPC_sel = +4
SUB
R[rd] R[rs] R[rt]; PC PC + 4

ALUsrc = RegB, ALUctr = sub, RegDst = rd, RegWr, nPC_sel = +4
ORi
R[rt] R[rs] + zero_ext(Imm16); PC PC + 4

ALUsrc = Im, Extop = Z, ALUctr = or, RegDst = rt, RegWr, nPC_sel = +4
LOAD
R[rt] MEM[ R[rs] + sign_ext(Imm16)]; PC PC + 4

ALUsrc = Im, Extop = Sn, ALUctr = add,
MemtoReg, RegDst = rt, RegWr, nPC_sel = +4
STORE
MEM[ R[rs] + sign_ext(Imm16)] R[rs]; PC PC + 4

ALUsrc = Im, Extop = Sn, ALUctr = add, MemWr, nPC_sel = +4
BEQ
if ( R[rs] == R[rt] ) then PC PC + sign_ext(Imm16)] || 00 else PC PC + 4

nPC_sel = EQUAL, ALUctr = sub
11/99
Ch.5 - 42.0
Step 5: Logic for each control signal

if (OP == BEQ) then EQUAL else 0
ALUsrc
if (OP == 000000) then regB else immed
ALUctr
if (OP == 000000) then funct
elseif (OP == ORi) then OR
elseif (OP == BEQ) then sub
else add
ExtOp
if (OP == ORi) then zero else sign
MemWr
(OP == Store)
MemtoReg (OP == Load)
RegWr:
if ((OP == Store) || (OP == BEQ))
then 0 else 1
RegDst:
if ((OP == Load) || (OP == ORi))
then 0 else 1
nPC_sel
11/99
Ch.5 - 43.0
Example: Load Instruction
nPC_sel
+4
MemtoReg
Rt
5
Rw Ra Rb
32 32-bit
Registers
imm16
16
=
32
busB
32
Extender
Clk
Clk
busA
32
Data In
1
32
Clk
sign
ext
ExtOp
ALUSrc
32
0
Mux
00
Rs
5
ALU
32
ALUctr MemWr
add
Equal
Mux
busW
PC
Mux
Adder
imm16
Imm16
RegWr 5
Adder
PC Ext
11/99
Rd
RegDst
Rd Rt
rt
1
Rt
Instruction<31:0>
<0:15>
Rs
<11:15>
Adr
<16:20>
<21:25>
Inst
Memory
WrEn Adr
Data
Memory
Ch.5 - 44.0
An Abstract View of the Implementation

Control
Ideal
Instruction
Memory
Instruction
Rd
5
Conditions
Rt
5
A
32
Rw Ra
32 32-bit
Registers
PC
32
Rb
32
ALU
Next Address
Instruction
Address
Rs
5
Control Signals
Clk
Clk
32
Data
Address
Data
In
Ideal
Data
Memory
Data
Out
Clk
Datapath
Logical vs. Physical Structure
11/99
Ch.5 - 45.0
Summary
5 steps to design a processor

1. Analyze instruction set datapath requirements

2. Select set of datapath components & establish clock
methodology
3. Assemble datapath meeting the requirements
4. Analyze implementation of each instruction to determine
setting of control points that effects the register transfer.
5. Assemble the control logic
MIPS makes it easier

Instructions same size

Source registers always in same place
Immediates same size, location
Operations always on registers/immediates
Single cycle datapath CPI=1, CCT long

Next topic: implementing control
11/99
Ch.5 - 46.0
Control

e.g., what should the ALU do with this instruction

Example: lw $1, 100($2)
35
op
rs
rt
100
16 bit offset
ALU control input (5 of the possible 8 input combinations):

000
AND
001
OR
010
add
110
subtract
111
set-on-less-than
Why is the code for subtract 110 and not 011?
11/99
Ch.5 - 47.0
Control
Must describe hardware to compute 3-bit ALU

conrol input
given instruction type

00 = lw, sw
01 = beq,
11 = arithmetic
function code for arithmetic
Describe it using a truth table (can turn into gates):

ALUOp
ALUOp1 ALUOp0
0
0
X
1
1
X
1
X
1
X
1
X
1
X
11/99
ALUOp
computed from instruction type
F5
X
X
X
X
X
X
X
Funct field
F4 F3 F2 F1
X X X X
X X X X
X 0 0 0
X 0 0 1
X 0 1 0
X 0 1 0
X 1 0 1
Operation
F0
X
X
0
0
0
1
0
010
110
010
110
000
001
111
Ch.5 - 48.0
Control
0
M
u
x
Add ALU
result
Add
4
Instruction [31 26]
Control
Instruction [25 21]

Read
address
PC
Instruction
memory
Instruction [15 11]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
register 1
Instruction [20 16]

Instruction
[31 0]
0
M
u
x
1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Write
data
16
Instruction [15 0]
Sign
extend
Read
data
Address
Data
memory
1
M
u
x
0
32
ALU
control
Instruction [5 0]
Instruction RegDst ALUSrc

R-format
1
0
lw
0
1
sw
X
1
beq
X
0
11/99
Memto- Reg Mem Mem

Reg
Write Read Write Branch ALUOp1 ALUp0
0
1
0
0
0
1
0
1
1
1
0
0
0
0
X
0
0
1
0
0
0
X
0
0
0
1
0
1
Ch.5 - 49.0
Control
Simple combinational logic (truth tables)

Inputs
Op5
Op4
Op3
Op2
ALUOp
Op1
ALU control block
Op0
ALUOp0
ALUOp1
Outputs
F3
F2
F (5 0)
Operation2
Operation1
R-format
Operation
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
F1
Operation0
RegWrite
F0
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
11/99
Ch.5 - 50.0
Our Simple Control Structure

All of the logic is combinational
We wait for everything to settle down, and the right thing to

be done
ALU might not produce right answer right away
we use write signals along with clock to determine when to write
Cycle time determined by length of the longest path
State
element
1
State
element
2
Combinational logic
Clock cycle
We are ignoring some details like setup and hold times

11/99
Ch.5 - 51.0
Single Cycle Implementation

Calculate cycle time assuming negligible delays except:

memory (2ns), ALU and adders (2ns), register file access (1ns)
PCSrc
Add
ALU
Add result
4
Shift
left 2
RegWrite
Instruction [25 21]
PC
Read
address
Instruction
[31 0]
Instruction
memory
Instruction [20 16]

1
M
u
Instruction [15 11] x
0
RegDst
Instruction [15 0]
Read
register 1
Read
register 2
Read
data 1
MemWrite
ALUSrc
Read
data 2
1
M
u
x
0
Write
register
Write
Registers
data
16
Sign
extend
1
M
u
x
0
ALU
Zero
ALU
result
MemtoReg
Address
Write
data
32
ALU
control
Read
data
Data
memory
1
M
u
x
0
MemRead
Instruction [5 0]
ALUOp
11/99
Ch.5 - 52.0
Where we are headed

Single Cycle Problems:

what if we had a more complicated instruction like floating point?

wasteful of area
One Solution:

use a smaller cycle time

have different instructions take different numbers of cycles
a multicycle datapath:
Instruction
register
PC
Address
Data
A
Memory
Data
11/99
Register #
Instruction
or data
Memory
data
register
ALU
Registers
Register #
ALUOut
B
Register #
Ch.5 - 53.0
Multicycle Approach
We will be reusing functional units

Our control signals will not be determined solely by

instruction
11/99
ALU used to compute address and to increment PC

Memory used for instruction and data
e.g., what should the ALU do for a subtract instruction?
Well use a finite state machine for control
Ch.5 - 54.0
Review: finite state machines

Finite state machines:

a set of states and

next state function (determined by current state and the
input)
output function (determined by current state and possibly
input)
Current state
Next-state
function
Next
state
Clock
Inputs
Output
function
Outputs
Well use a Moore machine (output based only on current

state)
11/99
Ch.5 - 55.0
Review: finite state machines

Example:
A friend would like you to build an electronic eye for use as

a fake security device. The device consists of three lights lined
up in a row, controlled by the outputs Left, Middle, and Right,
which, if asserted, indicate that a light should be on. Only one
light is on at a time, and the light moves from left to right
and then from right to left, thus scaring away thieves who
believe that the device is monitoring their activity. Draw the
graphical representation for the finite state machine used to
specify the electronic eye. Note that the rate of the eyes
movement will be controlled by the clock speed (which should
not be too great) and that there are essentially no inputs.
11/99
Ch.5 - 56.0
Multicycle Approach
Break up the instructions into steps, each step takes a

cycle

balance the amount of work to be done

restrict each cycle to use only one major functional unit
At the end of a cycle

store values for use in later cycles (easiest thing to do)

introduce additional internal registers
PC
0
M
u
x
1
Address
Memory
MemData
Write
data
Instruction
[2521]
Read
register 1
Instruction
[2016]
Read
Read
register 2 data 1
Registers
Write
Read
register data 2
Instruction
[150]
Instruction
register
Instruction
[150]
Memory
data
register
11/99
0
M
Instruction u
x
[1511]
1
B
4
Write
data
0
M
u
x
1
16
Sign
extend
0
M
u
x
1
32
Zero
ALU ALU
result
ALUOut
0
1M
u
2 x
3
Shift
left 2
Ch.5 - 57.0
Five Execution Steps

Instruction Fetch
Instruction Decode and Register Fetch
Execution, Memory Address Computation, or Branch

Completion
Memory Access or R-type instruction completion
Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
11/99
Ch.5 - 58.0
Step 1: Instruction Fetch

Use PC to get instruction and put it in the Instruction

Register.
Increment the PC by 4 and put the result back in the PC.
Can be described by using RTL "Register-Transfer
Language"
IR = Memory[PC];
PC = PC + 4;
Think about these!

Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
11/99
Ch.5 - 59.0
Step 2: Instruction Decode and Register Fetch

Read registers rs and rt in case we need them

Compute the branch address in case the instruction
is a branch
RTL:
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extend(IR[15-0]) << 2);
11/99
We aren't setting any control lines based on the

instruction type
(we are busy "decoding" it in our control logic)
Ch.5 - 60.0
Step 3 (instruction dependent)

11/99
ALU is performing one of three functions, based on

instruction type
Memory Reference:
ALUOut = A + sign-extend(IR[15-0]);
R-type:
ALUOut = A op B;
Branch:
if (A==B) PC = ALUOut;
Ch.5 - 61.0
Step 4 (R-type or memory-access)

Loads and stores access memory

MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;
R-type instructions finish

Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle on the
edge.
11/99
Ch.5 - 62.0
Write-back step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
11/99
Ch.5 - 63.0
Summary:
Summary:
Step name
Instruction fetch
Action for R-type

instructions
Instruction
decode/register fetch
Action for memory-reference

Action for
instructions
branches
IR = Memory[PC]
PC = PC + 4
A = Reg [IR[25-21]]
B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution, address
computation, branch/
jump completion
ALUOut = A op B
ALUOut = A + sign-extend
(IR[15-0])
Memory access or R-type

completion
Reg [IR[15-11]] =
ALUOut
Load: MDR = Memory[ALUOut]

or
Store: Memory [ALUOut] = B
Memory read completion
11/99
if (A ==B) then
PC = ALUOut
Action for
jumps
PC = PC [31-28] II
(IR[25-0]<<2)
Load: Reg[IR[20-16]] = MDR
Ch.5 - 64.0
Simple Questions
How many cycles will it take to execute this code?
Label:
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label
add $t5, $t2, $t3
sw $t5, 8($t3)
...
#assume not
What is going on during the 8th cycle of execution?

In what cycle does the actual addition of $t2 and $t3
takes place?
11/99
Ch.5 - 65.0
Implementing the Control

Value of control signals is dependent upon:

Use the information weve acculumated to specify a

finite state machine

11/99
what instruction is being executed

which step is being performed
specify the finite state machine graphically, or

use microprogramming
Implementation can be derived from specification
Ch.5 - 66.0
Graphical Specification of FSM

Instruction decode/
register fetch
Instruction fetch
=
(Op
')
'LW
p
or (O
W
= 'S
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
(Op
')
Branch
completion
Execution
6
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
')
e)
-typ
=R
Jump
completion
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
ALUSrcA =1
ALUSrcB = 00
ALUOp = 10
(Op = 'J')
Memory address
computation
EQ
Start
MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
'B
How many
state bits will
we need?
(O
p
PCWrite
PCSource = 10
(Op = 'LW')
(O
=
'S
')
W
Memory
access
Memory
access
5
MemRead
IorD = 1
R-type completion
7
RegDst = 1
RegWrite
MemtoReg = 0
MemWrite
IorD = 1
Write-back step
4
RegDst = 0
RegWrite
MemtoReg = 1
Finite State Machine for Control

Implementation:
P C W rite
P C W rite C on d
Io rD
M em R e ad
M em W rite
IR W rite
C on tro l logic
M em to R eg
P C S ou rce
ALUO p
Ou tp uts
A L U S rcB
A L U S rcA
R e gW rite
R e gD st
NS3
NS2
NS1
NS0
Instru ctio n re gister

o pco de field
11/99
S0
S1
S2
S3
Op0
Op1
Op2
Op3
Op4
Op5
Inp uts
S ta te re giste r
Ch.5 - 68.0
PLA Implementation
If I picked a horizontal or vertical line could you explain it?

Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
MemtoReg
PCSource1
PCSource0
ALUOp1
ALUOp0
ALUSrcB1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
11/99
Ch.5 - 69.0
ROM Implementation
ROM = "Read Only Memory"
A ROM can be used to implement a truth table

values of memory locations are fixed ahead of time

if the address is m-bits, we can address 2m entries in the ROM.
our outputs are the bits of data that the address points to.
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
1
0
0
0
0
0
1
1
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
1
0
1
m is the "heigth", and n is the "width"
11/99
Ch.5 - 70.0
ROM Implementation
How many inputs are there?

6 bits for opcode, 4 bits for state = 10 address lines
(i.e., 210 = 1024 different addresses)
How many outputs are there?

16 datapath-control outputs, 4 state bits = 20 outputs
ROM is 210 x 20 = 20K bits
Rather wasteful, since for lots of the entries, the outputs are
the same
i.e., opcode is often ignored
11/99
(and a rather unusual size)
Ch.5 - 71.0
ROM vs PLA
Break up the table into two parts

4 state bits tell you the 16 outputs,
10 bits tell you the 4 next state bits, 210 x 4 bits of ROM
Total: 4.3K bits of ROM
PLA is much smaller

can share product terms
only need entries that produce an active output
can take into account don't cares
Size is (#inputs #product-terms) + (#outputs #productterms)

11/99
24 x 16 bits of ROM
For this example = (10x17)+(20x17) = 460 PLA cells
PLA cells usually about the size of a ROM cell (slightly bigger)
Ch.5 - 72.0
Another Implementation Style

Complex instructions: the "next state" is often current state +

1
Control unit
PLA or ROM
Outputs
Input
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
BWrite
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
AddrCtl
1
State
Adder
Op[5 0]
Address select logic
Instruction register
opcode field
11/99
Ch.5 - 73.0
Details
Op
000000
000010
000100
100011
101011
Dispatch ROM 1
Opcode name
R-format
jmp
beq
lw
sw
Value
0110
1001
1000
0010
0010
Op
100011
101011
Dispatch ROM 2
Opcode name
lw
sw
Value
0011
0101
PLA or ROM
1
State
Adder
Mux
2 1
AddrCtl
0
0
Dispatch ROM 2
Dispatch ROM 1
11/99
Address-control action
Use incremented state
Use dispatch ROM 1
Use dispatch ROM 2
Replace state number by 0
Value of AddrCtl
3
1
2
3
0
0
3
0
0
0
Op
State number
0
1
2
3
4
5
6
7
8
9
opcode field
Ch.5 - 74.0
Microprogramming
Control unit
Microcode memory
Outputs
Input
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
BWrite
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
AddrCtl
Datapath
1
Microprogram counter
Adder
Op[5 0]
opcode field
What are the microinstructions ?

11/99
Ch.5 - 75.0
Microprogramming (Maurice Wilkes)

Control is the hard part of processor design

Datapath is fairly regular and well-organized
Memory is highly regular
Control is irregular and global
Microprogramming:
-- A Particular Strategy for Implementing the Control Unit of a
processor by "programming" at the level of register transfer
operations
Microarchitecture:
-- Logical structure and functional capabilities of the hardware as
seen by the microprogrammer
Historical Note:
IBM 360 Series first to distinguish between architecture & organization
Same instruction set across wide range of implementations, each with
different cost/performance
11/99
Ch.5 - 76.0
Macroinstruction Interpretation
User program
plus Data
Main
Memory
ADD
SUB
AND
this can change!

.
.
.
one of these is
mapped into one
of these
DATA
execution
unit
AND microsequence
CPU
control
memory
e.g., Fetch
Calc Operand Addr
Fetch Operand(s)
Calculate
Save Answer(s)
11/99
Ch.5 - 77.0
Microprogramming
A specification methodology

Label
Fetch
Mem1
LW2
appropriate if hundreds of opcodes, modes, cycles, etc.

signals specified symbolically using microinstructions
ALU
control
Add
Add
Add
SRC1
PC
PC
A
Register
SRC2
control
4
Extshft Read
Extend
PCWrite
Memory
control
Read PC ALU
Read ALU
Write MDR
SW2
Rformat1 Func code A
Write ALU
B
Write ALU
BEQ1
JUMP1

11/99
Subt
ALUOut-cond
Jump address
Sequencing
Seq
Dispatch 1
Dispatch 2
Seq
Fetch
Fetch
Seq
Fetch
Fetch
Fetch
Will two implementations of the same architecture have the same microcode?
What would a microassembler do?
Ch.5 - 78.0
Microinstruction format
Field name
Value
Add
Subt
ALU control
SRC1
SRC2
Func code
PC
A
B
4
Extend
Extshft
Read
ALUOp = 10
ALUSrcA = 0
ALUSrcA = 1
ALUSrcB = 00
ALUSrcB = 01
ALUSrcB = 10
ALUSrcB = 11
Write ALU
RegWrite,
RegDst = 1,
MemtoReg = 0
RegWrite,
RegDst = 0,
MemtoReg = 1
MemRead,
lorD = 0
MemRead,
lorD = 1
MemWrite,
lorD = 1
PCSource = 00
PCWrite
PCSource = 01,
PCWriteCond
PCSource = 10,
PCWrite
AddrCtl = 11
AddrCtl = 00
AddrCtl = 01
AddrCtl = 10
Register
control
Write MDR
Read PC
Memory
Read ALU
Write ALU
ALU
PC write control
ALUOut-cond
jump address
Sequencing
Signals active
ALUOp = 00
ALUOp = 01
Seq
Fetch
Dispatch 1
Dispatch 2
Comment
Cause the ALU to add.
Cause the ALU to subtract; this implements the compare for
branches.
Use the instruction's function code to determine ALU control.
Use the PC as the first ALU input.
Register A is the first ALU input.
Register B is the second ALU input.
Use 4 as the second ALU input.
Use output of the sign extension unit as the second ALU input.
Use the output of the shift-by-two unit as the second ALU input.
Read two registers using the rs and rt fields of the IR as the register
numbers and putting the data into registers A and B.
Write a register using the rd field of the IR as the register number and
the contents of the ALUOut as the data.
Write a register using the rt field of the IR as the register number and
the contents of the MDR as the data.
Read memory using the PC as address; write result into IR (and
the MDR).
Read memory using the ALUOut as address; write result into MDR.
Write memory using the ALUOut as address, contents of B as the
data.
Write the output of the ALU into the PC.
If the Zero output of the ALU is active, write the PC with the contents
of the register ALUOut.
Write the PC with the jump address from the instruction.
Choose the next microinstruction sequentially.
Go to the first microinstruction to begin a new instruction.
Dispatch using the ROM 1.
Dispatch using the ROM 2.
Horizontal vs. Vertical Microprogramming

NOTE: previous organization is not TRUE horizontal
microprogramming; register decoders give flavor of encoded
microoperations
Most microprogramming-based controllers vary between:
horizontal organization (1 control bit per control point)
vertical organization (fields encoded in the control memory and
must be decoded to control something)
Horizontal
Vertical
+ more control over the potential

parallelism of operations in the
datapath
+ easier to program, not very

different from programming
a RISC machine in assembly
language
11/99
uses up lots of control store
extra level of decoding may

slow the machine down
Ch.5 - 80.0
Maximally vs. Minimally Encoded

No encoding:
1 bit for each datapath operation
faster, requires more memory (logic)
used for Vax 780 an astonishing 400K of memory!
Lots of encoding:
send the microinstructions through logic to get control

signals
uses less memory, slower
Historical context of CISC:

Too much logic to put on a single chip with everything else
Use a ROM (or even RAM) to hold the microcode
Its easy to add new instructions
11/99
Ch.5 - 81.0
Designing a Microinstruction Set

Start with list of control signals
Group signals together that make sense (vs. random): called

fields
Places fields in some logical order (e.g., ALU operation &

ALU operands first and microinstruction sequencing last)
Create a symbolic legend for the microinstruction

format, showing name of field values and how they set
the control signals
Use
11/99
computers to design computers
To minimize the width, encode operations that will

never be used at the same time
Ch.5 - 82.0
Microcode: Trade-offs

Distinction between specification and implementation is

sometimes blurred
Specification Advantages:
Easy to design and write
Design architecture and microcode in parallel
Implementation (off-chip ROM) Advantages

Easy to change since values are in memory
Can emulate other architectures
Can make use of internal registers
Implementation Disadvantages, SLOWER now that:

Control is implemented on same chip as processor
ROM is no longer faster than RAM
No need to go back and make changes
11/99
Ch.5 - 83.0
Exceptions
user program
Exception:
System
Exception
Handler
return from
exception
normal control flow:
sequential, jumps, branches, calls, returns
Exception unprogrammed control transfer

system takes action to handle the exception
returns control to user

must save & restore user state
must record the address of the offending instruction

11/99
Allows constuction of a user virtual machine

Ch.5 - 84.0
What happens to Instruction with Exception?

MIPS architecture defines the instruction as having no

effect if the instruction causes an exception.
When get to virtual memory we will see that certain

classes of exceptions must prevent the instruction from
changing the machine state.
This aspect of handling exceptions becomes complex

and potentially limits performance why it is hard?
11/99
Ch.5 - 85.0
Two Types of Exceptions

Interrupts

caused by external events

asynchronous to program execution
may be handled between instructions
simply suspend and resume user program
Traps
caused by internal events

exceptional conditions (overflow)
errors (parity)
faults (non-resident page)

11/99
synchronous to program execution

condition must be remedied by the handler
instruction may be retried or simulated and program
continued or program may be aborted
Ch.5 - 86.0
MIPS convention:
exception means any unexpected change in control flow,
without distinguishing internal or external;
use the term interrupt only when the event is externally
caused.
Type of event
From where?
MIPS terminology
I/O device request

Invoke OS from user program
Arithmetic overflow
Using an undefined instruction
Hardware malfunctions
External
Internal
Internal
Internal
Either
Interrupt
Exception
Exception
Exception
Exception or
Interrupt
11/99
Ch.5 - 87.0
Addressing the Exception Handler

Traditional Approach: Interupt Vector

cause
PC IT_base + cause || 0000

saves state and jumps
Sparc, PA, M88K, . . .
handler
code
MIPS Approach: fixed entry

PC EXC_addr
Actually very small table
RESET entry
TLB
other
11/99
iv_base
RISC Handler Table

PC MEM[ IV_base + cause || 00]

370, 68000, Vax, 80x86, . . .
handler entry code

iv_base
cause
Ch.5 - 88.0
Saving State
Push it onto the stack

Save it in special registers

MIPS EPC, BadVaddr, Status, Cause
Shadow Registers

11/99
Vax, 68k, 80x86
M88k
Save state in a shadow of the internal pipeline registers
Ch.5 - 89.0
Additions to MIPS ISA to support Exceptions?

11/99
EPC a 32-bit register used to hold the address of the affected

instruction (register 14 of coprocessor 0).
Cause a register used to record the cause of the exception. In the
MIPS architecture this register is 32 bits, though some bits are
currently unused. Assume that bits 5 to 2 of this register encodes the
two possible exception sources mentioned above: undefined
instruction=0 and arithmetic overflow=1 (register 13 of coprocessor 0).
BadVAddr - register contained memory address at which memory
reference occurred (register 8 of coprocessor 0)
Status - interrupt mask and enable bits (register 12 of coprocessor 0)
Control signals to write EPC , Cause, BadVAddr, and Status
Be able to write exception address into PC, increase mux to add as
input 01000000 00000000 00000000 01000000two (8000 0080hex)
May have to undo PC PC + 4, since want EPC to point to offending
instruction (not its successor); PC PC - 4
Ch.5 - 90.0
Big Picture: user / system modes

By providing two modes of execution (user/system)

it is possible for the computer to manage itself
operating system is a special program that runs in the

priviledged mode and has access to all of the resources of
the computer
presents virtual resources to each user that are more
convenient that the physical resources
files vs. disk sectors
virtual memory vs physical memory
Exceptions allow the system to take action in

response to events that occur while user program is
executing
11/99
protects each user program from others
O/S begins at the handler

Ch.5 - 91.0
Precise Interrupts
Precise state of the machine is preserved as if program executed up

to the offending instruction

All previous instructions completed

Offending instruction and all following instructions act as if they have not
even started
Same system code will work on different implementations
Position clearly established by IBM
Difficult in the presence of pipelining, out-ot-order execution, ...
MIPS takes this position
Imprecise system software has to figure out what is where and put it
all back together
Performance goals often lead designers to forsake precise interrupts

11/99
system software developers, user, markets etc. usually wish they had not
done this
Modern techniques for out-of-order execution and branch prediction

help implement precise interrupts
Ch.5 - 92.0
How Control Detects Exceptions in our FSD

Undefined Instructiondetected when no next state is defined from

state 1 for the op value.

Arithmetic overflow

We handle this exception by defining the next state value for all op values
other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12.
Shown symbolically using other to indicate that the op field does not
match any of the opcodes that label arcs out of state 1.
Chapter 4 included logic in the ALU to detect overflow, and a signal called
Overflow is provided as an output from the ALU.
This signal is used in the modified finite state machine to specify an
additional possible next state
Note: Challenge in designing control of a real machine is to handle

different interactions between instructions and other exception-causing
events such that control logic remains small and fast.
Complex interactions makes the control unit the most challenging aspect of
hardware design
11/99
Ch.5 - 93.0
Modification to the Control Specification

IR <= MEM[PC]
PC <= PC + 4
A <= R[rs]
B <= R[rt]
R-type
S <= A fun B
ORi
LW
S <= A op ZX S <= A + SX
undefined instruction
EPC <= PC - 4
PC <= exp_addr
cause <= 10 (RI)
other
SW
S <= A + SX
BEQ
S <= A - B
0010
~Equal
Equal
overflow
M <= MEM[S]
MEM[S] <= B
PC <= PC +
SX || 00
0011
R[rd] <= S
11/99
R[rt] <= S
R[rt] <= M
Additional condition from

EPC <= PC - 4
Datapath
PC <= exp_addr
cause <= 12 (Ovf)Computer Organization & Architecture
Ch.5 - 94.0
Summary
Specialize state-diagrams easily captured by microsequencer

simple increment & branch fields
datapath control fields
Control design reduces to Microprogramming
Exceptions are the hard part of control
Need to find convenient place to detect exceptions and to branch

to state or microinstruction that saves PC and invokes the
operating system
For pipelined CPUs that support page faults on memory accesses,

it gets even harder:
11/99
Need precise interrupts:

The instruction cannot complete AND you must be able to restart the
program at exactly the instruction with the exception
Ch.5 - 95.0
Summary: Microprogramming one inspiration for RISC

11/99
If simple instruction could execute at very high clock rate

If you could even write compilers to produce
microinstructions
If most programs use simple instructions and addressing
modes
If microcode is kept in RAM instead of ROM so as to fix bugs
If same memory used for control memory could be used

instead as cache for macroinstructions
Then why not skip instruction interpretation by a
microprogram and simply compile directly into lowest
language of machine?
Ch.5 - 96.0
The Big Picture
Initial
representation
Finite state
diagram
Microprogram
Sequencing
control
Explicit next
state function
Microprogram counter
+ dispatch ROMS
Logic
representation
Logic
equations
Truth
tables
Implementation
technique
Programmable
logic array
Read only
memory
11/99
Ch.5 - 97.0
11/99
Ch.5 - 98.0
Basic Components: CMOS Inverter

Vdd
Circuit
Symbol
In
PMOS
In
Out
Out
NMOS
Inverter
Operation
Vout
Vdd
Vdd
Vdd
Vdd
Open
Charge
Out
Open
Discharge
Vdd
11/99
Vin
Ch.5 - 99.0
Basic Components: CMOS Logic Gates

NOR Gate
NAND Gate
A
A
Out
B Out
0
0
1
1
0
1
0
1
1
1
1
0
Out
0
0
1
1
B Out
0
1
0
1
1
0
0
0
Vdd
Vdd
A
Out
B
B
Out
A
11/99
Ch.5 - 100.0
Gate Comparison
Vdd
Vdd
A
Out
B
Out
A
NOR Gate
NAND Gate
If
PMOS transistors is faster:

It
is OK to have PMOS transistors in series

gate is preferred
NOR gate is preferred also if H -> L is more critical than L -> H
NOR
If
NMOS transistors is faster:

It
is OK to have NMOS transistors in series

NAND gate is preferred
NAND gate is preferred also if L -> H is more critical than H -> L
11/99
Ch.5 - 101.0
Ideal versus Reality

When
input 0 -> 1, output 1 -> 0 but NOT instantly
Output
When
goes 1 -> 0: output voltage goes from Vdd (5v) to 0v
input 1 -> 0, output 0 -> 1 but NOT instantly
Output
Voltage
goes 0 -> 1: output voltage goes from 0v to Vdd (5v)
does not like to change instantaneously
1 => Vdd
In
Out
Voltage
Vout
Vin
0 => GND
Time
11/99
Ch.5 - 102.0
Fluid Timing Model

Level (V) = Vdd
Vdd
Tank Level (Vout)
SW1
SW2
SW1
Sea Level
(GND)
Vout
Cout
SW2
Reservoir
Tank
(Cout)
Bottomless Sea
Water
<-> Electrical Charge

Tank Capacity <-> Capacitance (C)
Level <-> Voltage
Water Flow <-> Charge Flowing
(Current)
Size of Pipes <-> Strength of Transistors (G)
Time to fill up the tank proportional to C / G
Water
11/99
Ch.5 - 103.0
Series Connection
Vin
V1
G1
Vdd
Vout
Vin
G2
G1
Vdd
V1
G2
C1
Vout
Cout
Voltage
Vdd
V1
Vin
Vout
Vdd/2
d1
d2
GND
Time
Total
Propagation Delay = Sum of individual delays = d1 + d2

Capacitance C1 has two components:
Capacitance
Input
11/99
of the wire connecting the two gates

capacitance of the second inverter
Ch.5 - 104.0
Review: Calculating Delays

Vin
V1
Vdd
V2
Vin
Vdd
V1
G1
V2
G2
C1
V3
Vdd
V3
G3
Sum
delays along serial paths

(Vin -> V2) ! = Delay (Vin -> V3)
Delay
Delay
(Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
Critical
Path = The longest among the N parallel paths

C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
11/99
Ch.5 - 105.0
Review: General C/L Cell Delay Model

Vout
A
B
.
.
.
Combinational
Logic Cell
Delay
Va -> Vout
Cout
X
X
X
X
delay per unit load
Internal
Delay
Combinational
functional
Ccritical
Cout
Cell (symbol) is fully specified by:
(input -> output) behavior
truth-table, logic equation, VHDL
load
factor of each input

propagation delay from each input to each output for each
transition
critical
Linear
11/99
THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
model composes
Ch.5 - 106.0
Characterize a Gate
Input
capacitance for each input

For each input-to-output path:
For
each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
Internal delay (ns)
Load dependent delay (ns / fF)
Example:
2-input NAND Gate
Delay A -> Out

Out: Low -> High
Out
B
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Slope =
0.0021ns / fF
0.5ns
Cout
11/99
Ch.5 - 107.0
A Specific Example: 2 to 1 MUX

A
Gate 3
Gate 2
S
Input
Load
Y = (A and !S)
or (B and S)
Wire
2
B: I.L. (NAND) = 61 fF
I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
Dependent Delay (L.D.D.): Same as Gate 3
TAYlhf
= 0.0021 ns / fF
= 0.0021 ns / fF
TSYlhf = 0.0021 ns / fF
TBYlhf
11/99
Load (I.L.)
A,
S:
Wire 1
2 x 1 Mux
Gate 1
Wire
0
TAYhlf = 0.0020 ns / fF
TBYhlf = 0.0020 ns / fF
Ch.5 - 108.0
2 to 1 MUX: Internal Delay Calculation

A
Gate 1
Wire
0
Y = (A and !S) or (A and S)

Gate 3
Gate 2
S
Internal
Wire 1
Wire
2
Delay (I.D.):
A
to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3

B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3
S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv +
Internal Delay A to Y
We
can approximate the effect of Wire 1 C by:
Assume
Wire 1 has the same C as all the gate C attached to it.
11/99
Ch.5 - 109.0
2 to 1 MUX: Internal Delay Calculation (continue)

A
Gate 1
Wire
0
Y = (A and !S) or (B and S)

Gate 3
Gate 2
S
Internal
Wire 1
Wire
2
Delay (I.D.):
A
to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3

to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3
S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv +
Internal Delay A to Y
B
Specific
Example:
TAYlh
= TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3

= 0.1ns + 122 fF * 0.0020 ns/fF + 0.5ns = 0.844 ns
11/99
Ch.5 - 110.0
Abstraction: 2 to 1 MUX
A
Gate 3
B
Gate 2
2 x 1 Mux
Gate 1
S
S
Input
Load: A = 61 fF, B = 61 fF, S = 111 fF

Load Dependent Delay:
TAYlhf
= 0.0021 ns / fF
TBYlhf = 0.0021 ns / fF
Internal
TAYhlf = 0.0020 ns / fF
TBYhlf = 0.0020 ns / fF
TSYlhf = 0.0020 ns / f F
Delay:
TAYlh
= TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3

= 0.1ns + 122 fF * 0.0020ns/fF + 0.5ns = 0.844ns
Fun Exercises: TAYhl, TBYlh, TSYlh, TSYlh
11/99
Ch.5 - 111.0
CS152 Logic Elements

NAND2,
NAND3, NAND 4
NOR2, NOR3, NOR4
INV1x (normal inverter)
INV4x (inverter with large output
drive)
XOR2
XNOR2
PWR:
Source of 1s
GND: Source of 0s
fast MUXes
D
flip flop with negative edge

triggered
11/99
Ch.5 - 112.0
Storage Elements Timing Model

Clk
D
Setup
Dont Care
Unknown
Hold
Dont Care
Clock-to-Q
Setup
Time: Input must be stable BEFORE the trigger clock edge

Hold Time: Input must REMAIN stable after the trigger clock
edge
Clock-to-Q time:
Output
Similar
Typical
cannot change instantaneously at the trigger clock edge

to delay in logic gates, two components:
Internal Clock-to-Q
Load dependent Clock-to-Q
for class: 1ns Setup, 0.5ns Hold
11/99
Ch.5 - 113.0
Clocking Methodology
Clk
.
.
.
All
.
.
.
Combination Logic
.
.
.
.
.
.
storage elements are clocked by the same clock edge

combination logic blocks:
The
Inputs
are updated at each clock tick

All outputs MUST be stable before the next clock tick
11/99
Ch.5 - 114.0
Tricks to Reduce Cycle Time
Reduce the number of gate levels
A
B
A
B
C
Review Karnaugh maps for prereq quiz!

Use esoteric/dynamic timing methods
Pay attention to loading
One gate driving many gates is a bad idea
Avoid using a small gate to drive a long wire
Use multiple stages to drive large load
INV4x
Clarge
INV4x
11/99
Ch.5 - 115.0
How to Avoid Hold Time Violation?

Clk
.
.
.
Hold
.
.
.
Combination Logic
.
.
.
.
.
.
time requirement:
Input
to register must NOT change immediately after the clock tick
This is usually easy to meet in the edge trigger clocking scheme

Hold time of most FFs is <= 0 ns
CLK-to-Q + Shortest Delay Path must be greater than Hold Time

11/99
Ch.5 - 116.0
Clock Skews Effect on Hold Time

Clk1
Clock Skew
Clk2
.
.
.
.
.
.
Combination Logic
.
.
.
Clk1
Clk2
The
.
.
.
worst case scenario for hold time consideration:
The
input register sees CLK2

output register sees CLK1
fast FF2 output must not change input to FF1 for same clock edge
The
(CLK-to-Q
+ Shortest Delay Path - Clock Skew) > Hold Time
11/99
Ch.5 - 117.0
Summary
Total
execution time is the most reliable measure of

performance
Amdalls law: Law of Diminishing Returns
Performance and Technology Trends
Keep
the design simple (KISS rule) to take advantage of the latest

technology
CMOS inverter and CMOS logic gates
Delay
Modeling and Gate Characterization
Delay
Clocking
= Internal Delay + (Load Dependent Delay x Output Load)
Methodology and Timing Considerations
Simplest
Cycle
clocking methodology
All storage elements use the SAME clock edge
Time CLK-to-Q + Longest Delay Path + Setup + Clock Skew

+ Shortest Delay Path - Clock Skew) > Hold Time
(CLK-to-Q
11/99
Ch.5 - 118.0
To Get More Information

A
Classic Book that Started it All:

Carver
Mead and Lynn Conway, Introduction to VLSI Systems,

Addison-Wesley Publishing Company, October 1980.
A
Good VLSI Circuit Design Book

Lance
Glasser & Daniel Dobberpuhl, The Design and Analysis of

VLSI Circuits, Addison-Wesley Publishing Company, 1985.
A
Mr. Dobberpuhl is responsible for the DEC Alpha chip design.
Book on How and Why Digital ICs Work:

David
Hodges & Horace Jackson, Analysis and Design of Digital

Integrated Circuits, McGraw-Hill Book Company, 1983.
New
Book:
Jan
Rabaey, Digital Integrated Circuits: A Design Perspective,

Prentice-Hall Publishers, 1998.
11/99
Ch.5 - 119.0
CS152
Computer Architecture and Engineering
Lecture 4
Cost and Design
September 8, 1999
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/
11/99
Ch.5 - 120.0
Review: Performance and Technology Trends

1000
Supercomputers
Performance
100
Mainframes
10
Minicomputers
Microprocessors
0.1
1965
1975
1980
1985
Year
1990
1995
2000
Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year

1970
Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr.
Density: improves 1.2x / yr.
Die Area: 1.2x / yr.
RISC lesson is to keep the ISA as simple as possible:

Shorter design cycle => fully exploit the advancing technology (~3yr)
Advanced branch prediction and pipeline techniques
Bigger and more sophisticated on-chip caches
11/99
Ch.5 - 121.0
Review: General C/L Cell Delay Model

Vout
A
B
.
.
.
Combinational
Logic Cell
Delay
Va -> Vout
Cout
X
X
X
X
delay per unit load
Internal
Delay
Ccritical
Cout
Combinational Cell (symbol) is fully specified by:

functional (input -> output) behavior
load factor of each input

critical propagation delay from each input to each output for each
transition
truth-table, logic equation, VHDL

THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

11/99
Linear model composes

Ch.5 - 122.0
Review: Characterize a Gate

Input capacitance for each input

For each input-to-output path:
For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
Internal delay (ns)
Load dependent delay (ns / fF)
Example: 2-input NAND Gate
Delay A -> Out

Out: Low -> High
Out
B
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Slope =
0.0021ns / fF
0.5ns
Cout
11/99
Ch.5 - 123.0
Review: Technology, Logic Design and Delay

CMOS Technology Trends

Delay Modeling and Gate Characterization

Complementary: PMOS and NMOS transistors

CMOS inverter and CMOS logic gates
Delay = Internal Delay + (Load Dependent Delay x Output Load)
Clocking Methodology and Timing Considerations

Simplest clocking methodology
Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock

Skew
(CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
All storage elements use the SAME clock edge
11/99
Ch.5 - 124.0
Overview: Cost and Design

Review from Last Lecture (2 minutes)

Cost and Price (18)
Administrative Matters (3 minutes)
Design process (27 minutes)
Break (5 minutes)
More Design process (15 minutes)
Online notebook (10 minutes)
11/99
Ch.5 - 125.0
Integrated Circuit Costs

Die cost =
Wafer cost
Dies per Wafer * Die yield
Dies per wafer = * ( Wafer_diam / 2)2 * Wafer_diam Test dies Wafer Area
Die Area
2 * Die Area
Die Area
Die Yield =
Wafer yield
{ 1+
Defects_per_unit_area * Die_Area
Die Cost is goes roughly with the cube of the area.

11/99
Ch.5 - 126.0
Die Yield
Raw Dice Per Wafer
wafer diameter
6/15cm
8/20cm
10/25cm
die area (mm2)

100
144
196
139
90
62
265
177
124
431
290
206
256
44
90
153
324
32
68
116
400
23
52
90
die yield
23%
19%
16% 12% 11%
10%
typical CMOS process: =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer
6/15cm
8/20cm
10/25cm
Good Dice Per Wafer (Before Testing!)

31
16
9
5
3
59
32
19
11
7
96
53
32
20
13
2
5
9
typical cost of an 8, 4 metal layers, 0.5um CMOS wafer: ~$2000

11/99
Ch.5 - 127.0
Real World Examples

Chip
Metal Line
layers width
386DX
2
0.90
486DX 2
3
0.80
PowerPC 601
4
0.80
HP PA 7100
3
0.80
DEC Alpha
3
0.70
SuperSPARC 3
0.70
Pentium3
0.80 $1500
Wafer
cost
$900
$1200
$1700
$1300
$1500
$1700
1.5
Defect
/cm2
1.0
1.0
1.3
1.0
1.2
1.6
296
Area Dies/ Yield

mm2 wafer
43
360 71%
81
181 54%
121
115 28%
196
66 27%
234
53 19%
256
48 13%
40
9% $417
Die Cost
$4
$12
$53
$73
$149
$272
From "Estimating IC Manufacturing Costs, by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15
11/99
Ch.5 - 128.0
Other Costs
IC cost = Die cost + Testing cost + Packaging cost
Final test yield
Packaging Cost: depends on pins, heat dissipation
Chip
386DX
486DX2
PowerPC 601
HP PA 7100
DEC Alpha
SuperSPARC
Pentium
11/99
Die
cost
$4
$12
$53
$73
$149
$272
$417
Package
pins
type
132
QFP
168 PGA
304
QFP
504 PGA
431 PGA
293 PGA
273 PGA
cost
$1
$11
$3
$35
$30
$20
$19
Test &
Assembly
$4
$12
$21
$16
$23
$34
$37
Total
$9
$35
$77
$124
$202
$326
$473
Ch.5 - 129.0
System Cost: -1995-96 Workstation

System
Cabinet
Motherboard
board
I/O Devices
(DAT)
11/99
Subsystem
Sheet metal, plastic
Power supply, fans
Cables, nuts, bolts
(Subtotal)
Processor
DRAM (64MB)
Video system
I/O system
Printed Circuit
1%
(Subtotal)
Keyboard, mouse
Monitor
Hard disk (1 GB)
Tape drive
6%
(Subtotal)
% of total cost
1%
2%
1%
(4%)
6%
36%
14%
3%
(60%)
1%
22%
7%
(36%)Ch.5 - 130.0
Cost vs. Price

Q: What % of company income
on Research and Development (R&D)?
+5080%
Average
Discount
(3345%)
gross margin
(3314%)
direct costs
direct costs
(810%)
component
cost
component
cost
(2531%)
avg. selling price

+25100% Gross Margin
+33% Direct Costs
Component
Cost
Input:
chips,
displays, ...
component
cost
Making it:
labor, scrap,
returns, ...
11/99
(WSPC)
list price
Overhead:
R&D, rent,
marketing,
profits, ...
Commision:
channel
profit, volume
discounts,
Ch.5 - 131.0
Cost Summary

11/99
Integrated circuits driving computer industry

Die costs goes up with the cube of die area
Economics ($$$) is the ultimate driver for performance!
Ch.5 - 132.0
Chapter - 4
Arithmetic
11/99
Ch.5 - 133.0
Arithmetic
Where we've been:

Performance (seconds, cycles, instructions)

Abstractions:
Instruction Set Architecture
Assembly Language and Machine Language
What's up ahead:
Implementing the Architecture
operation
a
32
ALU
result
32
b
32
11/99
Ch.5 - 134.0
11/99
Ch.5 - 135.0
11/99
Ch.5 - 136.0
11/99
Ch.5 - 137.0
11/99
Ch.5 - 138.0
Chapter Five
11/99
Ch.5 - 139.0
The Processor: Datapath & Control

We're ready to look at an implementation of the MIPS

Simplified to contain only:

Generic Implementation:

11/99
memory-reference instructions: lw, sw

arithmetic-logical instructions: add, sub, and, or, slt
control flow instructions: beq, j
use the program counter (PC) to supply instruction address

get the instruction from memory
read registers
use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers

Why? memory-reference? arithmetic? control flow?
Ch.5 - 140.0
State Elements

Unclocked vs. Clocked

Clocks used in synchronous logic
when should an element that contains state be updated?

falling edge
cycle time
rising edge
11/99
Ch.5 - 141.0
An unclocked state element

The set-reset latch

11/99
output depends on present inputs and also on past inputs
Ch.5 - 142.0
Latches and Flip-flops

Output is equal to the stored value inside the element

(don't need to ask for permission to look at the value)
Change of state (value) is based on the clock
Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
"logically true",
could mean electrically low
A clocking methodology defines when signals can be read and written

wouldn't want to read a signal at the same time it was being written
11/99
Ch.5 - 143.0
D-latch
Two inputs:
the data value to be stored (D)
the clock signal (C) indicating when to read & store D

Two outputs:
the value of the internal state (Q) and it's complement
C
Q
_
Q
D
11/99
Ch.5 - 144.0
D flip-flop
Output changes only on the clock edge

D
D
C
D
latch
D
C
Q
D
latch _
Q
Q
_
Q
11/99
Ch.5 - 145.0

CENG 331 Course Slides Chapter 5

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CENG 331 Course Slides Chapter 5

Încărcat de

Drepturi de autor:

Formate disponibile

Chapter - 5

Computer Organization & Architecture

Design a processor: step-by-step

Computer Organization & Architecture

The Big Picture: Where are We Now?

The Five Classic Components of a Computer

This chapters Topic: Design a Single Cycle Processor

inst. set design

Computer Organization & Architecture

The Big Picture: The Performance Perspective

Performance of a machine is determined by:

Processor design (datapath and control) will

Clock cycle time

In this chapter ...

Single cycle processor:

Computer Organization & Architecture

How to Design a Processor: step-by-step

1. Analyze instruction set => datapath requirements

the meaning of each instruction is given by the register transfers

datapath must support each register transfer

2. Select set of datapath components and establish

Computer Organization & Architecture

Single Cycle Datapath

Computer Organization & Architecture

The MIPS Instruction Formats

All MIPS instructions are 32 bits long. The three instruction

The different fields are:

op: operation of the instruction

Computer Organization & Architecture

Step 1a: The MIPS-lite Subset

ADD and SUB

addU rd, rs, rt

subU rd, rs, rt

LOAD and STORE Word

lw rt, rs, imm16

sw rt, rs, imm16

beq rs, rt, imm16

ori rt, rs, imm16

Computer Organization & Architecture

Logical Register Transfers

RTL gives the meaning of the instructions

R[rd] R[rs] + R[rt];

R[rd] R[rs] R[rt];

R[rt] R[rs] | zero_ext(Imm16);

R[rt] MEM[ R[rs] + sign_ext(Imm16)];

MEM[ R[rs] + sign_ext(Imm16) ] R[rt];

if ( R[rs] == R[rt] ) then

Computer Organization & Architecture

Step 1: Requirements of the Instruction Set

instruction & data

Registers (32 x 32-bit)

Add and Sub register or extended immediate

Add 4 or extended immediate to PC

Computer Organization & Architecture

Step 2: Components of the Datapath

Computer Organization & Architecture

a. Data memory unit

Why do we need this stuff?

Computer Organization & Architecture

read contents of some state elements,

Computer Organization & Architecture

More Implementation Details

Abstract / Simplified View:

Two types of functional units: