Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics

Lecture 2: Review of
Performance/Cost/Power Metrics
and Architectural Basics
Prof. Jan M. Rabaey

Computer Science 252
Spring 2000
“Computer Architecture in Cory Hall”
JR.S00 1
Review Lecture 1
• Class Organization
– Class Projects
• Trends in the Industry and Driving Forces
JR.S00 2
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape RAID
Emerging Technologies
DRAM Interleaving
Bus protocols
Coherence,
Memory L2 Cache Bandwidth,
Hierarchy Latency
L1 Cache Addressing,
VLSI Protection,
Instruction Set Architecture Exception Handling
Pipelining, Hazard Resolution, Pipelining and Instruction

Superscalar, Reordering, Level Parallelism
Prediction, Speculation,
Vector, VLIW, DSP, Reconfiguration JR.S00 3
Computer Architecture Topics
Shared Memory,
P M P M P M P M
Message Passing,
°°° Data Parallelism
S Interconnection Network Network Interfaces
Processor-Memory-Switch Topologies,
Routing,
Multiprocessors Bandwidth,
Networks and Interconnections Latency,
Reliability
JR.S00 4
The Secret of Architecture Design:
Measurement and Evaluation
Architecture Design is an iterative process:
Design
• Searching the space of possible designs
• At all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Mediocre Ideas
Bad Ideas JR.S00 5
Computer Engineering Methodology
Evaluate Existing
Implementation
Systems for Analysis
Complexity
Bottlenecks
Imple- Benchmarks
mentation Technology
Trends
Implement Next
Simulate New
Generation System Designs and
Organizations
Workloads
JR.S00 6
Design
Measurement Tools
• Hardware: Cost, delay, area, power estimation

• Benchmarks, Traces, Mixes
• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb
• Fundamental “Laws”/Principles
JR.S00 7
Review:
Performance, Cost, Power
JR.S00 8
Metric 1: Performance In passenger-mile/hour
Plane DC to Paris Speed Passengers Throughput
Boeing 747 6.5 hours 610 mph 470 286,700
Concorde 3 hours 1350 mph 132 178,200
• Time to run the task

– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …
– Throughput, bandwidth
JR.S00 9
The Performance Metric
"X is n times faster than Y" means
ExTime(Y) Performance(X)
--------- = ---------------
ExTime(X) Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
JR.S00 10
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F

of the task by a factor S, and the remainder of the
task is unaffected
JR.S00 11
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced
1
ExTimeold
Speedupoverall = =
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
JR.S00 12
Amdahl’s Law
• Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
1
Speedupoverall = = 1.053
0.95
Law of diminishing return:

Focus on the common case!
JR.S00 13
Metrics of Performance
Application Answers per month
Operations per second
Programming
Language
Compiler
(millions) of Instructions per second: MIPS
ISA (millions) of (FP) operations per second: MFLOP/s
Datapath
Control Megabytes per second
Function Units
Transistors Wires Pins Cycles per second (clock rate)
JR.S00 14
Aspects of CPU Performance
CPU
CPUtime
time == Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
Inst Count CPI Clock Rate

Program X
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X
JR.S00 15
Cycles Per Instruction
“Average Cycles per Instruction”
CPI = Cycles / Instruction Count
= (CPU Time * Clock Rate) / Instruction Count
n
CPU time = CycleTime * Σ CPI i * I i
i =1
“Instruction Frequency”
n
CPI = Σ CPI i * Fi where F i = I i
i =1
Instruction Count
Invest Resources where time is Spent!
JR.S00 16
Example: Calculating CPI
Base Machine (Reg / Reg)
Op Freq CPIi CPIi*Fi (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Typical Mix
JR.S00 17
Creating Benchmark Sets
• Real programs
• Kernels
• Toy benchmarks
• Synthetic benchmarks
– e.g. Whetstones and Dhrystones
JR.S00 18
SPEC: System Performance Evaluation
Cooperative
• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995
– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
JR.S00 19
SPECfp_base95
How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
tracks execution time: Σ(Ti)/n or Σ(Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:
n/ Σ(1/Ri) or n/Σ(Wi/Ri)
• Normalized execution time is handy for scaling
performance (e.g., X times faster than
SPARCstation 10)
– Arithmetic mean impacted by choice of reference machine
• Use the geometric mean for comparison:
∏ (Ti)^1/n
– Independent of chosen machine
– but not good metric for total execution time
JR.S00 20
SPEC First Round
• One program: 99% of time in single line of code
• New front-end compiler could improve dramatically
800
700
600
500
400
300
200
100
0
gcc
doduc
li
spice
nasa7
fpppp
eqntott
tomcatv
epresso
matrix300
Benchmark
JR.S00 21
IBM Powerstation 550 for 2 different compilers
Impact of Means
on SPECmark89 for IBM 550
(without and with special compiler option)
Ratio to VAX: Time: Weighted Time:
Program Before After Before After Before After
gcc 30 29 49 51 8.91 9.22
espresso 35 34 65 67 7.64 7.86
spice 47 47 510 510 5.69 5.69
doduc 46 49 41 38 5.81 5.45
nasa7 78 144 258 140 3.43 1.86
li 34 34 183 183 7.86 7.86
eqntott 40 40 28 28 6.68 6.68
matrix300 78 730 58 6 3.43 0.37
fpppp 90 87 34 35 2.97 3.07
tomcatv 33 138 20 19 2.01 1.94
Mean 54 72 124 108 54.42 49.99
Geometric Arithmetic Weighted Arith.
Ratio 1.33 Ratio 1.16 Ratio 1.09
JR.S00 22
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Given sales is a function in part of performance
relative to competition, investment in improving
product as reported by performance summary
• If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
• Execution time is the measure of computer
performance!
JR.S00 23
Integrated Circuits Costs
Die cost + Testing cost + Packaging cost
IC cost =
Final test yield
Wafer cost
Die cost =
Dies per Wafer × Die yield
π (Wafer_diam/2)2 π ×Wafer_diam
Dies per wafer = − − Test_Die
Die_Area 2 ⋅ Die_Area
−α

 Defect_Density × Die_area   
Die Yield = Wafer_yield × 1 +   

  α   
Die Cost goes roughly with die area4 JR.S00 24

Real World Examples
Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost
layers width cost /cm2 mm2 wafer
386DX 2 0.90 $900 1.0 43 360 71% $4
486DX2 3 0.80 $1200 1.0 81 181 54% $12
PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53
HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73
DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149
SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272
Pentium 3 0.80 $1500 1.5 296 40 9% $417
– From "Estimating IC Manufacturing Costs,” by Linley Gwennap,

Microprocessor Report, August 2, 1993, p. 15
JR.S00 25
Cost/Performance
What is Relationship of Cost to Price?
• Recurring Costs
– Component Costs
– Direct Costs (add 25% to 40%) recurring costs: labor, purchasing, scrap,
warranty
• Non-Recurring Costs or Gross Margin (add 82% to

186%)
(R&D, equipment maintenance, rental, marketing, sales, financing
cost, pretax profits, taxes
• Average Discount to get List Price (add 33% to 66%): volume
discounts and/or retailer markup
List Price
Average
Discount
25% to 40%
Avg. Selling Price Gross
Margin 34% to 39%
Direct Cost 6% to 8%
Component
Cost
15% to 33% JR.S00 26
Chip Prices (August 1993)
• Assume purchase 10,000 units
Chip Area Mfg. Price Multi- Comment

mm2 cost plier
386DX 43 $9 $31 3.4 Intense Competition
486DX2 81 $35 $245 7.0 No Competition
PowerPC 601 121 $77 $280 3.6
DEC Alpha 234 $202 $1231 6.1 Recoup R&D?
Pentium 296 $473 $965 2.0 Early in shipments
JR.S00 27
Summary: Price vs. Cost
100%
80% Average Discount
60% Gross Margin
40% Direct Costs
20% Component Costs
0%
Mini W/S PC
5 4.7
3.8
4 3.5 Average Discount
3 2.5 Gross Margin
2 1.8 Direct Costs
1.5 Component Costs

1
0
Mini W/S PC JR.S00 28
Power/Energy
100
Pentium II (R)
Pentium Pro ?
Max Power (Watts)
(R)
Pentium(R)
Pentium(R)
10 MMX
486 486
Source: Intel
386
386
1
1.5µ 1µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ
Ê Lead processor power increases every generation

Ë Compactions provide higher performance at lower power
JR.S00 29
Energy/Power
• Power dissipation: rate at which energy is

taken from the supply (power source) and
transformed into heat
P = E/t
• Energy dissipation for a given instruction
depends upon type of instruction (and state
of the processor)
n
P = (1/CPU Time) * Σ Ei * Ii
i =1
JR.S00 30
The Energy-Flexibility Gap
1000
MOPS/mW (or MIPS/mW)
Dedicated
HW
100
Reconfigurable Pleiades
Energy Efficiency
Processor/Logic 10-80 MOPS/mW

10
ASIPs
DSPs 2 V DSP: 3 MOPS/mW
1
SA110
Embedded Processors 0.4 MIPS/mW
0.1
Flexibility (Coverage)
JR.S00 31
Summary, #1
• Designing to Last through Trends
Capacity Speed
Logic 2x in 3 years 2x in 3 years
SPEC RATING: 2x in 1.5 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
• 6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
– Throughput, bandwidth
• “X is n times faster than Y” means
ExTime(Y) Performance(X)
--------- = --------------
JR.S00 32
ExTime(X) Performance(Y)
Summary, #2
• Amdahl’s Law:
1
ExTimeold
Speedupoverall = =
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
• CPI Law: Speedupenhanced
CPU
CPUtime
time == Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
• Execution time is the REAL measure of computer
performance!
• Good products created when have:
– Good benchmarks, good ways to summarize performance
• Different set of metrics apply to embedded
systems
JR.S00 33
Review:
Instruction Sets, Pipelines, and Caches
JR.S00 34
Computer Architecture Is …
the attributes of a [computing] system as seen
by the programmer, i.e., the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaaw, and Brooks, 1964
JR.S00 35
Computer Architecture’s Changing
Definition
• 1950s to 1960s:
Computer Architecture Course = Computer Arithmetic
• 1970s to mid 1980s:
Computer Architecture Course = Instruction Set
Design, especially ISA appropriate for compilers
• 1990s:
Computer Architecture Course = Design of CPU,
memory system, I/O system, Multiprocessors
JR.S00 36
Computer Architecture is ...
Instruction Set Architecture
Organization
Hardware
JR.S00 37
Instruction Set Architecture (ISA)
software
instruction set
hardware
JR.S00 38
Interface Design
A good interface:
• Lasts through many implementations (portability,
compatability)
• Is used in many differeny ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
use imp 1 time

Interface
use imp 2
use imp 3
JR.S00 39
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers

(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model

from Implementation
High-level Language Based Concept of a Family

(B5000 1963) (IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets Load/Store Architecture

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
JR.S00 40
LIW/”EPIC”? (IA-64. . .1999)
Evolution of Instruction Sets
• Major advances in computer architecture are

typically associated with landmark instruction
set designs
– Ex: Stack vs GPR (System 360)
• Design decisions must take into account:
– technology
– machine organization
– programming languages
– compiler technology
– operating systems
– applications
• And they in turn influence these
JR.S00 41
A "Typical" RISC
• 32-bit fixed format instruction (3 formats I,R,J)

• 32 32-bit GPR (R0 contains zero, DP take pair)
• 3-address, reg-reg arithmetic instruction
• Single address mode for load/store:
base + displacement
– no indirection
• Simple branch conditions (based on register values)
• Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,

CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
JR.S00 42
Example: MIPS (- DLX)
Register-Register
31 26 25 21 20 16 15 11 10 6 5 0
Op Rs1 Rs2 Rd Opx
Register-Immediate
31 26 25 21 20 16 15 0
Op Rs1 Rd immediate
Branch
31 26 25 21 20 16 15 0
Op Rs1 Rs2/Opx immediate
Jump / Call
31 26 25 0
Op target
JR.S00 43
Pipelining: Its Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
JR.S00 44
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
JR.S00 45
Pipelined Laundry
Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Pipelined laundry takes 3.5 hours for 4 loads JR.S00 46

Pipelining Lessons
• Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it
Time helps throughput of
entire workload
T
a 30 40 40 40 40 20 • Pipeline rate limited by
s slowest pipeline stage
k A • Multiple tasks operating
simultaneously
O • Potential speedup =
r B Number pipe stages
d
e • Unbalanced lengths of
r C pipe stages reduces
speedup
D • Time to “fill” pipeline and
time to “drain” it reduces
speedup JR.S00 47
Computer Pipelines
• Execute billions of instructions, so

throughout is what matters
• DLX desirable features: all instructions same
length, registers located in same place in
instruction format, memory operands only in
loads or stores
JR.S00 48
5 Steps of DLX Datapath
Figure 3.1, Page 130
Instruction Instr. Decode Execute Memory Write

Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
RS2
Address
Memory
Reg File
Inst
ALU
Memory
RD L
Data
M
MUX
D
Sign
Imm Extend
WB Data
JR.S00 49
5 Steps of DLX Datapath
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
RS2
EX/MEM
Reg File
ID/EX
IF/ID
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
• Data stationary control

– local decode for each instruction phase / pipeline stage JR.S00 50
Visualizing Pipelining
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
JR.S00 51
Its Not That Easy for Computers
• Limits to pipelining: Hazards prevent next

instruction from executing during its designated
clock cycle
– Structural hazards: HW cannot support this combination of
instructions - two dogs fighting for the same bone
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
JR.S00 52
One Memory Port/Structural Hazards
Time (clock cycles)

ALU
Ifetch Reg DMem Reg
Load
n
s
ALU
Ifetch Reg DMem Reg
t
Instr 1
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
ALU
Ifetch Reg DMem Reg
d Instr 3
e
r
ALU
Ifetch Reg DMem Reg
Instr 4
JR.S00 53
One Memory Port/Structural Hazards
Time (clock cycles)

ALU
Ifetch Reg DMem Reg
Load
n
s
ALU
Ifetch Reg DMem Reg
t
Instr 1
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Ifetch Reg DMem Reg
Instr 3
JR.S00 54
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Average Stall cycles per Inst
Ideal CPI × Pipeline depth Cycle Timeunpipeline d

Speedup = ×
Ideal CPI + Pipeline stall CPI Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Pipeline depth Cycle Timeunpipeline d

Speedup = ×
1 + Pipeline stall CPI Cycle Timepipelined
JR.S00 55
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
JR.S00 56
Data Hazard on R1
Figure 3.9, page 147
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
Ifetch Reg DMem Reg
add r1,r2,r3
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.
ALU
O Ifetch Reg DMem Reg
and r6,r1,r7
r
d
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
JR.S00 57
Three Generic Data Hazards
• Read After Write (RAW)

InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler

nomenclature). This hazard results from an actual
need for communication.
JR.S00 58
• Write After Read (WAR)

InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
JR.S00 59
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated
pipes
JR.S00 60
Forwarding to Avoid Data Hazard
Time (clock cycles)
I
n
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
JR.S00 61
HW Change for Forwarding
NextPC
mux
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data
mux
Memory
mux
Immediate
JR.S00 62
Data Hazard Even with Forwarding
Time (clock cycles)
ALU
Reg
lw r1, 0(r2) Ifetch Reg DMem
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r6
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e
ALU
Ifetch Reg DMem Reg
r
or r8,r1,r9
JR.S00 63
Data Hazard Even with Forwarding
Time (clock cycles)

I
n
ALU
Reg
s lw r1, 0(r2) Ifetch Reg DMem
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r1,r7
r
Bubble
ALU
Ifetch Reg DMem
or r8,r1,r9
JR.S00 64
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
JR.S00 65
SW d,Rd SW d,Rd
Control Hazard on Branches
Three Stage Stall
ALU
Ifetch Reg DMem Reg
10: beq r1,r3,36
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Ifetch Reg DMem Reg
18: or r6,r1,r7
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
Ifetch Reg DMem Reg
36: xor r10,r1,r11
JR.S00 66
Branch Stall Impact
• If CPI = 1, 30% branch,

Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• DLX branch tests if register = 0 or ≠ 0
• DLX Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
JR.S00 67
Pipelined DLX Datapath
Figure 3.22, page 163
Fetch Reg. Fetch Addr. Calc. Access Back
This is the correct 1 cycle
latency implementation!
JR.S00 68
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
JR.S00 69
Four Branch Hazard Alternatives
#4: Delayed Branch

– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target

address in 5 stage pipeline
– DLX uses this
JR.S00 70
Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:

– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines,
multiple instructions issued per clock (superscalar)
JR.S00 71
Evaluating Branch Alternatives
Pipeline speedup = Pipeline depth
1 +Branch frequency ×Branch penalty
Scheduling Branch CPI speedup v. speedup v.

scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
JR.S00 72
Summary :
Control and Pipelining
• Just overlap tasks; easy if tasks are independent
• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
Pipeline depth Cycle Timeunpipeline d
Speedup = ×
1 + Pipeline stall CPI Cycle Timepipelined
• Hazards limit performance on computers:

– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
JR.S00 73

Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 2: Review of

Prof. Jan M. Rabaey

Disks, WORM, Tape RAID

Pipelining, Hazard Resolution, Pipelining and Instruction

S Interconnection Network Network Interfaces

• Hardware: Cost, delay, area, power estimation

Plane DC to Paris Speed Passengers Throughput

Boeing 747 6.5 hours 610 mph 470 286,700

Concorde 3 hours 1350 mph 132 178,200

• Time to run the task

"X is n times faster than Y" means

• Speed of Concorde vs. Boeing 747

• Throughput of Boeing 747 vs. Concorde

Suppose that enhancement E accelerates a fraction F

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

• Floating point instructions improved to run 2X;

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

Law of diminishing return:

Inst Count CPI Clock Rate

Invest Resources where time is Spent!

Die Cost goes roughly with die area4 JR.S00 24

– From "Estimating IC Manufacturing Costs,” by Linley Gwennap,

• Non-Recurring Costs or Gross Margin (add 82% to

Chip Area Mfg. Price Multi- Comment

60% Gross Margin

40% Direct Costs

20% Component Costs

3 2.5 Gross Margin

2 1.8 Direct Costs

1.5 Component Costs

Ê Lead processor power increases every generation

• Power dissipation: rate at which energy is

Processor/Logic 10-80 MOPS/mW

Instruction Set Architecture

use imp 1 time

Accumulator + Index Registers

Separation of Programming Model

High-level Language Based Concept of a Family

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

• Major advances in computer architecture are

• 32-bit fixed format instruction (3 formats I,R,J)

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,

Op Rs1 Rs2 Rd Opx

Op Rs1 Rs2/Opx immediate

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

• Pipelined laundry takes 3.5 hours for 4 loads JR.S00 46

• Execute billions of instructions, so

Instruction Instr. Decode Execute Memory Write

• Data stationary control

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

• Limits to pipelining: Hazards prevent next

Time (clock cycles)

Time (clock cycles)

CPIpipelined = Ideal CPI + Average Stall cycles per Inst

Ideal CPI × Pipeline depth Cycle Timeunpipeline d

For simple RISC pipeline, CPI = 1:

Pipeline depth Cycle Timeunpipeline d

• Read After Write (RAW)

• Caused by a “Dependence” (in compiler

• Write After Read (WAR)