ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation

ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation
ECE4680 Lec2 Performance.1
January 23, 2003
Review: What is "Computer Architecture"

Co-ordination of
levels of abstraction
Application Operating System Instruction Set Architecture
Compiler
Instr. Set Proc. I/O system Digital Design Circuit Design
Under a set of rapidly changing Forces CA = IS + CO
January 23, 2003
Review: Levels of Representation

temp = v[k]; High Level Language Program Compiler Assembly Language Program Assembler Machine Language Program
0000 1010 1100 0101 1001 1111 0110 1000
v[k] = v[k+1]; v[k+1] = temp; lw $15, lw $16, sw $16, sw $15,

1100 0101 1010 0000 0110 1000 1111 1001
0($2) 4($2) 0($2) 4($2)

1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111
Machine Interpretation Control Signal Specification
January 23, 2003
Review: Levels of Organization
SPARCstation 20
Computer SPARC Processor Control Datapath Memory Devices Input Output
January 23, 2003
Summary: Computer System Components

Proc Caches Busses adapters Memory Controllers I/O Devices: Disks Displays Keyboards
Networks
All have interfaces & organizations
January 23, 2003
Review: Summary from Last Lecture

All computers consist of five components Processor: (1) datapath and (2) control (3) Memory (4) Input devices and (5) Output devices Not all memory are created equally Cache: fast (expensive) memory are placed closer to the processor Main memory: less expensive memory--we can have more Input and output (I/O) devices has the messiest organization Wide range of speed: graphics vs. keyboard Wide range of requirements: speed, standard, cost ... etc. Least amount of research (so far)
January 23, 2003
Integrated Circuits Costs --- manufacturing process (p24)

Click how chips are made on the website.
January 23, 2003
Integrated Circuits Costs --- formula

Cost _ per _ wafter Die _ per _ wafer Yield wafer _ area Die _ area
Die cost =
Dies per wafer =
Die Yield =
1 (1 + ( Defect _ per _ area Die _ area )) 2

Die Cost is goes roughly with the cube of the area.
January 23, 2003
Real World Examples
Chip 386DX 486DX2
Metal Line layers width 2 3 4 3 3 3 3 0.90 0.80 0.80 0.80 0.70 0.70 0.80
Wafer Defect cost /cm2 $900 $1200 $1700 $1300 $1500 $1700 $1500 1.0 1.0 1.3 1.0 1.2 1.6 1.5
Area Dies/ Yield Die Cost mm2 wafer 43 81 121 196 234 256 296 360 181 115 66 53 48 40 71% 54% 28% 27% 19% 13% 9% $4 $12 $53 $73 $149 $272 $417
PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium
From "Estimating IC Manufacturing Costs, by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15
January 23, 2003
Other Costs
IC cost = Die cost + Testing cost + Packaging cost Final test yield
Packaging Cost: depends on pins, heat dissipation
Chip 386DX 486DX2 PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium
Die cost $4 $12 $53 $73 $149 $272 $417
Package pins type 132 QFP 168 PGA 304 QFP 504 PGA 431 PGA 293 PGA 273 PGA
cost $1 $11 $3 $35 $30 $20 $19
Test & Assembly $4 $12 $21 $16 $23 $34 $37
Total $9 $35 $77 $124 $202 $326 $473
January 23, 2003
CMOS improvements
Die size 2X / 3 years; Line widths halve / 7 years
25
20
15
Transistors per Unit Area
10
Die Size
0 1980 1983 1986 1989 1992
Capacity Logic DRAM disk 2x in 3 years 4x in 3 years 4x in 3 years
Speed
2x in 3 years 1.4x in 10 years 1.4x in 10 years
January 23, 2003
Processor Performance
120 P e 100 r 80 f o r m a n c e 60 1.54X/yr 40 20 Sun-4/260 0 1987 1988 1989 1990 Year 1991 1992 1993 MIPS M/120 IBM RS6000/540 MIPS M2000 1.35X/yr HP 9000/750
IBM Power 2/590
DEC AXP 300
January 23, 2003
The bottom line: Performance (and cost)

Airplane Boeing 747 BAC/Sud Concorde Douglas DC-8 DC to Paris 6.5 hours 3 hours 7.3 hours Range 4150 4000 8720 Speed (m.p.h.) 610 1350 544 Passengers 470 132 146 Throughput (p x m.p.h.) 286,700 178200 79,424 Cost ??? ??? ???
Different measurements lead to different results Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns. .. (Performance) throughput, bandwidth Cost renders the measurement more complex The bottom-line performance measurement is CPU execution time.
January 23, 2003
The bottom line: Performance (and cost)
" X is n times faster than Y" means ExTime(Y) Performance(X) -------------- = ---------------------ExTime(X) Performance(Y) Time of Concorde vs. Boeing 747? Throughput of Boeing 747 vs. Concorde?
January 23, 2003
Metrics of performance
Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins
Answers per month Operations per second
(millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s
Megabytes per second
Cycles per second (clock rate)
January 23, 2003
Relating Processor Metrics (pp60-66)

CPU execution time = CPU clock cycles/pgm clock rate or CPU execution time = CPU clock cycles/pgm X clock cycle time CPU clock cycles/pgm = Instructions/pgm X avg. clock cycles per instr. or CPI = CPU clock cycles/pgm Instructions/pgm CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured
January 23, 2003
Aspects of CPU Performance

CPU time = Seconds = CPU time = Seconds = Program Program Instructions x Cycles x Seconds Instructions x Cycles x Seconds Program Instruction Cycle Program Instruction Cycle
instr. count Program Compiler Instr. Set Arch. Organization Technology

CPI
clock rate
January 23, 2003
Aspects of CPU Performance

CPU time = Seconds = CPU time = Seconds = Program Program Instructions x Cycles x Seconds Instructions x Cycles x Seconds Program Instruction Cycle Program Instruction Cycle
instr. count Program Compiler Instr. Set Arch. Organization Technology

CPI X (x) X X
clock rate
X X X
X X
January 23, 2003
Organizational Trade-offs
Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins
3 factors: Where are they? How are they related?
Instruction Mix CPI
Cycle Time
January 23, 2003
CPI How to compute?

Average cycles per instruction
CPI = (CPU Time Clock Rate) / Instruction Count = Clock Cycles / Instruction Count CPU time =
ClockCycleTime CPI i Ci
i =1
CPU clock cycles summed up
"instruction frequency" CPI =
CPI
i =1
Fi
where
Fi =
Ci Instruction _ Count
Invest Resources where time is Spent!
January 23, 2003
Example (page 60)
Our favorite program runs in 10 sec on machine A, which has a 400MHz clock. We are trying to design a machine B with faster clock rate so as to reduce the execution time to 6 sec. The increase of clock rate will affect the rest of the CPU design, causing B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should be?
January 23, 2003
Answer: CPU time A = CPU clock cycle A / clock rate A ==> CPU clock cycle A = 10 sec x 400 x 10^6 Clock rate B = CPU clock cycle B / CPU time B = 1.2*400*10^6 / 6 = 800 MHz
January 23, 2003
Example
Base Machine (Reg / Reg) and Instruction frequencies in the execution of a program: Op ALU Load Store Branch Freq 43% 21% 12% 24% Cycles 1 2 2 2
Question: What is the average CPI of the machine? CPI = 143% + 221% + 212% +224% = 1.57
January 23, 2003
Example: (page 62) Suppose we have two implementations of the same instruction set. Machine A has a clock cycle time of 10 ns and an average CPI of 2.0 for some program. Machine B has a clock cycle time of 20 ns and an average CPI of 1.2 for the same program. Which is faster? And by how much? Let I denote the number of instructions of the program CPU time A = I * 2.0* 10 = 20 I CPU time B = I * 1.2 *20 = 24 I Machine A is 1.2 faster than B . Again we see 3 factors are related !
January 23, 2003
Example (pp65-66)
ISA has 3 kinds of instructions:
Instruction class A B C CPI for this instruction class 1 2 3
One program has 2 code sequences:

Code Sequence 1 2 Instruction counts for instruction class A B C 2 4 1 1 2 1
Which code sequence has more instructions? Which will be faster? What is the CPI for each sequence? S.1 has 5 instructions; S.2 has 6. S.1 needs 2x1+1x2+2x3=10 cycles; S.2 needs 4x1+1x2+1x3=9 cycles. S.1 has CPI=10/5=2; S.2 has CPI=9/6=1.5
ECE4680 Lec2 Performance.25 January 23, 2003
Marketing Metrics
MIPS
= Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
Million Instructions Per Seconds machines with different instruction sets ? programs with different instruction mixes ? dynamic frequency of instructions Peak MIPS: impractical uncorrelated with performance. ( see the next example) Many pitfalls?
MFLOP/S = FP Operations / Time * 10^6
Million Floating-point Operations Per Second machine dependent

often not where time is spent ??

January 23, 2003
Example: (similar to example at PP78-79, but not the same) Referring to example at slide 23, assume we build an optimizing compiler for the load/store machine. The compiler discards 50% of the ALU instructions. 1) What is the CPI_opt ? 2) Ignoring system issues and assuming a 20 ns clock cycle time (50 MHz clock rate). What is the MIPS rating for optimized code versus unoptimized code? Does the MIPS rating agree with the rating of execution time?
Op ALU Load Store CPI MIPS
Freq 43% 21% 12% 1.57 31.8
Cycle 1 2 2 2
Optimizing compiler
New Freq 27% 27% 15% 31% 1.73 28.9

January 23, 2003
Branch 24%
Why Do Benchmarks? Or How to evaluate an athlete?
Triathlon (3 sports) swimming bicycling running Pentathlon (5 sports) sprinting hurdling long jumping discus javelin Heptathlon (7 sports) Decathlon: (10 sports) 100-meter, 400-meter, 1,500-meter runs; 110-meter high hurdle; discus, javelin throws; shot-put; pole vault; high jump; long jump.
January 23, 2003
Why Do Benchmarks?
How we evaluate differences Different systems Changes to a single system Provide a target Benchmarks should represent large class of important programs Improving benchmark performance should help many programs For better or worse, benchmarks shape a field Good ones accelerate progress good target for development Bad benchmarks hurt progress help real programs v. sell machines/papers? Inventions that help real programs dont help benchmark
Programs to Evaluate Processor Performance (pp86-87)

(Toy) Benchmarks Small but easy to compile and run on simulators, convenient in early designing stage of a new machine. No compiler for novel machines. 10-100 line e.g.,: sieve, puzzle, quicksort Synthetic Benchmarks attempt to use a single benchmark to match average frequencies of real workloads or a set of benchmarks e.g., Whetstone(Algol60 Fortran), Dhrystone(Ada C) Kernels Time critical excerpts of Real programs Popular in scientific computing to illuminate performance of individual features of a machine. e.g., Livermore loops(21 loops), Linpack(linear algebra) Real programs e.g., gcc, spice
Successful Benchmark: SPEC (pp87-89)

1987 RISC industry mired in bench marketing: (That is 8 MIPS machine, but they claim 10 MIPS!) EETimes (http://www.eet.com/) + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: Sun, MIPS, HP, Apollo, DEC Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O
January 23, 2003
SPEC first round

First round 1989; 10 programs = 4 for integer + 6 for FP, single number to summarize performance One program (matrix300): 99% of time in single line of code New front-end compiler could improve dramatically (Fig. 2.3)
800 700 600 500 400 300 200 100 0 gcc doduc li eqntott fpppp spice nasa7 matrix300 tomcatv epresso
Benchmark
January 23, 2003
SPEC Evolution
Second round; SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,bcopy(a,b,c)=memcpy(b,a,c ) wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas Add SPECbase: dont allow program-specific optimization flags. Third round; 1995; new set of programs(8 int + 10fp) (fig. 2.6, p72) benchmarks useful for 3 years Base machine is changed from VAX-11/780 to Sun SPARC 10/40 Newer rounds include SPEC HPC96, SPEC JVM98, SPEC WEB99, SPEC OMP2001. See http://www.spec.org for more details.
January 23, 2003
How to Summarize Results? Program 1: 1 sec on machine A, 10 sec on machine B Program 2: 1000 sec 100 sec What are your conclusions?
A is 10 times faster than B for program1. B is 10 times faster than A for Program2. Total execution time: a consistent summary measure B is 1001/110=9.1 times faster than A. Workload: need to consider the percentage/frequency of each program in the total job.
January 23, 2003
Suppose n programs have execution time
How to Summarize Performance ti

1 n
where
Suppose the workload is ( w1 , w 2 ,.... w n ) where Arithmetic Mean AM ( t i ) =
i = 1, 2 ,... n .
wi = 1 .
i =1
i =1
ti =
1 ( t1 + t 2 + + t n ) n
Weighted Arithmetic mean WAM Geometric Mean GM ( t i ) =
(ti ) =
i =1
t i w i = t1 w 1 + t 2 w 2 + + t n w n
1 1 1
ti =
t1 t 2 t n = t1 n t 2 n t n n
w1
i =1
Weighted Geometric Mean WGM Harmonic Mean HM ( t i ) =
( t i ) = t1
t2
w2
tn
wn
i =1
1 ti
n 1 1 1 + + + t1 t 2 tn
n
Weighted Harmonic Mean GHM
(ti ) =
i =1
wi ti
n w1 w2 w + + + n t1 t2 tn
January 23, 2003
How to Summarize Performance

3 means: arithmetic, geometric and harmonic. Which one is correct? Arithmetic mean (or weighted arithmetic mean) tracks execution time. Harmonic mean (or weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time Normalized execution time is handy for scaling performance (e.g., time reference on machine time on measured machine ) But do not take the arithmetic mean of normalized execution time, use of the geometric wards all improvements equally: program A going from 2 seconds to 1 second as important as program B going from 2000 seconds to 1000 seconds. See example p81
January 23, 2003
Impact of Means on SPECmark89 for IBM 550

Ratio to VAX: Program gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Mean Before After 30 35 47 46 78 34 40 78 90 133 54 Ratio
Time: Before After 49 65 510 41 258 183 28 58 34 20 124 Ratio 51 67 510 38 140 183 28 6 35 19 108 1.16
Weighted Time: Before 8.91 7.64 5.69 5.81 3.43 7.86 6.68 3.43 2.97 2.01 After 9.22 7.86 5.69 5.45 1.86 7.86 6.68 0.37 3.07 1.94
29 34 47 49 144 34 40 730 87 138 72 1.33
54.42 49.99 Weighted Arith. Ratio 1.09

January 23, 2003
Geometric
Arithmetic
Example (p81)
Normalized to A Normalized to B Time on A program1 program2 Arithmetic mean Geometric mean 1 1000 500.5 31.6 Time on B 10 100 55 31.6 A 1 1 1 1 B 10 0.1 5.05 1 A 0.1 10 5.05 1 B 1 1 1 1
The difficulty arises from the use of arithmetic mean of normalized time. Geometric mean is independent of which data series we use for normalized time because
GM ( xi ) x = GM ( i ) GM ( yi ) yi
Amdahl's Law
Speedup due to enhancement E:
Speedup(E) =
ExTime(without E) Performance(with E) = ExTime(with E) Performance(without E)

F
F S
Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,
F ExTime(with E) = ((1 - F) + ) ExTime(without E) S

Speedup(E) = ExTime(without E) 1 1 = F F 1- F ((1 - F) + ) ExTime(without E) (1 - F) + S S
January 23, 2003
Example: Suppose a person wants to travel from city A to city B by city C. The routes from A to C are in mountains and the routes from C to B are in desert. The distances from A to C , and from C to B are 80 miles and 200 miles, respectively. From A to C, walk at speed of 4 mph From C to B, walk or drive (at speed of 100 mph) Question: How long will it take for the entire trip How much faster from A to B by a car as opposed to walk
January 23, 2003
Example: Suppose an enhancement runs 10 times faster than the original machine, but is only usable 40% of the time. Question: what is the overall speedup? Answer: Fraction_enhance = 0.4 Speedup_enhanced = 10 Speedup_overall = 1/(0.6+0.4/10) = 1.56
January 23, 2003
Cost Summary
Integrated circuits driving computer industry Die costs goes up with the cube of die area
January 23, 2003
Performance Evaluation Summary

CPU time CPU time = Seconds = Seconds Program Program = Instructions x Cycles x Seconds = Instructions x Cycles x Seconds Program Instruction Cycle Program Instruction Cycle
Time is the measure of computer performance! Good products created when have: Good benchmarks Good ways to summarize performance If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins. Remember Amdahls Law: Speedup is limited by unimproved part of program.
January 23, 2003
Homework, due Feb. 3 class time (Monday) Question 1.1 through 1.26 Question 2.1 through 2.4 Question 2.10 through 2.13 Question 2.18 through 2.20 Question 2.41
January 23, 2003

ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation

Încărcat de

Drepturi de autor:

Formate disponibile

ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation

ECE4680 Lec2 Performance.1

January 23, 2003

Review: What is "Computer Architecture"

Application Operating System Instruction Set Architecture

Instr. Set Proc. I/O system Digital Design Circuit Design

Under a set of rapidly changing Forces CA = IS + CO

ECE4680 Lec2 Performance.2

January 23, 2003

Review: Levels of Representation

v[k] = v[k+1]; v[k+1] = temp; lw $15, lw $16, sw $16, sw $15,

0($2) 4($2) 0($2) 4($2)

Machine Interpretation Control Signal Specification

ECE4680 Lec2 Performance.3

January 23, 2003

Review: Levels of Organization

Computer SPARC Processor Control Datapath Memory Devices Input Output

ECE4680 Lec2 Performance.4

January 23, 2003

Summary: Computer System Components

All have interfaces & organizations

ECE4680 Lec2 Performance.5

January 23, 2003

Review: Summary from Last Lecture

ECE4680 Lec2 Performance.6

January 23, 2003

Integrated Circuits Costs --- manufacturing process (p24)

ECE4680 Lec2 Performance.7

January 23, 2003

Integrated Circuits Costs --- formula

Dies per wafer =

1 (1 + ( Defect _ per _ area Die _ area )) 2

ECE4680 Lec2 Performance.8

January 23, 2003

Real World Examples

Chip 386DX 486DX2

PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium

ECE4680 Lec2 Performance.9

January 23, 2003

Die cost $4 $12 $53 $73 $149 $272 $417

cost $1 $11 $3 $35 $30 $20 $19

Test & Assembly $4 $12 $21 $16 $23 $34 $37

Total $9 $35 $77 $124 $202 $326 $473

ECE4680 Lec2 Performance.10

January 23, 2003

Transistors per Unit Area

Capacity Logic DRAM disk 2x in 3 years 4x in 3 years 4x in 3 years

2x in 3 years 1.4x in 10 years 1.4x in 10 years

ECE4680 Lec2 Performance.11

January 23, 2003

IBM Power 2/590

DEC AXP 300

ECE4680 Lec2 Performance.12

January 23, 2003

The bottom line: Performance (and cost)

ECE4680 Lec2 Performance.13

January 23, 2003

The bottom line: Performance (and cost)

ECE4680 Lec2 Performance.14

January 23, 2003

Answers per month Operations per second

Megabytes per second