Documente Academic
Documente Profesional
Documente Cultură
levels of abstraction
Compiler
SPARCstation 20
Networks
Die cost =
Die Yield =
Metal Line layers width 2 3 4 3 3 3 3 0.90 0.80 0.80 0.80 0.70 0.70 0.80
Wafer Defect cost /cm2 $900 $1200 $1700 $1300 $1500 $1700 $1500 1.0 1.0 1.3 1.0 1.2 1.6 1.5
Area Dies/ Yield Die Cost mm2 wafer 43 81 121 196 234 256 296 360 181 115 66 53 48 40 71% 54% 28% 27% 19% 13% 9% $4 $12 $53 $73 $149 $272 $417
From "Estimating IC Manufacturing Costs, by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15
Other Costs
IC cost = Die cost + Testing cost + Packaging cost Final test yield
Packaging Cost: depends on pins, heat dissipation
Chip 386DX 486DX2 PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium
Package pins type 132 QFP 168 PGA 304 QFP 504 PGA 431 PGA 293 PGA 273 PGA
CMOS improvements
Die size 2X / 3 years; Line widths halve / 7 years
25
20
15
10
Die Size
0 1980 1983 1986 1989 1992
Speed
Processor Performance
120 P e 100 r 80 f o r m a n c e 60 1.54X/yr 40 20 Sun-4/260 0 1987 1988 1989 1990 Year 1991 1992 1993 MIPS M/120 IBM RS6000/540 MIPS M2000 1.35X/yr HP 9000/750
Different measurements lead to different results Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns. .. (Performance) throughput, bandwidth Cost renders the measurement more complex The bottom-line performance measurement is CPU execution time.
" X is n times faster than Y" means ExTime(Y) Performance(X) -------------- = ---------------------ExTime(X) Performance(Y) Time of Concorde vs. Boeing 747? Throughput of Boeing 747 vs. Concorde?
Metrics of performance
Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins
(millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s
CPI
clock rate
CPI X (x) X X
clock rate
X X X
X X
January 23, 2003
Organizational Trade-offs
Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins
Cycle Time
CPI = (CPU Time Clock Rate) / Instruction Count = Clock Cycles / Instruction Count CPU time =
ClockCycleTime CPI i Ci
i =1
CPI
i =1
Fi
where
Fi =
Ci Instruction _ Count
Our favorite program runs in 10 sec on machine A, which has a 400MHz clock. We are trying to design a machine B with faster clock rate so as to reduce the execution time to 6 sec. The increase of clock rate will affect the rest of the CPU design, causing B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should be?
Answer: CPU time A = CPU clock cycle A / clock rate A ==> CPU clock cycle A = 10 sec x 400 x 10^6 Clock rate B = CPU clock cycle B / CPU time B = 1.2*400*10^6 / 6 = 800 MHz
Example
Base Machine (Reg / Reg) and Instruction frequencies in the execution of a program: Op ALU Load Store Branch Freq 43% 21% 12% 24% Cycles 1 2 2 2
Question: What is the average CPI of the machine? CPI = 143% + 221% + 212% +224% = 1.57
Example: (page 62) Suppose we have two implementations of the same instruction set. Machine A has a clock cycle time of 10 ns and an average CPI of 2.0 for some program. Machine B has a clock cycle time of 20 ns and an average CPI of 1.2 for the same program. Which is faster? And by how much? Let I denote the number of instructions of the program CPU time A = I * 2.0* 10 = 20 I CPU time B = I * 1.2 *20 = 24 I Machine A is 1.2 faster than B . Again we see 3 factors are related !
Example (pp65-66)
ISA has 3 kinds of instructions:
Instruction class A B C CPI for this instruction class 1 2 3
Which code sequence has more instructions? Which will be faster? What is the CPI for each sequence? S.1 has 5 instructions; S.2 has 6. S.1 needs 2x1+1x2+2x3=10 cycles; S.2 needs 4x1+1x2+1x3=9 cycles. S.1 has CPI=10/5=2; S.2 has CPI=9/6=1.5
ECE4680 Lec2 Performance.25 January 23, 2003
Marketing Metrics
MIPS
Million Instructions Per Seconds machines with different instruction sets ? programs with different instruction mixes ? dynamic frequency of instructions Peak MIPS: impractical uncorrelated with performance. ( see the next example) Many pitfalls?
Example: (similar to example at PP78-79, but not the same) Referring to example at slide 23, assume we build an optimizing compiler for the load/store machine. The compiler discards 50% of the ALU instructions. 1) What is the CPI_opt ? 2) Ignoring system issues and assuming a 20 ns clock cycle time (50 MHz clock rate). What is the MIPS rating for optimized code versus unoptimized code? Does the MIPS rating agree with the rating of execution time?
Op ALU Load Store CPI MIPS
ECE4680 Lec2 Performance.27
Cycle 1 2 2 2
Optimizing compiler
Branch 24%
Triathlon (3 sports) swimming bicycling running Pentathlon (5 sports) sprinting hurdling long jumping discus javelin Heptathlon (7 sports) Decathlon: (10 sports) 100-meter, 400-meter, 1,500-meter runs; 110-meter high hurdle; discus, javelin throws; shot-put; pole vault; high jump; long jump.
January 23, 2003
Why Do Benchmarks?
How we evaluate differences Different systems Changes to a single system Provide a target Benchmarks should represent large class of important programs Improving benchmark performance should help many programs For better or worse, benchmarks shape a field Good ones accelerate progress good target for development Bad benchmarks hurt progress help real programs v. sell machines/papers? Inventions that help real programs dont help benchmark
ECE4680 Lec2 Performance.29 January 23, 2003
Benchmark
SPEC Evolution
Second round; SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,bcopy(a,b,c)=memcpy(b,a,c ) wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas Add SPECbase: dont allow program-specific optimization flags. Third round; 1995; new set of programs(8 int + 10fp) (fig. 2.6, p72) benchmarks useful for 3 years Base machine is changed from VAX-11/780 to Sun SPARC 10/40 Newer rounds include SPEC HPC96, SPEC JVM98, SPEC WEB99, SPEC OMP2001. See http://www.spec.org for more details.
How to Summarize Results? Program 1: 1 sec on machine A, 10 sec on machine B Program 2: 1000 sec 100 sec What are your conclusions?
A is 10 times faster than B for program1. B is 10 times faster than A for Program2. Total execution time: a consistent summary measure B is 1001/110=9.1 times faster than A. Workload: need to consider the percentage/frequency of each program in the total job.
where
i = 1, 2 ,... n .
wi = 1 .
i =1
i =1
ti =
1 ( t1 + t 2 + + t n ) n
(ti ) =
i =1
t i w i = t1 w 1 + t 2 w 2 + + t n w n
1 1 1
ti =
t1 t 2 t n = t1 n t 2 n t n n
w1
i =1
( t i ) = t1
t2
w2
tn
wn
i =1
1 ti
n 1 1 1 + + + t1 t 2 tn
n
(ti ) =
i =1
wi ti
n w1 w2 w + + + n t1 t2 tn
January 23, 2003
Time: Before After 49 65 510 41 258 183 28 58 34 20 124 Ratio 51 67 510 38 140 183 28 6 35 19 108 1.16
Weighted Time: Before 8.91 7.64 5.69 5.81 3.43 7.86 6.68 3.43 2.97 2.01 After 9.22 7.86 5.69 5.45 1.86 7.86 6.68 0.37 3.07 1.94
Geometric
Arithmetic
Example (p81)
Normalized to A Normalized to B Time on A program1 program2 Arithmetic mean Geometric mean 1 1000 500.5 31.6 Time on B 10 100 55 31.6 A 1 1 1 1 B 10 0.1 5.05 1 A 0.1 10 5.05 1 B 1 1 1 1
The difficulty arises from the use of arithmetic mean of normalized time. Geometric mean is independent of which data series we use for normalized time because
GM ( xi ) x = GM ( i ) GM ( yi ) yi
ECE4680 Lec2 Performance.38 January 23, 2003
Amdahl's Law
Speedup due to enhancement E:
Speedup(E) =
F S
Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,
Example: Suppose a person wants to travel from city A to city B by city C. The routes from A to C are in mountains and the routes from C to B are in desert. The distances from A to C , and from C to B are 80 miles and 200 miles, respectively. From A to C, walk at speed of 4 mph From C to B, walk or drive (at speed of 100 mph) Question: How long will it take for the entire trip How much faster from A to B by a car as opposed to walk
Example: Suppose an enhancement runs 10 times faster than the original machine, but is only usable 40% of the time. Question: what is the overall speedup? Answer: Fraction_enhance = 0.4 Speedup_enhanced = 10 Speedup_overall = 1/(0.6+0.4/10) = 1.56
Cost Summary
Integrated circuits driving computer industry Die costs goes up with the cube of die area
Time is the measure of computer performance! Good products created when have: Good benchmarks Good ways to summarize performance If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins. Remember Amdahls Law: Speedup is limited by unimproved part of program.
Homework, due Feb. 3 class time (Monday) Question 1.1 through 1.26 Question 2.1 through 2.4 Question 2.10 through 2.13 Question 2.18 through 2.20 Question 2.41