ECE 462/562 Computer Architecture and Design: T-TH 12:30-1:45 in HARV210 WWW - Ece.arizona - Edu/ Ece462

ECE 462/562 Computer Architecture and Design
T-Th 12:30-1:45 in HARV210

www.ece.arizona.edu/~ece462
Instructor
Name: Ali Akoglu (ece.arizona.edu/~akoglu)
Office: ECE 356-B
Phone: (520) 626-5149
Email: akoglu@ece.arizona.edu
Office Hours: Tuesdays 11:00 AM – 12:00 PM
Thursdays 11:00 AM- 12:00 AM or
by appointment
Computer Architecture
Application
Abstraction Layers
Algorithm
Programming Language
Operating System/Virtual Machines
Instruction Set Architecture (ISA)
Gates/Register-Transfer Level (RTL)
Circuits
Devices
Physics
2
Computer Architecture is Design and Analysis
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Design
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Mediocre Ideas
Bad Ideas 3
Computer Architecture
Applications
suggest how Improved
Applications technologies
to improve
technology, make new
provide applications
revenue to possible
Technology
fund
development
Cost of software development

makes compatibility a major
force in market
4
Trends: The End of the Uniprocessor Era
Intel cancelled high performance

uniprocessor, joined IBM and
Sun for multiple processors
5
Crossroads: Conventional Wisdom
• Old Conventional Wisdom: Power is free, Transistors expensive
• New Conventional Wisdom: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
• Old CW: Sufficiently increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
• New CW: “ILP wall” law of diminishing returns on more HW for ILP
• Old CW: Multiplies are slow, Memory access is fast
• New CW: “Memory wall” Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for multiply)
• Old CW: Uniprocessor performance 2X / 1.5 yrs
• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
• Uniprocessor performance now 2X / 5(?) yrs
•  Sea change in chip design: multiple “cores”
(2X processors per chip / ~ 2 years)
• More simpler processors are more power efficient 6
Instruction Set Architecture: Critical Interface
software
instruction set
hardware
• Properties of a good abstraction

– Lasts through many generations (portability)
– Used in many different ways (generality)
– Provides convenient functionality to higher levels
– Permits an efficient implementation at lower levels
7
ISA vs. Computer Architecture
• Old definition of computer architecture
= instruction set design
• Other aspects of computer design called implementation
• Our view is computer architecture >> ISA
• Architect’s job much more than instruction set design;
technical hurdles today more challenging than those in
instruction set design
• What really matters is the functioning of the complete system
– hardware, runtime system, compiler, operating system, and
application
• Computer architecture is not just about transistors, individual
instructions, or particular implementations
8
Course Focus
Understanding the design techniques, machine structures,

technology factors, evaluation methods that will determine
the form of computers in 21st Century
Technology Parallelism
Programming
Languages
Applications Interface Design
Computer Architecture: (ISA)
• Organization
• Hardware/Software Boundary
Compilers
Operating Measurement &

Systems Evaluation History
9
Related Courses
ECE568
Parallel Processing
Strong ECE
ECE369 ECE569
Prerequisite 462/562
Basic computer Computer Architecture, High Performance
organization, first look First look at parallel Computing, Advanced
at pipelines + caches architectures Topics
ECE
474/574 ECE 576
Computer Aided Logic Computer Based

Design, FPGAs Systems
10
Introduction
• Text for ECE462/562: Hennessy and Patterson’s
Computer Architecture, A Quantitative Approach, 5th Edition
• Topics
1. Simple machine design (ISAs, microprogramming,
unpipelined machines, Iron Law, simple pipelines)
2. Memory hierarchy (DRAM, caches, optimizations) plus
virtual memory systems, exceptions, interrupts
3. Complex pipelining (score-boarding, out-of-order issue)
4. Explicitly parallel processors (vector machines, VLIW
machines, multithreaded machines)
5. Multiprocessor architectures (memory models, cache
coherence, synchronization)
11
Your ECE462/562
•How would you like your ECE462/562?

•Mix of lecture vs. discussion
•Depends on how well reading is done before class
•Goal is to learn how to do good systems research
•Learn a lot from looking at good work in the past
•At commit point, you may chose to pursue your own
new idea instead.
12
Coping with ECE462/562
• Undergrads must have taken ECE274 and ECE369

• Grad students with too varied background
• Review Appendix A, B, C
• review of ISA, Datapath, Pipelining and Memory
Hierarchy
13
Policies
• Background: ECE369 or equivalent, based on Patterson
and Hennessy’s Computer Organization and Design
• Prerequisite: ECE274 & ECE369 & Programming in C
• 3 to 4 assignments, 2 exams, final project
• Grad students: extra exam questions, survey
paper and presentation
• NO LATE ASSIGNMENTS
• Make-ups may be arranged prior to the scheduled activity.
• Inquiries about graded material => within 3 days of
receiving a grade.
• You are encouraged to discuss the assignment
specifications with your instructor, and your fellow students.
However, anything you submit for grading must be unique and
should NOT be a duplicate of another source.
• Read before the class
• Participate and ask questions
• Manage your time 14
• Start working on assignments early
Grading
Distribution of Components Grades Scale
Component Percentage Percentage Grade
Assignments+Quiz
35 90-100% A
+Participation
Exam-I 15 80-89% B
Exam-II 15 70-79% C
Project 35 60-69% D
Total 100 Below 60% E

Introduction
•Assignments and Project

 Pairs only
 Who is my partner? (email by 09/06)
• Assignment-0 due 08/28
• Announcements on the web
16
Research Paper Reading
• As graduate students, you are now researchers

• Most information of importance to you will be
in research papers
• Ability to rapidly scan and understand research
papers is key to your success
17
Project (Undergrad vs Grad)
• Transition from undergrad to grad student

• ECE wants you to succeed, but you need to show
initiative
• pick topic (more on this later)
• meet 3 times with faculty to see progress
• give oral presentation (grad students only)
• written report like conference paper
• 3 weeks work full time for 2 people
• Opportunity to do “research in the small” to help make
transition from good student to research colleague
18
Project (Undergrad vs Grad)
• Recreate results from research paper to see

• If they are reproducible
• If they still hold
• Papers from ISCA, HPCA, MICRO, IPDPS, ISC
• Performance evaluation of an architecture
• Using industry sponsored tools
• GEM5: gem5.org
• Pin: pintool.org
• SimpleScalar: simplescalar.com
• A complete end-to-end processor (UGs !!)
• Take advantage of FPGAs!!
• Propose your own research project that is related to
computer architecture
19
Measuring Performance
• Topics: (Chapter 1)
 Technology trends
Performance equations
20
Technology Trends and This Book
1996 2002 2009 2011

When I took this class! Shift to multicore! Reduced ILP to 1 chapter!
Reduced emphasis on ILP Request, Data, Thread,
Introduce thread level P. Instruction Level
Introduce: GPU, cloud
computing, Smart phones,
tablets! 21
Problems
• Algorithms, Programming Languages, Compilers,

Operating Systems, Architectures, Libraries, … not
ready to supply Thread Level Parallelism or
Data Level Parallelism for 1000 CPUs / chip,
• Architectures not ready for 1000 CPUs / chip
• Unlike Instruction Level Parallelism, cannot be
solved by just by computer architects and compiler
writers alone, but also cannot be solved without
participation of computer architects
• 5th Edition Computer Architecture:
A Quantitative Approach explores shift from
Instruction Level Parallelism to
Thread Level Parallelism / Data Level Parallelism
22
Classes of Parallelism
• In Applications
 Data Level Parallelism
o Data items that can be operated on concurrently
 Task-level Parallelism
o Tasks of a work can operate independently
• In Hardware
 ILP: exploits DLP with compiler, pipelining,
speculative execution
 Vector Architectures and GPUs: exploit DLP by
applying a single instruction to a collection of data
 Thread-level parallelism: exploits DLP and TLP,
tightly coupled hardware, interaction among threads
 Request level parallelism: exploits largely
decoupled tasks specified by the programmer 23
Processor Technology Trends
• Shrinking of transistor sizes: 250nm (1997) 

130nm (2002)  65nm (2007)  32nm (2010)
28nm(2011, AMD GPU, Xilinx FPGA)
22nm(2011, Intel Ivy Bridge, die shrink of the
Sandy Bridge architecture)
• Transistor density increases by 35% per year and die size

increases by 10-20% per year… more cores!
24
Trends: Historical Perspective
25
Power Consumption Trends
• Dyn power a activity x capacitance x voltage2 x frequency
• Capacitance per transistor and voltage are decreasing,

but number of transistors is increasing at a faster rate;
hence clock frequency must be kept steady
• Leakage power is also rising
• Power consumption is already between 100-150W in

high-performance processors today
 3.3GHz Intel core i7: 130 watts
26
Recent Microprocessor Trends
Transistors: 1.43x / year
Cores: 1.2 - 1.4x
Performance: 1.15x
Frequency: 1.05x
Power: 1.04x
2004 2010
27
Source: Micron University Symp.
Improving Energy Efficiency Despite Flat Clock Rate
•Turn off the clock of inactive modules

• Disable FP unit, core, etc.
•Dynamic Voltage-Frequency Scaling
• Periods of low activity, lower the clock rate
•Low power mode
• DRAMs lower power mode for extending the battery
•Overclocking
• Intel, Turbo mode (2008), chip decides safe clock rate
• i7 3.3 GHz, can run in short bursts for 3.6GHz
28
Modern Processor Today
• Intel Core i7
 Clock frequency: 3.2 – 3.33 GHz

 45nm and 32nm products
 Cores: 4 – 6
 Power: 95 – 130 W
 Two threads per core
 3-level cache, 12 MB L3 cache
 Price: $300 - $1000
29
Other Technology Trends
• DRAM density increases by 40-60% per year, latency has

reduced by 33% in 10 years, bandwidth
improves twice as fast as latency decreases
• Disk density improves by 100% every year, latency

improvement similar to DRAM
30
First Microprocessor Intel 4004, 1971
• 4-bit accumulator
architecture
• 8mm pMOS
• 2,300 transistors
• 3 x 4 mm2
• 750kHz clock
• 8-16 cycles/inst.
31
Hardware
• Team from IBM building
PC prototypes in 1979
• Motorola 68000 chosen
initially, but 68000 was
late
• 8088 is 8-bit bus version
of 8086 => allows
cheaper system
• Estimated sales of
250,000
• 100,000,000s sold
[ Personal Computing Ad, 11/81] 32

DYSEAC, first mobile computer!
• Carried in two tractor trailers, 12 tons + 8 tons

• Built for US Army Signal Corps
33
Measuring Performance
• Two primary metrics: wall clock time (response time for a

program) and throughput (jobs performed in unit time)
• To optimize throughput, must ensure that there is minimal

waste of resources
• Performance is measured with benchmark suites: a

collection of programs that are likely relevant to the user
 SPEC CPU 2006: cpu-oriented programs (for desktops)
 SPECweb, TPC: throughput-oriented (for servers)
 EEMBC: for embedded processors/workloads
34
Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate

Program X
Compiler X X
Inst. Set. X X
Organization X X
Technology X
35
Amdahl’s Law
• Architecture design is very bottleneck-driven – make the

common case fast, do not waste resources on a component
that has little impact on overall performance/power
• Amdahl’s Law: performance improvements through an

enhancement is limited by the fraction of time the
enhancement comes into play
36
Amdahl’s Law
• Considering an enhancement that runs 10 times faster

than the original machine but is only usable 40% of
the time.
• Only 1.56x overall speedup
• An application is “almost all” parallel: 90%. Speedup

using
• 10 processors => 5.3x
37
Principle of Locality
• Most programs are predictable in terms of instructions

executed and data accessed
• Temporal locality: a program will shortly re-visit X
• Spatial locality: a program will shortly visit X+1
38
Exploit Parallelism
• Most operations do not depend on each other – hence,

execute them in parallel
• At the circuit level, simultaneously access multiple ways

of a set-associative cache
• At the organization level, execute multiple instructions at

the same time
• At the system level, execute a different program while one

is waiting on I/O
39

ECE 462/562 Computer Architecture and Design: T-TH 12:30-1:45 in HARV210 WWW - Ece.arizona - Edu/ Ece462

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ECE 462/562 Computer Architecture and Design: T-TH 12:30-1:45 in HARV210 WWW - Ece.arizona - Edu/ Ece462

Încărcat de

Drepturi de autor:

Formate disponibile

ECE 462/562 Computer Architecture and Design

T-Th 12:30-1:45 in HARV210

Cost of software development

Intel cancelled high performance

• Properties of a good abstraction

Understanding the design techniques, machine structures,

Operating Measurement &

Computer Aided Logic Computer Based

•How would you like your ECE462/562?

• Undergrads must have taken ECE274 and ECE369

Component Percentage Percentage Grade

Total 100 Below 60% E

•Assignments and Project

• Assignment-0 due 08/28

• Announcements on the web

• As graduate students, you are now researchers

• Transition from undergrad to grad student

• Recreate results from research paper to see

1996 2002 2009 2011

• Algorithms, Programming Languages, Compilers,

• Shrinking of transistor sizes: 250nm (1997) 

• Transistor density increases by 35% per year and die size

• Dyn power a activity x capacitance x voltage2 x frequency

• Capacitance per transistor and voltage are decreasing,

• Leakage power is also rising

• Power consumption is already between 100-150W in

Transistors: 1.43x / year

Cores: 1.2 - 1.4x

•Turn off the clock of inactive modules

 Clock frequency: 3.2 – 3.33 GHz

• DRAM density increases by 40-60% per year, latency has

• Disk density improves by 100% every year, latency

[ Personal Computing Ad, 11/81] 32

• Carried in two tractor trailers, 12 tons + 8 tons

• Two primary metrics: wall clock time (response time for a

• To optimize throughput, must ensure that there is minimal

• Performance is measured with benchmark suites: a

Inst Count CPI Clock Rate

• Architecture design is very bottleneck-driven – make the

• Amdahl’s Law: performance improvements through an

• Considering an enhancement that runs 10 times faster

• An application is “almost all” parallel: 90%. Speedup

• Most programs are predictable in terms of instructions

• Temporal locality: a program will shortly re-visit X

• Spatial locality: a program will shortly visit X+1

• Most operations do not depend on each other – hence,

• At the circuit level, simultaneously access multiple ways

• At the organization level, execute multiple instructions at

• At the system level, execute a different program while one

S-ar putea să vă placă și