Sunteți pe pagina 1din 42

Lecture 2:

Cache
A Quest for Speed
Dr. Eng. Amr T. Abdel-Hamid
Winter 2014

Text book slides: Computer Architec


ture: A Quantitative Approach 5th E
dition, John L. Hennessy & David A
. Patterso with modifications.

Computer Architecture

Elect 707

Memory Hierarchy
Computer Architecture and Applications

Dr. Amr Talaat

Elect 707

Why Memory Organization?

CPU

Proc
60%/yr.

Moores Law

Performance

Computer Architecture and Applications

1000

100

Processor-Memory
Performance Gap:
(grows 50% / year)

10
DRAM
7%/yr.
DRAM
2000

1999

1994
1995
1996
1997
1998

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

1982
1983

1980

Dr. Amr Talaat


Elect 707

1981

The Principle of Locality


Computer Architecture and Applications

The Principle of Locality:


Programs access a relatively small portion of the address sp
ace at any instant of time.

Two Different Types of Locality:


Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, i
tems whose addresses are close by tend to be referenced so
on (e.g., straightline code, array access)

Dr. Amr Talaat

Basic Principle to overcome the memory/proc. Interaction


problem

Elect 707

What is a cache?
Computer Architecture and Applications

Small, fast storage used to improve average access time to slow m


emory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!

Registers a cache on variables


First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)

Proc/Regs
L1-Cache

Dr. Amr Talaat

Bigger

L2-Cache
Memory
Disk, etc.

Elect 707

Faster

Cache Definitions
Computer Architecture and Applications

Hit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of access tim
e + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block
Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to del
iver the block the processor
Hit Time << Miss Penalty

Dr. Amr Talaat


Elect 707

Review: Cache performance


Computer Architecture and Applications

AMAT HitTime MissRate MissPenalty

HitTimeInst MissRate Inst MissPenalty Inst

HitTimeData MissRate Data MissPenaltyData

Average Memory Access Time (AMAT) = Hit time + Miss r


ate x Miss penalty (ns or clocks)

Dr. Amr Talaat


Elect 707

Cache performance
Computer Architecture and Applications

Miss-oriented Approach to Memory Access:


MemAccess

CPUtime IC CPI

MissRate MissPenalty CycleTime


Execution
Inst

MemMisses

CPUtime IC CPI

MissPenalty CycleTime
Execution
Inst

CPIExecution includes ALU and Memory instructions


Separating out Memory component entirely
AMAT = Average Memory Access Time
CPIALUOps does not include memory instructions
AluOps
CPUtime IC
CPI
Inst

AluOps

MemAccess

AMAT CycleTime
Inst

Dr. Amr Talaat

AMAT HitTime MissRate MissPenalty


HitTimeInst MissRate Inst MissPenalty Inst

HitTimeData MissRate Data MissPenaltyData

Elect 707

Impact on Performance
Suppose a processor executes at
Computer Architecture and Applications

Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations get 50 cycle miss penal


ty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction=
= 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/Data
Mop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMo
p) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
58% of the time the proc is stalled waiting for memory!

Dr. Amr Talaat

AMAT HitTime MissRate MissPenalty

Elect 707

Computer Architecture and Applications

Four Questions for Memory Hierarchy Desig


ners
Q1: Where can a block be placed in the upper level? (Block plac
ement)
Q2: How is a block found if it is in the upper level?
Q3: Which block should be replaced on a miss?
Q4: What happens on a write?

Dr. Amr Talaat


Elect 707

Block Placement
Computer Architecture and Applications

Q1: Where can a block be placed in the upper level?


Fully Associative,
Set Associative,
Direct Mapped

Dr. Amr Talaat


Elect 707

11

1 KB Direct Mapped Cache, 32B blocks


The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ^M)
31
Cache Tag

Example: 0x50

9
Cache Index
Ex: 0x01

4
Byte Select
Ex: 0x00

Stored as part
of the cache state
Cache Data
Byte 31
Byte 63

0x50
Dr. Amr Talaat

:
Byte 1023

Elect 707

Byte 1 Byte 0 0
Byte 33 Byte 32 1
2
3
:
:

Cache Tag

Valid Bit

Computer Architecture and Applications

For a 2^N byte cache:

12

Byte 992 31

Set Associative Cache


Computer Architecture and Applications

N-way set associative: N entries for each Cache Index


N direct mapped caches operates in parallel
How big is the tag?

Example: Two-way set associative cache


Cache Index selects a set from the cache
The two tags in the set are compared to the input in parallel
Data is selected based onCache
the tag
Indexresult

Valid

Cache Tag

Cache Data
Cache Block 0

Cache Data
Cache Block 0

Cache Tag

Valid

Dr. Amr Talaat

Adr Tag

Elect 707

Compare

Sel1 1

Mux

0 Sel0

Compare

OR
Hit

Cache Block

13

Block Placement
Computer Architecture and Applications

Dr. Amr Talaat

Elect 707

Q2: How is a block found if it is in the upper level?


Computer Architecture and Applications

Index identifies set of possibilities


Tag on each block
No need to check index or block offset

Increasing associativity shrinks index, expands tag


Block Address
Tag

Index

Block
Offset

Dr. Amr Talaat

Cache size = Associativity * 2indexsize * 2offestsize


Elect 707

Computer Architecture and Applications

Q3: Which block should be replaced on a mi


ss?
Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)

Dr. Amr Talaat

Assoc:
2-way
4-way
8-way
Size
LRU Ran LRU Ran
LRU Ran
16 KB
5.2% 5.7% 4.7%
5.3% 4.4%
5.0%
64 KB
1.9% 2.0% 1.5%
1.7% 1.4%
1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Elect 707

Q4: What happens on a write?


Computer Architecture and Applications
Dr. Amr Talaat

Write-through: all writes update cache and underlying memory/cache


Can always discard cached data - most up-to-date data is in memory
Cache control bit: only a valid bit
Write-back: all writes simply update cache
Cant just discard cached data - may have to write it back to memory
Cache control bits: both valid and dirty bits
Other Advantages:
Write-through:
memory (or other processors) always have latest data
Simpler management of cache
Write-back:
much lower bandwidth, since data often overwritten multiple times
Better tolerance to long-latency memory?
Elect 707

Write Buffer for Write Through


Computer Architecture and Applications

Processor

Cache

DRAM

Write Buffer
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO:

Dr. Amr Talaat

Typical number of entries: 4


Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cyc
le

Elect 707

The Cache Design Space


Computer Architecture and Applications

Several interacting dimensions

Cache Size

cache size
block size
associativity
replacement policy
write-through vs write-back

Associativity

Block Size
The optimal choice is a compromise
depends on access characteristics

Dr. Amr Talaat

workload
use (I-cache, D-cache, TLB)
depends on technology / cost

Simplicity often wins


Elect 707

Bad
Good

Factor A

Less

Factor B

More

Computer Architecture and Applications

Review: Improving Cache Performance

CPUtime IC CPI

Execution

Memory accesses

Miss rate Miss penalty Clock cycle time

Instruction

1. Reduce the miss rate,


2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

Dr. Amr Talaat


Elect 707

20

Reducing Misses
Computer Architecture and Applications

Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first refere
nce misses.
CapacityIf the cache cannot contain all the blocks needed during execut
ion of a program, capacity misses will occur due to blocks being discarded
and later retrieved.
ConflictIf block-placement strategy is set associative or direct mapped, c
onflict misses (in addition to compulsory & capacity misses) will occur bec
ause a block can be discarded and later retrieved if too many blocks map t
o its set. Also called collision misses or interference misses.

Dr. Amr Talaat

More recent, 4th C:


Coherence - Misses caused by cache coherence.

Elect 707

21

Classify Cache Misses, How?


Computer Architecture and Applications

(1) Infinite cache, fully associative


Compulsory misses

(2) Finite cache, fully associative


Compulsory misses + Capacity misses

(3) Finite cache, limited associativity


Compulsory misses + Capacity misses + Conflict misses

Dr. Amr Talaat


Elect 707

22

3Cs Miss Rate


Computer Architecture and Applications

0.14

1-way

Miss Rate per Type

0.12

Conflict

2-way
0.1
4-way
0.08
8-way
0.06

Capacity

0.04
0.02

Dr. Amr Talaat

128

64

32

16

Compulsory

Cache Size (KB)

Compulsory extremely small


Elect 707

23

Reduce the miss rate,


Computer Architecture and Applications

Larger cache
Reduce Misses via Larger Block Size
Reduce Misses via Higher Associativity
Reducing Misses via Victim Cache
Reducing Misses via Pseudo-Associativity
Reducing Misses by HW Prefetching Instr, Data
Reducing Misses by SW Prefetching Data
Reducing Misses by Compiler Optimizations

Dr. Amr Talaat


Elect 707

Reduce Misses via Larger Block Size


Computer Architecture and Applications

Dr. Amr Talaat

Elect 707

Reduce Misses via Higher Associativity

0.14

1-way

0.12

Miss Rate per Type

Conflict

2-way
0.1
4-way
0.08

8-way
0.06

Capacity

0.04
0.02

Dr. Amr Talaat

Cache Size (KB)

128

64

32

16

Computer Architecture and Applications

miss rate 1-way associative cache size X


~= miss rate 2-way associative cache size X/2

Compulsory

Elect 707

26

Reducing Misses via a Victim Cache


Computer Architecture and Applications

Add buffer to place data dis


carded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95%
of conflicts for a 4 KB direct
mapped data cache

TAGS

DATA

Dr. Amr Talaat

Tag and Comparator

One Cache line of Data

Tag and Comparator

One Cache line of Data

Tag and Comparator

One Cache line of Data

Tag and Comparator

One Cache line of Data


To Next Lower Level In
Hierarchy

Elect 707

27

Reducing Misses via Pseudo-Associativity


Computer Architecture and Applications

How to combine fast hit time of Direct Mapped and have the lo
wer conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if the
re, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time

Miss Penalty
Time

Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

Dr. Amr Talaat

Better for caches not tied directly to processor (L2)


Used in MIPS R1000 L2 cache, similar in UltraSPARC

Elect 707

28

Computer Architecture and Applications

Reducing Misses by Hardware Pre-fetchi


ng of Instructions & Data
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer

Works with data blocks too:

Dr. Amr Talaat

Jouppi [1990] 1 data stream buffer got 25% misses from 4K


B cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8 str
eams got 50% to 70% of misses from 2 64KB, 4-way set as
sociative caches

Prefetching relies on having extra memory bandwidth tha


t can be used without penalty
Elect 707

29

Computer Architecture and Applications

Reducing Misses by Software Prefetching Dat


a
Build a Special prefetching instructions that cannot cause fa
ults; a form of speculative execution
Data Prefetch
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)

Issuing Prefetch Instructions takes time


Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth

Dr. Amr Talaat

Relies on having extra memory bandwidth that can be used


without penalty

Elect 707

30

Reducing Misses by Compiler Optimizations


Computer Architecture and Applications

McFarling [1989] reduced caches misses by 75%


on 8KB direct mapped cache, 4 byte blocks in software
How?
For instructions:

look at conflicts (using tools they developed)


Reorder procedures in memory so as to reduce co
nflict misses
For Data

Dr. Amr Talaat

Merging Arrays: improve spatial locality by single a


rray of compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to acce
ss data in order stored in memory
Loop Fusion: Combine 2 independent loops that h
ave same looping and some variables overlap
Elect 707

31

Merging Arrays Example


Computer Architecture and Applications

/* Before: 2 sequential arrays */


int val[SIZE];
int key[SIZE];

Dr. Amr Talaat

Reducing conflicts between val & key;


improve spatial locality

/* After: 1 array of stuctures */


struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Elect 707

32

Loop Interchange Example


Computer Architecture and Applications

/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Dr. Amr Talaat

Sequential accesses instead of striding through memory every


100 words; improved spatial locality

Elect 707

33

Loop Fusion Example


Computer Architecture and Applications
Dr. Amr Talaat

/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; impro
ve spatial locality
Elect 707

34

Reduce the miss penalty,


Computer Architecture and Applications

Read priority over write on miss


Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache

Dr. Amr Talaat


Elect 707

Read Priority over Write on Miss


Computer Architecture and Applications

CPU
in out

Dr. Amr Talaat

write
buffer
DRAM
(or lower mem)
Elect 707

Write Buffer

Read Priority over Write on Miss


Computer Architecture and Applications

Write-through w/ write buffers => RAW conflicts with main me


mory reads on cache misses
If simply wait for write buffer to empty, might increase read miss p
enalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the me
mory access continue

Write-back want buffer to hold displaced blocks


Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read, an
d then do the write

Dr. Amr Talaat

CPU stall less since restarts as soon as do read

Elect 707

Early Restart and Critical Word First


Computer Architecture and Applications

Dont wait for full block to be loaded before restarting CPU


Early restartAs soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
Critical Word FirstRequest the missed word first from memory a
nd send it to the CPU as soon as it arrives; let the CPU continue e
xecution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first

Generally useful only in large blocks,


Spatial locality a problem; tend to want next sequential word, s
o not clear if benefit by early restart

Dr. Amr Talaat

block

Elect 707

Computer Architecture and Applications

Non-blocking Caches to reduce stalls on misse


s
Non-blocking cache or lockup-free cache allow data cache to c
ontinue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories

hit under miss reduces the effective miss penalty by working


during miss vs. ignoring CPU requests
hit under multiple miss or miss under miss may further lowe
r the effective miss penalty by overlapping multiple misses

Dr. Amr Talaat

Significantly increases the complexity of the cache controller as th


ere can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Penium Pro allows 4 outstanding memory misses
Elect 707

Add a second-level cache


Computer Architecture and Applications

L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

Definitions:

Dr. Amr Talaat

Local miss rate misses in this cache divided by the total number of m
emory accesses to this cache (Miss rateL2)
Global miss ratemisses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters

Elect 707

Dr. Amr Talaat

miss penalty

Computer Architecture and Applications

miss rate

Cache Optimization Summary

Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches

Elect 707

MR
+
+
+
+
+
+
+

MP HT

+
+
+
+

Complexity
0
1
2
2
2
3
0
1
2
3
2

Next WEEK
Computer Architecture and Applications

Start the project +


Sheet 1 Online
QUIZ 1 covering Pipelining and Hazards

Dr. Amr Talaat


Elect 707

S-ar putea să vă placă și