Elct707 Cache

Lecture 2:
Cache
A Quest for Speed
Dr. Eng. Amr T. Abdel-Hamid
Winter 2014
Text book slides: Computer Architec

ture: A Quantitative Approach 5th E
dition, John L. Hennessy & David A
. Patterso with modifications.
Computer Architecture
Elect 707
Memory Hierarchy
Computer Architecture and Applications
Dr. Amr Talaat
Elect 707
Why Memory Organization?
CPU
Proc
60%/yr.
Moores Law
Performance
1000
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
7%/yr.
DRAM
2000
1999
1994
1995
1996
1997
1998
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1982
1983
1980
Dr. Amr Talaat

Elect 707
1981
The Principle of Locality

The Principle of Locality:

Programs access a relatively small portion of the address sp
ace at any instant of time.
Two Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, i
tems whose addresses are close by tend to be referenced so
on (e.g., straightline code, array access)
Dr. Amr Talaat
Basic Principle to overcome the memory/proc. Interaction

problem
Elect 707
What is a cache?
Small, fast storage used to improve average access time to slow m

emory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers a cache on variables

First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
Proc/Regs
L1-Cache
Dr. Amr Talaat
Bigger
L2-Cache
Memory
Disk, etc.
Elect 707
Faster
Cache Definitions
Hit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of access tim
e + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block
Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to del
iver the block the processor
Hit Time << Miss Penalty
Dr. Amr Talaat

Elect 707
Review: Cache performance

AMAT HitTime MissRate MissPenalty
HitTimeInst MissRate Inst MissPenalty Inst
HitTimeData MissRate Data MissPenaltyData
Average Memory Access Time (AMAT) = Hit time + Miss r

ate x Miss penalty (ns or clocks)
Dr. Amr Talaat

Elect 707
Cache performance
Miss-oriented Approach to Memory Access:

MemAccess
CPUtime IC CPI
MissRate MissPenalty CycleTime

Execution
Inst
MemMisses
CPUtime IC CPI
MissPenalty CycleTime
Execution
Inst
CPIExecution includes ALU and Memory instructions

Separating out Memory component entirely
AMAT = Average Memory Access Time
CPIALUOps does not include memory instructions
AluOps
CPUtime IC
CPI
Inst
AluOps
MemAccess
AMAT CycleTime
Inst
Dr. Amr Talaat

HitTimeInst MissRate Inst MissPenalty Inst
HitTimeData MissRate Data MissPenaltyData
Elect 707
Impact on Performance
Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cycle miss penal

ty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction=
= 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/Data
Mop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMo
p) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
58% of the time the proc is stalled waiting for memory!
Dr. Amr Talaat
Elect 707
Four Questions for Memory Hierarchy Desig

ners
Q1: Where can a block be placed in the upper level? (Block plac
ement)
Q2: How is a block found if it is in the upper level?
Q3: Which block should be replaced on a miss?
Q4: What happens on a write?
Dr. Amr Talaat

Elect 707
Block Placement
Q1: Where can a block be placed in the upper level?

Fully Associative,
Set Associative,
Direct Mapped
Dr. Amr Talaat

Elect 707
11
1 KB Direct Mapped Cache, 32B blocks

The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ^M)
31
Cache Tag
Example: 0x50
9
Cache Index
Ex: 0x01
4
Byte Select
Ex: 0x00
Stored as part
of the cache state
Cache Data
Byte 31
Byte 63
0x50
Dr. Amr Talaat
:
Byte 1023
Elect 707
Byte 1 Byte 0 0
Byte 33 Byte 32 1
2
3
:
:
Cache Tag
Valid Bit
For a 2^N byte cache:
12
Byte 992 31
Set Associative Cache

N-way set associative: N entries for each Cache Index

N direct mapped caches operates in parallel
How big is the tag?
Example: Two-way set associative cache

Cache Index selects a set from the cache
The two tags in the set are compared to the input in parallel
Data is selected based onCache
the tag
Indexresult
Valid
Cache Tag
Cache Data
Cache Block 0
Cache Data
Cache Block 0
Cache Tag
Valid
Dr. Amr Talaat
Adr Tag
Elect 707
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit
Cache Block
13
Block Placement
Dr. Amr Talaat
Elect 707
Q2: How is a block found if it is in the upper level?

Index identifies set of possibilities

Tag on each block
No need to check index or block offset
Increasing associativity shrinks index, expands tag

Block Address
Tag
Index
Block
Offset
Dr. Amr Talaat
Cache size = Associativity * 2indexsize * 2offestsize

Elect 707
Q3: Which block should be replaced on a mi

ss?
Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)
Dr. Amr Talaat
Assoc:
2-way
4-way
8-way
Size
LRU Ran LRU Ran
LRU Ran
16 KB
5.2% 5.7% 4.7%
5.3% 4.4%
5.0%
64 KB
1.9% 2.0% 1.5%
1.7% 1.4%
1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Elect 707
Q4: What happens on a write?

Dr. Amr Talaat
Write-through: all writes update cache and underlying memory/cache

Can always discard cached data - most up-to-date data is in memory
Cache control bit: only a valid bit
Write-back: all writes simply update cache
Cant just discard cached data - may have to write it back to memory
Cache control bits: both valid and dirty bits
Other Advantages:
Write-through:
memory (or other processors) always have latest data
Simpler management of cache
Write-back:
much lower bandwidth, since data often overwritten multiple times
Better tolerance to long-latency memory?
Elect 707
Write Buffer for Write Through

Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Dr. Amr Talaat
Typical number of entries: 4

Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cyc
le
Elect 707
The Cache Design Space

Several interacting dimensions
Cache Size
cache size
block size
associativity
replacement policy
write-through vs write-back
Associativity
Block Size
The optimal choice is a compromise
depends on access characteristics
Dr. Amr Talaat
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Elect 707
Bad
Good
Factor A
Less
Factor B
More
Review: Improving Cache Performance
CPUtime IC CPI
Execution
Memory accesses
Miss rate Miss penalty Clock cycle time
Instruction
1. Reduce the miss rate,

2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Dr. Amr Talaat

Elect 707
20
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first refere
nce misses.
CapacityIf the cache cannot contain all the blocks needed during execut
ion of a program, capacity misses will occur due to blocks being discarded
and later retrieved.
ConflictIf block-placement strategy is set associative or direct mapped, c
onflict misses (in addition to compulsory & capacity misses) will occur bec
ause a block can be discarded and later retrieved if too many blocks map t
o its set. Also called collision misses or interference misses.
Dr. Amr Talaat
More recent, 4th C:

Coherence - Misses caused by cache coherence.
Elect 707
21
Classify Cache Misses, How?

(1) Infinite cache, fully associative

Compulsory misses
(2) Finite cache, fully associative

Compulsory misses + Capacity misses
(3) Finite cache, limited associativity

Compulsory misses + Capacity misses + Conflict misses
Dr. Amr Talaat

Elect 707
22
3Cs Miss Rate

0.14
1-way
Miss Rate per Type
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Dr. Amr Talaat
128
64
32
16
Compulsory
Cache Size (KB)
Compulsory extremely small

Elect 707
23
Reduce the miss rate,

Larger cache
Reduce Misses via Larger Block Size
Reduce Misses via Higher Associativity
Reducing Misses via Victim Cache
Reducing Misses via Pseudo-Associativity
Reducing Misses by HW Prefetching Instr, Data
Reducing Misses by SW Prefetching Data
Reducing Misses by Compiler Optimizations
Dr. Amr Talaat

Elect 707
Reduce Misses via Larger Block Size

Dr. Amr Talaat
Elect 707
Reduce Misses via Higher Associativity
0.14
1-way
0.12
Miss Rate per Type
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Dr. Amr Talaat
Cache Size (KB)
128
64
32
16
miss rate 1-way associative cache size X

~= miss rate 2-way associative cache size X/2
Compulsory
Elect 707
26
Reducing Misses via a Victim Cache

Add buffer to place data dis

carded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95%
of conflicts for a 4 KB direct
mapped data cache
TAGS
DATA
Dr. Amr Talaat
Tag and Comparator
One Cache line of Data
Tag and Comparator
Tag and Comparator
Tag and Comparator

To Next Lower Level In
Hierarchy
Elect 707
27
Reducing Misses via Pseudo-Associativity

How to combine fast hit time of Direct Mapped and have the lo
wer conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if the
re, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time
Miss Penalty
Time
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Dr. Amr Talaat
Better for caches not tied directly to processor (L2)

Used in MIPS R1000 L2 cache, similar in UltraSPARC
Elect 707
28
Reducing Misses by Hardware Pre-fetchi

ng of Instructions & Data
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Works with data blocks too:
Dr. Amr Talaat
Jouppi [1990] 1 data stream buffer got 25% misses from 4K

B cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8 str
eams got 50% to 70% of misses from 2 64KB, 4-way set as
sociative caches
Prefetching relies on having extra memory bandwidth tha

t can be used without penalty
Elect 707
29
Reducing Misses by Software Prefetching Dat

a
Build a Special prefetching instructions that cannot cause fa
ults; a form of speculative execution
Data Prefetch
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Issuing Prefetch Instructions takes time

Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
Dr. Amr Talaat
Relies on having extra memory bandwidth that can be used

without penalty
Elect 707
30
Reducing Misses by Compiler Optimizations

McFarling [1989] reduced caches misses by 75%

on 8KB direct mapped cache, 4 byte blocks in software
How?
For instructions:
look at conflicts (using tools they developed)

Reorder procedures in memory so as to reduce co
nflict misses
For Data
Dr. Amr Talaat
Merging Arrays: improve spatial locality by single a

rray of compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to acce
ss data in order stored in memory
Loop Fusion: Combine 2 independent loops that h
ave same looping and some variables overlap
Elect 707
31
Merging Arrays Example

/* Before: 2 sequential arrays */

int val[SIZE];
int key[SIZE];
Dr. Amr Talaat
Reducing conflicts between val & key;

improve spatial locality
/* After: 1 array of stuctures */

struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Elect 707
32
Loop Interchange Example

/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Dr. Amr Talaat
Sequential accesses instead of striding through memory every

100 words; improved spatial locality
Elect 707
33
Loop Fusion Example

Dr. Amr Talaat
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; impro
ve spatial locality
Elect 707
34
Reduce the miss penalty,

Read priority over write on miss

Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Dr. Amr Talaat

Elect 707
Read Priority over Write on Miss

CPU
in out
Dr. Amr Talaat
write
buffer
DRAM
(or lower mem)
Elect 707
Write Buffer
Read Priority over Write on Miss

Write-through w/ write buffers => RAW conflicts with main me

mory reads on cache misses
If simply wait for write buffer to empty, might increase read miss p
enalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the me
mory access continue
Write-back want buffer to hold displaced blocks

Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read, an
d then do the write
Dr. Amr Talaat
CPU stall less since restarts as soon as do read
Elect 707
Early Restart and Critical Word First

Dont wait for full block to be loaded before restarting CPU

Early restartAs soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
Critical Word FirstRequest the missed word first from memory a
nd send it to the CPU as soon as it arrives; let the CPU continue e
xecution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first
Generally useful only in large blocks,

Spatial locality a problem; tend to want next sequential word, s
o not clear if benefit by early restart
Dr. Amr Talaat
block
Elect 707
Non-blocking Caches to reduce stalls on misse

s
Non-blocking cache or lockup-free cache allow data cache to c
ontinue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
hit under miss reduces the effective miss penalty by working

during miss vs. ignoring CPU requests
hit under multiple miss or miss under miss may further lowe
r the effective miss penalty by overlapping multiple misses
Dr. Amr Talaat
Significantly increases the complexity of the cache controller as th

ere can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Penium Pro allows 4 outstanding memory misses
Elect 707
Add a second-level cache

L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
Definitions:
Dr. Amr Talaat
Local miss rate misses in this cache divided by the total number of m
emory accesses to this cache (Miss rateL2)
Global miss ratemisses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters
Elect 707
Dr. Amr Talaat
miss penalty
miss rate
Cache Optimization Summary
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
Elect 707
MR
+
+
+
+
+
+
+
MP HT
+
+
+
+
Complexity
0
1
2
2
2
3
0
1
2
3
2
Next WEEK
Start the project +

Sheet 1 Online
QUIZ 1 covering Pipelining and Hazards
Dr. Amr Talaat

Elect 707

Elct707 Cache

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Elct707 Cache

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 2:

Text book slides: Computer Architec

Dr. Amr Talaat

Why Memory Organization?

Computer Architecture and Applications

Dr. Amr Talaat

The Principle of Locality

The Principle of Locality:

Two Different Types of Locality:

Dr. Amr Talaat

Basic Principle to overcome the memory/proc. Interaction

Small, fast storage used to improve average access time to slow m

Registers a cache on variables

Dr. Amr Talaat

Dr. Amr Talaat

Review: Cache performance

AMAT HitTime MissRate MissPenalty

HitTimeInst MissRate Inst MissPenalty Inst

HitTimeData MissRate Data MissPenaltyData

Average Memory Access Time (AMAT) = Hit time + Miss r

Dr. Amr Talaat

Miss-oriented Approach to Memory Access:

MissRate MissPenalty CycleTime

CPIExecution includes ALU and Memory instructions

Dr. Amr Talaat

AMAT HitTime MissRate MissPenalty

HitTimeData MissRate Data MissPenaltyData

Suppose that 10% of memory operations get 50 cycle miss penal

Dr. Amr Talaat

AMAT HitTime MissRate MissPenalty

Computer Architecture and Applications

Four Questions for Memory Hierarchy Desig

Dr. Amr Talaat

Q1: Where can a block be placed in the upper level?

Dr. Amr Talaat

1 KB Direct Mapped Cache, 32B blocks

Computer Architecture and Applications

For a 2^N byte cache:

Set Associative Cache

N-way set associative: N entries for each Cache Index

Example: Two-way set associative cache

Dr. Amr Talaat

Dr. Amr Talaat

Q2: How is a block found if it is in the upper level?

Index identifies set of possibilities

Increasing associativity shrinks index, expands tag

Dr. Amr Talaat

Cache size = Associativity * 2indexsize * 2offestsize

Computer Architecture and Applications

Q3: Which block should be replaced on a mi

Dr. Amr Talaat

Q4: What happens on a write?

Write-through: all writes update cache and underlying memory/cache

Write Buffer for Write Through

Write buffer is just a FIFO:

Dr. Amr Talaat

Typical number of entries: 4

The Cache Design Space

Several interacting dimensions

Dr. Amr Talaat

Simplicity often wins

Computer Architecture and Applications

Review: Improving Cache Performance

Miss rate Miss penalty Clock cycle time