Documente Academic
Documente Profesional
Documente Cultură
Cache
A Quest for Speed
Dr. Eng. Amr T. Abdel-Hamid
Winter 2014
Computer Architecture
Elect 707
Memory Hierarchy
Computer Architecture and Applications
Elect 707
CPU
Proc
60%/yr.
Moores Law
Performance
1000
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
7%/yr.
DRAM
2000
1999
1994
1995
1996
1997
1998
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1982
1983
1980
1981
Elect 707
What is a cache?
Computer Architecture and Applications
Proc/Regs
L1-Cache
Bigger
L2-Cache
Memory
Disk, etc.
Elect 707
Faster
Cache Definitions
Computer Architecture and Applications
Hit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of access tim
e + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block
Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to del
iver the block the processor
Hit Time << Miss Penalty
Cache performance
Computer Architecture and Applications
CPUtime IC CPI
MemMisses
CPUtime IC CPI
MissPenalty CycleTime
Execution
Inst
AluOps
MemAccess
AMAT CycleTime
Inst
Elect 707
Impact on Performance
Suppose a processor executes at
Computer Architecture and Applications
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Elect 707
Block Placement
Computer Architecture and Applications
11
Example: 0x50
9
Cache Index
Ex: 0x01
4
Byte Select
Ex: 0x00
Stored as part
of the cache state
Cache Data
Byte 31
Byte 63
0x50
Dr. Amr Talaat
:
Byte 1023
Elect 707
Byte 1 Byte 0 0
Byte 33 Byte 32 1
2
3
:
:
Cache Tag
Valid Bit
12
Byte 992 31
Valid
Cache Tag
Cache Data
Cache Block 0
Cache Data
Cache Block 0
Cache Tag
Valid
Adr Tag
Elect 707
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit
Cache Block
13
Block Placement
Computer Architecture and Applications
Elect 707
Index
Block
Offset
Assoc:
2-way
4-way
8-way
Size
LRU Ran LRU Ran
LRU Ran
16 KB
5.2% 5.7% 4.7%
5.3% 4.4%
5.0%
64 KB
1.9% 2.0% 1.5%
1.7% 1.4%
1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Elect 707
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Elect 707
Cache Size
cache size
block size
associativity
replacement policy
write-through vs write-back
Associativity
Block Size
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Bad
Good
Factor A
Less
Factor B
More
CPUtime IC CPI
Execution
Memory accesses
Instruction
20
Reducing Misses
Computer Architecture and Applications
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first refere
nce misses.
CapacityIf the cache cannot contain all the blocks needed during execut
ion of a program, capacity misses will occur due to blocks being discarded
and later retrieved.
ConflictIf block-placement strategy is set associative or direct mapped, c
onflict misses (in addition to compulsory & capacity misses) will occur bec
ause a block can be discarded and later retrieved if too many blocks map t
o its set. Also called collision misses or interference misses.
Elect 707
21
22
0.14
1-way
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
128
64
32
16
Compulsory
23
Larger cache
Reduce Misses via Larger Block Size
Reduce Misses via Higher Associativity
Reducing Misses via Victim Cache
Reducing Misses via Pseudo-Associativity
Reducing Misses by HW Prefetching Instr, Data
Reducing Misses by SW Prefetching Data
Reducing Misses by Compiler Optimizations
Elect 707
0.14
1-way
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
128
64
32
16
Compulsory
Elect 707
26
TAGS
DATA
Elect 707
27
How to combine fast hit time of Direct Mapped and have the lo
wer conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if the
re, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time
Miss Penalty
Time
Elect 707
28
29
Elect 707
30
31
Elect 707
32
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Elect 707
33
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; impro
ve spatial locality
Elect 707
34
CPU
in out
write
buffer
DRAM
(or lower mem)
Elect 707
Write Buffer
Elect 707
block
Elect 707
L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
Definitions:
Local miss rate misses in this cache divided by the total number of m
emory accesses to this cache (Miss rateL2)
Global miss ratemisses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters
Elect 707
miss penalty
miss rate
Technique
Larger Block Size
Higher Associativity
Victim Caches
Pseudo-Associative Caches
HW Prefetching of Instr/Data
Compiler Controlled Prefetching
Compiler Reduce Misses
Priority to Read Misses
Early Restart & Critical Word 1st
Non-Blocking Caches
Second Level Caches
Elect 707
MR
+
+
+
+
+
+
+
MP HT
+
+
+
+
Complexity
0
1
2
2
2
3
0
1
2
3
2
Next WEEK
Computer Architecture and Applications