Sunteți pe pagina 1din 15

ELECTRICAL AND ELECTRONIC ENGINEERING

Cache and Caching


Caching refers to an important optimization technique used to reduce Von Neumann
Bottleneck (time spent performing memory access that can limit overall performance) and
improve the performance of any hardware or software system that retrieves information.
cache acts as an intermediary.
!
Characteristics of Cache
"mall# active# transparent and automatic
"mall $ost caches are !%& of the main memory size and hold equal percentage of
data
ctive 'as active mechanism that e(amines each request and decides how to respond
vaila)le or not availa)le. *f not availa)le# to retrieve a copy of item from data
store
+ecides which item to keep in cache
,ransparent cache can )e inserted without making changes to the request or data store.
*nterface cache presents to requester the same interface as it does to data
storage and vice versa
utomatic Cache mechanism does not retrieve instruction on how to act or which data
items to store in the cache storage. *nstead it implements an algorithm that
e(amines the sequence of requests and uses the request to determine how to
manage the cache.
*mportance
-le(i)ility as in usage
o 'ardware# software and com)ination of the two
o "mall# medium and large data items
o .eneric data items
o pplication type of data
o ,e(tual and non te(tual
o Variety of computers
o "ystems designed to retrieve data(we)) or those that store (physical memories)
Cache terminologies
,here are many terminologies depending on application
$emory system Backing store
Cache we) pages Browser
"erver origin server
+ata)ase lookups Client request for data)ase servers (system that handles
requests)
'it /equest that can )e satisfied without any need to access the
underlying data store
$iss /equest that cannot )e satisfied
'igh locality of reference sequence containing repetitions of the same request
number of request that are hits
Hit Ratio
Total number of requests
=
0
( ) !
h m
Cost rC r C = +
where Ch and C
m
are costs of accessing cache and store
respectively
$iss ratio !1hit ratio
Replacement policy
Need ,o increase the hit ratio2
!. ,he policy should retain those items that will )e referenced most frequently
0. "hould )e ine(pensive
3. 4/5 method preferred
Multi leel cache
$ore than one cache used along the path from requester to data store. ,he cost of accessing
mew cache is lower than the cost of accessing the original cache
( )
! ! 0 0 ! 0
!
h h m
Cost rC r C r r C = + +
$ulti1level caches generally operate )y checking the smallest Leel ! (4!) cache first6 if it
hits# the processor proceeds at high speed. *f the smaller cache misses# the ne(t larger cache
(40) is checked# and so on# )efore e(ternal memory is checked.
"reloading Caches
+uring start1up the hit ratio is very low since it has to fetch items from the store. ,his can )e
improved )y preloading the cache.
o 5sing anticipation of requests (repeated)
o -requently used pages
"re#fetch related data
*f a processor accesses a )yte of memory# the cache fetches 78 )ytes. ,hus if the processor
fetches the ne(t )yte# the value will come from the cache. $odern computer systems employ
multiple caches. Caching is used with )oth virtual and physical memory as well as secondary
memory.
,ranslation 4ookaside Buffer (,4B) contains digital circuits that move values into a Content
ddressa)le memory (C$) at high speed.
Cache can be viewed as the main memory while data store as the external storage.
3
Caches in multiprocessors
$rite through and %rite &ac'
$rite through
,his is the method of writing to memory where the cache keeps a copy and forwards
the write operation to the underlying memory.
$rite &ac' scheme
Cache keeps data item locally and only writes the value to memory if
necessary. ,his is the case if value reaches end of 4/5 list and must )e
replaced. ,o determine whether value is to )e written )ack# a )it termed dirty
)it is kept )y cache.
Cache Coherence
9erformance can )e optimized )y using write )ack scheme than write through scheme. ,he
performance can also )e optimized )y giving each processor its own cache. 5nfortunately the
two methods conflict (write )ack and multi1cache) during /:+ and ;/*,: operations for
the same address.
,o avoid conflicts# all devices that access memory must follow a cache coherence protocol
that coordinates the values. :ach processor must inform the other processor of its operation
so that the addressing is not confused.
"hysical memory cache
Demand paging as a form of cache
Cache )ehaves like physical memory and data storage as e(ternal memory
9age replacement policy as cache replacement policy
cache inserted )etween processor and memory need to understand physical address. ;e can
imagine cache receiving a read request# checking to see if the request can )e answered from
cache and then if the request is not present# to pass the request to underlying memory. <nce
the item is retrieved from memory# cache saves a copy locally and then returns the value to
processor.
:(ample
/:+
Cache performs two tasks# passing the request simultaneously to physical and
searches locally
*f answer is local# cancel memory operation
*f no local answer# wait for underlying memory operation to complete
nswer arrives# save copy# transfer answer to processor
8
Instructions and Data caches
(hould all memory references pass through a single cache) ,o understand the question#
imagine instructions )eing e(ecuted and data )eing accessed.
*nstruction fetch tends to )ehave with highly locality since in many cases the ne(t instruction
is found at an ad=acent memory address. *f loops are used# they are small routines that can fit
into a cache.
+ata fetch may )e at random and hence not necessarily ad=acent in the memory address. lso
any time memory is referenced6 the cache keeps a copy even though the value will not )e
needed again.
,he overall performance of the cache is reduced. rchitects vary in choice from different
caches and one large cache that can allow intermi(ing.
*irtual memory caching and cache flush
;hen the <" is running a program# the addresses given are always the same# ie starting from
zero. *f the <" changes the program# it must also change that information in the cache since
the new program uses the same address to refer to the new set of values. ,he cache must have
a way to resolve these multiple application address location
!. Cache flush operation
,he cache is flushed whenever the <" changes to a new virtual space.
0. +isam)iguation
5se e(tra )its that identify the address space
9rocessor contains an e(tra hardware register that contain an address space *+
:ach program allocated a unique num)er
>
ny application swap# the <" loads the application *+ into the address space
*+ register.
9rocessor creates artificially longer addresses )efore passing an address to the
cache containing the *+
Implementation of memory cache
<riginally the memory cache contained two values of entry2 memory address and the content
found in that address. New methods are2
!. +irect mapping cache
0. "et associative memory cache
Use power of two minimise computation
+irect method cache
Cache divides memory and cache into )locks where the )lock is in powers of two
,o distinguish )locks# a unique tag value is assigned to each group of the )locks
-rom figure 3# tag0 can occupy )lock% in cache
,ags are used to identify a large group of )ytes than single )yte
Cache look1up )ecomes e(tremely efficient
Newer technology involve the addressing as shown in figure 8
ssociative memory cache
set of associative cache use hardware parallelism to provide more fle(i)ility
ssociative approach provides hardware that can search all of them simultaneously
/eferencing is in different caches
fully associative cache has the underlying caches containing only one slot# )ut the
slot can hold an ar)itrary value equivalent to Content addressa)le $emory (C$)
E+ample to programmers
9rogrammers who understand cache can write a code that e(ploits a cache
Array
ssume many operations on a large array
9erform all the operations on a single element of the array )efore moving to
the ne(t element (program iterates through the array once)
"aging "ingle iteration for demand of paging possi)le
7
TL,
,he register file in the C95 is accessi)le )y )oth the integer and the floating point units# or
each unit may have its own specialized registers. ,he out1of1order e(ecution units are
intelligent enough to know the original order of the instructions in the program and re1impose
program order when the results are to )e committed (?retired@) to their final destination
registers
A
E+clusie ersus inclusie cache
$ulti1level caches introduce new design decisions. -or instance# in some processors# all data
in the 4! cache must also )e somewhere in the 40 cache. ,hese caches are called strictly
inclusie. <ther processors (like the $+ thlon) have e+clusie caches B data is
guaranteed to )e in at most one of the 4! and 40 caches# never in )oth. "till other processors
(like the *ntel 9entium **# ***# and 8)# do not require that data in the 4! cache also reside in the
40 cache# although it may often do so. ,here is no universally accepted name for this
intermediate policy# although the term mainly inclusie has )een used.
,he advantage of e(clusive caches is that they store more data. ,his advantage is larger when
the e(clusive 4! cache is compara)le to the 40 cache# and diminishes if the 40 cache is many
times larger than the 4! cache. ;hen the 4! misses and the 40 hits on an access# the hitting
cache line in the 40 is e(changed with a line in the 4!. ,his e(change is quite a )it more
work than =ust copying a line from 40 to 4!# which is what an inclusive cache does.
<ne advantage of strictly inclusive caches is that when e(ternal devices or other processors in
a multiprocessor system wish to remove a cache line from the processor# they need only have
the processor check the 40 cache. *n cache hierarchies which do not enforce inclusion# the 4!
cache must )e checked as well. s a draw)ack# there is a correlation )etween the
associativities of 4! and 40 caches2 if the 40 cache does not have at least as many ways as all
4! caches together# the effective associativity of the 4! caches is restricted.
nother advantage of inclusive caches is that the larger cache can use larger cache lines#
which reduces the size of the secondary cache tags. (:(clusive caches require )oth caches to
have the same size cache lines# so that cache lines can )e swapped on a 4! miss# 40 hit). *f
the secondary cache is an order of magnitude larger than the primary# and the cache data is an
order of magnitude larger than the cache tags# this tag area saved can )e compara)le to the
incremental area needed to store the 4! cache data in the 40.
Three#Leel Cache -ierarchy
s the latency difference )etween main memory and the fastest cache has )ecome larger#
some processors have )egun to utilize as many as three levels of on1chip cache. the *tanium 0
(0%%3) had a 7 $B unified level 3 (43) cache on1die6 the $+ 9henom ** (0%%C) has up to 7
$B on1die unified 43 cache6 and the *ntel Core iA (0%%C) has an C $B on1die unified 43
cache that is inclusive# shared )y all cores. ,he )enefits of an 43 cache depend on the
applicationDs access patterns.
,he memory hierarchy of Conroe was e(tremely simple and *ntel was a)le to concentrate on
the performance of the shared 40 cache# which was the )est solution for an architecture that
was aimed mostly at dual1core implementations. But with Nehalem# the engineers started
from scratch and came to the same conclusions as their competitors2 a shared 40 cache was
not suited to native quad1core architecture. ,he different cores can too frequently flush data
needed )y another core and that surely would have involved too many pro)lems in terms of
internal )uses and ar)itration to provide all four cores with sufficient )andwidth while
keeping latency sufficiently low. ,o solve the pro)lem# the engineers provided each core with
a 4evel 0 cache of its own. "ince it@s dedicated to a single core and relatively small (0>7 EB)#
,hen comes an enormous 4evel 3 cache memory (C $B) for managing communications
)etween cores. ,hat means that if a core tries to access a data item and it@s not present in the
C
4evel 3 cache# there@s no need to look in the other cores@ private cachesBthe data item won@t
)e there either. Conversely# if the data are present# four )its associated with each line of the
cache memory (one )it per core) show whether or not the data are potentially present
(potentially# )ut not with certainty) in the lower1level cache of another core# and which one.
"ipelining in Microprocessors
$odern microprocessors are structured and hence they contain many internal processing
units. :ach unit performs a particular task. *n real sense each of these processing units is
actually a special purpose microprocessor. ,he processor can process several instructions
simultaneously at various stages of e(ecution. ,his a)ility is called pipelining. *ntel C%C7
was the first processor to make use of idle memory time )y fetching the ne(t instruction
while e(ecuting the current one. ,his process accelerates the overall e(ecution time of a
program.
-igure F shows how an *ntel i8C7 e(ecutes the instruction in a pipeline fashion. ;hen one
instruction is fetched# the other is decoded which the third is )eing e(ecuted while the fourth
is )eing e(ecuted )ack. ll these activities take place within the same time duration# thus
giving an overall e(ecution rate of one instruction per clock cycle. Considering the
conventional approach that requires 8 clock cycles to fetch and e(ecute and write )ack for
one instruction# the pipelining approach is much superior. *f the start and end times of the
F
operation are considered# the overall (average) rate of processing comes out to )e nearly one
(slightly greater) instruction per clock.
G9
Bus
-:,C' +ecode
!
:(ecute
!
-:,C'
0
+ecode
0
:(ecute
0
-:,C'
3
+ecode
3
:(ecute
3
B5"H *+4: B5"H B5"H *+4: B5"H B5"H *+4: B5"H
Non19ipelined :(ecution (C%C>)
Bus
5nit
-:,C'
!
-:,C'
0
-:,C'
3
-:,C'
8
",</:! -:,C'
>
-:,C'
7
/:+ -:,C'
A
*nstruction
5nit
+ecode
!
+ecode
0
+ecode 3 +ecode 8 *+4: +ecode > +ecode
7
*+4:
:(ecution 5nit
:(ecute
!
:(ecute
0
:(ecute
3
:(ecute
8
*+4: :(ecute
>
:(ecute
7
ddress 5nit
.enerate
ddress
!
.enerate
ddress
0
-ig. F 9ipelining of instructions
9ipelining approach is very much a part of /*"C architecture )esides )eing suita)le for C*"C
architecture. <ther factors that have contri)uted to the development of /*"C on i8C7 are
$$5 and CEB primary cache
!%
Additional Notes
8% $'I J 0> ns
+/$ chips J access time 7% J !%% ns
"/$ J access time !> J 0> ns
"/$ (:C4) J access time !0 ns B5, e(pensive
ssume aircraft moving at C>% kmKh
+istance moved in !0 nsL!K!% of diameter of hair
Cache J attempts adv of quick "/$ with cheapness of +/$s. to achieve the most
effective memory system.
C<N,/<44:/
Cache can )e <n chip or separate. *t can )e )etween !K!% J !K!%%% M smaller than main
memory
Cache hit means information requested in cache while cache miss indicates information
requested is not in cache
- Cache controller disa)les ready signal. C95 to insert wait states
Cache hit 1 Cache controller J reads a complete cache line called cache line fill
- +ata )ytes addressed )y C95 are immediately passed on )y the cache
controller )efore the whole cache line is completed.
!!
"/$
CC':
C95 CC':
+/$
$:$</H
($*N)
Cache line J !7 or 30 )ytes in size ne(t C95 request may )e part of the cache line hence
increase hit rate.
Cache controllers use )urst mode where a )lock of data that contains more )ytes than the data
)us width. Burst mode dou)les the )us transfer rate.
Cache strategies J write through
- ;rite )ack J copy )ack
- ;rite allocate
$rite through J always transfers data to main even when there is a hit.
- ;ait states
- 5se of fast write )uffers tries to improve the write operations.
- $ain memory consistency is enhanced.
$ultiprocessors would have difficulty in this strategy unless an inquiry cycle is done to re1
esta)lish consistency.
$rite &ack J cache write unless specified
$rite allocate J cache miss J cache controller fill the cache space for a cache line with the
data for the address to )e written.
5sually +ata written through to main memory. ,he cache controller then reads into the
cache the applica)le cache line with the entry to )e updated.
Cache controller independently performs the write allocate in parallel to C95
operation.
Cache miss J )ecause of complication# are simply switched through to main and ignored )y
cache.
CC': </.N*I,*<NN+ ""<C*,*V: $:$</H (C$)
,ypes2 +irect mapping
8 way
,ag
ssociative memory
ssume cache memory of !7 k) and a cache line of !7 )ytes
0% C 8
,ag address set address Byte address
Cache entry2 Cache directory N cache memory
Cache directory
Cache directory stored internally in the cache controller or in an e(ternal /$ hence more
"/$s than actually necessary.
cache memory 1"tores actual data
e.g 8 way cache J cache directory
!0
,. :lement of cache directory
+etermines whether a hit or miss
Valid )it implies valid cache line
-lush reset valid )its of the cache line
;rite protect No overwrite
":,
:very tag of corresponding cache line are elements of the set.
;ay -or a given set address# the tag address of all ways are simultaneously
compared with the tag part of the address given out of the C95 for a hitKmiss
criterion.
Capacity 8 way M set M cache line size L !7E)
$iss will check 4/5 for replacement.
Algorithms
+irect $apping Cache line in one position
ssociative Cache line can )e anywhere within the four ways. <verwrite can )e
avoided. 0 way would )e faster than a 8 way cache. ssociative
memory concept is also known as Content ddressa)le $emory
(C$).
Cache hit determination

!3
30 )it microprocessor ,he 8 .B is divided into 0
0%
cache pages each 8EB through to
0>7 sets. ,he page is further divided into !7 )yte cache line.
<rganization
No restrictions
40 cache is organized into >!0EB in a 0 way organization. cache line is taken as 78 )ytes
and sets are C!F0
4arge caches can )e implemented with e(ternal "/$s. ,ags may )e for "/$s with short
access time (!>n") while other data in "/$s with access time of 0%n" which can )e
e(ternal.
Replacement strategies
,he cache controller uses the 4/5 )its assigned to a set of cache lines for marking the last
addressed (most recently) way of the set.
/eplacement policy
ll lines validO No
B%L%O No
No B!L%O B0L%O No
Hes Hes
!8
/eplace cache entry
/eplace
way !
/eplace
way %
/eplace
way 0
/eplace
way 3
/eplace invalid line
/andom replacement is also possi)le. Comprehensive statistical analyses have shown that
there is very little difference )etween the efficiency of the 4/5 and random replacement
algorithm. /eplacement policy will solely rely on the cache designer.
ccess and addressing
*f last access was way % or !# controller sets 4/5 )it B%. -or access to way %# )it B! is set.
ddressing way ! sets 4/5 )it B!. ccess way 0 the 4/5 )it 0 is set. ddressing way 3# )it
0 is cleared.
!>

S-ar putea să vă placă și