0 evaluări0% au considerat acest document util (0 voturi)
46 vizualizări15 pagini
Cache and caching refers to an optimization technique used to reduce memory access time and improve performance. A cache acts as an intermediary between the processor and main memory, storing frequently accessed data. Caches are small, active, transparent, and automatic. They use algorithms like least recently used to determine which data to replace. Caches improve flexibility and reduce costs by allowing faster access to nearby data compared to slower main memory. Multi-level caches with increasing sizes further improve performance by checking smaller caches before larger ones and external memory.
Cache and caching refers to an optimization technique used to reduce memory access time and improve performance. A cache acts as an intermediary between the processor and main memory, storing frequently accessed data. Caches are small, active, transparent, and automatic. They use algorithms like least recently used to determine which data to replace. Caches improve flexibility and reduce costs by allowing faster access to nearby data compared to slower main memory. Multi-level caches with increasing sizes further improve performance by checking smaller caches before larger ones and external memory.
Cache and caching refers to an optimization technique used to reduce memory access time and improve performance. A cache acts as an intermediary between the processor and main memory, storing frequently accessed data. Caches are small, active, transparent, and automatic. They use algorithms like least recently used to determine which data to replace. Caches improve flexibility and reduce costs by allowing faster access to nearby data compared to slower main memory. Multi-level caches with increasing sizes further improve performance by checking smaller caches before larger ones and external memory.
Caching refers to an important optimization technique used to reduce Von Neumann Bottleneck (time spent performing memory access that can limit overall performance) and improve the performance of any hardware or software system that retrieves information. cache acts as an intermediary. ! Characteristics of Cache "mall# active# transparent and automatic "mall $ost caches are !%& of the main memory size and hold equal percentage of data ctive 'as active mechanism that e(amines each request and decides how to respond vaila)le or not availa)le. *f not availa)le# to retrieve a copy of item from data store +ecides which item to keep in cache ,ransparent cache can )e inserted without making changes to the request or data store. *nterface cache presents to requester the same interface as it does to data storage and vice versa utomatic Cache mechanism does not retrieve instruction on how to act or which data items to store in the cache storage. *nstead it implements an algorithm that e(amines the sequence of requests and uses the request to determine how to manage the cache. *mportance -le(i)ility as in usage o 'ardware# software and com)ination of the two o "mall# medium and large data items o .eneric data items o pplication type of data o ,e(tual and non te(tual o Variety of computers o "ystems designed to retrieve data(we)) or those that store (physical memories) Cache terminologies ,here are many terminologies depending on application $emory system Backing store Cache we) pages Browser "erver origin server +ata)ase lookups Client request for data)ase servers (system that handles requests) 'it /equest that can )e satisfied without any need to access the underlying data store $iss /equest that cannot )e satisfied 'igh locality of reference sequence containing repetitions of the same request number of request that are hits Hit Ratio Total number of requests = 0 ( ) ! h m Cost rC r C = + where Ch and C m are costs of accessing cache and store respectively $iss ratio !1hit ratio Replacement policy Need ,o increase the hit ratio2 !. ,he policy should retain those items that will )e referenced most frequently 0. "hould )e ine(pensive 3. 4/5 method preferred Multi leel cache $ore than one cache used along the path from requester to data store. ,he cost of accessing mew cache is lower than the cost of accessing the original cache ( ) ! ! 0 0 ! 0 ! h h m Cost rC r C r r C = + + $ulti1level caches generally operate )y checking the smallest Leel ! (4!) cache first6 if it hits# the processor proceeds at high speed. *f the smaller cache misses# the ne(t larger cache (40) is checked# and so on# )efore e(ternal memory is checked. "reloading Caches +uring start1up the hit ratio is very low since it has to fetch items from the store. ,his can )e improved )y preloading the cache. o 5sing anticipation of requests (repeated) o -requently used pages "re#fetch related data *f a processor accesses a )yte of memory# the cache fetches 78 )ytes. ,hus if the processor fetches the ne(t )yte# the value will come from the cache. $odern computer systems employ multiple caches. Caching is used with )oth virtual and physical memory as well as secondary memory. ,ranslation 4ookaside Buffer (,4B) contains digital circuits that move values into a Content ddressa)le memory (C$) at high speed. Cache can be viewed as the main memory while data store as the external storage. 3 Caches in multiprocessors $rite through and %rite &ac' $rite through ,his is the method of writing to memory where the cache keeps a copy and forwards the write operation to the underlying memory. $rite &ac' scheme Cache keeps data item locally and only writes the value to memory if necessary. ,his is the case if value reaches end of 4/5 list and must )e replaced. ,o determine whether value is to )e written )ack# a )it termed dirty )it is kept )y cache. Cache Coherence 9erformance can )e optimized )y using write )ack scheme than write through scheme. ,he performance can also )e optimized )y giving each processor its own cache. 5nfortunately the two methods conflict (write )ack and multi1cache) during /:+ and ;/*,: operations for the same address. ,o avoid conflicts# all devices that access memory must follow a cache coherence protocol that coordinates the values. :ach processor must inform the other processor of its operation so that the addressing is not confused. "hysical memory cache Demand paging as a form of cache Cache )ehaves like physical memory and data storage as e(ternal memory 9age replacement policy as cache replacement policy cache inserted )etween processor and memory need to understand physical address. ;e can imagine cache receiving a read request# checking to see if the request can )e answered from cache and then if the request is not present# to pass the request to underlying memory. <nce the item is retrieved from memory# cache saves a copy locally and then returns the value to processor. :(ample /:+ Cache performs two tasks# passing the request simultaneously to physical and searches locally *f answer is local# cancel memory operation *f no local answer# wait for underlying memory operation to complete nswer arrives# save copy# transfer answer to processor 8 Instructions and Data caches (hould all memory references pass through a single cache) ,o understand the question# imagine instructions )eing e(ecuted and data )eing accessed. *nstruction fetch tends to )ehave with highly locality since in many cases the ne(t instruction is found at an ad=acent memory address. *f loops are used# they are small routines that can fit into a cache. +ata fetch may )e at random and hence not necessarily ad=acent in the memory address. lso any time memory is referenced6 the cache keeps a copy even though the value will not )e needed again. ,he overall performance of the cache is reduced. rchitects vary in choice from different caches and one large cache that can allow intermi(ing. *irtual memory caching and cache flush ;hen the <" is running a program# the addresses given are always the same# ie starting from zero. *f the <" changes the program# it must also change that information in the cache since the new program uses the same address to refer to the new set of values. ,he cache must have a way to resolve these multiple application address location !. Cache flush operation ,he cache is flushed whenever the <" changes to a new virtual space. 0. +isam)iguation 5se e(tra )its that identify the address space 9rocessor contains an e(tra hardware register that contain an address space *+ :ach program allocated a unique num)er > ny application swap# the <" loads the application *+ into the address space *+ register. 9rocessor creates artificially longer addresses )efore passing an address to the cache containing the *+ Implementation of memory cache <riginally the memory cache contained two values of entry2 memory address and the content found in that address. New methods are2 !. +irect mapping cache 0. "et associative memory cache Use power of two minimise computation +irect method cache Cache divides memory and cache into )locks where the )lock is in powers of two ,o distinguish )locks# a unique tag value is assigned to each group of the )locks -rom figure 3# tag0 can occupy )lock% in cache ,ags are used to identify a large group of )ytes than single )yte Cache look1up )ecomes e(tremely efficient Newer technology involve the addressing as shown in figure 8 ssociative memory cache set of associative cache use hardware parallelism to provide more fle(i)ility ssociative approach provides hardware that can search all of them simultaneously /eferencing is in different caches fully associative cache has the underlying caches containing only one slot# )ut the slot can hold an ar)itrary value equivalent to Content addressa)le $emory (C$) E+ample to programmers 9rogrammers who understand cache can write a code that e(ploits a cache Array ssume many operations on a large array 9erform all the operations on a single element of the array )efore moving to the ne(t element (program iterates through the array once) "aging "ingle iteration for demand of paging possi)le 7 TL, ,he register file in the C95 is accessi)le )y )oth the integer and the floating point units# or each unit may have its own specialized registers. ,he out1of1order e(ecution units are intelligent enough to know the original order of the instructions in the program and re1impose program order when the results are to )e committed (?retired@) to their final destination registers A E+clusie ersus inclusie cache $ulti1level caches introduce new design decisions. -or instance# in some processors# all data in the 4! cache must also )e somewhere in the 40 cache. ,hese caches are called strictly inclusie. <ther processors (like the $+ thlon) have e+clusie caches B data is guaranteed to )e in at most one of the 4! and 40 caches# never in )oth. "till other processors (like the *ntel 9entium **# ***# and 8)# do not require that data in the 4! cache also reside in the 40 cache# although it may often do so. ,here is no universally accepted name for this intermediate policy# although the term mainly inclusie has )een used. ,he advantage of e(clusive caches is that they store more data. ,his advantage is larger when the e(clusive 4! cache is compara)le to the 40 cache# and diminishes if the 40 cache is many times larger than the 4! cache. ;hen the 4! misses and the 40 hits on an access# the hitting cache line in the 40 is e(changed with a line in the 4!. ,his e(change is quite a )it more work than =ust copying a line from 40 to 4!# which is what an inclusive cache does. <ne advantage of strictly inclusive caches is that when e(ternal devices or other processors in a multiprocessor system wish to remove a cache line from the processor# they need only have the processor check the 40 cache. *n cache hierarchies which do not enforce inclusion# the 4! cache must )e checked as well. s a draw)ack# there is a correlation )etween the associativities of 4! and 40 caches2 if the 40 cache does not have at least as many ways as all 4! caches together# the effective associativity of the 4! caches is restricted. nother advantage of inclusive caches is that the larger cache can use larger cache lines# which reduces the size of the secondary cache tags. (:(clusive caches require )oth caches to have the same size cache lines# so that cache lines can )e swapped on a 4! miss# 40 hit). *f the secondary cache is an order of magnitude larger than the primary# and the cache data is an order of magnitude larger than the cache tags# this tag area saved can )e compara)le to the incremental area needed to store the 4! cache data in the 40. Three#Leel Cache -ierarchy s the latency difference )etween main memory and the fastest cache has )ecome larger# some processors have )egun to utilize as many as three levels of on1chip cache. the *tanium 0 (0%%3) had a 7 $B unified level 3 (43) cache on1die6 the $+ 9henom ** (0%%C) has up to 7 $B on1die unified 43 cache6 and the *ntel Core iA (0%%C) has an C $B on1die unified 43 cache that is inclusive# shared )y all cores. ,he )enefits of an 43 cache depend on the applicationDs access patterns. ,he memory hierarchy of Conroe was e(tremely simple and *ntel was a)le to concentrate on the performance of the shared 40 cache# which was the )est solution for an architecture that was aimed mostly at dual1core implementations. But with Nehalem# the engineers started from scratch and came to the same conclusions as their competitors2 a shared 40 cache was not suited to native quad1core architecture. ,he different cores can too frequently flush data needed )y another core and that surely would have involved too many pro)lems in terms of internal )uses and ar)itration to provide all four cores with sufficient )andwidth while keeping latency sufficiently low. ,o solve the pro)lem# the engineers provided each core with a 4evel 0 cache of its own. "ince it@s dedicated to a single core and relatively small (0>7 EB)# ,hen comes an enormous 4evel 3 cache memory (C $B) for managing communications )etween cores. ,hat means that if a core tries to access a data item and it@s not present in the C 4evel 3 cache# there@s no need to look in the other cores@ private cachesBthe data item won@t )e there either. Conversely# if the data are present# four )its associated with each line of the cache memory (one )it per core) show whether or not the data are potentially present (potentially# )ut not with certainty) in the lower1level cache of another core# and which one. "ipelining in Microprocessors $odern microprocessors are structured and hence they contain many internal processing units. :ach unit performs a particular task. *n real sense each of these processing units is actually a special purpose microprocessor. ,he processor can process several instructions simultaneously at various stages of e(ecution. ,his a)ility is called pipelining. *ntel C%C7 was the first processor to make use of idle memory time )y fetching the ne(t instruction while e(ecuting the current one. ,his process accelerates the overall e(ecution time of a program. -igure F shows how an *ntel i8C7 e(ecutes the instruction in a pipeline fashion. ;hen one instruction is fetched# the other is decoded which the third is )eing e(ecuted while the fourth is )eing e(ecuted )ack. ll these activities take place within the same time duration# thus giving an overall e(ecution rate of one instruction per clock cycle. Considering the conventional approach that requires 8 clock cycles to fetch and e(ecute and write )ack for one instruction# the pipelining approach is much superior. *f the start and end times of the F operation are considered# the overall (average) rate of processing comes out to )e nearly one (slightly greater) instruction per clock. G9 Bus -:,C' +ecode ! :(ecute ! -:,C' 0 +ecode 0 :(ecute 0 -:,C' 3 +ecode 3 :(ecute 3 B5"H *+4: B5"H B5"H *+4: B5"H B5"H *+4: B5"H Non19ipelined :(ecution (C%C>) Bus 5nit -:,C' ! -:,C' 0 -:,C' 3 -:,C' 8 ",</:! -:,C' > -:,C' 7 /:+ -:,C' A *nstruction 5nit +ecode ! +ecode 0 +ecode 3 +ecode 8 *+4: +ecode > +ecode 7 *+4: :(ecution 5nit :(ecute ! :(ecute 0 :(ecute 3 :(ecute 8 *+4: :(ecute > :(ecute 7 ddress 5nit .enerate ddress ! .enerate ddress 0 -ig. F 9ipelining of instructions 9ipelining approach is very much a part of /*"C architecture )esides )eing suita)le for C*"C architecture. <ther factors that have contri)uted to the development of /*"C on i8C7 are $$5 and CEB primary cache !% Additional Notes 8% $'I J 0> ns +/$ chips J access time 7% J !%% ns "/$ J access time !> J 0> ns "/$ (:C4) J access time !0 ns B5, e(pensive ssume aircraft moving at C>% kmKh +istance moved in !0 nsL!K!% of diameter of hair Cache J attempts adv of quick "/$ with cheapness of +/$s. to achieve the most effective memory system. C<N,/<44:/ Cache can )e <n chip or separate. *t can )e )etween !K!% J !K!%%% M smaller than main memory Cache hit means information requested in cache while cache miss indicates information requested is not in cache - Cache controller disa)les ready signal. C95 to insert wait states Cache hit 1 Cache controller J reads a complete cache line called cache line fill - +ata )ytes addressed )y C95 are immediately passed on )y the cache controller )efore the whole cache line is completed. !! "/$ CC': C95 CC': +/$ $:$</H ($*N) Cache line J !7 or 30 )ytes in size ne(t C95 request may )e part of the cache line hence increase hit rate. Cache controllers use )urst mode where a )lock of data that contains more )ytes than the data )us width. Burst mode dou)les the )us transfer rate. Cache strategies J write through - ;rite )ack J copy )ack - ;rite allocate $rite through J always transfers data to main even when there is a hit. - ;ait states - 5se of fast write )uffers tries to improve the write operations. - $ain memory consistency is enhanced. $ultiprocessors would have difficulty in this strategy unless an inquiry cycle is done to re1 esta)lish consistency. $rite &ack J cache write unless specified $rite allocate J cache miss J cache controller fill the cache space for a cache line with the data for the address to )e written. 5sually +ata written through to main memory. ,he cache controller then reads into the cache the applica)le cache line with the entry to )e updated. Cache controller independently performs the write allocate in parallel to C95 operation. Cache miss J )ecause of complication# are simply switched through to main and ignored )y cache. CC': </.N*I,*<NN+ ""<C*,*V: $:$</H (C$) ,ypes2 +irect mapping 8 way ,ag ssociative memory ssume cache memory of !7 k) and a cache line of !7 )ytes 0% C 8 ,ag address set address Byte address Cache entry2 Cache directory N cache memory Cache directory Cache directory stored internally in the cache controller or in an e(ternal /$ hence more "/$s than actually necessary. cache memory 1"tores actual data e.g 8 way cache J cache directory !0 ,. :lement of cache directory +etermines whether a hit or miss Valid )it implies valid cache line -lush reset valid )its of the cache line ;rite protect No overwrite ":, :very tag of corresponding cache line are elements of the set. ;ay -or a given set address# the tag address of all ways are simultaneously compared with the tag part of the address given out of the C95 for a hitKmiss criterion. Capacity 8 way M set M cache line size L !7E) $iss will check 4/5 for replacement. Algorithms +irect $apping Cache line in one position ssociative Cache line can )e anywhere within the four ways. <verwrite can )e avoided. 0 way would )e faster than a 8 way cache. ssociative memory concept is also known as Content ddressa)le $emory (C$). Cache hit determination
!3 30 )it microprocessor ,he 8 .B is divided into 0 0% cache pages each 8EB through to 0>7 sets. ,he page is further divided into !7 )yte cache line. <rganization No restrictions 40 cache is organized into >!0EB in a 0 way organization. cache line is taken as 78 )ytes and sets are C!F0 4arge caches can )e implemented with e(ternal "/$s. ,ags may )e for "/$s with short access time (!>n") while other data in "/$s with access time of 0%n" which can )e e(ternal. Replacement strategies ,he cache controller uses the 4/5 )its assigned to a set of cache lines for marking the last addressed (most recently) way of the set. /eplacement policy ll lines validO No B%L%O No No B!L%O B0L%O No Hes Hes !8 /eplace cache entry /eplace way ! /eplace way % /eplace way 0 /eplace way 3 /eplace invalid line /andom replacement is also possi)le. Comprehensive statistical analyses have shown that there is very little difference )etween the efficiency of the 4/5 and random replacement algorithm. /eplacement policy will solely rely on the cache designer. ccess and addressing *f last access was way % or !# controller sets 4/5 )it B%. -or access to way %# )it B! is set. ddressing way ! sets 4/5 )it B!. ccess way 0 the 4/5 )it 0 is set. ddressing way 3# )it 0 is cleared. !>