Sunteți pe pagina 1din 15

Berger, D., Goodman, J.R., Sohi, G.S.

Memory Systems
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
2000 by CRC Press LLC
SS
Nemory Sysfems
88.1 Intioduction
88.2 Memoiy Hieiaichies
88.3 Cache Memoiies
88.4 Paiallel and Inteileaved Memoiies
88.5 Viitual Memoiy
88.6 Reseaich Issues
88.1 Intruductiun
A memory sysem seives as a iepositoiy of infoimation (data) in a computei system. The piocessoi also called
the cential piocessing unit (CPU)] accesses (ieads oi loads) data fiom the memoiy system, peifoims compu-
tations on them, and stoies (wiites) them back to memoiy. The memoiy system is a collection of stoiage
locations. Each stoiage location, oi memory worJ, has a numeiical aJJress. A collection of stoiage locations
foims an aJJress sate. Figuie 88.1 shows the essentials of how a piocessoi is connected to a memoiy system
via addiess, data, and contiol lines.
When a piocessoi attempts to load the contents of a memoiy location, the iequest is veiy uigent. In viitually
all computeis, the woik soon comes to a halt (in othei woids, the piocessoi sa||s) if the memoiy iequest does
not ietuin quickly. Modein computeis aie geneially able to continue biiey by oveilapping memoiy iequests,
but even the most sophisticated computeis will fiequently exhaust theii ability to piocess data and stall
momentaiily in the face of long memoiy delays. Thus, a key peifoimance paiametei in the design of any
computei, fast oi slow, is the effective speed of its memoiy.
Ideally, the memoiy system must be both infnitely laige so that it can contain an aibitiaiily laige amount
of infoimation and infnitely fast so that it does not limit the piocessing unit. Piactically, howevei, this is not
possible. Theie aie thiee piopeities of memoiy that aie inheiently in conict: speed, capacity, and cost. In
geneial, technology tiadeoffs can be employed to optimize any two of the thiee factois at the expense of the
thiid. Thus it is possible to have memoiies that aie (1) laige and cheap, but not fast; (2) cheap and fast, but
small; oi (3) laige and fast, but expensive. The last of the thiee is fuithei limited by physical constiaints. A
laige-capacity memoiy that is veiy fast is also physically laige, and speed-of-light delays place a limit on the
speed of such a memoiy system.
The latency (L) of the memoiy is the delay fiom when the piocessoi fist iequests a woid fiom memoiy
until that woid aiiives and is available foi use by the piocessoi. The latency of a memoiy system is one attiibute
of peifoimance. The othei is bandwidth (BV), which is the iate at which infoimation can be tiansfeiied fiom
the memoiy system. The bandwidth and the latency aie ielated. If R is the numbei of iequests that the memoiy
can seivice simultaneously, then:
(88.1) BV
R
L

Ioug Burger
Inverry of WconnModon
}ames R. Coodman
Inverry of WconnModon
CurIndar S. SohI
Inverry of WconnModon
2000 by CRC Press LLC
Fiom Eq. (88.1) we see that a deciease in the latency will iesult in an inciease in bandwidth, and vice veisa, if
R is unchanged. We can also see that the bandwidth can be incieased by incieasing R, if L does not inciease
piopoitionately. Foi example, we can build a memoiy system that takes 20 ns to seivice the access of a single
32-bit woid. Its latency is 20 ns pei 32-bit woid, and its bandwidth is
oi 200 Mbytes/s. If the memoiy system is modifed to accept a new (still 20 ns) iequest foi a 32-bit woid eveiy
5 ns by oveilapping iesults, then its bandwidth is
oi 800 Mbytes/s. This memoiy system must be able to handle foui iequests at a given time.
Building an ideal memoiy system (infnite capacity, zeio latency and infnite bandwidth, with affoidable
cost) is not feasible. The challenge is, given a set of cost and technology constiaints, to engineei a memoiy
system whose abilities match the abilities that the piocessoi demands of it. That is, engineeiing a memoiy
system that peifoims as close to an ideal memoiy system (foi the given piocessing unit) as is possible. Foi a
piocessoi that stalls when it makes a memoiy iequest (some cuiient miciopiocessois aie in this categoiy), it
is impoitant to engineei a memoiy system with the lowest possible latency. Foi those piocessois that can handle
multiple outstanding memoiy iequests (vectoi piocessois and high-end CPUs), it is impoitant not only to
ieduce latency, but also to inciease bandwidth (ovei what is possible by latency ieduction alone) by designing
a memoiy system that is capable of seivicing multiple iequests simultaneously.
Memory hierarchies piovide decieased aveiage latency and ieduced bandwidth iequiiements, wheieas
paiallel oi interleaved memoiies piovide highei bandwidth.
88.2 Memury Hierarchies
Technology does not peimit memoiies that aie cheap, laige, and fast. By iecognizing the noniandom natuie
of memoiy iequests, and emphasizing the aerage iathei than woist case latency, it is possible to implement a
hieiaichical memoiy system that peifoims well. A small amount of veiy fast memoiy, placed in fiont of a laige,
slow memoiy, can be designed to satisfy most iequests at the speed of the small memoiy. This, in fact, is the
piimaiy motivation foi the use of iegisteis in the CPU: in this case, the piogiammei oi compilei makes suie
that the most commonly accessed vaiiables aie allocated to iegisteis.
A vaiiety of techniques, employing eithei haidwaie, softwaie, oi a combination of the two, can be employed
to assuie that most memoiy iefeiences aie satisfed by the fastei memoiy. The foiemost of these techniques is
the exploitation of the |ota|y o[ re[erente piinciple. This piinciple captuies the fact that some memoiy locations
aie iefeienced much moie fiequently than otheis. Saa| |ota|y is the piopeity that an access to a given
memoiy location gieatly incieases the piobability that neighboiing locations will soon be accessed. This is
laigely, but not exclusively, a iesult of the tendency to access memoiy locations sequentially. Temora| |ota|y
is the piopeity that an access to a given memoiy location gieatly incieases the piobability that the same location
FIGURE 88.1 The memoiy inteiface.
32
20 10
9

-
sec
bits
32
5 10
9

-
sec
bits
2000 by CRC Press LLC
will be accessed again soon. This is laigely, but not exclusively, a iesult of the fiequency of looping behavioi of
piogiams. Paiticulaily foi tempoial locality, a good piedictoi of the futuie is the past: the longei a vaiiable has
gone uniefeienced, the less likely it is to be accessed soon.
Figuie 88.2 depicts a common constiuction of a memoiy hieiaichy. At the top of the hieiaichy aie the CPU
iegisteis, which aie small and extiemely fast. The next level down in the hieiaichy is a special, high-speed semi-
conductoi memoiy known as a cache memory. The cache can actually be divided into multiple distinct levels;
most cuiient systems have between one and thiee levels of cache. Some of the levels of cache may be on the
CPU chip itself, they may be on the same module as the CPU, oi they all may be entiiely distinct. Below the
cache is the conventional memoiy, iefeiied to as man memory, oi |at|ng sorage. Like a cache, main memoiy
is semiconductoi memoiy, but it is slowei, cheapei, and densei than a cache. Below the main memoiy is the
viitual memoiy, which is geneially stoied on magnetic oi optical disk. Accessing the viitual memoiy can be
tens of thousands times slowei than accessing the main memoiy because it involves moving mechanical paits.
As iequests go deepei into the memoiy hieiaichy, they encountei levels that aie laigei (in teims of capacity)
and slowei than the highei levels (moving left to iight in Fig. 88.2). In addition to size and speed, the bandwidth
in-between adjacent levels in the memoiy hieiaichy is smallei foi the lowei levels. The bandwidth in-between
the iegisteis and top cache level, foi example, is highei than that between cache and main memoiy oi main
memoiy and viitual memoiy. Since each level piesumably inteicepts a fiaction of the iequests, the bandwidth
to the level below need not be as gieat as that to the inteicepting level.
A useful peifoimance paiametei is the e[[ete |aenty. If the needed woid is found in a level of the hieiaichy,
it is a |, if a iequest must be sent to the next lowei level, the iequest is said to mss. If the latency L
HIT
is known
in the case of a hit and the latency in the case of a miss is L
MISS
, the effective latency foi that level in the hieiaichy
can be deteimined fiom the | rao (H), the fiaction of memoiy accesses that aie hits:
L
aveiage
L
HIT
H - L
MISS
(1 - H) (88.2)
The poition of memoiy accesses that miss is called the mss rao (M 1 - H). The hit iatio is stiongly
inuenced by the piogiam being executed, but is laigely independent of the iatio of cache size to memoiy size.
It is not uncommon foi a cache with a capacity a few thousand bytes to exhibit a hit iatio gieatei than 90%.
88.3 Cache Memuries
The basic unit of constiuction of a semiconductoi memoiy system is a moJu|e oi |an|. A memoiy bank,
constiucted fiom seveial memoiy chips, can seivice a single iequest at a time. The time that a bank is busy
seivicing a iequest is called the |an| |usy me. The bank busy time limits the bandwidth of a memoiy bank.
FIGURE 88.2 A memoiy hieiaichy.
2000 by CRC Press LLC
Both caches and main memoiies aie constiucted in this fashion, although caches have signifcantly shoitei
bank busy times than do main memoiy banks.
The haidwaie can dynamically allocate paits of the cache memoiy foi addiesses deemed most likely to be
accessed soon. The cache contains only iedundant copies of the addiess space. The cache memoiy is assotae,
oi tonen-aJJressa||e. In an associative memoiy, the addiess of a memoiy location is stoied, along with its
content. Rathei than ieading data diiectly fiom a memoiy location, the cache is given an addiess and iesponds
by pioviding data which may oi may not be the data iequested. When a cache miss occuis, the memoiy access
is then peifoimed with iespect to the backing stoiage, and the cache is updated to include the new data.
The cache is intended to hold the most active poitions of the memoiy, and the haidwaie dynamically selects
poitions of main memoiy to stoie in the cache. When the cache is full, biinging in new data must be matched
by deleting old data. Thus a stiategy foi cache management is necessaiy. Cache management stiategies exploit
the piinciple of locality. Spatial locality is exploited by the choice of what is biought into the cache. Tempoial
locality is exploited in the choice of which block is iemoved. When a cache miss occuis, haidwaie copies a
laige, contiguous block of memoiy into the cache, which includes the woid iequested. This fxed-size iegion
of memoiy, known as a cache |ne oi ||ot|, may be as small as a single woid, oi up to seveial hundied bytes.
A block is a set of contiguous memoiy locations, the numbei of which is usually a powei of two. A block is
said to be a|gneJ if the lowest addiess in the block is exactly divisible by the block size. That is to say, foi a
block of size B beginning at location , the block is aligned if
moJu|o B 0 (88.3)
Conventional caches iequiie that all blocks be aligned.
When a block is biought into the cache, it is likely that anothei block must be evicted. The selection of the
evicted block is based on some attempt to captuie tempoial locality. Since piescience is so diffcult to achieve,
othei methods aie geneially used to piedict futuie memoiy accesses. A least-iecently-used (LRU) policy is often
the basis foi the choice. Othei ieplacement policies aie sometimes used, paiticulaily because tiue LRU ieplace-
ment iequiies extensive logic and bookkeeping.
The cache often compiises two conventional memoiies: the data memoiy and the tag memoiy, shown in
Fig. 88.3. The addiess of each cache line contained in the data memoiy is stoied in the tag memoiy, as well as
othei infoimation (sae infoimation), paiticulaily the fact that a valid cache line is piesent. The state also
keeps tiack of which cache lines the piocessoi has modifed. Each line contained in the data memoiy is allocated
a coiiesponding entiy in the tag memoiy to indicate the full addiess of the cache line.
FIGURE 88.3 Components of a cache memoiy. (Sourte. Adapted fiom M. D. Hill, A case foi diiect-mapped caches," IEEE
Comuer, 21(12), 27, 1988.)
2000 by CRC Press LLC
The iequiiement that the cache memoiy be associative (content-addiessable) complicates the design.
Addiessing data by content is inheiently moie complicated than by its addiess. All the tags must be compaied
concuiiently, of couise, because the whole point of the cache is to achieve low latency. The cache can be made
simplei, howevei, by intioducing a mapping of memoiy locations to cache cells. This mapping limits the numbei
of possible cells in which a paiticulai line may ieside. The extieme case is known as Jret mang, in which
each memoiy location is mapped to a single location in the cache. Diiect mapping makes many aspects of the
design simplei, since theie is no choice of wheie the line might ieside, and no choice as to which line must be
ieplaced. Howevei, diiect mapping can iesult in pooi utilization of the cache when two memoiy locations aie
alteinately accessed and must shaie a single cache cell.
A hashing algoiithm is used to deteimine the cache addiess fiom the memoiy addiess. The conventional
mapping algoiithm consists of a function of the foim
(88.4)
wheie
cache
is the addiess within the cache foi main memoiy location
memoiy
, tat|e s:e is the capacity of the
cache in addiessable units (usually bytes), and tat|e |ne s:e is the size of the cache line in addiessable units.
Since the hashing function is simple bit selection, the tag memoiy need only contain the pait of the addiess
not implied by the hashing function. That is,

tag

memoiy
div s:e o[ tat|e (88.5)
wheie
tag
is stoied in the tag memoiy and J is the integei divide opeiation. In testing foi a match, the
complete addiess of a line stoied in the cache can be infeiied fiom the tag and its stoiage location within the
cache.
A wo-way se-assotae cache maps each memoiy location into eithei of two locations in the cache and
can be constiucted essentially as two identical diiect-mapped caches. Howevei, both caches must be seaiched
at each memoiy access and the appiopiiate data selected and multiplexed on a tag match (hit). On a miss, a
choice must be made between the two possible cache lines as to which is to be ieplaced. A single LRU bit can
be saved foi each such paii of lines to iemembei which line has been accessed moie iecently. This bit must be
toggled to the cuiient state each time eithei of the cache lines is accessed.
In the same way, an M-way assotae cache maps each memoiy location into any of M memoiy locations
in the cache and can be constiucted fiom M identical diiect-mapped caches. The pioblem of maintaining the
LRU oideiing of M cache lines quickly becomes haid, howevei, since theie aie M! possible oideiings, and
theiefoie it takes at least
log
2
( M !) (88.6)
bits to stoie the oideiing. In piactice, this iequiiement limits tiue LRU ieplacement to 3- oi 4-way set
associativity.
Figuie 88.4 shows how a cache is oiganized into sets, blocks, and woids. The cache shown is a 2-Kbyte,
4-way set-associative cache, with 16 sets. Each set consists of foui blocks. The cache block size in this example
is 32 bytes, so each block contains eight 4-byte woids. Also depicted at the bottom of Fig. 88.4 is a 4-way
inteileaved main memoiy system (see the next section foi details). Each successive woid in the cache block
maps into a diffeient main memoiy bank. Because of the cache`s mapping iestiictions, each cache block obtained
fiom main memoiy will be loaded into its coiiesponding set, but may appeai anywheie within that set.
Wiite opeiations iequiie special handling in the cache. If the main memoiy copy is updated with each wiite
opeiation-a technique known as wre-|roug| oi sore-|roug|-the wiites may foice opeiations to stall while
the wiite opeiations aie completing. This can happen aftei a seiies of wiite opeiations even if the piocessoi is
allowed to pioceed befoie the wiite to the memoiy has completed. If the main memoiy copy is not updated

tat|e s:e
tat|e |ne s:e
cache
memoiy



2000 by CRC Press LLC
with each wiite opeiation-a technique known as wre-|at| oi toy-|at| oi Je[erreJ wres-the main memoiy
locations become stale, that is, memoiy no longei contains the coiiect values and must not be ielied upon to
piovide data. This is geneially peimissible, but caie must be exeicised to make suie that it is always updated
befoie the line is puiged fiom the cache and that the cache is nevei bypassed. Such a bypass could occui with
DMA (Jret memory attess), in which the I/O system wiites diiectly into main memoiy without the involvement
of the piocessoi.
Even foi a system that implements wiite-thiough, caie must be exeicised if memoiy iequests bypass the
cache. While the main memoiy is nevei stale, a wiite that bypasses the cache, such as fiom I/O, could have the
effect of making the cached copy stale. A latei access by the CPU could then piovide an incoiiect value. This
can only be avoided by making suie that cached entiies aie invalidated even if the cache is bypassed. The
pioblem is ielatively easy to solve foi a single piocessoi with I/O, but becomes veiy diffcult to solve foi multiple
piocessois, paiticulaily so if multiple caches aie involved as well. This is known in geneial as the cache to|erente
oi tonssenty pioblem.
The cache exploits spatial locality by loading an entiie cache line aftei a miss. This tends to iesult in buisty
tiaffc to the main memoiy, since most accesses aie flteied out by the cache. Aftei a miss, howevei, the memoiy
system must piovide an entiie line at once. Cache memoiy nicely complements an inteileaved, high-bandwidth
main memoiy (desciibed in the next section), since a cache line can be inteileaved acioss many banks in a
iegulai mannei, thus avoiding memoiy conicts, and thus can be loaded iapidly into the cache. The example
main memoiy shown in Fig. 88.3 can piovide an entiie cache line with two paiallel memoiy accesses.
Conventional caches tiaditionally could not accept iequests while they weie seivicing a miss iequest. In othei
woids, they |ot|eJ u oi ||ot|eJ when seivicing a miss. The giowing penalty foi cache misses has made it
necessaiy foi high-end commodity memoiy systems to continue to accept (and seivice) iequests fiom the
piocessoi while a miss is being seiviced. Some systems aie able to seivice multiple miss iequests simultaneously.
To allow this mode of opeiation, the cache design is |ot|u-[ree oi non-||ot|ng Kioft, 1981]. Lockup-fiee
caches have one stiuctuie foi each simultaneous outstanding miss that they can seivice. This stiuctuie holds
the infoimation necessaiy to coiiectly ietuin the loaded data to the piocessoi, even if the misses come back in
a diffeient oidei than that in which they weie sent.
Two factois diive the existence of multiple levels of cache memoiy in the memoiy hieiaichy: access times
and a limited numbei of tiansistois on the CPU chip. Laigei banks with gieatei capacity aie slowei than smallei
banks. If the time needed to access the cache limits the clock fiequency of the CPU, then the fist-level cache
size may need to be constiained. Much of the beneft of a laige cache may be obtained by placing a small fist-
level cache above a laigei second-level cache; the fist is accessed quickly and the second holds moie data close
to the piocessoi. Since many modein CPUs have caches on the CPU chip itself, the size of the cache is limited
by the CPU silicon ieal-estate. Some CPU designeis have assumed that system designeis will add laige off-chip
caches to the one oi two levels of caches on the piocessoi chip. The complexity of this pait of the memoiy
hieiaichy may continue to giow as main memoiy access penalties continue to inciease.
FIGURE 88.4 Oiganization of a cache.
2000 by CRC Press LLC
Caches that appeai on the CPU chip aie manufactuied by the CPU vendoi. Off-chip caches, howevei, aie a
commodity pait sold in laige volume. An incomplete list of majoi cache manufactuieis is Hitachi, IBM Micio,
Micion, Motoiola, NEC, Samsung, SGS-Thomson, Sony, and Toshiba. Although most peisonal computeis and
all majoi woikstations now contain caches, veiy high-end machines (such as multi-million dollai supeicom-
puteis) do not usually have caches. These ultia-expensive computeis can affoid to implement theii main
memoiy in a compaiatively fast semiconductoi technology such as static RAM (SRAM), and can affoid so
many banks that cacheless bandwidth out of the main memoiy system is suffcient. Massively paiallel piocessois
(MPPs), howevei, aie often constiucted out of woikstation-like nodes to ieduce cost. MPPs theiefoie contain
cache hieiaichies similai to those found in the woikstations on which the nodes of the MPPs aie based.
Cache sizes have been steadily incieasing on peisonal computeis and woikstations. Intel Pentium-based
peisonal computeis come with 8 Kbyte each of instiuction and data caches. Two of the Pentium chip sets,
manufactuied by Intel and OPTi, allow level-two caches ianging fiom 256 to 512 Kbyte and 64 Kbyte to 2 Mbyte,
iespectively. The newei Pentium Pio systems also have 8 Kbyte, fist-level instiuction and data caches, but they
also have eithei a 256 Kbyte oi a 512 Kbyte second-level cache on the same module as the piocessoi chip.
Highei-end woikstations-such as DEC Alpha 21164-based systems-aie confguied with substantially moie
cache. The 21164 also has 8 Kbyte, fist-level instiuction and data caches. Its second-level cache is entiiely on-
chip, and is 96 Kbyte. The thiid-level cache is off-chip, and can have a size ianging fiom 1 to 64 Mbyte.
Foi all desktop machines, cache sizes aie likely to continue to giow-although the iate of giowth compaied
to piocessoi speed incieases and main memoiy size incieases is uncleai.
88.4 Para!!e! and Inter!eaved Memuries
Main memoiies aie compiised of a seiies of semiconductoi memoiy chips. A numbei of these chips, like caches,
foim a |an|. Multiple memoiy banks can be connected togethei to foim an interleaved (oi paiallel) memoiy
system. Since each bank can seivice a iequest, an inteileaved memoiy system with K banks can seivice K iequests
simultaneously, incieasing the peak bandwidth of the memoiy system to K times the bandwidth of a single
bank. In most inteileaved memoiy systems, the numbei of banks is a powei of two, that is, K 2
|
. An n-bit
memoiy woid addiess is bioken into two paits: a |-bit bank numbei and an m-bit addiess of a woid within a
bank. Though the | bits used to select a bank numbei could be any | bits of the n-bit woid addiess, typical
inteileaved memoiy systems use the low-oidei | addiess bits to select the bank numbei; the highei oidei m
n - | bits of the woid addiess aie used to access a woid in the selected bank. The ieason foi using the low-
oidei | bits will be discussed shoitly. An inteileaved memoiy system which uses the low-oidei | bits to select
the bank is iefeiied to as a |ow-orJer oi a sanJarJ inteileaved memoiy.
Theie aie two ways of connecting multiple memoiy banks: sm|e ner|eang and tom|ex ner|eang.
Sometimes simple inteileaving is also iefeiied to as ner|eang, and complex inteileaving as |an|ng.
Figuie 88.5 shows the stiuctuie of a simple inteileaved memoiy system. m addiess bits aie simultaneously
supplied to eveiy memoiy bank. All banks aie also connected to the same iead/wiite contiol line (not shown
in Fig. 88.5). Foi a iead opeiation, the banks stait the iead opeiation and deposit the data in theii latches. Data
can then be iead fiom the latches, one by one, by setting the switch appiopiiately. Meanwhile, the banks could
be accessed again, to caiiy out anothei iead oi wiite opeiation. Foi a wiite opeiation, the latches aie loaded,
one by one. When all the latches have been wiitten, theii contents can be wiitten into the memoiy banks by
supplying m bits of addiess (they will be wiitten into the same woid in each of the diffeient banks). In a simple
inteileaved memoiy, all banks aie cycled at the same time; each bank staits and completes its individual
opeiations at the same time as eveiy othei bank; a new memoiy cycle can stait (foi all banks) once the pievious
cycle is complete. Timing details of the accesses can be found in T|e rt|eture o[ Pe|neJ Comuers, Kogge,
1981].
One use of a simple inteileaved memoiy system is to back up a cache memoiy. To do so, the memoiy must
be able to iead blocks of contiguous woids (a cache block) and supply them to the cache. If the low-oidei |
bits of the addiess aie used to select the bank numbei, then consecutive woids of the block ieside in diffeient
banks, and they can all be iead in paiallel, and supplied to the cache one by one. If some othei addiess bits
aie used foi bank selection, then multiple woids fiom the block might fall in the same memoiy bank, iequiiing
multiple accesses to the same bank to fetch the block.
2000 by CRC Press LLC
Figuie 88.6 shows the stiuctuie of a complex inteileaved memoiy system. In such a system, each bank is set
up to opeiate on its own, independent of the opeiation of the othei banks. Foi example, bank 1 could caiiy
out a iead opeiation on a paiticulai memoiy addiess, and bank 2 caiiies out a wiite opeiation on a completely
unielated memoiy addiess. (Contiast this with the opeiation in a simple inteileaved memoiy wheie all banks
aie caiiying out the same opeiation, iead oi wiite, and the locations accessed within each bank iepiesent a
contiguous block of memoiy.) Complex inteileaving is accomplished by pioviding an addiess latch and a
iead/wiite command line foi each bank. The memory tonro||er handles the oveiall opeiation of the inteileaved
memoiy. The piocessing unit submits the memoiy iequest to the memoiy contiollei, which deteimines which
bank needs to be accessed. The contiollei then deteimines if the bank is busy (by monitoiing a busy line foi
FIGURE 88.5 A simple inteileaved memoiy system. (Sourte. Adapted fiom P. M. Kogge, T|e rt|eture o[ Pe|neJ
Comuers, 1st ed., New Yoik: McGiaw-Hill, 1981, p. 41.)
FIGURE 88.6 A complex inteileaved memoiy system. (Sourte. Adapted fiom P. M. Kogge, T|e rt|eture o[ Pe|neJ
Comuers, 1st ed., New Yoik: McGiaw-Hill, 1981, p. 42.)
2000 by CRC Press LLC
each bank). The contiollei holds the iequest if the bank is busy, submitting it latei when the bank is available
to accept the iequest. When the bank iesponds to a iead iequest, the switch is set by the contiollei to accept
the iequest fiom the bank and foiwaid it to the piocessing unit. Details of the timing of accesses can be found
in T|e rt|eture o[ Pe|neJ Comuers Kogge, 1981].
A typical use of a complex inteileaved memoiy system is in a etor rotessor. In a vectoi piocessoi, the
piocessing units opeiate on a vectoi, foi example a poition of a iow oi a column of a matiix. If consecutive
elements of a vectoi aie piesent in diffeient memoiy banks, then the memoiy system can sustain a bandwidth
of one element pei clock cycle. By aiianging the data suitably in memoiy and using standaid inteileaving (foi
example, stoiing the matiix in iow-majoi oidei will place consecutive elements in consecutive memoiy banks),
the vectoi can be accessed at the iate of one element pei clock cycle as long as the numbei of banks is gieatei
than the bank busy time.
Memoiy systems that aie built foi cuiient machines vaiy widely, the piice and puipose of the machine being
the main deteiminant of the memoiy system design. The actual memoiy chips, which aie the components of
the memoiy systems, aie geneially commodity paits built by a numbei of manufactuieis. The majoi commodity
DRAM manufactuieis include (but ceitainly aie not limited to) Hitachi, Fujitsu, LG Semicon, NEC, Oki,
Samsung, Texas Instiuments, and Toshiba.
The low-end of the piice/peifoimance spectium is the peisonal computei, piesently typifed by Intel Pentium
systems. Thiee of the manufactuieis of Pentium-compatible chip sets (which include the memoiy contiolleis)
aie Intel, OPTi, and VLSI Technologies. Theii contiolleis piovide foi memoiy systems that aie simply intei-
leaved, all with minimum bank depths of 256 Kbyte, and maximum system sizes of 192 Mbyte, 128 Mbyte,
and 1 Gbyte, iespectively.
Both highei-end peisonal computeis and woikstations tend to have moie main memoiy than the lowei-end
systems, although they usually have similai uppei limits. Two examples of such systems aie woikstations built
with the DEC Alpha 21164, and seiveis built with the Intel Pentium Pio. The Alpha systems, using the 21171
chip set, aie limited to 128 Mbyte of main memoiy using 16 Mbit DRAMs, although they will be expandable
to 512 Mbyte when 64-Mbit DRAMs aie available. Theii memoiy systems aie 8-way simply inteileaved,
pioviding 128 bits pei DRAM access. The Pentium Pio systems suppoit slightly diffeient featuies. The 82450KX
and 82450GX chip sets include memoiy contiolleis that allow ieads to bypass wiites (peifoiming wiites when
the memoiy banks aie idle). These contiolleis can also buffei eight outstanding iequests simultaneously. The
82450KX contiollei peimits 1- oi 2-way inteileaving, and up to 256 Mbyte of memoiy when 16-Mbit DRAMs
aie used. The 82450GX chip set is moie aggiessive, allowing up to foui sepaiate (complex-inteileaved) memoiy
contiolleis, each of which can be up to 4-way inteileaved and have up to 1 Gbyte of memoiy (again with
16 Mbit DRAMs).
Inteileaved memoiy systems found in high-end etor suertomuers aie slight vaiiants on the basic complex
inteileaved memoiy system of Fig. 88.6. Such memoiy systems may have hundieds of banks, with multiple
memoiy contiolleis that allow multiple independent memoiy iequests to be made eveiy clock cycle. Two
examples of modein vectoi supeicomputeis aie the Ciay T-90 seiies and the NEC SX seiies. The Ciay T-90
models come with vaiying numbeis of piocessois-up to 32 in the laigest confguiation. Each of these
piocessois is coupled with 256 Mbyte of memoiy, split into 16 banks of 16 Mbyte each. The T-90 has complex
inteileaving among banks. the laigest confguiation (the T-932) has 32 piocessois, foi a total of 512 banks and
8 Gbyte of main memoiy. The T-932 can piovide a peak of 800 Gbyte/s bandwidth out of its memoiy system.
NEC`s SX-4 pioduct line, theii most iecent vectoi supeicomputei seiies, has numeious models. Theii laigest
single-node model (with one piocessoi pei node) contains 32 piocessois, with a maximum of 8 Gbyte of
memoiy, and a peak bandwidth of 512 Gbyte/s out of main memoiy. Although the sizes of the memoiy systems
aie vastly diffeient between woikstations and vectoi machines, the techniques that both use to inciease total
bandwidth and minimize bank conicts aie similai.
88.5 Yirtua! Memury
Cache memoiy contains poitions of the main memoiy in dynamically allocated cache lines. Since the data
poition of the cache memoiy is itself a conventional memoiy, each line piesent in the cache has two addiesses
associated with it: its main memoiy addiess and its cache addiess. Thus, the main memoiy addiess of a woid
2000 by CRC Press LLC
can be divoiced fiom a paiticulai stoiage location and abstiactly thought of as an element in the addiess space.
The use of a two-level hieiaichy-consisting of main memoiy and a slowei, laigei disk stoiage device-evolved
by making a cleai distinction between the addiess space and the locations in memoiy. An addiess geneiated
duiing the execution of a piogiam is known as a rua| aJJress, which must be tianslated to a |ysta| aJJress
befoie it can be accessed in main memoiy. The total addiess space is simply an abstiaction.
A virtual memory addiess is mapped to a physical addiess, which indicates the location in main memoiy
wheie the data actually ieside Denning, 1970]. The mapping is maintained thiough a stiuctuie called the age
a||e, which is maintained in softwaie by the opeiating system. Like the tag memoiy of a cache memoiy, the
page table is accessed thiough a viitual addiess to deteimine the physical (main memoiy) addiess of the entiy.
Unlike the tag memoiy, howevei, the table is usually soited by viitual addiesses, making the tianslation piocess
a simple mattei of an extia memoiy access to deteimine the ieal physical addiess of the desiied item. A system
maintaining the page table in the way analogous to a cache tag memoiy is said to have nereJ age a||es. In
addition to the ieal addiess mapped to a viitual page, and an indication of whethei the page is piesent at all,
a page table entiy often contains othei infoimation. Foi example, the page table may contain the location on
the disk wheie each block of data is stoied when not piesent in main memoiy.
The viitual memoiy can be thought of as a collection of blocks. These blocks aie often aligned and of fxed
size, in which case they aie known as ages. Pages aie the unit of tiansfei between the disk and main memoiy,
and aie geneially laigei than a cache line-usually thousands of bytes. A typical page size foi machines in 1997
is 4 Kbyte. A page`s viitual addiess can be bioken into two paits, a viitual page numbei and an offset. The
page numbei specifes the page to be accessed, and the page offset indicates the distance fiom the beginning
of the page to the indicated addiess.
A physical addiess can also be bioken into two paits, a physical page numbei (also called a age [rame
numbei) and an offset. This mapping is done at the level of pages, so the page table can be indexed by means
of the viitual page numbei. The page fiame numbei is contained in the page table and is iead out duiing the
tianslation, along with othei infoimation about the page. In most implementations the page offset is the same
foi a viitual addiess and the physical addiess to which it is mapped.
The viitual memoiy hieiaichy is diffeient than the cache/main memoiy hieiaichy in a numbei of iespects,
iesulting piimaiily fiom the fact that theie is a much gieatei diffeience in latency between accesses to the disk
and the main memoiy. While a typical latency iatio foi cache and main memoiy is one oidei of magnitude
(main memoiy has a latency ten times laigei than the cache), the latency iatio between disk and main memoiy
is often foui oideis of magnitude oi moie. This laige iatio exists because the disk is a mechanical device-with
a latency paitially deteimined by velocity and ineitia-wheieas main memoiy is limited only by electionic and
eneigy constiaints. Because of the much laigei penalty foi a page miss, many design decisions aie affected by
the need to minimize the fiequency of misses. When a miss does occui, the piocessoi could be idle foi a peiiod
duiing which it could execute tens of thousands of instiuctions. Rathei than stall duiing this time, as may
occui upon a cache miss, the piocessoi invokes the opeiating system and may switch to a diffeient task. Because
the opeiating system is being invoked anyway, it is convenient to iely on the opeiating system to set up and
maintain the page table, unlike cache memoiy, wheie it is done entiiely in haidwaie. The fact that this accounting
occuis in the opeiating system enables the system to use viitual memoiy to enfoice piotection on the memoiy.
This ensuies that no piogiam can coiiupt the data in memoiy that belong to any othei piogiam.
Haidwaie suppoit piovided foi a viitual memoiy system geneially includes the ability to tianslate the viitual
addiesses piovided by the piocessoi into the physical addiesses needed to access main memoiy. Thus, only on
a viitual addiess miss is the opeiating system invoked. An impoitant aspect of a computei implementing viitual
memoiy, howevei, is the necessity of fieezing the piocessoi at the point wheie a miss occuis, seivicing the page
table fault, and latei ietuining to continue the execution as if no page fault had occuiied. This iequiiement
means eithei that it must be possible to halt execution at any point-including possibly in the middle of a
complex instiuction-oi it must be possible to guaiantee that all memoiy accesses will be to pages iesident in
main memoiy.
As desciibed above, viitual memoiy iequiies two memoiy accesses to fetch a single entiy fiom memoiy, one
into the page table to map the viitual addiess into the physical addiess, and the second to fetch the actual data.
This piocess can be sped up in a vaiiety of ways. Fiist, a special-puipose cache memoiy to stoie the active
poition of the page table can be used to speed up the fist access. This special-puipose cache is usually called
2000 by CRC Press LLC
a rans|aon |oo|asJe |u[[er (TLB). Second, if the system also employs a cache memoiy, it may be possible to
oveilap the access of the cache memoiy with the access to the TLB, ideally allowing the iequested item to be
accessed in a single cache access time. The two accesses can be fully oveilapped if the viitual addiess supplies
suffcient infoimation to fetch the data fiom the cache befoie the viitual-to-physical addiess tianslation has
been accomplished. This is tiue foi an M-way set-associative cache of capacity C if the following ielationship
holds:
(88.7)
Foi such a cache, the index into the cache can be deteimined stiictly fiom the page offset. Since the viitual
page offset is identical to the physical page offset, no tianslation is necessaiy, and the cache can be accessed
concuiiently with the TLB. The physical addiess must be obtained befoie the tag can be compaied.
An alteinative method applicable to a system containing both viitual memoiy and a cache is to stoie the
viitual addiess in the tag memoiy instead of the physical addiess. This technique intioduces consistency
pioblems in viitual memoiy systems that peimit moie than a single addiess space oi allow a single physical
page to be mapped to moie than a single viitual page. This pioblem is known as the a|asng pioblem.
88.6 Research Issues
Reseaich is occuiiing on all levels of the memoiy hieiaichy. At the iegistei level, ieseaicheis aie exploiing
techniques to piovide moie iegisteis than aie aichitectuially visible to the compilei. A laige volume of woik
exists (and is occuiiing) foi cache optimizations and alteinate cache oiganizations. Foi instance, modein
piocessois now commonly split the top level of the cache into sepaiate physical caches, one foi instiuctions
(code) and one foi piogiam data. Due to the incieasing cost of cache misses (in teims of piocessoi cycles),
some ieseaich tiades-off inciease the complexity of the cache foi ieducing the miss iate. Two examples of cache
ieseaich fiom opposite ends of the haidwaie/softwaie spectium aie ||ot|ng Lam, 1901] and s|eweJ-assotae
tat|es Seznec, 1993]. Blocking is a softwaie technique in which the piogiammei oi compilei ieoiganizes
algoiithms to woik on subsets of data that aie smallei than the cache, instead of stieaming entiie laige data
stiuctuies iepeatedly thiough the cache. This ieoiganization gieatly impioves tempoial locality. The skewed-
associative cache is one example of a host of haidwaie techniques that map blocks into the cache diffeiently,
with the goal of ieducing misses fiom conicts within a set. In skewed-associative caches, eithei one of two
FIGURE 88.7 Viitual-to-ieal addiess tianslation.
Page s:e
C
M
_
2000 by CRC Press LLC
hashing functions may deteimine wheie a block should be placed in the cache, as opposed to just the one
hashing function (low-oidei index bits) that tiaditional caches use. An impoitant cache-ielated ieseaich topic
is re[et|ng Mowiy, 1992], in which the piocessoi issues iequests foi data well befoie the data aie actually
needed. Speculative piefetching is also a cuiient ieseaich topic. In speculative piefetching, piefetches aie issued
based on guesses as to which data will be needed soon. Othei cache-ielated ieseaich examines placing special
stiuctuies in paiallel with the cache, tiying to optimize foi woikloads that do not lend themselves well to
caches. Stieam buffeis Jouppi, 1990] aie one such example. A stieam buffei automatically detects when a
lineai access thiough a data stiuctuie is occuiiing. The stieam buffei issues multiple sequential piefetches upon
detection of a lineai aiiay access.
Much of the ongoing ieseaich on main memoiy involves impioving the bandwidth fiom the memoiy system
without gieatly incieasing the numbei of banks. Multiple banks aie expensive, paiticulaily with the laige and
giowing capacity of modein DRAM chips. Rambus Rambus Inc., 1992] and Ramlink IEEE Computei Society,
1993] aie two such examples.
Reseaich issues associated with impioving the peifoimance of the viitual memoiy system fall undei the
domain of opeiating system ieseaich. One pioposed stiategy foi ieducing page faults allows each iunning
piogiam to specify its own page ieplacement algoiithm, enabling each piogiam to optimize the choice of page
ieplacements based on its iefeience pattein Englei et al., 1995]. Othei iecent ieseaich focuses on impioving
the peifoimance of the TLB. Two techniques foi doing this aie the use of a two-level TLB (the motivation is
similai to that foi a two-level cache), and the use of supeipages Talluii, 1994]. With supeipages, each TLB
entiy may iepiesent a mapping foi moie than one consecutive page, thus incieasing the total addiess iange
that a fxed numbei of TLB entiies may covei.
Summary
A computei`s memoiy system is the iepositoiy foi all the infoimation that the CPU uses and pioduces. A
peifect memoiy system is one that can immediately supply any datum that the CPU iequests. This ideal memoiy
is not implementable, howevei, as the thiee factois of memoiy capacity, speed, and cost aie diiectly in opposition.
By staging smallei, fastei memoiies in fiont of laigei, slowei, and cheapei memoiies, the peifoimance of the
memoiy system may appioach that of a peifect memoiy system-at a ieasonable cost. The memoiy hieiaichies
of modein geneial-puipose computeis geneially contain iegisteis at the top, followed by one oi moie levels of
cache memoiy, main memoiy, and viitual memoiy on a magnetic oi optical disk.
Peifoimance of a memoiy system is measuied in teims of latency and bandwidth. The latency of a memoiy
iequest is how long it takes the memoiy system to pioduce the iesult of the iequest. The bandwidth of a memoiy
system is the iate at which the memoiy system can accept iequests and pioduce iesults. The memoiy hieiaichy
impioves aveiage latency by quickly ietuining iesults that aie found in the highei levels of the hieiaichy. The
memoiy hieiaichy geneially ieduces bandwidth iequiiements by inteicepting a fiaction of the memoiy iequests
at highei levels of the hieiaichy. Some machines-such as high-peifoimance vectoi machines-may have fewei
levels in the hieiaichy, incieasing memoiy cost foi bettei piedictability and peifoimance. Some of these
machines contain no caches at all, ielying on laige aiiays of main memoiy banks to supply veiy high bandwidth,
with pipelined accesses to opeiands that mitigate the adveise peifoimance impact of long latencies.
Cache memoiies aie a geneial solution foi impioving the peifoimance of a memoiy system. Although caches
aie smallei than typical main memoiy sizes, they ideally contain the most fiequently accessed poitions of main
memoiy. By keeping the most heavily used data neai the CPU, caches can seivice a laige fiaction of the iequests
without accessing main memoiy (the fiaction seiviced is called the hit iate). Caches assume locality of iefeience
to woik well tianspaiently-they assume that accessed memoiy woids will be accessed again quickly (tempoial
locality), and that memoiy woids adjacent to an accessed woid will be accessed soon aftei the access in question
(spatial locality). When the CPU issues a iequest foi a datum not in the cache (a cache miss), the cache loads
that datum and some numbei of adjacent data (a cache block) into itself fiom main memoiy.
To ieduce cache misses, some caches aie associative-a cache may place a given block in one of seveial places,
collectively called a set. This set is content-addiessable; a block may oi may not be accessed based on an addiess
tag, one of which is coupled with each block. When a new block is biought into a set and the set is full, the
cache`s ieplacement policy dictates which of the old blocks should be iemoved fiom the cache to make ioom
2000 by CRC Press LLC
foi the new block. Most caches use an appioximation of least-iecently-used (LRU) ieplacement, in which the
block last accessed faithest in the past is the one that the cache ieplaces.
Main memoiy, oi backing stoie, consists of banks of dense semiconductoi memoiy. Since each memoiy chip
has a small off-chip bandwidth, iows of these chips aie placed togethei to foim a bank, and multiple banks aie
used to inciease the total bandwidth fiom main memoiy. When a bank is accessed, it iemains busy foi a peiiod
of time, duiing which the piocessoi may make no othei accesses to that bank. By incieasing the numbei of
inteileaved (paiallel) banks, the chance that the piocessoi issues two conicting iequests to the same bank is
ieduced.
Systems geneially iequiie a gieatei numbei of memoiy locations than aie available in the main memoiy
(i.e., a laigei addiess space). The entiie addiess space that the CPU uses is kept on laige magnetic oi optical
disks; this is called the viitual addiess space, oi viitual memoiy. The most fiequently used sections of the viitual
memoiy aie kept in main memoiy (physical memoiy), and aie moved back and foith in units called pages.
The place at which a viitual addiess lies in main memoiy is called its physical addiess. Since a much laigei
addiess space (viitual memoiy) is mapped onto a much smallei one (physical memoiy), the CPU must tianslate
the memoiy addiesses issued by a piogiam (viitual addiesses) into theii coiiesponding locations in physical
memoiy (physical addiesses). This mapping is maintained in a memoiy stiuctuie called the page table. When
the CPU attempts to access a viitual addiess that does not have a coiiesponding entiy in physical memoiy, a
page fault occuis. Since a page fault iequiies an access to a slow mechanical stoiage device (such as a disk), the
CPU usually switches to a diffeient task while the needed page is iead fiom the disk.
Eveiy memoiy iequest issued by the CPU iequiies an addiess tianslation, which in tuin iequiies an access
to the page table stoied in memoiy. A tianslation lookaside buffei (TLB) is used to ieduce the numbei of page
table lookups. The most fiequent viitual-to-physical mappings aie kept in the TLB, which is a small associative
memoiy tightly coupled with the CPU. If the needed mapping is found in the TLB, the tianslation is peifoimed
quickly and no access to the page table needs to be made. Viitual memoiy allows systems to iun laigei oi moie
piogiams than aie able to ft in main memoiy, enhancing the capabilities of the system.
Dehning Terms
Bandwidth: The iate at which the memoiy system can seivice iequests.
Cache memory: A small, fast, iedundant memoiy used to stoie the most fiequently accessed paits of the
main memoiy.
Interleaving: Technique foi connecting multiple memoiy modules togethei in oidei to impiove the band-
width of the memoiy system.
Latency: The time between the initiation of a memoiy iequest and its completion.
Memory hierarchy: Successive levels of diffeient types of memoiy, which attempt to appioximate a single
laige, fast, and cheap memoiy stiuctuie.
Virtual memory: A memoiy space implemented by stoiing the most fiequently accessed paits in main
memoiy and less fiequently accessed paits on disk.
Re!ated Tupics
80.1 Integiated Ciicuits (RAM, ROM) 80.2 Basic Disk System Aichitectuies
Relerences
P. J. Denning, Viitual memoiy," Comung Sureys, vol. 2, no. 3, pp. 153-170, Sept. 1970.
D. R. Englei, M. F. Kaashoek, J. O`Toole, Ji., Exokeinel: An opeiating system aichitectuie foi application-level
iesouice management," Prot. 15| Symosum on Oerang Sysems Prnt|es, pp. 251-266, 1995.
J. L. Hennessy and D. A. Patteison, Comuer rt|eture. Quanae roat|, San Mateo, Calif.: Moigan
Kaufmann Publisheis, 1990.
M. D. Hill, A case foi diiect-mapped caches," IEEE Comuer, 21(12), 1988.
IEEE Computei Society, IEEE SanJarJ [or Hg|-BanJwJ| Memory Iner[ate BaseJ on SCI Sgna|ng Tet|no|ogy
(RamLn|), Diaft 1.00 IEEE P1596.4-199X, 1993.
2000 by CRC Press LLC
N. Jouppi, Impioving diiect-mapped cache peifoimance by the addition of a small fully-associative cache and
piefetch buffeis," Prot. 17| nnua| Inernaona| Symosum on Comuer rt|eture, pp. 364-373, 1990.
P. M. Kogge, T|e rt|eture o[ Pe|neJ Comuers, New Yoik: McGiaw-Hill, 1981.
D. Kioft, Lockup-Fiee Instiuction Fetch/Piefetch Cache Oiganization," Prot. 8| nnua| Inernaona| Symo-
sum on Comuer rt|eture, pp. 81-87, 1981.
M. S. Lam, E. E. Rothbeig, and M. E. Wolf, The cache peifoimance and optimizations of blocked algoiithms,"
Prot. 4| nnua| Symosum on rt|etura| Suor [or Programmng Languages anJ Oerang Sysems,
pp. 63-74, 1991.
T. C. Mowiy, M. S. Lam, and A. Gupta, Design and evaluation of a compilei algoiithm foi piefetching," Prot.
5| nnua| Symosum on rt|etura| Suor [or Programmng Languages anJ Oerang Sysems,
pp. 62-73, 1992.
Rambus, Inc., Ram|us rt|etura| Oerew, Mountain View, Calif.: Rambus, Inc., 1992.
A. Seznec, A case foi two-way skewed-associative caches," Prot. 20| Inernaona| Symosum on Comuer
rt|eture, pp. 169-178, 1993.
A. J. Smith, Bibliogiaphy and ieadings on CPU cache memoiies and ielated topics," CM SICRCH Comuer
rt|eture News, 14(1), 22-42, 1986.
A. J. Smith, Second bibliogiaphy on cache memoiies," in CM SICRCH Comuer rt|eture News, 19(4),
154-182, June 1991.
M. Talluii and M. D. Hill, Suipassing the TLB peifoimance of supeipages with less opeiating system suppoit,"
Prot. Sx| Inernaona| Symosum on rt|etura| Suor [or Programmng Languages anJ Oerang
Sysems, pp. 171-182, 1994.
Further Inlurmatiun
Some geneial infoimation on the design of memoiy systems is available in Hg|-SeeJ Memory Sysems by
A. V. Pohm and O. P. Agiawal.
Comuer rt|eture. Quanae roat| by John Hennessy and David Patteison contains a detailed
discussion on the inteiaction between memoiy systems and computei aichitectuie.
Foi infoimation on memoiy system ieseaich, the iecent pioceedings of the Inernaona| Symosum on
Comuer rt|eture contain annual ieseaich papeis in computei aichitectuie, many of which focus on the
memoiy system. To obtain copies, contact the IEEE Computei Society Piess, 10662 Los Vaqueios Ciicle, P.O. Box
3014, Los Alamitos, CA 90720-1264.

S-ar putea să vă placă și