Sunteți pe pagina 1din 16

Lecture 15

Main Memory

Main Memory Background

Performance of Main Memory:


Latency: Cache Miss Penalty
Access Time(AT): time between request and word arrives
Cycle Time(CT): time between requests

Bandwidth: I/O & Large Block Miss Penalty (L2)

Main Memory, a 2D matrix, is DRAM:


Dynamic since needs to be refreshed periodically (8 ms)
Difference in AT and CT, AT<CT

Addresses divided into 2 halves, multiplexing them to memory:


RAS or Row Access Strobe
CAS or Column Access Strobe

Cache uses SRAM:


No refresh (6 transistors/bit vs. 1 transistor/bit)
No difference in AT and CT, AT=CT

Address not divided

Main Memory Background

Size: DRAM/SRAM 4~8


Cost and Cycle time: SRAM/DRAM 8~16
Capacity of DRAM : 4 times/3 years or 60%/year
RAS access time : 7% per year

Main Memory Organization

Simple:
CPU, Cache, Bus,
Memory are same
bits)

width (32

Wide:

CPU/Mux 1 word;
Mux/Cache, Bus,
Memory N words
(Alpha: 64 bits &
256 bits)
Interleaved:
CPU, Cache, Bus
1wd:
Memory N Modules
(4 Modules);
shows word interleave

1-wordwide

Wide Memory

CPU

CPU

Interleaved Memory
CPU

memory

Cache

BUS

MUX

Cache

Cache

BUS

BUS

M
M

Bank bank bank bank


0
1
2
3

Main Memory Performance


Timing model

1 to send address,
6 access time,
1 to send data
Block access time
Assuming Cache Block is 4 words

Simple M.P.
= 4 x (1+6+1) = 32
Wide M.P.
=1+6+1
=8
Interleaved M.P. = 1 + 6 + (4x1) = 11
Address

Bank 0

Bank1

Bank 2

Bank 3

Technique for Higher BW:

1. Wider Main Memory

Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory


Drawbacks
expandability
doubling the width needs doubling the capacity

bus width
need a multiplexer to get the desired word from a block

error correction - separate error correction every 32 bits


otherwise, on WRITE, read block -> modify word -> calculate the
new ECC -> store

Technique for Higher BW:

2. Interleaved Memory
Interleaved Memory and Wide Memory
Consider the following description of a machine and its cache
performance
mem bus width = 1 word=32 bit

block size(word) 1

4
miss
rate(%)
3 instr
2
memory
accesses/
= 1.2
1 cache miss penalty = 8(1+6+1) cycles
average CPI(ignoring cache misses) = 2
What is the improvement over the base machine(block
size=1) in performance of interleaving 2-way and 4-way
versus doubling the width of memory and the bus

Interleaved Memory
Answer
CPI + (M ref/instr. x miss rate x miss penalty)
= 2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty)

the CPI for the base machine(Simple Memory)(BM)


2+(1.2 x 0.03 x 8) = 2.288

2-word wide memory


32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = 2.384 slower than BM
32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = 2.216 faster than BM
64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = 2.192 faster than BM

4-word wide memory


32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = 2.384 slower than BM
32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = 2.132 faster than 2-word
64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = 2.192 same as 2-word

Technique for Higher BW:

3. Independent Memory Banks

Interleaved Memory-Faster Sequential Accesses;


Independent Memory Banks - Faster Independent Accesses
Motivation: Higher BW for sequential accesses by interleaving sequential
bank addresses - each bank shares the address line
Memory banks for independent accesses - each bank has a bank controller,
separate address lines
1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc.
If 1 controller controls all the banks, it can only provide fast access time
for one operation
Benefit of memory banks for Miss under Miss in Non-faulting caches
Superbank: all memory banks active on one block transfer
Bank: portion within a superbank that is word interleaved
Superbank
Superbank Number

Bank
Superbank Offset
Bank Number Bank Offset

Independent Memory Banks


How many banks?
For sequential accesses, a new bank delivers a word on each clock
For sequential accesses, number of banks number of clocks to
access a word in a bank
Otherwise will return to the original bank before it has the next
word ready

Increasing capacity of a DRAM chip => fewer chips to build the


same capacity memory system => harder to have banks

Technique for Higher BW:

4. Avoiding Bank Conflicts


Even a lot of banks, still bank conflict in certain regular accesses
- e.g. Storing 256x512 array in 128 banks and column processing
(512 is an even multiple of 128)
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
Column processing
x[i][j] = 2 * x[i][j];

Inner Loop is a
column processing
which causes bank
conflicts

Bank0 Bank1
Bank127 ,, Bank511

Column elements

the same bank

are in

Avoiding Bank Conflicts


SW approaches
Loop interchange to avoid accessing the same bank
Declaring array size not power of 2(number of banks is a power of 2) so
that addresses point to the different banks, i.e., a column elements are
spread around different banks

HW: Prime number of banks


bank number = (address) MOD (number of banks)
address within bank = address / number of banks
To avoid calculation of divide per memory access

address within bank = (address) MOD (number words in bank )


3=(31)MOD(7)

bank number? words per bank?


Easy if both are power of 2

Fast Bank Number


Chinese Remainder Theorem
As long as two sets of integers ai and bi follow these rules

bi=(x) MOD (ai), 0 < bi < ai, 0 < x < a0 x a1 x a2 x ...


and that ai and aj are co-prime if i j, then the integer x has
only one solution (unambiguous mapping):
bank number = b0=(x) Mod (a0);
number of banks = a0 (= 3 in ex), 0 < b0 < a0
address within a bank = b1=(x) Mod (a1);
size of a bank = a1 (= 8 in ex)
N words addresses 0 to N-1;
prime no. of banks(3);
words/bank power of 2(8)

Fast Bank Numbers


Address = 5

Bank Number:
Addr in Bank: 0
1
2
3
4
5
6
7

Seq. Interleaved
0
1
2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Modulo Interleaved
0
1
2
0
16
8
9
1
17
18
10
2
3
19
11
12
4
20
21
13
5
6
22
14
15
7
23

Bank # = (5) Mod (3) = 2:


5/3 = 1
(5) Mod (8) = 5

Technique for Higher BW:

5. DRAM Specific Interleaving

DRAM access - Row Access(RAS) and Column Access(CAS)


Multiple accesses to a RAS buffer: several names (page mode)

64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

New DRAMs to address CPU-DRAM speed gap;


what will they cost, will they survive?

Synchronous DRAM: Provide a clock signal to DRAM, transfer


synchronous to system clock
RAMBUS: startup company; reinvent DRAM interface

Each Chip acts as a module vs. slice of memory(or bank)


Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
Niche memory only? or main memory?

e.g., Video RAM for frame buffers, DRAM + fast serial output

Main Memory Summary

Wider Memory: for independent access


Interleaved Memory: for sequential or independent accesses
Avoiding bank conflicts: SW & HW
DRAM specific optimizations: page mode & Specialty DRAM

S-ar putea să vă placă și