Lec 15 MainMemory

Lecture 15
Main Memory
Main Memory Background
Performance of Main Memory:

Latency: Cache Miss Penalty
Access Time(AT): time between request and word arrives
Cycle Time(CT): time between requests
Bandwidth: I/O & Large Block Miss Penalty (L2)
Main Memory, a 2D matrix, is DRAM:

Dynamic since needs to be refreshed periodically (8 ms)
Difference in AT and CT, AT<CT
Addresses divided into 2 halves, multiplexing them to memory:

RAS or Row Access Strobe
CAS or Column Access Strobe
Cache uses SRAM:

No refresh (6 transistors/bit vs. 1 transistor/bit)
No difference in AT and CT, AT=CT
Address not divided
Main Memory Background
Size: DRAM/SRAM 4~8

Cost and Cycle time: SRAM/DRAM 8~16
Capacity of DRAM : 4 times/3 years or 60%/year
RAS access time : 7% per year
Main Memory Organization
Simple:
CPU, Cache, Bus,
Memory are same
bits)
width (32
Wide:
CPU/Mux 1 word;
Mux/Cache, Bus,
Memory N words
(Alpha: 64 bits &
256 bits)
Interleaved:
CPU, Cache, Bus
1wd:
Memory N Modules
(4 Modules);
shows word interleave
1-wordwide
Wide Memory
CPU
CPU
Interleaved Memory
CPU
memory
Cache
BUS
MUX
Cache
Cache
BUS
BUS
M
M
Bank bank bank bank

0
1
2
3
Main Memory Performance

Timing model
1 to send address,
6 access time,
1 to send data
Block access time
Assuming Cache Block is 4 words
Simple M.P.
= 4 x (1+6+1) = 32
Wide M.P.
=1+6+1
=8
Interleaved M.P. = 1 + 6 + (4x1) = 11
Address
Bank 0
Bank1
Bank 2
Bank 3
Technique for Higher BW:
1. Wider Main Memory
Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory

Drawbacks
expandability
doubling the width needs doubling the capacity
bus width
need a multiplexer to get the desired word from a block
error correction - separate error correction every 32 bits

otherwise, on WRITE, read block -> modify word -> calculate the
new ECC -> store
2. Interleaved Memory
Interleaved Memory and Wide Memory
Consider the following description of a machine and its cache
performance
mem bus width = 1 word=32 bit
block size(word) 1
4
miss
rate(%)
3 instr
2
memory
accesses/
= 1.2
1 cache miss penalty = 8(1+6+1) cycles
average CPI(ignoring cache misses) = 2
What is the improvement over the base machine(block
size=1) in performance of interleaving 2-way and 4-way
versus doubling the width of memory and the bus
Interleaved Memory
Answer
CPI + (M ref/instr. x miss rate x miss penalty)
= 2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty)
the CPI for the base machine(Simple Memory)(BM)

2+(1.2 x 0.03 x 8) = 2.288
2-word wide memory

32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = 2.384 slower than BM
32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = 2.216 faster than BM
64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = 2.192 faster than BM
4-word wide memory

32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = 2.384 slower than BM
32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = 2.132 faster than 2-word
64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = 2.192 same as 2-word
3. Independent Memory Banks
Interleaved Memory-Faster Sequential Accesses;

Independent Memory Banks - Faster Independent Accesses
Motivation: Higher BW for sequential accesses by interleaving sequential
bank addresses - each bank shares the address line
Memory banks for independent accesses - each bank has a bank controller,
separate address lines
1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc.
If 1 controller controls all the banks, it can only provide fast access time
for one operation
Benefit of memory banks for Miss under Miss in Non-faulting caches
Superbank: all memory banks active on one block transfer
Bank: portion within a superbank that is word interleaved
Superbank
Superbank Number
Bank
Superbank Offset
Bank Number Bank Offset
Independent Memory Banks

How many banks?
For sequential accesses, a new bank delivers a word on each clock
For sequential accesses, number of banks number of clocks to
access a word in a bank
Otherwise will return to the original bank before it has the next
word ready
Increasing capacity of a DRAM chip => fewer chips to build the

same capacity memory system => harder to have banks
4. Avoiding Bank Conflicts

Even a lot of banks, still bank conflict in certain regular accesses
- e.g. Storing 256x512 array in 128 banks and column processing
(512 is an even multiple of 128)
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
Column processing
x[i][j] = 2 * x[i][j];
Inner Loop is a
column processing
which causes bank
conflicts
Bank0 Bank1
Bank127 ,, Bank511
Column elements
the same bank
are in
Avoiding Bank Conflicts

SW approaches
Loop interchange to avoid accessing the same bank
Declaring array size not power of 2(number of banks is a power of 2) so
that addresses point to the different banks, i.e., a column elements are
spread around different banks
HW: Prime number of banks

bank number = (address) MOD (number of banks)
address within bank = address / number of banks
To avoid calculation of divide per memory access
address within bank = (address) MOD (number words in bank )

3=(31)MOD(7)
bank number? words per bank?

Easy if both are power of 2
Fast Bank Number

Chinese Remainder Theorem
As long as two sets of integers ai and bi follow these rules
bi=(x) MOD (ai), 0 < bi < ai, 0 < x < a0 x a1 x a2 x ...

and that ai and aj are co-prime if i j, then the integer x has
only one solution (unambiguous mapping):
bank number = b0=(x) Mod (a0);
number of banks = a0 (= 3 in ex), 0 < b0 < a0
address within a bank = b1=(x) Mod (a1);
size of a bank = a1 (= 8 in ex)
N words addresses 0 to N-1;
prime no. of banks(3);
words/bank power of 2(8)
Fast Bank Numbers

Address = 5
Bank Number:
Addr in Bank: 0
1
2
3
4
5
6
7
Seq. Interleaved
0
1
2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Modulo Interleaved
0
1
2
0
16
8
9
1
17
18
10
2
3
19
11
12
4
20
21
13
5
6
22
14
15
7
23
Bank # = (5) Mod (3) = 2:

5/3 = 1
(5) Mod (8) = 5
5. DRAM Specific Interleaving
DRAM access - Row Access(RAS) and Column Access(CAS)

Multiple accesses to a RAS buffer: several names (page mode)
64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns
New DRAMs to address CPU-DRAM speed gap;

what will they cost, will they survive?
Synchronous DRAM: Provide a clock signal to DRAM, transfer

synchronous to system clock
RAMBUS: startup company; reinvent DRAM interface
Each Chip acts as a module vs. slice of memory(or bank)

Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
Niche memory only? or main memory?
e.g., Video RAM for frame buffers, DRAM + fast serial output
Main Memory Summary
Wider Memory: for independent access

Interleaved Memory: for sequential or independent accesses
Avoiding bank conflicts: SW & HW
DRAM specific optimizations: page mode & Specialty DRAM

Lec 15 MainMemory

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lec 15 MainMemory

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 15

Main Memory Background

Performance of Main Memory:

Bandwidth: I/O & Large Block Miss Penalty (L2)

Main Memory, a 2D matrix, is DRAM:

Addresses divided into 2 halves, multiplexing them to memory:

Cache uses SRAM:

Address not divided

Main Memory Background

Size: DRAM/SRAM 4~8

Main Memory Organization

Bank bank bank bank

Main Memory Performance

Technique for Higher BW:

1. Wider Main Memory

Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory

error correction - separate error correction every 32 bits

Technique for Higher BW:

the CPI for the base machine(Simple Memory)(BM)

2-word wide memory

4-word wide memory

Technique for Higher BW:

3. Independent Memory Banks

Interleaved Memory-Faster Sequential Accesses;

Independent Memory Banks

Increasing capacity of a DRAM chip => fewer chips to build the

Technique for Higher BW:

4. Avoiding Bank Conflicts

the same bank

Avoiding Bank Conflicts

HW: Prime number of banks

address within bank = (address) MOD (number words in bank )

bank number? words per bank?

Fast Bank Number

bi=(x) MOD (ai), 0 < bi < ai, 0 < x < a0 x a1 x a2 x ...

Fast Bank Numbers

Bank # = (5) Mod (3) = 2:

Technique for Higher BW:

5. DRAM Specific Interleaving

DRAM access - Row Access(RAS) and Column Access(CAS)

64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

New DRAMs to address CPU-DRAM speed gap;

Synchronous DRAM: Provide a clock signal to DRAM, transfer

Each Chip acts as a module vs. slice of memory(or bank)

Main Memory Summary

Wider Memory: for independent access

S-ar putea să vă placă și