Computer Architecture

DAT105: Computer Architecture
Study Period 2, 2009
Exercise 5
Chapter 5: Memory Hierarchy Design and
Chapter 4: Multiprocessors and Thread-Level Parallelism
Mafijul Islam
Department of Computer Science and Engineering
December 3, 2009

Study Period 2, 2009
Goals: To understand the impact of
cache optimization techniques on performance
cache coherence in simple, bus-based multiprocessor
Case Studies/Assignments:
Case Study 2: 5.6, 5.7
Case Study 1: 4.1, 4.2

Case Study 2: 5.6
Optimizing cache performance via advanced techniques
Assume
Performing transpose of a matrix, matrices are stored in row major order
256x256 double-precision transpose on a processor with 16KB L1 D cache
L1 D cache is fully associative with LRU replacement and 64-byte blocks
L1 misses or prefetches requires 16 cycles, always hit in the L2 cache
L2 cache process a request every 2 processor cycles
Each iteration of the inner loop requires 4 cycles if data is in the L1 D cache
Cache has a write-allocate fetch-on-write policy for write misses.
Writing back dirty cache blocks require 0 cycle (unrealistic)

Case Study 2: 5.6
Optimizing cache performance via advanced techniques
Example C code:
for ( i = 0; i < 256; i++ ) {
for ( j = 0; j < 256; j++ ) {
output[j][i] = input[i][j];
}
}
How does loop interchange optimization affect the execution of the
above code?

Case Study 2: 5.6(a)
Block size to completely fill the data cache with one input and output block
Total cache capacity: 16 KB
Split equally between input and output: 8KB
Each element of matrix is 8 bytes
Number of elements in 8KB: 8KB / 8 = 1024 elements
Block size: 32 x 32

Case Study 2: 5.6(b)
Relative number of misses of the blocked and unblocked versions
The level 1 cache is direct mapped
Blocked version:
fetch each input once, fetch each output once
2 misses per block (compulsory miss)
Unblocked version:
one cache miss for every 8 row elements (64-byte block, 8-byte element)
each column requires 16KB (64-byte * 256) of storage
column elements will be replaced in the cache before they are used again
9 misses (1 row, 8 column) for every 2 in the blocked version

Case Study 2: 5.6(c)
Rewriting the code with a block size parameter B
Original C code:
for ( i = 0; i < 256; i++ )
for ( j = 0; j < 256; j++ ) {
output[j][i] = input[i][j];
}
}
Modified C code:
for ( i = 0; i < 256; i= i + B ) {
for ( j = 0; j < 256; j = j + B ) {
for(m=0; m < B; m++) {
for(n=0; n < B; n++) {
output[j+n][i+m] = input[i+m][j+n];
}
}
}
}

Case Study 2: 5.7
Designing prefetcher for the unblocked matrix transposition
Prefetches write directly into the cache and no pollution
L1 D cache is fully associative with LRU replacement and 64-byte blocks
L1 misses or prefetches requires 16 cycles, always hit in the L2 cache
L2 cache process a request every 2 processor cycles
Each iteration of the inner loop requires 4 cycles if data is in the L1 D cache
Cache has a write-allocate fetch-on-write policy for write misses.
Writing back dirty cache blocks require 0 cycle (unrealistic)
Performance (in cycles) per iteration) using an ideal nonunit

stride prefetcher in the steady state of the inner loop

Case Study 2: 5.7
Designing prefetcher for the unblocked matrix transposition
each cache block of the row supplying 8 elements (64/8, 64-byte blocks, 8byte elements) are fetched once
each column element needs to be fetched once
9 prefetches for processing 8 elements
each inner loop would require the maximum of 9 prefetches and each
prefetch requires 2 cycles.
18 cycles
each inner loop would require 8 loop iterations, 2 cycles for operations.
16 cycles
The number of iterations per cycle: 18/8 = 2.25

Case Study 1: 4.1
Simple bus-based multiprocessor : symmetric shared-memory architecture
Each processor has a single , private cache

Cache coherence is maintained using the snooping coherence protocol
Each cache is direct-mapped, with four blocks each holding two words
Cache-address tag contains the full address and each word shows only two
hex characters
Coherence states are denoted M, S, and I for Modified, Shared, and Invalid
The multiprocessor along with initial cache and memory state is illustrated in
Figure 4.37 of the textbook
Determine the resulting state (coherence state, tags, and data) of the caches
and memory after a specific sequence of one or more CPU operations. Also,
determine the value returned by each read operation.
Chapter 4: Problem No. 4.1
B0
B1
B2
B3
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
10
68
10
18
Data
00
00
00
00
00
00
00
00
08
10
18
20
28
30
Prepared by: Mafijul Islam

CHALMERS University of Technology,
Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
4.1
P0
120
108
110
118
00
00
00
00
20
08
10
10
B0
B1
B2
B3
S
S
M
I
120
108
110
118
20
08
30
10
B0
B1
B2
B3
I
M
I
S
100
128
110
118
10
68
10
18

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
a
P0
120
108
110
118
00
00
00
00
20
08
10
10
B0
B1
B2
B3
M
S
M
I
120
108
110
118
80
08
30
10
B0
B1
B2
B3
I
M
I
S
100
128
110
118
10
68
10
18

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
I
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
b
P0
120
108
110
118
00
00
00
00
20
08
10
10
B0
B1
B2
B3
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
100
128
110
118
10
68
10
18

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
M
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
c
P0
120
108
110
118
00
00
00
00
80
08
10
10
B0
B1
B2
B3
I
S
S
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
S
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
10
68
30
18
Data
00
00
00
00
00
00
00
00
08
30
18
20
28
30

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
d
P0
120
108
110
118
00
00
00
00
20
08
10
10
B0
B1
B2
B3
I
M
M
I
100
108
110
118
10
48
30
10
B0
B1
B2
B3
I
M
I
S
100
128
110
118
10
68
10
18

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
I
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
e
P0
120
108
110
118
00
00
00
00
20
08
10
10
B0
B1
B2
B3
I
S
M
I
100
108
130
118
10
08
78
10
B0
B1
B2
B3
I
M
I
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
10
68
10
18
Data
00
00
00
00
00
00
00
00
08
30
18
20
28
30

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
f
P0
120
108
110
118
00
00
00
00
20
08
10
10
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
100
128
110
118
10
68
10
18

Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
S
M
I
Data
ess T
a
Addr
Data
00
00
00
00
te
Cohe
renc
e Sta
g
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
P15
te
P1
te
P0
120
108
130
118
00
00
00
00
20
08
78
10

Case Study 1: 4.2
Performance of a snooping cache-coherent multiprocessor
CPU read and write hits generate no stall cycles
CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied by
memory and cache, respectively
CPU write hits that generate an invalidation incur Ninvalidate stall cycles
A writeback of a block (conflict or another processors request to an exclusive block)
incurs an additional Nwriteback stall cycles
Consider two implementations:
Parameter
Implementation 1
Implementation 2
Nmemory
100
100
Ncache
70
130
Ninvalidate
15
15
Nwriteback
10
10
Determine the number of stall cycles generated by each implementation
B0
B1
B2
B3
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
10
68
10
18
Data
00
00
00
08
00
10
00
18
00
20
00
28
00
30

CHALMERS University of Technology, Gteborg, Sweden
December 10, 2008
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
4.2 (a)
P0
120
108
110
118
00
00
00
00
20
08
10
10

P0: read 120
B0
B1
B2
B3
Read miss, satisfied by memory
Implementation 1:
Implementation 2:
100
100
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
10
68
10
18
Stall Cycles
Stall Cycles

December 10, 2008
B0
B1
B2
B3
S
S
I
I
120
108
110
118
Data
I
M
I
S
Addr
ess T
ag
20
08
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
120
108
110
118
Cohe
renc
e Sta
te
S
S
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
P15
te
P1
te
P0
00
00
00
00
20
08
10
10
P0: read 128

P15
Read miss, satisfied by P1's Cache
Implementation 1:
Implementation 2:
100
100
B0
B1
B2
B3
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
68
30
10
68
10
18
70+10 Stall Cycles

130+10 Stall Cycles

December 10, 2008
B0
B1
B2
B3
S
S
I
I
120
108
110
118
Data
I
S
I
S
Addr
ess T
ag
20
68
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
120
128
110
118
Cohe
renc
e Sta
te
S
S
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P1
te
P0
00
00
00
00
20
08
10
10
P0: read 130

P15

Writeback [110]
Implementation 1:
Implementation 2:
100 +
100 +
70+10
130+10
B0
B1
B2
B3
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
30
18
20
68
30
+ 100+10
+ 100+10
=
=
10
68
10
18
B0
B1
B2
B3
290 Stall Cycles

350 Stall Cycles

December 10, 2008
S
S
I
I
120
108
110
118
Data
I
S
I
S
Addr
ess T
ag
20
68
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
120
128
130
118
Cohe
renc
e Sta
te
S
S
S
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P1
te
P0
00
00
00
00
20
08
10
10
B0
B1
B2
B3
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
Data
00
00
00
08
00
10
00
18
00
20
00
28
00
30
10
68
10
18
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
4.2 (b)
P0
120
108
110
118
00
00
00
00
20
08
10
10
P0: read 100
Implementation 1:
Implementation 2:
100
100
Stall Cycles
Stall Cycles
B0
B1
B2
B3
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
10
68
10
18
B0
B1
B2
B3
S
S
I
I
120
108
110
118
Data
I
M
I
S
Addr
ess T
ag
00
08
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
100
108
110
118
Cohe
renc
e Sta
te
S
S
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
P15
te
P1
te
P0
00
00
00
00
20
08
10
10
P0: write 108 <--- 48

P15
Write hit, send invalidation
Implementation 1:
Implementation 2:
100
100
15
15
B0
B1
B2
B3
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
Stall Cycles
Stall Cycles
10
68
10
18
B0
B1
B2
B3
S
I
I
I
120
108
110
118
Data
I
M
I
S
Addr
ess T
ag
00
48
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
100
108
110
118
Cohe
renc
e Sta
te
S
M
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P1
te
P0
00
00
00
00
20
08
10
10
P0: write 130 <--- 78

P15
Write miss, satisfied by memory

Writeback [110]
Implementation 1:
Implementation 2:
100 +
100 +
15
15
B0
B1
B2
B3
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
30
18
20
28
30
+ 10+100
+ 10+100
=
=
10
68
10
18
B0
B1
B2
B3
225 Stall Cycles

225 Stall Cycles
S
I
I
I
120
108
110
118
Data
I
M
I
S
Addr
ess T
ag
00
48
78
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
100
108
130
118
Cohe
renc
e Sta
te
S
M
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P1
te
P0
00
00
00
00
20
08
10
10
B0
B1
B2
B3
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
Data
00
00
00
08
00
10
00
18
00
20
00
28
00
30
10
68
10
18
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
4.2 (c)
P0
120
108
110
118
00
00
00
00
20
08
10
10
P1: read 120
I
S
M
I
100
108
110
118
10
08
30
10
Implementation 1:
Implementation 2:
100
100
Stall Cycles
Stall Cycles
B0
B1
B2
B3
120
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
20
68
10
18
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Addr
Data
S
M
I
S
te
Cohe
renc
e Sta
g
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
P15
te
P1
te
P0
120
108
110
118
00
00
00
00
20
08
10
10
P1: read 128
I
S
M
I
100
108
110
118
10
08
30
10
Read hit
Implementation 1:
Implementation 2:
100
100
0
0
B0
B1
B2
B3
120
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
20
68
10
18
B0
B1
B2
B3
S
S
I
I
g
Data
ess T
a
Addr
Data
S
M
I
S
Stall Cycles
Stall Cycles
Cohe
renc
e Sta
g
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P15
te
P1
te
P0
120
108
110
118
00
00
00
00
20
08
10
10
P1: read 130
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
Implementation 1:
Implementation 2:
100 +
100 +
0
0
+
+
120
128
130
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
100
100
=
=
20
68
30
18
B0
B1
B2
B3
200 Stall Cycles

200 Stall Cycles
S
S
I
I
g
Data
ess T
a
Data
S
M
S
S
Addr
Cohe
renc
e Sta
g
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P15
te
P1
te
P0
120
108
110
118
00
00
00
00
20
08
10
10
B0
B1
B2
B3
I
S
M
I
100
108
110
118
10
08
30
10
B0
B1
B2
B3
I
M
I
S
Memory
Address
100
108
110
118
120
128
130
100
128
110
118
Data
00
00
00
08
00
10
00
18
00
20
00
28
00
30
10
68
10
18
B0
B1
B2
B3
S
S
I
I
Data
ess T
a
Data
00
00
00
00
te
g
ess T
a
Addr
Data
00
00
00
00
Cohe
renc
e Sta
te
P15
Cohe
renc
e Sta
g
ess T
a
Addr
Cohe
renc
e Sta
te
P1
Addr
4.2 (d)
P0
120
108
110
118
00
00
00
00
20
08
10
10
P1: read 100
Implementation 1:
Implementation 2:
100
100
Stall Cycles
Stall Cycles
B0
B1
B2
B3
Data
100
128
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
00
68
10
18
B0
B1
B2
B3
S
S
I
I
120
108
110
118
Data
S
M
I
S
Addr
ess T
ag
10
08
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
100
108
110
118
Cohe
renc
e Sta
te
I
S
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
P15
te
P1
te
P0
00
00
00
00
20
08
10
10
P1: write 108 <--- 48

P15

Writeback [128]
Implementation 1:
Implementation 2:
100
100
B0
B1
B2
B3
Data
100
108
110
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
68
30
100+10 Stall Cycles

100+10 Stall Cycles
00
48
10
18
B0
B1
B2
B3
S
I
I
I
120
108
110
118
Data
S
M
I
S
Addr
ess T
ag
10
08
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
100
108
110
118
Cohe
renc
e Sta
te
I
I
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P1
te
P0
00
00
00
00
20
08
10
10
P1: write 130 <--- 78

P15
B0
B1
B2
B3
Implementation 1:
Implementation 2:
100 +
100 +
100+10
100+10
+
+
Data
100
108
130
118
00
00
00
00
Memory
100
108
110
118
120
128
130
00
00
00
00
00
00
00
00
08
10
18
20
28
30
100
100
=
=
00
48
30
18
B0
B1
B2
B3
310 Stall Cycles

310 Stall Cycles
S
S
I
I
120
108
110
118
Data
S
M
M
S
Addr
ess T
ag
10
08
30
10
ess T
a
Data
00
00
00
00
Addr
Cohe
renc
e Sta
g
ess T
a
100
108
110
118
Cohe
renc
e Sta
te
I
S
M
I
Addr
Cohe
renc
e Sta
B0
B1
B2
B3
te
P1
te
P0
00
00
00
00
20
08
10
10

Computer Architecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Computer Architecture

Încărcat de

Drepturi de autor:

Formate disponibile

DAT105: Computer Architecture

Study Period 2, 2009

DAT105: Computer Architecture

DAT105: Computer Architecture

DAT105: Computer Architecture

DAT105: Computer Architecture

DAT105: Computer Architecture

DAT105: Computer Architecture

DAT105: Computer Architecture

Performance (in cycles) per iteration) using an ideal nonunit

DAT105: Computer Architecture

DAT105: Computer Architecture

Each processor has a single , private cache

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.1

Prepared by: Mafijul Islam

DAT105: Computer Architecture

Determine the number of stall cycles generated by each implementation

Chapter 4: Problem No. 4.2

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.2

Read miss, satisfied by memory

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.2

P0: read 128

Read miss, satisfied by P1's Cache

70+10 Stall Cycles

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.2

P0: read 130

Read miss, satisfied by memory

290 Stall Cycles

Prepared by: Mafijul Islam

Chapter 4: Problem No. 4.2

Chapter 4: Problem No. 4.2

P0: read 100

Read miss, satisfied by memory

Chapter 4: Problem No. 4.2

P0: write 108 <--- 48

Write hit, send invalidation

Chapter 4: Problem No. 4.2

P0: write 130 <--- 78

Write miss, satisfied by memory

225 Stall Cycles

Chapter 4: Problem No. 4.2

Chapter 4: Problem No. 4.2

P1: read 120

Read miss, satisfied by memory

Chapter 4: Problem No. 4.2

P1: read 128