Sunteți pe pagina 1din 4

Brock Wilcox

CS470
Homework 6
2012.04.12
1.Exercises 5.2.1 through 5.2.4, parts a and b
"In this exercise we look at memory locality properties of matrix computation. The following code is written in C,
where elements within the same row are stored contiguously."
Assume that the elements of both the arrays A and B contain single-precision floating-point values, declared with
the codes below:

a.

float A[4000][4000], B[8][8];


for (int I=0; I<4000; I++)
for (int J=0; J<8; J++)
A[I][J] = B[J][0] + A[I][J];

b.

float A[4000][4000], B[8][8];


for (int J=0; J<8; J++)
for (int I=0; I<4000; I++)
A[I][J] = B[J][0] + A[I][J];

Hint: Note that these codes differ slightly from those provided in the textbook exercise.
5.2.1 How many single-precision floating-point values can be stored in a 16-byte cache line? Show your work!

Single floats are 32-bits each, which is 4 bytes (8 bits each). So each float takes 4 bytes,
meaning we can fit 4 floats into the cacheline (4 * 4).
5.2.1.EXTRA.A How many cachelines total would the array A need, if it could be stored completely in cache? Show
your work!

A is 4000 * 4000 = 16,000,000 floats. We divide by 4 floats per cacheline, 16,000,000/4 =


4,000,000 cachelines.
5.2.1.EXTRA.B Since not all of the elements of A are used in the codes above, how many cachelines would the
required elements of the array A actually need, if they could be stored completely in cache? Show your work!

For (a) and (b) we use up to A[4000][8] (well, 1 less counting from zero). So that is 4000 * 8
= 32000 floats, which is 32000 / 4 = 8000 cachelines.
5.2.1.EXTRA.C How many cachelines would the array B require, if it could be stored completely in cache? Show
your work!

B is 8 * 8 = 64 floats. Since we have 4 floats per cacheline, that is 64 / 4 = 16 cachelines.


5.2.1.EXTRA.D Since not all of the elements of B are used in the codes above, how many cachelines would the
required elements of the array B actually need, if they could be stored completely in cache? Show your work!

In both (a) and (b) we use J[0..7][0], which means we are using 8 * 1 = 8 floats. 8 floats / 4
per cacheline = 2 cachelines needed.

5.2.2 References to which variables exhibit temporal locality? Why?

In (a) and (b) the indexes i and j use temporal locality, since they are accessed over and
over again. (these would most likely be stored in a register).
In (b), B[J][0] has temporal locality, since it is accessed over and over (4000 times in a row)
5.2.3 References to which variables exhibit spatial locality? Why?

In (a), A[I][J] has spacial locality. As J is incremented, nearby values of the array are
accessed.
5.2.4 (Using the code provided for exercises 5.2.1 through 5.2.3 above)
How many 16-byte cachelines are needed to store all 32-bit matrix elements being referenced -- as the code
executes? Show your work!
For this question only, assume that only one cacheline from the array A and only one cacheline from the
array B can be in the cache at the same time.
Hint: There will be a difference between the two sets of codes, as one of the codes will use all four floats in a
cacheline from one of the two arrays consecutively and the other code will not.
Note that only the first 8 elements of each row in A are used and only the first element of each of the eight
rows of B is used.

For (a), we need 2 cachelines to hold a whole row of A, which is 4 floats per cacheline.
For (b) we need a cacheline for each separate row of B, so 8 cachelines.
5.2.4.EXTRA.E
Since fetching values from memory (to put into the cache) is far more expensive than fetching values directly from
the cache, which of the two codes above is more efficient? Why?

The first is significantly better. It is able to fetch consecutive locations, whereas (b), as an
even multiple of 2, will keep having cache misses.
2.Exercise 5.3.1, using the list of references printed in red below:
"Caches are important to providing a high performance memory hierarchy to processors. Below is a list of 32-bit
memory address references, given as word addresses.
5, 215, 174, 7, 82, 215, 66, 175, 65, 104, 87, 214, 5
5.3.1 "For each of these references,
a) identify the binary address,
b) identify the tag, and
c) identify the index given a direct-mapped cache with 16 one-word blocks.
d) Also list if each reference is a hit or a miss, assuming the cache is initially empty."

Word addresses mean a block size of 4-bytes to hold the word. Since we have 16 one-word
blocks, we need 4 bits for the index (though since we are word-aligned we could in theory
ignore the 2 low bits). So the tag needs to be 32 4 = 28 bits, from the high end of the
address.
Reference

Address

Tag

Index

hit/miss

101

101 miss

215

11000000

1101

111 miss

174

10000000

1010

1110 miss

111

111 miss

82

1010010

101

10 miss

215

11000000

1101

66

1000010

100

10 miss

175

10000000

1010

1111 miss

65

1000001

100

1 miss

104

1101000

110

1000 miss

87

1010111

101

111 miss

214

11000000

1101

110 miss

101

101 miss

111 hit

3.Exercise 5.3.4
"There are many different design parameters that are important to a cache's overall performance. The table below
lists parameters for different ONE direct-mapped cache designs."

Cache data size Cache block size Cache access time


64 KB
2 words
2 cycle
5.3.4 "Calculate the total number of bits required for the cache listed in the table, assuming a 32-bit address. Given
that total size, find the total size of the closest direct-mapped cache with 16-word blocks of equal size or greater.
Explain why the second cache, despite its larger data size, might provide slower performance than the first cache."
Hint: Compute n and m first. See Section 5.2.

2^6 = 64, so n = 6. 2 words = 2^m = 2^1, so m=1.


So we get 2^n*(2^m*32+(32-n-m-2)+1)=5632 bits.
4.Exercises 5.4.1 and 5.4.2, parts a and b
"For a direct-mapped cache design with 32-bit address, the following bits of the address are used to access the
cache."

Tag Index Offset


a. 31-10 9-4
3-0
b. 31-12 11-5
4-0
5.4.1 "What is the cache line size (in words)?"

(a) Tag is 22 bits, index is 6 bits (n), offset is 4 bits. So we have 22 = 32 (n + m + 2) = 32


(8 +m), and so m = 32 22 8 = 2. So the cache line size is 2^2 = 4 words.
(b) Tag is 20 bits, index is 7 bits (n), offset is 5 bits. So we have 20 = 32 (n + m + 2) = 32
(7 + m + 2), so 32 20 9 = m = 3. So the cache line size is 2^3 = 8 words.
5.4.2 "How many entries does the cache have?"

This can be determined from the index. In (a) the index is 6 bits, so the cache is 2^6 = 64
blocks. In (b) the index is 7 bits, so the cache is 2^7 = 128 blocks.

5.Exercises 5.4.4 and 5.4.5


"Starting from power on, the following byte-addressed cache references are recorded." (12 references)
Address 0 4 16 132 232 160 1024 30 140 3100 180 2180
Hint: You may assume that m=2 and n = 6.
5.4.4 "How many blocks are replaced?"
Hint: First fill out the table below:
tag is 24 bits (bits 31-8), index is 6 bits (bits 7-2).

Address
Line ID
Hit/Miss
Replace?

1024
16
132 232 160
30
4
10000
1000 1000 1110 1010
1111
100
00000
0
0100 1000 0000
0
0
0
0 2
16
29 20
0 3
M M
M
M
M
M
M
M
N N
N
N
N
N
Y
N
0

3100
140
180
2180
11000
1000
10110 100010
00111
1100
100
000100
00
17
3
22
16
M
M
M
M
N
Y
N
Y

5.4.5 "What is the hit ratio?"

The hit ratio is 0/12 = 0


6.Exercise 5.5.2, using values from table below:
"Recall that we have two write policies and write allocation policies, their combinations can be implemented at either
in L1 or L2 cache.

L1
L2
Write-back, write allocate Write-through, non-write allocate
5.5.2 "Describe the procedure of handling an L1 write miss, considering the component involved and the possibility
of replacing a dirty block."

First we'd check to see if the block is dirty. If it is, then we write the dirty block to memory. Next
we retrieve the target block from memory (overwriting the block that is in our way). Finally we
write to our L1 block. Later, when something else must read in a different memory location
overwriting our block, the block is written.

S-ar putea să vă placă și