Final Exam (Takehome Portion) : Setting

Final Exam (Takehome Portion)
name: Raymond Zhou course: Computer Architecture 3
Problem
Setting
Consider the following game: The game is represented by a 72 72 grid, where each cell in the grid is a bit, and has 8 neighbors. Wrap-around connections are used for neighbors to edge cells. At each point in time, the following occur in parallel: 1. Any cell with a 1, and fewer than k1 neighboring 1s becomes a 0. 2. Any cell with a 1, and more than k2 neighboring 1s becomes a 0. 3. Any cell with a 1, and between k1 and k2 (inclusive) neighboring 1s stays as a 1. 4. Any cell with a 0, and exactly k3 neighboring 1s becomes a 1. 5. Any other cell with a 0 stays as a 0. The constants k1 , k2 , and k3 are initialized to some value and updated as follows: 1. At every 100th time, set k1 = f1 (grid) (i.e., f1 is a function of every element in the entire grid). You may assume that f1 is an expensive function to compute, and it has appropriate range. 2. At every 200th time, set k2 = f2 (grid) . You may assume that f2 is also expensive to computer and has appropriate range. 3. At every 300th time, set k3 = (f1 (grid) f2 (grid)) mod 9 For example, using a 4 4 grid, the following grid: 1 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0
would be transformed to this at the next point of time if k1 , k2 , k3 = 2, 3, 3: 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 1
Note that the 4 4 grid is only an examplethe actual problem is for a bigger grid.
Assignment
Part I In class, we gave a 4 step approach to developing MIMD programs. Apply this technique to run this game on a MIMD system with P processors, where P = 16. The interconnection network allows access of non-local memories, but it takes 20 times as long as a local memory. In your writeup, you should clearly state and justify the results of each step (partitioning, communication, agglomeration, mapping). You are required to say something about every goal we listed in class under the four steps (if a goal is not realizable, you should state why). You may need to make some assumptions about problem parameters beyond what is givenmake sure you state these assumptions. Finally, estimate your speedup. 1. Partitioning (a) many more tasks than the anticipated number of processors (b) avoid redundant composition/storage (c) balance the tasks (somewhat) (d) scalability with problem size (e) alternate partitions recommended 2. Communication (a) balance intertask communication (b) each task communicates to a small group of other tasks (c) communications can be done in parallel (d) tasks can perform computations while waiting for communication 3. Agglomeration (a) reduce intergroup communication overhead (b) load balancing (static) - balance both computation and communication (c) number of groups of scales with problem size 4. Mapping (a) exploit interconnection network to minimize communication overhead (b) static load balancing We rst identify the amount of time required to compute each time step for the serial program. The unit of time will be the time required for one local memory access. We will assume that all arithmetic and conditional operations take a negligible amount of time. We will also assume that every acesss of a grid cell requires a single memory access. Without any optimization, we will need to store two copies of the grid in memory: the grid at the current and previous time steps. For a single grid cell, there are nine memory accesses: eight to adjacent cells in the previous grid and one to write to the current grid. Let t1 and t2 represent the execution times for functions f1 and f2 respectively. For calculations of f3 , we do not need to recalculate f1 . Depending on which time step we are on, the calculation of f3 will require calculating f2 if we have not already done so.
time step f2 f3
100
200
300
400
500
600
700
800
900
1000
Thus, for i iterations, the serial program will take: = For large i: t1 2t2 46656 + + 100 3 i i i i i 9 722 i + t1 + t2 + t2 t2 100 200 300 600 i i i i t1 + + t2 100 200 300 600
46656i +
1. Partitioning Without too much trouble, we can decompose the grid into horizontal strips, vertical strips, or square regions. We will use the latter as it would best minimize the amount of intercommunication between processing elements. That is, for the 4 4 grid:
We will divide it into square 2 2 regions:
Extrapolating this to the 72 72 grid for P = 16 results in square regions of 18 18=324. Tasks are balanced between processors. 2. Communication and Agglomeration To account for adjacencies, we will need to store a total grid of 19 19. There are 2 19 + 2 178 = 72 adjacency cells which are calculated in neighboring processing elements, thereby requiring 72 non-local memory accesses. These accesses will take 20 72 = 1440 time units.
There are 2 18 + 2 16 = 68 perimeter cells of which there are two types: corner (4) and edge (64) cells. They must wait until all non-local memory accesses are complete. There are 17 17 = 289 interior cells which can have their next time step calculated entirely in local memory. This will take 9 289 = 2601 time units and can be done while waiting for all non-local memory accesses to complete. Since the amount of time required for non-local memory accesses is less than that required to calculate each interior cell, we can ignore it (computation is done in parallel with communication). For f1 , f2 , and f3 , a single processing element must retrieve the entire grid. This will take 722 182 20 = 97200 time units and must be executed every 100 time steps. = For large i: i i i i i 9 182 i + 97200 + t1 + t2 + t2 t2 100 100 200 300 600
i i i i i 2916i + 97200 + t1 + + t2 100 100 200 300 600
97200 t1 2t2 2916 + + + 100 100 3
t1 2t2 3888 + + 100 3
As the inuence of t1 and t2 becomes negligible, the speedup is: 46656 = 12 3888 Part II Now suppose that P = 20. Repeat those parts of Part I that change. For P = 20, it is no longer possible to split 722 grid cells evenly with square regions, such that we can no longer exploit the benets of the intercommunication network. We turn o four processors. The speed-up does not change.

Final Exam (Takehome Portion) : Setting

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Final Exam (Takehome Portion) : Setting

Încărcat de

Drepturi de autor:

Formate disponibile

Final Exam (Takehome Portion)

name: Raymond Zhou course: Computer Architecture 3

would be transformed to this at the next point of time if k1 , k2 , k3 = 2, 3, 3: 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 1

We will divide it into square 2 2 regions:

i i i i i 2916i + 97200 + t1 + + t2 100 100 200 300 600

97200 t1 2t2 2916 + + + 100 100 3

t1 2t2 3888 + + 100 3

S-ar putea să vă placă și