Documente Academic
Documente Profesional
Documente Cultură
A specialized form of parallel computing. Multiple processors connected by short wires. Unlike many forms of parallelism which lose speed through their connection. Cells(processors), compute data and store it independently of each other.
Systolic Unit(cell)
Each unit is an independent processor. Every processor has some registers and an ALU. The cells share information with their neighbors, after performing the needed operations on the data.
Matrix Multiplication
a11 a12 a13 a21 a22 a23 a31 a32 a33
Systolic Method
This will run in O(n) time!
To run in N time we need N x N processing units, in this case we need 9. P1 P4 P7 P2 P5 P8 P3 P6 P9
We need to modify the input data, like so: Flip columns 1 & 3
a13 a12 a11 a23 a22 a21 a33 a32 a31 b31 b32 b33 b21 b22 b23 b11 b12 b13
b31 b21 b11 a13 a12 a11 a23 a22 a21 a33 a32 a31 P1 P4 P7
P2 P5 P8
P3 P6 P9
At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
3 4 2 2 5 3 3 2 5
3 4 2 2 5 3 3 2 5
23 36 28 25 39 34 28 32 37 5 3 2
2 4 3
3 5 2
P1
P4
P2
P5
P3
P6
2 3
P7
P8
P9
3 2 2 3 5 2 5 3 4 2 3*3
2 5 4
5 3 2
Clock tick: 1
P1
P2
P3
P4
P5
P6
P7
P8
P9
3 2 3 5 2 5 3 4*2 2*3
P1
P2
P3
P4
P5
P6
P7
P8
P9
17
12
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
32
16
Clock tick: 4
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
18
25
33
13
12
Clock tick: 5
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
19
28
22
Clock tick: 6
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
34
28
32
12
Clock tick: 7
5*5
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
34
28
32
37
Same answer! In 2n + 1 time, can we do better? The answer is yes, there is an optimization.
23 25 28
36 39 32
28 34 37
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
34
28
32
37
Why Systolic?
Extremely fast. Easily scalable architecture. Can do many tasks single processor machines cannot attain. Turns some exponential problems into linear or polynomial time.
Expensive. Not needed on most applications, they are a highly specialized processor type. Difficult to implement and build.
Distinguishes multi-processor architecture by instruction and data SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Single Data MIMD Multiple Instruction, Multiple Data
Can execute different instructions on different data elements. Most common type of parallel computer.
All processors access all memory as a single global address space. Data sharing is fast. Lack of scalability between memory and CPUs
Each processor has its own memory. Is scalable, no overhead for cache coherency. Programmer is responsible for many details of communication between processors.
-Load balancing problem -Deadlock in message passing -Need to physically copying data between processes
-Synchronized access to share data in memory needed -Lack of scalability due to (memory) contention problem
Best of Both Worlds (Multicomputer using virtual shared memory) Also called distributed shared memory architecture The local memories of multi-computer are components of global address space:
any processor can access the local memory of
Three approaches:
Non-uniform memory access (NUMA) machines Cache-only memory access (COMA) machines Cache-coherent non-uniform memory access
(CC-NUMA) machines
THANK
YOU