Systolic Arrays & Their Applications

Systolic Arrays & Their Applications
What Is a Systolic Array?

A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions Each processor at each step takes in data from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs results in the opposite direction (South and East).
H. T. Kung and Charles Leiserson were the first to publish a paper on systolic arrays in 1978, and coined the name.
What Is a Systolic Array?
A specialized form of parallel computing. Multiple processors connected by short wires. Unlike many forms of parallelism which lose speed through their connection. Cells(processors), compute data and store it independently of each other.
Systolic Unit(cell)
Each unit is an independent processor. Every processor has some registers and an ALU. The cells share information with their neighbors, after performing the needed operations on the data.
Some simple examples of systolic array models.
Matrix Multiplication
a11 a12 a13 a21 a22 a23 a31 a32 a33
b11 b12 b13 b21 b22 b23 b31 b32 b33
c11 c12 c13 c21 c22 c23 c31 c32 c33
Conventional Method: N3 For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];
Systolic Method
This will run in O(n) time!
To run in N time we need N x N processing units, in this case we need 9. P1 P4 P7 P2 P5 P8 P3 P6 P9
We need to modify the input data, like so: Flip columns 1 & 3
a13 a12 a11 a23 a22 a21 a33 a32 a31 b31 b32 b33 b21 b22 b23 b11 b12 b13
Flip rows 1 & 3
and finally stagger the data sets for input.
b31 b21 b11 a13 a12 a11 a23 a22 a21 a33 a32 a31 P1 P4 P7
b32 b22 b12
b33 b23 b13
P2 P5 P8
P3 P6 P9
At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
3 4 2 2 5 3 3 2 5
3 4 2 2 5 3 3 2 5
23 36 28 25 39 34 28 32 37 5 3 2
Lets try this using a systolic array. 3 2 3 2 5 4
2 4 3
3 5 2
P1
P4
P2
P5
P3
P6
2 3
P7
P8
P9
3 2 2 3 5 2 5 3 4 2 3*3
2 5 4
5 3 2
Clock tick: 1
P1
P2
P3
P4
P5
P6
P7
P8
P9
Clock tick: 2 2 5 3*4 5 3 2
3 2 3 5 2 5 3 4*2 2*3
P1
P2
P3
P4
P5
P6
P7
P8
P9
17
12
Clock tick: 3 5 3 3*2
2 2*3 3 5 2 5*2 3*3 4*5 2*4
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
32
16
Clock tick: 4
5 2*2 3*3 5 2*2 5*5 3*4 4*3 2*2
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
18
25
33
13
12
Clock tick: 5
2*5 3*2 5*3 5*2 5*3 3*2
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
19
28
22
Clock tick: 6
3*5 5*2 2*3
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
34
28
32
12
Clock tick: 7
5*5
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
34
28
32
37
Same answer! In 2n + 1 time, can we do better? The answer is yes, there is an optimization.
23 25 28
36 39 32
28 34 37
P1
P2
P3
P4
P5
P6
P7
P8
P9
23
36
28
25
39
34
28
32
37
Why Systolic?
Extremely fast. Easily scalable architecture. Can do many tasks single processor machines cannot attain. Turns some exponential problems into linear or polynomial time.
Why Not Systolic?
Expensive. Not needed on most applications, they are a highly specialized processor type. Difficult to implement and build.
MIMD: Multiple instruction multiple data
Flynns Classical Taxonomy
Distinguishes multi-processor architecture by instruction and data SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Single Data MIMD Multiple Instruction, Multiple Data
Flynns Classical Taxonomy: MIMD
Can execute different instructions on different data elements. Most common type of parallel computer.
Multi-computer: Structure of Distributed Memory MIMD Architectures
Parallel Computer Memory Architectures: Shared Memory Architecture
All processors access all memory as a single global address space. Data sharing is fast. Lack of scalability between memory and CPUs
Parallel Computer Memory Architectures: Distributed Memory
Each processor has its own memory. Is scalable, no overhead for cache coherency. Programmer is responsible for many details of communication between processors.
Multi-computer (distributed memory system): Advantages and Disadvantages
+Highly Scalable +Message passing solves memory access synchronization problem
-Load balancing problem -Deadlock in message passing -Need to physically copying data between processes
Multi-processor (shared memory system): Advantages and Disadvantages
+May use uniprocessor programming techniques +Communication between processor is efficient
-Synchronized access to share data in memory needed -Lack of scalability due to (memory) contention problem
Best of Both Worlds (Multicomputer using virtual shared memory) Also called distributed shared memory architecture The local memories of multi-computer are components of global address space:
any processor can access the local memory of
any other processor
Three approaches:
Non-uniform memory access (NUMA) machines Cache-only memory access (COMA) machines Cache-coherent non-uniform memory access
(CC-NUMA) machines
Structure of NUMA Architectures
NUMA: remote load
Structure of COMA Architectures
Structure of CC-NUMA Architectures
Classification of MIMD computers
THANK
YOU

Systolic Arrays & Their Applications

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Systolic Arrays & Their Applications

Încărcat de

Drepturi de autor:

Formate disponibile

Systolic Arrays & Their Applications

What Is a Systolic Array?

What Is a Systolic Array?

Some simple examples of systolic array models.

b11 b12 b13 b21 b22 b23 b31 b32 b33

c11 c12 c13 c21 c22 c23 c31 c32 c33

Conventional Method: N3 For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Flip rows 1 & 3

and finally stagger the data sets for input.

b32 b22 b12

b33 b23 b13

Lets try this using a systolic array. 3 2 3 2 5 4

Clock tick: 2 2 5 3*4 5 3 2

Clock tick: 3 5 3 3*2

2 2*3 3 5 2 5*2 3*3 4*5 2*4

5 2*2 3*3 5 2*2 5*5 3*4 4*3 2*2

2*5 3*2 5*3 5*2 5*3 3*2

3*5 5*2 2*3

Why Not Systolic?

MIMD: Multiple instruction multiple data

Flynns Classical Taxonomy

Flynns Classical Taxonomy: MIMD

Multi-computer: Structure of Distributed Memory MIMD Architectures

Parallel Computer Memory Architectures: Shared Memory Architecture

Parallel Computer Memory Architectures: Distributed Memory

Multi-computer (distributed memory system): Advantages and Disadvantages

+Highly Scalable +Message passing solves memory access synchronization problem

Multi-processor (shared memory system): Advantages and Disadvantages

+May use uniprocessor programming techniques +Communication between processor is efficient

any other processor

Structure of NUMA Architectures

NUMA: remote load

Structure of COMA Architectures

Structure of CC-NUMA Architectures

Classification of MIMD computers

S-ar putea să vă placă și

2 23 3 5 2 52 33 45 2*4

5 22 33 5 22 55 34 43 2*2

25 32 53 52 53 32

35 52 2*3