Sunteți pe pagina 1din 35

Systolic Arrays & Their Applications

What Is a Systolic Array?


A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions Each processor at each step takes in data from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs results in the opposite direction (South and East).
H. T. Kung and Charles Leiserson were the first to publish a paper on systolic arrays in 1978, and coined the name.

What Is a Systolic Array?

A specialized form of parallel computing. Multiple processors connected by short wires. Unlike many forms of parallelism which lose speed through their connection. Cells(processors), compute data and store it independently of each other.

Systolic Unit(cell)

Each unit is an independent processor. Every processor has some registers and an ALU. The cells share information with their neighbors, after performing the needed operations on the data.

Some simple examples of systolic array models.

Matrix Multiplication
a11 a12 a13 a21 a22 a23 a31 a32 a33

b11 b12 b13 b21 b22 b23 b31 b32 b33

c11 c12 c13 c21 c22 c23 c31 c32 c33

Conventional Method: N3 For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Systolic Method
This will run in O(n) time!
To run in N time we need N x N processing units, in this case we need 9. P1 P4 P7 P2 P5 P8 P3 P6 P9

We need to modify the input data, like so: Flip columns 1 & 3
a13 a12 a11 a23 a22 a21 a33 a32 a31 b31 b32 b33 b21 b22 b23 b11 b12 b13

Flip rows 1 & 3

and finally stagger the data sets for input.

b31 b21 b11 a13 a12 a11 a23 a22 a21 a33 a32 a31 P1 P4 P7

b32 b22 b12

b33 b23 b13

P2 P5 P8

P3 P6 P9

At every tick of the global system clock data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.

3 4 2 2 5 3 3 2 5

3 4 2 2 5 3 3 2 5

23 36 28 25 39 34 28 32 37 5 3 2

Lets try this using a systolic array. 3 2 3 2 5 4

2 4 3
3 5 2

P1
P4

P2
P5

P3
P6

2 3

P7

P8

P9

3 2 2 3 5 2 5 3 4 2 3*3

2 5 4

5 3 2

Clock tick: 1

P1

P2

P3

P4

P5

P6

P7

P8

P9

Clock tick: 2 2 5 3*4 5 3 2

3 2 3 5 2 5 3 4*2 2*3

P1

P2

P3

P4

P5

P6

P7

P8

P9

17

12

Clock tick: 3 5 3 3*2

2 2*3 3 5 2 5*2 3*3 4*5 2*4

P1

P2

P3

P4

P5

P6

P7

P8

P9

23

32

16

Clock tick: 4

5 2*2 3*3 5 2*2 5*5 3*4 4*3 2*2

P1

P2

P3

P4

P5

P6

P7

P8

P9

23

36

18

25

33

13

12

Clock tick: 5

2*5 3*2 5*3 5*2 5*3 3*2

P1

P2

P3

P4

P5

P6

P7

P8

P9

23

36

28

25

39

19

28

22

Clock tick: 6

3*5 5*2 2*3

P1

P2

P3

P4

P5

P6

P7

P8

P9

23

36

28

25

39

34

28

32

12

Clock tick: 7

5*5

P1

P2

P3

P4

P5

P6

P7

P8

P9

23

36

28

25

39

34

28

32

37

Same answer! In 2n + 1 time, can we do better? The answer is yes, there is an optimization.

23 25 28

36 39 32

28 34 37

P1

P2

P3

P4

P5

P6

P7

P8

P9

23

36

28

25

39

34

28

32

37

Why Systolic?

Extremely fast. Easily scalable architecture. Can do many tasks single processor machines cannot attain. Turns some exponential problems into linear or polynomial time.

Why Not Systolic?

Expensive. Not needed on most applications, they are a highly specialized processor type. Difficult to implement and build.

MIMD: Multiple instruction multiple data

Flynns Classical Taxonomy

Distinguishes multi-processor architecture by instruction and data SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Single Data MIMD Multiple Instruction, Multiple Data

Flynns Classical Taxonomy: MIMD

Can execute different instructions on different data elements. Most common type of parallel computer.

Multi-computer: Structure of Distributed Memory MIMD Architectures

Parallel Computer Memory Architectures: Shared Memory Architecture

All processors access all memory as a single global address space. Data sharing is fast. Lack of scalability between memory and CPUs

Parallel Computer Memory Architectures: Distributed Memory

Each processor has its own memory. Is scalable, no overhead for cache coherency. Programmer is responsible for many details of communication between processors.

Multi-computer (distributed memory system): Advantages and Disadvantages

+Highly Scalable +Message passing solves memory access synchronization problem

-Load balancing problem -Deadlock in message passing -Need to physically copying data between processes

Multi-processor (shared memory system): Advantages and Disadvantages

+May use uniprocessor programming techniques +Communication between processor is efficient

-Synchronized access to share data in memory needed -Lack of scalability due to (memory) contention problem

Best of Both Worlds (Multicomputer using virtual shared memory) Also called distributed shared memory architecture The local memories of multi-computer are components of global address space:
any processor can access the local memory of

any other processor

Three approaches:
Non-uniform memory access (NUMA) machines Cache-only memory access (COMA) machines Cache-coherent non-uniform memory access

(CC-NUMA) machines

Structure of NUMA Architectures

NUMA: remote load

Structure of COMA Architectures

Structure of CC-NUMA Architectures

Classification of MIMD computers

THANK

YOU

S-ar putea să vă placă și