Sunteți pe pagina 1din 40

VLSI Digital Signal Processing Systems

Systolic Architecture Design


Lan-Da Van (), Ph. D.
Department of Computer Science
National Chiao Tung University
Taiwan, R.O.C.
Fall, 2010
ldvan@cs.nctu.edu.tw
http://www.cs.nctu.tw/~ldvan/
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-2
Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-3
Systolic Architecture
What is systolic architecture (also called Systolic Arrays)?
A network of PEs that rhythmically compute and pass
data through the system.
Used as a coprocessor in combination with a host
computer and the behavior is analogous to the flow of
blood through the heart; thus named as systolic.
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-4
Characteristics of Systolic Arrays
Synchronization
Modularity
Regularity
Locality
Finite Connection
Parallel/Pipeline
Extendibility
Some relaxations are introduced to increase the
utility of systolic arrays
Neighbor interconnection ( near, but not nearest )
Data broadcast operations
Different PEs, especially at the boundaries
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-5
Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-6
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-7
Projection vector d
T
= [d
1
d
2
]
Determine how DG is compressed.
Two nodes that are displaced by d or multiples of d are executed by the
same processor
Processor space vector p
T
= [p
1
p
2
]
Any node with index I
T
= [i, j] would be executed by processor p
T
I.
Schedule vector s
T
= [s
1
s
2
]
Any node with index I
T
= [i, j] would be executed at time s
T
I.
Hardware utilization efficiency: HUE = 1/|s
T
d|
This is because two tasks executed by the same processor are spaced
1/|s
T
d| time units apart.
Feasibility constrains
Processor space vector and the projection vector must be orthogonal to
each other. p is orthogonal to d, that is, p
T
d = 0
If A and B differ by projection vector, i.e, I
A
-I
B
= d,
then they must be executed by the same processor => p
T
I
A
= p
T
I
B
=>p
T
(I
A
-I
B
) = 0
=> p
T
d = 0
If A and B are mapped to the same processor, then they cannot be
executed at the same time, i.e, s
T
I
A
s
T
I
B
=> s
T
d 0
Edge mapping: If an edge e exists in DG, then an edge p
T
e exists in the
systolic array with s
T
e delays.
Design Methodology: Basic Vectors
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-8
Space to Space-Time Representation
Space-time representation
Interpreting one of the spatial dimensions as temporal
dimension
j: processor axis, t: scheduling time instance
|
|
|
.
|

\
|
|
|
|
.
|

\
|
=
|
|
|
.
|

\
|
t
j
i
s
p
t
j
i
T
T
0
0
1 0 0
'
'
'
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-9
Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Matrix-Matrix Multiplication and 2D Systolic
Array Design
Systolic Design for Space Representations
Containing Delays
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-10
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-11
DG of FIR Filter
Dependence Graph (DG)
Ex: FIR filter: y(n) = w
0
(n)x(n)+w
1
x(n-1)+w
2
x(n-2)
i
j
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-12
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-13
Applying Projection and Scheduling
(1/2)
| | 1
1
0
1 0 =
(

= I p
T
| | 1
1
1
1 0 =
(

= I p
T
| | 0
0
0
1 0 =
(

= I p
T
Processor vector
p
T =
[0 1]
Projection vector
d
T
= [1 0]
Part of DG:
| | 0
0
1
1 0 =
(

= I p
T
| | 1
1
1
0 1 =
(

= I s
T
| | 0
1
0
0 1 =
(

= I s
T
| | 1
0
1
0 1 =
(

= I s
T
| | 0
0
0
0 1 =
(

= I s
T
Scheduling vector
s
T
= [1 0]
(

0
0
(

1
0
(

1
1
(

0
1
| | 2
2
0
1 0 =
(

= I p
T
| | 2
2
1
1 0 =
(

= I p
T
| | 1
2
1
0 1 =
(

= I s
T
| | 0
2
0
0 1 =
(

= I s
T
(

2
0
(

2
1
apply

processor 0
processor 2
processor 1
SFG
PE2
PE0
PE1
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-14
Applying Projection and Scheduling
(2/2)
Applying projection and Scheduling Dependence Graph Space-time representation
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-15
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-16
Edge Mapping
(

=
j
i
e
e s Delay
T
=
Edge mapping
e p e
T
=
'
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-17
Edge Mapping
| | 0
0
1
1 0
2
'
2
=
(

= = e p e
T
| | 0
1
0
0 1
1 1
=
(

= = e s delay
T
e
| | 1
1
1
1 0
3
'
3
=
(

= = e p e
T
(

= =
1
0
1
e input
Example:
(

= =
1
1
3
e output
(

= =
0
1
2
e weight
Edge mapping
| | 1
1
0
1 0
1
'
1
=
(

= = e p e
T
| | 1
0
1
0 1
2 2
=
(

= = e s delay
T
e
| | 1
1
1
0 1
3 3
=
(

= = e s delay
T
e
p
T
=[0 1]
s
T
=[1 0]
d
T
=[1 0]
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-18
Edge mapping
Edge mapping table
e p
T
e s
T
e
Input [0 1]
T
1 0
Weight [1 0]
T
0 1
Output [1 -1]
T
-1 1
p
T
=[0 1]
s
T
=[1 0]
d
T
=[1 0]
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-19
Systolic Array Design Methodology
Represent the Algorithm as a Dependence Graph
Applying Projection, Processor, and Scheduling Vectors
(Space-Time Representation)
Edge Mapping
Construct the Final Systolic Architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-20
Construct the Final Systolic
Architecture
PE2
PE0
PE1
Input Output
Weight 0
Weight 1
Weight 2
D
D
D
D
D
D
This is called B1 design
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-21
Alternative Designs
B
1
(Broadcast inputs, Move results, Weight Stay)
B
2
(Broadcast inputs, Move Weight, Results stay)
F (Fan-in results, Move inputs, Weight stay)
R
1
(Results stay, Inputs and Weight move in opposite
directions)
R
2
and Dual R
2
(Results stay, Inputs and Weights
move in the same direction but at different speeds)
W
1
(Weights stay, Inputs and Results move in
opposite directions)
W
2
and Dual W
2
(Weights stay, Inputs and Results
move in same direction but at different speeds)
Relating systolic designs using transformations
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-22
B
2
Broadcast Inputs, Move Weight,
Results Stay
e p
T
e s
T
e
wt [1 0]
T
1 1
input [0 1]
T
1 0
result [1 -1]
T
0 1
d
T
=[1 -1]
p
T
=[1 1]
s
T
=[1 0]
1
1
= =
d s
HUE
T
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-23
F - Fan-in Results, Move Inputs,
Weight Stay
e p
T
e s
T
e
wt [1 0]
T
0 1
input [0 1]
T
1 1
result [1 -1]
T
-1 0
d
T
=[1 0]
p
T
=[0 1]
s
T
=[1 1]
1
1
= =
d S
HUE
T
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-24
R
1
- Results Stay, Inputs and Weight
Move in Opposite Directions
e p
T
e s
T
e
wt [1 0]
T
1 1
input [0 1]
T
-1 1
result [1 -1]
T
0 2
d
T
=[1 -1]
p
T
=[1 1]
s
T
=[1 -1]
2
1 1
= =
d s
HUE
T
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-25
R
2
and Dual R
2
-Results Stay, Inputs and Weights
Move in the Same Direction but at Different Speeds
e p
T
e s
T
e
wt [1 0]
T
1 1
input [0 1]
T
1 2
result [1 -1]
T
0 1
e p
T
e s
T
e
wt [1 0]
T
1 2
input [0 1]
T
1 1
result [1 -1]
T
0 1
R
2
d
T
=[1 -1]
p
T
=[1 1]
s
T
=[2 1]
1
1
= =
d s
HUE
T
Dual R
2
d
T
=[1 -1]
p
T
=[1 1]
s
T
=[1 2]
1
1
= =
d s
HUE
T
PE2 PE1 PE0
D D
2D 2D
Input
Weight
result
PE2 PE1 PE0
2D 2D
D D
Input
Weight
result
D D D
D D D
D
2D
2D
D
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-26
W
1
Weights Stay, Inputs and Results Move in
Opposite Directions
e p
T
e s
T
e
wt [1 0]
T
0 2
input [0 1]
T
1 1
result [1 -1]
T
-1 1
d
T
=[1 0]
p
T
=[0 1]
s
T
=[2 1]
2
1 1
= =
d s
HUE
T
PE2 PE1 PE0
D D
D
Input
result
weight
2D 2D 2D
D
D
weight weight
D
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-27
W
2
and Dual W
2
-Weights Stay, Inputs and Results
Move in Same Direction but at Different Speeds
e p
T
e s
T
e
wt [1 0]
T
0 1
input [0 1]
T
-1 1
result [1 -1]
T
-1 2
e p
T
e s
T
e
wt [1 0]
T
0 1
input [0 1]
T
1 2
result [1 -1]
T
1 1
W
2
d
T
=[1 0]
p
T
=[0 1]
s
T
=[1 2]
1
1
= =
d s
HUE
T
Dual W
2
d
T
=[1 0]
p
T
=[0 1]
s
T
=[1 -1]
1
1
= =
d s
HUE
T
PE2 PE1 PE0
2D 2D
D D
Input
result
weight
D D D
PE2 PE1 PE0
D D
2D 2D
Input
result
D D D
2D
D
weight weight
weight weight weight
D
2D
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-28
Relating Systolic Designs Using
Transformations
The same projection vector and processor space vector
Different scheduling vectors
Can derive each other using transformations
Edge reversal : reverse edge direction in DG when no precedence
constraints
Associativity : when accumulating (a+b)+c = a+(b+c)
Slow-down
Retiming
Pipelining
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-29
Cutset Retiming Transformation
F
B1
cutset retiming
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-30
Outline
Introduction
Systolic array design methodology
FIR systolic arrays
Selection of scheduling vector
Conclusion
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-31
Scheduling Inequalities (1/3)
Based on selected scheduling vector s
T
, the
projection vector d and the processor space vector p
T
can be selected.
Consider the dependence relation X -> Y,
where I
x
and I
y
are the indices of node X and node Y,
respectively. The scheduling inequality for this
dependence is defined as
0 0 ) ( = => = d p I I p
T
B A
T
0 = => = d s I s I s
T
B
T
A
T
|
|
.
|

\
|
=
x
x
x
j
i
X I :
|
|
.
|

\
|
=
y
y
y
j
i
Y I :
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-32
Scheduling Inequalities (2/3)
Linear scheduling
Affine scheduling
( )
( )
1 2
1 2
x T
x x
x
y
T
y y
y
i
S s I s s
j
i
S s I s s
j
| |
= =
|
\ .
| |
= =
|
\ .
( )
( )
1 2
1 2
x T
x x x x
x
y
T
y y y y
y
i
S s I s s
j
i
S s I s s
j


| |
= + = +
|
\ .
| |
= + = +
|
\ .
Where T
x
is the time to compute node X and S
x
, S
y
are the
scheduling times for nodes X, Y, respectively.
x x y
T S S + >
Eq. (1)
Eq. (3)
Eq. (2)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-33
Scheduling Inequalities (3/3)
Define the edge from node X to node Y as
Hence the selection of scheduling vector consists of
two steps:
Capture all the fundamental edges. The reduced dependence
graph (RDG) is used to capture the fundamental edges and the
regular iterative algorithm (RIA) description of the
corresponding problem is used to construct RDGs.
Construct the scheduling inequalities and solve them for
feasible s
T
.
x x y y x
T
x y y x
T r r > + =>
=

e s
I I e
Eqs. (1) & (2)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-34
Regular Iterative Algorithm (RIA)
The regular iterative algorithm is the method for
constructing the reduce dependence graph (RDG).
The regular iterative algorithm (RIA) has two
standard forms:
The RIA is in standard input RIA form if the index of the
inputs are the same for all equations.
The RIA is in standard output RIA form if output indices are
the same for all equations.
FIR example:
) 1 , 1 ( ) 1 , 1 (
) , ( ) 1 , 1 (
) , ( ) 1 , (
) , ( ) , 1 (
+ + +
= +
= +
= +
j i X j i W
j i Y j i Y
j i X j i X
j i W j i W
) , ( ) , (
) 1 , 1 ( ) , (
) 1 , ( ) , (
) , 1 ( ) , (
j i X j i W
j i Y j i Y
j i X j i X
j i W j i W
+
+ =
=
=
Output RIA Form
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-35
Scheduling Vector and Systolic Array
Design Using RDG
Constructing scheduling inequalities using RDG
Determine the scheduling vector using
scheduling inequalities
Systolic mapping using the scheduling vector
This formulation can accommodate different
computation times for various operations due to
its generality.
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-36
Example 7.4.1 (1/4)
There are 5 edges in the above RDG.
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-37
Example 7.4.1 (2/4)
2
1
1 2
0
: , 0
0
0
: , 1
1
1
: , 1
0
0
: , 0
0
1
: , 5 2 1
1
y x
x x
w w
y x
y y
W Y e
X X e s
W W e s
X Y e
Y Y e s s





| |
= >
|
\ .
| |
= + >
|
\ .
| |
= + >
|
\ .
| |
= >
|
\ .
| |
= + > + +
|

\ .
Reduced Dependence Graph (RDG)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-38
Example 7.4.1 (3/4)
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-39
Example 7.4.1 (4/4)
Linear scheduling
( )
1 2 1 2
2 1
1, 1, 8
1, 9 9 1
(1, 1), (1,1)
T
T
s s s s
s s s
d p
> > >
= = =
= =
e p
T
e s
T
e
wt(1,0) 1 9
i/p(0,1) 1 1
Result(1,-1) 0 8
D
9D
D
9D
X
W
8D
8D
8D
Systolic array architecture
VLSI Digital Signal Processing Systems
Lan-Da Van VLSI-DSP-7-40
Conclusion
Systolic architecture
A massively parallel processing with limited I/O
communication with host computer
Suitable for many regular interactive operations
Design methodology
Map an N-dimensional DG to (N-1) dimensional
space-time representation
Needs to determine three critical vectors
Projection vector
Processor space vector
Scheduling vector

S-ar putea să vă placă și