03 ArchitectureSynthesis

Architecture Synthesis
SS 2015
Jun.-Prof. Dr. Christian Plessl
Custom Computing
University of Paderborn
Version 1.4.0 2015-05-28
Motivation
translate program/algorithm into dedicated hardware
units of translation
single basic block (combinational)
complete program (sequential)
finite
state
machine
1: int gcd(int a, int b) {
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}
registers
FU
controller
FU
data path
FSM controls of data path (FSM

actions)
data path sends feedback to FSM
(FSM conditions)
Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling
scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP
translation of complete programs

finite state machine and data path
micro-coded controllers
Overview
formal modeling
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
scheduling with ILP

Sequencing Graph
NOP
GS (VS , ES )
*
*
*
-
*
+
+
<
10
VS = {v0 ,v1 , ,v12}
11
NOP 12
Resource Graph
GR (VR, ER )
set of nodes VR = VS VT
VS are the nodes of the sequencing graph (without NOPs)
VT represent resource types (adder, multiplier, ALU, )
set of edges (vS, vT ) E R with vS VS, vT VT
an instance of resource type vT can be used to implement operation vS
Resource Graph Example

*
*
*
+
<
-
1
3
6
*
*
*
2
MUL
4
7
VT = {vMUL , vALU}
5
9
11
+
-
8
ALU
10
Cost, Execution Time

cost function c: VT Z
assigns a cost value to each resource type
example:
c(vMUL ) = 8
MUL
ALU
c(vALU) = 4
execution times w: ER Z+
assigns the execution time of operation vS VS on resource
type vT VT to the edge (vS ,vT) ER
example:
w(v2 ,vMUL) = 1
MUL
Allocation, Binding
allocation : VT Z+
assigns a number (vT) of available instances to each
resource type vT
binding is given by the two functions
: VS VT and : VS Z+
(vS) = vT means that operation vS is implemented by resource type vT
(possible s shown in the resource graph)
(vS) = r denotes that vS is implemented by the rth instance
of vT ; r (vT)
Allocation, Binding Example (1)

*
*
*
1
3
6
*
*
*
2
MUL
c(vMUL) = 8
(vMUL) = 3
w(v1 ,vMUL ) = 1
w(v2 ,vMUL ) = 1
w(v3 ,vMUL ) = 1
w(v4 ,vMUL ) = 1
w(v6 ,vMUL) = 1
w(v7 ,vMUL) = 1
(v1 ) = vMUL
(v2 ) = vMUL
(v3 ) = vMUL
(v4 ) = vMUL
(v6) = vMUL
(v7) = vMUL
(v1 ) = 1
(v2 ) = 2
(v3 ) = 3
(v4 ) = 1
(v6) = 2
(v7) = 3
10
Allocation, Binding Example (2)

+
<
-
5
9
11
+
-
c(vALU) = 4
(vALU) = 2
8
ALU
10
w(v5 ,vALU ) = 1
w(v8 ,vALU ) = 1
w(v9 ,vALU ) = 1
w(v10 ,vALU ) = 1
w(v11 ,vALU ) = 1
(v5 ) = vALU
(v8 ) = vALU
(v9 ) = vALU
(v10 ) = vALU
(v11) = vALU
(v5 ) = 1
(v8 ) = 1
(v9 ) = 1
(v10 ) = 2
(v11) = 1
11
Schedule
schedule : VS Z+
assigns a start time to each operation under the constraint
(vj) - (vi) w(vi , (vi))
(vi ,vj) ES
latency L of a scheduled sequencing graph

difference in start times between end node and start node
L = (vn) - (v0)
12
Schedule - Example
NOP
time
*
*
-
10
+
*
+
<
11
NOP
12
L = (v12) - (v0) = 5-1= 4

13
Synthesis
allocation, binding, scheduling
finding (, , , ) that optimize latency and cost under resource and
timing constraints
algorithms for architecture synthesis discussed in, e.g.
J.Teich, C. Haubelt, Digitale Hardware/Software-Systeme, Springer 2007
G. De Micheli, Synthesis and Optimization of Digital Circuits, McGrawHill 1994
synthesis problem variants

multicycle operations, operator chaining
several possible resource types for an operation
iterative schedules, pipelining
in the following
scheduling without resource constraints
ASAP, ALAP
scheduling under resource constraints

extended ASAP, ALAP
list scheduling
14
Overview
formal modeling
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
scheduling with ILP

15
Scheduling without Resource Constraints

ASAP (as soon as possible)
determines the earliest possible start times for the operations
minimal latency
ALAP (as late as possible)

determines the latest possible start times for the operations
under a given latency bound
slack (mobility) of operations

difference of start times: (ALAP with ASAP latency bound) - ASAP
if slack = 0 operation is on the critical path
16
ASAP Scheduling
ASAP: as soon as possible scheduling
algorithm
ASAP( GS (V,E) ) {
schedule v0 by setting (v0) = 1
repeat {
select a vertex vj whose predecessors are all scheduled
schedule vj by setting (v j ) =
(vi ) + w(vi , (vi ))
i:(vi ,v j )E
}
until (vn is scheduled)
return
}
max
17
ASAP Example
NOP
time
*
*
-
+
<
10
11
NOP
12
minimal latency: L = 4
18
ALAP Scheduling
ALAP: as late as possible scheduling
requires a latency bound L
otherwise nodes could be arbitrarily delayed
typically the schedule length of ASAP schedule is used as latency bound
algorithm
ALAP( GS (V,E), L ) {
schedule vn by setting (vn) = L +1
repeat {
select a vertex vi whose successors are all scheduled
schedule vi by setting (vi ) =
(v j ) w(vi , (vi ))
j:(vi ,v j )E
}
until (v0 is scheduled)
return
}
min
19
ALAP Example
0
NOP
latency bound: L = 4
*
*
-
*
10
*
-
11
+
NOP
+
<
12
time
20
Slack (mobility)
*
0
0
*
1
**
<
2
2
*
+
2
2
+
<
time
21
Scheduling under Resource Constraints (1)
Extended ASAP, ALAP

first ASAP or ALAP
then move operations down (ASAP) or up (ALAP) until resource
constraints are satisfied
22
Extended ALAP
resource constraints:
(vMUL) = 2, (vALU) = 2
time
NOP
*
*
-
latency bound: L = 4
10
*
-
11
+
*
+
<
NOP 12
23
Scheduling under Resource Constraints (2)
list scheduling
operations are prioritized according to some criterion, e.g.
slack,
time = 1
repeat
for each resource (vT , (vT))
determine all ready operations vS with (vs)= vT and
schedule the one with the highest priority
time ++;
until (vn is scheduled)
24
List Scheduling (1)

example
criterion: number of successor nodes
resource constraints: (vMUL) = 1, (vALU) = 1
execute at time
1
2
3
4
5
6
7
*
2
1
2
1
*
*
1
0
*
+
1
0
+
<
0 25
List Scheduling (2)

1
<
7
time
Exercise 5.1: Architecture Synthesis
+
L = (v12) - (v0) = 8-1= 7
26
Overview
formal modeling
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
scheduling with ILP

27
ILP Basics (1)
Linear Program (LP)

is a special kind of optimization problem
A x b
x0
with
max cT x
x R n , A R mn ,
b Rm , c Rn
real variables
linear constraints modeled by (in)equations
linear cost function
Integer Linear Program (ILP)

LP where all variables must be integer
0-1 LP (binary LP, ZOLP)

LP where all variables must be binary
x Zn
n
x {0,1}
Mixed Integer Linear Program (MILP)

LP where some variables must be integer
28
ILP Basics (2)
example
x1 , x2 , x3 {0,1}
3 binary variables x1, x2, x3

constraint that at least two variables
must be set to 1
minimize cost function
x1 + x2 + x3 2
min 5 x1 + 6 x2 + 4 x3
solution
x1
x2
x3
cost
10
11
15
29
ILP Basics (3)
solving ILPs
ILPs are NP-complete, exponential runtime in worst case
solved by branch-and-bound algorithms to optimality
(runtime depends on the efficiency of the model)
popular solvers
CPLEX (commercial)
http://www.ibm.com/software/commerce/optimization/cplex-optimizer/
Gurobi (commercial, free academic license)
http://www.gurobi.com/
lp_solve (open source)
http://lpsolve.sourceforge.net/5.5/
challenges
translate the problem into an efficient ILP model
use only linear constraints and cost function
non-linearities of the problem can (possibly) be modeled by several
linear constraints, but this increases the complexity of the ILP
30
Example: Optimal DFG Scheduling

GS (VS, ES )
NOP
*
*
-
all operations take one

time step
*
+
+
<
resource constraints:
2 MUL
2 ALU (+,-,<)
find schedule with
minimal latency
10
11
VS = {v0 ,v1 , ,v12}
NOP 12
alternative problem:
minimize resources
under latency constraint
31
ILP for Optimal DFG Scheduling (1)

binary variables
xi,l =1, iff operation vi starts in time step l,
i = 1,..., n
l = 1,..., L
we need to define a latency bound
uniqueness constraints
each operation starts in exactly one time step
i ,l
= 1, i = 1,..., n
starting time
the starting time (vi) of operation vi can be expressed as
(vi ) = l xi,l
l
32

simple resource constraints (execution time for operations = 1)
in each time step, use no more resources than available
i ,l
(vMUL ),
( vi ) = vMUL
i ,l
(v ALU ), l = 1,..., L
( vi ) = v ALU
sequencing constraints
an operation cannot start earlier than the finishing time of its predecessors
l x
l
j,l
w ( vi , (vi )) + l xi,l , edges (vi ,v j )

l
actually, this basic problem formulation

requires the assumption that:
w ( vi , (vi )) = 1
33

cost function
minimize start time of last operation
#
&
min (vn ) = min % l xn,l (
$ l
'
34
/*********************/
/* integer variables */
/*********************/
int x1_1;
int x1_2;
int x5_1;
int x1_3;
int x5_2;
int x1_4;
int x5_3;
int x1_5;
int x5_4;
int x1_6;
int x5_5;
int x1_7;
int x5_6;
int x5_7;
int x2_1;
int x2_2;
int x6_1;
int x2_3;
int x6_2;
int x2_4;
int x6_3;
int x2_5;
int x6_4;
int x2_6;
int x6_5;
int x2_7;
int x6_6;
int x6_7;
int x3_1;
int x3_2;
int x7_1;
int x3_3;
int x7_2;
int x3_4;
int x7_3;
int x3_5;
int x7_4;
int x3_6;
int x7_5;
int x3_7;
int x7_6;
int x7_7;
int x4_1;
int x4_2;
int x8_1;
int x4_3;
int x8_2;
int x4_4;
int x8_3;
int x4_5;
int x8_4;
int x4_6;
int x8_5;
int x4_7;
int x8_6;
int x8_7;
Example: Scheduling of DFG (1)
int
int
int
int
int
int
int
x9_1;
x9_2;
x9_3;
x9_4;
x9_5;
x9_6;
x9_7;
int
int
int
int
int
int
int
x10_1;
x10_2;
x10_3;
x10_4;
x10_5;
x10_6;
x10_7;
int
int
int
int
int
int
int
x11_1;
x11_2;
x11_3;
x11_4;
x11_5;
x11_6;
x11_7;
int
int
int
int
int
int
int
x12_1;
x12_2;
x12_3;
x12_4;
x12_5;
x12_6;
x12_7;
/**********************/
/* variables in {0,1} */
/**********************/
x1_1 <= 1;
x5_1 <= 1;
x1_2 <= 1;
x5_2 <= 1;
x1_3 <= 1;
x5_3 <= 1;
x1_4 <= 1;
x5_4 <= 1;
x1_5 <= 1;
x5_5 <= 1;
x1_6 <= 1;
x5_6 <= 1;
x1_7 <= 1;
x5_7 <= 1;
x2_1 <= 1;
x6_1 <= 1;
x2_2 <= 1;
x6_2 <= 1;
x2_3 <= 1;
x6_3 <= 1;
x2_4 <= 1;
x6_4 <= 1;
x2_5 <= 1;
x6_5 <= 1;
x2_6 <= 1;
x6_6 <= 1;
x2_7 <= 1;
x6_7 <= 1;
x3_1 <= 1;
x7_1 <= 1;
x3_2 <= 1;
x7_2 <= 1;
x3_3 <= 1;
x7_3 <= 1;
x3_4 <= 1;
x7_4 <= 1;
x3_5 <= 1;
x7_5 <= 1;
x3_6 <= 1;
x7_6 <= 1;
x3_7 <= 1;
x7_7 <= 1;
x4_1 <= 1;
x8_1 <= 1;
x4_2 <= 1;
x8_2 <= 1;
x4_3 <= 1;
x8_3 <= 1;
x4_4 <= 1;
x8_4 <= 1;
x4_5 <= 1;
x8_5 <= 1;
x4_6 <= 1;
x8_6 <= 1;
x4_7 <= 1;
x8_7 <= 1;
x9_1
x9_2
x9_3
x9_4
x9_5
x9_6
x9_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
x10_1
x10_2
x10_3
x10_4
x10_5
x10_6
x10_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
x11_1
x11_2
x11_3
x11_4
x11_5
x11_6
x11_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
x12_1
x12_2
x12_3
x12_4
x12_5
x12_6
x12_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
problem description
in LP format
latency bound = 7
35

/**************************/
/* uniqueness constraints */
/**************************/
x1_1 + x1_2 + x1_3 + x1_4 + x1_5 + x1_6
x2_1 + x2_2 + x2_3 + x2_4 + x2_5 + x2_6
x3_1 + x3_2 + x3_3 + x3_4 + x3_5 + x3_6
x4_1 + x4_2 + x4_3 + x4_4 + x4_5 + x4_6
x5_1 + x5_2 + x5_3 + x5_4 + x5_5 + x5_6
x6_1 + x6_2 + x6_3 + x6_4 + x6_5 + x6_6
x7_1 + x7_2 + x7_3 + x7_4 + x7_5 + x7_6
x8_1 + x8_2 + x8_3 + x8_4 + x8_5 + x8_6
x9_1 + x9_2 + x9_3 + x9_4 + x9_5 + x9_6
x10_1 + x10_2 + x10_3 + x10_4 + x10_5 +
x11_1 + x11_2 + x11_3 + x11_4 + x11_5 +
x12_1 + x12_2 + x12_3 + x12_4 + x12_5 +
+ x1_7 = 1;
+ x2_7 = 1;
+ x3_7 = 1;
+ x4_7 = 1;
+ x5_7 = 1;
+ x6_7 = 1;
+ x7_7 = 1;
+ x8_7 = 1;
+ x9_7 = 1;
x10_6 + x10_7 = 1;
x11_6 + x11_7 = 1;
x12_6 + x12_7 = 1;
/************************/
/* resource constraints */
/************************/
x1_1 + x2_1 + x3_1 + x4_1 + x6_1 + x7_1 <= 2;
x5_1 + x8_1 + x9_1 + x10_1 + x11_1 <= 2;
x1_2 + x2_2 + x3_2 + x4_2 + x6_2 + x7_2 <= 2;
x5_2 + x8_2 + x9_2 + x10_2 + x11_2 <= 2;
x1_3 + x2_3 + x3_3 + x4_3 + x6_3 + x7_3 <= 2;
x5_3 + x8_3 + x9_3 + x10_3 + x11_3 <= 2;
x1_4 + x2_4 + x3_4 + x4_4 + x6_4 + x7_4 <= 2;
x5_4 + x8_4 + x9_4 + x10_4 + x11_4 <= 2;
x1_5 + x2_5 + x3_5 + x4_5 + x6_5 + x7_5 <= 2;
x5_5 + x8_5 + x9_5 + x10_5 + x11_5 <= 2;
x1_6 + x2_6 + x3_6 + x4_6 + x6_6 + x7_6 <= 2;
x5_6 + x8_6 + x9_6 + x10_6 + x11_6 <= 2;
x1_7 + x2_7 + x3_7 + x4_7 + x6_7 + x7_7 <= 2;
x5_7 + x8_7 + x9_7 + x10_7 + x11_7 <= 2;
/**********************/
/* objective function */
/**********************/
min: 1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7;
36
/**************************/
/* sequencing constraints */
/**************************/
/* 1->6 */
1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=
1 + 1 x1_1 + 2 x1_2 + 3 x1_3 + 4 x1_4 + 5 x1_5 + 6 x1_6 + 7 x1_7;
/* 2->6 */
1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=
1 + 1 x2_1 + 2 x2_2 + 3 x2_3 + 4 x2_4 + 5 x2_5 + 6 x2_6 + 7 x2_7;
/* 6->10 */
1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7 >=
1 + 1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7;
/* 10->11 */
1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=
1 + 1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7;
/* 11->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7;
/* 3->7 */
1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7 >=
1 + 1 x3_1 + 2 x3_2 + 3 x3_3 + 4 x3_4 + 5 x3_5 + 6 x3_6 + 7 x3_7;
/* 7->11 */
1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=
1 + 1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7;
/* 4->8 */
1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7 >=
1 + 1 x4_1 + 2 x4_2 + 3 x4_3 + 4 x4_4 + 5 x4_5 + 6 x4_6 + 7 x4_7;
/* 8->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7;
/* 5->9 */
1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7 >=
1 + 1 x5_1 + 2 x5_2 + 3 x5_3 + 4 x5_4 + 5 x5_5 + 6 x5_6 + 7 x5_7;
/* 9->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7;
37
Example: DFG - Optimal Schedule

0
NOP
latency = 4
*
*
-
*
10
*
-
11
+
NOP
<
12
time
38
Generalized and Optimized Formulation (1)

ALAP
ASAP
compute the earliest li = (vi )
and latest hi = (vi )
scheduling time for each node vi using ASAP and ALAP
scheduling (without resource constraints)
binary variables
xi,t {0,1} vi V, t : li t hi
uniqueness constraints
hi
i,t
= 1 vi V
t=li
scheduling time
hi
t x
i,t
= (vi ) vi V
t=li
Question: How to compute an ASAP/ALAP schedule with ILP?
39

sequencing
(v j ) (vi ) di
(vi , v j ) E
resource constraints (di=1 not required anymore)

min{di 1,tli }
xi,tp (rk )
i:(vi ,rk )E R p=max{0,thi }

V
i=1
i=1
rk VT , min{li } t max{hi }
cost function
min l xn,l
l
40

explanation generalized resource constraints (slightly simplified:
more complex summation bounds used to avoid summing
undefined variables)
%' 1: t : (v ) t (v ) + d 1
i
i
i
x
=
&
i,tp
0 : otherwise
'(
p=0
di 1
e.g. assume node v1 is executed on a resource with di=3, then

the resource usage caused by v1 in time steps t=1,2,3, ... is:
t = 1: x1,1
resource busy if v1 has been started at t=1
t = 2 : x1,2 + x1,1
resource busy if v1 has just been started at t=2 or still

occupied if started at t=1
occupied if started at t=1 or t=2
occupied if started at t=2 or t=3
t = 3 : x1,3 + x1,2 + x1,1

t = 4 : x1,4 + x1,3 + x1,2
Exercise 5.1: Architecture Synthesis
41
Iterative Scheduling
most applications execute a dataflow graph not just once but in
a loop
generation and optimization of periodic schedules
execution in pipelined iterations for optimizing performance
parameters
period/iteration interval/initiation interval (P): time between start of
execution of operation vi between two successive iterations
latency (L): time for between start of first and last operation for a given
interval
iteration
v1
v2
v3
v1
v2
v3
v4
v1
v2
v3
v4
v2
v3
v4
4
3
2
1
v1
iteration
L=4,P=1
v4
4
3
2
1
L=4,P=2
time
v1
v2
v1
v2
v3
v4
v1
v2
v3
v4
v3
v4
time
42
Ressourcen
P=7
v4(n 1)
r2
v1(n)
r1
Iterative Scheduling
v3(n)
v2(n)
Zeit
130iterative
schedule
examples
(visualized
with Gantt charts)
4 Ablaufplanung
0
Ressourcen
P=L=9
Abb. 4.18. Ablaufplanung mit funktionaler Fliebandverarbeitung
r2
4
gesenkt
werden, wenn v3 zum Zeitschritt t3 =v43 beginnt und am vZeitpunkt
t = 1 des
darauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Beist in Tabelle 4.3 dargestellt, die
ginn
v1 v4 auf t4 = 1. Der Ablaufplan
r1 von Operation
v2
entsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einen
Zeit
u berlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin!
tervallgrenze
131
0
1P = 6 u2berlappt.
3
44.5 Periodische
5
6Ablaufplanungsprobleme
7
8
9
sequential processing of
successive iterations
Ressourcen
Abb. 4.17.Ablaufplan
Ablaufplanung
und Bindung
Tabelle
4.3. Ein u berlappender
mit funktionaler
P =Fliebandverarbeitung
7
P
6
v
v
v
i
1
2 v3 v4
Bei
Ablaufplan
betragt das Iterationsintervall
P = 9, d. h. alle neun Zeitv4(n
1)
r2 diesem
v3(n)
(v
)
r
r
r2 rneuen
i
1
1
2
schritte wiederholt sich die Berechnung eines
Datensatzes. Ein solcher Abti
0 4 4 7/1
laufplan entspricht einer sequentiellen
Abarbeitung aufeinander folgender Iteratiov
r
(n)
v2(n) eine vollstatische Bindung mit einer
1
1
nen. In Abb. 4.17 ist zusatzlich zum Ablaufplan
Zeit
DarRessource des Typs r1 und einer Ressource des Typs r2 dargestellt. Eine solche
!
4.5 Periodische
Ablaufplanungsprobleme
131
stellung bezeichnet man auch als Gantt-Chart.
Man
erkennt, dass die Ressource vom
1 der2Zeitschritte
3
4 t = 05bis zum
6 Zeitschritt
7
9 Iteration
Typ r02 wahrend
von
t 8= 4 jeder
nicht arbeitet.
Offensichtlich ist es jedoch bei geeigneter Speicherung der ZwischenRessourcen
=6 P=7
ergebnisse m
oglich,
mitmit
dem
BeginnP der
Berechnung von Operation v4
Abb.
4.18.gleichzeitig
Ablaufplanung
funktionaler
Fliebandverarbeitung
v3(n
1)
auf der
Ressource
vom Typ r2 , die Berechnung eines neuen Datensatzes durch Plader
des Typs r1 zuv3beginnen.
Anders ausgedruckt bedeunung
von vv41(nauf
v4(nRessource
v1) zum Zeitschritt
1)
r2
gesenkt
werden,
wenn
t = 4(n)
beginnt und am Zeitpunkt t = 1 des
3
tet dies, dass in einer Iteration
die Knoten v31 (n), v2 (n), v3 (n) und v4 (n 1) berechnet
darauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Bewerden. Ein Ablaufplan, der diese Art der nebenlaufigen Abarbeitung von Iterationen
ist in Tabelle 4.3 dargestellt, die
ginn
v4 auf t4 = 1. Der Ablaufplan
v1(n)
r1 von Operation
v23(n)
(Fliebandverarbeitung)
berucksichtigt, ist
in Tabelle 4.2 dargestellt. Das Iterationsentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einen
intervall betragt jetzt P = 7. Abbildung 4.18 zeigt die entsprechende AuslastungZeit
der
u berlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der IterationsinRessourcen. Der Ablaufplan ist nicht u berlappend, da sich kein Ausfuhrungsintervall
tervallgrenze
0 Iterationsintervallgrenzen
1 P = 6 2u berlappt.
3
5 t = 76u berschneidet.
7
8
9
mit den
t4 = 0 und
Ein
Tabelle
4.3. Uberlappende
u berlappender
Ablaufplan
mitfunktionaler
funktionaler
Abb.
4.18.
Ablaufplanung
mit funktionaler
Abb. 4.19.
Ablaufplanung
mit
Tabelle 4.2. Ein Ablaufplan mit funktionaler Fliebandverarbeitung
P
6
Teich & Haubelt 2007
P
7
v
v
v
i
1
4
gesenkt werden, wenn v3 zum Zeitschritt t3 2=v34 vbeginnt
und am Zeitpunkt t = 1 des
schedule with functional

pipelining i.e. operations from
different iterations are processed
in one period
overlapping schedule with
functional pipelining i.e.
operations from an iteration may
overlap the border of the period
P=6
43
Overview
formal modeling
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
scheduling with ILP

44
Motivation
translate program/algorithm into dedicated hardware
units of translation
single basic block (combinational)
complete program (sequential)
finite
state
machine
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}
registers
FU
controller
FU
data path
FSM controls of data path (FSM

actions)
data path sends feedback to FSM
(FSM conditions)
45
Translating SW to HW High-level Synthesis

basic idea: execute control data flow graph
control flow graph
describes possible execution sequences of basic blocks
implement CFG as finite state machine (FSM)
data flow graph

describes arithmetic operations on basic block level
implement data flow graphs as data paths
use registers for storing and transferring data between basic blocks
finite
state
machine
controller
registers
FU
FU
data path
46
Example Program: Greatest Common Divider (GCD)

2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}
47
Control Data Flow Graph for GCD

BB0
br label %BB1
BB1
BB5
%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]

%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]
br label %BB2
%b2 = sub i32 %b1, %a2

br label %BB1
false
BB2
BB3
%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]

%whilecond = icmp eq i32 %b1, %a2
br i1 %whilecond, label %BB6, label %BB3
false
false
%ifcond = icmp sgt i32 %a2, %b1

br i1 %ifcond, label %BB4, label %BB5
true
true
BB6
ret i32 %a2
BB4
%a3 = sub i32 %a2, %b1

br label %BB2
generated with Clang/LLVM

48

BB0
br label %BB1
BB1
BB5
%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]

%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]
br label %BB2
%b2 = sub i32 %b1, %a2

br label %BB1
false
BB2
BB3
%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]

false
false

true
true
BB6
BB4
ret i32 %a2
%a3 = sub i32 %a2, %b1

br label %BB2
data path
predicates (flags) for conditional jumps (input to controller FSM)
conditional state transitions (evaluated by controller FSM)
phi nodes define registers + input multiplexers (other values are temporary)
49
Data Path for GCD

ena2 sela2
ena1 sela1 a
enb1 selb1 b
a1
a2
b1
a2
a1
b1
>
ifcond
==
whilecond
input signals from controller
data path
phi nodes (registers + input multiplexer)
predicates (flags)
output signals to controller
50
Data Path for GCD (Block Diagram)
sela1
ena1
sela2
ena2
data path
ifcond
whilecond
selb1
enb1
result (a2)
51

BB0
br label %BB1
BB1
BB5
%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]

%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]
br label %BB2
%b2 = sub i32 %b1, %a2

br label %BB1
false
BB2
BB3
%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]

false
false

true
true
BB6
ret i32 %a2
BB4
%a3 = sub i32 %a2, %b1

br label %BB2
52
Control FSM for GCD

0
/ a01
1
/ a12
/ a51
5
2
!ifcond / a35
whilecond / a26
6
/ a42
4
!whilecond / a23
ifcond / a34
edge labels: c / a c = condition for transition, a = action executed during transition

53
Transition Table and Actions

current
state
condition
next action
state
sel
a1
en
a1
sel
a2
en
a2
sel
b1
en
b1
a01
a12
a1
whilecond
a26
!whilecond
a23
ifcond
a34
a42
a3
!ifcond
a35
a51
a2
b2
input signals to data path

transition table and data path can be easily translated to
hardware description language (e.g. VHDL or Verilog)
54
Control FSM for GCD (Block Diagram)
nextstate
sela1
state
ena1
ifcond
whilecond
next state and

output logic
sela2
ena2
selb1
enb1
55
Limitations and Potential Improvements

limitations of this basic high-level synthesis approach
restricted subset of C only (no function calls, memory access, )

limited amount of parallelism (CDFG is executed sequentially)
no sharing of operators in data path
single cycle execution model (each basic block is executed in exactly one
cycle)
FSMs can become very complex
features of more advanced high-level synthesis methods
sharing of operators in data path

program transformations to increase parallelism
loop pipelining and retiming
many tools available

commercial: e.g. Impulse C, Mentor Catapult C, AutoESL
free/open source: e.g. ROCCC, SPARK, LegUp, C-to-Verilog
56
Microprogrammed Architectures
two ends of the spectrum for implementing applications
CPU: generic fully programmable architecture, application can be easily
varied after fabrication
ASIC/FPGA: highly specialized, application is fixed at design time
middle ground: Microprogrammed Architectures

also tailored to particular application or class of applications
but the controller is programmable (instead of a fixed FSM)
57
FSMD vs. Microprogrammed Architecture

FSMD
FSMD
state reg state reg
atus
NextState
status
Logic
NextState
Logic
Micro-programmed
MachineMachine
Micro-programmed
Jump field
Jump field
NextNextControl Control
Address Address
Store
status Logic
Store
status Logic
CSAR
Datapath Datapath
CSAR
Micro- Microinstruction
instruction
Datapath Datapath
Command
field
Command
fi
3 InFig.
contrast
to contrast
FSM-based
control, microprogramming
uses a flexible
microprogrammed
5.3 finite
In
to FSM-based
control, microprogramming
uses acontrol
flexiblescheme
control sche
state machine
with data path (FSMD)
fixed
architecture
programmable
CSAR (control store address register)
corresponds to instruction counter in CPU
Schaumont 2010
58
Microinstruction Encoding
microinstruction specifies
microinstruction
138
commands for data path

jump field
5 Microprogrammed Architec
32 bits
datapath command
next
address
jump field
how to compute address of
next microinstruction based
on feedback from data path
optional address constant
Command field
Jump field
next
0000
0001
0010
1010
0100
1100
address
absolute address of
microinstruction
here: address field width is 12
bits, hence 4096
microinstructions can be
addressed
Default
Jump
Jump if carry
Jump if no carry
Jump if zero
Jump if not zero
CSAR = CSAR + 1
CSAR = address
CSAR = cf ? address : CSAR + 1
CSAR = cf ? CSAR + 1 : address
CSAR = zf ? address : CSAR + 1
CSAR = zf ? CSAR + 1 : address
next + address
Next
Address
Logic
CSAR
Control
Store
microinstruction
datapath
command
cf + zf
Datapath
Next
CSAR
flags
Fig. 5.4 Sample format for a 32-bit microinstruction word
Schaumont 2010
59
tractor. It can be easily verified that each of the instructions enumerated abo
be implemented as a combination of control bit values for each multiplexer
the adder/subtractor. The controller on top shows two possible encodings
three instructions: a horizontal encoding and a vertical encoding.
Command Field Horizontal vs. Vertical Microcode

tradeoff between code
density and decoding effort
horizontal microcode
microinstruction directly
contains bits to control data
path
no decoding required
low code density
Horizontal
Vertical
Micro-instruction Microcode Microcode
Micro-Programmed
Controller
a=2*a
0 0
0 0 0
a = a 1
0 1
0 1 1
a = IN
1 0
1 0 0
Decoder
vertical microcode
sel1
microinstructions contain
encoded form of data path
control signals
decoding required
higher code density
sel2
Datapath
alu
1
0
+/
IN
1
a
0
Fig. 5.5 Example of vertical versus horizontal microprogramming
Schaumont 2010
60
an operand from the accumulator register with an operand from the register file
or the input port. The result of the operation is returned to the register file or the
Example: A Microcoded Datapath
unused
SBUS
Dest
ALU Shifter Dest
SBUS ALU
Nxt
Shifter
Address
Address Nxt
Register
File
flags
Input
Shift
NextAddress
Logic
Control
Store
CSAR
ACC
Fig. 5.7 A microprogrammed datapath
same microinstruction format as in previous example

4 data path units to be controlled
SBUS: select operand
ALU: choose ALU operation
Shifter: optional bit shift of ALU result
Dest: select location for storing the result
Schaumont 2010
61
Example: Microinstruction Encoding
5.4 The Microprogrammed Datapath
143
Table 5.1 Microinstruction encoding of the example machine

Field
Width
Encoding
SBUS
4
Selects the operand that will drive the S-Bus
0000
R0
0101
0001
R1
0110
0010
R2
0111
0011
R3
1000
0100
R4
1001
ALU
Selects the operation performed by the ALU

0000
ACC
0110
0001
S-Bus
0111
0010
ACC C SBus
1000
0011
ACC ! SBus
1001
0100
SBus ! ACC
1010
0101
ACC & S-Bus
1011
R5
R6
R7
Input
Address/Constant
ACC S-Bus
not S-Bus
ACC C 1
SBus ! 1
0
1
Shifter
Selects the function of the programmable shifter

000
logical SHL(ALU)
100
arith SHL(ALU)
001
logical SHR(ALU)
101
arith SHR(ALU)
010
rotate left ALU
111
ALU
011
rotate right ALU
Dest
Selects the target that will store S-Bus

0000
R0
0101
0001
R1
0110
0010
R2
0111
0011
R3
1000
0100
R4
1111
R5
R6
R7
ACC
unconnected
Selects next-value for CSAR

0000
CSAR C 1
0001
Address
0010
cf ? Address : CSAR C 1
cf ? CSAR C 1 : Address
zf ? Address : CSAR C 1
zf ? CSAR C 1 : Address
Nxt
1010
0100
1100
available microinstructions and their encoding

Schaumont 2010
As an example, let us develop a microprogram that reads two numbers from the
input port and that evaluates their greatest common divisor (GCD) using Euclids
algorithm. The first step is to develop a microprogram in terms of register transfers.
62
Example: How to Encode a Microinstruction

how to encode ACC R2?
5 Microprogrammed Architectures
144
RT-level
Instruction
Micro-Instruction
Field Encoding
Micro-Instruction
Formation
ACC < R2
SBUS
0010
ALU
0001
Shifter
111
Dest
1000
Nxt
0000
Address
000000000000
{0,0010,0001,111,1000,0000,000000000000}
{0001,0000,1111,1000,0000,0000,0000,0000}
Micro-Instruction
Encoding
0 10F80000
Fig. 5.8 Forming microinstructions from register-transfer instructions

Listing 5.1 Micro-program to evaluate a GCD
1 Schaumont
;
Command
|| Jump Field
2010 Field
63
Example: Implementing GCD on this Architecture

2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}
; Command Field
||Jump Field
; -------------------------------------------------------------------------1:
IN -> R0
; read a, store in R0
2:
IN -> ACC
; read b, store in ACC
3: Lcheck: R0 - ACC
|| JUMP_IF_Z Ldone ; check while condition
4:
(R0 ACC) << 1 || JUMP_IF_C Lsmall ; check whether R0<ACC, ..
5:
R0 - ACC -> R0
|| JUMP Lcheck
; if so, ACC RO -> ACC
6: Lsmall: ACC - R0 -> ACC || JUMP Lcheck
; else R0 - ACC -> R0
7: Ldone:
|| JUMP Ldone
; infinite loop, end of prog
Schaumont 2010
64
Changes
2015-05-28 (v1.4.0)
updated for SS2015
2014-05-22 (v1.3.1)
explain on slide 41 that di=1 no longer required
2014-05-06 (v1.3.0)
updated for SS2014
65
Changes
2013-06-18 (v1.2.4)
fixed label of basic blocks B4 and B5 also on p.52
2013-06-06 (v1.2.3)
cosmetic changes
fixed index in equation on p.41
fixed label of basic blocks B4 and B5 on p.48 + p.49
2013-05-23 (v1.2.2)
clarified assumption of unit delay on p.33
2013-05-16 (v1.2.1)
fix typo on slide 19, terminate algorithm when v0 is scheduled (not vn)
2013-05-13 (v1.2.)
updated for SS2013, merged all architecture synthesis materials into a
single presentation
66

03 ArchitectureSynthesis

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

03 ArchitectureSynthesis

Încărcat de

Drepturi de autor:

Formate disponibile

Architecture Synthesis

FSM controls of data path (FSM

translation of complete programs

translation of complete programs

VS = {v0 ,v1 , ,v12}

Resource Graph Example

Cost, Execution Time

binding is given by the two functions

Allocation, Binding Example (1)

Allocation, Binding Example (2)

(vj) - (vi) w(vi , (vi))

latency L of a scheduled sequencing graph

L = (v12) - (v0) = 5-1= 4

synthesis problem variants

scheduling under resource constraints

translation of complete programs

Scheduling without Resource Constraints

ALAP (as late as possible)

slack (mobility) of operations

Scheduling under Resource Constraints (1)

Extended ASAP, ALAP

Scheduling under Resource Constraints (2)

List Scheduling (1)

List Scheduling (2)

translation of complete programs

ILP Basics (1)

Linear Program (LP)

Integer Linear Program (ILP)

0-1 LP (binary LP, ZOLP)

Mixed Integer Linear Program (MILP)

ILP Basics (2)

3 binary variables x1, x2, x3

ILP Basics (3)

Example: Optimal DFG Scheduling

all operations take one

VS = {v0 ,v1 , ,v12}

ILP for Optimal DFG Scheduling (1)

we need to define a latency bound

ILP for Optimal DFG Scheduling (2)

w ( vi , (vi )) + l xi,l , edges (vi ,v j )

actually, this basic problem formulation

ILP for Optimal DFG Scheduling (3)

Example: Scheduling of DFG (1)

Example: Scheduling of DFG (2)

Example: Scheduling of DFG (3)

Example: DFG - Optimal Schedule

Generalized and Optimized Formulation (1)

Question: How to compute an ASAP/ALAP schedule with ILP?

Generalized and Optimized Formulation (2)

resource constraints (di=1 not required anymore)

i:(vi ,rk )E R p=max{0,thi }

Generalized and Optimized Formulation (3)

e.g. assume node v1 is executed on a resource with di=3, then

resource busy if v1 has been started at t=1

resource busy if v1 has just been started at t=2 or still

t = 3 : x1,3 + x1,2 + x1,1

Exercise 5.1: Architecture Synthesis

schedule with functional

translation of complete programs

FSM controls of data path (FSM

Translating SW to HW High-level Synthesis

data flow graph

Example Program: Greatest Common Divider (GCD)