Sunteți pe pagina 1din 66

Architecture Synthesis

SS 2015
Jun.-Prof. Dr. Christian Plessl
Custom Computing
University of Paderborn
Version 1.4.0 2015-05-28

Motivation
translate program/algorithm into dedicated hardware
units of translation
single basic block (combinational)
complete program (sequential)
finite
state
machine
1: int gcd(int a, int b) {
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}

registers

FU

controller

FU

data path

FSM controls of data path (FSM


actions)
data path sends feedback to FSM
(FSM conditions)

Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling

scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling

optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP

translation of complete programs


finite state machine and data path
micro-coded controllers

Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling

scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling

optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP

translation of complete programs


finite state machine and data path
micro-coded controllers

Sequencing Graph
NOP

GS (VS , ES )
*

*
*
-

*
+

+
<

10

VS = {v0 ,v1 , ,v12}

11
NOP 12

Resource Graph

GR (VR, ER )
set of nodes VR = VS VT
VS are the nodes of the sequencing graph (without NOPs)
VT represent resource types (adder, multiplier, ALU, )
set of edges (vS, vT ) E R with vS VS, vT VT
an instance of resource type vT can be used to implement operation vS

Resource Graph Example


*
*
*

+
<
-

1
3
6

*
*
*

2
MUL

4
7

VT = {vMUL , vALU}

5
9

11

+
-

8
ALU

10

Cost, Execution Time


cost function c: VT Z
assigns a cost value to each resource type
example:

c(vMUL ) = 8

MUL

ALU

c(vALU) = 4

execution times w: ER Z+
assigns the execution time of operation vS VS on resource
type vT VT to the edge (vS ,vT) ER
example:

w(v2 ,vMUL) = 1

MUL

Allocation, Binding
allocation : VT Z+
assigns a number (vT) of available instances to each
resource type vT

binding is given by the two functions

: VS VT and : VS Z+
(vS) = vT means that operation vS is implemented by resource type vT
(possible s shown in the resource graph)
(vS) = r denotes that vS is implemented by the rth instance
of vT ; r (vT)

Allocation, Binding Example (1)


*
*
*

1
3
6

*
*
*

2
MUL

c(vMUL) = 8
(vMUL) = 3

w(v1 ,vMUL ) = 1
w(v2 ,vMUL ) = 1
w(v3 ,vMUL ) = 1
w(v4 ,vMUL ) = 1
w(v6 ,vMUL) = 1
w(v7 ,vMUL) = 1

(v1 ) = vMUL
(v2 ) = vMUL
(v3 ) = vMUL
(v4 ) = vMUL
(v6) = vMUL
(v7) = vMUL

(v1 ) = 1
(v2 ) = 2
(v3 ) = 3
(v4 ) = 1
(v6) = 2
(v7) = 3

10

Allocation, Binding Example (2)


+
<
-

5
9

11

+
-

c(vALU) = 4
(vALU) = 2

8
ALU

10

w(v5 ,vALU ) = 1
w(v8 ,vALU ) = 1
w(v9 ,vALU ) = 1
w(v10 ,vALU ) = 1
w(v11 ,vALU ) = 1

(v5 ) = vALU
(v8 ) = vALU
(v9 ) = vALU
(v10 ) = vALU
(v11) = vALU

(v5 ) = 1
(v8 ) = 1
(v9 ) = 1
(v10 ) = 2
(v11) = 1

11

Schedule
schedule : VS Z+
assigns a start time to each operation under the constraint

(vj) - (vi) w(vi , (vi))

(vi ,vj) ES

latency L of a scheduled sequencing graph


difference in start times between end node and start node
L = (vn) - (v0)

12

Schedule - Example
NOP

time

*
*
-

10

+
*
+

<

11

NOP

12

L = (v12) - (v0) = 5-1= 4


13

Synthesis
allocation, binding, scheduling
finding (, , , ) that optimize latency and cost under resource and
timing constraints
algorithms for architecture synthesis discussed in, e.g.
J.Teich, C. Haubelt, Digitale Hardware/Software-Systeme, Springer 2007
G. De Micheli, Synthesis and Optimization of Digital Circuits, McGrawHill 1994

synthesis problem variants


multicycle operations, operator chaining
several possible resource types for an operation
iterative schedules, pipelining

in the following
scheduling without resource constraints
ASAP, ALAP

scheduling under resource constraints


extended ASAP, ALAP
list scheduling
14

Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling

scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling

optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP

translation of complete programs


finite state machine and data path
micro-coded controllers

15

Scheduling without Resource Constraints


ASAP (as soon as possible)
determines the earliest possible start times for the operations
minimal latency

ALAP (as late as possible)


determines the latest possible start times for the operations
under a given latency bound

slack (mobility) of operations


difference of start times: (ALAP with ASAP latency bound) - ASAP
if slack = 0 operation is on the critical path

16

ASAP Scheduling
ASAP: as soon as possible scheduling
algorithm
ASAP( GS (V,E) ) {
schedule v0 by setting (v0) = 1
repeat {
select a vertex vj whose predecessors are all scheduled
schedule vj by setting (v j ) =
(vi ) + w(vi , (vi ))
i:(vi ,v j )E
}
until (vn is scheduled)
return
}

max

17

ASAP Example
NOP

time

*
*
-

+
<

10

11

NOP

12

minimal latency: L = 4
18

ALAP Scheduling
ALAP: as late as possible scheduling
requires a latency bound L
otherwise nodes could be arbitrarily delayed
typically the schedule length of ASAP schedule is used as latency bound

algorithm
ALAP( GS (V,E), L ) {
schedule vn by setting (vn) = L +1
repeat {
select a vertex vi whose successors are all scheduled
schedule vi by setting (vi ) =
(v j ) w(vi , (vi ))
j:(vi ,v j )E
}
until (v0 is scheduled)
return
}

min

19

ALAP Example
0

NOP

latency bound: L = 4

*
*
-

*
10

*
-

11

+
NOP

+
<

12

time
20

Slack (mobility)

*
0
0

*
1

**

<

2
2

*
+

2
2

+
<

time

21

Scheduling under Resource Constraints (1)

Extended ASAP, ALAP


first ASAP or ALAP
then move operations down (ASAP) or up (ALAP) until resource
constraints are satisfied

22

Extended ALAP

resource constraints:
(vMUL) = 2, (vALU) = 2

time

NOP

*
*
-

latency bound: L = 4

10

*
-

11

+
*
+

<

NOP 12

23

Scheduling under Resource Constraints (2)

list scheduling
operations are prioritized according to some criterion, e.g.
slack,
time = 1
repeat
for each resource (vT , (vT))
determine all ready operations vS with (vs)= vT and
schedule the one with the highest priority
time ++;
until (vn is scheduled)

24

List Scheduling (1)


example
criterion: number of successor nodes
resource constraints: (vMUL) = 1, (vALU) = 1
execute at time
1
2
3
4
5
6
7

*
2
1

2
1

*
*

1
0

*
+

1
0

+
<

0 25

List Scheduling (2)


1

<

7
time
Exercise 5.1: Architecture Synthesis

+
L = (v12) - (v0) = 8-1= 7
26

Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling

scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling

optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP

translation of complete programs


finite state machine and data path
micro-coded controllers

27

ILP Basics (1)

Linear Program (LP)


is a special kind of optimization problem

A x b
x0

with

max cT x

x R n , A R mn ,
b Rm , c Rn

real variables
linear constraints modeled by (in)equations
linear cost function

Integer Linear Program (ILP)


LP where all variables must be integer

0-1 LP (binary LP, ZOLP)


LP where all variables must be binary

x Zn
n

x {0,1}

Mixed Integer Linear Program (MILP)


LP where some variables must be integer
28

ILP Basics (2)

example

x1 , x2 , x3 {0,1}

3 binary variables x1, x2, x3


constraint that at least two variables
must be set to 1
minimize cost function

x1 + x2 + x3 2
min 5 x1 + 6 x2 + 4 x3

solution
x1

x2

x3

cost

10

11

15

29

ILP Basics (3)

solving ILPs
ILPs are NP-complete, exponential runtime in worst case
solved by branch-and-bound algorithms to optimality
(runtime depends on the efficiency of the model)
popular solvers

CPLEX (commercial)
http://www.ibm.com/software/commerce/optimization/cplex-optimizer/
Gurobi (commercial, free academic license)
http://www.gurobi.com/
lp_solve (open source)
http://lpsolve.sourceforge.net/5.5/

challenges
translate the problem into an efficient ILP model
use only linear constraints and cost function
non-linearities of the problem can (possibly) be modeled by several
linear constraints, but this increases the complexity of the ILP

30

Example: Optimal DFG Scheduling


GS (VS, ES )

NOP

*
*
-

all operations take one


time step

*
+

+
<

resource constraints:
2 MUL
2 ALU (+,-,<)
find schedule with
minimal latency

10

11

VS = {v0 ,v1 , ,v12}

NOP 12

alternative problem:
minimize resources
under latency constraint

31

ILP for Optimal DFG Scheduling (1)


binary variables
xi,l =1, iff operation vi starts in time step l,

i = 1,..., n

l = 1,..., L

we need to define a latency bound

uniqueness constraints
each operation starts in exactly one time step

i ,l

= 1, i = 1,..., n

starting time
the starting time (vi) of operation vi can be expressed as

(vi ) = l xi,l
l
32

ILP for Optimal DFG Scheduling (2)


simple resource constraints (execution time for operations = 1)
in each time step, use no more resources than available

i ,l

(vMUL ),

( vi ) = vMUL

i ,l

(v ALU ), l = 1,..., L

( vi ) = v ALU

sequencing constraints
an operation cannot start earlier than the finishing time of its predecessors

l x
l

j,l

w ( vi , (vi )) + l xi,l , edges (vi ,v j )


l

actually, this basic problem formulation


requires the assumption that:

w ( vi , (vi )) = 1
33

ILP for Optimal DFG Scheduling (3)


cost function
minimize start time of last operation

#
&
min (vn ) = min % l xn,l (
$ l
'

34

/*********************/
/* integer variables */
/*********************/
int x1_1;
int x1_2;
int x5_1;
int x1_3;
int x5_2;
int x1_4;
int x5_3;
int x1_5;
int x5_4;
int x1_6;
int x5_5;
int x1_7;
int x5_6;
int x5_7;
int x2_1;
int x2_2;
int x6_1;
int x2_3;
int x6_2;
int x2_4;
int x6_3;
int x2_5;
int x6_4;
int x2_6;
int x6_5;
int x2_7;
int x6_6;
int x6_7;
int x3_1;
int x3_2;
int x7_1;
int x3_3;
int x7_2;
int x3_4;
int x7_3;
int x3_5;
int x7_4;
int x3_6;
int x7_5;
int x3_7;
int x7_6;
int x7_7;
int x4_1;
int x4_2;
int x8_1;
int x4_3;
int x8_2;
int x4_4;
int x8_3;
int x4_5;
int x8_4;
int x4_6;
int x8_5;
int x4_7;
int x8_6;
int x8_7;

Example: Scheduling of DFG (1)

int
int
int
int
int
int
int

x9_1;
x9_2;
x9_3;
x9_4;
x9_5;
x9_6;
x9_7;

int
int
int
int
int
int
int

x10_1;
x10_2;
x10_3;
x10_4;
x10_5;
x10_6;
x10_7;

int
int
int
int
int
int
int

x11_1;
x11_2;
x11_3;
x11_4;
x11_5;
x11_6;
x11_7;

int
int
int
int
int
int
int

x12_1;
x12_2;
x12_3;
x12_4;
x12_5;
x12_6;
x12_7;

/**********************/
/* variables in {0,1} */
/**********************/
x1_1 <= 1;
x5_1 <= 1;
x1_2 <= 1;
x5_2 <= 1;
x1_3 <= 1;
x5_3 <= 1;
x1_4 <= 1;
x5_4 <= 1;
x1_5 <= 1;
x5_5 <= 1;
x1_6 <= 1;
x5_6 <= 1;
x1_7 <= 1;
x5_7 <= 1;
x2_1 <= 1;
x6_1 <= 1;
x2_2 <= 1;
x6_2 <= 1;
x2_3 <= 1;
x6_3 <= 1;
x2_4 <= 1;
x6_4 <= 1;
x2_5 <= 1;
x6_5 <= 1;
x2_6 <= 1;
x6_6 <= 1;
x2_7 <= 1;
x6_7 <= 1;
x3_1 <= 1;
x7_1 <= 1;
x3_2 <= 1;
x7_2 <= 1;
x3_3 <= 1;
x7_3 <= 1;
x3_4 <= 1;
x7_4 <= 1;
x3_5 <= 1;
x7_5 <= 1;
x3_6 <= 1;
x7_6 <= 1;
x3_7 <= 1;
x7_7 <= 1;
x4_1 <= 1;
x8_1 <= 1;
x4_2 <= 1;
x8_2 <= 1;
x4_3 <= 1;
x8_3 <= 1;
x4_4 <= 1;
x8_4 <= 1;
x4_5 <= 1;
x8_5 <= 1;
x4_6 <= 1;
x8_6 <= 1;
x4_7 <= 1;
x8_7 <= 1;

x9_1
x9_2
x9_3
x9_4
x9_5
x9_6
x9_7

<=
<=
<=
<=
<=
<=
<=

1;
1;
1;
1;
1;
1;
1;

x10_1
x10_2
x10_3
x10_4
x10_5
x10_6
x10_7

<=
<=
<=
<=
<=
<=
<=

1;
1;
1;
1;
1;
1;
1;

x11_1
x11_2
x11_3
x11_4
x11_5
x11_6
x11_7

<=
<=
<=
<=
<=
<=
<=

1;
1;
1;
1;
1;
1;
1;

x12_1
x12_2
x12_3
x12_4
x12_5
x12_6
x12_7

<=
<=
<=
<=
<=
<=
<=

1;
1;
1;
1;
1;
1;
1;

problem description
in LP format

latency bound = 7

35

Example: Scheduling of DFG (2)


/**************************/
/* uniqueness constraints */
/**************************/
x1_1 + x1_2 + x1_3 + x1_4 + x1_5 + x1_6
x2_1 + x2_2 + x2_3 + x2_4 + x2_5 + x2_6
x3_1 + x3_2 + x3_3 + x3_4 + x3_5 + x3_6
x4_1 + x4_2 + x4_3 + x4_4 + x4_5 + x4_6
x5_1 + x5_2 + x5_3 + x5_4 + x5_5 + x5_6
x6_1 + x6_2 + x6_3 + x6_4 + x6_5 + x6_6
x7_1 + x7_2 + x7_3 + x7_4 + x7_5 + x7_6
x8_1 + x8_2 + x8_3 + x8_4 + x8_5 + x8_6
x9_1 + x9_2 + x9_3 + x9_4 + x9_5 + x9_6
x10_1 + x10_2 + x10_3 + x10_4 + x10_5 +
x11_1 + x11_2 + x11_3 + x11_4 + x11_5 +
x12_1 + x12_2 + x12_3 + x12_4 + x12_5 +

+ x1_7 = 1;
+ x2_7 = 1;
+ x3_7 = 1;
+ x4_7 = 1;
+ x5_7 = 1;
+ x6_7 = 1;
+ x7_7 = 1;
+ x8_7 = 1;
+ x9_7 = 1;
x10_6 + x10_7 = 1;
x11_6 + x11_7 = 1;
x12_6 + x12_7 = 1;

/************************/
/* resource constraints */
/************************/
x1_1 + x2_1 + x3_1 + x4_1 + x6_1 + x7_1 <= 2;
x5_1 + x8_1 + x9_1 + x10_1 + x11_1 <= 2;
x1_2 + x2_2 + x3_2 + x4_2 + x6_2 + x7_2 <= 2;
x5_2 + x8_2 + x9_2 + x10_2 + x11_2 <= 2;
x1_3 + x2_3 + x3_3 + x4_3 + x6_3 + x7_3 <= 2;
x5_3 + x8_3 + x9_3 + x10_3 + x11_3 <= 2;
x1_4 + x2_4 + x3_4 + x4_4 + x6_4 + x7_4 <= 2;
x5_4 + x8_4 + x9_4 + x10_4 + x11_4 <= 2;
x1_5 + x2_5 + x3_5 + x4_5 + x6_5 + x7_5 <= 2;
x5_5 + x8_5 + x9_5 + x10_5 + x11_5 <= 2;
x1_6 + x2_6 + x3_6 + x4_6 + x6_6 + x7_6 <= 2;
x5_6 + x8_6 + x9_6 + x10_6 + x11_6 <= 2;
x1_7 + x2_7 + x3_7 + x4_7 + x6_7 + x7_7 <= 2;
x5_7 + x8_7 + x9_7 + x10_7 + x11_7 <= 2;

/**********************/
/* objective function */
/**********************/
min: 1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7;

36

Example: Scheduling of DFG (3)

/**************************/
/* sequencing constraints */
/**************************/
/* 1->6 */
1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=
1 + 1 x1_1 + 2 x1_2 + 3 x1_3 + 4 x1_4 + 5 x1_5 + 6 x1_6 + 7 x1_7;
/* 2->6 */
1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=
1 + 1 x2_1 + 2 x2_2 + 3 x2_3 + 4 x2_4 + 5 x2_5 + 6 x2_6 + 7 x2_7;

/* 6->10 */
1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7 >=
1 + 1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7;
/* 10->11 */
1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=
1 + 1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7;
/* 11->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7;
/* 3->7 */
1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7 >=
1 + 1 x3_1 + 2 x3_2 + 3 x3_3 + 4 x3_4 + 5 x3_5 + 6 x3_6 + 7 x3_7;
/* 7->11 */
1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=
1 + 1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7;
/* 4->8 */
1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7 >=
1 + 1 x4_1 + 2 x4_2 + 3 x4_3 + 4 x4_4 + 5 x4_5 + 6 x4_6 + 7 x4_7;
/* 8->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7;
/* 5->9 */
1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7 >=
1 + 1 x5_1 + 2 x5_2 + 3 x5_3 + 4 x5_4 + 5 x5_5 + 6 x5_6 + 7 x5_7;
/* 9->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7;

37

Example: DFG - Optimal Schedule


0

NOP

latency = 4

*
*
-

*
10

*
-

11

+
NOP

<

12

time
38

Generalized and Optimized Formulation (1)


ALAP
ASAP
compute the earliest li = (vi )
and latest hi = (vi )
scheduling time for each node vi using ASAP and ALAP
scheduling (without resource constraints)

binary variables
xi,t {0,1} vi V, t : li t hi

uniqueness constraints
hi

i,t

= 1 vi V

t=li

scheduling time

hi

t x

i,t

= (vi ) vi V

t=li

Question: How to compute an ASAP/ALAP schedule with ILP?

39

Generalized and Optimized Formulation (2)


sequencing

(v j ) (vi ) di

(vi , v j ) E

resource constraints (di=1 not required anymore)


min{di 1,tli }

xi,tp (rk )

i:(vi ,rk )E R p=max{0,thi }


V

i=1

i=1

rk VT , min{li } t max{hi }
cost function

min l xn,l
l

40

Generalized and Optimized Formulation (3)


explanation generalized resource constraints (slightly simplified:
more complex summation bounds used to avoid summing
undefined variables)

%' 1: t : (v ) t (v ) + d 1
i
i
i
x
=
&
i,tp
0 : otherwise
'(
p=0
di 1

e.g. assume node v1 is executed on a resource with di=3, then


the resource usage caused by v1 in time steps t=1,2,3, ... is:

t = 1: x1,1

resource busy if v1 has been started at t=1

t = 2 : x1,2 + x1,1

resource busy if v1 has just been started at t=2 or still


occupied if started at t=1
resource busy if v1 has just been started at t=3 or still
occupied if started at t=1 or t=2
resource busy if v1 has just been started at t=4 or still
occupied if started at t=2 or t=3

t = 3 : x1,3 + x1,2 + x1,1


t = 4 : x1,4 + x1,3 + x1,2

Exercise 5.1: Architecture Synthesis

41

Iterative Scheduling
most applications execute a dataflow graph not just once but in
a loop
generation and optimization of periodic schedules
execution in pipelined iterations for optimizing performance

parameters
period/iteration interval/initiation interval (P): time between start of
execution of operation vi between two successive iterations
latency (L): time for between start of first and last operation for a given
interval
iteration

v1

v2

v3

v1

v2

v3

v4

v1

v2

v3

v4

v2

v3

v4

4
3
2
1

v1

iteration

L=4,P=1

v4

4
3
2
1

L=4,P=2

time

v1

v2

v1

v2

v3

v4

v1

v2

v3

v4

v3

v4

time
42

Ressourcen

P=7

v4(n 1)

r2

v1(n)

r1

Iterative Scheduling

v3(n)
v2(n)

Zeit
130iterative
schedule
examples
(visualized
with Gantt charts)
4 Ablaufplanung
0

Ressourcen
P=L=9
Abb. 4.18. Ablaufplanung mit funktionaler Fliebandverarbeitung
r2
4
gesenkt
werden, wenn v3 zum Zeitschritt t3 =v43 beginnt und am vZeitpunkt
t = 1 des

darauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Beist in Tabelle 4.3 dargestellt, die
ginn
v1 v4 auf t4 = 1. Der Ablaufplan
r1 von Operation
v2
entsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einen
Zeit
u berlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin!
tervallgrenze
131
0
1P = 6 u2berlappt.
3
44.5 Periodische
5
6Ablaufplanungsprobleme
7
8
9

sequential processing of
successive iterations

Ressourcen
Abb. 4.17.Ablaufplan
Ablaufplanung
und Bindung
Tabelle
4.3. Ein u berlappender
mit funktionaler
P =Fliebandverarbeitung
7

P
6
v
v
v
i
1
2 v3 v4
Bei
Ablaufplan
betragt das Iterationsintervall
P = 9, d. h. alle neun Zeitv4(n
1)
r2 diesem
v3(n)

(v
)
r
r
r2 rneuen
i
1
1
2
schritte wiederholt sich die Berechnung eines
Datensatzes. Ein solcher Abti
0 4 4 7/1
laufplan entspricht einer sequentiellen
Abarbeitung aufeinander folgender Iteratiov
r
(n)
v2(n) eine vollstatische Bindung mit einer
1
1
nen. In Abb. 4.17 ist zusatzlich zum Ablaufplan
Zeit

DarRessource des Typs r1 und einer Ressource des Typs r2 dargestellt. Eine solche
!
4.5 Periodische
Ablaufplanungsprobleme
131
stellung bezeichnet man auch als Gantt-Chart.
Man
erkennt, dass die Ressource vom
1 der2Zeitschritte
3
4 t = 05bis zum
6 Zeitschritt
7
9 Iteration
Typ r02 wahrend
von
t 8= 4 jeder
nicht arbeitet.
Offensichtlich ist es jedoch bei geeigneter Speicherung der ZwischenRessourcen
=6 P=7
ergebnisse m
oglich,
mitmit
dem
BeginnP der
Berechnung von Operation v4
Abb.
4.18.gleichzeitig
Ablaufplanung
funktionaler
Fliebandverarbeitung
v3(n
1)
auf der
Ressource
vom Typ r2 , die Berechnung eines neuen Datensatzes durch Plader
des Typs r1 zuv3beginnen.
Anders ausgedruckt bedeunung
von vv41(nauf
v4(nRessource
v1) zum Zeitschritt
1)
r2
gesenkt
werden,
wenn
t = 4(n)
beginnt und am Zeitpunkt t = 1 des
3
tet dies, dass in einer Iteration
die Knoten v31 (n), v2 (n), v3 (n) und v4 (n 1) berechnet
darauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Bewerden. Ein Ablaufplan, der diese Art der nebenlaufigen Abarbeitung von Iterationen
ist in Tabelle 4.3 dargestellt, die
ginn
v4 auf t4 = 1. Der Ablaufplan
v1(n)
r1 von Operation
v23(n)
(Fliebandverarbeitung)
berucksichtigt, ist
in Tabelle 4.2 dargestellt. Das Iterationsentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einen
intervall betragt jetzt P = 7. Abbildung 4.18 zeigt die entsprechende AuslastungZeit
der
u berlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der IterationsinRessourcen. Der Ablaufplan ist nicht u berlappend, da sich kein Ausfuhrungsintervall
tervallgrenze
0 Iterationsintervallgrenzen
1 P = 6 2u berlappt.
3
5 t = 76u berschneidet.
7
8
9
mit den
t4 = 0 und
Ein
Tabelle
4.3. Uberlappende
u berlappender
Ablaufplan
mitfunktionaler
funktionaler
Fliebandverarbeitung
Abb.
4.18.
Ablaufplanung
mit funktionaler
Fliebandverarbeitung
Abb. 4.19.
Ablaufplanung
mit
Fliebandverarbeitung
Tabelle 4.2. Ein Ablaufplan mit funktionaler Fliebandverarbeitung
P
6
Teich & Haubelt 2007
P
7
v
v
v
i
1
4
gesenkt werden, wenn v3 zum Zeitschritt t3 2=v34 vbeginnt
und am Zeitpunkt t = 1 des

schedule with functional


pipelining i.e. operations from
different iterations are processed
in one period
overlapping schedule with
functional pipelining i.e.
operations from an iteration may
overlap the border of the period
P=6

43

Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling

scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling

optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP

translation of complete programs


finite state machine and data path
micro-coded controllers

44

Motivation
translate program/algorithm into dedicated hardware
units of translation
single basic block (combinational)
complete program (sequential)
finite
state
machine
1: int gcd(int a, int b) {
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}

registers

FU

controller

FU

data path

FSM controls of data path (FSM


actions)
data path sends feedback to FSM
(FSM conditions)

45

Translating SW to HW High-level Synthesis


basic idea: execute control data flow graph
control flow graph
describes possible execution sequences of basic blocks
implement CFG as finite state machine (FSM)

data flow graph


describes arithmetic operations on basic block level
implement data flow graphs as data paths
use registers for storing and transferring data between basic blocks

finite
state
machine
controller

registers
FU

FU

data path

46

Example Program: Greatest Common Divider (GCD)

1: int gcd(int a, int b) {


2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}

47

Control Data Flow Graph for GCD


BB0

br label %BB1

BB1

BB5

%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]


%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]
br label %BB2

%b2 = sub i32 %b1, %a2


br label %BB1
false

BB2

BB3

%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]


%whilecond = icmp eq i32 %b1, %a2
br i1 %whilecond, label %BB6, label %BB3

false

false

%ifcond = icmp sgt i32 %a2, %b1


br i1 %ifcond, label %BB4, label %BB5
true

true
BB6

ret i32 %a2

BB4

%a3 = sub i32 %a2, %b1


br label %BB2

generated with Clang/LLVM


48

Control Data Flow Graph for GCD


BB0

br label %BB1

BB1

BB5

%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]


%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]
br label %BB2

%b2 = sub i32 %b1, %a2


br label %BB1
false

BB2

BB3

%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]


%whilecond = icmp eq i32 %b1, %a2
br i1 %whilecond, label %BB6, label %BB3

false

false

%ifcond = icmp sgt i32 %a2, %b1


br i1 %ifcond, label %BB4, label %BB5
true

true
BB6

BB4

ret i32 %a2

%a3 = sub i32 %a2, %b1


br label %BB2

data path
predicates (flags) for conditional jumps (input to controller FSM)
conditional state transitions (evaluated by controller FSM)
phi nodes define registers + input multiplexers (other values are temporary)
49

Data Path for GCD


ena2 sela2

ena1 sela1 a

enb1 selb1 b

a1

a2

b1

a2
a1
b1

>

ifcond

==

whilecond

input signals from controller

data path

phi nodes (registers + input multiplexer)

predicates (flags)
output signals to controller
50

Data Path for GCD (Block Diagram)

sela1
ena1
sela2
ena2

data path

ifcond
whilecond

selb1
enb1

result (a2)

51

Control Data Flow Graph for GCD


BB0

br label %BB1

BB1

BB5

%b1 = phi i32 [ %b2, %BB5 ], [ %b, %BB0 ]


%a1 = phi i32 [ %a2, %BB5 ], [ %a, %BB0 ]
br label %BB2

%b2 = sub i32 %b1, %a2


br label %BB1
false

BB2

BB3

%a2 = phi i32 [ %a1, %BB1 ], [ %a3, %BB4 ]


%whilecond = icmp eq i32 %b1, %a2
br i1 %whilecond, label %BB6, label %BB3

false

false

%ifcond = icmp sgt i32 %a2, %b1


br i1 %ifcond, label %BB4, label %BB5
true

true
BB6

ret i32 %a2

BB4

%a3 = sub i32 %a2, %b1


br label %BB2

52

Control FSM for GCD


0
/ a01
1
/ a12

/ a51
5

2
!ifcond / a35

whilecond / a26
6

/ a42
4

!whilecond / a23
ifcond / a34

edge labels: c / a c = condition for transition, a = action executed during transition


53

Transition Table and Actions


current
state

condition

next action
state

sel
a1

en
a1

sel
a2

en
a2

sel
b1

en
b1

a01

a12

a1

whilecond

a26

!whilecond

a23

ifcond

a34

a42

a3

!ifcond

a35

a51

a2

b2

input signals to data path


transition table and data path can be easily translated to
hardware description language (e.g. VHDL or Verilog)
54

Control FSM for GCD (Block Diagram)

nextstate

sela1

state

ena1
ifcond
whilecond

next state and


output logic

sela2
ena2
selb1
enb1

55

Limitations and Potential Improvements


limitations of this basic high-level synthesis approach

restricted subset of C only (no function calls, memory access, )


limited amount of parallelism (CDFG is executed sequentially)
no sharing of operators in data path
single cycle execution model (each basic block is executed in exactly one
cycle)
FSMs can become very complex

features of more advanced high-level synthesis methods

sharing of operators in data path


program transformations to increase parallelism
loop pipelining and retiming

many tools available


commercial: e.g. Impulse C, Mentor Catapult C, AutoESL
free/open source: e.g. ROCCC, SPARK, LegUp, C-to-Verilog

56

Microprogrammed Architectures
two ends of the spectrum for implementing applications
CPU: generic fully programmable architecture, application can be easily
varied after fabrication
ASIC/FPGA: highly specialized, application is fixed at design time

middle ground: Microprogrammed Architectures


also tailored to particular application or class of applications
but the controller is programmable (instead of a fixed FSM)

57

FSMD vs. Microprogrammed Architecture


FSMD

FSMD

state reg state reg

atus

NextState
status
Logic

NextState
Logic

Micro-programmed
MachineMachine
Micro-programmed
Jump field
Jump field
NextNextControl Control
Address Address
Store
status Logic
Store
status Logic
CSAR

Datapath Datapath

CSAR

Micro- Microinstruction
instruction
Datapath Datapath
Command
field
Command
fi

3 InFig.
contrast
to contrast
FSM-based
control, microprogramming
uses a flexible
microprogrammed
5.3 finite
In
to FSM-based
control, microprogramming
uses acontrol
flexiblescheme
control sche
state machine
with data path (FSMD)
fixed

architecture

programmable
CSAR (control store address register)
corresponds to instruction counter in CPU

Schaumont 2010

58

Microinstruction Encoding
microinstruction specifies

microinstruction

138

commands for data path


jump field

5 Microprogrammed Architec

32 bits
datapath command

next

address

jump field
how to compute address of
next microinstruction based
on feedback from data path
optional address constant

Command field

Jump field

next
0000
0001
0010
1010
0100
1100

address
absolute address of
microinstruction
here: address field width is 12
bits, hence 4096
microinstructions can be
addressed

Default
Jump
Jump if carry
Jump if no carry
Jump if zero
Jump if not zero

CSAR = CSAR + 1
CSAR = address
CSAR = cf ? address : CSAR + 1
CSAR = cf ? CSAR + 1 : address
CSAR = zf ? address : CSAR + 1
CSAR = zf ? CSAR + 1 : address

next + address
Next
Address
Logic

CSAR
Control
Store

microinstruction

datapath
command

cf + zf

Datapath

Next
CSAR

flags

Fig. 5.4 Sample format for a 32-bit microinstruction word

Schaumont 2010

59

tractor. It can be easily verified that each of the instructions enumerated abo
be implemented as a combination of control bit values for each multiplexer
the adder/subtractor. The controller on top shows two possible encodings
three instructions: a horizontal encoding and a vertical encoding.

Command Field Horizontal vs. Vertical Microcode


tradeoff between code
density and decoding effort
horizontal microcode
microinstruction directly
contains bits to control data
path
no decoding required
low code density

Horizontal
Vertical
Micro-instruction Microcode Microcode
Micro-Programmed
Controller

a=2*a

0 0

0 0 0

a = a 1

0 1

0 1 1

a = IN

1 0

1 0 0

Decoder

vertical microcode

sel1

microinstructions contain
encoded form of data path
control signals
decoding required
higher code density

sel2

Datapath

alu
1
0
+/

IN

1
a
0

Fig. 5.5 Example of vertical versus horizontal microprogramming

Schaumont 2010

60

an operand from the accumulator register with an operand from the register file
or the input port. The result of the operation is returned to the register file or the

Example: A Microcoded Datapath

unused
SBUS

Dest

ALU Shifter Dest

SBUS ALU

Nxt

Shifter

Address

Address Nxt

Register
File
flags
Input

Shift

NextAddress
Logic

Control
Store
CSAR

ACC

Fig. 5.7 A microprogrammed datapath

same microinstruction format as in previous example


4 data path units to be controlled
SBUS: select operand
ALU: choose ALU operation
Shifter: optional bit shift of ALU result
Dest: select location for storing the result
Schaumont 2010

61

Example: Microinstruction Encoding

5.4 The Microprogrammed Datapath

143

Table 5.1 Microinstruction encoding of the example machine


Field
Width
Encoding
SBUS
4
Selects the operand that will drive the S-Bus
0000
R0
0101
0001
R1
0110
0010
R2
0111
0011
R3
1000
0100
R4
1001
ALU

Selects the operation performed by the ALU


0000
ACC
0110
0001
S-Bus
0111
0010
ACC C SBus
1000
0011
ACC ! SBus
1001
0100
SBus ! ACC
1010
0101
ACC & S-Bus
1011

R5
R6
R7
Input
Address/Constant
ACC S-Bus
not S-Bus
ACC C 1
SBus ! 1
0
1

Shifter

Selects the function of the programmable shifter


000
logical SHL(ALU)
100
arith SHL(ALU)
001
logical SHR(ALU)
101
arith SHR(ALU)
010
rotate left ALU
111
ALU
011
rotate right ALU

Dest

Selects the target that will store S-Bus


0000
R0
0101
0001
R1
0110
0010
R2
0111
0011
R3
1000
0100
R4
1111

R5
R6
R7
ACC
unconnected

Selects next-value for CSAR


0000
CSAR C 1
0001
Address
0010
cf ? Address : CSAR C 1

cf ? CSAR C 1 : Address
zf ? Address : CSAR C 1
zf ? CSAR C 1 : Address

Nxt

1010
0100
1100

available microinstructions and their encoding


Schaumont 2010

As an example, let us develop a microprogram that reads two numbers from the
input port and that evaluates their greatest common divisor (GCD) using Euclids
algorithm. The first step is to develop a microprogram in terms of register transfers.

62

Example: How to Encode a Microinstruction


how to encode ACC R2?
5 Microprogrammed Architectures

144
RT-level
Instruction

Micro-Instruction
Field Encoding

Micro-Instruction
Formation

ACC < R2

SBUS
0010

ALU
0001

Shifter
111

Dest
1000

Nxt
0000

Address
000000000000

{0,0010,0001,111,1000,0000,000000000000}
{0001,0000,1111,1000,0000,0000,0000,0000}

Micro-Instruction
Encoding

0 10F80000

Fig. 5.8 Forming microinstructions from register-transfer instructions


Listing 5.1 Micro-program to evaluate a GCD
1 Schaumont
;
Command
|| Jump Field
2010 Field

63

Example: Implementing GCD on this Architecture


1: int gcd(int a, int b) {
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}

; Command Field
||Jump Field
; -------------------------------------------------------------------------1:
IN -> R0
; read a, store in R0
2:
IN -> ACC
; read b, store in ACC
3: Lcheck: R0 - ACC
|| JUMP_IF_Z Ldone ; check while condition
4:
(R0 ACC) << 1 || JUMP_IF_C Lsmall ; check whether R0<ACC, ..
5:
R0 - ACC -> R0
|| JUMP Lcheck
; if so, ACC RO -> ACC
6: Lsmall: ACC - R0 -> ACC || JUMP Lcheck
; else R0 - ACC -> R0
7: Ldone:
|| JUMP Ldone
; infinite loop, end of prog

Schaumont 2010

64

Changes
2015-05-28 (v1.4.0)
updated for SS2015

2014-05-22 (v1.3.1)
explain on slide 41 that di=1 no longer required

2014-05-06 (v1.3.0)
updated for SS2014

65

Changes
2013-06-18 (v1.2.4)
fixed label of basic blocks B4 and B5 also on p.52

2013-06-06 (v1.2.3)
cosmetic changes
fixed index in equation on p.41
fixed label of basic blocks B4 and B5 on p.48 + p.49

2013-05-23 (v1.2.2)
clarified assumption of unit delay on p.33

2013-05-16 (v1.2.1)
fix typo on slide 19, terminate algorithm when v0 is scheduled (not vn)

2013-05-13 (v1.2.)
updated for SS2013, merged all architecture synthesis materials into a
single presentation

66

S-ar putea să vă placă și