Documente Academic
Documente Profesional
Documente Cultură
SS 2015
Jun.-Prof. Dr. Christian Plessl
Custom Computing
University of Paderborn
Version 1.4.0 2015-05-28
Motivation
translate program/algorithm into dedicated hardware
units of translation
single basic block (combinational)
complete program (sequential)
finite
state
machine
1: int gcd(int a, int b) {
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}
registers
FU
controller
FU
data path
Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling
scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP
Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling
scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP
Sequencing Graph
NOP
GS (VS , ES )
*
*
*
-
*
+
+
<
10
11
NOP 12
Resource Graph
GR (VR, ER )
set of nodes VR = VS VT
VS are the nodes of the sequencing graph (without NOPs)
VT represent resource types (adder, multiplier, ALU, )
set of edges (vS, vT ) E R with vS VS, vT VT
an instance of resource type vT can be used to implement operation vS
+
<
-
1
3
6
*
*
*
2
MUL
4
7
VT = {vMUL , vALU}
5
9
11
+
-
8
ALU
10
c(vMUL ) = 8
MUL
ALU
c(vALU) = 4
execution times w: ER Z+
assigns the execution time of operation vS VS on resource
type vT VT to the edge (vS ,vT) ER
example:
w(v2 ,vMUL) = 1
MUL
Allocation, Binding
allocation : VT Z+
assigns a number (vT) of available instances to each
resource type vT
: VS VT and : VS Z+
(vS) = vT means that operation vS is implemented by resource type vT
(possible s shown in the resource graph)
(vS) = r denotes that vS is implemented by the rth instance
of vT ; r (vT)
1
3
6
*
*
*
2
MUL
c(vMUL) = 8
(vMUL) = 3
w(v1 ,vMUL ) = 1
w(v2 ,vMUL ) = 1
w(v3 ,vMUL ) = 1
w(v4 ,vMUL ) = 1
w(v6 ,vMUL) = 1
w(v7 ,vMUL) = 1
(v1 ) = vMUL
(v2 ) = vMUL
(v3 ) = vMUL
(v4 ) = vMUL
(v6) = vMUL
(v7) = vMUL
(v1 ) = 1
(v2 ) = 2
(v3 ) = 3
(v4 ) = 1
(v6) = 2
(v7) = 3
10
5
9
11
+
-
c(vALU) = 4
(vALU) = 2
8
ALU
10
w(v5 ,vALU ) = 1
w(v8 ,vALU ) = 1
w(v9 ,vALU ) = 1
w(v10 ,vALU ) = 1
w(v11 ,vALU ) = 1
(v5 ) = vALU
(v8 ) = vALU
(v9 ) = vALU
(v10 ) = vALU
(v11) = vALU
(v5 ) = 1
(v8 ) = 1
(v9 ) = 1
(v10 ) = 2
(v11) = 1
11
Schedule
schedule : VS Z+
assigns a start time to each operation under the constraint
(vi ,vj) ES
12
Schedule - Example
NOP
time
*
*
-
10
+
*
+
<
11
NOP
12
Synthesis
allocation, binding, scheduling
finding (, , , ) that optimize latency and cost under resource and
timing constraints
algorithms for architecture synthesis discussed in, e.g.
J.Teich, C. Haubelt, Digitale Hardware/Software-Systeme, Springer 2007
G. De Micheli, Synthesis and Optimization of Digital Circuits, McGrawHill 1994
in the following
scheduling without resource constraints
ASAP, ALAP
Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling
scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP
15
16
ASAP Scheduling
ASAP: as soon as possible scheduling
algorithm
ASAP( GS (V,E) ) {
schedule v0 by setting (v0) = 1
repeat {
select a vertex vj whose predecessors are all scheduled
schedule vj by setting (v j ) =
(vi ) + w(vi , (vi ))
i:(vi ,v j )E
}
until (vn is scheduled)
return
}
max
17
ASAP Example
NOP
time
*
*
-
+
<
10
11
NOP
12
minimal latency: L = 4
18
ALAP Scheduling
ALAP: as late as possible scheduling
requires a latency bound L
otherwise nodes could be arbitrarily delayed
typically the schedule length of ASAP schedule is used as latency bound
algorithm
ALAP( GS (V,E), L ) {
schedule vn by setting (vn) = L +1
repeat {
select a vertex vi whose successors are all scheduled
schedule vi by setting (vi ) =
(v j ) w(vi , (vi ))
j:(vi ,v j )E
}
until (v0 is scheduled)
return
}
min
19
ALAP Example
0
NOP
latency bound: L = 4
*
*
-
*
10
*
-
11
+
NOP
+
<
12
time
20
Slack (mobility)
*
0
0
*
1
**
<
2
2
*
+
2
2
+
<
time
21
22
Extended ALAP
resource constraints:
(vMUL) = 2, (vALU) = 2
time
NOP
*
*
-
latency bound: L = 4
10
*
-
11
+
*
+
<
NOP 12
23
list scheduling
operations are prioritized according to some criterion, e.g.
slack,
time = 1
repeat
for each resource (vT , (vT))
determine all ready operations vS with (vs)= vT and
schedule the one with the highest priority
time ++;
until (vn is scheduled)
24
*
2
1
2
1
*
*
1
0
*
+
1
0
+
<
0 25
<
7
time
Exercise 5.1: Architecture Synthesis
+
L = (v12) - (v0) = 8-1= 7
26
Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling
scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP
27
A x b
x0
with
max cT x
x R n , A R mn ,
b Rm , c Rn
real variables
linear constraints modeled by (in)equations
linear cost function
x Zn
n
x {0,1}
example
x1 , x2 , x3 {0,1}
x1 + x2 + x3 2
min 5 x1 + 6 x2 + 4 x3
solution
x1
x2
x3
cost
10
11
15
29
solving ILPs
ILPs are NP-complete, exponential runtime in worst case
solved by branch-and-bound algorithms to optimality
(runtime depends on the efficiency of the model)
popular solvers
CPLEX (commercial)
http://www.ibm.com/software/commerce/optimization/cplex-optimizer/
Gurobi (commercial, free academic license)
http://www.gurobi.com/
lp_solve (open source)
http://lpsolve.sourceforge.net/5.5/
challenges
translate the problem into an efficient ILP model
use only linear constraints and cost function
non-linearities of the problem can (possibly) be modeled by several
linear constraints, but this increases the complexity of the ILP
30
NOP
*
*
-
*
+
+
<
resource constraints:
2 MUL
2 ALU (+,-,<)
find schedule with
minimal latency
10
11
NOP 12
alternative problem:
minimize resources
under latency constraint
31
i = 1,..., n
l = 1,..., L
uniqueness constraints
each operation starts in exactly one time step
i ,l
= 1, i = 1,..., n
starting time
the starting time (vi) of operation vi can be expressed as
(vi ) = l xi,l
l
32
i ,l
(vMUL ),
( vi ) = vMUL
i ,l
(v ALU ), l = 1,..., L
( vi ) = v ALU
sequencing constraints
an operation cannot start earlier than the finishing time of its predecessors
l x
l
j,l
w ( vi , (vi )) = 1
33
#
&
min (vn ) = min % l xn,l (
$ l
'
34
/*********************/
/* integer variables */
/*********************/
int x1_1;
int x1_2;
int x5_1;
int x1_3;
int x5_2;
int x1_4;
int x5_3;
int x1_5;
int x5_4;
int x1_6;
int x5_5;
int x1_7;
int x5_6;
int x5_7;
int x2_1;
int x2_2;
int x6_1;
int x2_3;
int x6_2;
int x2_4;
int x6_3;
int x2_5;
int x6_4;
int x2_6;
int x6_5;
int x2_7;
int x6_6;
int x6_7;
int x3_1;
int x3_2;
int x7_1;
int x3_3;
int x7_2;
int x3_4;
int x7_3;
int x3_5;
int x7_4;
int x3_6;
int x7_5;
int x3_7;
int x7_6;
int x7_7;
int x4_1;
int x4_2;
int x8_1;
int x4_3;
int x8_2;
int x4_4;
int x8_3;
int x4_5;
int x8_4;
int x4_6;
int x8_5;
int x4_7;
int x8_6;
int x8_7;
int
int
int
int
int
int
int
x9_1;
x9_2;
x9_3;
x9_4;
x9_5;
x9_6;
x9_7;
int
int
int
int
int
int
int
x10_1;
x10_2;
x10_3;
x10_4;
x10_5;
x10_6;
x10_7;
int
int
int
int
int
int
int
x11_1;
x11_2;
x11_3;
x11_4;
x11_5;
x11_6;
x11_7;
int
int
int
int
int
int
int
x12_1;
x12_2;
x12_3;
x12_4;
x12_5;
x12_6;
x12_7;
/**********************/
/* variables in {0,1} */
/**********************/
x1_1 <= 1;
x5_1 <= 1;
x1_2 <= 1;
x5_2 <= 1;
x1_3 <= 1;
x5_3 <= 1;
x1_4 <= 1;
x5_4 <= 1;
x1_5 <= 1;
x5_5 <= 1;
x1_6 <= 1;
x5_6 <= 1;
x1_7 <= 1;
x5_7 <= 1;
x2_1 <= 1;
x6_1 <= 1;
x2_2 <= 1;
x6_2 <= 1;
x2_3 <= 1;
x6_3 <= 1;
x2_4 <= 1;
x6_4 <= 1;
x2_5 <= 1;
x6_5 <= 1;
x2_6 <= 1;
x6_6 <= 1;
x2_7 <= 1;
x6_7 <= 1;
x3_1 <= 1;
x7_1 <= 1;
x3_2 <= 1;
x7_2 <= 1;
x3_3 <= 1;
x7_3 <= 1;
x3_4 <= 1;
x7_4 <= 1;
x3_5 <= 1;
x7_5 <= 1;
x3_6 <= 1;
x7_6 <= 1;
x3_7 <= 1;
x7_7 <= 1;
x4_1 <= 1;
x8_1 <= 1;
x4_2 <= 1;
x8_2 <= 1;
x4_3 <= 1;
x8_3 <= 1;
x4_4 <= 1;
x8_4 <= 1;
x4_5 <= 1;
x8_5 <= 1;
x4_6 <= 1;
x8_6 <= 1;
x4_7 <= 1;
x8_7 <= 1;
x9_1
x9_2
x9_3
x9_4
x9_5
x9_6
x9_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
x10_1
x10_2
x10_3
x10_4
x10_5
x10_6
x10_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
x11_1
x11_2
x11_3
x11_4
x11_5
x11_6
x11_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
x12_1
x12_2
x12_3
x12_4
x12_5
x12_6
x12_7
<=
<=
<=
<=
<=
<=
<=
1;
1;
1;
1;
1;
1;
1;
problem description
in LP format
latency bound = 7
35
+ x1_7 = 1;
+ x2_7 = 1;
+ x3_7 = 1;
+ x4_7 = 1;
+ x5_7 = 1;
+ x6_7 = 1;
+ x7_7 = 1;
+ x8_7 = 1;
+ x9_7 = 1;
x10_6 + x10_7 = 1;
x11_6 + x11_7 = 1;
x12_6 + x12_7 = 1;
/************************/
/* resource constraints */
/************************/
x1_1 + x2_1 + x3_1 + x4_1 + x6_1 + x7_1 <= 2;
x5_1 + x8_1 + x9_1 + x10_1 + x11_1 <= 2;
x1_2 + x2_2 + x3_2 + x4_2 + x6_2 + x7_2 <= 2;
x5_2 + x8_2 + x9_2 + x10_2 + x11_2 <= 2;
x1_3 + x2_3 + x3_3 + x4_3 + x6_3 + x7_3 <= 2;
x5_3 + x8_3 + x9_3 + x10_3 + x11_3 <= 2;
x1_4 + x2_4 + x3_4 + x4_4 + x6_4 + x7_4 <= 2;
x5_4 + x8_4 + x9_4 + x10_4 + x11_4 <= 2;
x1_5 + x2_5 + x3_5 + x4_5 + x6_5 + x7_5 <= 2;
x5_5 + x8_5 + x9_5 + x10_5 + x11_5 <= 2;
x1_6 + x2_6 + x3_6 + x4_6 + x6_6 + x7_6 <= 2;
x5_6 + x8_6 + x9_6 + x10_6 + x11_6 <= 2;
x1_7 + x2_7 + x3_7 + x4_7 + x6_7 + x7_7 <= 2;
x5_7 + x8_7 + x9_7 + x10_7 + x11_7 <= 2;
/**********************/
/* objective function */
/**********************/
min: 1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7;
36
/**************************/
/* sequencing constraints */
/**************************/
/* 1->6 */
1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=
1 + 1 x1_1 + 2 x1_2 + 3 x1_3 + 4 x1_4 + 5 x1_5 + 6 x1_6 + 7 x1_7;
/* 2->6 */
1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7 >=
1 + 1 x2_1 + 2 x2_2 + 3 x2_3 + 4 x2_4 + 5 x2_5 + 6 x2_6 + 7 x2_7;
/* 6->10 */
1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7 >=
1 + 1 x6_1 + 2 x6_2 + 3 x6_3 + 4 x6_4 + 5 x6_5 + 6 x6_6 + 7 x6_7;
/* 10->11 */
1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=
1 + 1 x10_1 + 2 x10_2 + 3 x10_3 + 4 x10_4 + 5 x10_5 + 6 x10_6 + 7 x10_7;
/* 11->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7;
/* 3->7 */
1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7 >=
1 + 1 x3_1 + 2 x3_2 + 3 x3_3 + 4 x3_4 + 5 x3_5 + 6 x3_6 + 7 x3_7;
/* 7->11 */
1 x11_1 + 2 x11_2 + 3 x11_3 + 4 x11_4 + 5 x11_5 + 6 x11_6 + 7 x11_7 >=
1 + 1 x7_1 + 2 x7_2 + 3 x7_3 + 4 x7_4 + 5 x7_5 + 6 x7_6 + 7 x7_7;
/* 4->8 */
1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7 >=
1 + 1 x4_1 + 2 x4_2 + 3 x4_3 + 4 x4_4 + 5 x4_5 + 6 x4_6 + 7 x4_7;
/* 8->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x8_1 + 2 x8_2 + 3 x8_3 + 4 x8_4 + 5 x8_5 + 6 x8_6 + 7 x8_7;
/* 5->9 */
1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7 >=
1 + 1 x5_1 + 2 x5_2 + 3 x5_3 + 4 x5_4 + 5 x5_5 + 6 x5_6 + 7 x5_7;
/* 9->12 */
1 x12_1 + 2 x12_2 + 3 x12_3 + 4 x12_4 + 5 x12_5 + 6 x12_6 + 7 x12_7 >=
1 + 1 x9_1 + 2 x9_2 + 3 x9_3 + 4 x9_4 + 5 x9_5 + 6 x9_6 + 7 x9_7;
37
NOP
latency = 4
*
*
-
*
10
*
-
11
+
NOP
<
12
time
38
binary variables
xi,t {0,1} vi V, t : li t hi
uniqueness constraints
hi
i,t
= 1 vi V
t=li
scheduling time
hi
t x
i,t
= (vi ) vi V
t=li
39
(v j ) (vi ) di
(vi , v j ) E
xi,tp (rk )
i=1
i=1
rk VT , min{li } t max{hi }
cost function
min l xn,l
l
40
%' 1: t : (v ) t (v ) + d 1
i
i
i
x
=
&
i,tp
0 : otherwise
'(
p=0
di 1
t = 1: x1,1
t = 2 : x1,2 + x1,1
41
Iterative Scheduling
most applications execute a dataflow graph not just once but in
a loop
generation and optimization of periodic schedules
execution in pipelined iterations for optimizing performance
parameters
period/iteration interval/initiation interval (P): time between start of
execution of operation vi between two successive iterations
latency (L): time for between start of first and last operation for a given
interval
iteration
v1
v2
v3
v1
v2
v3
v4
v1
v2
v3
v4
v2
v3
v4
4
3
2
1
v1
iteration
L=4,P=1
v4
4
3
2
1
L=4,P=2
time
v1
v2
v1
v2
v3
v4
v1
v2
v3
v4
v3
v4
time
42
Ressourcen
P=7
v4(n 1)
r2
v1(n)
r1
Iterative Scheduling
v3(n)
v2(n)
Zeit
130iterative
schedule
examples
(visualized
with Gantt charts)
4 Ablaufplanung
0
Ressourcen
P=L=9
Abb. 4.18. Ablaufplanung mit funktionaler Fliebandverarbeitung
r2
4
gesenkt
werden, wenn v3 zum Zeitschritt t3 =v43 beginnt und am vZeitpunkt
t = 1 des
darauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Beist in Tabelle 4.3 dargestellt, die
ginn
v1 v4 auf t4 = 1. Der Ablaufplan
r1 von Operation
v2
entsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einen
Zeit
u berlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der Iterationsin!
tervallgrenze
131
0
1P = 6 u2berlappt.
3
44.5 Periodische
5
6Ablaufplanungsprobleme
7
8
9
sequential processing of
successive iterations
Ressourcen
Abb. 4.17.Ablaufplan
Ablaufplanung
und Bindung
Tabelle
4.3. Ein u berlappender
mit funktionaler
P =Fliebandverarbeitung
7
P
6
v
v
v
i
1
2 v3 v4
Bei
Ablaufplan
betragt das Iterationsintervall
P = 9, d. h. alle neun Zeitv4(n
1)
r2 diesem
v3(n)
(v
)
r
r
r2 rneuen
i
1
1
2
schritte wiederholt sich die Berechnung eines
Datensatzes. Ein solcher Abti
0 4 4 7/1
laufplan entspricht einer sequentiellen
Abarbeitung aufeinander folgender Iteratiov
r
(n)
v2(n) eine vollstatische Bindung mit einer
1
1
nen. In Abb. 4.17 ist zusatzlich zum Ablaufplan
Zeit
DarRessource des Typs r1 und einer Ressource des Typs r2 dargestellt. Eine solche
!
4.5 Periodische
Ablaufplanungsprobleme
131
stellung bezeichnet man auch als Gantt-Chart.
Man
erkennt, dass die Ressource vom
1 der2Zeitschritte
3
4 t = 05bis zum
6 Zeitschritt
7
9 Iteration
Typ r02 wahrend
von
t 8= 4 jeder
nicht arbeitet.
Offensichtlich ist es jedoch bei geeigneter Speicherung der ZwischenRessourcen
=6 P=7
ergebnisse m
oglich,
mitmit
dem
BeginnP der
Berechnung von Operation v4
Abb.
4.18.gleichzeitig
Ablaufplanung
funktionaler
Fliebandverarbeitung
v3(n
1)
auf der
Ressource
vom Typ r2 , die Berechnung eines neuen Datensatzes durch Plader
des Typs r1 zuv3beginnen.
Anders ausgedruckt bedeunung
von vv41(nauf
v4(nRessource
v1) zum Zeitschritt
1)
r2
gesenkt
werden,
wenn
t = 4(n)
beginnt und am Zeitpunkt t = 1 des
3
tet dies, dass in einer Iteration
die Knoten v31 (n), v2 (n), v3 (n) und v4 (n 1) berechnet
darauf folgenden Abarbeitungszyklus endet. Entsprechend verschiebt sich der Bewerden. Ein Ablaufplan, der diese Art der nebenlaufigen Abarbeitung von Iterationen
ist in Tabelle 4.3 dargestellt, die
ginn
v4 auf t4 = 1. Der Ablaufplan
v1(n)
r1 von Operation
v23(n)
(Fliebandverarbeitung)
berucksichtigt, ist
in Tabelle 4.2 dargestellt. Das Iterationsentsprechende Belegung der Ressourcen in Abb. 4.19. Es handelt sich hier um einen
intervall betragt jetzt P = 7. Abbildung 4.18 zeigt die entsprechende AuslastungZeit
der
u berlappenden Ablaufplan, da das Ausfuhrungsintervall von v3 mit der IterationsinRessourcen. Der Ablaufplan ist nicht u berlappend, da sich kein Ausfuhrungsintervall
tervallgrenze
0 Iterationsintervallgrenzen
1 P = 6 2u berlappt.
3
5 t = 76u berschneidet.
7
8
9
mit den
t4 = 0 und
Ein
Tabelle
4.3. Uberlappende
u berlappender
Ablaufplan
mitfunktionaler
funktionaler
Fliebandverarbeitung
Abb.
4.18.
Ablaufplanung
mit funktionaler
Fliebandverarbeitung
Abb. 4.19.
Ablaufplanung
mit
Fliebandverarbeitung
Tabelle 4.2. Ein Ablaufplan mit funktionaler Fliebandverarbeitung
P
6
Teich & Haubelt 2007
P
7
v
v
v
i
1
4
gesenkt werden, wenn v3 zum Zeitschritt t3 2=v34 vbeginnt
und am Zeitpunkt t = 1 des
43
Overview
translation of basic blocks
formal modeling
sequence graph, resource graph
cost function, execution times
synthesis problems: allocation, binding, scheduling
scheduling algorithms
ASAP, ALAP
extended ASAP, ALAP
list scheduling
optimal scheduling
introduction to integer linear programming (ILP)
scheduling with ILP
44
Motivation
translate program/algorithm into dedicated hardware
units of translation
single basic block (combinational)
complete program (sequential)
finite
state
machine
1: int gcd(int a, int b) {
2: while(a != b) {
3:
if(a > b) {
4:
a = a - b;
5:
} else {
6:
b = b - a;
7:
}
8: }
9: return a;
10:}
registers
FU
controller
FU
data path
45
finite
state
machine
controller
registers
FU
FU
data path
46
47
br label %BB1
BB1
BB5
BB2
BB3
false
false
true
BB6
BB4
br label %BB1
BB1
BB5
BB2
BB3
false
false
true
BB6
BB4
data path
predicates (flags) for conditional jumps (input to controller FSM)
conditional state transitions (evaluated by controller FSM)
phi nodes define registers + input multiplexers (other values are temporary)
49
ena1 sela1 a
enb1 selb1 b
a1
a2
b1
a2
a1
b1
>
ifcond
==
whilecond
data path
predicates (flags)
output signals to controller
50
sela1
ena1
sela2
ena2
data path
ifcond
whilecond
selb1
enb1
result (a2)
51
br label %BB1
BB1
BB5
BB2
BB3
false
false
true
BB6
BB4
52
/ a51
5
2
!ifcond / a35
whilecond / a26
6
/ a42
4
!whilecond / a23
ifcond / a34
condition
next action
state
sel
a1
en
a1
sel
a2
en
a2
sel
b1
en
b1
a01
a12
a1
whilecond
a26
!whilecond
a23
ifcond
a34
a42
a3
!ifcond
a35
a51
a2
b2
nextstate
sela1
state
ena1
ifcond
whilecond
sela2
ena2
selb1
enb1
55
56
Microprogrammed Architectures
two ends of the spectrum for implementing applications
CPU: generic fully programmable architecture, application can be easily
varied after fabrication
ASIC/FPGA: highly specialized, application is fixed at design time
57
FSMD
atus
NextState
status
Logic
NextState
Logic
Micro-programmed
MachineMachine
Micro-programmed
Jump field
Jump field
NextNextControl Control
Address Address
Store
status Logic
Store
status Logic
CSAR
Datapath Datapath
CSAR
Micro- Microinstruction
instruction
Datapath Datapath
Command
field
Command
fi
3 InFig.
contrast
to contrast
FSM-based
control, microprogramming
uses a flexible
microprogrammed
5.3 finite
In
to FSM-based
control, microprogramming
uses acontrol
flexiblescheme
control sche
state machine
with data path (FSMD)
fixed
architecture
programmable
CSAR (control store address register)
corresponds to instruction counter in CPU
Schaumont 2010
58
Microinstruction Encoding
microinstruction specifies
microinstruction
138
5 Microprogrammed Architec
32 bits
datapath command
next
address
jump field
how to compute address of
next microinstruction based
on feedback from data path
optional address constant
Command field
Jump field
next
0000
0001
0010
1010
0100
1100
address
absolute address of
microinstruction
here: address field width is 12
bits, hence 4096
microinstructions can be
addressed
Default
Jump
Jump if carry
Jump if no carry
Jump if zero
Jump if not zero
CSAR = CSAR + 1
CSAR = address
CSAR = cf ? address : CSAR + 1
CSAR = cf ? CSAR + 1 : address
CSAR = zf ? address : CSAR + 1
CSAR = zf ? CSAR + 1 : address
next + address
Next
Address
Logic
CSAR
Control
Store
microinstruction
datapath
command
cf + zf
Datapath
Next
CSAR
flags
Schaumont 2010
59
tractor. It can be easily verified that each of the instructions enumerated abo
be implemented as a combination of control bit values for each multiplexer
the adder/subtractor. The controller on top shows two possible encodings
three instructions: a horizontal encoding and a vertical encoding.
Horizontal
Vertical
Micro-instruction Microcode Microcode
Micro-Programmed
Controller
a=2*a
0 0
0 0 0
a = a 1
0 1
0 1 1
a = IN
1 0
1 0 0
Decoder
vertical microcode
sel1
microinstructions contain
encoded form of data path
control signals
decoding required
higher code density
sel2
Datapath
alu
1
0
+/
IN
1
a
0
Schaumont 2010
60
an operand from the accumulator register with an operand from the register file
or the input port. The result of the operation is returned to the register file or the
unused
SBUS
Dest
SBUS ALU
Nxt
Shifter
Address
Address Nxt
Register
File
flags
Input
Shift
NextAddress
Logic
Control
Store
CSAR
ACC
61
143
R5
R6
R7
Input
Address/Constant
ACC S-Bus
not S-Bus
ACC C 1
SBus ! 1
0
1
Shifter
Dest
R5
R6
R7
ACC
unconnected
cf ? CSAR C 1 : Address
zf ? Address : CSAR C 1
zf ? CSAR C 1 : Address
Nxt
1010
0100
1100
As an example, let us develop a microprogram that reads two numbers from the
input port and that evaluates their greatest common divisor (GCD) using Euclids
algorithm. The first step is to develop a microprogram in terms of register transfers.
62
144
RT-level
Instruction
Micro-Instruction
Field Encoding
Micro-Instruction
Formation
ACC < R2
SBUS
0010
ALU
0001
Shifter
111
Dest
1000
Nxt
0000
Address
000000000000
{0,0010,0001,111,1000,0000,000000000000}
{0001,0000,1111,1000,0000,0000,0000,0000}
Micro-Instruction
Encoding
0 10F80000
63
; Command Field
||Jump Field
; -------------------------------------------------------------------------1:
IN -> R0
; read a, store in R0
2:
IN -> ACC
; read b, store in ACC
3: Lcheck: R0 - ACC
|| JUMP_IF_Z Ldone ; check while condition
4:
(R0 ACC) << 1 || JUMP_IF_C Lsmall ; check whether R0<ACC, ..
5:
R0 - ACC -> R0
|| JUMP Lcheck
; if so, ACC RO -> ACC
6: Lsmall: ACC - R0 -> ACC || JUMP Lcheck
; else R0 - ACC -> R0
7: Ldone:
|| JUMP Ldone
; infinite loop, end of prog
Schaumont 2010
64
Changes
2015-05-28 (v1.4.0)
updated for SS2015
2014-05-22 (v1.3.1)
explain on slide 41 that di=1 no longer required
2014-05-06 (v1.3.0)
updated for SS2014
65
Changes
2013-06-18 (v1.2.4)
fixed label of basic blocks B4 and B5 also on p.52
2013-06-06 (v1.2.3)
cosmetic changes
fixed index in equation on p.41
fixed label of basic blocks B4 and B5 on p.48 + p.49
2013-05-23 (v1.2.2)
clarified assumption of unit delay on p.33
2013-05-16 (v1.2.1)
fix typo on slide 19, terminate algorithm when v0 is scheduled (not vn)
2013-05-13 (v1.2.)
updated for SS2013, merged all architecture synthesis materials into a
single presentation
66