Documente Academic
Documente Profesional
Documente Cultură
Keshab K. Parhi
Buy Textbook:
http://www.bn.com
http://www.amazon.com
http://www.bestbookbuys.com
Chap. 2
DSP System
out
Non-terminating
Example:
for
n = 1
to
y ( n ) = a x ( n ) + b x ( n 1) + c x ( n 2 )
end
3T 2T T 0
nT
Algorithms
out
signals
Chap. 2
Area-Speed-Power Tradeoffs
3-Dimensional Optimization (Area, Speed, Power)
Achieve Required Speed, Area-Power Tradeoffs
Power Consumption
P = C V 2 f
Latency reduction Techniques => Increase in speed or
power reduction through lower supply voltage operation
Since the capacitance of the multiplier is usually dominant,
reduction of the number of multiplications is important
(this is possible through strength reduction)
Chap. 2
x(n-1)
D
b
x(n-2)
c
y(n)
Chap. 2
z1
x(n)
z1
b
c
y(n)
Chap. 2
x(n)
a
c
y(n)
Chap. 2
Examples of DFG
Nodes are complex blocks (in Coarse-Grain DFGs)
Adaptive
filtering
FFT
IFFT
Decimator
Expander
Chap. 2
N samples
N/2 samples
N/2 samples
N samples
2
9
Introduction
Loop Bound
Important Definitions and Examples
Iteration Bound
Important Definitions and Examples
Techniques to Compute Iteration Bound
Chap. 2
10
Introduction
Iteration: execution of all computations (or functions) in an algorithm
once
Example 1:
A
2 times
2
B
2 times
C
3 times
Chap. 2
x(n)
y(n-1)
a
11
Introduction (contd)
Assume the execution times of multiplier and adder are Tm & Ta, then the
iteration period for this example is Tm+ Ta (assume 10ns, see the red-color
box). so for the signal, the sample period (Ts ) must satisfy:
Ts Tm + Ta
Definitions:
Iteration rate: the number of iterations executed per second
Sample rate: the number of samples processed in the DSP system per
second (also called throughput)
Chap. 2
12
Iteration Bound
Definitions:
Loop: a directed path that begins and ends at the same node
Loop bound of the j-th loop: defined as Tj/Wj, where Tj is the loop
computation time & Wj is the number of delays in the loop
Example 1: a b c a is a loop (see the same example in Note 2,
PP2), its loop bound:
Tloopbound = Tm + Ta = 10 ns
x(n)
2D
+
y(n-2)
Tloopbound =
Tm + Ta
= 5 ns
2
a
Chap. 2
13
L3: 2D
10ns
2ns
L1: D
3ns
C
L2: 2D
5ns
Definitions (Important):
Critical Loop: the loop with the maximum loop bound
Iteration bound of a DSP program: the loop bound of the critical loop, it is
defined as
T
T = max j
j L W
j
T = max{12, 5, 7.5}
lL
Chap. 2
14
T = TL 0 =
B = A Z non causal
1
causal
A = B Z
Speed of the DSP system: depends on the critical path comp. time
Paths: do not contain delay elements (4 possible path locations)
Critical path of a DFG: the path with the longest computation time among
all paths that contain zero delays
Clock period is lower bounded by the critical path computation time
Chap. 2
15
D
b
26
c
26
d
22
18
14
y(n)
Chap. 2
16
Precedence Constraints
Each edge of DFG defines a precedence constraint
Precedence Constraints:
Intra-iteration edges with no delay elements
Inter-iteration edges with non-zero delay
elements
Acyclic Precedence Graph(APG) : Graph obtained
by deleting all edges with delay elements.
Chap. 2
17
y(n)=ay(n-1) + x(n)
+
A
x(n)
D
10
13
A
19
3
B
6
C
D
APG of this graph is
10
A
2D
Chap. 2
18
B
(3)
(3)
(6)
D
B
C
2D
(21)
D
Tloop = 13ut
19
Chap. 2
20
Chap. 2
21
Example :
(1) 1
D d1
(2)
(1) 2
D d4
-1
-1
-1
-1
-1
d2
D d3
(2)
5
(2)
6
(1) 3
L(3)
L(1)
L(2)
L(4)
-1
-1 -1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1 -1
-1
-1 -1
-1
10
10
-1
T = max{4/2,4/2,5/3,5/3,5/3,8/4,8/4,5/4,5/4} = 2.
Chap. 2
22
23
Chap. 2
24
Example :
-4
2
Gd to Gd
-5
0
0
0
-5
m=0
m=1
m=2
i=1
-2
-2
-3
-2
i=2
-5/3
-1
-1
i=3
-2
-2
i=4
- -
25
Outline
Introduction
Pipelining of FIR Digital Filters
Parallel Processing
Pipelining and Parallel Processing for Low Power
Pipelining for Lower Power
Parallel Processing for Lower Power
Combining Pipelining and Parallel Processing for Lower Power
Chap. 3
Introduction
Pipelining
Comes from the idea of a water pipe: continue sending water without
waiting the water in the pipe to be out
water pipe
leads to a reduction in the critical path
Either increases the clock speed (or sampling speed) or reduces the power
consumption at same speed in a DSP system
Parallel Processing
Multiple outputs are computed in parallel in a clock period
The effective sampling speed is increased by the level of parallelism
Can also be used to reduce the power consumption
Chap. 3
Introduction (contd)
Example 1: Consider a 3-tap FIR filter: y(n)=ax(n)+bx(n-1)+cx(n-2)
x(n)
x(n-1)
D
b
x(n-2)
c
y(n)
The critical path (or the minimum time required for processing a new
sample) is limited by 1 multiply and 2 add times. Thus, the sample period
(or the sample frequency ) is given by:
T
f
Chap. 3
sample
sample
+ 2 T
1
+ 2T
Introduction (contd)
If some real-time application requires a faster input rate (sample rate),
then this direct-form structure cannot be used! In this case, the critical
path can be reduced by either pipelining or parallel processing.
Chap. 3
Introduction (contd)
Example 2
Figure (a): A data path
Figure (b): The 2-level pipelined structure of (a)
Figure (c): The 2-level parallel processing structure of (a)
Chap. 3
Input
x(0)
x(1)
x(2)
x(3)
x(n)
Node 1
ax(0)+bx(-1)
ax(1)+bx(0)
ax(2)+bx(1)
ax(3)+bx(2)
Node 3
cx(-2)
cx(-1)
cx(0)
Output
y(0)
y(1)
y(2)
Chap. 3
Node 2
ax(0)+bx(-1)
ax(1)+bx(0)
ax(2)+bx(1)
3
y(n)
Important Notes:
The speed of a DSP architecture ( or the clock period) is limited by the
longest path between any 2 latches, or between an input and a latch, or
between a latch and an output, or between the input and the output
Chap. 3
Chap. 3
Original SFG
A cutset
A feed-forward cutset
Assume:
The computation time
for each node is assumed
to be 1 unit of time
Chap. 3
10
Z 1
x(n)
a
Z 1
b
Z 1
y(n)
c
y(n)
Z 1
b
x(n)
b
D
Chap. 3
a
D
y(n)
11
Chap. 3
12
Parallel Processing
Parallel processing and pipelining techniques are duals each other: if a
computation can be pipelined, it can also be processed in parallel. Both of
them exploit concurrency available in the computation in different ways.
Chap. 3
13
SISO
y(n)
Sequential System
y(3k)
x(3k+1)
MIMO
x(3k+2)
y(3k+1)
y(3k+2)
3-Parallel System
In this parallel processing system, at the k-th clock cycle, 3 inputs x(3k),
x(3k+1) and x(3K+2) are processed and 3 samples y(3k), y(3k+1) and
y(3k+2) are generated at the output
14
x(2k)
x(2k-2)
x(10k)
Chap. 3
X(10k-10)
15
Chap. 3
16
T clock T M + 2 T A
T iteration = T sample =
T clock
T + 2T A
M
L
3
Chap. 3
17
T/4
T/4
T/4
T
x(4k+3)
T
x(4k+2)
T
x(4k+1)
4k+3
x(4k)
A parallel-to-serial converter
y(4k+3)
y(4k+2)
y(4k+1)
y(4k)
4k
0 T
D
T/4
Chap. 3
D
T/4
y(n)
T/4
18
bound (output-pad delay plus input-pad delay and the wire delay between
the two chips), we say this system is communication bounded
So, we know that pipelining can be used only to the extent such that the
critical path computation time is limited by the communication (or I/O)
bound. Once this is reached, pipelining can no longer increase the speed
Chip 1
output
pad
Chip 2
Tcommunicat ion
input
pad
Tcomputatio n
Chap. 3
19
Titeration = Tsample
Tclock
=
LM
Chap. 3
20
Chap. 3
21
T pd =
C ch arg e V0
k (V0 Vt ) 2
PCMOS = Ctotal V0 f
2
Chap. 3
22
Pseq = Ctotal V0 f ,
2
f = 1 Tseq
For a M-level pipelined system, its critical path is reduced to 1/M of its
original length, and the capacitance to be charged/discharged in a single
clock cycle is also reduced to 1/M of its original capacitance
If the same clock speed (clock frequency f) is maintained, only a fraction
(1/M) of the original capacitance is charged/discharged in the same
amount of time. This implies that the supply voltage can be reduced to
Vo (0< <1). Hence, the power consumption of the pipelined filter is:
Chap. 3
23
Tseq
(V0 )
Tseq
Chap. 3
Tseq
Tseq
(V0 )
24
Tseq =
C ch arg e V 0
k (V 0 V t )
T pip =
( C ch arg e M ) V 0
k ( V0 Vt ) 2
Since the same clock speed is maintained in both filters, we get the
equation to solve :
M ( V0 Vt )2 = (V0 Vt ) 2
Example: Please read textbook for Example 3.4.1 (pp.75)
Chap. 3
25
Chap. 3
26
Tseq
(V0 )
3Tseq
3Tseq
3Tseq
( V 0 )
The propagation delay of the original system is still same, but the
propagation delay of the L-parallel system is given by
L T seq =
Chap. 3
C ch arg e V 0
k ( V0 Vt ) 2
27
L ( V 0 V t ) 2 = (V 0 V t ) 2
Once is computed, the power consumption of the L-parallel system
can be calculated as
P para = ( LC total )( V 0 ) 2
f
= 2 Pseq
L
Examples: Please see the examples (Example 3.4.2 and Example 3.4.3) in
the textbook (pp.77 and pp.80)
Chap. 3
28
Chap. 3
29
Chap. 3
30
T
(a)
3T
3T
3T
3T
3T
3T
(b)
Chap. 3
31
LT
pd
(C
ch arg e
M ) V 0
k ( V0 Vt )
L C ch arg e V 0
k (V 0 V t ) 2
LM ( V 0 V t ) 2 = (V 0 V t ) 2
Example: please see the example in the textbook (pp.82)
Chap. 3
32
Chapter 4: Retiming
Keshab K. Parhi
Retiming :
Moving around existing delays
Does not alter the latency of the system
Reduces the critical path of the system
Node Retiming
D
5D
3D
3D
2D
Cutset Retiming
D
B
A
C
Chap. 4
D
2D
D
D
F
D
D
E
2
Retiming
Generalization of Pipelining
Pipelining is Equivalent to Introducing
Many delays at the Input followed by
Retiming
Chap. 4
Retiming Formulation
r(U)
U
Source node
Retiming
r(V)
V
Destination node
= + r(V) - r(U)
Properties of retiming
The weight of the retimed path p = V 0 --> V1 --> ..Vk is given by
r(p)= (p) + r(Vk) - r(V0)
Retiming does not change the number of delays in a cycle.
Retiming does not alter the iteration bound in a DFG as the
number of delays in a cycle does not change
Adding the constant value j to the retiming value of each node
does not alter the number of delays in the edges of the retimed
graph.
Retiming is done to meet the following
Clock period minimization
Register minimization
Chap. 4
Chap. 4
Chap. 4
K-slow transformation
Replace each D by kD
(1)
B
D
(1)
Clock
0
1
2
A0 B0
A1 B1
A2 B2
Titer= 2ut
Tclk= 2ut
Titer= 22ut=4ut
B
D
Tclk = 1ut
Titer = 21=2ut
*Hardware Utilization = 50 %
*Hardware can be fully utilized if
two independent operations are
available.
Chap. 4
A 100 stage Lattice Filter with critical path 2 multiplications and 101 additions
Chap. 4
10
Chap. 4
11
12
Chapter 5: Unfolding
Keshab K. Parhi
(1)
A
2D
(1) 0,2,4,.
B0
(1)
A0
T = 2ut
(1)
A1
T = 2ut
D
(1) 1,3,5,.
B1
D
4 nodes & 4 edges
T = 2/2 = 1ut
Chap. 5
37D
w = 37
(i+w)/4 = 9, i = 0,1,2
=10, i = 3
U0
9D
V0
U1 9D
V1
U2
V2
U3 10D
9D
V3
2D
D
V
U
3-unfolded
6D
5D
T
DFG
U0
V0 2D
U1
V1 2D T1
U2 D
D
T0
V2 2D T2
2D
Properties of unfolding :
Unfolding preserves the number of delays in a DFG.
This can be stated as follows:
w/J + (w+1)/J + + (w + J - 1)/J = w
J-unfolding of a loop l with wl delays in the original DFG
leads to gcd(wl , J) loops in the unfolded DFG, and each
of these gcd(wl , J) loops contains wl/ gcd(wl , J) delays
and J/ gcd(wl , J) copies of each node that appears in l.
Unfolding a DFG with iteration bound T results in a Junfolded DFG with iteration bound JT .
Chap. 5
Applications of Unfolding
Sample Period Reduction
Parallel Processing
Sample Period Reduction
Case 1 : A node in the DFG having computation
time greater than T .
Case 2 : Iteration bound is not an integer.
Case 3 : Longest node computation is larger
than the iteration bound T, and T is not an
integer.
Chap. 5
Case 1 :
The original DFG cannot
have sample period equal to
the iteration bound
because a node computation
time is more than iteration
bound
Chap. 5
Case 2 :
The original DFG cannot
have sample period equal
to the iteration bound
because the iteration
bound is not an integer.
Parallel Processing :
Word- Level Parallel Processing
Bit Level Parallel processing
vBit-serial processing
vBit-parallel processing
vDigit-serial processing
Chap. 5
b0
b1
b2
b3
Bit-parallel
a2 a0
a3 a2 a1 a0
b3 b2 b1 b0
4l+0
Bit-serial
adder
0
b3 b2 b1 b0
b2 b0
Digit-Serial
(Digit-size = 2)
a3 a1
Chap. 5
a3 a2 a1 a0 Bit-serial
b3 b1
s3 s2 s1 s0
D
4l+1,2,3
9
Chap. 5
10
Example :
4l + 3
U0
12l + 1, 7, 9, 11
V0
4l + 0,2
Unfolding by 3
U1
U2
4l + 3
V1
V2
Chap. 5
11
D D0
D
D1
D2
B0
B1
Chap. 5
B2
2l + 0
C0
2l + 1
A2 D
2l + 0
C1
2l + 1
2l + 1
C2
2l + 0
B0
B1
B2
A0
C0
2l + 0
C1
2l + 1
2l + 0
C2
2l + 1
12
Chap. 5
13
Chapter 6: Folding
Keshab K. Parhi
Folding is a technique to reduce the silicon area by timemultiplexing many algorithm operations into single functional
units (such as adders and multipliers)
Folding Transformation :
Chap. 6
Chap. 6
Note : Linear lifetime chart uses the convention that the variable is not
live during the clock cycle when it is produced but live during the clock
7
cycle when it is consumed.
Due to the periodic nature of DSP programs the lifetime chart can
be drawn for only one iteration to give an indication of the # of
registers that are needed. This is done as follows :
Let N be the iteration period
Let the # of live variables at time partitions n N be the # of
live variables due to 0-th iteration at cycles n-kN for k 0. In
the example, # of live variables at cycle 7 N (=6) is the sum
of the # of live variables due to the 0-th iteration at cycles 7
and (7 - 16) = 1, which is 2 + 1 = 3.
i|h|g|f|e|d|c|b|a
Chap. 6
Matrix
Transposer
i|f|c|h|e|b|g|d|a
Sample
Tin
Tzlout
Tdiff
Tout
Life
04
17
10
210
-2
35
48
11
511
-4
66
-2
79
12
812
Chap. 6
10
11
For variables that reach the last register and are still alive,
they are allocated in a backward manner on a first come first
serve basis.
12
Chap. 6
13
Chap. 6
14
in
out
49
--
33
11
22
44
56
34
15
Step 5 : Forward-backward
register allocation
Chap. 6
16
Chap. 6
17
Definitions :
d1
Projection vector (also called iteration vector), d =
d 2
pT I = ( p1
p 2 )
Chap. 7
i'
i 0
j ' = T j =
t'
t
1 i
p' 0 j
s' 0 t
0
j processor axis
t scheduling time instance
Chap. 7
Weights Stay)
dT = (1 0), pT = (0 1), sT = (1 0)
Any node with index IT = (i , j)
is mapped to processor pTI=j.
is executed at time sTI=i.
Since sTd=1 we have HUE = 1/|sTd| = 1.
Edge mapping : The 3 fundamental edges corresponding to
weight, input, and result can be mapped to corresponding
edges in the systolic array as per the following table:
Chap. 7
pTe
sTe
wt(1 0)
i/p(0 1)
result(1 1)
-1
Chap. 7
Chap. 7
pTe
sTe
wt(1 0)
i/p(0 1)
result(1 1)
10
pTe
sTe
wt(1 0)
i/p(0 1)
result(1 1)
-1
11
12
Chap. 7
pTe
sTe
wt(1 0)
i/p(0 -1)
-1
result(1 1)
13
14
Dual R2
pTe
sTe
pTe
sTe
wt(1, 0)
wt(1, 0)
i/p(0,1)
i/p(0,1)
result(1, -1)
result(-1, 1)
15
dT = (1 0), pT = (0 1), sT = (2 1)
Since sTd=2 for both of them we have HUE = 1/|sTd| = .
Edge mapping :
Chap. 7
pTe
sTe
wt(1 0)
i/p(0 -1)
result(1 1)
-1
16
Dual W2
pTe
sTe
pTe
sTe
wt(1, 0)
wt(1, 0)
i/p(0,1)
i/p(0,-1)
-1
result(1, -1)
result(1, -1)
-1
Chap. 7
17
Chap. 7
18
Example 2:
19
Chap. 7
20
Chap. 7
21
22
Example :
23
i/p(0 1)
result(1 1)
24
25
Chap. 7
26
Solution 2 :
sT = (1,1,1), dT = (1,1,-1), p1 = (1,0,1),
p2 = (0,1,1), PT = (p1 p2)T
Sol. 1
Sol. 2
pTe
sTe
a(0, 1, 0)
(0, 1)
b(1, 0, 0)
C(0, 0, 1)
Chap. 7
pTe
sTe
e
a(0, 1, 0)
(0, 1)
(1, 0)
b(1, 0, 0)
(1, 0)
(0, 0)
C(0, 0, 1)
(1, 1)
1
27
Chap. 8
Introduction
Fast Convolution: implementation of convolution algorithm using fewer
multiplication operations by algorithmic strength reduction
e c d a
f = d c b
ac bd = a(c d ) + d (a b)
ad + bc = b(c + d ) + d (a b)
Chap. 8
0
c
+
d
0
b
f 0 1 1
0
d 1 1
0
Where C is a post-addition matrix (requires 2 additions), D is a pre-addition
matrix (requires 1 addition), and H is a diagonal matrix (requires 2 additions to
get its diagonal elements)
Chap. 8
Cook-Toom Algorithm
A linear convolution algorithm for polynomial multiplication based on
the Lagrange Interpolation Theorem
Lagrange Interpolation Theorem:
Let
0 ,...., n
be a set of
of degree n or less
f ( i ) when evaluated at i
f ( p) = f ( i )
i =0
(p
j i
(
j i
Chap. 8
f ( p)
j)
h = {h 0 , h1 ,..., h N 1 }
x = {x 0 , x1 ,..., x L 1 } . The linear
and
s( p ) = h ( p ) x( p)
+ ... + h1 p + h0
multiplication as follows:
h ( p ) = hN 1 p N 1
where
x ( p ) = x L 1 p L 1 + ... + x1 p + x 0
s ( p ) = s L + N 2 p L + N 2 + ... + s1 p + s 0
L+ N 2
The output polynomial s ( p ) has degree
L + N 1 different points.
Chap. 8
and has
(continued)
L + N 1
different points. Let { 0 , 1 ,..., L + N 2 } be L + N 1
different real numbers. If s ( i ) for i = {0 ,1,..., L + N 2 } are
known, then s ( p ) can be computed using the Lagrange interpolation
s ( p)
theorem as:
s( p) =
(p )
s( ) ( )
L+ N 2
i =0
j i
ji
It can be proved that this equation is the unique solution to compute linear
convolution
for
s( p)
given
i = {0,1,..., L + N 2}.
Chap. 8
the
values
of
s( i ) ,
for
4.
Compute
s ( p)
Algorithm Complexity
by using
s ( p) =
(p )
s( ) ( )
L+ N 2
i =0
j i
j i
Chap. 8
s ( p ) = s0 + s1 p + s2 p 2
s0 h0
s = h
1 1
s 2 0
Chap. 8
0
x
h0 0
x
h1 1
Example-1 (continued)
Next we use C-T algorithm to get an efficient convolution implementation
with reduced multiplication number
0 = 0 , h ( 0 ) = h0 , x ( 0 ) = x 0
1 = 1, h ( 1 ) = h0 + h1 , x ( 1 ) = x 0 + x1
2 = 2, h ( 2 ) = h0 h1 , x ( 2 ) = x 0 x1
Then, s(0), s(1), and s( 2) are calculated, by using 3 multiplications, as
s ( 0 ) = h( 0 ) x ( 0 ) s ( 1 ) = h( 1 ) x ( 1 ) s ( 2 ) = h( 2 ) x ( 2 )
From the Lagrange Interpolation theorem, we get:
( p 1 )( p 2 )
( p 0 )( p 1 )
+ s( 1)
( 0 1 )( 0 2 )
( 1 0 )( 1 2 )
( p 0 )( p 1 )
+ s ( 2)
( 2 0 )( 2 1 )
s( p) = s( 0)
s(1) s( 2 )
s(1 ) + s( 2 )
) + p 2 ( s ( 0 ) +
)
2
2
= s 0 + ps 1 + p 2 s 2
= s( 0) + p(
Chap. 8
10
Example-1 (continued)
The preceding computation leads to the following matrix form
s0
1
s = 0
1
s 2
1
1
= 0
1
0
1
1
s( 0 )
s ( 1 ) 2
s ( 2 ) 2
0 h( 0 )
0
1 0
h( 1) 2
1 0
0
1
= 0
0
1
1
0
1
1
1
0
1
1
h0
0
0
( h 0 + h1 ) 2
0
x ( 0 )
x( )
1
h ( 2 ) 2 x ( 2 )
0
0
1
0
1
1
( h 0 h1 ) 2 1 1
0
0
x
0
x1
Chap. 8
11
Comments
Some additions in the preaddition or postaddition matrices can be
shared. So, when we count the number of additions, we only count one
instead of two or three.
If we take h0, h1 as the FIR filter coefficients and take x0, x1 as the signal
(data) sequence, then the terms H0, H1 need not be recomputed each
time the filter is used. They can be precomputed once offline and stored.
So, we ignore these computations when counting the number of
operations
From Example-1, We can understand the Cook-Toom algorithm as a
matrix decomposition. In general, a convolution can be expressed in
matrix-vector forms as s0 h0 0
s = h h x0
or s = T x
1
1
0
x
s 2 0 h1 1
Chap. 8
12
s =T x =C H D x
Chap. 8
13
s' ( p) = s( p) S L+ N 2 p L+N 2 .
s( p)
is
Chap. 8
14
3.
4.
L+N2
, for i = {0,1, , L + N 3}
( p )
s' ( p) = s' ( )
5. Compute s' ( p) by using
( )
L+N2
i=0
ji
ji
6.
Chap. 8
L+N 2
15
Example-3 (Example 8.2.3, p.234) Derive a 2X2 convolution algorithm using the
modified Cook-Toom algorithm with ={0,-1}
Consider
the
Lagrange
{0 = 0, 1 = 1}.
interpolation
for
and
s' ( p) = s( p) h1 x1 p2
at
0 = 0,
h(0 ) = h0 ,
x(0 ) = x0
1 = 1, h(1) = h0 h1, x(1) = x0 x1
s ' ( 0 ) = h( 0 ) x( 0 ) h1x1 0 = h0 x0
2
( p 1)
( p 0 )
s'( p) = s'( 0 )
+ s'(1)
(0 1)
( 1 0 )
Chap. 8
16
Example-3 (contd)
Therefore, s ( p ) = s ' ( p ) + h1x1 p 2 = s0 + s1 p + s2 p 2
Finally, we have the matrix-form expression:
Notice that
Therefore:
s0 1
s = 1
1
s 2 0
1
= 1
0
Chap. 8
s0
1
s = 1
1
s 2
0
s'( 0 )
1
s'( ) = 0
1
h 1 x 1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
s'( 0 )
s ' ( 1 )
h 1 x 1
0
1
1
s( 0 )
s ( 1 )
h 1 x 1
0 s ( 0 )
1 0
0 1 1 s ( 1 )
0 0
1 h 1 x 1
0 h0
0
0 1
1 0
h 0 h1
0 1
1 0
0
h1 0
0
0
1
0
1
1
x
0
x1
17
Example-3 (contd)
The computation is carried out as the follows:
1.
H 0 = h0 ,
H 1 = h0 h1 ,
H 2 = h1
(pre-computed)
2.
X 0 = x0 ,
X 1 = x0 x1 ,
X 2 = x1
3. S 0 = H 0 X 0 ,
S1 = H 1 X 1 ,
S2 = H 2 X 2
4.
s0 = S 0 ,
s1 = S 0 S1 + S 2 ,
s2 = S2
Chap. 8
18
Winograd Algorithm
The Winograd short convolution algorithm: based on the CRT (Chinese
Remainder Theorem) ---Its possible to uniquely determine a nonnegative integer given
only its remainder with respect to the given moduli, provided that the moduli are relatively
prime and the integer is known to be smaller than the product of the moduli
ci = Rmi [c]
i = 0,1,...,k ,
where
mi
mi
), for
c = ci Ni Mi modM , where M = k mi , Mi = M mi ,
i =0
i =0
19
k (i )
k
M ( p) = i =0 m (i ) ( p) , M (i ) ( p) = M ( p ) m (i ) ( p) , and N (i ) ( p) is the solution of
N (i ) ( p ) M (i ) ( p ) + n (i ) ( p) m (i ) ( p ) = GCD ( M (i ) ( p ), m (i ) ( p)) = 1 Provided that the
degree of c( p ) is less than the degree of M ( p)
Example-5 (Example 8.3.1, p.239): using the CRT for integer, Choose
20
Example-5 (contd)
The integer c can be calculated as
(a).
[
]
R [5x + 3x + 5] = 5(2) + 3x + 5 = 3x 5
[5x + 3x + 5] = 5(x 2) + 3x + 5 = 2x 5
Rx+2 5x2 + 3x + 5 = 5(2)2 + 3(2) + 5 = 19
(b).
x2 +2
(c). Rx2 + x+2
Chap. 8
21
Winograd Algorithm
1. Choose a polynomial m ( p ) with degree higher than the degree of
h ( p ) x ( p ) and factor it into k+1 relatively prime polynomials with real
( 0)
(1)
( k)
coefficients, i.e., m ( p ) = m ( p ) m ( p ) m ( p )
(i)
(i)
2. Let M ( p ) = m( p ) m ( p ). Use the Euclidean GCD algorithm to
solve N ( i ) ( p ) M (i ) ( p ) + n ( i ) ( p ) m ( i ) ( p ) = 1 for N ( i ) ( p ) .
(i )
(i)
h
(
p
)
=
h
(
p
)
mod
m
( p ),
3. Compute:
for i = 0,1, , k
x ( i ) ( p ) = x ( p ) mod m ( i ) ( p )
4. Compute: s ( i ) ( p ) = h ( i ) ( p ) x ( i ) ( p ) mod m ( i ) ( p ),
for
i = 0,1, , k
5. Compute s ( p ) by using:
k
s ( p ) = s ( i ) ( p ) N (i ) ( p ) M ( i ) ( p ) mod m ( i ) ( p )
i=0
Chap. 8
22
m (i ) ( p )
p
M (i ) ( p )
p3 p 2 + p 1
n (i ) ( p )
p2 p +1
p 1
p3 + p
12 p 2 + p + 2
p2 +1
p2 p
12 ( p 2)
N ( i ) ( p)
1
1
2
1
2
( p 1)
h ( 0 ) ( p ) = h0 ,
h (1) ( p ) = h0 + h1 ,
h ( 2 ) ( p ) = h0 + h1 p ,
Chap. 8
x ( p ) = x0 + x1 p + x2 p 2 :
x ( 0 ) ( p ) = x0
x (1) ( p ) = x0 + x1 + x2
x( 2 ) ( p ) = ( x0 x 2 ) + x1 p
23
Example-7 (contd)
s ( 0 ) ( p ) = h0 x0 = s0 ( 0 ) ,
s (1) ( p ) = ( h0 + h1 )( x0 + x1 + x 2 ) = s0 (1)
s ( 2 ) ( p ) = ( h0 + h1 p )(( x 0 x 2 ) + x1 p ) mod( p 2 + 1)
= h0 ( x0 x2 ) h1 x1 + ( h0 x1 + h1 ( x0 x 2 )) p = s0 ( 2 ) + s1( 2 ) p
(0 )
(1)
( 2)
Notice, we need 1 multiplication for s ( p ), 1 for s ( p ), and 4 for s ( p )
However it can be further reduced to 3 multiplications as shown below:
s 0 ( 2 ) 1
(2) =
s1 1
0
1
h
1 0
0
0
0
x 0 + x1 x 2
0 x 0 x 2
h 0 + h1
x1
h 0 h1
0
Then:
2
s ( p ) = s ( i ) ( p ) N ( i ) ( p ) M (i ) ( p ) mod m (i ) ( p )
i=0
= s
(0 )
( p )( p p + p 1) +
3
S (1 ) ( p )
2
( p + p) +
3
S(2) ( p)
2
( p3 2 p 2 + p)
mod( p 4 p 3 + p 2 p )
Chap. 8
24
Example-7 (contd)
Substitute s ( 0 ) ( p ),
table
p0
s0
(0)
s (1) ( p ),
p1
s0
p2
(0)
s0
p3
s0
(0)
1
2
s0
( 1)
1
2
s0
(2)
s0
1
2
s1
(2)
1
2
(2)
1
2
s0
s0
1
2
(0)
(1 )
(2)
s1
(2)
Therefore, we have
s0
s
1
s2
s3
Chap. 8
1
=
1
0
1
2
0
1
0
s0(0)
1
(1 )
s
12 0 ( 2 )
s0
21 ( 2 )
2 s 1
25
Example-7 (contd)
Notice that
s0
1
1 (1 )
2 s0 = 0
1 s (2)
0
21 0 ( 2 )
0
2 s 1
(0 )
0
1
0
0
0
1
0
0
0
0
0
1
s3
1
0
1
0
0
2
2
0
1
1
1
0
Chap. 8
h0
0
0
0
h0 + h1
2
0
0
h0
2
0
0
h1 h 0
2
h0
0
0
1
0
2
0
1 1
0
0
0
1
1 x 0
1 1 x 1
0 1 x 2
1
0
0
1
0
0
0
0
0
h 0 + h1
h 0 + h1
2
0
0
h0
2
0
0
h1 h0
2
x +
0
x0 +
x0
x0
x 1 + x 2
x1 x 2
x2
x1
0
0
0
0
h 0 + h1
26
Example-7 (contd)
In this example, the Winograd convolution algorithm requires 5
multiplications and 11 additions compared with 6 multiplications and 2
additions for direct implementation
Notes:
The number of multiplications in Winograd algorithm is highly dependent
on the degree of each m ( i ) ( p ). Therefore, the degree of m(p) should be as
small as possible.
More efficient form (or a modified version) of the Winograd algorithm
can be obtained by letting deg[m(p)]=deg[s(p)] and applying the CRT to
s '( p ) = s ( p ) h
Chap. 8
N 1
L 1
m ( p )
27
x ( i ) ( p ) = x ( p ) mod m ( i ) ( p )
6. Compute
Chap. 8
s ( p ) = s ' ( p ) + h N 1 x L 1 m( p )
28
m (i) ( p )
p
M (i) ( p )
p 2 1
p 1
p2 + p
p +1
p2 p
n (i ) ( p )
p
h ( 0 ) ( p ) = h0 ,
h (1) ( p ) = h0 + h1 ,
h ( 2 ) ( p ) = h0 h1 ,
s '( 0 ) ( p ) = h0 x0 ,
( p + 2)
1
( p 2)
2
N ( i ) ( p)
1
1
2
1
2
1
2
x ( p ) = x0 + x1 p + x2 p 2 :
x ( 0 ) ( p ) = x0
x (1) ( p ) = x0 + x1 + x2
x( 2 ) ( p ) = x0 x1 + x 2
s '(1) ( p ) = ( h0 + h1 )( x0 + x1 + x 2 ) ,
s '( 2 ) ( p ) = ( h0 h1 )( x0 x1 + x 2 )
Chap. 8
29
Example-8 (contd)
Since the degree of m ( i ) ( p ) is equal to 1, s '( i ) ( p ) is a polynomial of
degree 0 (a constant). Therefore, we have:
s( p) = s' ( p) + h1 x2 m( p)
= [ s'
( 0)
( p + 1) +
2
S '(1)
2
( p + p) +
2
S '( 2 )
2
( p 2 p) + h1 x2 ( p 3 p)
( 2)
s0 1
s 0
1 =
s2 1
s3 0
Chap. 8
0 s '( 0 )
(1)
1 s '2
s '( 2 )
2
1 h1 x 2
30
Example-8 (contd)
(matrix form)
0
s0 1
s 0
1
1 =
s2 1 1
0
s3 0
0
1
1
1
1 1
0
0
0
1
1
0
0 h0
1 0
0 0
1 0
h 0 + h1
2
h 0 h1
2
0
0
0
h1
0
x0
1
x1
1
x 2
1
Chap. 8
31
Iterated Convolution
Iterated convolution algorithm: makes use of efficient short-length
convolution algorithms iteratively to build long convolutions
Does not achieve minimal multiplication complexity, but achieves a
good balance between multiplications and addition complexity
Iterated Convolution Algorithm (Description)
1. Decompose the long convolution into several levels of short
convolutions
2. Construct fast convolution algorithms for short convolutions
3. Use the short convolution algorithms to iteratively (hierarchically)
implement the long convolution
Note: the order of short convolutions in the decomposition affects the
complexity of the derived long convolution
Chap. 8
32
x ( p ) = x0 + x1 p + x2 p 2 + x3 p 3
x '0 ( p) = x0 + x1 p,
x'1 ( p ) = x2 + x3 p
Then, we have:
s( p) = h( p) x( p) = h( p, q) x( p, q)
= [h'0 ( p) + h'1 ( p)q] [x '0 ( p) + x'1 ( p) q]
= h'0 ( p) x'0 ( p) + [h'0 ( p) x'1 ( p) + h'1 ( p) x'0 ( p)]q + h'1 ( p) x'1 ( p)q 2
= s'0 ( p) + s'1 ( p) q + s'2 ( p)q 2 = s( p, q)
Chap. 8
33
Example-9 (contd)
Therefore, the 4X4 convolution is decomposed into two levels of nested
2X2 convolutions
Let us start from the first convolution s'0 ( p) = h'0 ( p) x'0 ( p) , we have:
= h0 x0 + h1 x1 p 2 + p[(h0 + h1 ) (x0 + x1 ) h0 x0 h1 x1 ]
= h2 x2 + h3 x3 p 2 + p[(h2 + h3 ) ( x2 + x3 ) h2 x2 h3 x3 ]
: multiplication
Chap. 8
: addition
34
Example-9 (Contd)
For [(h'0 + h'1 ) ( x'0 + x'1 )], we have the following expression:
If we rewrite the three convolutions as the following expressions, then we can get
the following table (see the next page):
35
Example-9 (contd)
p0
a1
p1
a2
p2
a3
c1
b1
a1
p3
p4
b1
c2
c3
b2
b3
a2
a3
p5
b2
p6
b3
Chap. 8
36
Cyclic Convolution
Cyclic convolution: also known as circular convolution
Let the filter coefficients be h = {h0 , h1 , , hn1} , and the data
sequence be x = {x0 , x1, , xn1} .
The cyclic convolution can be expressed as
s( p) = h n x = [h( p) x( p)]mod( p n 1)
si = h((i k ) ) xk ,
i = 0,1, , n 1
k =0
where
37
M (i ) ( p)
n(i) ( p)
m(i) ( p)
p 1
p3 + p2 + p 1
14 ( p2 + 2p + 3)
p +1
p3 p2 + p 1
p2 +1
p2 1
1
4
N (i ) ( p)
( p 2p + 3)
2
1
2
1
4
14
12
38
Example-10 (contd)
x ( p ) = x 0 + x1 + x 2 + x3 = x 0 ,
x (1) ( p ) = x0 x1 + x 2 x3 = x0 (1) ,
x ( 2 ) ( p ) = (x 0 x 2 ) + (x1 x3 ) p = x0 ( 2 ) + x1 ( 2 ) p
: multiplication
(0 )
( 0)
s (0 ) ( p ) = h ( 0) ( p) x( 0 ) ( p ) = h0 x0 = s0 ,
(1)
(1)
(1 )
s (1) ( p ) = h (1) ( p ) x (1) ( p ) = h0 x0 = s0 ,
( 2)
( 2)
s ( 2) ( p) = s0 + s1 p = h ( 2 ) ( p ) x ( 2) ( p) mod( p 2 + 1)
( 0)
= h0
( 2)
x0
( 2)
h1 x1
( 2)
( 2)
)+ p(h
( 0)
( 0)
( 2)
(2 )
x1
+ h1 x0
( 2)
( 2)
0
Since
s 0 ( 2 ) = h0 ( 2 ) x0 ( 2 ) h1( 2 ) x1 ( 2 ) = h0 ( 2 ) x 0 ( 2 ) + x1( 2 ) h0 ( 2 ) + h1 ( 2 ) x1( 2 ) ,
s1 ( 2 ) = h0 ( 2 ) x1( 2 ) + h1( 2 ) x0 ( 2 ) = h0 ( 2 ) x0 ( 2 ) + x1 ( 2 ) + h1( 2 ) h0 ( 2 ) x0 ( 2 ) ,
or in matrix-form
s 0 ( 2 ) 1
(2) =
s1 1
0
1
h0( 2 )
1
0
0
0
(
(
) (
) (
x 0( 2 ) + x1( 2 )
(2)
0
x
0
(2)
(2)
+ h1 x1
h1( 2 ) h0( 2 )
0
)
)
h0( 2 )
39
Example-10 (contd)
Then
s( p) = s( i ) ( p) N ( i ) ( p)M ( i ) ( p) mod m( i ) ( p)
i=0
[
=(
=s (
( 0)
0
s0( 0 )
4
p 3 + p 2 + p +1
4
s0(1)
4
s0( 2 )
2
)+s (
(1)
0
)+ p (
p 3 p 2 + p 1
4
s0( 0 )
4
s0( 1)
4
] (p )
)+ p ( + )
)+s (
(2)
0
s1( 2 )
2
p 2 1
2
p 2 1
2
) +s
( 0)
2 s0
4
( 2)
1
s0(1)
4
s0( 2 )
2
So, we have
s 0 1
s 1
1 =
s 2 1
s 3 1
Chap. 8
1
1
1
0
0 14 s 0( 0 )
1 14 s 0(1 )
1 (2)
0 2 s0
1 (2)
1 2 s1
40
Example-10 (contd)
Notice that:
14 s0(0) 1
1 (1)
4 s0 = 0
12 s0(2) 0
1 (2 )
2 s1 0
14 h0(0)
0
0
0
0
Chap. 8
0
1 0 0 0
0 1 0 1
0 1 1 0
0 0 0
1
4
h0(1)
0
0
1 (2 )
h
2 0
0
0
0
0
0
0
0
0
1
2
(h
h0(2)
0
( 2)
1
)
1
2
0
h0(2) h1( 2)
x0(0)
(1)
x
0
x0( 2) + x1(2)
( 2)
x0
x( 2 )
41
Example-10 (contd)
Therefore, we have
s0
1
s
1
1
=
s2
1
s
3
1
1
1
1
1
1
0
h 0 + h1 +4 h 2 + h 3
1
1
1
0
Chap. 8
1
1
1
1
1
0
h 0 h1 + h 2 h 3
4
0
0
0
0
0
h 0 h2
2
h0 + h1 + h 2 h3
2
1
1
1
0
1
h 0 + h1 h 2 h 3
2
x0
x
1
x2
x3
42
Example-10 (contd)
This algorithm requires 5 multiplications and 15 additions
The direct implementation requires 16 multiplications and 12 additions
(see the following matrix-form. Notice that the cyclic convolution matrix
is a circulant matrix)
s 0 h0
s h
1 = 1
s2 h2
s 3 h3
h3
h0
h2
h3
h1
h2
h0
h1
h1 x 0
h 2 x 1
h3 x 2
h0 x3
43
Example-11 (contd)
h0 x0
h x +h x
1 0
0 1
h x = h2 x0 + h1x1 + h0 x2
h
x
+
h
x
2 1
1 2
h2 x2
h0 x0 + h2 x2
h x +h x
1 0
0 1
h4 x =
h2 x0 + h1x1 + h0 x2
h2 x1 + h1x2
Chap. 8
h 4 x , is:
44
Example-11 (contd)
Therefore, we have s( p ) = h( p) x( p) = h n x + h2 x2 ( p 1)
Using the result of Example-10 for h 4 x , the following convolution
algorithm for 3X3 linear convolution is obtained:
4
s 0 1
s 1
1
s 2 = 1
s 1
3
s 4 0
1
1
1
1
0
1
1
0
1
1
1
1
0
1
1
0
1
0
0
1
Chap. 8
45
Example-11 (contd)
h 0 + h1 + h 2
4
h 0 h1 + h 2
4
h0 h2
2
h 0 + h1 + h 2
2
h 0 + h1 h 2
2
1
1
1
0
0
Chap. 8
1
1
1
0
1
0
1
1
1
0
0
0
0
0
h 2
x0
x1
x 2
46
Example-11 (contd)
So, this algorithm requires 6 multiplications and 16 additions
Comments:
In general, an efficient linear convolution can be used to obtain an
efficient cyclic convolution algorithm. Conversely, an efficient
cyclic convolution algorithm can be used to derive an efficient
linear convolution algorithm
Chap. 8
47
h0 x0
s0
s h x + h x
1 0
0 1
1
s2 = h2 x0 + h1 x1 + h0 x2
s3 h2 x1 + h1 x2
s4
h2 x2
Chap. 8
48
Example-12 (contd)
Using the following identities:
s2 = h2 x0 + h1 x1 + h0 x2 = (h0 + h2 )( x0 + x2 ) h0 x0 + h1x1 h2 x2
s3 = h2 x1 + h1x2 = (h1 + h2 )( x1 + x2 ) h1x1 h2 x2
The 3X3 linear convolution can be written as:
s0 1
s 1
1
s2 = 1
s 0
3
s 4 0
Chap. 8
0
1
0
0
0
1
0
0
1
1
1
1
0
0
1
0
0
0
1
0
(continued on
the next page)
49
Example-12 (contd)
h0
0
0
0
h1
h2
0
0
0
0
h0 + h1
0
0
h0 + h 2
1
0 0
0 0
0 1
0 1
h1 + h2 0
0
0
1
0
1
0
1
0
0
x0
1
x1
0
x 2
Chap. 8
50
Outline
Introduction
Parallel FIR Filters
Formulation of Parallel FIR Filter Using Polyphase
Decomposition
Fast FIR Filter Algorithms
Introduction
Strength reduction leads to a reduction in hardware complexity by
exploiting substructure sharing and leads to less silicon area or power
consumption in a VLSI ASIC implementation or less iteration period
in a programmable DSP implementation
Strength reduction enables design of parallel FIR filters with a lessthan-linear increase in hardware
DCT is widely used in video compression. Algorithm-architecture
transformations and the decimation-in-frequency approach are used to
design fast DCT architectures with significantly less number of
multiplication operations
Chapter 9
y ( n ) = h ( n ) x ( n ) = h (i ) x ( n i ) ,
n = 0,1, 2 , ,
i =0
N 1
n
Y ( z) = H ( z) X ( z) = h(n) z x( n) z n
n =0
n= 0
Chapter 9
(contd)
Y ( z ) = Y0 ( z 2 ) + z 1Y1 ( z 2 )
)(
) H ( z ) + z [X ( z
[X ( z ) H ( z )]
= X 0 ( z 2 ) + z 1 X 1 ( z 2 ) H 0 ( z 2 ) + z 1 H 1 ( z 2 )
= X 0 (z 2
+ z 2
i.e.,
)H1(z 2 ) + X 1(z 2 )H 0 ( z 2 )
Y0 ( z 2 ) = X 0 ( z 2 ) H 0 ( z 2 ) + z 2 X 1 ( z 2 ) H 1 ( z 2 )
Y1 ( z 2 ) = X 0 ( z 2 ) H 1 ( z 2 ) + X 1 ( z 2 ) H 0 ( z 2 )
where Y0(z2) and Y 1(z2) correspond to y(2k) and y(2k+1) in time domain,
respectively. This 2-parallel filter processes 2 inputs x(2k) and x(2k+1)
and generates 2 outputs y(2k) and y(2k+1) every iteration. It can be
written in matrix-form as:
Y = H X
Chapter 9
Y0 H 0
or =
Y1 H 1
z 2 H 1 X 0
H 0 X1
(9.1)
The following figure shows the traditional 2-parallel FIR filter structure,
which requires 2N multiplications and 2(N-1) additions
x(2k)
y(2k)
H0
H1
x(2k+1)
y(2k+1)
H0
H1
X (z) = X 0 ( z3 ) + z 1 X1 (z 3 ) + z 2 X 2 ( z3 ),
H (z) = H 0 (z 3 ) + z 1H1 (z 3 ) + z 2 H 2 (z 3 )
where {X 0(z3), X 1(z3), X 2(z3)} correspond to x(3k),x(3k+1) and x(3k+2)
in time domain, respectively; and {H 0(z3), H 1(z3), H 2(z3)} are the three
sub-filters of H(z) with length N/3.
Chapter 9
(
= [X
)(
= X 0 + z 1 X 1 + z 2 X 2 H 0 + z 1 H 1 + z 2 H 2
0
H 0 + z 3 ( X 1 H 2 + X 2 H 1 ) + z 1 X 0 H 1 + X 1 H 0 + z 3 X 2 H 2
+ z 2 [X 0 H 2 + X 1 H 1 + X 2 H 0 ]
In every iteration, this 3-parallel FIR filter processes 3 input samples x(3k),
x(3k+1) and x(3k+2), and generates 3 outputs y(3k), y(3k+1) and y(3k+2),
and can be expressed in matrix form as:
3
3
Y0 H0 z H2 z H1 X0
X
Y = H
3
1 1 H0 z H2 1
Y2 H2 H1
H0 X 2
Chapter 9
(9.2)
The following figure shows the traditional 3-parallel FIR filter structure,
which requires 3N multiplications and 3(N-1) additions
x(3k)
H0
y(3k)
H1
y(3k+1)
H2
y(3k+2)
H0
x(3k+1)
H1
H2
:z
H0
x(3k+2)
H1
H2
Chapter 9
D
D
Generalization:
The outputs of an L-Parallel FIR filter can be computed as:
L1
k
L
Yk = z Hi X L+k i + H i xk i , 0 k L 2
i =k +1
i =0
(9. 3)
L1
YL1 = Hi X L 1i
i =0
Y =HX
H
z
HL1
Y0
0
Y1 H
H0
1
=
Y H
L1 L1 HL2
zL H1 X0
z L H2 X1
X
H0 L1
(9. 4)
10
Y0 = H 0 X 0 + z 2 H1 X 1
(9. 5)
Y1 = (H 0 + H1 ) ( X 0 + X 1 ) H 0 X 0 H1 X1
This 2-parallel fast FIR filter contains 3 sub-filters. The 2 subfilters H0X0 and H 1X1 are shared for the computation of Y0 and Y1
y(2k)
H0
-
x(2k)
H0+H1
y(2k+1)
x(2k+1)
H1
Chapter 9
D
11
H1 = {h1 , h3 , h5 , h7 }
H 0 + H1 = {h0 + h1 , h2 + h3 , h4 + h5 , h6 + h7 }
The subfilter H0 + H1 can be precomputed
The 2-parallel filter can also be written in matrix form as
Y2 = Q2 H2 P2 X2
(9.6)
Q2 is a post-processing matrix which determines the manner in which the filter outputs
are combined to correctly produce the parallel outputs and P 2 is a pre-processing
matrix which determines the manner in which the inputs should be combined
Chapter 9
12
(matrix form)
H0 1 0
Y0 1 0 z
X0
diag H0 + H1 1 1
Y =
1 1 1 1
H 0 1 X1
1
(9.7)
Note: the application of FFA diagonalizes the original pseudocirculant matrix H. The entries on the diagonal of H2 are the subfilters required in this parallel FIR filter
Many different equivalent parallel FIR filter structures can be
obtained. For example, this 2-parallel filter can be implemented
using sub-filters {H0, H0 -H1, H 1} which may be more attractive in
narrow-band low-pass filters since the sub-filter H 0 -H1 requires
fewer non-zero bits than H 0 +H1. The parallel structure containing
H0 +H1 is more attractive for narrow-band high-pass filters.
Chapter 9
13
(9.8)
[(H0 + H1 )( X0 + X1 ) H1 X1]
[(H1 + H2 )( X1 + X2 ) H1X1 ]
The 3-parallel FIR filter is constructed using 6 sub-filters of length
N/3, including H0X0, H1X1, H 2X2, (H0 + H1 )( X 0 + X1 ),
(H1 + H2 )( X1 + X2 ) and (H0 + H1 + H2 )( X0 + X1 + X2 )
With 3 pre-processing and 7 post-processing additions, this filter
requires 2N multiplications and 2N+4 additions which is 33% less
than a traditional 3-parallel filter
Chapter 9
14
Y3 = Q3 H 3 P3 X 3
1 0 z 3
1 0 z
0
Y0
0 1 0
Y3 = Y1 , Q3 = 1 1 0 0
0 1 0
0 1 1 1
Y2
0 0
0
H0
1 0
0 1
H1
0 0
H2
H 3 = diag
, P3 =
H 0 + H1
1 1
H1 + H 2
0 1
H 0 + H1 + H 2
1 1
Chapter 9
0
0
1
,
0
1
0 0 0
1 0 0
0 1 0
0 0 1
(9.9)
X0
X3 = X1
X 2
15
H0
H1
x(3k+2)
H2
y(3k)
y(3k+1)
H0+H1
H1+H2
y(3k+2)
H0+H1+H2
Chapter 9
16
Chapter 9
Y =H X
T
X L2 X 0 ]
where X F = [ X L 1
T
Y F = [Y L 1 Y L 2 Y 0 ]
(9.10)
17
Examples:
the 2-parallel FIR filter in (9.1) can be reformulated by using transposition
as follows:
H X
Y H
Y = z 2 H
0
1
1
H0 X 0
1
Y2 =Q2 H2 P2 X2
Y2F = (Q2 H2 P2 ) X 2F
H0 1 1
Y1 1 1 0
X1
Y = 0 1 1 diag H0 + H1 0 1 X (9.11)
0
H z 2 1 0
1
18
x0
H0
y0
-
H0+H1
x1
y0
H1
y1
-
z-2
H0
x0
-
H0+H1
y1
Chapter 9
H1
Fig. (a)
x1
Fig. (b)
z-2
19
x0
H0
y0
x1
H0+H1
H1
y1
Fig. (c)
Chapter 9
20
s2 h1
s = h
1 0
s0 0
(9. 12)
Chapter 9
0
x1
h1
x
0
h0
Y0 H 1
Y = 0
1
H0
H1
z 2 X 1
0
X
0
H 0
X1
(9.13)
21
s = C H A X
s2 1 0 0
h1 1 0
s = 1 1 1 diag h + h 1 1 x1
0 1
1
x
h 0 1 0
s0 0 0 1
0
(9.14)
Note: Flipping the samples in the sequences {s}, {h}, and {x} preserves
the convolution formulation (i.e., the same C and A matrices can be used
with the flipped sequences)
Taking the transpose of this algorithm, we can get the matrix form of the
reduced-complexity 2-parallel filtering structure:
T
(
)
Y = C H A X = Q H P X
Chapter 9
22
Y0 1 1 0
(9.15)
Y = 0 1 1 diag H0 + H1 0 1 0 X 0
1
H 0 1 1 X
0
H0
x(2k+1)
y(2k+1)
x(2k)
H0+H1
-
D
Chapter 9
H1
y(2k)
23
Chapter 9
24
M=
i =1
i =1 i
(9.16)
r
i 1
A = Ai Li + Ai L j M k
i =2
i =2
j =i +1 k =1
r
N
+ M i
1
r
i
=
1
i =1 Li
r
Chapter 9
(9.17)
25
For example: consider the case of cascading two 2-parallel reducecomplexity FFAs, the resulting 4-parallel filtering structure would
require a total of 9N/4 multiplications and 20+9(N/4-1) additions.
Compared with the traditional 4-parallel filter which requires 4N
multiplications. This results in a 44% hardware (area) savings
Example: (Example 9.2.1, p.268) Calculating the hardware complexity
Calculate the number of multiplications and additions required to
implement a 24-tap filter with block size of L=6 for both the cases
{L1 = 2, L 2 = 3} and{L1 = 3, L 2 = 2} :
For the case {L1 = 2 , L 2 = 3}:
M1 = 3, A1 = 4, M 2 = 6, A2 = 10,
24
24
M=
(3 6) = 72, A = (4 3) + (10 3) + (3 6)
1 = 96
(2 3)
(2 3)
Chapter 9
26
M1 = 6, A1 = 10, M 2 = 3, A2 = 4 ,
M =
24
(6 3) = 72,
(3 2)
24
A = (10 2 ) + (4 6 ) + (6 3)
1 = 98
(3 2)
= X 0 + z 1 X1 + z 2 X 2 + z 3 X 3
(H
1
2
3
+
z
H
+
z
H
+
z
H3
0
1
2
(9.18)
27
)(
1
1
Y
=
X
'
+
z
X
'
H
'
+
z
H'1
(contd)
0
1
0
where
2
2
H'0 = H0 + z H2, H'1 = H1 + z H3
Application-1
[(
)(
Y = X0' H0' + z1 X0' + X1' H0' + H1' X0' H0' X1' H1' + z 2 X1'H1' (9.19)
)(
X0' H0' = X0 + z2 X2 H0 + z2 H2
Chapter 9
28
Filtering Operation
)(
{X'1 H'1}
X1'H1' = X1 + z2 X3 H1 + z2 H3
Filtering Operation
{(X'0+X'1)(H'0+H'1)}
(
X
+
X
)(
H
+
H
)
(
X
+
X
)(
H
+
H
)
0 1 0 1
2
3
2
3
The second application of the 2-parallel FFA leads to the 4-parallel
filtering structure (shown on the next page), which requires 9
filtering operations with length N/4
Chapter 9
29
Chapter 9
30
n =0
IDCT: N 1
2
(2n +1)k
x(n) = e(k ) X (k ) cos
, n = 0,1, , N 1 (9.21)
N k =0
2
N
Chapter 9
31
where
1 2 , k = 0
e(k ) =
1, otherwise
Chapter 9
32
n =0
1 2 ,
k=0
where e( k ) =
otherwise
1,
It can be written in matrix form as follows: (where ci = cos i 16)
X (0) c4 c4 c4 c4 c4 c4 c4 c4 x(0 )
c c
c5 c7 c9 c11 c13 c15 x (1)
X
(
1
)
1
3
X
(
3
)
c
c
c
c
c
c
c
c
x
(
3
)
9
15
21
27
1
7
13
= 3
X
(
5
)
c
c
c
c
c
c
c
c
x
(
5
)
5 15
25
3
13
23
1
11
X ( 6) c c
c30 c10 c22 c2 c14 c26 x(6 )
6
18
33
X ( 0) c 4
c
X
(
1
)
1
X ( 2) c 2
X
(
3
)
= c 3
X ( 4) c 4
X ( 5) c 5
X ( 6) c
6
X (7) c7
Chapter 9
c4
c4
c4
c4
c4
c4
c3
c5
c7
c7
c5
c3
c6
c6
c2
c2
c6
c6
c7
c1
c5
c5
c1
c7
c4
c4
c4
c4
c4
c4
c1
c7
c3
c3
c7
c1
c2
c2
c6
c6
c2
c2
c5
c3
c1
c1
c3
c5
c 4 x ( 0)
c1 x(1)
c 2 x ( 2)
c3 x (3)
c 4 x ( 4)
c5 x(5)
c 6 x ( 6)
c7 x (7)
(9.22)
34
(continued)
(9.23)
M 0 = x0 x7 , M1 = x3 x4 , M 2 = x1 x6 , M 3 = x2 x5 ,
P0 = x0 + x7 , P1 = x3 + x4 , P2 = x1 + x6 , P3 = x2 + x5 ,
M 10 = P0 P1 , M 11 = P2 P3 , P10 = P0 + P1 , P11 = P2 + P3 ,
M100 = P10 P11, P100 = P10 + P11
(9.24)
The following figure (on the next page) shows the DCT
architecture according to (9.23) and (9.24) with 22 multiplications.
Chapter 9
35
Chapter 9
36
Second step, the DCT structure (see Fig. 9.10, p.279) is grouped into
different functional units represented by blocks and then the whole
DCT structure is transformed into a block diagram
Two major blocks are defined as shown in the following figure
x(0)
x(0)+x(1)
x(1)
x(0)-x(1)
X
a b
x(0)
x(1)
a
b
b
a-
ax(0)+bx(1)
bx(0)-ax(1)
XC
Chapter 9
37
Chapter 9
38
39
x
y
XC
bx-ay
rot
x
y
ax+by
x
y
for
a = sin
b = cos
rot(1 + 2 )
rot1 rot 2
rot( 4)
x
y
c4
c4 = cos( 4)
y X
c4
Chapter 9
40
From the three steps, we obtain the final structure where only 13
multiplications are required (also see Fig. 9.14, p.282)
x(0)
x(7)
c4
3
rot
16
x(3)
x(4)
x(1)
x(6)
x(2)
x(5)
X(5)
rot
16
X(3)
X
c4
-
X(7)
X(6)
3
rot
X(2)
c4
X(0)
X
c4
Chapter 9
X(1)
X(4)
41
{( N 2 ) log 2 N }
We only derive the fast IDCT computation (The fast DCT structure can
be obtained from IDCT by transposition according to their
computation symmetry). For simplicity, the 2/N scaling factor in (9.21)
is ignored in the derivation.
Chapter 9
42
2N
k =0
N 2 1
N 2 1
(
2
n
+
1
)
(
2
k
)
(2 n + 1) (2 k + 1)
(
)
(
)
= X 2 k cos
+ X 2k + 1 cos
2
N
2N
k =0
k =0
N 1
1
(2 n + 1) (2k )
X
(
2
k
)
cos
+
2N
2 cos[(2 n + 1) 2 N ]
k =0
N 2 1
(2 k + 1) cos (2 n + 1) (2 k + 1) cos (2 n + 1)
2
X
2 N
2N
k =0
N 2 1
Notice
(2n + 1) (2k + 1) cos (2n + 1) = cos (2n + 1) (k + 1)
2 cos
2N
2N
N
(2n + 1)k
+ cos
N
Chapter 9
43
2N
2N
k =0
N 2 1
N 2 1
(
2
n
+
1
)
(
k
+
1
)
N
N
k =0
k =0
N 2 2
N 2 1
(
2
n
+
1
)
(
k
+
1
)
N
N
k =0
k =0
Substitute k=k+1 into the first term, we obtain
N 2 2
(2n + 1) (k + 1)
X
(
2
k
+
1
)
cos
k =0
N 2 1
N 2 1
(
2
n
+
1
)
k
'
= X (2 k '1) cos
=
X
(
2
k
'
1
)
cos
N
N
k =0
k '=1
where X ( 1) = 0
Chapter 9
44
x ( n) = X (2 k ) cos
+
k =0
2( N 2 ) 2 cos [(2 n + 1) 2 N ]
(2 k + 1) + X (2 k 1)]cos (2 n + 1) k
[
X
2 (N 2 )
N 2 1
k =0
k = 0,1, , N 2 1 (9.25)
(2 n + 1)k
g
(
n
)
X
(
2
k
)
cos
, n = 0,1, , N 2 1
k =0
2( N 2)
(9.26)
N 2 1
h( n)
(2k + 1) + X (2k 1) cos (2n + 1)k
X
2
(
N
2
)
k
=
0
Clearly, G(k) & H(k) are the DCTs of g(n) & h(n), respectively.
Chapter 9
45
(2( N 1 n ) + 1)k
(2n + 1)k
cos
=
cos
N
N
N
N
Since
x
(
n
)
=
g
(
n
)
+
h ( n ),
2 cos [(2 n + 1) (2 N )]
n = 0,1, , N 2 1
1
x( N 1 n ) = g ( n )
h( n ),
2 cos [(2 n + 1) (2 N )]
(9.27)
Chapter 9
46
Example (see Example 9.3.2, p.284) Construct the 2-point IDCT butterfly
architecture.
The 2-point IDCT can be computed as
x (0) =
x (1) =
X ( 0 ) +
X ( 0 )
X (1) cos ( 4 ),
X (1) cos ( 4 ),
X ( 0 )
X (1)
Chapter 9
x(0)
C4
x(1)
-1
47
G(k ) X (2k ),
k = 0,1,2,3
Chapter 9
(2 n + 1)k
g
(
n
)
=
G
(
k
)
cos
, n = 0,1, , N 2 1
k =0
3 1
h(n) = H (k ) cos (2n + 1)k
k =0
1
x
(
n
)
=
g
(
n
)
+
h( n ),
2 cos [(2 n + 1) 16 ]
1
x ( N 1 n) = g ( n )
h ( n),
2 cos[(2 n + 1) 16 ]
48
The 8-point fast IDCT is shown below (also see Fig.9.16, p.285), where
only 13 multiplications are needed. This structure can be transposed to get
the fast 8-point DCT architecture as shown on the next page (also see Fig.
9.17, p.286) (Note: for N=8, C 4 = 1 [2 cos(4 16 )] = cos( 4 ) in both
figures)
X ( 0 )
G ( 0)
X (0)
X ( 4 )
X (4)
X ( 2 )
X (2)
X ( 6 )
X (6)
X (1)
X (1)
X ( 5)
G ( 2)
C4
Chapter 9
C2
X (3 )
C6
X ( 2)
G (3 )
C4
H (0 )
C1
X ( 7)
C3
X ( 6)
H ( 2)
C4
X (3)
H (1)
X (7)
H (3)
X ( 7 )
X (1)
G (1)
X (5)
X ( 3)
X ( 0)
C4
C2
C7
X ( 4)
C6
C5
X (5 )
49
C4
C1
C2
C4
X ( 2 )
X (6)
C3
C6
C4
Chapter 9
X (6)
X (1)
C7
C2
C4
X ( 3)
X ( 7 )
X (2 )
X (4 )
X (1)
X ( 5)
X (0)
X ( 3)
X ( 5)
C5
C6
C4
X (7 )
50
Keshab K. Parhi
Outline
Introduction
Pipelining in 1st-Order IIR Digital Filters
Pipelining in Higher-Order IIR Digital Filters
Parallel Processing for IIR Filters
Combined Pipelining and Parallel Processing for IIR
Filters
Chapter 10
Look-Ahead Computation
First-Order IIR Filter
Consider a 1st-order linear time-invariant recursion (see Fig. 1)
(10.1)
y(n +1) = a y(n) + b u(n)
The iteration period of this filter is {Tm +Ta}, where {Tm,Ta} represent
word-level multiplication time and addition time
In look-ahead transformation, the linear recursion is first iterated a few
times to create additional concurrency.
By recasting this recursion, we can express y(n+2) as a function of y(n)
to obtain the following expression (see Fig. 2(a))
(10.2)
y(n + 2) = a[ay(n) + bu(n)] + bu(n + 1)
The iteration bound of this recursion is 2 (Tm + Ta ) 2 , the same as the
original version, because the amount of computation and the number of
logical delays inside the recursive loop have both doubled
Chapter 10
y ( n + M ) = a M y ( n) + a i b u ( n + M 1 i) (10.4)
M i=0
Chapter 10
(
}
u(n)
b
Fig. 1
a
y(n)
D
y(n+1)
u(n+1)
u(n)
a
b
Fig.2.(a)
y(n+1)
a
2D
y(n)
y(n+2)
u(n+1)
u(n)
D
Fig.2.(b)
ab
a2
2D
y(n)
y(n+2)
Chapter 10
u(n+M-1)
b
u(n+M-2)
u(n)
ab
aM-1b
aM
y(n)
MD
y(n+M)
H (z) =
1
1 a z 1
(10.5)
The output sample y(n) can be computed using the input sample
u(n) and the past output sample as follows:
y ( n ) = a y ( n 1) + u ( n )
(10.6)
1 + a z 1 + a 2 z 2
H ( z) =
1 a 3 z 3
Chapter 10
H ( z) =
b z
log2 M 1
i =0
(1 + a
1 a z
M
z 2
(10.7)
The pipelined transfer function has poles at the following locations (see
Fig.4(b) for M=8):
{a, ae
j 2 M
, , ae j (M 1)(2 ) M
( ))
(10.8)
(log 2 M
+ 2)
10
Chapter 10
11
Fig. 4
1
H(z) =
1 a z 1
A 12-stage pipelined decomposed implementation is given by
i
i
a
z
i = 0
11
(
1 + az )(1 + a z
=
2
)(
+ a 4 z 4 1 + a 6 z 6
H ( z) =
1 a12 z 12
1 a12 z 12
- This implementation is based on a 232
decomposition (see Fig. 5)
Chapter 10
12
{ ja, ae
j 6
, ae j 5
(
1 + az )(1 + a
H ( z) =
)(
z 1+ a z + a z
1 a12 z 12
)(
)(
j 3
j 6
Chapter 10
j 5 6
13
Fig.5(a)
Pole-zero location of a
12-stage pipelined
1ST-order IIR
Fig.5(b)
Decomposition of the zeros
of the pipelined filter for
a 2X3X2 decomposition
Chapter 10
14
Chapter 10
15
i
i=0
H (z) =
N
(10.9)
1
a z i
i =1
y(n) =
i =1
i=1
a i y (n i) +
i=0
biu ( n i )
(10.10)
ai y (n i) + z (n )
The sample rate of this IIR filter realization is limited by the throughput of
1 multiplication and 1 addition, since the critical path contains a single
delay element
Chapter 10
16
Chapter 10
17
Example: (Example 10.4.1, p.327) Consider the all-pole 2nd-order IIR filter
with poles at {1 2 , 3 4} . The transfer function of this filter is
1
H (z) =
1 54 z 1 + 38 z 2
(10.11)
A 2-stage pipelined equivalent IIR filter can be obtained by eliminating
the z 1 term in the denominator (i.e., multiplying both the numerator and
denominator by 1 + 5 4 z 1 ). The transformed transfer function is
given by:
1
5
H (z) =
5
4
z 1 +
3
8
z 2
1 + 54 z 1
=
2
3
15
1 19
z
+
z
16
32
1+
1+
4
5
4
z
z 1
(10.12)
From, the transfer function, we can see that the coefficient of z 1 in the
denominator is zero. Hence, the critical path of this filter contains 2 delay
elements and can be pipelined by 2 stages
Chapter 10
18
1+
H (z) =
1
z
z
+
+
z
z
Stability: The canceling poles and zeros are utilized for pipelining IIR filters.
However, when the additional poles lie outside the unit circle, the filter
becomes unstable. Note that the filters in (10.12) and (10.13) are unstable.
Chapter 10
19
H ( z) =
1
1 1.5336 z 1 + 0.6889 z 2
(10.14)
By the stability analysis, it is shown that (M=5) does not meet the stability
condition. Thus M is increased to M=6 to obtain the following stable
pipelined filter as
20
i =1
Chapter 10
21
(continued)
N ( z)
H ( z) =
D( z )
=
N ( z ) i =1 k =1 1 pi e j 2 k
M 1
(1 p e
N
M 1
i =1
k =0
j 2 k M
z 1
z 1
N ' ( z)
=
D ' ( z M ) (10.17)
Example (Example 10.4.5, p.332) Consider the 2nd-order filter with complex
conjugate poles at z = re j. The filter transfer function is given by
H (z) =
1
1 2 r cos z 1 + r 2 z 2
22
1 r z
1 r z
H ( z) =
=
1
2
2
1
1+ r z + r z 1 r z
1 r 3 z 3
)(
Example (Example 10.4.6, p.332) Consider the 2nd-order filter with real
poles at {z = r1 , z = r2 }. The transfer function is given by
1
H ( z) =
1 (r1 + r2 ) z 1 + r1r2 z 2
Chapter 10
23
H ( z) =
1 + (r1 + r2 )z 1
{
+ (r
2
3
4
+
r
r
+
r
z
+
r
r
(
r
+
r
)
z
+
r
r
z
1
1 2
2
1 2 1
2
1 2
3
3
3 3
1 r1 + r2 z 3 + r1 r2 z 6
2
24
Im[z]
1
2
3
4
Re[z]
25
H ( z) =
1
i
b
z
i
i =0
N
i
a
z
i =1 i
N
N ( z)
D( z )
(10.18)
1
)
a
z
i
i =1
denominator. The equivalent 2-stage pipelined implementation is
described by
H ( z) =
)(1 ( 1) a z ) = N ' ( z)
a z )(1 (1) a z ) D' ( z )
bz
i=0 i
(1
i =1 i
Chapter 10
i =1
i =1
(10.19)
26
27
Y ( z)
b0 + b1 z 1 + b2 z 2
H ( z) =
=
U ( z ) 1 2 r cos z 1 + r 2 z 2
H ( z) =
1 2r cos M z
M
log2 M 1
i =0
(1 + 2r
2i
+r z
2M
cos 2 z
i
2 M
2i
+r
2i+1 2i+1
z = re
Chapter 10
, i = 0,1,2, (M 1)
28
Fig. 7: (also see Fig.10.11, p.336) Pole-zero representation for Example 10.4.7,
the decompositions of the zeros is shown in (c)
Chapter 10
29
Chapter 10
30
(10.20)
Substituting n = 4k
y (4k + 4) = a 4 y ( 4k ) + a 3u( 4k ) + a 2 u( 4k + 1)
+ au( 4k + 2) + u ( 4k + 3)
The corresponding architecture is shown in Fig.8.
a2
u(4k)
a4
a3
y(4k +4)
D
y(4k)
31
The pole of the original filter is at z = a , whereas the pole for the
parallel system is at z = a4, which is much closer to the origin since
{a
a , since
a 1
Chapter 10
32
u ( 4 k + 3)
u (4 k + 4 )
u ( 4k + 5)
u (4 k + 6 )
a
D
y (4 k + 7 ) a
a
y (4 k + 6 )
a
a
y ( 4 k + 3)
y (4 k + 2 )
D
a
y ( 4 k + 5) a
y (4 k + 6 )
y ( 4 k + 1)
y ( 4k )
Chapter 10
33
Trick:
Instead of computing y(4k+1), y(4k+2) & y(4k+3) independently,
we can use y(4k) to compute y(4k+1), use y(4k+1) to compute
y(4k+2), and use y(4k+2) to compute y(4k+3), at the expense of an
increase in the system latency, which leads to a significant
reduction in hardware complexity.
This method is referred as incremental block processing, and
y(4k+1), y(4k+2) and y(4k+3) are computed incrementally.
Example (Example 10.5.2, p.341) Consider the same 1st-order filter in last
example. To derive its 4-parallel filter structure with the minimum hardware
complexity instead of simply repeating the hardware 4 times as in Fig.15, the
incremental computation technique can be used to reduce hardware complexity
First, design the circuit for computing y(4k) (same as Fig.14)
Then, derive y(4k+1) from y(4k), y(4k+2) from y(4k+1), y(4k+3) from
y(4k+2) by using
y ( 4 k + 1 ) = ay ( 4 k ) + u ( 4 k )
y ( 4 k + 2 ) = ay ( 4 k + 1 ) + u ( 4 k + 1 )
y ( 4 k + 3 ) = ay ( 4 k + 2 ) + u ( 4 k + 2 )
Chapter 10
34
a2
u (4 k )
a4
a3
y(4k + 4)
y (4 k )
a
y ( 4 k + 1)
a
y ( 4k + 2)
a
y (4 k + 3)
Fig.10: (also see Fig.10.16, p.342) Incremental block filter structure with L=4
Chapter 10
35
(1 + z 1 ) 2
H ( z) =
1 54 z 1 + 38 z 2
(10.21)
y ( n ) = 54 y ( n 1) 83 y (n 2) + f ( n );
f ( n ) = u ( n) + 2u ( n 1) + u ( n 2)
(10.22)
The loop update process for the 3-parallel system is shown in Fig.12
where y(3k+3) and y(3k+4) are computed using y(3k) and y(3k+1)
Chapter 10
36
1
2
y(3k+3)
3
4
y(3k)
D
y(3k+4)
Chapter 10
y(3k+1)
D
37
The computation of y(3k+3) using y(3k) & y(3k+1) can be carried out if
y(n+3) can be computed using y(n) & y(n+1). Similarly y(3k+4) can be
computed using y(3k) & y(3k+1) if y(n+4) can be expressed in terms of
y(n) & y(n+1) (see Fig.13). These state update operations correspond to
clustered look-ahead operation for M=2 and 3 cases. The 2-stage and 3stage clustered look-ahead equations are derived as:
y ( n) = 54 y ( n 1) 38 y( n 2 ) + f ( n )
=
5 5
4 4
y( n 2 ) 83 y ( n 3) + f (n 1)] 38 y (n 2)
+ f (n )
5
3
15
= 19
[
y
(
n
3
)
y
(
n
4
)
+
f
(
n
2
)
]
y ( n 3)
16 4
8
32
+ 54 f (n 1) + f ( n)
(10.23)
Chapter 10
38
y
(
3
k
)
+
f (3k + 2)
y(3k + 3) = 19
16
32
4
(10.24)
+ f (3k + 2)
+ 54 f (3k + 3) + f (3k + 4)
The output y(3k+2) can be obtained incrementally as follows:
Chapter 10
39
u(3k + 4) u(3k + 3)
51
128
u(3k + 2)
65
64
2
f1(3k + 4)
f 2 (3k + 4)
y(3k +1)
D
15
32
D
D
5
4
f1(3k +3)
19
16
f 2(3k +3)
19
16
y(3k)
5
4
f1(3k + 2)
3
8
5
4
y(3k + 2)
40
Comments
The original sequential system has 2 poles at {1/2, 3/4}. Now consider the
pole locations of the new parallel system. Rewrite the 2 state update
equations in matrix form: Y(3k+3)=AY(3k)+F, i.e.
128
y(3k) f1
+
y(3k +1) f2
19
16
65
64
(10.25)
( )3 ( )3
41
H ( z) =
1
1 a z 1
(10.26)
Chapter 10
42
Since pipelining level M=4, the loop must contain 4 delay elements
(shown in Fig.15). Since the block size L=3, each delay element
represents a block delay (corresponds to 3 sample delays). Therefore,
y(3k+12) needs to be expressed in terms of y(3k) and inputs (see Fig. 15).
a 12
4D
y(3k+12)
y(3k)
Chapter 10
43
y ( n ) = ay ( n 1) + u ( n )
=
= a 12 y ( n 12 ) + a 11 u ( n 11 ) + + u ( n )
Substituting n=3k+12, we get:
y ( 3 k + 12 ) = a 12 y ( 3 k ) + a 11 u ( 3 k + 1) + + u ( 3 k + 12 )
where
= a 12 y ( 3 k ) + a 6 f 2 ( 3 k + 6 ) + a 3 f 1 ( 3 k + 9 ) + f 1 ( 3 k + 12 )
f 1 ( 3 k + 12 ) = a 2 u ( 3 k + 10 ) + au ( 3 k + 11 ) + u ( 3 k + 12 )
f 2 ( 3 k + 12 ) = a 3 f 1 ( 3 k + 9 ) + f1 ( 3 k + 12 )
Finally, we have:
y ( 3 k + 12 ) = a 12 y ( 3 k ) + a 6 f 2 ( 3 k + 6 ) + a 3 f 1 ( 3 k + 9 ) + f 1 ( 3 k + 12 )
y ( 3 k + 1) = ay ( 3 k ) + u ( 3 k + 1)
y ( 3 k + 2 ) = ay (3 k + 1) + u ( 3 k + 2 )
(10.27)
Chapter 10
44
u(3k +10)
u(3k + 11)
u(3k +12)
D
a
2D
f1 (3k +12)
3D
a3
f2 (3k +12)
a 12
y(3k )
3D
4D
u(3k + 1)
u(3k + 2)
y(3k + 1)
y(3k + 2)
45
Comments
(L 1) + log 2 M
+ 1 + (L 1) = 2 L 1 + log 2 M
(10.28)
y ( n ) = 54 y ( n 1) 83 y (n 2) + f ( n );
f ( n ) = u ( n) + 2u ( n 1) + u ( n 2)
Chapter 10
(10.29)
46
( ) ( )
(2 )
(2 ) (4 )
(4 )
(i )
Chapter 10
1
M
1 i N
47
Outline
Introduction
Scaling and Round-off Noise
State Variable Description of Digital Filters
Scaling and Round-off Noise Computation
Round-off Noise Computation Using State
Variable Description
Slow-Down, Retiming, and Pipelining
Chapter 11
Introduction
Chapter 11
D(z)
OUT
IN
F(z)
x
(a)
G(z)
D(z)
OUT
IN
F(z)/ x
G(z)
(b)
Fig.11.1 (a) A filter with unscaled node x, (b) A filter with scaled node x
Chapter 11
l1 scaling :
(11.2)
= f (i ) ,
l 2 scaling : =
i= 0
i =0
f 2 (i ) ,
(11.3)
x( n ) =
i =0
f (i )u ( n i ) i = 0 f (i )
(11.4)
Chapter 11
E x (n) = i =0 f 2 (i)
2
(11.5)
Chapter 11
Chapter 11
a
15-bits
8-bits
u(n)
x(n)
8-bits
Fig.11.2 A 1 ST-order IIR filter (W=8)
u(n)
x(n)
Let the wordlength of the output be W-bits, then the round-off error
e(n) can be given by
2 (W 1)
2 (W 1)
e ( n)
2
2
(11.6)
10
Pe(x)
1
1 x 2 2
2
=0
E[e(n)] = xPe ( x)dx =
(11.7)
2
2 2
3
2
2W
1
x
2
2
2
2
2
E
e
(
n
)
=
x
2 Pe (x)dx = 3 2 = 12 = 3
(11.8)
2
(11.8) can be rewritten as (11.9), where e is the variance of the round-
e2 = 2 2W 3
Chapter 11
(11.9)
11
2W
Chapter 11
12
(11.10)
(11.11)
where x is the state vector, u is the input, and y is the output of the filter; x,
b and c are N1 column vectors; A is NN matrix; d, u and y are scalars.
Let { fi (n)} be the unit-sample response from the input u(n) to the state
xi (n) and let yi (n) be the unit-sample response from the state xi (n)
to the output {gi (n)} . It is necessary to scale the inputs to multipliers in
order to avoid internal overflow
Chapter 11
13
d
A
u(n)
b
x ( n + 1)
x(n)
y (n )
e(n)
X ( z)
b z 1
=
U ( z) I z 1 A
Chapter 11
(11.12)
14
F ( z ) = X ( z ) U ( z) = ( I + Az 1 + A2 z 2 + )bz 1 ,
n 1
f (n) = A
b, n 1.
(11.13)
(11.14)
We can compute f(n) by substituting u(n) by (n) and using the recursion
(11.15) and initial condition f(0)=0:
f (n + 1) = A f (n) + b (n)
(11.15)
The unit-sample response g(n) from the state x(n) to the output y(n) can be
computed similarly with u(n)=0. The corresponding SFG is shown in
Fig.11.6, which represents the following transfer function G(z),
T
c
G(z) =
,
1
I Az
g (n) = c A ,
T
Chapter 11
(11.16)
n0
(11.17)
15
A z 1
STATE
STATE
OUT
OUT
1
I A z 1
K E x (n ) xT (n )
(11.18)
Chapter 11
16
x (n ) = [ x1 ( n ), x 2 ( n ), , x N ( n )] T
= f (n ) u ( n ) =
(11.19)
A b u ( n l 1)
l
l=0
Therefore
(11.20)
l
m
K = E ( A b ) u ( n l 1 ) u ( n m 1 )( A b ) T
l= 0
m =0
= E
l =0
=
l=0 m =0
m =0
l
m
A b u ( n l 1 ) u ( n m 1 )( A b ) T
A b E [u ( n l 1) u ( n m 1 ) ]( A b ) T
l
(11.21)
E [ u 2 ( n )] = 1
E [ u ( n ) u ( n k )] = 0 ,
Chapter 11
(11.22)
k 0
(11.23)
17
l =0
l =0
l =0 m=0
l =1
K +1
K +1
K=0
K
K
T T
= bb + A A b ( A b) A = bb + A A b ( A b) A (11.24)
K=0
K =0
K = b b + A K A
T
(11.25)
If for some state x i, E[xi2 ] has a higher value than other states, then
x i needs to be assigned more bits, which leads to extra hardware
and irregular design.
By scaling, we can ensure that all nodes have equal power, and the
same word-length can be assigned to all nodes.
Chapter 11
18
W = g (n)g(n) = (c A )T c A
T
n=0
(11.27)
n=0
W = A W A + c c
T
Chapter 11
(11.28)
19
x S , we can write,
x S ( n) = T 1 x(n ) x( n) = T x S ( n)
(11.29)
Substituting for x from (11.29) into (11.10) and solving for x S , we get
T x S ( n + 1) = A T x S ( n) + b u ( n)
1
(11.30)
1
x S (n + 1) = T A T x S (n ) + T b u (n )
(11.31)
x S (n + 1) = AS x S (n) + b S u (n)
(11.32)
Chapter 11
20
where
(A
= T A T , bS = T b
y ( n) = c T x S ( n) + d u ( n)
T
= c S x S ( n ) + d S u ( n)
T
cS = c T , dS = d
T
1 T
(11.33)
1
1 T
K S = E[ x S x ] = E[T x x (T ) ] = T E x x (T )
T
S
K S = T K (T )T
(11.34)
Chapter 11
21
T = diag [t11 , t 22 , t NN ],
(11.35)
1 1
1
, ,
] = (T 1 ) T
(11.36)
t11 t 22
t NN
From (11.34) and (11.35) and let ( K S ) ii = 1, we can obtain:
K
( K S ) ii = 2ii = 1,
(11.37)
t ii
T 1 = diag [
t ii =
K ii
(11.38)
Chapter 11
22
1 2
z 1
u (n )
x 2 (n )
z 1
1 16
x1 ( n )
y (n)
1 16
0
A=1
16
1
,
0
b = ,
1
161
c = 1,
2
d =0
K 11
K
21
Chapter 11
K 12 0 1 K 11 K 12 0
=1
K 22 16 0 K 21 K 22 1
1
K 22
16 K 21
=1
1
K
K
+
1
16 12 256 11
0
+
0 0
1
16
0
1
23
Thus, we get:
For
{K 11 = K 22 = 256
255 ,
K12 = K 21 = 0}
255
0 1
0
1
AS = T A T = 1
, b S = T b = 255 ,
16 0
16
1
255
T
c S = T c = 8 ,
dS = 0
255
1
Chapter 11
0 256
255
255
16
0
255
0 16
256
255
0
0 1
=
255
16
0
0
1
24
-0.501
0.998
u(n)
x2(n)
-0.0626
y(n)
x1(n)
1/16
Chapter 11
25
Consider the mean and the variance of yi (n) . Since ei (n) is white noise with
zero mean, so we have:
E [ yi (n)] = 0,
(11.40)
26
] g (n l )
(contd) E y 2 (n) =
i
2
e
lm g i (n m)
= e2 g i2 (n l ) = e2 g i2 (n )
l
(11.41)
Expand W in its explicit matrix form, we can observe that all its diagonal
2
g
(n) :
entries are of the form
n i
g1 (n),
T
W = g (n) g (n ) = [g1 (n ), , g N (n)]
n
n
g N (n)
n g12 (n )
g (n) g1 (n)
n 2
=
n g N (n) g1 (n)
Chapter 11
g (n) g (n)
g (n)
n
n g N (n) g 2 (n)
n
2
2
g (n) g
g (n ) g
(11.42)
(n)
(
n
)
N
n 2
2
n g N (n)
(11.43)
n
27
Using (11.41), we can write the expression for the total output round-off
noise in terms of trace of W:
2
2
2
2
total _ roundoff _ noise = e i =1 n g i (n) = e i =1Wii = e Trace(W )
N
(11.44)
Note: (11.44) is valid for all cases. But when there is no round-off
operation at any node, then the Wii corresponding to that node should not
be included while computing noise power
(11.44) can be extended to compute the total round-off noise for the scaled
system, which will simply be the trace of the scaled W matrix:
(11.45)
W S = T W T
( )
(11.46)
i =1
i =1
(11.47)
28
( )
K ii
2
e
(K
i=1
ii
Wii )
(11.48)
ii
ii
Example (Example 11.4.2, p.390) To find the output round-off noise for the
l2 filter in Fig.11.8, W can be calculated using (11.28) as
scaled
W11
W
21
Chapter 11
W12 0
=
W 22 1
W11
0 W21
1
1
W22 + 255
256
= 1
8
16 W12 255
1
16
W12 0
1
W 22 16
1
16
W21
W11 +
8
255
64
255
1
1 255
+ 8
0 255
8
255
64
255
29
Thus
W11
W
21
W12
=
W 22
0 .0049
0 .0332
0 .0332
0 .2559
W11
W
21
W12 0
=
W 22 1
W11
W
0 21
1
16
1
1
W 22 + 256
256
= 1
1
W
16 12 32
Thus
Chapter 11
W11
W
21
W12
=
W 22
1
W12 0 1 256
1
+ 1
W22 16 0 32
1
1
W
21
16
32
W11 + 14
0 .0049
0.0333
1
4
1
32
0 .0333
0 .2549
30
Notice: The scaled filter suffers from larger round-off noise, which can
also be observed by comparing the unscaled and scaled filter structure:
roundoff _ noise (unscaled ) 0 .2598
=
<1
roundoff _ noise ( scaled )
0 .2608
In the scaled filter, the input is scaled down by multiplying 0.998 to
the input to avoid overflow (See Fig.11.8). Therefore, to keep the
transfer functions the same in both filters, the output path of the
scaled filter should have a gain which is 1/0.998 times the gain of the
output path of the unscaled filter. Thus the round-off noise of the
scaled filter is1 0. 998 2 times that of the unscaled filter
The above observation represents the tradeoff between overflow and
round-off noise: More stringent scaling reduces the possibility of
overflow but increases the effect of round-off noise
Notice: (11.48) can be confirmed by:
(K11W11 + K 22W22 )unscaled = 256 (0.0049+ 0.2549) = 0.2608 = (W11 + W22 )scaled
255
Chapter 11
31
Chapter 11
32
F A,
2. Loop:
K F AF ,
K b b
T
F F
F =0
Algorithm analysis:
After the 1st-loop iteration:
K = A ( b b T ) A T + b b T
2
F
=
A
(11.49)
K = A 3 b b T ( A 3 ) T + A 2 b b T ( A 2 ) T + A b b T A T + b b T ,
4
F
=
A
(11.50)
Chapter 11
33
Thus, each iteration doubles the number of terms in the sum of (11.24).
The above algorithm converges as long as the filter is stable (because the
eigen-values of the matrix A are the poles of the transfer function)
This algorithm can be used to compute W after some changes
F A , W c c
2. Loop:
W F W F ,
F F
F =0
Chapter 11
34
Chapter 11
35
y (n )
0 .323
0 .9984
0 .9471
0 .0569
0. 0029
0 .8467
0 . 3695
z-1
u (n)
#1
0 . 532
0 .3209
0 . 9293
0 . 9293
0 . 3695
z-1
#2
0 . 2252
z-1
#3
0 . 9743
0 . 9743
0 . 2252
Chapter 11
36
1
K = 0
0
0
1
0
0
0
1
Conclusion:
Using state variable description method, we can compute signal
power or round-off noise of a digital filter easily and directly.
However, it can not be used on the nodes that are not connected to
unit-delay branches because these nodes do not appear in the state
variable description
Chapter 11
37
Chapter 11
38
(11.51)
Thus, if the unit-sample response from the input to the internal node x in
Fig.11.10(a) is defined by:
f ( n ) = { f (0 ), f (1), f ( 2), },
(11.52)
Then, the unit-sample response from the input to the internal node x in
Fig.11.10(b) is:
f ' ( n ) = { f (0 ), 0 ,0 , f (1), 0 ,0 , f ( 2 ), 0 , 0 , },
We can get:
W'xx = Wxx
Chapter 11
(11.53)
(11.54)
(11.55)
39
x
IN
G(Z)
F(Z)
OUT
(a)
x
IN
G(Z)=G(Z3)
F(Z)=F(Z3)
OUT
(b)
Figure 11.10 (a) A filter with transfer function H(z)=F(Z)G(Z).
(b) Transformed filter obtained by 3 slow-down transformation
H(Z)=F(Z3)G(Z3).
Chapter 11
40
Pipelining:
Consider the filter in Fig.11.11(a), which has a non-state variable node x
on the feed-forward path. It is obvious that the non-state variable node
cannot be converted into the state variable node by slow-down
transformation.
However, since x is on the feed-forward path, a delay can be placed on a
proper cut-set location as shown in Fig.11.11(b). This pipelining operation
converts the non-state variable node x into state variable node. The output
sequence of the pipelined filter is equal to that of the original filter except
one clock cycle delay.
So, the pipelined filter undergoes the same possibility of overflow and the
same effect of round-off noise as in the original filter. Thus it is clear that
pipelining does not change the filter finite word-length behavior
Chapter 11
41
c
D
u(n)
y(n)
b
a
(a)
c
D
D
D
u(n)
b
y(n-1)
(b)
Chapter 11
42
Retiming:
In a linear array, if either all the left-directed or all the right-directed edges
between modules carry at least 1 delay on each edge, the cut-set
localization procedure can be applied to transfer some delays or a fraction
of a delay to the opposite directed edges (see Chapter 4) This is called
retiming
Example (Example 11.7.1, p.407) Consider the filter shown in Fig.11.12, same
as the 3rd-order scaled-normalized lattice filter in Fig.11.9 except that it has
five more delays. The SFG in Fig.11.12 is obtained by using a 2-slow
transformation and followed by retiming or cut-set transformation. (contd)
Chapter 11
43
y(n)
0 . 323
z-1
0. 0029
0 . 9984
#5
z-1
0 . 9471
#6
0. 0569
0 .8467
u (n)
0 . 3209
0.3695
z-1
#1
0.2252
z-1
#3
#2
0.532
0.9293
0.9293
z-1
z-1
0 . 9743
z-1
#4
0 .9743
z-1
#7
0.3695
#8
0.2252
44
(contd) Notice that signal power or round-off noise at every internal node
in this filter can be computed using state variable description since each
node is connected to a unit delay branch. Since there are 8 states, the
dimensions of the matrices A, b, c, and d are 88, 81, 81, and 11,
respectively. From Fig.11.12, state equations can be written as follows:
x1 ( n + 1) = 0.532 x7 ( n ) + 0 .8467 u ( n ),
x ( n + 1) = 0.3695 x ( n ) 0 .9293 x (n ),
1
8
2
x3 ( n + 1) = 0.2252 x 2 ( n ) + 0 .9743 x4 ( n ),
x4 ( n + 1) = x3 (n ),
x5 ( n + 1) = 0 .0569 x1 ( n ) + 0 .9984 x6 ( n ),
x ( n + 1) = 0 .3209 x ( n ) + 0.9471 x ( n ),
2
4
6
x7 ( n + 1) = 0.9293 x1 ( n ) + 0 .3695 x8 ( n ),
x8 ( n + 1) = 0 .9743 x2 ( n ) + 0.2252 x 4 ( n ),
y (n ) = 0 .323 x5 ( n ) + 0 .0029 u ( n )
Chapter 11
45
2
2
Thus, the total output round-off noise is: = e K iiW ii = 1. 191 e
i
Note: no round-off operation is associated with node 4 or state x 4 .
Therefore, W 44 is not included in Trace(W) for round-off noise
computation
Chapter 11
46
Chap. 13
Parallel Multipliers :
A = aw-1.aw-2a1.a0 = -aw-1 +
B = bw-1.bw-2b1.b0 = -bw-1 +
W 1
i =1
W 1
i =1
aw-1-i2-i
bw-1-i2-i
Chap. 13
A = aw 1 i=1 aw 1 i 2 i
W 1
W 1
= aw 1 + i=1 (1 aw 1 i ) 2 i=1 2 i
i
W 1
= aw 1 + i=1 (1 aw 1 i ) 2 i 1 + 2 W +1
= (1 aw 1) +
Chap. 13
W 1
(1 aw 1 i )2
i =1
+ 2 W +1
4
Baugh-Wooley Multipliers:
Handles the sign bits of the multiplicand and multiplier
efficiently.
Chap. 13
10
b2i+1
b2i
b2i-1
bi
Operation
Comments
+0
string of 0s
+A
end of 1s
+A
a single 1
+2A
end of 1s
-2
-2A
beginning of 1s
-1
-A
A single 0
-1
-A
beginning of 1s
-0
string of 1s
11
12
Bit-Serial Multipliers
Lyons Bit-Serial Multiplier using Horners Rule :
13
Chap. 13
14
Chap. 13
15
Chap. 13
a(0,1)
b(1,0)
carry(1,0)
x(-1,1)
16
dT = [0 1], sT = [0 1] and pT = [1 0]
Chap. 13
p Te
s Te
a(0,1)
b(1,0)
carry(1,0)
x(-1,1)
-1
17
dT = [1 0], sT = [1 1] and pT = [0 1]
Chap. 13
p Te
s Te
a(0,1)
b(1,0)
carry(1,0)
x(-1,1)
18
Chap. 13
19
dT = [0 1], sT = [0 1] and pT = [1 0]
e
p Te
s Te
a(0,1)
carry(0,1)
b(1,0)
x(1,1)
carry-vm(-1,0)
-1
20
21
Chap. 13
22
23
24
Chap. 13
25
26
Chap. 13
27
Note:
To compute the total number of delays in the bit-level
architecture, the paths with the largest number of delay
elements in the switching elements should be counted.
Input synchronizing delays (also referred as shimming delays
or skewing delays).
It is also possible that the loops in the intermediate bitlevel pipelined architecture may contain more than W ND
number of bit-level delay elements, in which case the wordlength needs to be increased.
The architecture without the two loop synchronizing delays
can function correctly with a signal word-length of 6, which
is the minimum word-length for the bit-level pipelined bitserial architecture.
Chap. 13
28
Associativity transformation :
Chap. 13
29
30
Chap. 13
31
a-1 = 0;
-1 = 0;
aW = aW-1;
for (i = 0 to W-1)
{
Chap. 13
i = ai ai-1;
i = i-1i;
ai = (1 - 2ai+1)i;
32
W-1
ai
1 - 2ai+1
-1
-1
-1
-1
-1
-1
ai
-1
-1
-1
i
i
-1
Chap. 13
33
CSD Multiplication
34
35
Chap. 13
36
37
38
Redundancy Factor
Incomplete
< (r 1)/2
<
= (r 1)/2
Redundant
r/2
>
Minimally redundant
= r/2
Maximally redundant
=r1
=1
Over-redundant
>r-1
>1
Chap. 14
Chap. 14
Digit
Binary Code
xi
{1, 0, 1}
yi
{0, 1}
xi + - xi -
pi = xi + yi
{1, 0, 1, 2}
ui
{1, 0}
ti
{0, 1}
si = ui + ti-1
{1, 0, 1}
yi+
2ti + ui
-uiti+
si+ - si-
LSD-first adder
MSD-first adder
Chap. 14
Chap. 14
Digit
Binary Code
xi
{1, 0, 1}
yi
{0, 1}
xi + - xi -
pi = xi yi
{2, 1, 0, 1}
ui
{0, 1}
ti
{1, 0}
si = ui + ti-1
{1, 0, 1}
yi-
2ti + ui
ui+
-ti-
si+ - si-
Chap. 14
10
Chap. 14
11
12
Chap. 14
13
Digit
Binary Code
xi
{3, 2, 1, 0, 1, 2, 3}
yi
{0, 1, 2, 3}
pi = xi + yi
{3, 2, 1, 0, 1, 2, 3, 4, 5, 6}
ui
{3, 2, 1, 0, 1, 2}
ti
{0, 1}
ti+
si = ui + ti-1
{3, 2, 1, 0, 1, 2, 3}
2yi+2 + yi+
4ti + ui
Chap. 14
14
Four-digit MRHY4A
Chap. 14
15
Chap. 14
16
Digit
Binary Code
xi
{2, 1, 0, 1, 2}
yi
{0, 1, 2, 3}
pi = xi + yi
{2, 1, 0, 1, 2, 3, 4, 5}
ui
{2, 1, 0, 1}
ti
{0, 1}
ti+
si = ui + ti-1
{2, 1, 0, 1, 2}
2yi+2 + yi+
4ti + ui
Chap. 14
17
Four-digit mrHY4A
Chap. 14
18
Chap. 14
19
Radix-4 representation :
radix-4 maximally redundant number: X is a
Chap. 14
20
radix-4 minimally redundant number: X is a radix4 complement number, whose digits xi are encoded using
2 wires as xi = 2xi+2 + xi+ . Its corresponding minimally
redundant number Y is encoded using yi = -2yi-2 + yi+ + yi++.
To convert radix-r number x to redundant number y<r.>,
the digits in the range [, r - 1] are encoded using a
transfer digit 1 and a corresponding digit xi - r where xi
is the ith digit of x. Thus,
2xi+2 + xi+ = 4xi+2 - 2xi+2 + xi+
= yi+1++ - 2yi-2 + yi+
Chap. 14
21
Chap. 15
Constant Unsigned
Rem. of a 10100000
b
10110110
Rem. of c 00010000
Red. of a,c 01001101
Updated set of constants
1st iteration
Chap. 15
Constant
Unsigned
Rem. of a
00000000
Rem. of b
00010110
Rem. of c
00010000
Red. of a,c
01001101
Red. of Rem a,b 10100000
Updated set of constants
2nd iteration
Linear Transformations
A general form of linear transformation is given as:
y =T*x
where, T is an m by n matrix, y is length-m vector and x is a
length-n vector. It can also be written as:
yi =
tijxj , i = 1,..., m
j =1
n
Chap. 15
Example:
7
12
T=
5
7
8 2 13
11 7 13
8 2 15
11 7 11
Column 2
Column 3
Column 4
0101
1000
0010
1001
0010
1011
0111
0100
1100
Chap. 15
0010
Polynomial Evaluation
Chap. 15
10
Chap. 15
11
Chap. 15
12
Example:
y(n) = 1.000100000*x(n) + 0.101010010*x(n-1)
+ 0.000100001*x(n-2)
1
-1
-1
-1
1
1
1
-1
Example:
y(n) = 1.01010000010*x(n) + 0.10001010101*x(n-1)
+ 0.10010000010*x(n-2) + 1.00000101000*x(n-4)
The substructure matching procedure for this design
is as follows:
Start with the table containing the coefficients of
the FIR filter. An entry with absolute value of 1 in
this table denotes add or subtract of x1. Identify
the best sub-expression of size 2.
-1
1
-1
-1
1
Chap. 15
1
-1
-1
-1
-1
1
-1
14
2
-2
-1
-1
-2
-2
1
-1
Chap. 15
15
2
-3
-2
-2
1
-1
Chap. 15
16
Chap. 15
17
18
19
20
21
Speed
Complexity
IC Design Space
Chapter 17
Power
Sp
ee
Area
New
Design
Space
Challenges:
Chapter 17
Non-portable systems:
Workstations, communication systems
DEC alpha: 1 GHz, 120 Watts
Packaging costs, system reliability
Chapter 17
Power Dissipation
Two measures are important
Peak power (Sets dimensions)
VDD
iDD (t) dt
Pav =
T 0
Chapter 17
2
DD
EC = CV2/2 = CLVDD2/2
Energy Ec is also discharged,
i.e.
Etot= CL VDD2
Discharge
Power consumption
Chapter 17
P = CL VDD2 f
Chapter 17
Example
Pa=0.5
Pb=0.5
Pc=0.5
Pd=0.5
Pa=0.5
Px=0.25
Due
to correlation
Chapter 17
7
= 0.4375
16
Pz =
3
= 0.375
8
Py=0.25
Px=0.25
Pb=Pc=0.5
Pd=0.5
Pz =
Py=0.25
a
c
x
c
Delay in gate
Extra transition
due to race
Dissipates energy
z
Chapter 17
10
Module
B
Module
C
Control
circuitry is
needed for
clock gating
and power
down
and
Needs wake-up
11
Carry Ripple
0
Addi
Si
Ci+1
Addi+1
Addi+2
Addi+3
Si+1
Ci+2
Si+2
Ci+3
Ci+4
Si+3
12
Balancing
Operations
B
C
Example:
Addition
D
A
F
G
H
S
Chapter 17
S
13
Chapter 17
14
Chapter 17
15
Dual VT Technology
Reduced VDD Increased delay
Low VT Faster but Increased Leakage
Chapter 17
16
High VT stand-by
VDD
standby
Fast
high leakage
standby
Chapter 17
Low leakage in
stand by when
high VT tansistors
turned off
Systematic capture and elimination of slack using fictitious entities called Unit
Delay Fictitious Buffers.
Replace unnecessary fast gates by slower lower power gates from an
underlying gate library.
Use a simple relation between a gates speed and power and the UDFs in its
fanout nets. Model the problem as an efficiently solvable ILP similar to
retiming.
In Proceedings of ARVLSI99 Georgia Tech.
7
4
33
33
Critical Path = 8, UDFs in Boxes
Chapter 17
03
03
-3
1
3
-3
UDF
Displacement
Variables
18
19
Systematic capture and elimination of slack using fictitious entities called Unit
Delay Fictitious Buffers.
Switch unnecessarily fast gates to to lower supply voltage VDDL thereby
saving power, critical path gates have a high supply voltage of VDDH.
Use a simple relation between a gates speed/power and supply voltage with
the UDFs in its fanout nets. Model the problem as an approximately solvable
ILP.
Critical Path = 8, UDFs in Boxes
7
4
33
33
Chapter 17
1
VDDH
1
VDDH
3
VDDH
03
03
-3
VDDH
LC = Level Converter
1
3
-3
VDDL
VDDH
UDF
Displacement
20
Variables
Systematic capture and elimination of slack using fictitious entities called Unit
Delay Fictitious Buffers.
Gates on the critical path have a low threshold voltage VTL and unnecessarily
fast gates are switched to a high threshold voltage VTH.
Use a simple relation between a gates speed /power and threshold voltage
with the UDFs in its fanout nets. Model the problem as an efficiently
approximable 0-1 ILP.
Critical Path = 8, UDFs in Boxes
7
4
33
33
Chapter 17
1
VTL
1
VTL
3
VTL
03
03
-3
VTL
1
3
VTL
-3
VTH
UDF
Displacement
21
Variables
Experimental Results
Table :ISCAS85 Benchmark Ckts
Resizing (20 Sizes) Dual VDD Dual
Vt
(5v, 2.4v)
Ckt
C1908
c2670
c3540
c5315
c6288
c7552
#Gates Power
Savings
880
1211
1705
2351
2416
3624
15.27%
28.91%
37.11%
41.91%
5.57%
54.05%
CPU(s)
Power
Savings
87.5
164.38
312.51
660.56
69.58
1256.76
49.5%
57.6%
57.7%
62.4%
62.7%
59.6%
CPU(s)
Power
Savings
739.05
1229.37
1743.75
4243.63
7736.05
9475.1
84.92%
90.25%
83.36%
91.56%
61.75%
90.90%
V. Sundararajan and K.K. Parhi, "Low Power Synthesis of Dual Threshold Voltage CMOS
VLSI Circuits Proc. of 1999 IEEE Int. Symp. on Low-Power Electronics and Design,
pp. 139-144, San Diego, Aug. 1999
Chapter 17
22
Chapter 17
23
Theoretical Background
Signal probability:
S=T clk / Tgd ,where
NS
p1xi = lim
j =1
xi ( j )
NS
p x0i = 1 p1xi
Transition probability:
NS
0
p1x
= lim
i
j =1
xi ( j )xi ( j + 1)
NS
0
1
p1x
+ p1x
+ p x0i1 + p x0i0 = 1
i
i
Conditional probability:
Chapter 17
p1xi/ 0 =
p x0i1
p x0i1 + p x0i0
24
25
26
Performance Comparison
Power
Run-time
9000
8000
7000
6000
5000
uW
4000
SPICE
3000
HEAT
2000
1000
0
45000
40000
35000
30000
25000
sec
20000
15000
10000
5000
0
BW4
HY4
BW8
HY8
circuit
circuit
J. Satyanarayana and K.K. Parhi, "Power Estimation of Digital Datapaths using HEAT Tool",
IEEE Design and Test Magazine, 17(2), pp. 101-110, April-June 2000
Chapter 17
27
28
Parallel
Digit-serial
MAC2
+
DEGRED2
Four
Instr.
Chapter 17
MAC2
MAC2
DEGRED2
DEGRED2
29
Polynomial multiplication
Polynomial modulo operation
Array-type multiplication
Fully parallel multiplication
Digit-serial/parallel multiplication
30
[A0
A1
B0
B
... An 1 ] 1 = ( A0 B0 + ... + An 1 Bn 1 ) mod ( p ( x ) )
...
Bn 1
s=
Chapter 17
Total Energy(parallel)=Eng*n
Total Energy(MAC-D7)=0.25Eng*n+0.75Eng
Data-paths
One parallel finite field multiplier
Digit-serial multiplication: MACx and DEGREDy
Chapter 17
32
MAC8 + DEGRED2
MAC8 + DEGRED1
MAC4 + DEGRED2
MAC4 + DEGRED1
Energy-delay
MAC8 + DEGRED4
MAC8 + DEGRED2
L. Song, K.K. Parhi, I. Kuroda, T. Nishitani, "Hardware/Software Codesign of Finite Field Datapath for Low-Energy
Reed-Solomon Codecs", IEEE Trans. on VLSI Systems, 8(2), pp. 160-172, Apr. 2000
Chapter 17
33
DSP Applications
DSP applications are often real time
but with a wide variety of sample rates
High rates
Radar
Video
Medium rates
Audio
Speech
Low rates
Weather
Finance
Chap. 18
DSP features
Fast Multiply/Accumulate (MAC)
x(n)
FIR
D
D
D
FFT
h0
h1
h2
h3
etc.
y(n)
Addressing Modes
Chap. 18
Implied addressing
P=X*Y;
operation sets location
Immediate data
AX0=1234
Memory direct
R1=Mem[101]
Register direct
sub R1, R2
Register indirect
A0=A0+ *R5
Register indirect with increment/decrement
A0=A0+ *R5++
A0=A0+ *R5--
Standard
Processor
Special
Purpose
Processor Cores
Domain Specific
High Calculation Capacity
Processors
Programmable
Low Power
etc.
Low Design cost
User defined Interface
Standard Interface
Variable Wordlength
Good supply of tools
Low Price at Volume
Chap. 18
Architectural Partitioning
Processor
Core
Processor
Core
ASIC
Dist.
MEM
Main
MEM
Main
MEM
ASIC
Conflicting req.
Throughput
Flexibility by
Flexibility
using programmable
Power Consumption
processor core
Time to market
Chap. 18
Local busses
and
Distributed
memory
to decrease
data transfers
MIPS intensive
algorithms in
dedicated HW
to increase
throughput and
save power
8
Motorola DSP56000x
24
X0
X1
Y0
Y1
Operand
Registers
24
24
56
A (56)
B (56)
56
Accumulators
24
24
scaling
56
Shifter/
Limiter
24
Chap. 18
+ guard bits
ALU
56
24
Altenative is mult
with reduced
wordlength output,
e.g. 24
Addresss bus
Data bus
Memory
Chap. 18
10
one data
one program
Addresss bus 2
Data bus 2
Addresss bus 2
Data bus 2
Memory A
Chap. 18
Memory B
11
Chap. 18
12
Chap. 18
13
TI, C64
Chap. 18
14
TI, C55
Chap. 18
15
Processor Architectures
SIMD
Single Instruction Multiple Data
Program
Chap. 18
16
Processor Architectures
MIMD
Multiple Instruction Multiple Data
Program
Program
Program
Program
Chap. 18
17
Processor Architectures
VLIW
Very Long Instruction Words
Control Unit
VLIW Instruction
Functional
Unit
Chap. 18
Functional
Unit
Functional
Unit
Functional
Unit
18
Split Processors
Functional units
can be split into
submodules, e.g.
for images (8bits)
TI320C80,
1 RISC
4 x 32bit DSP which
can be split into
8bit modules
Chap. 18
19
Vector Processors
Chap. 18
20
Chap. 18
21
Chap. 18
22