Sunteți pe pagina 1din 15

VLSI DSP 2008 Y.T.

Hwang 5-1
Chapter 4
Pipelining and Parallel
Processing
VLSI DSP 2008 Y.T. Hwang 5-2
Introduction (1)
Pipelining
Reduction in critical path
Increase the clock speed
Reduce power consumption at same speed
Parallel processing
Parallelism
Increase effective sampling speed
Reduction of power consumption
VLSI DSP 2008 Y.T. Hwang 5-3
Introduction (2)
A 3-tap FIR filter
y(n)=ax(n)+bx(n-1)+cx(n-2)
Critical path: 1 multiply and 2 add
A M
sample
A M sample
T T
f
T T T
2
1
2
+
s
+ >
VLSI DSP 2008 Y.T. Hwang 5-4
Introduction (3)
Pipelining or parallel processing to sampling
frequency
Critical path: 2 add
Pipelining
Parallel processing
VLSI DSP 2008 Y.T. Hwang 5-5
Pipelining of FIR digital filters (1)
Feed forward cut set
Two iterations are
computed
concurrently
Critical path
reduced from
T
M
+2T
A
to T
M
+T
A
Latency increased
from 1 to 2
VLSI DSP 2008 Y.T. Hwang 5-6
Pipelining of FIR digital filters (2)
Drawbacks of pipelining
Increase in the number of latches and in system latency
Observations
The clock period is limited by the longest path between
Two latches
An input and a latch
A latch and an output
An input and an output
Critical path can be reduced by suitably placing the
pipelining latches
Pipelining latches can be placed across any feed-
forward cutset of the graph
VLSI DSP 2008 Y.T. Hwang 5-7
Pipelining of FIR digital filters (3)
Cut set
A set of edges of a graph such that if these edges are
removed from the graph, the graph becomes disjoint
Feed-forward cut set
The data move in the forward direction on all the edges
of the cut set
We can arbitrarily place latches on a feed-forward cut
set w/o affecting the functionality of the algorithm
VLSI DSP 2008 Y.T. Hwang 5-8
Pipelining of FIR digital filters (4)
Example 3.2.1
Incorrect pipelining correct pipelining
Original critical path: A3
A5 A4 A6
After pipelining: A3 A5
or A4 A6
Critical path is reduced by
one half
VLSI DSP 2008 Y.T. Hwang 5-9
Direct v.s. transpose form
Direct form with long critical path
Transpose form with data broadcast structure
Critical path is reduced to T
M
+ T
A
VLSI DSP 2008 Y.T. Hwang 5-10
Fine-Grain pipelining
Pipelining the function unit
Assume T
M
= 10 units, T
A
= 2 units
After pipelining, the critical path is 6 units
VLSI DSP 2008 Y.T. Hwang 5-11
Parallel processing of FIR filter (1)
Block processing of size L
y(n)=ax(n)+bx(n-1)+cx(n-2)
y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)
y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)
Block delay (L-slow): placing a latch at any line of MIMO
structures produces an effective delay of L clocks at the
sample rate
VLSI DSP 2008 Y.T. Hwang 5-12
Parallel processing of FIR filter (2)
Block size 3
3 times hardware
Critical path remains
unchanged T
M
+2T
A
T
clk
T
M
+2T
A
3 samples are
produced in 1 clock
cycle
effective iteration
period is
Note: T
clk
T
sample
) 2 (
3
1 1
A M clk sample iter
T T T
L
T T + > = =
VLSI DSP 2008 Y.T. Hwang 5-13
Parallel processing of FIR filter (3)
MIMO system
Complete parallel processing
System with block size 4
A serial-to-parallel
converter
A parallel-to-serial converter
VLSI DSP 2008 Y.T. Hwang 5-14
Pipelining v.s. parallel processing
Limitation of pipelining processing
Input/output bottleneck, i.e. communication bounded
system
Pipelining period cannot be smaller than the
communication or I/O bound
VLSI DSP 2008 Y.T. Hwang 5-15
pipelining & parallel processing
Combined fine grain
pipelining and
parallel processing
for 3-tap FIR filter
L = 3, M = 2
6
14
) 2 (
6
1
1
= + =
= =
A M
clk sample iter
T T
T
LM
T T
VLSI DSP 2008 Y.T. Hwang 5-16
Pipelining & parallel processing for low power
Advantages of pipelining and parallel processing
High speed
Low power
CMOS circuit model
1st order analysis
Propagation delay
Power consumption
f V C P
V V k
V C
T
total
t
e ch
pd
2
0
2
0
0 arg
) (
=

=
VLSI DSP 2008 Y.T. Hwang 5-17
Pipelining for low power (1)
Sequential version
M-level pipelined version
Working at the same frequency, i.e. f = 1/T
seq
remains
unchanged
Capacitance in each pipeline stage is reduced to
C
charge
/M
Only |V
0
(| < 1) is needed to charge C
charge
/M in T
seq
seq total seq
T f f V C P / 1 ,
2
0
= =
seq total pip
P f V C P
2 2
0
2
| | = =
VLSI DSP 2008 Y.T. Hwang 5-18
Pipelining for low power (2)
Calculation of |
2
0
2
0
2
0
0
arg
2
0
0 arg
) ( ) (
let
) (
) (
t t
pip seq
t
e ch
pip
t
e ch
seq
V V V V M
T T
V V k
V
M
C
T
V V k
V C
T
=
=

=
| |
|
|
VLSI DSP 2008 Y.T. Hwang 5-19
Pipelining for low power (3)
Example
3-tap FIR filter
T
m
=10, T
a
=2, C
m
=5C
a
Pipelined multiplier, T
m1
=6, T
m2
=4, C
m1
=3C
a
, C
m2
=2C
a
V
0
=5V, V
t
=0.6V
Supply voltage calculation
C
charge
=C
m
+C
a
=6C
a
Pipelined: C
charge
=C
m1
=C
m2
+C
a
=3C
a
50|
2
- 31.36| +0.72 =0 | =0.6033
V
pip
=| V
0
=3.0165V
Power consumption ratio = |
2
=36.4%
VLSI DSP 2008 Y.T. Hwang 5-20
Parallel processing for low power (1)
L-parallel version
Working at the one L
th
frequency, i.e. f = 1/(LT
seq
)
Total Capacitance is increased to LC
charge
Since each C
charge
is charged in LT
seq
, Only |V
0
(| < 1) is
needed to charge
VLSI DSP 2008 Y.T. Hwang 5-21
Parallel processing for low power (2)
Calculation of |
seq e ch
e ch par
t t
t
e ch
seq
t
e ch
seq
P f V C
L
f
V LC P
V V V V L
V V k
V C
LT
V V k
V C
T
2 2
0 arg
2
2
0 arg
2
0
2
0
2
0
0 arg
2
0
0 arg

) )( (
) ( ) (
) (
,
) (
| |
|
| |
|
|
= =
=
=

=

=
VLSI DSP 2008 Y.T. Hwang 5-22
Parallel processing for low power (3)
Example of 2-parallel version
4-tap FIR filter
T
m
=8, T
a
=1, C
m
=8C
a
T
seq
=9
V
0
=3.3V, V
t
=0.45V
VLSI DSP 2008 Y.T. Hwang 5-23
Parallel processing for low power (4)
2-parallel FIR filter design
Note each delay is 2-slow
x(2k-1)
x(2k-2)
VLSI DSP 2008 Y.T. Hwang 5-24
Parallel processing for low power (5)
Supply voltage calculation
C
charge
=C
m
+C
a
=9C
a
2-parallel: C
charge
=C
m
+2C
a
=10C
a
V
par
=| V
0
=2.17437V
Power consumption ratio = |
2
=43.41%
) ( 0282 . 0 or 6589 . 0
0 8225 . 1 3425 . 67 01 . 98
) ( 9 ) ( 5
2 2 let
) (
10
) (
9
2
2
0
2
0
2
0
0
2
0
0
= =
= +
=
= =

=

=
| |
| |
| |
|
|
t t
seq sample par
t
a
par
t
a
seq
V V V V
T T T
V V k
V C
T
V V k
V C
T
VLSI DSP 2008 Y.T. Hwang 5-25
Parallel processing for low power (6)
Area efficient 2-parallel version
Multiplier: 8 6, adder: 6 7 Delay: 3 4
VLSI DSP 2008 Y.T. Hwang 5-26
Parallel processing for low power (7)
Architecture verification
) 2 2 ( ) 1 2 ( ) 2 ( ) 1 2 (
) 1 2 (
) 3 2 ( ) 2 2 ( ) 1 2 ( ) 2 (
delay] block 1 after [ ) 2 (
) 1 2 ( ) 1 2 (
)) 1 2 ( ) 2 2 ( )( ( )) 1 2 ( ) 2 ( )( (
) 2 2 ( ) 2 (
) 3 ( ) 2 ( ) 1 ( ) ( ) (
3 2 1 0
3 2 1 0
3 1
3 2 1 0
2 0
3 2 1 0
+ + + + =
= +
+ + + =
+ =
+ + =
+ + + + + + =
+ =
+ + + =
k x h k x h k x h k x h
y y y k y
k x h k x h k x h k x h
y y k y
k x h k x h y
k x k x h h k x k x h h y
k x h k x h y
n x h n x h n x h n x h n y
C A B
C A
C
B
A
VLSI DSP 2008 Y.T. Hwang 5-27
Parallel processing for low power (8)
Supply voltage calculation
C
charge
=C
m
+C
a
=9C
a
2-parallel: C
charge
=C
m
+4C
a
=12C
a
V
par
=| V
0
=2.4585V
) ( 025 . 0 or 745 . 0
0 6075 . 0 155 . 25 67 . 32
) (
12
) (
9 2
2 2 let
) (
12
) (
9
2
2
0
0
2
0
0
2
0
0
2
0
0
= =
= +

=

= =

=

=
| |
| |
|
|
|
|
t
a
t
a
seq sample par
t
a
par
t
a
seq
V V k
V C
V V k
V C
T T T
V V k
V C
T
V V k
V C
T
VLSI DSP 2008 Y.T. Hwang 5-28
Parallel processing for low power (9)
Power consumption ratio
% 6 . 43
35
55 5 . 0
2
1
55 , 35
2
1
2
1
, 55 7 6
, 35 3 4
2
2
0
2 2
0
2 ) ( ) (
2
0
) ( ) (
=

= =
= =
= =
= = + =
= = + =
|
|
seq
par
s a par s a seq
s seq par
par par
par
total par a a m
par
total
seq
seq
total seq a a m
seq
total
P
P
ratio
f V C P f V C P
f f f
f V C P C C C C
f V C P C C C C
VLSI DSP 2008 Y.T. Hwang 5-29
Combining pipelining and parallel processing
Pipelining
Reduces the capacitance to be charged/discharged in 1
clock period
Parallel processing
Increases the clock period for charging/discharging the
original capacitance
3-parallel
2-stage pipelining
VLSI DSP 2008 Y.T. Hwang 5-30
pipelining + parallel processing
Propagation delay of the parallel pipelined filter
Solution of |
2
0
0 charge
2
0
0 charge
) ( ) (
) / (
t t
pd
V V k
V LC
V V k
V M C
LT

=
|
|
2
0
2
0
) ( ) (
t t
V V V V ML = | |

S-ar putea să vă placă și