Ch4 - Pipelining and Parallel Processing

VLSI DSP 2008 Y.T.
Hwang 5-1
Chapter 4
Pipelining and Parallel
Processing
VLSI DSP 2008 Y.T. Hwang 5-2
Introduction (1)
Pipelining
Reduction in critical path
Increase the clock speed
Reduce power consumption at same speed
Parallel processing
Parallelism
Increase effective sampling speed
Reduction of power consumption
Introduction (2)
A 3-tap FIR filter
y(n)=ax(n)+bx(n-1)+cx(n-2)
Critical path: 1 multiply and 2 add
A M
sample
A M sample
T T
f
T T T
2
1
2
+
s
+ >
Introduction (3)
Pipelining or parallel processing to sampling
frequency
Critical path: 2 add
Pipelining
Parallel processing
Pipelining of FIR digital filters (1)
Feed forward cut set
Two iterations are
computed
concurrently
Critical path
reduced from
T
M
+2T
A
to T
M
+T
A
Latency increased
from 1 to 2
Drawbacks of pipelining
Increase in the number of latches and in system latency
Observations
The clock period is limited by the longest path between
Two latches
An input and a latch
A latch and an output
An input and an output
Critical path can be reduced by suitably placing the
pipelining latches
Pipelining latches can be placed across any feed-
forward cutset of the graph
Cut set
A set of edges of a graph such that if these edges are
removed from the graph, the graph becomes disjoint
Feed-forward cut set
The data move in the forward direction on all the edges
of the cut set
We can arbitrarily place latches on a feed-forward cut
set w/o affecting the functionality of the algorithm
Example 3.2.1
Incorrect pipelining correct pipelining
Original critical path: A3
A5 A4 A6
After pipelining: A3 A5
or A4 A6
Critical path is reduced by
one half
Direct v.s. transpose form
Direct form with long critical path
Transpose form with data broadcast structure
Critical path is reduced to T
M
+ T
A
Fine-Grain pipelining
Pipelining the function unit
Assume T
M
= 10 units, T
A
= 2 units
After pipelining, the critical path is 6 units
Parallel processing of FIR filter (1)
Block processing of size L
y(n)=ax(n)+bx(n-1)+cx(n-2)
y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)
y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)
Block delay (L-slow): placing a latch at any line of MIMO
structures produces an effective delay of L clocks at the
sample rate
Block size 3
3 times hardware
Critical path remains
unchanged T
M
+2T
A
T
clk
T
M
+2T
A
3 samples are
produced in 1 clock
cycle
effective iteration
period is
Note: T
clk
T
sample
) 2 (
3
1 1
A M clk sample iter
T T T
L
T T + > = =
MIMO system
Complete parallel processing
System with block size 4
A serial-to-parallel
converter
A parallel-to-serial converter
Pipelining v.s. parallel processing
Limitation of pipelining processing
Input/output bottleneck, i.e. communication bounded
system
Pipelining period cannot be smaller than the
communication or I/O bound
pipelining & parallel processing
Combined fine grain
pipelining and
parallel processing
for 3-tap FIR filter
L = 3, M = 2
6
14
) 2 (
6
1
1
= + =
= =
A M
clk sample iter
T T
T
LM
T T
Pipelining & parallel processing for low power
Advantages of pipelining and parallel processing
High speed
Low power
CMOS circuit model
1st order analysis
Propagation delay
Power consumption
f V C P
V V k
V C
T
total
t
e ch
pd
2
0
2
0
0 arg
) (
=

=
Pipelining for low power (1)
Sequential version
M-level pipelined version
Working at the same frequency, i.e. f = 1/T
seq
remains
unchanged
Capacitance in each pipeline stage is reduced to
C
charge
/M
Only |V
0
(| < 1) is needed to charge C
charge
/M in T
seq
seq total seq
T f f V C P / 1 ,
2
0
= =
seq total pip
P f V C P
2 2
0
2
| | = =
Calculation of |
2
0
2
0
2
0
0
arg
2
0
0 arg
) ( ) (
let
) (
) (
t t
pip seq
t
e ch
pip
t
e ch
seq
V V V V M
T T
V V k
V
M
C
T
V V k
V C
T
=
=
=
| |
|
|
Example
3-tap FIR filter
T
m
=10, T
a
=2, C
m
=5C
a
Pipelined multiplier, T
m1
=6, T
m2
=4, C
m1
=3C
a
, C
m2
=2C
a
V
0
=5V, V
t
=0.6V
Supply voltage calculation
C
charge
=C
m
+C
a
=6C
a
Pipelined: C
charge
=C
m1
=C
m2
+C
a
=3C
a
50|
2
- 31.36| +0.72 =0 | =0.6033
V
pip
=| V
0
=3.0165V
Power consumption ratio = |
2
=36.4%
Parallel processing for low power (1)
L-parallel version
Working at the one L
th
frequency, i.e. f = 1/(LT
seq
)
Total Capacitance is increased to LC
charge
Since each C
charge
is charged in LT
seq
, Only |V
0
(| < 1) is
needed to charge
Calculation of |
seq e ch
e ch par
t t
t
e ch
seq
t
e ch
seq
P f V C
L
f
V LC P
V V V V L
V V k
V C
LT
V V k
V C
T
2 2
0 arg
2
2
0 arg
2
0
2
0
2
0
0 arg
2
0
0 arg

) )( (
) ( ) (
) (
,
) (
| |
|
| |
|
|
= =
=
=

=
=
Example of 2-parallel version
4-tap FIR filter
T
m
=8, T
a
=1, C
m
=8C
a
T
seq
=9
V
0
=3.3V, V
t
=0.45V
2-parallel FIR filter design
Note each delay is 2-slow
x(2k-1)
x(2k-2)
C
charge
=C
m
+C
a
=9C
a
2-parallel: C
charge
=C
m
+2C
a
=10C
a
V
par
=| V
0
=2.17437V
Power consumption ratio = |
2
=43.41%
) ( 0282 . 0 or 6589 . 0
0 8225 . 1 3425 . 67 01 . 98
) ( 9 ) ( 5
2 2 let
) (
10
) (
9
2
2
0
2
0
2
0
0
2
0
0
= =
= +
=
= =
=

=
| |
| |
| |
|
|
t t
seq sample par
t
a
par
t
a
seq
V V V V
T T T
V V k
V C
T
V V k
V C
T
Area efficient 2-parallel version
Multiplier: 8 6, adder: 6 7 Delay: 3 4
Architecture verification
) 2 2 ( ) 1 2 ( ) 2 ( ) 1 2 (
) 1 2 (
) 3 2 ( ) 2 2 ( ) 1 2 ( ) 2 (
delay] block 1 after [ ) 2 (
) 1 2 ( ) 1 2 (
)) 1 2 ( ) 2 2 ( )( ( )) 1 2 ( ) 2 ( )( (
) 2 2 ( ) 2 (
) 3 ( ) 2 ( ) 1 ( ) ( ) (
3 2 1 0
3 2 1 0
3 1
3 2 1 0
2 0
3 2 1 0
+ + + + =
= +
+ + + =
+ =
+ + =
+ + + + + + =
+ =
+ + + =
k x h k x h k x h k x h
y y y k y
k x h k x h k x h k x h
y y k y
k x h k x h y
k x k x h h k x k x h h y
k x h k x h y
n x h n x h n x h n x h n y
C A B
C A
C
B
A
C
charge
=C
m
+C
a
=9C
a
2-parallel: C
charge
=C
m
+4C
a
=12C
a
V
par
=| V
0
=2.4585V
) ( 025 . 0 or 745 . 0
0 6075 . 0 155 . 25 67 . 32
) (
12
) (
9 2
2 2 let
) (
12
) (
9
2
2
0
0
2
0
0
2
0
0
2
0
0
= =
= +

=
= =
=

=
| |
| |
|
|
|
|
t
a
t
a
seq sample par
t
a
par
t
a
seq
V V k
V C
V V k
V C
T T T
V V k
V C
T
V V k
V C
T
Power consumption ratio
% 6 . 43
35
55 5 . 0
2
1
55 , 35
2
1
2
1
, 55 7 6
, 35 3 4
2
2
0
2 2
0
2 ) ( ) (
2
0
) ( ) (
=
= =
= =
= =
= = + =
= = + =
|
|
seq
par
s a par s a seq
s seq par
par par
par
total par a a m
par
total
seq
seq
total seq a a m
seq
total
P
P
ratio
f V C P f V C P
f f f
f V C P C C C C
f V C P C C C C
Combining pipelining and parallel processing
Pipelining
Reduces the capacitance to be charged/discharged in 1
clock period
Parallel processing
Increases the clock period for charging/discharging the
original capacitance
3-parallel
2-stage pipelining
pipelining + parallel processing
Propagation delay of the parallel pipelined filter
Solution of |
2
0
0 charge
2
0
0 charge
) ( ) (
) / (
t t
pd
V V k
V LC
V V k
V M C
LT
=
|
|
2
0
2
0
) ( ) (
t t
V V V V ML = | |

Ch4 - Pipelining and Parallel Processing

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ch4 - Pipelining and Parallel Processing

Încărcat de

Drepturi de autor:

Formate disponibile

VLSI DSP 2008 Y.T.

S-ar putea să vă placă și