Sunteți pe pagina 1din 21

A Bit-Serial Method of Improving

Computational Efficiency of Dot-Products

1
DA is a bit-serial technique to greatly reduce
resource requirements for the dot product
calculation

So-called because the resources are not


easily recognizable: Wheres the MAC
module?

Takes advantage of small tables of pre-


computed coefficients and clever
rearrangement of the math

2
In signal processing the most common
operation is the dot product

DA lends itself well to FPGA implementation


due its use of lookup tables

DA can reduce gate count by 50%-80% in


signal processing arithmetic!

3
It turns out that the dot product is used
extensively in DSP (FIR, FFT, etc)

Recall that dot product is a sum of products:



y xA
A1
x1 x2 x3 A2
A3
A1 x1 A2 x2 A3 x3
Written as a summation:
K
y Ak xk
k 0

4
Simple example: smoothing data via DSP (low-pass
filter)

Accomplished with an FIR filter. General form:


K 1
h[n] Ak [n k ]
k 0

So we could implement a 3-tap (K=4) moving


average filter:
1 1 1
h[n] [n] [n 1] [n 2]
3 3 3

(In this special case, A1=A2=A3=0.33)


5
Recall the goal: K
y Ak xk
k 1

X is the filter input, (digital!), so lets consider twos


complement representation (scaled x<1 for cleanliness)
N 1
xk bk 0 bkn 2 n N total bits
n 1

Putting them together


K N 1
n
y Ak bk 0 bkn 2
k 1 n 1

6
Expand the summation:

K N 1
K n
y Ak (bk 0 ) Ak bkn 2
k 1 n 1 k 1

Since bkn is 0 or 1, this has


only 2K possible values
Two possible values

We can precompute all terms that depend on the input data


(bk0..bkK) and store them in a ROM of size 2K+1

The x inputs can then be used to address the ROM directly: LUT!

7
Non-DA Hardware Implementation
Let A C1 , C2 , C3 , C4 x A, B, C, D ( K 4)

Based on the 8-bit


original equation Adder
K
y Ak xk
k 0
8-bit
Multiplier

8
We said this is bit-serial technique, so how can we
perform multiplication?

Shift right by 1
AND with 1 parallel
and 1 serial input
Example
A Multiplication
x
x = 1011
A = 1011001
Result register 1 1011001
0 0000000
Here, x is 4-bit input and A is 8-bit constant 1 1011001
1 +1011001
10010000101

9
So, now we substitute the scaling accumulator
into our original design. Getting closer...

K
y Ak xk
k 0
10
Lets rearrange the hardware to match our expanded eqn:
K N 1
K n
y Ak (bk 0 ) Ak bkn 2
k 1 n 1 k 1

We first sum the products of


Then we add and scale
each input bit and its constant
each of those terms

11
Now recall that we had the clever idea to use pre-
computed sums in a LUT for the bitwise addition

Address Data
0000 0
0001 C0
0010 C1
0011 C0+C1
... ...
1110 C0+C1+C2
1111 C0+C1+C2+C3
12
K N 1
K
y Ak (bk 0 ) Ak bkn 2n
k 1 n 1 k 1
We need to accommodate the negative term, so we add
one more address line to the LUT called Ts. ROM size now
2K+1

Ts is a timing signal. Ts =1 during sign bit time, 0


otherwise

We also need this bit to know when the final result is ready
Address Data
For all Ts = 1 the ROM contains the 10000 0
negative of the appropriate sum 10001 -C0
11111 -(C0+C1+C2+C3)13
This is an example of K=4
DA dot-product hardware

ROM Size = 2K+1=25=32

Here is our scaling accumulator

Switch SWA in pos 2 after Ts=1,


at which point y contains final result

14
Computes N-bit dot product in N cycles

Reduced area and high speed due to the ROM

However, requires 2K+1 size ROM (grows


exponentially with input lines)

Input sizes often 16 bits -> Need 128K ROM!

15
Bit-serial means N-bit dot product requires N
cycles... Slower than parallel?

N HW multipliers not generally practical due


to large area\power!

Time-multiplexing your parallel HW multiplier


means you lose the speed gain: N vs K

Example: K=8, N=8 takes the same time on


time multiplexed parallel HW vs DA bit-serial

16
We can reduce the ROM size to 2K with some tricks

Replace adder with


adder/subtractor

Ts becomes control line for


adder/subtractor

ROM size is reduced by half

There are other math tricks to reduce the size further


to 2K-1

17
Speed determined by serial nature of input 1 BAAT
We can expand the HW to do multi-bit at a time

Introduce input as bit


pairs x10x11, x12x13, etc

Shift LSB of pair


result by 1

Shift accumulator
feedback by 2

Requires 2 ROMs instead of 1

18
DA lends itself easily to DSP because of its easy
application to the dot product

DA is easily implementable on FPGA because of


the similar architecture-> LUTs (of course better
on custom hardware)

DA is not limited to dot product; will work for any


algorithm where pre-computed values can be
leveraged

19
DA is a very efficient means of mechanizing
the dot product

The use of DA can save 50-80% area over the


parallel approach

Like everything, DA has tradeoffs:


ROM size input lines
Speed area (multi ROM)

20
Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial
Review. White, Stanley. IEEE ASSP Magazine July 1989
(I pulled most of the basic talk info from here)

Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based


Recursive Digital Filters. Hwang, H. and Su. C. IEEE Xplore VLSI Signal Processing
IX, 1996 35-44
(this has some slight remarks about bit parallel vs bit serial, also auto-regressive moving average filter example)

Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS


Software Receivers. Waelchli, G et al. Journal of Electrical and Computer
Engineering volume 2010
(application to GPS)

An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete


Wavelet Transform. Al-Haj, Ali. Informatica 29 (2005) 241-247
(DSP example using a Virtex FPGA)

21

S-ar putea să vă placă și