ENG6530 DistributedArithmetic NickBoyd

A Bit-Serial Method of Improving
Computational Efficiency of Dot-Products
1
DA is a bit-serial technique to greatly reduce
resource requirements for the dot product
calculation
So-called because the resources are not

easily recognizable: Wheres the MAC
module?
Takes advantage of small tables of pre-

computed coefficients and clever
rearrangement of the math
2
In signal processing the most common
operation is the dot product
DA lends itself well to FPGA implementation

due its use of lookup tables
DA can reduce gate count by 50%-80% in

signal processing arithmetic!
3
It turns out that the dot product is used
extensively in DSP (FIR, FFT, etc)
Recall that dot product is a sum of products:

y xA
A1
x1 x2 x3 A2
A3
A1 x1 A2 x2 A3 x3
Written as a summation:
K
y Ak xk
k 0
4
Simple example: smoothing data via DSP (low-pass
filter)
Accomplished with an FIR filter. General form:

K 1
h[n] Ak [n k ]
k 0
So we could implement a 3-tap (K=4) moving

average filter:
1 1 1
h[n] [n] [n 1] [n 2]
3 3 3
(In this special case, A1=A2=A3=0.33)

5
Recall the goal: K
y Ak xk
k 1
X is the filter input, (digital!), so lets consider twos

complement representation (scaled x<1 for cleanliness)
N 1
xk bk 0 bkn 2 n N total bits
n 1
Putting them together

K N 1
n
y Ak bk 0 bkn 2
k 1 n 1
6
Expand the summation:
K N 1
K n
y Ak (bk 0 ) Ak bkn 2
k 1 n 1 k 1
Since bkn is 0 or 1, this has

only 2K possible values
Two possible values
We can precompute all terms that depend on the input data

(bk0..bkK) and store them in a ROM of size 2K+1
The x inputs can then be used to address the ROM directly: LUT!
7
Non-DA Hardware Implementation
Let A C1 , C2 , C3 , C4 x A, B, C, D ( K 4)
Based on the 8-bit

original equation Adder
K
y Ak xk
k 0
8-bit
Multiplier
8
We said this is bit-serial technique, so how can we
perform multiplication?
Shift right by 1
AND with 1 parallel
and 1 serial input
Example
A Multiplication
x
x = 1011
A = 1011001
Result register 1 1011001
0 0000000
Here, x is 4-bit input and A is 8-bit constant 1 1011001
1 +1011001
10010000101
9
So, now we substitute the scaling accumulator
into our original design. Getting closer...
K
y Ak xk
k 0
10
Lets rearrange the hardware to match our expanded eqn:
K N 1
K n
y Ak (bk 0 ) Ak bkn 2
k 1 n 1 k 1
We first sum the products of

Then we add and scale
each input bit and its constant
each of those terms
11
Now recall that we had the clever idea to use pre-
computed sums in a LUT for the bitwise addition
Address Data
0000 0
0001 C0
0010 C1
0011 C0+C1
... ...
1110 C0+C1+C2
1111 C0+C1+C2+C3
12
K N 1
K
y Ak (bk 0 ) Ak bkn 2n
k 1 n 1 k 1
We need to accommodate the negative term, so we add
one more address line to the LUT called Ts. ROM size now
2K+1
Ts is a timing signal. Ts =1 during sign bit time, 0

otherwise
We also need this bit to know when the final result is ready
Address Data
For all Ts = 1 the ROM contains the 10000 0
negative of the appropriate sum 10001 -C0
11111 -(C0+C1+C2+C3)13
This is an example of K=4
DA dot-product hardware
ROM Size = 2K+1=25=32
Here is our scaling accumulator
Switch SWA in pos 2 after Ts=1,

at which point y contains final result
14
Computes N-bit dot product in N cycles
Reduced area and high speed due to the ROM
However, requires 2K+1 size ROM (grows

exponentially with input lines)
Input sizes often 16 bits -> Need 128K ROM!
15
Bit-serial means N-bit dot product requires N
cycles... Slower than parallel?
N HW multipliers not generally practical due

to large area\power!
Time-multiplexing your parallel HW multiplier

means you lose the speed gain: N vs K
Example: K=8, N=8 takes the same time on

time multiplexed parallel HW vs DA bit-serial
16
We can reduce the ROM size to 2K with some tricks
Replace adder with

adder/subtractor
Ts becomes control line for

adder/subtractor
ROM size is reduced by half
There are other math tricks to reduce the size further

to 2K-1
17
Speed determined by serial nature of input 1 BAAT
We can expand the HW to do multi-bit at a time
Introduce input as bit

pairs x10x11, x12x13, etc
Shift LSB of pair

result by 1
Shift accumulator
feedback by 2
Requires 2 ROMs instead of 1
18
DA lends itself easily to DSP because of its easy
application to the dot product
DA is easily implementable on FPGA because of

the similar architecture-> LUTs (of course better
on custom hardware)
DA is not limited to dot product; will work for any

algorithm where pre-computed values can be
leveraged
19
DA is a very efficient means of mechanizing
the dot product
The use of DA can save 50-80% area over the

parallel approach
Like everything, DA has tradeoffs:

ROM size input lines
Speed area (multi ROM)
20
Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial
Review. White, Stanley. IEEE ASSP Magazine July 1989
(I pulled most of the basic talk info from here)
Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based

Recursive Digital Filters. Hwang, H. and Su. C. IEEE Xplore VLSI Signal Processing
IX, 1996 35-44
(this has some slight remarks about bit parallel vs bit serial, also auto-regressive moving average filter example)
Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS

Software Receivers. Waelchli, G et al. Journal of Electrical and Computer
Engineering volume 2010
(application to GPS)
An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete

Wavelet Transform. Al-Haj, Ali. Informatica 29 (2005) 241-247
(DSP example using a Virtex FPGA)
21

ENG6530 DistributedArithmetic NickBoyd

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ENG6530 DistributedArithmetic NickBoyd

Încărcat de

Drepturi de autor:

Formate disponibile

A Bit-Serial Method of Improving

Computational Efficiency of Dot-Products

So-called because the resources are not

Takes advantage of small tables of pre-

DA lends itself well to FPGA implementation

DA can reduce gate count by 50%-80% in

Recall that dot product is a sum of products:

Accomplished with an FIR filter. General form:

So we could implement a 3-tap (K=4) moving

(In this special case, A1=A2=A3=0.33)

X is the filter input, (digital!), so lets consider twos

Putting them together

Since bkn is 0 or 1, this has

We can precompute all terms that depend on the input data

Based on the 8-bit

We first sum the products of

Ts is a timing signal. Ts =1 during sign bit time, 0

ROM Size = 2K+1=25=32

Here is our scaling accumulator

Switch SWA in pos 2 after Ts=1,

Reduced area and high speed due to the ROM

However, requires 2K+1 size ROM (grows

Input sizes often 16 bits -> Need 128K ROM!

N HW multipliers not generally practical due

Time-multiplexing your parallel HW multiplier

Example: K=8, N=8 takes the same time on

Replace adder with

Ts becomes control line for

ROM size is reduced by half

There are other math tricks to reduce the size further

Introduce input as bit

Shift LSB of pair

Requires 2 ROMs instead of 1

DA is easily implementable on FPGA because of

DA is not limited to dot product; will work for any

The use of DA can save 50-80% area over the

Like everything, DA has tradeoffs:

Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based

Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS

An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete

S-ar putea să vă placă și