Design and Implementation of High-Performance RNS Wavelet Processors Using Custom IC Technologies

Journal of VLSI Signal Processing 34, 227237, 2003
c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

Design and Implementation of High-Performance RNS Wavelet Processors
Using Custom IC Technologies
JAVIER RAM
IREZ
Department of Electronics and Computer Technology, University of Granada, Spain
UWE MEYER-B
ASE
Department of Electrical and Computer Engineering, Florida State University, Tallahassee, FL 32310-6046, USA
FRED TAYLOR
High-Speed Digital Architecture Laboratory, University of Florida, Gainesville, FL 32611-6130, USA
ANTONIO GARC
IA AND ANTONIO LLORIS

Department of Electronics and Computer Technology, University of Granada, Spain
Received September 6, 2001; Revised August 14, 2002; Accepted August 14, 2002
Abstract. The design of high performance, high precision, real-time digital signal processing (DSP) systems, such
as those associated with wavelet signal processing, is a challenging problem. This paper reports on the innovative
use of the residue number system (RNS) for implementing high-end wavelet lter banks. The disclosed system
uses an enhanced index-transformation dened over Galois elds to efciently support different wavelet lter
instantiations without adding any extra cost or additional look-up tables (LUT). A selection of a small wordwidth
modulus set are the keys for attaining low-complexity and high-throughput. An exhaustive comparison against
existing twos complement (2C) designs for different custom IC technologies was carried out. Results reveal a
performance improvement of up to 100% for high-precision RNS-based systems. These structures demonstrated to
be well suited for eld programmable logic (FPL) assimilation as well as for CBIC (cell-based integrated circuit)
technologies.
Keywords: discrete wavelet transform, RNS arithmetic, custom integrated circuit, eld-programmable logic
devices
1. Introduction
There is a growing demand that digital image pro-
cessing be performed at greater real-time bandwidths,
with higher precision, and lower complexity. Since
these systems are intrinsically SAXPY (S =AX+Y)
dominant, advanced solutions must overcome existing
arithmetic limitations. An arithmetic system capable
of surmounting this barrier is the residue number sys-
tem, or RNS. Computer arithmeticians have long held
that the RNS offers a distinct MAC(multiply and accu-
mulate) speed-area advantage [1] in SAXPY-intensive
applications. The development of new RNS structures
to better build signal processing systems with custom
IC technologies is a eld of continuous interest and
study.
The evolution of the DSP market and technol-
ogy makes necessary considering not only cell-based
ASICs but modern CPLD (complex programmable
logic device) families, such as Altera FLEX10K [2] or
228 Ramrez et al.
Virtex [3] in the design and implementation of signal
processing systems. ASIC are becoming the dominant
technology with the Y-2000 DSP CBIC (cell-based in-
tegratedcircuit) ASICmarket valuedinexcess of $13B,
comparedto$8Bfor PDSPs (programmable digital sig-
nal processors). The FPL ASIC market is expected to
expand at a rate of 20% per annum rate, with DSP
applications leading the way. While FPL houses cham-
pion their technology as a provider of system-on-a-
chip (SOC) DSP solutions, engineers have historically
viewed FPLs as a prototyping technology. It should be
noted that 40% of the current FPL design starts are
rated at 1,500 gates. This gure falls well below the
reported 50,000+ gates that account for 50% of stan-
dard cell ASICdesigns [1]. When one considers that an
FPGA typically requires 10 more gates than a CBIC
to implement a common logic function, a typical 50k
gate standard cell ASIC design would require a large
500k gate FPGA. In order for FPL to begin to compete
in areas currently controlled by low-end standard cell,
a means must be found to more efciently implement
DSP objects.
In [4], the RNS was used to design a wavelet trans-
form using eld-programmable logic (FPL). The de-
sign was compared to a twos complement (2C), and
distributed arithmetic (DA) implementation. The RNS
solution was found to be superior to the 2C case and
compared favourably with the DA instantiation, but
unlike a DA design, was fully programmable. An en-
hanced RNS implementation of FIR lters by means
of the DA method can be found in [5]. Later this
RNS-DA mechanization was enhanced and applied to
wavelet lter banks [6, 7]. Thus, a DWT lterbank
having a 14-bit input, designed by means of the re-
ported RNS-DAmethodology, achieved a performance
improvement over the equivalent 2C system of up to
156.27%, and with the conversion stage not degrading
the throughput of the overall system. The RNS speed
advantage is gained by reducing arithmetic to a set of
concurrent operations that reside in small wordlength
non-communicating channels. This attribute makes the
RNS potentially attractive for implementing DSP ob-
jects with commercially available FPL technology and
CBICtechnologies. Another demonstration of the RNS
benets is found in [8] for use in orthogonal wavelet
lter bank applications. The lter banks were designed
to accept 8-bit input signals, process using 10-bit co-
efcients, and ran 23.45% and 96.58% faster than a
2C design for one and two octaves, respectively. A
weakness of the reported RNS solution was that xed
coefcient multiplication was mapped into look-up
tables (LUTs). Consequently, the tables needed to be
re-programmed whenever a different set of wavelet co-
efcients were selected. This paper explores an ef-
cient means of obtaining efcient discrete wavelet
transform (DWT) architectures dened over multiple
lter coefcient sets, by means of the RNS. The pa-
per extends these ideas and develops a mechanism of
achieving synergy within FPL-dened environments
and cell-based CMOS IC technologies to better imple-
ment arithmetic intensive DSP solutions. The quan-
tiable benets of this approach are studied in the
context of a programmable wavelet lterbank. The
work will build upon previous works and RNS-FPL
design studies [6, 7, 913].
2. Index-Based Arithmetic over Galois Fields
There is emerging evidence that an arithmetic technol-
ogy, called the RNS, can avoid the throughput degra-
dation with the increase in precision and become a
custom IC enabling technology [3, 6, 7, 12, 13]. Com-
puter arithmeticians have long held that the RNS offers
the best MAC speed-area advantage [1]. In the RNS,
numbers are represented in terms of a relatively prime
basis set (moduli set) P ={m
1
, . . . , m
L
}. Any number
X Z
M
= {0, 1, . . . , M 1}, where M =

L
i =1
m
i
,
has a unique RNS representation X {X
1
, . . . , X
L
},
where X
i
= X mod m
i
. Like the 2C system, the
RNS arithmetic is exact as long as the nal result is
bounded within the systems dynamic range Z
M
. Map-
pingfromthe RNSbacktothe integer domainis dened
by the Chinese Remainder Theorem (CRT) [1]. RNS
arithmetic is dened by pair-wise modular operations:
Z = X Y
X
m
1
Y
m
1
m
1
, . . . ,
X
m
L
Y
m
L
m
L
Z = X Y
X
m
1
Y
m
1
m
1
, . . . ,
X
m
L
Y
m
L
m
L
(1)
where |Q|
m
j
denotes Q mod m
j
. The individual mod-
ular arithmetic operations are typically performed as
LUT calls to small memories. The RNS differs from
traditional weighted numbering systems in the fact that
the RNS arithmetic is a carry-free and can operate at a
constant speed over a wide range of precisions.
A variety of RNS multipliers are available, includ-
ing pure LUT multipliers, square law multipliers [14],
index-transform multipliers [15, 16], and array mul-
tipliers [17]. Pure LUT multipliers require a double
precision LUT and are only a good choice for small
Design and Implementation 229
moduli. Square law multipliers require two LUTs, two
adders and a modulo adder. Galois eld multipliers
are based on index transformation and require a sin-
gle LUT to implement modulo multiplication in a DSP
system [13]. Array multipliers are used, for instance in
cryptographic systems, since large moduli are required
and any LUT-based multiplier would require very large
LUTs.
The index-transformation multiplier [15, 16] consti-
tutes an efcient means of designing high performance,
reduced complexity DSP systems. They are based on
the mathematical properties associated with a Galois
elds denoted GF( p), where p is prime. All the
non-zero elements in a Galois eld can be gener-
ated by exponentiating a primitive element denoted
g
j
. This property can be exploited for multiplica-
tion in GF(m
j
) through the use of a well known iso-
morphism existing between the multiplicative group
Q ={1, 2, . . . , m
j
1}, with multiplication performed
modulo m
j
, and the additive group I = {0, 1, . . . ,
m
j
2}, with addition performed modulo (m
j
1).
The mapping is given by:
q =
1
j
(i ) = g
i
j
mod m
j
(2)
q Q, i I and multiplication, using index arithmetic,
is based on:
|q
j
q
k
|
m
j
= g
|i
j
+i
k
|
m
j
1
(3)
Thus, the multiplication of two numbers, say q
j
and
q
k
, can be performed by adding exponents in a mod-
ular sense. The exponents, or indexes i
j
and i
k
, can
be pre-computed and stored in a lookup table. Adding
the indexes can be performed with a modulo (m
j
1)
adder, and the inverse index transformation of i
j
into
q
j
can be performed again using a LUT.
3. Discrete Wavelet Transform
Interest in the wavelet transform [18, 19] has
grown dramatically during the last decade [2025].
Wavelet transforms are routinely used in speech, im-
age and video signal processing, and other appli-
cations. Discrete wavelet transforms (DWT) are de-
ned over a sequence of embedded closed subspaces,
V
J
V
J1
. . . V
1
V
0
, where V
0
=l
2
(Z) is the
space of square-summable sequences. These subspaces
satisfy the upward completeness property, V
j
=
l
2
(Z), j [0, J]. Assume that any element in V
j
can be
uniquely expressed as the sum of two elements from
V
j +1
and W
j +1
, where V
j
= V
j +1
W
j +1
. For or-
thogonal wavelets, W
j +1
is dened as the orthogo-
nal complement of V
j +1
in V
j
. Assuming a sequence
g
n
V
0
exists suchthat { g
n2k
}
kZ
is a basis for V
1
, a se-
quence

h
n
V
0
can then be found such that {
h
n2k
}
kZ
is a basis for W
1
. Thus, V
0
can be decomposed as:
V
0
=W
1
W
2
W
J
V
J
by simply iterating
the decomposition rule J times. An attractive feature
of the wavelet series expansion is that the underlying
multiresolution structure leads to an efcient discrete-
time algorithm based on a lter bank implementation.
The octave-bandanalysis lter bankcomputes the inner
products with the basis functions for W
1
, W
2
, . . . , W
J
,
and V
J
. The orthogonal projection of the input signal
onto W
1
, W
2
, . . . , W
J
, and V
J
is computed after con-
volution with the synthesis lters. Then, the sequence
is decomposed into a coarse resolution version in V
J
with added details in W
i
(i =1, 2, . . . , J). Thus a 1-D
Nth-order DWT decomposition of a sequence x
n
is
dened by the recurrent equations:
a
(i )
n
=
N1
k=0
g
k
a
(i 1)
2nk
i = 1, 2, . . . , J
d
(i )
n
=
N1
k=0
h
k
a
(i 1)
2nk
a
(0)
n
x
n
(4)
where a
(i )
n
and d
(i )
n
are level-i approximation and detail
sequences, respectively, and g
k
and h
k
(k = 0, 1, . . . ,
N1) correspond to the low-pass and high-pass analy-
sis lter coefcients. On the other hand, the signal
x
n
can be perfectly recovered through its multireso-
lution decomposition {a
(J)
n
, d
(J)
n
, d
(J1
n
), . . . , d
(1)
n
} by
iteration on:
a
(i 1)
m
=
N/21
k=0
g
2k
a
(i )
m
2
k
+
N/21
k=0
h
2k

d
(i )
m
2
k
m even
N/21
k=0
g
2k+1
a
(i )
m1
2
k
+
N/21
k=0
h
2k+1

d
(i )
m1
2
k
m odd
(5)
where g
k
and

h
k
represent low-pass and high-pass
synthesis lter coefcients. In order to ensure perfect
recovery of the input signal, the coefcients of the
analysis and synthesis lter banks are conveniently re-
lated to each other according to the perfect reconstruc-
tion condition [18, 19].
4. DWT Solutions Enhanced by the RNS
The design of wavelet lter banks using the RNS,
presents new opportunities. If the wavelet lter
230 Ramrez et al.
coefcients are xed a priori, the LUT-based mod-
ulo multiplier represents the most efcient solution to
meeting low-latency and hardware efciency [8]. How-
ever, if the wavelet lter coefcients are to be run-time
programmable, then the solution may require an unac-
ceptably large number of LUTs to cover all coefcient
instances [13].
The use of index-transformation multipliers [15, 16],
and re-timing techniques leads to DWT lterbanks de-
signs requiring a single 2
n
j
n
j
LUT for each l-
ter coefcient, where n
j
=log
2
(m
j
), is the modu-
lus wordwidth. Figure 1 shows the design based on
index transformations of a modulo m
j
channel, for
an octave-i 8-tap decomposition lter bank. The in-
put sequence |a
(i 1)
n
|
m
j
is decomposed into even and
odd sequences that are converted to the index-domain
by means of two LUTs storing the
j
function. Some
circuitry is added to the input to detect zero values
of the input sequences. Notice that clearable regis-
ters have been added to make zero the lter prod-
ucts in case zero is detected in the even- and odd-
indexed sequences. The reason for this is that multi-
plication by zero is not dened in the index domain
and must be considered to be a special case. After
the lter products are computed in the index-domain,
the LUT storing the function
1
j
maps the indices
back to the RNS domain, and the remaining ltering
or addition stage is carried out by a modular adder
tree. The system exhibits symmetry for the computa-
tion of the approximation and detail sequences. The
complete RNS design consists of a number of paral-
lel channels whose combined wordwidth is sufcient
to ensure that the dynamic range requirements are met
[18, 19].
In a similar manner, an index-based architecture
may be derived for the reconstruction (synthesis)
Table 1. Total area and maximum sampling rate obtained for an 8-tap DWT lter bank. Notice that, [x, y, z] represents x-bit input, y-bit
coefcients and z-bit output.
Twos complement RNS
Area
(m
2
or no. of modules) F (MHz)
Area
(m
2
or no. of modules) F (MHz)
Wordwidths and modulus set 0.8 m 0.35 m 0.8 m 0.35 m 0.8 m 0.35 m 0.8 m 0.35 m
[8, 10, 21] {31, 29, 23, 19, 17} 748608 19810 106.38 367.65 820360 29500 209.64 584.80
[10, 10, 23] {61, 59, 53, 47} 855016 21849 105.71 353.36 910688 36436 188.32 515.46
[12, 12, 27] {61, 59, 53, 47, 43} 1026864 25507 86.43 293.26 1138360 45545 188.32 515.46
[14, 12, 29] {61, 59, 53, 47, 43, 41} 1111376 28441 84.89 223.21 1366032 54654 188.32 515.46
lter bank. The resulting architecture for the 1-D
IDWT is shown in Fig. 2. The two input sequences
| a
(i )
n
|
m
j
and |
d
(i )
n
|
m
j
are converted into their index
representations by means of two parallel LUT storing
the
j
function. The lter products are computed by
parallel and efcient index-based multipliers with each
lter product requiring a single LUT storing
1
j
and a
modulo (m
j
1) adder. Additional logic and clearable
registers are used to detect a zero input values and
make zero the corresponding lter products. Finally,
two separate modulo m
j
addition stages are used to
compute the output sequence | a
(i 1)
n
|
m
j
in even and
odd clock cycles as required by Eq. (5).
5. Results and Discussion
An 8-tap 1-D DWT lter bank was used to illustrate
the design of 2C and RNS-based system. The compar-
ison was carried out using VHDL models over Altera
FLEX10KE eld programmable logic (FPL) devices
and two standard cell ASIC technologies. The se-
lected ASIC reference libraries were the 0.8 m MSU
SCMOS and the Chip Express 0.35 m triple-level
metal CX3003 CMOS technologies. The 0.8 m MSU
SCMOS cell library consists of a set of gates imple-
menting low-level logic functions. The Chip Express
0.35 m CMOS CX3003 technology is based on the
denition of a high-level module that can be congured
to operate in a very wide range of simple and complex
circuit functions and combinations. The logic module
is a universal function composed of three multiplexers
and one AND gate. It is based on the fact that a mul-
tiplexer can implement any logic function, which may
be either combinatorial or sequential.
Table 1 shows the total area and maximum sam-
pling rate obtained for 8-tap RNS and 2Cdesigns using
( 1)
j
i
n
m
a

Even sequence
Odd sequence
LUT
j
2
j n
j
n
1
j
m
+
CLR0
j
m
+
1
j
m
+
CLR1
1
j
m
+
CLR2
1
j
m
+
CLR3
1
j
m
+
CLR4
1
j
m
+
CLR5
1
j
m
+
CLR6
1
j
m
+
CLR7
j
m
+
j
m
+
j
m
+
j
m
+
j
m
+
j
m
+
1
j
m
+
LUT
1
j
2
j n
j
n
CLR0
1
j
m
+
CLR1
1
j
m
+
CLR2
1
j
m
+
CLR3
1
j
m
+
CLR4
1
j
m
+
CLR5
1
j
m
+
CLR6
1
j
m
+
CLR7
j
m
+
j
m
+
j
m
+
j
m
+
j
m
+
j
m
+
j
m
+
( )
j
i
n
m
a
( )
j
i
n
m
d
0
( )
j
g
1
( )
j
g
2
( )
j
g
3
( )
j
g
4
( )
j
g
5
( )
j
g
6
( )
j
g
7
( )
j
g
0
( )
j
h
1
( )
j
h
2
( )
j
h
3
( )
j
h
4
( )
j
h
5
( )
j
h
6
( )
j
h
7
( )
j
h
LUT
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
LUT
1
j
2
j n
j
n
n
j
n
j
CLR1
CLR3
CLR5
CLR7
n
j
CLR6
CLR4
CLR2
CLR0
Shift register
Shift register
Figure 1. Design of an RNS-based 1-D DWT architecture with index-transformation.
0.8 m and 0.35 m CBIC technologies. The solution
adopted here for the 2C arithmetic DWT architecture
was to use pipelined 2C multipliers based on Booth
encoding and Wallace trees [26]. Hardware complex-
ity and delay rapidly increase as the precision of the
input and coefcients increases. These facts are shown
in Table 1 and Fig. 3. Note that performance is con-
siderably higher for an RNS-based solution than for a
232 Ramrez et al.
Figure 2. Design of an RNS-based 1-D IDWT architecture with index-transformation.
50
100
150
200
250
19 21 23 25 27 29 31
Output precision
S
a
m
p
l
i
n
g
r
a
t
e
(
M
H
z
)
5-bit RNS
6-bit RNS
7-bit RNS
2C
MSU CMOS Technology
100
200
300
400
500
600
700
19 21 23 25 27 29 31
Output precision
S
a
m
p
l
i
n
g
r
a
t
e
(
M
H
z
)
5-bit RNS
6-bit RNS
7-bit RNS
2C
Chip Express CMOS Technology
Figure 3. Sampling rate as a function of the output precision for index-based and 2C arithmetic 1-D DWT lter banks implemented by means
of CBIC technologies.
2C design. In order to maximize the sample rate gain,
small wordwidth channels are desirable. However, only
prime moduli are suitable for use in an index arithmetic
system. For a 5-bit modulus set, the only admissible
moduli are {17, 19, 23, 29, 31} which leads to a 22.7-bit
maximumdynamic range. With a 6-bit modulus set, the
dynamic range can be up to 39 bits using the moduli set
{37, 41, 43, 47, 53, 59, 61}. The use of a 6-bit modulus
set was found to be attractive for the designs demanding
23-, 27- and 29-bit outputs, while for the design with
a 21-bit output a 5-bit modulus set is more efcient in
terms of area and speed. The efcient hardware imple-
mentation of modulo multiplication by means of index
transformations reveals 2C and RNS-based systems to
have similar hardware complexities, while an RNS so-
lution will take advantage of higher speed and better
ASIC routability inside each channel. For instance, a
DWT lter bank enhanced by RNS arithmetic and hav-
ing 21-, 23-, 27-, and 29-bit output is about 97%, 78%,
118%and 122%faster than a 2Cdesign when using the
MSU SCMOS 0.8m technology. Notice that, using a
six-bit modulus set for RNSwavelet lterbanks with 23
bits output or above, makes the overall throughput im-
provement to not steadily increase with the wordlength.
On the other hand, lters having 27 and 29 bits are twice
as fast as the 2C equivalent design.
FPL devices have recently generated interest for
use in DSP systems due to their ability to implement
custom solutions while maintaining exibility through
device reprogramming. FPL technology providing em-
bedded LUTs and dedicated logic blocks are potential
solutions for MAC-intensive RNS-based DSP systems.
234 Ramrez et al.
Table 2. Total resources required and maximumsampling rate obtained for a 4-tap DWTlter bank on an Altera FLEX10KE
device (grade-1). Notice that, [x, y, z] represents x-bit input, y-bit coefcients and z-bit output.
Twos complement RNS
No. of EABs No. of EABs
Wordwidths and modulus set No. of LEs (Memory bits) F (MHz) No. of LEs (Memory bits) F (MHz)
[8, 9, 19] {61, 59, 53, 47} 3470 0 39.06 4 314 4 10 (15360) 135.13
[8, 10, 20] {61, 59, 53, 47} 3440 0 38.16 4 314 4 10 (15360) 135.13
[9, 10, 21] {61, 59, 53, 47} 3647 0 34.24 4 314 4 10 (15360) 135.13
[10, 10, 22] {61, 59, 53, 47} 4354 0 30.67 4 314 4 10 (15360) 135.13
[12, 12, 26] {61, 59, 53, 47, 43} 5446 0 27.93 5 314 5 10 (19200) 135.13
[14, 12, 28] {61, 59, 53, 47, 43} 7972 0 26.95 5 314 5 10 (19200) 135.13
Modern CPLDs consist of LUTs (frequently called
logic elements) and dedicated memory blocks. De-
pendingonthe family, eachLE(logic element) includes
one or more variable input size LUTs (typical 2
5
1 or
2
4
1), fast carry propagation logic and one or more
ip-ops. Specically, each LE included in the Altera
FLEX10K [5] device consists of a 2
4
1 LUT, an out-
put register and dedicated logic for fast carry and cas-
cade chains in arithmetic mode; a number of embedded
array blocks (EABs), providing a 2K-bit RAMor ROM
and congurable as 2
8
8, 2
9
4, 2
10
2 or 2
11
1,
are the cores for the implementationof RNSLUT-based
multipliers. Likewise, LUTs allow building special-
ized memory functions such as ROM or RAM. Table 2
shows the total resources required and maximum sam-
pling rate obtained for a 4-tap DWT lter bank using a
grade1 Altera FLEX10KE FPL device, as well as the
moduli selected to cover the dynamic range. Hardware
10
30
50
70
90
110
130
150
18 20 22 24 26 28 30
Output precision
S
a
m
p
l
i
n
g

r
a
t
e

(
M
H
z
)
5,6-bit RNS
7-bit RNS
2C
Altera FLEX10KE (grade -1)
Figure 4. Sampling rate as a function of the output precision for index-based and 2C arithmetic 1-D DWT lter banks implemented with FPL
devices.
requirements were assessed in terms of the number of
LEs and EABs while performance was evaluated in
terms of the register-to-register path maximum delay.
Figure 4 shows the sampling rate as a function of the
output precision. The use of 5- and 6- bit modulus set
was found to be an attractive choice since performance
is only limited by the LUT operation. Thus, the pre-
sented RNS-enhanced DWT lterbanks, with 19-, 20-,
21- and 22-bit output, are about two times faster than
a 2C implementation. This dramatic increase in the
system performance is gained due to the fast imple-
mentation of the index multipliers taking advantage of
the FPL embedded resources. Thus, LUTs storing the
1
j
function were able to operate at 135 MHz (when
mapped on EABs) and 5- and 6-bit modulo adders took
advantage of the fast carry propagation paths inside
the 8-bit LEs of a logic array block (LAB). In opposi-
tion to a 2C design, the presented RNS-enabled DWT
Table 3. Area required for binary-to-RNS and -CRT RNS-to-binary converters. Notice that, [x, y, z] represents x-bit input, y-bit coefcients
and z-bit output.
0.8 m MSU CA/NCA (m
2
) 0.35 m CX3003 CA/NCA (No. of modules)
RNS2C 16-bit RNS2C 16-bit
2CRNS (No. of stages) output (No. of stages) 2CRNS (No. of stages) output (No. of stages)
[8, 10, 21] {31, 29, 23, 19, 17} 8696/4200 (2) 24252/15845 (4) 180/85 (2) 502/361 (4)
[10, 10, 23] {61, 59, 53, 47} 14893/6045 (2) 43378/23457 (4) 314/119 (2) 910/534 (4)
[12, 12, 27] {61, 59, 53, 47, 43} 19545/7582 (2) 54878/29345 (4) 412/152 (2) 1137/668 (4)
[14, 12, 29] {61, 59, 53, 47, 43, 41} 23587/9280 (2) 64897/34920 (4) 490/186 (2) 1364/795 (4)
CA: Combinational area.
NCA: Non combinational area.
solutions do not need long propagation paths or com-
municate information or carries between LABs since
carry chains are no longer than 7-bit. This fact is moti-
vated by the reduced wordlength of the RNS channels
and made possible to mask the FPLdevice architectural
limitations.
6. Binary-to-RNS and RNS-to-Binary
Converters
A historical barrier to the use of the RNS at the
system-level has been the overhead penalty associ-
ated with binary-to-RNS and RNS-to-binary conver-
sion. Binary-to-RNS conversion can be carried out
efciently by decomposing the B-bit 2C word, say x,
into a weighted sum of smaller words x
i
(e.g., 4-bit
words). Equation (6) exemplies the case where a 4-bit
decomposition, namely:
|x|
m
j
=
2
B1
x
B1
+
B2
l=0
2
l
x
l
m
j
=
2
B1
x
B1
+
p1
i =0
x
i
2
4i
m
j
(6)
and requires only 2
4
n
j
LUTs and a modulo addition
stage.
RNS-to-binary conversion implies the use of a
CRT (Chinese Remainder Theorem)-based converter.
However, CRT conversion can often be a barrier in
certain applications. The auto-scaling RNS-to-binary
converter (-CRT) proposed by Grifn et al. [27] can
overcome these drawbacks by using a few LUTs and
binary (modulo 2
n
) adders. For a scaled n-bit binary
output, and a n
j
-bit modulus set, this converter needs
one 2
n
j
n LUT for each modulus of the RNS and
a n-bit adder tree. This solution results more appro-
priate for most applications demanding high data rates
[28]. Implementation data, using cell-based integrated
circuit, of the 2C-to-RNS and RNS-to-2C converters
are provided in Table 3 for 5- and 6-bit modulus sets.
The design for the 2C-to-RNS converter was derived
from Eq. (6) while the -CRT algorithm with a 16-bit
output was used for the RNS-to-2C converter. The op-
erating frequency of both converters was adapted to the
system performance by inserting a number of pipeline
stages (shown in Table 3), so the high throughput of the
presented index-based RNS architectures for forward
and inverse wavelet transforms was not degraded when
converters were inserted in the system.
7. Conclusion
This paper reports on the design and implementation
using FPL devices and CBIC technologies of forward
and inverse wavelet lter banks by means of the RNS.
The architecture is based on index-transformation over
Galois elds, and requires a single LUT for each lter
coefcient multiplication. Efcient circuitry is used to
detect a zero value in the input sequence, a require-
ment of the design paradigm. The RNS design was
compared to a 2C architecture of comparable size. The
reported methodology demonstrated a performance
improvement over a 2C design.
Acknowledgments
J. Ramrez, A. Garca and A. Lloris were supported
by the Comisi on Interministerial de Ciencia y Tec-
nologa (Spain) under project PB98-1354. CAD tools
and supporting material were provided by Altera Corp.,
San Jose, CA, under the Altera University Program,
and Synopsys Inc., Mountain View, CA, under the
236 Ramrez et al.
Synopsys University Program. We would like to thank
the anonymous reviewers for their valuable com-
ments and suggestions that contributed to enhance the
material presented in this paper.
References
1. M.A. Sodersterand, W.K. Jenkins, G.A. Jullien, and F.J. Taylor,
Residue Number System Arithmetic: Modern Applications in
Digital Signal Processing. New York: IEEE Press, 1986.
2. Altera Corporation, FLEX10K Embedded Programmable Logic
Device Family, ver. 4.1, 2001.
3. Xilinx Inc., The Programmable Logic Data Book, 1999.
4. U. Meyer-Baese, J. Buros, W. Trautmann, and F. Taylor, Fast
Implementation of Orthogonal Wavelet Filterbanks Using Field-
Programmable Logic, in Proc. of the 1999 IEEE International
Conference on Acoustics, Speech and Signal Processing, 1999,
vol. 4, pp. 21192122.
5. A. Garca, U. Meyer-B ase, A. Lloris, and F. Taylor, RNSImple-
mentation of FIR Filters based on Distributed Arithmetic Using
Field-Programmable Logic, in Proc. of the 1999 IEEE Interna-
tional Symposiumon Circuits and Systems, 1999, vol. 1, pp. 486
489.
6. J. Ramrez, A. Garca, U. Meyer-Baese, F. Taylor, and A. Lloris,
Implementation of RNS-Based Distributed Arithmetic Discrete
Wavelet Transform Architectures Using Field-Programmable
Logic, Journal of VLSI Signal Processing (Special Issue on
Computer Arithmetic and Applications), 2003, vol. 33, pp. 171
190.
7. J. Ramrez, A. Garca, U. Meyer-B ase, F. Taylor, P.G. Fern andez,
and A. Lloris, Design of RNS-Based Distributed Arithmetic
DWT Filterbanks, in Proc. of the 2001 International Confer-
ence on Acoustics, Speech and Signal Processing ICASSP 2001,
May 2001, vol. 2, pp. 11931196.
8. J. Ramrez, A. Garca, P. G. Fern andez, L. Parrilla, and A. Lloris,
RNS-FPL Merged Architectures for the Orthogonal DWT,
Electronics Letters, vol. 36, no. 14, 2000, pp. 11981199.
9. V. Hamann and M. Sprachmann, Fast Residual Arithmetic with
FPGAs, in Proc. of the Workshop on Design Methodologies for
Microelectronics, Slovakia, Sept. 1995.
10. E. Di Claudio, F. Piazza, and G. Orlandi, Fast Combinational
RNS Processors for DSP Applications, IEEE Transactions on
Computers, May 1995, pp. 624633.
11. H. Sari, H. Ahamadi, G. Jullien, and V. Dimitrov, Design and
FPGA Implementation of Systolic FIR Filters Using the Fermat
ALU, Proc. of the Asilomar Conference on Signals, Systems
and Computers, Pacic Grove, 1996.
12. U. Meyer-B ase, A. Garca, and F. Taylor, Implementation of
a Communications Channelizer Using FPGAs and RNS Arith-
metic, Journal of VLSI Signal Processing, May 2001, vol. 28,
no. 1/2, pp. 115128.
13. J. Ramrez, P.G. Fern andez, U. Meyer-B ase, F. Taylor, A. Garca,
and A. Lloris, Index-based RNS DWT Architectures for Cus-
tom IC Designs, in Proc. of the IEEE Workshop on Signal
Processing Systems SiPS 2001, Oct. 2001, pp. 7079.
14. F. Taylor, Large Moduli Multipliers for Signal Processing,
IEEE Transactions on Circuits and Systems, vol. CAS-28, no. 7,
1981, pp. 731736.
15. G.A. Jullien, Implementation of Multiplication, Modulo a
Prime Number, with Applications to Number Theoretic Trans-
forms, IEEE Trans. on Computer, vol. C-29, no. 10, 1980,
pp. 899905.
16. D. Radhakrishnan and Y. Yuan, Fast and Highly Compact RNS
Multipliers, International Journal of Electronics, vol. 70, no. 2,
1991, pp. 281293.
17. A.A. Hiasat, New Efcient Structure for a Modular Multiplier
for RNS, IEEETransactions on Computers, vol. 49, no. 2, 2000,
pp. 170174.
18. M. Vetterli and J. Kovacevic, Wavelets and Subband Coding,
Prentice Hall, 1995.
19. G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesly-
Cambridge Press, 1997.
20. K.K. Parhi and T. Nishitani, VLSI Architectures for Discrete
Wavelet Transforms, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 1, June 1993, pp. 191202.
21. J. Fridman and E.S. Manolakos, Distributed Memory and Con-
trol VLSI Architectures for the 1-D Discrete Wavelet Trans-
form, VLSI Signal Processing, vol. VII, 1994, pp. 388397.
22. C. Chakrabarti and M. Vishwanath, Efcient Realizations of
the Discrete and Continuous Wavelet Transform: From Single
Chip Implementations to Mappings on SIMD Array Comput-
ers, IEEE Transactions on Signal Processing, vol. 43, March
1995, pp. 759771.
23. M. Vishwanath, R.M. Owens, and M.J. Irwin, VLSI Architec-
tures for the Discrete Wavelet Transform, IEEETransactions on
Circuits and Systems II, vol. 42, no. 5, May 1995, pp. 305316.
24. T.C. Denk and K.K. Parhi, VLSI Architectures for Lattice
Structure Based Orthogonal Discrete Wavelet Transforms,
IEEE Transactions on Circuits and Systems II, vol. 44, no. 2,
Feb. 1997, pp. 129132.
25. F. Marino, ADouble-Face Bit-Serial Architecture for the 1-D
Discrete Wavelet Transform, IEEE Transactions on Circuits
and Systems II, vol. 47, no. 1, Jan. 2000, pp. 6571.
26. J. Pihl and E.J. Aas, A Multiplier and Squared Generator for
High Performance DSP Applications, in Proc. of the 39th
Midwest Symposium on Circuits and Systems, 1996.
27. M. Grifn, F.J. Taylor, and M. Sousa, New scaling algorithms
for the Chinese Remainder Theorem, in Proc. of the 22nd Asilo-
mar Conf. on Signals, Syst. and Comp., CA, 1988.
28. J. Ramrez, A. Garca, P.G. Fern andez, L. Parrilla, and A. Lloris,
ANewArchitecture to Compute the Discrete Cosine Transform
using the Quadratic Residue Number System, in Proc. of the
2000 International Symposium on Circuits and Systems, vol. 5,
May 2000, pp. 321324.
Javier Ramrez received the M.A.Sc. degree in Electronic Engineer-
ing in 1998, and the Ph.D degree in Electronic Enginnering in 2001,
all fromthe University of Granada. Since 2001, he is an Assistant pro-
fessor at the Department of Electronics and Computer Technology
of the University of Granada (Spain). His research interest includes
residue number system arithmetic, high performance digital signal
processing and FPGAand VLSI signal processing systems. He is au-
thor of more than 50 technical journal and conference papers in these
areas. He has served as reviewer for several international journals and
conferences and is a member of IEEE.
jramirez@ieee.org
Uwe Meyer-B aese received his BSEE, MSEE, and Ph.D. Summa
cum Laude from the Darmstadt University of Technology in 1987,
1989, and 1995, respectively. In 1994 and 95 he hold a post-doc po-
sition in the Inst. of Brain Research in Magdeburg. In 1996 and
1997 he was a Visiting Professor at the University of Florida. From
1998 to 2000 Dr. Meyer-Baese worked in the ASIC industry. He
is now a Professor in the Electrical and Computer Engineering De-
partment at Florida State University. During his graduate studies he
worked part time for TEMIC, Siemens, Bosch, and Blaupunkt. He
holds 3 patents, has supervised more than 60 master thesis projects
in the DSP/FPGA area, and gave four lectures at the University of
Darmstadt in the DSP/FPGA area. He is author of three books in-
cluding Digital Signal Processing with Field Programmable Gate
Arrays and Fast Digital Signal Processing published by Springer-
Verlag. He received in 1997 the Max-Kade Award in Neuroengineer-
ing. Dr. Meyer-Baese is a IEEE, BME, SP and C&S society member.
Uwe.Meyer-Baese@ieee.org
Fred J. Taylor received his Ph.D. from the University of Colorado
in 1969. Since then he has held professional positions at Texas In-
struments and the University of Texas at El Paso, Cincinnati, and
Florida where he is currently a Professor of Electrical and Com-
puter Engineering and Computer and Information Science, along
with being president of the Athena Group, Inc. He has authored
over 100 archived papers, nine books, contributed chapters to four
monographs and encyclopedias, and holds four U.S. patents. His
professional interests include digital design and architecture, digital
signal processing, and engineering education.
fjt@hsdal.u.edu
Antonio Garca received the M.A.Sc. degree in Electronic Engi-
neering (being awarded the Nation Best Academic Record) in 1995,
the M.Sc. degree in Physics (majoring in Electronics) in 1997 and
the Ph.D. degree in Electronic Engineering in 1999, all from the
University of Granada (Spain). He was an Associate Professor at the
Department of Computer Engineering of the Universidad Aut onoma
de Madridbefore joiningthe Deparment of Electronics andComputer
Technology at the University of Granada as an Associate Professor.
His research interests include Residue Number System arithmetic,
the application of RNS to high-performance digital signal process-
ing, VLSI and FPL implementation of RNS-based systems and the
use of RNS for low-power VLSI systems. He has authored over
50 technical papers in international journals and conferences and
has served as reviewer for several international journals and con-
ferences. He is a member of IEEE and a C, C&S and SP Society
member.
agarcia@ieee.org
Antonio Lloris received the M.Sc. Degree and the Ph.D. degree
from the Universidad Complutense (Madrid). He was at the Centro
de Investigaciones T` ecnicas de Guip uzcoa (Spain) as a researcher
and, as a lecturer, at the Escuela T` ecnica Superior de Ingenieros
Industriales de San Sebastian. He was at the Universities of Malaga
and Murcia (Spain). Now he is a Full Professor at the University of
Granada (Spain). His research interest include multiple-value logic,
testing of digital circuits and signal processing using the residue
number system.
lloris@ditec.ugr.es

Design and Implementation of High-Performance RNS Wavelet Processors Using Custom IC Technologies

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Design and Implementation of High-Performance RNS Wavelet Processors Using Custom IC Technologies

Încărcat de

Drepturi de autor:

Formate disponibile

Journal of VLSI Signal Processing 34, 227237, 2003

c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

IA AND ANTONIO LLORIS

S-ar putea să vă placă și