A 36-Bit Balanced Moduli Architecture: Preethy and D. Radhakrishnan

A 36-bit Balanced Moduli MAC Architecture
AP. Preethy and D. Radhakrishnan

Division of Computer Engineering, School of Applied Science
Nanyang Technological University, Nanyang Ave.
Singapore 639798
email: asdrkrishnan@ntu.edu.sg
-
Absirucr Recently a renewed interest is seen in RNS type of binary adder cell employed in the modulo adder. So to
(Residue Number System) which stems out from the fact that make the MAC efficacious, a new low power CMOS adder
these systems are inherently parallel and modular and thus are cell is used which plays a pivotal role in scaling down the
fast and simple. In many DSP applications Multiply-Accumulate power requirements of the entire MAC unit [7].
(MAC) operation turns out to be the most basic one and hence
an RNS based 36-bit MAC architecture is presented in this
The MAC operation is completely done in residue
paper to speed up the whole operation. A further enhancement domain. So the operands have to be initially transformed intc
in speed up is achieved by exploiting the logarithmic properties residue domain (forward conversion) and after the
of Galois fields and integer rings. The choice of forward and completion of the MAC operation the sum residues have to
reverse converters used in the design results in considerable be converted back to binary form (reverse conversion). In
savings in silicon real estate. The adder cells used is based on order to effect these conversions, a forward and a reverse
pass transistor design which attribute to very low power converter are also incorporated in the proposed design. With
consumption. these, the design represents a complete MAC unit with both
the inputs and outputs in standard binary form.
I. INTRODUCTION
11. RNS OVERVIEW
Residue Number System (RNS) eliminates the speed-
draining carry propagation in arithmetic computations, thus In RNS, an integer X is uniquely represented by an r-tuple
making it attractive in heavily computation intensive of integers (x1,x2,....,xr) which is called the residue
applications. The parallel arithmetic nature of RNS offers a representation of X. The integers xi, i = 1,2,.,..,r are called
potential speed up in FIR filtering, discrete Fourier the residues and are obtained as remainders when the number
transforms, matrix multiplications, and similar other X is divided by a set of distinct relatively prime integers mi, i
computations [13. A Multiply-Accumulate (MAC!) unit is an = 1,2,. ..,r, which are called the moduli of the residue number
integral part of an arithmetic processor used in such system. Thus, xi = X mod mi, denoted by lXlmi , where
applications. In a MAC unit, multiplication speed is of
paramount importance and it becomes necessary to search for O<xixi<mi.It follows from the Chinese Remainder Theorem
new strategies for increasing the speed. One such strategy is (CRT) that, for any given r-tuple satisfying the above
the use of index calculus for doing multiplication. Even relationships, there exists one and only one integer X such
though it speeds up multiplication by converting it to that OIX<M where M = n;=,mi defines the range of the
addition, its domain was limited to the set of prime moduli. number system. The number X can be evaluated from the r-
When a large range such as 32 or 36 bits is aimed at, with a tuple (x,, x2,...,xr) using (1):
set of prime moduli, the moduli chosen will eventually result
in non-uniform word length. This necessitates exploring
additional moduli with similar properties that will provide
extra flexibility with uniform word length. It has been shown
that index calcuius techniques could be extended to non-
prime moduli that are powers of 2 [2,3]. This was further Arithmetic operations on two operands X and Y are
extended to powers of odd primes in [4,5]. Thus by defined as: Z = X O Y , where X = (x1,x2, ...,xr), Y = (yI,y2,
combining both primes and powers of prime moduli a very and zi = xi yi for i = 1,..., r. The
...,yr), Z = (ZI,ZZ, ...,q), O
fast and less complex multiplier could be implemented. symbol denotes any of the operations of addition,
Evaluation of the summation of the products obtained is subtraction or multiplication. From the above definition, it is
the next phase of the computation. This is achieved by seen that these operations are performed in parallel in each of
performing addition in residue domain using modulo adders. the residue channels, independent of one another. This
Since the proposed MAC is a multiplierless unit, the main inherent parallelism and the cany-free arithmetic between
logic units are modulo adders. Hence the performance of the different residue channels provide speed up during arithmetic
design mainly depends on the choice of modulo adder used. processing.
Hence a modulo adder which outperforms other counterparts
is chosen [ 6 ] . Another design issue equally important is the
0-7803-5491-5/99/$10.00 0 1999 IEEE 380

111. MULTIPLICATION
USING INDEX CALCULUS TECHNIQUES include 25, 27 and 32. All these moduli together give an
overall range of 36 bits. The above multiplication is valid for
The relatively prime moduli for a maximum dynamic all the nonzero elements of the corresponding field or ring.
range of 36 bits using 5-bits take any one of the three forms When the residue becomes zero, index cannot be defined.
p, 2m,and p,"' wher p is prime and m is any integer. It follows Hence extra logic is incorporated in the design as proposed in
from Number Theory that the groups formed by p, 2m, and [8]. By performing multiplication exclusively in logarithmic
pm integer elements fall into the category of Galois Field domain considerable savings in ROM requirements are
GF(p), and integer rings Z/(2m) and Z/(p"). When the achieved. The formulae for the computation of the ROM
modulus happens to be a prime number p, the normal index requirements for the three types of moduli are shown in Table
mapping in GF(p) is used and multiplication is done by index 1, In this table, ROM 1 stores the index table and ROM 2
addition. This can be speeded up in cases where p-1 is stores the inverse index table, where si refers to the
factorable into relatively prime submoduli, these submoduli submoduli of p-1 and sl is the smallest of these. The (sl+l)
being chosen with equal number of bits, wherever possible term used in the prime modulus formula accounts for the
[lo]. The only primes where p-1 cannot be factored into extra code added to handle the zero operands.
relatively prime integers are those of the form 2"+1, where m
is an integer. A procedure for finding an index set for the IV. MACARCMTECTURE
elements of Z/(2m) is given in [2,3]. Using this, any integer
X E [I, 2" -13 can be coded using a triplet index code a , p , y > The basic operation involved in the MAC is the
summation of a number of products, given by CEiXiyi.
with the relationship X = 2a ISp (-1)'I 2m ' where Consider that the operands Xi and Yi are represented in
a~{O,l, ...,m-l}, PE(O,I ,...,(2"''2-1)} and y~{0,1). residue form using a moduli set {ml,m2,...,mr} as Xi =
Multiplication of two integers can now be carried out as ( x I I , x ~...,
~xir}
, and Yi = {yil,yil,...,yir}. In residue notation,
follows: let XI,X2 E Z/(2'"), XI f 0, X2 f 0, and the above summation thus becomes r independent
summations, z[* ,.-,Ci=ixirYir .
xjlyjl , C jN= I x i 2 ~ , 2,-.,Ci=ixq~u
N N
I I2m I
x1=2n' SDI (-1)yI , x 2 =2Q2 5P2 (-1)Y2 , then the product
12m Hence a MAC unit is implemented by using r separate
channels each working independently and in parallel. As
Ix,x~~~,,, = 2al+a2 5pl+p2(-1)YI+Y2
l2m.
The indices are added
mentioned earlier, the moduli set is chosen so as to
subject to the following constraints: PI and p2 are added implement multiplication by transforming to logarithmic
mod 2m-2,y1 and y2 are added mod 2, and al and a2 are addition. For our 36-bit processor, we selected the 5-bit
balanced moduli set { 17,19,23,25,27,29,31,321 which
added in normal binary mode. When ai + a2 equals m-1 the enables exclusive logarithmic addition: A block schematic of
corresponding p and y are made zero, and when it exceeds a MAC unit for one of the residue channels is shown in Fig.
m-1, the final result is made zero. Thus, by storing the index 1. The two residues qj and yij of channelj address two index
and inverse index tables, multiplication of the nonzero ROMS that are used to find the logarithm corresponding to
elements can be repIaced by index addition. Similarly, in the these operands. The output from the index ROMs are added
case of Z/(p"), where p is odd, an index pair coding (a$)is using a modulo mi adder.
TABLE I. MULTIPLIER
ROM REQUIREMENTS
used, where X is given by X=(g"pP)modp"[4,5]. The
product of two numbers (XIX2 ) mod pm can now be
Modulus
calculated as follows: let XI= (gal p P 1)mod p'" and Prime
X2 = ( g a 2 p P 2 ) m o d p ' " ,then their product (XIX2) mod p=2m+1
Prn is given by:
Prime
IXIX21,,,,,= ( g a 1 p P 1 g a 2 p P 2 p"'
)mod
p#2m+I I
- (gal+a2P PI+&) mod pj'l
The indices are added subject to the following constraints:
a1and a2 are added mod $(pm),and PI and p2 are added in
normal binary mode. When PI + Pz exceeds m-1, the final 2"'
result is made zero. It may be noted that more than one index
pair <a$> generated during index addition may correspond
to the same residue value. From the above it can be seen that P"'
prime and powers of prime moduli emerge as ideal
candidates for the design of index transform based
multipliers. For the design of a 36 bit multiplier we decided
to go with moduli of up to 5 bits in length. The prime moduli
include 17,19, 23, 29, and 31, and powers of prime moduli
381
Xij Yij
Index Index
A U U B -
Inverse Index ROM
H
+
Register 1 1
H=ACBB
S = H €BCi,
r-7 Modulo Adder

Fig. 2. A Fully Restored CMOS Full Adder Cell
These residues are generated simultaneously and are

added using modulo adders to get the final residue. A total of
r-1 similar stages are needed to generate the residues
Fig. 1. A single channel of MAC unit x l ,x2 ,..., x,.-~. It is obvious that modulus of the form 2"
(32) does not require extra logic and for cases of the form 2"-
The adder output addresses an inverse index ROM to 1 and 2"+1 (31 and 17) the ROMs needed otherwise can be
generate the residue product. In the non-prime modulo eliminated. For mod 31, the most significant partition is a
channels, the modulo adder comprises two or three single bit and hence a simple logic circuitry instead of ROM
submodules depending on the use of odd or even values for p is used to generate its equivalent residue.
respectively. This is for performing the individual index The reverse converter design chosen here is based on CRT
additions C W and, y~. The accumulate stage consists of a and is shown in Fig. 4 [SI.The conversion is performed by
modulo adder and a register (accumulator) which is evaluating the quotient alone so that it always uses less
initialized to zero. hardware. The only requirement imposed is that one of thc
moduli is of the form 2". By storing the contents of ROM
The modulo adders shown in Fig. 1 employ the standard
biased-addition scheme given in [ 6 ] .An offset value of 2"-mi
is added to a 2" bit adder to make a modulo mi adder, when
with respect to modulus e instead of M, the ROM widths
mi is less than 2"-1, A modification is also done based on the have been reduced by 5 bits. This amounts to an overall
fact that the offset value to be added is known a-priori. So the reduction of 2 " x n x r (1280) bits of ROM for the entirp
logic requirements of the second adder in a cascade of two design. The reduction in total hardware requirements due to
can be brought down to half. Thus the primitive cells of the this is about 14% compared to earlier ones. There is also a
basic mod mi adder are constructed from a full adder (FA) 14% delay reduction assuming the use of carry propagate
followed by a zero- primitive (one-primitive) for a zero-offset adders.
(one-offset). To bring down the power and area requirements, 1- n+ Register 1
t n-+
the transmission gate adder used in the above adder is
replaced with a new low power counterpart given in [7]. This
14 transistor pass logic adder implementation which
outperforms all other counterparts is shown in Fig. 2.
V. FORWARD
AND REVERSE CONVERTERS
To generate the residues in the forward conversion stage

a logic as shown in Fig. 3 is used. The k bits of Register 1
which stores the input binary operand are partitioned into 1
partitions, each of n bits wide. When k # nl, the width of the
last (leftmost) partition will be k mod n bits. Each partition Fig. 3. Residue Generation Logic for xj
group of Register 1 addresses a ROM that is programmed to
produce its residue with respect to mj.
382
XI Binary to Residue MAC Residue to Binary
Converter Converter
Fig. 4. Residue-to-Binary Converter
C
Finally, a block schematics of a 36-bit MAC unit
---+
complete with the forward and reverse converters is shown in
Fig. 5 .
VI. CONCLUSIONS
A high speed 36-bit MAC unit is presented in this paper.

It uses eight 5-bit moduli which map into prime and powers
of prime integers, by virtue of which multiplication is done
exclusively by using index calculus techniques. This makes
the multiplier faster than conventional ones thereby
enhancing the performance of the entire system. The chosen
forward and reverse converters make the design further space
efficient and the CMOS pass logic adder cells used contribute
to a reduction in power consumption. The totai hardware
requirements is less than 3Kbytes of ROM, sixty four 5-bit
modulo adders and a few registers and other logic elements.
Fig. S. A 36-bit 8 moduli MAC unit
REFERENCES
M. A. Soderstrand, W. K. Jenkins, G. A. Jullien, and F. J. Taylor, [6] J.C. Smith and F.J. Taylor, "A fault-tolerant GEQRNS processing
Residue Number System Arithmetic: Modern Applicalions in Digital element for linear systolic array DSP applications," lEEE Trans.
Signal Processing. New York: IEEE Press, 1986. Comput., vol. 44, no. 9, pp. 1121-1 130, Sep. 1995.
I. M. Vinogradov, Elements of Number 7heov. New York: Dover [7] D.Radhakrishnan, "Formal design procedures for low-power CMOS
Publications, 1954. full adder cells," IEEE Trans. Circuits Syst. 4,unpublished.
G. C. Cardarilli, R. Lojacono, G. Martinelli and M. Salerno, [8] M. Dugdale, "Residue multipliers using factored decomposition,"IEEE
"Structurally passive digital filters in residue number systems," IEEE Trans. CircuitsSyst. 41, vol. 41, no. 9, pp. 623-627, Sept. 1994.
Trans. CircuitsSyst., vol. 35, pp. 149-158, 1988. [9] J. Mathew, D. Radhakrishnan and T. Srikanthan, "Using the 2"property
D. Radhakrishnan, "Modulo multipliers using polynomial rings," IEE to implement an efficient general purpose residue-to-binary converter,"
-
proc. CirnrifsDevices Syst., vol. 145, no. 6, pp. 443-445, Dec. 1998. Inti. Symp. Signals, Circuits Syst., Iasi, Romania, July 1999.
D. Radhakrishnan and A. P. Preethy, "A novel 36-bit single fault [IO] D. Radhakrishnan and Y.Yuan, "Novel approaches to the design of
tolerant multiplier using 5-bit moduli," IEEE TENCON 98, vol. I, pp. VLSI RNS multipliers," IEEE Trans. Circuits Syst., vol. 39, no. 1, pp.
128-130, New Delhi, India, Dec. 1998. 52-57, Jw. 1992.
383

A 36-Bit Balanced Moduli Architecture: Preethy and D. Radhakrishnan

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A 36-Bit Balanced Moduli Architecture: Preethy and D. Radhakrishnan

Încărcat de

Drepturi de autor:

Formate disponibile

A 36-bit Balanced Moduli MAC Architecture

AP. Preethy and D. Radhakrishnan

0-7803-5491-5/99/$10.00 0 1999 IEEE 380

r-7 Modulo Adder

These residues are generated simultaneously and are

To generate the residues in the forward conversion stage

Fig. 4. Residue-to-Binary Converter

A high speed 36-bit MAC unit is presented in this paper.

S-ar putea să vă placă și