Sunteți pe pagina 1din 5

Low Power and High Speed Multiplication Design Through Mixed Number Representations

Menghui Zheng and Alexander Albicki Department of Electrical Engineering University of Rochester Rochester, NY 14627, USA
Abstract A low power multiplication algorithm and its VLSI architecture using a mixed number representation is proposed. The reduced switching activity and low power dissipation are achieved through the Sign-Magnitude (SM) notation for the multiplicand and through a novel design of the Redundant Binary (RB) adder and Booth decoder. The high speed operation is achieved through the CarryPropagation-Free (CPF} accumulation of the Partial Products (PP) by using the RB notation. Analysis showed that the switching activity in the PP generation process can be reduced on average by 90%. Compared to the same type of multipliers [I, 2, 31, the proposed design dissipates much less power and is 18% faster on average. in terms of the number conversions, it is more energy efficient and has an operating speed close to the Wallace tree architecture [4] and faster than the multipliers proposed in [l, 2, 31. The paper is organized as follows. The ESA in a multiplication unit is addressed in Section 2. Then, in Section 3, a novel method to reduce the ESA and increase the operation speed is presented. The corresponding multiplication algorithm and the VLSI architecture are discussedin Section 4. Finally, some conclusion remarks are drawn in Section 5.

2: The ESA in 2 representation C


2 numbers and the radix-4 Booth algorithm are C s predominantly used for multiplier design, since the arithmetic operations can be easily carried out with 2 C numbers and the Booth algorithm can largely reduce the s number of PPs. But, the Booth algorithm often requires s the negation of the multiplicand, and the negation of a 2 C number requires many bits to be switched which results in high switching activity. Without losing generality, we use the radix-4 Booth algorithm to demonstrate the s probability of the negation of the multiplicand to be generated and how many bits on average have to be switched. This would give us the ESA during the PP generation. As shown in Table I, the radix-4 Booth algorithm s requires -Y and -2Y, where Y is the multiplicand. For 2 C representation -Y = ? + 1, and, to generate -Y given Y, all the bits of Y have to be switched and then the be 1 added to get the correct 2 result. The same operations C are needed to generate -2Y, except a left shift is needed before the bit complementation takes place. The negation process is highly energy consuming, as it requires the charging and discharging of all the nodes associated with be the PP. Indeed, let an n-bit multiplier s X=x,-*x,-~ . . . . XI+). The radix-4 Booth algorithm takes a triplet x2k+l~2k~2k-l as input and generates a PP

1: Introduction
We shall show that by the use of the SM notation for the multiplicand, the use of Two Complement (2 s C) representation for the multiplier, and the use of RB representation for the PP accumulation, the Expected Switching Activity (ESA), and therefore the power dissipation, can be significantly reduced. The ESA reduction occurs any time the negation of the multiplicand is needed in order to generate the PPs upon the radix-4 Booth algorithm. High speed operation is sustained s through the RB notations for accumulating the PPs, since a CPF addition can be executed with RB numbers. The inputs and outputs of the multiplication unit are assumed to be in 2 notation. Although we only consider integer C multiplications and radix-4 Booth algorithm here, the s proposed techniques can be easily extended to floating point multiplications and higher radix Booth algorithms. s It is interesting to point out the fact that although the proposed algorithm and its VLSI architecture is complex under award MIP-9300936. No.
This work was supported in part by the National ScienceFoundation

1063-6404/95 $4.00

0 1995 IEEE

566

Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95) 1063-6404/95 $10.00 1995 IEEE

Pk according to Table I, where k = 0, 1, .. . .

and

N p-even )(2+3x-2,2+i , 2 n 8 Case 2: n is odd,

4n

xmI E 0. So, it scans 3 bits for one PP with one bit overlap between two adjacent triplets. If n is odd, then the +l=n. Therefore, an extra bit

x,=x,-, (sign extension) must be appended to the left of x,-~ to make the triplet x~x,-~x,-~. If n is even, then the largest index 2 +l=n-1. Therefore, multiplier

X can be exactly grouped into n / 2 triplets and no sign extension is needed. For parallel multiplication, all triplets can be scannedat the same time.

I ololol

+OY istringof

number must be sign extended and (n+1)/2 triplets are needed to cover all the bits of the multiplier. Based on the sign extension rule, the triplet x,x,-,x,,-~ has four possible patterns: 000, 001, 110, 111. Among them there is just one NEG. So, the probability of a NEG to occur in triplet x,x,...~x,-~ is l/4. For x1x0, same as the case when n is even, the probability of a NEG to occur in the triplet x,x0x-, is l/2. For the remaining (n-3) bits, the probability Of a NEG t0 OCCUr in a triplet x2k+lx2kx2k-1 is 3/8. Therefore, the average probability of a NEG that may appear in a triplet X2k+1X2kX2&1 is: +2xn-3 3 -=x2+Ix2 (3) 4 n+l 2 n+l 8 n+l 8 Combining cases 1 and 2, the average probability for a NEG to appear in triplet x2k+l~2k~2k-l is if n is even (4) if n is odd Since, for 2 numbers -Y = r+l and the generation C of ? requires the complementation of every bit of Y, the ESA in the PP generation process1 is: Ni xn ESA,.,=---= (5) NP n Table II: Average ESA in Booth partial product generation Operand 4-bit 8-bit 16-bit 32-bit 64-bit Length ESA 0.4375 0.4063 0.3906 0.3828 0.3789 We computed in Table II the ESA values for some typical operand lengths. On the average, the ESA in the partial product generation process is about 0.40. This results in a large power dissipation!

From Table I, when the radix-4 Booth algorithm s catches the multiplier patterns , llO 101 and , it loo has to generate -Y or -2Y. These patterns, which will be referred to as the NEG - the negation patterns hereafter - are directly related to the ESA in the Booth PP generator. The average probability of a NEG patterns to occur in any given triplet x~~+~x+~~-, of the multiplier can be analyzed as follows. Assume an n-bit 2 number X = x,-~x,-~....x~x~, C and the probability of being for each bit of the 1 multiplier is 0.5. Case 1: n is even, triplets are needed to cover all the bits of the multiplier and the sign extension is not needed. For x1x0, since the Booth algorithm assumes bit xel to be always zero, s there are only four choices for the triplet xIxox~l: 000, 010, 100 and 110. Two of them are NEGs. Hence, the probability of a NEG to appear in x1x0 and x-, positions is l/2. For the remaining (n-2) bits, each triplet x2k+l~2k~2k-l has 8 possible patterns and 3 of them are NEGs. So, the probability of a NEG to appear in the remaining (n-2) bits is 3/8. Therefore, the average probability of a NEG that may appear in a triplet
X2k+lX2kX2k-l is

3: Reducing

the switching

activity

Clearly, the high switching activity in the Booth PP generatoris causedby the generationof-Y and -2Y and the fact that the 2 representation is chosen for the C multiplicand Y. The latter holds as the negation of a given 2 number is equivalent to the complementation of C all its bits and then adding . On the other hand, the 1
lIn this paper we only consider the ESA associated with the complement&m process. The ESA associated with the adding 1 process is not included here, since in the VLSI implementation the adding 1 processis implemented through the adder tree.

567

Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95) 1063-6404/95 $10.00 1995 IEEE

negation of a SM number is simple - just complementing the sign bit. Hence, if one uses the SM representation instead of 2 for the multiplicand Y, a C significant reduction of ESA during the Booth PP generation process should be expected. Consequently, we propose the SM representation for the multiplicand Y, yet keep the multiplier X in the 2 form. The correctness of C the radix-4 Booth algorithm applying to this mixed s number representation can be proved as follows: according to [5], the radix-4 Booth algorithm gives correct results s when applied to 2 numbers and the validity of the Booth C coding results depends exclusively on the pattern of the multiplier. Since the multiplier is kept in 2 notation, C the radix-4 Booth algorithm stands valid for our mixed s number representation. Now, let us evaluate the ESA of SM numbers. Since the multiplier is in its 2 form, the average probability C of a NEG pattern to appear in any triplet x~~+,x~~x~~-,of an n-bit multiplier is the same as in (4). Also, negation of a SM number is just to complement the sign bit, therefore, the ESA for SM number in the Booth PP generation process is: 3+1 if n is even N,, x 1 8n 4n2 ESA, = = 0 (6) n -3 if n is odd 8n A comparison of ESA for the SM and 2 number is C reported in Table III. The reduction of the ESA is significant, ranging from 87.5% for 8 bit operands to 98.4% for 64 bit operands. As the operand length increases, the ESA for the even bit 2 numbers decreases C with the asymptotic value of 318 and the ESA for the odd bit 2 numbers is a constant value of 3/8. For the SM C numbers, the ESA decreases at the rate of 0(1/n) and asymptotically reaches zero. Thus, for longer operandsthe ESA reduction and therefore the power saving is more profound.
Table HI: ESA for 2 C and SM in

and SM notations are identical - no conversion is needed. For negative numbers, the conversion from 2 to SM can C be implemented by complementing all the bits except the sign bit yn-, and adding the to the final result. If one 1 assumes an uniform distribution of positive and negative numbers, then the probability that the number has to be converted is 0.5. Although the conversion adds some delay, it does not offset the power dissipation gain due to the SM representation for the multiplicand. Indeed, if the multiplicand is in 2 notation one has to execute the C negation process for about 40% of all the PPs needed and the number of the negation processes increases as the operand length increases, while the conversion from 2 C to SM takes place only once for any operand length. For the add 1 operation, instead of using an n-bit adder which introduces delay and power overhead we generate a correction term associated with each PP and then add this correction term to all PPs through the binary addition tree as shown in Figure 3. In this manner, only one more input for the addition tree is added while the whole n-bit addition operation is avoided. The correction term can be generated according to Table IV. The logic for Cl and C2 is trivial: Cl= y,-t*lY and C2=y,-, *2Y. The block diagram, as shown in Figure 1, indicates that the 2 C-toSM conversion adds only one inverter delay or about 0.5 gate delay2 which comes from the complementation operation of the 2 number. The correction term does C not introduce extra power overhead compared to the traditional 2 implementation, since in the traditional C 2 implementation one also needs a similar correction C term generator ( adding 1) to generate the negation of the multiplicand.
Table 2 C IV: to Correction terms SM conversion for

2 Number 1 C

BoothI

2 number C

4: The algorithm
4.1: Conversion from

and architecture
2 to SM notation: C

SM . number
n-2
i=O

can

be

expressed

as

Y = (-l)yfi-l cyi2 and a 2 number can be expressedas C


n-2

Y = -yn42n-

+ Cy;2 .
i=O

For positive numbers, the 2 C

Fig we

1:

2 C-to-SM

conversi

ion

*We will refer gate-delay as a 2-input NAND gate delay. one

568

Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95) 1063-6404/95 $10.00 1995 IEEE

4.2: Speeding up the PP accumulation


We have substantially reduced the ESA in the PPs generation, but SM numbers are hard to manipulate for arithmetic operations, since the signs of the operands have to be identified separately through a sequence of decisions - costing excess control logic, execution time and power dissipation. On the other hand, the RB numbers, n-1 represented in the form R = 2 ri 2 , with digit ri E { i, 0, i=o 11, are more suitable for high speed parallel arithmetic computations [ 1, 61. Due to the redundancy in RB numbers one can perform the CPF addition through the selection of different numbers for the same value. Hence, we further convert the PPs into the RB representation. We are adopting the selection rule proposed by Takagi in [l] to perform CPF addition for the PP accumulation. The rule is shown in Table VI. Let us give an example. The CPF addition of X iOlOi1 lo=98 ;andY= OOliil 11=15 is shown in Figure 2. One can see that, the carry is limited within adjacent digits and there is no global carry propagation. = -98 Augend iOlOlll0 Addend+OOliilll = 15 Intermediate Sum i 0 0 1 0 0 0 i Intermediate Carry + 0 0 1 I I 1 1 1 oilioiiii =-83 Sum
Figure 2: CPF RB addition.

pair fashion, (x,~~x,-~),(x,-~x,~~),.~.,(x,-~x,),(x,-,x~), and interpret the pairs according to the SM coding rule shown in Table V. Clearly, we do not need any operations except some wiring.
Table V: Conversion rules for SM

to RB

4.3: Converting the RB number into 2 number C


The summation of the PFs is in RB form and it has to be converted back into 2 form. This conversion is C carried out easily in the following manner: from Table V, every digit xRBi = (rir:) of the RB number X, is composed of two bits. The left bit ri[ represents the sign and the right bit r[ represents the magnitude. One can easily form a number XIB from the positive digits of X,, and form another number X,, from the negative digits of X,. Then, subtracting XiB from X&, one can get the result in the 2 form. The process can be C implemented using a fast adder. Since a fast adder is essential for all the multiplication algorithms to carry out the final result, the RB-to-2 conversion does not C introduce any extra overhead.

4.4: The algorithm and its VLSI architecture


THE ALGORITHM :
Step 1: Convert the multiplicand Step 2: Step 3: Step 4:

The conversion of SM-to-RB can be carried out as following: as the RB representation uses a digit set of { i, 0, 1 ), one needs two bits r:( to represent one digit Yi. If we use a SM coding to represent a Rl3 digit, that is, ri to represent the sign and r/ to represent the magnitude, we can easily convert a SM number into an RB number. For a SM number X= x~~Ix,~~...xi...xlxo, the sign of the number is decided by the sign bit x,-, . Therefore, we can group the sign bit x,-, with all the rest bits in a pair by

Step 5:

from 2 into C the SM representation and keep the multiplier in 2 form. C Apply the radix-4 Booth algorithm to s generate all the FFs represented in SM notation. Convert all the partial products from SM into RB representation. Sum up all the PPs through a RB adder tree. Convert the final result from RB into 2 C notation.

The corresponding VLSI architecture for the algorithm is shown in Figure 3. It is composed of two major parts: the PP generator and the redundant binary addition tree. The key components in this architecture are: the RB adder in the addition tree and the Booth decoder in the FF generator. A novel design for the RB adder and Booth decoder based on SM coding has been developed. The new

569

Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95) 1063-6404/95 $10.00 1995 IEEE

RB adder has a critical path delay of 4 gate-delay, while the previously reported fastest RI3 adder has a critical path delay of 5 gate-delay [3]. The new Booth decoder needs only 3 transistors and 3 control lines instead of 9 transistors and 5 control lines for the 2 based Booth C decoder[7].
MultiplicandY in 2 C +, y.,

multiplication schemes, ours is on average 18% faster in terms of gate-delays. Table VII. A comparison of the speed (gatedelavsj of -different archite&ures-OperandLength CBitsj 1 8 1 16 ( 32 1 64 1 128 1

Makino Wallace

1 9

22 33 1 38.5 1 12 1 18 1 24 1 30

5: Conclusion
I
SM-To-RB Partial Products in RB

remarks

Figure

3: Low power and high multiplier architecture.

speed

A low power multiplication algorithm and its VLSI architecture are proposed. The reduced switching activity and low power dissipation are achieved through the SM representation for the multiplicand and through a novel design of the RB adder and the Booth decoder for the SM numbers. The high speed operation is achieved through the CFF accumulation of the FPs by using RB numbers. The SM-to-RB conversion is carried out by grouping the sign bit with all other bits, which does not require any operation except some wiring. Analytical study indicates that the ESA in the multiplicand negation process for PP generation can be reduced on average by 90 percent. Further research on the low power redundant binary addition tree design is under investigation.

References
4.5: Comparisons [l] N. Takagi, et al, High-Speed VLSI Multiplication Since speed is always at premium, let us make a comparison of our algorithm with some reported fast multiplication algorithms [l, 2, 31 and the Wallace tree architecture which is commonly thought to be the fastest architecture for multipliers [4]. The comparison in Table VII is made in terms of the equivalent gate delays along the critical path required by the partial product addition tree for different operand lengths. The extra 0.5 gate-delay overhead introduced by the 2 C-to-SM conversion is included. It is assumedthat the delay of the partial product generator and the final fast adder are the same for all the architectures. The delay of a full adder which is used in the Wallace tree is assumed to be 3 gate-delays. From Table VII, the gate-delays of our architecture is close to the Wallace tree when the operand length is less than 64 bits. When the operand length exceeds 64 bits, our architecture becomes faster than the Wallace tree architecture. Furthermore, the 2-to-1 binary reduction tree of our architecture implies much simpler layout and routing than the 3-to-2 Wallace tree; this advantagewill be more profound when the technology goes into deep submicron. As compared with other reported RB binary tree
Algorithm with a Redundant Binary Addition Tree, IEEE on Computers, Vol.C-34, No.9, pp.789-796, September 1985. [2] H.Makino, et al, A 8.8-ns 54x54-bit Multiplier Using New Redundant Binary Architecture, Proceedings of 1993
Trans. International Conference on Computer Design,

Cambridge, MA, USA, pp.202-205, October 3-6, 1993. [3] X.Huang, et al, A High-Performance CMOS Redundant Binary Multiplication-and Accumulation (MAC) Unit,
IEEE Trans. on Circuit and Systems-I: Fundamental Theory and Applications, Vo1.41, No.1, pp.33-39,

January 1994. [4] C.Wallace, A Suggestion for a Fast Multiplier, IEEE Trans. on Electronic Computer, Vol.EC- 13, pp. 14- 17, February 1964. [5] L. P. Rubinfield, A Proof of the Modified Booth s Algorithm for Multiplication, IEEE Trans. on Computers, Vo! C-24, No.10, pp.1014-1015, October 1975. [6] A. r\vizienis, Signed-Digit Number Representations for Fast Parallel Arithmetic, IRE Trans. on Electronic Computer, Vol.EC-10, pp.389-400, September, 1961. [7] N.Weste and K.Eshraghian, Principles of CMOS VLSI Design: A System Perspective, 2nd Edition, pp. 555, Addison-Wesley Publishing Company, 1993.

570

Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95) 1063-6404/95 $10.00 1995 IEEE

S-ar putea să vă placă și