Sunteți pe pagina 1din 6

PARALLEL DECIMAL MULTIPLIERS USING BINARY MULTIPLIERS

Mrio P. Vstias

Horcio C. Neto

INESC-ID/ISEL/IPL
email: mvestias@deetc.isel.ipl.pt

INESC-ID/IST/UTL
email: hcn@inesc-id.pt

ABSTRACT

Decimal multiplication is a fundamental operation for


any algorithm or even to implement decimal division. Decimal multiplication is much more complicated than binary
multiplication due to the inherent difficulty to represent decimal numbers using a binary number system. Both bit and
digit carries, as well as invalid results, must be considered
in order to produce the correct result.
Two main approaches have been considered in the design of a decimal multiplier: iterative and parallel. In the
iterative approach [3] the multiplicand is iteratively multiplied by one digit of the multiplier to generate a partial
product. Partial products are then added to produce the final decimal result. [4] and [5] present examples of decimal multipliers based on sequential units. In these, a set of
multiplicand multiples is generated in a preprocessing step.
Then, in the processing step, the multiples are selectively
added according to the value of the multiplier digits. Parallel decimal multipliers were recently proposed to improve
performance [6], [7], [8]. For example, in [6] the partial
products are generated in parallel and then reduced using a
decimal carry-save addition tree. Parallel multipliers can be
significantly faster than iterative ones, but require more area
for the implementation.
All decimal multipliers referenced above work directly
with decimal numbers, that is, produce decimal partial products using decimal multiplication of a number by a digit and
add the partial products using decimal adders. An alternative to this approach is to convert the BCD input to binary,
do a binary multiplication and then convert the result back
to BCD. This approach is particularly attractive when binary multipliers are already available in the hardware architecture, such as in coarse-grained reconfigurable hardware
architectures or in microprocessors with binary arithmetic
units.
A major problem of this solution is that the binary to
BCD conversion may be too slow. The most usual solution
is a serial implementation of the shift and add-3 algorithm
[9], which can be implemented with a small amount of resources, but at the cost of long latencies. A fast parallel implementation of this algorithm is a possible alternative but,
for large numbers, may consume a considerable amount of
hardware resources.

Human-centric applications, like financial and commercial,


depend on decimal arithmetic since the results must match
exactly those obtained by human calculations. The IEEE754 2008 standard for floating point arithmetic has definitely
recognized the importance of decimal for computer arithmetic. A number of hardware approaches have already been
proposed for decimal arithmetic operations, including addition, subtraction, multiplication and division. However, few
efforts have been done to develop decimal IP cores able to
take advantage of the binary multipliers available in most reconfigurable computing architectures. In this paper, we analyze the tradeoffs involved in the design of a parallel decimal multiplier, for decimal operands with 8 and 16 digits,
using existent coarse-grained embedded binary arithmetic
blocks. The proposed circuits were implemented in a Xilinx Virtex 4 FPGA. The results indicate that the proposed
parallel multipliers are very competitive when compared to
decimal multipliers implemented with direct manipulation
of BCD numbers.
1. INTRODUCTION
Computer arithmetic is predominantly based on binary arithmetic since the hardware implementations of the operations
are simpler than those for decimal computation. However,
decimal arithmetic is becoming a necessity in many applications, such as financial and commercial, where the results
must be exact, matching those obtained by human calculations. Since many decimal numbers cannot be represented
exactly as binary numbers with a finite number of bits, these
types of applications require that the arithmetic operations
be done directly over decimal numbers.
Until very recently, the adopted solution was to implement decimal operations using software algorithms based
on binary arithmetic. However, these software solutions are
typically three or four orders of magnitude slower than binary arithmetic implemented in hardware [1]. To speed-up
the execution of decimal arithmetic, a few processors, such
as the IBM Power6 [2], already include dedicated hardware
for decimal floating-point operations.

978-1-4244-6311-4/10/$26.00 2010 IEEE

73

2500
Number of LUTs

Recently, in [10], a new binary to decimal converter has


been proposed that is faster and uses about half of the resources of the parallel implementation of the traditional shift
and add-3 algorithm. With the faster converters, the decimal multiplication using binary multipliers becomes competitive with the more common approaches based on direct
BCD manipulation.
The approach considered in [10] has been applied only
to small multipliers. Also, the decimal to binary converter,
while simpler, must also be optimized to improve area and
performance. To be effective, this new approach must be applicable to larger size operands with 8, 16 or even 34 digits
(according to standard IEEE-754 2008).
In this paper, we analyze and develop efficient architectures for parallel BCD multiplication using binary multipliers. Section 2 analyses and proposes a new architecture
fot decimal to binary conversion. Section 3 analyses binary
to decimal conversion considering the algorithm proposed
in [10] applied to larger operand sizes. Section 4 reports
the trade-offs involved in the design of decimal multipliers
based on the results from the previous sections and proposes
parallel decimal multipliers for operands with 8 and 16 digits. Section 5 provides area and performance results with
and without embedded multipliers and a comparison with a
state-of-the-art parallel multiplier. Section 6 concludes the
paper and proposes future directions for decimal multiplication based on binary multipliers.

eq 3
eq 4
eq 5

1000
500
1

ShiftAdd

18

51

110

184

278

398

526

675

855 1044 1247 1487 1731 1995 2295

10

11

12

13

14

15

16

eq 2

26

54

88

130

186

245

312

395

479

571

682

791

908 1046

eq 3

22

47

70

104

145

188

236

297

355

424

498

596

684

781

eq 4

22

50

71

91

140

173

199

269

312

346

431

486

526

630

eq 5

22

47

73

103

132

174

203

259

296

349

408

480

529

602

Number of Digits

Fig. 1. Size of the decimal to binary converter for different


implementations
This new equation requires only multiplications by 10,
but the operations cannot be executed in parallel. Since multiplication by 10 is the same as a multiplication by 8+2, the
multiplication reduces to an addition of shifted values.
Other rearrangements can be achieved by manipulating
equation (2). A few examples are considered in equations
(3-5).

BCD to binary conversion can be performed by the known


process of successively dividing the decimal number by two
and storing the remaining of the division. Division by two
is easily achieved with a shift towards the least significant
bit. However, whenever a bit shifts across a boundary of a
digit, the lower digit must be subtracted by three or added
with five with the most significant bit removed.
An alternative to this method is to directly calculate the
decimal number using binary arithmetic, as defined in equation (1).

D = (((Dn1 10 + Dn2 )100 + ...)100


+(D1 10 + D0 ))

(3)

D = (((Dn1 100 + Dn2 10 + Dn3 )103 + ...)103


+(D2 100 + D1 10 + D0 ))

(4)

D = ((Dn1 10 + Dn2 )100


+(Dn3 10 + Dn3 ))104
+... + ((D3 10 + D2 )100 + (D1 10 + D0 ))

(5)

These equations have more available parallelism compared to the original Horners arrangement.
To determine which representation has the lowest cost
implementation, the architectures to implement them were
described in VHDL and synthesized considering a Virtex-4
FPGA (see results in figure 1).
The shift and add approach is by far the worst approach
in terms of area. The implementations of equations 4 and
5 are the best alternatives with improvements from 50 to
almost 75% of those obtained with the shift and add algorithm. A few other rearrangements of the Horner equation
were tested, including other powers of ten, but the achieved
results were worse.
The delays associated with the implementations follow
a similar relation (see figure 2).
The best solution achieves up to almost 50% improvement in the delay compared to that of the shift and add solution.

(1)

Because of the multiplications by the powers of ten, a direct implementation of this equation would require multipliers whose size would significantly increase with the size of
the numbers to be converted. To overcome this problem, (1)
can be rearranged by applying the Horners rule (see equation (2)).

D = (((Dn1 10 + Dn2 )10 + ...)10 + D0 )

eq 2

1500

2. DECIMAL TO BINARY CONVERSION

Dn1 . . . D0 = Dn1 10n1 + . . . + D0 100

ShiftAdd

2000

(2)

74

40

eq 2

30
Delay (ns)

a(2) (27 bits)

ShiftAdd

35

eq 4

20

eq 5

(16..0)

(26..17)

eq 3

25

72 x + y

15
10

18

5
0

18 bits

10

11

12

13

14

15

16

ShiftAdd

11

14

16

18

21

23

25

27

29

31

34

eq 2

12

14

16

18

20

23

25

27

29

32

34

eq 3

11

13

14

16

17

20

21

23

24

26

eq 4

10

11

13

13

14

16

17

18

eq 5

10

12

12

12

14

16

16

16

19

17

b2TOb1000
8

Number of Digits

17

Fig. 2. Delay of the decimal to binary converter for different


implementations

b2TOb1000

D2(100)

D0(1000)

D1(1000)

3. BINARY TO DECIMAL CONVERSION


Fig. 3. Converter of a binary number upto 99999999 to base
1000

Binary to decimal conversion is fundamentally the calculation of equation (6) in decimal.

and,
bn1 2n1 + bn2 2n2 + . . . + b0 20

(6)

c = b1 72 + b0
c (26 + 23 ) b1 + b0 < 218

17 bits
(8)
So, the b2TOb1000 can be already applied to determine the
least significant digit, d0 , and part of the following digits,
d1 .

Multiplication by two is achieved with a shift towards


the most significant bit. However, since the operations are
in decimal whenever a bit shifts across a boundary of a digit,
the digit must be corrected by adding three before the shift
takes place [9] (or six after the shift). This algorithm is usually known as the shift and add-3 algorithm.
An alternative approach was proposed in [10]. The idea
is to start by converting numbers in base 1024 to numbers
in base 1000, i.e. to start with b = b1 210 + b0 to obtain
b = d1 103 + d0 = d.
The algorithm uses a basic module (b2TOb1000) that
converts binaries upto 999 999 to base 1000. From base
1000 to decimal, the shift and add algorithm is used. For
larger numbers, the algorithm follows the Horner rule to determine the digits of the final decimal number. The converter
uses three fundamental modules: b2TOb1000, a module to
calculate 24 x + y and an adder. The number of each of
these units utilized in each converter depends on the size of
the number and increases more than linearly with the size of
the binary number to be converted.
In this paper, we propose an also very efficient solution
for the converter that considers 217 instead of 210 . In this
case, we have to calculate b = b1 217 + b0 .
Considering a converter for 8-digit numbers,
b = b1 217 + b0
= 131 b1 103 + b1 72 + b0
| {z }

b1 762 =
b0 217

d = d1 103 + d0 , where d1 185 28 8 bits (9)


b is now
b = (b1 + d1 ) 103 + d0 ,
| {z }

e 105 + 7 17 bits (10)

Since e is representable with 17 bits, once again the


b2TOb1000 unit can be used to determine the final two most
significant base 1000 digits, that is
e = d2 103 + d1

(11)

An hardware implementation of this converter can be designed using a set of adders and the b2TOb1000 unit (see
Fig. 3). Instead of two modules 24 x + y and two adders,
this new approach uses only one module that calculates 72
x + y and one adder.
Both architectures were evaluated using operands with
different sizes and compared to a parallel implementation of
the shift and add-3 algorithm (see results in figure 4).
For certain operand sizes, the new approach achieves almost 50% area reduction compared to the shift and add-3.
Compared to the solution presented in [10] the improvements are almost 10% for numbers with 16 digits. We also

108 1
217

(7)

75

2000

B7-0

A7-0

ShiftAdd

1800

[10]

1600

32

32

New converter

LUTs

1400

BCDtoBIN

BCDtoBIN

1200
1000

27

27

800
600

AB

400
200
0

10

11

12

ShiftAdd

14

41

91

147

213

320

411

518

675

810

955

1165 1318 1526 1789

[10]

14

41

67

95

132

171

248

339

427

593

732

950

13

1151 1451 1794

14

15

New converter

14

41

67

81

132

159

232

305

394

546

707

885

1083 1382 1675

54

16

BINtoBCD

Number of Digits

16

Fig. 4. Comparison between different binary to decimal


converters in terms of area

D15-0 = A7-0 B7-0

Fig. 6. Decimal multiplier without decimal partial products

70

ShiftAdd
60

[10]
New converter

Delay (ns)

50

size of the operands and of the result,

40

Subdivide the operands and consider partial products


to be added.

30
20
10
0

10

11

12

13

14

15

16

ShiftAdd

10

13

15

16

19

21

23

25

26

28

33

[10]

13

15

18

21

25

29

35

39

44

48

54

58

63

New converter

13

15

18

20

21

24

27

30

34

38

42

47

52

The first approach only uses the converters, besides the


multiplier, while the second approach needs extra adders, to
add the partial products, and several converters. However,
the converters utilized in the first approach are larger and,
given the analysis of the previous section, will utilize much
more area than the converters used in the second approach.
To see the difference between both solutions in terms
of performance and delay, we have implemented a 8 8
decimal multiplier using the two approaches.
In the first case, the architecture uses two 8-digit to 27bit decimal to binary converters, one 27 27 multiplier, and
one 54-bit to 16-digit binary to decimal converter (see figure
6).
In the second case, the 8-digit operands are divided into
two groups of 4-digits each (see figure 7). In this case, there
are four multiplications implemented with binary multipliers, that is, each 4-digit number is converted to binary and
then multiplied. The inner partial products are added in binary before being converted to decimal to be added to the
other partial decimal products (after binary to decimal conversion).
Table 1 shows the results obtained by implementing both
alternatives in a Virtex-4 FPGA, with and without using the
embedded DSP blocks.
As expected, the larger binary to decimal converters are
very expensive in terms of area and so the second approach,
using partial products and smaller converters, is better both
in terms of area and performance.
A 16 16 decimal multiplier was designed using the
(more efficient) approach with partial decimal products (see
figure 8). After converting the sub-groups of digits of the
operands to binary and performing the cross multiplications,

Number of Digits

Fig. 5. Delay of the binary to decimal converter for different


implementations
observe that while the shift and add algorithm increases the
resources almost linearly with the number of digits, the other
two solutions increases more than linearly. For example,
with 16 digits, all solutions are very close to each other in
terms of area occupation. This may indicate that for operands with more than 16 digits, the shift and add-3 approach
will become the most efficient, but further experiments are
required.
The delays associated with the implementations can be
observed in figure (5). Pipelining has not been considered
for this evaluation. The faster solution is the shift and add-3
with differences of more than 50% for numbers with more
digits. To reduce these timing differences, the alternative
solutions can be improved by using, for example, carry save
adders.
4. PARALLEL DECIMAL MULTIPLIER
To implement decimal multiplication with binary multipliers, we convert the numbers to binary, do the multiplication
and convert the result to decimal. Two design approaches
can be considered:
Use a complete binary multiplier and converters of the

76

A2xB3+A3xB2 (9)

000 (3)

000 (3)

A2 (4)

A1 (4)

A0 (4)

B3 (4)

B2 (4)

B1 (4)

B0 (4)

A0xB2+A0xB2+A0xB2 (9)

A1xB3+A2xB2+A3xB1 (9)
A3xB3 (9)

A3 (4)

000 (3)

A0xB0 (8)

A0xB1+A1xB0 (9)

A0xB3+A1xB2+A2xB1+A3xB0 (9)

4 digits

12 digits

32 Digits
Fig. 8. 16x16 Decimal multiplier with decimal partial products

4 Digits

4 Digits

4 Digits

4 Digits

Table 1. Decimal multiplier


Solution
LUTs DSPs
W/o partial products
W/o partial products
W/ partial products
W/ partial products

8 Digits
8 Digits

4 digits

8 Digits
8 Digits

3127
2023
2061
1176

0
4
0
4

Delay
73 ns
76 ns
45 ns
47 ns

Table 2. Decimal multiplier 8x8


Solution
LUTs DSPs Delay

8 digits

[7]
Our with DSP
Our without DSP

16 Digits

2609
1176
2061

0
4
0

34 ns
47 ns
45 ns

Fig. 7. Decimal multiplier with decimal partial products


pect that the performance of our solution can be improved
to values near those of the reference.
When using the DSP blocks, the number of utilized LUT
drops almost 50%, making our solution much more efficient
than that from [7] for FPGAs with embedded binary multipliers.
Compared to a binary multiplier, our decimal multiplier
uses about 2.5 times the number of LUTs (a binary multiplier of size 27 27 consumes 842 LUTs).
For the 16 16 decimal multiplier, our solution utilizes
about 25% less resources than the solution proposed in [7].
However, once again the delay is higher by about 17% (the
difference is smaller than that of the 8 8 decimal multiplier).
When using the DSP blocks, the number of utili-zed
LUTs drops almost 75% and, for FPGAs with embedded
binary multipliers, our solution is again much more efficient
than that from [7].
Once again, compared to a binary multiplier, our decimal multiplier uses about 2 times the number of LUTs (a

the aligned partial products are added in binary and then


converted to decimal. The three partial products in the figure
indicate the operations performed and the number of digits. After this alignment, the three final partial products are
added in decimal.
5. RESULTS
We have implemented two decimal multipliers with sizes
8 8 and 16 16 and compared them with the parallel
decimal multiplier from [7]. The decimal multipliers and its
sub-units were specified in VHDL, synthesized with Xilinx
ISE10.1 and then implemented in a Virtex-4 SX35-12 FPGA
(see the results in tables 2 and 3).
Our 8 8 decimal multiplier utilizes about 20% less resources in FPGA technology than the solution proposed in
[7]. However, the delay is about 30% higher. Since the final
adders used in our design are not carry save adders, we ex-

77

8. REFERENCES

Table 3. Decimal multiplier 16x16


Solution
LUTs DSPs Delay
[7]
Our with DSP
Our without DSP

8729
3005
6493

0
16
0

[1] M. F. Cowlishaw, Decimal floating-point: Algorism for


computers, in Proceedings IEEE 6th IEEE International
Symposium on Computer Arithmetic, June 2003, pp. 104
111.

54 ns
68 ns
65 ns

[2] IBM Power6,


IBM Corporation,
http://www2.hursley.ibm.com/decimal/.

May

2007,

[3] T. O. et al., Apparatus for decimal multiplication, U.S.


Patent 4 677 583, June, 1987.

binary multiplier of size 54 54 consumes 3223 LUTs).


Note that the difference is smaller compared to that of the
8 8 decimal multiplier.

[4] R. D. Kenney, M. J. Schulte, and M. A. Erle, Highfrequency decimal multiplier, in Proceedings IEEE International Conference on Computer Design: VLSI in Computers
and Processors, Oct. 2004, pp. 2629.

6. CONCLUSION

[5] M. A. Erle and M. J. Schulte, Decimal multiplication via


carry-save addition, in Proceedings IEEE 14th IEEE International Conference on Application Specific Systems, June
2003, pp. 348358.

We have implemented an 8 8 and an 16 16 decimal multipliers using binary multiplications. The results show that
this approach is better than those considering direct manipulation of decimal operands when implemented in a Virtex-4
FPGA.
An important advantage of the approach proposed herein
is that it can effectively use the embedded binary multipliers available in actual FPGAs and in other coarse-grained
reconfigurable architectures.
For future work, we plan to analyze the effects of other
subdivisions of the initial operands over the performance
and the consumed area.
It would be also important to test the designs with other
technologies besides FPGAs, namely coarse-grained reconfigurable architectures with binary arithmetic units of different complexities.

[6] T. Lang and A. Nannarelli, A radix-10 combinational multiplier, in Proceedings IEEE 40th International Asilomar
Conference on Signals, Systems, and Computers, Oct. 2006,
pp. 313317.
[7] A. Vzquez, E. Antelo, and P. Montushi, A new family of
high-performance parallel decimal multipliers, in Proceedings IEEE 18th Symposium on Computer Arithmetic, June
2007, pp. 195204.
[8] L. Dadda and A. Nannarelli, A variant of a radix-10 combinational multiplier, in Proceedings IEEE International Symposium on Circuits and Systems (ISCAS), May 2008, pp.
33703373.
[9] P. Alfke and B. New, Serial code conversion between BCD
and binary, in Xilinx application note XAPP 029, Oct. 1997.

7. ACKNOWLEDGMENT

[10] H. Neto and M. Vstias, Decimal multiplier on fpga using embedded binary multipliers, in Proceedings IEEE Field
Programmable Logic and Applications, 2008, pp. 197202.

This work was partially supported by the Portuguese Foundation for Science and Technology (FCT) through Project
Reconfigurable Hardware using MTJ Memories.
(PTDC/EEA-ELC/72933/2006).

78

S-ar putea să vă placă și