Parallel Decimal Multipliers Using Binary Multipliers PDF

PARALLEL DECIMAL MULTIPLIERS USING BINARY MULTIPLIERS
Mrio P. Vstias
Horcio C. Neto
INESC-ID/ISEL/IPL
email: mvestias@deetc.isel.ipl.pt
INESC-ID/IST/UTL
email: hcn@inesc-id.pt
ABSTRACT
Decimal multiplication is a fundamental operation for

any algorithm or even to implement decimal division. Decimal multiplication is much more complicated than binary
multiplication due to the inherent difficulty to represent decimal numbers using a binary number system. Both bit and
digit carries, as well as invalid results, must be considered
in order to produce the correct result.
Two main approaches have been considered in the design of a decimal multiplier: iterative and parallel. In the
iterative approach [3] the multiplicand is iteratively multiplied by one digit of the multiplier to generate a partial
product. Partial products are then added to produce the final decimal result. [4] and [5] present examples of decimal multipliers based on sequential units. In these, a set of
multiplicand multiples is generated in a preprocessing step.
Then, in the processing step, the multiples are selectively
added according to the value of the multiplier digits. Parallel decimal multipliers were recently proposed to improve
performance [6], [7], [8]. For example, in [6] the partial
products are generated in parallel and then reduced using a
decimal carry-save addition tree. Parallel multipliers can be
significantly faster than iterative ones, but require more area
for the implementation.
All decimal multipliers referenced above work directly
with decimal numbers, that is, produce decimal partial products using decimal multiplication of a number by a digit and
add the partial products using decimal adders. An alternative to this approach is to convert the BCD input to binary,
do a binary multiplication and then convert the result back
to BCD. This approach is particularly attractive when binary multipliers are already available in the hardware architecture, such as in coarse-grained reconfigurable hardware
architectures or in microprocessors with binary arithmetic
units.
A major problem of this solution is that the binary to
BCD conversion may be too slow. The most usual solution
is a serial implementation of the shift and add-3 algorithm
[9], which can be implemented with a small amount of resources, but at the cost of long latencies. A fast parallel implementation of this algorithm is a possible alternative but,
for large numbers, may consume a considerable amount of
hardware resources.
Human-centric applications, like financial and commercial,

depend on decimal arithmetic since the results must match
exactly those obtained by human calculations. The IEEE754 2008 standard for floating point arithmetic has definitely
recognized the importance of decimal for computer arithmetic. A number of hardware approaches have already been
proposed for decimal arithmetic operations, including addition, subtraction, multiplication and division. However, few
efforts have been done to develop decimal IP cores able to
take advantage of the binary multipliers available in most reconfigurable computing architectures. In this paper, we analyze the tradeoffs involved in the design of a parallel decimal multiplier, for decimal operands with 8 and 16 digits,
using existent coarse-grained embedded binary arithmetic
blocks. The proposed circuits were implemented in a Xilinx Virtex 4 FPGA. The results indicate that the proposed
parallel multipliers are very competitive when compared to
decimal multipliers implemented with direct manipulation
of BCD numbers.
1. INTRODUCTION
Computer arithmetic is predominantly based on binary arithmetic since the hardware implementations of the operations
are simpler than those for decimal computation. However,
decimal arithmetic is becoming a necessity in many applications, such as financial and commercial, where the results
must be exact, matching those obtained by human calculations. Since many decimal numbers cannot be represented
exactly as binary numbers with a finite number of bits, these
types of applications require that the arithmetic operations
be done directly over decimal numbers.
Until very recently, the adopted solution was to implement decimal operations using software algorithms based
on binary arithmetic. However, these software solutions are
typically three or four orders of magnitude slower than binary arithmetic implemented in hardware [1]. To speed-up
the execution of decimal arithmetic, a few processors, such
as the IBM Power6 [2], already include dedicated hardware
for decimal floating-point operations.
978-1-4244-6311-4/10/$26.00 2010 IEEE
73
2500
Number of LUTs
Recently, in [10], a new binary to decimal converter has

been proposed that is faster and uses about half of the resources of the parallel implementation of the traditional shift
and add-3 algorithm. With the faster converters, the decimal multiplication using binary multipliers becomes competitive with the more common approaches based on direct
BCD manipulation.
The approach considered in [10] has been applied only
to small multipliers. Also, the decimal to binary converter,
while simpler, must also be optimized to improve area and
performance. To be effective, this new approach must be applicable to larger size operands with 8, 16 or even 34 digits
(according to standard IEEE-754 2008).
In this paper, we analyze and develop efficient architectures for parallel BCD multiplication using binary multipliers. Section 2 analyses and proposes a new architecture
fot decimal to binary conversion. Section 3 analyses binary
to decimal conversion considering the algorithm proposed
in [10] applied to larger operand sizes. Section 4 reports
the trade-offs involved in the design of decimal multipliers
based on the results from the previous sections and proposes
parallel decimal multipliers for operands with 8 and 16 digits. Section 5 provides area and performance results with
and without embedded multipliers and a comparison with a
state-of-the-art parallel multiplier. Section 6 concludes the
paper and proposes future directions for decimal multiplication based on binary multipliers.
eq 3
eq 4
eq 5
1000
500
1
ShiftAdd
18
51
110
184
278
398
526
675
855 1044 1247 1487 1731 1995 2295
10
11
12
13
14
15
16
eq 2
26
54
88
130
186
245
312
395
479
571
682
791
908 1046
eq 3
22
47
70
104
145
188
236
297
355
424
498
596
684
781
eq 4
22
50
71
91
140
173
199
269
312
346
431
486
526
630
eq 5
22
47
73
103
132
174
203
259
296
349
408
480
529
602
Number of Digits
Fig. 1. Size of the decimal to binary converter for different

implementations
This new equation requires only multiplications by 10,
but the operations cannot be executed in parallel. Since multiplication by 10 is the same as a multiplication by 8+2, the
multiplication reduces to an addition of shifted values.
Other rearrangements can be achieved by manipulating
equation (2). A few examples are considered in equations
(3-5).
BCD to binary conversion can be performed by the known

process of successively dividing the decimal number by two
and storing the remaining of the division. Division by two
is easily achieved with a shift towards the least significant
bit. However, whenever a bit shifts across a boundary of a
digit, the lower digit must be subtracted by three or added
with five with the most significant bit removed.
An alternative to this method is to directly calculate the
decimal number using binary arithmetic, as defined in equation (1).
D = (((Dn1 10 + Dn2 )100 + ...)100

+(D1 10 + D0 ))
(3)
D = (((Dn1 100 + Dn2 10 + Dn3 )103 + ...)103

+(D2 100 + D1 10 + D0 ))
(4)
D = ((Dn1 10 + Dn2 )100

+(Dn3 10 + Dn3 ))104
+... + ((D3 10 + D2 )100 + (D1 10 + D0 ))
(5)
These equations have more available parallelism compared to the original Horners arrangement.
To determine which representation has the lowest cost
implementation, the architectures to implement them were
described in VHDL and synthesized considering a Virtex-4
FPGA (see results in figure 1).
The shift and add approach is by far the worst approach
in terms of area. The implementations of equations 4 and
5 are the best alternatives with improvements from 50 to
almost 75% of those obtained with the shift and add algorithm. A few other rearrangements of the Horner equation
were tested, including other powers of ten, but the achieved
results were worse.
The delays associated with the implementations follow
a similar relation (see figure 2).
The best solution achieves up to almost 50% improvement in the delay compared to that of the shift and add solution.
(1)
Because of the multiplications by the powers of ten, a direct implementation of this equation would require multipliers whose size would significantly increase with the size of
the numbers to be converted. To overcome this problem, (1)
can be rearranged by applying the Horners rule (see equation (2)).
D = (((Dn1 10 + Dn2 )10 + ...)10 + D0 )
eq 2
1500
2. DECIMAL TO BINARY CONVERSION
Dn1 . . . D0 = Dn1 10n1 + . . . + D0 100
ShiftAdd
2000
(2)
74
40
eq 2
30
Delay (ns)
a(2) (27 bits)
ShiftAdd
35
eq 4
20
eq 5
(16..0)
(26..17)
eq 3
25
72 x + y
15
10
18
5
0
18 bits
10
11
12
13
14
15
16
ShiftAdd
11
14
16
18
21
23
25
27
29
31
34
eq 2
12
14
16
18
20
23
25
27
29
32
34
eq 3
11
13
14
16
17
20
21
23
24
26
eq 4
10
11
13
13
14
16
17
18
eq 5
10
12
12
12
14
16
16
16
19
17
b2TOb1000
8
Number of Digits
17
Fig. 2. Delay of the decimal to binary converter for different

implementations
b2TOb1000
D2(100)
D0(1000)
D1(1000)
3. BINARY TO DECIMAL CONVERSION

Fig. 3. Converter of a binary number upto 99999999 to base
1000
Binary to decimal conversion is fundamentally the calculation of equation (6) in decimal.
and,
bn1 2n1 + bn2 2n2 + . . . + b0 20
(6)
c = b1 72 + b0
c (26 + 23 ) b1 + b0 < 218
17 bits
(8)
So, the b2TOb1000 can be already applied to determine the
least significant digit, d0 , and part of the following digits,
d1 .
Multiplication by two is achieved with a shift towards

the most significant bit. However, since the operations are
in decimal whenever a bit shifts across a boundary of a digit,
the digit must be corrected by adding three before the shift
takes place [9] (or six after the shift). This algorithm is usually known as the shift and add-3 algorithm.
An alternative approach was proposed in [10]. The idea
is to start by converting numbers in base 1024 to numbers
in base 1000, i.e. to start with b = b1 210 + b0 to obtain
b = d1 103 + d0 = d.
The algorithm uses a basic module (b2TOb1000) that
converts binaries upto 999 999 to base 1000. From base
1000 to decimal, the shift and add algorithm is used. For
larger numbers, the algorithm follows the Horner rule to determine the digits of the final decimal number. The converter
uses three fundamental modules: b2TOb1000, a module to
calculate 24 x + y and an adder. The number of each of
these units utilized in each converter depends on the size of
the number and increases more than linearly with the size of
the binary number to be converted.
In this paper, we propose an also very efficient solution
for the converter that considers 217 instead of 210 . In this
case, we have to calculate b = b1 217 + b0 .
Considering a converter for 8-digit numbers,
b = b1 217 + b0
= 131 b1 103 + b1 72 + b0
| {z }
b1 762 =
b0 217
d = d1 103 + d0 , where d1 185 28 8 bits (9)

b is now
b = (b1 + d1 ) 103 + d0 ,
| {z }
e 105 + 7 17 bits (10)
Since e is representable with 17 bits, once again the

b2TOb1000 unit can be used to determine the final two most
significant base 1000 digits, that is
e = d2 103 + d1
(11)
An hardware implementation of this converter can be designed using a set of adders and the b2TOb1000 unit (see
Fig. 3). Instead of two modules 24 x + y and two adders,
this new approach uses only one module that calculates 72
x + y and one adder.
Both architectures were evaluated using operands with
different sizes and compared to a parallel implementation of
the shift and add-3 algorithm (see results in figure 4).
For certain operand sizes, the new approach achieves almost 50% area reduction compared to the shift and add-3.
Compared to the solution presented in [10] the improvements are almost 10% for numbers with 16 digits. We also
108 1
217
(7)
75
2000
B7-0
A7-0
ShiftAdd
1800
[10]
1600
32
32
New converter
LUTs
1400
BCDtoBIN
BCDtoBIN
1200
1000
27
27
800
600
AB
400
200
0
10
11
12
ShiftAdd
14
41
91
147
213
320
411
518
675
810
955
1165 1318 1526 1789
[10]
14
41
67
95
132
171
248
339
427
593
732
950
13
1151 1451 1794
14
15
New converter
14
41
67
81
132
159
232
305
394
546
707
885
1083 1382 1675
54
16
BINtoBCD
Number of Digits
16
Fig. 4. Comparison between different binary to decimal

converters in terms of area
D15-0 = A7-0 B7-0
Fig. 6. Decimal multiplier without decimal partial products
70
ShiftAdd
60
[10]
New converter
Delay (ns)
50
size of the operands and of the result,
40
Subdivide the operands and consider partial products

to be added.
30
20
10
0
10
11
12
13
14
15
16
ShiftAdd
10
13
15
16
19
21
23
25
26
28
33
[10]
13
15
18
21
25
29
35
39
44
48
54
58
63
New converter
13
15
18
20
21
24
27
30
34
38
42
47
52
The first approach only uses the converters, besides the

multiplier, while the second approach needs extra adders, to
add the partial products, and several converters. However,
the converters utilized in the first approach are larger and,
given the analysis of the previous section, will utilize much
more area than the converters used in the second approach.
To see the difference between both solutions in terms
of performance and delay, we have implemented a 8 8
decimal multiplier using the two approaches.
In the first case, the architecture uses two 8-digit to 27bit decimal to binary converters, one 27 27 multiplier, and
one 54-bit to 16-digit binary to decimal converter (see figure
6).
In the second case, the 8-digit operands are divided into
two groups of 4-digits each (see figure 7). In this case, there
are four multiplications implemented with binary multipliers, that is, each 4-digit number is converted to binary and
then multiplied. The inner partial products are added in binary before being converted to decimal to be added to the
other partial decimal products (after binary to decimal conversion).
Table 1 shows the results obtained by implementing both
alternatives in a Virtex-4 FPGA, with and without using the
embedded DSP blocks.
As expected, the larger binary to decimal converters are
very expensive in terms of area and so the second approach,
using partial products and smaller converters, is better both
in terms of area and performance.
A 16 16 decimal multiplier was designed using the
(more efficient) approach with partial decimal products (see
figure 8). After converting the sub-groups of digits of the
operands to binary and performing the cross multiplications,
Number of Digits
Fig. 5. Delay of the binary to decimal converter for different

implementations
observe that while the shift and add algorithm increases the
resources almost linearly with the number of digits, the other
two solutions increases more than linearly. For example,
with 16 digits, all solutions are very close to each other in
terms of area occupation. This may indicate that for operands with more than 16 digits, the shift and add-3 approach
will become the most efficient, but further experiments are
required.
The delays associated with the implementations can be
observed in figure (5). Pipelining has not been considered
for this evaluation. The faster solution is the shift and add-3
with differences of more than 50% for numbers with more
digits. To reduce these timing differences, the alternative
solutions can be improved by using, for example, carry save
adders.
4. PARALLEL DECIMAL MULTIPLIER
To implement decimal multiplication with binary multipliers, we convert the numbers to binary, do the multiplication
and convert the result to decimal. Two design approaches
can be considered:
Use a complete binary multiplier and converters of the
76
A2xB3+A3xB2 (9)
000 (3)
000 (3)
A2 (4)
A1 (4)
A0 (4)
B3 (4)
B2 (4)
B1 (4)
B0 (4)
A0xB2+A0xB2+A0xB2 (9)
A1xB3+A2xB2+A3xB1 (9)
A3xB3 (9)
A3 (4)
000 (3)
A0xB0 (8)
A0xB1+A1xB0 (9)
A0xB3+A1xB2+A2xB1+A3xB0 (9)
4 digits
12 digits
32 Digits
Fig. 8. 16x16 Decimal multiplier with decimal partial products
4 Digits
4 Digits
4 Digits
4 Digits
Table 1. Decimal multiplier

Solution
LUTs DSPs
W/o partial products
W/o partial products
W/ partial products
W/ partial products
8 Digits
8 Digits
4 digits
8 Digits
8 Digits
3127
2023
2061
1176
0
4
0
4
Delay
73 ns
76 ns
45 ns
47 ns
Table 2. Decimal multiplier 8x8

Solution
LUTs DSPs Delay
8 digits
[7]
Our with DSP
Our without DSP
16 Digits
2609
1176
2061
0
4
0
34 ns
47 ns
45 ns
Fig. 7. Decimal multiplier with decimal partial products

pect that the performance of our solution can be improved
to values near those of the reference.
When using the DSP blocks, the number of utilized LUT
drops almost 50%, making our solution much more efficient
than that from [7] for FPGAs with embedded binary multipliers.
Compared to a binary multiplier, our decimal multiplier
uses about 2.5 times the number of LUTs (a binary multiplier of size 27 27 consumes 842 LUTs).
For the 16 16 decimal multiplier, our solution utilizes
about 25% less resources than the solution proposed in [7].
However, once again the delay is higher by about 17% (the
difference is smaller than that of the 8 8 decimal multiplier).
When using the DSP blocks, the number of utili-zed
LUTs drops almost 75% and, for FPGAs with embedded
binary multipliers, our solution is again much more efficient
than that from [7].
Once again, compared to a binary multiplier, our decimal multiplier uses about 2 times the number of LUTs (a
the aligned partial products are added in binary and then

converted to decimal. The three partial products in the figure
indicate the operations performed and the number of digits. After this alignment, the three final partial products are
added in decimal.
5. RESULTS
We have implemented two decimal multipliers with sizes
8 8 and 16 16 and compared them with the parallel
decimal multiplier from [7]. The decimal multipliers and its
sub-units were specified in VHDL, synthesized with Xilinx
ISE10.1 and then implemented in a Virtex-4 SX35-12 FPGA
(see the results in tables 2 and 3).
Our 8 8 decimal multiplier utilizes about 20% less resources in FPGA technology than the solution proposed in
[7]. However, the delay is about 30% higher. Since the final
adders used in our design are not carry save adders, we ex-
77
8. REFERENCES
Table 3. Decimal multiplier 16x16

Solution
LUTs DSPs Delay
[7]
Our with DSP
Our without DSP
8729
3005
6493
0
16
0
[1] M. F. Cowlishaw, Decimal floating-point: Algorism for

computers, in Proceedings IEEE 6th IEEE International
Symposium on Computer Arithmetic, June 2003, pp. 104
111.
54 ns
68 ns
65 ns
[2] IBM Power6,

IBM Corporation,
http://www2.hursley.ibm.com/decimal/.
May
2007,
[3] T. O. et al., Apparatus for decimal multiplication, U.S.

Patent 4 677 583, June, 1987.
binary multiplier of size 54 54 consumes 3223 LUTs).

Note that the difference is smaller compared to that of the
8 8 decimal multiplier.
[4] R. D. Kenney, M. J. Schulte, and M. A. Erle, Highfrequency decimal multiplier, in Proceedings IEEE International Conference on Computer Design: VLSI in Computers
and Processors, Oct. 2004, pp. 2629.
6. CONCLUSION
[5] M. A. Erle and M. J. Schulte, Decimal multiplication via

carry-save addition, in Proceedings IEEE 14th IEEE International Conference on Application Specific Systems, June
2003, pp. 348358.
We have implemented an 8 8 and an 16 16 decimal multipliers using binary multiplications. The results show that
this approach is better than those considering direct manipulation of decimal operands when implemented in a Virtex-4
FPGA.
An important advantage of the approach proposed herein
is that it can effectively use the embedded binary multipliers available in actual FPGAs and in other coarse-grained
reconfigurable architectures.
For future work, we plan to analyze the effects of other
subdivisions of the initial operands over the performance
and the consumed area.
It would be also important to test the designs with other
technologies besides FPGAs, namely coarse-grained reconfigurable architectures with binary arithmetic units of different complexities.
[6] T. Lang and A. Nannarelli, A radix-10 combinational multiplier, in Proceedings IEEE 40th International Asilomar
Conference on Signals, Systems, and Computers, Oct. 2006,
pp. 313317.
[7] A. Vzquez, E. Antelo, and P. Montushi, A new family of
high-performance parallel decimal multipliers, in Proceedings IEEE 18th Symposium on Computer Arithmetic, June
2007, pp. 195204.
[8] L. Dadda and A. Nannarelli, A variant of a radix-10 combinational multiplier, in Proceedings IEEE International Symposium on Circuits and Systems (ISCAS), May 2008, pp.
33703373.
[9] P. Alfke and B. New, Serial code conversion between BCD
and binary, in Xilinx application note XAPP 029, Oct. 1997.
7. ACKNOWLEDGMENT
[10] H. Neto and M. Vstias, Decimal multiplier on fpga using embedded binary multipliers, in Proceedings IEEE Field
Programmable Logic and Applications, 2008, pp. 197202.
This work was partially supported by the Portuguese Foundation for Science and Technology (FCT) through Project
Reconfigurable Hardware using MTJ Memories.
(PTDC/EEA-ELC/72933/2006).
78

Parallel Decimal Multipliers Using Binary Multipliers PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Parallel Decimal Multipliers Using Binary Multipliers PDF

Încărcat de

Drepturi de autor:

Formate disponibile

PARALLEL DECIMAL MULTIPLIERS USING BINARY MULTIPLIERS

Decimal multiplication is a fundamental operation for

Human-centric applications, like financial and commercial,

978-1-4244-6311-4/10/$26.00 2010 IEEE

Recently, in [10], a new binary to decimal converter has

855 1044 1247 1487 1731 1995 2295

Fig. 1. Size of the decimal to binary converter for different

BCD to binary conversion can be performed by the known

D = (((Dn1 10 + Dn2 )100 + ...)100

D = (((Dn1 100 + Dn2 10 + Dn3 )103 + ...)103

D = ((Dn1 10 + Dn2 )100

D = (((Dn1 10 + Dn2 )10 + ...)10 + D0 )

2. DECIMAL TO BINARY CONVERSION

Dn1 . . . D0 = Dn1 10n1 + . . . + D0 100

a(2) (27 bits)

Fig. 2. Delay of the decimal to binary converter for different

3. BINARY TO DECIMAL CONVERSION

Binary to decimal conversion is fundamentally the calculation of equation (6) in decimal.

Multiplication by two is achieved with a shift towards

d = d1 103 + d0 , where d1 185 28 8 bits (9)

e 105 + 7 17 bits (10)

Since e is representable with 17 bits, once again the

1165 1318 1526 1789

1151 1451 1794

1083 1382 1675

Fig. 4. Comparison between different binary to decimal

D15-0 = A7-0 B7-0

Fig. 6. Decimal multiplier without decimal partial products

size of the operands and of the result,

Subdivide the operands and consider partial products

The first approach only uses the converters, besides the

Fig. 5. Delay of the binary to decimal converter for different

Table 1. Decimal multiplier

Table 2. Decimal multiplier 8x8

Fig. 7. Decimal multiplier with decimal partial products

the aligned partial products are added in binary and then

Table 3. Decimal multiplier 16x16

[1] M. F. Cowlishaw, Decimal floating-point: Algorism for

[2] IBM Power6,

[3] T. O. et al., Apparatus for decimal multiplication, U.S.

binary multiplier of size 54 54 consumes 3223 LUTs).

[5] M. A. Erle and M. J. Schulte, Decimal multiplication via

S-ar putea să vă placă și