Sunteți pe pagina 1din 5

A 13.

3ns Double-precision Floating-point ALU and Multiplier


H. Yamada, T. Hottat, T. Nishiyama, F. Murabayashi, T. Yamauchi, and H. Sawamoto
General Purpose Computer Division, Hitachi Ltd. THitachi Research Laboratory, Hitachi Ltd.
1, Horiyamashita, Hadano City, Kanagawa Prefecture, 259- 13 Japan
Abstract

operands and integer multiplication for single precision


operands. The multiplier is unable to produce a
denormalized number, but it can optionally generate a
correctly signed zero instead of a denonnalized number
to avoid decrease of performance due to a trap.
To accomplish the 13.311s executing time, these
execution units were designed with several new
arithmetic and circuit techniques and fabricated with the
most advanced silicon technology. This paper describes
the arithmetic and circuit techniques developed for the
ALU and multiplier.

One-bit pre-shifting before alignment shift,


normalization with anticipated leading '1' bit and
pre-rounding techniques have been developed for a
floating-point arithmetic logic unit (ALU). In addition,
carry select addition and pre-rounding techniques have
been developed for a floating-point multiplier. A noise
tolerant precharge (NTF') circuit was designed and
applied to the ALU and multiplier. These techniques
reduced the delay time of the critical path by 24%. Each
unit was fabricated in 0.3ym 2.5V four-layer-metal
CMOS technology and achieved a two-cycle latency at
150 MHz.

2. ALU
A block diagram of the floating-point ALU is
shown in Figure 1. It is a two stage pipelined machine.
In the first stage, the exponent of the larger operand is
selected as the common exponent and the fraction of the
operand with the smaller exponent is shifted to the right
by the alignment shifter. In the second stage,
addition/subtraction of the fraction of the larger
exponent operand and the right shifted fraction, as well
as normalization, IEEE rounding, and correction of the
common exponent are performed.
Three arithmetic techniques are used in the ALU.
The f i s t is one-bit pre-shifting of both fractions in
effective addition cases. This technique is useful for
making the rounding process easier. The second is
normalization with the anticipated leading '1' bit of
addition/subtraction results. This normalization process
is fast even if the anticipated bit is wrong, because the
incorrectly shifted fraction can be adjusted by a simple
one-bit right shift. The third technique utilized is
pre-rounding, which prepares all possible rounded
results in parallel with addition/subtraction of aligned
fractions and selects the correct one with the leading '1'
bit of the addsubtract result. By using this technique,
the rounding process is acceralated by 5 1%.

1. Introduction
Scientific and engineering applications demand
exceptionally high floating-point performance which in
turn requires high speed floating-point ALUs and
multipliers to reduce executing time. In recent years a
number of high speed floating-point execution units
have been presented [ll - [61.
A floating-point ALU and multiplier were designed
which are each capable of 13.311s execution. The ALU
and multiplier can each individually produce a result in a
one-cycle pipelined pitch, achieving a peak execution
rate of 300MFLOPS at 15OMHz. The units are in full
compliance with the IEEE Standard for Binary
Floating-point Arithmetic (Std. 754-1985) [7].
The ALU performs add, subtract, compare, convert
to smaller/larger floating-point precision value, and
convert floating to/from integer instructions for both
double and single precision operands. The ALU can
produce a denormalized number without requiring an
additional cycle.
The
multiplier
performs
floating-point
multiplication for both double and single precision

1063-6404/95$4.000 1995 IEEE

466

S1.El S2.E2

F1

F2

SWAP

I gl I

2 . 2 Normalization with anticipated leading


' 1 ' bit

Q
X

a,

The normalization process consists of the


following four steps: 1) leading '1' bit anticipation of
the addsubtract result; 2) shift control signal generation
with priority encoding of the leading '1' bit anticipation
result; 3) left shift of the addsubtract result by the shift
control signal; 4) one-bit right shift adjustment if the
anticipated leading '1' bit is incorrect.
The algorithm used for the leading '1' bit
anticipation is as follows, The leading '1' bit
anticipation signal Z is :
z = z, , z, z2 * * * z, . * * 252
(1)
where the i-th bit of signal Z is defined as
4 = (abl"b,,) "(4I b3
(2)
and a, and b, are the i-th bits of the fractions to be added
(02 i2 52). In equation (2), """ represents an
EXCLUSIVE-OR and "I" represents an OR. Producing
signal Z requires a maximum of 2 gate delays (2
EXCLUSIVE-ORs) which is far smaller than the 7-8
gate delays necessary for a 55 bit carry lookahead adder.
The leading '1' bit position of signal Z is equal to or
only one bit lower than that of the addsubtract result. If
the anticipated bit is wrong, the normalization shift is
incorrect by one bit position and can be adjusted by a
simple one bit right shift. If the anticipated bit is
correct, no further shifting is required. Table 1 shows
examples of the leading '1' bit anticipation.

shift number(Ediff)

I I

16527
+ + I
SELECT

+ +

ST

ET

FT(1:51)

FT(52)

E S l One-bit pre-shifting

123 Normalization with anticipated leading '1' bit


0 Pre-rounding

Figure 1. ALU block diagram

2 . 1 One-bit pre-shifting

Table 1. Examples of leading '1' bit anticipation

When effective addition is performed, both fractions


of the operands are shifted right by one bit first, and
then the shifted fraction of the smaller exponent operand
is right shifted the amount of the operand exponent
difference (Ediff). The addition result of the aligned
fractions lies between 0.1 and 1.111.. . (represented in
binary) and may exceed the IEEE format bit length.
Normalization shift left by one/zero bit position and
rounding are performed if necessary.
When effective subtraction is performed, the
fraction of the smaller exponent operand is right shifted
the amount of Ediff. If Ediff=O or 1, the subtraction
result of the aligned fraction is less than or equal to 1,
so performing a large normalization shift is necessary.
However, the normalized result already complies to the
IEEE format bit length, so rounding is not performed. If
Ediff>l, the substraction result lies between 0.1 and
1.111... and may exceed the IEEE format bit length. In
such cases, normalization shift left by zero/one bit
position and rounding are performed if necessary.

(a) Correct anticipation


A 0 1 . 0 1 0 0 0 1 1 0 0 0 1 1 1
B 1 1 . 0 0 0 1 1 0 1 0 1 0 0 0 1
z
0 . 0 1 1 1 0 0 0 0 1 1 1 0 0
(sum 0 . 0 1 1 0 0 0 0 0 1 1 0 0 0 )
shift number=2 (adjustment shift=O)
(b) Incorrect anticipation
A 0 1 . 0 1 1 0 0 1 1 0 0 0 1 1 1
B 1 1 . 0 1 0 1 1 0 1 0 1 0 0 0 1
z
0 . 0 1 1 0 0 0 0 0 1 1 1 0 0
(sum 0 . 1 1 0 0 0 0 0 0 1 1 0 0 0 )
t----l shift numberla (adjustment shift=l)

2 . 3 Pre-rounding
Figure 2 shows the pre-rounding scheme. The
pre-rounding process of the ALU calculations consists
of four steps.

467

array utilizes a 4-2 compressor tree rather than a 3-2 full


2dder in order to reduce tree depth and to simplify
layout. Exponent addition and rebias are also performed
in the first stage. In the second stage, carry propagate
addition of the partial product sum (carry save form), as
well as normalization, IEEE rounding, and exponent
correction are performed.
Two arithmetic techniques are used in the
multiplier. The first involves spliting the Wallace tree
sum and performing the upper 52-bits and lower 54-bits
addition calculations in parallel. The second technique is
pre-rounding which is similar to that of the
floating-point ALU.

The first step involves incrementing the


addsubtract result at the 52nd decimal place by one.
This incrementation is performed in parallel with the
additionlsubtraction, and the result is ignored if no carry
arises from rounding. In the second step, three
independent pre-roundings are performed for the three
possible positions of the leading '1' bit (type 1, type 2,
type 3). Type 1, 2, and 3 are the cases when the leading
'1' bit is located one bit left, one bit right, and two or
more bits right of the decimal point. Bits 52 to 55 of
the addsubtract result, sign bit, and rounding mode
signals are used to calculate the three rounding carries
and the three least significant bits of the rounded results
in pre-rounders RO, R1, and R2. In the third step, the
correct pre-rounded result is selected according to the
most significant two bits of the addsubtract result. If
the two bits are '10' or '1 l', the results of RO are used.
If the bits are 'Ol', the results of R1 xe used.
Otherwise, the results of R2 are used. In the four step,
the selected carry is used to select either the incremented
result calculated in the first step, or the addsubtract
result.
Calculation of the most significant two bits of the
addsubtract result followed by the selection of the
rounding carry signal is one of the most critical paths,
so normalization shifters were intentionally removed
from the critical path. In this way they can execute in
parallel with the rounding carry calculation.
ahresult SO. SI s2
type l(R0) 1 X X
type2(R1) 0 1 x
type 3(R2) 0 0 x

El

E2

F2

f I

&

Radix 4 BOOTH ENCODER


PARTIAL PRODUCTS

- s52 s53 s54 s55


X

R
L
x

-1

R
0

S
0

L: least
R:round
S: sticky
x: o,

+
01 -> R1

ET

FT(1:51)

.c

FT(52)

64 Carry select addition


[J Pre-rounding

00 -> R2

rounding carry

Figure 3. Multiplier block diagram

r52

Figure 2. Pre-rounding scheme

3 . 1 Carry select addition

3. Multiplier

Partial product sum (carry save form) is divided into


two pairs (one is a pairing of the upper 52-bits and the
other is a pairing of the lower 54-bits). With-carry and
without-carry cases are calculated for the upper 52-bits,
and the correct sum is selected by the carry from the
lower 54-bit sum. Addition of the lower pair is also
performed in parallel with the upper pair calculation,
and the signal P (propagate carry from the most

A block diagram of the floating-point multiplier is


shown in Figure 3. Like the ALU, it is also a two stage
pipelined machine. In the first stage, one of the
fractions is encoded using a Radix 4 Booth algorithm.
The generated twenty seven 54-bit partial products are
summed by the Wallace tree [8]. The partial product

468

4 . 2 Performance

significant bit), L (least bit), G (guard bit), R (round


bit), and S (sticky bit) are output.

Figure 5 shows the delay time of the floating-point


ALU. Each delay time was calclated by a circuit
simulation. By using the above arithmetic techniques,
thedelay time of the maximum critical path is reduced
by 15.4%. Moreover, by using the NTP circuit, the
delay time of carry propagation in additiodsubtraction
and leading '1' bit anticipation in normalization is
reduced as well, reducing the total delay time by 24%.

3.2 Pre-rounding

The pre-rounding of multiplication results consists


of three steps. In the first step, the rounding carry CO,
C, and the rounded results Lo, Go,L, are calculated. CO,
Lo and Go are the results when G is the least significant
bit, and C , and L, are the results when L is the least
significant bit. In the second step, the correct rounding
Delay time (ns)
carry signal C, and rounded results L,, G, are selected. G,
0
5
10
15
has no meaning when L is the least significant bit.
1
1
1
1
1
1
1
1
1
[
1
1
1
1
1
1
1
1
Finally either the added result or incremented result is
.format alian
etc.
17.5ns
/
selected by the carry signal C, as the upper portion of
(') Without
normalize/ I
lalianment shift1 addhub m , , n A
the rounded result.
I, I_..

'

4. Design methodology
(2) With
arithmetic
techniques

4 . 1 Circuit

round

The noise tolerant precharge (NTP) circuit, a high

speed and high noise tolerance CMOS circuit was


developed and adopted for critical paths of the ALU and

(3) With
circuit
technique

multiplier [9]. Figure 4 shows a block diagram of the


NTP circuit. The NTP circuit has a noise tolerant
PMOS logic which provides high noise immunity. The
NTP circuit is precharged when the clock is low, and
the circuit is evaluated when the clock is high. The
delay time of the circuit is determined by the NMOS
logic. The NTP circuit has a 30-36% delay time
advantage over a conventional CMOS circuit. Three
types of NTP circuits were designed in order to
accelerate the time critical paths in cany lookahead
adders and leading '1' bit anticipator.

Noise-tolerant

CK

1'

4 . 3 Floating-point unit

7-7-

OUT
IN2
IN3

round
add/su

A floating-point unit utilizing the ALU and the


multiplier were fabricated in 0.3pm four-layer-metal
CMOS technology. A block diagram of the
floating-point unit is shown in Figure 6. The
floating-point unit contains four major sub-units: a
128x64-bit register file, an ALU, a multiplier, and a
dividdsquare root unit (Div/Sqrt). The register file has
four write ports and four read ports, which allows
parallel execution of a load, an ALU, and a multiply
operation. A microphotograph of the floating-point unit
is shown in Figure 7. All of the cells were placed
manually to shorten the wire length, and the routing of
the macro was made automatically except for the critical
parts. Table 2 summarizes the floating-point latency and
throughput.

5. Conclusion
One-bit pre-shifting before alignment shift,
Normalization with the anticipated leading '1' bit and
pre-rounding techniques have been developed for a
floating-point ALU. Carry select addition and

Discharge NMOS

Figure 4. NTR circuit block diagram

469

pre-rounding techniques have been developed for a


floating-point multiplier. A high speed and high noise
toleranct precharge (NTP) CMOS circuit was developed
in order to accelerate critical paths of the ALU and
multiplier. These techniques reduced the delay time of
critical path by 24%.Each unit was fabricated in 0.3pm
four-layer-metal CMOS technology and achieved a two
cycle latency at 150 MHz.

Acknowledgements
The authors would like to thank A. Anzai, M.
Hashimoto, R. Yamagata, T. Kumagai, E. Kamada, T.
Nakano, K. Kaneko, N. Ido, Y. Kiyoshige, S . Muto, S .
Tanaka, K. Shimamura, K. Matsuo, T. Shimizu, and S .
Nakahara of Hitachi Ltd. for their technical support,
discussions, a d guidance.

Figure 6. Floating-point unit block diagram

References
[l] R. K. Montoye et al., "Design of the IBM RISC
System/6000 Floating-Point Execution Unit," IBM J. Res.
Develop. Vol. 34, No. 1 , pp. 59-70, January 1990.
[2] J. Yetter, "A 100-MHz Superscalar PA-RISC
CPU/Coprosessor Chip," Digest of Technical Papers,
Symp. VLSICircuits, pp.12-13, 1992.
[3] D. W. Dobberpuhl et al., "A 200-MHz 64-b Dual-Issue
CMOS Microprocessor," IEEE J. Solid-state Circuits, Vol.
27,No. 1 1 , p p . 1555-1557,November1992.
[4] L. Gwennap, "Digital Leads the Pack with 21164,"
Microprocessor Report, Vol. 8, No. 12, pp. 6-10,
September 1994.
[5] L. Gwennap, "MIPS RlOOOO Uses Decoupled
Architecture," Microprocessor Report, Vol. 8, No. 14, pp.
18-22, October 1994.
[6] L. Gwennap, "PA-8000 Combines Complexity and
Speed," Microprocessor Report, Vol. 8, No. 15, pp. 6-9,
November 1994.
[7] IEEE Standard for Binary Floating-point Arithmetic,
A N S E E E Standard No.754, 1988.
[8] C.S. Wallace, "A Suggestion for a Fast Multiplier,"
Trans. IEEE Electronic Computers, Vol. EC-13, pp. 14-17,
February 1964.
191 F. Murabayashi et al., "2.5V NOVEL CMOS CIRCUIT
TECHNIQUES FOR A 150MHz SUPERSCALAR RISC
PROCESSOR," to be published in ESSCIRCP5, September
1995.

Register File
I

Multiplier

ALU

DivISqrt

Figure 7. Floating-point unit microphotograph

Table 2. Floating-point latency and throughput


Doubl -precision
Latency Throughpu
(Cycleln5 (Cyclelns)
Multiply
Divide

470

211 3.3
2113.3
1 81120.0
31/206.7

116.7
116.7
171113.3
30/200.0

S-ar putea să vă placă și