13.double Precision Floating-Point Arithmetic On FPGAs

MITSUBISHI ELECTRIC ITE VI-Lab
Internal Reference: Publication Date:
Title:
Double Precision Floating-Point Arithmetic on FPGAs
VIL04-D098 Dec. 2003
Author: Reference:
S. Paschalakis, P. Lee
Rev.
Paschalakis, S., Lee, P., Double Precision Floating-Point Arithmetic on FPGAs, In Proc. 2003 2nd IEEE International Conference on Field Programmable Technology (FPT 03), Tokyo, Japan, Dec. 15-17, pp. 352-358, 2003

Stavros Paschalakis, Peter Lee
Abstract We present low cost FPGA floating-point arithmetic circuits for all the common operations, i.e. addition/subtraction, multiplication, division and square root. Such circuits can be extremely useful in the FPGA implementation of complex systems that benefit from the reprogrammability and parallelism of the FPGA device but also require a general purpose arithmetic unit. While previous work has considered circuits for low precision floating-point formats, we consider the implementation of 64-bit double precision circuits that also provide rounding and exception handling.
2004 Mitsubishi Electric ITE B.V. - Visual Information Laboratory. All rights reserved.

Stavros Paschalakis Mitsubishi Electric ITE BV VI-Lab E-mail: Stavros.Paschalakis@vil.ite.mee.com Abstract
We present low cost FPGA floating-point arithmetic circuits for all the common operations, i.e. addition/subtraction, multiplication, division and square root. Such circuits can be extremely useful in the FPGA implementation of complex systems that benefit from the reprogrammability and parallelism of the FPGA device but also require a general purpose arithmetic unit. While previous work has considered circuits for low precision floating-point formats, we consider the implementation of 64-bit double precision circuits that also provide rounding and exception handling.
Peter Lee University of Kent at Canterbury E-mail: P.Lee@kent.ac.uk

rounding and exception handling. We have used these circuits in the implementation of a high-speed object recognition system which performs the extraction, normalisation and classification of moment descriptors and relies partly on custom parallel processing structures and partly on floating-point processing. A detailed description of the system is not given here but can be found in [9].
2. Floating-Point Numerical Representation

This section examines only briefly the double precision floating-point format. More details and discussions can be found in [8,10]. In a floating-point representation system of a radix , a real number N is represented in terms of a sign s, with s=0 or s=1, an exponent e and a significand S so that N=(1)s eS. The IEEE standard specifies that double precision floating-point numbers are comprised of 64 bits, i.e. a sign bit (bit 63), 11 bits for the exponent E (bits 62 down to 52) and 52 bits for the fraction f (bits 51 to 0). E is an unsigned biased number and the true exponent e is obtained as e=EEbias with Ebias=1023. The fraction f represents a number in the range [0,1) and the significand S is given by S=1.f and is in the range [1,2). The leading 1 of the significand, is commonly referred to as the hidden bit. This is usually made explicit for operations, a process usually referred to as unpacking. When the MSB of the significand is 1 and is followed by the radix point, the representation is said to be normalised. For double precision numbers, the range of the unbiased exponent e is [1022,1023], which translates to a range of only [1,2046] for the biased exponent E. The values E=0 and E=2047 are reserved for special quantities. The number zero is represented with E=0, and f=0. The hidden significand bit is also 0 and not 1. Zero has a positive or negative sign like normal numbers. When E=0 and f0 then the number has e=1022 and a significand S=0.f. The hidden bit is 0 and not 1 and the sign is determined as for normal numbers. Such numbers are referred to as denormalised. Because of the additional complexity and costs, this part of the standard is not commonly implemented in hardware. For the same reason, our circuits do not support denormalised numbers. An exponent E=2047 and a fraction f=0 represent infinity.
1. Introduction
FPGAs have established themselves as invaluable tools in the implementation of high performance systems, combining the reprogrammability advantage of general purpose processors with the speed and parallel processing advantages of custom hardware. However, a problem that is frequently encountered with FPGA-based system-on-achip solutions, e.g. in signal processing or computer vision applications, is that the algorithmic frameworks of most real-world problems will, at some point, require general purpose arithmetic processing units which are not standard components of the FPGA device. Therefore, various researchers have examined the FPGA implementation of floating-point operators [1-7] to alleviate this problem. The earliest work considered the implementation of operators in low precision custom formats, e.g. 16 or 18 bits in total, in order to reduce the associated circuit costs and increase their speed. More recently, the increasing size of FPGA devices allowed researchers to efficiently implement operators in the 32bit single precision format, the most basic format of the ANSI/IEEE 754-1985 binary floating-point arithmetic standard [8], and also consider features such as rounding and exception handling. In this paper we consider the implementation of FPGA floating-point arithmetic circuits for all the common operations, i.e. addition/subtraction, multiplication, division and square root, in the 64-bit double precision format, which is most commonly used in scientific computations. All the operators presented here provide
Table 1. Double precision floating-point operator statistics on a XILINX XCV1000 Virtex FPGA device*.
Adder 675 (5.49%) 336 1,118 10,334 Multiplier (4.03%) Divider (2.79%) Square Root 347 (2.82%) 316 399 5,366
Slices Slice flip-flops 4-input LUTs Total equivalent gate count

*
495 460 604 8,426
343 400 463 6,464
Device utilisation figures include I/O flip-flops: 194 for adder, 193 for multiplier, 193 for divider and 129 for square root.
The sign of infinity is determined as for normal numbers. Finally, an exponent E=2047, and a fraction f0 represent the symbolic unsigned entity NaN (Not a Number), which is produced by operations like 0/0 and 0. The standard does not specify any NaN values, allowing the implementation of multiple NaNs. Here, only one NaN is provided with E=2047 and f=.000001. Finally, a note should be made on the issue of rounding. It is clear that arithmetic operations on the significands can result in values which do not fit in the chosen representation and need to be rounded. The IEEE standard specifies four rounding modes. Here, only the default mode is considered, which is the most difficult to implement and is known as round-to-nearest-even (RNE). This is implemented by extending the relevant significands by three bits beyond their LSB (L) [10]. These bits are referred to, from the most significant to the least significant, as guard (G), round (R) and sticky (S). The fist two are normal extension bits, while the last one is the OR of all the bits that are lower than the R bit. Rounding up, by adding a 1 to the L bit, is performed when (i) G=1 and RS=1 for any L or (ii) G=1 and RS=0 for L=0. In other cases, truncation takes place.
3. Addition/Subtraction
The main steps in the calculation of the sum or difference R of two floating-point numbers A and B are as follows. First, calculate the absolute value of the difference of the two exponents, i.e. |EAEB|, and set the exponent ER of the result to the value of the larger of the two exponents. Then, shift right the significand which corresponds to the smaller exponent by |EAEB| places. Add or subtract the two significands SA and SB, according to the effective operation, and make the result positive if it is negative. Normalise SR, adjusting ER as appropriate, and round SR, which may require ER to be readjusted. Clearly, this procedure is quite generic and various modifications exist. Because addition is most frequent in scientific computations, our circuit aims at a low implementation cost combined with a low latency. The circuit is not pipelined, so that key components may be reused, with a fixed latency of three clock cycles. Its overall organisation is shown in Figure 1. In the first cycle, the operands A and B are unpacked
and checks for zero, infinity or NaN are performed. For now we can assume that neither operand is infinity or NaN. Based on the sign bits sA and sB and the original operation, the effective operation when both operands are made positive is determined, e.g. (|A|)(|B|) becomes (|A||B|), which results in the same effective operation but with a sign inversion of the result. From this point, it can be assumed that both A and B are positive. The absolute difference |EAEB| is calculated using two cascaded adders and a multiplexer. Both adders are fast ripple-carry adders, using the dedicated carry logic of the device (here, fast ripple-carry will always refer to such adders). Implicit in this is also the identification of the larger of the two exponents, and this provisionally becomes the exponent ER of the result. The relation between the two operands A and B is determined based on the relation between EA and EB and by comparing the significands SA and SB, which is required if EA=EB. This significand comparison deviates from the generic algorithm given earlier but has certain advantages, as will be seen. The significand comparator was implemented using seven 8-bit comparators that operate in parallel and an additional 8-bit comparator which processes their outputs. All the comparators employ the fast carry logic of the device. If B>A then the significands SA and SB are swapped Both significands are then extended to 56 bits, i.e. by the G, R and S bits as discussed earlier, and are stored in registers. Swapping SA and SB is equivalent to swapping A and B and making an adjustment to the sign sR. This swapping requires only multiplexers. In the second cycle, the significand alignment shift is performed and the effective operation is carried out. The advantage of swapping the significands is that it is always SB which will undergo the alignment shift. Hence, only the SB path needs a shifter. A modified 6-stage barrel shifter wired for alignment shifts performs the alignment. Each stage in the barrel shifter can clear the bits which rotate back from the MSB to achieve the alignment. Also, each stage calculates the OR of the bits that are shifted out and cleared. This allows the sticky bit S to be calculated as the OR of these six partial sticky bits along with the value that is in the sticky bit position of the output pattern. The barrel shifter is organised so that the 32-bit stage is followed by the 16-bit stage and so on, so that the large ORs do not become a significant factor with
Operation
A Unpack
LSB LSB+1 LSB
Leading-1 Pattern
A , B = , NaN
Input Pattern
sA sB
EA
EB EB>EA
SA
SB
MSB1 MSB
Effective Operation
Exponent Swap Difference EB=EA Logic ER |EAEB| B=A
MSB1 MSB MSB
MSB1
B>A Swap
MSB
Sign Logic Effective Operation B=A sR
Figure 2. Leading-1 detection

Align
Significand Add/Subtract SR
Leading 1 Detection
Exponent Adjust ER Rounding Control
Normalise
SR
Exponent Adjust ER Pack R
Round SR
Figure 1. Floating-point adder with respect to the speed of the shifter. The shifter output is the 56-bit significand SB, aligned and with the correct values in the G, R and S positions. A fast ripple-carry adder then calculates either SA+SB or SASB according to the effective operation. The advantage of the significand comparison earlier is that the result of this operation will never be negative, since SASB after alignment. The result of this operation is the provisional significand SR of the result and is routed back to the SB path. It is clear that SR will not necessarily be normalised. More specifically, setting aside SR=0, there are three cases: (a) SR is normalised (b) SR is subnormal and requires a left shift by one or more places and (c) SR is supernormal and requires a right shift by just one place. For the first two cases, a leading-1 detection component examines SR and calculates the appropriate normalisation shift size, equal to the number of 0 bits that are above the leading-1. Figure 2 shows a straightforward design for a leading-1 detector. The 56-bit leading-1 detector is comprised of seven 8-bit components and some simple connecting logic. For the final normalisation case, i.e. a supernormal SR in the range [2,4), the output of the leading-1 detector is overridden. This case is easily detected by monitoring
the carry-out of the significand adder. Finally, if SR=0 then normalisation is, obviously, inapplicable. In this case, the leading-1 detector produces a zero shift size and SR is treated as a normalised significand. The normalisation of SR takes place in the third and final cycle of the operation. SR is normalised by the alignment barrel shifter, which is wired for right shifts. If SR is normalised, then it passes straight through the shifter unmodified. If SR is subnormal, it is shifted right and rotated by the appropriate number of places so that the normalisation left shift can be achieved. If SR is supernormal, it is right shifted by one place, feeding a 1 at the MSB, and the sticky bit S is recalculated. The output of the shifter is the normalised 56-bit SR with the correct values in the G, R and S positions. Then, rounding is performed as discussed earlier. The rounding addition is performed by the significand adder, with SR on the SB path The normalised and rounded SR is given by the top 53 bits of the result, i.e. MSB down to the L bit, of which MSB will become hidden during packing. A complication that could arise is the rounding addition producing the significand SR=10.000000. However, no actual normalisation takes place because the bits MSB1 down to L already represent the correct fraction fR for packing. Each time SR requires normalisation, the exponent ER needs to be adjusted. This relies on the same hardware used for the processing of EA and EB in the first cycle. One adder performs the adjustment arising from the normalisation of SR. If SR is normalised, ER passes through unmodified. If SR is subnormal, ER is reduced by the left shift amount required for the SR normalisation. If SR is supernormal, ER is incremented by 1. The second cascaded adder increments this result by 1. The two results are multiplexed. If the rounded SR is supernormal then the result of the second adder is the correct ER. Otherwise, the result of the first adder is the correct ER. The calculation of the sign sR is performed in the first cycle and is trivial, requiring only a few logic gates. Checks are also performed on the final ER to detect an overflow or underflow, whereby R is forced to the appropriate patterns before packing. Another check that is performed is for an effective subtraction with A=B, whereby R is set to a positive zero, according the IEEE standard. Finally, infinity or NaN operands result in an
infinity or NaN value for R according to a set of rules. These are not included here but can be found in [9]. Table 1 shows the implementation statistics of the double precision floating-point adder on a XILINX XCV1000 Virtex FPGA device of 6 speed grade. At 5.49% usage, the circuit is quite small. These figures also include 194 I/O synchronization registers. The circuit can operate at up to ~25MHz, the critical path lying on the significand processing path and comprised of 41.1% and 58.9% logic and routing delays respectively. Since the design is not pipelined and has a latency of three cycles, this gives rise to a performance of ~8.33 MFLOPS. Obviously, the circuit is small enough to allow multiple instances to be incorporated in a single FPGA if required.
A Unpack A , B = 0 , , NaN sA sB EA EB SA
SB
Sign Logic sR
Exponent Add EA+EB Remove Excess Bias ER Exponent Adjust ER Exponent Adjust ER Pack R Significand Multiply SR Normalise SR Round SR Rounding Control
4. Multiplication
The most significant steps in the calculation of the product R of two numbers A and B are as follows. Calculate the exponent of the result as ER=EA+EBEbias. Multiply SA and SB to obtain the significand SR of the result. Normalise SR, adjusting ER as appropriate, and round SR, which may require ER to be readjusted. After addition and subtraction, multiplication is the most frequent operation in scientific computing. Thus, our double precision floating-point multiplier aims at a low implementation cost while maintaining a low latency, relative to the scale of the significand multiplication involved. The circuit is not pipelined and with a latency of ten cycles. Unlike the floating-point adder which operates on a single clock, this circuit operates on two clocks; a primary clock (CLK1), to which the ten cycle latency corresponds, and an internal secondary clock (CLK2), which is twice as fast as the primary clock and is used by the significand multiplier. Figure 3 shows the overall organisation of this circuit. In the first CLK1 cycle, the operands A and B are unpacked and checks are performed for zero, infinity or NaN operands. For now we can assume that neither operand is zero, infinity or NaN. The sign sR of the result is easily determined as the XOR of the signs sA and sB. From this point, A and B can be considered positive. As the first step in the calculation of ER, the sum EA+EB is calculated using a fast ripple-carry adder. In the second CLK1 cycle, the excess bias is removed from EA+EB using the same adder that was used for the exponent processing of the previous cycle. This completes the calculation of ER. The significand multiplication also begins in the second CLK1 cycle. Since both SA and SB are 53-bit normalised numbers, SR will initially be 106 bits long and in the range [1,4). The significand multiplier is based on the Modified Booths 2-bit parallel multiplier recoding method and has been implemented using a serial carry-save adder array and a fast ripple-carry adder for the assimilation of the final carry bits into the final sum bits.
Figure 3. Floating-point multiplier With respect to the carry-save array, this contains two cascaded carry-save adders, which retire four sum and two carry bits in each CLK2 cycle. For the 53-bit SA and SB, 14 CLK2 cycles are required to produce all the sum and carry bits, i.e. until the end of the eighth CLK1 cycle. Alongside the carry-save array, a 4-bit fast ripple-carry adder performs a partial assimilation, i.e. processes the four sum and two carry bits produced in the previous carry-save addition cycle, taking a carry-in from the previous partial assimilation. Taking into account the logic levels associated with the generation of the partial products required by Booths method and the carry-save additions, the latency of this component is so small that it has no effect on the clocking of the carry-save array, while it greatly accelerates the speed of the subsequent carry-propagate addition. The results of these partial assimilations need not be stored; all that needs to be stored is their OR, since they would eventually have been ORed into the sticky bit S. In the ninth CLK1 cycle, the final sum and carry bits produced by the carry-save array are added together, taking a carry-in form the last partial assimilation. Since SR is in the range [1,4), it can be written as y1y2.y3y4y5y104y105y106. Bits y1 to y56 are produced by this final carry assimilation. Bits y57 to y106 dont exist as such; all we have is their OR, which we can write as y57+, calculated during the partial assimilations of the previous cycles. Now, if y1=0, then SR is normalised and the 56-bit SR for rounding is given by y2.y3y4y5y54y55y56y57+. If y1=1, SR and requires a 1-bit right shift for normalisation, and the final 56-bit SR for rounding is given by y1.y2y3y4y53y54y55y56+, where y56+ is the OR of y56 and y57+. The normalisation of a supernormal SR is achieved
using multiplexers switched by y1. If SR is supernormal, ER is adjusted, i.e. incremented by 1, using the same adder that was previously used for the exponent processing. This adjustment does take place after it is determined that SR is supernormal but performed at the beginning of this cycle and then the adjusted ER either replaces the old ER or is discarded.. The rounding decision is also made in the ninth CLK1 cycle and without waiting for the final carry assimilation to finish. That is, a rounding decision is reached for both a normal and a supernormal SR. Then, the correct decision is chosen once y1 has been calculated. In the tenth and final CLK1 cycle, the rounding of SR is performed using the same fast ripple-carry adder that is used by the significand multiplier. The result of this addition is the final normalised and rounded 53-bit significand SR. As for the adder of the previous section, the complication that might arise is a supernormal SR after rounding. As before, no actual normalisation is needed because it would not change the fraction fR for packing The exponent ER, however, is adjusted, i.e. incremented by 1, using the same exponent processing adder. This adjustment is performed at the beginning of this cycle and then the correct ER is chosen between the previous ER or the adjusted ER based on the final SR. Checks are also performed on the final ER to detect an overflow or underflow, whereby R is forced to the correct bit patterns before packing. Finally, zero, infinity or a NaN operands result to a zero, infinity or NaN value for R according to a simple set of rules. Table 1 shows the implementation statistics of the double precision floating-point multiplier. The circuit is quite small, occupying only 4.03% of the device. The figures also include 193 I/O synchronization registers. The primary clock CLK1 can be set to a frequency of up to ~40MHz, its critical path comprised of 36.4% and 63.6% logic and routing delays respectively, while the secondary clock CLK2 can be set to a frequency of up to ~75MHz, its critical path comprised of 36.8% and 63.2% logic and routing delays respectively. Since the circuit is not pipelined with a fixed latency of ten CLK1 cycles, a frequency of 37MHz and 74MHz for CLK1 and CLK2 respectively gives rise to a performance in the order of 3.7 MFLOPS. Obviously, the circuit is small enough to allow multiple instances to be placed in a single chip.
A Unpack A , B = 0 , , NaN sA sB EA EB SA
SB
Sign Logic sR
Exponent Subtract EAEB Add Bias ER Exponent Adjust ER Significand Divide SR Normalise SR Rounding Control
Round SR
Figure 4. Floating-point divider aims solely at a low implementation cost. A non-pipelined design is adopted, incorporating an economic significand divider, with a fixed latency of 60 clock cycles. Figure 4 shows the overall organisation of this circuit. In the first cycle, the operands A and B are unpacked For now, we can assume that neither operand is zero, infinity or NaN. The sign sR of the result is the XOR of sA and sB,. As the first step in the calculation of ER, the difference EAEB is calculated using a fast ripple-carry adder. In the second cycle, the bias is added to EAEB, using the same exponent processing adder of the previous cycle. This completes the calculation of ER. The significand division also begins in the second cycle. The division algorithm employed here is the simple non-performing sequential algorithm and the division proceeds as follows. First, the remainder of the division is set to the value of the dividend SA. The divisor SB is subtracted from the remainder. If the result is positive or zero, the MSB of the quotient SR is 1 and this result replaces the remainder. Otherwise, the MSB of SR is 0 and the remainder is not replaced. The remainder is then shifted left by one place. The divisor SB is subtracted from the remainder for the calculation of MSB-1 and so on. The significand divider calculates one SR bit per cycle and its main components are two registers for the remainder and the divisor, a fast ripple-carry adder, and a shift register for SR. The divider operates for 55 cycles, i.e. during the cycles 2 to 56, and produces a 55-bit SR, the two least significant bits being the G and R bits. In cycle 57, the sticky bit S is calculated as the OR of all the bits of the final remainder. Since both SA and SB are
5. Division
In general, division is a much less frequent than the previous operations. The most significant steps in the calculation of the quotient R of two numbers A (the dividend) and B (the divisor) are as follows. Calculate the exponent of the result as ER=EAEB+Ebias. Divide SA by SB to obtain the significand SR. Normalise SR, adjusting ER as appropriate, and round SR, which may require ER to be readjusted. Our double precision floating-point divider
normalised, SR will be in the range (0.5,2), i.e. if not already normalised, it will require a left shift by just one place. This normalisation is also performed in cycle 57. No additional hardware is required, since SR is already stored in a shift register. If SR requires normalisation, the exponent ER is incremented by 1 in cycle 58. This exponent adjustment is performed using the same adder that is used for the exponent processing of the previous cycles. Also in cycle 58, SR is transferred to the divisor register, which is connected to the adder of the significand divider. In cycle 59, the 56-bit SR is rounded to 53 bits using the significand divider adder. For a supernormal SR after rounding no normalisation is actually required but the exponent ER is incremented by 1 and this takes place in cycle 60 and using the same adder that is used for the exponent processing of the previous cycles. Checks are also performed on ER for an overflow or underflow, whereby the result R is appropriately set before packing. As for zero, infinity and NaN operands, R will also be zero, infinity or NaN according to a simple set of rules. Table 1 shows the implementation statistics of the double precision floating-point divider. The circuit is very small, occupying only 2.79% of the device, which also includes 193 I/O synchronization registers. This circuit can operate at up to ~60MHz, the critical path comprised of 42.8% and 57.2% logic and routing delays respectively. Since the design is not pipelined and has a fixed latency of 60 clock cycles, this gives a performance in the order of 1 MFLOPS. As for the previous circuits, the implementation considered here is small enough to allow multiple instances to be incorporated in a single FPGA device if needed.
A Unpack A = Neg., 0 , , NaN sA EA SA
Adjust SA Exponent Significand Calculation Square Root ER SR Rounding Control
Round SR
Figure 5. Floating-point square root the significand SR, i.e. of the square root of SA, starts in the second clock cycle. According to (1), SA will be in the range [1,4). Consequently, its square root SR will be in the range [1,2), i.e. it will always be normalised. Denoting SR as y1.y2y3y4, each bit yn is calculated using [9]
1 if ( X n Tn ) 0 yn = 0 otherwise
2( X n Tn ) if y n = 1 2 , Tn +1 = y1 . y 2 y 3 K y n 01 X n +1 = 2 X if y n = 0 n
with X 1 =
6. Square Root
The square root function is much less frequent than the previous operations. Thus, our floating-point square root circuit aims solely at a low implementation cost. A nonpipelined design is adopted with a fixed latency of 59 cycles. Figure 5 shows the organisation of this circuit. With the circuit considered here, the calculation of the square root R of the floating-point number A proceeds as follows. In the first cycle, the operand A is unpacked. For now we can assume that A is positive and not zero, infinity or NaN. The biased exponent ER of the result is calculated directly from the biased exponent EA using [9]
E A + 1022 2 ER = E A + 1023 2 if E A is even (and left shift S A one place)
1
if E A is odd
ER is calculated using a fast ripple carry adder, while the division by 2 is just a matter of discarding the LSB of the numerator, which will always be even. The calculation of
SA , T1 = 0.1 and n=1,2,3, 2 From (2) it can be seen that the adopted square root calculation algorithm is quite similar to the division method examined in the previous section. Based on this algorithm a significand square root circuit calculates one bit of SR per clock cycle. The main components of this part of the circuit are two registers for Xn and Tn, and a fast ripple-carry adder. The Tn register has been implemented so that each flip-flop has its own enable signal, which allows each individual bit yn to be stored in the correct position in the register and also controls the reset and set inputs of the two next flips-flops in the register so that the correct Tn is formed for each cycle of the process. After the significand square root calculation process is complete, the contents of register Tn form the significand SR for rounding. Thus, the significand square root circuit operates for 55 cycles, i.e. during the clock cycles 2 to 56, and produces a 55-bit SR. The last two bits are the guard bit G and the round bit R. In cycle 57, the sticky bit S is calculated as the OR of all the bits of the final remainder of the significand square root calculation. Thus, a 56-bit SR for rounding is formulated. In cycle 58, SR is rounded to 53 bits using the
same adder that is used by the significand square root circuit. For a supernormal SR after rounding, no actual normalisation is required but the exponent ER is incremented by 1 and this adjustment takes place in cycle 59 and is performed using the same adder that is used for the exponent processing during the first cycle. From the definition of ER in (1) it is easy to see that an overflow or underflow will never occur. Finally, for zero, infinity, NaN or negative operands, simple rules apply with regards to the result R. Table 1 shows the implementation statistics of our double precision floating-point square root function. The circuit is very small, occupying only 2.82% of the device, which includes 129 I/O synchronization registers. The circuit can operate at up to ~80MHz, the critical path comprised of 53.0% and 47.0% logic and routing delays respectively. Since the circuit is not pipelined and has a fixed latency of 59 clock cycles, this gives rise to a performance in the order of 1.36 MFLOPS.
approximately the same number of slices as the single precision floating-point adder in [7]. That circuit has a higher latency than the adder presented here, but also a higher maximum clock speed, which result in both circuits having approximately the same performance. The adder of [7], however, is fully pipelined and has a much higher peak throughput. In conclusion, the circuits presented here provide an indication of the costs of FPGA floating-point operators using a long format. The choice of floating-point format for any given problem ultimately rests with the designer.
References
1. Shirazi, N., Walters, A., Athanas, P., Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines, In Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1995, pp. 155-162 Loucas, L., Cook, T.A., Johnson, W.H., Implementation of IEEE Single Precision Floating Point Addition and Multiplication on FPGAs, In Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1996,pp. 107-116 Li, Y., Chu, W., Implementation of Single Precision Floating Point Square Root on FPGAs, In Proc. 5th IEEE Symposium on Field Programmable Custom Computing Machines, 1997, pp. 226-232 Ligon, W.B., McMillan, S., Monn, G., Schoonover, K., Stivers, F., Underwood, K.D., A Re-Evaluation of the Practicality of Floating Point Operations on FPGAs, In Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1998, pp. 206-215 Belanovic, P., Leeser, M., A Library of Parameterized Floating-Point Modules and Their Use, In Proc. Field Programmable Logic and Applications, 2002, pp. 657-666 Wang, X., Nelson, B.E., Tradeoffs of Designing Floating-Point Division and Square Root on Virtex FPGAs, In Proc. 11th IEEE Symposium on FieldProgrammable Custom Computing Machines, 2003, pp. 195-203 Digital Core Design, Alliance Core Data Sheets for Floating-Point Arithmetic, 2001, www.xilinx.com ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic, 1985 Goldberg, D., What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys, 1991, vol. 23, no. 1, pp. 547 Paschalakis, S., Moment Methods and Hardware Architectures for High Speed Binary, Greyscale and Colour Pattern Recognition, Ph.D. Thesis, Department of Electronics, University of Kent at Canterbury, UK, 2001
8. Discussion
We have presented low cost FPGA floating-point arithmetic circuits in the 64-bit double precision format and for all the common operations. Such circuits can be extremely useful in the FPGA-based implementation of complex systems that benefit from the reprogrammability and parallelism of the FPGA device but also require a general purpose arithmetic unit. In our case, these circuits were used in the implementation of a high-speed object recognition system which relies partly on custom parallel processing structures and partly on floating-point processing. The implementation statistics of the operators show that they are very economical in relation to contemporary FPGAs, which also facilitates multiple instances of the desired operators into the same chip. Although non-pipelined circuits were considered here to achieve low circuit costs, the adder and multiplier, with a latency of three and ten cycles respectively, are suitable for pipelining to increase their throughput. For the divider and square root operators, pipelining the existing designs may not be the most efficient option and different designs and/or algorithms should be considered, e.g. a high radix SRT algorithm for division [6]. Clearly, there are significant speed and circuit size tradeoffs to consider when deciding on the range and precision of FPGA floating-point arithmetic circuits. A direct comparison with other floating-point unit implementations is very difficult to perform, not only because of floating-point format differences, but also due to other circuit characteristics, e.g. all the circuits presented here incorporate I/O registers, which would eventually be absorbed by the surrounding hardware. As an indication, and with some caution, the double precision floating-point adder presented here occupies
2.
3.
4.
5.
6.
7. 8. 9.
10.

13.double Precision Floating-Point Arithmetic On FPGAs

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

13.double Precision Floating-Point Arithmetic On FPGAs

Încărcat de

Drepturi de autor:

Formate disponibile

MITSUBISHI ELECTRIC ITE VI-Lab

Internal Reference: Publication Date:

Double Precision Floating-Point Arithmetic on FPGAs

VIL04-D098 Dec. 2003

Double Precision Floating-Point Arithmetic on FPGAs

Double Precision Floating-Point Arithmetic on FPGAs

Peter Lee University of Kent at Canterbury E-mail: P.Lee@kent.ac.uk

2. Floating-Point Numerical Representation

Slices Slice flip-flops 4-input LUTs Total equivalent gate count

495 460 604 8,426

343 400 463 6,464

LSB LSB+1 LSB

Exponent Swap Difference EB=EA Logic ER |EAEB| B=A

MSB1 MSB MSB

Sign Logic Effective Operation B=A sR

Figure 2. Leading-1 detection

Exponent Adjust ER Rounding Control

Exponent Adjust ER Pack R

Exponent Adjust ER Pack R

A Unpack A = Neg., 0 , , NaN sA EA SA

Adjust SA Exponent Significand Calculation Square Root ER SR Rounding Control

Exponent Adjust ER Pack R

S-ar putea să vă placă și