Sunteți pe pagina 1din 8

Fall 2013

Computer Arithmetic

H. Wu

Chapter 4. Binary Floating-Point Numbers 4.1. Floating-Point Representations

Machine representation of real numbers: Fixed-point (discussed in Chapter 1) Floating-point Compared to xed point numbers, oating-point numbers are efcient in representing both very large and very small numbers. A oating-point representation F has three parts: The sign, s; The signicand (or mantissa), f ; The exponent, e; The bits of the oating-point number F are stored in a register or a memory unit shown as follows F = s : sign e : exponent f : signicand and the value of such a oating-point number F is given by F = (1)s f e . where is the base of e which is implied in a system. Usually, is often chosen as a power of 2. The representation range is [Fmax , Fmin ] and [Fmin , Fmax ], where Fmax and Fmin are given by Fmax = fmax emax , Fmin = fmin emin ,

Fall 2013

Computer Arithmetic

H. Wu

The computer system is overow if a result is larger than Fmax or smaller than Fmax . The system is called underow if a result is nonzero and belongs to the interval (Fmin , Fmin ).

4.2.

IEEE 754-2008 Standard

There are four formats in the IEEE 754-2008 oat-point standard: 1. Single-precision format (32-bit) 2. Double-precision format (64-bit) 3. Single extended format ( 4. Double extended format ( 44-bit) 80-bit)

Here we will introduce the rst two formats, IEEE single-precision format and IEEE double-precision format.

4.2.1.

IEEE single-precision (32 bits)

An IEEE single-precision oating-point number F has three parts, s, e and f , as shown below. s : sign 1 bit e : exponent 8 bits 32 bits IEEE oating-point number F can be evaluated by F = (1)s 1.f 2e127 . (4.1) f : signicand or mantissa 23 bits

Note in (4.1) that there is a hidden 1 that is not shown in the representation of F . The maximal and minimal values of F can be decided by Fmax = (2 223 ) 2254127 = (1 224 ) 2128 , Fmin = 1 21127 = 2126 .
2

Fall 2013

Computer Arithmetic

H. Wu

How to represent zero? Use the reserved formats as shown in the table below. Value or meaning of an IEEE single-precision oating-point representation f = 0 F = 0 e=0 f = 0 F are subnormal numbers (= 0.f 2126 ) f = 0 F = e = 255 f = 0 F is NAN (Not a Number) 1 e 254 F is an ordinary number and F = (1)s 1.f 2e127 Example 4.1 (omitted)

4.2.2.

IEEE double-precision (64 bits)

An IEEE double-precision oating-point number F has also three parts, s, e and f , as shown below. s : sign 1 bit e : exponent 11 bits 64 bits IEEE oating-point number F can be evaluated by F = (1)s 1.f 2e1023 . (4.2) f : signicand or mantissa 52 bits

Note a hidden 1 also exists for double-precision case that is not shown in the representation of F . The maximal and minimal values of F can be decided by Fmax = (2 252 ) 220461023 = (1 253 ) 21024 , Fmin = 1 211023 = 21022 . The value or meaning of an IEEE double-precision number F is given in the following table: Value or meaning of an IEEE double-precision oating-point representation f = 0 F = 0 e=0 f = 0 F are subnormal numbers (= 0.f 21022 ) f = 0 F = e = 1023 f = 0 F is NAN (Not a Number) 1 e 1022 F is an ordinary number and F = (1)s 1.f 2e1023
3

Fall 2013

Computer Arithmetic

H. Wu

Example 4.2 (omitted)

4.3.
4.3.1.

Floating-Point Operations
Multiplication

Given two numbers F1 = (1)S1 M1 E1 bias and F2 = (1)S2 M2 E2 bias , and assume that they both are in normalized forms. Then the product F3 = F1 F2 and F3 = (1)S3 M3 E3 bias can be obtained as follows. 1. Calculate E3 = E1 + E2 + bias. IF (E3 > Emax ) THEN overow ELSE IF (E3 < Emin ) THEN underow. 2. Calculate M3 = M1 M2 . IF (M3 < (1/ )) THEN M3 = M3 and E3 = E3 1. 3. IF (E3 < Emin ) THEN underow. The second step above is called post-normalization.

4.3.2.

Addition/Subtraction

Let two oat-point numbers be given by F1 = (1)S1 M1 E1 bias F2 = (1)S2 M2 E2 bias then addition/subtraction operation can be performed as follows.
4

Fall 2013

Computer Arithmetic

H. Wu

1. Assume that E1

E2 and compute

F3 = F1 F2 = [(1)S1 M1 (1)S2 M2 (E1 E2 ) ] E1 bias 2. Let (1)S1 M1 (1)S2 M2 (E1 E2 ) be denoted as M3 . If M3 < (1/ ) or M3 then post-normalization is needed. 3. If there is post-normalization then we need to check whether or not the nal exponent E3 is overow or underow. 1

4.4.

Rounding Schemes

Rounding is a technique to obtain low-precision representation from given high-precision representation: High-precision Rounding Low-precision We assume that the input to a rounding scheme has m integer bits and d fractional bits, xm1 . . . x0 .x1 . . . xd , and the output usually contains only integer bits ym 1 . . . y0 . xm1 . . . x0 .x1 . . . xd Rounding ym 1 . . . y0 . Let X and Y (X ) be the input and the output of a rounding scheme, respectively. Rounding error is dened as R(X ) = Y (X ) X . We measure the accuracy of the rounding results by computing the maximum errors and the bias of the scheme, where bias is dened as the average error for a block of 2d numbers including all the possible inputs to the rounding scheme. Criteria for choosing a good rounding scheme: 1. Accuracy of results small maximum errors; small bias; small variation. 2. Cost of implementation and speed
5

Fall 2013

Computer Arithmetic

H. Wu

4.4.1.

Truncation or chopping: chop(x)

1. Denition: chop(x)(= [x]) = x . 2. Truth table: Chopping scheme with d = 2 Input: Output: Error: x chop(x) chop(x) x .00 . 0 .01 . 1/4 .10 . 1/2 .11 . 3/4 3. Error and bias: It can be seen from the above table that the maximal error is e max = 3/4 and Bias = 1 1 1 3 3 0 = . 4 4 2 4 8

4. Implementation: It is fast and cost free.

4.4.2.

Round to nearest integer: round(x)

1. denition: round(x) = x + 0.5 . 2. Truth table: Round-to-nearest scheme with d = 2 Input: Output: Error: x round(x) round(x) x .00 . 0 .01 . 1/4 .10 . + 1 +1/2 .11 . + 1 +1/4 3. Error and bias: It can be seen from the above table that the maximal error is e+ max = +1/2 and Bias = 4. Implementation: 1 1 1 1 1 0 + + =+ . 4 4 2 4 8

Fall 2013

Computer Arithmetic

H. Wu

4.4.3.

Round to nearest even integer: rtne(x)

1. denition: Round to the nearest even integer if it is a tie case. 2. Truth table: Round-to-nearest-even scheme with d = 2 Input: Output: Error: x rtne(x) rtne(x) x 0.00 0. 0 0.01 0. 1/4 0.10 0. 1/2 0.11 1. +1/4 1.00 1. 0 1.01 1. 1/4 1.10 1. + 1 +1/2 1.11 1. + 1 +1/4 3. Errors and bias: It can be seen from the above table that the maximal errors are e+ max = +1/2 and e max = 1/2. Bias = 4. Implementation: 1 1 1 1 1 1 1 0 + +0 + + = 0. 8 4 2 4 4 2 4

4.4.4.

ROM Rounding: ROM(x)

1. Denition: ROM rounding is given by xm1 . . . x 1 x 2 . . . x0 .x1 x2 . . . xd ROM(x) xm1 . . . x 1 y 2 . . . y0 . Note that ROM(x) takes only bits as input, x 2 . . . x0 .x1 , and generates 1 output bits, y 2 . . . y0 .

Fall 2013

Computer Arithmetic

H. Wu

2. Truth table: ROM scheme with = 3 Input: Output: Error: x ROM(x) ROM(x) x 00.0 00. 0 00.1 01. +1/2 01.0 01. 0 01.1 10. +1/2 10.0 10. 0 10.1 11. +1/2 11.0 11. 0 11.1 11. 1/2 3. Error and bias: It can be seen from the above table that the maximal errors are e+ max = +1/2 and e max = 1/2. Bias = 4. Implementation: 1 1 1 1 1 1 0+ +0+ +0+ +0 = . 8 2 2 2 2 8