CH 04

Fall 2013
Computer Arithmetic
H. Wu
Chapter 4. Binary Floating-Point Numbers 4.1. Floating-Point Representations
Machine representation of real numbers: Fixed-point (discussed in Chapter 1) Floating-point Compared to xed point numbers, oating-point numbers are efcient in representing both very large and very small numbers. A oating-point representation F has three parts: The sign, s; The signicand (or mantissa), f ; The exponent, e; The bits of the oating-point number F are stored in a register or a memory unit shown as follows F = s : sign e : exponent f : signicand and the value of such a oating-point number F is given by F = (1)s f e . where is the base of e which is implied in a system. Usually, is often chosen as a power of 2. The representation range is [Fmax , Fmin ] and [Fmin , Fmax ], where Fmax and Fmin are given by Fmax = fmax emax , Fmin = fmin emin ,
Fall 2013
Computer Arithmetic
H. Wu
The computer system is overow if a result is larger than Fmax or smaller than Fmax . The system is called underow if a result is nonzero and belongs to the interval (Fmin , Fmin ).
4.2.
IEEE 754-2008 Standard
There are four formats in the IEEE 754-2008 oat-point standard: 1. Single-precision format (32-bit) 2. Double-precision format (64-bit) 3. Single extended format ( 4. Double extended format ( 44-bit) 80-bit)
Here we will introduce the rst two formats, IEEE single-precision format and IEEE double-precision format.
4.2.1.
IEEE single-precision (32 bits)
An IEEE single-precision oating-point number F has three parts, s, e and f , as shown below. s : sign 1 bit e : exponent 8 bits 32 bits IEEE oating-point number F can be evaluated by F = (1)s 1.f 2e127 . (4.1) f : signicand or mantissa 23 bits
Note in (4.1) that there is a hidden 1 that is not shown in the representation of F . The maximal and minimal values of F can be decided by Fmax = (2 223 ) 2254127 = (1 224 ) 2128 , Fmin = 1 21127 = 2126 .
2
Fall 2013
Computer Arithmetic
H. Wu
How to represent zero? Use the reserved formats as shown in the table below. Value or meaning of an IEEE single-precision oating-point representation f = 0 F = 0 e=0 f = 0 F are subnormal numbers (= 0.f 2126 ) f = 0 F = e = 255 f = 0 F is NAN (Not a Number) 1 e 254 F is an ordinary number and F = (1)s 1.f 2e127 Example 4.1 (omitted)
4.2.2.
IEEE double-precision (64 bits)
An IEEE double-precision oating-point number F has also three parts, s, e and f , as shown below. s : sign 1 bit e : exponent 11 bits 64 bits IEEE oating-point number F can be evaluated by F = (1)s 1.f 2e1023 . (4.2) f : signicand or mantissa 52 bits
Note a hidden 1 also exists for double-precision case that is not shown in the representation of F . The maximal and minimal values of F can be decided by Fmax = (2 252 ) 220461023 = (1 253 ) 21024 , Fmin = 1 211023 = 21022 . The value or meaning of an IEEE double-precision number F is given in the following table: Value or meaning of an IEEE double-precision oating-point representation f = 0 F = 0 e=0 f = 0 F are subnormal numbers (= 0.f 21022 ) f = 0 F = e = 1023 f = 0 F is NAN (Not a Number) 1 e 1022 F is an ordinary number and F = (1)s 1.f 2e1023
3
Fall 2013
Computer Arithmetic
H. Wu
Example 4.2 (omitted)
4.3.
4.3.1.
Floating-Point Operations
Multiplication
Given two numbers F1 = (1)S1 M1 E1 bias and F2 = (1)S2 M2 E2 bias , and assume that they both are in normalized forms. Then the product F3 = F1 F2 and F3 = (1)S3 M3 E3 bias can be obtained as follows. 1. Calculate E3 = E1 + E2 + bias. IF (E3 > Emax ) THEN overow ELSE IF (E3 < Emin ) THEN underow. 2. Calculate M3 = M1 M2 . IF (M3 < (1/ )) THEN M3 = M3 and E3 = E3 1. 3. IF (E3 < Emin ) THEN underow. The second step above is called post-normalization.
4.3.2.
Addition/Subtraction
Let two oat-point numbers be given by F1 = (1)S1 M1 E1 bias F2 = (1)S2 M2 E2 bias then addition/subtraction operation can be performed as follows.
4
Fall 2013
Computer Arithmetic
H. Wu
1. Assume that E1
E2 and compute
F3 = F1 F2 = [(1)S1 M1 (1)S2 M2 (E1 E2 ) ] E1 bias 2. Let (1)S1 M1 (1)S2 M2 (E1 E2 ) be denoted as M3 . If M3 < (1/ ) or M3 then post-normalization is needed. 3. If there is post-normalization then we need to check whether or not the nal exponent E3 is overow or underow. 1
4.4.
Rounding Schemes
Rounding is a technique to obtain low-precision representation from given high-precision representation: High-precision Rounding Low-precision We assume that the input to a rounding scheme has m integer bits and d fractional bits, xm1 . . . x0 .x1 . . . xd , and the output usually contains only integer bits ym 1 . . . y0 . xm1 . . . x0 .x1 . . . xd Rounding ym 1 . . . y0 . Let X and Y (X ) be the input and the output of a rounding scheme, respectively. Rounding error is dened as R(X ) = Y (X ) X . We measure the accuracy of the rounding results by computing the maximum errors and the bias of the scheme, where bias is dened as the average error for a block of 2d numbers including all the possible inputs to the rounding scheme. Criteria for choosing a good rounding scheme: 1. Accuracy of results small maximum errors; small bias; small variation. 2. Cost of implementation and speed
5
Fall 2013
Computer Arithmetic
H. Wu
4.4.1.
Truncation or chopping: chop(x)
1. Denition: chop(x)(= [x]) = x . 2. Truth table: Chopping scheme with d = 2 Input: Output: Error: x chop(x) chop(x) x .00 . 0 .01 . 1/4 .10 . 1/2 .11 . 3/4 3. Error and bias: It can be seen from the above table that the maximal error is e max = 3/4 and Bias = 1 1 1 3 3 0 = . 4 4 2 4 8
4. Implementation: It is fast and cost free.
4.4.2.
Round to nearest integer: round(x)
1. denition: round(x) = x + 0.5 . 2. Truth table: Round-to-nearest scheme with d = 2 Input: Output: Error: x round(x) round(x) x .00 . 0 .01 . 1/4 .10 . + 1 +1/2 .11 . + 1 +1/4 3. Error and bias: It can be seen from the above table that the maximal error is e+ max = +1/2 and Bias = 4. Implementation: 1 1 1 1 1 0 + + =+ . 4 4 2 4 8
Fall 2013
Computer Arithmetic
H. Wu
4.4.3.
Round to nearest even integer: rtne(x)
1. denition: Round to the nearest even integer if it is a tie case. 2. Truth table: Round-to-nearest-even scheme with d = 2 Input: Output: Error: x rtne(x) rtne(x) x 0.00 0. 0 0.01 0. 1/4 0.10 0. 1/2 0.11 1. +1/4 1.00 1. 0 1.01 1. 1/4 1.10 1. + 1 +1/2 1.11 1. + 1 +1/4 3. Errors and bias: It can be seen from the above table that the maximal errors are e+ max = +1/2 and e max = 1/2. Bias = 4. Implementation: 1 1 1 1 1 1 1 0 + +0 + + = 0. 8 4 2 4 4 2 4
4.4.4.
ROM Rounding: ROM(x)
1. Denition: ROM rounding is given by xm1 . . . x 1 x 2 . . . x0 .x1 x2 . . . xd ROM(x) xm1 . . . x 1 y 2 . . . y0 . Note that ROM(x) takes only bits as input, x 2 . . . x0 .x1 , and generates 1 output bits, y 2 . . . y0 .
Fall 2013
Computer Arithmetic
H. Wu
2. Truth table: ROM scheme with = 3 Input: Output: Error: x ROM(x) ROM(x) x 00.0 00. 0 00.1 01. +1/2 01.0 01. 0 01.1 10. +1/2 10.0 10. 0 10.1 11. +1/2 11.0 11. 0 11.1 11. 1/2 3. Error and bias: It can be seen from the above table that the maximal errors are e+ max = +1/2 and e max = 1/2. Bias = 4. Implementation: 1 1 1 1 1 1 0+ +0+ +0+ +0 = . 8 2 2 2 2 8

CH 04

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CH 04

Încărcat de

Drepturi de autor:

Formate disponibile

Fall 2013

Chapter 4. Binary Floating-Point Numbers 4.1. Floating-Point Representations

IEEE 754-2008 Standard

IEEE single-precision (32 bits)

IEEE double-precision (64 bits)

Example 4.2 (omitted)

Truncation or chopping: chop(x)

4. Implementation: It is fast and cost free.

Round to nearest integer: round(x)

Round to nearest even integer: rtne(x)

ROM Rounding: ROM(x)

S-ar putea să vă placă și