1972

Finite Word Length Effects in Digital Filters
When digital signal processing operations are implemented on a computer or with special-purpose
hardware, the numbers and coefficients are stored in finite-length registers. The errors and
constraints due to finite word length are unavoidable. The coefficients and numbers are quantized
by truncation or rounding off when they are stored.
The main errors arise due to quantization of numbers are
a) Input quantization error
b) Product quantization error
c) Coefficient quantization error
Input quantization error : It is due to the conversion of a continuous-time signal into digital
value. This error arises due to the representation of the input signal by a fixed number of digits in
A/D conversion process
Product quantization error : Registers are the basic storage device in digital system. The
maximum size of the binary information that can be stored in a register is called register word
length. If a register stores an 8-bit data then its word length is 8-bit. While performing
calculations the size of the result may be exceeding the size of the register used for storing the
result. Example: Multiplication of a b bit data with a b bit coefficient results a product having 2b
bits. Since a b bit register is used, the multiplier output must be truncated or rounded to
accommodate the result in the register which produces Product quantization error. This error at
the output of the multiplier.
Coefficient quantization error : The filter coefficients are computed to infinite precision in
theory. The quantization of the filter coefficients has the effect of disturbing the location of the
filter poles and zeros. Therefore, the frequency response of the resulting filter may differ from the
desired response and sometimes the filter may fail to meet the desired specifications. If the poles
of the desired filter are close to the unit circle, then those of the filter with quantized coefficients
may lie just outside the unit circle, leading the instability. This deterministic frequency response
error is referred to as coefficient quantization error.
Representation of a Number
N
The number N can be represented to any desired accuracy by the following finite series
i A
r i or N
n2
i n1
r i where r is the radix, A=No. of integer bits, B=No. of fraction bits,
di=ith digit of the number. The binary digit d-A is the MSB and dB is the LSB of the binary number N
r=10, Decimal number representation,
r=16, Hexadecimal number representation,
r=8 Octal number representation,

r=2, Binary number presentation
Types of Number representation

In digital signal processing, the three common forms used to represent a number in a digital
computer are
i. Fixed point representation
ii. Floating point representation
iii. Block floating point representation
Finite Word Length Effects in Digital Filters : VRSEC ECE - GHK
Fixed point representation: In this the digits allotted for integer part and fraction part are fixed, so
the position of the binary point is fixed. The bits to the right of binary point represent the fractional
part of the number and to the left of binary point represent the integer part.
Ex1: 978.125 (978 Integer part and 125 Fractional part)
Ex2: 11011.111 (11011 Integer part and 111 Fractional part
In fixed point representation there are three different formats for representing negative binary
fraction numbers. They are
a. Signmagnitude form
b. Ones Complement form
c. Twos Complement form
Signmagnitude form: The most significant bit is set to 1 to represent the negative sign.
Ex:
(0.875)10 = (0.111000)2
(1.75)10 = (01.110000)2
(-0.875)10 = (1.111000)2
(-1.75)10 = (11.110000)2
In sign magnitude form, the number 0 has two representations, i.e., 00.000000 or 10.000000.
With b bits only (2b-1) numbers can be represented.
Ones-complement form: The positive number is represented as in the sign-magnitude form. The
negative number is obtained by complementing all the bits of the positive number.
Ex:
(0.875)10 = (0.111000)2
(-0.875)10 = (1.000111)2
The magnitude of the negative number is given by 1
C
i 1
2 i 2 b
In this type of representation with b bits only (2b-1) numbers can be represented exactly.
Twos-Complement form: The positive numbers are represented as in sign-magnitude and ones
complement form. The negative number is obtained by complementing all the bits of the positive
number and adding one to the LSB.
Ex:
(0.875)10 = (0.111000)2
(-0.875)10 = (1.001000)2
i
The magnitude of the negative number is given by 1 Ci 2
i 1
Floating point representation: This representation is employed to represent larger range of

numbers in a given binary word size. The floating point number is represented as F = 2c . M, where
M is called mantissa, is a fraction 1/2M1 and C, the exponent, can be either positive or negative.
The left most bit in mantissa and exponent is used to represent sign.
IEEE-754 32-bit Single-Precision Floating-Point Numbers
IEEE-754 64-bit Double-Precision Floating-Point Numbers
Comparison of fixed and floating point arithmetic

Sl.
No.
Fixed point arithmetic
Floating point arithmetic
1.
The accuracy of the result is less due to The accuracy of the result will be higher due
small dynamic range
to larger dynamic range
3.
Hardware implementation is cheaper
2.
4.
5.
6.
7.
Speed of processing is high
Speed of processing is low
It can be used for real time computations
It cannot be used for real time computations
Quantization error
multiplication
occurs
Overflow occurs in addition
only
Used in small computers
Hardware implementation is costlier
with Quantization error occurs

multiplication and addition.
Over flow does not arise
with
both
Used in larger, general purpose computers
Truncation : Truncation is the process of reducing the size of the binary number by discarding all
bits less significant that the least significant bit that is retained.
Errors due to Truncation in Fixed point representation:

For the truncation of a negative number represented in sign magnitude and ones complement
form, the error is always positive because the truncation reduces the magnitude of the numbers.
In Twos complement representation, the effect of truncation on a negative number is to
increase the magnitude of the negative number and so the truncation error is always negative.
Twos complement representation
The magnitude of the negative number is x 1
C
i 1
When the number is truncated to N bits, then xT 1

The change in magnitude due to truncation is
2i
N
C
i 1
2i
xT x Ci 2 i 0
iN
Since the magnitude increases with truncation, which implies that the error is negative and
satisfy the inequality 0 xT x 2 b
Ones complement representation
The magnitude of the negative number is x 1
C
i 1
When the number is truncated to N bits, then xT 1

The change in magnitude due to truncation is
2i 2 b
N
C
i 1
2i 2 N
xT x Ci 2 i 2 N 2 b 0
b
iN
Since the magnitude decreases with truncation, which implies that the error is positive and
satisfy the inequality 0 xT x 2 b
Sign magnitude representation:
In Sign magnitude representation the magnitude decreases with truncation, which implies that
the error is positive. Therefore, the above inequality condition holds for Sign magnitude
representation.
Errors due to Truncation in Floating point representation:

In floating point system the effect of truncation is visible only in mantissa.
Let x 2 M .
c
When the number is truncated to N bits, xT 2 M T (only the mantissa is truncated to N bits)
c
The error due to truncation is Error e xT x 2 c M T M
Twos complement representation: from the inequality condition 0 xT x 2 b ,
b c
The Twos complement representation of mantissa 0 M T M 2 b 0 e 2 2

The relative error
xT x
x
e
x
If M = 1/2, the maximum range of relative error is 0 2.2

If M = -1/2, the minimum range of relative error is 0 2.2
Ones complement representation:
The truncation of positive values of mantissa is 0 M T M 2 b
The relative error
xT x
x
e
x
or
0 e 2 b 2c
e x e 2c M
with M = 1/2, the maximum range of relative error is for positive M is 0 2.2
The truncation of Negative values of mantissa is 0 M T M 2 b

or
0 e 2 b 2c
with M = 1/2, the maximum range of relative error is for Negative M is 0 2.2
is the same as positive M
which
Rounding : Rounding is the process of reducing the size of a binary number to finite word size of
b-bits such that the rounded b-bit number is closest to the original unquantized number.
The rounding process consists of Truncation and Addition. In rounding of a number to b-bits,
first the unquantized number is truncated to b-bits by retaining the most significant b-bits. Then a
zero or one is added to the least significant bit of the truncated number depending on the bit that is
next to the least significant bit that is retained.
If the bit next to the least significant bit that is retained is zero then zero is added to the least
significant bit of the truncated number. If the bit next to the least significant bit that is retained is one
then one is added to the least significant bit of the truncated number. (Here adding one is called
rounding up).
Rounding up or down will have negligible effect on accuracy of computation.
Errors due to Rounding in Fixed point representation:
In fixed point arithmetic the error due to rounding a number to b bits produce an error
e xT x which satisfies the inequality
2 b
2b
. This is because with rounding, if
xT x
2
2
value lies half way between two levels, it can be approximated to either nearest higher level or by the
nearest lower level. For fixed point number
2 b
2b
satisfies regardless of whether
xT x
2
2
sign-magnitude, Ones complement and Twos complement used for negative numbers.
Errors due to Rounding in Floating point representation:
In floating point arithmetic, only the mantissa is affected by quantization.

Let x 2 M .
c
When the number is quantized to N bits, xT 2 M T (only the mantissa is rounded to N bits)
c
The error due to quantization is, e xT x 2 c M T M

but for rounding,
q
q
e n
2
2
-------- (1)
------- (2)
b
2 b
c 2
using equations (1) , equation (2) can be written as 2
xT x 2
2
2
c
or
since x 2 M
c
b
2 b
c 2
2
x 2
2
2
c
-----(3)
------- (4)
2b
2 b
M
then equation (4) becomes
2
2
If M = 1/2 , the maximum range of the relative error is 2
2 b
6
Input Quantization Error: The input quantization error arises when a continuous signal is
converted into digital value. The A/D converter consists of Sampler and Quantizer.
The sampler sampled the analog signal x(t) at regular intervals t=nT to produce a sequence of
unquantized values x(n).
The quantizer quantizes the analog values (unquantized values of x(n)) and produce the
corresponding binary codes.
If ADC is used to convert the sinusoidal signal it employ (b+1)bits including sign. The number
of levels available of quantizing x(n) is 2b+1.
The interval between successive level is q
2
2 b , where q is quantization step size.
b1
2
The common methods of quantization are Truncation and Rounding.
The errors produced by A/D conversion process are Quantization error and Saturation error.
Quantization error: It is due to the representation of the sampled signal by a fixed number of digits
Saturation error: It occurs when the analog signal exceeds the dynamic range of A/D converter.
Let
x(n) be the sampled unquantized value

xq(n) be the sampled quantized value
The quantization error is given be e(n) = xq(n) - x(n)
In A/D converters quantization can be performed by Truncation and Rounding. But the
quantization by rounding is preferred in A/D converters due to zero mean value of quantization error
and low variance when compared to truncation.
The quantization error for rounding of a number satisfies the relation
q
q
e n
2
2
The quantization error for truncating a number, in twos complement representation the error
is always negative and satisfies the inequality q e n 0
Steady State Input Noise Power
In digital signal processing applications, the quantization error is commonly viewed as an

additive error signal. i.e., xq(n) = x(n) + e(n). Therefore, the output of the A/D converter is the sum
of the input signal x(n) and the error signal e(n).
We assume that the A/D conversion error e(n) has the following properties
1. The error sequence e(n) is a sample sequence of a stationary random process.

2. The error sequence e(n)is uncorrelated with x(n) and other signals in the system.
3. The error is a white noise process with uniform amplitude.
Case 1:
If rounding is used for quantization then the quantization error e(n) = xq(n) - x(n) is bounded
by
q
q
e n
2
2
The error e(n) lies between q/2 and +q/2 with equal probability.
For a uniform distributed random variable X in the interval (X1, X2) the expected value (mean
value) and variance is given by
1
Expected value or mean value is E X
X 2 X1
E X
Variance E X
2
e
X2
X dx
X1
Let E e n be the expected value (mean value) of the error signal

q
2
q
2
2
2
1
1 e n
1 q q
E e n
e n de
0

q q q
q
2
2
q
q
2
2 2
2
2 2
2
Variance of the error signal e2 E e 2 n E 2 e n

q
3
3
3
2
1
1 e n 2
1 q q
q2
2
2
e n de 0

q q q
q
3
3
q
12
q
2 2
2
2
2 2
The quantization step size q 2

Case 2:
. Substitute q value in the above equation, e
12
2 2 b
12
If Truncation is used for quantization then the quantization error e(n) = xq(n) - x(n) is
bounded by q e n 0
In twos complement truncation the error e(n) lies between 0 and q.
Let E e n be the expected value (mean value) of the error signal. E e n

q
2
8
The Variance or power of the error signal e(n) is given by e2 E e 2 n E 2 e n
3
0
1
1
3
q2
q2 q2 q2 q2
q 1 e n
2
e n de
q

0 q q
3q
2 q 3 q 4
4 3 4 12
0
The quantization step size q 2
. Substitute q value in the above equation,
2
e
12
2 2 b
12
22 b
In both cases
which is also known as steady state noise power due to input quantization.
12
2
e
If the input signal is x(n) and its variance is e , then the ratio of signal power to noise power is
2
x2 x2
2b
2
which is known as signal to noise ratio for rounding is 2 2 b 12 2 x
e 2
12
Steady State Output Noise (Variance) Power
Fig. Representation of input quantization noise in an LTI system
The quantized input signal of a digital system can be represented as a sum of unquantized
signal x(n) and error signal e(n)
h(n) is the impulse response of the system and y(n) is the response of the system due to an
error signal. The response of the system is given by convolution of input and impulse response.
y' (n) = xq(n) * h(n)
= [x(n) + e(n)] * h(n)
Let
= [x(n) * h(n)] +[e(n) * h(n)]
y(n) = y(n) + e(n)

y(n) = [x(n) * h(n)]
(n) = [e(n) * h(n)]
Output due to the input signal x(n)

Output due to error signal e(n)
The variance of the signal (n) is called output noise power or steady state output noise power
due to the quantization error signal.
The steady state output noise power is given by e

2
h n
2
n 0
Using parsevals theorem the steady state output noise variance due to the quantization error
e2
H z H z 1 z 1 dz
is given by h n
2 j c
n 0
2
e
1
h n 2 j H z H z z
Prove that
n 0
dz
The z-transform of h(n) is

The z-transform of
The
integral
h2(n)
H z h n z n
is z h
formula
Z 1 H z h n
- - - - - (1)
n 0
n h n h n z
n 0
for
1
2 j
Sub. Eqn. (3) in eqn (2),
z h 2 n h n h n z n
n 0
the
H z z
inverse
n 1
dz
2 j H z z
n 0
h n z
2
z-transform
dz h n z n
Interchanging the order of summation and integration,
1
h n z z 1dz
c H z H z z dz
n 0
1
h n
2 j
n 0
- - - - - (5)
rewrite the above equation as
h2 n
n 0
from the definition of z-transform
h2 n
n 0
1
2 j
1
2 j
h n z
1
n 0
H z H z z
1
- - - - - (2)
n 0
- - - - - (3)
n 1
is
given
by
- - - - - (4)
h n z
n 0
H z 1
1
1
H
z
h
n
z
c
n 0
z dz - - - - - (6)
H z 1
- - - - - (7)
dz This expression is a form of the Parsevals relation
10
Limit cycles: When a stable IIR filter is excited by a finite input sequence, the output will ideally
decay to zero. However, the nonlinearities due to finite precision arithmetic operations cause periodic
oscillations in the output. These oscillations are called limit cycles. (OR)
In recursive system, the nonlinearities due to the finite-precision arithmetic operations often
cause periodic oscillations to occur in the output, even when the input sequence is zero or some
nonzero constant value. Such oscillations in recursive systems are called limit cycles and are directly
attributable to round-off errors in multiplications and overflow errors in addition.
The limit cycles occur as a result of the quantization effects in multiplications.
Types of limit cycles: Zero input limit cycles and overflow limit cycles.
Zero input limit cycle:
Consider a first order IIR filter with difference equation
( )=
( )+
Let us assume =1/2 and the data register length is 3 bits plus a sign bit.
The input is ( ) =
n
0
1
2
3
4
5
0.875
0
x(n)
y(n-1)
0.875
0.0
0
7/8
0
1/2
0
1/4
0
1/8
0
1/8
=0
y(n-1)
0.0
7/16
1/4
1/8
1/16
1/16
Q[ y(n-1)]
0.000
0.100
0.010
0.001
0.001
0.001
( )=
( 1)
( )+ [
7/8
1/2
1/4
1/8
1/8
1/8
( 1)]
The rounding is applied after the arithmetic operation. For n 3 the output remains constant and gives
1/8 as steady output causing limit cycle behaviour.
From the table it can be observed that for zero input, the unquantized y(n) decays exponentially to
zero with increasing n. However, the rounded-off (quantized) output y(n) gets stuck at a value of 1/8
and never decays further. Thus output is finite even when no input is applied. This is referred to as
Zero input limit cycle effect.
Let us assume =-1/2
n
0
1
2
3
4
5
6
x(n)
0.875
0
0
0
0
0
0
y(n-1)
0.0
7/8
-1/2
1/4
-1/8
1/8
-1/8
y(n-1)
0.0
-7/16
1/4
-1/8
1/16
-1/16
1/16
Q[ y(n-1)]
0.000
1.100
0.010
1.001
0.001
1.001
0.001
when =-1/2 the output oscillates between 0.125 and -0.125.
( )=
( )+ [
7/8
-1/2
1/4
-1/8
1/8
-1/8
1/8
( 1)]
11
Dead Band: The amplitude of the output during a limit cycle is confined to a range of values and this
range of value is called the dead band.
Let
us
consider
single
pole IIR
( )=
system whose difference

( 1) + ( ), > 0
After rounding the product term we have ( ) = [

During the limit cycle oscillation [
By the definition of rounding | [
Substituting ( 1)
| |
( 1)] =
( 1)]
( 1)] + ( )
)
equation
is
given
by
( 1)|
Overflow Limit Cycle Oscillations: In fixed point addition the flow occurs when the sum exceeds
the finite word length of the register used to store the sum. The overflow in addition makes the
output to oscillate between maximum and minimum amplitudes. Such limit cycles are called overflow
limit cycle oscillations.
The overflow in addition of two or more binary numbers occurs when the sum exceeds the
word size available in the digital implementation of the system.
The overflow occurs when the sum exceeds the dynamic range of the number systems. When
the binary fraction format is used for computing, the dynamic range is (-1,1).
Let us consider two positive numbers +3/8 and +5/8 in twos complement addition
(+3/8) + (+5/8) 0.011 + 0.101 = 1.000 (-8/8) = -1
The actual sum is +1 but due to overflow the sum is wrongly interpreted as a negative number.
The overflow limit cycle oscillations can be eliminated if saturation arithmetic is performed. In
saturation arithmetic, when an overflow is sensed, the output is set equal to maximum allowable
value and when an underflow in sensed, the output is set equal to minimum allowable value.
The saturation arithmetic causes undesirable signal distortion due to the nonlinearity of the
clipper.
How overflow limit cycles can be eliminated:
The overflow limit cycles can be eliminated either by using saturation arithmetic or by scaling the
input signal to the adder.
The study of limit cycle oscillations is important for two reasons.
1. In a communication environment, when no signal is transmitted, limit cycles can occur which
are extremely undesirable.
Example: In a telephone no one would like to hear unwanted noise when no signal is put in
from the other end. Consequently, when digital filters are used in telephone exchanges, care
must be taken regarding this problem.
2. The limit cycles effect can be effectively used in digital waveform generators. By producing
desirable limit cycles in a reliable manner, these limit cycles can be used as a source in digital
signal processing.
12
V R Siddhartha Engineering College (Autonomous : E C E Department : Vijayawada

Digital Signal Processing EC 6002
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Finite Word Length Effects in Digital Filters
What are the advantages of floating point arithmetic?

What are the effects of finite word length in digital filters?
Explain in detail the effects of finite word length in digital filters?
What are the quantization errors due to Finite Word Length registers in digital filters?
What is truncation? What is the error that arises due to truncation in floating point numbers?
Differentiate the fixed point and floating point arithmetic?
What are the methods to prevent overflow?
What are zero limit cycle oscillations? Explain about zero-limit cycle oscillations?
What is rounding? Discuss round off effects in digital filters.
Discuss in detail the errors resulting from rounding and truncation.
What is the need for quantization? How quantization noise is reduced?
What are the different quantization methods?
What is dead band of a filter?
Explain about the Input quantization error and Product quantization error?
Write shore notes on Fixed-point and floating-point representations?
Write short notes on finite Word Length effects in FIR digital filters?
Explain about the following: (a) Quantization Noise (b) Fixed point numbers ?
Explain about the following: (a) Zero-input limit cycle and (b) Overflow limit cycle oscillations ?
Derive the expression for the Variance of the output noise of a digital system which is fed with a
quantized input signal?
An LTI system is characterized by the difference equation, ( ) = 0.875 ( 1) + ( ).
Determine the limit cycle behavior and the dead band of the system when x(n)=0 and y(-1)=0.61.
Assume that the product is quantized to l4-bits by rounding.
A digital system is characterized by the difference equation y (n) 0.95 y (n 1) x(n) . Determine
the dead band of the system when x(n)=0 and y(-1)=13.
A digital system is characterized by the difference equation y (n) 0.9 y (n 1) x(n) with x(n)=0
and initial condition y(-1)=12. Determine the dead band of the system.
The output of a 12-bit A/D converter is passed through a digital filter which is described by the
difference equation y (n) 0.2 y (n 1) x(n) . Calculate the steady output noise power due to A/D
converter quantization.
0.245 0.245z 1
. If it is realized by using direct form-II
1 0.509z 1
structure, find the scaling factor to avoid overflow in the 1st adder of realization.
An LTI system is characterized by the difference equation, y(n)=0.68y(n-1)+0.15x(n). The input
signal x(n) has a range of -5V to +5V, represented by 8-bits. Find the quantization step size,
variance of the error signal and variance of the quantization noise at the output.
The T/F of a discrete time filter is H (z )
13

1972

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

1972

Încărcat de

Drepturi de autor:

Formate disponibile

Finite Word Length Effects in Digital Filters

r i where r is the radix, A=No. of integer bits, B=No. of fraction bits,

r=16, Hexadecimal number representation,

r=8 Octal number representation,

Types of Number representation

Ex2: 11011.111 (11011 Integer part and 111 Fractional part

b. Ones Complement form

c. Twos Complement form

The magnitude of the negative number is given by 1

Floating point representation: This representation is employed to represent larger range of

Finite Word Length Effects in Digital Filters : VRSEC ECE - GHK

IEEE-754 64-bit Double-Precision Floating-Point Numbers

Comparison of fixed and floating point arithmetic

Fixed point arithmetic

Floating point arithmetic

Hardware implementation is cheaper

Speed of processing is high

Speed of processing is low

It can be used for real time computations

It cannot be used for real time computations

Overflow occurs in addition

Used in small computers

Finite Word Length Effects in Digital Filters : VRSEC ECE - GHK

Hardware implementation is costlier

with Quantization error occurs

Used in larger, general purpose computers

Errors due to Truncation in Fixed point representation:

The magnitude of the negative number is x 1

When the number is truncated to N bits, then xT 1

Ones complement representation

The magnitude of the negative number is x 1

When the number is truncated to N bits, then xT 1

Sign magnitude representation:

Finite Word Length Effects in Digital Filters : VRSEC ECE - GHK

Errors due to Truncation in Floating point representation:

The error due to truncation is Error e xT x 2 c M T M

Twos complement representation: from the inequality condition 0 xT x 2 b ,

The Twos complement representation of mantissa 0 M T M 2 b 0 e 2 2

If M = 1/2, the maximum range of relative error is 0 2.2

Ones complement representation:

The truncation of positive values of mantissa is 0 M T M 2 b

The relative error

The truncation of Negative values of mantissa is 0 M T M 2 b

Finite Word Length Effects in Digital Filters : VRSEC ECE - GHK

Errors due to Rounding in Fixed point representation:

e xT x which satisfies the inequality

Errors due to Rounding in Floating point representation:

In floating point arithmetic, only the mantissa is affected by quantization.

The error due to quantization is, e xT x 2 c M T M

If M = 1/2 , the maximum range of the relative error is 2

Finite Word Length Effects in Digital Filters : VRSEC ECE - GHK

The common methods of quantization are Truncation and Rounding.

x(n) be the sampled unquantized value

The quantization error is given be e(n) = xq(n) - x(n)

Steady State Input Noise Power

In digital signal processing applications, the quantization error is commonly viewed as an

1. The error sequence e(n) is a sample sequence of a stationary random process.

Let E e n be the expected value (mean value) of the error signal

Variance of the error signal e2 E e 2 n E 2 e n

The quantization step size q 2

. Substitute q value in the above equation, e

Let E e n be the expected value (mean value) of the error signal. E e n

The Variance or power of the error signal e(n) is given by e2 E e 2 n E 2 e n

The quantization step size q 2