Documente Academic
Documente Profesional
Documente Cultură
Papers on
journals and the authors for granting permission to have these articles
reproduced in this way.
Alan V. Oppenheim
August 1969
v
INTR01)UCIION
The collectio:1 of articles divides rO·.lghly into four ffi3.jO= catego:::'ies:
z-transfonn theory and digital fIlter design, the effects of finite wo:::,d
length, the fast Fourier transform and spectral analysis, and hardware
considerations in the implementation of digital filters.
The first six papers deal with several issues in z-transfonn theory
and digital filter design. Specifically, in the paper by Steiglitz, one
attitude tOl-lard the relationship between digital and analog signals is
offered. This attitude is illuminating partly because it is an alterna
tive to the relationship usually depicted of a sequence as derived by
sampling a continuous time function. A representation of time functions
in terms of sequences offered in this paper is as coefficients in a
Laguerre series. This representation, as with the representation of
band-limited function:. as a se-cies of sin x/x functions, has the property
that the representation of the convolution of two functions is the dis
crete convolution of the sequences fo:::, each. The discussion by Steiglitz
also provides a basis for carrying o'ler the results on optimum linear
systems with continuous signals to analogo�s results for discrete signals.
Much of the theoretical develop"l1ents in digital signal p:ocessing
have been directed tOlvard reph:::'asing and paralleling results related to
analog processing within the co·ntext of sequences and z-transfonns. An
example of this is the discussion of the Hilbert transform �elations by
Gold, Oppl�nheim, and Rader. This paper p::esents the Hilbert transfo:m
relations in terms of both the z-transform and the discrete Fourier
transform. In addition the design and realization of digital 90° phase
splitters is discussed.
Detailed design techniques for recursive digital filters app·<!ar to
be well established. With the disclosure of the fast Fourier transform
(FFT) , nonrecurs ive filters, i.e. filters with a finite duration impulse
response, took on a new significance. Some design proced.ures, suP?lied
by previous work on tapped delay line filters and phased array antennas,
have been available for some time. We feel in general that in the next
several years there is likely to be more form'll work done to\vard defining
the limitations and design procedures for such filters. The first paper
by Gold and Jordan and the paper by Oppenheim and Weinstein both rehte
to nonrecursive filters. The first of these discusses a new design
procedure for these filters from a frequency domain point of view. The
second discusses a bound and consequently a scaling strategy for use in
implementing such filters using the FFT.
When the use of the fast Fourier transform in digital filtering was
first discussed, it was assumed that it was theoretically limited to
finite duration impulse responses. The note by Gold and Jordan on
vii
digital filter synthesis describes a procedure for using this technique
does not ap�ear to be computationally efficient, the fact that this can
then also quantization of the inp'.lt. The next severa l papers are con
using the FFT. The early papers on quantization and roundoff effects in
65
digital filters were by Kaiser, Knowles and Edwards, * and Gold and
Rader.43 The first of these dealt p=��rily with the p=oblem of co
efficient accuracy, while the second two dealt prim�rily with the effects
Edwards and Gold and Rader focused prim�rily on the effect of quantization
and multiplier roundoff noise. The approach ta�en is to view the errors
experime
· ntal verification
� --
Referen�e nmnbers refer to the Bibliograp�y
-- ---
viii
that combines some of the aspects of both floating-point and fixed-point
One of the first analyses, carried out by Gentleman and Sande, 3 7 focused
upper bound on the square root of the sum of the squares of the error
to the radix 2 FFT algorithm and deals mainly with arithmetic using
ix
procedures for digital filters which blend into the approximation problem
the p�actica1 requirements of finite word length coefficients are not yet
available.
points by making use of the periodicity of exp (2n nk/N) , may be traced
26
back to about the turn of the century (Cooley et a1. ),
and it appears
8
in the form of a detailed algorithmic procedure in papers by Good4 and
3l
Danielson and Lanczos. Nevertheless, almost all current interest in
the FFT derives from the publication of the pap·er by Cooley and Tu�ey in
1965. This p·:1per reveals the algo�ithm in terms of the general structure
goes into some detail about the case where all the radices are equal.
sine and cosine functions rather than the periodicities. Cooley and
Tukey did recognize the special interest in the case of radix two
algorithms .
of indices to write about the fast Fourier transform, but this point of
·
viet." while satisfying to s ome people, has been unsatisfying to some
others. Other p�ints of view can be taken to derive the algorithm and
take the Coo1ey-Tukey p�int of view. The Cooley and Tukey paper, in
further reducing the computation required for a DFT (and probably the
error, though this has not been investigated), namely the development
savings are to �e found when sp.:;cia1 factors are segregated and handled
sized DFT computations but they are a small saving for large computations,
5
say N greater than 10 • Higher radix FFT algorithms also permit a better
systems; this implies, of course, some high speed storage in the arithmetic
unit.
x
The paper by Singleton provides several valuable FFT tricks. For
the case of arbitrary and mixed radix FFT programs, bit reversing and
in-place comp�tation (including in-place bit reversing) can be complicated
unless the algorithm is appropriately structured. Singleton also shm'ls
how to program a DFT efficiently when one or more of the prime factors
of the number of points is odd (special attention being given to a
factor of 5). Another important section of the paper deals with the
compl.1tation of the required constants by iterative techniques, which
can reduce the storage requirements for an FFT program. This must be
done carefully, as it can be a grave source of error. An appandix of
the p·ap.er,
of the paper, has been deleted.
The papers by Bluestein and Rabiner et a1. describe the implications
for DFT comp�tation, of the discrete equivalent of chirp filtering.
Bluestein derives an algorithm which is structured as a digital network
but is comp'3titive with the FFT when the number of points is the square
of a p::-ime, or the square of a small composite integer. This network
involves the convolution of an input signal with a chirp sequence.
The Rabiner paper details the considerable flexibility which results
when the convolution is implemented by means of the FFT.
The early papars about the FFT have concentrated on ho\'l to program
it. An exception was the description of the use of the algorithm to
12 2
compute lagged products. As time passes we expect to see more papers
dealing with applications of the algorithm, and with quantization effects.
The last two papers discuss hard\'lare considerations. The under
standing of the z-transfo�, filter theory, and quantization problems
form the theoretical basis underlying the construction of digital signal
processing equipment. Within the past few years, quite a few devices
have been built: for p.erforming digital filtering and the discrete Fourier
transform; hO\'lever, only two papers of this collection deal with the
hard,,,are question. While the various devices built have been of con
siderable interest, only these two p.'3.pers dealt at sufficient length
with &.�nera1 hardware issues. The Jackson, Kaiser, McDonald paper
described the well-knmm digital filtering forms, proposed a way of
modu1arizing the components of a digital filtering system, discussed
the important multiplexing question, and combined these general points
into the description of a digital touch-tone receiver which they built.
The particular form that appears to be most useful is the cascade or
serial form; both the parallel form and the direct form seem to be
more sensitive to parameter errors. Our o'l1n exp erience verifies this;
an additional point is that each 1in� in the cascade could beneficially
be a coupled-form digital network, especially if spectral accuracy at
lOT" frequencies is desired.
xi
Whereas the choice of the proper form is dependent on digital
network principles, modularity considerations are also influenced by
the state of component technology. Since the latter changes relatively
rapidly with time, modularity prop:>sals that m:l.y be optimum no�oJ could
conceivably be less so in the future. Jackson, Kaiser, and McDonald
have devised an elegant set of modules consistent with present-day MSI
(medium-sized integration) component techniques. For example, the
multiplier has been conceived to be modular on a bit by bit basis, each
bit containing all the required gating an d carry logic. Thus, for
example, the word length can be extended by adding one module. If, in
the future, a sufficiently high packaging density leads to economic
array multipliers, the resultant speed increase and simplification of
control could greatly change modularization tactics.
The other harcrN'are-oriented contribution, by Bergland, reviews
general aspects of special purp:>se fast Fourier transfo:::m systems.
Bergland is pri.'1Iarily concerned �oJith the relation between the theoreti
cally possible computation.:!.l sp,aed of an FFT as a function of the number
of parallel memory and arithmetic modules in the system. This is an
interesting and important aspect of FFT device design, but it is well
to remember that it is only the bare beginnings of tl1e use of an
algorithm which has a startling number of diverse forms. We think,
for example, that the relation between the hardware design and the radix
used, the bit-reversal question, the in-place versus not-in-place
algorithm, and the relative speeds and costs of fast and slorN memo::-ies
are fertile area for general investigations.
Unfortunately, as yet, no thorough study of the effect of digital
processing algorithm.3 on the design of general p'.1rp:>se computers is
available. We wo'ald like to make the point that, as Bergland has
demonstrated in one particular case, parallelism leads directly to
increased speed; this is true for general as well as special purpose
hardTN'are. Comp'.1ter architects have, in the past, had great conceptual
difficulty in justifying any given parallel p:o:ocessing structure; in our
opinion, the study of the llk1.ny interesting structural variations of
digital signal processing algorithms could help set up useful criteria
for the effectiveness of general purpose p,3.rallel comp'.ltation.
Bernard Gold
Alan V. Oppenheim
Charles Rader
xii
CONTENTS
K. Steiglitz, "The Equivalence of Digital and Analog Signal 1
Processing," Infonnatio':1 and Control, Vol. 8, No. 5, October 1965,
pp. 455-467.
xiii
G. D. Bergland, I� Fast Fourier Transfo� Algorithm Using Base 8 158
Iterations " Mathematics of Comp=.Itatlon, Vol. 22, No. 102, Ap=il 1968,
!
pp. 275-27';1.
Bibliography 187
xiv
Papers on Oigital Signal Processing
Reprinted from I�FoRMATIO)oi AXD CO)oiTROL. Volume 8. No.5. October 19135
Copyright © by Academic Press Inc. Printed in U.S.A.
K. STEIGLITZ
Department of Electrical Engineering, Princeton University, Princeton, New Jer.�ey
LIST OF SYMBOLS
ret), get) continuous-time signals
F(jw), G(jw) Fourier transforms of continuous-time signals
A continuous-time filters, bounded linear transformations
of r} ( - O(), 00)
{fn}, {gnl discrete-time signals
F(z), G(Z) z-transforms of discrete-time signals
A discrete-time filters, bounded linear transformations of
�
f.L isomorphic mapping from L2 ( - 00, 00 ) to 12
'ijL2 ( - 00, 00 ) space of Fourier transforms of functions in L2( - 00 ,00)
312 space of z-transforms of sequences in 12
An(t) nth Laguerre function
* This work is part of a thesis submitted in partial fulfillment of requirements
for the degree of Doctor of Engineering Science at New York University, and
was supported partly by the National Science Foundation and partly by the Air
Force Office of Scientific Research under Contract No. AF 49-(638)-586 and Grant
No. AF-AFOSR-62-321.
1
STEIGLITZ
I. INTRODUCTION
The parallel between linear time-invariant filtering theory in the
continuous-time and the discrete-time cases is readily observed. The
theory of the z-transform, developed in the 1950's for the analysis of
sampled-data control systems, follows closely classical Fourier transform
theory in the linear time-invariant ca.'5e. In fact, it is common practice
to develop in detail a particular result for a continuous-t.ime problem
and to pay less attention to the discrete-time case, with the assumption
that the derivation in the discrete-time case follows the one for con
tinuous-time signals without much change. Examples of this ("an be
found in the fields of optimum linear filter and compensator design,
syst.em identification, and power spectrum measurement.
The main purpose of this paper is to show, by the construction of a
specific isomorphism between signal spaces L2 ( - co, co ) and l2 , that the
theories of processing signals with linear time-invariant realizable filters
are identical in t.he continuous-time and the discrete-time cases. This
will imply the equivalence of many common optimization problems
involving quadratic cost functions. In addition, the strong link that is
developed between discrete-time and continuous-time filtering theory
wiiI enable the data analyst to carry over to the digital domain many of
the concepts which have been important to the communications and
control engineers over the years. In particular, all the approximation
techniques developed for continuous-time filters become available for
the design of digital filters.
In the engineering literature, the term digital filter is usually applied
to a filter operating on samples of a continuous signal. In this paper,
however, the term digital filter will be applied to any bounded linear
operator on the signal space l2, and these signals will not in general
represent samples of a continuous signal. For example if {xn} and {Yn}
are two sequences, the recursive filter
Yn = Xn - 0. 5 Yn-l
will represent a digital filter \vhether or not the Xn are samples of a
continuous signal. The important property is that a digital computer
can be used to implement the filtering operation; the term numerical
filter might in fact be more appropriate.
II. PHELIMINAHIES
The Hilbert space L2 ( - co, co ) of complex valued, square integrable,
Lebesgue measurable functions f(t) will play the role of the space of
2
DIGITAL AND ANALOG SIGNAL PROCESSING
where aCt), the impulse re!'>ponse of the filter A, need not belong to
[} ( - 00, 00 ) . Similarly, a digital filter A will be called time-inmriant if
(4)
implies
(;'; )
for every integer P. Time-invariant digital filters can be represented by
the convolution summation
(6)
where the sequence {an}, the impulse response of the filter A, need not
belong to l2 •
1
R
F(s) l.i.m. f(t)e-at £It (7)
n-+oo -R
=
3
STEIGLITZ
-
1
(j,J) =
Icc I J(t)12 dt
-00
= 2 '
7rJ -JOO
(8)
and
J(t) = l.i.m.
R-+OO
fiR F(s)e·t ds.
J-iR
(9)
Analytic extension of F(jw) to the rest of the s-plane (via (7) when it
exists, Jar example) gives the two-sided Laplace transform.
THEORE:\I2 (Parseval). If f(t), get) E L2 ( - oc, 00 ), then
-
1 1iOO. F(s)G*(t) ds.
(j, g) =
Icc f(t)g*(t) dt
-DC
= 2
7rJ
'
-JOO
(10)
exists Jar z ei"'T, and F(ei"'T) E L2 (0, 27r/T), where (,) is the inde�
=
pendent variable of L2 (0, 27r/T), and this (,) is unrelated to the w used in
the s-plane. Furthermore,
(12)
and
dz ,
in 2
1
. i F(z)zn (13)
'!
=
7rJ z
where integrals in the z-plane are around the unit circle in the counter
clockwise direction.
As in the analog case, the analytic extension of F ( ei"'T) to the rest
of the z-plane will coincide with the ordinary z-transform, which is
usually defined only for digital signals of exponen tia l order.
4
DIGITAL AND ANALOG SIG�AL PROCESSING
z
=
7rJ
We denote the space L2 (0, 27r/T) of z-transfonns of digital signals by �l2 .
III. A SPECIFIC ISO:\lOUPIIISl\l BETWEE� TIlE AJ:\ALOG AXD DIGITAL
SIGNAL SPACES
z+I'
There is an additional factor required so that the transformation will
preserve inner products. Accordingly, the image {fn} E 12 corresponding
to f(t) E L2 ( - co, co ) will·be defined as the sequence with the z-trans
form
V2 z - 1 ( (16)
)
=Z+lF z+l .
p.:f(t) � F(s) � z
V2 z 1 ( ) F(z) � {fn}. (17)
+1 F z +1
=
p.-l: {fn l � F( z) �
l-s
V2 F (1l-s
+s
) = F(s) � f(t). (18)
We then have
THEORElI 5. The mappl:ng
p.: L2 ( - co, co ) � l2
defined by (17) and (18) is an isomorphism.
5
S'fEIGLITZ
Proof: J.L is obviously linear and onto To show that it preserves inner
.
product, let z = (1 + s)/(1 - s) in Parseva1's relation (10), yielding
z z 1
By (13), the formula for the nverse
i z-transform, we have
�. 1 v2 F ( z - 1 n dz.
fll =
) z
2;rJ l' +
(�l)
+ z 1 z 1 z
Letting z = (1 + s)/(1 - s), this i n tegral becomes
1
f = ()""
. _7rJ
. Iioo. F(s) -
-J'Xl
V2 ( + s ) " £Is.
+ - 1 S
1
1 _ S
(22)
By Par�eval's relation (10) thi::; can be written in terms of time f unc tions
6
DIGITAL AND ANALOG SIGNAL PROCESSING
where the An(t) are given by the following inverse two-sided Laplace
transform
�n (t) =
£-1 [ V2 (�)nJ. (24)
1 -8 1 +8
We see immediately that, depending on whether n > ° or n � 0, An (t)
vanishes for negative time or positive time. By manipulating fi standard
transform pair involving Laguerre polynomials we find:
n = 1, 2, 3,
. (2<";)
n = 0, -1, -2, ..
where u(t) is the Heaviside unit step function, and Ln(t) is the Laguerre
polynomial of degree n, defined by
t n
Z
Ln (t) =
:... _
c n (tn -I)
e , n 0, 1, 2, .. .
=
(26)
n! di
The set of functions An (t), n = 1, 2, 3, ... , is a complete orthonormal
set on (0, 00 ) and are called Laguerre functions. They have been em
ployed by Lee (1931-2), Wiener (1949); find others for network synthe�
sis; and are tabulated in Wiener (1949), and, with a slightly different
normalization, in Head and Wilson (19:>6). The functions An(t), n =
0, -1, -2, . . . , arc similn.rly complete find orthonormal on ( .:..... 00, 0),
so that the orthonormal expansion corresponding to (23) is
00
(28)
7
STEIGLlTZ
V2 1 z - 1 ( ) (z - 1) . (30)
z+ 1.l z+ 1 F z+ 1
Therefore, the image in 512 of A and hence of A is multiplication by
.-\(z) = A
(z - 1) . (31)
z+ 1
Similarly, a time-invariant digital filter A has an image in 5l.z given by
multiplication by A(Z), the z-transform of the impulse response {anl;
2
an image in L ( - IX), (x;) given by
A = IL- AIL� ) ;
I
(32)
and an image in �L2 ( - (x;) , IX» given by multiplication by
A(s ) = A
(1 + ) s
(33)
1 _ s .
We have therefore proved
TUEoHEM U. The isomorphism IL always matches time-invariant analog
filters A with time-invariant digital filters A. Furthermore.
A(Z) = A
(z 1) '
-
( 3 4)
z+ 1
8
DIGITAL AND ANALOG SIGNAL PROCESSING
and
A(s) A
(1 + ) s
(35)
1 s
=
_ •
filter. The same argument works the other way, and this establishes:
THEOREM 7. The mapping JJ. always matches time-invariant 1'eaHzable
analog filters with Hme-invariant realizable digital fillers.
9
STEIGLITZ
11o s)
( =
�y [(R +y*N)*R] ' (38)
LHP
where
yy* =
(R + N)(R + N)* (39)
Y has only left-half plane poles and zeros, and y* has only right-half
plane poles and zeros. The notation [ lLHP indicates that a partial
fraction expansion is made and only the terms involving left-half plane
poles are retained.
The fact that a least integral-square-error criterion is used means
that the optimization criterion (36) can be expressed within the axio
matic framework of Hilbert space. Thus, in L2 ( - 00, 00 ), (36) becomes
111' - lI(r + n) 1I =
min. (40)
If we now apply the isomorphism p. to the signal l' - lI(r + n), we have
II r - lI(r + n)1I =
II p.[/, - lI(r + n)lll =
II r - n(r + n)lI, (41)
since p. preserves norm. Hence lIo is the solution to the optimization
problem
II r - n(r + n)1I =
min. (42)
ZQ * i '
where
QQ* = (R + N)(R + N) *. (44)
Now R, N, and 110 are functions of Z; ( )* means that Z is replaced by
Z-l; Q and Q* have poles and zeros inside and outside the unit circle
ref';pectively; an d the notation [ lin indicates that only terms i n a
partial fraction expansion with poles inside the unit circle have been
retained.
In other optimization problems we may wish to minimize the norm
10
L
DIGITAL AND ANA OG SIGNAL PROCESSING
of some error signal while keeping the norm of some other system signal
within a certain range. In a feedback control system, for example, we
may want to minimize the norm of the error with the constraint that
the norm of the input to the plant be less than or equal to some pre
scribed number. Using Lagrange's method of undetermined multipliers,
this problem can be reduced to the problem of minimizing a quantity of
the form
(45)
where e is an error signal, i is some energy limited signal, and both e
and i depend on an undetermined filter II. Again, if Ho ( s) is the time
invariant realizable solution to such an analog p roblem , then IIo(Z) is the
time-invariant realizable solution to the analogous digital problem
determined by the mapping p,.
M ore general ly, we can s tate
THEOREM 8. Let v be an isomorphism between L2 ( - 00, 00 ) ' and 12 •
Hj(Z) = Hi
(z - 1) '
+
i = 1,2, 3, .. . ,n. (46)
Z 1
VII. RANDOM SIGNALS AND STATISTICAL OPTBIIZATIOX PROBLEl\IS
11
STEIGLlTZ
order properties. In the analog case these are the correlation function
¢ZI/(t) and its Fourier transform 4> I/( s ) . In the digital case these are the
Z
correlation sequence 9"I/(n) and its z-transform �ZIJ(z).
We define the mapping p. for correlation functions in the following
way, motivated by mapping the signals in the ensembles by the iso
morphism p. for signals:
(·17)
12
DIGITAL AND ANALOG SIGNAL PROCESSING
13
Presented at the 1969 Polytechnic Institute of Brooklyn Symposium on
Computer Processing in Communications. To appear in the symposium
proceedings.
by
B.Gold
A. V. Oppenheim
C. M. Rad e r
Lexington, Massachusetts
ABSTRACT
tion, as well as its use in reiating the real and imaginary components, and the
similar role in digital signal processing. In this paper, the Hilbert transform
relations, as they apply to sequences and their z-transforms, and also as they
These relations are identical only in the limit as the number of data samples
sequences usually takes the form of digital linear networks with constant co
14
1. Introduction
Hilbert transforms have played a useful role in signal and network theory
and have also been of practical importance in various signal processing systems.
Analytic signals, bandpass sampling, minimum phase networks and much of spectral
analysis theory is based on Hilbert transform relations. Systems for performing
Hilbert transform operations have proved useful in diverse fields such as radar
moving target indicators, analytic signal rooting (1 J, measurement of the voice
fundamental frequency (2, 3], envelope detection, and generation of the phase of
a spectrum given its amplitude (4, 5, 6 J •
2. Convolution Theorems
-n
x (z) = x (n) z
n=-CI>
Given two such sequences x (n) and h (n) and their corresponding z- transforms
X (z) and H (z), then, if Y ( z ) = X (z) H (z) , we have the convolution theorem
m=CI> m=CI>
y (n) = :E x (n- m) h (m) = :E x (m) h (n - m) (1)
m=-CI> m=-CI>
15
Similarly, if Y (n) = x (n) h (n ) , we have the complex convolution theorem
Y (z) = -2
1
m�
. -I 1
m
-1 (2)
where v is the complex variable of integration and the integration path chosen is
the unit circle, taken counterclockwise.
The spectrum of a signal is defined as the value of its z-transform on the unit
j
circle in the z-plane. Thus, the spectrum of x (n) can be written as X (e 9) where
e is the angle of the vector from the origin to a point on the unit circle.
If x (n) is a sequence of finite length N then it can be represented by its
(k). we have
•
X (k):::: N-I
!:
n =o
x (n) W
-nk
x (n ) = l
N-l
(k)
!: X W nk
N
k=o (3)
with W = eJ"21T /N
The convolution theorems for these finite sequences specify that if
Y (k) = H (k) X (k). then
yen) = mN-l=!:o X (m) h � (n- ») = m=o!: x ( (n-m») h (m)
m
N-l -
(4)
16
where the double parenthesis around the expressions k - fl. and n - m rcf9r
tel these expressions modulo N; i.e., «x» = the unique integer x + kN, satisfying
o < x + kN � N -1 •
(6)
First, we will derive an expression for X (z) outside (not on) the unit circle
j
given R (e 9) (on the unit circle), beginning with the physically appealing concept
of causality. A causal sequence can always be reconstructed from its even part,
defined as,
17
j
Now. consider X (z) outside the unit circle, that is, for z:; re e with
r> 1 . Then,
-e 0> 0>
n
X (reJ ) = k x (n) r- e- jn f) = 2 k x (n) s (n) r-n e - jn e (9)
e
n=-o> n=-o>
Thus, using •
(In this and subsequent contour integrals, the contour of integration is always
taken to be the unit circle).
Equation ( 1 0) expresses X (z) outside the unit circle in terms of its real
part on the unit circle. Equation (�O) was written as a contour integral to stress
the fact that in the physically most interesting case when R (z) is a rational
fraction, evaluation of (10) is most easily performed by contour integration using
residues.
j
Similarly, we may construct X (re j 8)
from I (e 9) by noting that
x (n) :; 2 x (n) s (n) + x (0) <5 (n) where x (n) denotes the odd part of x (n) and
o 0
<5 (n) is the unit pulse, defined as unity for n = 0 and zero elsewhere. The
result obtained is,
1
X (z) _ :;.! f -
�
I (z/v) (v+ - ) dv + x(o) (11)
z=reJ fl 17 V (v-r )
Now, Eg. (1 0) and ( 1 1) also hold in the limit as r -+ 1, provided care i s taken
to evaluate the integral correctly in the presence of a pole on the unit circle. This
can be done formally by changing the integrals in (10) and (1 1) to the Cauchy principal
values of these integrals, where the latter is defined as:
18
1 � f(z)
P dz = f(z ) for /z / < 1
21Tj z-z o 0
0
= 0 for Izo 1>1 (12)
= tf(z ) for Iz 1 = 1
o 0
From (10), (11) and (12), it is a simple matter to construct explicit relations
between R (ej B) and 1(e H). Alternately, these results could have been derived
by appealing directly to Fig. 1, which shows the explicit relations between the real
and imaginary part of a causal function to be,
(l- l
(13)
x (n) = .Qim [x (n) w (n) J + 6 (n) x(0)
e o I
tl- 1
Figure 2 shows the ring of convergence for the z-transform W (z) of w (n)
I I
and Fig . 3 shows the poles and zeros of W (z).
I
The results obtained are,
� R (z/v) (v+l) dv
J'1(z) I . = p ,f! (14)
z= e Je 21TJ .'f1' v(v-I)
(z v (v+l)
R (z) I = � p ,f! I / ) dv (15)
JB 21T j v (v-I)
.
z=e
9
By setting z = ej and v = ej CP, we change the contour integrals (14) anq
(15) to line integrals, yielding
9 . 21T
1 ( )
-I(eJ ) = -p r R (eJ CP-B)cot(cp/2)dcp
•
21T .,'0
(16)
19
Similar results can be obtained for the real and imaginary parts of the
discrete Fourier transform of a finite duration sequence provided that the sequence
is 'causal in the sense that, if the sequence x(n) is considered to be of duration
N , then x(n) = 0 for n> �. Defining the even and odd parts of x(n) as
x (n) =
e t(x [«n»] +x[« -n»J)
and
X (n) =
o
t ex [«n»] - x [«-n» J)
it follows (for N even) that
where
1
f(n) = 2 n = 1 , 2,
N N
o n = "2+1'"2+2, • • • N- l
From these relations, we can then derive that the real and imaginary parts, R (k)
and I(k), of the DFT of x(n) are related by
N-l
j I(k) = � r=o R (r)
� F [«k- r »] (17)
N-l
1 k N
R(k) = � jI(r) F C«k - r »] +x(o) +( - 1) x("2) (18)
N
r=o
1.
k
J cot 'lT k odd
N
-
F (k) =
,0 k even
20
Note that (17) and (18) are circular convolutions which can be numerically
evaluated by fast Fourier transform methods. Similar but not identical relations
can also be derived if N is odd.
If, instead of working with the z-transform of a sequence, we choose to work
with the logarithm of the z-transform, then comparable Hilbert transform relations
can be derived between the log magnitude of the spectrum and its phase. However,
certain theoretical restrictions arise, due to the fact that a) the logarithm of zero
diverges and b) the definition of phase is ambiguous. However, the derivative of
the phase (with respect to z ) is not ambiguous; this Jeads to relationships based on
the definition
� dX (z)
D (z) = ( 1 9)
X (z) dz
(20j
.e 1
21T .( e)
'l"(eJ ) = --p
21T
f
F(eJ ctJ- ) cot (CD/2) d<,Cl (21)
o
where Ixi is the magnitude of the spectrum and 'l' its phase and the primes denote
differentiation with respect to e.
If we impose the condition that 'l' is an odd function, then it must be zero for
CD = 0 and (20) and (21) may be integrated to give
.e
log l X(eJ )1=
1 21T 'l'(eJ( ctJ- e) ) cot (<,Cl/2) d<,Cl (22)
P
21T oJ
•
21
( )
°A 1 21T
\y (eJ ) 10glX (eJ cD-A Icot (C[)/2) dco
O
=--
0 (23)
21T o
0
The requ irement that the inverse z -transform of D (z) be zero for n < 0 imposes
a restriction on the pole and zero locations of X (z). Since the poles of D (z)
occur whenever there are either poles or zeros of X (z) and since the inverse
transform of D (z) is zero for n < 0 only if the poles of D (z) are all within the
unit circle, it follows that both poles and zeros of X (z) must be within the unit
circle in order for Eq. (20) through (23) to be valid. This is the well-known
minimum phase condition [11 J •
It is also possible to relate the log magnitude and the phase of the DFT by
analogous relations provided that the inverse DFT of the logarithm of the DFT is
causal. The difficulty in applying this notion is that the logarithm of X (k) is
ambiguous since X (k) is complex. For the previous case of the z -transform,
this ambiguity was resolved in effect by considering the phase to be a continuous,
odd, periodic function; this definition of the phase cannot be applied in this case.
Nevertheless, it has been useful computationally for constructing a phase function
from the log magnitude of a DFT by computing the inverse DFT of the log magnitude,
multiplying by the function f (n) and then transforming back [4, 5). The real
part of the result is the log magnitude as before and the imaginary part is an approxi
mation to the phase.
The relations of Section 3 were derived via the complex convolution theorem
(2) and the requirement of causality. By interchanging time and frequency and
using the convolution theorem (1), further relations can be found which are of
practical and theoretical interest. One way of obtaining such relations is by the
introduction C?f the 'ideal' Hilbert transformer which has a spectrum defined as having
the value + j for 0 < cD < 1T and - j for 1T < cP < 21T , or equivalently, a spectrum
with flat magnitude vs. frequency and having a phase of ±1T/2 . Thus, a Hilbert
transformer is a (non-realizable) linear network with this transfer function and, as
shown in Fig.4a, the output of the network is the Hilbert transform of the input.
Hilbert transform relations can also be realized by having two aU- pass networks
having a phase difference of 1T/2 , as shown in Fig. 4c; such a configuration is useful
for synthesis of realizable approximate Hilbert transformers.
22
The unit pulse response of a Hilbert transformer can be derived by
evaluating its inverse z-transform. Thus
1 .R n-1
7T :r H (z) z
2 j
h (n) = dz
h (n) = -
1
27T
l 7T
e e de - r
"r j J
.
27T
.A
j e J dEl �
o 1T
.
1 - e j 1Tn
h (n) = for
1Tn
(24)
= 0 for
x (m) [1- e
1 j 7T (n-m)
Y (n) = 7T- J (25)
-ex> (n-m)
m=
m,;tn
Equation (25) can be inverted by noting that X (z) = H* (z) Y (z) (where H* (z) is
the complex conjugate of H (z) ); this yields
1 j 1T (n-m) 1
x (n) = - y (m) [ l- e (26)
7T n-m
-
m=-ex>
m,;tn
Thus (25) and (26) can be said to be a Hilbert transform signal pair. The graph 'of
.!. h (n) is shOwn in Fig. 5.
1T
The complex signal s (n) = x (n) + j Y (n) (where x (n) and y (n) are a Hilbert
transform pair) has been called the analytic signal and has the useful property that
its spectrum is zero along the bottom half of the unit circle. One application of the
analytic signal is to the bandpass sampling problem. Consider the problem of
sampling a real signal having the spectrum of Fig. 6a. If this signal is passed
through the phase splitter of Fig. 4c, the resulting analytic signal has the spectrum
23
shown in Fig. 6b, and thus can be sampled at the rate li B . To reconstruct the
original signal requires that the samples be applied to the unity gain bandpass filter
shown in Fig. 6c. The real part of the filtered signal corresponds to the original
signal.
Another application of Hilbert transformers is to help create a bandpass
spectrum which is arithmetically symmetric about an arbitrary center frequency.
Effectively, the ability to do this allows us to design bandpass filters which are
linear translations in frequency of prototype low pass filters, thus avoiding the
distortions inherent in the standard low pass-bandpass transformations. Figure
7a illustrates a symmetric low pass. When a conventional transformation is
applied, the non-symmetric bandpass of Fig. 7b results. Symmetry may be
attained with the filter of Fig. 7c; however. we note that the output of such a filter
is a sequence of complex numbers and, also, that by merely taking the real part
we must introduce the complex conjugate pole, thus destroying the symmetry. To
maintain symmetry of a real output over the range 0 through 1T can be accomplished
by the configuration of Fig. 8, where H (z) and H (z) are all pass phase splitters
I 2
such as shown in Fig. 4. If only the real part of the signal is desired then a single
phase splitter (rather than two) is needed. A filter satisfying the pole-zero pattern
of Fig. 7c is easily made and well-known and has usually been referred to as the
coupled form [ 12,9].
a. Recursive Networks
Analog phase splitting networks have been extensively analyzed and
synthesized rI3 , 9]. Since the desired networks are all-pass with constant phase
difference over a frequency band, it is feasible to use the bilinear transformation
r.I4, IS, 9] to carry analog designs into the digital domain. The resulting net
works are all-pass, so that each pole at, say, z a , has a matChing zero at
=
0
z = I/a . An equiripple approximation to a constant 90 phase difference is
obtained by the use of Jacobian elliptic functions [ 161, with the added advantage
that all the poles and zeros lie on the real axis.
Let the two networks comprising the phase splitter be H (z) and H (z) .
I 2
To synthesize the all pass networks H (z) and H (z) in an efficient manner,
I 2
we note that the first order difference equation
24
corresponds to the digital network
-1
z -a
H (z) = (2S)
-1
l - az
This shows that an all pass network with a pole at z = a and a zero at z = l/a
can be synthesized with a single multiply in a first order difference equation. In
0
Fig. 9 is shown a complete digital 90 phase splitter which meets the requirements
0 0
that the phase difference deviates from 90 in an equiripple manner by ± 1 in
0 0
the range 10 through 120 along the unit circle. From (2S) we see that the
coefficients in Fig. 9 are equal to the pole positions. The nomenclature of Fig. 9
is the following: the box z -1 signifies a unit delay, the plus signifies addition
and a number-arrow combination Signifies both direction of data flow and multi
plication by the number. Arrows without numbers signify only data flow.
Now it is well known [9, 12, 17, IS, 19, 20, 21, 22] that because of finite
register length, the performance of the actual tilter deviates s0111ewhat from that
of the design. These effects can be categorized as follows:
a) Quantization of the input signal
b) Roundoff noise caused by the multiplications
(fixed point arithmetic is assumed)
c) Deadband effect
d) A fixed deviation in the filter characteristic caused by
inexact coefficients.
The analysis of these effects is Simplified because the networks HI (z)
and H (z) are all-pass. Thus, signal to noise ratios caused by (a) are the same
2
at the output of the networks as at the input. Item (b) can be analyzed for HI (z)
or H (z) by inserting noise generators at all adder nodes following multiplications.
2
But each noise is then filtered by a cascade combination of the pole of the section
in which the noise is generated, and an all-pass network. The well -known formula
for the output variance of a network which has been subjected to a white noise input
with uniform probability density of amplitude is given by
25
where E is a quantum step and h (n) is the network impulse response. For a
o
n
single pole at z;::; a , h (n) ;::; a ; assuming m independent noise generators and
m poles at a 1 ' a a causes a total variance
2 m
• • •
m
2 2 1
CJ E / 12 (29)
� -=---z
=
o
i;::;1 I - a.1
We see that only values of a. near unity cause mu,:h noise. Thus, for our
1
1
rumerical design example, only about bit of noise is generated. Item (c) can
be analyzed by similar considerations but it is probably not important: for band
pass phase splitters anyway, since it is only an effect when the input is a constant.
For small errors in coefficients. item (d) can be analyzed in a manner
similar to that of reference [121. The realization chosen in Fig. 9 guarantees
that even though a given coefficient is in error, the poles and zeros of the networks
remain reciprocals so that only the phase response of the network can be effected.
Let the phase response due to a pole zero pair at a, I/a be W(CD,a) . The phase
error for a coefficient error Aa is approximated by
-1 sin cp
;::; 2 tan ( )
a - cos CD
26
Therefore a good approximation to the error in phase, is given by the early term�
of the series
Using this approximation for each of the poles one can estimate how many bits
are necessary to keep the phase error within a given tolerance. Of course, once
a coeffi,cient has been specified, the phase difference can be computed precisely.
.27T
-N N-l H -j-
k N
H (z) l: (.'30)
l-z
W =e
N
=
-I k
k=o l- z W
27
Exact 90° phase can be attained at every frequency, by specifying that the
H be purely imaginary. However, for real unit pulse response, such a phase
k
shifter must have a magnitude characteristic which passes through zero at 0, 1T,
21T etc. , as shown in Fig. 11. Thus, an ideal phase can he attained by further
degrading the all pass property of the network near 0 and 1T.
28
References
3. C.M. Rader, "Vector Pitch Detection,"J. Acoust. Soc. Am. 36, 1963 rLl.
8. E.!. Jury, "Theory and Application of the z- Transform Method," Wiley, 1964.
11. H.W. Bode, "Network Analysis and Feedback Amplifier Design," D. Van
N ostrand Company, 1945.
12. C. M. Rader and I3. Gold, "Effects of Parameter Quantization on the Poles of a
Digital Filter, " Proc. IEEE, May 1967, P 688.
15. C. M. Rader and B. Gold, "Digital Filter Design Techniques in the Frequency
Domain,"Proc. IEEE, vol. 55 No. 2, February 1967, pp 149 -171.
16. E.T. Whittaker and G.N. Watson, " Modern Analysis,"Cambridge University
Press, 1952, (4th Edition).
29
18. J.B. Knowles and R. Edwards. "Effect of a Finite-Wordlength Computer in a
Sampled-Data Feedback System," Proc. lEE (London), vol. 112, pp 1197-1207,
June 1965.
20. J.B. Knowles and E. M. Olcayto, "Coefficient Accuracy and Digital Filter
Response," IEEE Trans. Circuit Theory, vol. CT-15, pp 31-41. March 1968.
21. T. Kaneoko and B. Liu, "Round-off Error of Floating Point Digital Filters,"
6th Ann. Allerton ConL on Circuit and System Theory," Oct. 2-4, 1968.
23. B. Gold and K.L. Jordan. Jr., "Linear Programming Procedure for Designing
Finite Duration Impulse Response Filters" (to be published) in the IEEE
Trans. on Audio and Electroacoustics.
30
(a) x (n)
I I
0 n
..
( b) Xe(n)
n
(e) xo(n)
I
I I
I n
(d) wen)
• • •
• • • 0 n
• • •
o n
Fig. 1
31
w
u
z
w
(!) ,.,...
0:: N
W ....."
> �
Z
0
3
u
0:
u-
0
0 L1-
(!)
Z
0::
W C'I
Z .�
�
<l:
.-I
a..
32
,.....
�
I
tj
� I
N
'-'
C\J ,--
N tJ
I
N
'-'
II
"......
w N
Z '-'
«
--1
��
a.. M
.�
IJ,.,
33
•
•
Q)
•
�
C\J
�
.. .
.....
I�
- ..
- -
I �
,-.. J ,-..H
c: c:
-
� C\J
0 �
,-.. ,-..
N N
- -
� N
t:: ::t: ::r:
I
�l A�
)( ,-.. A�
t:::
c:
C\J
,
-
)(
c u
......
•
•
•
34
c
•
r--�o
�------
•
•
•
35
SPECTRUM
I I
(0 ) ''/
I
I ..
I I I
.
I I
SPECTRUM
( b)
-
I I ,
--"1 �B 7T 27T 8
(c)
--
I
-
7T 27T 8
Fig. 6
36
z - PLANE
SYMMETRICAL NONSYMMETRICAL S Y M M ETR I CAL
LOW PASS BAND PA SS BAND PASS
(a ) (b) (c)
Fig. 7
W
-..I
w
(Xl
H (z)1 ...
1
H2 (z)
Y1 (n)
,....----
� I I I IMAGINARY REAL PART
,. tn)
'T" � O
F I L I � r\ I y� ( }
FOR Fig.7c �n •
!) OF ..
PART OF - OUTPUT
Hi (z) OU TPUT
Io.....-�.I H2 (z ) I '
Fig. 8
-
-
�
*c
-
c
-
....
- I
I
-
�
c
-
"
39
-
--------. .......
- ....
_)
-
....
<
' )
\
•
-
•
Cb
•
..
- <
� � --
....
--
_)
- --
---
- - --,
-
<--.
::>
T
C--,
(\J
0
�
.-<
N II ......
� . ....biJ
�') _ ....
u..
/1-----;/
('- -:;. I
0
-- I 0
... 1 ---+
�
0
I � +-'-
< ..
I )
(\J •
......
t:: '-)
•
.... •
< ..... •
t:: -,
- - --
--
I _
--
- -
C .a
- -
40
�
�
c,
.,..,-'"
c_
�
•
,) •
( •
) •
<--.,
•
•
-
•
<-----
---
---
.....
.....
<--
) bO
....
�:.
( "
<-.)
_r
c ......
__ _
1_- -
-
(-14--
•
.. �)
,.
•
-.-.e
<,�
•
�, •
- -� •
< ....--.--.
.. •
- --
-.-
c..
41
• • • • • • • •
Tr
e
1T/4 2Tr
-1--- - . . . . - .
- - . . • • •
Tr/4
e
Tr/4 2Tr
-T- • • • • • • • •
r
Tr/2
8
217"
Tr/2
J_
12
42
Reprinted from the PROCEEDI:\GS OF THE 1E1':E
VOL. 56, 1\'0. 10, OCTOBER, 1968
pp. 1717-1718
COPYRIGHT © 1968-THE INSTITUTE OF EI.F.CTRICAL AND EU:CTNONICS ENGINEERS, INC.
PRINTED IN THE '!.S.A.
where the set of II. are precisely the values of lI(z) at equally spaced points
on the unit circlc in the z-plane. The impulse response h(nT) can then be
A Note on Digital Filter Synthesis
expressed as the inverse DFT of the samples II. and if this inverse OFT is
Abslracl-It is commonly assumed that digital filters with both poles and substituted into (I), with the order of the resulting doublc summation
zeros in the complex z-plane can be synthesized using only recursive tech inverted. we easily arrive at the result
niques while filters with zeros alone can be synthesized by either direct con
volution or via the discrete Fourier transform (OFf). In this letter it is shown (3)
that no such restrictions hold and that both types of filters (those "'ith zeros
alone or those with both poles and zeros) can be synthesized using any of the
three methods, namely. recursion, OFT. or direct convolution. Equation (3) can be physically reprcsen[c'<l by a comb filter with transfer
function I - Z-N cascaded with a parallel arrangement of first-ordcr re
I. INTRODUCTION cursive equations. In general. the coefficients of these equations (ej"d,N,)
A digital filter can be synthesized either by direct convolution. by are complex. but if in (3) we specify that I1N_. and II. are complex conju
linear recursive equations. or by the use of the discrete Fourier transform gates. then combinations of second-order recursive networks with real
(DFT). usually via the fast Fourier transform (FFT). Kaiserl has used the coefficients can be derived. An interesting special case of (3) is the case
terms "recursive" and "non recursive" to distinguish between filters that 0. = k7! and M. = 1. where II. = IIJ.ej8,. This leads to a bandpass filter with
are defined by an impulse response of finite duration (nonrecursive) and linear phase. which has been called "frequency sampling" by Rader and
those defined by an impulse response of infinite duration. The former type Gold'> If two such filters arc designed so that the frequency response of one
contains only zeros while the latter has both poles and zeros in the complex is a linear translation of the other. then it is cas v to show that the two out
z-plane. It is well known that the design methods for filters with only zeros puts have a constant phase difference over the passbands common to both
differ markedly from the design methods for filters with zeros and poles. filters.
N-' and F(z) and G(z) are z-transforms which depend solely on the initial eondi-
}
H(z) = L h(nT)z-' (J) lions and arc siven by
,-0
.,-. .-)-1
where T is the sampling interval. FI%)- L ajz-) with u) = L K,_'.'''-T -<n
Since hInT) is of finite duration, it has a DFT given by )-0 ;-0
,-I ,-)-1
(7)
N-' GI%)- L bjz - ) with bj = L Lj_,+.x(-T- iT).
H. = L h(nT)e-)'20IN"., k - O. 1,2, . • • N - I (2) j-O '-0
,-0
If we define h.(nT) to be the inverse z-transform of R(z)/D(z), h1(nT) to
be the inverse-transform of I/D(z), l(nT) to be the inverse z-transform of
Manuscript received June 24. 1968. F(z), and genT) to be the inverse z-transform of G(z), then the solution
I J. F. Kaiser. in SYSIt'M Anal,-sls by D;gital Compu",. J. F. Kaiser and F. Kuo. Ed$.
Spring Joint Computer Conr.• 1966. Proc.t£££. vot. 55. pp. t49-171. February 1968.
43
PROCEEDINGS OF THE IEEE, OCTOBER 1968
+ L h2(iT)y(nT - IT). II and III demonstrate the claim made in the Introduction
Sections
namely, that filters with zeros alone can be synthesized recursively whil;
'-0
44
Introduction
x(nT)
dure for Designing
y(nT) h(nT)
where T is the sampling interval, is the input se
(I),
Finite Duration Im quence, is the output sequence, and is the
h(nT).
BERNARD GOLD, Member, IEEE limit in (I) can be replaced by -
K. L. JORDAN, JR., Member, IEEE tion of For such filters, there exist a variety of
M.I.T. Lincoln Laboratory' design techniques, which have been reviewed in some
Lexington, Mass. 02173 detail by Kaiser [1]. In particular the frequency domain
technique which Kaiser has called the "Fourier series"
method has received much attention and has led to the
invention of various spectral "window" functions, which
when convolved with the ideal spectrum, tend to reduce
Abstract
the sidelobes of the filter frequency response characteris
We introduce an approach to the design of low-pass (and, by exten tic. The idea is illustrated in Fig. 1. In Fig. leA) is shown
N points
sion, bandpass) digital filters containing only zeros. This approach is a specification of an "ideal" filter. Since the impulse
that of directly searching for transition values of the sampiL-d frequency response is of length N, then only of the (con
response function to reduce the sidclobe le,'cl of the response. It is tinuous) frequency response can be specified. If specifica
shown that the problem is a linear program and a search algorithm is tion is as shown in Fig. I(A), then the actual response
I(C), producing
exhibits strong sidc\obes, as shown in Fig. I(B). If, for
derived which makes it easier to obtain the experimental results.
example, Hk is convolved with Wk of Fig.
I(E).
the samples Fk of Fig. I(D), the sidelobes decrease as
shown in Fig. This reduction is, of course, paid for'
by the increased transition band. The design problem
reduces .to one of finding "good" windows which either
minimize the out-of-band sidelobes, or minimize the in
band ripple, or result in some compromise between these
two criteria.
H(ei"T)
mentally, a minimization in M dimensions. We observe,
linear
however, that the continuous frequency response
of the filter can be expressed as a function of its
uniformly spaced samples Hk• The required interpolation
Manuscript received August 14, 1968; revised December 18, 1968. formula can be derived, via the discrete Fourier trans
, Operated with support from the U. S. Air Force. fonn, to be
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. Au-17, No.1 MARCH 1969
45
linear function of the MTk and, therefore, defines an M
dimensional hyperplane in the M+I dimensional space.
H(eiwT).
The M + I dimensions consist of the M transition values
and the frequency response We wish to solve the
minimization problem
(A)
min max I Gil
IT.I I
which can alternately be expressed as
(B) IT.I I
The expression
(e)
is the upper envelope of the hyperplanes formed by the
sets {Gz} and {-Gz} and therefore describes a flat
sided convex polyhedron. A local minimum is thus a
D.. .11
described is a form of linear programming problem.
As a practical matter, the local maxima or sidelobes in
I".'·�
the out-of-band region are nearly periodic and their posi
I tions are nearly constant. If only one of the Tk is varied,
(E) o t l.
2T T this corresponds to a slice through the polyhedron as
fREOUENCY
shown, for example, in Fig. 5. As the particular Tk is
Fig. 1. Ripple In flnile·lmpulse response digilol fUler. varied, the maximum out-of-band response switches from
one peak to another so that the frequency position changes
drastically; this corresponds to a sudden change in the
slope of the curve.
The above properties can be used to derive a reasonably
Fig. 2. Model of flnile·impulse response flller
efficient search procedure. The idea is to reduce an M
with a transition band.
dimensional search to a sequence of one-dimensional
searches. To illustrate the procedure, we assume that the
transition region contains two transition points and y
..�
••• x
and search for the point x.., Ym which minimizes the max
imum out-of-band response. For a given value of x, say
Xl, the search for an optimum Y is one-dimensional, and,
(X2
(2)
Ihe-irklN
y= YI+ ------
N-I
� wNT/2
sin
•
- Xl )
k-O sin [(wT/2) - (lI-k/N)]
which determines a vertical plane intersecting the poly
Now, assume that the criterion chosen by the designer hedron. The procedure is to solve the one-dimensional
is that of minimizing the maximum value of the out-of problem of finding the minimum along this intersection.
46
I • C18, • • 0.:5411
10
... "
N·251
.·at ,-o.aat. I-�
.. olH
.. oil * ''I(, ERIIOR IN ,
104
fig. 4. Variation in filter sidelobe amplitude versus Fig. 6. Sidelobe and In·band ripple as a funcllan of
bandwidth. x with y optimized for minimum sidelobe ripple.
oL------��--�'oon--J
ew
tion of zero sample values is not shown. For N=256 seen that for most bandwidths, the sidclobes are still
and BW=32, the optimum is x=0. 7, y=0.225 and 80 dB below the in-band response.
z=0.01995. As can be seen from Fig. 3, the maximum It was also of interest to study how sensitive the side
out-of-band response is about 85 dB below the in-band lobes were to deviations from optimum settings. Fig. 5
I percent error
response. This compares favorably with the "Blackman" shows this for a case of a two-dimensional optimization
window which yields about the same out-of-band re with a fixed value of Z = 0.8. It is seen that a
sponse but which corresponds to 4 rather than 3 transi in y does not unduly disturb the minimax sidelobe.
tion points [2]. No information is presently available to For the two dimensional case with Z= 1 and x and y
us on the in-band ripple of the Blackman window. variable, Fig. 6 shows a result which also takes into
A given optimum is in general valid for given values of account in-band as well as out-of-band ripple. Each point
two parameters, Nand BW. It is laborious to find a dif on the maximum sidclobe curve corresponds to an opti
ferent M-dimensional optimum for many values of these mum y given x. For the value of y, the in-band ripple was
parameters. However, if BWis not very small and not too also measured. The designer now can choose his 2 transi
near NI2, the optimum values do not change too dras tion points according to the relative value he places on
tically. This is illustrated in Fig. 4, which, for the same in-band versus out-of-band ripple.
values of x, y, and z, shows the relative amplitude of the Fig. 7 shows a plot of y versus x corresponding to the
minimax sidelobe as a function of BW for N=256. It is two-transition case of Fig. 6.
47
0.4, r-------..,
O �-L������������ �
0.5 0.6 0.7 O.B 0.9 1.0
X
Conclusions ACKNOWLEDGMENT
[I) F. F.
REFERENCES
using linear programming techniques. The resulting com
Kuo and J. F. Kaiser, Syslell/ AI/t/II·sis by Digi"'l COlI/
(2) T. G.
putations are facilitated by utilizing the fact that the •
48
I. Introduction
metic is to be used since it assists in determining register lengths II. Problem Statement
necessary to prevent overflow. In this paper we consider the class of According to the above discussion, we would like to
digital filters which have an impulse response of finite duration and are determine an upper bound on the maximum modulus of
implemented by means of circular convolutions performed using the an output value that can result from an N-point circular
discrete Fourier transform. A least upper bound is obtained for the convolution. With {x. I denoting the input sequence,
maximum possible output of a circular convolution for the general
{h. I denoting the kernel, and {yo I denoting the output
sequence, we have
case of complex input sequences. For the case of real input sequences,
N-I
a lower bound on the least upper bound is obtained. The use of these
y. = L Xkh(n-k) mod.v n = 0, I, . . , .V 1 (I)
k-O
. -
results in the implementation of this class of digital filters is discussed.
1 N-l
Y" = -
L IJ"Wn" k = 0, I, . . . ,N - 1 (4)
N .-0
N-l
49
so that the values of Xt do not overflow in the fixed point with equality if and only if 1 Hkl = 1. However, (6) re
word. quires that
In the typical cases, the sequence h .. is known and, con
sequently, so is the sequence Hk• Therefore it is not neces (12)
sary to continually evaluate (5); that is, the sequence Hk .._0
is computed, normalized, and stored in advance. Thus it
is reasonable to only apply a normalization to Hk and not
with equality if and only if 1 x.. I = 1. Combining (11) and
(12),
to h .., so that we requirel
N-I
(i) L: 1 Y.. 12:::;; N. (13)
.. -0
A normalization of the transform of the kernel so that But
the maximum modulus is unity allows maximum energy 1\'-1
transfer through the filter, consistent with the require 1 y.. 12:::;; L: 1 Ynl2 (14)
ment that Yk does not overflow the register length.
Our objective is to obtain an upper bound on ly.. 1 for and therefore
all sequences {x,,} and {Hk} consistent with (6) and (7). 1 Ynl :::;; -ylY. (15)
This bound will specify, for exa m ple, the scaling factor to
Proof of Result B
be applied in computing the inverse of (4) to guarantee
that no value of y.. overflows the fixed point word. The To show that -yiN is a least upper bound on ly"l, we
following results will be obtained. review the conditions for equality in the inequalities used
Result A: With the above constraints, the result of the above. We observe that for equali ty to be satisfied in (15),
N-point circular convolution of (I) is bounded by it must be satisfied in (11), (12). and (14), requiring that
JI'-I .'\'-1
and
L: I X.. 12 N L: 1 Xk12• (9)
i.\, i
=
Substituting (2) into (8) and using (7), where we have used the fact that I Hki = I. For (\6) to be
satisfied. then
N-I N-I
50
As an additional observation, we note that for allY where Rk and Ik are real valued and (see Appendix)
input sequence {Xn } ,
N-l Rk2 + h2 =-
1 . (21)
iV
I YnI � E I Ih I I X. I
t-o
Since exp [jrIl2jN] is an even function of n, i.e.,
.with equality for some value of If if and only if IHkl = 1
and the phase of Hk is chosen on the basis of (18). There
cxp (jrn2jN] = cxp [jr(:V - n)2j.V],
fore, for any {x.} the output modulus is maximized Rk is the DFT of cos (rIl2jN) and It is the DFT of
when Hk is chosen in this manner. This maximum value sin (r/NN). Now, if we choose
will only equal ,/N, however, if, in addition, I x.1 =1 and
Ixkl =constant. x.
rU2
cos
( )
N
=
For N odd, a sequence with Ix.1 = 1 and I xkl =constant Yo = L I R.I· (22)
k_O
is2 (see Appendix)
Similarly, if we choose x ..'=sin (rn2jN), then we can
2 2 choose {Hk } in such a way that
x .. = cxp [i ""; J = W .. •. (20)
JI"-I
'
Using one of these sequences as the input, and choosing yo = L II.I·
Hk=ei•• , with 11k given by (18), equality in (IS) can be
achieved for any N. Thus the bound given in Result A is We note that since {x. } and {x..'} are both real, the
a least upper bound. values Yo and Yo' will be obtained with {Hk} having even
magnitude and odd phase, corresponding to real {h .. } .
Now, if {3 is the least upper bound for ly.. l, then
Proof of Result C
we will demonstrate only for the case where N is even, {3 � L 1 Ik I � ...jN L I h 1 2• (2Gb)
since the argument for N odd is identical.
Consider the complex sequence Adding (26a) and (26b) and using (21),
2{3 � ...jN
or
with DFT denoted by ...jN
Fk = Rk + ilk
/3>
-_. 2
(27)
51
IV. Discussion has a discrete Fourier transform with constant modulus.
The bound obtained in the previous sections can be uti We consider first the case of (29). Letting Xk denote the
lized in several ways. If the OFT computation is carried OFT of x .. ,
out using a block floating-point strategy so that arrays
=r
1 N I
-
[j -
?rn2 ] [j--
27rnk ]
are rescaled only when overflows occur, then a final re Xk
,\
L
..
cxp
N
cxp
N
_0
scaling must be carried out after each section is processed
or
so that it is compatible with the results from previous
=-
1 [ -j -
7rk2 ] N-I
(j7r(n + k)2].
sections. For general input and filter characteristics, the Xk exp r L exp (31)
final rescaling can be chosen based on the bounds given N ,\ n-O
here to insure that the output will not exceed the available We wish to show first that
register length.
The use of block floating-point computation requires
N-I 7r [ ]
the incorporation of an overflow test. In some cases we L cxp j-
,V
(n + k)2
n_O
may wish instead to incorporate scaling in the computa
tion in such a way that we are guaranteed never to over is a constant. It is easily verified by a substitution of
flow. For example, when we realize the OFT with a power variables that
of two algorithm, overflows in the FFT computation of IN-I
{ Xk} will be prevented by including a scaling of! at each l: cxp [j7r(n + k)2/.V] = constant 6. B. (3 2)
..-0
stage, since the maximum modulus of an array in the
computation is nondecreasing and increases by at most a But
factor of two as we proceed from one stage to the next 2.\"-1
[2]. With this scaling, the bound derived in this paper L cxp (j7r(n + k)2/N]
guarantees that with a power of two computation, scaling
is not required in more than half the arrays in the inverse Itt-I 2N-l
FFT computation. Therefore, including a scaling of! in = L cxp (j7r(n + ltV/N] + L cxp (j7r(n + k)2/N]
the first half of the stages in the inverse FFT will guaran
N-I
tee that there are no overflows in the remainder of the
computation. The fact that fJC:. v'FI/2 indicates that if we l: exp [j... (n + k)t/N]
II-I
restrict ourselves to only real input data, at most one N-I
rescaling could be eliminated for some values of N. + l: exp [j...(n + k)t/N] exp [j...N]
The bounds derived and method of scaling mentioned 11_0
above apply to the general case; that is, except for the or, since N is even,
normalization of (7), they do not depend on the filter
characteristics. This is useful when we wish to fix the scal 2.\'-1
L cxp [j7r(n + k)2/NJ
ing strategy without reference to any particular filter. For
specific filter characteristics, the bound can be reduced. N-I
(33)
Specifically, it can be verified from (I) and (6) that in = 2 L cxp [j7r(n + k)2/N].
terms of {h.. } .. _0
or
Appendix
.V =
I B 12•
We wish to demonstrate that for N even, the sequence Therefore
Xn = exp
[j 7rn2 ] n = 0, 1, . . . , .V - 1
(20)
N N even or
1
has a discrete Fourier transform with constant modulus I X k!
and that for N odd, the sequence
=
vN .
It can be verified by example (try N = 3) that the sequence
27rn2 ]
[jN n = 0, 1, . . . , .V - 1
of (29) does not have a OFT with constant modulus if N
Xn = exp (30)
Nodd is odd.
52
Consider next the sequence of (30). We will show that But
Xk has constant modulus by showing that the circular
autocorrelation of x, which we denote by
.. en, is nonzero
"'-1
L CXP -j JV
[
471"rn
-
]
only at 11=0. Specifically, consider .-0
N-I
n=O
c" = L X,X7ra+r)rnOd N o n ¢O, N odd
._0
N
�
L.... cxp
[. 271"r2 ] [. 271"[(n + r)2]mod N]
J exp -J .
n=-,
2
N even
N N
--
-0
N
•
N N
--
• =0
._0
1966 Sprillg Joillt Compllter COIlf., AFlPS Proc., vol. 28.
[ -j 271"n2] N-I [ -j 471"rn] Washington. D. c.: Spartan, 1 9 66. pp. 229-233.
= exp L --
N
exp
N
-- . [21 P. D. Welch. "A fixed-point fast Fourier transform error anal
ysis," this issue. pp. 151-157.
.-0
53
SOME PRAcrI CAL CONSIDERATIONS IN THE
REALIZATION OF LINEAR DIGITAL FILTERS
J. F. KAISER
Bell Telephone Laboratories, Incorporated, Murray Hill, New Jersey
ABSTRAcr
54
SOME PRACl'ICAL CONSIDERATIONS IN THE
REALIZATION OF LINEAR DIGITAL FILTERS
J. F. KAISER
Bell Telephone Laboratories, Incorporated, Murray Hill, New Jersey
The high speed general purpose digital computer has become a powerful
and widely used tool for the simulation of large and complex dy�amic
systemsl and for the processing and reducing of large amounts of data
by filter methods. The increased computational accuracy of the
machines, the broader dynamic ranges in both amplitude and frequency
of the system variables, and the increasing order or complexity of
the dynamic systems themselves have made it necessary to take a much
closer look at the computational and realization details of the
designed digital filters. �ny of the problems now coming to light
were not noticed before2, 3, ,5,6,7 either because the filters were
of low order with low ( two or three decimal ) accuracy requirements
or because the sampling frequencies were comparable to the dynamic
system and signal frequencies. An understanding of these computational
problems and realization considerations is of vital interest to the
users of the different digital filter design methods as their presence
may often spell the success or failure of a particular application.
The two most widely used methods for the design of digital filters
that approximate continuous linear filters with r �tional transfer
characteristics are the bilinear z transformation and the standard
z transformation 2 methods.
(1)
where
sT (2 )
z e
55
The bilinear z transform simply maps the imaginary axis of the s-plane
into the unit circle of z-l plane with the left half of the s-plane
mapping into the exterior of the unit circle in the z-l plane. The
mapping is one- to-one and thus unique. Thus if the transformation
indicated by (1) is carried out exactly then H* ( z-l ) will be stable
if H ( s ) is stable and will be of precisely the same order. The
bilinear z form can theoretically be applied directly to a rational
transfer characteristic, H ( s ) , in either polynominal or factored form.
It will be shown later which form is to be preferred.
1 T
s+a -aT -l
l_e z
k=l,n
(4)
IT ( l-z-l /zk )
where bO has been set unity with no loss in generality. The question
now arises as to what accuracy must the coefficients b k be known to
insure that the zeros of Dd ( z-l ) all lie external to the unit circle1
the requirement for a stable digital filter. First a crude bound will
be established to be followed by a more refined evaluation of the
coefficient sensitivity.
l
The polynomial Dd ( Z- ) can be written in factored form as follows
l l
n
56
For ea se of pre sentation only simple poles are a ssumed for the
ba sically low-pa s s transfer characteri stic R ( s ), there being no
difficulty in extending the analysi s to the multiple order pole and
non low-pa ss filter ca ses. If the standard z transform is u sed then
D ( z-l ) become s
d
( 6)
th
where Pk repre sents the k pole of R ( s ) and may be complex. For the
bilinear z transform there re sult s
(8)
(10)
Thus a s the sampling frequency is increa sed the � decrea se from unity
and approach zero. Then one can write for the standard z transform
case
57
[1 -
e
P kT
Z_' :J > [1 -
(l+,&')Z-
l :J (11)
as T-+ 0
(12)
as T-+ 0
which illustrates that the' two design methods yield essentially the
same characteristic polynomials, Dd(Z-l), in the limit as T is made
small.
l
Inspection of (11) and (12i show that the zeros of Dd (z- ) tend to
cluster about the pOint z- = +1 in the z-l plane, i.e.,
where for a itable system the � has a negative real part. Now the
filter H*(z- ) will become unstable if any of its poles move �cross
the unit circle to the interior as a result of some perturbation or
change in the coefficients bj' To estimate the order of this effect
one computes the change necessary to cause a zero of Dd(z-l) to occur
at the point z-l = 1.From (5) and there results (13)
(14 )
But
n
= 1 +
f' b z-
L... k
k=l
k
I z -1 =1.
= 1 +
\' b
L k
It=l
The right hand side of this expression is an important quantity and is
therefore defined as
(16)
58
Thus by combining (14) and (15) it is immediately seen that i ): any of
the bk are changed by the amount given by (16) then the Dd(z- ) will
have a zero at z-l = 1 and the filter H*(z-l) will thus have a
singularity on the stability boundary. A zero of Dd(z-l) at z-l = 1
causes H*(z-l) to behave as if an integration were present in the H(s).
Any further change in the magnitudes of any combination of the bk in
such a manner as to cause D (z-l) -1 to change sign will result
d
Iz =1
l
in an unstable filter, i.e., with some of the zeros of Dd(z- ) lying
inside the unit circle. Hence (14) is the desired crude bound on
coefficient accuracy.
n
4 2
maxbk :::::-*--
5 In
59
Hence from (14), (16), and (17) an absolute minimum bound on the number
of decimal digits � required for representing the bk is found as
(18)
k+l
dZi z
n ( z� i
(19)
II 1 - Z
�=
k=l
k,h
from which the total differential change in any zero may be evaluated
as
(20)
60
(21)
&.
i
where =--
(22)
or
fi (l _ z-
z
)
l
+
i=l i
This has the appearance of the standard root locus problem for a
single feedback loop having the loop transmission poles of the zi' a
kth order zero at the origin, and a loop gain factor of PIt. The
parameter F O is simply the "gain" PIt required when the root locus
passes through the point z-l = 1. Thus all the techniques of the
root locus method and the insight gained thereby can be brought to
bear on the problem.
By viewing the coefficient sensitivity problem in terms of root loci
the effects of both increasing filter order and especially increasing
the sampling rate can be easily observed. Increasing the sampling
rate tends t cluster the poles of H* (z-l) even more compactly about
the point z-� = 1 as Fig. 2 shows for a third order filter. As
628
61
filter order increases s o does the possible order k o f the zero a t the
origin of the z-l plane. All n branches of the root loci begin at the
t
roots Zij as Pk increases k branches converge on the k h order zero at
the origin and n -k branches move off toward infinity with eventually
radial symmetry. The angles the 10 c i make as they leave the zi are
simply the angles given by evaluating (19) at each zi. The value of
P.k at which a branch of the locus first crosses the unit circle
( the stabi1ity�oundary ) gives the measure of total variation that can
be made in bk and still keep the filter stable. Clearly the closer the
roots z i are to the unit circle initially the smaller will be the value
of Pk necessary to move them to lie on the boundary. Thus by varying
the P k ( the changes in b k) the extent of the stability problem can be
viewed.
62
Realization Schemes
The three basic forms for realizing linear digital 'filters are the
direct, the cascade and the parallel forms as shown in Fig. 4. As far
as the stability question goes the two variations of the dil'ect form,
Fig. 4 (a) and Fig. 4 (b), are entirely equivalent with the coUfiguration
of Fig. 4 (a) requiring fewer delay elements. The stability results
developed in the previous section indicate clearly that the coefficient
accuracy problem will be by far the most acute for the direct form
realization. For any reasonably complex filter with steep transitions
between pass and stop bands the use of the direct form should be
avoided.
The choice between the utilization of either the cascade, Fig. 4 (c),
or parallel, Fig. 4 (d), forms is not clear cut but depends somewhat
on the initial form of the continuous filter and the transformation
scheme to be used. In any case the denominator of H (s) must be k nown
in factored form. If the parallel form is desired then a partial
fraction expansion of H(s) must first be made. This is followed by a
direct application of either (1) or (3) if the bilinear or standard
z transforms are used respectively. For bandpass or bandstop
structures the midfrequency gains of the individual parallel sections
may vary considerably in magnitude introducing a small problem of the
differencing of large numbers. This parallel form is perhaps the most
Widely used realization forms.
(24)
63
Summary
64
Fig. I
Z-I plone
j1
1t T=3
� T=I
1t
•
•
1t
Fig.2
Z-1 plone
Fig. 3
65
--�
4(d)
Fig.4
66
PROCEEDINGS
FIRST-ORDER CASE
J:'np '.!II It .."" .............. 4;:t. ___ I'.L. r__ _
be random variables t. and {. account for the roundoff errors due to the
n�tin� �"'int ""ult;:-,I�I O:!O .......................:-...�: •• :�:•• :::� ::-: ::..::.: •• ..!_� :".1
0110 wing Kaneko and Liu, we define the error e.=y.-w., subtract (I)
om (2� neglect second-order terms in e, to and e, and obtain a difference
!U3lion fnr thp ?_� prrnr �Q'
(4)
Manuscript received February 10. 1969. This work was sponsored by ,he U.S. Air Force.
I T. Kaneko and B. Liu. "Round-olf error of ftoating-poine digital fillers.u oresenled
67
PROCEEDINGS OF THE IEEE, JUNE 9
1 69
(6)
TABLE I
If, instead, x. is taken to be a sine wave of the form A sin (won + 4» with
THEORETICAL AND EXPERIMENTAL NOISE-TO-SIGNAL RATIO FOR A SECOND
4> uniformly distributed in (0, 2><), then
ORDI,R FILTER, AS A FUNCTION OF POLE POSITION
(7)
1082
� [;ik] (bits)
To test the model, ": was measured experimentally for white noise and
sine wave inputs. Each input was applied to a filter using a 27-bit mantissa, , 9
and also a filter with the same coefficient a, but using a shorter (e.g., 12-bit) White Noise Sine Wave
mantissa in the computation. The outputs of the two filters were then sub
TbeoretK:aI Experimental TbeoretK:aI Experimental
tracted, squared, and averaged over a sufficiently long period to obtain a
stable estimate of ,,:. Kaneko and Liu assumed that e. and t. were both 0.55 22.5 1.48 1.66 1.54 1.64
uniformly distributed in (- r', r') with variances ,,; =": =!r". Actual 0.7 22.5 2.16 2.33 2.23 2.38
measurements of the noise due to a multiply and an add verified that t. and 0.9 22.5 3.32 3.33 3.35 3.45
e. have zero mean, but indicated that the variances 0.55 45.0 0.93 1.08 0.97 0.94
0.7 45.0 1.36 1.44 1.37 1.51
(8) 0.9 45.0 2.28 2.51 2.22 2.14
0.55 67.5 0.42 0.46 0.39 0.33
0.7 67.5 0.75 0.88 0.65 0.62
would better represent these noise sources. Using (I), (6), (7), and (8), we
0.9 67.5 1.63 1.97 1.45 0.99
can compute the output noise-to-signal ratio for both white noise and
sinusoidal inputs for the first-order case as
For the case of a high gain filter, with r= I-�, (II) becomes approximately
'
,,'-1- = (I )
(0.23)2-" --,
+ a
• (9)
"w I -a , ( 3+ 4 cos' 0 )
" = """ . (13)
. , . 1M' sin' 0
I,
In Fig. experimental curves for noise-to-signal ratio are compared with
the theoretical curve of (9). For the case of sinusoidal input, we obtain
as given by (8).
The statistical model of floating point roundoff noise proposed by
When x. is stationary white noise, we obtain for the variance of the Kaneko and Liu and one of fixed point roundoff noise as presented for
noise e.,
example by Gold and Rader' provide the framework for comparing these
two structures on the basis of the resulting noise-to-signal ratio. We con
r' cos'O
)J
,,' = ,,',,' G + G'
, C :K 3r' + 12r' --- (II) sider only the case of white noise inpul.
[ ( 1 For the fixed point case, the register length must be chosen sufficiently
long so that the output cannot overflow the fixed point word. If II. denotes
where the impulse response of the filter, Ihen output 11'. is bounded according to
,,'
- ..! -
" + r') (
(--
I I ).
G
- ,,�
-
�o h' -
• •
-
I - r' r' + I - 4r' cos' 0 + 2r'
(12) :I B. Gold and C. M. Rader� "Effect of quantization noise in digital fihers.·· /966 Spring
Joint Computer Conf.. AFIPS Proc., vol. 28, Washington. D. C. Spartan, 1966. pp. 213-219.
68
PROCEEDINGS LETTERS
ar-------� first-order filter, Fig. 2(a) presents curves of! log, (a!/a!a!) as determined
from (8), (17), and (19� These curves represent a comparison of the rms
noise-to-signal ratio for the two cases, in units of bits. In Fig. 2(b), a similar
comparison is illustrated for the second-order case. For the purpose orthe
iIIustration,O was kept fixed and only r varied.
Fig. 2(a) and (b) indicates that floating point arithmetic leads to a lower
noise-to-signal ratio than fixed point if the floating point mantissa is equal
in length to the fixed point word. We notice that for high gain filters, as a
increases toward unity in the first-order case, and as r increases toward
unity for 0 fixed in the second-order case, the noise-to-signal ratio for fixed
point increases faster than for floating point.
However, this comparison does not account for the number of bits
needed for the characteristic in floating point. If c denotes the number of
bits in the characteristic, this would be accounted for in Fig. 2 by numeri
(a)
cally adding the constant c to the floating point data. This shift will cause
. 2 .----- --------,
the floating and fixed point curves to cross at a point where the noise-to
signal ratios are equal for equal total register lengths.
For the sake of the comparison, we provide just enough bits in the
8 FIXED AT 20·
characteristic to allow the same dynamic range for both the floating and
the fixed point filters. If '" denotes the fixed point word length, then the
Nr:--'
b-JNb. 6
requirement of identical dynamic range requires that
� (2 1 )
gN
..J
Assuming for example that ',,= 16 so that c=4, crossover points in the
_IN
noise-to-signal ratio will occur at a=0.996 in the first-order case, and at
r=0.99975, 0=20°, in the second-order case depicted by Fig. 2(b).
0 .. 0 ...
CLIFFORD WEINSTEIN
ALAN V. OPPENUEIM
M.I.T. Lincoln Lab.
(b)
Fig. 2. Comparison or fixed point and float ina point noise-to-si,"al ratios.
Lexington, Mass. 02173
(a) Firsl-order filter. (b) Second-ordcr filter. 8 -20'"
..
max ( l w.l) = max (lx.1> L Ih.l· (15)
.-0
1 I
- - ..--- '
..-- < XII < + -- (16)
L Ih.1 L Ih.1
,,-0 .-0
With x. white and uniformly distributed between the limits in (16), the
resulting output noise-to-signal ratio for a first-order filter is
a'
--i
I (
_2-2, L
..
Ih.1
)' 1 2-"
= - -----.' (17)
aw
=
4 . -0 4 (I - a)
� 1 (
= - 2-"
(Jw 2
f Ih.I)' .-0
=
1
- 2-"
2
(� f
Sin 0 l1li-0
r"lsin [In + 1)0]1
) "
(18)
ERRATA
For a comparison of floating and fixed point arithmetic in the case of a order case depicted by Figure 2(b)."
69
Submitted for publication to IEEE Transactions on Audio and
Electroacoustics.
by
Alan V. Oppenheim
ABSTRACT
*This work was sponsored by tfie Department of tfie AIr Force. JULY 1969
70
Introduction
Recently, statistical models for the effects of roundoff noise in fixed-point and
floating-point realizations of digital filters have been proposed and verified, and, a
comp3rison between these realizations has been suggested. (I), (2), (3) In general terms
the comparison revolves around the fact that while floating-point arithmetic has a larger
dynamic r.mge than fixed-point, the latter is more accurate when the full register length
can be utilized. Because of the limited dynamic range of fixed-point arithmetic, for high
gain filters, the input signal must be attenuated to prevent overflow in the output. Thus,
for suffiCiently high gain. floating-paint arithmetic leads to lower noise-to-signal ratio
than fixed point. On the other hand, floating-paint arithmetic implies a more complex
hardware structure than fixed -point arithmetic.
An alternative realization, block-floatmg-point, has some of the advantages of both
fixed point and floating-paint. In this paper a structure for implementing digital filters
using block-floating-point arithmetic is proposed and a statistical analysis of the effects
of roundoff noise presented. On the basis of this analysis, block-floating point is com
pared to fixed-point and floating-point arithmetic with regard to roundoff noise effects.
(1
)
1
(2)
A =
n
where IP [M] is used to denote the integer power of two so that i � {M' IP (M)} < 1.
Thus A represents the power-of-two scaling which will jomtly normalize x and Y l'
n n n-
Thus with block-floating-point we can compute y as
n
(3)
where the multiplications and addition in (3) are carried out in a fixed point manner.
71
Because of the recursive nature of the computation for a digital filter it is advan
tageous to modify (3) as
(4)
with
and
t:.
n ::; AnfAn- 1 .
The difference between (3) and (4) is meant to imply that the number An Yn rather than
y n is stored in the delay register of the filter. Because of (2), An yn is always more
accurate (or as accurate) as yn since multiplication by A n corresponds to a left shift
of the register.
A disadvantage with (4) is that yn- 1 must be available to compute A n, and t:. n must
then be obtained from A n and An- 1. An alternative is represented by the set of equations
" " "
yn ::; t:.
n x n + a1 t:. n yn- 1 (Sa)
with (5b)
(5c)
1
and (5d)
In this case, we first ·scale xn by An- 1 to form Qn and then determine the incremental
scaling using (Sd). As in (4) the scaled value �n is stored in the delay register and the
output value yn is determined from �n using S(c). If we consider the general case of an
N th order filter of the form
yn ::; x n + a1 yn- 1 + a2 yn-2 + . . . + aN yn- N
72
then the block-floating-point realization corresponding to (5) and represented in the
1
A
n
= (6)
IP [max £l�nl. Iwini. lw2nI .. . . . lwNni J]
and
1
A
n
= =A I A
n- n
(7)
[max ( IX I y n -1' y y
IP
n I, 1 n_ 21 ' • • • 1 n_ Nil]
there is the possibility of overflow in the addition, which cannot be avoided by an at
tenuation of the input. This possibility of overflow can be avoided by decreasing the
1
(6')
and
1
(7')
where ex. is a constant that may be changed depending on the filter to be implemented.
In a first order filter, for example, ex. need never be greater than two. In a second
order filter it need never be greater than three, and in many cases can be chosen
as two.
ence of roundoff noise we will restrict attention to the implementation of Eqs. (5) and
Fig. 1 for the first and second-order cases. We will assume that no roundoff occurs in
the computation of � from xn and the subsequent multiplication by A ' Since A - and
n n 1
A _I A are always non-negative powers of two, that is, they always correspond to a
n n
positive scaling, the above assumption corresponds to allowing more bits in the repre
sentation of the intermediate variable �n' This is reasonable if we take the attitude that
it is primarily in the variables used in the arithmetic computations that the register
length is important.
73
For the first orde r case, roundoff noise is introduced in the multiplication of w
-1:- .
In
by � ' the multiplication by a and the final multiplication by The effects of
n 1
n
multiplier roundoff will be modeled by representing the roundoff by additive white noise
sources. We consider, for convenience, the fixed point numbers in the registers to
represent signed fractions, with the register length excluding sign denoted by t bits.
Each of the roundoff noise generators is assumed to be white, mutually independent and
independent of the input and to have a variance (I
2
.�
equal to . 2
-2t
A
. The network for
the first-order filter including the noise sources representing roundoff error is presented
in Fig. 2 (a). In Fig.2(b) an equivalent representation is shown, where the noise sources
are at the filter input. If we consider the input to be a stationary random signal then the
noise source'; will be white stationary random noise with variance
n
(8)
-;;z =.;
2
�
Q)
hn2
-=-z
+ E = (I
2 [1 + (9)
n n 3n �
n= 0
For the case of a second order filter a similar procedure can be followed. Figure 3(a)
shows a second-order filter with the roundoff noise sources included. In Fig.3(b) an
equivalent representation is shown, where equivalent noise sources are introduced at
the filter input. Again, conSidering the input to be a stationary random Signal, then
� (1 ,2 r
4 r2 cos2 9+2+r4 +r4 (IE
(. 1 ,2
J �)
2 2
';U =
�,,) (IE (10)
n L
4J
2 2r 2 2
= k (IE L4 r cos 9+2+2 r
74
where we assume that the mean square values of ( -f:- ) and (� ) are equal. Hence
the variance of the output noise '11 n is n n-l
� 4
'10
:; O'
2
E
+k
2
O'
2
E
G [4 r2 cos2 e + 2 + 2r ] (11)
where
G= C 1
(12)
Experimental
- Verification
-
To verify th; validity of Eqs. (9) and (1 1) the values of k2 were measured and the
values of �n2 computed from (9) and (11) using these measured values. These results
were taken as the theoretical results since they incorporate the assumptions of the model.
The variance of the roundoff noise � was then measured experimentally. This was done
n
by Simulating the block-floating-point filter with a Signed mantissa of 12 bits and com-
paring the output values with the output of an identical filter simulated with .'36 bit fixed
point arithmetic. In all of these measurements the input was white noise with a uniform
amplitude distribution. For the first order filter. the value of a in Eqs. (6') and (7')
was taken as two. For the second order filter. the value of a was taken as four.
In Table I. measured values of k2 and the theoretical and experimental values of
the variance of the roundoff noise for the first order case are given. In a similar manner,
theoretical and experimental results for the second order case are summarized in Table II.
A Comparison Of Block-Floating-Point, Floating-Point and Fixed-Point realizations.
Using the model presented in the previous section, the block-floating-point realiza
tion of digital filters can be compared with fixed-pOint and floating-point realizations.
The comparison to be presented here will be on the basis of the output noise-to-signal
ratio when the input is a random signal with a flat spectrum, using results presented by
Gold and Rader ( 1), Kaneko and Liu (2). and Weinstein and Oppenheim (3). With � denot
ing the vaiance of the roundoff noise as it appears in the output we have for the first-
order filter
(1}2) fixed -point = 1
TI
. 2
-2t
(13)
2
(1} ) floating-point= . 23 x 2
-2t
( 14)
75
and for the second-order filter
1
(1')2) fixed -point = "0
(1 5)
r cos26
164 2
(.;?) floating-point . 2 3 x 2 -2t r.� +G(3r4 +12r2 cos26-
l+r2 J
cr
y (16 )
where t is the number of bits in the mantissa, not including sign, cr y2 is the variance
of the output signal, and G is given by (12). In the fixed-point case the output noise is
independent of the output signal variance and in the floating-point case the output noise
is proportional to the output signal variance. The expression for block-floating-point noisE' has
a term independent of the signal and a term which depends on the signal through the factor
k2• In both the fixed-point and block-floating-point cases, the dynamic range for the
output is constrained by the register length. Consequently. as the filter gain increases
the input must be scaled down to prevent the output from overflowing the register
length. Since the output is given by
ex>
yn = Z hk xn -k
k= O
then co
To insure that the output fits within a register length, we require that, with xn and yn
interpreted as fractions,
I yn I s; 1
so that
1 1 (1 7)
S; x S;
ex> ex>
n
76
With this constraint on the input, we can then compute an out put noise-to-signal ratio
for fixed-point. floating-point and block-floating realizations. Specifically for the first
order case,
1 2-2t 3 <Is)
TI
(�) fixed-point •
(::. ) block-floating
=
1 -rr .
-- 2
-2t [3(1(1-_aa/)
1
)2
( 20)
-2 2
where k is the value for k when x is uniformly distributed between plus and tninus
n
unity .
In a similar manner, for the second-order case,
1
= -Z 2-2t ( 1
SIiilf
(21)
(:::-) fixed-point
(a:�) floating-point •
� 1 +rZ
(22)
77
In Fig.4 Eqs. ( 18), ( 19) and (20) are compared. In Figs. 5 Eqs. (2 1), (22) and (23) are
compared. In these figures the noise-to-signal ratIos are plotted in bits so that the dif
ference b�tween two of the curves reflects the number of bits that the mantissas should
differ by to achieve the same noise-to-signal ratio. In each of the cases, the difference
between floating-point and block-floating point is approximately constant as the filter
gain (or the proximity of the poles to the unit circle) increases. This difference is ap
proximately one bit in the first-order case and two bits in the second order case. In
contrast, the fixed-point noise-to-signal ratio increases at a faster rate than floating
point or block-floating point and for low gain is better and for high gain is worse than
block-floating point.
In evaluating the comparison between fixed-point, floating-point and block-floating
point filter realizations it is important t o note that Figs. 4 and 5 are based only on the
mantissa length and do not reflect the additional bits needed to represent the character
istic in either floating-point or block-floating-point arithmetic.
An additional consideration, which is not reflected in these curves is that in both fixed·
point and block-floating-point the noise-to-signal ratio is computed on the assumption
that the input Signal is as large as possible consistent with the requirement that the out-
put fit within the register length. If the input signal is in fact smaller than permitted
then the noise-to-signal ratio for t he fixed-point case will be proportionately higher.
For block-floating-point, as the input signal decreases, k2 decreases thus reducing the
output noise. From Eqs. (9) and ( 12) we observe that as the input signal decreases the
output noise variance asymptotically approaches oi.
For the case of high-gain filters, Eqs. ( 18) through (23) can be approximated by'
asymptotic expressions which place in evidence the relationship between them. For the
high gain case, that is for � close to unity in the first order filter and r close to unity
and e small in the second order filter, we will assume that I xnl is always smaller than
I Ynl so that (-i-) 2!! 2 1 Y nl for the first-order filter and (-i-)2!! 4 1 Ynl for the second
order filter. The'1. , if we consider Yn as a random variable wit'fi a symmetric probability
density,
in Eq. (23).
78
Representing a as 1- 0 for the first-order case and r as 1-0 for the second-order
case, with {j small we can approximate Eqs. ( 18) through (20) as
22 t (--
7
))
0y2
-
�
7
.2 5
(24 )
�
fixed-point
2 2t -Z
.23 (51 (25)
oy
':'!
floating-point
�
22t 8� '0
1
(26)
�-r)
0y block-floating
For the second-order case we will want to bracket the expression
•
1
rn I sin (n+ 1)9\.
CD
( �6Si
6
sm
� 6
. Furthermore for the high-gain case we approximate
1
G as G =:!
2 We can then write that
4 0 sin 9
79
:'-r
1 1 2t 1
..,2 . 2 2 (27)
8
u sm 9 �
(""""T)
ay fixed-point
�
"!
v
. 2
�2 SIn
9
,-
.2,3 L l + 3 +4 cO229
22t(a:�) J (28)
�
floating -point =
46 sin 9
1
-6
[1 '4 +
4(l+cos29
2
3 sin A
� ( 29)
block-floating
80
ACKNOWLEDG MENT
81
REF E R ENCE S
82
TABLE I
2t -2]
1 /2 log [2 l1 (bits)
2 n
a k
2
Theoretical Experimental
1
.1 .0136 -1. 780 -1.780
83
TABLE II
----
2t
1/21og [2 11
-2 ] (bits)
2 n
e
2
r k Theoretical Experimental
.55 22.5 .Oll -1.724 - 1.661
.55 45.0 .008 - 1.765 - 1.735
.55 67.5 .006 - 1.780 - 1.753
.7 22.5 .020 -1. 528 - 1.440
.7 45.0 .0 10 - 1.736 - 1.696
.7 67.5 .004 - 1.78 1 - 1.757
.9 22.5 .068 - .23 1 - .222
.9 45.0 .023 - 1.430 - 1.357
.9 67.5 .0 15 - 1. 665 - 1.584
.95 22.5 . 129 .7 16 .652
.95 45.0 .045 - .863 - .768
.95 67.5 .029 - 1.384 - 1.207
.99 22.5 . 150 1.992 2.050
.99 45.0 .053 .244 .350
.99 67.5 .035 - .540 - .150
84
FIGURE CAPTIONS
2
Table 1. M(?asured values of k and theoretical and experimental values of
output noise variance for a first-order filter with white noise input
in the range I x I s: .!..
16
•
n
2
Table II. Measured values of k and theoretical and experimental values of output
noise variance for a second-order filter with white noise input in the range
I xn I s: 1
-128 '
th
Fig. 1. Network for block-floating-point realization of an N -order filter.
85
>,C
N Z
I I
-
I
>,c >, C
>f
-
C I
-
C
I
C
"j'
C
E!
- c:( c:( c:(
II II
C C
II
C C .N C C
•z .
.- <J <J <J
-
• • • bO
....
r.z..
.
N z
.,
-
., .,
• • •
C
<J
C
OC
j'
C
c:(
C
ac
86
" .Ii
"If)
c
<I
N
bO
a �
.....
"N �
c
IU
C\I
+
c
\U
c
<l
....£..., .-
C
c I
�
.... CI
•
�c
- -
c ..c
, -
c
CI -
-
C
M
87
c
�
-
N
�
I
c
N
\II
c
<l
88
-
I C
c
It)
\II
'"
�
I
c
<X:
� c c
.....::.,.. <J _ <J
I t--..-...
... --t I
,.....--, N
c
.q-
'"
+ Q) ..0
en
M
0 t>O
c
rr> ('t.
'" u II-<
.....
�
I
C\J C\I
....
c
C\J
'"
+
Ct) c
en <l
0
0
....
C\J
c
-
� I
c c
lCl:
�
.--
II
C
o\u
89
c:n
c:n
C!) c:n
z 0
ti �
g�
�-�
z
�� �
u C!)
9
m
z
-
�
9
� c:n
�
/' 0
Q
L&J
X .
ii:
""
.�
�
10
o
C\J
o
90
10 8 = 22.50
U) 8
-
..c
SlOCK -FLOATING
�6 POINT
tr
�
b 4
�
N
o
.2
C\J
...... FLOATING POINT
,.... 2
o
0.99 0.999
r
Fig. Sa
..0
.....
0)
II 0)
a
/
2
o
a..
o
w
X �
I.L
.a
V)
to �
0) �
....
0)
<.!)
2 a
I-
«
01- ex)
-12
I.L -� a
, 0
�a.. I"-
u 0
0
-1
CD to
0
92
en
en
en
0
�
z
-
�
<!)
z
-
0
an
�
�
U) 9
lL en
II en
Cb / 0
0
L&J
X
LL 0
t- tl)
t>O
....
It) �
en
0
<!)
z en
� 0
91-
lL�
. 0 CX!
�Q. 0
U
9
m
'"':
0
�
0
93
ROUND-OFF ERROR OF FLOATING-POINT DIGITAL FILTERS
T. KANEKO and B. LIU
Princeton University
princeton, New Jersey
ABS T RACT
This paper is concerned with the accumulation of round-off
error in a floating-point digital fi lter. The error committed
at each arithmetic operation is assumed to be an independent
random variab le uniformly distributed in (-2 -t, 2-t) where t is
the length of the mantissa. Expression for the mean square
error is derived arrl a numerical example is given.
INTRODUCTION
Consider a digital filter specified by the input-output
re lationship:
wn (1)
94
fl (x+y) is the calculated sum of x+y, and fl (ax+ by) is the cal
culated Sl� of two terms; one is the calcu lated product o f a
and x, the ot her is the calculated product of b and y. It is
known (3) tha t
f l ( x+ y) (x+y) (l+E) (2)
with l
-t
lEI � 2
where t is the number of bits of the mantissa. A lso,
fl(xy) = xy ( I+E) (3)
Again with
I E I s: 2-t
That is, for each addition or multip lication, an error is com
mitted which is proportional to the ideal results obtained with
infinite precision. We shall assume that all numbers a n' bn,
xn are machine numbers.
CALCULATION OF A CTUAL OUTPUT SEQUENCE
To i llustrate the approach of this paper, consider a second
order filter specified by
wn b O xn - (al w n_l + a2 Wn_2)
=
l+S n
Figure I
l
It is assumed here that the accumulator is of double precision.
For sing le precision accumulators, slight modification is
necessary.
2 S light modifications are necessary when some of the
coe fficients are one or zero.
95
The quantities 6n, O, En, l' En,2 ' �n' � n arc all bounded in ab
solute value by 2-t • Since these are errors caused by round
off at each arithmetic step, we may assume that they are indc-3
pendent random variables uniformly distributed in (_2 -t ,2-t ).
Therefore the actual output (Yn} is seen to be given explicitly
by 2
Yn = b x � a ¢ k Y n- k
O 8n, O n k=1 k n,
-
where:
(1+6 n, 0) (l+Sn )
(HE n, 1) (1+�n) (l+S n)
1+" n, 2
1+" 3
n,
Figure 2
3
For s1ng
. 1e preC1. S1on
. accumulators, a slight modification is
needed.
96
'l'h(! ilctua 1 output sequence is therefore given by
M N
Yn = ! � 8n,k Xn-k - k� ak �n, k Yn-k (4)
k O l
where M
e II (1+' n, l..)
n, O i=l
M
II (1+, n, l..) j=l, 2, • • • ,M
i=j (5)
N
�n, l [) (1+'11
n, .)
l.
i=2
N
�n, . = (l+� n ) (l+ En, . ) II (1+"n, l..) j =2, 3, I N
) )
• • •
i=j
By defining a O = 1, �n, 0 =, 1, we may write Eq. (4) as
N M
I: a k ¢n, k yn k = I: (6)
k=O - k=O
where �n, 6n, k, 'n, k' En, k' "n, k are independent random varia
bles, each uniformly distributed in ( _2-t, 2-t ).
To solve for the actual output sequence {y 1 from Eq. (6) ,
we note first that the random variables �n k an3 en k are
essentially one and that their difference from one 1s of the
order of 2-t in magnitude. We rewrite Eq. (6) as
N M M
: akYn k = : b X � b (e -l)Xn-k
k O k n-k k o k n, k
- + -
k O
and Yn as a sum of te nns in order of decreasing size, viz. ,
Yn y'n + y"
n + y"
=
n ' + (8)
On substituting Eq. (8) into Eq. (7) and equating terms of like
order of magnitude, the following set of equations is obtained.
N M
I: akYn' k = I:
b kx n k (9)
k=O - k=O -
N M N
I: a y" = �: bk( e n, k-1)xn k - I: a k( �n, k-l ) y'
n _k (10 )
k=O k n-k k=O - k=O
N N
1: ak yep)k = - I: a k(� (p-l)
n, k 1) yn-k p=3, 4,5 ( 11)
-
k:O n- k=O
• • •
97
[w111 s<"tluencc, the ideal output. We may thus identif y y�'
with the error due to round-off error and denote it by e n.
'I'hilt is,
e n = yn - wn = y"n ( 12)
�
- N ( Z) N ( 1 2 7�
y , y ,(z ) - n(z) n ( 1 z) xx(z ) ( l3)
4
and its autocorrelation function Ryy�n) is given by
1 n dz
Ry'y'(n) = 2TTj i \'y'(z) z Z
,f'. ( 14)
where
k k
N(z) z- and n(z} z-
M
with
N
un (9n,k-l) xn-k - a (� -l}y�-k ( 16)
k:O� k o k n, k
= :
To calculate the statistics of u n' we need the statistics of
9n,k and � These can be evaluated in a straightforward
manner and nt �e result is summarized in the Appendix. It can be
•
E( u nu m } o nlm (18)
M M
2
E(u n } L L
xx
= bkb1.'
,1.
k=O i=O
N N
+ L L R , ,(k- i)
Y Y
ak a, Ak
1.
'
1. ,
k=l i=l
M N
-2 L � q R ,(k-i) ( 19)
xy
bka,
1.
k=O i=l
where q 2-2t/3 is the variance of a random variable uniformly
98
distributed in (_2 -t , 2 -t ), R x¥,(k-i) is the cross correla
t ion function betwecll {xnl and (Yn }, Ak, i and B k,i are given by:
E(
Ak ,
l.
' f! (¢
n, k-1) ( ¢n,l.,-1»)
(1+q)N+2-max(k, i) kfi or k=i=l (20)
= k
{ (1+q) N+ 3- 1 _ k=if1
B k, � E{ ( e n, k-1) (e n, I ) }
l. l.
-
' ,
k
(1+q)M+2-max( ,i) _
1 kfi or k=i=O (21)
= k
{ (1+q)M+3 - 1_ k=ifo
Thus we see {u 1 is white and w.s.s. with variance of un given
by Eq. (19) wh�ch may be rewritten as
E(u2
n} =- nJ xx Y Y xY z
(22)
where
M M
IB(Z)I 2 E E h b, B k , ' zk-i
k=O i= O --k l. l.
N N
IA(z)12 = E E a
k
a. JL ,z k -i
). - -k , ).
k=l i= l
M N k
C(z) = q E E � ai z -i
k=O i= l
By using Eq. (13 ) and the relationship
dz
-
z (25 )
99
OUTPUT ERROR TO SIGNAL RATIO
Quite often, one is interested in the error- to-signa1
ratio at the output . In terms �f our notations, this quantity
is E (e � } /E{ w2 } , or E{y�'2 }/E(YA }, which, by using Eqs. (14),
(23), and (2g), can be written as
(26)
E(U2 � tR i
dz
nJ �
2n] � xx(z) Z
•
W Z\D(Z) 1
•
100
with <l1=-J2(l- . 001) and a2=(1-.001)2 . Thus N(z)=l, D(Z)=
-2 2 2 2 2
1+"1 z-I+.:1
2 z IB(z) 1 =q, and I A(Z) I =(al+a2 ) [ (1+q) 3 -II +
,
ACKNOWLEDGMENT
This research is sponsored by the Air Force Office of
Scientific Research, Office of Aerospace Research, United
states Air Force under AFOSR Grant 1333 -67 and by the National
Science Foundation under Grant GK-143 9.
APPENDIX
STATISTICS OF 8n,k AND ¢ n,k
1 M+
E( 82 O} ( +q) 2 E [8 08 I = (l+q) M+2-j , M<:j>k<:O
II, n, J n, k
+
E(e2 o} (l+ q)M 3-j j=l, ... ,M
n, J
rp = ¢2 O = 1 E [¢ O¢ J = 1
n,O n, n, n,J
0
l
E[¢ � ,l J (l+q)N+
E[¢ o¢ 1
n,J n, k
o k=O
10 1
REFERENCES
102
n:EE TR.\XSM.'TIOXS ON CllICl.:"1T TIIEOR", VOL. CT-15, No.1, l\URCH 19G8
Abstract-The frequency response of a digital filter realized by a with (I), the width of qua ntiz ation (q) is defined as:
finite word-length machine deviates from that which would have
q
been obtained with an infinite word-length machine. An "ideal" or = 2-(·11-1). (3)
"errorless" filter is defined as a realization of the required pulse
transfer function by an infinite word-length machine. This paper By yirtue of the fi ni te computer word-length, the
shows that quantization of a digital filter's coefficients in an actual p erformance of a digital fil ter is inevitably degenerated .
realization can be represented by a "strav" transfer function in First, when continuous data are read into the compute rs ,
parallel with the corresponding ideal filter. A lso, by making certain
a quan tizatioll errorl"1 is incurrcd. Second, a roundoff
statistical assumptions, the statistically expected mean-square dif
ference between the real frequency responses of the actual and
error a rise s in the ev al u ation of each arithmet ic product
ideal filters can be readily evaluated by one short computer program in the com pu t ation YI . I GI Thil'd, th e coefficients of the
for all widths of quantization. Furthermore, the same computations di fTe ren ce equation which represcnts the digi tal filter are
may be used to evaluate the rms value of output noise due to data s ub jec t to a mpli tud e quantization errors. In conse quence,
quantization and multiplicative rounding errors. Experimental mea
the freque ncy rcsponse of the actu al filter realized deviates
surements verify the analysis in a practical case. The application of
the results to the design of the digital filters is also considered.
from that which would hav e been obtained wi t h an
infinite word-length m achine . For the purposes of comp ar
ison, it is convenient to defin e "the ideal" or "errorlcss"
I. hTRODUCTIO:-; filter as a realization of t he required p ulse transfer func
YSTK\IS which are uscd to spc ct rally shap e in tion by an infinite wonl-Iength computer.
to\\":\l'd di git al filtering techniques. In pa rticul ar, meth of a digitally contro lled feedback system, the loss of
OdS"I"'1 have been deve lo ped which enable the conv ersion perflJrmance is important in digital filt e r a pp li cat ions due
of certain well-established analog fil ter des igns into digita l to the absence of o\"erall negative fce dback and greater
filter::! hav ing essenti:\lly the same fre que ncy response over digital system compl exi ty. The efTe ct of coefficient errors
half the Xyqu is t interva l. These te chni ques are significant on digital fi lter performance was first investigated by
in that the resul ting digi tal filters large ly maintain the Kai�er. '71 ,\n ab sol ut e accuracy bound was derived for
frequency response of their a nalog "parents" even when the di fTerence equ ation coefficients within which the real
ization was still asymptotically stable. However, s uc h an
relath'ely small sampling fre que ncies are employed. That
is to say, f re que ncy al iasing efTects III are l argely error bound is incvitably pessimistic, because coefficient
quantization errors are intrinsically statistical. Further
eli mi nated.
,\n electronic di gi tal computer processes information more, this a n alysis docs not en able the compu ter word
in binary-number format. When the so-called "sci entific length to be selected so that the degeneration in the real
pmgmmmin g" convention is employed in a fixed-point f requ ency response of the actual digita l filter is m ain tai ned
s ignifi canc e
This paper shows that the qu antizat ion of a d igi tal
JIf-l
filter's coefficients can be represented by a "s tray" transfer
+ L "1,2-' (1) function in p aralle l with the corre spond in g ideal filter.
"10"11 "1M-I -"10
t:-1
• • •
Also, by m aki ng certain justifiable statistical as su mptions,
where the st ati stical ly exp ect ed mean-square difTerence between
the real fre que ncy responses of the actual and correspond
'YI - 0 or 1 for all i (2) ing errorless filter can be readily evaluated by one short
computer program for all widths of quanti za tion. Further,
and M is the word-length of the computer. In connection
the same computational results may be used to eva lua te
the rms value of the filter's output noise due to data
ManWJCript r�cei�ed July 3, 1967; reviled October 3, 1967. quantization and multiplicative roundofT errors.'41.IOI Un
J B. KnoWles IS With the Control and IDBtrument GrouPt. United
KiDKdom Atomic Energy Authority, Winfrith, Dol'lJet,EoJdand. fortun ate ly , while the mcthod is always app l icable to
iT. M. OIcayto was formerly with t he University 01 Mancme.t8r dire ct and parallcl pro gramm in g, it is gene ral ly un
Institute of S"cience and Technology, M anches ter, England. He
ia now with the Turkish Post OfIice, hmir, Turkey. suitable for cascade p rogra mmi ng.
103
IEEE TRANSACTIOXS OX CIRCUIT TlJEORY, MARCil 19G8
Justification for a Statistical Analysis roundofT errors, the direct programming realization of (4)
In the transformation of an analog filter to a digi tal is specifie d by the recursion equation'
filter by Golden and Kaiser's method 11 I a matrix of co N N
efficients is derived. For representation in the computer y'(n) = L: a.x(n - k) - L: b.y'(n - k) (0)
k-O '-1
these coefficients are quantized to a specific number of
bits. That is to say, for representation within the computer where xU) and y'(i) represent the i nput and output
these coefficients must be processed by the following gain sequences of t he actual filter. However, the ideal digital
.
characteristic. filter produces an output sequence yen) according to the
recursion equation
N ."1
, yen) = L: u.x(n - k) - L: b.y(n - k). (7)
.t-o .-1
Ideal. 01" destcned
coefticiente 7 Actual tiller coefficients
tor realizaUon D efining the computational error quantity
Quantber
e(n) = y'(n) - yen), (8)
In o ther words, the justi fic ation for a statistical analysis then substituting (5), (0), an d (7) into (8) yielc.l s
of filter coefficient errors is the same as for a statistical
N N N
analysis of quantization in AID converte rs . From one
e(n) = L: a.x (n - k) - L: b.e(n - k) - L: p.y(n - k)
viewpoint, the output of a quantizer to a speci fic i nput .-u .t-l .-1
is deterministic and not rea lly a statistical problem at all. N
Alternatively, the exact value of the actual i nput viewed - L: P.e(n - k), (fl)
0-,
in terms of the quanti ze d data is uncertain. In fact, for
a realized fil ter coefficient of nq, the hue coefficient value and neglecting second-order quantities, o ne obtains
lies in the interval nq - E to nq + E, where lEI < q12 . N N
It is this uncertainty regardin g the exact value of the e(n) = L: a.x(n - k) - L: b.c(n - k)
'-0 .t-l
input to the quanti z er that j usti fi es, in the authors'
N
opi nion , the statistical analysis of all ampli tud e quantiza
- L: p.y(n - k). (10)
tion efTects. Such applications of statistical techniques 0-,
E(Z·I) = L: e(k)z··.
0-0
As
Y(z·') = IIm(Z·I)X(z·I), (13)
then (11) may be reduced to
then the numerat or and denominator coefficients may be
wri tten as E',Z.1) = a(z·l) - (1(Z·I)[[m(Z·I) X(Z .1) (14)
Bm(z.')
•
and the error quantities a., P. are statistically independent. , If multiplicative roundoff errors are considered. then an error
It should be observed that no error at all is involved in term c(n) must be added to the right-hand side of (6) (see Knowles
and Edwards('I.I>I). However, a more complete description of de
the representation of unity an d zero magnitude co effi generative effects in a finite word-length computer is given in Section
cients within the computer. Neglectin g multipli cative VI of this paper.
104
KNOWLES AND OLCAYTO: COEFFICIE.... T ACCURACY AND DlGITAL FILTER RESPONSE
It i,; evident that the quantity 0': only exists when the
coefficient rounding errors are such that the actual filter
Il(z-I) is stilI asymptotically stable. This limitation does log w
not appear to represent a practical difficulty, because Fig. 2. Degeneration in filter frequency response due to the coeffi
cient rounding.
one is generally interested in determining the computer
word-length for the actual filter frequency response to
be within only a few percent of the ideal stable filter. sponse terms for the digital filters [l/B .. (z-I)] and
Under these conditions, the stability of the filter will [A .. (Z-I)]/[B.. (Z-I)]', respectively. Consequently, these
have been hardly affected by coefficient rounding. That integrals may be readily evaluated to any degree of
is to say, it is contended that loss of stability in a realiza accuracy using a short digital computer program based
tion will occur only after the deviation between the actual on Fig. 3.
and ideal frequency responses has become intolerable. On-line computers generally operate on a fixed-point
This is illustrated in l�ig. 2 for a lowpass filter. Assuming basis, and in this case with rounded quantization, the
}
filter stability, then the integrand of (18) can be reasonably error quantities in (5) lie in the interval
expected to satisfy the conditions of the Lcbesgue-Fubini
theorem.I"1 lIenee, interchanging the order of integration l a.1 :::; q/2 for all k. (21)
and substituting (14) into (18), one obtains 113.1 :::; q/2
• 1
0'
.. = 2trj ny virtue of the uniform statistical distribution assumed
for these error quantities, it follows that the summations
involved in (20) reduce to
N _ q.)
La! =
.-0
p.--
12
(22 )
'["sing the polynomial forms for a(z-I) and f3(Z-I) and N 2_ .!l
the stipulated statistical assumptions, (19) reduces to L 13. - " 12
.-1
105
IEEE TRANSACTIONS ON CIRCUIT THEORY, MARCH 1968
,.r:. dz
2IJ'j' 8.It'J8.IzJ T
1 1
IZI=I
.(k)=1 ij k=O
=0 it 1c;JC>
and
(28)
where
N
1I.(z-') = R.(z-') + I: II,,,,(z-')
It should be observed, however, that H;(Z-') includes i-I
the case of a real pole by setting
au = b" = O. (27)
As shown in Ragazzini and Franklinllol and Jury,!1I1 the (29)
actual parallel programming of H(Z-I) is implemented
by direct programming each of the elementary transfer
a., - cI.," + au
functions R(Z-'), H,(Z-') and then summing their re
sponses. This programming technique is illustrated in b., ... 6., + �.,
Fig. 4.
r, ', + p,.
(15), which is illustrated in Fig. I, it
"
By virtue of
follows that the actual filter transfer function realized is BubetitutiDg (28) into (18), one obtains
(30)
106
KNOWLES AND OLCAYTO: COEFFICIENT ACCURACY AND DIGITAL FILTER RESPONSE
Assuming that the systcms are stable and that the co gramming of each elementary transfer function, H , (Z-I),
efficient errors are statistically independcnt with zero with the output from ith transfer function forming the
input to the (i + l)th. This programming technique is
[ dz]
mean, (30) reduces to
- f.- � - - 1 1
shown in Fig. 5. By virtue of (15), it follows that, due
� p� + .f.1 (ao! + al!)21rj j
1
o-! =
Bmm(Z-')Bmm(z) Z
[
'-I-I
+
�
\}J,.. + 13----
.f.1 f'/i2 1 1 Amm(z-')Amm(z) dZ
2..) 21rj j [Bn(Z -I»)'[B.,m(Z))2 Z :
] (31)
INPUT
Ln."
�--------�
� . . ')1 rY'{z')
Fig. 5.
OUTPUT
The mean-square quantities �, �, and p;J are calculated to the filter coefficient errors, the actual transfer function
exactly as in Section II, assuming uniform probability realized is
density functions. It should be noted again that zero If
and unity filter coefficients may be represented in the H(.-I) - II [1 + .,c.-')]H,.c.-') (36)
computer without error. The contour integrals involved ,-,
1
_
- (BoB. + B�)(m .
2(BoB,B.)3 then (36) can be written as
N
IT (1 + Ei
where
Il(z-') H..(z-') (39)
0" + 0..
=
'-1
Bo = 1 +
Substituting (39) into (18) yields
B, = 2(1 - b2,)
Ao
B. =
=
1 - 0" + O2,
(au, + ali)2 (33)
0":
=
;
2 j
h'-1
f Il ..(z-')Il.. (z)
{ IT
re;;ults in b.. 1 causes loss of asymptotic stability.
=
lI. z
•
+ -I
+ -2
1 + bliz-' of b.iz-'
(35) 0-:
=
;
2 j f
'_1-1
Il.. (Z-I)Il .. (z)
107
IEEE TRANSACTIOXS ox CIRCUIT THEORY, MARCIl 1968
Substituting (37) into (42) one obtains terms of the word-length of a fixed-point machine, as
or
. [n <
which is also shown in Fig. 6 .
As a means of verifying (4S) the eoefficients of the
A .... (Z-l)A .... (z)B....(Z-l)B....(z) + B.... (Z-l)B....(z) bandstop filter shown in Table I were appropriat ely
. (± ) ( ± p:..)
rounded using the equations
a:.. + A ....(Z-l)A ....(Z)
.-0 i-I a. = q X [in tegra l part (a./q + 0.5)] (50)
- A .. (Z-l)A .. (z)B.. (Z-l)B.. (z) ] �. (44)
and the quantity
b. = q X [int <'gral part (ii./q + 0.5»),
n[ A .... (z-')A .... (z)B.... (z-')B.. .. (z) + B.... (Zl)B.... (z) was measured directly
in Fig. 7.
by mea ns of the technique shown
108
·KNOWLES AND OLCAYTO: COEFFICIEXT ACCURACY AND DIGITAL FILTER RESPOXSE
TABLE!
DIRECT PROGRAMMING COEFnCIENTS
k Ok 15k
0 6.14532 02400 45660 95935 10-1 1.00000 00000 00000 00000
1 1.82862 83743 34044 30234 2.84644 10689 74009 00222
2 9.18598 76393 23345 37501 1.36544 18024 06057 35998 10+1
3 2.01785 43858 30049 83632 10+1 2.86863 12R27 03054 60931 10+1
4 5.65074 01304 61323 06076 10+1 7. 68668 01556 51020 20537 10+1
5 9.77616 97277 77455 67929 10+1 1.26978 24238 74008 50583 10+1
6 1.95560 58002 77163 75480 10+1 2.42719 14138 13632 85357 10+1
7 2.74444 52024 17107 32360 10+2 3.25744 40025 55813 19723 10+1
8 4.27221 29208 22686 15701 10-' 4.84590 46018 12943 21722 10+1
9 4.!liiOfi� 95407 14507 65312 10+1 5.36954 24793 46719 51875 10+'
10 6.24128 701lCM 97272 04930 10+1 6.46875 76125 18105 92119 10+1
11 6.00062 '1331i 1231 7 550G1 10+1 5.94574 02388 97171 95950 10+1
12 6.24128 701l0S OU213 35302 10+1 5.90817 75064 19863 11885 10+'
13 4 !l.�Ofl� 95407 34381 27910 10+' 4.47868 926 04 45618 46370 10+1
14 4.27221 29208 46950 36973 10+1 3.69085 95309 78270 04033 10+'
15 2.74444 52!l24 38924 26919 10+1 2.26468 72853 40520 74943 10+1
16 1. 95ii60 58902 95258 27851 10+1 1.54004 37815 20019 23587 10+1
17 9.77616 97278 91991 46783 10-1 7.34798 81860 20413 00382 10+1
18 5.65974 01305 31808 57200 10+1 4.05582 38655 76315 12082 10+1
19 2.01785 43858 69746 80847 10 +1 1.37867 64204 74181 15588 10+1
!l.lSii!lS 7fi�n",
1 . 82862
20 60091 74744 5.97585 13278 52671 08000
21 83743 68430 88781 1.13257 95031 45398 63928
22 6.14532 02401 45188 00124 10-1 3.61729 371 40 82771 44243 10-1
,\,
10
�,
I---t---"<c'f-\,--+-�-t---t- --
K,
10
.
I<
a
10'
I
. �---+----�I -- . i __ __
l\,
--�.--��' .---i
'\,
I
I
I043S1,---13 I
COMPUTER
I
WORD'" LENGTH M( ) 'SITS
Key: -- � - - - -W; 0 v�.
M('):lit 11-0
=0 if k.o
109
_ 'I'U1fIUD1'JD1U ON 0IB0UlT TIIBOBY, IIABOB 1968
TABLED
PAIWoLIIL paoq.npm'g eo.riiu....... r. - L8II88'1H&'18
- -
I: ... Get liM li;;
1 -7."'1O'J'1O'M8 10-' 7.3162391858 10-' 9.11890799167 10-' 4.28617114740 10-'
2 9.8l1li1937100 10-' 7.476«71104 10-' 9 . 1l8883'72nO 10-' 1.IOM'126159 10-'
8 4.11&63620191 11)-1 -4.� 11)-1 9.96727889'J7 10-' 4. 31711124881 10-'
4 -4.888tM2202 11)-1 -4.mat24710 10-' 9.9lIII80678'111 10-' 1.063'1363(08 10-'
II -1.3328884702 11)-1 -7._11471331 10-' 9.8829l89882 10-' 4.4149478013 10-'
-7.6702309337
6
7
1.18780288'10
2.83M703827
11)-1
11)-1 '.0192757396
10-'
11)-1
9.8800730189 10-'
9.11614183M9 10-&
9.4074003990 11)-1
4.1I63II0628t8 10- '
8 -1.3308M1404 11)-1 4.1GtOOOtM 11)-1 9.651810179 10-' 6.1_ 11)-1
9 4.31161__ 10-' -1.G576739tO 10-' 8.7f2293M511 10-' 5.l8II03'79MO 10-'
8.09t4I0CI6357 10-'
10
11
-8.8t47288870
-1.16780741143
10-'
10-'
-1.47lI60II7472
-8.1101837180
10-'
10-' 1I.2II36IiCN87O 10-'
-2.MllM7f227
2.0746921909
10-'
10- '
TABLEm
" .. ---
,\ �
,
---..
.\ -_ .. - - .... -
1r1
I
I
i
,
r\ f\
! \ ,
10
•
I �.
I ,
-
---_. .
i
!
,
10
. I 16
• • • 10 12 I•
CC!MPUTBI _D-LENGTN (III). am
K.,: -- � - - - W; e ....
110
·.OWLE AIID OLOAYTO: CODTIClIINT AOCUBACY AIID DIGITAL I'ILTBB BllilPOlIISB
+20
o
�
-20
- f-
�
��
.IITS
..
..
-40
101"S- � \� /') ---
I 12 IITS - � "- 7�
2 ---
"060
40 IITS -
204 ·S ·6
�REQUENCY
� V V Y�
IN KIIZ
2·1 ·9 1-0
...-�-
-....,..--- ----r_-_,
10
, which .is also shown in Fig. 8. Examination of Fig. 8
,,
I ! suggests that, for all practical purposes, a 40-bit realization
may be considered "ideal" relative to less than 20-bit
realizations. Further, coefficient rounding errors can be
isolated from the data quantization and multiplicative
rounding errors in the actual filter by performing all
i 'i'\�-�
multiplications to a 40-bit accuracy. Using this experi
mental technique, the value of u .. was computed for
coefficient word-lengths in the range of 4 to 16 bits and
\ I . I the results obtained are included in Fig. 8. It is observed.
i• ,I I
\ i I that u .. is in close agreement with its expected value
!. i ', I
·'
i
Ii>
.--r-'�-+-
' -�--I
. Y:!, except for word-lengths of less than 8 bits, where
\1�---�--�
10
i
an apparently deterministic deviation of theoretical and
experimental values occur. This phenomenon is attributed
<1.
r---r-i -
, - ,\ . to the deletion of second-order quantities in (9). For
i ii '\1
illustrative purposes, the effect of coefficient accuracy
R 10
I
on the real frequency response of the parallel programmed
filter is shown in Fig. 9.
.. ,, The cascade version of the digital filter under considern.
i GI ,
tion is specified by the coefficients in Table III. For reasons
,,
of completeness, the degeneration in filter performance
with this method of programming is shown in Fig. 10.
In this case the ideal filter and all multiplications were
J().L-_-L __..l-_---''-_--L__.J..._
. -l performed to an accuracy of 40 bits, which may be
14 '6 II
It
10 12 20 22
COMPuTER WORD-LE.NGTH (M),BITS justified in the same manner as before. is seen that
the actual degeneration for cascade programming is less
- r-=;:
Key'e - - - v I.
n ,�i o "
... than that for direct programming but greater than that
Fig. 10. Cascade progTamming errors. for parallel programming.
111
lEa TRANSACTIONS ON CIBCUIT THIlORY, lURCH 1968
'DIAL PlLTIlI
VI. COXCLUSIOXS �--- -----,
, r------,
I
Due to inherent quantization errors on the cocfficients I
can be achieved by employing only one extra bit. programming. Further, parallel programmed filters are
'With the analysis just presented and that available apparently less susceptible to multiplicative and coeffi
elsewhere,'·,·16] it is possible to specify a specific linear cient quantization than directly programmed filters (see
model of the basically nonlinear effects occurring in the Knowles and Edwards,'·,·I5 I Kaiser,1 7 I and this paper).
finite word-length representation of a pulse transfer func After checking the real frequency responses of the
tion. In the case of a fixed-point ope ratio n with direct analog and parallel programmed digital filters for reason
programming, this specific statistical model is as shown able agreement, a computation of IT! can be effected
in Fig. 1 1. It will be observed that the parallel transfer using (31). These same calculations can also be utilized
function arrangement to obtain the rms value of the output noise due to data
.Jfu + .J&
q qAm(Z-')
and multiplicative rounding errors. As � is the math
ematical expectation of
Bm(Z-') [Bm(Z-'»)" 1 12r'T
11I*(jw) - lI:(jw) 12 dw,
'27r/1' 0
gives the statistically expected mean-square deviation
(IT!) between the actual and ideal frequency responses. then it is evidently a measure of the deviation between
As the width of quantization (q) would be chosen in the actual and the ideal filter responses at any frequency.
practice so that t he deviation in the real frequency For many bounded functions, plus or minus three times
responses of II ( z· ' ) and II., (Z·I ) is extremely small, then its rms value contains most, and in some cases all, probable
the noise shaping filter 1/B (Z·') may be evidently replaced observations. It seems reasonable. therefore, to select the
by I/Bm(z·') with only second-order errors. Quantita
l
computer word-length according to
tively, the power transmitted by I/ B.,(z·') and
I
N
3 v? Acceptable Gain
IT (55)
1
:E fhz·1 ,. < Fluctuation
0-'
B m(Z ·') - [B.,(z·')]" provided that the output noise due to data quanti zation
and multiplicative roundoff errors is also acceptable with
can be considered identical for a satisfactory actual
r a
e lization. • Private correspondence with Dr. J. F. Kaiser.
112
KNOWLES AND OLCAYTO: COEFFICIE:li"T ACCt;RACY AND DIGITAL FILTER RESPONSE
A CKxoWLEDmlENT
this word-length. This procedure evidently enables the
designer to assess whether a practical realization of the The authors wish to acknowledge the stimulus and
given digital filter can be made on a smaller word-lcngth encouragement received from their correspondence with
computer. Dr. J. F. Kaiser of Dell Telephone Laboratories, Murray
As an example of the use of this analysis in digital HilI, N. J.
filter design, the number of bits required for the Golden
Kaiser filter to meet the following specification will be
R.Bn:uNCBS
III R. M. Golden and J. F. xu. ''DeBip of wideband �
considered: data <en," Bell Bp. T_ J.., 43, pp. 11133-1_ J� 19M.
;;t
1II J. F. :s:.u.r, ..� III8UIOCII for IIIDI� &I..... ' Proe.
Ripple in passband = 0.5 dB. til Allrim CtIII/. 1m CirCtriI .." s".,. T...., pp. 221--' Ncw
ember 1963.
Minimum attenuation in rejection band -75 dD. 1II B. Widrow: ''Statiatical anaIYIIia of �tude-cruantiad
� aya{.em., " TNJu. AIBB (A� .." tllllualr)r ,
=
[ (�o5) ]
1967.
111 J. F. KaiIeI: ''Some vractical OODBiderr.ticma in the reaIiation
± antilog, ° - 1 = ±0.059 of IineIIr diaitll iIl-." Proc. IJrrI Allrim CtIII/. l1li CirniJ ..
� T� (Montioello, m), pp. 621� October 1965.
III O. 8oni1DOOD� "ID� of quantiution erron,"
and M.Bc. � Uniftnit of Manelii!llt.s EulaDd, 1966.
It! L. M. Graftl, � oJ � oj'1l-J. Vllliablet, 2nd eeL
[
± antilogu (- ;�) - antiIog,o ( - 7;(/)] N_ York: MClGraw-iiiii;" flNl8.
1111 J. R. Raaaini and O. F. J'ranldin, S-,w-Data CMnI
�. N_ York: MoGraw-HiD, 1968.
lUI E. L Jury, S-pW-DataCortlnIl� New York: Wiley,
= ±0.266 X 10-·, 1968.
1 13
Rep,inted !,om IEEE TRAXSACTlOXS
0:>1 AUTOMATIC CO.YTROL
Volume AC-13, !l;'umber 3, June, 1968
pp. 263-269
COPYRIGHT © 1968-THE INSTITl:TE OF ELECT RICA L AND ELECTRONICS ENGINEERS, INC.
PRINTEO IN THE U.S.A.
Abstract-The first part of this paper presents a new measure of between computer word length, that is, bits required,
sensitivity specifically applicable to the realization of a linear discrete
as determined by considerations of eigenvalue sensi
system on a digital computer. It is also shown that the sensitivity of
the eigenvalues to parameter inaccuracies in the realization depends
tivity to parameter inaccuracy, and the number of
strongly on the choice of state variables. From these considerations, arithmetic operations required.
a realization is obtained which is "best" for a large class of systems of Other authors have considered the choice of realiza
interest with regard to minimizing storage requirements, arithmetic tion with regard to parameter accuracy requirements
operations, parameter accuracy, and eigenvalue sensitivity. The
for stability. Golden and Kaiser(3).(4) have pointed out
second half of the paper considers the very practical problem of de
termining the number of bits accuracy required in the computer
with regard to realization of scalar input sampled-data
stored ,parameters of the system to achieve satisfactory performance. filters. that simulations corresponding to the direct or
For the realization found to be a best compromise, equations are reduced transfer-function form of difference equations
obtained for determining these bit requirements. Example s are generally require considerably greater accuracy than is
given showing the application of this realization to the computer im
required for parallel, partial-fraction expansion, or cas
plementation of a discrete filter, and a comparison is given to other
cade. factored transfer-function. forms. In later papers,
possible realizations.
Kaiser(4).(6) has obtained a lower bound on the number
I NTRODI:CTIOX of bits required for the stability of a digital filter. I Ie has
X AX'\" PROBLE�I where data arc to be processed
also shown hqw root-locus techniques can be applied
114
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, JUNE 1968
where
if> = T-1if>'T
aAk
=
[ajfOcf>i,] (8)
aq,ii aj/aA A.A·
r = T-1r' (4)
Using (7), this becomes
II = II'T.
Ak
tions for P-dimensional input, N-dimensional state, and
R-dimensional output. For simplicity , the derivations By definition, oA/iJt/>i, == 0 if t/>iJ is unity or zero, because
in this paper wiil consider only scalar input and output. the computer wiII realize these parameters exactly. The
All results obtained can be extended to cases of vector eigenvalue Xt corresponding to S is then the most sensi
valued input and output. tive eigenvalue for this realization, and S is taken as a
INFINITESnrAL EIGENVALUE SENSITIVITY measure of the sensitivity of this 4>. This sensitivity
measure can be used to evaluate the sensitivity of any
Eigenvalue sensitivity is defined as the expected proposed realization.
change in the location of an eigenvalue of if> for a
change in a parameter of CPo Such parameter variations SENSITIVITY OF VARIOUS FOR�IS OF 4>
in digital realizations with fixed word length occur due To obtain the minimally sensitive form requires a
to truncation, or rounding, of the specified parameters search over all cP related to cP' by (4), comparing these
to this word length. An infinitesimal approximation to choices on the basis of (10). If the possible choices are
this sensitivity, valid for small parameter inaccuracies, limited to those having only N parameters different
is given by from zero or unity, the number of possibilities is reduced,
flAk aAk but no orderly procedure has been devised for selection
� ( 5)
flcf>i; acf>i;
-- --
115
MANTEY: EIGENVALUE SENSITIVITY AND STATE-VARIABLE SELECTION
realization, the minimum number of parameters are For systems with eigenvalues which are reasonably close
required: N for <1>, and N for rand Il, for a total of together, S is very large for this form. For instance, with
2N+l, since J requires one. Thus the minimum num two real eigenvalues X=!, X=1, (13) yields S =7, where
ber of multiplications is used per input. Clearly, under X = 1 is the more sensitive eigenvalue. This means that a
these conditions no better form is possible. For multiple sevenfold increase in accuracy is required over that for
real eigenvalues, the analogous form is the Jordan form, the diagonal form.
and similar results can be obtained, although the
CO�IPLEX CO�Jl:GATE EIGE::-;VALUES
infinitesimal sensitivity measure of (9) is not defined.
For the case of complex eigenvalues, the diagonal or From the preceding discussion, the diagonal form
Jordan form has complex entries, and the corresponding emerges as the most attractive form for <I> in the case of
state vector is complex. This form is undesirable, as the real eigenvalues. However, for complex eigenvalues, the
effective number of storage locations and multiplica handling of complex quantities is not desirable. If the
tions is greatly increased. original system has real coefficients, then any complex
eigenvalues occur in conjugate pairs. Suppose that the
Companion }'falrix Form
system to be implemented has both real eigenvalues and
If the system described by (1) is controllable and/or complex conjugate eigenvalues. Let the number of real
observable, it can be shown (9) that there exists a T eigenvalues be M. Then <I> can be made to have an
which will reduce <1>' to companion matrix form in the MX AI submatrix with real entries on the diagonal, and
scalar input-output case, and will also reduce r' (or this submatrix is in an ideal form for realization. The
II') to N -1 zeros, the other element being unity. Simi N - �\l complex conjugate pairs remain, and note that
lar forms exist for the vector input-output case.IIO) The N-M must be even. Partition this matrix into 2X2
resulting system again requires the minimum of 2N + 1 matrices, one for each complex conjugate pair. Call each
storage locations for the parameters, which corresponds pair Ajo j= 1, 2� , (N -AI)/2, where
AI [ 0 ]
. . .
[
the form
�l
0
o <l>j = (15)
bu
(11)
where, for <l>j and Aj to have the same eigenvalues re
1
�: : :
o ° . . .
quires that
bN-l . . . b1
b2j = - hM+2j;\,\f+2j
and the characteristic equation is then (16)
blj = AM+�j + ;\,\(+2j
!(X)=det [Al-<I>]=XN-b1XN-l- ... -bN=O. (12)
then T will be real for <I> of the form
For stability, all eigenvalues of the system must, of
course, have absolute value less than unity. For this o
consideration alone, the question would be the minimum
number of bits required to specify the bi of (12) to keep
A.If_
all Xi inside the unit circle. Kaiser(4).li) has shown a
lower bound on the number of bits required for sta
<1>= (17)
bility of this form.
For the companion matrix form of <I>, the sensitivity
measure S is from (9), (10), and (12)
116
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, JUNE 1968
It can easily be shown that. for stable eigenvalues, (18) advantage that the bit requirements for realization of
has a lower bound of unity. the system to any desired accuracy of the eigenvalues
From these considerations, realization of <I> as in (17) can be computed directly. without depending on
is preferable to the companion matrix form whenever S infinitesimal arguments. This direct computation of bit
of (18) is less than (13). Define S.O\k) as the sensitivity requirements is covered in the next section.
of Xk in the companion matrix form, and SD(Xk) as the It should be noted that the form of <I> given in (17)
sensitivity of the same eigenvalue in the "decoupled" represents a realization in terms of (J/ +N)/2 parallel
form of (17). For Xk complex subsystems. Equivalently, placing ones in appropriate
locations above the main diagonal in (17) results in cas
cade form, with the same characteristic equation. From
the aspect of sensitivity, multiplicative operations, and
storage. the two forms are equivalent. lIowever. the
difTerent forms do have difTerent effects on arithmetic
while for X. real, SD(Xk) = 1, and roundoff error, with the cascade form being slightly
better in most cases.
BIT REQL'I1m�IE:-ITS
Consideration of the <l>j blocks of (17) will indicate
Assuring that the ratio S.(Xk)/ SD(Xk) exceeds unity for the number of bits required for this desired realization,
each Xk of <I> is sufficient to assure that the measure from that is, attention can be focused on the bit requirements
(13) exceeds (17). For many conditions this obviously is for each of the 2 X2 matrices <l>j of (15) to keep the
satisfied; for example: 1) systems with no real eigen eigenvalues within a circle of radius 'Y centered on their
desired location.
values, and all eigenvalues with Im (Xk) > 0 are con
Let
tained within a circle of radius t; 2) all eigenvalues are
within a circle of radius !; or 3) the distance from any >",\{+2,-1 = aj + i/3j = pje"i
eigenvalue to any other eigenvalue not its conjugate is (20)
less than unity. Although such conditions are sufficient >".U.. li = a, - i/3i = Pie-iIi.
but not necessary, they are satisfied for a large class of Then from (16)
systems, including most systems which are sampled-data
blj = 2al
versions of continuous systems with sampling rate (21)
chosen to avoid spectral folding, and in many such b2i = - (alj + /3;2)
cases these conditions can be used instead of computing
and the factor of the characteristic polynomial of <I> re
(13) or (18) to select the form desired. However, the
lated to <1>1 is
calculation required to actually compare (13) and (18) is
exceedingly trivial.
An example shows a comparison of the sensitivity in
the companion matrix form of (11 ) and decoupled form Suppose that the eigenvalues X.\(+2;_I, X.UHi are moved,
of (17) for a system with five eigenvalues as shown in by the bit limitation, to the new locations
Tahle 1. >":V+2i-1 = (a; + oS) + i({3i + E)
(23)
TABLE I
Il'OFDIITESI)IAL SEXSITIVITY OF LOW-PASS SYSTEM
so that again X'.\f+2i--1 =X'M+2it and thus the character
Real Imaginary Magnitude Angle Se(X) SD(X) istic polynomial retains real coefficients. Now the
0.8000 0.0000 0.8000 0.0000 52.2871 1.0000
coefficients of <1>/ are
0.6928 0.3999 0.8000 0.5235 49.8833 2.2500
0.6928 -0.3999 0.8000 -0.5235 49.8833 2.2500 bu' = 2ai + 2a
0.5656 0.5656 0.8000 0.7853 23.8559 1.5909 (24)
0.5656 -0.5656 0.8000 -0.7853 23.8559 1.5909 b2.' = - [ (ai + a)2 + ({3i + E)2J.
Define
From these infinitesimal considerations, S. of (13) Ab'i � bl/ - bIJ = 2a
is 52.28, while SD of (18) is 2.25, and it is estimated that (25)
the companion matrix realization will require at least Ab2i � b2/ - b2j = 2aja - oS2 - 2PjE - E2.
five more bits for the same accuracy in eigenvalue Now, if the changes in the eigenvalues of <l>j are to be
location. confined to a circle of radius 'Y, it is required that
The proposed decoupled form, besides yielding lower
sensitivity for a wide class of systems, has the importan t (26)
1 17
MANTEY: EIGENVALUE SENSITIVITY AND STATE-VARIABLE SELECTION
Iy
I Ilb1j I ::;; 11
Fig. 1 illustrates the eigenvalues and variations in the (30)
complex plane. 11 > O.
Now consider the changes in I::.b. 1h I::.b. 2j as the eigen
values AM+2i-h X M+2 i are moved an amount 'Y in any The problem of determining the bits required to keep the
direction cpo Then eigenvalues within 'Y is then equivalent to determining
I::.. so that the zeros of (22) are within 'Y of XM+2i-l and
Ii 'Y cos </>
=
(29)
Ilb2j - 2..fPi cos (9j -
= </» - 'Y2.
The center of the ellipse is at (I::.b. lj, I::.b. 2j) = (0, _'Y 2). It
can be shown that, for all 'Yl <'Y2 <(3h all changes in the
Using (25) this becomes
eigenvalues corresponding to a circle of radius 'Yl map to
an ellipse in the I::..b1;. I::..b2i plane which is interior to that , .1611
for 'Y2. AJI+IJ-I.JI+1J - till +
T
Thus, for all 'Y <(3), if the coefficients are specified to an (34)
accuracy such that I::.b. 1i> I::.b. 2J is interior to the ellipse for
a particular 'Y, the eigenvalues will be a distance less
than'Y from their desired values. However, note that if
'Y "?,(3i> the location of the eigenvalues can reverse with Then the bit requirement is determined by finding the
respect to the real axis. For stability, 'Y must be less I::.. such that if I::.b. 1;. I::.b. 2j satisfy (30), then
than 1-p. Fig. 2 illustrates the case where II::..b1JI
II::.b. 2JI =1::... For a given 'Y, I::.. is one-half the side of the max I }.�(+2j-l - (ai + i{3j) I ::;; 'Y. (35)
=
6.bli .{U2J
largest square which can be centered at the origin in this
plane and fitted inside the corresponding ellipse. Speci Combining (34) and (35), the problem becomes one of
fying the parameters to the accuracy given by I::.. assures finding I::.. such that if I::.b. 1i> I::..b2j satisfy (30) and 'Y is
eigenvalues within 'Y of their prescribed values. given, then
1 18
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, JUNE 1968
max
Illbll
-±
( Ilbll
-P,2+ajllbIj+--+llb21
) '
IHli.l1lJtJ 2 4
1
(44)
+ iP! � 'Y. (36)
�»
from the aspect of computation, by choosing d and
evaluating (36) according to the restrictions of (30) for -Pl+Il laJI+1+ O. (45)
the corresponding 'Yma,.. These values of 'Yma,. can be
tabulated for a range of d and the required bits easily
To summarize-for a given d (d > O), if
determined. For a given d, if dbl! and db2! satisfy (30),
then
ALTERNATE REALIZATIONS
[
form
' 1
Case
�,1 -
al P/ J (46)
IlbIj2 _P; a;
_p,2 + aAbl! + -- + Ilb21 < O.
-
(38)
4 where a; and {3; define the eigenvalues as given in (20).
For this case, again restricting db" dbJ+I according to This form of <1>; again can be shown to yield a real
transformation T. Again. only N coefficients need be
{ [ (
(30).
stored. Here. to keep the eigenvalues within a circle of
llblJ2 radius 'Y, it is required that 8 and E, which are now the
'Ymax = max + P! - Pi -aAbIj
tolerances in a, and {3/o respectively. must satisfy (26).
)'J2} ,
--
l1b'i,l1b" 4
(39) For a fixed word length, this makes 8 = E =2-\ where k
Ilb,i
--- -llb2,1 is the number of bits used. This is a less stringent re
4 quirement than that imposed with regard to realization
of <1>; of the form of (15). However. for each pair of
The maximum of (39) occurs for dbll =d sgn (a;),
eigenvalues realized according to (46). two additional
db2/ =d, and is
multiplicative operations are required for the computa
tion of each output; this violates the earlier restriction
to realizations using a minimum of multiplications. To
determine whether the saving in coefficient bit length
The Case1 requirement (38) becomes justifies the increase in multiplicative operations, in
terms of efficient computation, consider the machine
time required by each form. related to the corresponding
bit requirements.
From consideration of (40) and (44). the smallest
Case 2 dl, for any stable eigenvalue. that can result in a change
Ilb1J2 'Y is the eigenvalue locations for <I> of (15), is related to
- Pl + a;llbll + -- + Ilb21 � O. (42) 'Y by 'Y =v2dl+dN2. For small 'Y. and hence small
4
dl. dl�2/2. Now dl corresponds to a requirement of
kl bits. and since bl;. b2; of (15) are. in magnitude, less
than two. kl = 1-log2 dl = 1-log2 (-r2/2).
+ T +Ilb2;)
Ilbl,2 'J2 } ,
+ Pi .
(43) For <1>; of (46). the correspondingly smallest d2 for the
same'Y is related to'Y by 'Y = v'21l22, or d2 ='YIv2. Since
OIJ and b; are in magnitude less than unity. k2 = -log2d2
= - log2 (-r/v2).
The maximum of (40), subject to the constraint of (30), Thus. for the same 'Y. k2�(kl-1)/2, and about half
again occurs for dblJ =d sgn (a;) , db2/ =d, and is, the number of bits is required for the parameters of (46)
1 19
MANTEY: EIGENVALUE SENSITIVITY AND STATE-VARIABLE SELECTION
II shows the eigenvalue locations Cor this system and the tion problems," T,am. ASME, J. Basic E""g., ser. D, vol. 82, pp.
respective infinitesimal eigenvalue sensitivities (S.(X) 35-45, March 1960.
I'l R. M. Golden and J. F. Kaiser, "Design of wideband sampled·
and SD(X» as computed for the forms of (1 1 ) and (1 7). data filters," Bell S,s. Tee"- J., vol. 43, pt. 2, pp. 1533-1546, July
1964.
,4, J. F. Kaiser, ·Digital filters," in S),stem AIIIIl,s" 6)' D�«iIGl
.
TABLE II ComJ>vUr. F. F. Kuo and J. F. KaIser, Eds. New York: Wiley,
1966, pp. 218-285.
INFINITESI�IAL SENSITIVITY OF NARROW-BAND SYSTE)( 'i'
-- ·Some practical considerations in the realization of linear
digital filters," P,oc. J,d A ..... AllefolofJ Cn/. n C;,cwit aM S)'sUm
magi TUM, (Urbana, 111., October (965), pp. 621-633.
Real I nary - Magnitude Angle SD(X) (I. E. Bodewig, MtJJriz CsklllfU. New York: Interscience, 1950.
I"� C. G.J. Jacobi, C,el/e's JlIIlrlll, vol. 30. Berlin: de Gruyter,
0.9919 0.0197 0.9921 0.0199 15807784.2968 50.4099 1846. !>p. 51-95.
0.9919 -0.0197 0.9921 -0.0199 15807784.2968 $0.4099 ,I. B. S. Morgan,Jr., "Sensitivity analysis and synthesis of multi.
0.9833 0.0463 0.9844 0.0471 9909651. 1249 21.�965 variable systems," IEEE T,am. AulOIfIIJIic C",." ol, vol. AC·II, pp.
0.9833 -0.0463 0.9844 -0.0471 9909651.1406 21.3965 506-512, July 1966.
0.9894 0.0736 0.9921 0.0743 2352116.9062 13.5188 ,I, W. M. Wonham and C. D. J ohnson, ·Optimal bang-bang con.
0.9894 -0.0736 0.9921 -0.0743 2352116.9062 13.5188 trol with quadratic performance mdex," Preprints. 4th Joint Auto
matic Control Conf. (Minneapolis, Minn., June 1963), pp. 101-112.
,II, W. C. Tuel, ·Canonical forms for linear systems-I," IBM
Research Rept. RJ .175, March 1966.
(II' C. S. Weaver, P. E. Mantey, R. W. Lawrence, and C. A. Cole,
Realization oC this filter on the IB�( 7090 with <I> in
• Digital spectrum analyzers," Stanford Electronics Laboratories,
the companion matrix form of (1 1 ) resulted in an un Stanford, Calif., Rept. SEL 66-059 (TR 1809·1/1810-1), June 1966.
stable system in single precision arithmetic (27 hits
accuracy). Choosing <I> of the form of (17) and usin:.{ the
simplified analysis related to uniform word len!{th, it
was computed that the lilter realized in this form would
require only 14 hit coefficients for stability, to keep
'Ymax < (l-p), and it would yield essentially ideal pl'r
formance with 18 bit coefficients, where ideal per
formance was taken to mean that no eigenvalue had a
shift'Y, due to the use of a finite number of bits, no more
than 10 percent of the distance between the eigenvalue
and the unit circle in the complex plane. These results,
and many others,(111 were verified by simulation
and illustrate the desirability oC a realization in the
form oC (17), as well as the ability of the analysis de
veloped here to predict within one or two bits the
accuracy needed in the coefficients to achieve essentially
"ideal" performance. This empirical criterion of ideal
performance applied to'Y for this system, using the cal
culated bit accuracy, resulted in frequency and transient
responses with essentially no discernible differences
from those using more bits, while use of fewer bits
resulted in very deleterious changes in both transient
response and frequency response.
120
I. Introduction
Abstract
If X(j), j=O, I, ... ,N-I, is a sequence of complex
numbers, then the finite Fourier transform of X(j) is the
This paper contains an analysis of the fixed-point accuracy of the
sequence
power of two, fast Fourier transform algorithm. This analysis leads
X-I
to approximate upper and lower bounds on the root-mean-square error.
A(n) (lIN) L: XU) cxp -27fijnj.\'
(1)
=
Also included are the results of some accuracy experiments on a sim j.O
ulated fixed-point machine and their comparison with the error upper n = 0, I, . . . , N - 1.
bound.
The inverse transform is
/It-I
L: / XU) /2 = N L: / A(n) /2
J-O
or
N-l N-I
IEEE TRANSAcrlONS ON AUDIO AND ELEcrROACOUSTICS VOL. Au- 17, N o. 2 JUNE 1969
121
Let X..{i) and X..(j) be the original complex numbers.
Then, the new pair X.. +1(i), Xm+1(j) are given by
Xm+l(i) X.. (i) + X m V) IV
(4)
=
[I ;I ]
a simple right shift is not a sufficient correction.
Xm+1{i) 12 X..+1V) 1 /2 The above results and observations suggest a number of
alternative ways of keeping the array properly scaled. The
= v'2
[I X.. {i) 12 ; I X.. 12T'2.
W
(6) three that seem most reasonable are the following.
I Xo{i)1 <
1) Shifting Right One Bit At Ecery Iteration: If the
initial sequence, Xo(i), is scaled so that 1/2 for
Hence, in the root-mean-square sense, the numbers (both all i and if there is a right shift of one bit after every
real and complex) are increasing by v'2 at each stage. iteration (excluding the last) then there will be no over
Consider next the maximum modulus of the complex flows.
numbers. From (4) one can easily show that 2) Controlling the Sequence so that I x..(i)1 < 1/2:
Again assume the initial sequence is scaled so that
max { IX.. {i) I , I XmW II i.
I
Xowl < 1/2 for all Then at each iteration we check
� max { I X..+1{i) I, I X"+l{J) II (7) x..wl and if it is greater than one half for any i we shift
right one bit before each calculation throughout the next
� 2 max { I X.. m I , I X.. {J) II· iteration.
3) Testing Jar an Overflow: In this case the initial se
Hence the maximum modulus of the array of complex
numbers is nondecreasing.
quence is scaled so that Re I Xo(i)I< I and 1m { Xo(i) I < 1.
Whenever an overflow occurs in an iteration the entire
In what follows, we will assume that the numbers are
sequence (part of which will be new results, part of which
scaled so that the binary point lies at the extreme left.
will be entries yet to be processed) is shifted right by one
With this assumption the relationships among the num
bit and the iteration is continued at the point at which the
bers is as shown in Fig. I. The outside square gives the
overflow occurred. In this case there could be two over
region of possible values, Re { X.,(l)I< 1 and 1m { X.,(i)I
flows during an iteration.
< 1. The circle inscribed in this square gives the region
I x..(i)1 < I. The inside square gives the region Re I X..(i)I The first alternative is the simplest, but the least ac
< 1/2, 1m I X..(i) 1< 1/2. Finally, the circle inscribed in curate. Since it is not generally necessary to rescale the
this latter square gives the region I xmwl < 1/2. Now if sequence at each iteration, there is an unnecessary loss in
Xm(i) and X.,U) are inside the smaller circle, then (7) tells accuracy. The second alternative is also not as accurate
us that X..+l(i) and X.. +1(J) will be inside the larger circle as possible because one less than the total number of bits
and hence not result in an overflow. Consequently, if we available is being used for the representation of the se
control the sequence at the mth stage so that I X.. (i)1 quence. This alternative also requires the computation
< 1/2, we are certain we will have no overflow at the of the modulus of every member of the sequence at each
IZZ
iteration. The third alternative is the most accurate. It has 2) When two B bit numbers are added together and
the disadvantage that one must process through the se there is an overflow, then the sum must be shifted right
quence an additional time whenever there is an overflow. and a bit lost. If this bit is a zero, there is no error. If it is
The indexing for this processing is, however, straight a one, there is an error of ± 2-B depending upon whether
forward. It would not be the complex indexing required the number is positive or negative. The variance of this
for the algorithm. In comparing the speed of the second error (it is unbiased assuming there are an equal number
and third alternatives one would be comparing the speed of positive and negative numbers) is
of two overflow tests, two loads, two stores, and a trans
fer with that of the calculation or approximation of the
AZ2 = 2-28/2. (10)
modulus and a test of its magnitude. This comparison It has a standard deviation
would depend greatly upon the particular machine and
the particular approximation to the magnitude function.
A2 = 2-B-I/Z "" 0.i(2-B). (11)
A modification of the second alternative was adopted In addition, we will consider the effects of the propaga
by Shively [3] . In this modification, if I xm(i)i > 1 /2 , the tion of errors present in the initial sequence. The variance
right shift was made aJler each calculation in the next of these errors we designate by 02• In the simplest case
iteration. Provision was made for possible overflow. We these errors would be the quantization errors resulting
will give an error analysis of the third alternative below. from the A/D conversion of an analog signal.
A microcoding performance study of this third alterna
tive for the IBM 360/40 can be found in [4]. Although B. Upper Bound Analysis
this error analysis applies to the third alternative it can In this section, we give an upper bound analysis of the
be easily modified to apply to the second. In addition, the ratio of the rms error to the rms of the completed trans
upper bound given applies directly to the first alternative. form. This upper bound is obtained by assuming that
The analysis can also be modified for the power of four during each step of the calculation there is an overflow
algorithm. and a need to rescale. We let Xi,,}) be a typical real ele
ment at the kth stage (i.e., the real or imaginary part of a
complex element) and let
IV. A Fixed-Point Error Analysis
123
In going from the second stage to the third stage, we have and, generally, if M is the last stage
multiplications and we have them in all subsequent
Y(X,u)
stages. In generating the third stage, half the inner loops
have multiplications. Consider the first equation of (5). = 2M(6�2) + 2M62 + 2M-l(6·4�2) + ...
All the other equations are identical in terms of error + 2(6.4M-l�2) + 2M-2](�2 + (M - 3)2·11-1](�2
propagation. Remember that XaCi) is complex: + 2.l1-4(-!3�2) + 2J1-4(44�2) + ... + (4M�2) (21)
Re I X3(i)} = Re I X2(i)} + Re I X20)} Re lTV} = (1.5)2M+2�2(1 + 2 + ... + 2.11-1) + 2.l!62
(16) + (M - 2.5)2·1f-l](�2
- 1m I X20)} 1m lTV}.
+ 2·11+2�2 + 2M+4(1 + 2 . . . + 2M-4)�2
Equation (16) yields, with rounding to B bits after the
or
addition and with rescaling,
V(XM)
V'(X3)
... (1.5)22ftl+2�2 + 2M62 + (M - 2.5)2·1f-l](�2
V(X2) + [Re2/X2lf)} + Im2/X20)}]V(JV) (22)
+ 2·11+2�2 + 22M+l�2
=
+ [Re 2 (lJ') + 1m2 (1V)1V(X2) (17) "'" 22.1fH�2 + 2.l!62 + (M - 2.5)2·1I-1](�2 + 2M+2�2.
+ (_P�2) + 43.M2
K is the average of the square of the absolute values of the
= y(X2) + -I X2(j) 12�2 + V(X2) + (_P�2) + 43·M2. initial complex array. Hence, applying Parseval's theorem
(3), the average of the square of the absolute values of the
In (17), the first term is the variance of the first term of final array will be 2MK. What is most meaningful in this
(16). The second and third terms of(17) are the variance case, however, is the mean square of the real numbers,
of the full 2B bit products given by the second and third which is 2.11K/2. Hence we have
terms of (16). The fourth term of (17) is the result of
V(XM) 2MH�2 262
rounding after the addition. The fifth term is the rescaling +
term. Finally, we saw in (6) that the average modulus 2.11](/2 "'" ](/2 ](/2
(23)
squared of the complex numbers is increasing by a factor (M - 2.5)�2/2 22�2
of 2 every stage. Hence, if we let K equal the average + +
modulus squared of the initial array, i.e.,
1/2 ](/2'
and, finally, for large M,
V(X 3) = 2V(X2) + 2](�2 + 43�2/2 + 43.6�2 We will now obtain an approximate lower bound for
the ratio of the rms of the error to the rms of the answer.
23(6�2) + 2362 + 22(-!·M2) + 2(-!2·M2) (19)
=
We obtain this lower bound by assuming that there are
+ 43.6�2 + 2](�2 + 43�2/2. no overflows in the calculation and, hence, no shifts of
the array. In this case,
In the next stage, three quarters of the inner loops re
quire multiplications and these multiplications get pro
V(Xo) = 62
gressively more numerous as the stages increase. Hence, VeX!) 262
= (25)
from here on, we will assume all stages have multiplica V(X2) = 22152•
tions in all the inner loops. Thus, applying the above
In the third stage, half of the inner loops involve a mul
techniques, we get
tiplication and, hence,
V(X4) 24(6�2) + 2462 + 23(6'4�2) + 22(6 .42�2)
=
V(Xa) = (1/2)(22K�2) + 1/2(�2) + 23152. (26)
+ 2(6.43�2) + 6.44�2 (20) This can be seen by considering the first term of (17).
4
+ 2 2](�2 + 23](�2 + 43�2 + 4 �2 The first term of(26) comes from the second term of the
124
first of equations (17). The second term of (26) is caused Finally, the rms of the difference between the fixed-point
by the rounding to B bits. Now, as before, and floating-point answers was taken. We also obtained
the maximum absolute error and average error.
V(X�) =2V(X3) + 23Kt:,.2 + t:,.2
(27) Fig. 2 contains the result of transforming random num
bers which lie between zero and one (placed in both the
real and imaginary parts). In this and subsequent tests,
Finally,
three runs were made for every power of two from 8 to
v (XM) 2048. Since these random numbers have a dc component
= 2·lf-2Kt:,.2 + (..lJ 3)2·lf-IKt:,.2 + 2·1[-3t:,.2 + 2.1f-5t:,.2
-
of one-half, the fixed-point program must rescale at least
N- I times. Hence, one would expect the error to lie close
+ 2M-6t:,.2 + ... + t:,.2 + 2.1[152
(28) to the theoretical upper bound as given by (24). This
= (U - 2.5)2·11-IKt:,.2 + 2.1£-3t:,.2 theoretical upper bound is also plotted in Fig. 2 and the
+ (1 + ... + 2·ll-5)t:,.2 + 2.1£152 results are seen to lie slightly above it. The rms of the
original array, yKj2, is approximately 0.58.
"" (M 2.5)2.1[-IKt:,.2 + 2M-3t:,.2 + 2·lf-4t:,.2 + 2Mc52•
-
Fig. 3 contains the results of transforming three sine
As in Section IV-B, the mean square of the final sequence waves plus random numbers between zero and one-half
of real numbers is 2M• K/2. Hence, we have in the real part and all zeros in the imaginary part. Specif
ically,
v (XM) t:,.2/8 t:,.2/G 152
- 2.5)t:,.2 + + + . (29)
. I
2 [K/2 "" (JJ K /2 K /2 1\./2 Re {xwl =1/2[Y(j) + (l/2) sin (27f8j/N)
+ (1/4) sin (27f4j/N)
Now one has to be careful in interpreting (29) to obtain
an approximate lower bound. In actuality, the only way + (1/4) sin (27f8j/N)]
to have a situation in which there are no shifts is to have a
' 1m {X(j)} =0
small K and, in fact, one which approaches zero as N(or
M) becomes large. However, if we assume that the word where the Y(j) are random numbers between zero and
size expands to the left as necessary rather than over
one. Again, there is a dc component of magnitude one
flowing, then this analysis does provide a lower bound to
fourth and the array must be rescaled at least N - 2 times.
the error. With this interpretation, as M becomes large, Thus, one would expect these results to be lower relative
we have
to the theoretical upper bound than the case depicted in
rms (error) Fig. 2. From Fig. 3 one can see that this is in fact the case.
---- "" (JJ -
2.5) 1/2(.3)2-B. (30) The rms of the original array yKj2 is, in this case, ap
rms (result)
proximately 0.35. This is the reason the upper bound
The lower bound increases as MI/2=! log2N. This is curve is higher than that of Fig. 2.
the rate of increase which has been observed for the Fig. 4 contains the results of transforming random
floating-point calculation [5], [6]. numbers from minus one to one (in both real and imag
inary parts). In this case, the dc component is zero and
D. Some Experimental Results
there is no other strong component. The number of shifts
An IBM 7094 program was written to perform a fixed should be approximately (log2N)j2 or one-half shift per
point calculation using the fast Fourier transform al stage. Hence, one would expect the error curve to lie well
gorithm, as described above. The program was capable below the theoretical upper bound, as is the case. In this
of simulating a fixed-point machine of any word size up case, yK!i 0.58. =
to 35 bits plus a sign. Experiments were run with fixed Fig.5 contains the results of an experiment identical
point numbers of 17 bits plus a sign. This corresponds to to that used for Fig. 3, except that the random numbers
B= 17 in the analysis of Section IV-B and C. are between ±!. The results are as expected. In this case,
We will now describe some experimental results. In yk/2-0.35.
these experiments we did not consider the propagation of Finally, Fig. 6 contains the results of transforming a
the error present in the original sequence. Thus we con sine wave in the real part and zero in the imaginary part.
sidered the case where 152= O. The experiments were per The sine wave was sin (27fj/8). Although in this case the
formed as follows. Floating-point input was fixed to 17 array must be rescaled in at least N-2 times, the error is
bits plus a sign. This fixed input was then transformed well below the upper bound. Here, yKj2= 0.5.
with the fixed-point program. The fixed-point output was In all these calculations the bias, as reflected by the
then floated. Next, the fixed-point 17-bit input was average error, was negligible compared with the rms
floated and a floating-point transform taken. Since this error. Furthermore, the maximum error was of the same
floating-point transform uses a floating-point word with order of magnitude as the rms error and hence the error
a 27-bit mantissa, it was considered the correct answer. was not due to the effect of a few, highly inaccurate terms.
125
2
O- L
/A
'" b 1 J
THE RETIC L up ER
BOUND ..0' ./
y..
�/ � �
.. l.-V
v
0
0
0 ..,.4 JR/
Af ntBlRETICAL UPPER
r'
0
BOUND
... � .-
""-0
./'
IV ../
.. /0
0
0
0 0
0
2 2
I
ICT 10 20 40 eo 100 200 400 1000 2000 4000 20 40 60 100 200 400 1000 2000 4000
N- N-
Fig. 2. Experimental error results: random numbers between 0 ond Fig. 3. Experimental error resulls: random numbers plus 3 sine
IQ'3
6 L
5 II
�Ll
BOUND /"
......
L
/,...- ./
5 4 THEORETICAL UPPER ,/" i1l 4
::> w
en BOUND ./' !S
/'
V
ll! 2 � 2
u; V Q:
"- V
:::IE / /
� IO'4 8 �IO"'
Q: ..
,/" ......
..
� /V 8> b
!!!
6
&.
&. � II -,,-
..
./' :::IE
4 Q: 4
�
Q:
0
..0 ""
0
0
&
0
2 0
2
IO�
O 20 40 60 100 200 400 1000 2000 4000 20 40 60 100 200 400 1000 2000 4000
N- N-
Fig. 4. Experimental error results: random numbers between -1 Fig. S. Experimental error results: random numbers plus 3 sine
and I ; 8= 17. waves; -I <random numbers<l ; 8= 17.
II �w
BOUND "/
./
5 4
i1l V......
ll! 2 /'
�
./
V
ilO"4
Q:
w
II
./'
./
",
0
0
�Q: 4
0
2
126
E. Conclusions and Additional Comments Acknowledgment
The upper bound obtained in Section IV-B is of the The author would like to thank R. Ascher for assistance
form in programming the fixed-point calculations. He would
rms (error) 2(MH)/lZ-BC also like to thank the referee for a number of corrections
---- < (31) and valuable suggestions.
rms (result) - rms (initial sequence)
where C=0.3. On the basis of the experimental results we
would recommend a bound with C=O.4. Reference,
We also carried through the analysis for a sign mag [I] J. W. Cooley and J. W. Tukey, "An algorithm for machine cal
nitude machine with truncation rather than rounding. In culation of complex Fouricr series," Mal". Comp., vol. 19, pp.
this case, the analytical upper bound was of the form 297-301, April 1965.
[2] J. W. Cooley, "Finite complex Fourier transform," SHARE
given by (31) but with C=O.4. However, the experimental Program Library: PK FORT, October 6, 1966.
results were again higher and we would recommend a [3] R. R. Shively, "A digital processor to genl!rate spectra in real
bound with C=0.6. The case of a twos-complement ma time," lSI AIIII. IEEE Compuler COIlf., Digesl (If Papers, pp.
21-24_ 1967_
chine with truncation was not analyzed as analysis became [4] "Experimental signal processing system," IBM Corp., 3rd
exceedingly complex. However, experimental results in Quart. Tech. Rep!., under contract with the DirfCtorate of
dicated a bound of the form given by (3 1) with C=O.9. Planning and Technology, Ek'Ctronic Systems Div., AFSC,
USAF, Hanscom Field, Bedford, Mass., Contract FI9628-67-
It should be pointed out that if we are taking the tran� C-OI98.
form to estimate spectra then we will be either averagin g [5] J. W. Cooley, P. A. W. Lewis, and P. D. Welch, "The fast
over frequency in a single periodogram or over time in a
Fourier transform algorithm and its applications," IBM Corp.,
Res. Rep!. RC 1743, February 9, 1 967 .
sequence of periodograms and this averaging will decrease [6] W. M. Gentleman and G. Sande, "Fast Fourier transforms for
the error discussed here as well as the usual statistical fun and profit," 1966 Fall Joilll C()mpUler COIlf., AFlPS Proc.,
error. Finally, if we are taking a transform and then its vol. 29. Washington, D.C.: Spartan, 1966, pp. 563-578.
[7] A. V. Oppenheim and C. Weinstcin, "A bound on the output
inverse, Oppenheim and Weinstein have shown [7] that of a circular convolution with application to digital filtering,"
the errors in the two transforms are not independent. this issue, pp. 120-124.
127
To appear in the IEEE Transactions on Audio and Electroacoustics,
Vol. AU-17, No. 3, September 1969.
Clifford J Weinstein
•.
Lexington, Massachusetts
ABSTRACT
128
Introduction
Recently, there has been a great deal of interest in the Fast Fourier transform (FFT)
1
algorithm and its application . Of obvious practical importance is the issue of what accuracy
is to be expected when the FFT is implemented on a finite-word-Iength computer. This note
studies the effect of roundoff errors when the FFT is implemented using floating point arith
2
metic. Rather than deriving an upper bound on the roundoff noise, as Gentleman and Sande
have done, astatistical model for roundoff errors is used to predict the output noise variance.
The statistical approach is similar to one used previously 4 to predict output noise variance
3,
in digital filters impkmt-'ntc'.l via difference: equations. The predictions are tested experimen
tally, with excellent agreement.
IJ
The FFT Algorithm for N = 2
The discrete Fourier transform (OFT) of the complex N point sequence x(n) is
defined as
N-I
2::
-nk
X(k) x(n) W k = 0, 1, . . . , N-I (1)
n=o
.
where W = eJ 211"IN • For large N, the FFT offers considerable time savings over direct
computation of (1). We restrict attention to radix 2 FFT algorithms; thus we consider
N 2 v, where IJ = log N is an integer. Here the OFT is computed in IJ stages. At each
2
=
stage, the algorithm passes through the entire array of N complex numbers, two at a time,
th
generating a new N number array. The v computed array contains the desired DFT. The
th
basic numerical computation operates on a pair of numbers in the m array, to generate
st
a pair of numbers in the (m+ I) array. This computation, referred to as a "butterfly",
is defined by
(2a)
th
Here X (i), X (j) represent a pair of numbers in the m array, and W is some ap
m m
propriate integer power of W, that is
j 211" piN
W= wP = e
At each stage, N/2 separate butterfly computations like (2) are carried out to produce the
next array. The integer p varies with i, j, and m in a complicated way, which depends on
the specific form of the FFT algorithm which is used. Fortunately, the specific way in
129
which p varil's is not important for our analysis. Also, the spl!cific relationship between
th
i, j, and m. which liL'terminl'H how we inlil!x through the marray, is not important for
our analysis. Our derive d results will be valid for both dl!cimation in time and decimation
1
in frequency FFT algorithms , except in the sl!ction entitled "modified output noise
analysis", whl!re we specialize to the decimation in time case.
In the error analysis to be presente d, we will need some results governing the propa
gation of signals and noise in the FFT. These results, speCialized to correspond to the
statistical model of signal and roundoff noise which we will use, are given in this section.
We assume a simple statistical model for the signal being processed by the FFT.
Specifically, we consider the case where the signal Xm (i) present at the mth array is
white, in the sense that all 2N real random variables composing the array of N complex
numbers are mutually uncorrelated, with zero means, and equal variances. More formally,
we specify for i= 0, 1••• ••N-I that
2 2
£ ((Re Xm(i») ) = t1 (Im X m(i)]2) = t tIl Xm(i)\ ) = const. = � C1X2
m
st
to deduce the statistics of the (m + 1) array. First, the signal at the (m + 1)st array is
also white; that is, equations (3) all remain valid if we replace m by m+l. In verifying
this fact, it is helpful to write out (2) in terms of real and imaginary parts. Secondly,
the expected value of the squared magnitude of the signal at the (m+ 1)st array is just
th
double that at the m array, or
130
th
This rehltionship between the statistics at the m and (m+ l)st array allows us
to deduce two additional results, which will be useful below. First, if the initial signal
th
array X (i) is white, then the m array X (i) is also white, and
o m
st
Finally, let us assume that we add, to the signal present at the (m+ l) array, a signal
independent. white noi se sequence E (i) (which might be produced by roundoff errors)
m
th
having properties as described in (3). This noise sequence will propagate to the lI , or
-
th
output, array, independently of the signal, producing at the lI array white noise with
variance
(7)
Tn begin our FFT error analysiS, we first analyze how roundoff noise is generated
in, and propagated through, the basic butterfly computation. For reference, let the
variables in (2) represent the results of perfectly accurate computation. Actually, however,
th th
roundoffs through the m stage will cause the m array to consist of the inaccurate
results
A
X (i) X m (i) + E (i) i = 0, I, • • • , N-l • (8a)
m m
and these previous errors, together with the roundoff errors incurred in computing (2),
will cause us to obtain
(8b)
(9a)
(9b)
and a similar pair of equations results for (2b). Let fl ( .) represent the result of a
S
floating point computation. We can write that
131
fl (x+y) = (x+y) (1+ E) , (10)
t
with l EI � 2- , where t is the number of bits retained in the mantissa. Also
-t
with again IE I � 2 . Thus, one could represent the actual floating point computation
corresponding to (9) by the flow graphs of Fig. I, or by the equations
(12a)
(12b)
Now we subtract (9) from (12) to obtain (using (8» an equation governing noise generation
and propagation in the bu tterfly. Neglecting terms of second or higher order in the E and
i
E ' we obtain
m
(13)
where
U (i) = Re[X (i)] (E4) + Re VI Re [X (j)] ( E + E + E4) - 1m W 1m [X (j)] (E2+ E3 +E 4)
m m m 1 3 m
Equations similar to (1 3) and (14) can he:' derived for E (j). Equation (13) is the basic equation
m+ 1
governing noise generation and propagation in the FFT. Comparing (1 3) and (:la), we see
th
that the noise E already present at the m array propagates through the butterfly, as if it
m
were signal. to the next array. But also we have additional roundoff noise. represented by
st th
U , introduced due to errors in computing the (m+ l) array from the m array. Note that
m
the noise source U (i) is a function of the signal. as well as of the roundoff variables E •
m i
This signal dependence of roundoff errors is inherent in the floating point computation. and
requires that we assume a statistical model for the signal, as well as for the E·., in order to
1
132
obtain statistical predictions for the noise. We should note that the validity of the neglect
of second order error terms in obtaining (13) and (14) needs to be verified experimentally.
Now we introduce a statistical mOdel for the roundoff variables E., and for the
1
signal, which will allow us to derive the statistics of U and eventually predict output
m
noise variance. We assume that the random variables E. are uncorrelated with each other
1
and with the Signal, and have zero mean and e� ual variances, which we call cr� • We also
assume, for simplicity of analysis, that the signal x(n) to be transformed is white, in
the sense described above (see (3». Thus, we have that all 2N real random variables
(in the N point complex sequence) are mutually uncorrelated, with zero means, and equal
variances, which we call � a; so that
(15)
(16)
In obtaining (16), one must take note of (4), and of the fact (see discussion preceding Eq. (6»
th
that the whiteness assumed for the initial Signal array X (n) implies whiteness for the m
o
m m m
�
array, so that Re X (i), Im X (i), Re X (j), and Im X (j) are mutU IlY uncorrelated,
m
t
with equal variance. One can use (6) to express the variance at the m array in terms of
the initial signal variance as
(17)
st
so that the variance of each noise source introduced in computing the (m + I) array from
th
m array becomes
(18)
The argument leading to (18) implies that all the noise sources U (i) in a particular
m
array have equal variance. A slight refinement of this argument would include the fact
...., ....,
133
Output Noise Variance for FFT
In this section, our basic result for output noise-to-signal ratio in the FFT is derived.
- -
Because we arc assuming that all butterflies (including where W = 1 and W = j) are equally
noisy, the analysis is valid for both decim ation in time and decimation in frequency algorithms.
Later we will refine the model for the decimation in time case, to take into account the
- -
reduced butterfly noise variance intro�uced when W = 1 or W = j. But the quantitative change
in the results produced by this m odification is very slight.
Given the assumptions of independent roundoff errors and white signal, the variance
of the noise at an FFT output point can be obtained by adding the variances due to all the
(independent) noise sources introduced in the butterfly computations leading to that particular
output point.
Consider the contribution to the variance of the noise EII(i) at a particular point in the
th st
11 , or output array, from just the noise sources U (i} introduced in computing the (m+l)
m
st
array. These noise sources U (i) enter as additive noise of variance O'
m
at the (m+l) �
array, which (as implied by (13» propagates to the output array as if it Wl1re signal. One
can deduce (see (7» that the resulting output noise variance is*
2 11-m-1 2
[t I EII(i) \ ]m =
2 O' i = 0, 1, , N-1 • (19)
u
• • •
or using (18),
[ fJ EII(i) \
2
] 2
11+1
O'
� O' � (20)
m
th
(20) states that the output noise variance, due to the m array of noise sources, does not
depend on m. This results from the opposing effects (18), and (19). By (18), the noise source
2 m
variance 0' increases as 2 , as we go from stage to stage; this is due to the increase in
u
m
signal variance, and the fact that the variance of floating point roundoff errors is proportional
to Signal variance. But (19) states that the amplification which O' � goes through in propagating
m
*Note that it is not quite true that the noise sequence U (1), which is added to the signal
at the (m+l)St array, is white, for in computing the m two outputs of a butterfly. the
same multiplications are carried out. and thus the same roundoff errors are committed.
Thus. the pair of noise sources U (1), u 0> associated with each particular butterfly.
will be correlated. However, all tie noisH'sCX1rces U (i) which affect a particular CXltput
point. are uncorrelated. since (as one could verify fran an FFT flow-gra�) noise
sources introduced at the top and bottom CXltputs of the butterfly never affect the same
point in the output array.
134
-m
to the output has a 2 dependence, that is the later a noise source is introduced, the
less gain it will go through.
To obtain the output noise variance, we sum (20) over m to include the variance
due to the computation of each array. Since II arrays are computed, we obtain
(21)
We C2n recast (21) in terms of output noise-to-signal ratio, noting that (6) implies that
(22)
so that
(23)
Note the linear dependence on 11= log N in the expression (23) for expected output
ll
mean-squared noise-to-signal ratio. For comparison, the bounding argument of
2
Gentleman and Sande led to a bound on output mean-squared noise-tO-signal ratio which
2
increased as 11 rather than as II. (Actually, they obtained a bound on rms noise-to-signal
ratio, which increased as II). Certainly, the fact that the bound on output signal-to-noise
ratio is much higher than its expected value, is not surprising: since in obtaining a bound
one must assume that all roundoff errors take on their maximum possible values, and
add up in the worst possible way.
To express (23) quantitatively in terms of the register length used in the computa
tion, we need an expression for C1 � • Recall that C1 � characterizes the error due to
3
rounding a floating point multiplication or addition (see (5) and (6». Rather than assume
-t -t
that E is uniformly distributed in (_2 , 2 ) with variance = 2
-2t
cr� �
, C1 was �
measured experimentally, and it was found that
C1 � = (.21)2
-2t
(24)
matched more closely the experimental results. Actually, C1 � for an addition was found
to be slightly different from that for a multiplication, and C1 � for multiplication was
found to vary slightly as the constant coefficient (Re Vi or Im W) in the multiplication was
changed. (24) represents essentially an empirical average C1� for all the multiplications
and additions used in computing the FFT of white noise inputs.
135
(23) and (24) summarize explicitly our predictions thus far for output noise-to
signal ratio. In the next section. the argument leading to (23) is refined to include the
-
7
Eq. ( » that EJ= E = E�= t:S= E6= (7= O. since multiplication by lor O. or adding a number
2
to 0, is accvn.plishcd noiselessly. ThUd (14) becomes
(16)'
so that when W = 1. the butterfly error variance is half the variance introduced when
W I: 1 and VI I: j. One can easily verify that the variance in (16)' is valid for W = j.
also.
st
Now. not all the noise sources introduced in computing the (m+ 1 ) array from
th
the m array will have equal variance. However, if F (m) represents the fraction of
the m
th
array of butterflies which involve either = 1 or W W
= j. then one can express
the average noise variance for all butterflies used in this array of computations as
136
The dependence of F(m) on m depends on the form of the FFT algorithm which is
used. We will consider the case of a decimation in time algorithm. For this case, only
W = 1 is used in the first array of computations, so F(O) = 1. Only W=1 and W = j are
used in computing array 2 from array 1. so F(I) = 1. In computing the array 3 from
array 2, half the butterflies involve W = 1 or W = j. in the next array � of the butter
flies involve \V.:: 1 or W = j, and so on. Summarizing, we have
F(m) � 1 mmi =
0 (26)
l (�) -
=
m = 1, 2 . • . , II-I
O' m =O
� �m
_ m
[I (�) ]
0-2 m= 1, 2, • • . , II-I, (27)
u
rn
II-I
20' 2E [ II '2'
3
- + (�) ] (28)
As II becomes moderately large (say II � 6). one sees that (28) and (23) predict essen
tially the same linear rate of increase of O' /O' � � with II.
137
One further result. which can be derived using our model, is an expression for
the final expected output noise-to-signal ratio which results after performing an FFT
and an inverse FFT on a white signal x(n). The inverse FFT introduces just as much
roundoff noise as the FFT itself, and thus one can convince oneself that the resulting
output noisc-to-signal ratio is
2 2 2 3 ,,-1
cr E/cr x =
4 cr E [" -"2' + (t) ] (29)
Experimental Verification
The results of the above analysis of FFT roundoff noise, as summarized in (28), (29).
and (24), have been verified experimentally with excellent agreement. To check (28). a
white noise sequence (composed of uniformly distributed random variables) was generated
and transformed twice, once using rounded arithmetic with a short (e. g. 12 bit) mantissa,
138
and once using a much longer (27 bit) mantissa. A decimation in time FFT algorithm
was used. The results were su btract ed, squared, and averaged to estimate the noise
11
variance. For each N = 2 , this process was repeated for several white noise inputs to
obtain a stable estimate of roundoff noise variance. The results, as a function of II, are
represented by the small circles on Fig. 2, which also displays the theoretical curve of (28).
To check (29), white noise sequences were put through an FF T and inverse, and the
mean-squared difference between the initial and final sequences was taken. The results
of this experiment (divided by a factor of 2 since (29) is twice (28»are also plotted on
Fig. 2.
To clarify the experimental procedure used, we should define carefully the conven
tion used to round the results of floating point additions and multiplications. The results
were rounded to the closest (t-bit mantissa) machine number, and if a result (say of an
addition) lay midway between two machine numbers, a random choice was made as to
whether to round up or down. If one, for example, merely truncates the results to t bits,
the experimental noise-to-signal ratios have been observed to be significantly higher than
in Fig. 2, and to increase more than linearly with II. Sample results (to be compared with
Fig. 2) of performing the first of the experiments described above, using truncation
rather than rounding, are as follows: for II = 7, 8, 9,10, and 11 ,
For II =
11, for example, this represents an increase by a factor of 32 over the result
obtained using rounding. This increased output noise can be partially explained by the
fact that truncation introduces a correlation between signal and noise, in that the sign of
the truncation error depends on the sign of the Signal being truncated.
Some experimental investigation has been carried out as to whether the prediction
of (28) and (29) are anywhere near valid when the signal is non-white. Specifically, sinu
soidal signals of several frequencies were put through the experiment corresponding
to (28), for II = 8 , 9, 10 and, 1 1. The results, averaged over the input frequencies used,
were within 15% of those predicted by (28).
A linear scale is chosen for the vertical axis of Fig. 2, in order to display the es
sentially linear dependence of output noise-to-signal ratio on log N. To evaluate how II =
2
many bits of noise are actually represented by the curve of Fig. 2; or equivalently by
Eq. (28). one can use the expression
139
(30)
to represent the number of bits by which the rms noise-to-slgnal ratio increases in
passing through a floating point FFT. For example, for" = 8, this represents 1.89 bits,
and for" = 11,2.12 bits. Once can use (30) to decide on a suitable register length for
performing the computation.
According to (30), the number of bits of rms noise-to-signal ratio increases essen
tially as log (log N). so that doubling the number of points in the FFT produces a very
2 2
mUd increase in output noise, Significantly less than the. bit per stage increase pre
6
dicted and observed by Welch for fixed point computation. Tn fact, to obtain a. bit
increase in the result (30). one would essentially have to double" = log N, or square N.
2
Summary and Discussion
A statistical model has been used to predict output noise-to-signal ratio in a floating
point FFT computation, and the result has been verified experimentally. The essential
result is (see(23) and (24»
-2t
= (. 21) 2 " . (31)
that is the ratio of output noise variance to output signal variance is proporti(lnal to
" = log2 N; actually a slightly modified result was used for comparison with
experiment.
Tn order to carry out the analysis, it was necessary to assume very simple (i. e.
white) statistics for the signal. A question of importance is whether our result gives
reasonable prediction of output noise when the signal is not white. A few experiments
with sinusoidal signals seem to indicate that it does, but further work along these lines
would be useful.
It was found that the analysis, and in particular the linear dependence on" in (31),
checked closely with experiment only when rounded arithmetic was used. Some results
2 2
for truncated arithmetic, showing the greater than linear increase of aElaX with II,
have been given. In rounding, it was found to be important that a random choice be
made as to whether to round up or down, when an unrounded result lay equally between
two machine numbers. When. for example, results were simply rounded up in this mid
way Situation, a greater than linear increase of a /a � i with II, was observed. Such a
rounding procf'dure, it seems, introduces enough correlation between roundoff noise an:1
signal. to make the experimental results deviate noticeably from the predictions of our
model. which assumed signal and noise to be uncorrelated.
140
ACKNOWLEDGEMENT
14 1
REFERENCES
1. W. T. Cochran, et.al., "What is the fast Fourier transform, " Proc. IEEE,
vol. 55. pp. 1664-1674, October. 1967.
142
FIGURE CAPTIONS
143
144
5
- - THEORETICAL
C\J
1(\1
-
o EXPERIMENTAL . FFT
0
VI
-
o EXPERIMENTAL. FFT AND INVERSE
c: (result + 2)
34
w
u
Z
<l:
c:::
g
...J
<l:
z 3
(!)
CJ)
I-
:::>
0...
I-
:::>
0
,2
w
u
Z
<l:
c:::
�
w
CJ)
0
Z
I-
:::>
0...
I-
:::>
0
12
II = log N
2
Fig. 2.
145
Published in Ma�he�ati� of Comp:l�t�tion, Vol. 19, April 1965,
pp. 297-301.
An efficient method for the calculation of the interactions of a 2'" factorial ex
periment was introduced by Yates and is wid ely known by his name. The generaliza
tion to 3m was given by Box et al. [1].Good [2} generalized these methods and gave
elegant algorithms for which one class of applie ations is the calculation of Fourier
series. In their full generality, Good's methods are applic able to certain problems in
which one must multiply an N-vector by an N X N matrix which can be factored
into 1n spars e matrices, where 111, is proportional to log N. This results in a procedure
2
requiring a number of operations proportio nal to N log N rather than N These •
methods are applied here to the calculation of complex Fourier series. They are
useful in situations where the number of data points is, or can be chosen to be, a
highly composite number. The algorithm is here d erived and present ed in a rather
d ifferent form. Attention is given to the choice of N. It is also shown how special
advantage can be obtained in the use of a binary computer ,....ith N = 2"' and how
the entire cal c u latio n can be performed within the array of N data storage locations
used for the given Fourier coefIicients.
Consider the problem of calc u lati ng the complex Fourier series
N-I
where the giv en Fourier coefficients A (Ie) are complex and Tr is the principal
Nth root of unity,
(2)
2
A straightforward calculation usi n g (I) would require N operations where "opera
tion" means, as it will throughout this note, a complex multiplication followed by a
complex addition.
The al gor ithm described h e re iterates on the array of given complex Fourier
a mplitu des and yields the result in less than 2N Iog2 N operations without r e quiring
more data storage than is req uired for the given array A. To derive the algorithm,
suppose N is a composi te, i.e., N = rl· r2 • Theil let the i ndices in (1) be expressed
j = jlrl + jo , jo = 0, 1, . . . , rl - 1, j l = 0, 1, . . . , 1"2 - 1,
(3)
k = k1r2 + ko , ko = 0, 1, . . . , 1"2 - 1, kl = 0, 1, . . , 1·1
. - 1.
(4)
Received August 17, 1964. Hesearch in part at Princeton University under the sponsorship
of the Army Research Office (Durham). The authors wish to thank Hichard Garwin for his
essential role in communication and encouragement.
Since
(5)
the inner sum, over kl ,depends only on jo and ko and can be defined as a new array,
(6) Al(jo ,ko) = L A(kl, ko) ' TV;ok1'2.
kl
There are N elements in the array Al , each requiring 1'1 operations, giving a total
of N1'l operations to obtain AI. Similarly, it takes N1'2 operations to calculate X
from Al. Therefore, this two-step algorithm, given by (6) and (7), requires a total
of
(8)
operations.
It is easy to see how successive applications of the above procedure, starting with
its application to (6), give an m-step algorithm requiring
(9) T = N(1'l + 1'2 + ... + 1'm)
operations, where
(10) N = 1'1' 1'2 • • • 1'm •
If 1'j = s;t; wi th Sj ,tj > 1, then S; + tj < 1'j unless S; = tj = 2, when S; + t; = 1';.
In general, then, using as many factors as possible provides a minimum to (9) , but
factors of 2 can be combined in pairs without loss. If we are able to choose N to be
highly composite, we may make very real gains. If all 1'j are equal to 1', then, from
(10) we have
(11) m = 10g,N
and the total number of operations is
(12) T(1') = 1'N log, N.
T
= m·r + n·s + p·t +
(13) N
log2 N = m·log2 l' + n·log2 s + p·log2 t +
so t.hat
T
147
whose values run as follows
r
l' 10g2 r
2 2.00
3 1.88
-1 2.00
t.J 2.15
(i 2.31
7 2.49
8 2.(i7
9 2.82
10 3.01.
The usc of 1'j = :3 is formally most efficient, but the gain is only about 6% over
the use of 2 or 4, which have other advantages. If necessary, the usc of 1'j up to 10
can incrcase the number of computations by no more than 50% . Accordingly, we
can find " high ly composite" values of N within a few percent of any given large
JlUl\lber.
Whene\"cr possible, the use of N = )'m with r = 2 or -:1 offers important advantages
for computers with binary arithmetic, both in addressing and in multiplication
eeonomy.
The algorithm with l' = 2 is derived by exprcssing the indices in the form
"".-1 +
" "+ .
J
"
= . +
Jm-I·;t.
•
. . Jl ';:' Jo,
(14)
whcrcjr and I.... :1I'C ('qual to 0 or 1 and arc the contents of the respective bit positions
in the binary represcntation of.i and /;. All arrays will now be written as functions
of the bits of their indices. With this convention (I) is written
Proceeding to the next innermost sum, over "",-2 • and so on, and using
(18)
one obtains successive arrays,
. , • r
km_l
for l = 1, 2, . . . , m.
148
Writing out the sum this appears as
A,(jo, ... ,j'-l, km-1-1, • • • , leo)
= A,-1(jO, ... ,j,-2 ,0, km-1-1, • . . , leo)
(20) + ;'-1 ·iI-2A . .
(- 1) t l-1 (Jo, ... ,JI-2, 1, k"m-l-l, . . . , k0 )
. lVCh -8 .2'-3+...+iol .2m-1 j,-l 0, 1.
, =
(21) . 2m-1 +
JO· • • •
+
J1-1 • 2
0 ",-1 + k
Om_I_I· 2
m-1-1 +
• • •
+ 7
'CO.
It can be seen in (20) that only the two storage locations with indices having 0 and
-1
1 in the 2m bit position are involved in the computation. Parallel computation is
permitted since the operation described by (20) can be carried out with all values of
jo, ... ,j'-2, and leo, , lem-1-1 simultaneously. In some applications* it is con
• • •
( 22)
in such an order that the index of an X must have its binary bits put in reverse
order to yield its index in the array Am .
In some applications, where Fourier sums are to be evaluated twice, the above
procedure could be programmed so that no bit-inversion is necessary. For example,
consider the solution of the difference equation,
The present method could be first applied to calculate the Fourier amplitudes of
F(j) from the formula
(24)
B(k)
(25) A(k) = + +
aJVk b CTV-k·
The B(k) and A(k) arrays are in bit-inverted order, but with an obvious modifi
cation of (20), A(k) can be used to yield the solution with correct indexing.
A computer program for the IBM 7094 has been written which calculates three
dimensional Fourier sums by the above method. The computing time taken for com
puting three-dimensional 2" X 2b X 2� arrays of data points was as follows:
149
a b c No. Pts. Time (minutes)
4 4 3 2 11 . 02
11 0 0 2 11 . 02
4 4 4 212 .04
12 0 0 212 .07
3
5 4 4 21 .10
3
5 5 3 21 . 12
3
13 0 0 21 .13
Princeton University
Princeton, New Jersey
150
I. Introduction
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. AU-l7, No.2 JUNE 1969
151
z-plane values of k merely repeat the same N values of Zk, which
are the Nth roots of unity. The discrete Fourier transform
has assumed considerable importance, partly because of
its nice properties, but mainly because since 1965 it has
become widely known that the computation of (6) can be
achieved, not in the N2 complex multiplications and addi
tions called for by direct application of (6), but in some
thing of the order of N log2N operations if N is a power
of two, or N 1;mi operations if the integers mi are the
prime factors of N. Any algorithm which accomplishes
this is called an FFT. Much of the importance of the FFT
is that OFT may be used as a stepping stone to computing
lagged products such as convolutions, autocorrelations,
(A)
and cross convolutions more rapidly than before [3], [4].
The OFT has, however, some limitations which can be
eliminated using the CZT algorithm which we will de
(B)
scribe. We shall investigate the computation of the z
transform on a more general contour, of the form
s-plane
1c. = 0, 1, . . . , M-1 (7a)
where M is an arbitrary integer and both A and W are
arbitrary complex numbers of the form
(7b)
and
(7e)
(See Fig. 2.) The case A = I, M=N, and W=exp( -j2."./N)
Fig. 1. The correspond ence of (A) a z-plone contour corresponds to the OFT. The general z-plane contour
to (B) on '-Dlone contour through the relation z = e·T • begins with the point z=A and, depending on the value
of W, spirals in or out with respect to the origin.If Wo= I,
the contour is an arc of a circle.The angular spacing of
Values of the z-transform are usually computed along the samples is 2.".c/l0. The equivalent s-plane contour begins
the path corresponding to the jw axis, namely the unit with the point
circle.This gives the discrete equivalent of the Fourier 1
transform and has many applications including the esti 80 = 0'0 + J'wo = - In A (8)
'1'
mation of spectra, filtering, interpolation, and correla
tion. The applications of computing z-transforms off the and the general point on the a-plane contour is
unit circle are fewer, but one is presented elsewhere [6],
1
namely the enhancement of spectral resonances in systems Bt Bo + k(� + j&") (In A - k In W),
T
= =
152
,.
...O,I •..•• N·'
•. +... Xk = L x"A-"lVCn2f2)IVCk2'2)lV-Ck-n)2'2,
� �== ==:=::=:-- l'
(12)
-
: ---------
1
-
JL'�lT
,,_0
k=O,I,···,M-l;
II but, in fact, (12) can be thought of as a three-step process
1
:: '111- 16101
II
consisting of:
II .tt.I.
II I) forming a new sequence v,. by weighting the
acr·t·· .. II...
-..
x,.
� according to the equation
II
---------��---+-----+------��
� .,- x,.A -nlV,,2/2, n = 0, 1, . ,N
....t
y" = . . - 1; (13)
.. Ae
2) convolvin g y" with the sequence v,. defined as
Fig. 2. An iIIuslralian of Ihe Independent parameters of the CZT
algarilhm. (A) How the z-Iransform is evalualed on a spiral contour t'" = 1V-,.2'2 (14)
starling at Ihe point z=A. (B) The corresponding straighl line can·
tour and independent paramelers in the ,-plane_ to give a sequence gk,
N-l
similar waveform used in some radar systems has the gk = L Y"Vk_". k = 0, 1, ..., M - 1; {I 5)
picturesque name "chirp," we call the algorithm we are
about to present the chirp z-transform (CZT)_ Since the 3) multiplying gk by Wk',2 to give Xk,
CZT permits computing the z-transform on a more gen S
Xk = gklVk '2, k = 0, I, . . ., M - 1. (16)
eral tour than the FFT permits it is more flexible than
the FFT, although it is also considerably slower.The addi The three-step process is illustrated in Fig.3. Steps I
tional freedoms offered by the CZT include the following: and 3 require Nan d M multiplications, respectively,
I) The n umber of time samples does not have to equal and step 2 is a convolution which may be computed by
the number of samples of the z-transform. the high-speed technique disclosed by Stockham [3],
2) Neither M nor N need be a composite integer. based on the use of the FFT. Step 2 is the major part of
3) The angular spacing of the Zk is arbitrary. the computational effort and requires a time roughly pro
4) The contour need not be a circle but can spiral in or portional to (N+M) log (N+M).
out with respect to the origin. In addition, the point Zo Bluestein employed the substitution of (11) to convert
is arbitrary, but this is also the case with the FFT if the a OFT to a convolution as in Fig.3. The linear system to
samples Xn are multiplied by zo-" before transforming. which the convolution is equivalent can be called a chirp
filter which is, in fact, also sometimes used to resolve a
spectrum. Bluestein [5] showed that for N a perfect
II. Derivation of the CIT square, the chirp filter could be synthesized recursively
Along the contour of (7a), (4) becomes with VN multipliers and the computation of a OFT could
then be proportional to !{4/2.
N-\ The flexibility and speed of the CZT algorithm are
'Yk = L a'"A-nll'"\ k = 0, 1, .. ., M - 1 (10) related to the flexibility and speed of the method of high
,,_0
speed convolution using the FFT. The reader should re
which, at first appearance, seems to require NM complex call that the product of the OFT's of two sequences is
multiplications and additions, as we have already ob- the OFT of the circular convolution of the two sequences
�,
Begin with a waveform in the form of N samples Xn and
seek M samples of Xk where A and W have also been
(C)
chosen.
1) Choose L, the smallest integer greater than or equal L-I r
n1111ffirrnm,
power of two. Note that while many FFT programs will
work for arbitrary L, they are not equally efficient for (D)
all L. At the very least, L should be highly composite.
2/
2) Form an L point sequence Yn from Xn by n
Yn=
{A-nlrn 2xn n=O,1,2,'" ,N-l
�
(17)
° n=N, N+l, ... ,L-1.
(E)
3) Compute the L point OFT of Yn by the FFT. CalI lll.I: ::.rbl r"',--ti11f1
this Yr, r=O, 1, ..., L-1. M-I L-N+I L-I n
4) Oefine an L point sequence Vn by the relation
j
Will
lV-n2/2 0 � n � M -1
t'n = lV-CL-n)'/2 L - N + 1 � n <L (18)
If)
arbitrary other n, if any.
Il
Of course, if L is exactly equal to M+N- I, the region
in which Vn is arbitrary will not exist. If the region does
exist an obviOl';:; possibility is to increase M, the desired
number of points of the z-transform we compute, until
(0) t _ �
the region does not exist.
Note that t'n could be cut into two with a cut between urrmwr-r II-n . J.' •
b
slice out of the indefinite length sequence W-·'/2. This is
illustrated in Fig. 4. The sequence v. is defined the way ""----
(:-I) -------------
it is in order to force the circular convolution to give us not UloId •
155
data, the CZT does not require M= N. Furthermore, the We need one FFT and 2L storage locations for the trans
Zk need not stretch over the entire unit circle but can be form of XnA-nWn2/2; one FFT and L+2 storage locations
equally spaced along an arc. Let us assume, however, for the transform W-n'/2; and one FFT for the inverse
that we are really interested in computing the N point transform of the product of these two transforms. We
DFT of N data points .Still the CZT permits us to choose do not know a way of computing the transform of W-n2/2
any value of N, highly composi te, somewhat composite, either recursively or by a specific formula (except in some
or even prime, without strongly affecting the computation trivial cases). Thus we must compute this transform and
time. An important application of the CZT may be com store it in an extra L+2 storage locations. Of course, if
puting DFTs when N is not a power of two and when many transforms are to be done with the same value of L,
the program or special-purpose device available for we need not compute the transform of W-n'/2 each time.
computing DFT's by FFT is limited to the case of N a We can compute the quantities A-n Wn'/2 recursively as
power of two. they are needed to save computation and storage. This is
There is also no reason why the CZT cannot be ex easily seen from the fact that
tended to the case of transforms in two or more dimen
sions with similar considerations. The two-dimensional
A -(n+\)!V(n+\)'/2 /
= (A -nlVn' 2) lVnlV 1/2 A -I.
• (l!))
DFT becomes a two-dimensional convolution which is If we define
computable by FFT techniques.
We caution the reader to note that for the ordinary (20)
FFT the starting point of the contour is still arbitrary; and
merely mUltiply the waveform x� by A-� before using the
FFT, and the first point on the contour is effectively (21)
moved from Z= 1 to z=A. However, the contour is still then
restricted to a circle concentric with the origin. The angu
lar sp acing of Zk for the FFT can also be controlled to /)"tl = TV· D" (22)
some extent by appending zeroes to the end of Xn before and
computing the DFT (to decrease the angular spacing of
the Zk) or by choosing only P of the N points Xn and adding Cntl = Cn• Dn. (23)
together all the Xn for which the n are congruent modulo
Setting A= 1 in ( 19) to (23) provides an algorithm for the
P; i.e., wrapping the waveform around a cylinder and
coefficients required for the output sequence. A similar
adding together the pieces which overlap (to increase the recursion formula can be obtained for generating the se
angular spacing).
quence A-nW(n-N,)'/2. The user is cautioned that recursive
IV. Limitations computation of these coefficients may be a major source
of numerical error, especially when Wo"" 1, or cPo"" 0.
One limitation in using the CZT algorithm to evaluate
the z-transform off the unit circle stems from the fact
V. Summary
that we may be required to compute Wo±n2/2 for large 11.
If Wo differs very much from 1.0, Wo±n2/2 can become very A computational algorithm for numerically evaluating
large or very small when 11 becomes large. (We require a the z-transform of a sequence of N time samples was pre
large II when either M or N become large, since we need sented. This algorithm, entitled the chirp z-transform
to evaluate Wn'/2 for II in the range -N<II<M.) For algorithm, enables the evaluation of the z-transform at M
example, if Wo=e-·2�/I00o",,0.999749, and 11=1000 , equi-angularly spaced points on contours which spiral in
Wo±n'/2=e±12� which exceeds the single precision floating or out (circles being a special case) from an arbitrary
point capability of most computers by a large amount. starting point in the z-plane. In the s-plane the equivalent
Hence the tails of the functions W±n'/2 can be greatly in contour is an arbitrary straight line.
error, thus causing the tails of the convolution (the high The CZT algorithm has great flexibility in that neither
frequency terms) to be grossly inaccurate. The low fre N or M need be composite numbers; the output point
quency terms of the convolution wiIJ also be slightly in spacing is arbitrary; the contour is fairly general and N
error but these errors are negligible in general. need not be the same as M. The flexibility of the CZT
The limitation on contour distance in or out from the algorithm is due to being able to express the z-transform
unit circle is again due to computation of W±>o2/2. As Wo on the above contours as a convolution, pcrmitting the
deviates significantly from 1.0, the number of points for use of well-known high-speed convolution techniques to
which W±n'/2 can be accurately computed decreases. It is evaluate the convolution.
of importance to stress, however, that for Wo= I, there is Applications of the CZT algorithm include enhance
no limitation of this type since W±n'/2 is always of magni ment of poles for use in spectral analysis; high resolution,
tude 1. narrowband frequency analysis; and time interpolation of
The other main limitation on the CZT algorithm stems data from one sampling rate to any other sampling rate.
from the fact that two L point, and one LI2 point, FFTs These applications are explained in detail elsewhere [6].
must be evaluated where L is the smallest convenient The CZT algorithm also permits use of a radix-2 FFT
integer greater than N+M-l as mentioned previously. program or device to compute the DFT of an arbitrary
156
number of samples. Examples illustrating how the CZT
algorithm is used in specific cases are included elsewhere
[6]. It is anticipated that other applications of the CZT
algorithm will be found.
Appendix
for k= 1, 2, .. ., L/2-1. The remaining values of Xi
The purpose of this Appendix is to show how the and Y. are obtained from the relations
FFT's of two real, symmetric L point sequences can be
L-I
obtained using one L/2 point FFT.
Let x. and y" be two real, symmetric L point sequences
Xo = LX.
"-0
with corresponding DFT's Xk and Yk• By definition, L-I
Xk = XL k YL/I - Ly.(-I)".
0, 1,2, ..., L - 1.
_
"-0
k =
Reference,
Define a complex L/2 point sequence u. whose real and [I) J. W. Cooley and J. W. Tukey, "An algorithm fOI the machine
imaginary parts are calculation of complex Fourier series," Math. Comp., vol. 19,
from Uk using the relations (4) H. D. Helms, "Fast Fourier transform method of computing
difference equations and simulating filters," IEEE TrailS.
AudiO alld Electroacoustics, vol. AU-IS, pp. 85-90, June 1967.
Xk = H He [Uk] + He [(h/2-dl (5) L. I. Bluestein, "A linear filtering approach to the computation
1 of the discrete Fourier transform," /968 NEREM Rec., pp.
--- {Re [Uk] - Re [uL,2-dl 218-219.
271" (6) L. R. Rabiner, R. W. Schafer, and C. M. Rader, "The chirp
4 sin-k z-transform algorithm and its applications," Bell Sys. Tech. J.,
L vol. 48, pp. 1249-1292, May 1969.
157
Reprinted from Math�_matic�����utation, Vol. 22, No. 102,
April 1968, pp. 275-279.
By G. D. Bergland
1. Introduction. Cooley and Tukey stated in their original paper [1] that the
Fast Fourier Transform algorithm is formally most efficient when the number of
samples in a record can be expressed as a power of 3 (Le., N 3m), and further that =
Later, however, it was recognized that the symmetries of the sine and cosine
weighting functions made the base 4 algorithms more efficient than either the base
2 or the base 3 algorithms [2], [3]. Making use of this observation, Gentleman and
Sande have constructed an algorithm which performs as many iterations of the
transform as possible in a base 4 mode, and then, if required, performs the last itera
tion in a base 2 mode.
Although this "4 + 2" algorithm is more efficient than base 2 algorithms, it is
now apparent that the techniques used by Gentleman and Sande can be profitably
carried one step further to an even more efficient, base 8 algorithm. The base 8
algorithms described in this paper allow one to perform as many base 8 iterations
as possible and then finish the computation by performing a base 4 or a base 2 itera
tion if one is required. This combination preserves the versatility of the base 2 algo
rithm while attaining the computational advantage of the base 8 algorithm.
158
G. D. BERGLAND
(4)
In some cases the total computation required to evaluate these equations can
be reduced by grouping these equations in a slightly different manner. For N =
r1r2' ·r", this regrouping takes the form
•
Ap (jol
[ �'I jP-lI
• • k.._p-11 ' ' I 'ko} ,
r l
]
_
i P 1k p
(5) - LJ A�p-1 (3.0, '" I 3'P-
. 2, 1.
""-1'1 ' , 'I k0 ) TVrp- ,,-
k,,_,,-O
, TV ip-1(kia_p-l(rp+2· .. r,,)+ ... +k1rn+ko)(r1r2· .. rp-1) ,
N
Note that the bracketed term in (5) represents a set of rp-point Fourier trans
forms and that the complex exponential weights outside the brackets simply re
reference each set of results to a common time origin. (In Gentleman and Sande's
paper this rereferencing is termed twiddling.) The term
21tilrp
TV,. = TV NNlrp = e
l'
forms the basis for the complex exponential weights required. in evaluating each rp
point transform, and jp-1 and k.._p are the two indices of the transform.
An analogous regrouping can be performed on the original Cooley-Tukey re
cursive equations. For N = r1r2' , ·r", these take the form [4]
(6) =
k
2:
.._p-0
Ap-1 (j01 j11 ' " I jP-2 1 kn-1'1 ' "
I ko)
, TV (ip-1(r1r2 .• ,rp-1)+... +iO)kn -p(r.fIT
--.L1· ··rn)
N P = 11 2I " 'In,
I
(7) = [fA
k"-FO
p-1 (jO, " ' l jp-21 k.. -pl "
k
" ko) l V�� l "_p
, TV (jp-1(r1r2· .. rp-1)+"·+hr1+io)kn_p-l(rp+2·""n)
N ,
This expression is valid for p = 1,2, . , 'In provided that we define (rp+2' , 'r..) = 1
for p > n - 2 ,and define lc_1 = O. Note that the bracketed term of (7) is identical
to the bracketed term of (5).
Each iteration of both (5) and (7) is thus conveniently divided into two steps.
The first step involves performing a set of Tp-point Fourier transforms, and the
second step involves use of the Fourier transform shifting theorem to rereference
the resulting spectral estimates to the correct time origins. These two operations
are performed during each iteration.
Of particular interest in this paper are the algorithms which result when as many
of the rp terms as possible are set to 8. When this is done, the bracketed terms repre
sent a large number of 8-point Fourier transforms. The complex exponential weights
159
-
0-
o
TAllLE I
Comparisoll of arithmetic opemtiolH:; required for base �, base ·1, bai:ie �, and bai:ie 10 u.IgoriUllll�.
Real
Algorithm Requirecl computation for J.1Iultiplications Real Additions
Base 2 algorithm Evaluating (N/2)m, 0 2Nm, :.-
for N 2m, 2 term Fourier transforms.
�
=
m = 0,1,2,... Referencing «m/2 - I)N + 1)(4) «mj2 - I)N + 1)(2) Ul
t-:l
Complete analysis (2m - 4)N + 4 (3m - 2)N + 2 '"J
0
q
l:::1
.....
Base 4 algorithm Evaluating (N/4)(m/2), 0 2Nm l:'j
for N= (22)m/2, 4 term Fourier transforms. l:::1
m/2=0, 1 , 2 , ··· Referencing «3m/8 - I)N + 1)(4) «3m/8 - I)N + 1)(2) t-:l
l:::1
:.-
7-
Complete analysis (1.5m - 4)N + 4 (2.75m, - 2)N + 2 Ul
'Xl
0
:::l
Base 8 algorithm Evaluating (N/8)(m/3), Nm/6 13Nm/6 i:::
for N = (23)m/3 , 8 term Fourier transforms. :.-
m/3 0, 1,2, ... Referencing «7m/24 - I)N + 1)(4) «7m/24 - I)N + 1)(2) t"
=
0
0
l:::1
.....
Complete analysis (1.333m - 4)N + 4 (2.75m - 2)N + 2 t-:l
:=:
...
....
Base 16 algorithm Evaluating (N/16)(m/4), 3Nm/8 9Nm/4
for N = (24)m/4 , 16 term Fourier transforms.
m/4 = 0, 1,2, ... Referencing «15m/64 - I)N + 1)(4) «15m/64 - I)N + 1)(2)
Complete analysis (1.3125m - 4)N + 4 (2.71875m - 2)N + 2
G. D. BERGLAND
required in performing these transforms are: ±1, ±i, ± exp (+i7r/4), and
± exp (- i7r/4) Since use of the first 2 weights requires no multiplications and
.
the product of a complex number and either of the last 2 weights requires only 2
real multiplications, a total of only 4 real multiplications is required in evaluating
each 8-point Fourier transform.
Thus the weights used in evaluating an 8-point Fourier transform all have
symmetries which can be profitably exploited. Considerable use of these symmetries
is being made since the base 8 algorithin forces us to compute N/8, 8-point trans
forms during each iteration.
the algorithm, we can compare the computation required by base 2, base 4, base 8,
and base 16 algorithms for N being any po\ver of 2. The number of real multiplica
tions and real additions required for various values of m is expressed in Table I.
Although these expressions are only exact for values of N which are integral powers
of 2, 4, 8, and 16, respectively, they are good approximations for any integral value
ofm.
The real multiplications and additions required for m = 12 are given exactly
by these expressions and are expressed in Table II.
TABLE II
Real multiplications and additions required in performing base 2, base 4, base 8
and base 16 Fast Fourier Transform algorithms for N - 4096.
Numher
Number of real of real
Algorithm Multiplications Additions
161
A FAST FOUHIEIt THANSFOR:\[ ALGORITH�[
1. J. W. COOLf;Y & J. W. T UK.;Y, "An algorithm for the machine calculation of complex
Fourier series," Math. Comp., v. 19, 1965, �p. 297-301. MR 31 #2843.
2. W. M. GENTLEMAN & G. SANDE, "Fast Fourier transforms for fun and profit," Fall Joint
Computer Conference Proceedings, Vol. 29, 1966, pp. 563-578.
3. R. E. M I LL ER & S. WINOGRAD Private Communication.
4. G. D. BERGLAND, "The fast Fourier transform recursive equations for arbitrary length
records," Math. Comp., v. 21, 1967, pp. 236-238.
162
Introduction
Computing for k=O, 1,·· . II-I, where {Xj) and lad are both
,
RICHARD C. SINGLETON, Senior Member, IEEE and then decomposing the transform into In steps with
Stanford Research Institute II/II, transformations of size IIi within each step, is that of
Menlo Park, Calif. 94025 Cooley and Tukey [I]. Most subsequent authors have
directed their attention to the special case of II 2"', =
Manuscript received December 2, 1968. where F, is the transform step corresponding to the factor
This work was supported by Stanford Research Institute, out of
of II and P is a permutation matrix. The matrix F, has
IIi
Research and Development funds. only II, nonzero elements in each row and column and
1EEB TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. Au-17, No.2 JUNE 1969
163
can be partitioned into n/llisquare submatrices of dimen- and
sion IIi; it is this partition and the resulting reduction in
multiplications that is the basis for the FFT algorithm. j' = jln61l6 •.. nl + j2n6n4 •.. nl + j6n4n3n2nl
The matrices Fican be further factored to yield + j4n3n2nl + j3n2nl + j.nl + je.
{ 2lr
tion may be performed in place by pair interchanges if 11 has diagonal elements
is factored so that
[ j mod kk }
for i<lI-i. In this case, we can countj in natural order
rj = exp i- (j mod
kk
k)
k J
andj' in digit-reversed order, then exchange aj and aj' if forj=O, 1, . . . ,11-1 where
j<j' . This method is a generalization of a well-known k = n/nln2 ..• n, and kk = n ik,
method for reordering the radix- 2 FFT result.
Before computing the Fourier transform, we first de and the square brackets [ ] denote the greatest integer
compose II into its prime factors.The square factors are ::; the enclosed quantity. The rotation factors multiply
arranged symmetrically about the factors of the square ing each transform of dimension 11, within T, have angles
3X
2 X 3 X 5 X 3.
free portion of 11. Thus 11= 27 0 is factored as
0, 6, 26, . . . (ni - 1)6,
For n a power of 2, we note that a complex Fourier the mixed radix FFT. The results of this section, neglect
transform of dimensi on 2 or 4 can be computed without ing the reduction by /I-I for 8= 0, yield the following
multiplicati on and that a transform of dimensi on 8 re c omparis on :
quires only two real multiplications , equivalent to one
Radix Relative Efficiency
half a complex multiplication. Going one step further, a
transform of dimension 16, computed as two factors of 2 0.500
4, requires the equivalent of six complex multiplications. 4 0.375
8 0.333
Combining these results with the number of rotation fac 16 0.328
tor multiplications and assuming that n= 2m is a power of 3 0.631
the radix, the total number of complex multiplications is 5 0.689
7 0.763
as follows: II 0.920
13 0.998
Radix Number of Complex Multiplications 17 1. 151
19 1.227
2 nlll/2-(II-I) 23 1.374
4 311111/8-(11-1)
8 11111/3-(11-1)
16 2111111/64-(11-1) The general term for an odd prime p is
-------.------
(p - I)(p + a)
These results have been given previously by Bergland [6]. 4p log2 (p)
The savings for 16 over 8 is small, cons idering the added
complexity of the algorithm. As Bergland points out, Decomposition of a Complex Fourier Transform
radix 8, with provision for an additional factor of 4 or 2, In the previous section, we promised to show that a com
is a good choice for an efficient FFT program for powers plex transform of dimension p, for p odd, can be com
of 2. For the mixed radix FFT, we transform with factors puted with (p_I)2 real multiplications. Consider the
of 4 whenever possible, but also provide for factors of 2. complex transform
We now consider the number of complex multiplica
tions for a radix-p transform of /I= pm complex data
,,-1
= .1'0 + L.fjC:OS --
(21fjk)
sult holds, in fac t, for any odd value of p. Thus the trans i_I P
form steps for /I pm require the equivalent of
=
165
=
(,,-1l/2
Xo + L (Xi
i-1
+ Xp-i) cos (271"ik)
--
P
For even values of p, a decomposition similar to the
above yields 4(P/2-1) series to sum, each with (P/2 - 1)
multiplications. Thus a complex Fourier transform for p
even can be computed with at most (p- 2)� real multipli
cations. For p>2, we know that this result can be im
{(�p-Il/2 ( 2 7r"ik)
proved.Combining results for the odd and even cases, we
can state that a Fourier transform of dimension p can be
+ iYo+ (Yi + Yp-i) COS --
J-1 computed with the equivalent of
(�1'-1)/2 (.fi Xp-j) sin (2--
P
+ -
7r"ik)} [p ; IJ
J-1 P
for k=O. I• . . .
• p-1. We note first that or fewer complex multiplications, where the square
brackets [ ] denote the largest integer value :S the en
,,-\
ao + ibo = L (Xi + iyj) closed quantity.
i-O
bp_k = bk+ - bk- equally spaced values on the unit circle, it is useful to have
for k = I, 2, ... ,(P-1)/2, where accurate methods of generating them by complex multi
(2rik)
plication, rather than by repeated use of the library sine
(p-1)/1 and cosine functions. For very short sequences, we use the
a,+ = Xo + 1: (XI + Xp-/) cos -- simple method
i-I
(2rik)
P
(11-1)/1 �k+1 = �k exp (i8),
a,,- = 1: (111
•
where
1Ip-i) 81n --
/-1 �O =
-
P 1
(p-1l/2
27Tjk
b,,+ = 110 + L (Yj + Yp-j) cos --
i-1
( ) P
and {�k I is the sequence of computed values exp (ik8).
This method suffers, however, from rapid accumulation
1-1 p . equation
Altogether there are 2(p-I) series to sum. each with
(P-1)/2 multiplications, for a total of (P_ 1)2 real multi where the multiplier
plications.
'I = exp (i8) - 1
For p an odd prime and for"fixedj, the multipliers
= 2i sin ( 8/2) exp (i8/2)
cos (27Tjk/p) for k = 1,2, ... (p - 1)/2
= - 2sin2 (8/2) + isin (8)
have no duplications of magnitude, thus no further re decreases in magnitude with decreasing 8. This method
duction in multiplications appears possible.2 The same gives good accurcay on a computer using rounded float
condition holds for the multipliers ing-point arithmetic (e.g., the Burroughs B55(0). How
sin (27Tjk/p) for k = 1,2,... (p - 1)/2. ever, with truncated arithmetic (as on the IBM 360/67),
the value of �k tends to spiral inward from the unit circle
with increasing k.
• C. M. Rader (private communication) has proposed an al In Table I, we show the accumulated errors from extrap
ternative d�'Composition of a
using the equivalent of 3 complex multiplications (12 real multi
olating to 7T/2 in 2k increments, using rounded arithmetic
plications) instead of the 4 complex multiplications used in the (machine language) and truncated arithmetic (FORTRAN)
algorithm described in this paper. In Appendix III we give a on a CDC 6400 computer; identical initial values, from
FORTRAN coding of Rader's method. When substituted in subroutine
FFT (Appendix I), times were unchanged on the CDC 6600 com
the library sine and cosine functions, -were used in com
puter and improved by about 5 percent for radix-5 transforms on puting the results in each of the three pairs of columns.
the CDC 6400 computer (the 6400 has a relatively slower multiply In examining the second pair of columns, we find that the
operation). Rader's method looks advantageous for coding in
machine language on a computer having mUltiple arithmetic regis
angle after 2" extrapolation steps is very close to 7T/2, but
ters available for temporary storage of intermediate results. that the magnitude has shrunk through truncation. To
compensate for this shrinkage, we modify the above accuracy is small. The subroutines in Appendixes I and II
method to restore the extrapolated value to unit magni include comment cards indicating the changes to remove
tude. We first compute a trial value the rescaling. On the other hand,the number of multipli
cations may be reduced by· one when using truncated
'Yk = �k + lI�k
arithmetic, through using the overcorrection multiplier
where
Ilk = 2 - 'Yk'Yk"'.
11 = - 2 sin! (8/2) + i sin (8)
In this case, the truncation bias stabilizes a method that
and
mathematically borders on instability. On the CDC
�o = 1, 6400 computer,this multiplier gives comparable accuracy
then multiply by a scalar to the multiplier suggested above.
�k+l = Ilk'Yk,
A FORTRAN Subroutine for the Mixed Radix FFT
where
Ilk""
1 ,
In Appendix I, we list a FORTRAN subroutine for com
puting the mixed radix FFT or its inverse, using the
V'Yk'Yk'"
algorithm described above. This subroutine computes
to obtain the new value. Since 'Yk'Yk'" is very close to 1,we either a single-variate complex Fourier transform or the
can avoid the library square-root function and use the calculation for one variate of a multivariate transform.
approximation To compute a single-variate transform (I) of n data
�(_1_+ 1).
values,
lik = CALL FFT(A, B, n, n, n, 1).
2 'Yk'Yk'"
Or if division is more costly than multiplication, we can The "inverse" transform
alternatively use the approximation ft-I
Timing and Accuracy Tests of Subroutine FFT on a CDC 6400 Compuler N_ben S 100000 Containing No PrIM. focIar Gr...... Than 5
169
after scaling by 1/2, evaluates the Fourier series and leaves that it could have been used here to transform the square
the time domain values stored free factors of n. This alternative has not been tried, but
the potential gain,if any, appears small.
A(l), B(l), A(2), B(2) ... A(n), B(n)
References
as originally.
The subroutine REALTR, called with ISN=1, separates (I] J. W. Cooley and J. W. Tukey, "An algorithm for the machine
calculation of complex Fourier series," Math. Comp., vol. 19,
the complex transforms of the even-and odd-numbered pp. 297-301, April 1965.
data values,using the fact that the transform of real data (2] W. M. Gentleman and G. Sande, "Fast Fourier transforms for
fun and profit," 1966 Fall Joilll Comp"ter COllf , AFlPS Proc.,
has the complex conjugate symmetry vol. 29. Washington, D. C.: Spartan, 1966, pp. 563-578.
(3) R. C. Singleton, "An ALGOL procedure for the fast Fourier trans
form with arbitrary factors," COIllIIIIIII ACM, vol. II, pp. 776-
for k=1, 2, .. . ,n-I,then performs a final radix-2 step 779, Algorithm 339, November 1968.
[41 N. M. Brenner, "Three FORTRAN programs that perform the
to complete the transform for the 211 real values.If called Cooley-Tukey Fourier transform," M.LT. Lincoln Lab., Lex
with ISN= -I. the inverse operation is performed. The ington, Mass., Tech. Note 1967-2, July 1967.
pair of calls (5) R. C. Singleton, "On computing the fast Fourier transform,"
Commllll. ACM, vol. 10, pp. 647-654, October 1967.
CALL HEALTH (A, B, n, 1) (6) G. D. Bergland, "A fast Fourier transform algorithm using
base 8 iterations," Math. Comp., vol. 22, pp. 275-279, April
CALL HEALTH (A, B, n, - 1) 1968.
"
(7) R. C. Singleton, ALGOL procedures for the fast Fourier trans
form," COIllIllUII. ACM, vol. II, pp. 773-776, Algorithm 338,
return the original values multiplied by 4, except for
November 1968.
round-off errors. (8) I. J. Good, "The interaction algorithm and practical Fourier
Time on the CDC 6400 for 11=1000 is 0.100 second, series," J. Roy. Stat. Soc., ser. B, vol. 20, pp. 361-372, 1958;
Addendum, vol. 22, pp. 372-375, 1960.
and for 11=2000, 0.200 second. Time for REAtTR is a
linear function of n for other numbers of data values.
The rms error for the above pair of calls of REALTR was
1.6X 10-14 for both 11=1000 and n=2ooo.
Conclusion
L. I. BLUESTEIN
Sylvania, Waltham, Mass.
t�
This paper begins with the sampled data analogue of that
work and develops two results. The first permits the Thus, lI(z) may he re,llized by a bank of m filters liS
computation of the discrete Fourier transform with the shown in Figure 2. (A box labeled z-Z represents a delay
number of required operations proportional to N log. N of x units.) Since the impulsive response after 2m2 -1
for any N. The second develops an especially simple units does not concern us, we may open circuit the link
algorithm for this computation when N = m2; the number connected to the box labeled z-t-.m'. The number of
of computations is then proportional to N3/2. opemtions required is about 3N m, since we need only
count opemtions which take place during the crucial
An Algorithm Suggested by Chirp Filtering interval between 0 and 2N-1, This number is of the
For analog waveforms, a chirp filter is one whose fre same form (within a multiplicative constant) as that re
quency response is the bandpass equivalent of exp (jf!) quired by the Cooley-Tukey algorithm for N = 711' and m
over a range of frequencies. The impulsive response of a prime. Because of the apparent simplicity of Figure 2,
this filter is proportional to e}lrr"'. Now let us enter the it is suggested that an algorithm based on a combination
realm of sampled systems. :'\Iotivated by the above dis- of Figures 1 and 2 be investigated when the DFT is to
218-NEREM RECORD-I968
171
he computed on <\ machine especially l'ollstructed for 5. Gcutl('mun. \\'. �t. and Sunde. C .• "Foi,st Fourier Tmnsfol"llls fur Fun
mul Profit." HJfi6 F..,II Joint CUlliputer C()lIf(·rc.�nc."t. AFIPS Pruc., vol. 29.
that purpose. The application that we have in mind here pp. 563·5711.
is when N, the numher of points being transformed, is 6. �I()rruw, \\'. E., Jr., �Iilx J.. r"tril'Ck., J.. ""d K"rp. D., .. " Heill Time
of moderate size (around 256) and is fixt,d. I-�()urh!r Tmnsformcr/' �1.I.T. Lin<.'oln 1 ...,humlory Group Heport 36 (:·4;
July 16, HJ63.
7. Stm·kham. T. G., "High Speed Convulution imel Correhltion." 1966
Original MiUluscript Hec(�ived J til}' 2�). H)68. Spring Joint Cnmputer Conrcrenc.·e, AFIPS Pro,,·.• vol. 2R. \\'ashingtoll.
1. Co(.·hrdu. J. \"., et. ai., '·'\\'h"t is the FOIst Fouri,'r Tmnsform?" Proc. D.C.. S"arlan WOO; 1>1>. 22!)·233.
IEEE, vo\. 55, no. 10, pp. 1664·1674; Odoher 1967. K. \\'e us'nully lake 1m operation to nwan IIm1tiplic.'atioll hy a cOllllJlex
2. Brighiun, E. O. und �1()rr()w. R. E., "The Fust Fouril·rTrnnsform.lEEl-: numht'r pins nil attendant t,'Ol11pututions. sin<.'e IIlnltipli(,"ntjoll of this sort
S,JCctrrml, vol. 4, 110. l� PI). 63·70; Dc('.'cmhcr H)67. on U COllllmh!r requires un inordinate amcmnt of time.
3. Cooley, J. \\', cmd Tukcy, j. \V., "An Alf,!orithm (or the �Iildline C••leu 9. This ohscr\'utiol1 is due to C. Had(!r uf the �(.I.T. Liu<"uin Lahomtory.
lation of Complex Fotlri(�r Seril's," Melila. Com,JU'., vol. 19. P)l. 297·301;
April 1965.
4. Cood, I. J .• "TheInh�mdi()n Algorithm llud Pmcti(."ul Fourier Amll),sis:'
]. Roy. St"ti.t. Soc., Ser. II, vo\. 20, pp. 361·372; 19511.
J---+
-_-tl
= \ xK K � N-I
INPUT
10 NSK� 2N·1
172
Introduction
and small read-only memories for coefficient storage. The arithmetic I) The filters are constructed from a smaIl set of rela
circuits are readily multiplexed to process mUltiple data inputs or to tively simple digital circuits, primarily shift registers and
elTect multiple, but dilTerent, filters (or both), thus providing for effi adders.
cient hardware utilization. Up to 100 filter sections can be multiplexed 2) The configuration of the digital circuits is highly
in audio-frequency applications using presently available digital cir modular in form and thus well suitcd to LSI construction.
cuits in the medium-speed range. Thc filters arc also easily modified to 3) The configuration of the digital circuits has the flexi
realize a wide range of filter forms, transfer functions, multiplcxing bility to realize a wide range of filter forms, coefficient
schcmes, and round-olT noise le\'cls by changing only thc contents of accuraices, and round-off noise levels (i.e., data accura
the read-only memory and/or thc timing signals and the length of the cies).
shift-register delays. A simple analog-to-digital converter, which uses 4) The digital filter may be easily multiplexed to pro
delta modulation as an intermediate encoding process is also pre cess multiple data inputs or to effect multiple, but differ
sented for audio-frequency applications. ent, filters with the same digital circuits, thus providing
for efficient hardware utilization.
Canonical Forms
IH.& TRANSACTIO!'\S 0:-1 AUJ)IO A!'\J) I,L[CTROACOUSTICS VOL. Au-16, No.3 SEPIHIIII'R 1968
173
"'0
(A)
"
-
Lair;
- -
i.O
II(z) = ---- (I)
"
1+ Lb,z-;
i-I
where Z-l is the unit delay operator. There are a multitude (B)
of eqUIvalent digital circuit forms in which (I) may be __ + I-
~ ��
realized, but three canonical forms, or variations thereof,
are most often employed. These forms are canonical in the �It
,-I ,-I
sense that a minimum number of adders, multipliers, and
delays are required to realize (I) in the general case. The
first of these forms, shown in Fig. I, is a direct realization
+
of (I) and as such is called the direct form. It has been
pointed out by Kaiser [5] that use of the direct form is
usually to be avoided because the accuracy requirements
on the coefficients {aj I and {hi I are often severe. There
fore, although the implementation techniques presented rather than a mixed set of first- and second-order factors
here are applicable to any filter form, we will not spe for real and complex roots, respectively, to simplify the
cifically consider the direct form. implementation of the cascade form, especially when mul
The second canonical form corresponds to a factoriza tiplexing is employed. If n is odd, then the coefficients a!2.
tion of the numerator and denominator polynomials of and {32i will equal zero for some i. The a!2i multipliers are
(I) to produce an H(z) of the form shown in dotted lines in Fig. 2, because for the very com
mon case of zeros on the unit circle in the z-plane (cor
responding to zeros of transmission in the frequency re
sponse of the filter) the associated a!2i coefficients are
unity. Thus, for these a!2i coefficients, no multiplications
where m is the integer part of (n+ 1)/2. This is the cascade are actually required.
form for a digital filter,
depicted in Fig. 2. Second-order The third canonical form is the parallelform, shown in
actors (with real coefficients) have been chosen for (2) Fig. 3, which results from a partial faction expansion of
174
(I) to produce gree of parallel processing is possible in the implementa
tion of a digital filter and this may be achieved by provid
(3) ing multiple adders and multipliers with appropriate inter
connections. Economy is then realized by using serial
arithmetic. and by sharing the adders and multipliers
where "Yo=o,./b. and we have again chosen to use all sec
(using the multiplexed circuit configurations to be de
ond-order (denominator) factors. Note that all three
scribed) insofar as circuit speed will allow.
canonical forms. are entirely equivalent with regard to the
In addition to a significant simplification of the hard
amount of storage required (II unit delays) and the l1um
ware, serial arithmetic provides for an increased modu
bcr of arithmetic operations required (211+ 1 multipli
larity and flexibility in the digital circuit configurations.
clltions and 211 additions pcr sampling period). As previ
Also, the processing rate is limited only by the speed of
ously noted. however. the cascade form requires signifi
the basic digital circuits and not by carry-propagation
cantly fewer multiplications for zeros on the unit circle
times in the adders and multipliers. Finally, with serial
and is thus especially appropriate for filters of the band
arithmetic, sample delays are realized simply as single
pass and the band-stop variety (including low-pass and
input single-output shift registers.
high-pass filters).
The two's-complement representation [7] of binary
Another interesting filter form may be derived for the
numbers is most appropriate for digital filter implementa
special case of an all-pass filter (APF), i.e., a filter or
tion using serial arithmetic because additions may pro
"equalizer" with unity gain at lill frequencies. The transfer
ceed (starting with the least significant bits) with no ad
function for a discrete APF has the general form [6]
vance knowledge of the signs or relative magnitudes of the
.. numbers being added (and with no later corrections of the
2: bn_.z-i obtained sums as with one's-complement). We will
IIA(Z)
i-O
assume a two's-complement representation of the form
ft
(4)
2: b,Z-i (6)
i-O
-1 � /l < 1, (8)
Second-order sections for the cascade form of the APF with the sign of the number /l being given by the last bit
are shown in Fig. 4. Fig. 4(A) is a straightforward modifi (in time) /lo.
cation of the standard cascade form in Fig. 2. Note that An extremely useful property of two's-complement
because the (3li multiplier may be shared by both the feed representation is that in the addition of more than two
forward and feedback paths, only three multiplications numbers, if the magnitude of the correct total sum is
are required per second-order section rather than four. small enough to allow its representation by the N avail
The number of multiplications may be further reduced by able bits, then the correct total sum will be obtained
using the form of Fig. 4(B), which requires only two mul regardless of the order in which the numbers are added,
tiplications per second-order section. But now, two addi even if an overflow occurs in some of the partial sums.
tional delays are required preceding the first second-order This property is illustrated in Fig. 5, which depicts num
section to supply appropriately delayed inputs to the first bers in two's-complement representation as being arrayed
section. Therefore. the cascade form of Fig. 4(B) requires in a circle, with positive full scale (1- 2-N+1) and negative
a total of II multiplications and 11+2 delays for an I1th full scale (-1) being adjacent. The addition of positive
order APF. addends produces counterclockwise rotation about the
circle, whereas negative addends produce clockwise rota
tion. Thus, if the correct total sum satisfies (8), no infor
Serial Arithmetic
mation is lost with positive or negative overflows and the
Using any of the canonical forms described in the pre correct total sum will be obtained.
ceding section, all of the coefficient multiplications and This overflow property is important for digital filter
many of the additions during a given Nyquist interval implementation because the summation points in the
may be performed simultaneously. Therefore, a high de- filters often contain more than two inputs (see Fig. 3);
175
o
-8
CAIIIW
CLEAI!
....
AUIIIID )--
- >---L...!... __ ...J
and then adding the complemented subtrahend to the
minuend. To complement a number in two's-complement
representation, each bit of the representation is inverted
and a one is then added to the least significant bit of the
although it may be possible to argue that because of gain inverted representation (i.e., 2-N+l is added to the inverted
considerations the output of the summation point cannot number). The corresponding serial subtractor circuit is
overflow, there is no assurance that an overflow will not shown in Fig. 7. The subtrahend is inverted and a one is
occur in the process of performing the summation. Note added to the least significant bit by clearing the initial
that this property also applies when one of the inputs carry bit to one, rather than to zero as in the adder. This is
to the summation has itself overflowed as a result of a accomplished by means of two inverters in the carry feed
multiplication by a coefficient of magnitude greater than back path, as shown.
one. A separate two's complementer (apart from a subtrac
tor) may also be constructed ; such circuits are required in
the multiplier to be described. This operation is imple
Arithmetic Unit
mented with a simple sequential circuit which, for each
The three basic operations to be realized in the imple sample, passes unchanged all initial (least significant) bits
mentation of a digital filter are delay, addition (or subtrac up to and including the first "I" and then inverts all suc
tion), and multiplication. As previously mentioned, serial ceeding bits. A corresponding circuit is depicted in Fig. 8.
delays (Z-I) are realized simply as single-input single A serial multiplier may be realized in a variety of con
output shift registers. Realizations for a serial adder (sub figurations, but a special restriction imposed by this im
tractor) and a serial multiplier are described in this sec plementation approach makes one configuration most ap
tion. The adders and multipliers, including their inter propriate. This restriction is that no more than N bits per
connections, will be said to comprise the arithmetic unit sample may be generated at any point in the digital net
of the digital filter. work because successive samples are immediately adja
A serial adder for two's-complement addition is ex cent in time and there are no "time slots" available for
tremely simple to construct [7]. As shown in Fig. 6, it more than N bits per sample. Hence, the full (N+K)-bit
co nsists of a full binary adder with a single-bit delay product of the multiplication of an N-bit sample by a
(flip-flop) to transfer the carry output back to the carry K-bit (fractional) coefficient may not be accumulated
input. A gate is also required in the carry feedback path before rounding is performed. However, using the multi
to clear the carry to zero at the beginning of each sample. plication scheme described below, it is possible to obtain
Accordingly, the carry-clear input is a timing signal, the same rounded N-bit product without ever generating
which is zero during the least significant bit of each sample more than N bits per sample. Rounding is usually prefer
and is one otherwise. able, rather than truncation, to limit the introduct ion of
A serial two's-complement subtractor is implemented extraneous low-frequency components (de drift) into the
by first complementing (negating) the subtrahend input filter.
176
MULTIPLICAND (DATA): 8 0·8 18 a . • . . 8._1
MULT'PLIER �IT SECT'ONS
MULTIPLIER (COEFF.): " 00'0,0." "aK
•
bo·b.bz,·,:bN_1
. .
vo·v'Va....VN-1
+ f"_�
PRODUCT(DATA): Po' Ii; i>; . . . . p._,
.
Fig 9. Serial multiplication using no more than Fig. 11. A multiplier bit section.
N bits per data word.
Although (9) is not necessarily applicable in the general and to round (12) to N bits, only the value of the bit iv-I
case, it does hold for the denominator coefficients of the is required. Thus, before truncating 1,,-1. its value is
cascade and parallel forms, and usually for the numerator stored elsewhere to be added in the final step to g, as
coefficients of the cascade form as well. The magnitude of shown in Fig. 9, to obtain the rounded product (p).
the multiplier is thus represented in Fig. 9 as The serial multiplier corresponding to the scheme de
scribed above is shown i n Fig. 10. The absolute value of
(10) each incoming datum (0) is taken and its sign (SON 0) is
added modulo-2 to the coefficient sign (SON a) to de
which represents a value of termine the product sign (SON a·c5). The (positive) multi
plicand is then successively delayed and gated by the ap
K
propriate multiplier bits (ai) and the partial sums are ac
I ex l = Lai2-i• (11)
cumulated in the multiplier bit sections. A single multi
i.O
plier bit section is shown in Fig. II. The least significant
The restriction in (9) and the resulting representation i n bit of each partial sum is truncated (gated to zero) by the
(10) and (11) are in n o way essential t o the serial multi appropriate timing signal 'i+I' Rounding is accomplished
pliers to be described, but are meant only to be representa by adding in the last truncated bit (!v-I) via the * input to
tive of the mUltiplication scheme. The sign of the multi the last bit section. Finally, the sign of the product is in
plier is also stored separately as SON a. serted using a two's-complementer such as that in Fig. 8.
The multiplication scheme in Fig. 9 proceeds as follows. At high data rates, it may be necessary to insert extra
The multiplicand is successively shifted (delayed) and flip-flops between some or all of the multiplier bit sections,
multiplied (gated) by the appropriate bit (exi ) of the multi as shown in dotted lines in Fig. 11, to keep the propaga
plier. These delayed gated instances of the multiplicand tion delay through the adder circuits from becoming ex
are then added in the order indicated. After each addi cessive.
tion (including the "addition" of aKa to 0), the least Several observations concerning the serial multiplier
significant bit of the resulting partial sum (i.e., a.\·_ha.\'_h should be made at this point. First, there is a delay of K
b... -h • ,Iv-I) is truncated to prevent the succeeding
•
• bits in going through the multiplier and this delay must
partial sum from exceeding N bits in length. Note that be deducted from a delay (Z-I) that precedes the multi
these bits may be truncated because the full unrounded plier. (If the extra flip-flops in Fig. 1 1 are required, then
product would be the multiplier will yield a delay of up to 2K bits.) In addi-
177
tion, the absolute value operation at the first of the multi
plier requires a delay of N bits (to determine the sign of
each incoming datum) and this must be deducted from a
preceding delay as well. Thus, to use this serial multiplier,
the rl delays of the digital filter must be at least N+K
(or up to N+2K) bits in length. This in turn implies, as we
shall see in the next section, that some form of multiplex
ing is required if the multipliers are to be implemented i n RUD-oNLY 1
this manner. I COEfFICIENT I
I IIE�
�---------�
I
particular, the arithmetic unit containing the adders and so the bit rate in the filter must be (at least) 6N bits per
multipliers is the same; it just operates M times faster. Nyquist interval. During the first N bits of each Nyquist
The output samples emerge in the same interleaved order interval, the input sample is introduced into and is pro
'
as the input and are thus easily separated. Type-I multi cessed by the arithmetic unit with the mUltiplying coeffi
plexing is depicted in Fig. 12. cients (al> a2, (310 (32) of the first subfilter in the cascade
If the M channels in Fig. 12 are to be filtered differently form. The resulting output is delayed by N bits (rl/.\!)
or if type-2 multiplexing is also employed, the filter co and fed back via the input routing switch to become the
efficients are stored in a separate read-only coefficient input to the filter during the second N-bit portion of the
memory and are read-out as required by the multiplexed Nyquist interval. This feedback process is repeated four
178
SHIFT - REGISTER DELAY UI;IT
"' .. BITS "'N BITS I; BITS
()-+o
IN :
I
I
I
I
I
I
I
I I
I
I
I I
I
I I r
I l.-, I I .---- ..
L ___ _ _ ___ -. : : : : • _ _ __ _ _____ _
,..-l--L...l-l-.L-I.....,
fig. 13. General second-order filter for type-I and type-2 fig. 14. Digital touch-tone receiver. showing multiplexed filters and
multiplexing. nonlinear units.
more times, with the filter coefficients from the ROM all of the HPF's and BRF's ar e multiplexed into one sec
being changed each time to correspond to the appropriate ond-order filter (combi ned type-I and type-2 multiplex
subfilter in th e cascade form. The sixth (last) filter output ing), the eight BPF's are multiplexed into another seeond
during each Nyquist interval is the desired 12th-order order filter (type-I multiplexi ng with ROM coefficie nts),
filter output. The parallel form, or a combination of cas and the eight LPF's are m ultiplexed i nto on e first -order
cade and parallel filters may be realized using the filter in
, section (type- l multiplexi ng with wired-in coe fficients) .
Fig. 13 by simply changing the bits in the ROM which The nonlinear units are readily m ultiplexed as well and
control the switching sequences of the input and output operate directly upon the interleaved o utput samples from
routi ng sw itches. the filters.
Some of the parameters of the experimental TTR de
sign are as follows: the sampling rate is 10 K samples/se c
Sample System ond with a n initial quantization (A/D conversion) of7
bits/sample; the data word length (N) with in the filter is
As an example of th is approach to the impkmentation
10 bits/sample ; the filter coefficients have 6-bit fractional
of digital filters, we . will take a n experimental, all-digital
parts (K) ; and, as previously stated, the m ultiplexi n g
touch-tone receiver (TTR) which has been designed and
factor (M) is eight. Thus, the bit rate w ithin the filter
constructed at Bell Telephone Laboratories, Inc. The
(sampling rate X bi ts/sample X M) is 800 K bits/seco nd .
digital TTR is depicted in block-diagram form in Fig. 14
The number of bits required to represen t the data and the
This is a straightforward digital version of the standard
coefficients of the TTR were determined through comput
analog TTR described elsewhere [8]. Without going into
er simulation of the system. The hardware required to
the dctailcd operation of the system, we simply note that
implement this des ign consists primarily of about 40 serial
the combined high-pass filters (HPF's) are third order,
adders and 400 bits of shift-registe r storage.
the band-rejection filters (BRF's) are each sixth order, the
bandpass filters (BPF's) are each second order, and the
low-pass filters (LPF's) are each first order . The other
Analog-to-Digital Converter
signal-processing units r equired are the limiters (LIM's),
half-wave rectifiers (HWR's), and level detectors. These In most applications of digital filters, the initial inpu t
nonlinear operations are, of course, easily implemented signal is in analog form and must be converted to digital
in digital form. form for processing. It mayor may not be necessary to
A mUltiplexing factor of M 8 is employed in the ex
= recon ver t the digital output signal to analog form, de
perimental TTR to combine all of the units enclosed in pending upon the application. Digital -to -analog (D/A)
dotted lines into single multiplexed units. In particular, conversion is a relatively straightfo rward and inexpensive
179
process, but the initial analog-to-digital (A/D) conversion RESET
180
REFERENCES [5] J. F. Kaiser, "Some practical considerations in the realization
of linear digital tilters," 1965 Proc. lrd Allertoll COIlf. 011 Circuit
[ 1] J. F. Kaiser, "Digital tilters," in System Analysis by Digital alld System Theory, pp. 621-633.
Computer, J. F. Kaiser and F. F. Kuo,Eds. New York: Wiley, [6] R. B. Blackman,unpublished memorandum.
1966, pp. 218-85. [7] Y. Chu, Digital Computer Design FUI/damentals. New York:
i2] C. M. Rader and B. Golj, "Digital tilter design techniques in McGraw-HiII,1962.
the frequency domain," Proc. IEEE, vol. 55, pp. 149-171, [8] R. N. Battista, C. G. Morrison, and D. H. Nash: "Signaling
February 1967. system and receiver for touch-tone calling," IEEE Tral/s. Com
( 3] R. M. Golden,"Digital tilter synthesis by sampled-data trans mUllications alld Electrollics, vol. 82, pp. 9-17, March 1963.
formation," this issue, pp. 321-329. [9] E. N. Protonotarios,"Slope overload noise in differential pulse
[4] B. Gold and C. M. Rader, Digital Processillg ofSigllals. New code modulation systems," Bell Sys. Tech. J., vol. 46,pp. 2119-
York: McGraw-Hili, 1969. 2161,November 1967.
18 1
Introduction
Options
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. AU-17, No.2 JUNE 1969
18Z
Fig. 1. Fast Fourier transform flow diagram for N=8.
The first design choice often made concerns constrain Fig. 2. The functional block dia·
ing the number of data points to be analyzed to being a gram of a sequential fast Fourier
183
A.U. A.U. A.U.
2 ..
A.U. A.U. A.U.
MEM
184
TABLE I
Ambitious Processing Rotes for N=1024 With 8-1llS
Execution Processing
Machine Arith.
Time Rate
Organization Units
(}tS) (samples/s)
mon to the whole array). [2] R. Klahn, R. R. Shively, E. Gomez, and M. J. Gilmartin,
"The time-saver: FFT hardware," Electronics, pp. 92-97,
A more complete list of design options is given in the June 24. 1968.
FFT processor survey [20]. [3] J. W. Cooley, "Complex finite Fourier transform subroutine,"
SHARE Doc. 3465, September 8, 1966.
[4] W. M. Gentleman and G. Sande, "Fast Fourier transforms
for fun and profit," 1966 Fall Joint Complller Conf, AFlPS
Other Considerations
Proc., vol. 29. Washington, D. C.: Spartan, 1966, pp. 563-
In many cases, people tend to focus on only the FFT 578.
[5] G. D. Bergland, "The fast Fouricl transform recursive equa
hardware since it is the best defined part of the system. tions for arbitrary length records," Mat". Comp., vol. 21, pp.
Those parts of the problem which should not be over- 236-238, April 1967.
185
[6] N. M. Brenner. "Three FORTRAN programs that perform the [13] R. R. Shively, "A digital processor to generate spectra in real
Cooley-Tukey Fourier transform," M.I.T. Lincoln Lab., Lex time," IEEE TrailS. Computers, vol. C-l7, pp. 485-491, May
ington, Mass., Tech. Note 1967-2. July 1967. 1968.
[7] R. C. Sillglcton, "On computing the fast Fourier transform," [14] G. D. Bergland and H. W. Hale, "Digital real-time spectral
COIllIllIIII. ACM. vol. 10, PD. 647-·654, October 1967. analysis," IEEE TrailS. Electrollic Computers, vol. EC-16, pp.
'
[8) G. D. Bergland, "A fast Fourier transform algorithm using 18D-185, April 1967.
base 8 iterations," Math. Comp., vol. 22, pp. 275-279, April [15] R. A. Smith, "A fast Fourier transform processor," Bell Tele
1968. phone Labs., Inc., Whippany, N. J., 1967.
[9] M. C. Pease, "An adapt ion of the fast Fourier transform [16) G. Sande, University of Chicago, Chicago, Ill., private com
for parallel processing," J. ACM, vol. 15, pp. 252-264, April munication.
1961l. [17] M. C. Pease, 1II, and J. Goldberg, "Feadbility study of a
[10) G. D. Bergland, "A fast Fourier transform algorithm for real special-purpose digital computer for on-line Fourier analysis,"
valued series," Conllmlll. ACM, vol. II, pp. 703-710, October Advanced Research Projects Agency, Order 989, May 1967.
1968. [18] G. D. Bergland and D. E. Wilson, "An FFT algorithm for a
[II) L. I. Bluestein, "A linear filtering approach to the computation globp I, highly-parallel proce3sol," this issue, pp. 125-127.
of the discrete Fourier transform," 1968 NEREM Rec., pp. [19] R. B. McCullough, "A real-time digital spectrum analyzer,"
218-219. Stanford Electronics Labs., Stanford, Calif., Sci. Rept. 23,
[12) J. W. Cooley, P. A. W. Lewis, and P. D. Welch, "The fast November 1967.
Fourier transform algorithm and its applications," IBM Re [20] G. D. Bergland, "Fast Fourier transform hardware implemen
search Paper RC-1743. February 1967. tations-a survey," this issue, pp. 109-119.
186
BIBLIOGRAPHY
Journal Articles
17. H. W. Briscoe and P. L. Fleck, "A Real Time Comp'.1ting System for
LASA," presented at Sprin3 .Joint Comp'.1ter Conference, Vol. 28,
pp. 221-228, 1966.
187
18. P. W. Broome, "Discrete Orthononnal Sequences," Asso�. Computer..
����in��, Vol. 12, No.2, pp. 151-168, April 1965.
19. P. Broome, "A Frequency Transfonnation for Numerical Filters,"
Proc. IEE E, Vol. 52, No. 2, pp. 326-327, Febru�ry 1966.
24. J. W. Cooley, P. Le\Jis, an:! P. D. Welch, "The Use of the Fast FOllrier
Transform Algorithm fo:'.:' the Estimation of Spectra and Cross Spectra,"
Proceedings of the 1969 Polytechnic Institute of Brooklyn Symp::>sium
on Compater Processing in Communications.
188
33. R. Edwards, J. Bra:lley, a41d J. Knm;Tles, "Comp.arison of Noise Per
formances of Progra�ning Methods in the R�alization of Digital
Filte:::-s," Proceedings of the 1969 Polytechnic Institute of Broo�dyn
Symposium on Computer Processing in Comm'.lnications.
36. G-AE CO':1cep':s Subconunitt ee, "On Digital Filtering," .!.EEE Tr..a�_�u::liC2..!..,
Vol. 16, No. 3, pp. 303-315, September 1968.
37. W. M. Gent leman and G. Sande, "Fast Fourier Transfo=ms-Fo-::' Fun and
Profit," presented at 1966 Fall Joint Computer Conference, AFIPS
Proc., 29, P? 563-578, 1966.
40. B. Gold .and K. Jordan, "A Direct Search Procedure fo:;:- Designing
Finite DUTation Impillse Response Filters," IEEE Trans. Au:lio.,
--.---.-----
22 (1960), 372-375:--'
189
50. R. J. Graham, "Determination and Analysis of Numerical Smoothing
Weights," NASA Technical Report No. TR-R-179, December 1963.
190
66. J. B. Knm171es and R. Ed-ilards,
Associated Comp'.l�ational Erro::-s," Elect��ic:.
_ s_.h.e_ Vol. 1, No. 6,
pp. 160-161, August 1965.
74. Mark Tsu-Han Ma, "A Ne\v Mathematical App:-oach for Linear Array
Analysis and Synthesis," Ph. D. Thesis, Syracuse University,
University Microfilms, No. 62-3040, 1961.
19 1
83. A. M. Noll, "Cepstrum Pitch Determination, " J. Acoust. Soc. Am. ,
Vol. 41, pp. 293-309, February 1967.
92. F. K. Otnes, "An Elementary Design Procedure for Digital Filters, "
IEEE Trans. Audi�, Vol. 16, No. 3, pp. 330-336, September 1964.
192
100. C. M. Rader and B. Gold, "Effects of Parameter QU:lntization on the
Poles of a Digital Filter," Pro�. ITr.:E, Vol. 55, No. 5, pp. 688-689,
__
May 1967.
Ill. R. C. Singleton, "A Method for Computing the Fast Fourier Transform
with Auxiliary Memory and Limited High Speed Storage," IEEE Trans_._
Audio., Vol. 15, No. 2, pp. 91-9B, June 1967.
113. R. C. Singleton, I�n ALGOL Procedure fo:,: the Fast Fourier Transfonn
with Arbitrary Fsctors," Algo:oithm 339, Comm. Assoc. Computing
Machinery, Vol. 11, No; 11, pp. 776-779, November 1968
U5. R. C. Singleton, "An Algorithm for Computing the Mixed Radix Fast
Fourier Transform," IEEE Trims. Audio., Vol. AU-17, No. 2, pp. 93-.103,
June 1969
193
1l7. D. Slepian and H. o. Pollak, "Prolate Spheroidal Wave Functions,
Fourier Analysis and Uncertainty-I and II, " Bell System Tech. J.,
Vol. 40, No. 1, pp. 43-84, January 1961.
194
132. C. Weinstein, "Quantization Effects in Frequency Sampling Filters,"
222-NEREM Record, p. 222, 1968.
195
1. M. Abramovitz and I. Stegun, Handbook of Mathematical Functions,
Dover Publications, Inc., N�� York, 1965.
17. E. J. Hannan! Time Series Analysis. John Wiley & Sons, Inc.,
New York, 19bO.
196
21. E. I. Jury, ��led-Dat�Co���ol_�stems, Joh� Niley & Sons, In=.,
New Yo-::-k, 1950.
197