(Alan V. Oppenheim (Ed.) ) Papers On Digital Signal (B-Ok - CC) PDF

Papers on Digital Signal Processing
Papers on
DIGITAL SIGNAL PROCESSING
Alan V. Oppenheim, Editor
The M.I.T. Press CAMBRIDGE, MASSACHUSETTS, AND LONDON, ENGLAND

Copyright © 1969 by
The Massachusetts Institute of Technology
Second printing, November 1970

Third printing, April 1973
Printed and bound in the United States of America

by Halliday Lithograph Corp.
All rights reserved. No part of this book may be reproduced in

any form or by any means, electronic or mechanical, including
photocopying, recording, or by any information storage and
retrieval system, without permission in writing from the publisher.
ISBN 0 262 65004 5 (paperback)
Library of Congress catalog card number: 79-101414

PREFACE
This collection of papers is the result of a desire to make
available reprints of articles o� digital signal processing for use in
a graduate cours e to be offered at M. I.T. The primary objective is to
make these rep.::-ints available in an inexp·ensive and easily accessible
fonn. At the same time it appeared that this collection might be use
ful for a wider audience, and consequently it was decided to rep:'oduce
the articles through the M. I.T. press rather than in a mo re infonTh�l
manner.
The literature in this area is extensive, as evidenced by the
bibliography included at the end of this collectio�. To limit its size,
the articles included were restricted to those that would supplement
rather th.';ln duplicate material available in the book, Digital..R.rocessil'!1l
of S�gn';ll.�, by Gold and Rader. Furthermore, articles were included only
if they would be required reading for the students. Thus, m';lny ��pers
of histo-::oical significance or of practical importance were omitted only
because they did not have to be as accessible to the students.
The articles were selected and the introduction prepared in collab
oration with B. Gold and C. Rader, and I a� extremely grateful for their
assistance. In addition, I would like to exp.::-ess my app-::oeciation to the
journals and the authors for granting permission to have these articles
reproduced in this way.
Alan V. Oppenheim
August 1969
v
INTR01)UCIION
The collectio:1 of articles divides rO·.lghly into four ffi3.jO= catego:::'ies:
z-transfonn theory and digital fIlter design, the effects of finite wo:::,d
length, the fast Fourier transform and spectral analysis, and hardware
considerations in the implementation of digital filters.
The first six papers deal with several issues in z-transfonn theory
and digital filter design. Specifically, in the paper by Steiglitz, one
attitude tOl-lard the relationship between digital and analog signals is
offered. This attitude is illuminating partly because it is an alterna
tive to the relationship usually depicted of a sequence as derived by
sampling a continuous time function. A representation of time functions
in terms of sequences offered in this paper is as coefficients in a
Laguerre series. This representation, as with the representation of
band-limited function:. as a se-cies of sin x/x functions, has the property
that the representation of the convolution of two functions is the dis
crete convolution of the sequences fo:::, each. The discussion by Steiglitz
also provides a basis for carrying o'ler the results on optimum linear
systems with continuous signals to analogo�s results for discrete signals.
Much of the theoretical develop"l1ents in digital signal p:ocessing
have been directed tOlvard reph:::'asing and paralleling results related to
analog processing within the co·ntext of sequences and z-transfonns. An
example of this is the discussion of the Hilbert transform �elations by
Gold, Oppl�nheim, and Rader. This paper p::esents the Hilbert transfo:m
relations in terms of both the z-transform and the discrete Fourier
transform. In addition the design and realization of digital 90° phase
splitters is discussed.
Detailed design techniques for recursive digital filters app·<!ar to
be well established. With the disclosure of the fast Fourier transform
(FFT) , nonrecurs ive filters, i.e. filters with a finite duration impulse
response, took on a new significance. Some design proced.ures, suP?lied
by previous work on tapped delay line filters and phased array antennas,
have been available for some time. We feel in general that in the next
several years there is likely to be more form'll work done to\vard defining
the limitations and design procedures for such filters. The first paper
by Gold and Jordan and the paper by Oppenheim and Weinstein both rehte
to nonrecursive filters. The first of these discusses a new design
procedure for these filters from a frequency domain point of view. The
second discusses a bound and consequently a scaling strategy for use in
implementing such filters using the FFT.
When the use of the fast Fourier transform in digital filtering was
first discussed, it was assumed that it was theoretically limited to
finite duration impulse responses. The note by Gold and Jordan on
vii
digital filter synthesis describes a procedure for using this technique
for implementing recursive filters. While this approach to implementation
does not ap�ear to be computationally efficient, the fact that this can
be done is an important and satisfying conceptual contribution.

One of the important practical issues in the realization of digital
signal processing techniques focuses on the effect of the finite register
length. The finite register length manifests itself in terms of co
efficient inaccuracies, roundoff or truncation of the result of mUltipli
cation, and, if one thinks of the original signal as an analog signal,
then also quantization of the inp'.lt. The next severa l papers are con
cerned with these issues. Specifically, the class of problems which
these papers deal with is the quantization and roundoff effects in
digital filtering and in computation of the discrete Fourier transform
using the FFT. The early papers on quantization and roundoff effects in
65
digital filters were by Kaiser, Knowles and Edwards, * and Gold and
Rader.43 The first of these dealt p=��rily with the p=oblem of co
efficient accuracy, while the second two dealt prim�rily with the effects
of multiplier roundoff noise. Kaiser studies the coefficient sensitivity
problem in terms of sm�ll perturbations of the coefficients and also
proposes an ap?roach, not restricted to small perturbations, which
utilizes the techniques of the standard root-locus problem. Knowles and
Edwards and Gold and Rader focused prim�rily on the effect of quantization
and multiplier roundoff noise. The approach ta�en is to view the errors
statistically and exp�ess the variance of the roundoff noi se and it
appears in the output. In his paper with Olcayto, Knowles p�oposes a
statistical approach to studying coefficient errors. One of the concerns
in all of these papers is the comparison of digital filters implemented
by means of different network configurations such as the direct fonn,
the cascade form, and the parallel form.
In discussing the effects of multiplier roundoff noise, the above
pap.;rs consi,der digital filter realization using fixed-point arithmetic.
When floating-point arithmetic is used, roundoff errors occur in both

l
multiplication and addition. Sandberg 07 has presented one analysis of
the effect of roundoff errors when floating-point arithmetic is used.
His approach is to provide bounds on the errors that occur. In contrast,
Kaneko and Liu present a statistical app:coach to studying roundoff noise
in floating-point filters. In the note by �einstein and Oppenheim, some
experime
· ntal verification
presented, together with a comparison between filter implementation using
fixed-point and floating-point arithmetic. An alternative realization
� --
Referen�e nmnbers refer to the Bibliograp�y
-- ---
viii
that combines some of the aspects of both floating-point and fixed-point
arithmetic is the use of a block-floating-point structure as proposed in
the paper by Oppenheim. This approach to realizing digital filters may
potentially offer an aqvantage on sm�ll word length computers when high
gain filters are being simulated.
In each of the p�eceding papers the representation of digital filters
is in terms of digital networks. In the paper by Mantey the formulation
is in terms of state-space equations, and in p.�rticular
coefficient accuracy is phrased in terms of the sensitivity of the
e igenvalues of a matrix. One of the asp':!cts of this paper is the rep··
resentation of the various digital filter forms in state-space terms.
In analyzing roundoff errors in the comp�tation of the fast Fourier
transform, approaches similar to those already discussed have been taken.
One of the first analyses, carried out by Gentleman and Sande, 3 7 focused
on floating-p·oint computation. The approach taken was to derive an
upper bound on the square root of the sum of the squares of the error
sequence. In contrast, Weinstein presents a statistical analysis of
noise in the floating-point comp:.ltation. This discussion is restricted
to the radix 2 FFT algorithm and deals mainly with arithmetic using
roundoff rather than truncation. A detailed statistical analysis of

the FFT with fixed-point arithmetic appears in the paper by Welch. In
this and the analysis by Weinstein, the output error is represented in
a mean square sense.
With regard to the effect of qu.antization and arithmetic roundoff
in both digital filtering and FfT comp�tation, it appears in general
that upper bounds derived tend to 1;)e overly pessimistic. In fixed-point,
floating-point, and block-floating-point realizations, experiments have
substantiated the statistical analyses presented and this statistical
approach app·:aars to p�ovide a more analytically tractable means for
analyzing a given configuration. It should be stressed that the outcome
of these analyses should be viewed as pro'Jiding techniques that can be
applied to a particular p=oblem ::ather than as a set of definitive
statements regarding the o'lerall acceptability of one configuration or
kind of arithmetic as opposed to another. In particular, hardware
considerations, the details of the application, etc., introduce tradeoffs
which m�st be considered together with roundoff issues.
In contrast with the ap?arently successful and useful results with
regard to arithmetic roundoff and input quantization, the problem of
designing within the frame,vO'=k of coefficient truncation does not yet
appear to be totally resolved. In a general sense, the relative merits
of the direct form as compared �ith the cascade or p�rallel realizations
of digital filters have been defined. More spl�cifically, hmvever, design
ix
procedures for digital filters which blend into the approximation problem
the p�actica1 requirements of finite word length coefficients are not yet
available.
The literature on the fast Fourier transform suffers from an
abundance of p�ints of view. The basic idea leading to efficient com
putation of a discrete Fourier transform (DFT) of a composite number of
points by making use of the periodicity of exp (2n nk/N) , may be traced
26
back to about the turn of the century (Cooley et a1. ),
and it appears
8
in the form of a detailed algorithmic procedure in papers by Good4 and
3l
Danielson and Lanczos. Nevertheless, almost all current interest in
the FFT derives from the publication of the pap·er by Cooley and Tu�ey in
1965. This p·:1per reveals the algo�ithm in terms of the general structure
of a mixed radix representation of the time and frequency indices and
goes into some detail about the case where all the radices are equal.
The optimum radix is found to be about 3, although the sensitivity of
the dependence is so small that subsequent authors have been able to
prove that other radices are superior by examining the p�ssibi1ities of
small savings neglected by Cooley and Tukey, based on the synunetries of
sine and cosine functions rather than the periodicities. Cooley and
Tukey did recognize the special interest in the case of radix two
algorithms .
Many other subsequent papers made use of the radix representation
of indices to write about the fast Fourier transform, but this point of
·
viet." while satisfying to s ome people, has been unsatisfying to some
others. Other p�ints of view can be taken to derive the algorithm and
suggest modifications, including that of �.trix factorization, expressing
a DFT comp'J.tation in terms of a combination of the results of several
smaller DFT com�inations, exp�essing a one-dimensional DFT as (almost)
a multidimensional DFT, and others. The papers in this collection all
take the Coo1ey-Tukey p�int of view. The Cooley and Tukey paper, in
addition to presenting the basic algorit�, discusses the in-place
properties and the perplexing bit reversing p=operties of the algorithms.
Bergland's pap·er discusses one common way to exploit synunetries in
further reducing the computation required for a DFT (and probably the
error, though this has not been investigated), namely the development
of radix 4, radix 8, and higher radix FFT algorithms. Still other
savings are to �e found when sp.:;cia1 factors are segregated and handled
by sp�cial subroutines. These latter savings are important for moderate
sized DFT computations but they are a small saving for large computations,
5
say N greater than 10 • Higher radix FFT algorithms also permit a better
match between arithmetic sp.:;ed and memory sp,:;ed in sp·ecia1 purp·�se
systems; this implies, of course, some high speed storage in the arithmetic
unit.
x
The paper by Singleton provides several valuable FFT tricks. For
the case of arbitrary and mixed radix FFT programs, bit reversing and
in-place comp�tation (including in-place bit reversing) can be complicated
unless the algorithm is appropriately structured. Singleton also shm'ls
how to program a DFT efficiently when one or more of the prime factors
of the number of points is odd (special attention being given to a
factor of 5). Another important section of the paper deals with the
compl.1tation of the required constants by iterative techniques, which
can reduce the storage requirements for an FFT program. This must be
done carefully, as it can be a grave source of error. An appandix of
the p·ap.er,
of the paper, has been deleted.
The papers by Bluestein and Rabiner et a1. describe the implications
for DFT comp�tation, of the discrete equivalent of chirp filtering.
Bluestein derives an algorithm which is structured as a digital network
but is comp'3titive with the FFT when the number of points is the square
of a p::-ime, or the square of a small composite integer. This network
involves the convolution of an input signal with a chirp sequence.
The Rabiner paper details the considerable flexibility which results
when the convolution is implemented by means of the FFT.
The early papars about the FFT have concentrated on ho\'l to program
it. An exception was the description of the use of the algorithm to
12 2
compute lagged products. As time passes we expect to see more papers
dealing with applications of the algorithm, and with quantization effects.
The last two papers discuss hard\'lare considerations. The under
standing of the z-transfo�, filter theory, and quantization problems
form the theoretical basis underlying the construction of digital signal
processing equipment. Within the past few years, quite a few devices
have been built: for p.erforming digital filtering and the discrete Fourier
transform; hO\'lever, only two papers of this collection deal with the
hard,,,are question. While the various devices built have been of con
siderable interest, only these two p.'3.pers dealt at sufficient length
with &.�nera1 hardware issues. The Jackson, Kaiser, McDonald paper
described the well-knmm digital filtering forms, proposed a way of
modu1arizing the components of a digital filtering system, discussed
the important multiplexing question, and combined these general points
into the description of a digital touch-tone receiver which they built.
The particular form that appears to be most useful is the cascade or
serial form; both the parallel form and the direct form seem to be
more sensitive to parameter errors. Our o'l1n exp erience verifies this;
an additional point is that each 1in� in the cascade could beneficially
be a coupled-form digital network, especially if spectral accuracy at
lOT" frequencies is desired.
xi
Whereas the choice of the proper form is dependent on digital
network principles, modularity considerations are also influenced by
the state of component technology. Since the latter changes relatively
rapidly with time, modularity prop:>sals that m:l.y be optimum no�oJ could
conceivably be less so in the future. Jackson, Kaiser, and McDonald
have devised an elegant set of modules consistent with present-day MSI
(medium-sized integration) component techniques. For example, the
multiplier has been conceived to be modular on a bit by bit basis, each
bit containing all the required gating an d carry logic. Thus, for
example, the word length can be extended by adding one module. If, in
the future, a sufficiently high packaging density leads to economic
array multipliers, the resultant speed increase and simplification of
control could greatly change modularization tactics.
The other harcrN'are-oriented contribution, by Bergland, reviews
general aspects of special purp:>se fast Fourier transfo:::m systems.
Bergland is pri.'1Iarily concerned �oJith the relation between the theoreti
cally possible computation.:!.l sp,aed of an FFT as a function of the number
of parallel memory and arithmetic modules in the system. This is an
interesting and important aspect of FFT device design, but it is well
to remember that it is only the bare beginnings of tl1e use of an
algorithm which has a startling number of diverse forms. We think,
for example, that the relation between the hardware design and the radix
used, the bit-reversal question, the in-place versus not-in-place
algorithm, and the relative speeds and costs of fast and slorN memo::-ies
are fertile area for general investigations.
Unfortunately, as yet, no thorough study of the effect of digital
processing algorithm.3 on the design of general p'.1rp:>se computers is
available. We wo'ald like to make the point that, as Bergland has
demonstrated in one particular case, parallelism leads directly to
increased speed; this is true for general as well as special purpose
hardTN'are. Comp'.1ter architects have, in the past, had great conceptual
difficulty in justifying any given parallel p:o:ocessing structure; in our
opinion, the study of the llk1.ny interesting structural variations of
digital signal processing algorithms could help set up useful criteria
for the effectiveness of general purpose p,3.rallel comp'.ltation.
Bernard Gold
Alan V. Oppenheim
Charles Rader
xii
CONTENTS
K. Steiglitz, "The Equivalence of Digital and Analog Signal 1
Processing," Infonnatio':1 and Control, Vol. 8, No. 5, October 1965,
pp. 455-467.
B. Gold, A. V. Opp·2nheim, and C. M. Rader, "Theory and Implementa

14
tion of the Discrete Hilbert Transform, " Polytechnic Institute of
Brooklyn, Proceedings of the Symposium on Comp'.lter Processing in
Communications, 1969.
B. Gold and K. Jord.an, "A Note on Digital Filter Synthesis " 43

Proc��EEE, Vol. 56, No. 10, October 1968, pp. 1717-1718 ( i etters) .
B. Gold and K. Jordan, I � Direct Search Procedure for Designing

Finite Duration Imp'.llse Response Filters," IEEE Transactions on
45
Audio and Electroacou�tics, Vol. AU-17, No. 1, March 1969, pp. 33-36.
A. V. Oppenheim .:md C. Weinstein, "A Bound on the OutP'.lt of a

Circular Convolution with Application to Digital Filtering,"
IEEE Transactions on Audio and Electroacoustics, Vol. AU-17, No. 2, 49
June 1969, pp. 120-124.
J. F. Kaiser, "Som,= Practical Considerations in the Realization of

Linear Digital Filters, " Proceedings Third Allerton Conference on
54
Circuit and System Theory, October 1965, pp. 621-633.
C. Weinstein and A. V. Oppenheim, I � Comparison of Roundoff Noise

"
in Flo.ating Point and Fixed Point Digital Filter Realizations �
�roc. IEE�, Vol. 57, No. 6, June 1969, P? 1181-1183 (letters). 67
A. V. Oppenheim, "Realization of Digital Filters Using Block-Floating-

Point Arithmetic, " Submitted for p·.lblication to IEEE Transactions 70
on Audio and Electroacoustics.
T. Kaneko and B. Liu, "Roundoff Error of Floating-Point Digital

Filters, " Proceedings of the Sixth Allerton Conference on Circuit
94
and System Theoi:Y, October 2-4, 1968, pp. 219-227.
J. B. Knowles and E. M. 01cayto "Coefficient Accuracy and Digital

Filter Response " IEEE Transact Ions on Circuit Theory, Vol. CT-15,
!
No. 1, March 19b8, pp. 31-41. 103
P. E. Mantey, "Eigenvalue Sensitivity and State-Variable Selection, "

IEEE Transactio:J.s on Automatic Control, " Vol. AC-13, No. 3, June 1968, 114
pp. 263-269.
P. Welch, "A Fixed-Point Fast Fourier Transform Error Analysis, "

IEEE Transactions on Audio and Electroaco·.lstics, Vol. AU-17, No. 2,
June 1969, pp. 151-157. 121
C. weinstein "Ro'.lndoff Noise in Floating Point Fast Fourier Transform

,
Computation, I IEEE Transactions on Audio and Electroacoustics,
Vol. AU-17, No. 3, September 1969. 128
J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine

Calculation of Complex Fourier Series," Mathematics of Computation, 146
Vol. 19, No. 90, April 1965, pp. 297-301.
L. R. Rabiner, R. W. Schafer, and C. M. Rader, "The Chirp z-Transform

Algorithm, " IEEE Transactions on Audio and Electroacoustics, Vol.
AU-17, No. 2, June 1969, pp. 86-92. 151
xiii
G. D. Bergland, I� Fast Fourier Transfo� Algorithm Using Base 8 158
Iterations " Mathematics of Comp=.Itatlon, Vol. 22, No. 102, Ap=il 1968,
!
pp. 275-27';1.
R. C. Singleton, I�n Algorithm fo� Computing the Mixed Radix Fast

Fourier Transform, " IEEE Transactions on Audio and Electroacoustics, 163
Vol. AU-17, No. 2, June 1969, pp. 93-100.
L. 1. Bluestein, "A Linear Filtering App:;:ooach to the Computation

of the Discrete Fourier Transform, " 1968 NEREM Record, pp. 218-219. 171
L. Jackson, J. F. Kaiser, and H. S . McDonald, "An Approach to the

Implementation of Digital Filters, " IEEE Transactions on Audio and 173
Electroacou�tic�, Vol. AU-16, No. 3, Septem�er 1968, pp. 413-421.
G. Bergland, "Fast Fourier Transfo:':l1l Hard'Hare Implementations-

An Overview, " IEEE Transactions on A�dio .:lnd Electroacoustics,
Vol. AU-17, No. 2, June 1969, pp. 104-108. 182
Bibliography 187
xiv
Papers on Oigital Signal Processing
Reprinted from I�FoRMATIO)oi AXD CO)oiTROL. Volume 8. No.5. October 19135
Copyright © by Academic Press Inc. Printed in U.S.A.
INFORMATION AND CONTROL 8, 455-4Q7 (19Q5)
The Equivalence of Digital and Analog

Signal Processing*
K. STEIGLITZ
Department of Electrical Engineering, Princeton University, Princeton, New Jer.�ey
A specific isomorphism is constructed via the transform domains

between the analog signal space L2 ( - 00, 00 ) and the digital signal
space l2. It is then shown that the class of linear time-invariant
realizable filters is invariant under this isomorphism, thus demon
strating that the theories of processing signals with such filters are
identical in the digital and analog cases. This means that optimi
zation problems involving linear time-invariant realizable filters
and quadratic cost functions are equivalent in the discrete-time
and the continuous-time cases, for both deterministic and random
signals. Finally, applications to the approximation problem for
digital filters are discussed.
LIST OF SYMBOLS
ret), get) continuous-time signals
F(jw), G(jw) Fourier transforms of continuous-time signals
A continuous-time filters, bounded linear transformations
of r} ( - O(), 00)
{fn}, {gnl discrete-time signals
F(z), G(Z) z-transforms of discrete-time signals
A discrete-time filters, bounded linear transformations of
�
f.L isomorphic mapping from L2 ( - 00, 00 ) to 12
'ijL2 ( - 00, 00 ) space of Fourier transforms of functions in L2( - 00 ,00)
312 space of z-transforms of sequences in 12
An(t) nth Laguerre function
* This work is part of a thesis submitted in partial fulfillment of requirements
for the degree of Doctor of Engineering Science at New York University, and
was supported partly by the National Science Foundation and partly by the Air
Force Office of Scientific Research under Contract No. AF 49-(638)-586 and Grant
No. AF-AFOSR-62-321.
© 1965 by Academic Press, Inc., New York, New York 10003.
1
STEIGLITZ
I. INTRODUCTION
The parallel between linear time-invariant filtering theory in the
continuous-time and the discrete-time cases is readily observed. The
theory of the z-transform, developed in the 1950's for the analysis of
sampled-data control systems, follows closely classical Fourier transform
theory in the linear time-invariant ca.'5e. In fact, it is common practice
to develop in detail a particular result for a continuous-t.ime problem
and to pay less attention to the discrete-time case, with the assumption
that the derivation in the discrete-time case follows the one for con
tinuous-time signals without much change. Examples of this ("an be
found in the fields of optimum linear filter and compensator design,
syst.em identification, and power spectrum measurement.
The main purpose of this paper is to show, by the construction of a
specific isomorphism between signal spaces L2 ( - co, co ) and l2 , that the
theories of processing signals with linear time-invariant realizable filters
are identical in t.he continuous-time and the discrete-time cases. This
will imply the equivalence of many common optimization problems
involving quadratic cost functions. In addition, the strong link that is
developed between discrete-time and continuous-time filtering theory
wiiI enable the data analyst to carry over to the digital domain many of
the concepts which have been important to the communications and
control engineers over the years. In particular, all the approximation
techniques developed for continuous-time filters become available for
the design of digital filters.
In the engineering literature, the term digital filter is usually applied
to a filter operating on samples of a continuous signal. In this paper,
however, the term digital filter will be applied to any bounded linear
operator on the signal space l2, and these signals will not in general
represent samples of a continuous signal. For example if {xn} and {Yn}
are two sequences, the recursive filter
Yn = Xn - 0. 5 Yn-l
will represent a digital filter \vhether or not the Xn are samples of a
continuous signal. The important property is that a digital computer
can be used to implement the filtering operation; the term numerical
filter might in fact be more appropriate.
II. PHELIMINAHIES
The Hilbert space L2 ( - co, co ) of complex valued, square integrable,
Lebesgue measurable functions f(t) will play the role of the space of
2
DIGITAL AND ANALOG SIGNAL PROCESSING
continuous-time signals. The Hilbert space 12 of double-ended sequences

of complex numbers {fn}:-_oo that are square summable will play the
role of the space of discrete-time signals. A function in L2 ( - 00 , 00 )
will be called an analog signal, and a sequence in 12 wiI] be cn.Iled a dig
ital signal. Similarly, a bounded linear transformation A of L2 ( -00,00 )
will be called an analog filter, and a bounded linear transformation
A of /2 will be called a dig�tal filter. An analog filter A will be called
time-invariant if
A: f(t) --. g(t), f(O, get) E L2 ( -00, 00 ), ( 1)
imp1ies
.1:f(t + T) --. get + T) (2)
for every realllumber T. Time-invariant analog filters can be represented
by the cOllvolution integral
get) = i:f(T)a(t - T) dT, (3)
where aCt), the impulse re!'>ponse of the filter A, need not belong to
[} ( - 00, 00 ) . Similarly, a digital filter A will be called time-inmriant if
(4)
implies
(;'; )
for every integer P. Time-invariant digital filters can be represented by
the convolution summation
(6)
where the sequence {an}, the impulse response of the filter A, need not
belong to l2 •
Our program is to construct a specific isomorphism between the

analog signal space and the digital signal space via their isomorphic
transform domains. Hence, we now define the Fourier transform on the
analog signal space, mapping [} ( -00, 00 ) to another space L2 ( -00, 00 )
called the Fourier transform domain and denoted by fjL2 ( -00, 00. ) .
.We need the following key results (Wiener, 1933; Titchmarsh, 1948):
TIIEORE�t 1 (Plancherel). [ff(t) E L2 ( -00, 00 ), then
1
R
F(s) l.i.m. f(t)e-at £It (7)
n-+oo -R
=
3
STEIGLITZ
exists Jar s jw, and F(jw) E L2 ( -00, 00 ) . Purthermore,

l ioo I F(s W ds,
=
-
1
(j,J) =
Icc I J(t)12 dt
-00
= 2 '
7rJ -JOO
(8)
and
J(t) = l.i.m.
R-+OO
fiR F(s)e·t ds.
J-iR
(9)
Analytic extension of F(jw) to the rest of the s-plane (via (7) when it
exists, Jar example) gives the two-sided Laplace transform.
THEORE:\I2 (Parseval). If f(t), get) E L2 ( - oc, 00 ), then
-
1 1iOO. F(s)G*(t) ds.
(j, g) =
Icc f(t)g*(t) dt
-DC
= 2
7rJ
'
-JOO
(10)
The theory required for the analogous construction of a z-tmnsform

domain for di gital sign als is really no more than the theory of Fourier
series. Consider the di gital signal as a s eq uen ce of Fourier coe fficients ,
and consider the periodic function with these Fourier coefficients as the
z-transform evaluated on the unit c ircle in the z-plane. The Riesz
Fischer Theorem (Wiener, 1933) then reads:
TlIEORE:\13 (F. Riesz-Fischer). If {fnl E l2, then
N
F(z) = l.i.m. L: fnz-n (11)

N"OO n--N
exists Jar z ei"'T, and F(ei"'T) E L2 (0, 27r/T), where (,) is the inde�
=
pendent variable of L2 (0, 27r/T), and this (,) is unrelated to the w used in
the s-plane. Furthermore,
(12)
and
dz ,
in 2
1
. i F(z)zn (13)
'!
=
7rJ z
where integrals in the z-plane are around the unit circle in the counter
clockwise direction.
As in the analog case, the analytic extension of F ( ei"'T) to the rest
of the z-plane will coincide with the ordinary z-transform, which is
usually defined only for digital signals of exponen tia l order.
4
DIGITAL AND ANALOG SIG�AL PROCESSING
TUEORElI4 (Parseval). If {fn}, {gn} E l,.z, then
({fn }, {gn }) f fn gn* 2..!... 1 F(z)G*(Z) liz. (14)

n--""
'f
=
z
=
7rJ
We denote the space L2 (0, 27r/T) of z-transfonns of digital signals by �l2 .
III. A SPECIFIC ISO:\lOUPIIISl\l BETWEE� TIlE AJ:\ALOG AXD DIGITAL
SIGNAL SPACES
Intuitively, if we wish to connect the space of analog signals with the

space of digital signals in such a way as to preserve the time-invariance
and realizability of filters, we should �omehow connect the jw-axis ill
the s-plane with the unit circle in the z-plane. The natural correspondence
provided by the instantaneous sampling of analog signals matches e'T
with Z, but is not one-to-one and hence cannot be an bomorphism.
The next natural choice is the familiar bilinear transformation
z - 1 1+ s
s= z (15)
1 -s
= --.
z+I'
There is an additional factor required so that the transformation will
preserve inner products. Accordingly, the image {fn} E 12 corresponding
to f(t) E L2 ( - co, co ) will·be defined as the sequence with the z-trans
form
V2 z - 1 ( (16)
)
=Z+lF z+l .
Thus the mapping L2 ( - co, co ) � 12 is defined by a chain which goes

from L2 ( - co, co ) to 'ijL2 ( - co, co ) to �12 to 12 as follows:
p.:f(t) � F(s) � z
V2 z 1 ( ) F(z) � {fn}. (17)
+1 F z +1
=
The inverse mapping is casily defined, since each of these steps is

uniquely reversible:
p.-l: {fn l � F( z) �
l-s
V2 F (1l-s
+s
) = F(s) � f(t). (18)
We then have
THEORElI 5. The mappl:ng
p.: L2 ( - co, co ) � l2
defined by (17) and (18) is an isomorphism.
5
S'fEIGLITZ
Proof: J.L is obviously linear and onto To show that it preserves inner
.
product, let z = (1 + s)/(1 - s) in Parseva1's relation (10), yielding
(j, g) = ./:-.lioo F(s) G*(s) ds = J:-. 1 F(z)G*(z) d

z
_7rJ - i'Xl _7rJ l' Z (1 g)
= ({f"l, {gIl})'
We can show that J.L is one-to-one in the following way: if f ¢ g, then
(j - g, f - g) ({f..} - {gIl}, If,,} - {gn}) ¢ OJ which implies t ha t
=
Ifni ¢ {gIl}, and hence that J.L is one-to-one.

We note here that under the i:,omorphisms J.L and J.L-1 si gnals with rational
transforms arc always matched with si gnals ,,,ith rational transforms, a
convenience when dealing with the many signals ('ommonly encountered
in engineering problems with tra n:,forms which are rational functions of
s or z.
IV. THE ORTIIOXOIUIAL EXPANSIO� ATTACHED TO p.
2
The usual way of defining an i:-;omorphism from L ( - 00, (0 ) to 12 is
to map an arbitrary fu nction in /} ( - 00, (0 ) to the seque nce in 12 of
its coefficients in some orthonormal exp an sion. It comes as no surprise,
then, that the isomorphism J.L ('ould have been so generated. This section
will be d evot ed to finding this o rthon ormal expnn:,ion.
We start with the z transform of the digital ::: i g nal If,,} which is the
-
image under J.L of an arbitrary analog signal f(t):
_r 00
v 2
, (-) z 1
n�oo f"
-"
1 [.
F(z) = = z (20)
+ +
•
z z 1
By (13), the formula for the nverse
i z-transform, we have
�. 1 v2 F ( z - 1 n dz.
fll =
) z
2;rJ l' +
(�l)
+ z 1 z 1 z
Letting z = (1 + s)/(1 - s), this i n tegral becomes
1
f = ()""
. _7rJ
. Iioo. F(s) -
-J'Xl
V2 ( + s ) " £Is.
+ - 1 S
1
1 _ S
(22)
By Par�eval's relation (10) thi::; can be written in terms of time f unc tions
f" = L: f(t)An(t) £it, (23)
6
where the An(t) are given by the following inverse two-sided Laplace
transform
�n (t) =
£-1 [ V2 (�)nJ. (24)
1 -8 1 +8
We see immediately that, depending on whether n > ° or n � 0, An (t)
vanishes for negative time or positive time. By manipulating fi standard
transform pair involving Laguerre polynomials we find:
n = 1, 2, 3,
. (2<";)
n = 0, -1, -2, ..
where u(t) is the Heaviside unit step function, and Ln(t) is the Laguerre
polynomial of degree n, defined by
t n
Z
Ln (t) =
:... _
c n (tn -I)
e , n 0, 1, 2, .. .
=
(26)
n! di
The set of functions An (t), n = 1, 2, 3, ... , is a complete orthonormal
set on (0, 00 ) and are called Laguerre functions. They have been em
ployed by Lee (1931-2), Wiener (1949); find others for network synthe�
sis; and are tabulated in Wiener (1949), and, with a slightly different
normalization, in Head and Wilson (19:>6). The functions An(t), n =
0, -1, -2, . . . , arc similn.rly complete find orthonormal on ( .:..... 00, 0),
so that the orthonormal expansion corresponding to (23) is
00
J(t) == L fnAn(t) . (27)

n-=-:)()
We see that the values of the digital signal {fn} for n > ° correspond to
the coefficients in the Laguerre expansion of J(t) for positive t; and that
the values of {fn} for n � ° correspond to the coefficients in the Laguerre
expansion of J(t) for negative t.
v. THE INDUCED MAPPING FOR FILTEIlS

Thus far, we have explicitly defined four isomorphic Hilbert spaces fiS
follows
(28)
7
STEIGLlTZ
Therefore an analog or a digital filter as a bounded linear transformation

has image transformations induced on the remaining three spaces.
A time-invariant analog filter A, defined by the convolution integral
2
(3), has an image in fjL ( - (x;) , (x;) , in 12, and in 512. Its image in
fjI} ( - (x;) , (x;) is multiplication by A (s), the Fourier transform of
a(t). Its image A in 12 can be found in the following way: let x be any
digital signal. There corresponds to x a unique analog signal IL -I(X).
The result of operating on this analog signal by the analog filter A,
AIL-1(X), is also uniquely defined. This new analog signal can then be
mapped by IL into a unique digital signal ILAIL-l(X), which we designate
as the result of operating by A on x. Thus we define A to be the composite
operator
(29)
To find the image of the analog filter A in 512 , notice that the Fourier
transform of the analog signal Ai is A(s)F(s) and the z-tmnsform of the
digital signallLAi is
V2 1 z - 1 ( ) (z - 1) . (30)
z+ 1.l z+ 1 F z+ 1
Therefore, the image in 512 of A and hence of A is multiplication by
.-\(z) = A
(z - 1) . (31)
z+ 1
Similarly, a time-invariant digital filter A has an image in 5l.z given by
multiplication by A(Z), the z-transform of the impulse response {anl;
2
an image in L ( - IX), (x;) given by
A = IL- AIL� ) ;
I
(32)
and an image in �L2 ( - (x;) , IX» given by multiplication by
A(s ) = A
(1 + ) s
(33)
1 _ s .
We have therefore proved
TUEoHEM U. The isomorphism IL always matches time-invariant analog
filters A with time-invariant digital filters A. Furthermore.
A(Z) = A
(z 1) '
-
( 3 4)
z+ 1
8
and
A(s) A
(1 + ) s
(35)
1 s
=
_ •
Those time-invariant filters which are physically realizable in the

sense that they are nonanticipatory are of great importance in many
fields. A time-invariant analog filter A wi ll be called 1'ealizable if A J 0 =
for t < 0 whenever J 0 for t < O. Similarly, a time-invariant digital

=
filter A will be called 1'ealizable if A{fn} 0 for n � 0 whenever IfnI

= 0 =
for n � O. It is an important property of the mapping JJ. that it preserves

the realizability of time-invariant filters. To see this, suppose first that
A is a time-invariant realizable analog filter. Let IfnI be any digital
signal for which IfnI 0 for n � O. Then its analog image J(t) is such
=
that J (t) = 0 for t < 0, by (27). Thus A J 0 when t is negative, which

=
implies that A{fn} 0 for n � 0, by (23). Hence A is a realizable digital

=
filter. The same argument works the other way, and this establishes:
THEOREM 7. The mapping JJ. always matches time-invariant 1'eaHzable
analog filters with Hme-invariant realizable digital fillers.
VI. OPTIl\nZATIO� PROBLEl\IS FOIl SYSTEl\IS WITH DETEIU\IINISTIC

SIGNALS
We are no,Y in a position to see how some optimization problems can

be solved simultaneously for analog and digital signals. Suppose, for
example, that a certain one-sided analog ret) is corrupted by a known
additive noise n(t), and that we are required to filter out the noise with
a stable, realizable time-invariant filter II whose Laplace transform is,
say, lI(s). If we adopt a least integral-square-error criterion, we require
that
100 [r H(r+ n)}2 dt = min. ( 3G )
As described by Chang ( l9Gl ) , this can be transformed by l'a rseval 's

relation to the requirement
1
_
1iOO [R - H(R + N)][R - H(R+ N)]* ds min., (37)
2trJ. j-x>
=
where R, H, and N are functions of s, and ( ) * means that s is replaced

by -s. It can be shown, using an adaptation of the calculus of variations,
that the realizable solution for H(s), say Ho(s), is given by
9
STEIGLITZ
11o s)
( =
�y [(R +y*N)*R] ' (38)
LHP
where
yy* =
(R + N)(R + N)* (39)
Y has only left-half plane poles and zeros, and y* has only right-half
plane poles and zeros. The notation [ lLHP indicates that a partial
fraction expansion is made and only the terms involving left-half plane
poles are retained.
The fact that a least integral-square-error criterion is used means
that the optimization criterion (36) can be expressed within the axio
matic framework of Hilbert space. Thus, in L2 ( - 00, 00 ), (36) becomes
111' - lI(r + n) 1I =
min. (40)
If we now apply the isomorphism p. to the signal l' - lI(r + n), we have
II r - lI(r + n)1I =
II p.[/, - lI(r + n)lll =
II r - n(r + n)lI, (41)
since p. preserves norm. Hence lIo is the solution to the optimization
problem
II r - n(r + n)1I =
min. (42)
Furthermore, since JL match es one-sided analog signals with one-sided

digital signals and realizable time-invariant analog filters with realizable
time-invariant digital filters, we see that Ho is the solution to a digital
problem that is completely analogous to the original analog problem.
Thus
(z - )
1 Z [ (R + N) *R]
Ho(Z)
Q
lIo
Z+ 1 n
= =
ZQ * i '
where
QQ* = (R + N)(R + N) *. (44)
Now R, N, and 110 are functions of Z; ( )* means that Z is replaced by
Z-l; Q and Q* have poles and zeros inside and outside the unit circle
ref';pectively; an d the notation [ lin indicates that only terms i n a
partial fraction expansion with poles inside the unit circle have been
retained.
In other optimization problems we may wish to minimize the norm
10
L
DIGITAL AND ANA OG SIGNAL PROCESSING
of some error signal while keeping the norm of some other system signal
within a certain range. In a feedback control system, for example, we
may want to minimize the norm of the error with the constraint that
the norm of the input to the plant be less than or equal to some pre
scribed number. Using Lagrange's method of undetermined multipliers,
this problem can be reduced to the problem of minimizing a quantity of
the form
(45)
where e is an error signal, i is some energy limited signal, and both e
and i depend on an undetermined filter II. Again, if Ho ( s) is the time
invariant realizable solution to such an analog p roblem , then IIo(Z) is the
time-invariant realizable solution to the analogous digital problem
determined by the mapping p,.
M ore general ly, we can s tate
THEOREM 8. Let v be an isomorphism between L2 ( - 00, 00 ) ' and 12 •
Further, let the following optimization problem be posed in the analog

signal space L2 ( - 00, 00 ) : Find analog filters Hl, H2, " ', II" which
?ninimize some given function of some norms in a given analog signal
transmission system and which are in a class of filters X. Then if the class
of filters X is invariant under v, the corresponding digital 'problem, is
equivalent to the original analog problem, 'In the sense that, whene�'er one
can be solved, the other can be also. In part'ICular, when v 1S p" X can be
taken as the class of tune-invariant filters or the class of time-invariant
realizable filters. In this situation, the optimum filters are related by
Hj(Z) = Hi
(z - 1) '
+
i = 1,2, 3, .. . ,n. (46)
Z 1
VII. RANDOM SIGNALS AND STATISTICAL OPTBIIZATIOX PROBLEl\IS
While the considera t ion of systems ,,"ith deterministic signals is

import.ant for many theoretical and p ractical reasons, it is often the
case that the en g i neer knows only the statistical properties of the input
and disturbing sig nals. For this reason the design of systems on a sta
tistical basis has become increasingly important in recent years. The
method of connecting continuous-time theory with discrete-time theory
described above can be extended to the random case in a natural way if
we restrict ourselves to random processes which are stationary with zero
mean, ergodic, and have eorrelation functions of e xpone ntial order.
For our purposes, such processes will be characterized by their second
11
STEIGLlTZ
order properties. In the analog case these are the correlation function
¢ZI/(t) and its Fourier transform 4> I/( s ) . In the digital case these are the
Z
correlation sequence 9"I/(n) and its z-transform �ZIJ(z).
We define the mapping p. for correlation functions in the following
way, motivated by mapping the signals in the ensembles by the iso
morphism p. for signals:
(·17)
The inverse mapping is

-I 1+s
p. : 9,rI/(n) -+ (1',ry(Z) -+ �
I-s-
(I',rl/ (
l-s
) = 4>zy{s) -+ cf>,ry(t). (48)
The important invariants under p. are the quantities

f/>ZII(O) == E[z(t)y(t)], (49)
and
9�,,(O) == E[x.y.], (50)
which correspond to the inner products in the deterministic case. As
before, time-invariant filters are matched with time-invariant filters,
and time-invariant realizable filters are matched with time-invariant
realizable filters. Hence, we have
TlIEORE�l 9. Let the following optimization problem be posed for random
analog signals: Find analog filters III, II2, " ', lIn which minimize
some gil'en function of the mean-square values of some signals 1'n an analog
signallransmission system and which are in a class of filters X. Then if X
is the class of time-invariant filters, or the class of time-invar£ant realizable
filters, the corresponding d£gital problem is equivalent to the or£ginal analog
problem 1'n the sense that, whenever one can be solved, the other can be also.
If the correlation functions and power spectml densities are related b!J p,
the optimum filters are again related b!J (46).
In :summary, we have shown that in the time-invariant case the
theory of proces�ing analog signals and the theory of processing digital
signals are the same.
YIII. TIlE .APPHOXIl\IATlO� PHOBLE:\I FOR DIGITAL FILTER:=;
The mapping p. can be used to reduce the approximation problem for
digital filters to that for analog filters (Steiglitz, 1962; Golden and
Kaiser, 1064). Suppose that ,,"e wish to design a digital filter with a
rational transform and a desired magnitude or phase characteristic as a
12
function of w, -7r/T � w � 7r/T. For real frequencies the transfor

mation p. relates the frequency axes by
'" = tan wT /2. (51)
We can therefore transform the desired characteristic to a function of '"
simply by stretching the abscissa according to (51). This new character
istic can be interpreted as the frequency characteristic of an analog
filter, and we can approximate this with the rational analog filter A(s).
A(Z) = A«z - l)/(z + 1» will then be a rational function digital
filter with the appropriate frequency characteristic. l\Iany of the widely
used approximation criteria, such as equal-ripple or maximal flatness,
are preserved under this compression of the abscissa. Also, by Theorems
6 and 7, the time-invariant or the time-invariant realizable character
of the approximant is preserved. Applications to the design of windows
for digital spectrum measurement are discussed elsewhere (Steiglitz,
1963).
ACKNOWLEDGMENT
The author wishes to express his gratitude to Professor S. S. L. Chang for his
many valuable comments during the course of this work.
RECEIVED: January 22, 1964

REFERENCES
CHANG, S. S. L. (1961), "Synthesis of Optimum Control Systems," Chaps. 2-6.

McGraw-Hill, New York.
GOLDEN, R. 1\1., AND KAISER, J. F. (1964), Design of wideband sampled-data
filters. Bell System Tech. J. 43, No.4, Pt. 2, 1533-1546.
HEAD, J. W., AND WILSON, W. P. (1956), "Laguerre Functions: Tables and Prop
erties." Monograph No. 183-R of the Institution of Electrical Engineers.
LEE, Y. W. (1931-2), Synthesis of electrical networks by means of the Fourier
transforms of Laguerre's functions. J. Math. Physi 11, 83-113.
STEI GLITZ , K. (1962), "The Approximation Problem for Digital Filters," Tech.
Rept. no. 400-56 (Department of Electrical Engineering, New York Uni
versity).
STE I GLI T Z , K. (1963), "The General Theory of Digital Filters with Applications
to Spectral Analysis," AFOSR Heport No. 64-1664 (Eng. Sc.D. Dissertation,
New York University, New York).
TITCHMARSII, E. C. (1948), "Introduction to the Theory of Fourier Integrals."
Oxford Univ. Press, Oxford.
WIENER, N. (1933), "The Fourier Integral and Certain of Its Applications."
Reprinted by Dover, New York.
WIENER, N. (1949), "Extrapolation, Interpolation and Smoothing of Stationary
Time Series." Wiley, New York.
13
Presented at the 1969 Polytechnic Institute of Brooklyn Symposium on
Computer Processing in Communications. To appear in the symposium
proceedings.
THEORY AND IMPLEMENTA nON OF THE DISCRETE HILBERT TRANSFORM*
by
B.Gold
A. V. Oppenheim
C. M. Rad e r
Lincoln Laboratory. Massachusetts Institute of TechnOlogy
Lexington, Massachusetts
ABSTRACT
The Hilbert transform has traditionally played an important part in the
theory and practice of signal proceSSing operations in continuous system theory
because of its relevance to such problems as envelope detection and demodula
tion, as well as its use in reiating the real and imaginary components, and the
magnitude and phase components of spectra. The Hilbert transform plays a
similar role in digital signal processing. In this paper, the Hilbert transform
relations, as they apply to sequences and their z-transforms, and also as they
apply to sequences and their Discrete Fourier Transforms, will be discussed.
These relations are identical only in the limit as the number of data samples
taken in the Discrete Fourier Transforms becomes infinite.
The implementation of the Hilbert transform operation as applied to
sequences usually takes the form of digital linear networks with constant co
efficients, either recursive or non-recursive. which approximate an all-pass

0
network with 900 phase shift, or two-output digital networks which have a 90
phase difference over a wide range of frequencies. Means of implementing
such phase shifting and phase splitting networks are presented.
*1bts work was sponsored by the U. S. Air Force.
14
1. Introduction
Hilbert transforms have played a useful role in signal and network theory
and have also been of practical importance in various signal processing systems.
Analytic signals, bandpass sampling, minimum phase networks and much of spectral
analysis theory is based on Hilbert transform relations. Systems for performing
Hilbert transform operations have proved useful in diverse fields such as radar
moving target indicators, analytic signal rooting (1 J, measurement of the voice
fundamental frequency (2, 3], envelope detection, and generation of the phase of
a spectrum given its amplitude (4, 5, 6 J •
The present paper is a survey of Hilbert transform relations in digital systems,

and of the design of linear digital systems for performing the Hilbert transform of an
input signal. These subjects have been treated in the published literature for con
tinuous signals and systems (7]. In this paper we present a treatment of the
subject for digital systems. We will first present various Hilbert transform
relationships followed by several design techniques for Hilbert transformers
and a few examples and applications.
2. Convolution Theorems
In this section, some notation is introduced and some well-known convolution

theorems are quoted; the interested reader can find proofs of these and other theorems
of z-transform theory in various books (8, 9 J. Let x (n) be a stable infinite
sequence and define the z- transform of x (n) as,
-n
x (z) = x (n) z
n=-CI>
Given two such sequences x (n) and h (n) and their corresponding z- transforms
X (z) and H (z), then, if Y ( z ) = X (z) H (z) , we have the convolution theorem
m=CI> m=CI>
y (n) = :E x (n- m) h (m) = :E x (m) h (n - m) (1)
m=-CI> m=-CI>
15
Similarly, if Y (n) = x (n) h (n ) , we have the complex convolution theorem
Y (z) = -2
1
m�
. -I 1
m
-1 (2)
where v is the complex variable of integration and the integration path chosen is
the unit circle, taken counterclockwise.
The spectrum of a signal is defined as the value of its z-transform on the unit
j
circle in the z-plane. Thus, the spectrum of x (n) can be written as X (e 9) where
e is the angle of the vector from the origin to a point on the unit circle.
If x (n) is a sequence of finite length N then it can be represented by its
(k). we have
•
discrete Fourier transform (DFT). Denoting the DFT values by X
X (k):::: N-I
!:
n =o
x (n) W
-nk
x (n ) = l
N-l
(k)
!: X W nk
N
k=o (3)
with W = eJ"21T /N
The convolution theorems for these finite sequences specify that if
Y (k) = H (k) X (k). then
yen) = mN-l=!:o X (m) h � (n- ») = m=o!: x ( (n-m») h (m)
m
N-l -
(4)
and if y (n) = x (n) h (n). then

1 N-l N-l (, "' 1 .\
(i) (k =
(k) ( -i») i=o!: X \.« k-i».J (i)
Y !: X (5)
::::
N
i=o H
N
H
16
where the double parenthesis around the expressions k - fl. and n - m rcf9r
tel these expressions modulo N; i.e., «x» = the unique integer x + kN, satisfying
o < x + kN � N -1 •
Finally we define an infinite sequence x (n) as 'causal' if x (n) = 0 for

n < 0, A finite duration sequence of length N is 'causal' if x (n) is zero in the
latter half of the period 0, 1 . • . N - 1 , i.c. , for n> � •
3. !:! i1bert Transform Relations for Causal Signals
The z-transform X (z) of the impulse response x (n) of a linear stable

causal digital system is analytic outside the unit circle. Under these conditions,
interesting relations between components of the complex function X (z) can be
derived, these relations being a consequence of the Cauchy integral theorem [10].
For example, X (z) can be explicitly found outside the unit circle given either the
real or imaginary components of X (z) on the unit circle. These relations also
hold on the unit circle; if we wr ite
(6)
where R (e j B) and I (e j B) are the real and imaginary parts respectively of X (e j B)

tl."11 X (e j B) can be explicitly found in terms of R (e j B) or in terms of I (e j B)
and therefore R (e j B) and I (e j B) can be expressed as functions of each othe r.
These various integral relationships will be referred to as Hilbert transform
relations between components of X (z) •
First, we will derive an expression for X (z) outside (not on) the unit circle
j
given R (e 9) (on the unit circle), beginning with the physically appealing concept
of causality. A causal sequence can always be reconstructed from its even part,
defined as,
x (n);; � [x (n) + x (-n) J (7)

e
Since x (n) is causal, it can be written
x (n) = 2 x (n) s (n)

e
where s (n) ;; 1 for n> o (8)

=0 for n<o
;; � for n=o
17
j
Now. consider X (z) outside the unit circle, that is, for z:; re e with
r> 1 . Then,
-e 0> 0>
n
X (reJ ) = k x (n) r- e- jn f) = 2 k x (n) s (n) r-n e - jn e (9)
e
n=-o> n=-o>
But on the unit circle, the z-transform of x (n) is R (e j 9) and the

e
n 1 r 1 z -1
z- transform of the sequence s (n) r- is given by + -1 -1
-
Thus, using •
the complex convolution theorem (2), we can rewritk- 9) s

1 �
X (z) I = .! f R (z/v) (v+r -1) dv (10)

v (v-r - 1)
z=re j 8 trj
(In this and subsequent contour integrals, the contour of integration is always
taken to be the unit circle).
Equation ( 1 0) expresses X (z) outside the unit circle in terms of its real
part on the unit circle. Equation (�O) was written as a contour integral to stress
the fact that in the physically most interesting case when R (z) is a rational
fraction, evaluation of (10) is most easily performed by contour integration using
residues.
j
Similarly, we may construct X (re j 8)
from I (e 9) by noting that
x (n) :; 2 x (n) s (n) + x (0) <5 (n) where x (n) denotes the odd part of x (n) and
o 0
<5 (n) is the unit pulse, defined as unity for n = 0 and zero elsewhere. The
result obtained is,
1
X (z) _ :;.! f -
�
I (z/v) (v+ - ) dv + x(o) (11)
z=reJ fl 17 V (v-r )
Now, Eg. (1 0) and ( 1 1) also hold in the limit as r -+ 1, provided care i s taken
to evaluate the integral correctly in the presence of a pole on the unit circle. This
can be done formally by changing the integrals in (10) and (1 1) to the Cauchy principal
values of these integrals, where the latter is defined as:
18
1 � f(z)
P dz = f(z ) for /z / < 1
21Tj z-z o 0
0
= 0 for Izo 1>1 (12)
= tf(z ) for Iz 1 = 1
o 0
From (10), (11) and (12), it is a simple matter to construct explicit relations
between R (ej B) and 1(e H). Alternately, these results could have been derived
by appealing directly to Fig. 1, which shows the explicit relations between the real
and imaginary part of a causal function to be,
x (n) .Qim x (n) w (n)

o e I
=
(l- l
(13)
x (n) = .Qim [x (n) w (n) J + 6 (n) x(0)
e o I
tl- 1
Figure 2 shows the ring of convergence for the z-transform W (z) of w (n)
I I
and Fig . 3 shows the poles and zeros of W (z).
I
The results obtained are,
� R (z/v) (v+l) dv
J'1(z) I . = p ,f! (14)
z= e Je 21TJ .'f1' v(v-I)
(z v (v+l)
R (z) I = � p ,f! I / ) dv (15)
JB 21T j v (v-I)
.
z=e
9
By setting z = ej and v = ej CP, we change the contour integrals (14) anq
(15) to line integrals, yielding
9 . 21T
1 ( )
-I(eJ ) = -p r R (eJ CP-B)cot(cp/2)dcp
•
21T .,'0
(16)
'9 21T '(9 cp)

1 - )cot (cp/2) dco+x(0)
R(eJ ) = -p
21T
r I(eJ
o
•
19
Similar results can be obtained for the real and imaginary parts of the
discrete Fourier transform of a finite duration sequence provided that the sequence
is 'causal in the sense that, if the sequence x(n) is considered to be of duration
N , then x(n) = 0 for n> �. Defining the even and odd parts of x(n) as
x (n) =
e t(x [«n»] +x[« -n»J)
and
X (n) =
o
t ex [«n»] - x [«-n» J)
it follows (for N even) that
x (n) = xe (n) f(n)

and
x(n) = Xo(n) f(n)+ x

N
(0) 0 (n) +x("2) 0 (n- 2')
N
where
1
f(n) = 2 n = 1 , 2,
N N
o n = "2+1'"2+2, • • • N- l
From these relations, we can then derive that the real and imaginary parts, R (k)
and I(k), of the DFT of x(n) are related by
N-l
j I(k) = � r=o R (r)
� F [«k- r »] (17)
N-l
1 k N
R(k) = � jI(r) F C«k - r »] +x(o) +( - 1) x("2) (18)
N
r=o
where F(k) is given by
1.
k
J cot 'lT k odd
N
-
F (k) =
,0 k even
20
Note that (17) and (18) are circular convolutions which can be numerically
evaluated by fast Fourier transform methods. Similar but not identical relations
can also be derived if N is odd.
If, instead of working with the z-transform of a sequence, we choose to work
with the logarithm of the z-transform, then comparable Hilbert transform relations
can be derived between the log magnitude of the spectrum and its phase. However,
certain theoretical restrictions arise, due to the fact that a) the logarithm of zero
diverges and b) the definition of phase is ambiguous. However, the derivative of
the phase (with respect to z ) is not ambiguous; this Jeads to relationships based on
the definition
� dX (z)
D (z) = ( 1 9)
X (z) dz
The imaginary part of D (ej 6) is the derivative of log I D (e j 9) I and its

9
real part is the negativp. of the derivative of the phase of O(ej ). If the inverse
z-transform of D(z) is causal, reiatinns simiLlr to our previous real-imaginary
relations can be der.ived. For example, results analogous to (16)
(20j
.e 1
21T .( e)
'l"(eJ ) = --p
21T
f
F(eJ ctJ- ) cot (CD/2) d<,Cl (21)
o
where Ixi is the magnitude of the spectrum and 'l' its phase and the primes denote
differentiation with respect to e.
If we impose the condition that 'l' is an odd function, then it must be zero for
CD = 0 and (20) and (21) may be integrated to give
.e
log l X(eJ )1=
1 21T 'l'(eJ( ctJ- e) ) cot (<,Cl/2) d<,Cl (22)
P
21T oJ
•
21
( )
°A 1 21T
\y (eJ ) 10glX (eJ cD-A Icot (C[)/2) dco
O
=--
0 (23)
21T o
0
The requ irement that the inverse z -transform of D (z) be zero for n < 0 imposes
a restriction on the pole and zero locations of X (z). Since the poles of D (z)
occur whenever there are either poles or zeros of X (z) and since the inverse
transform of D (z) is zero for n < 0 only if the poles of D (z) are all within the
unit circle, it follows that both poles and zeros of X (z) must be within the unit
circle in order for Eq. (20) through (23) to be valid. This is the well-known
minimum phase condition [11 J •
It is also possible to relate the log magnitude and the phase of the DFT by
analogous relations provided that the inverse DFT of the logarithm of the DFT is
causal. The difficulty in applying this notion is that the logarithm of X (k) is
ambiguous since X (k) is complex. For the previous case of the z -transform,
this ambiguity was resolved in effect by considering the phase to be a continuous,
odd, periodic function; this definition of the phase cannot be applied in this case.
Nevertheless, it has been useful computationally for constructing a phase function
from the log magnitude of a DFT by computing the inverse DFT of the log magnitude,
multiplying by the function f (n) and then transforming back [4, 5). The real
part of the result is the log magnitude as before and the imaginary part is an approxi
mation to the phase.
4. Hilbert Transform Relations Between Real Signals, and a Few Applications
The relations of Section 3 were derived via the complex convolution theorem
(2) and the requirement of causality. By interchanging time and frequency and
using the convolution theorem (1), further relations can be found which are of
practical and theoretical interest. One way of obtaining such relations is by the
introduction C?f the 'ideal' Hilbert transformer which has a spectrum defined as having
the value + j for 0 < cD < 1T and - j for 1T < cP < 21T , or equivalently, a spectrum
with flat magnitude vs. frequency and having a phase of ±1T/2 . Thus, a Hilbert
transformer is a (non-realizable) linear network with this transfer function and, as
shown in Fig.4a, the output of the network is the Hilbert transform of the input.
Hilbert transform relations can also be realized by having two aU- pass networks
having a phase difference of 1T/2 , as shown in Fig. 4c; such a configuration is useful
for synthesis of realizable approximate Hilbert transformers.
22
The unit pulse response of a Hilbert transformer can be derived by
evaluating its inverse z-transform. Thus
1 .R n-1
7T :r H (z) z
2 j
h (n) = dz
h (n) = -
1
27T
l 7T
e e de - r
"r j J
.
27T
.A
j e J dEl �
o 1T
.
1 - e j 1Tn
h (n) = for
1Tn
(24)
= 0 for
From (1) the input-output relations can be written down;
x (m) [1- e
1 j 7T (n-m)
Y (n) = 7T- J (25)
-ex> (n-m)
m=
m,;tn
Equation (25) can be inverted by noting that X (z) = H* (z) Y (z) (where H* (z) is
the complex conjugate of H (z) ); this yields
1 j 1T (n-m) 1
x (n) = - y (m) [ l- e (26)
7T n-m
-
m=-ex>
m,;tn
Thus (25) and (26) can be said to be a Hilbert transform signal pair. The graph 'of
.!. h (n) is shOwn in Fig. 5.
1T
The complex signal s (n) = x (n) + j Y (n) (where x (n) and y (n) are a Hilbert
transform pair) has been called the analytic signal and has the useful property that
its spectrum is zero along the bottom half of the unit circle. One application of the
analytic signal is to the bandpass sampling problem. Consider the problem of
sampling a real signal having the spectrum of Fig. 6a. If this signal is passed
through the phase splitter of Fig. 4c, the resulting analytic signal has the spectrum
23
shown in Fig. 6b, and thus can be sampled at the rate li B . To reconstruct the
original signal requires that the samples be applied to the unity gain bandpass filter
shown in Fig. 6c. The real part of the filtered signal corresponds to the original
signal.
Another application of Hilbert transformers is to help create a bandpass
spectrum which is arithmetically symmetric about an arbitrary center frequency.
Effectively, the ability to do this allows us to design bandpass filters which are
linear translations in frequency of prototype low pass filters, thus avoiding the
distortions inherent in the standard low pass-bandpass transformations. Figure
7a illustrates a symmetric low pass. When a conventional transformation is
applied, the non-symmetric bandpass of Fig. 7b results. Symmetry may be
attained with the filter of Fig. 7c; however. we note that the output of such a filter
is a sequence of complex numbers and, also, that by merely taking the real part
we must introduce the complex conjugate pole, thus destroying the symmetry. To
maintain symmetry of a real output over the range 0 through 1T can be accomplished
by the configuration of Fig. 8, where H (z) and H (z) are all pass phase splitters
I 2
such as shown in Fig. 4. If only the real part of the signal is desired then a single
phase splitter (rather than two) is needed. A filter satisfying the pole-zero pattern
of Fig. 7c is easily made and well-known and has usually been referred to as the
coupled form [ 12,9].
5. Digital Hilbert Transform Networks
a. Recursive Networks
Analog phase splitting networks have been extensively analyzed and
synthesized rI3 , 9]. Since the desired networks are all-pass with constant phase
difference over a frequency band, it is feasible to use the bilinear transformation
r.I4, IS, 9] to carry analog designs into the digital domain. The resulting net
works are all-pass, so that each pole at, say, z a , has a matChing zero at
=
0
z = I/a . An equiripple approximation to a constant 90 phase difference is
obtained by the use of Jacobian elliptic functions [ 161, with the added advantage
that all the poles and zeros lie on the real axis.
Let the two networks comprising the phase splitter be H (z) and H (z) .
I 2
To synthesize the all pass networks H (z) and H (z) in an efficient manner,
I 2
we note that the first order difference equation
y (n) = x (n-I) + a [y (n- l) - x (n)] (27 )
24
corresponds to the digital network
-1
z -a
H (z) = (2S)
-1
l - az
This shows that an all pass network with a pole at z = a and a zero at z = l/a
can be synthesized with a single multiply in a first order difference equation. In
0
Fig. 9 is shown a complete digital 90 phase splitter which meets the requirements
0 0
that the phase difference deviates from 90 in an equiripple manner by ± 1 in
0 0
the range 10 through 120 along the unit circle. From (2S) we see that the
coefficients in Fig. 9 are equal to the pole positions. The nomenclature of Fig. 9
is the following: the box z -1 signifies a unit delay, the plus signifies addition
and a number-arrow combination Signifies both direction of data flow and multi
plication by the number. Arrows without numbers signify only data flow.
Now it is well known [9, 12, 17, IS, 19, 20, 21, 22] that because of finite
register length, the performance of the actual tilter deviates s0111ewhat from that
of the design. These effects can be categorized as follows:
a) Quantization of the input signal
b) Roundoff noise caused by the multiplications
(fixed point arithmetic is assumed)
c) Deadband effect
d) A fixed deviation in the filter characteristic caused by
inexact coefficients.
The analysis of these effects is Simplified because the networks HI (z)
and H (z) are all-pass. Thus, signal to noise ratios caused by (a) are the same
2
at the output of the networks as at the input. Item (b) can be analyzed for HI (z)
or H (z) by inserting noise generators at all adder nodes following multiplications.
2
But each noise is then filtered by a cascade combination of the pole of the section
in which the noise is generated, and an all-pass network. The well -known formula
for the output variance of a network which has been subjected to a white noise input
with uniform probability density of amplitude is given by
25
where E is a quantum step and h (n) is the network impulse response. For a
o
n
single pole at z;::; a , h (n) ;::; a ; assuming m independent noise generators and
m poles at a 1 ' a a causes a total variance
2 m
• • •
m
2 2 1
CJ E / 12 (29)
� -=---z
=
o
i;::;1 I - a.1
We see that only values of a. near unity cause mu,:h noise. Thus, for our
1
1
rumerical design example, only about bit of noise is generated. Item (c) can
be analyzed by similar considerations but it is probably not important: for band
pass phase splitters anyway, since it is only an effect when the input is a constant.
For small errors in coefficients. item (d) can be analyzed in a manner
similar to that of reference [121. The realization chosen in Fig. 9 guarantees
that even though a given coefficient is in error, the poles and zeros of the networks
remain reciprocals so that only the phase response of the network can be effected.
Let the phase response due to a pole zero pair at a, I/a be W(CD,a) . The phase
error for a coefficient error Aa is approximated by
It is easily shown, using (28), that
-1 sin cp
;::; 2 tan ( )
a - cos CD
o'lt -2 sin c,o

oa 2
_
-Za cos c,o+ 1

-
a
;::; 2 sin cp(a-cos cp)

2 2
(a -2a coscp+ 1)
26
Therefore a good approximation to the error in phase, is given by the early term�
of the series
- 2 sin ctJ A 2 sin (a - c� ('A 2

ctJ +
D. \I' �
�
2 Lola + 2 � Lola) • • •
a - 2 a cos cp + 1 (a - 2a cos ctJ+ 1)
Using this approximation for each of the poles one can estimate how many bits
are necessary to keep the phase error within a given tolerance. Of course, once
a coeffi,cient has been specified, the phase difference can be computed precisely.
b. Non- Ret ul-sive Networks

The phase splitting network we have described above is called recursive
because, in the computation (27), a new output depends on a previous output. Recur
sive networks always have poles and unit pulse response of infinite duration. By
contrast, a non-recursive network has only zeros and a finite duration unit pulse
response. As we have shown, perfectly all pass, recursive phase splitters with
equiripple phase characteristics can be constructed. Such criteria cannot, in general,
be met by non-recursive networks. However, useful constructions are certainly
possible. We have studied one method of nonrecursive design which is based on
sampling in the frequency domain (19 J. The sampling formula which relates the
z-transform of a finite duration sequence of length N to the values H of its DFT
k
is given by
.27T
-N N-l H -j-
k N
H (z) l: (.'30)
l-z
W =e
N
=
-I k
k=o l- z W
Thus, an approach to the design of a non -recursive Hilbert transformer is to specify

the phase and magnitude of the network at the N equally spaced points around the
unit :::ircle, that is, the N values of Hk• If the unit pulse response of H (z) is
constrained to be real, then we must specify that H has even magnitude and odd
k
phase. For example, if we specify the magnitude of H to be unity for all k and
k
the phase to be 7T /2 in the first half of the period and -7T /2 in the second half, as
in Fig. 10, then an interpolated spectrum results, such as shown by the dotted curve
of Fig. 10.
27
Exact 90° phase can be attained at every frequency, by specifying that the
H be purely imaginary. However, for real unit pulse response, such a phase
k
shifter must have a magnitude characteristic which passes through zero at 0, 1T,
21T etc. , as shown in Fig. 11. Thus, an ideal phase can he attained by further
degrading the all pass property of the network near 0 and 1T.
Non-recursive networks can also be arranged as phase splitters, which

has the advantage that the two components of the resulting analytic signal can have
the same delay. It has been found experimentally that the arrangement indicated
in Fig. 12, whereby each arm of the splitter has nominal phase of ±1T/4 leads to
a low value of ripple of the interpolated phase difference. Further reduction of
ripple is possible at the expense of bandwidth, by specifying intermediate values
of H in the transition region when the sign of the phase is changing.
k
28
References
1. M.R. Schroeder, J.L. Flanagan and E.A. Lundry, "I3andwidth Compression

of Speech by Analytic-Signal Rooting," Proc. IEEE, pp 3 96-40 1, March 1967,
vol. 55 No.3.
2. R. M. Lerner, "A Method of Speech Compression,"Ph.D Thesis, 1957.
3. C.M. Rader, "Vector Pitch Detection,"J. Acoust. Soc. Am. 36, 1963 rLl.
4. A.V.Oppenheim, R.W. Schafer, T.G. Stockham, Jr., Proc. IEEE, vol. 56

No.8, August 1968, pp 1264-1291.
5. A.V. Oppenheim, "Speech Analysis-Synthesis System Based on Homomorphic

Filtering,"J. Acoust. Soc. Am., vol. 45 No.2, pp 458-465, 1969.
6. D.J. Sakrison, W.T. Ford andJ.H. Hearne, "The z-Transform of a Realizable

Time Function,"IEEE Trans. on Geoscience Electronics, vol. GE-5 No. 2,
Sept. 1967, pp 33-41.
7. A. Papoulis, "The Fourier Integral and Its Application,"McGraw-Hill, 1962 .
8. E.!. Jury, "Theory and Application of the z- Transform Method," Wiley, 1964.
9. B. Gold and C. M. Rader, "Digital ProceSSing of Signals," McGraw-Hill,

Spring 1 969.
10. P. M. Morse and H. Feshback, "Methods of Theoretical Physics," McGraw-Hill,

1953.
11. H.W. Bode, "Network Analysis and Feedback Amplifier Design," D. Van
N ostrand Company, 1945.
12. C. M. Rader and I3. Gold, "Effects of Parameter Quantization on the Poles of a
Digital Filter, " Proc. IEEE, May 1967, P 688.
13. J. E. Storer, "Passive Network Synthesis,"McGraw -Hill, 1957.
14. R. M. Golden and J. F. Kaiser, "Design of Wideband Sampled-Data Filters,"

Bell Sys. Tech. J., vol. 48, July 1964.
15. C. M. Rader and B. Gold, "Digital Filter Design Techniques in the Frequency
Domain,"Proc. IEEE, vol. 55 No. 2, February 1967, pp 149 -171.
16. E.T. Whittaker and G.N. Watson, " Modern Analysis,"Cambridge University
Press, 1952, (4th Edition).
17. J. F. Kaiser, "Some Practical qp nsiderations in the Realization of Linear

Digital Filters,"1965, Proc. 3 Allerton Conf. I PP 621-633.
29
18. J.B. Knowles and R. Edwards. "Effect of a Finite-Wordlength Computer in a
Sampled-Data Feedback System," Proc. lEE (London), vol. 112, pp 1197-1207,
June 1965.
19. B. Gold and C. M. Rader, "Effects of Quantization Noise in Digital Filters,"

1966, Proc. Spring Joint Computer Conf., pp 213-219.
20. J.B. Knowles and E. M. Olcayto, "Coefficient Accuracy and Digital Filter
Response," IEEE Trans. Circuit Theory, vol. CT-15, pp 31-41. March 1968.
21. T. Kaneoko and B. Liu, "Round-off Error of Floating Point Digital Filters,"
6th Ann. Allerton ConL on Circuit and System Theory," Oct. 2-4, 1968.
22. C. Weinstein and A. V. Oppenhei m . " A Comparison of Roundoff Noise in

Floating Point and Fixed Point Digital Filter Realization" (to be published).
23. B. Gold and K.L. Jordan. Jr., "Linear Programming Procedure for Designing
Finite Duration Impulse Response Filters" (to be published) in the IEEE
Trans. on Audio and Electroacoustics.
30
(a) x (n)
I I
0 n
..
( b) Xe(n)
n
(e) xo(n)
I
I I
I n
(d) wen)
• • •
• • • 0 n
(e) W1(n)=an wen)

• (a<1)
• •
• • •
o n
Fig. 1
31
w
u
z
w
(!) ,.,...
0:: N
W ....."
> �
Z
0
3
u
0:
u-
0
0 L1-
(!)
Z
0::
W C'I
Z .�
�
<l:
.-I
a..
32
,.....
�
I
tj
� I
N
'-'
C\J ,--
N tJ
I
N
'-'
II
"......
w N
Z '-'
«
--1
��
a.. M
.�
IJ,.,
33
•
•
Q)
•
�
C\J
�
.. .
.....
I�
- ..
- -
I �
,-.. J ,-..H
c: c:
-
� C\J
0 �
,-.. ,-..
N N
- -
� N
t:: ::t: ::r:
I
�l A�
)( ,-.. A�
t:::
c:
C\J
,
-
)(
c u
......
•
•
•
34
c
•
r--�o
�------
•
•
•
35
SPECTRUM
I I
(0 ) ''/
I
I ..
I I I
.
I I
o -+i 14- B 7T B� t4-- 27T e
SPECTRUM
( b)
-
I I ,
--"1 �B 7T 27T 8
(c)
--
I
-
7T 27T 8
Fig. 6
36
z - PLANE
SYMMETRICAL NONSYMMETRICAL S Y M M ETR I CAL
LOW PASS BAND PA SS BAND PASS
(a ) (b) (c)
Fig. 7
W
-..I
w
(Xl
H (z)1 ...
1
H2 (z)
Y1 (n)
,....----
� I I I IMAGINARY REAL PART
,. tn)
'T" � O
F I L I � r\ I y� ( }
FOR Fig.7c �n •
!) OF ..
PART OF - OUTPUT
Hi (z) OU TPUT
Io.....-�.I H2 (z ) I '
Fig. 8
-
-
�
*c
-
c
-
....
- I
I
-
�
c
-
"
39
-
--------. .......
- ....
_)
-
....
<
' )
\
•
-
•
Cb
•
..
- <
� � --
....
--
_)
- --
---
- - --,
-
<--.
::>
T
C--,
(\J
0
�
.-<
N II ......
� . ....biJ
�') _ ....
u..
/1-----;/
('- -:;. I
0
-- I 0
... 1 ---+
�
0
I � +-'-
< ..
I )
(\J •
......
t:: '-)
•
.... •
< ..... •
t:: -,
- - --
--
I _
--
- -
C .a
- -
40
�
�
c,
.,..,-'"
c:. ... ___ _

--
- ---
•
c_
�
•
,) •
( •
) •
<--.,
•
•
-
•
<-----
---
---
.....
.....
<--
) bO
....
�:.
( "
<-.)
_r
c ......
__ _
1_- -
-
(-14--
•
.. �)
,.
•
-.-.e
<,�
•
�, •
- -� •
< ....--.--.
.. •
- --
-.-
c..
41
• • • • • • • •
Tr
e
1T/4 2Tr
-1--- - . . . . - .
- - . . • • •
Tr/4
e
Tr/4 2Tr
-T- • • • • • • • •
r
Tr/2
8
217"
Tr/2
J_
12
42
Reprinted from the PROCEEDI:\GS OF THE 1E1':E
VOL. 56, 1\'0. 10, OCTOBER, 1968
pp. 1717-1718
COPYRIGHT © 1968-THE INSTITUTE OF EI.F.CTRICAL AND EU:CTNONICS ENGINEERS, INC.
PRINTED IN THE '!.S.A.
where the set of II. are precisely the values of lI(z) at equally spaced points
on the unit circlc in the z-plane. The impulse response h(nT) can then be
A Note on Digital Filter Synthesis
expressed as the inverse DFT of the samples II. and if this inverse OFT is
Abslracl-It is commonly assumed that digital filters with both poles and substituted into (I), with the order of the resulting doublc summation
zeros in the complex z-plane can be synthesized using only recursive tech inverted. we easily arrive at the result
niques while filters with zeros alone can be synthesized by either direct con
volution or via the discrete Fourier transform (OFf). In this letter it is shown (3)
that no such restrictions hold and that both types of filters (those "'ith zeros
alone or those with both poles and zeros) can be synthesized using any of the
three methods, namely. recursion, OFT. or direct convolution. Equation (3) can be physically reprcsen[c'<l by a comb filter with transfer
function I - Z-N cascaded with a parallel arrangement of first-ordcr re
I. INTRODUCTION cursive equations. In general. the coefficients of these equations (ej"d,N,)
A digital filter can be synthesized either by direct convolution. by are complex. but if in (3) we specify that I1N_. and II. are complex conju
linear recursive equations. or by the use of the discrete Fourier transform gates. then combinations of second-order recursive networks with real
(DFT). usually via the fast Fourier transform (FFT). Kaiserl has used the coefficients can be derived. An interesting special case of (3) is the case
terms "recursive" and "non recursive" to distinguish between filters that 0. = k7! and M. = 1. where II. = IIJ.ej8,. This leads to a bandpass filter with
are defined by an impulse response of finite duration (nonrecursive) and linear phase. which has been called "frequency sampling" by Rader and
those defined by an impulse response of infinite duration. The former type Gold'> If two such filters arc designed so that the frequency response of one
contains only zeros while the latter has both poles and zeros in the complex is a linear translation of the other. then it is cas v to show that the two out
z-plane. It is well known that the design methods for filters with only zeros puts have a constant phase difference over the passbands common to both
differ markedly from the design methods for filters with zeros and poles. filters.
The term "recursive" is descriptive of the filter synthesis procedure

whereby a new output sample can be computed as a linear combination of
previous output samples as well as the latest and previous input samples. 111. DIRECT CONVOLUTtON AND FFT SYNTHI'5IS OF FtLTERS
Similarly. "nonrecursive" describes a computation for which the new out WITH POt.ES AND ZEROS
put sample is a linear function of only the input samples. Stockham1 has
We consider the defining equation of an 11 R filter to be
shown that filters with zeros alone can be synthesized using the FFT
m ,
algorithm.
The purpose of this letter is to show how I) filters with only zeros can be L K;y(nT- i1') - L L;x(nT- iT) (4)
1-0 ;-0
synthesized recursively and how 2) filters with both poles and zeros can be
synthesized using either direct convolution or the FFT. As a result, it be where the set of X(IIT) are the input samples. y(nT) are the output samples.
comes desirable to define our terms more carefully, and we propose that and K, and L, are constant coefficients. Taking the z-transform of (4) and
the terms "recursive," "direct convolution." and "FFT " be used to specify including the initial condition yields
only the synthesis procedure. To distinguish between filters with zeros
alone and those with poles and zeros. perhaps the terms "finite impulse X(z)R(z) + F(z) + G(z)
Y(z) = (5)
response" (FIR) and "infinite impulse response" (IIR) could be used. D(z) D(z)
where X(z) is the z-transform of x(nT), Y(z) is the z-tran,form of y(nT),

R(z) and D(z) are given by
II. ReCURSIvE SYNTHr:SIS OF FILTERS WITH ZEROS ALONE
, ..
The z-transform of an FIR filter having impulse response (of duration R(z) = L L,z-', D(z) = L Kiz-', (6)
NT) h(nT) is given by '·0 i-O
N-' and F(z) and G(z) are z-transforms which depend solely on the initial eondi-
}
H(z) = L h(nT)z-' (J) lions and arc siven by
,-0
.,-. .-)-1
where T is the sampling interval. FI%)- L ajz-) with u) = L K,_'.'''-T -<n
Since hInT) is of finite duration, it has a DFT given by )-0 ;-0
,-I ,-)-1
(7)
N-' GI%)- L bjz - ) with bj = L Lj_,+.x(-T- iT).
H. = L h(nT)e-)'20IN"., k - O. 1,2, . • • N - I (2) j-O '-0
,-0
If we define h.(nT) to be the inverse z-transform of R(z)/D(z), h1(nT) to
be the inverse-transform of I/D(z), l(nT) to be the inverse z-transform of
Manuscript received June 24. 1968. F(z), and genT) to be the inverse z-transform of G(z), then the solution
I J. F. Kaiser. in SYSIt'M Anal,-sls by D;gital Compu",. J. F. Kaiser and F. Kuo. Ed$.
New York: Wiley. 1966. pp. 218-277.

J T. G. Siockham. Jr .• "High speed correlation and convolution." presenled at the 1 C. M. Rader and B. Gold. "Digilal filler design techniques in the rrequency domain,"
Spring Joint Computer Conr.• 1966. Proc.t£££. vot. 55. pp. t49-171. February 1968.
43
PROCEEDINGS OF THE IEEE, OCTOBER 1968
y(nT) can be written as

Since a finite convolution can always be computed via DFT, it.follows
. -
that (8) can be synthesized using either direct convolution or FFT.
y(nT) - L h,(iT) x (nT - iT) + L h2(iT)/(nT - iT)
1-0 '-0
(8) IV. CONCLUSION
r
+ L h2(iT)y(nT - IT). II and III demonstrate the claim made in the Introduction
Sections
namely, that filters with zeros alone can be synthesized recursively whil;
'-0
filters with zeros and poles can be synthesized by direct convolution or

Eq uation (8) shows that the output can be wrillen as the sum of three
FFT. Thus, any of the known digital filter synthesis procedures can, in
finite convolutions for any section of length n + I. We as.,ume that the
indices for that section are 0, I, 2, . . . , n. Thus, we imagine that each new
fact, be used to synthesize either type of filter.
B. GOl. D
section resets the origin to the beginning of that sc'Ction, the contribution
from previous sections being determined solely by the initial conditions K. L. JORDAN, JR.
M. I.T. Lincoln Lab.'
and the system function. As can be seen in (5), the output for any con
Lexington, Mass. 02 I 73
tiguous section of samples can be viewed as the sum of an output deter
mined entirely by the input signal and another output determined entirely
by the initial conditions of the system. .. O perated with support (rom the U. S. Air Force.
44
Introduction
Digital filters can be synthesized by direct convolution,

via the formula
A Direct Search Proce

.
y(nT) = L: h(mT)x(nT - mT) (I)

.. -0
x(nT)
dure for Designing
y(nT) h(nT)
where T is the sampling interval, is the input se
(I),
Finite Duration Im quence, is the output sequence, and is the
x(nT) 0 for n<O and

response of the filter to a unit pulse at the origin. In
pulse Response Filters
stable. If h(nT) is, in addition, of finite duration, then the
it is assumed that = that the filter is
N I, where NT is the dura

filter has only zeros in the complex z plane and the upper
h(nT).
BERNARD GOLD, Member, IEEE limit in (I) can be replaced by -
K. L. JORDAN, JR., Member, IEEE tion of For such filters, there exist a variety of
M.I.T. Lincoln Laboratory' design techniques, which have been reviewed in some
Lexington, Mass. 02173 detail by Kaiser [1]. In particular the frequency domain
technique which Kaiser has called the "Fourier series"
method has received much attention and has led to the
invention of various spectral "window" functions, which
when convolved with the ideal spectrum, tend to reduce
Abstract
the sidelobes of the filter frequency response characteris
We introduce an approach to the design of low-pass (and, by exten tic. The idea is illustrated in Fig. 1. In Fig. leA) is shown
N points
sion, bandpass) digital filters containing only zeros. This approach is a specification of an "ideal" filter. Since the impulse
that of directly searching for transition values of the sampiL-d frequency response is of length N, then only of the (con
response function to reduce the sidclobe le,'cl of the response. It is tinuous) frequency response can be specified. If specifica
shown that the problem is a linear program and a search algorithm is tion is as shown in Fig. I(A), then the actual response
I(C), producing
exhibits strong sidc\obes, as shown in Fig. I(B). If, for
derived which makes it easier to obtain the experimental results.
example, Hk is convolved with Wk of Fig.
I(E).
the samples Fk of Fig. I(D), the sidelobes decrease as
shown in Fig. This reduction is, of course, paid for'
by the increased transition band. The design problem
reduces .to one of finding "good" windows which either
minimize the out-of-band sidelobes, or minimize the in
band ripple, or result in some compromise between these
two criteria.
Optimization by Direct Search
In this paper we approach the problem by directly

searching for "optimum" values of Hk in the transition
band. The model is shown in Fig. 2. All passband values
of Hi are specified to be unity and all stop band values to
be zero. In the transition band, there are, in general, M
values of Hi which can be chosen in any way so as to ful
fill the designer's criterion. For simplicity, we shall take Hi
to be real and even and use it to synthesize a filter with
a real, even transfer fnnction.
The major apparent difficulty with the above approach
is that of determining, either theoretically or experi
H(ei"T)
mentally, a minimization in M dimensions. We observe,
linear
however, that the continuous frequency response
of the filter can be expressed as a function of its
uniformly spaced samples Hk• The required interpolation
Manuscript received August 14, 1968; revised December 18, 1968. formula can be derived, via the discrete Fourier trans
, Operated with support from the U. S. Air Force. fonn, to be
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. Au-17, No.1 MARCH 1969
45
linear function of the MTk and, therefore, defines an M
dimensional hyperplane in the M+I dimensional space.
H(eiwT).
The M + I dimensions consist of the M transition values
and the frequency response We wish to solve the
minimization problem
(A)
min max I Gil
IT.I I
which can alternately be expressed as
mill max {max (G" -

G, ) } .
(B) IT.I I
The expression
max {max (G" -G,)}

I
(e)
is the upper envelope of the hyperplanes formed by the
sets {Gz} and {-Gz} and therefore describes a flat
sided convex polyhedron. A local minimum is thus a
global minimum and occurs at a vertex where M+I of

(0) the {G1} are equal to the minimum. Such a problem as
D.. .11
described is a form of linear programming problem.
As a practical matter, the local maxima or sidelobes in
I".'·�
the out-of-band region are nearly periodic and their posi
I tions are nearly constant. If only one of the Tk is varied,
(E) o t l.
2T T this corresponds to a slice through the polyhedron as
fREOUENCY
shown, for example, in Fig. 5. As the particular Tk is
Fig. 1. Ripple In flnile·lmpulse response digilol fUler. varied, the maximum out-of-band response switches from
one peak to another so that the frequency position changes
drastically; this corresponds to a sudden change in the
slope of the curve.
The above properties can be used to derive a reasonably
Fig. 2. Model of flnile·impulse response flller
efficient search procedure. The idea is to reduce an M
with a transition band.
dimensional search to a sequence of one-dimensional
searches. To illustrate the procedure, we assume that the
transition region contains two transition points and y
..�
••• x
and search for the point x.., Ym which minimizes the max
imum out-of-band response. For a given value of x, say
Xl, the search for an optimum Y is one-dimensional, and,
o��D-��*��----�t as mentioned earlier, the optimum y (or Yl) will occur at

FREQU£NCY the intersection of the plane X=Xl and the line formed by
the intersection of two faces of the polyhedron. If the
fixed value of X is changed slightly to X2, the optimum
point, Y2, will move along this line. This line determines a
l/(eiwT) � exp [ -i w.�T (1 �) ]

relationship between X and y given by
(Y2 - Yl)(X - Xl)

= -
(X2
(2)
Ihe-irklN
y= YI+ ------
N-I
� wNT/2
sin
•
- Xl )
k-O sin [(wT/2) - (lI-k/N)]
which determines a vertical plane intersecting the poly
Now, assume that the criterion chosen by the designer hedron. The procedure is to solve the one-dimensional
is that of minimizing the maximum value of the out-of problem of finding the minimum along this intersection.
H(eiwT) in the out-of-band region to be

band response. First, let us consider the set of points of We then have a new value of X and we repeat the proce
a discrete set con dure. In the case of three or more transition points we
taining a large number of samples (as should be the case iterate the above procedure. For our problem, the proce
in solving the problem on a digital computer), we shall dure seems to converge quite rapidly.
denote this set as {Gz}. In addition we shall denote the Fig. 3 shows the envelope of the result obtained from a
set of MH� in the transition region as {Tk}. Each GI is a three-dimensional search; the ripple created by specifica-
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS MARCH 1969
46
I • C18, • • 0.:5411
10
... "
N·251
.·at ,-o.aat. I-�
.. olH
.. oil * ''I(, ERIIOR IN ,
104
Fig. S. Variation In fllter sidelobe amplitude

- ,...., versus error in a transition value.
Fig. 3. Experimental frequency response of fllter

with three transition points.
fig. 4. Variation in filter sidelobe amplitude versus Fig. 6. Sidelobe and In·band ripple as a funcllan of
bandwidth. x with y optimized for minimum sidelobe ripple.
x • 0.7, y. 0.225, , . 0.01995

N 0256
oL------��--�'oon--J
ew
tion of zero sample values is not shown. For N=256 seen that for most bandwidths, the sidclobes are still
and BW=32, the optimum is x=0. 7, y=0.225 and 80 dB below the in-band response.
z=0.01995. As can be seen from Fig. 3, the maximum It was also of interest to study how sensitive the side
out-of-band response is about 85 dB below the in-band lobes were to deviations from optimum settings. Fig. 5
I percent error
response. This compares favorably with the "Blackman" shows this for a case of a two-dimensional optimization
window which yields about the same out-of-band re with a fixed value of Z = 0.8. It is seen that a
sponse but which corresponds to 4 rather than 3 transi in y does not unduly disturb the minimax sidelobe.
tion points [2]. No information is presently available to For the two dimensional case with Z= 1 and x and y
us on the in-band ripple of the Blackman window. variable, Fig. 6 shows a result which also takes into
A given optimum is in general valid for given values of account in-band as well as out-of-band ripple. Each point
two parameters, Nand BW. It is laborious to find a dif on the maximum sidclobe curve corresponds to an opti
ferent M-dimensional optimum for many values of these mum y given x. For the value of y, the in-band ripple was
parameters. However, if BWis not very small and not too also measured. The designer now can choose his 2 transi
near NI2, the optimum values do not change too dras tion points according to the relative value he places on
tically. This is illustrated in Fig. 4, which, for the same in-band versus out-of-band ripple.
values of x, y, and z, shows the relative amplitude of the Fig. 7 shows a plot of y versus x corresponding to the
minimax sidelobe as a function of BW for N=256. It is two-transition case of Fig. 6.
GOLD AND JORDAN: DESIGNING fiNITE DURATION IMPULSE RESPONSE FILTERS
47
0.4, r-------..,
O �-L��
0.5 0.6 0.7 O.B 0.9 1.0
X
fig. 7. y versus x for case of Fig. 6.
Conclusions ACKNOWLEDGMENT
The authors wish to thank L. Rabiner of the Bell Tele

Direct empirical search for optimum transition values of
phone Laboratories for pointing out a numerical error
a finite duration impulse response filter may lead to some
and supplying us with the correct answer.
what more efficient filter designs than the well-known
window techniques. This search is greatly facilitated by
[I) F. F.
REFERENCES
using linear programming techniques. The resulting com
Kuo and J. F. Kaiser, Syslell/ AI/t/II·sis by Digi"'l COlI/
(2) T. G.
putations are facilitated by utilizing the fact that the •
PilI!!,. New York: Wiley, 1966, ch. 7.

search procedure is a linear program. Stockham. Jr., private communication.
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS MARCH 1969
48
I. Introduction
When implementing a digital filter, either in hardware

or on a computer, it is important to utilize in the design a
A Bound on the Output bound or estimate of the largest output value which will
be obtained. Such a bound is particularly useful when
of a Circular Convolution fixed point arithmetic is to be used, since it assists in de
termining register lengths necessary to prevent overflow.
with Application to This paper considers the class of digital filters which have
an impulse response of finite duration and are imple
Digital Filtering mented by means of convolution sums performed using
the discrete Fourier transform (DFT). The outpUt sam
ples of such a filter are obtained from the results of N
ALAN V. OPPENHEIM. Member, IEEE point circular convolutions of the filter impulse response
(kernel) with sections of the input. These circular convolu
CLIFFORD WEINSTEIN. Student Member, IEEE
tions are obtained by computing the DFT of the input
Lincoln Laboratory
section, multiplying by the DFT of the impulse response,
Massachusetts Institute of Technology
Lexington, Mass. 02173 and inverse transforming the result. Stockham [1] has dis
cussed procedures for utilizing the results of these circular
convolutions to perform linear convolutions, rationales
for choosing the transform length N, and speed advan
tages to be gained by using the fast Fourier transform
Abstract (FFT) to implement the DFT. We concern ourselves here
When implementing a digital filter, it is important to utilize in the only with bounding the output of the N-point circular
design a bound or estimate of the largest output value which will be convolutions.
obtained. Such a bound is particularly useful when fixed point arith
metic is to be used since it assists in determining register lengths II. Problem Statement
necessary to prevent overflow. In this paper we consider the class of According to the above discussion, we would like to
digital filters which have an impulse response of finite duration and are determine an upper bound on the maximum modulus of
implemented by means of circular convolutions performed using the an output value that can result from an N-point circular
discrete Fourier transform. A least upper bound is obtained for the convolution. With {x. I denoting the input sequence,
maximum possible output of a circular convolution for the general
{h. I denoting the kernel, and {yo I denoting the output
sequence, we have
case of complex input sequences. For the case of real input sequences,
N-I
a lower bound on the least upper bound is obtained. The use of these
y. = L Xkh(n-k) mod.v n = 0, I, . . , .V 1 (I)
k-O
. -
results in the implementation of this class of digital filters is discussed.
where it is understood that, in general, each of the three

sequences may be complex. The circular convolution is
accomplished by forming the product
(2)
where
1 N-I
X" = -
L X.W·k k = 0, 1, . . . , .V - 1 (3)
,V n-O
1 N-l
Y" = -
L IJ"Wn" k = 0, I, . . . ,N - 1 (4)
N .-0
N-l
lh =IL It.Jrn" k = 0, I,'" ,N - 1 (5)
with W defined as W=exp (j2'l1/N].

For convenience in notation, we imagine the computa
tion to be carried out on fixed point fractions. Thus we
bound the input values so that
I Xn I � I. (6)
Manuscript received November 26, 1968. By virtue of (3) we are then assured that
This work was sponsored by the U. S. Air Force. I x,,1 � 1
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. Au-17, No.2 JUNE 1969
49
so that the values of Xt do not overflow in the fixed point with equality if and only if 1 Hkl = 1. However, (6) re
word. quires that
In the typical cases, the sequence h .. is known and, con
sequently, so is the sequence Hk• Therefore it is not neces (12)
sary to continually evaluate (5); that is, the sequence Hk .._0
is computed, normalized, and stored in advance. Thus it
is reasonable to only apply a normalization to Hk and not
with equality if and only if 1 x.. I = 1. Combining (11) and
(12),
to h .., so that we requirel
N-I
(i) L: 1 Y.. 12:::;; N. (13)
.. -0
A normalization of the transform of the kernel so that But
the maximum modulus is unity allows maximum energy 1\'-1
transfer through the filter, consistent with the require 1 y.. 12:::;; L: 1 Ynl2 (14)
ment that Yk does not overflow the register length.
Our objective is to obtain an upper bound on ly.. 1 for and therefore
all sequences {x,,} and {Hk} consistent with (6) and (7). 1 Ynl :::;; -ylY. (15)
This bound will specify, for exa m ple, the scaling factor to
Proof of Result B
be applied in computing the inverse of (4) to guarantee
that no value of y.. overflows the fixed point word. The To show that -yiN is a least upper bound on ly"l, we
following results will be obtained. review the conditions for equality in the inequalities used
Result A: With the above constraints, the result of the above. We observe that for equali ty to be satisfied in (15),
N-point circular convolution of (I) is bounded by it must be satisfied in (11), (12). and (14), requiring that
Iy.. I � -ylY. 1) I Hkl=1

2) Ix" I=I
Result B: In the general case where {X,.} and {h.}
3) Any output sequence {Y.. } which has a point whose
are allowed to be complex, the bound in Result A is a modulus is equal to -yiN can contain only one non
least upper bound. This will be shown by demonstra ting zero point.
a sequence that can achieve the bound.
Result C: If we restrict {x .. } and/or {h.} to be real The third requirement can be rephrased as a requirement
valued, the bound of Result A is no longer a least upper on the inp ut sequence and on the sequence Hk• Specif
bound for every N. However, the least upper bound (3(N) ically, if the output sequence contains only one nonzero
is itself bounded by point then Yk for this sequence must be of the form
-yiN
-2- :::;; (3(N) � -yiN.
_
Y k = A lV·ok = I A 1 exp [i [�� noli: + ]] p
where p is a real constant and no is an integer so that, from

III. Derivation of Results
(2),
Proof of Result A
Parseval's relation requires that (16)
We can express Hk and Xk as

(8)
and 1/. = ei'k
JI'-I .'\'-1
and
L: I X.. 12 N L: 1 Xk12• (9)
i.\, i
=
n_O k_O .\k = l"�'
Substituting (2) into (8) and using (7), where we have used the fact that I Hki = I. For (\6) to be
satisfied. then
N-I N-I
L: ly.12:::;; N L: Ixkl2, (10) ,! I (I 7)

.. -0
and
or, using (9),
:211"
JI'-I ,v-I l'Jk - 11. + - '1101: + p, (18)
(11 )
=
L: ly.. 12:::;; L: I x .. 12 .\'
Therefore, requirement 3) can be replaced by the require

1 The restrictions of (6) and (7) do not impose any loss of gen ment that:
erality, and are introduced only for convenience. The bounds to be
derived on max Iy" I can be interpreted in a more general sense as 3') I xkl =constant and the phase of Hk be chosen to
bounds on the ratio max ly.l/(maxlx.1 maxlHkll. satisfy (18).
OPPENHEIM AND WEINSTEIN: OUTPUT BOUND ON CIRCULAR CONVOLUTION
50
As an additional observation, we note that for allY where Rk and Ik are real valued and (see Appendix)
input sequence {Xn } ,
N-l Rk2 + h2 =-
1 . (21)
iV
I YnI � E I Ih I I X. I
t-o
Since exp [jrIl2jN] is an even function of n, i.e.,
.with equality for some value of If if and only if IHkl = 1
and the phase of Hk is chosen on the basis of (18). There
cxp (jrn2jN] = cxp [jr(:V - n)2j.V],
fore, for any {x.} the output modulus is maximized Rk is the DFT of cos (rIl2jN) and It is the DFT of
when Hk is chosen in this manner. This maximum value sin (r/NN). Now, if we choose
will only equal ,/N, however, if, in addition, I x.1 =1 and
Ixkl =constant. x.
rU2
cos
( )
N
=
For N even, a sequence having the property that

Ix.I= 1 and I xkl =constant is (see Appendix) the and
sequence
Ilk = f 1
1.-1
(10) then
N-I
For N odd, a sequence with Ix.1 = 1 and I xkl =constant Yo = L I R.I· (22)
k_O
is2 (see Appendix)
Similarly, if we choose x ..'=sin (rn2jN), then we can
2 2 choose {Hk } in such a way that
x .. = cxp [i ""; J = W .. •. (20)
JI"-I
'
Using one of these sequences as the input, and choosing yo = L II.I·
Hk=ei•• , with 11k given by (18), equality in (IS) can be
achieved for any N. Thus the bound given in Result A is We note that since {x. } and {x..'} are both real, the
a least upper bound. values Yo and Yo' will be obtained with {Hk} having even
magnitude and odd phase, corresponding to real {h .. } .
Now, if {3 is the least upper bound for ly.. l, then
Proof of Result C
Consider first the case where {x.} is restricted to be {3 � Yo (24a)

real. It can be verified by consideration of all possibilities {3 � yo' (24b)
that for N=2 and N=3, no real sequence exists for which and, from (2 1),
both the sequence and its transform have constant mod
ulus. Therefore, for these values of N at least, ( 15) does not
provide a least upper bound, since requirements 2) and
3') cannot be satisfied simultaneously. Note, however,
that for N=4 the real sequence {x .. } = {I, 1, -1, t}
satisfies 2) and 3'), and therefore for N=4, (IS) i s a least
and, hence,
upper bound.
If {h.} is required to be real (with no such restriction
1 Rk 1 � ...jN 1 Rk 12 (25a)
on {x.. } ), then one can verify for N=2 that if {x.. } is
chosen to satisfy 2) and 3'), then the phase of Hk cannot 1 Ikl � ...jN 1 hi!· (25b)
be chosen to satisfy 3'), and thus ( 15) is not a least upper Combining (22), (23), (24), and (25),
bound for this case.
X-I N-I
To show that (3(N)�...jN/2 for {x.} and/or Ih .. }
restricted to be real valued, it suffices to show that
/3 � L 1 RkI � ...jN L IRk 12 (26a)
{3(N)�...jN/2 for both Ix. } and {h.. } real valued. This N-I N-l
we will demonstrate only for the case where N is even, {3 � L 1 Ik I � ...jN L I h 1 2• (2Gb)
since the argument for N odd is identical.
Consider the complex sequence Adding (26a) and (26b) and using (21),
2{3 � ...jN
or
with DFT denoted by ...jN
Fk = Rk + ilk
/3>
-_. 2
(27)
t The sequences (19) and (20) were suggested to the authors by

Since we argued previously that /3�...jN, Result C is
C. M. Rader of the M.I.T. Lincoln Laboratory. proved.
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS JUNE 1969
51
IV. Discussion has a discrete Fourier transform with constant modulus.
The bound obtained in the previous sections can be uti We consider first the case of (29). Letting Xk denote the
lized in several ways. If the OFT computation is carried OFT of x .. ,
out using a block floating-point strategy so that arrays
=r
1 N I
-
[j -
?rn2 ] [j--
27rnk ]
are rescaled only when overflows occur, then a final re Xk
,\
L
..
cxp
N
cxp
N
_0
scaling must be carried out after each section is processed
or
so that it is compatible with the results from previous
=-
1 [ -j -
7rk2 ] N-I
(j7r(n + k)2].
sections. For general input and filter characteristics, the Xk exp r L exp (31)
final rescaling can be chosen based on the bounds given N ,\ n-O
here to insure that the output will not exceed the available We wish to show first that
register length.
The use of block floating-point computation requires
N-I 7r [ ]
the incorporation of an overflow test. In some cases we L cxp j-
,V
(n + k)2
n_O
may wish instead to incorporate scaling in the computa
tion in such a way that we are guaranteed never to over is a constant. It is easily verified by a substitution of
flow. For example, when we realize the OFT with a power variables that
of two algorithm, overflows in the FFT computation of IN-I
{ Xk} will be prevented by including a scaling of! at each l: cxp [j7r(n + k)2/.V] = constant 6. B. (3 2)
..-0
stage, since the maximum modulus of an array in the
computation is nondecreasing and increases by at most a But
factor of two as we proceed from one stage to the next 2.\"-1
[2]. With this scaling, the bound derived in this paper L cxp (j7r(n + k)2/N]
guarantees that with a power of two computation, scaling
is not required in more than half the arrays in the inverse Itt-I 2N-l
FFT computation. Therefore, including a scaling of! in = L cxp (j7r(n + ltV/N] + L cxp (j7r(n + k)2/N]
the first half of the stages in the inverse FFT will guaran
N-I
tee that there are no overflows in the remainder of the
computation. The fact that fJC:. v'FI/2 indicates that if we l: exp [j... (n + k)t/N]
II-I
restrict ourselves to only real input data, at most one N-I
rescaling could be eliminated for some values of N. + l: exp [j...(n + k)t/N] exp [j...N]
The bounds derived and method of scaling mentioned 11_0
above apply to the general case; that is, except for the or, since N is even,
normalization of (7), they do not depend on the filter
characteristics. This is useful when we wish to fix the scal 2.\'-1
L cxp [j7r(n + k)2/NJ
ing strategy without reference to any particular filter. For
specific filter characteristics, the bound can be reduced. N-I
(33)
Specifically, it can be verified from (I) and (6) that in = 2 L cxp [j7r(n + k)2/N].
terms of {h.. } .. _0
.1/-1 Combining (31), (32), and (33),

! y.. ! 5 L I hd (28)
1
1-0
X" -.B·exp[ -j...kt/N].
N
=
where M denotes the length of the impulse response. This

is a least upper bound since a sequence {x .. I can be To determine the modulus of B, Parseval's relation re
selected which will result in this value in the output. This quires that
will be significantly lower than the bound represented in
N-I X-I
(15) if, for example, the filter is very narrow band, or if
the kernel has many points with zero value. l: I·on 1 2 = .v L I xd2
11-0 __ 0
or
Appendix
.V =
I B 12•
We wish to demonstrate that for N even, the sequence Therefore
Xn = exp
[j 7rn2 ] n = 0, 1, . . . , .V - 1
(20)
N N even or
1
has a discrete Fourier transform with constant modulus I X k!
and that for N odd, the sequence
=
vN .
It can be verified by example (try N = 3) that the sequence
27rn2 ]
[jN n = 0, 1, . . . , .V - 1
of (29) does not have a OFT with constant modulus if N
Xn = exp (30)
Nodd is odd.
OPPENHEIM AND WEINSTEIN: OUTPUT BOUND ON CIRCULAR CONVOLUTION
52
Consider next the sequence of (30). We will show that But
Xk has constant modulus by showing that the circular
autocorrelation of x, which we denote by
.. en, is nonzero
"'-1
L CXP -j JV
[
471"rn
-
]
only at 11=0. Specifically, consider .-0
N-I
n=O
c" = L X,X7ra+r)rnOd N o n ¢O, N odd
._0
N
�
L.... cxp
[. 271"r2 ] [. 271"[(n + r)2]mod N]
J exp -J .
n=-,
2
N even
N N
--
-0
N
•
Now, o n ¢O,n ¢-, N e\"cn.

2
exp
[ -J.271"[(n+r)2]mOtlN]= [ exp -J
.271"(n + r)2 ] . Since we are considering the case of N odd,
N N
n=O
Therefore,
n ¢O.
N-I [. 271"r2 ] [ . 271"(r + n)2] Since I xkl is constant, we may again use Parseval's
en = L cxp J exp -J
theorem to show that I xkl =l/vN.
----
N N
--
• =0
N-I [ j 21r1l2] [ -j-- 471"rn] References

L - exp exp
[I)
N N T. G. Stockham. "High speed convolution and correlation."
--
._0
1966 Sprillg Joillt Compllter COIlf., AFlPS Proc., vol. 28.
[ -j 271"n2] N-I [ -j 471"rn] Washington. D. c.: Spartan, 1 9 66. pp. 229-233.
= exp L --
N
exp
N
-- . [21 P. D. Welch. "A fixed-point fast Fourier transform error anal
ysis," this issue. pp. 151-157.
.-0
53
SOME PRAcrI CAL CONSIDERATIONS IN THE
REALIZATION OF LINEAR DIGITAL FILTERS
J. F. KAISER
Bell Telephone Laboratories, Incorporated, Murray Hill, New Jersey
ABSTRAcr
The literature on sampled-data filters, although extensive on design

methods, has not treated adequately the important problems connected
with the act'ual realization of the obtained filters with finite
arithmetic elements. Beginning with a review of the traditional
design procedures a comparison is made between the different
canonical realization forms and their related computational procedures.
Special attention is directed to the problems of coefficient accuracy
and of rounding and trucation effects. A simple expression is
derived which yields an estimate of the required coefficient accuracy
and which shows clearly the relationship of this accuracy to both
sampling rate and filter comPlexity.
54
SOME PRACl'ICAL CONSIDERATIONS IN THE
REALIZATION OF LINEAR DIGITAL FILTERS
J. F. KAISER
Bell Telephone Laboratories, Incorporated, Murray Hill, New Jersey
The high speed general purpose digital computer has become a powerful
and widely used tool for the simulation of large and complex dy�amic
systemsl and for the processing and reducing of large amounts of data
by filter methods. The increased computational accuracy of the
machines, the broader dynamic ranges in both amplitude and frequency
of the system variables, and the increasing order or complexity of
the dynamic systems themselves have made it necessary to take a much
closer look at the computational and realization details of the
designed digital filters. �ny of the problems now coming to light
were not noticed before2, 3, ,5,6,7 either because the filters were
of low order with low ( two or three decimal ) accuracy requirements
or because the sampling frequencies were comparable to the dynamic
system and signal frequencies. An understanding of these computational
problems and realization considerations is of vital interest to the
users of the different digital filter design methods as their presence
may often spell the success or failure of a particular application.
The problems to be treated in this paper relate to the numerical

determination of the digital filter coefficients, to the stability of
the digital filters themselves, to the precision of the arithmetic
necessary to carry out the desired filtering operation, and to the
choice of filter design methods. The choice of a satisfactory
realization scheme is discussed in detail as it pertains to the
previously mentioned problems.
Stability and Coefficient Accuracy
The two most widely used methods for the design of digital filters
that approximate continuous linear filters with r �tional transfer
characteristics are the bilinear z transformation and the standard
z transformation 2 methods.
The bilinear z transform is algebraic in nature and consists simply

of a substitution or change of variable in the continuous filter
transfer characteristic H (s ) ; i. e., the digital filter H* (z-l ) is
formed as
(1)
where
sT (2 )
z e
the unit advance operator.
55
The bilinear z transform simply maps the imaginary axis of the s-plane
into the unit circle of z-l plane with the left half of the s-plane
mapping into the exterior of the unit circle in the z-l plane. The
mapping is one- to-one and thus unique. Thus if the transformation
indicated by (1) is carried out exactly then H* ( z-l ) will be stable
if H ( s ) is stable and will be of precisely the same order. The
bilinear z form can theoretically be applied directly to a rational
transfer characteristic, H ( s ) , in either polynominal or factored form.
It will be shown later which form is to be preferred.
The standard z transformation method consists first of obtaining a

partial fraction expansion of H ( s ) in its poles and then z transforming
each partial fraction by making use of the transform pair
1 T
s+a -aT -l
l_e z
or transform pairs derived therefrom. The mapping is again such that

the left half of the s-plane maps into the exterior of the unit circle
in the z-l plane with the portion of the imaginary axis in the s-plane
from n� /T to ( n+2 ) � /T, for n any integer, mapping into the unit circle.
Thus the standard z transformation when applied to a stable transfer
characteristic H ( s ) always yields an H* ( z- l ) that is also stable
provided the arithmetic is carried out with infinite precision.
l
For both transforms the· resulting H* ( z- ) for linear lumped continuous
filters is of the general form
k=l,n
(4)
IT ( l-z-l /zk )
where bO has been set unity with no loss in generality. The question
now arises as to what accuracy must the coefficients b k be known to
insure that the zeros of Dd ( z-l ) all lie external to the unit circle1
the requirement for a stable digital filter. First a crude bound will
be established to be followed by a more refined evaluation of the
coefficient sensitivity.
l
The polynomial Dd ( Z- ) can be written in factored form as follows
l l
n
Dd( z- ) IT (l-z - /zk)

k=l
•
56
For ea se of pre sentation only simple poles are a ssumed for the
ba sically low-pa s s transfer characteri stic R ( s ), there being no
difficulty in extending the analysi s to the multiple order pole and
non low-pa ss filter ca ses. If the standard z transform is u sed then
D ( z-l ) become s
d
( 6)
th
where Pk repre sents the k pole of R ( s ) and may be complex. For the
bilinear z transform there re sult s
A s suming now that the sampling rate, l iT, ha s been chosen to be at

lea st twice the highest critical frequency in the R ( s ) and in the
signals to be processed by the R ( s ), it is of intere st to see how
D ( z-l ) behaves as the sampling rate is increa sed further.
d
The N yqui st limit con straint dictates that for the standard
z transformation
(8)
The critical frequencies are normalized with re spect to half the

sampling frequency a s
where IDzl i s the Nyqui st frequency or one-half the sampling frequency.

Normally
(10)
Thus a s the sampling frequency is increa sed the � decrea se from unity
and approach zero. Then one can write for the standard z transform
case
57
[1 -
e
P kT
Z_' :J > [1 -
(l+,&')Z-
l :J (11)
as T-+ 0
and for the bilinear case
(12)
as T-+ 0
which illustrates that the' two design methods yield essentially the
same characteristic polynomials, Dd(Z-l), in the limit as T is made
small.
l
Inspection of (11) and (12i show that the zeros of Dd (z- ) tend to
cluster about the pOint z- = +1 in the z-l plane, i.e.,
where for a itable system the � has a negative real part. Now the
filter H*(z- ) will become unstable if any of its poles move �cross
the unit circle to the interior as a result of some perturbation or
change in the coefficients bj' To estimate the order of this effect
one computes the change necessary to cause a zero of Dd(z-l) to occur
at the point z-l = 1.From (5) and there results (13)
(14 )
But
n
= 1 +
f' b z-
L... k
k=l
k
I z -1 =1.
= 1 +
\' b
L k
It=l
The right hand side of this expression is an important quantity and is
therefore defined as
(16)
58
Thus by combining (14) and (15) it is immediately seen that i ): any of
the bk are changed by the amount given by (16) then the Dd(z- ) will
have a zero at z-l = 1 and the filter H*(z-l) will thus have a
singularity on the stability boundary. A zero of Dd(z-l) at z-l = 1
causes H*(z-l) to behave as if an integration were present in the H(s).
Any further change in the magnitudes of any combination of the bk in
such a manner as to cause D (z-l) -1 to change sign will result
d
Iz =1
l
in an unstable filter, i.e., with some of the zeros of Dd(z- ) lying
inside the unit circle. Hence (14) is the desired crude bound on
coefficient accuracy.
Equation (14) has a significant interpretation; it states that for

small J.lk (large sampling rates) the bound on coefficient accuracy is
dependent on both the order n of the filter and the sam�fiing rate or
normalized filter pole locations. Thus going from an n order filter
to a (2n)t h order filter at the same normalized frequency will require
approximately twice as many digits accuracy for the representation of
the bk. Similarly doubling the sampling rate for an nth order filter
requires nxloglO 2 or 0.3xn additional decimal digits in the
representation of each of the bk•
Equation (15) has the interpretation that it represents the return

difference9 at zero frequency when the filter H*(z-l) is realized in
direct form as shown in Fig. 1. This expression iS �lso recognized as
simply the reciprocal of Blackman's deadband factorl >.. Thus for
complex filters with fairly large sampling rates the quantity FO will
usually be very small. For example a fifth order Butterworth low
pass filter with its break frequency at 1/10 the sampling frequency
yields an H*(z-l) having FO 7.9xl O-5.
=
The coefficient accuarcy problem is somewhat further aggravated by the

fact that as T is made smaller the bk tend to approach in magnitude
the binomial coefficients ,11 (R), and tend to alternate in sign. Thus
the evaluation of D �(z-l) involves the perennial computational problems
associated with the differencing of large numbers. A better bound
on coefficient accuracy is obtained by dividing FO as obtained from
(16) by the magnitude of the largest bk• The largest bk is given
approximately by
n
4 2
maxbk :::::-*--
5 In
59
Hence from (14), (16), and (17) an absolute minimum bound on the number
of decimal digits � required for representing the bk is found as
(18)
where [ x] denotes the "greatest integer in x.

"
While the foregoing analysis has yielded an easily computable absolute

accuracy bound on the denominator coefficients of the recursive digital
filter, the bound is not necessarily the best possible nor does it
say anything about what happens to the zeros of Dd(z-l) as small
perturbations, less than FO' are introduced in the values of the bk•
It is not enough to say that the digital filter H*(z-l) is simply
stable; what is necessary is that the obtained digital filter have
response characteristics close in some sense to those of the continuous
filter it is a�proximating. This means that the sensitivity of the
zeros of Dd(Z- ) to changes in the bk must be determined. The most
direct way to establish this relationship is to equate the two forms
of the denominator of (4) and then to compute dz /dbk• There results
i
for filters with simple poles only
k+l
dZi z
n ( z� i
(19)
II 1 - Z
�=
k=l
k,h
from which the total differential change in any zero may be evaluated
as
(20)
These results ex�end directly to the multiple order pole case.

12
Utilizing the fact that a pole, Pk' of H(s) transforms to a zero of
1 -PkT
D (z ) at e
-
for the standard z transform and that for T such
-
t�at PkT « 1 the zero becomes approximately equal to 1 PkT, as

given previously by (13) the fractional change 51 in a zero location z1
can be expressed in terms of the fractional change Ek in a coefficient
bk of the polynomial Dd(z-l). Using (19) there resuIts
60
(21)
&.
i
where =--
and 11 - zi l « 1, i.e., tightly clustered zeros have been assumed.

For the purpose of quick ly estimating the value required for ck'
the product of the (n-l) factors can be approximated coarsely by the
product (14) which is simply FO . Thus (21) illustrates that the
precision required for the representation of the bk is increased by
the factor log O (l/O ) over that given by (18).
l i
Returning to (19) and (20) it is seen that the detailed changes in
the positions of the zeros resulting from changes in the bk are in
general complex functions as the zi may be complex. The changes in
the bk can occur as a result of imprecise arithmetic used in their
computation or as a result of truncating or rounding the obtained bk
coefficients to a smaller number of significant digits. The
qualitative evaluation of (19) can also be carried out by using the
well developed ideas of the root 1 cus.13 For example the changes in
location of all the zeros of Dd (Z-�) as a result of a change Pk
in bk are found from
(22)
or
fi (l _ z-
z
)
l
+
i=l i
This has the appearance of the standard root locus problem for a
single feedback loop having the loop transmission poles of the zi' a
kth order zero at the origin, and a loop gain factor of PIt. The
parameter F O is simply the "gain" PIt required when the root locus
passes through the point z-l = 1. Thus all the techniques of the
root locus method and the insight gained thereby can be brought to
bear on the problem.
By viewing the coefficient sensitivity problem in terms of root loci
the effects of both increasing filter order and especially increasing
the sampling rate can be easily observed. Increasing the sampling
rate tends t cluster the poles of H* (z-l) even more compactly about
the point z-� = 1 as Fig. 2 shows for a third order filter. As
628
61
filter order increases s o does the possible order k o f the zero a t the
origin of the z-l plane. All n branches of the root loci begin at the
t
roots Zij as Pk increases k branches converge on the k h order zero at
the origin and n -k branches move off toward infinity with eventually
radial symmetry. The angles the 10 c i make as they leave the zi are
simply the angles given by evaluating (19) at each zi. The value of
P.k at which a branch of the locus first crosses the unit circle
( the stabi1ity�oundary ) gives the measure of total variation that can
be made in bk and still keep the filter stable. Clearly the closer the
roots z i are to the unit circle initially the smaller will be the value
of Pk necessary to move them to lie on the boundary. Thus by varying
the P k ( the changes in b k) the extent of the stability problem can be
viewed.
The development up to this point assumed that the H ( s ) was basically

low pass with simple poles. Extension to filters of high pass,
bandpass, or bandstop types and with multiple order poles presents no
real problems when viewed using the root locus idea. For example a
digital version of a narrow bandpass filter with center frequency at
We would have its poles and zeros located as shown in Fig. 3 about the
radial lines at ±wcT radians. The coefficient sensitivity analysis
proceeds in the same way as before except that now the points on the
unit circle in the vicinity of z-l = cos wcT ± j sin wcT replace the
point z-l = 1in the stability computations. This is easily seen for
the standard z transform where a continuous pole of H ( s ) at
Pk = Ok + j� transforms to a zero of Dd ( z- l ) at
P T O T
e- k , i.e., at e- k L CUkT when written in polar form. Thus it
follows that it is primarily the smallness of the real parts of the
filter poles of H ( s ) that cause the z-l plane poles to be very near
the unit circle and as a result to contribute measurably to the
coefficient accuracy and the related sensitivity problems. It can be
shown that expressions quite similar to (14) and
(18) can be developed
for digital bandpass and bandstop filters. The expressions differ
primarily in that for an nth order filter only n/ 2 terms in the product
fi PkT will in general be small and thus contribute to this measure of

sensitivity.
It is interesting to note that in the construction of continuous

filters from RLC elements performance is limited primarily by the
obtainable Q's of the inductors. The representation of the continuous
filter by a digital filter is also strongly influenced by the Q's
required of the filter section as it has been shown this directly
establishes the number of digits needed to represent the digital filter
coefficients.
In this section some of the relationships between filter order, pole

locations, sampling frequency and digital filter coefficient accuracy
have been established. The question then logically arises, how do
these results affect the form chosen for realizing the digital filter?
This question is discusse� in the following section.
62
Realization Schemes
The three basic forms for realizing linear digital 'filters are the
direct, the cascade and the parallel forms as shown in Fig. 4. As far
as the stability question goes the two variations of the dil'ect form,
Fig. 4 (a) and Fig. 4 (b), are entirely equivalent with the coUfiguration
of Fig. 4 (a) requiring fewer delay elements. The stability results
developed in the previous section indicate clearly that the coefficient
accuracy problem will be by far the most acute for the direct form
realization. For any reasonably complex filter with steep transitions
between pass and stop bands the use of the direct form should be
avoided.
The choice between the utilization of either the cascade, Fig. 4 (c),
or parallel, Fig. 4 (d), forms is not clear cut but depends somewhat
on the initial form of the continuous filter and the transformation
scheme to be used. In any case the denominator of H (s) must be k nown
in factored form. If the parallel form is desired then a partial
fraction expansion of H(s) must first be made. This is followed by a
direct application of either (1) or (3) if the bilinear or standard
z transforms are used respectively. For bandpass or bandstop
structures the midfrequency gains of the individual parallel sections
may vary considerably in magnitude introducing a small problem of the
differencing of large numbers. This parallel form is perhaps the most
Widely used realization forms.
For cascade realization the bilinear z form requires that the

numerator in addition to the denominator of H(s) be k nown in factored
form. The splitting into simpler cascaded forms can then be done
rather arbitrarilY since the bilinear z operator has the property that
(24)
If the standard z transform is utilized, a partial fraction expansion

must first be made followed by z transforming term by term. Then the
fractions must be collapsed to yj.eld an Nd (z-l) which must then be
factored to permit the cascade realization. This more involved
procedure is necessary because the standard z transform does not
possess the transform property given by (24).
The discussion up to this pOint has centered on satisfactory means for

obtaining the digital filter coefficients required for the desired
realization form. In actually using the digital filter to process data
streams the performance of the filter will also be affected by the
quantization of the data and by roundoff in the multiPlication and
add1.tj.on operations. The recent paper by Knowles and Edwardsl4 treats
this aspect of the problem in some detail. Their results tend to
indicate also that the parallel form realization exhibits slightly
less performance error than the cascade form and that the direct form
is definitely inferior to both the cascade and parallel forms.
63
Summary
After reviewing briefly two design procedures for digital filters

expressions were derived e stimati n g the accuracy required in the
obtained filter coefficients. The expressions showed clearly the
relationship of coefficient accuracy to filter complexity and sampling
rate. These results also indicated which of the canonical realization
forms are to be preferred.
BIBLIOORAPHY
1. Golden, R. M., "Digital Computer Simulation of Communication

Systems Using the Block Diagram COmPiler: BLODIB, " Third
Annual Allerton Conference on Circuit and System Theory,
Monticello, Illinois, October 1965.
2. Ragazzini, J. R. and G. F. Franklin, "Sampled Data Control

Systems," McGraw Hill, 1958.
3. Monroe, A. J., "Digital Processes for Sampled Data Systems,"

John Wiley, 1962.
4. Jury, E. 1., "Sampled-Data Control Systems, " John Wiley, 1958.

5. Tou, J. T., "Digital and Sampled- Data Control Systems, "
McGraw Hill, 1959.
6. Jury, E. 1., " Theory and Application of the z-Transform Method,"

John Wiley, 1964.
7. Freeman, H., "Discrete Time Systems," John Wiley, 1965.

8. Kaiser, J. F., "Design Methods for Sampled-Data Filters,"
Proceedings First Allerton Conference on Circuit and System Theory,
November 1963.
9. Bode, H. W., "Network Analysis and Feedback Amplifier Design,"

Van Nostrand, 1945, pp. 47-49.
10. Blackman, R. B., "Linear Data -Smoothing and Prediction in Theory

and Practice," Addison-Wesley, 1965, p. 76.
11. Mansour, M., "Instability Criteria of Linear Discrete Systems,"

Automatica Vol. 2, n. 3, January 1965, pp. 167-178.
12. Maley, C. E., "The Effect of Parameters on the Roots of an

Equation System," Computer Journal Vol. 4, 1961-2, pp. 62-63.
13. 'Truxal, J. G., "Automatic Feedback Control System SyntheSiS, "

McGraw Hill Book Co., Inc., New York 1955, pp. 223-250.
14. Knowles, J. B. and R. Edwards, "Effect of a Finite-Word Length

Computer in a Sampled-Data Feedback System, Proc. lEE Vol. 112,
No. 6 June 1965, pp. 1197-1207·
64
Fig. I
Z-I plone
j1
1t T=3
� T=I
1t
•
•
1t
Fig.2
Z-1 plone
Fig. 3
65
--�
4(d)
Fig.4
66
PROCEEDINGS
A Comparison of Roundoff Noise in Floating Point and

'FlY"" Pn;nt nit'l:t�1 r;tf'n .. D,. ... ': ........ = __ _
Abstract-A statistical model for roundoff noise in floating point

�igital filters. proposed by Kaneko and Liu. is tested experimentally
for first- and second-order digital filters. Good agreement between
theory and experiment is obtained. The model is used to specify a
:omparison between floating point and fixed pOint digital filter
'ealizations on the basis of their output noise-to-signal ratio. and
:urves representing this comparison are presented. One can find
,alues of the filter parameters at which the fixed end the floating
-. . � . -,.-- ._-_ • • -1S ................ 1:fL .. �.
Recently, Kaneko and Liu' used a statistical model to predict theo

'etically the effect of roundoff noise in digital filters realized with floating
loint arithmetic. This letter is concerned with providing an e�perimental
'erification of the model, and the use of the model in specifying a quantita
ive comparison between fi�ed point and floating point realizations. We
cstrict attention to first- and second-order filters, both in the interest of
implicity and because more complicated digital filters arc often Con-
..... ,.. .... � .. _ ___ L!_ .. ,' _
FIRST-ORDER CASE
J:'np '.!II It .."" .............. 4;:t. ___ I'.L. r__ _
w" = aw,._! + X., (I)
Ihere x. is the input and w. is the output, the computed output y. is

\J ,.... rnu 11..L #0 \ I � " • • I: \
be random variables t. and {. account for the roundoff errors due to the
n�tin� �"'int ""ult;:-,I�I O:!O .......................:-...�: •• :�:•• :::� ::-: ::..::.: •• ..!_� :".1
1r..I<r', If 1< ,-, ,,,
0110 wing Kaneko and Liu, we define the error e.=y.-w., subtract (I)
om (2� neglect second-order terms in e, to and e, and obtain a difference
!U3lion fnr thp ?_� prrnr �Q'
(4)
Manuscript received February 10. 1969. This work was sponsored by ,he U.S. Air Force.
I T. Kaneko and B. Liu. "Round-olf error of ftoating-poine digital fillers.u oresenled
�.!�X!� ��.�.I.�I�e��n �onf. '!". Ci�ui� and Syster
67
PROCEEDINGS OF THE IEEE, JUNE 9
1 69
Assuming that t. and �. are independent from sample to sample (white),

and that t., �., and the signal x. are mutually independent, u. in (4) is white
--THEORETICAL
noise with va�iance dictated by the statistics of x. and the variances ,,; and
": of t. and e•. The variance ": of the output noise e. is obtained easily o WHITE NOISE
from the variance ,,� of u. as .. SINE WAVE ("'0 • t6.3")

D SINE WAVE ("'0 • 54.3°)
(5) Nb.'
: wlNb-
I;)
where h.=a" is the filter impulse response.
'----' <!I"
For example, if we assume that x. is stationary white noise of variance g
(T�, we obtain -IN
(6)
0.5 07 09 0.95 0."

For the case of a high gain filter, with a= I-�, and � small, (6) becomes a
Fig. I. Theoretical and experimental noise-to-signal ratio for a first-order filler. as a
function of pole position. The noise-to-signal ratio is represented in bits.
(6a)
TABLE I
If, instead, x. is taken to be a sine wave of the form A sin (won + 4» with
THEORETICAL AND EXPERIMENTAL NOISE-TO-SIGNAL RATIO FOR A SECOND
4> uniformly distributed in (0, 2><), then
ORDI,R FILTER, AS A FUNCTION OF POLE POSITION
(7)
1082
� [;ik] (bits)
To test the model, ": was measured experimentally for white noise and
sine wave inputs. Each input was applied to a filter using a 27-bit mantissa, , 9
and also a filter with the same coefficient a, but using a shorter (e.g., 12-bit) White Noise Sine Wave
mantissa in the computation. The outputs of the two filters were then sub
TbeoretK:aI Experimental TbeoretK:aI Experimental
tracted, squared, and averaged over a sufficiently long period to obtain a
stable estimate of ,,:. Kaneko and Liu assumed that e. and t. were both 0.55 22.5 1.48 1.66 1.54 1.64
uniformly distributed in (- r', r') with variances ,,; =": =!r". Actual 0.7 22.5 2.16 2.33 2.23 2.38
measurements of the noise due to a multiply and an add verified that t. and 0.9 22.5 3.32 3.33 3.35 3.45
e. have zero mean, but indicated that the variances 0.55 45.0 0.93 1.08 0.97 0.94
0.7 45.0 1.36 1.44 1.37 1.51
(8) 0.9 45.0 2.28 2.51 2.22 2.14
0.55 67.5 0.42 0.46 0.39 0.33
0.7 67.5 0.75 0.88 0.65 0.62
would better represent these noise sources. Using (I), (6), (7), and (8), we
0.9 67.5 1.63 1.97 1.45 0.99
can compute the output noise-to-signal ratio for both white noise and
sinusoidal inputs for the first-order case as
For the case of a high gain filter, with r= I-�, (II) becomes approximately
'
,,'-1- = (I )
(0.23)2-" --,
+ a
• (9)
"w I -a , ( 3+ 4 cos' 0 )
" = """ . (13)
. , . 1M' sin' 0
I,
In Fig. experimental curves for noise-to-signal ratio are compared with
the theoretical curve of (9). For the case of sinusoidal input, we obtain
": = A'G";[�r411:1I' + 6r' cos' 01111' + ! - 4r31111' cos 0 cos wo

SECOND-ORDER CASE
- r' I1:I1 cos (4) - 2wo) + 2rl1:l 1 cos 0 cos (4) - wo)] (14)
An analysis similar to the above can be carried out for a second-order
filter of the form
where 1111 and 4> represent the magnitude and phase.of the filter system
function at the input frequency woo In Table I, a comparison of theoretical
"'n = - ,Z"',,_2 + 2r cosOw,,_1 + XII' (10)
and experimental values for output noise-to-signal ratio are displayed for
a second-order filter.
with a complex conjugate pole pair at z=retjO Based on experimental
'
verification of (8) in the first-order case, we assume here that the t S and
fs representing the errors in the second-order case have the same variance, FIXED VERSUS FLOATING POINT COMPARISON
as given by (8).
The statistical model of floating point roundoff noise proposed by
When x. is stationary white noise, we obtain for the variance of the Kaneko and Liu and one of fixed point roundoff noise as presented for
noise e.,
example by Gold and Rader' provide the framework for comparing these
two structures on the basis of the resulting noise-to-signal ratio. We con
r' cos'O
)J
,,' = ,,',,' G + G'
, C :K 3r' + 12r' --- (II) sider only the case of white noise inpul.
[ ( 1 For the fixed point case, the register length must be chosen sufficiently
long so that the output cannot overflow the fixed point word. If II. denotes
where the impulse response of the filter, Ihen output 11'. is bounded according to
,,'
- ..! -
" + r') (
(--
I I ).
G
- ,,�
-
�o h' -
• •
-
I - r' r' + I - 4r' cos' 0 + 2r'
(12) :I B. Gold and C. M. Rader� "Effect of quantization noise in digital fihers.·· /966 Spring
Joint Computer Conf.. AFIPS Proc., vol. 28, Washington. D. C. Spartan, 1966. pp. 213-219.
68
PROCEEDINGS LETTERS
ar-------� first-order filter, Fig. 2(a) presents curves of! log, (a!/a!a!) as determined
from (8), (17), and (19� These curves represent a comparison of the rms
noise-to-signal ratio for the two cases, in units of bits. In Fig. 2(b), a similar
comparison is illustrated for the second-order case. For the purpose orthe
iIIustration,O was kept fixed and only r varied.
Fig. 2(a) and (b) indicates that floating point arithmetic leads to a lower
noise-to-signal ratio than fixed point if the floating point mantissa is equal
in length to the fixed point word. We notice that for high gain filters, as a
increases toward unity in the first-order case, and as r increases toward
unity for 0 fixed in the second-order case, the noise-to-signal ratio for fixed
point increases faster than for floating point.
However, this comparison does not account for the number of bits
needed for the characteristic in floating point. If c denotes the number of
bits in the characteristic, this would be accounted for in Fig. 2 by numeri
(a)
cally adding the constant c to the floating point data. This shift will cause
. 2 .----- --------,
the floating and fixed point curves to cross at a point where the noise-to
signal ratios are equal for equal total register lengths.
For the sake of the comparison, we provide just enough bits in the
8 FIXED AT 20·
characteristic to allow the same dynamic range for both the floating and
the fixed point filters. If '" denotes the fixed point word length, then the
Nr:--'
b-JNb. 6
requirement of identical dynamic range requires that
� (2 1 )
gN
..J
Assuming for example that ',,= 16 so that c=4, crossover points in the
_IN
noise-to-signal ratio will occur at a=0.996 in the first-order case, and at
r=0.99975, 0=20°, in the second-order case depicted by Fig. 2(b).
0 .. 0 ...
CLIFFORD WEINSTEIN
ALAN V. OPPENUEIM
M.I.T. Lincoln Lab.
(b)
Fig. 2. Comparison or fixed point and float ina point noise-to-si,"al ratios.
Lexington, Mass. 02173
(a) Firsl-order filter. (b) Second-ordcr filter. 8 -20'"
..
max ( l w.l) = max (lx.1> L Ih.l· (15)
.-0
Interpreting the fixed point numbers as signed fractions, we require for

no overflows that l w.1 < I, restricting x. to the range
1 I
- - ..--- '
..-- < XII < + -- (16)
L Ih.1 L Ih.1
,,-0 .-0
With x. white and uniformly distributed between the limits in (16), the
resulting output noise-to-signal ratio for a first-order filter is
a'
--i
I (
_2-2, L
..
Ih.1
)' 1 2-"
= - -----.' (17)
aw
=
4 . -0 4 (I - a)
and for a second-order filter
� 1 (
= - 2-"
(Jw 2
f Ih.I)' .-0
=
1
- 2-"
2
(� f
Sin 0 l1li-0
r"lsin [In + 1)0]1
) "
(18)
The variance of the roundofTnoise due to a multiplication is taken as n 2 - "

with' denoting the fixed point register length.
For the case of floating point computation, the noise-to-signal ratio for
the first-order filter is
ERRATA
(\9) A numerical error was committed in plotting

Figure 2(b). All points on the curve rep�esenting
fixed point noise-to-signal ratio should be 1.57 bits
where, is the number of bits in the mantissa. For the second-order filter,
we have higher than the values displayed in the figure.
Because of this erro�, the crossover p�int in
a2 [( r4 cos2 0 )] noise-to-signal ratio given in the text is incorrect.

� = (0.23)2-2' I + G 3r4 + 12r' cos2 0 - 16
�
. (20) The last phrase of the text should read, " • • • and
0
would occur at r - 0.998, 8 - 20 , in the second
For a comparison of floating and fixed point arithmetic in the case of a order case depicted by Figure 2(b)."
69
Submitted for publication to IEEE Transactions on Audio and
Electroacoustics.
Realization of Digital Filters Using Block-Floating-Point Arithmetic*
by
Alan V. Oppenheim
Lincoln Laboratory, Massachusetts Institute of Technology

ABSTRACT
Recently, statistical models for the effects of roundoff noise

in fixed-point and floating-pOint realizations of digital filters have
been proposed and verified, and a comparison betv.e en these reali
zations presented. In this paper a structure for implementing
digital filters using block-floating-point arithmetic is proposed and
a statistical analysis of the effects of roundoff noise carried out.
On the basis of this analysis, block-floating-point is compared to
fixed-point and floating-pOint arithmetic with regard to roundoff
noise effects.
*This work was sponsored by tfie Department of tfie AIr Force. JULY 1969
70
Introduction
Recently, statistical models for the effects of roundoff noise in fixed-point and
floating-point realizations of digital filters have been proposed and verified, and, a
comp3rison between these realizations has been suggested. (I), (2), (3) In general terms
the comparison revolves around the fact that while floating-point arithmetic has a larger
dynamic r.mge than fixed-point, the latter is more accurate when the full register length
can be utilized. Because of the limited dynamic range of fixed-point arithmetic, for high
gain filters, the input signal must be attenuated to prevent overflow in the output. Thus,
for suffiCiently high gain. floating-paint arithmetic leads to lower noise-to-signal ratio
than fixed point. On the other hand, floating-paint arithmetic implies a more complex
hardware structure than fixed -point arithmetic.
An alternative realization, block-floatmg-point, has some of the advantages of both
fixed point and floating-paint. In this paper a structure for implementing digital filters
using block-floating-point arithmetic is proposed and a statistical analysis of the effects
of roundoff noise presented. On the basis of this analysis, block-floating point is com
pared to fixed-point and floating-point arithmetic with regard to roundoff noise effects.
A Structure for Block-Floating-Point Realization

In block-floating-point arithmetic the input and filter states (1. e. the outputs of the
delay registers) are jointly normalized before the multiplications and adds are performed
using fixed-point arithmetic. The scale factor obtained during the normalization is then
applied to the final output to produce a fixed-point result. To illustrate, consider a first
order filter described by the difference equation
(1
)
To perform the computation in a block-floating-point manner, we define
1
(2)
A =
n
where IP [M] is used to denote the integer power of two so that i � {M' IP (M)} < 1.
Thus A represents the power-of-two scaling which will jomtly normalize x and Y l'
n n n-
Thus with block-floating-point we can compute y as
n
(3)
where the multiplications and addition in (3) are carried out in a fixed point manner.
71
Because of the recursive nature of the computation for a digital filter it is advan
tageous to modify (3) as
(4)
with
and
t:.
n ::; AnfAn- 1 .
The difference between (3) and (4) is meant to imply that the number An Yn rather than
y n is stored in the delay register of the filter. Because of (2), An yn is always more
accurate (or as accurate) as yn since multiplication by A n corresponds to a left shift
of the register.
A disadvantage with (4) is that yn- 1 must be available to compute A n, and t:. n must
then be obtained from A n and An- 1. An alternative is represented by the set of equations
" " "
yn ::; t:.
n x n + a1 t:. n yn- 1 (Sa)
with (5b)
(5c)
1
and (5d)
In this case, we first ·scale xn by An- 1 to form Qn and then determine the incremental
scaling using (Sd). As in (4) the scaled value �n is stored in the delay register and the
output value yn is determined from �n using S(c). If we consider the general case of an
N th order filter of the form
yn ::; x n + a1 yn- 1 + a2 yn-2 + . . . + aN yn- N
72
then the block-floating-point realization corresponding to (5) and represented in the
direct form is depicted in Fig. 1. For the general case,
1
A
n
= (6)
IP [max £l�nl. Iwini. lw2nI .. . . . lwNni J]
and
1
A
n
= =A I A
n- n
(7)
[max ( IX I y n -1' y y
IP
n I, 1 n_ 21 ' • • • 1 n_ Nil]
As an additional consideration, we note that because of the block normalization,
there is the possibility of overflow in the addition, which cannot be avoided by an at
tenuation of the input. This possibility of overflow can be avoided by decreasing the
normalization constant A by a fixed amount. Thus we modify (6) and (7) as

n
1
(6')
and
1
(7')
where ex. is a constant that may be changed depending on the filter to be implemented.
In a first order filter, for example, ex. need never be greater than two. In a second
order filter it need never be greater than three, and in many cases can be chosen
as two.
The Effect of Roundoff Noise in Block-Floating-Point Filters_
In evaluating the performance of the block-floaring-point realization in the pres
ence of roundoff noise we will restrict attention to the implementation of Eqs. (5) and
Fig. 1 for the first and second-order cases. We will assume that no roundoff occurs in
the computation of � from xn and the subsequent multiplication by A ' Since A - and
n n 1
A _I A are always non-negative powers of two, that is, they always correspond to a
n n
positive scaling, the above assumption corresponds to allowing more bits in the repre
sentation of the intermediate variable �n' This is reasonable if we take the attitude that
it is primarily in the variables used in the arithmetic computations that the register
length is important.
73
For the first orde r case, roundoff noise is introduced in the multiplication of w
-1:- .
In
by � ' the multiplication by a and the final multiplication by The effects of
n 1
n
multiplier roundoff will be modeled by representing the roundoff by additive white noise
sources. We consider, for convenience, the fixed point numbers in the registers to
represent signed fractions, with the register length excluding sign denoted by t bits.
Each of the roundoff noise generators is assumed to be white, mutually independent and
independent of the input and to have a variance (I
2
.�
equal to . 2
-2t
A
. The network for
the first-order filter including the noise sources representing roundoff error is presented
in Fig. 2 (a). In Fig.2(b) an equivalent representation is shown, where the noise sources
are at the filter input. If we consider the input to be a stationary random signal then the
noise source'; will be white stationary random noise with variance
n
(8)
2 denotes the expected value of 1 2

where, k (-X-) Letting 11 denote the noise in the
n
n
filter output due to the noise 'n' the variance of the output noise 11 n will be
-;;z =.;
2
�
Q)
hn2
-=-z
+ E = (I
2 [1 + (9)
n n 3n �
n= 0
For the case of a second order filter a similar procedure can be followed. Figure 3(a)
shows a second-order filter with the roundoff noise sources included. In Fig.3(b) an
equivalent representation is shown, where equivalent noise sources are introduced at
the filter input. Again, conSidering the input to be a stationary random Signal, then
� (1 ,2 r
4 r2 cos2 9+2+r4 +r4 (IE
(. 1 ,2
J �)
2 2
';U =
�,,) (IE (10)
n L
4J
2 2r 2 2
= k (IE L4 r cos 9+2+2 r
74
where we assume that the mean square values of ( -f:- ) and (� ) are equal. Hence
the variance of the output noise '11 n is n n-l
� 4
'10
:; O'
2
E
+k
2
O'
2
E
G [4 r2 cos2 e + 2 + 2r ] (11)
where
G= C 1
(12)
Experimental
- Verification
-
To verify th; validity of Eqs. (9) and (1 1) the values of k2 were measured and the
values of �n2 computed from (9) and (11) using these measured values. These results
were taken as the theoretical results since they incorporate the assumptions of the model.
The variance of the roundoff noise � was then measured experimentally. This was done
n
by Simulating the block-floating-point filter with a Signed mantissa of 12 bits and com-
paring the output values with the output of an identical filter simulated with .'36 bit fixed
point arithmetic. In all of these measurements the input was white noise with a uniform
amplitude distribution. For the first order filter. the value of a in Eqs. (6') and (7')
was taken as two. For the second order filter. the value of a was taken as four.
In Table I. measured values of k2 and the theoretical and experimental values of
the variance of the roundoff noise for the first order case are given. In a similar manner,
theoretical and experimental results for the second order case are summarized in Table II.
A Comparison Of Block-Floating-Point, Floating-Point and Fixed-Point realizations.
Using the model presented in the previous section, the block-floating-point realiza
tion of digital filters can be compared with fixed-pOint and floating-point realizations.
The comparison to be presented here will be on the basis of the output noise-to-signal
ratio when the input is a random signal with a flat spectrum, using results presented by
Gold and Rader ( 1), Kaneko and Liu (2). and Weinstein and Oppenheim (3). With � denot
ing the vaiance of the roundoff noise as it appears in the output we have for the first-
order filter
(1}2) fixed -point = 1
TI
. 2
-2t
(13)
2
(1} ) floating-point= . 23 x 2
-2t
( 14)
75
and for the second-order filter
1
(1')2) fixed -point = "0
(1 5)
r cos26
164 2
(.;?) floating-point . 2 3 x 2 -2t r.� +G(3r4 +12r2 cos26-
l+r2 J
cr
y (16 )
where t is the number of bits in the mantissa, not including sign, cr y2 is the variance
of the output signal, and G is given by (12). In the fixed-point case the output noise is
independent of the output signal variance and in the floating-point case the output noise
is proportional to the output signal variance. The expression for block-floating-point noisE' has
a term independent of the signal and a term which depends on the signal through the factor
k2• In both the fixed-point and block-floating-point cases, the dynamic range for the
output is constrained by the register length. Consequently. as the filter gain increases
the input must be scaled down to prevent the output from overflowing the register
length. Since the output is given by
ex>
yn = Z hk xn -k
k= O
then co
To insure that the output fits within a register length, we require that, with xn and yn
interpreted as fractions,
I yn I s; 1
so that
1 1 (1 7)
S; x S;
ex> ex>
n
76
With this constraint on the input, we can then compute an out put noise-to-signal ratio
for fixed-point. floating-point and block-floating realizations. Specifically for the first
order case,
1 2-2t 3 <Is)
TI
(�) fixed-point •
(:;2 ) floating-point . 23 x 2-2t

(19)
(::. ) block-floating
=
1 -rr .
-- 2
-2t [3(1(1-_aa/)
1
)2
( 20)
-2 2
where k is the value for k when x is uniformly distributed between plus and tninus
n
unity .
In a similar manner, for the second-order case,
1
= -Z 2-2t ( 1
SIiilf
(21)
(:::-) fixed-point
(.23) 2-2t r1 + G(3 r4 + 12 r2 cos2 e -16 r cos e) ]

4 2
(a:�) floating-point •
� 1 +rZ
(22)
77
In Fig.4 Eqs. ( 18), ( 19) and (20) are compared. In Figs. 5 Eqs. (2 1), (22) and (23) are
compared. In these figures the noise-to-signal ratIos are plotted in bits so that the dif
ference b�tween two of the curves reflects the number of bits that the mantissas should
differ by to achieve the same noise-to-signal ratio. In each of the cases, the difference
between floating-point and block-floating point is approximately constant as the filter
gain (or the proximity of the poles to the unit circle) increases. This difference is ap
proximately one bit in the first-order case and two bits in the second order case. In
contrast, the fixed-point noise-to-signal ratio increases at a faster rate than floating
point or block-floating point and for low gain is better and for high gain is worse than
block-floating point.
In evaluating the comparison between fixed-point, floating-point and block-floating
point filter realizations it is important t o note that Figs. 4 and 5 are based only on the
mantissa length and do not reflect the additional bits needed to represent the character
istic in either floating-point or block-floating-point arithmetic.
An additional consideration, which is not reflected in these curves is that in both fixed·
point and block-floating-point the noise-to-signal ratio is computed on the assumption
that the input Signal is as large as possible consistent with the requirement that the out-
put fit within the register length. If the input signal is in fact smaller than permitted
then the noise-to-signal ratio for t he fixed-point case will be proportionately higher.
For block-floating-point, as the input signal decreases, k2 decreases thus reducing the
output noise. From Eqs. (9) and ( 12) we observe that as the input signal decreases the
output noise variance asymptotically approaches oi.
For the case of high-gain filters, Eqs. ( 18) through (23) can be approximated by'
asymptotic expressions which place in evidence the relationship between them. For the
high gain case, that is for � close to unity in the first order filter and r close to unity
and e small in the second order filter, we will assume that I xnl is always smaller than
I Ynl so that (-i-) 2!! 2 1 Y nl for the first-order filter and (-i-)2!! 4 1 Ynl for the second
order filter. The'1. , if we consider Yn as a random variable wit'fi a symmetric probability
density,
in Eq. (20) and

-2
k 2!! �
16
G
in Eq. (23).
78
Representing a as 1- 0 for the first-order case and r as 1-0 for the second-order
case, with {j small we can approximate Eqs. ( 18) through (20) as
22 t (--
7
))
0y2
-
�
7
.2 5
(24 )
�
fixed-point
2 2t -Z
.23 (51 (25)
oy
':'!
floating-point
�
22t 8� '0
1
(26)
�-r)
0y block-floating
For the second-order case we will want to bracket the expression
•
1
rn I sin (n+ 1)9\.
CD
k This sum is the sum of the absolute values of the i mpul se

sine
n= 0
response and as such Is an upper bound on the filter output with a maximum input of
unity. Consequently, it mUl;t be greaLer than or equal to the response of the second-
order filter to a sinusoid of unity amplitude at the resonant frequency. This resonance
response is given by 1 or, with the high g ain
2
(l -r) (1+ r -2r cos 2 9) 1/2
approximation, 26 sin 1
6
An upper bound is easily obtained on the sum as
•
( �6Si
6
sm
� 6
. Furthermore for the high-gain case we approximate
1
G as G =:!
2 We can then write that
4 0 sin 9
79
:'-r
1 1 2t 1
..,2 . 2 2 (27)
8
u sm 9 �
(""""T)
ay fixed-point
�
"!
v
. 2
�2 SIn
9
,-
.2,3 L l + 3 +4 cO229
22t(a:�) J (28)
�
floating -point =
46 sin 9
1
-6
[1 '4 +
4(l+cos29
2
3 sin A
� ( 29)
block-floating
We note that the behavior of these expressions as a function of 6 is consistent with

the results plotted in Figs. 4 and 5.
80
ACKNOWLEDG MENT
I would like to express my appreciation to

Dr. Clifford Weinstein for his contributions to this
work through our many discussions.
81
REF E R ENCE S
1. 13. Gold and C. M. Rader, "Effect of Quantization Noise In Digital Filters"

Proceedings Spring Joint Computer Conference, Vol.28, pp.213-219, 1966.
2. T. Kaneko and 13. Liu, "Round-off Error of Floating-point Digital Filters"

Proceedings of the Sixth Annual Allerton Conference on Circuit and System
Theory, Oct2-4, 1968.
3. C.Weinstein and A. V.Oppenheim, "A Comparison of Roundoff Noise in

Floating-Point and Fixed-Point Digital Filter Realizations" IEEE Proceed
ings Letters, June 1969.
82
TABLE I
2t -2]
1 /2 log [2 l1 (bits)
2 n
a k
2
Theoretical Experimental
1
.1 .0136 -1. 780 -1.780
.3 .0162 -1.775 -1.780
.4 .0175 -1.77 -1.775
.55 .0202 -1.765 -1.765
.7 .0251 -1.750 -1.740
.9 .0591 -1.465 -1.470
.95 .1152 - .980 .945
83
TABLE II
----
2t
1/21og [2 11
-2 ] (bits)
2 n
e
2
r k Theoretical Experimental
.55 22.5 .Oll -1.724 - 1.661
.55 45.0 .008 - 1.765 - 1.735
.55 67.5 .006 - 1.780 - 1.753
.7 22.5 .020 -1. 528 - 1.440
.7 45.0 .0 10 - 1.736 - 1.696
.7 67.5 .004 - 1.78 1 - 1.757
.9 22.5 .068 - .23 1 - .222
.9 45.0 .023 - 1.430 - 1.357
.9 67.5 .0 15 - 1. 665 - 1.584
.95 22.5 . 129 .7 16 .652
.95 45.0 .045 - .863 - .768
.95 67.5 .029 - 1.384 - 1.207
.99 22.5 . 150 1.992 2.050
.99 45.0 .053 .244 .350
.99 67.5 .035 - .540 - .150
84
FIGURE CAPTIONS
2
Table 1. M(?asured values of k and theoretical and experimental values of
output noise variance for a first-order filter with white noise input
in the range I x I s: .!..
16
•
n
2
Table II. Measured values of k and theoretical and experimental values of output
noise variance for a second-order filter with white noise input in the range
I xn I s: 1
-128 '
th
Fig. 1. Network for block-floating-point realization of an N -order filter.
Fig. 2. (a) Noise model for first-order filter

(b) Equivalent noise model.
Fig.3. (a) Noise model for second-order filter

(b) Equivalent noise model.
Fig. 4. Comparison of noise-to-signal ratios for first-order filter using fixed

point, floating-point and block-floating-point arithmetic.
Fig.5. Comparison of nOise-to-signal ratios for second-order filter using fixed

point, floating-point and block-floating-point arithmetic.
85
>,C
N Z
I I
-
I
>,c >, C
>f
-
C I
-
C
I
C
"j'
C
E!
- c:( c:( c:(
II II
C C
II
C C .N C C
•z .
.- <J <J <J
-
• • • bO
....
r.z..
.
N z
.,
-
., .,
• • •
C
<J
C
OC
j'
C
c:(
C
ac
86
" .Ii
"If)
c
<I
N
bO
a �
.....
"N �
c
IU
C\I
+
c
\U
c
<l
....£..., .-
C
c I
�
.... CI
•
�c
- -
c ..c
, -
c
CI -
-
C
M
87
c
�
-
N
�
I
c
N
\II
c
<l
88
-
I C
c
It)
\II
'"
�
I
c
<X:
� c c
.....::.,.. <J _ <J
I t--..-...
... --t I
,.....--, N
c
.q-
'"
+ Q) ..0
en
M
0 t>O
c
rr> ('t.
'" u II-<
.....
�
I
C\J C\I
....
c
C\J
'"
+
Ct) c
en <l
0
0
....
C\J
c
-
� I
c c
lCl:
�
.--
II
C
o\u
89
c:n
c:n
C!) c:n
z 0
ti �
g�
�-�
z
��
u C!)
9
m
z
-
�
9
� c:n
�
/' 0
Q
L&J
X .
ii:
""
.�
�
10
o
C\J
o
90
10 8 = 22.50
U) 8
-
..c
SlOCK -FLOATING
�6 POINT
tr
�
b 4
�
N
o
.2
C\J
...... FLOATING POINT
,.... 2
o
0.99 0.999
r
Fig. Sa
..0
.....
0)
II 0)
a
/
2
o
a..
o
w
X �
I.L
.a
V)
to �
0) �
....
0)
<.!)
2 a
I-
«
01- ex)
-12
I.L -� a
, 0
�a.. I"-
u 0
0
-1
CD to
0
92
en
en
en
0
�
z
-
�
<!)
z
-
0
an
�
�
U) 9
lL en
II en
Cb / 0
0
L&J
X
LL 0
t- tl)
t>O
....
It) �
en
0
<!)
z en
� 0
91-
lL�
. 0 CX!
�Q. 0
U
9
m
'"':
0
�
0
93
ROUND-OFF ERROR OF FLOATING-POINT DIGITAL FILTERS
T. KANEKO and B. LIU
Princeton University
princeton, New Jersey
ABS T RACT
This paper is concerned with the accumulation of round-off
error in a floating-point digital fi lter. The error committed
at each arithmetic operation is assumed to be an independent
random variab le uniformly distributed in (-2 -t, 2-t) where t is
the length of the mantissa. Expression for the mean square
error is derived arrl a numerical example is given.
INTRODUCTION
Consider a digital filter specified by the input-output
re lationship:
wn (1)
where { x } is the input sequence and { w } is the output se

quence. nThere are three common sourcesnof errors associated
with the fi lter of Eq. (1), namely
(1) quantization error - caused by the quanti zation of the
input signa l into steps of finite range,
(2) coefficient accuracy - caused by the fact that the co
efficients ak and b k are rea lized with finite word
length,
(3) round-off accumu lation - caused by the accumu lation of
errors committed at each arithmetic operation because
these operations are carried out with on ly finite bit
accuracy.
The actua l computed output reference is therefore in general
different from { w n } . We shal l denote the actual output by { yn }
and we sha ll call { w n } the ideal output sequence.
When the filter of Eq. (1) is realized with fixed-point
arithmetic, the above error problems have been studied exten
sive ly [1]. Recently, Sandberg [2 ] studied the error prob lem
for f loating-point arithmetic and derived an absolute upper
bound of the error accumulation due to round-off. This paper
derives expression of the mean squared error caused by the
round -off accumulation.
Throughout this paper we shall be concerned with binary
machines. Thus, a floating point number is stored in the form of
(sgn). a. 2 b where a and b are binary numbers each ot fixed num
ber of bits. The number a is cal led the mantissa and b the
exponent. We ·sha l l assume that enough bits are allowed to the
exponent b so that no computed number wil l lie outside the
permissible range. We sha l l use the notation fl(·) to denote
the machine number resulted from the arithmetic operation
specified by the parenthesis in some designated order. Thus
PlLc�en.ted a..t :the S-i.rth Annual AUelL:t.on Con6e1Lenc.e on C-i.!z.c.u.U and
*
SU6:tem TheM!!, OaobeIL 2-4,. 1968. To be pubwhed bt :the PILOc.eed-i.ngl.>

06 :the Con6e1Lellc.e.
94
fl (x+y) is the calculated sum of x+y, and fl (ax+ by) is the cal
culated Sl� of two terms; one is the calcu lated product o f a
and x, the ot her is the calculated product of b and y. It is
known (3) tha t
f l ( x+ y) (x+y) (l+E) (2)
with l
-t
lEI � 2
where t is the number of bits of the mantissa. A lso,
fl(xy) = xy ( I+E) (3)
Again with
I E I s: 2-t
That is, for each addition or multip lication, an error is com
mitted which is proportional to the ideal results obtained with
infinite precision. We shall assume that all numbers a n' bn,
xn are machine numbers.
CALCULATION OF A CTUAL OUTPUT SEQUENCE
To i llustrate the approach of this paper, consider a second
order filter specified by
wn b O xn - (al w n_l + a2 Wn_2)
=
The actual computed output is

Yn fl [bO xn - (al Yn- l + a2 Yn 2»)
=
-
The ca lcu lation is to be performed in the fo llowing manner.
First, the products a l Yn-l, a2 Yn-2' and b Oxn are ca lcu lated
separately. al Y n- l and a2 Yn-2 are added next. Finally, this
sum is subtracted from bOxn to obtain Y n'
Fo l lowing Sandberg [2], a flow diagram of Figure I 2 may
be drawn by using Eqs. (2) and (3).
l+S n
Figure I
l
It is assumed here that the accumulator is of double precision.
For sing le precision accumulators, slight modification is
necessary.
2 S light modifications are necessary when some of the
coe fficients are one or zero.
95
The quantities 6n, O, En, l' En,2 ' �n' � n arc all bounded in ab
solute value by 2-t • Since these are errors caused by round
off at each arithmetic step, we may assume that they are indc-3
pendent random variables uniformly distributed in (_2 -t ,2-t ).
Therefore the actual output (Yn} is seen to be given explicitly
by 2
Yn = b x � a ¢ k Y n- k
O 8n, O n k=1 k n,
-
where:
(1+6 n, 0) (l+Sn )
(HE n, 1) (1+�n) (l+S n)
�n,2 (HEn, 2)(1+T1n) (l+Sn)

In general, for the digital filter of Eq. (1), we have
M N
Yn = fl( �
a k xn_k � b Y )
k=O k=l k n- k
-
and the corresponding flow graph showing the effect of round

off e rror is given by Figure 2.
1+" n, 2
1+" 3
n,
Figure 2
3
For s1ng
. 1e preC1. S1on
. accumulators, a slight modification is
needed.
96
'l'h(! ilctua 1 output sequence is therefore given by
M N
Yn = ! � 8n,k Xn-k - k� ak �n, k Yn-k (4)
k O l
where M
e II (1+' n, l..)
n, O i=l
M
II (1+, n, l..) j=l, 2, • • • ,M
i=j (5)
N
�n, l [) (1+'11
n, .)
l.
i=2
N
�n, . = (l+� n ) (l+ En, . ) II (1+"n, l..) j =2, 3, I N
) )
• • •
i=j
By defining a O = 1, �n, 0 =, 1, we may write Eq. (4) as
N M
I: a k ¢n, k yn k = I: (6)
k=O - k=O
where �n, 6n, k, 'n, k' En, k' "n, k are independent random varia
bles, each uniformly distributed in ( _2-t, 2-t ).
To solve for the actual output sequence {y 1 from Eq. (6) ,
we note first that the random variables �n k an3 en k are
essentially one and that their difference from one 1s of the
order of 2-t in magnitude. We rewrite Eq. (6) as
N M M
: akYn k = : b X � b (e -l)Xn-k
k O k n-k k o k n, k
- + -
k O
and Yn as a sum of te nns in order of decreasing size, viz. ,
Yn y'n + y"
n + y"
=
n ' + (8)
On substituting Eq. (8) into Eq. (7) and equating terms of like
order of magnitude, the following set of equations is obtained.
N M
I: akYn' k = I:
b kx n k (9)
k=O - k=O -
N M N
I: a y" = �: bk( e n, k-1)xn k - I: a k( �n, k-l ) y'
n _k (10 )
k=O k n-k k=O - k=O
N N
1: ak yep)k = - I: a k(� (p-l)
n, k 1) yn-k p=3, 4,5 ( 11)
-
k:O n- k=O
• • •
It is to be expected that the order of magnitude of y�p)

is smaller than that of y�P- l) by a ratio of roughly 2-t . This
can indeed be shown to be true. Therefore we will only be
in terestcd i n the y;' and the y;' I sequences. A compariso n of
F:q. (9) with Eq. (1) shows that {y;'} is identical with the
97
[w111 s<"tluencc, the ideal output. We may thus identif y y�'
with the error due to round-off error and denote it by e n.
'I'hilt is,
e n = yn - wn = y"n ( 12)
STATISTIC S OF ERROR SEQUENCES {en}

Assume the input {xn} is zero mean, and wide-sense station
ary (w. s. s) with autocorrelation function Rvx(n ) and power
spectral density 2xx{Z). Then it is clear from Eg. (9) that
y'n or wn is zero mean and w. s. s. Its power spectral density
�Y'y'{z) is related to that of {xnJ by
�
- N ( Z) N ( 1 2 7�
y , y ,(z ) - n(z) n ( 1 z) xx(z ) ( l3)
4
and its autocorrelation function Ryy�n) is given by
1 n dz
Ry'y'(n) = 2TTj i \'y'(z) z Z
,f'. ( 14)
where
k k
N(z) z- and n(z} z-
Furthermore, Y� is independent of 8m,j and �m,j. From Egs.

(10) and (12), we have
N
a en_k = un ( I S)
k=O k
1:
M
with
N
un (9n,k-l) xn-k - a (� -l}y�-k ( 16)
k:O� k o k n, k
= :
To calculate the statistics of u n' we need the statistics of
9n,k and � These can be evaluated in a straightforward
manner and nt �e result is summarized in the Appendix. It can be
•
shown easily then that

E( Un } = 0 (17)
E( u nu m } o nlm (18)
M M
2
E(u n } L L
xx
= bkb1.'
,1.
k=O i=O
N N
+ L L R , ,(k- i)
Y Y
ak a, Ak
1.
'
1. ,
k=l i=l
M N
-2 L � q R ,(k-i) ( 19)
xy
bka,
1.
k=O i=l
where q 2-2t/3 is the variance of a random variable uniformly
4Un]ess oth(;!rwise specified, CO l to ur integration in the z-plane

shall be along Izl = 1.
,
98
distributed in (_2 -t , 2 -t ), R x¥,(k-i) is the cross correla
t ion function betwecll {xnl and (Yn }, Ak, i and B k,i are given by:
E(
Ak ,
l.
' f! (¢
n, k-1) ( ¢n,l.,-1»)
(1+q)N+2-max(k, i) kfi or k=i=l (20)
= k
{ (1+q) N+ 3- 1 _ k=if1
B k, � E{ ( e n, k-1) (e n, I ) }
l. l.
-
' ,
k
(1+q)M+2-max( ,i) _
1 kfi or k=i=O (21)
= k
{ (1+q)M+3 - 1_ k=ifo
Thus we see {u 1 is white and w.s.s. with variance of un given
by Eq. (19) wh�ch may be rewritten as
E(u2
n} =- nJ xx Y Y xY z
(22)
where
M M
IB(Z)I 2 E E h b, B k , ' zk-i
k=O i= O --k l. l.
N N
IA(z)12 = E E a
k
a. JL ,z k -i
). - -k , ).
k=l i= l
M N k
C(z) = q E E � ai z -i
k=O i= l
By using Eq. (13 ) and the relationship
t (z) = 1tl.!l t (z)

xy' DTZT xx
we have
. 2
2
rIB(Z) 1 +IA(Z) �� J I �� Jlxx(Z) �Z
-2C(Z)
(2 3)
From Eq. (15), we see that e has zero mean and is w.s.s. Its
power spectral density tee(z f is given by
I (z)
uu E{un2 )
I (z) = = (24)
ee I (Z)12
D ID(Z)12
The mean square value is
dz
-
z (25 )
99
OUTPUT ERROR TO SIGNAL RATIO
Quite often, one is interested in the error- to-signa1
ratio at the output . In terms �f our notations, this quantity
is E (e � } /E{ w2 } , or E{y�'2 }/E(YA }, which, by using Eqs. (14),
(23), and (2g), can be written as
(26)
UPPER BOUND FOR ROUND-OFF ERROR ACCUMULATION

From Eqs. (23) and (25), we may derive an upper bound of
E(e � } . Since
E(U2 � tR i
dz
nJ �
2n] � xx(z) Z
•
,:,=1 [IB(z) 12+IA(z)� 2 �

D (z) 1 -C(z) D(z)
/ N(l Z}
-C ( 1 z) D (1 z)] �
2
= Rxx( O) ":' [IB(z ) 1 +1A(z)� � \2_C(Z)� ��J �� C(l/Z)� �� )
=1
2} 1 tR dz
E£en � Rxx (0) 2nj 2
____
W Z\D(Z) 1
•
Max [IB(Z) 12+IA(Z)�«�)} 1 - C(l/Z)� �� )

Izl= l Z Z
(27)
Similarly, an upper bound of the error - to-signal ratio a t the
output can be derived:
,: , = 1 (I � f �� 12 . [I B(z) \ 2+ I A(z)�f � f 12-C(z)� f � � -C (1/ z) � � ��J ] }

(28)
EXAMPLE
Here we take the second order filter considered in Sand
berg I s paper [2] :
100
with <l1=-J2(l- . 001) and a2=(1-.001)2 . Thus N(z)=l, D(Z)=
-2 2 2 2 2
1+"1 z-I+.:1
2 z IB(z) 1 =q, and I A(Z) I =(al+a2 ) [ (1+q) 3 -II +
,
ala2[(1+q)2_l](z -1+z) , C(z) =al z-1+ a2 z -2 . Since (1+q) 3_�3q

and (1+q)2_�2q, we have
\�f�� \2. [/B(z) \2+IA (Z)�� 12_ C(Z) � { �� -C(l/Z)�ii/�� I

= q [l +2 (a2 2 -1
l+a2)+ala2(z +z )I
Its maximum on \zl=l is seen easily to be approximately
(7+2./2)q . The integral 1/ (2rrj) � dz/(z\n(z ) 12 ) can be easily
evaluated; its value is 0.5 x 10 . with tc27 the upper bound of
v'E (e � }/ E (w � } as given by Eq . (28) is 3.3 x 10-7•
ACKNOWLEDGMENT
This research is sponsored by the Air Force Office of
Scientific Research, Office of Aerospace Research, United
states Air Force under AFOSR Grant 1333 -67 and by the National
Science Foundation under Grant GK-143 9.
APPENDIX
STATISTICS OF 8n,k AND ¢ n,k
The first and second moments of en k and ¢n k as given by

Eq. (5) Cdn be evaluated straightforwarCily . The'result is
tabu la te d be low.
E (& } = E( ¢ n,Jo } 1 'If n,j
n, J
=
0
1 M+
E( 82 O} ( +q) 2 E [8 08 I = (l+q) M+2-j , M<:j>k<:O
II, n, J n, k
+
E(e2 o} (l+ q)M 3-j j=l, ... ,M
n, J
rp = ¢2 O = 1 E [¢ O¢ J = 1
n,O n, n, n,J
0
l
E[¢ � ,l J (l+q)N+
E[¢2 oj (1+q)N+3-j j=2 ,3, ••. ,N

n,J
E[¢ o¢ 1
n,J n, k
E[8 o¢ l+q kiO

n, J n, k
J
o k=O
E[9n,j9m,k] = E[� n,j�m,k] nim

= E[9n,j�m,kJ = 1
q = 1/3(r2t)
10 1
REFERENCES
1. See, for example, J . B . Knowles and E . M . Olcayto, " Coef

fici�nt Accuracy in Digital Filter Response", IEEE Trans .
Circuit Theory, Vol . CT-15, pp . 3 1-4 1, March, 1968.
2. I. W. sandberg, "Floating-Point Round -off Accumulation in

Digi tal Filter Realization", B STJ, Vol. XLVI, pp . 1775 -
1791, October, 1 967 .
--
3. J . H. Wilkinson, Rounding Errors in Algebraic Processes,

Prentice -Hall, Englewood Cliffs, New Jersey, 1963 .
102
n:EE TR.\XSM.'TIOXS ON CllICl.:"1T TIIEOR", VOL. CT-15, No.1, l\URCH 19G8
Coefficient Accuracy and Digital Filter Response

J. n. I\:XOWLES, MElfDER, IEEE, AXD E.:\1. OLCAYTO
Abstract-The frequency response of a digital filter realized by a with (I), the width of qua ntiz ation (q) is defined as:
finite word-length machine deviates from that which would have
q
been obtained with an infinite word-length machine. An "ideal" or = 2-(·11-1). (3)
"errorless" filter is defined as a realization of the required pulse
transfer function by an infinite word-length machine. This paper By yirtue of the fi ni te computer word-length, the
shows that quantization of a digital filter's coefficients in an actual p erformance of a digital fil ter is inevitably degenerated .
realization can be represented by a "strav" transfer function in First, when continuous data are read into the compute rs ,
parallel with the corresponding ideal filter. A lso, by making certain
a quan tizatioll errorl"1 is incurrcd. Second, a roundoff
statistical assumptions, the statistically expected mean-square dif
ference between the real frequency responses of the actual and
error a rise s in the ev al u ation of each arithmet ic product
ideal filters can be readily evaluated by one short computer program in the com pu t ation YI . I GI Thil'd, th e coefficients of the
for all widths of quantization. Furthermore, the same computations di fTe ren ce equation which represcnts the digi tal filter are
may be used to evaluate the rms value of output noise due to data s ub jec t to a mpli tud e quantization errors. In conse quence,
quantization and multiplicative rounding errors. Experimental mea
the freque ncy rcsponse of the actu al filter realized deviates
surements verify the analysis in a practical case. The application of
the results to the design of the digital filters is also considered.
from that which would hav e been obtained wi t h an
infinite word-length m achine . For the purposes of comp ar
ison, it is convenient to defin e "the ideal" or "errorlcss"
I. hTRODUCTIO:-; filter as a realization of t he required p ulse transfer func
YSTK\IS which are uscd to spc ct rally shap e in tion by an infinite wonl-Iength computer.
S fo rm ation are known as filters. Until quite recently,

linear analog elements have b ee n used almost ex
The first t wo types of error described above have
already been thoroughly investigated,131-IOI and this paper
is conc erned with the third. A lthough the p roblem of
cl u�i\'e 1y in this context. However, with the increasi ng
flexibil ity of digi t al equipment, attention has been directe d coefficient errors is not practically signifi can t in the case
to\\":\l'd di git al filtering techniques. In pa rticul ar, meth of a digitally contro lled feedback system, the loss of
OdS"I"'1 have been deve lo ped which enable the conv ersion perflJrmance is important in digital filt e r a pp li cat ions due
of certain well-established analog fil ter des igns into digita l to the absence of o\"erall negative fce dback and greater
filter::! hav ing essenti:\lly the same fre que ncy response over digital system compl exi ty. The efTe ct of coefficient errors
half the Xyqu is t interva l. These te chni ques are significant on digital fi lter performance was first investigated by
in that the resul ting digi tal filters large ly maintain the Kai�er. '71 ,\n ab sol ut e accuracy bound was derived for
frequency response of their a nalog "parents" even when the di fTerence equ ation coefficients within which the real
ization was still asymptotically stable. However, s uc h an
relath'ely small sampling fre que ncies are employed. That
is to say, f re que ncy al iasing efTects III are l argely error bound is incvitably pessimistic, because coefficient
quantization errors are intrinsically statistical. Further
eli mi nated.
,\n electronic di gi tal computer processes information more, this a n alysis docs not en able the compu ter word
in binary-number format. When the so-called "sci entific length to be selected so that the degeneration in the real
pmgmmmin g" convention is employed in a fixed-point f requ ency response of the actual digita l filter is m ain tai ned
machine, the binary number "10"11 "1M-I has the

• • •
within a cceptable p racti cal l i mi ts.
s ignifi canc e
This paper shows that the qu antizat ion of a d igi tal
JIf-l
filter's coefficients can be represented by a "s tray" transfer
+ L "1,2-' (1) function in p aralle l with the corre spond in g ideal filter.
"10"11 "1M-I -"10
t:-1
• • •
Also, by m aki ng certain justifiable statistical as su mptions,
where the st ati stical ly exp ect ed mean-square difTerence between
the real fre que ncy responses of the actual and correspond
'YI - 0 or 1 for all i (2) ing errorless filter can be readily evaluated by one short
computer program for all widths of quanti za tion. Further,
and M is the word-length of the computer. In connection
the same computational results may be used to eva lua te
the rms value of the filter's output noise due to data
ManWJCript r�cei�ed July 3, 1967; reviled October 3, 1967. quantization and multiplicative roundofT errors.'41.IOI Un
J B. KnoWles IS With the Control and IDBtrument GrouPt. United
KiDKdom Atomic Energy Authority, Winfrith, Dol'lJet,EoJdand. fortun ate ly , while the mcthod is always app l icable to
iT. M. OIcayto was formerly with t he University 01 Mancme.t8r dire ct and parallcl pro gramm in g, it is gene ral ly un
Institute of S"cience and Technology, M anches ter, England. He
ia now with the Turkish Post OfIice, hmir, Turkey. suitable for cascade p rogra mmi ng.
103
IEEE TRANSACTIOXS OX CIRCUIT TlJEORY, MARCil 19G8
Justification for a Statistical Analysis roundofT errors, the direct programming realization of (4)
In the transformation of an analog filter to a digi tal is specifie d by the recursion equation'
filter by Golden and Kaiser's method 11 I a matrix of co N N
efficients is derived. For representation in the computer y'(n) = L: a.x(n - k) - L: b.y'(n - k) (0)
k-O '-1
these coefficients are quantized to a specific number of
bits. That is to say, for representation within the computer where xU) and y'(i) represent the i nput and output
these coefficients must be processed by the following gain sequences of t he actual filter. However, the ideal digital
.
characteristic. filter produces an output sequence yen) according to the
recursion equation
N ."1
, yen) = L: u.x(n - k) - L: b.y(n - k). (7)
.t-o .-1
Ideal. 01" destcned
coefticiente 7 Actual tiller coefficients
tor realizaUon D efining the computational error quantity
Quantber
e(n) = y'(n) - yen), (8)
In o ther words, the justi fic ation for a statistical analysis then substituting (5), (0), an d (7) into (8) yielc.l s
of filter coefficient errors is the same as for a statistical
N N N
analysis of quantization in AID converte rs . From one
e(n) = L: a.x (n - k) - L: b.e(n - k) - L: p.y(n - k)
viewpoint, the output of a quantizer to a speci fic i nput .-u .t-l .-1
is deterministic and not rea lly a statistical problem at all. N
Alternatively, the exact value of the actual i nput viewed - L: P.e(n - k), (fl)
0-,
in terms of the quanti ze d data is uncertain. In fact, for
a realized fil ter coefficient of nq, the hue coefficient value and neglecting second-order quantities, o ne obtains
lies in the interval nq - E to nq + E, where lEI < q12 . N N
It is this uncertainty regardin g the exact value of the e(n) = L: a.x(n - k) - L: b.c(n - k)
'-0 .t-l
input to the quanti z er that j usti fi es, in the authors'
N
opi nion , the statistical analysis of all ampli tud e quantiza
- L: p.y(n - k). (10)
tion efTects. Such applications of statistical techniques 0-,
seem to have a strong experim ental confirmation which

Taking the z transformation of (10) give s
is a further justification for this form of analysis.
0= a(z·')X(z·I) - p(Z·I)Y(Z·I) - E(z·I)Be(z·I) (11)
II. ANALYSIS FOR DIRECT rROGRA�DIIXG where
N N
In the following analysis, the errors on the difTerence " -k
equation coefficients will be assumed to be statistically L..J ao2 , P(Z·I) = L: Po2·O
0-0 0-1
independent and uniformly distributed with zero mean.
(12)
N N
These assump tions for the case of rounded quantization Ae(Z-I) = L: o..Z·" Be(z·I) = 1 + L: b.z·k
.-0 k-o
have been confirmed by careful experimental investiga
tions. (01.(81 Thus, if the actual pulse transfer function and
reali z ed is e
E(Z·I) = L: e(k)z··.
0-0
As
Y(z·') = IIm(Z·I)X(z·I), (13)
then (11) may be reduced to
then the numerat or and denominator coefficients may be
wri tten as E',Z.1) = a(z·l) - (1(Z·I)[[m(Z·I) X(Z .1) (14)
Bm(z.')
•
lIence, from (8) and (14) one ob tains
where u. and b. are the desired values of the coefficient s,

(5)
Y'(Z·I) = [ IIm(Z·I) +a
(z·I) - B��:=:� '
J
[[m(Z· ) X(z·I). (15)
and the error quantities a., P. are statistically independent. , If multiplicative roundoff errors are considered. then an error
It should be observed that no error at all is involved in term c(n) must be added to the right-hand side of (6) (see Knowles
and Edwards('I.I>I). However, a more complete description of de
the representation of unity an d zero magnitude co effi generative effects in a finite word-length computer is given in Section
cients within the computer. Neglectin g multipli cative VI of this paper.
104
KNOWLES AND OLCAYTO: COEFFICIE.... T ACCURACY AND DlGITAL FILTER RESPONSE
Thus, the physical significance of the analysis is that the

actual filter may be represented as the ideal filter in
parallel with the "stray" transfer function specified by
(14). This is illustrated in Fig.!.
A convenient measure of the degeneration in filter
performance due to coefficient errors is given by the
following statistical mean-square convergence criterion:
0' : = -- lh'T III*(jw) - II!(jw) 12

1
2 tr/T 0
dw (1 G)
Fig. 1. "Stray" transfer function due to the coefficient rounding.
where II*(jw) and II! (jw) represent the frequency re

sponses of the actual and ideal filters and T is the sampling
period. Setting
(17)
and substituting (15) into (16) yields
0' : = _1_ 1 E(Z-I) !i(z) �. (18)

2trj r X(Z-I) X(z) z
101-1
It i,; evident that the quantity 0': only exists when the
coefficient rounding errors are such that the actual filter
Il(z-I) is stilI asymptotically stable. This limitation does log w
not appear to represent a practical difficulty, because Fig. 2. Degeneration in filter frequency response due to the coeffi
cient rounding.
one is generally interested in determining the computer
word-length for the actual filter frequency response to
be within only a few percent of the ideal stable filter. sponse terms for the digital filters [l/B .. (z-I)] and
Under these conditions, the stability of the filter will [A .. (Z-I)]/[B.. (Z-I)]', respectively. Consequently, these
have been hardly affected by coefficient rounding. That integrals may be readily evaluated to any degree of
is to say, it is contended that loss of stability in a realiza accuracy using a short digital computer program based
tion will occur only after the deviation between the actual on Fig. 3.
and ideal frequency responses has become intolerable. On-line computers generally operate on a fixed-point
This is illustrated in l�ig. 2 for a lowpass filter. Assuming basis, and in this case with rounded quantization, the
}
filter stability, then the integrand of (18) can be reasonably error quantities in (5) lie in the interval
expected to satisfy the conditions of the Lcbesgue-Fubini
theorem.I"1 lIenee, interchanging the order of integration l a.1 :::; q/2 for all k. (21)
and substituting (14) into (18), one obtains 113.1 :::; q/2
• 1
0'
.. = 2trj ny virtue of the uniform statistical distribution assumed
for these error quantities, it follows that the summations
involved in (20) reduce to
N _ q.)
La! =
.-0
p.--
12
(22 )
'["sing the polynomial forms for a(z-I) and f3(Z-I) and N 2_ .!l
the stipulated statistical assumptions, (19) reduces to L 13. - " 12
.-1
O'! = [t. 2] ; 2 j f B.. () B..(Z) d:

1.'-1
where J.I and" represent the number of nonzero and non
unity coefficients in the numerator and denominator,
respectively. Thus, (20) to (2 2 ) specify 0' : for all widths
(20) of quantization.
Numbers in a floating-point machine are sometimes·
represented according to
The quantities
x = US" (23)
...!... 1 1 � and 1 1 A .. (Z-I)A .. (z) dz
27rj r B .. (z I)B.. (z) z 27rj r [B.. (Z-I)]"[B..(z)t z where U and v are sign conscious binary numbers. If the
hl-l 1.,-'
mantissa (U) has 'AI bits, then with rounded operations
represent the sum of the squares of the impulse re- the error quantities in (5) satisfy
105
IEEE TRANSACTIONS ON CIRCUIT THEORY, MARCH 1968
,.r:. dz
2IJ'j' 8.It'J8.IzJ T
1 1
IZI=I
.(k)=1 ij k=O
=0 it 1c;JC>
Fig. 3. A computational technique for evaluating
and
lal � !2-CM-1l8" } (24)

IPI � !2-c.v-1l8'·
where 1/ is the particular exponent value used in the
representation of a given coefficient. Although the error
on each coefficient depends now upon the magnitude of
the coefficient, it is still relatively easy to evaluate the X(z�)
summations I:2
and I:P! once the numerical values INPUT
of the filter coefficients are specified.

1
I
I
III. ANA LYSIS FOR PARALLEL PROGR.UlMIXG
_ I
If the pulse transfer function of the actual filter H(Z-')
does not contain multiple poles, then for parallel pro
Fig. 4.
�
Parallcl programming of a digital filter.
gramming it is first expressed in the form
N
H(Z-I) = R(Z-') + I: H;(z-I) (25) P
.-1
H(Z-') = H.(Z-') + I: PiZ-'
where i-O
(28)
where
N
1I.(z-') = R.(z-') + I: II,,,,(z-')
It should be observed, however, that H;(Z-') includes i-I
the case of a real pole by setting
au = b" = O. (27)
As shown in Ragazzini and Franklinllol and Jury,!1I1 the (29)
actual parallel programming of H(Z-I) is implemented
by direct programming each of the elementary transfer
a., - cI.," + au
functions R(Z-'), H,(Z-') and then summing their re
sponses. This programming technique is illustrated in b., ... 6., + �.,
Fig. 4.
r, ', + p,.
(15), which is illustrated in Fig. I, it
"
By virtue of
follows that the actual filter transfer function realized is BubetitutiDg (28) into (18), one obtains
(30)
106
KNOWLES AND OLCAYTO: COEFFICIENT ACCURACY AND DIGITAL FILTER RESPONSE
Assuming that the systcms are stable and that the co gramming of each elementary transfer function, H , (Z-I),
efficient errors are statistically independcnt with zero with the output from ith transfer function forming the
input to the (i + l)th. This programming technique is
[ dz]
mean, (30) reduces to
- f.- � - - 1 1
shown in Fig. 5. By virtue of (15), it follows that, due
� p� + .f.1 (ao! + al!)21rj j
1
o-! =
Bmm(Z-')Bmm(z) Z
[
'-I-I
+
�
\}J,.. + 13----
.f.1 f'/i2 1 1 Amm(z-')Amm(z) dZ
2..) 21rj j [Bn(Z -I»)'[B.,m(Z))2 Z :
] (31)
INPUT
Ln."
�--------�
� . . ')1 rY'{z')
Fig. 5.
OUTPUT
Cascade programming of a digital filter.

1_1-'
The mean-square quantities �, �, and p;J are calculated to the filter coefficient errors, the actual transfer function
exactly as in Section II, assuming uniform probability realized is
density functions. It should be noted again that zero If
and unity filter coefficients may be represented in the H(.-I) - II [1 + .,c.-')]H,.c.-') (36)
computer without error. The contour integrals involved ,-,
in (31) may be evaluated by the computing technique where

illustrated in Fig. 3. However, considerable computing
time may be saved by using the following results, which
include the case of a simple pole by virtue of (27).
1 1 1 dz
211-j j B, ..(Z-')B, ..(z) Z
1.11-1
aki a.. + a..
+ 6"
=
1
_
(1 - 6 .. + 6,,)(1 + 6.. + 6,,)(1 - 6,,) bki = bki + f3ki'

Since
1 1 A, ..(Z-')A'm(z) dz (32)
21rj j [B, ..(Z-I»)'[B, ..(Z»)2 Z N
hi-I Il ..(z-') = IT Ili ..(z-'). (38)
'-1
- (BoB. + B�)(m .
2(BoB,B.)3 then (36) can be written as
N
IT (1 + Ei
where
Il(z-') H..(z-') (39)
0" + 0..
=
'-1
Bo = 1 +
Substituting (39) into (18) yields
B, = 2(1 - b2,)
Ao
B. =
=
1 - 0" + O2,
(au, + ali)2 (33)
0":
=
;
2 j
h'-1
f Il ..(z-')Il.. (z)
Al = M:. + 2c1o.4.. + 3�,

A. = 3� - 2c1o,cJlI + 3�,
2 or
A3 = (ao, - al i) .
Further, it is apparent that for the case of a simple
quadratic transfer function, coefficient rounding which
{ IT
re;;ults in b.. 1 causes loss of asymptotic stability.
=
IV. ANALYSIS FOR CASCADE PROGRAMMI:-;'G

. (1 + E..(z-'»(I + E.. (z))
.,-1
For cascade programming, the pulse transfer function

of the actual filter lI(z-') is expressed in the form (41)
N
lI(Z-I) = IT Il,(z-') (34) As the coefficient errors are assumed to be statistically
_-I
independent with zero mean, it follows that (41) reduces to
where
lI. z
•
+ -I
+ -2
1 + bliz-' of b.iz-'
(35) 0-:
=
;
2 j f
'_1-1
Il.. (Z-I)Il .. (z)
As previously shO\vn,"01.11I1 the cascade programming of

(42 )
N(Z-I) is implemented by the successive direct pro-
107
IEEE TRANSACTIOXS ox CIRCUIT THEORY, MARCIl 1968
Substituting (37) into (42) one obtains terms of the word-length of a fixed-point machine, as
defined in (3), one obtains

� =
2!j f II..(z-l)II..(z)
vt;! 1.98 X lO'··2-(M-O (47)
'.1-1
=
or
log,. vt;! = 10.60 - 0.311f, (48)

which is shown in Fig. 6. Using a previous analysis, I'J. Ie)
the rms value of the filter's output noise due to multi
or
plicative rounding errors was derived as
log,. vrn: = 10.62 - 0.3M, (49)
. [n <
which is also shown in Fig. 6 .
As a means of verifying (4S) the eoefficients of the
A .... (Z-l)A .... (z)B....(Z-l)B....(z) + B.... (Z-l)B....(z) bandstop filter shown in Table I were appropriat ely
. (± ) ( ± p:..)
rounded using the equations
a:.. + A ....(Z-l)A ....(Z)
.-0 i-I a. = q X [in tegra l part (a./q + 0.5)] (50)
- A .. (Z-l)A .. (z)B.. (Z-l)B.. (z) ] �. (44)
and the quantity
b. = q X [int <'gral part (ii./q + 0.5»),
In order to automatically compute the contour integral

in (43), it is first necessary to implement the factorization (I'
1 l'·,T 11I*(.iw) - II!(jw) 12 d", (51)
! 27r/T .
=
n[ A .... (z-')A .... (z)B.... (z-')B.. .. (z) + B.... (Zl)B.... (z) was measured directly
in Fig. 7.
by mea ns of the technique shown
Examination of Fig. 6 suggests that, for all practical

purp oses, SO-bit realization can be considered "ideal" or
" errorless" relative to the realizations using less than
- A .. (z-')A .. (z)B.. (Z-l)B.. (z) = P (z-' )P(z). (45) 50 bits. In consequence, the "ideal" filter was realiz<,d
The poles and zeros of P( z-') are evidently dependent using SO-bit coefficients. Further, all multiplications in
both act ual and ideal filters were per formed with this
upon the mean-square values of the coefficient errors.
Therefore, the width of quantization is no longer a simple accuracy. l3y this procedure, coefficient rounding errors
were isolated from data and multiplicative rounding errors
scaling factor as in the previous cases. As practical wide
band digital filters may involve polynomials of high in the actual filter. The values of (1'" thus computed are
degree (22 poles in Golden and Kaiser!l'), the factorization shown in Fig. 6 for coefficient word-lengths in t he range
indicated by (45) is extremely laborious and gen erally of 3!l to 47 b its. It is observed that (1'.. is never grenter
not worthwhile. A more economic computational proce t han twi ce its exp ected value � which confirms, to W
some extent, the analysis.

dure is to measure the loss of performance direct ly by
the technique adopted in S ection V of this paper. It is also possible to realize Golden and Kaiser's band
stop filter example in the parallel programmed form
speci fi ed by (25). The r eleva nt filter coefficients for t h is
V. EXPERn!EXTAL RES'GLTS
parallel form of programming are shown in Table II.
Dy a slight modification to the design procedure, it is Assuming the coefficients have identical error bounds,
possible to synthesize the bandstop filter example of the degeneration in filter performance was calculated using
Golden and Kaiser!l' in the form of (4). The relev ant double-length arithmetic for all com p uter word-lengths
filter coefficients for this direct programming realization from (22) and (32) as
are shown in Table 1. Assuming the coefficien ts to possess
identical error bounds, as in a fixed-point machine, the vt;! = 18.65q (52)
degeneration in filter performance was calculated for all or
computer word-lengths from (20) and (22) as
log,. vt;! = 1.57 - 0.3.M, (5 3)
(I'
! = (3.91 X 102.)q2. (46)
whi('h is shown in Fig. S. Using a previ ous analysi s1 'I , 1&1 the
The numerical value in (46) was che eked by di rect and rms value of the output noise for parallel programming
cascade programming of the appropriate transfer func was derived as
tions of Fig. 3 using SO-bit accuracy. In both cases the
same numerical value was obtained. Expressi ng (46) 'in log,. � = 1.62 - 0.3 ..111, (54)
108
·KNOWLES AND OLCAYTO: COEFFICIEXT ACCURACY AND DIGITAL FILTER RESPOXSE
TABLE!
DIRECT PROGRAMMING COEFnCIENTS
k Ok 15k
0 6.14532 02400 45660 95935 10-1 1.00000 00000 00000 00000
1 1.82862 83743 34044 30234 2.84644 10689 74009 00222
2 9.18598 76393 23345 37501 1.36544 18024 06057 35998 10+1
3 2.01785 43858 30049 83632 10+1 2.86863 12R27 03054 60931 10+1
4 5.65074 01304 61323 06076 10+1 7. 68668 01556 51020 20537 10+1
5 9.77616 97277 77455 67929 10+1 1.26978 24238 74008 50583 10+1
6 1.95560 58002 77163 75480 10+1 2.42719 14138 13632 85357 10+1
7 2.74444 52024 17107 32360 10+2 3.25744 40025 55813 19723 10+1
8 4.27221 29208 22686 15701 10-' 4.84590 46018 12943 21722 10+1
9 4.!liiOfi� 95407 14507 65312 10+1 5.36954 24793 46719 51875 10+'
10 6.24128 701lCM 97272 04930 10+1 6.46875 76125 18105 92119 10+1
11 6.00062 '1331i 1231 7 550G1 10+1 5.94574 02388 97171 95950 10+1
12 6.24128 701l0S OU213 35302 10+1 5.90817 75064 19863 11885 10+'
13 4 !l.�Ofl� 95407 34381 27910 10+' 4.47868 926 04 45618 46370 10+1
14 4.27221 29208 46950 36973 10+1 3.69085 95309 78270 04033 10+'
15 2.74444 52!l24 38924 26919 10+1 2.26468 72853 40520 74943 10+1
16 1. 95ii60 58902 95258 27851 10+1 1.54004 37815 20019 23587 10+1
17 9.77616 97278 91991 46783 10-1 7.34798 81860 20413 00382 10+1
18 5.65974 01305 31808 57200 10+1 4.05582 38655 76315 12082 10+1
19 2.01785 43858 69746 80847 10 +1 1.37867 64204 74181 15588 10+1
!l.lSii!lS 7fi�n",
1 . 82862
20 60091 74744 5.97585 13278 52671 08000
21 83743 68430 88781 1.13257 95031 45398 63928
22 6.14532 02401 45188 00124 10-1 3.61729 371 40 82771 44243 10-1
,\,
10
�,
I---t---"<c'f-\,--+-�-t---t- --
K,
10
.
I<
a
10'
I
. �---+----�I -- . i __ __
l\,
--�.--��' .---i
'\,
I
I
I043S1,---13 I
COMPUTER
I
WORD'" LENGTH M( ) 'SITS
Key: -- � - - - -W; 0 v�.
Fig. 6. Direct programming errors.
M('):lit 11-0
=0 if k.o
Fig. 7. A computational technique for evaluating "w·'
109
_ 'I'U1fIUD1'JD1U ON 0IB0UlT TIIBOBY, IIABOB 1968
TABLED
PAIWoLIIL paoq.npm'g eo.riiu....... r. - L8II88'1H&'18
- -
I: ... Get liM li;;
1 -7."'1O'J'1O'M8 10-' 7.3162391858 10-' 9.11890799167 10-' 4.28617114740 10-'
2 9.8l1li1937100 10-' 7.476«71104 10-' 9 . 1l8883'72nO 10-' 1.IOM'126159 10-'
8 4.11&63620191 11)-1 -4.� 11)-1 9.96727889'J7 10-' 4. 31711124881 10-'
4 -4.888tM2202 11)-1 -4.mat24710 10-' 9.9lIII80678'111 10-' 1.063'1363(08 10-'
II -1.3328884702 11)-1 -7._11471331 10-' 9.8829l89882 10-' 4.4149478013 10-'
-7.6702309337
6
7
1.18780288'10
2.83M703827
11)-1
11)-1 '.0192757396
10-'
11)-1
9.8800730189 10-'
9.11614183M9 10-&
9.4074003990 11)-1
4.1I63II0628t8 10- '
8 -1.3308M1404 11)-1 4.1GtOOOtM 11)-1 9.651810179 10-' 6.1_ 11)-1
9 4.31161__ 10-' -1.G576739tO 10-' 8.7f2293M511 10-' 5.l8II03'79MO 10-'
8.09t4I0CI6357 10-'
10
11
-8.8t47288870
-1.16780741143
10-'
10-'
-1.47lI60II7472
-8.1101837180
10-'
10-' 1I.2II36IiCN87O 10-'
-2.MllM7f227
2.0746921909
10-'
10- '
TABLEm
I: ... - ... ... 6.. 6a.

1 0.91189079917 0.4183952392 0.91189079917 0.42851711474
2 1.I«N8M1087 0.1380f27242 0.9118883'1272 0.llOM'12616
3 0.99672781198 0.4129474324 0.9957278693 0.4317512488
4 1.27CN093'l36 0.1S924WJ111 0.99116306788 0.1063'136M1
5 0.11882918988 0.398911a08 0.11882918988 0."14947801
6 1.7233973619 0.2368636857 0.98801 0730 9 0.094074OCKO
7 0.911614l8385 0.368M23086 0.91161418335 0.4663Ii06285
8 3.6768896818 0.5911024&813 0.9651814018 0.0616263087
9 0.87422931146 0.2928381810 0.87422931146 O.III880379M
10 24.91171114807 11.1873798969 0.86IN8OO636 -0.0'JMIIM742
11 2.92118251702 O. '1M2988IIII5 0.lI2836II0497 0.2074692191
" .. ---
,\ �
,
---..
.\ -_ .. - - .... -
1r1
I
I
i
,
r\ f\
! \ ,
10
•
I �.
I ,
-
---_. .
i
!
,
10
. I 16
• • • 10 12 I•
CC!MPUTBI _D-LENGTN (III). am
K.,: -- � - - - W; e ....
Fie. 8. PanIIel p!Op8IIIIDiDJ __
110
·.OWLE AIID OLOAYTO: CODTIClIINT AOCUBACY AIID DIGITAL I'ILTBB BllilPOlIISB
+20
o
�
-20
- f-
�
��
.IITS
..
..
-40
101"S- � \� /') ---
I 12 IITS - � "- 7�
2 ---
"060
40 IITS -
-10 1/\ A /'\ /"\r'I
204 ·S ·6
�REQUENCY
� V V Y�
IN KIIZ
2·1 ·9 1-0
...-�-
-....,..--- ----r_-_,
10
, which .is also shown in Fig. 8. Examination of Fig. 8
,,
I ! suggests that, for all practical purposes, a 40-bit realization
may be considered "ideal" relative to less than 20-bit
realizations. Further, coefficient rounding errors can be
isolated from the data quantization and multiplicative
rounding errors in the actual filter by performing all
i 'i'\�-�
multiplications to a 40-bit accuracy. Using this experi
mental technique, the value of u .. was computed for
coefficient word-lengths in the range of 4 to 16 bits and
\ I . I the results obtained are included in Fig. 8. It is observed.
i• ,I I
\ i I that u .. is in close agreement with its expected value
!. i ', I
·'
i
Ii>
.--r-'�-+-
' -�--I
. Y:!, except for word-lengths of less than 8 bits, where
\1�---�--�
10
i
an apparently deterministic deviation of theoretical and
experimental values occur. This phenomenon is attributed
<1.
r---r-i -
, - ,\ . to the deletion of second-order quantities in (9). For
i ii '\1
illustrative purposes, the effect of coefficient accuracy
R 10
I
on the real frequency response of the parallel programmed
filter is shown in Fig. 9.
.. ,, The cascade version of the digital filter under considern.
i GI ,
tion is specified by the coefficients in Table III. For reasons
,,
of completeness, the degeneration in filter performance
with this method of programming is shown in Fig. 10.
In this case the ideal filter and all multiplications were
J().L-_-L __..l-_---''-_--L__.J..._
. -l performed to an accuracy of 40 bits, which may be
14 '6 II
It
10 12 20 22
COMPuTER WORD-LE.NGTH (M),BITS justified in the same manner as before. is seen that
the actual degeneration for cascade programming is less
- r-=;:
Key'e - - - v I.
n ,�i o "
... than that for direct programming but greater than that
Fig. 10. Cascade progTamming errors. for parallel programming.
111
lEa TRANSACTIONS ON CIBCUIT THIlORY, lURCH 1968
'DIAL PlLTIlI
VI. COXCLUSIOXS �--- -----,
, r------,
I
Due to inherent quantization errors on the cocfficients I
of the difference equation representing a digital filter,

the frequency response of an actual realization deviates
from that which would have been obtained using an
infinite word-length computer. This paper specifies this
loss of filter performance in terms of the statistically
expected mean-square deviation in the frequency re
sponses of the actual and ideal filters. Because the width
of quantization for the filter coefficients appears as a
simple scaling factor in the results for direct and parallel
programming, one simple computer program may be used 'II'UT OU1I'UT
to evaluate the statistically expected degeneration in

fi lter performance for all computer word-lengths. This
computation appears to be both shorter and more sig
nificant than several evaluations of IT,. for different widths
of quantization using the technique illustrated in Fig. 7.
Further, the computations involved may also be used
to evaluate the rms value of the output noise due to
data quantization and multiplicative rounding errors.
However, for cascade programming, direct evaluation of Fig. 11.
IT! using the technique illustrated in Fig. 7 appears to be
the simplest procedure in general, by virtue of the factor
ization problem involved. Experimental determinations
of IT,. as defined by (4\) for the cases of direct and parallel An apparently reasonable method of designing digital
programming are within twice its statistical expectation filters using IT! as a eriterion is as follows. A fi lter design
�, as computed through (20) and (31). This agree by the Golden and Kaiser technique is generally best
ment between theoretical and experimental results evi initially implemented in parallel programmed form on
dently provides confirmation of the analysis. An im a general purpose eomputer. The development of a
portant aspect of the results obtained is to show that a parallel realization involves only one factorization" in the
significant improvement in the accuracy of a realization 8 plane as opposed to the two required with cascade
can be achieved by employing only one extra bit. programming. Further, parallel programmed filters are
'With the analysis just presented and that available apparently less susceptible to multiplicative and coeffi
elsewhere,'·,·16] it is possible to specify a specific linear cient quantization than directly programmed filters (see
model of the basically nonlinear effects occurring in the Knowles and Edwards,'·,·I5 I Kaiser,1 7 I and this paper).
finite word-length representation of a pulse transfer func After checking the real frequency responses of the
tion. In the case of a fixed-point ope ratio n with direct analog and parallel programmed digital filters for reason
programming, this specific statistical model is as shown able agreement, a computation of IT! can be effected
in Fig. 1 1. It will be observed that the parallel transfer using (31). These same calculations can also be utilized
function arrangement to obtain the rms value of the output noise due to data
.Jfu + .J&
q qAm(Z-')
and multiplicative rounding errors. As � is the math
ematical expectation of
Bm(Z-') [Bm(Z-'»)" 1 12r'T
11I*(jw) - lI:(jw) 12 dw,
'27r/1' 0
gives the statistically expected mean-square deviation
(IT!) between the actual and ideal frequency responses. then it is evidently a measure of the deviation between
As the width of quantization (q) would be chosen in the actual and the ideal filter responses at any frequency.
practice so that t he deviation in the real frequency For many bounded functions, plus or minus three times
responses of II ( z· ' ) and II., (Z·I ) is extremely small, then its rms value contains most, and in some cases all, probable
the noise shaping filter 1/B (Z·') may be evidently replaced observations. It seems reasonable. therefore, to select the
by I/Bm(z·') with only second-order errors. Quantita
l
computer word-length according to
tively, the power transmitted by I/ B.,(z·') and
I
N
3 v? Acceptable Gain
IT (55)
1
:E fhz·1 ,. < Fluctuation
0-'
B m(Z ·') - [B.,(z·')]" provided that the output noise due to data quanti zation
and multiplicative roundoff errors is also acceptable with
can be considered identical for a satisfactory actual
r a
e lization. • Private correspondence with Dr. J. F. Kaiser.
112
KNOWLES AND OLCAYTO: COEFFICIE:li"T ACCt;RACY AND DIGITAL FILTER RESPONSE
TABLE IV respectively. Substituting these values into (55) yields
Muimum P_band MUUmum Attenuaticm in vi;·.. < 0.020

II Ripple Rejection Band and
18 -0.4997 dB at 2140 H. -71.12 dB at 2610 H.
v';'iw
< 0.89 X 10-6•
19 -0.49116 dB M 2140 H. -76.48 dB at 2630 Hs
For parallel programming, the minimum consistent
20 -0.4997 dB at 2140 H. -76.61 dB at 2730 H. number of bits is obtained from Fig. 8 as 19 bits. This
21 -0.4997 dB at 2140 H. -76.43 dB at 2680 Hs result was confirmed by experimental frequency response
• -0.4997 dB at 2140 H• -76.50 dB at 2730 Hs
measurements for computer word-lengths of 18, 19, 20,
and 21 bits. The maximum passband ripple together with
least rejection band attenuation is presented in Table IV.
A CKxoWLEDmlENT
this word-length. This procedure evidently enables the
designer to assess whether a practical realization of the The authors wish to acknowledge the stimulus and
given digital filter can be made on a smaller word-lcngth encouragement received from their correspondence with
computer. Dr. J. F. Kaiser of Dell Telephone Laboratories, Murray
As an example of the use of this analysis in digital HilI, N. J.
filter design, the number of bits required for the Golden
Kaiser filter to meet the following specification will be
R.Bn:uNCBS
III R. M. Golden and J. F. xu. ''DeBip of wideband �
considered: data &lten," Bell Bp. T_ J.., 43, pp. 11133-1_ J� 19M.
;;t
1II J. F. :s:.u.r, ..� III8UIOCII for IIIDI� &I..... ' Proe.
Ripple in passband = 0.5 dB. til Allrim CtIII/. 1m CirCtriI .." s".,. T...., pp. 221--' Ncw
ember 1963.
Minimum attenuation in rejection band -75 dD. 1II B. Widrow: ''Statiatical anaIYIIia of �tude-cruantiad
� aya{.em., " TNJu. AIBB (A� .." tllllualr)r ,
=
The idcal realization of this filter has a minimum attenua

tion of -76.5 dD in the rejection band. The allowance
p'-;rrt��ectaoffiai�
�ter in a IIIDIpied-data feedback�" Proe.IBB (LondoD),
of 1.5-dD fluctuation about this particular point evidently 'fOl. 112, pp. 1197-1207 JUDe 1966.
III --, ''Finite � elfecta in a multirate diJect diaitaI
constitutes a worst case design for the rejection band. oonUol BYBtem. " Proe. IB6 (London), vol 112, pp. 2376-�
Thus, the allowable gain fluctuations in the pass and the n-mbefl9M.
III c. M. Rader and B. Gold. '�tIl &Iter cJeeIa techniquee in
rejection bands are the freq\lellCI,Y domain," Proe.IBBB, VoL l1li, pp. 1�171, Felmaary
[ (�o5) ]
1967.
111 J. F. KaiIeI: ''Some vractical OODBiderr.ticma in the reaIiation
± antilog, ° - 1 = ±0.059 of IineIIr diaitll iIl-." Proc. IJrrI Allrim CtIII/. l1li CirniJ ..
� T� (Montioello, m), pp. 621� October 1965.
III O. 8oni1DOOD� "ID� of quantiution erron,"
and M.Bc. � Uniftnit of Manelii!llt.s EulaDd, 1966.
It! L. M. Graftl, � oJ � oj'1l-J. Vllliablet, 2nd eeL
[
± antilogu (- ;�) - antiIog,o ( - 7;(/)] N_ York: MClGraw-iiiii;" flNl8.
1111 J. R. Raaaini and O. F. J'ranldin, S-,w-Data CMnI
�. N_ York: MoGraw-HiD, 1968.
lUI E. L Jury, S-pW-DataCortlnIl� New York: Wiley,
= ±0.266 X 10-·, 1968.
1 13
Rep,inted !,om IEEE TRAXSACTlOXS
0:>1 AUTOMATIC CO.YTROL
Volume AC-13, !l;'umber 3, June, 1968
pp. 263-269
COPYRIGHT © 1968-THE INSTITl:TE OF ELECT RICA L AND ELECTRONICS ENGINEERS, INC.
PRINTEO IN THE U.S.A.
Eigenvalue Sensitivity and State-Variable Selection

PATRICK E. �IANTEY, MEMBER, IEEE
Abstract-The first part of this paper presents a new measure of between computer word length, that is, bits required,
sensitivity specifically applicable to the realization of a linear discrete
as determined by considerations of eigenvalue sensi
system on a digital computer. It is also shown that the sensitivity of
the eigenvalues to parameter inaccuracies in the realization depends
tivity to parameter inaccuracy, and the number of
strongly on the choice of state variables. From these considerations, arithmetic operations required.
a realization is obtained which is "best" for a large class of systems of Other authors have considered the choice of realiza
interest with regard to minimizing storage requirements, arithmetic tion with regard to parameter accuracy requirements
operations, parameter accuracy, and eigenvalue sensitivity. The
for stability. Golden and Kaiser(3).(4) have pointed out
second half of the paper considers the very practical problem of de
termining the number of bits accuracy required in the computer
with regard to realization of scalar input sampled-data
stored ,parameters of the system to achieve satisfactory performance. filters. that simulations corresponding to the direct or
For the realization found to be a best compromise, equations are reduced transfer-function form of difference equations
obtained for determining these bit requirements. Example s are generally require considerably greater accuracy than is
given showing the application of this realization to the computer im
required for parallel, partial-fraction expansion, or cas
plementation of a discrete filter, and a comparison is given to other
cade. factored transfer-function. forms. In later papers,
possible realizations.
Kaiser(4).(6) has obtained a lower bound on the number
I NTRODI:CTIOX of bits required for the stability of a digital filter. I Ie has
X AX'\" PROBLE�I where data arc to be processed
also shown hqw root-locus techniques can be applied
Iin real time on a digital computer by a linear dis

system, consideration must be given to the
crete
to studying the effect on stability of changes in a single
filter parameter.
The results in this paper yield parameter bit require
selection of the appropriate state variables to achieve
an efficient computer realization of the system. Exam ments not just for stability, but also for keeping the
ples of such real-time computer applications include eigenvalues of the computer-implemented system to
estimation, filtering and prediction with linear filters, within a prescribed maximum excursion from those
including discrete \Viener[l) and Kalman(2) filters, digi specified for the realization.
tal controllers and system compensators for real-time STATE-SPACE EQUATIONS FOR A LDIEAR
control, and real-time modeling or simulation. The DISCRETE SYSTE�I
resulting computer requirements dictate the cost of such
operations-or in deed their overall feasibility. There are
The relationship between sampled input tt(n) and
four fa ctors which ultimately determine the computer
output yen) for a linear discrete system can be described
requirements. These are : 1) sampling rate, 2) storage in state space by the equations
required, 3) arithmetic requi red to generate each out x
'
( 1t + 1) = <I" x'(n) + r'u(n)
put, and 4) computer word length required. (1)
y(n) = lI'x(n) + Pu(n)
The sampling rate is deter m in ed hy considerations of
system and signal requirements, and computer realiza where x'(n) is the state vector. and prime denotes the
tions will be studied for a fixed sa m pling rate. An ideal state-vec tor and system matrices for the initial choice
realization is here considered to be one which achie ves a of state variables. The p'rimed quantities are assumed
given input-output relationship, while minimizing the to be given.
number of storage locations, arithmetic operations, There exists a wide range of choices for the state
computer word length, and eigenvalue sensitivity. vector x(n) and for <1>, r, and II, all of which will yield
These requirements are interdependent, and a "best" the same i npu t-outpu t relationship between !ten) and
compromise is sought. Consideration will be limited to yen). Define
those realizations which use the minimum number cf
distinct parameters, excluding zeros and ones, and
x(n) = T-1x'(1I) (2)
storage locations required for an Nth-order system. where T is a matrix of rank N, x is an NX 1 column
The "best" realization will then represent a tradeoff vector. and N is the order of the linear difference equa
tion relating y en) and u(n). Thus
ManullCript received January 23, 1967; revised October 3, 1967,
and February 12, 1968. ( + 1)
x n = T-1'Tx(n) + T-lr'u(ll) = <l>x(ll) + rue,,)
The author was with the Dept. of Electrical Engineering, Stan
ford University, Stanford, Calif. He is now with the IBM Research (3)
Lab•• San Joee. Calif. yen) = lI'T"'(Il) + Ju(n) = lIx(lI) + Jlt(lI)
114
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, JUNE 1968
where
if> = T-1if>'T
aAk
=
[ajfOcf>i,] (8)
aq,ii aj/aA A.A·
r = T-1r' (4)
Using (7), this becomes
II = II'T.
Thus it is noted that the relationship between u(n)' and

y (n) is maintained for any choice of states x(n) rel�ted
'
(9)
to x'(n) by T,
the initial 4>', r', H' by (4). Various choices of T, and
hence of x(n), 4>, r, and H, will yield different com
puter requirements. 1
Considering the form of the discrete system with which is an estimate of the change in the kth eigenvalue
regard to storage required for implementation, there due to a change in t/>ij. If all parameters of cP have
are two demands for storage. The state vector x(n), I At/>ijl �A, then the maximum change in any eigen
regardless of T, will require N storage locations. In value of 4> is A· S, where S is given by
addition, the elements of 4>, r, H, and J must be stored. aAk
I I
4
This can require as many as N2+NP+RN+RP loca S max L:-- (10)
i.; aq,i;
-
Ak
tions for P-dimensional input, N-dimensional state, and
R-dimensional output. For simplicity , the derivations By definition, oA/iJt/>i, == 0 if t/>iJ is unity or zero, because
in this paper wiil consider only scalar input and output. the computer wiII realize these parameters exactly. The
All results obtained can be extended to cases of vector eigenvalue Xt corresponding to S is then the most sensi
valued input and output. tive eigenvalue for this realization, and S is taken as a
INFINITESnrAL EIGENVALUE SENSITIVITY measure of the sensitivity of this 4>. This sensitivity
measure can be used to evaluate the sensitivity of any
Eigenvalue sensitivity is defined as the expected proposed realization.
change in the location of an eigenvalue of if> for a
change in a parameter of CPo Such parameter variations SENSITIVITY OF VARIOUS FOR�IS OF 4>
in digital realizations with fixed word length occur due To obtain the minimally sensitive form requires a
to truncation, or rounding, of the specified parameters search over all cP related to cP' by (4), comparing these
to this word length. An infinitesimal approximation to choices on the basis of (10). If the possible choices are
this sensitivity, valid for small parameter inaccuracies, limited to those having only N parameters different
is given by from zero or unity, the number of possibilities is reduced,
flAk aAk but no orderly procedure has been devised for selection
� ( 5)
flcf>i; acf>i;
-- --
of the matrix T to obtain these forms. A further restric

tion can be imposed: that either r or H have, in the
where t/>;; is a parameter of 4> and X is an eigenvalue scalar case, all parameters zero or unity. 2 The corre
satisfying the characteristic equation spondence of these restrictions to T is entirely unclear.
I(A, {cf>i;}) = det [AI - if>] = O. (6) Therefore, certain possible forms for cP are considered
with regard to the sensitivity measure S of (10), and
Other useful forms of this equation are from these a compromise form is evolved which has
N N attractive properties in terms of sensitivity and com
f(A) = II (A - Am) = AN - L:b;AN-1 (7) puter requirements and represents a "best" form for a
1-1
large class of systems.
where, of course, the Xi and bi are functions of the t/>ij.
Bodewigl61 notes that the problem of determining the Diagonal Form
change in the eigenvalues of a matrix resulting from If the eigenvalues are distinct and real, then 4> may be
slight changes in the parameters was examined by transformed into diagonal form. For this form it can
Jacobil71 in 1846, with regard to the effect that small easily be shown that for stable systems the sensitivity
inaccuracies in the masses of the planets would have on measure S of (10) is unity, Aj�O, and the required word
their calculated orbits. A general method for calculating length for the coefficients is easily discerned. To keep all
such sensitivities is given in l\lorgan.!81 eigenvalues Xj within'Y of their ideal locations is equiva
For any matrix Tin (2) and therefore for any realiza lent to specifying the A, within 'Y-hence the number of
tion of 4>, the infinitesimal sensitivity of the kth eigen bits must be such that the specification of the Xi by this
value to a parameter t/>i; of if> is given by number of bits results in an error less than 'Yo For this
• [t should be noted that each parameter represents a required

I In actual computation, the transformation T must be carried multiplication in forming the output, and for speed and simplicity of
out with sufficient precision to assure that 4> and 4>' have essentially computation it is desirable to keep the number of multiplications to a
the same eigenvalues. minimum.
115
MANTEY: EIGENVALUE SENSITIVITY AND STATE-VARIABLE SELECTION
realization, the minimum number of parameters are For systems with eigenvalues which are reasonably close
required: N for <1>, and N for rand Il, for a total of together, S is very large for this form. For instance, with
2N+l, since J requires one. Thus the minimum num two real eigenvalues X=!, X=1, (13) yields S =7, where
ber of multiplications is used per input. Clearly, under X = 1 is the more sensitive eigenvalue. This means that a
these conditions no better form is possible. For multiple sevenfold increase in accuracy is required over that for
real eigenvalues, the analogous form is the Jordan form, the diagonal form.
and similar results can be obtained, although the
CO�IPLEX CO�Jl:GATE EIGE::-;VALUES
infinitesimal sensitivity measure of (9) is not defined.
For the case of complex eigenvalues, the diagonal or From the preceding discussion, the diagonal form
Jordan form has complex entries, and the corresponding emerges as the most attractive form for in the case of
state vector is complex. This form is undesirable, as the real eigenvalues. However, for complex eigenvalues, the
effective number of storage locations and multiplica handling of complex quantities is not desirable. If the
tions is greatly increased. original system has real coefficients, then any complex
eigenvalues occur in conjugate pairs. Suppose that the
Companion }'falrix Form
system to be implemented has both real eigenvalues and
If the system described by (1) is controllable and/or complex conjugate eigenvalues. Let the number of real
observable, it can be shown (9) that there exists a T eigenvalues be M. Then can be made to have an
which will reduce <1>' to companion matrix form in the MX AI submatrix with real entries on the diagonal, and
scalar input-output case, and will also reduce r' (or this submatrix is in an ideal form for realization. The
II') to N -1 zeros, the other element being unity. Simi N - �\l complex conjugate pairs remain, and note that
lar forms exist for the vector input-output case.IIO) The N-M must be even. Partition this matrix into 2X2
resulting system again requires the minimum of 2N + 1 matrices, one for each complex conjugate pair. Call each
storage locations for the parameters, which corresponds pair Ajo j= 1, 2� , (N -AI)/2, where
AI [ 0 ]
. . .
to the z-transform transfer function of the system in

reduced form, and it is a realization of the type sought. . - hM+2j+!
• (14)
o h,\f+2j
-
If the given system has real parameters, then the reali

zation in this form also has only real parameters. The
If, instead of the preceding, the Aj are replaced by <l>j of
companion matrix form of is
[
the form
�l
0
o <l>j = (15)
bu
(11)
where, for <l>j and Aj to have the same eigenvalues re
1
�: : :
o ° . . .
quires that
bN-l . . . b1
b2j = - hM+2j;\,\f+2j
and the characteristic equation is then (16)
blj = AM+�j + ;\,\(+2j
!(X)=det [Al-]=XN-b1XN-l- ... -bN=O. (12)
then T will be real for of the form
For stability, all eigenvalues of the system must, of
course, have absolute value less than unity. For this o
consideration alone, the question would be the minimum
number of bits required to specify the bi of (12) to keep
A.If_
all Xi inside the unit circle. Kaiser(4).li) has shown a
lower bound on the number of bits required for sta
<1>= (17)
bility of this form.
For the companion matrix form of , the sensitivity
measure S is from (9), (10), and (12)
which, of course, has the same eigenvalues as the origi

nal <1>'. This form of is attractive, as it involves again
m_l only N parameters to be stored for <1>. Again for repeated
....k (13)
eigenvalues, a form of related to the Jordan form
1 I xklN 1
max
-
results. The sensitivity measure S of (17) becomes
�. l - Ixkl N
I r I Xk - Xm I 1+lhM+�j l
S = max ..,---"':""-;"';":, (18)
�M +2; 1 2 1 m ( XM+2j) I
116
It can easily be shown that. for stable eigenvalues, (18) advantage that the bit requirements for realization of
has a lower bound of unity. the system to any desired accuracy of the eigenvalues
From these considerations, realization of as in (17) can be computed directly. without depending on
is preferable to the companion matrix form whenever S infinitesimal arguments. This direct computation of bit
of (18) is less than (13). Define S.O\k) as the sensitivity requirements is covered in the next section.
of Xk in the companion matrix form, and SD(Xk) as the It should be noted that the form of given in (17)
sensitivity of the same eigenvalue in the "decoupled" represents a realization in terms of (J/ +N)/2 parallel
form of (17). For Xk complex subsystems. Equivalently, placing ones in appropriate
locations above the main diagonal in (17) results in cas
cade form, with the same characteristic equation. From
the aspect of sensitivity, multiplicative operations, and
storage. the two forms are equivalent. lIowever. the
difTerent forms do have difTerent effects on arithmetic
while for X. real, SD(Xk) = 1, and roundoff error, with the cascade form being slightly
better in most cases.
BIT REQL'I1m�IE:-ITS
Consideration of the <l>j blocks of (17) will indicate
Assuring that the ratio S.(Xk)/ SD(Xk) exceeds unity for the number of bits required for this desired realization,
each Xk of is sufficient to assure that the measure from that is, attention can be focused on the bit requirements
(13) exceeds (17). For many conditions this obviously is for each of the 2 X2 matrices <l>j of (15) to keep the
satisfied; for example: 1) systems with no real eigen eigenvalues within a circle of radius 'Y centered on their
desired location.
values, and all eigenvalues with Im (Xk) > 0 are con
Let
tained within a circle of radius t; 2) all eigenvalues are
within a circle of radius !; or 3) the distance from any >",\{+2,-1 = aj + i/3j = pje"i
eigenvalue to any other eigenvalue not its conjugate is (20)
less than unity. Although such conditions are sufficient >".U.. li = a, - i/3i = Pie-iIi.
but not necessary, they are satisfied for a large class of Then from (16)
systems, including most systems which are sampled-data
blj = 2al
versions of continuous systems with sampling rate (21)
chosen to avoid spectral folding, and in many such b2i = - (alj + /3;2)
cases these conditions can be used instead of computing
and the factor of the characteristic polynomial of re
(13) or (18) to select the form desired. However, the
lated to <1>1 is
calculation required to actually compare (13) and (18) is
exceedingly trivial.
An example shows a comparison of the sensitivity in
the companion matrix form of (11 ) and decoupled form Suppose that the eigenvalues X.\(+2;_I, X.UHi are moved,
of (17) for a system with five eigenvalues as shown in by the bit limitation, to the new locations
Tahle 1. >":V+2i-1 = (a; + oS) + i({3i + E)
(23)
TABLE I
Il'OFDIITESI)IAL SEXSITIVITY OF LOW-PASS SYSTEM
so that again X'.\f+2i--1 =X'M+2it and thus the character
Real Imaginary Magnitude Angle Se(X) SD(X) istic polynomial retains real coefficients. Now the
0.8000 0.0000 0.8000 0.0000 52.2871 1.0000
coefficients of <1>/ are
0.6928 0.3999 0.8000 0.5235 49.8833 2.2500
0.6928 -0.3999 0.8000 -0.5235 49.8833 2.2500 bu' = 2ai + 2a
0.5656 0.5656 0.8000 0.7853 23.8559 1.5909 (24)
0.5656 -0.5656 0.8000 -0.7853 23.8559 1.5909 b2.' = - [ (ai + a)2 + ({3i + E)2J.
Define
From these infinitesimal considerations, S. of (13) Ab'i � bl/ - bIJ = 2a
is 52.28, while SD of (18) is 2.25, and it is estimated that (25)
the companion matrix realization will require at least Ab2i � b2/ - b2j = 2aja - oS2 - 2PjE - E2.
five more bits for the same accuracy in eigenvalue Now, if the changes in the eigenvalues of <l>j are to be
location. confined to a circle of radius 'Y, it is required that
The proposed decoupled form, besides yielding lower
sensitivity for a wide class of systems, has the importan t (26)
1 17
Iy
Fig. 2. Contours of equal eigenvalue shift.
SDIPLIFIED ANALYSIS WITH UNIFORM \VORD LENGTH

If the coefficients are to be specified to the same ac
curacy, corresponding to the use of the same word
Fig. 1. Eigenvalues and variations in complex plane. length for all parameters, then
I Ilb1j I ::;; 11
Fig. 1 illustrates the eigenvalues and variations in the (30)
complex plane. 11 > O.
Now consider the changes in I::.b. 1h I::.b. 2j as the eigen
values AM+2i-h X M+2 i are moved an amount 'Y in any The problem of determining the bits required to keep the
direction cpo Then eigenvalues within 'Y is then equivalent to determining
I::.. so that the zeros of (22) are within 'Y of XM+2i-l and
Ii 'Y cos </>
=
(27) X.II+2j when, from (25)

E = 'Y sin </>
btl = blj + Ilblj
and (25) becomes
(31)
Ilblj 2')' cos </>
=
b2/ = b2j + Ilb2j.
(28)
Ilb2i - 2')'(aj cos </> + Pi sin </» - ')'2.
= Equation (22) becomes
Equations (28) are parametric equations of an ellipse, (32)

which can be written from (20) as
Now the zeros of (32) are given by
Ilb1j 2')' cos <f>
=
(29)
Ilb2j - 2..fPi cos (9j -
= </» - 'Y2.
The center of the ellipse is at (I::.b. lj, I::.b. 2j) = (0, _'Y 2). It
can be shown that, for all 'Yl <'Y2 <(3h all changes in the
Using (25) this becomes
eigenvalues corresponding to a circle of radius 'Yl map to
an ellipse in the I::..b1;. I::..b2i plane which is interior to that , .1611
for 'Y2. AJI+IJ-I.JI+1J - till +
T
Thus, for all 'Y <(3), if the coefficients are specified to an (34)
accuracy such that I::.b. 1i> I::.b. 2J is interior to the ellipse for
a particular 'Y, the eigenvalues will be a distance less
than'Y from their desired values. However, note that if
'Y "?,(3i> the location of the eigenvalues can reverse with Then the bit requirement is determined by finding the
respect to the real axis. For stability, 'Y must be less I::.. such that if I::.b. 1;. I::.b. 2j satisfy (30), then
than 1-p. Fig. 2 illustrates the case where II::..b1JI
II::.b. 2JI =1::... For a given 'Y, I::.. is one-half the side of the max I }.�(+2j-l - (ai + i{3j) I ::;; 'Y. (35)
=
6.bli .{U2J
largest square which can be centered at the origin in this
plane and fitted inside the corresponding ellipse. Speci Combining (34) and (35), the problem becomes one of
fying the parameters to the accuracy given by I::.. assures finding I::.. such that if I::.b. 1i> I::..b2j satisfy (30) and 'Y is
eigenvalues within 'Y of their prescribed values. given, then
1 18
max
Illbll
-±
( Ilbll
-P,2+ajllbIj+--+llb21
) '
IHli.l1lJtJ 2 4
1
(44)
+ iP! � 'Y. (36)
The Case 2 requirement (42) becomes

(
This problem can be approached more conveniently,
�»
from the aspect of computation, by choosing d and
evaluating (36) according to the restrictions of (30) for -Pl+Il laJI+1+ O. (45)
the corresponding 'Yma,.. These values of 'Yma,. can be
tabulated for a range of d and the required bits easily
To summarize-for a given d (d > O), if
determined. For a given d, if dbl! and db2! satisfy (30),
then
'Ymax is given by (40). Otherwise.'Yma,. is found from (44).
ALTERNATE REALIZATIONS
Anotherpossible realization for the AI of (15), resulting

There are two cases to consider, depending on the sign in the same eigenvalues as those of (15), is a <1>, of the
of the inner radical term.
[
form
' 1
Case
�,1 -
al P/ J (46)
IlbIj2 _P; a;
_p,2 + aAbl! + -- + Ilb21 < O.
-
(38)
4 where a; and {3; define the eigenvalues as given in (20).
For this case, again restricting db" dbJ+I according to This form of <1>; again can be shown to yield a real
transformation T. Again. only N coefficients need be
{ [ (
(30).
stored. Here. to keep the eigenvalues within a circle of
llblJ2 radius 'Y, it is required that 8 and E, which are now the
'Ymax = max + P! - Pi -aAbIj
tolerances in a, and {3/o respectively. must satisfy (26).
)'J2} ,
--
l1b'i,l1b" 4
(39) For a fixed word length, this makes 8 = E =2-\ where k
Ilb,i
--- -llb2,1 is the number of bits used. This is a less stringent re
4 quirement than that imposed with regard to realization
of <1>; of the form of (15). However. for each pair of
The maximum of (39) occurs for dbll =d sgn (a;),
eigenvalues realized according to (46). two additional
db2/ =d, and is
multiplicative operations are required for the computa
tion of each output; this violates the earlier restriction
to realizations using a minimum of multiplications. To
determine whether the saving in coefficient bit length
The Case1 requirement (38) becomes justifies the increase in multiplicative operations, in
terms of efficient computation, consider the machine
time required by each form. related to the corresponding
bit requirements.
From consideration of (40) and (44). the smallest
Case 2 dl, for any stable eigenvalue. that can result in a change
Ilb1J2 'Y is the eigenvalue locations for of (15), is related to
- Pl + a;llbll + -- + Ilb21 � O. (42) 'Y by 'Y =v2dl+dN2. For small 'Y. and hence small
4
dl. dl�2/2. Now dl corresponds to a requirement of
kl bits. and since bl;. b2; of (15) are. in magnitude, less
than two. kl = 1-log2 dl = 1-log2 (-r2/2).
+ T +Ilb2;)
Ilbl,2 'J2 } ,
+ Pi .
(43) For <1>; of (46). the correspondingly smallest d2 for the
same'Y is related to'Y by 'Y = v'21l22, or d2 ='YIv2. Since
OIJ and b; are in magnitude less than unity. k2 = -log2d2
= - log2 (-r/v2).
The maximum of (40), subject to the constraint of (30), Thus. for the same 'Y. k2�(kl-1)/2, and about half
again occurs for dblJ =d sgn (a;) , db2/ =d, and is, the number of bits is required for the parameters of (46)
1 19
as for those of (15), but the number of corresponding CO:>lCLUSION

multiplications in forming the next state is doubled. The methods presented provide a rationale for selec·
Thus in most cases with "conventional" computers tion of state variables based on considerations of eigen.
where two multiplications of k bits length take longer value sensitivity. The future efforts in this area should
than one of length 2k, the form given in (15) is definitely include consideration of roundoff errors as well as the
superior. choice of configuration. Other criteria than eigenvalue
ApPLICATlO� TO A DIGITAL FILTER
sensitivity are also subjects for further study.
A sIx-pole discrete version oC a bandpass Butterworth ACKXOWLEDGllENT

filter was chosen as an example to illustrate the savings The helpful discussions with R. A. Singer regarding
in coefficient bit requirements that can be obtained by this problem are gratefully acknowledged , as wel l as the
choosing the proper state variables. The particular simulation efforts of Philip :\Iantey.
filter chosen has center frequency at 75/10 000 of the
sampling frequency, has 3-dB bandwidth equal to REFERENCES
1/200 oC the sampling frequency, and is one of a bank of ,I, N. Wiener, TIle &d,apolatitm, lme,poloJioa. aM Smoot"'"
100 filters applied to digital spectral analysis.(111 Table 0/ SIDlitma,)' Ti_ &rus. New York: Wiley, 1949.
'". R. E. Kalman, A new approach to linear filtering and predic·
•
II shows the eigenvalue locations Cor this system and the tion problems," T,am. ASME, J. Basic E""g., ser. D, vol. 82, pp.
respective infinitesimal eigenvalue sensitivities (S.(X) 35-45, March 1960.
I'l R. M. Golden and J. F. Kaiser, "Design of wideband sampled·
and SD(X» as computed for the forms of (1 1 ) and (1 7). data filters," Bell S,s. Tee"- J., vol. 43, pt. 2, pp. 1533-1546, July
1964.
,4, J. F. Kaiser, ·Digital filters," in S),stem AIIIIl,s" 6)' D�«iIGl
.
TABLE II ComJ>vUr. F. F. Kuo and J. F. KaIser, Eds. New York: Wiley,
1966, pp. 218-285.
INFINITESI�IAL SENSITIVITY OF NARROW-BAND SYSTE)( 'i'
-- ·Some practical considerations in the realization of linear
digital filters," P,oc. J,d A ..... AllefolofJ Cn/. n C;,cwit aM S)'sUm
magi TUM, (Urbana, 111., October (965), pp. 621-633.
Real I nary - Magnitude Angle SD(X) (I. E. Bodewig, MtJJriz CsklllfU. New York: Interscience, 1950.
I"� C. G.J. Jacobi, C,el/e's JlIIlrlll, vol. 30. Berlin: de Gruyter,
0.9919 0.0197 0.9921 0.0199 15807784.2968 50.4099 1846. !>p. 51-95.
0.9919 -0.0197 0.9921 -0.0199 15807784.2968 $0.4099 ,I. B. S. Morgan,Jr., "Sensitivity analysis and synthesis of multi.
0.9833 0.0463 0.9844 0.0471 9909651. 1249 21.�965 variable systems," IEEE T,am. AulOIfIIJIic C",." ol, vol. AC·II, pp.
0.9833 -0.0463 0.9844 -0.0471 9909651.1406 21.3965 506-512, July 1966.
0.9894 0.0736 0.9921 0.0743 2352116.9062 13.5188 ,I, W. M. Wonham and C. D. J ohnson, ·Optimal bang-bang con.
0.9894 -0.0736 0.9921 -0.0743 2352116.9062 13.5188 trol with quadratic performance mdex," Preprints. 4th Joint Auto
matic Control Conf. (Minneapolis, Minn., June 1963), pp. 101-112.
,II, W. C. Tuel, ·Canonical forms for linear systems-I," IBM
Research Rept. RJ .175, March 1966.
(II' C. S. Weaver, P. E. Mantey, R. W. Lawrence, and C. A. Cole,
Realization oC this filter on the IB�( 7090 with in
• Digital spectrum analyzers," Stanford Electronics Laboratories,
the companion matrix form of (1 1 ) resulted in an un Stanford, Calif., Rept. SEL 66-059 (TR 1809·1/1810-1), June 1966.
stable system in single precision arithmetic (27 hits
accuracy). Choosing of the form of (17) and usin:.{ the
simplified analysis related to uniform word len!{th, it
was computed that the lilter realized in this form would
require only 14 hit coefficients for stability, to keep
'Ymax < (l-p), and it would yield essentially ideal pl'r
formance with 18 bit coefficients, where ideal per
formance was taken to mean that no eigenvalue had a
shift'Y, due to the use of a finite number of bits, no more
than 10 percent of the distance between the eigenvalue
and the unit circle in the complex plane. These results,
and many others,(111 were verified by simulation
and illustrate the desirability oC a realization in the
form oC (17), as well as the ability of the analysis de
veloped here to predict within one or two bits the
accuracy needed in the coefficients to achieve essentially
"ideal" performance. This empirical criterion of ideal
performance applied to'Y for this system, using the cal
culated bit accuracy, resulted in frequency and transient
responses with essentially no discernible differences
from those using more bits, while use of fewer bits
resulted in very deleterious changes in both transient
response and frequency response.
120
I. Introduction
In many situations there is interest in implementing the

fast Fourier transform using fixed-point arithmetic. In
this case the efTect of the word size on the accuracy of the
A Fixed-Point Fast calculation is of obvious importance both with regard to
the design of special-purpose machines and with regard
Fourier Transform to the accuracy attainable from existing machines. This
paper contains an analysis of the fixed-point accuracy of
Error Analysis the power of two, fast Fourier transform (FFT) algorithm.
This analysis leads to approximate upper and lower
bounds on the root-mean-square error. Also included are
PETER D. WELCH. Member, IEEE the results of some accuracy experiments on a simulated
IBM Watson Research Center fixed-point machine and their comparison with the error
Yorktown Heights, N. Y. upper bound.
II. The Finite Fourier Transform
Abstract
If X(j), j=O, I, ... ,N-I, is a sequence of complex
numbers, then the finite Fourier transform of X(j) is the
This paper contains an analysis of the fixed-point accuracy of the
sequence
power of two, fast Fourier transform algorithm. This analysis leads
X-I
to approximate upper and lower bounds on the root-mean-square error.
A(n) (lIN) L: XU) cxp -27fijnj.\'
(1)
=
Also included are the results of some accuracy experiments on a sim j.O
ulated fixed-point machine and their comparison with the error upper n = 0, I, . . . , N - 1.
bound.
The inverse transform is
/It-I
XU) = L: A(n) cxp 27fijnIN. (2)

n-O
In both of the above equations, i= (_1)1/2. We will be

considering a fixed-point calculation of these transforms
using the fast Fourier transform algorithm [I], [2]. In
connection with(I), we will consider the calculation of
NA(n) from XU). N-l would then be included as an over
all scale factor at the end. Now considering the calcula
tion of NA(n) from X(j) or X(j) from A(n), Parseval's
theorem states:
N-l N-l
L: / XU) /2 = N L: / A(n) /2
J-O
or
N-l N-I
L: / NA(n) /2 = N L: / XU) 12 (3)
and we see that the mean-square value of the result is N

times the mean-square value of the initial sequence. This
fact will be used below.
III. The Inner Loop of the Fast Fourier Transform

Algorithm: Step-by-Step Scaling
The inner loop of the power of two FFT algorithm

Manuscript received February 26, 1969; revised April 9, 1969. operates on two complex numbers from the sequence. It
This work was supported in part by the Advanced Research Projects
takes these two numbers and produces two new complex
Agency, Dept. of Defense, Contract AF I 9-67-C-O I 98. numbers which replace the original ones in the sequence.
IEEE TRANSAcrlONS ON AUDIO AND ELEcrROACOUSTICS VOL. Au- 17, N o. 2 JUNE 1969
121
Let X..{i) and X..(j) be the original complex numbers.
Then, the new pair X.. +1(i), Xm+1(j) are given by
Xm+l(i) X.. (i) + X m V) IV
(4)
=
XmdJ) = X.. {i) - X.. {J)1V

where W is a complex root of unity. If we write these
equations out in terms of their real and imaginary parts,
l�R'{Xm(j)}
we get
Re {Xm�l(i)1 = He {x..{i)1 + He {x..(i)1 He {wI

- 1m {x.. (i)1 1m {wI
(5) Fig. 1. Regions important to the rescaling of the
sequence.
1m {x.. u{J)1 = 1m {x..(t)I":" He {x..{i)1 1m {wI
- 1m {x.. v)1 He {wI.
At each stage the algorithm goes through the entire se

m+lst stage. However, if X.. (i) and Xm(j) are inside the
quence of N numbers in this fashion, two at a time. If
N= 2Af, then the number of such stages in the computa
smaller square, then it is possible for X..+1(1) or Xm+1(j) to
be outside the larger square and hence result in an over
tion is M.
flow. Consequently, we cannot control the sequence to
As we move from stage to stage through the calculation,
prevent overflow by keeping the absolute values of the
the magnitudes of the numbers in the sequence generally
real and imaginary parts less than one-half. Furthermore,
increase which means that it can be kept properly scaled
the maximum absolute value of the real and imaginary
by right shifts. Consider first the root-mean square of the
parts can increase by more than a factor of two and hence
complex numbers. From (4) we have
[I ;I ]
a simple right shift is not a sufficient correction.
Xm+1{i) 12 X..+1V) 1 /2 The above results and observations suggest a number of
alternative ways of keeping the array properly scaled. The
= v'2
[I X.. {i) 12 ; I X.. 12T'2.
W
(6) three that seem most reasonable are the following.
I Xo{i)1 <
1) Shifting Right One Bit At Ecery Iteration: If the
initial sequence, Xo(i), is scaled so that 1/2 for
Hence, in the root-mean-square sense, the numbers (both all i and if there is a right shift of one bit after every
real and complex) are increasing by v'2 at each stage. iteration (excluding the last) then there will be no over
Consider next the maximum modulus of the complex flows.
numbers. From (4) one can easily show that 2) Controlling the Sequence so that I x..(i)1 < 1/2:
Again assume the initial sequence is scaled so that
max { IX.. {i) I , I XmW II i.
I
Xowl < 1/2 for all Then at each iteration we check
� max { I X..+1{i) I, I X"+l{J) II (7) x..wl and if it is greater than one half for any i we shift
right one bit before each calculation throughout the next
� 2 max { I X.. m I , I X.. {J) II· iteration.
3) Testing Jar an Overflow: In this case the initial se
Hence the maximum modulus of the array of complex
numbers is nondecreasing.
quence is scaled so that Re I Xo(i)I< I and 1m { Xo(i) I < 1.
Whenever an overflow occurs in an iteration the entire
In what follows, we will assume that the numbers are
sequence (part of which will be new results, part of which
scaled so that the binary point lies at the extreme left.
will be entries yet to be processed) is shifted right by one
With this assumption the relationships among the num
bit and the iteration is continued at the point at which the
bers is as shown in Fig. I. The outside square gives the
overflow occurred. In this case there could be two over
region of possible values, Re { X.,(l)I< 1 and 1m { X.,(i)I
flows during an iteration.
< 1. The circle inscribed in this square gives the region
I x..(i)1 < I. The inside square gives the region Re I X..(i)I The first alternative is the simplest, but the least ac
< 1/2, 1m I X..(i) 1< 1/2. Finally, the circle inscribed in curate. Since it is not generally necessary to rescale the
this latter square gives the region I xmwl < 1/2. Now if sequence at each iteration, there is an unnecessary loss in
Xm(i) and X.,U) are inside the smaller circle, then (7) tells accuracy. The second alternative is also not as accurate
us that X..+l(i) and X.. +1(J) will be inside the larger circle as possible because one less than the total number of bits
and hence not result in an overflow. Consequently, if we available is being used for the representation of the se
control the sequence at the mth stage so that I X.. (i)1 quence. This alternative also requires the computation
< 1/2, we are certain we will have no overflow at the of the modulus of every member of the sequence at each
IZZ
iteration. The third alternative is the most accurate. It has 2) When two B bit numbers are added together and
the disadvantage that one must process through the se there is an overflow, then the sum must be shifted right
quence an additional time whenever there is an overflow. and a bit lost. If this bit is a zero, there is no error. If it is
The indexing for this processing is, however, straight a one, there is an error of ± 2-B depending upon whether
forward. It would not be the complex indexing required the number is positive or negative. The variance of this
for the algorithm. In comparing the speed of the second error (it is unbiased assuming there are an equal number
and third alternatives one would be comparing the speed of positive and negative numbers) is
of two overflow tests, two loads, two stores, and a trans
fer with that of the calculation or approximation of the
AZ2 = 2-28/2. (10)
modulus and a test of its magnitude. This comparison It has a standard deviation
would depend greatly upon the particular machine and
the particular approximation to the magnitude function.
A2 = 2-B-I/Z "" 0.i(2-B). (11)
A modification of the second alternative was adopted In addition, we will consider the effects of the propaga
by Shively [3] . In this modification, if I xm(i)i > 1 /2 , the tion of errors present in the initial sequence. The variance
right shift was made aJler each calculation in the next of these errors we designate by 02• In the simplest case
iteration. Provision was made for possible overflow. We these errors would be the quantization errors resulting
will give an error analysis of the third alternative below. from the A/D conversion of an analog signal.
A microcoding performance study of this third alterna
tive for the IBM 360/40 can be found in [4]. Although B. Upper Bound Analysis
this error analysis applies to the third alternative it can In this section, we give an upper bound analysis of the
be easily modified to apply to the second. In addition, the ratio of the rms error to the rms of the completed trans
upper bound given applies directly to the first alternative. form. This upper bound is obtained by assuming that
The analysis can also be modified for the power of four during each step of the calculation there is an overflow
algorithm. and a need to rescale. We let Xi,,}) be a typical real ele
ment at the kth stage (i.e., the real or imaginary part of a
complex element) and let
IV. A Fixed-Point Error Analysis
A. Introduction V(Xk) = variance {Xk(j)f

1 N-I (12)
We will assume, in this analysis, that the inputs [ i.e. , = - L variance I Xk(j) }.
the real and imaginary parts of X(j) or A(n)] are repre
N. ;-0
sented by B bits plus a sign. We assume the binary point
lies to the left of the leftmost bit. We showed earlier that (This notation, a bar over the symbol indicating an av
erage over the sequence, will be carried throughout the
the magnitudes of the members of the sequence would
generally increase as we moved from stage to stage in paper. ) We will, in what follows, replace A22 by 6�12. We
calculation. Hence, the method of operation is to test

will also let A2=AI2.
Since the first stage gives an overflow, the original data
for overflow within the inner loop. If there is no overflow,
must be rescaled or truncated by one bit. Hence,
the calculation proceeds as usual. If there is an overflow,
then the two inputs producing the overflow are shifted (13)
right until there is no overflow. The amount of the shift
is recorded (it will be either one or two bits) and the entire In going from the original data to the results of the first
sequence is shifted right this same amount. In this scheme, stage, W=I and, hence, there is no multiplication and
we shift not only those elements we have already calcu we either add or subtract. Furth<!(, we assume that the
lated but also those yet to be done. The total number of next stage will result in an overflow and hence we will
shifts is accumulated and the power of two, raised to the have to rescale. This gives
negative of this total number of shifts, constitutes an
V(X1) = 2V(Xo) + 4·QA2
overall scale factor to be applied to the final sequence. (1-1 )
There are two operations which produce errors which V(Xl) = 2 (QAZ ) + 202 + 4.QA2.
are propagated through the calculation:
In these expressions and this entire discussion, we are
I) When two B bit numbers are multiplied together a assuming all errors to be independent and, hence, that
2B bit product results. If this product is rounded to B bits, the variance of the sum is the sum of the variances. Going
an error whose variance is from the first stage to the second stage, we have
W=(-I)I/2 and again there are only additions and sub
(8)
tractions. Thus, with the rescaling,
is created. This error has a standard deviation of

V(XZ) 2V(Xl) + 42·6Az
(15)
=
(D) = 22(QA2) + 2202 + 2(4.{jA2) + 42 .QA2.
WELCH: FIXED-POINT FFT ERROR ANALYSIS
123
In going from the second stage to the third stage, we have and, generally, if M is the last stage
multiplications and we have them in all subsequent
Y(X,u)
stages. In generating the third stage, half the inner loops
have multiplications. Consider the first equation of (5). = 2M(6�2) + 2M62 + 2M-l(6·4�2) + ...
All the other equations are identical in terms of error + 2(6.4M-l�2) + 2M-2](�2 + (M - 3)2·11-1](�2
propagation. Remember that XaCi) is complex: + 2.l1-4(-!3�2) + 2J1-4(44�2) + ... + (4M�2) (21)
Re I X3(i)} = Re I X2(i)} + Re I X20)} Re lTV} = (1.5)2M+2�2(1 + 2 + ... + 2.11-1) + 2.l!62
(16) + (M - 2.5)2·1f-l](�2
- 1m I X20)} 1m lTV}.
+ 2·11+2�2 + 2M+4(1 + 2 . . . + 2M-4)�2
Equation (16) yields, with rounding to B bits after the
or
addition and with rescaling,
V(XM)
V'(X3)
... (1.5)22ftl+2�2 + 2M62 + (M - 2.5)2·1f-l](�2
V(X2) + [Re2/X2lf)} + Im2/X20)}]V(JV) (22)
+ 2·11+2�2 + 22M+l�2
=
+ [Re 2 (lJ') + 1m2 (1V)1V(X2) (17) "'" 22.1fH�2 + 2.l!62 + (M - 2.5)2·1I-1](�2 + 2M+2�2.
+ (_P�2) + 43.M2
K is the average of the square of the absolute values of the
= y(X2) + -I X2(j) 12�2 + V(X2) + (_P�2) + 43·M2. initial complex array. Hence, applying Parseval's theorem
(3), the average of the square of the absolute values of the
In (17), the first term is the variance of the first term of final array will be 2MK. What is most meaningful in this
(16). The second and third terms of(17) are the variance case, however, is the mean square of the real numbers,
of the full 2B bit products given by the second and third which is 2.11K/2. Hence we have
terms of (16). The fourth term of (17) is the result of
V(XM) 2MH�2 262
rounding after the addition. The fifth term is the rescaling +
term. Finally, we saw in (6) that the average modulus 2.11](/2 "'" ](/2 ](/2
(23)
squared of the complex numbers is increasing by a factor (M - 2.5)�2/2 22�2
of 2 every stage. Hence, if we let K equal the average + +
modulus squared of the initial array, i.e.,
1/2 ](/2'
and, finally, for large M,
rms (error) 2(MH)f2� 2(MH)f22-B(0.3)

... ... • (24)
rms (result) v'](/2 rms (initial array)
then we have
Equation (24) gives an approximate upper bound for
V'(X3) = 2V(X2) + 22](�2 + 43�2 + 43·M2. (18) the ratio of the rms of the error to the rms of the answer.
Notice that this bound increases as the v'N or ! bit per
Equation (18) would be correct for V(Xa) if all the inner
stage.
loops involved multiplications. However, at this stage
only half of them do and, hence, c. Lower Bound Analysis
V(X 3) = 2V(X2) + 2](�2 + 43�2/2 + 43.6�2 We will now obtain an approximate lower bound for
the ratio of the rms of the error to the rms of the answer.
23(6�2) + 2362 + 22(-!·M2) + 2(-!2·M2) (19)
=
We obtain this lower bound by assuming that there are
+ 43.6�2 + 2](�2 + 43�2/2. no overflows in the calculation and, hence, no shifts of
the array. In this case,
In the next stage, three quarters of the inner loops re
quire multiplications and these multiplications get pro
V(Xo) = 62
gressively more numerous as the stages increase. Hence, VeX!) 262
= (25)
from here on, we will assume all stages have multiplica V(X2) = 22152•
tions in all the inner loops. Thus, applying the above
In the third stage, half of the inner loops involve a mul
techniques, we get
tiplication and, hence,
V(X4) 24(6�2) + 2462 + 23(6'4�2) + 22(6 .42�2)
=
V(Xa) = (1/2)(22K�2) + 1/2(�2) + 23152. (26)
+ 2(6.43�2) + 6.44�2 (20) This can be seen by considering the first term of (17).
4
+ 2 2](�2 + 23](�2 + 43�2 + 4 �2 The first term of(26) comes from the second term of the
IEEE TRANSACfIONS ON AUDIO AND ELECTROACOUSTICS JUNE 1969
124
first of equations (17). The second term of (26) is caused Finally, the rms of the difference between the fixed-point
by the rounding to B bits. Now, as before, and floating-point answers was taken. We also obtained
the maximum absolute error and average error.
V(X�) =2V(X3) + 23Kt:,.2 + t:,.2
(27) Fig. 2 contains the result of transforming random num
bers which lie between zero and one (placed in both the
real and imaginary parts). In this and subsequent tests,
Finally,
three runs were made for every power of two from 8 to
v (XM) 2048. Since these random numbers have a dc component
= 2·lf-2Kt:,.2 + (..lJ 3)2·lf-IKt:,.2 + 2·1[-3t:,.2 + 2.1f-5t:,.2
-
of one-half, the fixed-point program must rescale at least
N- I times. Hence, one would expect the error to lie close
+ 2M-6t:,.2 + ... + t:,.2 + 2.1[152
(28) to the theoretical upper bound as given by (24). This
= (U - 2.5)2·11-IKt:,.2 + 2.1£-3t:,.2 theoretical upper bound is also plotted in Fig. 2 and the
+ (1 + ... + 2·ll-5)t:,.2 + 2.1£152 results are seen to lie slightly above it. The rms of the
original array, yKj2, is approximately 0.58.
"" (M 2.5)2.1[-IKt:,.2 + 2M-3t:,.2 + 2·lf-4t:,.2 + 2Mc52•
-
Fig. 3 contains the results of transforming three sine
As in Section IV-B, the mean square of the final sequence waves plus random numbers between zero and one-half
of real numbers is 2M• K/2. Hence, we have in the real part and all zeros in the imaginary part. Specif
ically,
v (XM) t:,.2/8 t:,.2/G 152
- 2.5)t:,.2 + + + . (29)
. I
2 [K/2 "" (JJ K /2 K /2 1\./2 Re {xwl =1/2[Y(j) + (l/2) sin (27f8j/N)
+ (1/4) sin (27f4j/N)
Now one has to be careful in interpreting (29) to obtain
an approximate lower bound. In actuality, the only way + (1/4) sin (27f8j/N)]
to have a situation in which there are no shifts is to have a
' 1m {X(j)} =0
small K and, in fact, one which approaches zero as N(or
M) becomes large. However, if we assume that the word where the Y(j) are random numbers between zero and
size expands to the left as necessary rather than over
one. Again, there is a dc component of magnitude one
flowing, then this analysis does provide a lower bound to
fourth and the array must be rescaled at least N - 2 times.
the error. With this interpretation, as M becomes large, Thus, one would expect these results to be lower relative
we have
to the theoretical upper bound than the case depicted in
rms (error) Fig. 2. From Fig. 3 one can see that this is in fact the case.
---- "" (JJ -
2.5) 1/2(.3)2-B. (30) The rms of the original array yKj2 is, in this case, ap
rms (result)
proximately 0.35. This is the reason the upper bound
The lower bound increases as MI/2=! log2N. This is curve is higher than that of Fig. 2.
the rate of increase which has been observed for the Fig. 4 contains the results of transforming random
floating-point calculation [5], [6]. numbers from minus one to one (in both real and imag
inary parts). In this case, the dc component is zero and
D. Some Experimental Results
there is no other strong component. The number of shifts
An IBM 7094 program was written to perform a fixed should be approximately (log2N)j2 or one-half shift per
point calculation using the fast Fourier transform al stage. Hence, one would expect the error curve to lie well
gorithm, as described above. The program was capable below the theoretical upper bound, as is the case. In this
of simulating a fixed-point machine of any word size up case, yK!i 0.58. =
to 35 bits plus a sign. Experiments were run with fixed Fig.5 contains the results of an experiment identical
point numbers of 17 bits plus a sign. This corresponds to to that used for Fig. 3, except that the random numbers
B= 17 in the analysis of Section IV-B and C. are between ±!. The results are as expected. In this case,
We will now describe some experimental results. In yk/2-0.35.
these experiments we did not consider the propagation of Finally, Fig. 6 contains the results of transforming a
the error present in the original sequence. Thus we con sine wave in the real part and zero in the imaginary part.
sidered the case where 152= O. The experiments were per The sine wave was sin (27fj/8). Although in this case the
formed as follows. Floating-point input was fixed to 17 array must be rescaled in at least N-2 times, the error is
bits plus a sign. This fixed input was then transformed well below the upper bound. Here, yKj2= 0.5.
with the fixed-point program. The fixed-point output was In all these calculations the bias, as reflected by the
then floated. Next, the fixed-point 17-bit input was average error, was negligible compared with the rms
floated and a floating-point transform taken. Since this error. Furthermore, the maximum error was of the same
floating-point transform uses a floating-point word with order of magnitude as the rms error and hence the error
a 27-bit mantissa, it was considered the correct answer. was not due to the effect of a few, highly inaccurate terms.
WELCH: FIXED-POINT FFT ERROR ANALYSIS
125
2
O- L
/A
'" b 1 J
THE RETIC L up ER
BOUND ..0' ./
y..
�/ � �
.. l.-V
v
0
0
0 ..,.4 JR/
Af ntBlRETICAL UPPER
r'
0
BOUND
... � .-
""-0
./'
IV ../
.. /0
0
0
0 0
0
2 2
I
ICT 10 20 40 eo 100 200 400 1000 2000 4000 20 40 60 100 200 400 1000 2000 4000
N- N-
Fig. 2. Experimental error results: random numbers between 0 ond Fig. 3. Experimental error resulls: random numbers plus 3 sine
I; 8=17. waves; O<random numbers< ; l 8= 17.
IQ'3
6 L
5 II
�Ll
BOUND /"
......
L
/,...- ./
5 4 THEORETICAL UPPER ,/" i1l 4
::> w
en BOUND ./' !S
/'
V
ll! 2 � 2
u; V Q:
"- V
:::IE / /
� IO'4 8 �IO"'
Q: ..
,/" ......
..
� /V 8> b
!!!
6
&.
&. � II -,,-
..
./' :::IE
4 Q: 4
�
Q:
0
..0 ""
0
0
&
0
2 0
2
IO�
O 20 40 60 100 200 400 1000 2000 4000 20 40 60 100 200 400 1000 2000 4000
N- N-
Fig. 4. Experimental error results: random numbers between -1 Fig. S. Experimental error results: random numbers plus 3 sine
and I ; 8= 17. waves; -I <random numbers<l ; 8= 17.
Fig. 6. Experimental error results: single sine wave; 8= 17.
II �w
BOUND "/
./
5 4
i1l V......
ll! 2 /'
�
./
V
ilO"4
Q:
w
II
./'
./
",
0
0
�Q: 4
0
2
20 40 60 100 200 400 1000 2000 4000

N-
126
E. Conclusions and Additional Comments Acknowledgment
The upper bound obtained in Section IV-B is of the The author would like to thank R. Ascher for assistance
form in programming the fixed-point calculations. He would
rms (error) 2(MH)/lZ-BC also like to thank the referee for a number of corrections
---- < (31) and valuable suggestions.
rms (result) - rms (initial sequence)
where C=0.3. On the basis of the experimental results we
would recommend a bound with C=O.4. Reference,
We also carried through the analysis for a sign mag [I] J. W. Cooley and J. W. Tukey, "An algorithm for machine cal
nitude machine with truncation rather than rounding. In culation of complex Fouricr series," Mal". Comp., vol. 19, pp.
this case, the analytical upper bound was of the form 297-301, April 1965.
[2] J. W. Cooley, "Finite complex Fourier transform," SHARE
given by (31) but with C=O.4. However, the experimental Program Library: PK FORT, October 6, 1966.
results were again higher and we would recommend a [3] R. R. Shively, "A digital processor to genl!rate spectra in real
bound with C=0.6. The case of a twos-complement ma time," lSI AIIII. IEEE Compuler COIlf., Digesl (If Papers, pp.
21-24_ 1967_
chine with truncation was not analyzed as analysis became [4] "Experimental signal processing system," IBM Corp., 3rd
exceedingly complex. However, experimental results in Quart. Tech. Rep!., under contract with the DirfCtorate of
dicated a bound of the form given by (3 1) with C=O.9. Planning and Technology, Ek'Ctronic Systems Div., AFSC,
USAF, Hanscom Field, Bedford, Mass., Contract FI9628-67-
It should be pointed out that if we are taking the tran� C-OI98.
form to estimate spectra then we will be either averagin g [5] J. W. Cooley, P. A. W. Lewis, and P. D. Welch, "The fast
over frequency in a single periodogram or over time in a
Fourier transform algorithm and its applications," IBM Corp.,
Res. Rep!. RC 1743, February 9, 1 967 .
sequence of periodograms and this averaging will decrease [6] W. M. Gentleman and G. Sande, "Fast Fourier transforms for
the error discussed here as well as the usual statistical fun and profit," 1966 Fall Joilll C()mpUler COIlf., AFlPS Proc.,
error. Finally, if we are taking a transform and then its vol. 29. Washington, D.C.: Spartan, 1966, pp. 563-578.
[7] A. V. Oppenheim and C. Weinstcin, "A bound on the output
inverse, Oppenheim and Weinstein have shown [7] that of a circular convolution with application to digital filtering,"
the errors in the two transforms are not independent. this issue, pp. 120-124.
WELCH: FrXED-POrNT FFT ERROR ANALYSIS
127
To appear in the IEEE Transactions on Audio and Electroacoustics,
Vol. AU-17, No. 3, September 1969.
ROUNDOFF NOISE IN FLOATINfJ POINT FAST FOURIER TRANSFORM COMPUTATION*

by
Clifford J Weinstein
•.
, incoln Laboratory, Massachusetts Institute of Technology
ABSTRACT
A statistical model for roundoff errors is used to predict

output noisc-to-signal ratio when a fast Fourier transform is
computed using floating point arithmetic. The result, derived for
the case of white input signal, is that the ratio of mean-squared
output noise to mean-squared output signal varies essentially as
1'= log N, where N is the number of points transformed.
2
This predicted result is Significantly lower than bounds previously
derived on mean-squared output noise-to-signal ratio, which are
proportional to 1'2. The predictions are verified experimentally,
with excellent agreement. The model applies to rounded arith
metic, and it is found experimentally that if one truncates, rather
than rounds, the results of floating point additions and multiplica
tions, the output noise increases significantly (for a given v).
Also, for truncation, a greater than linear increase with v of the
output noise-to-signal ratio is observed.
*This work was sponsored by the U. S. Air Force.
128
Introduction
Recently, there has been a great deal of interest in the Fast Fourier transform (FFT)
1
algorithm and its application . Of obvious practical importance is the issue of what accuracy
is to be expected when the FFT is implemented on a finite-word-Iength computer. This note
studies the effect of roundoff errors when the FFT is implemented using floating point arith
2
metic. Rather than deriving an upper bound on the roundoff noise, as Gentleman and Sande
have done, astatistical model for roundoff errors is used to predict the output noise variance.
The statistical approach is similar to one used previously 4 to predict output noise variance
3,
in digital filters impkmt-'ntc'.l via difference: equations. The predictions are tested experimen
tally, with excellent agreement.
IJ
The FFT Algorithm for N = 2
The discrete Fourier transform (OFT) of the complex N point sequence x(n) is
defined as
N-I
2::
-nk
X(k) x(n) W k = 0, 1, . . . , N-I (1)
n=o
.
where W = eJ 211"IN • For large N, the FFT offers considerable time savings over direct
computation of (1). We restrict attention to radix 2 FFT algorithms; thus we consider
N 2 v, where IJ = log N is an integer. Here the OFT is computed in IJ stages. At each
2
=
stage, the algorithm passes through the entire array of N complex numbers, two at a time,
th
generating a new N number array. The v computed array contains the desired DFT. The
th
basic numerical computation operates on a pair of numbers in the m array, to generate
st
a pair of numbers in the (m+ I) array. This computation, referred to as a "butterfly",
is defined by
(2a)
X (i)-W X (j) (2b)

m m
th
Here X (i), X (j) represent a pair of numbers in the m array, and W is some ap
m m
propriate integer power of W, that is
j 211" piN
W= wP = e
At each stage, N/2 separate butterfly computations like (2) are carried out to produce the
next array. The integer p varies with i, j, and m in a complicated way, which depends on
the specific form of the FFT algorithm which is used. Fortunately, the specific way in
129
which p varil's is not important for our analysis. Also, the spl!cific relationship between
th
i, j, and m. which liL'terminl'H how we inlil!x through the marray, is not important for
our analysis. Our derive d results will be valid for both dl!cimation in time and decimation
1
in frequency FFT algorithms , except in the sl!ction entitled "modified output noise
analysis", whl!re we specialize to the decimation in time case.
Propagation of Signals and Noise in the FFT
In the error analysis to be presente d, we will need some results governing the propa
gation of signals and noise in the FFT. These results, speCialized to correspond to the
statistical model of signal and roundoff noise which we will use, are given in this section.
We assume a simple statistical model for the signal being processed by the FFT.
Specifically, we consider the case where the signal Xm (i) present at the mth array is
white, in the sense that all 2N real random variables composing the array of N complex
numbers are mutually uncorrelated, with zero means, and equal variances. More formally,
we specify for i= 0, 1••• ••N-I that
2 2
£ ((Re Xm(i») ) = t1 (Im X m(i)]2) = t tIl Xm(i)\ ) = const. = � C1X2
m
( c.= expected value of ).

th
Given this model for the statistics of the m array, we can use (2), and the fact that
..... ..... 2 + m ..... 2 = 1 •

I wi 2 = [Re W] (l W] (4)
st
to deduce the statistics of the (m + 1) array. First, the signal at the (m + 1)st array is
also white; that is, equations (3) all remain valid if we replace m by m+l. In verifying
this fact, it is helpful to write out (2) in terms of real and imaginary parts. Secondly,
the expected value of the squared magnitude of the signal at the (m+ 1)st array is just
th
double that at the m array, or
i=O,l •• • • ,N-l. (5)
130
th
This rehltionship between the statistics at the m and (m+ l)st array allows us
to deduce two additional results, which will be useful below. First, if the initial signal
th
array X (i) is white, then the m array X (i) is also white, and
o m
i = 0,1, • • • ,N-l (6)
st
Finally, let us assume that we add, to the signal present at the (m+ l) array, a signal
independent. white noi se sequence E (i) (which might be produced by roundoff errors)
m
th
having properties as described in (3). This noise sequence will propagate to the lI , or
-
th
output, array, independently of the signal, producing at the lI array white noise with
variance
(7)
Butterfly Error An alysis
Tn begin our FFT error analysiS, we first analyze how roundoff noise is generated
in, and propagated through, the basic butterfly computation. For reference, let the
variables in (2) represent the results of perfectly accurate computation. Actually, however,
th th
roundoffs through the m stage will cause the m array to consist of the inaccurate
results
A
X (i) X m (i) + E (i) i = 0, I, • • • , N-l • (8a)
m m
and these previous errors, together with the roundoff errors incurred in computing (2),
will cause us to obtain
(8b)
To analyze roundoff errors, we first express (2) in terms of real arithmetic.

Thus (2a) becomes
(9a)
(9b)
and a similar pair of equations results for (2b). Let fl ( .) represent the result of a
S
floating point computation. We can write that
131
fl (x+y) = (x+y) (1+ E) , (10)
t
with l EI � 2- , where t is the number of bits retained in the mantissa. Also
n (xy) = xy (1+ E) , (11)
-t
with again IE I � 2 . Thus, one could represent the actual floating point computation
corresponding to (9) by the flow graphs of Fig. I, or by the equations
(12a)
(12b)
Now we subtract (9) from (12) to obtain (using (8» an equation governing noise generation
and propagation in the bu tterfly. Neglecting terms of second or higher order in the E and
i
E ' we obtain
m
(13)
where
U (i) = Re[X (i)] (E4) + Re VI Re [X (j)] ( E + E + E4) - 1m W 1m [X (j)] (E2+ E3 +E 4)
m m m 1 3 m
Equations similar to (1 3) and (14) can he:' derived for E (j). Equation (13) is the basic equation
m+ 1
governing noise generation and propagation in the FFT. Comparing (1 3) and (:la), we see
th
that the noise E already present at the m array propagates through the butterfly, as if it
m
were signal. to the next array. But also we have additional roundoff noise. represented by
st th
U , introduced due to errors in computing the (m+ l) array from the m array. Note that
m
the noise source U (i) is a function of the signal. as well as of the roundoff variables E •
m i
This signal dependence of roundoff errors is inherent in the floating point computation. and
requires that we assume a statistical model for the signal, as well as for the E·., in order to
1
132
obtain statistical predictions for the noise. We should note that the validity of the neglect
of second order error terms in obtaining (13) and (14) needs to be verified experimentally.
Now we introduce a statistical mOdel for the roundoff variables E., and for the
1
signal, which will allow us to derive the statistics of U and eventually predict output
m
noise variance. We assume that the random variables E. are uncorrelated with each other
1
and with the Signal, and have zero mean and e� ual variances, which we call cr� • We also
assume, for simplicity of analysis, that the signal x(n) to be transformed is white, in
the sense described above (see (3». Thus, we have that all 2N real random variables
(in the N point complex sequence) are mutually uncorrelated, with zero means, and equal
variances, which we call � a; so that
(15)
Given these assumptions, one can derive that
(16)
In obtaining (16), one must take note of (4), and of the fact (see discussion preceding Eq. (6»
th
that the whiteness assumed for the initial Signal array X (n) implies whiteness for the m
o
m m m
�
array, so that Re X (i), Im X (i), Re X (j), and Im X (j) are mutU IlY uncorrelated,
m
t
with equal variance. One can use (6) to express the variance at the m array in terms of
the initial signal variance as
(17)
st
so that the variance of each noise source introduced in computing the (m + I) array from
th
m array becomes
(18)
The argument leading to (18) implies that all the noise sources U (i) in a particular
m
array have equal variance. A slight refinement of this argument would include the fact
...., ....,
that a reduced noise variance is introduced when W= 1 or W = j, but this refinement is

neglected for the moment. As indicated in (18), the noise variance depends on the signal
variance. However, due to the fact that the roundoff variables E are signal independent,
i
the noise samples U (i) are uncorrelated with the signal. Thus we can assume, in deriving
m
output noise, that the roundoff noise propagates independently of the signal.
133
Output Noise Variance for FFT
In this section, our basic result for output noise-to-signal ratio in the FFT is derived.
- -
Because we arc assuming that all butterflies (including where W = 1 and W = j) are equally
noisy, the analysis is valid for both decim ation in time and decimation in frequency algorithms.
Later we will refine the model for the decimation in time case, to take into account the
- -
reduced butterfly noise variance intro�uced when W = 1 or W = j. But the quantitative change
in the results produced by this m odification is very slight.
Given the assumptions of independent roundoff errors and white signal, the variance
of the noise at an FFT output point can be obtained by adding the variances due to all the
(independent) noise sources introduced in the butterfly computations leading to that particular
output point.
Consider the contribution to the variance of the noise EII(i) at a particular point in the
th st
11 , or output array, from just the noise sources U (i} introduced in computing the (m+l)
m
st
array. These noise sources U (i) enter as additive noise of variance O'
m
at the (m+l) �
array, which (as implied by (13» propagates to the output array as if it Wl1re signal. One
can deduce (see (7» that the resulting output noise variance is*
2 11-m-1 2
[t I EII(i) \ ]m =
2 O' i = 0, 1, , N-1 • (19)
u
• • •
or using (18),
[ fJ EII(i) \
2
] 2
11+1
O'
� O' � (20)
m
th
(20) states that the output noise variance, due to the m array of noise sources, does not
depend on m. This results from the opposing effects (18), and (19). By (18), the noise source
2 m
variance 0' increases as 2 , as we go from stage to stage; this is due to the increase in
u
m
signal variance, and the fact that the variance of floating point roundoff errors is proportional
to Signal variance. But (19) states that the amplification which O' � goes through in propagating
m
*Note that it is not quite true that the noise sequence U (1), which is added to the signal
at the (m+l)St array, is white, for in computing the m two outputs of a butterfly. the
same multiplications are carried out. and thus the same roundoff errors are committed.
Thus. the pair of noise sources U (1), u 0> associated with each particular butterfly.
will be correlated. However, all tie noisH'sCX1rces U (i) which affect a particular CXltput
point. are uncorrelated. since (as one could verify fran an FFT flow-gra�) noise
sources introduced at the top and bottom CXltputs of the butterfly never affect the same
point in the output array.
134
-m
to the output has a 2 dependence, that is the later a noise source is introduced, the
less gain it will go through.
To obtain the output noise variance, we sum (20) over m to include the variance
due to the computation of each array. Since II arrays are computed, we obtain
(21)
We C2n recast (21) in terms of output noise-to-signal ratio, noting that (6) implies that
(22)
so that
(23)
Note the linear dependence on 11= log N in the expression (23) for expected output
ll
mean-squared noise-to-signal ratio. For comparison, the bounding argument of
2
Gentleman and Sande led to a bound on output mean-squared noise-tO-signal ratio which
2
increased as 11 rather than as II. (Actually, they obtained a bound on rms noise-to-signal
ratio, which increased as II). Certainly, the fact that the bound on output signal-to-noise
ratio is much higher than its expected value, is not surprising: since in obtaining a bound
one must assume that all roundoff errors take on their maximum possible values, and
add up in the worst possible way.
To express (23) quantitatively in terms of the register length used in the computa
tion, we need an expression for C1 � • Recall that C1 � characterizes the error due to
3
rounding a floating point multiplication or addition (see (5) and (6». Rather than assume
-t -t
that E is uniformly distributed in (_2 , 2 ) with variance = 2
-2t
cr� �
, C1 was �
measured experimentally, and it was found that
C1 � = (.21)2
-2t
(24)
matched more closely the experimental results. Actually, C1 � for an addition was found
to be slightly different from that for a multiplication, and C1 � for multiplication was
found to vary slightly as the constant coefficient (Re Vi or Im W) in the multiplication was
changed. (24) represents essentially an empirical average C1� for all the multiplications
and additions used in computing the FFT of white noise inputs.
135
(23) and (24) summarize explicitly our predictions thus far for output noise-to
signal ratio. In the next section. the argument leading to (23) is refined to include the
-
reduced butterfly error variance intrOduced when W = 1 or W = j. We should remark

again that the modification is slight. and that the essential argument and essential
character of the results has been already given in this. section.
Modified Output Noise Analysis

As mentioned above. we have so far not considered in our analysis the reduced
- --
error variance introduced by butterfly computations involving W = 1 or W = j. To take

these cases into account. we first need an equation corresponding to (16) for the butter
fly error variance when W = 1 or W j. Observe that for W 1. we have in Fig. 1 (or
= =
7
Eq. ( » that EJ= E = E�= t:S= E6= (7= O. since multiplication by lor O. or adding a number
2
to 0, is accvn.plishcd noiselessly. ThUd (14) becomes
and (16) becomes (14)'
(16)'
so that when W = 1. the butterfly error variance is half the variance introduced when
W I: 1 and VI I: j. One can easily verify that the variance in (16)' is valid for W = j.
also.
st
Now. not all the noise sources introduced in computing the (m+ 1 ) array from
th
the m array will have equal variance. However, if F (m) represents the fraction of
the m
th
array of butterflies which involve either = 1 or W W
= j. then one can express
the average noise variance for all butterflies used in this array of computations as
0'u2 = [l-F (m)] O' 2 + F (m) O' 2 ' (25)

u u
m,ave m m
= [1-F (m)] 0'u2 + F (m) 0'u2 /2

m m
= [l-F (m)/2 ] O' �

m
136
The dependence of F(m) on m depends on the form of the FFT algorithm which is
used. We will consider the case of a decimation in time algorithm. For this case, only
W = 1 is used in the first array of computations, so F(O) = 1. Only W=1 and W = j are
used in computing array 2 from array 1. so F(I) = 1. In computing the array 3 from
array 2, half the butterflies involve W = 1 or W = j. in the next array � of the butter
flies involve \V.:: 1 or W = j, and so on. Summarizing, we have
F(m) � 1 mmi =
0 (26)
l (�) -
=
m = 1, 2 . • . , II-I
and combining (25) and (26) we obtain
O' m =O
� �m
_ m
[I (�) ]
0-2 m= 1, 2, • • . , II-I, (27)
u
rn
where 0'u2 is given in (18).

rn
To derive our modified expression for output noise-to-signal ratio, we carry

through the argument corresponding to Eqs. (19) through (23), but replace O' in (19) �
m
O'�
?
by • Two observations will be made before stating the result. First, the right
m,ave
hand side of (20) will now depend on m, and must actually be summed over m to obtain
the equation corresponding to (21). Secondly, the fact that, in general, not all butter
flies in a given array introduce the same roundoff noise variance, implies that there
will be a slight variation of noise variance over the output array. Our result, which is
thus to be interpreted as an average (over the output array) ratio of noise variance to
signal variance, is
II-I
20' 2E [ II '2'
3
- + (�) ] (28)
As II becomes moderately large (say II � 6). one sees that (28) and (23) predict essen
tially the same linear rate of increase of O' /O' � � with II.
137
One further result. which can be derived using our model, is an expression for
the final expected output noise-to-signal ratio which results after performing an FFT
and an inverse FFT on a white signal x(n). The inverse FFT introduces just as much
roundoff noise as the FFT itself, and thus one can convince oneself that the resulting
output noisc-to-signal ratio is
2 2 2 3 ,,-1
cr E/cr x =
4 cr E [" -"2' + (t) ] (29)
or just double the result in (28).
FFT versus direct Fourier Transform accuracy

An issue of some interest is the question of how the accuracy obtained in an FFT
computation compares with that obtained in computing a OFT by "straightforward" means,
2
i. e. direct summing of products as implied by (1). Gentleman and Sande treated this
problem by obtaining bounds on the mean-squared error for both methods, and the bound
they obtain on the error in computing a OFT directly, increases much faster with N than
their corresponding bound for the FFT. fu deriving their bound for the direct Fourier
transform case, they assume that the summing of products required by (1) is done in a
st
cumulative way, that is the (n+ l) product is added to the sum of the first n products,
and so on. However a slightly different; more error free, technique could be imagined
for summing the products. Suppose instead, one were to sum (1) in a treelike fashion,
that is the N products were summed in pairs, the N/2 results summed in pairs, and so
on. Then the bound one would derive on outPl,lt mean-squared noise-to-signal ratio would
have the same dependence on N as the corresponding bound for the FFT (namely, a linear
2 2
dependence on ,, =(log N ) ). The treelike summation technique thus makes the accuracy
2
for the direct Fourier transform essentially the same as for the FFT.
This argument carries over for the statistical approach to roundoff error analysis
which has been used here. That is, a statistical analysis of errors incurred in implement
ing (1) using the treelike summation technique, predicts a linear dependence of the output
noise-to-signal ratio on ", similar to (23). The treelike summation requires some extra
memory to store partial sums, but for large N (when memory becomes more important)
this issue is academic, since actually the FFT would be used to perform the computation.
Experimental Verification
The results of the above analysis of FFT roundoff noise, as summarized in (28), (29).
and (24), have been verified experimentally with excellent agreement. To check (28). a
white noise sequence (composed of uniformly distributed random variables) was generated
and transformed twice, once using rounded arithmetic with a short (e. g. 12 bit) mantissa,
138
and once using a much longer (27 bit) mantissa. A decimation in time FFT algorithm
was used. The results were su btract ed, squared, and averaged to estimate the noise
11
variance. For each N = 2 , this process was repeated for several white noise inputs to
obtain a stable estimate of roundoff noise variance. The results, as a function of II, are
represented by the small circles on Fig. 2, which also displays the theoretical curve of (28).
To check (29), white noise sequences were put through an FF T and inverse, and the
mean-squared difference between the initial and final sequences was taken. The results
of this experiment (divided by a factor of 2 since (29) is twice (28»are also plotted on
Fig. 2.
To clarify the experimental procedure used, we should define carefully the conven
tion used to round the results of floating point additions and multiplications. The results
were rounded to the closest (t-bit mantissa) machine number, and if a result (say of an
addition) lay midway between two machine numbers, a random choice was made as to
whether to round up or down. If one, for example, merely truncates the results to t bits,
the experimental noise-to-signal ratios have been observed to be significantly higher than
in Fig. 2, and to increase more than linearly with II. Sample results (to be compared with
Fig. 2) of performing the first of the experiments described above, using truncation
rather than rounding, are as follows: for II = 7, 8, 9,10, and 11 ,
O' � /2-2 t O'� =

40,63.5, 80, 1 01, and 128, respectively.
For II =
11, for example, this represents an increase by a factor of 32 over the result
obtained using rounding. This increased output noise can be partially explained by the
fact that truncation introduces a correlation between signal and noise, in that the sign of
the truncation error depends on the sign of the Signal being truncated.
Some experimental investigation has been carried out as to whether the prediction
of (28) and (29) are anywhere near valid when the signal is non-white. Specifically, sinu
soidal signals of several frequencies were put through the experiment corresponding
to (28), for II = 8 , 9, 10 and, 1 1. The results, averaged over the input frequencies used,
were within 15% of those predicted by (28).
Comment on Register Length Considerations
A linear scale is chosen for the vertical axis of Fig. 2, in order to display the es
sentially linear dependence of output noise-to-signal ratio on log N. To evaluate how II =
2
many bits of noise are actually represented by the curve of Fig. 2; or equivalently by
Eq. (28). one can use the expression
139
(30)
to represent the number of bits by which the rms noise-to-slgnal ratio increases in
passing through a floating point FFT. For example, for" = 8, this represents 1.89 bits,
and for" = 11,2.12 bits. Once can use (30) to decide on a suitable register length for
performing the computation.
According to (30), the number of bits of rms noise-to-signal ratio increases essen
tially as log (log N). so that doubling the number of points in the FFT produces a very
2 2
mUd increase in output noise, Significantly less than the. bit per stage increase pre
6
dicted and observed by Welch for fixed point computation. Tn fact, to obtain a. bit
increase in the result (30). one would essentially have to double" = log N, or square N.
2
Summary and Discussion
A statistical model has been used to predict output noise-to-signal ratio in a floating
point FFT computation, and the result has been verified experimentally. The essential
result is (see(23) and (24»
-2t
= (. 21) 2 " . (31)
that is the ratio of output noise variance to output signal variance is proporti(lnal to
" = log2 N; actually a slightly modified result was used for comparison with
experiment.
Tn order to carry out the analysis, it was necessary to assume very simple (i. e.
white) statistics for the signal. A question of importance is whether our result gives
reasonable prediction of output noise when the signal is not white. A few experiments
with sinusoidal signals seem to indicate that it does, but further work along these lines
would be useful.
It was found that the analysis, and in particular the linear dependence on" in (31),
checked closely with experiment only when rounded arithmetic was used. Some results
2 2
for truncated arithmetic, showing the greater than linear increase of aElaX with II,
have been given. In rounding, it was found to be important that a random choice be
made as to whether to round up or down, when an unrounded result lay equally between
two machine numbers. When. for example, results were simply rounded up in this mid
way Situation, a greater than linear increase of a /a � i with II, was observed. Such a
rounding procf'dure, it seems, introduces enough correlation between roundoff noise an:1
signal. to make the experimental results deviate noticeably from the predictions of our
model. which assumed signal and noise to be uncorrelated.
140
ACKNOWLEDGEMENT
Discussions with Professor Alan V. Oppenheim

of M. 1. T. contributed significantly to this
work.
14 1
REFERENCES
1. W. T. Cochran, et.al., "What is the fast Fourier transform, " Proc. IEEE,
vol. 55. pp. 1664-1674, October. 1967.
2. W. M. Gentleman and G. Sande, "Fast Fourier transforms - for fun and

profit, " Proceedings Fall Joint Computer Conference, pp 563 578 1966.
.
-
,
3 . T. Kaneko and B. Liu, "Round-off error of floating-point digital filters, ..

presented at the Sixth Annual Allerton Conference on Circuit and System
Theory, October 2-4, 1968. To be published in the Proceedings of the
Conference.
4. C. Weinstein and A. V. Oppenheim, "A Comparison of roundoff noise in

floating point and fixed point digital filter realizations, " submitted to
Proceedings IEEE (letters).
5. J. H.Wilkinson, Rounding Errors in Algebraic Processes, Englewood Cliffs.

N.J.: Prentice-Hall, 1963
6. P. O . Welch, "Fixed point FFT execution," notes distributed at IEEE Arden

House workshop on the FFT. October, 1968.
142
FIGURE CAPTIONS
Fig. 1. Flow graphs for noisy butterfly computation.
Fig. 2. Theoretical and experimental output noise-to-signal ratios

for floating point FFT computations.
143
144
5
- - THEORETICAL
C\J
1(\1
-
o EXPERIMENTAL . FFT
0
VI
-
o EXPERIMENTAL. FFT AND INVERSE
c: (result + 2)
34
w
u
Z
<l:
c:::
g
...J
<l:
z 3
(!)
CJ)
I-
:::>
0...
I-
:::>
0
,2
w
u
Z
<l:
c:::
�
w
CJ)
0
Z
I-
:::>
0...
I-
:::>
0
12
II = log N
2
Fig. 2.
145
Published in Ma�he�ati� of Comp:l�t�tion, Vol. 19, April 1965,
pp. 297-301.
An Algorithm for the Machine Calculation of

Complex Fourier Series
By James W. Cooley and John W. Tukey
An efficient method for the calculation of the interactions of a 2'" factorial ex
periment was introduced by Yates and is wid ely known by his name. The generaliza
tion to 3m was given by Box et al. [1].Good [2} generalized these methods and gave
elegant algorithms for which one class of applie ations is the calculation of Fourier
series. In their full generality, Good's methods are applic able to certain problems in
which one must multiply an N-vector by an N X N matrix which can be factored
into 1n spars e matrices, where 111, is proportional to log N. This results in a procedure
2
requiring a number of operations proportio nal to N log N rather than N These •
methods are applied here to the calculation of complex Fourier series. They are
useful in situations where the number of data points is, or can be chosen to be, a
highly composite number. The algorithm is here d erived and present ed in a rather
d ifferent form. Attention is given to the choice of N. It is also shown how special
advantage can be obtained in the use of a binary computer ,....ith N = 2"' and how
the entire cal c u latio n can be performed within the array of N data storage locations
used for the given Fourier coefIicients.
Consider the problem of calc u lati ng the complex Fourier series
N-I
(1) X(j) = L A(k)·Wi\ j = 0, 1, ... , N - 1,

k=O
where the giv en Fourier coefficients A (Ie) are complex and Tr is the principal
Nth root of unity,
(2)
2
A straightforward calculation usi n g (I) would require N operations where "opera
tion" means, as it will throughout this note, a complex multiplication followed by a
complex addition.
The al gor ithm described h e re iterates on the array of given complex Fourier
a mplitu des and yields the result in less than 2N Iog2 N operations without r e quiring
more data storage than is req uired for the given array A. To derive the algorithm,
suppose N is a composi te, i.e., N = rl· r2 • Theil let the i ndices in (1) be expressed
j = jlrl + jo , jo = 0, 1, . . . , rl - 1, j l = 0, 1, . . . , 1"2 - 1,
(3)
k = k1r2 + ko , ko = 0, 1, . . . , 1"2 - 1, kl = 0, 1, . . , 1·1
. - 1.
Then, one can write
(4)
Received August 17, 1964. Hesearch in part at Princeton University under the sponsorship
of the Army Research Office (Durham). The authors wish to thank Hichard Garwin for his
essential role in communication and encouragement.
Since
(5)
the inner sum, over kl ,depends only on jo and ko and can be defined as a new array,
(6) Al(jo ,ko) = L A(kl, ko) ' TV;ok1'2.
kl
The result can then be written

(7)
There are N elements in the array Al , each requiring 1'1 operations, giving a total
of N1'l operations to obtain AI. Similarly, it takes N1'2 operations to calculate X
from Al. Therefore, this two-step algorithm, given by (6) and (7), requires a total
of
(8)
operations.
It is easy to see how successive applications of the above procedure, starting with
its application to (6), give an m-step algorithm requiring
(9) T = N(1'l + 1'2 + ... + 1'm)
operations, where
(10) N = 1'1' 1'2 • • • 1'm •
If 1'j = s;t; wi th Sj ,tj > 1, then S; + tj < 1'j unless S; = tj = 2, when S; + t; = 1';.
In general, then, using as many factors as possible provides a minimum to (9) , but
factors of 2 can be combined in pairs without loss. If we are able to choose N to be
highly composite, we may make very real gains. If all 1'j are equal to 1', then, from
(10) we have
(11) m = 10g,N
and the total number of operations is
(12) T(1') = 1'N log, N.
T
= m·r + n·s + p·t +
(13) N
log2 N = m·log2 l' + n·log2 s + p·log2 t +
so t.hat
T
is a weighted mean of the quantities

l' S
' '
log2 1" log2 S log2 t
147
whose values run as follows
r
l' 10g2 r
2 2.00
3 1.88
-1 2.00
t.J 2.15
(i 2.31
7 2.49
8 2.(i7
9 2.82
10 3.01.
The usc of 1'j = :3 is formally most efficient, but the gain is only about 6% over
the use of 2 or 4, which have other advantages. If necessary, the usc of 1'j up to 10
can incrcase the number of computations by no more than 50% . Accordingly, we
can find " high ly composite" values of N within a few percent of any given large
JlUl\lber.
Whene\"cr possible, the use of N = )'m with r = 2 or -:1 offers important advantages
for computers with binary arithmetic, both in addressing and in multiplication
eeonomy.
The algorithm with l' = 2 is derived by exprcssing the indices in the form
"".-1 +
" "+ .
J
"
= . +
Jm-I·;t.
•
. . Jl ';:' Jo,
(14)
whcrcjr and I.... :1I'C ('qual to 0 or 1 and arc the contents of the respective bit positions
in the binary represcntation of.i and /;. All arrays will now be written as functions
of the bits of their indices. With this convention (I) is written
(15) X( jm-l, ... , jo) = L L ... L .1(/':",-1 ,

ko kl k .. _1
where the sums are over kv = 0, I. Since

(16)
the innermost sum of (15), over !'-m-l , depends only on jo , !.-m-2, ... , /':0 and can
be written
(17) 1.-0) = L AUm
L·",.I
Proceeding to the next innermost sum, over "",-2 • and so on, and using
(18)
one obtains successive arrays,
A I(jO , . ,jl_1 , k"._I_1 , ... , /':0)

• .
( 10) '"' . k 1"r(h_t.21-1+

. , k'0)
+jo).k"._1·2".-1
= L... j1 I-I ( )0,
. ... ,J 1-2, 'm-I,
•• •
. , • r
km_l
for l = 1, 2, . . . , m.
148
Writing out the sum this appears as
A,(jo, ... ,j'-l, km-1-1, • • • , leo)
= A,-1(jO, ... ,j,-2 ,0, km-1-1, • . . , leo)
(20) + ;'-1 ·iI-2A . .
(- 1) t l-1 (Jo, ... ,JI-2, 1, k"m-l-l, . . . , k0 )
. lVCh -8 .2'-3+...+iol .2m-1 j,-l 0, 1.
, =
According to the indexing convention, this is stored in a location whose index is
(21) . 2m-1 +
JO· • • •
+
J1-1 • 2
0 ",-1 + k
Om_I_I· 2
m-1-1 +
• • •
+ 7
'CO.
It can be seen in (20) that only the two storage locations with indices having 0 and
-1
1 in the 2m bit position are involved in the computation. Parallel computation is
permitted since the operation described by (20) can be carried out with all values of
jo, ... ,j'-2, and leo, , lem-1-1 simultaneously. In some applications* it is con
• • •
venient to use (20) to express A, in terms of A 1-2 , giving what is equivalent to an

algorithm with r = 4.
The last array calculated gives the desired Fourier sums,
( 22)
in such an order that the index of an X must have its binary bits put in reverse
order to yield its index in the array Am .
In some applications, where Fourier sums are to be evaluated twice, the above
procedure could be programmed so that no bit-inversion is necessary. For example,
consider the solution of the difference equation,
(23) aX(j + 1) + bX(j) + cX(j - 1) = Jt'(j).
The present method could be first applied to calculate the Fourier amplitudes of
F(j) from the formula
(24)
The Fourier amplitudes of the solution are, th en,
B(k)
(25) A(k) = + +
aJVk b CTV-k·
The B(k) and A(k) arrays are in bit-inverted order, but with an obvious modifi
cation of (20), A(k) can be used to yield the solution with correct indexing.
A computer program for the IBM 7094 has been written which calculates three
dimensional Fourier sums by the above method. The computing time taken for com
puting three-dimensional 2" X 2b X 2� arrays of data points was as follows:
* A multiple-processing circuit using this algorithm.was designed by R. E. Miller and S.

Winograd of the IBM Watson Research Center. In this case r = 4 was found to be most practi
cal.
149
a b c No. Pts. Time (minutes)
4 4 3 2 11 . 02
11 0 0 2 11 . 02
4 4 4 212 .04
12 0 0 212 .07
3
5 4 4 21 .10
3
5 5 3 21 . 12
3
13 0 0 21 .13
IBl\1 Watson Research Center

Yorktown Heights, New York
Bell Telephone Laboratories,

Murray Hill, New Jersey
Princeton University
Princeton, New Jersey
1. G. E. P. Box, L. R. CONNOR, W. R. COUSINS, O. L. DAVIES (Ed.),F. R. HIRNSWORTH &

G. P. SILITTO, The Design and Analysis of Industrial Experiments, Oliver & Boyd, Edinburgh,
1954.
2. I. J. GOOD, "The interaction algorithm and practical Fourier series," J. Roy. Statist.
Soc. Ser. B., v. 20, 1958, p. 361-372; Addendum, v. 22,1960, p. 372-375. MR 21 61674; MR 23
6 A4231.
150
I. Introduction
In dealing with sampled data the z-transform plays the

role which is played by the Laplace transform in contin
uous time systems. One example of its application is
The Chirp z-Transform spectrum analysis.We shall see that the computation of
sampled z-transforms, which has been greatly facilitated
Algorithm by the fast Fourier transform (FFT) [1], [2] algorithm,
is still further facilitated by the chirp z-transform (CZT)
algorithm to be described in this paper.
L. R. RABINER, Mem ber, IEEE The z-transform of a sequence of numbers Xn is de
fined as
R. W. SCHAFER, Mem ber, IEEE
Bell Telephone Laboratories, Inc. 00
Murray Hill, N. J. X(z) = E xnz-", (1)

c. M. RADER, Member, IEEE
Lincoln Laboratoryl a function of the complex variable z. In general, both Xn
Massachusetts Institute of Technology and X(z) could be complex. It is assumed that the sum on
Lexington, Mass. the right side of (1) converges for at least some values of
z.We restrict ourselves to the z-transform of sequences
with only a finite number N of nonzero points.In this
Abstract
case, we can rewrite (I) without loss of generality as
A computational algorithm for numerically evaluating the z-transform 1\'-1
of a sequence of N samples is discussed. This algorithm has been X(z) = E XnZ-" (2)
named the chirp z-transform (CZT) algorithm. Using the CZT al ... 0
gorithm one can efficiently evaluate the z-transform at M points in the

where the sum in (2) converges for all z except z=O.
z-plane which lie on circular or spiral contours beginning at any arbi Equations (1) and (2) are like the defining expressions
trary point in the z-plane. The angular spacing of the points is an arbi for the Laplace transform of a train of equally spaced
trary constant, and M and N are arbitrary integers. impulses of magnitudes xn• Let the spacing of the impulses
The algorithm is based on the fact that the values of the z-trans be Tand let the train of impulses be J:xn6U-nn. Then
form on a circular or spiral contour can be expressed as a discrete the Laplace transform is Ex"e-·nT wh ich is the same as
•
convolution. Thus one can use well-known high-speed convolution X(z) if we let
techniques to evaluate the transform efficiently. For M and N moder
ately large, the computation time is roughly proportional to (N+M>
(3)
log2(N+M) as opposed to being proportional to N·M for direct If we ar e dea ling with sampled waveforms the relation
evaluation of the z-transform at M points. between t he original waveform and the train of impulses
is well understood in terms of the phenomenon of aliasing.
Thus the z-transform of the sequence of samples of a time
waveform is representative of the Laplace transform of
.
the original waveform in a way which is well understood.
The Laplace transform of a train of impulses repeats its
values taken in a horizontal strip of the s-plane of width
27r/T i n every other strip parallel to it. The z-transform
maps each such strip into the enti re z-plane, or conversely,
the entire z-plane corresponds to any horizontal strip of
the s-plane, e.g., the region - 00 <0-< 00, -7r/T5,.w<7r/T,
where s=o-+jw. In the same correspondence, thejw axis
of t he s-plane, along which we generally equate the
Laplace transform wit h the Fourier transform, is the unit
circle i n th e z- pl ane, and the origin of the s-p la ne cor
resp onds to z= 1. The interior of the z- plan e unit circle
corresponds to the left half of the s-plane, and the exterior
corresponds to the rig ht half plane. Straight lines in the
s-pl ane correspond to circles or spirals in the z-plane.
Fig. I shows the correspondence of a contour in the s
Manuscript received December 23, 1968; revised January 16, 1969.
pl an e to a c ont ou r in the z - plane . To evaluate the L apl ace
This is a condenscd version of a paper published in the Bell System transform of the impulse train along t he linear contour is
Technical JOllrnal, May 1969. to evaluate the z-transform of the sequence along the
I Operated with support from the U. S. Air Force. spiral contour.
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. AU-l7, No.2 JUNE 1969
151
z-plane values of k merely repeat the same N values of Zk, which
are the Nth roots of unity. The discrete Fourier transform
has assumed considerable importance, partly because of
its nice properties, but mainly because since 1965 it has
become widely known that the computation of (6) can be
achieved, not in the N2 complex multiplications and addi
tions called for by direct application of (6), but in some
thing of the order of N log2N operations if N is a power
of two, or N 1;mi operations if the integers mi are the
prime factors of N. Any algorithm which accomplishes
this is called an FFT. Much of the importance of the FFT
is that OFT may be used as a stepping stone to computing
lagged products such as convolutions, autocorrelations,
(A)
and cross convolutions more rapidly than before [3], [4].
The OFT has, however, some limitations which can be
eliminated using the CZT algorithm which we will de
(B)
scribe. We shall investigate the computation of the z
transform on a more general contour, of the form
s-plane
1c. = 0, 1, . . . , M-1 (7a)
where M is an arbitrary integer and both A and W are
arbitrary complex numbers of the form
(7b)
and
(7e)
(See Fig. 2.) The case A = I, M=N, and W=exp( -j2."./N)
Fig. 1. The correspond ence of (A) a z-plone contour corresponds to the OFT. The general z-plane contour
to (B) on '-Dlone contour through the relation z = e·T • begins with the point z=A and, depending on the value
of W, spirals in or out with respect to the origin.If Wo= I,
the contour is an arc of a circle.The angular spacing of
Values of the z-transform are usually computed along the samples is 2.".c/l0. The equivalent s-plane contour begins
the path corresponding to the jw axis, namely the unit with the point
circle.This gives the discrete equivalent of the Fourier 1
transform and has many applications including the esti 80 = 0'0 + J'wo = - In A (8)
'1'
mation of spectra, filtering, interpolation, and correla
tion. The applications of computing z-transforms off the and the general point on the a-plane contour is
unit circle are fewer, but one is presented elsewhere [6],
1
namely the enhancement of spectral resonances in systems Bt Bo + k(� + j&") (In A - k In W),
T
= =
for which one has some foreknowledge of the locations (9)

of poles and zeroes. k = 0, 1, . . . ,M - 1.
Just as we can only compute (2) for a finite set of
samples, so we can only compute (2) at a finite number of Since A and Ware arbitrary complex numbers we see
points, say Zk. that the points Sk lie on an arbitrary straight line segment
of arbitrary length and sampling density. Clearly the
N-I
contour indicated in (7a) is not the most general contour
X. = X(Zk) = L XnZk-n• (.t)
n-O
but it is considerably more general than that for which
the OFT applies. In Fig. 2, an example of this more gen
The special case which has received the most attention eral contour is shown in both the z-plane and the s-plane.
is the set of points equally spaced around the unit circle, To compute the z-transform along this more general
zk=exp{j2.".k/N), k=O,l,···, N-l (5) contour would seem to require NM multiplications and
additions as the special symmetries of exp(j2.".k/ N) which
for which are exploited in the derivation of the FFT are absent in
N-I the more general case. However, we shall see that by using
X.=Lxnexp(-j2.".k/N), k=O,l,·· ·,N-I. (6) "
the sequence Wn /
n_O
to the computation of the z-transform along the contour
Equation (6) is called the discrete Fourier transform of (7a). Since for Wo= I, the sequence Wn"/ 2 is a complex
(OFT). The reader may easily verify that, in (5), other sinusoid of linearly increasing frequency, and since a
RABINER £'101.: CHIRP Z-TRANSFORM ALGORITHM
152
,.
...O,I •..•• N·'
Fig. 3. An iIIuslralion of Ihe steps involved in compuling values of Ihe

z-Iransform using Ihe CZT algorithm.
served.But, let us use the ingenious substitution, due to

Bluestein [5],
n2 + k2 - (k - n) 2
nk = ----..:..--...:.. (11)
2
(A)
for the exponent of W in (10). This produces an ap
(B) parently more unwieldly expression
s-plane
N-\
•. +... Xk = L x"A-"lVCn2f2)IVCk2'2)lV-Ck-n)2'2,
� �== ==:=::=:-- l'
(12)
-
: ---------
1
-
JL'�lT
,,_0
k=O,I,···,M-l;
II but, in fact, (12) can be thought of as a three-step process
1
:: '111- 16101
II
consisting of:
II .tt.I.
II I) forming a new sequence v,. by weighting the
acr·t·· .. II...
-..
x,.
� according to the equation
II
---------��---+-----+------��
� .,- x,.A -nlV,,2/2, n = 0, 1, . ,N
....t
y" = . . - 1; (13)
.. Ae
2) convolvin g y" with the sequence v,. defined as
Fig. 2. An iIIuslralian of Ihe Independent parameters of the CZT
algarilhm. (A) How the z-Iransform is evalualed on a spiral contour t'" = 1V-,.2'2 (14)
starling at Ihe point z=A. (B) The corresponding straighl line can·
tour and independent paramelers in the ,-plane_ to give a sequence gk,
N-l
similar waveform used in some radar systems has the gk = L Y"Vk_". k = 0, 1, ..., M - 1; {I 5)
picturesque name "chirp," we call the algorithm we are
about to present the chirp z-transform (CZT)_ Since the 3) multiplying gk by Wk',2 to give Xk,
CZT permits computing the z-transform on a more gen S
Xk = gklVk '2, k = 0, I, . . ., M - 1. (16)
eral tour than the FFT permits it is more flexible than
the FFT, although it is also considerably slower.The addi The three-step process is illustrated in Fig.3. Steps I
tional freedoms offered by the CZT include the following: and 3 require Nan d M multiplications, respectively,
I) The n umber of time samples does not have to equal and step 2 is a convolution which may be computed by
the number of samples of the z-transform. the high-speed technique disclosed by Stockham [3],
2) Neither M nor N need be a composite integer. based on the use of the FFT. Step 2 is the major part of
3) The angular spacing of the Zk is arbitrary. the computational effort and requires a time roughly pro
4) The contour need not be a circle but can spiral in or portional to (N+M) log (N+M).
out with respect to the origin. In addition, the point Zo Bluestein employed the substitution of (11) to convert
is arbitrary, but this is also the case with the FFT if the a OFT to a convolution as in Fig.3. The linear system to
samples Xn are multiplied by zo-" before transforming. which the convolution is equivalent can be called a chirp
filter which is, in fact, also sometimes used to resolve a
spectrum. Bluestein [5] showed that for N a perfect
II. Derivation of the CIT square, the chirp filter could be synthesized recursively
Along the contour of (7a), (4) becomes with VN multipliers and the computation of a OFT could
then be proportional to !{4/2.
N-\ The flexibility and speed of the CZT algorithm are
'Yk = L a'"A-nll'"\ k = 0, 1, .. ., M - 1 (10) related to the flexibility and speed of the method of high
,,_0
speed convolution using the FFT. The reader should re
which, at first appearance, seems to require NM complex call that the product of the OFT's of two sequences is
multiplications and additions, as we have already ob- the OFT of the circular convolution of the two sequences

and, therefore, a circular convolution is computable as
two OFT's, the multiplication of two arrays of complex
numbers, and an inverse discrete Fourier transform (A)
(10FT), which can also be computed by the FFT. Ordi I
•
nary convolutions can be computed as circular convolu
tions by appending zeroes to the end of one or both se
quences so that the correct numerical answers for the
ordinary convolution can result from a circular convolu
tion.
We shall now summarize the details of the CZT al (8)
gorithm on the assumption that an already existing FFT I

•
program (or special-purpose machine) is available to
compute OFT's and lOFT's.
�,
Begin with a waveform in the form of N samples Xn and
seek M samples of Xk where A and W have also been
(C)
chosen.
1) Choose L, the smallest integer greater than or equal L-I r
to N+M- I which is also compatible with our high

speed FFT program. For most users this will mean L is a
n1111ffirrnm,
power of two. Note that while many FFT programs will
work for arbitrary L, they are not equally efficient for (D)
all L. At the very least, L should be highly composite.
2/
2) Form an L point sequence Yn from Xn by n
Yn=
{A-nlrn 2xn n=O,1,2,'" ,N-l
�
(17)
° n=N, N+l, ... ,L-1.
(E)
3) Compute the L point OFT of Yn by the FFT. CalI lll.I: ::.rbl r"',--ti11f1
this Yr, r=O, 1, ..., L-1. M-I L-N+I L-I n
4) Oefine an L point sequence Vn by the relation
j
Will
lV-n2/2 0 � n � M -1
t'n = lV-CL-n)'/2 L - N + 1 � n <L (18)
If)
arbitrary other n, if any.
Il
Of course, if L is exactly equal to M+N- I, the region
in which Vn is arbitrary will not exist. If the region does
exist an obviOl';:; possibility is to increase M, the desired
number of points of the z-transform we compute, until
(0) t _ �
the region does not exist.
Note that t'n could be cut into two with a cut between urrmwr-r II-n . J.' •
II= M- I and II=L-N+ I and if the two pieces were abut

ted together differently, the resulting sequence would be a
b
slice out of the indefinite length sequence W-·'/2. This is
illustrated in Fig. 4. The sequence v. is defined the way ""----
(:-I) -------------
it is in order to force the circular convolution to give us not UloId •
the desired numerical results of an ordinary convolution. M-I L-I

5) Compute the DFT of t'" and call it V r=O, I, ... , .
L-1.
6) Multiply Vr and Yr point by point, giving Gr:
Gr = VrY" r = 0, 1, . . . ,L - 1. (I)
7) Compute the L point IDFT (lk, of Gr.

8) Multiply (lk by Wk'/2 to give the desired Xk: Fig. 4. Schematic representation of the various sequences involved
in Ihe eZT algorithm. (A) The input seque�ce Xn with N values.
Xi Wk2/2(1k, k
= 0,1. 2, . . ,M - 1.
= . (8) The weighted input sequence Yn= A-nW" /2� n. (C) The OFT of y ••
The (lk for k�M are discarded. (D) The values of the indeflnite sequence
W-" " . (E) The sequence
'
v" formedappropriately from segments of W-" /'. (F) The OFT of v••
Fig. 4 represents typical waveforms (magnitudes shown, (G) The product G, = Yr' V (H) The 10FT of Gr. (I) The desired M
.
phase omitted) involved in each step of the process. values of the z·transform.
RABINER el 01.: CHIRP Z-TRANSFORM ALGORITHM

III. Fine Points of the Computation having been shifted by No samples to the left; e.g., Xo is
weighted by WN,'/2 instead of Woo The region over which
Operation Count and Timing Considerations
W-,,'/2 must be formed, in order to obtain correct results
An operation count can be made, roughly, from the from the convolution, is
eight steps just presented. We will give it step by step
-N + 1 + No :::; n :::; M -1 + N o.
because there are, of course, many possible variations to
be considered. By choosing No=(N-M)/2 it can be seen that the limits
I) We assume that step I, choosing L, is a negligible over which W-"'/2 is evaluated are symmetric; i.e.,
operation. W-"'/2 is a symmetric function in both its real and
2) Forming Yn from Xn requires N complex multiplica imaginary parts. (It follows thus that the transform of
tions, not counting the generation of the constants w-n2/2 is also symmetric in both its real and imaginary
A-·W..'/2. The constants may be prestored, computed as parts.) It can be shown that using this special value of No,
needed, or generated recursively as needed. The recursive only (L/2+1) points of w-n'/2 need be calculated and
computation would require two complex multiplications stored and these (L/2+ I) complex points can be trans
per point. formed using an L/2 point transform. 2 Hence the total
3) An L point DFT requires a time kFFTL log2 L for L storage required for the transform of W-,,2/2 is L+2
a power of two, and a very simple FFT program. More locations.
complicated (but faster) programs have more complicated The only other modifications to the detailed procedures
computing time formulas. for evaluating the CZT presented in Section II of this
4), 5) Vn is computed for either M or N points, which paper are: 1) following the L point IDFT of step 7, the
ever is greater.The symmetry in W-·2/2 permits the other data of array Uk must be rotated to the left by No locations;
values of Vn to be obtained without computation. Again, and 2) the weighting factor of the Uk is Wk2/2WN,k rather
Vn can be computed recursively. The FFT takes the same than Wk·/2. The additional factor W.vok represents a data
time as that in step 3. If the same contour is used for shift of No samples to the right, thus compensating the
many sets of data, VT need only be computed once, and initial shift and keeping the effective positions of the data
stored. invariant to the value of No used.
6) This step requires L complex multiplications. An estimate of the storage required to perform the
7) This is another FFT and requires the same time as CZT can now be made. Assuming that the entire process
step 3. is to take place in core, storage is required for VT which
8) This step requires M complex multiplications. takes L+2 locations; for Yn, which takes 2L locations;
and perhaps for some other quantities which we wish to
As the number of samples of x or Xk grow large, the
save, e.g., the input, or values of W+n'/2 or A-nW,,'/2.
..
computation time for the CZT grows asymptotically as

something proportional to L log2 L. This is the same sort Additional Considerations
of asymptotic dependence of the FFT, but the constant
of proportionality is bigger for the CZT because two or Since the CZT permits M�N, it is possible that occa
three FFT's are required instead of one, and because L sions will arise where M»N or N»M. In these cases, if
is greater than N or M. StilI, the CZT is faster than the the smaller number is small enough, the direct method of
direct computation of (10) even for relatively modest (10) is called for. However, if even the smaller number is
values of M and N, of the order of 50. large it may be appropriate to use the methods of section
ing described by Stockham [3]. Either the lap-save or
'
Reduction in Storage lap-add methods may be used. Sectio
used when problems too big to be handled in core memory
The CZT can be put into a more useful form for com
arise. We have not actually encountered any of these
putation by redefining the substitution of ( I I) to read
problems and have not programmed the CZT with pro
(n - No)2 + k2 - (k - n + NO)2 + 2.Vok vision for sectioning.
nk = -------
Since the contour for the CZT is a straight line segment
2
in the s-plane, it is apparent that repeated application of
Equation (12) can now be rewritten as the CZT can compute the z-transform along a contour
N-l which is piecewise spiral in the z-plane or piecewise linear
X" = W,I/IWN" l: z"A-"W(II-N.)'IIW-(...."+N.)'/I in the s-plane.
.
,,-0 Let us briefly consider the CZT algorithm for the case
of Zk all on the unit circle. This means that the z-transform
The form of the new equation is similar to (12) in that
is like a Fourier transform. Unlike the DFT, which by
the input data x" are pre-weighted by a complex sequence
definition gives N points of transform for N points of
(A-n WC,,-Nol '/2), convolved with a second sequence
(W-Cn-N,)'/2), and post-weighted by a third sequence
(W"'/2WNok) to compute the output sequence Xk• How • The technique for transforming two real symmetric L point se
quences using one L/2 point FFT was demonstrated by J. Cooley
ever, there are differences in the detailed procedures for at the FFT Workshop, Arden House, October 1968. A summary of
realizing the CZT. The input data Xn can be thought of as this technique is presented in the Appendix.
155
data, the CZT does not require M= N. Furthermore, the We need one FFT and 2L storage locations for the trans
Zk need not stretch over the entire unit circle but can be form of XnA-nWn2/2; one FFT and L+2 storage locations
equally spaced along an arc. Let us assume, however, for the transform W-n'/2; and one FFT for the inverse
that we are really interested in computing the N point transform of the product of these two transforms. We
DFT of N data points .Still the CZT permits us to choose do not know a way of computing the transform of W-n2/2
any value of N, highly composi te, somewhat composite, either recursively or by a specific formula (except in some
or even prime, without strongly affecting the computation trivial cases). Thus we must compute this transform and
time. An important application of the CZT may be com store it in an extra L+2 storage locations. Of course, if
puting DFTs when N is not a power of two and when many transforms are to be done with the same value of L,
the program or special-purpose device available for we need not compute the transform of W-n'/2 each time.
computing DFT's by FFT is limited to the case of N a We can compute the quantities A-n Wn'/2 recursively as
power of two. they are needed to save computation and storage. This is
There is also no reason why the CZT cannot be ex easily seen from the fact that
tended to the case of transforms in two or more dimen
sions with similar considerations. The two-dimensional
A -(n+\)!V(n+\)'/2 /
= (A -nlVn' 2) lVnlV 1/2 A -I.
• (l!))
DFT becomes a two-dimensional convolution which is If we define
computable by FFT techniques.
We caution the reader to note that for the ordinary (20)
FFT the starting point of the contour is still arbitrary; and
merely mUltiply the waveform x� by A-� before using the
FFT, and the first point on the contour is effectively (21)
moved from Z= 1 to z=A. However, the contour is still then
restricted to a circle concentric with the origin. The angu
lar sp acing of Zk for the FFT can also be controlled to /)"tl = TV· D" (22)
some extent by appending zeroes to the end of Xn before and
computing the DFT (to decrease the angular spacing of
the Zk) or by choosing only P of the N points Xn and adding Cntl = Cn• Dn. (23)
together all the Xn for which the n are congruent modulo
Setting A= 1 in ( 19) to (23) provides an algorithm for the
P; i.e., wrapping the waveform around a cylinder and
coefficients required for the output sequence. A similar
adding together the pieces which overlap (to increase the recursion formula can be obtained for generating the se
angular spacing).
quence A-nW(n-N,)'/2. The user is cautioned that recursive
IV. Limitations computation of these coefficients may be a major source
of numerical error, especially when Wo"" 1, or cPo"" 0.
One limitation in using the CZT algorithm to evaluate
the z-transform off the unit circle stems from the fact
V. Summary
that we may be required to compute Wo±n2/2 for large 11.
If Wo differs very much from 1.0, Wo±n2/2 can become very A computational algorithm for numerically evaluating
large or very small when 11 becomes large. (We require a the z-transform of a sequence of N time samples was pre
large II when either M or N become large, since we need sented. This algorithm, entitled the chirp z-transform
to evaluate Wn'/2 for II in the range -N<II<M.) For algorithm, enables the evaluation of the z-transform at M
example, if Wo=e-·2�/I00o",,0.999749, and 11=1000 , equi-angularly spaced points on contours which spiral in
Wo±n'/2=e±12� which exceeds the single precision floating or out (circles being a special case) from an arbitrary
point capability of most computers by a large amount. starting point in the z-plane. In the s-plane the equivalent
Hence the tails of the functions W±n'/2 can be greatly in contour is an arbitrary straight line.
error, thus causing the tails of the convolution (the high The CZT algorithm has great flexibility in that neither
frequency terms) to be grossly inaccurate. The low fre N or M need be composite numbers; the output point
quency terms of the convolution wiIJ also be slightly in spacing is arbitrary; the contour is fairly general and N
error but these errors are negligible in general. need not be the same as M. The flexibility of the CZT
The limitation on contour distance in or out from the algorithm is due to being able to express the z-transform
unit circle is again due to computation of W±>o2/2. As Wo on the above contours as a convolution, pcrmitting the
deviates significantly from 1.0, the number of points for use of well-known high-speed convolution techniques to
which W±n'/2 can be accurately computed decreases. It is evaluate the convolution.
of importance to stress, however, that for Wo= I, there is Applications of the CZT algorithm include enhance
no limitation of this type since W±n'/2 is always of magni ment of poles for use in spectral analysis; high resolution,
tude 1. narrowband frequency analysis; and time interpolation of
The other main limitation on the CZT algorithm stems data from one sampling rate to any other sampling rate.
from the fact that two L point, and one LI2 point, FFTs These applications are explained in detail elsewhere [6].
must be evaluated where L is the smallest convenient The CZT algorithm also permits use of a radix-2 FFT
integer greater than N+M-l as mentioned previously. program or device to compute the DFT of an arbitrary
RABINER et af.: CHIRP Z-TRANSFORM ALGORITHM
156
number of samples. Examples illustrating how the CZT
algorithm is used in specific cases are included elsewhere
[6]. It is anticipated that other applications of the CZT
algorithm will be found.
Appendix
for k= 1, 2, .. ., L/2-1. The remaining values of Xi
The purpose of this Appendix is to show how the and Y. are obtained from the relations
FFT's of two real, symmetric L point sequences can be
L-I
obtained using one L/2 point FFT.
Let x. and y" be two real, symmetric L point sequences
Xo = LX.
"-0
with corresponding DFT's Xk and Yk• By definition, L-I
x" = XL-" Yo - LY"

n = 0, 1,2, . ..,I, - 1, .-0
y" = YL-n /.-1
and it is easily shown that Xk and Yk are real, symmetric

XLII = L x.(-I)"
"-0
L point sequences, so that L-I
Xk = XL k YL/I - Ly.(-I)".
0, 1,2, ..., L - 1.
_
"-0
k =
Reference,
Define a complex L/2 point sequence u. whose real and [I) J. W. Cooley and J. W. Tukey, "An algorithm fOI the machine
imaginary parts are calculation of complex Fourier series," Math. Comp., vol. 19,
TIe [u.] =X2.-Y2.+1+Y2.-1 } n=O, I, ...,L/2-1.

(2)
pp. 297-301, 1965.
G-AE Subcommittee on Measurement Concepts, "What is the
fast Fourier transform?" IEEE TrailS. Audio alld Electroacous
1m [U.]=Y2.+X2>1+1-X2.-1 tics, vol. AU-IS, pp. 45-55, June 1967.
(3) T. G. Stockham, Jr., "High speed convolution and correla
The L/2 point DFT of II. is denoted Uk and is calculated tion," 1966 Sprillg Joint Complller COIlf., AFlPS Proc., vol. 28.
by the FFT.The values of Xk and Yk may be computed Washington, D. C.: Spartan, 1966, pp. 229-233.
from Uk using the relations (4) H. D. Helms, "Fast Fourier transform method of computing
difference equations and simulating filters," IEEE TrailS.
AudiO alld Electroacoustics, vol. AU-IS, pp. 85-90, June 1967.
Xk = H He [Uk] + He [(h/2-dl (5) L. I. Bluestein, "A linear filtering approach to the computation
1 of the discrete Fourier transform," /968 NEREM Rec., pp.
--- {Re [Uk] - Re [uL,2-dl 218-219.
271" (6) L. R. Rabiner, R. W. Schafer, and C. M. Rader, "The chirp
4 sin-k z-transform algorithm and its applications," Bell Sys. Tech. J.,
L vol. 48, pp. 1249-1292, May 1969.
157
Reprinted from Math�_matic��utation, Vol. 22, No. 102,
April 1968, pp. 275-279.
A Fast Fourier Transform Algorithm

Using Base 8 Iterations
By G. D. Bergland
1. Introduction. Cooley and Tukey stated in their original paper [1] that the
Fast Fourier Transform algorithm is formally most efficient when the number of
samples in a record can be expressed as a power of 3 (Le., N 3m), and further that =
there is little efficiency lost by using N 2m or N = 4m. =
Later, however, it was recognized that the symmetries of the sine and cosine
weighting functions made the base 4 algorithms more efficient than either the base
2 or the base 3 algorithms [2], [3]. Making use of this observation, Gentleman and
Sande have constructed an algorithm which performs as many iterations of the
transform as possible in a base 4 mode, and then, if required, performs the last itera
tion in a base 2 mode.
Although this "4 + 2" algorithm is more efficient than base 2 algorithms, it is
now apparent that the techniques used by Gentleman and Sande can be profitably
carried one step further to an even more efficient, base 8 algorithm. The base 8
algorithms described in this paper allow one to perform as many base 8 iterations
as possible and then finish the computation by performing a base 4 or a base 2 itera
tion if one is required. This combination preserves the versatility of the base 2 algo
rithm while attaining the computational advantage of the base 8 algorithm.
2. The Basic Algorithms. Fast Fourier Transform recursive equations for

N = TIT2' • ·r n can be d er i ved in either of two different forms [4]. The second of these
forms is
Ap(jo,jI, .. " jp-I, kn-p-1, • . " ko)

'1'-1
(1) == � ..1·P-1(jo,jI, .. ·,jp-2,kn-p, "',ko)

kn_p_O
.1V iP-1(kn-p('P+1'" 'n)+,,·+kO)('l"· 'p-1)

N
p = 1, 2, .. " n , 1VN = exp (2-rri/N)

where
j = jn_1 (Tlr2" ·rn-1 ) + jn-2(Tlr2" ·rn-2) + . . . + itTl + jo

(2)
k = kn_l(r2r3" ·rn) + kn_2(r3r4" ·rn) + . . . + klTn + ko
subject to the constraints
j.-l = 0, 1, 2, " ' , r. - 1,

(3)
k. = 0, 1 , 2, .. " r,,_. - 1 , ° ;;; v ;;; n -
The last array calculated gives the Fourier sums as
Received June 16, 1967.
158
G. D. BERGLAND
(4)
In some cases the total computation required to evaluate these equations can
be reduced by grouping these equations in a slightly different manner. For N =
r1r2' ·r", this regrouping takes the form
•
Ap (jol
[ �'I jP-lI
• • k.._p-11 ' ' I 'ko} ,
r l
]
_
i P 1k p
(5) - LJ A�p-1 (3.0, '" I 3'P-
. 2, 1.
""-1'1 ' , 'I k0 ) TVrp- ,,-
k,,_,,-O
, TV ip-1(kia_p-l(rp+2· .. r,,)+ ... +k1rn+ko)(r1r2· .. rp-1) ,
N
Note that the bracketed term in (5) represents a set of rp-point Fourier trans
forms and that the complex exponential weights outside the brackets simply re
reference each set of results to a common time origin. (In Gentleman and Sande's
paper this rereferencing is termed twiddling.) The term
21tilrp
TV,. = TV NNlrp = e
l'
forms the basis for the complex exponential weights required. in evaluating each rp
point transform, and jp-1 and k.._p are the two indices of the transform.
An analogous regrouping can be performed on the original Cooley-Tukey re
cursive equations. For N = r1r2' , ·r", these take the form [4]
AI' (jol j l l " '1 jP-lI k..-p-1, " '1 ko}

:'1'-1
(6) =
k
2:
.._p-0
Ap-1 (j01 j11 ' " I jP-2 1 kn-1'1 ' "
I ko)
, TV (ip-1(r1r2 .• ,rp-1)+... +iO)kn -p(r.fIT
--.L1· ··rn)
N P = 11 2I " 'In,
I
When these equations are regroupedI the rereferencing of iteration p + 1 must be

performed before the rp+1 point transform instead of after it. This can be done by
performing the p + 1 iteration rereferencing at the end of the pth iteration. If these
rereferenced AI' terms are labeled AI" we have
Ap (jol . 'I jp-11 k"-P-lI . 'I ko)

]
, ,
(7) = [fA
k"-FO
p-1 (jO, " ' l jp-21 k.. -pl "
k
" ko) l V�� l "_p
, TV (jp-1(r1r2· .. rp-1)+"·+hr1+io)kn_p-l(rp+2·""n)
N ,
This expression is valid for p = 1,2, . , 'In provided that we define (rp+2' , 'r..) = 1
for p > n - 2 ,and define lc_1 = O. Note that the bracketed term of (7) is identical
to the bracketed term of (5).
Each iteration of both (5) and (7) is thus conveniently divided into two steps.
The first step involves performing a set of Tp-point Fourier transforms, and the
second step involves use of the Fourier transform shifting theorem to rereference
the resulting spectral estimates to the correct time origins. These two operations
are performed during each iteration.
Of particular interest in this paper are the algorithms which result when as many
of the rp terms as possible are set to 8. When this is done, the bracketed terms repre
sent a large number of 8-point Fourier transforms. The complex exponential weights
159
-
0-
o
TAllLE I
Comparisoll of arithmetic opemtiolH:; required for base �, base ·1, bai:ie �, and bai:ie 10 u.IgoriUllll�.
Real
Algorithm Requirecl computation for J.1Iultiplications Real Additions
Base 2 algorithm Evaluating (N/2)m, 0 2Nm, :.-
for N 2m, 2 term Fourier transforms.
�
=
m = 0,1,2,... Referencing «m/2 - I)N + 1)(4) «mj2 - I)N + 1)(2) Ul
t-:l
Complete analysis (2m - 4)N + 4 (3m - 2)N + 2 '"J
0
q
l:::1
.....
Base 4 algorithm Evaluating (N/4)(m/2), 0 2Nm l:'j
for N= (22)m/2, 4 term Fourier transforms. l:::1
m/2=0, 1 , 2 , ··· Referencing «3m/8 - I)N + 1)(4) «3m/8 - I)N + 1)(2) t-:l
l:::1
:.-
7-
Complete analysis (1.5m - 4)N + 4 (2.75m, - 2)N + 2 Ul
'Xl
0
:::l
Base 8 algorithm Evaluating (N/8)(m/3), Nm/6 13Nm/6 i:::
for N = (23)m/3 , 8 term Fourier transforms. :.-
m/3 0, 1,2, ... Referencing «7m/24 - I)N + 1)(4) «7m/24 - I)N + 1)(2) t"
=
0
0
l:::1
.....
Complete analysis (1.333m - 4)N + 4 (2.75m - 2)N + 2 t-:l
:=:
...
....
Base 16 algorithm Evaluating (N/16)(m/4), 3Nm/8 9Nm/4
for N = (24)m/4 , 16 term Fourier transforms.
m/4 = 0, 1,2, ... Referencing «15m/64 - I)N + 1)(4) «15m/64 - I)N + 1)(2)
Complete analysis (1.3125m - 4)N + 4 (2.71875m - 2)N + 2
G. D. BERGLAND
required in performing these transforms are: ±1, ±i, ± exp (+i7r/4), and
± exp (- i7r/4) Since use of the first 2 weights requires no multiplications and
.
the product of a complex number and either of the last 2 weights requires only 2
real multiplications, a total of only 4 real multiplications is required in evaluating
each 8-point Fourier transform.
Thus the weights used in evaluating an 8-point Fourier transform all have
symmetries which can be profitably exploited. Considerable use of these symmetries
is being made since the base 8 algorithin forces us to compute N/8, 8-point trans
forms during each iteration.
3. Computation Required by Base 2, Base 4, Base 8, and Base 16 Algorithms.

When N can be expressed as a power of 2, the recursive equations (5) and (7) can
be specialized such that rl = rz = . . . = rn-l = 2q and rn = 2r, where N = 2'" and
r = m - q(n - 1) � q. That is N = 2 q2q ·2q2r• If 2q is referred to as the base of
• •
the algorithm, we can compare the computation required by base 2, base 4, base 8,
and base 16 algorithms for N being any po\ver of 2. The number of real multiplica
tions and real additions required for various values of m is expressed in Table I.
Although these expressions are only exact for values of N which are integral powers
of 2, 4, 8, and 16, respectively, they are good approximations for any integral value
ofm.
The real multiplications and additions required for m = 12 are given exactly
by these expressions and are expressed in Table II.
TABLE II
Real multiplications and additions required in performing base 2, base 4, base 8
and base 16 Fast Fourier Transform algorithms for N - 4096.
Numher
Number of real of real
Algorithm Multiplications Additions
Base 2 (N = (2)12) 81,924 139,266
Base 4 (N = (22)6) 57,348 126,978
BaseS (N = (23)4) 49,156 126,978
Base 16 (N = (24)3) 48,�32 125,442
In counting the number of multiplications and additions required by each of

these algorithms, it is assumed that each rereferencing operation requires one com
plex multiplication except when the multiplier is WD.* It is also assumed that the
basic transform of each algorithm is performed in the most efficient manner possible.
Thus as the number base gets progressively higher, the 2q point transform becomes
more and more specialized but the total computation required decreases.
* Rereferencing operations involving a multiplication by ±i or ± exp (±i1l'"/4) could be per

formed with less arithmetic operations at the expense of having to locate these special pr oducts.
Since only the rereferencing products involving WO are easily located, present implementations of
these algorithms only treat rereferencing products involving TV' as special cases.
161
A FAST FOUHIEIt THANSFOR:\[ ALGORITH�[
4. Conclusions. It is apparent from Table I that performing as many high base

iterations as possible, reduces the total computation. As the base of the algorithm
increases, however, the basic 2q point transform becomes more involved, and it be
comes increasingly difficult to let N be an arbitrary power of 2. A reasonable eompro
mise appears to be the base 8 algorithm. The computation required is very nearly
minimized while the 8-point transforms required are still relatively easy to compute.
.\.. Fortran II subroutine, using the base 8 algorithm, has been written for use on
the G.E. 635 computer. The execution time is approximately 60Nm microseconds
(where N =2"'), which is more than 40% faster than the base 2 programs previously
used and 20% faster than Gentleman and Sande's 114 + 2" subroutine.
5. Acknowledgments. The author wishes to thank W. 1\1. Gentleman, G. Sande,

and W. L. Zweig for their suggestions which led to what appears to be the mini
mum number of computations required in executing the base 16 algorithm.
Bell Telephone Laboratories, Incorporated

Whippany , New Jersey
1. J. W. COOLf;Y & J. W. T UK.;Y, "An algorithm for the machine calculation of complex
Fourier series," Math. Comp., v. 19, 1965, �p. 297-301. MR 31 #2843.
2. W. M. GENTLEMAN & G. SANDE, "Fast Fourier transforms for fun and profit," Fall Joint
Computer Conference Proceedings, Vol. 29, 1966, pp. 563-578.
3. R. E. M I LL ER & S. WINOGRAD Private Communication.
4. G. D. BERGLAND, "The fast Fourier transform recursive equations for arbitrary length
records," Math. Comp., v. 21, 1967, pp. 236-238.
162
Introduction
The fast Fourier transform (FFT) algorithm is an

efficient method for computing the transformation
n-I
An Algorithm for ak = L: Xjexp (i2Irjk/n) (1)
j-O
Computing for k=O, 1,·· . II-I, where {Xj) and lad are both
,
complex-valued. The basic idea of the current form of the

the Mixed Radix Fast fast Fourier transform algorithm, that of factoring II,
Fourier Transform m
n = II n"
RICHARD C. SINGLETON, Senior Member, IEEE and then decomposing the transform into In steps with
Stanford Research Institute II/II, transformations of size IIi within each step, is that of
Menlo Park, Calif. 94025 Cooley and Tukey [I]. Most subsequent authors have
directed their attention to the special case of II 2"', =
Explanation and programming are simpler for 11= 2"' than

Abstract
for the general case, and the restricted choice of values of
This paper presents an algorithm for computing the fast Fourier II is adequate for a majority of applications. There are,
transform, based on a method proposed by Cooley and Tukey, As in however, some applications in which a wider choice of
their algorithm, the dimension n of the transform is factored (if possi values of II is needed.The author has encountered this
ble), and n/p elementary transforms of dimension p are computed for need in spectral analysis of speech and economic time
each factor p of n, An improved method of computing a transform step
series data.
corresponding to an odd factor of n is given; with this method, the
Gentleman and Sande [2] have extended the develop
ment of the general case and describe possible variations
number of complex muitiplicatioll& for an elementary transform of di
in organizing the algorithm. They mention the existence
mension p is reduced from (P-l)2 to (P-l)2/4 for odd p, The fast
of a mixed radix FFT program written by Sande.Avail
Fourier transform, when computed in place, requires a final permuta
able mixed radix programs include one in ALGOL by
tion step to arrange the results in normal order, This algorithm in Singleton [3] and another in FORTRAN by Brenner [4].
cludes an efficient method for permuting the results in place. The al A FORTRAN program based on the algorithm discussed
gorithm is described mathematically and illustrated by a FORTRAN here is included in Appendix I; this program was com
subroutine. pared with Brenner's on several computers (CDC 6400,
CDC 6600, IBM 360/67, and Burroughs B5500) and
found to be significantly faster.
The Mixed Radix FFT
The complex Fourier transform ( 1) can be expressed

as a matrix multiplication
a = Tx,
where T is an II XII matrix of complex exponentials
tjk = cxp (i27rjk/n),
In decomposing the matrix T, we use the factoring of
Sande [2], rather than the original factoring of Cooley
[I]. However, if the data are first permuted to digit
reversed order and then transformed, Cooley's factoring
leads to an equally efficient algorithm.
In computing the fast Fourier transform, we factor T
as
T = PFmFm-I • • • F2FI,
Manuscript received December 2, 1968. where F, is the transform step corresponding to the factor
This work was supported by Stanford Research Institute, out of
of II and P is a permutation matrix. The matrix F, has
IIi
Research and Development funds. only II, nonzero elements in each row and column and
1EEB TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. Au-17, No.2 JUNE 1969
163
can be partitioned into n/llisquare submatrices of dimen- and
sion IIi; it is this partition and the resulting reduction in
multiplications that is the basis for the FFT algorithm. j' = jln61l6 •.. nl + j2n6n4 •.. nl + j6n4n3n2nl
The matrices Fican be further factored to yield + j4n3n2nl + j3n2nl + j.nl + je.
The permutation PI in this case leaves each result ele

ment in its correct segment of length 11/111112, grouped in
where Ri is a diagonal matrix of rotation factors (called subsequences of 111112 consecutive elements.The permuta
twiddle factors by Gentleman and Sande [ 2]) and Tican tion P2 then completes the reordering by permuting the
be partitioned into II /IIi identical square submatrices, 113114116 subsequences within each segment of length
each the matrix of a complex Fourier transform of dimen 11/111112' In the FORTRAN subroutine given in Appendix I,
sion Ili. Although it might appear that this step increases P2 is done by first determining the permutation cycles for
the number of complex multiplications, it in fact enables digit reversal of the digits corresponding to the square
us to exploit trigonometric function symmetries and free factors, then permuting the data following these
multipliers of simple form (e.g., eir, eirl2, and eir/4) in cycles.The permutation can be done using as few as two
computing Ti that more than compensate for the fewer elements of temporary storage to hold a single complex
than II multiplications in applying the rotation Ri• This result, but the program uses its available array space to
point will be discussed furthel in later sections. permute the subsequences of length 111112 if possible.
The permutation P is required because the transformed
result is initially in digit-reversed order, Le., the Fourier
The Rotation Factor
coefficient a;, with
In the previous section, we described the factoring of
the transform step Fi, corresponding to a factor l1i of 11,
is found in location into a product RiT, of a matrix T, of 11/11, identical Fourier
transforms of dimension 11, and a diagonal rotation factor
matrix R,.Here we specify the elements of R, for the Sande
version of the FFT.
As mentioned previously by the author [5], the permuta The rotation factor R, following the transform step T,
{ 2lr
tion may be performed in place by pair interchanges if 11 has diagonal elements
is factored so that
[ j mod kk }
for i<lI-i. In this case, we can countj in natural order
rj = exp i- (j mod
kk
k)
k J
andj' in digit-reversed order, then exchange aj and aj' if forj=O, 1, . . . ,11-1 where
j<j' . This method is a generalization of a well-known k = n/nln2 ..• n, and kk = n ik,
method for reordering the radix- 2 FFT result.
Before computing the Fourier transform, we first de and the square brackets [ ] denote the greatest integer
compose II into its prime factors.The square factors are ::; the enclosed quantity. The rotation factors multiply
arranged symmetrically about the factors of the square ing each transform of dimension 11, within T, have angles
3X
2 X 3 X 5 X 3.
free portion of 11. Thus 11= 27 0 is factored as
0, 6, 26, . . . (ni - 1)6,
where 6 may differ from one transform to another. No

Then the permutation P is factored into two steps, mUltiplication is needed for the zero angle, thus there are
at most
P = P2P1•
n(n, - 1)/n,
The permutation PI is associated with the square factors
of 11 and is done by pair interchanges as described above, complex multiplications to apply the rotation factor
except that the digits of 11 corresponding to the square following the transform step T,. In addition, 6= 0 for
free factors are held constant and the digits of the square U mod k)=O or U mod kk)=O, allowing the number of
factors are exchanged symmetrically.Thus if complex multiplications to be reduced by
(nl - 1) + nl(n2 - 1) + nln2(n3 - 1)
+ . . .+ nln2 ... n.._l(n.. - 1) = n - 1.

with III =117, 112=116, and l1a, 114, and 116 are relatively
prime, we interchange We note that the number of rotation factor multiplica
tions is independent of the order of arrangement of the
j = hn6n6 ... nl + j6n6n4 ... nl + j6n4nan2nl
factors of 11. The final rotation factor R". has 6= 0 for all
+ j4nan2nl + j3n2nl + j2nl + it elements, and thus is omitted.

164
Counting Complex Multiplications mn(p - ])2
A complex multiplication, requiring four real multipli 4p
cations and two real additions, is a relatively slow opera complex multiplications.Adding the
tion on most computers.1 To a first approximation, the
mn( -1)
p
speed of an FFT algorithm is proportional to the number ---- - (n-I)
of complex multiplications used. The number of times we p
index through the data array is, however, an important multiplications for rotation factors, we obtain a total of
secondary factor.
mn(p
--- I)(p + 3)
Using the results of the previous section, the number of --'- --- (n-])
complex mUltiplications for the rotation factors R; is 4p
n(n complex multiplications for a radix-p transform.Since p
f ;- 1) _ (n- 1), is assumed here to be an odd prime, we have no lotations
i-I ni with (J an integer multiple of 1f/4 to reduce further the
assuming we avoid multiplication for all rotations of zero number of complex multiplications.
angle. To this number we must add the multiplications The ratio of the number of complex multiplications to
for the transform steps T;. /I log2 /I can serve as a measure of relative efficiency for
For n a power of 2, we note that a complex Fourier the mixed radix FFT. The results of this section, neglect
transform of dimensi on 2 or 4 can be computed without ing the reduction by /I-I for 8= 0, yield the following
multiplicati on and that a transform of dimensi on 8 re c omparis on :
quires only two real multiplications , equivalent to one
Radix Relative Efficiency
half a complex multiplication. Going one step further, a
transform of dimension 16, computed as two factors of 2 0.500
4, requires the equivalent of six complex multiplications. 4 0.375
8 0.333
Combining these results with the number of rotation fac 16 0.328
tor multiplications and assuming that n= 2m is a power of 3 0.631
the radix, the total number of complex multiplications is 5 0.689
7 0.763
as follows: II 0.920
13 0.998
Radix Number of Complex Multiplications 17 1. 151
19 1.227
2 nlll/2-(II-I) 23 1.374
4 311111/8-(11-1)
8 11111/3-(11-1)
16 2111111/64-(11-1) The general term for an odd prime p is
-------.------
(p - I)(p + a)
These results have been given previously by Bergland [6]. 4p log2 (p)
The savings for 16 over 8 is small, cons idering the added
complexity of the algorithm. As Bergland points out, Decomposition of a Complex Fourier Transform
radix 8, with provision for an additional factor of 4 or 2, In the previous section, we promised to show that a com
is a good choice for an efficient FFT program for powers plex transform of dimension p, for p odd, can be com
of 2. For the mixed radix FFT, we transform with factors puted with (p_I)2 real multiplications. Consider the
of 4 whenever possible, but also provide for factors of 2. complex transform
We now consider the number of complex multiplica
tions for a radix-p transform of /I= pm complex data
,,-1
ak + ibk = L (;rj + iYi) cos --

{ (21fjk)
(21fjk)}
values, where p is an odd prime. While at first it might ap J-O P
pear that an elementary transform of dimension p re
quires (p_I)2 complex mul tiplications, we show in the + isin p
next section that (p_I)2 real multiplications suffice,
eq u i valent to (p-I)2/4 complex multiplications. This re
,,-1
= .1'0 + L.fjC:OS --
(21fjk)
sult holds, in fac t, for any odd value of p. Thus the trans i_I P
form steps for /I pm require the equivalent of
=
I G. Golub (private communication) has pointed out that a

complex multiplication can alternatively be done with three real
multiplications and five real additions, as indicated by the following:
(a+ih) . «'+id) = [(a+b) ·(c-d)+a·d-h·(' ]+i[a·d+h·c J.

This method does not appear advantageous for FORTRAN coding,
as the number of statements is increased from two to four.
SINGLETON: ALGORITHM FOR MIXED RADIX FFT
165
=
(,,-1l/2
Xo + L (Xi
i-1
+ Xp-i) cos (271"ik)
--
P
For even values of p, a decomposition similar to the
above yields 4(P/2-1) series to sum, each with (P/2 - 1)
multiplications. Thus a complex Fourier transform for p
even can be computed with at most (p- 2)� real multipli
cations. For p>2, we know that this result can be im
{(�p-Il/2 ( 2 7r"ik)
proved.Combining results for the odd and even cases, we
can state that a Fourier transform of dimension p can be
+ iYo+ (Yi + Yp-i) COS --
J-1 computed with the equivalent of
(�1'-1)/2 (.fi Xp-j) sin (2--
P
+ -
7r"ik)} [p ; IJ
J-1 P
for k=O. I• . . .
• p-1. We note first that or fewer complex multiplications, where the square
brackets [ ] denote the largest integer value :S the en
,,-\
ao + ibo = L (Xi + iyj) closed quantity.
i-O
A Method for Computing Trigonometric Function Values

is computed without multiplications. The remaining
Fourier coefficients can be expressed as The trigonometric function values used in the fast
Qk = Qk+ - a,,
Fourier transform can all be represented in terms of in
teger powers of
Qp-k = ak+ + ak cxp (i27T/n),
bk bk+ + bk-
the nth root of unity. Since we often use a sequence of
=
bp_k = bk+ - bk- equally spaced values on the unit circle, it is useful to have
for k = I, 2, ... ,(P-1)/2, where accurate methods of generating them by complex multi
(2rik)
plication, rather than by repeated use of the library sine
(p-1)/1 and cosine functions. For very short sequences, we use the
a,+ = Xo + 1: (XI + Xp-/) cos -- simple method
i-I
(2rik)
P
(11-1)/1 �k+1 = �k exp (i8),
a,,- = 1: (111
•
where
1Ip-i) 81n --
/-1 �O =
-
P 1
(p-1l/2
27Tjk
b,,+ = 110 + L (Yj + Yp-j) cos --
i-1
( ) P
and {�k I is the sequence of computed values exp (ik8).
This method suffers, however, from rapid accumulation
(Xj - Xp-j) sm (--).

(p-1)/1 of round-off errors. A better method, proposed by the
27Tjk
b,,- = 1: author in an earlier paper [5], is to use the difference
•
1-1 p . equation
Altogether there are 2(p-I) series to sum. each with
(P-1)/2 multiplications, for a total of (P_ 1)2 real multi where the multiplier
plications.
'I = exp (i8) - 1
For p an odd prime and for"fixedj, the multipliers
= 2i sin ( 8/2) exp (i8/2)
cos (27Tjk/p) for k = 1,2, ... (p - 1)/2
= - 2sin2 (8/2) + isin (8)
have no duplications of magnitude, thus no further re decreases in magnitude with decreasing 8. This method
duction in multiplications appears possible.2 The same gives good accurcay on a computer using rounded float
condition holds for the multipliers ing-point arithmetic (e.g., the Burroughs B55(0). How
sin (27Tjk/p) for k = 1,2,... (p - 1)/2. ever, with truncated arithmetic (as on the IBM 360/67),
the value of �k tends to spiral inward from the unit circle
with increasing k.
• C. M. Rader (private communication) has proposed an al In Table I, we show the accumulated errors from extrap
ternative d�'Composition of a
using the equivalent of 3 complex multiplications (12 real multi
olating to 7T/2 in 2k increments, using rounded arithmetic
plications) instead of the 4 complex multiplications used in the (machine language) and truncated arithmetic (FORTRAN)
algorithm described in this paper. In Appendix III we give a on a CDC 6400 computer; identical initial values, from
FORTRAN coding of Rader's method. When substituted in subroutine
FFT (Appendix I), times were unchanged on the CDC 6600 com
the library sine and cosine functions, -were used in com
puter and improved by about 5 percent for radix-5 transforms on puting the results in each of the three pairs of columns.
the CDC 6400 computer (the 6400 has a relatively slower multiply In examining the second pair of columns, we find that the
operation). Rader's method looks advantageous for coding in
machine language on a computer having mUltiple arithmetic regis
angle after 2" extrapolation steps is very close to 7T/2, but
ters available for temporary storage of intermediate results. that the magnitude has shrunk through truncation. To
IEEE TRANSACftONS ON AUDIO AND ELECfROACOUSTICS JUNE 1969

166
TABLE I
Extrapolated Values of cos ../2 and sin ../2- 1 an a CDC 6400 Computer, Using Rounded and Truncated Arithmetic Operations
(Values in Units of 10-'"
---- - -. -. _.- .-
Rounded Arithmetic Truncated Arithmetic Truncated Arithmetic

Number of Without Correction Without Correction With Correction
Extrapolations
cos ,,/2 sin 7</2-1 cos ../2 sin 7</2-1 cos 7</2 sin 7</2-1
2' 2.6 0.0 2.9 -3.6 2.8 -0.4

25 3.0 -0.7 4.2 -8.2 3.9 -0.4
2' 2.5 0.0 3.8 -13.1 4.1 -0.4
27 5.1 -0.7 4.0 -26.3 3.6 -0.4
28 2.7 -0.7 4.5 -51.9 4.5 -0.4
2" 4.9 -1.1 5.6 -104.4 5.2 -0.4
2'· 8.8 -1.1 5.2 -213.2 4.7 -0.4
2" 14.2 2.1 2.8 -426.0 0.0 -0.4
212 13.2 -0.4 -1.1 -866.2 -5.4 -0.4
2'3 -0.2 -0.4 4.7 -1705.7 4.2 -0.4
2" -7.4 1.4 -9.6 -3440.1 1.2 -0.4
2" -10.3 -1.8 -9.4 -6841.5 36.9 -0.4
216 19.8 2.8 4.4 -13707.1 -33.3 -0.4
217 \8.7 5.7 -8.9 -27416.7 3.5 -0.4
compensate for this shrinkage, we modify the above accuracy is small. The subroutines in Appendixes I and II
method to restore the extrapolated value to unit magni include comment cards indicating the changes to remove
tude. We first compute a trial value the rescaling. On the other hand,the number of multipli
cations may be reduced by· one when using truncated
'Yk = �k + lI�k
arithmetic, through using the overcorrection multiplier
where
Ilk = 2 - 'Yk'Yk"'.
11 = - 2 sin! (8/2) + i sin (8)
In this case, the truncation bias stabilizes a method that
and
mathematically borders on instability. On the CDC
�o = 1, 6400 computer,this multiplier gives comparable accuracy
then multiply by a scalar to the multiplier suggested above.
�k+l = Ilk'Yk,
A FORTRAN Subroutine for the Mixed Radix FFT
where
Ilk""
1 ,
In Appendix I, we list a FORTRAN subroutine for com
puting the mixed radix FFT or its inverse, using the
V'Yk'Yk'"
algorithm described above. This subroutine computes
to obtain the new value. Since 'Yk'Yk'" is very close to 1,we either a single-variate complex Fourier transform or the
can avoid the library square-root function and use the calculation for one variate of a multivariate transform.
approximation To compute a single-variate transform (I) of n data
�(_1_+ 1).
values,
lik = CALL FFT(A, B, n, n, n, 1).
2 'Yk'Yk'"
Or if division is more costly than multiplication, we can The "inverse" transform
alternatively use the approximation ft-I
lik = H3 - 'Yk'Yk"'). Xi = :E O/k exp ( - i27rjk/n)

k-O
On the CDC 6400 computer, both approximations give is computed by
the results shown in the third pair of columns of Table I.
This rescaling of magnitude uses four real multiplications CALL FFT(A, B, n, n, n, - 1).
and a divide (or five real multiplications) in addition to
the. four real multiplications to compute the trial value of . Scaling is left to the user. The two calls in succession give
'Yk.However, on most computers, these calculations will the transformation
take less time than computing the values using the library
T*Tx = nx,
trigonometric function.
The added step of rescaling the extrapolated trigono i.e., II times the original values, except for round-off er
metric function values to the unit circle can also be used rors. The arrays A and B originally hold the real and
when computing with rounded arithmetic,but the gain in imaginary components of the data, indexed from I to Il;
SINGLETON: ALGORITHM fOR MIXED RADIX fIT
167
the data values are replaced by the complex Fourier co loss in efficiency compared with the present algorithm. In
efficients. Thus the real component of (){k is found in the ALGOL algorithm, the Cooley version of the FFT al
A(k+l), and the imaginary component in B(k+I), for gorithm is used, with a simulated recursive structure;
k=O, I, . .., 11-1. rotation factor multiplication is included within the trans
The difference betwecn the transform and inverse cal form phase, requiring two additional arrays with dimen
culation is primarily one of changing the sign of a variable sion equal to the largest prime factor in 11. The transform
holding the value 2lr. The one additional change is to method for odd factors is like that used here. The permu
follow an alternative path within the radix-4 section of tation for square factors of II also has a simulated recur
the program, using the angle -lr/2 rather than lr/2. ' sive structure, with one level of "recursion" for each
The usc of the subroutine for multivariate transforms is square factor in II; in the present algorithm, this permuta
described in the comments at the beginning of the pro tion is consolidated into a single step. The permutation
gram. To compute a bivariate transform on data stored for square-free factors is identical in both algorithms.
in rectangular arrays A and B, the subroutine is called The ALGOL algorithm contains a number of dynamic ar
once to transform the columns and again to transform rays, which is an obstacle to translation to FORTRAN. On
the rows. A multivariate transform is essentially a single the other hand, the FORTRAN subroutine given here can
variate transform with modified indexing. easily be translated to ALGOL, with the addition of dy
The subroutine as listed permits a maximum prime namic upper bounds on all arrays other than NFAC; in
factor of 23, using four arrays of this dimension. The making this translation, it would be desirable to modify
dimension of these arrays may be reduced to I if 11 con the data indexing to go from 0 to 11-1 to correspond with
tains no prime factors greater than 5. An array NP(209) is the mathematical notation.
used in permuting the results to normal order; the present
value permits a maximum of 210 for the product of the
Timing and Accuracy
square-free factors of II, If II contains at most one square
free factor, the dimcnsion of this array can be reduced to The subroutine FFT was tested for time and accuracy
j+I, where j is the maximum number of prime factors of on a CDC 6400 computer at Stanford Research Institute.
II. A sixth arrray NFAC(II) holds the factors of 11. This is The results are shown in Table II.The times are central
ample for any transform that can be done on a computer processor times, which are measured with 0.002 second
with core storage for 217 real values (216 complex values); resolution; the times measured on successive runs rarely
differ by more than 0.002 to 0.004 second. Furthermore,
52 "\88 = 2 X 34 X 2 X 34 X 2
calling the subroutine with 11 = 2 yields a timing result of
is the only number <216 with as many as II factors, given o or 0.002 second; thus the time is apparently measured
the factoring used in this algorithm. The existing array with negligible bias.
dimensions do not permit unrestricted choice of 11, but The data used in the trials were random normal deviates
they rule out only a very small percentage of the possible ",ith a mean of zero and a standard deviation of one (i.e.,
values. an expected rms value of one). The subroutine was called
The transform portion of the subroutine includes sec twice:
tions for factors of 2, 3, 4, and 5, as well as a general sec
CALL FFT(A, B, n, n, n, 1)
tion for odd factors. The scctions for 2 and 4 include
multiplication of each result value by the rotation factor; CALL FFT(A, B, n, n, n, - 1);
combining the two steps gives about a 10 percent speed
improvement over using the general rotation factor sec then the result was scaled by lin. The squared deviations
tion in the program, due to reduced indexing. The sec from the original data values were summed, the real and
tions for 3 and 5 are similar to the general odd factors imaginary quantities separately, then divided by II and
section, and they improve speed substantially for these square roots taken to yield an rms error value. The two
factors by reducing indexing operations. The odd factors values were in all cases comparable in magnitude, and an
section is used for odd primes> 5, but can handle any odd average is reported in Table II.
factor. The rotation factor section works for any factor The measured times were normalized in two ways, first
but is used only for odd factors. by dividing by
The permutation for square factors of 11 contains special
code for single-variate transforms, sinc(.' less indexing is
required. However, the permutation for multivariate
transforms also works on single-variate transforms. and'second by dividing by
The author has previously published an ALGOL proce
n logdn) .
dure [3] of the same name and with a similar function.
One significant difference between the two algorithms is To a first approximation, computing time for the mixed
that the ALGOL one is organized for computing large trans radix FFT is proportional to II times the sum of the factors
form') on a virtual core memory system (e.g., the Bur of n, and we observe in the present case that a propor
roughs B5500 computer). This constraint leads to a small tionality constant of 25 p'S gives a fair fit to this model.

168
TABLE II TABLE III
Timing and Accuracy Tests of Subroutine FFT on a CDC 6400 Compuler N_ben S 100000 Containing No PrIM. focIar Gr...... Than 5
Time Time 2 ?II 19?0 6750 19440 411001)

3 100 1944 6912 196n 4R600
rms
4
IlL II;
Time 320 1000 1200 20000 49152
Factoring of II II)Og.1I Error 5 324 2025 1290 20250 50000
(seconds)
(X10-13) 6 360 ?048 7500 204110 0;062'1
(pS) (pS) I 315 2160 7680 20116 51200
., )84 2lR7 1776 21600 '511140
10 400 2250 1000 2187C 0;?4111
512=4'X2X4' 0.188 20.4 40.8 1.1 12 405 2104 I1CO 22500 54000
1024=4'X4X42 0.398 19.4 38.9 1 :2 15 432 7400 8192 23040 54615
2048=4'X2X2X2X4' 0.928 20.6 41.2 1.4 16 450 2430 8640 23ne '5'5296
4096=4'X43 1.864 19.0 37.9 1.5 18 480 2500 8748 24COO 56750
20 486 25tO 9000 24300 57600
2187=33X3X3' 1.494 32.5 61.6 1.6
24 500 259? 9216 24576 511320
312S=5'X5X52 1.898 24.3 52.3 2.3 25 'lIZ 2700 9375 25000 '59"49
2401=72X7' 2.310 34.4 85.7 2.6 21 5 40 21180 9600 25600 60000
1331=IIXIIXII 1.324 30.1 95.9 2.5 30 '516 2916 '1120 25920 607'50
2197=13X13X13 2.478 28.9 101.6 3.5 n 600 3000 10000 26244 61440
36 625 1012 10125 27000 621M
289=17X17 0.272 27.7 115.1 2.5 40 640 3125 10240 2164e 61'500
361=19X19 0.372 27.1 121.3 3.2 45 648 1200 10368 2812� 64"00
529=23)<23 0.636 26.1 132.9 3.5 U 675 3240 10100 28800 641100
IOOO=2X5x2X5X2X5 0.546 26.0 54.8 1.6 50 120 3315 10935 29160 65536
1.042 54 129 34'16 11250 300"0
2000=4X5X5X5X4 22.6 47.5 1.7 6'1610
60· 150 3600 11520 3 0 37 5 67500
64 168 3645 lU64 30720 69120
12 800 3 750 12000 31104 69984
75 810 311 4031250 ll 1 50 nooo
On the basis of counting complex multiplications, we 80 864 311118
]2000 12288 n900
12500
would expect a decline in this proportionality constant II
9C
900
960
4000
40'10
324IlC
]276@ ll800
U1211
1'1000
with increasing radix; a decline is observed for odd 96 972 4096 3280'5 ll9tO 7foll00
100 1000 4120 337'50 13122 777foO
primes> 5.Factors of 5 or less are of course favored by 108 1024 4114 34560 13500 111115
special coding in the program. The second normalized 125 120 1080
1 115
4500
4fo08
13824
31t 9 q � 711132
36000 110400 110000
time value places all times on a comparable scale, allow 128 115 2 41100361050 14580 Rlonn
ing one to assess the relative efficiency of using values of II 144
135 1200
121'5
4860
5000
36864
37500
15000
15ltO
'1920
1I?944
other than powers of 2; these results follow closely the 150 1250 5120 184CO 15552 84375
160 1280 '1184388RO 15625 116400
relative efficiency values derived in an earlier section by 162 1296 540n 39366 16000 814110
counting complex multiplications, except that radix 5 is 110 192
1350
1440
5625
5160
40000 16200
163810
901100
40500 911?5
substantially better than predicted. 200 1458 5832100960 16875 92160
216 1500 6000
In Table III, we list the numbers up to 100 000 contain 225 1536 6075
1011012
43200
11280
17496
9"t?
93750
ing no prime factor greater than 5 to aid the user in select 240 1600 61444374e 18000 96000
243 1620 615045000 18225 91200
ing efficient values of II. 250 1128 6400460110 181032 98104
25 6
When compared with Brenner's FORTRAN subroutine 270 1800
1875
648046656
6'161
46815
18150
19200
911415
100000
[6] on the CDC 6400 computer, FFT was about 8 percent
faster for radix 2, about 50 percent faster for radix 3 and
5, and about 22 percent faster for odd prime radix�7.
similar to an ALGOL procedure REALTRAN given elsewhere
Brenner's subroutine also requires working storage array
[7] by the author.
space equal to that used for data when computing other
The real data values are stored alternately in the arrays
than radix- 2 transforms.
A and B,
The FORTRAN style in the subroutine FFT was designed
to simplify hand compiling into assembly language for the A(1), B(l), A(2), B(2), · ..A(n), B(n),
CDC 6600 to gain improved efficiency.Times on the CDC
then we
6600 for the assembly language version are approximately
l/lO of those shown in Table II.The register arrangement CALL FFT(A, B, n, n, n, 1)
of the CDC 6600 is well suited to the radix- 2 FFT; the
CALL REALTH (A, B, n, 1).
author has written a subroutine occupying 59 words of
storage on this machine, including the constants used to After scaling by 0.5/11, the results in A and B are the
generate all needed trigonometric function values, that Fourier cosine and sine coefficients, i.e.,
computes a complex FFT for 11 = 1024 in 4 2 ms.
a., = A(k + 1)
Transforming Real Data
b. = B(k + 1)
As others have pointed out previously, a single-variate fork-O, 1,·· . ,1I,withb-b.=O.1heinverseoperation,

Fourier transform of 211 real data values can be computed
CALL REALTR (A, B, ft, - 1)
by use of a complex Fourier transform of dimension II. In
Appendix II, we include a FORTRAN subroutine REALTR, CALL FFT(A, B, ft, ft, ft, - 1),
SINGLETON: ALGORITHM FOR MIXED RADIX FFT
169
after scaling by 1/2, evaluates the Fourier series and leaves that it could have been used here to transform the square
the time domain values stored free factors of n. This alternative has not been tried, but
the potential gain,if any, appears small.
A(l), B(l), A(2), B(2) ... A(n), B(n)
References
as originally.
The subroutine REALTR, called with ISN=1, separates (I] J. W. Cooley and J. W. Tukey, "An algorithm for the machine
calculation of complex Fourier series," Math. Comp., vol. 19,
the complex transforms of the even-and odd-numbered pp. 297-301, April 1965.
data values,using the fact that the transform of real data (2] W. M. Gentleman and G. Sande, "Fast Fourier transforms for
fun and profit," 1966 Fall Joilll Comp"ter COllf , AFlPS Proc.,
has the complex conjugate symmetry vol. 29. Washington, D. C.: Spartan, 1966, pp. 563-578.
(3) R. C. Singleton, "An ALGOL procedure for the fast Fourier trans
form with arbitrary factors," COIllIIIIIII ACM, vol. II, pp. 776-
for k=1, 2, .. . ,n-I,then performs a final radix-2 step 779, Algorithm 339, November 1968.
[41 N. M. Brenner, "Three FORTRAN programs that perform the
to complete the transform for the 211 real values.If called Cooley-Tukey Fourier transform," M.LT. Lincoln Lab., Lex
with ISN= -I. the inverse operation is performed. The ington, Mass., Tech. Note 1967-2, July 1967.
pair of calls (5) R. C. Singleton, "On computing the fast Fourier transform,"
Commllll. ACM, vol. 10, pp. 647-654, October 1967.
CALL HEALTH (A, B, n, 1) (6) G. D. Bergland, "A fast Fourier transform algorithm using
base 8 iterations," Math. Comp., vol. 22, pp. 275-279, April
CALL HEALTH (A, B, n, - 1) 1968.
"
(7) R. C. Singleton, ALGOL procedures for the fast Fourier trans
form," COIllIllUII. ACM, vol. II, pp. 773-776, Algorithm 338,
return the original values multiplied by 4, except for
November 1968.
round-off errors. (8) I. J. Good, "The interaction algorithm and practical Fourier
Time on the CDC 6400 for 11=1000 is 0.100 second, series," J. Roy. Stat. Soc., ser. B, vol. 20, pp. 361-372, 1958;
Addendum, vol. 22, pp. 372-375, 1960.
and for 11=2000, 0.200 second. Time for REAtTR is a
linear function of n for other numbers of data values.
The rms error for the above pair of calls of REALTR was
1.6X 10-14 for both 11=1000 and n=2ooo.
Conclusion
We have described an efficient algorithm for computing

the mixed radix fast Fourier transform and have illus
trated this algorithm by a FORTRAN subroutine FFT for
computing multivariate transforms.The principal means
of improving efficiency is the reduction in the number of
complex multiplications for an odd prime factor p of II to
approximately
n(p - l)(p + 3)/tp.
The algorithm also permutes the result in place by pair

interchanges for the square factors of n, using additional
temporary storage during permutation only when n has
two or more square-free factors.
A second subroutine REALTR for completing the trans
form of 2n real values is given,allowing efficient use of a
complex transform of dimension II for the major portion
of the computing in this case.
By use of these two subroutines, Fourier transforms
can be computed for many possible values of II, with
nearly as good efficiency as for II a power of 2. This ex
panded range of values has been found useful by the
author in speech and economic time series analysis work.
Before Cooley and Tukey's paper [1]. Good [8] pre
sented a fast Fourier transform method based on decom
posing n into mutually prime factors.This algorithm uses
a more complicated indexing scheme than the Cooley
Tukey algorithm, but avoids the rotation factor mUltipli
cations. While the restriction to mutually prime factors is
an obstacle to general use of Good's algorithm, we note
170
A Linear Filtering Approach to the Computation of the
Discrete Fourier Transform
L. I. BLUESTEIN
Sylvania, Waltham, Mass.
cussion, we consider the effect of a sampled filter whose

IIiIitII W....... Proceuina impulsive response is
CIIIIrmall: C.II. RIder hr = exp (+j7Tr2/N) (2)

MIT Uncoln Laboratory, Lexington, Mass.
FRIDAY, NOVEMBER S. 1968,2:30 P.M.

on a sampled time function. After some mathematical
manipulation we find it is possihle to generate the dis
FAIRFAX ROOM, SHERATON BOSTON
crete Fourier transform provided that:
• The X
k ,Ire multiplied by a time varying gain, e-}rrk"'\',
before entering the filter.
Introducfion • N zeros are appended to input sequence.
Recently there has been a good deal of interest'" in • The n'h output of the filter is multiplied by another
algorithms which rapidly compute the discrete Fourier time varying gain, e-}""'''.
transform (OFl') of an array of points. This transform is • Only the outputs of the device at instants being be
the array of N numbers, A., n = O. 1 •• • . • N - 1. de6ned tween N and 2N -1 are interpreted as heing the DFT
by the relation array A,I'
These considerations are depicted in Figure 1.
N-'
An = L Xk exp (-27Tjnk/N) (1) Therefore, we have reduced the prohlem to one of
k=O elliciently realizing a sampled chirp filter, whose im
pulsive response is given by (2). :,\Ioreover, the mathe
in which Xk is the k'h sample of a time series consisting matical development indicates that we need only match
of N (possibly complex) samples andj = V-l. The current Eq. (2) for 0 .. r .. 2N - 1 ,
interest stems from the application of the OFT in such One implementation utilizes existing fast Fourier
signal processing applications as the evaluation of power transform techniques (which are usually programmed for
spectra, autocorrelation functions, and the responses of N a power of two) to synthesize this sampled chirp filter,·
digital filters. since any digital filter can be so realized.? The number
A number of elegant algorithms have been devised3.'.' of operations required is of the order of 3N' log! N',
which systematize the formation of the sums and products where N is the smallest power of two greater than or
involved in computing the A •. The result of the investi equal to 2N. This technique is especially applicable for
gations is that the A. may be computed in about N log! N use on a large general-purpose computer on which a
operations' (when N = 2·) rather than the N2 operations fast Fourier transform technique is usually available.
which would be required by a direct approach to Eq. (1). Thus we arrive at the first result of our study.
These techniques, in general, depend on N being highly An alternate technique is to realize the sampled chirp
composite and are simplest to program or implement filter as a special-purpose digital device. To pursue this
when N is a power of two or four. aim, we take the z-transform of Eq. (2) and, as mentioned
The approach adopted in this paper is to find a digital previously, constrain II, to be zero for r ;;. 2N. The resul
filter which, when operating on a sequence that is gen tant, 11(z), can only be conveniently summed for N=m',
erated from the Xk sequence will produce a sequence where m is an integer. Under these conditions, it may be
that is related to the A. array. A similar sequence of shown that
operations has been proposed for continuous signals.8
This technique mixes the function to be transformed with m-' 1- z-2m2
11 (z) � z-t eJTrt2/N (3)
a chirp signal and passes the result through a chirp filter. 1 + eJ2Trtim z-m
=
t�
This paper begins with the sampled data analogue of that
work and develops two results. The first permits the Thus, lI(z) may he re,llized by a bank of m filters liS
computation of the discrete Fourier transform with the shown in Figure 2. (A box labeled z-Z represents a delay
number of required operations proportional to N log. N of x units.) Since the impulsive response after 2m2 -1
for any N. The second develops an especially simple units does not concern us, we may open circuit the link
algorithm for this computation when N = m2; the number connected to the box labeled z-t-.m'. The number of
of computations is then proportional to N3/2. opemtions required is about 3N m, since we need only
count opemtions which take place during the crucial
An Algorithm Suggested by Chirp Filtering interval between 0 and 2N-1, This number is of the
For analog waveforms, a chirp filter is one whose fre same form (within a multiplicative constant) as that re
quency response is the bandpass equivalent of exp (jf!) quired by the Cooley-Tukey algorithm for N = 711' and m
over a range of frequencies. The impulsive response of a prime. Because of the apparent simplicity of Figure 2,
this filter is proportional to e}lrr"'. Now let us enter the it is suggested that an algorithm based on a combination
realm of sampled systems. :'\Iotivated by the above dis- of Figures 1 and 2 be investigated when the DFT is to
218-NEREM RECORD-I968
171
he computed on <\ machine especially l'ollstructed for 5. Gcutl('mun. \\'. �t. and Sunde. C .• "Foi,st Fourier Tmnsfol"llls fur Fun
mul Profit." HJfi6 F..,II Joint CUlliputer C()lIf(·rc.�nc."t. AFIPS Pruc., vol. 29.
that purpose. The application that we have in mind here pp. 563·5711.
is when N, the numher of points being transformed, is 6. �I()rruw, \\'. E., Jr., �Iilx J.. r"tril'Ck., J.. ""d K"rp. D., .. " Heill Time
of moderate size (around 256) and is fixt,d. I-�()urh!r Tmnsformcr/' �1.I.T. Lin<.'oln 1 ...,humlory Group Heport 36 (:·4;
July 16, HJ63.
7. Stm·kham. T. G., "High Speed Convulution imel Correhltion." 1966
Original MiUluscript Hec(�ived J til}' 2�). H)68. Spring Joint Cnmputer Conrcrenc.·e, AFIPS Pro,,·.• vol. 2R. \\'ashingtoll.
1. Co(.·hrdu. J. \"., et. ai., '·'\\'h"t is the FOIst Fouri,'r Tmnsform?" Proc. D.C.. S"arlan WOO; 1>1>. 22!)·233.
IEEE, vo\. 55, no. 10, pp. 1664·1674; Odoher 1967. K. \\'e us'nully lake 1m operation to nwan IIm1tiplic.'atioll hy a cOllllJlex
2. Brighiun, E. O. und �1()rr()w. R. E., "The Fust Fouril·rTrnnsform.lEEl-: numht'r pins nil attendant t,'Ol11pututions. sin<.'e IIlnltipli(,"ntjoll of this sort
S,JCctrrml, vol. 4, 110. l� PI). 63·70; Dc('.'cmhcr H)67. on U COllllmh!r requires un inordinate amcmnt of time.
3. Cooley, J. \\', cmd Tukcy, j. \V., "An Alf,!orithm (or the �Iildline C••leu 9. This ohscr\'utiol1 is due to C. Had(!r uf the �(.I.T. Liu<"uin Lahomtory.
lation of Complex Fotlri(�r Seril's," Melila. Com,JU'., vol. 19. P)l. 297·301;
April 1965.
4. Cood, I. J .• "TheInh�mdi()n Algorithm llud Pmcti(."ul Fourier Amll),sis:'
]. Roy. St"ti.t. Soc., Ser. II, vo\. 20, pp. 361·372; 19511.
J---+
-_-tl
= \ xK K � N-I
INPUT
10 NSK� 2N·1
TIME VA,ctYING m FILTERS

PHASf SHIFnfl.S
FIGURE I-A Fourier Transformer. FIGURE 2-A Realization of lI(z).
NEREM RECORD·1968 - 219
172
Introduction
The basic theory underlying the analysis and design of

digital filters is weIl advanced (although by no means
An Approach to the complete) and quite a few summaries of theoretical results
are now available in the engineering literature [I]-[4).
However, the impact of digital filtering theory has not yet
Implementation of been felt by most of the engineers and technicians who
design and use the wide variety of filters presently con
Digital Filters structed from RLC or crystal circuits. This has been due,
in part, to a general unawareness of the possibilities of
digital filtering and also, until recently, to the prohibitive
LELAl'iD B. JACKSON, Member, IEEE
complexity and cost of constructing most digital filters.
JAMES F. KAISER, Associate Member, IEEE
Hence, digital filter implementation has been confined
HENRY S. McDONALD, Member,IEEE
primarily to computer programs for simulation work or
Bell Telephone Laboratories,Inc. for processing relatively smaIl amounts of data, usuaIly
Murray Hill, N. J.
not in real time. However, the rapid development of the
integrated-circuit technology and especiaIly the potential
for large-scale integration (LSI) of digital circuits now
promise to reverse this situation in many instances and to
Abstract make many digital filters more attractive than their ana
An approach to the implementation of digital filters is presented that log counterparts, from the standpoints of cost, size, and
employs a small set of relatively simple digital circuits in a highly regu reliability.
lar and modular configuration, well suited to LSI construction. Using In this paper, we present an approach to the physical
parallel processing and serial, two's-complement arithmetic, the implementation of digital filters which has the following
required arithmetic circuits (adders and multipliers) are quite simple, features.
as are the remaining circuits, which consist of shift registers for delay
and small read-only memories for coefficient storage. The arithmetic I) The filters are constructed from a smaIl set of rela
circuits are readily multiplexed to process mUltiple data inputs or to tively simple digital circuits, primarily shift registers and
elTect multiple, but dilTerent, filters (or both), thus providing for effi adders.
cient hardware utilization. Up to 100 filter sections can be multiplexed 2) The configuration of the digital circuits is highly
in audio-frequency applications using presently available digital cir modular in form and thus well suitcd to LSI construction.
cuits in the medium-speed range. Thc filters arc also easily modified to 3) The configuration of the digital circuits has the flexi
realize a wide range of filter forms, transfer functions, multiplcxing bility to realize a wide range of filter forms, coefficient
schcmes, and round-olT noise le\'cls by changing only thc contents of accuraices, and round-off noise levels (i.e., data accura
the read-only memory and/or thc timing signals and the length of the cies).
shift-register delays. A simple analog-to-digital converter, which uses 4) The digital filter may be easily multiplexed to pro
delta modulation as an intermediate encoding process is also pre cess multiple data inputs or to effect multiple, but differ
sented for audio-frequency applications. ent, filters with the same digital circuits, thus providing
for efficient hardware utilization.
After a brief review of general digital filter forms, the

advantages of serial, two's-complcment, binary arithmetic
in the implemcntation of digital filters are discussed. The
rcquired arithmetic circuits (adder/subtractor, comple
menter, and multiplier) are then described, followed by
the techniques for multiplexing. Finally, several examples
are presented of multiplexed digital filters that have been
constructed and tested. A description of a simple analog
to-digital converter for relatively low-frequency applica
tions is also included.
Canonical Forms
The transfer characteristics of a digital filter are com

monly described in terms of its z-domain transfer func-
Mi.IIlUS\!riPI rc..:ciwd M.IY 6, 1968. tion (I],
IH.& TRANSACTIO!'\S 0:-1 AUJ)IO A!'\J) I,L[CTROACOUSTICS VOL. Au-16, No.3 SEPIHIIII'R 1968
173
"'0
fig. !. The direct form for a digital filter.
fig. 3. The parallel farm for a digital filter

fig. 2. The cascade form for a digital filter.
fig. 4. (A) Second-order section for digital 011-

pass filter In cascade farm. (B) Alternate con
figuration for digital all· pass filter In cascade
farm.
(A)
"
-
Lair;
- -
i.O
II(z) = ---- (I)
"
1+ Lb,z-;
i-I
where Z-l is the unit delay operator. There are a multitude (B)
of eqUIvalent digital circuit forms in which (I) may be __ + I-
~ ��
realized, but three canonical forms, or variations thereof,
are most often employed. These forms are canonical in the �It
,-I ,-I
sense that a minimum number of adders, multipliers, and
delays are required to realize (I) in the general case. The
first of these forms, shown in Fig. I, is a direct realization
+
of (I) and as such is called the direct form. It has been
pointed out by Kaiser [5] that use of the direct form is
usually to be avoided because the accuracy requirements
on the coefficients {aj I and {hi I are often severe. There
fore, although the implementation techniques presented rather than a mixed set of first- and second-order factors
here are applicable to any filter form, we will not spe for real and complex roots, respectively, to simplify the
cifically consider the direct form. implementation of the cascade form, especially when mul
The second canonical form corresponds to a factoriza tiplexing is employed. If n is odd, then the coefficients a!2.
tion of the numerator and denominator polynomials of and {32i will equal zero for some i. The a!2i multipliers are
(I) to produce an H(z) of the form shown in dotted lines in Fig. 2, because for the very com
mon case of zeros on the unit circle in the z-plane (cor
responding to zeros of transmission in the frequency re
sponse of the filter) the associated a!2i coefficients are
unity. Thus, for these a!2i coefficients, no multiplications
where m is the integer part of (n+ 1)/2. This is the cascade are actually required.
form for a digital filter,
depicted in Fig. 2. Second-order The third canonical form is the parallelform, shown in
actors (with real coefficients) have been chosen for (2) Fig. 3, which results from a partial faction expansion of
IEEE TRANSACfIONS ON AUDIO AND ELECfROACOUSTICS SEPTEMBER 1968
174
(I) to produce gree of parallel processing is possible in the implementa
tion of a digital filter and this may be achieved by provid
(3) ing multiple adders and multipliers with appropriate inter
connections. Economy is then realized by using serial
arithmetic. and by sharing the adders and multipliers
where "Yo=o,./b. and we have again chosen to use all sec
(using the multiplexed circuit configurations to be de
ond-order (denominator) factors. Note that all three
scribed) insofar as circuit speed will allow.
canonical forms. are entirely equivalent with regard to the
In addition to a significant simplification of the hard
amount of storage required (II unit delays) and the l1um
ware, serial arithmetic provides for an increased modu
bcr of arithmetic operations required (211+ 1 multipli
larity and flexibility in the digital circuit configurations.
clltions and 211 additions pcr sampling period). As previ
Also, the processing rate is limited only by the speed of
ously noted. however. the cascade form requires signifi
the basic digital circuits and not by carry-propagation
cantly fewer multiplications for zeros on the unit circle
times in the adders and multipliers. Finally, with serial
and is thus especially appropriate for filters of the band
arithmetic, sample delays are realized simply as single
pass and the band-stop variety (including low-pass and
input single-output shift registers.
high-pass filters).
The two's-complement representation [7] of binary
Another interesting filter form may be derived for the
numbers is most appropriate for digital filter implementa
special case of an all-pass filter (APF), i.e., a filter or
tion using serial arithmetic because additions may pro
"equalizer" with unity gain at lill frequencies. The transfer
ceed (starting with the least significant bits) with no ad
function for a discrete APF has the general form [6]
vance knowledge of the signs or relative magnitudes of the
.. numbers being added (and with no later corrections of the
2: bn_.z-i obtained sums as with one's-complement). We will
IIA(Z)
i-O
assume a two's-complement representation of the form
ft
(4)
2: b,Z-i (6)
i-O
which represents a number (Il) having a value of

Thus, with oi=b.-i and bo= I, the direct form can be used
to implement (4). However, to reduce the accuracy re N-l
quirements on the filter coefficients, a modified cascade Ii = - Ilo + 2: Ii;Z-i, (7)

i_I
form can be derived for the APF, corresponding to an
H,I(Z) of the form
where each Ili is either 0 or I. Thus, the data is assumed to
lie in the interval
-1 � /l < 1, (8)
Second-order sections for the cascade form of the APF with the sign of the number /l being given by the last bit
are shown in Fig. 4. Fig. 4(A) is a straightforward modifi (in time) /lo.
cation of the standard cascade form in Fig. 2. Note that An extremely useful property of two's-complement
because the (3li multiplier may be shared by both the feed representation is that in the addition of more than two
forward and feedback paths, only three multiplications numbers, if the magnitude of the correct total sum is
are required per second-order section rather than four. small enough to allow its representation by the N avail
The number of multiplications may be further reduced by able bits, then the correct total sum will be obtained
using the form of Fig. 4(B), which requires only two mul regardless of the order in which the numbers are added,
tiplications per second-order section. But now, two addi even if an overflow occurs in some of the partial sums.
tional delays are required preceding the first second-order This property is illustrated in Fig. 5, which depicts num
section to supply appropriately delayed inputs to the first bers in two's-complement representation as being arrayed
section. Therefore. the cascade form of Fig. 4(B) requires in a circle, with positive full scale (1- 2-N+1) and negative
a total of II multiplications and 11+2 delays for an I1th full scale (-1) being adjacent. The addition of positive
order APF. addends produces counterclockwise rotation about the
circle, whereas negative addends produce clockwise rota
tion. Thus, if the correct total sum satisfies (8), no infor
Serial Arithmetic
mation is lost with positive or negative overflows and the
Using any of the canonical forms described in the pre correct total sum will be obtained.
ceding section, all of the coefficient multiplications and This overflow property is important for digital filter
many of the additions during a given Nyquist interval implementation because the summation points in the
may be performed simultaneously. Therefore, a high de- filters often contain more than two inputs (see Fig. 3);
JACKSON et 01.: IMPLEMENTATION Of DIGITAL FILTERS
175
o
Fig. 7. Serial twoOs-complement subtractor.
Fig. 5. Illustration of overflow

property in two's-complement
binary representation.
Fig. 8. Serial twaO, complementer.
-8
Fig. 6. Seriol twoos·complement adder.
CAIIIW
CLEAI!
....
AUIIIID )--
- >---L...!... __ ...J
and then adding the complemented subtrahend to the
minuend. To complement a number in two's-complement
representation, each bit of the representation is inverted
and a one is then added to the least significant bit of the
although it may be possible to argue that because of gain inverted representation (i.e., 2-N+l is added to the inverted
considerations the output of the summation point cannot number). The corresponding serial subtractor circuit is
overflow, there is no assurance that an overflow will not shown in Fig. 7. The subtrahend is inverted and a one is
occur in the process of performing the summation. Note added to the least significant bit by clearing the initial
that this property also applies when one of the inputs carry bit to one, rather than to zero as in the adder. This is
to the summation has itself overflowed as a result of a accomplished by means of two inverters in the carry feed
multiplication by a coefficient of magnitude greater than back path, as shown.
one. A separate two's complementer (apart from a subtrac
tor) may also be constructed ; such circuits are required in
the multiplier to be described. This operation is imple
Arithmetic Unit
mented with a simple sequential circuit which, for each
The three basic operations to be realized in the imple sample, passes unchanged all initial (least significant) bits
mentation of a digital filter are delay, addition (or subtrac up to and including the first "I" and then inverts all suc
tion), and multiplication. As previously mentioned, serial ceeding bits. A corresponding circuit is depicted in Fig. 8.
delays (Z-I) are realized simply as single-input single A serial multiplier may be realized in a variety of con
output shift registers. Realizations for a serial adder (sub figurations, but a special restriction imposed by this im
tractor) and a serial multiplier are described in this sec plementation approach makes one configuration most ap
tion. The adders and multipliers, including their inter propriate. This restriction is that no more than N bits per
connections, will be said to comprise the arithmetic unit sample may be generated at any point in the digital net
of the digital filter. work because successive samples are immediately adja
A serial adder for two's-complement addition is ex cent in time and there are no "time slots" available for
tremely simple to construct [7]. As shown in Fig. 6, it more than N bits per sample. Hence, the full (N+K)-bit
co nsists of a full binary adder with a single-bit delay product of the multiplication of an N-bit sample by a
(flip-flop) to transfer the carry output back to the carry K-bit (fractional) coefficient may not be accumulated
input. A gate is also required in the carry feedback path before rounding is performed. However, using the multi
to clear the carry to zero at the beginning of each sample. plication scheme described below, it is possible to obtain
Accordingly, the carry-clear input is a timing signal, the same rounded N-bit product without ever generating
which is zero during the least significant bit of each sample more than N bits per sample. Rounding is usually prefer
and is one otherwise. able, rather than truncation, to limit the introduct ion of
A serial two's-complement subtractor is implemented extraneous low-frequency components (de drift) into the
by first complementing (negating) the subtrahend input filter.
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS SEPTEMBER 1968
176
MULTIPLICAND (DATA): 8 0·8 18 a . • . . 8._1
MULT'PLIER �IT SECT'ONS
MULTIPLIER (COEFF.): " 00'0,0." "aK
aK " 80.8,8•.... 8._1

(So -0) OK _I" 80·8,8a··.. 8._1
.a
°0°°. OZ' . . '0"_1
OK_a" 80.8,8..... S._,
•
bo·b.bz,·,:bN_1
. .
fo'f, fa' ".f._1

Fig. 10. Serial multiplier, showing modularity.
00" 80.8,8..... 8._,
vo·v'Va....VN-1
+ f"_�
PRODUCT(DATA): Po' Ii; i>; . . . . p._,
.
Fig 9. Serial multiplication using no more than Fig. 11. A multiplier bit section.
N bits per data word.
The serial multiplication :;cheme is depicted in Fig. 9.

To simplify the required hardware, both the multipJicand
(data) a nd the multiplier (coefficient) are constrained to be OLD
PARTIAL
positive, with appropriate sign changes being made before SUM '---+--+-____...J
and after the multipli;ation. Thus ao=o in Fig. 9 and the

sign of the multiplicand (0) is stored separately as SON a.
For convenience, the multiplier (ex) will be assumed to lie
in the interval.
-2 < a < 2. (9) (12)
Although (9) is not necessarily applicable in the general and to round (12) to N bits, only the value of the bit iv-I
case, it does hold for the denominator coefficients of the is required. Thus, before truncating 1,,-1. its value is
cascade and parallel forms, and usually for the numerator stored elsewhere to be added in the final step to g, as
coefficients of the cascade form as well. The magnitude of shown in Fig. 9, to obtain the rounded product (p).
the multiplier is thus represented in Fig. 9 as The serial multiplier corresponding to the scheme de
scribed above is shown i n Fig. 10. The absolute value of
(10) each incoming datum (0) is taken and its sign (SON 0) is
added modulo-2 to the coefficient sign (SON a) to de
which represents a value of termine the product sign (SON a·c5). The (positive) multi
plicand is then successively delayed and gated by the ap
K
propriate multiplier bits (ai) and the partial sums are ac
I ex l = Lai2-i• (11)
cumulated in the multiplier bit sections. A single multi
i.O
plier bit section is shown in Fig. II. The least significant
The restriction in (9) and the resulting representation i n bit of each partial sum is truncated (gated to zero) by the
(10) and (11) are in n o way essential t o the serial multi appropriate timing signal 'i+I' Rounding is accomplished
pliers to be described, but are meant only to be representa by adding in the last truncated bit (!v-I) via the * input to
tive of the mUltiplication scheme. The sign of the multi the last bit section. Finally, the sign of the product is in
plier is also stored separately as SON a. serted using a two's-complementer such as that in Fig. 8.
The multiplication scheme in Fig. 9 proceeds as follows. At high data rates, it may be necessary to insert extra
The multiplicand is successively shifted (delayed) and flip-flops between some or all of the multiplier bit sections,
multiplied (gated) by the appropriate bit (exi ) of the multi as shown in dotted lines in Fig. 11, to keep the propaga
plier. These delayed gated instances of the multiplicand tion delay through the adder circuits from becoming ex
are then added in the order indicated. After each addi cessive.
tion (including the "addition" of aKa to 0), the least Several observations concerning the serial multiplier
significant bit of the resulting partial sum (i.e., a.\·_ha.\'_h should be made at this point. First, there is a delay of K
b... -h • ,Iv-I) is truncated to prevent the succeeding
•
• bits in going through the multiplier and this delay must
partial sum from exceeding N bits in length. Note that be deducted from a delay (Z-I) that precedes the multi
these bits may be truncated because the full unrounded plier. (If the extra flip-flops in Fig. 1 1 are required, then
product would be the multiplier will yield a delay of up to 2K bits.) In addi-
lACKSON et al.: IMPLEMENTATION OF DIGITAL FILTERS
177
tion, the absolute value operation at the first of the multi
plier requires a delay of N bits (to determine the sign of
each incoming datum) and this must be deducted from a
preceding delay as well. Thus, to use this serial multiplier,
the rl delays of the digital filter must be at least N+K
(or up to N+2K) bits in length. This in turn implies, as we
shall see in the next section, that some form of multiplex
ing is required if the multipliers are to be implemented i n RUD-oNLY 1
this manner. I COEfFICIENT I
I IIE�
�---------�
I
Another observation is that the adders in the multiplier

bit sections do not require carry-clear inputs because only Fig. 12. Type-I multiplexing for M input
positive numbers are being added. However, output prod channels.
uct overflows (in the sense of Fig. 5) are possible with

coefficients (a) of magnitude greater than one. It may thus
be necessary to restrict the amplitude of the data into cer
tain multipliers to prevent output overflows; while in cer
tain other multipliers, these overflows may be perfectly
allowable as discussed in the preceding section. In gen filter. A diode matrix provides a very fast and inexpensive
eral, however, the inputs to a summation must be scaled form of read-only memory (ROM) for this purpose. If,
so that an overflow will not occur in the final output of however, all M channels are to be filtered identically and
the summation. Such overflows represent a severe non no type-2 multiplexing is employed, the coefficients may
linearity i n the system, and very undesirable effects can be wired into the multipliers and no ROM is then re
result i n the output of the filter. quired. This is indicated by the dotted lines enclosing the
ROM in Fig. 12. In this case, adders must be included
only in those multiplier bit sections of the arithmetic unit
Multiplexing
for which the corresponding multiplier bits (ai) equal one.
Having realized the three basic digital filter components In many cases, a number of different, but similar, filters
(delays, adders, and multipliers), the filter itself may be or subfilters are required for the same input signal. For
implemented by simply interconnecting these compo example, all of the second-order subfilters comprising the
nents in a configuration corresponding to one of the digital cascade or parallel forms are similar in form, differing
forms, canonical or otherwise, for the filter. However, if only in the values for the multiplying coefficients (see
the input rate bit (sampling rate times bits per sample) is Figs. 2 and 3). Type-2 multiplexing refers to the imple
significantly below the capability of the digital circuits, mentation of these different (sub)filters with a signal mul
the digital filter can be multiplexed to utilize the circuits tiplexed filter. An example of a multiplexed second-order
more efficiently. The various multiplexing schemes are of filter is shown in Fig. 13. As with type- l multiplexing. (he
two mai n types: 1) the multiplexed filter may operate combining of M separate filters into one multiplexed filter
upon a number of input signals simultaneously or 2) the requires that the bit rate in the filter be increased by a fac
multiplexed filter may effect a number 01 (different) filters tor of M and that the shift-register delays (rl ) also be in
for a single input signal. A combination of these two types creased by a factor of M to MN bits in length. The co
is also possible. efficients are supplied from the read-only coefficient mem
To multiplex the filter to process M simultaneous in ory, which cycles through M values for each coefficient
puts (type I), the input samples from the M sources are during every Nyquist interval. Data are routed in, around,
interleaved sample by sample and fed (serially) into the and out of the filter by external routing switches, which
filter. The bit rate in the filter is thus increased by a factor are also controlled from the ROM.
of M. The shift-register delays must also be increased by As an example of type-2 multiplexing, consider the im
a factor of M to a length of MN bits. Otherwise, the filter plementation of a 12th-order filter in cascade form, using
is identical in its construction to the single-input case. In the multiplexed second-order filter in Fig. 13. Here M 6, =
particular, the arithmetic unit containing the adders and so the bit rate in the filter must be (at least) 6N bits per
multipliers is the same; it just operates M times faster. Nyquist interval. During the first N bits of each Nyquist
The output samples emerge in the same interleaved order interval, the input sample is introduced into and is pro
'
as the input and are thus easily separated. Type-I multi cessed by the arithmetic unit with the mUltiplying coeffi
plexing is depicted in Fig. 12. cients (al> a2, (310 (32) of the first subfilter in the cascade
If the M channels in Fig. 12 are to be filtered differently form. The resulting output is delayed by N bits (rl/.\!)
or if type-2 multiplexing is also employed, the filter co and fed back via the input routing switch to become the
efficients are stored in a separate read-only coefficient input to the filter during the second N-bit portion of the
memory and are read-out as required by the multiplexed Nyquist interval. This feedback process is repeated four
178
SHIFT - REGISTER DELAY UI;IT
"' .. BITS "'N BITS I; BITS
()-+o
IN :
I
I
I
I
I
I
I
I I
I
I
I I
I
I I r
I l.-, I I .---- ..
L ___ _ _ ___ -. : : : : • _ _ __ _ _____ _
,..-l--L...l-l-.L-I.....,
fig. 13. General second-order filter for type-I and type-2 fig. 14. Digital touch-tone receiver. showing multiplexed filters and
multiplexing. nonlinear units.
more times, with the filter coefficients from the ROM all of the HPF's and BRF's ar e multiplexed into one sec
being changed each time to correspond to the appropriate ond-order filter (combi ned type-I and type-2 multiplex
subfilter in th e cascade form. The sixth (last) filter output ing), the eight BPF's are multiplexed into another seeond
during each Nyquist interval is the desired 12th-order order filter (type-I multiplexi ng with ROM coefficie nts),
filter output. The parallel form, or a combination of cas and the eight LPF's are m ultiplexed i nto on e first -order
cade and parallel filters may be realized using the filter in
, section (type- l multiplexi ng with wired-in coe fficients) .
Fig. 13 by simply changing the bits in the ROM which The nonlinear units are readily m ultiplexed as well and
control the switching sequences of the input and output operate directly upon the interleaved o utput samples from
routi ng sw itches. the filters.
Some of the parameters of the experimental TTR de
sign are as follows: the sampling rate is 10 K samples/se c
Sample System ond with a n initial quantization (A/D conversion) of7
bits/sample; the data word length (N) with in the filter is
As an example of th is approach to the impkmentation
10 bits/sample ; the filter coefficients have 6-bit fractional
of digital filters, we . will take a n experimental, all-digital
parts (K) ; and, as previously stated, the m ultiplexi n g
touch-tone receiver (TTR) which has been designed and
factor (M) is eight. Thus, the bit rate w ithin the filter
constructed at Bell Telephone Laboratories, Inc. The
(sampling rate X bi ts/sample X M) is 800 K bits/seco nd .
digital TTR is depicted in block-diagram form in Fig. 14
The number of bits required to represen t the data and the
This is a straightforward digital version of the standard
coefficients of the TTR were determined through comput
analog TTR described elsewhere [8]. Without going into
er simulation of the system. The hardware required to
the dctailcd operation of the system, we simply note that
implement this des ign consists primarily of about 40 serial
the combined high-pass filters (HPF's) are third order,
adders and 400 bits of shift-registe r storage.
the band-rejection filters (BRF's) are each sixth order, the
bandpass filters (BPF's) are each second order, and the
low-pass filters (LPF's) are each first order . The other
Analog-to-Digital Converter
signal-processing units r equired are the limiters (LIM's),
half-wave rectifiers (HWR's), and level detectors. These In most applications of digital filters, the initial inpu t
nonlinear operations are, of course, easily implemented signal is in analog form and must be converted to digital
in digital form. form for processing. It mayor may not be necessary to
A mUltiplexing factor of M 8 is employed in the ex
= recon ver t the digital output signal to analog form, de
perimental TTR to combine all of the units enclosed in pending upon the application. Digital -to -analog (D/A)
dotted lines into single multiplexed units. In particular, conversion is a relatively straightfo rward and inexpensive
JACKSON el af.: IMPLEMENTATION OF DJOITAL FILTERS
179
process, but the initial analog-to-digital (A/D) conversion RESET
is often quite a different situation. For audio-frequency

applications, however, a simple and very accurate A/D
converter may be implemented using delta modulation
(,i-mod) as an intermediate encoding process. This A/D
converter will now be described.
The A/D converter is depicted in Fig. 15, with a
,i-mod encoder being shown in Fig. 16. The ,i-mod en
coder produces a series of bivalued pulses (O's and I's) FI,. 15. SIMple AID _811.. using delta lIIOCIulalion.
which, when integrated, constitute an approximation to
the input analog signal. The number of I's (or O's) oc
curring during each (eventual) sampling interval is ac
cumulated in the counter as a measure of the change in
signal amplitude during that interval. At the end of the fig. 16. Deltaotnadulallan -......
interval, this number is transferred to the storage register
and the counter is then reset to its initial value to begin
counting during the next interval. The appropriate initial
value for the counter is minus one-half the number of
,i-mod pulses per sampling interval. .-
The number stored in the storage register for each

sampling interval is the difference between the desired
sample value and the preceding sample value. Thus, if
these difference samples are accumulated in a simple first
order accumulator, as shown, the full digital sample
values result. A small "leak" is introduced into the ac
cumulator by making the feedback gain slightly less than
ratio of the ,i-mod rate to the sampling rate. Thus, for
one (1 -2-K) to keep the dc gain of the accumulator from
example, to effect IO-bit A/D conversion using this
being infinite. This prevents a small dc bias in the ,i-mod
scheme, the ,i-mod rate must be approximately 1000
output from generating an unbounded accumulator out
times the sampling rate.
put. The accumulator leak should be matched by a simi
lar leak in the integrator of the ,i-mod encoder.
Since the accumulator is itself a first-order digital filter,
Conclusions
it can be implemented and multiplexed using the same
circuits and techniques as previously described. The An approach to the implementation of digital filters has
multiplexing may be either with other A/D conversion been described that employs a small set of relatively simple
channels or with other filters, or both. Note, however, digital circuits in a highly regular and modular configura
that if the digital filter following the A/D converter has, tion, well suited to LSI construction. By using parallel
or can have, a zero at Z= I (corresponding to zero of processing and serial, two's-complement arithmetic, the
transmission at dc), this zero would cancel the pole sup required arithmetic circuits (adders and multipliers) are
plied by the accumulator (with no leak). Therefore, in this greatly simplified, and the processing rate is limited only
case, the accumulator may be eliminated from the A/D by the speed of the basic digital circuits and not by carry
converter (along with the zero at Z= 1 from the following propagation times. The resulting filters are readily multi
filter), making the A/D conversion even simpler. plexed to process multiple data inputs or to effect multi
The accuracy of A/D conversion implemented in this ple, but different, filters (or both) using the same arithme
manner is a function of the following factors: I) the ratio tic circuits, thus providing for efficient hardware utiliza
of the ,i-mod rate to the sampling rate, 2) the sensitivity tion. A multiplexing factor of 100 or so is possible in
of the input comparison amplifier in the �-mod encoder, audio-frequency applications, using presently available
and 3) the match between the accumulator leak (if an digital logic in the medium-speed range. The filters are
accumulator is required) and the integrator leak in the also easily modified to realize a wide range of filter forms,
,i-mod encoder. There is also a maximum-slope limitation transfer functions, multiplexing schemes, and round-off
with delta modulation and an accompanying slope over noise levels (i.e., data accuracies) by changing only the
load noise results if this slope limitation is exceeded [9]. contents of the read-only coefficient memory and/or the
Assuming that the error resulting from 2) and 3) and timing signals and the length of the shift-register delays.
from slope overload is negligible, a useful rule of thumb For audio-frequency applications, a simple A/D convert
for the A/D conversion accuracy is that the number of er may be implemented using delta modulation as an
quantization levels effected equals approximately the i ntermediate encoding process.
180
REFERENCES [5] J. F. Kaiser, "Some practical considerations in the realization
of linear digital tilters," 1965 Proc. lrd Allertoll COIlf. 011 Circuit
[ 1] J. F. Kaiser, "Digital tilters," in System Analysis by Digital alld System Theory, pp. 621-633.
Computer, J. F. Kaiser and F. F. Kuo,Eds. New York: Wiley, [6] R. B. Blackman,unpublished memorandum.
1966, pp. 218-85. [7] Y. Chu, Digital Computer Design FUI/damentals. New York:
i2] C. M. Rader and B. Golj, "Digital tilter design techniques in McGraw-HiII,1962.
the frequency domain," Proc. IEEE, vol. 55, pp. 149-171, [8] R. N. Battista, C. G. Morrison, and D. H. Nash: "Signaling
February 1967. system and receiver for touch-tone calling," IEEE Tral/s. Com
( 3] R. M. Golden,"Digital tilter synthesis by sampled-data trans mUllications alld Electrollics, vol. 82, pp. 9-17, March 1963.
formation," this issue, pp. 321-329. [9] E. N. Protonotarios,"Slope overload noise in differential pulse
[4] B. Gold and C. M. Rader, Digital Processillg ofSigllals. New code modulation systems," Bell Sys. Tech. J., vol. 46,pp. 2119-
York: McGraw-Hili, 1969. 2161,November 1967.
JACKSO:-ll!t ul.: NI'LEMENTATION OF DIGITAL FILTERS
18 1
Introduction
Software implementations of the Cooley-Tukey fast

Fourier transform (FFT) algorithm [I] have in many
cases reduced the time required to perform Fourier anal
Fast Fourier Trans ysis by nearly two orders of magnitude. Even greater
gains can be realized through special-purpose hardware
form Hardware designed specifically for performing the FFT algorithm.
In order to design this hardware one should examine:
Implementations
I) The reasons for building special-purpose hardware;
An Overview 2) The options that should be considered;
3) The tradeoffs that must be made;
4) O ther considerations.
GLENN D. BERGLAND. Member, IEEE
Bell Telephone Laboratories, Inc

Whippany, N. J. 07981 Special-Purpose Hardware
Most applications for special-purpose FFT processors

result from signal processing problems which have an in
herent real-time constraint. Examples include digital
vocoding, synthetic-aperture radar mapp ing, sonar signal
Abstract processing, radar signal processing, and digital filtering .
This discussion served as an introduction to the Hardware Implemen In these examples, a processing rate slower than real-time
tations Session of the IEEE Workshop on Fast Fourier Transform would overload the system with input data or lead to
Processing. It introduces the problems associated with implementing
worthless results.
the FFT algorithm in hardware and provides a frame of reference for
Other applications involve off-line processing where
the volume of data makes processing impractical unless
characterizing specific implementations. Many of the design options
a dedicated machine is used. Studies in radio astronomy
applicable to an FFT processor are described, and a brief comparison
and crystallography have involved Fourier analysis tak
of several machine organizations is given.
ing nearly a month to perform on a general-purpose com
puter. In several cases, the use of special-purpose hard
ware could reduce this time to less than a day and make
a corresponding reduction in cost.
Experience with the FFT processor built by Bell Tele
phone Laboratories [2] indicates that the cost reduction
resulting from special-purpose hardware is nearly as
great as the reduction which came with the Cooley-Tukey
algorithm. The FFT signal processing system costs 5
times less per hour than a large general-purpose computer
while performing the fast Fourier transform algorithm 20
times faster. Thus, on the FFT part of the processing, a
100 to 1 cost saving is possible. As a result of this reduc
tion in cost, people who had not even heard of the fast
Fourier transform three years ago are now finding that
they cannot get along without it.
Options
The many and varied forms of the FFT algorithm have

been described at length in the literature [3]-[1l]. The
execution times and memory requirements of software
implementations of these algorithms can be evaluated
rather conveniently. The criteria which apply to evaluat
ing hardware implementations, however, are not as
Manuscript received February 14, 1969. easily specified.
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS VOL. AU-17, No.2 JUNE 1969
18Z
Fig. 1. Fast Fourier transform flow diagram for N=8.
The first design choice often made concerns constrain Fig. 2. The functional block dia·
ing the number of data points to be analyzed to being a gram of a sequential fast Fourier
power of 2 (i.e., N 2m for m = 0, I, 2, ... ). This repre

=
transform processor.
sents a tradeoff of versatility for cost and performance. TABLE

This choice is not as limiting as it was previously. The MEMORY
convolutional form of the FFT [II] can be used to find DATA

MEMORY r---+
the discrete Fourier transform for any value of N even
;
though the FFT processor performing the convolutions ARITHMETIC
requires that N be a power of 2. UNIT
If the input time series consists of real numbers, the INDEXING

So CONTROL
second option which should be considered involves the
use of a real-input algorithm [JO] or a modified complex
input algorithm [12]. By exercising this option, a two-to
one improvement in performance and a two-to-one
reduction in storage can be achieved.
simplified block diagram of the resulting processor is
While a radix-8 algorithm may be near optimum for
shown in Fig. 2.
a software implementation, the simplicity of the radix-4
This organization is similar to that of a small general
and radix-2 algorithms is a considerable advantage when
purpose computer except that the table memory, data
dealing with hardware. Since the cost is proportional to
the number of options included, the use of only one basic memory, arithmetic unit, and control unit can usually all
operate concurrently. Since the processor operates on
operation in the radix-2 algorithm in many cases offsets
batches of data, a real-time environment would usually
the additional computation required.
dictate that some buffering precede or be incorporated
The organization of an FFT processor is usually dic
into the data memory.
tated by the performance and cost requirements and the
For this discussion, the sequential processor will be
technology assumed. Four families of machine organiza
characterized as having: I) one arithmetic unit, 2)
tions, which have appeared in some form in the literature,
(N/2) log2 N operations performed sequentially, and 3) an
will be described and characterized.
execution time of B(N/2) log2 N �s where B is the time
required for performing one basic operation. The reorder
The Sequential Processor ing of the Fourier coefficients can be done either in place
or during I/O.
The first hardware implementation considered involves
implementing the basic operation shown in Fig. I, in
The Cascade Processor
hardware [l3]. This basic operation (i.e., one complex
multiplication followed by an addition and a subtraction) To improve the performance of the processor. parallel
can be applied sequentially to the 12 sets of data shown ism can be introduced into the flow diagram shown in
in the diagram. The same memory can be used to store Fig. I [14J, [IS]. By using a separate arithmetic unit for
the input data, the intermediate results, and the resulting each iteration, the throughput can be increased by a factor
Fourier coefficients. Since only one basic operation is in of log2 N. In the diagram of Fig. I this means that the
volved and the accessing pattern is very regular, the first arithmetic unit performs the operations labeled
amount of hardware involved can be relatively small. A I through 4, the second performs operations 5 through 8,
BERGLAND: fFT HARDWARE IMPLE�IENTATlONS-AN OVERVIEW
183
A.U. A.U. A.U.
2 ..
A.U. A.U. A.U.
Fig. 3. The functional block diagram of a cascade

processor. A.U. A.U, A.U,
A.U. A.U. A.U.
Fig. 5. The functional block diagram of

Fig. 4. The functianal block diagram an arraY processor.
of a parallel-iterative processor.
MEM
within each iteration in parallel. A simplified block dia

gram of the resulting processor is shown in Fig. 4.
In practice, this organization would often be combined
with the sequential processor so that only the degree of
parallelism actually needed is implemented. Without the
use of large-scale integration, the cost of N/2 arithmetic
units (as shown in Fig. 4) would usually be prohibitive.
The problem areas associated with this organization
involve the communication bctween the arithmetic units
and the generation of the required sine and cosine func
tions. Both of these problems, however, have been dealt
and the third performs operations 9 through 12. A sim with successfully [17], [18].
plified block diagram of the resulting processor is shown For this discussion, the parallel iterative processor will
in Fig. 3. be characterized as having I) N/2 arithmetic units, 2)
In this processor, the buffering required for processing N/2 operations performed in parallel, 3) m iterations per
a continuous stream of data is incorporated directly into formed sequentially, and 4) an execution time of B(log2 N)
the organization and takes the form of a delay line. p.s.
For this discussion, the cascade processor will be char
acterized as having: 1) m arithmetic units, 2) m iterations The Array Analyzer
performed in parallel, 3) N/2 operations performed se
For completeness, a processor can be considered in
quentially, 4) an execution time of (B·N) p.S per record,
which all 12 of the operations of Fig. 1 are performed in
and 5) buffering incorporated within the processor in
parallel [19]. By pipelining three different sets of data
the form of time delays.
through this processor simultaneously, the effective ex
Although this discussion is directed toward radix-2
ecution time is simply the time required for performing
algorithms, versions of this processor for radix-4 algo
one basic operation. A simplified block diagram is shown
rithms, real-valued algorithms, and arbitrary-radix algo
in Fig. 5 for the example of N 8. =
rithms should be apparent. It should also be apparent

For this discussion, the array analyzer will be charac
that this organization lends itself well to multichannel
terized as having: 1) (N/2) (I0g2 N) arithmetic units, 2)
operation where sets of samples are interleaved. Although
(N/2) (log2 N) operations performed in parallel, and 3)
this adds to the amount of buffering within the processor,
an execution time of B p.s. At this point in time, the cost
in many applications it allows one to realize this buffering
of this approach severely limits its application.
via relatively inexpensive drum or disk storage used in
the form of time delays. As recently shown by Sande [16],
this same magnetic drum can also be used to perform the Summary
reordering. By assuming the relatively fast processing rate of one

p.s per basic operation, the approximate processing rates
The Parallel Iterative Processor
of the different families of analyzers can be compared in
A third alternative for improving performance involves Table I. The time saving of special-purpose hardware is
introducing parallelism within each iteration. By using apparent when these times are compared to the typical
four arithmetic units, the operations labeled 1 through 4 500 000 p's execution time of a large general-purpose com
can be performed in parallel before performing operations puter. Note that you can introduce as much parallelism
5 through 8 in parallel, etc. The processor performs the into the fast Fourier transform algorithm as your prob
iterations sequentially, but performs all of the operations lem demands.
IEEE TRANSACTIONS OR AlIDIO AND ELECTROACOllSTICS JUNE 1969
184
TABLE I
Ambitious Processing Rotes for N=1024 With 8-1llS
Execution Processing
Machine Arith.
Time Rate
Organization Units
(}tS) (samples/s)
Sequential 1 5000 200 000

Cascade 10 1000 1 000 000·
Parallel Iterative 512 10 100 000 000
Array 5120 I 1 000 000 000
• Capable of doing two channels at this rate simultaneously.
Tradeoffs looked are preprocessing, postprocessing, data reduction,

and diagnostics.
In designing a special-purpose fast Fourier transform
The preprocessing function often involves applying
processor, there are many possible tradeoffs which can be
an appropriate data window, forming redundant records,
made in terms of cost, speed, and accuracy. These trade
and buffering. Postprocessing frequently involves smooth
offs are related specifically to the arithmetic unit, the
ing, interpolating, automatic gain controlling, convolving,
memory, the control unit, and the algorithm being im
and correlating.
plemented.
Since every set of N numbers put into an FFT processor
As described previously, there is usually a cost penalty
results in N Fourier coefficients coming out, any data re
associated with allowing N to be other than a power of 2.
duction functions which can be defined are usually worth
If enough options are added, the control unit can rapidly
building into the hardware.
overtake the arithmetic unit in complexity.
The problem of designing a set of diagnostic checks for
When versatility is more important than speed, it may
the FFT processor is helped considerably when the pro
be worth changing from a hard-wired control unit to a
. cessor is a computer attachment. Even then, however, the
software or microprogrammed control unit.
testing function should be given at least as much thought
The choice of a core memory, a semiconductor mem
as the FFT algorithm.
ory, or some combination of the two in a sequential
processor will generally be dictated by the performance
and cost requirements placed on the system. In the cas
Conclusions
cade processor, a plot of cost versus speed can be very
erratic since the internal storage could be anything from Four families of FFT machine organizations have been
magnetic drum storage to semiconductor shift registers. defined which represent increasing degrees of parallelism,
The arithmetic units for each type of processor can vary performance, and cost. Within each family, however, the
considerably. The cost ratio between building five inex cost and performance of two processors can still differ
pensive arithmetic units and building one very expensive widely due to choices of arithmetic unit, memory, and
one can dictate the organization of the processor. One control. It is clear that, given enough parallelism and
also has the option of building one fast real-input arith enough money, FFT processing can be done digitally at
metic unit, or four slower arithmetic units tied together rates less than one sample per second or as high as a
to form a complex-input arithmetic unit. In some cases a billion samples per second.
combinatorial (or static) multiplier is required. In other
cases the less expensive iterative multiplier is fast enough.
References
The speed is also highly dependent on whether the num
[I] J. W. Cooley and J. W. Tukey, "An algorithm for the machine
bers are represented in fixed point, floating point, or calculation of complex Fouricr series," Math. Comp., vol. 19,
"poor man's floating point" (i.e., with an exponent com pp. 297-301, April 1965. . .
mon to the whole array). [2] R. Klahn, R. R. Shively, E. Gomez, and M. J. Gilmartin,
"The time-saver: FFT hardware," Electronics, pp. 92-97,
A more complete list of design options is given in the June 24. 1968.
FFT processor survey [20]. [3] J. W. Cooley, "Complex finite Fourier transform subroutine,"
SHARE Doc. 3465, September 8, 1966.
[4] W. M. Gentleman and G. Sande, "Fast Fourier transforms
for fun and profit," 1966 Fall Joint Complller Conf, AFlPS
Other Considerations
Proc., vol. 29. Washington, D. C.: Spartan, 1966, pp. 563-
In many cases, people tend to focus on only the FFT 578.
[5] G. D. Bergland, "The fast Fouricl transform recursive equa
hardware since it is the best defined part of the system. tions for arbitrary length records," Mat". Comp., vol. 21, pp.
Those parts of the problem which should not be over- 236-238, April 1967.
BERGLAND: FFT HARDWARE IMPLEMENT.\TIONS-AN OVERVIEW
185
[6] N. M. Brenner. "Three FORTRAN programs that perform the [13] R. R. Shively, "A digital processor to generate spectra in real
Cooley-Tukey Fourier transform," M.I.T. Lincoln Lab., Lex time," IEEE TrailS. Computers, vol. C-l7, pp. 485-491, May
ington, Mass., Tech. Note 1967-2. July 1967. 1968.
[7] R. C. Sillglcton, "On computing the fast Fourier transform," [14] G. D. Bergland and H. W. Hale, "Digital real-time spectral
COIllIllIIII. ACM. vol. 10, PD. 647-·654, October 1967. analysis," IEEE TrailS. Electrollic Computers, vol. EC-16, pp.
'
[8) G. D. Bergland, "A fast Fourier transform algorithm using 18D-185, April 1967.
base 8 iterations," Math. Comp., vol. 22, pp. 275-279, April [15] R. A. Smith, "A fast Fourier transform processor," Bell Tele
1968. phone Labs., Inc., Whippany, N. J., 1967.
[9] M. C. Pease, "An adapt ion of the fast Fourier transform [16) G. Sande, University of Chicago, Chicago, Ill., private com
for parallel processing," J. ACM, vol. 15, pp. 252-264, April munication.
1961l. [17] M. C. Pease, 1II, and J. Goldberg, "Feadbility study of a
[10) G. D. Bergland, "A fast Fourier transform algorithm for real special-purpose digital computer for on-line Fourier analysis,"
valued series," Conllmlll. ACM, vol. II, pp. 703-710, October Advanced Research Projects Agency, Order 989, May 1967.
1968. [18] G. D. Bergland and D. E. Wilson, "An FFT algorithm for a
[II) L. I. Bluestein, "A linear filtering approach to the computation globp I, highly-parallel proce3sol," this issue, pp. 125-127.
of the discrete Fourier transform," 1968 NEREM Rec., pp. [19] R. B. McCullough, "A real-time digital spectrum analyzer,"
218-219. Stanford Electronics Labs., Stanford, Calif., Sci. Rept. 23,
[12) J. W. Cooley, P. A. W. Lewis, and P. D. Welch, "The fast November 1967.
Fourier transform algorithm and its applications," IBM Re [20] G. D. Bergland, "Fast Fourier transform hardware implemen
search Paper RC-1743. February 1967. tations-a survey," this issue, pp. 109-119.
186
BIBLIOGRAPHY
Journal Articles
1. E. B. Anders et al., "Digital Filters," NASA ContractoT Report CR-136,

December 1964.
2. Archambeau et al., "Data Processing Techniques for the Detection

and Interpretation of Teleseismic Signals," Proc. IEEE, Vol. 53,
No. 12, pp. 1860-1884, December 1965.
3. J. S. Bailey, "A Fast Fourier Transfonn Without Multiplications,"

Proceedings of the 1969 Polytechnic Institute of Brooklyn Symposium
on Comp�ter Processing in Communications.
4. D. C. Baxter, "The Digital Simulation of Transfer Functions,"

National Research Laboratories, Ottawa, Canada, DME Report No. MK-13,
April 1964.
5. W. R. Bennett, "Spectra of Quantized Signals," Bell System Tech. J.,

Vol. 27, pp. 446-472, July 1948.
6. G. D. Bergland, "The Fast Fourier Transfonn Recursive Equations for

Arbitrary Length Records," Math. Comp., Vol. 21, PI' .236-238, 1967.
7. G. Bergland, "Fast Fourier Transfonn Hard\vare Implementations-An

Overvie\v," IEEE Tra:,s. Audio., Vol. 17, P? 104-103, June 1969.
8. G. D. Bergland, "A Fast FOllrier Transform Algorithm �sing Base 8

Iterations," Math.Somp., Vol. 22, P? 275-279, April 1968.
9. J. E. Bertram, "The Effect of Quantization in Sampled-Feedback

Systems," Trans.--Amer. Inst.
-- ----
Elect. Engrs.,
-------. - Vol. 77, Pt. 11, p. ]]7,
1958.
10. C. Bingham, M. D. Godfrey, and J. W. Tukey, "Modern Techniques of

POT'I7er Sp,ectrum Estimation," IEEIL!.rans-! A:.tdio., Vol. 15, No. 2,
pp. 56-66, June 1967.
11. L. 1. Bluestein, "A Linear Filtering App.::oach to the Co:nputation of

the Discrete Fou':-ieo: Transform," 1968liERfu':LJi�cor4., pp. 218-219.
12. B. Bogert, M. Healy, and J. Tu�ey, The Quefrency Alanysis of Time

Series for Echoes, in M. Rosenblatt (ed.), Procee�ings of_the
�nposi�"!.J.imp. Series Analy�'!:.�, Ch�p. 15, pp. 209-243,
John !viley & Sons, Inc., New York, 1963.
13. R. Boxer, "A Note on :-:Iumerical Transform Calculus," ------

Proc. IRE,
Vol. 45, No. 10, pp. 1401-1406, October 1957.
14. R. Boxer and S. Thaler " A Silnplified Method of Solving Linear

,
and Nonlinear Systems, I Prq,�..!.-l��, Vol. 44, P? 89-101, Janu.3.ry 1956.
15. N. M. Brenner, "Three Fortran Programs that perfm:"Ill the Cooley

Tu:<ey Fourier Transform," Technical Note 1967-2, Lincoln Laboratory,
M.I.T., Lexingto�, Mass., July 28, 1967.
16. E. O. Brigham and R. E. MorrO\v, "The Foist Fourie:c TransfO'!:'m,"

IEE!E!��':!£!!, Vol. 4, No. 12, P? 63-70, December 1967.
17. H. W. Briscoe and P. L. Fleck, "A Real Time Comp'.1ting System for
LASA," presented at Sprin3 .Joint Comp'.1ter Conference, Vol. 28,
pp. 221-228, 1966.
187
18. P. W. Broome, "Discrete Orthononnal Sequences," Asso�. Computer..
��in��, Vol. 12, No.2, pp. 151-168, April 1965.
19. P. Broome, "A Frequency Transfonnation for Numerical Filters,"
Proc. IEE E, Vol. 52, No. 2, pp. 326-327, Febru�ry 1966.
20. V. CapP'3llini, "Design of Som·e Digiti'll Filters with Ap;,:>lication to

Spectral Estimatio:l and Data Compression," Proceedings of the 1969
Polytechnic Institute of Brooklyn Symp::>sium on Computer Pro�essing
in Co��nications.
21. E. W. Cheney and H. L. Loeb, "Generalized Ratio:lal Ap;:>:oximatio:l,"

J. SLA�, Numerical Analysis, B, Vol. 1, pp. 11-25, 1964.
22. W. T. Cochran et al., ''What Is the Fast Fourier Trans fonn?,"

�EEE�ran�. _��dio�, Vol. 15, No. 2, pp. 45-55, June 1967.
23. J. W. Cooley, "Application of the Fast FO'.lrier Transfonn Method,"

Proceedings of the IBM Scientific Computing Symposium, Thomas J.
Watson Research Center, Yorktmvn Heights, New Yo:'.:'k, 1966.
24. J. W. Cooley, P. Le\Jis, an:! P. D. Welch, "The Use of the Fast FOllrier
Transform Algorithm fo:'.:' the Estimation of Spectra and Cross Spectra,"
Proceedings of the 1969 Polytechnic Institute of Brooklyn Symp::>sium
on Compater Processing in Communications.
25. J. W. Co::>ley, P. Lewis, and P. Welch, "The Fast FO'.lrier Transfo:::m

and Its Applications," IBM Res. Pap·er RG-1743, Febru..:J.r y 9, 1967.
26. J. W. Cooley, P. Le\vis, and P. Welch, "Historical Notes on the

Fast Fourier Transform," IEEE 'l)::.<!.ns. t\.udio., Vol. 15, No. 2,
pp. 76-79, June 1967.
27. J. W. Cooley, P. Lewis, and P. Welch, '�pplication of the Fast

Fourier Transfonn to Compatation of FOllrier Integrals, Fourier
Series, and Convolution Integrals," IEEE Trans. Audio ._, Vol. 15,
No. 2, pp. 79-84, June 1967.
28. J. W. Co::>ley and J. W. Tu�ey, "An Algorithm for the Machine

Computation of Complex Fourier Series," ---
Math. Com...E.!."
-
-
Vol. 19,
pp. 297-301, April 1965.
29. M. J. Corinthios, "A Time-Series Analyzer," Proceedings of the

1969 Polytechnic Institute of Brooklyn S�nposiwn on Computer
Processing in Communications.
30. T. H. Crystal and L. Ehrman, "The Design and Applications of Digital

Filters with Complex Coefficients ). " IEEE Trans. Audio. , Vol. 16,
No. 3, pp. 315-321, September 1961S.
31. G. C. Danielson a nd C. Lanczos, "Some Improvements in Practical

Fourier Analysis, and Their Ap;,:>lication to X-ray Scattering from
Liquids," J. Fran�din lnst�, 233, pp. 365-380, 435-1+52, April-May 1942.
32. C. J. Drane, "Directivity and BeaIll\vidth Approximations for Large

Scanning Dolph-Chebyshev Arrays," AFCRL Physical Science Research
Papers No. 117, AFCRL-65-472, June 1965.
188
33. R. Edwards, J. Bra:lley, a41d J. Knm;Tles, "Comp.arison of Noise Per
formances of Progra�ning Methods in the R�alization of Digital
Filte:::-s," Proceedings of the 1969 Polytechnic Institute of Broo�dyn
Symposium on Computer Processing in Comm'.lnications.
34. M. J. Ferguson and P. E. Ma.ntey, "Automatic Frequency Control via

Digital Filtering," IE�ILlr..a.!!:"-!.-A_udio!.., Vol. 16, No. 3, pp. 392-398,
September 1968.
35. P. E. Fleischer, "Digital Realization of Complex Transfer Functions,"

Simu�.!:.i�ll, Vol. 6, No. 3, Pi>. 171-180, March 1966.
36. G-AE CO':1cep':s Subconunitt ee, "On Digital Filtering," .!.EEE Tr..a�_�u::liC2..!..,
Vol. 16, No. 3, pp. 303-315, September 1968.
37. W. M. Gent leman and G. Sande, "Fast Fourier Transfo=ms-Fo-::' Fun and
Profit," presented at 1966 Fall Joint Computer Conference, AFIPS
Proc., 29, P? 563-578, 1966.
38. G. Goertzel, "A� Algorithm for the Evaluation of Finite Trigonometric

Series," A-�·ricaIl.Math�t:..!:..c_ Vol. 65, pp. 34-35,
January 1958.
39. B. Gold and K. Jo:::-dan, "A Note on l)igital Filter Synthesis,"

Pr��EE�, Vol. 56, No. 10, pp. 1717-1718, Octo�er 1968.
40. B. Gold .and K. Jordan, "A Direct Search Procedure fo:;:- Designing
Finite DUTation Impillse Response Filters," IEEE Trans. Au:lio.,
--.---.-----
Vol. 17, No. 1, pp. 33-36, Ma.rch 1969.
41. B. Gold, A. V. Oppenheim, and C. M. Rader, "Theory and Implementation of

the Discrete Hilbert Transforms," Polytechnic Institute of Brooklyn, Pro
ceedings of the Symposium on Computer Processing in Communications, 1969.
42. B. Gold and L. R. Rabiner, '�nalysis of Digital and Analog FO-::'ffiant
Synthesizers," IEEE Trans. Audio., Vol. 16, March 1968.
43. B. Gold and C. M. Rader, "Effects of Qua.nt ization Noise in Digital

Filters," p:-esented at 1966 Spring Joint Comput er Conference,
AFIPS Proc., 28, p;>. 213-219, 1966.
44. R. M. Golden, "Digital Filter Synthesis by Sampled-Data Transfo:tn.'ltion,"

.!.�EE. Tr..a..£s..,.!.. Aud�, Vol. 16, No. 3, Sep::ember 1968.
45. R. M. Golden, "Digital Comp1lter Simulation of a S ampled Data Voice
excitedVocoder," J. Acoust. Soc. Am., 35, P? 1358-1366, 1963.
46. R. M. Golden and J. F. Kaiser, "Design of Wideband Sampled Data

Filters," Bell Syst em Tech. J., Vol. 43, No. 4, pp. 1533-1546, Pt. 2,
July 1961+.
47. R. M. Golden and S. A. White, "A Holding Technique to Reduce Num�er

of Bits in Digital Transfer Functions," IEEE Trans. Audio., Vol. 16,
No. 3, pp. 433-437, September 1968.
48. 1. J. Good, "The Interaction Algo:-ithm .'lnd Practical Fourier Series,"

J. Royal Statistical Soc., Ser. B, 20 (1958), 361-372; Addendum,
--
22 (1960), 372-375:--'
49. D. J. Goodman, "Optimum Digital Filters for the Estimation of

Continuous Signals in Noise," Proceedings of the 1969 Polytechnic
Institute of Brooklyn Symposium on Computer Processing in
Communications
189
50. R. J. Graham, "Determination and Analysis of Numerical Smoothing
Weights," NASA Technical Report No. TR-R-179, December 1963.
51. B. F. Green, J. E. K. Smith, and L. Klem, '�pirical Tests of an

Additive Random Number Generator, " J. Assoc. Computer Machinery,
Vol. 6, No. 4, pp. 527-537, October 1959.
52. H. D. Helms, I�ast Fourier Transform Method of Computing Difference

Equations and Simulating Filters, " IEEE Trans. Audio., Vol. 15,
No. 2, pp. 85-90, June 1967.
53. H. D. Helms, '�onrecursive Digital Filters: Design Methods fo:::

Achieving Specifications on Frequency Response, " IEEE Trans. Audio • •
Vol. 16, No. 3, September 1968.
54. G. E. Hyliger, "The Scanning Function Approach to the Design of

Numerical Filters," Rep::>rt R-63-2, Martin Co. , Denver, Colorado,
April 1963.
55. F. B. Hills, "A Study of Incremental Comp'J.tation by Difference

Equation," Servomechanisms Labo:::-atory Rep::>:::-t No. 7849-R-l,
Massachusetts Institute of Technology, Cambridge, Massachusetts,
May 1958.
56. H. Holtz and C. T. Leondes, "The Synthesis of Recursive Filters,"

J. Assoc. Computer Machinery, Vol. 13, No. 2, pp. 262-280, April 1966.
57. L. Jackson, J. F. Kaiser, and H. S. McDonald, "An App:i:'oach to the

Implementation of Digital Filters," IEEE Trans. Audio. , Vol. 16,
No. 3, pp. 413-421, September 1968.
58. J. F. Kaiser, "Design Methods fOT Sampled Data Filters," Proe.

First Allerton Conf. on Circuit and System Theory, pp. 221-236,
November 1963.
59. J. F. Kaiser, "A Family of WindO·N Functio:1s Having Nearly Ideal

Properties," unp'J.blished memorandum, November 1964.
60. J. F. Kaiser, "Some Practical Considerations in the R�alization of

Linear Digital Filters," Proceedings of the Third Allerton Conference
on Circuit and System TheoTY, pp. 621-633, October 20-22, 1965.
61. J. F. Kaiser, "Digital Filters," in F. F. Kuo and J. F. Kaiser (eds. ),

"System Analysis by Digital Comp'J.ter," Chap. 7, John Wiley & SO:1S, Inc. ,
New York, 1966.
62. T. Kaneko and B. Liu, "Roundoff Error of Floating-Point Digital

Filters," Proceedings of the Sixth Allerton Conference on Circuit
and System Theory, pp. 219-227, October 1968.
63. J. L. Kelly, C. L. Lochbaum, and V. A. Vyssotsky, "A Block Diagram

Compiler," Bell S.Ist�_l��, Vol. 40, pp. 669-676, May 1961.
64. J. B. KnO<,vles and R. Ed'Nards, "Simplified An.alysis of Computational

Erro';:'s in a Feedbilck System Incorporating a Digital Comp'J.ter,"
S. I. T. Symposium on Direct Digital Control, London, April 22, 1965.
65. J. B. Knoi.,r1es and R. Edwards, "Effects of a Finite WO!:'d Length

Comp'J.ter in a Sampled-data Feedback System," Proc. Inst. Elec. Engrs. ,
_. -. --'
Vol. 112, No. 6, June 1965.
190
66. J. B. Knm171es and R. Ed-ilards,
Associated Comp'.l�ational Erro::-s," Elect��ic:.
_ s_.h.e_ Vol. 1, No. 6,
pp. 160-161, August 1965.
67. J. B. Knmlles and R. Ed'Nards, "Finite Wo::-d-length Effects in a

Multirate Direct Digital Contt'ol System," Proc....!- IEE, LO':ldon, Vol. 112,
pp. 2376-2384, December 1965.
68. J. B. Knmvles 'in:! S. M. Olcayto, "Coefficient Accuracy and Digital

Filter Respo':lse," IE�'tr_C!.£!-s_�_C_irc:.�it Th�Q!Y., Vo!.. 15, No. 1,
pp. 31-41, March 1903.
69. A. A. Kosya',dn, "The Statistical Theory of Amplitude Quantization,"

Av��i_Teleme�, Vol. 22, No. 6, p. 722, 1961.
70. 1. M. Langenthal, "The Synthesis of Symmetrical Bandpass Digital

Filters," Proceedings of the 1969 Polytechnic Institute of Broo�dyn
Symposiwn on Comt".lter Processing in Com::lunications.
71. R. M. Lerne!.", "Band-p-'3.3s Filters Ttlith Linear Phase," Proc. IEEE,

Vol. 52, pp. 249-268, 1964.
72. P. M. Le,l7is, "Synthesis of Sampled Signal Netwox-ks," ------

IRE Trans.
CircE_it Til��, Vol. 5, No. 1, Pt'. 74-77, March 1958.
73. W. K. LimTill, "Sampled-d.�ta CO:ltrol Systems Studied through

Comparison of S.'3.mpling .... ith Amplitude Modulation," Tran�. AJE:J:�_,
• __
Vol. 70, Part II, pp. 1779-1788, 1951.
74. Mark Tsu-Han Ma, "A Ne\v Mathematical App:-oach for Linear Array
Analysis and Synthesis," Ph. D. Thesis, Syracuse University,
University Microfilms, No. 62-3040, 1961.
75. M. D. MacLaren and G. Marsaglia, "Uniform Random Nwnber Generators,"

:L....ft3SOC., Co�-!.ter MacJlin�, Vol. 12, Pp. 83-89, 1965.
76. C. E. Maley, "The Effect of Parameters on the Roots of an Equatio:l
System," Comp'.lt;.�r Jou-::-nal, Vol. 4, pp. 62-63, 1961-2.
77. M. Mansour, "Ins tability Criteria of Linear Discrete Systerns,"

��tomilti�, Vol. 2, No. 3, pp. 167-178, January 1965.
78. P. E. Mantey, "Convergent Autollk1.tic-Synthesis Pro�edures for

Sampled-Data Networks with Feedback," Stanford Electronics Laboratories,
Technical Report No. 6773-1, SU-SEL-64-112, Stanford University,
Octo�er 1964.
79. P. E. Mantey, "Eigenvalue Sensitivity and State-Variable Selection,"

IEEE Trans. Autollk1.tic C�ntrol, Vol. 13, No. 3, pp. 263-269, June 1968.
80. M. A. Martin, "Frequency Domain Applicatio':ls to Data Processing,"

'IRE Tra� Space Electronic:.��d-1e��etry, Vol. 5, No. 1, Pt'. 33-41,
�
March 1::15:1.
81. M. A. Martin, "Digital Filters fo::- Data Processing," General Electric

Co. , Missile and Space Division, Tech. Info. , Series Repo::-t No.
62-S0484, 1962.
82. H. T. Nagle, Jr. , and C. C. Carroll, "Organizing a Special-Purpose

Computer to Realize Digital Filters fo::- Sampled-Data Systems,"
IEEE Trans. Audio. , Vol. 16, No. 3, pp. 398-413, September 1968.
19 1
83. A. M. Noll, "Cepstrum Pitch Determination, " J. Acoust. Soc. Am. ,
Vol. 41, pp. 293-309, February 1967.
84. D. J. Nor..mk and P. E. Schmid, "A Nonrecursive Digital Filter for

Data Transmission, " IEEE Trans. Audio, Vol. 16, No. 3, p? 343-350,
September 1968.
85. A. V. Oppenheim, "Nonlinear Filtering of Convolved Signals, " Mass.

Inst. Technol. , Res. Lab. Electron. , Quart. Progr. Rept. 80,
pp. 168-175, January 15, 1966. '
86. A. V. Oppenheim, "Speech Analysis-Synthesis System Based on

Homomorphic Filtering, " J. Acoust. Soc. Am., Vol. 45, No. 2,
pp. 458-465, Febru'lry 19�
87. A. V. Oppenheim, "Realization of Digital Filters Using Block-Floating

Point Arithmetic," Submitted for publication to IEEE Transactions
on Audio and Electroacoustic.3_.
88. A. V. Oppenheim and R. W. Sch9.fer, "Homomo::phic Analysis of Speech,"

IEEE Trans. A�dio. , Vol. 16, pp. 221-226, June 1968
89. A. V. Oppenheim, R. W. Schafer, and T. G. Stockham, "The Nonlinear

Filtering of Multiplied and Convolved Signals," Proc. IEEE, Vol. 56,
------
No. 8, Pi'. 1264-1291, August 1968.
90. A. V. Oppenheim and G. Weinstein, "A Bound on the Output of a

Circular Convolution l,01ith Application to Digital Filtering, "
I�EE Tra�s. Audio. , Vol. 17, No. 2, pp. 33-36, June 1969.
91. J. F. A. Ormsby, "Design of Numerical Filters with Applications to

Missile Data Pro,::essing, Ass.oc' Vol. 8, No. 3,
pp. 440-466, July 1961.
92. F. K. Otnes, "An Elementary Design Procedure for Digital Filters, "
IEEE Trans. Audi�, Vol. 16, No. 3, pp. 330-336, September 1964.
93. E. Parzen, "Notes on Fourier Analysis and Spectral Windmys,"

Applied Mathematics and Statistical Laborato::oies, Technical Rep::>rt
No. 48, Stanford University, California, May 15, 1963.
94. M. J. Piovo:3o ,lnd L. P. Bolgiano, Jr. , "Digital Sim'.llation Using

Poisson Transfo:::m Sequances," Proceedings of the 1969 Polytechnic
Institute of Brooklyn Symposium on Computer Processing in Communications.
95. C. Pottle, "On the Partial-Fraction Expansion of a Rational Function

with Multiple Poles by a Digital Computer, " IEEE Trans. Circuit
Theor�, Vol. 11, P? 161-162, March 1964.
96. L. R. Rabiner, R. W. Sch9.fer and C. M. Rader, "The Chirp z-Transform

Algorithm, " IEEf£]rans-!..A..udi�, Vol. 17, No. 2, Pi'. 86-92, June 1969.
97. C. M. Rader, "Speech Compression Sim'.llation Co�piler, " presented at

Spring Meeting of Acoustical Society of America, 1965, ��Acou��
Soc. ftm�.(AJL�, Vol. 37, No. 6, p. 1199, June 1965.
98. C. R'lder, "Discrete Fourier Transfo:::ms When the Number

of Data Samples Is Prime," Proc�IEEE, Vol. 56, No. 6, pp. 1107-1108,
June 1968.
99. C. Rader and B. Gold, "Digital Filter Design Techniques in the

Frequency Dom'iin, " Pro.£.L-!.�E-'�" Vol. 55, No. 2, pp. 149-171,
Febru9.ry 1967.
192
100. C. M. Rader and B. Gold, "Effects of Parameter QU:lntization on the
Poles of a Digital Filter," Pro�. ITr.:E, Vol. 55, No. 5, pp. 688-689,
__
May 1967.
101. R. A. Roberts !:I.nd J. Tooley, "Signal Processing with Limited Memo:'y,"

Pro�eedlngs of the 1969 Polytechnic Institute of Brooklyn Symposium
on Computer Processing in COID:.l1unications.
102. H. H. Robe'l:'tson, "Approxim.:lte .IechnQ.metr!.�s_,

Vol. 7, No. 3, P? 337-403, Augu st 1965.
103. D. T. Ross "Imp�oved Computational Techniques for Fourier Trans

fo�tion, ,I Servomechanisms Laboratory Report, No. 7138-R-5,
Massachusetts Institute of Technology, Cambridge, Massachusetts,
June 25, 1954.
10,4.. P. Rudnick, "Note on the Calcubttion of Fourier Series," Math.

--
� omr::lt ., Vol. 20,

P? 429-430, July 1965.
lOS. D. J. Sakrison, W. T. Fo:,:d, and J. H. Hearne, "The z-Transform of

a Realizable Time Function," I�EE Tr�!!.s. Geo;'3ciellCe Electro:lics,
Vol. 5, No. 2, pp. 33-41, Septeml)er 1967.
106. J. M. Salzer, "Frequency An'i

. lysis
in Real Time," P�Q.c_�IRE, Vol. 42, pp. 457-466, February 1954.
107. 1. W. Sandberg, "F1oi3.ting-Point-Roundoff Accum·.llation in Digital

Filter Realization," BeU System Te�� Vol. 46, P? 1775-1791,
Octobe'i:'
lOB. R. W. Schafer, "Echo �emova1 by Generalized Linear Filtering,"

NEREM Record, pp. llB-119, 1967.
109. R. W. Schafer, "Echo Removal by D;i,screte Generalized Linear Filtering,"

Ph.D. Thesis, Mass!:l.chusetts Institute of Technology, Department of
Electrical Engineering, Cambridge, Mass., February 1968.
110. S. A. Schelkunof£, "A Mathem:ttical Theory of Linear Arrays,"

Bet�stem Tech. ��, Vol. 22, pp. BO-l07, January 1943.
Ill. R. C. Singleton, "A Method for Computing the Fast Fourier Transform
with Auxiliary Memory and Limited High Speed Storage," IEEE Trans_._
Audio., Vol. 15, No. 2, pp. 91-9B, June 1967.
112. R. C. Singleton, "On Computing the Fast Fourier Transform," Comm.

Assoc. � m :lting Machi�ery, Vol. 10, No. 10, P? 647-654,
l�
October 9 7.
113. R. C. Singleton, I�n ALGOL Procedure fo:,: the Fast Fourier Transfonn
with Arbitrary Fsctors," Algo:oithm 339, Comm. Assoc. Computing
Machinery, Vol. 11, No; 11, pp. 776-779, November 1968
114. R. C. Singleton, "ALGOL Procedures for the Fast FO'.1rier Transfo:'m,"

Algorithm 338, Comm. Assoc. Comp·.lting Machin�,
---- Vol. 11, No 11, • .
p. 338, November 1968.
U5. R. C. Singleton, "An Algorithm for Computing the Mixed Radix Fast
Fourier Transform," IEEE Trims. Audio., Vol. AU-17, No. 2, pp. 93-.103,
June 1969
116. J. B. Slaughter, "Quantization Errors in Digital Control Systems,"

IEEE Trans. Auto��tic Control, Vol.9, No. 1, pp. 70-74, January 1964.
193
1l7. D. Slepian and H. o. Pollak, "Prolate Spheroidal Wave Functions,
Fourier Analysis and Uncertainty-I and II, " Bell System Tech. J.,
Vol. 40, No. 1, pp. 43-84, January 1961.
1l8. O. Sorrunoonpin, "Investigation of Quantization Errors," M.Sc. Thesis,

University of Manchester, England, 1966.
119. K. Steiglitz, "The Approximation Problem for Digital Filters," Tech.

Rept., No. 400-56, Department of Electrical Engineering, New York
University, 1962.
12C. K. Steiglitz, "The General Theory of Digital Filters with Applica

tions to Spectral Analysis, " AFOSR Report.No. 64-1664, Eng. Sc.D.
Dissertation, New York University, New York, 1963.
121. K. Steiglitz, '�e Equivalence of Digital and Analog Signal

Processing ! " Information and Control, Vol. 8, No. 5, pp. 455-467,
October 19b5.
122. T. G. Stockham Jr., ''High Speed Convolution and Correlation, "

9
presented at 1 66 Spring Joint Computer Conference, AFIPS Proc.,
Vol. 28, pp. 229-233, 1966.
123. T. G. Stockham, Jr., ''The Application of Generalized Linearity to

Automatic Gain Control, " IEEE Trans. Audio., Vol. 16, No. 2,
pp. 267-270, June 1968.
124. Josef Stoer, ,� Direct Method for Chebyshev Approximation by

Rational Functions, " J. Assoc. Computing Machinery, Vol. 11, No. 1
pp. 59-69, January 1964.
125. D. J. Thomson, "Generation of Gegenbauer Prewhitening Filters by

Fast Fourier Transforming," Proceedings of the 1969 Polytechnic
Institute of Brooklyn Symp�sium on Computer Processing in
Communications.
126. Ya. Z. Tsypkin, "An Estimate of the Influence of Amplitude

Quantization on Processes in Digital Automatic Control Systems,"
Avtom�t. i Telemekh., Vol. 21, No. 3, p. 195, 196�.
127. A. Tustin, I� Method of Analyzing the Behavior of Linear Systems

in Terms of Time Series," Proc. _!.li�, Vol. 94, Part IIA, pp. 130-142,
May 1947.
128. D. G. Watts, "A General Theory of Amplitude Quantizations Tllith

Applications to Correlation Determination," Pro<:..!...-IEE, Vol. 109C,
p. 209, 1962.
129. D. G. Watts, "Optimal Windo�vs for Power Spectra Estimation,"

Mathematics Research Center, University of Wisconsin, MRC-TSR-506,
September 1964.
130. C. S. Weaver, P. E. Mantey, R. W. Lawrence, and C. A. Cole,

"Digital Spectrum Analyzers," Stanford Electronics Laboratories,
Stanford, California, Rept. SEL 66-059 (TR-l09-l/l8l0-l), June 1966.
131. C. S. Weaver, J. von der Groeben, P. E. Mantey, J. G. Toole,

C. A. Cole, J. W. Fitzgerald, and R. W. Lawrence, "Digital Filtering
with Applications to Electrocardiogram Processing," IEEE Trans.
�u:iio., Vol. 16, No. 3, pp. 350-392, Sep�ember 1968.- ----
--
194
132. C. Weinstein, "Quantization Effects in Frequency Sampling Filters,"
222-NEREM Record, p. 222, 1968.
l33. C. Weinstein, "Roundoff Noise in Floating Point Fast Fourier Trans

form Computation," IEEE Trans. Audio. , Vol. 17, No. 3, September 1969.
134. C. Weinstein and A. V. Oppenheim, '� Comparison of Roundoff Noise

in Floating Point and Fixed Point Digital Filter Realizations,"
Proc. IEEE, Vol. 57, No. 6, pp. 1181-1183 (letters), June 1969.
135. P. Welch, "A Fixed-Point Fast Fourier Transform Error Analysis,"

IEEE Trans. Audio. , Vol. 17, No. 2, pp. 151-157, June 1969.
136. W. D. White and A. E. Ruvin, "Recent Advances in the Synthesis of

Comb Filters," IRE Natl. Conv. Record, pp. 186-199, 1957.
137. J. R. B. Whittlesey, "A Rapid �ethod fo:::, Digital Filtering,"

Comm. As�C�u�ing_Machinery, Vol. 7, No. 9, pp. 552-556,
September 190.4".
l38. B. Widrmv, "Statistical Analysis of Amplitude-Quantized Sampled

Data Systems," Trans. AlEE (Applications al!.<i. Industry).., Part 2,
Vol. 79, pp. 555-568, January 1961.
l39. B. Widro'v, "A Study of Rough Amplitude Quantization by Mean of

Nyquist Sampling Theory," Trans. IRE Circuit
- --
Theor:.y, Vol. 3, No. 4,
---
pp. 266-276, December 1956.
140. F. Yates, "The Design and Analysis of Factorial Experiments,"

Harpenden: Imperial Bureau of Soil Service.
141. T. Y. Young, "Representation and Analysis of Signals, Part X.

Signal Theory and Electrocardiography," Department of Electrical
Engineering, Johns Hopkins University, .May 1962.
142. A. 1. Zverev, "Digital MTI Radar Filters," IEEE Trans. Audio. ,

Vol. 16, No. 3, pp. 422-432, September 1968-.----- ---
--
195
1. M. Abramovitz and I. Stegun, Handbook of Mathematical Functions,
Dover Publications, Inc., N�� York, 1965.
2. R. B. Blackman and J. W. Tukey, The Measurement of Power Spectra,

Dover Publications, Inc., N�� York, 1958.
-- -
3. R. B. Blackman, Linear Data-smoothing and Prediction in Theorv and

ti65�ice, Addison-Wesley pUblishing Company, Inc., Reading, Mass.,
4. H. W. Bode, Network Analr.is and Feedback-

Am..2,lifier Design,
D. Van Nostrand, New Yo!: , 1945.
-
5. D. A. Calahan, Modern N�tw0rk _�thesis, Hayden, New York, 1964.
6. G. A. Campbell and R. M. Foster, Fourier Integrals for Practical

-
�li��o�_, D. Van Nostrand, N�� York, 1948.
7. P. M. DeRusso, R. J. Roy, C. M. Close! State Variables fq,r Engineers,

John Wiley & Sons, Inc., New York, 19b5.
8. J. L. Flanagan, Speech Analysis. SfSthesi�and Perceptio�,

Academic Press, Inc., N�� York, 19 •
9. H. Freeman Discrete Time Syst�, John Wiley & Sons, Inc.,

New York, ! 965.
10. H. L. Garabedian (ed.), Approximation of Functions, Elsevier

Publishing Co., 1965. �-- -
11. J. E. Gibson, Nonlinear Automatic CO'!ltrol, McGraw-Hill Book Company,

Inc., N�� York, 1963.
-
12. B. Gold and C. Rader, ru.&ita1 Processing of Signals, McGraw-Hill

Book Company, Inc., N��rk, 1969.
13. U. Grenander and M. Rosenblatt, Statistical Ana1rSiS of -Stationary

Time Series, John Wiley & Sons, Inc., Ne� York, 967.
14. E. A. Guillemin, Theo, of Linear Physical Sys�, John Wiley &

Sons, Inc., New York, 963.
15. E. A. Gui1lemin, S17thesis of Passive Networks, John Wiley & Sons,
Inc., New York, 19 •
16. R. W. Hamming, Numerical Methods for Scientists and Engineers,

McGraw-Hill Book Com�any, Inc., New York, 1962.
17. E. J. Hannan! Time Series Analysis. John Wiley & Sons, Inc.,
New York, 19bO.
18. C. Hasting, Jr., Approximation for D �ital Computers, Princeton

University Press, Princeton, N.J., 1 5.
19. H. M. James N. B. Nichols, and R. S. Phillips, Theorv of
Servomechanlsms, Chap. 5, pp. 231-261, McGraw-Hill Book Company, Inc.,
New York, 1941.
20. C. Jordan, Calculus of Finite Differences, Chelsea Publishing Company,

New York, 1960 (ReprInt of 1939 Edition).
196
21. E. I. Jury, ��led-Dat�Co��ol_�stems, Joh� Niley & Sons, In=.,
New Yo-::-k, 1950.
22. E. 1. Ju::y, Theo':y an:i-A-E.E.l��a_t:.�0!L0f the z-TraIlsfQ!!ll ...!'1etho<!,

John Wiley & Som;, Ine., New York, [got;.
23. F. F. Kuo, �e �'!Q..!'� All"!..lysi��nthes....�s_, John Wiley & Sons, Inc.,

New Yo::k, 176 _.
24. B. KU?, Ana!ysiS2 ::l Synt�e�is of Sample�-Da�a... Control Sys_tems ,

1 _
Prent1ce-Hall, Eng e�ood�rrrs, N.J., [9b.�
25. F. F. �uo and J. F. Kaiser, ��t:.�s Am.lys is b....Y.J!.!&!.t:.a1 Computer,

John W1ley & Sons, In(�., Ne�11 York, 196"6.
26. C. Lanczos, �li�d Analys��, Prentice-Hall, Inc., Engle�ood Cliffs,

N.J., 195 6.
27. J. H. Laning and R. B�ttin, Random Pr��esses in_Automati�Control,

McGra'il-Hill Book Comp:;tny, Inc., Ne'.v YorI<;"1956":""
28. D. P. Lindorff, Theory 0f_Sa�)ed-�at�Control �ste��, John Niley &

Sonc, Inc., New York, 1905.
29. L. M. Milne-Thomson, The Calculus ofJ[inite Diff�!.��_��, Macmillan

and Co., Ltd., Londo�, 1933.
30. E. Mishkin and L. Braun, Jr., Ad ptive Control System�, McGraw-Hill

6
Book Company, Inc., New York, 19 1.
31. A. J. Monroe, Digital Processes fo� Sampled �at��ms, John Wiley

& Sons, Inc., New York, f962.
32. A. Papoulis, The_�ourier Integral_and Its �plicati�, McGraw-Hill

Book Company, Inc., New York, 1962.
33. J. R. Ragazzini and G. F. Franklin, Sampled-data Control Systems,

-
McGraw-Hill Book Company, Inc., Ne� Yo�k, 1958.
34. J. E. Storer, Passive Network Synthesis, McGraw-Hill Book Company,

Inc., New York:-T951.- - --
35. A. Sus�kind, Notes on An�l�-Digital Conversion Teshniques ,

John ·.v�l�y & Sons, Inc., Ne� York;-r�
36. J. T. To!.1, Digit_al and SaEipled-Data Contr��, McGraw-Hill

Book Comp.:my, Inc., Ne\v York, 1959.
37. J. G. Truxal, AU�Q.ITl<lticJe�d:'ack..£.ontr_o��tem Synthesis,

McGraw-Hill Book Company, Inc., New York, 1955.
38. L. Weinberg, Network Analys s and Synthesis, McGraw-Hill Book

�
Compilny, Inc., Ne�v York, 19 2.
39. E. T. Whittaker and G. N. Watson, A Course of Modern Analysis,

-
Cam1:>ridge University Press, Ne�v York, 4th ed., 1927.
40. J. H. Wilkinson, Rounding.Errors in Algebraic Processes, Prentice

Hall, Englewood Cliffs, N.J., 1963.
41. C. H. Wilts, Principles of Feedback Control, Addison-Wesley

Publishing Company, Inc., Reading, Mass., 1960.
197

(Alan V. Oppenheim (Ed.) ) Papers On Digital Signal (B-Ok - CC) PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

(Alan V. Oppenheim (Ed.) ) Papers On Digital Signal (B-Ok - CC) PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Papers on Digital Signal Processing

DIGITAL SIGNAL PROCESSING

Alan V. Oppenheim, Editor

The M.I.T. Press CAMBRIDGE, MASSACHUSETTS, AND LONDON, ENGLAND

Second printing, November 1970

Printed and bound in the United States of America

All rights reserved. No part of this book may be reproduced in

ISBN 0 262 65004 5 (paperback)

Library of Congress catalog card number: 79-101414

for implementing recursive filters. While this approach to implementation

be done is an important and satisfying conceptual contribution.

signal processing techniques focuses on the effect of the finite register

length. The finite register length manifests itself in terms of co­

efficient inaccuracies, roundoff or truncation of the result of mUltipli­

cation, and, if one thinks of the original signal as an analog signal,

cerned with these issues. Specifically, the class of problems which

these papers deal with is the quantization and roundoff effects in

digital filtering and in computation of the discrete Fourier transform

of multiplier roundoff noise. Kaiser studies the coefficient sensitivity

problem in terms of sm�ll perturbations of the coefficients and also

proposes an ap?roach, not restricted to small perturbations, which

utilizes the techniques of the standard root-locus problem. Knowles and

statistically and exp�ess the variance of the roundoff noi se and it

appears in the output. In his paper with Olcayto, Knowles p�oposes a

statistical approach to studying coefficient errors. One of the concerns

in all of these papers is the comparison of digital filters implemented

by means of different network configurations such as the direct fonn,

the cascade form, and the parallel form.

In discussing the effects of multiplier roundoff noise, the above

pap.;rs consi,der digital filter realization using fixed-point arithmetic.

When floating-point arithmetic is used, roundoff errors occur in both

the effect of roundoff errors when floating-point arithmetic is used.

His approach is to provide bounds on the errors that occur. In contrast,

Kaneko and Liu present a statistical app:coach to studying roundoff noise

in floating-point filters. In the note by �einstein and Oppenheim, some

presented, together with a comparison between filter implementation using

fixed-point and floating-point arithmetic. An alternative realization

arithmetic is the use of a block-floating-point structure as proposed in

the paper by Oppenheim. This approach to realizing digital filters may

potentially offer an aqvantage on sm�ll word length computers when high

gain filters are being simulated.

In each of the p�eceding papers the representation of digital filters

is in terms of digital networks. In the paper by Mantey the formulation

is in terms of state-space equations, and in p.�rticular

coefficient accuracy is phrased in terms of the sensitivity of the

e igenvalues of a matrix. One of the asp':!cts of this paper is the rep··

resentation of the various digital filter forms in state-space terms.

In analyzing roundoff errors in the comp�tation of the fast Fourier

transform, approaches similar to those already discussed have been taken.

on floating-p·oint computation. The approach taken was to derive an

sequence. In contrast, Weinstein presents a statistical analysis of

noise in the floating-point comp:.ltation. This discussion is restricted

roundoff rather than truncation. A detailed statistical analysis of

this and the analysis by Weinstein, the output error is represented in

a mean square sense.

With regard to the effect of qu.antization and arithmetic roundoff

in both digital filtering and FfT comp�tation, it appears in general

that upper bounds derived tend to 1;)e overly pessimistic. In fixed-point,

floating-point, and block-floating-point realizations, experiments have

substantiated the statistical analyses presented and this statistical

approach app·:aars to p�ovide a more analytically tractable means for

analyzing a given configuration. It should be stressed that the outcome

of these analyses should be viewed as pro'Jiding techniques that can be

length. The finite register length manifests itself in terms of co

efficient inaccuracies, roundoff or truncation of the result of mUltipli

abundance of p�ints of view. The basic idea leading to efficient com

by sp�cial subroutines. These latter savings are important for moderate

B. Gold, A. V. Opp·2nheim, and C. M. Rader, "Theory and Implementa

(j, g) = ./:-.lioo F(s) G(s) ds = J:-. 1 F(z)G(z) d