Sunteți pe pagina 1din 8

Twiddle-Factor-Based FFT Algorithm with Reduced Memory Access

Yingtao Jiang Ting Zhou Yiyan Tang and Yuke Wang


Department of Electrical & ASIC Design Division Department of Computer Science
Computer Engineering Gennum Corporation University of Texas at Dallas
University of Nevada, Las Vegas Kanada, Ontario Richardson, TX 75083
Las Vegas, NV 89119 Canada USA
USA tzhou@gennum.com {yiyan, yuke}@utdallas.edu
yingtao@eng.unlv.edu

Abstract The study of FFT algorithms not only has a long


history and large bibliography, and it is still an exciting
In microprocessor-based systems, memory access is research field where new results are used in practical
expensive due to longer latency and higher power applications. Efficient FFT algorithms were first
consumption. In this paper, we present a novel FFT discovered by Gauss [7], and later by Runge and Konig
algorithm to reduce the frequency of memory access as [13]. The importance of FFT algorithms was not fully
well as multiplication operations. For an N-point FFT, recognized until its rediscovery by Cooley and Tukey [4]
we design the FFT with two distinct sections: (1) The first in 1960s. Since then, the research in FFT has been
section of the FFT structure computes the butterflies proliferated, to name a few, higher radix algorithms [2],
involving twiddle factors WNj ( j ≠ 0 ) through a mixed-radix [15], prime-factor [8], Winograd (WFTA)
[20], the split-radix Fourier transform algorithms [16][17],
computation/partitioning scheme similar to the Hoffman recursive FFT algorithm [19], and the combination of
coding. In this section, all the butterflies sharing the same decimation-in-time and the decimation-in-frequency FFT
twiddle factor will be clustered and computed together. In algorithms [14]. The structures of the FFT computation
this way, redundant memory access to load twiddle are all organized in the same way defined in [4].
factors is avoided. (2) In the second section, the remaing There are many ways to measure the complexity and
( N − 1) butterflies involving the twiddle factor W N0 are efficiency of the proposed FFT algorithms, and a final
computed with a register-based breadth-first tree assessment depends on both the available technology and
traversal algorithm. This novel twiddle-factor-based FFT the intended applications. However, by careful analysis,
is tested on the TI TMS320C62x digital signal processor. we can see that there is a memory access problem with
The results show that, for a 32-point FFT, the new previously proposed approaches. For instance, unless the
algorithm leads to as much as 20% reduction in clock processor where the FFT runs provides a large number of
cycles and an average of 30% reduction in memory registers, repeated access to the memory to load some
access than that of the conventional DIF FFT. twiddle factors are unavoidable under proposed FFT
algorithms. It has been recognized that memory access is
1. Introduction expensive due to long latency and high power
consumption. In this paper, we propose an algorithm that
In the field of digital signal processing, the Discrete can remove the redundant memory access in the
Fourier Transform (DFT) plays an important role in the calculation of DFT.
analysis, design, and implementation of discrete-time For an N-point FFT, we consider two distinct cases:
signal-processing algorithms and systems [1]-[10][12]- W N0 and W Nj ( j ≠ 0 ). The FFT structure is, therefore,
[20]. For instance, the DFT can be used to calculate a organized as two concatenated sections. The first section
signal’s frequency response, find a system’s frequency
response from the system’s impulse response, and serve as computes those butterflies involving twiddle factors W Nj
an intermediate step in more elaborate signal processing ( j ≠ 0 ). In this section, once a twiddle factor W Nj is
techniques. The Fast Fourier Transform (FFT) is an
loaded, it will be used until there is no need for its value
efficient class of computational algorithms of the DFT.
in the following computation. In this way, we show that of
FFT algorithms are based on the fundamental principle of
an N-point Radix-2 FFT, only (N/2-1) memory accesses
decomposing the computation of the DFT of a sequence
are needed as classical approaches may require (N-1)
of length N into successively smaller DFTs, all with
memory accesses to load twiddle factors for computation.
comparable improvements in computational speed.
The power saving can be quite significant when N is a

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
very large number. In the second section to compute the decomposition, let WN2 nr = WNnr/ 2 and following equation
rest butterflies involving the twiddle factor WN0 , which are derived by
accounts for a total of (N-1) butterflies, the main concern N / 2 −1 N N
X ( 2k ) = ∑ [ x( n) + x( n + )]WNnk/ 2 k = 0,1,..., − 1
is to construct a tree structure to minimize the frequency n=0 2 2
of read/write operations to store the intermediate results. (4)
To this end, we propose a breadth-first traversal N / 2 −1 N N
X ( 2k + 1) = ∑ [ x( n) − x ( n + )]W N W N / 2
n nk
k = 0, 1,..., − 1
algorithm. As W N0 = 1 , for these (N-1) butterflies, no n =0 2 2
multiplication operation is needed in the computation. (5)
It is fair to say that this novel twiddle-factor-based Above equations are frequently represented in
algorithm lead to efficient implementations and a wide butterfly format. The butterfly of a Radix-2 algorithm is
range of applications, such as low power high shown in Fig. 1.a. The complete flow graph of an N-point
performance ASIC designs. We test the proposed Radix-2 FFT can be constructed by applying the basic
algorithm in TI TMS320C62x fixed-point digital signal butterfly structure (Fig.1.a) recursively, where
processor (DSP). The experimental results show that the N = 2, 4, 8,... For an N-point Radix-2 FFT, it has log 2 N
new algorithm requires fewer clock cycles to compute the stages. Within stage s, for s = 1, 2, ..., log 2 N , there are
N-point FFT than conventional FFT approaches.
Furthermore, we can expect that the power consumption N / 2 s groups of butterflies, with 2 s −1 butterflies per
in the new approach shall be significantly less than the group. The computation of the 8-point DFT, for instance,
conventional FFT schemes due to the reduction of power- can be accomplished by the algorithm depicted in Fig. 1.b.
hungry memory access and multiplication operations.
x (n) y (n)
The rest of this paper is organized as follows. In
section 2, the conventional Radix-2 FFT algorithm is WN0
N N
briefly reviewed. The new twiddle-factor-based FFT x( n + ) y (n + )
2 -1 2
algorithm is described in section 3. Some practical issues (a)
x (0) X ( 0)
are addressed in section 4. Experimental results are
W80
presented in section 5. The conclusions are summarized in x(1)
−1
X ( 4)

section 6. W80
x ( 2) X ( 2)
−1
W82 W80
2. Discrete Fourier Transform and FFT x(3)
−1 −1
X ( 6)

W80
x ( 4) X (1)
−1
The Discrete Fourier Transform (DFT) of discrete W81 W80
signal x (n ) can be directly computed as x(5)
−1 −1
X (5)

W82 W80
N −1
X ( k ) = ∑ x( n)WNnk k = 0,1,..., N − 1
x (6) X (3)
(3) −1 −1

n=0 x (7 )
W83 W82 W80 (b)
X ( 7)
−1 −1 −1
where WN = e− j 2π / N = cos 2π − j sin 2π and WN is known as
N N
Fig. 1 Flow graph of FFT
the phase or twiddle factor and j 2 = −1 . Here x(n) and
X (ω ) are sequences of complex numbers.
3. The New Twiddle-Factor-Based FFT
An efficient method of computing the DFT that
significantly reduces the number of required arithmetic Algorithm
operations is called FFT [1]-[10][12]-[20]. An FFT
algorithm divides the DFT calculation into many short- It can be seen from Fig. 1.b that, unless sufficiently
length DFTs and results in huge savings of computations. large number of registers is available, in most practical
If the length of DFT N = R v , i.e., is the product of situations, the twiddle factor W82 will be loaded from the
identical factors, the corresponding FFT algorithms are memory to the CPU twice in both Stages 1 and 2. Such
called Radix-R algorithms. Assume FFT length is 2M, redundant memory access is repeatedly seen in loading
where M is the number of stages. The radix-2 DIF FFT other twiddle factors and therefore, becomes a serious
divides an N-point DFT into 2 N/2-point DFTs, then into problem when computing a large FFT. In this section, we
4 N/4-point DFTs, and so on. That is, the radix-2 DIF present the twiddle-factor-based FFT algorithm, which
FFT expresses the DFT equation as two summations, then can reduce the number of memory access as well as the
divides it into two equations, each of which computes number of multiplication operations.
every two output samples. To arrive at a two-point DFT

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
Theorem 1: The total number of butterflies involving the k = log 2 N − 1 . All butterflies involving the twiddle factor
twiddle factor W N0 is N − 1 for an N-point Radix 2 FFT. W Nj ( j ≠ 0 ) will be computed at super-stage (k + 1) ,
Proof: This proof is performed on an N-point DIF
such that
(Decimation-In-Frequency) FFT.
j
At the stage 1, there is only one butterfly requiring W N0 mod 2 = 1 and
2k
At the stage 2, there are two butterflies requiring W N0 j
… k −1
mod 2 = 0, k = 0,1, 2,..., (log 2 N − 1) .
2
At the stage k, there are 2 k −1 butterflies requiring W N0 The proposed twiddle-factor-based algorithm can be
There are in total of log 2 N stages in the FFT structure. viewed as a skewed version of popular DIF FFT structure.
We, therefore, use the term “super-stage” to reflect the
Therefore, the total number of butterflies that require W N0
fact that at each super-stage ss, the butterflies to be
is 1 + 2 + ... + 2log 2 N −1 = N − 1 . computed span the stages 1, 2,..., ss in the classical DIF
Theorem 2: The total number of butterflies involving the FFT. Section 1 consists of (log 2 N − 1) super-stages.
same twiddle factor W Nj ( j ≠ 0 ) is 2 k +1 − 1 for an N- (1) At the first super-stage, all the data samples with
j binary indices ( B k B k −1 ...B 2 B11) are computed.
point Radix 2 FFT, where mod 2 = 1 and
2k Among these (N / 2) data samples, any of the two with
j indices ( B k B k −1 ...B 2 B11) and ( B k B k −1 ...B 2 B11) can
k −1
mod 2 = 0, k = 0,1, 2,..., log 2 N − 1 .
2 pair together to form a butterfly. The twiddle factor
Proof: (1) If j = 1, 3, 5, 7,...N / 2 − 1 , W Nj will only involved in this butterfly is W Nj , where j corresponds
appear on the first stage. to the decimal value of the binary (0 Bk −1...B2 B11) . In
Similarly, from Eqs. (5) and (6), we can see that under total, (N / 4) butterflies are to be computed in this
the situation
super-stage.
j j
k
mod 2 = 1 and k −1 mod 2 = 0, k = 0,1, 2,... ,
2 2 // n: n-point FFT
the appearance of W Nj ( j ≠ 0 ) will span from the Stage 1 // x: input data samples
// x[2k + 1] ---- imaginary part of kth sample
up to the Stage k + 1 .
// x[2k] ---- real part of the kth sample
(2) For a twiddle factor W Nj ( j ≠ 0 ), it appears in the first // w: pre-computed twiddle factors
stage once; it appears on the second stage twice as there // w[2k + 1] ---- imaginary part of kth twiddle factor
are two butterflies requiring this twiddle factor. This // w[2k] ---- real part of kth twiddle factor
continues to the kth stage, where 2 k groups of twiddle void radix2_fft(int n, float* x, float* w)
factors exist to share the same twiddle factor. Therefore, {
int n2 = 0;
the total number of butterflies that require W Nj ( j ≠ 0 ) is int start = 1;
1 + 2 + ... + 2 k = 2 k +1 − 1 . int step = 2;
The C-like pseudo code of the proposed twiddle-
factor-based algorithm is shown in Fig. 2. For an N-point // Section 1: Compute butterflies with twiddle factors
FFT, this algorithm consists of two concatenated sections // W(N, j), j <> 0.
for (proc = n; proc > 2; proc /= 2) { // super stage
to deal with two distinct cases: W N0 and W Nj ( j ≠ 0 ).
n2++;
Section 1: In the first section of the FFT structure, those for (twiddle = start; twiddle < n/2; twiddle += step) {
butterflies with twiddle factors W Nj ( j ≠ 0 ) involved are // load one twiddle factor and repeatedly use it
co = w[twiddle*2 + 1]; // load twiddle factor: cos
computed. In this section, the major concern is the
si = w[twiddle*2]; // load e twiddle factor: sine
minimization of the number of memory access to load the
int n3 = n4 = n;
twiddle factors. That is, once a twiddle factor W Nj is
loaded, it will be repeatedly used until there is no need for for (stage = 0; stage < n2; stage++) { // stage
its value in the following computation. n4 /= 2;
For an N-point FFT, the binary index of a data for (i0 = twiddle >> stage; i0 < n; i0 += n3) {
sample will look like this, ( B k B k −1 ...B0 ) , where //butterfly computation

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
i1 = i0 + n4; computed. Apparently, there are ( N / 4 + N / 2) data
re0 = x[2 * i0] + x[2 * i1]; samples falling into this category. Any two data samples
im0 = x[2*i0 + 1] + x[2*i1 + 1]; with binary indices ( B k B k −1 ...B 2 10) and
re1 = x[2 * i0] - x[2 * i1];
im1 = x[2*i0 + 1] - x[2*i1 + 1]; ( B k B k −1 ...B 2 10) , or ( B k B k −1 ...B 2 B11) and
x[2 * i0] = re0; ( B k B k −1 ...B 2 B11) , can pair together to form a butterfly.
x[2*i0 + 1] = im0;
x[2 * i1] = re1*co - im1*si; The twiddle factor involved in this butterfly is W Nj , where
x[2*i1 + 1]= re1*si + im1*co; j is the decimal value of binary index (0 Bk −1...B210) . In
} total, ( N / 8 + N / 4) butterflies are to be computed in this
n3 = n4;
super-stage. That is, a quarter of the butterflies in the first
}
stage and a half of the butterflies in the second stage of the
}
original DIF FFT are computed.
start *= 2;
(3) Within the ss-th super-stage, for
step *= 2;
} ss = 1, 2, ..., log 2 N − 1 , all the data samples with binary
n2++; indices ( Bk Bk −1...Bss +110...0) , ( Bk Bk −1...Bss10...0) ,
( Bk Bk −1...Bss Bss −110...0) , …, ( Bk Bk −1...B210) , and
// Section 2: Compute the butterflies with ( Bk Bk −1...B2 B11) are computed. In this case, the
// twiddle factor W(N, 0)
n3 = n4 = n; butterflies are originated from the Stage 1 all the way up
to Stage ss in the corresponding DIF FFT. There are
for (stage = 0; stage < n2; stage++) { N / 2 ss +1 different twiddle factors involved in this super-
n4 /= 2; stage.
for (i0 = 0; i0 < n; i0 += n3) { Under this algorithm, according to Theorem 1, we can
i1 = i0 + n4; see that only ( N / 2 − 1) irredundant memory accesses are
re0 = x[2 * i0] + x[2 * i1]; needed. The classical approach, the DIF FFT shown in
im0 = x[2*i0 + 1] + x[2*i1 + 1]; Fig. 1, however, may require as many as ( N − 1) memory
re1 = x[2 * i0] - x[2 * i1];
accesses to load twiddle factors for computation unless the
im1 = x[2 *i0 + 1] - x[2*i1 + 1];
size of the register file is comparable to the size of the
x[2 * i0] = re0;
input samples. Very large size of register files, however,
x[2*i0 + 1] = im0;
are barely seen in current microprocessor designs.
x[2 * i1] = re1;
As an illustrative example, Table 1 lists the
x[2*i1 + 1] = im1;
computation order of the 16-point FFT with indexing and
}
pairing information presented in binary format.
n3 = n4;
Altogether, there are 3 super-stages and 17 butterflies.
}
The overall computation structure of this 16-point FFT
}
based on the proposed algorithm can be seen in Fig. 3.
/*
Note that in Section 2, no multiplication operation is
needed. Table 1. The computation order of a 16-point Twiddle-
Furthermore, the algorithm used in this section is not factor-based FFT: Indexing and Pairing
an optimized one in terms of memory access. Super-Stage 1: Super-Stage 2: Super-Stage 3:
If a small number of registers are allocated to save Original FFT Original FFT Original FFT
temporary values, we will show an algorithm (Fig. 4) ( B3 B 2 B11) and
that can help to partition the system so that the
( B3 B 2 B11)
intermediate memory access for read/write can be
significantly reduced. ( B3 B 2 10) and ( B3 B 2 B11) and
*/ ( B3 B210) ( B3 B3 B11)
Fig. 2 Pseudo code of the proposed twiddle-factor-based ( B3 100) and ( B3 B 2 10) and ( B3 B2 B11) and
FFT algorithm
( B3100) ( B3 B2 10) ( B3 B2 B11)
(2) At the second super-stage, all the data samples with
binary indices ( Bk Bk −1...B210) and ( Bk Bk −1...B2 B11) are

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
bold lines: multiply with -1
// The number of the stages to be calculated
x(0) x(0)

x(1)
W[0]
x(8)
// in the prolog
x(2)
W[0]
x(4)
int n5 = n1 % 2; // n1: N = 2^n1
x(3)
W[4] W[0]
x(12)
int n2 = N;
W[0]
x(4) x(2)

x(5)
W[2] W[0]
x(10) // The Prolog of the tree structure
W[4] W[0]
x(6) x(6) for (proc = 0; proc < n5; proc++) {
W[6] W[4] W[0]
x(7)

W[0]
x(14)
n3 = n2;
x(8)

W[1] W[0]
x(1)
n2 >>= 1;
x(9) x(9)

W[2] W[0]
for (bu = 0; bu < n; bu += n3) {
x(10) x(5)

x(11)
W[3] W[4] W[0]
x(13)
// Calculate the butterfly
x(12)
W[4] W[0]
x(3)
// butterfly_cal(x[0], x[1], x[2], x[3]): Calculate
x(13)
W[5] W[2] W[0]
x(11)
// the butterfly with two specified points, x[0], x[2]
x(14)
W[6] W[4] W[0]
x(7) // are the real parts of the points; x[1],x[3] are
x(15)
W[7] W[6] W[4] W[0]
x(15) // the imaginary parts of the points.
butterfly_cal(x[2bu], x[2bu+1],
Super 1. stage
Stage 2. Twiddle factor x[2(bu+n2)], x[2(bu+n2)+1]);
3. butterfly
Section 1
}
Section 2
}
Fig. 3 Structure of a 16-point FFT based on the proposed // The Kernel of the tree structure
algorithm. // If there are r pairs of registers, denoted as
// Reg_real [0: (r-1)] and Reg_im[0: (r-1)],
From the above discussion, we can see that we can // to be used, (r-1) butterflies can be computed .
compute the FFT structure in the way similar to the // Reg_real and Reg_im are for real and imaginary parts,
Hoffman coding. This resolves the data dependence and // respectively. Immediate results can be saved in given
the verification of this algorithm can be viewed as // registers, rather than writing them back to the memory.
“decoding” of the Hoffman codes. for (proc = N >> n5; proc > 1; proc >>= j) {
It can be seen that four loops are required in this int n4 = proc >> j;
section of computation, while traditional approaches may int index = 0;
require three. This loop overhead, however, can be easily
absorbed in current processors with multiple data paths, for (group = 0; group < 2^(n1 – n3) ; group++){
such as TI TMSC62x DSP [18]. // Fetch the points from memory to registers
Section 2: In the second section, the rest butterflies for (i = 0; i < r; i++) {
involving the twiddle factor W N0 are computed. Note that Reg_real[i] = x[2index];
no multiplication is needed in computing these ( N − 1) Reg_im[i] = x[2index + 1];
index = index + n4;
butterflies (Theorem 1) as W N0 = 1 . All these butterflies }
are organized as a binary tree and there are log 2 N stages. m=r;
The memory access of this section can be significantly // Calculation the butterflies: (r-1) in total
reduced if a few user-visible data registers are available. for (i = 1; i <= j; i++) { // r = 2^j -- levels
Depending on the size of the given registers (M) to save p = m;
intermediate results, we can traversal the binary tree with m = m / 2;
an algorithm shown in Fig. 4, where the visit to a node for (q = 0; q<r; q=q+p) {
refers to a 2-point butterfly computation. bufferly_cal(Reg_real[q], Reg_im[q],
Reg_real[q+m], Reg_im[q+m]);
// N: N-point FFT }
// x: input samples }
// x[2k] -- real part of the kth data sample // Store the points back to the memory
// k= 0, 1, 2,..., (n-1). for (i = 0; i < r; i++) {
// x[2k + 1] -- imaginary part of the kth data sample x[2index] = Reg_real[i];
// r: r pair of registers x[2index + 1] = Reg_im[i];
// j: r = 2^j index = index + n4;
void section2(int N, float* x, float* w, int n1) }
{ }
Fig. 4 Tree traversal algorithm for section 2 computation in the algorithm shown in Fig. 2

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
Here we assume that the M is an exponential of 2 4.1. Memory Interleaving and Addressing
(i.e., M = 2 k ), and there are two cases to be considered: Patterns
(1) (log 2 N ) mod k = 0 , and (2) (log 2 N ) mod k ≠ 0 . In
the first case, this partition algorithm (Fig. 4) transforms There are two main ways in which memory systems
the original binary tree into a complete tree with each are usually designed to match a high-performance
parent node has M immediate children. In the second case, processor. The first is to reduce the memory effective
except the top [(log 2 N ) mod k ] levels, all the rest of nodes access time by reducing the number of accesses that
actually reach the memory. Caches, or a set of registers, or
in the binary tree are transformed to construct a reduced
some type of buffering are designed for this purpose. The
tree with each parent node has M immediate children. The
second approach, memory interleaving, is to replace a
reduced tree is traversed in a breadth-first manner.
single memory unit by several memory units (banks),
Although in this section, a depth-first traversal algorithm
organized in such a way that independent and concurrent
can be designed due to weak data dependence among
access to several units is possible. The technique works
butterflies, we feel a breadth-first approach is more viable
best with predictable memory-access patterns [11].
for easy parallel computation and less memory access.
For the interleaving to work, it is necessary that every
The second section of the 16-point FFT can be seen in
M temporally adjacent words accessed by the processor be
Fig. 3.b and redrawn in Fig. 5.a, where each node
in different banks. In our proposed FFT algorithm, it is
indicates a butterfly. If four pairs of registers, M = 4, are
desirable to design the memory architecture of
allocated to store the immediate data, the binary tree then
(N / 2) words (twiddle factors) interleaved in M banks.
is transformed into a quadric-tree, in concert with the Case
1 mentioned above. This new quadric-tree consists of 5 We shall assign a memory address j to memory bank
nodes, as opposed to 15 nodes in the original binary tree. [( j / 2) mod M ] to achieve maximum memory bandwidth
Each node of the quadric-tree is made from three for concurrent access with least amount of conflicts. For
butterflies, and the intermediate results within these three instance, at the super-stage 1, twiddle factors of
butterflies will be saved in the dedicated register file, not N
WNj , j = 1, 3, 5, ..., − 1 , are to be loaded for computation
the main memory. For a 32-point FFT, the merged tree 2
will consist of 3 levels and 11 nodes, as demonstrated in (Figs. 2 and 3). It is preferable that these data can be
Fig. 5.b. stored in different memory banks. Fig. 6 illustrates the
Group 1 storage pattern of the twiddle factors used in a 16-point
FFT with 4 memory banks. Apparently, these twiddle
factors can be fetched in parallel at each super-stage.

Word
0: W160 W162 W164 W166
Group 2 Group 5
Address
1: W161 W163 W165 W167

(a) Bank 0 Bank 1 Bank 2 Bank 3

Group 1 Fig. 6 Assignment of addresses in memory interleaving

To access a memory location, the memory system


Group 2 Group 3 interprets each processor-generated address as consisting
of a pair <bank_address, displacement_in_bank>. That is,
in our case, for a memory of (N / 2) words interleaved in
Group 4 Group 11
N
M banks, the high-order (log 2 − log 2 M ) bits along
2
with the least significant bit are used to select a word
(b)
within the selected bank and the bank is selected with the
Fig. 5 An example of the merging of the binary tree rest of log 2 M bits.

4. Practical Considerations 4.2. In-Order Computation

In this section, we present a variety of details that shall The idea of our proposed twiddle-factor-based FFT
be considered in the implementation of the proposed algorithm, actually, can be borrowed to modify many
algorithm in various platforms. existing FFT algorithms to squeeze out the redundant

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
memory access. For instance, the FFT structure shown in Table 2. Results of 32-pt FFTs: Twiddle-factor-based
Fig. 1 can be easily modified to account for in-order approach vs. DIF
FFT Total Cycle Load/ Load/Store
computation requirement at a cost of more memory usage Cycle Reduction Store Reduction
during the computation. The data dependence can still be Count (%) (%)
No optimization by compiler
viewed and checked using the timing diagram as shown in Traditional DIF FFT 88825 0 480/320 0
Table 1. Novel FFT (4 pair reg.) 88247 1 312/280 26
Novel FFT (8 pair reg.) 86025 3 294/272 30
Novel FFT (16-pair reg.) 84012 5 286/264 31
Optimization option –o0 register
4.3. Scaling Traditional DIF FFT 29860 0 480/320 0
Novel FFT (4 pair reg.) 24257 19 312/280 26
Novel FFT (8 pair reg.) 24058 20 294/272 30
In practice, the arithmetic operations involved in Novel FFT (16-pair reg.) 28112 6 286/264 31
FFTs are sometimes carried out using fixed-point or block Optimization option –o1 local
Traditional DIF FFT 21745 0 480/320 0
floating point arithmetic. Although fixed-point arithmetic Novel FFT (4 pair reg.) 18068 17 312/280 26
leads to a fast and inexpensive implementation, it is Novel FFT (8 pair reg.) 18788 14 294/272 30
Novel FFT (16-pair reg.) 29620 -36 286/264 31
limited in the range of numbers that can be represented, Optimization option –o2 function
and is susceptible to problems of overflow that may occur Traditional DIF FFT 26081 0 480/320 0
Novel FFT (4 pair reg.) 24172 7 312/280 26
when the result of an addition exceeds the permissible Novel FFT (8 pair reg.) 23477 10 294/272 30
Novel FFT (16-pair reg.) 26669 -2 286/264 31
number range. To deal with this problem, scaling has to be Optimization option –o3 file
performed to prevent the occurrence of overflow. Traditional DIF FFT 26081 0 480/320 0
Novel FFT (4 pair reg.) 24172 7 312/280 26
Although in the new algorithm, we have introduced the Novel FFT (8 pair reg.) 23477 10 294/272 30
concept of super-stage, the scaling in our algorithm has to Novel FFT (16-pair reg.) 26669 -2 286/264 31

be performed at the end of each butterfly computation, not


at the super-stage level (Figs. 2 and 3). This simple 6. Conclusions
scaling scheme can be easily embedded into the main flow
of the algorithm shown in Figs. 2 and 4. In this paper, we have presented a novel twiddle-
factor-based FFT algorithm with redundant memory
5. The Experimental Results access removed. The first section of the new FFT structure
computes those butterflies with twiddle factors W Nj
In this section, we conduct the experiment to evaluate
the performance of our twiddle-factor-based FFT ( j ≠ 0 ). In this section, once a twiddle factor W Nj is
algorithm along with a classical DIF FFT algorithm. The loaded, it will be repeatedly used until there is no need for
test platform is TI TMS320C6211 fixed-point digital its value in the following computation. In this way, we
signal processor [18] with an enhanced VLIW (Very Long show that in an N-point Radix-2 FFT, only ( N / 2 − 1)
Instruction Word) architecture. memory accesses are needed whereas classical approach
Notice that our algorithm is dependent on the size of may require as many as ( N − 1) memory accesses to load
registers, denoted as M, available for temporary storage
(Fig. 4). We have chosen three sizes: M=4, M=8, and twiddle factors for computation. This new FFT structure
M=16 in the test. Altogether, we have designed four FFT can be viewed as a structure similar to the idea of
programs in C: three based on our approach and one Hoffman coding. However, we also show that 4 nested
based on DIF FFT. The sizes of FFTs under test range loops are needed as compared to 3 in the classical method.
from 8 points, 16 points, 32 points, all the way up to 1024 This effect has little impact in the computing speed with
points. Since the trend of the results from different FFT careful design or with the help of an efficient compiler. In
sizes is similar, for the sake of the space, here we only the section to compute the rest butterflies involving the
report the results collected from 32-point FFTs (Table 2). twiddle factor W N0 , which accounts of a total of ( N − 1)
Four different compilation options are considered. It can butterflies, the main concern is to construct a tree structure
be seen that, the memory access of our approach is around to minimize the number to access and store data in the
30 per cent less than that of the conventional DIF FFT. In intermediate arrays of the FFT. It has shown in this paper,
all cases with different options of the compilation depending on the given size of the temporary registers and
optimization, we have observed that the clock cycle is the input samples, different optimized computation
reduced by as much as 20 per cent. Furthermore, we have structures are necessary. This novel structure of the
seen more dramatic performance improvement and algorithm should lead to efficient implementations and a
reduction of memory access in larger FFTs. wide range of application, as demonstrated in an
implementation based on the TI TMX320C62x DSP. The
results show that the new algorithm requires significantly
less clock cycles to compute the n-point FFT. Thanks to

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE
the substantial reduction of memory access, considerable [16] H. V. Sorensen and C. S. Burrus, “A New Efficient
power saving can be expected. Algorithm for Computing a Few DFT Points,” IEEE Trans.
According to the transposition theorem [12], we can Acoust., Speech, Signal Processing, vol. 35, no. 6, pp. 849-
863, June 1987.
obtain a transposed structure, where Sections 1 and 2 as
[17] D. Takahashi, “An Extended Split-Radix FFT Algorithm,”
well as the directions of all branches in the network (Figs. IEEE Signal Processing Letters, vol. 8, no. 5, pp. 145-147,
2-4) are reversed. In a more general term, the idea of our May 2001.
proposed algorithm can be borrowed to modify many [18] Texas Instrument, TMS320C62x DSP Library
existing FFT algorithms to squeeze out the redundant Programmer' s References, SPRU402.
memory access and arithmetic operations. [19] A. R. Varkonyi-Koczy, “A Recursive Fast Fourier
Transform Algorithm,” IEEE Trans. Circuits and Systems,
II, vol. 42, pp. 614-616, Sep. 1995.
7. References [20] S. Winograd, “On Computing the Discrete Fourier
Transform,” Math. Comput., vol. 32, no. 141, pp. 175-199,
[1] D. H. Bailey, “FFTs in External or Hierarchical Memory,” Jan. 1978.
NASA Tech. Report RNR-89-004, 1989.
[2] G. D. Bergland, “A Raidx-Eight Fast-Fourier Transform
Subroutine for Real-Valued Series,” IEEE Trans. Audio
Electroacoust. vol. 17, no. 2, pp. 138-144, June 1969.
[3] C. S. Burrus and T. W. Parks, DFT/FFT and Convolution
Algorithms and Implementation, NY: John Wiley & Sons,
1985.
[4] J. W. Cooley and J. W. Tukey, "An algorithm for the
machine calculation of complex Fourier series," Math.
Computat., vol. 19, pp. 297-301, 1965.
[5] P. Duhamel, and H. Hollmann, “Split Radix FFT
Algorithm,” Electronics Letters, vol. 20, pp.14-16, Jan. 5,
1984.
[6] P. Duhamel, “Implementation of ‘Split-Radix’ FFT
Algorithms for Complex, Real, and Real-Symmetric Data,”
IEEE Acoustics, Speech, and Signal Processing Magazine,
vol. 34, pp. 285-295, Apr. 1986.
[7] M. Frigo and S. G. Johnson, "The fastest Fourier transform
in the west," Tech. Rep. MIT-LCS-TR-728, Laboratory for
Computer Science, MIT, Cambridge, MA, Sept. 1997.
[8] M. T. Heideman, D. H. Johnson, and C. S. Burrus, "Gauss
and the History of the FFT," IEEE Acoustics, Speech, and
Signal Processing Magazine, vol. 1, pp. 14-21, Oct. 1984.
[9] D. Pl. Kolba and T. W. Parks, “A Prime Factor FFT
Algorithm Using High-Speed Convolution,” IEEE Trans.
Acoust., Speech, Signal Processing, vol. 25, no. 4, pp. 281-
294, Aug. 1977.
[10] K.-S. Lin, ed., Digital Signal Processing Applications with
the TMS 320 Family, vol. 1. Englewood Cliffs, N.J.:
Prentice Hall, 1987.
[11] A. R. Omondi, The Microachitecture of Pipelined and
Superscalar Computers. Boston: Kluwer Academic
Publishers, 1999.
[12] A. V. Oppenheim and C. M. Rader, 2nd ed., Discrete-Time
Signal Processing. Upper Saddle River, NJ: Prentice-Hall,
1989.
[13] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T.
Vetterling. Numerical Recipes: The Art of Scientic
Computing. Cambridge University Press, 1986.
[14] A. Saidi, “Decimation-in-Time-Frequency FFT
Algorithm,” Proc. IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. III:453-456,
April 19-22 1994.
[15] R. C. Singleton, “An Algorithm for Computing the Mixed
Radix Fast Fourier Transform,” IEEE Trans. Audio
Electroacoust. vol. 1, no. 2, pp. 93-103, June 1969.

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02)


1530-2075/02 $17.00 © 2002 IEEE

S-ar putea să vă placă și