C 39

A HIGH PERFORMANCE VLSI FFT
ARCHITECTURE
K. Babionitakis, K. Manolopoulos, K. Nakos, V.A. Chouliaras
D. Reisis, N. Vlassopoulos Department of Electronic and Electrical Engineering,
Electronics Laboratory, Department of Physics Loughborough University, Loughborough,
National and Kapodistrian University of Athens LEICS, LE11 3TU, UK
Athens, Greece V.A.Chouliaras@lboro.ac.uk
dreisis@phys.uoa.gr
Abstract— High performance VLSI-based FFT architectures FFT designs [11] with fully unrolled FFT circuits but occupies
are key to signal processing and telecommunication systems since more VLSI area.
they meet the hard real-time constraints at low silicon area and
The present paper addresses an efficient FFT architecture
low power compared to CPU-based solutions. In order to meet
these goals, this paper presents a novel VLSI FFT architecture maximizing throughput and keeping the control and the mem-
based on combining three consecutive radix-4 stages to result in ory organizations simple compared to cascade and unfolded
a 64-point FFT engine. Cascading these 64-point FFT engines FFT architectures. Moreover, it is efficient compared to the
consequences an improved architecture design featuring certain aforementioned architectures with respect to the scalability
characteristics. First, it can efficiently accommodate large input
of the maximal operating frequency, the pipeline depth and
data sets in real time. It also simplifies processing requirements
due to the radix-4 calculations. Finally, it reduces memory the data and twiddle widths. To improve latency and memory
requirements and latency to one third compared to the fully requirements, - particularly for large input data sets -, the
unfolded radix-4 architecture. Two different implementations are proposed architecture combines three (3) Radix-4 circuits to
utilized in order to validate the architecture efficiency: a FPGA result in a 64-point FFT engine.
implementation of a 4096-point FFT achieving a throughput of
4096 point/20.48 usec, and a VLSI implementation sustaining a
In order to demonstrate the efficiency of combining 64-
throughput of 4096 point/3.89 usec. point Radix-4 FFT engines we discuss the architecture of a
4096-complex point design implemented on a Xilinx Virtex
I. I NTRODUCTION II FPGA. The particular FPGA implementation requires only
Signal processing and telecommunication applications re- two (2) memory banks, each of 4096 complex words depth
quire FFT implementations that can perform large size, low and achieved a maximum operating frequency of 200 MHz
latency computations while exhibiting low power consumption sustaining a throughput of 4096 points/20.48 us while consum-
[12]. These demanding computational tasks are executed either ing 6.4 Watts for a typical workload. The architecture has been
by a single, high frequency embedded processor [5] or by also implemented in a high performance 0.13 um standard cell
using an Application Specific Integrated Circuit (ASIC). A library from TSMC where it achieved a worst-case (0.9V, 125
number of FFT architectures have been proposed in the C), post-route frequency of 604.5 MHz while consuming 4.4
literature [2],[3],[7] with varying levels of parallelism, sus- Watts. It is interesting to point out that the design exceeded
tained throughput rate, memory usage, hardware resources the 1GHz rate for typical conditions (1.0V, 25C).
and power dissipation. Fully unfolded FFT architectures [10] The paper consists of three sections: section 2 describes
achieve their maximum throughput at lower clock rates while the derivation of the Radix-4 3 schema from the FFT equation,
occupying more VLSI area and using larger memory arrays section 3 gives a detailed architecture description and section
between their successive stages. 4 concludes the paper.
Cascade FFT topologies [2], [6], [7] have reduced memory
requirements compared to the unfolded case but are less II. A NALYSIS OF THE R ADIX -4 3 A LGORITHM
efficient for higher than radix-2 architectures. Cascade archi-
tecture become complicated in the case of seeking high speed The Discrete Fourier Transform (DFT) of a signal x[n] of
performance by increasing the depth of the pipeline within length N is given by the series
the butterflies. In such cases, the cascade architectures must N −1
include larger size memory within each butterfly processing
X[k] = x[n]WNkn
element and complicated control specifically designed for n=0
different pipeline depths. Higher Radix techniques reduce the
number of stages of the FFT at an increased cost in tems of In order to derive the Radix − 4 3 algorithm, the first 3
VLSI area for each stage. Interesting results have been pre- steps in the cascade decomposition are considered. The linear
sented by an alternative approach, which utilizes asynchronous index mapping transforms into a four-dimentional index map
as follows: ture and can be expressed as

N
= n1 + 64 N
n2 + 16 n3 + N4 n4 k3 k4 N k4 N
n
(1) HN n1 + n2 = B N n1 + n2 +
k = 64k1 + 16k2 + 4k3 + k4 16 64 4 64

N N
+ (−j)k3 W16k4 k4
B N n1 + n2 + +
Applying equation 1 to the DFT equation yields 4 64 16

N N
N
64 −1
3 3 3 + (−1)k3 W8k4 B kN4 n1 + n2 + +
4 64 8
X(64k1 + 16k2 + 4k3 + k4 ) =
N 3N
n =0 n =0 n3 =0 n4 =0 +j k3 W8k4 W16
k4 k4
B N n1 + n2 + (7)
1 2
4 64 16
N N N
x n1 + n2 + n3 + n4 WNnk (2) Finally, expanding the summation of equation 6 with respect
64 16 4
to inedx n2 provides a set of 64 DFTs of length N/64.
With the cascade decomposition, the composite twiddle X(64k1 + 16k2 + 4k3 + k4 ) =
factors can be expressed as follows N
64 −1

n (6k +4k +k )
[n1 + N n2 + 16 T kN2 k3 k4 (n1 ) WN1 2 3 4 W nN1 k1 (8)
4 n4 ][64k1 +16k2 +4k3 +k4 ]
N
n3 + N
WNkn = WN 64 ⇒ n1 =0
64 64
n2 k2 +n3 k3 +n4 k4 n3 k4 n2 (4k3 +k4 )

WNkn = (−j) W16 W64 × where T kN2 k3 k4 (n1 ) represents the third butterfly and has
n1 (16k2 +4k3 +k4 ) n1 k1 64
×WN WN (3) the expression of equation 9
64
T kN2 k3 k4 (n1 ) = H kN3 k4 (n1 ) +

Applying equation 3 into equation 2 and expanding the 64 16

summation with index n 4 yields k2 k3 k4 k3 k4 N
+ (−j) W16 W64 H N n1 + +
16 64

X(64k1 + 16k2 + 4k3 + k4 ) = k N
+ (−1) 2 W8k3 W32 k4 k3 k4
HN n1 + +
3 32
N
−1 3

64 N N
16

B kN4 n1 + n2 + n3 × 3(4k +k ) 3N
n =0 n =0 n =0
4 64 16 +j k2 W64 3 4 H kN3 k4 n1 + (9)
1 2 3 16 64
n (4k3 +k4 )
× (−j)n2 k2 +n3 k3 W16
n3 k4
W642 × Equations 4 to 9 describe a radix-64 based FFT. Further,
n (16k2 +4k3 +k4 )
×WN1 W nN1 k1 (4) equations 5, 7 and 9 describe the internal structure of the radix-
64
64 butterfly, which is based on three radix-4 butterflies and
constitutes the Radix-4 3 (R43 ).
where B kN4 n1 + 64N N
n2 + 16 n3 denotes the first butterfly
4
unit and can be written as follows III. A RCHITECTURE
The architecture of the 4096-point FFT is depicted in figure
N N N N
k4
B N n1 + n2 + n3 = x n1 + n2 + n3 1. The FFT processor consists of two (2) R4 3 processing cores,
4 64 16 64 16
two (2) 4096-word dual bank memory elements, one (1) 4096
k N N N point read-only memory that stores the W 4096 twiddle factors
+ (−j) 4 x n1 + n2 + n3 + +
64 16 4 and one (1) complex multiplier.

k N N N
+ (−1) 4 x n1 + n2 + n3 + +
64 16 2

N N 3N
+j k4 x n1 + n2 + n3 + (5)
64 16 4
Expanding equation 4 with respect to the next summation

with index n3 yields
X(64k1 + 16k2 + 4k3 + k4 ) =

N

64 −1 3

k3 k4 N n k
HN n1 + n2 (−j) 2 2 ×
n =0 n =0
16 64
1 2
n (4k3 +k4 ) n (16k2 +4k3 +k4 )
×W642 WN1 W nN1 k1 (6)
64
Fig. 1. Overall FFT Architecture
k3 k4
N

where H N n1 + 64 n2 is the secondary butterfly struc-
16
A. R43 Engine
Figure 2 depicts the internal structure of the R4 3 engine.
Each engine consists of three (3) radix-4 butterflies, two
(2) complex multipliers, two (2) Dual Bank (DB) memory
elements (one consisting of 2x16 words and one consisting
of 2x64 words) and two (2) Read Only Memories where
the W16 and W64 twiddles are stored. The R4 3 control
unit is responsible for generating the signals that control the
individual modules whereas each of the R4 3 engines is capable Fig. 4. Accumulator architecture
of operating as a stand-alone 64-point FFT processor and
can be easily combined to form large scale FFT architectures
(N > 27 ). for a 256K-point FFT. The proposed architecture demonstrates
improved latency compared to the other casscade architectures
because it requires data buffering only between the two R4 3
stages instead of 6 plain R-4 stages. Also unfolded FFT im-
plementations require memories of size N between contiguous
stages. In order therefore to perform a 4K-point FFT, memory
of size 4K × 6 × 2 (points, stages, dual bank) is required while
our proposed architecture requires only 1/3 of that memory.
The proposed architecture has been implemented in RTL
VHDL using fixed point arithmetic. We have targetted both
high capacity, state-of-the-art FPGA devices such as the XIL-
INX Virtex II 6000 as well as a high speed 0.13um, 8M
Fig. 2. R43 Butterfly Architecture standard cell process. The Xilinx implementation on the 6000
part resulted in just 20% of logic area utilization, and a 25
% Blocks RAM utilization. Through the use of optimized
Note that each Radix-4 processor uses an “unrolled” archi- Xilinx components (CoreLib multipliers), we managed to
tecture by implementing a fully systolic, tree-based layout in reduce the area to 13% of the total resources of the 6000
order to avoid the use of single-clock feedback loops. Figures part. The standard cell implementation on the other hand
3 and 4 depict the Radix-4 butterfly and the Accumulator included 96K standard cells and 64 RAMs (1361 standard
structures respectively. cell rows), occupied a silicon area of 2630x5129 um square
(1.42*10E7 um sq) at 84.2% utilization and achieved a worst-
case (0.9V, 125C) post-route performance of 604.5 MHz
for and a 4.4 Watts power conumption. When using the
typical process parameters (1V, 25C), the post-route frequency
reported from the Place and Route tool exceeded 1 GHz which
makes the proposed architecture the fastest standard-cell FFT
implementation reported in the literature. Figure 5 shows the
final layout of the chip.
C. Comparison to Related Results

Several FFT schemes have been proposed in the literature.In
[14] the 2K complex point FFT processor performs at 76MHz
and sustaines throughput of 2K points/26us. [15] implements
1-D and 2-D FFTs of 1024-point FFT at 80 MHz, at a com-
Fig. 3. radix-4 Butterfly Architecture
putation time of 68us. The 64-point Fourier transform chip,
presented in [17] operates at 20 MHz with 3.85 us latency.
Comparatively, the architecture of the R4 3 processor presented
B. Architecture Performance and Advantages in this work performs a 64 complex point FFT operating at
The architecture presented in this paper realizes a 4096- the frequency of 200 MHz with a 0.32us latency. ALTERA
point FFT, by cascading two successive R4 3 stages. An designs [19] utilize FFT cores with word length varying from
additional stage of R4 3 would result in a 256K-point FFT 64 points up to 4K points, presenting a maximum operating
engine with a latency of 3 R4 3 stages. This is in contrast to frequency of 300MHZ. Comparatively, the R4 3 processor
existing cascade FFT architectures which require 6 R − 2 2 when implemented in ALTERA FPGAs achieved a maximum
[6], [7] or R-4 stages to perform a 4K-point FFT or 9 stages operating frequency of 350 MHz. The corresponding XILINX
to existing solutions and at the same time, has reduced data
memory required and improved multiplier utilization while
occupying a smaller silicon area occupation consuming less
power compared to similar solutions. The modular design
of the Radix-4 3 allows them to be easily incroporated into
larger systems for computing large scale FFTs while a fully-
registered, systolic architecture assures maximum operating
frequency. Future research by our group will focus on the
implementation of a reconfigurable FFT architecture, capable
of performing the FFT transform of 64, 4K, 256K or 16M
complex points.
R EFERENCES
[1] A. Oppenheim, R. Schafer “Digital Signal Processing”, Prentice Hall
1975.
[2] Clark D. Thompson “Fourier Transform in VLSI”, IEEE Transactions
on Computers, 1983.
[3] E.H. Wold and A.M. Despain “Pipeline and Parallel FFT Processors
for VLSI Implementations”, IEEE Transactions on Computers, vol. C-33,
1984.
[4] J.W. Cooley and J.W. Tukey “An algorithm for the machine calculation
of complex Fourier series”.
[5] J. Lee, J. Lee, M. H.Sunwoo, S. Moh and S. Oh “A DSP Architecture
for High-Speed FFT in OFDM Systems”, ETRI Journal, 2002.
[6] S. He and M. Torkelson “Design and Implementation of a 1024-point
Pipeline FFT Processor”, IEEE 1998 Custom Integrated Circuits.
[7] S. He and M. Torkelson “A New Approach to Pipeline FFT Processor.”,
Proceedings of the IPPS, 1996.
[8] G. Bi and E.V. Jones “A pipelined FFT processor for word-sequential
data”, IEEE Trans. Acoust, Speech, Signal Processing, 37(12):1982-
1985, Dec. 1989.
[9] S. Choi, G. Govindu, J.-W. Jang, V. K. Prasanna “Energy-Efficient and
Parameterized Designs of Fast Fourier Transforms on FPGAs”, The 28th
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), April 2003.
[10] L. R. Rabiner and B. Gold “Theory and Application of Digital Signal
Processing”, Prentice-Hall
[11] B. Suter and K. S. Stevens “A Low Power, High Performance approach
for Time-Frequency / Time-Scale Computations.”, Proceedings SPIE98
Conference on Advanced Signal Processing Algorithms, Architectures
and Implementations VIII. Vol. 3461, pp. 86–90, July 1998.
[12] I. Saarinen, G. Coppola, A. Polydoros, J.L. Garcia, M. Lobeira, P. Dallas,
M. Gertou, R. Cusani and G. Razzano “High Bit Rate Adaptive WIND-
FLEX Modem Architectures for Wireless Ad-Hoc Networking in Indoor
Environments”.
[13] S. Hong S. Kim, C. Papaeftymiou and W.E. Stark “Power-Complexity
Analysis of Pipelined VLSI FFT Architectures for Low Energy Wireless
Communication Applications.”, 42nd Midwest Symposium on Circuits
and Systems, August 1999.
[14] T. Lenart and V. Owall, ”A 2048 Complex Point FFT Processor Using
a Novel Data Scaling Approach”, IEEE ISCAS 2003.
Fig. 5. VLSI implementation layout of the 4K FFT processor [15] S. Bouguezel, M. O. Ahmad and M.N. Swamy, ”Arithmetic complexity
of the split-radix FFT algorithms”, IEEE ICASSP 2005.
[16] I. S. Uzun, A. Amira and A. Bouridane, ”FPGA implementations of fast
Fourier transforms for real-time signal and image processing” IEEE
designs achieve a maximum operating frequency of 200MHz Vision, Image and Signal Processing, 2005.
but occupy considerably larger chip area than our approach. [17] K. Maharatna, E. Grass, and U. Jagdhold, ”A 64-Point Fourier Transform
Chip for High-Speed Wireless LAN Applications Using OFDM IEEE
IV. C ONCLUSION Journal of Solid State Circuits, VOL. 39, NO. 3, March 2004.
[18] J. Y. OH and M. S. Lim, ”New Radix-2 to the 4th Power Pipeline FFT
This paper presented a new, very high speed FFT archi- Processor”, IEICE Trans. Electron., VOL. E88-C, NO. 8, August 2005.
tecture based on the Radix-4 3 algorithm. A fully pipelined, [19] FFT MegaCore Function Errat Sheeta, http://www.altera.com
[20] FFT Cores for FPGA Product Information Sheet, http://www.xilinx.com
systolic processing core of a 4096-point FFT has been im-
plemented in both FPGA and standard cell technologies and
validated in the former case. The results demonstrate the very
high operating frequencies and the low latencies of both the
FPGA and VLSI implementations. The proposed FFT archi-
tecture demonstrates a significant latency reduction compared

C 39

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

C 39

Încărcat de

Drepturi de autor:

Formate disponibile

A HIGH PERFORMANCE VLSI FFT

n2 k2 +n3 k3 +n4 k4 n3 k4 n2 (4k3 +k4 )

T kN2 k3 k4 (n1 ) = H kN3 k4 (n1 ) +

Expanding equation 4 with respect to the next summation

X(64k1 + 16k2 + 4k3 + k4 ) =

C. Comparison to Related Results

S-ar putea să vă placă și