Documente Academic
Documente Profesional
Documente Cultură
ARCHITECTURE
K. Babionitakis, K. Manolopoulos, K. Nakos, V.A. Chouliaras
D. Reisis, N. Vlassopoulos Department of Electronic and Electrical Engineering,
Electronics Laboratory, Department of Physics Loughborough University, Loughborough,
National and Kapodistrian University of Athens LEICS, LE11 3TU, UK
Athens, Greece V.A.Chouliaras@lboro.ac.uk
dreisis@phys.uoa.gr
Abstract— High performance VLSI-based FFT architectures FFT designs [11] with fully unrolled FFT circuits but occupies
are key to signal processing and telecommunication systems since more VLSI area.
they meet the hard real-time constraints at low silicon area and
The present paper addresses an efficient FFT architecture
low power compared to CPU-based solutions. In order to meet
these goals, this paper presents a novel VLSI FFT architecture maximizing throughput and keeping the control and the mem-
based on combining three consecutive radix-4 stages to result in ory organizations simple compared to cascade and unfolded
a 64-point FFT engine. Cascading these 64-point FFT engines FFT architectures. Moreover, it is efficient compared to the
consequences an improved architecture design featuring certain aforementioned architectures with respect to the scalability
characteristics. First, it can efficiently accommodate large input
of the maximal operating frequency, the pipeline depth and
data sets in real time. It also simplifies processing requirements
due to the radix-4 calculations. Finally, it reduces memory the data and twiddle widths. To improve latency and memory
requirements and latency to one third compared to the fully requirements, - particularly for large input data sets -, the
unfolded radix-4 architecture. Two different implementations are proposed architecture combines three (3) Radix-4 circuits to
utilized in order to validate the architecture efficiency: a FPGA result in a 64-point FFT engine.
implementation of a 4096-point FFT achieving a throughput of
4096 point/20.48 usec, and a VLSI implementation sustaining a
In order to demonstrate the efficiency of combining 64-
throughput of 4096 point/3.89 usec. point Radix-4 FFT engines we discuss the architecture of a
4096-complex point design implemented on a Xilinx Virtex
I. I NTRODUCTION II FPGA. The particular FPGA implementation requires only
Signal processing and telecommunication applications re- two (2) memory banks, each of 4096 complex words depth
quire FFT implementations that can perform large size, low and achieved a maximum operating frequency of 200 MHz
latency computations while exhibiting low power consumption sustaining a throughput of 4096 points/20.48 us while consum-
[12]. These demanding computational tasks are executed either ing 6.4 Watts for a typical workload. The architecture has been
by a single, high frequency embedded processor [5] or by also implemented in a high performance 0.13 um standard cell
using an Application Specific Integrated Circuit (ASIC). A library from TSMC where it achieved a worst-case (0.9V, 125
number of FFT architectures have been proposed in the C), post-route frequency of 604.5 MHz while consuming 4.4
literature [2],[3],[7] with varying levels of parallelism, sus- Watts. It is interesting to point out that the design exceeded
tained throughput rate, memory usage, hardware resources the 1GHz rate for typical conditions (1.0V, 25C).
and power dissipation. Fully unfolded FFT architectures [10] The paper consists of three sections: section 2 describes
achieve their maximum throughput at lower clock rates while the derivation of the Radix-4 3 schema from the FFT equation,
occupying more VLSI area and using larger memory arrays section 3 gives a detailed architecture description and section
between their successive stages. 4 concludes the paper.
Cascade FFT topologies [2], [6], [7] have reduced memory
requirements compared to the unfolded case but are less II. A NALYSIS OF THE R ADIX -4 3 A LGORITHM
efficient for higher than radix-2 architectures. Cascade archi-
tecture become complicated in the case of seeking high speed The Discrete Fourier Transform (DFT) of a signal x[n] of
performance by increasing the depth of the pipeline within length N is given by the series
the butterflies. In such cases, the cascade architectures must N −1
include larger size memory within each butterfly processing
X[k] = x[n]WNkn
element and complicated control specifically designed for n=0
different pipeline depths. Higher Radix techniques reduce the
number of stages of the FFT at an increased cost in tems of In order to derive the Radix − 4 3 algorithm, the first 3
VLSI area for each stage. Interesting results have been pre- steps in the cascade decomposition are considered. The linear
sented by an alternative approach, which utilizes asynchronous index mapping transforms into a four-dimentional index map
as follows: ture and can be expressed as
N
= n1 + 64 N
n2 + 16 n3 + N4 n4 k3 k4 N k4 N
n
(1) HN n1 + n2 = B N n1 + n2 +
k = 64k1 + 16k2 + 4k3 + k4 16 64 4 64
N N
+ (−j)k3 W16k4 k4
B N n1 + n2 + +
Applying equation 1 to the DFT equation yields 4 64 16
N N
N
64 −1
3 3 3 + (−1)k3 W8k4 B kN4 n1 + n2 + +
4 64 8
X(64k1 + 16k2 + 4k3 + k4 ) =
N 3N
n =0 n =0 n3 =0 n4 =0 +j k3 W8k4 W16
k4 k4
B N n1 + n2 + (7)
1 2
4 64 16
N N N
x n1 + n2 + n3 + n4 WNnk (2) Finally, expanding the summation of equation 6 with respect
64 16 4
to inedx n2 provides a set of 64 DFTs of length N/64.
With the cascade decomposition, the composite twiddle X(64k1 + 16k2 + 4k3 + k4 ) =
factors can be expressed as follows N
64 −1
n (6k +4k +k )
[n1 + N n2 + 16 T kN2 k3 k4 (n1 ) WN1 2 3 4 W nN1 k1 (8)
4 n4 ][64k1 +16k2 +4k3 +k4 ]
N
n3 + N
WNkn = WN 64 ⇒ n1 =0
64 64