Documente Academic
Documente Profesional
Documente Cultură
FPGA/ASIC Implementation
Todd E. Schmuland and Mohsin M. Jamali
I. I NTRODUCTION
System-on-Chip (SoC) solutions using Field Programmable
Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs) are desirable to lower hardware costs and keep
circuit sizes small. The component building blocks used in
an SoC are typically hand-coded in Hardware Description
Language (HDL). This endeavor is both time consuming and
prone to implementation errors. A key component used in
many SoC solutions is the Fast Fourier Transform (FFT) [1].
However, the performance of the FFT component in terms of
occupied slices, maximum frequency, and dynamic range, is
not known until the HDL for the FFT component is coded,
synthesized, and measured. It would be desirable to have a
software tool autogenerate HDL for an FFT component where
an engineer simply provides the targeted characteristics of the
FFT. In addition, the software tool should give feedback to
the engineer on the performance of the autogenerated FFT
component. Using this feedback, the engineer can focus on
the overall SoC performance and make adjustments to the FFT
component as necessary.
This paper describes a software tool, written as a MATLAB
function, that allows an FFT to be specified via its programmable/selectable parameters such as input word size, FFT
This work was sponsored by the Dayton Area Graduate Studies Institute
(DAGSI) with support from the Air Force Research Lab, Sensors Directorate.
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
x[10]
x[11]
x[12]
x[13]
x[14]
x[15]
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W4
W4
W4
W4
W0
W0
W0
W0
W0
W2
W4
W6
W0
W1
W2
W3
W0
W3
W6
W9
W0
W0
W0
W4
W0
W0
W0
W4
W0
W0
W0
W4
W0
W0
W0
W4
X[0]
X[8]
X[4]
X[12]
X[2]
X[10]
X[6]
X[14]
X[1]
X[9]
X[5]
X[13]
X[3]
X[11]
X[7]
X[15]
FFT size
2n
parallel
Fig. 1. FFT flowgraph of a 16-point DIT radix-4 phase factor map with
radix-4 butterflies constructed using four generalized 2-input butterflies as
highlighted by the box
Butterfly arithmetic
- Parallel/serial operators
Phase factor multipliers
Butterfly equation
- Phase factor case
serial
signals
process
variables
begin
operators
label
operators
port maps
238
Begin Butterfly
Generation
Case
1
Pass 1
Pass 2
U1: Pass-thru
L1: Pass-thru
U1: Pass-thru
L1: Multiply
U1: Multiply
L1: Multiply
L s = 1 + Ni + Nf + Ns +
U1: Pass-thru
L1: Pass-thru
U2
WU
WL
WU
WL
L1
1 + 0j
1 + 0j
1 + 0j
Complex *
Complex *
Complex *
1 + 0j
0 - 1j
Ns
X
L(n)
(2)
n=1
Butterfly Output
case 1) U2: add/add, L2: subtract/subtract
case 2) round(L1), U2: add/add, L2: subtract/subtract
case 3) U2: round(add/add), L2: round(subtract/subtract)
case 4) U2: add/subtract, L2: subtract/add
U1
(1)
L2
component. The files include the top level entity for the FFT
component and individual files for each FFT stage. The top
level entity consists of concurrent statements for each FFT
stage, plus a bit-reversed signal assignment from the last FFT
stage to the FFT output, thus creating a component with inorder input and in-order output signals.
The tool autogenerates each generalized 2-input butterfly
using a two pass approach as shown in Fig. 3. Pass one
computes the needed variables and bit vector sizes while pass
two generates the function or entity arithmetic operators for the
butterfly. Every butterfly in an FFT falls into one of four cases
depending on the upper and lower phase factors and whether
they are a trivial or non-trivial complex multiplication. The
resulting case determines if the input data is passed through
as is or requires a complex multiplication, and determines
how the phase factor multiplications are combined into the
butterflys output.
An FFT component using serial-parallel butterfly architecture requires pipeline timing circuitry to control:
1) Timing the data as it passes through the FFT
2) Indicating when each FFT result is ready to be latched
3) Resetting each FFT stage at the correct time
The following equations are used by the software tool
to determine the timing shift register length, tap points for
FFT stage resets, and minimum number of pipeline clocks
required for continuous operation of the FFT component. Each
FFT stage requires a specific number of clocks for its fixedpoint result to appear on its output, with the same fractional
(4)
where the timing shift register is indexed as 0 . . . Ls 1. It
should be noted that (3) includes one extra clock required for
each FFT stage to grow the word size of the data value by
one bit. Also, T (1) should be connected directly to the load
signal from the circuit outside the FFT component.
V. P ERFORMANCE AND P ORTABILITY
To evaluate the performance of the autogenerated VHDL
FFT component, various FFTs were generated and synthesized
using Xilinx ISE 13.2, as shown in Table I, with our largest
available Virtex-6 LXT -3 speed grade FPGA. One can clearly
see that the parallel arithmetic FFT F1 has a much higher
throughput of 23.04 Gs/s versus 2.369 Gs/s for the serialparallel FFT F2, however the hardware cost is 2.66 times
greater than FFT F2. Moreover, the Virtex-6 LXT using
SelectIO [9] can only input data at 1.4 Gs/s, therefore FFT
F1 is highly data starved. The throughput of serial-parallel
FFT F2 is closer to the maximum I/O data rate of the FPGA
part and is therefore a more appropriate solution.
To analyze serial-parallel butterfly architecture throughput
versus FFT size, FFT F3 was compared to FFT F2 in Table I.
One can see that the latency for the data to flow through the
pipeline increases from 81 to 125 clocks, however the throughput remains approximately the same at 2.426 Gs/s. This occurs
because the larger front-end of FFT F3 compensates for the
longer pipeline. Also, the hardware cost of serial-parallel FFT
F3 is approximately the same as parallel arithmetic FFT F1,
even though 512-point FFT F3 has four times the number of
data paths and butterflies.
239
TABLE I
VARIOUS FFT S WITH 11- BIT INPUT WORDS AND 14- BIT PHASE FACTORS
F1
F2
F3
Point
Size
256
256
512
Butterfly
Radix
4
4
2
Math
Type
parallel
ser-par
ser-par
Slices
Used
23862
8963
24677
Pipeline
Clocks
1
81
125
Sample
Rate
23.04 Gs/s
2.369 Gs/s
2.426 Gs/s
1
Parallel Logic (unscaled)
Parallel Logic (scaled)
Parallel DSP (scaled)
SPIRAL DSP (scaled)
Serial Logic
CoreGen
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Maximum Throughput
SLICEs Used
Fig. 4.
Normalized comparison of various Q1.8 64-point radix-2 FFT
implementations using 8-bit fixed length phase factors (288 DSP slices were
used for both Parallel DSP and SPIRAL DSP)
VI. C ONCLUSION
This paper has described an FFT autogeneration tool that
accepts a set of FFT parameters and generates a VHDL
component to be synthesized for use in FPGAs. Unlike propri-
240
R EFERENCES
[1] J.W. Cooley and J.W. Tukey, An Algorithm for the Machine Calculation
of Complex Fourier Series, Math. Computation, Vol. 19, 1965, pp. 297
301.
[2] DSP Cores from IP Cores, Inc. (http://www.ipcores.com).
[3] FFT IP from Dillon Engineering, Inc. (http://www.dilloneng.com).
[4] C. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, Multidimensional
DFT IP Generator for FPGA Platforms, IEEE T. Circuits-I, Vol. 58,
No. 4, Apr. 2011, pp. 755764.
[5] DFT/FFT IP Core Generator from Carnegie Mellon University (http://www.spiral.net).
[6] J.G. Proakis and D.G. Manolakis, Digital Signal Processing, 4th ed.,
Upper Saddle River: Pierson Prentice Hall, 2007, pp. 449461.
[7] L. Wenqi, W. Xuan, and S. Xiangran, Design of Fixed-Point HighPerformance FFT Processor, ICETC 2010, Vol. 5, 2010, pp. V5-139
V5-143.
[8] K. Maharatna, E. Grass, and U. Jagdhold, A 64-Point Fourier Transform
Chip for High-Speed Wireless LAN Application Using OFDM, IEEE
J. Solid-St. Circ. , Vol. 39, No. 3, Mar. 2004, pp. 484493.
[9] Virtex-6 LXT FPGAs from Xilinx, Inc. (http://www.xilinx.com).