Sunteți pe pagina 1din 4

CAD Tool Autogeneration of VHDL FFT for

FPGA/ASIC Implementation
Todd E. Schmuland and Mohsin M. Jamali

Matthew B. Longbrake and Peter E. Buxa

Department of Electrical Engineering & Computer Science


AFRL/RYDR
The University of Toledo
Wright-Patterson AFB
Toledo, OH, 43606
Dayton, OH, 45433
todd.schmuland@utoledo.edu, mohsin.jamali@utoledo.edu matthew.longbrake2@wpafb.af.mil, peter.buxa@wpafb.af.mil

AbstractHand-coding Fast Fourier Transforms (FFTs) in


Hardware Description Language (HDL) is time consuming and
prone to errors. Proprietary IP cores are available, however
they are closed-source and unviewable. The open-source FFT
generator SPIRAL is available, however it only produces parallel
arithmetic solutions and thus limits the maximum FFT size that
will fit in available Field Programmable Gate Arrays (FPGAs).
An autogenerator of VHDL FFTs is described that takes a
set of FFT parameters and generates an FFT component with
feedback of occupied slices, maximum frequency, and dynamic
range performance. Both parallel arithmetic and serial-parallel
butterfly architectures can be generated where serial-parallel
allows larger sized FFTs to fit inside available FPGA parts.
Emphasis is placed on large sized serial-parallel FFTs and
portability to Application-Specific Integrated Circuits (ASICs)
using Cadence Encounter. Serial-parallel FFT pipeline control
and FPGA hardware reduction are also investigated.
Index TermsFFT; fixed-point; VHDL; FPGA; autogeneration

I. I NTRODUCTION
System-on-Chip (SoC) solutions using Field Programmable
Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs) are desirable to lower hardware costs and keep
circuit sizes small. The component building blocks used in
an SoC are typically hand-coded in Hardware Description
Language (HDL). This endeavor is both time consuming and
prone to implementation errors. A key component used in
many SoC solutions is the Fast Fourier Transform (FFT) [1].
However, the performance of the FFT component in terms of
occupied slices, maximum frequency, and dynamic range, is
not known until the HDL for the FFT component is coded,
synthesized, and measured. It would be desirable to have a
software tool autogenerate HDL for an FFT component where
an engineer simply provides the targeted characteristics of the
FFT. In addition, the software tool should give feedback to
the engineer on the performance of the autogenerated FFT
component. Using this feedback, the engineer can focus on
the overall SoC performance and make adjustments to the FFT
component as necessary.
This paper describes a software tool, written as a MATLAB
function, that allows an FFT to be specified via its programmable/selectable parameters such as input word size, FFT
This work was sponsored by the Dayton Area Graduate Studies Institute
(DAGSI) with support from the Air Force Research Lab, Sensors Directorate.

size, Decimation-in-Time (DIT) or Decimation-in-Frequency


(DIF), butterfly radix, phase factor (twiddle factor) quantization, and stage scaling. The software tool provides a bit-true
MATLAB simulation of the FFT architecture and an option to
autogenerate Very-high-speed integrated circuit HDL (VHDL)
for vendor-independent FPGA or ASIC implementation. The
generated FFT structure is fully parallel where every butterfly
is instantiated, however each butterfly is optimized by leveraging constant multipliers.
Unlike commercial FFT IP cores [2][4], which are delivered as a black box FFT implementation that conform to a set
of parameters, this software tool generates open source VHDL
of parameterized FFTs whose contents are fully exposed and
portable to other hardware platforms. The popular SPIRAL
FFT generator [5] only produces HDL with parallel arithmetic
using DSP blocks which limits the FFT size that can fit into
a given FPGA part. Our software tool however, gives the
option of autogenerating a serial-parallel constant multiplier
butterfly architecture, thus allowing larger FFT sizes to fit in
available FPGA parts. The targeted use of the resulting VHDL
entity is for SoC solutions where the bulk of DSP and RAM
blocks [6][8] are reserved for components other than the FFT
component.
The paper is organized as follows. In Section II, the FFT
parameters the VHDL autogeneration tool accepts as input
are described. Section III discusses the functions, macros, and
entities developed as a VHDL package for constructing FFTs.
The autogeneration of VHDL for parallel arithmetic and serialparallel butterfly architectures is covered in Section IV. Section
V looks at the performance characteristics of the autogenerated
VHDL code and the portabilty of the VHDL code. Finally is
the conclusion.
II. S OFTWARE D EFINED FFT PARAMETERS
The VHDL autogenerator takes many FFT parameters as
input. This offers great flexibility in designing an FFT component for an SoC without having to write a single line of
VHDL code. The FFT parameters entered into the VHDL
autogenerator are:
Number of input samples
DIT or DIF
Butterfly radix of 2, 4, or split (auto-determined)

978-1-4673-0859-5/12/$31.00 2012 IEEE


237

x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
x[10]
x[11]
x[12]
x[13]
x[14]
x[15]

W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0

W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W0
W4
W4
W4
W4

W0
W0
W0
W0
W0
W2
W4
W6
W0
W1
W2
W3
W0
W3
W6
W9

W0
W0
W0
W4
W0
W0
W0
W4
W0
W0
W0
W4
W0
W0
W0
W4

X[0]
X[8]
X[4]
X[12]
X[2]
X[10]
X[6]
X[14]
X[1]
X[9]
X[5]
X[13]
X[3]
X[11]
X[7]
X[15]

FFT size
2n

Analyze: trivial * / non-trivial *


Determine phase factor case

parallel

Fig. 1. FFT flowgraph of a 16-point DIT radix-4 phase factor map with
radix-4 butterflies constructed using four generalized 2-input butterflies as
highlighted by the box

Butterfly architecture: serial-parallel or parallel arithmetic


Input word size
Scaling between FFT stages (output word size)
Fixed or variable length phase factors
Complex multiplier built using 3 or 4 multipliers
The software tool uses one master FFT flowgraph and
determines the correct phase factor map while leveraging
radix-4 butterflies whenever possible. For example, Fig. 1
shows a 16-point DIT with radix-4 butterflies constructed with
four generalized 2-input butterflies. Fully parallel structured
FFTs have constant phase factors per butterfly, therefore the
complex multipliers reduce to a series of adders shifted by
position. Both fixed and variable length phase factors are
offered as choices for autogenerated VHDL.
Butterfly architecture can either be serial-parallel, where
data is operated on one bit at a time, or it can be processed
in parallel as bit vectors. Serial-parallel butterfly architecture
lends itself well for partitioning very large FFTs across several
FPGA parts, since each data path is only one signal/pin.
In addition, the ratio of serial-parallel throughput versus
occupied slices is favorable since every butterfly of the FFT
is instantiated and clocked simultaneously.
As data flows through an FFT, the word size grows by one
bit per stage due to the add/subtract operations carry bit inside
each butterfly. Depending on the application, either the output
can be truncated after the FFT, or the data can be truncated
internal to the FFT, also known as scaling, by truncating one
bit between each FFT stage.

User supplied input parameters


- FFT Size, DIT/DIF, scaling
Generate phase factor constants
Generate FFT flowgraph
- Butterfly data paths
- Phase factor locations

Butterfly arithmetic
- Parallel/serial operators
Phase factor multipliers
Butterfly equation
- Phase factor case

serial

signals
process
variables
begin
operators

Developed VHDL library


- functions
(ADD)
- macros
(SUB)
- entities
(MUL)

label
operators
port maps

III. D EVELOPED VHDL L IBRARY PACKAGE


The VHDL autogenerator uses our developed VHDL library package consisting of functions, macros, and entities
to construct the VHDL files for a specified FFT as shown
in Fig. 2. Functions were used for the parallel arithmetic
since they allow the recursive function calling necessary to
construct the constant multipliers which consist of adders
shifted by position. Functions also facilitate easier construction

Stage progression of data


Adaptive scaling between stages

Timing shift register taps (serial)


Bit-reversal output signal map
Mux/Demux entity wrapper

Fig. 2. Autogeneration software tool showing FFT parameter input, flowgraph


generation, trivial/non-trivial phase factor determination for each FFT stage,
generation of each FFT butterfly, stage scaling, and bit-reversal output with
mux/demux wrapper

of pipelining where each FFT stage is clocked using VHDL


processes containing arithmetic operators for each butterfly in
a behavioral fashion.
Entities were used for the serial-parallel butterfly architecture since they require a structural model where each
arithmetic operator clocks and processes one bit of data at
a time. The use of entities also allows generics to be used to
provide the constant bit-string (phase factor) to the constant
multiplier entity. A second generic is used to dictate whether
a buffering delay line should be placed on the entitys output
and how many clocks the output should be delayed by. This
is necessary to keep all the butterflies of a given FFT stage
in sync such that trivial and non-trivial phase factor butterfly
results appear at the next stage with the same latency. A trivial
butterfly is one where the phase factor is either 1+0j or 01j,
thus reducing the complex multiply to a simple identity or
complex conjugate with swap operation respectively.
IV. AUTOGENERATION OF FFT B UTTERFLIES
After the VHDL autogeneration tool has determined the
FFT flowgraph and phase factor locations, the tool creates
a set of VHDL files that will synthesize into a usable FFT

238

precision, such that



1
L(n) =
Tf + 2

Begin Butterfly
Generation
Case

1
Pass 1

stage n is all trivial


stage n has some non-trivial

where L(n) is the number of clocks required for FFT stage n


to have its result appear on its output and Tf is the number
of fraction bits that represent the phase factors of the FFT.
Therefore, the overall timing shift register length is

Compute needed variables and bit vector sizes

Pass 2
U1: Pass-thru
L1: Pass-thru

U1: Pass-thru
L1: Multiply

U1: Multiply
L1: Multiply

L s = 1 + Ni + Nf + Ns +

U1: Pass-thru
L1: Pass-thru

U2

WU
WL

WU
WL
L1

1 + 0j
1 + 0j

1 + 0j
Complex *

Complex *
Complex *

1 + 0j
0 - 1j

Ns
X

L(n)

(2)

n=1

where Ni and Nf are the number of integer and fraction bits


respectively that represent the input data words to the FFT,
Ns is the number of FFT stages, and L(n) is taken from (1).
The time it takes for the first load signal to propagate
through the timing shift register and appear as the ready signal
is Ls , however subsequent data sets only require Ls Ns
clocks between load signals.
The number of clocks each FFT stage requires for its input
to start appearing on its output is

2
stage n is all trivial
(3)
K(n) =
4
stage n has some non-trivial

Butterfly Output
case 1) U2: add/add, L2: subtract/subtract
case 2) round(L1), U2: add/add, L2: subtract/subtract
case 3) U2: round(add/add), L2: round(subtract/subtract)
case 4) U2: add/subtract, L2: subtract/add

U1

(1)

L2

Fig. 3. Autogeneration of FFT butterflies using a two pass approach where


pass one determines the variables required for the butterfly and pass two
generates the arithmetic operator statements based on trivial/non-trivial phase
factor cases

component. The files include the top level entity for the FFT
component and individual files for each FFT stage. The top
level entity consists of concurrent statements for each FFT
stage, plus a bit-reversed signal assignment from the last FFT
stage to the FFT output, thus creating a component with inorder input and in-order output signals.
The tool autogenerates each generalized 2-input butterfly
using a two pass approach as shown in Fig. 3. Pass one
computes the needed variables and bit vector sizes while pass
two generates the function or entity arithmetic operators for the
butterfly. Every butterfly in an FFT falls into one of four cases
depending on the upper and lower phase factors and whether
they are a trivial or non-trivial complex multiplication. The
resulting case determines if the input data is passed through
as is or requires a complex multiplication, and determines
how the phase factor multiplications are combined into the
butterflys output.
An FFT component using serial-parallel butterfly architecture requires pipeline timing circuitry to control:
1) Timing the data as it passes through the FFT
2) Indicating when each FFT result is ready to be latched
3) Resetting each FFT stage at the correct time
The following equations are used by the software tool
to determine the timing shift register length, tap points for
FFT stage resets, and minimum number of pipeline clocks
required for continuous operation of the FFT component. Each
FFT stage requires a specific number of clocks for its fixedpoint result to appear on its output, with the same fractional

therefore each FFT stage n reset tap point is given by



1
for n = 1
T (n) =
T (n 1) + K(n 1)
for n = 2 . . . Ns

(4)
where the timing shift register is indexed as 0 . . . Ls 1. It
should be noted that (3) includes one extra clock required for
each FFT stage to grow the word size of the data value by
one bit. Also, T (1) should be connected directly to the load
signal from the circuit outside the FFT component.
V. P ERFORMANCE AND P ORTABILITY
To evaluate the performance of the autogenerated VHDL
FFT component, various FFTs were generated and synthesized
using Xilinx ISE 13.2, as shown in Table I, with our largest
available Virtex-6 LXT -3 speed grade FPGA. One can clearly
see that the parallel arithmetic FFT F1 has a much higher
throughput of 23.04 Gs/s versus 2.369 Gs/s for the serialparallel FFT F2, however the hardware cost is 2.66 times
greater than FFT F2. Moreover, the Virtex-6 LXT using
SelectIO [9] can only input data at 1.4 Gs/s, therefore FFT
F1 is highly data starved. The throughput of serial-parallel
FFT F2 is closer to the maximum I/O data rate of the FPGA
part and is therefore a more appropriate solution.
To analyze serial-parallel butterfly architecture throughput
versus FFT size, FFT F3 was compared to FFT F2 in Table I.
One can see that the latency for the data to flow through the
pipeline increases from 81 to 125 clocks, however the throughput remains approximately the same at 2.426 Gs/s. This occurs
because the larger front-end of FFT F3 compensates for the
longer pipeline. Also, the hardware cost of serial-parallel FFT
F3 is approximately the same as parallel arithmetic FFT F1,
even though 512-point FFT F3 has four times the number of
data paths and butterflies.

239

TABLE I
VARIOUS FFT S WITH 11- BIT INPUT WORDS AND 14- BIT PHASE FACTORS

F1
F2
F3

Point
Size
256
256
512

Butterfly
Radix
4
4
2

Math
Type
parallel
ser-par
ser-par

Slices
Used
23862
8963
24677

Pipeline
Clocks
1
81
125

Sample
Rate
23.04 Gs/s
2.369 Gs/s
2.426 Gs/s

1
Parallel Logic (unscaled)
Parallel Logic (scaled)
Parallel DSP (scaled)
SPIRAL DSP (scaled)
Serial Logic
CoreGen

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Maximum Throughput
SLICEs Used

Fig. 4.
Normalized comparison of various Q1.8 64-point radix-2 FFT
implementations using 8-bit fixed length phase factors (288 DSP slices were
used for both Parallel DSP and SPIRAL DSP)

Various 64-point radix-2 FFTs were also generated


and compared to SPIRAL and Xilinx CoreGen in terms
of maximum throughput and slices used. Variations of
scaled/unscaled, and Logic/DSP multipliers were generated.
As Fig. 4 shows, the slices used for unscaled Logic multipliers
is larger than scaled, however the throughput is the same,
indicating the number of adders required to implement the
FFT has a greater impact on hardware cost than the size of the
adders. By simply changing one line in the developed VHDL
library, DSP multipliers were generated instead of Logic
multipliers, resulting in a DSP based FFT that outperformed
SPIRAL DSP both in terms of higher throughput and lower
hardware cost.
To test the portability of the autogenerated VHDL, FFT
F1 from Table I was imported without modification into
Cadence Encounter 6.1.5 using the IBM 65 nm 6-metal digital
ASIC process. The ASIC cell layout completed successfully
as shown in Fig. 5. The FFT cell measures 1.8 mm per side,
contains 482,875 standard cells, has a clock fanout of 63,488
gates, and took approximately 5 hours to complete. The key
for successful importation into Cadence was generated VHDL
code consistency and modularity of the developed VHDL
library package.

Fig. 5. 256-point radix-4 FFT imported without modification to Cadence


Encounter using IBM 65 nm 6-metal process (the cell size is 1.8 mm per
side)

etary FFT IP cores, the VHDL code generated is completely


viewable and modifiable with no reliance on any particular
FPGA vendor. Both parallel arithmetic and serial-parallel
butterfly architectures can be generated where serial-parallel
allows larger sized FFTs to fit inside available FPGA parts.
A 512-point serial-parallel FFT was successfully generated
and easily fits available FPGA parts with regards to I/O
throughput and slices used. A comparison of 64-point radix-2
FFTs was performed and has shown that the autogenerated
parallel arithmetic DSP based FFT had higher throughput and
lower hardware cost than the equivalent SPIRAL generated
FFT. In addition, the generated VHDL code can be imported
without modification into Cadence Encounter for rapid FFT
cell creation.

VI. C ONCLUSION
This paper has described an FFT autogeneration tool that
accepts a set of FFT parameters and generates a VHDL
component to be synthesized for use in FPGAs. Unlike propri-

240

R EFERENCES
[1] J.W. Cooley and J.W. Tukey, An Algorithm for the Machine Calculation
of Complex Fourier Series, Math. Computation, Vol. 19, 1965, pp. 297
301.
[2] DSP Cores from IP Cores, Inc. (http://www.ipcores.com).
[3] FFT IP from Dillon Engineering, Inc. (http://www.dilloneng.com).
[4] C. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, Multidimensional
DFT IP Generator for FPGA Platforms, IEEE T. Circuits-I, Vol. 58,
No. 4, Apr. 2011, pp. 755764.
[5] DFT/FFT IP Core Generator from Carnegie Mellon University (http://www.spiral.net).
[6] J.G. Proakis and D.G. Manolakis, Digital Signal Processing, 4th ed.,
Upper Saddle River: Pierson Prentice Hall, 2007, pp. 449461.
[7] L. Wenqi, W. Xuan, and S. Xiangran, Design of Fixed-Point HighPerformance FFT Processor, ICETC 2010, Vol. 5, 2010, pp. V5-139
V5-143.
[8] K. Maharatna, E. Grass, and U. Jagdhold, A 64-Point Fourier Transform
Chip for High-Speed Wireless LAN Application Using OFDM, IEEE
J. Solid-St. Circ. , Vol. 39, No. 3, Mar. 2004, pp. 484493.
[9] Virtex-6 LXT FPGAs from Xilinx, Inc. (http://www.xilinx.com).

S-ar putea să vă placă și