Sunteți pe pagina 1din 13

SYNTHESIS OF FPGAFPGA-BASED FFT IMPLEMENTATIONS

Hojin Kee1, Newton Peterson2, 2, Shuvra 1 J Jacob b Kornerup K Sh S. S Bhattacharyya Bh h of Electrical and Computer Engineering, University of Maryland, College Park, 20742, USA. 2National N ti l Instruments I t t Corporation, C ti Austin, A ti 78759, 78759 USA. USA
1Department

Overview
Propose a systematic approach for synthesizing fieldprogrammable gate array (FPGA) implementations of fast F Fourier i transform t f computations. t ti Proposed approach is composed of two orthogonal techniques FFT inner loop p unrolling g and outer loop p unrolling g to perform design space exploration in terms of cost and performance. Achieve cost-optimized cost optimized FFT implementations, subject to user-specified performance levels. Proposed techniques that can be retargeted to different kinds of FPGA devices.

Introduction
Fast Fourier transform (FFT) computation potentially requires multi-cycle processing blocks as its computational complexity is blocks, O(N*logN), where N is the number of inputs. Proposed approaches. Outer O t loop l unrolling lli : R Realizing li i pipelining i li i by instantiating multiple processing cores across FFT butterfly stages. Inner loop unrolling : Realizing parallelism by allocating multiple cores within each stage. Our synthesis approach is prototyped in National Instruments LabVIEW FPGA 8.5. Cost metric
Usage of FPGA slices 1 of Block RAMs Usage

Related Works
Ma [2] developed an efficient method for in-place memory management in FFT implementation, but this approach is restricted t i t d to t a single i l b butterfly tt fl unit. it Nordin et al. [4] presented a parameterized soft core generator for the FFT based on the Peace FFT algorithm g with the stride permutation approach proposed by Takala et al. [5]. Jackson et al. [6] proposed a systolic structure to provide for high throughput FFT implementation implementation. Distinguishing aspect in our approach : Realization of data
parallelism and pipelining with a carefully-configured address generator. t
No special permutation structures for butterfly operations. Efficient utilization of FPGA slices subject to user-defined performance.
3

Unrolling techniques
A basic FFT core (BFC) provides dedicated hardware for one butterfly operation. K- times throughput improvement
Running BFCs simultaneously across stages. Incorporating p gp parallelism inside the BFC within a given stage.

Two unrolling techniques show different cost functions in terms of usage of FPGA slices or BRAMs. The two approaches should be considered jointly for cost-efficient FPGA-based, FFT implementation.
4

Outer Loop Unrolling


In unrolling factor k > 1,
Instantiates k BFCs. (k-1) BFCs take The last BFC takes loop iterations in each. loop p iterations.

This approach introduces k identical copies of the sub-FFT core. It is expected that a factor k of increase in hardware cost results. Trade-offs associated with outer loop unrolling are complemented by inner loop unrolling. unrolling

Inner Loop Unrolling (Read)


Indices of two inputs, u and l, for a butterfly unit in the pth stage are identical, except for the p-th bit in their binary patterns. Define two functions Let x1=110 and x2 = 01100 RL(x2, 2) = 10001, RR(x1, 1) = 011 CONCAT(x1, x2) = 11001100 Read 2k inputs for k BFCs with a single address. Ap = an-r-2 an-r-1 a0 : Address for all inputs. B0p = br br-1 b1 0 : Index of 1st DM bank for BFC B1p = br br-1 b1 1 : Index of 2nd DM bank for BFC

Address = an-r-2 an-r-3 a0 bpRAM = br br-1 b1 0

BFC

bpRAM

= br br-1 b1 1

u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0

(1)

Inner Loop Unrolling (Write)


Outputs in the p-th stage should be written to a DM bank so th t it will that ill b be ready d f for th the read di in th the (p+1)1) th stage. t The destined DM bank index and its associated address for writing g butterfly y output p data can be g generated by y an inverse mapping of (1).
u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0 = RL(CONCAT(RR(Ap+1, p+1),Bp+1),p+1) Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0 Bp+1 = apbrbr-1b1

Inner Loop Unrolling (Write) cont.


Address = an-r-2 an-r-3 a0
bpRAM = br br-1 b1 0 = 12 = 1100
BFC

ap

Destined BRAM index Bp+1 = ap br br-1b1 Destined Address Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0

switch
output address =1 0 1 0

reg
output address =1110

bp+1RAM = (ap=0) br br-1 b1 =0110 = 6 bp+1RAM = (ap=1) br br-1 b1 =1110 = 14

bpRAM = br br-1 b1 1 = 13 = 1101

reg

br br-1 b1 = 1 1 0

Simple interconnection network

Cost/Performance Analysis
Cost model for outer loop unrolling/ inner loop unrolling. We calibrate the model using synthesis results.
uinner = sinner*uinitial(kinner-1)+uinitial uouter = souter*uinitial(kouter-1)+uinitial
uinner/uouter uinitial kinner/kouter sinner/souter unrolling : Amount of utilization after inner/outer loop unrolling : Amount of utilization without loop unrolling : Unrolling factors : The slope p of the linear p plots from synthesis y for inner ( (outer) ) loop p

Analytic combined analytic cost function.


ucombined = souter*u uinner(kouter-1)+u 1) uinner kcombined = kouter*kinner
ucombined : Amount of utilization after a combination of inner/outer loop unrolling kcombined : Speedup S d resulting lti f from such h a combination bi ti
9

Experimental Results
Figure 3 reports the FPGA resource utilization when the target speedup is 6. (kinner, kouter)=(3, 2) shows the best utilization performance in the target speedup. This matches to the results from the analytic cost function we analyzed. For streaming FFT performance, our approach requires 23% less FPGA slices compared to the Xilinx core, but 140% more BRAMs. For the sequential performance level, our approach requires 30% fewer slices, and 17% more BRAMs.

10

Conclusion
Our approach incorporates efficient FFT address generation and memory management, and applies two orthogonal loop unrolling methods et ods to op provide o de a tu tunable ab e trade-off ade o be between ee pe performance o a ce a and d FPGA resource costs. We also develop an analytical approach for high level design space exploration, which allows one to estimate the most resourceresource efficient FFT architecture configuration for a given throughput constraint and a given critical target resource. A distinguishing characteristic of our approach approach, compared to commercially available FFT IP cores, is that we provide a systematic method to generate an FPGA-based FFT architecture while taking into account trade trade-offs offs between performance and cost.

11

References
[1] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, Vol. 19, No. 90, 297-301, 1965. [2] Y. Ma, An Effective Memory Addressing Scheme for FFT Processors, IEEE T Transactions ti on Signal Si lP Processing, i vol. l 47 47, I Issue 3 3, pp. 907 907-911, 911 M March h 1999 1999. [3] W. Wolf. FPGA-Based System Design. Prentice Hall, 2004. [4] G. Nordin, P. A. Milder, J. C. Hoe, M. Puschel, Automatic Generation of Customized Discrete Fourier Transform IPs IPs , Design Automation Conference Conference, pp pp. 471471 474, 474 2005. [5] J. Takala, T. Jarvinen, P. Salmela, and D. Akopian. Multi-port interconnection networks for radix-r algorithms. In Proc. IEEE Intl. Conf. Acoustics, Speech, Signal P Processing, i 2001 2001. [6] P. A. Jackson, C. P. Chan, J. E. Scalera, C. M. Rader, and M. M. Vai, A Systolic FFT Architecture for Real Time FPGA Systems, High Performance Embedded Computing Workshop, 2004

12

S-ar putea să vă placă și