Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

SYNTHESIS OF FPGAFPGA-BASED FFT IMPLEMENTATIONS
Hojin Kee1, Newton Peterson2, 2, Shuvra 1 J Jacob b Kornerup K Sh S. S Bhattacharyya Bh h of Electrical and Computer Engineering, University of Maryland, College Park, 20742, USA. 2National N ti l Instruments I t t Corporation, C ti Austin, A ti 78759, 78759 USA. USA
1Department
Overview
Propose a systematic approach for synthesizing fieldprogrammable gate array (FPGA) implementations of fast F Fourier i transform t f computations. t ti Proposed approach is composed of two orthogonal techniques FFT inner loop p unrolling g and outer loop p unrolling g to perform design space exploration in terms of cost and performance. Achieve cost-optimized cost optimized FFT implementations, subject to user-specified performance levels. Proposed techniques that can be retargeted to different kinds of FPGA devices.
Introduction
Fast Fourier transform (FFT) computation potentially requires multi-cycle processing blocks as its computational complexity is blocks, O(N*logN), where N is the number of inputs. Proposed approaches. Outer O t loop l unrolling lli : R Realizing li i pipelining i li i by instantiating multiple processing cores across FFT butterfly stages. Inner loop unrolling : Realizing parallelism by allocating multiple cores within each stage. Our synthesis approach is prototyped in National Instruments LabVIEW FPGA 8.5. Cost metric
Usage of FPGA slices 1 of Block RAMs Usage
Related Works
Ma [2] developed an efficient method for in-place memory management in FFT implementation, but this approach is restricted t i t d to t a single i l b butterfly tt fl unit. it Nordin et al. [4] presented a parameterized soft core generator for the FFT based on the Peace FFT algorithm g with the stride permutation approach proposed by Takala et al. [5]. Jackson et al. [6] proposed a systolic structure to provide for high throughput FFT implementation implementation. Distinguishing aspect in our approach : Realization of data
parallelism and pipelining with a carefully-configured address generator. t
No special permutation structures for butterfly operations. Efficient utilization of FPGA slices subject to user-defined performance.
3
Unrolling techniques
A basic FFT core (BFC) provides dedicated hardware for one butterfly operation. K- times throughput improvement
Running BFCs simultaneously across stages. Incorporating p gp parallelism inside the BFC within a given stage.
Two unrolling techniques show different cost functions in terms of usage of FPGA slices or BRAMs. The two approaches should be considered jointly for cost-efficient FPGA-based, FFT implementation.
4
Outer Loop Unrolling

In unrolling factor k > 1,
Instantiates k BFCs. (k-1) BFCs take The last BFC takes loop iterations in each. loop p iterations.
This approach introduces k identical copies of the sub-FFT core. It is expected that a factor k of increase in hardware cost results. Trade-offs associated with outer loop unrolling are complemented by inner loop unrolling. unrolling
Inner Loop Unrolling (Read)

Indices of two inputs, u and l, for a butterfly unit in the pth stage are identical, except for the p-th bit in their binary patterns. Define two functions Let x1=110 and x2 = 01100 RL(x2, 2) = 10001, RR(x1, 1) = 011 CONCAT(x1, x2) = 11001100 Read 2k inputs for k BFCs with a single address. Ap = an-r-2 an-r-1 a0 : Address for all inputs. B0p = br br-1 b1 0 : Index of 1st DM bank for BFC B1p = br br-1 b1 1 : Index of 2nd DM bank for BFC
Address = an-r-2 an-r-3 a0 bpRAM = br br-1 b1 0
BFC
bpRAM
= br br-1 b1 1
u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0
(1)
Inner Loop Unrolling (Write)

Outputs in the p-th stage should be written to a DM bank so th t it will that ill b be ready d f for th the read di in th the (p+1)1) th stage. t The destined DM bank index and its associated address for writing g butterfly y output p data can be g generated by y an inverse mapping of (1).
u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0 = RL(CONCAT(RR(Ap+1, p+1),Bp+1),p+1) Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0 Bp+1 = apbrbr-1b1
Inner Loop Unrolling (Write) cont.

Address = an-r-2 an-r-3 a0
bpRAM = br br-1 b1 0 = 12 = 1100
BFC
ap
Destined BRAM index Bp+1 = ap br br-1b1 Destined Address Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0
switch
output address =1 0 1 0
reg
output address =1110
bp+1RAM = (ap=0) br br-1 b1 =0110 = 6 bp+1RAM = (ap=1) br br-1 b1 =1110 = 14
bpRAM = br br-1 b1 1 = 13 = 1101
reg
br br-1 b1 = 1 1 0
Simple interconnection network
Cost/Performance Analysis
Cost model for outer loop unrolling/ inner loop unrolling. We calibrate the model using synthesis results.
uinner = sinner*uinitial(kinner-1)+uinitial uouter = souter*uinitial(kouter-1)+uinitial
uinner/uouter uinitial kinner/kouter sinner/souter unrolling : Amount of utilization after inner/outer loop unrolling : Amount of utilization without loop unrolling : Unrolling factors : The slope p of the linear p plots from synthesis y for inner ( (outer) ) loop p
Analytic combined analytic cost function.

ucombined = souter*u uinner(kouter-1)+u 1) uinner kcombined = kouter*kinner
ucombined : Amount of utilization after a combination of inner/outer loop unrolling kcombined : Speedup S d resulting lti f from such h a combination bi ti
9
Experimental Results
Figure 3 reports the FPGA resource utilization when the target speedup is 6. (kinner, kouter)=(3, 2) shows the best utilization performance in the target speedup. This matches to the results from the analytic cost function we analyzed. For streaming FFT performance, our approach requires 23% less FPGA slices compared to the Xilinx core, but 140% more BRAMs. For the sequential performance level, our approach requires 30% fewer slices, and 17% more BRAMs.
10
Conclusion
Our approach incorporates efficient FFT address generation and memory management, and applies two orthogonal loop unrolling methods et ods to op provide o de a tu tunable ab e trade-off ade o be between ee pe performance o a ce a and d FPGA resource costs. We also develop an analytical approach for high level design space exploration, which allows one to estimate the most resourceresource efficient FFT architecture configuration for a given throughput constraint and a given critical target resource. A distinguishing characteristic of our approach approach, compared to commercially available FFT IP cores, is that we provide a systematic method to generate an FPGA-based FFT architecture while taking into account trade trade-offs offs between performance and cost.
11
References
[1] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, Vol. 19, No. 90, 297-301, 1965. [2] Y. Ma, An Effective Memory Addressing Scheme for FFT Processors, IEEE T Transactions ti on Signal Si lP Processing, i vol. l 47 47, I Issue 3 3, pp. 907 907-911, 911 M March h 1999 1999. [3] W. Wolf. FPGA-Based System Design. Prentice Hall, 2004. [4] G. Nordin, P. A. Milder, J. C. Hoe, M. Puschel, Automatic Generation of Customized Discrete Fourier Transform IPs IPs , Design Automation Conference Conference, pp pp. 471471 474, 474 2005. [5] J. Takala, T. Jarvinen, P. Salmela, and D. Akopian. Multi-port interconnection networks for radix-r algorithms. In Proc. IEEE Intl. Conf. Acoustics, Speech, Signal P Processing, i 2001 2001. [6] P. A. Jackson, C. P. Chan, J. E. Scalera, C. M. Rader, and M. M. Vai, A Systolic FFT Architecture for Real Time FPGA Systems, High Performance Embedded Computing Workshop, 2004
12

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations

Încărcat de

Drepturi de autor:

Formate disponibile

SYNTHESIS OF FPGAFPGA-BASED FFT IMPLEMENTATIONS

Outer Loop Unrolling

Inner Loop Unrolling (Read)

Address = an-r-2 an-r-3 a0 bpRAM = br br-1 b1 0

u or l = RL(CONCAT(RR(Ap, p),Bp),p) = an-r-2an-r-3 apbrbr-1b0ap-1ap-2a0

Inner Loop Unrolling (Write)

Inner Loop Unrolling (Write) cont.

Destined BRAM index Bp+1 = ap br br-1b1 Destined Address Ap+1=an-r-2an-r-3ap+1b0ap-1ap-2a0

bp+1RAM = (ap=0) br br-1 b1 =0110 = 6 bp+1RAM = (ap=1) br br-1 b1 =1110 = 14

bpRAM = br br-1 b1 1 = 13 = 1101

Simple interconnection network

Analytic combined analytic cost function.

S-ar putea să vă placă și