Sunteți pe pagina 1din 4

High speed eight-parallel mixed-radix FFT Processor

for OFDM systems

Eun Ji Kim and Myung Hoon Sunwoo


School of Electrical and Computer Engineering
Ajou University
San 5, Wonchun-Dong, Yeungtong-Gu,
Suwon, 443-749 Korea
sunwoo@ajou.ac.kr

Abstract— This paper presents a novel eight-parallel 128/256- consisted of the multi-path delay commutator (MDC), single-
point mixed-radix multi-path delay commutator (MRMDC) path delay commutator (SDC), or single-path feedback (SDF).
FFT processor for orthogonal frequency-division multiplexing This paper implemented the MDC pipelined FFT processor
(OFDM) systems. The proposed FFT architecture can provide a which can operate at a high speed and with ease of control so
higher throughput rate and low hardware complexity by using that the proposed processor can satisfy the computational
an eight-parallel data-path scheme, a multi-path delay demand of OFDM systems.
commutator structure and an efficient scheduling scheme of
complex multiplications. Using the modified radix-4 butterfly Throughput depends on not only their architecture but also
unit which can perform one radix-4 butterfly or two radix-2 the degree of parallelization. Moreover, the parallel data-path
butterflies, the proposed FFT processor can provide 128 and approach can reduce the data sampling rate of an analog-
256-point FFT computations. The proposed FFT processor has digital converter. As the number of parallel data-paths
been designed and implemented with the 90nm CMOS increases, the data sampling rate of each path can decrease.
technology. The proposed eight-parallel FFT processor can However, the hardware cost also significantly increases,
provide a throughput rate of up to 27.5Gsample/s at 430MHz. because more complex multipliers and memory are needed to
allow multiple data to be simultaneously computed.
I. INTRODUCTION
The FFT size is normally a power of two and common
Orthogonal frequency-division multiplexing (OFDM) has values for the FFT size in high-speed OFDM systems are 128
emerged to be the leading modulation technology for the and 256 [2], [5], [7]. In general, a higher radix algorithm can
wireless and wireline communications, and has been complete FFT operations with fewer cycles than lower radix
incorporated into many communications standards such as algorithm because a higher radix algorithm can reduce the
IEEE 802.11a/g/n, IEEE 802.16e, DAB, and DVB-T/H. number of multiplications and the number of stages. Since the
OFDM transmits data through many parallel orthogonal proposed FFT processor can perform 128 and 256-point FFTs
subcarriers, and provides channel equalization with a that are not powers of eight, it requires the mixed-radix
relatively simple solution in frequency-domain that would be algorithm to perform these FFTs.
otherwise quite complex with the conventional time-domain
equalization [1]. OFDM transceivers involve fast Fourier This paper is organized as follows. Section 2 describes the
transform (FFT) computation that requires a large amount of 128, 256-point mixed-radix FFT algorithm. Section 3 provides
arithmetic operations. High-speed OFDM systems such as details of the proposed FFT architecture. Section 4 deals with
ultra-wideband (UWB), wireless personal area network the design and implementation results of the proposed FFT
(WPAN) and optical OFDM (O-OFDM) have required the processor. Finally, conclusions are presented in Section 5.
high-speed FFT implementation to meet continuing demands
for ever higher data rates [2], [5], [7]. II. MIXED-RADIX FFT ALGORITHM
The N-point discrete Fourier transform (DFT) is defined as
However, the FFT processor implementation is one of the
most difficult parts in the realization of OFDM modems and N −1

X ( k ) = ∑ x ( n) ⋅ WN ,
nk
its hardware complexity is very high. Hence, various FFT k = 0, 1, " , N − 1 (1)
processors have been proposed to meet real-time processing n =0

requirements and to reduce hardware complexity [3]-[8], [10].


where x(n) is the input sequence, X(k) is the output sequence,
For high throughput applications, pipeline architectures have
and N is the transform length. WN denotes the Nth primitive
been proposed. The pipeline architecture is classified as being

978-1-4244-9474-3/11/$26.00 ©2011 IEEE 1684


root of unity, with its exponent evaluated modulo N. Data out

Radix 2/4 Radix 8 Radix 8


In (1), the computational complexity is O(N2) through directly Data in # #
performing the required computation. By using the FFT M Radix 8
Radix 2/4 Radix 8
algorithm, the computational complexity can be reduced to U # #
X
O(NlogrN), where r means the radix-r FFT. In general, a # # # #
higher radix algorithm should be used to reduce the number of
multiplications and the number of stages. The mixed-radix Radix 2/4 Radix 8 # Radix 8

algorithm can perform fast FFT computations and can perform


FFTs that are not powers of eight. The mixed-radix algorithm means eight-data path means one-data path

is described below. Assume that


Figure 1. Proposed FFT processor.
N = 256
n = 64n1 + n2 , 0 ≤ n1 < 4, 0 ≤ n2 < 64 (2) Fig. 2 shows the detailed structure represented by the
k = k1 + 4k 2 , 0 ≤ k1 < 4, 0 ≤ k 2 < 64. dotted box in Fig. 1 when the FFT size is 256. Fig. 2(a) shows
the conventional MRMDC. Fig. 2(b) and (c) describe the
Substituting (2) into (1) yields proposed MRMDC architectures. These architectures are
composed of the input buffer, butterflies, and commutator. Fig.
63 3 2(b) shows the proposed architecture which can reduce the
X ( k ) = X ( k1 + 4 k 2 ) = ∑ ∑ x(64n 1
+ n2 ) W256
( 64 n1 + n 2 )( k1 + 4 k 2 )
number of butterfly units from 2 to 1 in the first stage. While
n 2 = 0 n1 = 0 the FFT clock cycles do not increase, the hardware complexity
of the proposed architecture in Fig. 2(b) can be saved by
⎡⎧ ⎫ ⎤ reducing the number of butterfly units. In the conventional
⎢⎪
63 3
⎪ ⎥
= ∑ ⎢ ⎨ ∑ x (64 n + n ) W ⎬ W (3) MRMDC architecture in Fig. 2(a) [6], the input sequence of
n1 k 1 n2 k1 n2 k 2

N ⎥ W1 2 4 256 64
eight-parallel data-path is split into eight data streams. Then,

⎢ 

n2 = 0 n1 = 0
⎪ ⎥ twiddle factor
each data is delayed in the delay elements to keep proper
⎣⎩
 ⎭ ⎦

4-point DFT
cycles. It takes 28 cycles to start the first radix-4 butterfly
64 − point DFT
computation and 3 clocks to deliver data sequence from the
63 first stage to the second stage in the delay commutator when
= ∑ {BF ( n , k )} W
4 2 1
n2 k 2

64
, the FFT size is 256.
n2 = 0

where the butterfly structure is

⎧ 3
⎫ nk
BF4 ( n2 , k1 ) = ⎨∑ x (64 n1 + n2 )W4 ⎬ W256
n1 k1 2 1

⎩ n =0
1 ⎭
(4)
⎧ x ( n2 ) + ( − j ) k x ( n2 + 64) +
1
⎫ nk (a)
=⎨
k k
⎬ W256 . 2 1

⎩ ( − 1) x ( n2
1
+ 128) + ( j ) x ( n 2
+ 196) ⎭ 1

(4) corresponds to the butterfly of radix-4 algorithm. (3) can


be considered as a 2-dimensional DFT. One is 64-point DFT
and the other is a 4-point DFT. The 256-point mixed-radix
FFT algorithm can be derived from (3) by decomposing the
remaining 64-point DFT into the 8-point DFT two times (b)
recursively. Similarly, the 128-point mixed-radix FFT
algorithm is easily obtained and is consisted of 2-point DFT
and two 8-point DFTs.

III. PROPOSED FFT ARCHITECTURE


The main objective of the paper is to design a novel eight-
parallel mixed-radix MDC (MRMDC) FFT processor that
offers high throughput and low hardware complexity. Higher
throughput rate can be provided by using eight parallel data
paths. Fig. 1 shows the proposed eight-parallel 128/256-point
MRMDC FFT architecture, which consists of butterfly
processing units, delay commutator, and twiddle factor (c)
multipliers. Figure 2. (a) Conventional architecture. (b) Proposed architecture reducing
the number of butterflies. (c) Proposed architecture employing scheduling
scheme based on Fig. 2(b).

1685
in Fig. 2(c). Fig. 4 shows the structure represented by the
dotted box in Fig. 2(c). M1 and M3 are multipliers of each
path, and M2 represents the multiplier which is shared by two
other paths. a(k) and b(k) are the kth input sequence of each
path. The multiplexer is used to select either a(k) or b(k) and
then the selected data are multiplied by a proper twiddle factor
in M2. Hence, eight inputs of a(k) and b(k) can be multiplied
by two different twiddle factors at the same time. Fig. 4(a)
shows that the first input data arrives in delay elements
entering multipliers. After two cycles of arriving in delay
Figure 3. Proposed radix-2/4 DIF butterfly unit . elements, the input data are multiplied by the twiddle factors
in M1 as shown in Fig. 4(b). Fig. 4(c) and (d) show that the
multiplexer chooses input data of M2 from two different paths.
On the other hand, in the proposed architecture, the input
sequence is split into four data streams and it takes 24 cycles Commutators are required in the pipeline architecture to
to start first radix-4 butterfly computation and 7 cycles to adjust the data in the suitable order. When the FFT size is 256-
deliver data sequence from the first stage to the second stage point, eight operation modes are needed in the commutator, as
in the delay commutator. Therefore, without increasing clock shown in Fig. 2(c). Since the delay elements can change the
cycles, the proposed architecture which consists of one radix- distance as the FFT size changes, the proposed processor can
2/4 butterfly at the first stage can reduce the hardware also support 128-point FFT. The commutator operates four
complexity. To implement the butterfly processing unit different modes for performing 128-point FFT. Although our
suitable for the pipeline architecture, the proposed architecture architecture requires more delay elements than the other
uses the radix-4 and radix-8 butterfly units proposed by Jaber existing architectures [3], [6], it can reduce both the number of
and Massicotte [9]. Because of using the butterfly units, the multipliers and the size of twiddle factor ROM, which occupy
critical path delay of the proposed FFT architecture can be an enormous portion of area.
reduced compared to that of the architecture using the The input data, multiplied by an appropriate twiddle factor,
conventional butterfly [9]. The first stage implements the is fed to the second stage for the radix-8 butterfly operation
radix-2/4 butterfly which can support either one radix-4 through the delay commutator. In the second stage, the
butterfly or two radix-2 butterflies. When the FFT size is 256, remaining radix-8 calculation except multiplication is
the radix-4 butterfly unit [9] needs the operation of input data performed.
multiplying of twiddle factor, –j. This multiplication can be
efficiently implemented by interchanging real and imaginary The structure of the third stage is different from that of the
parts of inputs for subtraction, as shown in Fig. 3. Delay second stage, because eight available data of the radix-8
elements are used to generate the correct distance between butterfly in the third stage are on the different parallel paths.
parallel data sequences, and the commutators can send correct Thus, a suitable structure is needed to ensure the correction of
data sequence into butterfly elements [3]. Since the number of the FFT output data. All output data, generated by the radix-8
delay elements can change the distance based on the FFT size, butterfly in the second stage, are fed to the third stage by the
the proposed processor can support two FFTs, i.e., 128 and specific order. The order is expressed by
256.
in3 ( p, l ) = out2 (l , p) (5)
Fig. 2(c) shows the proposed architecture which employs
the scheduling scheme based on Fig. 2(b). The radix-8 where ink and outk represent input data and output data of the
butterfly in [9] requires 11 complex multipliers and 28 kth stage, respectively and the indices are p=0,1,…,P-1, and
complex adders/subtractors to perform radix-8 calculation. l=0,1,…,L-1, and p is the number of parallel data-paths (P=8),
Among 8 inputs of the radix-8 butterfly [9], four inputs are and l is the number of output from a parallel data-path (L=8.
multiplied by two different twiddle factors at the same time. With the input sequence from the second stage based on (5),
Complex multipliers contribute to a dominating part of the the radix-8 butterfly calculation is performed in the third stage.
overall hardware complexity. To improve the area efficiency,
this paper proposes the architecture which can reduce
multipliers from 11 to 5 with scheduling scheme of the
twiddle factor multiplication in the second stage. The
proposed architecture in Fig. 2(c) performs the complex
multiplications before the delay commutator between the first
and second stage. To multiply data and two different twiddle
factors simultaneously, the proposed architecture shares one
multiplier with two other paths instead of the architecture
which has two multipliers for each path. By adding two
multiplexers, the input data of the shared multipliers are
selected from data of two different paths.
Therefore, the proposed architecture needs only 5 complex Figure 4. Scheduling scheme of complex multiplications.
multipliers for the second stage in each parallel path, as shown

1686
IV. IMPLEMENTATION RESULTS V. CONCLUSIONS
The eight-parallel MRMDC FFT processor is designed This paper proposes the eight-parallel 128/256-point
using hardware description language (HDL) and synthesized MRMDC FFT processor using the novel MDC architecture
with the IBM 90nm standard CMOS technology. Each of the and scheduling scheme of complex multiplications. The
internal word lengths for real and imaginary parts is 10 bits proposed scheduling scheme can reduce the number of
and the SQNR of the proposed processor is about 31dB. complex multipliers. In addition, the eight-parallel MRMDC
FFT processor uses only one butterfly unit in the first stage on
Table I presents the performance comparisons between the each parallel data-path, which can perform either one radix-4
proposed eight-parallel MRMDC FFT processor and the other butterfly or two radix-2 butterflies. The proposed processor
existing FFT processors [5], [8]. The performance and using the proposed radix-2/4 butterfly supports 128 and 256
hardware cost of the pipelined FFT processor increase by point FFTs. The performance results show that the data
using the multiple data-path approach. In general, processing rate can be as high as 27.5Gsample/s at 430MHz.
conventional pipeline FFT architectures employ a four-parallel The proposed FFT processor can be used in O-OFDM and
data-path approach and conventional memory-based other OFDM systems required high data rate.
architectures employ multiple processing elements to meet the
high data rate requirement [5], [7], [8]. However, the proposed
FFT processor employs the eight-parallel data-path approach
ACKNOWLEDGMENT
and the MRMDC architecture to improve throughput rate.
While the proposed processor provides up to fifteen times This work was supported by the IT R&D program of
higher throughput rate compared to the other architectures [5], MKE/KEIT. [KI002145, High Speed Digital Signal
[8], the hardware complexity increases only five times Processing based CMOS Circuit Design for Next-generation
because of the scheduling scheme which can reduce the Optical Communication]
number of complex multipliers by sharing one multiplier with
two other paths. Moreover, the proposed processor using the REFERENCES
proposed radix-2/4 butterfly supports various point FFTs, such [1] W. Shieh, Q. Yang, and Y. Ma, “107 Gb/s coherent optical OFDM
as 128 and 256. transmission over 1000-km SSMF fiber using orthogonal band
multiplexing,” Opt. Express, vol. 16, no. 9, pp. 6378-6386, Apr. 2008
The proposed FFT processor consists of 668,000 gates [2] S. L. Jansen, I. Morita, T. C. W. Schenk, N. Takeda, and H. Tanaka,
excluding memories, and the operating clock frequency is “Coherent optical 25.8-Gb/s OFDM transmission over 4160-km
about 430MHz. The highest throughput rate of the proposed SSMF,” IEEE J. Lightw. Technol., vol. 26, no. 1, pp. 6-15, Jan. 2008.
architecture can be up to 27.5 Gsample/s at 430MHz. [3] He Shousheng and M. Torkelson, "Designing pipeline FFT processor
for OFDM (de)modulation," in Proc. Int. Symp. Signals, Systems, and
Electronics, 29 Sep.-2 Oct. 1998, pp. 257-262.
[4] B. G. Jo and M. H. Sunwoo, "New continuous-flow mixed radix
Table I. Performance comparisons (CFMR) FFT using novel in-place strategy," IEEE Trans. Circuits Syst.,
Proposed [5] [8] vol. 52, pp. 911-919, May. 2005.
Technology 90nm 90nm 180nm [5] S. Huang and S. Chen, “A green FFT processor with 2.5-GS/s for IEEE
802.15.3c (WPANs),” in Proc. Int. Conf. Green Circuits and Systems,
Memory- Jun. 2010, pp. 9-13.
Architecture Pipeline Pipeline
based [6] Y. Jung, H. Yoon, and J. Kim, "New efficient FFT algorithm and
FFT size 128, 256 512 128 pipeline implementation results for OFDM/DMT applications," IEEE
4 Trans. Consum. Electron., vol. 49, no. 1, pp. 14 - 20, 2003.R. Nicole,
Algorithm Radix-2,4,8 Radix-16 Radix- 2 “Title of paper with only first word capitalized,” J. Name Stand.
Word length I,Q : 10bits I,Q : 12bits I,Q : 10bits Abbrev., in press.
SQNR 31dB - 33dB [7] Y. W. Lin, H. Y. Liu and C. Y. Lee, "A 1 GS/s FFT/IFFT processor for
Clock rate 430MHz 324MHz 450MHz UWB applications," IEEE J. Solid-State Circuits, vol. 40, pp. 1726 -
2005.
Throughput 27.5 GS/s 2.5 GS/s 1.8 GS/s [8] M. Shin, H. Lee, “A high-speed four-parallel radix-2 4 FFT/IFFT
Total processor for UWB applications,” in Proc. IEEE Int. Symp. Circuits
gate count 668,000 - 130,000 and Systems, May 2008, pp.960-963.
(excl.memory) [9] M. Jaber and D. Massicotte, “A New FFT Concept for Efficient VLSI
Implemantation: Part Ⅰ – Butterfly Processing Element,” in Proc. IEEE
Int. Conf. Digital Signal Processing, Jul. 2009, pp.1-6.
[10] M. Jaber and D. Massicotte, “A New FFT Concept for Efficient VLSI
Implemantation: Part II –Parallel Pipelined PRocessing,” in Proc. IEEE
Int. Conf. Digital Signal Processing, Jul. 2009, pp.1-6.

1687

S-ar putea să vă placă și