Sunteți pe pagina 1din 8

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO.

9, SEPTEMBER 2015

1793

Area-Efficient 128- to 2048/1536-Point


Pipeline FFT Processor for LTE and
Mobile WiMAX Systems
Chu Yu, Member, IEEE, and Mao-Hsu Yen
Abstract Fast Fourier transform (FFT) is widely used in
digital signal processing and telecommunications, particularly in
orthogonal frequency division multiplexing systems, to overcome
the problems associated with orthogonal subcarriers. This paper
presents a novel 128/256/512/1024/1536/2048-point single-path
delay feedback (SDF) pipeline FFT processor for long-term evolution and mobile worldwide interoperability for microwave access
systems. The proposed design employs a low-cost computation
scheme to enable 1536-point FFT, which significantly reduces
hardware costs as well as power consumption. In conjunction
with the aforementioned 1536-point FFT computation scheme,
the proposed design included an efficient three-stage SDF pipeline
architecture on which to implement a radix-3 FFT. The new
radix-3 SDF pipeline FFT processor simplifies its data flow
and is easy to control, and the complexity of the resulting
hardware is lower than that of existing structures. This paper also
formulated a hardware-sharing mechanism to reduce the memory
space requirements of the proposed 1536-point FFT computation
scheme. The proposed design was implemented using 90 nm
CMOS technology. Postlayout simulation results revealed a die
area of approximately 1.44 1.44 mm2 with power consumption
of only 9.3 mW at 40 MHz.
Index Terms 1536-point fast Fourier transform (FFT),
long-term evolution (LTE), orthogonal frequency division multiplexing (OFDM), radix-3 FFT, single-path delay feedback (SDF),
worldwide interoperability for microwave access (WiMAX).

I. I NTRODUCTION

ISCRETE Fourier transform (DFT) is indispensable in


modern telecommunications and discrete signal processing; however, this technique tends to be computationally intensive. To overcome this issue, Cooley and Tukey [1] developed
the fast Fourier transform (FFT), which has proven particularly valuable for applications involving orthogonal frequency
division multiplexing (OFDM), such as IEEE 802.11a/g/n,
Worldwide Interoperability for Microwave Access (WiMAX),
long-term evolution (LTE), HiPerLAN/2, asymmetric digital
subscriber line (DSL), very-high-speed DSL, and digital
audio/video broadcasting (DAB/DVB) systems. This paper

Manuscript received November 5, 2013; revised March 16, 2014; accepted


August 3, 2014. Date of publication September 9, 2014; date of current version August 21, 2015. This work was supported by the National
Science Council of Taiwan under Grants NSC-102-2221-E-197-035 and
NSC-102-2218-E-197-001.
C. Yu is with the Department of Electronic Engineering, National Ilan
University, Yilan 260, Taiwan (e-mail: chu@niu.edu.tw).
M.-H. Yen is with the Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung 202, Taiwan (e-mail:
ymh@email.ntou.edu.tw).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2014.2350017

proposes an area-efficient FFT processor in compliance with


LTE and mobile WiMAX standards.
A variety of FFT processors have been developed to
reduce power consumption and hardware costs [2][16]. The
memory-based architecture in [2][4] provides a low-power
solution; however, this approach suffers from long latency
and may require additional buffer space for system synchronization. A single-path delay feedback (SDF) pipeline FFT
architecture was proposed in [5][15] to reduce the memory required for memory-based architectures. This method
includes N1 delay elements (where N denotes the processing size), wherein multiplication accounts for less than 50% of
the computation and the design of the control unit is relatively
straightforward. These features are particularly advantageous
in high-performance designs involving portable digital signal
processing devices. Yang et al. [16] proposed a multipath delay
commutator (MDC)-based architecture, which enables butterflies and multipliers to work at 100% utilization. The inherent
parallelism of this method makes it suitable for applications
in multiple-input multiple-output (MIMO) OFDM systems;
however, it requires more memory and multipliers of greater
complexity than those used in an SDF-based architecture [17].
This paper opted for the low-power and low-latency of the
SDF pipeline FFT architecture in the design of the proposed
FFT processor.
The processing size of FFT is generally expressed as a
power of 2; however, the processing size in LTE systems
involves 1536-point FFT computation, which increases the
difficulty of hardware design. Peng et al. [4] proposed a
memory-based FFT architecture to support 1536-point FFT
computation. Yang et al. [14] proposed a design methodology
capable of minimizing the power and area of variable-length
FFT processors. The design was based on an L-parallel
M-point SDF pipeline FFT architecture with N = L M.
The final stage included eight- and six-point FFTs (L = 6
for 1536-point FFT), which incorporated with the previous stage of M-point FFTs to produce a parallel pipeline
FFT processor capable of variable-length FFT computations
(128- to 2048/1536-point). The variable-length architectures
in [4] and [14] enable 1536-point FFT computation; however,
these designs suffer from long latency [4] and high hardware
costs [14].
This paper proposes a novel 128- to 2048/1536-point
SDF pipeline FFT architecture with low latency and low
hardware cost. The proposed 1536-point FFT computation
scheme retains the use of the conventional single-input single-

1063-8210 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1794

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2015

output (SISO) SDF pipeline architecture, which considerably


reduces the number of complex/constant multipliers. To match
the nonpower-of-2 computation scheme, an efficient radix-3
FFT processor is proposed and implemented using radix-2
hardware modules. The proposed radix-3 FFT processor
requires only a complex-constant multiplier, thereby reducing
hardware complexity. The proposed design also employs a
hardware-sharing mechanism to reduce the memory required
for 1536-point FFT computation. The unused memory from
the preceding idle pipeline stages is combined with the
memory of the active pipeline stages to provide the required
memory capacity. These strategies result in an FFT processor with reduced chip area and lower power consumption.
The proposed FFT design can also be used in mobile
WiMAX systems, which typically require the ability to
perform 128/512/1024/2048-point FFT computations.
The remainder of this paper is organized as follows.
Section II provides a brief review of FFT and presents the
proposed SDF pipeline FFT architecture for use in LTE and
mobile WiMAX systems. Section III presents a performance
evaluation of two FFT architectures, and Section IV concludes
the paper.
II. R EVIEW OF FFT A LGORITHM AND
P ROPOSED A RCHITECTURE
A. Review of FFT Algorithm
Let an input signal x(n) be a discrete time sequence, the
N-point DFT X[k] of which is defined as follows:
X[k] =

N1


x(n)W Nnk

a larger number of points. This paper used DIF decomposition


for its ability to deal with the manipulation of the SDF pipeline
structure.
For 1536-point FFT computation, the DFT X[k] can be
evaluated using the following expression:
X[k] =

n=0
511


+
=

where the twiddle factor


=
efficient DFT algorithm from (1), let

m=0
k(3m+1)

x(3m + 1)W1536

511


m=0
511


511


m=0

m=0

k(3m+2)

x(3m + 2)W1536

m=0

km
k
x(3m)W512
+W1536

511


2k
W1536

m=0

km
x(3m + 1)W512



512point FFT

km
x(3m + 2)W512
.



(5)

512point FFT

The preceding equation is decomposed into three 512-point


FFT computations, where the second and third terms require
k
2k , respecand W1536
multiplication by the twiddle factors W1536
tively. The resulting X[k] is then obtained by summing the
k
three terms. The twiddle factors W1536
for 512 k 1535
can be rewritten as
512+k
512
k
= W1536
W1536
W1536
1024+k
W1536

for 0 k < 512

(6)

for 0 k < 512

(7)

1024
k
= W1536
W1536
k
= W32 W1536
,

To derive an
and
(2)
(3)

Substituting the index mapping in (2) and (3) into (1), the DFT
algorithm in (1) for N = 8 can be rewritten as follows:
X[k1 + 2k2 + 4k3 ]
1 
1 
1

x(4n 1 + 2n 2 + n 3 )
=

W31 = 0.5 0.866 j

(8)

W32 = 0.5 + 0.866 j.

(9)

Using (5)(9), a radix-3 FFT is required to compute the


resulting outcomes X[k] as follows:
X[0] = x(0) + x(1) + x(2)
(10)
X[1] = x(0)+(1/2)(x(1)+x(2))+(x(1)x(2))( j ) (11)
X[2] = x(0)+(1/2)(x(1)+x(2))+(x(1)x(2))( j ) (12)

n 3 =0 n 2 =0 n 1 =0
W8(k1 +2k2 +4k3 )(4n1 +2n2 +n3 )
1 
1 
1


x(4n 1 + 2n 2 + n 3 )W2k1 n1

n 3 =0 n 2 =0 n 1 =0
W8(2n2 +n3 )k1 W8(2n2 +n3 )(2k2 +4k3 )
1 
1


=
x(2n 2 + n 3 ) + (1)k1 x(2n 2
n 3 =0 n 2 =0
W8(2n2 +n3 )k1 W8(2n2 +n3 )(2k2 +4k3 ) .

k(3m)

x(3m)W1536



512point FFT

(1)

n = 4n 1 + 2n 2 + n 3 , where n 1 , n 2 , and n 3 = 0, 1
k = k1 + 2k2 + 4k3 , where k1 , k2 , and k3 = 0, 1.

511


n=0

e j 2nk/N .

kn
x(n)W1536
=

k
= W31 W1536
,

0 k N 1

W Nnk

1535


+ n 3 + 4)

(4)

Fig. 1 shows an example of an eight-point radix-2/4/8


decimation-in-frequency (DIF) FFT signal-flow graph (SFG)
based on (4), which is a general form that can be extended to

where the constant = sin(2/3) = 0.866 and j is an imaginary unit. With (10)(12), Fig. 2 shows the corresponding
SFG, which comprises three stages. Each stage includes only
one butterfly structure (an incomplete topology); however, it
remains unaffected when implemented using a classic SDF
pipeline scheme. Compared with the design in [18], the
proposed modification of radix-3 SFG simplifies hardware
implementation by combining the complex multiplications by
and j . The original design [18] requires two complex
multiplications, which increases the number of multiplexers
and control signals for hardware implementation. The modified
radix-3 FFT SFG mapping to hardware implementation is
described in Section II-E.

YU AND YEN: AREA-EFFICIENT 128- TO 2048/1536-POINT PIPELINE FFT PROCESSOR

Fig. 1.

Radix-2/4/8 DIF FFT signal-flow graph of length 8.

Fig. 2.

Radix-3 FFT SFG.

B. Finite Wordlength Simulation


In the implementation of fixed-point FFT processors, finite
wordlength is essential to achieving a suitable tradeoff between
output signal-to-noise ratio (SNR) and hardware cost. This
paper adopted the fixed-point simulation environment in [3] for
the selection of wordlength in which an input signal sequence
with additive white Gaussian noise is fed into the fixedpoint FFT processor. After performing this simulation, the
output SNR is obtained under various input SNR conditions
and wordlengths. Fig. 3 shows the results of output SNR
simulation, showing that a high-input SNR requires a large
data wordlength to maintain an acceptable output SNR, owing
to the high computational precision required for low-input
noise levels. Wordlengths exceeding 11 bit provide output
SNR suitable for most cases. With simulation results and
previous works [9], [14], this paper selected a data wordlength
of 12 bit.
C. Proposed Architecture
Fig. 4 shows the proposed 128/256/512/1024/1536/2048point SDF pipeline FFT architecture for LTE and mobile
WiMAX systems. The proposed architecture comprises four
types of processing element (PE) modules (marked as

Fig. 3.

1795

Output SNR simulation results under various data wordlengths.

PE1PE4), delay-line (DL) buffers of various sizes (indicated


by numbers enclosed in rectangles), and four full-complex
multipliers. This architecture uses the radix-2 butterfly SDF
pipeline structure for the radix-2 FFT computation, with the
radix-2 butterfly SDF pipeline structure running asymmetric
operations in the last three stages for radix-3 FFT. By means of
multiplexer switching, the proposed design is able to perform
128/256/512/1024/1536/2048-point FFT computations in the
14 PE stages ST1 ST14 as shown in Fig. 4. The sizes of
the buffers in stages ST3 ST11 are in powers of 2 for use
in 128/256/512/1024/2048-point FFT computations, resulting
in values ranging from 1 to 256. The 1536-point FFT computation is performed using the 12 PE stages ST3 ST14 , in
which the sizes of the buffers in the nine PE stages ST3 ST11
are 768, 384, 192, 96, 48, 24, 12, 6, and 3, respectively. The
outputs of the 1536-point and other sizes of FFTs are separated
by the two-input multiplexer located in the last stage.
Compared with previous works on the basis of the design
in [6], our design is more efficient because of its SDF pipeline
FFT hardware structure. The first stage performs multiplications using only j or 1 (Fig. 1), whereas most conventional
designs, such as those in [6] and [9], require the computation
of W81 and W83 in their first stages, thereby incurring higher
hardware costs.

1796

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2015

Fig. 4.

Proposed 128/256/512/1024/1536/2048-point SDF pipeline FFT architecture.

Fig. 5.

Practical example of the proposed 1536-point computation scheme used in ST5 shown in Fig. 4.

D. Proposed 1536-Point FFT Computation Scheme


As shown in (5), the 1536-point FFT comprises three
512-point FFTs, in which the three contiguous input data are
multiplied using the same twiddle factor. Thus, (5) can be
rewritten as follows:
X[k] =
=

1535


kn
x(n)W1536

n=0
511
2


r=0

x(3m

km
+ r )W512

rk
.
W1536

(13)

m=0

Although (13) is not a classic radix-3 FFT containing a


kernel core of 512-point FFT, it still essentially performs
a modified 512-point FFT followed by subsequent radix-3
FFT computations. This modified 512-point FFT must perform
three times as many operations using the same twiddle factor
on three contiguous inputs. An additional preprocessing step
is required before performing radix-3 FFT computations. This
preprocessing step involves the computation of multiplications
k
[see (6) and (7)].
using the common twiddle factor W1536
Finally, radix-3 FFT is used to complete the 1536-point
FFT computation. In this manner, the proposed computation
scheme combines modified 512- with 3-point FFTs to achieve
the required 1536-point FFT.
The conventional 512-point SDF pipeline structure can
easily be modified for the preceding modified 512-point FFT
computations to map the proposed computation scheme to

hardware implementations. To match the flow of computation


data, three contiguous input data are multiplied by the same
twiddle factor for a total processing length of 512 3 = 1536.
Fig. 5 shows a practical example of the proposed 1536-point
computation scheme used in ST5 shown in Fig. 4. The buffer
size in stages ST3 ST11 must be three times as large as
that used in the original SDF pipeline structure. Thus, in
practical situations, the DL buffers in stages ST3 ST11 require
a switching mechanism to select the correct buffer size for
the general mode and 1536-point FFT mode. In this case,
the general mode refers to 128/256/512/1024/2048-point FFT
computations. The resulting output is finally obtained by
performing a radix-3 FFT computation, as shown in (10)(12).
In the proposed 1536-point FFT computation scheme, a classic
SISO SDF pipeline architecture can still be used to realize
a 128/256/512/1024/1536/2048-point FFT processor, because
the 1536- with 3-point FFTs can also be implemented using
an SDF pipeline architecture. This classic SISO SDF pipeline
architecture with 1536-point FFT reduces the costs associated
with hardware implementation.
E. Radix-3 FFT Design
A radix-3 FFT is used to sum the three 512-point FFT
outputs to produce the final outcome of the 1536-point FFT.
As described previously, the preprocessing unit must to be
activated prior to radix-3 FFT computation. This requires
a full-complex multiplier to perform multiplications using

YU AND YEN: AREA-EFFICIENT 128- TO 2048/1536-POINT PIPELINE FFT PROCESSOR

1797

TABLE I
C ALCULATION OF E ACH N ODE IN E ACH S TAGE S HOWN IN F IG .4

k (k < 512). This complex


the common twiddle-factor W1536
multiplier is placed in front of ST12 , as shown in Fig. 4.
To elucidate the operation of the radix-3 FFT SFG, Table I
presents the calculation of each output node in the three
stages shown in Fig. 2. With this table, the calculations in
each stage can be mapped to a radix-2 butterfly SDF pipeline
hardware structure. For example, Node 1 of stage 1 directly
passes the input data to the output, such that calculations at
Nodes 2 and 3 can be mapped to a butterfly structure. Thus,
stage 1 is implemented using a classic radix-2 PE hardware
module known as PE3. The operations in stage 2 are similar
to those in stage 1; however, the output at Node 3 is obtained
by multiplying input values by j . This necessitates a new
type of PE hardware module, called PE4, for the performance
of these operations. The operations in stage 3 are similar to
those in stage 1; therefore, a PE3 hardware module can also be
used in this stage. The preceding three-stage SDF pipeline FFT
hardware structure is presented in ST12 ST14 shown in Fig. 4.

F. Processing Elements
The FFT SFGs shown in Figs. 1 and 2 require that the
proposed architecture include four types of PE to cope with
FFT computations of various sizes. The first three types of PE
hardware modules (PE1PE3) were described in [12]. With
the radix-3 FFT SFG shown in Fig. 2 and Table I, this paper
propose the novel PE4 hardware module shown in Fig. 6.
In addition to performing butterfly operations, this module
is able to perform multiplications using j . Multiplication
by requires only a complex-constant multiplier, comprising
two constant real-value multipliers. This complex-constant
multiplier design is detailed in the following section.

Fig. 6.

Circuit diagram of proposed PE4 module.

Fig. 7.

Circuit diagram of multiplication by 0.866.

where x is an input value and y is an output value.


The resulting CSD operation reduces the number of adders
by approximately 57%, compared with conventional binary
representation. However, the use of (14) can lead to inaccurate
results because of truncation error [20]. Horners rule can be
used to ensure the precision required to increase the SNR,
as follows:
y (1 23 (1 + 24 (1 + 23 )))x.

(15)

Fig. 7 shows a circuit diagram of the multiplication by 0.866,


in accordance with (15). This circuit uses only three adders
for the implementation of real-value multipliers.
H. DL Buffers

G. Complex and Constant Multipliers


The proposed design uses four full-complex multipliers,
referred to as M1 M4 shown in Fig. 4, as well as four
complex-constant multipliers (three in the PE1 modules
and one in the PE4 module). The details of full-complex
multiplication and complex-constant multiplication of W81
were outlined in [12] and [15], respectively. Implementing
complex-constant multiplications of W31 and W32 is relatively
straightforward; however, high hardware costs make it impractical. According to (8) and (9), the complex multiplications of
W31 and W32 can be replaced using real-value multiplications
with a constant value of 0.866, which is transformed into
canonical signed digit (CSD) representation to reduce hardware costs [19], [20]. This CSD can be represented as follows:
y (1 23 27 210 )x

(14)

The FFT stage of the SDF pipeline generally contains


DL buffers, which can use either registers or random-access
memory (RAM) in actual implementation. Registers are the
most intuitive type of implementation; however, they consume
approximately twice the power and chip area of RAM with the
same capacity. RAM consumes less power and the hardware
is less expensive; however, it requires additional address
generators and decoders.
Fig. 8 shows the power-area performance of registers,
single-port memory, and two-port memory using 90-nm
CMOS technology. In this case, the size of the storage ranges
from 4 to 1024 words. Clearly, the power-area performance
of registers is inferior to that of memory when the storage is
greater than or equal to 64 words. Moreover, the performance
of single-port memory smaller than 512 words is unable to
match its two-port counterpart. Thus, the DL buffers in the

1798

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2015

Fig. 8. Power-area performance of registers and two other types of memory.

from the memory in stage ST2 . The memory sharing structure


is shown in Fig. 10. To enhance the efficiency of hardware
implementation, memory space requirements of eight words
or less are met through the use of additional registers, rather
than the proposed sharing mechanism.
As mentioned previously, the total buffer size is 4094 + 34
words for the real as well as the imaginary parts, of which
34 words are reserved for the 1536-point FFT. The proposed
hardware-sharing mechanism reduces the number of words in
the DL buffers by 204434 = 2010 words. The need for
additional memory can be entirely eliminated using the memory space remaining in stage ST2 . However, considerations of
hardware implementation efficiency and the size limitations of
commercial memory compliers mean that registers sometimes
present a better alternative.
III. P ERFORMANCE E VALUATION

Fig. 9.

Partition and allocation of memory from ST1 to ST3 .

proposed design use two-port memory when the buffer size is


greater than or equal to 32 words but no more than 512 words.
Otherwise, registers are used.
As shown in Fig. 4, the DL buffers used in 1536-point
FFT computation require a buffer size three times that of the
original in stages ST3 to ST11 . In practical implementations,
the original storage can be cascaded with storage of double
size, which necessitates (512 + 256 + 128 + 64 + 32 + 16 + 8 +
4 + 2) 2 = 2044 words in those stages and incurs high cost
in terms of hardware and power consumption. Thus, this paper
adopts a memory hardware-sharing mechanism to reduce chip
area and power. This approach involves borrowing unused
memory from PE stages that is not being used for computation
to provide the necessary buffer capacity. A total of 1024+512
words of unused memory is enabled from PE stages ST1
and ST2 for the proposed mechanism when 1536-point
FFT computation mode is initiated. As mentioned in
Section II-D, the proposed design has two operating modes:
general FFT and 1536-point FFT. Fig. 9 shows the hardwaresharing mechanism used in ST3 . The dashed lines indicate
the signal routing of 1536-point FFT computation mode,
whereas the solid lines mark the signal routing of general
FFT computation mode. In this example, multiplexer switching
leads to the accumulation of 768 words of memory space in
stage ST3 , for use in the 1536-point FFT computation mode.
In an FFT computation mode, the memory space in stage
ST3 is reset to the original 256 words. To facilitate sharing,
the original memory space in stage ST1 is partitioned into
two parts of equal size. One part can be combined using the
memory space of stage ST3 to form the desired 768-word
buffer for use in the 1536-point FFT. The other part is reserved
exclusively for use in the general computation mode.
The memory hardware-sharing structure in stages ST4 ST11
is similar to that shown in Fig. 9. The memory space of these
PE stages required for the 1536-point FFT is partly obtained

The processing size of the 1536-point FFT is not a


power of 2; therefore, the classic radix-2 FFT structures are
insufficient to deal with the requirements of implementing
1536-point FFT, thereby necessitating the use of other radixsize FFT processors. Radix-6 or radix-3 FFT must be used
in this case; therefore additional hardware costs cannot be
avoided.
Table II compares the performance of two 128- to
2048/1536-point SDF pipeline FFT designs. This table does
not include other related FFT designs which do not support
1536-point FFT. The design by Peng et al. [4] is able to
support 1536-point FFT processing; however, it is a memorybased architecture. In Table II, normalized energy per FFT is
defined as follows [2], [17], [16]:
Normalized energy/FFT
PowerTclock NExecution
103
=
(Voltage)2 (Lmin /65 nm)PL

(16)

where Tclock is the clock period, L min is the minimum channel


length of the MOS transistor, Nexecution is the number of
clock cycles required to process an FFT, Voltage is the supply
voltage of the chip, and PL is the level of parallelism. The
normalized area is computed as follows [2], [9], [16]:
Core Size
.
(17)
(L min /65 nm)2 PL
In addition, the list of multiplexers listed in Table II includes
only those used to achieve a reconfigurable FFT mechanism;
those used in the PEs are excluded.
As shown in Table II, the normalized power of the proposed architecture is less than that of the design proposed
by Yang et al. [14], which adopted a parallel SDF pipeline
architecture to achieve reconfigurability to accommodate FFT
computations using various processing sizes, especially for
supporting 1536-point FFT. As a result, their proposed architecture requires numerous multiplexers and complex/constant
multipliers (specifically 16 full-complex multipliers, 760 realvalued equivalent adders, and 76 multiplexers). It should be
noted that despite that the design in [14] employs a parallel
architecture, it processes FFT in a single input stream.
Normalized area =

YU AND YEN: AREA-EFFICIENT 128- TO 2048/1536-POINT PIPELINE FFT PROCESSOR

Fig. 10.

1799

Proposed memory sharing structure in stage ST2 .


TABLE II
C OMPARISON OF T WO FFT D ESIGNS

TABLE III
S UMMARY OF P ROPOSED C HIP D ESIGN

By contrast, the proposed method uses only four full-complex


multipliers, 38 real-value equivalent adders, and ten multiplexers. The method proposed in this paper therefore incurs lower
hardware costs with regard to ROM size and the number of
equivalent adders.

The proposed FFT architecture is based on radix-2/4/8 SFG


structures; therefore, the size of its ROM for storing twiddle
factors is smaller than that of the method proposed in [14].
Leveraging the symmetry of twiddle factors, the proposed
design requires only one quarter as much ROM space for both
real and imaginary parts. The twiddle factor generator used in
the proposed design is outlined in [12].
Table III details the chip used in the proposed design.
I/O pads are counted in all of the parameters, with the
exception of gate count and power for 128- to 2048/1536point FFT, in which only the core of the chip is included.
As shown in the power consumption distribution in Table III,
memory consumes approximately half of the total power.
By considering the registers used as delay-line buffers, the
power consumption of all the delay-line buffers would be far
more than half of the total power consumption. Thus, reducing
power consumption requires the efficient implementation of
delay-line buffers. Through the analysis of power-area product performance, the implementation of delay-line buffers in
the proposed design strikes a good balance between power
consumption and chip area.
IV. C ONCLUSION
This paper presents an area-efficient SDF pipeline FFT
processor for use in LTE and mobile WiMAX systems.

1800

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2015

The proposed low-cost nonpower-of-2 FFT computation


scheme facilitates the use of a conventional SDF pipeline
structure for 1536-point FFT. This enables the use of an SISOtype SDF pipeline architecture for the hardware implementation of the 128- to 2048/1536-point FFT. According to the
performance evaluation presented in Table II, the proposed
design achieves a better tradeoff between memory-based and
parallel processing designs with regard to actual chip area
(rather than normalized chip area), latency, and actual power
consumption (rather than normalized power consumption).
Moreover, the inner OFDM receiver of most FFT processors
consumes approximately half of the total power. Therefore, the
flexibility of the proposed design makes it suitable for many
practical applications such as portable devices. Generally,
hardware devices with lower power consumption are able to
achieve at the cost of low data rate. In cases of low data
rates, a SISO scheme is preferable; in cases where a high
data rate is desired, a MIMO scheme is preferred. The use
of the frequency multiplication techniques in the proposed
design to support an array antenna enables high throughput;
however, it incurs a penalty with regard to power consumption.
Furthermore, if the supply voltage in low-power designs
cannot be lowered, then the use of parallel architecture will
increase power consumption and push up the costs of hardware
implementation. Under this circumstance, the proposed design
has a distinctive advantage over parallel architectures.
This paper then employed the proposed 1536-point FFT
computation scheme in conjunction with a modified radix-2
512-point SDF pipeline FFT architecture cascaded with a
low-cost radix-3 SDF pipeline FFT architecture to reduce
hardware complexity. The complexity of the proposed radix-3
FFT architecture can be reduced, because the proposed
PE4 module requires only one complex-constant multiplier.
To further reduce the chip area, this paper included a hardwaresharing mechanism to allow the use of unused memory with
memory in active stages to provide the additional space
required for 1536-point FFT computation. The proposed
design is also applicable to MDCs in MIMO systems.
ACKNOWLEDGMENT
The authors would like to thank the Chip Implementation
Center of the National Applied Research Laboratories in
Taiwan for EDA tool support.
R EFERENCES
[1] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation
of complex Fourier series, Math. Comput., vol. 19, no. 90, pp. 297301,
Apr. 1965.
[2] B. M. Baas, A low-power, high-performance, 1024-point FFT processor, IEEE J. Solid-State Circuits, vol. 34, no. 3, pp. 380387, Mar. 1999.
[3] J.-C. Kuo, C.-H. Wen, C.-H. Lin, and A.-Y. Wu, VLSI design of
a variable-length FFT/IFFT processor for OFDM-based communication systems, EURASIP J. Appl. Signal Process., vol. 2003, no. 13,
pp. 13061316, Dec. 2003.
[4] S.-Y. Peng, K.-T. Shr, C.-M. Chen, and Y.-H. Huang, Energy-efficient
1282048/1536-point FFT processor with resource block mapping for
3 GPP-LTE system, in Proc. Int. Conf. Green Circuits Syst., Jun. 2010,
pp. 1417.
[5] S. He and M. Torkelson, Designing pipeline FFT processor for OFDM
(de)modulation, in Proc. URSI Int. Symp. Signals, Syst., Electron.,
vol. 29. Oct. 1998, pp. 257262.

[6] L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, A new VLSI-oriented FFT


algorithm and implementation, in Proc. IEEE ASIC Conf., Sep. 1998,
pp. 337341.
[7] Y.-N. Chang and K. K. Parhi, An efficient pipelined FFT architecture,
IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 50,
no. 6, pp. 322325, Jun. 2003.
[8] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, A 1-GS/s FFT/IFFT processor
for UWB applications, IEEE J. Solid-State Circuits, vol. 40, no. 8,
pp. 17261735, Aug. 2005.
[9] Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh, Low-power variable-length fast
Fourier transform processor, IEE Proc. Comput. Digit. Techn., vol. 152,
no. 4, pp. 499506, Jul. 2005.
[10] M. Shin and H. Lee, A high-speed four-parallel radix-24 FFT/IFFT
processor for UWB applications, in Proc. IEEE Int. Symp. Circuits
Syst., May 2008, pp. 960963.
[11] M. S. Patil, T. D. Chhatbar, and A. D. Darji, An area efficient and low
power implementation of 2048 point FFT/IFFT processor for mobile
WiMAX, in Proc. Int. Conf. Signal Process. Commun., Jul. 2010,
pp. 14.
[12] C. Yu, M.-H. Yen, P.-A. Hsiung, and S.-J. Chen, A low-power
64-point pipeline FFT/IFFT processor for OFDM applications, IEEE
Trans. Consum. Electron., vol. 57, no. 1, pp. 4045, Feb. 2011.
[13] H.-Y. Lee and I.-C. Park, Balanced binary-tree decomposition for areaefficient pipelined FFT processing, IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 54, no. 4, pp. 889900, Apr. 2007.
[14] C.-H. Yang, T.-H. Yu, and D. Markovic, Power and area minimization
of reconfigurable FFT processors: A 3 GPP-LTE example, IEEE J.
Solid-State Circuits, vol. 47, no. 3, pp. 757768, Mar. 2012.
[15] C. Yu, A 128/512/1024/2048-point pipeline FFT/IFFT architecture for
mobile WiMAX, in Proc. 2nd IEEE Global Conf. Consum. Electron.,
Oct. 2013, pp. 243244.
[16] K.-J. Yang, S.-H. Tsai, and G. C. H. Chuang, MDC FFT/IFFT processor
with variable length for MIMO-OFDM systems, IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 21, no. 4, pp. 720731, Apr. 2013.
[17] T.-D. Chiueh and P.-Y. Tsai, OFDM Baseband Receiver Design for
Wireless Communications. New York, NY, USA: Wiley, 2007.
[18] J. Lofgren and P. Nilsson, On hardware implementation of radix 3
and radix 5 FFT kernels for LTE systems, in Proc. NORCHIP Conf.,
Nov. 2011, pp. 14.
[19] G. W. Reitwiesner, Binary arithmetic, Adv. Comput., vol. 1,
pp. 231308, 1960.
[20] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation. New York, NY, USA: Wiley, 1999.

Chu Yu (M11) received the B.S. and M.S. degrees


in electronic engineering from the National Taiwan
University of Science and Technology, Taipei,
Taiwan, in 1991 and 1993, respectively, and the
Ph.D. degree in electrical engineering from National
Taiwan University, Taipei, Taiwan, in 1999.
He has been a faculty member with the Department of Electronic Engineering, National Ilan
University, Yilan, Taiwan, since 2000, where he
is currently an Associate Professor. His current
research interests include IC design for digital communications and digital signal processing.
Dr. Yu is a member of the IEEE Consumer Electronics Society and the
IEEE Circuits and Systems Society.

Mao-Hsu Yen received the B.S., M.S., and Ph.D.


degrees in electronic engineering from the National
Taiwan University of Science and Technology,
Taipei, Taiwan, in 1991, 1993, and 2000, respectively.
He has been a faculty member with the
Department of Computer Science and Engineering,
National Taiwan Ocean University, Keelung, Taiwan,
since 2005, where he is currently an Associate
Professor. His current research interests include the
design of application-specific integrated circuit and
FPGA architectures.

S-ar putea să vă placă și