Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

1
Low Power and Memory Efficient FFT Architecture Using Modified

CORDIC Algorithm
1
A.MALASHRI, 2C.PARAMASIVAM,
Department of ECE, K.S.Rangasamy College of Technology, Tiruchengode.
1
malashri67@gmail.com, 2sivamvlsi@gmail.com
Abstract-- This paper presents a pipelined, reduced memory bottleneck in the pipeline of the FFT processor. The
and low power CORDIC-based architecture for fast Fourier Coordinate Rotation Digital Computer (CORDIC) [5]
transform implementation. The proposed algorithm utilizes algorithm is an alternative method to realize the butterfly
a new addressing scheme and the associated angle generator operation without using any dedicated multiplier
logic in order to remove any ROM usage for storing twiddle hardware. CORDIC algorithm is very versatile and
factors. CORDIC is implemented by a simple hardware hardware efficient since it requires only add and shift
through repeated shift-add operations Low power is
achieved by the using the Coordinate Rotation Digital
operations, making it very suitable for the butterfly
Computer algorithm in the place of conventional operations in FFT [6]. Instead of storing actual twiddle
multiplication and furthermore, dynamic power factors in a ROM, the CORDIC-based FFT processor
consumption is reduced with no delay penalties. needs to store only the twiddle factor angles in a ROM for
the butterfly operation. Additionally, the CORDIC-based
Index Terms— FFT, CORDIC, VLSI, Low power butterfly can be twice faster than traditional multiplier-
based butterflies in VLSI implementations.
I. INTRODUCTION In this, we propose a modified CORDIC
algorithm for FFT processors which eliminates the need
Fast Fourier transform (FFT) is among the most for storing the twiddle factor angles. The algorithm
widely used operations in digital signal processing. Often, generates the twiddle factor angles successively by an
a high performance FFT processor is the key component accumulator. With this approach, full memory
and determines most of the design metrics in many requirements of an FFT processor can be reduced by more
applications such as Orthogonal Frequency-Division than 20%. Memory reduction improves with the increased
Multiplexing (OFDM), Synthetic Aperture Radar (SAR) radix size. Since the critical path is not modified with the
and software defined radio. For embedded systems, in CORDIC angle calculation, system throughput does not
particular portable devices; efficient hardware realization change.
of FFT with small area, low-power dissipation and real-
time computation is a significant challenge. II. FAST FOURIER TRANSFORM
A typical FFT processor is composed of butterfly The N-point discrete Fourier transform is defined by
calculation units, memory banks and control logic
(address generator for data and twiddle factor accesses).
In most cases, an FFT processor uses only one butterfly
unit to realize all calculations iteratively, and the “in-
place” memory access strategy is required for the least
amount of memory. With “inplace” strategy, the outputs
of a butterfly operation are stored back to the same Figure 1 shows the signal flow graph of 16-point
memory location of the inputs, saving the memory usage decimation-in-frequency (DIF) radix-2 FFT. FFT
by one half. However, correct memory addressing scheme algorithm is composed of butterfly calculation units:
is required to avoid the data conflict. This study
implements an efficient addressing scheme to realize the
parallel, pipelined and “in-place” memory accessing. It
produces an output at every clock cycle; furthermore the
memory banks and the butterfly unit are utilized with
100% efficiency within the pipeline.
Equations (2), (3) describe the radix-2 butterfly operation
In FFT processors, butterfly operation is the at stage m as shown in Fig.1. Each butterfly operation
most computationally demanding stage. Traditionally, a needs four data accesses (two read and two write);
butterfly unit is composed of complex adders and however, hardware realization of four port memory units
multipliers, and the multiplier is usually the speed is difficult and costly.
2
To overcome this challenge, multi-bank memory (5)

units can be used to realize the parallel and "in-place"
data accesses. Two two-port memory banks can provide The direction of each rotation is defined by
four data access in each clock cycle, but in this case, a and the sequence of all 's determines the final vector.
special data addressing scheme is required to prevent the is given as:
data conflict.
(6)
Where is called angle accumulator and given by

(7)
All operations described through (4)-(7) can be

realized by only additions and shifts; therefore, CORDIC
algorithm does not require dedicated multipliers.
Fig. 1. Signal flow graph of a 16-point radix-2 FFT
A new address scheme has been proposed to realize this

function and it can be easily adopted for CORDIC based
FFT implementation.
III. CORDIC ALGORITHM
CORDIC algorithm is an iterative algorithm to

calculate the rotation of a vector by using only additions
and shifts. It calculates trigonometric functions[5],
rotation of a vector and angle of a vector by realizing two
dimensional vector rotation in circular coordinate
systems. The CORDIC algorithm involves rotation of a
vector v on the XY-plane in circular, linear and hyperbolic
coordinate systems depending on the function to be
evaluated. This is an iterative convergence algorithm that
performs a rotation iteratively using a series of specific
incremental rotation angles selected so that each iteration
is performed by shift and add operation. The norm of a
vector in these coordinate systems is defined
as , where represents a circular,
linear or hyperbolic coordinate system respectively.
The norm preserving rotation trajectory is a

circle defined by in the circular coordinate
system. Similarly, the norm preserving rotation trajectory Fig. 2. Basic structure of a pipelined CORDIC
in the hyperbolic and linear coordinate systems is defined unit
by the function and x = 1, respectively. The
Although CORDIC may not be the fastest
CORDIC method can be employed in two different
technique to perform these operations, it is attractive due
modes, namely, 1) Rotation mode 2) Vectoring mode.
to the simplicity of its hardware implementation, since the
The rotation mode is used to perform the general rotation
same iterative algorithm could be used for all these
by a given angle θ. The vectoring mode computes
applications using the basic shift-add operations of the
unknown angle θ of a vector by performing a finite
form . Keeping the requirements and constraints of
number of microrotations.
different application environments in view, the
development of CORDIC algorithm and architecture has
It can be shown that rotation can be simplified to:
taken place for achieving high throughput rate and
(4)
reduction of hardware-complexity as well as the latency
of implementation. Parallel and pipelined CORDIC have
3
been suggested for high-throughput computation. to the muxing unit. The multiplexer unit will produce
CORDIC algorithm is often realized by pipeline input for the core block.
structures, leading to high processing speed. Figure 2
shows the basic structure of the pipelined CORDIC unit. INPUT INPUT BLOCK
As shown in (1), the key operation of the FFT

processing is .This is equivalent to “rotate
by angle ” operation can be realized easily by BUTTERFLY STRUCTURE
the CORDIC algorithm. Without normal complex
multiplication, CORDIC based butterfly can be very fast.
An FFT processor needs to store the twiddle factors in ANGLE GENERATOR
memory. CORDIC-based FFT doesn’t have twiddle
factors but needs a memory bank to store the rotation
angles. For radix-2, N-point, m-bit FFT, mN/2 bits CORE BLOCK
memory needed to store N/2 angles. In the next section, a
new CORDIC based FFT design which does not require
OUTPUT BLOCK
any twiddle factor or angle memory units is presented. OUTPUT
This design uses a single accumulator for generating all
the necessary angles instantly and does not have any Fig. 3. General block diagram of the proposed FFT
precision loss.
The core block consists of the butterfly structure
IV. PROPOSED CORDIC BASED FFT of the FFT which is designed using the CORDIC
algorithm to replace the complex multipliers. An angle
The proposed architecture is designed using the generator is used to generate the twiddle factor angle for
butterfly structure using CORDIC, angle generator, ram, rotation to the pipe-lined CORDIC structure. The core
multiplexer, demultiplexer and registers. This block will be designed for the radix 2 and radix-4 FFT
architecture(Figure 3) can be classified into input block, structure. The output from the core block also will have
core block and output block. The input block will have the the demux - mux arrangement with registers, the data
ram, demultiplexer, register and multiplexer arrangement, output will be stored before sending the data out.
input for the system is going to be binary data input. Input
block will have a RAM where the data will be saved by Although several multi-bank addressing schemes
incremental addressing and that data will enter in to the have been used to realize parallel and pipelined FFT
demux unit, output of demux unit is saved in the register. processing, these methods are not suitable for the reduced
The register chosen for saving the data is based on the memory CORDIC FFT. In these schemes, the twiddle
select line of the demux, output of the register are applied factor angles are not in regular increasing order, resulting
in a more complex design for angle generators.
Table 1.Address generation table of the proposed design for 16-point radix-2 FFT
STAGE 0 STAGE 1 STAGE 2 STAGE 3

Butterfly
Counter RAM Twiddle RAM Twiddle RAM Twiddle RAM Twiddle
B(b2b1b0) Address Factor Address Factor Address Factor Address Factor

b2b1b0 Angle b0b2b1 angle b1b0b2 Angle b2b1b0 angle
000 000 0 000 0 000 0 000 0
001 001 100 0 010 0 001 0
010 010 001 100 0 010 0
011 011 101 110 0 011 0
100 100 010 001 100 0
101 101 110 011 101 0
110 110 011 101 110 0
111 111 111 111 111 0

4
For an N = 2n-point FFT, the addressing and control logic

are composed of several components:
An (n − 1)-bit butterfly counter B
will provide the address sequences and the control logic
Using a new addressing scheme as shown in Table 1, the of the angle generator. In stage S, the memory address is
twiddle factor angles follow a regular, increasing order,
given by , which is rotate
which can be generated by a simple accumulator. Table 1
shows the address generation table of the 16-point radix-2 right S bits of butterfly counter B. Meanwhile, the control
FFT. It can be seen that twiddle factor angles are logic of the latch of the angle generator is determined by
sequentially increasing, and every angle is a multiple of the sequence of the pattern; (S 0’s).
the basic angle 2π/N, which is π/8 for 16-point FFT. For
different FFT stages, the angles increase always one step For 16 (N = 24)-point FFT, the addressing and control
per clock cycle. Hence, an angle generator circuit logic are composed of several components:
composed of an accumulator, and an output latch can
realize this function, as shown in Figure 4. • A 3((n – 1) = (4-1) =3)-bit butterfly counter B=
will provide the address sequences and
the control logic of the angle generator.
• In stage 0, the memory address is given by
, which is rotate right ‘0’ bits of butterfly
counter B
, which is rotate right ‘1’ bit of butterfly
counter B and the control logic of the latch of the
angle generator is determined by the sequence of
the pattern; (S 0’s, where S=1)
Fig. 4. Angle generator for the CORDIC based FFT
The accumulator consists of a simple adder and a , which is rotate right ‘2’ bits of butterfly
register. It will add value fed back by the register with the counter B. And the control logic of the latch of
input angle value. Control signal for the latch that enables the angle generator is determined by the
or disables the accumulator output is simple and it is sequence of the pattern; (S 0’s, where S=2)
based on the current FFT butterfly stage and RAM • In stage 3, the memory address is given by
address bits b2b1b0. , which is rotate right ‘3’ bits of butterfly
Figure 5 shows the architecture of the proposed counter B. And the control logic of the latch of
the angle generator is determined by the
no-twiddle-factor-memory design for radix-2 FFT.
sequence of the pattern; 000 (S 0’s, where S=3)
Due to finite wordlength, as the accumulator

operates, the precision loss will accumulate as well. In
order to address this issue, more bits (wider wordlength)
can be used for the fundamental angle 2π/N and the
accumulator logic.
V. RESULTS AND CONCLUSION
The proposed designs for both radix-2 and radix-

Fig. 5. Proposed design for radix-2 CORDIC FFT 4 FFT algorithms have been realized by VHDL. For FFT
processor processors, butterfly operation is the most
computationally demanding stage. Traditionally, a
Four registers and eight 2-to-1 multiplexers are butterfly unit is composed of complex adders and
used. Registers are needed before and after the butterfly multipliers, and the multiplier is usually the speed
unit to buffer the intermediate data in order to group two bottleneck in the pipeline of the FFT processor.
sequential butterfly operations together. Therefore, the
conflict-free “in-place” data accessing can be realized. In order to avoid these problems with
This register-buffer design can be extended to any radix butterfly unit, modified CORDIC algorithm associated
FFTs. For radix-2, the structure can be simplified by using with angle generator logic is proposed. The simulated
just 4 registers, but for radix-r FFT, 2 × r2 registers are output of angle generator is shown in Fig. 6. For 16-point
needed. FFT, totally there are 8 angles as shown in Table 1. The
output waveforms of radix-2 and radix-4 CORDIC FFT
5
architecture are shown in Fig. 7 and Fig. 8 respectively.

The HDL synthesis has been performed for the both
architectures and the results are given in Table 2.
Table 2. Synthesis results for radix-2 and radix-4
CORDIC FFT architectures
Radix-2 Radix-4 CORDIC

PARAMETERS
CORDIC FFT FFT
Delay 13.762ns 18.041ns
Power
121mW 319mW
consumption
Fig. 6.Simulation output of angle generator unit
8159(utilization- 17,793(utilization-
Number of LUTs
12%) 26%)
Number of slice 4066(utilization- 8998(utilization-
flipflops 6%) 11%)
The proposed design can reduce memory usage for FFT

processors without any tangible increase in the number of
logic elements used.
REFERENCES
[1] Xiao, X., Oruklu, E., & Saniie, J. (2012) Low Power And Reduced
Memory Architecture for CORDIC-based FFT, J Sign Process Syst
[2] Wey, C., Lin, S., & Tang, W. (2007). Efficient memory-based FFT
processors for OFDM applications. In IEEE International Conf. on
Electro-Information Technology, 345–350. May.
[3] Mittal, S., Khan, M., & Srinivas, M. B. (2007). On the suitability of
Bruun’s FFT algorithm for software defined radio. In 2007 IEEE
Sarnoff Symposium, (pp. 1–5),Apr.
[4] G. Bi and E.V. Jones, “A Pipelined FFT Processor for Word

Sequence Data,” IEEE Trans. on Acoustics, Speech, and Signal
Processing, Vol.37, pp.1982-1985, December 1989.
Fig. 7. Output waveform of radix-2 CORDIC FFT
[5] Volder, J. (1959). The CORDIC trigonometric computing technique.
IEEE Transactions on Electronic Computers, 8(8), 330–334.
[6] Despain, A. M. (1974). Fourier transform computers using CORDIC

iterations. IEEE Transactions on Electronic Computers, 23(10),
993–1001.
[7] Abdullah, S. S., Nam, H., McDermot, M., & Abraham, J. A. (2009).
A high throughput FFT processor with no multipliers. In IEEE
International Conf. on Computer Design, pp. 485–4 90.
[8] Lin, C., & Wu, A. (2005). Mixed-scaling-rotation CORDIC

(MSRCORDIC) algorithm and architecture for high-performance
vector rotational DSP applications. IEEE Transactions on Circuits
and Systems I, 52(11), 2385–2396.
[9] Jiang, R. M. (2007). An area-efficient FFT architecture for OFDM

digital video broadcasting. IEEE Transactions on Consumer
Electronics, 53(4), 1322–1326.
[10] Garrido, M., & Grajal, J. (2007). Efficient memory-less CORDIC

for FFT Computation. In IEEE International Conference on
Acoustics, Speech and Signal Processing, 2, 113–116), Apr.
[11] Xiao, X., Oruklu, E., & Saniie, J. (2009). Fast memory addressing
scheme for radix-4 FFT implementation. In IEEE International
Conference on Electro/Information Technology, EIT 2009, 437–
440, June.
Fig. 8. Output waveform of radix-4 CORDIC FFT

6
[12] Xiao, X., Oruklu, E., & Saniie, J. (2010) Reduced Memory
Architecture for CORDIC-based FFT. In IEEE International
Symposium on Circuits and Systems, 2690–2693.

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Încărcat de

Drepturi de autor:

Formate disponibile

1

Low Power and Memory Efficient FFT Architecture Using Modified

To overcome this challenge, multi-bank memory (5)

Where is called angle accumulator and given by

All operations described through (4)-(7) can be

Fig. 1. Signal flow graph of a 16-point radix-2 FFT

A new address scheme has been proposed to realize this

III. CORDIC ALGORITHM

CORDIC algorithm is an iterative algorithm to

The norm preserving rotation trajectory is a

As shown in (1), the key operation of the FFT

STAGE 0 STAGE 1 STAGE 2 STAGE 3

B(b2b1b0) Address Factor Address Factor Address Factor Address Factor

001 001 100 0 010 0 001 0

010 010 001 100 0 010 0

011 011 101 110 0 011 0

100 100 010 001 100 0

101 101 110 011 101 0

110 110 011 101 110 0

111 111 111 111 111 0

For an N = 2n-point FFT, the addressing and control logic

Due to finite wordlength, as the accumulator

V. RESULTS AND CONCLUSION

The proposed designs for both radix-2 and radix-

architecture are shown in Fig. 7 and Fig. 8 respectively.

Radix-2 Radix-4 CORDIC

The proposed design can reduce memory usage for FFT

[4] G. Bi and E.V. Jones, “A Pipelined FFT Processor for Word

[6] Despain, A. M. (1974). Fourier transform computers using CORDIC

[8] Lin, C., & Wu, A. (2005). Mixed-scaling-rotation CORDIC

[9] Jiang, R. M. (2007). An area-efficient FFT architecture for OFDM

[10] Garrido, M., & Grajal, J. (2007). Efficient memory-less CORDIC

Fig. 8. Output waveform of radix-4 CORDIC FFT

S-ar putea să vă placă și