On The Design of The FFT Butterfly Units

On the Design of the FFT Butterfly Units
Fotis Douskas and Kiamal Pekmestzi

Microprocessors and Digital Systems Lab (MicroLab)
School of Electrical and Computer Engineering, National Technical University of Athens (NTUA), Athens, Greece
fotisdou@microlab.ntua.com, pekmes@cs.ntua.gr
Abstract Building Blocks

In this paper, efficient designs of the Fast Fourier Transform (FFT) Decimation-in-Time (DIT), radix-2, A. CL Adders, CS Adders and 4:2 Adders
Butterfly Unit are proposed. Several techniques are incorporated in order to achieve higher Carry Look-ahead Adders were implemented with the Synopsys DesignWare IP (DW01_add)
performance. The operations are fused by keeping the intermediate variables in Carry-Save format. Carry-Save Adders were implemented with the Synopsys DesignWare IP (DW02_tree)
Besides of the conventional, the Gauss complex multiplication algorithm is also explored. Considered 4:2 Adders were also implemented with the Synopsys DesignWare IP (DW02_tree)
that the twiddle factors cos, sin used in FFT algorithm are constant numbers, we apply to them a
special NR4SD encoding scheme with the following sets of digit values: {-2, -1, 0, +1}. Finally, to B. Multipliers
increase the operation speed, one level of pipelining is introduced in all designs. We implement four Two types of multipliers have been designed for the different architectures, a Modified Booth Multiplier
designs: one conventional and three new Butterfly Units. In all cases, the proposed three schemes are and a NR4SD pre-encoded multiplier, both with CS output, as shown in Fig. 3
superior in terms of operation speed, area and power, compared to the conventional.
Modified Booth Multiplier with CS output in Fig. 3(a)
NR4SD pre-encoded multiplier with CS output in Fig. 4(b)
Main Objectives Coefficients are stored in 2s complement and NR4SD form, respectively
Propose a Butterfly Computation Unit (BCU) based on a special NR4SD encoding scheme, in order
Result in Carry-Save form
to increase the operation speed, and a novel architecture based on the Gauss Complex Number
A A
Multiplication Algorithm, to further increase performance in area and power consumption
3 bits 2
Utilize Carry-Save form in order to decrease the critical path and speed-up calculations PP0 Generator NR4SD
encoding
MB Encoding
Evaluate the proposed architecture by comparing it with the conventional BCU architecture 3 bits 2
twiddle factors
NR4SD
PP1 Generator
Pre-encoded
encoding
Partial Product
NR4SD
B cor 3 bits NR4SD 2
Generator PP2 Generator encoding
FFT BUTTERFLY COMPUTATION UNIT cor

3 bits (MB)
The Butterfly Computation Unit (BCU) is essential to the implementation of the FFT algorithm. It PPk-1 Generator
involves a complex number multiplication between a variable and the twiddle factors (cos, sin), A
and the addition and subtraction of the result with a second variable , as shown in Fig. 1 below CSA Tree Multiplier B
CSA Tree
- CS output
C S
C S C S
xk +
Xk P=AB P=AB
+ D P=AB
+
(a) (b)

Fig. 3: (a) Modified Booth Multiplier, (b) NR4SD pre-encoded multiplier
yk - + D Yk We have implemented four designs, where the first two use the conventional architecture, as shown in
Fig. 2(a), and differentiate in the type of multipliers they use, Modified Booth and NR4SD multipliers, and
j 2a the last two designs, use the proposed Gauss Algorithm architecture and similarly differentiate in the
W e a
N
N
type of multipliers they use.
=Complex number Multiplier + =Complex number Adder

Results
All BCU units were implemented with N {16, 24, 32, 48, 64}, where N is the bit-width of the inputs, in
order to be compared in terms of delay, area and power.
All the circuits have been described with Verilog and for the functional simulation we used ModelSim.
Fig. 1 FFT radix-2 DIT Butterfly Computation Unit Then, the designs were implemented based on Faraday 90nm technology using Synopsys tools. The
synthesis has been made with Design Compiler. The synthesis constraints have been set for optimal
results, without keeping the hierarchy of the designs. For the Static Time Analysis (STA) we used
Proposed Techniques PrimeTime. The post-synthesis simulation has been made again in ModelSim. Since the purpose of this
NR4SD encoding scheme. The NR4SD encoding scheme uses the following sets of digit values: work is to compare the performance of the aforementioned designs, we synthesized each design at the
{-2, -1, 0, +1}. In order to cover the dynamic range of the 2s complement form, all digits of the highest achievable frequency. We also synthesized all designs at lower frequencies in order to explore
representation are encoded according to NR4SD, except the most significant one that is MB how they behave considering different timing constraints in terms of area, timing and power
encoded. consumption.
Gauss Complex Number Multiplication Algorithm for multiplying two complex numbers reduces the
Critical Time Delay
number of multiplications required to three, through algebraic operations with a cost of three 1.6
additions. For the multiplication between two complex numbers, AR + j AI and BR + j BI , we 1.4
produce the following products:
Critical Time (ns)
0 = ( ) 1.2
1.0
1 = ( )
0.8
2 = ( + )
0.6
0.4
The result of the complex multiplication is produced after adding M0 and M1 for the real part of the 0.2
result ( = 0 + 1 ), and M0 and M2 for the imaginary part ( = 0 + 2 ). 0.0
16 24 32 48 64
Input bit-width
Conventional and Proposed Architectures
Conventional MB Conventional NR4SD Gauss MB Gauss NR4SD
Conventional and Proposed Architectures are pipelined as shown in Fig. 2
yR yI yR
Subtrac t- Gain in Conventional architecture of NR4SD over MB
+
Multiply Unit
- yI
25.00
yI yR CS to MB
cos sin cos sin 20.00

Multiplier Multiplier Multiplier Multiplier cos-sin Multiplier sin Multiplier Multiplier cos+sin
- CS output - CS output - CS output - CS output - CS output - CS output - CS output
15.00
Gain (%)
D D D D D D D D D D D D D D D D
+ + - - + + + + + + + + + + + + 10.00
4:2 Adder 4:2 Adder 4:2 Adder 4:2 Adder
xR D xI D xR D xI D
5.00
+ + + - - + + + + - - + + + + - - + + + + - - +
CSA CSA CSA CSA
0.00
CSA CSA CSA CSA
16 bits 24 bits 32 bits 48 bits 64 bits
CLA Adder CLA Adder CLA Adder CLA Adder CLA Adder CLA Adder CLA Adder CLA Adder Input bit-width
XR YR XI YI XR YR XI YI Area Power
(a) (b)
Fig. 2: (a) BCU Conventional Design, (b) BCU Gauss Algorithm Design Gain of Gauss with NR4SD over Conventional with MB
25.00
Conclusions 20.00
We have presented efficient designs for the FFT decimation-in-time (DIT), radix-2, Butterfly Unit, in
Gain (%)
15.00
terms of delay, area and power consumption. The proposed design BCU_Conv_2, which uses the
Conventional complex number multiplication algorithm and CS NR4SD multipliers, yields gain 5% in 10.00
the delay, 18% less area and 18% less power dissipation in average, compared to the Conventional
5.00
(BCU_Conv_1). We have also presented an alternative and efficient design (BCU_Gauss_2),
according to the Gauss complex multiplication algorithm, which is more suitable for low power and 0.00
area efficient applications, that presents a mean gain of 18% in area complexity and 13% in power 16 bits 24 bits 32 bits 48 bits 64 bits
dissipation. Input bit-width
Area Power

On The Design of The FFT Butterfly Units

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

On The Design of The FFT Butterfly Units

Încărcat de

Drepturi de autor:

Formate disponibile

On the Design of the FFT Butterfly Units

Fotis Douskas and Kiamal Pekmestzi

Abstract Building Blocks

FFT BUTTERFLY COMPUTATION UNIT cor

=Complex number Multiplier + =Complex number Adder

cos sin cos sin 20.00

S-ar putea să vă placă și