Sunteți pe pagina 1din 34

OFDM baseband transceiver data chain Implementation of an 64 point IFFT module for 802.

.11a OFDM specification Team: Hemanth Bettachar(4388014), Saurabh Patodia (4450683) Abstract: OFDM
is the preferred multi-carrier transmission

scheme used in most communication systems today. In the OFDM scheme, QAM, QPSK, etc. symbols in the spectral space are

mapped to orthogonal sub-carriers in the time domain. Hence, it enables efficient with low spectrum utilization and high data The

throughput

inter-symbol

interference.

orthogonality of the sub-carriers is achieved by the IFFT/FFT module in the Transmitter/Receiver data chain. The primary

goal of this project is to explore various architectures for an efficient implementation of the 64-point IFFT core for the 802.11 Wireless Standard specifications. We aim to compare the performance, in terms of area, power and speed, of the flat 6stage 64-point IFFT core architecture with the 1-stage 64point folded IFFT core architecture. Our implementation scope encapsulates the architectural conception of the IFFT core, System Verilog implementation and verification of the

fundamental blocks, integration of the fundamental blocks to generate the top-level IFFT core, functional verification of the top-level IFFT Core using MATLAB co-simulation, synthesis and testability analysis of the IFFT core.

INTRODUCTION

OFDM Transmitter Chain The above figure represents the OFDM architecture for most applications like 802.1 WLAN, GPS and Communication Systems. The main focus of the project is to develop the radix-2 64-point 16bit IFFT/FFT core, which forms the heart of the OFDM transmitter chain. The 64-point IFFT/FFT core developed as part of this

project can be used in the OFDM Transmitter chain for all the mentioned applications. We are targeting the 802.11a Wireless Standard as its primary use-case. It receives 64 complex

samples of 16-bit data in the frequency-domain from the system data bus, in parallel, performs the 64-point IFFT (Fast Fourier Transform) and dumps the 64-point 16-bit time-domain complex

outputs to the next block in the OFDM Transmitter data-path, in parallel. OFDM is a multi-carrier transmission scheme, which divides the available spectrum into multiple carriers, where each one of them is modulated advantage by of a low using data OFDM rate as encoding the scheme. of The data

primary

means

transmission at the physical link layer is the fact that it is highly efficient in terms of spectrum-reuse. The sub-carriers used in the transmission scheme are synthesized orthogonal to each other, thus enabling them to be placed very close to each other in the spectrum, boosting the throughput of the system, while avoiding interference. The orthogonality of the carriers is guaranteed by the IFFT module in the transmitter chain. With the introduction of powerful and cheap DSP processors, it is now feasible to efficiently implement the IFFT algorithm in

hardware. IFFT is computationally far more efficient compared to IDFT, because incoming data-stream can be processed in parallel, thus making it the preferred implementation scheme. This project aims to develop an efficient complex 64 point 16-bit IFFT core. This core could be used in any OFDM

application for generating the 64-sample time domain output from 64-sample frequency domain input. It uses the butterfly

structure (shown below) as its fundamental building block to generate the 64-point DIT-IFFT efficiently.

Butterfly structure

We have streamlined our core for the 802.11a standard and the specifications are as below:

Features 64 point FFT Forward/Inverse FFT Data Rates (Mbits/sec)

Our FFT/IFFT core Supported Supported 89600 (for flat 6 stage DIT-FFT architecture)

802.11a Spec Supported Supported 6-54

Input Word Length Output Word Length Twiddle Word Length Self-Sorted Outputs

16 bits supported 16 bits supported 16 buts supported Supported

12 bits 14 bits 10 bits Supported

The above analysis shows that our FFT/IFFT core can be used for 802.11a WLAN applications. The block receives 64 complex samples in parallel from the Serial-to-Parallel converter, processes

these samples in parallel through the 6-stage, 16-bit fixedpoint butterfly stages and generates the time-domain output.

This output is then fed to the parallel-to-serial converter in the OFDM transmitter data-path.

Our 64-point IFFT core architecture:

6-stage 16-point FFT/IFFT core flat-architecture

Previous Work: The previous work in this domain has been aimed primarily

towards architectural exploration of efficient FFT/IFFT cores for various use-cases and applications. The work has ranged from developing low power architectures for 64-point IFFT/FFT cores to developing highly efficient parallel-pipelined architectures for high-speed and high-performance applications. Mixed-Radix

and multi-path delay FFT architectures have also been explored in some detail to demonstrate processing the trade-off while between the area, system

hardware-reuse

and

time,

meeting

specifications. Considerable research effort has also gone in developing proto-types for FFT architectures employing folding transformation and register minimization techniques. References:
i. ii. iii. iv. v. vi. vii. viii. ix. x. An approach to simplify the Design of IFFT/FFT cores of OFDM systems Ainhoa Corts, Igone Vlez, Juan F. Sevillano, and Andoni Irizar, IEEE Transactions on Consumer Electronics, Vol. 52, No. 1, FEBRUARY 2006 A Low-Power 64-Point FFT/IFFT Design for IEEE 802.11a WLAN Application, Chin-Teng Lin, Yuan-Chu Yu, and Lan-Da Van, ISCAS 2006 A Novel 64- point FFT/IFFT Processor For IEEE 802.11(A) standard, K. Maharatna, E. Grass, U.Jagdhold, ICASSP 2003 An efficient OFDM Transceiver Design suitable to IEEE 802.11a WLAN Standard, T. Suresh, K.L.Shunmuganathan, , IJCSIS 2010 Design of an FFT/IFFT Processor for MIMO OFDM Systems, Yu-Wei Lin and Chen-Yi Lee, IEEE transactions on circuits and systems, 2007 Pipelined Parallel FFT architectures via Folding Transformation, Manohar Ayinala, Michael Brown, Keshab K. Parhi, IEEE TRANSACTIONS on VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2011. http://www.ie.u-ryukyu.ac.jp/~wada/design07/spec_e.html Keshab K. Parhi , VLSI Digital Signal Processing Systems: Design and Implementation Milos D. Ergegovac, Tomas Lang , Digital Arithmetic John G Proakis, Dimitris G. Manolakis, Digital Signal Processing: Principles, Algorithms and Applications

PROPOSED ARCHITECTURE

Overall system block diagram of 802.11 OFDM:

Our Proposed FFT/IIFT Core:

Interface of our FFT/IFFT core to overall OFDM Transmitter chain is shown below:

real_ input_ 64x16bit Controller imag_ input_ 64x16bit rst_n clk


Stage1 With 32 Butterfly modules Stage2 With 32 Butterfly modules

64pt IFFT Block real_ output_ 64x16bit

Twiddle Register array 32x2x16bit

Stage3 With 32 Butterfly modules

Stage4 With 32 Butterfly modules

Stage5 With 32 Butterfly modules

Stage6 With 32 Butterfly modules

Multplier 16x16 and round to 16bit

imag_ output_ 64x16b

FFT/IFFT CORE BLOCK DESCRIPTION

TABLE OF PINS: Signal name clk rst_n I/O Input Input Description Clock signal Asynchronous Reset signal to reset the state machine and FFT computation. real_input_ imag_input_ real_output_ imag_output_ Input (64x16bit bus) Input (64x16bit bus) Output (64x16bit bus) Output (64x16bit bus) Real values for Complex Input Samples Imaginary values for Complex Input Samples Real Values for Complex Output Samples Imaginary values for Complex Output Samples

FUNCTIONAL DESCRIPTION OF MAJOR BLOCKS:

64-POINT IFFT BLOCK (TOP MODULE):

This is the top level module of our design. This module computes 64 point Inverse FFT on complex inputs and provides complex outputs. The twiddle co-efficients are read in the reset state from the Twiddle-ROM file. The 64 point IFFT is computed in parallel using 192 butterfly structures connected in 6 Stages of 32 butterflies each. The final output is compute by multiplying 6th stage outputs by 1/64 value stored in internal register. Once the IFFT computation is done the 64 point complex IFFT values are sent out in parallel in the same clock cycle as Y[0].real, Y[0].imaginary,Y[1].real, Y[1].imaginary.....Y[63].real and Y[64].imaginary.

SUBMODULES OF 64-POINT IFFT BLOCK (TOP MODULE):

Controller: The controller block controls the two states of the IFFT block: 'reset_state' and 'processing_state'. In the 'reset_state', active low rst_n is asserted and the IFFT block is idle. In the 'processing_state', the rst_n is de-asserted and the IFFT block computes the 64-point inverse Fourier transform every clock cycle.

Multplier16x16: This is the final stage of IFFT block. It normalizes the 64-point 16-bit complex output samples at stage-6 by 64, using a 16x16 signed Booth multiplier.

Stage 1 to Stages 6: Each stage employs 32 instances of 2-point complex butterfly computation units.

2-point_butterfly structure: The butterfly structure computes the complex outputs Y1 and Y2 ( Y1 = X1 + X2 * Wx and Y2 = X1 X2 * Wx ). This structure is instantiated 192 times to compute the 64-point 16-bit complex IFFT values.
X1.r, X1.i S1 X2.r C1 4:2 compressor 32 bit S2 X2.i C2 X1.i 3:2 S21 compressor 32 bit C21 S31 Kogge-Stone Y1.i Adder 32 bit +roundto16b C31 X1.r S31 Kogge-Stone Y1.r 3:2 Adder 32 bit S11 compressor +roundto16b 32 bit C11 C31 Y1

W.r W.i
S3

X1.r

X2.r

C3
4:2 Compressor 32bit S4

S31 Kogge-Stone Y2.r 3:2 Adder 32 bit S11 compressor +roundto16b 32 bit C31 C11 Y2

W.i
X2.i C4

X1.i

S31 Kogge-Stone 3:2 Y2.i Adder 32 bit S21 compressor +roundto16b 32 bit C21 C31

W.r

SUB-MODULES OF BUTTERFLY STRUCTURE:

16bit Multiplier block: The block gets the complex inputs X1.r, X1.i , X2.r and X2.i from the complex input buffer and the complex twiddle values W.r and W.i from the Twiddle register array.

This block computes the signed multiplications X2.r * W.r , X2.i * W.i ,X2.r * W.i and X2.i * W.r .The intermediate 36 bit output of this multiplier is preserved in 32 bit Sum and Carry format {S1,C1}{S2,C2}{S3,C3}{S4,C4} to reduce overflow and rounding errors .

4:2 Compressor 32bit: This block compresses the 32 bit Sum and Carry outputs of the multipliers to compute X2.r * W.r - X2.i * W.i and X2.r * W.i + X2.i * W.i in 32 bit Sum and Carry format {S11, C11} and{S21, C21} respectively.

3:2 Compressor 32bit: This block compresses the 32 bit Sum and Carry outputs of the 4:2 compressors and X1 real and imaginary inputs, to compute X1.r + (X2.r * W.r - X2.i * W.i) , X1.i + (X2.r * W.i + X2.i * W.i) and X1.r - (X2.r * W.r - X2.i * W.i) , X1.i - (X2.r * W.i + X2.i * W.i) in 32 bit Sum and Carry format {S31,C31},{S32,C32},{S33,C33} and {S34,C34} respectively.

KoggeStone Adder 32bit: This high speed prefix adder computes the final real and complex outputs by vector merging of {S31,C31},{S32,C32},{S33,C33} and {S34,C34} 3:2 compressor outputs. This block also performs 32bit to 16bit rounding and outputs the 16bit butterfly values Y1.r, Y1.i , Y2.r and Y2.i.

VERILOG CODES
/********************************************************************** Module butterfly: This module computes the 16bit signed complex outputs Y1 = X1 + X2 * Wx and Y2 = X1 X2 * Wx from 16-bit signed complex inputs X1 and X2. The block uses four 16bit signed multiplier units, four 4:2 32bit . The multiplier outputs are 32-bit sum and carry outputs which are combined using 4:2 32-bit compression followed by 3:2 32 bit compression. Then the final 16-bit signed outputsY1.r, Y1.i, Y2.r and Y2.i are computed by 32bit vector merging using high speed Kogge-Stone Adder and a 32-bit to 16-bit rounding stage. It is to be noted that for maintaining precision, rounding to 16-bit is postponed to the very last stage. Also subtractions are optimized by using the same hardware by applying complemented inputs and adding 1b1 at LSB positions at available empty slots in the Wallace tree structure or Vector merging structure. inputs : 16-bit signed Complex X1.r + jX1.i, X2.r + jX2.i , W.r + jW.i(twiddle_coefficient) outputs : 16-bit signed Complex Y1.r + jY2.i, Y2.r + jY2.i ***********************************************************************/ typedef struct { Structure definition reg signed [15:0] r, i; // real and imaginary parts } complex; //complex data type

module butterfly(input complex X1,X2,W, output complex Y1,Y2); wire [31:0] S1,C1,S2,C2,S3,C3,S4,C4,S11,C11,S21,C21,S31,C31,S32,C32,S33,C33,S34,C34; wire [32:0] Y1_r_temp1, Y1_i_temp1, Y2_r_temp1, Y2_i_temp1; reg [16:0] Y1_r_temp2, Y1_i_temp2, Y2_r_temp2, Y2_i_temp2; wire [32:0] Y2_i_temp1_r; reg [32:0] Y2_i_temp2_r; //Computing real part A.r = X2.r * W.r - X2.i * W.i //Subtraction is done by inverting S2 and C2 and adding 1b1 to LSB position, //other 1b1 to be added is postponed //to next Wallace tree compression as //there is not slot left. booth16f mult1(X2.r, W.r , S1 , C1); //16-bit multiplication generates 32-bit Sum and Carry booth16f mult2(X2.i, W.i , S2 , C2); //16-bit multiplication generates 32-bit Sum and Carry wallace4to2 wallace11(S1, ~S2, {C1[31:1],1'b1}, ~C2, S11 ,C11); //32-bit Sum and Carry for 4 input 32-bit PPs //Computing real output Y1.r = X1.r + A.r //1b1 which is postponed from the previous stage is also added to the LSB //position //Input X1.r is added to the bit positions [30:14] and sign extension is //added at bit position 31.

wallace3to2 wallace31(S11,{C11[31:1],1'b1}, {X1.r[15],X1.r,15'h0000}, S31, C31 ); //3:2 32-bit compression kogge_stone_adder KS31(Y1_r_temp1, S31, C31); //Vector merging of 32-bit //inputs //Rounding Scheme : 32 bit output to 16bit output(future scope). assign Y1.r = (Y1_r_temp1[14:0] >= (15'h4000)? ((Y1_r_temp1[30]== 1'b0)? ( Y1_r_temp1[30:15]+1): (Y1_r_temp1[30:15])) : Y1_r_temp1[30:15]); //Computing real output Y2.r = X1.r - A.r //Subtraction is done by inverting S2 and C2 and adding 1b1 to LSB positions //twice //Input X1.r is added to the bit positions [30:14] and sign extension is //added at bit position 31. wallace3to2 wallace32(~S11,{~C11[31:1],1'b1}, {X1.r[15],X1.r,15'h0000}, S32, C32 ); //3:2 32-bit compression kogge_stone_adder KS32(Y2_r_temp1, S32, {C32[31:1],1'b1}); //Vector merging of 32-bit inputs //Rounding Scheme : 32 bit output to 16bit output(future scope) assign Y2.r = (Y2_r_temp1[14:0] >= (15'h4000)? ((Y2_r_temp1[30]== 1'b0)? ( Y2_r_temp1[30:15]+1): (Y2_r_temp1[30:15])) : Y2_r_temp1[30:15]); //Computation of complex parts A.i = X2.r * W.i + X2.i * W.r booth16f mult3(X2.r, W.i , S3 , C3); //16-bit multiplication generates 32-bit Sum and Carry booth16f mult4(X2.i, W.r , S4 , C4); //16-bit multiplication generates 32-bit Sum and Carry wallace4to2 wallace21(S3, S4, C3, C4, S21 ,C21); //32-bit Sum and Carry for 4 input 32-bit PPs //Computing imaginary output Y1.i = X1.i + A.i //Input X1.i is added to the bit positions [30:14] and sign extension is added at bit position 31. wallace3to2 wallace33(S21,C21, {X1.i[15],X1.i,15'h0000}, S33, C33 ); //3:2 32-bit compression kogge_stone_adder KS33(Y1_i_temp1, S33, C33); //Vector merging of 32-bit inputs //Rounding Scheme : 32 bit output to 16bit output(future scope). assign Y1.i = (Y1_i_temp1[14:0] >= (15'h4000)? ((Y1_i_temp1[30]== 1'b0)? ( Y1_i_temp1[30:15]+1): (Y1_i_temp1[30:15])) : Y1_i_temp1[30:15]); //Computing imaginary output Y2.i = X1.i - A.i //Input X1.i is added to the bit positions [30:14] and sign extension is //added at bit position 31. wallace3to2 wallace34(~S21, ~C21 , {X1.i[15],X1.i,15'h0001}, S34, C34 ); //3:2 //32-bit compression kogge_stone_adder KS34(Y2_i_temp1, S34, {C34[31:1],1'b1});//Vector merging of 32-bit inputs //Rounding Scheme: 32 bit output to 16bit output(future scope). assign Y2.i = (Y2_i_temp1[14:0] >= (15'h4000)? ((Y2_i_temp1[30]== 1'b0)? (Y2_i_temp1[30:15]+1): (Y2_i_temp1[30:15])) : Y2_i_temp1[30:15]); endmodule

/********************************************************************** Module booth16f : This module computes the product of two 16-bit signed numbers and outputs two 32 bit sum and carry values. This is required in the butterfly module for doing intermediate 32-bit computation for preserving precision and for improving the computation speed of the butterfly as vector merging is postponed to last stage of butterfly computation. This module uses booth encoding scheme followed by 4-stage Wallace Tree structure. The Wallace tree structure is optimized by avoiding sign extension of the partial products. Using the concept that a4a4a4a4a4a3a2a1 = 0000a4'a3a2a1 + 11110000 and adjusting at 11110000 addition in the last row along with partial product u. Signed inputs : x[15:0], y[15:0] Signed output : sum_op[31:0], carry_op[31:0] **********************************************************************/ module booth16f(x, y, sum_op, carry_op); input [15:0] x, y; output [31:0] sum_op, reg [16:0] a, b, c, d reg [31:0] u; wire [14:0] s11; wire wire [11:0] s13; wire wire [15:0] s22; wire wire [31:0] s41; wire carry_op; , e , f , g , h ; [15:1] [12:1] [16:1] [31:0] c11; wire [14:0] s12; wire [15:1] c12; c13; wire [14:0] s21; wire [15:1] c21; c22; wire [18:0] s31; wire [19:1] c31; c41; //BOOTH ENCODING

//b(2i+1) b(2i) b(2i-1) bi'(recoded value) ALGO // 0 0 0 0 // 0 0 1 1 // 0 1 0 1 // 0 1 1 2 // 1 0 0 -2 // 1 0 1 -1 // 1 1 0 -1 // 1 1 1 -0

always @(x or y) begin u = 32'hAAAB0000; case (y[1:0]) //Generation of encoded PP1 2'b00 : begin a = 17'b00000000000000000; u[0] = 0; end // 0 2'b01 : begin a = {x[15], x[15:0]}; u[0] = 0; end // 1 2'b10 : begin a = {~x[15:0], 1'b1}; u[0] = 1; end // -2 2'b11 : begin a = {~x[15], ~x[15:0]}; u[0] = 1; end // -1 endcase case (y[3:1]) //Generation of encoded PP2 3'b000 : begin b = 17'b00000000000000000; u[2] = 0; end // 0 3'b001 : begin b = {x[15], x[15:0]}; u[2] = 0; end // 1 3'b010 : begin b = {x[15], x[15:0]}; u[2] = 0; end // 1 3'b011 : begin b = {x[15:0], 1'b0}; u[2] = 0; end // 2 3'b100 : begin b = {~x[15:0], 1'b1}; u[2] = 1; end // -2 3'b101 : begin b = {~x[15], ~x[15:0]}; u[2] = 1; end // -1 3'b110 : begin b = {~x[15], ~x[15:0]}; u[2] = 1; end // -1 3'b111 : begin b = 17'b00000000000000000; u[2] = 0; end // 0 endcase case (y[5:3]) //Generation of encoded PP3 3'b000 : begin c = 17'b00000000000000000; u[4] = 0; end // 0 3'b001 : begin c = {x[15], x[15:0]}; u[4] = 0; end // 1

3'b010 : begin c 3'b011 : begin c 3'b100 : begin c 3'b101 : begin c 3'b110 : begin c 3'b111 : begin c endcase case (y[7:5]) 3'b000 : begin d 3'b001 : begin d 3'b010 : begin d 3'b011 : begin d 3'b100 : begin d 3'b101 : begin d 3'b110 : begin d 3'b111 : begin d endcase case (y[9:7]) 3'b000 : begin e 3'b001 : begin e 3'b010 : begin e 3'b011 : begin e 3'b100 : begin e 3'b101 : begin e 3'b110 : begin e 3'b111 : begin e endcase case (y[11:9]) 3'b000 : begin f 3'b001 : begin f 3'b010 : begin f 3'b011 : begin f 3'b100 : begin f 3'b101 : begin f 3'b110 : begin f 3'b111 : begin f endcase case (y[13:11]) 3'b000 : begin g 3'b001 : begin g 3'b010 : begin g 3'b011 : begin g 3'b100 : begin g 3'b101 : begin g 3'b110 : begin g 3'b111 : begin g endcase case (y[15:13]) 3'b000 : begin h 3'b001 : begin h 3'b010 : begin h 3'b011 : begin h 3'b100 : begin h 3'b101 : begin h 3'b110 : begin h 3'b111 : begin h endcase

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

{x[15], x[15:0]}; u[4] = 0; end // 1 {x[15:0], 1'b0}; u[4] = 0; end // 2 {~x[15:0], 1'b1}; u[4] = 1; end // -2 {~x[15], ~x[15:0]}; u[4] = 1; end // -1 {~x[15], ~x[15:0]}; u[4] = 1; end // -1 17'b00000000000000000; u[4] = 0; end // 0 //Generation of encoded PP4 17'b00000000000000000; u[6] = 0; end // 0 {x[15], x[15:0]}; u[6] = 0; end // 1 {x[15], x[15:0]}; u[6] = 0; end // 1 {x[15:0], 1'b0}; u[6] = 0; end // 2 {~x[15:0], 1'b1}; u[6] = 1; end // -2 {~x[15], ~x[15:0]}; u[6] = 1; end // -1 {~x[15], ~x[15:0]}; u[6] = 1; end // -1 17'b00000000000000000; u[6] = 0; end // 0 //Generation of encoded PP5 17'b00000000000000000; u[8] = 0; end // 0 {x[15], x[15:0]}; u[8] = 0; end // 1 {x[15], x[15:0]}; u[8] = 0; end // 1 {x[15:0], 1'b0}; u[8] = 0; end // 2 {~x[15:0], 1'b1}; u[8] = 1; end // -2 {~x[15], ~x[15:0]}; u[8] = 1; end // -1 {~x[15], ~x[15:0]}; u[8] = 1; end // -1 17'b00000000000000000; u[8] = 0; end // 0 //Generation of encoded PP6 17'b00000000000000000; u[10] = 0; end // 0 {x[15], x[15:0]}; u[10] = 0; end // 1 {x[15], x[15:0]}; u[10] = 0; end // 1 {x[15:0], 1'b0}; u[10] = 0; end // 2 {~x[15:0], 1'b1}; u[10] = 1; end // -2 {~x[15], ~x[15:0]}; u[10] = 1; end // -1 {~x[15], ~x[15:0]}; u[10] = 1; end // -1 17'b00000000000000000; u[10] = 0; end // 0 //Generation of encoded PP7 17'b00000000000000000; u[12] = 0; end // 0 {x[15], x[15:0]}; u[12] = 0; end // 1 {x[15], x[15:0]}; u[12] = 0; end // 1 {x[15:0], 1'b0}; u[12] = 0; end // 2 {~x[15:0], 1'b1}; u[12] = 1; end // -2 {~x[15], ~x[15:0]}; u[12] = 1; end // -1 {~x[15], ~x[15:0]}; u[12] = 1; end // -1 17'b00000000000000000; u[12] = 0; end // 0 ////Generation of encoded PP8 17'b00000000000000000; u[14] = 0; end // 0 {x[15], x[15:0]}; u[14] = 0; end // 1 {x[15], x[15:0]}; u[14] = 0; end // 1 {x[15:0], 1'b0}; u[14] = 0; end // 2 {~x[15:0], 1'b1}; u[14] = 1; end // -2 {~x[15], ~x[15:0]}; u[14] = 1; end // -1 {~x[15], ~x[15:0]}; u[14] = 1; end // -1 17'b00000000000000000; u[14] = 0; end // 0

end //************ WALLACE Tree structure sign extension is avoided by using the concept signed number a4a4a4a4a4a3a2a1 = 0000a4'a3a2a1 + 11110000 ***********************/ //stage1 //Generating the full-adder array for stage-1 (9:6 compression) //iteration 1 //Compression of rows 1,2 and 3 of PPs genvar i; generate for(i=0; i <= 11 ;i = i+1) begin: fulladd11 fulladd fa11_(s11[i], c11[i + 1], a[i + 4], b[i + 2], c[i] ); end endgenerate fulladd fa11_12(s11[12], c11[13], ~a[16], b[14], c[12] ); halfadd ha11_13(s11[13], c11[14], b[15], c[13]); halfadd ha11_14(s11[14], c11[15], ~b[16], c[14]); //iteration 2 //Compression of rows 4,5 and 6 of PPs generate for(i=0;i<=11 ;i=i+1) begin: fulladd12 fulladd fa12_(s12[i],c12[i+1], d[i + 4], e[i + 2],f[i] ); end endgenerate fulladd fa12_12(s12[12], c12[13], ~d[16], e[14], f[12] ); halfadd ha12_13(s12[13], c12[14], e[15], f[13]); halfadd ha12_14(s12[14], c12[15], ~e[16], f[14]); //iteration 3 //Compression of rows 7,8 and 9 of PPs fulladd fa13_0(s13[0], c13[1], g[2], h[0], u[14]); halfadd ha13_1(s13[1], c13[2], g[3], h[1] ); fulladd fa13_2(s13[2], c13[3], g[4], h[2], 1'b1 ); generate for(i=3;i<=11 ;i=i+2) begin: fulladd13 fulladd fa13_(s13[i], c13[i+1], g[i+2], h[i], 1'b1); end endgenerate generate for(i=4;i<=10 ;i=i+2) begin: halfadd13 halfadd ha13_(s13[i], c13[i+1], g[i + 2], h[i] ); end endgenerate //stage2 //Generating the full-adder array for //iteration 1 //Compression of rows 1,2 and 3 of generate for(i=0; i <= 3 ;i = i+1) begin: fulladd21_a fulladd fa21_(s21[i], c21[i + 1], s11[i+2], c11[i end endgenerate generate for(i=4; i <=12; i = i+1) begin: fulladd21_b fulladd fa21_(s21[i], c21[i + 1], s11[i+2], c11[i end endgenerate stage-2 (6:4 compression) Stage-1 Outputs + 2], d[i] );

+ 2], s12[i-4] );

fulladd fa21_13(s21[13], c21[14], c[15], c11[15], s12[9] ); halfadd ha21_14(s21[14], c21[15], ~c[16], s12[10]); //iteration 2 //Compression of rows 4,5 and 6 of Stage-1 Outputs fulladd fa22_0(s22[0], c22[1], c12[2], g[0], u[12] ); halfadd ha22_1(s22[1], c22[2], c12[3], g[1]); halfadd ha22_2(s22[2], c22[3], c12[4], s13[0]); generate for(i=3; i <= 13 ;i = i+1) begin: fulladd22 fulladd fa22_(s22[i], c22[i+1], c12[i+2], s13[i-2], c13[i-2] ); end endgenerate fulladd fa22_14(s22[14], c22[15], g[14], h[12], c13[12] ); fulladd fa22_15(s22[15], c22[16], g[15], h[13], 1'b1 ); //stage 3 //Generating the full-adder array for stage-3 (4:3 compression) //iteration1 //Compression of rows 1, 2 and 3 of //Stage-2 Outputs fulladd fa31_0(s31[0], c31[1], s21[2], c21[2], e[0] ); fulladd fa31_1(s31[1], c31[2], s21[3], c21[3], e[1] ); fulladd fa31_2(s31[2], c31[3], s21[4], c21[4], u[10] ); fulladd fa31_3(s31[3], c31[4], s21[5], c21[5], c12[1] ); generate for(i=4; i <= 12 ;i = i+1) begin: fulladd31 fulladd fa31_(s31[i], c31[i+1], s21[i+2], c21[i+2], s22[i-4] ); end endgenerate fulladd fa31_13(s31[13], c31[14], s12[11], c21[15], s22[9] ); halfadd ha31_14(s31[14], c31[15], s12[12], s22[10] ); halfadd ha31_15(s31[15], c31[16], s12[13], s22[11] ); halfadd ha31_16(s31[16], c31[17], s12[14], s22[12] ); halfadd ha31_17(s31[17], c31[18], f[15], s22[13] ); halfadd ha31_18(s31[18], c31[19], ~f[16], s22[14] ); // stage 4 //Generating the full-adder array for stage-4 (3:2 compression) halfadd ha41_0(s41[0], c41[1], a[0], u[0] ); //Compression of rows //1,2 and 3 of Stage-3 Outputs assign s41[1] = a[1]; assign c41[2] = 1'b0; fulladd ha41_2(s41[2], c41[3], a[2], b[0], u[2] ); halfadd ha41_3(s41[3], c41[4], a[3], b[1] ); halfadd ha41_4(s41[4], c41[5], s11[0], u[4] ); halfadd ha41_5(s41[5], c41[6], s11[1], c11[1] ); halfadd ha41_6(s41[6], c41[7], s21[0], u[6] ); halfadd ha41_7(s41[7], c41[8], s21[1], c21[1] ); halfadd ha41_8(s41[8], c41[9], s31[0], u[8] ); halfadd ha41_9(s41[9], c41[10], s31[1], c31[1] ); halfadd ha41_10(s41[10], c41[11], s31[2], c31[2] ); halfadd ha41_11(s41[11], c41[12], s31[3], c31[3] ); halfadd ha41_12(s41[12], c41[13], s31[4], c31[4] ); generate for(i=13; i <= 26 ;i = i+1) begin: fulladd41 fulladd fa41_(s41[i], c41[i+1], s31[i-8], c31[i-8], c22[i-12] ); end

endgenerate fulladd fa41_27(s41[27], c41[28], c31[19], s22[15], c22[15] ); fulladd fa41_28(s41[28], c41[29], ~g[16], h[14], c22[16] ); halfadd ha41_29(s41[29], c41[30], h[15], 1'b1 ); assign s41[30] = ~h[16]; assign c41[31] = 1'b0; assign s41[31] = 1'b1; assign c41[0] = 1'b0; assign carry_op = {c41[31:1],1'b0}; // Final signed 32-bit SUM and CARRY generated assign sum_op = s41; endmodule /********************************************************************** Module ifftcore_64pt: This module computes the 64 point complex DIT-IFFT. This is the top-level module. This module implements a flat architecture of radix-2 DIT-IFFT Cooley-Tukey algorithm and is computed with 6 stages of butterfly structures, each stage containing 32 butterfly instances, totaling 192 butterfly instances. The output of the final butterfly stage is normalized by 64 using sixty-four 16bit booth multipliers. The input to the module is 128x16-bit parallel input [consisting of 64 real and 64 imaginary samples of 16-bit]. The 128x16-bit parallel output[consisting of 64 real and 64 imaginary samples of 16-bit] is computed in a single clock-cycle. Complex Signed inputs : 128x16 bit X[63:0] , X[k] = X[k].r + j X[k].i Complex Signed output : 128x16bit ifft_out[63:0], ifft_out[k] = ifft_out[k].r + ifft_out[k].i **********************************************************************/ module ifftcore_64pt(input complex ifft_in[64], input clk, input rst_n, output complex ifft_out[64]); complex W[64] ; reg [15:0] norm = 16'h0200; // Final IFFT Normalization (1/64 in hex)

//intermediate nodes complex X[64], X1[64], X2[64], X3[64], X4[64], X5[64] ,Y[64]; //Stage 1 // 32-Butterfly instances in Stage-1 of DIT-IFFT genvar i; generate for(i=0; i <= 1 ; i = i+1) begin: butt1_ butterfly butt1_1(X[0+i], X[32+i], W[0], X1[0+32*i], X1[1+32*i]); butterfly butt1_2(X[16+i], X[48+i], W[0], X1[2+32*i], X1[3+32*i]); butterfly butt1_3(X[8+i], X[40+i], W[0], X1[4+32*i], X1[5+32*i]); butterfly butt1_4(X[24+i], X[56+i], W[0], X1[6+32*i], X1[7+32*i]); butterfly butt1_5(X[4+i], X[36+i], W[0], X1[8+32*i], X1[9+32*i]); butterfly butt1_6(X[20+i], X[52+i], W[0], X1[10+32*i], X1[11+32*i]); butterfly butt1_7(X[12+i], X[44+i], W[0], X1[12+32*i], X1[13+32*i]); butterfly butt1_8(X[28+i], X[60+i], W[0], X1[14+32*i], X1[15+32*i]); butterfly butt1_9(X[2+i], X[34+i], W[0], X1[16+32*i], X1[17+32*i]); butterfly butt1_10(X[18+i], X[50+i], W[0], X1[18+32*i], X1[19+32*i]); butterfly butt1_11(X[10+i], X[42+i], W[0], X1[20+32*i], X1[21+32*i]); butterfly butt1_12(X[26+i], X[58+i], W[0], X1[22+32*i], X1[23+32*i]); butterfly butt1_13(X[6+i], X[38+i], W[0], X1[24+32*i], X1[25+32*i]); butterfly butt1_14(X[22+i], X[54+i], W[0], X1[26+32*i], X1[27+32*i]);

butterfly butt1_15(X[14+i], X[46+i], W[0], X1[28+32*i], X1[29+32*i]); butterfly butt1_16(X[30+i], X[62+i], W[0], X1[30+32*i], X1[31+32*i]); end endgenerate ////## BUTTERFLY STAGE 2 ########### generate for(i=0; i <= 15 ; i = i+1) begin: butt2_ butterfly butt2_1_(X1[4*i], X1[4*i+2], W[0], X2[4*i+0], X2[4*i+2]); butterfly butt2_2_(X1[4*i+1], X1[4*i+3], W[16], X2[4*i+1], X2[4*i+3]); end endgenerate ///#### BUTTERFLY STAGE 3 ########### generate for(i=0; i <= 7 ; i = i+1) begin: butt3_ butterfly butt3_1_(X2[8*i], X2[8*i+4], W[0], X3[8*i+0], X3[8*i+4]); butterfly butt3_2_(X2[8*i+1], X2[8*i+5], W[8], X3[8*i+1], X3[8*i+5]); butterfly butt3_3_(X2[8*i+2], X2[8*i+6], W[16], X3[8*i+2], X3[8*i+6]); butterfly butt3_4_(X2[8*i+3], X2[8*i+7], W[24], X3[8*i+3], X3[8*i+7]); end endgenerate ///#### BUTTERFLY STAGE 4 ########### generate for(i=0; i <= 7 ;i = i+1) begin: butt41_ butterfly butt41_(X3[i], X3[i+8], W[i*4], X4[i], X4[i+8]); butterfly butt42_(X3[i+16], X3[i+24], W[i*4], X4[i+16], X4[i+24]); butterfly butt43_(X3[i+32], X3[i+40], W[i*4], X4[i+32], X4[i+40]); butterfly butt44_(X3[i+48], X3[i+56], W[i*4], X4[i+48], X4[i+56]); end endgenerate ///#### BUTTERFLY STAGE 5 ########### generate for(i=0; i <= 15 ;i = i+1) begin: butt51_ butterfly butt51_(X4[i], X4[i+16], W[i*2], X5[i], X5[i+16]); butterfly butt52_(X4[i+32], X4[i+48], W[i*2], X5[i+32], X5[i+48]); end endgenerate ///#### BUTTERFLY STAGE 6 ########### generate for(i=0; i <= 31 ;i = i+1) begin: butt6_ butterfly butt6_(X5[i], X5[i+32], W[i], Y[i], Y[i+32]); end endgenerate //multiplication by 1/N genvar i; generate for(i=0;i<=63 ;i=i+1) begin: mult16bit multiplier_16bit mult16bit_(norm, Y[i].r, ifft_out[i].r); multiplier_16bit mult16bitx_(norm, Y[i].i, ifft_out[i].i); end endgenerate

integer j; parameter size = 128; // Initializes the Twiddle Co-efficient ROM File parameter filein = "twiddle_ifft_64pt_fixed.txt";// W[0].r, W[0].i, W[1].r, //W[1].i... reg [15:0] read_mem[0:size-1]; //ROM implementation (Future Scope) always @(posedge clk, negedge rst_n) begin if (~rst_n) // Reset-State, reads the co-efficients form the ROM file into the internal memory begin //initialize twiddle co-efficients. $readmemh (filein, read_mem , 0 , size-1);//read coeffs from file for (j = 0; j<size/2; j = j+1) begin W[j].r = read_mem[(2*j)]; W[j].i = read_mem[(2*j+1)]; end end else // compute ifft X = ifft_in; end endmodule

VERIFICATION STRATEGY The verification was done using the bottom-up strategy. The fundamental building blocks for the 64-point IFFT TOP Module were tested stand-alone using exhaustive linear test-benches (1-bit Half-adder/Full-adder, wallace4to2, wallace3to2, booth_multiplier_16bit, mult_prod_output, koggestoneadder_32bit). These blocks were then integrated together to develop the butterfly module (butterfly2), which was again tested stand-alone using random vectors fed from a FileI/O based testbench (file generated from MATLAB). This butterfly structure was then instantiated 192 times in 6 stages to form the wrapper for the 64-point IFFTcore. The top level testing was done again based on the File-I/O based testbench. The input stimulus is dumped in a file by the MATLAB tool, and this file is read by the IFFT_TOP testbench. The real and imaginary complex samples are picked up by the Test-Bench and sent to the DUT (IFFT core). The output of the core is again dumped in a file and this dump is compared to the dump from MATLAB for the same set of inputs.

File-I/O Based Verification Strategy for the IFFT Core


//A Testbench for IFFT 64point modules /**************************************************************************** **** This testbench module tests the top-level ifftcore_64pt module. This uses a file I/O based strategy for testing. The frequency domain complex input stimuli fft_input.txt is generated from Matlab and is used in testbench to provide complex data-type test inputs to ifftcore_64pt module. The resulting complex data-type output samples from the ifftcore_64pt module is dumped to an output file fft_output.txt which is compared with the corresponding Matlab IFFT output, in the co-simulation environment. ***************************************************************************** ****/ module tb_ifft64; complex X[64], Y[64]; // complex inputs and outputs //file I/O related integer i; parameter size = 128; //64 * 2 samples real and imaginary samples parameter filein = "fft_input.txt"; // X[0].r, X[0].i, X[1].r, //X[1].i........... parameter fileout = "fft_output.txt";//Y[0].r, Y[0].i, Y[1].r, //Y[1].i........... reg [15:0] read_mem[0:size-1]; // Memory for Complex Input samples reg [15:0] write_out_mem[0:size-1]; // Memory for Complex Output samples reg clk; reg rst_n; //instantiation of 64-point IFFT core ifftcore_64pt ifft64pt(X,clk, rst_n ,Y); initial begin #0 clk = 1'b0; #0 rst_n = 1'b0; #0 // assign input values

$readmemh (filein, read_mem , 0 , size-1); //read complex samples from file dumped by MATLAB for (i = 0; i<size/2; i = i+1) begin X[i].r = read_mem[(2*i)]; //Reading Real and Complex Values into the input array X[i].i = read_mem[(2*i+1)]; end #7 rst_n = 1'b1; // write output samples computed by the IFFT core into a file for #10 // Verification for (i =0; i<size/2; i = i+1) begin //Writing Real and Complex Values into the output array write_out_mem[(2*i)] = Y[i].r; write_out_mem[(2*i+1)] = Y[i].i; end $writememh (fileout, write_out_mem, 0 , size-1); //write output to file from //the output array $finish; end always #2 clk=~clk; endmodule // Driving Clock with Period 4ns

RTL SIMULATION SNAPSHOT

The final comparison is being done manually at the moment, because the 16-bit fixed point arithmetic employed in our System Verilog IFFT core, results in truncation and rounding, whereas MATLAB has double precision outputs. However, we see the deviation is minimal as shown below: The following TABLE gives the comparison between our System Verilog 64-point IFFT module and the 64-point IFFT MATLAB module:

INPUT PINS X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] X[8] X[9] X[10] X[11] X[12] X[13] X[14] X[15] X[16] X[17] X[18] X[19] X[20] X[21] X[22] X[23] X[24] X[25] X[26] X[27] X[28] X[29] X[30] X[31] X[32] X[33] X[34] X[35] X[36] X[37] X[38] X[39] X[40] X[41] X[42] X[43] X[44] X[45] X[46] X[47] X[48] X[49] X[50]

INPUT Real 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

DATA Imaginary 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 -2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22 -24 -26 -28 -30 -32 -34 -36 -38

MATLAB IFFT MODEL OUTPUT Real Imaginary 32 0 -22 -12 -1 -10 -7 0 -1 -5 -4 1 -1 -3 -3 1 -1 -2 -2 1 -1 -2 -2 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 1 1 -1 1 1 1 -1 1 1 1 -1 1 1 1 -1 1 1 1 -1 1

SYSTEM VERILOG IFFT CORE OUTPUT Real Imaginary 32 0 -22 -13 -2 -11 -8 -1 -2 -6 -5 1 -2 -4 -3 1 -2 -2 -2 1 -2 -2 -2 1 -2 -2 -2 1 -2 -2 -2 1 -2 -1 -1 1 -2 -1 -1 1 -2 -1 -1 1 -2 -1 -1 1 -2 -1 -1 1 -2 -1 -1 1 -2 -1 -1 1 -2 -1 -1 1 -1 0 0 1 -2 0 0 1 -2 0 0 1 -2 0 0 1 -2 0 0 1 -2 1 1 1 -2 1 1 1 -2 1 1 1 -1 1 1 1 -2 1

X[51] X[52] X[53] X[54] X[55] X[56] X[57] X[58] X[59] X[60] X[61] X[62] X[63]

39 41 43 45 47 49 51 53 55 57 59 61 63

-40 -42 -44 -46 -48 -50 -52 -54 -56 -58 -60 -62 -64

1 -1 2 -1 2 -1 3 -1 4 -1 7 -1 22

1 1 1 2 1 2 1 3 1 5 0 10 -12

1 -2 2 -2 2 -1 3 -2 4 -1 7 -1 22

1 2 1 2 1 2 1 3 1 5 -1 10 -13

The line coverage for the top-level is found to be: 19.43% (Testing to be automated. Will be done for next phase)

Convergence between MATLAB and System Verilog Models:

MATLAB-SYSTEM VERILOG IFFT CO-SIMULATION

SYNTHESIS REPORTS
The IFFT top was synthesized using the following script (compile_ifft.tcl): #/* Compile Script for Synopsys */

set my_verilog_files [list ~/labs/ofdm/synthesis/full_adder_half_adder_1bit.v ~/labs/ofdm/synthesis/koggestoneadder_32bit.v ~/labs/ofdm/synthesis/booth_multiplier_16bit.v ~/labs/ofdm/synthesis/mult_Product_output.v ~/labs/ofdm/synthesis/wallace4to2.v ~/labs/ofdm/synthesis/wallace3to2.v ~/labs/ofdm/synthesis/butterfly_latest.v ~/labs/ofdm/synthesis/ifft_64pt.v] set my_toplevel ifftcore_64pt set my_clock_pin clk set my_clk_freq_MHz 10 set my_input_delay_ns 0.1 set my_output_delay_ns 0.1 set verilogout_show_unconnected_pins "true" set hdlin_sv_ieee_assignment_patterns 2 analyze -f sverilog $my_verilog_files elaborate $my_toplevel set my_period [expr 1000 / $my_clk_freq_MHz] set find_clock [ find port [list $my_clock_pin] ] if { $find_clock != [list] } { set clk_name $my_clock_pin create_clock -period $my_period $clk_name } else { set clk_name vclk create_clock -period $my_period -name $clk_name } set_input_delay $my_input_delay_ns -clock $clk_name [remove_from_collection [all_inputs] $my_clock_pin] set_output_delay $my_output_delay_ns -clock $clk_name [all_outputs] current_design wallace4to2 link uniquify compile current_design wallace3to2 link uniquify compile current_design link uniquify compile set_dont_touch set_dont_touch set_dont_touch kogge_stone_adder

[get_designs wallace4to2] [get_designs wallace3to2] [get_designs kogge_stone_adder]

current_design butterfly link uniquify

compile set_dont_touch [get_designs butterfly] current_design multiplier_16bit link uniquify compile set_dont_touch [get_designs multiplier_16bit] current_design $my_toplevel set_scan_configuration -style multiplexed_flip_flop compile -scan -exact_map -map_effort medium check_design report_constraint -all_violators set_scan_configuration -replace false create_test_protocol -infer_async -infer_clock dft_drc insert_dft set_dft_signal -view spec -type ScanEnable -port test_se set_dft_signal -view spec -type ScanDataIn -port test_si set_drive 2 test_se set_drive 2 test_s1 report_constraint -all_violators set filename [format "%s%s" $my_toplevel ".vh"] write -format verilog -hierarchy -output $filename set filename [format "%s%s" write_sdc $filename $my_toplevel ".sdc"]

report_area > area.rpt report_timing -significant_digits 4 -tran -delay max -max_path 10 -input -net > timing.rpt report_power > power.rpt quit

The following were the reports generated: a. Area Report:


**************************************** Report : area Design : ifftcore_64pt **************************************** Number Number Number Number of of of of ports: nets: cells: references: 4098 33335 15269 8 1849277.203472 9261.055664

Combinational area: Noncombinational area:

Net Interconnect area: Total cell area: Total area:

642175.925509 1858538.259136 2500714.184646

b. Power Report
**************************************** Report : power -analysis_effort low Design : ifftcore_64pt **************************************** Global Operating Voltage = 0.95 Power-specific unit information : Voltage Units = 1V Capacitance Units = 1.000000pf Time Units = 1ns Dynamic Power Units = 1mW (derived from V,C,T units) Leakage Power Units = 1pW Cell Internal Power Net Switching Power Total Dynamic Power Cell Leakage Power = 620.4071 uW = 372.7429 uW --------= 993.1500 uW = 10.3207 mW (62%) (38%) (100%)

c. Timing Report
The most timing critical path in the IFFT core:
**************************************** Report : timing -path full -delay max Design : ifftcore_64pt **************************************** Startpoint: ifft_in[63][r][11] (rising edge-triggered flip-flop clocked by clk) Endpoint: ifft_out[0][r][15] (output port clocked by clk) Path Group: clk Path Type: max clock clk (rise edge) 100.0000 100.0000 clock network delay (ideal) 0.0000 100.0000 output external delay -0.1000 99.9000 data required time 99.9000 data arrival time 89.7406 -------------------------------------------------------------------slack (MET) 10.1594

Timing closed at 100ns (10 MHz)

CONCLUSION

Our IFFT core can take in 64 complex samples of 16-bit frequency-domain data from the system bus in parallel and generate the corresponding 64 complex samples of 16-bit data in the time domain in parallel. The above sections have described our 64-point IFFT core in detail. The architecture of the core and its implementation in terms of its fundamental blocks (half adders, full adders, Wallace4to2, Wallace3to2, Kogge Stone Adder, Booth encoded Multiplier and Butterfly structure) is explained in detail. The verification is based on a File-I/O Testbench scheme. We have explored the flat 6-stage architecture for the DIT-IFFT algorithm and we observe that it synthesizes to a large area but is power-efficient as compared to recent implementations for 802.11 WLAN and also the timing results meet the 802.11 WLAN specifications by a huge margin. So, we see a scope to improve the area of the IFFT core, by trading-off the power and speed gain, using other architectures.

NEXT STEPS FOR THE PROJECT: Folding transformation of the 6-butterfly stages into 1-stage, which will enable us to re-use the hardware. This can improve the area of the IFFT core considerably, compared to the present implementation, but it will also have an impact on the timing of the critical path of the circuit, as there will be extra muxing and inter-stage glue logic on the critical path. Pipelining the butterfly stages, so that we can further increase the frequency of operation of the IFFT core.

Automate the testing environment for rigorous testing of the IFFT core using random stimuli dumped by MATLAB. The co-simulation can also be automated, where we plan to mask the 3 LSB bits of the MATLAB rounded output, along with the SystemVerilog core output. Thus, we can get a PASS/FAIL assertion from the Test-Bench without any human intervention.

POTENTIAL DESIGN ISSUES: Our 64-point IFFT core uses 16-bit fixed point arithmetic. Thus, after

multiplication and accumulation at every butterfly stage, the 32-bit results are rounded to 16-bits before being passed to the next stage. This rounding operation results in rounding and truncation errors. These errors can get accumulated through the 6-butterfly stages in the data-path and after the final normalization of the IFFT samples, the rounded sample values can deviate from the expected output, as compared to the MATLAB dump. We have implemented a rounding scheme by comparing the LSB 14 bits of the 32-bit operation with 0.5(0x4000) and rounding the samples to the next highest integer. We need to explore better rounding/truncation schemes in order to ensure we dont lose precision in the final IFFT outputs. The testing environment, at the moment, just reads a single stimuli file and there

is no mechanism for automated regressions. Thus, our coverage is low and we plan to automate the verification environment to boost the line-coverage. Also, the Twiddle memory testing needs to be done exhaustively.

APPENDIX
/********************************************************************** Module wallace3to2: This module computes the sum of three 32 bit vectors and generate two 32 bit vectors sum and carry. inputs: sum_ip1[]31:0, sum_ip2[31:0], carry_ip1[31:0] outputs : sum_out[31:0], carry_out[31:0] ***********************************************************************/ module wallace3to2(input [31:0] sum_ip1,sum_ip2,carry_ip1, output [31:0] sum_out, carry_out); genvar i; wire [32:0] carry_out_wr; //Computes the 3:2 compression of inputs with 31 full adder modules. generate for(i=0; i <= 31; i = i+1) begin: fulladdsc32 fulladd fasc32_(sum_out[i], carry_out_wr[i+1], sum_ip1[i], carry_ip1[i], sum_ip2[i] ); end endgenerate assign carry_out_wr[0] = 0; assign carry_out[31:0] = carry_out_wr[31:0]; endmodule /********************************************************************** Module wallace4to2 : This module computes the sum of four 32 bit vectors and generate two 32 bit vectors sum and carry. inputs : sum_ip1[]31:0, sum_ip2[31:0], carry_ip1[31:0], carry_ip2[31:0] outputs : sum_out[31:0], carry_out[31:0] **********************************************************************/ module wallace4to2(input [31:0] sum_ip1,sum_ip2,carry_ip1,carry_ip2, output [31:0] sum_out, carry_out); wire [32:0] carry_out_wr; wire [32:0] carry1; wire [31:0] sum1; //iteration 1 genvar i; generate for(i=0; i <=31; i = i+1) begin: fulladdsc32 fulladd fasc32_(sum1[i], carry1[i+1], sum_ip1[i], carry_ip1[i], sum_ip2[i] ); end endgenerate //iteration 2 genvar i; generate for(i=0; i <=31; i = i+1) begin: fulladdsc32_iter_2 fulladd fasc32_iter_2_(sum_out[i], carry_out_wr[i+1], sum1[i], carry1[i], carry_ip2[i] ); end endgenerate assign carry1[0] = 0;

assign carry_out_wr[0] = 0; assign carry_out[31:0] = carry_out_wr[31:0]; endmodule /********************************************************************** 1 bit full-adder Module fulladd : inputs : a, b, c_in outputs : sum, c_out 1 bit half-adder Module fulladd : inputs : a, b outputs : sum, c_out **********************************************************************/ // define 1-bit full adder module fulladd(sum, c_out, a, b, c_in); output sum, c_out; // I/O declaration, wire type by default input a, b, c_in; wire s1, c1, c2; // internal ports xor(s1, a, b); and(c1, a, b); xor(sum, s1, c_in); and(c2, s1, c_in); or(c_out, c2, c1); endmodule // define 1-bit Half adder module halfadd(sum, c_out, a, b ); output sum, c_out; // I/O declaration, wire type by default input a, b ; wire s1, c1, c2; // internal ports xor(sum, a, b); and(c_out, a, b); endmodule /********************************************************************** Module kogge_stone_adder: This module computes the 32bit fast addition using high-speed Kogge-Stone Prefix adder architecture. The structures has 6stages .In the first stage the propagate and generate outputs are computed from inputs a[31:0] and b[31:0]. In the following 4 stages the group-propagate and group-generate outputs are computed. In the last XOR stage final output is computed. The implementation is optimized for CMOS implementation by using bubbled logic in alternate stages in-order to reduce the logic on the critical path. Signed inputs : ai[31:0], bi[31:0] Signed output : si[32:0] **********************************************************************/ //32 bit Kogge-Stone Prefix adder module xor_2(y,a,b); //xor gate input a,b; output y; xor(y,a,b); endmodule module xnor_2(y,a,b); //xnor gate input a,b; output y; xnor(y,a,b); endmodule //bubbled output PG Generate cell module pg_generate_bubbled(_pi,_gi,ai,bi); output _pi,_gi; input ai,bi;

xnor(_pi,ai,bi); nand(_gi,ai,bi); endmodule //Group PG generator // Propagate and Generate for a group of signals (i:j) module group_pg_generator(pij, gij, _pik, _pk_1j, _gik, _gk_1j); output pij,gij; input _pik,_pk_1j,_gik,_gk_1j; wire t1,t2; nor(t1,_pik,_gk_1j); not(t2,t1); nand(gij,t2,_gik); //Gi:j = Gi:k + Pi:k . Gk-1:j nor(pij,_pik,_pk_1j); //Pi:j = Pi:k and Pk-1:j endmodule //Group G generator // Generate for a group of signals (i:j) module group_g_generator(gij,_pik,_gik,_gk_1j); output gij; input _pik,_gik,_gk_1j; wire t1,t2; nor(t1,_pik,_gk_1j); not(t2,t1); nand(gij,t2,_gik); //Gi:j = Gi:k + Pi:k . Gk-1:j endmodule //Group PG generator bubbled // Inverted Propagate and Generate for a group of signals (i:j) module group_pg_generator_bubbled(_pij,_gij,pik,pk_1j,gik,gk_1j); output _pij,_gij; input pik,pk_1j,gik,gk_1j; wire t1,t2; nand(t1,pik,gk_1j); not(t2,t1); nor(_gij,t2,gik); //_Gi:j = (Gi:k + Pi:k and Gk-1:j)' nand(_pij,pik,pk_1j); //_Pi:j = (Pi:k and Pk-1:j)' endmodule //Group G genertor bubbled // Inverted Generate for a group of signals (i:j) module group_g_generator_bubbled(_gij,pik,gik,gk_1j); output _gij; input pik,gik,gk_1j; wire t1,t2; nand(t1,pik,gk_1j); not(t2,t1); nor(_gij,t2,gik); //_Gi:j = (Gi:k + Pi:k and Gk-1:j)' endmodule /// KOGGE STONE ADDER Tree module kogge_stone_adder(si,ai,bi); //Generates the 32-bit signed sum from 32-bit signed operands input [31:0] ai,bi; output [32:0] si; wire [31:0] _pi,_gi,p1_ij,g1_ij, _p2_ij,_g2_ij, p3_ij,g3_ij,_p4_ij,_g4_ij,p5_ij,g5_ij;

// bubbled PG generate stage // Inverted Propagate and Generate for a group of signals (i:j) genvar i; generate for(i=0;i<32;i=i+1) begin: pg_generate pg_generate_bubbled pg_gen_bubbled(_pi[i],_gi[i],ai[i],bi[i]); end endgenerate //Group PG generator stage 1 // Propagate and Generate for a group of signals (i:j) assign g1_ij[0]= ~_gi[0]; assign p1_ij[0]= 1'b0; group_g_generator group1_g_gen1(g1_ij[1],_pi[1],_gi[1],_gi[0]); assign p1_ij[1]=1'b0; generate for(i=2;i<32;i=i+1) begin: group1_pg_generate group_pg_generator group1_pg_gen(p1_ij[i],g1_ij[i],_pi[i],_pi[i1],_gi[i],_gi[i-1]); end endgenerate //Group PG generator stage 2 group of signals (i:j) assign _g2_ij[0]= ~g1_ij[0]; assign _p2_ij[0]= 1'b0; assign _g2_ij[1]= ~g1_ij[1]; assign _p2_ij[1]= 1'b0; // Inverted Propagate and Generate for a

group_g_generator_bubbled group2_g_gen_bubbled2(_g2_ij[2],p1_ij[2],g1_ij[2],g1_ij[0]); assign _p2_ij[2]= 1'b0; group_g_generator_bubbled group2_g_gen_bubbled3(_g2_ij[3],p1_ij[3],g1_ij[3],g1_ij[1]); assign _p2_ij[3]= 1'b0; generate for(i=4;i<32;i=i+1) begin: group2_pg_generate_bubbled group_pg_generator_bubbled group2_pg_gen_bubbled(_p2_ij[i],_g2_ij[i],p1_ij[i],p1_ij[i2],g1_ij[i],g1_ij[i-2]); end endgenerate //Group PG generator stage 3 signals (i:j) assign g3_ij[0]= g1_ij[0]; assign p3_ij[0]= 1'b0; assign g3_ij[1]= g1_ij[1]; assign p3_ij[1]= 1'b0; assign g3_ij[2]= ~_g2_ij[2]; assign p3_ij[2]= 1'b0; assign g3_ij[3]= ~_g2_ij[3]; assign p3_ij[3]= 1'b0; generate // Propagate and Generate for a group of

for(i=4;i<8;i=i+1) begin: group3_g_generate group_g_generator group3_g_gen(g3_ij[i],_p2_ij[i],_g2_ij[i],_g2_ij[i-4]); end endgenerate assign assign assign assign p3_ij[4]=1'b0; p3_ij[5]=1'b0; p3_ij[6]=1'b0; p3_ij[7]=1'b0;

generate for(i=8;i<32;i=i+1) begin: group3_pg_generate group_pg_generator group3_pg_gen(p3_ij[i],g3_ij[i],_p2_ij[i],_p2_ij[i4],_g2_ij[i],_g2_ij[i-4]); end endgenerate //Group PG generator stage 4 group of signals (i:j) assign _g4_ij[0]= _g2_ij[0]; assign _p4_ij[0]= 1'b0; assign _g4_ij[1]= _g2_ij[1]; assign _p4_ij[1]= 1'b0; assign _g4_ij[2]= _g2_ij[2]; assign _p4_ij[2]= 1'b0; assign _g4_ij[3]= _g2_ij[3]; assign _p4_ij[3]= 1'b0; assign _g4_ij[4]= ~g3_ij[4]; assign _p4_ij[4]= 1'b0; assign _g4_ij[5]= ~g3_ij[5]; assign _p4_ij[5]= 1'b0; assign _g4_ij[6]= ~g3_ij[6]; assign _p4_ij[6]= 1'b0; assign _g4_ij[7]= ~g3_ij[7]; assign _p4_ij[7]= 1'b0; // Inverted Propagate and Generate for a

generate for(i=8;i<16;i=i+1) begin: group4_g_generate_bubbled group_g_generator_bubbled group4_g_gen_bubbled(_g4_ij[i],p3_ij[i],g3_ij[i],g3_ij[i-8]); end endgenerate assign _p4_ij[8]= 1'b0; assign _p4_ij[9]= 1'b0; assign _p4_ij[10]= 1'b0; assign _p4_ij[11]= 1'b0; assign _p4_ij[12]= 1'b0; assign _p4_ij[13]= 1'b0; assign _p4_ij[14]= 1'b0; assign _p4_ij[15]= 1'b0; generate for(i=16;i<32;i=i+1) begin: group4_pg_generate_bubbled group_pg_generator_bubbled group4_pg_gen_bubbled(_p4_ij[i],_g4_ij[i],p3_ij[i],p3_ij[i8],g3_ij[i],g3_ij[i-8]); end

endgenerate //stage 5 // Propagate and Generate for a group of signals (i:j) generate for(i=16;i<32;i=i+1) begin: group5_g_generate group_g_generator group5_g_gen(g5_ij[i],_p4_ij[i],_g4_ij[i],_g4_ij[i-16]); end endgenerate //stage 6 // Final SUM GENERATION (SUM.i = P.i xor C.i-1) assign si[0] = ~_pi[0]; genvar i; generate for(i=1;i<17; i=i+1) begin: xor_2input xor_2 xor_2ip(si[i],_g4_ij[i-1],_pi[i]); end endgenerate genvar i; generate for(i=17;i<32; i=i+1) begin: xnor_2input2 xnor_2 xnor_2ip(si[i],g5_ij[i-1],_pi[i]); end endgenerate assign si[32] = g5_ij[31]; endmodule /********************************************************************** Module multiplier_16bit: This module computes the 16bit product and outputs the 32bit to 16bit rounded result. Signed inputs : inputs x[15:0], y[15:0] Signed output : outputs p[15:0] **********************************************************************/ module multiplier_16bit(x, y, p); input [15:0] x, y; output [15:0] p; reg [16:0] a, b, c, d, e, f, g, h; reg 31:0] u; wire [31:0] carry, sum; wire [32:0] p_temp; wire[14:0] s11; wire[15:1] c11; wire[14:0] s12; wire[15:1] c12; wire[11:0] s13; wire[12:1] c13; wire[14:0] s21; wire[15:1] c21; wire[15:0] s22; wire[16:1] c22; wire[18:0] s31; wire[19:1] c31; wire[31:0] s41; wire[31:0] c41; /////Code here is same as in module boot16f from line 6 to last but 3lines///// assign carry = {c41[31:1],1'b0}; assign sum = s41; //vector merging. kogge_stone_adder KSSSSS(p_temp,carry,sum); //32 bit prefix adder //rounding scheme for 32bit to 16bit conversion(future scope) assign p = (p_temp[14:0] >= (15'h4000)? ((p_temp[30]== 1'b0)? ( p_temp[30:15]+1): (p_temp[30:15])) : p_temp[30:15]); endmodule

S-ar putea să vă placă și