Report FFT Implementation 08gr943

FFT Parallelization for OFDM Systems
9 TH SEMESTER PROJECT, AAU A PPLIED S IGNAL P ROCESSING AND I MPLEMENTATION (ASPI)
Group 943
Jeremy LERESTEUX Jean-Michel LORY Olivier LE JACQUES
AALBORG UNIVERSITY INSTITUTE FOR ELECTRONIC SYSTEMS Fredrik Bajers Vej 7 DK-9220 Aalborg East Phone 96 35 80 80 http://www.esn.aau.dk
T ITLE : FFT Parallelization for OFDM Systems T HEME : Parallel Architecture Processing FFT implementation P ROJECT PERIOD : 9th Semester September 2008 to January 2009 P ROJECT GROUP : ASPI 08gr943 PARTICIPANTS : Jeremy Leresteux jlereste@kom.aau.dk Jean-Michel Lory jlory@es.aau.dk Olivier le Jacques olivedk@es.dk S UPERVISORS : Yannick Le Moullec (AAU) Ole Mikkelsen (Rohde&Schwarz) Jes Toft Kritensen (Rohde&Schwarz) P UBLICATIONS : N UMBER OF PAGES : A PPENDICES : F INISHED : 8 46 1 CD-ROM 5th of January 2009
Abstract
This 9th semester project for the Applied Signal Processing and Implementation specialization at Aalborg University is a study of FFT algorithms parallelization for OFDM receivers on Cell BE. The project focuses on mobile applications, which require efcient bandwidth utilization like in LTE. This can be achieved by means of the OFDM technology. A significant contribution in OFDM is the IFFT/FFT operations. This can be exploited by the parallelization of special FFT algorithms to yield a lower operations count and intuitively improve the time of computation. This project seeks to investigate the possibilities and differences, with regards to time usage, when computing FFT algorithms on multiple processors on the Cell BE. First of all, the denition of LTE and OFDM is explained. Then, two Fast Fourier Transform algorithms - a Radix-2 DIT FFT and a Srensen FFT algorithm (SFFT) are examined and mapped on Cell Be processor architecture. Afterwards, tests are done and results are discussed for both algorithms. It appears SFFT algorithm is better than Radix 2 DIT algorithm in terms of execution time and performance. In the conclusion, an assessment is done and future perspectives are discussed.
Preface
This report is the documentation for a 9th semester project in Applied Signal Processing and Implementation (ASPI) entitled FFT Parallelization for OFDM Systems at Aalborg University (AAU). This report is prepared by group 08gr943 and spans from September 2nd , 2008 to January 5th , 2009. The project is supervised by Yannick Le Moullec, Assistant Professor at AAU, Jes Toft Kritensen and Ole Mikkelsen from the company Rohde & Schwarz Technology Center A/S in Aalborg. The report is divided into four parts. These chapter correspond to the introduction of the project, analysis, implementation and conclusion.
Jeremy Leresteux
Jean-Michel Lory
Olivier Le Jacques
Aalborg, January 5th 2008
Contents
Preface 1 Introduction 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Long Term Evolution (LTE) . . . . . . . . . . . . . . 1.1.2 Orthogonal Frequency-Division Multiplexing (OFDM) 1.1.3 Conclusion on the context . . . . . . . . . . . . . . . 1.2 Project subject . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Fast Fourier Transformation (FFT) . . . . . . . . . . . 1.2.2 Cell Broadband Engine . . . . . . . . . . . . . . . . . 1.2.3 Parallelization . . . . . . . . . . . . . . . . . . . . . 1.3 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . 1.4 Project Delimitations . . . . . . . . . . . . . . . . . . . . . . Analysis 2.1 Overview . . . . . . . . . . . . . . 2.2 Design Methodology . . . . . . . . 2.2.1 Design Model . . . . . . . . 2.3 Cell BE . . . . . . . . . . . . . . . 2.3.1 Architecture . . . . . . . . . 2.3.2 Programmation of the CBE . 2.4 FFT algorithms . . . . . . . . . . . 2.4.1 Overview . . . . . . . . . . 2.4.2 Discrete Fourier Transform . 2.4.3 Cooley-Tukey . . . . . . . . 2.4.4 Srensen . . . . . . . . . . 2.5 Conclusion of the Analysis section . Implementation 3.1 Overview . . . . . . . . . . . 3.2 Cooley-Tukey Implementation 3.2.1 Overall Approach . . . 3.2.2 Results . . . . . . . . 3.2.3 Optimizations . . . . . 3.3 Srensen Implementation . . .
4 7 7 7 10 14 14 14 15 15 15 15 16 16 16 16 17 17 23 25 26 26 26 29 31 32 32 32 32 33 36 39 5
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
3.3.1 3.3.2 3.3.3 4
Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Comparison with the CT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 40 45 45 46 46 46
Conclusion & Perspectives 4.1 Conclusions . . . . . . . . . . 4.2 Perspectives . . . . . . . . . . 4.2.1 Short term perspectives 4.2.2 Long term perspectives
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Bibliography List of Figures
47 49
Chapter
Introduction
1.1 Context
In 1981, Nordic Mobile Telephony (NMT) led to the commercialization of the rst mobile phone (referred as 1st Generation1 ). On the 29th of November 2007, 3.3 billion mobile phones have been identied worlwide [1]. Most of these phones are GSM phones (2G). But the 3rd Generation phones, which can provide features like web browsing or videoconferences, approached half a billion of devices at the end of September 2007. 3G phones have good capabilities but a new generation (4G), with even better capabilities including higher bandwidth and more exibility, is approaching. See Figure 1.1 for a summary of the history of the mobile phone generations.
1.1.1
Long Term Evolution (LTE)
LTE [2] (Long Term Evolution) is the next major step in mobile radio communication. It is one of the best candidates for the 4th Generation of mobile wireless data transfer. Its development started in 2004 by 3GPP [3] and several European mobile constructors and operators [4]. 802.16m WiMAX (Worldwide Interoperability for Microwave Access) is one of the other candidates [5] to the 4G appellation . It is developed by the IEEE and headed by Intel [6]. The last candidate is the Ultra Mobile Broadband (UMB) developed by 3GPP2 [7] and headed by Qualcomm (it was decided on November 13, 2008 to stop UMB development to the benet of LTE [8]). This project only considers LTE, therefore WiMAX and UMB are disregarded. LTEs major aim is to improve the 3G UMTS (Universal Mobile Telecommunication System). It has ambitious requirements for the spectrum efciency, lowering costs capacities, improving services like video conferences and VoIP (Voice over Internet Protocol) communication, latency and also better integration with other standards. The 3GPP Release 8 [9] gives what the LTE requirements shall be (only the most signicant ones are listed here): Peak data rate Instantaneous downlink peak data rate of 100 Mb/s within a 20 MHz downlink spectrum allocation (5 bps/Hz)
1 Generation
: Term used to dene the technology used in mobile communication. 1G is NMT, 2G is GSM and 3G is UMT-
S/HSPA.
LTE plan in 2009 LTE 100Mbps HSUPA in 2008 3G UMTS in 2001 GPRS in 2000 GSM First call made in 1991 GSM 40kbps WCDMA 384kbps HSDPA in 2005 EDGE in 2003 HSPA 14.4Mbps HSPA Evolved 28/40Mbps
4G
LTE
HSPA: 55 million subscribers
3G
WCDMA: 297 million subscribers GSM: 3.3 billion subscribers
2G
2005 2006 2000 2004 2001 2002 1990 1992 1994 1993 1991 1995 1996 1997 1998 1999 2003 2007
2009
2008
2010
2011
2012
Figure 1.1: Standardization evolution track. Where GSM is Global System for Mobile communications, GPRS is General Packet Radio Service, UMTS is Universal Mobile Telecommunications System, WCDMA is Wideband Code Division Multiple Access, EDGE is Enhanced Data Rates for GSM Evolution, HSPA is High Speed Packet Access, HSDPA is High-Speed Downlink Packet Access, HSUPA is High-Speed Uplink Packet Access and LTE is Long-Term Evolution. Modied from [10] Instantaneous uplink peak data rate of 50 Mb/s within a 20MHz uplink spectrum allocation (2.5 bps/Hz) Latency Transition time of less than 100 ms from a camped state, Idle Mode, to an active state Less than 5 ms in unload condition (i.e. single user with single data stream) for small IP packet User capacity and throughput At least 200 users per cell should be supported in the active state for spectrum allocations up to 5 MHz Downlink: average user throughput per MHz, 3 to 4 times HSDPA (High-Speed Downlink Packet Access: 3,5G Downlink protocol) Uplink: average user throughput per MHz, 2 to 3 times HSUPA (High-Speed Uplink Packet Access: 3,5G Uplink protocol) Spectrum efciency Downlink: In a loaded network, target for spectrum efciency (bits/sec/Hz/site), 3 to 4 times HSDPA 8 Chapter: 1 Introduction
2013
2014
Uplink: In a loaded network, target for spectrum efciency (bits/sec/Hz/site), 2 to 3 times HSUPA Coverage Throughput, spectrum efciency and mobility targets above should be met for 5 km cells, and with a slight degradation for 30 km cells. Cells range up to 100 km should not be precluded. Complexity Minimize the number of options No redundant mandatory features These characteristics are performed thanks to the E-UTRA Air Interface. E-UTRA is the acronym for Evolved Universal Terrestrial Radio Access. It is the successor of the UTRAN/GERAN (GSM EDGE Radio Access Network/ UMTS Terrestrial Radio Access Network), 2G/3G air interface. Also designed by 3GPP, its requirement are as follow: Mobility E-UTRAN should be optimized for low mobile speed from 0 to 15 km/h Higher mobile speed between 15 and 120 km/h should be supported with high performance Mobility across the cellular network shall be maintained at speeds from 120 km/h to 350 km/h (or even up to 500 km/h depending on the frequency band) Spectrum exibility E-UTRA shall operate in spectrum allocations of different sizes, including 1.25 MHz, 1.6 MHz, 2.5 MHz, 5 MHz, 10 MHz, 15 MHz and 20 MHz in both the uplink and downlink. Operation in paired and unpaired spectrum shall be supported Co-existence and Inter-working with 3GPP Radio Access Technology (RAT) Co-existence in the same geographical area and co-location with GERAN/UTRAN on adjacent channels. E-UTRAN terminals supporting also UTRAN and/or GERAN operation should be able to support measurement of, and handover from and to, both 3GPP UTRAN and 3GPP GERAN. The interruption time during a handover of real-time services between E-UTRAN and UTRAN (or GERAN) should be less than 300 msec. E-UTRA is the air inteface which permits the communication between a BTS (Base Transmitter Station) and a UE (User Equipment). The signal modulation used for the BTS and demodulation for the UE are a bit different but made on the same technology namely the Frequency-Division Multiplexing. This provides a lot of similarities between them. SC-FDM (Single Carrier Frequency-Division Multiplexing) is used for the transmitter part and OFDM (Orthogonal Frequency-Division Multiplexing) for the receiver part. This has been decided by the 3GPP members and summarized in the Release8. This project is related to the OFDM aspect at the receiver side. Section 1.1.2 gives an overview of OFDM fundamentals. Section: 1.1 Context 9
1.1.2
Orthogonal Frequency-Division Multiplexing (OFDM)
This section has been inspired by the information provided by this two papers [11] [12]. OFDM is a modulation technique which is used in most of the new wireless technologies such as IEEE802.11 a/b/g, 802.16, HiperLan-2, DVB (digital TV) and DAB [13]. The 3GPP members selected it to be the LTE/E-UTRA downlink protocol i.e. the system which receives data and communication packets from a transmitter. As indicated at the end of section 1.1.1, the selected uplink protocol, SC-FDM, presents similarities to OFDM, that is why this section only introduces OFDM on the transmitter and receiver sides. 1.1.2.1 Overview
With standard single carrier transmitters, the signal is spread into multiple transmission paths, multiple frequencies. Because of the environment (buildings, cars, distance), the signal becomes less powerful and distorted. This phenomenon, called fading , appears when signals are reected on the buildings for example. The reected signals arrive to the receiver later than the main signal, which results in distortions, as illustrated in Figure 1.2.
Transmitter
Mobile Obstacle Receiver
Obstacle
Buildings
Figure 1.2: Multipath propagation. A transmitted signal is spread between different frequencies and according to theses frequencies, the obstacle met and the distance covered, the distortion is more or less present. Modied from [12]. These distortions are a major problem when establishing secured high speed data transfer like used on the 3G UMTS cell phones. OFDM settles this distortion problem. It is not avoiding reections but its characteristics make a transmission safer, in the meaning that data packets are always present by permiting to send multiple signals by a single radio channel. OFDM is a multi-carrier transmitter/receiver, i.e. it can send/receive signals to/from several users. The next subsections describe the main principles of OFDM on the transmitter and receiver sides. 1.1.2.2 OFDM Principles
OFDM distributes the data over a large number of carriers at different frequencies. This spacing provides the orthogonality which prevents the receivers to see wrong frequencies. In opposite to other multi10 Chapter: 1 Introduction
carriers techniques, like CDMA, OFDM prevents the Inter Symbol Interference (ISI) by adding a cyclic prex, which is explained in section Inter-Symbol Interferences. One of the key features of OFDM is the IFFT/FFT pair. These two mathematical tools are used here to transform several signals on different carriers from the frequency-domain to the time-domain in the IFFT (or F F T 1 ) and from the time-domain to the frequency-domain in the FFT. See in Figure 1.3, the principle with the main parts of an OFDM system.
Transmitter Serial to parallel conversion Add cyclic prex Modulated signal Antenna Antenna Remove cyclic prex Receiver Parallel to serial conversion
IFFT
FFT
Input signal
Output signal
Frequency Domain
Time Domain
Frequency Domain
Figure 1.3: Main principle of an OFDM transmitter/receiver.
The Transmitter Figure 1.4 shows a representation of the transmitter. OFDM divides the spectrum into N sub-carriers, each on different frequencies, and each carrying a part of the signal by means of the IFFT (also noted F F T 1 ). In opposite with FDM, where there is no coordination or synchronisation between each sub-carriers, OFDM links them with the principle of orthogonality . It results in a overlapping of the sub-carriers, see Figure 1.5, where all the sub-carriers can be simultaneous transmitted in a tight spaced frequencies but without Inter-Signal Interference.
Constellation mapping X0 X1 s[n] Serial to parallel XN 2 XN 1 m DAC F F T 1 90o e DAC fc s(t)
Figure 1.4: Representation of the OFDM transmitter [14]. The digital signal s[n] represents the data to transfer. It is then modulated with a QPSK, 16-QAM or 64-QAM to create symbols. Then the spectrum goes through an IFFT to transform it into time-domain. Real and Imaginary components are converted to analog domain to modulate cosine and sine at the carrier frequency, fc . They are then summed into s(t) to be transferred to the receiver via the antenna. Signals are orthogonal if they are mutually independant of each other. Orthogonality is based on the fact that any sub-carriers, sine or cosine wave, admit zero on one half-period. Lets assume two sine sub-carriers of frequency m and n, both integers, and multiply them together: f (x) = sin mwt sin nwt Section: 1.1 Context (1.1) 11
Its integral yields to a sum of two sinusods of frequency (n m) and (n + m) 1 1 cos(m n) cos(m + n) 2 2 As this two components are sinusods, the integral is equal to zero over one period f (x) =
2
(1.2)
f (x) =
0
1 cos(m n) 2
2 0
1 cos(m + n) 2
(1.3)
It conclues as when two sinusods of different frequencies, n and m/n, are multiplied, the area under the product is zero. For all n and m, sin mx, cos mx, sin nx and cos nx are all orthogonal to each others. These frequencies are called harmonics. Overlapping gives a better spectrum usage than FDM modulator which just places each carrier next to each others and results on interferences between them.
FDM 0 1 2 N-1 f
OFDM
N-1 f
overlapping
Figure 1.5: Spectrum efciency difference (f ) between FDM and OFDM. With OFDM, signals, on each sub-carriers, are overlapped but still orthogonal to each others. With FDM, sub-carriers are placed to next to each others. The Receiver OFDM symbols are transmitted over the channel to the receiver on an only frequency. Basically, the receiver performs the same operations as the transmitter, but in the inverse order. By means of a FFT, an approximation of the source signal is retrieved as illustrated in Figure 1.6.
Symbol detection ADC fc r(t) FFT 90o YN 2 ADC m YN 1 Parallel to serial e Y0 Y1 s[n]
Figure 1.6: Representation of the OFDM receiver [14]. The antenna receives each part of the spectrum as one signal r(t).It is demodulating and after eliminating the cyclic prex with lters, a FFT algorithm transforms them back to frequency-domain. Then, each symbol is detected to create an approximation of the original data signal. 12 Chapter: 1 Introduction
Inter-Symbol Interference (ISI) As seen in Figure 1.5, signals are overlapped. This overlapping introduces a problem known as Inter-Symbol Interference (ISI). ISI are the spread delay of the signal N 1 on N due to the overlapping where, with the example in Figure 1.5, the last element of symbol 0 is overlapped by the rst element of symbol 1 because of the channel.
Spread Delay The spread delay corresponds to the propagation of a transmitted signal on the next one. It is the echo from the rst signal on the second one as illustrated in Figure 1.7 (a). This physical effect depends on the channel and the distance between the two signals.
To avoid this problem, a distance, called guard interval, superior of the spread delay is needed. As it is impossible not to send anything, samples from the tail of the symbol signal are added to the front, as illustrated in Figure 1.7 (b). This principle, explained in [15], is called cyclic prex. In theory, this security prex should be added after each sub-carrier, but in practice OFDM signal is a linear combination thus only one cyclic prex is added, as illustrated in Figure 1.7 (c).
(a)
spread delay
(b)
Guard interval
(c)
Copy of the tail of the signal
Figure 1.7: The cyclic prex which permits to avoid the ISI problems.(a) shows the spread delay problem. (b) shows the adding of the cyclic prex in the guard interval according to the theory. (c) shows the cyclic prexs adding in practice because of the linear combination of the OFDM. Section: 1.1 Context 13
1.1.2.3
Advantages
OFDM provides better spectrum exibility by overlapping the signals on orthogonal frequencies, the harmonics. It is less noise sensitive than a single-carrier system. And the ISI problem is solved thanks to the guard interval and the cyclic prex. 1.1.2.4 Drawbacks
OFDM is sensitive to frequency offset and synchronisation problem which can destroy the carriers orthogonalities. Also, after the IFFT, OFDM can provide very high amplitude which can lead to a large amount of power consumption. This high amplitude, called Peak to Average Power Ratio (PAPR), can be reduced with transmitted signals correction vectors. But this adds complexity to the OFDM transmitter.
1.1.3
Conclusion on the context
The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair. In this project, the focus in on the receiver side, hence on the FFT block. In section 1.2, the FFT concept is presented by means of 3 FFT algorithms and the issue of parallelizing them is introduced.
1.2 Project subject

1.2.1 Fast Fourier Transformation (FFT)
The group members have selected three FFT algorithms which will be compared. These three algorithms are presented below : Radix 2 DIT Fast Fourier Transform (Decimation In Time) : This algorithm is chosen because it is the simplest form of the Cooley Tukey Algorithm. It exists many other algorithms which compute DFT faster than radix 2 (radix 4 and split radix for example) but it is important for the project to be able to compare the basis algorithm with better algorithm (Srensen, Edelman) to show the difference of computation and complexity (explained in 2.4.3). "Srensen" Fast Fourier Transform (SFFT): The second algorithm under test is a mix of a CooleyTukey algorithm, like Split Radix, and Horners polynomial evaluation scheme. It takes into account the fact that all the outputs are not interesting for the nal result. So only some chosen outputs are computed. This fact permits to avoid many operations which are time and memory expensive. Srensen FFT is well known and the project results can be compared with other studies. It is an interesting algorithm, in terms of complexity and challenge, to implement and compare with other algorithms like Radix-2 DIT or Edelman. "Edelman" Fast Fourier Transform : This algorithm computes approximately DFT, doing some errors but which are minimal against the number of computation. This kind of algorithms allows increasing speed of computation in spite of some errors. Edelman Algorithm is useful for parallel computing. All the algorithms mentioned above are further developed in section 2.4. However because of a lack of documentation about the Edelman algorithm, it is disregarded from the project. 14 Chapter: 1 Introduction
1.2.2
Cell Broadband Engine
The purpose of this project is to examine the implementation of FFT algorithms for the OFDM application presented in section 1.1.2 on a multiprocessors platform, namely the Cell Broadband Engine architecture. The Cell BE is, for this project, used for: The implementation of parallelized FFT algorithms Evaluation of the performance, in particular the execution time, of the implementation of the parallelized FFT algorithms The Cell BE is constructed as an heterogeneous processor architecture, with multiple executions and memory transfers active at the same time. This architecture is composed of a processor that contains a PowerPC unit (PPU) with two cores, and eight simpler processors, Synergistic Processing Units (SPUs), which are designed to perform calculations, whereas the PowerPC performs control, data management and scheduling of operations. The SPUs contains a RISC processor and are constructed with two pipelines that can execute an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside the SPUs are wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions. The use of this method produces a processor optimized for computations.
1.2.3
Parallelization
The parallelization is an important part in this project. Indeed, the OFDM receiver requires a FFT as an integral part of the wireless communication. It is essential that the computation of this FFT be the fastest possible so that the achievable throughput is maximised. To obtain the best performance from the application running on the Cell BE processor, the use of multiple SPUs concurrently is evaluated. The application creates at least as many threads as concurrent SPU contexts are required. Each of these threads runs a single SPU context at a time. With this method, the FFT is parallelized and uses some of the features of the Cell BE to accelerate the computation.
1.3 Problem Denition

This work seeks to answer the following question "How efcient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized FFT algorithms?"
1.4 Project Delimitations

Deep researches on LTE and OFDM are not the purposes of this project, nor is a complete mathematical examination of the FFT algorithms. This project focuses on the use of the Cell-Be for FFT, probably the single most important tool in digital signal processing (DSP) , according to Srensen and Burrus [16]
Section: 1.3 Problem Denition
15
Chapter
Analysis
2.1 Overview
The purpose of this chapter is, rst, to introduce the design methodology the project group chooses, in terms on project methodology (A3 design) and the way to parallelize an algorithm following established procedures. Then, this chapter introduces the platform under tests in part 2.3, the Cell-Be, followed, in part 2.4, by an analysis of the different chosen FFT algorithms, with explanation on the reasons to choose these algorithms.
2.2 Design Methodology

2.2.1 Design Model
The design of the model is divided in three parts as in the A3 model [17]: Application, Algorithm and Architecture. First of all, Figure 2.1 shows the generic A3 model. Then, this methodology is applied to this specic project presented in this report, as showed in gure 2.2. Application : The application is any system with specications and constraints. It can be time constraints, power consumption, area problems,... It is the main purpose of a project. Algorithm : At this level, existing algorithms are developed. Special algorithms can be created for the application. The algorithms are optimized on a purely mathematical point of view, i.e. the optimization are only done on the algorithms parts directly related to the application. Architecture : The mapping of the previous algorithms is realised on the selected platform (DSP, FPGA, Cell-BE,...). In case of uncompatiblity between the specications/constraints of the application and the results, modications have to be done. On one hand, if the algorithms is implemented on an established architecture, a modication of the program, in terms of architecture related program (bus control, data transfer control, memory allocation,...) can be done for the specied architecture. On the other hand, if the algorithms are established then a modication of the architecture (VHDL program for a FPGA platform for example) can be done. Application : In the application domain, a presentation of LTE and OFDM in the context section 1.1 is done. 16 Chapter: 2 Analysis
Application
Specications Constraints
Algorithmic constraints
Comparison
iterate
Architecture constraints
Algorithm
Results
Architecture
Architectural optimizations
Mapping
Algorithmic optimizations
Figure 2.1: The generic A3 design methodology.
Algorithm : In the algorithms domain, three Fast Fourier Transform algorithms are compared. First of all, an analysis of derivation is done. Then, the complexity, i.e. numbers of computation to execute the Fourier Transform, of each algorithms is analysed. Finally, the implementation of the algorithms is done in C language two times: one time for sequential execution and a second time for parallel execution. Architecture : In the architecture domain, the platform used to implement different algorithms is analysed. Available hardware and system limitations are studied. Then, how the compiler used in order to parallelize programs is examined and also to measure the computation consumption in terms of resource utilisation, execution speed,. . .
2.3 Cell BE
2.3.1 Architecture
In this section a presentation of the architecture used along the project, the Cell Broadband Engine. According to the A3 model design, this section belongs to the analysis of the architecture, as illustrated on Figure 2.3 2.3.1.1 Architecture Overview
The Cell Broadband Engine (CBE) is a multicore processor. It has a Power Processing Element (PPE) which is a dual-thread PowerPC Architecture and eight Synergistic Processing Element (SPE) which is a SIMD (Single Instruction Multiple Data) processor element. The communication path for commands and data between all processor elements and all chip controllers for memory access or Input/output is made by the Element Interconnection Bus (EIB) [18, p. 41]. An overview of the architecture is presented on the gure 2.4. In the Playstation 3, 6 of the 8 SPEs can be used for computation because one is used by the OS virtualization layer and the other has been disabled for wafer yield reasons [19, p. 5]. That means that when running the operating system, 6 SPEs are available for computation, as shown in gure 2.4. Section: 2.3 Cell BE 17
Application Algorithms
OFDM receiver LTE 4G
Edelman
Srensen Radix2
Requirements
iterate
Cell Be
Architecture Figure 2.2: A3 model for project. 2.3.1.2 Power Processing Element (PPE)
The PPE contains a 64-bit, dual-thread PowerPC Architecture RISC core. It has 32 KB level-1 (L1) instruction and data caches and a 512 KB level-2 (L2) unied (instruction and data) cache. It can run existing PowerPC architecture software and is well-suited for executing system-control code. However for this project, it will be used as a managing controller for the SPE threads and it is assumed that the PPE is fast enough to manage the threads executing on the SPE. The PPE consists of two main units, the PowerPC processor unit (PPU) which performs instruction execution and the PowerPC processor storage subsystem (PPSS) which handles memory requests from the PPU and external requests to the PPE from SPEs [18, p. 41]. The architecture overview of the PPE is presented in gure 2.5. In the Playstation 3, the PPE is clocked at 3.2GHz, thus it can theoretically reach 2x3.2=6.4GFLOP/s of IEEE compliant double precision oating-point performance. It can also reach 4x2x3.2=25.6GFLOP/s of non-IEEE compliant single precision oating-point performance using 4-way single instruction multiple data (SIMD) fused multiply-add operation [19, p. 5]. 2.3.1.3 Synergistic Processor Element (SPE)
The SPE is single instruction multiple data (SIMD) processor element that is optimized for data-rich (computation of FFT butteries) operations allocated to them by the PPE. Each SPE has a Synergistic Processor Unit (SPU) which fetches instructions and datas from its 256KB Local Store (LS) and its single register le which has 128 entries, each 128bits wide. Each SPE has a Direct Memory Access (DMA) 18 Chapter: 2 Analysis
Edelman
Srensen Radix2
Requirements
iterate
Cell Be
Architecture
Figure 2.3: A3 model for project. Highlighted in red, the algorithms analyzed in this section
interface and a channel interface for communicating with its Memory Flow Controller (MFC) and all the other Processors (PPE and SPE). The SPE is intended to run is own program which is in the LS and to not run an operating system [18, p. 63]. The architecture overview of the SPE is presented in gure 2.6. The SPU functional unit, as shown in gure 2.7, consists of a local store (LS) where is stored all instructions and data used by the SPU, a Synergistic Execution Unit (SXU) which executes all the instruction and a SPU Register File Unit (SRF) which stores all data types,return addresses and results of comparisons. The SXU includes 6 executions units : SPU Odd Fixed-Point Unit (SFS) which executes byte granularity shift, rotate mask and shufe operations on quadwords. SPU Even Fixed-Point Unit (SFX) which executes arithmetic instructions, logical instructions, word shifts and rotates, oating-point compares, and oating-point reciprocal and reciprocal square-root estimates. SPU Floating-Point Unit (SFP) which executes single-precision and double-precision oating-point instructions, integer multiplies and conversions, and byte operations. It can perform fully pipelined single precision (32 bit) oating point instructions and partially pipelined double (64 bits) precision instructions. SPU Load and Store Unit (SLS) which executes load and store instructions. It also handles DMA requests to the LS. SPU Control Unit (SCN) which fetches and issues instructions to the two pipelines, executes branch instructions, arbitrates access to the LS and register le, and performs other control functions. SPU Channel and DMA Unit (SSC) which enables communication, data transfer, and control into and out of the SPU. The functions of SSC, and those of the associated DMA controller in the Memory Flow Control (MFC). Section: 2.3 Cell BE 19
SPE1 SPE3 SPE5 SPE7 PPE XIO XIO MIC RAM EIB B E I IOIF_1 IOIF_0 FlexIO FlexIO RAM
SPE0 SPE2 SPE4 SPE6
BEI : Cell Broadband Engine interface EIB : Element Interconnect Bus FlexIO : Rambus FlexIO Bus IOIF : I/O Interface
MIC : Memory Interface Controller PPE : PowerPC Processor Element RAM : Ressource Allocation Management SPE : Synergistic Processor Element XIO : Rambus XDR I/O (XIO) cell
Figure 2.4: Architecture overview of the Cell Broadband Processor. The Element Interconnect Bus is a connection between all processor elements and all chip controllers for memory access and Input/Output access. The cell broadband engine has 1 PowerPC processor element and 8 synergistic processor elements. Adopted from [18, p. 37]. The Synergistic Execution Unit (SXU) is divided into an even/odd pipeline (pipeline 0 and 1 respectively) and it can complete up to two instruction per cycle, one on each pipeline [18, p. 68]. Examining the SXU, the odd pipeline provides the data moving unit and the even pipeline provides the data processing unit. Furthermore, each units of the SXU has a datapath of 128 bits wide resulting in the capability to use Single Instruction Multiple Data (SIMD). If the SXU is working with data of 32 bits wide, thus it can perform 4 operations in each instruction. On the Playstation 3, the SPU has a frequency of 3,2GHz. Thus each SPU can theoretically provide with 32 bits wide data 2*4*3.2GFLOPS (one operation on each pipeline and 4 operations on each instruction). 6 SPUs are available, thus, this yields a total of 153.6 GFLOPS [18, p. 5]. It must be noted that Single precision oating point operations are not conform to the IEEE 754 because of the following differences : Truncation is used in rounding. Denormal numbers are treated as zero. NaN are interpreted as normilazed numbers. The double precision oating point does not have this problem [18, p. 68-69].
20
Chapter: 2 Analysis
PowerPC Processor Element (PPE) PowerPC Processor Unit (PPU) L1 Instruction Cache L1 Data Cache
PowerPC Processor Storage Subsystem (PPSS) L2 Cache
Figure 2.5: Architecture overview of the PPE which consists of a PowerPC processor unit (PPU) and a PowerPC processor storage subsystem (PPSS). It has a 32 KB Level-1 (L1) instruction and data caches and a 512 KB level-2 (L2) unied (instruction and data) cache. Adopted from [18, p. 49].
Synergistic Processor Element (SPE) Synergistic Processor Unit (SPU) Local Store (LS)
Memory Flow Controller (MFC) DMA Controller
Figure 2.6: Architecture overview of the SPE which consists of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). The SPU has a LS of 256 KB. Adopted from [18, p. 63].
Section: 2.3 Cell BE
21
SPU Functional Unit Local Store (LS)
SPU Odd Fixed-Point Unit (SFS)
SPU Load and Store Unit (SLS)
SPU Control Unit (SCN)
SPU Channel and DMA Unit (SSC) SPU Register File Unit (SRF)
Odd Pipeline Even Pipeline SPU Floating Point Unit (SFP) Synergistic Execution Unit (SXU)
SPU Even Fixed-Point Unit (SFX)
Figure 2.7: Architecture overview of the SPU Functional Unit. The 256 KB Local Store (LS) is lled by the Element Interconnection Bus (EIB) via the MFC. The SXU contains 2 xed points units and a oating point unit. The odd pipeline takes care of moving data (fetch instructions to the pipelines, load and store data between the LS and the register (128 entries of 128 bits wide each) while the even pipeline takes care of data processing (arithmetic and logic instructions). Adopted from [18, p. 64].
22
Chapter: 2 Analysis
2.3.1.4
Element Interconnection Bus (EIB)
One of the main component in the Playstation 3 is the EIB which connect all the components together including PPEs, SPEs, main memory and all inputs/outputs. The bus has a bandwidth of 25.6GB/s (96 bytes per clock cycle) and enabling multiple concurrent data transfers [18, p. 42].
2.3.1.5
Memory Interface Controller (MIC)
The Memory Interface Controller (MIC) provides the interface between the EIB and the physical memory. It supports one or two Rambus extreme data rate (XDR) memory interfaces, which together support between 64 MB and 64 GB of XDR DRAM memory [18, p. 42].
2.3.1.6
Memory System
The Playstation 3 has a dual channel rambus extreme data rate (XDR) memory however the platform provides a modest amount of 256 MB which only 200 MB are available for Linux OS and the applications [19, p. 7]. The SPU access to the ram by the EIB and move the data to his LS via DMA transfers, with the MFC of the SPU acting as a DMA controller.
2.3.1.7
Cell Broadband Engine Interface (BEI)
The Cell Broadband Engine interface (BEI) unit supports I/O interfacing. It manages data transfers between the EIB and I/O devices. The BEI supports two Rambus FlexIO interfaces. One of the two interfaces (IOIF1) supports only a noncoherent I/O interface (IOIF) protocol, which is suitable for I/O devices. The other interface (IOIF0) is software-selectable between the noncoherent IOIF protocol and the memorycoherent Cell Broadband Engine interface protocol [18, p .42].
2.3.2
Programmation of the CBE
The programming for the CBE is split into two main tasks, the programming of the PPE which manages the utilization of the SPU and the programming of what is executed on the SPE.
2.3.2.1
Development platform
The platform used for the project is a PlayStation 3 surrounded by a monitor, keyboard, mouse and LAN connection for remote access. The PlaySation 3 is set up with a linux operating system and a set of development tools: Fedora 8 Linux kernel 2.6.23.1-42.fc8 IBM SDK3.0 for the CBE architecture, includding: gcc compiler toolchain for the CBE (ppu-gcc and spu-gcc ver. 4.1.1) lipspe2 - SPE runtime management library ver. 2.2 Makele from SDK Section: 2.3 Cell BE 23
2.3.2.2
Creating a simple application on a SPE
Generally, applications do not have the physical control of the SPE. The operating systems manages this resources. Applications use software constructs called SPE context. These SPE context are a logical representation of an SPE. The SPE Runtime Management Library (libspe) provides all the function to manage the SPE. This library provides also the means for communication and data transfer between the SPEs and the PPE. The ow of running a single SPU program context, as shown in Figure 2.8, is to create a SPE context, to load an SPE executable object into the SPE context local store (LS), to run the SPE context, this is done by the operating system which requests the actual scheduling of the SPE context onto a physical SPE and lastly to destroy the SPE context in order to free the memory resources used by the context. It must be noticed that the fact to run the SPE context represents a synchronous call to the operating system and thus, the calling application blocks until the SPE stops executing [20, p. 1]. All functions for the SPE context management are described in [20]. Load an SPE executable object into SPE context LS
Create an SPE context
Run the SPE context
Destroy the SPE context
Figure 2.8: The ow for running a simple application using a SPE.
2.3.2.3
Create an application on several SPEs
The project in order to get faster need to use multiple SPEs concurrently. For achieve this, the application must create at least as many threads as concurrent SPE contexts are required. The library used to achieve this is the libspe2 which uses the POSIX (Portable Operating System Interface) threads [20, p. 41]. The ow of running an application on several SPEs is show in Figure 2.9. Each of these threads may run a single SPE context at a time. If N concurrent contexts are required, it is common to have a main application thread and beside, N threads dedicated to the SPE context execution. the main application thread issues a request for the context to be run, and become locked until the context nished execution. But there is no matter from the lock of the main program thread because it can still creates as many threads as needed. If all SPEs are busy, the threads are queued up and will be executed in the same order as they were created. Finally, when all the threads have been executed, the main program thread destroys the no longer needed SPE contexts. 2.3.2.4 Project directory structure
In order to program the cell broadband engine, the source code is arranged into two folders, one for the ppu code and one for the spu code. Furthermore, to use makele denitions supplied by the SDK for producing programs,the line "include $(CELL_TOP)/buildutils/make.footer" has to be included in the makele. The project directory structure in shown in gure 2.10. 2.3.2.5 Program compilation
The built of the application for the cell Be requires several steps as shown in Figure 2.11. First all .c les in the ppu folder are compiled using ppu-gcc for PPE programs and all .c les in the spu folder are compiled using spu-gcc for SPE programs. Next spu-gcc creates SPE executables from SPE compiled progams. These executables are embedded into the PPE programs by rst creating embedded 24 Chapter: 2 Analysis
Create N SPE contexts
Load SPE executable object into context
Create N threads
Run one SPE context in each thread
Stop thread
Wait for all N threads to stop
Destroy all N SPE contexts
Figure 2.9: The ow of running an application using several SPEs.
PPE images of the SPE executables (using ppu-embedspu), next creating PPE libraries (using ppu-ar), and nally compiling the PPE programs again by merging it with the SPE libraries to obtain the nal program FFT (using ppu-gcc).
2.4 FFT algorithms

This section is an introduction to select FFT algorithms with a complete analysis of each algorithm which will be parallelized in Chapter 3. According to the A3 design Model, this section belongs to the Algorithm Domain, as illustrated Figure 2.12. First of all, it discusses about the selection of algorithm. Then, the different mathematical forms of algorithms are developed. At last, the computational time is compared. A FFT Algorithm allows computing the Discrete Fourier Transform (DFT) with a minimum complexity. In fact, for an application of DFT denition, computing complexity is O(n2 ). The purpose of FFT algorithms is to split the transform to obtain a complexity O(n log(n)). Section: 2.4 FFT algorithms 25
Makele in the program directory # Subdirectories DIRS = ppu spu # make.footer include $(CELL_TOP)/buildutils/make.footer
Makele in directory ppu # Target PROGRAM_PPU = main # make.footer include $(CELL_TOP)/buildutils/make.footer # Target PROGRAM_spu = fft_spu
Makele in directory spu
# make.footer include $(CELL_TOP)/buildutils/make.footer
Figure 2.10: Project directory structure which yields two subfolders : one for the ppu program code and one for the spu program code.
2.4.1
Overview
It exists lot of algorithms to compute FFT. The most common algorithm is called Cooley-Tukey (CT). It uses on a kind of approach divide to control thanks to recursion. This recursion divides Discrete Fourier Transform in several DFT. This algorithm needs o(n) multiplications by twiddles factors. They are trigonometric constant coefcients that are multiplied in the course of algorithm developed in 2.4.2. In 1965, James Cooley and John Tukey published this method but this algorithm has been originally designed by Carl Friedrich Gauss in 1805. The most well known use of CT algorithm is a division of the transformation in two parts of similar size.
2.4.2
Discrete Fourier Transform
The Discrete Fourier Transform (DFT) presented in [21] is a mathematical tool for digital signal processing, (spectral analysis, data compression, partial differential equations,. . . ) similar to the Continuous Fourier Transform (CFT) which is used for analog signal treatment. The formula is shown below:
N 1
X[k] =
n=0 N 1
x[n] exp
2j n k N
(2.1)
X[k] =
n=0
n x[n] WN k
(2.2) (2.3)
n WN k = exp
2j (n k) N
n Where WN k is known as the Twiddle Factor . The time domain input data x[n] is a nite series of N samples of length n = [0, 1, . . . , N ] and is transformed to the frequency domain signal X[k] where k = [0, 1, . . . , N 1].
2.4.3
Cooley-Tukey
This chapter presents a theoretical analysis of Radix 2 DIT FFT. First of all, a development of DFT formula is done to obtain Radix 2 DIT formula. Then, a data path derivation is shown to optimize the implementation in language programming code. 26 Chapter: 2 Analysis
common.h
ppu Compile PPE programs *.o
spu Compile SPE programs *.o Create SPE executables * Create embedded PPE images *-embed.o Create PPE libraries *.a
Compile PPE programs with libraries FFT Figure 2.11: Flow for CBE program compilation : First .c les for PPE programs and SPE programs are compiled using ppu-gcc and spu-gcc, respectively. SPE compiled programs are used to create SPE executables, which are compiled through embedded PPE images into PPE libraries to nally get the nal program FFT.
2.4.3.1
Radix 2 DIT FFT
This section presents the radix-2 FFT implementation [22] used for testing against Edelman and Srensen FFT algorithms. It is used because it is one of the simplest FFT algorithm. The simplest for 2 reasons: it is well-studied therefore it can be used for comparison and then to get acquainted with FFT. First, the analytical algorithm of the radix 2 calculation of a DFT is presented. The radix-2 decimation-in-time rearranges the DFT equation into two parts: a sum on the evennumbered indices n = [0, 2, 4, . . . , N 2] and a sum over the odd-numbered indices n = [1, 3, 5, . . . , N 1] as in the following equations:
N 2
N 2
Xk =
m=0
N 2
x2m e
1
2 N (2m)k
+
m=0
x2m+1 e
N 2
2 N (2m+1)k
(2.4)
Xk =
m=0
x2m e
2
2 N (2m)k
+e
2 N k
x2m+1 e
m=0
2 N (2m)k
(2.5) (2.6) (2.7) 27
k Xk = DF T N [x(0), x(2), . . . , x(N 2)] + WN DF T N [x(1), x(3), . . . , x(N 1)]

2
k Xk = Odd(k) + WN Even(k)
Section: 2.4 FFT algorithms
Edelman
Srensen Radix2
Requirements
iterate
Cell Be
Architecture
Figure 2.12: A3 model for project. Highlighted in red, the algorithms analyzed in this section
where k = [0, 1, . . . , N 1]. The previous simplications show that the DFT radix-2 DIT can be computed as the sum of two N length DFTs; one of them with the even indexes and the other with the odd 2 2 k indexes which are multiplied by the twiddle factor WN = e N k . Whereas DFT computation requires N 2 complex multiplications and N 2 N complex additions, the radix-2 DIT rearrangement costs only N2 N2 2 + N complex multiplications and 2 complex additions.
2.4.3.2
Data path Derivation
One can notice that the radix 2 DIT simplication is recursive. This kind of expression is simple, but not optimal to implement in language programming code, because memory consumption and scheduling; that is why iterative algorithms are generally preferable. An other property is described below. Even and odd parts are periodic with period N ; so Odd(k + 2 = Odd(k) and Even(k+ N ) = Even(k). In addition, the twiddle factor is periodic WN 2 The equation may be expressed now as:
N 2) k+ N 2 k = WN .
k Xk = Odd(k) + WN Even(k) k Xk+ N = Odd(k) WN Even(k) where k = 0, 1, . . . ,

2
(2.8) N 1 2 (2.9)
The decimation of data sequence can be repeated again and again until the resulting sequences are reduced to one point sequences. Thus, for N = 2n , this decimation can be performed n = log2 N times. Therefore, the total number of complex multiplications is reduced to N log2 N and the number of additions 2 to N log2 N . 28 Chapter: 2 Analysis
x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7)

0 W8 0 W8 0 W8 0 W8
X(0) -1 X(1)
0 W8 2 W8
-1 -1
0 W8 1 W8
X(2) X(3) X(4) -1 X(5) -1 -1 -1 X(6) X(7)
-1
-1
0 W8 2 W8
2 W8
-1 -1
3 W8
-1
Figure 2.13: Eight point decimation in time algorithm. One can observe that the computation is divided in three stages: a four two-point DFT, then a two four-point DFT and nally a one eight-point DFT. Another important observation is the order of input data. Indeed, the order of these data have to be inverted to obtain the good sequence for the corresponding data output.
2.4.4
Srensen
Srensen FFT [16] (SFFT) algorithm is used in the project as a test algorithm like Radix-2 DIT and Edelman. It is also known as Transform Decomposition. Its principle is different from standard algorithms, like Radix-2 DIT or Split radix, in terms of length of input and output data points. Standard algorithms assumes that the both length of data points are equal, as seen in Figure 2.13, where all the data are computed. SFFT computes them in a different manner where only some output points are said of interest , thus only these points are computed, as illustrated in Figure 2.14. Considering the DFT denition (2.2) as:
N 1
X[k] =
n=0
n x[n] WN k
(2.10)
where k = [0, 1, . . . , N 1] SFFT supposes that only L points are interesting. It exist two sums of lenght P and Q such as N Section: 2.4 FFT algorithms 29
x(0) x(1)
x0 (0) x0 (k) x0 (P 1) x1 (0) x1 (k) x1 (P 1) xQ1 (0) xQ1 (k)
X0 (0) X0 (k) X0 (P 1) X1 (0) X1 (k) X1 (P 1) XQ1 (0) XQ1 (k) XQ1 (P 1) Recombinaison
0 WN k WN
X(0) X(k) X(L 1)
Q1 WN
x(N 1)
xQ1 (P 1) Input Mapping
Q length-P FFTs
Figure 2.14: There are N inputs, but only one output (X(k) in this example) is computed and used for further operation. The way it is done is explained in the following paragraphs. Modied from Srensen and Burrus, 1993, gure 4 [16] divided by P denes Q as: Q = N/P n = Qn1 + n2 with n1 = [0, . . . , P 1] and n2 = [0, . . . , Q 1] So the DFT equation (2.2) becomes:
Q1 P 1
(2.11) (2.12)
X[k] =
n2 =0 n1 =0 Q1
x[Qn1 + n2 ] WN
(Qn1 +n2 )k
(2.13)
n WN2 k
P 1 n1 =0
X[k] =
n2 =0
x[Qn1 + n2 ] WN1
n <k>p
(2.14)
where < k >p is k modulo p.

Q1
X[k] =
n2 =0
n Xn2 [< k >p ] WN2 k n x[Qn1 + n2 ] WP 1 j n xn2 [n1 ] WP 1 j
(2.15)
P 1
Xn2 [j] =
n1 =0 P 1
(2.16)
Xn2 [j] =
n1 =0
(2.17) (2.18) Chapter: 2 Analysis
xn2 [n1 ] = x[Qn1 + n2 ] 30
The equation (2.17) is the equation of a length P DFT and can be compute with any FFT algorithm such as Radix-2 DIT or Split Radix. Srensens paper says that it is better with a Split Radix FFT in terms of number of operations. But as the Radix-2 FFT has been used previously in the project, it is better to use it to compare with the previous results. Equation (2.15) shows that Q FFTs of length P have to be computed, as illustrated in Figure 2.14 2.4.4.1 Complexity
The SFFT complexity depends on the number P, which permits to yield Q, the number of FFTs which have to be performed. Then the complexity depends on the complexity of the FFT algorithm used, Radix-2 DIT or Split Radix.
2.5 Conclusion of the Analysis section

The Analysis section shows the theorical point of the subject developed in this project. The A3 design methodology is used to organize the project and permits to establish simple and dened parts. The application is dened as developping an OFDM receiver for LTE. Then, the algorithm part describes 2 FFT algorithms to be used : Radix-2 DIT and Srensen. The last part corresponds to the architecture on which the algorithms are implemented, namely the Cell Broadband Engine. The analysis of the Cell Broadband Engine shows a multiprocessor architecture containing one PPE managing the communication between the 6 SPEs, out of 8 in the Playstation 3 platform. The instructions and datas are owing thanks to the Element Interconnect Bus (EIB) which connect the PPE, the SPEs and memories. The SPUs contains a RISC processor and are constructed with two pipelines that can execute an instruction each cycle. Moreover, the data paths of the arithmetic functional units inside the SPUs are wide (128 bits), allowing the use of Single Instruction Multiple Data (SIMD) instructions. The use of this method produces a processor optimized for computations. The last part of the Analysis section concerns the FFT algorithms. First of all a FFT (or its inverse the IFFT also known as F F T 1 ) is a tool used in digital signal processing permitting the transformation from time domain to frequency domain. This domain is used to determine the usefull frequency from the added noise. In the case of OFDM, the transmitter contains an IFFT which transforms the digital symbols in analog signal for its transmission. The receiver operates the inverse computation to retrieve the data transmitted among noises. These operations are time expensive. An efcient multicore architecture designed for computation can reduce the computional time. The group project chooses the Radix-2 DIT for rst algorithm to implement on the CBE. This algorithm is assumed to be one of the simplest FFT algorithm based on the paper by J.Cooley and J.Tukey. Then, the second algorithm to be implemented on the CBE is the Transform Decomposition algorithm, known as Srensen. This algorithm, a bit more complex than Radix-2 DIT, permits to speed up the computation time. Next Section deals with the experiments of implementing the FFT algorithms rstly on the PPU then on one SPU and nally on several SPUs. The Radix-2 DIT is the rst algorithm being used followed by Srensen algorithm.
Section: 2.5 Conclusion of the Analysis section
31
Chapter
Implementation
3.1 Overview
This chapter puts in practice the theoretical analysis developed in the chapter 2. It contains the results of the tests on one or several processors, with different FFT algorithms. All these results are evaluated, compared and discussed. According to the A3 design model, this section belongs to both Algorithm and Architecture Domain, i.e. the mapping of the algorithm onto the architecture, as illustrated in Figure 3.1.
3.2 Cooley-Tukey Implementation

3.2.1 Overall Approach
The tests are carried out with the CT Algorithm. First of all, Matlab is used to have reference results. Indeed, the fft function is used to verify that the results of the implementations are correct. This verication is done only for the computation results and the Matlab computation time is of no interest. Indeed, as mentioned in the second chapter section 2.4.3, CT Algorithm is one of the simplest existing FFT algorithms; therefore it is selected for the initial tests, as its sequential implementation is straightforward. These also provide elements of comparison for the subsequent implementations. Then, various types of tests are performed. All the following tests are carried out 10000 times to insure that the results are meaningful (since the execution is not fully deterministic due to architectural and OS hazards). The rst one is a sequential execution on the main processor (PPU). The second one is also a sequential computation but on the SPU (without data transfer). These two tests allow seeing the computation difference between both. Then, the parallel implementation on 6 SPUs is performed to evaluate the potential improvement. Two parameters are evaluated during these tests: the computation time and the number of computation per second. Measurement of the time is realized by the function gettimeofday [23] and is carried out for the execution of bit reverse function, twiddle factor computation and buttery computation. The following calculation allows computing the number of operations per second: 10 N log2 N 2 totaltime
N umberof operationspersecond = where N is the length of the FFT. 32
(3.1)
Chapter: 3 Implementation
Edelman
Srensen Radix2
Requirements
iterate
Cell Be
Architecture
Figure 3.1: A3 model for project. Highlighted in red, the mapping developed in this section
This formula corresponds to the complexity of CT butteries seen in section 2.4.3.2]. The computation of the twiddle factor and bit reverse function does not affect equation 3.1 because there are no oating operations in these functions. Bit reverse is only data transfer and twiddle factor has no oating operation (only cosine and sine which, in this case, are not oating point operations).
3.2.2
Results
A graphical representation of the results is seen in Figure 3.2. This graph shows the computation time of the sequential executions on the PPU and on the SPU. Both are almost linear, which is normal because when the FFT length is multiplied by 2, the computation time is almost doubled. One more comment about these results is that the SPUs computation time are larger than the PPUs ones. Indeed, the difference between both increases with the FFT length. The following graph in Figure 3.3 depicts the computation time of a parallel implementation according to the FFT length. Indeed, the CT algorithm is parallelized on the 6 SPUs of the Cell Be. The more the FFT length increases, the larger the computation time is. This is an unexpected result; rstly, the computation time for the parallelized version is larger than for the sequential one. Secondly, the larger the FFT length is, the larger the execution time is. There is an explanation for that. The data transfers between the main storage (RAM) and the local storage (LS) are very long (as compared to the computation, i.e. data transfers are a bottleneck and the PPUs remain idle for signicantly long periods of time). Moreover, no optimizations have been implemented so far. Finally, the number of operations per second is drawn according to the number of processors as depicted in Figure 3.4. With the FFT length of 1024, it can be observed that the number of operations per second (in MFLOPS) decreases when the number of processors increases. Considering the results of the previous tests (computation on 6 SPUs, Figure 3.3), this result was expected. Indeed, much time (cf. previous results comments) is spent for data transfers when the number of processors increases. Therefore, the number of operations per second (i.e. actual computations) is very low compared to the transfer times. Section: 3.2 Cooley-Tukey Implementation 33
400 PPU SPU 350
300 Computation time (s)
250
200
150
100
50
0 0 200 400 600 800 N : Number of points 1000 1200
Figure 3.2: Computation time of a sequential radix 2 FFT implemented on the PPU (dashed blue) and one SPU (continuous red) for different lengths FFT (ranging from 4 to 1024).
34
104 10 9 8 7 Computation time (s) 6 5 4 3 2 1 0 0 200 400 600 800 N : Number of points 1000 1200
Figure 3.3: Computation time of a parallel radix 2 FFT implemented on 6 SPUs for different FFT lengths, ranging from 4 to 1024.
Section: 3.2 Cooley-Tukey Implementation
35
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 1 2 3 4 5 6 N : Number of SPUs Figure 3.4: Number of operation per second for a parallel radix 2 FFT implemented on different number of processors (from 1 to 6).
3.2.3
Number of operations per second
Optimizations
Intuitively, one would expect that increasing the parallelism would increase the number of operations per second. However, the opposite effect has been observed in the results described above. Therefore, the group members have decided to evaluate whether it is possible to reduce the computation time by means of several optimizations techniques, as described in what follows. Problem of data transfers: The time for performing data transfers between the PPU and the SPU is higher than the computation time. Several methods have been used to reduce this time, as described in the following paragraphs. 3.2.3.1 Deterministic twiddle factors
The twiddle factors have been made as a constant on the SPU. If the FFT length is xed, the twiddle factors are always deterministic (they can be predicted). Instead of passing them as arguments to the SPU, they are stored in the Local Storage of the SPU. The twiddle factors are complex values with real and imaginary parts. Assuming 32 bits oats and a 1024 length FFT, the size of these data is: 512 x 2 x 4= 4096 bytes 4 bytes oat
Number of twiddle factor
2 oats : real and imaginary That is not a problem for the LS because it is only 4,096 bytes out of the 256 Kb. This technique allows to not waste precious EIB bandwidth. 36 Chapter: 3 Implementation
3.2.3.2
Double Buffering
One of the methods to transfer data to (from) the PPU from (to) the SPU uses Direct Memory Access (DMA). This section presents a technique called double buffering. To achieve computation on the SPU, the program has to transfer data from main storage to the LS using DMA data transfer. For example, consider a SPU program that repeats the following steps: 1. Access data using DMA from the main storage to the LS buffer B, 2. Wait for the transfer to complete, 3. compute on data in buffer B, This sequence is not efcient because the SPU has to wait for the complete transfer of the data before it can compute the data in buffer. The process wastes much time. Figure 3.5 illustrates this scenario. First Iteration Second Iteration
time DMA Input Compute Figure 3.5: Serial computation and data transfer. Modied from [24] This process can be signicantly accelerated by using double buffering. Two buffers, B0 and B1 , are allocated, allowing overlapping computation on one buffer with data transfer in the other one. The diagram scheme is showed in gure 3.6. Double buffering is achieved by using tag-group identiers [25]. All transfers involving buffer B0 (respectively B1 ) are applied to Tag-group ID 0 (respectively ID 1). Then, software sets the tag-group mask to include only tag ID 0 (tag ID 1) and requests conditional tag status update. It enables not to begin the computation before the transfer to the buffer is complete. Figure 3.7 shows the resulting execution in time. Double buffering is used in the project to transfer the data structure from the PPU to the SPU. This structure is described below:
Initiate DMA transfer from EA to LS buffer B0
Wait for DMA transfer to buffer B0 to complete
Compute on data in buffer B0
Compute on data in buffer B1
Wait for DMA transfer to buffer B1 to complete
Figure 3.6: Double Buffering scheme. Modied from [24] Section: 3.2 Cooley-Tukey Implementation 37
B0
B0 B1 time DMA Input Compute
B0 B1
B0 B1
Figure 3.7: Paralell Computing and Transfer. Double Buffering is more efcient than the approach presented in Figure 3.5 as the SPU does not have to wait for the data. A part can be computed in buffer B0 while the next data is in the DMA transfer to B1 . Modied from [24]
Typedef struct complex{ oat real; oat imag; }complex_t
Typedef struct{ complex_t *input; complex_t *output; complex_t *twiddle; int count; }spe_arg_t
The structure spe_arg_t is passed in arguments from the PPU to the SPU. While the computation of one buttery is being performed by means of the rst buffer transfer, the second buffer is transferring the data for the computation of the next buttery. Although twiddle factors and double buffering methods have been implemented, no signicant improvement for the data transfer time has been observed (since the results are the same with or without these two methods, the corresponding numbers are not repeated here).
3.2.3.3
Large amount of data
After further considerations, the group members wanted to evaluate that to gain anything from using double buffering, a larger amount of data must be transferred. The EIB only becomes efcient if it can work for longer durations of time. So, in a new experiment, instead of sending 1024 times the input data, half of the data have been sent to the SPU. Then, the calculations started while the other half was sent. Although this method has been implemented, no improvement regarding the computation time has been measured.
3.2.3.4
Computation of several stages on the same SPU
The goal of all the previous optimizations is to reduce the data transfer time. Regarding gure 3.8, the rst four data (x(0), x(4), x(2), x(6)) are used together in stages 1 and 2. It means that only one transfer is necessary from the PPU to the SPU to compute these four values in stages 1 and 2. If this method is applied to a 1024 point FFT on 4 SPUs, 256 data (1024/4) are transferred on each SPU. It means that each SPU computes 128 Butteries (27 ) thus each SPU computes the rst seven stages with only one transfer of 256 data. This optimization is only possible on a power of 2 numbers of processors. Then, the last three stages are computed on the 4 SPUs like the method decribed in part []. 38 Chapter: 3 Implementation
stage 1 x(0) x(4) SPU1 x(2) x(6) x(1) x(5) SPU2 x(3) x(7)
0 W8 0 W8 0 W8 0 W8
stage 2
stage 3 X(0)
-1
X(1)
0 W8 2 W8
-1 -1
0 W8 1 W8
X(2) X(3) X(4) -1 X(5) -1 -1 -1 X(6) X(7)
-1
-1
0 W8 2 W8
2 W8
-1 -1
3 W8
-1
Figure 3.8: Eight-point decimation in time algorithm. This algorithm is implemented on two SPUs. The rst two stages are computed with only one transfer from the main storage to the LS. Modied from [24] The results are interesting. By means of this method, the computation time is improved. For a 1024 FFT length, the time without optimization is 30 ms on 2 SPUs. With this one, the computation time is 7 ms. This result shows two things: rstly, the data transfer time is the problem (that was a supposition until this part). Indeed, the time is divided by 4,3 thanks to sending less data. Secondly, the improvement is not enough because the computation time for a parallel implementation is always larger than the sequential one. Another algorithm (Srensen) has been analysed in part 2.4.4. The implementation is developed in the following part. This implementation is better than Radix 2 DIT in terms of computation time as shown in section 3.3.
3.3 Srensen Implementation

3.3.1 Overall Approach
The following tests are carried out on the Srensen Algorithm. The reference results come from Matlab and are the same as presented in section 3.2. This implementation allows comparing the difference with the CT algorithm. According to the theoretical analysis section 2.4.4, the results should be better (in terms of execution time) with Srensen than with CT radix 2. Indeed, Srensen algorithm divides a large FFT in small FFTs, which facilitates the parallelization. Various tests are performed on Srensen. However, in order to compare the results with those of CT, the same type of tests as those used for CT Section: 3.3 Srensen Implementation 39
are carried out. Two sequential implementations on the PPU are performed: one with Q set to 2 and the other one with Q set to 4 (Q is the number of small FFTs as seen in the part 2.4.4). Then, the parallel implementation is tested to see the potential improvement. There are also two different values for Q (2 and 4). Therefore, the parallel implementation is performed on 2 and 4 SPUs. The same parameters as for the CT algorithm are measured. The measurement of the computation time concerns the reordering, compute_fft and recombination functions. The function to measure the time is still gettimeofday [23]. Then, the number of operations per second is evaluated as well but with a different formula because the complexity of the computation is not the same as for CT. The formula is described below in equation 3.2: (Q 1) L totaltime
GF LOP S = 5 Q P log2 P + 8
(3.2)
where Q is the number of small FFTs, P the number of input data for each small FFTs and L the desired number of output data. The number of operations per second only concerns the computation of small FFTs and the recombination function. The reordering function has no oating computations. It only consists of the data reordering by means of data moves.
3.3.2
Results
The graph in Figure 3.9 shows the computation time of the sequential execution on the PPU. There are two lines: one (continuous red) for a division of the large FFT in two smaller ( Q = 2) and another one (dashed blue) for a division in four smaller (Q = 4). The execution time for Q = 2 is always smaller than for Q = 4. That is normal because the complexity depends on the chosen subdivision factor P , as this denes Q, which is the number of small FFTs performed. The larger the factor Q is, the larger the number of computation is. Therefore, for a sequential execution, the time increases with the number of calculation. That explains the behaviour of these measures. Moreover, these two curves are almost linear. That is normal because increasing of the input data increases the computation time. Figure 3.10 shows the computation time for 2 parallel implementations (Q = 2 and Q = 4, i.e. 2 and 4 SPUs, respectively) according to the FFT length (from 4 to 1024). It appears that the execution time is always larger for a parallelization on 4 SPUs. Thus, it can be deduced that the problem still comes from the time needed to transfer the data between the PPU and the SPUs as for the parallel implementation of the CT algorithm. However, the positive point choose aspect for this case is that the computation time becomes almost constant when the FFT length is increased (due to the effect of the pipeline). The computation time is always, for the parallel implementations, larger than the sequential one. Moreover, the 4 SPUs execution is slower than the 2 SPUs one. This is an expected result because there are only 2 data transfers for Q = 2 whereas 4 are needed when Q = 4.
3.3.3
Comparison with the CT algorithm
The goal of this section is to compare the results of Srensen implementation with the different measures obtained for the CT Radix 2 DIT implementation. Indeed, although several optimizations have been applied to the CT implementation, the results (especially for the computation time) due to data transfer between the PPU and the SPUs are larger than the sequential execution time. Figure 3.12 shows the parallel implementation of Srensen algorithm (Q = 2, i.e. 2 SPUs) and of CT (on 6 SPUs). Although the parallel implementation of the Srensen also suffers from problem of data transfer, it appears that it is better than Radix 2 DIT in terms of computation time. The explanation is simple; in Srensens algorithm, data are transferred one and only one time for the computation of one small FFT whereas for Radix 2 DIT, even with the optimizations, data are transferred four times (N = 1024), as seen in section 3.2.2. The conclusion is that Srensen is better tted for parallel implementation due to its 40 Chapter: 3 Implementation
2500 Q=4 Q=2 2000
Computation time (s)
1500
1000
500
0 0 200 400 600 800 N : Number of points 1000 1200
Figure 3.9: Computation time of a sequential Srensen FFT implemented on PPU for Q = 2 (continuous red) and Q = 4 (dashed blue) with different FFT lengths (ranging from 4 to 1024).
Section: 3.3 Srensen Implementation
41
10000 9000 8000 7000 Computation time (s) 6000 5000 4000 3000 2000 1000 0 0 200 400 600 800 N : Nombre de points 1000 1200 Q=4 Q=2
Figure 3.10: Computation time of a parallel Srensen FFT implemented on PPU for Q = 2 (red) and Q = 4 (blue dash) with different FFT lengths (ranging from 4 to 1024).
42
104 10 9 8 7 Computation time (s) 6 5 4 3 2 1 0 0 200 400 600 800 N : Number of points 1000 1200 CT FFT SFFT Q = 4 SFFT Q = 2
Figure 3.11: Comparison of the computation time of a parallel Srensen FFT for Q = 2 (i.e. 2 SPUs) (dashed green), for Q = 4 (i.e. 4 SPUs) (dotted blue) and a parallel CT Radix 2 DIT FFT on 6 SPUs (continuous red) with FFT lengths from 4 to 1024.
Section: 3.3 Srensen Implementation
43
12
10 CT FFT SFFT Q = 2 Number of operations per second 8
0 0 200 400 600 800 N : Nombre de points 1000 1200
Figure 3.12: Comparison of the number of operations per second for a parallel Srensen FFT (Q = 2) (dashed blue) and a parallel Radix 2 DIT FFT on 6 SPUs (continuous red) with different FFT length (ranging from 4 to 1024). design as compared to CT. Then, the time needed for data transfer appears to be the main limiting factor but with a better knowledge of these transfers, the Srensen algorithm is most likely easily applicable to this kind of parallel architecture. Another element of comparison is the number of computations per second. Figure 3.12 shows this variable (in MFLOPS) for the Srensen implementation on 2 processors and for the CT Radix 2 DIT algorithm on 6 SPUs according to the FFT length (from 4 to 1024). Please note that this is a different type of measure as compared to the execution time (here larger number indicate a better performance). The number of oating point operations per seconds is larger for the Srensen implementation than the Radix 2 one. It can be explained by the fact that the Srensen algorithm has much computations (compared to Radix 2). Indeed, the recombination step, cf 3.2.2, after the computation of the small FFTs, adds some computations. Furthermore, Figure 3.11 has shown that Srensen FFT was faster than CT in terms of computation time. These two explanations combined together, can explain the trend observed in Figure 3.12.
44
Chapter
Conclusion & Perspectives

4.1 Conclusions
The next step in mobile communication, LTE, focuses on the quality and speed of the data transfer. This is realized by the help of the OFDM modulator/demodulator which is the most used solution to problem such as ISI or fading. One of the key features in OFDM is the IFFT/FFT pair. To speed up the data transfer provides by the OFDM, an improvement of the computation speed of the IFFT/FFT tool can be sought. With the latest multiprocessor platform, the speed up can be improved even more as soon as the data transfer protocol between the different parts of the architecture are well managed. The goal of this 9th Semester ASPI project is to answer the problem dened in section 1.3 as follow : "How efcient, in terms of computation speed, is the Cell-Be processor for the execution of parallelized FFT algorithms?" First of all, an analysis of the Cell-BE has been done to determine if this multicore architecture can speed up the algorithm of a common tool of digital signal processing, the FFT. It appears that the Cell-BE, combined with the SIMD method produces a processor optimized for computations, and then it ables to improve the computation speed of the execution of FFT algorithms. To evaluate the efciency of parallelized FFT algoritms on the Cell-Be processor, two FFT algortihms are used (Radix-2 DIT and Srensen). The rst uses N step with N butteries at each step. It means that this step can be parallelised. This 2 algorithm is assumed to be one of the simplest FFT algorithm. It is not the most efcient but the easiest to establish. Srensen algorithm splits the FFT in smaller FFT. It means that each smaller FFT can be computed on one processor in a parallel scheduling. Then, during the implementation, these algortithms are computed only with the PPE of the Cell-BE. The fact to use these algorithms only on the PPE provides a computation without any parallelisation to have a reference against the same algorithms with a parallelisation computation. Radix-2 DIT is implemented rst. The comparison between only PPU implementation and multiple SPUs implementation shows that data transfers between PPU and SPUs causes waste of time and the results are unexpected, in a way that they are showing less efciency than an unparallelized algorithm. Optimizations, like the DoubleBuffering method, is applied to reduce the data transfer time but without any improvement. Srensen algorithm is implemented after and shows improvement in the computation time results in comparison with the Radix-2 DIT implementation. However, the results of this implementation are still under what the theoritical computation power the Cell-Be can provide. 45
4.2 Perspectives
4.2.1 Short term perspectives
The short time perspective for this project concern the usage of another optimization on the source code, namely the pipelines. This method allows to transfer data while computed the previous ones. This method is applicable on each SPUs. It permits to achieve one operation per cycle instead of loosing cycles on only transfering data. One other short term perspective the Srensen optimizations but based on the Radix-2 DIT optimizations, the hypothesis that Srensens optimizations will not offer any improvement can be advanced. Then the implementation of another algorithm dedicated to parallelisation can be studied. This other algorithm could be the Edelman algorithm which has not been discuss in this paper because of a lack of documentation. Edelman algorithm is assumed to be dedicated to parallelisation. In any case, the main issue in this project is that the data transfers between PPE and SPEs have to be decreased because they are the cause of the results explained in the chapter 3.
4.2.2
Long term perspectives
During the implementation of the algorithms, the data has been transfered using scalar multiplication. Another method using vector is assumed to be more efcient. For further use of the Cell-BE, a complete OFDM system can be realized on a this architecture. The Cell-Be is powerful enough and with a good data transfer managing, the implementation of all features of a transmitter or receiver can be realized. Each SPU can handle one part of the transmitter, such as parallel to serial conversion or adding the cyclic prex.
46
Chapter: 4 Conclusion & Perspectives
Bibliography
[1] http://www.reuters.com/article/marketsNews/idINL2917209520071129?rpc=44m 28/10/2008)
(visited
on
[2] Long Term Evolution : Technology developed by 3GPP and other mobile constructors and operators : http://www.3gpp.org/article/lte (10/11/08) [3] 3GPP : http://www.3gpp.org/(10/11/2008) [4] Such as Ericsson, Motorola, Nokia-Siemens, Nortel, Alcatel-Lucent, Orange, T-Mobile, Vodafone, Telia Sonera and Telecom Italia Mobile http://en.wikipedia.org/wiki/3GPP_Long_Term_Evolution#cite_note-9 (10/11/2008) [5] WiMAX : http://www.instat.com/catalog/wcatalogue.asp?id=311 (21/11/08) [6] http://www.instat.com/catalog/wcatalogue.asp?id=311 (21/11/2008) [7] http://www.3gpp2.org/(21/11/2008) [8] http://www.reuters.com/article/marketsNews/idUSN1335969420081113?rpc=401 (21/11/2008) [9] http://www.3gpp.org/article/release-8 [10] http://www.gsmworld.com/technology/hspa.htm (02/12/2008): September, 2008 Source: Wireless Intelligence
[11] Rohde&Schwarz. UMTS Long Term Evolution (LTE) Technology Introduction, Application Note 1MA111 [12] Mrouane Debbah, Short introduction to OFDM principles. Mobile Communications Group, Institut Eurecom, 2229 Route des Cretes B.P. 193, 06904 SOPHIA ANTIPOLIS CEDEX, France debbah@eurecom.fr) [13] Chang, R. W. and Gibbey, R. A. (1968). A theoretical study of performance of an orthogonal multiplexing data transmission scheme, IEEE Transactions on Communications Technology 16 (4), 529540. [14] Originally from en.wikipedia 2006-09-02. Licensed under the GFDL by Oli Filth; Released under the GNU Free Documentation License. (01/10/08) BIBLIOGRAPHY 47
[15] A. Peled and A. Ruiz, Frequency domain data transmission using reduced computational complexity algorithms, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Apr. 1980, pp. 964-967. [16] Henrik V. Srensen and C. Sidney Burrus. Efcient Computation of DFT with Only a Subset of Input or Output Points. IEEE, 1st edition, 1993. IEEE Transactions on Signal Processing, Vol 41. No 3, March 1993. [17] Yannick Le Moullec. DSP Design Methodology. AAU, 2007. Lecture notes for mm1 of course in DSP Design Methodology, ASPI8-4 http://kom.aau.dk/ ylm/aspi8-4-part1-2007.pdf. [18] Cell Broadband Engine Programming Handbook. IBM Systems and Technology Group, version 1.1, April 2007. Get from IBM.com. [19] scop3 [20] IBM. SPE Runtime Management Library. IBM Systems and Technology Group, version 2.2, September 2007. Get from IBM.com. [21] http://en.wikipedia.org/wiki/Discrete_Fourier_transform, visited at November, rst, 2008. [22] Fast Fourier Transform visited the 15th of http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html. October 2008,
[23] http://www.linux-kheops.com/doc/man/manfr/man-html-0.9/man2/gettimeofday.2.html [24] Programming on the Cell Broadband. PDF, 4.3.7 p 160. [25] IBM. Cell Broadband Engine Programming Handbook, 19.2.5 p 513. [26] IBM. Cell Broadband Engine Programming Handbook, 22.1 p 622.
48
BIBLIOGRAPHY
List of Figures
1.1 1.2 1.3 1.4 1.5 1.6 1.7 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11
Standardization evolution track . . . . . . . . . . . . . . . . . . Multipath propagation . . . . . . . . . . . . . . . . . . . . . . Main principle of an OFDM transmitter/receiver. . . . . . . . . Representation of the OFDM transmitter . . . . . . . . . . . . . Spectrum efciency difference (f ) between FDM and OFDM . Representation of the OFDM receiver . . . . . . . . . . . . . . The cyclic prex . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 10 11 11 12 12 13 17 18 19 20 21 21 22 24 25 26 27 28 29 30 33 34 35 36 37 37 38 39 41 42 43 49
The generic A3 design methodology. . . . . . . . . . . . . . . . . . . . . . . A3 model for project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A3 model for project. Highlighted in red, the algorithms analyzed in this section . Architecture overview of the Cell Broadband Processor . . . . . . . . . . . . . . Architecture overview of the PPE . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture overview of the SPE . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture overview of the SPU Functional Unit . . . . . . . . . . . . . . . . . The ow for running a simple application using a SPE. . . . . . . . . . . . . . . The ow of running an application using several SPEs. . . . . . . . . . . . . . . Project directory structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flow for CBE program compilation . . . . . . . . . . . . . . . . . . . . . . . . A3 model for project. Highlighted in red, the algorithms analyzed in this section . Eight point decimation in time algorithm. . . . . . . . . . . . . . . . . . . . . . Example of Srensen computation principle . . . . . . . . . . . . . . . . . . . . A3 model for project. Highlighted in red, the mapping developed in this section . Computation time of radix 2 FFT on the PPU and one SPU . . . . . . . . . . . . Computation time of a parallel radix 2 FFT on 6 SPUs . . . . . . . . . . . . . . Number of operation per second for a parallel radix 2 FFT . . . . . . . . . . . . Serial computation and data transfer . . . . . . . . . . . . . . . . . . . . . . . . Double Buffering scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paralell Computing and Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . Eight-point decimation in time algorithm . . . . . . . . . . . . . . . . . . . . . . Computation time of a sequential Srensen FFT on PPU . . . . . . . . . . . . . Computation time of a parallel Srensen FFT implemented on PPU . . . . . . . Comparison of the computation time of a Srensen FFT and CT Radix 2 DIT FFT
LIST OF FIGURES
3.12 Comparison of the number of operations per second for a Srensen FFT and a Radix 2 DIT FFT 44
50
LIST OF FIGURES

Report FFT Implementation 08gr943

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Report FFT Implementation 08gr943

Încărcat de

Drepturi de autor:

Formate disponibile

FFT Parallelization for OFDM Systems

9 TH SEMESTER PROJECT, AAU A PPLIED S IGNAL P ROCESSING AND I MPLEMENTATION (ASPI)

Aalborg, January 5th 2008

3.3.1 3.3.2 3.3.3 4

Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Comparison with the CT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 40 45 45 46 46 46

Bibliography List of Figures

Long Term Evolution (LTE)

HSPA: 55 million subscribers

WCDMA: 297 million subscribers GSM: 3.3 billion subscribers

Orthogonal Frequency-Division Multiplexing (OFDM)

Mobile Obstacle Receiver

Figure 1.3: Main principle of an OFDM transmitter/receiver.

Copy of the tail of the signal

Conclusion on the context

1.2 Project subject

Cell Broadband Engine

1.3 Problem Denition

1.4 Project Delimitations

Section: 1.3 Problem Denition

2.2 Design Methodology

Figure 2.1: The generic A3 design methodology.

OFDM receiver LTE 4G

OFDM receiver LTE 4G

SPE0 SPE2 SPE4 SPE6

PowerPC Processor Storage Subsystem (PPSS) L2 Cache

Memory Flow Controller (MFC) DMA Controller

Section: 2.3 Cell BE

SPU Functional Unit Local Store (LS)

SPU Odd Fixed-Point Unit (SFS)

SPU Load and Store Unit (SLS)

SPU Control Unit (SCN)

SPU Even Fixed-Point Unit (SFX)

Element Interconnection Bus (EIB)

Memory Interface Controller (MIC)

Cell Broadband Engine Interface (BEI)

Programmation of the CBE

Creating a simple application on a SPE

Create an SPE context

Run the SPE context

Destroy the SPE context

Figure 2.8: The ow for running a simple application using a SPE.

Create an application on several SPEs

Create N SPE contexts

Load SPE executable object into context

Run one SPE context in each thread

Wait for all N threads to stop

Destroy all N SPE contexts

Figure 2.9: The ow of running an application using several SPEs.

2.4 FFT algorithms

Makele in directory spu

# make.footer include $(CELL_TOP)/buildutils/make.footer

Discrete Fourier Transform

ppu Compile PPE programs *.o

Radix 2 DIT FFT

(2.5) (2.6) (2.7) 27

k Xk = DF T N [x(0), x(2), . . . , x(N 2)] + WN DF T N [x(1), x(3), . . . , x(N 1)]

Section: 2.4 FFT algorithms

OFDM receiver LTE 4G

Data path Derivation

k Xk = Odd(k) + WN Even(k) k Xk+ N = Odd(k) WN Even(k) where k = 0, 1, . . . ,

x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7)

X(2) X(3) X(4) -1 X(5) -1 -1 -1 X(6) X(7)

x0 (0) x0 (k) x0 (P 1) x1 (0) x1 (k) x1 (P 1) xQ1 (0) xQ1 (k)