05681463

FPGA Based Soft-core SIMD Processing: A MIMO-OFDM Fixed-Complexity Sphere Decoder Case Study
Xuezheng Chu #1 , John McAllister #2

#
Institute of Electronics, Communications and Information Technology (ECIT), Queens University Belfast Belfast, Northern Ireland, UK
2
xchu01@qub.ac.uk j.mcallister@ecit.qub.ac.uk
AbstractTo enable reliable data transfer in next generation Multiple-Input Multiple-Output (MIMO) communication systems, terminals must be able to react to uctuating channel conditions by having exible modulation schemes and antenna congurations. This creates a challenging real-time implementation problem: to provide the high performance required of cutting edge MIMO standards, such as 802.11n, with the exibility for this behavioural variability. FPGA softcore processors offer a solution to this problem, and in this paper we show how heterogeneous SISD/SIMD/MIMD architectures can enable programmable multicore architectures on FPGA with similar performance and cost as traditional dedicated circuit-based architectures. When applied to a 44 16-QAM Fixed-Complexity Sphere Decoder (FSD) detector we present the rst soft-processor based solution for real-time 802.11n MIMO.
I. I NTRODUCTION The increasing need for higher data transmission rates across wireless communication channels have seen MIMO schemes [1] increasingly adopted in most emerging wireless standards (e.g. 802.11n [2]) due to their ability to obtain superior channel capacity, throughput and diversity over single antenna solutions. The channels across which these systems communicate uctuate rapidly, and as such future MIMO detectors will require the ability to support variable modulation schemes (QPSK, 16-QAM and 64-QAM, etc) and antenna congurations to enable reliable signal transmission. From an implementation perspective, this implies the need to not only achieve high performance real-time implementations, but also to have architectures which are exible to support these different system congurations and behaviours. Current MIMO detectors are mostly static, custom circuit architectures [3] [9] with only some consideration of the need for behavioural exibility [10]. As such either exibility or high real-time performance are offered, but rarely both at once. FPGA provides the ideal platform with which to address this problem; it has already been proven to enable very high performance detector implementations [11], and the ability to host customisable, software programmable softcore processors [12], [13] means FPGA offers more exibility for application
___________________________________ 978-1-4244-8983-1/10/$26.00 2010 IEEE
specic customisation of embedded architectures than other technologies. To date, however, these softcore processor architectures have failed to match the performance and resource efciency of traditional dedicated circuit architectures, or achieve real-time performance in applications as demanding as 802.11n MIMO. In this paper, we study the potential for a softcore processor-based implementation strategy to overcome this barrier by exploring a Fixed-Complexity Sphere Decoding (FSD) MIMO detector algorithm case study. We show that, by employing a heterogeneous mix of parallel processors, not only can such architectures achieve real-time performance for cutting edge MIMO systems, they do so in a manner which is comparable in terms of efciency with custom circuit implementations. This results in the rst known real-time processor based implementation of such algorithms on any embedded device technology. The remainder of this paper is organized as follows. Section II introduces related background about MIMO detection and the FSD algorithm, and analyses current approaches for developing exible real-time MIMO detector architectures. This motivates the need for a softcore processor based solution. Section III describes the construction of a heterogeneous multi-softcore architecture for 4 4 16-QAM FSD. Section IV discusses the real-time performance and efciency of this architecture with respect to previous comparable work. II. BACKGROUND A ND M OTIVATION The topology of a generic MIMO communication system is formulated in (1) and shown in Fig.1. A transmitter sends data s through M transmit antennas across N M complex communications channel H, where it is corrupted by multipath distortion and white Gaussian noise v. The received signal y is then sensed by N antennas at the receiver. Typically, the communications channel is used as a set of parallel at fading subchannels via Orthogonal Frequency Division Multiplexing (OFDM) at the transmitter, with each subchannel decoded separately at the receiver; for an uncoded system, this imposes
a 480 Mbps real-time performance constraint for detection 44 16-QAM MIMO systems [2]. y = Hs + v
TX1 s1 Modulation and Mapping ...
(1)
v1 y1 ... Detector
h11 hN1
RX1
s1
Demodulation and Seperation
TXM h1N
RXN
vN
sM
...
hNN Wireless Channel H
yN
...
sN
Fig. 1.
A generic MIMO communication system
Normally, the decoding algorithms employed in the receiver encounter problems resulting from either high complexity or reduced complexity with reduced Bit Error Rate (BER) performance [6], [14]. FSD is a highly effective MIMO detection algorithm [15] which overcomes these problems, achieving near ideal (quasi-ML) decoding performance, as well as being xed complexity and well suited to parallel/pipelined dedicated hardware implementation [11]. It uses a three step tree search scheme, which is illustrated in Fig. 2.
i. Preprocessing
...
...
...
iii. Metric Calculation
iii. Sorting
Fig. 2.
An FSD tree search architecture
The three phases of operation described in Fig. 2 are: 1) Preprocessing: Generate upper triangularized version r of channel matrix H; initialise center of the FSD sphere using Zero Forcing (ZF) detection. 2) Metric calculation: Launch a series of iterative Data Slicing (DS) (2) and Accumulated Partial Euclidean Distance (APED) (3) calculations on each estimated detected symbol. 3) Sorting: Sort detected symbols by APED value to nd the closest detected symbol. si = si
2
rij (sj sj ) r j=i+1 ii

2 rjj sj sj 2
(2)
2 Di = rii si si
+
j=i+1
= di + Di+1 (3)
The intriguing aspect of the FSD algorithm is how it changes with varying channel conditions. In particular, the number of tree branches is dependent on the constellation size of the modulation scheme employed; accordingly when recongured as 64-QAM, the tree topology extends to 64 branches. Furthermore, the number of antennas denes the depth of the tree. Hence, the behaviour of the FSD algorithm is entirely independent of the actual received data, but totally dependent on the parameters (modulation scheme, number of antennas) of the communications scheme employed. By extension, the change in modulation scheme and number of antennas necessitated by variable channel conditions changes the dimensionality of the FSD algorithm, and the demands of the real-time implementation in a highly regular fashion. The regular, parallel FSD tree structure enables implementations with high real-time performance [3], [4], but the resulting architectures can only realise specic algorithm congurations; when implemented with recongurability in such an approach there is a signicant increase in hardware cost [7], [9][11]. Some work has investigated Xilinx Microblaze [5] and DSP processor [8] based implementation, but these suffer from real-time performance restrictions. There is signicant scope for a new implementation approach which can achieve high real-time performance and exibility. Since such implementations will require exibility in architecture and behaviour, softcore processors offer a very attractive solution. Current softcore processors such as Xilinx Microblaze do not provide high enough performance, and are too resource expensive to consider for such architectures given the high real-time demands of standards such as 802.11n; further, whilst Single Instruction Multiple Data (SIMD) accelerators such as VIPERS [12] or VESPA [13] boost the performance of these processors, they quickly experience performance and cost bottlenecks which restrict their applicability. As such the challenge remains for a softcore processorbased design approach to present a credible alternative which provides high enough performance and efciency to be competitive with custom circuit-based implementations, but which also has similar resource cost and enables sufcient exibility to realise variable topologies for algorithms such as FSD by reconguring and reprogramming arrays of soft processors. In this paper we address the rst issue; we wish to show that a softcore processor based implementation can match dedicated hardware architectures in terms of both performance and cost. We propose to examine the use of softcore processors for construction of exible, real-time implementations of FSD for MIMO-OFDM systems. In Section III, we describe the evolution of such an architecture for FSD, and we examine the performance of the results, against state-of-the-art dedicated circuit based solutions in Section IV. III. PARALLEL S OFT- CORE P ROCESSING FOR FSD A. SISD Softcore Processing for FSD We propose a simple, highly resource efcient Single Instruction Single Data (SISD) processor architecture, known as the FPGA Processing Element (FPE) as the basis of the
In (2) and (3), rij refers to an entry in r, si is the center of the constrained FSD sphere, Di+1 can be considered as APED in the tree level j = i + 1 and di as PED in level i. Thus, the APED in (3) can be recursively obtained by starting from level i = M and working backwards until level i = 1.
......
......
parallel processor array motivated in Section II. In order to maximise its exibility it exhibits two key softcore processor characteristics: congurability, i.e. the ability of the designer to customise the processor architecture to their application in a set of pre-dened ways, and programmability i.e. the ability to make the architecture mimic any given functionality by issuing sequences of instructions to a central arithmetic unit. Fig.3 illustrates the FPE architecture.
PC ID PM
TABLE II FPE I NSTRUCTION S ET Type Instruction LOOP/RPT BEQ/BGT/BLT JMP GET PUT GETCH CLRCH NOP MULSUB MULADD MULSUBFWD MULADDFWD SWITCH MIN LD ST Function loop/repeat branch if equal/greater/less than jump load data from channel (static) push/broadcast data onto channel load data from channel (dynamic) clear all channels no operation multiply-subtract multiply-add MULSUB and forward MULADD and forward two operand switch two operand minimum load data from memory store data to memory Vec. Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
CTRL
DSP48E
MEM
ALU
>
RF
ALU
1 2 3 SWITCH 4
COMM PAUSE
MEM
Fig. 3.
The FPE architecture
The FPE is a four-stage pipelined Reduced Instruction Set Computing (RISC) architecture composed of a Program Counter (PC), Program Memory (PM), Instruction Decoder (ID), Register File (RF), Arithmetic Logic Unit (ALU), Memory (MEM), Communication Adapter (COMM), Pause Control (PAUSE) and FIFO components. The congurable aspects of the FPE are described in Table I. The FPE supports a set of 20 instructions (minimising the number of distinct instruction types leads to simple hardware in the ID unit), described in terms of three instruction classes (Processor Control (CTRL), Arithmetic (ALU), Memory (MEM)) in Table II.
TABLE I FPE C ONFIGURATION PARAMETERS Parameter DataWidth DataType ALUWidth PMDepth PWWidth DMDepth RFDepth TxCOMM RxCOMM Meaning Wordsize of data processed Type of data processed No. DSP48E slices in the ALU No. program memory locations Size of instruction word No. data memory locations No. registers in RF Tx port number Rx port number Values 16/32 bits Real/complex 1-4 Unlimited Unlimited Unlimited Unlimited 1024 1024
available on Virtex 5/6 FPGA, 40 times more than that on TI C647+ series multi-core DSP processor [16].
TABLE III SISD DATAPATH AND PERFORMANCE Datapath 16-bit real 16-bit cplx (1) 16-bit cplx (2) 16-bit cplx (4) 32-bit real (1) 32-bit real (2) 32-bit real (3) Speed (MHz) 483 476 453 474 466 431 431 Resource (LUTs) 90 132 172 140 215 185 182 DSP48Es 1 1 2 4 1 2 3 Latency (Cycle) 4 7 5 5 6 6 7 Throughput (MMACs/s) 483 119 226.5 474 153.3 215.5 431
The ALU is based around the DSP48E slice on Xilinx FPGA. As Table I shows, the processor can be congured for either 16 or 32 bit real or complex arithmetic modes, and each mode can be specied to exploit 1-4 DSP48E slices. Table III describes the performance of the FPE on Virtex 5 VLX110T FPGA. The 483 (431) MMACs/s for 16 (32) bit arithmetic are unmatched by any other FPGA softcore architecture; if all DSP48Es on this device were exploited by FPEs, between 31 to 970 GMACs/s computing capacity would potentially be
Whilst the FPE offers very high performance when performing simple arithmetic operations (e.g. multiplications, additions and subtractions). This reduces when data dependencies necessiate branch instructions, but in many cases the potential impact of such problematic instructions may be reduced by exploiting datapath congurability, i.e. the ability to extend the datapath architecture and dene custom instructions for it. Consider two such problematic operations which arise in FSD: the need for four-way variable selection during slicing, and the need to select the lowest of a series of values during the sort operation. Each of these operations can be accelerated by employing custom datapath units as shown in Fig.3: a SWITCH unit is used to accelerate the fourway comparison in slicing, and a MIN unit to accelerate the two-way comparison required during the sort operation (the MIN unit is implemented in the output multiplexer). This architectural manipulation is accompanied by an extension of the instruction set to include SWITCH and MIN instructions, which access these accelerators rather than the DSP48E. The extra resource cost associated with these two units per FPE is
only 20 LUTs but, as shown in Table IV, can enable a 68% reduction in memory requirements, and a throughput increase by a factor of 2.3 for a 44 16-QAM FSD relative to a SISD FPE when this capability is not present.
TABLE IV P ERFORMANCE COMPARISON FOR SWITCH/MIN UNIT 44 16-QAM FSD SWITCH/MIN SWITCH/no-MIN no-SWITCH/MIN no-SWITCH/no-MIN Instructions 1420 1519 4486 4591 Memory (LUTs) 805 840 2485 2520 Clock Cycles 1420 1489 3206 3281 Throughput (Mbps) 4.5 4.2 2.0 1.9 Fig. 4. SIMD Processor Architecture
PE RF1 PE RF2 PE RF3 PE RF16
PC
PM
Vector core code mapped Read/Write Address Broadcast
ID_VEC
ALU1
ALU2
ALU3
ALU16
Decoded Instruction Broadcast
For 802.11n using 44 16-QAM FSD based on OFDM transmission, independent FSD calculations on 108 data subcarriers is required [2]. The maximum throughput of the SISD FPE (4.5 Mbps) is sufcient to perform the FSD on a single subcarrier; this implies that simply replicating this in a 108 processor Multiple Instruction Multiple Data (MIMD) array would be sufcient to achieve real-time performance for the entire subcarrier spectrum. However, such a MIMD strategy will duplicate the same program 108 times in 108 different program memories; as such, SIMD architecture offers a cost saving by centralising this program into a single resource. However, very large SIMD architectures will suffer performance degradations on FPGA due to high fan-out of instructions for multiple datapaths. Coupled with the mixed parallelism in the FSD algorithm (task parallel subcarriers, data parallel APED/Slicing and serial Sorting), an arrangement where the 108 subcarriers are mapped across a MIMD array of smaller SIMD processors (computing the APED/Slicing calculations for each subcarrier), and a serial SISD FPE (to implement the sequential sort functionality) may offer an ideal compromise for real-time implementation. Such a solution is outlined in Section III-B. B. SIMD Processor Arrays for FSD The SIMD processors used are composed of a congurable number of FPE units, shown in Fig. 4; the PMDepth parameter of each FPE set to zero allowing the congurable SIMD processor to exploit its own centralized PM but all remaining aspects of FPE conguration are open to user control. Table V denes the congurable aspects of the SIMD processor. All of the FPE instructions (except branch instructions BEQ, BGT and BLT) are vectorisable to match the expansion of this unit to a SIMD architecture.
TABLE V SIMD P ROCESSOR C ONFIGURATION PARAMETERS Parameter SIMDways Pipeline Meaning No. parallel FPE elements Instruction broadcast pipeline tree breadth Values Unlimited Unlimited
linearly for SIM Dways < 20 approximately, but steadily diminishing returns are observed beyond this point. The 4-way SIMD variant offers throughput of around 1,483 MMACs/s, rising to 28,736 MMACs/s when the processor scales to 120 ways. This is factor four increase in performance over the FPGA-based SIMD soft processor in [17] with much smaller hardware resources (only 9% and 50% of the LUTs and DSP slices cost respectively).
3 SIMD processor throughput (MMACs/s) x 10
4
SIMD Performance vs. SIMD issues
2.5
1.5
0.5
20
40
60 SIMD issues
80
100
120
Fig. 5.
SIMD Processor Performance
Fig. 5 shows how the performance of the SIMD changes with the number of ways. As this shows, performance scales
To maintain a high performance/resource balance, we choose to exploit 16-way SIMD processors. Accordingly, to implement the 802.11n OFDM-MIMO 4 4 16-QAM FSD decoder, we propose an architecture composed of 12 groups of a combination of a 16-way SIMD and a single SISD FPE, communicating via simple FIFOs through convergent communication ports. The resulting architecture, and mapping of OFDM subcarriers to processor groups, is shown in Fig. 6. The xed point measurement in [11] shows that 16 bit data is sufcient for reliable decoder performance, and as such this is chosen as the ALU unit of both the SIMD processor and the SISD FPE, which exploit a real-valued datapath. Both processors have P M Depth = 128, RF Depth = 32 and DM Depth = 0. The SIMD processor is congured to support SWITCH instructions to accelerate the performance of the slicing operation, whilst the SISD is congured with the MIN
instruction and ALU extension to accelerate the sort operation. Fig. 7 and Fig. 8 show code fragments for the programs implemented on the SIMD and SISD processors respectively.
108 .
..
24
12 OFDM Subcarrier ZF datum
...
101 .
.. 17
...
100 ... 99 ... 16 OFDM 98 ... 14 15 4 OFDM 97 Subcarrier 3 OFDM ... 13 Subcarrier 2 OFDM ZF datum Subcarrier 1 OFDM ZF datum Subcarrier ZF datum Subcarrier ZF datum ZF datum
Assembly Code File
...
CH_IDX1 CONST1 ZF4 ZF3 ZF2 ZF1 CH_COEF
...
16-ways SIMD core

DE 14 DE 13 DE 12 DE 11 D1 APE DE 24 DE 23 DE 22 DE 21 D2 APE DE DE 164 DE 163 DE 162 APE 161 D16
//SISD Core Init #1 GET ch0 r1 //get ch0 idx0 #2 GET ch1 r2 //get ch1 idx1 ... #17 GET ch15 r16 //get ch15 idx15 //run #18 GET ch0 r17 //get ch0 aped0 #19 GET ch1 r18 //get ch1 aped1 ... #24 SUB r27, r31, r18, r17 #25 MIN r29, r17, r18 //min ape0,aped1 #26 SUB r27, r31, r18, r17 #27 MIN r30, r1, r2 //min idx0,idx1 ... #94 SUB r27, r31, r17, r29 #95 MIN r29, r29, r17 //min aped_min,aped15 #96 SUB r27, r31, r17, r29 #97 MIN r30, r30, r16 //min idx_min,idx15 ... #101 GET r17, r30 //get detected data ... #106 PUT r17 //put detected data .. #110 CLRCH //clear channel #111 JMP #18 //jump to decode new symble
SISD core
Array 12 Array 5
Fig. 8.
Array 4 Array 3 Array 2 Array 1
FSD4 FSD3 FSD2 FSD1
SISD Assembly Code
Fig. 6.
SISD/SIMD Processor Array Architecture and FSD Mapping
//SIMD Core Init #1 VGET #2 VGET #3 VGET #4 VGET ... //run #23 VGET #24 VGET ... #28 VSUB #29 VSUB ... #33 VMUL #34 VMULADDFWD #35 VMULSUB #36 VMULADDFWD #37 VMULSUB #38 VMULSUBFWD #39 VMUL #40 VMUL #41 VNOP #42 VSWITCH #43 VSWITCH ... #117 VPUT ... #121 VJMP
chidx U11 U12_real U12_imag
R19 r1 r2 r3
//get ch index //get ch coefficient
zf4_real r25 zf4_imag r26
//get zf symbols
r25, r31, r25, r17 r26, r31, r26, r18 //Compute c41-zf4 r27, r27, r28, r28, r29, R29, r30, r21, r25, r26, r14, r15, r14, r15, r16, r11, r25, r26, r25, r26, r26, r25, r27, r25, r0 r0 //Compute |c41-zf4|^2 r23 r0 r24 r0 //Compute tempzf1 r0 //Compute aped1 r0
The operation of the SISD core (Fig. 8) is also composed of three stages. 1) The SISD sequentially loads data from FIFOs to initialise the channel index when the decoder receives a new OFDM frame (#1 - #17 in Fig. 8). 2) From #18 to #111, APEDs are read from each SIMD issue via the input FIFOs and compared, with the lowest retained, along with its channel index; the index is then used to guide the SISD to read from the channel housing the appropriate result (i.e. that with the lowest APED). 3) The SISD jumps to either #18 to reload new APED data (for a new symbol) or to #1 to reload the channel index when a new OFDM frame is received. The number of cycles assigned to specic tasks for both the SIMD and SISD cores are described in Tables VI.
TABLE VI SIMD/SISD C YCLE - BY-C YCLE O PERATION B REAKDOWN Processor SIMD SISD Function Data Initialisation APED/Slicing/Communication Data Initialisation Sorting No. Cycles 22 99 17 94 % Of Total 19% 81% 22% 78%
r19, r28 r20, r29 r30 #23
//Slicing c31
Fig. 7.
16-way SIMD Assembly Code
The operation of the SIMD core (Fig. 7) is composed of three main stages. 1) In every OFDM frame period, the SIMD processor loads the latest channel matrix coefcient and constellation point from the input FIFOs (lines #1 to #22 in Fig. 7). 2) The SIMD then loads the ZF symbols as the center of sphere for current target symbols to decode (#23 #27), computes the APED and slicing for 4 levels (#28 - #116), and pushes them into respective FIFO to the scalar core (#117 - #121). 3) The process jumps either to #23 to reload new ZF data for the next symbol, or to #1 to reload the new channel coefcient when receives new OFDM frame.
IV. R ESULT AND DISCUSSION Table VII summarizes the performance and cost of this implementation on Xilinx Virtex 5 VSX240T FPGA; as this shows, 513.5 Mbps throughput can be achieved, which exceeds the 480 Mbps standard equirement of 802.11n. The processor array takes 83 ns for matrix coefcient initialisation and 730 ns for decoding, within the 16 s Short Inter-Frame Space (SIFS) period for 802.11n. To the best of our knowledge, this is the rst time these real-time performance landmarks have been achieved by any software programmable processor, and certainly any softcore architecture on FPGA.
TABLE VII C OMPARISON OF 4 4 16-QAM FSD I MPLEMENTATIONS Ref [11] [11] [7] This work Device Virtex II Virtex 5 Virtex 4 Virtex 5 Clock (MHz) 150 150 35 265 Resource 16,119 LUTs, 160 DSP48s, 82 BRAMs 13,197 LUTs, 160 DSP48Es, 49 BRAMs 10,745 LUTs 23,728 LUTs, 204 DSP48Es Throughput (Mbps) 600 600 140 513.5 Programmability No No No Yes
Table VII compares the performance of the FPE-based 44 16-QAM FSD with other recent published implementations. The implementation in [7] does not represent a real-time solution, with [11] the current real-time standard; when retargetted to Virtex 5 VSX240T [11] requires 13,197 LUTs, 160 DSP48Es and 49 BRAMs. Based on [18] we can create equivalent LUT estimates of this and our FSD implementations; the work in [11] translates to 168,039 equivalent LUTs, but since the FPE-based solution avoids BRAMs, its 139,804 equivalent LUTs represents a saving of 16.2%, whilst achieving uncoded real-time performance for 802.11n MIMO. As such, this softcore-based array offers lower resource cost to meet the 802.11n throughput standard, in addition to being exible to reach several lower cost solutions which [11] cannot. V. C ONCLUSION A heterogeneous parallel soft-core processor architecture and implementation strategy for FPGA has been presented and applied to MIMO FSD decoding in this paper. We have shown that by combining the use of SISD, SIMD and MIMD architecture styles, we can enable architectures which not only meet the demanding real-time requirements of such systems, but are highly competitive with dedicated circuits in terms of resource efciency. We believe that employing this kind of heterogeneous multicore implementation approach represents the most effective way to exploit both the high performance on-chip processing resources on modern FPGA, but also to customize the implementation architecture to the variety of avours of parallelism in any such application. This technique offers the potential to reduce the architecture design problem for DSP systems in general, and dynamic DSP design problems in particular, from a custom circuit design problem to a more tractable processor array conguration and programming problem, with no reduction in implementation efciency or performance. ACKNOWLEDGMENTS The work is part of Islay Project, and supported through EPSRC Research Grant EP/F031017/1. The authors would like to thank Prof Roger Woods, Dr John Thompson, Chengwei Zheng and Dr Sujit Bhattacharya, Matthew Milford for their valuable assistance in this work. R EFERENCES
[1] P. Wolniansky, G. Foschini, G. Golden, and R. Valenzuela, V-BLAST: An Architecture for Realizing Very High Data Rates Over The RichScattering Wireless Channel, in 1998 URSI International Symposium on Signals, Systems, and Electronics. Conference Proceedings, 1998, pp. 295300.
[2] 802.11n-2009 IEEE Local and metropolitan area networksSpecic requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specications Amendment 5: Enhancements for Higher Throughput, p. 536, 2009. [3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bolcskei, VLSI Implementation of MIMO Detection Using The Sphere Decoding Algorithm, Solid-State Circuits, IEEE Journal of, vol. 40, no. 7, pp. 15661577, Jul. 2005. [4] J. Antikainen, P. Salmela, O. Silven, M. Juntti, J. Takala, and M. Myllyla, Application-Specic Instruction Set Processor Implementation of List Sphere Detector, in Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers ACSSC 2007, Nov. 2007, pp. 943947. [5] X. Huang, C. Liang, and J. Ma, System Architecture and Implementation of MIMO Sphere Decoders on FPGA, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 2, pp. 188197, 2008. [6] M. Li, B. Bougard, W. Xu, D. Novo, L. Van Der Perre, and F. Catthoor, Optimizing Near-ML MIMO Detector for SDR Baseband on Parallel Programmable Architectures, Design, Automation and Test in Europe, 2008. DATE 08, pp. 444449, Mar. 2008. [7] P. Bhagawat, R. Dash, and G. Choi, Architecture for Recongurable MIMO Detector and its FPGA Implementation, in 2008 15th IEEE International Conference on Electronics, Circuits and Systems, Aug. 2008, pp. 6164. [8] J. Janhunen, O. Silv n, and M. Juntti, Programmable Processor Implee mentations of K-best List Sphere Detector for MIMO Receiver, Signal Process., vol. 90, no. 1, pp. 313323, 2009. [9] M. S. Khairy, M. M. Abdallah, and S. E.-D. Habib, Efcient FPGA Implementation of MIMO Decoder for Mobile WiMAX System, in 2009 IEEE International Conference on Communications, Jun. 2009, pp. 15. [10] P. Bhagawat, R. Dash, and G. Choi, Array Like Runtime Recongurable MIMO Detectors for 802.11n WLAN: A Design Case Study, 2009 Asia and South Pacic Design Automation Conference, pp. 751756, Jan. 2009. [11] L. Barbero, Rapid Prototyping of a Fixed-Complexity Sphere Decoder and its Application to Iterative Decoding of Turbo-MIMO Systems, Ph.D. dissertation, The University of Edinburgh, 2006. [12] J. Yu, C. Eagleston, C. H.-Y. Chou, M. Perreault, and G. Lemieux, Vector Processing as a Soft Processor Accelerator, ACM Trans. Recongurable Technol. Syst., vol. 2, no. 2, pp. 134, 2009. [13] P. Yiannacouras, J. G. Steffan, and J. Rose, Fine-Grain Performance Scaling of Soft Vector Processors, in International Conference on Compilers, Architecture and Synthesis for Embedded Systems, 2009. [14] Z. Guo and P. Nilsson, Algorithm and Implementation of The K-best Sphere Decoding for MIMO Detection, Selected Areas in Communications, IEEE Journal on, vol. 24, no. 3, pp. 491503, Mar. 2006. [15] L. G. Barbero and J. S. Thompson, Fixing the Complexity of the Sphere Decoder for MIMO Detection, Wireless Communications, IEEE Transactions on, vol. 7, no. 6, pp. 21312142, Jun. 2008. [16] M. Milford and J. McAllister, An Ultra-ne Processor for FPGA DSP Chip Multiprocessors, in 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers, 2009, pp. 226230. [17] J. Cho, H. Chang, and W. Sung, An FPGA Based SIMD Processor With A Vector Memory Unit, 2006 IEEE International Symposium on Circuits and Systems, p. 4, 2006. [18] D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, and D. Tullsen, Application-Specic Customization of Parameterized FPGA Soft-Core Processors, in Computer-Aided Design, 2006. ICCAD 06. IEEE/ACM International Conference on, 2006, pp. 261268.

05681463

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

05681463

Încărcat de

Drepturi de autor:

Formate disponibile

FPGA Based Soft-core SIMD Processing: A MIMO-OFDM Fixed-Complexity Sphere Decoder Case Study

Xuezheng Chu #1 , John McAllister #2

hNN Wireless Channel H

A generic MIMO communication system

iii. Metric Calculation

An FSD tree search architecture

rij (sj sj ) r j=i+1 ii

The FPE architecture

SIMD Performance vs. SIMD issues

SIMD Processor Performance

12 OFDM Subcarrier ZF datum

Assembly Code File

CH_IDX1 CONST1 ZF4 ZF3 ZF2 ZF1 CH_COEF

CH_IDX2 CONST2 ZF4 ZF3 ZF2 ZF1 CH_COEF

CH_IDX16 CONST16 ZF4 ZF3 ZF2 ZF1 CH_COEF

16-ways SIMD core

SISD Assembly Code

SISD/SIMD Processor Array Architecture and FSD Mapping

chidx U11 U12_real U12_imag

//get ch index //get ch coefficient

zf4_real r25 zf4_imag r26

r19, r28 r20, r29 r30 #23

16-way SIMD Assembly Code

S-ar putea să vă placă și