Sunteți pe pagina 1din 5

272

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 60, NO. 5, MAY 2013

VLSI Implementation of a High-Throughput Iterative Fixed-Complexity Sphere Decoder


Xi Chen, Guanghui He, Member, IEEE, and Jun Ma
AbstractBy exchanging soft information between the multiple-input multiple-output (MIMO) detector and the channel decoder, an iterative receiver can signicantly improve the performance compared to the noniterative receiver. In this brief, a soft-input soft-output xed-complexity-sphere-decoding algorithm and its very large scale integration architecture are proposed for the iterative MIMO receiver. The deeply pipelined architecture employs the optimized hybrid enumeration to search for the best child node estimate efciently. By adding the counterhypotheses in parallel with other candidates, the proposed iterative MIMO detector improves the detection performance signicantly with low detection latency. An iterative detector for an 4 4 64-quadrature amplitude modulation (QAM) MIMO system based on our proposed architecture is designed and implemented using the 90-nm CMOS technology. The detector can achieve a maximum throughput of 2.2 Gbit/s with an area efciency of 3.96 Mbit/s/kGE, which is more efcient than other iterative MIMO detectors. Index TermsFixed-complexity sphere decoding (SD) (FSD), multiple-input multiple-output (MIMO), soft-input soft-output (SISO) MIMO detection, very large scale integration (VLSI).

I. I NTRODUCTION ULTIPLE-input and multiple-output (MIMO) technology has been widely applied in wireless communications since it offers signicant increases in data throughput and link range without additional bandwidth or increased transmit power. By incorporating MIMO with bit-interleaved coded modulation with iterative detection and decoding (BICM-IDD), the channel capacity can be approached [1] at the cost of much higher complexity and lower throughput compared with noniterative schemes. Thus, it is very important to develop a high-speed iterative detector to meet the increasing demand for gigabit-per-second wireless systems such as the IEEE 802.11ac wireless local area network (WLAN) and 3GPP LTEAdvanced. Due to its practical importance, the very large scale integration (VLSI) design of soft-input soft-output (SISO) detectors has recently received a lot of attention. The rst reported

implementation of a SISO MIMO detector is based on the minimum mean square error parallel interference cancellation (MMSE-PIC) algorithm [2], but it cannot fully exploit the spatial diversity provided by MIMO. To overcome this limitation, implementations of SISO single tree search (STS) sphere decoding (SD) [3], [4] are presented, which have maxlog maximum a posteriori (MAP) performance. However, like other depth-rst tree-search algorithms, it suffers from variable throughput and complexity depending on the signal-to-noise ratio (SNR). More recently, a novel SISO detection algorithm based on trellis search and its VLSI architecture has been proposed in [5], which provides a peak throughput of 1.7 Gbit/s, but it consumes large silicon area and is hard to support highorder modulation [e.g., 64 quadrature amplitude modulation (QAM)]. Fixed-complexity SD (FSD) is a breath-rst tree-search algorithm previously proposed for hard-output MIMO detection. It is capable of providing near maximum likelihood (ML) detection performance with xed and low complexity [6]. A highly efcient silicon implementation of FSD is reported in [7], which can achieve a 1.98-Gbit/s detecting throughput with the parallel multistage VLSI architecture. It is very attractive to extend the hard-output base architecture to support iterative MIMO detection. In this brief, we presentto the best of our knowledgethe rst VLSI architecture of list-based SISO FSD. Based on the architecture presented in [7], we propose an optimized hybrid enumeration (HE) for iterative SISO FSD to nd the best child estimate with low complexity. Meanwhile, candidates with counterhypotheses are added by the bit ipping of the best child of the MAP estimate to improve the quality of generated soft information. Implemented in a 90-nm CMOS technology, our proposed architecture for a 4 4 64-QAM spatial multiplexing iterative MIMO detector achieves a constant throughput of 2.2 Gbit/s per iteration independent of the SNR while maintaining near maxlogMAP detection performance.

II. S YSTEM M ODEL


Manuscript received August 26, 2012; revised November 19, 2012; accepted February 2, 2013. Date of publication March 27, 2013; date of current version May 13, 2013. This work was supported in part by the Research Fund for the Doctoral Program of Higher Education of China under Grant 20110073110055 and in part by the Shanghai Natural Science Foundation under Grant 10ZR1416500. This brief was recommended by Associate Editor Z. Wang. The authors are with the School of Microelectronics, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: chenxi_good@sjtu.edu.cn; guanghui.he@sjtu.edu.cn; majun@sjtu.edu.cn). Color versions of one or more of the gures in this brief are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSII.2013.2251954

Consider a MIMO system based on the BICM-IDD scheme [1] with Nt transmit antennas and Nr receive antennas (Nr Nt ). Assuming transmission over a at fading channel, the received symbol vector r can be written as r=H s+ n, where s is a Nt 1 transmit H is an Nr Nt channel matrix and symbol vector whose entries are taken from some set C of M -QAM Gray mapped constellation points with M = 2Mc . The vector n is zero mean independent and identically distributed Gaussian noise samples with variance N0 per complex entry.

1549-7747/$31.00 2013 IEEE

CHEN et al.: VLSI IMPLEMENTATION OF HIGH-THROUGHPUT ITERATIVE FIXED-COMPLEXITY SPHERE DECODER

273

In general, to avoid hardware-consuming operations in the complex domain, the orthogonal version of real-value decomposition (ORVD) [7] can be adopted to transform the Nr Nt complex system model into its equivalent 2Nr 2Nt real system represented as r = Hs + n. The ORVD also transforms the complex constellation C of M points to its equivalent real constellation P of M points. For the treesearch algorithm, H is typically QR decomposed with H = QR, where Q is a unitary matrix and R is an upper triangular matrix. Then, the system model can be rewritten as y = QH r = Rs + QH n. As given in (1) and (2), metric increments MC (si ) and MA (si ) for channel-based and a priori-based information, respectively, are summed up to a total increment MP (s i ) = MC (s i ) + MA (s i )
2N t 2

Fig. 1. Illustration of the proposed OHE method in the equivalent real-value system for Gray mapped 64-QAM modulation.

A. OHE Since soft inputs prevent the use of simplied methods relying on the geometric of constellations P , nding the exact SchnorrEuchner (SE) order requires exhaustive computing and sorting the {MP } of all the M children, which is very computationally expensive. In [3], an efcient solution called HE is proposed for SISO STS-SD, where the two best nodes based on MC and MA are enumerated concurrently, and then, the one with the minimum MP is selected for the next tree-search step. Unfortunately, when HE is applied to FSD to nd the best child node of a certain parent node, some performance degradation is introduced. That is because the HE cannot guarantee to nd the best child node with the minimum MP among all the children. The proposed OHE adds an additional step after enumerating (1) (1) the two best nodes sC,i and sA,i based on MC and MA , respectively. It replaces sA,i with another appropriate node sCA,i which is more likely to have smaller MP than sA,i does. The steps of OHE can be described as follows. i /Rii . 1) Enumerate sC,i by quantizing y 2) Enumerate sA,i whose bit vector xi = [xi,1 , . . . , xi,Mc /2 ] satises xi,b = sign(LA i,b ) for 1 b MC /2. 3) Obtain Mc /2 sibling nodes ipping each bit of is nearest
(1) to sC,i (1) sA,i (1) (1) (1) (1)

MC (s i ) = yi
j =i Mc / 2

Rij sj

= |y i Rii si |2

(1)

MA (s i ) =

1 2

xi,b N0 LA i,b
b=1

(2)

where xi,b {+1, 1} denotes the bth bit of the bit-level vector xi associated with the ith level and LA i,b denotes the a priori log-likelihood ratio (LLR) of xi,b . The sum of the metric increments along a path from the root node to node si yields the partial metric MP (s(i) ) for a partial symbol vector s(i) = [si , . . . , s2Nt ]T . The extrinsic LLR LE i,b of bit xi,b is computed as LE i,b 1 N0
min sP 2Nt xi,b =1

MP (s) (3)

sP 2Nt xi,b =+1 MP (s) N0 LA i,b .

min

(1) |1 A,i,b

b Mc /2
(1)

by

Making an exhaustive search of the two minima in (3) is impractical. A typical modication is to generate a list L of Ncand candidates. Then, the extrinsic LLR can be computed as LE i,b 1 N0
min sLxi,b =1

in turn, and choose sCA,i , which b Mc /2 .

in geometry but not equal to sC,i , among


(1) (1) , s |1 A,i,b A,i

the candidate set s


(1) sC,i

MP (s)
min

sLxi,b =+1 MP (s) N0 LA i,b .

4) Expand and sCA,i , and select the node with smaller MP as the best child estimate. (4) Note that s
(1) A,i,b

denotes the sibling node whose bth bit is the


(1) (1)

ipped bit of the bth bit of sA,i . As shown in Fig. 1, since, at III. P ROPOSED SISO FSD A LGORITHM The proposed SISO FSD algorithm is an extension from the algorithm of hard-output FSD in [7] where an imbalancedexpansion scheme is applied to avoid inefcient full expansion at the (2Nt 1)th level. The method introduces a polygonshaped admissible region to reduce the unnecessary visits to some nodes by introducing an extension number limitation, m Lm 2Nt 1 , with which only the L2Nt 1 best nodes are extended from the mth father node at the (2Nt 1)th level. In order to extend the hard-output FSD to SISO FSD and provide near maxlogMAP performance, three methods are proposed to achieve the target, which are the optimized HE (OHE), parallel candidate adding (PCA) using bit-ipping strategy, and incorporating the compensation of self-interference into the tree search. most, only one bit between sCA,i and sA,i is different and sCA,i is very close to sC,i in geometry, both MA and MC of sCA,i are small. Therefore, sCA,i is more likely to have smaller MP (1) than sA,i , particularly in the rst few iterations. B. Parallel Candidate Adding Scheme Although FSD can achieve near ML detection performance for hard-output MIMO systems, it cannot provide accurate soft information [8] due to the missing of counterhypotheses. To solve this problem, our proposed PCA scheme introduces another candidate list L+ which contains the counterhypotheses of the best child estimates of the partial MAP nodes. In our PCA scheme, after expanding the upper two levels of the realvalue tree, only the best child estimates are extended for the rest
(1)

274

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 60, NO. 5, MAY 2013

TABLE I N UMBER OF V ISITED N ODES

Fig. 2. BER performance of various algorithms for 4 4 64-QAM MIMO system with turbo code rate of 1/2.

of the levels. In addition, from the (2Nt 2)th level, the PCA will locate the partial MAP parent node s PMAP by searching for the node with the minimum MP in the original candidate list L and then add Mc /2 sibling nodes of the best child estimate of s PMAP as counterhypotheses to the expanded list L+ before proceeding to the next level, using the bit-ipping operation. Finally, the extrinsic LLR is computed based on L L+ .
Fig. 3. Proposed VLSI architecture of SISO FSD.

C. Compensation of Self-Interference Like other tree-search algorithms, the SISO FSD benets signicantly from the use of column sorting and regularization of the channel matrix [9]. However, the self-interference caused by channel-matrix regularization incurs performance degradation. In order to recover this performance loss, we adopt the method developed in [9] where the compensation of self-interference is incorporated into the tree search. The self-interference term MSI (si ) should be subtracted from the metric increment MP (si ) as follows: MP (si ) = MC (si ) + MA (si ) MSI (si ) (5) outperforms MMSE-PIC, K-best detector, and LFSD, with only a 0.1-dB performance degradation compared with STSSD at bit error rate (BER) = 104 . For the iterative (I = 4) detection, the proposed SISO FSD shows only a very small performance loss compared to the iterative K-best detector, with a 0.35-dB degradation against STS-SD. The performance gap becomes a little bigger as the iteration number increases because the OHE cannot nd exactly the best child in the later iterations as in the rst iteration, with the presence of nonzero a priori information. The computational complexity in terms of the number of visited nodes per vector detection pertaining to a single receiver iteration for a 4 4 64-QAM MIMO system is given in Table I. The total detection complexity is proportional to the number of iterations in the detector/decoder loop. The proposed SISO FSD algorithm visits the least number of nodes among all listed treesearch algorithms. By employing the efcient OHE method, SISO FSD avoids the brute-force searching of the best child and thus signicantly reduces the number of visited nodes. V. VLSI A RCHITECTURE FOR P ROPOSED SISO FSD Our proposed VLSI architecture for the list-based SISO FSD in a 4 4 64-QAM MIMO system is illustrated in Fig. 3. The architecture is based on the multistage architecture of the hardoutput FSD [7] and is extended by the OHE strategy and PCA scheme described in Section IV to support SISO processing. A. High-Level Architecture By employing the ORVD, the MP computation (i.e., MP (si ), MP (si + 1)) in two adjacent levels can be conducted in parallel with Ri,i+1 being zero for i = 1, 3, . . . , 2Nt 1 [7].

where MSI (si ) = 2 |si |2 and is the regularization parameter. IV. S IMULATION R ESULTS AND C OMPLEXITY A NALYSIS In this section, the proposed SISO FSD algorithm is evaluated and compared with other algorithms. We considered a coded 4 4 MIMO system utilizing 64-QAM modulation over a spatially uncorrelated Rayleigh MIMO channel with additive white Gaussian noise. The 3GPP-LTE turbo code was used, with constraint length = 4, polynomial: (feedback, redundancy) (13, 15)octal , block size = 1024 bits, code rate = 1/2, and eight internal iterations of log-MAP decoding. Fig. 2 shows the detection performance of the proposed SISO FSD with Lm 2Nt 1 = 7, 7, 5, 5, 3, 3, 1, 1], the K-best detector with K = 50, the list FSD (LFSD) [8] with nS = [1, 1, 1, 1, 2, 2, 8, 8], and the STS-SD with Lmax = 8. The performance of STS-SD is given as the baseline reference since it has been demonstrated to be capable of achieving maxlogMAP optimality if the LLR clipping value Lmax is sufciently large. As shown in Fig. 2, for the noniterative (iteration number I = 1) detection, our proposed SISO FSD

CHEN et al.: VLSI IMPLEMENTATION OF HIGH-THROUGHPUT ITERATIVE FIXED-COMPLEXITY SPHERE DECODER

275

Fig. 4.

Timing schedule of the proposed VLSI architecture.

As a consequence, the number of processing element (PE) stages is reduced by half compared to those pipelined detectors using traditional real-value decomposition [10]. The architecture supports both hard outputs and soft outputs. The hard-output module generates the original hard-output FSD candidate list L in which the best path with the minimum MP is found. The soft-output module generates an expanded list L+ by employing the PCA scheme and calculates the LLRs based on the union of the two lists L L+ . The PEs in our design are divided into three types: PE-A, PE-B, and PE-C. PE-A is located in the rst stage where multiple child nodes are expanded. PE-B performs the single expansion in the remaining three stages. PE-C in the soft-output module adopts the bit-ipping strategy to add the counterhypotheses to the expanded list L+ . To identify the partial MAP node among the L, the minimum (MIN) search block at the soft-output module is needed to select the node with the smallest LP . With Lm 2Nt 1 = [7, 7, 5, 5, 3, 3, 1, 1], the number of candidates L in L is Ncand = 32. In the hard-output module, we instantiate eight PE-Bs at each stage where eight nodes can be processed simultaneously, and thus, four cycles are needed to complete the processing of all the candidates in L. The candidate generation unit (CGU) is adopted to generate all possible values of |Ri,j sj | which are shared by the MP (si ) calculations at the same level. Additionally, the MA (si ) and MSI (si ) of all possible symbols are also precomputed to further enhance the hardware sharing. (1) Moreover, the best node sA,i with the minimum MA at each level is also identied and buffered in CGU according to the sign of LA i,b , which avoids full sorting of the set {MA (si )}. The LLR calculation unit (LCU) in the last stage calculates the LLRs of each transmitted bit according to (4) based on the candidate lists L and L+ . Fig. 4 shows the timing schedule of the proposed VLSI architecture. The latency requires 36 cycles to detect one symbol vector. The whole architecture works in a deeply pipelined fashion and outputs a detected symbol vector every four cycles after the latency. B. PE-A The imbalanced-expansion scheme needs to determine the SE order in the upper two levels (i.e., levels 8 and 7). As the presence of a priori information prevents the applicability of the well-known zigzag enumeration used in hard-output FSD, nding the exact SE order requires the full computation and sorting of the {MP (si )}. However, utilizing the property that R7,8 = 0 and MP (si ) = |yi Rii si |2 + MA (si ) MSI (si ) for 7 i 8, the computation and sorting of {MP (si )} in levels 8 and 7 can be carried out independently and simulta-

Fig. 5. Architecture of PE-A.

Fig. 6. Architecture of PE-B in stage 2.

neously. Thus, the number of MP (si ) computations is reduced to 16, saving 77.8% compared to the straightforward approach of computing MP (si ) of 8 + 64 = 72 nodes in the upper two levels. The number of sorters is also reduced to two. Moreover, the complexity of PE-A can be further reduced by using the time-multiplexing hardware sharing. That means that, given the number of cycles per symbol vector Ncycle = 4, only two MP (si ) computation blocks are instantiated in each level to compute the metric increments of eight candidates in serial. Fig. 5 gives the architecture of PE-A using these techniques described earlier. To save more area, two folded bubble sorters are used which can sort the eight candidates in four cycles. The path selecting and combining unit (PSCU) receives the sorted candidates and then selects and combines them to form MP (s(7) ) = MP (s8 ) + MP (87) for the next stage according to Lm 2Nt 1 = [7, 7, 5, 5, 3, 3, 1, 1]. C. PE-B PE-B is used to implement the single expansion where only the best node estimate is selected and preserved using the proposed OHE. As shown in Fig. 6, the interference cancellation unit (ICU) in PE-B computes y in (1) to eliminate the interantenna interference introduced by previously detected symbols. (1) To enumerate the best child node sC,i with the minimum MC , a quantization step Q is required to nd the symbol which is next to y i /Ri,i . The HE unit (HEU) chooses sCA,i according to step 3) of the OHE method. The MIN block compares the MP (1) of sC,i and sCA,i and then selects the node with smaller MP .

276

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 60, NO. 5, MAY 2013

TABLE II I MPLEMENTATION R ESULTS AND C OMPARISON

Fig. 7. Architecture of PE-C in stage 2.

D. PE-C Fig. 7 shows the architecture of PE-C, which implements the PCA scheme. The best child estimate selection unit (BCSU) receives the partial MAP node s PMAP and nds its best child estii+1 (1) mate s i by employing the OHE, just the same as it is in PE-B. The candidate adding unit (CAU) uses bit-ipping strategy (1) to add three sibling nodes of s i , which feedforward to a multiplexer, and only one of them is selected per cycle for MP computation. The serial computation method saves the number of MP computation blocks in CAU by 66.7% and reduces the number of PE-Bs following in the subsequent stages compared to the parallel method, without impacting the throughput of the whole architecture. VI. I MPLEMENTATION R ESULTS The proposed SISO FSD architecture has been implemented in a 90-nm CMOS technology with a standard-performance standard-cell library. As shown in Fig. 2, the xed-point detector has shown about 0.1-dB performance loss compared to the oating-point detector. The core area of the chip occupies 2.61 mm2 . At the normal 1.0-V supply voltage, the detector can work at a maximum frequency fmax of 370 MHz, achieving a 2.2-Gbit/s peak throughput per iteration. The throughput is given by = M c Nt fclk . Ncycle (6)

employs the efcient OHE to avoid the exhaustive search of the best child for the soft-input scenario and adopts the simple PCA scheme to improve the quality of the output LLRs. In addition, the compensation of the self-interference caused by channel-matrix regularization is incorporated in the tree search, leading to further performance gain. These proposed techniques can reduce the complexity signicantly and provide near maxlogMAP performance. At the architecture level, the proposed multistage architecture using the time-multiplexing hardware sharing fashion further reduces the area cost. Implementation results show that our SISO FSD outperforms other reported iterative MIMO detectors in terms of throughput and area efciency. R EFERENCES
[1] B. M. Hochwald and S. Brink, Achieving near-capacity on a multiple antenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp. 389399, May 2003. [2] C. Studer, S. Fateh, and D. Seethaler, ASIC implementation of softinput soft-output MIMO detection using MMSE parallel interference cancellation, IEEE J. Solid-State Circuits, vol. 46, no. 7, pp. 17541765, Jul. 2011. [3] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, A scalable VLSI architecture for soft-input soft-output single tree-search sphere decoding, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 9, pp. 706710, Sep. 2010. [4] F. Borlenghi, E. M. Witte, G. Ascheid, H. Meyr, and A. Burg, A 772 Mbit/s 8.81 bit/nJ 90 nm CMOS soft-input soft-output sphere decoder, in Proc. ASSCC, Jeju, Korea, 2011, pp. 297300. [5] Y. Sun and J. R. Cavallaro, Trellis-search based soft-input soft-output MIMO detector: Algorithm and VLSI architecture, IEEE Trans. Signal Process., vol. 60, no. 5, pp. 26172627, May 2012. [6] L. G. Barbero and J. S. Thompson, Fixing the complexity of the sphere decoder for MIMO detection, IEEE Trans. Wireless Commun., vol. 7, no. 6, pp. 21312142, Jun. 2008. [7] L. Liu, J. Lofgren, and P. Nilsson, Area-efcient congurable highthroughput signal detector supporting multiple MIMO modes, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 9, pp. 20852096, Sep. 2012. [8] L. G. Barbero and J. S. Thompson, Extending a xed-complexity sphere decoder to obtain likelihood information for turbo-MIMO systems, IEEE Trans. Veh. Technol., vol. 57, no. 5, pp. 28042814, Sep. 2008. [9] C. Studer and H. Bolcskei, Soft-input soft-output single tree-search sphere decoding, IEEE Trans. Inf. Theory, vol. 56, no. 10, pp. 4827 4842, Oct. 2010. [10] D. Patel, V. Smolyakov, M. Shabany, and P. G. Gulak, VLSI implementation of a WiMAX/LTE compliant low-complexity high-throughput soft-output K-best MIMO detector, in Proc. IEEE ISCAS, May 2010, pp. 593596.

We compare the proposed SISO FSD with recently reported MIMO detectors in Table II. The proposed SISO FSD MIMO detector can achieve signicant increase in data throughput and much lower latency compared with other detectors. Additionally, the SISO FSD achieves a 3.96-Mbit/s/kGE area efciency, which is the most area efcient among all the reported iterative detectors. Unlike the depth-rst tree-search algorithms whose throughput and area efciency will degrade substantially when operating in the low-SNR regime, the proposed SISO FSD has xed throughput and area efciency per iteration while preserving near maxlogMAP detection performance. VII. C ONCLUSION This brief presents the algorithm optimization and VLSI implementation of a SISO FSD. Based on the hard-output imbalanced FSD in [7], the proposed SISO FSD algorithm

S-ar putea să vă placă și