Documente Academic
Documente Profesional
Documente Cultură
Abstract—In this paper, VLSI implementation of a configurable, Recently, in an amazing demonstration, 5 Gbps downlink
soft-output MIMO detector is presented. The proposed chip can wireless communication is achieved using spatial multiplexing
support up to 8 8 64-QAM spatial multiplexing MIMO commu-
on 12 12 antenna configuration and 100 MHz bandwidth,
nications, which surpasses all reported MIMO detector ICs in an-
tenna number and modulation order. Moreover, this chip provides reaching a record-breaking 50 bps/Hz spectrum efficiency [4].
configurable antenna number from 2 2 up to 8 8 and modula- Moreover, along with the development of the RF front-end
tion order from QPSK to 64-QAM. Its outputs include bit-wise log circuit technology, short-range data communication through the
likelihood ratios (LLRs) and a candidate list, making it compatible extremely high frequency (EHF) band is no longer infeasible.
with powerful soft-input channel decoders and iterative decoding
system. The MIMO detector adopts a novel sphere decoding algo- In this frequency band, antenna size can be shrunk to several
rithm with high decoding efficiency and superior error rate per- millimeters, making MIMO systems with a large number of
formance, called modified best-first with fast descent (MBF-FD). antennas practical even for portable devices.
Moreover, a low-power pipelined quad-dual-heap (quad-DEAP) One of the most challenging tasks in MIMO communi-
circuit for efficient node pool management and several circuit tech-
niques are implemented in this chip. When this chip is configured
cation systems is data detection at the receiver when spatial
as 4 4 64-QAM and 8 8 64-QAM soft-output MIMO detectors, multiplexing is applied. Multiple streams of signals, coupled
it achieves average throughputs of 431.8 Mbps and 428.8 Mbps with noise and channel fading, interfere with each other when
with only 58.2 mW and 74.8 mW respective power consumption traveling in space and are received by a plurality of antennas.
and reaches 10 5 coded bit error rate (BER) at signal-to-noise
The optimal detection solution mandates exhaustive search
ratio (SNR) of 24.2 dB and 22.6 dB, respectively.
among the entire transmitted signal space and requires com-
Index Terms—Multiple-input multiple-output (MIMO) detec- plexity that scales exponentially with the number of antennas.
tion, soft-output sphere decoder, VLSI implementation.
To reduce the search complexity, sphere decoding (SD) was
proposed and it is capable of achieving optimal detection
I. INTRODUCTION performance with much reduced complexity [5]. To further
improve the error rate performance, the original hard-output
M ULTIPLE-INPUT multiple-output (MIMO) techniques
have recently enjoyed high degree of popularity in
wireless communications as they significantly enhance spec-
sphere decoding has been modified to provide soft-valued out-
puts, making it applicable in iterative detection and decoding
trum resource utilization [1]. In particular, a MIMO technique architectures to attain significantly enhanced detection perfor-
called spatial multiplexing can increase the data throughput mance [5]. The complexity of the hard-output and soft-output
almost linearly with the number of antennas [2]. Hence, the sphere decoding algorithms depends to a large extent on the
spatial multiplexing MIMO technique has been adopted in adopted search method. Several previous research works pro-
many current wireless communication standards. For example, posed a variety of search algorithms, such as K-best [6]–[11],
the IEEE 802.11n wireless LAN standard adopts MIMO depth-first [5], [12], [13] etc. However, owing to the limitation
configurations with up to 4 4 spatial multiplexing, and the in search scalability these algorithms are mainly applicable
latest IEEE 802.16e mobile WiMAX standard also includes a to MIMO systems with either fewer antenna elements or
lower-order modulation.
4-stream spatial multiplexing mode.
Systems with higher number of antennas are on the horizon. In light of the trend in spatial-multiplexing MIMO commu-
For instance, it was proposed that 8 8 spatial multiplexing nications toward higher-order modulation, more spatial streams
may be necessary in the next-generation (4G) mobile com- and soft-valued output, we propose, in this paper, a configurable
munication standard to achieve peak spectrum efficiency [3]. soft-output MIMO detector IC based on a novel complex-plane
sphere decoding algorithm. In this IC, several architecture and
circuit techniques are proposed and implemented to achieve the
Manuscript received February 02, 2009; revised July 06, 2009. Current
version published February 05, 2010. This paper was approved by Associate
following advanced features:
Editor Bevan Baas. This work was supported in part by the National Science • First MIMO detector IC supporting 8 8 64-QAM spatial
Council, Taiwan, R.O.C., under Grant NSC98-2752-M-002-002-PAE and multiplexing.
NSC97-2219-E-002-011. The work of Chun-Hao Liao is also partially spon-
sored by the Institute for Integrated Signal Processing Systems, RWTH Aachen • Support for antenna configuration from 2 2 to 8 8 and
University, Aachen, Germany. modulation from QPSK to 64-QAM.
The authors are with the Graduate Institute of Electronics Engineering and • Provision of soft-valued outputs and candidate list, making
the Department of Electrical Engineering, National Taiwan University, Taipei,
Taiwan 10617 (e-mail: chiueh@cc.ee.ntu.edu.tw). it compatible with soft-input error-correction-code (ECC)
Digital Object Identifier 10.1109/JSSC.2009.2037292 decoders and iterative detection and decoding system.
0018-9200/$26.00 © 2010 IEEE
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
412 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010
• Novel modified best first with fast descent (MBF-FD) vector that contains the a priori information about . In iter-
MIMO detection algorithm enhancing detection efficiency ative detection and decoding system, the MIMO detector first
and performance. computes the LLR outputs without the a priori information; the
• Low-latency, pipelined quad-dual-heap (quad-DEAP) cir- LLRs are then passed to a soft-in-soft-out ECC decoder, whose
cuit facilitating node pool maintenance. outcome will then be fed back to the MIMO detector as a priori
• Tabular enumeration scheme providing fast and efficient information; and the iteration goes on.
enumeration. Generally speaking, sphere decoders are very effective as
• Optimized node processing circuit enabling high clock rate soft-output MIMO detectors due to the efficient search strategy
and low power consumption. that confines the search space to include only the vectors
• Average throughput of 431.8 Mbps with 58.2 mW in 4 4 whose costs are smaller than the sphere constraint. However,
64-QAM configuration and 428.82 Mbps with 74.8 mW in as search with the sphere constraint the whole space in each
8 8 64-QAM configuration. iteration is still time-consuming, we adopted a compromised
The rest of this paper is organized as follows: after intro- solution proposed in [5]. This scheme generates a candidate
ducing the MIMO detection problem and the conventional list during the first MIMO detection and afterwards confines
sphere decoding algorithms in Section II, we will give the the search to only among the solutions in that candidate list. In
main idea of the MBF-FD algorithm for MIMO detection the candidate-list-based MIMO detectors, where the a priori
in Section III and expound on related simulation results and information can be ignored, we can rewrite the cost function as
comparison with existing solutions. Section IV presents circuit
design and implementation of the proposed IC, including (4)
hardware architecture and circuit techniques. Then, Section V
reports the chip measurements and compares the proposed chip where , and and are respectively a unitary ma-
with several reported MIMO detector chips. Finally the paper trix and an upper-triangle matrix that satisfy . Since
is concluded in Section VI. is an upper-triangular matrix, the complex symbols in
can be determined sequentially from bottom to top. The de-
coding can then be mapped into a search over an -layer -ary
II. DETECTION IN SPATIAL-MULTIPLEXING MIMO SYSTEMS tree, whose leaf nodes correspond to the possible solu-
tion vectors. By expanding (4), we recursively define the par-
Let us consider spatial streams, each transmitting -bit
tial cost of an intermediate node in layer with partial solution
data per symbol using -QAM modulation over an MIMO
as
system with transmitting and receiving antennas.
Denote these -bit data as a binary row vector with
, , and let be the QAM-mapped complex
constellation symbol vector having complex symbols,
. The received complex symbol, ,
is then given by
(5)
(1)
where and are respectively elements in and , and
where is the channel matrix that is assumed known
.
beforehand and is the complex Gaussian noise vector.
Several tree search schemes have been studied in the context
For simplicity, in the rest of the paper we assume
of sphere decoding MIMO detection. Among them, the breadth-
.
first, depth-first and best-first algorithms are the most popular.
Hard-output MIMO detectors try to find the symbol vector
Breadth-first algorithms [6]–[11] are favored due to their regular
(and correspondingly the binary vector ) that maximizes the
memory arrangement and amenability to pipelined and paral-
likelihood of the received vector. On the other hand, soft-output
leled implementation. However, for systems with more number
MIMO detectors compute the extrinsic bit-wise log-likelihood
of antenna and/or higher modulation order, breadth-first algo-
ratio (LLR) of each bit in under the max-log maximum a pos-
rithms tend to require enormous computational complexity to
teriori (MAP) criterion according to [5]
achieve acceptable performance. On the other hand, depth-first
algorithms [5], [12], [13] have better search efficiency, although
their tree traversing strategy still leaves room for improvement.
In [14], a best-first algorithm is proposed and shown to be a
(2) better search method. This best-first algorithm maintains a pool
of nodes to visit, which are not necessarily in the same sub-tree.
(3) When the current best node with the lowest partial cost has been
visited and processed, the best-first algorithm starts from the
where and are respectively the extrinsic and a next best node in the pool. Namely, it can hop within the tree
priori LLR of the bit ; is the search space ; without being restricted by the structure and connectivity of the
is the noise power spectral density, and is the row tree and always looks into the most promising nodes.
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
LIAO et al.: A 74.8 mW SOFT-OUTPUT DETECTOR IC FOR 8 8 SPATIAL-MULTIPLEXING MIMO COMMUNICATIONS 413
Fig. 1. Operation of the best-first algorithms: (a) original best-first, (b) MBF, and (c) Modified best-first with fast descent algorithm MBF-FD.
III. A LOW-COMPLEXITY SEARCH ALGORITHM order to reach more leaf nodes, the MBF algorithm is further
modified to include the flavor of depth-first tree search. The
A. Algorithm Description final algorithm, called modified best-first with fast descent
By continuously starting the search from the current best node (MBF-FD) [15], continuously searches downward for the best
with the lowest partial cost, the aforementioned best-first algo- child nodes and pushes best sibling nodes along the search path
rithm avoids the limitation of adjacency in the tree suffered by into the node pool until a leaf node is reached. Then a new
the depth-first and thus achieves a better search efficiency. How- forward search is started from the best node in the node pool.
ever, in the original best-first algorithm, the nodes are connected The MBF-FD algorithm preserves the benefits of the MBF
in a traditional -ary tree—each node has children and each algorithm while guaranteeing enough full-length solutions for
child node can be reached only from its parent. So individual soft-output MIMO detection. Fig. 1(c) illustrates the operation
partial cost of all children nodes must be evaluated before the of the MBF-FD algorithm.
search can move downward to the next level as indicated in
Fig. 1(a). Evaluation of all child nodes’ partial costs often makes B. Simulation Results
the best-first algorithm’s efficiency less than desirable. What’s
We compare the proposed MBF-FD algorithm with the
worse is that in a tree with high degree, pushing in many nodes
modified K-best Schnorr-Euchner (MKSE) algorithm [7] and
and removing only one parent node can quickly bloat the node
the single tree search (STS) algorithm [13], which are popular
pool with useless nodes.
breadth-first-based and depth-first-based algorithms, respec-
In the modified best-first (MBF) algorithm [15], the original
tively. To make a fair comparison, we evaluate them in terms of
-ary tree is converted into an equivalent binary tree, as illus-
the computational complexity measured in average number of
trated in Fig. 1(b). When a node is visited, we can replace this
required partial cost calculations (PCC) to reach coded bit error
node in the pool by only two new nodes: its best child node in
rate (BER) of at certain SNR. The data are coded in a rate
the next layer and its best yet-to-visit sibling. Afterwards, the
systematic convolutional code with constraint length 3,
next best node in the sorted node pool is examined and vis-
and interleaved with a 128 72 row-in-column-out interleaver.
ited and so on. By adding these two nodes into the pool (and
A spatially uncorrelated Rayleigh channel matrix is assumed
deleting the current node), the legacy of the current node is pre-
in each case and its elements are complex zero-mean Gaussian
served, downward by its child node and horizontally by its sib-
random variables with variance 0.5 per dimension. The sphere
ling node. This procedure is similar to encoding a general or-
constraint is set to 2 in all algorithms initially, which leads to a
dered -ary tree (e.g., 4-ary, 16-ary, or 64-ary) into a binary
fair search space reduction while maintaining good error rate
tree by a method called first-child/next-sibling binary tree [16].
performance from extensive simulations. The different sphere
The MBF algorithm greatly reduces the degree of a node by in-
decoding algorithms are compared under various run-time
troducing horizontal connections and thus effectively decreases
constraint settings, e.g., maximum number of visited nodes,
the complexity of child node evaluation in the original best-first
in MBF-FD and STS, and in MKSE.
algorithm. It also makes the node pool more efficient in cap-
Fig. 2 depicts the required average number of PCC and min-
turing promising nodes for future visit.
imum SNR to achieve coded BER for each algorithm
Although the MBF algorithm successfully resolves the
under different run-time constraints, where the SNR is defined
complexity and node pool issues of the traditional best-first
as
algorithm, it still has the problem of spending too much time
searching on higher layers and may not reach even one leaf
node (for a full-length solution) under a time constraint. In (6)
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
414 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010
Fig. 2. Comparison of MBF-FD, STE, and MKSE algorithms under different run-time constraints.
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
LIAO et al.: A 74.8 mW SOFT-OUTPUT DETECTOR IC FOR 8 8 SPATIAL-MULTIPLEXING MIMO COMMUNICATIONS 415
be easily maintained through node exchanges propagating from Although pipelining improves the node pool, it also intro-
one end to the other. duces possible incoherence in the best node value when the node
With the above DEAP structure, we next present several exchanges associated with a best node replacement are not com-
techniques for efficient circuit implementation of DEAP. First, pleted in time, leading to degradation in error rate performance.
DEAP suffers from possible long latency which scales lin- To this end, we include a best node cache, which holds the best
early with the number of layers due to propagation of node node of the pool while the quad-DEAP handles the other nodes
exchanges. To reduce the number of layers, we propose to use in the pool. Armed with this cache, the aforementioned inco-
quad trees instead of binary trees. Fig. 4 depicts the adopted herence and possible BER degradation are avoided. Finally, we
6-layer quad-DEAP structure, which contains 42 nodes to introduce two more low-power circuit techniques for the node
guarantee satisfactory BER performance. pool. First, for the idle nodes which are not on the path of
Moreover, an interlaced pipelining scheme is implemented in propagation, we turn off the associated circuits by clock gating.
the node exchange operations to improve the node processing Second, when node exchange procedure halts at a certain stage,
rate and circuit utilization, as illustrated in Fig. 4. To implement the inputs of the comparators in the ensuing stages are frozen to
node exchanges, the pipelining stages operate in a period of two minimize signal switching. A 38.2% saving in power is achieved
clock cycles. Specifically, in the first clock cycle, two root nodes by these techniques according to gate-level power simulation.
in layers 1 and 6 update their values with the respective inputs
if necessary, while nodes in layers 2, 3, 4, 5 that have been up- B. Node Processing
dated in the previous cycle compare with associated nodes in This part performs the main operations of MBF-FD tree tra-
layers 3, 2, 5, 4 and exchange values whenever necessary. In the versal, including identifying the child and sibling nodes and cal-
second clock cycle, similar node exchanges are performed be- culating their partial costs. A dedicated pipelining strategy is in-
tween layers 1 and 2, layers 3 and 4, layers 5 and 6, respectively. troduced to cut down the possible long delay path. We partition
Note that both upward and downward propagation of node ex- the computation involved with a child node into three stages: the
changes are possible. In addition, these two types of propagation inter-antenna interference cancellation (IAIC) block first can-
can happen simultaneously in the pipelining stages. Therefore, cels the interference from the QAM symbols that have been de-
the circuit is designed to handle upward and downward node cided in the previous layers; the child node processing (CNP)
exchanges concurrently. Finally, the comparators are shared be- block then finds the best child node; and finally the partial cost
tween the two phases (even-cycle phase and odd-cycle phase) to calculation (PCC) block computes and accumulates the squared
increase circuit utilization. error. The operation involved with a sibling node is similarly
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
416 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010
(8)
(9)
Fig. 5. Pipelining schedule of node processing.
where runs from layer down to layer 0 and is the quan-
tization function that converts its argument to the nearest con-
partitioned into the sibling node processing (SNP) block and stellation point. To avoid the division in (8), we adopt a search
PCC as well. With the above partitioning and pipelining, the over all constellation points instead:
clock speed of the proposed chip can be close to 200 MHz.
Referring to Fig. 1, note that and depend on the de- (10)
cision of . Thus, we propose a pipelining schedule as shown
in Fig. 5. The top three blocks refer to the processing of the By the orthogonality between the real and imaginary parts, we
child node . At time , the decision of is already avail- can search the real and the imaginary part independently for
able though the partial cost of is yet to be computed. Con- the closest constellation point. In addition, as the signs of the
sequently, we parallelize the processing of and with real/imaginary parts of the closest point is identical to those of
the PCC of . Tree traversal for the following layers is simi- , we search only constellation points in the first quad-
larly performed until a leaf node is reached. Note that in parallel rant. In summary, we compare and with
another SNP block determines the best sibling of the best , , in parallel to find the real and imaginary parts
node retrieved from the node pool, [see Fig. 1(c)]. of simultaneously. Concurrently, all possible combinations
The above scheduling has several advantages. First, only one of the corresponding difference are computed and then
set of IAIC, CNP, SNP, and PCC circuits is implemented. Next, selected by the results of aforementioned comparison to reduce
since the tree is traversed sequentially, by adjusting the schedule the path delay. Fig. 6 depicts the circuit diagram of the CNP
this architecture can be configured to support different number block. From synthesis results, the proposed simplified child
of antennas (layers), different modulations, and run-time con- node search circuit saves 70.4% of area and 39.7% of circuit
straints. Also, the rate of node processing matches that of the delay when compared with the straightforward implementation.
node pool, i.e., two clock cycles per node, thus enhancing cir-
cuit utilization. We next introduce the circuit techniques adopted 3) Sibling Node Processing and Tabular Enumeration:
in these blocks. Finding the next sibling node requires sorting the yet-to-visit
1) Inter-Antenna Interference Cancellation: Assuming that constellation points according to their partial costs, which can
the current node is in layer , IAIC computes the first two terms account for a significant portion of the complexity in tree-search
inside the square norm of (5): MIMO detection hardware. To avoid that, we apply the tabular
enumeration (TE) technique proposed in [15] for fast node
order look-up. Fig. 7 illustrates how this technique works.
(7)
First, suppose the constellation point closest to the equalized
and interference-cancelled signal, , has been found and
To reduce the critical path delay, the associated terms in- denoted as . The region around this constellation
side the summation in (7) are computed and accumulated as point is then divided into eight triangular sub-regions. For each
early as possible, namely, during the processing of nodes at sub-region, the most likely visiting order of all other constel-
layers through . Hence, can be computed with lation points is computed in advance and stored in a table.
only one final multiplication and addition. Rearranging the cal- Extensive simulation indicates that TE introduces negligible
culation of greatly facilitates design configurability over the BER degradation when comparing to the exact enumeration
number of antennas. For the proposed 8 8 MIMO detector IC, order.
seven such IAIC units are implemented to compute through Direct implementation of TE requires eight tables for each
. In configurations with smaller number of antennas, fewer constellation point, each with entries. To reduce the re-
IAIC units are needed and unused ones are simply turned off. quired storage, we further unify these tables into one by uti-
To further reduce the critical path delay, two more circuit lizing to the symmetry in the eight sub-regions and the shift
techniques are adopted. First, as is a QAM constellation point invariance property of the partial cost function. Fig. 7 shows
and thus its real part (as well as imaginary part) has at most the unified node order table with a maximum offset of
eight possible values, the multiplier uses a simplified radix-4 which supports up to 64-QAM. Note that the node order and
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
LIAO et al.: A 74.8 mW SOFT-OUTPUT DETECTOR IC FOR 8 8 SPATIAL-MULTIPLEXING MIMO COMMUNICATIONS 417
Fig. 8. Three possible cases of the second best sibling node assuming sub-
region 0 is considered.
Fig. 7. Illustration of triangular partitions and node ordering table in tabular
enumeration.
can be found in one clock cycle. Fig. 9(a) and (b) shows the cir-
the offset to the current are listed inside and around cuit diagram of the proposed TE and STE, respectively, where
the table, respectively. As there is only one table for all possible the index TN is the sub-region index and the flip block handles
, boundary check is necessary to skip those offsets the symmetry processing of the offsets according to TN. Finally,
that lead to points outside the constellation. For different QAM the SNP circuit adopting STE is depicted in Fig. 10. Note that
modulations, the same table can be reused by merely modifying the first two bits of TN is simply the sign value of real and imagi-
the boundary. In sum, the unified table is implemented in only nary part of difference, , while the third bit of TN, ,
1.76 K bits, which is 0.88% of the straightforward design. Al- needs one more comparison. To reduce the critical path, two
though the unified table significantly reduced the storage, re- STE blocks are implemented to process the two possible cases
peated table-look-up to skip the invalid offsets can be a speed of . Moreover, many possible partial results for the differ-
bottleneck. To prevent this, eight parallel boundary check units ence are available from the CNP unit. These two techniques
are implemented. results in a 56.9% saving in critical path.
Note that that except for in Fig. 1 all other sibling nodes 4) Partial Cost Computation: The PCC block squares the
in MBF-FD are always the second best among all nodes of the differences obtained in CNP and SNP blocks and accumulates
same parent. Therefore, we further propose simplified TE (STE) the partial cost according to
for processing these sibling nodes during fast descent. Assume
that falls in sub-region 0 without loss of generality. (11)
Then there are only three possible cases of TE for these sibling
nodes as illustrated in Fig. 8, and hence the table can be reduced To reduce complexity and shorten critical path, a special squarer
to only three entries: (0, 2), (2, 0), ( 2, 0). These entries are pro- is designed and its outcomes are fed to a carry-save adder that
cessed and boundary checked in parallel so that a sibling node updates the partial cost according to (11).
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
418 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010
Fig. 9. Circuit diagram of (a) tabular enumeration (TE) and (b) simplified tabular enumeration (STE).
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
LIAO et al.: A 74.8 mW SOFT-OUTPUT DETECTOR IC FOR 8 8 SPATIAL-MULTIPLEXING MIMO COMMUNICATIONS 419
TABLE I
SUMMARY OF CIRCUIT TECHNIQUES
to provide detail information about the nodes for future visit. For
node processing, we see the signals flow through IAIC, CNP, Fig. 12. Chip microphotograph.
SNP and PCC. The outputs of the two SNP blocks are fed into
the node pool. Finally, SOG and CL receive the full-length solu-
tions and generate the soft-output LLR values and maintain the
list of candidate solutions.
Significant saving in power, delay, and circuit complexity has
been attained through several circuit techniques adopted in de-
signing the proposed MIMO detector IC. Table I summarizes all
the techniques used and their improvements in power reduction,
clock speed-up, and circuit/storage complexity.
V. EXPERIMENTAL RESULTS
The proposed IC is fabricated in a 0.13-micron CMOS tech-
nology. To validate the feasibility of the proposed IC for high-
speed MIMO receivers, two copies of the circuit in Fig. 11, Fig. 13. Maximum clock rate of the proposed IC versus supply voltage.
processing element (PE), are integrated in this IC. Each PE
can execute independently MBF-FD MIMO detection for a re-
ceived -element signal vector, . The core area of the IC is
mm . Fig. 12 depicts the chip microphotograph.
The maximum operating clock rates of the chip under different
supply voltages are plotted in Fig. 13. In the nominal 1.3 V
supply voltage, the chip can operate up to 198 MHz, about 1%
less than the post-simulation result. Fig. 14 depicts power con-
sumption of the IC when it is configured in four different modes
and operating at the maximum frequencies under several supply
voltages. As expected, more power is consumed when the de-
tector IC operates with more antennas and/or higher-order QAM
constellations.
The throughput of the proposed IC is formulated as Fig. 14. Power consumption versus supply voltage of the proposed detector IC
in four operation modes.
(12)
configurations by constraining to 8 and 16, respectively.
where is the clock rate, is the average number of vis- Specifically, these configuration can reach 10 coded BER at
ited nodes, is the number of PEs, and is the average SNR of 24.2 dB and 22.6 dB. However, when the channel condi-
number of clock cycles to visit a node, which is 2.53 in the pro- tions are poor, the MIMO detector may need to visit more nodes
posed IC. Operating in the maximum frequency and under good and require longer run time to obtain more precise soft-output
channel conditions, the proposed IC achieves 431.8 Mbps and values for acceptable BER. As such, the achievable throughput
421.8 Mbps throughput in 4 4 64-QAM and 8 8 64-QAM can become lower.
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
420 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010
TABLE II
COMPARISON OF SPHERE DECODING MIMO DETECTOR IMPLEMENTATIONS
Table II lists the overall performance of the proposed IC and is willing to sacrifice BER performance, the proposed IC can
several reported sphere decoding MIMO detector ICs. The pro- achieve even higher throughput by setting a smaller maximum
posed IC is the only one that supports a maximum of eight an- number of visited nodes. Finally, the proposed IC has the best
tennas and 64-QAM modulation. The detector IC in [8] sup- measured power performance. Note that the power consump-
ports 8 8 MIMO systems, but only for QPSK modulation. On tion is normalized considering supply voltage and adopted tech-
the contrary, the proposed IC, capable of providing 21 configu- nology using
rations from 2 2 to 8 8 and from QPSK to 64-QAM, is the
most configurable among all reported ICs. Only one other im-
plementation provides some degree of configurability, but with
(13)
less flexibility in antenna number [9]. Moreover, the proposed
chip is one of the very few chips that provide both soft LLR
and candidate list output, which are indispensable for MIMO
detectors in advanced iterative detection and decoding systems. VI. CONCLUSIONS
Therefore, although its throughput is not the highest, satisfac- This paper presents the design of a novel configurable
tory BER performance is guaranteed. In other words, if one soft-output MIMO detector IC. From the algorithmic aspect,
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.
LIAO et al.: A 74.8 mW SOFT-OUTPUT DETECTOR IC FOR 8 8 SPATIAL-MULTIPLEXING MIMO COMMUNICATIONS 421
a new and efficient sphere decoding algorithm, the MBF-FD [14] A. Murugan, H. Gamal, M. Damen, and G. Caire, “A unified frame-
algorithm, is shown to be very effective in soft-output MIMO work for tree search decoding: Rediscovering the sequential decoder,”
IEEE Trans. Information Theory, vol. 52, no. 3, pp. 933–953, 2006.
detection, especially when the antenna number or the modu- [15] T.-P. Wang, T.-H. Lee, and T.-D. Chiueh, “Low-complexity
lation order is high. In terms of VLSI implementation, new soft-output MIMO detection for iterative decoding using modi-
hardware architectures are proposed for a better hardware fied best-first tree search,” IEEE Trans. Wireless Commun., submitted
for publication.
design. These include the pipelined quad-DEAP and tabular
[16] D. E. Knuth, The Art of Computer Programming, 3rd ed. Reading,
enumeration. Several circuit techniques are adopted in the MA: Addison Wesley, 1997, vol. 1, Fundamental Algorithms.
design of function blocks to further improve the performance [17] A. Carlsson, “The DEAP: A double-ended heap to implement double-
of the IC. Measurement results show that the proposed IC ended priority queues,” Information Process. Lett., vol. 26, pp. 33–36,
1987.
outperforms all the other implementations in terms of normal- [18] P. Salmela, J. Antikainen, O. Silven, and J. Takala, “Memory-based list
ized power. Moreover, the proposed IC is the first soft-output updating for list sphere decoders,” in Proc. IEEE Workshop on Signal
sphere decoding MIMO detector IC that can support up to Processing Systems (SiPS), 2007, pp. 633–638.
[19] C. Hess, M. Wenk, A. Burg, P. Luethi, C. Studer, N. Felber, and W.
8 8 64-QAM MIMO systems. When the chip is configured in Fichtner, “Reduced-complexity MIMO detector with close-to ML
4 4 64-QAM and 8 8 64-QAM and constraining to 8 error rate performance,” in Proc. 17th ACM Great Lakes Symp. VLSI
and 16, its throughput can reach 431.8 Mbps and 428.8 Mbps, (GLSVLSI), 2007, pp. 200–203.
respectively. With such performance, the proposed IC is very [20] M. Shabany and P. Gulak, “Scalable VLSI architecture for K-best lat-
tice decoders,” in Proc. ISCAS, 2008, pp. 940–943.
competitive among all soft-output MIMO detectors.
Authorized licensed use limited to: BANGALORE INSTITUTE OF TECHNOLOGY. Downloaded on March 30,2010 at 02:20:45 EDT from IEEE Xplore. Restrictions apply.