2015 LDPC

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 62, NO. 7, JULY 2015
Byte-Recongurable LDPC Codec Design With

Application to High-Performance ECC of NAND
Flash Memory Systems
Yu-Min Lin, Huai-Ting Li, Ming-Han Chung, and An-Yeu (Andy) Wu, Fellow, IEEE
AbstractThe reliability of NAND Flash memory deteriorates

due to multi-level cell technique and advanced manufacturing
technology. To deal with more errors, LDPC codes show superior performance to conventional BCH codes as ECC of NAND
Flash memory systems. However, LDPC codec for NAND Flash
memory systems faces problems of high redesign effort, high
on-chip memory cost and high-throughput demand. This paper
presents a byte-recongurable cost-effective high-throughput
QC-LDPC codec design for NAND Flash memory systems. Recongurable codec design is proposed to support various QC-LDPC
codes for different Flash memories. To save on-chip memory
cost, shared-memory architecture and rescheduling architecture are presented for encoder and decoder, respectively. The
shared-memory architecture can save 23% area cost of the encoder and the rescheduling architecture reduces 15% area cost of
decoder. In addition, the proposed sub-iteration based early termination (SIB-ET) scheme reduces 29.6% decoding iteration counts
compare with the state-of-the-art early termination scheme when
. Finally, the QC-LDPC
raw BER of Flash memory is
codec for NAND Flash memory systems is implemented in TSMC
90 nm technology. The post-layout result shows that the core size
is only 6.72
at 222 MHz operating frequency.
Index
TermsCost-effective,
early-termination,
highthroughput, NAND ash memory, quasi-cyclic low-density
parity-check (QC-LDPC) codes, recongurable VLSI design.
I. INTRODUCTION
AND FLASH memory, one of prevailing of non-volatile

memory systems (NVMS), is widely used in many consumer electronic products. High-capacity NAND Flash memory
which employs multi-level cell (MLC) technique, enables multiple bits to be stored in one cell, can signicantly increase the
storage density. However, they suffer severe degradation of reliability due to higher error rate of MLC [1]. Error-correcting
code (ECC) is usually applied in memory controller to deal
with errors. Currently, BCH codes are widely used to perform
ECC functions of Flash memory. However, the advanced manufacturing technology, together with the use of MLC, further
deteriorates the reliability of NAND Flash memory. Recently,
Manuscript received December 12, 2014; revised March 24, 2015; accepted
April 08, 2015. This work was supported in part by the National Science
Council, Taiwan, under Grant NSC-100-2220-E-002-013, and by VIA Labs,
Inc. under Grant 100-S-C47. This paper was recommended by Associate Editor
A. Ashra.
The authors are with the Graduate Institute of Electronics Engineering and
Department of Electrical Engineering, National Taiwan University, Taipei, 106,
Taiwan (e-mail: yumin@access.ee.ntu.edu.tw; wesli@access.ee.ntu.edu.tw;
cmh@access.ee.ntu.edu.tw; andywu@ntu.edu.tw).
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/TCSI.2015.2423798
Fig. 1. Proposed QC-LDPC codec for NAND ash memory.
low-density parity-check (LDPC) codes with soft-decision have

been proposed as ECC to solve the reliability issues of NAND
Flash memory [2], [3]. Compared to the BCH codes, LDPC
codes show much superior performance with the same parity
bits.
Although LDPC codec design had been deeply researched for
communication applications [7][9], there are still three major
issues need to be addressed when applying LDPC codes as ECC
to NAND Flash memory (as shown in Fig. 1):
1) High redesign effort: NAND Flash cell information remains a trade secret, and the page size varies with different
vendors and different technology. Therefore, unlike communication applications, there is no standard of ECC for
NAND Flash system and it causes high redesign effort to
implement specic codec for particular Flash systems.
2) High memory cost: The codeword length of LDPC codes
for NAND Flash memory ranges from 1 KB to 4 KB, which
is much longer than codeword length of LDPC codes for
communication applications. Because the sequential logic
is proportion to codeword length, LDPC codec for NVMS
suffers from high memory cost.
3) Demand of decoder throughput: Prior to the Flash memories suffer from serious wear; the throughput of decoder
should be higher than memory accessing speed to avoid
becoming system bottleneck. Unlike communication systems, Flash memories guarantee to have low error rate
under few program erase counts. Hence, we make use of
early termination to increase decoder throughput.
Recently, a rate-0.96 (68254, 65536) decoder was presented
by [4] with low energy consumption design, and a drift-immune
soft-sensing rate-0.875 (9376, 8204) decoder was developed by
[5]. However, both [4] and [5] did not consider redesign effort
1549-8328 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
and memory cost of posterior messages. Although [5] implemented early termination in the decoder, the parity-check operation only performs at the end of iteration, which is not efcient
enough.
In this paper, we aim to develop a recongurable cost-effective high-throughput LDPC codec for NAND Flash memory
systems. Since quasi-cyclic LDPC codes signicantly reduce
the hardware complexity and had been adopted as standard in
many communication systems [7][9], our recongurable design supports QC-LDPC codes. As depicted in Fig. 1, the main
contributions of this paper are as follows:
1) A recongurable codec is desirable to reduce redesign effort. In addition, because the codeword length of LDPC
equals page size and page size of memory is multiples of
byte, a fully-recongurable LDPC codec is overdesigned.
Therefore, we propose the byte-recongurable QC-LDPC
encoder and decoder design. The codec is able to support
various QC-LDPC codes, with the constraint that size of
circulant sub-matrix be multiples of eight. Hence, the design is named byte-recongurable.
2) We propose two techniques to decrease memory cost of
encoder and decoder, respectively. For QC-LDPC encoder, we propose a shared-memory architecture to reduce
memory cost of generator matrix which saves 23% area
cost of encoder. Moreover, the proposed rescheduling
architecture is able to eliminate FIFO memory and reduce
the soft message memory of QC-LDPC decoder. The
rescheduling architecture is able to reduce 15% area cost
of decoder.
3) A novel sub-iteration based early termination (SIB-ET)
scheme is proposed. The proposed SIB-ET performs the
parity-check equation more frequently according to the decoding schedule of decoding algorithm. To our best knowledge, the SIB-ET scheme is the rst scheme that guarantees the correctness of decoding result and is able to terminate decoding procedure within iteration processing. Compare with the state-of-the-art early termination scheme that
guarantees the correctness of decoding result, the SIB-ET
scheme reduces 29.6% decoding iteration counts when raw
BER is
.
At last, we implement and validate the byte-recongurable
high-throughput QC-LDPC codec in TSMC 90 nm technology.
The core size is 6.72
at 222 MHz operating frequency.
The rest of this paper is organized as follows. Section II
briey introduces the fundamental of QC-LDPC codes. Section III presents the proposed QC-LDPC encoder architecture.
The proposed QC-LDPC decoder architecture is illustrated in
Section IV. Section V presents the proposed sub-iteration based
early termination scheme. Section VI shows the implementa-
..
.
..
.
..
..
.
..
.
tion results of proposed QC-LDPC codec. Finally, we conclude

this paper in Section VII.
II. FUNDAMENTALS OF QC-LDPC CODES
A. QC-LDPC Codes
An
LDPC code represents the LDPC code with codeword of length , message of length , and parity of length
. Also, LDPC code can be specied by a parity-check
matrix which is denoted as . Parity-check matrix is
and LDPC codeword satises
(1)
where
is
and is a
zero vector.
Quasi-cyclic LDPC (QC-LDPC) codes are an essential
branch of LDPC codes which signicantly reduce hardware
requirements. The parity-check matrix of QC-LDPC code can
be represented by an
array of circulants sub-matrix
as
..
.
..
..
.
(2)
where
and
. Each sub-matrix
is a
square matrix, where each row of a sub-matrix is the
cyclical right-shift of previous row, and the rst row is cyclically
right shifted from the last row. The structured parity-check matrix and the cyclical-shift property of circulant signicantly simplify the hardware implementation of QC-LDPC codec compared to other non-structure LDPC codes.
B. Encoding of QC-LDPC Codes
The encoding of an
LDPC codes is identical to that
of linear block codes by generator matrix, . Given an input
message sequence , there exists codeword that satised
(3)
For QC-LDPC codes, [10] shows that the desired generator matrix can be shown in systematic-circulant (SC) form, as shown
in the equation at the bottom of the page.
In generator matrix with the form given by (4), is a
identity matrix, and
is a
zero matrix, and
is a
with
and
circulant of size
. Therefore, QC-LDPC encoding of parity part utilizes
the cyclic shift property of circulants in . The proposed parity
sub-parity generators, and
generator in [10] is composed of
the th sub-parity generator calculates the th section of parity
bit as follows,
(5)
..
.
..
..
.
(4)
LIN et al.: BYTE-RECONFIGURABLE LDPC CODEC DESIGN WITH APPLICATION TO HIGH-PERFORMANCE ECC
Fig. 2. (a) Architecture of conventional QC-LDPC encoder. (b) Timing diagram of sub-parity generators with private memory architecture.
where
is consecutive bits of th section of , and
consecutive bits of th section of parity part .
is
C. Decoding of QC-LDPC Codes

The iterative decoding algorithm of LDPC codes are based
on belief propagation, in which messages from bit nodes and
check nodes are updated and passed along the edges iteratively.
Layered decoding, which is also called turbo decoding message
passing (TDMP) [11], has been widely employed in many
QC-LDPC decoder designs. Compare to standard decoding
algorithms, layered decoding algorithm has longer processing
delay to nish an iteration because more bit node updating
(BNU) is needed. However, [15] proposed an overlapped
QC-LDPC decoder with layered min-sum (LMS) decoding
algorithm which shortened the processing delay through reasonable pipelining between check node updating (CNU) and
BNU. Since the CNU and BNU can process simultaneously
with optimized schedule, the overhead of BNU processing can
be negligible. Compare to standard decoding algorithm, the
LMS decoding algorithm has twice the convergence speed in
terms of decoding iterations [12] and is able to reduce decoder
complexity and memory requirements. Therefore, we apply
LMS decoding algorithm to implement our QC-LDPC decoder.
Decoding process of layered decoding executes the updating
operations in a layer-by-layer fashion. Hence, an iteration decoding is composed of several sub-iterations decoding in layered decoding algorithm. For each sub-iteration decoding, the
check node updating and the bit node updating are executed
locally within the corresponding row. The iterative updating
process of th layer is composed of following three steps.
1) Read & subtract: The prior messages will be calculated
as follows,
(6)
where is the posterior messages,
denotes the set of
represents the exindexes of position in the row , and
trinsic messages corresponding to the nonzero entries in
row . In addition, is initialized with log-likelihood ratio
(LLR) and is initialized as zero.
Fig. 3. (a) Architecture of proposed QC-LDPC encoder. (b) Timing diagram

of sub-parity generators with proposed shared-memory architecture.
2) Decode: The soft-input-soft-output (SISO) function decodes the check node by check node updating processor.
In this paper, we adopt layered min-sum decoding algorithm as SISO function which is formulated below:
(7)
denotes the prior messages from
except the corresponding bit node connecting, and is the normalization
factor.
3) Write back: The partial posterior messages
are updated by adding corresponding to as
(8)
III. PROPOSED BYTE-RECONFIGURABLE ENCODER
ARCHITECTURE
An online recongurable encoder is desirable to reduce the
redesign cost of varying QC-LDPC encoder design. Moreover,
the proposed shared memory architecture is able to reduce
memory cost of generator matrix.
A. Conventional Encoder Design
Hardware architecture of QC-LDPC encoder from [10] is
shown in Fig. 2(a). The encoder is composed of a circulant
memory and
parallel fully-serial sub-parity generators. The
is the rst row of circulant
and is stored in circulant memory. Besides, every fully-serial sub-parity generator is
a shift-register-adder-accumulator with feedback shift register
circuit. The input sequence is serially shifted into the encoder,
and the sub-parity generator computes (5) in a fully-serial way.
First,
to
are parallel loaded into the -bit feedback
shift-register of each sub-parity generator. The contents of the
feedback shift-register are multiplied by the input bit through
the AND gates, and the outputs of AND gates are accumulated
with the content of -bit register by the XOR gates. Outputs of
the XOR gates are stored in the -bit registers for next row opto
are generated by
eration. The second rows of
cyclically right shifting the
to
, which are stored in
feedback shift register. Then, the procedure proceeds as those of
the rst row. The computation is nished after performing the
4
mentioned operation times to compute

,
. Afterwards, the
to
are parallel loaded into
the feedback shift-registers and it takes another cycles to generate the partial sum, which is denoted as
. At
last, (5) is determined after ( - ) repetitive processing. Furto
, are generated by
thermore, all the parity bits, from
sub-parity generators.
The conventional encoder applies private memory architecture, where each sub-parity generator has a unique memory
bank. The sub-parity generators access their memory bank in a
parallel and independent fashion. The corresponding memory
access strategy is shown in Fig. 2(b). Because each sub-parity
generator requires a private memory, many shallow-depth
memories are contained in the encoder. The periphery circuits
of a memory, such as row decoder, column decoder, and sense
amplier, contribute a xed cost of the memory. Therefore,
those small-sized memories increase the xed cost of the systematic-circulant memory bank and leads to high memory cost.
B. Byte-Reconfigurable Encoder Design
Since the codeword length of LDPC codes will be multiple
of byte, we propose the byte-recongure design for QC-LDPC
encoder. The architecture of the proposed byte-recongurable
encoder is shown in Fig. 3(a). The design includes conguring
mode and encode mode. The controller feeds in the generator
matrix and related , , in conguring mode. Then the encoder will generate parities in encoding mode.
In the conguring mode, the , , and can be congured
in the proposed encoder. Due to xed encoder hardware, the
, , are restricted to a predened parameter space
,
,
, and should be multiple of eight, so-called it is
byte-recongurable. Moreover, the rst row of ciculant
,
, should also be updated. By (4), the
which is denoted as
corresponding number of
is
.
The
will be serially inputted through input ports and be
stored in SRAM memory. Hence, the contents of the circulant
memory can be dynamically programmable for switching to different generator matrix. Moreover, the recongurablility of the
parameter
is simply achieved by changing the boundary of
memory addresses.
In the encoding mode, input become segment of message
and the encoder procedure is similar to the mentioned procedure
in Section II. The controller of proposed encoder will encode
information message based on set , , and as below:
out of
sub-parity generators are going to be ac1)
tivated by enable signals
,
.
2) Each sub-parity generator takes
cycles
to complete a parity section.
3) The internal data width of congurable sub-parity generator is controlled by .
C. Shared-Memory Architecture for Systematic-Circulant
Memory
The codeword length of LDPC codes for Flash memory
system is used to be multiple KBs. As mentioned before,
is
stored in a systematic-circulant memory. According to (4), total
size of systematic-circulant is
bits. Hence,
the memory cost of generator matrix takes a large proportion
of encoder area.
TABLE I
COMPARISON OF QC-LDPC ENCODERS
Due to regulation of memory generator, 688

with 4 useless words.
Encoding cycles are calculated by
288 SRAM are generated
The conventional encoder design contains many shallowdepth memories to implement private memory architecture.
Those memories increase the xed cost of the systematic-circulant memory bank. Therefore, it is desired to use a deep-depth
memory instead of many shallow-depth memories to store the
generator matrix. Based on the few deep-depth congurations,
we proposed a shared-memory architecture for the systematic-circulant memory bank.
As shown in Fig. 3(a), all the sub-parity generators share a
single memory bank, which is composed of one deep-depth
memory. Since deep-depth memory serially reads out the
, only one sub-parity generator can access the output of
systematic-circulant memory bank at a time. To avoid data
conict among the sub-parity generators, serial memory access
is also proposed for the shared-memory architecture. Fig. 3(b)
illustrates the timing diagram of sub-parity generators with
proposed serial memory accessing. Sub-parity generators sequentially load the data from the systematic-circulant memory
without conicts. The overhead of the proposed shared-memory
architecture is merely additional
cycles in the latency of
encoding.
D. Implementation Results of Proposed Encoder
A byte-recongurable cost-effective high-throughput
QC-LDPC encoder based on the proposed architecture is
designed and synthesized with TSMC 90 nm technology.
The implemented QC-LDPC encoder is able to support
and
.
Table I shows the comparisons between the conventional encoder and the proposed encoder. The proposed shared-memory
architecture applies a deep-depth memory to substitute six
shallow-depth memories. Gate count of the proposed encoder is
23% smaller than the total gate count of conventional encoder
with negligible overhead of encoding cycles. The synthesis
result shows that the proposed encoder can achieve 273 MB/s
throughput under 221.3 K gate count.
IV. PROPOSED BYTE-RECONFIGURABLE DECODER
ARCHITECTURE
An online byte-recongurable decoder is proposed to support various QC-LDPC codes for diverse NAND Flash memories. Moreover, based on the conventional block-serial decoder
However, [13] contains two permutation networks. Therefore, [14] proposed a reduced complexity architecture which
merges two permutation network into one by permuting differential offset value
between consecutive layers as
(9)
and
. Therefore, one permutawhere
tion network is unnecessary without deinterleaving during decoding. The architecture of conventional block-serial layered
min-sum decoder is illustrates in Fig. 4(b). The permutation
cyclically shifts posterior messages according to differential
offset. Then, -parallel CNUs and -parallel BNUs will execute
in parallel to decode codeword. When the CNU processor in
the conventional decoder is processing, the prior message has
to be buffered in the FIFO memory for BNU processor to update later. Then will be added with and be stored to posterior message memory. To store those soft messages, the conventional decoder contains large size memories.
B. Byte-Reconfigurable Decoder Design
Fig. 4. (a) Illustration of block-serial decoding schedule. (b) Architecture of

block-serial layered min-sum decoder.
architecture, we propose a rescheduling architecture to reduce

memory cost of decoder.
A. Conventional Decoder Design
Layered min-sum algorithm divides
matrix into several
layers. For QC-LDPC code, it often regards a block row as a
layer to maximize parallelism while avoiding data conicting
[13]. In more detail, parity matrix is composed of
cirand the offset of each circulant is
. The
culant matrices
rows of non-zero entries of circulants are non-overlapping.
Those non-overlapping rows represent that there is no data coniction among the BNU and CNU in a row of circulants. Therefore, the rows of circulant can be parallel processed to maximize parallelism.
The hardware decoding schedule of [13], which is called
block-serial decoding, is shown in Fig. 4(a). The block-serial
decoding schedule of layered decoder processes one circulant at
a time. Therefore, a sub-iteration is nished when circulants
of a layer are all processed. Besides, it takes
sub-iterations
to nish single iteration. The decoding process is depicted as
following paragraph.
First, a bit node to check node permutation network interleaves the LLRs according to the offset. Second, CNU is applied to execute (6) and (7). Third, the updated LLRs are deinterleaved by check node to bit node permutation network. Fourth,
the deinterleaved LLRs will be inputted to BNU and BNU will
calculate (8). At last, after decoding of single layer is accomplished, the system will check whether whole
is processed
or not. If not, the decoding will start another iterative layer decoding refer to sub-iteration. Otherwise, the system will check
the termination condition. Since the termination condition meet,
the decoding procedure is going to be terminated. Contrarily,
new iteration should be processed.
The architecture of the proposed QC-LDPC decoder is shown

in Fig. 5. The byte-recongurable design is similar to previous
encoder. In conguring mode, ,
and will be congured
within the predened parameter space constrained by
,
,
, and should be multiple of eight. Besides, offsets
of the QC-LDPC code will be serially inputted and stored in
offset memory. The controller of proposed decoder will arrange
the procedure with set , , as below:
1) The boundary address of prior message memory and sign
memory bank is controlled by
and .
2) out of
BNU and CNU processors will be activated.
3) Shifting range of recongurable permutation network is
bounded by .
4) It takes
cycles to nish a sub-iteration decoding
sub-iteration to accomplish a iteration decoding.
and
To carry out the recongurable decoder, hardware modication includes BNU processors, CNU processors and permutation network are proposed based on the design concept of [15].
C. BNU Design
-parallel BNU proThe structure of byte-recongurable
cessors is shown in Fig. 6 and each processor is activated by
an enable signal. A BNU performs the bit node updating of
each block row to calculate (8). The rst and second minimum
value (
and
, respectively), the position index of
the rst minimum (is_min_pos), and the summation of sign bits
are passed from CNU to the BNU to recover the
extrinsic messages , which is a prevailing method to reduce
memory requirement [15]. If the position index of the
is
equal to the position of the exact prior message, the
is
used to update . At last, posterior messages are updated by
adding to .
D. CNU Design
-parallel CNU proThe structure of byte-recongurable
cessors is shown in Fig. 7. Similarly, enable signal is controlled
by to activate each CNU. The prior messages will be calculated through minus . Due to the serial processing, the CNU
performs the check node updating simply using comparing and
6
Fig. 5. Architecture of the proposed byte-recongurable cost-effective high-throughput QC-LDPC decoder.
Fig. 6. Structure of byte-recongurable
-parallel BNU processors.

Fig. 8. Modied byte-recongurable permutation network.
Fig. 7. Structure of byte-recongurable
-parallel CNU processors.
selecting to nd the
and
. On the other hand, an
accumulator is applied to compute the summation of sign bits
. All updated extrinsic messages pertaining to CNU
,
,
are condensed into the component set, including
position of
, and sign bits. Through CNU updating, all mentioned information of each layer will be obtained
and preserved. In our architecture, sign bit of the extrinsic messages will be store separately in SIGN memory because the
amount of sign bits is relatively large and other information will
be stored in controller of CNU processors for extrinsic message
recovery.
E. Permutation Network Design
A recongurable permutation network is able to cyclically
. Refleft shift input data vector equal to or less than
erence [15] had proposed a congurable permutation network
which needs two cycles to complete a permutation. Since our

design is not dual-path as [15], we propose a modied congurable permutation network which is able to accomplish the permutation in one cycle. As shown in Fig. 8, the input data is partitioned into , and according to the parameters and . The
last
bits are zero padding. The input is cyclically
left shifted by d' with the 1st
-way barrel shifter [22] and
with the 2nd
-way
cyclically left-shifted by
barrel shifter. Parts of the output of both barrel shifters are selected and combined by multiplexer to generate the output.
F. Rescheduling Architecture
The conventional block-serial QC-LDPC decoder architecture of layered min-sum algorithm is shown in Fig. 4(b). We
notice that the content of prior message is stored two times in
FIFO buffer and posterior message. Therefore, a rescheduling
architecture is proposed to isolate prior message memory. Instead of storing the posterior messages in the posterior message memory and the prior messages in the FIFO memory,
the proposed architecture only stores prior messages in a prior
message memory. As a result, FIFO memory is eliminated and
the proposed architecture replaces posterior message memory
with prior message memory.
The dataow of conventional decoder follows by permutation network, CNU and BNU, respectively. We call the conventional decoder as
structure. The timing
TABLE II
MEMORY SIZE OF QC-LDPC DECODER
34560) QC-LDPC codes, which

and
, the total saving memory bits is
Fig. 9. Timing diagram of the QC-LDPC decoding. (a) Conventional decoder.

(b) Proposed rescheduling decoder.
diagram of CNU processor and BNU processor in the conventional decoder is depicted in Fig. 9(a). It takes cycles for CNU
processor to perform (7) for each row in block row . After the
CNU nishes processing, the BNU processor starts to update
the posterior messages with other cycles.
However, to keep consistent with the data ow in the layered
min-sum algorithm with block-serial schedule, the positions of
permutation network, CNU processors, and BNU processors are
also rearranged as shown in Fig. 5. The proposed decoder is regarded as
structure. The detailed decoding
procedures of rescheduling architecture are as follows:
1) After the log-likelihood ratio (LLR) is inputted and stored
in prior message memory, out of
BNUs processors
will operate the bit node updating.
2) Modied recongurable permutation network will cyclically left shift LLR values by shift-value . A differential offset calculator block is added to calculate the according to [14].
3) After shifting, out of
CUNs processors will update
the extrinsic message and prior message .
4) The extrinsic message is forwarded to BNUs and prior
message is inputted to prior message memory. Then, go
back to Step 1).
In addition, the timing diagram of the CNU processor and
the BNU processor in the proposed rescheduling architecture
is shown in Fig. 9(b). When the CNU processor nishes processing block row , the BNU processor loads the prior messages from prior message memory and updates the posterior
messages of block row . Without writing back, the posterior
messages of block row is directly passed to the permutation
.
network and CNU processor for row updating of block row
Compared to the conventional decoder, the proposed architecture not only reduces the area cost of memory, but also lets the
timing schedule of decoder be fully overlapped.
G. Memory Size of Rescheduling Architecture
The memory size of conventional architecture and
rescheduling architecture are listed in Table II. It can be
observed that the total memory saving includes the FIFO
memory and reduced soft message memory. Because
QC-LDPC coeds for Flash memory systems are long codeword length, the proposed rescheduling architecture receives
huge amount of memory reduction. For rate-0.95 (32832,
and is equivalent to 256 K gate

counts.
V. SUB-ITERATION BASED EARLY TERMINATION SCHEME FOR
QC-LDPC DECODER
The redundant calculation of decoder leads to decreased
throughput and increased power consumption. Since the proposed decoder is designed for future NAND Flash systems
with advanced manufacturing and triple level cell, maximum
decoding iteration of proposed decoder is set to be a high value.
When NAND Flash memories are healthy, only few iterations
of decoding are needed. Hence, LDPC decoder with efcient
early termination scheme is necessary.
A. Proposed Sub-Iteration Based Early Termination Scheme
Several studies have been proposed to early terminate the
decoding process for LDPC codes. Hard-decision-aided (HDA)
criterion is proposed by [8] to early terminate the decoding
process if two consecutive iteration decoding results are the
same. Although HDA is simple for implementation with negligible degrading BER [23], it doesn't guarantee the correctness
of decoding result while incorrect termination is forbidden for
NVMS. On the other hand, [16] and [17] checked the parity
check equation (PCE) that terminate decoding processes and
guarantee the correctness of decoding result. If a codeword has
been corrected, the hard-decision bits of the valid codeword
satisfy the PCE, which can be formulated as
.
However, early termination scheme proposed by [17] is unable
to terminate decoding operations within iteration processing.
Currently, frozen technique is proposed by [25] and is able to
terminate within iteration processing. Nevertheless, stopping
parameters of frozen techniques should be set appropriately
according to channel state. Since decoder will be applied for
various NAND Flash systems in the scope of this paper, frozen
technique needs extra effort to determine stopping parameters
for specic NAND Flash system in advance.
We propose a sub-iteration based early termination (SIB-ET)
scheme which computes PCE corresponding to decoding procedure. The proposed SIB-ET is universal and guarantees the
correctness of decoding result. We notice that layered decoding
updates posterior messages every layer, which infers that
will be updated every sub-iteration. Fig. 10 shows the checking
schedule of proposed SIB-ET scheme, which is performed in
a column-by-column manner along a block-by-block decoding
process. The parity check can be regarded as checking whether
or not every cycle, where is accumulated from 1
8
Fig. 10. Checking schedule of sub-iteration based early termination scheme.
Fig. 12. Block diagram of the layered decoder with proposed SIC-ET scheme.
TABLE III
AVERAGE SUB-ITERATION NUMBERS ANALYSIS
Fig. 11. Timing diagram of the proposed SIC-ET scheme.
to . The timing diagram of proposed SIB-ET scheme in the

block-serial decoder is shown in Fig. 11. Since the SIB-ET is
working concurrently with block-serial decoding process, the
parity check for decoding result of the layer will be done when
BNU nished processing next layer. Therefore, it takes only
single extra sub-iteration to do the PCE checking, which make
it possible to early terminate within iteration processing.
B. Hardware Design of Proposed SIB-ET
A cost-effective SIB-ET design is proposed to implement
parity check equation
. Because the elements of
parity check matrix and hard-decision of posterior messages are
all binary, we can use XOR operation to implement the matrix
multiplication. Due to the structure of QC-LDPC, there is only
one input for each block-serial decoding. Therefore, we can use
XOR-register pair to accumulate parity check equation.
The block diagram of layered decoder with proposed SIB-ET
scheme is shown in Fig. 12. Every sign bits of posterior mesparallel
sages from BNU processor are forwarded to
1-bit byte-recongurable permutation network (BR-PN), which
is controlled by . The permutation network will cyclic shift
sign bits according to the difference of shift value between decoding block and checking block. Then, the accumulators do
the XOR operation between accumulated register and permuted
data. At last, OR-tree checks if the content of the registers are
all-zero after all the block columns are processed. If the output
of the OR-tree is zero, the parity check is satised and the decoding procedure can be early terminated.
To verify the proposed SIB-ET scheme, we follow [18]
to construct a rate-0.90 (8320, 9280) QC-LDPC code with
. Layered min-sum decoder with
SIB-ET scheme is implemented with Xilinx ML605 for emulation. The maximum iteration is set to 8, and each iteration
consists of 6 sub-iterations. Table III shows the average sub-iteration numbers of proposed SIC-ET scheme compared with
low-complexity iteration control algorithm (LC-ICA) scheme
which is proposed by [17]. The quantity of emulation data
is
pages for each raw BER. It shows that the proposed
SIB-ET reduces 29.6% of sub-iteration decoding numbers than
LC-ICA when raw BER of Flash memory is
.
VI. SYSTEM PERFORMANCE AND IMPLEMENTATION RESULTS
A. System Performance
The designed QC-LDPC codec supports maximum 34560
bits codeword length whilst the supported block rows
are 6 and block columns are 120. Moreover, the codec
is evaluated with reasonable rate-0.90 (8320, 9280) and
rate-0.95 (32 832, 34 560) QC-LDPC codes based on [18]
for prototypical 1 KB page size and 4 KB page size, respectively. The specication of rate-0.90 QC-LDPC code
is
. The specication of rate-0.95
.
QC-LDPC code is
A performance simulation is given in Fig. 13. The horizontal
axis is raw bit error rate (Raw BER) and the vertical axis is the
bit error rate after channel coding (Coded BER). The ideal inputted intrinsic messages are calculated according to voltage
distribution by [19] with 4 bits word length. The maximum
number of iterations is set to 8. The normalized factor in the
layered min-sum algorithm is set to be 0.5 since it can be easily
implemented by hardware. It can be observed that performance
of QC-LDPC code is better than BCH code with the same code
rate.
In addition, xed-point analysis is performed to determine
the word length of the messages in the decoder. The result of
xed-point analysis leads us to set to be 8 bits. According to
word length of , corresponding is set to be 7 bits and is set
to be 6 bits.
TABLE IV
SUMMARY OF DIFFERENT QC-LDPC DECODERS
Fig. 13. The simulation results of proposed QC-LDPC codec.
B. Implementation Results of Proposed QC-LDPC Decoder

The proposed byte-recongurable cost-effective highthroughput QC-LDPC decoder is implemented with TSMC 90
nm technology. The synthesis results of the decoder are 1443
K gate counts, including 95.4 K gate counts for early termination, at 263 MHz operating frequency. The overhead of early
termination is only 6.6%. On the other hand, the total saving
memory bits due to rescheduling architecture is equivalent to
256 K gate counts, which implies 15% area cost reduction with
the rescheduling architecture. The throughput of decoder is
dened as
Reference [5] only provides the synthesis results, we estimate the core area
through area of NAND2 gate (gate counts).
(10)
We particularly place and route the QC-LDPC decoder and
the post-layout results are shown in Table IV. At last, the comparisons of some other state-of-the-art QC-LDPC decoders for
NAND Flash memory are also summarized. To make a fair comparison, we normalize the area to 90 nm process and 8 bits word
length, normalize the throughput to 90 nm process and 8 iterations, and dene the throughput-to-area ratio (TAR) [4],
(11)
as a performance index. It can be observed that our design
has the highest TAR. Furthermore, the power consumption of
post-layout design is estimated by PrimePower. For rate-0.95
QC-LDPC code and 8 iterations decoding, the power consumption is 700 mW with the operating voltage of 0.9 V.
C. Implementation Results of Proposed QC-LDPC Codec
A QC-LDPC codec chip composed of the proposed
QC-LDPC encoder and decoder is implemented in TSMC 90
nm CMOS process by using Cadence SoC encounter tool. The
implemented QC-LDPC encoder is able to support
,
, and
(
,
).
Fig. 14 shows the layout of the proposed QC-LDPC codec
for NAND Flash memory. The block of encoder and decoder
with memory allocation is identied. The post-layout chip
features are summarized in Table V. The core size is 6.72
at 222 MHz operating frequency. The encoder performs 222
MB/s throughput, which is faster than minimum demand of
Fig. 14. The chip layout of the proposed QC-LDPC codec.
Flash products. On the other hand, the minimum throughput of

decoder is 163 MB/s.
VII. CONCLUSIONS
We have developed a byte-recongurable cost-effective
high-throughput QC-LDPC codec which can be applied to
NAND Flash memory systems. The proposed byte-recongurable codec design is able to support various NAND Flash
memories with 6% area overhead. Besides, the share-memory
architecture saves 23% of encoder area and the rescheduling
reduces 15% area cost of decoder. To increase decoder
throughput, the SIB-ET is proposed to efciently terminate
decoding procedure. Compare to the state-of-the-art early
termination scheme, the proposed SIB-ET reduces 29.6% of
decoding time when Raw BER of Flash memory is
.
10
TABLE V
POST-LAYOUT CHIP SUMMARY
The implemented VLSI achieves chip size of 6.72

. The
throughput of encoder is 222 MB/s. The minimum throughput
of decoder is 163 MB/s.
REFERENCES
[1] N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal, E. Schares, F.
Trivedi, E. Goodness, and L. R. Nevill, Bit error rate in NAND ash
memories, in Proc. IEEE Int. Rel. Phys. Symp., Apr. 2008, pp. 919.
[2] G. Dong, N. Xie, and T. Zhang, On the use of soft-decision errorcorrection codes in nand ash memory, IEEE Trans. Circuits Syst. I,
Reg. Papers, vol. 58, no. 2, pp. 429439, Feb. 2011.
[3] D. Lee and W. Sung, Estimation of NAND ash memory threshold
voltage distribution for optimum soft-decision error correction, IEEE
Trans. Signal Process., vol. 61, no. 2, pp. 440449, Jan. 2013.
[4] J. Kim and W. Sung, Rate-0.96 LDPC decoding VLSI for soft-decision error correction of NAND ash memory, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 22, no. 5, pp. 10041015, May 2014.
[5] K. C. Ho, P. C. Fang, H. P. Li, C.-Y. M. Wang, and H. C. Chang, A 45
nm 6 b/cell charge-trapping ash memory using LDPC-based ECC and
drift-immune soft-sensing engine, in IEEE Int. Solid-State Circuits
Conf. Dig. Tech. Papers, Feb. 2013, pp. 222223.
[6] D. Kim, B. Chung, and R. E. Kim, Improved hard-decision decoding
LDPC Codec IP design, in Proc. IEEE Int. Symp. Circuits Syst., May
2012, pp. 416419.
[7] Y. C. Liao, C. C. Lin, H. C. Chang, and C. W. Liu, Self-compennsation
technique for simplied belief-propagation algorithm, IEEE Trans.
Signal Process., vol. 55, no. 6, pp. 30613072, Jun. 2007.
[8] X. Y. Shih, C. Z. Zhan, C. H. Lin, and A. Y. Wu, An 8.29
52
mW multi-mode LDPC decoder design for mobile WiMAX system in
CMOS process, IEEE J. Solid-State Circuits, vol. 43, no. 3,
0.13
pp. 672683, Mar. 2008.
[9] C. Roth, P. Meinerzhagen, C. Studer, and A. Burg, A 15.8 pJ/bit/iter
quasi-cyclic LDPC decoder for IEEE 802.11n in 90 nm CMOS, in
Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2010, pp. 14.
[10] Z. W. Li, L. Chen, L.-Q. Zeng, S. Lin, and W. H. Fong, Efcient
encoding of quasi-cyclic low-density parity-check codes, IEEE Trans.
Commun., vol. 54, no. 1, pp. 7181, Jan. 2006.
[11] M. M. Mansour, A turbo-decoding message-passing algorithm for
sparse parity-check matrix codes, IEEE Trans. Signal Process., vol.
54, pp. 43764392, Nov. 2006.
[12] E. Sharon, S. Litsyn, and J. Goldberger, Efcient serial message-passing schedules for LDPC decoding, IEEE Trans. Inf. Theory,
vol. 53, pp. 40764091, Nov. 2007.
[13] K. Zhang, X. Huang, and Z. Wang, High-throughput layered decoder
implementation for quasi-cyclic LDPC codes, IEEE J. Sel. Areas
Commun., vol. 27, no. 6, pp. 985994, Aug. 2009.
[14] S. Kim, G. E. Sobelman, and H. Lee, A reduced-complexity architecture for LDPC layered decoding schemes, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 19, no. 6, pp. 10991103, Jun. 2011.
[15] B. Xiang, D. Bao, S. Huang, and X. Zeng, An 847955 Mb/s 342397

mw dual-path fully-overlapped QC-LDPC decoder for WiMAX
system in 0.13 CMOS, IEEE J. Solid-State Circuits, vol. 46, no. 6,
pp. 14161432, Jun. 2011.
[16] A. Darabiha, A. C. Carusone, and F. R. Kschischang, Power reduction
techniques for LDPC decoders, IEEE J. Solid-State Circuits, vol. 43,
no. 8, pp. 18351845, Aug. 2008.
[17] Y. H. Chien, M. K. Ku, and J. B. Liu, Low-complexity iteration control algorithm for multi-rate partially parallel layered LDPC decoders,
IET Electron. Lett., vol. 48, no. 22, pp. 14061407, Oct. 2012.
[18] W. E. Ryan and S. Lin, Channel Codes Classical and Modern. Cambridge, U.K.: Cambridge Univ. Press, 2009.
[19] Y. Maeda and H. Kaneko, Error control coding for multilevel cell
ash memories using nonbinary low-density parity-check codes, in
Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI Syst., Oct. 2009,
pp. 367375.
[20] M. H. Chung, Y. M. Lin, C. Z. Zhan, and A. Y. Wu, Cost-effective scalable QC-LDPC decoder designs for non-volatile memory systems, in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., May
2013, pp. 26252628.
[21] Y. M. Lin, Y. H. Chen, M. H. Chung, and A. Y. Wu, High-throughput
QC-LDPC decoder with cost-effective early termination ccheme for
non-volatile memory systems, in Proc. IEEE Int. Symp. Circuits Syst.,
Jun. 2014, pp. 27322735.
[22] H. T. Tran, K. Cameron, and B. Z. Shen, Message passing memory
and barrel shifter arrangement in LDPC (low density parity check) decoder supporting multiple LDPC codes, U.S. Patent 20060085720,
Jun. 2006.
[23] J. Li, G. He, H. Hou, Z. Zhang, and J. Ma, Memory efcient layered
decoder design with early termination for LDPC codes, in Proc. IEEE
Int. Symp. Circuits Syst., May 2011, pp. 26972700.
[24] J. Li, K. Zhao, J. Ma, and T. Zhang, Realizing unequal error correction
for nand ash memory at minimal read latency overhead, IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 61, no. 5, pp. 354358, May 2014.
[25] Y. S. Park, Y. Tao, and Z. Zhang, A fully parallel nonbinary LDPC
decoder with ne-grained dynamic clock gating, IEEE J. Solid-State
Circuits, vol. 50, no. 2, pp. 464475, Feb. 2015.
Yu-Min Lin received the B.S. degree in Electrical

Engineering from National Chiao Tung University,
Hsinchu, Taiwan, in 2011. He is currently working
toward the Ph.D. degree from the Graduate Institute
of Electronics Engineering, National Taiwan University, Taipei. His research interests are in the areas of
VLSI implementation of DSP algorithms and errorcorrecting codes.
Huai-Ting Li received the B.S. degree in electrical

engineering from National Taiwan University,
Taipei, in 2013. He is currently pursuing the Ph.D.
degree in the Graduate Institute of Electronics
Engineering, National Taiwan University. His research interests are in the areas of architecture and
algorithm design of 3-D networks-on-chip, advance
arithmetic unit design, and error-resilient system
design.
Ming-Han Chung was born in Taipei, Taiwan,

in 1988. He received B.S. degree in electrical
engineering and the M.S. degree in electronic
engineering, in 2010 and 2012, respectively, both
from Nation Taiwan University, Taipei. His research
interests include VLSI architecture designs and
algorithm development for error-correcting codes.
An-Yeu (Andy) Wu (M'96SM'12F'15) received

the B.S. degree from National Taiwan University
(NTU), Taipei, in 1987, and the M.S. and Ph.D.
degrees from the University of Maryland, College
Park, MD, USA, in 1992 and 1995, respectively,
all in electrical engineering. In August 2000, he
joined the faculty of the Department of Electrical Engineering and the Graduate Institute of
Electronics Engineering, NTU, where he is currently a Professor. His research interests include
low-power/high-performance VLSI architectures for
DSP and communication applications, adaptive/multirate signal processing,
recongurable broadband access systems and architectures, and system-on-chip
(SoC)/network-on-chip (NoC) platform for software/hardware co-design. He
has published more than 190 refereed journal and conference papers in above
research areas, together with ve book chapters and 16 granted US patents.
Dr. Wu had served as the Associate Editors of leading IEEE Transactions
in the circuits and systems area and signal processing area, such as IEEE
TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, IEEE
11
TRANSACTIONS ON CIRCUITS AND SYSTEMSPART I: REGULAR PAPERS,

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPART II: EXPRESS BRIEFS,
and IEEE TRANSACTIONS ON SIGNAL PROCESSING. He served as the General
Co-Chair of 2013 International Symposium on VLSI Design, Automation &
Test (VLSI-DAT), and 2013 IEEE Workshop on Signal Processing Systems
(SiPS). He also served as Technical Program Co-Chair of 2014 International
SoC Design Conference (ISOCC) and 2014 IEEE Asia Pacic Conference on
Circuits and Systems (APCCAS). From 2012 to 2014, he served as the Chair of
VLSI Systems and Applications (VSA) Technical Committee (TC), one of the
largest TCs in IEEE Circuits and Systems (CAS) Society. From August 2007
to Dec. 2009, he was on leave from NTU and served as the Deputy General
Director of SoC Technology Center (STC), Industrial Technology Research
Institute (ITRI), Hsinchu, TAIWAN, supervising Parallel Core Architecture
(PAC) VLIW DSP Processor and Multicore/Android SoC platform projects. In
2010, Dr. Wu received the Outstanding EE Professor Award from The Chinese
Institute of Electrical Engineering (CIEE), Taiwan. Prof. Wu was elevated to
IEEE Fellow in 2015 for his contributions to DSP algorithms and VLSI designs
for communication IC/SoC.

2015 LDPC

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2015 LDPC

Încărcat de

Drepturi de autor:

Formate disponibile

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

Byte-Recongurable LDPC Codec Design With

AbstractThe reliability of NAND Flash memory deteriorates

AND FLASH memory, one of prevailing of non-volatile

Fig. 1. Proposed QC-LDPC codec for NAND ash memory.

low-density parity-check (LDPC) codes with soft-decision have

tion results of proposed QC-LDPC codec. Finally, we conclude

C. Decoding of QC-LDPC Codes

Fig. 3. (a) Architecture of proposed QC-LDPC encoder. (b) Timing diagram

mentioned operation times to compute

Due to regulation of memory generator, 688

288 SRAM are generated

Fig. 4. (a) Illustration of block-serial decoding schedule. (b) Architecture of

architecture, we propose a rescheduling architecture to reduce

The architecture of the proposed QC-LDPC decoder is shown

Fig. 5. Architecture of the proposed byte-recongurable cost-effective high-throughput QC-LDPC decoder.

Fig. 6. Structure of byte-recongurable

-parallel BNU processors.

Fig. 7. Structure of byte-recongurable

-parallel CNU processors.

which needs two cycles to complete a permutation. Since our

34560) QC-LDPC codes, which

Fig. 9. Timing diagram of the QC-LDPC decoding. (a) Conventional decoder.

and is equivalent to 256 K gate

Fig. 10. Checking schedule of sub-iteration based early termination scheme.

to . The timing diagram of proposed SIB-ET scheme in the

Fig. 13. The simulation results of proposed QC-LDPC codec.

B. Implementation Results of Proposed QC-LDPC Decoder

Fig. 14. The chip layout of the proposed QC-LDPC codec.

Flash products. On the other hand, the minimum throughput of

The implemented VLSI achieves chip size of 6.72

[15] B. Xiang, D. Bao, S. Huang, and X. Zeng, An 847955 Mb/s 342397

Yu-Min Lin received the B.S. degree in Electrical

Huai-Ting Li received the B.S. degree in electrical

Ming-Han Chung was born in Taipei, Taiwan,

An-Yeu (Andy) Wu (M'96SM'12F'15) received

TRANSACTIONS ON CIRCUITS AND SYSTEMSPART I: REGULAR PAPERS,

S-ar putea să vă placă și