Documente Academic
Documente Profesional
Documente Cultură
3, MARCH 2011
Abstract—This paper presents a method to protect memories or loss of control [4]. Although SEU is the major concern for
against multiple bit upsets and to improve manufacturing yield. some critical applications, multiple bit upsets (MBU) are also
The proposed method, called a Matrix code, combines Hamming becoming important problems in designing memories mostly
and Parity codes to assure the improvement of reliability and yield
of the memory chips in the presence of high defects and multiple because of the following.
bit-upsets. The method is evaluated using fault injection experi- 1) The error rate of memories are increased due to the tech-
ments. The results are compared to well-known techniques such as nology shrinkage [5], [6]. Therefore, the probability of
Reed–Muller and Hamming codes. The proposed technique per- having multiple errors increases.
forms better than the Hamming codes and achieves comparable 2) MBUs can be induced by direct ionization or nuclear recoil
performance with Reed–Muller codes with very favorable imple-
mentation gains such as 25% reduction in area and power con-
after passing a high-energy ion [7].
sumption. It also achieves reliability increase by more than 50% 3) The probability of having multiple errors is increased when
in some cases. Further, the yield benefits provided by the proposed the size of memory is increased as demonstrated by the
method, measured by the yield improvements per cost metric, is up experiments in [8] and [9].
to 300% better than the ones provided by Reed–Muller codes. Unfortunately, packaging and shielding cannot effectively
Index Terms—Error correcting codes (ECCs), memories, relia- be used to shield against SEUs and MBUs, since they may be
bility, yield. caused by neutrons which can easily penetrate through pack-
ages [6], [10]. The most common approach to maintain a good
level of reliability for memory cells is to use error correcting
I. INTRODUCTION
codes. Hamming and odd weight codes are largely used to pro-
S CMOS process technology scales, high-density,
A low-cost, high-speed integrated circuits with low voltage
levels and small noise margins will be increasingly susceptible
tect memories against SEU because of their ability to correct
single upsets efficiently with reduced area and performance
overhead [11]. However, multiple upsets caused by a single
to temporary faults [1]. In very deep submicrometer technolo- charged particle can provoke errors in the system protected by
gies, single-event upsets (SEUs) like atmospheric neutrons and these single-error correcting codes. On the other hand, there
alpha particles severely impact field-level product reliability, are advanced error correcting codes such as Reed–Muller code
not only for memory, but for logic as well. When these particles [12], which can cope with multiple upsets. However, this is
hit the silicon bulk, they create minority carriers which if col- achieved at the expense of high area and power consumption.
lected by the source/drain diffusions, could change the voltage The most common approach to deal with multiple errors has
level of the node. been the use of interleaving in the physical arrangement of the
This issue has drawn a growing attention from the fault tol- memory cells, so that cells that belong to the same logical word
erance community due to the recent increase of the soft error are separated. As the errors in an MBU are physically close as
rate of combinational logic circuits [2]. While effective solu- discussed in [13] they will cause single errors in different words
tions to protect memory elements have already been devised [3], that can be corrected by the single error correction-double error
the low probability of soft errors affecting CMOS combinational detection (SEC-DED) codes.
circuits being latched at the output of the circuit kept this subject However, interleaving cannot be used, for example, in small
as a secondary research point. Therefore, not many techniques memories or register files, and in other cases, its use may have
to cope with this problem have been proposed until now. Sim- an impact on floor-planning, access time and power consump-
ilar concerns are also expressed for critical applications such as tion, as discussed in [14]. For those reasons, the use of more
space, where there can be potentially serious consequences for sophisticated codes or the combination of different codes has
the spacecraft, including loss of information, functional failure, been proposed in [15] to deal with MBUs when the use of in-
terleaving is not a valid option. More recently, codes that can
Manuscript received May 27, 2009; revised September 09, 2009. First pub- correct multiple errors only when they are adjacent have also
lished December 11, 2009; current version published February 24, 2011. been proposed in [14]. These codes are tailored to the specific
C. Argyrides and D. K. Pradhan are with the Department of Computer Sci-
ence, University of Bristol, Bristol, BS8 1UB, U.K. (e-mail: costas@computer.
patterns of errors in an MBU (again the errors will tend to be
org; pradhan@cs.bris.ac.uk). physically close) and therefore can achieve effective protection
T. Kocak is with the Department of Computer Engineering, Bahcesehir Uni- at a reduced cost.
versity, Istanbul 34353, Turkey (e-mail: taskin.kocak@bahcesehir.edu.tr).
Color versions of one or more of the figures in this paper are available online
An alternative approach to protect memories is the use of
at http://ieeexplore.ieee.org. built-in current sensors (BICS) that are able to detect the occur-
Digital Object Identifier 10.1109/TVLSI.2009.2036362 rence of errors by detecting changes in the current, as proposed
1063-8210/$26.00 © 2009 IEEE
ARGYRIDES et al.: MATRIX CODES FOR RELIABLE AND COST EFFICIENT MEMORY CHIPS 421
in [16] and [17]. The sensors are placed in the columns of the the area of circuit to be protected. One of the simpler examples
memory block and they detect unexpected current variations on of CED is called duplication with comparison [20]–[22], which
each of the memory bit positions. duplicates the circuit to be protected and compares the results
The protection can be optimized with protection codes that generated by both copies to check for errors. This technique im-
are tailored to the specific problem of a memory protected that poses an area overhead higher than 100%, and when an error is
suffers MBUs. This is the objective of this paper in which such detected the outputs of the circuit must be recomputed.
codes are proposed. Hamming codes and odd weight codes are largely used to
Another major issue in the design of memories for new tech- protect memories against SEU because of their efficient ability
nologies is to cope with defects. As reported in [18], the yield to correct single upsets with a reduced area and performance
is decreasing while the technology is scaling. Many different overhead [23]. The Hamming code implementation is composed
techniques have been proposed based on the use of redundant by a combinational block responsible to code the data (encoder
elements to replace the defective ones. These techniques vary block), inclusion of extra bits in the word that indicate the Parity
from those applied during the manufacturing process, in the test (extra latches or flip-flops) and another combinational block re-
phase, to the use of built-in circuits that are able to repair the sponsible for decoding the data (decoder block). The encoder
memory chips even during normal operation in the field, with block calculates the parity bit, and it can be implemented by a
different tradeoffs in terms of cost and speed. set of two-input XOR gates. The decoder block is more complex
In this paper, a high level method for detection and correc- than the encoder block, because it needs not only to detect the
tion of multiple faults is proposed. This method is based on fault, but it must also correct it. It is basically composed by the
combining Hamming codes and Parity codes in a matrix format same logic used to compose the Parity bits plus a decoder that
so the detection and correction of multiple faults is achieved. will indicate the bit address that contains the upset. The decoder
The fault detection/correction coverage, mean-error-to-fail block can also be composed of a set of two-input XOR gates and
(METF) and reliability in terms of mean-time-to-fail (MTTF) some AND and INVERTER gates. In [23], studies show the area
of the proposed approach are analyzed and compared to those efficiency of using Hamming code to protect memories. How-
of Reed–Muller code and Hamming code. The results show ever, it does not cope with multiple upsets. Consequently, more
that the proposed approach can detect and correct multiple complex correcting codes must be investigated.
faults more than Hamming code but slightly less than the The Reed–Muller code [12], [24] is another protection code
Reed–Muller code but with higher MTTF results compared that is able to detect and correct more errors than a Hamming
to Reed–Muller ones in most of the cases. However, the area code and is suitable for tolerating multiple upsets in SRAM
and power consumption of the proposed method are 25% memories. Although this type of protection code is more effi-
and 21% better than the Reed–Muller code, respectively. We cient than Hamming code in terms of multiple error detection
also introduce different metrics for comparing the protection and correction, the main drawback of this protection code is its
methods for their correction and detection coverage as well high area and power penalties (since encoding and decoding cor-
as their cost while incorporated into a memory chip for the recting circuitry in this code is more complex than for a Ham-
improvement of the yield. Based on the experimental results, ming code). Hence, a low-overhead error detection and correc-
tion code for tolerating multiple upset is required.
the detection/correction coverage of our proposed method is
Many different techniques have been proposed, all of them
better than Reed–Muller and provides better yield results with
based on the use of redundant elements to replace defective
respect to the cost of the methods. This paper is an extension
ones. Those techniques vary from those applied during the man-
of [19]. In this paper, we show results based on memory chip
ufacturing process, in the test phase, to the use of built-in circuits
simulation while the results of [19] were portrayed using
able to repair the memory chips even during normal operation in
experiment and analytical results based on the codeword’s
the field, with different tradeoffs in terms of cost and speed. No
coverage. Also in this version we discuss different codeword
matter when the chip repair is performed, those techniques rely
organization. Additionally a yield and cost per chip analysis
in a few basic redundancy schemes. For instance, in redundant
are presented here and METF results are portrayed as well.
rows or redundant columns approach, only redundant rows (or
The remainder of this paper is organized as follows. Section II
redundant columns) are included in the memory array and are
provides some background and related work. The matrix code
used to replace defective rows (or defective columns) detected
(MC) is introduced in Section III. Experimental study based on
during test. Also known as 1-D redundancy, the main advan-
fault injection and reliability analysis based on the METF of tage of this approach is that its implementation is very straight-
the proposed method are provided in Section IV. Section V ex- forward, requiring no complex redundant rows (or columns) al-
plains the area and power consumption analysis of the proposed location algorithms. However, its repair efficiency can be low,
method. The yield and cost per chip analysis are portrayed in since a defective column (row) containing multiple defective
Section VI, and finally Section VII concludes this paper. cells cannot be replaced by a single redundant row (column)
[25]–[27].
II. BACKGROUND AND RELATED WORK Both redundant rows and columns can be incorporated into
Concurrent error detection (CED) is one of the major ap- the memory array, which will be more efficient than having ei-
proaches for transient errors mitigation. In its simpler forms, ther one especially in the case of multiple faulty cells exist in the
CED allows only the detection of errors, requiring the use of ad- 2-D memory array. The main drawback of this approach is that
ditional techniques for error correction. Nevertheless, the imple- the optimal redundancy allocation problem becomes NP-com-
mentation of CED usually requires sometimes the duplication of plete [28], [29]. Although many heuristic algorithms have been
422 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011
The read and write procedure for the memory with error
correcting technique can be explained as follows. First each
word in the modules is segmented into multiple bit segments.
Then each bit segment is encoded to bit segment which
contains data bits and check bits. Algorithms 2
and 3 show the procedure for reading/writing words from/to a
Fig. 2. MCs—32 bits. memory, respectively.
Algorithm 1 MATRIX code verification algorithm (X: data) 1: Read the word which contains the desired bits.
2: Correct for any errors.
1: Read the saved data bits of X 3: Route the desired bits on the tree to the root node
2: Generate check bits using saved data bits
3: Generate syndrome bits of check bits Algorithm 3 Algorithm MEMORY WRITE
4: Generate parity bits using saved data bits
5: Generate syndrome bits of parity bits 1: Read the word which includes the desired bit.
6: Correct every saved bit if it is erroneous using (7)
7: Output the corrected word 2: Check for errors and correct them (if any)
3: Compare the value of the bit to be written against the value
Let us give an example of the Matrix method. Suppose the stored in the memory.
codeword “10111001 01100111 11001100 11011100” is saved
4: if bits are different then
to the memory. Check bits will be equal to “11101
01111 10100 01001” and according to parity equations, parity 5: Recompute the check bits based on this new value.
bits will be equal to “1100 1110”. The physical layout
6: Write back the data and the newly computed check bits
of this codeword with check and parity bits is shown in Fig. 2.
Suppose that while reading the codeword from the memory 7: else
and are erroneous. changed from 1 to 0 and from
8: Write back the data and the newly computed check bits
0 to 1. Using the decoding algorithm one can easily correct the
two erroneous bits. This procedure is shown in Example 1. With 9 end if
this technique we can correct any kind of single or double errors
in each row. IV. EXPERIMENTAL STUDY AND RELIABILITY ESTIMATION
TABLE I
MTTF IN DAYS OF PROPOSED TECHNIQUE FOR = 10 , wordsize = 32
TABLE II
MTTF IN DAYS OF PROPOSED TECHNIQUE FOR = 10 , wordsize = 64
TABLE III
REQUIRED REDUNDANT BITS
TABLE IV
AREA, POWER, AND DELAY ANALYSIS
[32] H. C. Kim, D. S. Yi, J. Y. Park, and C. H. Cho, “A BISR (built-in self Dhiraj K. Pradhan (S’70–M’72–SM’80–F’88) is
repair) circuit for embedded memory with multiple redundancies,” in currently a Professor with the Department of Com-
Int. Conf. VLSI CAD, Oct. 1999, pp. 602–605. puter Science, University of Bristol, Bristol, U.K.
[33] P. Mazumder and Y. S. Jih, “A new built-in self-repair approach to Previously, he was a Professor with the Department
VLSI memory yield enhancement by using neural-type circuits,” IEEE of Electrical and Computer Engineering, Oregon
Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 12, no. 1, pp. State University, Corvallis, and held the COE En-
24–36, Jan. 1993. dowed Chair Professorship in Computer Science
[34] B. Elmer, W. Tchon, A. Denboer, S. Kohyama, K. Hirabayashi, and I. with Texas A&M University, College Station, where
he also served as the founder of the Laboratory of
Nojima, “Fault tolerant 92160 bit multiphase ccd memory,” in IEEE Int.
Computer Systems, and he held a professorship at
Conf. Solid-State Circuits. Dig. Techn. Papers, Feb. 1977, pp. 116–117.
the University of Massachusetts, Amherst, where
[35] C. H. Stapper, A. N. McLaren, and M. Dreckmann, “Yield model for
he also served as a coordinator of computer engineering. He also worked at
productivity optimization of vlsl memory chips with redundancy and the University of California, Berkeley, Oakland University, Rochester, MI,
partially good product,” IBM J. Res. Developm., vol. 24, no. 3, pp. and the University of Regina, Saskatchewan, Canada. He was also a Visiting
398–409, May 1980. Professor with Stanford University, Stanford, CA. In the past, he worked as
[36] C. Argyrides, A. A. Al-Yamani, C. Lisboa, and L. C. D. K. Pradhan, a Staff Engineer with IBM. More recently, he served as the founding CEO
“Increasing memory yield in future technologies through innovative of Reliable Computer Technology, Inc. He continues to serve as an Editor of
design,” in Proc. 8th Int. Symp. Quality Electron. Des. (ISQED), Mar. prestigious journals, including IEEE transactions. He has also served as the
2009, pp. 622–626. general chair and program chair for various major conferences. He is also the
[37] J. A. Maestro and P. Reviriego, “Study of the effects of MBUs on the inventor of two patents, one of which was licensed to Mentor Graphics and
reliability of a 150 nm SRAM device,” in Proc. 45th Annu. Des. Autom. Motorola. The verification tool, Formal Pro, by Mentor Graphics is based on
Conf. (DAC), 2008, pp. 930–935. his patent. He has contributed to VLSI computer-aided design and test, as well
as to fault-tolerant computing, computer architecture, and parallel processing
research, with major publications in journals and conferences spanning more
than 30 years. During this long career, he has been well funded by various
agencies in Canada, the United States of America, and the United Kingdom.
He is also the coauthor and editor of various books, including Fault-Tolerant
Computing: Theory and Techniques, volumes I and II (Prentice-Hall, 1986),
Fault-Tolerant Computer Systems Design (Prentice-Hall, 1996; second print,
2003), IC Manufacturability: The Art of Process and Design Integration (IEEE
Press, 2000), Practical Design Verification (Cambridge University Press,
2009), and Fault and Defect Tolerance in Nanotechnology Circuits (Cambridge
University Press, 2009). His research interests include low power designs,
soft-error protection and variability issues.
Dr. Pradhan is a fellow of the ACM and the Japan Society for the Promo-
tion of Science. He was a recipient of the Humboldt Prize in Germany, the Ful-
bright-Flad Chair in Computer Science in 1997, Best Paper Awards including
the 1996 IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems Best Paper Award, with W. Kunz, on “Recursive Learning: A New
Implication Technique for Efficient Solutions to CAD Problems Test, Verifica-
tion and Optimization.”