Sunteți pe pagina 1din 9

420 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.

3, MARCH 2011

Matrix Codes for Reliable and Cost Efficient


Memory Chips
Costas Argyrides, Member, IEEE, Dhiraj K. Pradhan, Fellow, IEEE, and Taskin Kocak

Abstract—This paper presents a method to protect memories or loss of control [4]. Although SEU is the major concern for
against multiple bit upsets and to improve manufacturing yield. some critical applications, multiple bit upsets (MBU) are also
The proposed method, called a Matrix code, combines Hamming becoming important problems in designing memories mostly
and Parity codes to assure the improvement of reliability and yield
of the memory chips in the presence of high defects and multiple because of the following.
bit-upsets. The method is evaluated using fault injection experi- 1) The error rate of memories are increased due to the tech-
ments. The results are compared to well-known techniques such as nology shrinkage [5], [6]. Therefore, the probability of
Reed–Muller and Hamming codes. The proposed technique per- having multiple errors increases.
forms better than the Hamming codes and achieves comparable 2) MBUs can be induced by direct ionization or nuclear recoil
performance with Reed–Muller codes with very favorable imple-
mentation gains such as 25% reduction in area and power con-
after passing a high-energy ion [7].
sumption. It also achieves reliability increase by more than 50% 3) The probability of having multiple errors is increased when
in some cases. Further, the yield benefits provided by the proposed the size of memory is increased as demonstrated by the
method, measured by the yield improvements per cost metric, is up experiments in [8] and [9].
to 300% better than the ones provided by Reed–Muller codes. Unfortunately, packaging and shielding cannot effectively
Index Terms—Error correcting codes (ECCs), memories, relia- be used to shield against SEUs and MBUs, since they may be
bility, yield. caused by neutrons which can easily penetrate through pack-
ages [6], [10]. The most common approach to maintain a good
level of reliability for memory cells is to use error correcting
I. INTRODUCTION
codes. Hamming and odd weight codes are largely used to pro-
S CMOS process technology scales, high-density,
A low-cost, high-speed integrated circuits with low voltage
levels and small noise margins will be increasingly susceptible
tect memories against SEU because of their ability to correct
single upsets efficiently with reduced area and performance
overhead [11]. However, multiple upsets caused by a single
to temporary faults [1]. In very deep submicrometer technolo- charged particle can provoke errors in the system protected by
gies, single-event upsets (SEUs) like atmospheric neutrons and these single-error correcting codes. On the other hand, there
alpha particles severely impact field-level product reliability, are advanced error correcting codes such as Reed–Muller code
not only for memory, but for logic as well. When these particles [12], which can cope with multiple upsets. However, this is
hit the silicon bulk, they create minority carriers which if col- achieved at the expense of high area and power consumption.
lected by the source/drain diffusions, could change the voltage The most common approach to deal with multiple errors has
level of the node. been the use of interleaving in the physical arrangement of the
This issue has drawn a growing attention from the fault tol- memory cells, so that cells that belong to the same logical word
erance community due to the recent increase of the soft error are separated. As the errors in an MBU are physically close as
rate of combinational logic circuits [2]. While effective solu- discussed in [13] they will cause single errors in different words
tions to protect memory elements have already been devised [3], that can be corrected by the single error correction-double error
the low probability of soft errors affecting CMOS combinational detection (SEC-DED) codes.
circuits being latched at the output of the circuit kept this subject However, interleaving cannot be used, for example, in small
as a secondary research point. Therefore, not many techniques memories or register files, and in other cases, its use may have
to cope with this problem have been proposed until now. Sim- an impact on floor-planning, access time and power consump-
ilar concerns are also expressed for critical applications such as tion, as discussed in [14]. For those reasons, the use of more
space, where there can be potentially serious consequences for sophisticated codes or the combination of different codes has
the spacecraft, including loss of information, functional failure, been proposed in [15] to deal with MBUs when the use of in-
terleaving is not a valid option. More recently, codes that can
Manuscript received May 27, 2009; revised September 09, 2009. First pub- correct multiple errors only when they are adjacent have also
lished December 11, 2009; current version published February 24, 2011. been proposed in [14]. These codes are tailored to the specific
C. Argyrides and D. K. Pradhan are with the Department of Computer Sci-
ence, University of Bristol, Bristol, BS8 1UB, U.K. (e-mail: costas@computer.
patterns of errors in an MBU (again the errors will tend to be
org; pradhan@cs.bris.ac.uk). physically close) and therefore can achieve effective protection
T. Kocak is with the Department of Computer Engineering, Bahcesehir Uni- at a reduced cost.
versity, Istanbul 34353, Turkey (e-mail: taskin.kocak@bahcesehir.edu.tr).
Color versions of one or more of the figures in this paper are available online
An alternative approach to protect memories is the use of
at http://ieeexplore.ieee.org. built-in current sensors (BICS) that are able to detect the occur-
Digital Object Identifier 10.1109/TVLSI.2009.2036362 rence of errors by detecting changes in the current, as proposed
1063-8210/$26.00 © 2009 IEEE
ARGYRIDES et al.: MATRIX CODES FOR RELIABLE AND COST EFFICIENT MEMORY CHIPS 421

in [16] and [17]. The sensors are placed in the columns of the the area of circuit to be protected. One of the simpler examples
memory block and they detect unexpected current variations on of CED is called duplication with comparison [20]–[22], which
each of the memory bit positions. duplicates the circuit to be protected and compares the results
The protection can be optimized with protection codes that generated by both copies to check for errors. This technique im-
are tailored to the specific problem of a memory protected that poses an area overhead higher than 100%, and when an error is
suffers MBUs. This is the objective of this paper in which such detected the outputs of the circuit must be recomputed.
codes are proposed. Hamming codes and odd weight codes are largely used to
Another major issue in the design of memories for new tech- protect memories against SEU because of their efficient ability
nologies is to cope with defects. As reported in [18], the yield to correct single upsets with a reduced area and performance
is decreasing while the technology is scaling. Many different overhead [23]. The Hamming code implementation is composed
techniques have been proposed based on the use of redundant by a combinational block responsible to code the data (encoder
elements to replace the defective ones. These techniques vary block), inclusion of extra bits in the word that indicate the Parity
from those applied during the manufacturing process, in the test (extra latches or flip-flops) and another combinational block re-
phase, to the use of built-in circuits that are able to repair the sponsible for decoding the data (decoder block). The encoder
memory chips even during normal operation in the field, with block calculates the parity bit, and it can be implemented by a
different tradeoffs in terms of cost and speed. set of two-input XOR gates. The decoder block is more complex
In this paper, a high level method for detection and correc- than the encoder block, because it needs not only to detect the
tion of multiple faults is proposed. This method is based on fault, but it must also correct it. It is basically composed by the
combining Hamming codes and Parity codes in a matrix format same logic used to compose the Parity bits plus a decoder that
so the detection and correction of multiple faults is achieved. will indicate the bit address that contains the upset. The decoder
The fault detection/correction coverage, mean-error-to-fail block can also be composed of a set of two-input XOR gates and
(METF) and reliability in terms of mean-time-to-fail (MTTF) some AND and INVERTER gates. In [23], studies show the area
of the proposed approach are analyzed and compared to those efficiency of using Hamming code to protect memories. How-
of Reed–Muller code and Hamming code. The results show ever, it does not cope with multiple upsets. Consequently, more
that the proposed approach can detect and correct multiple complex correcting codes must be investigated.
faults more than Hamming code but slightly less than the The Reed–Muller code [12], [24] is another protection code
Reed–Muller code but with higher MTTF results compared that is able to detect and correct more errors than a Hamming
to Reed–Muller ones in most of the cases. However, the area code and is suitable for tolerating multiple upsets in SRAM
and power consumption of the proposed method are 25% memories. Although this type of protection code is more effi-
and 21% better than the Reed–Muller code, respectively. We cient than Hamming code in terms of multiple error detection
also introduce different metrics for comparing the protection and correction, the main drawback of this protection code is its
methods for their correction and detection coverage as well high area and power penalties (since encoding and decoding cor-
as their cost while incorporated into a memory chip for the recting circuitry in this code is more complex than for a Ham-
improvement of the yield. Based on the experimental results, ming code). Hence, a low-overhead error detection and correc-
tion code for tolerating multiple upset is required.
the detection/correction coverage of our proposed method is
Many different techniques have been proposed, all of them
better than Reed–Muller and provides better yield results with
based on the use of redundant elements to replace defective
respect to the cost of the methods. This paper is an extension
ones. Those techniques vary from those applied during the man-
of [19]. In this paper, we show results based on memory chip
ufacturing process, in the test phase, to the use of built-in circuits
simulation while the results of [19] were portrayed using
able to repair the memory chips even during normal operation in
experiment and analytical results based on the codeword’s
the field, with different tradeoffs in terms of cost and speed. No
coverage. Also in this version we discuss different codeword
matter when the chip repair is performed, those techniques rely
organization. Additionally a yield and cost per chip analysis
in a few basic redundancy schemes. For instance, in redundant
are presented here and METF results are portrayed as well.
rows or redundant columns approach, only redundant rows (or
The remainder of this paper is organized as follows. Section II
redundant columns) are included in the memory array and are
provides some background and related work. The matrix code
used to replace defective rows (or defective columns) detected
(MC) is introduced in Section III. Experimental study based on
during test. Also known as 1-D redundancy, the main advan-
fault injection and reliability analysis based on the METF of tage of this approach is that its implementation is very straight-
the proposed method are provided in Section IV. Section V ex- forward, requiring no complex redundant rows (or columns) al-
plains the area and power consumption analysis of the proposed location algorithms. However, its repair efficiency can be low,
method. The yield and cost per chip analysis are portrayed in since a defective column (row) containing multiple defective
Section VI, and finally Section VII concludes this paper. cells cannot be replaced by a single redundant row (column)
[25]–[27].
II. BACKGROUND AND RELATED WORK Both redundant rows and columns can be incorporated into
Concurrent error detection (CED) is one of the major ap- the memory array, which will be more efficient than having ei-
proaches for transient errors mitigation. In its simpler forms, ther one especially in the case of multiple faulty cells exist in the
CED allows only the detection of errors, requiring the use of ad- 2-D memory array. The main drawback of this approach is that
ditional techniques for error correction. Nevertheless, the imple- the optimal redundancy allocation problem becomes NP-com-
mentation of CED usually requires sometimes the duplication of plete [28], [29]. Although many heuristic algorithms have been
422 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011

proposed to solve this problem, it is still difficult to develop


on-chip implementations for these algorithms. Due to the high
bandwidth requirement of today’s system-on-chip (SoC) de-
signs, we usually have long bit lines and word lines for em-
bedded memories. The repair efficiency will decrease since an
entire (and long) redundant row (column) is required to repair Fig. 1. 32-bit logical organization of MCs.
a faulty row (column) containing only a small number of faulty
cells. Examples of such techniques are presented in [28]–[33].
Another method is called divide by half. By using this ap- the horizontal check bits, are the vertical parity bits.
proach (apart from the redundant rows and columns that are in- Hamming codes are applied for each row. For an 8-bit data, 5
corporated into the memory array) while manufacturing, if there Hamming check bits are required, so 5 check bits are added at
are more defects than the ones that may be mitigated using ad- the end of the 8 data bits.
ditional redundancy, the most significant bit (of the address) is The check bits are calculated as follows:
ignored in order to use one half of the chip (upper part or lower
part). When a faulty cell is detected, we can use a redundant (1)
row or a redundant column to replace. Due to the high band- (2)
width requirement of current SoC designs, embedded memories
usually have long bit and word lines, which decreases the repair (3)
efficiency, since an entire (and long) redundant row (column) is (4)
required to repair a defective row (column) containing only a (5)
small number of defective cells. Both for 1-D and 2-D redun-
dancy approaches, when the number of defective cells in the Accordingly, we calculate all check bits for all rows using
array exceeds the repair capability through the use of redun- and , where is the
dant elements, the last alternative before discarding the defec- number of check bits per row, is the row number where is
tive chip is try to use it as a downgraded version of memory. For the corresponding check bit’s position in the first row and is
example, when all remaining defective cells are located in one the corresponding data bit’s position in this first row.
half of the array, the other half can still be used as a memory with For the parity row we use the following formula:
reduced capacity. This is done by permanently setting the most
significant bit of the addresses either to 0 or 1, depending on (6)
which part of the memory is to be used. However, in most cases
the remaining defective cells are evenly distributed across the where is column number from 0 to 7 for eight parity bits.
whole array, and not clustered in one half of the array, making A Hamming decoder is used to decode each row. Decoding is
this technique useless. Examples of such techniques are pre- done in two steps. First, the horizontal check bits are calculated
sented in [34], [35]. More recently another divide by half tech- using the saved data bits and compared with the saved horizontal
nique to cope with this problem has been proposed [36], that check bits. This procedure is called syndrome bit generation
divides memory chip by half using a set of multiplexer and de- and is called syndrome bit of check bit . Second, using
multiplexer. This technique provides better yield results com- syndrome bits , the single error detection (SED)/double error
pared to the other techniques but this technique like the other detection (DED)/no error (NE) signals are generated for each
techniques have a important drawback, they do not provide soft row. If DED is activated (double error is detected in a row), we
error tolerance and in the presence of soft errors will fail. use the vertical syndrome bits and the saved value of the bit
In order to improve the efficiency of repairing embedded we can correct any single or double erroneous bits in each row
memories, a novel redundancy mechanism to cope not only using (7)
with defects but as well as with transients is required. Our
technique, in comparison with the previous techniques, dra- (7)
matically reduces the cost per chip while improving the overall
systems reliability. where is the erroneous bit, the decoder output corre-
sponding to the erroneous bit , is the DED signal of row
III. MATRIX CODES and the syndrome parity of the corresponding parity of
The proposed detection/correction scheme is called MCs the bit, e.g., for , we have .
since the protection bits are used in a matrix format. The It is important to mention that if more than two errors are
-bit code word is divided into subwords of width (i.e., present in the code word, MCs can correct two errors in any row
). A matrix is formed where and assuming that we have only one error in other rows. If only two
represent the numbers of rows and columns, respectively. errors occur, they these can be corrected without any restriction.
For each of the rows, the check bits are added for single Algorithm 1 shows the procedure of detection and correction
error correction/double error detection. Another bits are in the proposed Matrix method which is applied on a code word
added as vertical parity bits. We explain the basic technique , where and are the check bits and the parity bits that
by considering a 32-bit word length memory, which is divided are calculated using the saved data bits in the memory. These
into a matrix format as shown in Fig. 1, where and are then compared with saved in memory check bits and parity
, through are the data bits, through are bits to calculate the syndrome bits and .
ARGYRIDES et al.: MATRIX CODES FOR RELIABLE AND COST EFFICIENT MEMORY CHIPS 423

The read and write procedure for the memory with error
correcting technique can be explained as follows. First each
word in the modules is segmented into multiple bit segments.
Then each bit segment is encoded to bit segment which
contains data bits and check bits. Algorithms 2
and 3 show the procedure for reading/writing words from/to a
Fig. 2. MCs—32 bits. memory, respectively.

Algorithm 2 Algorithm MEMORY READ

Algorithm 1 MATRIX code verification algorithm (X: data) 1: Read the word which contains the desired bits.
2: Correct for any errors.
1: Read the saved data bits of X 3: Route the desired bits on the tree to the root node
2: Generate check bits using saved data bits
3: Generate syndrome bits of check bits Algorithm 3 Algorithm MEMORY WRITE
4: Generate parity bits using saved data bits
5: Generate syndrome bits of parity bits 1: Read the word which includes the desired bit.
6: Correct every saved bit if it is erroneous using (7)
7: Output the corrected word 2: Check for errors and correct them (if any)
3: Compare the value of the bit to be written against the value
Let us give an example of the Matrix method. Suppose the stored in the memory.
codeword “10111001 01100111 11001100 11011100” is saved
4: if bits are different then
to the memory. Check bits will be equal to “11101
01111 10100 01001” and according to parity equations, parity 5: Recompute the check bits based on this new value.
bits will be equal to “1100 1110”. The physical layout
6: Write back the data and the newly computed check bits
of this codeword with check and parity bits is shown in Fig. 2.
Suppose that while reading the codeword from the memory 7: else
and are erroneous. changed from 1 to 0 and from
8: Write back the data and the newly computed check bits
0 to 1. Using the decoding algorithm one can easily correct the
two erroneous bits. This procedure is shown in Example 1. With 9 end if
this technique we can correct any kind of single or double errors
in each row. IV. EXPERIMENTAL STUDY AND RELIABILITY ESTIMATION

Example 1 The procedure of error detection/correction in A. Detection and Correction Coverage


Matrix code In order to estimate the error detection and correction
coverage of the proposed technique, we use a fault injection
1: Read the saved method. Without loss of generality, we consider the coverage of
the proposed technique for a 32-bit data word since the protec-
tion code can be applied on each data word of a given memory.
2: Calculate check bits using saved data bits: The size of a word is assumed to be 32 bits. Both single and
'' multiple faults were injected. For each number of bits in the
3: Generate syndrome bits for check bits: case of multiple fault injection, about one million experiments
'' were conducted. The obtained values are shown in Fig. 3. For
4: Generate parity bits using saved data bits: each protection code, there are two lines for fault detection and
'' correction coverage. The horizontal axis shows the number of
5: Generate syndrome bits of parity bits: faulty bits in a codeword. As the number of faulty bits increases,
'' the fault detection or correction coverage decreases. As can be
6: Correct every saved bit if it is erroneous using (7) observed from this figure, the fault detection and correction
7: coverage of the matrix and Reed–Muller methods are better
8: than Hamming code. It is also shown that the matrix method
9: Initial Value has the ability to provide coverage for multiple faults, though, it
10: is less than the Reed–Muller code for three simultaneous faults
11: in the same codeword.
12: Initial Value
13: Output the B. Reliability Estimation
A memory chip protected with the proposed technique,
Reed–Muller codes and Hamming Codes have been also de-
424 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011

Fig. 3. Detection and correction coverage of different techniques.

Fig. 5. 64 bits codeword.

TABLE I
MTTF IN DAYS OF PROPOSED TECHNIQUE FOR  = 10 , wordsize = 32

TABLE II
MTTF IN DAYS OF PROPOSED TECHNIQUE FOR  = 10 , wordsize = 64

Fig. 4. 32 bits codeword.


32-bit and 64-bit codewords for fault rate ( is the
number of bit upsets per day) are portrayed in Tables I and II,
scribed using C language. Random faults were thrown into the respectively. The MTTF of the “Matrix Codes B” for 2 Mb is
memory and the METF for each technique was calculated. The 50% increased compared to Reed–Muller Codes.
METF of each technique was calculated using 15000 trials
for each memory size, for more details, refer to [37]. We have
used two different codeword sizes, 32 and 64 bits. The results (8)
are portrayed in Figs. 4 and 5. “Matrix Codes A” and “Matrix
Codes B” denote two different word organizations of the pro- The redundant bits required for these techniques are por-
posed technique for each codeword. For the 32-bit codeword trayed in Table III. The extra required for redundant bits for
we used two organizations “Matrix Codes A” we have MCs are 78% and 125% for “Matrix Codes A” and “Matrix
and and for “Matrix Codes B” and . For Codes B” respectively and 100% for Reed–Muller Codes in
the 64-bit codeword, “Matrix Codes A” we have and 32 bits codeword. For 64 bits codeword the required bit for
and for “Matrix Codes B” and . The Matrix Codes 62.5% and 75% for “Matrix Codes A” and
improvement for 32-bit codeword is more than 10% compared “Matrix Codes B,” respectively. The required redundant bits for
to Reed–Muller codes. Reed–Muller Codes are 100%. The results of the Tables I–III
Using the results obtained from METF and the (8) we can will be discussed further in the following section for cost
obtain the MTTF of these techniques. The MTTF results for the analysis.
ARGYRIDES et al.: MATRIX CODES FOR RELIABLE AND COST EFFICIENT MEMORY CHIPS 425

TABLE III
REQUIRED REDUNDANT BITS

TABLE IV
AREA, POWER, AND DELAY ANALYSIS

V. IMPLEMENTATION AND COST ANALYSIS


Fig. 6. Correction coverage per cost for Reed–Muller and matrix codes versus
All methods described in the previous section are coded in number of faults in memory.
VHDL. The results reported here are for a register file, how-
ever the design is generic in datapath width. The design was TABLE V
simulated using ModelSim and was tested for functionality by MTTF PER COST, FOR 32-bit CODEWORD
giving various inputs. The outputs from the VHDL coded ar-
chitecture are validated against a standard MATLAB output.
The architectures were synthesized using the Synopsys tools in
0.18- m technology. Synopsys design power was used to esti-
mate the power consumption. Table IV shows the area, power,
and critical path delay of different protection schemes. The area,
power, and delay overheads of the proposed method are 53%,
57%, and 10%, compared to those of the Hamming code, re- TABLE VI
MTTF PER COST, FOR 64-bit CODEWORD
spectively. However, these overheads are less than the overhead
imposed by the Reed-Muller code. Note that, although Ham-
ming code has less overhead compared to the proposed method
and Reed-Muller, it can detect only two errors and correct one
error as depicted in Fig. 3 before.
In order to compare the efficiency of error detection and cor-
rection of the protection codes with respect to the implementa-
tion overheads, we provide a new metric. We call this new metric We have also introduced another metric for further compar-
correction coverage per cost (CCC) and calculate it as follows: ison of the efficiency of the memory using the proposed tech-
niques. We refer to this metric as “MTTF per Cost”. As Cost
we define the redundant bits that are required to implement the
(9) specific technique per code word as in Table III. In this metric
we can see the MCs perform better than Reed–Muller Codes.
where is defined by
These results are portrayed in Tables V and VI.
From Tables V and VI, we can see that the MCs with 64-bit
(10) codeword is always performing better than Reed–Muller codes.
For a 32-bit codeword, we can see that the technique with higher
Designers would prefer detection coverage and the cost of de- number of redundant bits is performing better than Reed–Muller
tection method to take high and low values, respectively. This codes for smaller memory sizes than 128-Mbits.
implies that designs should have high a value of the new metric
detection coverage per cost. VI. YIELD AND COST PER CHIP ANALYSIS
Fig. 6 shows the CCC metric against the number of faults. Traditional memory repair techniques for yield improvement
Note that the Hamming code is not included in this analysis as rely only on the addition of redundant rows and columns, which
there is no any correction when the number of faults is more than are then used to replace defective ones in the fabricated chip
one. The proposed matrix method has better coverage per cost when necessary. As the number of redundant rows and columns
than the Reed–Muller code. Thus, based on the results given is increased to allow higher repair capability, the fabrication
in Table IV and Fig. 6, the matrix method is better suited for yield also increases. However, since full size rows or columns
low-power and high-performance applications. must be used to replace defective ones, which usually contain
426 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011

only a few cells, this technique also increases the fabrication


cost. Moreover, when all redundant rows and columns are ex-
hausted, only a few chips can be used with degraded capacity
by setting the most significant address bit to a constant value,
due to the scattered distribution of defects in the array.
The technique proposed in this paper aims to reduce the cost
per chip and increase yield by using coding techniques that al-
lows you to save some of the faulty memory chips with small
defects instead of traditional redundant rows and columns. In the
analysis of yield and cost per chip presented here, the following
assumptions have been adopted for the following simulations.
1) All defective memory chips have only spot defects, and no
global defects, which are those defects affecting complete
sections of a chip or wafer. For traditional techniques,
when no redundant rows/columns are included in the
chip, any single spot defect will result in the chip being
discarded.
2) A 1024 32 bits memory array is used for area calcula- Fig. 7. Relative cost per chip for scenario 2).
tion, which has been obtained by modeling the chips in C
programming language.
3) Each wafer can hold 1000 chips without any redundancy. calculated by dividing total number of chips that could be pro-
If redundant rows/columns are included on the chip, the duced in the ideal scenario where no redundancy is used and
number of chips per wafer is reduced. For example, if we no chips have defects (1000 in our simulations) by the effective
add two redundant rows and two columns in each memory number of chips that were considered good for sale after repair
array the number of chips per wafer will be 969. (This is in each case. Using this criteria to evaluate the effectiveness of
used only for the cost per chip.) the proposed approach, for the ideal scenario the relative cost
per chip would be , while in a case where
A. Yield Analysis only 750 working chips are left after repair the relative cost per
In order to confirm the yield benefits provided by our tech- chip would be
nique, we have performed several simulations of a production
run for 1000 chips, with different numbers of defects per array (11)
and different quantities of redundant rows and columns in each
run compared to our technique. In the yield analysis, the con- This means a 33% cost overhead when compared to the ideal
sidered number of defects per chip in each simulated produc- scenario. For scenario 2), 32 different runs have been simulated,
tion run has been randomly distributed in the array, and simu- with the number of defects per chip varying from 0 to 15, with
lations have been performed assuming the following scenario: increments of 1. The results of the simulations are shown in the
1) chips are repaired using the coding techniques (Reed–Muller chart of Fig. 7, where one can see that even when no defects are
and matrix) and chips with remaining defects after all redundant found in the chips, the use of redundant rows and columns in-
elements have been allocated are discarded. For each different creases to relative cost per chip, when compared to the ideal sce-
simulation run, the yield has been calculated by dividing the ef- nario. This happens because in the simulations a constant wafer
fective number of chips that were considered good for sale after capacity has been considered, which means that, when redun-
repair in each case by the total number of chips that are pro- dant elements are added, the total number of chips per wafer de-
duced 1000 in our simulations. Using this criteria we evaluated creases, when compared to the ideal scenario. In our case using
the effectiveness of the proposed approach and calculated the the coding techniques we increase the size of the chip by 53%
yield for each technique. and 100% for matrix and Reed–Muller codes respectively. It is
obvious that if we reduce the cost per chip, the more benefits we
B. Cost per Chip Analysis obtain.
In order to confirm the cost benefits provided by our tech- As said before, both “Yield” and “Cost per Chip” are both
nique, we have performed several simulations of a production important in the design of memory chips. Using both results
run for 1000 chips, with different numbers of defects per array we for both of the techniques we evaluate the “Yield per Cost”
and using the same coding techniques. In the cost analysis, the metric (YIC) as shown as follows with equal of both:
considered number of defects per chip in each simulated pro-
duction run has been randomly distributed in the array, and sim- (12)
ulations have been performed assuming two different scenarios:
2) chips are repaired using the coding techniques (Reed–Muller The results of this metric are shown in Fig. 8. In all cases (i.e.,
Codes and Matrix Codes) and chips with remaining defects after regardless of the number of faults), the memory with embedded
all redundant elements have been allocated are discarded. For Matrix codes shows better yield per cost than the one with the
each different simulation run, the relative cost per chip has been Reed–Muller code. The improvement provided by the proposed
ARGYRIDES et al.: MATRIX CODES FOR RELIABLE AND COST EFFICIENT MEMORY CHIPS 427

[7] R. Hentschke, R. Marques, F. Lima, L. Carro, A. Susin, and R. Reis,


“Analyzing area and performance penalty of protecting different digital
modules with hamming code and triple modular redundancy,” in Proc.
Symp. Integr. Circuits Syst. Des., 2002, pp. 95–100.
[8] J. Karlsson, P. Liden, P. Dahlgren, R. Johansson, and U. Gunneflo,
“Using heavy-ion radiation to validate fault-handling mechanisms,”
IEEE Trans. Microelectron., vol. 14, pp. 8–23, 1994.
[9] R. Reed, M. Carts, P. Marshall, C. J. Marshall, O. Musseau, P. Mc-
Nulty, D. Roth, S. Buchner, J. Melinger, and T. Corbiere, “Heavy ion
and proton-induced single event multiple upset,” IEEE Trans. Nucl.
Sci., vol. 44, no. 6, pp. 2224–2229, Dec. 1997.
[10] N. Seifert, D. Moyer, N. Leland, and R. Hokinson, “Historical trend in
alpha-particle induced soft error rates of the Alpha microprocessor,” in
Proc. 39th Annu. IEEE Int. Reliab. Phys. Symp., 2001, pp. 259–265.
[11] M. Y. Hsiao, “A class of optimal minimum odd-weight column
SEC-DED codes,” IBM J. Res. Development, vol. 14, pp. 395–401,
1970.
[12] D. K. Pradhan and S. M. Reddy, “Error-control techniques for logic
processors,” IEEE Trans. Comput., vol. C-21, no. 12, pp. 1331–1336,
Dec. 1972.
[13] S. Satoh, Y. Tosaka, and S. A. Wender, “Geometric effect of mul-
tiple-bit soft errors induced by cosmic ray neutrons on DRAM’s,” IEEE
Fig. 8. Yield per cost.
Electron Device Lett., vol. 21, no. 6, pp. 310–312, 2000.
[14] A. Dutta and N. A. Touba, “Multiple bit upset tolerant memory using
a selective cycle avoidance based SEC-DED-DAEC code,” in Proc.
IEEE VLSI Test Symp. (VTS), 2007, pp. 349–354.
Matrix method in YIC can reach up to 300% as in the case of
[15] G. Neuberger, F. D. Lima, L. Carro, and R. Reis, “A multiple bit upset
15 defects per chip. tolerant SRAM memory,” ACM Trans. Des. Autom. Electron. Syst. (TO-
DAES), vol. 8, no. 4, pp. 577–590, 2003.
[16] M. Nicolaidis, F. Vargas, and B. Courtois, “Design of built-in cur-
VII. CONCLUSION rent sensors for concurrent checking in radiation environments,” IEEE
Trans. Nucl. Sci., vol. 40, no. 6, pp. 1584–1590, Dec. 1993.
This paper presented a high level error detection and correc- [17] J. Lo, “Analysis of a BICS-only concurrent error detection method,”
tion method called MC. The proposed protection code combines IEEE Trans. Computers, vol. 51, no. 3, pp. 241–253, 2002.
[18] A. J. S. D. Ciplickas and S. F. Lee, “A new paradigm for evaluating IC
Hamming code and Parity code, so that multiple errors can be yield loss,” Solid State Technol., vol. 44, no. 10, pp. 47–52, Oct. 2001.
detected and corrected. The fault-injection based experimental [19] C. Argyrides, H. Zarandi, and D. K. Pradhan, “Matrix codes: Mul-
results show that the proposed Matrix method provides detec- tiple bit upsets tolerant method for SRAM memories,” in Proc. 22nd
IEEE Int. Symp. Defect Fault Toler. VLSI Syst. (DFT), Sep. 2007, pp.
tion and correction coverage somewhere in between the Ham-
340–348.
ming and Reed–Muller codes. However, the Hamming code is [20] C. Argyrides, C. Lisboa, L. Carro, and D. K. Pradhan, “A soft error
not adequate for more than two errors and the Reed–Muller robust and power aware memory design,” in Proc. 20th Annu. Symp.
imposes significant area and power consumption compared to Integr. Circuits Syst. Des. (SBCCI), Sep. 2007, pp. 300–305.
[21] E. F. Assmus and J. D. Key, Designs and Their Codes. Cambridge,
the proposed method. Cost analysis showed that the proposed U.K.: Press Syndicate of the University of Cambridge, 1992.
method is significantly better than the Reed–Muller code in both [22] J. F. Wakerly, Error Detecting Codes, Self-Checking Circuits and Ap-
the detection/correction coverage per cost and yield per cost. plications. New York: North-Holland, 1978.
[23] A. D. Houghton, The Engineer’s Error Coding Handbook. London,
Future research will be conducted for the further improve- U.K.: Chapman and Hall, 1997.
ment of the reliability of the proposed technique and the reduc- [24] K. Tokiwa, “New decoding algorithm for Reed-Muller codes,” IEEE
tion of the overheads. Trans. Inf. Theory, vol. IT-28, no. 5, pp. 114–122, Sep. 1982.
[25] D. K. Bhavsar, “An algorithm for row-column self-repair of RAM’s
and its implementation in the ALPHA 21264,” in Proc. Int. Test Conf.,
REFERENCES 1999, pp. 311–318.
[26] K. Kim, C. Kim, and K. Roy, “TFT-LCD application specific low
[1] ITRS 2002. [Online]. Available: http://public.itrs.net power SRAM using charge-recycling technique,” in Proc. 6th Int.
[2] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, Symp. Quality Electron. Des., 2005, pp. 59–64, 21–23.
“Modeling the effect of technology trends on the soft error rate of com- [27] C. A. Lisboa, M. I. Erigson, and L. Carro, “System level approaches
binational logic,” in Proc. Int. Conf. Dependable Syst. Netw. (DSN), for mitigation of long duration transient faults in future technologies,”
2002, pp. 389–398. in Proc. 12th Eur. Test Symp. (ETS), May 2007, pp. 165–170.
[3] G. Cardarilli, A. Leandri, P. Marinucci, M. Ottavi, S. Pontarelli, M. Re, [28] S. K. Lu, “Efficient built-in redundancy analysis for embedded memo-
and A. Salsano, “Design of a fault tolerant solid state mass memory,” ries with 2-D redundancy,” IEEE Trans. Very Large Scale Integr. (VLSI)
IEEE Trans. Reliab., vol. 52, no. 4, pp. 476–491, Dec. 2003. Systems, vol. 14, no. 1, pp. 34–42, Jan. 2006.
[4] B. Cooke, “Reed Muller Error Correcting Codes,” MIT Undergraduate [29] S. K. Lu and S.-C. Huang, “Built-in self-test and repair (BISTR) tech-
J. Math., vol. 1, pp. 21–26, 1999. niques for embedded RAMs,” in Proc. Int. Workshop, Memory Technol.
[5] P. A. Ferreyra, C. A. Marques, R. T. Ferreyra, and J. P. Gaspar, “Failure Des. Test., Aug. 2004, pp. 60–64.
map functions and accelerated mean time to failure tests: New ap- [30] M. Horiguchi, J. Etoh, M. Aoki, K. Itoh, and T. Matsumoto, “A flexible
proaches for improving the reliability estimation in systems exposed to redundancy technique for high-density DRAM’s,” IEEE J. Solid-State
single event upsets,” IEEE Trans. Nucl. Sci., vol. 52, no. 1, pp. 494–500, Circuits, vol. 26, no. 1, pp. 12–17, Jan. 1991.
Jan. 2005. [31] W. K. Huang, Y. H. Shen, and F. Lombrardi, “New approaches for the
[6] P. Hazucha and C. Svensson, “Impact of CMOS technology scaling on repairs of memories with redundancy by row/column deletion for yield
the atmospheric neutron soft error rate,” IEEE Trans. Nucl. Sci., vol. enhancement,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
47, no. 6, pp. 2586–2594, Dec. 2000. vol. 9, no. 3, pp. 323–328, May 1990.
428 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011

[32] H. C. Kim, D. S. Yi, J. Y. Park, and C. H. Cho, “A BISR (built-in self Dhiraj K. Pradhan (S’70–M’72–SM’80–F’88) is
repair) circuit for embedded memory with multiple redundancies,” in currently a Professor with the Department of Com-
Int. Conf. VLSI CAD, Oct. 1999, pp. 602–605. puter Science, University of Bristol, Bristol, U.K.
[33] P. Mazumder and Y. S. Jih, “A new built-in self-repair approach to Previously, he was a Professor with the Department
VLSI memory yield enhancement by using neural-type circuits,” IEEE of Electrical and Computer Engineering, Oregon
Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 12, no. 1, pp. State University, Corvallis, and held the COE En-
24–36, Jan. 1993. dowed Chair Professorship in Computer Science
[34] B. Elmer, W. Tchon, A. Denboer, S. Kohyama, K. Hirabayashi, and I. with Texas A&M University, College Station, where
he also served as the founder of the Laboratory of
Nojima, “Fault tolerant 92160 bit multiphase ccd memory,” in IEEE Int.
Computer Systems, and he held a professorship at
Conf. Solid-State Circuits. Dig. Techn. Papers, Feb. 1977, pp. 116–117.
the University of Massachusetts, Amherst, where
[35] C. H. Stapper, A. N. McLaren, and M. Dreckmann, “Yield model for
he also served as a coordinator of computer engineering. He also worked at
productivity optimization of vlsl memory chips with redundancy and the University of California, Berkeley, Oakland University, Rochester, MI,
partially good product,” IBM J. Res. Developm., vol. 24, no. 3, pp. and the University of Regina, Saskatchewan, Canada. He was also a Visiting
398–409, May 1980. Professor with Stanford University, Stanford, CA. In the past, he worked as
[36] C. Argyrides, A. A. Al-Yamani, C. Lisboa, and L. C. D. K. Pradhan, a Staff Engineer with IBM. More recently, he served as the founding CEO
“Increasing memory yield in future technologies through innovative of Reliable Computer Technology, Inc. He continues to serve as an Editor of
design,” in Proc. 8th Int. Symp. Quality Electron. Des. (ISQED), Mar. prestigious journals, including IEEE transactions. He has also served as the
2009, pp. 622–626. general chair and program chair for various major conferences. He is also the
[37] J. A. Maestro and P. Reviriego, “Study of the effects of MBUs on the inventor of two patents, one of which was licensed to Mentor Graphics and
reliability of a 150 nm SRAM device,” in Proc. 45th Annu. Des. Autom. Motorola. The verification tool, Formal Pro, by Mentor Graphics is based on
Conf. (DAC), 2008, pp. 930–935. his patent. He has contributed to VLSI computer-aided design and test, as well
as to fault-tolerant computing, computer architecture, and parallel processing
research, with major publications in journals and conferences spanning more
than 30 years. During this long career, he has been well funded by various
agencies in Canada, the United States of America, and the United Kingdom.
He is also the coauthor and editor of various books, including Fault-Tolerant
Computing: Theory and Techniques, volumes I and II (Prentice-Hall, 1986),
Fault-Tolerant Computer Systems Design (Prentice-Hall, 1996; second print,
2003), IC Manufacturability: The Art of Process and Design Integration (IEEE
Press, 2000), Practical Design Verification (Cambridge University Press,
2009), and Fault and Defect Tolerance in Nanotechnology Circuits (Cambridge
University Press, 2009). His research interests include low power designs,
soft-error protection and variability issues.
Dr. Pradhan is a fellow of the ACM and the Japan Society for the Promo-
tion of Science. He was a recipient of the Humboldt Prize in Germany, the Ful-
bright-Flad Chair in Computer Science in 1997, Best Paper Awards including
the 1996 IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems Best Paper Award, with W. Kunz, on “Recursive Learning: A New
Implication Technique for Efficient Solutions to CAD Problems Test, Verifica-
tion and Optimization.”

Taskin Kocak received B.S. degrees in electrical and


electronic engineering, and in physics (as a double
major) from Bogazici University, Istanbul, Turkey,
in 1996, and the M.S. and Ph.D. degrees in electrical
and computer engineering from Duke University,
Durham, NC, in 1998 and 2001, respectively.
He is currently an Associate Professor and
Chairman of the Computer Engineering Department,
Bahcesehir University, Istanbul, Turkey. Previously,
Costas Argyrides (S’07–M’10) received the B.Sc. he was a Senior Lecturer (associate Professor) with
degree in informatics and computer science from the Electrical and Electronic Engineering Depart-
Moscow Power Engineering Institute-Technical ment, University of Bristol, U.K. (2007–2009). Prior to that he was an Assistant
University (MPEI-TU), Moscow, Russia, with Professor with the Department of Computer Engineering, University of Central
distinction and has entered the list of top 10 students Florida, Orlando (2001–2007). Before joining the academia, he worked as a De-
of the MPEI-TU, and the M.Sc. degree in advanced sign Engineer with Mitsubishi Electronic America’s Semiconductor Division
computing and the Ph.D. degree in computer science in Raleigh-Durham, NC (1998–2000). His research interests include computer
from the University of Bristol (UoB), Bristol, U.K. networks and communications, and hardware design (computer architecture
Currently he is a Postdoctoral Research Associate and VLSI). His research activities have produced over 85 peer-reviewed
with the Department of Computer Science, UoB. publications, including 29 journal papers, and have been supported by both
Prior to this, he served as a Research Assistant with American and British funding agencies and companies, including Northrop
the Universities of Warwick and Cambridge. He is the author or coauthor of Grumman, Toshiba Research Europe, Great Western Research, ClearSpeed
more than 30 technical papers. His research interests include fault-tolerant Technology, and U.K. Engineering and Physical Sciences Research Council.
computer systems, software fault tolerance, reliability improvement, error He founded and organized the Advanced Networking and Communications
correcting codes, algorithmic based fault tolerance and nanotechnology-based Hardware Workshop series (2004–2006) which were supported by both IEEE
designs. and ACM. He is the founding Editor-In-Chief of the ICST Transactions on
Dr. Argyrides was a recipient of the Best Paper Award for his paper “Re- Network Computing. He also served as an associate editor for the Computer
liability Aware Yield Improvement Technique for Nanotechnology Based Cir- Journal (2007–2009). He is currently serving as a guest editor for special issues
cuits” with C. Lisboa, L. Carro, and D. K. Pradhan presented at the 22nd Sym- of the ACM Journal on Emerging Technologies in Computing Systems and
posium on Integrated Circuits and Systems Design SBCCI 2009. EURASIP Journal on Wireless Communications and Networking.

S-ar putea să vă placă și