Sunteți pe pagina 1din 5

A NOVEL SEU, MBU AND SHE HANDLING STRATEGY FOR XILINX VIRTEX-4 FPGAS

X. Iturbe, M. Azkarate, I. Martnez and J. Perez

A. Astarloa

Embedded System-on-Chip Group


IKERLAN-IK4 Research Alliance
J. M. Arizmendiarrieta, 2, 20500
Arrasate-Mondragon,
Basque Country (Spain)
email: xiturbe@ikerlan.es

Applied Electronics Research Team


University of the Basque Country
Dept. of Electronics & Telecommun.
Urquijo s/n, 48013, Bilbao,
Basque Country (Spain)
email: armando.astarloa@ehu.es

ABSTRACT
This paper presents a new Single Event Upset (SEU),
Multiple Bit Upset (MBU) and Single Hardware Error (SHE)
mitigation strategy to be used in Virtex-4 FPGAs. This strategy aims to increase not only the effectiveness of traditional
Triple Module Redundancy (TMR), but also the overall system availability. Frame readback with ECC detection and
frame scrubbing are combined in a dynamically reconfigurable TMR architecture, designed under both spatial and implementation diversification premises. Moreover, since the
strategy works on the devices bitstream domain, the basis
for Virtex-4 FPGAs bitstream definition are also shown.
1. INTRODUCTION
Since the introduction of SRAM-based Field Programmable
Gate Arrays (FPGAs) around two decades ago, this technology has progressively consolidated a place in the current
competitive electronic market. Nowadays they are systematically being used for embedding complex digital systems in
a single chip (SoC). In this way, the experienced increase
in density brought about by smaller SRAM cells and logic
structures, makes these state-of-the-art reconfigurable logic
devices very attractive for modern applications.
However, SRAM technology is specially sensitive to radiation induced SEUs. According to released data from the
last Rosetta experiment [1], Virtex-4 FPGA devices present
238 FIT/Mb1 in the configuration memory and 379 FIT/Mb
in Block RAM memories due to SEU effect. Moreover,
MBUs are starting to be also a matter to be addressed in
very deep sub-micron process technologies such as those
used for FPGAs fabrication. The NASA Jet Propulsion Laboratory (JPL) points out in [2] that MBUs are nearly three
times more likely to occur in the Virtex-4 family (90 nm)
than in the Virtex-II family (130 nm), and 69 times more
1 One

FIT is equal to one failure per 109 hours of system operation.

978-1-4244-3892-1/09/$25.00 2009 IEEE

likely to occur than in the Virtex family (220 nm). Finally,


integrated circuits experienced continuous scaling technology reduction and switching activity increase make recent
FPGA devices more vulnerable to aging effects (i.e. Electromigration and Time-Dependent Dielectric Breakdown),
which may cause permanent SHEs in the chips substratum [3]. According to Xilinx Reliability Report [1], it is
assigned a 3 FIT for the Virtex-4 process technology. Nevertheless, the study carried out in [4], which demonstrates
that an FPGA device usually starts to fail much earlier under
continuous operating conditions, and the fact that no FPGA
has been tested for more than 30 years point out that this
value could be much higher.
Therefore, in order to enable the use of SRAM-based
FPGAs in safety critical applications, in which extremely
low FIT is required, it is mandatory to define new techniques
and strategies for handling with herein presented fault types
and frequencies.
This paper presents a novel fault handling strategy, which
increases the system availability and, differently from other
mitigation strategies (presented in section 2), addresses SEU,
MBU and SHE occurrence in Xilinx Virtex-4 FPGAs. Being the basis for the proposed strategy, the details for Virtex4 bitstream are shown in section 3, while the strategy itself
is described in section 4. Section 5 presents the system in
which the strategy has been tested and validated. In fact, the
measured results are integrated into an analytical model in
section 6 aiming to characterize the overall strategy performance. Finally the conclusions are shown in section 7.
2. PREVIOUS WORK
Currently TMR is one of the most used fault tolerance approach despite the amount of logic resources it needs, the
performance limitation it involves and the vulnerability that
the voter introduces. Proof of this is Xilinx recently developed XTMR Tool, which automatically hardens a design

569

3. XILINX VIRTEX-4 FPGAS

by triplicating the inputs/outputs, the throughput logic, and


inserting feedback logic for registered data correction [5].
Anyway, the inherent flexibility to this kind of devices
has opened the doors in the last ages to new methods for
fault mitigation. Xilinx has developed the so called scrubbing technique based on a continuous FPGA reconfiguration, which claims to be the definitive solution for radiation
induced bit upsets.
In this line, the Xilinx SEU controller macro [6] is considered to be the best existing working solution. This controller continuously reads every configuration frames, which
are protected by 12 SEC-DEC ECC bits. Thus, in case of
bit-flip has occurred, it is possible to identify its position
and correct it. This scrubbing derived technique is called
Readback with Correction. The SEU controller macro internally instantiates the Internal Configuration memory Access Port (ICAP). Consequently, this port will unfortunately
not be available to be used for extending the mitigation strategy to other types of faults. Anyway, the SEU controller is
the unique known solution for correcting the initial value of
frame ECC parity bits, since they are not currently being set
properly by the Xilinx Bitgen tool.
As stated in [7], when TMR is combined with scrubbing
the device failure rate due to SEUs can be reduced at singledigit FITs, but this is not the case when multiple bits are
flipped. Since various upsets may potentially affect multiple
redundant modules and due to the fact that multiple bit-flips
in the same frame can not be corrected by using SEC-DEC
ECC bits, neither TMR nor readback with correction techniques are good candidates to deal with MBUs. The NASAGSFC Radiation Effects and Analysis Group has developed
one of the few known scrubbers capable of mitigating MBUs
[8]. It does not use ECC circuitry in order to correct the
faulty bits, instead, a golden configuration is stored. An external device periodically refresh the FPGAs configuration
memory content through external Select-Map or JTAG ports
with the golden information whether or not upsets have occurred (Blind-Scrubbing). Nevertheless, this intrusiveness excessively affects to the Dynamically Changing Information (DCI) in the user design (i.e. LUT RAMs, SRL16s
and Block RAMs content) since the correct value could be
erroneously overwritten.
As opposed to SEUs and MBUs, SHEs are not correctable because they involve permanent damage in the devices
substratum. Evolutionary techniques try to adapt the system
to chips substratum damage at runtime by taking advantage
of the flexibility of FPGAs. However, this adaptation has
not been successfully achieved for complex systems yet and
the required amount of time to run the evolution drastically
reduces the availability. Hence, the most feasible solution
when dealing with this type of faults seems to simply be to
avoid the faulty resources utilization, as presented in [9].

Fig. 1 shows Virtex-4 FPGA bitstream definition, focused


on the XC4VFX12 part, which has been used in this work.
The device is divided into two halves, top (identified by 0)
and bottom (identified by 1) and each of these parts is also
divided in clock regions or rows. In contrast to previous generation of Virtex families, Virtex-4 bitstream is composed
of fixed-length configuration frames, each consisting of 41
words (1312 bits) and spanning the height of a row. There
are different types of frames for each type of logic resources
(00 for IOBs, CLBs, Vertical Clock and DSP48s; 01 for
BRAM interconnect and 10 for BRAM content), requiring each type of resource a different number of frames to
be configured. In this way, every frame is addresses by a
five part address according to the position of the resources
to which it refers: (a) logic resource type, (b) top / bottom
half, (c) row, (d) major column address, which identifies a
column within a row, and (e) minor intra-column address,
which identifies a specific frame within each column. Table 1 shows the major and minor addresses for the frames
constituting a XC4VFX12 device row. Thus, by using this
information it is possible to address every frame when performing the readback and also physically localize the fault
affecting resources in the device.
As we have already mentioned, each configuration frame
data bits are protected by 12 bits SEC-DEC Hamming code,
allocated in the 21st word from 640 to 651 bit positions.
Thus, the ability to protect 2036 data bits allows bit positions
to be shifted avoiding the power of 2 positions, which are
reserved for ECC parity bits. Consequently, the data block
organization can be referred by using the D(32 N + i) nomenclature, where N is from 22 through 63 (avoiding the use
of N=32 since it includes 1024 position, which is a power of
2) and i is from 0 to 31 except when N=43, in which case
data information only extends from i=0 to 19 because ECC
bits are placed in the remaining positions. The Virtex-4 FPGAs incorporate a built-in block, called Frame ECC [10],
which automatically checks the frame ECC bits when performing a frame read through the ICAP port. In the case that
an upset has occurred, the Frame ECC gives the position of
the flipped bit (also called syndrome) or warns in case of
MBU has occurred (SEC-DEC code only permits to detect 2
or more than 2 but odd number of bit-flips). The correspondence between ECC syndrome and the bitstream bit position
is shown in Fig. 1. This information has been utilized for
frame ECC bits calculation and syndrome decoding.
Finally, the readback .rbb file, automatically generated
by Xilinx Bitgen tool, provides the system configuration information and is used for verifying the correctness of readback data stream at runtime. Besides the .rbb file, a logic
allocation (.ll) and a mask file (.msk) are generated that specify the DCI location within the bitstream. This information
has been used in order to prevent the DCI overwriting.

570

VWU  9X
:9  89: ; 

Y 7Z; 8 [ : \7Z]
^
<=>?>@ABC =DEF / KLL ?MND@I =DEIF O,0

    !"

,
O/.
P
 4
QRS22T

  

/ -1
112

233 /.-,2/ -,..


P
 4U
QRS/ -T

+++

/2,
-,3H

P
 44
QRS//T

H/1
-/00

H,3 HOH0, O,/


HO2 -/ --23,
-/O. -/ OH
-0,O -0,3
-0/1 2, -H
2,0 O
P
 54 QR S0/T
P
 5U
P
 55
P
 W4
+++
QRS02T
QRS00T
QR SH/T
\7Z] 677

+++

_`abcd efgh c`a ichj ca g kfglm


  

#$%
#$%

+++

#$%
#$%

&'%

#$%
(# $)

#$%

#$%

&'%

&'%
&'%

#$%
#$%

#$%
#$%

#$%
#$%

&'%
&'%

&'%

#$%

  

  

+++

,




 
   

#$%

45 677 89: ;
<=>?>@ABC =DEF GH. - II H0, J
KLL ?MND@I =DEIF G-/1HII-0,OJ

5U \7Z] 89:;

&'%
&'%

#$%
#$%

+++

+++

#$%

+++

+++

-,

&'%
&'%

#$%
#$%
*
#
$
)

&'%
-/

VWU  9X
:9  89:; 

Y 7Z; 8  ^ : \7Z]
^

\7Z]  9X ^


#$%
#$%

+++

+++

&'%
&'%

#$%

#$%

#$%

&'%

#$%
#$%

#$%
#$%

#$%
#$%

&'%
&'%

#$%
-0

#$%
#$%

+++

+++

-.

#$%
-1

+++

#$%

2,

2.

,




 
     

+++

&'%
23

Fig. 1: XC4VFX12 FPGA Device bitstream definition.


Table 1: XC4VFX12 FPGA Device configuration frame indexing in a row.
Column Type

IOB

CLB

IOB

VCLK

CLB

DSP48

CLB

IOB

BRAM Conn.

BRAM Data

Major addr.
Minor addr.

0
029

112
021

13
029

14
02

1518
021

19
020

2027
021

28
029

02
019

02
063

4. FAULT HANDLING STRATEGY PROPOSAL

This section describes the proposed fault handling strategy,


whose procedure is depicted in Fig. 2. The description is
done following Avizienis defined dependability related terminology [11]. According to these definitions, a fault (SEU,
MBU or SHE) is active when it produces an error, which
is defined as the deviation from the correct service state; otherwise, it is dormant. A system failure occurs when an
error is propagated to the system output and causes the service delivered by the system to deviate from correct service.
Specifically for FPGAs, Asadi and Tahoori have defined the
following terms in [12]: (i) Mean Time To Manifest errors
(MTTM) is defined as the mean period of time a fault is dormant, which varies depending on the functionality assigned
to the faulty configuration bit, and (ii) Mean Time To Detect
an error (MTTD) is defined as the elapsed time interval between a configuration bit is affected by a fault and the instant
that the erroneous configuration bit is detected.
To achieve the best results the proposed strategy implements both system failure prevention mechanisms (aimed to
increase the Mean Time Between Error, MTBE) and module
error correction methods (aimed to reduce the Mean Time
To Repair, MTTR).

4.1. System Failure Prevention


TMR is used in order to mask one single error, while the
correction action takes place. We have designed a majority
voter that also identifies the module which gives a distinct
output, and consequently activates the scrubbing procedure
before more errors accumulate and make the system fail.
Being the most critical element, the voter is asynchronously
implemented by using DSP48 blocks, which are definitely
more robust against radiation effects since they are built-in
the devices substratum, instead of using radiation sensitive
SRAM based configuration memory defining functionality.
The TMR modules have been floor-planned separately
within the FPGA substratum in order to reduce the probability that one module affecting fault corrupts other modules
(Spatial Diversity). This placement makes also possible
to identify the faulty module by knowing the corrupted bit
position within the bitstream. In order to perform this detection, aiming to correct any dormant fault, the configuration
memory is continuously being read (Preventive Readback,
Pr Rb). Thus, when there is a single bit flip, that bit is again
inverted and written back to the configuration memory. On
the other hand, MBUs are overcome by scrubbing the affected frame, whose correct content is taken from the .rbb data
stream, stored in a flash memory. If these actions are not
successful, two alternatives arise: in case the fault affects

571

t
oq

oq
nopo
qrstuv PPC qrstuv
wxyz{
wyz{
405
|}}~
|}}~

pq
qn

rsr
s

pq
qn

s
rsr

pq
qn

rsr

(a) System architecture (Gray intensity matches radiation sensitivity).




Fig. 2: Proposed fault handling strategy flow chart.

:;
.
/01234156

-
" $$  %%",(
"! #%%"$%&'( 7$+7*+89#8&+&*
!






)

*
&+

,
 




 

one TMR module, next section describing SHE recovery actions are undergone, or else the ICAP port access is disabled
in order to prevent a potential self-corruption of the configuration memory content due to system malfunctioning.







(b) Fault propagation in the system.

Fig. 3: Fault handling strategy running system.

4.2. Module Error Correction


When a TMR fault becomes active, the voter immediately
is aware of it, with MTTD 0 since it is asynchronous,
and initiates the recovery process execution (darkly shaded
branch in Fig. 2). As SEU and MBU faults are more probable, the first action consists on partially reconfiguring the
faulty module with its correct configuration value (Blind
Module Scrubbing, BM Scr). Thus, the MTTR is reduced
as the configuration commands execution overhead is minimal and the intrusiveness is not as high as in a standard
(complete system) Blind Scrubbing.
If the error is not corrected, it is supposed to have occurred a SHE and thus, the affected module partial bitstream is
analyzed (Module-focused Readback, Mf Rb) to fine-tune
the corrupted frame and identify the position of the mistaken
bit, if possible. In order to do so we use the Frame ECC
primitive given information, which is usable because we
have sorted out the existing bug in the Xilinx Bitgen tool
when computing frame ECC codes. Thus, the proper frame
ECC bits are online calculated and written back in the configuration memory before the system starts running.
Once the permanent damage in the device is located,
a different module implementation is loaded in the TMR
slot. We have implemented up to 3 distinct versions for each
module, each of them using the underlying logic resources
in 3 out of 4 quadrants (Implementation Diversity, See
Fig. 3), but this idea could be extended to other diversity


-

based strategies. Therefore, by knowing the fault location, it


is possible to select the module implementation which does
not use the faulty logic resources, in this case, the module
version not using the slot quadrant where the fault is. However, if all the implementations need to use them the system
gets diminished as only two TMR modules keep working.
5. SYSTEM ARCHITECTURE
Fig. 3 shows the system architecture on which the fault
handling strategy has been tested, together with the proposed fault propagation model. The system consists on two
parts: the Partially Reconfigurable Functional Region (PRFR)
and the Static Reconfiguration Controller Region (SRCR).
The PRFR region is itself divided in three TMR slots, each
of them allocating one partially reconfigurable module which
implements system functionality and incorporates bus-macro
based interfaces. The reconfiguration controller is based on
built-in Xilinx PowerPC-405 processor (PPC-405). Both
program and data memories are protected by means of ECC
codes, which temporally mask the faults, being the scrubbing mechanism the responsible of fault accumulation prevention. Furthermore, we have used low amount of logic
resources requiring buses (e.g. DCR or OCM) in order to
minimize fault occurrence probability.

572

PRSEUMBU
BM Scr

z }| {
M T T R = Nf T W b f
M T BE [in hours] =

Nf T
Nf
| {z }

Mf Rb

}|
{ z
FITRAD
Nf
+(
FITRAD + FITSHE
2
109
4
(
+
FITRAD
3
|{z}

Spatial Diversity

Imp. Diversity

}|

PRSHE

Mi Rp

}|
{
FITSHE
FITRAD + FITSHE
109
Nf TRb f

)
FITSHE Nf TRb f 2 MTTM
{z
}
|
z

z }| {
TRb f + Nf TW b f )

(1)
(2)

PRFE

6. STRATEGY CHARACTERIZATION

7. CONCLUSIONS

Previously described module correction techniques ensure


that a faulty TMR module will be again operative within
MTTR average time interval, but if a second module fails
before the aforementioned module is repaired, the system
will fail (when TTR>TBE). Both MTTR and MTBE parameters are analytically expressed in equations 1 and 2, and
graphically representated in Fig. 4 for various module sizes.
In these equations, Nf is the number of frames that configure each TMR module; TW b f is the needed time to write
a frame in the configuration memory when performing module partial reconfiguration (it also includes flash memory access time for reading .rbb data); TRb f is the necessary time
to readback a frame; and Nf T is the total number of frames
in the device. We have experimentally measured TW b f =
5, 077, TRb f = 2, 545, and MTTM 5 clock ticks. Based
on [1], we have selected for the XC4VFX12 device (Nf T
= 3,848), FITSEU+MBU = FITRAD = 1,142 and the worstcase overdimensioned FITSHE = 30. MTTR is represented
by the elapsed time when executing module correction mechanisms (darkly shaded branch in Fig. 2) weighted by the
probability of each fault type occurrence. The faults are considered to occur randomly in the configuration frames and
thus, on average Nf /2 frames have to be analyzed in order
to find their location. MTBE is inversely proportional to FIT
and is directly proportional to the probability of fault occurrence within the frames of a TMR module. For SHEs, it is
also subject to the fault occurred in one of the three quadrants used by the loaded module implementation at a given
time. Finally, the probability that the fault turns active is
PRFE = (MTTD-MTTM)/MTTD [12], where MTTD =
Nf TRb f . When system runs at 100 MHz and for Nf =
Nf T we have experimentally measured MTTD = 97.93 ms.

In this paper a novel strategy for SEU, MBU and SHE handling in Virtex-4 FPGAs has been described, as well as the
basis used for its development. The strategy has been firstly
validated in a self-reconfigurable TMR scheme and then characterized by means of a single analytical model which incorporates experimentally obtained results.
8. REFERENCES
[1] Xilinx Inc., Reliability report, 4th quarter 2008, UG116.
[2] M. Berg, Assessing and mitigating radiation effects in
Xilinx FPGAs, JPL Publication, 2008.
[3] C. Constantinescu, Trends and challenges in VLSI circuit
reliability, IEEE Micro, vol. 23, no. 4, pp. 1419, 2003.
[4] S. Suresh, M. Prasanth, X. Yuan, N. Vijaykrishnan, and
S. Karthik, FLAW: FPGA lifetime awareness, in IEEE
Design Automation Conference, 2006, pp. 630635.
[5] Xilinx Inc., Xilinx TMRTool user guide, UG156, 2006.
[6] K. Chapman and L. Jones, SEU strategies for Virtex-5 devices, Xilinx Inc. XAPP864, 2009.
[7] A. Lesea and P. Alfke, Xilinx FPGAs overcome the side
effects of sub-90 nm technology, Xilinx Inc. WP256, 2007.
[8] M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. A.
Label, M. Friendlich, H. Kim, and A. Phan, Effectiveness of
internal versus external SEU scrubbing mitigation strategies
in a Xilinx FPGA: Design, test, and analysis, IEEE Trans.
on Nuclear Science, vol. 55, no. 4, pp. 22592266, 2008.
[9] S. Pontarelli, M. Ottavi, V. Vankamamidi, G. Cardarilli,
F. Lombardi, and A. Salsano, Analysis and evaluations of reliability of reconfigurable FPGAs, in Journal of Electronic
Testing, vol. 24, no. 1, 2008, pp. 105116.
[10] W. E. Cory, D. P. Schultz, and S. P. Young, Error checking
parity and syndrome of a block of data with relocated parity
bits, U.S. Patent 7 426 678, Sept. 16, 2008.
[11] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. on Dependable Secure Comput., vol. 1,
no. 1, pp. 1133, 2004.

Fig. 4: Measured MTTR and MTBE running @100 MHz.

[12] G. Asadi and M. B. Tahoori, Soft error rate estimation and


mitigation for SRAM-based FPGAs, ACM International
Symposium on Field-Programmable Gate Arrays, pp. 149
160, 2005.

573

S-ar putea să vă placă și