Redundancy Issues in Software and

June 13, 2011 19:38 WSPC/S0218-5393 122-IJRQSE
S0218539311004093
International Journal of Reliability, Quality and Safety Engineering

Vol. 18, No. 1 (2011) 6198
c World Scientic Publishing Company
DOI: 10.1142/S0218539311004093
REDUNDANCY ISSUES IN SOFTWARE AND

HARDWARE SYSTEMS: AN OVERVIEW
MADHU JAIN, and RITU GUPTA,

Department of Mathematics
Int. J. Rel. Qual. Saf. Eng. 2011.18:61-98. Downloaded from www.worldscientific.com
I.I.T. Roorkee-247667, India

Department of Mathematics, Institute of Basic Science
Khandari, Agra-282002, India
madhufma@iitr.ernet.in
gupta ritu84@yahoo.co.in
by 93.87.110.211 on 12/11/12. For personal use only.
Received 22 September 2010

Revised 10 January 2011
The redundancy is a widely spread technology of building computing systems that

continue to operate satisfactorily in the presence of faults occurring in hardware and
software components. The principle objective of applying redundancy is achieve reliabil-
ity goals subject to techno-economic constraints. Due to a plenty of applications arising
virtually in both industrial and military organizations especially in embedded fault tol-
erance systems including telecommunication, distributed computer systems, automated
manufacturing systems, etc., the reliability and its dependability measures of redundant
computer-based systems have become attractive features for the systems designers and
production engineers. However, even with the best design of redundant computer-based
systems, software and hardware failures may still occur due to many failure mecha-
nisms leading to serious consequences such as huge economic losses, risk to human life,
etc. The objective of present survey article is to discuss various key aspects, failure
consequences, methodologies of redundant systems along with software and hardware
redundancy techniques which have been developed at the reliability engineering level.
The methodological aspects which depict the required steps to build a block diagram
composed of components in dierent congurations as well as Markov and non-Markov
state transition diagram representing the structural system has been elaborated. Fur-
thermore, we describe the reliability of a specic redundant system and its comparison
with a non redundant system to demonstrate the tractability of proposed models and
its performance analysis.
Keywords: Redundancy; software and hardware system; reliability; survey; fault-

tolerance.
Nomenclature and Notations

NHPP : Non-homogeneous Poisson process
RBD : Reliability block diagram
TMR : Triple modular redundancy
NMR : N-modular redundancy
61
S0218539311004093
62 M. Jain & R. Gupta
DWC : Duplication with comparison

SS : Standby sparing
PAS : Pair-and-a-spare
NMRS : N-modular redundancy with spares
SP : Self purging
SOM : Sift-out modular
TD : Triplex-duplex
RB : Recovery blocks
AT : Acceptance Test
NVP : N-version programming
NSCP : N-self-checking programming
RtB : Retry block

NCP : N-copy programming
DM : Decision mechanism
MTTF : Mean time-to-failure
N : Number of Modules
X(t) : States of the system at time t
: Failure rate parameter
pi (t) : Probability that the system is in state i, i = 1, 2, . . . , N
Ri : Reliability of the module i, i = 1, 2, . . . , N
Rs (t) : Reliability of the system at time t
1. Introduction
Recent advancements in information technology over the past decades have resulted
in an exponential growth of computer based systems. With the fast growing tech-
nology, a computer system consists of hundreds or thousands of interacting software
and hardware components. The evolutionary improvements in hardware and soft-
ware performances include the changes in computer architecture, vast increments in
memory and storage capacity and wide variety of exotic input and output options.
During last few years, the computer applications are becoming ever more com-
plex in both design and architecture; however they are often built from unreliable
hardware and software components. The primary goals of hardware/software engi-
neering are to improve the quality of products and to increase the productivity in
terms of product reliability and eciency.
The redundancy issues in both hardware and software systems play the key
role for a successful design and functionality of sophisticated computer systems.
For embedded computer systems consisting of software and hardware components,
redundancy can be achieved by applying extra copies of hardware and software
components in parallel that provide an alternate path for successful operation.
Redundancy techniques based on reliability theory are commonly applied to achieve
high reliability, maintainability, availability, safety of the system and also to han-
dle the system workloads. Highly reliable systems are required in the situation in
S0218539311004093
Redundancy Issues in Software and Hardware Systems 63
which the repair action can not be taken into account (e.g. spacecraft) and another
situation may arise where computers are employed to perform a critical function in
which even the small amount of time lost due to repairs can not be tolerated such
as in the case of ight-control systems, nuclear missiles, etc.
The system redundancy is common in many real time systems as it is
practically impossible to make a perfect system in which component fail-
ure (hardware/software) leading to system failure does not take place. The
accomplishment of redundancy techniques through redundant hardware and soft-
ware components was conceived in the early 1950s. A vast literature can be found on
various hardware and software redundancy techniques and their reliability model-
ing for redundant systems.11,55,82,97,98,101 Hardware redundancy techniques can be
broadly categorized in two dierent forms called as static or dynamic redundancy.

Static redundancy involves fault masking, error detecting and correcting codes, etc.
and provides an eective immediate action. Dynamic redundancy includes standby
spare components to replace the faulty ones which may be hot, cold or warm.
Sometimes, it is possible that any amount of redundant hardware can fail due
to some faulty software as such hardware redundancy is not sucient to achieve
highly reliable or available computer system. The technology developed for software
redundant system yields a high degree of satisfaction in terms of reliability and
availability. As a result, software redundancy techniques have become an implicit
requirement in most applications. Software redundancy is achieved by incorporat-
ing some additional software components that are not exactly identical but they
are similar in functionality.
Hardware and software reliability both are the important factors aecting
system reliability and discussed by Yamada and Osaki,83 Dhillon and Ugwu,19 Nas-
sar,67 Levitin,59 Immonen and Niemela,39 Yang et al.101,102 and many others. It
is widely recognized that software reliability diers from hardware reliability since
causes of failure in hardware and software are dierent and it reects the design per-
fection rather than manufacturing perfection. Hardware reliability changes during
certain periods such as at initial use or at the end of a useful life whereas soft-
ware reliability varies during the development and testing phases. Moreover, unlike
hardware, the software does not degrade physically as a function of time or envi-
ronmental stresses. But the software system to be operational on the same input
data, computing environment and user requirements constantly is not reasonable
to expect. Thus, now in computers world the emphasis to achieve high hardware
reliability is shifting to software because software faults are root causes in a high
percentile of operational system failures in real time embedded computer systems.
An adequate representation of reliability model is also important and should be
carefully taken into account. The most widely spread reliability modeling technique
is combinatorial modeling such as reliability block diagram (RBD) and fault/event
trees. These are the useful tools for modeling purpose; however, these are not capa-
ble to depict the dynamic behavior of the system such as load sharing, standby
S0218539311004093
redundancy, imperfect fault coverage, complex repair policies, etc. This is mainly
due to the fact that the interaction between two or more components/subsystems
can aect the system performance signicantly. To overcome this problem, some
state space methods such as Markov models, Petri-nets, hybrid models (combina-
tion of combinatorial and state space methods) may be adopted in order to analyze
more complicated systems.
In this article, we provide a comprehensive survey on redundancy issues asso-
ciated with design and analysis of software and hardware redundant systems. The
remainder of this article is organized as follows. In Sec. 2, we discuss the software
and hardware faults and their classication. In Sec. 3, some elementary techniques
related to fault/failure phenomenon that are used to improve the performance
of redundant system and various failures related issues are addressed. Section 4
is concerned with the methodological aspects to solve the reliability problems.
Section 5 highlights the basic features of the system redundancy. Section 6 pro-
vides hardware redundancy techniques. Software redundancy techniques are dis-
cussed in Sec. 7. Finally, the last Sec. 8 summarizes the contributions of present
work.
2. Classification of Hardware and Software Faults

Generally a fault is dened as physical defect, imperfection or a design aw over the
hardware and software components. An error is a manifestation of the fault caused
by the activation or execution. A failure is incorrect performance of some function in
the system. Specically, the faults are the causes of errors, and errors are the cause
of failures.73 The terms error and fault are often used as synonym to each other
and both can propagate system failure. The faults can occur during computers
specication, design, implementation, modication, installation and throughout its
operational life. One way to classify faults is a couple of attributes; nature and
duration of faults.24 The nature of a fault refers whether the faults are hardware
faults or software faults whereas the duration of a fault refers to the way the faults
are activated. With respect to fault duration, the hardware faults can be classied
into following three forms.
(i) Permanent faults: A fault is said to be permanent if it continues to remain
active until it is repaired. Some physical defects in the hardware are short-
circuits, connection disruption as well as design errors are examples of faults.
(ii) Transient faults: Transient faults remain active for short period of time and
disappear quickly. Transient faults are often detected through the errors that
result from their propagation. They are usually referred to as soft faults or
glitches and are mostly induced by random environmental disturbances such
as voltage uctuations, electromagnetic interference, radiation, etc.
(iii) Intermittent faults: Intermittent faults activate, deactivate, and reactivate
repeatedly. They are often attributed to design errors that results in marginal
or unstable hardware. Intermittent faults are dicult to be predicted, but their
S0218539311004093
eects are highly correlated. In the presence of these faults, the system works
well most of the time but fails under typical environmental conditions such as
in the case of fault due to loose wire.
The nature of software faults is dierent from that of hardware faults. The
source of failure in software is the design fault while the causes of hardware failure
may be physical deterioration, a manufacturing defect or poor quality of mate-
rial. The software faults are all design faults which are harder to visualize, classify,
detect and correct. The faults may be due to programming errors, specication
errors, etc. Software can not physically break after being installed in a computer
system however some latent faults in the programming code may activate during
operation. This type of problem can be seen under heavy or unusual workloads of
the programs which eventually lead to system failure. During software execution,
the software faults which are activated every time are usually detected in the test-
ing phase and are corrected before releasing the piece of software. A few software
faults which are rarely activated escape testing and debugging processes and are
terminated/removed when software is ready to be released. Some of the software
faults are not activated every time i.e. latent faults present in the software. This
situation arises since faulty piece of code is executed when a certain external event
occurs that is fault trigger event.
A typical way to classify the software faults is bugs based on the type of failure
that occur in the software. The software faults can be classied into following three
categories. They are dened in the context of software operation.
(i) Heisenbugs: Heisenbugs are the bugs which may or may not cause a fault for
a given operation. If a Heisenbug is present in the system, the error could be
removed on retrying the operation. Heisenbugs are also called as transient or
intermittent faults.31
(ii) Bohrbugs: Bohrbugs are the bugs which always cause a failure when a partic-
ular operation is performed. In the presence of Bohrbugs, there would always
be a failure on retrying the operation which caused the failure. Bohrbugs are
called as permanent faults.31
Mostly industrial software systems are released after passing design reviews, quality
assurance, alpha, beta, gamma tests. They are mostly free from Bohrbugs but
Heisenbugs may persist. In this case, if the software system is restarted it would
function correctly.
(iii) Aging related bugs: These are the bugs which appear when software systems
running continuously for a long time; in such case, due to aging, the software
systems tend to show a degraded performance and an increased failure occur-
rence rate. These bugs depend on internal environment of the system such as
unreleased physical memory due to memory leaks.32
S0218539311004093
3. Some Failure/Fault Related Issues

A system is said to have a failure if the system does not perform as expected.
Now it is worthwhile to mention some important failure/fault related issues
while examining the reliability of redundant systems under active and standby
congurations.
3.1.
In practice, a number of failure factors can signicantly reduce the reliability of
the redundant systems. Common cause failures and load sharing phenomena are
the most concern failure issues in active redundant systems. In standby redundant
systems, switching failure is quite common consideration.
3.1.1. Common cause failure

In case of common cause failure there may be simultaneous failure of one or more
redundant components in the system due to some common reason. Such failures
have the strong potentialities in reducing the benet gained with redundant cong-
urations. Common cause failures may be produced by common electric connections,
shared environmental stress such as dust, vibration, humidity or common mainte-
nance problems. The power failure may also be important example of common cause
failure in manufacturing systems, common communication buses in computer net-
works, etc. A great deal of engineering eorts of engineering systems is expended
on identifying possible common cause failure mechanism and eliminating them.
However, in some cases, it may be impossible to eliminate the causes entirely and
therefore reliability modeling must take them into account. Due to this spirit, a con-
siderable amount of work including this concept has been done by many researchers.
An analytical methodology and examples of common cause failure has been pre-
sented by Mosleh.65 Jain and Ghimire42 obtained the reliability of k-r-out of-n: G
system subject to random and common cause failure. The reliability of a two unit
system with common cause shock failures was computed by Jain.41 Jain et al.43
suggested loading policies for M-r-out of-N: G system subject to common cause
failure. Kang et al.46 investigated the standby safety systems consisting of more
than two redundant components.
3.1.2. Load sharing

Another cause of reliability degradation in active redundant systems is load sharing.
In load sharing systems, the failure of one component increases the stress level on
the other and therefore it increases the failure rates of remaining surviving compo-
nents. The stress may be an electrical load, a load caused by high temperature or an
information load. This introduces failure dependency between the load sharing com-
ponents which increases complexity in analyzing redundant systems. Fortunately, in
S0218539311004093
a redundant system with sucient capacity, the increased failure rate can not lead
to unacceptable failure probabilities. When the system experiences the rst failure,
if it is detected, the system may be required to work for only a short period of time
before the completion of repairs. From this view point, the load sharing degraded
problem is less serious than common cause failures. Some research investigations
have appeared in the literature on load sharing redundant systems. A ow-graph
based approach has been suggested to analyze a multi-state k-out-of-n:G/F load
sharing systems by Jenab and Dhillon.44 Singh et al.86 investigated k-components
load sharing systems. They obtained the load sharing parameters under classical
and Bayesian set up. In this sequence, an optimal load allocation for load sharing
k-out of-n:F systems has been done by Yamamoto et al.100 A recent contribution
in this eld is due to Deshpande et al.17
3.1.3. Switching failure

In many systems with standby components, a standby component automatically

becomes active in the event of the failure of active component. However, in some
cases when dealing with standby systems, a switching device is also present and is
used to switchover the standby component when an active component fails so the
system can resume operation. The presence of a switching device has a signicant
eect on the reliability of a standby system. The failure and reliability proper-
ties of the switch must also be included while analyzing the redundant systems.
Standby systems are inherently superior to active systems, but most of this superi-
ority depends on the reliability of the standby switch. Many lot of models have been
developed by several researchers for analyzing the standby redundant systems with
switching failures including some notable contributions due to Alidrisi,3 Chung,14
Ke et al.,49 Pan,69 Wang and Chen94 and others.
3.2.
Achieving highly reliable system from the customers perspective is very demanding
for all software developers and system designers. To tackle fault related issues, the
following techniques are commonly used Refs. 58, 64, 50.
3.2.1. Fault prevention/avoidance

Fault prevention techniques are used to prevent the occurrences or introduction of
faults in dependable system working in computing environment. Fault prevention is
achieved by quality control techniques employed during the development and design
of hardware and software. Rigorous design rules, component screening and testing
techniques are employed in hardware while structural programming, modulariza-
tion and formal verication techniques are employed in software. To achieve fault
prevention, another common approach is shielding from operational physical faults
such as radiation, humidity, heat, etc. User and operation faults are prevented by
S0218539311004093
training, rigorous procedures for maintenance. Malicious faults are prevented by

rewalls and similar securities.
3.2.2. Fault tolerance

Fault tolerance is one of the important approaches to achieve highly reliable com-
puting systems. Fault tolerance techniques have the ability to deliver continuous
service in the presence of hardware and software faults by providing redundant hard-
ware and software components. A typical fault tolerance system generally include
fault detection, location, containment and subsequent its recovery.
(i) Fault detection generates an error signal message when a fault occurs within
a system. Detection of a fault is done by acceptance test or comparator; in

general it can not be predicated which component or module has failed. Various
fault detection techniques are employed such as acceptance test, comparator,
etc.
(ii) Fault location is a mechanism to determine where a fault has occurred.

(iii) The process of isolating a fault is referred to as fault containment that contains
the manifested faults throughout the system. It prevents further propagation
of the eect of faults such as exception handling routines to treat unsuccessful
operations. This isolating fault process can be achieved by multiple request
protocols by employing consistency checks between modules and by performing
frequent fault detection techniques.
(iv) Fault recovery is a process to transform a system state that contains one or
more isolated faults into an operational state status and faults that may be
activated again. This mechanism recovers system operations from erroneous
conditions such as check pointing and rollback mechanisms.
(v) Fault masking is another fault tolerance technique that is used to hide the
occurrences of faults and prevent faults from resulting in errors. It provides
continuous system operation and prevents faults in the system from introduc-
ing errors into the informational structural of that system.
3.2.3. Fault removal

The fault removal technique is mainly used to reduce the number of faults which are
present in the system. During development and operational phases of the system;
fault removal is performed. Fault removal during development phase is completed
in three steps:
(i) Verication
(ii) Diagnosis
(iii) Correction
Verification is the process of checking whether the system satises the pre-specied
conditions. If it so happens, the next step is the diagnosing the faults that prevented
S0218539311004093
the verication conditions from being fullled and then performing the necessary
corrections. During operational phase, fault removal is performed in following two
steps:
(i) Corrective
(ii) Preventive maintenance.
During operation phase, the corrective maintenance is performed to remove faults
that have produced one or more errors and have been reported. In case of preven-
tive maintenance, some adjustments are made or parts which may undergo to be
faulty during normal operation are replaced before occurring the system failure. In
addition to this, preventive maintenance is the achievement to avoid high cost of
replacement or avoid damages of the surrounding of the system components. It is

to be mentioned that the corrective and preventive forms of fault removal technique
are applied to fault tolerant systems as well as non-fault tolerant systems that can
be maintained without interrupting service delivery or during service outage.
3.2.4. Fault/failure forecasting

An evaluation of system behavior with respect to fault occurrence or activation is
done to forecast the faults in the system. The evaluation can be either qualitative
or quantitative. The aim of qualitative evaluation is to identify, classify, rank of fail-
ure modes or the event combinations in terms of component failures that may be
resulted in system failures. The methods applied for failure mode and eect analysis
are performed as qualitative evaluation. The quantitative evaluation is performed
in terms of probabilities of the extent to which some of the dependability attributes
such as availability, reliability, safety, maintainability etc. are satised. Various reli-
ability techniques namely Markov chains, stochastic Petri nets, etc. are used for
quantitative evaluation. There are some methods which can be used to perform
both qualitative as well as quantitative evaluation; reliability block diagrams, fault
trees etc., fall in this category.
4. Reliability Modeling
The analysis of redundant systems in literature is focused mainly on determining
the reliability. Reliability is an important quality measure of a system. The com-
puter based systems are viewed as one of many system components. System analysts
often consider the estimation of hardware and software reliability essential in order
to estimate the full system reliability. Hardware reliability encompasses a wide spec-
trum of analyses that strive systematically to reduce or eliminate system failures
which adversely aect the performance. Software reliability makes an eort system-
atically to reduce or eliminate system failures which adversely aect performance of
a software program. Hardware reliability can be improved by better design, better
material, applying redundancy and accelerated life testing while software reliability
can be improved by increasing the testing eort and by correcting detected faults.
S0218539311004093
Reliability is the probability of failure free operation of a computer program in

a specified environment for a specified period of time.
Mathematically, the system reliability Rs (t) is dened as the conditional prob-
ability that the system operates correctly throughout the interval [t0 , t] given that
it was operating correctly at t0 i.e. Rs (t) = P (z > t), t t0 where z is a random
variable denoting the time-to-failure. A measure of failure F (t) is dened as the
conditional probability that the system fails by time t referred to as unreliability or
failure time distribution, F (t) = P (z t), t t0 . If f (t) be the probability density

function of time to failure random variable z then Rs (t) = t f (u)du. If life time
of the system is the exponential function, then Rs (t) = et .
The overall system reliability can be evaluated by developing reliability models.
The complexity of reliability models depends on various factors such as mission

prole, function criticality and redundancy characteristics. The main techniques
used for reliability modeling are:
(i) Combinatorial modeling

(ii) Markov modeling
(iii) Non-Markovian models
4.1. Combinatorial modeling

The combinatorial modeling is categorized as an analytical approach in which gov-
erning equations describe the system behavior and discussed by many researchers.
Recent developments in this eld include Carrasco and Su ne,10 Choi and Seong,12
21
Distefano and Puliato, and others. In this technique, we consider the number
of all possible ways of event in which a system can continue to operate, given
the probability of failure of its individual components. In other words, the failures
of the individual components which are mutually independent are enumerated to
estimate the systems reliability. Many congurations are usually being used to
model the interconnection among the systems components such as series, parallel,
series-parallel, parallel-series, M-out of-N etc. Combinatorial modeling of system
reliability includes mainly two qualitative approaches (i) reliability blocks diagrams
and (ii) fault trees. Here we discuss reliability block diagram which is the oldest and
most common reliability model.
Reliability block diagram (RBD): Reliability block diagrams are widely used
in engineering and other industrial setup to describe the behavior of the systems
components and are represented as blocks showing operational dependency between
the components with reliability view point. The system can be broadly categorized
in two congurations i.e. series and parallel conguration. The analysis of more
complex redundant systems may also be built to mixed congurations such as series-
parallel, parallel-series, M-out of-N systems, etc.
(i) Series configuration: A system is said to be in series conguration when all
the components (blocks) are necessary for the system to be operational i.e. failure
S0218539311004093
design specification
design
process
human
wearout
data corruption
overstress
electrical
interference
Hardware Software
Fault Fault
Error
Error Recovery
No Failure
System Undetected
Failure Failure
Fig. 1. Failures sequence of hardware and software faults.
of only single component leads to system failure. The graphical representation of a

series system is shown in Fig. 2(a).
(ii) Parallel configuration: A parallel conguration system is shown in Fig. 2(b).

In such conguration, the system fails in case when all components of the system
fail. The system reliability in a parallel conguration is higher than the reliability
of any single component system.
(iii) Series-parallel/parallel-series configuration: Some systems are made up

of combinations of several series and parallel congurations as shown in Figs. 2(c)
and 2(d). To obtain the system reliability in such cases, a way is to break the total
system conguration into subsystems. Then consider each of theses subsystems
separately as a component and calculate their reliabilities. Finally, we put these
components reliabilities into a single system and obtain its reliability.
(iv) M-out of-N system configuration: The term M-out of-N system is often
used to indicate either a G system or an F system or both built into an M-out of-N
system. Both parallel (1-out of-N: G or N-out of-N: F) and series (1-out of-N: F
S0218539311004093
1 2 N
(a)
2
(b)
1
1
2 1
2
2 Nk
N1
Subsystem k
Subsystem 1 N2
Subsystem 2
(c)
Subsystem 1
1 2 N1
1 2 N2
Subsystem 2
1 2 Nk
Subsystem k
(d)
Fig. 2. (a) Series conguration with N components; (b) Parallel conguration with N components;
(c) Series-parallel conguration; (d) Parallel-series conguration.
S0218539311004093
or N-out of-N: G) systems are special cases of the M-out of-N system. The M-out
of-N system structure is very popular type of redundancy in fault tolerant systems
including industrial and military systems.
An N components system that works or is good, if and only if at least M com-
ponents out of total N components work or are good, is called as M-out of-N:
G system. The M-out of-N: F system fails if and only if at least M components
out of total N components fail. A triple modular redundant system uses 2-out
of-3: G voting conguration. A variety of the M-out of-N systems are described
below.
Consecutive M-out of-N system: Consecutive M-out of-N system consists of

N linearly or cyclically ordered components such that the system fails if and only
if at least M consecutive components fail.
Weighted M-out of-N system: A weighted M-out of-N system is N compo-
nents system wherein each component carries its own positive integer weight such
that the system is good if and only if the total weight of working component is at
least M, a pre-specied value. In mathematical sense, in a weighted M-out of-N
system, the component i carries a weight wi , wi 0 for i = 1, 2, . . . , N such that
N
w = i=1 wi where w is the total weight of all the components. Thus, M-out of
N: G system can be seen as a special case of the weighted M-out of-N: G system
wherein each component has a weight of 1.
M-K-out of-N system: A M-K-out of-N system fails if less than M or more
than K components function simultaneously, i.e. for the successful operation of
the system, neither less than M nor more than K components function properly.
The system reliabilities for dierent congurations are summarized in Table 1 under
consideration of non-identical components except M-out of-N. The simplest case of
Table 1. Reliability models.
Serial no. System System reliability

conguration
Q
1 Series Rseries (t) = N i=1 Ri , where Ri is the reliability of ith
component
Q
2 Parallel Rparallel (t) = 1 N i=1 (1 Ri ), where Ri is the
reliability of ith component
QN
3 Series-Parallel Rsp (t) = i=1 Rparallel , Rparallel is the reliability of ith
subsystem
Q
4 Parallel-Series Rps (t) = 1 N i=1 (1 Rseries ), Rseries is the reliability
of ith subsystem
P N
5 M-out of-N: G RM -outof -N (t) = NMi=0 (1 R)i RNi , where R is
i
a reliability of a component
P N
6 M-K-out of-N: G RM -K -outof -N (t) = KMi=0 (1 R)NK+i RKi ,
i
where R is a reliability of a component
S0218539311004093
components in M-out of-N conguration while analyzing system reliability is consid-

ered when the components are mutually independent and identical. The reliability
of M-out of-N system can be evaluated by using the binomial distribution.
Reliability block diagrams are gaining popularity because they are easy to under-
stand and can be used for modeling of real time redundant systems. However, RBD
as well as other combinatorial reliability model has a number of serious limitations
as given below.
RBDs assume that the system components are limited to operational and failed
states and that system conguration does not change during the mission.
The failures of the individual components are assumed to be independent. Thus
system reliability can not be adequately represented when it is aected by

sequence of component failures.
4.2. Markov modeling

Markov modeling is the most powerful tool available to system engineers and design-
ers for analyzing complex redundant systems. Markov models are preferred in case
when the system is more complex and the reliability expressions can not be easily
modeled combinatorially. It gives the results for both time dependant evolution and
steady state of the system. In Markov modeling, the system can be represented by a
number of states and state transitions. The transition from the current state of the
system is evaluated only through the present state, not from its past state. Tran-
sitions may be determined by a variety of possible events (i.e. failure, repair, etc.)
and are characterized by a probability distribution under reasonable conditions. In
a large number of Markov models in reliability analysis, the transition probabilities
follow exponential distributions with constant failure or repair rates. It may also be
useful for describing the electronic system or systems components with repairable
components which either function or fail. In computer based systems, various com-
ponents namely CPUs, RAM, network card, hard disk controllers and hard disks
are used; such systems can be described by Markov model.
A simple Markov model for one component repairable system is depicted by
transition ow diagram shown in Fig. 3. Reibman79 presented an overview of numer-
ical approaches for transient analysis of Markov as well as Markov reward model in
fault tolerant systems and derived many instantaneous and cumulative reliability
measures. The reliability modeling for multiple repairable systems based on Markov
process has been done by Islamov.40 Sharma and Kumar85 proposed Markovian
approach to model the behavior of safety engineering systems.
A more complicated example in Markov modeling would be a combined hard-
ware and software system. In this situation, the repair time is the time required to
bring the software back into service, not the time required to detect and remove
the bug. The state transition diagram of combined hardware and software sys-
tem is shown in Fig. 4. A Markov model for availability analysis of distributed
software/hardware systems has been developed by Lai et al.56 Dominguez-Garcia
S0218539311004093
failure rate ( )
Operational Failed
repair rate ( )
Fig. 3. Transition diagram of a one unit repairable system.

et al.23 suggested an integrated methodology for the reliability evaluation and

dynamic performance analysis of fault-tolerant systems.
Reliability modeling of TMR system: Here we illustrate the concept of Markov

model as well as RBD for a TMR redundant system consisting of three modules,
two of which are required for the system functioning properly. It is assumed that
the component failures are mutually independent and the voter is perfect.
(a) RBD model : Let us denote the modules reliabilities by R1 , R2 and R3 . The
reliability of a TMR system is given by
RTMR = R1 R2 R3 + (1 R1 )R2 R3 + (1 R2 )R1 R3 + (1 R3 )R1 R2 (1)
where
R1 R2 R3 = Prob{module 1 functions correctly module 2 functions correctly
module 3 functions correctly}.
h Operational s
Hardware h s Software
failed failed
h : hardware failure rate; s : software failure rate

h : hardware repair rate; s : software repair rate
Fig. 4. Markov model for combined hardware and software repairable system.
S0218539311004093
(1 Ri )Rj Rk = Prob{module i has failed module j functions correctly

module k functions correctly} for i, j, k = 1, 2, 3.
If all the components are identical, then R1 = R2 = R3 = R (say). Then Eq. (1)
yields
RTMR = 3R2 2R3 (2)
(b) A Markov model : The TMR system can be modeled by assuming exponen-
tially distributed life time of each module. The system can be represented by three
states by assuming as the failure rate of a module as follows:
State 1- State in which only one module or no module is working (failure state)
State 2- State in which two modules are operational (operational state)
State 3- State in which three modules are operational (operational state)
The aim of Markov modeling of TMR system is to calculate pi (t), the probability
that the system is in the state i (i = 1, 2, 3) at time t. From the state transition
diagram shown in Fig. 5, we can construct the state transition equations as follows.
d
p1 (t) = 2p1 (t) (3)
dt
d
p2 (t) = 3p3 (t) 2p2 (t) (4)
dt
d
p3 (t) = 3p3 (t) (5)
dt
On solving the above system of Eqs. (3)(5), we get
p1 (t) = 1 3e2t + 2e3t , p2 (t) = 3e2t 3e3t , p3 (t) = e3t . (6)
The reliability of TMR system is obtained as
RTMR (t) = p2 (t) + p3 (t) = 1 p1 (t) = 3e2t 2e3t (7)
Now we compare the reliability of TMR (redundant) and simplex (non-redundant)

systems which change with time and failure rate (see Fig. 6(a)) and MTTF (see
Fig. 6(b)).
3 2
3 2 1
Fig. 5. State transition diagram of a TMR system.

S0218539311004093
1 simplex (s) 2 simplex (s)
0.8 TMR
TMR 1.5
Reliability
MTTF
0.6
1
0.4
0.2 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1.5 2.5 3.5 4.5 5.5 6.5
time (t) failure rate ()
(a) (b)
Fig. 6. Comparison of TMR and simplex systems for (a) reliability (b) MTTF.
The reliability of a single component system (i.e. simplex system) having exponen-
tially distributed life time having failure rate is given by
Rs (t) = et . (8)
The MTTF for the system is obtained as

1
MTTFs = Rs (t)dt = (9)
0
and MTTF for TMR system is given by

5
MTTFTMR = RTMR (t)dt = . (10)
0 6
It is noticed from the Fig. 6(a) that the reliability of TMR system is higher than
the reliability of simplex system in the time period between 0 and approximately
1.4 when we set the failure rate = 0.5 as a default parameter. But beyond this
period, the reliability of simplex system becomes high. Moreover, from the Fig. 6(b)
a decreasing trend of mean time to failure of both simplex and TMR systems is
found by varying . It is also examined that the MTTF is higher for simplex system
in comparison to simplex system. Therefore it is concluded that the TMR system
is suitable for short mission as in TMR system reliability ultimately degrades for
long missions (t > z). Spare provisioning is better option for long mission where
failure is likely to be occurred when all the spares are exhausted.
Consequently for z 1.4 , we can say that
RTMR (t) Rsimplex (t), 0tz

RTMR (t) Rsimplex (t), z t < and MTTFTMR < MTTFsimplex.
4.3. Non-Markovian models

In the stochastic Markov modeling, the most idealized assumption is that
repair/service times of the components are exponentially-distributed i.e. they
S0218539311004093
possess the memoryless property. When this assumption is removed, the resulting
stochastic process is said to be as non-Markovian.30 In non-Markovian modeling,
the transitions are performed according to Markovian laws and holding times
such as life time, repair time, service time, etc. are described by random variables
which follow general probability distributions.20
Let {X(t), t T } be the stochastic process denoting the states of the sys-
tem at time t. If the stochastic process {X(t); t 0} is non-Markovian with
non-exponential repair/service times, then the future time-dependant evolution of
the system depends on the elapsed or remaining repair time and service time. In
reliability engineering/modeling, the most popular analytical technique known as
supplementary variable technique originated by Cox in 1955 is used largely for
such non-Markovian models which converts a non-Markovian process into Marko-

vian one. The recent developments in reliability modeling considering supplemen-
tary variable technique include Garg et al.,29 Oliveira et al.,68 Wang and Chen,94
Zhang,106 Zhang and Wang107 and many others. There are other analytical methods
namely embedded Markov chain and dummy states methods, which are also used
for analysis purpose and are based on the attempt of reduction to the Markovian
case. The reliability models based on embedded Markov chain technique could be
developed by Agarwal,2 El-Karaksy et al.,26 Huang and Chang,37 Schoenig et al.84
and many others.
5. System Redundancy
The redundancy is a fundamental prerequisite for a system which can be achieved
by either recover from or hide failures. For a redundant system to continue correct
operation in the presence of faults, the redundancy must be properly managed. The
redundancy issues in any system are deeply interrelated and ultimately determine
system reliability. Now we describe dierent types of redundancy that can be built
into a system. In literature, there are various methods, techniques and terminologies
to categorize system redundancy. Here we outline most common means of providing
redundancy as follows.
5.1. Active system redundancy

The system with active redundancy has all components operating simultaneously
in parallel. All the components are in use at the same time, even though only one
or less than operating components are required for successful functioning of the
system. In active redundancy, there is no eect in the failure rate of the surviving
components after the failure of a component. Now, we cite some research woks that
have incorporated the concept of active redundancy in the reliability models. Carr
and Savage9 proposed a unied methodology for the systems with active redundancy
to determine the reliability index. Valdes and Zequeira91,92 presented the optimal
allocation of components in a two-component series system when they are used as an
active redundant components. In the same direction, active redundancy allocation
S0218539311004093
for a k-out of-n: F system has been examined by Bueno and Carmo.8 Most recently,
Valdes et al.93 discussed some stochastic comparisons in series systems with active
redundancy.
5.2. Standby system redundancy

The systems with standby redundancy consist of two or more components. Only
one component called as primary component operates at a time to accomplish the
system functioning and other components may be in hot, cold or warm standby
mode. In standby redundancy, the failure rate of one component strongly aects
the failure characteristics of others in terms of increased failure rates because they
are now under load. The standby redundant systems can be divided into three
categories as follows.
Hot standby : The hot standby components have the same failure rate as the
primary component. The failure rate of one component is not aected by other
components either they are performing or non-performing. Hence, the compo-
nents are statistically independent.
Warm standby : The warm standby components have lower failure rate than the
failure rate of primary component. Thus, warm standby components are subject
to a lower load until primary component fails.
Cold standby : The cold standby components have zero failure rates. They never
fail when they are in standby mode; thus preserve the component reliability.
When the primary component fails, a cold standby component takes over the
control of the primary components responsibilities and then its characteristic is
same as that of primary component.
In general, standby redundancy has received more attention in the past. Kapur
and Kapoor,47 and Yearout et al.104 presented a survey on standby redundancy in
reliability. Apart from theoretical interests, standby redundancies have also been
used extensively in reliability models in which a notable contribution has been given
by Azaron et al.6 They have applied shortest path approach in stochastic networks
called E-network to evaluate the reliability function of time-dependent systems with
standby redundancy. The redundant systems with warm standby components have
been studied by Papageorgiou and Kokolakis,70 Wang et al.,95 Zhang et al.105 and
many others.
In standby redundancy, there is an inevitable period of disruption between the
failure occurring and redundant component being brought into operation. Dur-
ing this period, the system or application may be stopped functioning or system
response may be delayed. This is the main disadvantage of standby redundancy.
Such an approach is rarely satisfactorily for critical systems in modern commer-
cial and industrial situations. As compared to standby redundancy, active redun-
dancy tends to have a shorter switchover time when a failure occurs. Thus, active
S0218539311004093
redundancy is suitable for computer installations. The use of mirrored disks in a

server-computer is an example of active redundancy.
6. Hardware Redundancy
Hardware redundancy can be achieved by providing two or more physical copies of a
hardware component. A typical computer system can include redundant processors,
disk drives, memories, power supplies or buses which can be switched automati-
cally to replace the failed components. Following are three basic forms of hardware
redundancy66,90 :
Passive (Static) redundancy
Active (Dynamic) redundancy

Hybrid redundancy
6.1. Passive redundancy

Passive redundancy techniques are employed as fault masking to hide the occur-
rences of faults within a set of redundant hardware modules. As soon as a fault is
detected, the eect of faulty module is immediately masked by permanently con-
nected and continually operational redundant modules. In this technique a number
of identical modules execute the same functions and their outputs are voted to
remove the errors created by a faulty module. Most of the passive approaches are
developed around the concept of N-modular redundancy (NMR) wherein N copies
of a module perform simultaneously.
The triple modular redundancy (TMR) in which three identical modules
are arranged in parallel, is the basic and most common arrangement of passive
redundancy as shown in Fig. 7(a). In TMR all the modules perform the same task
at the same time, then their results or outputs are sent to a majority voter which
is used to examine the correct result. If one of the modules gives the wrong results,
then majority voter easily mask the fault that caused of incorrect result by attaining
the correct result of the remaining two fault-free modules. This redundancy ensures
that a single faulty module out of three modules does not corrupt the performance
of the system. Thus a TMR system can mask or tolerate the fault of only one
module; however remaining two fault-free modules may also be a cause of system
failure, if the voter fails and produces an erroneous result. Therefore voter is called
as single point of failure. This is the primary weakness of TMR. To overcome this
diculty with TMR, the provision of three voters which provide three independent
outputs can be made (see Fig. 7(b)).
To achieve a higher level of tolerance of faults, N-modular redundancy
(NMR) technique can be used which is a generalization of TMR. In NMR sys-
tem, if there are n faulty modules then N = 2n + 1 redundant modules can mask
or tolerate those n module faults under consideration of perfect voting. NMR tech-
nique is simple but expensive and provides uninterrupted service in the presence
S0218539311004093
Input 1 Module 1
Input 2 Module 2 Voter Output
Input 3 Module 3
(a)
Input 1 Module 1 Voter 1 Output 1

(b)
Fig. 7. (a) TMR with single voter; (b) TMR with triplicate voters.
of faults since any fault in redundant modules does not delay the results unless
the number of faulty modules exceeds the tolerance of the voting. These techniques
are suitable for those real time applications which are made for short-mission time
such as space shuttle computer control system.87,88 During ight-critical phases of
a mission, a system of four redundant computers is used to achieve high reliability
on which single majority voter is performed. This system is designed to cope up
with two successive failures. If a computer fails, it is overturned by other three
computers. If further another computer becomes defective, it is also overturned by
the remaining two. In case of failures of all four computers, another computer which
was independently developed programme performs critical functions.
6.2. Active redundancy

In active redundancy, the fault tolerance is achieved by detecting existence of faults
and performing some action to remove faulty hardware from the system. Active
redundancy techniques involve fault detection, fault location and fault recovery in
order to achieve fault tolerance. In this approach, no attempt of fault masking is
made as such the system may produce an erroneous result which must be acceptable
in the application. After fault detection, the system is recongured and back to
original status.
There are many techniques which can be used for fault detection. The most com-
mon form of fault detection is duplication with comparison (DWC) as shown
S0218539311004093
in Fig. 8(a). In DWC, two identical hardware modules are developed for performing
the same computations in parallel. The computation results of both modules are
compared using a comparator device. If the results are found to be mismatched,
an error signal is generated. Thus the system under duplication with comparator
operates correctly only when both modules operate correctly. This technique can
detect only one module fault. When a fault occurs, the comparator detects that
fault and then normal functioning of the system stops. Some times to avoid this
problem, action may be taken from outside to switch when a fault is detected.
Standby-sparing (SS) technique is another form of active redundancy.73 It
is also known as standby-replacement technique in which one of the modules is
operational so called primary module and one or more modules serve as standbys
or spares as illustrated in Fig. 8(b). If a fault is detected and located, the primary
faulty module is removed and replaced by a standby module. In standby-sparing, the
reconguration is done by a switching device which monitors the primary module
and switches operation to a standby if an error is found. The standby sparing
schemes are categorized as hot and cold standby sparing.73

In hot standby sparing , all spare modules are powered up and ready to be
switched at any times into operation immediately after the primary module becomes
failed. In cold standby sparing , the spare modules are powered down and they
are powered on when they are needed to replace a faulty module. The advantage
of cold standby sparing is that the spare do not consume power until needed. The
example of hot standby sparing can be seen in a process control system that controls
a chemical reaction where reconguration time needs to be minimized. Satellite
applications where power consumption is extremely critical use cold standby sparing
technique.
Pair-and-a-spare (PAS) active redundancy approach is the combination of
duplication with comparison and standby sparing techniques. Figure 8(c) represents
the basic structure of PAS scheme. In this approach, two modules are arranged
in parallel and always kept in operation. Their output results are compared to
provide the error detection capability as required in standby sparing. When a fault
is detected, the system reconguration is taken place that removes faulty module
and replaces with a spare one. This approach is used in a commercial system called
as stratus computers system.
6.3. Hybrid redundancy

The combination of both passive and active redundancy techniques is the hybrid
redundancy.45 The implementation of the system under hybrid approach is usu-
ally very expensive and it is used in real time applications that require high
integrity of computations and highest levels of reliability. Hybrid redundancy tech-
niques use the fault masking, fault detection, fault location and fault recovery
processes to recongure the system after occurrence of a fault. There are three
hybrid redundancy approaches namely (i) N-modular redundancy with spares
S0218539311004093
Module 1 Output
Input Comparator Error

Signal
Module 2
(a)
Input Primary
Module
Standby S
Module 1 w
i
Fault t Output
Standby c
Detector
h
Module 2
.
.
.
Standby
Module N
(b)
Module 2A
Output
Module 2B S
Fault w
Input Detector i
t Comparator
c
Module 2A
h
Module 2B Error signal
(c)
Fig. 8. (a) Duplication with comparison; (b) Standby-sparing redundancy; (c) Pair-and-a-spare
redundancy.
S0218539311004093
Module 1
Module 2
.
.
S Voter
Input w Output
i .
Module N .
t .
c
h
Standby 1 Disagreement
. detector
.
Standby S
(a)
Module 1A
Switch 1
Module 1B Disagreement
detector
Module 2A
Input Switch 2 Voter Output
detector
Module 3A
Switch 3
detector
(b)
Fig. 9. (a) N-modular redundancy with spares; (b) Triplex-duplex redundancy.
(ii) self-purging redundancy and (iii) triplex-duplex redundancy. Figure 9(a) shows
the basic arrangement of N-modular redundancy with spares (NMRS).
NMRS approach consists of N identical modules, and S additional spares mod-
ules. All the modules are arranged in a voting conguration. Initially N modules
take input in parallel and their results are compared. In case of no fault detection,
S0218539311004093
the result is passed on as output. But in case of disagreement i.e. when a fault is
detected, two possibilities may arise;
(a) The majority voter masks the erroneous result of the active modules, then the
result is passed on as output.
(b) The switching device replaces the faulty module by one of the spare modules,
then the system continues using the N main modules and S-1 spares.
Another form of hybrid redundancy is self-purging (SP) redundancy intro-

duced by Lombardi60 and Losq.63 In SP conguration, N active modules working
in parallel are provided to the system and they are arranged under voting scheme.
In this approach, individual switches are provided to each module. As a result, each
module becomes capable to remove itself from the system when the module is found
to be faulty. The voter produces the system output and provides masking of any
fault when it occurs. The possibility of fault occurrence in disagreement detector
may also occur. In this situation the switch opens and removes/purges the faulty
module from the system.

Sift-out modular (SOM) redundancy approach is also one example of hybrid
redundancy and developed by De Sousa and Mathur in 1978. The system has N
identical modules with three basic elements such as comparator, detector and col-
lector. The outputs of N modules are compared to each other using comparator that
performs the reports of each comparison. If any is found by the comparator then
the detector removes a module which disagrees with a majority of the remaining
modules. The collector element is responsible to report the output of each module
comparison and also to report the output from the detector that indicates which
module is faulty.
Figure 9(b) shows the basic arrangement for the hybrid redundancy based on
triple-modular and duplication with comparison redundancy techniques which is
known as triplex-duplex (TD) redundancy. In this arrangement, there are three
primary modules and each module uses a duplicate module. Thus, total six identical
modules are computing in parallel which are grouped in three pairs. The compu-
tation result of each pair is compared using a comparator. In this case the output
given by the majority voter is passed on as nal output of the system.
A brief of work done on hardware redundancy techniques is presented below in
Table 2.
7. Software Redundancy
The implementation of software redundancy can be done by adding software compo-
nents which are not exactly identical but they are similar in functionality. Software
redundancy techniques are designed to allow a system to tolerate software faults
that remain in the system after its development. These techniques provide a mech-
anism to the software system to prevent system failure from occurrences of the
faults.
S0218539311004093
Table 2. Hardware redundancy techniques.
Serial no. Technique Authors Findings

1 Duplication Hohl et al. (1993) To achieve fault tolerance in
with multiprocessor systems, a detailed
comparison comparison between watchdog
processors and master-checker type
duplication has been shown from the
view point of fault coverage, hardware
and time overhead
Tahir et al. (1995) Proposed fault tolerant arithmetic unit
using DWC and residue code
techniques to nd the best design in
terms of lower cost and better error

coverage
Hashimoto et al. Developed a scheduling algorithm to
(2000) tolerate a single processor failure in
multiprocessor systems. This algorithm
duplicates all task of a program which

reduces high overheads of
communication
Kim and Somani Used component-level duplication to
(2001) examine the dynamic control signals in
microprocessor control logic
2 N-modular Lombardi and Discussed steady state availabilities of
redundancy Ratheal (1983) static and dynamic N-modular
redundant fault tolerant systems
Koutny and A software system which permits
Mancini (1989) redundant systems to be robust with
respect to failures in replicated
processors has been constructed
Flammini et al. Proposed a new approach to the safety
(2009) evaluation of N-modular redundant
computer systems in presence of
imperfect maintenance
3 Triple-modular Pham et al. (1996) Examined the reliability and mean time
redundancy to failure (MTTF) for the TMR system
Krstic et al. (2005) A fault-tolerant voter under TMR scheme
was presented which is capable to
select mid value from the correct
consensus
4 N-modular Lombardi et al. System reliability duplex-hybrid systems
redundancy (1982) with standby spares has been discussed
with spares
Krishna (1993) Discussed NMR with a spare processor
approach to show the impact of
workload on reliability of real-time
processor triads
Dabney et al. Designed a simple dual-redundant fault
(2008) tolerant test control system
architecture and presented a survey of
exiting fault tolerant control systems
S0218539311004093
Table 2. (Continued )
Serial no. Technique Authors Findings

5 Standby- Schmitz et al. A signicant contribution has been
sparing (2004) provided to reduce the energy
redundancy consumption in standby-sparing. To
accomplish this, dynamic voltage
scaling (DVS) and dynamic power
management (DPM) have been
presented for primary module and
spare module, respectively
Eljali et al. (2009) Developed an online energy-management
method to analyze hard real time
systems which uses a slack reclamation

scheme to reduce the energy
consumption of both the primary and
spare modules
6 Self purging Razavi (1993) A modied self-purging system was
redundancy developed which contains a digital

voter to adjust the threshold of the
voter automatically as failed modules
are purged
Quintana et al. An ecient implementation of the voter
(2001) was presented which is helpful in
reducing the circuit complexity and
delay
The traditional hardware redundancy techniques are designed to tackle the man-
ufacturing faults rstly and environmental or other faults secondly. Furthermore
software redundancy techniques developed are also based on the approach of hard-
ware redundancy techniques but hardware redundancy techniques do not protect
the system against software design and specication faults. For example, the TMR
system was developed to solve many single errors by replicating the same hardware
module but similar approach can not be applied when we develop a software imple-
mentation with triplicate modules and voting on its outputs; we can not tolerate a
fault in the module because all modules have identical faults. Software redundancy
techniques attempt to leverage the experience of hardware redundancy techniques
to solve a dierent problem. To accomplish this task, and to create a proper soft-
ware redundant system, the concept of design diversity has to be applied. The design
diversity is the fault tolerance approach which has the capability to solve the com-
mon mode design faults. Under this approach, it is considered that to vary a design
is more ecient at high level of abstractions since varying function (algorithm) is
more ecient than varying implementation details of a design e.g. using dierent
program-languages. The one way of looking the software redundancy techniques is
along the diversity75 as follows:
(i) Design diversity based software redundancy

(ii) Data diversity based software redundancy.
S0218539311004093
7.1. Design diversity based software redundancy

Design diversity is a solution to software redundant systems so far as it is possible
to create diverse and equivalent specications so that the developer can design a
software which do not share common faults. Design diversity approaches are used in
a multiple version software environment. Software versions are functionally equiva-
lent and independently developed programme to provide the capability of tolerance
of software design faults. The most common examples of software redundancy tech-
niques based on design diversity are recovery-blocks, N-version programming and
N-self-checking programming systems.
The recovery blocks (RB) scheme was evolved as software redundancy tech-
nique and initiated by Horning36 and further developed by Brian Randell in early
1970s. The basic structure of recovery block scheme is shown in Fig. 10(a). In this
technique, the multiple software versions are implemented for the same program
in which one version is primary and others are alternate versions. RB uses three
mechanism approaches (i) acceptance test (AT), (ii) checkpoint and (iii) restart. In
the beginning, the primary version is executed then the acceptance test is applied
to the result of primary version. If the version passes the AT, it is considered as
reliable. If the error is detected by AT, a roll back signal is sent to the switch which
switches execution to another version of software program (module). This process
is repeated until some version passes the AT or all version fail. The checkpoint is
created before execution of a version and it is needed to recover the system state
after a version fails. N-version programming (NVP) technique is another form
of software redundancy technique based on design diversity and investigated by
Elmendorf in 1972. In this technique, all N software versions are executed simul-
taneously and their results are sent to decision mechanism referred to as majority
voter which selects the correct output result (see Fig. 10(b)). The goal of N-version
programming systems is to minimize the probability of similar errors at decision
points.
N-self-checking programming (NSCP) is also a design diverse software
redundancy technique which combines the features of both recovery blocks and
N-version programming proposed by Laprie et al.57 and Yau and Cheung.103 NSCP
approach uses program redundancy to check its own behavior during execution and
it can be done by using either acceptance test or comparator. NSCP scheme uses
comparator resembles triplex-duplex hardware redundancy.
7.2. Data diversity based software redundancy

Data diversity was introduced by Ammann4 and Ammann and Knight.5 These
approaches are used in a multiple data representation environment and uses only
one version of the software. Data diversity approaches utilize dierent representa-
tions of input data to provide the capability of tolerance of software design faults
and cheaper to implement than the design-diversity approaches. The examples of
S0218539311004093
Checkpoint
Memory
S
Version 1 e
l
e
c
t
Version 2 i
Input o Output
n
.
. S Acceptance
. w Test
i
t
c
Version N h
(a)
Version 1
Version 2 Selection
Input Algorithm Output
.
.
.
Version N
(b)
Fig. 10. (a) Recovery-block; (b) N-version programming.
software redundancy techniques based on data diversity include retry-blocks and

N-copy programming systems.
Retry block (RtB) software redundancy technique is an enhancement of
the recovery block scheme as shown in Fig. 11(a) that uses only one algorithm
rather multiple algorithms as used in recovery block. The execution results of retry
block are evaluated by providing acceptance test. Another approach of software
redundancy based on data diversity is N-copy programming (NCP) shown in
Fig. 11(b). The NCP resembles N-modular hardware redundancy scheme. This
S0218539311004093
Checkpoint Restore
Memory Checkpoint
Signal
Version 1 Exception
Fail
Pass
Execute Acceptance Discard
Version 2 Algorithm Test Checkpoint
Input
.
. Output
.
Version N
(a)
Version 1 Copy 1
Version 2 Copy 2 Decision Output

Input Algorithm
. .
. .
. .
_
Failure
_ Exception
Version N Copy 3
(b)
Fig. 11. (a) Retry-block; (b) N-copy programming.
technique uses one or more data re-expression algorithms. A NCP consists of N

copies of a program executing in parallel and each copy run on dierent input sets
produced by re-expression. The selection of best output result of the system is done
by using a modied voting scheme. In this technique, rstly data re-expression
algorithms are executed concurrently to re-express the input data, then N copies
are executed concurrently. The results of the executions of N copies are sent to the
decision mechanism (DM). If the correct result is adjusticated by DM then it is
returned otherwise an error signal occurs.
Many research investigations have done considerable amount of work including
software redundancy techniques in which some are given in Table 3.
S0218539311004093
Table 3. Software redundancy techniques.
Serial Technique Authors Findings

no.
1 Recovery blocks Berman and Optimization models for a fault tolerant
Kumar software system for both independent and
(1999) consensus recovery block schemes under cost
and reliability constraint has been discussed
Abulnaja Presented component based recovery block
(2005) technique
Wattanapongs- Discussed embedded system design and
korn and optimization issues considering component
Coit (2007) redundancy and uncertainty in the
component reliability estimates

2 N-version Pham (1994) The optimization issue for the cost of
programming NVP-systems subject to desired reliability
level was discussed
Kapur et al. The optimal release policy for 3VP system
(2007) minimizing cost subject to reliability

constraint under a fuzzy environment was
presented
Proenza et al. An improved design of NVP-execution
(2009) architecture resolving some potential
inconsistencies has been suggested
3 N-self checking Romanovsky Introduced a general concept of N-SCP scheme
programming (1997)
Djordjevic et al. Provided an approach to partially self-checking
(2004) combinatorial circuits design which is similar
to DWC wherein comparator works as a
checker that detects any erroneous result
4 Retry blocks Huang and They have provided C-style construction of
Kintala retry-block scheme for the programming
(1995) purpose and the construction was
implemented using macros
5 N-copy Christ-mansson To tolerate the software design faults in a ight
programming et al. (1994) control system, data diversity technique via
N-copy programming was applied which gives
best computation correct results and high
reliability than other technique
8. Concluding Remarks
The complexity of embedded computer systems is increasing day-by-day as per
requirements of safety-critical, mission-critical and business-critical applications. In
these applications, a system failure may be a big loss in terms of peoples lives or
environmental disaster. The redundancy has become one of the best ways to build
the computer systems highly reliable and available in dierent congurations. The
redundancy technology makes the systems capable to tolerate both expected and
S0218539311004093
unexpected software/hardware faults. In the literature, various software/hardware

redundancy techniques have been developed to manage these faults. The hardware
redundancy allows recovery (repair) from failures rather than prevents them and
thus provides the fault tolerance in the system with respect to operational faults.
On the other hand, software redundancy plays a key role in most redundant com-
puter systems since the computers that recover from failures mainly by hardware
means, use software to control their recovery and decision making processes. In
the present survey article we have discussed some of the fundamental issues of
redundant computer based systems related to design and reliability analysis in a
dierent framework. An overview of classied software/hardware faults and vari-
ous hardware/software redundancy techniques are presented. Some methodological
aspects discussed for reliability modeling of redundant systems may provide an

insight to the system designers and developers to improve the software and hard-
ware systems subject to techno-economic constraints. Our study may be helpful to
resolve the problems introduced by hardware failures and software faults in many
computer-based engineering systems.
References
1. O. A. Abulnaja, Component-based recovery block technique, AIML Journal 5(2)
(2005) 15.
2. M. Agarwal, Imbedded semi-Markov process applied to stochastic analysis of a two-
unit standby system with two types of failures, Microelectronics Reliability 25(3)
(1985) 561571.
3. M. M. Alidrisi, The reliability of a dynamic warm standby redundant system of
n components with imperfect switching, Microelectronics Reliability 32(6) (1992)
851859.
4. P. E. Ammann, Data diversity: An approach to software fault tolerance, Proceedings
of FTCS-17, Pittsberg, P A (1987), pp. 113117.
5. P. E. Ammann and J. C. Knight, Data diversity: An approach to software fault
tolerance, IEEE Transaction on Computers 37(4) (1988) 418425.
6. A. Azaron, H. Katagiri, M. Sakawa and M. Modarres, Reliability function of a class of
time-dependent systems eith standby redundancy, European Journal of Operational
Research 164(2) (2005) 378386.
7. O. Berman and U. D. Kumar, Optimization models for recovery block schemes,
European Journal of Operational Research 115(2) (1999) 368379.
8. V. D. C. Bueno and I. M. D. Carmo, Active redundancy allocation for a k-out-of-n: F
system of dependent components, European Journal of Operational Research 176(2)
(2007) 10411051.
9. S. M. Carr and G. J. Savage, A unied methodology for reliability assessment of sys-
tems with active redundancy, Reliability Engineering & System Safety 34(2) (1991)
181219.
10. J. A. Carrasco and V. Su ne, Combinatorial methods for the evaluation of yield and
operational reliability of fault-tolerant systems-on-chip, Microelectronics Reliability
44(2) (2004) 339350.
11. R. J. Chevance, Hardware and software solutions for high availability, Server-
Architecture (2005), 609652.
S0218539311004093
12. J. G. Choi and P. H. Seong, Reliability assessment of embedded digital system using
multi-state function, Reliability Engineering & System Safety 91(3) (2006) 261269.
13. J. Christmansson, Z. Kalbarczyk and J. Torin, Dependable ight control system by
data diversity and self-checking components, Microprocessing and Microprogramming
40(23) (1994) 207222.
14. W. K. Chung, Reliability of imperfect switching of cold standby systems with mul-
tiple non-critical and critical errors, Microelectronics and Reliability 35(12) (1995)
14791482.
15. D. R. Cox, The analysis of non-Markovian stochastic process by inclusion of supple-
mentary variables, Mathematical Proceedings of the Cambridge Philosophical Society
51(3) (1955) 433441.
16. R. W. Dabney, L. Etzkorn and G. W. Cox, A fault tolerant approach to test control
utilizing dual-redundant processors, Advances in Engineering Software 39(5) (2008)
371383.
17. J. V. Deshpande, I. Dewan and U. V. Naik-Nimbalkar, A family of distributions to
model load sharing systems, Journal of Statistical Planning and Inference 140(6)
(2010) 14411451.
18. P. T. De-Sousa and F. P. Mathur, Sift-out modular redundancy, IEEE Transaction
on Computers C-27(7) (1978) 624627.

19. B. S. Dhillon and K. I. Ugwu, Bibliography of literature on computer hardware
reliability, Microelectronics and Reliability 26(1) (1986) 131153.
20. A. Di-Macro, A semi-Markov model of a three state generating unit, IEEE Trans-
action Power Apparatus Systems PAS-91(5) (1972) 21542160.
21. S. Distefano and A. Puliato, Reliability and availability analysis of dependent
dynamic systems with DRBDs, Reliability Engineering & System Safety 94(9) (2009)
13811393.
22. G. L. Djordjevic, M. K. Stojcev and T. R. Stankovic, Approach to partially
self-checking combinatorial circuits design, Microelectronics Journal 35(12) (2004)
945952.
23. A. D. Dominguez-Garcia, J. G. Kassakian, J. E. Schindall and J. J. Zinchuk, An
integrated methodology for the dynamic performance and reliability evaluation of
fault tolerant systems, Reliability Engineering and System Safety 93(11) (2008)
16281649.
24. E. Dubrova, Fault Tolerant Design: An Introduction, Kluwer Academic Publishers
(2007).
25. A. Eljali, B. M. Al-Hashimi and P. Eles, A standby-sparing technique with low
energy overhead for fault tolerant hard real time systems, Proceedings of the 7th
IEEE/ACM International Conference on Hardware/Software Codesign and System
Synthesis, Power-aware design methodology (2009), pp. 193202.
26. M. R. El-Karaksy, A. S. Nouh and A. R. Al-Obaidan, Performance analysis of timed
Petri net models for communication protocols: A methodology and a package, Com-
puter Communications 13(2) (1990) 7382.
27. W. R. Elmendorf, Fault-tolerant programming, Proceedings of FTCS-2, Newton, MA
(1972), pp. 7983.
28. F. Flammini, S. Marrone, N. Mazzocca and V. Vittorini, A new modeling approach
to the safety evaluation of N-modular redundant computer systems in the presence
of imperfect maintenance, Reliability Engineering and System Safety 94(9) (2009)
14221432.
S0218539311004093
29. S. Garg, J. Singh and D. V. Singh, Availability analysis of crank-case manufacturing

in a two-wheeler automobile industry, Applied Mathematical Modelling 34(6) (2010)
16721683.
30. R. German, Non-Markovian Analysis, Springer-Verlag (2002), 156182.
31. J. Gray, Why do computers stop and what can be done about it? in Proceeding of the
5th Symposium on Reliability in Distributed Software and Database Systems (1986),
pp. 312.
32. M. Grottke and K. S. Trivedi, Fighting bugs: Remove, retry, replicate and rejuvenate,
Software Technologies (2007), 107109.
33. H. Guo and X. Yang, Automatic creation of Markov models for reliability assessment
of instrumented systems, Reliability Engineering and System Safety 93(6) (2008)
829837.
34. K. Hashimoto, T. Tsuchiya and T. Kikuno, A new approach to fault-tolerant schedul-
ing using task duplication in microprocessor systems, Journal of Systems and Soft-
ware 53(2) (2000) 159171.
35. W. Hohl, E. Michel and A. Pataricza, Hardware support for error detection in mul-
tiprocessor systems-a case study, Microprocessors and Microsystems 17(4) (1993)
201206.
36. J. J. Horning, A program structure for error detection and recovery, New York:
Springer-Verlag 16 (1974) 171187.
37. C. Y. Huang and Y. R. Chang, An improved decomposition scheme for assessing the
reliability of embedded systems by using dynamic fault trees, Reliability Engineering
& System Safety 92(10) (2007) 14031412.
38. Y. Huang and C. Kintala, Software Fault Tolerance in the Application Layer, John
Wiley & Sons Ltd (1995).
39. A. Immonen and E. Niemela, Survey of reliability and availability prediction methods
from the view point of software architecture, Software Systems Modeling 7 (2008)
4965.
40. R. T. Islamov, Using Markov reliability modeling for multiple repairable systems,
Reliability Engineering and System Safety 44(2) (1994) 113118.
41. M. Jain, Reliability of a two-unit system with common cause shock failures, Inter-
national Journal of Pure & Applied Mathematics 29(12) (1998) 12811289.
42. M. Jain and R. P. Ghimire, Reliability of k-r-out-of-n: G system subject to random
and common cause failure, Performance Evaluation 29 (1997) 213218.
43. M. Jain, S. Maheshwari and Rakhee, Study of loading policies for k-r-out-of-N: G
system subject to common cause failure, R & D Quality Quest 4 (2) (2002) 1523.
44. K. Jenab and B. S. Dhillon, Assessment of reversible multi-state k-out-of-n:G/F/load
sharing systems with ow-graph models, Reliability Engineering and System Safety
91(7) (2006) 765771.
45. B. W. Johnson, Design and analysis of fault tolerant digital systems, Addition Wesley
(1989).
46. D. L. Kang, M. J. Hwang and S. H. Han, Estimation of common cause failure prob-
abilities of the components under mixed testing scheme, Annals of Nuclear Energy
36(4) (2009) 493497.
47. P. K. Kapur and K. R. Kapoor, Eect of standby redundancy in system reliability,
Microelectronics Reliability 15(5) (1976) 376.
48. P. K. Kapur, A. Gupta and P. C. Jha, Reliability growth modeling and optimal
release policy under fuzzy environment of an N-version programming system incor-
porating the eect of fault removal eciency, International Journal of Automation
and Computing 4(4) (2007) 369379.
S0218539311004093
49. J. B. Ke, W. C. Lee and K. H. Wang, Reliability and sensitivity analysis of a system
with multiple unreliable service stations and standby switching failures, Physica A:
Statistical Mechanics and its Applications 380 (2007) 455469.
50. J. Kienzle, Software Fault Tolerance: An Overview, Springer Berlin/Heidelberg
(2003).
51. S. Kim and A. K. Somani, On-line integrity monitoring of microprocessor control
logic, Microelectronics Journal 32(12) (2001) 9991007.
52. M. Koutny and L. V. Mancini, Synchronizing events in replicated systems, Journal
of Systems and Software 9(3) (1989) 183190.
53. C. M. Krishna, The impact of workload on the reliability of real-time processor
triads, Microelectronics Reliability 33(8) (1993) 11691178.
54. M. D. Krstic, M. K. Stojcev, G. L. Djordjevic and I. D. Andrejic, A mid-value select
voter, Microelectronics and Reliability 45(34) (2005) 733738.
55. K. Kuspert, Principles of error detection in storage structures of database systems,

Reliability Engineering 14(4) (1986) 275290.
56. C. D. Lai, M. Xie, K. L. Poh, Y. S. Dai and P. Yang, A model for availability analy-
sis of distributed software/hardware systems, Information and Software Technology,
44(6) (2002) 343350.
57. J. C. Laprie, Denition and analysis of hardware and software fault tolerant archi-
tecture, IEEE Computer 23(7) (1990) 3951.
58. J. C. Laprie, Dependability: Basic concepts and terminology, Springer-Verlag (1992).
59. G. Levitin, Reliability and performance analysis of hardware-software systems with
fault tolerant software components, Reliability Engineering and System Safety 91(5)
(2006) 570579.
60. F. Lombardi, Availability modeling of ring microcomputer systems, Microelectronics
Reliability 22(2) (1982) 295308.
61. F. Lombardi and S. Ratheal, Analysis of series deviance in a parallel state transition
diagram and applications to fault tolerant computing, Microelectronics Reliability
23(5) (1983) 963980.
62. F. Lombardi, V. Obac-Roda and M. M. Islam, Reliability study of duplex-hybrid
systems, Microelectronics and Reliability 22(3) (1982) 457470.
63. J. Losq, A highly ecient redundancy scheme: self-purging redundancy, IEEE Trans-
action on Computers C-25(6) (1976) 569578.
64. M. R. Lyu, Handbook of software reliability engineering, IEEE Computer Society
Press and McGraw-Hill (1996).
65. A. Mosleh, Common cause failure: An analysis methodology and examples, Reliabil-
ity Engineering 34 (1991) 249292.
66. M. Mukurani, Task-based dynamic fault tolerance for humanoid robot applications
and its hardware implementation, Journal of Computers 3(8) (2008) 4048.
67. S. M. Nassar, Software reliability, Computers & Industrial Engineering 11(14)
(1986) 613618.
68. E. A. Oliveira, A. C. M. Alvim and P. F. Frutuoso-e-Melo, Unavailability analysis of
safety systems under aging by supplementary variables with imperfect repair, Annals
of Nuclear Energy 32(2) (2005) 241252.
69. J. N. Pan, Reliability prediction of imperfect switching systems subject to Weibull
failures, Computers & Industrial Engineering 34(2) (1998) 481492.
70. E. Papageorgiou and G. Kokolakis, Reliability analysis of a two-unit general paral-
lel system with (n 2) warm standbys, European Journal of Operational Research
201(3) (2010) 821827.
S0218539311004093
71. H. Pham, On the optimal design of N-version software systems subject to constraints,
Journal of Systems and Software 27(1) (1994) 5561.
72. H. Pham, A. Suprasad and R. B. Misra, Reliability analysis of k-out of-n systems
with partially repairable multi-state components, Microelectronics and Reliability
36(10) (1996) 14071415.
73. D. K. Pradhan, Fault-tolerant computer system design, Prentice Hall, Englewood
Clis, NJ (1996).
74. J. Proenza, J. Miro-Julia and H. Hansson, Managing redundancy in CAN-based net-
works supporting N-version programming, Computer Standards & Interfaces 31(1)
(2009) 120127.
75. L. L. Pullum, Software fault tolerance techniques and implementation, Artech House
(2001).
76. J. M. Quintana, M. J. Avedillo and J. L. Huertas, Ecient realization of a thresh-
old voter for self-purging redundancy, Journal of Electronic Testing: Theory and
Applications 17(1) (2001) 6973.
77. B. Randell, System structure for software fault tolerance, IEEE Transaction on Soft-
ware Engineering SE-1(2) (1975) 220232.
78. H. M. Razavi, Self-purging redundancy with automatic threshold adjustment, IEE
Proceedings of Circuits, Devices and Systems 140(4) (1993) 233236.

79. A. Reibman, R. Smith and K. Trivedi, Markov and Markov reward model transient
analysis: An overview of numerical approaches, European Journal of Operational
Research 40(2) (1989) 257267.
80. A. Romanovsky, Practical exception handing and resolution in concurrent programs,
Computer Languages 23(1) (1997) 4358.
81. O. Rooks and M. Armbruster, Duo duplex drive-by-wire computer system, Reliability
Engineering and System Safety 89(1) (2005) 7180.
82. R. Samet, Design and implementation of highly reliable dual-computer systems,
Computers & Security 28(7) (2009) 710722.
83. M. T. Schmitz, B. M. Al-Hashimi and P. Eles, System-level design techniques for
energy-ecient embedded systems, Norwell, MA: Kluwer (2004).
84. R. Schoenig, J. F. Aubry, T. Cambois and T. Hutinet, An aggregation method of
Markov graphs for the reliability analysis of hybrid systems, Reliability Engineering
& System Safety 91(2) (2006) 137148.
85. R. K. Sharma and S. Kumar, Performance modeling in critical engineering sys-
tems using RAM analysis, Reliability Engineering and System Safety 93(6) (2008)
913919.
86. B. Singh, K. K. Sharma and A. Kumar, A classical and Bayesian estimation of a
k-components load sharing parallel system, Computational Statistics & Data Anal-
ysis 52(12) (2008) 51755185.
87. J. R. Sklaro, Redundancy management technique for space shuttle computers, IBM
JRD 20(1) (1976) 2028.
88. A. Spector and D. Giord, The space shuttle primary computer system, Communi-
cations of the ACM 27(9) (1984) 872900.
89. J. M. Tahir, S. S. Dlay, R. N. G. Naguib and O. R. Hinton, Fault tolerant arithmetic
unit using duplication and residue codes, Integration; the VLSI Journal 18(23)
(1995) 187200.
90. W. Torres-Pomales, Software fault tolerance: A tutorial, NASa/TM-2000-210616
(2000).
91. J. E. Valdes and R. I. Zequeira, On the optimal allocation of an active redundancy in
a two-component series system, Statistics & Probability Letters 63(3) (2003) 325332.
S0218539311004093
92. J. E. Valdes and R. I. Zequeira, On the optimal allocation of two active redundancies
in a two-component series system, Operations Research Letters 34(1) (2006) 4952.
93. J. E. Valdes, G. Arango, R. I. Zequeira and G. Brito, Some stochastic comparisons
in series systems with active redundancy, Statistics & Probability Letters 80(1112)
(2010) 945949.
94. K. H. Wang and Y. J. Chen, Comparative analysis of availability between three sys-
tems with general repair times, reboot delay and switching failures, Applied Mathe-
matics and Computation 215(1) (2009) 384394.
95. K. H. Wang, W. L. Dong and J. B. Ke, Comparison of reliability and the availability
between four systems with warm standby components and standby switching failures,
Applied Mathematics and Computation 183(2) (2006) 13101322.
96. N. Wattanapongskorn and D. W. Coit, Fault-tolerant embedded system design and
optimization considering reliability estimation uncertainty, Reliability Engineering &
System Safety 92(4) (2007) 395407.

97. J. Wu, Y. Wang and E. B. Fernandez, A uniform approach to software and hardware
fault tolerance, Journal of System and Software 26(2) (1994) 117127.
98. K. Wu, P. Mishra and R. Karri, Concurrent error detection of fault-based side chan-
nel cryptanalysis of 128-bit RC6 block cipher, Microelectronics Journal 34(1) (2003)
3139.
99. S. Yamada and S. Osaki, Reliability growth models for hardware and software sys-
tems based on non-homogeneous Poisson processes: A survey, Microelectronics Reli-
ability 23(1) (1983) 91112.
100. W. Yamamoto, L. Jin and K. Suzuki, Optimal allocations for load-sharing k-out-of-n:
F systems, Journal of Statistical Planning and Inference 139(5) (2009) 17771781.
101. B. Yang, H. Hu and S. Guo, Cost oriented task allocation and hardware redun-
dancy policies in heterogeneous distributed computing systems considering software
reliability, Computers & Industrial Engineering 56(4) (2009) 16871696.
102. B. Yang, X. Li, M. Xie and T. Feng, A generic data-driven software reliability
model with model mining technique, Reliability Engineering and System Safety 95(6)
(2010) 671678.
103. S. S. Yau and R. C. Cheung, Design of self-checking software, Proceedings of the
International Conference on Reliable Software 10(6) (1975) 450455.
104. R. D. Yearout, P. Reddy and D. L. Grosh, Standby redundancy in reliability A
review, Microelectronics Reliability 27(5) (1987) 937.
105. T. Zhang, M. Xie and M. Horigome, Availability and reliability of k-out-of-(M + N):G
warm standby systems, Reliability Engineering and System Safety 91(4) (2006)
381387.
106. Y. L. Zhang, A geometrical process repair model for a repairable system with delayed
repair, Computers & Mathematics with Applications 55(8) (2008) 16291643.
107. Y. L. Zhang and G. J. Wang, A deteriorating cold standby repairable system with
priority in use, European Journal of Operational Research 183(1) (2007) 278295.
About the Authors

Madhu Jain, Faculty, Dept of Mathematics, I.I.T. Roorkee, received her M.Sc.,
M.Phil., Ph.D and D.Sc. degrees in Mathematics from University of Agra. She has
been a gold medalist of Agra University at M. Phil. level. There are more than 250
research publications in refereed International/National journals and more than
20 books to her credit in addition to two reference books. She was recipient of
S0218539311004093
the Young Scientic award and SERC visiting fellow of Department of Science and
Technology (India), and Career award of University Grant Commission (India). She
has successfully completed six sponsored major research projects of DST, UGC and
CSIR. Her current research interest includes Stochastic Modelling, Soft Comput-
ing, Bio-informatics, Reliability and Queueing Theory. Twenty ve candidates have
received their Ph.D. degrees under her supervision. She has visited more than 25
reputed Universities/Institutes in USA, Canada, UK, Germany, France, Holland
and Belgium. She has participated and presented her research work in more than
30 International and 75 National Conferences/Seminars.
Ritu Gupta is presently conducting research leading to Ph.D. degree in the Institute
of Basic Science, Dr B. R. Ambedkar University, Agra. She is 1st division scholar

at the degree and post graduate level. She has published three research papers in
reputed journals and refereed proceedings. She has attended and presented her work
at two International and eight National Conferences/Seminars. Her areas of inter-
est are software and hardware reliability and performance modeling of redundant
systems.

Redundancy Issues in Software and

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Redundancy Issues in Software and

Încărcat de

Drepturi de autor:

Formate disponibile

June 13, 2011 19:38 WSPC/S0218-5393 122-IJRQSE

International Journal of Reliability, Quality and Safety Engineering

REDUNDANCY ISSUES IN SOFTWARE AND

MADHU JAIN, and RITU GUPTA,

I.I.T. Roorkee-247667, India

Received 22 September 2010

The redundancy is a widely spread technology of building computing systems that

Keywords: Redundancy; software and hardware system; reliability; survey; fault-

Nomenclature and Notations

62 M. Jain & R. Gupta

DWC : Duplication with comparison

RtB : Retry block

Redundancy Issues in Software and Hardware Systems 63

broadly categorized in two dierent forms called as static or dynamic redundancy.

64 M. Jain & R. Gupta

2. Classification of Hardware and Software Faults

Redundancy Issues in Software and Hardware Systems 65

66 M. Jain & R. Gupta

3. Some Failure/Fault Related Issues

3.1.1. Common cause failure

3.1.2. Load sharing

Redundancy Issues in Software and Hardware Systems 67

in this eld is due to Deshpande et al.17

3.1.3. Switching failure

In many systems with standby components, a standby component automatically

3.2.1. Fault prevention/avoidance

68 M. Jain & R. Gupta

training, rigorous procedures for maintenance. Malicious faults are prevented by

3.2.2. Fault tolerance

a system. Detection of a fault is done by acceptance test or comparator; in

(ii) Fault location is a mechanism to determine where a fault has occurred.

3.2.3. Fault removal

Redundancy Issues in Software and Hardware Systems 69

replacement or avoid damages of the surrounding of the system components. It is

3.2.4. Fault/failure forecasting

70 M. Jain & R. Gupta

Reliability is the probability of failure free operation of a computer program in

The complexity of reliability models depends on various factors such as mission

(i) Combinatorial modeling

4.1. Combinatorial modeling

Redundancy Issues in Software and Hardware Systems 71

Fig. 1. Failures sequence of hardware and software faults.

of only single component leads to system failure. The graphical representation of a

(ii) Parallel configuration: A parallel conguration system is shown in Fig. 2(b).

(iii) Series-parallel/parallel-series configuration: Some systems are made up

72 M. Jain & R. Gupta

Redundancy Issues in Software and Hardware Systems 73

Consecutive M-out of-N system: Consecutive M-out of-N system consists of

Table 1. Reliability models.

Serial no. System System reliability

74 M. Jain & R. Gupta

components in M-out of-N conguration while analyzing system reliability is consid-

system reliability can not be adequately represented when it is aected by

4.2. Markov modeling

Redundancy Issues in Software and Hardware Systems 75

Fig. 3. Transition diagram of a one unit repairable system.

et al.23 suggested an integrated methodology for the reliability evaluation and

Reliability modeling of TMR system: Here we illustrate the concept of Markov

RTMR = R1 R2 R3 + (1 R1 )R2 R3 + (1 R2 )R1 R3 + (1 R3 )R1 R2 (1)

h : hardware failure rate; s : software failure rate

76 M. Jain & R. Gupta

(1 Ri )Rj Rk = Prob{module i has failed module j functions correctly

RTMR = 3R2 2R3 (2)