Documente Academic
Documente Profesional
Documente Cultură
Most people will have some concept of what reliability is from everyday life. For
example, people may discuss how reliable their washing machine has been over the length of
time they have owned it. Similarly, a car that doesnt need to go to the garage for repairs
often, during its lifetime, would be said to have been reliable. It can be said that reliability is
quality overtime. Quality is associated with workmanship and manufacturing and therefore if
a product doesnt work or breaks as soon as you buy it you would consider the product to
have poor quality. However if over time parts of the product wear-out before you expect them
to then this would be termed poor reliability. The difference therefore between quality and
reliability is concerned with time and more specifically product life time.
Reliability is the probability of performing without failure, a specific function, under given
conditions for a specified period of time.
The five elements are:
i.
ii.
iii.
iv.
v.
However if, as shown in fig2, there is an overlap of the two distributions then failures will
occur. There therefore needs to be a safety margin to ensure that there is no overlap of these
distributions.
It is clear that to ensure good reliability the causes of failure need to be identified and
eliminated. Indeed the objectives of reliability engineering are:
Infant Mortality: This stage is also called early failure or debugging stage. The failure
rate is high but decreases gradually with time. During this period, failures occur because
engineering did not test products or systems or devices sufficiently, or manufacturing
made some defective products. Therefore the failure rate at the beginning of infant
mortality stage is high and then it decreases with time after early failures are removed by
burn-in or other stress screening methods. Some of the typical early failures are: poor
welds, poor connections, contamination on surface in materials, incorrect positioning of
parts, etc.
Useful Life Period: As the product matures, the weaker units die off, the failure rate
becomes nearly constant, and modules have entered what is considered the normal life
period. This period is characterized by a relatively constant failure rate. The length of this
period is referred to as the system life of a product or component. It is during this period
of time that the lowest failure rate occurs. Notice how the amplitude on the bathtub curve
is at its lowest during this time. The useful life period is the most common time frame for
making reliability predictions.
Wear-out Period: This is the final stage where the failure rate increases as the products
begin to wear out because of age or lack of maintenance. When the failure rate becomes
high, repair, replacement of parts etc., should be done.
RELIABILITY MEASURES
Reliability is the probability that a product or part will operate properly for a specified
period of time (design life) under the design operating conditions (such as temperature, volt,
etc.) without failure. In other words, reliability may be used as a measure of the systems
success in providing its function properly. Reliability is one of the quality characteristics that
consumers require from the manufacturer of products.
Many mathematical concepts apply to reliability engineering, particularly from the
areas of probability and statistics. Likewise, many mathematical distributions can be used for
various purposes, including the Gaussian (normal) distribution, the log-normal distribution,
the exponential distribution, the Weibull distribution and a host of others.
Failure rate: The purpose for quantitative reliability measurements is to define the rate of
failure relative to time and to model that failure rate in a mathematical distribution for the
purpose of understanding the quantitative aspects of failure. The most basic building block is
the failure rate, which is estimated using the following equation:
Where:
= Failure rate (sometimes referred to as the hazard rate)
T = Total running time/cycles/miles/etc. during an investigation period for both failed
and non-failed items.
r = the total number of failures occurring during the investigation period.
For example, if five electric motors operate for a collective total time of 50 years with
five functional failures during the period, then the failure rate, , is 0.1 failures per year.
Another very basic concept is the mean time between/to failure (MTBF/MTTF). The
only difference between MTBF and MTTF is that we employ MTBF when referring to items
that are repaired when they fail. For items that are simply thrown away and replaced, we use
the term MTTF. The computations are the same.
The basic calculation to estimate mean time between failure (MTBF) and mean time
to failure (MTTF), both measures of central tendency, is simply the reciprocal of the failure
rate function. It is calculated using the following equation:
Where:
= Mean time between/to failure
T = Total running time/cycles/miles/etc. during an investigation period for
both failed and non-failed items.
r = the total number of failures occurring during the investigation period.
The MTBF for the industrial electric motor mentioned in the previous example is 10
years, which is the reciprocal of the failure rate for the motors. Incidentally, we would
estimate MTBF for electric motors that are rebuilt upon failure. For smaller motors that are
considered disposable, we would state the measure of central tendency as MTTF. The failure
rate is a basic component of many more complex reliability calculations. Depending upon the
mechanical/electrical design, operating context, environment and/or maintenance
effectiveness, a machines failure rate as a function of time may decline, remain constant,
increase linearly or increase geometrically.
Failure rate calculations are based on complex models which include factors using
specific component data such as temperature, environment, and stress. In the prediction
model, assembled components are structured serially. Thus, calculated failure rates for
assemblies are a sum of the individual failure rates for components within the assembly.
There are three common basic categories of failure rates:
a) Mean Time Between Failures (MTBF): MTBF is a basic measure of reliability for
repairable items. MTBF can be described as the time passed before a component,
assembly, or system fails, under the condition of a constant failure rate. Another way of
stating MTBF is the expected value of time between two consecutive failures, for
repairable systems. It is a commonly used variable in reliability and maintainability
analyses.
MTBF can be calculated as the inverse of the failure rate, , for constant failure rate
systems. For example, for a component with a failure rate of 2 failures per million hours,
the MTBF would be the inverse of that failure rate, , i.e.:
b) Mean time to failure (MTTF): MTTF is a basic measure of reliability for non-repairable
systems. It is the mean time expected until the first failure of a piece of equipment. MTTF
is a statistical value and is intended to be the mean over a long period of time and with a
large number of units. For constant failure rate systems, MTTF is the inverse of the
failure rate, . If failure rate, , is in failures/million hours, MTTF = 1,000,000 /Failure
Rate, , for components with exponential distributions.
MTTF is the number of total hours of service of all devices divided by the number of
devices. It is only when all the parts fail with the same failure mode that MTBF
converges to MTTF.
MTTF = 1/
= T/N
( = MTTF; T = total time; N = Number of units under test.)
For example, the item above fails, on average, once every 4000 hours, so the
probability of failure for each hour is obviously 1/4000. This depends on the failure
rate being constant - which is the condition for the exponential distribution.
This equation can also be written the other way round:
MTBF (or MTTF) = 1/
For example, if the failure rate is 0.00025, then
MTBF (or MTTF) = 1/0.00025 = 4,000 hours.
c) Mean Time to Repair (MTTR): Mean time to repair (MTTR) is defined as the total
amount of time spent performing all corrective or preventative maintenance repairs
divided by the total number of those repairs. It is the expected span of time from a failure
(or shut down) to the repair or maintenance completion. This term is typically only used
with repairable systems.
NORMAL DISTRIBUTION
In reliability engineering, the normal distribution primarily applies to measurements of
product susceptibility and external stress. This two parameter distribution is used to describe
systems in which a failure results due to some wear out effect for many mechanical systems.
Normal distributions are applied to single variable continuous data (E.g. heights of plants,
weights of lambs, lengths of time etc.). The normal distribution is the most important
distribution in statistics, since it arises naturally in numerous applications. The key reason is
that large sums of (small) random variables often turn out to be normally distributed.
The normal distribution takes the well-known bell shape. This distribution is symmetrical
about the mean and the spread is measured by variance. This distribution is symmetrical
about the mean and the spread is measured by variance. The larger the value, the flatter the
distribution. The pdf is given by
Where is the mean value and is the standard deviation. The cumulative
distribution function (cdf) is
(either s or x)
The reliability function is
There is no closed form solution for the above equation. However, tables for the
standard normal density function are readily available and can be used to find probabilities
for any normal distribution. If
This is a so-called standard normal pdf, with a mean value of 0 and a standard
deviation of 1. The standardized cdf is given by
Where yields the relationship necessary if standard normal tables are to be used.
The hazard function for a normal distribution is a monotonically increasing function of t. This
can be easily shown by proving that h(t) 0 for all t. Since
The normal distribution is flexible enough to make it a very useful empirical model. It
can be theoretically derived under assumptions matching many failure mechanisms. Some of
these are corrosion, migration, crack growth, and in general, failures resulting from chemical
reactions or processes. That does not mean that the normal is always the correct model for
these mechanisms, but it does perhaps explain why it has been empirically successful in so
many of these cases.
Example: A component has a normal distribution of failure times with = 2000 hours and
= 100 hours. Find the reliability of the component and the hazard function at 1900 hours.
Solution: The reliability function is related to the standard normal deviate z by,
Example: A part has a normal distribution of failure times with = 40000 cycles and =
2000 cycles. Find the reliability of the part at 38000 cycles.
Solution: The reliability at 38000 cycles,
ii.
EXPONENTIAL DISTRIBUTION
The exponential distribution, the most basic and widely used reliability prediction
formula, models machines with the constant failure rate, or the flat section of the bathtub
curve. Most industrial machines spend most of their lives in the constant failure rate, so it is
widely applicable. Below is the basic equation for estimating the reliability of a machine that
follows the exponential distribution, where the failure rate is constant as a function of time.
In the electric motor example, if you assume a constant failure rate the likelihood of
running a motor for six years without a failure, or the projected reliability, is 55 percent.
This is calculated as follows:
given individual from the population could fail on the first day of operation while another
individual could last 30 years. That is the nature of probabilistic reliability projections.
A characteristic of the exponential distribution is the MTBF occurs at the point at
which the calculated reliability is 36.78%, or the point at which 63.22% of the machines have
already failed. In our motor example, after 10 years, 63.22% of the motors from a population
of identical motors serving in identical applications can be expected to fail. In other words,
the survival rate is 36.78% of the population.
The probability density function (pdf), or life distribution, is a mathematical equation
that approximates the failure frequency distribution. It is the pdf, or life frequency
distribution, that yields the familiar bell-shaped curve in the Gaussian, or normal,
distribution. Below is the pdf for the exponential distribution.
f(t) = e-t
Where:
f(t) = Life frequency distribution for a given time (t) (Failure Density)
e = Base of the natural logarithms (2.718281828)
= Failure rate
In our electric motor example, the actual likelihood of failure at three years is calculated as
follows:
-t
f(t) = e
f(3) = 0.1e-0.1x3 = .07408 7.4%
In the example, if we assume a constant failure rate, which follows the exponential
distribution, the life distribution, or pdf for the industrial electric motors, is expressed in
Figure 6. The failure rate is constant, but the pdf mathematically assumes failure without
replacement, so the population from which failures can occur is continuously reducing
asymptotically approaching zero.
The cumulative distribution function (cdf) is simply the cumulative number of failures
one might expect over a period of time. For the exponential distribution, the failure rate is
constant, so the relative rate at which failed components are added to the cdf remains
constant. However, as the population declines as a result of failure, the actual number of
mathematically estimated failures decreases as a function of the declining population. Much
like the pdf asymptotically approaches zero, the cdf asymptotically approaches one.
The declining failure rate portion of the bathtub curve, which is often called the infant
mortality region, and the wear out region will be discussed in the following section
addressing the versatile Weibull distribution.
Hazard Rate: Sometimes it is difficult to specify the distribution function of T directly from
the physical information that is available. A function found useful in clarifying the
relationship between physical modes of failure and the probability distribution of T is the
conditional density function h(t), called the hazard function or failure rate. The hazard
function for the exponential distribution is given as:
h(t) =
()
For a constant failure rate, hazard rate is also constant and is equal to the failure rate.
h(t) =
()
Notice that the hazard function is not a function of time and is in fact a constant equal to .
iii.
WEIBULL DISTRIBUTION
Many data sets will exhibit two or even three distinct regions. It is common for
reliability engineers to plot, for example, one curve representing the shape parameter during
run in (infant mortality period), another curve to represent the constant or gradually
increasing failure rate and a third distinct linear slope emerges to identify a third shape, the
wear out region. In these instances, the pdf of the failure data do in fact assume the familiar
bathtub curve shape.
The 3-parameter Weibull pdf is given by:
f t =
where
f(t) 0;
t 0 or
> 0;
>0
- < < +
: scale parameter, or characteristic life
: shape parameter (or slope)
: location parameter (or failure free life)
f t =
Frequently, the location parameter is not used, and the value for this parameter can be set to
zero.
There is also a form of the Weibull distribution known as the 1-parameter Weibull
distribution. This in fact takes the same form as the 2-parameter Weibull pdf, the only
difference being that the value of is assumed to be known beforehand. This assumption
means that only the scale parameter needs be estimated, allowing for analysis of small data
sets. It is recommended that the analyst have a very good and justifiable estimate for before
using the 1-parameter Weibull distribution for analysis.
Weibull reliability and CDF functions are:
R t =
F t =
H t =
Example: The failure time of a component follows a Weibull distribution with shape
parameter = 1.5 and scale parameter = 10,000 h. When should the component be replaced if
the minimum recurring reliability for the component is 0.95?
Solution:
=
0.95 =
1.5
10000
t = 1380.38 h
Example 2.8: The failure time of a certain component has a Weibull distribution with = 4,
= 2000, and = 1000. Find the reliability of the component and the hazard rate for an
operating time of 1500 hours.
Solution: A direct substitution into equation yields
Note that the Rayleigh and exponential distributions are special cases of the Weibull
distribution at = 2, = 0, and = 1, = 0, respectively. For example, when = 1 and = 0,
the reliability of the Weibull distribution function reduces to
And the hazard function reduces to 1/, a constant. Thus, the exponential is a special case of
the Weibull distribution. Similarly, when = 0 and = 2, the Weibull probability density
function becomes the Rayleigh density function. That is
iv.
GAMMA DISTRIBUTION
Gamma distribution can be used as a failure probability function for components whose
distribution is skewed. The failure density function for a gamma distribution is
and
The gamma density function has shapes that are very similar to the Weibull
distribution. At = 1, the gamma distribution becomes the exponential distribution with the
constant failure rate 1/. The gamma distribution can also be used to model the time to the
nth failure of a system if the underlying failure distribution is exponential. Thus, if Xi is
exponentially distributed with parameter = 1/, then T = X1 + X2 ++Xn, is gamma
distributed with parameters and n.
The gamma model is a flexible lifetime model that may offer a good fit to some sets
of failure data. It is not, however, widely used as a lifetime distribution model for common
failure mechanisms. A common use of the gamma lifetime model occurs in Bayesian
reliability applications.
Example: The time to failure of a component has a gamma distribution with = 3 and = 5.
Determine the reliability of the component and the hazard rate at 10 time-units.
Solution: Using
we compute,
The other form of the gamma probability density function can be written as follows:
This pdf is characterized by two parameters: shape parameter and scale parameter .
When 0<<1, the failure rate monotonically decreases; when >1, the failure rate
monotonically increase; when =1 the failure rate is constant. The mean, variance and
reliability of the density function in the above equation are, respectively,
Example: A mechanical system time to failure is gamma distribution with =3 and 1/=120.
Find the system reliability at 280 hours.
Solution: The system reliability at 280 hours is given by
v.
where and are parameters such that - < < , and > 0. Note that and are
not the mean and standard deviations of the distribution as in normal distribution.
The relationship to the normal (just take natural logarithms of all the data and time
points and you have normal data) makes it easy to work with many good software analysis
programs available to treat normal data.
Mathematically, if a random variable X is defined as X = lnT, then X is normally
distributed with a mean of and a variance of 2. That is,
E(X) = E(lnT) =
&
V(X) = V(lnT) = 2
Since T = eX, the mean of the log normal distribution can be found by using the
normal distribution. Consider that
The log normal lifetime model, like the normal, is flexible enough to make it a very
useful empirical model. Figure above shows the reliability of the log normal vs. time. It can
be theoretically derived under assumptions matching many failures mechanisms. Some of
these are: corrosion and crack growth, and in general, failures resulting from chemical
reactions or processes.
Example: The failure time of a certain component is log normal distributed with = 5 and
= 1. Find the reliability of the component and the hazard rate for a life of 50 time units.
Solution: Substituting the numerical values of , , and t into equation, we compute
Thus, values for the log normal distribution are easily computed by using the standard
normal tables.
Example: The failure time of a part is log normal distributed with = 6 and = 2. Find the
part reliability for a life of 200 time units.
Solution: The reliability for the part of 200 time units is
Maintainability
Maintainability is defined as the probability that a device will be restored to its
operational effectiveness within the given period when maintenance action is performed in
accordance with the prescribed procedure. Maintenance action is the prescribed operation to
correct an equipment failure.
Repairable and Non-repairable Items
It is important to distinguish between repairable and non-repairable items when predicting or
measuring reliability.
Decreasing Failure Rate (Non-repairable Items): A decreasing failure rate (DFR) can be
caused by an item, which becomes less likely to fail as the survival time increases. This is
demonstrated by electronic equipment during their early life or the burn-in period. This is
demonstrated by the first half of the traditional bath tub curve for electronic components
or equipment where failure rate is decreasing during the early life period.
Constant Failure Rate (Non-repairable Items): A constant failure rate (CFR) can be
caused by the application of loads at a constant average rate in excess of the design
specifications or strength. These are typically externally induced failures.
Increasing Failure Rate (Non-repairable Items): An increasing failure rate (IFR) can be
caused by material fatigue or by strength deterioration due to cyclic loading. Its failure
mode does not accrue for a finite time, and then exhibits an increasing probability of
occurrence.
ii.
iii.
iii.
demonstration of reliability requirement at the desired confidence level. The most commonly
used tool for this purpose is the Operating Characteristic (OC) Curve. Figure below provides
a sample OC Curve. This OC curve is generated for a fixed configuration test and displays
the relationship between the probability of acceptance and MTBF based on test duration and
acceptable number of failures. The OC curve is a tool to determine the probability of
acceptance of a test plan corresponding to a given reliability requirement. The OC curve is
used to quantify the consumer risk and producer risk associated with a given MTBF value for
the associated testplan.
Reliability Risks: There are two types of decision risks which are of significant importance
during the demonstration of reliability requirements. These risks are called Consumer Risk
and Producer Risk.
i.
ii.
Consumer risk: The probability that a level of system reliability at or below the
requirement will be found to be acceptable due to statistical chance. This is depicted
on the operational characteristic curve. We should endeavor to quantify and manage
consumer risk because reliability below the requirement results in reduced mission
reliability and increased support costs.
Producer risk: The probability that a level of system reliability that meets or exceeds
the reliability goal will be deemed unacceptable due to statistical chance. This risk is
also depicted in the figure above. If the system is incorrectly deemed unsuitable,
major cost and schedule impacts to the acquisition program may result.
An appropriate balance between the consumer risk and the producer risk is important to
determine test duration/number of trials. If the consumer risk and producer risk are not
balanced appropriately, the test duration/number of trials may be too short/small or too
long/large. If the test duration/number of trials is too short/small, the reliability goal (target)
for the test will be higher (test reliability requirement is inversely proportional to the test
duration/number of trials). For short/small test duration/number of trials, one or both risks
may be too high. If the test duration/number of trials is too long/large, it may be very costly
to perform the test. The cost factor may lead to an unacceptable program burden.
The probability of acceptance, P(A), can be represented by the cumulative binomial
distribution:
where: =
! !
This gives the probability that the number of failures observed during the test, f, is
less than or equal to the acceptance number, c, which is the number of allowable failures in n
trials. Each trial has a probability of succeeding of R, where R is the reliability of each unit
under test. The reliability OC curve is developed by evaluating the above equation for various
values of R.
Poisson distribution can be used for large values of n.
=
!
The OC curve represents the probability of acceptance for a given mean life. An OC
curve may be constructed showing the probability of acceptance as a function of average life,
. In this case, the sampling plan may be defined with:
Number of hours of test and
an acceptance number
A major assumption is that the failed item will be replaced by a good item.
Consider a sampling plan with:
an acceptance number, c
For each average life, ,
Compute the failure rate per hour
Compute the expected number of failures during the test
c = T
Compute Pa=P(c or fewer failure)=1-P(c+1 or more failure when the mean number of failures
is c). This can be obtained from using Poisson equation or the table from statistical data
book.
Example: In one of the plans, 10 items were to be tested for 5000 hours with replacement and
with an acceptance number of 1. Plot an OC curve showing probability of acceptance as a
function of average life.
Solution: Given:
Duration of the test, T = 5000
c=1
Step 1: Create a column for mean life, .
Mean Life ()
1000
2000
3000
4000
5000
6000
7000
8000
(You can also assume Rt values and create the first column. Example, 0.05,
0.10, 0.15 etc. upto 8 to 10 rows)
Step 2: Calculate = 1/
Mean Life ()
Failure Rate,
1000
0.001
2000
0.0005
3000
0.0003
4000
0.00025
5000
0.0002
6000
0.00017
7000
0.00014
8000
0.00012
Step 3: Calculate c = T
Mean Life ()
Failure Rate,
Expected Average
no. of failure, c
1000
0.001
5
2000
0.0005
2.5
3000
0.0003
1.5
4000
0.00025
1.25
5000
0.0002
1
6000
0.00017
0.85
7000
0.00014
0.7
8000
0.00012
0.6
Step 4: Calculate Pausing Poisson distribution for c = 1
=
!
For example,
when = 1000,
P(X1) = P(X=0) + P(X+1)
1 =
Mean Life ()
5 50
0!
Failure Rate,
1000
0.001
2000
0.0005
3000
0.0003
4000
0.00025
5000
0.0002
6000
0.00017
7000
0.00014
8000
0.00012
Step 5: Plot graph, Y axis: Pa& X axis:
5 51
1!
= 0.041 = 4.1%
Expected
Average no. of
failure, c
5
2.5
1.5
1.25
1
0.85
0.7
0.6
Probability of
acceptance, Pa
0.041
0.287
0.558
0.644
0.736
0.790
0.845
0.878
Series systems: Simplest reliability model is a serial model where all the components
must be working for the system to be successful.
To calculate the system reliability for a serial process, you only need to multiply the
estimated reliability of Subsystem A at time (t) by the estimated reliability of Subsystem B at time (t). The basic equation for calculating the system reliability of a
simple series system is:
RS = RA * RB .RZ
The Failure rate of the system is calculated as by adding the failure rates together, i.e
Example: So, for a simple system with three subsystems, or sub-functions, each
having an estimated reliability of 0.90 (90%) at time (t), the system reliability is
calculated as 0.90 X 0.90 X 0.90 = 0.729, or about 73%.
To calculate the reliability of an active parallel system, where both machines are
running, use the following simple equation:
Problem 1: A certain type of electronic component has a uniform failure rate of 0.00001 per
hour. What is the reliability for a specified period of service of 10000 hours?
Solution:
Given:
= 0.00001 per hour
t = 10000 hours
Rt = e-t = e-0.00001x10000 = 0.90483 = 90.483%
Problem 2: Given a (MTTF) of 5000 hours and a uniform failure rate, what is the
reliability associated with a specified service period of 200 hours?
Solution:
Given:
' = 5000 hours
t = 200 hours
1
= = 1/5000
Rt = e-t =96.079%
Problem 3: The following reliability requirements have been set on the sub-systems of a
communication system:
Sub-System
Reliability (for a 4 hour period)
Receiver
0.970
Control system
0.989
Power supply
0.995
Antenna
0.996
What is the expected reliability of the overall system?
Solution: Rt(system) = Rt(subsystem1) xRt2 xRt3x Rt4 = 0.970x0.989x0.995x0.996 = 0.950 (95%)
The chance that the overall system will perform its function without failure for
a 4 hour period is 95%.
Problem 4: A unit has a reliability of 0.99 for a specified mission time. If 2 identical units are
used in parallel redundancy, what overall efficiency will be obtained?
Solution:
Rs(t) = 1 {1- R1(t)}n =1 {1 0.99}2 = 0.999 or 99.9%
Problem 5: An industrial machine compresses natural gas into an interstate gas pipeline. The
compressor is on line 24 hours a day. (If the machine is down, a gas field has to be shutdown
until the natural gas can be compressed, so down time is very expensive.) The vendor knows
that the compressor has a constant failure rate of 0.000001 failures/hr. What is the operational
reliability after 2500 hours of continuous service?
Solution:
The compressor has a constant failure rate and therefore the reliability follows
the exponential distribution: Rt = e-t
Given:
Failure rate = 0.000001 f/hr
Operational time t = 2500 hours
Reliability = e-(0.000001 * 2500) = 0.9975 or 99.75%
Problem 6: Suppose that a component we wish to model has a constant failure rate with a
mean time between failures of 25 hours? Find:(a) The reliability function.
(b) The reliability of the item at 30 hours.
Solution:
Since the failure rate is constant, we will use the exponential distribution.
Also, the MTBF = 25 hours. We know, for an exponential distribution, MTBF
= 1/.
Therefore = 1/25 = 0.04
(a) The reliability function is given by: R(t) = e-t = e- (0.04 * t)
(b) The reliability of the item at 30 hours = e-0.04 * 30 = 0.3012
Problem 7: A certain electronic component has an exponential failure time with a mean of 50
hours.
(a) What is the rate of this component?
(b) What is the reliability of this component at 100 hours?
(c) What is the minimum number of these components that should be placed in parallel if we
desire a reliability of 0.90 at 100 hours? (The idea of placing extra components in parallel is
to provide a backup if the first component fails.)
Solution:
(a) = 1/50 = 0.02 per hour
(b) R(100) = e-0.02x100 = 0.1353 (which is not very good)
(c) The parallel system will only fail if all components fail. The probability of
each failing is 1-0.1353= 0.8647.
If there are n parallel components needed
1 - 0.8647n = 0.9
0.8647n = 0.1
By trial and error, n = 16, so we need 16 components in parallel.
RELIABILITY TOOLS AND TECHNIQUES
Some of the tools that are useful during the design stage can be thought of as tools for
fault avoidance. The fall into two general methods, bottom-up and top-down.
I.
Top-down method
Undesirable single event or system success at the highest level of interest (the top event)
should be defined.
Contributory causes of that event at all levels are then identified and analysed.
Start at highest level of interest to successively lower levels
Event-oriented method
Useful during the early conceptual phase of system design
Used for evaluating multiple failures including sequentially related failures and commoncause events
Some examples of top-down methods include: Fault tree analysis (FTA) & Reliability
block diagram (RBD)
a. Fault tree analysis
Fault tree analysis is a systematic way of identifying all possible faults that could lead
to system fail-danger failure. The FTA provides a concise description of the various
combinations of possible occurrences within the system that can result in predetermined
critical output events. The FTA helps identify and evaluate critical components, fault paths,
and possible errors. It is both a reliability and safety engineering task, and it is a critical data
item that is submitted to the customer for their approval and their use in their higher-level
FTA and safety analysis. The key elements of a FTA include:
Gates represent the outcome
Events represent input to the gates
Cut sets are groups of events that would cause a system to fail
FTA can be done qualitatively by drawing the tree and identifying all the basic events.
However to identify the probability of the top event then probabilities or reliability figures
must be input for the basic events. Using logic the probabilities are worked up to given a
probability that the top event will occur. Often the data from an FMEA are used in
conjunction with an FTA.
The following table shows the flowchart symbols that are used in fault tree analysis in
order to aid with the correct reading of the fault tree.
A rectangle signifies a fault or
undesired event caused by one
or more preceding causes
acting through logic gates.
Circle signifies a primary
failure or basic fault that
requires
no
further
development
Diamond denotes a secondary
failure or undesired event but
not developed further
And gate denotes that a
failure will occur if all inputs
fail (parallel redundancy)
Or gate denotes a failure will
occur if any input fails (series
reliability)
FTA example
Transfer event
II.
Bottom-up method
Identify fault modes at the component level.
For each fault mode the corresponding effect on performance is deduced for the
next higher system level.
The resulting fault effect becomes the fault mode at the next higher system level,
and so on.
Successive iterations result in the eventual identification of the fault effects at all
functional levels up to the system level.
Some examples of bottom-up methods include: Event tree analysis (ETA); FMEA and
Hazard and operability study (HAZOP).
a. Event tree analysis
Considers a number of possible consequences of an initiating event or a system
failure.
May be combined with a fault tree.
Used when it is essential to investigate all possible paths of consequent events their
sequence.
Analysis can become very involved and complicated when analysing larger systems.
Example:
Benefits include:
Limitations include:
The output data may be large even for relatively simple systems.
May become complicated and unmanageable unless there is a fairly direct
(or "single-chain") relationship between cause and effect may not easily
deal with time sequences, restoration processes, environmental conditions,
maintenance aspects, etc.
Prioritizing mode criticality is complicated by competing factors involved.
improves the chances for failure occurring in a shorter period of time. This also means that a
smaller sample population of devices can be tested with an increased probability of finding
failure. Stress testing amplifies unreliability so failure can be detected sooner. Accelerated
life tests are also used extensively to help make predictions. Predictions can be limited when
testing small sample sizes. Predictions can be erroneously based on the assumption that lifetest results are representative of the entire population. Therefore, it can be difficult to design
an efficient experiment that yields enough failures so that the measures of uncertainty in the
predictions are not too large. Stresses can also be unrealistic. Fortunately, it is generally rare
for an increased stress to cause anomalous failures, especially if common sense guidelines are
observed.
Anomalous testing failures can occur when testing pushes the limits of the material out of
the region of the intended design capability. The natural question to ask is: What should the
guidelines be for designing proper accelerated tests and evaluating failures? The answer is:
Judgment is required by management and engineering staff to make the correct decisions in
this regard. To aid such decisions, the following guidelines are provided:
Always refer to the literature to see what has been done in the area of accelerated
testing.
Avoid accelerated stresses that cause nonlinearities, unless such stresses are
plausible in product-use conditions. Anomalous failures occur when accelerated stress
causes nonlinearities in the product. For example, material changing phases from
solid to liquid, as in a chemical nonlinear phase transition (e.g., solder melting,
inter-metallic changes, etc.); an electric spark in a material is an electrical
nonlinearity; material breakage compared to material flexing is a mechanical
nonlinearity.
Tests can be designed in two ways: by avoiding high stresses or by allowing them,
which may or may not cause nonlinear stresses. In the latter test design, a concurrent
engineering design team reviews all failures and decides if a failure is anomalous or
not. Then a decision is made whether or not to fix the problem. Conservative
decisions may result in fixing some anomalous failures. This is not a concern when
time and money permit fixing all problems. The problem occurs when normal failures
are labeled incorrectly as anomalous and no corrective action is taken.
Accelerated life testing is normally done early in the design process as a method for
testing for fit for purpose. It can be done at the component level or the sub-assembly level
but is rarely done at a system level as there are usually too many parts and factors that can
cause failures and these can be difficult to control and monitor.
Step-Stress Testing is an alternative test; it usually involves a small sample of devices
exposed to a series of successively higher and higher steps of stress. At the end of each
stress level, measurements are made to assess the results to the device. The measurements
could be simply to assess if a catastrophic failure has occurred or to measure the resulting
parameter shift due to the steps stress. Constant time periods are commonly used for
each step-stress period. This provides for simpler data analysis. There are a number of
reasons for performing a step-stress test, including:
Aging information can be obtained in a relatively short period of time. Common stepstress tests take about 1 to 2 weeks, depending on the objective.
Step-stress tests establish a baseline for future tests. For example, if a process
changes, quick comparisons can be made between the old process and the new
process. Accuracy can be enhanced when parametric change can be used as a measure
for comparison. Otherwise, catastrophic information is used.
Failure mechanisms and design weaknesses can be identified along with material
limitations. Failure-mode information can provide opportunities for reliability growth.
Fixes can then be put back on test and compared to previous test results to assess fix
effectiveness.
Data analysis can provide accurate information on the stress distribution in which the
median-failure stress and stress standard deviation can be obtained.
It has been shown that a well-managed reliability growth programme as discussed earlier
would avoid the need for demonstration testing as they concentrate on how to improve
products. It is has also been argued that the benefit to the product in terms of improved
reliability is sometimes questionable having used PRST methods.
fixed. No matter what the method, Reliability Growth planning is essential to avoid wasting
time and money when accelerated testing is attempted without an organized program plan.
Table below summarizes how different tests fit into the product life cycle.
Accelerated tests
or methods
Stage of product
life cycle
Reliability Growth
or Reliability
Enhancement
Design and
development
HALT (Highly
Accelerated Life
Test)
Design and
development
Step-Stress Test
Design and
Development of
units or
components
Failure-Free Test or
demonstration test
Post Design
ESS
(Environmental
Stress Screening)
HASS (Highly
Accelerated
Stress Screen)
Production
Production