Sunteți pe pagina 1din 432

Reliability Modeling

The RIAC Guide to Reliability Prediction,


Assessment and Estimation

LI [(Ta1,Tb1 ),...,(TaL ,TbL ) /] [F(Tb i ) F(Tai )] |)


i=1

Ea

= b e KT S n
r

1 CL =
k =0

(t )
k!

=e

r 1

t )
(
(t )r
+

1 + t + +
(r 1)! ( r)!

RIAC is a DoD Information Analysis Center sponsored by the Defense Technical Information Center. RIAC is operated by a
team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn State University Applied Research
Laboratory and the State University of New York Institute of Technology.

Ordering No.: RPAE

Reliability Modeling The RIAC Guide to Reliability


Prediction, Assessment and
Estimation
Prepared by:
Reliability Information Analysis Center
6000 Flanagan Rd.
Suite 3
Utica, NY 13502-1348
Under Contract to:
Defense Technical Information Center
DTIC-AI
8725 John J. Kingman Rd.
Suite 0944
Fort Belvoir, VA 22060

RIAC is a DoD Information Analysis Center sponsored by the Defense


Technical Information Center. RIAC is operated by a team of Wyle
Laboratories, Quanterion Solutions Inc., the University of Maryland, the
Penn State University Applied Research Laboratory and the State University
of New York Institute of Technology.

The information and data contained herein have been compiled from
government and nongovernment technical reports and from material
supplied by various manufacturers and are intended to be used for reference
purposes. Neither the United States Government nor the Wyle Laboratories
contract team warrant the accuracy of this information and data. The user is
further cautioned that the data contained herein may not be used in lieu of
other contractually cited references and specifications.
Publication of this information is not an expression of the opinion of The
United States Government or of the Wyle Laboratories contract team as to
the quality or durability of any product mentioned herein and any use for
advertising or promotional purposes of this information in conjunction with
the name of The United States Government or the Wyle Laboratories
contract team without written permission is expressly prohibited.

ISBN-10: 1-933904-17-8
ISBN-13: 978-1-933904-17-7
ISBN-10: 1-933904-18-6
ISBN-13: 978-1-933904-18-4

(Hardcopy)
(Hardcopy)

(PDF Download)
(PDF Download)

Form Approved
OMB No. 0704-0188

REPORT DOCUMENTATION PAGE

Public reporting burden for this collection is estimated to average 1 hour per response including the time for reviewing instructions, searching existing data sources,
gathering and maintaining the data needed and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other
aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for
Information Operations and Reports(0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that
notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a current
or valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

1. REPORT DATE
31 May 2010

2. REPORT TYPE

3. DATES COVERED (From - To)


N/A

Technical

4. TITLE AND SUBTITLE

5a. CONTRACT NUMBER


HC1047-05-D-4005
5b. GRANT NUMBER N/A

Reliability Modeling The RIAC Guide to Reliability Prediction,


Assessment and Estimation

5c. PROGRAM ELEMENT NUMBER


N/A
5d. PROJECT NUMBER
N/A
5e. TASK NUMBER
N/A
5f. WORK UNIT NUMBER
N/A
8. PERFORMING ORGANIZATION
REPORT NUMBER

6. AUTHORS

William Denson

7. PERFORMING ORGANIZATIONS NAME(S) AND ADDRESS(ES)


Reliability Information Analysis Center
100 Sherman Rd.
Suite C101
Utica, NY 13502-1348

RPAE

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)


Defense Technical Information Center DTIC-AI
8725 John J. Kingman Rd. STE 0944
Ft. Belvoir, VA 22060

10. SPONSORING/MONITORS
ACRONYM(S)

Air Force Research Lab/RISE


525 Brooks Rd.
Rome, NY 13440

DTIC-AI and AFRL/RISE


11. SPONSORING/MONITORS
REPORT NUMBERS
N/A

12. DISTRIBUTION/AVAILABILITY STATEMENT

Approved for public release, distribution unlimited.


13. SUPPLEMENTARY NOTES
Hardcopies available from Reliability Information Analysis Center, 100 Sherman Rd., Suite C101, Utica, NY 13502-1348. (Price: $85
US/$95 Non-US). PDF Download available from http://theRIAC.org (Price $70).
14. ABSTRACT
The intent of this book is to provide guidance on modeling techniques that can be used to quantify the reliability of a product or system. In
this context, reliability modeling is the process of constructing a mathematical model that is used to estimate the reliability characteristics of
a product. There are many ways in which this can be accomplished, depending on the product or system and the type of information that
is available, or practical to obtain, to the analyst. This book will review possible approaches, summarize their advantages and
disadvantages, and provide guidance on selecting a methodology based on the specific goals and constraints of the analyst. While this
book will not discuss the use of specific published methodologies, in cases where examples are provided, tools and methodologies with
which the author has personal experience in their development are used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.

15. SUBJECT TERMS

Reliability Modeling
NPRD

Reliability Prediction
MIL-HDBK-217

Reliability Assessment
217Plus

16. SECURITY CLASSIFICATION OF:

17. LIMITATION
OF ABSTRACT

UNCLASSIFIED

Reliability Estimation
18. NUMBER
OF
PAGES

19a. NAME OF RESPONSIBLE


PERSON

David Nicholls
a. REPORT

b. ABSTRACT

c. THIS PAGE

UNCLASSIFIED

UNCLASSIFIED

UNCLASSIFIED

UNLIMITED

410

19b. TELEPHONE NUMBER


(include area code)

315.351.4202
Standard Form 298 (Rev. 8/98)
Prescribed by ANSI Std. Z39.18

The Reliability Information Analysis Center (RIAC), formerly the Reliability Analysis Center (RAC),
is a Department of Defense Information Analysis Center sponsored by the Defense Technical
Information Center, managed by the Air Force Research Laboratory (formerly Rome Laboratory), and
operated by a team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn
State University Applied Research Laboratory and the State University of New York Institute of
Technology. RIAC is chartered to collect, analyze and disseminate reliability, maintainability,
quality, supportability and interoperability (RMQSI) information pertaining to systems and products,
as well as the components used in them. The RIAC addresses both military and commercial
perspectives.
The data contained in the RIAC databases is collected on a continuous basis from a broad range of
sources, including testing laboratories, device and equipment manufacturers, government laboratories
and equipment users (government and industry). Automatic distribution lists, voluntary data
submittals and field failure reporting systems supplement an intensive data solicitation program.
Users of RIAC are encouraged to submit their RMQSI data to enhance these data collection efforts.
RIAC publishes documents for its users in a variety of formats and subject areas. While most are
intended to meet the needs of RMQSI practitioners, many are also targeted to managers and designers.
RIAC also offers RMQSI consulting, training and responses to technical and bibliographic inquiries.
REQUESTS FOR TECHNICAL ASSISTANCE
AND INFORMATION ON AVAILABLE RIAC
SERVICES AND PUBLICATIONS MAY BE
DIRECTED TO:
Reliability Information Analysis Center
100 Sherman Rd.
Suite C101
Utica, NY 13502-1348
General Information:(877) 363-RIAC
(877) 363-7422
Technical Inquiries: (315) 351-4200
Fax:
(315) 351-4209
E-Mail:
inquiry@theRIAC.org
Internet:
http://theRIAC.org

ALL OTHER RIAC REQUESTS SHOULD BE


DIRECTED TO:

Air Force Research Laboratory


AFRL Systems and Information
Interoperability Branch
Attn: R. Hyle
525 Brooks Road
Rome, NY 13441-4505
Telephone:
DSN:
Fax:
E-Mail:

(315) 330-4857
587-4857
(315) 330-7647
Richard.Hyle@rl.af.mil

Copyright 2010 by Quanterion Solutions Incorporated. This handbook was developed by Quanterion
Solutions Incorporated, in support of the prime contractor (Wyle Laboratories) in the operation of the Department
of Defense Reliability Information Analysis Center (RIAC) under Contract HC1047-05-D-4005. The Government
has a fully paid up perpetual license for free use of and access to this publication and its contents among all the
DOD IACs in both hardcopy and electronic versions, without limitation on the number of users or servers. Subject
to the rights of the Government, this document (hardcopy and electronic versions) and the content contained within
it are protected by U.S. Copyright Law and may not be copied, automated, re-sold, or redistributed to multiple
users without the express written permission. The copyrighted work may not be made available on a server for use
by more than one person simultaneously without the express written permission. If automation of the technical
content for other than personal use, or for multiple simultaneous user access to a copyrighted work is desired,
please contact 877.363.RIAC (toll free) or 315.351.4202 for licensing information.

Table of Contents: Reliability Modeling The RIAC Guide

Table of Contents
1.

INTRODUCTION

1.1.
1.2.
1.3.
1.4.
1.5.
1.6.

Scope
BookOrganization
ReliabilityProgramElements
TheHistoryofReliabilityPrediction
Acronyms
References

2.

Page
1
2
5
7
11
17
18

GENERALASSESSMENTAPPROACH

DefineSystem
2.1.
2.2.
IdentifythePurposeoftheModel
2.3.
DeterminetheAppropriateLevelatWhichtoPerformtheModeling
2.3.1. Levelvs.DataNeeded
2.3.2. UsinganFMEAasthebasisforareliabilitymodel
2.3.3. ModelFormvs.Level
2.4.
AssessDataAvailable
2.5.
DetermineandExecuteAppropriateApproach
2.5.1. Empirical

2.5.1.1. Test
2.5.1.2. Field Data
2.5.2.

20
22
25
26
28
34
36
38
44

44
77
106

Physics

2.5.2.1. Stress/Strength Modeling


2.5.2.2. First Principals

106
111

2.6.
CombineData
2.6.1. BayesianInference
2.7.
DevelopSystemModel
2.7.1. MonteCarloAnalysis
2.8.
References

3.

19

114
121
123
127
133

FUNDAMENTALCONCEPTS

135

ReliabilityTheoryConcepts
3.1.
3.2.
Probabilityconcepts
3.2.1. Covariance
3.2.2. CorrelationCoefficient
3.2.3. PermutationsandCombinations
3.2.4. MutualExclusivity

135
142
142
142
143
144
i

Table of Contents: Reliability Modeling The RIAC Guide

Table of Contents
Page
3.2.5. IndependentEvents
3.2.6. Nonindependent(Dependent)Events
3.2.7. Nonindependent(Dependent)Events:BayesTheorem
3.2.8. SystemModels
3.2.9. KoutofNConfigurations
3.3.
Distributions
3.3.1. Exponential
3.3.2. Weibull
3.3.3. Lognormal
3.4.
References

4.

DOEBASEDAPPROACHESTORELIABILITYMODELING

4.1.
4.2.
4.3.
4.4.
4.5.
4.6.
4.7.
4.8.

DeterminetheFeaturetobeAssessed
DetermineFactors
DeterminetheFactorLevels
DesigntheTests
PerformTestsandMeasurements
AnalyzetheData
DeveloptheLifeModel
References

5.

LIFEDATAMODELING

171
172
172
172
174
180
181
183
183

185

SelectingaDistribution
5.1.
5.2.
ParameterEstimationOverview
5.2.1. ClosedFormParameterApproximations
5.2.2. LeastSquaresRegression
5.2.3. ParameterEstimationUsingMLE

185
186
189
190
192

5.2.3.1. Brief Historical Remarks


5.2.3.2. Likelihood Function
5.2.3.3. Maximum Likelihood Estimator (MLE)
5.2.4.

144
145
146
146
151
153
159
160
166
169

193
193
195
198

ConfidenceBoundsandUncertainty

5.2.4.1. Confidence Bounds with MLE


5.2.4.2. Confidence Bounds Approximations
5.3.
AccelerationModels
5.3.1. FundamentalAccelerationModels

198
199
206
207

5.3.1.1. Examples

208
ii

Table of Contents: Reliability Modeling The RIAC Guide

Table of Contents
Page
5.3.2. CombinedModels
5.3.3. CumulativeDamageModel
5.4.
MLEEquations
5.4.1. LikelihoodFunctions
5.5.
References

6.

210
214
216
217
221

INTERPRETATIONOFRELIABILITYESTIMATES

BathtubCurve
6.1.
6.2.
CommonCausevs.SpecialCause
6.3.
ConfidenceBounds
6.3.1. TraditionalTechniquesforConfidenceBounds
6.3.2. UncertaintyinReliabilityPredictionEstimates
6.4.
FailureRatevspdf
6.5.
PracticalAspectsofReliabilityAssessments
6.6.
Weibayes
6.7.
WeibullClosureProperty
6.8.
EstimatingEventRelatedReliability
6.9.
CombiningDifferentTypesofAssessmentsatDifferentLevels
6.10. EstimatingtheNumberofFailures
6.11. CalculationofEquivalentFailureRates
6.12. FailureRateUnits
6.13. FactorstobeConsideredWhenDevelopingModels
6.13.1.
CausesofElectronicSystemFailure
6.13.2.
SelectionofFactors
6.13.3.
ReliabilityGrowthofComponents
6.13.4.
Relativevs.AbsoluteHumidity
6.14. AddressingDatawithNoFailures
6.15. ReliabilityofComponentsUsedOutsideofTheirRating
6.16. References

7.

EXAMPLES

223
223
225
238
238
240
243
245
245
246
247
248
250
251
252
253
253
255
257
259
259
261
262

263

MILHDBK217ModelDevelopmentMethodology
7.1.
7.1.1. IdentifyPossibleVariables
7.1.2. DevelopTheoreticalModel
7.1.3. CollectandQCData
7.1.4. CorrelationCoefficientAnalysis
iii

264
266
266
267
268

Table of Contents: Reliability Modeling The RIAC Guide

Table of Contents
Page
7.1.5. StepwiseMultipleRegressionAnalysis
7.1.6. GoodnessofFitAnalysis
7.1.7. ExtremeCaseAnalysis
7.1.8. ModelValidation
7.2.
217PlusReliabilityPredictionModels
7.2.1. Background
7.2.2. SystemReliabilityPredictionModel

270
271
272
272
273
273
274

7.2.2.1. 217Plus Background


7.2.2.2. Methodology Overview
7.2.2.3. System Reliability Model
7.2.2.4. Initial Failure Rate Estimate
7.2.2.5. Process Grading Factors
7.2.2.6. Basis Data for the Model
7.2.2.7. Uncertainty in Traditional Approach Estimates
7.2.2.8. System Failure Causes
7.2.2.9. Environmental Factor
7.2.2.10. Reliability Growth
7.2.2.11. Infant Mortality
7.2.2.12. Combining Predicted Failure Rate with Empirical Data
7.2.3.

DevelopmentofComponentReliabilityModels

7.2.3.1. Model Form


7.2.3.2. Acceleration Factors
7.2.3.3. Time Basis of Models
7.2.3.4. Failure Mode to Failure Cause Mapping
7.2.3.5. Derivation of Base Failure Rates
7.2.3.6. Combining the Predicted Failure Rate with Empirical Data
7.2.3.7. Estimating Confidence Levels
7.2.3.8. Using the 217Plus Model in a Top-Down Analysis
7.2.3.9. Capacitor Model Example
7.2.3.10. Default Values
7.2.4.

7.2.5.

292

292
294
294
295
296
296
298
298
299
301
303

PhotonicModelDevelopmentExample

7.2.4.1.
7.2.4.2.
7.2.4.3.
7.2.4.4.
7.2.4.5.

274
277
278
279
280
281
281
282
287
291
292
292

Introduction
Model development methodology and results
Uncertainty Analysis
Comments on Part Quality Levels
Explanation of Failure Rate Units

303
306
322
325
325
326

SystemLevelModel

7.2.5.1. Model Presentation

326
iv

Table of Contents: Reliability Modeling The RIAC Guide

Table of Contents
7.2.5.2. 217Plus Process Grading Criteria
7.2.5.3. Design Process Grade Factor Questions
7.2.5.4. Manufacturing Process Grade Factor Questions
7.2.5.5. Part Quality Process Grade Factor Questions
7.2.5.6. System Management Process Grade Factor Questions
7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions
7.2.5.8. Induced Process Grade Factor Questions
7.2.5.9. Wearout Process Grade Factor Questions
7.2.5.10. Growth Process Grade Factor Questions
7.3.
LifeModelingExample
7.3.1. Introduction
7.3.2. Approach
7.3.3. ReliabilityTestPlan
7.3.4. Results

350
350
350
350
352

7.3.4.1. Times to Failure Summary


7.3.4.2. Life Models

352
354

7.4.
NPRDDescription
7.4.1. DataCollection
7.4.2. DataInterpretation
7.4.3. DocumentOverview

7.4.3.1.
7.4.3.2.
7.4.3.3.
7.4.3.4.
7.4.3.5.
7.4.3.6.
Prefix"
7.5.

8.

Page
328
330
336
340
342
346
347
348
349

357
358
361
366

"Part Summaries" Overview


366
"Part Details" Overview
373
Section 4 "Data Sources" Overview
374
Section 5 "Part Number/MIL Number" Index
374
Section 6 National Stock Number Index with Federal Stock Class 375
Section 7 "National Stock Number Index without Federal Stock Class
375
375

References

THEUSEOFFMEAINRELIABILITYMODELING

Introduction
8.1.
8.2.
Definitions
8.3.
FMEALogistics
8.3.1. Wheninitiated
8.3.2. FMEATeam
8.3.3. FMEAFacilitation

377
377
381
383
383
383
384

Table of Contents: Reliability Modeling The RIAC Guide

Table of Contents
Page
8.3.4. Implementation
8.4.
HowtoPerformanFMEA
8.5.
IdentifySystemHierarchy
8.6.
FunctionAnalysis
8.7.
IPOUNDAnalysis
8.8.
IdentifytheSeverity
8.9.
IdentifythePossibleEffect(s)thatResultfromOccurrenceofEachFailureMode
8.10. IdentifyPotentialCausesofEachFailureMode
8.11. IdentifyFactorsforEachFailureCause
8.11.1.
AcceleratingStress(es)orPotentialTests
8.11.2.
Occurrence

8.11.2.1. Occurrence Rankings

398

8.11.3.
Preventions
8.11.4.
Detections
8.11.5.
Detectability
8.12. CalculatetheRPN
8.13. DetermineAppropriateCorrectiveAction
8.14. UpdatetheRPN
8.15. UsingQualityFunctionDeploymenttoFeedtheFMEA
8.16. References

9.

385
385
387
388
388
390
392
392
398
398
398

CONCLUDINGREMARKS

401
401
401
404
405
408
408
410

411

vi

List of Figures: Reliability Modeling The RIAC Guide

List of Figures
Page
FIGURE1.11:PHASESOFARELIABILITYPROGRAM.....................................................................................2
FIGURE1.12:RELATIVECOSTOFFAILURESVS.PHASE................................................................................3
FIGURE1.13:RELIABILITYPREDICTION,ASSESSMENTANDESTIMATION....................................................4
FIGURE1.14:PERCENTOFCOMPANIESUSINGRELIABILITYENGINEERINGTOOLS.....................................5
FIGURE1.31:EXAMPLERELIABILITYPROGRAMAPPROACH........................................................................7
FIGURE2.01:GENERALMODELINGAPPROACH.........................................................................................20
FIGURE2.11:FAULTTREEREPRESENTATIONOFSYSTEMMODEL.............................................................21
FIGURE2.12:FAULTTREEREPRESENTATIONTOTHEFAILURECAUSELEVEL............................................21
FIGURE2.21:BREAKDOWNOFPOTENTIALRELIABILITYMODELINGPURPOSES.......................................23
FIGURE2.31:TYPICALDATAREQUIREMENTSVS.LEVELOFHIERARCHY...................................................27
FIGURE2.32:THEBASICFMEAAPPROACH.................................................................................................28
FIGURE2.33:HIERARCHICALRELATIONSHIPBETWEENCAUSE,MODEANDEFFECT.................................29
FIGURE2.34:APPROACHTOIDENTIFYINGCAUSES....................................................................................29
FIGURE2.35:FAULTTREEOFPRODUCTORSYSTEM.................................................................................32
FIGURE2.36:FAULTTREEOFPRODUCTORSYSTEMWITHCAUSEASTHELOWESTLEVEL.......................32
FIGURE2.37:FAULTTREEOFPRODUCTORSYSTEMWITHCAUSEABOVETHELOWESTLEVEL................33
FIGURE2.38:FAULTTREEOFPRODUCTORSYSTEMWITHCAUSETWOLEVELSABOVETHELOWEST
LEVEL...................................................................................................................................................33
FIGURE2.51:BREAKDOWNOFRELIABILITYASSESSMENTOPTIONS..........................................................38
FIGURE2.52:QUALIFICATIONCONCEPTSANDTERMINOLOGY..................................................................46
FIGURE2.53:EVT,DVTANDPVTRELATIONSHIPS.......................................................................................48
FIGURE2.54:ACCELERATIONLEVELS.........................................................................................................51
FIGURE2.55:UNCERTAINTYINEXTRAPOLATION......................................................................................52
FIGURE2.56:ACCELERATIONLEVELS.........................................................................................................53
FIGURE2.57:ACCELERATIONALTERNATIVES.............................................................................................53
FIGURE2.58:RELATIVELIFETIMEVS.STRESS.............................................................................................54
FIGURE2.59:RELIABILITYREQUIREMENTVS.SMALLPOPULATIONRELIABILITYINFERENCE...................60
FIGURE2.510:LIFEMODELINGMETHODOLOGY.......................................................................................62
FIGURE2.511:IDENTIFICATIONOFTESTSTRESSESBASEDONTHEFMEA.................................................64
FIGURE2.512:USINGTHEDESTRUCTLIMITTODEFINETHELIFETESTMAXSTRESS................................66
FIGURE2.513:POSSIBLESTRESSPROFILES................................................................................................67
FIGURE2.514:MEASUREMENTPOINTSFORANINFANTMORTALITYFAILURECAUSE..............................69
FIGURE2.515:MEASUREMENTPOINTSFORAWEAROUTFAILURECAUSE...............................................69
FIGURE2.516:ACCELERATIONWHENTHEDISTRIBUTIONSFORATLEASTTWOSTRESSESAREAVAILABLE
............................................................................................................................................................71
FIGURE2.517:ACCELERATIONWHENTHEDISTRIBUTIONSFORLOWSTRESSESARENOTAVAILABLE.....71
FIGURE2.518:LIFEMODELSEQUENCE.......................................................................................................72
FIGURE2.519DEGRADATIONMODELINGAPPROACH................................................................................75
FIGURE2.520:DEGRADATIONDATAEXAMPLE..........................................................................................76
FIGURE2.521:DEGRADATIONDATACONVERSIONTOTIMESTOFAILURE................................................77
FIGURE2.522:RELIABILITYESTIMATESFROMFIELDDATA........................................................................78
vii

List of Figures: Reliability Modeling The RIAC Guide

List of Figures
Page
FIGURE2.523:FMEAASATOLLFORASSESSINGSIMILARITY.....................................................................81
FIGURE2.524:MILHDBK217PARTCOUNTEXAMPLE...............................................................................85
FIGURE2.525:MILHDBK217PARTSTRESSEXAMPLE...............................................................................86
FIGURE2.526:TELCORDIASR332(BELLCORE)...........................................................................................87
FIGURE2.527:RACPRISMREPLACEDBYRIAC217PLUS.............................................................................88
FIGURE2.528:CNET/RDF2000...................................................................................................................89
FIGURE2.529:CNET/RDF2000MODELEXAMPLE......................................................................................90
FIGURE2.530:FIDES....................................................................................................................................91
FIGURE2.531:USESOFPROGRAMDATAELEMENTS.................................................................................93
FIGURE2.532:PROGRAMDATABASESTRUCTURE.....................................................................................93
FIGURE2.533:DATABASEINFORMATIONFLOW........................................................................................95
FIGURE2.534:HIERARCHYOFMAINTENANCEACTIONS............................................................................97
FIGURE2.535:CALCULATIONOFPARTLIFEUNIT.....................................................................................100
FIGURE2.536:FAILURETIMESBASEDONOPERATINGTIME....................................................................101
FIGURE2.537:FAILURETIMESBASEDONCALENDARTIME.....................................................................102
FIGURE2.538:FAILURERATESIMULATIONWITHWEIBULLBETA=20....................................................103
FIGURE2.539:FAILURERATESIMULATIONWITHWEIBULLBETA=5.0...................................................103
FIGURE2.540:FAILURERATESIMULATIONWITHWEIBULLBETA=2.0...................................................104
FIGURE2.541:FAILURERATESIMULATIONWITHWEIBULLBETA=1.0...................................................104
FIGURE2.542:FAILURERATESIMULATIONWITHWEIBULLBETA=0.5...................................................105
FIGURE2.544:STRESS/STRENGTHINTERFERENCE...................................................................................108
FIGURE2.545:STRESS/STRENGTHINTERFERENCEVS.TIME....................................................................109
FIGURE2.61:217PLUSAPPROACHTOFAILURERATEESTIMATION.........................................................114
FIGURE2.63.BAYESIANINFERENCEOUTLINE..........................................................................................122
FIGURE2.71:COMBININGSEVENFAILURECAUSEDISTRIBUTIONS..........................................................125
FIGURE2.72:POSSIBLEFAULTTREEREPRESENTATIONOFASERIESRELIABILITYBLOCKDIAGRAM........126
FIGURE2.73:PDFOFNORMALDISTRIBUTIONWITHMEANOF10ANDSTANDARDDEVIATIONOF3....128
FIGURE2.74:CUMULATIVENORMALDISTRIBUTIONWITHMEANOF10ANDSTANDARDDEVIATIONOF3
..........................................................................................................................................................128
FIGURE2.75:VALUESELECTIONFROMADISTRIBUTION.........................................................................129
FIGURE2.76:VALUESELECTIONFROMAWEIBULLDISTRIBUTION..........................................................130
FIGURE2.77:RELIABILITYBLOCKDIAGRAMOFREDUNDANTEXAMPLE..................................................131
FIGURE2.78:SYSTEMMONTECARLOEXAMPLE.......................................................................................131
FIGURE2.79:MONTECARLOSIMULATIONOFEXAMPLESYSTEM...........................................................132
FIGURE3.11:DISCRETEPROBABILITYDISTRIBUTION...............................................................................135
FIGURE3.12:CONTINUOUSPROBABILITYDISTRIBUTION........................................................................136
FIGURE3.21:EXAMPLESOFCORRELATIONCOEFFICIENTS.......................................................................142
FIGURE3.22:VENNDIAGRAMOFMUTUALLYEXCLUSIVEEVENTS...........................................................144
FIGURE3.23:INDEPENDENTEVENTS........................................................................................................145
FIGURE3.24:FAULTTREEORGATE..........................................................................................................147
FIGURE3.25:RELIABILITYBLOCKDIAGRAMFORANORGATE.................................................................147
FIGURE3.26:FAULTTREEANDGATE........................................................................................................148
viii

List of Figures: Reliability Modeling The RIAC Guide

List of Figures
Page
FIGURE3.27:RELIABILITYBLOCKDIAGRAMFORANANDGATE..............................................................149
FIGURE3.28:FAULTTREEOFANAND/ORCOMBINATION.......................................................................150
FIGURE3.29:RBDOFAND/ORCOMBINATION.........................................................................................150
FIGURE3.31:SHAPESOFFAILUREDENSITYANDRELIABILITYFUNCTIONSOFCOMMONLYUSEDDISCRETE
DISTRIBUTIONS(FROMMILHDBK338B).........................................................................................157
FIGURE3.32:SHAPESOFFAILUREDENSITY,RELIABILITYANDHAZARDRATEFUNCTIONSFORCOMMONLY
USEDCONTINUOUSDISTRIBUTIONS(FROMMILHDBK338B)........................................................158
FIGURE3.33:EXAMPLEPDFPLOTSFORTHEWEIBULLDISTRIBUTION....................................................164
FIGURE3.34:EXAMPLEHAZARDRATEPLOTSFORTHEWEIBULLDISTRIBUTION....................................164
FIGURE3.35:EXAMPLEPROBABILITYPLOTSFORWEIBULLDISTRIBUTION.............................................165
FIGURE3.36:EXAMPLEPDFPLOTSFORTHELOGNORMALDISTRIBUTION..............................................167
FIGURE3.37:EXAMPLEHAZARDRATEPLOTSFORTHELOGNORMALDISTRIBUTION..............................168
FIGURE3.38:EXAMPLEPROBABILITYPLOTSFORTHELOGNORMALDISTRIBUTION...............................168
FIGURE4.01:THEDOECONCEPT..............................................................................................................171
FIGURE4.31:POSSIBLERESPONSEFACTORLEVELRELATIONSHIP...........................................................173
FIGURE4.41:DOETERMINOLOGY............................................................................................................174
FIGURE4.42:ONEFACTORATATIMEEXPERIMENTS.............................................................................176
FIGURE4.43:STANDARDDOENOMENCLATURE......................................................................................177
FIGURE4.44:POTENTIALINTERACTIONS..................................................................................................178
FIGURE4.61:ANALYSISOFMEANS...........................................................................................................182
FIGURE4.62:LINEARIZATIONOFTHEARRHENIUSRELATIONSHIP...........................................................182
FIGURE4.63:OPTIMALFACTORSETTINGS................................................................................................183
FIGURE5.41:LIKELIHOODCONTOUREXAMPLE........................................................................................220
FIGURE6.11:BATHTUBCURVE.................................................................................................................223
FIGURE6.21:EXAMPLEOFNONMONOMODALDISTRIBUTION..............................................................228
FIGURE6.22:MULTIMODALDISTRIBUTIONEXAMPLE1...........................................................................229
FIGURE6.23:MULTIMODALDISTRIBUTIONEXAMPLE2...........................................................................230
FIGURE6.24:MULTIMODALDISTRIBUTIONEXAMPLE3...........................................................................231
FIGURE6.25:MULTIMODALDISTRIBUTIONEXAMPLE4...........................................................................232
FIGURE6.26:MULTIMODALDISTRIBUTIONEXAMPLE5...........................................................................233
FIGURE6.27:MULTIMODALDISTRIBUTIONEXAMPLEOFPOOLEDDATASET.........................................234
FIGURE6.28:AGEATDEATHDATA...........................................................................................................235
FIGURE6.29:PDFOFMULTIMODEDISTRIBUTIONOFAGES....................................................................236
FIGURE6.210:FAILURERATEOFAGEDATA.............................................................................................236
FIGURE6.211:PROBABILITYPLOTOFAGEDATA......................................................................................237
FIGURE6.212:SINGLEMODEWEIBULLFITTOTHEAGEDATA.................................................................238
FIGURE6.31:SOURCESOFERRORINEMPIRICALMODELS.......................................................................241
FIGURE6.32:CONFIDENCELEVELTHROUGHPREDICTION,ASSESSMENTANDESTIMATION..................243
FIGURE6.61:WEIBAYESEXAMPLE............................................................................................................246
FIGURE6.131:NOMINALFAILURECAUSEDISTRIBUTIONOFELECTRONICSYSTEMS...............................254

ix

List of Figures: Reliability Modeling The RIAC Guide

List of Figures
Page
FIGURE6.132:IPOMODEL........................................................................................................................256
FIGURE6.133:RELATIONSHIPBETWEENABSOLUTEANDRELATIVEHUMIDITY.......................................259
FIGURE6.141:ESTIMATEDUPPERBOUNDFAILURERATESVSOPERATINGTIMEAT60AND90%
CONFIDENCE.....................................................................................................................................260
FIGURE7.11:MILHDBK217MODELDEVELOPMENTMETHODOLOGY...................................................265
FIGURE7.21:FAILURECAUSEDISTRIBUTIONOFELECTRONICSYSTEMS..................................................275
FIGURE7.22:OPTICALAMPLIFIERFAILURECAUSEDISTRIBUTION...........................................................277
FIGURE7.23:GVS.TIMEANDGROWTHRATES.....................................................................................291
FIGURE7.24:MODELDEVELOPMENTMETHODOLOGYFLOWCHART......................................................306
FIGURE7.25:DISTRIBUTIONOFLOG10PREDICTED/OBSERVEDFAILURERATERATIOFORALLDATA....323
FIGURE7.26:DISTRIBUTIONOFLOG10PREDICTED/OBSERVEDRATIOFORFIELDDATAONLY...............324
FIGURE7.27:DISTRIBUTIONSOFTHEPREDICTED/OBSERVEDFAILURERATERATIOFORALLDATAAND
FORFIELDDATAONLY......................................................................................................................324
FIGURE7.31:TIMESTOFAILUREDISTRIBUTIONS.....................................................................................354
FIGURE7.32:PROBABILITYOFFAILUREVS.TEMPERATUREANDRELATIVEHUMIDITYAT50,000HOURS
..........................................................................................................................................................357
FIGURE7.41:APPARENTFAILURERATEFORREPLACEMENTUPONFAILURE...........................................362
FIGURE7.43:EXAMPLEOFPARTDETAILENTRIES...................................................................................374
FIGURE8.11:TWOBASICTYPESOFFMEA................................................................................................378
FIGURE8.41:FMEAPROCESSFLOW.........................................................................................................386
FIGURE8.71:FAILURECAUSEMODEEFFECTRELATIONSHIP...................................................................390
FIGURE8.101:FAILURECAUSE,MODEANDEFFECTHIERARCHY.............................................................393
FIGURE8.102:FAILURECAUSES................................................................................................................395
FIGURE8.111:OCCURRENCEDEFINITIONS...............................................................................................399
FIGURE8.112:OCCURRENCEGUIDELINES................................................................................................400
FIGURE8.113:DETECTABILITYDEFINITIONS.............................................................................................402
FIGURE8.114:LIFECYCLEVSDETECTABILITYDIMENSION.......................................................................403
FIGURE8.131:POTENTIALCORRECTIVEACTIONS....................................................................................407
FIGURE8.151:QFDTOFMEALINKS.........................................................................................................408
FIGURE8.152:QFDFMEA.........................................................................................................................410

List of Tables: Reliability Modeling The RIAC Guide

List of Tables
Page
TABLE1.31:RANGESOFPOTENTIALCUSTOMERREACTIONS......................................................................8
TABLE2.21:RELIABILITYASSESSMENTPURPOSES.....................................................................................24
TABLE2.22:PROGRAMPHASEVS.RELIABILITYASSESSMENTPURPOSE...................................................25
TABLE2.31:EXAMPLESOFINITIALCONDITIONS,STRESSESANDMECHANISMS......................................30
TABLE2.32:RELATIONSHIPBETWEENCAUSE,MODEANDEFFECT...........................................................31
TABLE2.51:SUMMARYOFRELIABILITYASSESSMENTOPTIONS...............................................................39
TABLE2.51:SUMMARYOFASSESSMENTOPTIONS(CONTINUED)............................................................40
TABLE2.52:RELEVANCYOFAPPROACHTOPREDICTION,ASSESSMENTANDESTIMATION.......................41
TABLE2.53:IDENTIFICATIONOFAPPROPRIATEAPPROACHESBASEDONTHEPURPOSE.........................43
TABLE2.54:RANKINGTHEATTRIBUTESOFEMPIRICALDATA...................................................................44
TABLE2.55:EVT,DVTANDPVTPURPOSEANDAPPROACH.......................................................................47
TABLE2.56:RELIABILITYDEMONSTRATIONEXAMPLE...............................................................................50
TABLE2.57:EXAMPLEOFAQUALIFICATIONPLANFORANASSEMBLY.....................................................57
TABLE2.58:QUALIFICATIONEXAMPLEFORALASERDIODE.....................................................................58
TABLE2.59:STRESSPROFILEOPTIONADVANTAGESANDDISADVANTAGES.............................................68
TABLE2.510:SIMILARITYANALYSIS............................................................................................................80
TABLE2.511:DIGITALCIRCUITBOARDFAILURERATES(INFAILURESPERMILLIONPARTHOURS)...........83
TABLE2.512:TESTCONDITIONS...............................................................................................................111
TABLE2.513:DATATOESTIMATEDIFFUSIONRATE.................................................................................112
TABLE2.514:PREDICTEDLIFETIMESVS.OBSERVED.................................................................................113
TABLE3.11:PROBABILITYDISTRIBUTIONNOTATION&MATHEMATICALREPRESENTATIONS...............141
TABLE3.21:COMBINATIONSEXAMPLE....................................................................................................143
TABLE3.22:COMBINATIONSOFANORCONFIGURATION.......................................................................147
TABLE3.23:COMBINATIONSOFANANDCONFIGURATION.....................................................................149
TABLE3.24:EXAMPLEOFKOUTOFNPROBABILITYCALCULATIONS...................................................151
TABLE3.25:EXAMPLEOF2OUTOF3REQUIREDFORSUCCESS..........................................................152
TABLE3.31:PROBABILITYDISTRIBUTIONSAPPLICABLETORELIABILITYENGINEERING..........................154
TABLE3.32:EXPONENTIALDISTRIBUTIONPARAMETERS........................................................................160
TABLE3.33:CONFUSINGTERMINOLOGYOFTHEWEIBULLDISTRIBUTION.............................................162
TABLE3.34:WEIBULLDISTRIBUTIONPARAMETERS................................................................................163
TABLE4.31:POSSIBLECONCLUSIONSFORANONLINEARRESPONSEFACTORRELATIONSHIP...............173
TABLE4.41:FULLFACTORIALEXAMPLE....................................................................................................175
TABLE4.42:FULLANDHALFFACTORIALEXAMPLEFORCORROSION......................................................179
TABLE5.21:TERMINOLOGYUSEDINPARAMETERESTIMATION.............................................................187
TABLE5.22:TECHNIQUESFORPARAMETERESTIMATION.......................................................................188
TABLE5.23:PARAMETERSTYPICALLYESTIMATEDFROMSTATISTICALDISTRIBUTIONS.........................189
TABLE5.24:CONFIDENCEBOUNDSFORTHEPOISSONDISTRIBUTION...................................................200
TABLE5.25:CONFIDENCEBOUNDSFORTHEBINOMIALDISTRIBUTION.................................................201
TABLE5.26:CONFIDENCEBOUNDSFORTHEEXPONENTIALDISTRIBUTION...........................................202
TABLE5.28:CONFIDENCEBOUNDSFORTHENORMALDISTRIBUTION...................................................203
TABLE5.310:CONFIDENCEBOUNDSFORTHEWEIBULLDISTRIBUTION.................................................205
xi

List of Tables: Reliability Modeling The RIAC Guide

List of Tables
Page
TABLE6.11:CATEGORIESOFFAILUREEFFECTS........................................................................................227
TABLE6.22:BIMODALPOPULATIONEXAMPLE1......................................................................................229
TABLE6.23:BIMODALPOPULATIONEXAMPLE2......................................................................................230
TABLE6.14:BIMODALPOPULATIONEXAMPLE3......................................................................................231
TABLE6.15:BIMODALPOPULATIONEXAMPLE4......................................................................................232
TABLE6.16:BIMODALPOPULATIONEXAMPLE5......................................................................................233
TABLE6.17:FOURMODEWEIBULLDISTRIBUTIONPARAMETERS............................................................235
TABLE6.31:FAILURERATEUNCERTAINTYLEVELMULTIPLIERS................................................................242
TABLE6.91:EXAMPLEOFCOMBINGDIFFERENTTYPESOFMODELS........................................................248
TABLE6.131:FACTORSTOBECONSIDEREDINARELIABILITYMODEL.....................................................256
TABLE6.132:FAILURERATEDATASUMMARY.........................................................................................258
TABLE7.11:DATACOLLECTEDFORMODELDEVELOPMENT....................................................................269
TABLE7.12:DATATRANSFORMS..............................................................................................................270
TABLE7.13:REGRESSIONDATAINCLUDINGCATEGORICALVARIABLES...................................................271
TABLE7.21:UNCERTAINTYLEVELMULTIPLIER.........................................................................................282
TABLE7.22:PERCENTAGEOFFAILURESATTRIBUTABLETOEACHFAILURECAUSE..................................283
TABLE7.23:WEIBULLPARAMETERSFORFAILURECAUSEPERCENTAGES................................................283
TABLE7.24:MULTIPLIERSASAFUNCTIONOFPROCESSGRADE.............................................................284
TABLE7.25:EXAMPLEOFFAILUREMODETOFAILURECAUSECATEGORYMAPPING.............................295
TABLE7.26:CAPACITORPARAMETERS.....................................................................................................301
TABLE7.27:DEFAULTENVIRONMENTALSTRESSVALUES........................................................................302
TABLE7.28:DEFAULTOPERATINGPROFILEVALUES.................................................................................303
TABLE7.29:FAILURECAUSESUMMARYFORCONNECTORS....................................................................308
TABLE7.210:FAILUREMODETOFAILURECAUSECATEGORYFORCONNECTORS(SCANDFC)..............309
TABLE7.211:FAILURECAUSEPERCENTAGESFORCONNECTORS.............................................................311
TABLE7.212:DATACOLLECTEDFORCONNECTORS..................................................................................312
TABLE7.213:CATEGORIESOFACCELERATIONMODELPARAMETERS......................................................315
TABLE7.214:ACCELERATIONMODELPARAMETERS................................................................................315
TABLE7.215:DEFAULTMODELPARAMETERS..........................................................................................316
TABLE7.216:SUMMARYOFPIFACTORCALCULATIONS..........................................................................317
TABLE7.217:APPLICABILITYOFTESTDATA..............................................................................................318
TABLE7.218:BASEFAILURERATES(FAILURESPERMILLIONCALENDARHOURS)....................................319
TABLE7.219:PARTQUALITYPROCESSGRADEFACTORQUESTIONSFORPHOTONICDEVICEMODELS..320
TABLE7.220:SUMMARYOFUNCERTAINTYMETRICS...............................................................................323
TABLE7.221:PARAMETERSFORTHEPROCESSGRADEFACTORS.............................................................327
TABLE7.222.INDEXOFPROCESSGRADETYPEQUESTIONS....................................................................328
TABLE7.223:DESIGNPROCESSGRADEFACTORQUESTIONS..................................................................330
TABLE7.224:MANUFACTURINGPROCESSGRADEFACTORQUESTIONS.................................................336
TABLE7.225:PARTQUALITYPROCESSGRADEFACTORQUESTIONS.......................................................340
TABLE7.226:SYSTEMMANAGEMENTPROCESSGRADEFACTORQUESTIONS........................................342
TABLE7.227:CANNOTDUPLICATE(CND)PROCESSGRADEFACTORQUESTIONS..................................346
TABLE7.228:INDUCEDPROCESSGRADEFACTORQUESTIONS...............................................................347
xii

List of Tables: Reliability Modeling The RIAC Guide

List of Tables
Page
TABLE7.229:WEAROUTPROCESSGRADEFACTORQUESTIONS.............................................................348
TABLE7.230:GROWTHPROCESSGRADEFACTORQUESTIONS...............................................................349
TABLE7.31:PARAMETERLEVELS..............................................................................................................350
TABLE7.32:TESTPLANSUMMARY...........................................................................................................351
TABLE7.33:LIFETESTRESULTS.................................................................................................................352
TABLE7.34:TIMESTOFAILUREDISTRIBUTIONPARAMETERS..................................................................353
TABLE7.35:ESTIMATEDPARAMETER80%2SIDEDCONFIDENCEBOUNDS............................................356
TABLE7.41:DATASUMMARIZATIONPROCESS........................................................................................359
TABLE7.42:TIMEATWHICHASYMPTOTICVALUEISREACHED...............................................................363
TABLE7.43/MTTFRATIOASAFUNCTIONOF.....................................................................................363
TABLE7.44:PERCENTFAILUREFORWEIBULLDISTRIBUTION...................................................................364
TABLE7.45:FIELDDESCRIPTIONS.............................................................................................................367
TABLE7.46:APPLICATIONENVIRONMENTSDEFINEDINNPRD...............................................................368
TABLE8.71:FAILUREMODERELATIONSHIPTOTAGUCHILOSSFUNCTION.............................................389
TABLE8.81:DIMENSIONSOFFUNCTIONALSEVERITY..............................................................................391
TABLE8.82:DIMENSIONSOFSEVERITY....................................................................................................392
TABLE8.111:CATEGORIESOFFAILUREEFFECTS......................................................................................401
TABLE8.112:RECOMMENDEDDETECTABILITYRATINGCRITERIA............................................................404

xiii

List of Tables: Reliability Modeling The RIAC Guide

This page intentionally left blank

xiv

Chapter 1: Introduction

1.

Introduction

Few engineering techniques have caused as much controversy in the last several decades
as the topic of reliability prediction. One of the primary reasons for this is the stochastic
nature of reliability. Whereas many engineering disciplines are governed by
deterministic processes, reliability is governed by a complex interaction of stochastic
processes. As a result, the metrics of interest in other engineering disciplines are
generally much more quantifiable by their very nature. While there is always a stochastic
element in any engineering model, the topic of reliability quantification must address its
extreme stochastic nature.
Many highly respected reliability engineering texts treat the topic of reliability modeling
thoroughly and in great detail. Included in these texts are detailed ways to model system
reliability using techniques like Failure Modes and Effects Analysis (FMEA), Fault Tree
Analysis (FTA), Markov models, fault tolerant design techniques, etc. The techniques
that are addressed in detail in these texts often gloss over a fundamental requirement in
order to effectively utilize these techniques, i.e., the ability to quantify the reliability of
the constituent components and subsystems comprising the system.
The intent of this book is to provide guidance on reliability modeling techniques that can
be used to quantify the reliability of a product or system. In this context, reliability
modeling is the process of constructing a mathematical model that is used to estimate the
reliability characteristics of an item. There are many ways in which this can be
accomplished, depending on the item and the type of information that is available to, or
practical to obtain by, the analyst. This book will review possible approaches, summarize
their advantages and disadvantages, and provide guidance on selecting a methodology
based on specific goals and constraints. While this book will not discuss the use of
specific published methodologies, in cases where examples are provided, tools and
methodologies with which the author has personal experience in their development are
used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.
The Reliability Information Analysis Center (RIAC) has prepared many documents in the
past relating to many different reliability engineering techniques, such as FMEA, FTA,
Worst Case Analysis (WCA), etc. However, one noteworthy omission from this list is
reliability modeling. This, coupled with (1) the RIACs history of providing reliability
modeling data and solutions, and (2) the need to objectively address some of the
confusion and misconceptions related to this topic, formed the inspiration for this book.

Reliability Information Analysis Center


1

Chapter 1: Introduction

In years past, DoD contracts would require specific reliability prediction methodologies,
usually MIL-HDBK-217, be used. This resulted in system developers having very little
flexibility in applying different reliability prediction practices. Since the DoD has not,
until very recently, supported updates to MIL-HDBK-217, companies were encouraged
to use best practices in quantifying product reliability. The difficult question to be
addressed is what are the best practices that should be used? This book attempts to
provide guidance on selecting an appropriate methodology based on the specific
conditions and constraints of the company and its products or systems.
It is hoped that the authors experience gained by attempting many different reliability
assessment approaches, including physics and empirical approaches, can be used to the
advantage of the reader in a practical way.

1.1. Scope
The intent of a reliability program is to identify and mitigate failure modes/mechanisms,
verify their removal through reliability testing, implement corrective actions for
discovered failures, and maintain reliability levels after reliability has been designed in.
These correspond to the designing-in reliability, reliability growth and ensuring on-going
reliability goals, respectively, as illustrated in Figure 1.1-1.

Figure 1.1-1: Phases of a Reliability Program


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
2

Chapter 1: Introduction

The cost to an organization increases exponentially as a function of when failure causes


are discovered, as illustrated in Figure 1.1-2. It is most efficient to discover failure
modes and mechanisms as early as possible, when they can be effectively mitigated. If
failure modes and mechanisms are discovered late in development or, worse, in the field,
organizations can be faced with staggering costs associated with corrective actions.

Figure 1.1-2: Relative Cost of Failures vs. Phase


The use of reliability engineering techniques early in the development cycle of a system
is critical to achieving high reliability. An important part of these efforts is the modeling
of reliability before the product or system is fielded.
The term Reliability Prediction has had a relatively narrow connotation, primarily
associated with handbook approaches. This document attempts to take a broader view
of this topic by investigating the various approaches for quantifying reliability, and their
effectiveness when used to achieve specific objectives. For this reason, the book is
entitled Reliability Modeling the RIAC Guide to Reliability Prediction, Assessment
and Estimation. The definitions of these are:
Prediction - something that is predicted, forecasted
Assessment - to determine the importance, size, or value of
Estimation - A tentative evaluation or rough calculation, as of worth, quantity, or
size
Reliability Information Analysis Center
3

Chapter 1: Introduction

Predictions are performed very early, before there is any empirical data on the item under
analysis. Reliability assessments are made to determine the affects of certain factors on
reliability and to identify failure causes. Reliability estimates are made based on
empirical data. This book covers all three areas, as illustrated in Figure 1.1-3.

Figure 1.1-3: Reliability Prediction, Assessment and Estimation


Figure 1.1-4 summarizes the results of a benchmarking study of best commercial
reliability practices (Reference 9). In this study, reliability predictions were identified by
more than 90% of the participants as being an appropriate reliability task during the
product/system development life cycle. Approximately 70% of the survey respondents
felt that reliability predictions were effective, supporting the proposition that, while
generally perceived as beneficial, there are problems associated with their use. This
information highlights the importance that organizations often place on assessing and
predicting reliability.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


4

Chapter 1: Introduction

Figure 1.1-4: Percent of Companies Using Reliability Engineering Tools

1.2. Book Organization


Chapter 1 of this book presents background information on reliability modeling. The
next section of this chapter includes a description of a typical reliability program, the
intent of which is to present the elements that should be considered when developing a
program, and to highlight how reliability modeling fits into such a program. Also
included is a section on the history of reliability prediction, to provide a historical
perspective of its evolution.
Chapter 2 covers the primary topic of this book, and includes information on the various
ways in which a product can be modeled and guidance on selecting an approach. It
presents a generic approach, and describes the elements of this approach.
Chapter 3 presents fundamental concepts of reliability theory, probability and statistics.
In many books, these topics are presented first. However, in this book, it is presented
after Chapter 3 because it is not the primary topic. Rather, it is presented to provide the
fundamental foundation for the concepts used in reliability modeling. It is also the
foundation for Design of Experiments (DOE) and Life Modeling techniques, which are
further detailed in Chapters 4 and 5.

Reliability Information Analysis Center


5

Chapter 1: Introduction

Approaches like using a Multi-cell-based designed experiment to generate data from


which a life model is developed are presented in Chapter 2. Here, a generic approach to
this topic is presented. Since the topic of life modeling is central to reliability modeling,
important elements of it are presented in more detail in Chapters 4 and 5. One of the
critical aspects of life modeling is reliability testing.
Design of Experiments is a technique to maximize the usefulness of the data resulting
from DOE tests, and is the topic of Chapter 4.
Chapter 5 presents information relative to development of the mathematical models that
form the basis of the reliability model, and includes information pertaining to parameter
estimation.
Chapter 6 presents a variety of topics pertaining to the interpretation of reliability models.
This is provided to allow the reader to gain a better appreciation for what can, and cannot,
be concluded from a model.
Chapter 7 is a compilation of examples of reliability models. Presented here are the
following examples:
1.
2.
3.
4.

A typical MIL-HDBK-217 model development process


Information on the development of the RIACs 217Plus methodology
A life modeling example
A description of RIACs Nonelectronic Parts Reliability Data (NPRD), provided
as an example of the use of field data in reliability modeling

These examples are provided to give the reader a better appreciation for the tools,
techniques and limitations of various approaches to reliability modeling.
A discussion of FMEA is presented in Chapter 8. Although FMEA is secondary to the
primary intent of this book, it can form the basis for many elements of a reliability
program, including reliability modeling. Therefore, Chapter 8 is intended to present
FMEA concepts in this context, as well as provide practical information on performing
FMEAs that this author has found to be useful.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


6

Chapter 1: Introduction

1.3. Reliability Program Elements


In order to allow a perspective on how reliability modeling fits into a reliability program,
this section presents a generic reliability program, with a description of its various
elements. It is presented to highlight how reliability modeling fits into such a program.
There are many possible approaches to designing in reliability. The specific approach
used will depend on the needs of the specific organization. Figure 1.3-1 presents one
possible approach, and includes the elements that should be included in all approaches.
The premise of this approach is to identify the critical parts and material which warrant
detailed attention. Since it is impractical to perform some reliability modeling
approaches on all system parts, it is imperative to identify the critical parts which are the
highest risk. Since one of the most effective ways to verify the robustness of parts or
materials is from experience, an effective reliability program must leverage knowledge
gained in the development and deployment of previous systems. It will be shown that
reliability assessments impact many of the elements of this approach.

Figure 1.3-1: Example Reliability Program Approach


Reliability Information Analysis Center
7

Chapter 1: Introduction

Elements of the reliability program are summarized as follows:


1. Design requirements: The first step in any product development process is the
identification of requirements. These requirements include items pertaining to
Performance, Reliability (failure rate, life), Maintainability, Diagnostics, and Use
Environment and Operational stresses (i.e., mission profiles). Typically, the medium for
communicating these requirements is the product specification. While the specification
usually contains details regarding the require performance of the product or system, it is
often lacking relative to quantifying the reliability attributes required. The following
questions should be answered to determine these reliability requirements:

What is the required failure rate of the item in its useful life?
What is the service life required?
What criteria will be used to determine when the requirements are not met?
Whose responsibility will it be to take corrective action if these requirements are
not met?
What are the operating and environmental profiles expected in field deployed
conditions?

A valuable tool to assist in understanding the requirements is Quality Function


Deployment (QFD).
The reliability that is considered acceptable will, of course, be specific to the industry,
criticality of failure, etc. The specific value may be specified, or it may not be,
depending on the industry and the maturity of the product. The range of potential
customer reactions to various scenarios are summarized in Table 1.3-1.
Table 1.3-1: Ranges of Potential Customer Reactions
Outcome
Best

Worst

Field reliability
No failures
Failures occur at an acceptable rate
Recurring failures, but on a relatively small percent of
items
Recurring failures on a high percent of items
An unexpected failure mechanism is discovered that
will affect the entire population, or critical safety
related failures

Likely Customer reaction


Pleased
Tolerant
Annoyed
Angry
Legal action, loss of business

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


8

Chapter 1: Introduction

If the requirement is not specified, an estimate of the requirement must be made so that
there is a goal that can be used in the development process.
2. Initial Design: After the product requirements are understood, the design team
generally derives an initial, or preliminary, design for the product or system. Inputs to
this initial design should be in the form of design rules and a Standard parts list. Design
rules are the culmination of lessons learned from previous development activities, from
both empirical field or test data, and from analysis. These design rules should be a living
document which is continuously updated based on current information. Effective use of
design rules also saves much effort since reliability attributes which have a reliability
history or which have been previously studied do not need to be addressed in detail, thus
saving resources to be applied to the study of critical parts.
3. Similarity analysis: Once an initial design is available, a similarity analysis can be
performed to identify attributes which are similar to those for which a reliability history is
available, and those for which it is not. A FMEA can be a valuable technique for this
analysis, and will be discussed later. In this analysis, each reliability attribute identified
in the FMEA is reviewed to determine if a reliability history exists or not.
4. Identify attributes that are similar: Similar attributes are those that have a reliability
history
5. Assess robustness of attribute: If the part or attribute does have a history, previous test
data or field experience data can be used to assess the robustness of the part or attribute.
6. Identify attributes that are not similar: Attributes that are not similar do not have a
reliability history.
7. Perform design analysis: Although any attribute that is potentially different in the new
design relative to the previous design must be analyzed, particular attention is given to
the attributes that are not similar. Design techniques that are used for this purpose are
FMEA, tolerance or worst case analysis, thermal analysis, stress analysis, and reliability
predictions.
8. Implement corrective action: From the results of the design analysis, corrective action
should be taken to improve the robustness of the design.
9. Identify critical parts/materials: Based on the results of the analysis, critical parts or
materials are identified.
Reliability Information Analysis Center
9

Chapter 1: Introduction

10. Model critical parts/materials: Once critical parts are identified, action must be taken
to ensure that the parts or materials are robust enough to meet the reliability and
durability requirements. More details of the approach used for this purpose will be
presented later in the book.
11. Identify effective tests for non-similar attributes: Based on the identification of
critical parts and the design analysis that was performed, specific tests that will assess the
reliability and durability of the attribute can be determined. Part of the FMEA should
include identification of stresses that will accelerate the attribute under analysis and
therefore, this analysis is important for identifying the appropriate stress tests.
12. Develop a test plan and execute tests: Based on the design analysis performed and
the identification of tests for non-similar attributes, a test plan can be determined. In the
context of this approach, the goal of these tests is to assess the robustness of the product
by subjecting the product to test stresses that are intended to accelerate the critical parts
and non-similar attributes to failure. In addition to these tests, other test requirements
should be incorporated into this test plan. These additional test requirements include any
tests required by the customer, such as qualification or reliability demonstration tests.
13. Document the test results: Once the tests have been performed and the data analyzed,
the results should be fully documented, since they subsequently will be used for a variety
of purposes.
14. Monitor field reliability: Once the product is deployed, field reliability experience
data should be carefully gathered, since it will be used for a variety of purposes.
Elements of the data to be gathered include:
1. Product or system deployment history by serial number, including when
deployed, when fielded
2. Failure information, including failure date, root failure cause, results of failure
analysis
3. Product or system re-deployment information
15. Update reliability database: A database is required to manage the reliability data, and
should include both test data and field data. This data can be used to generate a
company-specific reliability prediction methodology.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


10

Chapter 1: Introduction

16. Update Design Rules: Data acquired from tests and field surveillance should be used
to update the design rules. Field data is probably the most valuable type of data for this
purpose since it represents the actual product or system in the intended use environment.
The process of maintaining design rules and ensuring that they are used in new designs is
the cornerstone of the means by which reliability is improved in a reliability growth
process.
Critical parts are those which may result in a significant risk to the project. This risk can
be related to reliability, lifetime, availability, or maintainability. Some of the factors that
constitute critical parts are:

New, unproven technology


New, unproven manufacturing processes
Performance limitations: stringent environmental conditions or non-robust design
practices
Reliability limitations: components/materials with life limitations
Vendors with a past history of delivery, cost performance or reliability problems
Old technology with availability problems

These critical parts or items warrant additional attention in assessing their reliability, as
they generally will represent the greatest reliability risk.

1.4. The History of Reliability Prediction


The term reliability prediction has historically be used to denote the process of
applying mathematical models and data for the purposes of estimating field reliability of
a product or system before empirical data is available on that product or system. This
section will review some of the developments in the area of reliability prediction from the
1950s to the present. While there are several techniques available to reliability
practitioners to perform reliability predictions, the discussion inevitably centers around
MIL-HDBK-217 due to its historical prominence as a reliability prediction tool.
During World War II, electronic tubes were by far the most unreliable component used in
DoD electronic systems. This observation led to various studies and ad hoc groups
whose purpose was to identify ways that their reliability, and the reliability of the systems
in which they operated, could be improved. One group in the early 1950s concluded
that:
1. There needs to be better reliability data collected from the field
2. Better components need to be developed
Reliability Information Analysis Center
11

Chapter 1: Introduction

3. Quantitative reliability requirements need to be established


4. Reliability needs to be verified by test before full scale production
5. A permanent committee needs to be established to guide the reliability discipline
Item 5, above, was implemented in the form of the Advisory Group on Reliability of
Electronic Equipment (AGREE), whose charter was to identify actions that could be
taken to provide more reliable electronic equipment. This time period was the advent of
the reliability engineering discipline. It soon became clear that the emerging discipline
was using several different methods to achieve its goal of higher reliability. One was the
identification of root causes of field failure and determination of mitigating actions.
Another was the specification of quantitative reliability requirements. The specification
of requirements in turn led to the desire to have a means of estimating reliability before
an equipment is built and tested so that the probability of achieving its reliability goal
could be estimated. This, of course, was the beginning of reliability prediction. The
1950s also saw much pioneering work in the reliability discipline, including;

A variety of efforts to improve device reliability through data collection and


design
The establishment of reliability programs
Symposiums devoted to quality and reliability engineering
Statistical techniques development such as the Weibull distribution
Military handbooks that provided guidance on the reliable application of
electronic components

In addition to these accomplishments, the 50s also included pioneering work in the area
of quantitative reliability prediction. In 1956, RCA released TR-1100, Reliability Stress
Analysis for Electronic Equipment, which presented mathematical models for the
estimation of component failure rates. This report turned out to be the predecessor of
MIL-HDBK-217.
Several additional early works in the area of reliability prediction were produced in the
early 1960s, including D.R. Erles report (Reference 2) and the Erles and Edins paper
(Reference 3). In 1962, the first version of MIL-HDBK-217 was published by the Navy.
Once issued, MIL HDBK-217 quickly became the standard by which reliability
predictions were performed, and other sources of failure rates gradually disappeared.
Part of the reason for the demise of other sources was the fact that MIL-HDBK-217 was
often a contractually cited document and defense contractors did not have the option of
using other sources of data.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
12

Chapter 1: Introduction

These early sources of failure rates also often included design guidance on the reliable
application on electronic components. However, subsequent versions of the documents,
primarily MIL-HDBK-217, would delete the application information because it was
treated in more detail elsewhere.
By now, the reliability discipline was working under the tenet that reliability was a
quantitative discipline that needed quantitative data sources to support its many
statistically based techniques, such as allocations and redundancy modeling. However,
another branch of the reliability discipline focused on the physical processes by which
components were failing. The first symposium devoted to this topic was the Physics of
Failure In Electronics Symposium sponsored by the Rome Air Development Center
(RADC) and IIT Research Institute (IITRI) in 19621. This symposium later became
known as the International Reliability Physics Symposium (IRPS). In this period of time,
the two branches of reliability engineering seemed to be diverging, with the systems
engineers devoted to the tasks of specifying, allocating, predicting and demonstrating
reliability, while the physics-of-failure (PoF) engineers and scientists were devoting their
efforts to identifying and modeling the physical causes of failure. Both branches were
integral parts of the reliability discipline, and both were hosted at RADC (later to become
Rome Laboratory). The physics-based information was necessary to develop part
qualification, screening and application requirements, and the systems tasks of
specifying, allocating, predicting and demonstrating reliability were necessary to insure
that reliability requirements were met. The component research efforts of the 1950s and
1960s culminated with the implementation of the ER and TX families of
specifications. This complicated the issue of predicting their reliability because there
were now many different combinations of quality levels and environments that needed to
be addressed in MIL-HDBK-217.
In the early 1970s, the responsibility for preparing MIL-HDBK-217 was transferred to
RADC, who published revision B in 1974. However, other than the transition to RADC,
the 1970s maintained the status quo in the area of reliability prediction. MIL-HDBK217 was updated to reflect the technology at that time, but there were few other efforts
that changed the manner in which predictions were performed. One exception, however,
was that there was a shift in the complexity of the models being developed for MILHDBK-217. There were several efforts to develop new and innovative models for
reliability prediction. The results of these efforts were extremely complex models that
may have been technically sound, but were criticized by the user community as being too
1

IITRI was the original contractor of the Reliability Analysis Center (RAC). In 2005, the RAC contract was awarded as RIAC to the current team of
Wyle Labs (prime), Quanterion Solutions Incorporated, the University of Maryland Center for Risk and Reliability, the Pennsylvania State Applied
Research Laboratory (ARL), and the State University of New York Institute of Technology (SUNYIT)

Reliability Information Analysis Center


13

Chapter 1: Introduction

complex, too costly, and unrealistic given the low level of detailed design information
available at the point in time when the models were needed. RCA, under contract to
RADC, had developed PoF-based models which were rejected as unusable, since the
detailed design and construction data for microcircuits were simply unavailable to typical
model users. These models were never incorporated into MIL-HDBK-217.
While MIL-HDBK-217 was updated again several times in the 1980s, there were
agencies that were developing reliability prediction models unique to their industries. As
an example, the automotive industry, under the auspices of the Society of Automotive
Engineers (SAE) Reliability Standards Committee, developed a series of models specific
to automotive electronics. The SAE committee felt that there was no existing prediction
methodologies that were applicable to the specific quality levels and environments of
automotive applications. The Bellcore reliability prediction standard is another example
of a specific industry developing methodologies for their unique conditions and
equipment. It originally was developed by modifying MIL-HDBK-217 to better reflect
the conditions of interest of the telecommunications industry. It has since taken on its
own identity with models derived from telecommunications equipment and is now used
widely within that industry.
The 1980s also saw explosive growth in integrated circuit technology. Very dense
circuits were being fabricated using feature sizes as small as 0.5 microns. This presented
unique challenges to reliability modelers. The VHSIC (Very High Speed Integrated
Circuit) program was the governments attempt to leverage from the technological
advancements of the commercial industry and, at the same time, produce circuits capable
of meeting the unique requirements of military applications. From the VHSIC program
came the Qualified Manufacturers List (QML) - a qualification methodology that
qualified an integrated circuit manufacturing line, unlike the traditional qualification of
specific parts. The government realized that it needed a QML-like process if it were to
leverage from the advancements in commercial technologies and, at the same time, have
a timely and effective qualification scheme for military parts. A reliability prediction
model was also developed for VHSIC devices in 1989 (Reference 9) in support of a MILHDBK-217 update. An interesting observation was made during that study that deviated
from the premise on which most of the MIL-HDBK-217 models were based. The
traditional approach to developing models was to collect as much field failure rate data as
possible, statistically analyze it, and quantify model factors based on the results of the
statistical analysis. For integrated circuits, one of the factors that was quantified was
inevitably device complexity. This complexity was measured by the number of gates or
transistors and was the primary factor on which the models were based. The correlation
between failure rate and complexity was strong and could be quantified because the
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
14

Chapter 1: Introduction

failure rate of circuits was much higher than they are today and the defect rate was
directly proportional to the complexity. As technology has advanced, the gate or
transistor count became so high that it could no longer effectively be used as the measure
of complexity in a reliability model. Furthermore, transistor or gate count data was often
difficult or impossible to obtain. Therefore, the model developed for VHSIC
microcircuits needed another measure of complexity on which to base the model. The
best measures, and the ones most highly correlated to reliability are defect density and
silicon area. It can be shown that the failure rate (for small cumulative percent failure) is
directly proportional to the product of the area and defect density. However, another
factor that is highly correlated to defect density and area is the yield of the die, or the
percent of die that are functional upon manufacture. Ideally, a reliability model would
use either yield or defect density/area as the primary factor(s) on which to base the
model. The problem in using these factors in a model is that they are considered highly
proprietary parameters from a market competition viewpoint and, therefore, are rarely
released by the manufacturers. Therefore, the single most important driver of reliability
cannot be obtained by the user of the device, which is unfortunate because the accuracy
of the model suffers. The conflict between the usability of a model and its accuracy has
always been a difficult tradeoff to address for model developers.
Much of the literature in the 1990s on the topic of reliability prediction has centered
around the debate as to whether the reliability discipline should focus on PoF-based or
empirically-based models (such as MIL-HDBK-217) for the quantification of reliability.
In the authors opinion, many of the primary criticisms of MIL-HDBK-217 stem from the
fact that it was often used for purposes for which it was not intended. For example, it
was often used as a means by which the reliability of a product was demonstrated. Since
its use was contractually required, contractors would try to demonstrate compliance to the
specified reliability requirements by adjusting factors in the model to make it appear
that the reliability would meet requirements. Sometimes these adjustments had a
technical basis, and sometimes they did not. Les Gubbins, one of the governments first
project managers for the handbook, once made the analogy that engaging in the use of
these adjustment factors is like pushing the needle on your cars speedometer up, and
convincing yourself youre going faster. This, of course, is not good engineering
practice, but rather was done for nontechnical reasons.
Another key development in the area of reliability predictions was related to the
implications of acquisition reform. In 1994, Military Specifications and Standards
Reform (MSSR) was initiated which decreed the adoption of performance-based
specifications as a means of acquiring and modifying weapon systems. It also overhauled
Reliability Information Analysis Center
15

Chapter 1: Introduction

the military standardization process which, in turn, led to a list of standardization


documents that required priority action because they were identified as barriers to
commercial processes, as well as major cost drivers in defense acquisitions. The list
included only one handbook, MIL-HDBK-217. Over the years, critics of MIL-HDBK217 have complained about its utility as an effective method for assessing reliability.
While the claim is made that it is inaccurate and costly, to date there is no viable
replacement in the public domain. As the DoD Lead Standardization Activity for
reliability and maintainability (R&M), Rome Laboratory (RL) was responsible for
implementing the R&M segment of MSSR. Within this context, RL initiated a project to
develop a new reliability assessment technique to supplement MIL-HDBK-217, and to
overcome some of its perceived problems.
Utilizing standardization reform funding, RL awarded a contract to the Reliability
Analysis Center and Performance Technology, Inc. The objective of the work was to
develop new and innovative reliability assessment methods that are flexible enough to
suit the needs of system reliability analysts regardless of their preferred (or required)
initial prediction methods. The intent was to use the final model to supplement or
possibly replace MIL-HDBK-217. The premise of traditional methods, such as MILHDBK-217, is that the failure rate is primarily determined by components comprising the
system. This was a good premise in the 1960s and 1970s when components exhibited
higher failure rates and systems were less complex than they are today. Increased system
complexity and component quality have resulted in a shift of system failure causes away
from components to more system level factors including manufacturing, design, system
requirements, interface, and software problems. Historically, these factors have not been
explicitly addressed in prediction methods. The intent of this study was to develop a
structure for an electronic system reliability assessment methodology. The term system
was used because the methodology accounted for all predominant causes of system
failure. The new model adopted a broader definition of reliability. An integral part of the
methodology was the assessment of processes used in the design and manufacture of the
system, including factors contributing to the following failure causes: parts, design,
manufacturing, system management, induced, wearout, no defect found and software.
The results of this study became the basis for the current RIAC 217Plus methodology.
The 2000s was a time in which there was progress on development of new standards,
some of which will be summarized in this book. Also, the DoD has initiated efforts to
resurrect MIL-HDBK-217 by updating it with models reflecting state-of-the-art
technologies.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


16

Chapter 1: Introduction

1.5. Acronyms
Acronyms and abbreviations that are used in his book are defined as follows:
AL
ALM
ALT
CA
CDF
CRR
D
DoD
DPA
DVT
ED
ELFR
EPRD
ESD
EV
EVT
FMEA
FMECA
FRU
GFL
HALT
HASS
HAST
HTB
HTOL
HTRB
IOL
IPL
IWV
KPSI
LI
MCMC
MLE
MS
MTTF
NPRD
O
PD
PDF
PVT
RBD
RPN
RSH
S
SD
TBD
TC
TR
TST
TTF
VVF

Accelerated Life
Accelerated Life Model
Accelerated Life Testing
Constant acceleration
Cumulative Distribution Function
Center for Risk and Reliability
Detectability
Department of Defense
Destructive Physical Analysis
Design Verification Test
Electrical distributions
Early life failure rate
Electronic Parts Reliability Data
Electrostatic discharge
External visual
Engineering Verification Test
Failure Mode and Effect Analysis
Failure Mode and Effect Criticality Analysis
Field Replaceable Unit
Gross/fine leak
Highly Accelerated Life Test (simultaneous temperature cycling and vibration)
Highly Accelerated Stress Screening
Highly Accelerated Stress Testing
High temperature bake
High temperature operating life
High temp. reverse bias
Intermittent operational life
Inverse Power Law
Internal water vapor
Pounds per square inch, in thousands
Lead integrity
Markov Chain Monte Carlo
Maximum Likelihood Estimator
Mechanical shock
Mean Time to Failure
Non-Electronic Parts Reliability Data
Occurrence
Physical dimensions
Probability Density Function
Process Verification Test
Reliability Block Diagram
Risk Priority Number
Resistance to solder heat
Severity
Solderability
To Be Defined
Temperature cycling
Thermal resistance
Pre and post electrical test
Time to Failure
Vibration - variable freq.

Reliability Information Analysis Center


17

Chapter 1: Introduction

1.6. References
1. Coppola, A., Reliability Engineering of Electronic Equipment, A Historical
Perspective, IEEE Transactions on Reliability. Vol. R-33. No. 1, April 1984.
2. Erles, D.R., Reliability Application and Analysis Guide, The Martin Company, July
1961.
3. Erles D.R. and M.F. Edins, Failure Rates, AVCO Corp. April, 1962.
4. Knight, C.R., Four Decades of Reliability Progress, 1991 Proceedings Annual
Reliability and Maintainability Symposium.
5. Reliability Prediction Methodologies For Electronic Equipment, AIR 5286, SAE G11 Committee, Electronic Reliability Prediction Committee, 31 Jan. 1998
6. Reliable Application of Plastic Encapsulated Microcircuits, Reliability Analysis
Center Publication PEM2.
7. Morris, S.F. and J.F. Reilly (Rome Laboratory), MIL-HDBK-217 - A Favorite
Target.
8. Denson, W. And P. Brusius, VHSIC and VHSIC-Like Reliability Modeling, RADCTR-89-177.
9. Reliability Analysis Center, Benchmarking Commercial Reliability Practices

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


18

Chapter 2: General Assessment Approach

2.

General Assessment Approach

Prior to developing a reliability model for a product or system, the analyst should
consider the following questions:

What is the goal of the model, and what decisions will be made based on it?
What data is currently available on the product?
Is field data available? If so, is it from the product or system operating in the
same manner and environment as the one under analysis?
Is test data available? If so, what types of tests (i.e., accelerated life tests, nonaccelerated life tests, qualification tests, etc.)
Is data, either field or test, available on a predecessor (i.e., earlier version) of the
product?
Have models been developed for specific failure modes, mechanisms and/or
causes of the product?
o Life models?
o Stress-strength models?
o Models from first principals?
Have critical failure causes of the product been identified?
How much support can be expected from suppliers regarding identification and
quantification of the failure causes of their product?

A suggested approach to modeling the reliability of a product is shown in Figure 2.0-1.

Reliability Information Analysis Center


19

Chapter 2: General Assessment Approach

Define system

Identify the purpose of


the model

Determine the appropriate level at which


to perform the assessment
(System, assembly, part, failure cause)
Assess data
available

Determine appropriate approach


and execute

Assess feasibility of
performing reliability
tests
Combine data

Develop System Model

Figure 2.0-1: General Modeling Approach


Each of the elements of this approach is discussed below.

2.1. Define System


The first step in assessing the reliability of a product or system is to clearly define the
scope of the assessment. A model is then generated that describes the breakdown of the
product or system. This breakdown can be in accordance with a physical hardware
hierarchy of the system, or a functional breakdown. Either way, the goal is to define the
items for which a reliability estimate is required.
If handbook reliability prediction methodologies such as 217Plus or MIL-HDBK-217 are
used, the definition of the items to address in the prediction is generally accomplished
with a hardware-based hierarchical breakdown, since those prediction methodologies are
based on the physical components comprising the system. In other approaches, such as
life modeling from accelerated test data, the product or system breakdown can be based
on functionality or hardware, with the exception that the breakdown continues down to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
20

Chapter 2: General Assessment Approach

the root failure mode cause or mechanism level. Tools for this system model include
FMEA and FTA.
Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the components, represented by circles (basic events) is illustrated
in Figure 2.1-1. This Figure represents a reliability prediction performed using MILHDBK-217 or 217Plus.
System

Assembly 1

Assembly 2

Subassembly 1a

Comp. 1a1

Comp. 1a2

Subassembly 1b

Comp. 1a3

Comp. 1b1

Subassembly 2b

Comp. 1b2

Comp. 2b1

Comp. 2b2

Subassembly 2c

Comp. 2b3

Comp. 2c1

Comp. 2c2

Figure 2.1-1: Fault Tree Representation of System Model


Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the failure mechanisms of the components, represented by circles
(basic events) is illustrated in Figure 2.1-2. This would be the representation of a
reliability prediction performed using a physics approach in which the intent is to
estimate the reliability of specific root-cause failure mechanisms.
System

Assembly 1

Assembly 2

Subassembly 1a

Comp. 1a1

FM1

FM2

Subassembly 1b

Comp. 1a2

FM2

FM1

Comp. 1a3

FM1

FM2

Comp. 1b1

FM3

FM1

Subassembly 2b

Comp. 1b2

FM1

FM2

Comp. 2b1

FM1

FM2

Subassembly 2c

Comp. 2b2

FM3

FM1

FM2

Comp. 2b3

FM1

Comp. 2c1

FM1

Comp. 2c2

FM2

Figure 2.1-2: Fault Tree Representation to the Failure Cause Level


Reliability Information Analysis Center
21

FM1

FM2

Chapter 2: General Assessment Approach

Approaches such as this, in which the reliability of each failure mechanism is estimated,
are practical if:
1. The product or system under analysis has a manageable number of failure
mechanisms that can be estimated
2. The approach can be practically applied for all failure mechanisms over the entire
supply chain. In other words, each organization responsible for their component
or assembly has the ability to estimate the reliability of all failure mechanisms
within their component or assembly.
This same representation is relevant to performing FMEAs. In this case, the lowest level
events in the fault tree are the constituent failure modes of the component. If a failure
mechanism modeling approach is to be used, it needs to be applied to all failure
mechanisms in order for the assessment to quantify the reliability of the entire system.

2.2. Identify the Purpose of the Model


Perhaps the single most important factor contributing to a successful reliability
assessment is an unambiguous definition of the specific purpose to be accomplished in
the assessment. Only by knowing the purpose of an assessment can an appropriate
methodology be selected. If the purpose is not made clear, there is little chance that the
assessment will be successful. In the authors opinion, this unclear definition of purpose
is the root cause of many of the controversies found in the reliability discipline over the
last twenty years as to selecting and using the appropriate approach.
All of the approaches described in this book have merit. All have their strengths and
weaknesses. A successful assessment will leverage the strengths of specific
methodologies toward the specific goals of the assessment. Toward this end, the intent of
this section (and the following sections) is to provide guidance on the applicable
approaches for specific assessment purposes.
A breakdown of the possible purposes for developing a reliability model is shown in
Figure 2.2-1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


22

Chapter 2: General Assessment Approach


Purpose of model

Risk assessment
Reliability demo
Observed
failure

Anticipated
failure

Determine if
minimum
robustness is
achieved

Input to
FMEA/FTA for ID
of failure cause
priority
Compare
competing
designs

Determine if
reliability rqmt
is achieved
PM
schedules
Warranty cost
predictions

Model
Reliability
growth

Maintainability

Design aid

Determine
feasibility of
meeting rel. rqmt

Determine fault
tolerance, redundancy

Determine
impact of factors
on reliability

Allocate
maintenance
personnel

Spares
allocation

Determine
screening rqmt

Determine testability
requirements

Figure 2.2-1: Breakdown of Potential Reliability Modeling Purposes


Each of these purposes is described in Table 2.2-1.

Reliability Information Analysis Center


23

Chapter 2: General Assessment Approach

Table 2.2-1: Reliability Assessment Purposes


Purpose
Anticipated failure
Risk
Assessment
Observed failure

Input to
FMEA/FTA for ID
of failure cause
priority
Compare
competing designs

Design Aid

Model reliability
growth
Determine
feasibility of
meeting reliability
rqmt.
Determine impact
of factors on
reliability
(derating)
Determine
screening rqmt.

Reliability
Demo

Determine if
minimum
robustness is
achieved
Determine if
reliability rqmt. is
achieved
Warranty cost
predictions
Preventive
Maintenance (PM)
schedules

Maintainability
Spares allocation
Allocate
maintenance
personnel

Description
Risk assessments are performed to quantify the reliability of critical- or safety-related failure
modes before the product is fielded. This is often done to meet industry or customer
requirements.
Risk assessments are performed on fielded products that experience failures. Factors that
usually need to be quantified are (a) determination of the root cause, (b) lifetime, (c) percent
failure at a given time, (d) the percent of the population at risk (i.e. whether the root cause is
special cause or common cause), (e) whether the defect is lot- or batch-related, (f) whether
the defective portion can be contained, and (g) what the reliability will be as a function of
the level of corrective actions (for example, 1 if nothing is done; 2 - if a complete recall is
done, and 3 - an approach in between).
Techniques such as FMEA and FTA are used to assess and prioritize failure causes. Part of
this prioritization includes the identification of the probability of occurrence, either
qualitatively or quantitatively.
For this purpose, reliability modeling is performed to quantify the relative reliabilities of
several competing designs. This analysis is then used as one criterion from which the final
design is chosen. In this case, reliability is only one of the factors to be accounted for in this
comparison, and needs to be traded off against all of the other factors.
A natural part of the development process is to grow the reliability to a point that it meets its
reliability requirement. For this purpose, the reliability metric of choice is quantified as a
function of time. This provides Program Management with the information to assess the
reliability status of the project and to estimate the date at which the requirements will be
met.
In many cases, reliability requirements are levied upon suppliers and contractors. For this
purpose, the reliability assessment is performed to determine if there is a reasonable
probability of achieving the reliability requirements. If it is highly likely that requirements
cannot be met, then management must make decisions regarding the future of the program.
For this purpose, the effects of specific factors are assessed. For example, the effects of
temperature may be assessed to determine how much cooling is required.
This purpose relates to quantifying reliability as a function of possible screening options, so
that it can be determined which screening options will result in the reliability requirements
being met.
This purpose is to provide quantitative data that proves, within acceptable confidence limits,
that predefined robustness levels are achieved. These robustness levels usually correspond to
a qualification requirement, and may not be highly correlated to field reliability.
This purpose is to provide quantitative data that proves, within acceptable confidence limits,
that the reliability requirements are met.
For this purpose, the assessment is performed so that the costs associated with warranty
repairs or replacements can be estimated.
The assessment is performed so that effective preventive maintenance schedules can be
derived.
For repairable systems, since the replacement of failed items requires the availability of
spare items, the question of how many spares to keep on hand inevitably arises. The
reliability characteristics of the item is one piece of information required. Others are repair
rates, a reliability block diagram, etc.
For repairable systems, organizations need to determine the personnel required to keep up
with maintenance demands. One input to this is the frequency of various types of failures

1-only for the specific failure causes modeled

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


24

Chapter 2: General Assessment Approach

Specific reliability modeling purposes are generally suited to specific program phases, as
summarized in Table 2.2-2.
Table 2.2-2: Program Phase vs. Reliability Assessment Purpose
Stage
Purpose
Concept
Risk
Assessment

Design Aid

Reliability
Demo

Maintainability

Anticipated failure
Observed failure
Input to FMEA/FTA
for ID of failure cause
priority
Compare competing
designs
Model reliability
growth
Determine feasibility
of meeting reliability
rqmt.
Determine impact of
factors on reliability
(derating)
Determine screening
rqmt.
Determine if minimum
robustness is achieved
Determine if reliability
req. is achieved
Warranty cost
predictions
PM schedules
Spares allocation
Allocate maintenance
personnel

Development

Early
Production

Production

Deployment

x
x

x
x
x

x
x
x
x

x
x

x
x

x
x

x
x

2.3. Determine the Appropriate Level at Which to Perform the


Modeling
The first thing to determine is the hierarchical level at which the assessment will be
performed. A generic hierarchy is shown below:
Reliability Information Analysis Center
25

Chapter 2: General Assessment Approach

System
Subsystem
Assembly
Component
Failure Modes (Root)
Failure Causes/Mechanisms (Root)
2.3.1. Level vs. Data Needed

Traditional handbook approaches for reliability predictions will generally be applied at


the component level. In this case, a failure rate is estimated for each component, based
on the factors accounted for in the specific model used. In some cases, this predicted
failure rate will be apportioned amongst the components failure modes in a FMEA (if
the MIL-STD-1629 method is used, in which the criticality is determined by the modal
failure rate, i.e., the component failure rate multiplied by the failure mode percentage of
occurrence). This approach can be used based on readily accessible data, such as that
found in the handbooks. This approach also allows for the estimation of a failure rate
associated with each failure severity. This is accomplished by adding the failure rates
for the failure modes that result in a specific severity call of failure.
If the level to be analyzed is failure causes, then additional detailed data and information
is required. Therefore, the practicality of obtaining the required data must be a
consideration when choosing an appropriate approach. The degree of difficulty of
obtaining required data generally increases as you go lower in the hierarchy. This
concept is illustrated in Figure 2.3-1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


26

Chapter 2: General Assessment Approach

Data required for reliability


assessment

System level
System

Subsystem
Parts lists
Environmental conditions
Part stresses

Assembly

Component

Failure mode distributions


Yield
Defect density
Internal part stresses &
distributions

Failure modes

Failure causes/mechanisms

Figure 2.3-1: Typical Data Requirements vs. Level of Hierarchy


As shown in Figure 2.3-1, the data required for the assessment of specific failure causes
can be factors like yield, defect density, internal part stresses and distributions. Because
these are factors often difficult to obtain by outside organizations, the best approach is
generally to have the manufacturer assess the reliability of the causes in the event that the
selected approach requires this sort of data.
The appropriate approaches for a reliability assessment will, therefore, generally depend
on the location of a companys product in the hierarchy of the product or system.

Reliability Information Analysis Center


27

Chapter 2: General Assessment Approach

2.3.2. Using an FMEA as the basis for a reliability model

A FMEA can be an effective tool in identifying specific root failure causes that need to
be quantified in a reliability model. A generic FMEA approach is shown in Figure 2.3-2.
System Hierarchy
How functions can fail
Functions
Functions

Failure effects
Failure modes

Identify failure causes

Occurrence
Detectability

Risk Priority
Number (RPN)

Improve design

Figure 2.3-2: The Basic FMEA Approach


The hierarchical relationship between cause, mode and effect is shown in Figure 2.3-3.
For example, a failure mode can have any number of potential effects, and also can have
any number of potential causes.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


28

Chapter 2: General Assessment Approach


Failure
Effect
#1

Failure
Effect
#2

Failure
Effect
#3

Failure
Mode

Cause
#2

Cause
#1

Cause
#1b

Cause
#1a

Cause
#1c

Cause
#2a

Cause
#2a1

Cause
#4

Cause
#3

Cause
#2b

Cause
#3a

Cause
#3b1

Cause
#2a2

Cause
#3b

Cause
#4b

Cause
#4a

Cause
#3b2

Figure 2.3-3: Hierarchical Relationship Between Cause, Mode and Effect


If the reliability assessment is to be performed at the failure cause level, then all possible
causes need to be identified. One of the FMEA objectives is to identify all conceivable
failure causes. One way to accomplish this is to identify all combinations of initial
conditions, stresses and mechanisms, as illustrated in Figure 2.3-4 and Table 2.3-1.

Initial conditions
Defect Free

Mechanism

Stresses

Defects
Operational

Intrinsic

Extrinsic

Environmental

Mechanical

Electrical

Figure 2.3-4: Approach to Identifying Causes


Reliability Information Analysis Center
29

Chemical

Chapter 2: General Assessment Approach

Table 2.3-1: Examples of Initial Conditions, Stresses and Mechanisms


Defect Free
Intrinsic
Initial Conditions

Defects
Extrinsic
Operational

Stresses
Environmental

Electrical

Mechanism

Mechanical

Chemical

Voids
Material property variation
Geometry variation
Contamination
Ionic contamination
Crystal defects
Stress concentrations
Organic contamination
Nonconductive particles
Conductive particles
Contamination
Ionic contamination
Thermal
Electrical
Chemical
Optical
Chemical exposure
Salt fog
Mechanical shock
UV exposure
Drop
Vibration
Temperature-high &low
Temperature cycling
Humidity
Pressure low &high
Radiation EMI, cosmic
Sand and dust
Electromigration
Dielectric breakdown
Dendritic growth
Tin whiskers
Electro-thermo-migration
Second breakdown
Metal fatigue
Stress corrosion cracking
Melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Pitting corrosion
Spalling
Crazing
Abrasive wear
Adhesive wear
Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgas
Corrosion
Chemical attack
Fretting corrosion
Oxidation
Crystallization

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


30

Chapter 2: General Assessment Approach

One of the keys to a successful FMEA is to understand the relationship between cause,
mode and effect. In general, there is a natural tiering effect that occurs in an FMEA as a
function of the product or system level, as illustrated in Table 2.3-2. For example, at the
most basic level, the part manufacturing process, the cause of failure may be a process
step that is out of control. The ultimate effect of that cause becomes the failure mode at
the part level, the failure effect of the part becomes the failure mode at the next level of
assembly, and so forth. It is very important that the cause, mode and effect are not
confounded in the analysis.
Table 2.3-2: Relationship Between Cause, Mode and Effect.
System

Assembly

Part

Part
Manufacturing
Process

Effect
Mode

Effect

Cause

Mode

Effect

Cause

Mode

Effect

Cause

Mode
Cause

More detail regarding an FMEA approach is provided in Chapter 8.


Figures 2.3-5 through 2.3-8 illustrate, with fault trees, how the relationship between
cause, mode and effect scale up or down the product or system hierarchy, depending on
the hierarchical level at which the analysis is to take place.
In this example, failure cause is considered to be at the lowest level at which a
modeling effort will occur. If the cause corresponds to a fundamental mechanism of
failure (i.e., the mechanism represents the fundamental physical failure of the item), then
the term cause is considered synonymous with the term mechanism.

Reliability Information Analysis Center


31

Chapter 2: General Assessment Approach


TOP

OR

AND

AND

AND

OR

VT

AND

AND

OR

AND

OR

OR

OR

OR

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Figure 2.3-5: Fault Tree of Product or System

TOP

OR

AND

AND

AND

OR

VT

Effect

AND

Mode

OR

OR

OR

OR

OR

Event

Event

Event

Event

Event

Event

Event

Cause

Event

Event

Event

Event

Event

Event

Event

Event

Figure 2.3-6: Fault Tree of Product or System with Cause as the Lowest Level

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


32

Chapter 2: General Assessment Approach


TOP

OR

Effect

AND

OR

AND

VT

Mode

AND

Cause

OR

OR

OR

OR

OR

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Figure 2.3-7: Fault Tree of Product or System with Cause Above the Lowest Level

Effect

OR

Mode

AND

OR

AND

VT

Cause

AND

AND

OR

OR

OR

OR

OR

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Figure 2.3-8: Fault Tree of Product or System with Cause Two Levels Above the
Lowest Level
Therefore, if the FTA view of the product or system is to be consistent with the reliability
assessment, then the lowest level in the tree must be the level at which reliability
estimates are made.
The section above describes the hierarchical level at which a reliability model will be
developed, whether it be a failure cause, failure mode, a component or an assembly.
Once this physical level is determined, there are several model forms possible to
Reliability Information Analysis Center
33

Chapter 2: General Assessment Approach

construct a model to describe its reliability. This form will, of course, depend on the
specific approach and data used to develop the reliability model. Some of these forms are
described below. More detail on each of these is provided in subsequent sections.
2.3.3. Model Form vs. Level

The form of the model to be developed will depend on the level and the approach. For
example, if empirical data is used directly without a model developed from it, assuming
constant failure rate, the best estimate of the failure rate is simply:

Failures
operating time

If a life model is developed from life tests performed at various stress levels, the result
will be a time-to-failure (TTF) distribution (described by the Weibull, lognormal or other
statistical distributions) that is a function of stress levels. If a Weibull distribution is
used, the general model will be:

R(t ) = e

If models are to be derived from the analysis of field data, there are several possible
model forms. Traditional methods of reliability prediction model development have
included the statistical analysis of empirical failure rate data. When using multiple linear
regression techniques with highly variable data (which is often the case with empirical
field failure rate data), a requirement of the model form is that it be multiplicative (i.e. the
predicted failure rate is the product of a base failure rate and several factors that account
for the stresses and component variables that influence reliability). An example of a
multiplicative model is as follows:

p = b e q s
where:
p =
b =
e =
q =
s =

Predicted failure rate


Base failure rate
Environmrntal factor
Quality factor
Stress factor
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
34

Chapter 2: General Assessment Approach

However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for.
Another possible approach to model reliability is to segment the failure rate for each
group of failure causes that are accelerated by stresses incurred during specific portions
of a mission. Each of these failure rate terms are then accelerated by the appropriate
stress or component characteristic. This is the model form used in the RIAC 217Plus
methodology. This model form is as follows;

p = o o + e e + c c + i + sj sj
where:
p = predicted failure rate
o = failure rate from operational stresses
o = Product of failure rate multipliers for Operational Stresses
e = failure rate from environmental stresses
e= Product of failure rate multipliers for Environmental Stresses
c = failure rate from power or temperature cycling stresses
c = Product of failure rate multipliers for Cycling stresses
I = failure rate from induced stresses, including electrical overstress and ESD
sj = failure rate from solder joints
sj = Product of failure rate multipliers for solder joint stresses
The concept of this approach is that the occurrence of each group of failure causes is
mutually exclusive, and their failure rates can be modeled separately and summed. By
modeling the failure rate in this manner, factors that account for the application and
component-specific variables that affect reliability ( factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:
o Address Operating-, Non-Operating- and Cycling-related Failure Rates in an
additive model. These individual failure rates are weighted in accordance with
the operational profile (duty cycle and cycling rate). The Pi factors modify only
Reliability Information Analysis Center
35

Chapter 2: General Assessment Approach

the applicable failure rate term, thereby eliminating many of the extreme value
problems that plague multiplicative models.
o Are based on observed failure mode distributions, so that observed component
root failure causes are empirically modeled
o Can be tailored with test data (if available) by applying it in a Bayesian fashion to
the appropriate failure rate term. As examples, temperature cycling data can be
combined with the failure rate from power or temperature cycling stresses (c), or
high temperature operating life can be combined with the failure rate from
operational stresses term (o).

2.4. Assess Data Available


A predominant factor that will dictate the options that an analyst has in modeling the
reliability of a product is the availability of data. The analyst should consider the
following questions when assessing the availability of test data:

Is field data available on the specific product or system?


Is data on a similar product or system available? If so, is it field data or test data?
If data is available, is it:
o Relevant?
o Of sufficient quantity?
o Of sufficient quality?
If physics-based models are to be employed, is the required detailed data and
information available, such as:
o Defect rates
o Material properties (e.g., functional characteristics)
o Defect (flaw) distributions
o Material variation quantification (e.g., purity, yields, dimensions)
o Etc.

Perhaps the most important element of a reliability program is the reliability testing of the
product. Reliability test data is, in turn, a critical element for assessing reliability. In this
context, a reliability test consists of two primary elements: measurement and exposure.
The measurement is the means of assessing the performance of the product or system
relative to its requirements. It usually consists of quantifying parameters that are
specifiable attributes. It may include both continuous variables (i.e. gain, power output,
etc.) or attribute data (i.e. a binomial representation of whether a product possesses an
attribute or not). Exposure is the application of a stress or stresses. These stresses may
consist of operational stresses or environmental stresses. Operational stresses are defined
as those stresses to which the product will be exposed by the act of operating the product.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
36

Chapter 2: General Assessment Approach

For example, a transistor is designed to have a voltage applied, and pass a given amount
of current. As such, these are operational stresses. It will also be exposed to externally
applied environmental stresses such as temperature, temperature cycling, vibration, etc.
Reliability tests can be performed either by sequentially performing repeated cycles of a
measurement, exposure, measurement, etc., or by continuously measuring performance
parameters in-situ during exposure. It is usually desirable to perform in-situ
measurement so that times to failure can be accurately determined. In practical cases,
however, it is not always feasible due to the complexities of setting up such measurement
capabilities. If repeated cycles of a measurement, exposure, and measurement are used,
the measurement intervals should be frequent enough so that sufficient resolution in the
times-to-failure data is available.
Practical considerations for assessing the feasibility of testing products are:

Are samples available? If so, are they available in sufficient quantity?


Are measurement systems available for continuous, in-situ, measurements during
exposure? If not, repeated cycles of a measurement and exposure may be
required.
Are laboratory facilities available to perform the exposure?
Are the measurement and exposure facilities available to support a multi-cell test
at various stress levels (i.e., application of various combination of stresses)?

Additional considerations for testing products and systems are provided in Chapter 5.

Reliability Information Analysis Center


37

Chapter 2: General Assessment Approach

2.5. Determine and Execute Appropriate Approach


This section discussed the various options that an analyst has to predict, assess, and
estimate the reliability of a product. Figure 2.5-1 illustrates the breakdown of various
approaches.

Figure 2.5-1: Breakdown of Reliability Assessment Options


Table 2.5-1 describes the approach, it strengths and its weaknesses. This information is
presented in the context of the intent of this book, which is to present options for
quantifying the reliability of a product as it is used by customers in actual use conditions.
Effective techniques also include using a combination of the approaches in this section.
The manner in which these approaches can be combined will be addressed in Section 2.6.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


38

Chapter 2: General Assessment Approach

Table 2.5-1: Summary of Reliability Assessment Options


Approach

Description

1
Highly
Accelerated
Life Test
(HALT)

Exposure to severe
levels of thermal
cycling and vibration

Strengths
Can quickly identify failure causes that are
accelerated by thermal cycling and vibration

Only accelerates specific failure causes


accelerated by the test

Accounts for the interaction of the two


stresses

Large extrapolations to use conditions is


required

Can be used as a screening basis

Can excite non-relevant failure modes (i.e.,


those that are not representative under field
environmental conditions)

Reflects the actual reliability


Test data can be collected and applied before
the system is fielded

2
Qual

Exposure to industry
standard trade and
commerce tests

Weaknesses

Can demonstrate a degree of robustness to


the specific qualification tests
Reflects the actual reliability
Test data can be collected and applied before
the system is fielded

Cannot quantify special cause failure modes


Correlation to field use conditions is difficult
Can excite non-relevant failure modes (i.e.,
those that are not representative under field
environmental conditions)
Cannot quantify special cause failure modes

Can accurately model lifetime due to


common cause mechanisms
Can quantify acceleration factors

3
DOE multicell

Life tests under a


variety of stress levels

Can estimate reliability at use conditions


Reflects the actual reliability

4
Reliability
demo

5
Reliability
demo

6
Field data
same product

Demonstration of
reliability via life tests
at accelerated
conditions

Demonstration of
reliability via life tests
at non-accelerated
conditions

Use of field experience


data on the product or
system under analysis

Can be expensive to execute


Difficult to quantify special cause failure
modes due to large sample sizes sometimes
required

Test data can be collected and applied before


the system is fielded
Can demonstrate required reliability in a
statistically significant way
Reflects the actual reliability

Correlation to field use conditions is difficult

Test data can be collected and applied before


the system is fielded
Can demonstrate required reliability in a
statistically significant way
Reflects the actual reliability

Large sample sizes usually required

Test data can be collected and applied before


the system is fielded
The most representative data
Can quantify failure causes that exhibit low
percent failures

Usually, the data is not available in time for


use in product or system development
Collecting field data is prone to errors

Reliability Information Analysis Center


39

Chapter 2: General Assessment Approach

Table 2.5-1: Summary of Assessment Options (continued)


Approach

Description

Strengths
Can be reasonably sensitive to various
stresses
Represents field use

7
Models

Models developed
from field experience
data on similar
products

Can be a good indicator of field reliability


performance
Based on easily obtainable data
Easy to use
Can quantify failure causes that exhibit low
percent failures
Represents field use

8
Raw data
(EPRD,
NPRD)

The direct use of field


experience data on
similar products

Easy to use
Can quantify failure causes that exhibit low
percent failures

Weaknesses
Difficult to keep updated
Actual failures are impacted by factors not
considered by the model
Models become outdated by new technology
Misapplication of models by the analyst
No uncertainty estimates available
Difficult to collect good quality field data
Difficult to distinguish correlated
variables(i.e. quality and environment)
Extrapolations to specific use conditions
required
Not feasible to collect data representing all
conceivable situations
Difficult to account for material defects

9
Stress/Strength
modeling

Calculation of failure
probabilities based on
the strength
distribution and the
stress distribution

Good approach for fundamental material


behavior
Can model fatigue behavior
Models specific failure mechanisms
Valuable for predicting end-of-life for known
failure mechanisms

May require information thats difficult to


obtain
Difficult to use for estimating field reliability
Can be complex and costly to apply
Difficult to use for modeling defect-driven
failure mechanisms
Not practical to use for the assessment of an
entire system
Can only be applied in rare cases
In practice, difficult to derive fundamental
equations

Scientifically robust

10
First principals

Calculation of failure
probabilities based on a
fundamental
understanding of the
physics of the failure
cause

Good approach for fundamental material


behavior
Can model fatigue behavior
Models specific failure mechanisms
Valuable for predicting end-of life for known
failure mechanisms t

Empirical data is usually required to validate


the model, or to estimate model constants
Difficult to account for material defects
May require information thats difficult to
obtain
Can be complex and costly to apply
Difficult to use for modeling defect-driven
failure mechanisms
Not practical to use for the assessment of an
entire system

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


40

Chapter 2: General Assessment Approach

Selecting a methodology
The various approaches summarized here are suited to various program phases,
corresponding to prediction, assessment and estimation. This is shown in Table 2.5-2
(note that the shaded area indicates where the approach can be applied). For example,
MIL-HDBK-217 should only be used for prediction, meaning that its usefulness is
limited for assessment and estimation. Conversely, 217Plus was designed to provide a
framework for all three reliability modeling phases.

Accelerated
Test
NonAccelerated
Same
product

Empirical
Field
data

Physics

Similar
product

Estimation

Approach

Assessment

Prediction

Table 2.5-2: Relevancy of Approach to Prediction, Assessment and Estimation

HALT
Qualification
DOE multicell
Reliability demo
Reliability demo

Models
217Plus
MIL-HDBK-217
Bellcore
Raw data (EPRD,
NPRD)

Stress/Strength modeling
First principals

The appropriate approach(es) to modeling reliability will depend on several factors,


including:

The severity of product failure. In this context, severity can mean that there are
significant financial ramifications of failure, that there are safety-related risks, or
that the system is not maintainable. For all of the reasons that high reliability may
Reliability Information Analysis Center
41

Chapter 2: General Assessment Approach

be required in the first place, are the same reasons that the reliability model must
be acceptably accurate. Since reliability is a stochastic process, reliance on any
one of the methodologies discussed in this book is susceptible to uncertainties.
Sometimes these uncertainties can be very large. This is true for any of the
methods. If, however, several methodologies can be employed, and their results
are consistent with each other, then this adds much more credibility to the
modeled reliability of the product. This is especially true if a physics approach is
coupled with an empirical approach.
The amount and level of detailed information available to the analyst. Often, this
will dictate the available choices for the analysis.
Complexity of the product. If the product or system is very complex, has many
levels of indenture, and there is a complex supply chain involving many suppliers,
then the available suitable choices for analysis at the top of the supply chain will
be limited. For example, as discussed previously, it is very difficult to obtain the
data required to utilize one of the physics approaches by organizations higher in
the supply chain. If, however, the entire supply chain utilizes the PoF approach
for the product or system, it can be a viable approach.

Table 2.5-3 provides general guidance on the identification of appropriate approaches


based on the purpose of the assessment.
If empirical data is to be used as a basis for one or more of the approaches, there are
various factors that will influence the uncertainty in assessments made with this empirical
data. These include the following data attributes:
Relevancy how close is the product or system architecture and complexity on
which data is to be used to the item under analysis
Quantity this pertains to the statistical uncertainty of reliability estimates based
on the quantity of data. For example, if the TTF distribution is exponential, this
uncertainty is usually modeled with the Chi-squared distribution.
Quality this pertains to the accuracy inherent in the data itself. For test data, the
accuracy is generally much better than with field data, since test data is usually
much better controlled with known sample sizes, failure times, etc. Field data, on
the other hand, is usually fraught with many problems and sources of uncertainty.
This will be discussed in Section 5.2.1.2.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


42

Chapter 2: General Assessment Approach

Table 2.5-3: Identification of Appropriate Approaches Based on the Purpose


Approach
Empirical

Physics

Reliability
demo

Maintainability

x
x

Raw data

x
x

x
x

x
x

x
x

Models

Same product

NonAccelerated
Reliability demo

Reliability demo

DOE multicell
x
x

x
x

x
x

x
x

First principals

Design aid

Anticipated failure
Observed failure
Input to FMEA/FTA for ID of failure
cause priority
Compare competing designs
Model reliability growth
Determine fault
tolerance,
Determine
redundancy
feasibility of
Determine
meeting rel req.
testability
requirements
Determine impact of factors on
reliability (e.g. derating)
Determine screening req.
Determine if
minimum
robustness is
achieved
Determine if rel.
req. is achieved
Warranty cost
predictions
PM schedules
Spares allocation
Allocate
Maintenance
personnel

Stress/Strength modeling

Risk
assessment

Qualification

HALT

Purpose

Similar product

Field data

Accelerated

Test

x
x

x
x

1 for those failure causes addressed by the approach


Reliability Information Analysis Center
43

Chapter 2: General Assessment Approach

Relevancy is a function of the type of data that is available, and the product or system on
which that data is available. To further address the relevancy issue for assessments made
with empirical data, consider the information in Table 2.5-4, which summarizes the
various attributes of empirical data. This notion is valid, regardless of the level of
assembly, ranging from root failure causes to the system level.
Table 2.5-4: Ranking the Attributes of Empirical Data
Type of data
Field

Product or
System on
which data
is
available

Same

Similar

Same
mfg/process
Different
mfg/process
Same
mfg/process
Different
mfg/process

Same
environment
Best

Different
environment

Same
stress

Test
Different
stress

Worst

There has been much information published in the literature comparing and contrasting
empirical and physics-based models. However, they are not mutually exclusive
methodologies. For example, empirical models generally utilize PoF principals in their
derivation, and PoF models utilize empirical data in their derivation and parameter
estimation.
The majority of component field failures are a result of special causes. These causes may
be an anomaly in the manufacturing process, an application anomaly, or a host of other
assignable causes. They are rarely the result of a common cause failure mechanism,
which can generally be modeled by life modeling techniques.
Guidelines and examples are provided in the following sections for each of the
approaches.
2.5.1. Empirical
2.5.1.1. Test

Testing product ors system reliability is performed for many reasons, including:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
44

Chapter 2: General Assessment Approach

Quantifying reliability (infant mortality, wear-out)


Demonstrating reliability
Growing reliability
Lot acceptance
Developing screens
Performing screens
Determining the limits of the technology
Determining stress bounds for subsequent tests
Determining predominant accelerating stresses
Identifying weak points in the design
Identifying failure causes
Demonstrate compliance to industry standard qualification tests

An important consideration for all of the tests described above is the definition of
failure, i.e., the "failure criteria" that will be used to determine if a product passes or
fails. Industry guidelines, specifications, or an understanding of end-use application
tolerances are often used to set pass/fail criteria.
A common form of empirical testing is the performance of qualification tests.
Qualification is usually defined as demonstrating that a product will meet performance
requirements in its intended application, as used by customers, over the expected lifetime
of the product. There are two primary elements to performance qualification: Validation
and Verification (Reference 1), as follows:
Validation Confirmation by examination and provision of objective evidence
that the particular requirements for specific intended use are fulfilled.
Verification Confirmation by examination and provision of objective evidence
that specified requirements have been fulfilled.
Therefore, for a product or system to be considered fit for use for a specific application, it
must conform to the requirements of its specification over its intended life (verification)
and the specification must adequately capture the requirements of the end user
(validation). The various elements of qualification are illustrated in Figure 2.5-2.

Reliability Information Analysis Center


45

Chapter 2: General Assessment Approach

Qualification

Verification

Validation

Specification
Compliance

Reliability
Testing

EVT
DV
PVT

Root
cause
analysis
and
corrective
action

Figure 2.5-2: Qualification Concepts and Terminology


Verification ensures that the product or system meets the specified requirements both
initially (specification compliance) and over its intended lifetime (reliability testing).
Specification compliance ensures that, at the beginning of its lifetime, the item meets
specified performance requirements and that the distribution of performance parameters
over the population of items is within acceptable limits. Reliability testing ensures that
the product is robust and that it meets the specified performance requirements over its
intended lifetime. Reliability testing consists of several test phases, each of which has its
own purposes and approaches.
The testing sequence can be grouped into three categories: Engineering Verification Tests
(EVT), Design Verification Tests (DVT), and Production Verification Tests (PVT).
These are further explained below, along with their relationship to the establishment of a
life model, the prediction of product reliability and the various elements of each test
approach. This is provided specifically to highlight how reliability testing can be used in
a reliability program.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


46

Chapter 2: General Assessment Approach

Engineering Verification Tests (EVT) are intended to identify and assess high risk
critical items so that corrective action can be taken, if necessary. The intent is to uncover
weaknesses or to identify product capability, not to pass a set of predefined tests, as is the
case with traditional qualification testing. The purpose and approach of these tests are
described in Table 2.5-5. These tests are also used to identify the maximum stress
capability of a product, which is a prerequisite for developing a complete test plan (in
DVT) to assess lifetime. Step stress tests are often used for this purpose, and the results
can support establishment of an upper bound on subsequent test stresses.
One of the primary purposes of DVT testing is to provide the data required to develop life
models. Often, there are multiple accelerating stresses, in which case life tests must be
conducted for various stress combinations. Design of Experiments (DOE) is used to
develop an effective and cost-efficient test plan. DOE concepts, as they pertain to
reliability testing, are covered in Chapter 4.
PVT tests demonstrate that the robustness of production units is equivalent to that of the
EVT/DVT samples. Whereas EVT and DVT demonstrate the intrinsic robustness, PVT
demonstrates the as-built robustness.
Table 2.5-5: EVT, DVT and PVT Purpose and Approach
Test
Program
Element

EVT

DVT

PVT

Purpose
Determine limits of the technology
Determine stress bounds for
subsequent tests
Determine predominant accelerating
stresses
Identify weak points in the design
Identify failure causes
Quantify elements of the bathtub
curve (infant mortality, wear-out) so
that effective screens can be
developed
Provide data to assess product
lifetime under various combinations
of stresses
Verify the robustness of production
units are as good as EVT/DVT
samples

Approach

Relative
Sample
Sizes
Required

Test to failure
Step stress (to determine limits)
Test a broad range of stressors to
determine the stresses that accelerate
predominant failure causes

Low

Development of a life model that


estimates time to failure as a function of
pertinent accelerants
Use DOE to design statistically valid life
tests, and perform long term life tests
using stresses that will be experienced in
the intended application
Test relatively small samples of parts
using a broad range of stressors. These
are traditional qualification tests

Reliability Information Analysis Center


47

High

Low

Chapter 2: General Assessment Approach

Some reliability practitioners choose to separate qualification tests from reliability tests.
In this case, reliability tests are those that have a purpose similar to the DVT tests. The
reason for separation is that the reliability tests are more of an engineering test that are
not dictated by industry standards. As such, the results may or may not be shared with
customers. Likewise, qualification tests are required and thus shared with customers to
demonstrate compliance.
Root cause analysis and corrective action
A critical part of any reliability program is the ability to learn from failures and improve
the product or system. Failure analysis is performed to ensure that the root cause is
identified and understood, and corrective actions are implemented and verified. This is
done throughout product development, including EVT or DVT and PVT.
With a product that is comprised of a number of subassemblies, there is a time offset
between EVT, DVT or PVT tests performed on components of the product or system and
those tests performed on the end item. This is illustrated in Figure 2.5-3.

Time

Component
(EVT)

(DVT)

Early Screening

Prequalification

Ongoing Reliability
Test (ORT)

(PVT)

Full qualification

(EVT)

(DVT)

Early Screening

Prequalification

Mass Production

(PVT)

Full qualification

Ongoing Reliability
Test (ORT)
Mass Production

Assembly
Figure 2.5-3: EVT, DVT and PVT Relationships

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


48

Chapter 2: General Assessment Approach


2.5.1.1.1.

Non-Accelerated

Nonaccelerated reliability tests are those in which samples are tested in a manner that
recreates the use conditions the product will experience in its intended use environment
as used by customers. These tests may be performed for several reasons:
1. To uncover any unexpected failure causes
2. To demonstrate that the product meets its reliability requirement
Generally, if the purpose is #1, a more effective way of acheiving this is with accelerated
testing, discussed in the next section of this book. If the purpose is #2, then concepts of
reliability demostration can be used, as discussed in the next section.
2.5.1.1.2.

Reliability Demonstration

The fundamental concept of reliability demonstration is the following:

1 CL = R
This is essentially a hypothesis test in which the hypothesis is that the true product
reliability is R or greater. For example, consider a case in which the reliability
requirement is 0.95 at 5000 hours, and the desired confidence level is 0.80 (80%). In this
case, the implied failure rate is 0.0000103 failures per hour.
If the hypothesis is true and the test is run such that there less than a 20% probability of
experiencing the observed number of failures (or fewer), then the analyst can be 80%
certain that the reliability requirements have been met.
Table 2.5-6 summarizes the probability as a function of the number of failures and
cumulative operating time. The values in the cells are the Poisson probability that there
will be F or fewer failures, under the hypothesis that the true failure rate is 0.0000103
(failures per hour). In this example, if the test can be run until 200,000 hours are
accumulated, with no failures, then the test is passed and the hypothesis is verified. This
is the first opportunity to pass the test, as this is the shortest time at which the Poisson
probability falls below 0.20 (i.e., 0.13). In this example, 0.20 is the risk of concluding
that the failure rate is less than 0.0000103 when it is not.
The test is run until the number of failures and time combinations falls either above or
below the shaded red area. If it falls above the red area, then the null hypothesis is
confirmed (that the failure rate is greater than the required). If it falls below the red area,
the hypothesis is confirmed. If the combination of hours and failures remains in the red
area, the hypothesis cannot be confirmed or denied, and further testing is required.
Reliability Information Analysis Center
49

Chapter 2: General Assessment Approach

The probability values are generally calculated from the binomial or Poisson
distributions, depending on whether the probability is time-based (Poisson) or attributebased (binomial). Poisson is used in the case of constant failure rates.
Table 2.5-6: Reliability Demonstration Example

250

300

350

400

450

500

550

600

800

200

750

150

700

100

650

50

Number of Failures

Cumulative operating time (in thousands of hours)

10

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.99

0.98

0.97

0.95

0.92

0.89

0.85

0.79

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.99

0.98

0.96

0.94

0.90

1.00

1.00

1.00

1.00

1.00

1.00

0.99

0.98

0.95

0.92

0.88

0.83

0.86

0.81

0.75

0.69

0.77

0.71

0.64

1.00

1.00

1.00

1.00

1.00

0.99

0.97

0.94

0.90

0.85

0.79

0.72

0.56

0.65

0.57

0.50

0.42

1.00

1.00

1.00

0.99

0.98

0.96

0.93

0.88

0.82

0.74

0.66

1.00

1.00

0.99

0.98

0.95

0.91

0.85

0.77

0.68

0.59

0.50

0.58

0.50

0.42

0.35

0.29

0.42

0.35

0.28

0.22

1.00

1.00

0.98

0.94

0.88

0.80

0.71

0.61

0.51

0.42

0.17

0.34

0.26

0.21

0.16

0.12

0.09

1.00

0.98

0.93

0.85

0.74

0.63

0.52

0.41

0.32

0.98

0.91

0.80

0.66

0.53

0.41

0.30

0.22

0.16

0.25

0.19

0.14

0.10

0.07

0.05

0.04

0.11

0.08

0.06

0.04

0.03

0.02

0.91

0.73

0.54

0.39

0.27

0.19

0.13

0.08

0.01

0.06

0.04

0.02

0.02

0.01

0.01

0.00

0.00

0.60

0.36

0.21

0.13

0.08

0.05

0.03

0.02

0.01

0.01

0.00

0.00

0.00

0.00

0.00

0.00

The Microsoft EXCEL functions for these are:


=1-BINOMDIST(x,y,z,TRUE), where BINOMDIST(number_s,trials,probability_s,cumulative)
=POISSON(x,y,TRUE), where POISSON(x,mean,cumulative)

2.5.1.1.3.

Accelerated Testing

Accelerated testing is an enormous part of a reliability program. It is used for many


purposes, including:

Identification of failure causes


Qualification
Life characterization
Reliability demonstration

One of the critical aspects of accelerated testing is the degree to which acceleration takes
place. Consider the situation depicted in Figure 2.5-4. The reliability requirement, in
terms of lifetime in this example, will be specified at a specific stress condition. If tests
are performed at the accelerated conditions of Test 1, there will be some extrapolation to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
50

Chapter 2: General Assessment Approach

lifetimes at use conditions (if the purpose is to quantify life). If tests are performed at the
accelerated conditions of Test 2, there will be additional extrapolation to lifetimes at use
conditions. Life modeling is the means of performing this extrapolation, and will be
covered in Section 2.5.1.1.2.3 and Chapter 5.

Figure 2.5-4: Acceleration Levels


The larger the extrapolation distance, the larger the uncertainty in the reliability estimate
at use conditions. This is illustrated in Figure 2.5-5.

Reliability Information Analysis Center


51

Chapter 2: General Assessment Approach

Figure 2.5-5: Uncertainty in Extrapolation


The relevancy of failure causes must be considered when using accelerated test data to
model product or system reliability in field deployed conditions. For example, if failures
occur in an accelerated test, the questions to be addressed are:
1. Can the failure cause occur under field conditions? Or has it been induced by the
test?
2. If the failure cause is relevant, can its reliability characteristics be scaled to field
use conditions with an acceleration model?
For example, consider several scenarios illustrated in Figure 2.5-6. Case 1 illustrates the
situation in which the failure cause observed in accelerated testing is relevant, and its
probability of occurrence can be extrapolated to use conditions with an acceleration
model. Case 2 illustrates the situation in which the failure cause observed in accelerated
testing is not relevant, and its probability of occurrence cannot be extrapolated to use
conditions with an acceleration model. Case 2 is representative of a situation in which
there is a threshold stress, above which the failure cause has been induced by the test.
The higher the acceleration, the higher the risk is that Case 2 will occur. For this reason,
for the purposes of quantifying reliability under field use conditions, highly accelerated
tests (like HALT) must be used with caution.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
52

Chapter 2: General Assessment Approach

Figure 2.5-6: Acceleration levels


Alternatives are also available that will cover more of the life-stress space, as shown in
Figure 2.5-7. This approach is desirable because there is minimal extrapolation to field
use conditions, and validity of the acceleration models over a broader stress range can be
ascertained.

Figure 2.5-7: Acceleration Alternatives


Reliability Information Analysis Center
53

Chapter 2: General Assessment Approach

Another factor to consider in accelerated testing, when used to quantify reliability at use
conditions, is the relative probability of occurrence of various failure causes as a function
of stress level. Each failure cause will have unique acceleration characteristics as a
function of stress, depicted as the slope of the life-stress line. They will also have unique
probabilities of occurrence, as depicted as the vertical position of the life-stress line.
These factors together indicate that the relative probabilities of the causes require a model
for each. This is illustrated in Figure 2.5-8. In this life-stress plot, the slope represents
the dependency of life as a function of stress, and the position of the line represents the
absolute life. As can be seen, the relative probabilities of the causes will depend on the
stress level.

Figure 2.5-8: Relative Lifetime vs. Stress


2.5.1.1.4.

Highly Accelerated Life Test (HALT)

Highly Accelerated Life Test (HALT) is a popular technique in reliability testing. It is


useful to achieve very large acceleration factors. HALT is a test methodology that
simultaneously subjects an item to highly accelerated levels of thermal cycling and
vibration. It can be a useful tool in identifying mechanical design weaknesses. It is a
particularly valuable technique in the identification of the weakest area(s) of a new
design in the shortest possible time. Therefore, is if often used as a tool to grow the
product reliability through a test analyze and fix sequence.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


54

Chapter 2: General Assessment Approach

Such tests can:

Provide a means of sampling inspection for incoming component lots


Be used for burn-in screening tests. This is called Highly Accelerated Stress
Screening (HASS). For HASS, care must be taken to ensure that high levels of
the accelerating stress will not damage or remove excessive life from units that
are to be put into service.
Be used for pilot tests to get information needed for planning a more extensive
accelerated life test (ALT) at lower levels of the accelerating variable
Be used to assess the relevance of specific failure modes
Be used to obtain shorter test times to allow design engineering to remain focused
on the product or system (resulting in a highly intensive and uninterrupted
engineering effort)
Serve as a cooperative workshop that involves both suppliers and the customer(s)
Support collaboration between design and test engineers to address design
weaknesses

HALT requires a different mindset than "conventional" accelerated testing. One is not
trying to predict or demonstrate life, but rather to induce failures of the weakest links in
the design, strengthen those links, and thereby greatly extend the life of the design. Root
cause failure analyses are conducted and repairs and redesign are carried out, as feasible
and cost-effective. Output results from HALT may include a Pareto chart showing the
weak links in the design, and design guidance that can be used to create a more robust
design.
Testing a new design and comparing it against a proven previous generation design using
the same accelerated test provides an efficient benchmarking test. Based on HALT
results, a determination of "optimum" design characteristics can be made using statistical
design of experiments (DOE).
A generic HALT process starts with a temperature survey:
1. Start at room temperature
2. Step down temperature to -100C in 20 increments, with each dwell time long
enough to stabilize the product's internal temperature (the thermal rate of change
between each temperature transition step should be ~100C/minute)
3. Step up temperature from -100C to +40C at 100C/min
4. Step up temperature from +40C in 20C increments to 100C or the maximum
temperature for the materials involved, with each dwell time long enough to
Reliability Information Analysis Center
55

Chapter 2: General Assessment Approach

stabilize the product's internal temperature (the thermal rate of change between
each temperature transition step should be ~100C/minute)
Next, a vibration survey is performed:
1.
2.
3.
4.

Begin vibration testing at room temperature


Start six-axis random vibration at 5 Grms from 2Hz to 12kHz
Step up the vibration level in 5 Grms increments, to a maximum of about 50 Grms
Dwell for 10 minutes at each level

The vibration stress is provided by mechanically impacting the table with hammers.
As such, the frequency spectrum is not truly random, but rather is pseudorandom. The
purpose of the vibration survey is to detect weakness in the design as a function of the
stresses created by the increased vibration levels.
A combined environment HALT may also be performed:
1. Superimpose simultaneous temperature cycling from -100C to +100C at
~100C/min of circulating air temperature. Dwell at each temperature only long
enough to semi-stabilize the internal temperature of the part
2. During temperature dwells, subject the test unit to vibration at 5 Grms
3. During subsequent thermal cycles, step the vibration level up in 5 Grms increments
In this example, the vibration is applied during temperature dwells, but if failure causes
are possible that are accelerated by vibration stresses during temperature transitions, the
stress profile can be modified to apply vibration continuously throughout the temperature
cycle.
This is a typical stress profile, and will be varied (and should be tailored) based on the
limits of the product or system being tested. The purpose of the step-stress temperature
test is to detect sensitivity of design functionality to temperature and temperature change
rates.
The purpose of the combined environment test should highlight weaknesses that result
from the interaction effects of simultaneous exposure to temperature and vibration.
Quantifying reliability is generally not the objective of HALT. The ability to improve the
inherent reliability/robustness of the product or systems design is. However, in some
cases it can be used as an indicator of field reliability performance. The fundamental
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
56

Chapter 2: General Assessment Approach

question to address is this: Does the HALT test excite failure causes that the item may
experience in the field? The answer to this question will depend entirely on the
characteristics of the item under test, and the stresses to which it will be exposed in field
use. For example, if the product or system critical failure causes are accelerated by
thermal cycling and random vibration, and the item will experience these stresses in the
field, then HALT test results may be indicative of field reliability. Likewise, if the
product or system critical failure causes are not accelerated by thermal cycling and
vibration, and/or the product will operating in a benign environment, then the HALT
results will provide very little information regarding field reliability.
2.5.1.1.5.

Qualification Testing

Qualification testing is a term used to describe a series of tests that a product or system
must be exposed to, and pass, for it to be considered qualified by the industry or
standards body governing the qualification requirements. Several examples of
qualification requirements are provided in Tables 2.5-7 and 2.5-8, for an assembly, and
for a laser diode component, respectively.
Table 2.5-7: Example of a Qualification Plan for an Assembly
Group/Test

SS/Failures

Group 1 Test Set


Impact (packaged w/ mates
removed for test)

Cat A: 30" Drop (nominal, based on weight, see


table 4-7) 10-orientations as specified

3/0

Impact (not packaged)

4" Drop (nominal, based on weight, see table 4-9)


5-orientations as specified

3/0

Temperature Cycling

-40 C to 85C / 100Cycles

3/0

Vibration

10-55 Hz, 1.52mm (max=10G), 1min/cycle, 120


cycles, 3axis

3/0

Electro-Magnetic Interference

Compliance with MIL-STD

1/0

Electro-Static Discharge

Compliance with MIL-STD-883

Group 2 Test Set

Group 3 Test Set


Damp Heat

75C/90%RH: 500 hrs qual, 1000 hrs info only

3/0

Toperating Max, Pnominal


Full Qualification = 2000hrs
Information = 5000hrs

3/0

Group 4 Test Set


Endurance

Reliability Information Analysis Center


57

Chapter 2: General Assessment Approach

Table 2.5-8: Qualification Example for a Laser Diode


Test Description
High Temperature Aging at
Ambient Condition

GR-468, Hermetic Laser Module (active)


70C
Q=2000 hrs
I =5000 hrs

Low Temperature Aging at


Ambient Condition

Min. storage temp.


Q=2000 hrs

Damp Heat Aging

85C/85% RH
Q=1000 hrs

Thermal Cycling

-40 to 70C
Q=100 cycles
I =500 cycles

Thermal Shock

T=100C
20 cycles

Vibration

20G, 20-2000 Hz, 4 min/cy, 4 cy/axis

Shock

500G,
0.5 ms,
5 times/axis

Electrostatic Discharge

MIL-STD-883, Method 3015

There are many qualification standards in existence, governed by standards bodies within
specific industries. Some noteworthy standards organizations are IEC (International
Electrochemical Commission), the U.S. Military (via MIL-specs), ISO, and Telcordia
(for telecommunication components and equipment.
There are several factors which will impact the usefulness of qualification data as an
indicator or field reliability. These are:

The degree to which the stress is accelerated, and the acceleration factor between
the test and field environments
The degree to which the stress accelerates critical failure causes that the product
or system will experience in the field
The sample sizes used, which impacts the statistical significance of the data

The first two bullets are treated in detail elsewhere in this book. The last bullet is
discussed next.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


58

Chapter 2: General Assessment Approach

A common way in which sample size requirements are identified in standards is with a
Lot Tolerance Percent Defective (LTPD) methodology. This concept is identical to the
reliability demonstration idea presented previously. In this case, two parameters are
specified:
1. The percent of allowable defects
2. The confidence level
From before:

1 CL = R
In this case, the value of R is the reliability of the entire sample size. So, if the test
plan is established to allow no failures (this will require the minimum sample size), the
equation becomes:

1 CL = R n
where n is the sample size. For example, if the allowable percentage of defects is 20%,
and the desired confidence level is 0.90 (i.e. 90%), then n = 11 is the minimum sample
size required, as shown:
11

1 0.9 = 0.8

So, if the test is performed on 11 samples with no failures, then there is a 90% confidence
that the true reliability is greater than 0.8 (i.e., the probability of failure is less than 0.2).
Other plans are also available that allow a certain number of failures. These require
larger sample sizes, and are determined with binomial statistics.
Since the LTPD is generally less than the required reliability, qualification data is usually
not sufficient, in and of itself, to demonstrate reliability requirements. It can, however,
be valuable data when used in combination with other data sources.
As an example, consider a case in which a reliability requirement is that a product or
system must have less than 3% cumulative failures after 1000 hours of operation. This is
shown as the star in Figure 2.5-9. Now, lets say that the item is represented by a
multimode Weibull distribution (notice the three distinct portions of the curve
representing the bathtub curve), characterized by the probability line called Case 1in
Figure 2.5-9. If 11 parts were tested, and zero failures occurred after 300 hours of
Reliability Information Analysis Center
59

Chapter 2: General Assessment Approach

operation, the only statistical statement that can be made is that there is a 90% confidence
that the true unreliability is less than 0.2 at 300 hours, shown as the solid star and arrow.
Here, the data is not sufficient to determine if the actual distribution is Case 1, or that
the reliability requirement is met. However, testing 11 samples may be sufficient to
determine if we have a wearout mode occurring at a time less than 300 hours, as
illustrated in Case 2.
Probability - Weibull
99.000

90.000

Unreliability, F(t)

50.000

Case 2

Case 1

10.000

5.000

1.000

0.500

0.100
0.100

1.000

10.000

100.000

1000.000

10000.000

Time, (t)

Figure 2.5-9: Reliability Requirement vs. Small Population Reliability Inference


If the goal of the test is to demonstrate the infant mortality percent fail value from the
first of the distribution modes is for example, less than 1%, it can be seen that testing 11
samples will not come close to demonstrating this requirement.
This example is shown to illustrate the fact that the demonstration of reliability due to
wearout related failure causes can be done with relatively small populations, whereas low
percent fail values typical of infant mortality cannot.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
60

Chapter 2: General Assessment Approach

In any case, however, the goal of a reliability program is to ensure that the actual
probability line is to the right of the reliability requirement point.
2.5.1.1.6.

DOE-Based Multicell

The methodology of a DOE (Design of Experiment)-based multicell involves subjecting


a sample of products to a combination of factors, or accelerants. These factors can be
stresses or categorical variables. The intent of these tests is to generate the data that is
required to develop a life model that is capable of predicting reliability under a variety of
use conditions. Life modeling is usually performed for specific failure causes. A goal of
a reliability program is to identify those causes that warrant the work required to develop
a life model. Characteristics of these critical failure causes often include:

Failures experienced in EVT tests


New, unproven technology
New, unproven manufacturing processes
Items exposed to stringent/severe environmental conditions
Items exposed to stringent/severe operating stresses
Items designed or manufactured with non-robust practices
Items with known life limitations
Items from suppliers with a history of delivery, cost, performance or reliability
problems
Old technology with availability problems (obsolescence and/or diminishing
manufacturing sources

After the identification of critical failure causes of a product or system that require life
modeling, action must be taken to ensure that those items are sufficiently robust to meet
product/system reliability and durability requirements. Life modeling is used for this
purpose, and involves the characterization and quantification of specific failure causes,
making it a critical element of a reliability program.
A generic life modeling methodology is shown in Figure 2.5-1.

Reliability Information Analysis Center


61

Chapter 2: General Assessment Approach

Tools

Measurement:
DOE

FMEA

Life
Modeling

FTA

Environment
Stresses
Duty Cycle
Extreme Event
Statistics

FTA

Characterize
operating
stresses
Identify
Factors

Reliability
Tests

Develop
Life
Model

Predict
Reliability under
Use Conditions

Model of
System
Reliability

Actions
Figure 2.5-10: Life Modeling Methodology
Each of the elements in Figure 2.5-10 are further examined below. Additionally, the
topics of Design of Experiments (DOE) and life modeling are treated in more detail in
Chapters 4 and 5, due to their relatively complex nature and their importance to life
modeling. A detailed example of a life model developed is also provided in Chapter 7.
Identify Factors
Factors are the independent variables that can influence the product reliability, and the
response variable is the dependent variable. DOE is a common technique used to study
the relationships amongst many types of factors. In the context of this book, the response
variables specifically refer to the reliability metric of interest.
Critical failure causes and the factors that potentially affect their probability of
occurrence need to be identified. This can be done through testing, through analysis, or
both. EVT testing that is performed as part of the overall product/system reliability
program can be used for the identification of these factors, as previously described.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
62

Chapter 2: General Assessment Approach

FMEA is also a popular analytical technique for this and will be used in the upcoming
example.
Factors fall into one of several categories:

Stresses
o Environmental
o Operational
Product/System Attributes
o Design factors
o Manufacturing processes

Each of these factors can be a continuous or a categorical variable:

A continuous variable is one that can assume any value within a given range
Categorical variables are those that assume a discrete number of possibilities

Some factors can be modeled as either. For example, environmental stress can be
modeled with continuous variables of the specific environmental stresses (i.e.,
temperature, vibration, humidity, etc.), or it can be modeled as a categorical variable.
The latter case is the approach that has historically been used in MIL-HDBK-217, which
uses environmental categories like Ground, Benign, Airborne, Inhabited, etc. The
217Plus methodology treats them as continuous variables, but default values are provided
for the categorical values of environment.
There are several ways in which these factors can be identified. One method that has
proven to be an efficient means of accomplishing this is to utilize the FMEA. This
involves modifying the FMEA to include several additional columns that correspond to
the above listed factors. At the analysts discretion, from one to four additional columns
can be included. This will depend on the type of product or system under analysis and
the level of rigor desired. In this approach, the FMEA team (or at least someone
knowledgeable with the item design and process attributes) identifies the specific stresses
or attributes that will affect the probability of occurrence of the specific failure cause that
was identified in the FMEA. Since each failure cause will generally have an associated
risk priority number (RPN), the cumulative RPN can be calculated for all failure causes
affected by the specific stress or product/system attribute.
For example, consider the case in which an FMEA was accomplished in this manner, and
the results in Figure 2.5-11 were obtained. Here, only the environmental stresses are
Reliability Information Analysis Center
63

Chapter 2: General Assessment Approach

shown, but the same methodology would apply to whichever additional factors are
included in the FMEA.
A more detailed discussions of the FMEA methodology is provided in Chapter 8.

Figure 2.5-11: Identification of Test Stresses Based on the FMEA


In this case, the sum of the RPN values for all failure causes accelerated by mechanical
shock is about 500. This cumulative RPN value is a relative number only, but can
provide valuable insight into the most important stresses to be addressed in the reliability
test plan.
In this example, the test stresses shown pertain to all of the failure causes addressed in the
FMEA. In performing life tests on specific failure causes, the information identified in
the FMEA should be used to identify the test stresses to be considered in the DOE plan.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


64

Chapter 2: General Assessment Approach

Reliability Tests
If critical item failure mechanisms are time dependent, then time-based life tests are
required. Life tests are conducted by subjecting test samples to a defined stress level and
measuring the times when failure occurs. The process is repeated for various
combinations of factor levels. Considerations for the reliability tests are described below.
Test Plan
If there are multiple accelerating stresses, then life tests must be conducted at various
combinations of stress magnitudes. A plan should be developed using an effective tool
such as Design of Experiments. The plan should consider all aspects of testing so that the
test program generates data in a cost effective way. It is easy to lapse into the mentality
of testing one factor at a time, in which tests are conducted to assess specific factors,
but this approach is generally not time- or cost-effective.
Factors to consider in establishing an appropriate DOE include (1) the sample size per
test cell, (2) stress levels, (3) the number of stress levels for each stress, (4) stress
interactions, (5) stress durations, (6) failure criteria, and (7) measurement methodology
(i.e., in-situ or periodic). The principals of DOE are treated in more detail in Chapter 4.
Maximum Test Stress
A prerequisite for developing a complete test plan to assess the lifetime of a product or
system attribute is knowledge of the maximum stress magnitude that can be tolerated by
the item prior to catastrophic failure. This knowledge supports establishment of an upper
bound on subsequent test stresses that may be a part of step-stress testing. These tests are
generally performed as part of the EVT tests.
In many cases, it is desirable to establish the upper bound of the test stress for each
specific stressor. An efficient way to determine this stress level, often called the
destruct limit, is to perform a step stress test. Here, a sample of units is exposed to a
stress level well below the suspected destruct limit. Then, the stress is increased until the
product is overstressed. This step-stress test can include a linearly ramped stress, or a
stepped-stress in which the samples are exposed to a constant stress for a given dwell
time, after which the stress is increased, dwelled, and so on until failure. An example of
the identification of these maximum stresses was mentioned previously in the HALT
discussion.
The destruct limit can be used as the upper limit of all subsequent life tests. Usually, the
actual life tests will be performed at a maximum stress that is a certain percentage level
Reliability Information Analysis Center
65

Chapter 2: General Assessment Approach

below the destruct limit. This percentage is dictated primarily by the sensitivity of the
TTFs to the stress. For example, consider the two cases illustrated in Figure 2.5-12.
Case 1 is a situation in which the lifetime, and subsequent reliability, is moderately
sensitive to the stress level. Case 2 is a situation in which the lifetime has an extreme
sensitivity to the stress level.

Figure 2.5-12: Using the Destruct Limit to Define the Life Test Max Stress
For example, if a power law acceleration model is used, the life stress relationship is:

Life =

A
Sn

where A is a life constant and S is the stress.


A typical value of n for Case 1 would be 1 to 3, whereas a typical value of n for Case
2 would be greater than 20.
In case 1, the maximum stress for the life tests may be 10-20% below the destruct limit.
For Case 2, however, the maximum stress should be only a few percentage points below
the destruct limit. Otherwise, the risk is taken that the product or system will not fail
within a reasonable time period, which is required for reliability model development.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
66

Chapter 2: General Assessment Approach

Stress Profile
The two main types of stress profiles are steady-state and time varying. Steady state tests
are those in which a sample set is exposed to constant stress levels, and the response
(performance parameter(s)) is measured. Several examples are shown in Figure 2.5-13.

Figure 2.5-13: Possible Stress Profiles


Any of the profiles in Figure 2.5-13 can be used to develop life models. If the timevarying stress profiles are used, a cumulative damage model is usually appropriate. In
this case, the stress function is integrated to obtain the cumulative damage. This will be
explained in more detail later.
Some of the advantages and disadvantages of the two generic approaches are listed in
Table 2.5-9.

Reliability Information Analysis Center


67

Chapter 2: General Assessment Approach

Table 2.5-9: Stress Profile Option Advantages and Disadvantages


Approach
Steady State Stress

Stepped (or Linear


Ramped) Stress

Advantage
Results can be easily interpreted

Disadvantage
Longer test times required

Facilitates the de-convolution of


time and stress effects more
easily
Short test times possible

Requires knowledge of
destruct limits

A good approach when the time


to failure characteristics as a
function of stress are unknown

Can be difficult to model


parameters
Software required for
modeling

Does not require knowledge of


destruct limits

Optimum Measurement Intervals


When testing is performed on products or systems whose performance cannot be
monitored in-situ, the test needs to be run such that performance measurements are done
at periodic intervals. These intervals need to be frequent enough to bracket the TTFs
tightly enough such that life model parameters can be estimated accurately enough.
The objective of the measurement intervals is to obtain as much resolution as possible in
the regions of time that exhibit high failure rates. The measurement intervals should be
an order of magnitude shorter than the failure times.
There are several approaches to determining the appropriate measurement intervals:
1. Use constant intervals. While this approach may not be optimal, it can be
appropriate in cases where the failure characteristics are completely unknown
2. If the rate of occurrence of failure (ROCOF) is expected to decrease over time,
the measurement intervals can start out very frequent, and decrease in frequency
as the failure rate decreases. This is shown in Figure 2.5-14.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


68

Chapter 2: General Assessment Approach

Failure
Rate

Measurement Points

Figure 2.5-14: Measurement Points for an Infant Mortality Failure Cause

If the ROCOF is expected to increase over time, the measurement intervals can start out
very infrequent, and increase in frequency as the failure rate decreases. This is shown in
Figure 2.5-15.

Failure
Rate

Measurement Points

Figure 2.5-15: Measurement Points for a Wearout Failure Cause


This case is generally much more difficult to implement because the failure
characteristics need to be known before the tests. Therefore, one of the first two
approaches is usually desirable.
Reliability Information Analysis Center
69

Chapter 2: General Assessment Approach

Sample Size Requirements


The determination of adequate sample sizes will depend on several factors, the most
important being whether the failure cause is special cause or common cause. If it is
special cause, the sample size needed will depend entirely on the percent of the
population affected by the failure cause. For example, if the failure cause manifests itself
in 0.1% of the population, then at least 1000 items would be required in order to expect a
single failure. Since multiple failures are required for true quantification, an order of
magnitude more items, or about 10,000, would be required. The specific number can be
calculated by using the principals of reliability demonstration, as explained elsewhere in
this book.
If the failure cause is a common cause mechanism, meaning that the entire population is
at risk, then many fewer items would be required. In this case, test data on enough
samples is required such that differences in reliability as a function of the factors (i.e.,
stresses, indicator variables) can be determined in a statistically significant manner. This
will be a function of how much inherent variability there is in the population, and how
sensitive the reliability is as a function of the factors under analysis. Essentially, if these
variabilities are known, then statistical techniques, like the Fisher F-test, could be used.
However, in practice, these variabilities are rarely known a priori. Therefore, sample
sizes as large as possible are preferred. In practice, the sample sizes are usually dictated
by programmatic constraints, in which case it is the reliability practitioners responsibility
to lobby program managers for the required samples.
Test Time
The question as to how long tests should be run before stopping them inevitably needs to
be addressed. This is especially true in cases where the stress levels are low and the
resulting lifetimes are long. While it is usually difficult to determine an appropriate test
duration before the test is run, a general rule of thumb is that tests should be run for
durations sufficient to cause at least 50% of the items to fail. This facilitates
quantification of the median life. Keep in mind that tests are used to characterize the
statistical distribution at a specific stress level, and therefore enough failures need to be
experienced to quantify the distribution.
Consider the illustration in Figure 2.5-16. In this case, tests were performed at two stress
levels, and the resulting TTF distributions were obtainable for each level. The
acceleration in this case can be quantified, along with confidence bounds around the
acceleration model parameters.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
70

Chapter 2: General Assessment Approach

Figure 2.5-16: Acceleration When the Distributions for at Least Two Stresses are
Available
Now, consider the case in which the lower stress samples are not tested until enough
failures have occurred. This is shown in Figure 2.5-17. In this case, the distribution
cannot be quantified. All that is possible is the estimation of the lower bound of life, via
techniques like Weibayes analysis (shown as the star).

Figure 2.5-17: Acceleration When the Distributions for Low Stresses are Not
Available
Reliability Information Analysis Center
71

Chapter 2: General Assessment Approach

This 50% objective can sometimes be offset if enough data is available in at least two
other, more stressful conditions, to compensate for the lack of data in the low stress
condition.
Develop Life Model
After the life data is generated from implementing the DOE plan, a reliability model can
be constructed. Factors that must be quantified include:

Time-to-failure (TTF) distribution


Acceleration factors for the primary stress variables
Characterization of the impact of specific design attributes on reliability

A generic sequence of events for model development is shown in Figure 2.5-18.


Collect data
TTFs
Acceleration variables
Stress(es)
Indicator
Select TTF distribution

Select acceleration model(s)


Estimate model parameters

Analyze goodness of fit and


parameter significance
Figure 2.5-18: Life Model Sequence

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


72

Chapter 2: General Assessment Approach

The TTF distribution can typically be modeled using the Weibull, exponential or
lognormal distributions. For sample "subpopulations" that exhibit different reliability
behavior than the main population, TTF distributions may manifest themselves as
bimodal. It is important that bimodal distributions be characterized. If one of the two
"modes" in the distribution appears to be the result of early failures from workmanship,
materials or process defects, then this information should be used to develop an
appropriate reliability screen. This topic is discussed in detail later in this book.
Characterize Operating Stresses
In order to estimate the field reliability of the product, in addition to the life model
(which will predict the life characteristics as a function of the chosen factors),
information regarding the stresses to which the product or system will be exposed in the
field is also necessary.
There are a variety of sources that can be used to estimate the stresses to which an item
will be exposed. First, customers will usually specify nominal and worst case
environmental requirements in the product or system specification. However, the data in
specifications are often very generic and lack sufficient detail for reliability analysis.
Another source of information is from direct measurement, either by directly measuring
stresses in the item use environment, or by equipping the item with sensors and data
logging features.
Field maintenance personnel can also often provide qualitative information pertaining to
stresses, especially when those stresses have resulted in failures.
There is a wealth of information available in both commercial and military handbooks
and standards. Many industries also have their own source material from the products or
systems used in their industry.
A summary of sources include:

Customer specifications
Customer usage information
Measurement of conditions:
Stresses
Duty cycle
Extreme event statistics
Reliability Information Analysis Center
73

Chapter 2: General Assessment Approach

Using a sample of fielded products fitted with sensors and data-recording


electronics
Discussions with field maintenance personnel
Handbooks and standards
MIL-STD-210, Climatic Information to Determine Design and Test
Requirements for Military Systems and Equipment

Predict Reliability Under Use Conditions


Once life models have been developed for all pertinent failure causes, the specific
combinations of design attributes and stresses that result in reliability requirements being
met can be identified. These attributes/stresses define the item "safe operating region,"
which should then be added to the system/product design rules so that reliability
requirements for future designs can be met without having to repeat the reliability
modeling process for that item.
Model of System Reliability
Once life models have been developed for all pertinent failure causes, they need to be
combined such that a reliability estimate of the entire product can be made. Section 2.7
describes this process and the appropriate tools in more detail.
Degradation Modeling
In many cases, the reliability response variable will not be a TTF, but rather it will be the
behavior of a critical parameter as a function of time. In these cases, there are several
choices:
1. Develop a model that predicts the parameter as a function of all factors that need
to be quantified.
2. Derive a simple model (linear, logarithmic, exponential or power law) model that
describes the parameter as a function of time, and then use this model to estimate
a time to failure (i.e. the time the parameter is predicted to degrade to some
predefined failure threshold.
In many cases, Option 2 is a good choice. Option 1 is a good choice in the following
cases:
1. When the failure mechanism can reach an asymptotic value of degradation. This
condition is difficult to model using the conventional life modeling techniques
2. If the goal of the analysis is to feed other analytical techniques, like worst case
analysis (WCA).
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
74

Chapter 2: General Assessment Approach

A general approach for degradation modeling is shown in Figure 2.5-19.

Regression
Non linear
model
parameter
estimation

Life modeling

Data of a performance
parameter vs. time

Model of performance
vs. time

Prediction of
delta

Prediction of
percent fail
Life model

Prediction of life
distribution

Figure 2.5-19 Degradation Modeling Approach


This approach starts with data pertaining to the value of a critical parameter as a function
of an independent parameter. This independent parameter is usually time, but can be
other parameters, such as cycles. An example of such data is shown in Figure 2.5-20, in
which five samples were put on test and the critical parameter was measured in situ.

Reliability Information Analysis Center


75

Chapter 2: General Assessment Approach

Figure 2.5-20: Degradation Data Example


Next, models of performance vs. time are modeled. This can be accomplished by using
some standard model forms like, linear, exponential, logarithmic, polynomial, or a more
sophisticated non linear model form. The standard model forms can be quantified by
applying a linear transform to the data and applying regression techniques. The models
are easily performed in MS EXCEL with the trend line functions. Non linear model
forms can be quantified using numerical methods. The Solver utility in MS EXCEL is,
again, an example of this solution type.
Once these degradation models are available, predictions can be made regarding the
degradation value or the percent of the population failing in accordance to a predefined
failure criterion (i.e. percent degradation). Or, another option is to convert the
degradation data to failure times, as shown in Figure 2.5-21. The estimated TTFs are
then used to generate a life model using the techniques covered elsewhere in this book.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


76

Chapter 2: General Assessment Approach

Figure 2.5-21: Degradation Data Conversion to Times-to-Failure


Note that the resulting TTF distribution can sometimes be counterintuitive. For example,
when dealing with what is believed to be a wearout-related phenomenon, the conversion
of degradation into TTFs can reveal a TTF distribution that is not usually considered to
be a wearout characteristic (for example a Weibull distribution with a shape parameter
that is less than one).
2.5.1.1.7.

Reliability Demonstration

An accelerated reliability demonstration is conceptually the same as a non-accelerated


test, but the level of acceleration needs to be quantified. For this, life modeling
approaches are used.
2.5.1.2. Field Data

Reliability data obtained from the field experience of products or systems is an invaluable
source of data. When using empirical field reliability data from a similar item as the
basis of the reliability estimate, there are two fundamental approaches, as illustrated
below.

Reliability Information Analysis Center


77

Chapter 2: General Assessment Approach

The first approach is to utilize the field data directly, and the second is to utilize the data,
via an interim model developed from the data. This is shown in Figure 2.5-22.

Empirical
Field Data

Reliability
Estimate
Model

Figure 2.5-22: Reliability Estimates from Field Data


This data has been the primary source of data used to develop most of the empirical
prediction methodologies such as MIL-HDBK-217, 217Plus, etc. Due to the authors
experience with these prediction methodologies, they will be used as examples in Chapter
7 to illustrate the concepts discussed in this section.
2.5.1.2.1.

Same Product

Field data on the exact item under analysis is the best information on which to estimate
the reliability of the product or system. Unfortunately, it is usually available too late to
do any good. Reliability predictions and estimates are required long before product or
system deployment. This type of data is a lagging indicator of reliability, whereas the
other techniques discussed in this book are leading indicators. In other words, we need
leading indicators to estimate the reliability that will ultimately be observed with the field
data. This data, however, which should always be collected on products, is valuable in
the reliability assessment of future products.
2.5.1.2.2.

Similar product

When using data on a similar product or system to assess a new product or system, the
degree of similarity needs to be accounted for to estimate the new item reliability based
on the empirical data available on the similar product. There are several ways in which
similarity can be assessed. The first approach is to utilize a reliability prediction
technique. This technique can be any of those covered in this document. The
techniques ability to assess similarity is dependent on the ability of the specific
methodology to:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
78

Chapter 2: General Assessment Approach

1. Address the factors that drive the reliability for the two products under analysis
2. Be reasonably sensitive to these drivers.
Regarding #2, for example, if a system is being developed that represents an evolutionary
change to the system for which a reliability estimate is available, estimating the reliability
of the new system based on the data from the old system requires that the prediction
methodology be sensitive to the design differences between the old and new systems. If
these differences consist of the addition of new components, an increase in the operating
temperature, and the addition of software, then the methodology used to assess the
delta in reliability between the new and old system must be capable of assessing these
elements, and the reliability prediction approach must be reasonably sensitive to these
factors. The methodology of 217Plus was designed to accommodate this type of
situation, and is further detailed in Section 2.6.
Additionally, it is not necessary that a single methodology be used to assess this delta.
Different techniques can be used to assess each of the elements of the design, and the
cumulative effect can be pooled together to form a complete system model. The
techniques used to assess each of the design elements will generally fall into the
categories described in this document.
Another more qualitative technique is to simply list the general attributes of the design, as
shown in Table 2.5-10. The relative expected reliability of each of these elements for the
new and old designs are then listed. This is a qualitative method, but can be useful in
some cases.

Reliability Information Analysis Center


79

Chapter 2: General Assessment Approach

Table 2.5-10: Similarity Analysis


Reliability Ratio for the
Design and Process elements of the
Old and New Designs

Design Elements

Process elements

Size
Weight
General design
Number of components of type A
Number of components of type B
Number of components of type C
Number of optical components
Thermal dissipation
Number of connections
Manufacturing site
Equipment
Screening
Component attachment
Screening tests
QC tests

This approach needs to be developed for each product or system, since the reliability
attributes will be unique to that particular item type.
Another approach that can be used to assess similarity is to utilize the FMEA, if
available. This is illustrated in Figure 2.5-23. Here, the FMEA is performed on both the
new and the predecessor system. The failure causes identified represent a cumulative
listing of all failure causes, whether they are applicable to either or both items. Then, the
Occurrence rating is determined for each failure cause for both items. If a specific failure
cause is not applicable to one of the items, then it gets a rating of zero. The sum of the
Occurrence ratings are then calculated for each of the products or systems. The ratio of
this sum is an indicator of the relative reliability levels of the two items, and is a good
measure of the degree to which the items are similar.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


80

Chapter 2: General Assessment Approach

Recommended Actions

Detectability

Occurrence

Causes

Severity

Failure effects

Failure mode

function

Component

Applicable to:

This section represents the FMEA results for both the new and old
systems, and lists a cumulative set of failure modes, causes, etc.

Old System

New System

In these two columns, it is


identified whether the failure
cause is applicable to the old,
new or both systems.
Sum of O
Sum of O
values of
values of
causes
causes
applicable to
applicable to
the old system the new
system

Figure 2.5-23: FMEA as a Toll for Assessing Similarity


2.5.1.2.3.

Raw Empirical Field data - Similar Product or System

Raw field reliability data has been a very popular source of data on which to base
reliability estimates. This similar data can be based on a specific companys own field
experience on previous products or systems, or it can be a pooled set of data based on a
variety of companies and organizations. As an example of the latter, one of the RIACs
most popular documents has been the Nonelectronic Parts Reliability Data, (NPRD)
publication. NPRD is a compilation of observed field reliability data on a wide variety of
components. A summary of NPRD is provided in Section 7.4, to provide the reader with
a guide to the interpretation of this type of data.
For the most part, methodologies such as EPRD (Electronic Parts Reliability Data),
NPRD, MIL-HDBK-217, and 217Plus rely on field data from similar products or systems
in order to make reliability estimates. The manner in which they do this differs, but they
all share the same fundamental type of data as their basis.

Reliability Information Analysis Center


81

Chapter 2: General Assessment Approach

2.5.1.2.4.

Models

The use of models derived from empirical data to estimate the reliability of a product or
system is just one option for estimating reliability. Empirical models can be developed
and used by the analyst, or he/she can use empirical models developed by others. Models
developed by others include the industry standards or methodologies that many reliability
analysts are familiar with.
This section of the book deals with such models that are derived from the analysis of
empirical field data. Modeling is the means by which mathematical equations are
developed for the purpose of estimating the reliability of a specific item used and applied
in a specific manner. There are many ways in which models can be derived, and there is
no single correct way to develop these models. There are many such models in
existence. These models are generally easy to use, in that they are of a closed form and
simply require the analyst to identify the appropriate values of the input variables. The
developers of each of these models had their own perspective in terms of the user
community to be served, the variables that were to be modeled, the data that was
available, etc. It is not the intent of this book to review the specifics of these models, or
to compare them in detail. It is the intent, however, to discuss the rationale and options
for development of the models, and to provide some examples.
The analyst must first decide what variables are to be modeled. Factors that should be
considered as indicators of reliability include:

Environmental stresses
Operational stresses
Reliability growth
Time dependency
o Infant mortality
o Wearout
Engineering practices
Technology
o Feature sizes
o Materials
Defect rates
Yields

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


82

Chapter 2: General Assessment Approach

Which ones are actually included depend on whether the data is available to support the
quantification of a factor, if a valid theoretical basis exists for its inclusion, and whether
the factor can be empirically shown to be an indicator of reliability.
There are always many more potential factors influencing reliability that can realistically
be included in a model. The analyst must choose which ones are considered to be the
predominant reliability drivers, and include them in the model. The next step of model
development is to theorize a model form. This is generally accomplished by attempting
to establish a model consistent with the fundamental physics of reliability. Examples of
the development of several empirically-based models are provided in Chapter 7.
To compare various empirical methodologies, Table 2.5-11 contains the predicted failure
rate of various empirical methodologies for a digital circuit board. The failure rates in
this table were calculated for each combination of environment, temperature and stress.
As can be seen from the data, there can be significant differences between the predicted
failure rate values, depending on the method used. Differences are expected because
each methodology is based on unique assumptions and data. The RIAC data in the last
row of the table is based on observed component failure rates in a ground benign
application.
Table 2.5-11: Digital Circuit Board Failure Rates (in Failures per Million Part Hours)
Environment
Temperature
Stress
ALCATEL
Bellcore Issue 4
Bellcore Issue 5
British Telecom HDR4
British Telecom HDR5
MIL-HDBK-217 E Notice 1
MIL-HDBK-217 F Notice 1
MIL-HDBK-217 F Notice 2
217Plus Version 2.0
RIAC data

Ground Benign
10 Deg. C
70 Deg. C
10% 50%
10%
50%
6.59 10.18 13.30
19.89
5.72
7.09
31.64
35.43
8.47
9.25 134.45 137.85
6.72
6.72
6.72
6.72
2.59
2.59
2.59
2.59
10.92 20.20 94.37 111.36
9.32 18.38 20.15
35.40
6.41
9.83
18.31
26.76
0.28
4.89
3.3

Ground Fixed
10 Deg. C
70 Deg. C
10% 50%
10%
50%
22.08 29.79
32.51
47.27
8.56 10.63
47.46
53.14
16.94 18.49 268.90 275.70
9.84
9.84
9.84
9.84
2.59
2.59
2.59
2.59
36.38 56.04 128.98 165.91
28.31 48.78
45.44
79.46
24.74 40.15
73.63
119.21
0.51
6.04

For electronic systems, generic handbook models such as MIL-HDBK-217 or Telcordia


SR-332 can be separated into two basic approaches, Parts Count and Parts Stress. When
the models for these handbooks were developed, researchers performed statistical
analyses on collected test and field data to determine major influencing factors for the
Reliability Information Analysis Center
83

Chapter 2: General Assessment Approach

class of components being considered. For example, for most all electronic components,
the predicted failure rate is found to be a function of operating temperature and applied
electrical stress. In general, the lower the operating temperature and applied electrical
stress, the lower the predicted failure rate will be. Therefore, the parts stress method
includes model factors for these specific stresses. However, if specific stress values
cannot be determined, it is still possible to perform a prediction using the more general
parts count methodology. For the parts count method, model stress levels have been set
to typical default levels to allow a failure rate estimate simply by knowing the generic
type of component (such as chip resistor) and its intended use environment (such as
ground mobile). It should be noted that these reliability prediction handbook approaches
are, by necessity, generic in nature. Actual test or field data from other similar items is
always more desirable, given sufficient similarity, as was discussed previously.
MIL-HDBK-217
MIL-HDBK-217, Reliability Prediction of Electronic Equipment, has historically been
the most widely used of all of the empirically-based reliability prediction methodologies.
The basic premise of the handbook is the use of historical piece part test and field failure
rate data as the basis for predicting future system reliability. The handbook includes
failure rate models for most electronic part types. The latest released version of MILHDBK-217 is F, Notice 2, dated 28 February 19952. The handbook was almost a
casualty of the DoD Acquisition Reform initiative, but it survived primarily because of its
widespread use, the dependency on it throughout the military-industrial complex, and the
lack of a suitable replacement.
Figure 2.5-24 presents a brief example of the MIL-HDBK-217 parts count method, where
the product or system failure rate is the sum of the failure rates of the generic electrical
and electromechanical components of which it is comprised3. Each piece-part failure rate
is derived by assigning typical defaults to the generic component category stress
models. The only factors considered in these parts count component models are (1) the
generic base failure rate for that part type (represented by g) that is based on an assumed
application environment and default temperature, (2) a generic quality factor (q) that is
used to modify this part type base failure rate, and (3) the quantity of that part type used
in the equipment. In the example shown here, the g for a bipolar microcircuit comprised
of between 1 and 100 gates in a 16-pin dual-in-line package used in a ground, fixed (i.e.,
GF) environment and operating at an assumed junction temperature of 60 degrees C is
2
As of the publication date of this book, a Draft version of MIL-HDBK-217G is in development and is expected to be released some
time in 2010.
3
The current version of MIL-HDBK-217 does not predict the reliability of mechanical components or non-hardware reliability
elements, such as software, human reliability, and processes. Field failures of mechanical components and non-hardware items should
not be scored against MIL-HDBK-217 or any other electronics-based empirical methodologies.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


84

Chapter 2: General Assessment Approach

0.012 failures per million hours. The quality factor for a parts count prediction is also
determined from a table (not shown in this example).
The parts count prediction approach is intended for use early in the design phase of the
equipment life cycle, prior to the start of detailed design, when there is little known about
the specific characteristics of the parts being used, or how they will be applied (such as
individual operating and environmental stresses).

Figure 2.5-24: MIL-HDBK-217 Part Count Example


The parts stress approach of MIL-HDBK-217 is applied later in the design development
phase of the equipment life cycle, when more details of the design are becoming
available and specific part applications are being identified. The use of this approach
requires detailed knowledge of the applied stresses and physical characteristics of the
device, including ambient and/or operating junction temperatures, electrical stress levels
(such as voltage, power, or current) vs. rated parameters, device complexity (such as gate
counts or transistor counts for semiconductors), etc. An example of a MIL-HDBK-217
parts stress model is shown in Figure 2.5-25 for gate/logic arrays and microprocessors.

Reliability Information Analysis Center


85

Chapter 2: General Assessment Approach

Figure 2.5-25: MIL-HDBK-217 Part Stress Example


The form of the model separately addresses failure rate contributions from the
microcircuit die and the microcircuit package. The quantitative values for C1 (shown
here) and C2 (not shown) are device- and package technology-dependent. The values for
the temperature and environmental factors, T and E respectively, are also technology
dependent and are seen to independently impact the die and package failure rate
contributions. The quality factor, Q, represents the amount of pre-condition screening or
testing that the part might get, and the learning factor (L) reflects the level of maturity
associated with the manufacture of the device. Component maturity has been shown to
be a predominant reliability driver of components, since the maturity is inversely
proportional to the defect density, which in turn is proportional to the failure rate. .
Telcordia SR-332 (Bellcore)
The Telcordia SR-332, shown in Figure 2.5-26, was formerly known under the name of
Bellcore. The models are similar in concept and purpose to MIL-HDBK-217, including
the fact that they are empirically based.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


86

Chapter 2: General Assessment Approach

Figure 2.5-26: Telcordia SR-332 (Bellcore)


The primary discriminator between the two approaches is that the Telcordia models have
been tailored specifically for the telecommunications industry, meaning that the breadth
of available environmental factors in the Telcordia models is much narrower than in
MIL-HDBK-217. There are three basic reliability prediction methods in the Telcordia
methodology:

Method I represents a parts count reliability prediction approach, and is applicable


to new technology where no field data exists. Since it does reflect new
technology, the model includes a first year multiplier factor to account for
infant mortality failure rates.
Method II incorporates the characteristics of Method I, but expands its scope to
include the impact of lab test data.
Method III is based on field tracking of failure rates.

It uses the Bayesian methodology to combine data from various sources, in a manner
similar to the RAC PRISM/RIAC 217Plus methodology.

Reliability Information Analysis Center


87

Chapter 2: General Assessment Approach

PRISM/217Plus
The original RAC PRISM4 system reliability assessment tool was developed and
released by the RAC in January 2000 as a potential replacement for MIL-HDBK-217.
With the subsequent transition to RIAC in June 2005, the RIAC 217Plus methodology
replaced the RAC PRISM tool and added additional component models. As a result, the
217Plus methodology currently addresses all the major component types found in MILHDBK-217. Figure 2.5-27 symbolizes the replacement of the RAC PRISM tool by
217Plus.

Figure 2.5-27: RAC PRISM Replaced by RIAC 217Plus


With no DoD sponsorship and funding, the models contained within MIL-HDBK-217F,
Notice 2, and the data upon which they were based, were becoming increasingly
outdated, thereby subject to increasing criticism. The part failure rate models
incorporated within PRISM, and ultimately within 217Plus, are based on a much larger
and more recent dataset, reflecting the improvements made in semiconductor device and
packaging technologies, resulting in more accurate part failure rate predictions. The
PRISM/217Plus software tool incorporates many of the ideas contained in the New
System Reliability Assessment Study performed by the RIAC for what was then Rome
Laboratory (Reference 7). These ideas include the ability to update an analytical
reliability prediction using in-house test data or field experience through Bayesian
techniques, and the ability to factor in system-level reliability impacts resulting from the
robustness (or lack, thereof) of the system development process. This methodology is
discussed in more detail in Chapter 7.

PRISM is a registered trademark of Alion Science and Technology.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


88

Chapter 2: General Assessment Approach

CNET/RDF 2000
The CNET/RDF 2000 Reliability Prediction Standard, shown in Figure 2.5-28, covers
most of the same component categories as MIL-HDBK-217.

Figure 2.5-28: CNET/RDF 2000


An example of an integrated circuit model is shown in Figure 2.5-29.
This model has many similarities to the PRISM/217Plus models, in that it addresses the
year of manufacture, dormancy failure rates, thermal cycling characteristics and electrical
overstress failure rates. The form of the integrated circuits model is shown here, and
bears a resemblance to the format of the MIL-HDBK-217F, Notice 2 microcircuit failure
rate model, in that it partitions the predicted device failure rate into die- and packagerelated contributions. As can be seen from Figure 2.5-29, there is quite a bit of
information that the analyst must have access to in order to use the model, but this is
typical for virtually all parts stress reliability prediction models.

Reliability Information Analysis Center


89

Chapter 2: General Assessment Approach

Figure 2.5-29: CNET/RDF 2000 Model Example


FIDES
The FIDES methodology, illustrated in Figure 2.5-30, was created by a French
consortium of reliability experts from various companies. It has similarities to the
CNET, RAC PRISM and RIAC 217Plus methods.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


90

Chapter 2: General Assessment Approach

Figure 2.5-30: FIDES


Other Methodologies
Other methodologies include:
UTE C 80-810, RELIABILITY DATA HANDBOOK: RDF 2000 A universal
model for reliability prediction of electronic components, PCBs and equipment.
IEC 62380 TR Ed.1 (2003), Reliability Data Handbook - A universal model for
reliability prediction of electronics components, PCBs and equipment.
IEC 1709, Electronic components Reliability Reference conditions for failure
rates and constraints influence models for conversion.
The VITA51 committee is also an organization that is addressing reliability prediction.
Its approach has been to adapt and modify existing methodologies, like MIL-HDBK-217,
by tailoring various factors of the MIL-HDBK-217F, Notice 2 models so that there is a
closer correlation between predicted values and field experience.
Reliability prediction models such as those summarized above can easily become
outdated as technology advances. As previously mentioned, maintaining the currency
and accuracy of a reliability prediction model can be a prohibitively costly and laborReliability Information Analysis Center
91

Chapter 2: General Assessment Approach

intensive effort. Failure to invest in this activity, however, will doom a reliability
prediction methodology to eventual irrelevancy and obsolescence.
2.5.1.2.5.

Collecting Field Data

Since field data is critical to the reliability assessment process, it is explored in this
section. The nuances of collecting and interpreting it are discussed. Some of the issues
encountered in collecting field data are discussed in the NPRD discussion included in
Chapter 7. The intent of this section is to present guidelines on how to approach field
data collection.
Good data collection is the key to an effective process for utilizing data obtained from a
reliability tracking system. This information includes:

Failure statistics (i.e., TTF, MTBF)


Application information (i.e. stress, environment, etc.)
Failure modes
Failure causes

The intent of this section is to outline a reliability data collection and analysis system that
can provide the data required. Although a reliability tracking system outlined herein has
similarities to a FRACAS program, there are distinct differences. While a FRACAS
program is intended identify the causes of failures so that corrective action can take
place, the program outlined herein is intended to be more comprehensive in that it assists
its user in more than the implementation of corrective actions, as it also provides the data
required to quantify reliability, in accordance with the methodologies outlined in this
book. This concept is illustrated in Figure 2.5-31.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


92

Chapter 2: General Assessment Approach

Reliability

TTF
Analysis

Vendor
Selection

Failure
Verification

Warranty
Claims

MTBF
Analysis

Root Cause
Identification

RCM
Implementation

Implement
Design
Improvements

Figure 2.5-31: Uses of Program Data Elements


A data system consists of several basic elements: a database, software analysis tools, and
an interface to the data system users. The database is the core of the system that captures
the raw maintenance data that is necessary to perform the required data analysis. A
typical structure of a database is provided in Figure 2.5-32.

System Information

Parts Breakdown
Maintenance Data
Root Failure
Cause/Analysis Data

Figure 2.5-32: Program Database Structure


The blocks in the above figure correspond to records in a relational database structure.
The data elements associated with each record are defined below. The System
Reliability Information Analysis Center
93

Chapter 2: General Assessment Approach

Information record consists of population statistics and needs to be updated whenever the
product or system status changes. Such a change occurs when new or modified items are
fielded.
The parts breakdown data element consists of a hierarchical description of the system.
This description is necessary to avoid confusion as to which FRUs (Field Replaceable
Units) belong to which assemblies and the number of FRUs in the assembly, as well as in
the entire system.
The maintenance data element consists of a record of the maintenance action taken to
maintain or repair the system. It also consists of a description of the anomaly, the failure
mode, and the failure mechanism of the failed unit as determined by the maintenance
technician. One record corresponds to a single maintenance action, and there can be any
number of them for each FRU in the system (i.e., a FRU in the system can be replaced
any number of times over the life cycle of the system).
The root failure cause/analysis data element consists of information on the results of the
detailed failure analysis that may be performed on the failed unit. It is a separate record
because not all maintenance actions will result in the failure analysis of a removed unit.
There are two primary interfaces required of the system. The first is the maintenance
technician interface. This interface is the means by which maintenance data is entered
into the database. Ideally, this interface would consist of computers located within the
maintenance facility for direct data entry. The second interface is the one utilized by
individuals that need the results of the data analysis. The flow of the interface to the
system from the perspective of the system user is given in Figure 2.5-33.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


94

Chapter 2: General Assessment Approach

System maintenance is
required and maintenance
commences
Maintenance technician
identifies the part
requiring maintenance
and enters the part data
into the database

System user
enters part
breakdown and
maintains system
usage status
Central
Database
User runs
appropriate analysis
and obtains
necessary reliability
metric(s)

Technician performs
required maintenance

Maintenance technician
enters maintenance data
into the database
Figure 2.5-33: Database Information Flow
Important elements of the data system that should be considered for inclusion are
summarized below:

System information
Number of systems fielded
Dates of fielding for each system j
Location of operation (optional)
System Numbers (unique identifier for each system)

Critical elements of a data collection system are discussed below.


Parts Breakdown
A description of every level of assembly must be available, down to the lowest level of
repair. For the purposes of this example, this assembly will be called a FRU (Field
Reliability Information Analysis Center
95

Chapter 2: General Assessment Approach

Replaceable Unit). This product or system description is critical to the unique


identification of parts so that the data that is reported at various levels is not confounded.
It is also critical if maintenance actions are not consistently performed at the same level.
At the lowest level of indenture, the following FRU information is required.

Part number
Serial number
Part identification code (unique descriptor of part in hierarchical breakdown of
system; sometimes referred to as a Reference Designator)
Number of parts in the product or system
Applicable Life Unit (i.e. hours, miles, cycles, operations, etc.)
Identification as to if there is an individual elapsed time meter (or miles, cycles,
operations) on the specific part or whether system life units must be used
Manufacturer name

Maintenance Information
A critical element to an effective reliability data collection and analysis system is the
accurate quantification of the failure cause. Not all perceived failures are real failures
and, therefore, it is important to identify whether part removals are indeed true failures.
Figure 2.5-34 illustrates the hierarchy of maintenance actions.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


96

Chapter 2: General Assessment Approach

Maintenance Action

Unscheduled

Scheduled

Perform
Routine
Maintenance

Remove/Replace

Correct
Diagnosis

Necessary
Repair

Failure Analysis
Performed to
Identify Root
Cause

Real Failure

Incorrect
Diagnosis

Faulty Unit
Gets Put Back
into Field

False Alarm

Correct
Diagnosis

Cannot
Duplicate

Incorrect
Diagnosis

Unnecessary
Repair

Failure
Analysis
Not
Performed

Figure 2.5-34: Hierarchy of Maintenance Actions


The following is a list of required data elements in the capture of maintenance
information:

Job number (unique identifier)


Reliability Information Analysis Center
97

Chapter 2: General Assessment Approach

Calendar date and time that system is taken out of operation


Calendar date of maintenance action
System serial or configuration control number
Number of total life units (i.e. hours, miles, cycles, operations) on the FRU at the
start of the maintenance action (if life unit meter is on FRU)
Number of total life units (i.e. hours, miles, cycles, operations) on the product or
system at the start of the maintenance action (if life unit meter is not on part)
Number of total life units (FRU or product/system, depending on which of the
above two items are applicable) on the part at the start of the maintenance action.
This is a calculated field generated by the database software.
Initial description of the anomaly
Initiating event (only one is chosen):
o Failure of system to perform (unscheduled maintenance)
o Condition monitoring-based event
o Scheduled maintenance
When discovered
Action taken (only one is chosen):
o Remove/replace
o Maintain
o Remove, re-test OK, and replace
FRU on which action is taken (description and serial/configuration control
number)
Maintenance technician (name)
Man-hours required for maintenance action
Calendar date and time that the system is put back into service
Cause of failure identified by the maintenance technician
Failure mode description
Failure mechanism description. There could be a standardized listing of the
possible failure mechanisms from which the technician could scan and identify
the appropriate mechanism.

Failure Analysis Information


The failure analysis record is used when there is a detailed failure analysis performed on
a removed FRU. The data contained in this record generically consists of the following:

Summary of the analysis performed


Results of the analysis
Failure cause (should be the root failure cause, not a failure symptom cause)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
98

Chapter 2: General Assessment Approach

Analysis
From the data collected and captured in the database, several fundamental reliability
parameters, including those listed below, can be calculated.

Operating hours (or life unit) of each FRU


Cumulative operating hours of the population
Cumulative system calendar hours of the population
Cumulative FRU calendar hours of the population
Individual calendar times for each product or system
For scheduled removals:
o Number of scheduled removals
o Total number of man-hours associated with scheduled removals
o Individual operating times for scheduled removals
o Individual calendar times for scheduled removals
o Number of man-hours for each scheduled removal
For unscheduled removals:
o Number of unscheduled removals
o Total number of man-hours associated with unscheduled removals
o Individual operating times for unscheduled removals
o Individual calendar times for unscheduled removals
o Number of man-hours for each unscheduled removal
Number of total removals
Total number of man-hours
Individual number of man hours
Individual operating times of all removals
Individual calendar times of all removals
Number of removals for each failure cause
Individual operating times of removals for each failure cause
Individual calendar times of removals for each failure cause
Total time that each individual product or system is unavailable

For many of these parameters, it is necessary to calculate the number of life units to
which each part has been exposed. This is done by calculating the number of life units on
the part since the last time that the part was replaced. This calculation procedure is
illustrated in Figure 2.5-35.

Reliability Information Analysis Center


99

Chapter 2: General Assessment Approach

Is there a life unit meter


on the part?
Yes

No

Use part life unit


meter

Use system life unit


meter

Record life unit


meter reading (i.e.,
Part
hours/miles/cycles)

Has the part been


previously removed? (i.e.,
is there a maintenance
record for that part in the
database?)

Yes

Subtract the system


life unit from that of
the last maintenance
record from the
current life unit

No

Record the
system life unit

Figure 2.5-35: Calculation of Part Life Unit


Outputs
A list of typical output parameters are listed below:

Mean Operating Hours Between Scheduled Removals


Mean Calendar Hours Between Scheduled Removals
Mean Operating Hours Between Unscheduled Removals
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
100

Chapter 2: General Assessment Approach

Mean Calendar Hours Between Unscheduled Removals


Mean Man Hours per Maintenance Action (MMH/MA)
Distribution of maintenance man hours per maintenance action
Weibull parameters of individual operating times for unscheduled maintenance
actions
Weibull parameters for failures of a specific cause
Pareto ranking of part failure rates (or of any of the above listed parameters)
Failure cause distribution
Pareto ranking of failure causes
Mean system availability for each system
Distribution of system availability

Drenicks Theorem
An important aspect of interpreting field reliability data is distinguishing between
calendar time and operating time. Consider a situation in which five items are fielded at
the same time, as illustrated in Figure 2.5-36. They will each have a failure time (or other
appropriate life unit) that is described by the TTF distribution as a function of operating
time.

1
2
3
4
5

Operating
Time

Failure Times
Figure 2.5-36: Failure Times Based on Operating Time

Reliability Information Analysis Center


101

Chapter 2: General Assessment Approach

Now, consider the same five items that were placed in the field at different calendar
times, as illustrated in Figure 2.5-37. They will have the same failure times relative to
their operating time, but the apparent failure times relative to calendar time will be quite
different.

1
2
3
4
5

Calendar
Time
Failure Times
Figure 2.5-37: Failure Times Based on Calendar Time
Furthermore, if the product or system is repairable (in which case the failed items are
replaced upon failure with a new item), an interesting effect occurs in which the apparent
failure rate will reach an asymptotic value that appears to represent a constant failure rate.
This occurs as the time zero values become randomized as items fail and are replaced
with new items.
To illustrate the relationship between the beta value (Weibull shape) and the
instantaneous failure rate as a function of calendar time when parts are replaced upon
failure, a simulation was performed. In this simulation example, the failure rate of 1100
items as a function of calendar time was calculated.
Figures 2.5-38 through 2.5-42 illustrate the results. These figures correspond to Weibulldistributed TTFs with shape parameters of 20, 5, 2, 1 and 0.5, respectively. The time axis
is calendar time, normalized to a time unit of one characteristic life.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


102

Chapter 2: General Assessment Approach

Figure 2.5-38: Failure Rate Simulation with Weibull Beta = 20

Figure 2.5-39: Failure Rate Simulation with Weibull Beta = 5.0


Reliability Information Analysis Center
103

Chapter 2: General Assessment Approach

Figure 2.5-40: Failure Rate Simulation with Weibull Beta = 2.0

Figure 2.5-41: Failure Rate Simulation with Weibull Beta = 1.0


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
104

Chapter 2: General Assessment Approach

Figure 2.5-42: Failure Rate Simulation with Weibull Beta = 0.5


Consider the case where the Weibull beta = 20 (Figure 2.5-38). When the populations
start operating at the same time at t = 0, the failures occur at a rate described by the
Weibull distribution with a beta value of 20. The peak of the failure rate occurs at
approximately the characteristic life value of time. As units fail and are replaced, the
time zeros start to become randomized. As enough time passes, the times zeros will
eventually become completely randomized. At this point, the asymptotic value of failure
rate is reached, which is the reciprocal of the characteristic life (in this case, 100). Figure
2.5-39, depicting the simulation results for a beta value of 5.0, indicates a similar effect.
The asymptotic failure rate, however, is reached sooner. This happens because the
variance in failure time is greater for a beta of 5.0 relative to a beta of 20, which, in turn,
means that the population time zeros become randomized sooner. The plot illustrating
a beta of 2.0 (Figure 2.5-40) is similar, with a corresponding asymptotic value reached
sooner. The plot corresponding to a beta of 1.0 (Figure 2.5-41) indicates that the random
failure rate occurs at t=0 which intuitively make since it has, by definition, a randomly
occurring failure rate.
However, when the beta is less than 1.0 (Figure 2.5-42), the asymptotic failure rate value
is zero. This occurs because, when enough time has passed, the failed items have been
replaced with items that have a higher probability of living longer. The lower the beta
value, the shorter the time period required to achieve a zero failure rate.
Reliability Information Analysis Center
105

Chapter 2: General Assessment Approach

Because this is an important factor in interpreting field reliability data, a methodology


was derived for the NPRD data to estimate the characteristic life based on field data with
varying time zero values. This methodology is discussed in Chapter 7, Section 4.
2.5.2. Physics

The generic approaches covered here in using a physics approach are stress strength
interference models and models from first principals. Each is described below.
2.5.2.1. Stress/Strength Modeling

Stress/strength interference theory is a technique used to quantify the probability that the
strength of an item is less than the stress to which it is subjected. For example, if the
distribution of the strength of an item can be quantified, and the distribution of the stress
it is under can be quantified, the area of intersection of the two stresses represents the
probability that the strength is less than the stress.
This technique is general in nature and applies equally to any situation that the two
distributions can be quantified, as long as the X-axis represents the same variable for both
distributions. The variable can be electrical, such as voltage or current, or it can be
mechanical strength, for example, in units of KPSI.
The goal of any design for robustness effort is to minimize the variance of both
distributions, and maximize the separation of the distribution means. In this manner, the
probability of distribution intersection, or failure, is minimized.
An example of this approach is illustrated in Figure 2.5-43.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


106

Chapter 2: General Assessment Approach

Material Properties

Design Dimension
Dimensions

FEA

Extrinsic
Stresses

Modulus

Strength
Data

Stress

CTE

Fatigue
Data

Strength

Probability of
Failure vs Time
Figure 2.5-43: Stress Strength Methodology
In this example, a mechanical item has certain physical properties, for example its
modulus and its coefficient of thermal expansion (CTE). These material properties are
used in addition to the design variable (i.e. dimensions, extrinsic stresses) to estimate the
stresses to which the item is exposed. This stress can be modeled in several ways. One is
the use of handbooks that contain closed-form equations that estimate the stress to which
a material is exposed as a function of dimensions, force, deflections, etc. This is usually
only viable for simple structures. For more complex mechanical structures, finite
element models and analysis (FEA) may be required to simulate stresses.
For the strength portion of the model, two factors need to be considered:

The inherent strength distribution of the material


The strength properties as a function of time

Reliability Information Analysis Center


107

Chapter 2: General Assessment Approach

An example of strength as a function of time is the fatigue properties of the material. The
fatigue properties pertain to the strength degradation over time.
At time = 0, the probability of failure is the intersection of the stress and the strength
distributions, as illustrated in Figure 2.5-44.

Figure 2.5-44: Stress/Strength Interference


The calculation for Normally-Distributed Stress and Strength Distributions is:
Z=
where:
Z=

x =
y =
x =
y =

ux u y

x2 + y2

Standard Normal variant (i.e., the number of standard deviations from the
normal standardized distribution). The value for Z can be obtained
from:
1. Tables of the Standard Normal distribution
2. MS EXCEL formula = Normdist(Z)
the mean of the strength
the mean of the stress
the standard deviation of the strength
the standard deviation of the stress
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
108

Chapter 2: General Assessment Approach

In many real situations, distributions other than the Normal are used, requiring alternate
methods of calculating the interference probability. Readily available software tools can
be used for this purpose (Reference 3).
As stated previously, in addition to the probability of failure at t=0, it is also critically
important to understand how this interference between stress and strength behaves as a
function of time. Items will sometimes age (due to mechanisms such as fatigue), which
essentially means that the strength distribution changes such that its mean is lowered.
Assuming that the stress to which the item is exposed remains constant, the result is that
there is more interference, and the failure probability increases with time. To properly
account for this aging phenomenon, the characteristics of this strength distribution and
the interference must be quantified as a function of time. This concept is illustrated in
Figure 2.5-45.

Figure 2.5-45: Stress/Strength Interference vs. Time

Reliability Information Analysis Center


109

Chapter 2: General Assessment Approach

An example of a model that has been successfully used for brittle materials is the
following:

V
P = 1 exp
V0

S 0

t0

where:
P=
probability of failure
m = Weibull slope of the initial strength
S0 = characteristic strength
n=
fatigue constant
V and V0 are volume parameters to account for the effects of size (i.e., they
account for the effect that the more volume or surface area that there is, the more
likely it is to have a strength limiting flaw)
=
stress
Now, if a screen is applied to the material to eliminate defects having strength values
below the applied screen stress threshold (Sth), the probability of failure becomes:

V
P = 1 exp
V0

1 m

t0 n
m

th

t t n

S0
t 0

This is only one example of a stress strength model. Many others can be found in the
literature.
Models such as these can be invaluable in understating the sensitivity of reliability as a
function of the factors accounted for in the model. However, as is the case with any
physics-based model, it is important to validate the model based on empirical evidence.
This is critical because there is ample opportunity to introduce large errors in the
analysis, based on extreme sensitivity to assumptions, sample variability, etc.
Additionally, while the approach may be grounded in physics, the model parameters
usually need empirical data for their quantification.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


110

Chapter 2: General Assessment Approach


2.5.2.2. First Principals

The premise of First Principals is that the fundamental physics that govern a failure
mechanism can be characterized, and that the reliability of the mechanism can be
accurately predicted from these equations. This is best illustrated with an example from
References 4 and 5. In this example, the reliability of a Fused Biconic Splitter was
modeled. This is a passive optical component used to split optical signals in fiber optic
telecommunication systems. The observed failure mode was a degradation of the
coupling ratio over time.
The original test plan included Accelerated Aging Tests on Fused Splitters for 3
conditions, as shown in Table 2.5-12.
Table 2.5-12: Test Conditions
Test Conditions

Temperature
(C)

85C / 85% RH
85C / 16% RH
45C / 85% RH

X
X

Relative
Humidity
(RH)
X
X

Absolute
Humidity
(AH)
X
X

The X values in the cells indicate which test conditions have a constant value for the
stress indicated in each column. The values were chosen to assess whether relative
humidity or absolute humidity was the predominant mechanism of the failure mode. In
this case, two of the three conditions have equivalent relative humidity and two of three
have equivalent absolute humidity.
The results of the accelerated tests did not agree with a previously hypothesized failure
mechanism that proposed epoxy creep as the coupling ratio drift mechanism. Therefore,
in an effort to obtain a model that was consistent with empirical evidence, the
fundamental physics were investigated. This process is described below:
From optical component physics, it can be shown that the coupling between two fibers is:

c=

3
1
2
32n2 a (1 + 1 / V )2

where:

V ak (n22 n32 )1 / 2
Reliability Information Analysis Center
111

Chapter 2: General Assessment Approach

Additionally, the diffusion of water vapor into silica can be represented as:
C (r , t ) = C 01 BnJ 0( jnr / b) exp{ jn 2 [ DH 2O(T )t / b 2 ]}

n =1

where:

Bn 2 /[ jnJ 1( jn)]
The hypothesis of the physical mechanism is that water diffuses into the outer surface of
the fused region very slowly and slightly decreases the index of refraction of this outer
surface. This increases the coupling coefficient, thereby increasing the coupling ratio.
As time goes by, more and more water diffuses in, and the coupling ratio increases until
the device goes out of spec. The amount of water in the silica is simply the number of
water molecules hitting the surface of the silica per unit time (directly proportional to the
absolute humidity) and the diffusion rate at that temperature. Therefore, if the time to
failure at a specific condition is known, the time to failure at a new condition is the
known TTF multiplied by a ratio of the absolute humidity level times the ratio of the
diffusion rates.
The data obtained in the tests were used to estimate the diffusion rate and the temperature
dependence of this diffusion rate, as shown in Table 2.5-13.
Table 2.5-13: Data to Estimate Diffusion Rate

85

ABS HUM
grams H2O
per m3
297.1

DIFFUSION
CONSTANT
cm2/sec
6.63x10-18

45

85

55.4

High Temp/Medium
Humidity Chamber

85

16

Underground

25

Footway Box

15

TEMP
C.

RH
%

High Temp/High
Humidity Chamber

85

Med Temp/High
Humidity Chamber

SERVICE
CONDITION

RATIO

MTBF
(Years)

0.579

2.60x10-19

137

79

56.0

6.63x10-18

85

19.6

3.73x10-20

2691

1559

93

11.9

1.73x10-20

9552

5535

The predictions from the model were then obtained. The predicted and observed
lifetimes are shown in Table 2.5-14.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
112

Chapter 2: General Assessment Approach

Table 2.5-14: Predicted Lifetimes vs. Observed


ENVIRONMENTAL
CONDITION

High Temp/High
Humidity Chamber
High Temp/Medium
Humidity Chamber

TEMP
C.

RH
%

MTBF in
Hours
(Predicted)

MTBF in
Hours
(Measured)

%
Difference

85

85

5072

5072

85

16

26,909

27,500

As can be seen above, the model is extremely accurate in predicting the failure
mechanism behavior.
Models developed from first principals like the one shown in this example can be very
accurate and, thus, beneficial to a reliability program. However, several pieces of
information were required in order to make this approach a viable alternative:

Detailed component information, including:


o Index of refraction of the core and cladding on the fiber used in the
component
o Fiber dimensions, and model constants in the above equations
The ability to generate a closed form equation that describes:
o Water diffusion rates into silica
o Optical coupling ratio as a function of component design parameters

While it would be desirable to model the reliability of every conceivable failure


mechanism in this manner, practical constraints of most reliability practitioners make this
difficult to apply to complex systems. The primary reason for this is that information like
that summarized above is not practical to obtain in many cases. Additionally, with
complex systems, there can be thousands of possible failure causes which would need to
be modeled in order to obtain a system reliability estimate.
The primary difference between this approach and the DOE-based life modeling
approach previously described is the manner in which the model form is determined. In
the DOE approach, the model forms are assumed and are based on standard forms like
the power law or the Arrhenius law. In the physics approach, the model forms are
determined from first principals of physics. In both cases, however, certain model
parameters are generally estimated from empirical data.
Reliability Information Analysis Center
113

Chapter 2: General Assessment Approach

2.6. Combine Data


Once the data for each item has been analyzed and reliability estimates have been made
using any of the methods descried previously, the information needs to be combined to
form the best estimate of product or system reliability. The methodology of 217Plus was
developed for this specific purpose and can be used as a framework from which to
perform this combination.
Figure 2.6-1 summarizes the 217Plus methodology for estimating the failure rate of a
product or system. In this example, only constant failure rates are addressed. If specific
items are described by non constant failure rates, the mathematics become more difficult,
but the basic approach remains the same.

Figure 2.6-1: 217Plus Approach to Failure Rate Estimation


The specific approach that can be used depends on several factors, including:

Whether information exists on a predecessor product or system


The amount of empirical reliability data available on that product or system
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
114

Chapter 2: General Assessment Approach

Whether the analyst chooses to evaluate and assess the processes used in the
development of the product or system

The types of data that may be available can be any of the types summarized previously in
this section of the book.
If the product or system under analysis is an evolution of a predecessor item, the field
experience of the predecessor product can be leveraged and modified to account for the
differences between the new product and the predecessor product. A predecessor is
defined as a product or system that is based on similar technology and uses
design/manufacturing processes similar to the new item under development for which a
reliability prediction is desired. In this case, the new product or system is an evolution of
its predecessor. In this analysis, a prediction is performed on both the predecessor item
and the new item under development. These two predictions form the basis of a ratio that
is used to modify the observed failure rate of the predecessor, and account for the degree
of similarity between the new and predecessor products pr systems. The result of the
predecessor analysis is expressed as 1, as presented in Figure 2.6-1.
If enough empirical data (field, test, or both) is available on the new product or system
under development, it can be combined with the reliability prediction on the new item to
form the best failure rate estimate possible. A Bayesian approach is used for this
combination, which merges the reliability prediction with the available data. As the
quantity of empirical data increases, the failure rate using the Bayesian combination will
be increasingly dominated by the empirical data. The result of the Bayesian combination
is defined as 2, as presented in Figure 2.6-1.
The minimum amount of analysis required to obtain a predicted failure rate for a product
or system is the summation of the component estimated failure rates. The component
failure rates are determined from the component models, along with other data that may
be available to the analyst. The result of this component-based prediction is IA,new. This
value can be further modified by incorporating the optional data, resulting in predicted,new,
as shown in Figure 2.6-1. All methods of analysis require that a prediction be performed
on the new product or system under development in accordance with the component
prediction methodology. Predictions based solely on the component analysis should
be used only when there is no field or test reliability history for the new item and no
suitable predecessor item with a field reliability history. In this case, the reliability
model is purely predictive in nature. After a product or system has been fielded, and
there has been a significant amount of operating time, the best data on which to base a
failure rate estimate is field observed data, or a combination of prediction and observed
Reliability Information Analysis Center
115

Chapter 2: General Assessment Approach

failure data. In this case, the reliability model yields an estimate of reliability, because
the reliability is estimated from empirical data.
Each element of the 217Plus methodology is further described in the following sections.

IA,predecessor

IA,predecessor is the initial reliability assessment of the predecessor product or system. It is


the sum of the predicted component failure rates, and uses any of the methods described
in this book.

observed, predecessor

observed, predecessor is the observed failure rate of the predecessor product or system. It is

the point estimate of the failure rate, which is equal to the number of observed failures
divided by the cumulative number of operating hours5.

Optional data
Optional data is used to enhance the predicted failure rate by adding more detailed data
pertaining to environmental stresses, operating profile factors, and process grades (the
concept of process grades is explained in detail in Chapter 7). The 217Plus models
contains default values for the environmental stresses and operational profile, but in the
event that actual values of these parameters are known, either through analysis or
measurements, they should be used. The application of the process grades is also
optional, in that the user has the option of evaluating specific processes used in the
design, development, manufacturing and sustainment of a product or system. If process
grades are not used in a 217Plus analysis, default values are provided for each process
(failure cause), so that the user can evaluate any or all of the processes.

predicted, predecessor

predicted, predecessor is the predicted failure rate of the predecessor product or system after
combining the initial assessment with any optional data, if appropriate.

IA,new

IA,new is the initial reliability assessment of the new product or system. This is the sum
of the predicted component failure rates, and uses the 217Plus component failure rate
models or other methods (such as data from NPRD or other data sources). A reliability
5
Note that operating hours can be replaced by any other life unit, such as calendar hours, miles, cycles, etc. The 217Plus
methodology predicts failure rates in terms of calendar hours. The important point is that all life units used in the assessment must be
consistent.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


116

Chapter 2: General Assessment Approach

prediction performed in accordance with this method is the minimum level of analysis
that will result in a predicted reliability value. Applying any optional data can further
enhance this value.

predicted, new

predicted, new is the predicted failure rate of the new system after combining the initial

reliability assessment with any optional data, if used. If optional data is not used, then

predicted, new is equal to IA,new

1 is the failure rate estimate of the new system after the predicted failure rate of the new
system is combined with the information on the predecessor product (predicted and
observed data). The equation that translates the failure rate from the old product or
system to the new one is:

1 = predicted , new

observed , predecessor
predicted , predecessor

The values for predicted,new and predicted,predecessor are obtained using the component
reliability prediction procedures. The ratio of observed,predecessor /predicted,predecessor inherently
accounts for the differences in the predicted and observed failure rates of the predecessor
system, i.e., it inherently accounts for the differences in the products or systems analyzed
in the component reliability prediction methodology.
This methodology can be used when the new product or system is an evolutionary
extension of predecessor designs. If similar processes are used to design and
manufacture a new item, and the same reliability prediction processes and data are used,
then there is every reason to believe that the observed/predicted ratio of the new system
will be similar to that observed on the predecessor system. This methodology implicitly
assumes that there is enough operating time and failures on which to base a value of
observed,predecessor. For this purpose, the observance of failures is critical to derive a point
estimate of the failure rate (i.e., failures divided by hours). A single-sided confidence
level estimate of the failure rate should not be used.

ai

ai is the number of failures for the ith set of data on the new product or system.
Reliability Information Analysis Center
117

Chapter 2: General Assessment Approach

bi

bi is the cumulative number of operating hours for the ith set of data on the new product or
system.

AFi

AFi is the acceleration factor (AF) between the conditions of the test or field data on the
new product or system and the conditions under which the predicted failure rate is
desired. If the data is from a field application in the same environment for which the
prediction is being performed, then the AF value will be 1.0. If the data is from
accelerated test data or from field data in a different environment, then the AF value
needs to be determined. If the applied stresses are higher than the anticipated field use
environment of the new system, AF will have a value greater than 1.0. The AF can be
determined by performing a reliability prediction at both the test and use conditions. The
AF can only be determined in this manner, however, if the reliability prediction model is
capable of discerning the effects of the accelerating stress(es) of the test. As an example,
consider a life test in which the product was exposed to a temperature higher than what it
would be exposed to in field-deployed conditions. In this case, the AF can be calculated
as follows:
AF =

T 1
T 2

where:

T1 = the predicted failure rate at the test conditions obtained by performing a


reliability prediction of the system at temperature 1
T2 = the predicted failure rate at the use conditions obtained by performing a
prediction at temperature 2

b i

bi is the effective cumulative number of hours of the test or field data used. If the tests
were performed at accelerated conditions, the equivalent number of hours needs to be
converted to the conditions of interest, as follows:

bi ' = bi AFi

ao

ao is the effective number of failures associated with the predicted failure rate. If this
value is unknown, then use a default value of 0.5. In the event that predicted and
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
118

Chapter 2: General Assessment Approach

observed data is available on enough predecessor products or systems, this value can be
tailored. See the next section for the appropriate tailoring methodology.

2 is the best estimate of the new system failure rate after using all available data and
information. As much empirical data as possible should be used in the reliability
assessment. This is done by mathematically combining 1 with empirical data. Bayesian
techniques are used for this purpose. The technique accounts for the quantity of data by
weighting large amounts of data more heavily than small quantities. 1 forms the prior
distribution, comprised of a0 and ao/1. If empirical data (i.e., test or field data) is
available on the system under analysis, it is combined with 1 using the following
equation:
n

2 =

a 0 + ai
a0

i =1
n

+ bi '
i =1

where 2 is the best estimate of the failure rate, and ao is the equivalent number of
failures of the prior distribution corresponding to the reliability prediction. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the next section.
ao/1 is the equivalent number of hours associated with 1.
a1 through an are the number of failures experienced in each source of empirical data.
There may be n different sources of data available (for example, each of the n sources
corresponds to individual tests or field data from the total population of products or
systems).
b1 through bn are the equivalent number of cumulative operating hours experienced for
each individual data source. These values must be converted to equivalent hours by
accounting for any accelerating effects between the use conditions.

Reliability Information Analysis Center


119

Chapter 2: General Assessment Approach

Tailoring the Bayesian Constant, ao


This section discusses tailoring of the ao value used in the Bayesian equations. The value
of ao is proportional to the degree of weighting given to the predicted value (1). The
value of the constant, a0, is chosen such that the uncertainty in the failure rate estimate, as
calculated with the chi square distribution, equates to the observed uncertainty. The
default value of 0.5 to be used in the equation is based on the observed/predicted ratio
derived from a wide variety of systems, applications, industries, etc. As such, there are
many noise factors contributing to the variability in this ratio. However, if the user of
the 217Plus model has enough data on which to derive a tailored value of a0, it should be
derived and used. While the default value of 0.5 represents the large degrees of
uncertainty inherent when a diverse data set is used, a specific 217Plus user will
generally be analyzing products with a much more narrow focus, in terms of product
type, environment, operating profile, etc. As such, with enough data, the value of a0 can
be increased. As an example of calculating a value for a specific application, consider
the an example for a product used in a telecommunications system.

To estimate the value of ao that should be used, a distribution of the following metric is
calculated for all products for which both predicted and observed data is available:

observed, predecessor
predicted, predecessor
The lognormal distribution will generally fit this metric well, but others (for example,
Weibull) can also be used. The cumulative value of this distribution is then plotted.
Next, failure rate multipliers (as calculated by a chi square distribution) are calculated
and plotted. This chi-square distribution should be calculated and plotted for various
numbers of failures, to ensure that the distribution of observed/predicted failure rate
ratios falls between the chi-square values. In most cases, one, two and three failures
should be sufficient. Next, the plots are compared to determine which chi-square
distribution most closely matches the observed uncertainty values. The number of
failures associated with that distribution then becomes the value of a0. Figure 2.6-2
illustrates an example for which this analysis was performed.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


120

Chapter 2: General Assessment Approach

Figure 2.6-2: Comparison of Observed Uncertainty with the Uncertainty Calculated With
the Chi-square Distribution
As can be seen from Figure 2.6-2, the observed uncertainty does not precisely match the
Chi-square calculated uncertainty for any of the one, two or three failures used in this
analysis. This is likely due to the fact that the population of products on which this
analysis is based is not homogeneous, as assumed by the chi-square calculation.
However, the confidence levels of interest are generally in the range 60 to 90 percent. In
this range, the chi-square calculated uncertainty with 2 failures most closely
approximates the observed uncertainty. Therefore, in this example, an a0 value of 2 was
used. This value is also consistent with the Telcordia GR-332 reliability prediction
methodology (Reference 6).
The uncertainties represented by the distribution of observed/predicted failure rates are
typical of what can be expected when historical data on predecessor products or systems
are collected and analyzed to improve the reliability prediction process. Using this
example, one can be 80% certain that the actual failure rate for a product or system will
be less than 2.2 times the predicted value.
2.6.1. Bayesian Inference

Figure 2.6-3 depicts the outline of the Bayesian inference approach. The available
information about the model parameter vector, , in the form of prior distribution, f0(),
Reliability Information Analysis Center
121

Chapter 2: General Assessment Approach

is transformed to a new state of knowledge, represented by posterior distribution, f().


The likelihood function represents data in the Bayesian framework, and determines how
much the data may influence the prior knowledge.

Model
for
Failure
Data
Failure
Data
Prior
f0 ()

Likelihood
L( Failure Data | )

Posterior

Bayesian
Inference

f () = L( | Failure Data)

Figure 2.6-3. Bayesian Inference Outline


The mathematical description of the Bayesian transformation is defined by the equation
below. The normalization factor appearing in the denominator is inevitable when dealing
with conditional probability calculations.

f ( ) = f ( DATA) =

f 0 ( ) L(DATA )

f ( ) L(DATA ) d
0

where,

=
f() =
f0() =

the vector of model parameters, (1, 2, , n)


the posterior joint distribution of parameters
the prior joint distribution of parameters
L(data|) = the likelihood of data given the model parameters

In practice, the features of this distribution include the updated marginal and conditional
distribution of each parameter given the provided information. The marginal distribution
of a single parameter is defined by the next equation. The marginal distribution is
estimated by integrating the posterior joint distribution, f(), over the range of other
parameters, as shown. The other important outcome of the posterior joint distribution is
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
122

Chapter 2: General Assessment Approach

the conditional distribution of each parameter, when other elements of vector are given.
The conditional distribution is constructed by substituting the known parameters in the
joint distribution, f(). Here again, the function needs to be scaled by a normalization
factor, as demonstrated in the equation below, in order to be consistent with the basic
characteristics of the distribution functions.

f j ( j ) =

f (1 , 2 ,..., j ,..., n )d j

g j j | j =

f ( j , j )
f ( , )d
j

where:

-j =

)
i =

(1, , j-1, j+1, , n)


the given value for i

The integrals necessary for Bayesian computation usually require analytic or numerical
approximations. While the computations for non-constant failure rate distributions can
get quite involved, they are relatively straightforward for the exponential distribution.
The method explained in the previous section details this situation.

2.7. Develop System Model


There are several options that the analyst has for merging the reliability models of all of
the failure causes, components, etc. In decreasing order of rigor, they are:
1. Perform a Monte Carlo analysis, where the TTF distribution of each element is
preserved, and the operating time and number of failures is modeled
2. If all of the failure causes are independent, the options are:
a. Calculate the reliability of each cause at a specific time of interest, and
then calculate the reliability as:
n

R(t ) = Ri (t )
i =1

where there are n items, and Ri is the reliability of each item


b. Convert the reliability estimate of each element to a constant failure rate,
and calculate the reliability as:
Reliability Information Analysis Center
123

Chapter 2: General Assessment Approach


n

= i
i =1

if the following conditions are satisfied:


1. The analysis is performed only to the component level, without modeling the
specific failure causes
2. A constant failure rate distribution is used
3. All components are required for the product or system to meet its requirements
(i.e., failure probability values are independent)
Then, the product reliability is simply the product of the reliabilities of the individual
components, or likewise the failure rate is the sum of the failure rates of the individual
constituent components. This has been the traditional approach when using the
handbook types of methodologies.
If all of the above listed conditions are not present, then more sophisticated techniques
are required. For example, consider the situation in which Condition 1 and 2 are not
satisfied, but Condition 3 is. In this example, lets say that there are seven failure causes
for which the life modeling has resulted in an estimate of the TTF distribution under field
use conditions. These distributions can be any arbitrary shape, dependent entirely on the
characteristics of the specific failure causes. This situation is depicted in Figure 2.7-1,
where the reliability block diagram is shown as a series configuration. Each failure cause
is represented by Events 1 through 7, each of which has its own probability density
function.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


124

Chapter 2: General Assessment Approach

Event 1

Event 2

0.001

0.002

8.000E-4

0.001

6.000E-4

Event 5

Event 7

Probability Density Function


0.001

9.000E-4

Probability Density Function

9.000E-4

2.000E-4

Event 6

Probability Density Function

Probability Density Function

Probability Density Function

Probability Density Function

Probability Density Function


0.002

Event 4

Event 3

0.003

8.000E-4

7.200E-4
1.600E-4

7.200E-4

1.200E-4

5.400E-4

0.002

6.000E-4

5.400E-4

f(t)

f(t)

f(t)

f(t)

f(t)

f(t)

f(t)

0.002

4.000E-4

3.600E-4

8.000E-4

4.000E-4

4.000E-4

2.000E-4

8.000E-5

3.600E-4

4.000E-5

1.800E-4

0.001

2.000E-4

1.800E-4

0.000
0.000

1000.000

2000.000

3000.000

Time, (t)

4000.000

5000.000

0.000
0.000

1000.000

2000.000

3000.000

Time, (t)

4000.000

5000.000

0.000
0.000

1000.000

2000.000

3000.000

Time, (t)

4000.000

5000.000

0.000
0.000

6.000E-4

0.000
0.000

600.000

1200.000

1800.000

2400.000

3000.000

0.000
0.000
1000.000

2000.000

3000.000

4000.000

Time, (t)

Time, (t)

5000.000

0.000
0.000

600.000

1200.000

1800.000

2400.000

3000.000

1000.000

2000.000

3000.000

4000.000

5000.000

Time, (t)

Time, (t)

Time to first failure


defines system Time to
Failure (TTF)
Probability Density Function
0.002

0.002

f(t)

0.001

8.000E-4

4.000E-4

0.000
0.000

1000.000

2000.000

3000.000

4000.000

5000.000

Time, (t)

Figure 2.7-1: Combining Seven Failure Cause Distributions


For repairable systems, in which repairs are made as failures occur, the system reliability
would be simulated over a given time period, such as the mission duration or the
warranty period. In this case, failure times are simulated from time = 0 to the specified
time period. In this simulation, multiple systems are simulated, for which the failure
times of the constituent components are also simulated. As failures occur, new
replacement components are installed which have a new component time zero (the
system operating time will not be zero, but will be the cumulative operating time). This
continues until the duration is exceeded for each of the simulated systems. The resulting
failure times for the system can then be analyzed, and the distribution parameters defined.
The resultant distribution will generally not be a mono-modal distribution; rather, it will
be a distribution of an arbitrary shape that is usually represented by a multi-modal
distribution.
It is noteworthy that the above model is valid for any situation in which all items are
critical, i.e., the failure of any one item results in product or system failure. For example,
the Fault Tree for this situation may look like Figure 2.7-2. In this case, all gates are OR
gates, which means that all failures of items represented by Events 1 through 7 constitute
critical failures. This is shown to illustrate the fact that the analysis does not necessarily
Reliability Information Analysis Center
125

Chapter 2: General Assessment Approach

need to be performed at the same level of hierarchy. The most important thing is that all
of the critical failure causes are accounted for.

TOP

OR

Event 3

OR

OR
Event 1

OR

Event 2

Event 4

Event 5

Event 6

Event 7

Figure 2.7-2: Possible Fault Tree Representation of a Series Reliability Block


Diagram
For non repairable systems, in which the first failure causes system failure and all items
represented by each failure cause are required for the product or system to function, this
becomes a competing risk situation in which the first failure cause to occur will define
the items TTF distribution. The Type 1 extreme value distribution, also known as the
Gumbel distribution, is sometimes used to model this situation when components have
the same reliability distribution. This competing risk situation, modeled with times to
first failure (TTFF), will not yield the same results as taking either the product of the
reliability values or the sum of the failure rates (in the case of constant failure rates)
because, in the latter cases, there is a probability that multiple failures will occur in the
time period analyzed, which is not the case for the competing risk situation.
For all but the simplest of situations, closed-form solutions cannot be obtained. These
require solutions with numerical simulation, like Monte Carlo analysis, which is
described in the next section.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
126

Chapter 2: General Assessment Approach

2.7.1. Monte Carlo Analysis

Monte Carlo analysis is a powerful analytical technique that allows for the estimation of
parameters or factors in cases where closed-form statistical derivations are not possible.
This occurs in many reliability engineering analyses, making it an invaluable tool.
Monte Carlo analysis can be used for several purposes:
1. To determine the time to first failure, as in the previous example
2. To determine the probability of failure from a stress/strength interference model
For #2, there are handbooks available which provide estimates of interference probability
based on the individual stress and strength distributions. Or, a statistical simulation can
be performed to estimate the degree of interference via numerical techniques. This is
generally a more efficient and effective way of performing the simulation, given software
tools that are readily available.
The basic principal behind Monte Carlo analysis, as applied to stress/strength interference
analysis is shown here:
1. First, the stress and strength distributions are determined
2. A randomly selected value from each distribution is obtained
3. The randomly selected values are compared, and if the selection from the strength
distribution is less than the selection from the stress distribution, a failure is
considered to have occurred. If it is not, then success is considered to have
occurred.
4. This process is repeated many times, and the number of trials and the number of
failures are counted. The number of trials needs to be large enough to result in a
good estimate of the failure probability. The failure probability is equal to the
total number of failures divided by the total number of trials.
F =

N strength < stress


N

where:
F=
N=

the failure probability


the total number of trials

More detail regarding the process is described on the next page.


Reliability Information Analysis Center
127

Chapter 2: General Assessment Approach

The first step is to randomly select a value from each of the stress and strength
distributions. As an example, consider a normally distributed strength with a mean of 10
and standard deviation of 3, the pdf of which is shown in Figure 2.7-3.

0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 2.7-3: pdf of Normal Distribution with Mean of 10 and Standard Deviation of
3.
Next, the cumulative function of this distribution is calculated, as shown in Figure 2.7-4.

1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 2.7-4: Cumulative Normal Distribution with Mean of 10 and Standard


Deviation of 3
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
128

Chapter 2: General Assessment Approach

Next, the randomly selected value from the distribution is obtained by:

Selecting a random number between 0 and 1. This number is displayed on the yaxis.
Then, the value on the x-axis corresponding to this y-value is determined a shown
in Figure 2.7-5.

Figure 2.7-5: Value Selection From a Distribution


Distributions typically used in stress/strength analysis include the Normal distribution
and the Weibull distribution. The Normal cumulative distribution does not have a closedform solution, and requires the solution of an integral for its computation. However,
software programs have simplified this calculation. For example, the MS EXCEL
function for this calculation is.
NORMINV(rand(),mean,standard deviation)
where:
Rand() returns a random number between 0 and 1
The mean and standard deviation are the values from the sampled distribution

Reliability Information Analysis Center


129

Chapter 2: General Assessment Approach

The Weibull distribution is simpler to use than the Normal distribution since an integral
of the pdf is not required to derive the CDF. The closed-form pdf of the Weibull
distribution is:

t
f (t ) =

The reliability function (1 - cumulative function (CDF)) is:

R(t ) = e

The Weibull distribution is one of the most widely used distributions in reliability
engineering due to its versatility. It also has the advantage of having a closed-form
solution for its cumulative function.
To select a random value from this distribution, a random number between 0 and 1 is
selected, this value is substituted for R(t) and the corresponding TTF is determined from
the equation. In this example, time(t) is shown as the independent variable, but the
specific parameter could be any parameter whose distribution is used in a Monte Carlo
analysis. The inverse cumulative function is shown in Figure 2.7-6, along with the
selection of the random value.

Figure 2.7-6: Value Selection From a Weibull Distribution


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
130

Chapter 2: General Assessment Approach

Now, lets consider another application of Monte Carlo simulation. In this example, a
simple relationship between items for a repairable system is shown in Figure 2.7-7. Here,
the items can be failure causes, components or assemblies, in accordance with the level to
which the analysis is performed.

Figure 2.7-7: Reliability Block Diagram of Redundant Example


For this example,
A and (B or C) need to be operational for the system to function
TTFi and TTRi are the times to failure (TTF) and times to repair (TTR), taken
from the governing distributions of each
The behavior for each item, along with the resultant system behavior, is shown in Figure
2.7-8.

Figure 2.7-8: System Monte Carlo Example


For example, item A operates until it fails at TTFA1. At that point in time, it takes TTRA1
to repair it. Items B and C fail and get repaired at rates determined by the simulated
Reliability Information Analysis Center
131

Chapter 2: General Assessment Approach

times for each, and governed by the specific distribution of each. The resultant system
availability (Asystem) is shown on the bottom.
A simulation was performed on this hypothetical system using a software tool, the results
of which are shown in Figure 2.7-9. In this case, the following metrics were calculated
from the Monte Carlo analysis:

Ao:
MTBDE:
MDT:
MTBM:
MRT:
% green time:
% yellow time:

Availability (% of total time the system is available)


Mean Time Between Downing Events
Mean Down Time
Mean Time Between Maintenance
Mean Repair Time
Percent of time that all units are operational
Percent of time that at least one unit is not operational, but
the system still operates
% red time:
Percent of time that at least one critical item is not
operational
Number of failures: The number of simulated failures per run

Figure 2.7-9: Monte Carlo Simulation of Example System

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


132

Chapter 2: General Assessment Approach

Simulations of product reliability, as described above, are generally the best way to
combine life estimates of constituent parts in a system. If a system is comprised of
redundant elements, closed-form equations are available that calculate the effective
failure rate of the redundant elements. However, care must be taken when using these
equations. For example, the manner in which they are generally derived is to calculate
the failure characteristics as time approaches infinity. Only in this manner are closedform solutions possible. The results are effective failure rate estimates that often
underestimate the benefits of redundancy. This is especially true when mission times are
relatively short. As a result, calculating reliability based on the failure probability
examples described above is generally a more sound approach. Additionally, the
availability of software tools has made it much easier to perform these calculations.

2.8. References
1. Production Part Approval Process (PPAP), Third Edition, Daimler-Chrysler,
Ford , General Motors, 1999)
2. Modarres, M., Accelerated Testing, ENRI 641, Univ. of Maryland, May 2005
3. Weibull++, Reliasoft Corp.
4. Colm V. Cryan, James R. Curley, Frederick J. Gillham, David R. Maack, Bruce
Porter, and David W. Stowe, Long Term Splitting Ratio Drifts in Singlemode
Fused Fiber Optic Splitters, NFOEC 95
5. David R. Maack, David W. Stowe and Frederick J. Gillham, Confirmation of a
Water Diffusion Model For Splitter Coupling Ratio Drift Using Long Term
Reliability Data, NFOEC 96
6. Telcordia GR-332, Reliability Prediction Methodology
7. Denson, W.K. and S. Keene, A New System Reliability Assessment
Methodology Final Report, Available from the Reliability Information
Analysis Center, 1998

Reliability Information Analysis Center


133

Chapter 2: General Assessment Approach

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


134

Chapter 3: Fundamental Concepts

3.

Fundamental Concepts

The intent of this book is not to cover the basics of probability or reliability theory. The
understanding of some of these fundamental concepts, however, is critical to the
interpretation of reliability estimates. The definition of reliability is a probability, the
value of which is estimated by the techniques covered in this book. Therefore, the basics
of reliability terminology, and the basis for various theoretical concepts are covered in
this section.

3.1. Reliability Theory Concepts


There are two basic types of variables: Discrete and Continuous. A discrete variable is
one that is limited to integer values (i.e., 0, 1, 2, 3,). The probability distribution
describing this type of variable is called a discrete distribution. For example, the
distribution of the number of defects remaining in software programs after 6 months of
development would be a discrete distribution, since a partial defect cannot exist. Figure
3.1-1 illustrates a discrete probability distribution.

p(x5)
Probability - p(xi)

p(x4)

p(x6)

p(x3)

p(x7)
p(x8)

p(x2)
p(x1)
x1

p(x9)
x2

x3

x4

x5

x6

x7

x8

x9

Number of Remaining Defects (x)

Figure 3.1-1: Discrete Probability Distribution


The probability that a random variable x takes on a specific value xi is expressed as:

P{x = xi } = p(xi )
Reliability Information Analysis Center
135

Chapter 3: Fundamental Concepts

A continuous variable is one that is measured on a continuous scale, and its probability
distribution is defined as a continuous distribution. For example, the distribution of the
TTF would be a continuous distribution, since an infinite number of positive time values
can be represented in the distribution. Figure 3.1-2 illustrates a continuous distribution.

Figure 3.1-2: Continuous Probability Distribution


The probability that a random variable x lies between the interval from a to b is
expressed as:
b

P{a x b} = f ( x)dx
a

A probability distribution is characterized by a probability density function (pdf), f(t).


The pdf is essentially a histogram of the random variable, often the TTF. For a discrete
random variable, the pdf at a given value of the random variable is the probability that the
realization of the random variable will take on that value. For a continuous random
variable, the area under the pdf for a given interval is the probability that a realization of
the random variable will fall within that interval (Figure 3.1-2). The probability density
functions are non-negative for all values and the sum of the probabilities over all values
for discrete random variables, or the total area under the pdf for continuous random
variables, always equals 1.0.
The cumulative distribution function F(t) is defined as the probability in a random trial
that the random variable is not greater than t:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
136

Chapter 3: Fundamental Concepts

F (t ) =

f (t )dt

If the random variable is discrete, the integral is replaced by a summation.


The Cumulative Distribution Function (CDF) is the probability that the value of a
corresponding random variable will not be exceeded. Cumulative distribution functions
are non-negative and non-decreasing. Given a random variable that cannot be negative,
the value of the CDF at the origin is zero. The upper limit of a CDF is always 1.0, as
illustrated in Figure 3.1-3. The CDF is the integral of the pdf, and is illustrated in Figure
3.1-3 for discrete and continuous distributions, respectively.

Figure 3.1-3: The Cumulative Distribution Function (CDF)


The reliability function, R(t), is the probability of a device surviving (not failing) prior to
time t, and is given by:

Reliability Information Analysis Center


137

Chapter 3: Fundamental Concepts

R(t ) = 1 F (t ) = f (t )dt
t

Note that for the reliability, the integral of the pdf is from t to infinity for the
probability of success, as opposed to minus infinity to t as in the case of the failure
probability. The sum of the probability of success and the probability of failure needs to
be 1.0, consistent with the definition of a pdf.
By differentiating the above equation:

dR(t )
= f (t )
dt
The probability of failure in a given time interval between t1 and t2 can be expressed by
the reliability function:

t1

f (t )dt f (t )dt = R (t1 ) R (t 2 )


t2

The rate at which failures occur in the interval t1 to t2, the failure rate (t), is defined as
the ratio of the probability that a failure occurs within the interval, given that it has not
occurred prior to t1 (the start of the interval), divided by total the interval length. Thus:

(t ) =

R (t1 ) R (t 2 ) R (t ) R (t + t )
=
(t 2 t1 )R (t1 )
(t )R (t )

where t = t1 and t2 = t + t. The hazard rate, h(t), or instantaneous failure rate, is defined
as the limit of the failure rate as the interval length approaches zero, or:

R(t ) R(t + t )
1 dR(t )
h(t ) = lim(t 0)
=

(t )R(t ) R(t ) dt
Since it was already shown that:

dR(t )
= f (t )
dt
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
138

Chapter 3: Fundamental Concepts

Then,
h (t ) =

f (t )
R (t )

In an attempt at providing an interpretation of the hazard rate function, consider the


following:

The hazard rate, h(t), is the rate at which failures occur, providing the item has not
failed before the time h(t) is evaluated
f(t) is the normalized percentage of the population failing in a given time interval
(t), such that the population size times value of f(t) is equal to the number of
failures in the interval of time.
The denominator, R(t), is the probability of survival at t, which is equivalent to
the percentage of the population surviving at time t.

Multiplying R(t) by the population size yields the total number of units surviving until
t. This is the binomial probability, or expected value of the number of survivors at t.
Since this population will have accrued an operating time of RN*t, the denominator is
equivalent to the cumulative operating time on the population in the time interval.
Therefore,
h(t ) =

f (t ) f (t ) N
# failures in t
Failures
=
=
=
= Failure rate
R(t ) R(t ) N # units surviving t item hours

Integrating both sides of the h(t) function results in:

h(t ) =

1 dR(t )
R(t ) dt

Resulting in:

R(t ) = e

h (t )dt
0

This is the general expression for the reliability function. If h(t) can be considered a
constant failure rate (), which is often the case, the equation becomes:
Reliability Information Analysis Center
139

Chapter 3: Fundamental Concepts

R(t ) = e t
The mean time to failure (MTTF) is the expected value of the time to failure, and is:

MTTF = R(t )dt


0

If the reliability function can be easily integrated, this is a convenient way to calculate the
mean time to failure (MTTF). If not, then numerical techniques can be used.
If all parts in a population are operated until failure, the mean life is:
n

t
i =1

where:
ti =
n=

the time to failure of the ith item in the population


total number of items in the population

The mean time between failure (MTBF) is:

MTBF =

T (t )
r

where:
T(t) = total operating time
r=
number of failures
Failure rate and MTBF are applicable only to the situation in which the failure rate is
constant, i.e., the exponential TTF distribution. Per the definitions above, it can be seen
that the failure rate and MTBF are reciprocals of each other:

1
MTBF

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


140

Chapter 3: Fundamental Concepts

The failure rate is the number of failures divided by the cumulative operating time of the
entire population (failure/part hours), whereas the MTBF is the cumulative operating time
of the entire population divided by the number of failures (part hours per failure).
Table 3.1-1 provides an overview of the basic notation and mathematical representations
that are common among the various types of probability distributions.
Table 3.1-1: Probability Distribution Notation & Mathematical Representations
Notation

Pr( X S )
f (x)

Definition
Random Variable
Realization of a Random Variable
Probability That the Random Variable
X is in the Set S
Probability Density Function (PDF)

F (x)

Cumulative Distribution Function


(CDF)

h(x)

Hazard Rate

R (x )

Mathematical Representation

f ( x),
Discrete Distribution
xS
Pr( X S ) =
f ( x ) dx , Continuous Distribution
S
x
f ( w ),
Discrete Distribution
w =0
F (x) = x

f ( w ) dw , Cumulative Distribution
0

h( x) =

f ( x)
1 F(x)

Reliability

f ( x)
R( x)

1 dF( x )
R ( x ) dx
x

R ( x ) = 1 F ( x ) = f ( t ) dt = e

h ( t ) dt
0

E[u( X )]

Expected Value

Mean

Standard Deviation

u(w) f(w),
Discrete Distribution
w =0
E[ u ( X )] =

u ( w ) f ( w ) dw , Continuous Distribution
0

= E (X )
= E[( X ) 2 ]

Note: These definitions are based on the assumption that all realizations of a random variable
must be non-negative.

Reliability Information Analysis Center


141

Chapter 3: Fundamental Concepts

3.2. Probability concepts


This section discusses some of the basic probability concepts that are important in
reliability modeling.
3.2.1. Covariance

Covariance is a measure of the extent to which one variable is related to another, and is
expressed as:
Cov( X , Y ) =

(x x )(y y )
i

i =1

n 1

3.2.2. Correlation Coefficient

The correlation coefficient is a defined as the standardized Covariance:


r=

Cov( X , Y )

X Y

Examples of various correlation coefficients are shown in Figure 3.2-1.

Figure 3.2-1: Examples of Correlation Coefficients


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
142

Chapter 3: Fundamental Concepts

3.2.3. Permutations and Combinations

A permutation is defined the number of ways of ordering n items taken x at a time,


and is mathematically expressed as:
Pr =
n

n!
(n x )!

A combination is defined as the number of distinct combinations of n items taken x


at a time, when ordering is not relevant, and is mathematically expressed as:
Pr =
n

n!
x!(n x )!

As an example of permutations and combinations, define n=4 and x=2. The number of
combinations is:
Pr =
n

n!
4!
=
=6
x!(n x )! 2!(4 2)!

Consider these combinations, as illustrated in Table 3.2-1. Here, there are 4 items (n=4),
each of which can have two possible values (blank or x).
Table 3.2-1: Combinations Example
n
1

x
x

x
x

Reliability Information Analysis Center


143

Chapter 3: Fundamental Concepts

The corresponding number of permutations is:


Pr =
n

n!
4!
=
= 12
(n x )! (4 2)!

Each set of 2 can be reversed, thus the number of permutations is double the number of
combinations for n=4.
3.2.4. Mutual Exclusivity

Items are mutually exclusive when the occurrence of one event precludes the other. In
other words, if one event occurs, the other cannot. This is the only case in which
probabilities can be added. Mutual exclusivity is defined as:
P(a or b ) = P(a ) + P(b )

where:
P(a or b) =
P(a) =
P(b) =

probability of either event a or event b occurring


probability of event a occurring
probability of event b occurring

Mutually exclusive sets are those with no common members, shown in the Venn diagram
in Figure 3.2-2.

Figure 3.2-2: Venn Diagram of Mutually Exclusive Events


The expression AB signifies the Empty or Null set.
3.2.5. Independent Events

An independent event is one in which the probability of one event has no effect on the
other, and is expressed as follows:
P(a and b ) = P(a )P(b )

where:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
144

Chapter 3: Fundamental Concepts

P(a and b) =
P(a) =
P(b) =

probability of both event a and event b occurring


probability of event a occurring
probability of event b occurring

The probability of either event a or b occurring is:


P(a or b ) = P(a ) + P(b ) P(a )P(b )

This is illustrated in Figure 3.2-3.

Figure 3.2-3: Independent Events


3.2.6. Non-independent (Dependent) Events

Non-independent (or dependent) events indicate that the probability of one event is
dependent on the other, as shown:
P(a and b ) = P(a )P(b a )

or
P(a and b ) = P(b )P(a b )

where:
P(a and b) =
P(a) =
P(b) =
P(b|a) =
P(a|b) =

probability of both event a and event b occurring


probability of event a occurring
probability of event b occurring
probability of event b occurring, given that event a has occurred
probability of event a occurring, given that event b has occurred
Reliability Information Analysis Center
145

Chapter 3: Fundamental Concepts

3.2.7. Non-independent (Dependent) Events: Bayes Theorem

For non-independent (dependent) events, one event may have several different outcomes,
each affecting the other event differently. This situation is mathematically described as:
P (a1 b ) =

P(b a1 ) P(a1 )

P(b a ) P(a )
i

where:
P(b|a1) =
P(b|ai)*P(ai) =

probability of event b occurring, given that event a1 has


occurred
the total probability of event b occurring

The event set a is mutually exclusive. Therefore their probabilities can be added.
3.2.8. System Models

For independent failure causes, the reliability of a system is the product of the reliability
values for the constituent failure causes, as shown:
R = R1 R2 R3 .........Rn

If the failure rate is constant, the probability of survival for a specific cause is:
R = e t

The system reliability is:


e total t = e 1t e 2t e 3t .........e nt

Taking the natural log of both sides yields:

total = 1 + 2 + 3 + ..........n
The above equations are relevant to a series configuration of items, each with a constant
failure rate. The fault tree representation of this configuration is shown in Figure 3.2-4.
Here, the system reliability is represented by a logical OR gate, since the failure of A or
B or C will cause system failure.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


146

Chapter 3: Fundamental Concepts

OR

Figure 3.2-4: Fault Tree OR Gate

The corresponding reliability block diagram representation for this scenario is shown in
Figure 3.2-5.
A

Figure 3.2-5: Reliability Block Diagram for an OR Gate

All possible outcomes for this example are shown in Table 3.2-2.
Table 3.2-2: Combinations of an OR Configuration
A

Output of OR Gate

Fail

Fail

Fail

Fail

Fail

Fail

Pass

Fail

Fail

Pass

Fail

Fail

Fail

Pass

Pass

Fail

Pass

Fail

Fail

Fail

Pass

Fail

Pass

Fail

Pass

Pass

Fail

Fail

Pass

Pass

Pass

Pass

Reliability Information Analysis Center


147

Chapter 3: Fundamental Concepts

Note that each of these eight possible outcomes in the table are mutually exclusive, in
that there is only one possible way in which each of the eight can occur.
As an example, if events A, B and C have the following reliability values:
RA = 0.95
RB = 0.92
RC = 0.99
The reliability of the series configuration (i.e., the probability of exactly zero failures) of
the three items is:
R = RA RB RC
R = .95 .92 .99 = .87

Now, suppose that several items must fail in order for the system to fail. This scenario is
represented by an AND gate in a fault tree representation, as is shown in Figure 3.2-6.

AND

Figure 3.2-6: Fault Tree AND Gate

The corresponding Reliability Block Diagram (RBD) representation is shown in Figure


3.2-7. Note the parallel nature of this configuration.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


148

Chapter 3: Fundamental Concepts

C
Starting
Block

Ending
Block
A

Figure 3.2-7: Reliability Block Diagram for an AND Gate

All possible outcomes for this example are shown in Table 3.2-3.
Table 3.2-3: Combinations of an AND Configuration
A

Output of AND Gate

Fail

Fail

Fail

Fail

Fail

Fail

Pass

Pass

Fail

Pass

Fail

Pass

Fail

Pass

Pass

Pass

Pass

Fail

Fail

Pass

Pass

Fail

Pass

Pass

Pass

Pass

Fail

Pass

Pass

Pass

Pass

Pass

The reliability of this parallel configuration of three items is:


RA = 0.95
RB = 0.92
RC = 0.99
R = 1 (1 RA )(1 RB )(1 RC )

Reliability Information Analysis Center


149

Chapter 3: Fundamental Concepts

R = 1 (1 0.95)(1 0.92)(1 0.99) = 0.99996

As an example of a slightly more complex situation, consider the fault tree representation
of a system in Figure 3.2-8.

TOP

AND

Event 3

OR

AND
Event 1

OR

Event 2

Event 4

Event 5

Event 6

Event 7

Figure 3.2-8: Fault Tree of an AND/OR Combination

The RBD is shown in Figure 3.2-9.

Event 1
Extra
Starting
Block

Event 4
Event 3

Event 2

Event 6
Event 5

Figure 3.2-9: RBD of AND/OR combination


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
150

Event 7

Chapter 3: Fundamental Concepts

Combining the series and parallel events yields the following reliability expression for
this configuration
R = (1 (1 R1 )(1 R2 ) )R3 (1 (1 R4 )(1 R5 ) )R6 R7
3.2.9. K-out-of-N Configurations

A system consisting of n components or subsystems, of which only k need to be


functioning for system success, is called a k-out-of-n configuration. For such a system,
the integer value of k is always less than the integer value of n.
Define the following as:
R=
Q=
R+Q=

reliability of one unit for a specified time period


unreliability of one unit for a specified time period
1

As an example, let us assume that there are three units operating in parallel, two of which
are required for the system to perform adequately. If R=0.9 and Q=0.1, then the
probabilities associated with each possible combination of outcomes is summarized in
Table 3.2-4.
Table 3.2-4: Example of k-out-of-n Probability Calculations

Probability

Prob
of pass
or fail
of A

Prob
of pass
or fail
of B

Prob
of pass
or fail
of C

Fail

QAQBQC

0.1

0.1

0.1

0.1*0.1*0.1

0.001

Fail

Pass

QAQBRC

0.1

0.1

0.9

0.1*0.1*0.9

0.009

Fail

Pass

Fail

QARBQC

0.1

0.9

0.1

0.1*0.9*0.1

0.009

Fail

Pass

Pass

QARBRC

0.1

0.9

0.9

0.1*0.9*0.9

0.081

Pass

Fail

Fail

RAQBQC

0.9

0.1

0.1

0.9*0.1*0.1

0.009

Pass

Fail

Pass

RAQBRC

0.9

0.1

0.9

0.9*0.1*0.9

0.081

Pass

Pass

Fail

RARBQC

0.9

0.9

0.1

0.9*0.9*0.1

0.081

Pass

Pass

Pass

RARBRC

0.9

0.9

0.9

0.9*0.9*0.9

0.729

Outcome

Fail

Fail

Fail

Reliability Information Analysis Center


151

Total System
Probability

Chapter 3: Fundamental Concepts

In this example, the probability of each combination of possible outcomes (in this case,
eight) is calculated. Note that the sum of the probabilities for all possible outcomes is
1.0, since each of the eight possibilities is mutually exclusive and their probabilities can,
therefore, be added. This approach of calculating the probability of every possible
outcome is always valid, regardless of whether the reliability values of each of the
elements are the same or not. For example, if two of the three units are required for the
system to perform adequately, the system will pass if there are either no failures or if
there is one failure, as shown below. This is summarized in Table 3.2-5.
Table 3.2-5: Example of 2-out-of-3 Required for Success
Outcome

Probability

Total
Probability

System Pass or
Fail

Fail

Fail

Fail

QAQBQC

0.001

Fail

Fail

Fail

Pass

QAQBRC

0.009

Fail

Fail

Pass

Fail

QARBQC

0.009

Fail

Fail

Pass

Pass

QARBRC

0.081

Pass

Pass

Fail

Fail

RAQBQC

0.009

Fail

Pass

Fail

Pass

RAQBRC

0.081

Pass

Pass

Pass

Fail

RARBQC

0.081

Pass

Pass

Pass

Pass

RARBRC

0.729

Pass

It can be seen that the system will pass with outcomes 4, 6, 7 and 8. Outcomes 4, 6 and 7
correspond to exactly one failure (i.e., there are three ways in which one failure can
occur), and outcome 8 corresponds to exactly zero failures (there is only one way in
which this can occur).
If the probability of failure of all of the units is the same and they are independent, then
the binomial or Poisson distributions can be used:

If the metric used in the reliability analysis is the probability of failure, use the
binomial distribution
If the metric is a failure rate, use the Poisson distribution

Since this example pertains to items with defined probabilities, the binomial distribution
applies. As defined previously:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


152

Chapter 3: Fundamental Concepts


r
r
n!
n
F (x; r ) = p x q n x =
p x q n x
x =0 x
x =0 (n x!)x!

where:
n=
x=
r=

total number of items (3)


good items (2 or 3)
failed items (0 or 1)

The probability of exactly no failures (i.e., the first term in the above summation) is:
F (3,0 ) =

n!
3!
x n x
3 33
p q
=
.9 q = 1 * 0.729 = 0.729
(3 3)!3!
(n x )! x!

The probability of exactly one failure (i.e. the second term in the above summation) is:
F (2,1) =

n!
3!
x n x
2 3 2
p q
=
.9 .1 = 3 * 0.81 * 0.1 = 0.243
(3 2 )!2!
(n x )! x!

Therefore, the cumulative binomial expression for 0 or 1 failures (r = 0 or 1) is:

F ( x; r ) =

n!

(n x)! x!p q

x n x

= 0.729 + 0.243 = 0.972

x =0

Because the first term in the binomial probability expression is the number of
combinations of a specific number of failures (or survivals) occurring, the number of
combinations (as calculated by the first term) essentially adds the probabilities associated
with the mutually exclusive events.

3.3. Distributions
Reliability distributions are at the heart of a reliability model. They represent the
fundamental relationship between the reliability metric of interest (probability of failure,
failure rate, etc.) and the independent variable (TTF, cycles to failure, etc.). This
independent variable is called the life unit. Table 3.3-1 summarizes probability
distributions often used in reliability modeling, along with a description of their primary
uses.
Reliability Information Analysis Center
153

Chapter 3: Fundamental Concepts

Table 3.3-1: Probability Distributions Applicable to Reliability Engineering


Probability
Distribution
Binomial

Type

Primary Uses

Discrete

Used to find the probability of x events occurring in a total of n


trials, e.g., the number of failures in a sequence of a specified number
of equal-length time intervals

Poisson

Discrete

Used to model the probability of a specified number of events


occurring in a specified time interval

Exponential

Continuous

Used to describe the distribution of the time to failure when the


failure rate is constant

Gamma

Continuous

Used to determine the distribution of the time by which a specified


number of failures will occur when the failure rate is constant

Normal

Continuous

Used to describe the statistical mean of a sample taken from any


population with a finite mean and variance. Often used to model
parameter distributions. Rarely used for time to failure distributions.

Standard Normal

Continuous

The Standard Normal distribution (Z) is derived from the Normal for
ease of analysis and interpretation (mean = 0; standard deviation =
1).

Lognormal

Continuous

Used to model many wear out failure causes

Weibull

Continuous

Used to describe the distribution of failures representing constant


(i.e., exponential), increasing, or decreasing failure rates, depending
on the value of the slope parameter (). Increasing popularity due to
its versatility. Applicable only when no repair is performed
following failure.

Student t

Continuous

Used to test for statistical significance of the difference between the


means of two samples

F Distribution

Continuous

Used to test for statistical significance of differences between the


variances of two samples

Chi-Square

Continuous

A special case of the Gamma distribution, used to estimate


confidence intervals around reliability test data, and to test to see
whether measured data reflects a constant failure rate.

The following section discusses several of the distributions used in reliability assessment.
While the intent of this book is not to cover the statistical aspects of distributions, some
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
154

Chapter 3: Fundamental Concepts

fundamental concepts are critical to the understanding of the basis for certain techniques
pertaining to reliability assessment, namely confidence level calculations and
demonstrating reliability levels. In particular, the binomial and Poisson distribution are
critical for these purposes.
The binomial distribution is used when there are only two outcomes, such as success or
failure, and the probability remains the same for all trials. The probability density
function (pdf) of the binomial distribution is:

n
f (x) = p x q (n x )
x
where:

n!
n
=
x (n x!)x!
and q = 1 p.
The function f(x) is the probability of obtaining exactly x good items and (n-x) bad
items in a sample of n items, where p is the probability of obtaining a good item
(success) and q (or 1-p) is the probability of obtaining a bad item (failure).
The CDF, i.e., the probability of obtaining r or fewer successes in n trials, is given
by:
r
r
n!
n
F (x; r ) = p x q n x =
p x q n x
x =0 x
x =0 (n x!)x!

The Poisson distribution is an extension of the binomial distribution when n is infinite.


In fact, it is used to approximate the binomial distribution when n 20 and p 0.05.
If events are Poisson-distributed, they occur at a constant average rate and the number of
events occurring in any given time interval is independent of the number of events
occurring in any other time interval. Since the TTF distribution for this situation is the
exponential (i.e., constant failure rate), the Poisson distribution will predict the number of
failures for specific values of time and failure rates. The number of failures in a given
time would be given by:

Reliability Information Analysis Center


155

Chapter 3: Fundamental Concepts

f (x ) =

a x e a
x!

where x is the actual number of failures and a is the expected number of failures.
Since the expected number of failures (i.e., the expected value) for the exponential
distribution is t, the Poisson expression becomes:

(
t )x e t
f (x ) =
x!

where:
=
t=
x=

failure rate
length of time being considered
number of failures

The reliability function, R(t), or the probability of zero failures in time t is given by:

(
t )0 e t
R(t ) =
0!

= e t

This is the reliability for the exponential distribution.


There are many cases where the probability of experiencing a given number of failures (r)
or fewer is required. Examples are reliability demonstration, test planning, etc. For these
cases, the CDF is used:
r

(t )x e t

x =0

x!

R(x) =

A summary of the distributions most commonly used in reliability engineering are


presented in Figures 3.3-1 and 3.3-2, for discrete and continuous distributions,
respectively.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


156

Chapter 3: Fundamental Concepts

Figure 3.3-1: SHAPES OF FAILURE DENSITY AND RELIABILITY FUNCTIONS OF


COMMONLY USED DISCRETE DISTRIBUTIONS (from MIL-HDBK-338B)
Reliability Information Analysis Center
157

Chapter 3: Fundamental Concepts

Figure 3.3-2: SHAPES OF FAILURE DENSITY, RELIABILITY AND HAZARD RATE


FUNCTIONS FOR COMMONLY USED CONTINUOUS DISTRIBUTIONS (from MILHDBK-338B)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
158

Chapter 3: Fundamental Concepts

Continuous distributions are used when analyzing time to failure data, since times to
failure is a continuous variable. The most common distributions used in reliability
modeling to describe times to failure characteristics are the exponential, Weibull and
lognormal distributions. These are described in more detail n the following sections.
3.3.1. Exponential

The exponential distribution is most commonly applied in reliability to describe the times
to failure for repairable items. For non-repairable items, the Weibull distribution is
popular due to its flexibility. In general, the exponential distribution has numerous
applications in statistics, especially in reliability and queuing theory.
The exponential distribution describes products whose failure rates are the same
(constant) at each point in time (i.e., the flat portion of the reliability bathtub curve,
where failures occur randomly, by chance). This is also called a Poisson process. This
means that if an item has survived for "t" hours, the chance of it failing during the next
hour is the same as if it had just been placed in service. It is sometimes referred to as the
distribution with no memory. It is an appropriate distribution for complex systems that
are comprised of different electronic and electromechanical component types, the
individual failure rates of which may not follow an exponential distribution.
Since the exponential distribution is relatively easy to fit to data, it can be misapplied to
data sets that would be better described using a more complex distribution.
Table 3.3-2 lists the parameters for the exponential distribution: the probability density
function (pdf), the cumulative distribution function (CDF), the mean, the variance, and
the standard deviation. Another useful parameter of continuous distributions is the 100pth percentile of a population, i.e., the age by which a portion of the population has failed.
The 50% point is the median life. The mean of the exponential distribution is equal to the
63rd percentile. Thus, if an item with a 1000 hour MTBF had to operate continuously for
1000 hours, there would only be a 0.37 probability of success.
As an example, consider a software system with a failure rate () of 0.0025 failures per
processor hour. Its corresponding mean time between failure (MTBF) is calculated as:

MTBF = =

1
= 400 processor hours
0 .0025

Reliability Information Analysis Center


159

Chapter 3: Fundamental Concepts

Table 3.3-2: Exponential Distribution Parameters


Parameters

Mathematical Expression
(based on failure rate)

Probability Density Function

Cumulative Distribution Function


Failure rate
Mean

f (t) = e t ,

F( t ) = 1 e t , t > 0

Variance

100 pth Percentile


Reliability Function

f (t ) =

1
e
, t>0

F (t ) = 1 e

t
,

t>0

2 =

Standard Deviation

t>0

Mathematical Expression
(based on MTBF)

yP =

2 = 2

1
ln(1 P )

R (t ) = e t

y P = ln(1 P)

R (t) = e

The reliability function (i.e., the probability, or population fraction that survives beyond
age t) at 100 and 1000 processor hours is:
R ( t ) = e ( 0.0025 )(100 ) = 0.7788 = 77.88%
R ( t ) = e ( 0.0025 )(1000 ) = 0.0821 = 8.21%

Which can be seen to be R(t) = 1 F(t).


3.3.2. Weibull

The Weibull distribution is important in reliability modeling since it represents a general


distribution which can model a wide range of life characteristics. It can accommodate
increasing, decreasing and constant failure rates. Weibull analysis assumes that there has
been no repair of failed items and is often used to model single failure causes. The basic
features of the Weibull are:

The shape parameter, , which describes the shape of the pdf


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
160

Chapter 3: Fundamental Concepts

The scale (or characteristic life) parameter, , is the value at which 63rd percentile
of the distribution occurs
The location parameter, (or gamma), is only used in the three parameter version
of the Weibull distribution, and is the value that represents the failure free period
for the item. If an item does not have a period where the probably of failure is
zero, then = 0 and the Weibull distribution becomes a two parameter distribution.
This third parameter is used when there are threshold effects.
Determination of , , and can easily be estimated using Weibull probability
paper or by using available Weibull software programs
A multi-mode version of the Weibull distribution can be used to determine the
points on the bathtub curve where the failure rate is changing from decreasing, to
constant, to increasing

There are two general versions of the Weibull distribution, the first being the twoparameter Weibull and the second being the three-parameter Weibull. The twoparameter Weibull uses a shape parameter that reflects the tendency of the failure rate
(increasing, decreasing, or constant) and a scale parameter that reflects the characteristic
life of items being measured ( 63.2% of the population will have failed). The threeparameter Weibull adds a location parameter used to represent the minimum life of the
population (e.g., a failure mode that does not immediately cause system failure at time
zero, such as a software algorithm whose degrading calculation accuracy does not cause
system failure until four calls to the algorithm have been made). Note that in most cases,
the location parameter is set to zero (failures assumed to start at time zero) and the
Weibull distribution reverts to the two-dimensional case. The three parameter Weibull
distribution is also commonly used to characterize strength distributions (i.e., when using
a stress/strength model), where the -value represents a screen value, or proof test, in
which case this value of stress is applied to the item as a screen. It is also used to model
failure causes that are not initiated until a time equal to the gamma value has passed.
As with the gamma distribution, the definition of Weibull parameters is inconsistent
throughout the literature. Table 3.3-3 illustrates how some sources define these
parameters.

Reliability Information Analysis Center


161

Chapter 3: Fundamental Concepts

Table 3.3-3: Confusing Terminology of the Weibull Distribution


Reference
Montgomery, D.C., Introduction to
Statistical Quality Control 2nd
Edition, John Wiley & Sons, 1991
Musa, J.D.; Iannino, A.; and Okumoto,
K.; Software Reliability:
Measurement, Prediction, Application,
McGraw-Hill, May 1987
Nelson, W., Applied Life Data
Analysis, John Wiley & Sons, 1982
MIL-HDBK-338, Section 5.3.6
This book

Weibull
Form

Random
Variable

Shape
Parameter

Scale
Parameter

Location
Parameter

3-P

2-P

2-P

3-P

2-P

For much life data, the Weibull distribution is more suitable than the exponential, normal
and extreme value distributions, so it should be the distribution of first resort. The
characteristics of various shape parameters are summarized below:

For shape parameter < 1.0, the Weibull pdf takes the form of the gamma
distribution (see Section 3.7.1.4) with a decreasing failure rate (i.e., infant
mortality)
For shape parameter = 1.0, the failure rate is constant so that the Weibull pdf
takes the form of the simple exponential distribution with failure rate parameter
(the flat part of the reliability bathtub)
For shape parameter = 2.0, the Weibull pdf takes the form of the lognormal or
Rayleigh distribution, with a failure rate that is linearly increasing with time (i.e.,
wearout). This is often used to model software reliability.
For 3 < shape parameter < 4, the Weibull pdf approximately takes the form of the
Normal distribution
For shape parameter > 10, the Weibull distribution is close to the shape of the
smallest extreme value distribution

The basic parameters of the 2-parameter Weibull distribution are presented in Table 3.34. To have the mathematical expressions reflect a 3-parameter Weibull, replace all
values of x with (x-x0), where x0 represents the value as described above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
162

Chapter 3: Fundamental Concepts

Table 3.3-4: Weibull Distribution Parameters


Parameter

Probability Density Function

Mathematical Expression
x

1

x

f ( x ) =
e
,

Cumulative Distribution Function

F(x) = 1 e

Shape parameter

Scale parameter

Failure Rate

( x) =

x>0

Mean

1
= 1 +

Variance

2
1
2 = 2 1 + 1 +

Standard deviation

2
1
= 1 + 1 +

100 Pth Percentile

y P = [ ln(1 P )]1

Reliability

R (x) = e

0.5

Figure 3.3-3 provides a graphical example of the Weibull distribution pdf with a
characteristic life of 1000 hours for a variety of shape parameters (). Figures 3.3-4 and
3.3-5 illustrate the hazard rate and probability plot, respectively, for the same values of
the shape parameter.

Reliability Information Analysis Center


163

Chapter 3: Fundamental Concepts

Figure 3.3-3: Example pdf Plots for the Weibull Distribution

Figure 3.3-4: Example Hazard Rate Plots for the Weibull Distribution
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
164

Chapter 3: Fundamental Concepts

Figure 3.3-5: Example Probability Plots for Weibull Distribution

For an example, consider that very early in the system integration phase of a large
software development effort, there have been numerous failures due to software that have
caused the system to crash (the predominant system failure cause). Plotting the failure
times of this specific failure mode (other failure modes are ignored for now) on Weibull
probability paper resulted in a shape parameter value of 0.77 and a scale parameter value
of approximately 32 hours. Based on these parameters, the calculated reliability and
failure rate of the software at 10 system hours is expected to be:
(10 ) =

0 . 77
32

10

32

R (10 ) = e

0 .77 1

= 0 . 0314 failures per hour


10

32

0.77

= 0.6647

Reliability Information Analysis Center


165

Chapter 3: Fundamental Concepts

3.3.3. Lognormal

The lognormal distribution is the distribution of a random variable whose natural


logarithm is distributed normally; in other words, it is the normal distribution with ln t
as the independent variable. The probability density function is

f (t ) =

t 2

1 ln (t )

The mean is:

e
And the standard deviation is:

(e

2 + 2 2

1
2 + 2 2

where and are the mean and standard deviation (SD), respectively, of ln (t).
The lognormal distribution is used in the reliability analysis of semiconductors and the
fatigue life of certain types of mechanical components. This distribution is also
commonly used in maintainability analysis.
The CDF for the lognormal distribution is:

1 ln (t ) 2
F (t ) =
exp
dt

2
0 t 2
t

This can be related to the Standard Normal variant Z by:

ln(t )
F (t ) = P Z

The reliability function is 1-F(t) or:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


166

Chapter 3: Fundamental Concepts

ln(t )
R(t ) = P Z >

The hazard function, h(t), is given as follows

ln(t )

f (t )

=
h(t ) =
R(t )
tR(t )

where is the standard normal probability function, and and are the mean and
standard deviation of the natural logarithm of the random variable, t.
Figures 3.3-6 through 3.3-8 illustrate the lognormal distribution for a mean value of 1000
and standard deviations of 0.1, 1 and 3. Shown are the pdf, the hazard rate, and the
cumulative unreliability function, F(t), respectively.

Figure 3.3-6: Example pdf Plots for the Lognormal Distribution

Reliability Information Analysis Center


167

Chapter 3: Fundamental Concepts

Figure 3.3-7: Example Hazard Rate Plots for the Lognormal Distribution

Figure 3.3-8: Example Probability Plots for the Lognormal Distribution


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
168

Chapter 3: Fundamental Concepts

3.4. References
1. Lyu, M.R. (Editor), Handbook of Software Reliability Engineering, McGrawHill, April 1996, ISBN 0070394008
2. Musa, J.D., Software Reliability Engineering: More Reliable Software, Faster
Development and Testing, McGraw-Hill, July 1998, ISBN 0079132715
3. Nelson, W., Applied Life Data Analysis, John Wiley & Sons, 1982, ISBN
0471094587
4. Musa, J.D.; Iannino, A.; and Okumoto, K.; Software Reliability: Measurement,
Prediction, Application, McGraw-Hill, May 1987, ISBN 007044093X
5. Montgomery, D.C., Introduction to Statistical Quality Control 2nd Edition,
John Wiley & Sons, 1991, ISBN 047151988X
6. Shooman, M., "Probabilistic Reliability, An Engineering Approach," McGrawHill, 1968.
7. Abernethy, Dr. R.B., "The New Weibull Handbook," Gulf Publishing Co., 1994.

Reliability Information Analysis Center


169

Chapter 3: Fundamental Concepts

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


170

Chapter 4: DOE-Based Approaches to Reliability Modeling

4.

DOE-Based Approaches to Reliability Modeling

The use of Design of Experiment (DOE) principals is critical to reliability modeling,


particularly as it pertains to designing reliability tests from which life models will be
derived. As such, it is treated as a separate topic in this book.
The tenets of DOE is that one or more of a products or systems responses is observed as
a function of pertinent factors that may affect that response, as illustrated in Figure 4.0-1.

Figure 4.0-1: The DOE Concept


At the heart of this technique is the product/system or process under analysis. This is the
feature for which we want to quantify the behavior. The independent variables are called
the factors. These represent the inputs to the product/system or process and are the things
that can potentially change how the product behaves. The output of the DOE activity is
the response, and is a measure of how good the product/system or process behaves.
The levels for each factor are varied, tests are performed, and the resulting response is
measured. The resultant data is analyzed to quantify the item or process response as a
function of the factor levels. The generic steps in applying DOE to generate life models
are:
1. Determine the product/system or process feature to be assessed
Reliability Information Analysis Center
171

Chapter 4: DOE-Based Approaches to Reliability Modeling

2.
3.
4.
5.
6.
7.

Determine the factors


Determine the factor levels
Design the tests
Perform tests and measurements
Analyze the data
Develop the life model

Each of these steps is described below.

4.1. Determine the Feature to be Assessed


The product/system or process feature to be assessed can be any characteristic of the
entity that is important to the end user or the producer. It can be related to the
performance of the entity, or it can be related to its reliability or durability. In the context
of this book, the primary features of interest are reliability and durability. The basic
premise of the DOE approach dictates that the feature to be assessed must be
quantifiable.

4.2. Determine Factors


A factor is any variable that can potentially influence the feature being analyzed. It can
be a design attribute, manufacturing attribute, process attribute, environmental stress,
operational stress, or any other influencing factor. The output of this determination is a
list of factors that will be varied in the DOE tests to be performed. A variety of tools can
be used to assist in determining the factors that are to be included in the experiments.
Some of these tools are:

Quality Function Deployment (QFD)


Brainstorming sessions
Ishikawa diagram
Design FMEA
Process FMEA

The FMEA is treated in more detail in Chapter 8.

4.3. Determine the Factor Levels


After the factors are identified, the next step is to determine the levels of each factor that
will be used in the subsequent tests. The simplest and most common approach is the use
of two levels, one at the high end of the operating space (defined below) and the other at
the low end. However, there are risks associated with using only two levels. The main
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
172

Chapter 4: DOE-Based Approaches to Reliability Modeling

drawback is that it cannot detect non-linearity in the relationship between the factor and
the response. For example, consider the relationship in Figure 4.3-1.

Figure 4.3-1: Possible Response-Factor Level Relationship


In this example, the levels a and d represent the operating space of the product. The
conclusions will be very different, depending on the levels chosen within this operating
space. For example, if levels a and b are chosen, the conclusion will be that there is
a strong positive relationship; if levels b and d are chosen, the conclusion will be that
the factor has no effect on the response; and if a and d are chosen, which is a typical
approach, the conclusion will be that there is a moderate relationship. These results are
summarized in Table 4.3-1.
Table 4.3-1: Possible Conclusions for a Non-Linear Response-Factor Relationship
Levels

Conclusion

a-b

High positive relationship

c-d

High negative relationship

b-d

No relationship

a-d

Moderate positive relationship

The number of levels for each factor should be chosen, in part, based on knowledge of
the physics of the manner in which the factor affects the response. Otherwise, there can
be large uncertainty in using the resulting model to interpolate or extrapolate the response
behavior as a function of the factor. For example, if the response under analysis is
Reliability Information Analysis Center
173

Chapter 4: DOE-Based Approaches to Reliability Modeling

corrosion, and the relationship between the factor, temperature, and the corrosion rate is
expected to be governed by the Arrhenius relationship over the entire operating space,
then a two-level temperature test may be appropriate. If, however, it is hypothesized that
there is a temperature threshold within the operating space, then more than two levels
may be required.

4.4. Design the Tests


The next step is to design the experiment itself. The tests must be designed to determine
the specific factor level combinations to be tested, and the order in which they will be
tested. There are many things that will influence the design of the experiment, including
sample availability, the cost of running the tests, the time allotted for the tests, and test
equipment availability.
As an example of a simple experimental design, consider Figure 4.4-1.

Figure 4.4-1: DOE Terminology


In this example, there are three factors to be assessed, A, B and C, represented by the
three right-hand columns. Each factor has two levels, a + indicating the high level and
a indicating the low level. This experiment has four runs, each one representing a
treatment. A treatment refers to the combination of levels used in the tests.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
174

Chapter 4: DOE-Based Approaches to Reliability Modeling

Repetition and replication are techniques used to increase the number of runs. The
advantage of increasing the number of runs is that obtaining multiple responses with
exactly the same factor levels is valuable in quantifying the amount of variability and
error in the measurements obtained. Repetition is the practice of repeating the same run
sequentially. Replication is the practice of repeating a set of runs sequentially. Both
practices will result in multiple responses for a given set of factor levels, but the
advantage of replication over repetition is that it is better able to quantify measurement
error in the event when there is a gradually changing parameter in the test or
measurement system.
The full-factorial approach will be used as an example for illustrating the concepts of data
analysis, followed by a discussion of other approaches.
A full-factorial design, an example of which is shown in Table 4.4-1, is the most
comprehensive experimental design. It includes runs which represent all possible
combinations of factor levels. The primary drawback to the full-factorial approach is that
it requires many runs. In some cases, this may be practical, but in many cases, the cost
and time required to carry out the experiments are prohibitive.
Table 4.4-1: Full-Factorial Example
Run

R (response)

R1

R2

R3

R4

R5

R6

R7

R8

The number of required runs is calculated as yx, where y is number of levels per factor
(2, for this example), and x is number of factors (3). In Table 4.4-1, then, the number
of runs is 23=8.
Reliability Information Analysis Center
175

Chapter 4: DOE-Based Approaches to Reliability Modeling

There are many alternatives to the full factorial approach. One-Factor-at-a-Time


experiments, illustrated in Figure 4.4-2, refer to experiments in which each run varies the
level of one factor. In this manner, the effects of each factor can be assessed by
comparing the response between the two successive runs in which the factor was varied.
This is generally a brute force way to perform experiments, and is usually very
inefficient.

Figure 4.4-2: One-Factor-at-a-Time Experiments


Fractional Factorial Orthogonal Array Experiments can be used when it is impractical to
perform a full factorial experiment. Characteristics of orthogonal experiments are as
follows:

They use a fraction of the number of full-factorial combinations


The treatments are chosen to provide enough information to analyze the effects of
a factor using analysis of means
Orthogonal means that the combination of factors are balanced such that the
weight of all factors are equal
Orthogonal also means that the effects of the factors can be assessed
independently of the others

A full-factorial array can be scaled such that the resultant array has the characteristics of
orthogonality. These are referred to as fractional factorial arrays, since only a fraction of
the full-factorial runs are required, yet are still orthogonal. The naming convention for
these arrays is determined from:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


176

Chapter 4: DOE-Based Approaches to Reliability Modeling

La ( y x )
where:
a=
y=
x=

the number of experimental runs


the number of levels
the number of factors

In the previous examples, y and x were the number of factors and the number of
runs, respectively. In the standard DOE nomenclature, however, La refers to the
number of runs. For example, a seven-factor, two-level experiment for which there will
be eight runs is shown in Figure 4.4-3.

Figure 4.4-3: Standard DOE Nomenclature


Another critical element that must be considered when defining reliability tests is the
potential interactions between factors. Everything discussed thus far in this section has
assumed that the effects of each of the factors are independent of each other. In practice,
there are often interactions between factors that must be accounted for. Graphical
representations of potential interactions are shown in Figure 4.4-4. Referring to the
Figure, if the responses for the two levels of the B-factor plotted against the two levels
of the A-factor are parallel, then this is an indication that there is no interaction
between the two factors. This is shown on the top left. In other words, the relative
magnitudes of the B-response are independent of the level of A. If however, when the
plots of the same factors result in the plot on the top right, then this is an indication that
there is a strong interaction between factors A and B. In this example, the levels of A
change the entire relationship between the B-levels and the response. The plot on the
bottom indicates that there is a mild interaction between the two factors.

Reliability Information Analysis Center


177

Chapter 4: DOE-Based Approaches to Reliability Modeling

Figure 4.4-4: Potential Interactions


If the potential interactions are not accounted for in the reliability test plan, the risk is that
the effects of the factors cannot be deconvolved (separated) from the interactions between
the factors. There are many DOE test plans and tools that assist in identifying the
capability of various plans to identify main effects and interactions.
A detailed treatment of DOE principals is beyond the scope of this book, as this has been
done extensively in the literature, but it is important to understand the impact of some of
the principals as they pertain to reliability testing.
Resolution is a term that describes the degree to which the main effects of factors are
aliased, or confounded, with the interactions amongst factors. In general, the resolution
number of a design is one more than the smallest order interaction with which some main
effects are aliased. For example, if some main effects are confounded with some 2-level
interactions, the resolution number of the DOE is 3. Since full-factorial designs test the
response of every possible combination of factors, there is no confounding and, therefore,
they have infinite resolution. As stated previously, since the implementation of a fullfactorial test is often not practical, weaker tests are often necessary. The key is to select
the aliasing structure of the test such that the actual critical interactions can be
deconvolved from the main effects.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
178

Chapter 4: DOE-Based Approaches to Reliability Modeling

To illustrate this, consider an example of a corrosion failure mechanism that is


accelerated by temperature, humidity and the level of ionic contamination. A full
factorial, 2-level per factor, plan would be as shown in Table 4.4-2. The -1 and 1
designation represent the low and high levels of the factors, respectively. For this fullfactorial, 2-level plan, eight runs are sufficient to test all possible combinations.
Table 4.4-2: Full and Half Factorial Example for Corrosion

FullFactorial

HalfFactorial
(Resolution
= 3)

Temperature
(T)
1
-1
1
-1
-1
1
-1
1
1
1
-1
-1

Main effects
Humidity
Ionic
(H)
contamination (I)
-1
1
1
-1
-1
-1
-1
1
1
1
1
1
-1
-1
1
-1
1
1
-1
-1
-1
1
1
-1

Interactions
T*H
-1
-1
-1
1
-1
1
1
1
1
-1
1
-1

T*I
1
1
-1
-1
-1
1
1
-1
1
-1
-1
1

H*I
-1
-1
1
-1
1
1
1
-1
1
1
-1
-1

Another possible plan would be a half factorial, also shown in Table 4.4-2. Notice that,
for the half-factorial design, the temperature-humidity (T*H) interaction (i.e., the product
of the two) is the same as for ionic contamination (I). Also, the T*I interaction is the
same as H, and the H*I interaction is the same as T. Therefore, this Resolution 3 plan is
incapable of deconvolving the main effects of T, H or I with the interactions of the other
two.
From physics, we know that both humidity and ionic contamination are required for
corrosion. Therefore, the fact that H*I is the same as T (i.e., they are confounded) is
unacceptable, since we would not be able to determine if the lifetime is governed by
temperature, or the combination of humidity and ionic contamination. Therefore, we
need a better DOE test plan. The full-factorial plan would be the best, if it could be
executed, since none of this confounding exists. For the full-factorial plan, notice that
none of the interaction terms are the same as the main effects.

Reliability Information Analysis Center


179

Chapter 4: DOE-Based Approaches to Reliability Modeling

If we were to actually model this failure cause based on the tests defined in these plans,
the general form of the reliability model may be based on the two parameter Weibull
distribution, which is:

R=e

where:
R=
=
=

the reliability, or probability of survival, at time t


the characteristic life (i.e., time to 63% failure)
the Weibull shape parameter

The characteristic life is then developed as a function of the applicable variables. The
model in this case is:
0

= e e T H I HI
2

where:
0 through 4 = parameter coefficients estimated in the life modeling process
T=
the temperature in degrees K (degrees C+273)
H=
the relative humidity
I=
the ionic contamination
HI =
the product of humidity and ionic contamination
All model parameters, 0 through 4, could be adequately quantified with the fullfactorial design, but not with the half-factorial.
There are many other potential test plans that would be adequate, providing that the
required model variables can be quantified and are not confounded with one another
(Reference 1).

4.5. Perform Tests and Measurements


The next step in the process is to perform the tests. The test for each run is performed,
and the response is measured. All variables that are not factors being addressed in the
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
180

Chapter 4: DOE-Based Approaches to Reliability Modeling

experiment must be kept as constant as possible. Make sure that all results are fully
documented. This also must include any anomalies or potential sources of error that may
have occurred. The order of the runs must be kept intact, per the experimental plan. If
repetition is used, the same run or treatment is repeated sequentially. If replication is
used, then the set of runs to be repeated have been identified in the experimental design.
For in-situ measurements, careful time stamping of the data is required. Life models to
be developed from the collected data often represent parameter degradation data and not
actual TTF data. As a result, a model of degradation rate as a function of time may be
used as the response to predict failure times. All test samples should be carefully stored,
as root-cause failure analysis may be required at some future time.

4.6. Analyze the Data


The data that is generated from the tests is then analyzed to identity the impact that each
factor has on the response, and the interactions between each factor. The simplest way to
analyze the data and the effects of each factor is to perform an analysis of means. This
can be done only if the experimental design is orthogonal. In this case, the average value
of the response is calculated for each level of each factor. From the previous example, if
the effects of A are to be determined, then the average of the responses when A is +
and when A is - are calculated. Likewise, the mean of each level of each factor is
calculated in the same manner, as shown below:

The means can be pictorially represented, as shown in Figure 4.6-1. This is a convenient
way to illustrate the sensitivity of the response to each factor. Data analysis techniques
more sophisticated than the analysis of means shown here are also often used, and there
are many good software tools available to aid in this analysis. However, if a balanced,
orthogonal design is used, analysis of means can be very straightforward and effective.

Reliability Information Analysis Center


181

Chapter 4: DOE-Based Approaches to Reliability Modeling

Figure 4.6-1: Analysis of Means


In the event that it is known that the response does not behave linearly with the factor
level, the response can sometimes be linearized by making the appropriate data
transformation. For example, if the response under analysis is corrosion that is governed
by the Arrhenius relationship over the entire operating space, then the response, life in
this case, would be exponential with temperature. However, if the transformation shown
in Figure 4.6-2 is applied, the response will be linear. This is especially useful when a
goal of the analysis is to determine the activation energy.

Figure 4.6-2: Linearization of the Arrhenius Relationship


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
182

Chapter 4: DOE-Based Approaches to Reliability Modeling

After the data has been analyzed, the optimal combination of factor levels can be
determined. The goal of this approach is to determine the factor levels that will result in
minimal variability of the product response and maximum probability of the product
meeting its requirements. This is the payoff in this approach, since it results in a more
robust design.
In this example, if the desirable response is high, then a high value of A and B with a low
value of C provides the best response, as shown in Figure 4.6-3.

Figure 4.6-3: Optimal Factor Settings

4.7. Develop the Life Model


The reliability data is then analyzed and the life model is developed. This process is
discussed in detail in Chapter 5

4.8. References
1. William Y. Fowlkes and Clyde M. Creveling, Engineering Methods For Robust
Product Design: Using Taguchi Methods In Technology And Product
Development,

Reliability Information Analysis Center


183

Chapter 4: DOE-Based Approaches to Reliability Modeling

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


184

Chapter 5: Life Data Modeling

5.

Life Data Modeling

This section addresses the topic of life modeling, after the life data has been generated.
Life data modeling is treated as separate topic in this book since its principals pertain to
many of the types of data previously discussed. The purpose of modeling the reliability
of critical components or failure causes was previously described, and includes a variety
of objectives. Life modeling is simply a means of constructing a mathematical model
that predicts, assesses, or estimates the reliability of a product or system. A methodology
was previously presented for developing life models from tests performed at multiple
combinations of stress (DOE Multicell). From this data, a reliability model can be
constructed.
If all samples are tested to failure, or have been tested in exactly the same manner, then
traditional statistical analysis techniques (like regression, F-tests, T-tests, AVOVA, etc.)
will generally suffice for reliability modeling purposes. However, most real world cases
include censored data, unbalanced datasets, uncertain failure times, etc. It is these cases
where life modeling techniques are most effectively used.
Life modeling requires simultaneous characterization of:
1. TTF distributions
2. Acceleration factors (which provide a relative value of the reliability parameter as
a function of the stress level)
Each of these two major elements is discussed in the following sections and presents
more detailed information regarding development of the models after the life data has
been obtained.

5.1. Selecting a Distribution


While there is no specific distribution type that should be used in specific situations, there
are some rules of thumb that are helpful when selecting an appropriate distribution.
If the failure mechanism of interest is a manifestation of a positive feedback situation,
then the lognormal distribution is often applicable. These positive feedback situations are
recursive cases in which a flaw starts, the presence of the flaw results in an increased
stress level, the flaw propagates resulting in further increased stress, and so on, until
catastrophic failure occurs.

Reliability Information Analysis Center


185

Chapter 5: Life Data Modeling

If the failure process is governed by a distribution of defects present in the product or


system at time zero, then the Weibull distribution is usually appropriate.
In cases where the failure mechanism is random in nature, the exponential is applicable.

5.2. Parameter Estimation Overview


Life modeling, using statistical concepts, involves drawing inferences from observations
of random variables, such as observed failure times. Typical inferences consist of point
and interval estimates of distribution parameters and decisions in statistical hypothesis
testing.
Parameter estimation provides a means for the effective use of data to aid in life
modeling and the estimation of constants appearing in those models. The constants that
appear in distribution functions (e.g., p in the binomial distribution; in the Poisson
distribution; and in the Normal distribution; or in the exponential
distribution; and and in the Weibull distribution) are called parameters. The true
value of the parameters from a given distribution may not be known or measurable, so it
becomes more practical to obtain approximate or estimated values of these parameters
from a sample of data. In the larger context, parameter estimation is typically applied to
one of the following scenarios.
Point estimation is frequently used in reliability analysis to quantify parameters like the
failure rate in the exponential distribution.
Formally, a statistic, Y, is a function of random variables that does not depend on any
unknown parameter:

Y = u( X 1 , K , X n )
Let denote the parameter to be estimated. Consider functions w(Y) of the statistic,
which might serve as point estimates of the parameter. Since w(Y) is a random variable,
it has a probability distribution. Statisticians have defined certain properties for assessing
the quality of estimators. These properties are defined in terms of this probability
distribution.
A loss function, L[,w(Y)], assigns a number to the deviation between a parameter and an
estimator. A typical loss function is the square of the difference, and is the value used in
least squares regression:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
186

Chapter 5: Life Data Modeling

L[ , w(Y )] = [ w(Y )] 2

The risk function is the expected value of the loss function:

R( , w) = E{L[ , w(Y )]}


An unbiased estimator that minimizes the risk function for the above loss function is
referred to as a minimum variance unbiased estimator. An estimator that minimizes this
risk function uniformly in is called a minimum mean squared estimator. Table 5.2-1
summarizes the terms most commonly used in parameter estimation.
Table 5.2-1: Terminology Used In Parameter Estimation
Term
Confidence Level
Consistent Estimator
Estimator
Interval Estimator
Likelihood
Loss Function
Maximum Likelihood
Estimate
Minimum Mean Squared
Estimate
Minimum Variance
Unbiased Estimator
Risk Function
Sample Size
Unbiased Estimator

Definition
The theoretical percentage (or probability) of an interval estimate containing
the parameter, and in which the endpoints of the interval are constructed from
sample data
The estimate converges to the true value of the parameter as the sample size
increases to infinity
A function of a statistic used to estimate a parameter in a probability model
Estimates of the endpoints of an interval around a parameter
The probability weight for given values of parameters at observed data points
A function that provides a measure of the distance between a parameter value
and its estimator
An estimate that maximizes the probability that given parameter values will
occur at observed data points
An estimator that uniformly minimizes the expected value of the square of the
difference between a parameter and an estimator
Of all unbiased estimators, none has a smaller variance. Sometimes called a
best estimator
The mathematical expectation of the loss function
The number of random variables from which a statistic is calculated
An estimator with a mathematical expectation equal to the parameter being
estimated

Table 5.2-2 includes a brief discussion of common parameter estimation techniques.

Reliability Information Analysis Center


187

Chapter 5: Life Data Modeling

Table 5.2-2: Techniques for Parameter Estimation


Technique

Discussion

Process

Maximum
Likelihood
Estimation
(MLE)

In all practical cases, MLEs converge


stochastically to the population value.
If a MLE exists uniquely and a
sufficient statistic for the parameter
exists, the MLE is a function of the
sufficient statistic. Sometimes the MLE
is impossible to find in closed form, and
numerical methods must be used
(typical of time-domain software
reliability models). MLEs are the best
estimators for large sample sizes.

Express the joint probability density function of the


random variables of interest as a function of the
unknown parameters (i.e., the likelihood function)
Where appropriate, take the natural logarithm of the
likelihood function
Differentiate the likelihood (or log likelihood)
function with respect to each parameter
Set all derivatives equal to zero and solve for the
parameters as functions of realizations of the
random variables
Check second-order conditions

Least Squares

Least square estimators may be better


when small or medium sample sizes are
involved, since they may have smaller
bias, or approach normality faster.
Least squares estimation minimizes the
variance around the estimated
parameter. The technique is familiar to
those comfortable with linear regression
modeling.

Express the sum of the squared distance between


actual and predicted values as a function of
parameter estimates
Determine the parameter estimators that minimize
the sum of this squared distance (typically using
differential calculus)

Method of
Moments

This technique works by equating


statistical sample moments calculated
from a data set to actual population
moments. Population moments are
determined by the parameters to be
estimated. As many moments are
equated as there are parameters to be
estimated. In most cases of practical
interest, these can be found in closed
form., but their theoretical justification
is not as rigorous as for other parameter
estimation methods.

Determine the distribution whose parameters are to


be estimated (suppose there are n parameters to
be estimated)
Find the first n moments of the distribution, either
around zero, or around the mean for moments
higher than the first
Equate these moments to sample moments
Solve for the parameters as a function of the
realizations of the random variables in the sample.

Bayesian

Provides an efficient method for


incorporating various subjective and
objective data sources into parameter
estimation. It is a much less practical
method than MLE, as the analysis is
much more complex and the
computation is much more complicated.
The validity of the approach is
dependent on validity of the model and
prior distributions.

Assign a non-informative or subjective distribution


to the parameters of the model (the priors). The
priors express the uncertainties in the parameter
values.
Combine actual data with the priors to obtain new
parameter distributions (the posteriors). The
posteriors provide estimates and Bayesian
confidence limits for the parameters, producing
more precise estimates.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


188

Chapter 5: Life Data Modeling

5.2.1. Closed Form Parameter Approximations

Simple equations that approximate parameters have been developed and are summarized
in Table 5.2-3, which provides an overview of the parameter estimates for commonly
used distributions.
Table 5.2-3: Parameters Typically Estimated from Statistical Distributions
Distribution

True Parameter

Poisson

Occurrence Rate,

Binomial

Proportion, p

Estimated Parameter
Sample Occurrence Rate: = n / t
n = number of observed failures
t = period (time, length, volume) over which failures are
observed

Sample Proportion: p = x / n
x = number of successful trials
n = number of statistically independent sample units
n

xi

Exponential

Mean,

= x = i =1
n
Sample Mean:
xi = individual times to failure for each of the observations of
sample size n
n = number of statistically independent sample observations
n

x=

Mean,

x i
i =1

n
Sample Mean:
xi = individual times to failure for each of the observations of
sample size n
n = number of statistically independent sample observations

Normal

s2 =

Variance, s2

(x i

x )2

i =1

n 1
Sample Variance:
2
s = sample variance (standard deviation, s, equals (s2)0.5)
xi = individual measurements for each of the observations of
sample size n
n = number of statistically independent sample observations

Reliability Information Analysis Center


189

Chapter 5: Life Data Modeling

Table 5.2-3: Parameters Typically Estimated from Statistical Distributions (continued)


Distribution

True Parameter

Estimated Parameter
The estimate of the Weibull shape parameter is:
1.283
=
s
where,

n
2
(x i x )
s = i =1

n 1

Shape Parameter,

0.5

Weibull

x=

xi
i =1

n
s = sample standard deviation
xi = individual times to failure for each observation of sample size
n
n = number of statistically independent sample observations
The estimate of the Weibull scale parameter is:
= exp ( x + ( 0.5772 )( 0.7797 ) s )

Scale Parameter,

s= sample standard deviation


xi = individual measurements for each observation of sample size
n
n = number of statistically independent sample observations

The parameter estimates shown in Table 5.2-3 are rather simplistic and easy to use, and
often provide adequate estimates. There are more rigorous techniques available that do a
better, more accurate job of estimating parameters, but their complexity requires the use
of software tools.
The most popular techniques used in reliability modeling are least squares regression and
maximum likelihood. These are described in the next sections.
5.2.2. Least Squares Regression

Least squares regression is often used to estimate model parameters in cases when a
function can be linearized. The following steps are required for this approach:
1. Select the distribution type
2. Linearize the distribution
3. Determine the plotting positions of each data point
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
190

Chapter 5: Life Data Modeling

4. Determine the parameters using a least squares technique


For example, if a two parameter Weibull distribution is used (Step 1), the linear transform
is performed as follows (Step 2):

R=e

Taking the natural log (base e) of both sides, twice, yields:

ln( ln(R)) = ln(t ) ln( )


This is now a linear model with ln(t) being the independent variable, being the slope,
and ln() being the intercept.
Step 3 calculates the plotting position (i.e., the estimated percent fail of the population at
the failure time of each) for each data point. A common way to accomplish this is by
using Bernards formula:

F=

i 0.3
N + 0.4

where:
i=
N=

the cumulative number of failures


the total sample size

For example, if there are ten items, the value of F after the second failure is:

F=

i 0.3
2 0.3
=
= 0.163
N + 0.4 10 + 0.4

The value of F is calculated for each failure. These pairs of x-y points are the values to
which a linear model will be fit.
The values of the slope and intercept are then:
Reliability Information Analysis Center
191

Chapter 5: Life Data Modeling

(x x )(y y )
(x x )
2

ln( ) = y x
In this case, y = ln(), and x = time (t)
5.2.3. Parameter Estimation Using MLE

This section addresses the use of Maximum Likelihood Estimation (MLE) techniques for
estimating TTF distribution parameters, such as the parameter of the exponential pdf,
or and of the Normal and lognormal pdf. The objective is to find a point
estimate, as well as a confidence interval, for the parameters of these distributions based
on the data available from test or field observation. Quantification of confidence
intervals is very important in the estimation process because there is almost always a
limited amount of data (e.g., on TTFs), and, thus, we cannot state our point estimation
with certainty. Therefore, the confidence interval is a statement about the range within
which the actual (true) value of the parameter resides. This interval is greatly
influenced by the amount of data available. Of course, other factors such as diversity and
accuracy of the data sources and adequacy of the selected model can also influence the
state of our uncertainty regarding the estimated parameters. When discussing goodnessof-fit tests, we are trying to address the uncertainty due to the choice of the probability
model form by using the concept of levels of significance. However, uncertainty due to
diversity and accuracy of the data sources is a more difficult issue to deal with.
Times-to-failure data are seldom complete. A complete sample is one in which all items
observed have failed during a given observation period, and all the failure times are
known. When n items are placed on test or observed in the field, whether with
replacement or not, it is sometimes necessary (due to the long life of certain components)
to terminate the test and perform the reliability analysis based on the observed data up to
the time of termination.
There are two basic types of possible life observation termination. The first type is time
terminated (which results in Type I right censored data), and the second is failure
terminated (resulting in Type II right-censored data). In the time-terminated life
observation, n units are monitored and the observation is terminated after a
predetermined time has elapsed. The number of items that failed during the observation
time, and the corresponding TTF of each component, are recorded. In the failureterminated life observations, n units are monitored and the observation is terminated
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
192

Chapter 5: Life Data Modeling

when a predetermined number of component failures have occurred. The time to failure
of each failed item, including the time that the last failure occurred, are recorded.
The MLE method is one of the most widely used methods for estimating reliability model
parameters. I n the first part of this section, a brief historical review of the MLE method
is presented. The likelihood function concept for different types of failure data, as well
as the mathematical approach to solve likelihood equations, is presented next. The last
part of this section reviews the basic equations of the MLE approach for specific case
studies, including exponential, Weibull and lognormal distribution likelihood functions.
5.2.3.1. Brief Historical Remarks

The use of regression techniques has many shortcomings when it comes to reliability
modeling. In particular, it is weak when it comes to analyzing interval or censored data.
The Maximum Likelihood Estimation method was originally introduced by Fisher
(Reference 5).
Fisher used the conditional probability of occurrence for each failure event as a measure
for his mathematical curve fitting. He argued that, using a subjective assumption about
the TTF model, one can characterize the probability of each failure event, conditioned to
the model parameter. He then derived the posterior probability of failure events in a
Bayesian framework using a uniform distribution as a prior for the model parameters. He
later calculated the best estimate for model parameters by maximizing the posterior.
Note that, in a Bayesian framework, a uniform distribution cancels out from the equation
since it is a constant. The normalizing factor in the denominator is also a constant, which
has no impact when one is interested in the extremes of the function. Therefore, this
method was eventually called the maximum likelihood estimator, because it is basically
the likelihood function that is maximized in this process.
5.2.3.2. Likelihood Function

Fisher (Reference 3) based his maximum likelihood measure on an implied Bayesian


uniform prior for the parameters, and he names the method as leading to the most
probable set of values for the parameters (Reference 9). He suggested that the ratio of
the likelihood function and its maximum may be used to find confidence intervals for the
model parameters, and derived it for the case of Normal sampling curves.
Let [f (t) dt] be the chance of a failure observation falling within the range dt.
Fisher introduces the method of maximum likelihood by claiming that the factor dt is
independent of the theoretical curve, and the probability is proportional to f(t).

Reliability Information Analysis Center


193

Chapter 5: Life Data Modeling

Therefore, the likelihood of N independent TTF observations will be proportional to


the product of the probability distribution function at the TTFs, as shown below:
N

LF (t1 , t 2 ,..., t N / ) f (ti | )


i =1
M

LR (T1 , T2 ,..., TM / ) R (Ti | )


i =1
K

LL (T1 , T2 ,..., TK / ) F (Ti | )


i =1

LI [(Ta1 , Tb1 ),..., (TaL , TbL ) / ] [ F (Tb i ) F (Tai )] | )


i =1

where:
=
N=
M=
K, L =
Ti =
ti =
f (t) =
R (t) =
Tai =
Tbi =
LF =
LR =
LL =
LI =

the vector of model parameters, (1, 2, , n)


the number of complete failure observations
the number of right censored observations
the number of left and interval data observations
censored observations
complete failure observations
probability density function
reliability function
the lower bound of time interval
the upper bound of time interval
the likelihood function for complete failure data
the likelihood function for right censored data
the likelihood function for left censored data
the likelihood function for interval data

Using the notion of the conditional probability density function, f(t|), helps to integrate
many different types of failure data into the likelihood function. For example, the
likelihood of the right-censored observations will be the reliability function, because this
is the probability that the component remains reliable up to the censored time. Therefore,
the likelihood of M independent right-censored observations will be the product of the
reliability functions as illustrated in the second equation above. For left-censored times
(that is, the time before which a failure has occurred) the likelihood is also the definition
of probability of failure at that time. In the case of many left-censored times, the total
likelihood will be the multiplication of the likelihood values of individual components
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
194

Chapter 5: Life Data Modeling

using the independency assumption, as shown in the third equation on the previous page.
For interval times, the likelihood is the probability of having one failure in that interval
(which is basically the integral of the probability density function between the upper and
lower abounds of interval). This is simply the difference between the cumulative
distribution function when it is evaluated at the upper and lower bounds, respectively, as
shown in the final equation from the previous page. Assuming the independency of
failure or censored time events, these likelihood functions can be multiplied with the
likelihood of the complete failure data in order to build the likelihood function for the
entire population.
5.2.3.3. Maximum Likelihood Estimator (MLE)

The likelihood function is used differently in the Bayesian and MLE frameworks. In the
Bayesian method, the prior knowledge that is available for the model parameters is
updated using this function as the conditional likelihood of data. In the MLE approach,
the most probable set of values of the parameter vector, , s estimated by maximizing this
likelihood as a standalone function.
The practical way to find the modes of the likelihood function is derivation. A
multivariable function has its maximum value at a point in which the first-order partial
derivative of the function with respect to each variable becomes zero, as shown below:


= 0
1

=0

= ln( L ) 2
= (1 , 2,..., n )
...


= 0
n
where:
=
=

the log likelihood function


the best estimate parameter vector

Note that the likelihood function, as explained before, is based on a multiplication format.
This makes the derivation process very complex. The likelihood is always positive, so
Reliability Information Analysis Center
195

Chapter 5: Life Data Modeling

one may take the natural logarithm of this function to convert these multiplication
operators to summation. This will significantly reduce the mathematical derivation
complexity, while still providing the same best estimates for the mode of the likelihood
function. Constructing the likelihood function, L, as explained before, one may set up the
equations that need to be solved for the modes of this function.
In the following sections, three examples of the likelihood function, representing the
exponential, Weibull and lognormal distributions, are presented for further clarification.
5.2.3.3.1.

Exponential Distribution

If failures are expected to randomly occur at a constant rate in time, the TTF distribution
follows an exponential distribution. The exponential distribution assumes a constant
hazard rate for the item. This constant hazard rate is the only parameter of the
exponential distribution. The likelihood of complete failure and right-censored data, as
explained in previous sections, can be represented based on the probability density and
the cumulative distribution functions of the exponential distribution. The equation below
shows the log-likelihood function in case of F complete (i.e., failed) and S rightcensored (i.e., survived or suspended) observations.
F

L = N i ln e
i =1

t i

) N T
S

j =1

The only variable in this equation is . In the MLE method, the best estimate of is
evaluated by maximizing the likelihood (or log-likelihood) function. The next equation
shows the criteria to estimate the best estimate of . The uncertainty of the calculation
can be illustrated as confidence bounds over , which is calculated using the
corresponding local Fisher information matrix. This step will be explained in detail later.
F
L
S
1
= N i ti N jT j = 0
i =1
j =1

5.2.3.3.2.

Weibull Distribution

The Weibull distribution can be used for non-repairable hardware units exhibiting
increasing, decreasing, or constant hazard rate functions. Similar to the lognormal
distribution, it is a two-parameter distribution and its estimation, even in the case of
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
196

Chapter 5: Life Data Modeling

complete (uncensored) data, is not a trivial problem. It can be easily shown that, in the
situation where all r units out of n observed units fail, the log-likelihood estimates of
the Weibull distribution are represented by the equation below:
1 t i


ti
L = N i ln e

i =1

S
Tj

N j
j =1

The best estimate of the parameters and are made using the first derivative of the
log-likelihood function, as shown in the next equations below. The best estimates will be
the unique answer for the set of two equations and two unknowns, as shown:

F
T T
L 1 F
ti F ti ti S
= Ni + Ni ln Ni ln N j j ln j = 0
i =1
i =1 j =1
i =1

L
=

Ni +

i =1
F

S
Tj
ti

N
N
+

i
j

i =1
j =1

=0

Note that, despite the complexity of the mathematical representations of the likelihood
and log-likelihood functions and their derivatives, the basic concept is fairly simple. In
advanced numerical approaches using computers, the entire mathematical derivation is
done through numerical simulations using predefined tool boxes and library functions.
5.2.3.3.3.

Lognormal Distribution

In the case of estimating the parameters of the lognormal distribution, the only difference
is in the construction of the likelihood function for which the pdf and CDF of the
distribution are used for complete and suspended failure data, respectively. The equation
below shows the log-likelihood function for a combination of complete failure and
suspended (right-censored) data.
F

1 ln(ti ) S
ln(T j )

L = N i ln
+ N j ln1

j =1
i =1

ti

Having the log-likelihood function of failure data, the MLE approach can be executed
using the first derivative approach, as explained in previous sections. The first derivative
Reliability Information Analysis Center
197

Chapter 5: Life Data Modeling

of the log-likelihood function with respect to the mean and standard deviation is
illustrated in the following two equations:

ln(T j )

L
1 F
1 S

=0
= 2 Ni (ln(ti ) ) + N j
i =1
ln(T j )
j =1

F
(ln(t i ) )
L
= Ni
i =1
3

ln(T j ) ln(T j )

1 1 S

=0

Nj

j =1
ln(T j )

where:

(x) =
(x ) =

2
1
2

1
(x)2
e 2
1
(t )2
e 2 dt

The capital in the above equation is basically the cumulative Normal distribution,
which is defined as the integral of the small (i.e., normal pdf). The derivative of the
CDF always becomes the pdf, since the derivative operator cancels out the integration.
5.2.4. Confidence Bounds and Uncertainty

Since point estimates are constructed from data that exhibits random variation, these
estimates will not be exactly equal to the unknown population parameters. Confidence
bounds provide a convention for making statements about the random variation in the
estimates of parameters.
5.2.4.1. Confidence Bounds with MLE

The variance and covariance of the parameters, calculated using MLE equations, can be
found using the local Fisher information matrix. Fisher assumed a Normal distribution
for the parameters when deriving these equations. Using the following local information
matrix, one can relate the likelihood function to the variance and covariance of the model
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
198

Chapter 5: Life Data Modeling

parameters. The next equation represents these uncertainties for a general case for which
there are n parameters in the likelihood function.

Var (1 )
...
Cov (1 , 2 )
Cov (1 , n )


...
Var (2 )
Cov (2 , n )
Cov ( 2 , 1 )
1
.
.
.
.
= [F ]

.
.
.
.

...
...
Cov (n , n 1 )
Var (n )

(17)
where:
Var =
Cov =
=
F=

variance of the parameter of interest


covariance of the two parameters
the log likelihood function
the local Fisher information matrix as defined below:

2
2
2 1


2
1
F=
.

...

1 2
2
2
2
.
.
...

...
...
.
.
2

n n 1

1 n
2

2 n

2
2
n

Having the variance and the best estimate of each parameter, one may estimate the
uncertainty bounds for any given confidence bounds. Note that the important underlying
assumption here is independency, as well as the Normal distribution for all parameters.
5.2.4.2. Confidence Bounds Approximations

Tables 5.2-4 through 5.2-11 present a summary of equations for calculating the
confidence bounds around the parameters for various distributions.
Reliability Information Analysis Center
199

Chapter 5: Life Data Modeling

Table 5.2-4: Confidence Bounds for the Poisson Distribution


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The estimate for a the true occurrence rate, , is the sample occurrence rate:
= n / t
where,
n=
number of observed failures
t=
period (time, length, volume) over which failures are observed
Poisson Limits (approximate only):
Exact confidence levels cannot be conveniently obtained for discrete distributions

True Occurrence
Rate,

L = 0.5 2 [1 ; 2 n ] / t

L = 0.5 2 [(1 ) 2 ; 2 n ] / t

Normal Approximation
When n is large (say, >10)
z ( / t ) 0.5

L z ( 1+ )

( / t ) 0.5

U + z ( / t ) 0.5

U + z ( 1+ )

( / t ) 0.5

U = 0 .5 2 [ ; ( 2 n + 2 ) ] / t

U = 0.5 2 [(1 + ) 2 ; ( 2 n + 2 ) ] / t

Given: Given the observed rate of occurrence above, the prediction for the future rate of occurrence is:
y = s = (n / t ) s
where,
n, t = as defined above
s=
period (time, length, volume) over which future observation is predicted
Poisson Limits (approximate only)
Closest integer solutions for yL and yU from the following equations
( n + 1)
F [ ; ( 2 n + 2 ); 2 y U ]
s
t
s
t
= F [ ; ( 2 y L + 2 ); 2 n ]
( y L + 1) n
yU

Future Occurrence
Rate, y

( n + 1)
F [(1 + ) 2 ; ( 2 n + 2 ); 2 y U ]
s
t
s
t
= F [(1 + ) 2 ; ( 2 y L + 2 ); 2 n ]
( y L + 1) n
yU

Normal Approximation
When n and y are large (e.g., each is > 10)

(
)
( s ( t + s ) t )

y L y z s ( t + s ) t

0.5

y U y + z

0.5

( s ( t + s ) t )
( s ( t + s ) t )

0.5

y L y z ( 1+ )

y U y + z ( 1+ )

0.5

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


200

Chapter 5: Life Data Modeling

Table 5.2-5: Confidence Bounds for the Binomial Distribution


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The estimate of the true population proportion, p, is the sample proportion:
p = x / n

where,
x=
number of successful trials
n=
number of statistically independent sample units
Binomial Limits (approximate only):
Exact confidence levels cannot be conveniently obtained for discrete distributions
1
1 + ( n x + 1)(1 x ) F [ ; ( 2 n 2 x + 2 ); 2 x ]
1
pU =
1 + ( n x )(1 (( x + 1) F [ ; ( 2 x + 2 ); 2 n 2 x ]
pL =

True Proportion,
p

pL =

1
1 + ( n x + 1)(1 x ) F [(1 + ) 2 ; ( 2 n 2 x + 2 ); 2 x ]

pU =

1
1 + ( n x )(1 (( x + 1) F [(1 + ) 2 ; ( 2 x + 2 ); 2 n 2 x ]

Normal Approximation
When x and n-x are large (e.g., each is > 10)
p L p z ( p (1 p ) / n ) 0.5

p L p z ( 1+ ) 2 ( p (1 p ) / n ) 0.5

p U p + z ( p (1 p ) / n ) 0.5

p U p + z ( 1+ ) 2 ( p (1 p ) / n ) 0.5

Poisson Approximation
When n is large and x is small (e.g., when x < n/10)
p L 0.5 2 [(1 ); 2 x ] n

p U 0 .5

[ ; 2 x + 2 ]

p L 0.5 2 [(1 ) 2 ; 2 x ] n

p U 0.5 2 [(1 + ) 2 ; 2 x + 2 ] n

Given: Given the observed probability above, the prediction for the number of y future category units
is:
y = mp = m ( x / n )

where,
x, n = as defined above
m=
future sample size
Normal Approximation
When x, n-x, y and m-y are all large (say, > 10)

[
]
0.5
y U y + z [m p (1 p )( m + n ) n ]
0.5
y L y z m p (1 p )( m + n ) n

Prediction of
Future
Probability of
Success, y

[
]
(1 p )( m + n ) n ]0.5
2 [m p

y L y z (1+ ) 2 m p (1 p )( m + n ) n 0 .5
y U y + z ( 1+ )

Poisson Approximation
When n is large and x is small (e.g., when x < n/10)
Closest integer solutions for yL and yU from the following equations
( x + 1)
F [ ; 2 x + 2; 2 y U ]
n
m
n
= F [ ; ( 2 y L + 2 ); 2 x ]
( y L + 1) x
yU
m

( x + 1)
F [(1 + ) 2 ; ( 2 x + 2 ); 2 y U ]
n
m
n
= F [(1 + ) 2 ; ( 2 y L + 2 ); 2 x ]
( y L + 1) x
yU
m

Reliability Information Analysis Center


201

Chapter 5: Life Data Modeling

Table 5.2-6: Confidence Bounds for the Exponential Distribution


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The estimate of the true population mean, , is the sample mean:
n

= x =

xi
i =1

where,
xi =
n=

individual times to failure for each of the observations of sample size n


number of statistically independent sample observations
Exponential Limits (exact) for Failure Truncated Tests
L =
U =

2 nx

L =

[ ; 2 n ]
2 nx

U =

2 [(1 ); 2 n ]

2 nx

[(1 + )

2 ;2n ]

2 nx

2 [(1 ) 2 ; 2 n ]

Exponential Limits (exact) for Time Truncated Tests


True value of the
mean,

2nx
2 [;2(n + 1)]
2nx
U = 2
[(1 );2(n + 1)]

L =

L =

U =

2n x

[(1 + ) 2 ;2(n + 1) ]
2n x
2

[(1 ) 2;2(n + 1) ]
2

Normal Approximation for Failure Truncated Tests


When n is large (say, > 15)
L

exp z

U x * exp z

L
n

exp z ( 1+ )

U x * exp z ( 1+ )

Given: The estimate of the true population failure rate, , is the sample failure rate:
1
1
= = n

xi
i =1

where,
hat=
sample mean
xi =
individual times to failure for each of the observations of sample size n
n=
number of statistically independent sample observations
Exponential Limits (exact) for Failure Truncated Tests
True value of the
failure rate,

L =
U =

2 [(1 ); 2 n ]
1
=
2 nx
U
2 [ ; 2 n ]
1
=
2 nx
L

L =
U =

2 [(1 ) 2; 2 n ]
1
=
U
2 nx

2 [(1 + ) 2; 2 n ]
1
=
L
2 nx

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


202

Chapter 5: Life Data Modeling

Table 5.2-7: Confidence Bounds for the Exponential Distribution (continued)


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The usual estimate of the 100 pth percentile, yp, is calculated as:
y p = x * ln(1 p )
where,

probability at the 100 pth percentile

p=
True value of
the 100-pth
percentile, yp

y p ,L = L * ln(1 p ) =
y p , U = U * ln(1 p ) =

2 nx * ln(1 p )

y p ,L = L * ln(1 p ) =

2 [ ; 2 n ]
2 nx * ln(1 p )

y p , U = U * ln(1 p ) =

2 [(1 ); 2 n ]

2 n x * ln(1 p )

2 [(1 + ) 2 ; 2 n ]
2 n x * ln(1 p )

2 [(1 ) 2 ; 2 n ]

Given: The usual estimate of the reliability, R(t), at any age, t, is:
R * (t) = e (t

x)

where,
R=
t=

reliability as a function of time, distance, etc.


period at which reliability is assessed (time, distance, etc.)

( {
= exp ( t * {

R L ( t ) = e ( t L ) = exp t * 2 [ ; 2 n ] 2 nx

True value of
reliability at
end of period,
R(t)

R U (t ) = e

( t / U )

[(1 );2n ]

})

2 nx

})

( {
( {

R L ( t ) = e ( t L ) = exp t * 2 [(1 + ) 2; 2 n ] 2 nx

Table 5.2-8: Confidence Bounds for the Normal Distribution


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The estimate of the true population mean, , is the sample mean:
n

x=

x i
i =1

where,
xi =
n=

True value of the


mean,

individual times to failure for each of the observations of sample size n


number of statistically independent sample observations
Normal Limits (exact)
Also serve as approximate intervals for the mean of a distribution that is not normal

L = x t [ ; n 1] * s
n

U = x + t [ ; n 1] * s
n

})
})

R U ( t ) = e ( t / U ) = exp t * 2 [(1 ) 2; 2 n ] 2 nx

L = x t [(1 ) 2; n 1] * s

U = x + t [(1 ) 2; n 1] *
n

Reliability Information Analysis Center


203

Chapter 5: Life Data Modeling

Table 5.2-9: Confidence Bounds for the Normal Distribution (continued)


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The estimate of the true population variance, 2, is the sample variance:
n

s2 =

where,

s2=
xi =
n=

True value of the


variance, 2

(x i

x )2

i =1

n 1

sample variance (standard deviation, s, equals (s2)0.5)


individual measurements for each of the observations of sample size n
number of statistically independent sample observations
Normal Limits (exact)

n 1
L = s*

2
[ ; n 1]

0.5

n 1
U = s*

2
[(1 ); n 1]

0.5

n 1
L = s *

2 [(1 + ) 2 ; n 1]

0.5

n 1
U = s*

2
[(1 ) 2 ; n 1]

0.5

Given: The estimate of the reliability at any age t, R(t), is:


R * (t ) = 1 ( z )

where,
R=
reliability as a function of time, distance, etc.
t=
period at which reliability is assessed (time, distance, etc.)
(z) = estimate of the fraction of a population failing by age t
True value of
reliability at end
of period, R(t)

R L ( t ) = 1 FU ( t ) = 1 ( z U )

R L ( t ) = 1 FU ( t ) = 1 ( z U )
where ,

where ,
(x x)
z=
s
2
z
1 + z (n / 2)
zU z +

n 1
n

z=

0 .5

zU z +

2
z (1+ ) 2
1 + z (n / 2)

n 1
n

0 .5

R U ( t ) = 1 FL ( t ) = 1 ( z L )

R U ( t ) = 1 FL ( t ) = 1 ( z L )

where ,

where ,
(x x)
z=
s
2
z
1 + z (n / 2)
zL z

n 1
n

(x x)
s

z=

0. 5

(x x)
s

2
z (1+ ) 2
1 + z (n / 2)
zL z

n 1
n

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


204

0 .5

Chapter 5: Life Data Modeling

Table 5.3-10: Confidence Bounds for the Weibull Distribution


Parameter
One-Sided Confidence Interval
Two-Sided Confidence Interval
Given: The estimate of the Weibull shape parameter, , is given as:
1.283
=
s
where,
n
2
(x i x )
=
1
i
s=

n 1

0.5

x=

xi
i =1

where,
s=
sample standard deviation
xi =
individual times to failure for each observation of sample size n
n=
number of statistically independent sample observations
Weibull Limits (approximate)
Limits are crude unless n is quite large (say, n > 100)
True value of
the Weibull
shape
parameter,

1
1.049 z
0.7797s * exp

0.7797s
1.049 z
exp

1
1.049z (1+ ) 2

0.7797s * exp

0.7797s
1.049z (1+ ) 2

exp

Given: The estimate of the Weibull scale parameter, , is:


= exp ( x + ( 0.5772 )( 0.7797 ) s )
where,
s=
sample standard deviation
xi =
individual measurements for each observation of sample size n
n=
number of statistically independent sample observations
Weibull Limits (approximate)
Limits are crude unless n is quite large (say, n > 100)
True value of

(1.081)( 0.7797 ) s
(1.081)( 0.7797 ) s
the Weibull
L exp ( x + 0.45 s ) z ( 1+ ) 2
L exp ( x + 0.45s ) z

n
n

scale

parameter,

(1.081)( 0.7797 ) s
(1.081)( 0.7797 ) s

U exp ( x + 0.45s ) + z

U exp ( x + 0.45 s ) + z ( 1+ )

Reliability Information Analysis Center


205

Chapter 5: Life Data Modeling

Table 5.2-11: Confidence Bounds for the Weibull Distribution (continued)


Parameter
One-Sided Confidence Interval
Given: The estimate of the reliability at any age t, R(t), is:

R * (t ) = e

Two-Sided Confidence Interval

where,

True value
of reliability
at end of
period, R(t)

R=
reliability as a function of time, distance, etc.
t=
period at which reliability is assessed (time, distance, etc.)
=
Weibull scale parameter
=
Weibull shape parameter
Limits are crude unless n is quite large (say, n > 100)
One-sided approximate Weibull limits:
0.5

t ( x + 0.45s)
t ( x + 0.45s)

1
.
168
(
1
.
1
)
(
0
.
1913
)

t ( x + 0.45s)

0.7797s
0.7797s
RL (t ) = exp exp
+ z


n
0.7797s

0.5

t ( x + 0.45s)
t ( x + 0.45s)

1
.
168
(
1
.
1
)
(
0
.
1913
)

t ( x + 0.45s)

0.7797s
0.7797s
RU (t ) = exp exp
z


n
0.7797s

Two-sided approximate Weibull limits:


0.5

t ( x + 0.45s )
t ( x + 0.45s )

1
.
168
(
1
.
1
)
(
0
.
1913
)

t ( x + 0.45s)

0.7797 s
0.7797 s
RL (t ) = exp exp
+ z(1 ) 2


0
.
7797
s
n

0.5

t ( x + 0.45s)
t ( x + 0.45s)

1
.
168
(
1
.
1
)
(
0
.
1913
)
+

t ( x + 0.45s)

0.7797s
0.7797s
RU (t ) = exp exp
z(1 ) 2

n
0.7797s

5.3. Acceleration Models


Acceleration models are needed to determine how the TTF distribution behaves as a
function of the accelerant. The accelerant can be a stress (such as temperature, voltage,
pressure, etc.), or it can be an indicator variable (such as a product feature or design
attribute). These are also sometimes called categorical variables. One of the most
common ways to quantify acceleration factors is to perform tests at various stress levels
(and, in the case of indicator variables, for various product features or design attributes).
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
206

Chapter 5: Life Data Modeling

Accelerated testing is often used for this purpose, in which case tests are performed at
stress levels higher than the item will experience in use, to speed up failure processes.
Acceleration models consist of two generic types:
Physical Acceleration Models: For well-understood failure mechanisms, one
may have a model based on physical/chemical theory that describes the failurecausing process over the range of the data and provides extrapolation to use
conditions.
Empirical Acceleration Models: Empirical acceleration models are used when
there is little understanding of the chemical or physical processes leading to
failure, and a model can be empirically determined to describe the observed data.
5.3.1. Fundamental Acceleration Models

In practice, the acceleration models used are a combination of physical and empirical, in
that theory may be used to determine the appropriate form of the acceleration model, but
the specific model constants are almost always determined empirically.
There are four basic forms of accelerated life models. Combinations of these are also
possible:
The linear model is:

y = ax + b

The exponential model is:

y = be ax
The power law model is:

y = bx a
The Logarithmic model is:

y = a ln(x) + b

In all of these equations, y is the dependent variable, usually either lifetime (as
measured by characteristic life or mean life, depending on the TTF distribution used), or
failure rate. Since the failure rate is the reciprocal of the mean life (in the case of the
Reliability Information Analysis Center
207

Chapter 5: Life Data Modeling

exponential distribution), the constant a will generally be positive in one case and
negative in the other.
The most commonly used reliability models are the power law and exponential models.
Several points regarding acceleration models are:

There is no correct acceleration factor to use for a specific application


Several different acceleration factor model forms are often equally applicable
The best acceleration factor model is often the one that best fits the empirical
data

5.3.1.1. Examples

Several commonly used acceleration models are summarized in this section.


Arrhenius
The Arrhenius relationship is a widely used model describing the effect that temperature
has on the rate of a simple chemical reaction:

L Ae

Ea
KT

where:
L=
A=
Ea =
T=

the lifetime
a life constant
the activation energy in eV
the absolute temperature in degrees Kelvin

It can be seen that this is the exponential model, with the reciprocal of temperature used
as the stress. The Arrhenius model is the most widely used for evaluating the effect of
temperature on reliability. It is applicable to situations in which the failure mechanism is
a function of the steady state temperature, such as corrosion, diffusion, etc. Notable
observations about the Arrhenius acceleration model are that:

It was derived centuries ago to model chemical reaction rates


Over the last few decades, it has been applied to electronics reliability modeling,
since it often empirically fit the data reasonable well

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


208

Chapter 5: Life Data Modeling

In the formative years of the electronics industry, many failure mechanisms were
related to corrosion and contamination, which are inherently chemical reaction
rates for which the Arrhenius factor applies reasonably well
It has since been applied to many other failure mechanisms, with an assumed
applicability

Eyring
The Eyring model is:

1 A
L e T
T
Coffin-Manson
A form of fatigue life strain models is the Coffin-Manson life vs. plastic strain, which
is often used for solder joint reliability modeling:

T
AF = S
TU

where:
AF =
TU =
TS =
=

acceleration factor
product temperature in service use, K
product temperature in stress conditions, K
constant for a specific failure mechanism

The number of cycles to failure is expressed as:


1
N f = A
e
p

where:
Nf =
A=
ep =
=

number of cycles to failure


a material constant
plastic strain range
a material constant
Reliability Information Analysis Center
209

Chapter 5: Life Data Modeling

Since T ep, a simplified acceleration factor for temperature cycling fatigue testing
is:

T
N
AF = use = test
N test Tuse

The Coffin-Manson model is also sometimes used to model the acceleration due to
vibration stresses. Random vibration input and response curves are typically plotted on
log-log paper, with the power spectral density (PSD) expressed in squared acceleration
units per hertz (G2/Hz), plotted along the vertical axis, and the frequency (Hz) plotted
along the horizontal axis.
G2
f 0 f

P = lim

In the above equation, G is the root mean square (RMS) of the acceleration, expressed
in gravity units, and f is the bandwidth of the frequency range expressed in hertz.
Since G is the agent of failure that causes fatigue, the following inverse power model
applies:

L(G )

1
1
Life =

G
KG

The acceleration factor for vibration based on Grms for similar product responses is
represented by:

G
N
AF = use = test
N test G use

5.3.2. Combined Models

Acceleration models with more than one accelerating variable might be suggested when it
is known that two or more potential accelerating variables contribute to degradation and
failure. Several examples follow.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


210

Chapter 5: Life Data Modeling

Temperature and Non-Thermal Stress


When temperature and a second non-thermal stress (e.g., voltage) are the accelerated
stresses of a test, then the Arrhenius and the Inverse Power Law models can be combined
to yield the Temperature-Non-Thermal (T-NT) model (Reference 10):

L(U ,V ) =

C
n

U e

B
V

where:
U=
non-thermal stress (i.e., voltage, vibration, etc.)
V=
temperature (in K)
B, C, n = parameters to be determined
The T-NT relationship can be linearized and plotted on a Life vs. Stress plot by taking the
natural logarithm of both sides:

ln[L(U ,V )] = ln(C ) n ln(U ) +

B
V

Here, the log of the life is equal to a linear relationship, where the intercept is ln(C), the
slope of ln(U) is n and the slope of 1/V is B.
The acceleration factor for the T-NT relationship is given by:
B

AF =

LUse
LAccelerated

C Vu
e
U un
B

C VA
e
U An

where:
LUse =
LAccelerated =
Vu =
VA =
Uu =

U B V V
= A e u A
Uu

the life at use stress level


the life at the accelerated stress level
the use temperature level
the accelerated temperature level
the use non-thermal level
Reliability Information Analysis Center
211

Chapter 5: Life Data Modeling

UA =

the accelerated non-thermal level

Temperature-Humidity Models
A variation of the Eyring relationship is the Temperature-Humidity (TH) relationship.
This combination model is expressed as:

L(V , U ) = Ae

b
+
V U

where, and b are parameters to be determined (the parameter b is also known as


the activation energy for humidity), A is a constant, U is the relative humidity
(decimal or percentage), and V is the temperature (in absolute units, K). Note that the
relative humidity can be expressed in either a decimal format or as a percentage, as long
as it is consistent throughout the analysis. The relationship is linearized by taking the
natural logarithm of both sides of the equation:

ln[L(V , U )] = ln( A) +

b
U

The acceleration factor for the TH relationship is:

AF =

LUse
LAccelerated

Ae
Ae

b
+
V U
u
u

+
VA U A

=e

1 1 1
1

+ b

Vu V A U u U A

where:
LUse =
LAccelerated =
Vu =
VA =
Uu =
UA =

the life at use stress level


the life at the accelerated stress level
the use temperature level
the accelerated temperature level
the use humidity level
the accelerated humidity level

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


212

Chapter 5: Life Data Modeling

Peck Model
The Peck model (Reference 6) is:
n

L ( RH ) e

Ea
KT

where:
RH =
T=
n=
Ea =
K=

Relative Humidity
temperature
constant
activation energy
Boltzmans constant = 8.617 x 10-5 eV/K

Note that this is a multiplicative model consisting of a power law for humidity and the
Arrhenius model for temperature.
The British Telecom Model
The British Telecom model, also used in the Telcordia standards (Reference 7) is:

Le

Ea
2
KT + n ( RH )

This model includes the effects of both temperature and relative humidity.
Harris Model
Wearout data published by the Harris Corporation shows a good fit to Pecks model
(Reference 8) in representing aluminum corrosion. This model is:

AF = e

Ea

1
1 RH S

T
T
RH U
S
U

VS

VU

where:
AF =
Ea =
k=
TU =

acceleration factor
activation energy
Boltzmans constant = 8.617 x 10-5 eV/K
product temperature in service use, K
Reliability Information Analysis Center
213

Chapter 5: Life Data Modeling

TS =
RHU =
RHS =
VU =
VS =
a=
b=

product temperature in stress conditions, K


relative humidity in service use
relative humidity in stress conditions
voltage in service use
voltage in stress conditions
2.66 based on Peck
1.4 (from Harris data)

Fatigue and S-N curves


With metals and alloys, the fatigue process starts with dislocations, or crystallographic
irregularities, that ultimately result in crack formation. It is a probabilistic phenomenon
with a significant variation in lifetime. The S-N curves quantify the relationship between
stress and the number of stress cycles to failure. It is essentially a life-stress relationship
for metals. The curves are generally obtained by testing samples of the metal or alloy,
and have been published in various handbooks.

In estimating fatigue life for materials, the model is used as the analytical representation
of the so-called S-N curves, where S is stress amplitude and N is life (in cycles to
failure), such that N = kS-b, where b and k are material parameters either estimated
from test data or published in handbooks.
Miners Rule
Miners rule states that the amount of damage sustained by a metal is proportional to the
number of cycles it experiences, as follows:
k

ni

N
i =1

=C

There are k stress levels (one for each contribution n cycles), N is the total number
of cycles at a constant stress reversal, and C is usually assumed to be 1.0. It essentially
estimates the percentage of life used by each stress reversal at each specific magnitude.
5.3.3. Cumulative Damage Model

Many situations arise in which there is cumulative damage inflicted on an item when
subjected to a stress. For those situations where a Weibull distribution is appropriate, the
reliability function is expressed as:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


214

Chapter 5: Life Data Modeling

R (t ) = e

where:
R(t) = reliability the probability of survival at time t
=
Weibull shape parameter (in time space)
=
Characteristic life as a function of the stressor
If it is assumed that the acceleration can be described by a power law, then:
a
=
S

where:
S=
a=
n=

stressor
life constant
fatigue exponent in time space

Combining the two equations yields:

R (t ) = e

n
a

The modeling process estimates , a and n.


The premise of the cumulative damage model is that the amount of life used per cycle is
proportional to the stressor raised to the n power:

S
te = t1 1
S0

Reliability Information Analysis Center


215

Chapter 5: Life Data Modeling

where:
te =
S0 =

equivalent time at stressor 1 relative to S0


normalization stressor

This cumulative damage model is particularly useful when the stresses are time varying,
since an equivalent amount of damage can be estimated per unit time, regardless of the
behavior of stress as a function of time. This model is also consistent with fatigue, which
is essentially a cumulative damage scenario.

5.4. MLE Equations


The previous sections summarized information relative to the selection of a specific
distribution and acceleration factors. Once these have been determined for a particular
situation, and tests have been performed at various levels of acceleration, the next step is
to quantify model parameters. Previously, in the discussion on distributions and
parameter estimation with MLE, only the distribution parameters were considered, not
the acceleration model parameters. A life model needs parameter estimates for both the
distribution parameters and the acceleration model parameters. In this section, the
likelihood equations for various combinations of distributions and acceleration models
are presented.
The form of a life model is the distribution equation, with the mean or characteristic life
(depending on the distribution) replaced with the acceleration model. For example, if a
Weibull distribution is used, the reliability function is:

R (t ) = e

where:
R(t) = reliability the probability of survival at time t
=
Weibull shape parameter (in time space)
=
Characteristic life as a function of stressor:
And, if the acceleration model is the power law:

a
=
S

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


216

Chapter 5: Life Data Modeling

where:
S=
a=
n=

stressor
life constant
fatigue exponent in time space

Then, combining the two equations yields:

R(t ) = e

n
a
S

The modeling process estimates , a and n. Once these parameters are estimated, the
life distribution for any stress level can be obtained.
5.4.1. Likelihood Functions

The likelihood functions for the six combinations of distribution (exponential, Weibull,
lognormal) and acceleration model (Arrhenius, Inverse Power Law) are provided below.
Exponential-Arrhenius Reaction Rate Model:
B

M
B
t
T
L = N i ( ln(C ) i e Vi ) N i Ri e Vi
Vi
C
C
i =1
i =1
N

B
B

TbieVBi

T e Vi
T e Vi
ai C
Li C

C
K
L

e
+ N i Ln 1 e

+ N i Ln e
i =1

i =1

Exponential-Inverse Power Law (IPL):


M

L = N i (ln K + n ln(S i ) KS t ) N i KS inTRi


n
i i

i =1

+ N i Ln(1 e
i =1

KS inTLi

i =1

) + N i Ln(e KSi Tbi e KSi Tai )


n

i =1

Reliability Information Analysis Center


217

Chapter 5: Life Data Modeling

Weibull-Arrhenius:

N

L = N i ln B
i =1
Ce Vi

ti
B
Vi
Ce

LiB
K

Vi
+ N i ln1 e Ce
i =1

t
i B
Vi
Ce

M
TRi
N

i B
i =1 Ce Vi

T
T

aiB
biB

L
Ce Vi
Ce Vi
N
ln
e
e
+

i
i =1

Weibull-IPL:
N

L = N i ln KS in KS in t i

i =1

e(

1 KS nt
i i

) N (KS nT )
i i Ri
M

i =1

n
n
n
+ N i ln1 e (KSi TLi ) + N i ln e (KSi Tai ) e (KSi Tbi )

i =1

i =1

Lognormal-Arrhenius:

B
B

ln(TRi ) ln(C )
ln(t i ) ln(C ) M
Vi
Vi

1
L = N i ln
+ N i ln1

i =1
i =1



B
B
B

ln(TLi ) ln(C ) L
ln(Tbi ) ln(C )
ln(Tai ) ln(C )
K
Vi
Vi
Vi

+ N i ln
+ N i ln

i =1
i =1


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


218

Chapter 5: Life Data Modeling

Lognormal-IPL:
N
1 ln(t i ) + ln( K ) + n ln(S i )

L = N i ln

i =1
t i
M

ln(TLi ) + ln( K ) + n ln(S i )


ln(TRi ) + ln( K ) + n ln(S i ) K
+ N i ln1
+ N i ln

i =1

i =1


L
ln(Tbi ) + ln( K ) + n ln(S i )
ln(Tai ) + ln( K ) + n ln(S i )
+ N i ln

i =1

Solutions for the parameters stress-life log-likelihood functions can be obtained by


setting their first order partial derivatives equal to zero and applying iterative methods.
The second order partial differential equations for each of the six combinations are
presented below. The advantage to using the second order partials is the potential for
dual use in the Fisher information matrix and the required iterative methods for attaining
the necessary parameter solutions.
Exponential Arrhenius:
2L 2L 2L
,
,
B 2 C 2 B C

Exponential IPL:
2L 2L 2L
,
,
K 2 n 2 K n

Weibull Arrhenius:

2L 2L 2L 2L 2L 2L
,
,
,
,
,
2 B 2 C 2 B C BC
Weibull IPL:

2L 2L 2L 2L 2L 2L
,
,
,
,
,
2 K 2 n 2 K n Kn
Reliability Information Analysis Center
219

Chapter 5: Life Data Modeling

Lognormal Arrhenius:
2L
2L 2L 2L 2L 2L
,
,
,
,
,
B 2 C 2 2 B C B C

Lognormal IPL:
2L 2L 2L 2L 2L 2L
,
,
,
,
,
K 2 n 2 2 K n K n

The likelihood function will yield a value for all possible combinations of parameter
values. A useful tool in data analysis is a plot of the likelihood value. As an example,
Figure 5.4-1 illustrates a contour plot of the likelihood value for an exponential-IPL
model.

Figure 5.4-1: Likelihood Contour Example


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
220

Chapter 5: Life Data Modeling

In this example, the plot lines represent values of equal likelihood as a function of the
two parameters of interest (i.e., the Weibull slope and the exponent in the power law
acceleration model). The center position represents the combination of beta and n at
which the maximum value of likelihood occurs. The height of the likelihood value
increases as the center of the contour lines is approached. The spread in the contour lines
of equal likelihood are proportional to the uncertainty in the parameter estimates, and in
fact are one way to estimate confidence bounds on the model parameters. Also, the
dispersion of the likelihood values on the n axis can be thought of as the spread of the
TTFs in the stress dimension, and the dispersion of the likelihood values on the beta
axis can be thought of as the spread of the TTFs in the time dimension.

5.5. References
1. Lyu, M.R. (Editor), Handbook of Software Reliability Engineering, McGraw-Hill,
April 1996, ISBN 0070394008
2. Musa, J.D.; Iannino, A.; and Okumoto, K.; Software Reliability: Measurement,
Prediction, Application, McGraw-Hill, May 1987, ISBN 007044093X
3. Musa, J.D., Software Reliability Engineering: More Reliable Software, Faster
Development and Testing, McGraw-Hill, July 1998, ISBN 0079132715
4. Nelson, W., Applied Life Data Analysis, John Wiley & Sons, 1982,
ISBN0471094587
5. Fisher, R. A., 1912, On an Absolute Criterion for Fitting Frequency Curves,
Messenger of Mathematics, Vol. 41, pp. 155-160. [Reprinted in Statistical
Science, Vol. 12, (1997) pp. 39-41.]
6. Peck, S., IRPS tutorial, 1990
7. Telcordia GR1221
8. Peck and Hallberg, Quality and Reliability Engineering International, 1991
9. Hald, A., 1999, On the Maximum Likelihood in Relation to Inverse Probability
and Least Squares. Statistical Science, Vol. 14, No. 2, pp. 214-222.
10. Accelerated Life Testing Analysis (ALTA), Reliasoft Corp.

Reliability Information Analysis Center


221

Chapter 5: Life Data Modeling

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


222

Chapter 6: Interpretation of Reliability Estimates

6.

Interpretation of Reliability Estimates

This chapter presents topics related to the interpretation of various aspects of reliability
models. It is hoped that this information will provide the reader with information that
allows for a better intuitive understanding of reliability predictions, assessments and
estimations.

6.1. Bathtub Curve


The bathtub curve is a general reliability model of failure rate as a function of time
that, for hardware, has three distinct periods. It is often misunderstood and
misinterpreted. It should be thought of as a concept rather than an actual failure rate
function. A generic bathtub curve is shown in Figure 6.1-1.

Figure 6.1-1: Bathtub Curve


The three regions are:
Infant Mortality. In this first portion of the bathtub curve, the failure rate is relatively
high because a portion of the population may contain parts with defects. These parts
Reliability Information Analysis Center
223

Chapter 6: Interpretation of Reliability Estimates

generally fail earlier than those in the main population. The shape of the failure rate
curve is decreasing, with its rate of decrease dependent on the maturity of the design and
manufacturing processes, as well as the applied stresses.
Useful Life. The second portion of the bathtub curve is known as the useful life and is
characterized by a relatively constant failure rate caused by randomly occurring failures.
It should be noted that the failure rate is only related to the height of the curve, not to the
length of the curve, which is a representation of product or system life. If items are
exhibiting randomly occurring failures, then they fail according to the exponential
distribution, in accordance with a Poisson process. Since the exponential distribution
exhibits a constant hazard rate, we can simply add the failure rates for all items making
up an item to estimate the overall failure rate of that item during its useful life.
Wearout. The last part of the curve is the wearout portion. This is where items start to
deteriorate to such a degree that they are approaching, or have reached, the end of their
useful life. This is often relevant to mechanical parts, but can also apply to any failure
cause that exhibits wearout behavior.

It is important to understand the difference between the MTBF of an item and the useful
life of that same item. Items that experience wearout failure modes/mechanisms will
have some period of useful life before they fail as a result of wearout. This useful life is
not the same as the item MTBF. During useful life, an item may also experience
randomly occurring freak failures caused by weak components or faulty workmanship,
especially if the item is subjected to high stress conditions. The occurrence of these
random failures during an items useful life results in higher failure rates, or lower
MTBF, for that item.
Mechanical items are usually most prone to wearout and, therefore, we are usually most
concerned with the useful life, or MTTF, associated with these items. Electronic items
usually become obsolete long before any significant wearout takes place6. Therefore, the
infant mortality and constant failure rate portions of the bathtub curve are of the most
interest for these items.
The bathtub curve conceptually offers a good view of the three primary types of failure
categories. It is essentially a composite failure rate curve comprised of three generic
types of failure causes. In practice, however, the well defined curve of Figure 6.1-1 is
rare. The actual curve for a product or system will depend on many factors. A specific
6

It should be noted, however, that with the progressively decreasing feature sizes of current state-of-the-art microelectronic devices,
the issues associated with wearout and useful life are becoming of greater concern.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


224

Chapter 6: Interpretation of Reliability Estimates

failure cause will generally exhibit characteristics of only one segment of the bathtub
curve, but when the characteristics of all of the other failure causes for that product or
system are considered, and a composite model is generated, the curve will have a shape
that deviates from the classic bathtub curve, even though it will often contain elements of
each of the three portions. Usually, the composite curve will be dominated by the
characteristics of those failure causes that dominate the overall reliability of the item.
It is also important to note that defects do not always manifest themselves as infant
mortality failures. They can appear to be infant mortality, random or wearout, depending
on the specific characteristics of the failure mechanism and factors, such as defect
severity distributions.

6.2. Common Cause vs. Special Cause


The fact that a failure rate can be predicted for a given part under a specific set of
conditions does not imply that a failure rate is an inherent quality of the part. The
probability of failure is a complex interaction between the inherent defect density, defect
severity, and stresses incurred in operation. Failure rates predicted using empirical
models are, therefore, typical failure rates only and represent typical defect rates, design
characteristics and use conditions. The accuracy of these prediction models is dependent
on:

The model developers ability to identify the variables (component- or userelated) that most heavily influence reliability
The level of detailed data to which the model user has access
The quantity and quality of the data on which the models are based

The accuracy of a reliability model is a strong function of the manner in which defects
are accounted for. Therefore, there is a trade-off between the usability of the model and
the level of detailed data that it requires. This highlights the fact that the purpose of a
reliability prediction must be clearly understood before a methodology is chosen.
Practical considerations for choosing an approach will inevitably include the types and
level of detail of information available to the analyst. Given the practical time and cost
constraints that most reliability practitioners face, it is usually important that the chosen
reliability prediction methodology be based on data and information accessible to them.
Model developers have long known that many of the factors which had a major influence
on the reliability of the end product were not included in traditional methods like MILHDBK-217, but under the constraints of handbook users, these factors could not be
Reliability Information Analysis Center
225

Chapter 6: Interpretation of Reliability Estimates

included in the models. For example, it was known that manufacturing processes had a
major impact on end item reliability, but those are the factors which corporations hold
most proprietary. As an example of this, a physics-of-failure-like model was developed
several years ago for small-scale CMOS technology. This model required many input
variables, such as metallization cross-sectional area, silicon area, oxide field strength,
oxide defect density, metallization defect density etc. While the model has the potential
to be much more accurate than the other MIL-HDBK-217 models, it is essentially
unusable by anyone other than the component manufacturers who have access to such
information. The model is useful, however, for these manufacturers to improve the
reliability of their component designs.
The two primary purposes for performing a quantitative reliability assessment of systems
are (1) to assess the capability of the parts and design to operate reliably in a given
application (robustness), and (2) to estimate the number of field failures or the probability
of mission success. The first does not require statistically-based data or models, but
rather sound part and materials selection/qualification and robust design techniques. It is
for this purpose that physics approaches have merit. The second, however, requires
empirical data and models derived from that data. This is due to the fact that field
component failures are predominantly caused by component and manufacturing defects
which can only be quantified through the statistical analysis of empirical data. This can
be seen by observing the TTF characteristics of components and systems, which are
almost always decreasing, indicating the predominance of defect-driven failure
mechanisms. The handbook models described in this book provide the data to quantify
average failure rates which are a function of those defects.
It has been shown that system reliability failure causes are not driven by deterministic
processes, but rather by stochastic processes that must be treated as such in a successful
model. There is a similarity between reliability prediction and chaotic processes. This
likeness stems from the fact that the reliability of a complex system is entirely dependent
upon initial conditions (e.g., manufacturing variation) and use variables (i.e., field
application). Both the initial conditions and the use application variables are often
unknowable to any degree of certainty. For example, the likelihood of a specific system
containing a defect is often unknown, depending on the defect type, because the
propensity for defects is a function of many variables and deterministically modeling
them all is virtually impossible. However, the reliability can be predicted within bounds
by using empirically based stochastic models.
A critical factor that must be considered when choosing a reliability assessment method
is whether the failure mechanism under analysis is a special cause or a common cause
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
226

Chapter 6: Interpretation of Reliability Estimates

mechanism. In other words, a special cause mechanism means that there is an assignable
cause to the failure and that only a subpopulation of the item is susceptible to this failure
mechanism. Common cause mechanisms are those affecting the entire population.
Table 6.1-1 summarizes the characteristics of various categories of failure causes, and
identifies whether they are typically common cause or special cause. The categories of
failure types encompass the ways a failure cause can manifest itself. These are also
categories that can be used in a FMEA.
Table 6.1-1: Categories of Failure Effects
Failure cause type

Always (Common
Cause)
Sometimes (Special
Cause)

Category of Failure Type


Screen
Infant
Fallout/Out-ofMortality
the-Box Failure

Design
Not
Capable

Process
Not
Capable

Random
Failure

Wearout

x
x

If it is erroneously assumed that special cause mechanisms will affect the entire
population, gross errors in the reliability estimates of the population will result. This
error results from the assumption of a mono-modal TTF distribution when, in fact, the
actual distribution is multimodal.
If the distribution is truly mono-modal, only the parameters applicable to a single mode
distribution need to be estimated. However, if there are really several sub-populations
within the entire population, the parameters of each of the distributions needs to be
estimated, along with the percentage of the entire population represented by each
distribution.
This is especially critical when dealing with defects. In this case, it is critical to
understand the percentage of the population that is at risk of failure. To illustrate this,
consider the probability plot in Figure 6.2-1. As can be seen in this plot, there is an
apparent knee in the plot at about 400 hours, an indication of several subpopulations.
If a mono-modal distribution is assumed (i.e., the straight line), errors in the cumulative
percent fail at a given time will occur. Likewise, if a multimodal distribution is assumed,
a much more accurate representation of the situation results (the line through the data
points).

Reliability Information Analysis Center


227

Chapter 6: Interpretation of Reliability Estimates


ReliaSoft Weibull++ 7 - www.ReliaSoft.com

Probability - Weibull

99.000

Probability-Weibull
Data 1
Weibull-Mixed
MLE SRM MED FM
F=98/S=139
Data Points
Susp Points
Probability Line

90.000

Unreliability, F(t)

50.000

10.000

5.000

1.000

0.500

0.100
10.000

100.000

1000.000

10000.000

100000.000

Bill Denson
Corning
1/15/2008
5:24:32 PM
1000000.000

Time, (t)
[1]=1.3341, [1]=307.1460, [1]=0.0646; [2]=0.7505, [2]=2.1367+4, [2]=0.4240; [3]=4.2735, [3]=1.1624+5, [3]=0.5114

Figure 6.2-1: Example of a Non-Mono-Modal Distribution


The quantification of subpopulations usually requires data on many more samples
relative to the mono-modal situations.
If accelerated tests are used to model life, the risk in assuming mono-modality must be
considered. For this reason, techniques like stress/strength and first principals are often
difficult to use to quantify multimodality.
Examples of multimodal distributions
The plots presented in Figures 6.2-2 through 6.2-6 illustrate the characteristics of several
different types of multimodal distributions. Before each plot, the information on each of
the two distributions comprising the multimodal distribution is presented in a table
(Tables 6.2-2 through 6.2-6, respectively). Included in this description are the beta value,
the eta value (characteristic life) and the portion of the population represented by the
distribution.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
228

Chapter 6: Interpretation of Reliability Estimates

Table 6.2-2: Bimodal Population Example 1


Population
Beta
Eta
Portion

Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. com

1
0.60
61.1
0.42

2
0.59
918.7
0.58

Probability - Weibull

9 9. 9 00

9 0. 0 00

Unreliability, F(t)

5 0. 0 00

1 0. 0 00

5. 0 00

1. 0 00

0. 5 00

0. 1 00
0 . 1 00

1. 00 0

10 . 00 0

1 00 . 00 0

1 000 . 0 00

T ime, (t)
F olio 1\1-. 5: [ 1 ] =0 .6 0 1 0 , [ 1 ] =6 1 .0 8 6 0 , [1 ]= 0 .4 2 1 9 ; [ 2 ]= 0 .5 9 3 6 , [2 ]= 9 1 8 .6 9 3 5 , [ 2 ] = 0 .5 7 8 1

Figure 6.2-2: Multimodal Distribution Example 1

Reliability Information Analysis Center


229

10 00 0. 00 0

Chapter 6: Interpretation of Reliability Estimates

Table 6.2-3: Bimodal Population Example 2


Population
Beta
Eta
Portion

Re lia Sof t W e ibull+ + 7 - w w w . Re lia Sof t. co m

1
0.86
341.4
0.63

2
1.4
863.25
0.37

Probability - Weibull

99 . 9 0 0

90 . 0 0 0

Unreliability, F(t)

50 . 0 0 0

10 . 0 0 0

5.000

1.000

0.500

0.100
0. 1 00

1 . 00 0

1 0. 00 0

10 0 . 0 00

1 0 00 . 00 0

T ime, (t)
F o lio 1 \1 -1 : [1 ]= 0 .8 6 3 3 , [1 ]= 3 4 1 .4 6 5 6 , [1 ]= 0 .6 3 0 3 ; [2 ]= 1 .4 0 6 2 , [2 ]= 8 6 3 .2 7 6 7 , [2 ]=0 .3 6 9 7

Figure 6.2-3: Multimodal Distribution Example 2

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


230

1 0 00 0. 0 00

Chapter 6: Interpretation of Reliability Estimates

Table 6.1-4: Bimodal Population Example 3


Population
Beta
Eta
Portion

Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m

1
1.81
98.44
0.19

2
1.23
679.4
0.81

Probability - Weibull

9 9. 9 00

9 0. 0 00

Unreliability, F(t)

5 0. 0 00

1 0. 0 00

5. 0 00

1. 0 00

0. 5 00

0. 1 00
0.100

1. 00 0

1 0. 00 0

1 0 0. 00 0

1 00 0. 0 00

T ime, (t)
F o lio 1\5-1 : [1 ]= 1 .8 1 4 4 , [1 ]= 9 8 .4 4 2 8 , [1 ]= 0 .1 8 8 1 ; [ 2 ]= 1 .2 3 8 5 , [2 ]=6 7 9 .4 4 6 9 , [ 2 ] =0 .8 1 1 9

Figure 6.2-4: Multimodal Distribution Example 3

Reliability Information Analysis Center


231

10 0 00. 0 00

Chapter 6: Interpretation of Reliability Estimates

Table 6.1-5: Bimodal Population Example 4


Population
Beta
Eta
Portion

Re lia Sof t W e ibull+ + 7 - w w w . Re lia Sof t. co m

1
1.18
206.2
0.19

2
4.69
497.6
0.81

Probability - Weibull

99 . 9 0 0

90 . 0 0 0

Unreliability, F(t)

50 . 0 0 0

10 . 0 0 0

5.000

1.000

0.500

0.100
0. 1 00

1 . 00 0

1 0. 00 0

10 0 . 0 00

1 0 00 . 00 0

T ime, (t)
F o lio 1 \. 5-5: [1 ]=1 .1 8 0 8 , [1 ]=2 0 6 .1 9 6 8 , [1 ]= 0 .1 9 4 0 ; [2 ] =4 .6 9 4 3 , [2 ]= 4 9 7 .6 3 5 9 , [2 ]=0 .8 0 6 0

Figure 6.2-5: Multimodal Distribution Example 4

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


232

1 0 00 0. 0 00

Chapter 6: Interpretation of Reliability Estimates

Table 6.1-6: Bimodal Population Example 5


Population
Beta
Eta
Portion

Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m

1
5.71
44.7
0.10

2
4.29
483.7
0.90

Probability - Weibull

99.900

90.000

Unreliability, F(t)

50.000

10.000

5 . 0 00

1 . 0 00

0 . 5 00

0 . 1 00
0. 10 0

1. 00 0

10 . 0 0 0

10 0 . 0 00

1 00 0 . 0 0 0

1 0 00 0. 0 00

T ime, (t)
F o lio 1\5 -5 : [ 1 ]=5 .7 1 6 3 , [1 ]=4 4 .7 8 9 1 , [1 ]= 0 .0 9 9 9 ; [ 2 ]= 4 .2 9 3 2 , [2 ]= 4 8 3 .7 0 4 2 , [2 ] =0 .9 0 0 1

Figure 6.2-6: Multimodal Distribution Example 5


A distribution was then obtained by pooling all of the individual distributions described
previously. This is shown in Figure 6.2-7. The effect of pooling the various distributions
from many failure causes has the effect of randomizing the apparent failure
characteristics of the resultant pooled population. This is one of the reasons that a
Reliability Information Analysis Center
233

Chapter 6: Interpretation of Reliability Estimates

constant failure rate distribution (i.e., exponential) is usually a reasonably good


representation of a complex systems failure rate characteristics.
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m

Probability - Weibull

99.900

Proba bility-W e ibull


a ll da ta
W e ibull-Mixe d
NLRR SRM MED F M
F = 4 98 /S= 0
Da ta Points
Pro ba bility Line

90.000

Unreliability, F(t)

50.000

10.000

5 . 0 00

1 . 0 00

0 . 5 00

0 . 1 00
0. 01 0

0 . 1 00

1.000

1 0 . 0 00

1 00 . 00 0

10 0 0. 00 0

1 0 00 0. 0 00

Bill De nson
Co rning
3 /14 /2 0 10
9 :43 :2 8 PM
10 0 000 . 0 00

T ime, (t)
[1 ]=0 .7 2 0 8 , [1 ]=4 3 2 .4 6 1 4 , [1 ]= 0 .6 6 9 4 ; [ 2 ]= 4 .4 2 3 2 , [2 ]= 4 9 1 .0 1 2 0 , [2 ]=0 .3 3 0 6

Figure 6.2-7: Multimodal Distribution Example of Pooled Data Set


To illustrate the reliability theory concepts discussed above, consider an example in
which the lifetimes of people are analyzed. The data on which this analysis was based is
from http://www.mortality.org (Reference 1) and considers the lifetimes of individuals
that died in 2006.
The raw data is contained in Figure 6.2-8, which presents the number of deaths occurring
at each age. This is the discrete version of the pdf.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


234

Chapter 6: Interpretation of Reliability Estimates


4000

3500

Number of Deaths

3000

2500

2000

1500

1000

500

0
0

20

40

60

80

100

120

Age

Figure 6.2-8: Age at Death Data


From this graphic, it can be seen that there are several distinct distributions present. First,
is the infant mortality period, which is represented by Mode 1. The second mode, Mode
2, represents deaths in the late teens and early twenties. Then, the third and fourth modes
represent deaths from old age.
Next, a multimode Weibull distribution was fit to the data, using Reliasofts Weibull++
software tool, which allows fitting failure data to multimode distributions. The results
are summarized in Table 6.1-7.
Table 6.1-7: Four Mode Weibull Distribution Parameters
Parameter

Mode 1

Mode 2

Mode 3

Mode 4

Beta

0.184

4.25

4.74

9.61

Eta

0.1030

24.81

67.84

87.67

Portion

0.0090

0.012

0.194

0.784

The composite pdf is shown in Figure 6.2-9.


Reliability Information Analysis Center
235

Chapter 6: Interpretation of Reliability Estimates


Probability Density Function
0.040

0.032

f(t)

0.024

0.016

0.008

0.000
0.100

22.100

44.100

66.100

88.100

110.100

Time, (t)

Figure 6.2-9: pdf of multimode distribution of ages


The failure rate is shown in Figure 6.2-10.
Failure Rate vs Time Plot
0.040

0.032

Failure Rate, f(t)/R(t)

0.024

0.016

0.008

0.000
0.100

22.100

44.100

66.100

88.100

110.100

Time, (t)

Figure 6.2-10: Failure Rate of Age Data

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


236

Chapter 6: Interpretation of Reliability Estimates

The Weibull probability plot is shown in Figure 6.2-11. Note that, in this graph, the plot
is shown using Weibull scales, i.e. the log of time on the x-axis and double log of
unreliability on the y-axis. If this plot was close to a straight line, it would indicate that
the distribution could be described adequately with a mono-modal Weibull distribution.
Clearly, this is not the case.
Probability - Weibull
99.990

90.000

Unreliability, F(t)

50.000

10.000

5.000

1.000

0.500

0.100
0.100

1.000

10.000

110.000

Time, (t)

Figure 6.2-11: Probability Plot of Age Data


Figure 6.2-12 illustrates a single mode Weibull fit (straight line) to the data. As can be
seen, if the single mode fit is used to estimate probability of death at a specific age,
significant errors would result. For example, it would imply that about 20% of the
population would live to 110 years. And, it would imply that there is less than .001%
probability of death in the first year.
This example illustrates the fact that, if there is a sub-population of samples with
different reliability behavior than the main population, then the TTF distributions may
manifest themselves as bimodal or multimodal. It is important that these multimodal
distributions be characterized. If one of the two modes in the distribution appears as
early failures resulting from defects, this information is required to develop an
appropriate reliability screen.

Reliability Information Analysis Center


237

Chapter 6: Interpretation of Reliability Estimates


Probability - Weibull
99.990

90.000

50.000

10.000

Unreliability, F(t)

5.000

1.000
0.500

0.100
0.050

0.010
0.005

0.001
1.000

10.000

110.000

Time, (t)

Figure 6.2-12: Single Mode Weibull Fit to the Age Data

6.3. Confidence Bounds


The topic of confidence bounds has always been important in reliability engineering due
to the fact that estimating the uncertainty associated with a reliability estimate is
important when making decisions based on that estimate. In this case, the risk associated
with being wrong must be assessed. It is a topic that has received a tremendous amount
of attention by reliability practitioners and academicians alike.
6.3.1. Traditional Techniques for Confidence Bounds

The traditional manner in which confidence levels are calculated around failure rates is
the use of the chi-square distribution, as follows:

2 (1 CL ,2 r + 2 )
2t

where the numerator is a value taken from a chi-square table, and t is the number of
device hours. A question sometimes arises as to how the confidence bounds calculated in
this manner compare to those calculated with the use of the Poisson distribution.
From the binomial and Poisson distributions, Farachi (Reference 2) has shown that:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
238

Chapter 6: Interpretation of Reliability Estimates

n!
(1 q )nk q k
k =0 k!(n k )!
r

1 CL =

Using the Poisson approximation of the binomial:

n!
(1 q )n k q k (nq ) e nq
k !(n k )!
k!
k

Combining the above two equations yields:


k
r 1

(
(
(nq)r
nq) nq
nq)
nq
1 CL =
e = e 1 + nq + +
+
(r 1)! (r )!
k!
k =0

Since:

nq = t
Then:

(
(
t )k t
t )r 1 (t )r
t
1 CL =
e = e 1 + t + +
+
(
)
(r )!
r
1
!

k =0 k!

The chi-square value is the exact solution to the above equation. The chi-square values
are for t, not alone. Therefore, for a given confidence level and number of
failures, the chi-square tables provide the value for t. Therefore, the chi-square values
are entirely consistent with the Binomial and Poisson distributions.
It is important to note that the confidence bounds based on the chi-square distribution
summarized above pertain to the uncertainty from statistical considerations alone. They
do not account for variations in failure rate due to other noise factors, such as:

Uncertainty in the number of hours or failures


Whether the failure causes are truly relevant
Time dependencies of the failure rate
Reliability Information Analysis Center
239

Chapter 6: Interpretation of Reliability Estimates

Additional information on confidence bounds is included in the section on life modeling.


6.3.2. Uncertainty in Reliability Prediction Estimates

One of the limitations of reliability predictions that are based on handbook models is that
they can only provide point estimates of failure rates. These failure rates are based on
whatever data was available to make up the model, and the model development approach.
There are no statistical confidence limits or intervals that can be associated with
handbook model data. Traditional methods are not applicable because there are many
more factors contributing to the uncertainty than the statistical-only considerations of
traditional techniques.
For example, consider the following summary of the model development and use
approach, along with the potential sources of error, as shown in Figure 6.3-1. The
sources of error are highlighted in the gray boxes. From this, it can be seen that there are
many sources of noise. The model output results reflect the cumulative effects of the
uncertainties in all of the noise sources shown.
Although a theoretical basis for the calculation of the confidence bounds around
reliability predictions is extremely difficult to derive, it is possible to empirically observe
the degree of uncertainty. Reliability predictions performed using empirical models
developed from field data result in a failure rate estimate with relatively wide confidence
bounds. Table 6.3-1 presents the multipliers of the failure rate point estimate as a
function of confidence level. This data was obtained by analyzing data on systems for
which both predicted and observed data was available. For example, using traditional
approaches, one could be 90% certain that the true failure rate was less than 7.57 times
the predicted value.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


240

Chapter 6: Interpretation of Reliability Estimates

Raw Input Data


Item Information:
Manufacturer
Manufacturing Date
Quality
Defect rate
Data:
Operating Hours
Time to Failure
Number of Failures
Failure Relevancy
Degradation vs
Catastrophic

Unmodeled Noise
Factors
Modeled Factors:
Environmental:
Temperature
Humidity
Delta T
Radiation
Contaminants
Operational Profile:
Duty Cycle
Cycling Rate
Operating Stress
Electrical Stress
Mechanical
Extreme Events

User

Item Information:
Manufacturing Date
Quality
Defect Rate
Environmental Stresses:
Temperature
Humidity
Delta T
Radiation
Contaminants
Operational Profile:
Duty Cycle
Cycling Rate
Operating Stress
Electrical Stress
Mechanical
Extreme Events

Model
Development
Model
Censored Data;
Biased Estimators;
Assumptions Made
in Modeling

Model
Output

Figure 6.3-1: Sources of Error in Empirical Models


Reliability Information Analysis Center
241

Chapter 6: Interpretation of Reliability Estimates

Table 6.3-1: Failure Rate Uncertainty Level Multipliers


Percentile
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Multiplier
0.13
0.26
0.44
0.67
1.00
1.49
2.29
3.78
7.57

An interesting effect occurs when combining the distributions that describe the
uncertainties of the individual components comprising a system. The uncertainties are
wider at the piece-part level than at the system level. If one were to take the distributions
of failure rate from the regression analysis used to derive the component model (i.e.,
standard error estimate), and statistically combine them with a Monte Carlo summation,
the resultant distribution describing the system prediction uncertainty will have a
variance much smaller than that of the individual components comprising the system.
The reason for this is the effect of the Central Limit Theorem which quantifies the
variance of summed distributions. For example, the variance around the component
failure rate estimate is higher than the variance suggested by the above table. However,
the variance in the above table is observed to be much larger than that theoretically
derived by summing the component failure rate distributions. This implies that there are
system-level effects that contribute to the uncertainty that are not accounted for in the
component-based estimate.
Bayesian techniques, such as those used in the 217Plus system reliability assessment
methodology, allow the refinement of analytical predictions over time to reflect the
experienced reliability of an item as it progresses through in-house testing, initial field
deployment and subsequent use by the customer. In-house testing can be comprised of
accelerated tests at the component or equipment level, reliability growth tests, and
reliability screens or accelerated screening techniques.
We will not discuss Bayesian methods in detail here. The primary benefit of using
Bayesian techniques can be implied from Figure 6.3-2, however. As more and more test
and experience data is factored into the initial analytical reliability prediction, the
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
242

Chapter 6: Interpretation of Reliability Estimates

statistical confidence levels represented by the outside (red) lines on the graph continue
to converge on the True MTBF of the subject item. Using Bayesian techniques, as
time approaches infinity the predicted inherent MTBF and the true MTBF of the device,
product or system population become one and the same. This, of course, assumes that
MTBF is the appropriate metric, but the same situation conceptually applies to other
metrics such as failure rate and reliability (R).

Prediction
Assessment
Estimation

Paper
Analysis

MTBF

In-House
Testing

Field
Data
Upper Confidence
True MTBF
Lower
Confidence Level

TIME

Figure 6.3-2: Confidence Level Through Prediction, Assessment and Estimation

6.4. Failure Rate vs pdf


The biggest distinction to be made when assessing reliability is whether the time period
of interest for the item under analysis is in the meat of the TTF distribution, or whether
it is in the extreme left tail of the distribution. For example, consider a system that has a
five year design life. If an item has a mean life of three years, clearly precautions would
Reliability Information Analysis Center
243

Chapter 6: Interpretation of Reliability Estimates

be required, such as preventive maintenance. The reliability of these types of items is


usually easier to predict because they can be tested to failure in relatively short times, and
small sample sizes will usually suffice.
On the other hand, consider a component that has a failure rate of 2 FITs7, typical for
many modern electronic components. In the five year design life, assuming continuous
operation, the reliability would be:

R=e

( t )

= 0 .999912

Or, a probability of failure of 0.000088.


Therefore, if there were 10,000 of these components operating in a system, the expected
number of failures in the five year period would be less than one.
Predicting the reliability behavior of a failure cause based on the extreme left tail of the
TTF distribution of the main population is dangerous, since the accuracy of the
distribution breaks down in its extreme tails. As an example, consider a state-of-the-art
integrated circuit. One failure mechanism is electromigration of the metal lines.
Manufacturers will typically perform life tests of the metal line structures to assess their
lifetime. These tests are done in a manner similar to the practices detailed in this book.
They are accelerated tests performed under a variety of temperature and current density
conditions. Failure times are collected and models are developed to predict lifetimes
under deployment conditions. A goal of a good manufacturer is to design the metal lines
such that the probability of failure is acceptably low when the part is used under specified
conditions. While, as stated, the models developed can be used to estimate the reliability
under deployment conditions, rarely will the prediction be reasonably close to the
observed failure data. The reasons for this are:

The distribution is usually not mono-modal


Manufacturing variability is difficult to account for in the model
Extreme events, such as defects in the metal lines, will only manifest themselves
after very large sample sizes are tested or fielded

A multimode distribution can be used to model this situation, the first mode being
applicable to the defects, and the second being applicable to the main population.

Two FITs is defined as 2.0 failures per billion hours. This corresponds to 0.002 failures per million hours.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


244

Chapter 6: Interpretation of Reliability Estimates

However, in many cases, it is only the first mode that will impact the field reliability
within the useful life of the component.
Some researchers have attempted to use extreme value statistics for such cases, but they
also have limited usefulness because the data on low failure rate items, like electronic
components, is generally not consistent with these distributions. As a result, low failure
rate items are usually modeled with a constant failure rate (exponential distribution), or a
Weibull distribution. The Weibull is usually used in this case to model the effects of
infant mortality.

6.5. Practical Aspects of Reliability Assessments


There are very often serious constraints put on practicing reliability engineers. Due to
limitations of time, cost, test resources, availability of data, limitation of modeling
capabilities, and lack of understanding of failure physics, analysts often need to do the
best they can with what they have to work with. This is usually compounded in smalland medium-sized companies, which often lack the resources needed to execute many of
the analysis techniques described in this book.
Companies engaged in highly competitive industries face extreme time pressures, which
is in stark contrast to the tenets of good reliability engineering practices. The goal of the
reliability engineer should be to select an optimal approach that achieves the desired
purpose of the analysis, while conforming to the practical constrains to which he or she is
subjected.

6.6. Weibayes
There are many cases in reliability modeling in which there are few or no failures. For
these, a Weibayes technique can be used. This approach is practical when there are few
or no failures and a reasonable shape parameter can be estimated. This approach
essentially fixes a plotting position using:
1. One failure assumed at the end of the test duration
2. A line drawn through the median rank point with an assumed beta
The result of this analysis is a lower single-sided bound of the life distribution. As an
example, consider the following case:
1. 50 samples are tested for 1000 hours, with no failures
2. Data from other testing indicates a beta of 3 is appropriate
Reliability Information Analysis Center
245

Chapter 6: Interpretation of Reliability Estimates

3. The median rank at 1000 hours is 1.39%. A line is drawn through this point with
a beta slope of 3.
This is shown in Figure 6.6-1.

Figure 6.6-1: Weibayes Example

6.7. Weibull Closure Property


In cases where it is desired to estimate the time-to first-failure (TTFF) of a product or
system comprised of multiple items, the Weibull closure property can be used. Here, the
characteristic life of the Weibull distribution of time to first failure is:

n 1
s =
i =1 i

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


246

Chapter 6: Interpretation of Reliability Estimates

where i and represent the Weibull distribution parameters for individual items. This is
applicable when is the same, but i can be different for each item.

6.8. Estimating Event-Related Reliability


Many cases arise in estimating reliability where a failure cause of a device under analysis
is event-related. For example, if a hand-held device is susceptible to failure when it is
dropped, the failure rate (or hazard rate, if a time-varying failure rate distribution is used)
is a function of the:

rate at which drops occur


distribution of the drop height
relationship between drop height and G-shock level
probability of failure as a function of G-level

The failure rate is expressed as:

(t ) = d (t )hd

G
Gth
h

where:
(t) =
d(t) =
hd =
G/h =
Gth =

the failure rate of the device due to shock-related failure causes


the rate at which the drops occur
the drop height distribution
the relationship between the G-level and the drop height
the failure threshold distribution

Since hd and Gth are random variables described by distributions, (t) can generally be
estimated with a Monte Carlo analysis, as described earlier in this book.
In this case, the conditional probability of failure if the device is dropped is:

hd

G
Gth
h

This is essentially a stress/strength interference model.

Reliability Information Analysis Center


247

Chapter 6: Interpretation of Reliability Estimates

6.9. Combining Different Types of Assessments at Different


Levels
Practicing reliability engineers are usually faced with the challenge of making reliability
estimates of a product or system based on imperfect, noisy data and information. The
engineer must utilize the data that is available, and additional data that is feasible to
obtain, and combine this information to estimate the product or system reliability.
The 217Plus methodology summarized previously in this book presents one possible
approach for using the initial estimated reliability based on the predictions made from
empirical models, and combining it with empirical data on the same product or system.
This combination is done using Bayesian principals. This is a general approach that can
be extended to include the combination of estimates from different methods that are made
at different levels. For example, consider the case summarized in Table 6.9-1. It may be
possible to characterize specific failure causes with one of the physics-based techniques
summarized herein, but it also may be unlikely that all failure causes can be modeled in
this manner.
Table 6.9-1: Example of Combing Different Types of Models
Item
Assembly
Component A
Failure Cause 1
Failure Cause 2
Failure Cause 3
Component B
Failure Cause 1
Failure Cause 2
Failure Cause 3

Available Reliability Estimate


Life Test Data
Life Test Data
Physics Model
Field Data on Similar Item
Physics Model
Field Data
Life Test Data
Field
Physics Model

In this example, the objective is to estimate the reliability of the assembly, which is
comprised of two components. Component A has physics-based models available for
two of the three primary failure causes.
An estimate of the failure rate of component A is:

A preliminary = 1 + 2 + 3
where 1, 2 and 3 are the failure rates obtained from the model or data available on each
failure cause. Of course, these values should represent the failure rate under the use
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
248

Chapter 6: Interpretation of Reliability Estimates

conditions for which the assessment is to be made. In this example, is used, which
indicates a constant failure rate. However, if the failure rates are time-dependent, the
corresponding time-dependent failure rates or hazard rates can be used. Also, the
methodology to be illustrated in this example is similar to the data combination
methodology described in the 217Plus section, the main difference being that this
example deals with the situation in which there are different types of data at different
hierarchical levels of the product or system, whereas the 217Plus methodology deals with
different types of data within the same configuration item.
Now, since Component A has life test data available from tests performed on the
component, A-preliminary is the failure rate estimate before accounting for the life test data
on the entire component. This life data will account for any failure causes not included in
the three failure causes considered, and it will also provide additional data on the three
failure causes considered. A better estimate of reliability can be obtained by combining
A-preliminary with the life test data, using Bayesian techniques. This technique accounts
for the quantity of data by weighting large amounts of data more heavily than small
amounts. A-preliminary forms the prior distribution, comprised of a0 and ao/A-preliminary .
The empirical data (i.e., test data in this case) is combined with A-preliminary using the
following equation:

a0 +

A =

i =1

a0

Apreliminary

b '
i

i =1

A is the best estimate of the Component A failure rate, while ao is the equivalent
number of failures of the prior distribution corresponding to A-preliminary. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the Section 2.6 of this book. The equivalent number of hours
associated with A-preliminary is represented by ao/A-preliminary. The number of failures
experienced in each source of empirical data is a1 through an. There may be n different
sources of data available (for example, each of the n sources corresponds to individual
tests or field data from the population of products). The equivalent number of cumulative
operating hours experienced for each individual data source is b1 through bn. These
values must be converted to equivalent hours by accounting for any accelerating effects
between the use conditions.
The same methodology is applied to Component B, and B is obtained.
Reliability Information Analysis Center
249

Chapter 6: Interpretation of Reliability Estimates

The same methodology is, in turn, applied at the parent level assembly, in which case, the
preliminary estimate is:

Assembly= preliminary = A + B
and the parent assembly failure rate becomes:

a0 +

A =

i =1

a0

Assembly- preliminary

b '
i

i =1

where ao is the equivalent number of failures of the prior distribution corresponding to


Assembly-preliminary, and the values for ai and bi correspond to the Assembly life test data.

6.10. Estimating the Number of Failures


There are many cases in which the desired outcome of a reliability analysis is the
expected number of failures. This is appropriate, for example, when calculating spares
requirements or warranty returns. The techniques described in this book are useful for
estimating either failure rates or probability of failure.
If the outcome of the analysis is a failure rate, then the expected number of failures is:

N f = t
where:
Nf =
=
t=

the number of expected failures


the failure rate
the cumulative operating time

This can be seen by reviewing the units in this relationship:

N = t =

operating time

Failures

# parts = Failures
operating time
part

# parts
part

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


250

Chapter 6: Interpretation of Reliability Estimates

This equation is usually used for repairable systems.


If the output of the analysis is a life model that describes the distribution of TTFs for a
specific set of conditions, the number of failures is:

N f = N [F (t 2 ) F (t1 )]
where:
Nf i= the number of expected failures
N = the total number of parts in the population
F(t1) = the cumulative probability function at time t1
F(t2) = the cumulative probability function at time t2
t1 and t2 are the times between which the failure probability is to be evaluated
In this case, since F is a (unitless) probability value, the total population is scaled by
the probability of failure in the time interval of interest. This is identical to the expected
value of the binomial distribution of the number of failures.

6.11. Calculation of Equivalent Failure Rates


In many cases, it is advantageous to calculate an equivalent failure rate from the results
of a reliability model that yield a non-constant failure rate as its output. For example, if a
reliability model estimates that a certain percent fail will occur at a given time (based on
the non-constant failure rate model), the equivalent constant failure rate can be calculated
as follows:
The reliability function for a constant failure rate is:

R = e t
The equivalent failure rate can be obtained by solving the above equation for the failure
rate:

ln(R )
t

Reliability Information Analysis Center


251

Chapter 6: Interpretation of Reliability Estimates

The resulting failure rate value is equal to a failure rate that will result in the same
cumulative percent fail as predicted by the non-constant model at the specific time that
the reliability is calculated. If a different time is chosen, a different value will be
obtained.
This technique can be used when the reliability of some parts of a system is calculated
with non-constant failure rate models and others are calculated with a constant failure
rate. It can also be used when modeling one-shot devices, which will simply have a
probability of failure instead of a failure rate.

6.12. Failure Rate Units


The output of a reliability model can include a host of potential metrics, including:

Mean life
Median life
MTBF
Failure rate
Time to X% fail
B10 life
Distribution parameters:
o Weibull characteristic life and shape parameter
o Lognormal mean and standard deviation

If a constant failure rate distribution is used, there are various units of failure rate
possible. Some of these are:

Failures per hour


Failures per million hours
Failures per billion hours
Percent failure per thousand hours

Failures per hour is the fundamental unit. All of these failure rate units can be
translated to each other with a constant multiplication factor. For example, Failures per
million hours times 1000 equals Failures per billion hours and Percent failure per
thousand hours is equivalent to Failures per ten thousand hours.
In the above cases, the life unit shown is in hours (i.e., time), but it does not necessarily
need to be. Other possible life units are cycles, miles, missions, operations, etc.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
252

Chapter 6: Interpretation of Reliability Estimates

Additionally, if the life unit in the above listed metrics is time (hours), it can refer to the
number of operating hours, calendar hours, flight hours, etc. Reliability prediction
methods like MIL-HDBK-217 use operating hours as the life unit, whereas 217Plus uses
calendar hours as the life unit. Calculation of the operating failure rate using MILHDBK-217 makes the implicit assumption that the failure rate during non-operating
periods is zero, unless the non-operating failure rate is otherwise accounted for.
However, in all cases the life unit refers to the cumulative value of the population. For
example, if the failure rate unit of Failures per million hours is used, the million hours
refers to the cumulative time of the entire population, i.e. the sum of each components
number of hours.

6.13. Factors to be Considered When Developing Models


This section discusses a few of the factors that should be considered in the development
of a reliability model. It is by no means an exhaustive list, but it is included here to give
the reader ideas on the types of factors that should be considered.
6.13.1. Causes of Electronic System Failure

An assumption often made when using traditional reliability prediction methodologies is


that the failure rate of a product or system is primarily determined by the components
comprising the system. A significant number of failures also stem from non-component
causes such as defects in design and manufacturing. Historically, these factors have not
been explicitly addressed in prediction methods. The data in Figure 6.13-1 contains the
nominal percentage of failures attributable to each of eight identified predominant failure
causes based on failure mode data collected by the RIAC on electronic systems.

Reliability Information Analysis Center


253

Chapter 6: Interpretation of Reliability Estimates

Figure 6.13-1: Nominal Failure Cause Distribution of Electronic Systems


The definitions of failure causes are:

Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor,
connector, etc.) failing to perform its intended function. Examples
include part failures due to poor quality; manufacturer or lot
variability; or any process deficiency that causes a part to fail before
its expected wearout limit is reached.

Design (9%): Failures resulting from an inadequate design. Examples include


tolerance stack-up, unanticipated logic conditions (e.g., sneak paths), a
non-robust design for given environmental stresses, etc.

Manufacturing (15%): Failures resulting from anomalies in the manufacturing


process that are not related to the inherent reliability of a part, i.e.,
faulty solder joints, inadequate wire routing resulting in chafing, bent
connector pins, etc.

System Management (4%): Failures traceable to faulty interpretation of system


requirements, imposition of bad requirements (missing, inadequate,
ambiguous or contradictory), or failure to provide the resources
(funding and/or personnel) required to design and build a reliable
product or system.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
254

Chapter 6: Interpretation of Reliability Estimates

Wearout (9%): Failures resulting from wearout-related failure mechanisms due


to basic device physics. Examples of electronic components
exhibiting wearout-related failure mechanisms are electrolytic
capacitors, solder joints, microwave tubes (such as TWTs), and switch
and relay contacts.

No defect (20%): Perceived failures that cannot be reproduced upon further


testing. These may or may not be an actual failure; however they are
removals and, therefore, are typically counted toward the logistic
failure rate (or MTBF). Examples include the inability of the
maintenance environment to recreate the operational environmental
stresses under which the original failure occurred, or looser
tolerances on the test equipment than on the platform or system from
which the defective unit was taken.

Induced (12%): Failures resulting from an externally applied stress. Examples


are electrical overstress and maintenance-induced failures (i.e.,
dropping, bending pins, etc.).

Software (9%): Failures of a system to perform its intended function due to the
manifestation of a software fault

While there are reliability assessment methods for specific causes listed above, (i.e.,
components, software, etc.) there are few methodologies that attempt to take a holistic
view of system reliability and integrate them into a single methodology. One example of
a methodology that attempts to do this is 217Plus, which is described in Chapter 7.
6.13.2. Selection of Factors

The process of reliability assessment can be viewed as an IPO model, which has input
parameters (I), the process or models used to assess the reliability as a function of those
input parameters (P), and an output (O). This is illustrated in Figure 6.13-2.

Reliability Information Analysis Center


255

Chapter 6: Interpretation of Reliability Estimates

Input

Process

Output

Initial
Conditions

Reliability
Metrics

Stresses
Figure 6.13-2: IPO Model
Examples of the IPO variables, as applied to reliability modeling, are shown in Table
6.13-1.
Table 6.13-1: Factors to be Considered in a Reliability Model
Defect-Free

Intrinsic
Initial Conditions

Defects

Extrinsic

Input

Operational

Stresses
Environmental

Voids
Material Property Variation
Geometry Variation
Contamination
Ionic Contamination
Crystal Defects
Stress Concentrations
Organic Contamination
Nonconductive Particles
Conductive Particles
Contamination
Ionic Contamination
Thermal
Electrical
Chemical
Optical
Chemical Exposure
Salt Fog
Mechanical Shock
UV Exposure
Drop
Vibration
Temperature - High and Low
Temperature Cycling
Humidity
Atmospheric Pressure Low and High
Radiation EMI, Cosmic
Sand and Dust

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


256

Chapter 6: Interpretation of Reliability Estimates

Table 6.13-1: Factors to be Considered in a Reliability Model (continued)


Process
Output

This is the reliability assessment process using the various techniques described in this book.
Mean Life
Median Life
MTBF
Failure Rate
Time to X% Fail
B10 Life
Distribution Parameters
Weibull Characteristic Life and Shape Parameter
Lognormal Mean and Standard Deviation

Additional information will be discussed in Chapter 8.


Given the stochastic nature of reliability prediction for many failure causes, it is
impossible to develop a model that is adequately sensitive to all conceivable factors. At
least, this is true in all but the simplest of cases. Thats why model developers need to
select what are believed to be the most relevant factors, and then model accordingly.
This highlights the fact that reliability assessment falls into two distinct categories: the
modeling of intrinsic and extrinsic failure causes. Intrinsic failure causes are generally
those whose root cause is from a known failure mechanism that affects the entire
population of product. These can often be predicted within acceptable bounds by
understanding the stresses, the material properties, etc.
Extrinsic failure causes are those resulting from unpredictable causes, often a complex
sequence of events that ultimately results in the failure cause. Unfortunately, many real
world situations fall into this category. Its unfortunate because these are the ones whose
likelihood is most difficult to predict. Generally, components that have very low failure
rates are governed by these mechanisms. It is often those unexpected, unpredictable
things that happen somewhere upstream in the process, or in the suppliers process. This
is the premise behind the 217Plus system assessment methodology. While it is difficult
to predict the likelihood of these extreme events, or even identify the failure cause a
priori, it is possible to assess controllable factors that have a relationship to the likelihood
of experiencing the failure cause.
6.13.3. Reliability Growth of Components

Another issue facing reliability model developers is the manner in which reliability
growth is accounted for. A good model reflects state-of-the-art technology. However,
empirical models are usually developed from the analysis of field data, which takes time
Reliability Information Analysis Center
257

Chapter 6: Interpretation of Reliability Estimates

to collect. The faster the growth, the more difficult it is to derive an accurate (i.e.,
current) model.
As an example of this reliability growth effect, Table 6.13-2 contains, for each generic
component electronic type, the growth rate that has been observed from data collected by
the RIAC. These reliability growth factors are included in the 217Plus component
models. The growth rate model used for each component for this purpose is:

e (t t
1

where:
=
=
t1 =
t2 =

the estimated failure rate as a function of year of manufacture


the growth rate
the year of part manufacture for which a failure rate is estimated
the year of manufacture of parts on which the data was collected
Table 6.13-2: Failure Rate Data Summary
Component Type
Capacitor, Ceramic
Capacitor, Electrolytic
Capacitor, Tantalum
Connectors
Diode, General Purpose
Diode, Schottky
Diode, Zener
IC, Digital, Nonhermetic
IC, Hermetic (All Types)
IC, Linear, Nonhermetic
IC, Memory/Microprocessor, Nonhermetic
Inductors
LED
Optoelectronic Devices
Relays
Resistors, All Types
Switches
Thyristors
Transformers
Transistor, Bipolar
Transistor, FET, N-Channel
Transistor, Microwave

Growth
Rate ()
0.0082
0.229
0.229
0.23
0.223
0.297
0.150
0.473
0.33
0.293
0.479
0.0
0.34
0.087
0.0
0.00089
0.0
0.20
0.0
0.281
0.397
0.269

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


258

Chapter 6: Interpretation of Reliability Estimates

6.13.4. Relative vs. Absolute Humidity

There are many failure mechanisms that are accelerated by the combination of
temperature and humidity. When modeling failure causes that are a function of humidity,
a question arises as to whether the model should be a function of relative humidity or
absolute humidity. The appropriate metric to use will depend on whether the failure
cause is a function of the absolute amount of water at the surface of the item under
analysis. If this is the case, absolute humidity is probably the appropriate measure. The
relationship between absolute and relative humidity is illustrated in Figure 6.13-3.

Figure 6.13-3: Relationship Between Absolute and Relative Humidity

6.14. Addressing Data with No Failures


In many cases, reliability estimates are made with data containing few or no failures. The
analyst must be careful when using this data to estimate reliability. The true failure rate
of a component is only available after prolonged operation, but reliability estimates are
usually required before this data becomes available. In other words, the analyst needs a
leading indicator of reliability, not a lagging indicator. Therefore, before the component
has experienced enough operating time to estimate the true failure rate, there may be
some data available. This data is often a certain number of operating hours with no
Reliability Information Analysis Center
259

Chapter 6: Interpretation of Reliability Estimates

observed failures. A common way of utilizing this data is to estimate a single-sided


confidence level of the failure rate, based on the observed number of operating hours.
As an example, consider a situation in which a components true failure rate is 0.1
failures per million operating hours. Figure 6.14-1 illustrates the 60% and 90% upper
bound estimates as a function of the number of operating hours. For example, if there are
1 million observed operating hours, then the upper bound of the failure rate, at a 60%
confidence level, is 0.916 F/10e6. I n other words, there is 60% confidence that the true
failure rate is less than 0.916 F/10e6 hours. Only after there have been a total of 6 to 8
million operating hours is the 60% upper bound a reasonable estimate.

Figure 6.14-1: Estimated Upper Bound failure Rates vs Operating Time at 60 and
90% Confidence
Using a single-sided failure rate bound for reliability estimates can be dangerous, because
they can be very pessimistic. Exactly how pessimistic is determined by the number of
operating hours relative to the true failure rate. Moreover, if the upper bound is used on
multiple components in an assembly, then the pessimism in the assembly failure rate
estimate is compounded.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
260

Chapter 6: Interpretation of Reliability Estimates

The Bayesian techniques described previously are a way to address the issue of few or no
failures. This is, in fact, the premise of the 217Plus methodology. This approach, while
it requires a prior estimate, can alleviate the pessimistic nature of reliability estimates
made only from an observed number of hours with no failures.
Another related approach is to pool like data together for the purpose of estimating a
failure rate. For example, if a component has no failures, but there is also data available
on other components within the family of components, the data can be combined. An
example of this approach is described in the section on NPRD (Section 7.4). In that case,
the pooling occurs as a function of part type, quality and environment. The algorithm
used in that case was similar to a Bayesian approach, but was tailored to the specific
constraints of the data.

6.15. Reliability of Components Used Outside of Their Rating


A significant issue with the application of commercial microcircuits used in severe
environments is the temperature rating of the part. The associated temperature range over
which a manufacturer will guarantee performance is limited to that of a commercial part,
i.e. typically 0 to 70 degrees C. Military and aerospace applications often require
guaranteed performance over wider temperature ranges, i.e. -55 to 125 degrees C. While
this is not a reliability prediction issue per se, it does confound the definition of failure
criteria. For example, although a part may not perform beyond it rated temperature, it
usually does not catastrophically fail and, therefore, is not considered a reliability failure.
However, many practitioners do consider this a reliability issue and, as such, turn to
reliability models for the quantification of the microcircuit reliability in their specific
extended range application. There are no reliability models currently available that
can quantify the reliability of parts when used beyond their rating. All existing
models make the implicit assumption that parts are used within their rating. A separate,
but critical, requirement for the reliable application of components is the qualification of
parts and manufacturers to insure that specific parts will function reliably in the intended
application.
The application of a component beyond its rated value of stress can result in one or more
undesired effects. First, there can be reliability ramifications, which can manifest
themselves in a variety of ways: either as a sudden, catastrophic failure or as a latent
failure. The detectability of the first is much better, since it can be observed with product
or system testing. Latent failures are much more difficult to detect, and require more
testing and modeling using the techniques described in this book. The second type of
undesired effect is related to component performance. Performance characteristics can
either be permanently degraded or they may be subject to a reversible process in which
Reliability Information Analysis Center
261

Chapter 6: Interpretation of Reliability Estimates

the performance recovers after the overstress condition is taken away. In any event, these
possible undesired effects should be studied and understood before applying components
beyond their rated stress values.

6.16. References
1. http://www.mortality.org
2. Farachi, V., Electronic Component Failure Rate Prediction Analysis, RIAC
Journal, Nov., 2006.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


262

Chapter 7: Examples

7.

Examples

This chapter presents several examples of reliability models that are intended to provide a
cross section of several different methodologies. The focus of the examples is to present
methodologies that the author has personally developed, and ,thus, can provide insight
into the logic and rationale for their development. Several examples were previously
presented in Chapter 2, but not in detail. This section presents more detail regarding
model factors, development methods, etc.
The following examples are provided:
1. MIL-HDBK-217 Model Development Methodology The generic modeling
methodology for many of the models contained in MIL-HDBK-217 is presented
in this section. Not all of the models in the handbook have been developed using
this methodology, but the majority have been. This is presented so that the reader
can gain an understanding of the approach and methodology used, and to provide
insight into the decisions faced by the model developer.
2. 217Plus Reliability Models 217Plus is the methodology developed by the
RIAC to fill the void left after MIL-HDBK-217 was no longer scheduled to be
updated. The approach taken in the development of this methodology was quite
different than the methodology for MIL-HDBK-217. It was intended to be a
holistic approach in which all primary causes of electronic system failure were
accounted for. Therefore, factors addressing non-component reliability were
considered. It was also intended to be holistic in terms of its ability to leverage
experience from predecessor systems, and utilize information from empirical
testing. The general approach for this methodology was previously presented in
Chapter 2 in the Combining Data section. The additional information presented
in this section presents the details on the remaining portions of the methodology.
Additionally, the development of models for several different components is
presented. First is the development of the original twelve electronic part types.
For these models, sufficient field reliability data was available. The second
component models presented are for photonic component types. For these, very
little field data was available, and, therefore, the original 217Plus approach
needed to be tailored.
3. Life Model Example The intent of the life modeling example that will be
presented is to illustrate an application of the life modeling methodologies
previously discussed. This is a hypothetical example, but provides information
pertaining to the various elements of life modeling.
Reliability Information Analysis Center
263

Chapter 7: Examples

4. NPRD This section, covering the RIAC Nonelectronic Parts Reliability Data
(NPRD) publication, is presented to illustrate the nuances of field reliability data,
the manner in which data is merged, and the manner in which it is used in
reliability modeling. Some of this information was previously presented in
Chapter 2 in the section on the use of field data, but more detail will be presented
here. This will hopefully provide the user with an appreciation for both the uses
and limitations of this type of data.
The examples presented in this section were selected to provide a cross-section of various
methodologies, including prediction, assessment and estimation. It is presented to
complement the information previously provided in Chapter 2.

7.1. MIL-HDBK-217 Model Development Methodology


MIL-HDBK-217 is probably the most widely used of the empirically-based reliability
prediction methodologies. The basic premise of the handbook is the use of historical
piece-part test and field failure rate data as the basis for predicting future product or
system reliability. The handbook includes failure rate models for most electronic part
types, and many electromechanical part types. The latest version of MIL-HDBK-217 is
F, Notice 2, dated 28 February 19958. The handbook was almost a casualty of Perrys
DoD Acquisition Reform initiative, but it survived primarily on the wide use of, and
dependency on, the methodology throughout the military-industrial complex and the lack
of a suitable replacement.
The models that are currently contained in MIL-HDBK-217 have been developed by
various organizations, which use various techniques for their development. However,
Reference 1 will be used to illustrate a typical model development methodology. The
study documented in this report developed the models for discrete semiconductor
devices. Excerpts from this report are summarized within this section. The model
development methodology is shown in Figure 7.1-1. Each of the elements in this
methodology is further examined below.

As noted previously, as of the publication date of this book, a Draft of MIL-HDBK-217G is currently in the works, with an
anticipated release in 2010.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


264

Chapter 7: Examples

Figure 7.1-1: MIL-HDBK-217 Model Development Methodology


Reliability Information Analysis Center
265

Chapter 7: Examples

7.1.1. Identify Possible Variables

The first step in the modeling methodology is to identify possible model factors. In this
example, the possible factors were:

Device Style
Power Rating
Package Type
Semiconductor Material
Structure (NPN, PNP)
Electrical Stress
Circuit Application
Quality Level
Duty Cycle
Operating Frequency
Junction Temperature
Application Environment
Complexity
Power Cycling

7.1.2. Develop Theoretical Model

A series of theoretical failure rate prediction models is hypothesized to provide the


resultant models with a sound theoretical/engineering backing. Basically, theoretical
model development involves evaluation of the effects of the parameters identified in the
previous phase. In addition, the optimal model form (i.e., additive, multiplicative, or a
combination) is determined and the time dependency of the discrete semiconductor
failure rates is studied.
The development of the theoretical device failure rate prediction models is an integral
part of the overall model development process. Information collected through literature
searches and discrete semiconductor user and vendor surveys is reviewed and evaluated
to aid in the development of theoretical models for each discrete semiconductor device
type group. The theoretical models serve the following functions:
1. Assure that the prediction models conform to physical and chemical principles
2. Select variables when not possible to determine sing purely statistical techniques
In general terms, the theoretical models were of the following form.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
266

Chapter 7: Examples
n

= b T E Q i
i =1

where:
=
theoretical failure rate prediction
b = base failure rate, dependent on device style
T = temperature factor (based on the Arrhenius relationship)
E = environment factor
Q = quality factor based upon device screening level and hermeticity
Product of i = the product of Pi factors based upon variables from the potential
list of input variables found to have a significant effect on the
discrete semiconductor failure rate.
7.1.3. Collect and QC Data

The collection of empirical reliability data is integral to the approach used in model
development. Four specific data collection tasks were defined.
The first task was a system/equipment identification process. A survey of numerous
military equipments was conducted to identify system/equipments meeting predetermined
criteria established to ensure plentiful and accurate data.
The second task was an extensive survey of discrete semiconductor manufacturers and
users.
The third task was in-person visits to organizations where data could not be accessed by
other means.
The final data collection task was the compilation of data referenced in the literature and
documented technical studies. Also, as part of this task, additional contact was made
between the authors and/or study sponsors to determine whether more data was available.
The results of the four specific data collection tasks are described in the following
sections.
Five minimum criteria were established to define an acceptable data source. Each
potential equipment selection was evaluated with these criteria before proceeding with
data summarization. These five criteria were:
Reliability Information Analysis Center
267

Chapter 7: Examples

1.
2.
3.
4.
5.

Data available to the part level


Primary failures could be separated from total maintenance actions
Sufficient detail, including stress levels, could be identified for the components
Part hours could be precisely determined
Sufficient equipment hours existed to expect discrete semiconductor failures

In addition to these criteria, the following factors were considered:


1.
2.
3.
4.

Number of different discrete semiconductor part types


Existence of low-population and state-of-the-art parts
Application environment
Age of data

Data summarization consisted of the extraction and compilation of the desired data
elements from the source reports and/or supporting documentation, and coding the data
for computer entry. Data summarization consisted of the following five tasks for sources
of field data:
1.
2.
3.
4.
5.

Identification of discrete semiconductor part types within the chosen equipment


Determination of part characterization information
Identification of relevant part failures
Determination of applicable electrical and environmental stress levels
Determination of equipment operating histories

The data collected for this effort is summarized on the next page, in Table 7.1-1.
Included are, for each part type, the number of observed failures and operating hours. In
addition to this data, other information was captured, such as quality level, environment,
etc.
7.1.4. Correlation Coefficient Analysis

Using the multiple linear regression technique makes the implicit assumption that the
variables under analysis are independent, and not correlated. In practice, however,
factors are often highly correlated, thus making it difficult to deconvolve the effects that
each factor has.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


268

Chapter 7: Examples

Table 7.1-1: Data Collected for Model Development


Part Class
Switching Diode
Rectifier Diode
Voltage Regulator Diode
Voltage Reference Diode
Current Regulator Diode
Transient Suppressor Diode
PNP Transistor, <5W
NPN Transistor, <5W
PNP Transistor, > 5W
NPN Transistor, > 5W
Dual Transistor
Darlington Transistor
JFET
MOSFET
Unijunction Device
Thyristor
Schottky Microwave Diode
Tunnel Diode
Varactor
PIN Diode
Microwave Power Transistor
LED
Infrared Emitting Diode (IRED)
Alphanumeric Display (Segment)
Alphanumeric Display (Display)
Photodetector
Opto-isolator

Failures

Part Hours
(Millions)

86
471
228
282
2
7
2330
246
52
89
1
57
878
209
19
245
18
72
30
1857
2612
22
0
144
4
7
170

916.91
7745.48
1154.84
2951.22
13.54
6.58
24706.61
1845.35
75.10
112.24
7.05
76.58
5177.81
431.77
68.23
1013.18
129.39
234.45
173.2
13413.37
1138.70
4827.08
39.1
636689.67
646.09
47.0
595.96

An example of this is the correlation between quality and environment. This correlation
exists because higher quality parts are often used in the more severe environments. As
such, the analysts options are to:
1. Keep the factors as derived, with the caveat that they may be in error
2. Treat the factors as a combined, pooled factor representing the correlated
variables
3. Use alternate approaches to quantifying the effects of either or all correlated
variables
Reliability Information Analysis Center
269

Chapter 7: Examples

7.1.5. Stepwise Multiple Regression Analysis

This step in the analysis consists of the following:


1. Each factor is linearized in accordance with the desired acceleration model
2. The regression is performed and coefficients are estimated
For example, consider the following model in which the factors to be included are the
base failure rate, a temperature factor and a stress factor:

= b T s
or:

= be

Ea
KT

Sn

Taking the log of both sides yields:


Ea

ln = ln b + ln e KT + ln S n
ln = ln b +

Ea
+ n ln S
KT

or:

=a

ln b +

Ea
+ n ln S
KT

The transforms are shown in Table 7.1-2.


Table 7.1-2: Data Transforms
Variable
Observed failure rate
Temperature
Stress

Transform
ln
-1/T
ln S

When the regression is performed, the intercept is ln b, and the temperature factor and
stress coefficients are Ea/K and n, respectively. In MS Excel, the LINEST function
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
270

Chapter 7: Examples

is used to determine the model coefficients. These are the values used in the original
equation:

= be

Ea
KT

Sn

If categorical variables are to be modeled, they can be modeled with regression analysis
by assigning a 1 or a 0 to the variable, and performing the regression as described
above. As an example, consider the case in which the product or system to be modeled
has temperature, stress, environment and quality as the four variables affecting the
reliability. This is shown in Table 7.1-3.
Table 7.1-3: Regression Data Including Categorical Variables
Variable

Environment

Independent
variable (e.g., )

Temperature

Stress

ln(1)
ln(2)
ln(3)
ln(4)
ln(5)

1/T1
1/T2
1/T3
1/T4
1/T5

lnS1
lnS2
lnS3
lnS4
lnS5

GB
0
1
0
0
1

AI
1
0
0
1
0

GM
0
0
1
0
0

Quality
Commercial
1
0
1
0
1

Industrial
0
0
0
1
0

Military
0
1
0
0
0

The equation above, expanded with the inclusion of the categorical variables, becomes:

=e

ln b +

Ea
+ n ln S + a1GB + A2 AI + A3GM + A4Comm.+ A5 Ind .+ A6 Mil
KT

where Ai are the coefficients of the categorical variables determined from the regression
analysis.
7.1.6. Goodness-of-Fit Analysis

There are several ways to analyze how good the model fits the data. The standard error
provides an indication of the significance of the specific factor under analysis. The
standard error is the standard deviation of the coefficient estimate. Therefore, if the
standard error is small relative to the coefficient estimate, this is an indication that the
factor is statistically significant. Likewise the opposite is also true.

Reliability Information Analysis Center


271

Chapter 7: Examples

Residual plots are also useful in assessing how good the model is as a predictor of
reliability. The smaller the residuals, the better the model is.
Another useful plot, similar to a residual plot, is obtained when plotting the log10 of the
observed-to-predicted ratio. If this metric is relatively tightly clustered and centered
around zero, this is an indication of a good model.
7.1.7. Extreme Case Analysis

One of the potential problems in using a multiplicative model form is that extreme value
problems can arise. For example, when all input factors are simultaneously at their high
or low values, the resultant predicted failure rate can be unrealistically high or low. This
situation can be addressed with the use of different model forms, such as in the case of
the RIAC 217Plus models, in which a combination additive and multiplicative model
form is used.
7.1.8. Model Validation

The last step in the process is to validate the model. This is accomplished by ensuring
that the resulting models fit the observed data to a reasonable degree. Additionally, the
models can be checked against observed data not used in the model development.
Valuable data for this purpose is data at levels above the component level. In many
cases, high quality data can be obtained on systems or assemblies, but not at the part
level. This occurs due to the level at which maintenance is performed and data is
captured. Therefore, while the data cannot be used for model development, it can be used
for model validation.
Another thing that must be accounted for in the model validation effort is the scaling of
base failure rates to account for data in which there were no observed failures. The
methodology presented in this section is based on the premise that there exists a point
estimate of the dependent variable, in this case the failure rate. In cases where there are
no failures, a point estimate is not possible, i.e., only a lower single-sided confidence
bound is possible. The use of this confidence bound value cannot be used to represent
the data since the resultant model will be pessimistic (i.e., the failure rate will be
artificially increased). Only using the data points for which there are failures is also not
appropriate because it also will artificially bias the model pessimistically. Potential
solutions to this situation include:

Scaling the base failure rates to reflect the zero failure data. One possible
alternative to accomplish this is to scale the base failure rates with the boundary
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
272

Chapter 7: Examples

condition that the predicted number of failures in the entire dataset equals the
observed number.
Use of maximum likelihood (MLE) parameter estimation techniques. These MLE
techniques are especially suited to censored data such as zero failures.

7.2. 217Plus Reliability Prediction Models


7.2.1. Background

In 1994, Military Specifications and Standards Reform (MSSR) decreed the adoption of
performance-based specifications as a means of acquiring and modifying weapons
systems. This led to the cancellation of many military specifications and standards. This,
coupled with the fact that the Air Force had re-directed the mission of Rome Laboratory
(now called the Air Force Research Laboratory (the preparing activity for MIL-HDBK217)) away from reliability, resulted in MIL-HDBK-217 becoming obsolete, with no
government plans to update it. The RIAC believed that there was a need for a reliability
assessment technique that could be used to estimate the reliability of systems in the field.
A viable assessment methodology needed:
1. Updated component reliability prediction models, since MIL-HDBK-217 was not
to be updated
2. A methodology for quantifying the effect that non-component variables have on
system reliability
3. To be useable by reliability engineers with data that is typically available during
the system development process
The RIAC is chartered with the collection, analysis and dissemination of reliability data
and information. To this end, it publishes quantitative reliability data such as failure rate
and failure mode/mechanism compendiums, as well as failure rate models. It is not
required to provide these services, but does so because there is a need for this data in the
reliability engineering community. It will continue to engage in such activities as long as
there appears to be this need by reliability practitioners. For this reason, the 217Plus
models and methodology were developed.
There are two primary elements to 217Plus, component reliability prediction models and
system-level models. A system failure rate estimate is first made by using the component
models to estimate the failure rate of each component. These failure rates are then
summed to estimate the system failure rate. This is the traditional methodology used in
many reliability predictions, and represents the reliability prediction, i.e., a reliability
estimate that is made before empirical data or detailed assessments are available. This
Reliability Information Analysis Center
273

Chapter 7: Examples

prediction is then modified in accordance with system level factors, which account for
non-component, or system level, effects. This is an example of a reliability
assessment, in which the process and design factors are assessed. Finally, the
prediction and assessment are combined with empirical data to form the reliability
estimate of the product, which is the best estimate of reliability based on all analysis
and data available to the analyst.
The goal of component reliability models is to estimate the rate of occurrence of
failure, or ROCOF, and accelerants of a components primary failure mechanisms
within an acceptable degree of accuracy. Toward this end, the models should be
adequately sensitive to operating scenarios and stresses, so that they allow the user the
ability to perform tradeoff analysis amongst these variables. For example, the basic
premise of the 217Plus models is that they have predicted failure rates for operating
periods, non-operating periods and cycling. As a result, the user can perform tradeoff
analysis amongst duty cycle, cycling rate, and other variables. As an example, a question
that frequently arises is whether a system will have a higher failure rate if it is
continuously powered on, or whether it is powered off during periods of non-use. The
models in 217Plus are structured to facilitate the tradeoff analysis required to answer this
question.
A flow diagram of the entire approach was presented in Chapter 2, which guides the user
in the application of the component models and the system level models. The basis for
the 217Plus methodology is the component reliability models, which estimate a systems
reliability by summing the predicted failure rates of the constituent components in the
system. This estimate of the system reliability is further modified by the application of
System-Level factors, called Process Grade Factors (PGF). Development of the
component models is presented in Sections 7.2.3 through 7.2.5.
The primary intent of this section is to detail the development of the 217Plus
methodology. It is provided to familiarize the reader with the issues faced by model
developers in order to allow a better understanding of 217Plus and similar models. It
provides details related to certain aspects of model development.
7.2.2. System Reliability Prediction Model
7.2.2.1. 217Plus Background

The premise of traditional methods of reliability predictions, such as MIL-HDBK-217, is


that the failure rate of a product or system is primarily determined by the components
comprising it. Historically, a significant number of failures also stem from noncomponent causes such as design deficiencies, manufacturing defects, inadequate
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
274

Chapter 7: Examples

requirements, induced failures, etc., that have not been explicitly addressed in prediction
methods.
The data in Figure 7.2-1, presented previously, contains the nominal percentage of
failures attributable to each of eight identified predominant failure causes based on data
collected by the RIAC. The data in this figure represents nominal percentages. The
actual percentages can vary significantly around these nominal values.
Softw are
9%

Parts
22%

No Defect
20%

Manufacturing
15%

Induced
12%
Wearout
System
9%
Management
4%

Design
9%

Figure 7.2-1: Failure Cause Distribution of Electronic Systems


The definitions of failure causes, as presented in an earlier chapter, are:

Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor,
connector, etc.) failing to perform its intended function. Examples
include part failures due to poor quality; manufacturer or lot
variability; or any process deficiency that causes a part to fail before
its expected wearout limit is reached.

Reliability Information Analysis Center


275

Chapter 7: Examples

Design (9%): Failures resulting from an inadequate design. Examples include


tolerance stack-up, unanticipated logic conditions (e.g., sneak paths), a
non-robust design for given environmental stresses, etc.

Manufacturing (15%): Failures resulting from anomalies in the manufacturing


process that are not related to the inherent reliability of a part, i.e.,
faulty solder joints, inadequate wire routing resulting in chafing, bent
connector pins, etc.

System Management (4%): Failures traceable to faulty interpretation of system


requirements, imposition of bad requirements (missing, inadequate,
ambiguous or contradictory), or failure to provide the resources
(funding and/or personnel) required to design and build a reliable
product or system.

Wearout (9%): Failures resulting from wearout-related failure mechanisms due


to basic device physics. Examples of electronic components
exhibiting wearout-related failure mechanisms are electrolytic
capacitors, solder joints, microwave tubes (such as TWTs), and switch
and relay contacts.

No defect (20%): Perceived failures that cannot be reproduced upon further


testing. These may or may not be an actual failure; however they are
removals and, therefore, are typically counted toward the logistic
failure rate (or MTBF). Examples include the inability of the
maintenance environment to recreate the operational environmental
stresses under which the original failure occurred, or looser
tolerances on the test equipment than on the platform or system from
which the defective unit was taken.

Induced (12%): Failures resulting from an externally applied stress. Examples


are electrical overstress and maintenance-induced failures (i.e.,
dropping, bending pins, etc.).

Software (9%): Failures of a system to perform its intended function due to the
manifestation of a software fault

Another example that this author has experience with is shown in Figure 7.2-2, which
represents the distribution observed for Erbium Doped Fiber Amplifiers (EDFAs) used in
long haul telecommunications systems. The distribution is different than the above chart,
which is a pooled result from various system types and manufacturers. This example is
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
276

Chapter 7: Examples

provided to illustrate the notion that the system type and manufacturing practices will
dictate the specific distribution obtained.

8% No Fault
Found
21% Manufacturing
Defect

1% - Component Mechanical
7% - Component Electrical

63% - Component
- Pumps and
Other Optical
Components

Figure 7.2-2: Optical Amplifier Failure Cause Distribution


7.2.2.2. Methodology Overview

The 217Plus methodology is structured to allow the user the ability to estimate the
reliability of a product or system in the initial design stages when little is known about it.
For example, a reliability prediction early in the development phase of a system can be
made based on a generic parts list, using default values for operational profiles and
stresses. As additional information becomes available, the model allows the incremental
addition of empirical test and field data to supplement the initial prediction.
The purpose of 217Plus is to provide an engineering tool to assess the reliability of
electronic systems. It is not intended to be the "standard" prediction methodology, and it
can be misused if applied carelessly, just as any empirical or physics-based model can.
Also, it is a tool to allow the user the ability to estimate the failure rate of parts,
assemblies and systems. It does not consider the effect of redundancy or perform
FMEAs. The intent of 217Plus is to provide the data necessary as an input to these
analyses. The methodology allows for the modification of a base reliability estimate with
Process Grading Factors for the failure causes listed in Section 7.2.2.1.
Reliability Information Analysis Center
277

Chapter 7: Examples

These process grades correspond to the degree to which actions have been taken to
mitigate the occurrence of product or system failure due to these failure categories. Once
the base estimate is modified with the process grades, the reliability estimate is further
modified by empirical data taken throughout item development and testing. This
modification is accomplished using Bayesian techniques that apply the appropriate
weights for the different data elements.
Advantages of the 217Plus methodology are that it uses all available information to form
the best estimate of field reliability, it is tailorable, it has quantifiable confidence bounds,
and it has sensitivity to the predominant product or system reliability drivers. The
methodology represents a holistic approach to predicting, assessing and estimating
product or system reliability by accounting for all primary factors that influence the
inability of an item to perform its intended function. It factors in all available reliability
data as it becomes available on the program. It, thus, integrates test and analysis data,
which provides a better prediction foundation and a means for estimating variances from
different reliability measures.
7.2.2.3. System Reliability Model

The fundamental 217Plus failure rate model for a system is as follows:

P = IA ( P + D + M + S + I + N + W ) + SW
The sum of the Pi-factors in the parenthesis represents the cumulative multiplier that
accounts for all of the processes used in system development and sustainment. The sum
of these values is normalized to unity for processes that are considered to be the mean of
industry practices. The individual model factors are:
P

IA

P
D
M
S
I
N

=
=
=
=
=
=

Predicted failure rate of the product or system (in failures per


million calendar hours)
Initial assessment of the failure rate based on component failure
rate estimates
Parts process multiplier
Design process multiplier
Manufacturing process multiplier
System management process multiplier
Induced process multiplier
No-defect process multiplier

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


278

Chapter 7: Examples

W
SW

=
=

Wearout process multiplier


Software failure rate prediction

Additional factors included in the model account for the effects of infant mortality,
environment, and reliability growth. Since each of these factors does not influence all of
the factors in the above equation, they are applied selectively to the applicable factors.
For example, environmental stresses will generally accelerate part defects and
manufacturing defects to failure. These additional factors are normalized to unity under
average conditions, so that the value inside the parenthesis is one under nominal
conditions and for nominal processes.

P = IA ( P IM E + D G + M IM E G + S G + I + N + W ) + SW
where,
IM
E
G

=
=
=

Infant mortality factor


Environmental factor
Reliability growth factor

The initial assessment of the failure rate, IA, is the seed failure rate value, which is
obtained by using the 217Plus component reliability prediction models, along with other
available data. This failure rate is then modified by the Pi-factors that account for
specific processes used in the design and manufacture of the product or system, along
with the environment, reliability growth and infant mortality characteristics of the item.
The above failure rate expression represents the total failure rate of the system, which
includes "induced" and "no defect found" failure causes. If the inherent failure rate is
desired, then the "induced" and "no defect found" Pi-factors should be set to zero, since
they represent operational and non-inherent failure causes.
7.2.2.4. Initial Failure Rate Estimate

An initial estimate of a system failure rate is based on a combination of the component


failure rate models, the empirical field failure rate data contained in the RIAC databases,
or user-defined failure rates from other sources that are entered directly by the user. This
initial failure rate is then used as a seed value that represents a typical failure rate for the
product or system. It is then adjusted in accordance with the PGFs, infant mortality
characteristics, reliability growth characteristics, and environmental stresses. In addition,
software is modeled as a separate failure rate.
Reliability Information Analysis Center
279

Chapter 7: Examples

All variables in the model default to average values, not worst-case values. As a result,
the user has the option of applying any or all factors, depending on the level of
knowledge of the product or system and the amount of time or resources available for the
assessment. If a traditional reliability prediction is desired, the user can perform it using
the component models and the RIAC database failure rates contained in 217Plus9. As
additional data and information becomes available, the analysis can be expanded to
include these system-level factors.
7.2.2.5. Process Grading Factors

An objective of the 217Plus system model is to explicitly account for the factors
contributing to the variability in traditional reliability prediction approaches. This is
accomplished by grading the process for each of the failure cause categories. The
resulting grade for each cause corresponds to the level to which an organization has taken
the action necessary to mitigate the occurrence of failures of that cause. This grading is
accomplished by assessing the processes in a self-audit fashion. Any or all failure causes
can be assessed and graded. If the user chooses not to address a specific failure cause,
the model simply reverts to the default "average" value. If the user chooses to apply the
PGF methodology for any failure cause, there are a minimum number of questions that
should be assessed and graded. Beyond this minimum, the user can selectively assess
and grade additional criteria. If answers to the grading questions are not known, the
model simply ignores those criteria. Process grading is used to quantify the following
factors:

P (parts process multiplier)


D (design process multiplier)
M (manufacturing process multiplier)
S (system management process multiplier)
I (induced process multiplier)
N (no-defect process multiplier)
W (wearout process multiplier)

The sum of the factors within the parentheses in the failure rate model is equal to
unity for the average grade. Each factor will increase if "less than average" processes are
in used and decrease if better than average processes are in used.

The RIAC 217Plus software contains databases that hold the RIACs NPRD and EPRD failure rate data, converted to failures per
million calendar hours. The RIAC Handbook of 217Plus Reliability Prediction Models does not contain this supplementary data.
The RIAC NPRD and ERPD databooks are available for separate purchase from the RIAC, and are in units of failures per million
operating hours.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


280

Chapter 7: Examples

Features of this PGF methodology are that it:

Explicitly recognizes and accounts for special (assignable) cause problems


Models reliability from the user (or total system-level) perspective
Promotes cross-organizational commitment to Reliability, Availability and
Maintainability (RAM)
Quantitatively grades developers' efforts to affect improved reliability
Maintains continuing organizational focus on RAM throughout the development
cycle

Reference 2 presents the results of the study in which the process grades were
determined.
7.2.2.6. Basis Data for the Model
7.2.2.7. Uncertainty in Traditional Approach Estimates

A goal of 217Plus is to model predominant system reliability drivers. The premise of


traditional methods such as MIL-HDBK-217 is that the failure rate is primarily
determined by the technology and application stress of the components comprising the
product or system. This was a good premise many years ago, when components
exhibited higher failure rates and systems were not as complex as they are today.
Increased item complexity and component quality have resulted in a shift of system
failure causes away from components to more system-level factors, including system
requirements, interface problems and software problems. A significant number of
failures also stem from non-component causes such as defects in design and
manufacturing. Historically, these factors have not been explicitly addressed in
prediction methods. The approach used to develop the 217Plus model was to (1) quantify
the uncertainty in predictions using "component-based" traditional approaches and (2)
explicitly model the factors contributing to that uncertainty.
Data was collected by the RIAC on systems for which both predicted and observed
MTBF data was available. This was done for the purpose of quantifying the uncertainty
in traditional component-based predictions. Table 7.2-1 presents the multipliers of a
failure rate point estimate as a function of confidence level that was derived from analysis
of this data. For example, using traditional approaches, one could be 90% certain that the
true failure rate was less than 7.575 times the predicted value.

Reliability Information Analysis Center


281

Chapter 7: Examples

Table 7.2-1: Uncertainty Level Multiplier


Percentile
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90

Multiplier
0.132
0.265
0.437
0.670
1.000
1.492
2.290
3.780
7.575

7.2.2.8. System Failure Causes

The premise of the 217Plus model developed in the RIAC study was that the failure rate
attributable to the predominant system-level failure causes could be quantified. In
addition to the intrinsic variability associated with the failure rate prediction, there is
additional variability associated with the variance in the distribution of failure causes.
This requires that there be baseline data that quantifies the failure rate of each cause. The
data in Table 7.2-2 was used for this purpose. This table contains, for each source of
data, the percentage of failures attributable to each of the eight identified predominant
failure causes. It should be noted here that the reported percentages of failure due to
some failure causes might be underestimated. For example, system management and
software may be under-reported because failures are usually not attributed to those
categories, even when they are the root cause of failure. This also means that the
percentages from the other causes may be overestimated. Although the authors recognize
that this is likely, the values in the model reflect the reported values. However, if a user
of the model has failure cause distribution information from which the model factors can
be tailored, this data should be used instead of the nominal values.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


282

Chapter 7: Examples

Table 7.2-2: Percentage of Failures Attributable to Each Failure Cause


Survey
Respondent
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Part
Defect
5
34
13
9
46
46
19
28
42
64
24
15
32
13
19
61
38
30

Mfg.
Defect
38
28
5
31
10
25
39
28
42
0
28
13
1
10
3
5
15
19

Design
0
0
5
38
19
2
10
28
16
0
0
4
5
10
5
5
17
10

System Wearout
No Induced Software
Mgt.
Defect
0
0
42
8
8
0
39
0
0
0
0
3
30
43
0
0
6
0
16
0
0
12
0
14
0
0
12
0
14
0
0
10
0
22
0
0
0
0
17
0
0
0
0
0
0
0
17
0
20
0
0
6
34
8
0
12
6
17
32
1
11
27
16
7
0
1
13
0
34
20
0
5
40
7
20
1
15
10
3
0
0
12
0
18
0
1
11
11
15
3

An analysis was then performed on the Table 7.2-2 data to quantify the distributions of
percentages for each failure cause. This was accomplished by performing a Weibull
analysis of each column. The resulting distributions are summarized in Table 7.2-3.
Table 7.2-3: Weibull Parameters for Failure Cause Percentages
Failure Cause
Parts
Manufacturing
Design
System Management
Wearout
Induced
No Defect
Software

Characteristic
Percentage
33.9
23.2
13.9
7.1
14.7
19.8
31.9
15.0

Weibull Shape
Parameter (beta)
1.62
0.96
1.29
0.64
1.68
1.58
1.92
0.70

Reliability Information Analysis Center


283

Chapter 7: Examples

Table 7.2-4 summarizes the failure rate multiplier values for each of the eight failure
causes as a function of the grade for each of the eight. The generic formula for the
multiplier is given as:

i = (ln Ri )1/
In this calculation, the characteristic percentages listed in Table 7.2-3 are scaled by a
factor of 1.11 to ensure that the sum of the multipliers is equal to one when each grade is
equal to 0.50. In this case, a grade of 0.50 represents an "average" process, and since the
model is normalized to an average process, the total multiplier of the initial assessment
failure rate is equal to one under these conditions.

Parts

Manufacturing

Design

System
Management

Wearout

Induced

No Defect

Table 7.2-4: Multipliers as a Function of Process Grade

0.725
0.655
0.612
0.581
0.556
0.535
0.516
0.500
0.486
0.472
0.460
0.449
0.438
0.428
0.419
0.410
0.402
0.394
0.386
0.379
0.372

0.948
0.800
0.714
0.653
0.606
0.567
0.535
0.507
0.482
0.461
0.441
0.423
0.406
0.391
0.376
0.363
0.351
0.339
0.328
0.317
0.307

0.378
0.333
0.306
0.286
0.271
0.258
0.247
0.237
0.229
0.221
0.214
0.207
0.201
0.195
0.190
0.185
0.180
0.176
0.171
0.167
0.163

0.643
0.498
0.420
0.367
0.328
0.298
0.273
0.251
0.233
0.218
0.204
0.191
0.180
0.170
0.161
0.152
0.145
0.137
0.131
0.124
0.119

0.304
0.276
0.258
0.245
0.235
0.227
0.219
0.212
0.207
0.201
0.196
0.191
0.187
0.183
0.179
0.176
0.172
0.169
0.166
0.162
0.160

0.433
0.391
0.365
0.346
0.330
0.317
0.306
0.296
0.288
0.279
0.272
0.265
0.259
0.253
0.247
0.242
0.237
0.232
0.227
0.223
0.219

0.588
0.540
0.511
0.488
0.470
0.455
0.442
0.430
0.420
0.410
0.401
0.393
0.385
0.378
0.371
0.364
0.358
0.352
0.346
0.340
0.335

Cumulative Percentage
(Grade)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


284

Parts

Manufacturing

Design

System
Management

Wearout

Induced

No Defect

Chapter 7: Examples

0.365
0.358
0.352
0.345
0.339
0.333
0.328
0.322
0.317
0.311
0.306
0.301
0.296
0.291
0.286
0.281
0.277
0.272
0.267
0.263
0.259
0.254
0.250
0.246
0.241
0.237
0.233
0.229
0.225
0.221
0.217
0.213
0.209
0.205
0.202
0.198

0.298
0.288
0.280
0.271
0.263
0.256
0.248
0.241
0.234
0.228
0.221
0.215
0.209
0.203
0.198
0.192
0.187
0.181
0.176
0.171
0.167
0.162
0.157
0.153
0.148
0.144
0.140
0.136
0.132
0.128
0.124
0.120
0.117
0.113
0.109
0.106

0.160
0.156
0.152
0.149
0.146
0.143
0.140
0.137
0.134
0.131
0.128
0.125
0.123
0.120
0.118
0.115
0.113
0.110
0.108
0.106
0.104
0.101
0.099
0.097
0.095
0.093
0.091
0.089
0.087
0.085
0.083
0.081
0.080
0.078
0.076
0.074

0.113
0.108
0.103
0.098
0.094
0.090
0.086
0.083
0.079
0.076
0.072
0.069
0.067
0.064
0.061
0.059
0.056
0.054
0.052
0.049
0.047
0.045
0.043
0.042
0.040
0.038
0.036
0.035
0.033
0.032
0.030
0.029
0.028
0.026
0.025
0.024

0.157
0.154
0.151
0.149
0.146
0.144
0.141
0.139
0.137
0.134
0.132
0.130
0.128
0.126
0.124
0.122
0.120
0.118
0.116
0.114
0.112
0.111
0.109
0.107
0.105
0.104
0.102
0.100
0.098
0.097
0.095
0.093
0.092
0.090
0.088
0.087

0.214
0.210
0.206
0.203
0.199
0.196
0.192
0.189
0.185
0.182
0.179
0.176
0.173
0.170
0.167
0.164
0.161
0.159
0.156
0.153
0.151
0.148
0.146
0.143
0.140
0.138
0.136
0.133
0.131
0.128
0.126
0.124
0.121
0.119
0.117
0.114

0.330
0.325
0.320
0.315
0.310
0.306
0.301
0.297
0.293
0.288
0.284
0.280
0.276
0.272
0.269
0.265
0.261
0.257
0.254
0.250
0.247
0.243
0.240
0.236
0.233
0.229
0.226
0.223
0.219
0.216
0.213
0.210
0.206
0.203
0.200
0.197

Cumulative Percentage
(Grade)
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
0.41
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.50
0.51
0.52
0.53
0.54
0.55
0.56
0.57

Reliability Information Analysis Center


285

Parts

Manufacturing

Design

System
Management

Wearout

Induced

No Defect

Chapter 7: Examples

0.194
0.190
0.186
0.183
0.179
0.175
0.172
0.168
0.164
0.160
0.157
0.153
0.149
0.146
0.142
0.138
0.135
0.131
0.127
0.123
0.119
0.116
0.112
0.108
0.104
0.100
0.096
0.092
0.088
0.084
0.079
0.075
0.070
0.066
0.061
0.056

0.103
0.099
0.096
0.093
0.090
0.086
0.083
0.080
0.077
0.074
0.072
0.069
0.066
0.063
0.061
0.058
0.055
0.053
0.050
0.048
0.045
0.043
0.040
0.038
0.036
0.034
0.031
0.029
0.027
0.025
0.023
0.021
0.019
0.017
0.015
0.013

0.072
0.071
0.069
0.067
0.065
0.064
0.062
0.060
0.059
0.057
0.055
0.054
0.052
0.050
0.049
0.047
0.046
0.044
0.042
0.041
0.039
0.038
0.036
0.035
0.033
0.031
0.030
0.028
0.027
0.025
0.023
0.022
0.020
0.019
0.017
0.015

0.023
0.022
0.021
0.020
0.019
0.018
0.017
0.016
0.015
0.014
0.013
0.013
0.012
0.011
0.010
0.010
0.009
0.008
0.008
0.007
0.007
0.006
0.006
0.005
0.005
0.004
0.004
0.003
0.003
0.003
0.002
0.002
0.002
0.001
0.001
0.001

0.085
0.084
0.082
0.080
0.079
0.077
0.076
0.074
0.073
0.071
0.069
0.068
0.066
0.065
0.063
0.062
0.060
0.058
0.057
0.055
0.053
0.052
0.050
0.048
0.047
0.045
0.043
0.042
0.040
0.038
0.036
0.034
0.032
0.030
0.028
0.026

0.112
0.110
0.108
0.106
0.103
0.101
0.099
0.097
0.095
0.092
0.090
0.088
0.086
0.084
0.081
0.079
0.077
0.075
0.073
0.071
0.068
0.066
0.064
0.062
0.059
0.057
0.055
0.052
0.050
0.047
0.045
0.042
0.040
0.037
0.034
0.031

0.194
0.190
0.187
0.184
0.181
0.178
0.174
0.171
0.168
0.165
0.162
0.158
0.155
0.152
0.149
0.145
0.142
0.139
0.135
0.132
0.129
0.125
0.122
0.118
0.114
0.111
0.107
0.103
0.099
0.095
0.091
0.087
0.082
0.078
0.073
0.068

Cumulative Percentage
(Grade)
0.58
0.59
0.60
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.90
0.91
0.92
0.93

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


286

Parts

Manufacturing

Design

System
Management

Wearout

Induced

No Defect

Chapter 7: Examples

0.051
0.045
0.039
0.033
0.025
0.016

0.011
0.009
0.007
0.005
0.003
0.002

0.013
0.012
0.010
0.008
0.006
0.003

0.001
0.001
0.000
0.000
0.000
0.000

0.023
0.021
0.018
0.015
0.012
0.008

0.028
0.025
0.022
0.018
0.014
0.009

0.062
0.057
0.050
0.043
0.035
0.024

Cumulative Percentage
(Grade)
0.94
0.95
0.96
0.97
0.98
0.99
7.2.2.9. Environmental Factor

MIL-HDBK-344 (Reference 6) defines the stress screening strength (SS) to be the


probability that a specific screen will precipitate a latent defect to failure and detect it by
test, given that a latent defect susceptible to the screen is present. It is the product of the
precipitation efficiency (PE) and detection efficiency (DE). It is equivalent to the
percentage of defects that are removed from the prescreened population:

SS =

Dremoved
Din

where:

Dremoved = D in Dremaining
The failure rate is, therefore:
=

D field (t )
t

where:
t =
the period, in hours, over which the MTBF is to be measured
Dfield = the number of field failures due to latent defects occurring during the
interval t.
Since SS is the percentage of defects removed from the population, it follows that:
Reliability Information Analysis Center
287

Chapter 7: Examples

Dfield = Dremaining* SS field


The SSfield is the effective screening strength of the stresses that the product or system
will encounter in the field, and SSESS is the screening strength that the system is exposed
to during environmental stress screening (ESS). It also follows that Dfield is equal to the
cumulative (integral of) field failure rate:

D field = (t )
D field = postscreened (t )

D field = SS * prescreened (t )

postsceened = SS * prescreened
This indicates that, in addition to estimating the effect that ESS has on system reliability,
the screening strength calculated from field stresses (SSfield) can be effectively used as a
failure rate multiplier that accounts for the environmental stresses:
SS field (t ) =

1 e kt
t

where,
SSfield(t) =
k =

equivalent screening strength of the field environment


field precipitation rate

The total screening strength, SStotal , after accounting for both the temperature cycling and
vibration-related portions, is:

SStotal = PTC * SS(TC) + PRV * SS(RV)


where:
the percentage of failures resulting from temperature cycling stresses
PTC =
the percentage of failures resulting from random vibration stresses
PRV =
SS(TC) = the screening strength applicable to temperature cycling
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
288

Chapter 7: Examples

SS(RV) = the screening strength applicable to random vibration.


Algorithms for calculating screening strength are given in a subsequent section. If the
actual values of PTC and PRV are unknown, the default values that should be used are:

PTC = 0.80
PRV = 0.20
Since the component failure rates described above are relative to a ground benign
environment, the failure rate multiplier is the ratio of the SS value in the use environment
to the SS value in a ground benign environment:

E =

PTC * SS (TCuse ) + PRV * SS ( RVuse )


PTC * SS (TCGb ) + PRV * SS ( RVG )
b

where:
PTC = percentage of failures resulting from temperature cycling stresses
PRV = percentage of failures resulting from random vibration stresses
SS = screening strength applicable to the application environmental values
As previously indicated, the SS value is the screening strength and has been derived from
MIL-HDBK-344. It is an estimate of the probability of both precipitating a defect to
failure and detecting it once it is precipitated by the test.

SS TC = 1 e ( kTC t )

SSRV =1 e(k RV t )
k TC = 0.0017 ( T + .6) .6 [ln (RATE + 2.718) ]

where:
T = Tmax Tmin

(in degrees C)

RATE =
degrees C/minute
t =
# of cycles
k RV = 0.0046 G 1.71
Reliability Information Analysis Center
289

Chapter 7: Examples

The parameter G is the magnitude of vibration stress, in units of Grms. Whenever


possible, the actual values of delta T (T) and vibration (Grms) should be used for the use
application environment when calculating SS values. If the actual values are not known,
then the default values of T (summarized in the component model descriptions later)
can be used. A discussion of the values of k follows.
For RV screens it is necessary to include an axis sensitivity factor. The RV applied in the
axis perpendicular to the plane of the board will have the greatest effect. When selecting
and modeling RV stress, the precipitation efficiency is, thus, given by:

[1 exp (-kt)]* (Axis Sensitivity Factor)


where the axis sensitivity factor is the defect density in the sensitive axis divided by the
total defect density. Transmissibility and resonance effects must be considered, and the
frequency spectrum may need to be suitably notched to avert overstress or wearout
effects. Similarly, thermal mass and conductivities must be considered when determining
temperature cycle (TC) transition rates and required dwell times. The stress levels for all
of these equations pertain to the product or system being screened and not the test
chamber conditions.
It should also be noted that the expressions and tables for precipitation efficiency are only
approximate and, as in the estimation of initial defects, should be refined based upon
actual user data according to the techniques of Procedure D of MIL-HDBK-344.
Under the average temperature cycling and random vibrations conditions that represent
the data used in development of the models, the denominator is 0.205. This value is a
normalization constant such that the environment factor is equal to 1.0 when a product or
system is subjected to the average stress levels.
The values assumed for the rate and duration are 2 degrees C per minute and 10 hours,
respectively. Therefore, the environment factor is:
1.71
(
(
0.065 (T + 0.6 )0.6 )
0.046 G )

0.855 0.81 e
+ 0.21 e

E =
0.205

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


290

Chapter 7: Examples
7.2.2.10.

Reliability Growth

The 217Plus model includes a factor for assessing the reliability growth characteristics of
a product or system10. The premise of this factor is that the processes that contribute to
system reliability growth in the field may or may not exist. The degree to which growth
exists is estimated by a grading factor that assesses the processes contributing to growth.
The growth factor calculation is given by the formula:
G =

1.12(t + 2)
2

The denominator in the above expression is necessary to ensure that the value of the
factor is 1.12 at the time of field deployment, regardless of the growth rate (). Figure
7.2-3 illustrates the growth Pi-factor multiplier for various values of growth rates as a
function of time.
1.2

Pi (Growth)

0.8

0
0.2
0.5

0.6

0.7
1
0.4

0.2

1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1.1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
Time (years)

Figure 7.2-3: G vs. Time and Growth Rates


The value of is estimated by determining the degree to which the potential for growth
exists. This estimation is accomplished in a manner similar to the process grading
10
The system reliability growth factor is different from, and in addition to, the reliability growth factors used in the 217Plus
component models to reflect component technology improvements from their respective baseline years.

Reliability Information Analysis Center


291

Chapter 7: Examples

methodology by assessing and grading the processes that can contribute to reliability
growth.
7.2.2.11.

Infant Mortality

Infant mortality is accounted for in the model with a time-variant factor that is a function
of the level to which ESS has been applied. The infant mortality correction factor, IM,
is calculated as:

t - 0.62
IM =
(1 - SSESS )
1.77
where:
t
=
SSESS =

time in years
the screening strength of the screen(s) applied, if any.

The value of SS can be determined by using the stress screening strength equations as
presented in Section 7.2.2.9.
The above expression represents the instantaneous failure rate. If the average failure rate
for a given time period is desired, this expression must be integrated and divided by the
time period.
7.2.2.12.

Combining Predicted Failure Rate with Empirical Data

The user of this model is encouraged to collect as much empirical data as possible and
use it in the 217Plus reliability assessment. This was summarized in Section 2.6, and is
done by mathematically combining the initial assessment made (based on the initial
assessment and the process grades) with empirical data. This step combines the best
"pre-build" failure rate estimate obtained from the initial assessment (plus the influence
of the PGFs) with the metrics obtained from the empirical data. Bayesian techniques are
used for this purpose. This technique accounts for the quantity of data by weighting large
amounts of data more heavily than small amounts. The failure rate estimate obtained
above forms the "prior" distribution, comprised of a0 and b0.
7.2.3. Development of Component Reliability Models
7.2.3.1. Model Form

Traditional methods of reliability prediction model development, as discussed earlier in


the section on MIL-HDBK-217, have included the statistical analysis of empirical failure
rate data. Statistical methods have included ANOVA, multiple linear regression,
sensitivity analysis, etc. When using multiple linear regression techniques with highly
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
292

Chapter 7: Examples

variable data (which is often the case with empirical failure rate data), a requirement of
the model form is that it be multiplicative (i.e., the predicted failure rate is the product of
a base failure rate and several factors that account for the stresses and component
variables that influence reliability). An example of a multiplicative model is as follows:

p = b e q s
where:
p =
b =
e =
q =
s =

predicted failure rate


base failure rate
environmental factor
quality factor
stress factor

However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for. A better
approach is an additive model which predicts a separate failure rate for each generic class
of failure mechanisms. Each of these failure rate terms are then accelerated by the
appropriate stress or component characteristic. This model form is as follows;

p = o o + e e + c c + i + sj sj
where:
p =
o =
o =
e =
e=
c =
c =
i =
sj =
sj =

predicted failure rate


failure rate from operational stresses
product of failure rate multipliers for operational stresses
failure rate from environmental stresses
product of failure rate multipliers for environmental stresses
failure rate from power or temperature cycling stresses
product of failure rate multipliers for cycling stresses
failure rate from induced stresses, including electrical overstress and ESD
failure rate from solder joints
product of failure rate multipliers for solder joint stresses
Reliability Information Analysis Center
293

Chapter 7: Examples

By modeling the failure rate in this manner, factors that account for the application and
component specific variables that affect reliability ( factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:

Address operating, non-operating and cycling-related failure rates in an additive


model which are weighted in accordance with the operational profile (duty cycle
and cycling rate). The Pi-factors modify only the applicable failure rate term,
thereby eliminating many of the extreme value problems that plague
multiplicative models
Are based on observed failure mode distributions so that observed component
failure causes are empirically modeled
Are based on quantitative stresses (and not on qualitative environmental
categories), but default to average stress conditions as a function of environment
Are industry-independent and predict the average failure rates of best commercial
practices
Can be tailored with test data, if available, by applying the test data to appropriate
additive term via the Bayesian method

7.2.3.2. Acceleration Factors

Acceleration factors (also called Pi-factors) are used in the 217Plus models to estimate
the effect on failure rate of various stress and component variables. Since the traditional
technique of multiple linear regression was not used in the derivation of the failure rate
models, the Pi-factors were derived by utilizing either industry accepted values, values
determined separately from data available to the RIAC, or values from previous modeling
efforts. For example, the models typically include both an operating and non-operating
temperature factor based on the Arrhenius relationship, which require an activation
energy for operating and non-operating conditions. To estimate these values for the
models, previous modeling studies (along with existing prediction methodologies) were
used. Similarly, some factors were based on test data. For example, the exponent used in
the delta T Pi-factor for the 217Plus integrated circuit model is based on fallout rate data
from temperature cycling tests that were performed at various levels of delta T.
7.2.3.3. Time Basis of Models

Traditional reliability prediction models have been based on the operating time of the
part, and the units were typically failures per million (or billion) operating hours
(F/106H). The RIAC 217Plus models (and the empirical data contained in the RIAC
databases included with the RIAC 217Plus software) predict the failure rate in units of
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
294

Chapter 7: Examples

failures per million calendar hours (F/106CH). This is necessary (and appropriate)
because it is the common basis for all failure rate contribution terms used in the model
(operating, non-operating, cycling, and induced). If an equivalent operating failure rate is
desired (in units of failures per million operating hours), the failure rate (in F/106CH) can
be divided by the duty cycle to yields a failure rate in F/106operating hours.
7.2.3.4. Failure Mode to Failure Cause Mapping

There are two primary types of data on which the RIAC 217Plus component models are
based: failure rate and failure mode. The model development process required that the
failure rate data be apportioned into four failure cause categories. Since the failure mode
data contained in the RIAC databases was typically not defined by these categories, it
was necessary to transform the RIAC failure mode data into a failure cause distribution.
This was accomplished by assessing the stresses that accelerate the specific class of
failure categories, and estimating the percentage of failures that could be attributed to
those stresses. The primary stresses that potentially accelerate operational failure modes
are operating temperature, vibration, current and voltage. The stresses that accelerate
environmental failure causes are non-operating (i.e., dormant) ambient temperature,
corrosive stresses (contaminants/heat/humidity), ageing stresses (time), and humidity. As
an example, Table 7.2-5 summarizes this process for a resistor. Each of the six failure
modes included in the analysis are listed across the top of the table, i.e. EOS,
contamination, etc., along with their associated observed relative percentage of
occurrence. This data was collected by the RIAC and was based primarily on the root
cause failure analysis results of parts that had failed in the field.
Table 7.2-5: Example of Failure Mode-to-Failure Cause Category Mapping
Failure
Category

Operational
Stresses

Accelerating
Stresses/
Causes

Failure Mode
Contamination Cracked Chip Leakage
out
41.20%
23.50%
17.60% 7.10% 5.90%
EOS

Operating
Temperature
Vibration

Voltage
Ambient
(Dormant) Temp.
Corrosion

p
p

Total
%

0.00

0.05

4.70%
s

Current
Environmental
Stresses

TNI

0.04

0.00

0.00

0.08

0.09

Ageing

0.05

Humidity

0.09

Power Cycling

Power Cycling

Induced/EOS

Induced/EOS

Reliability Information Analysis Center


295

0.31

0.22

0.22

0.42

0.42

Chapter 7: Examples
7.2.3.5. Derivation of Base Failure Rates

Once the Pi-factors were defined for each component type that was modeled, and once
the failure rate was apportioned amongst the failure causes, the base failure rate could be
determined. This was accomplished by (1) gathering all failure rate data, (2) estimating
the model input variables (temperatures, stresses, etc.) for each source of data, (3)
calculating the associated Pi-factor for each failure rate, and (4) deriving a base failure
rate for each of the failure cause categories. For example, the failure rate associated with
operational stresses is equated to the product of the base failure rate and the operational
Pi-factors:
PFC * obs = b o
where:
PFC =
obs =
b =
o =

percentage of failure rate attributable to operational failure causes


observed failure rate
base failure rate to be derived
product of model Pi-factors

Solving for b, and adding a factor to account for data points which have had no observed
failures, yields:

b =

PFC * obs

* PF

The PF parameter is the percentage of total observed calendar hours associated with
components that have had observed failures. This factor is necessary to pro-rate the base
failure rate which was calculated from those data records containing failures. Once this
value of b was calculated for each data record, the geometric mean was used as the best
estimate of the base failure rate.
7.2.3.6. Combining the Predicted Failure Rate with Empirical Data

The user of the 217Plus model is encouraged to collect as much empirical data as
possible and use it in the assessment. This is done by mathematically combining the
prediction made (based on the initial assessment and the process grades) with empirical
data, resulting in a reliability estimate. This step will combine the best pre-build failure
rate estimate obtained from the initial assessment (with process grading) with the metrics
obtained from the empirical data. Bayesian techniques are used for this purpose. This
technique accounts for the quantity of data by weighting large amounts of data more
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
296

Chapter 7: Examples

heavily than small quantities. The failure rate estimate obtained above forms the prior
distribution, comprised of a0 and b0.
If empirical data (i.e., test or field data) is available on the system under analysis, it can
be combined with the best pre-build failure rate estimate using the following equation:

ao + a1 + ....an
bo + b1 + ....bn

where:

=
ao =

the best estimate of the predicted failure rate


the equivalent number of failures of the prior distribution corresponding to
the reliability prediction (after process grading has been accounted for).
The default value is:

a0 = 0.5
bo =

the equivalent number of hours associated with the reliability prediction


(after process grading) After a0 is calculated, the value of b0 can be
calculated by:
b0 =

a0

a1 through an = the number of failures experienced in the empirical data. There


may be n different types of data available
b1 through bn = the equivalent number of cumulative operating hours (in
millions) experienced in the empirical data. These values must
be converted to equivalent hours by accounting for the
accelerating effects between the test and use conditions.
If test data is available that was taken at accelerated conditions, it needs to be converted
to the conditions of interest. A traditional reliability prediction can be performed at both
the test and use conditions, and the equivalent number of hours (bi) can be accelerated by
the failure rate ratio between the test and use temperatures, as follows:

Reliability Information Analysis Center


297

Chapter 7: Examples

H Eq =

T 1
* HT
T 2

where:
H Eq =

T1 =

T2 =

HT =

the equivalent number of test hours


the predicted failure rate at the test conditions, obtained by performing a
reliability prediction of the product or system at the test conditions
the predicted failure rate at the use conditions, obtained by performing a
reliability prediction of the product or system at the use conditions
the actual number of test hours

The benefits of including empirical data in the failure rate estimate are that it:

Integrates all reliability data that is available at the point in time when the
estimate is performed (analogous to the statistical process called meta-analysis)
Provides flexibility for the user to customize the reliability model with actual
historical experience data

7.2.3.7. Estimating Confidence Levels

The 217Plus methodology also estimates confidence levels around the failure rate.
Before empirical data is available on a system, the levels are assessed based on a
distribution that was derived by analyzing data on a variety of systems for which both
reliability predictions and field data were available. After test or field data becomes
available and failures are accrued, traditional Chi-square techniques can be used to
estimate the uncertainty in the reliability prediction.
7.2.3.8. Using the 217Plus Model in a Top-Down Analysis

If empirical data exists on a predecessor system, the equation that translates the failure
rate from the old system to the new system is as follows:

predicted = predecessor *

predicted , new
predicted , predecessor

The (predicted, new)/(predicted, predecessor) failure rate ratio accounts for the
differences in application environment, complexity, stresses, date, etc. The predicted
failure rates for the predecessor and the new system are determined using the complete
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
298

Chapter 7: Examples

detailed 217Plus methodology previously described. The observed predecessor failure


rate is used as the baseline against which the new system failure rate is estimated.
7.2.3.9. Capacitor Model Example

This section presents an example of the 217Plus component model for capacitors. The
failure rate equation for capacitors is:

P = G C (OB DCO TO S + EB DCN TE + TCB CR DT ) + SJB SJDT + EOS


P =
G =

predicted failure rate, failures per million calendar hours


reliability growth failure rate multiplier:

G = e ( (Y 1993 ))
=
C =

growth constant. A function of capacitor type (see Table 7.2-6)


capacitance failure rate multiplier:

C
C =
C1

CE

C=
capacitance, in microfarads
C1 = constant. A function of capacitor type (see Table 7.2-6)
CE = constant. A function of capacitor type (see Table 7.2-6)
OB = base failure rate, operating
DCO = failure rate multiplier for duty cycle, operating:

DCO =

DC
DC1op

TO = Failure rate multiplier for temperature, operating:

TO = e

Eaop
1
1

.00008617 T + 273 298


AO

Reliability Information Analysis Center


299

Chapter 7: Examples

Eaop = activation energy, operating. A function of capacitor type (see Table 7.26)
S = failure rate multiplier for electrical stress:

S
S = A
S1

SA = stress ratio, the applied voltage stress divided by the rated voltage
S1 = constant. A function of capacitor type (see Table 7.2-6)
n=
constant. A function of capacitor type (see Table 7.2-6)
EB = base failure rate, environmental (see Table 7.2-6)
DCN = failure rate multiplier, duty cycle nonoperating:
DCN =

1 DC
DC 1nonop

TE = failure rate multiplier, temperature-environment :

TE = e

Ea nonop
1
1

.00008617 T + 273 298

AE

Eanonop = activation energy, nonoperating. A function of capacitor type


(see Table 7.2-6)
TCB = base failure rate, temperature cycling (see Table 7.2-6)
CR = failure rate multiplier, cycling rate:

CR =

CR
CR1

DT = Failure rate multiplier, delta temperature:

DT

T T
= AO AE
DT1

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


300

Chapter 7: Examples

SJB = base failure rate, solder joint (see Table 7.2-6)


SJDT = failure rate multiplier, solder joint delta temperature:

SJDT

T TAE
= AO

44

2.26

EOS = failure rate, electrical overstress (see Table 7.2-6)

DC1op

TRdefault

DC1nonop

Eanonop

CR1 DT1 n C1 S1 CE

Eaop

Table 7.2-6: Capacitor Parameters

Aluminum

0.000465

0.00022

0.000214

0.000768

.00095

0.229

0.17

0.5

0.83

0.4

1140.35

21

7.6

0.6 0.23

Ceramic

0.001292

0.000645

0.000096

0.00014

.00095

0.0082

0.17

0.3

0.83

0.3

1140.35

21

0.1

0.6 0.09

General

0.000634

0.000351

0.000083

0.000259

.00095

0.033

0.17

0.3

0.83

0.3

1140.35

21

0.1

0.6 0.09

Mica/Glass

0.000826

0.000997

0.000888

0.000764

.00095

0.0082

0.17

0.4

0.83

0.4

1140.35

21

10

0.1

0.6 0.09

Paper

0.000663

0.000075

0.000882

0.000042

.00095

0.0082

0.17

0.2

0.83

0.2

1140.35

21

0.1

0.6 0.09

Plastic

0.000994

0.001462

0.001657

0.002531

.00095

0.0082

0.17

0.2

0.83

0.2

1140.35

21

0.1

0.6 0.09

Tantalum

0.000175

0.000049

0.000032

0.000816

.00095

0.229

0.17

0.2

0.83

0.2

1140.35

21

17

7.6

0.6 0.23

Tantalum

0.000175

0.000049

0.000032

0.000816

.00095

0.229

0.17

0.2

0.83

0.2

1140.35

21

17

7.6

0.6 0.23

Variable, Air

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.3

1140.35

21

0.35 0.5 0.09

Variable, Ceramic

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.1

1140.35

21

0.35 0.5 0.09

Variable, FEP

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.2

1140.35

21

0.35 0.5 0.09

Variable, General

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.2

1140.35

21

0.35 0.5 0.09

Variable, Glass

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.2

1140.35

21

0.35 0.5 0.09

Variable, Mica

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.2

1140.35

21

10 0.35 0.5 0.09

Variable, Plastic

0.002683

0.005193

0.002066

0.000566

.00095

0.0082

0.17

0.3

0.83

0.2

1140.35

21

Part Type

7.2.3.10.

OB

EB

TCB

IND

SJB

0.35 0.5 0.09

Default Values

The default values for the environmental and operating profile factors are summarized in
Tables 7.2-7and 7.2-8.

Reliability Information Analysis Center


301

Chapter 7: Examples

Table 7.2-7: Default Environmental Stress Values


Environment
Airborne
Airborne, Fixed Wing
Airborne, Fixed Wing, Inhabited
Airborne, Fixed Wing, Uninhabited
Airborne, Missile
Airborne, Missile, Flight
Airborne, Missile, Launch
Airborne, Rotary Wing
Airborne, Rotary Wing, Inhabited
Airborne, Rotary Wing, Uninhabited
Airborne, Space
Ground
Ground, Man Pack
Ground, Mobile
Ground, Mobile, Heavy Wheeled
Ground, Mobile, Heavy Wheeled, Chassis Mounted
Ground, Mobile, Heavy Wheeled, Engine Compartment
Ground, Mobile, Heavy Wheeled, Engine Mounted
Ground, Mobile, Heavy Wheeled, Instrument Panel Closed
Ground, Mobile, Heavy Wheeled, Instrument Panel Open
Ground, Mobile, Heavy Wheeled, Trunk
Ground, Mobile, Light Wheeled
Ground, Mobile, Light Wheeled, Chassis Mounted
Ground, Mobile, Light Wheeled, Engine Compartment
Ground, Mobile, Light Wheeled, Engine Mounted
Ground, Mobile, Light Wheeled, Instrument Panel Closed
Ground, Mobile, Light Wheeled, Instrument Panel Open
Ground, Mobile, Light Wheeled, Trunk
Ground, Mobile, Tracked
Ground, Stationary
Ground, Stationary, Indoors
Ground, Stationary, Outdoors
Naval
Naval, Shipboard
Naval, Shipboard, Sheltered
Naval, Shipboard, Unsheltered
Naval, Submarine

TAO TAE Humidity


55
14
40
55
14
40
55
14
40
71
14
50
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
71
14
50
55
14
40
35
17
40
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
55
14
40
34
14
40
40
14
40
58
14
40
31
14
40
24
14
40
17
14
40
55
14
40
35
19
40
30
23
40
40
14
50
55
14
80
55
14
80
40
20
70
60
14
90
55
23
50

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


302

Vibration (GRMS)
9
9
9
9
10
1.3
16
3.3
3.3
3.3
0
0
1
10
10
10
10
10
10
10
10
4
4
4
4
4
4
4
2
0
0
0
0.7
0.7
0.7
0.7
1

Chapter 7: Examples

Table 7.2-8: Default Operating Profile Values


Equipment type
Automotive
Commercial Aircraft
Computer
Consumer
Emergency Power
Industrial
Military Aircraft
Military Ground
Naval
Telecommunications

Operating profile
DC
CR (C/yr)
5
1000
25
2982
80
1491
30
368
10
50
80
184
25
1008
45
263
80
50
80
368

7.2.4. Photonic Model Development Example


7.2.4.1. Introduction
7.2.4.1.1.

Component Reliability Models Form

This section summarizes the manner in which photonic device models were derived
(Reference 3). It is included to demonstrate the development of models when little field
data is available.
The photonic component model form is:

P = Q (OB DCO TO V + EB DCN TE RH + TCB CR DT + ind )


where:
p = predicted failure rate
Q = multiplier for photonic device quality
OB = base failure rate from operational stresses
DCO = failure rate multiplier for duty cycle:

DCO =

DC
DC1op

Reliability Information Analysis Center


303

Chapter 7: Examples

TO = factor for operating temperature:

TO = e
V =

Eaop
1
1

.00008617 T +T + 273 298


AO R

vibration factor:
V +1
V = a
Vc

nvib

EB = base failure rate from environmental stresses


DCN = failure rate multiplier for nonoperating duty cycle:

DCN =

1 DC
1 DC1op

TE = nonoperating temperature factor:

TE = e

Eanonop
1
1

.00008617 T + 273 298

AE

RH = humidity factor:

RH

RH a + 1

=
RH c

n RH

TCB = base failure rate from power or temperature cycling stresses


cr =

cycling rate factor:

CR =

CR
CR1

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


304

Chapter 7: Examples

DT = delta Temperature factor:

DT

T + T TAE
= AO R

14

n PC

i = failure rate from induced stresses


The model parameters are defined as follows:
P
=
Q
=
=
OB
DCO =
DC
=
DC1op =
TO
=
Eaop =
=
TAO
TR
=
V
=
=
VA
VC
=
=
nvib
=
EB
DCN =
=
TE
Eanonop =
=
TAE
RH
=
RHa =
RHc =
=
nRH
TCB =
CR
=
CR
=
CR1 =

predicted failure rate, failures per million calendar hours


failure rate multiplier for quality
base failure rate, operating
failure rate multiplier for duty cycle, operating
duty cycle (fraction of calendar time in operation)
0.25
failure rate multiplier, temperature operating
activation energy - operating
ambient operating temperature
temperature rise above TAO
failure rate multiplier, vibration level
max vibration level applied (Grms)
1.0
vibration exponent
base failure rate, environment
failure rate multiplier, duty cycle nonoperating
failure rate multiplier, Temperature environment
activation energy, nonoperating
ambient environmental temperature
failure rate multiplier, relative humidity
relative Humidity (%)
50%
relative humidity exponent
base failure rate, temperature cycling
failure rate multiplier, cycling rate
cycling rate (cycles per year)
1000
Reliability Information Analysis Center
305

Chapter 7: Examples

DT
nPC
7.2.4.1.2.

=
=

failure rate multiplier, delta temperature


temperature cycling exponent

Model Development Methodology

The modeling methodology that was used in the photonics device modeling study is
summarized in Figure 7.2-4. This methodology is similar to the 217Plus model
development methodology, but was tailored for the specific needs of photonic
components. Each element of this methodology is explained in the following sections.

Collect reliability data and


populate spreadsheet
Collect failure mode
data

Map observed
failure modes into
the failure cause
categories

Identify the base


percentage of failure
rate attributable to
each cause

Estimate stresses to which


the parts were exposed
Estimate acceleration
model constants

Calculate a normalization
stress accelerating stress

Estimate acceleration
factors (Pi factors) for each
part from each data source

Calculate the based failure rates for each


cause such that observed = predicted
failure rates

Figure 7.2-4: Model Development Methodology Flowchart


7.2.4.2. Model development methodology and results

This section details the model development methodology and also presents the results of
each task in this methodology. Each task in Figure 7.2-4 is described in the following
sections.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
306

Chapter 7: Examples
7.2.4.2.1.

Collect Failure Mode Data

There are two primary types of data upon which the component models are based, failure
rate and failure mode. The model development process required that the failure rate data
be apportioned into the following four defined failure cause categories:

Failures from operational stresses


Failures from environmental stresses
Failures from power or temperature cycling stresses
Failures from induced stresses

Since failure mode data is typically not classified according to these categories, it is
necessary to transform the failure mode distribution data into the failure cause
distribution. This failure mode distribution data was obtained from several sources:

Data collected during the photonic device study


Data obtained from the literature
Analysis similar to a Failure Mode and Effects Analysis (FMEA), in which failure
causes are hypothesized.

An example of this is summarized in Table 7.2-9, in which the failure causes for a
connector are hypothesized (2nd column), and then an occurrence rating is given for each
cause. This rating is in the 3rd column, and is scored as a 1, 3 or 9. This weighting
scheme is often used in FMEA analysis. The result is a fractional value for each failure
cause that is proportional to the weighting. The sum of all of these values for each
component type equals 1.0.
The methodology used in the photonics device models to derive the fraction of
occurrence differs from the methodology presented previously for the 217Plus
components, in that failure mode distributions were not available during the photonics
model development effort. For the 217Plus models, the components were more mature
and therefore, there was considerable history of both failure mode and failure rate data to
draw upon.

Reliability Information Analysis Center


307

Chapter 7: Examples

Table 7.2-9: Failure Cause Summary for Connectors


Component
Type

Connector
(SC and FC)

7.2.4.2.2.

Failure Cause
Spring failure
Wear of the connector resulting in misalignment
Wear of the end face
Contamination of facet (sand, dust, grease)
Contamination on outside that wicks in
Eccentric wear on the ferrule causes misalignment
Crimping too tight causes pinching
Crimping too loose causes it to fall apart
O-ring failure
Contraction of the outer jacket causes fiber pistoning
Fracture of the end face
Misalignment of cable end due to sleeve wear
Misalignment of cable end due to buckling from
tolerance stack up
Misalignment of cable end due to separation from
tolerance stack up
Insufficient cure of epoxy
Corrosion, pitting or facets
Embrittlment of organic materials due to UV exposure

Occurrence
3
3
1
9
1
1
3
1
1
3
1
1

Fraction of
Occurrence
0.073
0.073
0.024
0.220
0.024
0.024
0.073
0.024
0.024
0.073
0.024
0.024

0.073

3
3
3
1

0.073
0.073
0.073
0.024

Map Observed Failure Modes into the Failure Cause Categories

To transform the failure mode distribution data into the failure cause distribution, the
following process was used:

Identify failure modes and their relative percentages (summarized above)


Identify the accelerating factors applicable to each failure cause
Identify the accelerating stresses applicable to each failure cause category (for
example, accelerating stresses from device operation applicable to many photonic
components will be optical power, temperature, etc.)
Map the accelerating stress to the appropriate failure modes (identify them as
being a primary, secondary or no accelerant driver)

The last item is accomplished by assessing whether each stress is a primary accelerant of
the failure mode, a secondary accelerant, or is not an accelerant. A 3:1 weighting
between primary and secondary accelerant was then used in estimating the percentage of
failures that could be attributed to those stresses.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
308

Chapter 7: Examples

The primary stresses that potentially accelerate operational failure modes are operating
temperature, vibration, current/voltage and optical power. The stresses that accelerate
environmental failure causes are nonoperating ambient temperature, corrosive stresses
(contaminants/heat/humidity), and aging stresses (time). As an example, Table 7.2-10
summarizes this process for our connector example.

Environmental

Accelerating Stress
or Cause
Operating temperature
Vibration
s p p
Current/voltage
Optical power

Ambient temperature
Corrosion
Ageing
Humidity

p
p
p s

Power Cycling

Power Cycling

s s

Induced/handling

Induced/handling

p
s
s s

s
p p

p p p

p p
s

p
s

s s

s p p s
p

TOTAL

Total
100 %

Failure Cause
Category
Operational Stresses

7.32%
7.32%
2.44%
21.95%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
7.32%
7.32%
7.32%
2.44%

Spring failure
Wear of the connector resulting in misalignment
Wear of the end face
Contamination of facet (sand, dust, grease)
Contamination on outside that wicks in
Eccentric wear on the ferrule causes misalignment
Crimping too tight causes pinching
Crimping too loose causes it to fall apart
O-ring failure
Contraction of the outer jacket causes fiber pistoning
Fracture of the end face
Misalignment of cable end due to sleeve wear
Misalignment of cable end due to buckling from tolerance stack up
Misalignment of cable end due to separation from tolerance stack up
Insufficient cure of epoxy
Corrosion, pitting or facets
Embrittlment of organic materials due to UV exposure

Table 7.2-10: Failure Mode to Failure Cause Category for Connectors (SC and FC)

0.00 0.11
0.10
0.00
0.01

s p 0.07 0.30
p
0.04
0.09
p
0.10
0.23 0.23
0.36 0.36
1.00 1.00

Reliability Information Analysis Center


309

Chapter 7: Examples

Each of the failure modes is listed across the top of the table, and each of the accelerating
stresses/causes is listed down the left side. Each combination is identified with a blank
(no acceleration from the factor), a "p" (primary) or an "s" (secondary). The associated
relative percentage of failures attributable to the accelerating stress/cause is listed down
the right columns.
The % column (second from the right) is calculated as follows:

wi

% = FM % n

FM1
wi
AC1
n

where:
FM% = the percentage associated with the ith failure mode
wi = the weight of the specific combination of failure mode and accelerating
stress or cause (0 for none, 1 for secondary, and 3 for primary)

For example, the % value for ambient temperature (as part of the environmental failure
cause category) is:

1
1
1
1
7.32% + 7.32% + 7.32% + 2.44% = 0.07
1
11
4
4
Therefore, an estimate of the percentage of failure causes accelerated by ambient
temperature is 7%.
7.2.4.2.3.

Identify the Base Percentage of Failure Rate Attributable to Each Cause

The base percentages of failure rate are calculated by summing the accelerating
stress/cause percentages associated with each failure cause. For our connector example,
the four percentages associated with the operating accelerating stresses/causes is 11%,
or 0.11. These percentages are an estimate of the percent of failures that can be expected
for each cause under nominal stress conditions. In this case, nominal stresses are the
average stresses to which the models are normalized. Table 7.2-11 summarizes the
failure cause percentages (in fractional form).

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


310

Chapter 7: Examples

Table 7.2-11: Failure Cause Percentages for Connectors


Failure Rate
Term
Operational
Environmental
Power Cycling
Induced
7.2.4.2.4.

Percentage
(Fraction)
0.11
0.30
0.23
0.36

Collect Reliability Data and Populate Spreadsheet

As previously summarized, the approach that was taken in photonics device model
development methodology relied on the collection of quantitative failure mode and
failure rate data. Literature searches were performed toward the goal of collecting the
quantitative data required for model development. Sources searched for applicable data
included:

Optical Society of America (OSA)


SPIE
RIAC databases
Total Electronic Migration System (TEMS) (a database of government-related
research from IACs and other sources)
Government-Industry Data Exchange Program (GIDEP)
Manufacturers data
Data mined from the Web

The results of this data collection effort, for connectors, are summarized in Table 7.2-12.

Reliability Information Analysis Center


311

Chapter 7: Examples

Failures

Lambda
Observed

0 0.8 368
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 1 1752
0 1 1752
0 1 1752
0 1 1752
20 1
0

Hours

12
0
0
0
0
0
0
0
0
125
110
125
125
0

RHa

5
0
0
0
0
0
0
0
0
0
0
0
0
0

CR

23
85
60
85
85
85
85
85
-40
-40
-40
-40
-40
25

DC

30
85
60
85
85
85
85
85
-40
85
70
85
85
25

VA

Delta T

Connector

Field
Damp heat
Damp heat
Damp heat
Damp heat
Damp heat
High temperature storage
High temperature storage
Low temperature storage
Thermal Cycling
Thermal Cycling
Thermal Cycling
Thermal Cycling
Vibration

TR

Data Type

TAE

Part
Type

TAO

Table 7.2-12: Data Collected for Connectors

40
85
95
85
85
85
2

33333333.33
20000
20160
1056
22000
22000
20160
1056
1056
5000
12600
50
5500
33

1
0
12
0
0
0
8
0
0
0
16
0
0
0

30

The first column is the part type; the second is the data type. Data types used in the
photonics device study included:

Field data
Test data
Thermal cycling
o Vibration
o Damp heat
o High temperature storage
o Low temperature storage
o Operating life test

The 3rd through tenth columns are the estimates of the actual stresses to which the part
was exposed in the field or during the test. These stresses are defined as follows:
TAO
TAE

=
=

ambient operating temperature


ambient environmental temperature

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


312

Chapter 7: Examples

TR
VA
DC
CR
RHa
7.2.4.2.5.

=
=
=
=
=

temperature rise above TAO


maximum vibration level applied (Grms)
duty cycle (fraction of calendar time in operation)
cycling rate (cycles per year)
relative humidity (%)

Estimate Stresses to Which the Parts were Exposed

For each source of data that was collected, an estimate of the stresses and operating
profiles to which the component was exposed was required so that the failure rates could
be normalized to the actual stresses. These stresses were summarized in the previous
section.
For test data, these values were generally readily available. For data collected from
fielded systems, the actual stress values were not available. Therefore, they had to be
estimated. The default values of the environmental and operating profile factors were
summarized in Tables 7.2-7 and 7.2-8. Only field data from telecommunication
applications used in a ground, stationary, indoors environment was available to the
photonics device modeling study, so only the values pertaining to those conditions were
estimated in this manner.
7.2.4.2.6.

Estimate Acceleration Model Constants for Each Part

Acceleration factors (or Pi-factors) were used in the component models to estimate the
effects of various stress and component variables on the failure rate. The two
predominant forms of acceleration factors are the Arrhenius and the power law models.
The Arrhenius model is generally used for modeling temperature effects and is:

AFT = e

Ea
KT

where AFT is the temperature acceleration factor, Ea is the activation energy, K is


Boltzmans constant, and T is the temperature (in degrees K).
The power law model is:

AF = S n
where S is the stress and n is a constant.
Reliability Information Analysis Center
313

Chapter 7: Examples

The specific forms of these acceleration factors that were used in the models are
summarized below.
TO = factor for operating temperature:

TO = e
V =

Eaop
1
1

.00008617 T +T + 273 298


AO R

vibration factor:

V +1
V = a
Vc

nvib

TE = nonoperating temperature factor:

TE = e

Eanonop
1
1

.00008617 T + 273 298


AE

RH = humidity factor:

RH

RH a + 1

=
RH c

n RH

DT = delta temperature factor:

DT

T + T TAE
= AO R

14

n PC

The temperature factors based on the Arrhenius relationship were normalized to 25


degrees C. The acceleration factors for vibration and relative humidity that are based on
the power law were normalized to a specific value, i.e. the denominator, and include a
value of 1.0 in the numerator to ensure that the factor does not go to zero with a stress
level of zero.
Each model has a single factor that needs to be estimated, i.e., Ea for the Arrhenius and
n for the power law. These were estimated in one of the following ways:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
314

Chapter 7: Examples

1. Values generated from information that was available in the literature


2. Engineering judgment based on the known behavior of similar failure
mechanisms
For #2, the accelerations were categorized from no acceleration to very high
acceleration for each specific accelerating stress. Table 7.2-13 summarizes the values of
the applicable parameters as a function of the relationship.

Table 7.2-13: Categories of Acceleration Model Parameters


Dependency
Very High
High
Medium
Low
None

n (PC)
10
5
2
1
0

Ea (op) Ea (nonop)
1
1
0.7
0.7
0.5
0.5
0.1
0.1
0
0

n (RH)
10
5
2
1
0

n (Vibration)
10
5
2
1
0

Table 7.2-14 summarizes the specific parameter values used in the connector models.

Table 7.2-14: Acceleration Model Parameters


Component Type
Connector
7.2.4.2.7.

n (PC)
2

Ea (op) Ea (nonop)
0.1
0.1

n (RH)
10

n (Vibration)
5

Calculate a Normalization Stress Accelerating Stress

The Pi factors needed to be normalized to a fixed set of conditions. This approach makes
it convenient to derive default Pi-factors. By normalizing the factors in this manner, the
Pi-factor is equal to 1.0 when the stress is equal to the default stress. Therefore, if an
analyst chooses to ignore the effects of a particular stress, the failure rate will be
representative of the default stress levels.
The default values for the applicable photonics device model Pi-factors are summarized
in Table 7.2-15.

Reliability Information Analysis Center


315

Chapter 7: Examples

7.2.4.2.8.

Default
Vibration

Default RH

Default DT

0
10
0
5
20
0
0
0
15
5
15
15
15

Default CR

Connector
Passive Micro-Optic Component
Passive Fiber-Based Component
Isolator
VOA
Fiber
Splice
Cable
Laser Diode Module
Photodiode
Transmitter
Receiver
Transceiver

Default DC

Model Category

Default Tr

Table 7.2-15: Default Model Parameters

0.25

1000

50

20

Estimate the Acceleration Factors (Pi-factors) for Each Part from Each Data Source

The acceleration factors used in the models are Pi-factors, which are the acceleration
factors normalized to a given stress level. These factors were calculated for each part
from each data source. To derive these factors, two pieces of information were required:
1. The estimate of the stress for each data point (in this case, a data point is a single
observation of reliability (failures and hours) at a known set of stress conditions).
The manner in which these were quantified was previously explained.
2. The default stress level of the data for each stress parameter in the model
The Pi-factor was then the acceleration model normalized to the default stress level. An
example of this calculation is shown in Table 7.2-16. Every data point available from
field or test data had its associated Pi-factor values calculated. Note that some of the Pifactors were zero. This occurs because test data was not applicable to all failure causes.
This concept will be further explained in the next section.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


316

Chapter 7: Examples

Table 7.2-16: Summary of Pi-factor Calculations

Pi TE

Pi RH

Pi CR

Pi DT

1.000
32.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000 4084101
1.000
496874
1.000
785027
1.000 4084101
1.135
1.000
1.921
1.000
1.506
1.000
1.921
1.000
1.921
1.000
1.921
1.000
1.921
1.000
1.921
1.000
0.337
1.000

Pi DCN

Pi TO

3.200
4.000
4.000
4.000
4.000
4.000
4.000
4.000
3.200
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

Pi V

Pi DCO

Pi factors

0.267
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.267
1.333
1.333
1.333
1.333
1.333
1.333
1.333
1.333

1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.974
1.921
1.506
1.921
1.921
1.921
1.921
1.921
0.337

0.137
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.137
227
681
227
227
227
0.000
0.000
0.000

0.368
2.037
4.037
4.037
0.000
0.000
0.000
0.000
0.368
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

1.063
1.180
1.149
1.162
0.000
0.000
0.000
0.000
0.360
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

Cable
Cable
Cable
Cable
Cable
Cable
Cable
Cable
Connector
Connector
Connector
Connector
Connector
Connector
Connector
Connector
Connector

Field
Thermal Cycling
Thermal Cycling
Thermal Cycling
Vibration
Vibration
Vibration
Vibration
Field
Damp heat
Damp heat
Damp heat
Damp heat
Damp heat
High temperature storage
High temperature storage
Low temperature storage

7.2.4.2.9.

Calculate the Base Failure Rates for Each Cause Such That the Observed Failure
Rates = the Predicted Failure Rates

In the case of the 217Plus models, which were based solely on field data, the base failure
rates for the photonic device models were obtained, as follows, for each failure cause
category:
m

Bi =

(Fobs %i ) field
1
m

H obs field

where:
Bi =
Fobs =
Hobs =
=
i=
m=
k=
%i =

the base failure rate for the ith failure rate term
the number of observed field failures
the number of observed field hours
the product of the applicable Pi-factors to the applicable field environment
the number of failure causes
the number of field data sources
the number of correction factors
the percentage of failure rate attributable to the specific failure causes
Reliability Information Analysis Center
317

Chapter 7: Examples

The product of the Pi-factors converts the actual hours to an equivalent effective
number of hours normalized to the default stress values.
However, in the case of the photonic models developed for the study, it was necessary to
utilize a significant amount of test data since there was not enough field data available.
This is due to the fact that there are few field data sources for photonic components.
Therefore, the modeling methodology needed to be tailored to accommodate the specific
data available on the parts addressed in the photonics device study. This was
accomplished by using a Bayesian technique in which the field data becomes the prior
distribution, and the summation of the failure and hours from all data sources forms the
basis of the posterior distribution. The failure rate parameter of the exponential
distribution was, therefore:
j

Bi =

(Fobs %i ) field + Fobstest


1

H obs field + H obstest

where there were j test data sources.


Each specific type of test data that was collected for the study was applicable to only one
of the four specific failure causes, as summarized in Table 7.2-17. Field data, however,
encompassed all four failure causes.

Table 7.2-17: Applicability of Test Data


Data Type
Field
Operating Life Test
High Temperature Storage
Low Temperature Storage
Damp Heat
Vibration
Thermal Cycling

Operating
X
X

Failure Cause Category


Environmental
Cycling
X
X

Induced
X

X
X
X
X
X

One of the advantages to the model structure was this ability to modify the base failure
rates of specific failure causes with test data applicable to only that failure cause.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
318

Chapter 7: Examples

The connector base failure rates resulting from this analysis are listed in Table 7.2-18.

Table 7.2-18: Base Failure Rates (Failures per Million Calendar Hours)
Base Failure Rate
(failures per million calendar hours)

Component
Connector
7.2.4.2.10.

Operating
0.0002

Environmental
0.3053

Cycling
2.7952

Induced
0.0110

Adjust the Base Failure Rates

The last step in the process was to adjust the base failure rates to ensure that the predicted
number of failures was equal to the observed number. The manner in which this was
accomplished was to scale the base failure rates to ensure that the cumulative predicted
number of failures of the entire population of observed data points was equal to the
observed number of failures. This was accomplished by using the MS Excel goal seek
function, which finds the value of a correction factor that satisfies this boundary
condition. This approach is conceptually similar to a maximum likelihood method.
7.2.4.2.11.

Treatment of Quality and Environmental Stresses

There were several options for modeling the effects of environmental stresses. Early in
the study, it was decided that the effects of quality and environment would be treated
such that the photonic component models would be stand-alone. This approach
differed from the form of the 217Plus methodology, in that quality and environment were
treated as system level effects. This concept was based on the premise that quality and
environmental effects were manifested more at the assembly or system level than they
were at the component level. The photonic component models include the effects of their
pertinent environmental stresses in the component models, instead of applying the
environment factor in the assembly or system model, as was the case with 217Plus. The
primary environmental stresses included in the photonic component models are
temperature, humidity and vibration.
The quality factor ( Q) is calculated in a manner similar to the 217Plus methodology, but
tailored to the unique concerns of photonic components. This factor is calculated as
follows:
1

q = i ( ln (R i ))

Where i and i are Weibull parameters representing the distribution of the percentage of
failures attributable to components (parts). The quality factor is scaled within this
Reliability Information Analysis Center
319

Chapter 7: Examples

distribution based on how good the parts control program is. The parameter Ri is the
rating of the parts control program and is calculated from:
ni

Ri =

j =1

GijWij

ni

W
j =1

ij

where,
rating of the process for the ith failure cause, from 0.0 to 1.0
the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause

Ri =
Gij =

The 217Plus grading criteria, as applied to the photonics device models, are provided in
Table 7.2-19. These were tailored specifically for photonic components.

Table 7.2-19: Part Quality Process Grade Factor Questions for Photonic Device Models
Highest
Actual
Possible
Score
Score

Rating

Input
Range

User
Input

Is there a documented part selection and part


management process?

yes = 5
no = 0

Y,N

0.0

Are part evaluation and qualification processes


established to add parts to the PPL?

yes = 3
no = 0

Y,N

0.0

yes = 3
no = 0

Y,N

0.0

yes = 6
no = 0

Y,N

0.0

Will new parts be added to the PPL to design this


FRU?

yes = 4
no = 0

Y,N

0.0

Are procedures in place to detect part problems


in both manufacturing and the field?

yes = 10
no = 0

Y,N

10

0.0

yes = 10
no = 0

Y,N

10

0.0

yes = 10
no = 0

Y,N

10

0.0

Parts Contribution to Reliability

Does a cross functional development team


(CFDT) review and approve new candidate parts
for addition to the PPL?
Is this a commercial off-the-shelf (COTS)
purchased assembly with a good history of
operational reliability?

Are quality and reliability data tracked on parts


and fed back to suppliers so they know their
performance on this product?
Is there a design compliance checklist to ensure
that all parts are properly applied, operating at
sufficient margin with respect to environmental
and operational stresses, and take into account
lessons learned?

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


320

Chapter 7: Examples
Highest
Actual
Possible
Score
Score

Rating

Input
Range

User
Input

yes = 7
no = 0

Y,N

0.0

yes = 7
no = 0

Y,N

0.0

yes = 10
no = 0

Y,N

10

0.0

yes = 7
no = 0

Y,N

0.0

Is there a device specification for all critical and


custom parts?

yes = 5
no = 0

Y,N

0.0

Has the supplier reviewed the part application


for all critical and custom parts?

yes = 7
no = 0

Y,N

0.0

yes = 7
no = 0

Y,N

0.0

yes = 7
no = 0

Y,N

0.0

yes = 7
no = 0

Y,N

0.0

Is there a first article inspection and acceptance


test planned?

yes = 7
no = 0

Y,N

0.0

Have key suppliers identified their part failure


mechanisms?

yes = 10
no = 0

Y,N

10

0.0

Have the sources and the extent of part variation


been identified?

yes = 7
no = 0

Y,N

0.0

Have mitigations been identified to handle the


effects of part's variations?

yes = 8
no = 0

Y,N

0.0

Will a design of experiments part evaluation,


considering variations, as well as manufacturing
variations, be conducted?

yes = 7
no = 0

Y,N

0.0

Will developers' quality organization audit the


supplier's processes and facility capabilities?

yes = 6
no = 0

Y,N

0.0

A,B,C,D,E

10

0.0

Parts Contribution to Reliability


Are teaming relationships established with all
critical component suppliers?
Will all suppliers provide timely failure reporting
and corrective action support (FRACAS) for both
critical and custom parts? (Timely reporting
implies a 2 week turnaround with faster response
on priority demand.)
Have supplier identified the likely failure modes
on critical and custom parts, and does the design
take these failure modes into account?
Are operational failure rate and failure mode
data provided by the suppliers of critical and
custom parts being used?

Will critical suppliers provide timely notice of


impending part changes to allow the developer
to assess the impact?
Is a change history log maintained to provide
traceability of engineering change actions and
their associated rationale for critical and custom
parts?
Will part identification (revision numbers) be
shown on the part to identify the particular part
configuration, including the level of the parts
firmware?

Is an optical path adhesive (OPA) used in the


component

A. No OPA = 10
B. yes, MFD <2um = 0
C. yes, MFD = 2 - 5 um = 4
D. yes, MFD = 5 - 10 um = 6
E. yes, MFD > 10 um = 8

Reliability Information Analysis Center


321

Chapter 7: Examples

Parts Contribution to Reliability

Are there thin films (AR coatings, filter


elements) in the light path?

Rating
A. No Thin film = 0
B. yes, and surface is prepared by
sputtering = 2
C. yes, and surface is not
prepared by sputtering = 3

Highest
Actual
Possible
Score
Score

Input
Range

User
Input

A,B,C

0.0

Does the component contain fused fiber?

yes = 5
no = 0

Y,N

0.0

Does the component contain fiber?

yes = 5
no = 0

Y,N

0.0

Was the package thermally designed to safely


dissipate heat by understanding and modeling
the thermal characteristics?

yes = 3
no = 0

Y,N

0.0

Has the manufacturer characterized the power


handling capability of the component?

yes = 5
no = 0

Y,N

0.0

yes = 5
no = 0

Y,N

0.0

yes = 4
no = 0

Y,N

0.0

A,B

0.0

Y,N

0.0

A,B,C

0.0

Y,N

0.0

Have acceleration factors for power and


temperature been quantified and are they used
to determine the derating requirements?
Does the component contain absorbers at
wavelengths for which the component will be
exposed (i.e. garnet, shutter, etc.)
How is dissipated power intended to be dumped?
Does the component rely on alignment of free
space components attached with organics
Cleanliness precautions

For components that have a fiber/epoxy interface,


is the fiber tip inspected to ensure it is free of
defects and contamination?

A. with a heat sink = 4


B. dissipation not actively
managed = 0
yes = 3
no = 0
A. stringent cleaning procedures
=3
B. some cleaning procedures = 2
C. no cleaning procedures = 0
yes = 3
no = 0

7.2.4.3. Uncertainty Analysis

An analysis was performed to quantify the degree of uncertainty in the predicted failure
rates. This was accomplished by calculating the predicted failure rate and comparing it to
the observed failure rate. The metric that was used for this analysis was the log10 of the
value: predicted failure rate/observed failure rate. The value of this metric should cluster
around zero if the prediction models are approximating the observed data. Calculation of
the standard deviation of this metric also provides a quantification of the uncertainty
levels present in the predictions made with these models. Table 7.2-20 summarizes the
mean and standard deviation of this metric for all of the data and for only the field data.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


322

Chapter 7: Examples

Table 7.2-20: Summary of Uncertainty Metrics


Mean
Standard deviation

All Data Field Data


-0.68
0.20
1.07
0.44

Figures 7.2-5 and 7.2-6 illustrate the distribution of this metric for all data and for just
field data. For this analysis, only data for which failures occurred were included, since
data with no observed failures only have a single-sided bound on the failure rate and,
therefore, cannot be compared to the predicted value. The result of not including zero
failure data is that the metric is biased. As can be seen in these figures, the distribution of
all failures is significantly wider than the distribution of just the field failure rates. This
is due to the fact that the non-field data, i.e. test data, is typically at extreme conditions.
Therefore, the uncertainty in these extreme cases is typically larger than for nominal
conditions.

Histogram
14
12
Frequency

10
8
6
4
2
0
-3

-2

-1

LOG 10 (PREDICTED/OBSERVED)

Figure 7.2-5: Distribution of Log10 Predicted/Observed Failure Rate Ratio for All
Data

Reliability Information Analysis Center


323

Chapter 7: Examples

Histogram
7
6
Frequency

5
4
3
2
1
0
-0.25

0.25
0.75
1.25
1.75
More
LOG 10 (PREDICTED/OBSERVED)

Figure 7.2-6: Distribution of Log10 Predicted/Observed Ratio for Field Data Only
The distributions of the predicted/observed failure rate ratio are illustrated in Figure 7.27. With this metric, the value should be centered about one, since the log of this ratio has
not been taken.

Re lia So f t W e ibu ll+ + 7 - w w w . Re lia So ft. com

distribution of predicted/observed failure rate ratio

cumulative probability

99 . 0 00

Prob a bility-Lo gn orma l

50 . 0 00

10 . 0 00

5. 00 0

1. 00 0
0. 00 1

0 . 0 10

0. 10 0

1. 000

10. 00 0

1 00 . 0 00

predicted/observed ratio
F olio1\Da ta 1: = 1 .5 5 8 5 , =2 .4 9 2 5 , = 0 .9 8 8 0
F olio1\Da ta 2: = 0 .4 5 5 6 , =0 .9 5 4 7 , =0 .8 8 1 3

Figure 7.2-7: Distributions of the Predicted/Observed Failure Rate Ratio for All Data
and For Field Data Only
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
324

Chapter 7: Examples
7.2.4.4. Comments on Part Quality Levels

Part quality level has traditionally been used as one of the primary variables affecting the
predicted failure rate of a component. The quality level categories were usually those
defined by the applicable military specification.
One of the problems that developers had when developing MIL-HDBK-217 models was
de-convolving the effects of quality and environment. For example, multiple linear
regression analysis of field failure rate data was usually used to quantify model variables
as a function of independent variables such as quality and environment. A basic
assumption of such techniques is that the independent variables are statistically
independent of each other. However, in reality they are not, since the higher quality
components are generally used in the severe environments and the commercial quality
components are used in the more benign environments. This correlation makes it
difficult to discern the effects of each of the variables individually. Additionally, there
are several attributes pooled into the quality factor, including qualification, process
certification, screening and quality systems.
The approach used in the 217Plus model to quantify the effects of part quality is to treat it
as one of the failure causes for which a process grade is determined. In this manner,
issues related to qualification, process certification, screening and quality systems were
individually addressed.
7.2.4.5. Explanation of Failure Rate Units

The 217Plus models predict the failure rate in units of failures per million calendar hours.
This is necessary because the 217Plus methodology accounts for all failure rate
contribution terms (i.e., operating, nonoperating, cycling and induced), and the
appropriate manner in which they can be combined is to use a common time basis for the
failure rate, which is calendar hours.
If an equivalent operating failure rate is desired in units of failures per million operating
hours, the 217Plus reliability prediction should be performed with the actual duty cycle to
which the unit will be subjected, then divide the resulting failure rate (in f/106 calendar
hours) by the duty cycle to yield a failure rate in terms of f/106 operating hours. The
resulting operating failure rate will be artificially increased to account for the
nonoperating and cycling failures that would not otherwise be accounted for. The
incorrect way to predict a 217Plus failure rate in units of failures per million operating
hour is to set the duty cycle equal to 1.0. The resulting failure rate in this case would be
valid only if the actual duty cycle is 100%. If the actual duty cycle is not 100%, then the
failures during non-operating periods will not be accounted for.
Reliability Information Analysis Center
325

Chapter 7: Examples

7.2.5.

System-Level Model

7.2.5.1. Model Presentation

As a reminder, the total 217Plus system model is:

P = IA ( P IM E + D G + M IM E G + S G + I + N + W ) + SW
where:
p =

predicted failure rate of the system

IA =

initial assessment of the failure rate. This failure rate is based on new
component failure rate models derived by the RIAC presented in Section
2.2, whose derivations are discussed in the next section

Each of the following model factors represents a failure cause:

P
D
M
S
I
N
W

=
=
=
=
=
=
=

parts process factor


design process factor
manufacturing process factor
system management process factor
induced process factor
no-defect process factor
wearout process factor

Each of these factors is calculated as follows:


1

i = i ( ln(Ri ))

where i and i are constants for each failure cause category, as given in Table 7.2-21.
The parameter Ri is calculated as:
ni

Ri =

j =1

GijWij

ni

W
j =1

ij

where:

Ri =

rating of the process for the ith failure cause, from 0.0 to 1.0.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
326

Chapter 7: Examples

the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause

Gij =

Table 7.2-21: Parameters for the Process Grade Factors


Model Factor
Symbol (i)
D
M
P
S
N
I
W

Name
Design process factor
Manufacturing process factor
Parts Quality process factor
Systems Management process
factor
CND process factor
Induced process factor
Wearout process factor

0.12
0.21
0.30
0.06

1.29
0.96
1.62
0.64

Default value for factor if


Ri is unknown
0.094
0.142
0.243
0.036

0.29
0.18
0.13

1.92
1.58
1.68

0.237
0.141
0.106

IM = infant mortality factor

IM =

t - 0.62
(1 - SSESS )
1.77

where:
t=

time in years. This is the instantaneous time at which the failure rate is
to be evaluated. If the average failure rate for a given time period is
desired, this expression must be integrated and divided by the time
period.
SSESS = the screening strength of the screen(s) applied, if any
E =
environmental factor

((

) (

.6
1.71
.855 .8 1 e(.065(T +.6 ) ) + .2 1 e(.046G )
E =
.205

))

where:

T = the change in temperature between operating and non-operating periods


(TAO-TAE)
Reliability Information Analysis Center
327

Chapter 7: Examples

G = the magnitude of random vibration while the system is operating, in GRMS


G = reliability growth factor, given by the formula:
G =

1 .12 (t + 2 )
2

where:

=
Ri =

the growth constant, which is equal to Ri for reliability growth processes


the rating of the growth process using the criteria in Table 7.2-30, and is
given as:
ni

Ri =

j =1

GijWij

ni

W
j =1

ij

7.2.5.2. 217Plus Process Grading Criteria

This section contains a listing of all of the criteria that comprise the definition and
scoring for the individual 217Plus Process Grades. An index of the tables included
within this section is listed in Table 7.2-22.

Table 7.2-22. Index of Process Grade Type Questions


Table Number
7.2-23
7.2-24
7.2-25
7.2-26
7.2-27
7.2-28
7.2-29
7.2-30

Process Grade Type


Design
Manufacturing
Part Quality
System Management
CND
Induced
Wearout
Growth

The rating for each process grade type, Ri,is given as:
ni

Ri =

j =1

GijWij

ni

W
j =1

ij

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


328

Chapter 7: Examples

where:
rating of the process for the ith failure cause, from 0.0 to 1.0.
the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = number of grading criteria associated with the ith failure cause

Ri =
Gij =

These tables are organized as follows. Column 1 contains the criteria associated with the
specific Process Grade Type. Column 2 is the grading criteria (Gij). Most of the
questions are designated with a Y/N in this column. In these cases, a Yes (Y) answer
equals "1" and a "No" answer equals 0. The question will receive the full weighted
score for a "Yes" answer and a zero for a "No" answer. In some cases, the grading
criteria is not binary, but rather can be one of three or four possible values. The grading
criteria for these are noted in this column. Column 3 identifies the scoring weight (Wij)
associated with the specific question.
In the event that a model user does not wish to answer all of the questions, he/she can
choose a subset of the most important questions by using only those with weight values
of seven or higher. Questions that are not scored should not be counted in the number of
grading criteria (ni) associated with the ith failure score.

Reliability Information Analysis Center


329

Chapter 7: Examples

7.2.5.3. Design Process Grade Factor Questions

Table 7.2-23: Design Process Grade Factor Questions


Question

Gij

Wij

What is the % of lead design engineering people with cross training experience in manufacturing or field operations (thresholds at
10, 20%)?

<10 = 0
10-20 = .5
>20 = 1

What is the % of team members having relevant product experience (thresholds at 25, 50%)?

<25 = 0
25-50 = .5
>50 = 1

What is the % of team members having relevant process experience, i.e., they have previously developed a product under the
current development process (thresholds at 20, 40%)?

<20 = 0
20-40 = .5
>40 = 1

What is the % of development team that have 4-year technical degrees (thresholds at 20, 40%)?

<20 = 0
20-40 = .5
>40 = 1

What is the % of engineering team having advanced technical degrees (thresholds at 10, 20%)?

<10 = 0
10-20 = .5
>20 = 1

What is the % of engineering team members involved in professional activities in the past year; hold patents; authored/presented
papers; are registered professional engineers, or professional society offices at the National level (thresholds at 10, 20%)?

<10 = 0
10-20 = .5
>20 = 1

What is the % of engineering team members who have taken engineering courses in the past year (thresholds at 10, 20%)?

<10 = 0
10-20 = .5
>20 = 1

Are resource people identified for program technology support across key technology and specialty areas such as optoelectronics,
servo control, Application Specific Integrated Circuits (ASIC) design, etc., to provide program guidance and support as needed?

Yes = 1
No = 0

Are resource people identified, for program tools support, to provide guidance and assistance with Computer Aided Design (CAD),
simulation, etc.?

Yes = 1
No = 0

How many (0,1,2,3) of the program objectives of cost, schedule and reliability did the manager successfully meet for the last
program that he/she was responsible?

3=1
2 = .5
1 = .25
0=0

10

Is this development program organized as "Cross Functional Development Teams" (CFDT) involving: design, manufacturing, test,
procurement, etc.?

Yes = 1
No = 0

Does this Field Replaceable Unit (FRU) depend more on mature technology than state of the art technology?

Yes = 1
No = 0

Is design of experiments (DOE) used to ensure robustness of the FRU in the product under all operational and environmental
variations?

Yes = 1
No = 0

Are critical components identified along with plans to mitigate their risks?

Yes = 1
No = 0

Have designs been reviewed and plans made for part obsolescence during the product's life cycle?

Yes = 1
No = 0

Are considerations made to accommodate part form factor evolution? This applies particularly to those parts deemed likely to
change during the production life of the fielded system.

Yes = 1
No = 0

Are predominantly standard tools required for maintenance (limited-to-no use of special tools)?

Yes = 1
No = 0

Will the design application be modeled by variational analysis to ensure design centering?

Yes = 1
No = 0

Will timing analysis be performed on digital circuits?

Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


330

Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question

Gij

Wij

Will a network modal analysis be performed on analog circuits?

Yes = 1
No = 0

Will electrical stress analysis be performed on electronic circuits?

Yes = 1
No = 0

Will mechanical stress analysis be performed on relevant components, materials and structures?

Yes = 1
No = 0

Will a prototype be developed in time to have user feedback impact the design?

Yes = 1
No = 0

10

Will customer feedback on the prototype be sought?

Yes = 1
No = 0

10

Will design personnel participate in a Failure Modes and Effects Analysis (FMEA), Failure Modes Effects and Criticality Analysis
(FMECA), or Fault Tree Analysis (FTA) that is performed concurrently with the design effort?

Yes = 1
No = 0

Will the design engineer also design the diagnostic code for this FRU?

Yes = 1
No = 0

Will a worst-case analysis be performed?

Yes = 1
No = 0

Will the product support tasks be ergonomically evaluated (human factors) from an Operations & Maintenance standpoint?

Yes = 1
No = 0

Will the product be analyzed using a human factor task analysis to ensure the Operations and Maintenance Tasks are tailored to
human capabilities?

Yes = 1
No = 0

Will the chassis that this FRU is mounted in be thermally measured and analyzed and operating temperatures assured to be at a safe
margin below device limits?

Yes = 1
No = 0

Is electrical/mechanical power by electronic logic or physical action (switches)?

Yes = 1
No = 0

Do control procedures ensure that the system and its software are put in a safe state during power shut down?

Yes = 1
No = 0

<90 = 1
90-125 = .5
>125 = 0

Will environmental analyses and profiling (thermal, dynamic) be performed on the product to ensure it is used within its design
strength capabilities?

Yes = 1
No = 0

Will the product be analyzed/tested for electromagnetic compatibility (EMC) and radiated/conducted susceptibility and emissions?

Yes = 1
No = 0

Will the product be EMC-certified, per the European CE (Conformity European) regulatory compliance criteria for equipment used
in Europe, or under a similarly rigorous standard such as DO-160 (commercial aircraft)?

Yes = 1
No = 0

Are the size of equipment orifices (cover openings) less than 1/10 of the wavelength of the signal frequencies that the equipment
will generate within its enclosure or be exposed to in its environment?

Yes = 1
No = 0

Do traces on a Printed Wiring Board (PWB) run over a ground plane or an impedance control layer (e.g., power planes) and never
over reference plane or power plane voids?

Yes = 1
No = 0

Do traces on alternate PWB layers run orthogonal to one another, when a reference plane or power plane is not interposed between
them?

Yes = 1
No = 0

Are adjacent traces separated by at least twice their width, except for minor adjacencies that run less than a half inch?

Yes = 1
No = 0

Is the power source filtered over the range of 1KHz to 100 MHz for military or 150KHz to 30 MHz for commercial power, and
utilize surge suppression devices where appropriate?

Yes = 1
No = 0

Are all interconnect cables emerging from a shielded cabinet grounded to the chassis for operating frequencies greater than 1 MHz
or capacitively decoupled to the chassis for frequencies less than 1 MHz?

Yes = 1
No = 0

Are traces set back at least 2 widths from the edge of the reference or ground plane?

Yes = 1
No = 0

Is there a shared product development vision that includes Design for Manufacturability (DFM) goals?

Yes = 1
No = 0

What is the maximum silicon junction temperature on this FRU in degrees C (thresholds at 90 and 125 degrees C)? If not
applicable select 0.

th

Reliability Information Analysis Center


331

Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question

Gij

Wij

Are part types standardized via a Preferred Parts List (PPL)?

Yes = 1
No = 0

Is there continuing focus to keep the PPL up to date and to minimize the number of parts on the PPL, by increasing part
standardization, encouraging designers to use the PPL and requiring analysis to justify adding a new part to the PPL?

Yes = 1
No = 0

Is this product to be built on an existing manufacturing platform that makes use of existing process capabilities?

Yes = 1
No = 0

Do plans for follow-on products and product retirement exist?

Yes = 1
No = 0

Are new, critical parts qualified for by test and analysis prior to their inclusion in the system?

Yes = 1
No = 0

<75% = 1
75%-100% = .5
>100% = 0

Yes = 1
No = 0

<60 = 1
60-100 = .5
>100 = 0

Are PWB traces at least 5 mils in width?

Yes = 1
No = 0

Is the development process documented?

Yes = 1
No = 0

Is the process documentation on-line with the recognition that the on-line version is the only standard? (All printed copies are for
reference only).

Yes = 1
No = 0

Does each process activity have clear entry and exit criteria?

Yes = 1
No = 0

Is the system configuration documented on-line, with changes since the last baseline highlighted to keep the entire team current
with the design?

Yes = 1
No = 0

Are there functional block diagrams of the system, subsystems, etc., down to the FRU level?

Yes = 1
No = 0

Are examples of good development products (e.g., specs, plans, documentation) provided to the engineering team, typifying the
desired work products for each stage of development?

Yes = 1
No = 0

Are examples of past problems provided to the engineering team that typify those found at each stage of development?

Yes = 1
No = 0

Is there a closed-loop problem database to track development problems to closure?

Yes = 1
No = 0

Does development activities planning include the identification of critical path tasks?

Yes = 1
No = 0

Are critical path tasks planned to minimize cycle time impacts and improve schedule robustness?

Yes = 1
No = 0

Are individual developers encouraged to make contact with their customer counterpart?

Yes = 1
No = 0

Will Cross-Functional Development Team (CFDT) phase reviews/sign-offs follow each product development phase: requirements,
preliminary design, final design, and test?

Yes = 1
No = 0

Are formal reviews documented and defect data analyzed and tracked, along with any action items, to completion?

Yes = 1
No = 0

Do design reviewers share responsibility for the performance of the design once they have reviewed it?

Yes = 1
No = 0

Are developers rated on the success of the overall product in the field?

Yes = 1
No = 0

Is there a technical review board in place to minimize design changes and maintain cost, schedule and reliability goals?

Yes = 1
No = 0

How does the part count on this project compare with predecessor products or competitive products? (Thresholds at 75 and 100%).
Are there DFM guidelines provided that the program must adhere to? (e.g., a good DFM design is fabricated on a uni-axis
assembly orientation, preferably built from the bottom)
What % of inter-connections are there compared to the predecessor version of this FRU (thresholds at 70 and 100%)?

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


332

Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question

Gij

Wij

Are engineering change (EC) costs budgeted, measured and tracked against their associated design driver?

Yes = 1
No = 0

Is reliability and/or quality a significant goal or the number one goal placed on the entire development organization? This occurs
on safety critical applications such as air traffic control, nuclear or critical medical applications.

Yes = 1
No = 0

10

<10 = 0
10-20 = .5
>20 = 1

Are individual developers empowered by having input and control over resources to accomplish their job, such as having a travel
budget (if travel is required)?

Yes = 1
No = 0

Are engineering team members dedicated full time to the project?

Yes = 1
No = 0

Are process owners identified across the development team for each configuration item (CI) and its components?

Yes = 1
No = 0

Is there tracking of open problems, action items and cross system dependencies?

Yes = 1
No = 0

Is there a change review board/process?

Yes = 1
No = 0

Is development creativity fostered through planned creativity exercises to spawn breakthrough thinking with respect to design
simplicity, cost, schedule and reliability?

Yes = 1
No = 0

Are failures traced to their root cause and managed to resolution?

Yes = 1
No = 0

Is this development process ISO rated?

Yes = 1
No = 0

How many (0,1,2,3) of the cost, schedule and reliability goals did the last product developed by this organization meet?

3=1
2 = .5
1 = .25
0=0

10

Do you know the reliability performance of your current products in the field versus their predicted reliability?

Yes = 1
No = 0

If so, are previous reliability estimates greater than 15% of the predicted reliability? When not applicable select "No".

Yes = 1
No = 0

Is there a 15% staffing buffer on the program, i.e., will the program be staffed to 115% of the needed baseline to allow for
contingencies?

Yes = 1
No = 0

Are in-process metrics maintained to track actual vs. planned defect rates, schedule and resource targets?

Yes = 1
No = 0

Can continuous measurable improvement (CMI) be demonstrated for the development processes?

Yes = 1
No = 0

Are development processes maintained on-line with all printed paper copies designated "for reference use only"?

Yes = 1
No = 0

Are there procedures to ensure that documentation stays current with the design?

Yes = 1
No = 0

Is there a requirements document for this program?

Yes = 1
No = 0

Is there a Functional Specifications document?

Yes = 1
No = 0

Are there document owners or points-of-contact identified for these documents so the development team knows who it can go to
for a specific need?

Yes = 1
No = 0

Do the team members contribute to the creation and/or review and approval of these documents?

Yes = 1
No = 0

Are documentation standards promoted with examples to demonstrate what is considered adequate documentation?

Yes = 1
No = 0

What is the % of FRU reuse across the system? (thresholds at 10 and 20%).

Reliability Information Analysis Center


333

Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question

Gij

Wij

Is product documentation field-tested prior to product delivery or general availability?

Yes = 1
No = 0

Is there a procedure for field feedback on product operations and maintenance documentation?

Yes = 1
No = 0

Does product documentation maximize pictures and minimize words (fallibility of natural language)?

Yes = 1
No = 0

Is product documentation kept at reading grade level 10 or less?

Yes = 1
No = 0

Is there an operational concept document developed prior to high level design that is maintained throughout development?

Yes = 1
No = 0

Is there a set of hardware and process design guidelines that provide general and component-specific design guidance practices?

Yes = 1
No = 0

Is there an assumptions/dependencies database that is maintained and reviewed prior to each development stage exit?

Yes = 1
No = 0

Is a distributed architecture used?

Yes = 1
No = 0

Does the design exclude electro-optical devices?

Yes = 1
No = 0

Does the design exclude electro-mechanical devices?

Yes = 1
No = 0

Is chemical processing excluded from this design?

Yes = 1
No = 0

Is hot fusing (toner) excluded from this design?

Yes = 1
No = 0

Are all voltages used in this design less than 110 VAC?

Yes = 1
No = 0

Are all operating frequencies less than 50 MHz

Yes = 1
No = 0

What is the number of developers on this project (thresholds of 20 and 100)?

<20 = 1
20-100 = .5
>100 = 0

What is the development schedule in months? (thresholds at 18, 36 and 48 months)

<18 = 1
18-36 = .75
36-48 = .5
>48 = 0

Is there a 24-hour/day availability requirement?

Yes = 1
No = 0

Does the operational concept call for a remote operations and maintenance (O&M) operator to be able to diagnose system problems
as part of the system concept?

Yes = 1
No = 0

Is this PWB of standard size or dimension?

Yes = 1
No = 0

Does this FRU have a 25% reduction in parts count over its predecessor or competitor?

Yes = 1
No = 0

Are stuck faults required to be isolated down to a single failing FRU 90% of the time?

Yes = 1
No = 0

Does this FRU report its status via a "Management Information Data Bit" (MIB) capability for fault determination and isolation?

Yes = 1
No = 0

Is there over-voltage / under-voltage detection and reporting?

Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


334

Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question

Gij

Wij

Will FRUs be "hot-pluggable"?

Yes = 1
No = 0

Is there an independent test team?

Yes = 1
No = 0

Is the customer directly involved in defining the product's operational profile and in reviewing the test plans?

Yes = 1
No = 0

Does test planning take into account the "lessons learned" database?

Yes = 1
No = 0

Does a problem tracking database exist and is it being used on this program?

Yes = 1
No = 0

Will accelerated testing be performed during development that combines temperature and vibration?

Yes = 1
No = 0

Will alpha tests be conducted, whereby the final product is robustly tested against probable extensions of its operational
environment?

Yes = 1
No = 0

Will beta tests be conducted, whereby customers can use and test pre-release versions of the product, feeding back their results to
the developer?

Yes = 1
No = 0

Are test procedures, set-up conditions, results, etc., documented so that measurements can be verified, failures reproduced, test
conditions recreated, and corrective actions confirmed?

Yes = 1
No = 0

Will a gold standard (tested product) be preserved for comparative regression analysis?

Yes = 1
No = 0

Will product changes be regression tested?

Yes = 1
No = 0

Will the FRU be reliability or endurance tested (at any assembly level)?

Yes = 1
No = 0

Can parts (ASIC, EPROM) be reprogrammed in the circuit?

Yes = 1
No = 0

Can active elements be backwardly driven for more complete coverage?

Yes = 1
No = 0

<80 = 0
80-95 = .5
>95 = 1

Yes = 1
No = 0

>40 = 1
32-40 = .75
25-32 = .25
<25 = 0

1=1
2=0

Has mechanical loading of the test fixture on the device under test been analyzed?

Yes = 1
No = 0

Are the buses or signal lines actively driven (vs. passively driven)?

Yes = 1
No = 0

Are the test item configurations representative of both the design (development tests) or production (validation tests) products?

Yes = 1
No = 0

3=1
2 = .8
1 = .5
0=0

What % of nodes (interconnection of traces) can be backward driven (thresholds at 80 and 95%)?
Has test fixture complexity been analyzed for fixtures with over 50 pins per square inch?

What is the test point contact size in mils (thresholds at 40, 32, and 25 mils)?

Is a one-sided or two-sided test fixture used, if this FRU is a PWB?

Will the product be environmentally stress tested (0 to 3) for 1. Design, 2. Qualification, 3. Product Acceptance?

Reliability Information Analysis Center


335

Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question

Gij

Wij

>95 = 1
80-95 = .5
50-80 = .25
<50 = 0

Are Engineering Design Analysis (EDA) tools available that will be used to support the design task?

Yes = 1
No = 0

Are EDA tools stable?

Yes = 1
No = 0

Does the team have a core competency with tool experience?

Yes = 1
No = 0

Are there dedicated tool support personnel available to the development team?

Yes = 1
No = 0

Do the development team members have domain expertise with the operating platform, i.e., HP, Unix, networking, and OS?

Yes = 1
No = 0

Are the tools self-documenting?

Yes = 1
No = 0

What % of nodes can be tested on this FRU (thresholds at 95, 80 and 50%)?

7.2.5.4. Manufacturing Process Grade Factor Questions

Table 7.2-24: Manufacturing Process Grade Factor Questions


Question
How many product orientations (axes and directions) are required to assemble this Field Replaceable Unit (FRU) (thresholds at
1,2,3 axes out of 6 possible)?

How many cuts and traces are allowed if this is a Printed Wiring Board (PWB) (preliminary thresholds at 10, 20)?
Is a Computer Aided Design/Computer Aided Manufacturing (CAD/CAM) process used to support manufacturing?
Does the CAD/CAM system allow manufacturing personnel to have access to design information and documentation?
Are there no adjustments associated with this FRU?
Are parts/assemblies designed to simplify and facilitate automatic feeding and insertion?
Do hand-inserted parts have visual guides to aid in building the assembly?
Is there a focused effort to minimize the number of cables and connectors on this FRU?
Does the system support the FRUs being "hot-pluggable"?
Has the number of different manufacturing processes been minimized in building this FRU?
Is a flexible manufacturing process used, such that this new product will be fabricated on an existing, proven line?
Is a cellular manufacturing process used, where the autonomous manufacturing station has all materials and parts brought to it and it
produces a finished product?
If there are symmetrical, polarized components used on this FRU, is the mounting process made "fool-proof", so that they cannot be
inserted backwards?

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


336

Gij

Wij

>3 = 0
3 = .5
2 = .75
1 = 1
>20 = 0
10-20 = .5
<10 = 1
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes
No
Yes
No
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=
=
=
=
=

1
0
1
0
1
0
1
0
1
0

3
5
3
4
5
3
3
4
3
3
3
3

Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question
Has the total number of threaded fasteners associated with this assembly been minimized?

What is the number of different fastener types associated with this FRU (threshold at 0, 1, 2, >2)?

Is it easy to visually distinguish between fasteners (e.g., no minor differences in length) prior to installation?
Is there only one type fastener drive (torx, Phillips, etc.) needed in the assembly, installation and maintenance of this FRU?
Are mounting guides or registration pins provided for aligning and securing electro-mechanical or electro-optical parts?
Are development personnel, including manufacturing, all co-located?
Does this project have a built-in 15% staffing buffer, i.e., staffing is at least 115% of base requirements?
Is the project organized around self-directed work teams?
Are workers rated on both total output and quality?
Are there process improvement teams with continuous measurable improvement (CMI) goals?
Are employees rated on field performance of the product?
Is there an advanced manufacturing engineering (AME) support department to help bridge between engineering and production?
Has Cross Functional Development Team (CFDT) been implemented such that the manufacturing manager is able to explain the
design concept?
Are manufacturing people encouraged to ask questions of development people (identified points of contact) when questions arise?
Are enterprise points-of-contact (POCs) identified (development, manufacturing, test, field, marketing) to help answer questions and
address issues across the organization?
Can any of the line or quality personnel "stop the line" if that person believes a serious problem exists?

Has the majority of the manufacturing leadership had direct field or customer contact in the past year?
Do manufacturing people have measurable goals to improve production metrics, including quality and cycle time?
If answer to 3.2.13 is yes, do direct manufacturing people participate in developing the goals?
Do manufacturing personnel have goals for continuous quality improvement?
Are there quality circles that meet regularly?
Are teams rewarded or recognized for improving quality?
Are key metrics for quality and cost monitored and tracked?
Is the cost of defect prevention measures tracked (proactive quality)?

Reliability Information Analysis Center


337

Gij

Wij

Yes = 1
No = 0
>2 = 0
2 = .25
1 = .5
0 = 1
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes
No
Yes
No
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=
=
=
=
=

1
0
1
0
1
0
1
0
1
0

Yes = 1
No = 0
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes
No
Yes
No
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=
=
=
=
=

1
0
1
0
1
0
1
0
1
0

3
3
3
6
6
3
4
5
4
3
1
5
5

3
5
4
3
3
3
5
3

Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question
Is the cost of problem corrections tracked (corrective quality)?
Do the process operators collect and interpret their own statistical process control (SPC) operational data?
Is machine-level configuration control practiced?
Is the cost of engineering changes (EC's) tracked and allocated back to the responsible development entity that caused the EC?
Are root cause failure analyses performed on Pareto-significant manufacturing line problems?
Are root cause failure analyses performed on Pareto-significant field problems?
Is there a continuing focus on eliminating test escapes so as to find problems when they are created rather than when the customer
receives the system?
Is a lessons-learned database maintained based upon problem post mortem analysis?
Are lessons learned fed back to development personnel at the corresponding development phase where particular, significant fault
types have been found to occur?
Will this FRU have an (expected) yield of over 90%?
Are examples of field manufacturing defects displayed for production personnel?
Do manufacturing people have current awareness of the field performance of their products, in terms of problem types and problem
rates?
Are the manufacturing processes based upon sensitivity analyses, process Failure Modes and Effects Analysis (FMEA) or Design of
Experiments (DOE)?
Has a declared manufacturing vision that incorporates reliability and quality been established, documented and communicated to
personnel?
Is leadership rotated among manufacturing personnel participating in a quality circle?
Does management promote quality circles with continuous measurable improvement (CMI) targets?
Do employees' personal development/assessment plans emphasize product and process quality?
Are team-building exercises promoted as lead-ins to the production phases?
Do manufacturing personnel get 40 hours of training a year?
Do you visit suppliers, review their processes, and make suggestions for process improvement?
Do you invite suppliers, or customers to review your company's processes and allow them to suggest ways the company can do
things better?
Does manufacturing participate in design reviews?
Is management aware and involved in day to day manufacturing operations on a regular basis?
Is management located in proximity to line people and accessible to them?
Do part suppliers manage their stock at your production facility?
Will this product be built on an existing manufacturing line vs. a new manufacturing process that will have to be developed to
support the manufacture this product?

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


338

Gij

Wij

Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=

1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0

Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=

1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0

Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes = 1
No = 0

3
5
5
4
6
7
6
6
4
3
3
4
6
4
2
3
5
3
2
3
3
5
4
4
3
6

Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question
Has the manufacturing process been mistake- proofed?
Is there an EC budget for this FRU, and will results be measured against this budget?
Do manufacturing personnel know the projected average cost of an EC once product is in the field?
Has a demand-based, pull system been established for manufacturing processing stations?
Are Printed Wiring Boards (PWBs) conformal coated?
Are there tighter tolerances than 0.020" on unaided hand assembly operations associated with manufacturing this FRU, or
integrating it into the next higher level assembly?
Are there tighter tolerance requirements than 0.005" for fixtured assembly operations with measurement capability?
Are there tighter tolerances than 0.0005" for automated assembly operations?
Is the manufacturing process documented?
Has manufacturing provided a product design checklist of their concerns to the development team at the start of development?
Is the checklist identified above reviewed for compliance at each development milestone review?
Is 90% test coverage achieved on the components in this FRU?
Is a shipping test performed on samples of the packaged product?
Are FRUs burned in for at least 24 hours?
Is a "gold standard" of the qualified item maintained for regression test purposes?
Is Design of Experiments (DOE) used in setting up and controlling testing?
Is there an Operational Reliability Test conducted to simulate the customer application?

How many elements (0 to 4) of environmental stress screening (ESS) are run: 1. temperature bake, 2. temperature cycle, 3.
temperature shock, 4. vibration?

Are production test stress screens conducted?


Does this PWB have fewer than 6 layers?
Is this PWB small enough so that it cannot bow or "oil can" in handling and usage?
Is it a one-sided or two-sided board?
If this is a PWB, are the majority of components attached via methods other than surface mount technology (SMT)?
Does this FRU have at least 25% fewer solder joints than its predecessor or competitor?

Reliability Information Analysis Center


339

Gij

Wij

Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=

1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0

Yes
No
Yes
No
Yes
No

=
=
=
=
=
=

1
0
1
0
1
0

Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
0 = 0
1 = .25
2 = .5
3 = .75
4 = 1
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
1 = 1
2 = 0
Yes = 1
No = 0
Yes = 1
No = 0

6
3
3
4
3
3
3
3
5
5
5
5
4
5
4
7
5

5
3
3
3
3
3

Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question

Gij

Wij

<30 = 0
30-50 = .5
50-100 = .75
>100 = 1
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0

What is the solder joint spacing (in mils) (thresholds at 30,50,100)?

Is ball grid array (BGA) technology excluded from this design?


Has your organization previously implemented BGA technology into a design?
Have card insertion guides been used in the design?
Have board stiffeners been used in the design?

3
6
3
3

7.2.5.5. Part Quality Process Grade Factor Questions

Table 7.2-25: Part Quality Process Grade Factor Questions


Question

Gij

Wij

Is there a documented part selection and part management process?

Yes = 1
No = 0

Is there a Preferred Parts List (PPL)?

Yes = 1
No = 0

Are part evaluation and qualification processes established to add parts to the PPL?

Yes = 1
No = 0

Does a cross-functional development team (CFDT) review and approve new candidate parts for addition to the PPL?

Yes = 1
No = 0

Is this a commercial off-the-shelf (COTS) purchased assembly with a good history of operational reliability? If the assembly is not
COTS, select "Yes".

Yes = 1
No = 0

Will new parts be excluded from being added to the PPL to design this FRU?

Yes = 1
No = 0

Are procedures in place to detect part problems in both manufacturing and the field?

Yes = 1
No = 0

Are quality and reliability data tracked on parts and fed back to suppliers so they know their performance on this product?

Yes = 1
No = 0

Is there a design compliance checklist to ensure that all parts are properly applied, operating at sufficient margin with respect to
environmental and operational stresses, and take into account lessons learned?

Yes = 1
No = 0

Are there processes in place that specifically address precautions and handling of parts/components susceptible to electrostatic
discharge (ESD)?

Yes = 1
No = 0

Do part specifications reflect environmental and regulatory compliance requirements for the specific intended application?

Yes = 1
No = 0

Has mechanical interfacing of critical parts been facilitated by providing mating parts/assemblies to the part supplier?

Yes = 1
No = 0

Is there an end of life plan to recycle or dispose of this part?

Yes = 1
No = 0

Are teaming relationships established with all critical component suppliers?

Yes = 1
No = 0

Are critical parts ISO 9000 certified?

Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


340

Chapter 7: Examples

Table 7.2-25: Part Quality Process Grade Factor Questions (continued)


Question

Gij

Wij

Are critical parts QS 9000 (automobile manufacturer certification) certified?

Yes = 1
No = 0

In the case of commercial off-the-shelf (COTS) equipment, is the purchased assembly certified and marked to sell in Europe (CE
marked)? If the assembly is not COTS, select Yes.

Yes = 1
No = 0

Is the FRU under configuration management control by the time it enters system test?

Yes = 1
No = 0

Are critical parts burned in for at least 24 hours?

Yes = 1
No = 0

Will the supplier manage the developers inventory, in the case of high volume production?

Yes = 1
No = 0

Will all suppliers provide timely failure reporting and corrective action support (FRACAS) for both critical and custom parts (timely
reporting implies a 2 week turnaround with faster response on priority demand)?

Yes = 1
No = 0

Have vendor dependencies been identified for critical and custom components?

Yes = 1
No = 0

Have suppliers identified the likely failure modes on critical and custom parts, and does the design take these failure modes into
account?

Yes = 1
No = 0

Are operational failure rate and failure mode data provided by the suppliers of critical and custom parts being used?

Yes = 1
No = 0

Is there a part control drawing for critical and custom parts?

Yes = 1
No = 0

Is there a device specification for all critical and custom parts?

Yes = 1
No = 0

Has the supplier reviewed the part application for all critical and custom parts?

Yes = 1
No = 0

Has the developer met with suppliers to discuss the application of all critical and custom parts?

Yes = 1
No = 0

Has a suppliers technical point of contact (POC) been identified for addressing reliability concerns?

Yes = 1
No = 0

Will critical suppliers provide timely notice of impending part changes to allow the developer to assess the impact?

Yes = 1
No = 0

Is a change history log maintained to provide traceability of engineering change actions and their associated rationale for critical and
custom parts?

Yes = 1
No = 0

Will part identification (revision numbers) be shown on the part to identify the particular part configuration, including the level of
the parts firmware?

Yes = 1
No = 0

Will suppliers routinely update firmware on parts returned for repair?

Yes = 1
No = 0

If suppliers update firmware will the part identification reflect this change?

Yes = 1
No = 0

Will suppliers part support timing horizon meet program development, manufacture, and field support component requirements?

Yes = 1
No = 0

Will vendor provide timely notice of production/support cessation and provide an end of life buy opportunity?

Yes = 1
No = 0

Will future releases of this part be compatible with respect to form, fit and function?

Yes = 1
No = 0

Is there a first article inspection and acceptance test planned?

Yes = 1
No = 0

Do critical and custom parts on this FRU all have at least a 12-month warranty?

Yes = 1
No = 0

Have likely part developments, evolution, and extensions of critical/custom parts been identified by the supplier?

Yes = 1
No = 0

Are there 32 Kbytes or more of firmware embedded in this FRU?

Yes = 1
No = 0

Reliability Information Analysis Center


341

Chapter 7: Examples

Table 7.2-25: Part Quality Process Grade Factor Questions (continued)


Question

Gij

Wij

Have development personnel meet with supplier's technical personnel?

Yes = 1
No = 0

Has a functional block diagram been developed for COTS or purchased complex part assemblies?

Yes = 1
No = 0

Has a failure history been collected for critical parts, complex assemblies, or COTS items?

Yes = 1
No = 0

Have key suppliers identified their part failure mechanisms?

Yes = 1
No = 0

Have suppliers, in the case of complex part assemblies, supported the developer in performing a Failure Modes and Effects Analysis
(FMEA) on those assemblies?

Yes = 1
No = 0

Have the sources and the extent of part variation been identified?

Yes = 1
No = 0

Have mitigations been identified to handle the effects of part's variations?

Yes = 1
No = 0

Do you know the supplier's dependencies and needs?

Yes = 1
No = 0

Will a design of experiments part evaluation, considering variations, as well as manufacturing variations, be conducted?

Yes = 1
No = 0

Have mechanical interfacing components been provided to the key vendors to assure proper mechanical mating?

Yes = 1
No = 0

Will the developer's quality organization audit suppliers' processes and facility capabilities?

Yes = 1
No = 0

Will the developer receive notice of pending part changes?

Yes = 1
No = 0

Will the developer have approval rights of part changes?

Yes = 1
No = 0

Are procedures and processes in place for the identification and handling of critical reliability components (derating, screening,
failure response, etc.)?

Yes = 1
No = 0

7.2.5.6. System Management Process Grade Factor Questions

Table 7.2-26: System Management Process Grade Factor Questions


Question

Gij

Wij

Does the customer participate with the developer in developing/validating a requirements statement?

Yes = 1
No = 0

Is Quality Function Deployment (QFD) used to help develop requirements and requirements traceability?

Yes = 1
No = 0

If QFD is not used, is there another systematic way used, such as a Pugh chart, to identify and document customer needs and
preferences?

Yes = 1
No = 0

Is there a system specification?

Yes = 1
No = 0

Does an "operations concept" document exist?

Yes = 1
No = 0

Has a comprehensive literature study been done of relevant design and reliability technology advancements?

Yes = 1
No = 0

Have previous or similar products been reviewed for their advantages and pitfalls?

Yes = 1
No = 0

Has a "lessons learned" database been studied to ensure the product will not repeat past problems?

Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


342

Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question

Gij

Wij

Have aggressive requirements (particularly reliability, availability, and/or safety) been explicitly specified?

Yes = 1
No = 0

10

Have regulatory agency compliance requirements been included?

Yes = 1
No = 0

Does the requirements definition also account for what the product is supposed to "not do" (for example, air bags should not deploy
except on impact)?

Yes = 1
No = 0

Is there a plan as to how to retire or recycle this new system at the end of its life?

Yes = 1
No = 0

Does a requirements database exist to capture opportunistic requirements for future consideration?

Yes = 1
No = 0

Have future expansion requirements been identified (such as loading growth) and can the system handle the projected growth in
demand?

Yes = 1
No = 0

Are requirements deemed achievable within program budget and schedule restraints, with a 90% confidence level?

Yes = 1
No = 0

Are product requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and safety
concerns)?

Yes = 1
No = 0

Has a project level Failure Modes and Effects Analysis (FMEA) been done in conjunction with designers and system engineers at
the planning stage?

Yes = 1
No = 0

Will the FMEA be refined down to the Field Replaceable Unit (FRU) level during design?

Yes = 1
No = 0

Does this product have to meet CE (European) standards?

Yes = 1
No = 0

Have likely product extensions been identified in the planning stage?

Yes = 1
No = 0

Are creativity and team building exercises being conducted during the planning stage?

Yes = 1
No = 0

Are future product releases planned in order to systematically integrate new requirements and features?

Yes = 1
No = 0

Are trade studies shared with the customer to broaden the base of inputs and support for design decisions?

Yes = 1
No = 0

Does a vision statement that speaks to reliability exist for the product?

Yes = 1
No = 0

Does a functional block diagram exist for this system?

Yes = 1
No = 0

Do sketches, drawings, or models exist for the delivered product?

Yes = 1
No = 0

Is the development team provided guidelines for acceptable deliverables at kick-off meetings for each development stage?

Yes = 1
No = 0

Are prototypes planned for early design?

Yes = 1
No = 0

Is this design an incremental improvement over an existing design?

Yes = 1
No = 0

Will state diagrams be developed before detail design to depict control flows?

Yes = 1
No = 0

Will data flow diagrams be developed before detail design begins?

Yes = 1
No = 0

Are entity-relationship diagrams developed prior to detail design?

Yes = 1
No = 0

Will a list identifying the capabilities and advantages that this product provides the customer be developed and maintained?

Yes = 1
No = 0

Is there a system transition plan to replace the current system with the new system, in a smooth, non-disruptive manner? When not
applicable select Yes.

Yes = 1
No = 0

Reliability Information Analysis Center


343

Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question

Gij

Wij

Are requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and design autonomy)?

Yes = 1
No = 0

Are requirements verification activities planned for the appropriate stages of product development?

Yes = 1
No = 0

Are entrance and exit criteria established for each development stage?

Yes = 1
No = 0

Is requirements traceability verified and maintained throughout development?

Yes = 1
No = 0

Is requirements compliance verified prior to the exit of each phase and prior to shipment?

Yes = 1
No = 0

Are test cases developed concurrently with the design and reviewed by the designers?

Yes = 1
No = 0

Is there a log of key product decisions and accompanying rationale for traceability?

Yes = 1
No = 0

Does the specified reliability represent an improvement of 10% or greater over its predecessor or competitive products?

Yes = 1
No = 0

Is there definition and agreement as to what constitutes successful product reliability performance by the customer?

Yes = 1
No = 0

Can this product be built using existing manufacturing processes (line)?

Yes = 1
No = 0

Are development and reliability requirements developed by a cross-functional development team (CFDT)?

Yes = 1
No = 0

Are system issues routinely documented as action items?

Yes = 1
No = 0

Do design reviews have technical representation from all interfacing areas?

Yes = 1
No = 0

Is prototype interconnection hardware routinely provided to interfacing subsystems and suppliers to guide their packaging?

Yes = 1
No = 0

Is there a requirement to detect and isolate faults to a single FRU 90% of the time?

Yes = 1
No = 0

Is there a system failure modes and effects analysis (FMEA) done during planning stage, and is it updated throughout the program?

Yes = 1
No = 0

Customer and process Q1: Have I identified who are my internal customers and my external customers?

Yes = 1
No = 0

Customer and process Q2: Have I identified what deliverables my customers need (plans, prototypes, documentation,)?

Yes = 1
No = 0

Customer and process Q3: Do I know when my customers require my deliverables?

Yes = 1
No = 0

Customer and process Q4: Is there a customer centered quality initiative that will be incorporated to differentiate your deliverables?

Yes = 1
No = 0

Customer and process Q5: Is there an identified tool or process improvement that the reliability section or the development
organization will gain from this effort?

Yes = 1
No = 0

Have the customers been notified and concur on items Q1-Q3 above?

Yes = 1
No = 0

Is there a database that documents cross-functional dependencies that is managed to closure?

Yes = 1
No = 0

Is a database on cross-functional dependencies maintained?

Yes = 1
No = 0

Do the developers, reviewers, testers, QA, manufacturing, and customer program office, all share in the accountability for getting a
successful program to the field?

Yes = 1
No = 0

Are developers and the entire product team rated or rewarded based upon the field performance of the product?

Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


344

Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question

Gij

Wij

Are there designated points of contact in each product development area?

Yes = 1
No = 0

Are there designated facilitators to manage cross-system issues?

Yes = 1
No = 0

Are there checklists covering reliability concerns for each program phase?

Yes = 1
No = 0

Is the technical staff encouraged to talk directly to its customer counterparts?

Yes = 1
No = 0

Are periodic informal activities, such as brown bag lunches, promoted to encourage team member technical exchange in an informal
atmosphere?

Yes = 1
No = 0

Can a technical employee call for a technical review board of peers when it is felt appropriate to address a broad-impact technical
concern?

Yes = 1
No = 0

Does this equipment not require an interface with other vendors' equipment or government furnished equipment (GFE)?

Yes = 1
No = 0

Is the % of product reuse from previous products 25% or more of the lines of code for software?

Yes = 1
No = 0

Is the % of product reuse from previous products 50% or more of the FRU count or cost for hardware?

Yes = 1
No = 0

Do program planning sessions have cross-functional representation?

Yes = 1
No = 0

Do technical reviews have cross-functional representation?

Yes = 1
No = 0

Are there program development plans that show timing of activities and deliverables (this should be done during the requirements
phase and maintained throughout the program)?

Yes = 1
No = 0

Is there a non-management person designated to work full-time as the program technical lead, who works as a cross-team facilitator?

Yes = 1
No = 0

Is there a team-building effort and project brain storming at each program phase?

Yes = 1
No = 0

Are documentation products maintained on-line and accessible to all program personnel?

Yes = 1
No = 0

Is there a program database of "action items" that is maintained and managed to closure?

Yes = 1
No = 0

Is there a formal documented change process?

Yes = 1
No = 0

Are self-audits periodically performed on the change process?

Yes = 1
No = 0

Are business cases always run to evaluate the benefits and impacts of making a change (e.g., Reinertsen's model)?

Yes = 1
No = 0

Are total cost estimates made for ECs, including scrap, rework, tooling, and the potential slippage of schedule?

Yes = 1
No = 0

Are there two or less EC's planned during the first year of shipping?

Yes = 1
No = 0

Are ECs blocked into sections and scheduled ahead on periodic intervals to promote timely integration of changes?

Yes = 1
No = 0

Are ECs at or below the plan to date?

Yes = 1
No = 0

Do change review meetings have cross-functional representation?

Yes = 1
No = 0

Are there any ECs that are modifying previous ECs on the FRU?

Yes = 1
No = 0

Is there an EC meeting log maintained that includes the change rationale, the analysis provided, and meeting participants?

Yes = 1
No = 0

Reliability Information Analysis Center


345

Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question

Gij

Wij

Are EC management metrics collected with a focus on continual, measurable process improvement?

Yes = 1
No = 0

Are the program development, integration, and test activities charted, showing tasks, their timing, operational dependencies and
identification of critical path activities?

Yes = 1
No = 0

Are critical path elements identified (e.g., long-lead items)?

Yes = 1
No = 0

Is there a focus to get items off the critical path?

Yes = 1
No = 0

Are there risk assessment and contingency plans to minimize critical path risk?

Yes = 1
No = 0

Has the program met its targeted dates so far?

Yes = 1
No = 0

Is this product architecture based upon a distributed architecture?

Yes = 1
No = 0

Are there no future product inventions required on this program?

Yes = 1
No = 0

Are the R&M design goals sufficiently defined and allocated to ensure that customer needs are met?

Yes = 1
No = 0

Has development committed to support the required tasks for meeting the customer's R&M needs?

Yes = 1
No = 0

Does the design approach emphasize R&M as a major goal?

Yes = 1
No = 0

Has an agreed-to process been defined to assess progress towards meeting R&M goals and requirements?

Yes = 1
No = 0

Have adequate means been agreed upon to ensure that the R&M objectives of the product will have been achieved?

Yes = 1
No = 0

Have processes been defined and implemented to ensure that the designed-in (inherent) reliability does not degrade during
manufacturing and operational use?

Yes = 1
No = 0

7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions

Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question
Is the system required to isolate to a single Field Replaceable Unit (FRU) on 90% % of failures?
Is there a specified time limit to isolate a fault, effect a repair and restore the system?
Is there a requirement for 90% or greater test coverage within the FRU being analyzed?
Does the system promote remote serviceability with failure status communicated via Ethernet, serial port, parallel port, serial bus,
etc., to a central maintenance station?
Is there any remote failure protection for this FRU residing on a separate FRU (e.g., an arc suppression circuit that is located on a
different FRU than the relay FRU)?
Is this FRU designed to be hot-pluggable?
Does the FRU designer also design the fault isolation software that supports fault diagnosis?
Are multiple occurrences of "Can Not Duplicate" (CND) incidents analyzed for root cause of the problem?

Gij
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


346

Wij
6
5
6
4
5
4
6
6

Chapter 7: Examples

Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question
Are test, warranty, early-life, and high fallout FRUs subjected to double fault verification (this procedure re-inserts the faulted FRU
to ensure the problems track the replaced FRU)?
Do your current products experience 40% or less Can Not Duplicate (CND) failures (note that CNDs are synonymous with No
Defects Found (NDF) and No Trouble Found (NTF))?
Is a failure mode and effect analysis (FMEA) performed down to the FRU level or the Circuit Card Assembly (CCA) level,
whichever is lower?
Do design personnel participate directly in performing the FMEA?
Are maintenance analysis procedures (MAPs) developed to map failure symptoms to the failing FRU?
Are the MAPs verified by inserting faults in a maintainability test?
Are the MAPs updated with actual test and field data?
Has your company established the cost impact of a field failure?
Does the system contain error logging and reporting capability?
Does the system promote ongoing analysis of soft error conditions that might predict when a likely failure will occur?
Will the contractor developing this equipment also be responsible for maintaining it?
Does the repair facility have the ability to recreate the conditions under which a true false alarm occurred (sequence of events,
operator error, sneak circuit, etc.) and are these techniques used to try to recreate the failure?
Does the repair facility have the ability to recreate the conditions under which a real failure occurred (high/low temperature, thermal
cycling/shock, vibration/ mechanical shock, etc.) and are these techniques used to try to recreate the failure?
Will the maintainer be motivated to provide timely and complete documentation of the diagnosis and repair action?
Do the system maintenance personnel receive feedback on their repair reports and the actions taken to mitigate the failure
reoccurrence?
Are the performance specification limits of the test equipment used to troubleshoot/repair the system, FRU, etc., equal to or more
stringent than the performance specification limits of the system, FRU, etc., in its actual application?
Are CND failures included in the Failure Reporting and Corrective Action System (FRACAS) system and closed out through
corrective action verification?

Gij
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0

Wij
10
8
5
5
5
4
5
3

Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0

5
4
5

Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0

5
5
5
5
5
5

7.2.5.8. Induced Process Grade Factor Questions

Table 7.2-28: Induced Process Grade Factor Questions


Question

Gij

Wij

Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced failure
through electrostatic discharge?

Yes = 1
No = 0

If parts/materials are susceptible, are procedures used to protect them during handling, test, assembly, packaging, storage,
transportation and use (i.e., wrist straps, non-conductive work areas, ionized air, warning labels, maintenance manuals, etc.)?

Yes = 1
No = 0

Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical overstress resulting from
another primary failure?

Yes = 1
No = 0

Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical transients generated within the
system/FRU, or received from outside the system/FRU (via cable/wiring harnesses)?

Yes = 1
No = 0

Are maintenance manuals/procedures written such that risk of Electrostatic Discharge/Electrical Overstress (ESD/EOS) during
troubleshooting and repair activity is identified (warning labels, etc.)?

Yes = 1
No = 0

Reliability Information Analysis Center


347

Chapter 7: Examples

Table 7.2-28: Induced Process Grade Factor Questions (continued)


Question

Gij

Wij

Has the operating environment that the part/FRU/system is to be used in been evaluated to determine the potential for mishandling of
the equipment that could result in induced mechanical failure (weather; personnel capabilities; training needs)?

Yes = 1
No = 0

Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced (mechanical)
secondary failure resulting from the primary failure of another part/assembly?

Yes = 1
No = 0

If parts/materials are susceptible to induced mechanical damage, are procedures in-place to protect them during handling, test,
assembly, packaging, storage, transportation and use?

Yes = 1
No = 0

Is the part, FRU, and/or system designed such that it can be handled and transported in a manner that minimizes the risk of induced
mechanical failure (proper location/use of handles; orientation labels This Side Up; etc.)?

Yes = 1
No = 0

Are shipping tests run to ensure adequacy of packaging and shipping procedures to protect the product during transportation?

Yes = 1
No = 0

Are maintenance manuals/procedures written such that the risk of induced mechanical damage during troubleshooting and repair
activity is identified (warning labels, etc.)?

Yes = 1
No = 0

Do maintenance manuals include detailed instructions for removing and replacing parts/components/assemblies from sockets and/or
soldered PCB and multiplayer boards, etc.?

Yes = 1
No = 0

Do maintenance manuals include detailed instructions for disconnecting and reconnecting wires, harnesses, cables, hoses, etc.?

Yes = 1
No = 0

Is the FRU/system ergonomically designed such that it can be used by the customer in normal operation without unnecessary risk of
induced mechanical damage?

Yes = 1
No = 0

Is the FRU designed to withstand normal handling and expected mishaps (e.g., a drop off a 36-inch high table top) without induced
mechanical damage?

Yes = 1
No = 0

Are wires color coded, and connectors keyed or of differing configuration such that FRUs cannot be misplugged?

Yes = 1
No = 0

7.2.5.9. Wearout Process Grade Factor Questions

Table 7.2-29: Wearout Process Grade Factor Questions


Question

Gij

Wij

Have all parts and materials been selected for use in the design that extend the wearout life of the part/Field Replaceable Unit
(FRU)/system to meet/exceed its required useful life?

Yes = 1
No = 0

Has the expected reliability of parts subjected to significant mechanical loading been modeled to ensure the capability to endure the
mission, e.g., using Miners life expectation rule for components subjected to cyclical loads?

Yes = 1
No = 0

Have wearout failure modes and mechanisms at the part, FRU and system level been identified and mitigated during the Failure
Modes and Effects Analysis (FMEA) process?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include fatigue (solder joints for electronic components/assemblies; welds for bonded
materials; fractures in mechanical parts/assemblies/materials; etc.)?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include leaks (electrolyte loss in electrolytic capacitors; worn seals in hydraulic systems;
etc.)?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include chafing (wires in electrical harnesses; wear in hydraulic lines and hoses; etc.)?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include cold flow of insulation (wires wrapped around sharp edges or subjected to
pressure points; etc.)?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include wearout resulting from cyclic operations (activation in electronic switch/relay
contacts; mating/unmating of electronic or mechanical connectors; etc.)?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include wearout resulting from breakdown of insulation in wires, or dielectric materials in
semiconductors?

Yes = 1
No = 0

Do the relevant failure modes/mechanisms include wearout resulting from moving parts (bearings, gears, belts, springs, seals, etc.)?

Yes = 1
No = 0

Has the system/FRU/part design been modified based on the wearout modes/mechanisms identified in the FMEA to reduce or
minimize their occurrence to the maximum extent feasible?

Yes = 1
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


348

Chapter 7: Examples

Table 7.2-29: Wearout Process Grade Factor Questions (continued)


Question

Gij

Wij

Are process FMEAs performed to determine the failure modes/mechanisms of critical processes during manufacturing?

Yes = 1
No = 0

Is data collected and analyses performed to determine the process capability of manufacturing processes?

Yes = 1
No = 0

Is statistical process control (SPC) applied to manufacturing processes to control the process mean and variability?

Yes = 1
No = 0

Is the measured mean of each manufacturing process parameter equal to, or better than, the parameter value used to calculate the
wearout failure rates of the system/FRU parts/components?

Yes = 1
No = 0

If required, has this product been hardened to withstand adverse environmental stresses such as corrosion, radiation, humidity, etc.?

Yes = 1
No = 0

Are procedures defined/implemented to ensure that assembly/test steps during manufacturing do not contribute to early wearout of
susceptible items (i.e., minimize connector matings/unmatings; stress relief/tie downs to minimize chafing during test; etc.)?

Yes = 1
No = 0

Do maintenance manuals/procedures instruct repair personnel to check that wire harnesses are properly secured, seals are properly
reinstalled, connectors are properly mated, etc., following troubleshooting/repair?

Yes = 1
No = 0

Is preventative maintenance planned to replace wear out-susceptible parts/materials at or before their L10 life (where no more than
10% of the units should experience wearout)?

Yes = 1
No = 0

Are wearout-susceptible parts/materials inspected during each corrective maintenance action to find and replace items exhibiting
premature wearout?

Yes = 1
No = 0

Are wearout-susceptible parts/materials inspected during each preventive maintenance action to find and replace items exhibiting
premature wear out?

Yes = 1
No = 0

Are wearout failures (both valid and premature) included in the Failure Reporting and Corrective Action System (FRACAS) and
closed out through corrective action, which could include life-extension opportunities?

Yes = 1
No = 0

Is field data tracked and analyzed to detect FRUs displaying increasing failure rate tendencies, i.e., wearout?

Yes = 1
No = 0

7.2.5.10.

Growth Process Grade Factor Questions

Table 7.2-30: Growth Process Grade Factor Questions


Question
Is there an effective Failure Reporting and Corrective Action System (FRACAS) in place for the fielded system?
What is the percentage of field failures for which the root cause is determined?
Is analysis performed to determine if the failure is recurring?
Are design, manufacturing, or system management related potential corrective actions identified?
Are the original designers or manufacturing personnel consulted regarding the potential corrective action?
Is there a field support infrastructure in place that can affect the necessary changes?
Are systems adequately tested to insure that the changes were made properly without inducing other defects or damage?

Reliability Information Analysis Center


349

Gij
Yes = 1
No = 0
Yes = 1
No = 0
G = percentage/100
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0

Wij
8
8
6
6
4
10
5

Chapter 7: Examples

7.3. Life Modeling Example


7.3.1. Introduction

This section presents the results of an analysis in which the intent was to quantify the
reliability of a seal used in an assembly. The approach taken in the analysis was to
perform life tests under a variety of conditions, and to develop life models from this data
so that lifetimes could be predicted as a function of the appropriate stress and product
variables. In this manner, estimates of reliability under a wide range of use conditions
could be made.
This is an example of an assessment methodology, the results of which would be more
accurate than a prediction method applied to the seal. If the analyst is able to develop a
model like the one presented here for a specific component or failure cause, the resulting
model should be weighed more heavily than a prediction on the specific component.
7.3.2. Approach

All samples were tested under a variety of temperature and relative humidity conditions.
In addition, samples included two factors which were varied in the life tests: Process
Force and Hardness. These stresses and product/process variables were expected to be
the ones that most heavily influenced the product reliability.
7.3.3. Reliability Test Plan

The Reliability Test Plan required that the lifetime be measured at various magnitudes of
these variables, such that life model parameters (including acceleration factors) could be
quantified. Table 7.3-1 summarizes, for each variable, the number of levels, and the level
values.

Table 7.3-1: Parameter Levels


Variable
Temperature
Humidity
Process force
Hardness

Number of Levels
2
2
2
3

Levels
85, 130 C
85, 100%
2, 20 N
25, 50, 100 V

Table 7.3-2 summarizes the tests performed.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


350

Chapter 7: Examples

Table 7.3-2: Test Plan Summary


Sample
Size
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7

Temperature

Humidity

Hardness

Process Force

85
85
85
85
85
85
130
130
130
130
130
130
130
130
130
130
130
130

85
85
85
85
85
85
85
85
85
85
85
85
100
100
100
100
100
100

25
50
100
25
50
100
25
50
100
25
50
100
25
50
100
25
50
100

2
2
2
20
20
20
2
2
2
20
20
20
2
2
2
20
20
20

The tests were performed by first inspecting each sample, then exposing them to the
specific combination of variables as previously summarized, and, finally, re-inspecting
them at various intervals. The exposure times and inspection intervals were structured
such that short lifetimes could be observed in the event that acceleration factors were
higher than anticipated. Therefore, more frequent inspections were performed early in
the test, followed by less frequent inspections for the surviving samples. Failed samples
were removed from the test.
Data was then summarized in a format suitable for life modeling. The required data
elements included stress and product/process variables, plus life variables, as follows:

Variables:
o Temperature
o Humidity
o Process force
o Hardness
Reliability Information Analysis Center
351

Chapter 7: Examples

Life variables
o Last known good time
o First known bad time

7.3.4. Results
7.3.4.1. Times to Failure Summary

The test results for the seal samples are presented in Table 7.3-3. Included in this table is
the sample number, the temperature (in degrees C), the relative humidity, the Hardness,
the process force, whether the sample failed (F) or survived (S), and the time at which it
failed or survived.

Table 7.3-3: Life Test Results


T

RH

Thickness

Speed

F or S

Time to F/S

RH

Thickness

Speed

F or S

Time to F/S

85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85

85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85

25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50
50
50
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100

2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20

S
S
S
F
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
F
S
S
S
S
S
S
S
S
S
S
S

1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
778
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159

85
85
85
85
85
85
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130

85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85

100
100
100
100
100
100
25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50
50
50
50
50
50
50
50
50
50
50
100
100

20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2

S
S
S
S
S
S
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F

1159
1159
1159
1159
1159
1159
278
158
130
237.5
158
196.5
130
158
196.5
237.5
428
237.5
130
158
237.5
196.5
278
196.5
278
158
158
196.5
158
196.5
428
158
237.5
158
278
158

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


352

Chapter 7: Examples
T

RH

Thickness

Speed

F or S

Time to F/S

RH

Thickness

Speed

F or S

Time to F/S

130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130

85
85
85
85
85
85
85
85
85
85
85
85
85
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100

100
100
100
100
100
100
100
100
100
100
100
100
100
25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50

2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2

F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
S
F
F
F
F
F
F
S
F
S

278
220
371
278
325
428
58
325
428
325
428
278
196.5
58
59
34
58
34
34
58
58
70
34
34
58
1.5
58
58
70
58
70

130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130

100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100

50
50
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100
100
100
100
100
100
100

2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20

F
F
S
F
S
F
F
F
F
F
S
F
F
F
F
F
F
S
S
F
F
F
F
S

58
58
70
58
70
34
34
58
34
58
70
58
58
58
58
58
34
70
70
58
58
58
58
70

The 2-parameter Weibull distribution parameters for the TTF distributions for the
samples are shown in Table 7.3-4.

Table 7.3-4: Times to Failure Distribution Parameters


Test Condition
85C/85%RH
130C/85%RH
130C/100%RH

Characteristic
Life
2109
268
62.1

Shape
Parameter
5.1
2.71
3.2

The TTF distributions for each of the three test conditions are illustrated in Figure 7.3-1.

Reliability Information Analysis Center


353

Chapter 7: Examples
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m

Probability - Weibull

9 9. 0 00

P ro ba bility-W e ibull
F o lio1 \SL-130 , 1 00
W e ibull-2P
MLE SRM MED F M
F = 3 3/S= 9
Da ta Po ints
Susp Po ints
Pro ba bilit y Line

9 0. 0 00

F o lio1 \SL-130 , 8 5
W e ibull-2P
MLE SRM MED F M
F = 4 3/S= 0
Da ta Po ints
Pro ba bilit y Line

Unreliability, F(t)

5 0. 0 00

F o lio1 \SL-85, 85
W e ibull-2P
MLE SRM MED F M
F = 2 /S= 40
Da ta Po ints
Pro ba bilit y Line

1 0. 0 00

5. 00 0

1. 00 0
1. 0 0 0

10 . 00 0

1 00 . 0 00

10 0 0. 0 00

Bill De nson
Co rning
1 1 /24 /2 00 8
5 :0 5:2 2 PM
1 0 00 0. 00 0

T ime, (t)
F olio 1\SL-13 0, 10 0: =3 .2 1 8 3 , =6 2 .1 1 9 5
F olio 1\SL-13 0, 85 : = 2 .7 2 2 1 , = 2 6 8 .2 4 7 9
F olio 1\SL-85 , 85: =5 .0 5 0 5 , =2 1 0 9 .0 6 3 5

Figure 7.3-1: Times To Failure Distributions


7.3.4.2. Life Models

Life models were generated from the data summarized above. These life models estimate
the TTF distribution as a function of the variables used in the experiments.
A general form of the Weibull reliability function used is:

R=e

where:
R=
=

the reliability, or probability of survival, at time t


the Weibull characteristic life (i.e., the time to 63% failure)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
354

Chapter 7: Examples

the Weibull shape parameter

The characteristic life is then developed as a function of the applicable variables. The
model form is:
0

= e e T RH H F
2

Where:
0 through 4 =
T=
RH =
H=
F=

parameter coefficients estimated in the life modeling process


temperature in degrees K (C+273)
relative humidity
ionic contamination
the process force

Maximum likelihood analysis was performed to determine the values of , 0, 1, 2, 3,


and 4 that maximize the value of the above likelihood function. These parameter
estimates then become the coefficients in the life model. The Likelihood Function is:

))

L = f (ti , , 0 ,1, 2 , 3 , 4 ) * 1 F t j , , 0 ,1, 2 , 3 , 4 *


(F (tk , , 0 ,1, 2 , 3 , 4 ))(1 F (tk 1, , 0 ,1, 2 , 3 , 4 ))
where:

f =
F=
ti =
tj
tk and tk-1 =

Weibull pdf (probability density function)


cumulative Weibull function (probability of failure)
failure times
survival times
times that bracket the failure interval

The first of the three product terms represent failures at known times, the second
represents survivals, and the third represent failures that occur within intervals but the
precise failure times are not known
Once the model parameters are estimated in this fashion, the reliability at any time, and
for any combination of variables, can be estimated.
Reliability Information Analysis Center
355

Chapter 7: Examples

The estimated parameters are summarized in Table 7.3-5. In this table, the best estimate
is provided along with the 80% 2-sided confidence levels around the estimate. A small
variation between the lower and upper confidence bound are indicative of significant
variables.

Table 7.3-5: Estimated Parameter 80% 2-Sided Confidence Bounds


Parameter

Lower 80% CL

Best Estimate

Upper 80% CL

2.737

3.073

3.450

19.68

23.98

28.28

6957.2

8015.7

9074.3

-9.45

-8.83

-8.21

0.131

0.215

0.299

-0.0031

0.0388

0.0807

The resulting equation for the characteristic life is then:

=e

23.98

8015.7
T

RH

8.83

0.2150

0.0388

Once the model parameters are estimated, then a variety of output formats are possible.
For example, Figure 7.3-2 illustrates the probability of failure as a function of
temperature and relative humidity at a time of 50,000 hours.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


356

Chapter 7: Examples
Unreliability vs Stress Surface

Figure 7.3-2: Probability of Failure vs. Temperature and Relative Humidity at 50,000
Hours

7.4. NPRD Description


Information from the RIAC document Nonelectronic Parts Reliability Data (NPRD)
(Reference 4) is presented here to provide the reader with:
1. An understanding of the issues involved in the collection and interpretation of
field reliability data
2. A summary of alternatives that can be used to combine data from various sources
The purpose of NPRD is to present failure rate data on a wide variety of
electromechanical and mechanical parts and assemblies (including many types of
electronic assemblies). While there are reliability prediction methodologies for standard
electronic components such as MIL-HDBK-217 and the RIAC 217Plus methodology,
Reliability Information Analysis Center
357

Chapter 7: Examples

there are few sources of failure rate data for other component types. All part types and
assemblies for which RIAC has data are included in NPRD with the exception, of
standard electronic component types. Although the data contained in NPRD were
collected from a wide variety of sources, RIAC has screened the data such that only high
quality data is added to the database and presented in this document. In addition, only
field failure rate data is included. The intent of this section is to provide the user with
information to adequately interpret and use data to supplement standard reliability
prediction methodologies.
It is not feasible for documents like MIL-HDBK-217 or other prediction methodologies
to contain failure rate models on every conceivable type of component and assembly.
Traditionally, reliability prediction models have been primarily applicable only for
generic electronic components. Therefore, NPRD serves a variety of needs:

To provide failure rates on assemblies in cases where piece-part level analyses


are not feasible or required
To complement other prediction methodologies by providing data on part
types not addressed by its models

7.4.1. Data Collection

The failure rate data contained in the newest version of NPRD (NPRD-2010) will
represent a cumulative compilation of data collected from the early 1970's through
December 2008. RIAC is continuously soliciting new field data in an effort to keep the
databases current. The goals of these data collection efforts are as follows:
1. To obtain data on relatively new part types and assemblies.
2. To collect as much data on as many different data sources, application
environments, and quality levels as possible.
3. To identify as many characteristic details as possible, including both part and
application parameters.
The following generic sources of data were used for this publication:
1.
2.
3.
4.
5.

Published reports and papers


Data collected from government-sponsored studies
Data collected from military maintenance data collection systems
Data collected from commercial warranty repair systems
Data from commercial/industrial maintenance databases
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
358

Chapter 7: Examples

6. Data submitted directly from military or commercial organizations that maintain


failure databases
An example of the process by which RIAC identifies candidate systems and extracts
reliability data on military systems is summarized in Table 7.4-1.

Table 7.4-1: Data Summarization Process


(1)

Identify System Based On:

Environments/Quality
Age
Component Types
Availability of Quality Data

(2)

Build Parts List:

Obtain Illustrated Parts Breakdown (IPB)


Ensure Correct Version of System Consistent with
Maintenance Data
Identify Characteristics of Components (Part Numbers,
Federal Stock Number, Vendor Catalogs, etc.)
Enter Part Characteristics into Database

(3)

Obtain Failure Data:

(4)

Obtain Operating Data:

(5)

Transform Data to
Common RIAC Database
Template

Reliability Improvement Warranty, DO56, Warranty


Records, etc.
Match Failures to IPB
Insure Part Replacements were Component Failures
Add Failure Data to Database

Verify Equipment Inventory


Equipment Hours/Miles, Part Hours/Miles
Application Environment

Perhaps the most important aspect of this data collection process is identifying viable
sources of high quality data. Large automated maintenance databases, such as the Air
Force REMIS system or the Navy's 3M and Avionics 3M systems, typically will not
provide accurate data on piece parts. They can, however, provide acceptable data on
assemblies or LRUs, if used judiciously. Additionally, there are specific instances in
which they can be used to obtain piece-part data. Piece-part data from these maintenance
systems is used in the RIAC's data collection efforts only when it can be verified that
they accurately report data at this level. Reliability Improvement Warranty (RIW) data
are another high quality data source which has been used.
Reliability Information Analysis Center
359

Chapter 7: Examples

Completeness of data, consistency of data, equipment population tracking, failure


verification, availability of parts breakdown structures, and characterization of
operational histories are all used to determine the adequacy of the data. In many cases,
data submitted to the RIAC is discarded since an acceptable level of credibility does not
exist.
Inherent limitations in data collection efforts can result in errors and inaccuracies in
summary data. Care must be taken to ensure that the following factors are considered
when using a data source. Some of the sources of error are:
1. There are many more factors affecting reliability than can be identified
2. There is a degree of uncertainty in any failure rate data collection effort. This
uncertainty is due to the following factors:
a. Uncertainty as to whether the failure was inherent (common cause) or
event-related (special cause)
b. Difficulty in separating primary and secondary failures
c. Much of the collected data is generic and not manufacturer specific,
indicating that variations in the manufacturing process are not accounted
for
d. It is very difficult to distinguish between the effects of highly correlated
variables. For example, the fact that higher quality components are
typically used in more severe environments makes it impossible to
distinguish the effect that each has, independently, on reliability.
e. Operating hours can be reported inaccurately
f. Maintenance logs can be incomplete
Actual component stresses are rarely known. Even if nominal stresses are known, actual
stresses which significantly impact reliability can vary significantly about this nominal
value. The impacts of complex environmental stresses on reliability during field
operation of a product or system is also extremely difficult, if not impossible, to discern.
When collecting field failure data, a very important variable is the criteria used to define,
detect and classify failures. Much of the failure data presented in NPRD-2010 were
identified by maintenance technicians performing a repair action, indicating that the
criteria for failure is that a part in a particular application has failed in a manner that
makes it apparent to the technician. In some data sources, the criteria for failure were
that the component replacement must have remedied the failure symptom.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
360

Chapter 7: Examples

7.4.2. Data Interpretation

Data contained in NPRD-2010 reflects industry average failure rates, especially the
summary failure rates which were derived by combining several failure rates on similar
parts/assemblies from various sources. In certain instances, reliability differences can be
distinguished between manufacturers or between detailed part characteristics. Although
the summary section of NPRD cannot be used to identify these differences (since it
presents summaries only by generic type, quality, environment, and data source), the
listings in the detailed section of NPRD contain all of the specific information that was
known for each part and, therefore, can sometimes be used to identify such differences.
Data in the summary section of NPRD represent an "estimate" of the expected failure
rate. The "true" value will lie within some confidence interval about that estimate. The
traditional method of identifying confidence limits for components with exponentially
distributed lifetimes has been the use of the Chi-Square distribution. This distribution
relies on the observance of failures from a homogeneous population and, therefore, has
limited applicability to merged data points from a variety of sources.
To give users of NPRD a better understanding of the confidence they can place in the
presented failure rates, an analysis of RIAC data in the past concluded that, for a given
generic part type, the natural logarithm of the observed failure rate is normally distributed
with a standard deviation of 1.5. This means that 68 percent of the actual experienced
failure rates will be between 0.22 and 4.5 times the mean value. Similarly, 90% of actual
failure rates will be between 0.08 and 11.9 times the presented mean value. As a general
rule-of-thumb, this type of precision is typical of probabilistic reliability prediction
models and point-estimate failure rates such as those contained within NPRD. It should
be noted that this precision is applicable to predicted failure rates at the component level,
and that confidence will increase as the statistical distributions of components are
combined when analyzing modules or systems.
In virtually all of the field failure data collected for NPRD, TTF was not available. Few
current DoD or commercial data tracking systems report elapsed time indicator (ETI)
meter readings that would allow TTF compilations. Those that do lose accuracy
following removal and replacement of failed items. To accurately monitor these times,
each replaceable item would require its own individual time recording device. Data
collection efforts typically track only the total number of item failures, part populations,
and the number of system operating hours. This means that the assumed underlying TTF
distribution for all failure rates presented in NPRD is the exponential distribution.
Unfortunately, many part types for which data are presented typically do not follow the
exponential failure law, but rather exhibit wearout characteristics, or an increasing failure
Reliability Information Analysis Center
361

Chapter 7: Examples

rate in time. While the actual TTF distribution may be Weibull or lognormal, it may
appear to be exponentially distributed if a long enough time has elapsed. This
assumption is accurate only under the condition that components are replaced upon
failure, which is true for the vast majority of data contained in NPRD. To illustrate this,
refer to Figure 7.4-1, which depicts the apparent failure rate for a population of
components that are replaced upon failure, each of which follow the Weibull TTF
distribution. This illustrates Drenicks theorem that was discussed earlier in this book.

MTTF = Mean-Time-to-Failure, = Weibull Characteristic Life

Figure 7.4-1: Apparent Failure Rate for Replacement Upon Failure


At t = 0, the population of parts has not experienced operation. As operating time
increases, parts in the original population are replaced and the failure rate increases. The
failure rate then decreases as the majority of parts have been replaced with new parts.
The population of replaced parts undergo the same process with the exception that the
deviation of the second distribution is greater due to the fact that the "time zeros" of the
replaced parts, themselves, are spread over time. This process continues until the "time
zeros" of the parts have become sufficiently randomized to result in an apparent
exponentially distributed population. The approximate time at which this asymptotic
value is reached as a function of beta is given in Table 7.4-2. The asymptotic value of
failure rate is 1/alpha, regardless of beta.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


362

Chapter 7: Examples

Table 7.4-2: Time at Which Asymptotic Value is Reached

2
4
6
8

Asymptote
1.0
2.4
4.2
7.0

Additionally, since MTTF is often used instead of characteristic life, their relationship
should be understood. The ratio of alpha/MTTF is a function of beta and is given in
Table 7.4-3.

Table 7.4-3 /MTTF Ratio as a Function of

1.0
2.0
2.5
3.0
4.0

Asymptote
1.00
1.15
1.12
1.10
1.06

Based on the previous discussion, it is apparent that the time period over which data is
collected is very important. For example, if the data is collected from time zero to a
time which is a fraction of alpha, the failure rate will be increasing over that period and
the average failure rate will be much less than the asymptotic value. If however the data
is collected during a time period after which the failure rate has reached its asymptote, the
apparent failure rate will be constant and will have the value 1/alpha. The detailed data
section in NPRD presents part populations which provide the user the ability to further
analyze the time logged to an individual part or assembly, and to estimate the
characteristic life. For example, the detailed section presents the population and the total
number of operating hours for each data record. Dividing the part operating hours by the
population yields the average number of operating hours for the system/equipment in
which the part/assembly was operating. An entry for a commercial quality mercury
battery in a ground, fixed (GF) environment indicates that a population of 328 batteries
had experienced a total of 0.8528 million part hours of operation. This indicates that
each battery had experienced an average of 0.0026 million hours of operation in the time
period over which the data was collected. If a shape parameter, beta, of the Weibull
distribution is known for a particular part/assembly, the user can use this data to
extrapolate the average failure rate presented in NPRD to a Weibull characteristic life
Reliability Information Analysis Center
363

Chapter 7: Examples

(alpha). If the percent failure rate is relatively low, the methodology is of limited value.
If a significant percent of the population has failed, the methodology will yield results for
which the user should have a higher degree of confidence. The methodology presented is
useful only in cases where TTF characteristics are needed. In many instances, knowledge
of the part characteristic life is of limited value if the logistics demand is the concern.
This data can, however, be used to estimate characteristic life in support of preventive
maintenance efforts. The assumptions in the use of this methodology are:
1. Data were collected from "time zero" of the part/assembly field usage
2. The Weibull distribution is valid and is known
Table 7.4-4 contains cumulative percent failure as a function of the Weibull beta shape
parameter and the time/characteristic life ratio (t/). The percent failure from the NPRD
detailed data section can be converted to a (t/alpha) ratio using the data in Table 7.4-4.
Once this ratio is determined, a characteristic life can be determined by dividing the
average operating hours per part (part hours/population) by the (t/alpha) ratio. It should
be noted here that the percentage failures in the table can be greater than 100, since parts
are replaced upon failure and there can be an unlimited number of replacements for any
given part.

Table 7.4-4: Percent Failure for Weibull Distribution

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


364

Chapter 7: Examples

As an example, consider the NPRD detailed data for Electrical Motors, Sensor;
Military Quality Grade; Airborne, Uninhabited (AU) environment; and a Population Size
of 960 units. Assume for this data entry that there were 359 failures in 0.7890 million
part-operating hours. The data may be converted to a characteristic life in the following
manner:
1. Determine the Percent Failure:
359
% Failure = 960 = 37.4%

2. Determine a typical Weibull shape parameter (). For motors, a typical beta
value is 3.0 (Reference 5).
3. Convert the Percent Failure to a t/alpha ratio using Table 7.4-4 (for % fail = 37.4
and = 3)
t

0.65

(extrapolating between 31 and 42)

4. Calculate average operating hours per part:


Part Hours
0.7890
=
= 0.00082 million hours
Population Count
960

5. Calculate :

Part Hours

Population
Count
0.00082
=
=
= 0.00126 million hours
0.65
t


Based on this data, an approximate Weibull characteristic life is 1260 hours. The user of
this methodology is cautioned that this is a very approximate method for determining the
characteristic life of an item when TTF data is not available. It should also be noted that
for small values of time (i.e.; t < 0.1 alpha), random failures can predominate, effectively
masking wearout characteristics and rendering the methodology inaccurate.
Reliability Information Analysis Center
365

Chapter 7: Examples

Additionally, for small operating times relative to , the results are dependent on the
extreme tail of the distribution, thus significantly decreasing the confidence in the derived
alpha value.
For part types exhibiting wearout characteristics, the failure rate presented represents an
average failure rate over the time period in which the data was collected. It should also
be noted that for complex nonelectronic devices or assemblies, the exponential
distribution is a reasonable assumption. The user of this data should also be aware of
how data on cyclic devices such as circuit breakers is presented in NPRD. Ideally, these
devices should have failure rates presented in terms of failures per operating cycles.
Unfortunately, from the field data collected, the number of actuations is rarely known
and, therefore, the listed failure rates are presented in terms of failures per operating hour
for the equipment in which the part is used.
7.4.3. Document Overview

The RIAC NPRD databook is organized into the following sections:


Section 1:
Section 2:
Section 3:
Section 4:
Section 5:
Section 6:
Section 7:
Section 8:

Introduction
Part Summaries
Part Details
Data Sources
Part Number/Mil Number Index
National Stock Number Index with Federal Stock Class Prefix
National Stock Number Index without Federal Stock Class Prefix
Part Description Index

Sections 2 through 8 are described in detail in the following sections.


7.4.3.1. "Part Summaries" Overview

The summary section of NPRD contains combined failure rate data, presented in order of
Part Description, Quality Level, Application Environment, and Data Source. The Part
Description itself is presented in a hierarchical classification. The known technical
characteristics, in addition to the classification, are contained in Section 3 of the book,
Part Details. All data records were combined by totaling the failures and operating
hours from each unique data source. In some cases, only failure rates were reported to
RIAC. These data points do not include specific operating hours and failures, and have
dashes in the Total Failed and Operating Hours/Miles fields. Table 7.4-5 describes each
field presented in the summary section.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
366

Chapter 7: Examples

Table 7.4-5: Field Descriptions


Field
Name
Part
Description

Field Description
Description of the part, including the major family of parts and specific part-type breakdown within the part
family.
The RIAC does not distinguish parts from assemblies within NPRD. Information is presented on
parts/assemblies at the indenture level at which it was available. The description of each item for which data
exists is made as clear as possible so that the user can choose a failure rate on the most similar part or assembly.
The parts/assemblies for which data is presented can be comprised of several part types, or they can be a
constituent part of a larger assembly. In general, however, data on the part type listed first in the data table is
representative of the part type listed and not of the higher level of assembly. For example, a listing for Stator,
Motor represents failure experience on the stator portion of the motor and not the entire motor assembly.
Added descriptors to the right, separated by commas, provide further details on the part type listed first.
Additional detailed part/assembly characteristics can be found, if available, in the Part Details section of
NPRD.

Quality
Level

The Quality Level of the part, as indicated by:


Commercial - Commercial quality parts
Military - Parts procured in accordance with MIL specifications
Unknown - Data resulting from a device of unknown quality level

App. Env.

The Application Environment describes the conditions of field operation. See Table 7.4-6 for a detailed list of
the application environments and their descriptions. These environments are consistent with MIL-HDBK-217.
In some cases, environments more generic than those used in MIL-HDBK-217 are used. For example: "A"
indicates the part was used in an Airborne environment, but the precise location and aircraft type was not
known. Additionally, some environments are more specific than the current version of MIL-HDBK-217, since
the current version has merged many of the environment categories and the NPRD data was originally
categorized into the more specific environment. Environments preceded by the term "NO" are indicative of
components used in a non-operating product or system in the specified environment.

Data Source

Source of data comprising the NPRD data entry. The source number may be used as a reference to Section 4 of
NPRD to review the specific data source description.

Failure Rate
Fails / (E6)

The failure rate presented for each unique part type, environment, quality, and source combination. It is the
total number of failures divided by the total number of life units. No letter suffix indicates that the failure rate
is in failures per million operating hours. An "M" suffix indicates the unit is failures per million miles. For
roll-up data entries (i.e., those without sources listed), the failure rate is derived using the data merge algorithm
described in this section. A failure rate preceded by a "<" is representative of entries with no failures. The
failure rate listed was calculated by using a single failure divided by the given number of operating hours. The
resulting number is a worst case failure rate and the real failure rate is less than this value. All failure rates
are presented in NPRD in a fixed format of four decimal places after the decimal point. The user is cautioned
that the presented data has inherently high variability and that four decimal places does not imply any level of
precision or accuracy.

Total Failed

The total number of failures observed in the merged data records.

Op. Hours/
Miles (E6)

The total number of operating life unit (in millions) observed in merged data records. Absence of a suffix
indicates operating hours is the life unit and "M" indicates that miles is the life unit.

Detail Page

The page number containing the detail data source description which comprises the summary record.

Reliability Information Analysis Center


367

Chapter 7: Examples

Table 7.4-6: Application Environments Defined in NPRD


Env

Description

Airborne - The most generalized aircraft operation and testing conditions.

AI

Airborne Inhabited - General conditions in inhabited areas without environmental extremes.

AIA

Airborne Inhabited Attack - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as used for ground support.

AIB

Airborne Inhabited Bomber -Typical conditions in bomber compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission bomber
aircraft.

AIC

Airborne Inhabited Cargo - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission transport
aircraft .

AIF

Airborne Inhabited Fighter - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as fighters and interceptors.

AIT

Airborne Inhabited Transport - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as trainer aircraft.

ARW Airborne Rotary Wing - Equipment installed on helicopters; includes laser designators and fire control systems.
AU

Airborne Uninhabited - General conditions of such areas as cargo storage areas, wing and tail installations
where extreme pressure, temperature, and vibration cycling exist.

AUA

Airborne Uninhabited Attack - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for ground support.

AUB

Airborne Uninhabited Bomber - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on long mission bomber aircraft.

AUF

Airborne Uninhabited Fighter - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as fighters and interceptors.

AUT

Airborne Uninhabited Transport - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for trainer aircraft.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


368

Chapter 7: Examples

Table 7.4-6: Application Environments Defined in NPRD (continued)


Env

Description

DOR

Dormant - Component or equipment is connected to a system in the normal operational configuration and
experiences non-operational and/or periodic operational stresses and environmental stresses. The system may
be in a dormant state for prolonged periods before being used in a mission.

Ground - The most generalized ground operation and test conditions.

GB
&
GBC

Ground Benign - Non-mobile, laboratory environment readily accessible to maintenance; includes laboratory
instruments and test equipment, medical electronic equipment, business and scientific computer complexes.
GBC refers to a commercial application of a commercial part.

GF

Ground Fixed - Conditions less than ideal such as installation in permanent racks with adequate cooling air and
possible installation in unheated buildings; includes permanent installation of air traffic control, radar and
communications facilities.

GM

Ground Mobile - Equipment installed on wheeled or tracked vehicles; includes tactical missile ground support
equipment, mobile communication equipment, tactical fire direction systems.

ML

Missile Launch - Severe conditions related to missile launch (air and ground), and space vehicle boost into
orbit, vehicle re-entry and landing by parachute. Conditions may also apply to rocket propulsion powered
flight.

MP

Manpack - Portable electronic equipment being manually transported while in operation; includes portable field
communications equipment and laser designations and rangefinders.

Naval - The most generalized normal fleet operation aboard a surface vessel.

NH

Naval Hydrofoil - Equipment installed in a hydrofoil vessel.

NS

Naval Sheltered - Sheltered or below deck conditions, protected from weather; include surface ships
communication, computer, and sonar equipment.

NSB

Naval Submarine - Equipment installed in submarines; includes navigation and launch control systems.

NU

Naval Unsheltered - Nonprotected surface shipborne equipment exposed to weather conditions; includes most
mounted equipment and missile/projectile fire control equipment.

N/R

Not Reported - Data source did not report application environment.

SF

Spaceflight - Earth orbital. Approaches benign ground conditions. Vehicle neither under powered flight nor in
atmosphere re-entry; includes satellites and shuttles.

Data records are also merged and presented at each level of part description (categorized
from most generic to most specific). The data entries with no source listed represent
these merged records. Merging data becomes a particular problem due to the wide
dispersion in failure rates, and because many data points consist of only survival data in
which no failures occurred, thus making it impossible to derive a failure rate. Several
Reliability Information Analysis Center
369

Chapter 7: Examples

approaches were considered in defining an optimum data merge routine. These options
are summarized as follows:
1. Summing all failures and dividing by the sum of all hours. The advantages of
this methodology are its simplicity and the fact that all observed operating
hours are accounted for. The primary disadvantage is that it does not weigh
outlier data points less than those clustering about a mean value. This can
cause a single failure rate to dominate the resulting value.
2. Using statistical methods to identify and exclude outliers prior to summing
hours and failures. This methodology would be very advantageous in the
event there are enough failure rate data points to properly apply the statistical
methods. The data being combined in NPRD often consists of a very limited
number of data points, thus negating the validity of this method.
3. Deriving the arithmetic mean of all observed failure rates which are from data
records with failures, and modifying the resulting value in accordance with the
percentage of operating hours associated with the zero failure records.
Advantages of this method are that modifying the mean in accordance with
the percentage of operating hours from survival data will ensure that all
observed part hours are accounted for, regardless of whether they have
experienced failures. Disadvantages are that the arithmetic mean does not
apply less weight to those data points substantially beyond the mean and,
therefore, a single data point could dominate the calculated failure rate.
4. Using a mean failure rate by taking the lower 60% confidence level (Chisquare) for zero failure data records and combining them with failure rates
from failure records. The disadvantages of this methodology are that the 60%
lower confidence limit can be a pessimistic approximation of the failure rate,
especially in the case where there are few observed part hours of operation;
and an arithmetic mean failure rate of these values (combined with the failure
rates from failure records) could yield a failure rate which is dominated by a
single failure rate, which itself may be based on a zero failure data point. The
use of a geometric mean would alleviate some of this effect. The problem
with the pessimistic nature of using the confidence level ,however, will
remain.
5. Deriving the geometric mean of all the failure rates associated with records
having failures and multiplying the derived failure rates by the proportion:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
370

Chapter 7: Examples

[observed hours with failures/total observed hours]. For example, if 70


percent of the total part hours correspond to records with failures, the
geometric mean of failure rates from the data records with failures would be
multiplied by 0.7. This option is appealing, since the geometric mean will
inherently apply less weight to failure rates that are significantly greater than
the others for the same part type. The merged failure rate should be
representative of the population of parts since it takes into consideration all
observed operating hours, regardless of whether or not there were observed
failures.
Option 5 was selected for NPRD, since it is the only one that (1) accounts for all
operating hours and (2) applies less weighting to the outliers. The resulting algorithm
used to merge data within NPRD is:
1

n n
merged = i

i =1

n
h

i =1
n

h

i =1

where,
n

i = The product of failure rates from NPRD Section 2 records with failures*
i =1

h = The sum of hours from NPRD Section 2 records with failures*

i =1
n

h = The sum of hours from NPRD Section 2 records

i =1

n
n'
h
h'

=
=
=
=

The total number of NPRD Section 2 data records


The total number of NPRD Section 2 data records with failures*
The number of hours associated with all NPRD Section 2 data records
The number of hours associated with all NPRD Section 2 data records
with failures*

* Note: Or having a second source failure rate.


Reliability Information Analysis Center
371

Chapter 7: Examples

In NPRD Section 2, part descriptions with "(Summary)" following the part name
comprise a merge of all data related to the generic part listed. An example of the NPRD
summary section is given in Figure 7.4-2.

Figure 7.4-2: Example of Part Summary Entries


To illustrate how the data was rolled up, consider the entries for linear mechanical
actuators. The failure rate of 41.7293 listed for "Actuator, Mechanical, Linear" is a rollup of three individual data entries for which there are sources listed (two for commercial
quality, AUC environment and one for unknown quality in an Airborne environment).
The listing of 5.5413 for "Actuator, Mechanical" is a roll-up of four individual data
entries (two for Mil/AIF, one for Unk/AUT , and one for Unk/GM ). Using the algorithm
described previously, the roll-up was calculated as follows:
1

0.1957 + 0.0595

= 5.5413
0
.
1957
+
0
.
0595
+
0
.
0830
+
0
.
2655

summary = [(5.110)(33.6241)] 2

Now consider the entry for "Actuator, Mechanical (Summary)". This listing is a roll-up
of all "Actuator, Mechanical" data (in this case Actuator, Mechanical and Actuator,
Mechanical, Linear) using the algorithm described previously. In other words, the failure
rate of 25.8092 is a summary of failure data from seven individual data sources. For
these "(Summary)" data entries, sources are not listed since they represent a merge of one
or more data sources which are presented below the summary level. Roll-up values are
presented for each specific quality level and application environment for all components
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
372

Chapter 7: Examples

having multiple part type entries at the same indenture level. If there is no summary
record indicated for a particular part type, the listed part description represents the lowest
level of indenture available. For example, the listing for "Actuator, Mechanical,"
although being identical to the generic level for which the summary data is presented,
was the most detailed description available for the particular data entry. More detailed
part level information may be available in NPRD Section 3. Each failure rate record
listed in the NPRD summary section is a merge of all detailed data from Section 3 for a
specific part type, quality, environment and unique data source. Each of these failure rate
records refers to a Section 3 page which contains all detailed records, including part
details, when they were known. Roll-ups are performed at every combination of part
description (down to 4 levels), quality level, and application environment. The data
points being merged in the NPRD summary section include only those records for which
a data source is listed. These individual data points were already combined by summing
part hours and failures (associated with the detailed records) for each unique data source.
Roll-ups performed on only zero-failure data records are accomplished simply by
summing the total operating hours, calculating a failure rate by assuming one failure, and
denoting the resulting worst case failure rate with a "<" (less than) sign.
The roll-ups were performed in this manner to give the NPRD user maximum flexibility
in choosing data on the most specific part type possible. For example, if the user needs
data on a part type which is not specified in detail or for conditions for which data does
not exist in this document, the user can choose data on a more generic part type or
summary condition for which there is data.
7.4.3.2. "Part Details" Overview

The detailed part data in NPRD Section 3 can be used to:


1. Determine if there is data on a specific part number, manufacturer or device
with similar physical characteristics to the one of interest.
2. View the detailed data that was used to generate the summarized data section,
so that a qualitative assessment of the data can be made.
The user is cautioned that individual data points from the detailed section may be of
limited value relative to the merged summary data in NPRD Section 2, which combines
records from several sources and typically results in many more part hours. Under no
circumstance should the NPRD detailed data or summary data be used to blindly
cherry-pick the most favorable or optimistic failure rate for a particular part or
assembly type.
Reliability Information Analysis Center
373

Chapter 7: Examples

NPRD Section 3 contains a listing of all field experience records contained in the RIAC
part databases. The detailed data section presents individual data records that are
representative of the specific part types used in a particular application from a single data
source. For example, if 20 relays of the same type were used in a specific military
system, for which there were 300 systems in service, each with 1300 hours of operation
over the time in which the data was collected, the part population is 20x300 = 6000, and
the total part operating hours are 6000x1300 = 7,800,000 hours. If the same part is used
in another system, or if the system is used in different operating environments, or if the
information came from a different source, then separate NPRD data records were
generated. If known, the population size is given for each data record as the last element
in the Part Characteristics field. An example of NPRD Section 3 is shown in Figure
7.4-3.

Figure 7.4-3: Example of Part Detail Entries


7.4.3.3. Section 4 "Data Sources" Overview

This section of NPRD describes each of the data sources from which data were extracted
for the databook. The Title, author(s), publication dates, report numbers, and a brief
abstract are presented. In a number of cases, information regarding the source of the data
had to be kept proprietary. In these cases, "Source Proprietary" is indicated.
7.4.3.4. Section 5 "Part Number/MIL Number" Index

This NPRD section provides an index, ordered by generic part type, of those Section 3
data entries that contain a generic commercial part number or a MIL-Spec number. The
Section 3 page which contains the specific entry for the part or MIL number of interest is
given. Note that not all data entries contain a part or MIL number, since these numbers
either were not applicable or were not known for all entries.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


374

Chapter 7: Examples
7.4.3.5. Section 6 National Stock Number Index with Federal Stock Class

This NPRD section provides an index of those Section 3 data entries that contain a
National Stock Number (NSN), including the four digit Federal Stock Class (FSC) prefix.
This index contains all parts for which the NSN is known.
7.4.3.6. Section 7 "National Stock Number Index without Federal Stock Class Prefix"

This NPRD section provides an index similar to the Section 6 index, with the exception
that the four-digit FSC is omitted.

7.5. References
1. RADC-TR-88-97, RELIABILITY PREDICTION MODELS FOR DISCRETE
SEMICONDUCTOR DEVICES, Final Technical Report, 1988
2. Denson, W.K. and S. Keene, A New System Reliability Assessment
Methodology, Final Report, 1998
3. Photonic Component and Subsystem Reliability Process Final Report,
Subcontract 0044-SC-20100-0203, Prepared for Penn State University ElectroOptics Center, September 25, 2008
4. Nonelectronic Parts Reliability Data (NPRD), Reliability Information Analysis
Center
5. RADC-TR-77-408, Electric Motor Reliability Model)
6. MIL-HDBK-344A, Environmental Stress Screening of Electronic Equipment,
August 1993

Reliability Information Analysis Center


375

Chapter 7: Examples

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


376

Chapter 8: The Use of FMEA in Reliability Modeling

8.

The Use of FMEA in Reliability Modeling

Although analytical techniques like FMEA are not the primary focus of this book, they
are important in the development of a reliability model. For example, when identifying
the root failure causes that are to be included in a comprehensive product or system
reliability model, a need exists for the identification of the highest risk failure causes that
should be addressed in the model. The FMEA is a popular technique to use for this
purpose. The intent of this chapter is not to present a detailed procedural guide to FMEA,
as this has been done extensively in the literature. Rather, it is to present practical FMEA
guidelines based on the experience of the author, specifically toward the goal of
developing a reliability model.

8.1. Introduction
In order to build reliability into a product or system, it is necessary to anticipate failure
causes, and ensure that they are eliminated or, at least, that their probability of occurring
is made acceptably low. This anticipation can be accomplished empirically through
test, or analytically through analysis and modeling. Failure Mode and Effects Analysis
(FMEA) is a structured way of identifying root cause failure modes, and is the backbone
of an effective reliability program, particularly as it relates to reliability growth during the
design and development phase.
A successful product or system depends on the requirements being fully understood, that
the design is robust, and that the manufacturing process is also robust. A Design FMEA
(DFMEA) assesses the first two of these, and a Process FMEA (PFMEA) assesses the
third. This is illustrated in Figure 8.1-1.

Reliability Information Analysis Center


377

Chapter 8: The Use of FMEA in Reliability Modeling

Requirements for
Successful Product

What Can
Go Wrong?

Understanding of
Requirements

Wrong or
Bad
Requirements

Robust
Design

Robust
Manufacturing

Bad
Design

Bad
Manufacturing

Build the
Wrong Product

DFMEA

Build the
Product
Wrong
PFMEA

Figure 8.1-1: Two Basic Types of FMEA

Generally, the best manner in which to perform the FMEA is to separate the design and
process attributes and perform separate process and design FMEAs. However, in some
cases, these can essentially be combined into a single design FMEA by incorporating the
manufacturing process-related failure modes into the DFMEA failure cause/mechanism
column. The circumstances when this is appropriate are generally those when the item
under analysis is not complex from both a design and manufacturing perspective. This
book primarily addresses the DFMEA, since a reliability model is generally driven more
by the design than the process. However, process variables are often used as factors in
the reliability model.
A FMEA is the cornerstone of a reliability program, having many uses. The primary
purpose of the FMEA is to acquire an understanding of the reliability characteristics of a
product or system, such that corrective action can be taken to make the item more reliable
(reliability growth). The results of a FMEA are also used to support other reliability
engineering tasks, such as test plan development, the evaluation of engineering changes,
assessing detectability, the basis of troubleshooting manuals, and the development of
reliability models.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


378

Chapter 8: The Use of FMEA in Reliability Modeling

The logical, bottom-up analysis technique of the FMEA facilitates the understanding of
the reliability characteristics of a product or system. This understanding is a core
requirement for the attainment of the reliability objectives, and, as such, it will help
reduce the total program cost. While reliability engineering tasks are sometimes
considered to be costly to a program, the reality is that they will save significant amounts
of money, if properly implemented. Costs incurred when reliability problems are
identified in the field will be orders of magnitude higher than the upfront cost of the
reliability engineering tasks that solve them during design and development. Since the
success of a reliability program depends largely on the effectiveness of FMEA,
implementation of the FMEA is a critical element of the cost avoidance of field failures11.
Benefits of performing an FMEA include:

The assurance that all conceivable root failure causes and their effects have been
considered in the early stages of the product or system design and development
process, and that corrective actions are taken to mitigate the risk associated with
critical failure modes.

If elements such as accelerating stresses are included in the FMEA analysis, it can
be used to develop reliability growth, demonstration and screening test plans, as
well as environmental qualification test plans. In this case, the importance of
each potential accelerating stress can be quantified and prioritized in accordance
with the severity, criticality or failure rate of the individual failure modes
accelerated by the specific stress. For example, if temperature is determined to
accelerate the majority of critical failure modes, then it should be used as a stress
in reliability and qualification testing.

If and when reliability problems occur after a product or system is delivered to the
customer, the FMEA can be used as a basis for determining the root cause of
failure. Based on failure symptoms, the possible causes can be identified based
on the FMEA analysis that was performed.

It can be used as a basis for the reliability model, in which the reliability of each
high risk failure cause is quantified.

11

It should be noted that an FMEA is only technically effective if it has an impact on the design of the product or system. An FMEA
that does an excellent job of identifying root failure causes, but is performed after-the-fact so as to have no impact on the actual
design, is a waste of reliability program resources. An FMEA is only cost effective if it impacts the design of the product or system
before the design is finalized and bending metal has started. A poorly timed FMEA that results in extensive and costly redesign
efforts to eliminate or mitigate root failure causes is also counterproductive.

Reliability Information Analysis Center


379

Chapter 8: The Use of FMEA in Reliability Modeling

Another benefit of the FMEA is that it can be used as a basis for evaluating the risk
associated with engineering changes. If a design change is proposed, the FMEA can be
consulted to determine if the change will result in new failure modes or an increase in the
probability of failure of identified modes. Based on this information, the change can be
accepted, or additional reliability characterization can be performed to further assess the
reliability impact of the proposed change.
Detectability can also be assessed by the FMEA. This is particularly useful in instances
where failures that are undetectable are of special importance to the project. An example
of this is when alarms are used as a means to detect failures. Some failure modes may
not result in an alarm, and, therefore, the criticality associated with the failure mode can
be high. In this case, the FMEA can be used to assess these failure modes.
Troubleshooting manuals are essentially an FMEA that is presented in reverse order. The
FMEA is generally presented in the order of functional elements, or components. If the
FMEA is sorted by the effect of failure (or symptom), it essentially becomes a
troubleshooting aid, since the analyst can review the specific failure modes that will
result in the observed symptom. Additionally, if the probability of failure is included in
the FMEA analysis, the possible failure modes or causes can be ranked in accordance
with their probability. This can aid in the troubleshooting process.
Typical problems with the implementation of an FMEA include:

Confounding of failure modes, effects and causes


The tiering effect between the failure causemodeeffect as a function of the level
of assembly makes it difficult to keep the cause-mode-effect straight.
In the determination of occurrence, severity and detectability, there are several
dimensions of each that need to be accounted for, and, therefore, definitions
should be tailored for each product or system in accordance with the various
dimensions
Lack of follow-up action tracking
The tendency to normalize effects and detectability to in process and not infield which is the purpose of the FMEA

The methodology outlined in this document is intended to provide guidance to overcome


these limitations.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


380

Chapter 8: The Use of FMEA in Reliability Modeling

8.2. Definitions
FMEA refers to a generic analysis methodology and, while there are industry standards
that define the specifics of the analysis, there are many different ways in which the
analysis can be accomplished. The following list of terms and definitions summarizes
typical data elements that the FMEA will typically include as columns in the FMEA
worksheet template, presented in the order of which they usually appear.

Item: The name of the "Item" being analyzed


Item Function: The function of the "Item" under analysis. If the "Item" has
more than one function, with different potential modes of
failure, list all functions separately.
Potential Failure Mode: The manner in which an item can fail, relative to the
particular "Item" and "Item Function". These should
include all failure modes that could occur, but may not
necessarily occur.
Failure Effect on Item: The local effect that the failure mode will have on the
item under analysis. This effect should reflect how an
item can fail to meet its functional requirements.
Failure Effect: Defined as the effect(s) of the "Failure Mode" on the end-item
(module) function, as perceived by the customer. This should be
described in terms of what the customer/end-user might notice or
experience.
Severity (S): The rank associated with the most serious effect for a given failure
mode. A numerical value of one (1) to ten (10), proportional to
this severity, is assigned. High numbers are applicable to effects
for which the consequences are severe. For example, if the effect
of a particular failure mode is that a critical module will fail
catastrophically (i.e., no output), then the assigned severity value
will be close to ten. Guidelines for assigning this value are
provided in a subsequent table in this book.
Potential Cause/Mechanism: Defined as an indication of a design weakness, the
consequence of which is the failure mode. You
should list every potential cause and/or failure
mechanism for each failure mode. Causes can be
any underlying reason that the failure mode
occurs, and can be manufacturing process
anomalies, human error, defect type, product
Reliability Information Analysis Center
381

Chapter 8: The Use of FMEA in Reliability Modeling

attributes that can contribute to a failure mode,


etc. Failure mechanisms are generally the
physical processes which result in the failure
mode. Examples are corrosion, crack
propagation, electromigration, spalling, etc.
Accelerating Stress(es): The stresses that will accelerate the cause/mechanism
of failure. These can be operational or environmental
stresses. This information is useful in developing
reliability growth and characterization test plans.
Occurrence (O): The likelihood that the specific failure cause/mechanism will
occur during the design life. A numerical value of one (1) to
ten (10), proportional to this likelihood, is assigned. High
numbers are applicable to causes/mechanisms that are likely to
occur. For example, if the specific cause/mechanism has been
observed to exhibit a relatively high failure rate, then the
assigned value will be close to ten.
Current Design Control Preventions: Indicate what has been done to prevent
the cause/mechanism of failure, or the
failure mode, from occurring, or reduce
its rate of occurrence.
Current Design Control Detections: Indicate what has been done to detect the
cause/mechanism of failure. This can be
done via test or analysis.
Detectability (D): The rank associated with the best detection control listed in
the design control columns, for the specific failure
cause/mechanism under analysis. A numerical value of one
(1) to ten (10), inversely proportional to the level of
detectability, is assigned. High numbers are applicable to
causes/mechanisms that are virtually undetectable. For
example, if it is known that a specific failure
cause/mechanism will be difficult to detect if it occurs, then
the assigned value will be close to ten.
PRN: The Risk Priority Number, which is defined as the product of O, S and D
Recommendations: The recommendations of the FMEA team members or
stakeholders regarding corrective actions that should be
taken to address the specific failure cause
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
382

Chapter 8: The Use of FMEA in Reliability Modeling

Responsibility: The assigned individual or team that will lead the


implementation of the corrective action recommendation(s)
Target Date: The date by which the corrective action recommendations are to
be implemented. Note that this date can also reflect the date by
which the corrective action recommendations, once implemented,
have been verified as being effective.

8.3. FMEA Logistics


This section provides specific guidance on FMEA implementation.
8.3.1. When initiated

The FMEA should be initiated when there is preliminary design. In a multi-stage


development process, this generally occurs in one of the first few stages. Most reliability
experts will recommend initiating the FMEA as early as possible in the development of a
product. However, for products that are very early in the research phase, and whose
design is changing rapidly based on research learnings, there is a risk that FMEA analysis
will be premature. When the design fundamentals start becoming finalized, the FMEA
results are used, along with other learnings from the development process to improve the
design.
8.3.2. FMEA Team

To be effective, the FMEA team, led by an FMEA facilitator, must be cross-functional,


and must have participation by all relevant departments, functions and/or disciplines (i.e.,
all relevant stakeholders). Specifically, the following functions should be represented:

Design, especially the Design Lead


Reliability
Quality
Project Management
Application Engineers
Manufacturing (Participation by manufacturing is especially critical when
performing PFMEAs, due to the fact that it is manufacturing processes that are
being analyzed)

Additional functions may also be required, depending on the specific organization and
nature of the product. These additional functions can include component engineering,
procurement, measurements, and marketing. There are also instances where an FMEA
Reliability Information Analysis Center
383

Chapter 8: The Use of FMEA in Reliability Modeling

might include the direct involvement of the customer, particularly for critical or highly
complex products or systems.
All of the above listed functions are not required for every part of the FMEA. For
instance, the initial parts of the FMEA can efficiently be performed by only the
Reliability engineering and the Applications engineering (or the Project Manager)
functions. After this, engagement by the entire team is critical, especially to gain buyin on corrective actions, which can be the responsibility of any of the disciplines.
The ideal team size is 5 to 8 people. Any larger, and the efficiency of the analysis is
compromised. It is also more efficient to break up the analysis into distinct functional
elements of the design, i.e. mechanical, electrical, optical, software/firmware, etc.,
although it is also imperative to account for failure causes that are due to interactions of
these functional elements. The FMEA facilitator needs to ensure that these interactions
are accounted for, since the individuals cognizant of their functional element will often
overlook these interactions.
8.3.3. FMEA Facilitation

As with any cross functional team, it is important to have a lead which facilitates the
FMEA. Specific responsibilities of this facilitator are:

Document the results of analysis (it is also beneficial to have a separate scribe
that documents the results, and allows the facilitator to concentrate on the
additional items listed below)
Keep the group focused on the task at hand
Ensure that all components or processes are accounted for
Prompt the group for participation, as required
Spark the discussion by suggesting failure modes
Ensure that the analysis is kept moving
Ensure that the inputs of all participants are heard and captured. This includes
making sure that certain people are not allowed to dominate the analysis, and that
the ideas of quiet people are brought out.
Manage conflicts Professional people take a great deal of pride in their work.
Since the FMEA goal is to find fault with the product or system, FMEA sessions
can sometimes get contentious. The facilitator must manage this by keeping the
session constructive and not allow emotions to dictate the course of the analysis.

The facilitator is often from the reliability group, but does not have to be. It is more
important that the facilitator be skilled in the responsibilities listed above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
384

Chapter 8: The Use of FMEA in Reliability Modeling

8.3.4. Implementation

Some suggestions for implementing an effective FMEA are listed here:

Determine the appropriate FMEA methodology that will be used. Factors to


consider are:
a. Standards used in the specific industry that the product is intended for
b. Customer needs and expectations
c. Previous experience pertaining to the effectiveness of specific FMEA
methodologies
Use a facilitator who is experienced with FMEA techniques. This individual does
not necessarily need to be a member of the technical team.
Have a small team, or an individual, prepare the background information before
the larger FMEA team meets. The information should include system
documentation such as schematics, drawings, Bills of Materials (BOMs), theory
of operation, and potential failure mode lists.
Re-use as much information from previous FMEAs as possible, as this will save
time
Segment the FMEA team sessions into logical groupings, if appropriate. As an
example, electronic design and mechanical design can sometimes be separated.
However, if they are separated, areas of interaction must be adequately covered.
Thermal properties is a typical example of an area of potential interaction.

8.4. How to Perform an FMEA


The concept of FMEA is very straightforward. First, each component or element of the
product or system is studied to see how it could fail. These are called failures modes, or
an observable effect of a failure mechanism. For example, a resistor can have open or
short failure modes. These failure modes are directly observable and external to the
part. Possible causes of each failure mode are then determined. Examples may include
metal migration, corrosion, etc.
Next, the analysis determines what happens if the failure mode was to occur. These are
the effects, and are determined at the various levels of the assembly architecture
(progressing from low to high), such as the surrounding components, the sub-assembly,
the assembly and the entire product or system. The specific levels used in the analysis
are very item-specific, in that the more complex that the product or system is, the more
levels that may be required for analysis. If the product or system is relatively simple,
only one level may be required.

Reliability Information Analysis Center


385

Chapter 8: The Use of FMEA in Reliability Modeling

The next step is the determination of the possible corrective actions. This will be
discussed later in this chapter.
There are many ways in which an FMEA can be performed. This section outlines one
approach that has been successful based on the experience of the author. The process
flow is illustrated in Figure 8.4-1.

Figure 8.4-1: FMEA Process Flow

In this approach, the following steps are followed:

Make a hierarchical listing of the product (Section 8.5)


List functional requirements of each item in the hierarchy (Section 8.6)
Use IPOUND analysis to generate a set of potential failure modes of the
system. These are the effects. (Section 8.7)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
386

Chapter 8: The Use of FMEA in Reliability Modeling

For each effect, identify the severity (Section 8.8)


Use IPOUND analysis to identify potential failure modes of parts
Identify the possible effect(s) that could result from occurrence of each failure
mode (Section 8.9)
Identify potential causes of each failure mode (Section 8.10)
For each cause, identify Section 8.11):
o Accelerating stress(es) or applicable tests
o Occurrence
o Preventions
o Detections
o Detectability
Calculate the RPN (Section 8.12)
Determine the appropriate corrective actions to be taken (Section 8.13)
Update the RPN (Section 8.14)

Each of these steps is summarized below, along with guidance and tips on performing the
step.

8.5. Identify System Hierarchy


A hierarchical description of the product or system is first generated. The highest level is
the system level, and levels below the system level are equipments, major assemblies,
then subassemblies, etc. This breakdown continues until the lowest level that will be
analyzed is reached.
The complexity of the system will dictate the number of hierarchical levels. Items
comprised of a single component will have only a single level. More complex products
or systems can have from four, up to eight, levels.
It may be necessary to treat the system as the customers system, since the effects of
failure will be manifested at that level. Whether this is necessary also depends on if the
FMEA is being driven by customer requests.
The lowest level to be analyzed also needs to be determined. A general guideline to
determine the appropriate lowest level is that it should be one level lower than that for
which design control exists. For example, if an electrical circuit is being designed, and a
constituent component is a commercial off-the-shelf (COTS) capacitor, the FMEA should
go down to the capacitor level. In this case, it is not necessary to determine the specific
failure causes of the capacitor, but it is necessary to understand the failure modes of the
capacitor so that design actions can be taken with respect to the circuit design to mitigate
Reliability Information Analysis Center
387

Chapter 8: The Use of FMEA in Reliability Modeling

these potential modes. Ideally, the capacitor manufacturer will have performed a FMEA
on their product, in which case specific failure causes have already been identified and,
hopefully, mitigated.

8.6. Function Analysis


The next step in the FMEA is to list, for each item, its functional requirements. This
function analysis is performed on each item at each level in the hierarchy. It can include
both functions and the attributes of those functions. A function is the purpose of the item,
whereas an attribute is a characteristic of the function. Attributes are generally what is
detailed in a product specification.

8.7. IPOUND Analysis


An IPOUND analysis is a means to identify all possible ways in which a function or
attribute can fail. The failure modes of the system will become the failure effects, and the
failure modes of the parts will become the failure modes that will be further analyzed by
identifying their causes.
The IPOUND categories are:
I:
P:
O:
U:
N:
D:

Intermittent
Partial
Over
Unintended
Negative
Degraded

These are defined as follows:

I (Intermittent): The function is performed sometimes. This is common for


electrical connections, where continuity is intermittent
P (Partial):
Too little of the function or attribute is initially achieved. This
does not refer to the situation in which a function is initially
fine, but degrades over time. That situation is described by the
Degraded category.
O (Over):
Too much of the function or attribute is achieved. This is only
applicable for attributes where more is better
U (Unintended): This refers to failure modes that are not directly attributable to
the function under analysis, but rather a different attribute is
affected.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
388

Chapter 8: The Use of FMEA in Reliability Modeling

N (Negative):
D (Degraded):

None, complete loss of the function


Degraded function, when a function or attribute is initially fine,
but degrades over time

The loss function used in the Taguchi methodology can be used to determine some of the
applicable failure modes. Here, functions or attributes are categorized as larger the
better, nominal the best and smaller the better. For larger the better
function/attributes, a failure can occur when there is too little of the function/attribute, but
cannot fail when there is too much of it. This is illustrated in Table 8.7-1. It relates only
to the over function and the partial function IPOUND categories. The other
IPOUND categories are used when appropriate, and will generally be independent of the
Taguchi categories.
Table 8.7-1: Failure Mode Relationship to Taguchi Loss Function
Function/Attribute
Type
Larger the better
Nominal the best
Smaller the better

Too Much
(Over Function)
X
X

Too Little
(Partial Function)
X
X

The IPOUND categories are intended to represent a complete and mutually exclusive set
of the ways in which a function or attribute can fail. When identifying failure modes in
this manner, it is helpful to set up a matrix of functions/attributes and the IPOUND
categories, and proceed to fill it in. When filling in this matrix, it is not necessary to
identify failure modes for all categories of IPOUND. Likewise, for any single category,
multiple failure modes are possible. The IPOUND methodology is simply a way to get
the team to think about all possible ways in which a function/attribute can fail.
The flow diagram in Figure 8.7-1 depicts a simple system consisting of a two-level
hierarchy. In practice, there can be any number of levels in the system hierarchy. The
failure cause-mode-effect relationship shifts in the FMEA as a function of the system
level, as illustrated in Figure 8.7-1. For example, at the most basic level, the part
manufacturing process, the cause of failure may be a process step that is out of control.
The ultimate effect of that cause becomes the failure mode at the part level. The failure
effect of the part becomes the failure mode at the next level of assembly, and so forth. It
is very important that the failure cause, mode and effect are not confounded in the a

Reliability Information Analysis Center


389

Chapter 8: The Use of FMEA in Reliability Modeling

System

Assembly

Part

Part Manufacturing
Process

Effect
Mode

Effect

Cause

Mode

Effect

Cause

Mode
Cause

Effect
Mode
Cause

Figure 8.7-1: Failure Cause-Mode Effect Relationship

The failure modes of the system functions/attributes are the effects of failure modes at the
subordinate hierarchical level. This tiering continues as the system is broken down to the
lowest level that the analysis will take place.
For simple, single-level products (for example, a component made from a monolithic
material) there is only a single level and, therefore, this is not an issue. Also, for
relatively simple products with two levels, a local effects column can be added to
capture the effects of the failure mode on the subassembly function. In this case, the
effects are relative to the functional requirements of the subassembly.
When identifying failure modes, the assumption is made that the failure could occur but
may not necessarily occur.

8.8. Identify the Severity


Based on the IPOUND analysis at the system level, in which system failure modes were
identified, these failure modes will be the effects used in the FMEA. For each of these
effects, a severity rating is required. Table 8.8-1 summarizes the factors that should be
accounted for in establishing the severity value of each effect, and provides a summary of
the range of magnitudes for each dimension, from least to most severe.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


390

Chapter 8: The Use of FMEA in Reliability Modeling

Table 8.8-1: Dimensions of Functional Severity


Dimension of Severity
Degree to Which Function is
Lost

Magnitude
No Degradation

Example

Slight Degradation
Severe Degradation
Intermittent

Importance of
Function/Attribute

When Occurs

Not Critical
Critical
Design Not Capable

In Research &
Development (R&D)

Process Not Capable

In Process

Screen Fallout

Customer Inspections
Infant Mortality
In Deployment

Random
Wearout

In the identification of severity, effects of failure modes that are potentially safety-related
are usually considered to be the most severe. For these, a severity value of 9 or 10 is
used, regardless of the above listed factors.
The when occurs dimension of severity pertains to the life cycle phase in which the
failure mode and its effect occurs. If the failure mode occurs in the R&D phase, it is
either because the design or process is not capable, or because intrinsic or extrinsic
failure causes occur. In either case, this is the best phase to identify these, since they can
be corrected in the most cost-effective manner possible.
If the failure mode occurs in process, the effect is essentially a yield reduction.
Failures occurring during inspections or quality checks by the customer are similar, but
they occur at the customers site and are, therefore, more severe then when the defects are
caught in-house.
Failures occurring in deployment represent the most severe type of failure effect (with the
possible exception of safety-related failures). These failures can be represented by the
three types of failures in the bathtub curve: infant mortality, random and wearout.
Reliability Information Analysis Center
391

Chapter 8: The Use of FMEA in Reliability Modeling

Usually, the severity of an effect is treated as one factor in the FMEA. However,
separating the severity into three factors and subsequent columns can be beneficial. For
example, if the when occurs dimension of severity is separated, the failure modes that
can be caught in process are identified, and this, in turn, can be used to establish inprocess checks and screening protocols.
If they are separated, any convenient numeric scale can be used, including 1-to-10, 1-to3, or others. If 1-to-10 is used, the RPN of a failure cause will range from 1 to 1,000.
If these dimensions are not separated (which will usually be the case), each of the three
should be represented in the criteria used to define the severity levels. One way in which
this can be accomplished is to use the guidelines in Table 8.8-2, in which each dimension
is assumed to have a value between 1 and 3, directly proportional to its severity. The
total severity is then the sum of each of the three values.
Table 8.8-2: Dimensions of Severity
Dimension of Severity
Degree to Which Function is
Lost

Importance of
Function/Attribute
When Occurs

Magnitude
No Degradation
Slight Degradation
Severe Degradation
Intermittent
Not Critical
Critical
In R&D
In Process
Customer Inspections
In Deployment

1
3
1
3
1
3

8.9. Identify the Possible Effect(s) that Result from Occurrence


of Each Failure Mode
At this point in the analysis, the part failure modes have been identified, and the effects
(and their severity) have also been identified. This task is to identify the effects that will
result if the failure mode occurs. There can be any number of effects that can result from
the occurrence of the mode.

8.10. Identify Potential Causes of Each Failure Mode


Up to this point in the analysis, the FMEA has been a relatively straightforward
systematic approach to identify failure modes and their effects. For this reason, these
previous tasks can be accomplished by a small group of people, and the entire team is not
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
392

Chapter 8: The Use of FMEA in Reliability Modeling

required. It is only required that someone knowledgeable in the system and part
functional and attribute requirements be involved.
The task of identifying failure causes is a much more unstructured, brainstorming-like
activity. For this, it is important to get the entire team involved. The intent of this task is
to identify all possible causes that could result in the failure mode. Causes are often more
complex than the identification of a single failure mechanism and, therefore, describing
them in a few sentences in the FMEA table can be problematic. A failure cause will
often be the result of sub-causes, and can be broken down further and further until the
physical failure phenomena is identified. For this reason, an alternative is to perform a
fault tree analysis (FTA) on each failure mode. This allows for the breaking down of
failure causes into any level of detail.
There should be one severity rating for each failure effect, since the severity is a direct
1:1 relation to the effect. The maximum of each of these severities associated with each
failure mode is the severity used in the RPN calculation, since the RPN is applicable to
the cause. Here, the failure mode can result in several effects, but can also be initiated by
several causes. Therefore, a single cause can result in several effects, the worst of which
should be used in the PRN calculation. The relationship between failure cause, mode and
effect is illustrated in Figure 8.10-1.

Figure 8.10-1: Failure Cause, Mode and Effect Hierarchy

Reliability Information Analysis Center


393

Chapter 8: The Use of FMEA in Reliability Modeling

Here are some examples of design-related failure causes:

Failures resulting from operational stress refers to failures resulting from the
inability of a product or system to tolerate the applied stresses to which the
component, item or material within the item is exposed.

Failures resulting from environmental stress refers to failures resulting from


the inability of a product or system to tolerate the applied environmental
stresses to which the item is exposed.

Tolerance stack up refers to the initial tolerance at time zero, and the failure
of a product or system to tolerate the cumulative effect of those tolerances.

Wear and component or material ageing refers to the inability of a product


or system to tolerate the changes of its constituent components or materials
due to wear and ageing.

The combination of component ageing and tolerance stack-up is the


cumulative effects of wear, ageing, and tolerance stack-up. As components
and materials within a product or system age, the susceptibility of the item to
the cumulative effects of component/material tolerance will increase.

Failures can also be a result of short-term exposure to extreme stresses. While the
product or system is not designed to tolerate these stresses under steady-state conditions,
it should be able to tolerate short-term extreme stress exposure. There is a limit to the
stress level(s) that the product or system should be able to tolerate. However, design
actions can be taken to minimize the probability of failure due to these stresses.
The information presented here is generic in nature and applies equally to mechanical and
electronics failures. The specific failure mechanisms will vary, but the concepts are the
same.
Failure causes are often the result of a combination of conditions and events. Therefore,
when identifying causes, the analyst needs to consider these combinations. The factors
whose combinations can cause failure generically include:

Design not capable


Process not capable
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
394

Chapter 8: The Use of FMEA in Reliability Modeling

Screen fallout/out of-the-box failure


Infant mortality
Random failure
Wearout
Design
Manufacturing
Environmental exposure
Stress exposure

When hypothesizing failure causes, it is useful to think about them in terms of their initial
conditions, stresses, and failure mechanisms, as illustrated in Figure 8.10-2.

Figure 8.10-2: Failure Causes

A list of typical initial conditions, stresses, and failure mechanisms is provided below.

Initial conditions:
o Defect free (the item is made as designed)
o Defects:
Intrinsic:
Voids
Material property variation
Geometry variation
Contamination
Ionic contamination
Crystal defects
Stress concentrations
Extrinsic:
Organic contamination
Nonconductive particles
Conductive particles
Reliability Information Analysis Center
395

Chapter 8: The Use of FMEA in Reliability Modeling

Contamination
Ionic contamination

o Stresses:
Operation - steady state
Operation - cycling
Chemical exposure
Salt fog
Mechanical shock
UV exposure
Drop
Vibration
Temperature-high
Temperature-low
Temperature cycling
Damp heat
Pressure - low
Pressure - high
Radiation - EMI
Radiation - cosmic
Sand and dust
o Failure mechanisms (physical process):
Electromigration
Dielectric breakdown
Corrosion
Dendritic growth
Tin whiskers
Metal fatigue
Stress corrosion cracking
melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Galvanic corrosion
Pitting corrosion
Chemical attack
Fretting corrosion
Spalling
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
396

Chapter 8: The Use of FMEA in Reliability Modeling

Crazing
Abrasive wear
Adhesive wear
Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Oxidation
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgassing
Index of refraction changes
Photodarkening
Condensation
Crystallization
Each failure cause can be characterized with a specific combination of initial condition,
stress and degradation process. For example, a cause could be represented as:
Defect1-temperaturecorrosion
After identifying all of the FMEA elements in accordance with the guidelines presented
herein, it is very useful to check the completeness of the analysis by hypothesizing what
would happen if the product or system:

Is exposed to various environmental stresses


Is exposed to high operating stresses (i.e., voltage, current, optical power, flow
rates, etc.)
Has manufacturing defects. Manufacturing defects are typically analyzed in a
PFMEA. However, they can be included in the design FMEA, with the defect
type being listed in the Cause column. In fact, if a PFMEA is not planned for a
Reliability Information Analysis Center
397

Chapter 8: The Use of FMEA in Reliability Modeling

product or system, then manufacturing process-related failure causes should be


included in the Cause column.
In these cases, you are filling in the FMEA backwards by essentially hypothesizing the
cause a-priori and then determining what the resulting failure mode would be. For
example, the cause identified in this manner will result in a failure mode, which in turn
will have an effect at the next higher level in the system.

8.11. Identify Factors for Each Failure Cause


8.11.1. Accelerating Stress(es) or Potential Tests

Accelerating stresses or tests can include the following. This information can be used to
define reliability test plans. A list of potential tests may include:
1. Operation - steady state
2. Operation - cycling
3. Chemical exposure
4. Salt fog
5. Mechanical shock
6. UV exposure
7. Drop
8. Vibration
9. Temperature - high
10. Temperature - low
11. Temperature cycling
12. Damp heat
13. Pressure - low
14. Pressure - high
15. Radiation - EMI
16. Radiation - cosmic
17. Sand and dust
8.11.2. Occurrence
8.11.2.1.

Occurrence Rankings

The occurrence rating should be a function of two factors: (1) an estimate of the
likelihood of occurrence based on the analysts experience, and (2) the degree to which
the failure cause/mechanism has been observed. For example, Figure 8.11-1 represents
the occurrence, as defined by Reference 1.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
398

Chapter 8: The Use of FMEA in Reliability Modeling

Figure 8.11-1: Occurrence Definitions

Ideally, a reliability model would be available from which to determine the occurrence,
but this is usually impractical due to the fact that the FMEA is generally performed
before the reliability modeling activities commence.
The occurrence should be based on engineering judgment and on empirical data. The
resulting occurrence value is based on both, as illustrated in Figure 8.11-2. For example,
if empirical information exists on a specific cause, it should be used as part of the
assessment of the Occurrence level. In this case, heavier weighting should be given to
field data over manufacturing and test data. If no empirical data exists, engineering
Reliability Information Analysis Center
399

Chapter 8: The Use of FMEA in Reliability Modeling

judgment should be used, and should be based on the collective experience of the FMEA
team.
The occurrence should be based on the likelihood that the cause will occur and that the
resulting mode will occur. Some FMEA methodologies, like the cancelled MIL-STD1629, include a separate factor that accounts for the probability that the effect will occur
if the mode is to occur (the same concept can be used for the cause-mode relationship).

High

10

Failure rate
estimate
based on
experience

Low

1
Not at all

Frequently

How often has the failure cause/mechanism been


observed in the past (heavier weighting should
be given to field data over manufacturing and
test data)

Figure 8.11-2: Occurrence Guidelines

The frequencies of occurrence should be rated relative to the required reliability for a
specific failure cause. For example, if a reliability allocation is performed to allocate the
product or system failure rate (or unreliability) to its constituent components, then the
occurrence value should be relative to this allocated value.
Common cause vs special cause
Categories of failure effects are shown in Table 8.11-14. These illustrate the differences
between common cause and special cause failure effects.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


400

Chapter 8: The Use of FMEA in Reliability Modeling

Table 8.11-1: Categories of Failure Effects


Design Not
Capable

Process
Not
Capable

Screen
Fallout/Outof-the-Box
Failure

Always
(Common Cause)
Sometimes
(Special Cause)

Infant
Mortality

Random
Failure

Wearout

X
X

8.11.3. Preventions

Preventions are the actions taken to prevent the cause/mechanism of failure or the failure
mode from occurring, or to reduce their rate of occurrence. These will generally be
design-related actions. Examples include: Ensured proper derating for all components
or Use of a conformal coating.
8.11.4. Detections

Detections are actions taken to detect the cause/mechanism of failure. This can be via
either test or analysis.
8.11.5. Detectability

Detectability is a value between 1 and 10 that is inversely proportional to the degree to


which the failure cause can be detected, i.e., the less likely the detection of the failure
cause, the higher the detectability value. The traditional detectability definitions are
listed in Figure 8.11-3.

Reliability Information Analysis Center


401

Chapter 8: The Use of FMEA in Reliability Modeling

Figure 8.11-3: Detectability Definitions

There are four aspects of detection that should be captured in the FMEA:
Current design control detections:
Indicate what has been done to detect the cause/mechanism of failure or the failure mode,
either by analytical or physical methods, before the item is released into production.
These are generally the application of tests or analytical techniques whose goals are to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
402

Chapter 8: The Use of FMEA in Reliability Modeling

ascertain the probability of occurrence of the failure cause/mechanism. Therefore, this


aspect of detection relates to detecting the probability of occurrence.
Probability of detection if the Failure cause/mechanism occurs
This is the probability that the failure cause/mechanism will be detected if it occurs.
Some failure modes are inherently undetectable when they occur. An example of this is
cracks that are initiated within a structure.
Screening
This aspect of detectability addresses the question: What will be done in the
manufacturing process to detect and eliminate the items prone to the failure
cause/mechanism? Reliability screening is a common technique for accomplishing this.
If screening is to be employed, the screening effectiveness must be determined. This
screening effectiveness is directly related to the Probability of detection if the failure
cause/mechanism occurs.
Degree of Warning
The fourth aspect of detectability relates to how detectable a failure cause/mechanism is
before it results in the worst case effect identified.

The life cycle phases to which each of these four dimensions is applicable are illustrated
in Figure 8.11-4.

Figure 8.11-4: Life Cycle vs Detectability Dimension


Reliability Information Analysis Center
403

Chapter 8: The Use of FMEA in Reliability Modeling

The combinations of each of the four dimensions (H = High, L = Low, x = Doesnt


Matter) and the recommended detectability are summarized in Table 8.11-2.
Table 8.11-2: Recommended Detectability Rating Criteria
Current Design
Control
Detections

Probability of Detection if
the Failure
Cause/Mechanism Occurs

Screening

Degree of
Warning

Detectability

x
L
x
H
L
H
L
H

L
H
L
H
H
H
H
H

x
L
x
L
H
L
H
H

L
H
H
L
L
H
H
H

10
8
7
5
5
2
2
1

8.12. Calculate the RPN


A measure of criticality is the Risk Priority Number, or RPN. The probability of failure
is usually a value between 1 and 10. The RPN is the product of the probability of
occurrence, the severity and the detectability:
RPN = O*S*D
where:
O=
S=
D=

Probability of occurrence
Severity
Detectability

Another definition of criticality is provided in MIL-STD-1629. In this case, the


criticality is the product of the failure rate, the failure effect probability and the failure
mode ratio, and is expressed as:
C=

C=
=

Criticality
Failure rate

where:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


404

Chapter 8: The Use of FMEA in Reliability Modeling

=
=

Failure effect probability


Failure mode ratio

The failure rate is the rate of occurrence of failure, expressed in failures per million
cumulative operating hours, or in FITs (failures per billion operating hours). The failure
effect probability is the conditional probability that, if the failure mode occurs, the
severity level identified in the FMEA will be the result. The failure mode ratio is the
fraction of the failure rate that can be attributed to the specific failure mode under
analysis. In other words the sum of these probabilities for all failure modes of an item
will be 1.0.
The same logic applies to the RPN methodology, in that the occurrence rating (O) is the
product of the probability of the failure cause occurring times the probability that the
failure cause will result in the identified effect.
Since severity is not included in this calculation, failure modes are usually sorted by
criticality for each severity level. This is done since a true measure of criticality must
include the severity of the failure mode.
The RPN methodology is generally the most common used in many industries. However,
in some cases, the criticality metric is more applicable. Such cases occur when the
system under analysis is complex, or when quantitative failure rate estimates are
available. These failure rate estimates are generally derived from reliability modeling, as
summarized in this book.

8.13. Determine Appropriate Corrective Action


The FMEA failure causes are then sorted by RPN, from highest to lowest. After the RPN
of each failure cause is identified, the team will identify the actions that should be taken
to mitigate the most important failure causes. These are the causes with the highest RPN
number.
An issue that needs to be addressed is the identification of a critical RPN value above
which corrective action should take place. The RPN is a qualitative measure of risk and,
therefore, there is not a single value. Usually, the number of failure causes that can be
addressed with corrective actions will be determined by the availability of resources and
the criticality or severity of failure. Some organizations state to their suppliers that RPNs
of 40, or 50, or greater shall be addressed. However, this value is somewhat arbitrary.
Also, in many cases, it is required that all failure causes with high severity be addressed,
regardless of their occurrence or detectability.
Reliability Information Analysis Center
405

Chapter 8: The Use of FMEA in Reliability Modeling

The other factor that determines the failure causes that are to be addressed is the Pareto
ranking of the RPNs. In other words, in some cases, there are a well-defined number of
causes that comprise the total risk to the system. This situation becomes evident in the
Pareto analysis of RPNs.
Corrective actions will generally fall into three categories:

First, the design can be modified such that the effect of failure is minimized,
thus effectively lowering the severity level of the failure effect. Options for
this include the addition of redundant elements or fault tolerance, the selection
of better materials, and/or the use of more robust components.

The second general option is to reduce the likelihood of the failure mode
occurring in the first place. This often can be achieved by the use of more
robust components. This robustness can be achieved with components of
higher quality levels or the ability to handle high stress levels. Another option
for reducing this likelihood is the control of environmental stresses and
reducing the stress to which the component is exposed.

The third general corrective action is to improve detectability. Many products


will have failure modes for which the only viable corrective action is to make
the failure mode detectable. A common example of this is digital circuitry. If
redundancy is not an option, and higher quality components are not available,
the failure mode can be made detectable through the use of alarms or built-intest (BIT) features.

Examples of these corrective actions are shown in Figure 8.13-1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


406

Chapter 8: The Use of FMEA in Reliability Modeling

Corrective Action

Reduce
Likelihood of
Failure

Modify
Design

Fault
Tolerance

Better
Materials

Improve
Detectability

Reduce
Stress

More Robust
Components

Control
Environment

Figure 8.13-1: Potential Corrective Actions

Potential corrective actions possibilities include:

Severity Reduction:
o Add redundancy
o Add a fail-safe feature
o Use personal protection equipment (for safety critical items)
Occurrence Reduction:
o Design out the cause
o Reduce the rate of occurrence
Detection Improvement:
o Implement alarm features
o Implement screening tests
o Design more relevant tests to detect the failure cause
o Develop better characterization methods

Reliability Information Analysis Center


407

Chapter 8: The Use of FMEA in Reliability Modeling

8.14. Update the RPN


The objective of the FMEA is to improve the reliability of the product or system under
analysis. Updating the FMEA with information that is learned during the analysis allows
the FMEA to be used as means by which the reliability status of the item can be tracked
and improved. As failure causes are identified and eliminated, the reliability will
improve and the resulting RPN will decrease. This RPN value can be an effective means
for tracking the reliability growth of a product or system both during the design and
development phases of the life cycle.

8.15. Using Quality Function Deployment to Feed the FMEA


Quality Function Deployment (QFD) analysis can provide valuable information in
support of an FMEA. The manner in which this can be done is illustrated in Figure 8.151. Note that it is assumed that the reader has knowledge of the QFD process, so those
details are not included in this book.

Figure 8.15-1: QFD-to-FMEA Links


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
408

Chapter 8: The Use of FMEA in Reliability Modeling

The QFD elements are defined as:


1. Characteristics (or Whats) These are the high level characteristics of the
product that need to be achieved for the products customer to be satisfied. The
lack of these characteristics is synonymous with failure modes of the product,
which in turn are failure effects of failure modes at the next lower level of the
product hierarchy.
2. Importance (or Ranking of Needs) The QFD will generally include a rating
of the importance of the characteristic (#1). The severity of the failure effects will
then be proportional to this importance. The dimensions of this importance should
include the dimensions of severity as described previously.
3. Measures (or Hows) The measures in the QFD pertain to the
characteristics of the items comprising the design. These measures can be
comprised on a hierarchical listing of the items and their critical characteristics or
functions. The manner in which these characteristics can fail can be identified
with an IPOUND analysis previously discussed, and become the failure modes in
the FMEA.
4. Relationships the relationships matrix in the QFD identifies if the measure is
related to the characteristic and, if so, whether it is a strong, medium or weak
relationship. These relationships essentially identify the effects (i.e. the negative
of the characteristic) that will occur if the failure mode (i.e. failure modes of the
measures) occur.
If the FMEA elements are obtained from the QFD in this manner, the first five columns
in the FMEA can be populated directly, as shown in Figure 8.15-2. These columns are
identified in bold in the above descriptions.

Reliability Information Analysis Center


409

Chapter 8: The Use of FMEA in Reliability Modeling

These FMEA columns are populated


directly from the QFD/IPOUND
analysis
Figure 8.15-2: QFD-FMEA

Failure modes can be interpreted in several ways:


1. Inability to perform an intended function
2. Inability to meet customer expectations
Number 2 is a broader definition of failure, in that it encompasses whether a product or
system has features that customers want. Number 1 relates to whether a set of features
that are assumed to meet customer wants are capable of being sustained over the design
life of the product. If the QFD results are used as summarized above, then failure will
generally be defined as #2, since the QFD should encompass all customer expectations.

8.16. References
1. SAE J1739 (R) Potential Failure Mode and Effects Analysis in Design (Design
FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and
Assembly Processes (Process FMEA)

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


410

Chapter 9: Concluding Remarks

9.

Concluding Remarks

Reliability modeling has been used successfully as a reliability engineering tool for many
years. It is only one element of a well-structured reliability program and, to be effective,
it must be integrated into a complete reliability program. This book has reviewed options
an analyst has for developing a reliability model of a product or system, and has provided
guidance on applying the appropriate methodology based on the specific needs and
constraints of the analyst.
The premise of the holistic approach described in this book is that the reliability model is
a living model that needs to be continuously updated throughout the program
development and deployment phases. This approach to modeling consists of predictions,
assessments and estimation. Each of these is performed at specific points in the
development cycle and has different purposes and approaches. Reliability predictions are
performed very early, before there is any empirical data on the item under analysis.
Reliability assessments are made to determine the effects of certain factors on reliability,
and to identify and study specific failure causes. Reliability estimates are made based on
empirical data, and encompass all three elements.
A critical theme of this book has been that the purpose of a reliability model must be
clearly defined, and then an appropriate methodology should be chosen. Each model
need must be fully defined in terms of the customers being served (their roles,
educational background, requirements, etc.), the constraints placed upon that customer
(including legal and contractual, as well as technical and engineering), and the purpose of
the model (what decisions are being supported and in what manner).
A summary of recommendations for an analyst developing a product or system reliability
model are:
1. Clearly define the purpose and objectives of the model
2. Apply the appropriate methodologies in the appropriate program phase
3. Identify critical failure causes early in the program, as they will require the most
attention in terms of modeling
4. Fully leverage all available expertise in areas of design analysis, testing,
measurement, etc.
5. Use all available data and information, and be diligent about seeking needed data
6. Strategically perform tests to characterize critical failure causes
7. Engage suppliers and customers to maximize the consistency of models
throughout all system hierarchical levels
Reliability Information Analysis Center
411

Chapter 9: Concluding Remarks

8. Use multiple modeling techniques, and work toward the goal of having them
reasonably agree with each other. In this manner, confidence in the results will be
greater.
9. Continuously update the model based on data that is obtained throughout all
program phases of the product or system life cycle
10. Identify and use available reliability software tools. These tools have become
very cost effective and are readily available, making the application of techniques
which were impractical several decades ago easily implemented.
It is hoped that this book has provided the reader with a knowledge of approaches, tools,
and interpretations that will allow a better understanding of the usefulness and limitations
of various reliability modeling techniques. Given its stochastic nature, reliability
modeling is part science and part art, and there are many ways to approach it. But, if the
analyst keeps the goals in mind and uses common sense, there is a high probability that
the model will be successful in achieving its objectives.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


412

S-ar putea să vă placă și