Documente Academic
Documente Profesional
Documente Cultură
Ea
= b e KT S n
r
1 CL =
k =0
(t )
k!
=e
r 1
t )
(
(t )r
+
1 + t + +
(r 1)! ( r)!
RIAC is a DoD Information Analysis Center sponsored by the Defense Technical Information Center. RIAC is operated by a
team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn State University Applied Research
Laboratory and the State University of New York Institute of Technology.
The information and data contained herein have been compiled from
government and nongovernment technical reports and from material
supplied by various manufacturers and are intended to be used for reference
purposes. Neither the United States Government nor the Wyle Laboratories
contract team warrant the accuracy of this information and data. The user is
further cautioned that the data contained herein may not be used in lieu of
other contractually cited references and specifications.
Publication of this information is not an expression of the opinion of The
United States Government or of the Wyle Laboratories contract team as to
the quality or durability of any product mentioned herein and any use for
advertising or promotional purposes of this information in conjunction with
the name of The United States Government or the Wyle Laboratories
contract team without written permission is expressly prohibited.
ISBN-10: 1-933904-17-8
ISBN-13: 978-1-933904-17-7
ISBN-10: 1-933904-18-6
ISBN-13: 978-1-933904-18-4
(Hardcopy)
(Hardcopy)
(PDF Download)
(PDF Download)
Form Approved
OMB No. 0704-0188
Public reporting burden for this collection is estimated to average 1 hour per response including the time for reviewing instructions, searching existing data sources,
gathering and maintaining the data needed and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other
aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for
Information Operations and Reports(0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that
notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a current
or valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE
31 May 2010
2. REPORT TYPE
Technical
6. AUTHORS
William Denson
RPAE
10. SPONSORING/MONITORS
ACRONYM(S)
Reliability Modeling
NPRD
Reliability Prediction
MIL-HDBK-217
Reliability Assessment
217Plus
17. LIMITATION
OF ABSTRACT
UNCLASSIFIED
Reliability Estimation
18. NUMBER
OF
PAGES
David Nicholls
a. REPORT
b. ABSTRACT
c. THIS PAGE
UNCLASSIFIED
UNCLASSIFIED
UNCLASSIFIED
UNLIMITED
410
315.351.4202
Standard Form 298 (Rev. 8/98)
Prescribed by ANSI Std. Z39.18
The Reliability Information Analysis Center (RIAC), formerly the Reliability Analysis Center (RAC),
is a Department of Defense Information Analysis Center sponsored by the Defense Technical
Information Center, managed by the Air Force Research Laboratory (formerly Rome Laboratory), and
operated by a team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn
State University Applied Research Laboratory and the State University of New York Institute of
Technology. RIAC is chartered to collect, analyze and disseminate reliability, maintainability,
quality, supportability and interoperability (RMQSI) information pertaining to systems and products,
as well as the components used in them. The RIAC addresses both military and commercial
perspectives.
The data contained in the RIAC databases is collected on a continuous basis from a broad range of
sources, including testing laboratories, device and equipment manufacturers, government laboratories
and equipment users (government and industry). Automatic distribution lists, voluntary data
submittals and field failure reporting systems supplement an intensive data solicitation program.
Users of RIAC are encouraged to submit their RMQSI data to enhance these data collection efforts.
RIAC publishes documents for its users in a variety of formats and subject areas. While most are
intended to meet the needs of RMQSI practitioners, many are also targeted to managers and designers.
RIAC also offers RMQSI consulting, training and responses to technical and bibliographic inquiries.
REQUESTS FOR TECHNICAL ASSISTANCE
AND INFORMATION ON AVAILABLE RIAC
SERVICES AND PUBLICATIONS MAY BE
DIRECTED TO:
Reliability Information Analysis Center
100 Sherman Rd.
Suite C101
Utica, NY 13502-1348
General Information:(877) 363-RIAC
(877) 363-7422
Technical Inquiries: (315) 351-4200
Fax:
(315) 351-4209
E-Mail:
inquiry@theRIAC.org
Internet:
http://theRIAC.org
(315) 330-4857
587-4857
(315) 330-7647
Richard.Hyle@rl.af.mil
Copyright 2010 by Quanterion Solutions Incorporated. This handbook was developed by Quanterion
Solutions Incorporated, in support of the prime contractor (Wyle Laboratories) in the operation of the Department
of Defense Reliability Information Analysis Center (RIAC) under Contract HC1047-05-D-4005. The Government
has a fully paid up perpetual license for free use of and access to this publication and its contents among all the
DOD IACs in both hardcopy and electronic versions, without limitation on the number of users or servers. Subject
to the rights of the Government, this document (hardcopy and electronic versions) and the content contained within
it are protected by U.S. Copyright Law and may not be copied, automated, re-sold, or redistributed to multiple
users without the express written permission. The copyrighted work may not be made available on a server for use
by more than one person simultaneously without the express written permission. If automation of the technical
content for other than personal use, or for multiple simultaneous user access to a copyrighted work is desired,
please contact 877.363.RIAC (toll free) or 315.351.4202 for licensing information.
Table of Contents
1.
INTRODUCTION
1.1.
1.2.
1.3.
1.4.
1.5.
1.6.
Scope
BookOrganization
ReliabilityProgramElements
TheHistoryofReliabilityPrediction
Acronyms
References
2.
Page
1
2
5
7
11
17
18
GENERALASSESSMENTAPPROACH
DefineSystem
2.1.
2.2.
IdentifythePurposeoftheModel
2.3.
DeterminetheAppropriateLevelatWhichtoPerformtheModeling
2.3.1. Levelvs.DataNeeded
2.3.2. UsinganFMEAasthebasisforareliabilitymodel
2.3.3. ModelFormvs.Level
2.4.
AssessDataAvailable
2.5.
DetermineandExecuteAppropriateApproach
2.5.1. Empirical
2.5.1.1. Test
2.5.1.2. Field Data
2.5.2.
20
22
25
26
28
34
36
38
44
44
77
106
Physics
106
111
2.6.
CombineData
2.6.1. BayesianInference
2.7.
DevelopSystemModel
2.7.1. MonteCarloAnalysis
2.8.
References
3.
19
114
121
123
127
133
FUNDAMENTALCONCEPTS
135
ReliabilityTheoryConcepts
3.1.
3.2.
Probabilityconcepts
3.2.1. Covariance
3.2.2. CorrelationCoefficient
3.2.3. PermutationsandCombinations
3.2.4. MutualExclusivity
135
142
142
142
143
144
i
Table of Contents
Page
3.2.5. IndependentEvents
3.2.6. Nonindependent(Dependent)Events
3.2.7. Nonindependent(Dependent)Events:BayesTheorem
3.2.8. SystemModels
3.2.9. KoutofNConfigurations
3.3.
Distributions
3.3.1. Exponential
3.3.2. Weibull
3.3.3. Lognormal
3.4.
References
4.
DOEBASEDAPPROACHESTORELIABILITYMODELING
4.1.
4.2.
4.3.
4.4.
4.5.
4.6.
4.7.
4.8.
DeterminetheFeaturetobeAssessed
DetermineFactors
DeterminetheFactorLevels
DesigntheTests
PerformTestsandMeasurements
AnalyzetheData
DeveloptheLifeModel
References
5.
LIFEDATAMODELING
171
172
172
172
174
180
181
183
183
185
SelectingaDistribution
5.1.
5.2.
ParameterEstimationOverview
5.2.1. ClosedFormParameterApproximations
5.2.2. LeastSquaresRegression
5.2.3. ParameterEstimationUsingMLE
185
186
189
190
192
144
145
146
146
151
153
159
160
166
169
193
193
195
198
ConfidenceBoundsandUncertainty
198
199
206
207
5.3.1.1. Examples
208
ii
Table of Contents
Page
5.3.2. CombinedModels
5.3.3. CumulativeDamageModel
5.4.
MLEEquations
5.4.1. LikelihoodFunctions
5.5.
References
6.
210
214
216
217
221
INTERPRETATIONOFRELIABILITYESTIMATES
BathtubCurve
6.1.
6.2.
CommonCausevs.SpecialCause
6.3.
ConfidenceBounds
6.3.1. TraditionalTechniquesforConfidenceBounds
6.3.2. UncertaintyinReliabilityPredictionEstimates
6.4.
FailureRatevspdf
6.5.
PracticalAspectsofReliabilityAssessments
6.6.
Weibayes
6.7.
WeibullClosureProperty
6.8.
EstimatingEventRelatedReliability
6.9.
CombiningDifferentTypesofAssessmentsatDifferentLevels
6.10. EstimatingtheNumberofFailures
6.11. CalculationofEquivalentFailureRates
6.12. FailureRateUnits
6.13. FactorstobeConsideredWhenDevelopingModels
6.13.1.
CausesofElectronicSystemFailure
6.13.2.
SelectionofFactors
6.13.3.
ReliabilityGrowthofComponents
6.13.4.
Relativevs.AbsoluteHumidity
6.14. AddressingDatawithNoFailures
6.15. ReliabilityofComponentsUsedOutsideofTheirRating
6.16. References
7.
EXAMPLES
223
223
225
238
238
240
243
245
245
246
247
248
250
251
252
253
253
255
257
259
259
261
262
263
MILHDBK217ModelDevelopmentMethodology
7.1.
7.1.1. IdentifyPossibleVariables
7.1.2. DevelopTheoreticalModel
7.1.3. CollectandQCData
7.1.4. CorrelationCoefficientAnalysis
iii
264
266
266
267
268
Table of Contents
Page
7.1.5. StepwiseMultipleRegressionAnalysis
7.1.6. GoodnessofFitAnalysis
7.1.7. ExtremeCaseAnalysis
7.1.8. ModelValidation
7.2.
217PlusReliabilityPredictionModels
7.2.1. Background
7.2.2. SystemReliabilityPredictionModel
270
271
272
272
273
273
274
DevelopmentofComponentReliabilityModels
7.2.5.
292
292
294
294
295
296
296
298
298
299
301
303
PhotonicModelDevelopmentExample
7.2.4.1.
7.2.4.2.
7.2.4.3.
7.2.4.4.
7.2.4.5.
274
277
278
279
280
281
281
282
287
291
292
292
Introduction
Model development methodology and results
Uncertainty Analysis
Comments on Part Quality Levels
Explanation of Failure Rate Units
303
306
322
325
325
326
SystemLevelModel
326
iv
Table of Contents
7.2.5.2. 217Plus Process Grading Criteria
7.2.5.3. Design Process Grade Factor Questions
7.2.5.4. Manufacturing Process Grade Factor Questions
7.2.5.5. Part Quality Process Grade Factor Questions
7.2.5.6. System Management Process Grade Factor Questions
7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions
7.2.5.8. Induced Process Grade Factor Questions
7.2.5.9. Wearout Process Grade Factor Questions
7.2.5.10. Growth Process Grade Factor Questions
7.3.
LifeModelingExample
7.3.1. Introduction
7.3.2. Approach
7.3.3. ReliabilityTestPlan
7.3.4. Results
350
350
350
350
352
352
354
7.4.
NPRDDescription
7.4.1. DataCollection
7.4.2. DataInterpretation
7.4.3. DocumentOverview
7.4.3.1.
7.4.3.2.
7.4.3.3.
7.4.3.4.
7.4.3.5.
7.4.3.6.
Prefix"
7.5.
8.
Page
328
330
336
340
342
346
347
348
349
357
358
361
366
References
THEUSEOFFMEAINRELIABILITYMODELING
Introduction
8.1.
8.2.
Definitions
8.3.
FMEALogistics
8.3.1. Wheninitiated
8.3.2. FMEATeam
8.3.3. FMEAFacilitation
377
377
381
383
383
383
384
Table of Contents
Page
8.3.4. Implementation
8.4.
HowtoPerformanFMEA
8.5.
IdentifySystemHierarchy
8.6.
FunctionAnalysis
8.7.
IPOUNDAnalysis
8.8.
IdentifytheSeverity
8.9.
IdentifythePossibleEffect(s)thatResultfromOccurrenceofEachFailureMode
8.10. IdentifyPotentialCausesofEachFailureMode
8.11. IdentifyFactorsforEachFailureCause
8.11.1.
AcceleratingStress(es)orPotentialTests
8.11.2.
Occurrence
398
8.11.3.
Preventions
8.11.4.
Detections
8.11.5.
Detectability
8.12. CalculatetheRPN
8.13. DetermineAppropriateCorrectiveAction
8.14. UpdatetheRPN
8.15. UsingQualityFunctionDeploymenttoFeedtheFMEA
8.16. References
9.
385
385
387
388
388
390
392
392
398
398
398
CONCLUDINGREMARKS
401
401
401
404
405
408
408
410
411
vi
List of Figures
Page
FIGURE1.11:PHASESOFARELIABILITYPROGRAM.....................................................................................2
FIGURE1.12:RELATIVECOSTOFFAILURESVS.PHASE................................................................................3
FIGURE1.13:RELIABILITYPREDICTION,ASSESSMENTANDESTIMATION....................................................4
FIGURE1.14:PERCENTOFCOMPANIESUSINGRELIABILITYENGINEERINGTOOLS.....................................5
FIGURE1.31:EXAMPLERELIABILITYPROGRAMAPPROACH........................................................................7
FIGURE2.01:GENERALMODELINGAPPROACH.........................................................................................20
FIGURE2.11:FAULTTREEREPRESENTATIONOFSYSTEMMODEL.............................................................21
FIGURE2.12:FAULTTREEREPRESENTATIONTOTHEFAILURECAUSELEVEL............................................21
FIGURE2.21:BREAKDOWNOFPOTENTIALRELIABILITYMODELINGPURPOSES.......................................23
FIGURE2.31:TYPICALDATAREQUIREMENTSVS.LEVELOFHIERARCHY...................................................27
FIGURE2.32:THEBASICFMEAAPPROACH.................................................................................................28
FIGURE2.33:HIERARCHICALRELATIONSHIPBETWEENCAUSE,MODEANDEFFECT.................................29
FIGURE2.34:APPROACHTOIDENTIFYINGCAUSES....................................................................................29
FIGURE2.35:FAULTTREEOFPRODUCTORSYSTEM.................................................................................32
FIGURE2.36:FAULTTREEOFPRODUCTORSYSTEMWITHCAUSEASTHELOWESTLEVEL.......................32
FIGURE2.37:FAULTTREEOFPRODUCTORSYSTEMWITHCAUSEABOVETHELOWESTLEVEL................33
FIGURE2.38:FAULTTREEOFPRODUCTORSYSTEMWITHCAUSETWOLEVELSABOVETHELOWEST
LEVEL...................................................................................................................................................33
FIGURE2.51:BREAKDOWNOFRELIABILITYASSESSMENTOPTIONS..........................................................38
FIGURE2.52:QUALIFICATIONCONCEPTSANDTERMINOLOGY..................................................................46
FIGURE2.53:EVT,DVTANDPVTRELATIONSHIPS.......................................................................................48
FIGURE2.54:ACCELERATIONLEVELS.........................................................................................................51
FIGURE2.55:UNCERTAINTYINEXTRAPOLATION......................................................................................52
FIGURE2.56:ACCELERATIONLEVELS.........................................................................................................53
FIGURE2.57:ACCELERATIONALTERNATIVES.............................................................................................53
FIGURE2.58:RELATIVELIFETIMEVS.STRESS.............................................................................................54
FIGURE2.59:RELIABILITYREQUIREMENTVS.SMALLPOPULATIONRELIABILITYINFERENCE...................60
FIGURE2.510:LIFEMODELINGMETHODOLOGY.......................................................................................62
FIGURE2.511:IDENTIFICATIONOFTESTSTRESSESBASEDONTHEFMEA.................................................64
FIGURE2.512:USINGTHEDESTRUCTLIMITTODEFINETHELIFETESTMAXSTRESS................................66
FIGURE2.513:POSSIBLESTRESSPROFILES................................................................................................67
FIGURE2.514:MEASUREMENTPOINTSFORANINFANTMORTALITYFAILURECAUSE..............................69
FIGURE2.515:MEASUREMENTPOINTSFORAWEAROUTFAILURECAUSE...............................................69
FIGURE2.516:ACCELERATIONWHENTHEDISTRIBUTIONSFORATLEASTTWOSTRESSESAREAVAILABLE
............................................................................................................................................................71
FIGURE2.517:ACCELERATIONWHENTHEDISTRIBUTIONSFORLOWSTRESSESARENOTAVAILABLE.....71
FIGURE2.518:LIFEMODELSEQUENCE.......................................................................................................72
FIGURE2.519DEGRADATIONMODELINGAPPROACH................................................................................75
FIGURE2.520:DEGRADATIONDATAEXAMPLE..........................................................................................76
FIGURE2.521:DEGRADATIONDATACONVERSIONTOTIMESTOFAILURE................................................77
FIGURE2.522:RELIABILITYESTIMATESFROMFIELDDATA........................................................................78
vii
List of Figures
Page
FIGURE2.523:FMEAASATOLLFORASSESSINGSIMILARITY.....................................................................81
FIGURE2.524:MILHDBK217PARTCOUNTEXAMPLE...............................................................................85
FIGURE2.525:MILHDBK217PARTSTRESSEXAMPLE...............................................................................86
FIGURE2.526:TELCORDIASR332(BELLCORE)...........................................................................................87
FIGURE2.527:RACPRISMREPLACEDBYRIAC217PLUS.............................................................................88
FIGURE2.528:CNET/RDF2000...................................................................................................................89
FIGURE2.529:CNET/RDF2000MODELEXAMPLE......................................................................................90
FIGURE2.530:FIDES....................................................................................................................................91
FIGURE2.531:USESOFPROGRAMDATAELEMENTS.................................................................................93
FIGURE2.532:PROGRAMDATABASESTRUCTURE.....................................................................................93
FIGURE2.533:DATABASEINFORMATIONFLOW........................................................................................95
FIGURE2.534:HIERARCHYOFMAINTENANCEACTIONS............................................................................97
FIGURE2.535:CALCULATIONOFPARTLIFEUNIT.....................................................................................100
FIGURE2.536:FAILURETIMESBASEDONOPERATINGTIME....................................................................101
FIGURE2.537:FAILURETIMESBASEDONCALENDARTIME.....................................................................102
FIGURE2.538:FAILURERATESIMULATIONWITHWEIBULLBETA=20....................................................103
FIGURE2.539:FAILURERATESIMULATIONWITHWEIBULLBETA=5.0...................................................103
FIGURE2.540:FAILURERATESIMULATIONWITHWEIBULLBETA=2.0...................................................104
FIGURE2.541:FAILURERATESIMULATIONWITHWEIBULLBETA=1.0...................................................104
FIGURE2.542:FAILURERATESIMULATIONWITHWEIBULLBETA=0.5...................................................105
FIGURE2.544:STRESS/STRENGTHINTERFERENCE...................................................................................108
FIGURE2.545:STRESS/STRENGTHINTERFERENCEVS.TIME....................................................................109
FIGURE2.61:217PLUSAPPROACHTOFAILURERATEESTIMATION.........................................................114
FIGURE2.63.BAYESIANINFERENCEOUTLINE..........................................................................................122
FIGURE2.71:COMBININGSEVENFAILURECAUSEDISTRIBUTIONS..........................................................125
FIGURE2.72:POSSIBLEFAULTTREEREPRESENTATIONOFASERIESRELIABILITYBLOCKDIAGRAM........126
FIGURE2.73:PDFOFNORMALDISTRIBUTIONWITHMEANOF10ANDSTANDARDDEVIATIONOF3....128
FIGURE2.74:CUMULATIVENORMALDISTRIBUTIONWITHMEANOF10ANDSTANDARDDEVIATIONOF3
..........................................................................................................................................................128
FIGURE2.75:VALUESELECTIONFROMADISTRIBUTION.........................................................................129
FIGURE2.76:VALUESELECTIONFROMAWEIBULLDISTRIBUTION..........................................................130
FIGURE2.77:RELIABILITYBLOCKDIAGRAMOFREDUNDANTEXAMPLE..................................................131
FIGURE2.78:SYSTEMMONTECARLOEXAMPLE.......................................................................................131
FIGURE2.79:MONTECARLOSIMULATIONOFEXAMPLESYSTEM...........................................................132
FIGURE3.11:DISCRETEPROBABILITYDISTRIBUTION...............................................................................135
FIGURE3.12:CONTINUOUSPROBABILITYDISTRIBUTION........................................................................136
FIGURE3.21:EXAMPLESOFCORRELATIONCOEFFICIENTS.......................................................................142
FIGURE3.22:VENNDIAGRAMOFMUTUALLYEXCLUSIVEEVENTS...........................................................144
FIGURE3.23:INDEPENDENTEVENTS........................................................................................................145
FIGURE3.24:FAULTTREEORGATE..........................................................................................................147
FIGURE3.25:RELIABILITYBLOCKDIAGRAMFORANORGATE.................................................................147
FIGURE3.26:FAULTTREEANDGATE........................................................................................................148
viii
List of Figures
Page
FIGURE3.27:RELIABILITYBLOCKDIAGRAMFORANANDGATE..............................................................149
FIGURE3.28:FAULTTREEOFANAND/ORCOMBINATION.......................................................................150
FIGURE3.29:RBDOFAND/ORCOMBINATION.........................................................................................150
FIGURE3.31:SHAPESOFFAILUREDENSITYANDRELIABILITYFUNCTIONSOFCOMMONLYUSEDDISCRETE
DISTRIBUTIONS(FROMMILHDBK338B).........................................................................................157
FIGURE3.32:SHAPESOFFAILUREDENSITY,RELIABILITYANDHAZARDRATEFUNCTIONSFORCOMMONLY
USEDCONTINUOUSDISTRIBUTIONS(FROMMILHDBK338B)........................................................158
FIGURE3.33:EXAMPLEPDFPLOTSFORTHEWEIBULLDISTRIBUTION....................................................164
FIGURE3.34:EXAMPLEHAZARDRATEPLOTSFORTHEWEIBULLDISTRIBUTION....................................164
FIGURE3.35:EXAMPLEPROBABILITYPLOTSFORWEIBULLDISTRIBUTION.............................................165
FIGURE3.36:EXAMPLEPDFPLOTSFORTHELOGNORMALDISTRIBUTION..............................................167
FIGURE3.37:EXAMPLEHAZARDRATEPLOTSFORTHELOGNORMALDISTRIBUTION..............................168
FIGURE3.38:EXAMPLEPROBABILITYPLOTSFORTHELOGNORMALDISTRIBUTION...............................168
FIGURE4.01:THEDOECONCEPT..............................................................................................................171
FIGURE4.31:POSSIBLERESPONSEFACTORLEVELRELATIONSHIP...........................................................173
FIGURE4.41:DOETERMINOLOGY............................................................................................................174
FIGURE4.42:ONEFACTORATATIMEEXPERIMENTS.............................................................................176
FIGURE4.43:STANDARDDOENOMENCLATURE......................................................................................177
FIGURE4.44:POTENTIALINTERACTIONS..................................................................................................178
FIGURE4.61:ANALYSISOFMEANS...........................................................................................................182
FIGURE4.62:LINEARIZATIONOFTHEARRHENIUSRELATIONSHIP...........................................................182
FIGURE4.63:OPTIMALFACTORSETTINGS................................................................................................183
FIGURE5.41:LIKELIHOODCONTOUREXAMPLE........................................................................................220
FIGURE6.11:BATHTUBCURVE.................................................................................................................223
FIGURE6.21:EXAMPLEOFNONMONOMODALDISTRIBUTION..............................................................228
FIGURE6.22:MULTIMODALDISTRIBUTIONEXAMPLE1...........................................................................229
FIGURE6.23:MULTIMODALDISTRIBUTIONEXAMPLE2...........................................................................230
FIGURE6.24:MULTIMODALDISTRIBUTIONEXAMPLE3...........................................................................231
FIGURE6.25:MULTIMODALDISTRIBUTIONEXAMPLE4...........................................................................232
FIGURE6.26:MULTIMODALDISTRIBUTIONEXAMPLE5...........................................................................233
FIGURE6.27:MULTIMODALDISTRIBUTIONEXAMPLEOFPOOLEDDATASET.........................................234
FIGURE6.28:AGEATDEATHDATA...........................................................................................................235
FIGURE6.29:PDFOFMULTIMODEDISTRIBUTIONOFAGES....................................................................236
FIGURE6.210:FAILURERATEOFAGEDATA.............................................................................................236
FIGURE6.211:PROBABILITYPLOTOFAGEDATA......................................................................................237
FIGURE6.212:SINGLEMODEWEIBULLFITTOTHEAGEDATA.................................................................238
FIGURE6.31:SOURCESOFERRORINEMPIRICALMODELS.......................................................................241
FIGURE6.32:CONFIDENCELEVELTHROUGHPREDICTION,ASSESSMENTANDESTIMATION..................243
FIGURE6.61:WEIBAYESEXAMPLE............................................................................................................246
FIGURE6.131:NOMINALFAILURECAUSEDISTRIBUTIONOFELECTRONICSYSTEMS...............................254
ix
List of Figures
Page
FIGURE6.132:IPOMODEL........................................................................................................................256
FIGURE6.133:RELATIONSHIPBETWEENABSOLUTEANDRELATIVEHUMIDITY.......................................259
FIGURE6.141:ESTIMATEDUPPERBOUNDFAILURERATESVSOPERATINGTIMEAT60AND90%
CONFIDENCE.....................................................................................................................................260
FIGURE7.11:MILHDBK217MODELDEVELOPMENTMETHODOLOGY...................................................265
FIGURE7.21:FAILURECAUSEDISTRIBUTIONOFELECTRONICSYSTEMS..................................................275
FIGURE7.22:OPTICALAMPLIFIERFAILURECAUSEDISTRIBUTION...........................................................277
FIGURE7.23:GVS.TIMEANDGROWTHRATES.....................................................................................291
FIGURE7.24:MODELDEVELOPMENTMETHODOLOGYFLOWCHART......................................................306
FIGURE7.25:DISTRIBUTIONOFLOG10PREDICTED/OBSERVEDFAILURERATERATIOFORALLDATA....323
FIGURE7.26:DISTRIBUTIONOFLOG10PREDICTED/OBSERVEDRATIOFORFIELDDATAONLY...............324
FIGURE7.27:DISTRIBUTIONSOFTHEPREDICTED/OBSERVEDFAILURERATERATIOFORALLDATAAND
FORFIELDDATAONLY......................................................................................................................324
FIGURE7.31:TIMESTOFAILUREDISTRIBUTIONS.....................................................................................354
FIGURE7.32:PROBABILITYOFFAILUREVS.TEMPERATUREANDRELATIVEHUMIDITYAT50,000HOURS
..........................................................................................................................................................357
FIGURE7.41:APPARENTFAILURERATEFORREPLACEMENTUPONFAILURE...........................................362
FIGURE7.43:EXAMPLEOFPARTDETAILENTRIES...................................................................................374
FIGURE8.11:TWOBASICTYPESOFFMEA................................................................................................378
FIGURE8.41:FMEAPROCESSFLOW.........................................................................................................386
FIGURE8.71:FAILURECAUSEMODEEFFECTRELATIONSHIP...................................................................390
FIGURE8.101:FAILURECAUSE,MODEANDEFFECTHIERARCHY.............................................................393
FIGURE8.102:FAILURECAUSES................................................................................................................395
FIGURE8.111:OCCURRENCEDEFINITIONS...............................................................................................399
FIGURE8.112:OCCURRENCEGUIDELINES................................................................................................400
FIGURE8.113:DETECTABILITYDEFINITIONS.............................................................................................402
FIGURE8.114:LIFECYCLEVSDETECTABILITYDIMENSION.......................................................................403
FIGURE8.131:POTENTIALCORRECTIVEACTIONS....................................................................................407
FIGURE8.151:QFDTOFMEALINKS.........................................................................................................408
FIGURE8.152:QFDFMEA.........................................................................................................................410
List of Tables
Page
TABLE1.31:RANGESOFPOTENTIALCUSTOMERREACTIONS......................................................................8
TABLE2.21:RELIABILITYASSESSMENTPURPOSES.....................................................................................24
TABLE2.22:PROGRAMPHASEVS.RELIABILITYASSESSMENTPURPOSE...................................................25
TABLE2.31:EXAMPLESOFINITIALCONDITIONS,STRESSESANDMECHANISMS......................................30
TABLE2.32:RELATIONSHIPBETWEENCAUSE,MODEANDEFFECT...........................................................31
TABLE2.51:SUMMARYOFRELIABILITYASSESSMENTOPTIONS...............................................................39
TABLE2.51:SUMMARYOFASSESSMENTOPTIONS(CONTINUED)............................................................40
TABLE2.52:RELEVANCYOFAPPROACHTOPREDICTION,ASSESSMENTANDESTIMATION.......................41
TABLE2.53:IDENTIFICATIONOFAPPROPRIATEAPPROACHESBASEDONTHEPURPOSE.........................43
TABLE2.54:RANKINGTHEATTRIBUTESOFEMPIRICALDATA...................................................................44
TABLE2.55:EVT,DVTANDPVTPURPOSEANDAPPROACH.......................................................................47
TABLE2.56:RELIABILITYDEMONSTRATIONEXAMPLE...............................................................................50
TABLE2.57:EXAMPLEOFAQUALIFICATIONPLANFORANASSEMBLY.....................................................57
TABLE2.58:QUALIFICATIONEXAMPLEFORALASERDIODE.....................................................................58
TABLE2.59:STRESSPROFILEOPTIONADVANTAGESANDDISADVANTAGES.............................................68
TABLE2.510:SIMILARITYANALYSIS............................................................................................................80
TABLE2.511:DIGITALCIRCUITBOARDFAILURERATES(INFAILURESPERMILLIONPARTHOURS)...........83
TABLE2.512:TESTCONDITIONS...............................................................................................................111
TABLE2.513:DATATOESTIMATEDIFFUSIONRATE.................................................................................112
TABLE2.514:PREDICTEDLIFETIMESVS.OBSERVED.................................................................................113
TABLE3.11:PROBABILITYDISTRIBUTIONNOTATION&MATHEMATICALREPRESENTATIONS...............141
TABLE3.21:COMBINATIONSEXAMPLE....................................................................................................143
TABLE3.22:COMBINATIONSOFANORCONFIGURATION.......................................................................147
TABLE3.23:COMBINATIONSOFANANDCONFIGURATION.....................................................................149
TABLE3.24:EXAMPLEOFKOUTOFNPROBABILITYCALCULATIONS...................................................151
TABLE3.25:EXAMPLEOF2OUTOF3REQUIREDFORSUCCESS..........................................................152
TABLE3.31:PROBABILITYDISTRIBUTIONSAPPLICABLETORELIABILITYENGINEERING..........................154
TABLE3.32:EXPONENTIALDISTRIBUTIONPARAMETERS........................................................................160
TABLE3.33:CONFUSINGTERMINOLOGYOFTHEWEIBULLDISTRIBUTION.............................................162
TABLE3.34:WEIBULLDISTRIBUTIONPARAMETERS................................................................................163
TABLE4.31:POSSIBLECONCLUSIONSFORANONLINEARRESPONSEFACTORRELATIONSHIP...............173
TABLE4.41:FULLFACTORIALEXAMPLE....................................................................................................175
TABLE4.42:FULLANDHALFFACTORIALEXAMPLEFORCORROSION......................................................179
TABLE5.21:TERMINOLOGYUSEDINPARAMETERESTIMATION.............................................................187
TABLE5.22:TECHNIQUESFORPARAMETERESTIMATION.......................................................................188
TABLE5.23:PARAMETERSTYPICALLYESTIMATEDFROMSTATISTICALDISTRIBUTIONS.........................189
TABLE5.24:CONFIDENCEBOUNDSFORTHEPOISSONDISTRIBUTION...................................................200
TABLE5.25:CONFIDENCEBOUNDSFORTHEBINOMIALDISTRIBUTION.................................................201
TABLE5.26:CONFIDENCEBOUNDSFORTHEEXPONENTIALDISTRIBUTION...........................................202
TABLE5.28:CONFIDENCEBOUNDSFORTHENORMALDISTRIBUTION...................................................203
TABLE5.310:CONFIDENCEBOUNDSFORTHEWEIBULLDISTRIBUTION.................................................205
xi
List of Tables
Page
TABLE6.11:CATEGORIESOFFAILUREEFFECTS........................................................................................227
TABLE6.22:BIMODALPOPULATIONEXAMPLE1......................................................................................229
TABLE6.23:BIMODALPOPULATIONEXAMPLE2......................................................................................230
TABLE6.14:BIMODALPOPULATIONEXAMPLE3......................................................................................231
TABLE6.15:BIMODALPOPULATIONEXAMPLE4......................................................................................232
TABLE6.16:BIMODALPOPULATIONEXAMPLE5......................................................................................233
TABLE6.17:FOURMODEWEIBULLDISTRIBUTIONPARAMETERS............................................................235
TABLE6.31:FAILURERATEUNCERTAINTYLEVELMULTIPLIERS................................................................242
TABLE6.91:EXAMPLEOFCOMBINGDIFFERENTTYPESOFMODELS........................................................248
TABLE6.131:FACTORSTOBECONSIDEREDINARELIABILITYMODEL.....................................................256
TABLE6.132:FAILURERATEDATASUMMARY.........................................................................................258
TABLE7.11:DATACOLLECTEDFORMODELDEVELOPMENT....................................................................269
TABLE7.12:DATATRANSFORMS..............................................................................................................270
TABLE7.13:REGRESSIONDATAINCLUDINGCATEGORICALVARIABLES...................................................271
TABLE7.21:UNCERTAINTYLEVELMULTIPLIER.........................................................................................282
TABLE7.22:PERCENTAGEOFFAILURESATTRIBUTABLETOEACHFAILURECAUSE..................................283
TABLE7.23:WEIBULLPARAMETERSFORFAILURECAUSEPERCENTAGES................................................283
TABLE7.24:MULTIPLIERSASAFUNCTIONOFPROCESSGRADE.............................................................284
TABLE7.25:EXAMPLEOFFAILUREMODETOFAILURECAUSECATEGORYMAPPING.............................295
TABLE7.26:CAPACITORPARAMETERS.....................................................................................................301
TABLE7.27:DEFAULTENVIRONMENTALSTRESSVALUES........................................................................302
TABLE7.28:DEFAULTOPERATINGPROFILEVALUES.................................................................................303
TABLE7.29:FAILURECAUSESUMMARYFORCONNECTORS....................................................................308
TABLE7.210:FAILUREMODETOFAILURECAUSECATEGORYFORCONNECTORS(SCANDFC)..............309
TABLE7.211:FAILURECAUSEPERCENTAGESFORCONNECTORS.............................................................311
TABLE7.212:DATACOLLECTEDFORCONNECTORS..................................................................................312
TABLE7.213:CATEGORIESOFACCELERATIONMODELPARAMETERS......................................................315
TABLE7.214:ACCELERATIONMODELPARAMETERS................................................................................315
TABLE7.215:DEFAULTMODELPARAMETERS..........................................................................................316
TABLE7.216:SUMMARYOFPIFACTORCALCULATIONS..........................................................................317
TABLE7.217:APPLICABILITYOFTESTDATA..............................................................................................318
TABLE7.218:BASEFAILURERATES(FAILURESPERMILLIONCALENDARHOURS)....................................319
TABLE7.219:PARTQUALITYPROCESSGRADEFACTORQUESTIONSFORPHOTONICDEVICEMODELS..320
TABLE7.220:SUMMARYOFUNCERTAINTYMETRICS...............................................................................323
TABLE7.221:PARAMETERSFORTHEPROCESSGRADEFACTORS.............................................................327
TABLE7.222.INDEXOFPROCESSGRADETYPEQUESTIONS....................................................................328
TABLE7.223:DESIGNPROCESSGRADEFACTORQUESTIONS..................................................................330
TABLE7.224:MANUFACTURINGPROCESSGRADEFACTORQUESTIONS.................................................336
TABLE7.225:PARTQUALITYPROCESSGRADEFACTORQUESTIONS.......................................................340
TABLE7.226:SYSTEMMANAGEMENTPROCESSGRADEFACTORQUESTIONS........................................342
TABLE7.227:CANNOTDUPLICATE(CND)PROCESSGRADEFACTORQUESTIONS..................................346
TABLE7.228:INDUCEDPROCESSGRADEFACTORQUESTIONS...............................................................347
xii
List of Tables
Page
TABLE7.229:WEAROUTPROCESSGRADEFACTORQUESTIONS.............................................................348
TABLE7.230:GROWTHPROCESSGRADEFACTORQUESTIONS...............................................................349
TABLE7.31:PARAMETERLEVELS..............................................................................................................350
TABLE7.32:TESTPLANSUMMARY...........................................................................................................351
TABLE7.33:LIFETESTRESULTS.................................................................................................................352
TABLE7.34:TIMESTOFAILUREDISTRIBUTIONPARAMETERS..................................................................353
TABLE7.35:ESTIMATEDPARAMETER80%2SIDEDCONFIDENCEBOUNDS............................................356
TABLE7.41:DATASUMMARIZATIONPROCESS........................................................................................359
TABLE7.42:TIMEATWHICHASYMPTOTICVALUEISREACHED...............................................................363
TABLE7.43/MTTFRATIOASAFUNCTIONOF.....................................................................................363
TABLE7.44:PERCENTFAILUREFORWEIBULLDISTRIBUTION...................................................................364
TABLE7.45:FIELDDESCRIPTIONS.............................................................................................................367
TABLE7.46:APPLICATIONENVIRONMENTSDEFINEDINNPRD...............................................................368
TABLE8.71:FAILUREMODERELATIONSHIPTOTAGUCHILOSSFUNCTION.............................................389
TABLE8.81:DIMENSIONSOFFUNCTIONALSEVERITY..............................................................................391
TABLE8.82:DIMENSIONSOFSEVERITY....................................................................................................392
TABLE8.111:CATEGORIESOFFAILUREEFFECTS......................................................................................401
TABLE8.112:RECOMMENDEDDETECTABILITYRATINGCRITERIA............................................................404
xiii
xiv
Chapter 1: Introduction
1.
Introduction
Few engineering techniques have caused as much controversy in the last several decades
as the topic of reliability prediction. One of the primary reasons for this is the stochastic
nature of reliability. Whereas many engineering disciplines are governed by
deterministic processes, reliability is governed by a complex interaction of stochastic
processes. As a result, the metrics of interest in other engineering disciplines are
generally much more quantifiable by their very nature. While there is always a stochastic
element in any engineering model, the topic of reliability quantification must address its
extreme stochastic nature.
Many highly respected reliability engineering texts treat the topic of reliability modeling
thoroughly and in great detail. Included in these texts are detailed ways to model system
reliability using techniques like Failure Modes and Effects Analysis (FMEA), Fault Tree
Analysis (FTA), Markov models, fault tolerant design techniques, etc. The techniques
that are addressed in detail in these texts often gloss over a fundamental requirement in
order to effectively utilize these techniques, i.e., the ability to quantify the reliability of
the constituent components and subsystems comprising the system.
The intent of this book is to provide guidance on reliability modeling techniques that can
be used to quantify the reliability of a product or system. In this context, reliability
modeling is the process of constructing a mathematical model that is used to estimate the
reliability characteristics of an item. There are many ways in which this can be
accomplished, depending on the item and the type of information that is available to, or
practical to obtain by, the analyst. This book will review possible approaches, summarize
their advantages and disadvantages, and provide guidance on selecting a methodology
based on specific goals and constraints. While this book will not discuss the use of
specific published methodologies, in cases where examples are provided, tools and
methodologies with which the author has personal experience in their development are
used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.
The Reliability Information Analysis Center (RIAC) has prepared many documents in the
past relating to many different reliability engineering techniques, such as FMEA, FTA,
Worst Case Analysis (WCA), etc. However, one noteworthy omission from this list is
reliability modeling. This, coupled with (1) the RIACs history of providing reliability
modeling data and solutions, and (2) the need to objectively address some of the
confusion and misconceptions related to this topic, formed the inspiration for this book.
Chapter 1: Introduction
In years past, DoD contracts would require specific reliability prediction methodologies,
usually MIL-HDBK-217, be used. This resulted in system developers having very little
flexibility in applying different reliability prediction practices. Since the DoD has not,
until very recently, supported updates to MIL-HDBK-217, companies were encouraged
to use best practices in quantifying product reliability. The difficult question to be
addressed is what are the best practices that should be used? This book attempts to
provide guidance on selecting an appropriate methodology based on the specific
conditions and constraints of the company and its products or systems.
It is hoped that the authors experience gained by attempting many different reliability
assessment approaches, including physics and empirical approaches, can be used to the
advantage of the reader in a practical way.
1.1. Scope
The intent of a reliability program is to identify and mitigate failure modes/mechanisms,
verify their removal through reliability testing, implement corrective actions for
discovered failures, and maintain reliability levels after reliability has been designed in.
These correspond to the designing-in reliability, reliability growth and ensuring on-going
reliability goals, respectively, as illustrated in Figure 1.1-1.
Chapter 1: Introduction
Chapter 1: Introduction
Predictions are performed very early, before there is any empirical data on the item under
analysis. Reliability assessments are made to determine the affects of certain factors on
reliability and to identify failure causes. Reliability estimates are made based on
empirical data. This book covers all three areas, as illustrated in Figure 1.1-3.
Chapter 1: Introduction
Chapter 1: Introduction
These examples are provided to give the reader a better appreciation for the tools,
techniques and limitations of various approaches to reliability modeling.
A discussion of FMEA is presented in Chapter 8. Although FMEA is secondary to the
primary intent of this book, it can form the basis for many elements of a reliability
program, including reliability modeling. Therefore, Chapter 8 is intended to present
FMEA concepts in this context, as well as provide practical information on performing
FMEAs that this author has found to be useful.
Chapter 1: Introduction
Chapter 1: Introduction
What is the required failure rate of the item in its useful life?
What is the service life required?
What criteria will be used to determine when the requirements are not met?
Whose responsibility will it be to take corrective action if these requirements are
not met?
What are the operating and environmental profiles expected in field deployed
conditions?
Worst
Field reliability
No failures
Failures occur at an acceptable rate
Recurring failures, but on a relatively small percent of
items
Recurring failures on a high percent of items
An unexpected failure mechanism is discovered that
will affect the entire population, or critical safety
related failures
Chapter 1: Introduction
If the requirement is not specified, an estimate of the requirement must be made so that
there is a goal that can be used in the development process.
2. Initial Design: After the product requirements are understood, the design team
generally derives an initial, or preliminary, design for the product or system. Inputs to
this initial design should be in the form of design rules and a Standard parts list. Design
rules are the culmination of lessons learned from previous development activities, from
both empirical field or test data, and from analysis. These design rules should be a living
document which is continuously updated based on current information. Effective use of
design rules also saves much effort since reliability attributes which have a reliability
history or which have been previously studied do not need to be addressed in detail, thus
saving resources to be applied to the study of critical parts.
3. Similarity analysis: Once an initial design is available, a similarity analysis can be
performed to identify attributes which are similar to those for which a reliability history is
available, and those for which it is not. A FMEA can be a valuable technique for this
analysis, and will be discussed later. In this analysis, each reliability attribute identified
in the FMEA is reviewed to determine if a reliability history exists or not.
4. Identify attributes that are similar: Similar attributes are those that have a reliability
history
5. Assess robustness of attribute: If the part or attribute does have a history, previous test
data or field experience data can be used to assess the robustness of the part or attribute.
6. Identify attributes that are not similar: Attributes that are not similar do not have a
reliability history.
7. Perform design analysis: Although any attribute that is potentially different in the new
design relative to the previous design must be analyzed, particular attention is given to
the attributes that are not similar. Design techniques that are used for this purpose are
FMEA, tolerance or worst case analysis, thermal analysis, stress analysis, and reliability
predictions.
8. Implement corrective action: From the results of the design analysis, corrective action
should be taken to improve the robustness of the design.
9. Identify critical parts/materials: Based on the results of the analysis, critical parts or
materials are identified.
Reliability Information Analysis Center
9
Chapter 1: Introduction
10. Model critical parts/materials: Once critical parts are identified, action must be taken
to ensure that the parts or materials are robust enough to meet the reliability and
durability requirements. More details of the approach used for this purpose will be
presented later in the book.
11. Identify effective tests for non-similar attributes: Based on the identification of
critical parts and the design analysis that was performed, specific tests that will assess the
reliability and durability of the attribute can be determined. Part of the FMEA should
include identification of stresses that will accelerate the attribute under analysis and
therefore, this analysis is important for identifying the appropriate stress tests.
12. Develop a test plan and execute tests: Based on the design analysis performed and
the identification of tests for non-similar attributes, a test plan can be determined. In the
context of this approach, the goal of these tests is to assess the robustness of the product
by subjecting the product to test stresses that are intended to accelerate the critical parts
and non-similar attributes to failure. In addition to these tests, other test requirements
should be incorporated into this test plan. These additional test requirements include any
tests required by the customer, such as qualification or reliability demonstration tests.
13. Document the test results: Once the tests have been performed and the data analyzed,
the results should be fully documented, since they subsequently will be used for a variety
of purposes.
14. Monitor field reliability: Once the product is deployed, field reliability experience
data should be carefully gathered, since it will be used for a variety of purposes.
Elements of the data to be gathered include:
1. Product or system deployment history by serial number, including when
deployed, when fielded
2. Failure information, including failure date, root failure cause, results of failure
analysis
3. Product or system re-deployment information
15. Update reliability database: A database is required to manage the reliability data, and
should include both test data and field data. This data can be used to generate a
company-specific reliability prediction methodology.
Chapter 1: Introduction
16. Update Design Rules: Data acquired from tests and field surveillance should be used
to update the design rules. Field data is probably the most valuable type of data for this
purpose since it represents the actual product or system in the intended use environment.
The process of maintaining design rules and ensuring that they are used in new designs is
the cornerstone of the means by which reliability is improved in a reliability growth
process.
Critical parts are those which may result in a significant risk to the project. This risk can
be related to reliability, lifetime, availability, or maintainability. Some of the factors that
constitute critical parts are:
These critical parts or items warrant additional attention in assessing their reliability, as
they generally will represent the greatest reliability risk.
Chapter 1: Introduction
In addition to these accomplishments, the 50s also included pioneering work in the area
of quantitative reliability prediction. In 1956, RCA released TR-1100, Reliability Stress
Analysis for Electronic Equipment, which presented mathematical models for the
estimation of component failure rates. This report turned out to be the predecessor of
MIL-HDBK-217.
Several additional early works in the area of reliability prediction were produced in the
early 1960s, including D.R. Erles report (Reference 2) and the Erles and Edins paper
(Reference 3). In 1962, the first version of MIL-HDBK-217 was published by the Navy.
Once issued, MIL HDBK-217 quickly became the standard by which reliability
predictions were performed, and other sources of failure rates gradually disappeared.
Part of the reason for the demise of other sources was the fact that MIL-HDBK-217 was
often a contractually cited document and defense contractors did not have the option of
using other sources of data.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
12
Chapter 1: Introduction
These early sources of failure rates also often included design guidance on the reliable
application on electronic components. However, subsequent versions of the documents,
primarily MIL-HDBK-217, would delete the application information because it was
treated in more detail elsewhere.
By now, the reliability discipline was working under the tenet that reliability was a
quantitative discipline that needed quantitative data sources to support its many
statistically based techniques, such as allocations and redundancy modeling. However,
another branch of the reliability discipline focused on the physical processes by which
components were failing. The first symposium devoted to this topic was the Physics of
Failure In Electronics Symposium sponsored by the Rome Air Development Center
(RADC) and IIT Research Institute (IITRI) in 19621. This symposium later became
known as the International Reliability Physics Symposium (IRPS). In this period of time,
the two branches of reliability engineering seemed to be diverging, with the systems
engineers devoted to the tasks of specifying, allocating, predicting and demonstrating
reliability, while the physics-of-failure (PoF) engineers and scientists were devoting their
efforts to identifying and modeling the physical causes of failure. Both branches were
integral parts of the reliability discipline, and both were hosted at RADC (later to become
Rome Laboratory). The physics-based information was necessary to develop part
qualification, screening and application requirements, and the systems tasks of
specifying, allocating, predicting and demonstrating reliability were necessary to insure
that reliability requirements were met. The component research efforts of the 1950s and
1960s culminated with the implementation of the ER and TX families of
specifications. This complicated the issue of predicting their reliability because there
were now many different combinations of quality levels and environments that needed to
be addressed in MIL-HDBK-217.
In the early 1970s, the responsibility for preparing MIL-HDBK-217 was transferred to
RADC, who published revision B in 1974. However, other than the transition to RADC,
the 1970s maintained the status quo in the area of reliability prediction. MIL-HDBK217 was updated to reflect the technology at that time, but there were few other efforts
that changed the manner in which predictions were performed. One exception, however,
was that there was a shift in the complexity of the models being developed for MILHDBK-217. There were several efforts to develop new and innovative models for
reliability prediction. The results of these efforts were extremely complex models that
may have been technically sound, but were criticized by the user community as being too
1
IITRI was the original contractor of the Reliability Analysis Center (RAC). In 2005, the RAC contract was awarded as RIAC to the current team of
Wyle Labs (prime), Quanterion Solutions Incorporated, the University of Maryland Center for Risk and Reliability, the Pennsylvania State Applied
Research Laboratory (ARL), and the State University of New York Institute of Technology (SUNYIT)
Chapter 1: Introduction
complex, too costly, and unrealistic given the low level of detailed design information
available at the point in time when the models were needed. RCA, under contract to
RADC, had developed PoF-based models which were rejected as unusable, since the
detailed design and construction data for microcircuits were simply unavailable to typical
model users. These models were never incorporated into MIL-HDBK-217.
While MIL-HDBK-217 was updated again several times in the 1980s, there were
agencies that were developing reliability prediction models unique to their industries. As
an example, the automotive industry, under the auspices of the Society of Automotive
Engineers (SAE) Reliability Standards Committee, developed a series of models specific
to automotive electronics. The SAE committee felt that there was no existing prediction
methodologies that were applicable to the specific quality levels and environments of
automotive applications. The Bellcore reliability prediction standard is another example
of a specific industry developing methodologies for their unique conditions and
equipment. It originally was developed by modifying MIL-HDBK-217 to better reflect
the conditions of interest of the telecommunications industry. It has since taken on its
own identity with models derived from telecommunications equipment and is now used
widely within that industry.
The 1980s also saw explosive growth in integrated circuit technology. Very dense
circuits were being fabricated using feature sizes as small as 0.5 microns. This presented
unique challenges to reliability modelers. The VHSIC (Very High Speed Integrated
Circuit) program was the governments attempt to leverage from the technological
advancements of the commercial industry and, at the same time, produce circuits capable
of meeting the unique requirements of military applications. From the VHSIC program
came the Qualified Manufacturers List (QML) - a qualification methodology that
qualified an integrated circuit manufacturing line, unlike the traditional qualification of
specific parts. The government realized that it needed a QML-like process if it were to
leverage from the advancements in commercial technologies and, at the same time, have
a timely and effective qualification scheme for military parts. A reliability prediction
model was also developed for VHSIC devices in 1989 (Reference 9) in support of a MILHDBK-217 update. An interesting observation was made during that study that deviated
from the premise on which most of the MIL-HDBK-217 models were based. The
traditional approach to developing models was to collect as much field failure rate data as
possible, statistically analyze it, and quantify model factors based on the results of the
statistical analysis. For integrated circuits, one of the factors that was quantified was
inevitably device complexity. This complexity was measured by the number of gates or
transistors and was the primary factor on which the models were based. The correlation
between failure rate and complexity was strong and could be quantified because the
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
14
Chapter 1: Introduction
failure rate of circuits was much higher than they are today and the defect rate was
directly proportional to the complexity. As technology has advanced, the gate or
transistor count became so high that it could no longer effectively be used as the measure
of complexity in a reliability model. Furthermore, transistor or gate count data was often
difficult or impossible to obtain. Therefore, the model developed for VHSIC
microcircuits needed another measure of complexity on which to base the model. The
best measures, and the ones most highly correlated to reliability are defect density and
silicon area. It can be shown that the failure rate (for small cumulative percent failure) is
directly proportional to the product of the area and defect density. However, another
factor that is highly correlated to defect density and area is the yield of the die, or the
percent of die that are functional upon manufacture. Ideally, a reliability model would
use either yield or defect density/area as the primary factor(s) on which to base the
model. The problem in using these factors in a model is that they are considered highly
proprietary parameters from a market competition viewpoint and, therefore, are rarely
released by the manufacturers. Therefore, the single most important driver of reliability
cannot be obtained by the user of the device, which is unfortunate because the accuracy
of the model suffers. The conflict between the usability of a model and its accuracy has
always been a difficult tradeoff to address for model developers.
Much of the literature in the 1990s on the topic of reliability prediction has centered
around the debate as to whether the reliability discipline should focus on PoF-based or
empirically-based models (such as MIL-HDBK-217) for the quantification of reliability.
In the authors opinion, many of the primary criticisms of MIL-HDBK-217 stem from the
fact that it was often used for purposes for which it was not intended. For example, it
was often used as a means by which the reliability of a product was demonstrated. Since
its use was contractually required, contractors would try to demonstrate compliance to the
specified reliability requirements by adjusting factors in the model to make it appear
that the reliability would meet requirements. Sometimes these adjustments had a
technical basis, and sometimes they did not. Les Gubbins, one of the governments first
project managers for the handbook, once made the analogy that engaging in the use of
these adjustment factors is like pushing the needle on your cars speedometer up, and
convincing yourself youre going faster. This, of course, is not good engineering
practice, but rather was done for nontechnical reasons.
Another key development in the area of reliability predictions was related to the
implications of acquisition reform. In 1994, Military Specifications and Standards
Reform (MSSR) was initiated which decreed the adoption of performance-based
specifications as a means of acquiring and modifying weapon systems. It also overhauled
Reliability Information Analysis Center
15
Chapter 1: Introduction
Chapter 1: Introduction
1.5. Acronyms
Acronyms and abbreviations that are used in his book are defined as follows:
AL
ALM
ALT
CA
CDF
CRR
D
DoD
DPA
DVT
ED
ELFR
EPRD
ESD
EV
EVT
FMEA
FMECA
FRU
GFL
HALT
HASS
HAST
HTB
HTOL
HTRB
IOL
IPL
IWV
KPSI
LI
MCMC
MLE
MS
MTTF
NPRD
O
PD
PDF
PVT
RBD
RPN
RSH
S
SD
TBD
TC
TR
TST
TTF
VVF
Accelerated Life
Accelerated Life Model
Accelerated Life Testing
Constant acceleration
Cumulative Distribution Function
Center for Risk and Reliability
Detectability
Department of Defense
Destructive Physical Analysis
Design Verification Test
Electrical distributions
Early life failure rate
Electronic Parts Reliability Data
Electrostatic discharge
External visual
Engineering Verification Test
Failure Mode and Effect Analysis
Failure Mode and Effect Criticality Analysis
Field Replaceable Unit
Gross/fine leak
Highly Accelerated Life Test (simultaneous temperature cycling and vibration)
Highly Accelerated Stress Screening
Highly Accelerated Stress Testing
High temperature bake
High temperature operating life
High temp. reverse bias
Intermittent operational life
Inverse Power Law
Internal water vapor
Pounds per square inch, in thousands
Lead integrity
Markov Chain Monte Carlo
Maximum Likelihood Estimator
Mechanical shock
Mean Time to Failure
Non-Electronic Parts Reliability Data
Occurrence
Physical dimensions
Probability Density Function
Process Verification Test
Reliability Block Diagram
Risk Priority Number
Resistance to solder heat
Severity
Solderability
To Be Defined
Temperature cycling
Thermal resistance
Pre and post electrical test
Time to Failure
Vibration - variable freq.
Chapter 1: Introduction
1.6. References
1. Coppola, A., Reliability Engineering of Electronic Equipment, A Historical
Perspective, IEEE Transactions on Reliability. Vol. R-33. No. 1, April 1984.
2. Erles, D.R., Reliability Application and Analysis Guide, The Martin Company, July
1961.
3. Erles D.R. and M.F. Edins, Failure Rates, AVCO Corp. April, 1962.
4. Knight, C.R., Four Decades of Reliability Progress, 1991 Proceedings Annual
Reliability and Maintainability Symposium.
5. Reliability Prediction Methodologies For Electronic Equipment, AIR 5286, SAE G11 Committee, Electronic Reliability Prediction Committee, 31 Jan. 1998
6. Reliable Application of Plastic Encapsulated Microcircuits, Reliability Analysis
Center Publication PEM2.
7. Morris, S.F. and J.F. Reilly (Rome Laboratory), MIL-HDBK-217 - A Favorite
Target.
8. Denson, W. And P. Brusius, VHSIC and VHSIC-Like Reliability Modeling, RADCTR-89-177.
9. Reliability Analysis Center, Benchmarking Commercial Reliability Practices
2.
Prior to developing a reliability model for a product or system, the analyst should
consider the following questions:
What is the goal of the model, and what decisions will be made based on it?
What data is currently available on the product?
Is field data available? If so, is it from the product or system operating in the
same manner and environment as the one under analysis?
Is test data available? If so, what types of tests (i.e., accelerated life tests, nonaccelerated life tests, qualification tests, etc.)
Is data, either field or test, available on a predecessor (i.e., earlier version) of the
product?
Have models been developed for specific failure modes, mechanisms and/or
causes of the product?
o Life models?
o Stress-strength models?
o Models from first principals?
Have critical failure causes of the product been identified?
How much support can be expected from suppliers regarding identification and
quantification of the failure causes of their product?
Define system
Assess feasibility of
performing reliability
tests
Combine data
the root failure mode cause or mechanism level. Tools for this system model include
FMEA and FTA.
Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the components, represented by circles (basic events) is illustrated
in Figure 2.1-1. This Figure represents a reliability prediction performed using MILHDBK-217 or 217Plus.
System
Assembly 1
Assembly 2
Subassembly 1a
Comp. 1a1
Comp. 1a2
Subassembly 1b
Comp. 1a3
Comp. 1b1
Subassembly 2b
Comp. 1b2
Comp. 2b1
Comp. 2b2
Subassembly 2c
Comp. 2b3
Comp. 2c1
Comp. 2c2
Assembly 1
Assembly 2
Subassembly 1a
Comp. 1a1
FM1
FM2
Subassembly 1b
Comp. 1a2
FM2
FM1
Comp. 1a3
FM1
FM2
Comp. 1b1
FM3
FM1
Subassembly 2b
Comp. 1b2
FM1
FM2
Comp. 2b1
FM1
FM2
Subassembly 2c
Comp. 2b2
FM3
FM1
FM2
Comp. 2b3
FM1
Comp. 2c1
FM1
Comp. 2c2
FM2
FM1
FM2
Approaches such as this, in which the reliability of each failure mechanism is estimated,
are practical if:
1. The product or system under analysis has a manageable number of failure
mechanisms that can be estimated
2. The approach can be practically applied for all failure mechanisms over the entire
supply chain. In other words, each organization responsible for their component
or assembly has the ability to estimate the reliability of all failure mechanisms
within their component or assembly.
This same representation is relevant to performing FMEAs. In this case, the lowest level
events in the fault tree are the constituent failure modes of the component. If a failure
mechanism modeling approach is to be used, it needs to be applied to all failure
mechanisms in order for the assessment to quantify the reliability of the entire system.
Risk assessment
Reliability demo
Observed
failure
Anticipated
failure
Determine if
minimum
robustness is
achieved
Input to
FMEA/FTA for ID
of failure cause
priority
Compare
competing
designs
Determine if
reliability rqmt
is achieved
PM
schedules
Warranty cost
predictions
Model
Reliability
growth
Maintainability
Design aid
Determine
feasibility of
meeting rel. rqmt
Determine fault
tolerance, redundancy
Determine
impact of factors
on reliability
Allocate
maintenance
personnel
Spares
allocation
Determine
screening rqmt
Determine testability
requirements
Input to
FMEA/FTA for ID
of failure cause
priority
Compare
competing designs
Design Aid
Model reliability
growth
Determine
feasibility of
meeting reliability
rqmt.
Determine impact
of factors on
reliability
(derating)
Determine
screening rqmt.
Reliability
Demo
Determine if
minimum
robustness is
achieved
Determine if
reliability rqmt. is
achieved
Warranty cost
predictions
Preventive
Maintenance (PM)
schedules
Maintainability
Spares allocation
Allocate
maintenance
personnel
Description
Risk assessments are performed to quantify the reliability of critical- or safety-related failure
modes before the product is fielded. This is often done to meet industry or customer
requirements.
Risk assessments are performed on fielded products that experience failures. Factors that
usually need to be quantified are (a) determination of the root cause, (b) lifetime, (c) percent
failure at a given time, (d) the percent of the population at risk (i.e. whether the root cause is
special cause or common cause), (e) whether the defect is lot- or batch-related, (f) whether
the defective portion can be contained, and (g) what the reliability will be as a function of
the level of corrective actions (for example, 1 if nothing is done; 2 - if a complete recall is
done, and 3 - an approach in between).
Techniques such as FMEA and FTA are used to assess and prioritize failure causes. Part of
this prioritization includes the identification of the probability of occurrence, either
qualitatively or quantitatively.
For this purpose, reliability modeling is performed to quantify the relative reliabilities of
several competing designs. This analysis is then used as one criterion from which the final
design is chosen. In this case, reliability is only one of the factors to be accounted for in this
comparison, and needs to be traded off against all of the other factors.
A natural part of the development process is to grow the reliability to a point that it meets its
reliability requirement. For this purpose, the reliability metric of choice is quantified as a
function of time. This provides Program Management with the information to assess the
reliability status of the project and to estimate the date at which the requirements will be
met.
In many cases, reliability requirements are levied upon suppliers and contractors. For this
purpose, the reliability assessment is performed to determine if there is a reasonable
probability of achieving the reliability requirements. If it is highly likely that requirements
cannot be met, then management must make decisions regarding the future of the program.
For this purpose, the effects of specific factors are assessed. For example, the effects of
temperature may be assessed to determine how much cooling is required.
This purpose relates to quantifying reliability as a function of possible screening options, so
that it can be determined which screening options will result in the reliability requirements
being met.
This purpose is to provide quantitative data that proves, within acceptable confidence limits,
that predefined robustness levels are achieved. These robustness levels usually correspond to
a qualification requirement, and may not be highly correlated to field reliability.
This purpose is to provide quantitative data that proves, within acceptable confidence limits,
that the reliability requirements are met.
For this purpose, the assessment is performed so that the costs associated with warranty
repairs or replacements can be estimated.
The assessment is performed so that effective preventive maintenance schedules can be
derived.
For repairable systems, since the replacement of failed items requires the availability of
spare items, the question of how many spares to keep on hand inevitably arises. The
reliability characteristics of the item is one piece of information required. Others are repair
rates, a reliability block diagram, etc.
For repairable systems, organizations need to determine the personnel required to keep up
with maintenance demands. One input to this is the frequency of various types of failures
Specific reliability modeling purposes are generally suited to specific program phases, as
summarized in Table 2.2-2.
Table 2.2-2: Program Phase vs. Reliability Assessment Purpose
Stage
Purpose
Concept
Risk
Assessment
Design Aid
Reliability
Demo
Maintainability
Anticipated failure
Observed failure
Input to FMEA/FTA
for ID of failure cause
priority
Compare competing
designs
Model reliability
growth
Determine feasibility
of meeting reliability
rqmt.
Determine impact of
factors on reliability
(derating)
Determine screening
rqmt.
Determine if minimum
robustness is achieved
Determine if reliability
req. is achieved
Warranty cost
predictions
PM schedules
Spares allocation
Allocate maintenance
personnel
Development
Early
Production
Production
Deployment
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
System
Subsystem
Assembly
Component
Failure Modes (Root)
Failure Causes/Mechanisms (Root)
2.3.1. Level vs. Data Needed
System level
System
Subsystem
Parts lists
Environmental conditions
Part stresses
Assembly
Component
Failure modes
Failure causes/mechanisms
A FMEA can be an effective tool in identifying specific root failure causes that need to
be quantified in a reliability model. A generic FMEA approach is shown in Figure 2.3-2.
System Hierarchy
How functions can fail
Functions
Functions
Failure effects
Failure modes
Occurrence
Detectability
Risk Priority
Number (RPN)
Improve design
Failure
Effect
#2
Failure
Effect
#3
Failure
Mode
Cause
#2
Cause
#1
Cause
#1b
Cause
#1a
Cause
#1c
Cause
#2a
Cause
#2a1
Cause
#4
Cause
#3
Cause
#2b
Cause
#3a
Cause
#3b1
Cause
#2a2
Cause
#3b
Cause
#4b
Cause
#4a
Cause
#3b2
Initial conditions
Defect Free
Mechanism
Stresses
Defects
Operational
Intrinsic
Extrinsic
Environmental
Mechanical
Electrical
Chemical
Defects
Extrinsic
Operational
Stresses
Environmental
Electrical
Mechanism
Mechanical
Chemical
Voids
Material property variation
Geometry variation
Contamination
Ionic contamination
Crystal defects
Stress concentrations
Organic contamination
Nonconductive particles
Conductive particles
Contamination
Ionic contamination
Thermal
Electrical
Chemical
Optical
Chemical exposure
Salt fog
Mechanical shock
UV exposure
Drop
Vibration
Temperature-high &low
Temperature cycling
Humidity
Pressure low &high
Radiation EMI, cosmic
Sand and dust
Electromigration
Dielectric breakdown
Dendritic growth
Tin whiskers
Electro-thermo-migration
Second breakdown
Metal fatigue
Stress corrosion cracking
Melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Pitting corrosion
Spalling
Crazing
Abrasive wear
Adhesive wear
Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgas
Corrosion
Chemical attack
Fretting corrosion
Oxidation
Crystallization
One of the keys to a successful FMEA is to understand the relationship between cause,
mode and effect. In general, there is a natural tiering effect that occurs in an FMEA as a
function of the product or system level, as illustrated in Table 2.3-2. For example, at the
most basic level, the part manufacturing process, the cause of failure may be a process
step that is out of control. The ultimate effect of that cause becomes the failure mode at
the part level, the failure effect of the part becomes the failure mode at the next level of
assembly, and so forth. It is very important that the cause, mode and effect are not
confounded in the analysis.
Table 2.3-2: Relationship Between Cause, Mode and Effect.
System
Assembly
Part
Part
Manufacturing
Process
Effect
Mode
Effect
Cause
Mode
Effect
Cause
Mode
Effect
Cause
Mode
Cause
OR
AND
AND
AND
OR
VT
AND
AND
OR
AND
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
TOP
OR
AND
AND
AND
OR
VT
Effect
AND
Mode
OR
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Cause
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-6: Fault Tree of Product or System with Cause as the Lowest Level
OR
Effect
AND
OR
AND
VT
Mode
AND
Cause
OR
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-7: Fault Tree of Product or System with Cause Above the Lowest Level
Effect
OR
Mode
AND
OR
AND
VT
Cause
AND
AND
OR
OR
OR
OR
OR
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Event
Figure 2.3-8: Fault Tree of Product or System with Cause Two Levels Above the
Lowest Level
Therefore, if the FTA view of the product or system is to be consistent with the reliability
assessment, then the lowest level in the tree must be the level at which reliability
estimates are made.
The section above describes the hierarchical level at which a reliability model will be
developed, whether it be a failure cause, failure mode, a component or an assembly.
Once this physical level is determined, there are several model forms possible to
Reliability Information Analysis Center
33
construct a model to describe its reliability. This form will, of course, depend on the
specific approach and data used to develop the reliability model. Some of these forms are
described below. More detail on each of these is provided in subsequent sections.
2.3.3. Model Form vs. Level
The form of the model to be developed will depend on the level and the approach. For
example, if empirical data is used directly without a model developed from it, assuming
constant failure rate, the best estimate of the failure rate is simply:
Failures
operating time
If a life model is developed from life tests performed at various stress levels, the result
will be a time-to-failure (TTF) distribution (described by the Weibull, lognormal or other
statistical distributions) that is a function of stress levels. If a Weibull distribution is
used, the general model will be:
R(t ) = e
If models are to be derived from the analysis of field data, there are several possible
model forms. Traditional methods of reliability prediction model development have
included the statistical analysis of empirical failure rate data. When using multiple linear
regression techniques with highly variable data (which is often the case with empirical
field failure rate data), a requirement of the model form is that it be multiplicative (i.e. the
predicted failure rate is the product of a base failure rate and several factors that account
for the stresses and component variables that influence reliability). An example of a
multiplicative model is as follows:
p = b e q s
where:
p =
b =
e =
q =
s =
However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for.
Another possible approach to model reliability is to segment the failure rate for each
group of failure causes that are accelerated by stresses incurred during specific portions
of a mission. Each of these failure rate terms are then accelerated by the appropriate
stress or component characteristic. This is the model form used in the RIAC 217Plus
methodology. This model form is as follows;
p = o o + e e + c c + i + sj sj
where:
p = predicted failure rate
o = failure rate from operational stresses
o = Product of failure rate multipliers for Operational Stresses
e = failure rate from environmental stresses
e= Product of failure rate multipliers for Environmental Stresses
c = failure rate from power or temperature cycling stresses
c = Product of failure rate multipliers for Cycling stresses
I = failure rate from induced stresses, including electrical overstress and ESD
sj = failure rate from solder joints
sj = Product of failure rate multipliers for solder joint stresses
The concept of this approach is that the occurrence of each group of failure causes is
mutually exclusive, and their failure rates can be modeled separately and summed. By
modeling the failure rate in this manner, factors that account for the application and
component-specific variables that affect reliability ( factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:
o Address Operating-, Non-Operating- and Cycling-related Failure Rates in an
additive model. These individual failure rates are weighted in accordance with
the operational profile (duty cycle and cycling rate). The Pi factors modify only
Reliability Information Analysis Center
35
the applicable failure rate term, thereby eliminating many of the extreme value
problems that plague multiplicative models.
o Are based on observed failure mode distributions, so that observed component
root failure causes are empirically modeled
o Can be tailored with test data (if available) by applying it in a Bayesian fashion to
the appropriate failure rate term. As examples, temperature cycling data can be
combined with the failure rate from power or temperature cycling stresses (c), or
high temperature operating life can be combined with the failure rate from
operational stresses term (o).
Perhaps the most important element of a reliability program is the reliability testing of the
product. Reliability test data is, in turn, a critical element for assessing reliability. In this
context, a reliability test consists of two primary elements: measurement and exposure.
The measurement is the means of assessing the performance of the product or system
relative to its requirements. It usually consists of quantifying parameters that are
specifiable attributes. It may include both continuous variables (i.e. gain, power output,
etc.) or attribute data (i.e. a binomial representation of whether a product possesses an
attribute or not). Exposure is the application of a stress or stresses. These stresses may
consist of operational stresses or environmental stresses. Operational stresses are defined
as those stresses to which the product will be exposed by the act of operating the product.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
36
For example, a transistor is designed to have a voltage applied, and pass a given amount
of current. As such, these are operational stresses. It will also be exposed to externally
applied environmental stresses such as temperature, temperature cycling, vibration, etc.
Reliability tests can be performed either by sequentially performing repeated cycles of a
measurement, exposure, measurement, etc., or by continuously measuring performance
parameters in-situ during exposure. It is usually desirable to perform in-situ
measurement so that times to failure can be accurately determined. In practical cases,
however, it is not always feasible due to the complexities of setting up such measurement
capabilities. If repeated cycles of a measurement, exposure, and measurement are used,
the measurement intervals should be frequent enough so that sufficient resolution in the
times-to-failure data is available.
Practical considerations for assessing the feasibility of testing products are:
Additional considerations for testing products and systems are provided in Chapter 5.
Description
1
Highly
Accelerated
Life Test
(HALT)
Exposure to severe
levels of thermal
cycling and vibration
Strengths
Can quickly identify failure causes that are
accelerated by thermal cycling and vibration
2
Qual
Exposure to industry
standard trade and
commerce tests
Weaknesses
3
DOE multicell
4
Reliability
demo
5
Reliability
demo
6
Field data
same product
Demonstration of
reliability via life tests
at accelerated
conditions
Demonstration of
reliability via life tests
at non-accelerated
conditions
Description
Strengths
Can be reasonably sensitive to various
stresses
Represents field use
7
Models
Models developed
from field experience
data on similar
products
8
Raw data
(EPRD,
NPRD)
Easy to use
Can quantify failure causes that exhibit low
percent failures
Weaknesses
Difficult to keep updated
Actual failures are impacted by factors not
considered by the model
Models become outdated by new technology
Misapplication of models by the analyst
No uncertainty estimates available
Difficult to collect good quality field data
Difficult to distinguish correlated
variables(i.e. quality and environment)
Extrapolations to specific use conditions
required
Not feasible to collect data representing all
conceivable situations
Difficult to account for material defects
9
Stress/Strength
modeling
Calculation of failure
probabilities based on
the strength
distribution and the
stress distribution
Scientifically robust
10
First principals
Calculation of failure
probabilities based on a
fundamental
understanding of the
physics of the failure
cause
Selecting a methodology
The various approaches summarized here are suited to various program phases,
corresponding to prediction, assessment and estimation. This is shown in Table 2.5-2
(note that the shaded area indicates where the approach can be applied). For example,
MIL-HDBK-217 should only be used for prediction, meaning that its usefulness is
limited for assessment and estimation. Conversely, 217Plus was designed to provide a
framework for all three reliability modeling phases.
Accelerated
Test
NonAccelerated
Same
product
Empirical
Field
data
Physics
Similar
product
Estimation
Approach
Assessment
Prediction
HALT
Qualification
DOE multicell
Reliability demo
Reliability demo
Models
217Plus
MIL-HDBK-217
Bellcore
Raw data (EPRD,
NPRD)
Stress/Strength modeling
First principals
The severity of product failure. In this context, severity can mean that there are
significant financial ramifications of failure, that there are safety-related risks, or
that the system is not maintainable. For all of the reasons that high reliability may
Reliability Information Analysis Center
41
be required in the first place, are the same reasons that the reliability model must
be acceptably accurate. Since reliability is a stochastic process, reliance on any
one of the methodologies discussed in this book is susceptible to uncertainties.
Sometimes these uncertainties can be very large. This is true for any of the
methods. If, however, several methodologies can be employed, and their results
are consistent with each other, then this adds much more credibility to the
modeled reliability of the product. This is especially true if a physics approach is
coupled with an empirical approach.
The amount and level of detailed information available to the analyst. Often, this
will dictate the available choices for the analysis.
Complexity of the product. If the product or system is very complex, has many
levels of indenture, and there is a complex supply chain involving many suppliers,
then the available suitable choices for analysis at the top of the supply chain will
be limited. For example, as discussed previously, it is very difficult to obtain the
data required to utilize one of the physics approaches by organizations higher in
the supply chain. If, however, the entire supply chain utilizes the PoF approach
for the product or system, it can be a viable approach.
Physics
Reliability
demo
Maintainability
x
x
Raw data
x
x
x
x
x
x
x
x
Models
Same product
NonAccelerated
Reliability demo
Reliability demo
DOE multicell
x
x
x
x
x
x
x
x
First principals
Design aid
Anticipated failure
Observed failure
Input to FMEA/FTA for ID of failure
cause priority
Compare competing designs
Model reliability growth
Determine fault
tolerance,
Determine
redundancy
feasibility of
Determine
meeting rel req.
testability
requirements
Determine impact of factors on
reliability (e.g. derating)
Determine screening req.
Determine if
minimum
robustness is
achieved
Determine if rel.
req. is achieved
Warranty cost
predictions
PM schedules
Spares allocation
Allocate
Maintenance
personnel
Stress/Strength modeling
Risk
assessment
Qualification
HALT
Purpose
Similar product
Field data
Accelerated
Test
x
x
x
x
Relevancy is a function of the type of data that is available, and the product or system on
which that data is available. To further address the relevancy issue for assessments made
with empirical data, consider the information in Table 2.5-4, which summarizes the
various attributes of empirical data. This notion is valid, regardless of the level of
assembly, ranging from root failure causes to the system level.
Table 2.5-4: Ranking the Attributes of Empirical Data
Type of data
Field
Product or
System on
which data
is
available
Same
Similar
Same
mfg/process
Different
mfg/process
Same
mfg/process
Different
mfg/process
Same
environment
Best
Different
environment
Same
stress
Test
Different
stress
Worst
There has been much information published in the literature comparing and contrasting
empirical and physics-based models. However, they are not mutually exclusive
methodologies. For example, empirical models generally utilize PoF principals in their
derivation, and PoF models utilize empirical data in their derivation and parameter
estimation.
The majority of component field failures are a result of special causes. These causes may
be an anomaly in the manufacturing process, an application anomaly, or a host of other
assignable causes. They are rarely the result of a common cause failure mechanism,
which can generally be modeled by life modeling techniques.
Guidelines and examples are provided in the following sections for each of the
approaches.
2.5.1. Empirical
2.5.1.1. Test
Testing product ors system reliability is performed for many reasons, including:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
44
An important consideration for all of the tests described above is the definition of
failure, i.e., the "failure criteria" that will be used to determine if a product passes or
fails. Industry guidelines, specifications, or an understanding of end-use application
tolerances are often used to set pass/fail criteria.
A common form of empirical testing is the performance of qualification tests.
Qualification is usually defined as demonstrating that a product will meet performance
requirements in its intended application, as used by customers, over the expected lifetime
of the product. There are two primary elements to performance qualification: Validation
and Verification (Reference 1), as follows:
Validation Confirmation by examination and provision of objective evidence
that the particular requirements for specific intended use are fulfilled.
Verification Confirmation by examination and provision of objective evidence
that specified requirements have been fulfilled.
Therefore, for a product or system to be considered fit for use for a specific application, it
must conform to the requirements of its specification over its intended life (verification)
and the specification must adequately capture the requirements of the end user
(validation). The various elements of qualification are illustrated in Figure 2.5-2.
Qualification
Verification
Validation
Specification
Compliance
Reliability
Testing
EVT
DV
PVT
Root
cause
analysis
and
corrective
action
Engineering Verification Tests (EVT) are intended to identify and assess high risk
critical items so that corrective action can be taken, if necessary. The intent is to uncover
weaknesses or to identify product capability, not to pass a set of predefined tests, as is the
case with traditional qualification testing. The purpose and approach of these tests are
described in Table 2.5-5. These tests are also used to identify the maximum stress
capability of a product, which is a prerequisite for developing a complete test plan (in
DVT) to assess lifetime. Step stress tests are often used for this purpose, and the results
can support establishment of an upper bound on subsequent test stresses.
One of the primary purposes of DVT testing is to provide the data required to develop life
models. Often, there are multiple accelerating stresses, in which case life tests must be
conducted for various stress combinations. Design of Experiments (DOE) is used to
develop an effective and cost-efficient test plan. DOE concepts, as they pertain to
reliability testing, are covered in Chapter 4.
PVT tests demonstrate that the robustness of production units is equivalent to that of the
EVT/DVT samples. Whereas EVT and DVT demonstrate the intrinsic robustness, PVT
demonstrates the as-built robustness.
Table 2.5-5: EVT, DVT and PVT Purpose and Approach
Test
Program
Element
EVT
DVT
PVT
Purpose
Determine limits of the technology
Determine stress bounds for
subsequent tests
Determine predominant accelerating
stresses
Identify weak points in the design
Identify failure causes
Quantify elements of the bathtub
curve (infant mortality, wear-out) so
that effective screens can be
developed
Provide data to assess product
lifetime under various combinations
of stresses
Verify the robustness of production
units are as good as EVT/DVT
samples
Approach
Relative
Sample
Sizes
Required
Test to failure
Step stress (to determine limits)
Test a broad range of stressors to
determine the stresses that accelerate
predominant failure causes
Low
High
Low
Some reliability practitioners choose to separate qualification tests from reliability tests.
In this case, reliability tests are those that have a purpose similar to the DVT tests. The
reason for separation is that the reliability tests are more of an engineering test that are
not dictated by industry standards. As such, the results may or may not be shared with
customers. Likewise, qualification tests are required and thus shared with customers to
demonstrate compliance.
Root cause analysis and corrective action
A critical part of any reliability program is the ability to learn from failures and improve
the product or system. Failure analysis is performed to ensure that the root cause is
identified and understood, and corrective actions are implemented and verified. This is
done throughout product development, including EVT or DVT and PVT.
With a product that is comprised of a number of subassemblies, there is a time offset
between EVT, DVT or PVT tests performed on components of the product or system and
those tests performed on the end item. This is illustrated in Figure 2.5-3.
Time
Component
(EVT)
(DVT)
Early Screening
Prequalification
Ongoing Reliability
Test (ORT)
(PVT)
Full qualification
(EVT)
(DVT)
Early Screening
Prequalification
Mass Production
(PVT)
Full qualification
Ongoing Reliability
Test (ORT)
Mass Production
Assembly
Figure 2.5-3: EVT, DVT and PVT Relationships
Non-Accelerated
Nonaccelerated reliability tests are those in which samples are tested in a manner that
recreates the use conditions the product will experience in its intended use environment
as used by customers. These tests may be performed for several reasons:
1. To uncover any unexpected failure causes
2. To demonstrate that the product meets its reliability requirement
Generally, if the purpose is #1, a more effective way of acheiving this is with accelerated
testing, discussed in the next section of this book. If the purpose is #2, then concepts of
reliability demostration can be used, as discussed in the next section.
2.5.1.1.2.
Reliability Demonstration
1 CL = R
This is essentially a hypothesis test in which the hypothesis is that the true product
reliability is R or greater. For example, consider a case in which the reliability
requirement is 0.95 at 5000 hours, and the desired confidence level is 0.80 (80%). In this
case, the implied failure rate is 0.0000103 failures per hour.
If the hypothesis is true and the test is run such that there less than a 20% probability of
experiencing the observed number of failures (or fewer), then the analyst can be 80%
certain that the reliability requirements have been met.
Table 2.5-6 summarizes the probability as a function of the number of failures and
cumulative operating time. The values in the cells are the Poisson probability that there
will be F or fewer failures, under the hypothesis that the true failure rate is 0.0000103
(failures per hour). In this example, if the test can be run until 200,000 hours are
accumulated, with no failures, then the test is passed and the hypothesis is verified. This
is the first opportunity to pass the test, as this is the shortest time at which the Poisson
probability falls below 0.20 (i.e., 0.13). In this example, 0.20 is the risk of concluding
that the failure rate is less than 0.0000103 when it is not.
The test is run until the number of failures and time combinations falls either above or
below the shaded red area. If it falls above the red area, then the null hypothesis is
confirmed (that the failure rate is greater than the required). If it falls below the red area,
the hypothesis is confirmed. If the combination of hours and failures remains in the red
area, the hypothesis cannot be confirmed or denied, and further testing is required.
Reliability Information Analysis Center
49
The probability values are generally calculated from the binomial or Poisson
distributions, depending on whether the probability is time-based (Poisson) or attributebased (binomial). Poisson is used in the case of constant failure rates.
Table 2.5-6: Reliability Demonstration Example
250
300
350
400
450
500
550
600
800
200
750
150
700
100
650
50
Number of Failures
10
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.99
0.98
0.97
0.95
0.92
0.89
0.85
0.79
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.99
0.98
0.96
0.94
0.90
1.00
1.00
1.00
1.00
1.00
1.00
0.99
0.98
0.95
0.92
0.88
0.83
0.86
0.81
0.75
0.69
0.77
0.71
0.64
1.00
1.00
1.00
1.00
1.00
0.99
0.97
0.94
0.90
0.85
0.79
0.72
0.56
0.65
0.57
0.50
0.42
1.00
1.00
1.00
0.99
0.98
0.96
0.93
0.88
0.82
0.74
0.66
1.00
1.00
0.99
0.98
0.95
0.91
0.85
0.77
0.68
0.59
0.50
0.58
0.50
0.42
0.35
0.29
0.42
0.35
0.28
0.22
1.00
1.00
0.98
0.94
0.88
0.80
0.71
0.61
0.51
0.42
0.17
0.34
0.26
0.21
0.16
0.12
0.09
1.00
0.98
0.93
0.85
0.74
0.63
0.52
0.41
0.32
0.98
0.91
0.80
0.66
0.53
0.41
0.30
0.22
0.16
0.25
0.19
0.14
0.10
0.07
0.05
0.04
0.11
0.08
0.06
0.04
0.03
0.02
0.91
0.73
0.54
0.39
0.27
0.19
0.13
0.08
0.01
0.06
0.04
0.02
0.02
0.01
0.01
0.00
0.00
0.60
0.36
0.21
0.13
0.08
0.05
0.03
0.02
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00
2.5.1.1.3.
Accelerated Testing
One of the critical aspects of accelerated testing is the degree to which acceleration takes
place. Consider the situation depicted in Figure 2.5-4. The reliability requirement, in
terms of lifetime in this example, will be specified at a specific stress condition. If tests
are performed at the accelerated conditions of Test 1, there will be some extrapolation to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
50
lifetimes at use conditions (if the purpose is to quantify life). If tests are performed at the
accelerated conditions of Test 2, there will be additional extrapolation to lifetimes at use
conditions. Life modeling is the means of performing this extrapolation, and will be
covered in Section 2.5.1.1.2.3 and Chapter 5.
Another factor to consider in accelerated testing, when used to quantify reliability at use
conditions, is the relative probability of occurrence of various failure causes as a function
of stress level. Each failure cause will have unique acceleration characteristics as a
function of stress, depicted as the slope of the life-stress line. They will also have unique
probabilities of occurrence, as depicted as the vertical position of the life-stress line.
These factors together indicate that the relative probabilities of the causes require a model
for each. This is illustrated in Figure 2.5-8. In this life-stress plot, the slope represents
the dependency of life as a function of stress, and the position of the line represents the
absolute life. As can be seen, the relative probabilities of the causes will depend on the
stress level.
HALT requires a different mindset than "conventional" accelerated testing. One is not
trying to predict or demonstrate life, but rather to induce failures of the weakest links in
the design, strengthen those links, and thereby greatly extend the life of the design. Root
cause failure analyses are conducted and repairs and redesign are carried out, as feasible
and cost-effective. Output results from HALT may include a Pareto chart showing the
weak links in the design, and design guidance that can be used to create a more robust
design.
Testing a new design and comparing it against a proven previous generation design using
the same accelerated test provides an efficient benchmarking test. Based on HALT
results, a determination of "optimum" design characteristics can be made using statistical
design of experiments (DOE).
A generic HALT process starts with a temperature survey:
1. Start at room temperature
2. Step down temperature to -100C in 20 increments, with each dwell time long
enough to stabilize the product's internal temperature (the thermal rate of change
between each temperature transition step should be ~100C/minute)
3. Step up temperature from -100C to +40C at 100C/min
4. Step up temperature from +40C in 20C increments to 100C or the maximum
temperature for the materials involved, with each dwell time long enough to
Reliability Information Analysis Center
55
stabilize the product's internal temperature (the thermal rate of change between
each temperature transition step should be ~100C/minute)
Next, a vibration survey is performed:
1.
2.
3.
4.
The vibration stress is provided by mechanically impacting the table with hammers.
As such, the frequency spectrum is not truly random, but rather is pseudorandom. The
purpose of the vibration survey is to detect weakness in the design as a function of the
stresses created by the increased vibration levels.
A combined environment HALT may also be performed:
1. Superimpose simultaneous temperature cycling from -100C to +100C at
~100C/min of circulating air temperature. Dwell at each temperature only long
enough to semi-stabilize the internal temperature of the part
2. During temperature dwells, subject the test unit to vibration at 5 Grms
3. During subsequent thermal cycles, step the vibration level up in 5 Grms increments
In this example, the vibration is applied during temperature dwells, but if failure causes
are possible that are accelerated by vibration stresses during temperature transitions, the
stress profile can be modified to apply vibration continuously throughout the temperature
cycle.
This is a typical stress profile, and will be varied (and should be tailored) based on the
limits of the product or system being tested. The purpose of the step-stress temperature
test is to detect sensitivity of design functionality to temperature and temperature change
rates.
The purpose of the combined environment test should highlight weaknesses that result
from the interaction effects of simultaneous exposure to temperature and vibration.
Quantifying reliability is generally not the objective of HALT. The ability to improve the
inherent reliability/robustness of the product or systems design is. However, in some
cases it can be used as an indicator of field reliability performance. The fundamental
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
56
question to address is this: Does the HALT test excite failure causes that the item may
experience in the field? The answer to this question will depend entirely on the
characteristics of the item under test, and the stresses to which it will be exposed in field
use. For example, if the product or system critical failure causes are accelerated by
thermal cycling and random vibration, and the item will experience these stresses in the
field, then HALT test results may be indicative of field reliability. Likewise, if the
product or system critical failure causes are not accelerated by thermal cycling and
vibration, and/or the product will operating in a benign environment, then the HALT
results will provide very little information regarding field reliability.
2.5.1.1.5.
Qualification Testing
Qualification testing is a term used to describe a series of tests that a product or system
must be exposed to, and pass, for it to be considered qualified by the industry or
standards body governing the qualification requirements. Several examples of
qualification requirements are provided in Tables 2.5-7 and 2.5-8, for an assembly, and
for a laser diode component, respectively.
Table 2.5-7: Example of a Qualification Plan for an Assembly
Group/Test
SS/Failures
3/0
3/0
Temperature Cycling
3/0
Vibration
3/0
Electro-Magnetic Interference
1/0
Electro-Static Discharge
3/0
3/0
85C/85% RH
Q=1000 hrs
Thermal Cycling
-40 to 70C
Q=100 cycles
I =500 cycles
Thermal Shock
T=100C
20 cycles
Vibration
Shock
500G,
0.5 ms,
5 times/axis
Electrostatic Discharge
There are many qualification standards in existence, governed by standards bodies within
specific industries. Some noteworthy standards organizations are IEC (International
Electrochemical Commission), the U.S. Military (via MIL-specs), ISO, and Telcordia
(for telecommunication components and equipment.
There are several factors which will impact the usefulness of qualification data as an
indicator or field reliability. These are:
The degree to which the stress is accelerated, and the acceleration factor between
the test and field environments
The degree to which the stress accelerates critical failure causes that the product
or system will experience in the field
The sample sizes used, which impacts the statistical significance of the data
The first two bullets are treated in detail elsewhere in this book. The last bullet is
discussed next.
A common way in which sample size requirements are identified in standards is with a
Lot Tolerance Percent Defective (LTPD) methodology. This concept is identical to the
reliability demonstration idea presented previously. In this case, two parameters are
specified:
1. The percent of allowable defects
2. The confidence level
From before:
1 CL = R
In this case, the value of R is the reliability of the entire sample size. So, if the test
plan is established to allow no failures (this will require the minimum sample size), the
equation becomes:
1 CL = R n
where n is the sample size. For example, if the allowable percentage of defects is 20%,
and the desired confidence level is 0.90 (i.e. 90%), then n = 11 is the minimum sample
size required, as shown:
11
1 0.9 = 0.8
So, if the test is performed on 11 samples with no failures, then there is a 90% confidence
that the true reliability is greater than 0.8 (i.e., the probability of failure is less than 0.2).
Other plans are also available that allow a certain number of failures. These require
larger sample sizes, and are determined with binomial statistics.
Since the LTPD is generally less than the required reliability, qualification data is usually
not sufficient, in and of itself, to demonstrate reliability requirements. It can, however,
be valuable data when used in combination with other data sources.
As an example, consider a case in which a reliability requirement is that a product or
system must have less than 3% cumulative failures after 1000 hours of operation. This is
shown as the star in Figure 2.5-9. Now, lets say that the item is represented by a
multimode Weibull distribution (notice the three distinct portions of the curve
representing the bathtub curve), characterized by the probability line called Case 1in
Figure 2.5-9. If 11 parts were tested, and zero failures occurred after 300 hours of
Reliability Information Analysis Center
59
operation, the only statistical statement that can be made is that there is a 90% confidence
that the true unreliability is less than 0.2 at 300 hours, shown as the solid star and arrow.
Here, the data is not sufficient to determine if the actual distribution is Case 1, or that
the reliability requirement is met. However, testing 11 samples may be sufficient to
determine if we have a wearout mode occurring at a time less than 300 hours, as
illustrated in Case 2.
Probability - Weibull
99.000
90.000
Unreliability, F(t)
50.000
Case 2
Case 1
10.000
5.000
1.000
0.500
0.100
0.100
1.000
10.000
100.000
1000.000
10000.000
Time, (t)
In any case, however, the goal of a reliability program is to ensure that the actual
probability line is to the right of the reliability requirement point.
2.5.1.1.6.
DOE-Based Multicell
After the identification of critical failure causes of a product or system that require life
modeling, action must be taken to ensure that those items are sufficiently robust to meet
product/system reliability and durability requirements. Life modeling is used for this
purpose, and involves the characterization and quantification of specific failure causes,
making it a critical element of a reliability program.
A generic life modeling methodology is shown in Figure 2.5-1.
Tools
Measurement:
DOE
FMEA
Life
Modeling
FTA
Environment
Stresses
Duty Cycle
Extreme Event
Statistics
FTA
Characterize
operating
stresses
Identify
Factors
Reliability
Tests
Develop
Life
Model
Predict
Reliability under
Use Conditions
Model of
System
Reliability
Actions
Figure 2.5-10: Life Modeling Methodology
Each of the elements in Figure 2.5-10 are further examined below. Additionally, the
topics of Design of Experiments (DOE) and life modeling are treated in more detail in
Chapters 4 and 5, due to their relatively complex nature and their importance to life
modeling. A detailed example of a life model developed is also provided in Chapter 7.
Identify Factors
Factors are the independent variables that can influence the product reliability, and the
response variable is the dependent variable. DOE is a common technique used to study
the relationships amongst many types of factors. In the context of this book, the response
variables specifically refer to the reliability metric of interest.
Critical failure causes and the factors that potentially affect their probability of
occurrence need to be identified. This can be done through testing, through analysis, or
both. EVT testing that is performed as part of the overall product/system reliability
program can be used for the identification of these factors, as previously described.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
62
FMEA is also a popular analytical technique for this and will be used in the upcoming
example.
Factors fall into one of several categories:
Stresses
o Environmental
o Operational
Product/System Attributes
o Design factors
o Manufacturing processes
A continuous variable is one that can assume any value within a given range
Categorical variables are those that assume a discrete number of possibilities
Some factors can be modeled as either. For example, environmental stress can be
modeled with continuous variables of the specific environmental stresses (i.e.,
temperature, vibration, humidity, etc.), or it can be modeled as a categorical variable.
The latter case is the approach that has historically been used in MIL-HDBK-217, which
uses environmental categories like Ground, Benign, Airborne, Inhabited, etc. The
217Plus methodology treats them as continuous variables, but default values are provided
for the categorical values of environment.
There are several ways in which these factors can be identified. One method that has
proven to be an efficient means of accomplishing this is to utilize the FMEA. This
involves modifying the FMEA to include several additional columns that correspond to
the above listed factors. At the analysts discretion, from one to four additional columns
can be included. This will depend on the type of product or system under analysis and
the level of rigor desired. In this approach, the FMEA team (or at least someone
knowledgeable with the item design and process attributes) identifies the specific stresses
or attributes that will affect the probability of occurrence of the specific failure cause that
was identified in the FMEA. Since each failure cause will generally have an associated
risk priority number (RPN), the cumulative RPN can be calculated for all failure causes
affected by the specific stress or product/system attribute.
For example, consider the case in which an FMEA was accomplished in this manner, and
the results in Figure 2.5-11 were obtained. Here, only the environmental stresses are
Reliability Information Analysis Center
63
shown, but the same methodology would apply to whichever additional factors are
included in the FMEA.
A more detailed discussions of the FMEA methodology is provided in Chapter 8.
Reliability Tests
If critical item failure mechanisms are time dependent, then time-based life tests are
required. Life tests are conducted by subjecting test samples to a defined stress level and
measuring the times when failure occurs. The process is repeated for various
combinations of factor levels. Considerations for the reliability tests are described below.
Test Plan
If there are multiple accelerating stresses, then life tests must be conducted at various
combinations of stress magnitudes. A plan should be developed using an effective tool
such as Design of Experiments. The plan should consider all aspects of testing so that the
test program generates data in a cost effective way. It is easy to lapse into the mentality
of testing one factor at a time, in which tests are conducted to assess specific factors,
but this approach is generally not time- or cost-effective.
Factors to consider in establishing an appropriate DOE include (1) the sample size per
test cell, (2) stress levels, (3) the number of stress levels for each stress, (4) stress
interactions, (5) stress durations, (6) failure criteria, and (7) measurement methodology
(i.e., in-situ or periodic). The principals of DOE are treated in more detail in Chapter 4.
Maximum Test Stress
A prerequisite for developing a complete test plan to assess the lifetime of a product or
system attribute is knowledge of the maximum stress magnitude that can be tolerated by
the item prior to catastrophic failure. This knowledge supports establishment of an upper
bound on subsequent test stresses that may be a part of step-stress testing. These tests are
generally performed as part of the EVT tests.
In many cases, it is desirable to establish the upper bound of the test stress for each
specific stressor. An efficient way to determine this stress level, often called the
destruct limit, is to perform a step stress test. Here, a sample of units is exposed to a
stress level well below the suspected destruct limit. Then, the stress is increased until the
product is overstressed. This step-stress test can include a linearly ramped stress, or a
stepped-stress in which the samples are exposed to a constant stress for a given dwell
time, after which the stress is increased, dwelled, and so on until failure. An example of
the identification of these maximum stresses was mentioned previously in the HALT
discussion.
The destruct limit can be used as the upper limit of all subsequent life tests. Usually, the
actual life tests will be performed at a maximum stress that is a certain percentage level
Reliability Information Analysis Center
65
below the destruct limit. This percentage is dictated primarily by the sensitivity of the
TTFs to the stress. For example, consider the two cases illustrated in Figure 2.5-12.
Case 1 is a situation in which the lifetime, and subsequent reliability, is moderately
sensitive to the stress level. Case 2 is a situation in which the lifetime has an extreme
sensitivity to the stress level.
Figure 2.5-12: Using the Destruct Limit to Define the Life Test Max Stress
For example, if a power law acceleration model is used, the life stress relationship is:
Life =
A
Sn
Stress Profile
The two main types of stress profiles are steady-state and time varying. Steady state tests
are those in which a sample set is exposed to constant stress levels, and the response
(performance parameter(s)) is measured. Several examples are shown in Figure 2.5-13.
Advantage
Results can be easily interpreted
Disadvantage
Longer test times required
Requires knowledge of
destruct limits
Failure
Rate
Measurement Points
If the ROCOF is expected to increase over time, the measurement intervals can start out
very infrequent, and increase in frequency as the failure rate decreases. This is shown in
Figure 2.5-15.
Failure
Rate
Measurement Points
Figure 2.5-16: Acceleration When the Distributions for at Least Two Stresses are
Available
Now, consider the case in which the lower stress samples are not tested until enough
failures have occurred. This is shown in Figure 2.5-17. In this case, the distribution
cannot be quantified. All that is possible is the estimation of the lower bound of life, via
techniques like Weibayes analysis (shown as the star).
Figure 2.5-17: Acceleration When the Distributions for Low Stresses are Not
Available
Reliability Information Analysis Center
71
This 50% objective can sometimes be offset if enough data is available in at least two
other, more stressful conditions, to compensate for the lack of data in the low stress
condition.
Develop Life Model
After the life data is generated from implementing the DOE plan, a reliability model can
be constructed. Factors that must be quantified include:
The TTF distribution can typically be modeled using the Weibull, exponential or
lognormal distributions. For sample "subpopulations" that exhibit different reliability
behavior than the main population, TTF distributions may manifest themselves as
bimodal. It is important that bimodal distributions be characterized. If one of the two
"modes" in the distribution appears to be the result of early failures from workmanship,
materials or process defects, then this information should be used to develop an
appropriate reliability screen. This topic is discussed in detail later in this book.
Characterize Operating Stresses
In order to estimate the field reliability of the product, in addition to the life model
(which will predict the life characteristics as a function of the chosen factors),
information regarding the stresses to which the product or system will be exposed in the
field is also necessary.
There are a variety of sources that can be used to estimate the stresses to which an item
will be exposed. First, customers will usually specify nominal and worst case
environmental requirements in the product or system specification. However, the data in
specifications are often very generic and lack sufficient detail for reliability analysis.
Another source of information is from direct measurement, either by directly measuring
stresses in the item use environment, or by equipping the item with sensors and data
logging features.
Field maintenance personnel can also often provide qualitative information pertaining to
stresses, especially when those stresses have resulted in failures.
There is a wealth of information available in both commercial and military handbooks
and standards. Many industries also have their own source material from the products or
systems used in their industry.
A summary of sources include:
Customer specifications
Customer usage information
Measurement of conditions:
Stresses
Duty cycle
Extreme event statistics
Reliability Information Analysis Center
73
Regression
Non linear
model
parameter
estimation
Life modeling
Data of a performance
parameter vs. time
Model of performance
vs. time
Prediction of
delta
Prediction of
percent fail
Life model
Prediction of life
distribution
Reliability Demonstration
Reliability data obtained from the field experience of products or systems is an invaluable
source of data. When using empirical field reliability data from a similar item as the
basis of the reliability estimate, there are two fundamental approaches, as illustrated
below.
The first approach is to utilize the field data directly, and the second is to utilize the data,
via an interim model developed from the data. This is shown in Figure 2.5-22.
Empirical
Field Data
Reliability
Estimate
Model
Same Product
Field data on the exact item under analysis is the best information on which to estimate
the reliability of the product or system. Unfortunately, it is usually available too late to
do any good. Reliability predictions and estimates are required long before product or
system deployment. This type of data is a lagging indicator of reliability, whereas the
other techniques discussed in this book are leading indicators. In other words, we need
leading indicators to estimate the reliability that will ultimately be observed with the field
data. This data, however, which should always be collected on products, is valuable in
the reliability assessment of future products.
2.5.1.2.2.
Similar product
When using data on a similar product or system to assess a new product or system, the
degree of similarity needs to be accounted for to estimate the new item reliability based
on the empirical data available on the similar product. There are several ways in which
similarity can be assessed. The first approach is to utilize a reliability prediction
technique. This technique can be any of those covered in this document. The
techniques ability to assess similarity is dependent on the ability of the specific
methodology to:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
78
1. Address the factors that drive the reliability for the two products under analysis
2. Be reasonably sensitive to these drivers.
Regarding #2, for example, if a system is being developed that represents an evolutionary
change to the system for which a reliability estimate is available, estimating the reliability
of the new system based on the data from the old system requires that the prediction
methodology be sensitive to the design differences between the old and new systems. If
these differences consist of the addition of new components, an increase in the operating
temperature, and the addition of software, then the methodology used to assess the
delta in reliability between the new and old system must be capable of assessing these
elements, and the reliability prediction approach must be reasonably sensitive to these
factors. The methodology of 217Plus was designed to accommodate this type of
situation, and is further detailed in Section 2.6.
Additionally, it is not necessary that a single methodology be used to assess this delta.
Different techniques can be used to assess each of the elements of the design, and the
cumulative effect can be pooled together to form a complete system model. The
techniques used to assess each of the design elements will generally fall into the
categories described in this document.
Another more qualitative technique is to simply list the general attributes of the design, as
shown in Table 2.5-10. The relative expected reliability of each of these elements for the
new and old designs are then listed. This is a qualitative method, but can be useful in
some cases.
Design Elements
Process elements
Size
Weight
General design
Number of components of type A
Number of components of type B
Number of components of type C
Number of optical components
Thermal dissipation
Number of connections
Manufacturing site
Equipment
Screening
Component attachment
Screening tests
QC tests
This approach needs to be developed for each product or system, since the reliability
attributes will be unique to that particular item type.
Another approach that can be used to assess similarity is to utilize the FMEA, if
available. This is illustrated in Figure 2.5-23. Here, the FMEA is performed on both the
new and the predecessor system. The failure causes identified represent a cumulative
listing of all failure causes, whether they are applicable to either or both items. Then, the
Occurrence rating is determined for each failure cause for both items. If a specific failure
cause is not applicable to one of the items, then it gets a rating of zero. The sum of the
Occurrence ratings are then calculated for each of the products or systems. The ratio of
this sum is an indicator of the relative reliability levels of the two items, and is a good
measure of the degree to which the items are similar.
Recommended Actions
Detectability
Occurrence
Causes
Severity
Failure effects
Failure mode
function
Component
Applicable to:
This section represents the FMEA results for both the new and old
systems, and lists a cumulative set of failure modes, causes, etc.
Old System
New System
Raw field reliability data has been a very popular source of data on which to base
reliability estimates. This similar data can be based on a specific companys own field
experience on previous products or systems, or it can be a pooled set of data based on a
variety of companies and organizations. As an example of the latter, one of the RIACs
most popular documents has been the Nonelectronic Parts Reliability Data, (NPRD)
publication. NPRD is a compilation of observed field reliability data on a wide variety of
components. A summary of NPRD is provided in Section 7.4, to provide the reader with
a guide to the interpretation of this type of data.
For the most part, methodologies such as EPRD (Electronic Parts Reliability Data),
NPRD, MIL-HDBK-217, and 217Plus rely on field data from similar products or systems
in order to make reliability estimates. The manner in which they do this differs, but they
all share the same fundamental type of data as their basis.
2.5.1.2.4.
Models
The use of models derived from empirical data to estimate the reliability of a product or
system is just one option for estimating reliability. Empirical models can be developed
and used by the analyst, or he/she can use empirical models developed by others. Models
developed by others include the industry standards or methodologies that many reliability
analysts are familiar with.
This section of the book deals with such models that are derived from the analysis of
empirical field data. Modeling is the means by which mathematical equations are
developed for the purpose of estimating the reliability of a specific item used and applied
in a specific manner. There are many ways in which models can be derived, and there is
no single correct way to develop these models. There are many such models in
existence. These models are generally easy to use, in that they are of a closed form and
simply require the analyst to identify the appropriate values of the input variables. The
developers of each of these models had their own perspective in terms of the user
community to be served, the variables that were to be modeled, the data that was
available, etc. It is not the intent of this book to review the specifics of these models, or
to compare them in detail. It is the intent, however, to discuss the rationale and options
for development of the models, and to provide some examples.
The analyst must first decide what variables are to be modeled. Factors that should be
considered as indicators of reliability include:
Environmental stresses
Operational stresses
Reliability growth
Time dependency
o Infant mortality
o Wearout
Engineering practices
Technology
o Feature sizes
o Materials
Defect rates
Yields
Which ones are actually included depend on whether the data is available to support the
quantification of a factor, if a valid theoretical basis exists for its inclusion, and whether
the factor can be empirically shown to be an indicator of reliability.
There are always many more potential factors influencing reliability that can realistically
be included in a model. The analyst must choose which ones are considered to be the
predominant reliability drivers, and include them in the model. The next step of model
development is to theorize a model form. This is generally accomplished by attempting
to establish a model consistent with the fundamental physics of reliability. Examples of
the development of several empirically-based models are provided in Chapter 7.
To compare various empirical methodologies, Table 2.5-11 contains the predicted failure
rate of various empirical methodologies for a digital circuit board. The failure rates in
this table were calculated for each combination of environment, temperature and stress.
As can be seen from the data, there can be significant differences between the predicted
failure rate values, depending on the method used. Differences are expected because
each methodology is based on unique assumptions and data. The RIAC data in the last
row of the table is based on observed component failure rates in a ground benign
application.
Table 2.5-11: Digital Circuit Board Failure Rates (in Failures per Million Part Hours)
Environment
Temperature
Stress
ALCATEL
Bellcore Issue 4
Bellcore Issue 5
British Telecom HDR4
British Telecom HDR5
MIL-HDBK-217 E Notice 1
MIL-HDBK-217 F Notice 1
MIL-HDBK-217 F Notice 2
217Plus Version 2.0
RIAC data
Ground Benign
10 Deg. C
70 Deg. C
10% 50%
10%
50%
6.59 10.18 13.30
19.89
5.72
7.09
31.64
35.43
8.47
9.25 134.45 137.85
6.72
6.72
6.72
6.72
2.59
2.59
2.59
2.59
10.92 20.20 94.37 111.36
9.32 18.38 20.15
35.40
6.41
9.83
18.31
26.76
0.28
4.89
3.3
Ground Fixed
10 Deg. C
70 Deg. C
10% 50%
10%
50%
22.08 29.79
32.51
47.27
8.56 10.63
47.46
53.14
16.94 18.49 268.90 275.70
9.84
9.84
9.84
9.84
2.59
2.59
2.59
2.59
36.38 56.04 128.98 165.91
28.31 48.78
45.44
79.46
24.74 40.15
73.63
119.21
0.51
6.04
class of components being considered. For example, for most all electronic components,
the predicted failure rate is found to be a function of operating temperature and applied
electrical stress. In general, the lower the operating temperature and applied electrical
stress, the lower the predicted failure rate will be. Therefore, the parts stress method
includes model factors for these specific stresses. However, if specific stress values
cannot be determined, it is still possible to perform a prediction using the more general
parts count methodology. For the parts count method, model stress levels have been set
to typical default levels to allow a failure rate estimate simply by knowing the generic
type of component (such as chip resistor) and its intended use environment (such as
ground mobile). It should be noted that these reliability prediction handbook approaches
are, by necessity, generic in nature. Actual test or field data from other similar items is
always more desirable, given sufficient similarity, as was discussed previously.
MIL-HDBK-217
MIL-HDBK-217, Reliability Prediction of Electronic Equipment, has historically been
the most widely used of all of the empirically-based reliability prediction methodologies.
The basic premise of the handbook is the use of historical piece part test and field failure
rate data as the basis for predicting future system reliability. The handbook includes
failure rate models for most electronic part types. The latest released version of MILHDBK-217 is F, Notice 2, dated 28 February 19952. The handbook was almost a
casualty of the DoD Acquisition Reform initiative, but it survived primarily because of its
widespread use, the dependency on it throughout the military-industrial complex, and the
lack of a suitable replacement.
Figure 2.5-24 presents a brief example of the MIL-HDBK-217 parts count method, where
the product or system failure rate is the sum of the failure rates of the generic electrical
and electromechanical components of which it is comprised3. Each piece-part failure rate
is derived by assigning typical defaults to the generic component category stress
models. The only factors considered in these parts count component models are (1) the
generic base failure rate for that part type (represented by g) that is based on an assumed
application environment and default temperature, (2) a generic quality factor (q) that is
used to modify this part type base failure rate, and (3) the quantity of that part type used
in the equipment. In the example shown here, the g for a bipolar microcircuit comprised
of between 1 and 100 gates in a 16-pin dual-in-line package used in a ground, fixed (i.e.,
GF) environment and operating at an assumed junction temperature of 60 degrees C is
2
As of the publication date of this book, a Draft version of MIL-HDBK-217G is in development and is expected to be released some
time in 2010.
3
The current version of MIL-HDBK-217 does not predict the reliability of mechanical components or non-hardware reliability
elements, such as software, human reliability, and processes. Field failures of mechanical components and non-hardware items should
not be scored against MIL-HDBK-217 or any other electronics-based empirical methodologies.
0.012 failures per million hours. The quality factor for a parts count prediction is also
determined from a table (not shown in this example).
The parts count prediction approach is intended for use early in the design phase of the
equipment life cycle, prior to the start of detailed design, when there is little known about
the specific characteristics of the parts being used, or how they will be applied (such as
individual operating and environmental stresses).
It uses the Bayesian methodology to combine data from various sources, in a manner
similar to the RAC PRISM/RIAC 217Plus methodology.
PRISM/217Plus
The original RAC PRISM4 system reliability assessment tool was developed and
released by the RAC in January 2000 as a potential replacement for MIL-HDBK-217.
With the subsequent transition to RIAC in June 2005, the RIAC 217Plus methodology
replaced the RAC PRISM tool and added additional component models. As a result, the
217Plus methodology currently addresses all the major component types found in MILHDBK-217. Figure 2.5-27 symbolizes the replacement of the RAC PRISM tool by
217Plus.
CNET/RDF 2000
The CNET/RDF 2000 Reliability Prediction Standard, shown in Figure 2.5-28, covers
most of the same component categories as MIL-HDBK-217.
intensive effort. Failure to invest in this activity, however, will doom a reliability
prediction methodology to eventual irrelevancy and obsolescence.
2.5.1.2.5.
Since field data is critical to the reliability assessment process, it is explored in this
section. The nuances of collecting and interpreting it are discussed. Some of the issues
encountered in collecting field data are discussed in the NPRD discussion included in
Chapter 7. The intent of this section is to present guidelines on how to approach field
data collection.
Good data collection is the key to an effective process for utilizing data obtained from a
reliability tracking system. This information includes:
The intent of this section is to outline a reliability data collection and analysis system that
can provide the data required. Although a reliability tracking system outlined herein has
similarities to a FRACAS program, there are distinct differences. While a FRACAS
program is intended identify the causes of failures so that corrective action can take
place, the program outlined herein is intended to be more comprehensive in that it assists
its user in more than the implementation of corrective actions, as it also provides the data
required to quantify reliability, in accordance with the methodologies outlined in this
book. This concept is illustrated in Figure 2.5-31.
Reliability
TTF
Analysis
Vendor
Selection
Failure
Verification
Warranty
Claims
MTBF
Analysis
Root Cause
Identification
RCM
Implementation
Implement
Design
Improvements
System Information
Parts Breakdown
Maintenance Data
Root Failure
Cause/Analysis Data
Information record consists of population statistics and needs to be updated whenever the
product or system status changes. Such a change occurs when new or modified items are
fielded.
The parts breakdown data element consists of a hierarchical description of the system.
This description is necessary to avoid confusion as to which FRUs (Field Replaceable
Units) belong to which assemblies and the number of FRUs in the assembly, as well as in
the entire system.
The maintenance data element consists of a record of the maintenance action taken to
maintain or repair the system. It also consists of a description of the anomaly, the failure
mode, and the failure mechanism of the failed unit as determined by the maintenance
technician. One record corresponds to a single maintenance action, and there can be any
number of them for each FRU in the system (i.e., a FRU in the system can be replaced
any number of times over the life cycle of the system).
The root failure cause/analysis data element consists of information on the results of the
detailed failure analysis that may be performed on the failed unit. It is a separate record
because not all maintenance actions will result in the failure analysis of a removed unit.
There are two primary interfaces required of the system. The first is the maintenance
technician interface. This interface is the means by which maintenance data is entered
into the database. Ideally, this interface would consist of computers located within the
maintenance facility for direct data entry. The second interface is the one utilized by
individuals that need the results of the data analysis. The flow of the interface to the
system from the perspective of the system user is given in Figure 2.5-33.
System maintenance is
required and maintenance
commences
Maintenance technician
identifies the part
requiring maintenance
and enters the part data
into the database
System user
enters part
breakdown and
maintains system
usage status
Central
Database
User runs
appropriate analysis
and obtains
necessary reliability
metric(s)
Technician performs
required maintenance
Maintenance technician
enters maintenance data
into the database
Figure 2.5-33: Database Information Flow
Important elements of the data system that should be considered for inclusion are
summarized below:
System information
Number of systems fielded
Dates of fielding for each system j
Location of operation (optional)
System Numbers (unique identifier for each system)
Part number
Serial number
Part identification code (unique descriptor of part in hierarchical breakdown of
system; sometimes referred to as a Reference Designator)
Number of parts in the product or system
Applicable Life Unit (i.e. hours, miles, cycles, operations, etc.)
Identification as to if there is an individual elapsed time meter (or miles, cycles,
operations) on the specific part or whether system life units must be used
Manufacturer name
Maintenance Information
A critical element to an effective reliability data collection and analysis system is the
accurate quantification of the failure cause. Not all perceived failures are real failures
and, therefore, it is important to identify whether part removals are indeed true failures.
Figure 2.5-34 illustrates the hierarchy of maintenance actions.
Maintenance Action
Unscheduled
Scheduled
Perform
Routine
Maintenance
Remove/Replace
Correct
Diagnosis
Necessary
Repair
Failure Analysis
Performed to
Identify Root
Cause
Real Failure
Incorrect
Diagnosis
Faulty Unit
Gets Put Back
into Field
False Alarm
Correct
Diagnosis
Cannot
Duplicate
Incorrect
Diagnosis
Unnecessary
Repair
Failure
Analysis
Not
Performed
Analysis
From the data collected and captured in the database, several fundamental reliability
parameters, including those listed below, can be calculated.
For many of these parameters, it is necessary to calculate the number of life units to
which each part has been exposed. This is done by calculating the number of life units on
the part since the last time that the part was replaced. This calculation procedure is
illustrated in Figure 2.5-35.
No
Yes
No
Record the
system life unit
Drenicks Theorem
An important aspect of interpreting field reliability data is distinguishing between
calendar time and operating time. Consider a situation in which five items are fielded at
the same time, as illustrated in Figure 2.5-36. They will each have a failure time (or other
appropriate life unit) that is described by the TTF distribution as a function of operating
time.
1
2
3
4
5
Operating
Time
Failure Times
Figure 2.5-36: Failure Times Based on Operating Time
Now, consider the same five items that were placed in the field at different calendar
times, as illustrated in Figure 2.5-37. They will have the same failure times relative to
their operating time, but the apparent failure times relative to calendar time will be quite
different.
1
2
3
4
5
Calendar
Time
Failure Times
Figure 2.5-37: Failure Times Based on Calendar Time
Furthermore, if the product or system is repairable (in which case the failed items are
replaced upon failure with a new item), an interesting effect occurs in which the apparent
failure rate will reach an asymptotic value that appears to represent a constant failure rate.
This occurs as the time zero values become randomized as items fail and are replaced
with new items.
To illustrate the relationship between the beta value (Weibull shape) and the
instantaneous failure rate as a function of calendar time when parts are replaced upon
failure, a simulation was performed. In this simulation example, the failure rate of 1100
items as a function of calendar time was calculated.
Figures 2.5-38 through 2.5-42 illustrate the results. These figures correspond to Weibulldistributed TTFs with shape parameters of 20, 5, 2, 1 and 0.5, respectively. The time axis
is calendar time, normalized to a time unit of one characteristic life.
The generic approaches covered here in using a physics approach are stress strength
interference models and models from first principals. Each is described below.
2.5.2.1. Stress/Strength Modeling
Stress/strength interference theory is a technique used to quantify the probability that the
strength of an item is less than the stress to which it is subjected. For example, if the
distribution of the strength of an item can be quantified, and the distribution of the stress
it is under can be quantified, the area of intersection of the two stresses represents the
probability that the strength is less than the stress.
This technique is general in nature and applies equally to any situation that the two
distributions can be quantified, as long as the X-axis represents the same variable for both
distributions. The variable can be electrical, such as voltage or current, or it can be
mechanical strength, for example, in units of KPSI.
The goal of any design for robustness effort is to minimize the variance of both
distributions, and maximize the separation of the distribution means. In this manner, the
probability of distribution intersection, or failure, is minimized.
An example of this approach is illustrated in Figure 2.5-43.
Material Properties
Design Dimension
Dimensions
FEA
Extrinsic
Stresses
Modulus
Strength
Data
Stress
CTE
Fatigue
Data
Strength
Probability of
Failure vs Time
Figure 2.5-43: Stress Strength Methodology
In this example, a mechanical item has certain physical properties, for example its
modulus and its coefficient of thermal expansion (CTE). These material properties are
used in addition to the design variable (i.e. dimensions, extrinsic stresses) to estimate the
stresses to which the item is exposed. This stress can be modeled in several ways. One is
the use of handbooks that contain closed-form equations that estimate the stress to which
a material is exposed as a function of dimensions, force, deflections, etc. This is usually
only viable for simple structures. For more complex mechanical structures, finite
element models and analysis (FEA) may be required to simulate stresses.
For the strength portion of the model, two factors need to be considered:
An example of strength as a function of time is the fatigue properties of the material. The
fatigue properties pertain to the strength degradation over time.
At time = 0, the probability of failure is the intersection of the stress and the strength
distributions, as illustrated in Figure 2.5-44.
x =
y =
x =
y =
ux u y
x2 + y2
Standard Normal variant (i.e., the number of standard deviations from the
normal standardized distribution). The value for Z can be obtained
from:
1. Tables of the Standard Normal distribution
2. MS EXCEL formula = Normdist(Z)
the mean of the strength
the mean of the stress
the standard deviation of the strength
the standard deviation of the stress
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
108
In many real situations, distributions other than the Normal are used, requiring alternate
methods of calculating the interference probability. Readily available software tools can
be used for this purpose (Reference 3).
As stated previously, in addition to the probability of failure at t=0, it is also critically
important to understand how this interference between stress and strength behaves as a
function of time. Items will sometimes age (due to mechanisms such as fatigue), which
essentially means that the strength distribution changes such that its mean is lowered.
Assuming that the stress to which the item is exposed remains constant, the result is that
there is more interference, and the failure probability increases with time. To properly
account for this aging phenomenon, the characteristics of this strength distribution and
the interference must be quantified as a function of time. This concept is illustrated in
Figure 2.5-45.
An example of a model that has been successfully used for brittle materials is the
following:
V
P = 1 exp
V0
S 0
t0
where:
P=
probability of failure
m = Weibull slope of the initial strength
S0 = characteristic strength
n=
fatigue constant
V and V0 are volume parameters to account for the effects of size (i.e., they
account for the effect that the more volume or surface area that there is, the more
likely it is to have a strength limiting flaw)
=
stress
Now, if a screen is applied to the material to eliminate defects having strength values
below the applied screen stress threshold (Sth), the probability of failure becomes:
V
P = 1 exp
V0
1 m
t0 n
m
th
t t n
S0
t 0
This is only one example of a stress strength model. Many others can be found in the
literature.
Models such as these can be invaluable in understating the sensitivity of reliability as a
function of the factors accounted for in the model. However, as is the case with any
physics-based model, it is important to validate the model based on empirical evidence.
This is critical because there is ample opportunity to introduce large errors in the
analysis, based on extreme sensitivity to assumptions, sample variability, etc.
Additionally, while the approach may be grounded in physics, the model parameters
usually need empirical data for their quantification.
The premise of First Principals is that the fundamental physics that govern a failure
mechanism can be characterized, and that the reliability of the mechanism can be
accurately predicted from these equations. This is best illustrated with an example from
References 4 and 5. In this example, the reliability of a Fused Biconic Splitter was
modeled. This is a passive optical component used to split optical signals in fiber optic
telecommunication systems. The observed failure mode was a degradation of the
coupling ratio over time.
The original test plan included Accelerated Aging Tests on Fused Splitters for 3
conditions, as shown in Table 2.5-12.
Table 2.5-12: Test Conditions
Test Conditions
Temperature
(C)
85C / 85% RH
85C / 16% RH
45C / 85% RH
X
X
Relative
Humidity
(RH)
X
X
Absolute
Humidity
(AH)
X
X
The X values in the cells indicate which test conditions have a constant value for the
stress indicated in each column. The values were chosen to assess whether relative
humidity or absolute humidity was the predominant mechanism of the failure mode. In
this case, two of the three conditions have equivalent relative humidity and two of three
have equivalent absolute humidity.
The results of the accelerated tests did not agree with a previously hypothesized failure
mechanism that proposed epoxy creep as the coupling ratio drift mechanism. Therefore,
in an effort to obtain a model that was consistent with empirical evidence, the
fundamental physics were investigated. This process is described below:
From optical component physics, it can be shown that the coupling between two fibers is:
c=
3
1
2
32n2 a (1 + 1 / V )2
where:
V ak (n22 n32 )1 / 2
Reliability Information Analysis Center
111
Additionally, the diffusion of water vapor into silica can be represented as:
C (r , t ) = C 01 BnJ 0( jnr / b) exp{ jn 2 [ DH 2O(T )t / b 2 ]}
n =1
where:
Bn 2 /[ jnJ 1( jn)]
The hypothesis of the physical mechanism is that water diffuses into the outer surface of
the fused region very slowly and slightly decreases the index of refraction of this outer
surface. This increases the coupling coefficient, thereby increasing the coupling ratio.
As time goes by, more and more water diffuses in, and the coupling ratio increases until
the device goes out of spec. The amount of water in the silica is simply the number of
water molecules hitting the surface of the silica per unit time (directly proportional to the
absolute humidity) and the diffusion rate at that temperature. Therefore, if the time to
failure at a specific condition is known, the time to failure at a new condition is the
known TTF multiplied by a ratio of the absolute humidity level times the ratio of the
diffusion rates.
The data obtained in the tests were used to estimate the diffusion rate and the temperature
dependence of this diffusion rate, as shown in Table 2.5-13.
Table 2.5-13: Data to Estimate Diffusion Rate
85
ABS HUM
grams H2O
per m3
297.1
DIFFUSION
CONSTANT
cm2/sec
6.63x10-18
45
85
55.4
High Temp/Medium
Humidity Chamber
85
16
Underground
25
Footway Box
15
TEMP
C.
RH
%
High Temp/High
Humidity Chamber
85
Med Temp/High
Humidity Chamber
SERVICE
CONDITION
RATIO
MTBF
(Years)
0.579
2.60x10-19
137
79
56.0
6.63x10-18
85
19.6
3.73x10-20
2691
1559
93
11.9
1.73x10-20
9552
5535
The predictions from the model were then obtained. The predicted and observed
lifetimes are shown in Table 2.5-14.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
112
High Temp/High
Humidity Chamber
High Temp/Medium
Humidity Chamber
TEMP
C.
RH
%
MTBF in
Hours
(Predicted)
MTBF in
Hours
(Measured)
%
Difference
85
85
5072
5072
85
16
26,909
27,500
As can be seen above, the model is extremely accurate in predicting the failure
mechanism behavior.
Models developed from first principals like the one shown in this example can be very
accurate and, thus, beneficial to a reliability program. However, several pieces of
information were required in order to make this approach a viable alternative:
Whether the analyst chooses to evaluate and assess the processes used in the
development of the product or system
The types of data that may be available can be any of the types summarized previously in
this section of the book.
If the product or system under analysis is an evolution of a predecessor item, the field
experience of the predecessor product can be leveraged and modified to account for the
differences between the new product and the predecessor product. A predecessor is
defined as a product or system that is based on similar technology and uses
design/manufacturing processes similar to the new item under development for which a
reliability prediction is desired. In this case, the new product or system is an evolution of
its predecessor. In this analysis, a prediction is performed on both the predecessor item
and the new item under development. These two predictions form the basis of a ratio that
is used to modify the observed failure rate of the predecessor, and account for the degree
of similarity between the new and predecessor products pr systems. The result of the
predecessor analysis is expressed as 1, as presented in Figure 2.6-1.
If enough empirical data (field, test, or both) is available on the new product or system
under development, it can be combined with the reliability prediction on the new item to
form the best failure rate estimate possible. A Bayesian approach is used for this
combination, which merges the reliability prediction with the available data. As the
quantity of empirical data increases, the failure rate using the Bayesian combination will
be increasingly dominated by the empirical data. The result of the Bayesian combination
is defined as 2, as presented in Figure 2.6-1.
The minimum amount of analysis required to obtain a predicted failure rate for a product
or system is the summation of the component estimated failure rates. The component
failure rates are determined from the component models, along with other data that may
be available to the analyst. The result of this component-based prediction is IA,new. This
value can be further modified by incorporating the optional data, resulting in predicted,new,
as shown in Figure 2.6-1. All methods of analysis require that a prediction be performed
on the new product or system under development in accordance with the component
prediction methodology. Predictions based solely on the component analysis should
be used only when there is no field or test reliability history for the new item and no
suitable predecessor item with a field reliability history. In this case, the reliability
model is purely predictive in nature. After a product or system has been fielded, and
there has been a significant amount of operating time, the best data on which to base a
failure rate estimate is field observed data, or a combination of prediction and observed
Reliability Information Analysis Center
115
failure data. In this case, the reliability model yields an estimate of reliability, because
the reliability is estimated from empirical data.
Each element of the 217Plus methodology is further described in the following sections.
IA,predecessor
observed, predecessor
observed, predecessor is the observed failure rate of the predecessor product or system. It is
the point estimate of the failure rate, which is equal to the number of observed failures
divided by the cumulative number of operating hours5.
Optional data
Optional data is used to enhance the predicted failure rate by adding more detailed data
pertaining to environmental stresses, operating profile factors, and process grades (the
concept of process grades is explained in detail in Chapter 7). The 217Plus models
contains default values for the environmental stresses and operational profile, but in the
event that actual values of these parameters are known, either through analysis or
measurements, they should be used. The application of the process grades is also
optional, in that the user has the option of evaluating specific processes used in the
design, development, manufacturing and sustainment of a product or system. If process
grades are not used in a 217Plus analysis, default values are provided for each process
(failure cause), so that the user can evaluate any or all of the processes.
predicted, predecessor
predicted, predecessor is the predicted failure rate of the predecessor product or system after
combining the initial assessment with any optional data, if appropriate.
IA,new
IA,new is the initial reliability assessment of the new product or system. This is the sum
of the predicted component failure rates, and uses the 217Plus component failure rate
models or other methods (such as data from NPRD or other data sources). A reliability
5
Note that operating hours can be replaced by any other life unit, such as calendar hours, miles, cycles, etc. The 217Plus
methodology predicts failure rates in terms of calendar hours. The important point is that all life units used in the assessment must be
consistent.
prediction performed in accordance with this method is the minimum level of analysis
that will result in a predicted reliability value. Applying any optional data can further
enhance this value.
predicted, new
predicted, new is the predicted failure rate of the new system after combining the initial
reliability assessment with any optional data, if used. If optional data is not used, then
1 is the failure rate estimate of the new system after the predicted failure rate of the new
system is combined with the information on the predecessor product (predicted and
observed data). The equation that translates the failure rate from the old product or
system to the new one is:
1 = predicted , new
observed , predecessor
predicted , predecessor
The values for predicted,new and predicted,predecessor are obtained using the component
reliability prediction procedures. The ratio of observed,predecessor /predicted,predecessor inherently
accounts for the differences in the predicted and observed failure rates of the predecessor
system, i.e., it inherently accounts for the differences in the products or systems analyzed
in the component reliability prediction methodology.
This methodology can be used when the new product or system is an evolutionary
extension of predecessor designs. If similar processes are used to design and
manufacture a new item, and the same reliability prediction processes and data are used,
then there is every reason to believe that the observed/predicted ratio of the new system
will be similar to that observed on the predecessor system. This methodology implicitly
assumes that there is enough operating time and failures on which to base a value of
observed,predecessor. For this purpose, the observance of failures is critical to derive a point
estimate of the failure rate (i.e., failures divided by hours). A single-sided confidence
level estimate of the failure rate should not be used.
ai
ai is the number of failures for the ith set of data on the new product or system.
Reliability Information Analysis Center
117
bi
bi is the cumulative number of operating hours for the ith set of data on the new product or
system.
AFi
AFi is the acceleration factor (AF) between the conditions of the test or field data on the
new product or system and the conditions under which the predicted failure rate is
desired. If the data is from a field application in the same environment for which the
prediction is being performed, then the AF value will be 1.0. If the data is from
accelerated test data or from field data in a different environment, then the AF value
needs to be determined. If the applied stresses are higher than the anticipated field use
environment of the new system, AF will have a value greater than 1.0. The AF can be
determined by performing a reliability prediction at both the test and use conditions. The
AF can only be determined in this manner, however, if the reliability prediction model is
capable of discerning the effects of the accelerating stress(es) of the test. As an example,
consider a life test in which the product was exposed to a temperature higher than what it
would be exposed to in field-deployed conditions. In this case, the AF can be calculated
as follows:
AF =
T 1
T 2
where:
b i
bi is the effective cumulative number of hours of the test or field data used. If the tests
were performed at accelerated conditions, the equivalent number of hours needs to be
converted to the conditions of interest, as follows:
bi ' = bi AFi
ao
ao is the effective number of failures associated with the predicted failure rate. If this
value is unknown, then use a default value of 0.5. In the event that predicted and
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
118
observed data is available on enough predecessor products or systems, this value can be
tailored. See the next section for the appropriate tailoring methodology.
2 is the best estimate of the new system failure rate after using all available data and
information. As much empirical data as possible should be used in the reliability
assessment. This is done by mathematically combining 1 with empirical data. Bayesian
techniques are used for this purpose. The technique accounts for the quantity of data by
weighting large amounts of data more heavily than small quantities. 1 forms the prior
distribution, comprised of a0 and ao/1. If empirical data (i.e., test or field data) is
available on the system under analysis, it is combined with 1 using the following
equation:
n
2 =
a 0 + ai
a0
i =1
n
+ bi '
i =1
where 2 is the best estimate of the failure rate, and ao is the equivalent number of
failures of the prior distribution corresponding to the reliability prediction. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the next section.
ao/1 is the equivalent number of hours associated with 1.
a1 through an are the number of failures experienced in each source of empirical data.
There may be n different sources of data available (for example, each of the n sources
corresponds to individual tests or field data from the total population of products or
systems).
b1 through bn are the equivalent number of cumulative operating hours experienced for
each individual data source. These values must be converted to equivalent hours by
accounting for any accelerating effects between the use conditions.
To estimate the value of ao that should be used, a distribution of the following metric is
calculated for all products for which both predicted and observed data is available:
observed, predecessor
predicted, predecessor
The lognormal distribution will generally fit this metric well, but others (for example,
Weibull) can also be used. The cumulative value of this distribution is then plotted.
Next, failure rate multipliers (as calculated by a chi square distribution) are calculated
and plotted. This chi-square distribution should be calculated and plotted for various
numbers of failures, to ensure that the distribution of observed/predicted failure rate
ratios falls between the chi-square values. In most cases, one, two and three failures
should be sufficient. Next, the plots are compared to determine which chi-square
distribution most closely matches the observed uncertainty values. The number of
failures associated with that distribution then becomes the value of a0. Figure 2.6-2
illustrates an example for which this analysis was performed.
Figure 2.6-2: Comparison of Observed Uncertainty with the Uncertainty Calculated With
the Chi-square Distribution
As can be seen from Figure 2.6-2, the observed uncertainty does not precisely match the
Chi-square calculated uncertainty for any of the one, two or three failures used in this
analysis. This is likely due to the fact that the population of products on which this
analysis is based is not homogeneous, as assumed by the chi-square calculation.
However, the confidence levels of interest are generally in the range 60 to 90 percent. In
this range, the chi-square calculated uncertainty with 2 failures most closely
approximates the observed uncertainty. Therefore, in this example, an a0 value of 2 was
used. This value is also consistent with the Telcordia GR-332 reliability prediction
methodology (Reference 6).
The uncertainties represented by the distribution of observed/predicted failure rates are
typical of what can be expected when historical data on predecessor products or systems
are collected and analyzed to improve the reliability prediction process. Using this
example, one can be 80% certain that the actual failure rate for a product or system will
be less than 2.2 times the predicted value.
2.6.1. Bayesian Inference
Figure 2.6-3 depicts the outline of the Bayesian inference approach. The available
information about the model parameter vector, , in the form of prior distribution, f0(),
Reliability Information Analysis Center
121
Model
for
Failure
Data
Failure
Data
Prior
f0 ()
Likelihood
L( Failure Data | )
Posterior
Bayesian
Inference
f () = L( | Failure Data)
f ( ) = f ( DATA) =
f 0 ( ) L(DATA )
f ( ) L(DATA ) d
0
where,
=
f() =
f0() =
In practice, the features of this distribution include the updated marginal and conditional
distribution of each parameter given the provided information. The marginal distribution
of a single parameter is defined by the next equation. The marginal distribution is
estimated by integrating the posterior joint distribution, f(), over the range of other
parameters, as shown. The other important outcome of the posterior joint distribution is
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
122
the conditional distribution of each parameter, when other elements of vector are given.
The conditional distribution is constructed by substituting the known parameters in the
joint distribution, f(). Here again, the function needs to be scaled by a normalization
factor, as demonstrated in the equation below, in order to be consistent with the basic
characteristics of the distribution functions.
f j ( j ) =
f (1 , 2 ,..., j ,..., n )d j
g j j | j =
f ( j , j )
f ( , )d
j
where:
-j =
)
i =
The integrals necessary for Bayesian computation usually require analytic or numerical
approximations. While the computations for non-constant failure rate distributions can
get quite involved, they are relatively straightforward for the exponential distribution.
The method explained in the previous section details this situation.
R(t ) = Ri (t )
i =1
= i
i =1
Event 1
Event 2
0.001
0.002
8.000E-4
0.001
6.000E-4
Event 5
Event 7
9.000E-4
9.000E-4
2.000E-4
Event 6
Event 4
Event 3
0.003
8.000E-4
7.200E-4
1.600E-4
7.200E-4
1.200E-4
5.400E-4
0.002
6.000E-4
5.400E-4
f(t)
f(t)
f(t)
f(t)
f(t)
f(t)
f(t)
0.002
4.000E-4
3.600E-4
8.000E-4
4.000E-4
4.000E-4
2.000E-4
8.000E-5
3.600E-4
4.000E-5
1.800E-4
0.001
2.000E-4
1.800E-4
0.000
0.000
1000.000
2000.000
3000.000
Time, (t)
4000.000
5000.000
0.000
0.000
1000.000
2000.000
3000.000
Time, (t)
4000.000
5000.000
0.000
0.000
1000.000
2000.000
3000.000
Time, (t)
4000.000
5000.000
0.000
0.000
6.000E-4
0.000
0.000
600.000
1200.000
1800.000
2400.000
3000.000
0.000
0.000
1000.000
2000.000
3000.000
4000.000
Time, (t)
Time, (t)
5000.000
0.000
0.000
600.000
1200.000
1800.000
2400.000
3000.000
1000.000
2000.000
3000.000
4000.000
5000.000
Time, (t)
Time, (t)
0.002
f(t)
0.001
8.000E-4
4.000E-4
0.000
0.000
1000.000
2000.000
3000.000
4000.000
5000.000
Time, (t)
need to be performed at the same level of hierarchy. The most important thing is that all
of the critical failure causes are accounted for.
TOP
OR
Event 3
OR
OR
Event 1
OR
Event 2
Event 4
Event 5
Event 6
Event 7
Monte Carlo analysis is a powerful analytical technique that allows for the estimation of
parameters or factors in cases where closed-form statistical derivations are not possible.
This occurs in many reliability engineering analyses, making it an invaluable tool.
Monte Carlo analysis can be used for several purposes:
1. To determine the time to first failure, as in the previous example
2. To determine the probability of failure from a stress/strength interference model
For #2, there are handbooks available which provide estimates of interference probability
based on the individual stress and strength distributions. Or, a statistical simulation can
be performed to estimate the degree of interference via numerical techniques. This is
generally a more efficient and effective way of performing the simulation, given software
tools that are readily available.
The basic principal behind Monte Carlo analysis, as applied to stress/strength interference
analysis is shown here:
1. First, the stress and strength distributions are determined
2. A randomly selected value from each distribution is obtained
3. The randomly selected values are compared, and if the selection from the strength
distribution is less than the selection from the stress distribution, a failure is
considered to have occurred. If it is not, then success is considered to have
occurred.
4. This process is repeated many times, and the number of trials and the number of
failures are counted. The number of trials needs to be large enough to result in a
good estimate of the failure probability. The failure probability is equal to the
total number of failures divided by the total number of trials.
F =
where:
F=
N=
The first step is to randomly select a value from each of the stress and strength
distributions. As an example, consider a normally distributed strength with a mean of 10
and standard deviation of 3, the pdf of which is shown in Figure 2.7-3.
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 2.7-3: pdf of Normal Distribution with Mean of 10 and Standard Deviation of
3.
Next, the cumulative function of this distribution is calculated, as shown in Figure 2.7-4.
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Next, the randomly selected value from the distribution is obtained by:
Selecting a random number between 0 and 1. This number is displayed on the yaxis.
Then, the value on the x-axis corresponding to this y-value is determined a shown
in Figure 2.7-5.
The Weibull distribution is simpler to use than the Normal distribution since an integral
of the pdf is not required to derive the CDF. The closed-form pdf of the Weibull
distribution is:
t
f (t ) =
R(t ) = e
The Weibull distribution is one of the most widely used distributions in reliability
engineering due to its versatility. It also has the advantage of having a closed-form
solution for its cumulative function.
To select a random value from this distribution, a random number between 0 and 1 is
selected, this value is substituted for R(t) and the corresponding TTF is determined from
the equation. In this example, time(t) is shown as the independent variable, but the
specific parameter could be any parameter whose distribution is used in a Monte Carlo
analysis. The inverse cumulative function is shown in Figure 2.7-6, along with the
selection of the random value.
Now, lets consider another application of Monte Carlo simulation. In this example, a
simple relationship between items for a repairable system is shown in Figure 2.7-7. Here,
the items can be failure causes, components or assemblies, in accordance with the level to
which the analysis is performed.
times for each, and governed by the specific distribution of each. The resultant system
availability (Asystem) is shown on the bottom.
A simulation was performed on this hypothetical system using a software tool, the results
of which are shown in Figure 2.7-9. In this case, the following metrics were calculated
from the Monte Carlo analysis:
Ao:
MTBDE:
MDT:
MTBM:
MRT:
% green time:
% yellow time:
Simulations of product reliability, as described above, are generally the best way to
combine life estimates of constituent parts in a system. If a system is comprised of
redundant elements, closed-form equations are available that calculate the effective
failure rate of the redundant elements. However, care must be taken when using these
equations. For example, the manner in which they are generally derived is to calculate
the failure characteristics as time approaches infinity. Only in this manner are closedform solutions possible. The results are effective failure rate estimates that often
underestimate the benefits of redundancy. This is especially true when mission times are
relatively short. As a result, calculating reliability based on the failure probability
examples described above is generally a more sound approach. Additionally, the
availability of software tools has made it much easier to perform these calculations.
2.8. References
1. Production Part Approval Process (PPAP), Third Edition, Daimler-Chrysler,
Ford , General Motors, 1999)
2. Modarres, M., Accelerated Testing, ENRI 641, Univ. of Maryland, May 2005
3. Weibull++, Reliasoft Corp.
4. Colm V. Cryan, James R. Curley, Frederick J. Gillham, David R. Maack, Bruce
Porter, and David W. Stowe, Long Term Splitting Ratio Drifts in Singlemode
Fused Fiber Optic Splitters, NFOEC 95
5. David R. Maack, David W. Stowe and Frederick J. Gillham, Confirmation of a
Water Diffusion Model For Splitter Coupling Ratio Drift Using Long Term
Reliability Data, NFOEC 96
6. Telcordia GR-332, Reliability Prediction Methodology
7. Denson, W.K. and S. Keene, A New System Reliability Assessment
Methodology Final Report, Available from the Reliability Information
Analysis Center, 1998
3.
Fundamental Concepts
The intent of this book is not to cover the basics of probability or reliability theory. The
understanding of some of these fundamental concepts, however, is critical to the
interpretation of reliability estimates. The definition of reliability is a probability, the
value of which is estimated by the techniques covered in this book. Therefore, the basics
of reliability terminology, and the basis for various theoretical concepts are covered in
this section.
p(x5)
Probability - p(xi)
p(x4)
p(x6)
p(x3)
p(x7)
p(x8)
p(x2)
p(x1)
x1
p(x9)
x2
x3
x4
x5
x6
x7
x8
x9
P{x = xi } = p(xi )
Reliability Information Analysis Center
135
A continuous variable is one that is measured on a continuous scale, and its probability
distribution is defined as a continuous distribution. For example, the distribution of the
TTF would be a continuous distribution, since an infinite number of positive time values
can be represented in the distribution. Figure 3.1-2 illustrates a continuous distribution.
P{a x b} = f ( x)dx
a
F (t ) =
f (t )dt
R(t ) = 1 F (t ) = f (t )dt
t
Note that for the reliability, the integral of the pdf is from t to infinity for the
probability of success, as opposed to minus infinity to t as in the case of the failure
probability. The sum of the probability of success and the probability of failure needs to
be 1.0, consistent with the definition of a pdf.
By differentiating the above equation:
dR(t )
= f (t )
dt
The probability of failure in a given time interval between t1 and t2 can be expressed by
the reliability function:
t1
The rate at which failures occur in the interval t1 to t2, the failure rate (t), is defined as
the ratio of the probability that a failure occurs within the interval, given that it has not
occurred prior to t1 (the start of the interval), divided by total the interval length. Thus:
(t ) =
R (t1 ) R (t 2 ) R (t ) R (t + t )
=
(t 2 t1 )R (t1 )
(t )R (t )
where t = t1 and t2 = t + t. The hazard rate, h(t), or instantaneous failure rate, is defined
as the limit of the failure rate as the interval length approaches zero, or:
R(t ) R(t + t )
1 dR(t )
h(t ) = lim(t 0)
=
(t )R(t ) R(t ) dt
Since it was already shown that:
dR(t )
= f (t )
dt
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
138
Then,
h (t ) =
f (t )
R (t )
The hazard rate, h(t), is the rate at which failures occur, providing the item has not
failed before the time h(t) is evaluated
f(t) is the normalized percentage of the population failing in a given time interval
(t), such that the population size times value of f(t) is equal to the number of
failures in the interval of time.
The denominator, R(t), is the probability of survival at t, which is equivalent to
the percentage of the population surviving at time t.
Multiplying R(t) by the population size yields the total number of units surviving until
t. This is the binomial probability, or expected value of the number of survivors at t.
Since this population will have accrued an operating time of RN*t, the denominator is
equivalent to the cumulative operating time on the population in the time interval.
Therefore,
h(t ) =
f (t ) f (t ) N
# failures in t
Failures
=
=
=
= Failure rate
R(t ) R(t ) N # units surviving t item hours
h(t ) =
1 dR(t )
R(t ) dt
Resulting in:
R(t ) = e
h (t )dt
0
This is the general expression for the reliability function. If h(t) can be considered a
constant failure rate (), which is often the case, the equation becomes:
Reliability Information Analysis Center
139
R(t ) = e t
The mean time to failure (MTTF) is the expected value of the time to failure, and is:
If the reliability function can be easily integrated, this is a convenient way to calculate the
mean time to failure (MTTF). If not, then numerical techniques can be used.
If all parts in a population are operated until failure, the mean life is:
n
t
i =1
where:
ti =
n=
MTBF =
T (t )
r
where:
T(t) = total operating time
r=
number of failures
Failure rate and MTBF are applicable only to the situation in which the failure rate is
constant, i.e., the exponential TTF distribution. Per the definitions above, it can be seen
that the failure rate and MTBF are reciprocals of each other:
1
MTBF
The failure rate is the number of failures divided by the cumulative operating time of the
entire population (failure/part hours), whereas the MTBF is the cumulative operating time
of the entire population divided by the number of failures (part hours per failure).
Table 3.1-1 provides an overview of the basic notation and mathematical representations
that are common among the various types of probability distributions.
Table 3.1-1: Probability Distribution Notation & Mathematical Representations
Notation
Pr( X S )
f (x)
Definition
Random Variable
Realization of a Random Variable
Probability That the Random Variable
X is in the Set S
Probability Density Function (PDF)
F (x)
h(x)
Hazard Rate
R (x )
Mathematical Representation
f ( x),
Discrete Distribution
xS
Pr( X S ) =
f ( x ) dx , Continuous Distribution
S
x
f ( w ),
Discrete Distribution
w =0
F (x) = x
f ( w ) dw , Cumulative Distribution
0
h( x) =
f ( x)
1 F(x)
Reliability
f ( x)
R( x)
1 dF( x )
R ( x ) dx
x
R ( x ) = 1 F ( x ) = f ( t ) dt = e
h ( t ) dt
0
E[u( X )]
Expected Value
Mean
Standard Deviation
u(w) f(w),
Discrete Distribution
w =0
E[ u ( X )] =
u ( w ) f ( w ) dw , Continuous Distribution
0
= E (X )
= E[( X ) 2 ]
Note: These definitions are based on the assumption that all realizations of a random variable
must be non-negative.
Covariance is a measure of the extent to which one variable is related to another, and is
expressed as:
Cov( X , Y ) =
(x x )(y y )
i
i =1
n 1
Cov( X , Y )
X Y
n!
(n x )!
n!
x!(n x )!
As an example of permutations and combinations, define n=4 and x=2. The number of
combinations is:
Pr =
n
n!
4!
=
=6
x!(n x )! 2!(4 2)!
Consider these combinations, as illustrated in Table 3.2-1. Here, there are 4 items (n=4),
each of which can have two possible values (blank or x).
Table 3.2-1: Combinations Example
n
1
x
x
x
x
n!
4!
=
= 12
(n x )! (4 2)!
Each set of 2 can be reversed, thus the number of permutations is double the number of
combinations for n=4.
3.2.4. Mutual Exclusivity
Items are mutually exclusive when the occurrence of one event precludes the other. In
other words, if one event occurs, the other cannot. This is the only case in which
probabilities can be added. Mutual exclusivity is defined as:
P(a or b ) = P(a ) + P(b )
where:
P(a or b) =
P(a) =
P(b) =
Mutually exclusive sets are those with no common members, shown in the Venn diagram
in Figure 3.2-2.
An independent event is one in which the probability of one event has no effect on the
other, and is expressed as follows:
P(a and b ) = P(a )P(b )
where:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
144
P(a and b) =
P(a) =
P(b) =
Non-independent (or dependent) events indicate that the probability of one event is
dependent on the other, as shown:
P(a and b ) = P(a )P(b a )
or
P(a and b ) = P(b )P(a b )
where:
P(a and b) =
P(a) =
P(b) =
P(b|a) =
P(a|b) =
For non-independent (dependent) events, one event may have several different outcomes,
each affecting the other event differently. This situation is mathematically described as:
P (a1 b ) =
P(b a1 ) P(a1 )
P(b a ) P(a )
i
where:
P(b|a1) =
P(b|ai)*P(ai) =
The event set a is mutually exclusive. Therefore their probabilities can be added.
3.2.8. System Models
For independent failure causes, the reliability of a system is the product of the reliability
values for the constituent failure causes, as shown:
R = R1 R2 R3 .........Rn
If the failure rate is constant, the probability of survival for a specific cause is:
R = e t
total = 1 + 2 + 3 + ..........n
The above equations are relevant to a series configuration of items, each with a constant
failure rate. The fault tree representation of this configuration is shown in Figure 3.2-4.
Here, the system reliability is represented by a logical OR gate, since the failure of A or
B or C will cause system failure.
OR
The corresponding reliability block diagram representation for this scenario is shown in
Figure 3.2-5.
A
All possible outcomes for this example are shown in Table 3.2-2.
Table 3.2-2: Combinations of an OR Configuration
A
Output of OR Gate
Fail
Fail
Fail
Fail
Fail
Fail
Pass
Fail
Fail
Pass
Fail
Fail
Fail
Pass
Pass
Fail
Pass
Fail
Fail
Fail
Pass
Fail
Pass
Fail
Pass
Pass
Fail
Fail
Pass
Pass
Pass
Pass
Note that each of these eight possible outcomes in the table are mutually exclusive, in
that there is only one possible way in which each of the eight can occur.
As an example, if events A, B and C have the following reliability values:
RA = 0.95
RB = 0.92
RC = 0.99
The reliability of the series configuration (i.e., the probability of exactly zero failures) of
the three items is:
R = RA RB RC
R = .95 .92 .99 = .87
Now, suppose that several items must fail in order for the system to fail. This scenario is
represented by an AND gate in a fault tree representation, as is shown in Figure 3.2-6.
AND
C
Starting
Block
Ending
Block
A
All possible outcomes for this example are shown in Table 3.2-3.
Table 3.2-3: Combinations of an AND Configuration
A
Fail
Fail
Fail
Fail
Fail
Fail
Pass
Pass
Fail
Pass
Fail
Pass
Fail
Pass
Pass
Pass
Pass
Fail
Fail
Pass
Pass
Fail
Pass
Pass
Pass
Pass
Fail
Pass
Pass
Pass
Pass
Pass
As an example of a slightly more complex situation, consider the fault tree representation
of a system in Figure 3.2-8.
TOP
AND
Event 3
OR
AND
Event 1
OR
Event 2
Event 4
Event 5
Event 6
Event 7
Event 1
Extra
Starting
Block
Event 4
Event 3
Event 2
Event 6
Event 5
Event 7
Combining the series and parallel events yields the following reliability expression for
this configuration
R = (1 (1 R1 )(1 R2 ) )R3 (1 (1 R4 )(1 R5 ) )R6 R7
3.2.9. K-out-of-N Configurations
As an example, let us assume that there are three units operating in parallel, two of which
are required for the system to perform adequately. If R=0.9 and Q=0.1, then the
probabilities associated with each possible combination of outcomes is summarized in
Table 3.2-4.
Table 3.2-4: Example of k-out-of-n Probability Calculations
Probability
Prob
of pass
or fail
of A
Prob
of pass
or fail
of B
Prob
of pass
or fail
of C
Fail
QAQBQC
0.1
0.1
0.1
0.1*0.1*0.1
0.001
Fail
Pass
QAQBRC
0.1
0.1
0.9
0.1*0.1*0.9
0.009
Fail
Pass
Fail
QARBQC
0.1
0.9
0.1
0.1*0.9*0.1
0.009
Fail
Pass
Pass
QARBRC
0.1
0.9
0.9
0.1*0.9*0.9
0.081
Pass
Fail
Fail
RAQBQC
0.9
0.1
0.1
0.9*0.1*0.1
0.009
Pass
Fail
Pass
RAQBRC
0.9
0.1
0.9
0.9*0.1*0.9
0.081
Pass
Pass
Fail
RARBQC
0.9
0.9
0.1
0.9*0.9*0.1
0.081
Pass
Pass
Pass
RARBRC
0.9
0.9
0.9
0.9*0.9*0.9
0.729
Outcome
Fail
Fail
Fail
Total System
Probability
In this example, the probability of each combination of possible outcomes (in this case,
eight) is calculated. Note that the sum of the probabilities for all possible outcomes is
1.0, since each of the eight possibilities is mutually exclusive and their probabilities can,
therefore, be added. This approach of calculating the probability of every possible
outcome is always valid, regardless of whether the reliability values of each of the
elements are the same or not. For example, if two of the three units are required for the
system to perform adequately, the system will pass if there are either no failures or if
there is one failure, as shown below. This is summarized in Table 3.2-5.
Table 3.2-5: Example of 2-out-of-3 Required for Success
Outcome
Probability
Total
Probability
System Pass or
Fail
Fail
Fail
Fail
QAQBQC
0.001
Fail
Fail
Fail
Pass
QAQBRC
0.009
Fail
Fail
Pass
Fail
QARBQC
0.009
Fail
Fail
Pass
Pass
QARBRC
0.081
Pass
Pass
Fail
Fail
RAQBQC
0.009
Fail
Pass
Fail
Pass
RAQBRC
0.081
Pass
Pass
Pass
Fail
RARBQC
0.081
Pass
Pass
Pass
Pass
RARBRC
0.729
Pass
It can be seen that the system will pass with outcomes 4, 6, 7 and 8. Outcomes 4, 6 and 7
correspond to exactly one failure (i.e., there are three ways in which one failure can
occur), and outcome 8 corresponds to exactly zero failures (there is only one way in
which this can occur).
If the probability of failure of all of the units is the same and they are independent, then
the binomial or Poisson distributions can be used:
If the metric used in the reliability analysis is the probability of failure, use the
binomial distribution
If the metric is a failure rate, use the Poisson distribution
Since this example pertains to items with defined probabilities, the binomial distribution
applies. As defined previously:
where:
n=
x=
r=
The probability of exactly no failures (i.e., the first term in the above summation) is:
F (3,0 ) =
n!
3!
x n x
3 33
p q
=
.9 q = 1 * 0.729 = 0.729
(3 3)!3!
(n x )! x!
The probability of exactly one failure (i.e. the second term in the above summation) is:
F (2,1) =
n!
3!
x n x
2 3 2
p q
=
.9 .1 = 3 * 0.81 * 0.1 = 0.243
(3 2 )!2!
(n x )! x!
F ( x; r ) =
n!
(n x)! x!p q
x n x
x =0
Because the first term in the binomial probability expression is the number of
combinations of a specific number of failures (or survivals) occurring, the number of
combinations (as calculated by the first term) essentially adds the probabilities associated
with the mutually exclusive events.
3.3. Distributions
Reliability distributions are at the heart of a reliability model. They represent the
fundamental relationship between the reliability metric of interest (probability of failure,
failure rate, etc.) and the independent variable (TTF, cycles to failure, etc.). This
independent variable is called the life unit. Table 3.3-1 summarizes probability
distributions often used in reliability modeling, along with a description of their primary
uses.
Reliability Information Analysis Center
153
Type
Primary Uses
Discrete
Poisson
Discrete
Exponential
Continuous
Gamma
Continuous
Normal
Continuous
Standard Normal
Continuous
The Standard Normal distribution (Z) is derived from the Normal for
ease of analysis and interpretation (mean = 0; standard deviation =
1).
Lognormal
Continuous
Weibull
Continuous
Student t
Continuous
F Distribution
Continuous
Chi-Square
Continuous
The following section discusses several of the distributions used in reliability assessment.
While the intent of this book is not to cover the statistical aspects of distributions, some
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
154
fundamental concepts are critical to the understanding of the basis for certain techniques
pertaining to reliability assessment, namely confidence level calculations and
demonstrating reliability levels. In particular, the binomial and Poisson distribution are
critical for these purposes.
The binomial distribution is used when there are only two outcomes, such as success or
failure, and the probability remains the same for all trials. The probability density
function (pdf) of the binomial distribution is:
n
f (x) = p x q (n x )
x
where:
n!
n
=
x (n x!)x!
and q = 1 p.
The function f(x) is the probability of obtaining exactly x good items and (n-x) bad
items in a sample of n items, where p is the probability of obtaining a good item
(success) and q (or 1-p) is the probability of obtaining a bad item (failure).
The CDF, i.e., the probability of obtaining r or fewer successes in n trials, is given
by:
r
r
n!
n
F (x; r ) = p x q n x =
p x q n x
x =0 x
x =0 (n x!)x!
f (x ) =
a x e a
x!
where x is the actual number of failures and a is the expected number of failures.
Since the expected number of failures (i.e., the expected value) for the exponential
distribution is t, the Poisson expression becomes:
(
t )x e t
f (x ) =
x!
where:
=
t=
x=
failure rate
length of time being considered
number of failures
The reliability function, R(t), or the probability of zero failures in time t is given by:
(
t )0 e t
R(t ) =
0!
= e t
(t )x e t
x =0
x!
R(x) =
Continuous distributions are used when analyzing time to failure data, since times to
failure is a continuous variable. The most common distributions used in reliability
modeling to describe times to failure characteristics are the exponential, Weibull and
lognormal distributions. These are described in more detail n the following sections.
3.3.1. Exponential
The exponential distribution is most commonly applied in reliability to describe the times
to failure for repairable items. For non-repairable items, the Weibull distribution is
popular due to its flexibility. In general, the exponential distribution has numerous
applications in statistics, especially in reliability and queuing theory.
The exponential distribution describes products whose failure rates are the same
(constant) at each point in time (i.e., the flat portion of the reliability bathtub curve,
where failures occur randomly, by chance). This is also called a Poisson process. This
means that if an item has survived for "t" hours, the chance of it failing during the next
hour is the same as if it had just been placed in service. It is sometimes referred to as the
distribution with no memory. It is an appropriate distribution for complex systems that
are comprised of different electronic and electromechanical component types, the
individual failure rates of which may not follow an exponential distribution.
Since the exponential distribution is relatively easy to fit to data, it can be misapplied to
data sets that would be better described using a more complex distribution.
Table 3.3-2 lists the parameters for the exponential distribution: the probability density
function (pdf), the cumulative distribution function (CDF), the mean, the variance, and
the standard deviation. Another useful parameter of continuous distributions is the 100pth percentile of a population, i.e., the age by which a portion of the population has failed.
The 50% point is the median life. The mean of the exponential distribution is equal to the
63rd percentile. Thus, if an item with a 1000 hour MTBF had to operate continuously for
1000 hours, there would only be a 0.37 probability of success.
As an example, consider a software system with a failure rate () of 0.0025 failures per
processor hour. Its corresponding mean time between failure (MTBF) is calculated as:
MTBF = =
1
= 400 processor hours
0 .0025
Mathematical Expression
(based on failure rate)
f (t) = e t ,
F( t ) = 1 e t , t > 0
Variance
f (t ) =
1
e
, t>0
F (t ) = 1 e
t
,
t>0
2 =
Standard Deviation
t>0
Mathematical Expression
(based on MTBF)
yP =
2 = 2
1
ln(1 P )
R (t ) = e t
y P = ln(1 P)
R (t) = e
The reliability function (i.e., the probability, or population fraction that survives beyond
age t) at 100 and 1000 processor hours is:
R ( t ) = e ( 0.0025 )(100 ) = 0.7788 = 77.88%
R ( t ) = e ( 0.0025 )(1000 ) = 0.0821 = 8.21%
The scale (or characteristic life) parameter, , is the value at which 63rd percentile
of the distribution occurs
The location parameter, (or gamma), is only used in the three parameter version
of the Weibull distribution, and is the value that represents the failure free period
for the item. If an item does not have a period where the probably of failure is
zero, then = 0 and the Weibull distribution becomes a two parameter distribution.
This third parameter is used when there are threshold effects.
Determination of , , and can easily be estimated using Weibull probability
paper or by using available Weibull software programs
A multi-mode version of the Weibull distribution can be used to determine the
points on the bathtub curve where the failure rate is changing from decreasing, to
constant, to increasing
There are two general versions of the Weibull distribution, the first being the twoparameter Weibull and the second being the three-parameter Weibull. The twoparameter Weibull uses a shape parameter that reflects the tendency of the failure rate
(increasing, decreasing, or constant) and a scale parameter that reflects the characteristic
life of items being measured ( 63.2% of the population will have failed). The threeparameter Weibull adds a location parameter used to represent the minimum life of the
population (e.g., a failure mode that does not immediately cause system failure at time
zero, such as a software algorithm whose degrading calculation accuracy does not cause
system failure until four calls to the algorithm have been made). Note that in most cases,
the location parameter is set to zero (failures assumed to start at time zero) and the
Weibull distribution reverts to the two-dimensional case. The three parameter Weibull
distribution is also commonly used to characterize strength distributions (i.e., when using
a stress/strength model), where the -value represents a screen value, or proof test, in
which case this value of stress is applied to the item as a screen. It is also used to model
failure causes that are not initiated until a time equal to the gamma value has passed.
As with the gamma distribution, the definition of Weibull parameters is inconsistent
throughout the literature. Table 3.3-3 illustrates how some sources define these
parameters.
Weibull
Form
Random
Variable
Shape
Parameter
Scale
Parameter
Location
Parameter
3-P
2-P
2-P
3-P
2-P
For much life data, the Weibull distribution is more suitable than the exponential, normal
and extreme value distributions, so it should be the distribution of first resort. The
characteristics of various shape parameters are summarized below:
For shape parameter < 1.0, the Weibull pdf takes the form of the gamma
distribution (see Section 3.7.1.4) with a decreasing failure rate (i.e., infant
mortality)
For shape parameter = 1.0, the failure rate is constant so that the Weibull pdf
takes the form of the simple exponential distribution with failure rate parameter
(the flat part of the reliability bathtub)
For shape parameter = 2.0, the Weibull pdf takes the form of the lognormal or
Rayleigh distribution, with a failure rate that is linearly increasing with time (i.e.,
wearout). This is often used to model software reliability.
For 3 < shape parameter < 4, the Weibull pdf approximately takes the form of the
Normal distribution
For shape parameter > 10, the Weibull distribution is close to the shape of the
smallest extreme value distribution
The basic parameters of the 2-parameter Weibull distribution are presented in Table 3.34. To have the mathematical expressions reflect a 3-parameter Weibull, replace all
values of x with (x-x0), where x0 represents the value as described above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
162
Mathematical Expression
x
1
x
f ( x ) =
e
,
F(x) = 1 e
Shape parameter
Scale parameter
Failure Rate
( x) =
x>0
Mean
1
= 1 +
Variance
2
1
2 = 2 1 + 1 +
Standard deviation
2
1
= 1 + 1 +
y P = [ ln(1 P )]1
Reliability
R (x) = e
0.5
Figure 3.3-3 provides a graphical example of the Weibull distribution pdf with a
characteristic life of 1000 hours for a variety of shape parameters (). Figures 3.3-4 and
3.3-5 illustrate the hazard rate and probability plot, respectively, for the same values of
the shape parameter.
Figure 3.3-4: Example Hazard Rate Plots for the Weibull Distribution
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
164
For an example, consider that very early in the system integration phase of a large
software development effort, there have been numerous failures due to software that have
caused the system to crash (the predominant system failure cause). Plotting the failure
times of this specific failure mode (other failure modes are ignored for now) on Weibull
probability paper resulted in a shape parameter value of 0.77 and a scale parameter value
of approximately 32 hours. Based on these parameters, the calculated reliability and
failure rate of the software at 10 system hours is expected to be:
(10 ) =
0 . 77
32
10
32
R (10 ) = e
0 .77 1
0.77
= 0.6647
3.3.3. Lognormal
f (t ) =
t 2
1 ln (t )
e
And the standard deviation is:
(e
2 + 2 2
1
2 + 2 2
where and are the mean and standard deviation (SD), respectively, of ln (t).
The lognormal distribution is used in the reliability analysis of semiconductors and the
fatigue life of certain types of mechanical components. This distribution is also
commonly used in maintainability analysis.
The CDF for the lognormal distribution is:
1 ln (t ) 2
F (t ) =
exp
dt
2
0 t 2
t
ln(t )
F (t ) = P Z
ln(t )
R(t ) = P Z >
ln(t )
f (t )
=
h(t ) =
R(t )
tR(t )
where is the standard normal probability function, and and are the mean and
standard deviation of the natural logarithm of the random variable, t.
Figures 3.3-6 through 3.3-8 illustrate the lognormal distribution for a mean value of 1000
and standard deviations of 0.1, 1 and 3. Shown are the pdf, the hazard rate, and the
cumulative unreliability function, F(t), respectively.
Figure 3.3-7: Example Hazard Rate Plots for the Lognormal Distribution
3.4. References
1. Lyu, M.R. (Editor), Handbook of Software Reliability Engineering, McGrawHill, April 1996, ISBN 0070394008
2. Musa, J.D., Software Reliability Engineering: More Reliable Software, Faster
Development and Testing, McGraw-Hill, July 1998, ISBN 0079132715
3. Nelson, W., Applied Life Data Analysis, John Wiley & Sons, 1982, ISBN
0471094587
4. Musa, J.D.; Iannino, A.; and Okumoto, K.; Software Reliability: Measurement,
Prediction, Application, McGraw-Hill, May 1987, ISBN 007044093X
5. Montgomery, D.C., Introduction to Statistical Quality Control 2nd Edition,
John Wiley & Sons, 1991, ISBN 047151988X
6. Shooman, M., "Probabilistic Reliability, An Engineering Approach," McGrawHill, 1968.
7. Abernethy, Dr. R.B., "The New Weibull Handbook," Gulf Publishing Co., 1994.
4.
2.
3.
4.
5.
6.
7.
drawback is that it cannot detect non-linearity in the relationship between the factor and
the response. For example, consider the relationship in Figure 4.3-1.
Conclusion
a-b
c-d
b-d
No relationship
a-d
The number of levels for each factor should be chosen, in part, based on knowledge of
the physics of the manner in which the factor affects the response. Otherwise, there can
be large uncertainty in using the resulting model to interpolate or extrapolate the response
behavior as a function of the factor. For example, if the response under analysis is
Reliability Information Analysis Center
173
corrosion, and the relationship between the factor, temperature, and the corrosion rate is
expected to be governed by the Arrhenius relationship over the entire operating space,
then a two-level temperature test may be appropriate. If, however, it is hypothesized that
there is a temperature threshold within the operating space, then more than two levels
may be required.
Repetition and replication are techniques used to increase the number of runs. The
advantage of increasing the number of runs is that obtaining multiple responses with
exactly the same factor levels is valuable in quantifying the amount of variability and
error in the measurements obtained. Repetition is the practice of repeating the same run
sequentially. Replication is the practice of repeating a set of runs sequentially. Both
practices will result in multiple responses for a given set of factor levels, but the
advantage of replication over repetition is that it is better able to quantify measurement
error in the event when there is a gradually changing parameter in the test or
measurement system.
The full-factorial approach will be used as an example for illustrating the concepts of data
analysis, followed by a discussion of other approaches.
A full-factorial design, an example of which is shown in Table 4.4-1, is the most
comprehensive experimental design. It includes runs which represent all possible
combinations of factor levels. The primary drawback to the full-factorial approach is that
it requires many runs. In some cases, this may be practical, but in many cases, the cost
and time required to carry out the experiments are prohibitive.
Table 4.4-1: Full-Factorial Example
Run
R (response)
R1
R2
R3
R4
R5
R6
R7
R8
The number of required runs is calculated as yx, where y is number of levels per factor
(2, for this example), and x is number of factors (3). In Table 4.4-1, then, the number
of runs is 23=8.
Reliability Information Analysis Center
175
A full-factorial array can be scaled such that the resultant array has the characteristics of
orthogonality. These are referred to as fractional factorial arrays, since only a fraction of
the full-factorial runs are required, yet are still orthogonal. The naming convention for
these arrays is determined from:
La ( y x )
where:
a=
y=
x=
In the previous examples, y and x were the number of factors and the number of
runs, respectively. In the standard DOE nomenclature, however, La refers to the
number of runs. For example, a seven-factor, two-level experiment for which there will
be eight runs is shown in Figure 4.4-3.
FullFactorial
HalfFactorial
(Resolution
= 3)
Temperature
(T)
1
-1
1
-1
-1
1
-1
1
1
1
-1
-1
Main effects
Humidity
Ionic
(H)
contamination (I)
-1
1
1
-1
-1
-1
-1
1
1
1
1
1
-1
-1
1
-1
1
1
-1
-1
-1
1
1
-1
Interactions
T*H
-1
-1
-1
1
-1
1
1
1
1
-1
1
-1
T*I
1
1
-1
-1
-1
1
1
-1
1
-1
-1
1
H*I
-1
-1
1
-1
1
1
1
-1
1
1
-1
-1
Another possible plan would be a half factorial, also shown in Table 4.4-2. Notice that,
for the half-factorial design, the temperature-humidity (T*H) interaction (i.e., the product
of the two) is the same as for ionic contamination (I). Also, the T*I interaction is the
same as H, and the H*I interaction is the same as T. Therefore, this Resolution 3 plan is
incapable of deconvolving the main effects of T, H or I with the interactions of the other
two.
From physics, we know that both humidity and ionic contamination are required for
corrosion. Therefore, the fact that H*I is the same as T (i.e., they are confounded) is
unacceptable, since we would not be able to determine if the lifetime is governed by
temperature, or the combination of humidity and ionic contamination. Therefore, we
need a better DOE test plan. The full-factorial plan would be the best, if it could be
executed, since none of this confounding exists. For the full-factorial plan, notice that
none of the interaction terms are the same as the main effects.
If we were to actually model this failure cause based on the tests defined in these plans,
the general form of the reliability model may be based on the two parameter Weibull
distribution, which is:
R=e
where:
R=
=
=
The characteristic life is then developed as a function of the applicable variables. The
model in this case is:
0
= e e T H I HI
2
where:
0 through 4 = parameter coefficients estimated in the life modeling process
T=
the temperature in degrees K (degrees C+273)
H=
the relative humidity
I=
the ionic contamination
HI =
the product of humidity and ionic contamination
All model parameters, 0 through 4, could be adequately quantified with the fullfactorial design, but not with the half-factorial.
There are many other potential test plans that would be adequate, providing that the
required model variables can be quantified and are not confounded with one another
(Reference 1).
experiment must be kept as constant as possible. Make sure that all results are fully
documented. This also must include any anomalies or potential sources of error that may
have occurred. The order of the runs must be kept intact, per the experimental plan. If
repetition is used, the same run or treatment is repeated sequentially. If replication is
used, then the set of runs to be repeated have been identified in the experimental design.
For in-situ measurements, careful time stamping of the data is required. Life models to
be developed from the collected data often represent parameter degradation data and not
actual TTF data. As a result, a model of degradation rate as a function of time may be
used as the response to predict failure times. All test samples should be carefully stored,
as root-cause failure analysis may be required at some future time.
The means can be pictorially represented, as shown in Figure 4.6-1. This is a convenient
way to illustrate the sensitivity of the response to each factor. Data analysis techniques
more sophisticated than the analysis of means shown here are also often used, and there
are many good software tools available to aid in this analysis. However, if a balanced,
orthogonal design is used, analysis of means can be very straightforward and effective.
After the data has been analyzed, the optimal combination of factor levels can be
determined. The goal of this approach is to determine the factor levels that will result in
minimal variability of the product response and maximum probability of the product
meeting its requirements. This is the payoff in this approach, since it results in a more
robust design.
In this example, if the desirable response is high, then a high value of A and B with a low
value of C provides the best response, as shown in Figure 4.6-3.
4.8. References
1. William Y. Fowlkes and Clyde M. Creveling, Engineering Methods For Robust
Product Design: Using Taguchi Methods In Technology And Product
Development,
5.
This section addresses the topic of life modeling, after the life data has been generated.
Life data modeling is treated as separate topic in this book since its principals pertain to
many of the types of data previously discussed. The purpose of modeling the reliability
of critical components or failure causes was previously described, and includes a variety
of objectives. Life modeling is simply a means of constructing a mathematical model
that predicts, assesses, or estimates the reliability of a product or system. A methodology
was previously presented for developing life models from tests performed at multiple
combinations of stress (DOE Multicell). From this data, a reliability model can be
constructed.
If all samples are tested to failure, or have been tested in exactly the same manner, then
traditional statistical analysis techniques (like regression, F-tests, T-tests, AVOVA, etc.)
will generally suffice for reliability modeling purposes. However, most real world cases
include censored data, unbalanced datasets, uncertain failure times, etc. It is these cases
where life modeling techniques are most effectively used.
Life modeling requires simultaneous characterization of:
1. TTF distributions
2. Acceleration factors (which provide a relative value of the reliability parameter as
a function of the stress level)
Each of these two major elements is discussed in the following sections and presents
more detailed information regarding development of the models after the life data has
been obtained.
Y = u( X 1 , K , X n )
Let denote the parameter to be estimated. Consider functions w(Y) of the statistic,
which might serve as point estimates of the parameter. Since w(Y) is a random variable,
it has a probability distribution. Statisticians have defined certain properties for assessing
the quality of estimators. These properties are defined in terms of this probability
distribution.
A loss function, L[,w(Y)], assigns a number to the deviation between a parameter and an
estimator. A typical loss function is the square of the difference, and is the value used in
least squares regression:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
186
L[ , w(Y )] = [ w(Y )] 2
Definition
The theoretical percentage (or probability) of an interval estimate containing
the parameter, and in which the endpoints of the interval are constructed from
sample data
The estimate converges to the true value of the parameter as the sample size
increases to infinity
A function of a statistic used to estimate a parameter in a probability model
Estimates of the endpoints of an interval around a parameter
The probability weight for given values of parameters at observed data points
A function that provides a measure of the distance between a parameter value
and its estimator
An estimate that maximizes the probability that given parameter values will
occur at observed data points
An estimator that uniformly minimizes the expected value of the square of the
difference between a parameter and an estimator
Of all unbiased estimators, none has a smaller variance. Sometimes called a
best estimator
The mathematical expectation of the loss function
The number of random variables from which a statistic is calculated
An estimator with a mathematical expectation equal to the parameter being
estimated
Discussion
Process
Maximum
Likelihood
Estimation
(MLE)
Least Squares
Method of
Moments
Bayesian
Simple equations that approximate parameters have been developed and are summarized
in Table 5.2-3, which provides an overview of the parameter estimates for commonly
used distributions.
Table 5.2-3: Parameters Typically Estimated from Statistical Distributions
Distribution
True Parameter
Poisson
Occurrence Rate,
Binomial
Proportion, p
Estimated Parameter
Sample Occurrence Rate: = n / t
n = number of observed failures
t = period (time, length, volume) over which failures are
observed
Sample Proportion: p = x / n
x = number of successful trials
n = number of statistically independent sample units
n
xi
Exponential
Mean,
= x = i =1
n
Sample Mean:
xi = individual times to failure for each of the observations of
sample size n
n = number of statistically independent sample observations
n
x=
Mean,
x i
i =1
n
Sample Mean:
xi = individual times to failure for each of the observations of
sample size n
n = number of statistically independent sample observations
Normal
s2 =
Variance, s2
(x i
x )2
i =1
n 1
Sample Variance:
2
s = sample variance (standard deviation, s, equals (s2)0.5)
xi = individual measurements for each of the observations of
sample size n
n = number of statistically independent sample observations
True Parameter
Estimated Parameter
The estimate of the Weibull shape parameter is:
1.283
=
s
where,
n
2
(x i x )
s = i =1
n 1
Shape Parameter,
0.5
Weibull
x=
xi
i =1
n
s = sample standard deviation
xi = individual times to failure for each observation of sample size
n
n = number of statistically independent sample observations
The estimate of the Weibull scale parameter is:
= exp ( x + ( 0.5772 )( 0.7797 ) s )
Scale Parameter,
The parameter estimates shown in Table 5.2-3 are rather simplistic and easy to use, and
often provide adequate estimates. There are more rigorous techniques available that do a
better, more accurate job of estimating parameters, but their complexity requires the use
of software tools.
The most popular techniques used in reliability modeling are least squares regression and
maximum likelihood. These are described in the next sections.
5.2.2. Least Squares Regression
Least squares regression is often used to estimate model parameters in cases when a
function can be linearized. The following steps are required for this approach:
1. Select the distribution type
2. Linearize the distribution
3. Determine the plotting positions of each data point
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
190
R=e
F=
i 0.3
N + 0.4
where:
i=
N=
For example, if there are ten items, the value of F after the second failure is:
F=
i 0.3
2 0.3
=
= 0.163
N + 0.4 10 + 0.4
The value of F is calculated for each failure. These pairs of x-y points are the values to
which a linear model will be fit.
The values of the slope and intercept are then:
Reliability Information Analysis Center
191
(x x )(y y )
(x x )
2
ln( ) = y x
In this case, y = ln(), and x = time (t)
5.2.3. Parameter Estimation Using MLE
This section addresses the use of Maximum Likelihood Estimation (MLE) techniques for
estimating TTF distribution parameters, such as the parameter of the exponential pdf,
or and of the Normal and lognormal pdf. The objective is to find a point
estimate, as well as a confidence interval, for the parameters of these distributions based
on the data available from test or field observation. Quantification of confidence
intervals is very important in the estimation process because there is almost always a
limited amount of data (e.g., on TTFs), and, thus, we cannot state our point estimation
with certainty. Therefore, the confidence interval is a statement about the range within
which the actual (true) value of the parameter resides. This interval is greatly
influenced by the amount of data available. Of course, other factors such as diversity and
accuracy of the data sources and adequacy of the selected model can also influence the
state of our uncertainty regarding the estimated parameters. When discussing goodnessof-fit tests, we are trying to address the uncertainty due to the choice of the probability
model form by using the concept of levels of significance. However, uncertainty due to
diversity and accuracy of the data sources is a more difficult issue to deal with.
Times-to-failure data are seldom complete. A complete sample is one in which all items
observed have failed during a given observation period, and all the failure times are
known. When n items are placed on test or observed in the field, whether with
replacement or not, it is sometimes necessary (due to the long life of certain components)
to terminate the test and perform the reliability analysis based on the observed data up to
the time of termination.
There are two basic types of possible life observation termination. The first type is time
terminated (which results in Type I right censored data), and the second is failure
terminated (resulting in Type II right-censored data). In the time-terminated life
observation, n units are monitored and the observation is terminated after a
predetermined time has elapsed. The number of items that failed during the observation
time, and the corresponding TTF of each component, are recorded. In the failureterminated life observations, n units are monitored and the observation is terminated
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
192
when a predetermined number of component failures have occurred. The time to failure
of each failed item, including the time that the last failure occurred, are recorded.
The MLE method is one of the most widely used methods for estimating reliability model
parameters. I n the first part of this section, a brief historical review of the MLE method
is presented. The likelihood function concept for different types of failure data, as well
as the mathematical approach to solve likelihood equations, is presented next. The last
part of this section reviews the basic equations of the MLE approach for specific case
studies, including exponential, Weibull and lognormal distribution likelihood functions.
5.2.3.1. Brief Historical Remarks
The use of regression techniques has many shortcomings when it comes to reliability
modeling. In particular, it is weak when it comes to analyzing interval or censored data.
The Maximum Likelihood Estimation method was originally introduced by Fisher
(Reference 5).
Fisher used the conditional probability of occurrence for each failure event as a measure
for his mathematical curve fitting. He argued that, using a subjective assumption about
the TTF model, one can characterize the probability of each failure event, conditioned to
the model parameter. He then derived the posterior probability of failure events in a
Bayesian framework using a uniform distribution as a prior for the model parameters. He
later calculated the best estimate for model parameters by maximizing the posterior.
Note that, in a Bayesian framework, a uniform distribution cancels out from the equation
since it is a constant. The normalizing factor in the denominator is also a constant, which
has no impact when one is interested in the extremes of the function. Therefore, this
method was eventually called the maximum likelihood estimator, because it is basically
the likelihood function that is maximized in this process.
5.2.3.2. Likelihood Function
where:
=
N=
M=
K, L =
Ti =
ti =
f (t) =
R (t) =
Tai =
Tbi =
LF =
LR =
LL =
LI =
Using the notion of the conditional probability density function, f(t|), helps to integrate
many different types of failure data into the likelihood function. For example, the
likelihood of the right-censored observations will be the reliability function, because this
is the probability that the component remains reliable up to the censored time. Therefore,
the likelihood of M independent right-censored observations will be the product of the
reliability functions as illustrated in the second equation above. For left-censored times
(that is, the time before which a failure has occurred) the likelihood is also the definition
of probability of failure at that time. In the case of many left-censored times, the total
likelihood will be the multiplication of the likelihood values of individual components
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
194
using the independency assumption, as shown in the third equation on the previous page.
For interval times, the likelihood is the probability of having one failure in that interval
(which is basically the integral of the probability density function between the upper and
lower abounds of interval). This is simply the difference between the cumulative
distribution function when it is evaluated at the upper and lower bounds, respectively, as
shown in the final equation from the previous page. Assuming the independency of
failure or censored time events, these likelihood functions can be multiplied with the
likelihood of the complete failure data in order to build the likelihood function for the
entire population.
5.2.3.3. Maximum Likelihood Estimator (MLE)
The likelihood function is used differently in the Bayesian and MLE frameworks. In the
Bayesian method, the prior knowledge that is available for the model parameters is
updated using this function as the conditional likelihood of data. In the MLE approach,
the most probable set of values of the parameter vector, , s estimated by maximizing this
likelihood as a standalone function.
The practical way to find the modes of the likelihood function is derivation. A
multivariable function has its maximum value at a point in which the first-order partial
derivative of the function with respect to each variable becomes zero, as shown below:
= 0
1
=0
= ln( L ) 2
= (1 , 2,..., n )
...
= 0
n
where:
=
=
Note that the likelihood function, as explained before, is based on a multiplication format.
This makes the derivation process very complex. The likelihood is always positive, so
Reliability Information Analysis Center
195
one may take the natural logarithm of this function to convert these multiplication
operators to summation. This will significantly reduce the mathematical derivation
complexity, while still providing the same best estimates for the mode of the likelihood
function. Constructing the likelihood function, L, as explained before, one may set up the
equations that need to be solved for the modes of this function.
In the following sections, three examples of the likelihood function, representing the
exponential, Weibull and lognormal distributions, are presented for further clarification.
5.2.3.3.1.
Exponential Distribution
If failures are expected to randomly occur at a constant rate in time, the TTF distribution
follows an exponential distribution. The exponential distribution assumes a constant
hazard rate for the item. This constant hazard rate is the only parameter of the
exponential distribution. The likelihood of complete failure and right-censored data, as
explained in previous sections, can be represented based on the probability density and
the cumulative distribution functions of the exponential distribution. The equation below
shows the log-likelihood function in case of F complete (i.e., failed) and S rightcensored (i.e., survived or suspended) observations.
F
L = N i ln e
i =1
t i
) N T
S
j =1
The only variable in this equation is . In the MLE method, the best estimate of is
evaluated by maximizing the likelihood (or log-likelihood) function. The next equation
shows the criteria to estimate the best estimate of . The uncertainty of the calculation
can be illustrated as confidence bounds over , which is calculated using the
corresponding local Fisher information matrix. This step will be explained in detail later.
F
L
S
1
= N i ti N jT j = 0
i =1
j =1
5.2.3.3.2.
Weibull Distribution
The Weibull distribution can be used for non-repairable hardware units exhibiting
increasing, decreasing, or constant hazard rate functions. Similar to the lognormal
distribution, it is a two-parameter distribution and its estimation, even in the case of
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
196
complete (uncensored) data, is not a trivial problem. It can be easily shown that, in the
situation where all r units out of n observed units fail, the log-likelihood estimates of
the Weibull distribution are represented by the equation below:
1 t i
ti
L = N i ln e
i =1
S
Tj
N j
j =1
The best estimate of the parameters and are made using the first derivative of the
log-likelihood function, as shown in the next equations below. The best estimates will be
the unique answer for the set of two equations and two unknowns, as shown:
F
T T
L 1 F
ti F ti ti S
= Ni + Ni ln Ni ln N j j ln j = 0
i =1
i =1 j =1
i =1
L
=
Ni +
i =1
F
S
Tj
ti
N
N
+
i
j
i =1
j =1
=0
Note that, despite the complexity of the mathematical representations of the likelihood
and log-likelihood functions and their derivatives, the basic concept is fairly simple. In
advanced numerical approaches using computers, the entire mathematical derivation is
done through numerical simulations using predefined tool boxes and library functions.
5.2.3.3.3.
Lognormal Distribution
In the case of estimating the parameters of the lognormal distribution, the only difference
is in the construction of the likelihood function for which the pdf and CDF of the
distribution are used for complete and suspended failure data, respectively. The equation
below shows the log-likelihood function for a combination of complete failure and
suspended (right-censored) data.
F
1 ln(ti ) S
ln(T j )
L = N i ln
+ N j ln1
j =1
i =1
ti
Having the log-likelihood function of failure data, the MLE approach can be executed
using the first derivative approach, as explained in previous sections. The first derivative
Reliability Information Analysis Center
197
of the log-likelihood function with respect to the mean and standard deviation is
illustrated in the following two equations:
ln(T j )
L
1 F
1 S
=0
= 2 Ni (ln(ti ) ) + N j
i =1
ln(T j )
j =1
F
(ln(t i ) )
L
= Ni
i =1
3
ln(T j ) ln(T j )
1 1 S
=0
Nj
j =1
ln(T j )
where:
(x) =
(x ) =
2
1
2
1
(x)2
e 2
1
(t )2
e 2 dt
The capital in the above equation is basically the cumulative Normal distribution,
which is defined as the integral of the small (i.e., normal pdf). The derivative of the
CDF always becomes the pdf, since the derivative operator cancels out the integration.
5.2.4. Confidence Bounds and Uncertainty
Since point estimates are constructed from data that exhibits random variation, these
estimates will not be exactly equal to the unknown population parameters. Confidence
bounds provide a convention for making statements about the random variation in the
estimates of parameters.
5.2.4.1. Confidence Bounds with MLE
The variance and covariance of the parameters, calculated using MLE equations, can be
found using the local Fisher information matrix. Fisher assumed a Normal distribution
for the parameters when deriving these equations. Using the following local information
matrix, one can relate the likelihood function to the variance and covariance of the model
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
198
parameters. The next equation represents these uncertainties for a general case for which
there are n parameters in the likelihood function.
Var (1 )
...
Cov (1 , 2 )
Cov (1 , n )
...
Var (2 )
Cov (2 , n )
Cov ( 2 , 1 )
1
.
.
.
.
= [F ]
.
.
.
.
...
...
Cov (n , n 1 )
Var (n )
(17)
where:
Var =
Cov =
=
F=
2
2
2 1
2
1
F=
.
...
1 2
2
2
2
.
.
...
...
...
.
.
2
n n 1
1 n
2
2 n
2
2
n
Having the variance and the best estimate of each parameter, one may estimate the
uncertainty bounds for any given confidence bounds. Note that the important underlying
assumption here is independency, as well as the Normal distribution for all parameters.
5.2.4.2. Confidence Bounds Approximations
Tables 5.2-4 through 5.2-11 present a summary of equations for calculating the
confidence bounds around the parameters for various distributions.
Reliability Information Analysis Center
199
True Occurrence
Rate,
L = 0.5 2 [1 ; 2 n ] / t
L = 0.5 2 [(1 ) 2 ; 2 n ] / t
Normal Approximation
When n is large (say, >10)
z ( / t ) 0.5
L z ( 1+ )
( / t ) 0.5
U + z ( / t ) 0.5
U + z ( 1+ )
( / t ) 0.5
U = 0 .5 2 [ ; ( 2 n + 2 ) ] / t
U = 0.5 2 [(1 + ) 2 ; ( 2 n + 2 ) ] / t
Given: Given the observed rate of occurrence above, the prediction for the future rate of occurrence is:
y = s = (n / t ) s
where,
n, t = as defined above
s=
period (time, length, volume) over which future observation is predicted
Poisson Limits (approximate only)
Closest integer solutions for yL and yU from the following equations
( n + 1)
F [ ; ( 2 n + 2 ); 2 y U ]
s
t
s
t
= F [ ; ( 2 y L + 2 ); 2 n ]
( y L + 1) n
yU
Future Occurrence
Rate, y
( n + 1)
F [(1 + ) 2 ; ( 2 n + 2 ); 2 y U ]
s
t
s
t
= F [(1 + ) 2 ; ( 2 y L + 2 ); 2 n ]
( y L + 1) n
yU
Normal Approximation
When n and y are large (e.g., each is > 10)
(
)
( s ( t + s ) t )
y L y z s ( t + s ) t
0.5
y U y + z
0.5
( s ( t + s ) t )
( s ( t + s ) t )
0.5
y L y z ( 1+ )
y U y + z ( 1+ )
0.5
where,
x=
number of successful trials
n=
number of statistically independent sample units
Binomial Limits (approximate only):
Exact confidence levels cannot be conveniently obtained for discrete distributions
1
1 + ( n x + 1)(1 x ) F [ ; ( 2 n 2 x + 2 ); 2 x ]
1
pU =
1 + ( n x )(1 (( x + 1) F [ ; ( 2 x + 2 ); 2 n 2 x ]
pL =
True Proportion,
p
pL =
1
1 + ( n x + 1)(1 x ) F [(1 + ) 2 ; ( 2 n 2 x + 2 ); 2 x ]
pU =
1
1 + ( n x )(1 (( x + 1) F [(1 + ) 2 ; ( 2 x + 2 ); 2 n 2 x ]
Normal Approximation
When x and n-x are large (e.g., each is > 10)
p L p z ( p (1 p ) / n ) 0.5
p L p z ( 1+ ) 2 ( p (1 p ) / n ) 0.5
p U p + z ( p (1 p ) / n ) 0.5
p U p + z ( 1+ ) 2 ( p (1 p ) / n ) 0.5
Poisson Approximation
When n is large and x is small (e.g., when x < n/10)
p L 0.5 2 [(1 ); 2 x ] n
p U 0 .5
[ ; 2 x + 2 ]
p L 0.5 2 [(1 ) 2 ; 2 x ] n
p U 0.5 2 [(1 + ) 2 ; 2 x + 2 ] n
Given: Given the observed probability above, the prediction for the number of y future category units
is:
y = mp = m ( x / n )
where,
x, n = as defined above
m=
future sample size
Normal Approximation
When x, n-x, y and m-y are all large (say, > 10)
[
]
0.5
y U y + z [m p (1 p )( m + n ) n ]
0.5
y L y z m p (1 p )( m + n ) n
Prediction of
Future
Probability of
Success, y
[
]
(1 p )( m + n ) n ]0.5
2 [m p
y L y z (1+ ) 2 m p (1 p )( m + n ) n 0 .5
y U y + z ( 1+ )
Poisson Approximation
When n is large and x is small (e.g., when x < n/10)
Closest integer solutions for yL and yU from the following equations
( x + 1)
F [ ; 2 x + 2; 2 y U ]
n
m
n
= F [ ; ( 2 y L + 2 ); 2 x ]
( y L + 1) x
yU
m
( x + 1)
F [(1 + ) 2 ; ( 2 x + 2 ); 2 y U ]
n
m
n
= F [(1 + ) 2 ; ( 2 y L + 2 ); 2 x ]
( y L + 1) x
yU
m
= x =
xi
i =1
where,
xi =
n=
2 nx
L =
[ ; 2 n ]
2 nx
U =
2 [(1 ); 2 n ]
2 nx
[(1 + )
2 ;2n ]
2 nx
2 [(1 ) 2 ; 2 n ]
2nx
2 [;2(n + 1)]
2nx
U = 2
[(1 );2(n + 1)]
L =
L =
U =
2n x
[(1 + ) 2 ;2(n + 1) ]
2n x
2
[(1 ) 2;2(n + 1) ]
2
exp z
U x * exp z
L
n
exp z ( 1+ )
U x * exp z ( 1+ )
Given: The estimate of the true population failure rate, , is the sample failure rate:
1
1
= = n
xi
i =1
where,
hat=
sample mean
xi =
individual times to failure for each of the observations of sample size n
n=
number of statistically independent sample observations
Exponential Limits (exact) for Failure Truncated Tests
True value of the
failure rate,
L =
U =
2 [(1 ); 2 n ]
1
=
2 nx
U
2 [ ; 2 n ]
1
=
2 nx
L
L =
U =
2 [(1 ) 2; 2 n ]
1
=
U
2 nx
2 [(1 + ) 2; 2 n ]
1
=
L
2 nx
p=
True value of
the 100-pth
percentile, yp
y p ,L = L * ln(1 p ) =
y p , U = U * ln(1 p ) =
2 nx * ln(1 p )
y p ,L = L * ln(1 p ) =
2 [ ; 2 n ]
2 nx * ln(1 p )
y p , U = U * ln(1 p ) =
2 [(1 ); 2 n ]
2 n x * ln(1 p )
2 [(1 + ) 2 ; 2 n ]
2 n x * ln(1 p )
2 [(1 ) 2 ; 2 n ]
Given: The usual estimate of the reliability, R(t), at any age, t, is:
R * (t) = e (t
x)
where,
R=
t=
( {
= exp ( t * {
R L ( t ) = e ( t L ) = exp t * 2 [ ; 2 n ] 2 nx
True value of
reliability at
end of period,
R(t)
R U (t ) = e
( t / U )
[(1 );2n ]
})
2 nx
})
( {
( {
R L ( t ) = e ( t L ) = exp t * 2 [(1 + ) 2; 2 n ] 2 nx
x=
x i
i =1
where,
xi =
n=
L = x t [ ; n 1] * s
n
U = x + t [ ; n 1] * s
n
})
})
R U ( t ) = e ( t / U ) = exp t * 2 [(1 ) 2; 2 n ] 2 nx
L = x t [(1 ) 2; n 1] * s
U = x + t [(1 ) 2; n 1] *
n
s2 =
where,
s2=
xi =
n=
(x i
x )2
i =1
n 1
n 1
L = s*
2
[ ; n 1]
0.5
n 1
U = s*
2
[(1 ); n 1]
0.5
n 1
L = s *
2 [(1 + ) 2 ; n 1]
0.5
n 1
U = s*
2
[(1 ) 2 ; n 1]
0.5
where,
R=
reliability as a function of time, distance, etc.
t=
period at which reliability is assessed (time, distance, etc.)
(z) = estimate of the fraction of a population failing by age t
True value of
reliability at end
of period, R(t)
R L ( t ) = 1 FU ( t ) = 1 ( z U )
R L ( t ) = 1 FU ( t ) = 1 ( z U )
where ,
where ,
(x x)
z=
s
2
z
1 + z (n / 2)
zU z +
n 1
n
z=
0 .5
zU z +
2
z (1+ ) 2
1 + z (n / 2)
n 1
n
0 .5
R U ( t ) = 1 FL ( t ) = 1 ( z L )
R U ( t ) = 1 FL ( t ) = 1 ( z L )
where ,
where ,
(x x)
z=
s
2
z
1 + z (n / 2)
zL z
n 1
n
(x x)
s
z=
0. 5
(x x)
s
2
z (1+ ) 2
1 + z (n / 2)
zL z
n 1
n
0 .5
n 1
0.5
x=
xi
i =1
where,
s=
sample standard deviation
xi =
individual times to failure for each observation of sample size n
n=
number of statistically independent sample observations
Weibull Limits (approximate)
Limits are crude unless n is quite large (say, n > 100)
True value of
the Weibull
shape
parameter,
1
1.049 z
0.7797s * exp
0.7797s
1.049 z
exp
1
1.049z (1+ ) 2
0.7797s * exp
0.7797s
1.049z (1+ ) 2
exp
(1.081)( 0.7797 ) s
(1.081)( 0.7797 ) s
the Weibull
L exp ( x + 0.45 s ) z ( 1+ ) 2
L exp ( x + 0.45s ) z
n
n
scale
parameter,
(1.081)( 0.7797 ) s
(1.081)( 0.7797 ) s
U exp ( x + 0.45s ) + z
U exp ( x + 0.45 s ) + z ( 1+ )
R * (t ) = e
where,
True value
of reliability
at end of
period, R(t)
R=
reliability as a function of time, distance, etc.
t=
period at which reliability is assessed (time, distance, etc.)
=
Weibull scale parameter
=
Weibull shape parameter
Limits are crude unless n is quite large (say, n > 100)
One-sided approximate Weibull limits:
0.5
t ( x + 0.45s)
t ( x + 0.45s)
1
.
168
(
1
.
1
)
(
0
.
1913
)
t ( x + 0.45s)
0.7797s
0.7797s
RL (t ) = exp exp
+ z
n
0.7797s
0.5
t ( x + 0.45s)
t ( x + 0.45s)
1
.
168
(
1
.
1
)
(
0
.
1913
)
t ( x + 0.45s)
0.7797s
0.7797s
RU (t ) = exp exp
z
n
0.7797s
t ( x + 0.45s )
t ( x + 0.45s )
1
.
168
(
1
.
1
)
(
0
.
1913
)
t ( x + 0.45s)
0.7797 s
0.7797 s
RL (t ) = exp exp
+ z(1 ) 2
0
.
7797
s
n
0.5
t ( x + 0.45s)
t ( x + 0.45s)
1
.
168
(
1
.
1
)
(
0
.
1913
)
+
t ( x + 0.45s)
0.7797s
0.7797s
RU (t ) = exp exp
z(1 ) 2
n
0.7797s
Accelerated testing is often used for this purpose, in which case tests are performed at
stress levels higher than the item will experience in use, to speed up failure processes.
Acceleration models consist of two generic types:
Physical Acceleration Models: For well-understood failure mechanisms, one
may have a model based on physical/chemical theory that describes the failurecausing process over the range of the data and provides extrapolation to use
conditions.
Empirical Acceleration Models: Empirical acceleration models are used when
there is little understanding of the chemical or physical processes leading to
failure, and a model can be empirically determined to describe the observed data.
5.3.1. Fundamental Acceleration Models
In practice, the acceleration models used are a combination of physical and empirical, in
that theory may be used to determine the appropriate form of the acceleration model, but
the specific model constants are almost always determined empirically.
There are four basic forms of accelerated life models. Combinations of these are also
possible:
The linear model is:
y = ax + b
y = be ax
The power law model is:
y = bx a
The Logarithmic model is:
y = a ln(x) + b
In all of these equations, y is the dependent variable, usually either lifetime (as
measured by characteristic life or mean life, depending on the TTF distribution used), or
failure rate. Since the failure rate is the reciprocal of the mean life (in the case of the
Reliability Information Analysis Center
207
exponential distribution), the constant a will generally be positive in one case and
negative in the other.
The most commonly used reliability models are the power law and exponential models.
Several points regarding acceleration models are:
5.3.1.1. Examples
L Ae
Ea
KT
where:
L=
A=
Ea =
T=
the lifetime
a life constant
the activation energy in eV
the absolute temperature in degrees Kelvin
It can be seen that this is the exponential model, with the reciprocal of temperature used
as the stress. The Arrhenius model is the most widely used for evaluating the effect of
temperature on reliability. It is applicable to situations in which the failure mechanism is
a function of the steady state temperature, such as corrosion, diffusion, etc. Notable
observations about the Arrhenius acceleration model are that:
In the formative years of the electronics industry, many failure mechanisms were
related to corrosion and contamination, which are inherently chemical reaction
rates for which the Arrhenius factor applies reasonably well
It has since been applied to many other failure mechanisms, with an assumed
applicability
Eyring
The Eyring model is:
1 A
L e T
T
Coffin-Manson
A form of fatigue life strain models is the Coffin-Manson life vs. plastic strain, which
is often used for solder joint reliability modeling:
T
AF = S
TU
where:
AF =
TU =
TS =
=
acceleration factor
product temperature in service use, K
product temperature in stress conditions, K
constant for a specific failure mechanism
where:
Nf =
A=
ep =
=
Since T ep, a simplified acceleration factor for temperature cycling fatigue testing
is:
T
N
AF = use = test
N test Tuse
The Coffin-Manson model is also sometimes used to model the acceleration due to
vibration stresses. Random vibration input and response curves are typically plotted on
log-log paper, with the power spectral density (PSD) expressed in squared acceleration
units per hertz (G2/Hz), plotted along the vertical axis, and the frequency (Hz) plotted
along the horizontal axis.
G2
f 0 f
P = lim
In the above equation, G is the root mean square (RMS) of the acceleration, expressed
in gravity units, and f is the bandwidth of the frequency range expressed in hertz.
Since G is the agent of failure that causes fatigue, the following inverse power model
applies:
L(G )
1
1
Life =
G
KG
The acceleration factor for vibration based on Grms for similar product responses is
represented by:
G
N
AF = use = test
N test G use
Acceleration models with more than one accelerating variable might be suggested when it
is known that two or more potential accelerating variables contribute to degradation and
failure. Several examples follow.
L(U ,V ) =
C
n
U e
B
V
where:
U=
non-thermal stress (i.e., voltage, vibration, etc.)
V=
temperature (in K)
B, C, n = parameters to be determined
The T-NT relationship can be linearized and plotted on a Life vs. Stress plot by taking the
natural logarithm of both sides:
B
V
Here, the log of the life is equal to a linear relationship, where the intercept is ln(C), the
slope of ln(U) is n and the slope of 1/V is B.
The acceleration factor for the T-NT relationship is given by:
B
AF =
LUse
LAccelerated
C Vu
e
U un
B
C VA
e
U An
where:
LUse =
LAccelerated =
Vu =
VA =
Uu =
U B V V
= A e u A
Uu
UA =
Temperature-Humidity Models
A variation of the Eyring relationship is the Temperature-Humidity (TH) relationship.
This combination model is expressed as:
L(V , U ) = Ae
b
+
V U
ln[L(V , U )] = ln( A) +
b
U
AF =
LUse
LAccelerated
Ae
Ae
b
+
V U
u
u
+
VA U A
=e
1 1 1
1
+ b
Vu V A U u U A
where:
LUse =
LAccelerated =
Vu =
VA =
Uu =
UA =
Peck Model
The Peck model (Reference 6) is:
n
L ( RH ) e
Ea
KT
where:
RH =
T=
n=
Ea =
K=
Relative Humidity
temperature
constant
activation energy
Boltzmans constant = 8.617 x 10-5 eV/K
Note that this is a multiplicative model consisting of a power law for humidity and the
Arrhenius model for temperature.
The British Telecom Model
The British Telecom model, also used in the Telcordia standards (Reference 7) is:
Le
Ea
2
KT + n ( RH )
This model includes the effects of both temperature and relative humidity.
Harris Model
Wearout data published by the Harris Corporation shows a good fit to Pecks model
(Reference 8) in representing aluminum corrosion. This model is:
AF = e
Ea
1
1 RH S
T
T
RH U
S
U
VS
VU
where:
AF =
Ea =
k=
TU =
acceleration factor
activation energy
Boltzmans constant = 8.617 x 10-5 eV/K
product temperature in service use, K
Reliability Information Analysis Center
213
TS =
RHU =
RHS =
VU =
VS =
a=
b=
In estimating fatigue life for materials, the model is used as the analytical representation
of the so-called S-N curves, where S is stress amplitude and N is life (in cycles to
failure), such that N = kS-b, where b and k are material parameters either estimated
from test data or published in handbooks.
Miners Rule
Miners rule states that the amount of damage sustained by a metal is proportional to the
number of cycles it experiences, as follows:
k
ni
N
i =1
=C
There are k stress levels (one for each contribution n cycles), N is the total number
of cycles at a constant stress reversal, and C is usually assumed to be 1.0. It essentially
estimates the percentage of life used by each stress reversal at each specific magnitude.
5.3.3. Cumulative Damage Model
Many situations arise in which there is cumulative damage inflicted on an item when
subjected to a stress. For those situations where a Weibull distribution is appropriate, the
reliability function is expressed as:
R (t ) = e
where:
R(t) = reliability the probability of survival at time t
=
Weibull shape parameter (in time space)
=
Characteristic life as a function of the stressor
If it is assumed that the acceleration can be described by a power law, then:
a
=
S
where:
S=
a=
n=
stressor
life constant
fatigue exponent in time space
R (t ) = e
n
a
S
te = t1 1
S0
where:
te =
S0 =
This cumulative damage model is particularly useful when the stresses are time varying,
since an equivalent amount of damage can be estimated per unit time, regardless of the
behavior of stress as a function of time. This model is also consistent with fatigue, which
is essentially a cumulative damage scenario.
R (t ) = e
where:
R(t) = reliability the probability of survival at time t
=
Weibull shape parameter (in time space)
=
Characteristic life as a function of stressor:
And, if the acceleration model is the power law:
a
=
S
where:
S=
a=
n=
stressor
life constant
fatigue exponent in time space
R(t ) = e
n
a
S
The modeling process estimates , a and n. Once these parameters are estimated, the
life distribution for any stress level can be obtained.
5.4.1. Likelihood Functions
The likelihood functions for the six combinations of distribution (exponential, Weibull,
lognormal) and acceleration model (Arrhenius, Inverse Power Law) are provided below.
Exponential-Arrhenius Reaction Rate Model:
B
M
B
t
T
L = N i ( ln(C ) i e Vi ) N i Ri e Vi
Vi
C
C
i =1
i =1
N
B
B
TbieVBi
T e Vi
T e Vi
ai C
Li C
C
K
L
e
+ N i Ln 1 e
+ N i Ln e
i =1
i =1
i =1
+ N i Ln(1 e
i =1
KS inTLi
i =1
i =1
Weibull-Arrhenius:
N
L = N i ln B
i =1
Ce Vi
ti
B
Vi
Ce
LiB
K
Vi
+ N i ln1 e Ce
i =1
t
i B
Vi
Ce
M
TRi
N
i B
i =1 Ce Vi
T
T
aiB
biB
L
Ce Vi
Ce Vi
N
ln
e
e
+
i
i =1
Weibull-IPL:
N
L = N i ln KS in KS in t i
i =1
e(
1 KS nt
i i
) N (KS nT )
i i Ri
M
i =1
n
n
n
+ N i ln1 e (KSi TLi ) + N i ln e (KSi Tai ) e (KSi Tbi )
i =1
i =1
Lognormal-Arrhenius:
B
B
ln(TRi ) ln(C )
ln(t i ) ln(C ) M
Vi
Vi
1
L = N i ln
+ N i ln1
i =1
i =1
B
B
B
ln(TLi ) ln(C ) L
ln(Tbi ) ln(C )
ln(Tai ) ln(C )
K
Vi
Vi
Vi
+ N i ln
+ N i ln
i =1
i =1
Lognormal-IPL:
N
1 ln(t i ) + ln( K ) + n ln(S i )
L = N i ln
i =1
t i
M
i =1
i =1
L
ln(Tbi ) + ln( K ) + n ln(S i )
ln(Tai ) + ln( K ) + n ln(S i )
+ N i ln
i =1
Exponential IPL:
2L 2L 2L
,
,
K 2 n 2 K n
Weibull Arrhenius:
2L 2L 2L 2L 2L 2L
,
,
,
,
,
2 B 2 C 2 B C BC
Weibull IPL:
2L 2L 2L 2L 2L 2L
,
,
,
,
,
2 K 2 n 2 K n Kn
Reliability Information Analysis Center
219
Lognormal Arrhenius:
2L
2L 2L 2L 2L 2L
,
,
,
,
,
B 2 C 2 2 B C B C
Lognormal IPL:
2L 2L 2L 2L 2L 2L
,
,
,
,
,
K 2 n 2 2 K n K n
The likelihood function will yield a value for all possible combinations of parameter
values. A useful tool in data analysis is a plot of the likelihood value. As an example,
Figure 5.4-1 illustrates a contour plot of the likelihood value for an exponential-IPL
model.
In this example, the plot lines represent values of equal likelihood as a function of the
two parameters of interest (i.e., the Weibull slope and the exponent in the power law
acceleration model). The center position represents the combination of beta and n at
which the maximum value of likelihood occurs. The height of the likelihood value
increases as the center of the contour lines is approached. The spread in the contour lines
of equal likelihood are proportional to the uncertainty in the parameter estimates, and in
fact are one way to estimate confidence bounds on the model parameters. Also, the
dispersion of the likelihood values on the n axis can be thought of as the spread of the
TTFs in the stress dimension, and the dispersion of the likelihood values on the beta
axis can be thought of as the spread of the TTFs in the time dimension.
5.5. References
1. Lyu, M.R. (Editor), Handbook of Software Reliability Engineering, McGraw-Hill,
April 1996, ISBN 0070394008
2. Musa, J.D.; Iannino, A.; and Okumoto, K.; Software Reliability: Measurement,
Prediction, Application, McGraw-Hill, May 1987, ISBN 007044093X
3. Musa, J.D., Software Reliability Engineering: More Reliable Software, Faster
Development and Testing, McGraw-Hill, July 1998, ISBN 0079132715
4. Nelson, W., Applied Life Data Analysis, John Wiley & Sons, 1982,
ISBN0471094587
5. Fisher, R. A., 1912, On an Absolute Criterion for Fitting Frequency Curves,
Messenger of Mathematics, Vol. 41, pp. 155-160. [Reprinted in Statistical
Science, Vol. 12, (1997) pp. 39-41.]
6. Peck, S., IRPS tutorial, 1990
7. Telcordia GR1221
8. Peck and Hallberg, Quality and Reliability Engineering International, 1991
9. Hald, A., 1999, On the Maximum Likelihood in Relation to Inverse Probability
and Least Squares. Statistical Science, Vol. 14, No. 2, pp. 214-222.
10. Accelerated Life Testing Analysis (ALTA), Reliasoft Corp.
6.
This chapter presents topics related to the interpretation of various aspects of reliability
models. It is hoped that this information will provide the reader with information that
allows for a better intuitive understanding of reliability predictions, assessments and
estimations.
generally fail earlier than those in the main population. The shape of the failure rate
curve is decreasing, with its rate of decrease dependent on the maturity of the design and
manufacturing processes, as well as the applied stresses.
Useful Life. The second portion of the bathtub curve is known as the useful life and is
characterized by a relatively constant failure rate caused by randomly occurring failures.
It should be noted that the failure rate is only related to the height of the curve, not to the
length of the curve, which is a representation of product or system life. If items are
exhibiting randomly occurring failures, then they fail according to the exponential
distribution, in accordance with a Poisson process. Since the exponential distribution
exhibits a constant hazard rate, we can simply add the failure rates for all items making
up an item to estimate the overall failure rate of that item during its useful life.
Wearout. The last part of the curve is the wearout portion. This is where items start to
deteriorate to such a degree that they are approaching, or have reached, the end of their
useful life. This is often relevant to mechanical parts, but can also apply to any failure
cause that exhibits wearout behavior.
It is important to understand the difference between the MTBF of an item and the useful
life of that same item. Items that experience wearout failure modes/mechanisms will
have some period of useful life before they fail as a result of wearout. This useful life is
not the same as the item MTBF. During useful life, an item may also experience
randomly occurring freak failures caused by weak components or faulty workmanship,
especially if the item is subjected to high stress conditions. The occurrence of these
random failures during an items useful life results in higher failure rates, or lower
MTBF, for that item.
Mechanical items are usually most prone to wearout and, therefore, we are usually most
concerned with the useful life, or MTTF, associated with these items. Electronic items
usually become obsolete long before any significant wearout takes place6. Therefore, the
infant mortality and constant failure rate portions of the bathtub curve are of the most
interest for these items.
The bathtub curve conceptually offers a good view of the three primary types of failure
categories. It is essentially a composite failure rate curve comprised of three generic
types of failure causes. In practice, however, the well defined curve of Figure 6.1-1 is
rare. The actual curve for a product or system will depend on many factors. A specific
6
It should be noted, however, that with the progressively decreasing feature sizes of current state-of-the-art microelectronic devices,
the issues associated with wearout and useful life are becoming of greater concern.
failure cause will generally exhibit characteristics of only one segment of the bathtub
curve, but when the characteristics of all of the other failure causes for that product or
system are considered, and a composite model is generated, the curve will have a shape
that deviates from the classic bathtub curve, even though it will often contain elements of
each of the three portions. Usually, the composite curve will be dominated by the
characteristics of those failure causes that dominate the overall reliability of the item.
It is also important to note that defects do not always manifest themselves as infant
mortality failures. They can appear to be infant mortality, random or wearout, depending
on the specific characteristics of the failure mechanism and factors, such as defect
severity distributions.
The model developers ability to identify the variables (component- or userelated) that most heavily influence reliability
The level of detailed data to which the model user has access
The quantity and quality of the data on which the models are based
The accuracy of a reliability model is a strong function of the manner in which defects
are accounted for. Therefore, there is a trade-off between the usability of the model and
the level of detailed data that it requires. This highlights the fact that the purpose of a
reliability prediction must be clearly understood before a methodology is chosen.
Practical considerations for choosing an approach will inevitably include the types and
level of detail of information available to the analyst. Given the practical time and cost
constraints that most reliability practitioners face, it is usually important that the chosen
reliability prediction methodology be based on data and information accessible to them.
Model developers have long known that many of the factors which had a major influence
on the reliability of the end product were not included in traditional methods like MILHDBK-217, but under the constraints of handbook users, these factors could not be
Reliability Information Analysis Center
225
included in the models. For example, it was known that manufacturing processes had a
major impact on end item reliability, but those are the factors which corporations hold
most proprietary. As an example of this, a physics-of-failure-like model was developed
several years ago for small-scale CMOS technology. This model required many input
variables, such as metallization cross-sectional area, silicon area, oxide field strength,
oxide defect density, metallization defect density etc. While the model has the potential
to be much more accurate than the other MIL-HDBK-217 models, it is essentially
unusable by anyone other than the component manufacturers who have access to such
information. The model is useful, however, for these manufacturers to improve the
reliability of their component designs.
The two primary purposes for performing a quantitative reliability assessment of systems
are (1) to assess the capability of the parts and design to operate reliably in a given
application (robustness), and (2) to estimate the number of field failures or the probability
of mission success. The first does not require statistically-based data or models, but
rather sound part and materials selection/qualification and robust design techniques. It is
for this purpose that physics approaches have merit. The second, however, requires
empirical data and models derived from that data. This is due to the fact that field
component failures are predominantly caused by component and manufacturing defects
which can only be quantified through the statistical analysis of empirical data. This can
be seen by observing the TTF characteristics of components and systems, which are
almost always decreasing, indicating the predominance of defect-driven failure
mechanisms. The handbook models described in this book provide the data to quantify
average failure rates which are a function of those defects.
It has been shown that system reliability failure causes are not driven by deterministic
processes, but rather by stochastic processes that must be treated as such in a successful
model. There is a similarity between reliability prediction and chaotic processes. This
likeness stems from the fact that the reliability of a complex system is entirely dependent
upon initial conditions (e.g., manufacturing variation) and use variables (i.e., field
application). Both the initial conditions and the use application variables are often
unknowable to any degree of certainty. For example, the likelihood of a specific system
containing a defect is often unknown, depending on the defect type, because the
propensity for defects is a function of many variables and deterministically modeling
them all is virtually impossible. However, the reliability can be predicted within bounds
by using empirically based stochastic models.
A critical factor that must be considered when choosing a reliability assessment method
is whether the failure mechanism under analysis is a special cause or a common cause
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
226
mechanism. In other words, a special cause mechanism means that there is an assignable
cause to the failure and that only a subpopulation of the item is susceptible to this failure
mechanism. Common cause mechanisms are those affecting the entire population.
Table 6.1-1 summarizes the characteristics of various categories of failure causes, and
identifies whether they are typically common cause or special cause. The categories of
failure types encompass the ways a failure cause can manifest itself. These are also
categories that can be used in a FMEA.
Table 6.1-1: Categories of Failure Effects
Failure cause type
Always (Common
Cause)
Sometimes (Special
Cause)
Design
Not
Capable
Process
Not
Capable
Random
Failure
Wearout
x
x
If it is erroneously assumed that special cause mechanisms will affect the entire
population, gross errors in the reliability estimates of the population will result. This
error results from the assumption of a mono-modal TTF distribution when, in fact, the
actual distribution is multimodal.
If the distribution is truly mono-modal, only the parameters applicable to a single mode
distribution need to be estimated. However, if there are really several sub-populations
within the entire population, the parameters of each of the distributions needs to be
estimated, along with the percentage of the entire population represented by each
distribution.
This is especially critical when dealing with defects. In this case, it is critical to
understand the percentage of the population that is at risk of failure. To illustrate this,
consider the probability plot in Figure 6.2-1. As can be seen in this plot, there is an
apparent knee in the plot at about 400 hours, an indication of several subpopulations.
If a mono-modal distribution is assumed (i.e., the straight line), errors in the cumulative
percent fail at a given time will occur. Likewise, if a multimodal distribution is assumed,
a much more accurate representation of the situation results (the line through the data
points).
Probability - Weibull
99.000
Probability-Weibull
Data 1
Weibull-Mixed
MLE SRM MED FM
F=98/S=139
Data Points
Susp Points
Probability Line
90.000
Unreliability, F(t)
50.000
10.000
5.000
1.000
0.500
0.100
10.000
100.000
1000.000
10000.000
100000.000
Bill Denson
Corning
1/15/2008
5:24:32 PM
1000000.000
Time, (t)
[1]=1.3341, [1]=307.1460, [1]=0.0646; [2]=0.7505, [2]=2.1367+4, [2]=0.4240; [3]=4.2735, [3]=1.1624+5, [3]=0.5114
1
0.60
61.1
0.42
2
0.59
918.7
0.58
Probability - Weibull
9 9. 9 00
9 0. 0 00
Unreliability, F(t)
5 0. 0 00
1 0. 0 00
5. 0 00
1. 0 00
0. 5 00
0. 1 00
0 . 1 00
1. 00 0
10 . 00 0
1 00 . 00 0
1 000 . 0 00
T ime, (t)
F olio 1\1-. 5: [ 1 ] =0 .6 0 1 0 , [ 1 ] =6 1 .0 8 6 0 , [1 ]= 0 .4 2 1 9 ; [ 2 ]= 0 .5 9 3 6 , [2 ]= 9 1 8 .6 9 3 5 , [ 2 ] = 0 .5 7 8 1
10 00 0. 00 0
1
0.86
341.4
0.63
2
1.4
863.25
0.37
Probability - Weibull
99 . 9 0 0
90 . 0 0 0
Unreliability, F(t)
50 . 0 0 0
10 . 0 0 0
5.000
1.000
0.500
0.100
0. 1 00
1 . 00 0
1 0. 00 0
10 0 . 0 00
1 0 00 . 00 0
T ime, (t)
F o lio 1 \1 -1 : [1 ]= 0 .8 6 3 3 , [1 ]= 3 4 1 .4 6 5 6 , [1 ]= 0 .6 3 0 3 ; [2 ]= 1 .4 0 6 2 , [2 ]= 8 6 3 .2 7 6 7 , [2 ]=0 .3 6 9 7
1 0 00 0. 0 00
1
1.81
98.44
0.19
2
1.23
679.4
0.81
Probability - Weibull
9 9. 9 00
9 0. 0 00
Unreliability, F(t)
5 0. 0 00
1 0. 0 00
5. 0 00
1. 0 00
0. 5 00
0. 1 00
0.100
1. 00 0
1 0. 00 0
1 0 0. 00 0
1 00 0. 0 00
T ime, (t)
F o lio 1\5-1 : [1 ]= 1 .8 1 4 4 , [1 ]= 9 8 .4 4 2 8 , [1 ]= 0 .1 8 8 1 ; [ 2 ]= 1 .2 3 8 5 , [2 ]=6 7 9 .4 4 6 9 , [ 2 ] =0 .8 1 1 9
10 0 00. 0 00
1
1.18
206.2
0.19
2
4.69
497.6
0.81
Probability - Weibull
99 . 9 0 0
90 . 0 0 0
Unreliability, F(t)
50 . 0 0 0
10 . 0 0 0
5.000
1.000
0.500
0.100
0. 1 00
1 . 00 0
1 0. 00 0
10 0 . 0 00
1 0 00 . 00 0
T ime, (t)
F o lio 1 \. 5-5: [1 ]=1 .1 8 0 8 , [1 ]=2 0 6 .1 9 6 8 , [1 ]= 0 .1 9 4 0 ; [2 ] =4 .6 9 4 3 , [2 ]= 4 9 7 .6 3 5 9 , [2 ]=0 .8 0 6 0
1 0 00 0. 0 00
1
5.71
44.7
0.10
2
4.29
483.7
0.90
Probability - Weibull
99.900
90.000
Unreliability, F(t)
50.000
10.000
5 . 0 00
1 . 0 00
0 . 5 00
0 . 1 00
0. 10 0
1. 00 0
10 . 0 0 0
10 0 . 0 00
1 00 0 . 0 0 0
1 0 00 0. 0 00
T ime, (t)
F o lio 1\5 -5 : [ 1 ]=5 .7 1 6 3 , [1 ]=4 4 .7 8 9 1 , [1 ]= 0 .0 9 9 9 ; [ 2 ]= 4 .2 9 3 2 , [2 ]= 4 8 3 .7 0 4 2 , [2 ] =0 .9 0 0 1
Probability - Weibull
99.900
90.000
Unreliability, F(t)
50.000
10.000
5 . 0 00
1 . 0 00
0 . 5 00
0 . 1 00
0. 01 0
0 . 1 00
1.000
1 0 . 0 00
1 00 . 00 0
10 0 0. 00 0
1 0 00 0. 0 00
Bill De nson
Co rning
3 /14 /2 0 10
9 :43 :2 8 PM
10 0 000 . 0 00
T ime, (t)
[1 ]=0 .7 2 0 8 , [1 ]=4 3 2 .4 6 1 4 , [1 ]= 0 .6 6 9 4 ; [ 2 ]= 4 .4 2 3 2 , [2 ]= 4 9 1 .0 1 2 0 , [2 ]=0 .3 3 0 6
3500
Number of Deaths
3000
2500
2000
1500
1000
500
0
0
20
40
60
80
100
120
Age
Mode 1
Mode 2
Mode 3
Mode 4
Beta
0.184
4.25
4.74
9.61
Eta
0.1030
24.81
67.84
87.67
Portion
0.0090
0.012
0.194
0.784
0.032
f(t)
0.024
0.016
0.008
0.000
0.100
22.100
44.100
66.100
88.100
110.100
Time, (t)
0.032
0.024
0.016
0.008
0.000
0.100
22.100
44.100
66.100
88.100
110.100
Time, (t)
The Weibull probability plot is shown in Figure 6.2-11. Note that, in this graph, the plot
is shown using Weibull scales, i.e. the log of time on the x-axis and double log of
unreliability on the y-axis. If this plot was close to a straight line, it would indicate that
the distribution could be described adequately with a mono-modal Weibull distribution.
Clearly, this is not the case.
Probability - Weibull
99.990
90.000
Unreliability, F(t)
50.000
10.000
5.000
1.000
0.500
0.100
0.100
1.000
10.000
110.000
Time, (t)
90.000
50.000
10.000
Unreliability, F(t)
5.000
1.000
0.500
0.100
0.050
0.010
0.005
0.001
1.000
10.000
110.000
Time, (t)
The traditional manner in which confidence levels are calculated around failure rates is
the use of the chi-square distribution, as follows:
2 (1 CL ,2 r + 2 )
2t
where the numerator is a value taken from a chi-square table, and t is the number of
device hours. A question sometimes arises as to how the confidence bounds calculated in
this manner compare to those calculated with the use of the Poisson distribution.
From the binomial and Poisson distributions, Farachi (Reference 2) has shown that:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
238
n!
(1 q )nk q k
k =0 k!(n k )!
r
1 CL =
n!
(1 q )n k q k (nq ) e nq
k !(n k )!
k!
k
(
(
(nq)r
nq) nq
nq)
nq
1 CL =
e = e 1 + nq + +
+
(r 1)! (r )!
k!
k =0
Since:
nq = t
Then:
(
(
t )k t
t )r 1 (t )r
t
1 CL =
e = e 1 + t + +
+
(
)
(r )!
r
1
!
k =0 k!
The chi-square value is the exact solution to the above equation. The chi-square values
are for t, not alone. Therefore, for a given confidence level and number of
failures, the chi-square tables provide the value for t. Therefore, the chi-square values
are entirely consistent with the Binomial and Poisson distributions.
It is important to note that the confidence bounds based on the chi-square distribution
summarized above pertain to the uncertainty from statistical considerations alone. They
do not account for variations in failure rate due to other noise factors, such as:
One of the limitations of reliability predictions that are based on handbook models is that
they can only provide point estimates of failure rates. These failure rates are based on
whatever data was available to make up the model, and the model development approach.
There are no statistical confidence limits or intervals that can be associated with
handbook model data. Traditional methods are not applicable because there are many
more factors contributing to the uncertainty than the statistical-only considerations of
traditional techniques.
For example, consider the following summary of the model development and use
approach, along with the potential sources of error, as shown in Figure 6.3-1. The
sources of error are highlighted in the gray boxes. From this, it can be seen that there are
many sources of noise. The model output results reflect the cumulative effects of the
uncertainties in all of the noise sources shown.
Although a theoretical basis for the calculation of the confidence bounds around
reliability predictions is extremely difficult to derive, it is possible to empirically observe
the degree of uncertainty. Reliability predictions performed using empirical models
developed from field data result in a failure rate estimate with relatively wide confidence
bounds. Table 6.3-1 presents the multipliers of the failure rate point estimate as a
function of confidence level. This data was obtained by analyzing data on systems for
which both predicted and observed data was available. For example, using traditional
approaches, one could be 90% certain that the true failure rate was less than 7.57 times
the predicted value.
Unmodeled Noise
Factors
Modeled Factors:
Environmental:
Temperature
Humidity
Delta T
Radiation
Contaminants
Operational Profile:
Duty Cycle
Cycling Rate
Operating Stress
Electrical Stress
Mechanical
Extreme Events
User
Item Information:
Manufacturing Date
Quality
Defect Rate
Environmental Stresses:
Temperature
Humidity
Delta T
Radiation
Contaminants
Operational Profile:
Duty Cycle
Cycling Rate
Operating Stress
Electrical Stress
Mechanical
Extreme Events
Model
Development
Model
Censored Data;
Biased Estimators;
Assumptions Made
in Modeling
Model
Output
Multiplier
0.13
0.26
0.44
0.67
1.00
1.49
2.29
3.78
7.57
An interesting effect occurs when combining the distributions that describe the
uncertainties of the individual components comprising a system. The uncertainties are
wider at the piece-part level than at the system level. If one were to take the distributions
of failure rate from the regression analysis used to derive the component model (i.e.,
standard error estimate), and statistically combine them with a Monte Carlo summation,
the resultant distribution describing the system prediction uncertainty will have a
variance much smaller than that of the individual components comprising the system.
The reason for this is the effect of the Central Limit Theorem which quantifies the
variance of summed distributions. For example, the variance around the component
failure rate estimate is higher than the variance suggested by the above table. However,
the variance in the above table is observed to be much larger than that theoretically
derived by summing the component failure rate distributions. This implies that there are
system-level effects that contribute to the uncertainty that are not accounted for in the
component-based estimate.
Bayesian techniques, such as those used in the 217Plus system reliability assessment
methodology, allow the refinement of analytical predictions over time to reflect the
experienced reliability of an item as it progresses through in-house testing, initial field
deployment and subsequent use by the customer. In-house testing can be comprised of
accelerated tests at the component or equipment level, reliability growth tests, and
reliability screens or accelerated screening techniques.
We will not discuss Bayesian methods in detail here. The primary benefit of using
Bayesian techniques can be implied from Figure 6.3-2, however. As more and more test
and experience data is factored into the initial analytical reliability prediction, the
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
242
statistical confidence levels represented by the outside (red) lines on the graph continue
to converge on the True MTBF of the subject item. Using Bayesian techniques, as
time approaches infinity the predicted inherent MTBF and the true MTBF of the device,
product or system population become one and the same. This, of course, assumes that
MTBF is the appropriate metric, but the same situation conceptually applies to other
metrics such as failure rate and reliability (R).
Prediction
Assessment
Estimation
Paper
Analysis
MTBF
In-House
Testing
Field
Data
Upper Confidence
True MTBF
Lower
Confidence Level
TIME
R=e
( t )
= 0 .999912
A multimode distribution can be used to model this situation, the first mode being
applicable to the defects, and the second being applicable to the main population.
Two FITs is defined as 2.0 failures per billion hours. This corresponds to 0.002 failures per million hours.
However, in many cases, it is only the first mode that will impact the field reliability
within the useful life of the component.
Some researchers have attempted to use extreme value statistics for such cases, but they
also have limited usefulness because the data on low failure rate items, like electronic
components, is generally not consistent with these distributions. As a result, low failure
rate items are usually modeled with a constant failure rate (exponential distribution), or a
Weibull distribution. The Weibull is usually used in this case to model the effects of
infant mortality.
6.6. Weibayes
There are many cases in reliability modeling in which there are few or no failures. For
these, a Weibayes technique can be used. This approach is practical when there are few
or no failures and a reasonable shape parameter can be estimated. This approach
essentially fixes a plotting position using:
1. One failure assumed at the end of the test duration
2. A line drawn through the median rank point with an assumed beta
The result of this analysis is a lower single-sided bound of the life distribution. As an
example, consider the following case:
1. 50 samples are tested for 1000 hours, with no failures
2. Data from other testing indicates a beta of 3 is appropriate
Reliability Information Analysis Center
245
3. The median rank at 1000 hours is 1.39%. A line is drawn through this point with
a beta slope of 3.
This is shown in Figure 6.6-1.
n 1
s =
i =1 i
where i and represent the Weibull distribution parameters for individual items. This is
applicable when is the same, but i can be different for each item.
(t ) = d (t )hd
G
Gth
h
where:
(t) =
d(t) =
hd =
G/h =
Gth =
Since hd and Gth are random variables described by distributions, (t) can generally be
estimated with a Monte Carlo analysis, as described earlier in this book.
In this case, the conditional probability of failure if the device is dropped is:
hd
G
Gth
h
In this example, the objective is to estimate the reliability of the assembly, which is
comprised of two components. Component A has physics-based models available for
two of the three primary failure causes.
An estimate of the failure rate of component A is:
A preliminary = 1 + 2 + 3
where 1, 2 and 3 are the failure rates obtained from the model or data available on each
failure cause. Of course, these values should represent the failure rate under the use
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
248
conditions for which the assessment is to be made. In this example, is used, which
indicates a constant failure rate. However, if the failure rates are time-dependent, the
corresponding time-dependent failure rates or hazard rates can be used. Also, the
methodology to be illustrated in this example is similar to the data combination
methodology described in the 217Plus section, the main difference being that this
example deals with the situation in which there are different types of data at different
hierarchical levels of the product or system, whereas the 217Plus methodology deals with
different types of data within the same configuration item.
Now, since Component A has life test data available from tests performed on the
component, A-preliminary is the failure rate estimate before accounting for the life test data
on the entire component. This life data will account for any failure causes not included in
the three failure causes considered, and it will also provide additional data on the three
failure causes considered. A better estimate of reliability can be obtained by combining
A-preliminary with the life test data, using Bayesian techniques. This technique accounts
for the quantity of data by weighting large amounts of data more heavily than small
amounts. A-preliminary forms the prior distribution, comprised of a0 and ao/A-preliminary .
The empirical data (i.e., test data in this case) is combined with A-preliminary using the
following equation:
a0 +
A =
i =1
a0
Apreliminary
b '
i
i =1
A is the best estimate of the Component A failure rate, while ao is the equivalent
number of failures of the prior distribution corresponding to A-preliminary. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the Section 2.6 of this book. The equivalent number of hours
associated with A-preliminary is represented by ao/A-preliminary. The number of failures
experienced in each source of empirical data is a1 through an. There may be n different
sources of data available (for example, each of the n sources corresponds to individual
tests or field data from the population of products). The equivalent number of cumulative
operating hours experienced for each individual data source is b1 through bn. These
values must be converted to equivalent hours by accounting for any accelerating effects
between the use conditions.
The same methodology is applied to Component B, and B is obtained.
Reliability Information Analysis Center
249
The same methodology is, in turn, applied at the parent level assembly, in which case, the
preliminary estimate is:
Assembly= preliminary = A + B
and the parent assembly failure rate becomes:
a0 +
A =
i =1
a0
Assembly- preliminary
b '
i
i =1
N f = t
where:
Nf =
=
t=
N = t =
operating time
Failures
# parts = Failures
operating time
part
# parts
part
N f = N [F (t 2 ) F (t1 )]
where:
Nf i= the number of expected failures
N = the total number of parts in the population
F(t1) = the cumulative probability function at time t1
F(t2) = the cumulative probability function at time t2
t1 and t2 are the times between which the failure probability is to be evaluated
In this case, since F is a (unitless) probability value, the total population is scaled by
the probability of failure in the time interval of interest. This is identical to the expected
value of the binomial distribution of the number of failures.
R = e t
The equivalent failure rate can be obtained by solving the above equation for the failure
rate:
ln(R )
t
The resulting failure rate value is equal to a failure rate that will result in the same
cumulative percent fail as predicted by the non-constant model at the specific time that
the reliability is calculated. If a different time is chosen, a different value will be
obtained.
This technique can be used when the reliability of some parts of a system is calculated
with non-constant failure rate models and others are calculated with a constant failure
rate. It can also be used when modeling one-shot devices, which will simply have a
probability of failure instead of a failure rate.
Mean life
Median life
MTBF
Failure rate
Time to X% fail
B10 life
Distribution parameters:
o Weibull characteristic life and shape parameter
o Lognormal mean and standard deviation
If a constant failure rate distribution is used, there are various units of failure rate
possible. Some of these are:
Failures per hour is the fundamental unit. All of these failure rate units can be
translated to each other with a constant multiplication factor. For example, Failures per
million hours times 1000 equals Failures per billion hours and Percent failure per
thousand hours is equivalent to Failures per ten thousand hours.
In the above cases, the life unit shown is in hours (i.e., time), but it does not necessarily
need to be. Other possible life units are cycles, miles, missions, operations, etc.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
252
Additionally, if the life unit in the above listed metrics is time (hours), it can refer to the
number of operating hours, calendar hours, flight hours, etc. Reliability prediction
methods like MIL-HDBK-217 use operating hours as the life unit, whereas 217Plus uses
calendar hours as the life unit. Calculation of the operating failure rate using MILHDBK-217 makes the implicit assumption that the failure rate during non-operating
periods is zero, unless the non-operating failure rate is otherwise accounted for.
However, in all cases the life unit refers to the cumulative value of the population. For
example, if the failure rate unit of Failures per million hours is used, the million hours
refers to the cumulative time of the entire population, i.e. the sum of each components
number of hours.
Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor,
connector, etc.) failing to perform its intended function. Examples
include part failures due to poor quality; manufacturer or lot
variability; or any process deficiency that causes a part to fail before
its expected wearout limit is reached.
Software (9%): Failures of a system to perform its intended function due to the
manifestation of a software fault
While there are reliability assessment methods for specific causes listed above, (i.e.,
components, software, etc.) there are few methodologies that attempt to take a holistic
view of system reliability and integrate them into a single methodology. One example of
a methodology that attempts to do this is 217Plus, which is described in Chapter 7.
6.13.2. Selection of Factors
The process of reliability assessment can be viewed as an IPO model, which has input
parameters (I), the process or models used to assess the reliability as a function of those
input parameters (P), and an output (O). This is illustrated in Figure 6.13-2.
Input
Process
Output
Initial
Conditions
Reliability
Metrics
Stresses
Figure 6.13-2: IPO Model
Examples of the IPO variables, as applied to reliability modeling, are shown in Table
6.13-1.
Table 6.13-1: Factors to be Considered in a Reliability Model
Defect-Free
Intrinsic
Initial Conditions
Defects
Extrinsic
Input
Operational
Stresses
Environmental
Voids
Material Property Variation
Geometry Variation
Contamination
Ionic Contamination
Crystal Defects
Stress Concentrations
Organic Contamination
Nonconductive Particles
Conductive Particles
Contamination
Ionic Contamination
Thermal
Electrical
Chemical
Optical
Chemical Exposure
Salt Fog
Mechanical Shock
UV Exposure
Drop
Vibration
Temperature - High and Low
Temperature Cycling
Humidity
Atmospheric Pressure Low and High
Radiation EMI, Cosmic
Sand and Dust
This is the reliability assessment process using the various techniques described in this book.
Mean Life
Median Life
MTBF
Failure Rate
Time to X% Fail
B10 Life
Distribution Parameters
Weibull Characteristic Life and Shape Parameter
Lognormal Mean and Standard Deviation
Another issue facing reliability model developers is the manner in which reliability
growth is accounted for. A good model reflects state-of-the-art technology. However,
empirical models are usually developed from the analysis of field data, which takes time
Reliability Information Analysis Center
257
to collect. The faster the growth, the more difficult it is to derive an accurate (i.e.,
current) model.
As an example of this reliability growth effect, Table 6.13-2 contains, for each generic
component electronic type, the growth rate that has been observed from data collected by
the RIAC. These reliability growth factors are included in the 217Plus component
models. The growth rate model used for each component for this purpose is:
e (t t
1
where:
=
=
t1 =
t2 =
Growth
Rate ()
0.0082
0.229
0.229
0.23
0.223
0.297
0.150
0.473
0.33
0.293
0.479
0.0
0.34
0.087
0.0
0.00089
0.0
0.20
0.0
0.281
0.397
0.269
There are many failure mechanisms that are accelerated by the combination of
temperature and humidity. When modeling failure causes that are a function of humidity,
a question arises as to whether the model should be a function of relative humidity or
absolute humidity. The appropriate metric to use will depend on whether the failure
cause is a function of the absolute amount of water at the surface of the item under
analysis. If this is the case, absolute humidity is probably the appropriate measure. The
relationship between absolute and relative humidity is illustrated in Figure 6.13-3.
Figure 6.14-1: Estimated Upper Bound failure Rates vs Operating Time at 60 and
90% Confidence
Using a single-sided failure rate bound for reliability estimates can be dangerous, because
they can be very pessimistic. Exactly how pessimistic is determined by the number of
operating hours relative to the true failure rate. Moreover, if the upper bound is used on
multiple components in an assembly, then the pessimism in the assembly failure rate
estimate is compounded.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
260
The Bayesian techniques described previously are a way to address the issue of few or no
failures. This is, in fact, the premise of the 217Plus methodology. This approach, while
it requires a prior estimate, can alleviate the pessimistic nature of reliability estimates
made only from an observed number of hours with no failures.
Another related approach is to pool like data together for the purpose of estimating a
failure rate. For example, if a component has no failures, but there is also data available
on other components within the family of components, the data can be combined. An
example of this approach is described in the section on NPRD (Section 7.4). In that case,
the pooling occurs as a function of part type, quality and environment. The algorithm
used in that case was similar to a Bayesian approach, but was tailored to the specific
constraints of the data.
the performance recovers after the overstress condition is taken away. In any event, these
possible undesired effects should be studied and understood before applying components
beyond their rated stress values.
6.16. References
1. http://www.mortality.org
2. Farachi, V., Electronic Component Failure Rate Prediction Analysis, RIAC
Journal, Nov., 2006.
Chapter 7: Examples
7.
Examples
This chapter presents several examples of reliability models that are intended to provide a
cross section of several different methodologies. The focus of the examples is to present
methodologies that the author has personally developed, and ,thus, can provide insight
into the logic and rationale for their development. Several examples were previously
presented in Chapter 2, but not in detail. This section presents more detail regarding
model factors, development methods, etc.
The following examples are provided:
1. MIL-HDBK-217 Model Development Methodology The generic modeling
methodology for many of the models contained in MIL-HDBK-217 is presented
in this section. Not all of the models in the handbook have been developed using
this methodology, but the majority have been. This is presented so that the reader
can gain an understanding of the approach and methodology used, and to provide
insight into the decisions faced by the model developer.
2. 217Plus Reliability Models 217Plus is the methodology developed by the
RIAC to fill the void left after MIL-HDBK-217 was no longer scheduled to be
updated. The approach taken in the development of this methodology was quite
different than the methodology for MIL-HDBK-217. It was intended to be a
holistic approach in which all primary causes of electronic system failure were
accounted for. Therefore, factors addressing non-component reliability were
considered. It was also intended to be holistic in terms of its ability to leverage
experience from predecessor systems, and utilize information from empirical
testing. The general approach for this methodology was previously presented in
Chapter 2 in the Combining Data section. The additional information presented
in this section presents the details on the remaining portions of the methodology.
Additionally, the development of models for several different components is
presented. First is the development of the original twelve electronic part types.
For these models, sufficient field reliability data was available. The second
component models presented are for photonic component types. For these, very
little field data was available, and, therefore, the original 217Plus approach
needed to be tailored.
3. Life Model Example The intent of the life modeling example that will be
presented is to illustrate an application of the life modeling methodologies
previously discussed. This is a hypothetical example, but provides information
pertaining to the various elements of life modeling.
Reliability Information Analysis Center
263
Chapter 7: Examples
4. NPRD This section, covering the RIAC Nonelectronic Parts Reliability Data
(NPRD) publication, is presented to illustrate the nuances of field reliability data,
the manner in which data is merged, and the manner in which it is used in
reliability modeling. Some of this information was previously presented in
Chapter 2 in the section on the use of field data, but more detail will be presented
here. This will hopefully provide the user with an appreciation for both the uses
and limitations of this type of data.
The examples presented in this section were selected to provide a cross-section of various
methodologies, including prediction, assessment and estimation. It is presented to
complement the information previously provided in Chapter 2.
As noted previously, as of the publication date of this book, a Draft of MIL-HDBK-217G is currently in the works, with an
anticipated release in 2010.
Chapter 7: Examples
Chapter 7: Examples
The first step in the modeling methodology is to identify possible model factors. In this
example, the possible factors were:
Device Style
Power Rating
Package Type
Semiconductor Material
Structure (NPN, PNP)
Electrical Stress
Circuit Application
Quality Level
Duty Cycle
Operating Frequency
Junction Temperature
Application Environment
Complexity
Power Cycling
Chapter 7: Examples
n
= b T E Q i
i =1
where:
=
theoretical failure rate prediction
b = base failure rate, dependent on device style
T = temperature factor (based on the Arrhenius relationship)
E = environment factor
Q = quality factor based upon device screening level and hermeticity
Product of i = the product of Pi factors based upon variables from the potential
list of input variables found to have a significant effect on the
discrete semiconductor failure rate.
7.1.3. Collect and QC Data
The collection of empirical reliability data is integral to the approach used in model
development. Four specific data collection tasks were defined.
The first task was a system/equipment identification process. A survey of numerous
military equipments was conducted to identify system/equipments meeting predetermined
criteria established to ensure plentiful and accurate data.
The second task was an extensive survey of discrete semiconductor manufacturers and
users.
The third task was in-person visits to organizations where data could not be accessed by
other means.
The final data collection task was the compilation of data referenced in the literature and
documented technical studies. Also, as part of this task, additional contact was made
between the authors and/or study sponsors to determine whether more data was available.
The results of the four specific data collection tasks are described in the following
sections.
Five minimum criteria were established to define an acceptable data source. Each
potential equipment selection was evaluated with these criteria before proceeding with
data summarization. These five criteria were:
Reliability Information Analysis Center
267
Chapter 7: Examples
1.
2.
3.
4.
5.
Data summarization consisted of the extraction and compilation of the desired data
elements from the source reports and/or supporting documentation, and coding the data
for computer entry. Data summarization consisted of the following five tasks for sources
of field data:
1.
2.
3.
4.
5.
The data collected for this effort is summarized on the next page, in Table 7.1-1.
Included are, for each part type, the number of observed failures and operating hours. In
addition to this data, other information was captured, such as quality level, environment,
etc.
7.1.4. Correlation Coefficient Analysis
Using the multiple linear regression technique makes the implicit assumption that the
variables under analysis are independent, and not correlated. In practice, however,
factors are often highly correlated, thus making it difficult to deconvolve the effects that
each factor has.
Chapter 7: Examples
Failures
Part Hours
(Millions)
86
471
228
282
2
7
2330
246
52
89
1
57
878
209
19
245
18
72
30
1857
2612
22
0
144
4
7
170
916.91
7745.48
1154.84
2951.22
13.54
6.58
24706.61
1845.35
75.10
112.24
7.05
76.58
5177.81
431.77
68.23
1013.18
129.39
234.45
173.2
13413.37
1138.70
4827.08
39.1
636689.67
646.09
47.0
595.96
An example of this is the correlation between quality and environment. This correlation
exists because higher quality parts are often used in the more severe environments. As
such, the analysts options are to:
1. Keep the factors as derived, with the caveat that they may be in error
2. Treat the factors as a combined, pooled factor representing the correlated
variables
3. Use alternate approaches to quantifying the effects of either or all correlated
variables
Reliability Information Analysis Center
269
Chapter 7: Examples
= b T s
or:
= be
Ea
KT
Sn
ln = ln b + ln e KT + ln S n
ln = ln b +
Ea
+ n ln S
KT
or:
=a
ln b +
Ea
+ n ln S
KT
Transform
ln
-1/T
ln S
When the regression is performed, the intercept is ln b, and the temperature factor and
stress coefficients are Ea/K and n, respectively. In MS Excel, the LINEST function
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
270
Chapter 7: Examples
is used to determine the model coefficients. These are the values used in the original
equation:
= be
Ea
KT
Sn
If categorical variables are to be modeled, they can be modeled with regression analysis
by assigning a 1 or a 0 to the variable, and performing the regression as described
above. As an example, consider the case in which the product or system to be modeled
has temperature, stress, environment and quality as the four variables affecting the
reliability. This is shown in Table 7.1-3.
Table 7.1-3: Regression Data Including Categorical Variables
Variable
Environment
Independent
variable (e.g., )
Temperature
Stress
ln(1)
ln(2)
ln(3)
ln(4)
ln(5)
1/T1
1/T2
1/T3
1/T4
1/T5
lnS1
lnS2
lnS3
lnS4
lnS5
GB
0
1
0
0
1
AI
1
0
0
1
0
GM
0
0
1
0
0
Quality
Commercial
1
0
1
0
1
Industrial
0
0
0
1
0
Military
0
1
0
0
0
The equation above, expanded with the inclusion of the categorical variables, becomes:
=e
ln b +
Ea
+ n ln S + a1GB + A2 AI + A3GM + A4Comm.+ A5 Ind .+ A6 Mil
KT
where Ai are the coefficients of the categorical variables determined from the regression
analysis.
7.1.6. Goodness-of-Fit Analysis
There are several ways to analyze how good the model fits the data. The standard error
provides an indication of the significance of the specific factor under analysis. The
standard error is the standard deviation of the coefficient estimate. Therefore, if the
standard error is small relative to the coefficient estimate, this is an indication that the
factor is statistically significant. Likewise the opposite is also true.
Chapter 7: Examples
Residual plots are also useful in assessing how good the model is as a predictor of
reliability. The smaller the residuals, the better the model is.
Another useful plot, similar to a residual plot, is obtained when plotting the log10 of the
observed-to-predicted ratio. If this metric is relatively tightly clustered and centered
around zero, this is an indication of a good model.
7.1.7. Extreme Case Analysis
One of the potential problems in using a multiplicative model form is that extreme value
problems can arise. For example, when all input factors are simultaneously at their high
or low values, the resultant predicted failure rate can be unrealistically high or low. This
situation can be addressed with the use of different model forms, such as in the case of
the RIAC 217Plus models, in which a combination additive and multiplicative model
form is used.
7.1.8. Model Validation
The last step in the process is to validate the model. This is accomplished by ensuring
that the resulting models fit the observed data to a reasonable degree. Additionally, the
models can be checked against observed data not used in the model development.
Valuable data for this purpose is data at levels above the component level. In many
cases, high quality data can be obtained on systems or assemblies, but not at the part
level. This occurs due to the level at which maintenance is performed and data is
captured. Therefore, while the data cannot be used for model development, it can be used
for model validation.
Another thing that must be accounted for in the model validation effort is the scaling of
base failure rates to account for data in which there were no observed failures. The
methodology presented in this section is based on the premise that there exists a point
estimate of the dependent variable, in this case the failure rate. In cases where there are
no failures, a point estimate is not possible, i.e., only a lower single-sided confidence
bound is possible. The use of this confidence bound value cannot be used to represent
the data since the resultant model will be pessimistic (i.e., the failure rate will be
artificially increased). Only using the data points for which there are failures is also not
appropriate because it also will artificially bias the model pessimistically. Potential
solutions to this situation include:
Scaling the base failure rates to reflect the zero failure data. One possible
alternative to accomplish this is to scale the base failure rates with the boundary
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
272
Chapter 7: Examples
condition that the predicted number of failures in the entire dataset equals the
observed number.
Use of maximum likelihood (MLE) parameter estimation techniques. These MLE
techniques are especially suited to censored data such as zero failures.
In 1994, Military Specifications and Standards Reform (MSSR) decreed the adoption of
performance-based specifications as a means of acquiring and modifying weapons
systems. This led to the cancellation of many military specifications and standards. This,
coupled with the fact that the Air Force had re-directed the mission of Rome Laboratory
(now called the Air Force Research Laboratory (the preparing activity for MIL-HDBK217)) away from reliability, resulted in MIL-HDBK-217 becoming obsolete, with no
government plans to update it. The RIAC believed that there was a need for a reliability
assessment technique that could be used to estimate the reliability of systems in the field.
A viable assessment methodology needed:
1. Updated component reliability prediction models, since MIL-HDBK-217 was not
to be updated
2. A methodology for quantifying the effect that non-component variables have on
system reliability
3. To be useable by reliability engineers with data that is typically available during
the system development process
The RIAC is chartered with the collection, analysis and dissemination of reliability data
and information. To this end, it publishes quantitative reliability data such as failure rate
and failure mode/mechanism compendiums, as well as failure rate models. It is not
required to provide these services, but does so because there is a need for this data in the
reliability engineering community. It will continue to engage in such activities as long as
there appears to be this need by reliability practitioners. For this reason, the 217Plus
models and methodology were developed.
There are two primary elements to 217Plus, component reliability prediction models and
system-level models. A system failure rate estimate is first made by using the component
models to estimate the failure rate of each component. These failure rates are then
summed to estimate the system failure rate. This is the traditional methodology used in
many reliability predictions, and represents the reliability prediction, i.e., a reliability
estimate that is made before empirical data or detailed assessments are available. This
Reliability Information Analysis Center
273
Chapter 7: Examples
prediction is then modified in accordance with system level factors, which account for
non-component, or system level, effects. This is an example of a reliability
assessment, in which the process and design factors are assessed. Finally, the
prediction and assessment are combined with empirical data to form the reliability
estimate of the product, which is the best estimate of reliability based on all analysis
and data available to the analyst.
The goal of component reliability models is to estimate the rate of occurrence of
failure, or ROCOF, and accelerants of a components primary failure mechanisms
within an acceptable degree of accuracy. Toward this end, the models should be
adequately sensitive to operating scenarios and stresses, so that they allow the user the
ability to perform tradeoff analysis amongst these variables. For example, the basic
premise of the 217Plus models is that they have predicted failure rates for operating
periods, non-operating periods and cycling. As a result, the user can perform tradeoff
analysis amongst duty cycle, cycling rate, and other variables. As an example, a question
that frequently arises is whether a system will have a higher failure rate if it is
continuously powered on, or whether it is powered off during periods of non-use. The
models in 217Plus are structured to facilitate the tradeoff analysis required to answer this
question.
A flow diagram of the entire approach was presented in Chapter 2, which guides the user
in the application of the component models and the system level models. The basis for
the 217Plus methodology is the component reliability models, which estimate a systems
reliability by summing the predicted failure rates of the constituent components in the
system. This estimate of the system reliability is further modified by the application of
System-Level factors, called Process Grade Factors (PGF). Development of the
component models is presented in Sections 7.2.3 through 7.2.5.
The primary intent of this section is to detail the development of the 217Plus
methodology. It is provided to familiarize the reader with the issues faced by model
developers in order to allow a better understanding of 217Plus and similar models. It
provides details related to certain aspects of model development.
7.2.2. System Reliability Prediction Model
7.2.2.1. 217Plus Background
Chapter 7: Examples
requirements, induced failures, etc., that have not been explicitly addressed in prediction
methods.
The data in Figure 7.2-1, presented previously, contains the nominal percentage of
failures attributable to each of eight identified predominant failure causes based on data
collected by the RIAC. The data in this figure represents nominal percentages. The
actual percentages can vary significantly around these nominal values.
Softw are
9%
Parts
22%
No Defect
20%
Manufacturing
15%
Induced
12%
Wearout
System
9%
Management
4%
Design
9%
Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor,
connector, etc.) failing to perform its intended function. Examples
include part failures due to poor quality; manufacturer or lot
variability; or any process deficiency that causes a part to fail before
its expected wearout limit is reached.
Chapter 7: Examples
Software (9%): Failures of a system to perform its intended function due to the
manifestation of a software fault
Another example that this author has experience with is shown in Figure 7.2-2, which
represents the distribution observed for Erbium Doped Fiber Amplifiers (EDFAs) used in
long haul telecommunications systems. The distribution is different than the above chart,
which is a pooled result from various system types and manufacturers. This example is
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
276
Chapter 7: Examples
provided to illustrate the notion that the system type and manufacturing practices will
dictate the specific distribution obtained.
8% No Fault
Found
21% Manufacturing
Defect
1% - Component Mechanical
7% - Component Electrical
63% - Component
- Pumps and
Other Optical
Components
The 217Plus methodology is structured to allow the user the ability to estimate the
reliability of a product or system in the initial design stages when little is known about it.
For example, a reliability prediction early in the development phase of a system can be
made based on a generic parts list, using default values for operational profiles and
stresses. As additional information becomes available, the model allows the incremental
addition of empirical test and field data to supplement the initial prediction.
The purpose of 217Plus is to provide an engineering tool to assess the reliability of
electronic systems. It is not intended to be the "standard" prediction methodology, and it
can be misused if applied carelessly, just as any empirical or physics-based model can.
Also, it is a tool to allow the user the ability to estimate the failure rate of parts,
assemblies and systems. It does not consider the effect of redundancy or perform
FMEAs. The intent of 217Plus is to provide the data necessary as an input to these
analyses. The methodology allows for the modification of a base reliability estimate with
Process Grading Factors for the failure causes listed in Section 7.2.2.1.
Reliability Information Analysis Center
277
Chapter 7: Examples
These process grades correspond to the degree to which actions have been taken to
mitigate the occurrence of product or system failure due to these failure categories. Once
the base estimate is modified with the process grades, the reliability estimate is further
modified by empirical data taken throughout item development and testing. This
modification is accomplished using Bayesian techniques that apply the appropriate
weights for the different data elements.
Advantages of the 217Plus methodology are that it uses all available information to form
the best estimate of field reliability, it is tailorable, it has quantifiable confidence bounds,
and it has sensitivity to the predominant product or system reliability drivers. The
methodology represents a holistic approach to predicting, assessing and estimating
product or system reliability by accounting for all primary factors that influence the
inability of an item to perform its intended function. It factors in all available reliability
data as it becomes available on the program. It, thus, integrates test and analysis data,
which provides a better prediction foundation and a means for estimating variances from
different reliability measures.
7.2.2.3. System Reliability Model
P = IA ( P + D + M + S + I + N + W ) + SW
The sum of the Pi-factors in the parenthesis represents the cumulative multiplier that
accounts for all of the processes used in system development and sustainment. The sum
of these values is normalized to unity for processes that are considered to be the mean of
industry practices. The individual model factors are:
P
IA
P
D
M
S
I
N
=
=
=
=
=
=
Chapter 7: Examples
W
SW
=
=
Additional factors included in the model account for the effects of infant mortality,
environment, and reliability growth. Since each of these factors does not influence all of
the factors in the above equation, they are applied selectively to the applicable factors.
For example, environmental stresses will generally accelerate part defects and
manufacturing defects to failure. These additional factors are normalized to unity under
average conditions, so that the value inside the parenthesis is one under nominal
conditions and for nominal processes.
P = IA ( P IM E + D G + M IM E G + S G + I + N + W ) + SW
where,
IM
E
G
=
=
=
The initial assessment of the failure rate, IA, is the seed failure rate value, which is
obtained by using the 217Plus component reliability prediction models, along with other
available data. This failure rate is then modified by the Pi-factors that account for
specific processes used in the design and manufacture of the product or system, along
with the environment, reliability growth and infant mortality characteristics of the item.
The above failure rate expression represents the total failure rate of the system, which
includes "induced" and "no defect found" failure causes. If the inherent failure rate is
desired, then the "induced" and "no defect found" Pi-factors should be set to zero, since
they represent operational and non-inherent failure causes.
7.2.2.4. Initial Failure Rate Estimate
Chapter 7: Examples
All variables in the model default to average values, not worst-case values. As a result,
the user has the option of applying any or all factors, depending on the level of
knowledge of the product or system and the amount of time or resources available for the
assessment. If a traditional reliability prediction is desired, the user can perform it using
the component models and the RIAC database failure rates contained in 217Plus9. As
additional data and information becomes available, the analysis can be expanded to
include these system-level factors.
7.2.2.5. Process Grading Factors
An objective of the 217Plus system model is to explicitly account for the factors
contributing to the variability in traditional reliability prediction approaches. This is
accomplished by grading the process for each of the failure cause categories. The
resulting grade for each cause corresponds to the level to which an organization has taken
the action necessary to mitigate the occurrence of failures of that cause. This grading is
accomplished by assessing the processes in a self-audit fashion. Any or all failure causes
can be assessed and graded. If the user chooses not to address a specific failure cause,
the model simply reverts to the default "average" value. If the user chooses to apply the
PGF methodology for any failure cause, there are a minimum number of questions that
should be assessed and graded. Beyond this minimum, the user can selectively assess
and grade additional criteria. If answers to the grading questions are not known, the
model simply ignores those criteria. Process grading is used to quantify the following
factors:
The sum of the factors within the parentheses in the failure rate model is equal to
unity for the average grade. Each factor will increase if "less than average" processes are
in used and decrease if better than average processes are in used.
The RIAC 217Plus software contains databases that hold the RIACs NPRD and EPRD failure rate data, converted to failures per
million calendar hours. The RIAC Handbook of 217Plus Reliability Prediction Models does not contain this supplementary data.
The RIAC NPRD and ERPD databooks are available for separate purchase from the RIAC, and are in units of failures per million
operating hours.
Chapter 7: Examples
Reference 2 presents the results of the study in which the process grades were
determined.
7.2.2.6. Basis Data for the Model
7.2.2.7. Uncertainty in Traditional Approach Estimates
Chapter 7: Examples
Multiplier
0.132
0.265
0.437
0.670
1.000
1.492
2.290
3.780
7.575
The premise of the 217Plus model developed in the RIAC study was that the failure rate
attributable to the predominant system-level failure causes could be quantified. In
addition to the intrinsic variability associated with the failure rate prediction, there is
additional variability associated with the variance in the distribution of failure causes.
This requires that there be baseline data that quantifies the failure rate of each cause. The
data in Table 7.2-2 was used for this purpose. This table contains, for each source of
data, the percentage of failures attributable to each of the eight identified predominant
failure causes. It should be noted here that the reported percentages of failure due to
some failure causes might be underestimated. For example, system management and
software may be under-reported because failures are usually not attributed to those
categories, even when they are the root cause of failure. This also means that the
percentages from the other causes may be overestimated. Although the authors recognize
that this is likely, the values in the model reflect the reported values. However, if a user
of the model has failure cause distribution information from which the model factors can
be tailored, this data should be used instead of the nominal values.
Chapter 7: Examples
Part
Defect
5
34
13
9
46
46
19
28
42
64
24
15
32
13
19
61
38
30
Mfg.
Defect
38
28
5
31
10
25
39
28
42
0
28
13
1
10
3
5
15
19
Design
0
0
5
38
19
2
10
28
16
0
0
4
5
10
5
5
17
10
System Wearout
No Induced Software
Mgt.
Defect
0
0
42
8
8
0
39
0
0
0
0
3
30
43
0
0
6
0
16
0
0
12
0
14
0
0
12
0
14
0
0
10
0
22
0
0
0
0
17
0
0
0
0
0
0
0
17
0
20
0
0
6
34
8
0
12
6
17
32
1
11
27
16
7
0
1
13
0
34
20
0
5
40
7
20
1
15
10
3
0
0
12
0
18
0
1
11
11
15
3
An analysis was then performed on the Table 7.2-2 data to quantify the distributions of
percentages for each failure cause. This was accomplished by performing a Weibull
analysis of each column. The resulting distributions are summarized in Table 7.2-3.
Table 7.2-3: Weibull Parameters for Failure Cause Percentages
Failure Cause
Parts
Manufacturing
Design
System Management
Wearout
Induced
No Defect
Software
Characteristic
Percentage
33.9
23.2
13.9
7.1
14.7
19.8
31.9
15.0
Weibull Shape
Parameter (beta)
1.62
0.96
1.29
0.64
1.68
1.58
1.92
0.70
Chapter 7: Examples
Table 7.2-4 summarizes the failure rate multiplier values for each of the eight failure
causes as a function of the grade for each of the eight. The generic formula for the
multiplier is given as:
i = (ln Ri )1/
In this calculation, the characteristic percentages listed in Table 7.2-3 are scaled by a
factor of 1.11 to ensure that the sum of the multipliers is equal to one when each grade is
equal to 0.50. In this case, a grade of 0.50 represents an "average" process, and since the
model is normalized to an average process, the total multiplier of the initial assessment
failure rate is equal to one under these conditions.
Parts
Manufacturing
Design
System
Management
Wearout
Induced
No Defect
0.725
0.655
0.612
0.581
0.556
0.535
0.516
0.500
0.486
0.472
0.460
0.449
0.438
0.428
0.419
0.410
0.402
0.394
0.386
0.379
0.372
0.948
0.800
0.714
0.653
0.606
0.567
0.535
0.507
0.482
0.461
0.441
0.423
0.406
0.391
0.376
0.363
0.351
0.339
0.328
0.317
0.307
0.378
0.333
0.306
0.286
0.271
0.258
0.247
0.237
0.229
0.221
0.214
0.207
0.201
0.195
0.190
0.185
0.180
0.176
0.171
0.167
0.163
0.643
0.498
0.420
0.367
0.328
0.298
0.273
0.251
0.233
0.218
0.204
0.191
0.180
0.170
0.161
0.152
0.145
0.137
0.131
0.124
0.119
0.304
0.276
0.258
0.245
0.235
0.227
0.219
0.212
0.207
0.201
0.196
0.191
0.187
0.183
0.179
0.176
0.172
0.169
0.166
0.162
0.160
0.433
0.391
0.365
0.346
0.330
0.317
0.306
0.296
0.288
0.279
0.272
0.265
0.259
0.253
0.247
0.242
0.237
0.232
0.227
0.223
0.219
0.588
0.540
0.511
0.488
0.470
0.455
0.442
0.430
0.420
0.410
0.401
0.393
0.385
0.378
0.371
0.364
0.358
0.352
0.346
0.340
0.335
Cumulative Percentage
(Grade)
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
Parts
Manufacturing
Design
System
Management
Wearout
Induced
No Defect
Chapter 7: Examples
0.365
0.358
0.352
0.345
0.339
0.333
0.328
0.322
0.317
0.311
0.306
0.301
0.296
0.291
0.286
0.281
0.277
0.272
0.267
0.263
0.259
0.254
0.250
0.246
0.241
0.237
0.233
0.229
0.225
0.221
0.217
0.213
0.209
0.205
0.202
0.198
0.298
0.288
0.280
0.271
0.263
0.256
0.248
0.241
0.234
0.228
0.221
0.215
0.209
0.203
0.198
0.192
0.187
0.181
0.176
0.171
0.167
0.162
0.157
0.153
0.148
0.144
0.140
0.136
0.132
0.128
0.124
0.120
0.117
0.113
0.109
0.106
0.160
0.156
0.152
0.149
0.146
0.143
0.140
0.137
0.134
0.131
0.128
0.125
0.123
0.120
0.118
0.115
0.113
0.110
0.108
0.106
0.104
0.101
0.099
0.097
0.095
0.093
0.091
0.089
0.087
0.085
0.083
0.081
0.080
0.078
0.076
0.074
0.113
0.108
0.103
0.098
0.094
0.090
0.086
0.083
0.079
0.076
0.072
0.069
0.067
0.064
0.061
0.059
0.056
0.054
0.052
0.049
0.047
0.045
0.043
0.042
0.040
0.038
0.036
0.035
0.033
0.032
0.030
0.029
0.028
0.026
0.025
0.024
0.157
0.154
0.151
0.149
0.146
0.144
0.141
0.139
0.137
0.134
0.132
0.130
0.128
0.126
0.124
0.122
0.120
0.118
0.116
0.114
0.112
0.111
0.109
0.107
0.105
0.104
0.102
0.100
0.098
0.097
0.095
0.093
0.092
0.090
0.088
0.087
0.214
0.210
0.206
0.203
0.199
0.196
0.192
0.189
0.185
0.182
0.179
0.176
0.173
0.170
0.167
0.164
0.161
0.159
0.156
0.153
0.151
0.148
0.146
0.143
0.140
0.138
0.136
0.133
0.131
0.128
0.126
0.124
0.121
0.119
0.117
0.114
0.330
0.325
0.320
0.315
0.310
0.306
0.301
0.297
0.293
0.288
0.284
0.280
0.276
0.272
0.269
0.265
0.261
0.257
0.254
0.250
0.247
0.243
0.240
0.236
0.233
0.229
0.226
0.223
0.219
0.216
0.213
0.210
0.206
0.203
0.200
0.197
Cumulative Percentage
(Grade)
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
0.41
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.50
0.51
0.52
0.53
0.54
0.55
0.56
0.57
Parts
Manufacturing
Design
System
Management
Wearout
Induced
No Defect
Chapter 7: Examples
0.194
0.190
0.186
0.183
0.179
0.175
0.172
0.168
0.164
0.160
0.157
0.153
0.149
0.146
0.142
0.138
0.135
0.131
0.127
0.123
0.119
0.116
0.112
0.108
0.104
0.100
0.096
0.092
0.088
0.084
0.079
0.075
0.070
0.066
0.061
0.056
0.103
0.099
0.096
0.093
0.090
0.086
0.083
0.080
0.077
0.074
0.072
0.069
0.066
0.063
0.061
0.058
0.055
0.053
0.050
0.048
0.045
0.043
0.040
0.038
0.036
0.034
0.031
0.029
0.027
0.025
0.023
0.021
0.019
0.017
0.015
0.013
0.072
0.071
0.069
0.067
0.065
0.064
0.062
0.060
0.059
0.057
0.055
0.054
0.052
0.050
0.049
0.047
0.046
0.044
0.042
0.041
0.039
0.038
0.036
0.035
0.033
0.031
0.030
0.028
0.027
0.025
0.023
0.022
0.020
0.019
0.017
0.015
0.023
0.022
0.021
0.020
0.019
0.018
0.017
0.016
0.015
0.014
0.013
0.013
0.012
0.011
0.010
0.010
0.009
0.008
0.008
0.007
0.007
0.006
0.006
0.005
0.005
0.004
0.004
0.003
0.003
0.003
0.002
0.002
0.002
0.001
0.001
0.001
0.085
0.084
0.082
0.080
0.079
0.077
0.076
0.074
0.073
0.071
0.069
0.068
0.066
0.065
0.063
0.062
0.060
0.058
0.057
0.055
0.053
0.052
0.050
0.048
0.047
0.045
0.043
0.042
0.040
0.038
0.036
0.034
0.032
0.030
0.028
0.026
0.112
0.110
0.108
0.106
0.103
0.101
0.099
0.097
0.095
0.092
0.090
0.088
0.086
0.084
0.081
0.079
0.077
0.075
0.073
0.071
0.068
0.066
0.064
0.062
0.059
0.057
0.055
0.052
0.050
0.047
0.045
0.042
0.040
0.037
0.034
0.031
0.194
0.190
0.187
0.184
0.181
0.178
0.174
0.171
0.168
0.165
0.162
0.158
0.155
0.152
0.149
0.145
0.142
0.139
0.135
0.132
0.129
0.125
0.122
0.118
0.114
0.111
0.107
0.103
0.099
0.095
0.091
0.087
0.082
0.078
0.073
0.068
Cumulative Percentage
(Grade)
0.58
0.59
0.60
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.90
0.91
0.92
0.93
Parts
Manufacturing
Design
System
Management
Wearout
Induced
No Defect
Chapter 7: Examples
0.051
0.045
0.039
0.033
0.025
0.016
0.011
0.009
0.007
0.005
0.003
0.002
0.013
0.012
0.010
0.008
0.006
0.003
0.001
0.001
0.000
0.000
0.000
0.000
0.023
0.021
0.018
0.015
0.012
0.008
0.028
0.025
0.022
0.018
0.014
0.009
0.062
0.057
0.050
0.043
0.035
0.024
Cumulative Percentage
(Grade)
0.94
0.95
0.96
0.97
0.98
0.99
7.2.2.9. Environmental Factor
SS =
Dremoved
Din
where:
Dremoved = D in Dremaining
The failure rate is, therefore:
=
D field (t )
t
where:
t =
the period, in hours, over which the MTBF is to be measured
Dfield = the number of field failures due to latent defects occurring during the
interval t.
Since SS is the percentage of defects removed from the population, it follows that:
Reliability Information Analysis Center
287
Chapter 7: Examples
D field = (t )
D field = postscreened (t )
D field = SS * prescreened (t )
postsceened = SS * prescreened
This indicates that, in addition to estimating the effect that ESS has on system reliability,
the screening strength calculated from field stresses (SSfield) can be effectively used as a
failure rate multiplier that accounts for the environmental stresses:
SS field (t ) =
1 e kt
t
where,
SSfield(t) =
k =
The total screening strength, SStotal , after accounting for both the temperature cycling and
vibration-related portions, is:
Chapter 7: Examples
PTC = 0.80
PRV = 0.20
Since the component failure rates described above are relative to a ground benign
environment, the failure rate multiplier is the ratio of the SS value in the use environment
to the SS value in a ground benign environment:
E =
where:
PTC = percentage of failures resulting from temperature cycling stresses
PRV = percentage of failures resulting from random vibration stresses
SS = screening strength applicable to the application environmental values
As previously indicated, the SS value is the screening strength and has been derived from
MIL-HDBK-344. It is an estimate of the probability of both precipitating a defect to
failure and detecting it once it is precipitated by the test.
SS TC = 1 e ( kTC t )
SSRV =1 e(k RV t )
k TC = 0.0017 ( T + .6) .6 [ln (RATE + 2.718) ]
where:
T = Tmax Tmin
(in degrees C)
RATE =
degrees C/minute
t =
# of cycles
k RV = 0.0046 G 1.71
Reliability Information Analysis Center
289
Chapter 7: Examples
0.855 0.81 e
+ 0.21 e
E =
0.205
Chapter 7: Examples
7.2.2.10.
Reliability Growth
The 217Plus model includes a factor for assessing the reliability growth characteristics of
a product or system10. The premise of this factor is that the processes that contribute to
system reliability growth in the field may or may not exist. The degree to which growth
exists is estimated by a grading factor that assesses the processes contributing to growth.
The growth factor calculation is given by the formula:
G =
1.12(t + 2)
2
The denominator in the above expression is necessary to ensure that the value of the
factor is 1.12 at the time of field deployment, regardless of the growth rate (). Figure
7.2-3 illustrates the growth Pi-factor multiplier for various values of growth rates as a
function of time.
1.2
Pi (Growth)
0.8
0
0.2
0.5
0.6
0.7
1
0.4
0.2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Time (years)
Chapter 7: Examples
methodology by assessing and grading the processes that can contribute to reliability
growth.
7.2.2.11.
Infant Mortality
Infant mortality is accounted for in the model with a time-variant factor that is a function
of the level to which ESS has been applied. The infant mortality correction factor, IM,
is calculated as:
t - 0.62
IM =
(1 - SSESS )
1.77
where:
t
=
SSESS =
time in years
the screening strength of the screen(s) applied, if any.
The value of SS can be determined by using the stress screening strength equations as
presented in Section 7.2.2.9.
The above expression represents the instantaneous failure rate. If the average failure rate
for a given time period is desired, this expression must be integrated and divided by the
time period.
7.2.2.12.
The user of this model is encouraged to collect as much empirical data as possible and
use it in the 217Plus reliability assessment. This was summarized in Section 2.6, and is
done by mathematically combining the initial assessment made (based on the initial
assessment and the process grades) with empirical data. This step combines the best
"pre-build" failure rate estimate obtained from the initial assessment (plus the influence
of the PGFs) with the metrics obtained from the empirical data. Bayesian techniques are
used for this purpose. This technique accounts for the quantity of data by weighting large
amounts of data more heavily than small amounts. The failure rate estimate obtained
above forms the "prior" distribution, comprised of a0 and b0.
7.2.3. Development of Component Reliability Models
7.2.3.1. Model Form
Chapter 7: Examples
variable data (which is often the case with empirical failure rate data), a requirement of
the model form is that it be multiplicative (i.e., the predicted failure rate is the product of
a base failure rate and several factors that account for the stresses and component
variables that influence reliability). An example of a multiplicative model is as follows:
p = b e q s
where:
p =
b =
e =
q =
s =
However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for. A better
approach is an additive model which predicts a separate failure rate for each generic class
of failure mechanisms. Each of these failure rate terms are then accelerated by the
appropriate stress or component characteristic. This model form is as follows;
p = o o + e e + c c + i + sj sj
where:
p =
o =
o =
e =
e=
c =
c =
i =
sj =
sj =
Chapter 7: Examples
By modeling the failure rate in this manner, factors that account for the application and
component specific variables that affect reliability ( factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:
Acceleration factors (also called Pi-factors) are used in the 217Plus models to estimate
the effect on failure rate of various stress and component variables. Since the traditional
technique of multiple linear regression was not used in the derivation of the failure rate
models, the Pi-factors were derived by utilizing either industry accepted values, values
determined separately from data available to the RIAC, or values from previous modeling
efforts. For example, the models typically include both an operating and non-operating
temperature factor based on the Arrhenius relationship, which require an activation
energy for operating and non-operating conditions. To estimate these values for the
models, previous modeling studies (along with existing prediction methodologies) were
used. Similarly, some factors were based on test data. For example, the exponent used in
the delta T Pi-factor for the 217Plus integrated circuit model is based on fallout rate data
from temperature cycling tests that were performed at various levels of delta T.
7.2.3.3. Time Basis of Models
Traditional reliability prediction models have been based on the operating time of the
part, and the units were typically failures per million (or billion) operating hours
(F/106H). The RIAC 217Plus models (and the empirical data contained in the RIAC
databases included with the RIAC 217Plus software) predict the failure rate in units of
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
294
Chapter 7: Examples
failures per million calendar hours (F/106CH). This is necessary (and appropriate)
because it is the common basis for all failure rate contribution terms used in the model
(operating, non-operating, cycling, and induced). If an equivalent operating failure rate is
desired (in units of failures per million operating hours), the failure rate (in F/106CH) can
be divided by the duty cycle to yields a failure rate in F/106operating hours.
7.2.3.4. Failure Mode to Failure Cause Mapping
There are two primary types of data on which the RIAC 217Plus component models are
based: failure rate and failure mode. The model development process required that the
failure rate data be apportioned into four failure cause categories. Since the failure mode
data contained in the RIAC databases was typically not defined by these categories, it
was necessary to transform the RIAC failure mode data into a failure cause distribution.
This was accomplished by assessing the stresses that accelerate the specific class of
failure categories, and estimating the percentage of failures that could be attributed to
those stresses. The primary stresses that potentially accelerate operational failure modes
are operating temperature, vibration, current and voltage. The stresses that accelerate
environmental failure causes are non-operating (i.e., dormant) ambient temperature,
corrosive stresses (contaminants/heat/humidity), ageing stresses (time), and humidity. As
an example, Table 7.2-5 summarizes this process for a resistor. Each of the six failure
modes included in the analysis are listed across the top of the table, i.e. EOS,
contamination, etc., along with their associated observed relative percentage of
occurrence. This data was collected by the RIAC and was based primarily on the root
cause failure analysis results of parts that had failed in the field.
Table 7.2-5: Example of Failure Mode-to-Failure Cause Category Mapping
Failure
Category
Operational
Stresses
Accelerating
Stresses/
Causes
Failure Mode
Contamination Cracked Chip Leakage
out
41.20%
23.50%
17.60% 7.10% 5.90%
EOS
Operating
Temperature
Vibration
Voltage
Ambient
(Dormant) Temp.
Corrosion
p
p
Total
%
0.00
0.05
4.70%
s
Current
Environmental
Stresses
TNI
0.04
0.00
0.00
0.08
0.09
Ageing
0.05
Humidity
0.09
Power Cycling
Power Cycling
Induced/EOS
Induced/EOS
0.31
0.22
0.22
0.42
0.42
Chapter 7: Examples
7.2.3.5. Derivation of Base Failure Rates
Once the Pi-factors were defined for each component type that was modeled, and once
the failure rate was apportioned amongst the failure causes, the base failure rate could be
determined. This was accomplished by (1) gathering all failure rate data, (2) estimating
the model input variables (temperatures, stresses, etc.) for each source of data, (3)
calculating the associated Pi-factor for each failure rate, and (4) deriving a base failure
rate for each of the failure cause categories. For example, the failure rate associated with
operational stresses is equated to the product of the base failure rate and the operational
Pi-factors:
PFC * obs = b o
where:
PFC =
obs =
b =
o =
Solving for b, and adding a factor to account for data points which have had no observed
failures, yields:
b =
PFC * obs
* PF
The PF parameter is the percentage of total observed calendar hours associated with
components that have had observed failures. This factor is necessary to pro-rate the base
failure rate which was calculated from those data records containing failures. Once this
value of b was calculated for each data record, the geometric mean was used as the best
estimate of the base failure rate.
7.2.3.6. Combining the Predicted Failure Rate with Empirical Data
The user of the 217Plus model is encouraged to collect as much empirical data as
possible and use it in the assessment. This is done by mathematically combining the
prediction made (based on the initial assessment and the process grades) with empirical
data, resulting in a reliability estimate. This step will combine the best pre-build failure
rate estimate obtained from the initial assessment (with process grading) with the metrics
obtained from the empirical data. Bayesian techniques are used for this purpose. This
technique accounts for the quantity of data by weighting large amounts of data more
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
296
Chapter 7: Examples
heavily than small quantities. The failure rate estimate obtained above forms the prior
distribution, comprised of a0 and b0.
If empirical data (i.e., test or field data) is available on the system under analysis, it can
be combined with the best pre-build failure rate estimate using the following equation:
ao + a1 + ....an
bo + b1 + ....bn
where:
=
ao =
a0 = 0.5
bo =
a0
Chapter 7: Examples
H Eq =
T 1
* HT
T 2
where:
H Eq =
T1 =
T2 =
HT =
The benefits of including empirical data in the failure rate estimate are that it:
Integrates all reliability data that is available at the point in time when the
estimate is performed (analogous to the statistical process called meta-analysis)
Provides flexibility for the user to customize the reliability model with actual
historical experience data
The 217Plus methodology also estimates confidence levels around the failure rate.
Before empirical data is available on a system, the levels are assessed based on a
distribution that was derived by analyzing data on a variety of systems for which both
reliability predictions and field data were available. After test or field data becomes
available and failures are accrued, traditional Chi-square techniques can be used to
estimate the uncertainty in the reliability prediction.
7.2.3.8. Using the 217Plus Model in a Top-Down Analysis
If empirical data exists on a predecessor system, the equation that translates the failure
rate from the old system to the new system is as follows:
predicted = predecessor *
predicted , new
predicted , predecessor
The (predicted, new)/(predicted, predecessor) failure rate ratio accounts for the
differences in application environment, complexity, stresses, date, etc. The predicted
failure rates for the predecessor and the new system are determined using the complete
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
298
Chapter 7: Examples
This section presents an example of the 217Plus component model for capacitors. The
failure rate equation for capacitors is:
G = e ( (Y 1993 ))
=
C =
C
C =
C1
CE
C=
capacitance, in microfarads
C1 = constant. A function of capacitor type (see Table 7.2-6)
CE = constant. A function of capacitor type (see Table 7.2-6)
OB = base failure rate, operating
DCO = failure rate multiplier for duty cycle, operating:
DCO =
DC
DC1op
TO = e
Eaop
1
1
Chapter 7: Examples
Eaop = activation energy, operating. A function of capacitor type (see Table 7.26)
S = failure rate multiplier for electrical stress:
S
S = A
S1
SA = stress ratio, the applied voltage stress divided by the rated voltage
S1 = constant. A function of capacitor type (see Table 7.2-6)
n=
constant. A function of capacitor type (see Table 7.2-6)
EB = base failure rate, environmental (see Table 7.2-6)
DCN = failure rate multiplier, duty cycle nonoperating:
DCN =
1 DC
DC 1nonop
TE = e
Ea nonop
1
1
AE
CR =
CR
CR1
DT
T T
= AO AE
DT1
Chapter 7: Examples
SJDT
T TAE
= AO
44
2.26
DC1op
TRdefault
DC1nonop
Eanonop
CR1 DT1 n C1 S1 CE
Eaop
Aluminum
0.000465
0.00022
0.000214
0.000768
.00095
0.229
0.17
0.5
0.83
0.4
1140.35
21
7.6
0.6 0.23
Ceramic
0.001292
0.000645
0.000096
0.00014
.00095
0.0082
0.17
0.3
0.83
0.3
1140.35
21
0.1
0.6 0.09
General
0.000634
0.000351
0.000083
0.000259
.00095
0.033
0.17
0.3
0.83
0.3
1140.35
21
0.1
0.6 0.09
Mica/Glass
0.000826
0.000997
0.000888
0.000764
.00095
0.0082
0.17
0.4
0.83
0.4
1140.35
21
10
0.1
0.6 0.09
Paper
0.000663
0.000075
0.000882
0.000042
.00095
0.0082
0.17
0.2
0.83
0.2
1140.35
21
0.1
0.6 0.09
Plastic
0.000994
0.001462
0.001657
0.002531
.00095
0.0082
0.17
0.2
0.83
0.2
1140.35
21
0.1
0.6 0.09
Tantalum
0.000175
0.000049
0.000032
0.000816
.00095
0.229
0.17
0.2
0.83
0.2
1140.35
21
17
7.6
0.6 0.23
Tantalum
0.000175
0.000049
0.000032
0.000816
.00095
0.229
0.17
0.2
0.83
0.2
1140.35
21
17
7.6
0.6 0.23
Variable, Air
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.3
1140.35
21
Variable, Ceramic
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.1
1140.35
21
Variable, FEP
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.2
1140.35
21
Variable, General
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.2
1140.35
21
Variable, Glass
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.2
1140.35
21
Variable, Mica
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.2
1140.35
21
Variable, Plastic
0.002683
0.005193
0.002066
0.000566
.00095
0.0082
0.17
0.3
0.83
0.2
1140.35
21
Part Type
7.2.3.10.
OB
EB
TCB
IND
SJB
Default Values
The default values for the environmental and operating profile factors are summarized in
Tables 7.2-7and 7.2-8.
Chapter 7: Examples
Vibration (GRMS)
9
9
9
9
10
1.3
16
3.3
3.3
3.3
0
0
1
10
10
10
10
10
10
10
10
4
4
4
4
4
4
4
2
0
0
0
0.7
0.7
0.7
0.7
1
Chapter 7: Examples
Operating profile
DC
CR (C/yr)
5
1000
25
2982
80
1491
30
368
10
50
80
184
25
1008
45
263
80
50
80
368
This section summarizes the manner in which photonic device models were derived
(Reference 3). It is included to demonstrate the development of models when little field
data is available.
The photonic component model form is:
DCO =
DC
DC1op
Chapter 7: Examples
TO = e
V =
Eaop
1
1
vibration factor:
V +1
V = a
Vc
nvib
DCN =
1 DC
1 DC1op
TE = e
Eanonop
1
1
AE
RH = humidity factor:
RH
RH a + 1
=
RH c
n RH
CR =
CR
CR1
Chapter 7: Examples
DT
T + T TAE
= AO R
14
n PC
Chapter 7: Examples
DT
nPC
7.2.4.1.2.
=
=
The modeling methodology that was used in the photonics device modeling study is
summarized in Figure 7.2-4. This methodology is similar to the 217Plus model
development methodology, but was tailored for the specific needs of photonic
components. Each element of this methodology is explained in the following sections.
Map observed
failure modes into
the failure cause
categories
Calculate a normalization
stress accelerating stress
Estimate acceleration
factors (Pi factors) for each
part from each data source
This section details the model development methodology and also presents the results of
each task in this methodology. Each task in Figure 7.2-4 is described in the following
sections.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
306
Chapter 7: Examples
7.2.4.2.1.
There are two primary types of data upon which the component models are based, failure
rate and failure mode. The model development process required that the failure rate data
be apportioned into the following four defined failure cause categories:
Since failure mode data is typically not classified according to these categories, it is
necessary to transform the failure mode distribution data into the failure cause
distribution. This failure mode distribution data was obtained from several sources:
An example of this is summarized in Table 7.2-9, in which the failure causes for a
connector are hypothesized (2nd column), and then an occurrence rating is given for each
cause. This rating is in the 3rd column, and is scored as a 1, 3 or 9. This weighting
scheme is often used in FMEA analysis. The result is a fractional value for each failure
cause that is proportional to the weighting. The sum of all of these values for each
component type equals 1.0.
The methodology used in the photonics device models to derive the fraction of
occurrence differs from the methodology presented previously for the 217Plus
components, in that failure mode distributions were not available during the photonics
model development effort. For the 217Plus models, the components were more mature
and therefore, there was considerable history of both failure mode and failure rate data to
draw upon.
Chapter 7: Examples
Connector
(SC and FC)
7.2.4.2.2.
Failure Cause
Spring failure
Wear of the connector resulting in misalignment
Wear of the end face
Contamination of facet (sand, dust, grease)
Contamination on outside that wicks in
Eccentric wear on the ferrule causes misalignment
Crimping too tight causes pinching
Crimping too loose causes it to fall apart
O-ring failure
Contraction of the outer jacket causes fiber pistoning
Fracture of the end face
Misalignment of cable end due to sleeve wear
Misalignment of cable end due to buckling from
tolerance stack up
Misalignment of cable end due to separation from
tolerance stack up
Insufficient cure of epoxy
Corrosion, pitting or facets
Embrittlment of organic materials due to UV exposure
Occurrence
3
3
1
9
1
1
3
1
1
3
1
1
Fraction of
Occurrence
0.073
0.073
0.024
0.220
0.024
0.024
0.073
0.024
0.024
0.073
0.024
0.024
0.073
3
3
3
1
0.073
0.073
0.073
0.024
To transform the failure mode distribution data into the failure cause distribution, the
following process was used:
The last item is accomplished by assessing whether each stress is a primary accelerant of
the failure mode, a secondary accelerant, or is not an accelerant. A 3:1 weighting
between primary and secondary accelerant was then used in estimating the percentage of
failures that could be attributed to those stresses.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
308
Chapter 7: Examples
The primary stresses that potentially accelerate operational failure modes are operating
temperature, vibration, current/voltage and optical power. The stresses that accelerate
environmental failure causes are nonoperating ambient temperature, corrosive stresses
(contaminants/heat/humidity), and aging stresses (time). As an example, Table 7.2-10
summarizes this process for our connector example.
Environmental
Accelerating Stress
or Cause
Operating temperature
Vibration
s p p
Current/voltage
Optical power
Ambient temperature
Corrosion
Ageing
Humidity
p
p
p s
Power Cycling
Power Cycling
s s
Induced/handling
Induced/handling
p
s
s s
s
p p
p p p
p p
s
p
s
s s
s p p s
p
TOTAL
Total
100 %
Failure Cause
Category
Operational Stresses
7.32%
7.32%
2.44%
21.95%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
7.32%
7.32%
7.32%
2.44%
Spring failure
Wear of the connector resulting in misalignment
Wear of the end face
Contamination of facet (sand, dust, grease)
Contamination on outside that wicks in
Eccentric wear on the ferrule causes misalignment
Crimping too tight causes pinching
Crimping too loose causes it to fall apart
O-ring failure
Contraction of the outer jacket causes fiber pistoning
Fracture of the end face
Misalignment of cable end due to sleeve wear
Misalignment of cable end due to buckling from tolerance stack up
Misalignment of cable end due to separation from tolerance stack up
Insufficient cure of epoxy
Corrosion, pitting or facets
Embrittlment of organic materials due to UV exposure
Table 7.2-10: Failure Mode to Failure Cause Category for Connectors (SC and FC)
0.00 0.11
0.10
0.00
0.01
s p 0.07 0.30
p
0.04
0.09
p
0.10
0.23 0.23
0.36 0.36
1.00 1.00
Chapter 7: Examples
Each of the failure modes is listed across the top of the table, and each of the accelerating
stresses/causes is listed down the left side. Each combination is identified with a blank
(no acceleration from the factor), a "p" (primary) or an "s" (secondary). The associated
relative percentage of failures attributable to the accelerating stress/cause is listed down
the right columns.
The % column (second from the right) is calculated as follows:
wi
% = FM % n
FM1
wi
AC1
n
where:
FM% = the percentage associated with the ith failure mode
wi = the weight of the specific combination of failure mode and accelerating
stress or cause (0 for none, 1 for secondary, and 3 for primary)
For example, the % value for ambient temperature (as part of the environmental failure
cause category) is:
1
1
1
1
7.32% + 7.32% + 7.32% + 2.44% = 0.07
1
11
4
4
Therefore, an estimate of the percentage of failure causes accelerated by ambient
temperature is 7%.
7.2.4.2.3.
The base percentages of failure rate are calculated by summing the accelerating
stress/cause percentages associated with each failure cause. For our connector example,
the four percentages associated with the operating accelerating stresses/causes is 11%,
or 0.11. These percentages are an estimate of the percent of failures that can be expected
for each cause under nominal stress conditions. In this case, nominal stresses are the
average stresses to which the models are normalized. Table 7.2-11 summarizes the
failure cause percentages (in fractional form).
Chapter 7: Examples
Percentage
(Fraction)
0.11
0.30
0.23
0.36
As previously summarized, the approach that was taken in photonics device model
development methodology relied on the collection of quantitative failure mode and
failure rate data. Literature searches were performed toward the goal of collecting the
quantitative data required for model development. Sources searched for applicable data
included:
The results of this data collection effort, for connectors, are summarized in Table 7.2-12.
Chapter 7: Examples
Failures
Lambda
Observed
0 0.8 368
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 1 1752
0 1 1752
0 1 1752
0 1 1752
20 1
0
Hours
12
0
0
0
0
0
0
0
0
125
110
125
125
0
RHa
5
0
0
0
0
0
0
0
0
0
0
0
0
0
CR
23
85
60
85
85
85
85
85
-40
-40
-40
-40
-40
25
DC
30
85
60
85
85
85
85
85
-40
85
70
85
85
25
VA
Delta T
Connector
Field
Damp heat
Damp heat
Damp heat
Damp heat
Damp heat
High temperature storage
High temperature storage
Low temperature storage
Thermal Cycling
Thermal Cycling
Thermal Cycling
Thermal Cycling
Vibration
TR
Data Type
TAE
Part
Type
TAO
40
85
95
85
85
85
2
33333333.33
20000
20160
1056
22000
22000
20160
1056
1056
5000
12600
50
5500
33
1
0
12
0
0
0
8
0
0
0
16
0
0
0
30
The first column is the part type; the second is the data type. Data types used in the
photonics device study included:
Field data
Test data
Thermal cycling
o Vibration
o Damp heat
o High temperature storage
o Low temperature storage
o Operating life test
The 3rd through tenth columns are the estimates of the actual stresses to which the part
was exposed in the field or during the test. These stresses are defined as follows:
TAO
TAE
=
=
Chapter 7: Examples
TR
VA
DC
CR
RHa
7.2.4.2.5.
=
=
=
=
=
For each source of data that was collected, an estimate of the stresses and operating
profiles to which the component was exposed was required so that the failure rates could
be normalized to the actual stresses. These stresses were summarized in the previous
section.
For test data, these values were generally readily available. For data collected from
fielded systems, the actual stress values were not available. Therefore, they had to be
estimated. The default values of the environmental and operating profile factors were
summarized in Tables 7.2-7 and 7.2-8. Only field data from telecommunication
applications used in a ground, stationary, indoors environment was available to the
photonics device modeling study, so only the values pertaining to those conditions were
estimated in this manner.
7.2.4.2.6.
Acceleration factors (or Pi-factors) were used in the component models to estimate the
effects of various stress and component variables on the failure rate. The two
predominant forms of acceleration factors are the Arrhenius and the power law models.
The Arrhenius model is generally used for modeling temperature effects and is:
AFT = e
Ea
KT
AF = S n
where S is the stress and n is a constant.
Reliability Information Analysis Center
313
Chapter 7: Examples
The specific forms of these acceleration factors that were used in the models are
summarized below.
TO = factor for operating temperature:
TO = e
V =
Eaop
1
1
vibration factor:
V +1
V = a
Vc
nvib
TE = e
Eanonop
1
1
RH = humidity factor:
RH
RH a + 1
=
RH c
n RH
DT
T + T TAE
= AO R
14
n PC
Chapter 7: Examples
n (PC)
10
5
2
1
0
Ea (op) Ea (nonop)
1
1
0.7
0.7
0.5
0.5
0.1
0.1
0
0
n (RH)
10
5
2
1
0
n (Vibration)
10
5
2
1
0
Table 7.2-14 summarizes the specific parameter values used in the connector models.
n (PC)
2
Ea (op) Ea (nonop)
0.1
0.1
n (RH)
10
n (Vibration)
5
The Pi factors needed to be normalized to a fixed set of conditions. This approach makes
it convenient to derive default Pi-factors. By normalizing the factors in this manner, the
Pi-factor is equal to 1.0 when the stress is equal to the default stress. Therefore, if an
analyst chooses to ignore the effects of a particular stress, the failure rate will be
representative of the default stress levels.
The default values for the applicable photonics device model Pi-factors are summarized
in Table 7.2-15.
Chapter 7: Examples
7.2.4.2.8.
Default
Vibration
Default RH
Default DT
0
10
0
5
20
0
0
0
15
5
15
15
15
Default CR
Connector
Passive Micro-Optic Component
Passive Fiber-Based Component
Isolator
VOA
Fiber
Splice
Cable
Laser Diode Module
Photodiode
Transmitter
Receiver
Transceiver
Default DC
Model Category
Default Tr
0.25
1000
50
20
Estimate the Acceleration Factors (Pi-factors) for Each Part from Each Data Source
The acceleration factors used in the models are Pi-factors, which are the acceleration
factors normalized to a given stress level. These factors were calculated for each part
from each data source. To derive these factors, two pieces of information were required:
1. The estimate of the stress for each data point (in this case, a data point is a single
observation of reliability (failures and hours) at a known set of stress conditions).
The manner in which these were quantified was previously explained.
2. The default stress level of the data for each stress parameter in the model
The Pi-factor was then the acceleration model normalized to the default stress level. An
example of this calculation is shown in Table 7.2-16. Every data point available from
field or test data had its associated Pi-factor values calculated. Note that some of the Pifactors were zero. This occurs because test data was not applicable to all failure causes.
This concept will be further explained in the next section.
Chapter 7: Examples
Pi TE
Pi RH
Pi CR
Pi DT
1.000
32.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000 4084101
1.000
496874
1.000
785027
1.000 4084101
1.135
1.000
1.921
1.000
1.506
1.000
1.921
1.000
1.921
1.000
1.921
1.000
1.921
1.000
1.921
1.000
0.337
1.000
Pi DCN
Pi TO
3.200
4.000
4.000
4.000
4.000
4.000
4.000
4.000
3.200
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Pi V
Pi DCO
Pi factors
0.267
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.267
1.333
1.333
1.333
1.333
1.333
1.333
1.333
1.333
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.974
1.921
1.506
1.921
1.921
1.921
1.921
1.921
0.337
0.137
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.137
227
681
227
227
227
0.000
0.000
0.000
0.368
2.037
4.037
4.037
0.000
0.000
0.000
0.000
0.368
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
1.063
1.180
1.149
1.162
0.000
0.000
0.000
0.000
0.360
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Cable
Cable
Cable
Cable
Cable
Cable
Cable
Cable
Connector
Connector
Connector
Connector
Connector
Connector
Connector
Connector
Connector
Field
Thermal Cycling
Thermal Cycling
Thermal Cycling
Vibration
Vibration
Vibration
Vibration
Field
Damp heat
Damp heat
Damp heat
Damp heat
Damp heat
High temperature storage
High temperature storage
Low temperature storage
7.2.4.2.9.
Calculate the Base Failure Rates for Each Cause Such That the Observed Failure
Rates = the Predicted Failure Rates
In the case of the 217Plus models, which were based solely on field data, the base failure
rates for the photonic device models were obtained, as follows, for each failure cause
category:
m
Bi =
(Fobs %i ) field
1
m
H obs field
where:
Bi =
Fobs =
Hobs =
=
i=
m=
k=
%i =
the base failure rate for the ith failure rate term
the number of observed field failures
the number of observed field hours
the product of the applicable Pi-factors to the applicable field environment
the number of failure causes
the number of field data sources
the number of correction factors
the percentage of failure rate attributable to the specific failure causes
Reliability Information Analysis Center
317
Chapter 7: Examples
The product of the Pi-factors converts the actual hours to an equivalent effective
number of hours normalized to the default stress values.
However, in the case of the photonic models developed for the study, it was necessary to
utilize a significant amount of test data since there was not enough field data available.
This is due to the fact that there are few field data sources for photonic components.
Therefore, the modeling methodology needed to be tailored to accommodate the specific
data available on the parts addressed in the photonics device study. This was
accomplished by using a Bayesian technique in which the field data becomes the prior
distribution, and the summation of the failure and hours from all data sources forms the
basis of the posterior distribution. The failure rate parameter of the exponential
distribution was, therefore:
j
Bi =
Operating
X
X
Induced
X
X
X
X
X
X
One of the advantages to the model structure was this ability to modify the base failure
rates of specific failure causes with test data applicable to only that failure cause.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
318
Chapter 7: Examples
The connector base failure rates resulting from this analysis are listed in Table 7.2-18.
Table 7.2-18: Base Failure Rates (Failures per Million Calendar Hours)
Base Failure Rate
(failures per million calendar hours)
Component
Connector
7.2.4.2.10.
Operating
0.0002
Environmental
0.3053
Cycling
2.7952
Induced
0.0110
The last step in the process was to adjust the base failure rates to ensure that the predicted
number of failures was equal to the observed number. The manner in which this was
accomplished was to scale the base failure rates to ensure that the cumulative predicted
number of failures of the entire population of observed data points was equal to the
observed number of failures. This was accomplished by using the MS Excel goal seek
function, which finds the value of a correction factor that satisfies this boundary
condition. This approach is conceptually similar to a maximum likelihood method.
7.2.4.2.11.
There were several options for modeling the effects of environmental stresses. Early in
the study, it was decided that the effects of quality and environment would be treated
such that the photonic component models would be stand-alone. This approach
differed from the form of the 217Plus methodology, in that quality and environment were
treated as system level effects. This concept was based on the premise that quality and
environmental effects were manifested more at the assembly or system level than they
were at the component level. The photonic component models include the effects of their
pertinent environmental stresses in the component models, instead of applying the
environment factor in the assembly or system model, as was the case with 217Plus. The
primary environmental stresses included in the photonic component models are
temperature, humidity and vibration.
The quality factor ( Q) is calculated in a manner similar to the 217Plus methodology, but
tailored to the unique concerns of photonic components. This factor is calculated as
follows:
1
q = i ( ln (R i ))
Where i and i are Weibull parameters representing the distribution of the percentage of
failures attributable to components (parts). The quality factor is scaled within this
Reliability Information Analysis Center
319
Chapter 7: Examples
distribution based on how good the parts control program is. The parameter Ri is the
rating of the parts control program and is calculated from:
ni
Ri =
j =1
GijWij
ni
W
j =1
ij
where,
rating of the process for the ith failure cause, from 0.0 to 1.0
the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause
Ri =
Gij =
The 217Plus grading criteria, as applied to the photonics device models, are provided in
Table 7.2-19. These were tailored specifically for photonic components.
Table 7.2-19: Part Quality Process Grade Factor Questions for Photonic Device Models
Highest
Actual
Possible
Score
Score
Rating
Input
Range
User
Input
yes = 5
no = 0
Y,N
0.0
yes = 3
no = 0
Y,N
0.0
yes = 3
no = 0
Y,N
0.0
yes = 6
no = 0
Y,N
0.0
yes = 4
no = 0
Y,N
0.0
yes = 10
no = 0
Y,N
10
0.0
yes = 10
no = 0
Y,N
10
0.0
yes = 10
no = 0
Y,N
10
0.0
Chapter 7: Examples
Highest
Actual
Possible
Score
Score
Rating
Input
Range
User
Input
yes = 7
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 10
no = 0
Y,N
10
0.0
yes = 7
no = 0
Y,N
0.0
yes = 5
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 10
no = 0
Y,N
10
0.0
yes = 7
no = 0
Y,N
0.0
yes = 8
no = 0
Y,N
0.0
yes = 7
no = 0
Y,N
0.0
yes = 6
no = 0
Y,N
0.0
A,B,C,D,E
10
0.0
A. No OPA = 10
B. yes, MFD <2um = 0
C. yes, MFD = 2 - 5 um = 4
D. yes, MFD = 5 - 10 um = 6
E. yes, MFD > 10 um = 8
Chapter 7: Examples
Rating
A. No Thin film = 0
B. yes, and surface is prepared by
sputtering = 2
C. yes, and surface is not
prepared by sputtering = 3
Highest
Actual
Possible
Score
Score
Input
Range
User
Input
A,B,C
0.0
yes = 5
no = 0
Y,N
0.0
yes = 5
no = 0
Y,N
0.0
yes = 3
no = 0
Y,N
0.0
yes = 5
no = 0
Y,N
0.0
yes = 5
no = 0
Y,N
0.0
yes = 4
no = 0
Y,N
0.0
A,B
0.0
Y,N
0.0
A,B,C
0.0
Y,N
0.0
An analysis was performed to quantify the degree of uncertainty in the predicted failure
rates. This was accomplished by calculating the predicted failure rate and comparing it to
the observed failure rate. The metric that was used for this analysis was the log10 of the
value: predicted failure rate/observed failure rate. The value of this metric should cluster
around zero if the prediction models are approximating the observed data. Calculation of
the standard deviation of this metric also provides a quantification of the uncertainty
levels present in the predictions made with these models. Table 7.2-20 summarizes the
mean and standard deviation of this metric for all of the data and for only the field data.
Chapter 7: Examples
Figures 7.2-5 and 7.2-6 illustrate the distribution of this metric for all data and for just
field data. For this analysis, only data for which failures occurred were included, since
data with no observed failures only have a single-sided bound on the failure rate and,
therefore, cannot be compared to the predicted value. The result of not including zero
failure data is that the metric is biased. As can be seen in these figures, the distribution of
all failures is significantly wider than the distribution of just the field failure rates. This
is due to the fact that the non-field data, i.e. test data, is typically at extreme conditions.
Therefore, the uncertainty in these extreme cases is typically larger than for nominal
conditions.
Histogram
14
12
Frequency
10
8
6
4
2
0
-3
-2
-1
LOG 10 (PREDICTED/OBSERVED)
Figure 7.2-5: Distribution of Log10 Predicted/Observed Failure Rate Ratio for All
Data
Chapter 7: Examples
Histogram
7
6
Frequency
5
4
3
2
1
0
-0.25
0.25
0.75
1.25
1.75
More
LOG 10 (PREDICTED/OBSERVED)
Figure 7.2-6: Distribution of Log10 Predicted/Observed Ratio for Field Data Only
The distributions of the predicted/observed failure rate ratio are illustrated in Figure 7.27. With this metric, the value should be centered about one, since the log of this ratio has
not been taken.
cumulative probability
99 . 0 00
50 . 0 00
10 . 0 00
5. 00 0
1. 00 0
0. 00 1
0 . 0 10
0. 10 0
1. 000
10. 00 0
1 00 . 0 00
predicted/observed ratio
F olio1\Da ta 1: = 1 .5 5 8 5 , =2 .4 9 2 5 , = 0 .9 8 8 0
F olio1\Da ta 2: = 0 .4 5 5 6 , =0 .9 5 4 7 , =0 .8 8 1 3
Figure 7.2-7: Distributions of the Predicted/Observed Failure Rate Ratio for All Data
and For Field Data Only
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
324
Chapter 7: Examples
7.2.4.4. Comments on Part Quality Levels
Part quality level has traditionally been used as one of the primary variables affecting the
predicted failure rate of a component. The quality level categories were usually those
defined by the applicable military specification.
One of the problems that developers had when developing MIL-HDBK-217 models was
de-convolving the effects of quality and environment. For example, multiple linear
regression analysis of field failure rate data was usually used to quantify model variables
as a function of independent variables such as quality and environment. A basic
assumption of such techniques is that the independent variables are statistically
independent of each other. However, in reality they are not, since the higher quality
components are generally used in the severe environments and the commercial quality
components are used in the more benign environments. This correlation makes it
difficult to discern the effects of each of the variables individually. Additionally, there
are several attributes pooled into the quality factor, including qualification, process
certification, screening and quality systems.
The approach used in the 217Plus model to quantify the effects of part quality is to treat it
as one of the failure causes for which a process grade is determined. In this manner,
issues related to qualification, process certification, screening and quality systems were
individually addressed.
7.2.4.5. Explanation of Failure Rate Units
The 217Plus models predict the failure rate in units of failures per million calendar hours.
This is necessary because the 217Plus methodology accounts for all failure rate
contribution terms (i.e., operating, nonoperating, cycling and induced), and the
appropriate manner in which they can be combined is to use a common time basis for the
failure rate, which is calendar hours.
If an equivalent operating failure rate is desired in units of failures per million operating
hours, the 217Plus reliability prediction should be performed with the actual duty cycle to
which the unit will be subjected, then divide the resulting failure rate (in f/106 calendar
hours) by the duty cycle to yield a failure rate in terms of f/106 operating hours. The
resulting operating failure rate will be artificially increased to account for the
nonoperating and cycling failures that would not otherwise be accounted for. The
incorrect way to predict a 217Plus failure rate in units of failures per million operating
hour is to set the duty cycle equal to 1.0. The resulting failure rate in this case would be
valid only if the actual duty cycle is 100%. If the actual duty cycle is not 100%, then the
failures during non-operating periods will not be accounted for.
Reliability Information Analysis Center
325
Chapter 7: Examples
7.2.5.
System-Level Model
P = IA ( P IM E + D G + M IM E G + S G + I + N + W ) + SW
where:
p =
IA =
initial assessment of the failure rate. This failure rate is based on new
component failure rate models derived by the RIAC presented in Section
2.2, whose derivations are discussed in the next section
P
D
M
S
I
N
W
=
=
=
=
=
=
=
i = i ( ln(Ri ))
where i and i are constants for each failure cause category, as given in Table 7.2-21.
The parameter Ri is calculated as:
ni
Ri =
j =1
GijWij
ni
W
j =1
ij
where:
Ri =
rating of the process for the ith failure cause, from 0.0 to 1.0.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
326
Chapter 7: Examples
the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause
Gij =
Name
Design process factor
Manufacturing process factor
Parts Quality process factor
Systems Management process
factor
CND process factor
Induced process factor
Wearout process factor
0.12
0.21
0.30
0.06
1.29
0.96
1.62
0.64
0.29
0.18
0.13
1.92
1.58
1.68
0.237
0.141
0.106
IM =
t - 0.62
(1 - SSESS )
1.77
where:
t=
time in years. This is the instantaneous time at which the failure rate is
to be evaluated. If the average failure rate for a given time period is
desired, this expression must be integrated and divided by the time
period.
SSESS = the screening strength of the screen(s) applied, if any
E =
environmental factor
((
) (
.6
1.71
.855 .8 1 e(.065(T +.6 ) ) + .2 1 e(.046G )
E =
.205
))
where:
Chapter 7: Examples
1 .12 (t + 2 )
2
where:
=
Ri =
Ri =
j =1
GijWij
ni
W
j =1
ij
This section contains a listing of all of the criteria that comprise the definition and
scoring for the individual 217Plus Process Grades. An index of the tables included
within this section is listed in Table 7.2-22.
The rating for each process grade type, Ri,is given as:
ni
Ri =
j =1
GijWij
ni
W
j =1
ij
Chapter 7: Examples
where:
rating of the process for the ith failure cause, from 0.0 to 1.0.
the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = number of grading criteria associated with the ith failure cause
Ri =
Gij =
These tables are organized as follows. Column 1 contains the criteria associated with the
specific Process Grade Type. Column 2 is the grading criteria (Gij). Most of the
questions are designated with a Y/N in this column. In these cases, a Yes (Y) answer
equals "1" and a "No" answer equals 0. The question will receive the full weighted
score for a "Yes" answer and a zero for a "No" answer. In some cases, the grading
criteria is not binary, but rather can be one of three or four possible values. The grading
criteria for these are noted in this column. Column 3 identifies the scoring weight (Wij)
associated with the specific question.
In the event that a model user does not wish to answer all of the questions, he/she can
choose a subset of the most important questions by using only those with weight values
of seven or higher. Questions that are not scored should not be counted in the number of
grading criteria (ni) associated with the ith failure score.
Chapter 7: Examples
Gij
Wij
What is the % of lead design engineering people with cross training experience in manufacturing or field operations (thresholds at
10, 20%)?
<10 = 0
10-20 = .5
>20 = 1
What is the % of team members having relevant product experience (thresholds at 25, 50%)?
<25 = 0
25-50 = .5
>50 = 1
What is the % of team members having relevant process experience, i.e., they have previously developed a product under the
current development process (thresholds at 20, 40%)?
<20 = 0
20-40 = .5
>40 = 1
What is the % of development team that have 4-year technical degrees (thresholds at 20, 40%)?
<20 = 0
20-40 = .5
>40 = 1
What is the % of engineering team having advanced technical degrees (thresholds at 10, 20%)?
<10 = 0
10-20 = .5
>20 = 1
What is the % of engineering team members involved in professional activities in the past year; hold patents; authored/presented
papers; are registered professional engineers, or professional society offices at the National level (thresholds at 10, 20%)?
<10 = 0
10-20 = .5
>20 = 1
What is the % of engineering team members who have taken engineering courses in the past year (thresholds at 10, 20%)?
<10 = 0
10-20 = .5
>20 = 1
Are resource people identified for program technology support across key technology and specialty areas such as optoelectronics,
servo control, Application Specific Integrated Circuits (ASIC) design, etc., to provide program guidance and support as needed?
Yes = 1
No = 0
Are resource people identified, for program tools support, to provide guidance and assistance with Computer Aided Design (CAD),
simulation, etc.?
Yes = 1
No = 0
How many (0,1,2,3) of the program objectives of cost, schedule and reliability did the manager successfully meet for the last
program that he/she was responsible?
3=1
2 = .5
1 = .25
0=0
10
Is this development program organized as "Cross Functional Development Teams" (CFDT) involving: design, manufacturing, test,
procurement, etc.?
Yes = 1
No = 0
Does this Field Replaceable Unit (FRU) depend more on mature technology than state of the art technology?
Yes = 1
No = 0
Is design of experiments (DOE) used to ensure robustness of the FRU in the product under all operational and environmental
variations?
Yes = 1
No = 0
Are critical components identified along with plans to mitigate their risks?
Yes = 1
No = 0
Have designs been reviewed and plans made for part obsolescence during the product's life cycle?
Yes = 1
No = 0
Are considerations made to accommodate part form factor evolution? This applies particularly to those parts deemed likely to
change during the production life of the fielded system.
Yes = 1
No = 0
Are predominantly standard tools required for maintenance (limited-to-no use of special tools)?
Yes = 1
No = 0
Will the design application be modeled by variational analysis to ensure design centering?
Yes = 1
No = 0
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
Yes = 1
No = 0
Will mechanical stress analysis be performed on relevant components, materials and structures?
Yes = 1
No = 0
Will a prototype be developed in time to have user feedback impact the design?
Yes = 1
No = 0
10
Yes = 1
No = 0
10
Will design personnel participate in a Failure Modes and Effects Analysis (FMEA), Failure Modes Effects and Criticality Analysis
(FMECA), or Fault Tree Analysis (FTA) that is performed concurrently with the design effort?
Yes = 1
No = 0
Will the design engineer also design the diagnostic code for this FRU?
Yes = 1
No = 0
Yes = 1
No = 0
Will the product support tasks be ergonomically evaluated (human factors) from an Operations & Maintenance standpoint?
Yes = 1
No = 0
Will the product be analyzed using a human factor task analysis to ensure the Operations and Maintenance Tasks are tailored to
human capabilities?
Yes = 1
No = 0
Will the chassis that this FRU is mounted in be thermally measured and analyzed and operating temperatures assured to be at a safe
margin below device limits?
Yes = 1
No = 0
Yes = 1
No = 0
Do control procedures ensure that the system and its software are put in a safe state during power shut down?
Yes = 1
No = 0
<90 = 1
90-125 = .5
>125 = 0
Will environmental analyses and profiling (thermal, dynamic) be performed on the product to ensure it is used within its design
strength capabilities?
Yes = 1
No = 0
Will the product be analyzed/tested for electromagnetic compatibility (EMC) and radiated/conducted susceptibility and emissions?
Yes = 1
No = 0
Will the product be EMC-certified, per the European CE (Conformity European) regulatory compliance criteria for equipment used
in Europe, or under a similarly rigorous standard such as DO-160 (commercial aircraft)?
Yes = 1
No = 0
Are the size of equipment orifices (cover openings) less than 1/10 of the wavelength of the signal frequencies that the equipment
will generate within its enclosure or be exposed to in its environment?
Yes = 1
No = 0
Do traces on a Printed Wiring Board (PWB) run over a ground plane or an impedance control layer (e.g., power planes) and never
over reference plane or power plane voids?
Yes = 1
No = 0
Do traces on alternate PWB layers run orthogonal to one another, when a reference plane or power plane is not interposed between
them?
Yes = 1
No = 0
Are adjacent traces separated by at least twice their width, except for minor adjacencies that run less than a half inch?
Yes = 1
No = 0
Is the power source filtered over the range of 1KHz to 100 MHz for military or 150KHz to 30 MHz for commercial power, and
utilize surge suppression devices where appropriate?
Yes = 1
No = 0
Are all interconnect cables emerging from a shielded cabinet grounded to the chassis for operating frequencies greater than 1 MHz
or capacitively decoupled to the chassis for frequencies less than 1 MHz?
Yes = 1
No = 0
Are traces set back at least 2 widths from the edge of the reference or ground plane?
Yes = 1
No = 0
Is there a shared product development vision that includes Design for Manufacturability (DFM) goals?
Yes = 1
No = 0
What is the maximum silicon junction temperature on this FRU in degrees C (thresholds at 90 and 125 degrees C)? If not
applicable select 0.
th
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
Is there continuing focus to keep the PPL up to date and to minimize the number of parts on the PPL, by increasing part
standardization, encouraging designers to use the PPL and requiring analysis to justify adding a new part to the PPL?
Yes = 1
No = 0
Is this product to be built on an existing manufacturing platform that makes use of existing process capabilities?
Yes = 1
No = 0
Yes = 1
No = 0
Are new, critical parts qualified for by test and analysis prior to their inclusion in the system?
Yes = 1
No = 0
<75% = 1
75%-100% = .5
>100% = 0
Yes = 1
No = 0
<60 = 1
60-100 = .5
>100 = 0
Yes = 1
No = 0
Yes = 1
No = 0
Is the process documentation on-line with the recognition that the on-line version is the only standard? (All printed copies are for
reference only).
Yes = 1
No = 0
Does each process activity have clear entry and exit criteria?
Yes = 1
No = 0
Is the system configuration documented on-line, with changes since the last baseline highlighted to keep the entire team current
with the design?
Yes = 1
No = 0
Are there functional block diagrams of the system, subsystems, etc., down to the FRU level?
Yes = 1
No = 0
Are examples of good development products (e.g., specs, plans, documentation) provided to the engineering team, typifying the
desired work products for each stage of development?
Yes = 1
No = 0
Are examples of past problems provided to the engineering team that typify those found at each stage of development?
Yes = 1
No = 0
Yes = 1
No = 0
Does development activities planning include the identification of critical path tasks?
Yes = 1
No = 0
Are critical path tasks planned to minimize cycle time impacts and improve schedule robustness?
Yes = 1
No = 0
Are individual developers encouraged to make contact with their customer counterpart?
Yes = 1
No = 0
Will Cross-Functional Development Team (CFDT) phase reviews/sign-offs follow each product development phase: requirements,
preliminary design, final design, and test?
Yes = 1
No = 0
Are formal reviews documented and defect data analyzed and tracked, along with any action items, to completion?
Yes = 1
No = 0
Do design reviewers share responsibility for the performance of the design once they have reviewed it?
Yes = 1
No = 0
Are developers rated on the success of the overall product in the field?
Yes = 1
No = 0
Is there a technical review board in place to minimize design changes and maintain cost, schedule and reliability goals?
Yes = 1
No = 0
How does the part count on this project compare with predecessor products or competitive products? (Thresholds at 75 and 100%).
Are there DFM guidelines provided that the program must adhere to? (e.g., a good DFM design is fabricated on a uni-axis
assembly orientation, preferably built from the bottom)
What % of inter-connections are there compared to the predecessor version of this FRU (thresholds at 70 and 100%)?
Chapter 7: Examples
Gij
Wij
Are engineering change (EC) costs budgeted, measured and tracked against their associated design driver?
Yes = 1
No = 0
Is reliability and/or quality a significant goal or the number one goal placed on the entire development organization? This occurs
on safety critical applications such as air traffic control, nuclear or critical medical applications.
Yes = 1
No = 0
10
<10 = 0
10-20 = .5
>20 = 1
Are individual developers empowered by having input and control over resources to accomplish their job, such as having a travel
budget (if travel is required)?
Yes = 1
No = 0
Yes = 1
No = 0
Are process owners identified across the development team for each configuration item (CI) and its components?
Yes = 1
No = 0
Is there tracking of open problems, action items and cross system dependencies?
Yes = 1
No = 0
Yes = 1
No = 0
Is development creativity fostered through planned creativity exercises to spawn breakthrough thinking with respect to design
simplicity, cost, schedule and reliability?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
How many (0,1,2,3) of the cost, schedule and reliability goals did the last product developed by this organization meet?
3=1
2 = .5
1 = .25
0=0
10
Do you know the reliability performance of your current products in the field versus their predicted reliability?
Yes = 1
No = 0
If so, are previous reliability estimates greater than 15% of the predicted reliability? When not applicable select "No".
Yes = 1
No = 0
Is there a 15% staffing buffer on the program, i.e., will the program be staffed to 115% of the needed baseline to allow for
contingencies?
Yes = 1
No = 0
Are in-process metrics maintained to track actual vs. planned defect rates, schedule and resource targets?
Yes = 1
No = 0
Can continuous measurable improvement (CMI) be demonstrated for the development processes?
Yes = 1
No = 0
Are development processes maintained on-line with all printed paper copies designated "for reference use only"?
Yes = 1
No = 0
Are there procedures to ensure that documentation stays current with the design?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are there document owners or points-of-contact identified for these documents so the development team knows who it can go to
for a specific need?
Yes = 1
No = 0
Do the team members contribute to the creation and/or review and approval of these documents?
Yes = 1
No = 0
Are documentation standards promoted with examples to demonstrate what is considered adequate documentation?
Yes = 1
No = 0
What is the % of FRU reuse across the system? (thresholds at 10 and 20%).
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
Is there a procedure for field feedback on product operations and maintenance documentation?
Yes = 1
No = 0
Does product documentation maximize pictures and minimize words (fallibility of natural language)?
Yes = 1
No = 0
Yes = 1
No = 0
Is there an operational concept document developed prior to high level design that is maintained throughout development?
Yes = 1
No = 0
Is there a set of hardware and process design guidelines that provide general and component-specific design guidance practices?
Yes = 1
No = 0
Is there an assumptions/dependencies database that is maintained and reviewed prior to each development stage exit?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are all voltages used in this design less than 110 VAC?
Yes = 1
No = 0
Yes = 1
No = 0
<20 = 1
20-100 = .5
>100 = 0
<18 = 1
18-36 = .75
36-48 = .5
>48 = 0
Yes = 1
No = 0
Does the operational concept call for a remote operations and maintenance (O&M) operator to be able to diagnose system problems
as part of the system concept?
Yes = 1
No = 0
Yes = 1
No = 0
Does this FRU have a 25% reduction in parts count over its predecessor or competitor?
Yes = 1
No = 0
Are stuck faults required to be isolated down to a single failing FRU 90% of the time?
Yes = 1
No = 0
Does this FRU report its status via a "Management Information Data Bit" (MIB) capability for fault determination and isolation?
Yes = 1
No = 0
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
Yes = 1
No = 0
Is the customer directly involved in defining the product's operational profile and in reviewing the test plans?
Yes = 1
No = 0
Does test planning take into account the "lessons learned" database?
Yes = 1
No = 0
Does a problem tracking database exist and is it being used on this program?
Yes = 1
No = 0
Will accelerated testing be performed during development that combines temperature and vibration?
Yes = 1
No = 0
Will alpha tests be conducted, whereby the final product is robustly tested against probable extensions of its operational
environment?
Yes = 1
No = 0
Will beta tests be conducted, whereby customers can use and test pre-release versions of the product, feeding back their results to
the developer?
Yes = 1
No = 0
Are test procedures, set-up conditions, results, etc., documented so that measurements can be verified, failures reproduced, test
conditions recreated, and corrective actions confirmed?
Yes = 1
No = 0
Will a gold standard (tested product) be preserved for comparative regression analysis?
Yes = 1
No = 0
Yes = 1
No = 0
Will the FRU be reliability or endurance tested (at any assembly level)?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
<80 = 0
80-95 = .5
>95 = 1
Yes = 1
No = 0
>40 = 1
32-40 = .75
25-32 = .25
<25 = 0
1=1
2=0
Has mechanical loading of the test fixture on the device under test been analyzed?
Yes = 1
No = 0
Are the buses or signal lines actively driven (vs. passively driven)?
Yes = 1
No = 0
Are the test item configurations representative of both the design (development tests) or production (validation tests) products?
Yes = 1
No = 0
3=1
2 = .8
1 = .5
0=0
What % of nodes (interconnection of traces) can be backward driven (thresholds at 80 and 95%)?
Has test fixture complexity been analyzed for fixtures with over 50 pins per square inch?
What is the test point contact size in mils (thresholds at 40, 32, and 25 mils)?
Will the product be environmentally stress tested (0 to 3) for 1. Design, 2. Qualification, 3. Product Acceptance?
Chapter 7: Examples
Gij
Wij
>95 = 1
80-95 = .5
50-80 = .25
<50 = 0
Are Engineering Design Analysis (EDA) tools available that will be used to support the design task?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are there dedicated tool support personnel available to the development team?
Yes = 1
No = 0
Do the development team members have domain expertise with the operating platform, i.e., HP, Unix, networking, and OS?
Yes = 1
No = 0
Yes = 1
No = 0
What % of nodes can be tested on this FRU (thresholds at 95, 80 and 50%)?
How many cuts and traces are allowed if this is a Printed Wiring Board (PWB) (preliminary thresholds at 10, 20)?
Is a Computer Aided Design/Computer Aided Manufacturing (CAD/CAM) process used to support manufacturing?
Does the CAD/CAM system allow manufacturing personnel to have access to design information and documentation?
Are there no adjustments associated with this FRU?
Are parts/assemblies designed to simplify and facilitate automatic feeding and insertion?
Do hand-inserted parts have visual guides to aid in building the assembly?
Is there a focused effort to minimize the number of cables and connectors on this FRU?
Does the system support the FRUs being "hot-pluggable"?
Has the number of different manufacturing processes been minimized in building this FRU?
Is a flexible manufacturing process used, such that this new product will be fabricated on an existing, proven line?
Is a cellular manufacturing process used, where the autonomous manufacturing station has all materials and parts brought to it and it
produces a finished product?
If there are symmetrical, polarized components used on this FRU, is the mounting process made "fool-proof", so that they cannot be
inserted backwards?
Gij
Wij
>3 = 0
3 = .5
2 = .75
1 = 1
>20 = 0
10-20 = .5
<10 = 1
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
=
=
=
=
1
0
1
0
1
0
1
0
1
0
3
5
3
4
5
3
3
4
3
3
3
3
Chapter 7: Examples
What is the number of different fastener types associated with this FRU (threshold at 0, 1, 2, >2)?
Is it easy to visually distinguish between fasteners (e.g., no minor differences in length) prior to installation?
Is there only one type fastener drive (torx, Phillips, etc.) needed in the assembly, installation and maintenance of this FRU?
Are mounting guides or registration pins provided for aligning and securing electro-mechanical or electro-optical parts?
Are development personnel, including manufacturing, all co-located?
Does this project have a built-in 15% staffing buffer, i.e., staffing is at least 115% of base requirements?
Is the project organized around self-directed work teams?
Are workers rated on both total output and quality?
Are there process improvement teams with continuous measurable improvement (CMI) goals?
Are employees rated on field performance of the product?
Is there an advanced manufacturing engineering (AME) support department to help bridge between engineering and production?
Has Cross Functional Development Team (CFDT) been implemented such that the manufacturing manager is able to explain the
design concept?
Are manufacturing people encouraged to ask questions of development people (identified points of contact) when questions arise?
Are enterprise points-of-contact (POCs) identified (development, manufacturing, test, field, marketing) to help answer questions and
address issues across the organization?
Can any of the line or quality personnel "stop the line" if that person believes a serious problem exists?
Has the majority of the manufacturing leadership had direct field or customer contact in the past year?
Do manufacturing people have measurable goals to improve production metrics, including quality and cycle time?
If answer to 3.2.13 is yes, do direct manufacturing people participate in developing the goals?
Do manufacturing personnel have goals for continuous quality improvement?
Are there quality circles that meet regularly?
Are teams rewarded or recognized for improving quality?
Are key metrics for quality and cost monitored and tracked?
Is the cost of defect prevention measures tracked (proactive quality)?
Gij
Wij
Yes = 1
No = 0
>2 = 0
2 = .25
1 = .5
0 = 1
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
=
=
=
=
1
0
1
0
1
0
1
0
1
0
Yes = 1
No = 0
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
=
=
=
=
1
0
1
0
1
0
1
0
1
0
3
3
3
6
6
3
4
5
4
3
1
5
5
3
5
4
3
3
3
5
3
Chapter 7: Examples
Gij
Wij
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes = 1
No = 0
3
5
5
4
6
7
6
6
4
3
3
4
6
4
2
3
5
3
2
3
3
5
4
4
3
6
Chapter 7: Examples
How many elements (0 to 4) of environmental stress screening (ESS) are run: 1. temperature bake, 2. temperature cycle, 3.
temperature shock, 4. vibration?
Gij
Wij
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Yes
No
Yes
No
Yes
No
=
=
=
=
=
=
1
0
1
0
1
0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
0 = 0
1 = .25
2 = .5
3 = .75
4 = 1
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
1 = 1
2 = 0
Yes = 1
No = 0
Yes = 1
No = 0
6
3
3
4
3
3
3
3
5
5
5
5
4
5
4
7
5
5
3
3
3
3
3
Chapter 7: Examples
Gij
Wij
<30 = 0
30-50 = .5
50-100 = .75
>100 = 1
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
3
6
3
3
Gij
Wij
Yes = 1
No = 0
Yes = 1
No = 0
Are part evaluation and qualification processes established to add parts to the PPL?
Yes = 1
No = 0
Does a cross-functional development team (CFDT) review and approve new candidate parts for addition to the PPL?
Yes = 1
No = 0
Is this a commercial off-the-shelf (COTS) purchased assembly with a good history of operational reliability? If the assembly is not
COTS, select "Yes".
Yes = 1
No = 0
Will new parts be excluded from being added to the PPL to design this FRU?
Yes = 1
No = 0
Are procedures in place to detect part problems in both manufacturing and the field?
Yes = 1
No = 0
Are quality and reliability data tracked on parts and fed back to suppliers so they know their performance on this product?
Yes = 1
No = 0
Is there a design compliance checklist to ensure that all parts are properly applied, operating at sufficient margin with respect to
environmental and operational stresses, and take into account lessons learned?
Yes = 1
No = 0
Are there processes in place that specifically address precautions and handling of parts/components susceptible to electrostatic
discharge (ESD)?
Yes = 1
No = 0
Do part specifications reflect environmental and regulatory compliance requirements for the specific intended application?
Yes = 1
No = 0
Has mechanical interfacing of critical parts been facilitated by providing mating parts/assemblies to the part supplier?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
In the case of commercial off-the-shelf (COTS) equipment, is the purchased assembly certified and marked to sell in Europe (CE
marked)? If the assembly is not COTS, select Yes.
Yes = 1
No = 0
Is the FRU under configuration management control by the time it enters system test?
Yes = 1
No = 0
Yes = 1
No = 0
Will the supplier manage the developers inventory, in the case of high volume production?
Yes = 1
No = 0
Will all suppliers provide timely failure reporting and corrective action support (FRACAS) for both critical and custom parts (timely
reporting implies a 2 week turnaround with faster response on priority demand)?
Yes = 1
No = 0
Have vendor dependencies been identified for critical and custom components?
Yes = 1
No = 0
Have suppliers identified the likely failure modes on critical and custom parts, and does the design take these failure modes into
account?
Yes = 1
No = 0
Are operational failure rate and failure mode data provided by the suppliers of critical and custom parts being used?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Has the supplier reviewed the part application for all critical and custom parts?
Yes = 1
No = 0
Has the developer met with suppliers to discuss the application of all critical and custom parts?
Yes = 1
No = 0
Has a suppliers technical point of contact (POC) been identified for addressing reliability concerns?
Yes = 1
No = 0
Will critical suppliers provide timely notice of impending part changes to allow the developer to assess the impact?
Yes = 1
No = 0
Is a change history log maintained to provide traceability of engineering change actions and their associated rationale for critical and
custom parts?
Yes = 1
No = 0
Will part identification (revision numbers) be shown on the part to identify the particular part configuration, including the level of
the parts firmware?
Yes = 1
No = 0
Yes = 1
No = 0
If suppliers update firmware will the part identification reflect this change?
Yes = 1
No = 0
Will suppliers part support timing horizon meet program development, manufacture, and field support component requirements?
Yes = 1
No = 0
Will vendor provide timely notice of production/support cessation and provide an end of life buy opportunity?
Yes = 1
No = 0
Will future releases of this part be compatible with respect to form, fit and function?
Yes = 1
No = 0
Yes = 1
No = 0
Do critical and custom parts on this FRU all have at least a 12-month warranty?
Yes = 1
No = 0
Have likely part developments, evolution, and extensions of critical/custom parts been identified by the supplier?
Yes = 1
No = 0
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
Has a functional block diagram been developed for COTS or purchased complex part assemblies?
Yes = 1
No = 0
Has a failure history been collected for critical parts, complex assemblies, or COTS items?
Yes = 1
No = 0
Yes = 1
No = 0
Have suppliers, in the case of complex part assemblies, supported the developer in performing a Failure Modes and Effects Analysis
(FMEA) on those assemblies?
Yes = 1
No = 0
Have the sources and the extent of part variation been identified?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Will a design of experiments part evaluation, considering variations, as well as manufacturing variations, be conducted?
Yes = 1
No = 0
Have mechanical interfacing components been provided to the key vendors to assure proper mechanical mating?
Yes = 1
No = 0
Will the developer's quality organization audit suppliers' processes and facility capabilities?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are procedures and processes in place for the identification and handling of critical reliability components (derating, screening,
failure response, etc.)?
Yes = 1
No = 0
Gij
Wij
Does the customer participate with the developer in developing/validating a requirements statement?
Yes = 1
No = 0
Is Quality Function Deployment (QFD) used to help develop requirements and requirements traceability?
Yes = 1
No = 0
If QFD is not used, is there another systematic way used, such as a Pugh chart, to identify and document customer needs and
preferences?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Has a comprehensive literature study been done of relevant design and reliability technology advancements?
Yes = 1
No = 0
Have previous or similar products been reviewed for their advantages and pitfalls?
Yes = 1
No = 0
Has a "lessons learned" database been studied to ensure the product will not repeat past problems?
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Have aggressive requirements (particularly reliability, availability, and/or safety) been explicitly specified?
Yes = 1
No = 0
10
Yes = 1
No = 0
Does the requirements definition also account for what the product is supposed to "not do" (for example, air bags should not deploy
except on impact)?
Yes = 1
No = 0
Is there a plan as to how to retire or recycle this new system at the end of its life?
Yes = 1
No = 0
Does a requirements database exist to capture opportunistic requirements for future consideration?
Yes = 1
No = 0
Have future expansion requirements been identified (such as loading growth) and can the system handle the projected growth in
demand?
Yes = 1
No = 0
Are requirements deemed achievable within program budget and schedule restraints, with a 90% confidence level?
Yes = 1
No = 0
Are product requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and safety
concerns)?
Yes = 1
No = 0
Has a project level Failure Modes and Effects Analysis (FMEA) been done in conjunction with designers and system engineers at
the planning stage?
Yes = 1
No = 0
Will the FMEA be refined down to the Field Replaceable Unit (FRU) level during design?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are creativity and team building exercises being conducted during the planning stage?
Yes = 1
No = 0
Are future product releases planned in order to systematically integrate new requirements and features?
Yes = 1
No = 0
Are trade studies shared with the customer to broaden the base of inputs and support for design decisions?
Yes = 1
No = 0
Does a vision statement that speaks to reliability exist for the product?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Is the development team provided guidelines for acceptable deliverables at kick-off meetings for each development stage?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Will state diagrams be developed before detail design to depict control flows?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Will a list identifying the capabilities and advantages that this product provides the customer be developed and maintained?
Yes = 1
No = 0
Is there a system transition plan to replace the current system with the new system, in a smooth, non-disruptive manner? When not
applicable select Yes.
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Are requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and design autonomy)?
Yes = 1
No = 0
Are requirements verification activities planned for the appropriate stages of product development?
Yes = 1
No = 0
Are entrance and exit criteria established for each development stage?
Yes = 1
No = 0
Yes = 1
No = 0
Is requirements compliance verified prior to the exit of each phase and prior to shipment?
Yes = 1
No = 0
Are test cases developed concurrently with the design and reviewed by the designers?
Yes = 1
No = 0
Is there a log of key product decisions and accompanying rationale for traceability?
Yes = 1
No = 0
Does the specified reliability represent an improvement of 10% or greater over its predecessor or competitive products?
Yes = 1
No = 0
Is there definition and agreement as to what constitutes successful product reliability performance by the customer?
Yes = 1
No = 0
Yes = 1
No = 0
Are development and reliability requirements developed by a cross-functional development team (CFDT)?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Is prototype interconnection hardware routinely provided to interfacing subsystems and suppliers to guide their packaging?
Yes = 1
No = 0
Is there a requirement to detect and isolate faults to a single FRU 90% of the time?
Yes = 1
No = 0
Is there a system failure modes and effects analysis (FMEA) done during planning stage, and is it updated throughout the program?
Yes = 1
No = 0
Customer and process Q1: Have I identified who are my internal customers and my external customers?
Yes = 1
No = 0
Customer and process Q2: Have I identified what deliverables my customers need (plans, prototypes, documentation,)?
Yes = 1
No = 0
Yes = 1
No = 0
Customer and process Q4: Is there a customer centered quality initiative that will be incorporated to differentiate your deliverables?
Yes = 1
No = 0
Customer and process Q5: Is there an identified tool or process improvement that the reliability section or the development
organization will gain from this effort?
Yes = 1
No = 0
Have the customers been notified and concur on items Q1-Q3 above?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Do the developers, reviewers, testers, QA, manufacturing, and customer program office, all share in the accountability for getting a
successful program to the field?
Yes = 1
No = 0
Are developers and the entire product team rated or rewarded based upon the field performance of the product?
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Yes = 1
No = 0
Yes = 1
No = 0
Are there checklists covering reliability concerns for each program phase?
Yes = 1
No = 0
Yes = 1
No = 0
Are periodic informal activities, such as brown bag lunches, promoted to encourage team member technical exchange in an informal
atmosphere?
Yes = 1
No = 0
Can a technical employee call for a technical review board of peers when it is felt appropriate to address a broad-impact technical
concern?
Yes = 1
No = 0
Does this equipment not require an interface with other vendors' equipment or government furnished equipment (GFE)?
Yes = 1
No = 0
Is the % of product reuse from previous products 25% or more of the lines of code for software?
Yes = 1
No = 0
Is the % of product reuse from previous products 50% or more of the FRU count or cost for hardware?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are there program development plans that show timing of activities and deliverables (this should be done during the requirements
phase and maintained throughout the program)?
Yes = 1
No = 0
Is there a non-management person designated to work full-time as the program technical lead, who works as a cross-team facilitator?
Yes = 1
No = 0
Is there a team-building effort and project brain storming at each program phase?
Yes = 1
No = 0
Are documentation products maintained on-line and accessible to all program personnel?
Yes = 1
No = 0
Is there a program database of "action items" that is maintained and managed to closure?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are business cases always run to evaluate the benefits and impacts of making a change (e.g., Reinertsen's model)?
Yes = 1
No = 0
Are total cost estimates made for ECs, including scrap, rework, tooling, and the potential slippage of schedule?
Yes = 1
No = 0
Are there two or less EC's planned during the first year of shipping?
Yes = 1
No = 0
Are ECs blocked into sections and scheduled ahead on periodic intervals to promote timely integration of changes?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are there any ECs that are modifying previous ECs on the FRU?
Yes = 1
No = 0
Is there an EC meeting log maintained that includes the change rationale, the analysis provided, and meeting participants?
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Are EC management metrics collected with a focus on continual, measurable process improvement?
Yes = 1
No = 0
Are the program development, integration, and test activities charted, showing tasks, their timing, operational dependencies and
identification of critical path activities?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are there risk assessment and contingency plans to minimize critical path risk?
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Are the R&M design goals sufficiently defined and allocated to ensure that customer needs are met?
Yes = 1
No = 0
Has development committed to support the required tasks for meeting the customer's R&M needs?
Yes = 1
No = 0
Yes = 1
No = 0
Has an agreed-to process been defined to assess progress towards meeting R&M goals and requirements?
Yes = 1
No = 0
Have adequate means been agreed upon to ensure that the R&M objectives of the product will have been achieved?
Yes = 1
No = 0
Have processes been defined and implemented to ensure that the designed-in (inherent) reliability does not degrade during
manufacturing and operational use?
Yes = 1
No = 0
Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question
Is the system required to isolate to a single Field Replaceable Unit (FRU) on 90% % of failures?
Is there a specified time limit to isolate a fault, effect a repair and restore the system?
Is there a requirement for 90% or greater test coverage within the FRU being analyzed?
Does the system promote remote serviceability with failure status communicated via Ethernet, serial port, parallel port, serial bus,
etc., to a central maintenance station?
Is there any remote failure protection for this FRU residing on a separate FRU (e.g., an arc suppression circuit that is located on a
different FRU than the relay FRU)?
Is this FRU designed to be hot-pluggable?
Does the FRU designer also design the fault isolation software that supports fault diagnosis?
Are multiple occurrences of "Can Not Duplicate" (CND) incidents analyzed for root cause of the problem?
Gij
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Wij
6
5
6
4
5
4
6
6
Chapter 7: Examples
Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question
Are test, warranty, early-life, and high fallout FRUs subjected to double fault verification (this procedure re-inserts the faulted FRU
to ensure the problems track the replaced FRU)?
Do your current products experience 40% or less Can Not Duplicate (CND) failures (note that CNDs are synonymous with No
Defects Found (NDF) and No Trouble Found (NTF))?
Is a failure mode and effect analysis (FMEA) performed down to the FRU level or the Circuit Card Assembly (CCA) level,
whichever is lower?
Do design personnel participate directly in performing the FMEA?
Are maintenance analysis procedures (MAPs) developed to map failure symptoms to the failing FRU?
Are the MAPs verified by inserting faults in a maintainability test?
Are the MAPs updated with actual test and field data?
Has your company established the cost impact of a field failure?
Does the system contain error logging and reporting capability?
Does the system promote ongoing analysis of soft error conditions that might predict when a likely failure will occur?
Will the contractor developing this equipment also be responsible for maintaining it?
Does the repair facility have the ability to recreate the conditions under which a true false alarm occurred (sequence of events,
operator error, sneak circuit, etc.) and are these techniques used to try to recreate the failure?
Does the repair facility have the ability to recreate the conditions under which a real failure occurred (high/low temperature, thermal
cycling/shock, vibration/ mechanical shock, etc.) and are these techniques used to try to recreate the failure?
Will the maintainer be motivated to provide timely and complete documentation of the diagnosis and repair action?
Do the system maintenance personnel receive feedback on their repair reports and the actions taken to mitigate the failure
reoccurrence?
Are the performance specification limits of the test equipment used to troubleshoot/repair the system, FRU, etc., equal to or more
stringent than the performance specification limits of the system, FRU, etc., in its actual application?
Are CND failures included in the Failure Reporting and Corrective Action System (FRACAS) system and closed out through
corrective action verification?
Gij
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Wij
10
8
5
5
5
4
5
3
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
5
4
5
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
5
5
5
5
5
5
Gij
Wij
Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced failure
through electrostatic discharge?
Yes = 1
No = 0
If parts/materials are susceptible, are procedures used to protect them during handling, test, assembly, packaging, storage,
transportation and use (i.e., wrist straps, non-conductive work areas, ionized air, warning labels, maintenance manuals, etc.)?
Yes = 1
No = 0
Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical overstress resulting from
another primary failure?
Yes = 1
No = 0
Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical transients generated within the
system/FRU, or received from outside the system/FRU (via cable/wiring harnesses)?
Yes = 1
No = 0
Are maintenance manuals/procedures written such that risk of Electrostatic Discharge/Electrical Overstress (ESD/EOS) during
troubleshooting and repair activity is identified (warning labels, etc.)?
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Has the operating environment that the part/FRU/system is to be used in been evaluated to determine the potential for mishandling of
the equipment that could result in induced mechanical failure (weather; personnel capabilities; training needs)?
Yes = 1
No = 0
Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced (mechanical)
secondary failure resulting from the primary failure of another part/assembly?
Yes = 1
No = 0
If parts/materials are susceptible to induced mechanical damage, are procedures in-place to protect them during handling, test,
assembly, packaging, storage, transportation and use?
Yes = 1
No = 0
Is the part, FRU, and/or system designed such that it can be handled and transported in a manner that minimizes the risk of induced
mechanical failure (proper location/use of handles; orientation labels This Side Up; etc.)?
Yes = 1
No = 0
Are shipping tests run to ensure adequacy of packaging and shipping procedures to protect the product during transportation?
Yes = 1
No = 0
Are maintenance manuals/procedures written such that the risk of induced mechanical damage during troubleshooting and repair
activity is identified (warning labels, etc.)?
Yes = 1
No = 0
Do maintenance manuals include detailed instructions for removing and replacing parts/components/assemblies from sockets and/or
soldered PCB and multiplayer boards, etc.?
Yes = 1
No = 0
Do maintenance manuals include detailed instructions for disconnecting and reconnecting wires, harnesses, cables, hoses, etc.?
Yes = 1
No = 0
Is the FRU/system ergonomically designed such that it can be used by the customer in normal operation without unnecessary risk of
induced mechanical damage?
Yes = 1
No = 0
Is the FRU designed to withstand normal handling and expected mishaps (e.g., a drop off a 36-inch high table top) without induced
mechanical damage?
Yes = 1
No = 0
Are wires color coded, and connectors keyed or of differing configuration such that FRUs cannot be misplugged?
Yes = 1
No = 0
Gij
Wij
Have all parts and materials been selected for use in the design that extend the wearout life of the part/Field Replaceable Unit
(FRU)/system to meet/exceed its required useful life?
Yes = 1
No = 0
Has the expected reliability of parts subjected to significant mechanical loading been modeled to ensure the capability to endure the
mission, e.g., using Miners life expectation rule for components subjected to cyclical loads?
Yes = 1
No = 0
Have wearout failure modes and mechanisms at the part, FRU and system level been identified and mitigated during the Failure
Modes and Effects Analysis (FMEA) process?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include fatigue (solder joints for electronic components/assemblies; welds for bonded
materials; fractures in mechanical parts/assemblies/materials; etc.)?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include leaks (electrolyte loss in electrolytic capacitors; worn seals in hydraulic systems;
etc.)?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include chafing (wires in electrical harnesses; wear in hydraulic lines and hoses; etc.)?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include cold flow of insulation (wires wrapped around sharp edges or subjected to
pressure points; etc.)?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include wearout resulting from cyclic operations (activation in electronic switch/relay
contacts; mating/unmating of electronic or mechanical connectors; etc.)?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include wearout resulting from breakdown of insulation in wires, or dielectric materials in
semiconductors?
Yes = 1
No = 0
Do the relevant failure modes/mechanisms include wearout resulting from moving parts (bearings, gears, belts, springs, seals, etc.)?
Yes = 1
No = 0
Has the system/FRU/part design been modified based on the wearout modes/mechanisms identified in the FMEA to reduce or
minimize their occurrence to the maximum extent feasible?
Yes = 1
No = 0
Chapter 7: Examples
Gij
Wij
Are process FMEAs performed to determine the failure modes/mechanisms of critical processes during manufacturing?
Yes = 1
No = 0
Is data collected and analyses performed to determine the process capability of manufacturing processes?
Yes = 1
No = 0
Is statistical process control (SPC) applied to manufacturing processes to control the process mean and variability?
Yes = 1
No = 0
Is the measured mean of each manufacturing process parameter equal to, or better than, the parameter value used to calculate the
wearout failure rates of the system/FRU parts/components?
Yes = 1
No = 0
If required, has this product been hardened to withstand adverse environmental stresses such as corrosion, radiation, humidity, etc.?
Yes = 1
No = 0
Are procedures defined/implemented to ensure that assembly/test steps during manufacturing do not contribute to early wearout of
susceptible items (i.e., minimize connector matings/unmatings; stress relief/tie downs to minimize chafing during test; etc.)?
Yes = 1
No = 0
Do maintenance manuals/procedures instruct repair personnel to check that wire harnesses are properly secured, seals are properly
reinstalled, connectors are properly mated, etc., following troubleshooting/repair?
Yes = 1
No = 0
Is preventative maintenance planned to replace wear out-susceptible parts/materials at or before their L10 life (where no more than
10% of the units should experience wearout)?
Yes = 1
No = 0
Are wearout-susceptible parts/materials inspected during each corrective maintenance action to find and replace items exhibiting
premature wearout?
Yes = 1
No = 0
Are wearout-susceptible parts/materials inspected during each preventive maintenance action to find and replace items exhibiting
premature wear out?
Yes = 1
No = 0
Are wearout failures (both valid and premature) included in the Failure Reporting and Corrective Action System (FRACAS) and
closed out through corrective action, which could include life-extension opportunities?
Yes = 1
No = 0
Is field data tracked and analyzed to detect FRUs displaying increasing failure rate tendencies, i.e., wearout?
Yes = 1
No = 0
7.2.5.10.
Gij
Yes = 1
No = 0
Yes = 1
No = 0
G = percentage/100
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Yes = 1
No = 0
Wij
8
8
6
6
4
10
5
Chapter 7: Examples
This section presents the results of an analysis in which the intent was to quantify the
reliability of a seal used in an assembly. The approach taken in the analysis was to
perform life tests under a variety of conditions, and to develop life models from this data
so that lifetimes could be predicted as a function of the appropriate stress and product
variables. In this manner, estimates of reliability under a wide range of use conditions
could be made.
This is an example of an assessment methodology, the results of which would be more
accurate than a prediction method applied to the seal. If the analyst is able to develop a
model like the one presented here for a specific component or failure cause, the resulting
model should be weighed more heavily than a prediction on the specific component.
7.3.2. Approach
All samples were tested under a variety of temperature and relative humidity conditions.
In addition, samples included two factors which were varied in the life tests: Process
Force and Hardness. These stresses and product/process variables were expected to be
the ones that most heavily influenced the product reliability.
7.3.3. Reliability Test Plan
The Reliability Test Plan required that the lifetime be measured at various magnitudes of
these variables, such that life model parameters (including acceleration factors) could be
quantified. Table 7.3-1 summarizes, for each variable, the number of levels, and the level
values.
Number of Levels
2
2
2
3
Levels
85, 130 C
85, 100%
2, 20 N
25, 50, 100 V
Chapter 7: Examples
Temperature
Humidity
Hardness
Process Force
85
85
85
85
85
85
130
130
130
130
130
130
130
130
130
130
130
130
85
85
85
85
85
85
85
85
85
85
85
85
100
100
100
100
100
100
25
50
100
25
50
100
25
50
100
25
50
100
25
50
100
25
50
100
2
2
2
20
20
20
2
2
2
20
20
20
2
2
2
20
20
20
The tests were performed by first inspecting each sample, then exposing them to the
specific combination of variables as previously summarized, and, finally, re-inspecting
them at various intervals. The exposure times and inspection intervals were structured
such that short lifetimes could be observed in the event that acceleration factors were
higher than anticipated. Therefore, more frequent inspections were performed early in
the test, followed by less frequent inspections for the surviving samples. Failed samples
were removed from the test.
Data was then summarized in a format suitable for life modeling. The required data
elements included stress and product/process variables, plus life variables, as follows:
Variables:
o Temperature
o Humidity
o Process force
o Hardness
Reliability Information Analysis Center
351
Chapter 7: Examples
Life variables
o Last known good time
o First known bad time
7.3.4. Results
7.3.4.1. Times to Failure Summary
The test results for the seal samples are presented in Table 7.3-3. Included in this table is
the sample number, the temperature (in degrees C), the relative humidity, the Hardness,
the process force, whether the sample failed (F) or survived (S), and the time at which it
failed or survived.
RH
Thickness
Speed
F or S
Time to F/S
RH
Thickness
Speed
F or S
Time to F/S
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50
50
50
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
S
S
S
F
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
F
S
S
S
S
S
S
S
S
S
S
S
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
778
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
1159
85
85
85
85
85
85
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
85
100
100
100
100
100
100
25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50
50
50
50
50
50
50
50
50
50
50
100
100
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
S
S
S
S
S
S
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
1159
1159
1159
1159
1159
1159
278
158
130
237.5
158
196.5
130
158
196.5
237.5
428
237.5
130
158
237.5
196.5
278
196.5
278
158
158
196.5
158
196.5
428
158
237.5
158
278
158
Chapter 7: Examples
T
RH
Thickness
Speed
F or S
Time to F/S
RH
Thickness
Speed
F or S
Time to F/S
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
85
85
85
85
85
85
85
85
85
85
85
85
85
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
2
2
2
2
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
S
F
F
F
F
F
F
S
F
S
278
220
371
278
325
428
58
325
428
325
428
278
196.5
58
59
34
58
34
34
58
58
70
34
34
58
1.5
58
58
70
58
70
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
130
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
50
50
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100
100
100
100
100
100
100
2
2
2
20
20
20
20
20
20
20
2
2
2
2
2
2
2
20
20
20
20
20
20
20
F
F
S
F
S
F
F
F
F
F
S
F
F
F
F
F
F
S
S
F
F
F
F
S
58
58
70
58
70
34
34
58
34
58
70
58
58
58
58
58
34
70
70
58
58
58
58
70
The 2-parameter Weibull distribution parameters for the TTF distributions for the
samples are shown in Table 7.3-4.
Characteristic
Life
2109
268
62.1
Shape
Parameter
5.1
2.71
3.2
The TTF distributions for each of the three test conditions are illustrated in Figure 7.3-1.
Chapter 7: Examples
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
Probability - Weibull
9 9. 0 00
P ro ba bility-W e ibull
F o lio1 \SL-130 , 1 00
W e ibull-2P
MLE SRM MED F M
F = 3 3/S= 9
Da ta Po ints
Susp Po ints
Pro ba bilit y Line
9 0. 0 00
F o lio1 \SL-130 , 8 5
W e ibull-2P
MLE SRM MED F M
F = 4 3/S= 0
Da ta Po ints
Pro ba bilit y Line
Unreliability, F(t)
5 0. 0 00
F o lio1 \SL-85, 85
W e ibull-2P
MLE SRM MED F M
F = 2 /S= 40
Da ta Po ints
Pro ba bilit y Line
1 0. 0 00
5. 00 0
1. 00 0
1. 0 0 0
10 . 00 0
1 00 . 0 00
10 0 0. 0 00
Bill De nson
Co rning
1 1 /24 /2 00 8
5 :0 5:2 2 PM
1 0 00 0. 00 0
T ime, (t)
F olio 1\SL-13 0, 10 0: =3 .2 1 8 3 , =6 2 .1 1 9 5
F olio 1\SL-13 0, 85 : = 2 .7 2 2 1 , = 2 6 8 .2 4 7 9
F olio 1\SL-85 , 85: =5 .0 5 0 5 , =2 1 0 9 .0 6 3 5
Life models were generated from the data summarized above. These life models estimate
the TTF distribution as a function of the variables used in the experiments.
A general form of the Weibull reliability function used is:
R=e
where:
R=
=
Chapter 7: Examples
The characteristic life is then developed as a function of the applicable variables. The
model form is:
0
= e e T RH H F
2
Where:
0 through 4 =
T=
RH =
H=
F=
))
f =
F=
ti =
tj
tk and tk-1 =
The first of the three product terms represent failures at known times, the second
represents survivals, and the third represent failures that occur within intervals but the
precise failure times are not known
Once the model parameters are estimated in this fashion, the reliability at any time, and
for any combination of variables, can be estimated.
Reliability Information Analysis Center
355
Chapter 7: Examples
The estimated parameters are summarized in Table 7.3-5. In this table, the best estimate
is provided along with the 80% 2-sided confidence levels around the estimate. A small
variation between the lower and upper confidence bound are indicative of significant
variables.
Lower 80% CL
Best Estimate
Upper 80% CL
2.737
3.073
3.450
19.68
23.98
28.28
6957.2
8015.7
9074.3
-9.45
-8.83
-8.21
0.131
0.215
0.299
-0.0031
0.0388
0.0807
=e
23.98
8015.7
T
RH
8.83
0.2150
0.0388
Once the model parameters are estimated, then a variety of output formats are possible.
For example, Figure 7.3-2 illustrates the probability of failure as a function of
temperature and relative humidity at a time of 50,000 hours.
Chapter 7: Examples
Unreliability vs Stress Surface
Figure 7.3-2: Probability of Failure vs. Temperature and Relative Humidity at 50,000
Hours
Chapter 7: Examples
there are few sources of failure rate data for other component types. All part types and
assemblies for which RIAC has data are included in NPRD with the exception, of
standard electronic component types. Although the data contained in NPRD were
collected from a wide variety of sources, RIAC has screened the data such that only high
quality data is added to the database and presented in this document. In addition, only
field failure rate data is included. The intent of this section is to provide the user with
information to adequately interpret and use data to supplement standard reliability
prediction methodologies.
It is not feasible for documents like MIL-HDBK-217 or other prediction methodologies
to contain failure rate models on every conceivable type of component and assembly.
Traditionally, reliability prediction models have been primarily applicable only for
generic electronic components. Therefore, NPRD serves a variety of needs:
The failure rate data contained in the newest version of NPRD (NPRD-2010) will
represent a cumulative compilation of data collected from the early 1970's through
December 2008. RIAC is continuously soliciting new field data in an effort to keep the
databases current. The goals of these data collection efforts are as follows:
1. To obtain data on relatively new part types and assemblies.
2. To collect as much data on as many different data sources, application
environments, and quality levels as possible.
3. To identify as many characteristic details as possible, including both part and
application parameters.
The following generic sources of data were used for this publication:
1.
2.
3.
4.
5.
Chapter 7: Examples
Environments/Quality
Age
Component Types
Availability of Quality Data
(2)
(3)
(4)
(5)
Transform Data to
Common RIAC Database
Template
Perhaps the most important aspect of this data collection process is identifying viable
sources of high quality data. Large automated maintenance databases, such as the Air
Force REMIS system or the Navy's 3M and Avionics 3M systems, typically will not
provide accurate data on piece parts. They can, however, provide acceptable data on
assemblies or LRUs, if used judiciously. Additionally, there are specific instances in
which they can be used to obtain piece-part data. Piece-part data from these maintenance
systems is used in the RIAC's data collection efforts only when it can be verified that
they accurately report data at this level. Reliability Improvement Warranty (RIW) data
are another high quality data source which has been used.
Reliability Information Analysis Center
359
Chapter 7: Examples
Chapter 7: Examples
Data contained in NPRD-2010 reflects industry average failure rates, especially the
summary failure rates which were derived by combining several failure rates on similar
parts/assemblies from various sources. In certain instances, reliability differences can be
distinguished between manufacturers or between detailed part characteristics. Although
the summary section of NPRD cannot be used to identify these differences (since it
presents summaries only by generic type, quality, environment, and data source), the
listings in the detailed section of NPRD contain all of the specific information that was
known for each part and, therefore, can sometimes be used to identify such differences.
Data in the summary section of NPRD represent an "estimate" of the expected failure
rate. The "true" value will lie within some confidence interval about that estimate. The
traditional method of identifying confidence limits for components with exponentially
distributed lifetimes has been the use of the Chi-Square distribution. This distribution
relies on the observance of failures from a homogeneous population and, therefore, has
limited applicability to merged data points from a variety of sources.
To give users of NPRD a better understanding of the confidence they can place in the
presented failure rates, an analysis of RIAC data in the past concluded that, for a given
generic part type, the natural logarithm of the observed failure rate is normally distributed
with a standard deviation of 1.5. This means that 68 percent of the actual experienced
failure rates will be between 0.22 and 4.5 times the mean value. Similarly, 90% of actual
failure rates will be between 0.08 and 11.9 times the presented mean value. As a general
rule-of-thumb, this type of precision is typical of probabilistic reliability prediction
models and point-estimate failure rates such as those contained within NPRD. It should
be noted that this precision is applicable to predicted failure rates at the component level,
and that confidence will increase as the statistical distributions of components are
combined when analyzing modules or systems.
In virtually all of the field failure data collected for NPRD, TTF was not available. Few
current DoD or commercial data tracking systems report elapsed time indicator (ETI)
meter readings that would allow TTF compilations. Those that do lose accuracy
following removal and replacement of failed items. To accurately monitor these times,
each replaceable item would require its own individual time recording device. Data
collection efforts typically track only the total number of item failures, part populations,
and the number of system operating hours. This means that the assumed underlying TTF
distribution for all failure rates presented in NPRD is the exponential distribution.
Unfortunately, many part types for which data are presented typically do not follow the
exponential failure law, but rather exhibit wearout characteristics, or an increasing failure
Reliability Information Analysis Center
361
Chapter 7: Examples
rate in time. While the actual TTF distribution may be Weibull or lognormal, it may
appear to be exponentially distributed if a long enough time has elapsed. This
assumption is accurate only under the condition that components are replaced upon
failure, which is true for the vast majority of data contained in NPRD. To illustrate this,
refer to Figure 7.4-1, which depicts the apparent failure rate for a population of
components that are replaced upon failure, each of which follow the Weibull TTF
distribution. This illustrates Drenicks theorem that was discussed earlier in this book.
Chapter 7: Examples
2
4
6
8
Asymptote
1.0
2.4
4.2
7.0
Additionally, since MTTF is often used instead of characteristic life, their relationship
should be understood. The ratio of alpha/MTTF is a function of beta and is given in
Table 7.4-3.
1.0
2.0
2.5
3.0
4.0
Asymptote
1.00
1.15
1.12
1.10
1.06
Based on the previous discussion, it is apparent that the time period over which data is
collected is very important. For example, if the data is collected from time zero to a
time which is a fraction of alpha, the failure rate will be increasing over that period and
the average failure rate will be much less than the asymptotic value. If however the data
is collected during a time period after which the failure rate has reached its asymptote, the
apparent failure rate will be constant and will have the value 1/alpha. The detailed data
section in NPRD presents part populations which provide the user the ability to further
analyze the time logged to an individual part or assembly, and to estimate the
characteristic life. For example, the detailed section presents the population and the total
number of operating hours for each data record. Dividing the part operating hours by the
population yields the average number of operating hours for the system/equipment in
which the part/assembly was operating. An entry for a commercial quality mercury
battery in a ground, fixed (GF) environment indicates that a population of 328 batteries
had experienced a total of 0.8528 million part hours of operation. This indicates that
each battery had experienced an average of 0.0026 million hours of operation in the time
period over which the data was collected. If a shape parameter, beta, of the Weibull
distribution is known for a particular part/assembly, the user can use this data to
extrapolate the average failure rate presented in NPRD to a Weibull characteristic life
Reliability Information Analysis Center
363
Chapter 7: Examples
(alpha). If the percent failure rate is relatively low, the methodology is of limited value.
If a significant percent of the population has failed, the methodology will yield results for
which the user should have a higher degree of confidence. The methodology presented is
useful only in cases where TTF characteristics are needed. In many instances, knowledge
of the part characteristic life is of limited value if the logistics demand is the concern.
This data can, however, be used to estimate characteristic life in support of preventive
maintenance efforts. The assumptions in the use of this methodology are:
1. Data were collected from "time zero" of the part/assembly field usage
2. The Weibull distribution is valid and is known
Table 7.4-4 contains cumulative percent failure as a function of the Weibull beta shape
parameter and the time/characteristic life ratio (t/). The percent failure from the NPRD
detailed data section can be converted to a (t/alpha) ratio using the data in Table 7.4-4.
Once this ratio is determined, a characteristic life can be determined by dividing the
average operating hours per part (part hours/population) by the (t/alpha) ratio. It should
be noted here that the percentage failures in the table can be greater than 100, since parts
are replaced upon failure and there can be an unlimited number of replacements for any
given part.
Chapter 7: Examples
As an example, consider the NPRD detailed data for Electrical Motors, Sensor;
Military Quality Grade; Airborne, Uninhabited (AU) environment; and a Population Size
of 960 units. Assume for this data entry that there were 359 failures in 0.7890 million
part-operating hours. The data may be converted to a characteristic life in the following
manner:
1. Determine the Percent Failure:
359
% Failure = 960 = 37.4%
2. Determine a typical Weibull shape parameter (). For motors, a typical beta
value is 3.0 (Reference 5).
3. Convert the Percent Failure to a t/alpha ratio using Table 7.4-4 (for % fail = 37.4
and = 3)
t
0.65
5. Calculate :
Part Hours
Population
Count
0.00082
=
=
= 0.00126 million hours
0.65
t
Based on this data, an approximate Weibull characteristic life is 1260 hours. The user of
this methodology is cautioned that this is a very approximate method for determining the
characteristic life of an item when TTF data is not available. It should also be noted that
for small values of time (i.e.; t < 0.1 alpha), random failures can predominate, effectively
masking wearout characteristics and rendering the methodology inaccurate.
Reliability Information Analysis Center
365
Chapter 7: Examples
Additionally, for small operating times relative to , the results are dependent on the
extreme tail of the distribution, thus significantly decreasing the confidence in the derived
alpha value.
For part types exhibiting wearout characteristics, the failure rate presented represents an
average failure rate over the time period in which the data was collected. It should also
be noted that for complex nonelectronic devices or assemblies, the exponential
distribution is a reasonable assumption. The user of this data should also be aware of
how data on cyclic devices such as circuit breakers is presented in NPRD. Ideally, these
devices should have failure rates presented in terms of failures per operating cycles.
Unfortunately, from the field data collected, the number of actuations is rarely known
and, therefore, the listed failure rates are presented in terms of failures per operating hour
for the equipment in which the part is used.
7.4.3. Document Overview
Introduction
Part Summaries
Part Details
Data Sources
Part Number/Mil Number Index
National Stock Number Index with Federal Stock Class Prefix
National Stock Number Index without Federal Stock Class Prefix
Part Description Index
The summary section of NPRD contains combined failure rate data, presented in order of
Part Description, Quality Level, Application Environment, and Data Source. The Part
Description itself is presented in a hierarchical classification. The known technical
characteristics, in addition to the classification, are contained in Section 3 of the book,
Part Details. All data records were combined by totaling the failures and operating
hours from each unique data source. In some cases, only failure rates were reported to
RIAC. These data points do not include specific operating hours and failures, and have
dashes in the Total Failed and Operating Hours/Miles fields. Table 7.4-5 describes each
field presented in the summary section.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
366
Chapter 7: Examples
Field Description
Description of the part, including the major family of parts and specific part-type breakdown within the part
family.
The RIAC does not distinguish parts from assemblies within NPRD. Information is presented on
parts/assemblies at the indenture level at which it was available. The description of each item for which data
exists is made as clear as possible so that the user can choose a failure rate on the most similar part or assembly.
The parts/assemblies for which data is presented can be comprised of several part types, or they can be a
constituent part of a larger assembly. In general, however, data on the part type listed first in the data table is
representative of the part type listed and not of the higher level of assembly. For example, a listing for Stator,
Motor represents failure experience on the stator portion of the motor and not the entire motor assembly.
Added descriptors to the right, separated by commas, provide further details on the part type listed first.
Additional detailed part/assembly characteristics can be found, if available, in the Part Details section of
NPRD.
Quality
Level
App. Env.
The Application Environment describes the conditions of field operation. See Table 7.4-6 for a detailed list of
the application environments and their descriptions. These environments are consistent with MIL-HDBK-217.
In some cases, environments more generic than those used in MIL-HDBK-217 are used. For example: "A"
indicates the part was used in an Airborne environment, but the precise location and aircraft type was not
known. Additionally, some environments are more specific than the current version of MIL-HDBK-217, since
the current version has merged many of the environment categories and the NPRD data was originally
categorized into the more specific environment. Environments preceded by the term "NO" are indicative of
components used in a non-operating product or system in the specified environment.
Data Source
Source of data comprising the NPRD data entry. The source number may be used as a reference to Section 4 of
NPRD to review the specific data source description.
Failure Rate
Fails / (E6)
The failure rate presented for each unique part type, environment, quality, and source combination. It is the
total number of failures divided by the total number of life units. No letter suffix indicates that the failure rate
is in failures per million operating hours. An "M" suffix indicates the unit is failures per million miles. For
roll-up data entries (i.e., those without sources listed), the failure rate is derived using the data merge algorithm
described in this section. A failure rate preceded by a "<" is representative of entries with no failures. The
failure rate listed was calculated by using a single failure divided by the given number of operating hours. The
resulting number is a worst case failure rate and the real failure rate is less than this value. All failure rates
are presented in NPRD in a fixed format of four decimal places after the decimal point. The user is cautioned
that the presented data has inherently high variability and that four decimal places does not imply any level of
precision or accuracy.
Total Failed
Op. Hours/
Miles (E6)
The total number of operating life unit (in millions) observed in merged data records. Absence of a suffix
indicates operating hours is the life unit and "M" indicates that miles is the life unit.
Detail Page
The page number containing the detail data source description which comprises the summary record.
Chapter 7: Examples
Description
AI
AIA
Airborne Inhabited Attack - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as used for ground support.
AIB
Airborne Inhabited Bomber -Typical conditions in bomber compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission bomber
aircraft.
AIC
Airborne Inhabited Cargo - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission transport
aircraft .
AIF
Airborne Inhabited Fighter - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as fighters and interceptors.
AIT
Airborne Inhabited Transport - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as trainer aircraft.
ARW Airborne Rotary Wing - Equipment installed on helicopters; includes laser designators and fire control systems.
AU
Airborne Uninhabited - General conditions of such areas as cargo storage areas, wing and tail installations
where extreme pressure, temperature, and vibration cycling exist.
AUA
Airborne Uninhabited Attack - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for ground support.
AUB
Airborne Uninhabited Bomber - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on long mission bomber aircraft.
AUF
Airborne Uninhabited Fighter - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as fighters and interceptors.
AUT
Airborne Uninhabited Transport - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for trainer aircraft.
Chapter 7: Examples
Description
DOR
Dormant - Component or equipment is connected to a system in the normal operational configuration and
experiences non-operational and/or periodic operational stresses and environmental stresses. The system may
be in a dormant state for prolonged periods before being used in a mission.
GB
&
GBC
Ground Benign - Non-mobile, laboratory environment readily accessible to maintenance; includes laboratory
instruments and test equipment, medical electronic equipment, business and scientific computer complexes.
GBC refers to a commercial application of a commercial part.
GF
Ground Fixed - Conditions less than ideal such as installation in permanent racks with adequate cooling air and
possible installation in unheated buildings; includes permanent installation of air traffic control, radar and
communications facilities.
GM
Ground Mobile - Equipment installed on wheeled or tracked vehicles; includes tactical missile ground support
equipment, mobile communication equipment, tactical fire direction systems.
ML
Missile Launch - Severe conditions related to missile launch (air and ground), and space vehicle boost into
orbit, vehicle re-entry and landing by parachute. Conditions may also apply to rocket propulsion powered
flight.
MP
Manpack - Portable electronic equipment being manually transported while in operation; includes portable field
communications equipment and laser designations and rangefinders.
Naval - The most generalized normal fleet operation aboard a surface vessel.
NH
NS
Naval Sheltered - Sheltered or below deck conditions, protected from weather; include surface ships
communication, computer, and sonar equipment.
NSB
Naval Submarine - Equipment installed in submarines; includes navigation and launch control systems.
NU
Naval Unsheltered - Nonprotected surface shipborne equipment exposed to weather conditions; includes most
mounted equipment and missile/projectile fire control equipment.
N/R
SF
Spaceflight - Earth orbital. Approaches benign ground conditions. Vehicle neither under powered flight nor in
atmosphere re-entry; includes satellites and shuttles.
Data records are also merged and presented at each level of part description (categorized
from most generic to most specific). The data entries with no source listed represent
these merged records. Merging data becomes a particular problem due to the wide
dispersion in failure rates, and because many data points consist of only survival data in
which no failures occurred, thus making it impossible to derive a failure rate. Several
Reliability Information Analysis Center
369
Chapter 7: Examples
approaches were considered in defining an optimum data merge routine. These options
are summarized as follows:
1. Summing all failures and dividing by the sum of all hours. The advantages of
this methodology are its simplicity and the fact that all observed operating
hours are accounted for. The primary disadvantage is that it does not weigh
outlier data points less than those clustering about a mean value. This can
cause a single failure rate to dominate the resulting value.
2. Using statistical methods to identify and exclude outliers prior to summing
hours and failures. This methodology would be very advantageous in the
event there are enough failure rate data points to properly apply the statistical
methods. The data being combined in NPRD often consists of a very limited
number of data points, thus negating the validity of this method.
3. Deriving the arithmetic mean of all observed failure rates which are from data
records with failures, and modifying the resulting value in accordance with the
percentage of operating hours associated with the zero failure records.
Advantages of this method are that modifying the mean in accordance with
the percentage of operating hours from survival data will ensure that all
observed part hours are accounted for, regardless of whether they have
experienced failures. Disadvantages are that the arithmetic mean does not
apply less weight to those data points substantially beyond the mean and,
therefore, a single data point could dominate the calculated failure rate.
4. Using a mean failure rate by taking the lower 60% confidence level (Chisquare) for zero failure data records and combining them with failure rates
from failure records. The disadvantages of this methodology are that the 60%
lower confidence limit can be a pessimistic approximation of the failure rate,
especially in the case where there are few observed part hours of operation;
and an arithmetic mean failure rate of these values (combined with the failure
rates from failure records) could yield a failure rate which is dominated by a
single failure rate, which itself may be based on a zero failure data point. The
use of a geometric mean would alleviate some of this effect. The problem
with the pessimistic nature of using the confidence level ,however, will
remain.
5. Deriving the geometric mean of all the failure rates associated with records
having failures and multiplying the derived failure rates by the proportion:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
370
Chapter 7: Examples
n n
merged = i
i =1
n
h
i =1
n
h
i =1
where,
n
i = The product of failure rates from NPRD Section 2 records with failures*
i =1
i =1
n
i =1
n
n'
h
h'
=
=
=
=
Chapter 7: Examples
In NPRD Section 2, part descriptions with "(Summary)" following the part name
comprise a merge of all data related to the generic part listed. An example of the NPRD
summary section is given in Figure 7.4-2.
0.1957 + 0.0595
= 5.5413
0
.
1957
+
0
.
0595
+
0
.
0830
+
0
.
2655
summary = [(5.110)(33.6241)] 2
Now consider the entry for "Actuator, Mechanical (Summary)". This listing is a roll-up
of all "Actuator, Mechanical" data (in this case Actuator, Mechanical and Actuator,
Mechanical, Linear) using the algorithm described previously. In other words, the failure
rate of 25.8092 is a summary of failure data from seven individual data sources. For
these "(Summary)" data entries, sources are not listed since they represent a merge of one
or more data sources which are presented below the summary level. Roll-up values are
presented for each specific quality level and application environment for all components
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
372
Chapter 7: Examples
having multiple part type entries at the same indenture level. If there is no summary
record indicated for a particular part type, the listed part description represents the lowest
level of indenture available. For example, the listing for "Actuator, Mechanical,"
although being identical to the generic level for which the summary data is presented,
was the most detailed description available for the particular data entry. More detailed
part level information may be available in NPRD Section 3. Each failure rate record
listed in the NPRD summary section is a merge of all detailed data from Section 3 for a
specific part type, quality, environment and unique data source. Each of these failure rate
records refers to a Section 3 page which contains all detailed records, including part
details, when they were known. Roll-ups are performed at every combination of part
description (down to 4 levels), quality level, and application environment. The data
points being merged in the NPRD summary section include only those records for which
a data source is listed. These individual data points were already combined by summing
part hours and failures (associated with the detailed records) for each unique data source.
Roll-ups performed on only zero-failure data records are accomplished simply by
summing the total operating hours, calculating a failure rate by assuming one failure, and
denoting the resulting worst case failure rate with a "<" (less than) sign.
The roll-ups were performed in this manner to give the NPRD user maximum flexibility
in choosing data on the most specific part type possible. For example, if the user needs
data on a part type which is not specified in detail or for conditions for which data does
not exist in this document, the user can choose data on a more generic part type or
summary condition for which there is data.
7.4.3.2. "Part Details" Overview
Chapter 7: Examples
NPRD Section 3 contains a listing of all field experience records contained in the RIAC
part databases. The detailed data section presents individual data records that are
representative of the specific part types used in a particular application from a single data
source. For example, if 20 relays of the same type were used in a specific military
system, for which there were 300 systems in service, each with 1300 hours of operation
over the time in which the data was collected, the part population is 20x300 = 6000, and
the total part operating hours are 6000x1300 = 7,800,000 hours. If the same part is used
in another system, or if the system is used in different operating environments, or if the
information came from a different source, then separate NPRD data records were
generated. If known, the population size is given for each data record as the last element
in the Part Characteristics field. An example of NPRD Section 3 is shown in Figure
7.4-3.
This section of NPRD describes each of the data sources from which data were extracted
for the databook. The Title, author(s), publication dates, report numbers, and a brief
abstract are presented. In a number of cases, information regarding the source of the data
had to be kept proprietary. In these cases, "Source Proprietary" is indicated.
7.4.3.4. Section 5 "Part Number/MIL Number" Index
This NPRD section provides an index, ordered by generic part type, of those Section 3
data entries that contain a generic commercial part number or a MIL-Spec number. The
Section 3 page which contains the specific entry for the part or MIL number of interest is
given. Note that not all data entries contain a part or MIL number, since these numbers
either were not applicable or were not known for all entries.
Chapter 7: Examples
7.4.3.5. Section 6 National Stock Number Index with Federal Stock Class
This NPRD section provides an index of those Section 3 data entries that contain a
National Stock Number (NSN), including the four digit Federal Stock Class (FSC) prefix.
This index contains all parts for which the NSN is known.
7.4.3.6. Section 7 "National Stock Number Index without Federal Stock Class Prefix"
This NPRD section provides an index similar to the Section 6 index, with the exception
that the four-digit FSC is omitted.
7.5. References
1. RADC-TR-88-97, RELIABILITY PREDICTION MODELS FOR DISCRETE
SEMICONDUCTOR DEVICES, Final Technical Report, 1988
2. Denson, W.K. and S. Keene, A New System Reliability Assessment
Methodology, Final Report, 1998
3. Photonic Component and Subsystem Reliability Process Final Report,
Subcontract 0044-SC-20100-0203, Prepared for Penn State University ElectroOptics Center, September 25, 2008
4. Nonelectronic Parts Reliability Data (NPRD), Reliability Information Analysis
Center
5. RADC-TR-77-408, Electric Motor Reliability Model)
6. MIL-HDBK-344A, Environmental Stress Screening of Electronic Equipment,
August 1993
Chapter 7: Examples
8.
Although analytical techniques like FMEA are not the primary focus of this book, they
are important in the development of a reliability model. For example, when identifying
the root failure causes that are to be included in a comprehensive product or system
reliability model, a need exists for the identification of the highest risk failure causes that
should be addressed in the model. The FMEA is a popular technique to use for this
purpose. The intent of this chapter is not to present a detailed procedural guide to FMEA,
as this has been done extensively in the literature. Rather, it is to present practical FMEA
guidelines based on the experience of the author, specifically toward the goal of
developing a reliability model.
8.1. Introduction
In order to build reliability into a product or system, it is necessary to anticipate failure
causes, and ensure that they are eliminated or, at least, that their probability of occurring
is made acceptably low. This anticipation can be accomplished empirically through
test, or analytically through analysis and modeling. Failure Mode and Effects Analysis
(FMEA) is a structured way of identifying root cause failure modes, and is the backbone
of an effective reliability program, particularly as it relates to reliability growth during the
design and development phase.
A successful product or system depends on the requirements being fully understood, that
the design is robust, and that the manufacturing process is also robust. A Design FMEA
(DFMEA) assesses the first two of these, and a Process FMEA (PFMEA) assesses the
third. This is illustrated in Figure 8.1-1.
Requirements for
Successful Product
What Can
Go Wrong?
Understanding of
Requirements
Wrong or
Bad
Requirements
Robust
Design
Robust
Manufacturing
Bad
Design
Bad
Manufacturing
Build the
Wrong Product
DFMEA
Build the
Product
Wrong
PFMEA
Generally, the best manner in which to perform the FMEA is to separate the design and
process attributes and perform separate process and design FMEAs. However, in some
cases, these can essentially be combined into a single design FMEA by incorporating the
manufacturing process-related failure modes into the DFMEA failure cause/mechanism
column. The circumstances when this is appropriate are generally those when the item
under analysis is not complex from both a design and manufacturing perspective. This
book primarily addresses the DFMEA, since a reliability model is generally driven more
by the design than the process. However, process variables are often used as factors in
the reliability model.
A FMEA is the cornerstone of a reliability program, having many uses. The primary
purpose of the FMEA is to acquire an understanding of the reliability characteristics of a
product or system, such that corrective action can be taken to make the item more reliable
(reliability growth). The results of a FMEA are also used to support other reliability
engineering tasks, such as test plan development, the evaluation of engineering changes,
assessing detectability, the basis of troubleshooting manuals, and the development of
reliability models.
The logical, bottom-up analysis technique of the FMEA facilitates the understanding of
the reliability characteristics of a product or system. This understanding is a core
requirement for the attainment of the reliability objectives, and, as such, it will help
reduce the total program cost. While reliability engineering tasks are sometimes
considered to be costly to a program, the reality is that they will save significant amounts
of money, if properly implemented. Costs incurred when reliability problems are
identified in the field will be orders of magnitude higher than the upfront cost of the
reliability engineering tasks that solve them during design and development. Since the
success of a reliability program depends largely on the effectiveness of FMEA,
implementation of the FMEA is a critical element of the cost avoidance of field failures11.
Benefits of performing an FMEA include:
The assurance that all conceivable root failure causes and their effects have been
considered in the early stages of the product or system design and development
process, and that corrective actions are taken to mitigate the risk associated with
critical failure modes.
If elements such as accelerating stresses are included in the FMEA analysis, it can
be used to develop reliability growth, demonstration and screening test plans, as
well as environmental qualification test plans. In this case, the importance of
each potential accelerating stress can be quantified and prioritized in accordance
with the severity, criticality or failure rate of the individual failure modes
accelerated by the specific stress. For example, if temperature is determined to
accelerate the majority of critical failure modes, then it should be used as a stress
in reliability and qualification testing.
If and when reliability problems occur after a product or system is delivered to the
customer, the FMEA can be used as a basis for determining the root cause of
failure. Based on failure symptoms, the possible causes can be identified based
on the FMEA analysis that was performed.
It can be used as a basis for the reliability model, in which the reliability of each
high risk failure cause is quantified.
11
It should be noted that an FMEA is only technically effective if it has an impact on the design of the product or system. An FMEA
that does an excellent job of identifying root failure causes, but is performed after-the-fact so as to have no impact on the actual
design, is a waste of reliability program resources. An FMEA is only cost effective if it impacts the design of the product or system
before the design is finalized and bending metal has started. A poorly timed FMEA that results in extensive and costly redesign
efforts to eliminate or mitigate root failure causes is also counterproductive.
Another benefit of the FMEA is that it can be used as a basis for evaluating the risk
associated with engineering changes. If a design change is proposed, the FMEA can be
consulted to determine if the change will result in new failure modes or an increase in the
probability of failure of identified modes. Based on this information, the change can be
accepted, or additional reliability characterization can be performed to further assess the
reliability impact of the proposed change.
Detectability can also be assessed by the FMEA. This is particularly useful in instances
where failures that are undetectable are of special importance to the project. An example
of this is when alarms are used as a means to detect failures. Some failure modes may
not result in an alarm, and, therefore, the criticality associated with the failure mode can
be high. In this case, the FMEA can be used to assess these failure modes.
Troubleshooting manuals are essentially an FMEA that is presented in reverse order. The
FMEA is generally presented in the order of functional elements, or components. If the
FMEA is sorted by the effect of failure (or symptom), it essentially becomes a
troubleshooting aid, since the analyst can review the specific failure modes that will
result in the observed symptom. Additionally, if the probability of failure is included in
the FMEA analysis, the possible failure modes or causes can be ranked in accordance
with their probability. This can aid in the troubleshooting process.
Typical problems with the implementation of an FMEA include:
8.2. Definitions
FMEA refers to a generic analysis methodology and, while there are industry standards
that define the specifics of the analysis, there are many different ways in which the
analysis can be accomplished. The following list of terms and definitions summarizes
typical data elements that the FMEA will typically include as columns in the FMEA
worksheet template, presented in the order of which they usually appear.
Additional functions may also be required, depending on the specific organization and
nature of the product. These additional functions can include component engineering,
procurement, measurements, and marketing. There are also instances where an FMEA
Reliability Information Analysis Center
383
might include the direct involvement of the customer, particularly for critical or highly
complex products or systems.
All of the above listed functions are not required for every part of the FMEA. For
instance, the initial parts of the FMEA can efficiently be performed by only the
Reliability engineering and the Applications engineering (or the Project Manager)
functions. After this, engagement by the entire team is critical, especially to gain buyin on corrective actions, which can be the responsibility of any of the disciplines.
The ideal team size is 5 to 8 people. Any larger, and the efficiency of the analysis is
compromised. It is also more efficient to break up the analysis into distinct functional
elements of the design, i.e. mechanical, electrical, optical, software/firmware, etc.,
although it is also imperative to account for failure causes that are due to interactions of
these functional elements. The FMEA facilitator needs to ensure that these interactions
are accounted for, since the individuals cognizant of their functional element will often
overlook these interactions.
8.3.3. FMEA Facilitation
As with any cross functional team, it is important to have a lead which facilitates the
FMEA. Specific responsibilities of this facilitator are:
Document the results of analysis (it is also beneficial to have a separate scribe
that documents the results, and allows the facilitator to concentrate on the
additional items listed below)
Keep the group focused on the task at hand
Ensure that all components or processes are accounted for
Prompt the group for participation, as required
Spark the discussion by suggesting failure modes
Ensure that the analysis is kept moving
Ensure that the inputs of all participants are heard and captured. This includes
making sure that certain people are not allowed to dominate the analysis, and that
the ideas of quiet people are brought out.
Manage conflicts Professional people take a great deal of pride in their work.
Since the FMEA goal is to find fault with the product or system, FMEA sessions
can sometimes get contentious. The facilitator must manage this by keeping the
session constructive and not allow emotions to dictate the course of the analysis.
The facilitator is often from the reliability group, but does not have to be. It is more
important that the facilitator be skilled in the responsibilities listed above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
384
8.3.4. Implementation
The next step is the determination of the possible corrective actions. This will be
discussed later in this chapter.
There are many ways in which an FMEA can be performed. This section outlines one
approach that has been successful based on the experience of the author. The process
flow is illustrated in Figure 8.4-1.
Each of these steps is summarized below, along with guidance and tips on performing the
step.
these potential modes. Ideally, the capacitor manufacturer will have performed a FMEA
on their product, in which case specific failure causes have already been identified and,
hopefully, mitigated.
Intermittent
Partial
Over
Unintended
Negative
Degraded
N (Negative):
D (Degraded):
The loss function used in the Taguchi methodology can be used to determine some of the
applicable failure modes. Here, functions or attributes are categorized as larger the
better, nominal the best and smaller the better. For larger the better
function/attributes, a failure can occur when there is too little of the function/attribute, but
cannot fail when there is too much of it. This is illustrated in Table 8.7-1. It relates only
to the over function and the partial function IPOUND categories. The other
IPOUND categories are used when appropriate, and will generally be independent of the
Taguchi categories.
Table 8.7-1: Failure Mode Relationship to Taguchi Loss Function
Function/Attribute
Type
Larger the better
Nominal the best
Smaller the better
Too Much
(Over Function)
X
X
Too Little
(Partial Function)
X
X
The IPOUND categories are intended to represent a complete and mutually exclusive set
of the ways in which a function or attribute can fail. When identifying failure modes in
this manner, it is helpful to set up a matrix of functions/attributes and the IPOUND
categories, and proceed to fill it in. When filling in this matrix, it is not necessary to
identify failure modes for all categories of IPOUND. Likewise, for any single category,
multiple failure modes are possible. The IPOUND methodology is simply a way to get
the team to think about all possible ways in which a function/attribute can fail.
The flow diagram in Figure 8.7-1 depicts a simple system consisting of a two-level
hierarchy. In practice, there can be any number of levels in the system hierarchy. The
failure cause-mode-effect relationship shifts in the FMEA as a function of the system
level, as illustrated in Figure 8.7-1. For example, at the most basic level, the part
manufacturing process, the cause of failure may be a process step that is out of control.
The ultimate effect of that cause becomes the failure mode at the part level. The failure
effect of the part becomes the failure mode at the next level of assembly, and so forth. It
is very important that the failure cause, mode and effect are not confounded in the a
System
Assembly
Part
Part Manufacturing
Process
Effect
Mode
Effect
Cause
Mode
Effect
Cause
Mode
Cause
Effect
Mode
Cause
The failure modes of the system functions/attributes are the effects of failure modes at the
subordinate hierarchical level. This tiering continues as the system is broken down to the
lowest level that the analysis will take place.
For simple, single-level products (for example, a component made from a monolithic
material) there is only a single level and, therefore, this is not an issue. Also, for
relatively simple products with two levels, a local effects column can be added to
capture the effects of the failure mode on the subassembly function. In this case, the
effects are relative to the functional requirements of the subassembly.
When identifying failure modes, the assumption is made that the failure could occur but
may not necessarily occur.
Magnitude
No Degradation
Example
Slight Degradation
Severe Degradation
Intermittent
Importance of
Function/Attribute
When Occurs
Not Critical
Critical
Design Not Capable
In Research &
Development (R&D)
In Process
Screen Fallout
Customer Inspections
Infant Mortality
In Deployment
Random
Wearout
In the identification of severity, effects of failure modes that are potentially safety-related
are usually considered to be the most severe. For these, a severity value of 9 or 10 is
used, regardless of the above listed factors.
The when occurs dimension of severity pertains to the life cycle phase in which the
failure mode and its effect occurs. If the failure mode occurs in the R&D phase, it is
either because the design or process is not capable, or because intrinsic or extrinsic
failure causes occur. In either case, this is the best phase to identify these, since they can
be corrected in the most cost-effective manner possible.
If the failure mode occurs in process, the effect is essentially a yield reduction.
Failures occurring during inspections or quality checks by the customer are similar, but
they occur at the customers site and are, therefore, more severe then when the defects are
caught in-house.
Failures occurring in deployment represent the most severe type of failure effect (with the
possible exception of safety-related failures). These failures can be represented by the
three types of failures in the bathtub curve: infant mortality, random and wearout.
Reliability Information Analysis Center
391
Usually, the severity of an effect is treated as one factor in the FMEA. However,
separating the severity into three factors and subsequent columns can be beneficial. For
example, if the when occurs dimension of severity is separated, the failure modes that
can be caught in process are identified, and this, in turn, can be used to establish inprocess checks and screening protocols.
If they are separated, any convenient numeric scale can be used, including 1-to-10, 1-to3, or others. If 1-to-10 is used, the RPN of a failure cause will range from 1 to 1,000.
If these dimensions are not separated (which will usually be the case), each of the three
should be represented in the criteria used to define the severity levels. One way in which
this can be accomplished is to use the guidelines in Table 8.8-2, in which each dimension
is assumed to have a value between 1 and 3, directly proportional to its severity. The
total severity is then the sum of each of the three values.
Table 8.8-2: Dimensions of Severity
Dimension of Severity
Degree to Which Function is
Lost
Importance of
Function/Attribute
When Occurs
Magnitude
No Degradation
Slight Degradation
Severe Degradation
Intermittent
Not Critical
Critical
In R&D
In Process
Customer Inspections
In Deployment
1
3
1
3
1
3
required. It is only required that someone knowledgeable in the system and part
functional and attribute requirements be involved.
The task of identifying failure causes is a much more unstructured, brainstorming-like
activity. For this, it is important to get the entire team involved. The intent of this task is
to identify all possible causes that could result in the failure mode. Causes are often more
complex than the identification of a single failure mechanism and, therefore, describing
them in a few sentences in the FMEA table can be problematic. A failure cause will
often be the result of sub-causes, and can be broken down further and further until the
physical failure phenomena is identified. For this reason, an alternative is to perform a
fault tree analysis (FTA) on each failure mode. This allows for the breaking down of
failure causes into any level of detail.
There should be one severity rating for each failure effect, since the severity is a direct
1:1 relation to the effect. The maximum of each of these severities associated with each
failure mode is the severity used in the RPN calculation, since the RPN is applicable to
the cause. Here, the failure mode can result in several effects, but can also be initiated by
several causes. Therefore, a single cause can result in several effects, the worst of which
should be used in the PRN calculation. The relationship between failure cause, mode and
effect is illustrated in Figure 8.10-1.
Failures resulting from operational stress refers to failures resulting from the
inability of a product or system to tolerate the applied stresses to which the
component, item or material within the item is exposed.
Tolerance stack up refers to the initial tolerance at time zero, and the failure
of a product or system to tolerate the cumulative effect of those tolerances.
Failures can also be a result of short-term exposure to extreme stresses. While the
product or system is not designed to tolerate these stresses under steady-state conditions,
it should be able to tolerate short-term extreme stress exposure. There is a limit to the
stress level(s) that the product or system should be able to tolerate. However, design
actions can be taken to minimize the probability of failure due to these stresses.
The information presented here is generic in nature and applies equally to mechanical and
electronics failures. The specific failure mechanisms will vary, but the concepts are the
same.
Failure causes are often the result of a combination of conditions and events. Therefore,
when identifying causes, the analyst needs to consider these combinations. The factors
whose combinations can cause failure generically include:
When hypothesizing failure causes, it is useful to think about them in terms of their initial
conditions, stresses, and failure mechanisms, as illustrated in Figure 8.10-2.
A list of typical initial conditions, stresses, and failure mechanisms is provided below.
Initial conditions:
o Defect free (the item is made as designed)
o Defects:
Intrinsic:
Voids
Material property variation
Geometry variation
Contamination
Ionic contamination
Crystal defects
Stress concentrations
Extrinsic:
Organic contamination
Nonconductive particles
Conductive particles
Reliability Information Analysis Center
395
Contamination
Ionic contamination
o Stresses:
Operation - steady state
Operation - cycling
Chemical exposure
Salt fog
Mechanical shock
UV exposure
Drop
Vibration
Temperature-high
Temperature-low
Temperature cycling
Damp heat
Pressure - low
Pressure - high
Radiation - EMI
Radiation - cosmic
Sand and dust
o Failure mechanisms (physical process):
Electromigration
Dielectric breakdown
Corrosion
Dendritic growth
Tin whiskers
Metal fatigue
Stress corrosion cracking
melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Galvanic corrosion
Pitting corrosion
Chemical attack
Fretting corrosion
Spalling
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
396
Crazing
Abrasive wear
Adhesive wear
Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Oxidation
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgassing
Index of refraction changes
Photodarkening
Condensation
Crystallization
Each failure cause can be characterized with a specific combination of initial condition,
stress and degradation process. For example, a cause could be represented as:
Defect1-temperaturecorrosion
After identifying all of the FMEA elements in accordance with the guidelines presented
herein, it is very useful to check the completeness of the analysis by hypothesizing what
would happen if the product or system:
Accelerating stresses or tests can include the following. This information can be used to
define reliability test plans. A list of potential tests may include:
1. Operation - steady state
2. Operation - cycling
3. Chemical exposure
4. Salt fog
5. Mechanical shock
6. UV exposure
7. Drop
8. Vibration
9. Temperature - high
10. Temperature - low
11. Temperature cycling
12. Damp heat
13. Pressure - low
14. Pressure - high
15. Radiation - EMI
16. Radiation - cosmic
17. Sand and dust
8.11.2. Occurrence
8.11.2.1.
Occurrence Rankings
The occurrence rating should be a function of two factors: (1) an estimate of the
likelihood of occurrence based on the analysts experience, and (2) the degree to which
the failure cause/mechanism has been observed. For example, Figure 8.11-1 represents
the occurrence, as defined by Reference 1.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
398
Ideally, a reliability model would be available from which to determine the occurrence,
but this is usually impractical due to the fact that the FMEA is generally performed
before the reliability modeling activities commence.
The occurrence should be based on engineering judgment and on empirical data. The
resulting occurrence value is based on both, as illustrated in Figure 8.11-2. For example,
if empirical information exists on a specific cause, it should be used as part of the
assessment of the Occurrence level. In this case, heavier weighting should be given to
field data over manufacturing and test data. If no empirical data exists, engineering
Reliability Information Analysis Center
399
judgment should be used, and should be based on the collective experience of the FMEA
team.
The occurrence should be based on the likelihood that the cause will occur and that the
resulting mode will occur. Some FMEA methodologies, like the cancelled MIL-STD1629, include a separate factor that accounts for the probability that the effect will occur
if the mode is to occur (the same concept can be used for the cause-mode relationship).
High
10
Failure rate
estimate
based on
experience
Low
1
Not at all
Frequently
The frequencies of occurrence should be rated relative to the required reliability for a
specific failure cause. For example, if a reliability allocation is performed to allocate the
product or system failure rate (or unreliability) to its constituent components, then the
occurrence value should be relative to this allocated value.
Common cause vs special cause
Categories of failure effects are shown in Table 8.11-14. These illustrate the differences
between common cause and special cause failure effects.
Process
Not
Capable
Screen
Fallout/Outof-the-Box
Failure
Always
(Common Cause)
Sometimes
(Special Cause)
Infant
Mortality
Random
Failure
Wearout
X
X
8.11.3. Preventions
Preventions are the actions taken to prevent the cause/mechanism of failure or the failure
mode from occurring, or to reduce their rate of occurrence. These will generally be
design-related actions. Examples include: Ensured proper derating for all components
or Use of a conformal coating.
8.11.4. Detections
Detections are actions taken to detect the cause/mechanism of failure. This can be via
either test or analysis.
8.11.5. Detectability
There are four aspects of detection that should be captured in the FMEA:
Current design control detections:
Indicate what has been done to detect the cause/mechanism of failure or the failure mode,
either by analytical or physical methods, before the item is released into production.
These are generally the application of tests or analytical techniques whose goals are to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
402
The life cycle phases to which each of these four dimensions is applicable are illustrated
in Figure 8.11-4.
Probability of Detection if
the Failure
Cause/Mechanism Occurs
Screening
Degree of
Warning
Detectability
x
L
x
H
L
H
L
H
L
H
L
H
H
H
H
H
x
L
x
L
H
L
H
H
L
H
H
L
L
H
H
H
10
8
7
5
5
2
2
1
Probability of occurrence
Severity
Detectability
C=
=
Criticality
Failure rate
where:
=
=
The failure rate is the rate of occurrence of failure, expressed in failures per million
cumulative operating hours, or in FITs (failures per billion operating hours). The failure
effect probability is the conditional probability that, if the failure mode occurs, the
severity level identified in the FMEA will be the result. The failure mode ratio is the
fraction of the failure rate that can be attributed to the specific failure mode under
analysis. In other words the sum of these probabilities for all failure modes of an item
will be 1.0.
The same logic applies to the RPN methodology, in that the occurrence rating (O) is the
product of the probability of the failure cause occurring times the probability that the
failure cause will result in the identified effect.
Since severity is not included in this calculation, failure modes are usually sorted by
criticality for each severity level. This is done since a true measure of criticality must
include the severity of the failure mode.
The RPN methodology is generally the most common used in many industries. However,
in some cases, the criticality metric is more applicable. Such cases occur when the
system under analysis is complex, or when quantitative failure rate estimates are
available. These failure rate estimates are generally derived from reliability modeling, as
summarized in this book.
The other factor that determines the failure causes that are to be addressed is the Pareto
ranking of the RPNs. In other words, in some cases, there are a well-defined number of
causes that comprise the total risk to the system. This situation becomes evident in the
Pareto analysis of RPNs.
Corrective actions will generally fall into three categories:
First, the design can be modified such that the effect of failure is minimized,
thus effectively lowering the severity level of the failure effect. Options for
this include the addition of redundant elements or fault tolerance, the selection
of better materials, and/or the use of more robust components.
The second general option is to reduce the likelihood of the failure mode
occurring in the first place. This often can be achieved by the use of more
robust components. This robustness can be achieved with components of
higher quality levels or the ability to handle high stress levels. Another option
for reducing this likelihood is the control of environmental stresses and
reducing the stress to which the component is exposed.
Corrective Action
Reduce
Likelihood of
Failure
Modify
Design
Fault
Tolerance
Better
Materials
Improve
Detectability
Reduce
Stress
More Robust
Components
Control
Environment
Severity Reduction:
o Add redundancy
o Add a fail-safe feature
o Use personal protection equipment (for safety critical items)
Occurrence Reduction:
o Design out the cause
o Reduce the rate of occurrence
Detection Improvement:
o Implement alarm features
o Implement screening tests
o Design more relevant tests to detect the failure cause
o Develop better characterization methods
8.16. References
1. SAE J1739 (R) Potential Failure Mode and Effects Analysis in Design (Design
FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and
Assembly Processes (Process FMEA)
9.
Concluding Remarks
Reliability modeling has been used successfully as a reliability engineering tool for many
years. It is only one element of a well-structured reliability program and, to be effective,
it must be integrated into a complete reliability program. This book has reviewed options
an analyst has for developing a reliability model of a product or system, and has provided
guidance on applying the appropriate methodology based on the specific needs and
constraints of the analyst.
The premise of the holistic approach described in this book is that the reliability model is
a living model that needs to be continuously updated throughout the program
development and deployment phases. This approach to modeling consists of predictions,
assessments and estimation. Each of these is performed at specific points in the
development cycle and has different purposes and approaches. Reliability predictions are
performed very early, before there is any empirical data on the item under analysis.
Reliability assessments are made to determine the effects of certain factors on reliability,
and to identify and study specific failure causes. Reliability estimates are made based on
empirical data, and encompass all three elements.
A critical theme of this book has been that the purpose of a reliability model must be
clearly defined, and then an appropriate methodology should be chosen. Each model
need must be fully defined in terms of the customers being served (their roles,
educational background, requirements, etc.), the constraints placed upon that customer
(including legal and contractual, as well as technical and engineering), and the purpose of
the model (what decisions are being supported and in what manner).
A summary of recommendations for an analyst developing a product or system reliability
model are:
1. Clearly define the purpose and objectives of the model
2. Apply the appropriate methodologies in the appropriate program phase
3. Identify critical failure causes early in the program, as they will require the most
attention in terms of modeling
4. Fully leverage all available expertise in areas of design analysis, testing,
measurement, etc.
5. Use all available data and information, and be diligent about seeking needed data
6. Strategically perform tests to characterize critical failure causes
7. Engage suppliers and customers to maximize the consistency of models
throughout all system hierarchical levels
Reliability Information Analysis Center
411
8. Use multiple modeling techniques, and work toward the goal of having them
reasonably agree with each other. In this manner, confidence in the results will be
greater.
9. Continuously update the model based on data that is obtained throughout all
program phases of the product or system life cycle
10. Identify and use available reliability software tools. These tools have become
very cost effective and are readily available, making the application of techniques
which were impractical several decades ago easily implemented.
It is hoped that this book has provided the reader with a knowledge of approaches, tools,
and interpretations that will allow a better understanding of the usefulness and limitations
of various reliability modeling techniques. Given its stochastic nature, reliability
modeling is part science and part art, and there are many ways to approach it. But, if the
analyst keeps the goals in mind and uses common sense, there is a high probability that
the model will be successful in achieving its objectives.