Tools For Failure Data Analysis - Section1

Failure Data Analysis Tools
2010/7/3
Tools for Failure Data Analysis y

Dr. Albert H.C. Tsang
Phone: (852) 27666591
Fax: (852) 23625267
email: albert.tsang@polyu.edu.hk
Importance of Maintenance
What p percentage g of y your organizations g total operating budget is maintenance related? 15 to 40% of total manufacturing costs are maintenance related 40% of man-hours expended in providing the railway services are consumed by maintenance work MTR Corporation
2010 by Albert H.C. Tsang
2010/7/3
Maintenance as a Business Process

Maintenance is the last remaining business process that has not been optimized Handled well, it pays direct dividends, handled wrongly, it incurs costs Effective maintenance can bring substantial savings, both direct & indirect, maximizing efficiency & productivity, and improving the bottom line
2010 by Albert H.C. Tsang Failure Data Analysis Tools 3
Types of Losses
Effective Utilization
Breaks Maintenance / Cleaning Training planned
Break downs / Repairs Production change over unplanned Tool change / Adjustments Run at reduced speeds Minor stoppages Idling p / Shut down Start up Errors / Defects Scrap Rework
2010 by Albert H.C. Tsang Failure Data Analysis Tools
Stops
Performance Losses
QualityLosses
4
2010/7/3
Overall Equipment Effectiveness (OEE)

Total Time
480 min The OEE is decreased by all non-value adding Events: E t
Planned - Breaks, Maintenance, Cleaning - Changeover, Adjustment, Tool change Unplanned - Breakdowns, Repairs - Run at reduced speeds - Minor stoppages - Idling (e g due to missing material/personnel) (e.g. - Start up and Shut down losses
Available Time
400 min
- 80 min
Availability Losses
Productive Time
330 min
- 70 min
Performance Losses
Effective Time
310 min
- 20 min
Quality Losses
- Scrap - Defects - Rework
OEE Calculation
Availability Rate = Available Time Total Time = Total Time - Availability Losses Total Time 480 min 80 min 480 min =
OEE (in %) =
AR 0,833
X
PR 0,825
Performance Rate =
Productive Time Available Time
Available Time Performance Losses Available Time 400 min - 70 min 400 min
Quality Rate =
Effective Time Productive Time
X
= Productive Time Quality Losses Productive Time 330 min - 20 min 330 min
QR 0,939
= 0,646 = 64,6% 6
2010/7/3
Prioritization of the Losses

Adjustments Break Downs Minor Stops Changeovers
Use a Pareto diagram to visually prioritize the individual types of losses

Losses Pareto Analysis
OEE
Break-Downs
Changeovers
Adjustments
Minor Stops
Effective Utilization
Determining Maintenance Priorities
How do you prioritize failure codes?
2010/7/3
Optimizing Maintenance Decisions

Optimizing p g Equipment q p Maintenance and Replacement p Decisions
Component Replacement
Inspection Procedures
Capital Equipment Resource Replacement Requirements
DATABASE (CMM/EAM/ERP System)
Quantitative decision models

Quality of Life Data
10
2010/7/3
Failure Replacement
The unscheduled actions taken, as a result of failure, to restore a system to a specified level of performance
Preventive Replacement
The scheduled actions taken, not as a result of f failure, f il to t retain t i a system t at t a specified ifi d level of performance by such functions as scheduled replacement of critical items and overhauls
Is preventive replacement a panacea for optimization of maintenance?
12
2010/7/3
Use Preventive Replacement only if :
the risk of failure increases with age or usage, i.e., wear-out effect occurring total cost of a failure replacement is greater than total cost of a preventive, replacement
13
Optimizing Preventive Replacement Decisions We want

Fact-based arguments
(data driven decisions)
NOT
Intuition-based pronouncements
(strength of personalities, # of mechanics complaints)
14
2010/7/3
Preventive Replacement Cost Conflicts

Total C Cost Per Week
Failure Replacement Cost per Week Cost per Week
Preventive P ti R Replacement l t Cost per Week
0 0
Optimal preventive replacement age
Preventive Replacement Age
15
Optimizing Preventive Replacement Decisions

Maintenance data are stored in databases of CMM/EAM/ERP system A typical scenario: Data Rich, Information Poor The h risk k of f failure f l can be b determined d d from analysis of the items failure data Weibull Analysis
2010/7/3
Software & Web Sites

Software ReliSoft: www.reliasoft.com MORE tools: http://www.crcpress.com/product/isbn/9780849339660 Maintenance Web Sites
http://www.pem-mag.com http://www.plant-maintenance.com http://reliabilityweb.com
17
References
Tsang, Albert H.C., et al. (2000) RCM: A Key to Maintenance Excellence, CityU Press Jardine, A.K.S. and Tsang, Albert H.C. (2006) Maintenance, Replacement, and Reliability: Theory and Applications Taylor & Francis: CRC Press
18
2010/7/3
The Instructor
Albert H.C. Tsang is Principal Lecturer in the Department of Industrial & Systems Engineering at The Hong Kong Polytechnic University. University He has a PhD from the University of Toronto. Dr. Tsang is a chartered engineer in the United Kingdom with working experience in the manufacturing industry covering functions such as industrial engineering, quality assurance, and project management. He is a founding member, fellow, and past Chairman of the Hong Kong Society for Quality (HKSQ). Dr. Tsang had provided consultancy and advisory services to enterprises and industry support organizations in manufacturing, logistics, public utilities, health care, and government sectors on matters related to quality, reliability, maintenance, performance management and assessment of performance excellence. Dr. Tsang is the author of WeibullSoft, a computer-aided self learning package on Weibull analysis, and a co-author of 2 books: Reliability-Centred Maintenance: A Key to Maintenance Excellence, and Maintenance, Reliability and Replacement: Theory and Applications.
19
10
Tools for Failure Data Analysis

Albert H.C. Tsang
The Hong Kong Polytechnic University albert.tsang@polyu.edu.hk
Abstract Using knowledge acquired from fact-based analysis to inform decisions is the hallmark of performance excellence in maintenance management. This paper presents a set of data mining tools applicable to failure data analysis that will support maintenance and asset replacement optimization, and prioritization of failure codes. These tools Weibull analysis, Laplace trend test, logarithmic scatter plot of MTTR versus count of failures, and jack-knife diagrams will enable maintenance, plant, works or planning managers and engineers to: (1) (2) (3) (4) (5) (6) Characterize the risk of failure of specific assets Choose between preventive replacement and run-to-failure Detect trends of reliability degradation or improvement Classify failure codes into reliability, maintainability and availability problems Determine criticality of failure codes in the context of the business cycle Visualize maintenance performance trends associated with specific failure codes
Keywords Maintenance optimization, Weibull analysis, Laplace trend test, plot of MTTR versus failure counts, Jack-knife diagrams
Introduction A survey conducted by Plant Engineering & Maintenance Magazine (Robertson & Jones, 2004) indicated that maintenance budgets ranged from 2 to 90% of the total plant operating budget, with and the average being 20.8%. It can be reasoned that operation and maintenance (O&M) represent a major cost item in equipment intensive industrial operations. These operations can achieve significant savings in O&M costs by making the right and opportune maintenance decisions. Maintenance is often the business process that has not been optimised. Instead of being a liability of business operations, achieving excellence in maintenance will pay huge dividends through reduced waste, maximised efficiency and productivity, thereby improving the bottom line. Maintenance excellence is concerned with balancing performance, risks and the resource inputs to achieve an optimal solution. This is not an easy task because much of what happens in an industrial environment is characterised by uncertainties. Traditionally maintenance practitioners in industry are expected to cope with maintenance problems without seeking to operate in an optimal manner. For example, many preventive maintenance schemes are put into operation with only a slight, if any, quantitative approach to the scheme. As a consequence, no one is very sure just what the best interval between preventive replacements is and, as a result, these schemes are cancelled because it is said they cost too much. Clearly some form of balance between the preventive replacement interval and the returns from it is required (for example, maintenance costs are reduced because items are often replaced before they fail in-service resulting in costly repairs).
Tools for Failure Data Analysis Page 1 of 9 17/10/2006
Asset managers who wish to optimise the life-cycle value of the organisations human and physical assets must consider four key decision areas: (1) component replacement, (2) inspection activities, including condition monitoring, (3) replacement of capital equipment, and (4) resource requirements. (Jardine and Tsang, 2005) Consider the conflicts involved in decision area (1), i.e., component replacement. If a decision is made to perform repairs only, and not to do any preventive maintenance, such as overhauls, it may well reduce the budget required by the asset management department but it may also cause considerable production or operation downtime. Analytical models can help the maintenance manager to take account of these interactions in order to achieve optimal solutions, thereby reducing tension that often occurs between maintenance and operations.
Total cost of maintenance and time lost
Cost / unit time
Optimal Policy Cost of maintenance policy
Cost of time lost due to breakdowns
0 0
Maintenance policy (say, frequency of overhauls)
Figure 1: Optimal frequency of overhauls Figure 1.9 illustrates the type of approach taken by using a mathematical model to determine the optimal frequency of overhauling a piece of plant by balancing the input (maintenance cost) of the maintenance policy, against its output (reduction in downtime). Modelling the risk of failure is a crucial step in optimising replacement of components that is subject to failure in preventive maintenance or condition based maintenance schemes. Weibull analysis, a powerful tool for modelling such risks, as well as the techniques and tools available to address complications that may arise in such analysis are presented in the subsequent section. Outage incidents and repair times are typically classified according to failure modes, which are known as failure codes in many organizations. Pareto charts are commonly used to determine the maintenance priorities by ranking equipment failure codes according to their relative downtime contribution. However, these charts fail to highlight the dominant variables that contribute to equipment downtime. These variables are frequency of failure, and mean time to repair (MTTR). The latter part of this paper presents a powerful charting tool that
addresses the above mentioned limitation of Pareto analyses. It allows us to classify failures into acute and chronic problems, and identify problems that affect system reliability, availability, and maintainability. This type of knowledge is critical in adjusting the focus of maintenance efforts that takes into account changes in business priorities. The charting technique can also be extended to visualize maintenance performance trends associated with specific failure codes Weibull Analysis Maintenance decision analyses that consider risks of failure involve the use of the equipments failure time distribution that may not be known. However, observations of failure times may well be available from maintenance records. We might wish to find a Weibull distribution that fits these failure observations. The Weibull distribution is named after Waloddi Weibull (18871979) who found that in general, distributions of data on an items time-to-failure can be modeled by a function of the following form:
t f (t ) =
f (t ) = 0
t exp
for t > for t
The three parameters of a Weibull distribution are: (Shape parameter), (Location parameter), and (Scale parameter). and have non-negative values. When an items failure time distribution can be modelled by a Weibull distribution, the value of that distribution will determine the items risk profile. When is less than, equal to, and greater than 1, the risk of failure will decrease, remain constant, and increase with age (usage), respectively. The motivation of replacing an item preventatively is to reduce the risk of inservice failure so as to optimize total preventive and failure replacement costs. Obviously, there is no justification to do preventive replacement when is not greater than 1. In such cases, run-to-failure should be the optimal replacement policy. The Weibull distribution that models a given set of failure data can be determined graphically using Weibull analysis. The procedure, which involves plotting cumulative percentage failure F(t) versus the observes failure time t on a special probability paper known as Weibull paper, is given in Jardine and Tsang (2005: 239 243). Figure 2 shows the Weibull plot of a set of lamp failure data. A straight line can be fitted through the plot, indicating that the Weibull distribution with = 0 can be used as the model of the data set. We can then proceed to estimate the other parameters of the distribution from the plot. From the estimation point on the top left hand corner of the Weibull paper, we draw a line perpendicular to the fitted line. The intersection between the perpendicular line and the scale beneath the Estimation Point gives the estimated value of , which is 1.2 in this case. Thus, the lamp passes an essential hurdle to justify the adoption of a preventive replacement policy. The value of t at which the fitted line cuts F(t) = 63.2% (the estimation line on Weibull paper) is an estimate of .
While a Weibull distribution is completely defined by the values of its , and parameters, we may also wish to determine its mean value . It can also be determined from the Weibull plot, from the intersection between the perpendicular line and the P scale beneath the
Page 3 of 9
17/10/2006
Estimation Point of the Weibull paper. In the example given in Figure 2, P = 60%. Thus, the distribution mean is 40 hours, the time at which the cumulative probability of failure = 60%.
60%
1.2
Item: Lamps
60%
= 43 hours = 40 hours
Age: (hours)
Figure 2: Weibull plot of lamp failure data Weibull analysis on failure data may get complicated in practice. Commonly encountered complications include: the Weibull plot is non-linear; not every item observed has been run to failure, i.e., data set with censored (incomplete) failure data; too many records need to be analysed; and analysis of data generated from multiple failure modes. Furthermore, additional information may need to be mined from the Weibull plot, such as determining the confidence interval on estimates of reliability at specified age, and assessing the goodness of fit between the hypothesized Weibull model and the data set. A discussion on the techniques and tools to deal with these situations and demands will considerably increase the size of this paper. Interested readers are referred to Appendix 2 of Jardine and Tsang (2005) for such information.
Identifying Trends of Failure Data
A Weibull analysis involves fitting a probability distribution to a set of failure data. It is assumed that the process generating the failure times is statistically stable. In reality, this condition may not apply as in the case of failure times observed from maintenance records of repairable systems. Fore example, design modifications and improvements made on the equipment in successive life cycles may have the effect of progressively reducing the frequency of failure. In another scenario, imperfect repair or increasing severity of usage in successive life cycles may produce a trend of increasing frequency of failure. Conducting a Weibull analysis on time-between-failure data of these cases is inappropriate because the failure distribution varies from one life cycle to another. The Laplace trend test can be used to detect existence of trends in a data set of successive event times. Let ti denotes the running time of a repairable item at its i-th failure, where i = 1, , n; N(tn) be the total number of failures observed to time tn, and the observation terminates at time T when the item is in the operational state. In other words, the failure times are obtained from a time terminated test. Figure 3 shows the notations used.
End of observation
t1 t1
t2
ti-1
ti ti
tn-1
tn
Running time
tn T
Figure 3: Time-terminated failure times Using the Laplace trend test to determine if the failure generating process is stable, the test statistic for time terminated data is:
nt i 1 0.5 u = 12 N (t n ) T N (t n ) (1)
If the process is stable, u will be normally distributed with mean = 0, standard deviation = 1. When u is significantly small (negative), we infer the existence of reliability growth. When u is significantly large (positive), we infer the existence of reliability deterioration. If we are satisfied that the failure generating process is stable, Weibull analysis can be performed on the inter-failure times, (ti ti1), where i = 1 to n. In the case where the observation terminates at a failure event, say tn , we have a set of failure terminated data. The test statistic for failure terminated data is:
n 1 t i 1 0.5 u = 12 N (t n1 ) t n N (t n 1 ) (2)
Worked examples illustrating the use of Laplace trend test can be found in Jardine and Tsang (2005: 268 270).
Page 5 of 9
17/10/2006
Visualization of Maintenance Priorities
Pareto charts are the classical tool used by maintenance analysts to prioritize failure codes. Depending on the focus of attention, these Pareto charts may help to visualize ranking of failure codes according to cumulative maintenance downtime (or costs), failure frequency, and MTTR. The priorities generated from these Pareto charts are often dissimilar to each other. Given the multiple listings, which failure codes should be given top priority for improvement effort to maximize impact on business performance? It is also noted that maintenance downtime, failure frequency, and MTTR are inter-related. Maintenance downtime is contributed by the combined effects of failure frequency and MTTR. Thus, a Pareto chart of maintenance downtime that prioritizes availability problems is unable to identify failure modes that caused brief yet frequent operational disturbances (reliability problems), or those with long downtime in each instance (maintainability problems). A tool that can simultaneously identify availability, reliability, and maintainability problems will be very useful to maintenance analysts. Knights (2004) presents a visualization tool that serves this purpose. It is a scatter plot of failure codes on a log-log graph paper. The X and Y axes of the plot indicate number of failures recorded and MTTR, respectively. Curves of constant downtime will be shown as lines with a slope of -1 on the graph. Figure 4 is an example of such a plot, with each dot represents a failure code.
Figure 4: Log scatter plot of MTTR versus number of failures Source: Knights (2004) Failure codes that result in lengthy repair in each occasion are classified as acute problems, and those that recur frequently are considered chronic problems. Thus, we can divide the scatter plot into four quadrants as shown in Figure 5. The threshold limits that define the boundaries of these quadrants can be determined by company policy, or they can be relative values, as suggested by Knights (2004). The determination of maintenance priorities is influenced by business imperatives. When facilities have to operate at capacity to meet surplus demand for output, or when each unit of output generates high return, the opportunity cost of lost production will far exceed the direct cost of repair and maintenance. In such situation, enhancing equipment availability and
reliability should have higher priority than improving maintainability, and the Jack-knife diagram shown in Figure 6 helps to identify these higher priority problems.
Figure 5: Log scatter plot with limit values Source: Knights (2004)
Figure 6: Jack-knife diagram for facilities operating at peak capacity Source: Knights (2004)
Page 7 of 9
17/10/2006
In another scenario, when facilities have lots of idle capacity, or when the return from output is low, controlling and reducing the direct cost of repair and maintenance will become the primary focus of attention. Under such circumstances, dealing with equipment availability and maintainability problems should be given higher priority than resolving reliability problems, and the Jack-knife diagram shown in Figure 7 helps to identify these higher priority problems.
Figure 7: Jack-knife diagram for facilities with lots of idle capacity Source: Knights (2004) Maintenance performance trends associated with specific failure codes can also be visualized on the log scatter plot. Figure 8 shows the trends of four failure codes over a period of 3 years.
Figure 8: Trends in unplanned failures for an equipment Source: Knights (2004)

Concluding Remarks
This paper presents two powerful tools for failure data analysis that support maintenance and replacement optimization, and prioritization of maintenance problems. The data used in such analyses are typically maintained in the database of the computerised maintenance management system (CMMS), enterprise asset management system (EAM), or enterprise resource planning system (ERP). However, it should be noted that these data are often not well organized in structure and imperfect (dirty and incomplete) in content. Preprocessing such data for decision support becomes a challenge. Tsang, et al (2006) discuss issues of data management for condition based maintenance (CBM) optimization. Some of the common data quality problems identified in that publication are also relevant for effective application of the tools presented in this paper.
References
Jardine, A.K.S. and Tsang, A.H.C. (2005) Maintenance, Replacement, and Reliability: Theory & Applications, CRC Press Knights, P.F. (2004) Downtime Priorities, Jack-knife Diagrams, and the Business Cycle, Maintenance Journal, 17(2), Melbourne, Australia Robertson, R. and Jones, A. (2004) Pay Day, Plant Engineering & Maintenance, 28(9), pages 1825 Tsang, A.H.C., Yeung, W.K., Jardine, A.K.S. and Leung, Bartholomew P.K. (2006) Data Management for CBM Optimization, Journal of Quality in Maintenance Engineering, 12(1), 3751
Page 9 of 9
17/10/2006

Tools For Failure Data Analysis - Section1

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Tools For Failure Data Analysis - Section1

Încărcat de

Drepturi de autor:

Formate disponibile

Failure Data Analysis Tools

Tools for Failure Data Analysis y

Phone: (852) 27666591

Fax: (852) 23625267

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Maintenance as a Business Process

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Overall Equipment Effectiveness (OEE)

- Scrap - Defects - Rework

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Productive Time Available Time

Effective Time Productive Time

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Prioritization of the Losses

Use a Pareto diagram to visually prioritize the individual types of losses

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Determining Maintenance Priorities

How do you prioritize failure codes?

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Optimizing Maintenance Decisions

Capital Equipment Resource Replacement Requirements

DATABASE (CMM/EAM/ERP System)

Quantitative decision models

Quality of Life Data

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Is preventive replacement a panacea for optimization of maintenance?

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Optimizing Preventive Replacement Decisions We want

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Preventive Replacement Cost Conflicts

Failure Replacement Cost per Week Cost per Week

Preventive P ti R Replacement l t Cost per Week

Preventive Replacement Age

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Optimizing Preventive Replacement Decisions

2010 by Albert H.C. Tsang

Failure Data Analysis Tools

Software & Web Sites