Sunteți pe pagina 1din 22

Maintenance of Mining Machinery

Importance of Maintenance Mining used to be a people-intensive industry. Not any more. The productivity in mining has grown by a factor of twenty in the last 40 years. This growth has been possible by large-scale mechanisation. Obviously, when the mining is done by machinery, the important consideration is to keep them operating. This is partly the aim of the maintenance function. For a typical surface mining operation, maintenance-related costs make up about 50% of all operating costs. For an underground mine operation, it is about 30-40%. On average, mining machinery maintenance is 30-50% of the operating costs for the nations mining industry or $10-15b/year. Another indicator of the importance of maintenance is given by the following table that lists improvements in profits caused by 1% improvement in various area:

Improvement (1%) Area Productivity Availability Reduce Operating Costs Product Price Increase Reduced Interest Rate

Effect on profit 3% 3% 0.5-3.5 % 0.5-0.9 % 0.7-1.2 %

Source : World Mining Equipment, Dec '98 After years of seeing this as a cost of doing business, the mining industry is starting to recognise the importance of equipment reliability and maintenance. Many companies have implemented computer-based centralised maintenance management systems and are reviewing new equipment purchases with a stronger focus on issues such as lifecycle costing, reliability and maintainability. While maintenance costs are significant, even of higher significance is the cost of lost production. That is why the availability has a high leverage on profits and 1% improvement in availability increases company profits by up to 3%. Another way of stating this is that modern mining requires large amounts of investment in capital infrastructure and equipment to produce a product sold at a low unit cost. To stay competitive, such a business has to focus on full realisation of its capital investment, i.e. sustained production at high levels. Maintenance plays an essential role in achieving this goal.

In simple words, you have to keep the machine running to stay ahead of the pack. This can be expressed as an optimisation problem: Problem = Maximise the annual production Total annual production is given by another simple formula: Annual Production = Tonnes/h x [TH(=365 x 24) - PlannedMaintenance(PM) BreakdownMaintenance(BM)] where we assume round-the-clock potential operation (365 days a year, 24 hours a day). The term between the brackets is an important parameter as it represents the total operating time. This is usually expressed as its ratio to the total time potentially available for production and this ratio is called the Availability. Availability : It is a responsibility for the entire site to keep this figure at a high level. The availability goes down when the equipment is poorly operated. It also goes down when it is poorly maintained. The Concept of Failure Failure Failure is the loss of ability of an item to perform its required function. An example is the downdrive gear shown in the figure on the right. A number of teeth on this gear are broken and the gear cannot fulfill its function of transmitting power. The Cost associated with this failure has three main components: - Lost production when the machine was down - The cost of the replacement gear - Maintenance labour costs MTBF = Mean Time Between Failures MTTR = Mean Time To Repair For the example on the right, MTBF = 100 hours and MTTR = 20 hours. The availability is sometimes expressed in terms of MTBF and MTTR as

where N is the number of failures in the data collection

period. Failure Probability Distrbutions Failures usually ocur randomly. The repair time is also hardly constant. If we treat failure as a random event, then we can use the well-established tools of probability and statistics to model the uptime, downtime and availability for our equipment. Poisson distribution is commonly used in forecasting to represent the number of occurrences of a specific event in a given continuous interval.

Ships arriving at a dock on a given day Traffic accidents on the SE freeway in a month Mad cow disease breakouts in the world in one year Typos per page in a long report typed by Hal Gurgenci Cable shovel failures in one day of operation

The following is the probability distribution of the Poisson random variable X representing the number of outcomes occurring in a given time interval t. Lamda is the average number of outcomes per unit time:

Assume failure events follow a Poisson distribution. It is then easy to find the probability of having NO FAILURES in a given time interval t by substituting x=0 in the Poisson distribution function:

This is referred to as the survival probability or the reliability. Component Reliability

Reliability is the probability that a product will operate throughout a specified period without failure

when maintained in accordance with the manufacturer's instructions; and when not subjected to the environmental or operational stresses beyond limits stipulated by the manufacturer.

Failure Probability If the reliability, R(t), is the probability to survive through time t, then the probability of failing in that period is 1 R(t), or

This is called the cumulative failure distribution function or shortly failure c.d.f. It is called cumulative because it expresses the aggregate probability of failure for a time period t. The time-derivative of the c.d.f. gives the Probability Distribution Function or p.d.f.

The coefficient lamda is called the hazard rate or the failure rate (eg if a piece of equipment is expected to fail twice a day on average, lamda is 0.5 d-1). If we have failure data for a large population, then the hazard rate can be estimated as follows:

Hazard Rate (Failure Rate): This assumes an exponential (uniform failure rate) distribution. The Mean Time Between Failure is the inverse of the failure rate.

Mean Time Between Failure(MTBF):

Hazard Rate The probability of failure in a unit time interval (t, t+1) is roughly equal to f(t) as shown in the following figure:

This probability depends on two things

The probability of survival until time t. Obviously, if the piece fails before time=t, it will not fail in (t, t+dt). The survival probability is of course reliability, R(t).

The probability of failure in one unit time interval. This is called the hazard rate, h.

The probability of both of these mutually independent events happening is equal to their product. Therefore,

The hazard rate can also be defined as the conditional probability of failure in a small time interval (t, t+dt). It is conditional on there being no failure until t. For exponential failure distribution, the hazard rate is constant

Weibull Reliability The Hazard Rate is not always constant. For example, assume the failure rate is increasing by the formula 0.1t where t is measured in days It starts from zero and at the end of the month, the hazard rate (or failure rate) is 3 failures per day. How do we generate the reliability function for this component? In a sample of N, dN will fail in a time interval dt

If it were dN=-mNdt, this would correspond to uniform rate of failure. When dN=-Nmdt, this means the rate of failure increases with time (assuming m>0). We can then group the terms, integrate and get the number of units that would survive over time t. This is the survival probability or reliability. The expression for N(t) can then be found as

where No is the value of N at t=0. Then the reliability function is

This reliability expression is different from simple exponential distribution because of the exponent on t. Curves of this form are called Weibull distribution curves. The general Weibull distribution curve is

where R is the probability of surviving through time t, beta is the shape factor and eta is the scale factor. Weibull Distribution Curves:

Cumulative Distribution Function (cdf)

Probability Distribution Function (pdf)

Hazard Rate

The value of beta determines the shape of the Weibull curve. For example, beta=1 corresponds to a constant hazard rate or an exponential distribution. Beta>1 means that the hazard rate increases with increasing age. This is summarised in the left-hand figure below:

By using different values for and factors, Weibull distribution can be made to fit a wide range of failure data. The figure on the right-hand side above is a typical Weibull curve. For this component, the probability of survival drops below 90% after the first 1250 hours. In other words, the probability of survival through 1250 hours is 90%.

Bathtub Curve

The so-called Bathtub Curve would be found in almost every maintenance management book. It is partly popular because of its anthropomor phic qualities with a superficial resemblance to a human life span. Babies are vulnerable to many hazards. Some of the sources of hazard are congenital and the others are caused by the environment. This is the infant mortality region similar to the learning and commissioni ng phase after the installation of complex machinery. As the human child grows to an adult, he or she learns to cope with their

congenital weaknesses and at the same time builds defense mechanisms to deal with environmenta l or external hazards. The hazard rate is fairly uniform for adults over a relatively long period. This is the flat portion on the bathtub curve and corresponds to the operating life of a complex machine. Old age brings frailty and increasing hazards. This is the the third part of the bathtub curve where the hazard probability rapidly increases with increasing age. If you know where that period starts, you may want to consider

retiring the machine at that point and replace it with a new unit. The bathtub curve concept is plausible and easy to understand. Most complex systems show a bathtub-like behaviour. Unfortunately, when it comes down to the individual parts and coponents, the bathtub curve does have only limited application. The curves on the right are the failure patterns observed on aircraft electronic components in a study completed in 1978 by Nowlan and Heap. They show that only 4% of the components go through a bathtub curve. Another 2% has a sudden death region and another 5% has a slowly increasing hazard rate with time. These charts tell us that most things do not fail through an age-related mechanism. This is contrary to laboratory tests because in laboratory tests the hazard rate usually increases with time. Most mechanical engineering components fail through a fatigue-related failure mechanism which implies a time dependence. Even though Nowlan & Heap study was conducted for electronic components in the aircraft industry, their conclusions are almost universally accepted for all industries. Our own studies in the longwall face

equipment failures indicate that a uniform-hazard assumption or an exponential failure distribution is more useful than a Weibull-type curve to represent longwall stoppages that are caused by both mechanical and electrical component failures. The main reason for the discrepancy between the lab tests and the field experience is that most components in the field fail through a variety of failure modes and their interaction cannot be clearly understood. The lack of understanding of the actual failure modes causes us to dump them all together and this tends to favour a exponential distribution. Reliability of Multi-Component Systems Series Systems A series system is a chain of components. When one of these parts fails, the entire system fails.

Parallel Systems The failure for a parallel system means the failure of each individual component. The system failure probability is then the product of individual failure probabilities (1 R).

Most mining machinery systems are series systems. In other words, the failure of one component fails the entire system. The redundancy in mining can be provided by having multiple systems, eg spare trucks or shovels.

MANAGING RELIABILITY Optimum utilisation of its capital investment in equipment is essential for company profits. Equipment reliability plays a major role in this. Therefore, managing reliability is a core business for a mining company. This is a task for both production and maintenance engineers. In the rest of this module, we will focus on the maintenance function. Maintenance Function The maintenance function can be broadly separated into two parts: Preventive and Corrective. The aim of preventive maintenance is to increase the reliability of the system by preventing failures from occurring. This can be done in a number of ways:

Servicing such as cleaning or lubrication Inspection to find and correct incipient failures Planned replacement of parts at fixed intervals

Corrective maintenance is the repair action that is taken when the system fails. The amount of corrective maintenance is governed by the system reliability. Very little time on corrective maintenance is spent with a system that has a high reliability. The aim of the maintenance function is to help maximise asset utilisation. This is done by maximising the equipment availability. In mining industry, the opportunity cost of lost production due to machine downtime is higher than any other cost associated with the maintenance function. Therefore, increasing machine availability is the paramount aim. Which maintenance strategy vetter serves this aim? Preventive or Corrective? An

optimal mix of preventive and corrective maintenance needs to be found for each application to maximise equipment availability. Corrective Maintenance Through corrective maintenance we assume that the system is brought to an "as new" state. The reliability is not affected. The impact on the system availability is by the time it takes to repair (on average, this is referred to as MTTR or Mean Time To Repair). The two main KPIs in assessing Corrective Maintenance are

the validation of the assumption that the corrective maintenance brings the component back to its "as new" state the duration of the time it takes to do the repair

The repair duration has the following components

Fault Identification - What caused the failure? What needs to be repaired? Set-up time - Find and bring the right person to the job Actual repair Logistic delays - Waiting for the spare part Restart time - Time spent to bring the system back to normal operation after the fault is repaired

The actions to reduce the overall Mean Time To Repair (MTTR) follow logically from the above breakdown:

Identify the failed components quickly. This is achieved by experienced operators, on-line fault detection tools For frequent failures have the repair crew with the right skills on standby Ditto for the frequently failing spare parts Design the equipment and the operating procedure to minimise re-starting time

Preventive Maintenance Preventive Maintenance is done to reduce or to eliminate the risk of failure. It is always an interruption to production. It is important that the preventive maintenance is effective in avoiding failure and the cost of failure avoided as a result the preventive maintenance exceeds the cost of preventive maintenance. The cost of failure - MTTR; the cost of the repair and the replacement should be greater than The cash cost of the planned maintenance action (salaries, consumables, etc) plus The opportunity cost (lost production). The Preventive Maintenance can be performed at different levels and at each level decision has to be made whether the action is necessary or not.

Service (necessary only if the service action has a significant effect on system reliability) Inspection (necessary only if there is sufficient timebetween potential and actual system failure, this time is referred to as the P-F time in maintenance literature) Periodic Replacements (necessary only if the part is in or about to enter a period where the age-induced failure rate is steeply increasing, eg the end portion of the bathtub curve)

P-F Time If we can identify the onset of failure by inspecting the part and if there is enough time between this identification and the expected failure to schedule and implement a corrective maintenance task, then this is a successful inspection. The crucial parameter is the interval between the occurrence of a potential failure and its decay to an actual failure. This is called the P-F interval. For this strategy to be effective, the inspection intervals should be equal to the P-F interval. This is practical only if the P-F interval is long enough. Otherwise, we would spend our time inspecting the machine without producing anything. In other words, scheduled inspections will only help when

Potential failure condition is clearly defined The P-F interval

is consistent It is practical to inspect at intervals less than the P-F interval The P-F interval is long enough to implement corrective maintenance action

Periodic Replacements Periodic part replacements are always a costly component of preventive maintenance. Therefore, they should be done only when they contribute to the system reliability at a level compensating for the cash and time cost of performing the replacement action. Periodic (or scheduled) replacements help when

The component breakdown has costly consequences (eg chain of failures, distance from the workshop, etc) The dominant failure mode is age-related with the hazard rate consistently increasing above an acceptable value at around the set replacement period

Decreasing Hazard Rate Scheduled replacement increases failure probability

Constant Hazard Rate Increasing Hazard Rate Scheduled replacement has Scheduled replacement no effect on failure decreases failure probability probability

Reliability Data Analysis A typical mine site today would have archives of past data on maintenance histories, equipment availability, and production delays. The quality of the data in terms of identifying the root causes of equipment failure and developing accurate reliability statistics is usually questionable. Nevertheless, it is a good place to start for anybody wanting to achieve a significan improvement in equipment utilisation.

It is usually necessary to process 100000+ lines from multiple sources of sometimes questionable accuracy

Pareto Principle says that in any list of items there are a Significant few" and the remainder are the "Insignificant many. In the context of mining equipment maintenance, a large part of the failures are due to a small number of causes. A Pareto plot helps to identify the most significant causes. The biggest benefit is incurred by the maintenance action addressing only the significant issues

Another way of representing past failure data is by way of Scatter Plots. A scatter plot is a logarithmic plot of MTTR against the number of failures N. Since the total downtime associated with each failure is NxMTTR, constant downtime curves appear as lines on logarithmic axes.

Reliability Analysis Pareto Analysis and Scatter Plots are good tools to identify the reliability sinks in the equipment. The next step is to calculate the failure probability distribution curves for all critical components. The MTTR statistics may also be required if MTTR is not reasonably constant for each item. This step requires high quality data. One needs a large enough data set to have at least 4-5 failure events for each target failure mode. This data set should cover a sufficiently long time period to eliminate local and temporary effects. Uniform operating conditions need to apply over this period. At the same time, the accuracy of the data should be free of collectors bias. For example, if the attributes of collected data has an influence on the assessment of the data collecting staff, it should not be surprising if there is a bias favourable to the assessment KPIs. In this module, we will limit ourselves to the estimation of exponential failure distribution: The reliability in exponential distribution is expressed as

where lamda is the uniform failure rate (or the inverse of the Mean Time Between Failures or MTBF). Let us see how we can develop an estimate for lamda using a real example:

Example 1 Suppose that we have the failure log for a component as 180, 216, 930, 990, 1300 and 1850 hours. Estimate the MTBF assuming an exponential probability distribution Solution Since we have a record of failures, it is best to use the cumulative failure distribution function. For an assumption of exponential distribution of failures, the cumulative distribution function is

We have records of six failures. The following plots them on a time axis

Assuming that after each failure, the system is brought back to "as new" condition, the time periods between failures can be treated as examples of different systems surviving for different periods of times. Let us calculate the Time-Between-Failure data from the above history: TBF = 36, 714, 60, 310 and 550. Therefore, the MTBF is the average of the above (=334) and the reliability function describing the above failure history is

This is plotted in the following chart. The vertical lines represent the actual failure data.

Example 2 How can you decide if the exponential distribution was the correct assumption for the failure data analysed in Example 1? Solution Let us look at the TBF data again. TBF = 36, 714, 60, 310 and 550. Let us sort and tabulate the data: TBF, h 36 60 310 550 714 We can roughly say than 20% of the failures occur before the first 36 hours, 40% occur before 60 hours, 60% occur before 310 hours, etc. This leads us to create the following table: Rough TBF, Estimate h for the c.d.f. 36 20%

60 310 550 714

40% 60% 80% 100%

A better estimate is probably to say that 10% of the failures occur before 26 hours and another 10% occur after 36 but before 48 (the mid-point between 36 and 60). With a simular reasoning for the other rows, a better estimate is formed as Rough Mid-point TBF, Estimate Estimates h for the for the c.d.f. c.d.f. 36 60 310 550 714 20% 40% 60% 80% 10% 30% 50% 70%

100% 90%

Now we can superimpose the estimated cdf values from our actual data onto the exponential distribution curve:

is plotted against time(hours) in the chart on the left-hand side. The discrete data points correspond to the rough and mid-point estimates for the cumulative distribution function as given in the tables above. This does not look like a good fit.

People usually plot against time because for an exponential distribution this gives a straight line. It is left as an exercise to explain why this so. The above figure plots against time . It is clearer here that the exponential fit is not a very bad assumption for this data set.