LRCM at Cerrejon

Without an adequate data sample there can be no reliability analysis (RA). Without analysis there can be no systematic, verifiable improvement in reliability nor in operational economy. A sample is a collection of life cycles. A beginning and ending event define a failure mode life cycle. Well known obstacles impede the reliability engineer in his role to analyze maintenance data. The main problem lies in the difficulty to obtain analyzable data samples. Although modern CMMSs should provide the needed information they rarely deliver samples of adequate quality. Cerrejon, an integrated mining operation in Colombia, South America solved the information management problem by applying a “Living Reliability Centered Maintenance (LRCM)” process to its fleets of Trucks, Graders, Dozers, Loaders, and Shovels. This paper describes a method wherein completed work orders capture reliability analysis enabling information. The ‘right’ maintenance observations reference significant failure modes. Work orders link to records in a continuously growing RCM structured knowledge base. Grouped and filtered knowledge-to-work order links, being instances of failure modes, provide the samples required by reliability analysis software tools and methods. LRCM software and related procedures facilitate the growth of knowledge, manage the work order-to-RCM relationship, and generate samples for subsequent reliability analysis.

Introduction

Carbones del Cerrejón Ltd. is the world’s largest export open pit coal mining operation. It is located in La Guajira state, in the north-eastern part of Colombia. The Mine has estimated resources of 2,193 million tonnes of coal covering a land area of 690 square kilometers. Open pit mining is an operation that starts with the cleaning of the surface and the careful removal of the topsoil layer, which is stored for future rehabilitation of the land used. Then drilling and blasting are carried out, followed by the removal of overburden material, until the coal seams are exposed. Coal is loaded and transported in trucks from the mine to the stock piles and crushers, and then is taken to two silos that load the train. There are shops on site for carrying out maintenance of trucks, tractors, loaders, dozers, and graders. Considerable maintenance activities are performed away from the shops in the pits particularly on shovels and dozers.

This paper discusses a recent LRCM pilot project at Cerrejon. The initiative was undertaken with the objective to measure and then to increase the return on Cerrejon’s considerable Condition Based Maintenance (CBM) investment. The following topics relate to obstacles encountered in achieving measurable CBM performance through RA.

Work order free text
Motivation, leadership, and training
Performance metrics
Mistaking suspensions for failures

Absent from the foregoing list are technical issues related to data collection, the CMMS, RCM software and Reliability Analysis (RA) tools that deliver Weibull Analysis and Simulation. Maintenance analytical and data centric software and systems are of high quality and have outstanding technical capabilities and performance. Yet their contribution toward improved maintenance, reliability, and cost performance eludes measurement by Cerrejon managers and engineers. This paper will explore the gaps and offer solutions.

CBM Decisions

The Inspection and Technology Group (ITG) of Cerrejon’s maintenance department operates a number of CBM programs. These include a state-of-the art on-site oil analysis laboratory, equipment testing and monitoring instrumentation, and real-time equipment vital sign data logging. CBM is also known as Condition Monitoring (CM) or Predictive Maintenance (PdM). Its purpose is to gather and interpret periodic observational data in order to decide, at the current time, whether to:

Stop the equipment as soon as possible and perform a specific preventative action as indicated by the monitored data, or
Schedule an indicated preventative maintenance action within a specific and safe time period, or
Carry on with the normal operation of the equipment until the next CBM inspection and evaluation.

The CBM Model

It is imperative, given the scale^[1] of maintenance at Cerrejon, that the process for making a CBM decision (one of 1, 2, or 3 above) be optimal, automated, and verifiable. To meet the CBM decision making challenge, reliability and maintenance engineering staff use a CBM optimizing system called EXAKT. CBM model construction is illustrated in Figure 2.

: Figure 2 EXAKT Modeling Process.

The modeling process of Figure 2 consists of the following three phases:

The first phase develops the reliability model. An example of a reliability model is seen in the top left block of Figure 2. The equation is known as a Proportional Hazard Model (PHM) where:
1. h(t) is the Failure Rate function.
2. The first factor on the right hand side $\frac{0.781}{2709}\left(\frac{t}{2709}\right)^{0.781-1}$ is recognizable as the Weibull Failure Rate where (in this example) the shape parameter β is 0.781 and the scale parameter η is 2709.
3. The second factor $e^{\left(0.06944 \times MaxWSDrop \right)}$ extends the Weibull failure rate equation to include relevant CBM data. Each CBM variable is multiplied by a “covariate” parameter γ_i. In the example γ_i = 0.06944. The parameters β, η, and γ_i are estimated by a numerical algorithm applied to a sample of life cycles and concurrent CBM data. The product of each significant CBM variable and its γ_i are summed within the exponential term. In the example equation we can see that age t is significant since β is different from 1. If β were equal to one the factor $\left(\frac{t}{2709}\right)^{\beta-1}$ would become 1 and the failure rate would then be independent of age and dependent only on the significant CBM variables. This is a desirable CBM situation. If, in addition to a low value of β, the confidence in the model is high (see Phase 3 below) then our maintenance decisions based on CBM data will be optimal ones. CBM decisions when optimized will yield lower overall cost and higher availability than Time Based Maintenance (TBM) decisions. In the example, only one condition monitoring variable MaxWSDrop appears in the exponential term. It is significant since its γ_i is found by the software to be significantly different from zero.
In the second phase of CBM modeling we build a prediction model (middle block on the left side of Figure 2). A combination of levels of the significant variables constitutes a “state”. If there are two significant variables and each has 3 discrete levels then transitions could occur among 9 states. Past transitions from one state to another form a probability history in the form of a probability matrix that is used to predict the likely future states of the variables.^[2]
The combined reliability and predictive models (the foregoing phase 1 and 2) are combined in the third modeling phase to generate a Remaining Useful Life Estimate (RULE). Confidence in the estimate can be expressed, for example, as a standard deviation. The RULE and its confidence (scatter) are illustrated in the “Conditional Probability Density” graph in the bottom left block of Figure 2.

The reliability and predictive models may be combined with business factors to develop a “Cost model” an example of which is shown on the left of Figure 2. The vertical axis represents the overall cost of maintaining the component. The Horizontal axis measures “Risk”.^[3] The low point on the graph indicates the optimal risk. It is the level of risk that results in the lowest maintenance cost, the highest availability, or the highest profitability depending on the optimizing objective that the Reliability Engineer sets into the model.

The optimal decision model is represented by the green, yellow, and red graph at the bottom of Figure 2. The vertical axis measures the weighted sum of the CBM significant variables. The horizontal axis measures the working age of the asset. If the weighted sum of significant variables fall in the red region of the graph, a Potential Failure (PF) is declared tentatively subject to confirmation at the time of work order execution. Each point on the crossover boundary to the red area corresponds to the optimal risk as calculated as the low point on the Cost model graph of Figure 2.

The problem

Mentioned previously were three criteria required for effective CBM at Cerrejon. The decisions flowing from the CBM program have to be 1) optimal, 2) automated, and 3) verifiable. The third criterion, “verifiability”, when satisfied, reveals CBM’s effectiveness and inevitably its weaknesses. Therein lies the problem. The EXAKT process exposes CBM performance issues that management must address. The following is a typical example of a CBM model applied to the predictive maintenance of the engine powering an EX3600 hydraulic shovel. EXAKT reveals poor predictive performance.

Table 1 EX3600 Hydraulic Shovel CBM program
	CBM Method: Oil Analysis Failure mode: General Engine Wear RULE: 2090 hours StdDev: 1445 hours

Table 1 points out low confidence in the remaining useful life estimate. The Standard Deviation of the remaining useful life estimate (RULE) is large. The RULE is also known as the conditional Mean Time to Failure (MTTF). Table 2 shows the results of attempting to build a usable PHM model for the equipment by considering Iron and Lead as candidate CBM condition indicators.

Table 2 Results of the PHM analysis

The Estimate for the Shape parameter 5.256 is large ^[4], indicating that the applied model’s decision will, in essence, be age based. “N” means not significant. (The column heading “Sign.” stands for “Significance”.) Table 2 confirms that the candidate CBM variables have little or no influence on failure probability and therefore cannot be used with confidence as condition indictors.

The statistics in the remaining columns (Standard Error, Wald, DF, p-Value, Exp. Of Estimate, and 95% Confidence Interval) add further to the conviction that the model if used will have poor predictive performance. Figure 3 confirms the inappropriateness of the candidate CBM model for prediction.

: Figure 3 CBM Decision based on cost and probability (left) and probability alone (right)

The green, yellow, and red graph of Figure 3 indicates that the decision to maintain is essentially time based. The values of the composite covariate fall along a straight horizontal line that is constant with age. This says that the proposed monitored variables (Lead and Iron) do not exert influence on the CBM decision.^[5] The line of composite values intercepts the boundary defining a potential failure at 15000 hours. Not surprisingly, this happens to coincide with the normal engine overhaul period (see section “Mistaking Suspension for Failure”). The Conditional Density Function graph on the right of Figure 3 indicates wide scatter (low confidence in the estimate of remaining life). The RULE and standard deviation reported by software in this example are RULE: 2090 hours and StdDev: 1445 hours.

In the PHM model a large value for the shape parameter β means that:

The CBM variables used in the model do not contain much predictive content, and
Unmonitored (yet significant) CBM variables are influencing failure probability and inflating β.

EXAKT is reliability analysis software that differs from Weibull analysis in the way that it relates several dimensions, namely Reliability versus Age versus each monitoring variable (condition indicator) that is significant to the probability of failure. The Weibull model, by contrast, is a two-dimensional (age-reliability) model.

Two dimensional analysis (when applied in maintenance) usually does not have to worry excessively about the quality of the data sample because the results are so general as to be, often, inapplicable in the day-to-day maintenance context. This is because the Age dimension is a mixture. Its influence derives from all those unknown yet significant variables that are not used or not monitored in the CBM program. Since they are not explicitly part of the CBM process, they have no way to exert their influence on the model except through the age dimension. Consider an example where the accumulated damage to a bearing is highly related to the number of times that it has been subjected to excessive stress due to some overload condition. Yet these incidences will not have been recorded or if they have been recorded in the Distributed Control System (DCS), assume that they have been ignored as condition indicators. Then the reliability (PHM) equation derived from the CMMS records of failure incidences and from the (less influential) monitored variables (for example oil analysis) will exhibit a higher shape factor than it would if the historical values of the more significant variable were also included in the analysis.

If the analysis is based on a simple Weibull model, the significant factors influencing a specific failure mode are further diluted in the age dimension causing high scatter (quantified as a high standard deviation indicating a widely spread out probability density curve). That is to say, the confidence in decisions derived from the two-dimensional model will be low, age being the sole, but usually insufficient, decision criterion. There is little that can be done to improve the two-dimensional model’s poor predictive capability because age alone is usually too general a decision determining factor. Figure 4 illustrates the ability of added significant dimensions to improve the predictive confidence in a reliability model.

: Figure 4 Two dimensional vs. multiple dimensional analysis

In the two dimensional (back) plane of the graph of Figure 4, the reliability (in the form of probability density) is plotted against age. (The corresponding failure rate h(t) form of the model is also shown.) The result is a wide probability distribution as is commonly seen in maintenance. Next we introduce a third dimension, Iron, for example. Assume that a high value of Iron (Fe), say 100 ppm, dissolved in the lubrication oil of a bearing or engine, is closely related to the item’s (conditional) failure probability. If we were to re-plot the probability density curve, but only for those life cycles whose endings are preceded by CBM values for Iron greater than 100 ppm, we would get the narrow, low scatter, “confident” predictive distribution illustrated in the foreground of Figure 4. The less influential that Iron is to failure probability the wider will be our probability distribution, imparting lower confidence to our CBM decisions.

PHM, as opposed to simple Weibull, requires good failure mode, failure, and suspension^[6]discrimination in the work order reporting system because its results are intended to provide practical, unambiguous decision support on a day-to-day basis. The results of decisions made from the EXAKT model are measurable through the LRCM process. LRCM requires that the Potential Failure (PF) detected by the CBM model (be it an EXAKT model, a statistical quality control alert, or a fixed alarm level) be confirmed or refuted upon execution of the work resulting from the model’s recommendation. The EXAKT model assesses its own performance based on “as-found” (i.e. Event Type = PF, Functional Failure (FF), or Suspension (S)) work order information. The model is upgraded as more events and better reporting procedures (as specified by the LRCM process) generate accurate links between work orders and RCM knowledge. Work order reports include each failure mode instance’s Event type attribute (PF, FF, or S).

The solution

There are two possible reasons for the unsatisfactory performance of a CBM decision model.

The condition monitoring variables that are available to the CBM program intrinsically bear little or no relationship to the actual failure modes that occur in the fleet. Or,
The data sample used to build the predictive model does not distinguish between Failure and Suspension.

Eliminating both these causes of poor CBM performance requires a simple but novel “people-centered” approach to the management of information in maintenance. This approach is described in the following subsections.

Work order free text

What exactly is the purpose of free text on the work order? A work order, particularly a “corrective” work order following a failure, contains an assemblage of facts from the technician’s observations that we classify in two groups:

What I found
1. Failures: Doesn’t cool, Doesn’t heat, Doesn’t steer easily, Doesn’t protect, Doesn’t …
2. Failure modes consisting of three grammatical phrases:
  1. A part that failed
  2. An action phrase implying some physical change of state in the part (dirty, out of adjustment, fell off, …).
  3. A “due to” (rust, corrosion, dirt, fatigue …) clause.

(2 and 3 above are optional and included when the consequences of failure justify the added detail. If not included, analysis is performed simply on the part that failed due to any reason.)

What I did
1. I renewed (adjusted, cleaned, replaced, repaired, ….)

The combination of “What I did” and “What I found” provides the Reliability Engineer with the information needed to generate a sample to be analyzed. There are three possible cases that motivate the renewal of a part (or a failure mode). They must be clearly distinguished in the text as FF, S, or PF. If:

The failure mode caused a failure → Then report a Functional Failure FF
The part did not fail but was renewed anyway → Then report a Suspension S
The part was about to fail → Then report a Potential failure PF

Given a tradesman’s pride in a job well done. “What I did” is often well explained in the free text comments. Facts regarding “What I found” on the other hand, are sparse, incidental, less detailed, and considered less relevant. We are told that the part was replaced but not whether or not it had failed. From the reliability analyst’s point of view we seize the significance of both. “What I did” represents failure mode life beginnings. “What I found” generally describes failure mode life endings either by failure or by suspension. This perspective offers untapped possibilities to our efforts to achieve reliability from the analysis of data. Once we recognize free text’s dual purpose (of describing observation as well as action) a number of desirable changes in organizational behavior occur spontaneously:

The work order free text will become simpler, more structured, and easier for the Reliability Engineer to work with.
The RCM knowledge base will, in a “living” process at work order closure, document failure modes more accurately. In particular, the RCM Effects will contain the relevant details covering what can happen when the failure mode occurs.
Users will be encouraged to display the RCM knowledge record simultaneously with the work order, to ensure that the work order is truly an instance of the RCM failure mode.
At the same time they will make sure that the referenced RCM knowledge accurately describes the current situation. If not, they will provide suggestions, on the spot, in the free text commentary, for improvement in the light of the experience gained during the execution of the current work order.
Technicians, planners, and supervisors will tend to refer to the RCM knowledge base not only when closing work orders but also when planning them.
As the RCM knowledge base grows, particularly in detail and accuracy of the Effects, less free text will be required on the actual work order. Why repeat information that is already well described in the referenced RCM record?
The reliability engineer will more easily verify work orders and their RCM linkages,
He will make improvements in the knowledge base by adding new records and by updating the Effects as new information appears in the work order free text commentary.
He will generate more and better samples for analysis.

Mistaking Suspension for Failure

Here is a common scenario. Assume engine overhauls occur at 12000 hour intervals. The teardown report indicates the discovery of several failure modes in a potential failure state. However a clear standard or definition is not used to make the potential failure versus suspension determination. A subcomponent or part may still have 5000 hours of useful life left in it. Nevertheless, due to some degree of material wear, the failure mode is judged to have potentially failed rather than, more correctly, to have been suspended.

What are we telling the model and what is the model telling us? We are telling the model (by misreporting suspensions as potential failures) that failures tend to occur at a fixed time, which happens to be the time of our scheduled overhaul. And the model reports back to us, each time it is executed, the very same thing – that failure is age dependent. This defeats the objective of CBM.

Misreporting suspensions as failures (or potential failures) will weaken the model in two ways:

It will inflate the shape parameter thereby causing the decisions to be predominantly age based, regardless of intrinsically good (predictive) CBM condition indicators. And,
It will increase the scatter, and consequently confidence in prediction. This is illustrated in Figure 5.

: Figure 5 Confidence in CBM is reduced when suspensions are mistaken for failures

This point raises a subject that RCM stresses as one of prime importance. What shall be the “standard” used to declare failure? Different people have different standards and this leads inevitably to the obscuring of company objectives. To the operator the standard may be a “large leak”. To the technician it may be a “small leak”. To the reliability engineer it may be a “damaged hose – impending leak”. Who decides? Although here it sounds obvious, this is no trivial decision when viewed at the scale of maintenance and production. There are many failure modes to deal with. And they come in all shades of gray and with competing priorities.

Without a systematic “living” RCM knowledge base tightly integrated into the routine work order process, multiple opinions “rule”. The solution is not simple. It requires cross-level on-going dialog in order to get consensus on standards for failure and potential failure. The living RCM knowledge base (particularly the Effects) documents that dialog.

Motivation, leadership, and Training

An LRCM project implementation succeeds based on a realization that personnel respond best to the intangibles – recognition, empowerment, and interest by management in their activities. These characteristics of leadership are described in the following list:

Empowerment. LRCM solicits and encourages updating of the knowledge base by those who are directly impacted by the maintenance plan – technicians, supervisors, planners, and engineers. It is their knowledge that LRCM seeks, continuously, which will impact the maintenance plan.
Recognition. The LRCM process highlights those performance indicators over which personnel have direct control. It recognizes the knowledge contributed by the team and/or individual employees).
Visible interest. In the LRCM methodology, the manager asks, at least weekly, about those “low level KPIs (see next section) that will ultimately drive bottom line maintenance performance.
Training. The LRCM process depends on a thorough understanding of RCM concepts, such as Function, Failure, Cause, Effects, Consequences, Potential failure, Suspension, and the Amount of detail justified by the consequences of failure. LRCM recognizes that Training extends beyond the classroom to the shop and the field. Meetings and informal discussions regarding issues encountered day to day are carried out to reinforce these subjects.

Performance metrics

Performance metrics should point us precisely to what we need to improve currently in our maintenance process. That is, they should trigger a control action. Subsequently they should confirm and measure the extent to which the control action had the desired effect.

Often, performance metrics do not indicate what management action must be taken to achieve an organizational objective. Cascading metrics (high to low or “leading” to “lagging”) will solve this problem and will, therefore, drive the continuous improvement process.

High level (lagging) KPIs, provide, at various levels of granularity, such measures as:
1. MTTF, MTTR, Availability
2. Costs, and
3. Yield.
Low level (leading) KPIs in maintenance should measure such indicators as:
1. RCM knowledge added,
2. The number of verified links between RCM knowledge and work orders,
3. The number of analyses performed and recommendations issued,
4. CBM performance such as the standard deviation in remaining useful life estimation, the influence of current CBM variables as reported by the PHM shape parameter, and CBM hit and miss ratios (detection confidence).

Managers must recognize the important differences between high and low level KPIs. Basically, that:

Employees have no direct influence on high level KPIs. Thus, throwing these performance results back at them will elicit little interest or engagement.
Low level KPIs, by definition, respond directly to team performance. They relate to day-to-day activities and actions as they should be specified in personnel job descriptions.

Here is the key point: It is management’s job to set low level objectives (i.e. KPI values) that:

Employees can influence by the way they perform their duties, and that
Support the high level organizational targets.

Managers exercise creativity in setting low level objectives. They must then build their skills in understanding the lags and complexities between the achievement of low level KPIs and the eventual resulting high level (organizational) metrics.

Conclusion

Maintenance managers improve performance in two ways, described roughly as:

Technology centric, and
People centric.

Technology

The technology centric approach relies on two main types of information systems:

Automated condition monitoring, testing, and diagnostics, and
Work management systems (the CMMS).

Automated condition monitoring and diagnostic systems provide infrastructure and contain logic to interpret equipment and process data correctly. Built-in logic and associated corrective procedures should allow maintainers and managers to make the right decisions, thereby controlling the maintenance process effectively. Work management systems aim, in their turn, to simplify maintenance by organizing maintenance tasks.

Managers shift their attention periodically between these two technology approaches. Technology projects of both types tend to be large. They extend over substantial time periods, often exceeding their budget and schedule. The organization usually must defer tangible expectations from these projects until after their completion, and sometimes indefinitely.

People

Human oriented improvement philosophies, by contrast, seek more immediate, measurable benefits. They encourage small pilot projects in which a team takes on the task of proving or disproving a hypothesis. A successful pilot reveals the risks in scaling the new methodology to wider use within the organization. People centric changes are more immediately effective, gain more internal momentum, and are less expensive. Moreover, they enhance the effectiveness of adjacent technology projects. Living RCM (LRCM) is primarily a people oriented initiative.

Most maintenance improvement projects conducted over the last four decades were of the large technology centric type. A new, more balanced, attitude towards maintenance improvement has taken hold in vanguard organizations. These companies include in their plans, projects that rely on the collaboration of people, placing less emphasis on technology.

References

The Elusive P-F Interval https://www.livingreliability.com/en/posts/the-elusive-p-f-interval/
Repairable system reliability: recent developments in CBM optimization, A.K.S. Jardine, D. Banjevic, N. Montgomery, A. Pak, Department of Mechanical and Industrial Engineering, University of Toronto, Canada
Jardine A.K.S., Banjevic D., Wiseman M., Buck S., Joseph T., Optimizing a mine haul truck wheel motors’ condition monitoring program, JQME, Vol. 7, pp. 286–301, 2001.
Lin D., Wiseman M., Banjevic D., Jardine A.K.S., An approach to signal processing and condition-based maintencnce for gearboxes subject to tooth failure, Mechanical Systems and Signal Processing, Vol. 18, pp. 993–1007, 2004.

^[1]Over 1000 maintenance personnel, 240 haul trucks, 40 shovels, loaders, graders, and dozers, a railway, and a ship loading facility.↩
^[2]The process for this prediction is known as the Markov failure time model.↩
^[3]In EXAKT and most reliability analysis methods risk is calculated as the product of failure probability and the “penalty” cost of a failure over that of a preventative action.↩
^[4]Values of the Shape Parameter less than 3.5 would indicate that the CBM variables can be valuable as condition indicators. This would be subject to confirmation by the results of the model’s execution shown, for example, in Figure 3↩
^[5]Thus it is possible to go through the “motions” of CBM for years not realizing that the indicators being measured are not reflective of the failure modes that actually occur.↩
^[6]A suspension is a renewal of a component, part, or failure mode for any reason other than failure.↩