Documente Academic
Documente Profesional
Documente Cultură
FaultDiagnosis
Table of Contents
Troubleshooting .............................................................................. 1
Theory of Operation ........................................................................ 2
Intermittent Symptoms .................................................................... 3
Multiple Failures.............................................................................. 3
Gathering Information ..................................................................... 4
The Five Whys ................................................................................ 5
Example .......................................................................................... 5
Clearly, State the Problem .............................................................. 6
Form a Hypothesis.......................................................................... 7
Test the Hypothesis ........................................................................ 8
Observe the Results and Draw Conclusions................................... 9
Repeat Until You Are Happy With Your Conclusions.................... 10
Failure Analysis............................................................................. 11
Root Cause Analysis..................................................................... 12
RCA Based Corrective Action ....................................................... 13
Basic Elements of Root Cause ..................................................... 14
Applied Logic ................................................................................ 16
Sutton’s Law ................................................................................. 16
Occam’s Razor ............................................................................. 17
Hickam's Dictum ........................................................................... 17
Holmesian Deduction.................................................................... 18
Murphy’s Law................................................................................ 18
ii Chris McAndrew
T R O U B L E S H O O T I N G
1
Chapter
Troubleshooting
The following is an outline of diagnostic principals which can be applied to virtually any fault finding situation.
The basic premise is to distil fault finding down to a common set of instructions which can be adapted to suite
the needs of the problem at hand.
Diagnostics is an investigative process which applies logic to test theories through observation and
experimentation.
Solving problems first requires a logical and systematic procedure which allows you to gather the available
information, discard that which is irrelevant, discover other useful facts and draw logical conclusions in order
to arrive at the cause of the problem.
The process of diagnosing faults within electrical systems is commonly referred to as troubleshooting.
Troubleshooting is the systematic search for the source of a problem in order to facilitate rectification.
The fault is normally described as symptoms of a failure and troubleshooting is the process of determining the
causes of these symptoms.
The process of elimination is a basic logical tool used to solve problems. By removing options that may be
deemed impossible, illogical, or can be ruled out due to explicit understanding of the scenario in question, the
pool of remaining possibilities grows smaller.
One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and
resolved. Often considerable effort and emphasis is placed on reproducibility ... on finding a procedure to
reliably induce the symptom to occur.
Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and
the resolution generally involves repairing or replacing those components which are at fault.
Efficient methodical troubleshooting starts with a clear understanding of the expected behaviour of the system
and the symptoms being observed. From there the engineer forms hypotheses on potential causes, and devises
(or perhaps references a standardised checklist) of tests to eliminate these prospective causes.
1 Chris McAndrew
T R O U B L E S H O O T I N G
Theory of Operation
A theory of operation is a description of how a device or system should work. It should be included in
documentation, especially maintenance documentation, or a user manual. It aids troubleshooting by helping to
provide the engineer with a mental model that will aid him or her in diagnosing the problem.
This should not be confused with the undocumented version, which is generally what the customer expected
after the salesman had left!!!!!
2 Chris McAndrew
T R O U B L E S H O O T I N G
Intermittent Symptoms
Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent.
In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit
varies with the temperature of the conductors in it- remember Ohms Law?). Compressed air or freezer can be
used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus
troubleshooting of electronics systems frequently entails applying thermal stress in order to reproduce a
problem.
Equally, there is a distinction between frequency of occurrence and a "known procedure to consistently
reproduce" an issue. For example knowing that an intermittent problem occurs "within" an hour of a
particular stimulus or event ... but that sometimes it happens in five minutes and other times it takes almost an
hour ... does not constitute a "known procedure" even if the stimulus does increase the frequency of
observable exhibitions of the symptom.
Nevertheless, sometimes engineers must resort to statistical methods ... and can only find procedures to
increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible.
In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low
confidence that the root cause has been found and that the problem is truly solved.
Multiple Failures
Isolating single component failures which cause reproducible symptoms is relatively straightforward.
However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault
tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and
failover to a system may also be subject to failure, and sufficient failures in any system will "take it down."
Even in simple systems the engineer must always consider the possibility that there is more than one fault.
(Replacing each component, using serial substitution, and then swapping each new component back out for
the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the
replacement of any component with a defective one can actually increase the number of problems rather than
eliminating them).
Note that, while we talk about "replacing components" the resolution of many problems involves adjustments
or tuning rather than "replacement." For example, intermittent breaks in conductors --- or "dirty or loose
contacts" might simply need to be cleaned and/or tightened.
3 Chris McAndrew
T R O U B L E S H O O T I N G
Gathering Information
You must gather all the information available about the effect of fault in order to discover its cause, ideally that
information should come from multiple sources;
1/. Check power lights, status indicators and displays. Do not forget the basics, has it been unplugged or has
the fuse blown?
3/. Ask users what they are experiencing, but treat this information with care – most users are non technical.
Users have also been known to lie, particularly if they believe that they are responsible for the fault.
5/. Check and verify system configurations – are there any known software issues?
Any system can be described in terms of its components or subsystems. Each subsystem can be described in
terms of its expected behaviour. So the inputs to a system can be described as a cascade of inputs and results
among the components of the system.
For example: handset to curly, curly to telephone, telephone to cat5 cable, cat5 cable to wall port, wall port to
patch chord, patch chord, to network switch…….Now think of an unplugged handset curly cord and an
unplugged CAT5 to wall port cable. Both are unplugged wires; is the effect the same? Obviously not!! But now
you are beginning to visualise the subsystems in their own right.
Often, troubleshooting is applied to something which has suddenly stopped working, since it’s previously
working state forms the expectations about its continued behaviour.
So the initial focus is often on recent changes to the system or to the environment in which it exists.
For example a handset that "was working when it was plugged in over there". However, there is a well known
principle that correlation does not imply causality.
For example the failure of a device shortly after it's been plugged into a different outlet doesn't necessarily
mean that the events were related. The failure could have been a matter of coincidence.
It's useful to consider, at this point, the common experiences we have with light bulbs. Light bulbs "blow"
more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the
power supplied to it cause the filament to crack or vaporise. The same principle applies to most other
electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal
wear-and-tear of components in a system.
4 Chris McAndrew
T R O U B L E S H O O T I N G
The ‘five whys’ is a question-asking method used to explore the cause/effect relationships underlying a
particular problem. Ultimately, the goal of applying the five whys method is to determine a root cause of a
defect or problem.
Example
4. Why? - The alternator belt was well beyond its useful service life and has never been replaced. (4th)
5. Why? - I have not been maintaining my car according to the recommended service schedule. (5th)
The questioning for this example could be taken further to a sixth, seventh, or even greater level. This would
be legitimate, as the "five" in five whys is not gospel; rather, it is postulated that five iterations of asking why is
generally sufficient to get to a root cause. The real key is to encourage the engineer to avoid assumptions and
logic traps and instead to trace the chain of causality in direct increments from the effect through any layers of
abstraction to a root cause that still has some connection to the original problem
Only once all the available information has been gathered will the engineer be ready to move onto the next
phase
5 Chris McAndrew
T R O U B L E S H O O T I N G
This is the process of reviewing all the available information and getting a clear understanding of the perceived
fault.
For example, let’s say that you have a user complaining that they can not transfer a call to an outside line.
Upon investigation you find that they are pressing this when they should be pressing this .
6 Chris McAndrew
T R O U B L E S H O O T I N G
Form a Hypothesis
Having collected your information and clearly stating what the problem is you now need to form your initial
hypothesis.
The best way to do this is in the form of a question which can be proven or disproved.
1/. Extension 123 was working before and it is not working now. At what point in time did it break and what
broke it?
Is it
When starting to form your hypothesis bear in mind that a basic principle in troubleshooting is to start from
the simplest and most probable possible problems first. This is illustrated by the old saying "When you see
hoof prints, look for horses, not zebras", or to use another maxim, use the KISS principle.
This principle results in the common complaint about help desks or manuals, which sometimes first ask: "Is it
plugged in and is the power turned on?", but this should not be taken as an affront, rather it should serve as a
reminder or conditioning to always check the simple things first.
7 Chris McAndrew
T R O U B L E S H O O T I N G
Once you have stated the problem and formed a hypothesis you must devise a method to test that hypothesis.
Your testing must enable you to eliminate one single possible cause by changing only one setting at a time.
If you make more than one change at a time you will not be able to eliminate all possible causes and it is quite
likely that you will never find the root cause of the actual problem.
An engineer could check each component in a system one by one; substituting known good components for
each potentially suspect one. However, this process of "serial substitution" could be considered wasteful when
components are substituted without regards to the hypothesis concerning how their failure could result in the
symptoms being diagnosed. (e.g. there is no power light on so let’s change the hard drive…..)
Two common strategies used by engineers are to check for frequently encountered or easily tested conditions
first.
For example, checking to ensure that a handset's display is on and that its cables are firmly seated at both ends.
For example, checking to see if the voice packets which leave a handset also leave the network switch further
down the line.
This latter technique can be particular efficient in systems with long chains of serialized dependencies or
interactions among its components. It's simply the application of a binary search across the range of
dependences.
Troubleshooting can also take the form of a systematic checklist, procedure, flowchart or table that is made
before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about
the steps to take and organising them into the most efficient process.
Always remember, when you are testing your hypothesis, if you can not identify the root cause, how will you
repair it the next time?
8 Chris McAndrew
T R O U B L E S H O O T I N G
After each test note whether the change you made did or did not solve the problem, gather any new
information and then draw a conclusion as to whether the problem is solved or whether the change you made
had any affect on the problem at all.
Once you have drawn conclusions you can devise new tests to eliminate other possible causes.
9 Chris McAndrew
T R O U B L E S H O O T I N G
This entire methodology is based upon removing the possible causes of the fault, one at a time, until the root
cause has been identified and eliminated.
Therefore, until the root cause has been identified – go back and start again!!
10 Chris McAndrew
T R O U B L E S H O O T I N G
Failure Analysis
Once the root cause has been identified and the fault has been rectified then we move to the final phase.
Failure analysis is the process of collecting and analyzing data to determine the cause of a failure and how to
prevent it from recurring. It is an important discipline and it is a vital tool used in the development of new
products and for the improvement of existing products.
11 Chris McAndrew
R O O T C A U S E A N A L Y S I S
2
Chapter
Root cause analysis is not a single, sharply defined methodology; there are many different tools, processes, and
philosophies of RCA in existence. However, most of these can be classed into five, very-broadly defined
"schools" that are named here by their basic fields of origin: safety-based, production-based, process-based,
failure-based, and systems-based.
• Safety-based RCA descends from the fields of accident analysis and occupational safety and health.
• Production-based RCA has its origins in the field of quality control for industrial manufacturing.
• Process-based RCA is basically a follow-on to production-based RCA, but with a scope that has been
expanded to include business processes.
• Failure-based RCA is rooted in the practice of failure analysis as employed in engineering and
maintenance.
• Systems-based RCA has emerged as an amalgamation of the preceding schools, along with ideas taken
from fields such as change management, risk management, and systems analysis.
Despite the seeming disparity in purpose and definition among the various schools of root cause analysis,
there are some general principles that could be considered as universal. Similarly, it is possible to define a
general process for performing RCA.
12 Chris McAndrew
R O O T C A U S E A N A L Y S I S
Notice that RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective action, because it
directs the corrective action at the root of the problem. That is to say, it is effective solutions we seek, not root
causes. Root causes are secondary to the goal of prevention, and are only revealed after we decide which
solutions to implement.
2. Gather data/evidence.
3. Ask why and identify the causal relationships associated with the defined problem.
5. Identify effective solutions that prevent recurrence, are within your control, meet your goals and
objectives and do not cause other problems.
13 Chris McAndrew
R O O T C A U S E A N A L Y S I S
• Materials
• Machine/Equipment
• Environment
Orderly workplace
Forces of nature
• Management
Inattention to task
Stress demands
14 Chris McAndrew
R O O T C A U S E A N A L Y S I S
• Methods
No or poor procedures
Poor communication
• Management system
15 Chris McAndrew
A P P L I E D L O G I C
3
Chapter
Applied Logic
Sutton’s Law
Sutton's law states that in attempting to diagnose a problem, one should first do the experiment that can
confirm the most likely diagnosis. It is taught in medical schools to guide new doctors in ordering tests in a
way that leads to faster treatment, while minimizing unnecessary costs. It is also applicable to other disciplines,
such as debugging computer programs.
A more thorough analysis will consider the false positive rate of the test and the possibility that a less likely
diagnosis might have more serious consequences.
The law is named after the bank robber Willie Sutton, who supposedly answered a reporter inquiring why he
robbed banks by saying "because that's where the money is." He later denied saying it, however.
A similar idea is contained in the adage, "When you hear hoof beats, think horses, not zebras."
16 Chris McAndrew
A P P L I E D L O G I C
Occam’s Razor
Occam's razor (sometimes spelled Ockham's razor) is a principle attributed to the 14th-century English
logician and Franciscan friar William of Ockham. The principle states that the explanation of any
phenomenon should make as few assumptions as possible, eliminating those that make no difference in the
observable predictions of the explanatory hypothesis or theory. The principle is often expressed in Latin as the
lex parsimoniae ("law of parsimony" or "law of succinctness"): "entia non sunt multiplicanda praeter
necessitatem", roughly translated as "entities must not be multiplied beyond necessity".
This is often paraphrased as "All other things being equal, the simplest solution is the best." In other words,
when multiple competing theories are equal in other respects, the principle recommends selecting the theory
that introduces the fewest assumptions and postulates the fewest entities. It is in this sense that Occam's razor
is usually understood.
Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as a heuristic
maxim (rule of thumb) that advises economy, parsimony, or simplicity, often or especially in scientific theories.
Hickam's Dictum
Hickam's dictum is a counterargument to the use of Occam's razor in the medical profession. The principle is
commonly stated: "Patients can have as many diseases as they damn well please". The principle is attributed to
John Hickam, MD
17 Chris McAndrew
A P P L I E D L O G I C
Holmesian Deduction
Holmes is famous for his intellectual prowess, and is renowned for his skilful use of "deductive reasoning"
while using abductive reasoning (inference to the best explanation) and astute observation to solve difficult
cases
As Holmes says in the story, (The Sign of the Four) "How often have I said to you that when you have
eliminated the impossible, whatever remains, however improbable, must be the truth?"
Murphy’s Law
If you skip a step, Murphy's Law states that the step you skip is where the problem will lie.
18 Chris McAndrew