Documente Academic
Documente Profesional
Documente Cultură
problem. The name of the method comes from the main question asked during the
analysis: Why? Thanks to asking that question many times it is possible to find real,
often hidden causes of the problems. Dealing with real causes prevents the problem
from happening again.
The method is an evolution of Fish diagram created by Kaoru Ishikawa. The why why
analysis is used mainly in Six sigma, however it can be used in every organization. The
idea of 5 whys comes from Genichi Taguchi, who once said that to find the real causes
you need to ask why five times.
Usually the best tool for the analysis is large blackboard or flipchart. However the
analysis can be performed using computer, beamer and mind mapping application, such
as free Freemind.
1. Start with the problem you'd like to solve. Ask "Why does the ... take place?"
That's a first why. Write this question in the centre of the blackboard/flipchart.
2. Write the answers to the question. Write each answer on one line coming from
the main question. It will be convenient to draw each line in slightly different
direction.
3. For each of the answers ask again "Why does the ... take place?". And write
down the next level answers on new lines coming from the first level ones. This
way a kind of net is created.
4. Repeat the same on the next levels until you get the 5th level. Then stop.
5. When you hit the 5th level you usually have most of the causes identified,
including the root ones. They are not always on the 5th level. Some of them can
arise on higher levels. For each root cause identify potential actions that reduce
possibility of occurrence or result of that cause.
• Involve the right people. They should be familiar with the problem and process. They
should be from different departments if possible. There shouldn't be superiors and
subordinates in one group. Sometimes it's good to add some people that don't know
the process/problem. They can present a fresh look.
• Avoid blaming for problems - The aim is not to find guilty, but to solve the problem.
Stop each argument that leads to blaming someone immediately.
• Get creative - use brainstorming to enable people and their creativity. Try some ice
breakers for starters.
Technically speaking, a failure is declared when the system does not meet its
desired objectives. When comes to IT systems, including disk storage, this generally
means an outage or down time. But I have experienced situations where the system
was running so slowly that it should be considered failed even though it was
technically still “up.” Therefore, I consider any system that cannot meet minimum
performance or availability requirements to be “failed.”
The first metric that we should understand is the time that a system is not failed, or is
available. Often referred to as “uptime” in the IT industry, the length of time that a
system is online between outages or failures can be thought of as the “time to
failure” for that system.
For example, if I bring my RAID array online on Monday at noon and the system
functions normally until a disk failure Friday at noon, it was “available” for exactly 96
hours. If this happens every week, with repairs lasting from Friday noon until Monday
noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of
96 hours. I would probably also call my system vendor and demand that they replace
this horribly unreliable device!
Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device
by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or
over 100 years. But no one should expect a given hard disk drive to last this long. In
fact, disk replacement rate is much higher than disk failure rate!
In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves
three days, or 72 hours, to get things operational again. Over time, we would come
to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again,
we would be justified in complaining to the vendor at this point.
Repairs can be excruciating, but they often do not take anywhere near as long as
this. In fact, most computer systems and devices are wonderfully reliable, with MTTF
measured in months or years. But when things do go wrong, it can often take quite a
while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends
to be measured in hours rather than days.
The most common failure related metric is also mostly used incorrectly. “Mean time
between failures” or “MTBF” refers to the amount of time that elapses between one
failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total
time required for a device to fail and that failure to be repaired.
For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72
hours would have an MTBF of one week, or 168 hours. But many disk drives only fail
once in their life, and most never fail. So manufacturers don’t bother to talk about
MTTR and instead use MTBF as a shorthand for average failure rate over time. In
other words, “MTBF” often reflects the number of drives that fail rather than the rate
at which they fail!