Sunteți pe pagina 1din 5

Why why analysis is a method of searching for and finding the root causes of the

problem. The name of the method comes from the main question asked during the
analysis: Why? Thanks to asking that question many times it is possible to find real,
often hidden causes of the problems. Dealing with real causes prevents the problem
from happening again.

The method is an evolution of Fish diagram created by Kaoru Ishikawa. The why why
analysis is used mainly in Six sigma, however it can be used in every organization. The
idea of 5 whys comes from Genichi Taguchi, who once said that to find the real causes
you need to ask why five times.

Procedure of why why analysisedit


The analysis should be performed by a team of people interested in finding solution to
the problem. The team should be diversified to assure different points of view. The team
should consist of not more than 12 people.

Usually the best tool for the analysis is large blackboard or flipchart. However the
analysis can be performed using computer, beamer and mind mapping application, such
as free Freemind.

The procedure is as follows:

1. Start with the problem you'd like to solve. Ask "Why does the ... take place?"
That's a first why. Write this question in the centre of the blackboard/flipchart.
2. Write the answers to the question. Write each answer on one line coming from
the main question. It will be convenient to draw each line in slightly different
direction.
3. For each of the answers ask again "Why does the ... take place?". And write
down the next level answers on new lines coming from the first level ones. This
way a kind of net is created.
4. Repeat the same on the next levels until you get the 5th level. Then stop.
5. When you hit the 5th level you usually have most of the causes identified,
including the root ones. They are not always on the 5th level. Some of them can
arise on higher levels. For each root cause identify potential actions that reduce
possibility of occurrence or result of that cause.

Example of why why analysisedit


In this example the problem is lack of money. We have found three possible causes of it.
Every cause was analysed further. The example has only two levels due to lack of
space. The full analysis should be five levels
deep.

Effective why why analysisedit


Keep to the following rules:

• Involve the right people. They should be familiar with the problem and process. They
should be from different departments if possible. There shouldn't be superiors and
subordinates in one group. Sometimes it's good to add some people that don't know
the process/problem. They can present a fresh look.
• Avoid blaming for problems - The aim is not to find guilty, but to solve the problem.
Stop each argument that leads to blaming someone immediately.
• Get creative - use brainstorming to enable people and their creativity. Try some ice
breakers for starters.

Benefits of why why analysisedit


• Helps to identify the root causes of the problem and distinguish them from less
important ones.
• Determines relations between causes.
• Doesn't require statistical analysis.
• Very simple and quick tool

Constraints of why why analysisedit


The simplicity is not always a good thing. When it comes to sophisticated problems it
might prevent from finding a solution. The why why analysis is quick, however it
shouldn't be the only one method in manager's methods portfolio. The typical problems
are:

• Stopping at symptoms, not the real causes


• Limited knowledge of the team who can't find additional causes
• Lack of ability to ask the right why? questions
• Dependability on team competences - different team can find different causes
• Stopping at first root cause, where there is a set of root causes
The Bathtub Curve and Product Failure Behavior
Part One - The Bathtub Curve, Infant Mortality and Burn-
in
Reliability specialists often describe the lifetime of a population of
products using a graphical representation called the bathtub curve. The
bathtub curve consists of three periods: an infant mortality period with a
decreasing failure rate followed by a normal life period (also known as
"useful life") with a low, relatively constant failure rate and concluding
with a wear-out period that exhibits an increasing failure rate. This
article provides an overview of how infant mortality, normal life failures
and wear-out modes combine to create the overall product failure
distributions. It describes methods to reduce failures at each stage of
product life and shows how burn-in, when appropriate, can significantly
reduce operational failure rate by screening out infant mortality failures.
The material will be presented in two parts. Part One (presented in this
issue) introduces the bathtub curve and covers infant mortality and burn-
in. Part Two (presented in next month's HotWire) will address the
remaining two periods of the bathtub curve: normal life failures and end
of life wear-out.

Figure 1: The Bathtub Curve


Definition of a Failure

I suppose it is wise to begin by considering what exactly qualifies as a “failure.”


Clearly, if the system is down, it has failed. But what about the system running in
degraded mode, such as a raid array that is rebuilding? And what about systems that
are intentionally brought off-line?

Technically speaking, a failure is declared when the system does not meet its
desired objectives. When comes to IT systems, including disk storage, this generally
means an outage or down time. But I have experienced situations where the system
was running so slowly that it should be considered failed even though it was
technically still “up.” Therefore, I consider any system that cannot meet minimum
performance or availability requirements to be “failed.”

Similarly, a return to normal operations signals the end of downtime or system


failure. Perhaps the system is still in a degraded mode, with some nodes or data
protection systems not yet online, but if it is available for normal use I would consider
it to be “non-failed.”

MTBF is the sum of MTTR and MTTF

Mean Time to Failure (MTTF)

The first metric that we should understand is the time that a system is not failed, or is
available. Often referred to as “uptime” in the IT industry, the length of time that a
system is online between outages or failures can be thought of as the “time to
failure” for that system.

For example, if I bring my RAID array online on Monday at noon and the system
functions normally until a disk failure Friday at noon, it was “available” for exactly 96
hours. If this happens every week, with repairs lasting from Friday noon until Monday
noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of
96 hours. I would probably also call my system vendor and demand that they replace
this horribly unreliable device!

Most systems only occasionally fail, so it is important to think of reliability in statistical


terms. Manufacturers often run controlled tests to see how reliable a device is
expected to be, and sometimes report these results to buyers. This is a good
indication of the reliability of a device, as long as these manufacturer tests are
reasonably accurate. Unfortunately, many vendors refer to this metric as “mean time
between failure” (MTBF), which is incorrect as we shall soon see.

Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device
by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or
over 100 years. But no one should expect a given hard disk drive to last this long. In
fact, disk replacement rate is much higher than disk failure rate!

Mean Time to Repair (MTTR)

Many vendors suppose that repairs are instantaneous or non-existent, but IT


professionals know that this is not the case. In fact, I might still be a systems
administrator if it wasn’t for the fact that I had to spend hours in freezing cold
datacenters trying to repair failed systems! The amount of time required to repair a
system and bring it back online is the “time to repair”, another critical metric.

In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves
three days, or 72 hours, to get things operational again. Over time, we would come
to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again,
we would be justified in complaining to the vendor at this point.

Repairs can be excruciating, but they often do not take anywhere near as long as
this. In fact, most computer systems and devices are wonderfully reliable, with MTTF
measured in months or years. But when things do go wrong, it can often take quite a
while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends
to be measured in hours rather than days.

Mean Time Between Failures (MTBF)

The most common failure related metric is also mostly used incorrectly. “Mean time
between failures” or “MTBF” refers to the amount of time that elapses between one
failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total
time required for a device to fail and that failure to be repaired.

For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72
hours would have an MTBF of one week, or 168 hours. But many disk drives only fail
once in their life, and most never fail. So manufacturers don’t bother to talk about
MTTR and instead use MTBF as a shorthand for average failure rate over time. In
other words, “MTBF” often reflects the number of drives that fail rather than the rate
at which they fail!

S-ar putea să vă placă și