Documente Academic
Documente Profesional
Documente Cultură
Transcript
Slide 1
Welcome to the Data Center University™ course: Analyzing Reliability in the Data Center.
Slide 2
For best viewing results, we recommend that you maximize your browser window now. The screen controls
allow you to navigate through the eLearning experience. Using your browser controls may disrupt the
normal play of the course. Click the attachments link to download supplemental information for this course.
Click the Notes tab to read a transcript of the narration.
Slide 3
At the completion of this course, you will be able to:
Define key terms associated with analyzing reliability risks
Identify some common cause failures in the data center
Describe the benefits of conducting a Probabilistic Risk Assessment (PRA)
Recognize the reliability advantages of utilizing scalable, modular architecture in the data center
Slide 4: Introduction
The growing reliance on information systems that operate 24 hours per day, 7 days per week, has spawned
a rapidly growing and developing industry that supplies products and services on-demand. The need for
these types of information services now reaches into every business office in the world. Unfortunately,
events of all kinds including hardware failure, human error, environmental changes, structural failure and
external events, can lead to the possibility of unanticipated systems downtime.
Slide 5: Introduction
Modern data centers do not tolerate planned downtime and strive for no outages in a 10-year mission. Data
center operations staffs are faced with the dilemma of either downtime as a result of insufficient physical
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
infrastructure, or incurring extensive costs by designing in more redundancies than is necessary. Targeted
reliability solutions allow businesses to meet individual requirements of the data center, while minimizing the
total cost of ownership.
In fact, very high reliability is difficult to attain and redundant hardware is only part of the answer. This
course will demonstrate some important performance success factors and overviews best practices for
analyzing and optimizing reliability.
Slide 6: Reliability
The concept of reliability has a basis in science and engineering. The concepts are defensible and
falsifiable. Understanding how to best define downtime risk or vulnerability for a particular business is
important to optimizing its reliability, while decreasing total cost of ownership (TCO) and increasing agility.
Reliability and probability of failure, also known as unreliability, are useful tools for data center owners and
designers. Calculation of reliability is more difficult but the methods and tools are well developed.
Slide 7: Reliability
In this course, we will focus on the reliability of two types of power systems: the traditional 3-phase central
UPS and the more recent scalable, modular rack based power system.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Just as fault tolerant IT equipment allows continued data center operation when an IT component fails, fault
tolerant Network Critical Physical Infrastructure (NCPI) equipment allows continued operation of power or
cooling when an NCPI component fails. Fault tolerance can be accomplished by applying appropriate
redundancy of NCPI units, or by internal redundancy of components within NCPI units – for example, by
having extra power modules in a UPS.
How are high levels of reliability achieved today? A company that is very concerned about reliability may
spend millions of dollars to blanket the entire IT infrastructure with elaborate highly redundant schemes that
are not flexible even though a portion of the equipment is not critical to business operations. It is important
to note that redundant hardware is only a starting point. Redundant components can be costly and, when
used inappropriately, are not a best practice.
Slide 8: Redundancy
Fundamental questions exist regarding redundant designs. While redundancy can in principle increase
reliability by allowing individual components or subassemblies to fail without causing the system to fail, there
are significant costs and potentially serious drawbacks. A redundant system has more components, and in
general systems with more components will experience more failures. For example, twin-engine airplanes
experience roughly twice as many engine failures per hour of operation than comparable single-engine
airplanes.
There must be very reliable mechanisms in place to identify the failed component and isolate it from the
system, or the benefits of redundancy are lost, and the number of component failures are increased. Some
failure modes can affect multiple components simultaneously. Such common cause failures place a
significant limit on the benefits of redundancy. Design defects, manufacturing defects, defects introduced
during installation, maintenance, or repair, all can result in failure modes where multiple, supposedly
independent units fail, often causing the entire system to fail despite the redundant design. Catastrophic
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
failures of some components can damage connected or nearby equipment and cause system failure despite
redundant design.
To fully understand reliability, we must first educate ourselves on typical data center common cause failures.
Let's talk about those now.
Now that we’ve identified some common causes of failure in a data center, let’s discuss the importance of
the reliability terms “Mean Time Between Failure”, “Mean Time To Recover”, and we'll talk about how best to
define the word "failure".
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 11: Reliability and Mean Time Between Failure
Mean Time Between Failure is a reliability term, largely based on assumptions and a definition of failure.
MTBF, or Mean Time Between Failure, is a basic measure of a system’s reliability. It is typically represented
in units of hours. The higher the MTBF number is, the higher the reliability of the product.
These questions should be asked immediately upon reviewing any MTBF value. Without the answers to
these questions, the discussion holds little value. MTBF is often quoted without providing a definition of
failure. This practice is not only misleading but completely useless. A similar practice would be to advertise
the fuel efficiency of an automobile as “miles or kilometers per tank” without defining the capacity of the tank
in gallons or liters.
To address this ambiguity, one could argue there are two basic definitions of a failure:
The termination of the ability of the product as a whole to perform its required function
The termination of the ability of any individual component to perform its required function but not
the termination of the ability of the product as a whole to perform
Swappable modules, like those found in some power backup systems, help reduce MTTR. Equipment with
understandable architecture and design, enhance manageability and reduce human error. Hot-swappable
modules can be replaced without downtime.
A failed modular component can be quickly swapped-out for replacement, so recovery is not delayed while
waiting for repair. Standardization makes systems easier to understand and operate, makes diagnosis of
problems faster and increases the potential for correction by the user.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 13: Modularity and Component Count
Component reliability can be increased through standardization. Modularity is a powerful concept when it
comes to reliability. Unless classic reliability analysis is updated to include it, modularity’s substantial
advantage to users risks misunderstanding. Modularizing a system can, in some cases, increase the
number of internal components – for example, large UPS capacity modularized into a bank of smaller power
modules will increase the number of electrical components and connectors. To be valid, reliability analysis
of modular systems must consider component design, function, and dependencies, and not just rely on
simple multiplication of parts.
Swappable modules can be removed for factory service, enabling continuous quality improvement
in which defects are diagnosed at the factory and engineered out as they are discovered. (This
process is called “reliability growth” in systems analysis.)
Modules are manufactured in much greater quantity than a larger non-modular system, increasing
even further the quality improvements already inherent in mass production.
The generally smaller size of modules (compared to non-modular design) tends to mean less
manual work during manufacture.
Modular design allows for the considerable reliability advantage of fault tolerance – redundant
modules operating in parallel, allowing for individual module failure without affecting overall system
performance.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
The benefits are:
Verifies that component is operable
Reveals failures that can be repaired
Maintains operability
Maintains testing skills
There is also some potential harm that must be taken into account:
Removal from service can result in complete component unavailability
Wear and tear due to testing
Introduction of new defects (i.e. damage during inspection, fuel depletion)
Acceleration of dependent failures (i.e. vibration)
Damage or degradation of component via incorrect restoration to service (i.e. resetting, realigning,
bad documentation
Human error can cause the wrong component to be removed from service
In most UPSs, utility AC power is rectified to DC. The DC bus connects the rectifier to the battery (which is
typically composed of multiple series and parallel strings) and to the inverter. The inverter synthesizes an
AC voltage free from the effects of spikes, sags, harmonics, and brief utility outages.
So how do we begin to assess the reliability of a particular choice? Let’s discuss our options.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 17: Assessing Reliability
One way to determine reliability is to introduce a new product to customers and then observe the number of
failures. This approach has many drawbacks. First, the customer becomes the subject of an experiment.
Second, since even a poorly designed or manufactured unit might not fail very often, it may take months or,
more probably, years of observation before statistically significant differences could be demonstrated. Third,
achieving reliability in critical systems (i.e. airplanes, anti-lock brakes, and telephone switches) requires
observation of large fleets of essentially identical components over long periods.
Also, the present UPS marketplace has evolved to include a substantial number of uniquely designed data
centers. Since each of these data centers is unique, the UPSs within those data centers are exposed to
unique operating environments and management practices. UPS vendors have naturally responded with an
ever-growing array of custom and customizable solutions that can meet any conceivable design
specification for the next custom-built data center.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
fail. Mathematical analysis quickly showed that, in a highly interconnected system such as a rocket or a data
center, the old adage that a chain is only as strong as its weakest link is no longer true. The chain comes to
resemble a net, one with many weak links and undiscovered threads linking one area to another. Failures in
one part of the net place new and different stresses on other parts, which are then more likely to fail. The
result is an environment where even minor upsets start a series of cascading failures that end with complete
failure of the system.
The results have been gratifying. Not only have there been no more incidents resembling TMI, but the fleet
of 103 power plants now produces 20% more electricity annually than they did prior to TMI. It is becoming
routine for plants to operate for 18 months or 24 months without a single forced outage, shutting down only
when it is necessary to refuel. PRA has also informed maintenance practices and demonstrated that many
best practices in fact unnecessarily increase the risks of component failures and accidents.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 22: The Correct Course of Action: PRA
The value of PRA is due both to the quantitative results, and its ability to identify the relative contribution of
each component to failure. Without quantifiable, reproducible calculations of each component's role in the
systems' success or failure, it is simply impossible to allocate resources rationally, much less optimally. The
traditional reliance on redundancy to characterize system reliability illustrates this point.
Many data center designs are characterized as "N+1" or "N+2" or even "2N" or "2N +1" designs. The
implication is that if N components are required for success, there is one, two, twice as many, or even twice
plus one as many units available. But clearly not all redundancy makes the same contribution to reliability.
Redundant standby generators, with a 1% failure to start per demand, will contribute far more to reliability
than redundant dry-type transformers, whose failure rate is so low that the money spent on redundant units
can almost invariably be spent to better effect elsewhere. Absent the ability to determine the quantitative
contribution of each component, redundant or not, designers and buyers simply can not make informed
decisions regarding the best use of scarce financial resources. PRA is a powerful tool to answer these
questions. It identifies potential sources of failure, while determining component failure rates.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 24: Reliability Assessment Case Study
The goals of the case study were to identify potential sources of failure, and to evaluate the potential for
further improvement in the scalable, modular power system’s reliability and availability. The scalable,
modular power system utilizes redundancy in many components to achieve high reliability, and "hot
swappable" power and battery modules to enable high availability. The scalable, modular power system
solution is sized to serve one or more rows of equipment racks. This strategy is an alternative to the use of
one large, central UPS to serve an entire data center. MTech performed a detailed analysis of a 40 kW
scalable, modular UPS and PDU with static bypass.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
The targets of the analysis included:
The electrical and mechanical design
The verification of engineering
Validation testing
The manufacturing techniques
System performance
The fault tree model was used as a tool in the analysis. The fault tree highlights which components and
subassemblies impact systems reliability the most. A fault tree lists all possible failures and illustrates the
impact of one failure upon another in any given system.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Some components have more than one failure mode. For example, in a UPS, components such as bus,
batteries, controls and power modules were modeled with two failure modes: normal and catastrophic.
Normal failure, such as the failure of a power module in a scalable, modular UPS, does not result in failure
of the UPS because the components are redundant. A catastrophic failure results in failure of the UPS. An
example of catastrophic failure would be if plasma vented to the UPS interior caused shorts in multiple
power and control circuits and resulted in a load drop.
In this analysis, fault trees were constructed for two hypothetical data centers. One analyzing a single
500kW UPS and the other analyzing 14 modular, scalable products to serve the same load.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 29: Comparing Modularity to the Central UPS
Let's take a closer look at the data that the study yielded about the scalable, modular power system.
An informed judgment was made that approximately 1% of all component failures are catastrophic. Later
this was adjusted to reflect actual field data and the 1% estimate was found to be accurate.
The breakdown of scalable, modular system failures (Component contribution to failure: scalable, modular
power system only, No utility failures) is as follows:
Failures in the PDU transformer and catastrophic failure of the collector bus (The point of parallel
connection between power modules and the bypass switch) account for 72% of expected failures
The input and output molded case circuit breakers account for nearly 17%
Battery failure was negligible (8 series-parallel strings, 4 positive and 4 negative, 196 Volts DC)
The scalable, modular system loses power to all loads only when the main entrance bus fails or the
transfer switch fails to open
The probability of all 14 scalable, modular units failing simultaneously due to internal failures is
extremely low
PDU failure will cause partial load drop
Only 1 circuit breaker after the transfer switch will cause all critical loads to fail
Effects of operator error were not significant
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Effects of operator error on UPS were not significant
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
While power module failures were observed more often in the scalable, modular power system, the
increase was more than offset by the benefits provided by redundancy.
Detailed consideration of common-cause failure mechanisms and potential catastrophic failure
modes that could potentially cause UPS failure did not result in significant reductions to the
calculated product reliability.
The scalable, modular design and the associated high volume of product allows the utilization of
dedicated manufacturing cells that produce products at lower costs and with fewer defects. This
makes it possible to manufacture five power modules when a non-modular design of the same
power rating manufactures one. This enables faster reliability growth of the product line.
The use of factory-built distribution wiring in the scalable, modular architecture presents a
significant advantage over field-wired distribution systems for centralized UPS products.
Distribution wiring introduces multiple opportunities to introduce wiring defects that may eventually
cause loss of power to critical loads. Mtech’s analysis of the field vs. factory wiring process found
that the probability of defects in field-produced systems was 1,500 times higher than the equivalent
factory-built system.
Examining these findings, there are some general conclusions that can be drawn.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 34: Conclusions
After thorough analysis utilizing PRA tools, the scalable, modular power system architecture proved more
reliable than the single module UPS with a single battery string.
The redundant sub-systems within the scalable, modular power system successfully reduced the probability
of UPS failure, but the effects of external systems common to both the centralized and scalable, modular
power system approach affected both systems similarly.
Utilizing parallel redundant battery strings in the central UPS would reduce, but not eliminate the difference
in reliability.
The performance of the automatic transfer switch (ATS) is often the limiting factor in achieving higher
reliability.
© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.