Sunteți pe pagina 1din 17

Analyzing Reliability in the Data Center

Transcript

Slide 1
Welcome to the Data Center University™ course: Analyzing Reliability in the Data Center.

Slide 2
For best viewing results, we recommend that you maximize your browser window now. The screen controls
allow you to navigate through the eLearning experience. Using your browser controls may disrupt the
normal play of the course. Click the attachments link to download supplemental information for this course.
Click the Notes tab to read a transcript of the narration.

Slide 3
At the completion of this course, you will be able to:
Define key terms associated with analyzing reliability risks
Identify some common cause failures in the data center
Describe the benefits of conducting a Probabilistic Risk Assessment (PRA)
Recognize the reliability advantages of utilizing scalable, modular architecture in the data center

Slide 4: Introduction
The growing reliance on information systems that operate 24 hours per day, 7 days per week, has spawned
a rapidly growing and developing industry that supplies products and services on-demand. The need for
these types of information services now reaches into every business office in the world. Unfortunately,
events of all kinds including hardware failure, human error, environmental changes, structural failure and
external events, can lead to the possibility of unanticipated systems downtime.

Slide 5: Introduction
Modern data centers do not tolerate planned downtime and strive for no outages in a 10-year mission. Data
center operations staffs are faced with the dilemma of either downtime as a result of insufficient physical

Analyzing Reliability in the Data Center Page |1

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
infrastructure, or incurring extensive costs by designing in more redundancies than is necessary. Targeted
reliability solutions allow businesses to meet individual requirements of the data center, while minimizing the
total cost of ownership.

In fact, very high reliability is difficult to attain and redundant hardware is only part of the answer. This
course will demonstrate some important performance success factors and overviews best practices for
analyzing and optimizing reliability.

Slide 6: Reliability
The concept of reliability has a basis in science and engineering. The concepts are defensible and
falsifiable. Understanding how to best define downtime risk or vulnerability for a particular business is
important to optimizing its reliability, while decreasing total cost of ownership (TCO) and increasing agility.

Reliability metrics statistically analyze the likelihood of a failure occurring.

There must be a time element involved when analyzing reliability


Quantifying reliability gives insight to the various anticipated challenges that data center staffs will have to
face
If reliability is viewed as a quantitative metric for data centers, the language of reliability, must be consistent
throughout the organization

Reliability and probability of failure, also known as unreliability, are useful tools for data center owners and
designers. Calculation of reliability is more difficult but the methods and tools are well developed.

Slide 7: Reliability
In this course, we will focus on the reliability of two types of power systems: the traditional 3-phase central
UPS and the more recent scalable, modular rack based power system.

Analyzing Reliability in the Data Center Page |2

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Just as fault tolerant IT equipment allows continued data center operation when an IT component fails, fault
tolerant Network Critical Physical Infrastructure (NCPI) equipment allows continued operation of power or
cooling when an NCPI component fails. Fault tolerance can be accomplished by applying appropriate
redundancy of NCPI units, or by internal redundancy of components within NCPI units – for example, by
having extra power modules in a UPS.

How are high levels of reliability achieved today? A company that is very concerned about reliability may
spend millions of dollars to blanket the entire IT infrastructure with elaborate highly redundant schemes that
are not flexible even though a portion of the equipment is not critical to business operations. It is important
to note that redundant hardware is only a starting point. Redundant components can be costly and, when
used inappropriately, are not a best practice.

Let’s talk a little more about the concept of redundancy.

Slide 8: Redundancy
Fundamental questions exist regarding redundant designs. While redundancy can in principle increase
reliability by allowing individual components or subassemblies to fail without causing the system to fail, there
are significant costs and potentially serious drawbacks. A redundant system has more components, and in
general systems with more components will experience more failures. For example, twin-engine airplanes
experience roughly twice as many engine failures per hour of operation than comparable single-engine
airplanes.

There must be very reliable mechanisms in place to identify the failed component and isolate it from the
system, or the benefits of redundancy are lost, and the number of component failures are increased. Some
failure modes can affect multiple components simultaneously. Such common cause failures place a
significant limit on the benefits of redundancy. Design defects, manufacturing defects, defects introduced
during installation, maintenance, or repair, all can result in failure modes where multiple, supposedly
independent units fail, often causing the entire system to fail despite the redundant design. Catastrophic

Analyzing Reliability in the Data Center Page |3

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
failures of some components can damage connected or nearby equipment and cause system failure despite
redundant design.

To fully understand reliability, we must first educate ourselves on typical data center common cause failures.
Let's talk about those now.

Slide 9: Discussing Best Practices


When discussing best practices for determining reliability, the entire life cycle of the component or system
needs to be scrutinized. This includes:
The design, manufacture, operation, maintenance and repair of equipment
The gathering of data, and the review and publication of component benchmarking results
Consistent deployment of the language of reliability, both definitions and assumptions
A philosophy which addresses the constant pursuit of root causes, common cause failures and
relevant data

Let’s discuss some specific common cause failures.

Slide 10: Common Cause Failures


Some data center common cause failures include simultaneous failures of multiple components caused by:
Design errors
Installation defects
Extreme environments, specifically when taking into account temperature, humidity, and vibration
Human, or operator, error
Defects introduced in manufacturing or testing

Now that we’ve identified some common causes of failure in a data center, let’s discuss the importance of
the reliability terms “Mean Time Between Failure”, “Mean Time To Recover”, and we'll talk about how best to
define the word "failure".

Analyzing Reliability in the Data Center Page |4

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 11: Reliability and Mean Time Between Failure
Mean Time Between Failure is a reliability term, largely based on assumptions and a definition of failure.
MTBF, or Mean Time Between Failure, is a basic measure of a system’s reliability. It is typically represented
in units of hours. The higher the MTBF number is, the higher the reliability of the product.

What is a Failure? What are the Assumptions?

These questions should be asked immediately upon reviewing any MTBF value. Without the answers to
these questions, the discussion holds little value. MTBF is often quoted without providing a definition of
failure. This practice is not only misleading but completely useless. A similar practice would be to advertise
the fuel efficiency of an automobile as “miles or kilometers per tank” without defining the capacity of the tank
in gallons or liters.

To address this ambiguity, one could argue there are two basic definitions of a failure:
The termination of the ability of the product as a whole to perform its required function
The termination of the ability of any individual component to perform its required function but not
the termination of the ability of the product as a whole to perform

Slide 12: Reliability and Mean Time to Recover


A shorter Mean Time to Recover (MTTR) means faster recovery after failure. This translates to less
downtime. Short MTTR improves the reliability of fault tolerant architectures.

Swappable modules, like those found in some power backup systems, help reduce MTTR. Equipment with
understandable architecture and design, enhance manageability and reduce human error. Hot-swappable
modules can be replaced without downtime.

A failed modular component can be quickly swapped-out for replacement, so recovery is not delayed while
waiting for repair. Standardization makes systems easier to understand and operate, makes diagnosis of
problems faster and increases the potential for correction by the user.

Analyzing Reliability in the Data Center Page |5

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 13: Modularity and Component Count
Component reliability can be increased through standardization. Modularity is a powerful concept when it
comes to reliability. Unless classic reliability analysis is updated to include it, modularity’s substantial
advantage to users risks misunderstanding. Modularizing a system can, in some cases, increase the
number of internal components – for example, large UPS capacity modularized into a bank of smaller power
modules will increase the number of electrical components and connectors. To be valid, reliability analysis
of modular systems must consider component design, function, and dependencies, and not just rely on
simple multiplication of parts.

Slide 14: Modularity and Component Count


Further, reliability analysis based on component count alone is incomplete – even potentially misleading –
because it leaves out the new and overriding reliability advantages of modular structure, most importantly:

Swappable modules can be removed for factory service, enabling continuous quality improvement
in which defects are diagnosed at the factory and engineered out as they are discovered. (This
process is called “reliability growth” in systems analysis.)
Modules are manufactured in much greater quantity than a larger non-modular system, increasing
even further the quality improvements already inherent in mass production.
The generally smaller size of modules (compared to non-modular design) tends to mean less
manual work during manufacture.
Modular design allows for the considerable reliability advantage of fault tolerance – redundant
modules operating in parallel, allowing for individual module failure without affecting overall system
performance.

Slide 15: Benefits and Drawbacks


Data center testing and maintenance practices often have a significant impact on systems reliability. Testing
and diagnosis can improve reliability, but may also degrade it.

Analyzing Reliability in the Data Center Page |6

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
The benefits are:
Verifies that component is operable
Reveals failures that can be repaired
Maintains operability
Maintains testing skills

There is also some potential harm that must be taken into account:
Removal from service can result in complete component unavailability
Wear and tear due to testing
Introduction of new defects (i.e. damage during inspection, fuel depletion)
Acceleration of dependent failures (i.e. vibration)
Damage or degradation of component via incorrect restoration to service (i.e. resetting, realigning,
bad documentation
Human error can cause the wrong component to be removed from service

Slide 16: UPS: Historical Perspective


Historically, the product of choice for improving the reliability of electric power is the uninterruptible power
supply, or UPS. The UPS conditions utility power so that essentially perfect voltage and current is supplied
to the protected equipment. The UPS also includes batteries (or other energy storage devices) that keep
power flowing to the critical load when the utility fails. The UPS has been manufactured for decades.

In most UPSs, utility AC power is rectified to DC. The DC bus connects the rectifier to the battery (which is
typically composed of multiple series and parallel strings) and to the inverter. The inverter synthesizes an
AC voltage free from the effects of spikes, sags, harmonics, and brief utility outages.

So how do we begin to assess the reliability of a particular choice? Let’s discuss our options.

Analyzing Reliability in the Data Center Page |7

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 17: Assessing Reliability
One way to determine reliability is to introduce a new product to customers and then observe the number of
failures. This approach has many drawbacks. First, the customer becomes the subject of an experiment.

Second, since even a poorly designed or manufactured unit might not fail very often, it may take months or,
more probably, years of observation before statistically significant differences could be demonstrated. Third,
achieving reliability in critical systems (i.e. airplanes, anti-lock brakes, and telephone switches) requires
observation of large fleets of essentially identical components over long periods.

Also, the present UPS marketplace has evolved to include a substantial number of uniquely designed data
centers. Since each of these data centers is unique, the UPSs within those data centers are exposed to
unique operating environments and management practices. UPS vendors have naturally responded with an
ever-growing array of custom and customizable solutions that can meet any conceivable design
specification for the next custom-built data center.

Slide 18: Assessing Reliability


It is more efficient and less costly to employ some means of learning about the reliability of a new product
before subjecting thousands of customers to potential mistakes that compromise reliability. Further, it would
be extremely useful to know which of several competing proposals offers the best reliability for the least cost.
The product's designers would very much like to understand which components and sub-systems are most
important to the product's overall reliability. The product support engineers, charged with tracking the
products' performance in actual use and quickly identifying and implementing changes necessary to correct
deficiencies or defects, would benefit from a road map identifying components most likely to fail. Deviations
from the predictions of the road map would identify new areas for more intensive investigation and possible
remedial action.

Slide 19: The Correct Course of Action: PRA


Probabilistic Risk Assessment, PRA, was first developed as a response to the exasperation of early rocket
engineers, who grew frustrated with the seemingly endless litany of reasons for their cherished vehicles to

Analyzing Reliability in the Data Center Page |8

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
fail. Mathematical analysis quickly showed that, in a highly interconnected system such as a rocket or a data
center, the old adage that a chain is only as strong as its weakest link is no longer true. The chain comes to
resemble a net, one with many weak links and undiscovered threads linking one area to another. Failures in
one part of the net place new and different stresses on other parts, which are then more likely to fail. The
result is an environment where even minor upsets start a series of cascading failures that end with complete
failure of the system.

Slide 20: The Correct Course of Action: PRA


PRA was applied on a large scale to the US nuclear power industry, first as a means to address public
concerns regarding safety. After the events of Three Mile Island (TMI) threatened the viability of the entire
multi-billion dollar industry, PRA techniques were embraced and extended to include not only design
choices but also operating and maintenance decisions and the effects of management practices.

The results have been gratifying. Not only have there been no more incidents resembling TMI, but the fleet
of 103 power plants now produces 20% more electricity annually than they did prior to TMI. It is becoming
routine for plants to operate for 18 months or 24 months without a single forced outage, shutting down only
when it is necessary to refuel. PRA has also informed maintenance practices and demonstrated that many
best practices in fact unnecessarily increase the risks of component failures and accidents.

Slide 21: The Correct Course of Action: PRA


PRA is a powerful tool when applied carefully. The process of building the logical model results in a
comprehensive review of the decisions, features, and assumptions that shaped the product. The
mathematical nature of the calculation limits the appeals to experience and other common logical fallacies
that tend to dominate qualitative evaluation of reliability. All too often, a claim of twenty years experience is
roughly equivalent to one year's learning followed by nineteen years of doing the same thing over and over
again.

Analyzing Reliability in the Data Center Page |9

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 22: The Correct Course of Action: PRA
The value of PRA is due both to the quantitative results, and its ability to identify the relative contribution of
each component to failure. Without quantifiable, reproducible calculations of each component's role in the
systems' success or failure, it is simply impossible to allocate resources rationally, much less optimally. The
traditional reliance on redundancy to characterize system reliability illustrates this point.

Many data center designs are characterized as "N+1" or "N+2" or even "2N" or "2N +1" designs. The
implication is that if N components are required for success, there is one, two, twice as many, or even twice
plus one as many units available. But clearly not all redundancy makes the same contribution to reliability.
Redundant standby generators, with a 1% failure to start per demand, will contribute far more to reliability
than redundant dry-type transformers, whose failure rate is so low that the money spent on redundant units
can almost invariably be spent to better effect elsewhere. Absent the ability to determine the quantitative
contribution of each component, redundant or not, designers and buyers simply can not make informed
decisions regarding the best use of scarce financial resources. PRA is a powerful tool to answer these
questions. It identifies potential sources of failure, while determining component failure rates.

Slide 23: Reliability Assessment Case Study


MTechnology, Inc (MTech), an independent reliability consultant, uses Probabilistic Risk Assessment (PRA)
techniques and software adapted from the nuclear power industry to help calculate systems reliability.
MTech undertook a project to analyze new scalable, modular power systems and compare them to
traditional 3-phase UPS systems. This analysis is the basis for the case study presented in this course. The
mathematical models that resulted from the analysis were used to answer some key questions. The
scalable, modular system utilizes redundancy in nearly all components as a means of achieving high
reliability. MTech's analysis showed that there are both costs and benefits to redundancy, and that some
sub-systems benefit less from redundancy than others. Complex mathematical formulas, including Bayesian
updating techniques, were utilized to calculate the case study failure rates and common cause failures. The
specifics of these mathematical calculations are beyond the scope of this particular course.

Let’s take a closer look at the goals of this study.

Analyzing Reliability in the Data Center P a g e | 10

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 24: Reliability Assessment Case Study
The goals of the case study were to identify potential sources of failure, and to evaluate the potential for
further improvement in the scalable, modular power system’s reliability and availability. The scalable,
modular power system utilizes redundancy in many components to achieve high reliability, and "hot

swappable" power and battery modules to enable high availability. The scalable, modular power system
solution is sized to serve one or more rows of equipment racks. This strategy is an alternative to the use of
one large, central UPS to serve an entire data center. MTech performed a detailed analysis of a 40 kW
scalable, modular UPS and PDU with static bypass.

Slide 25: Reliability Assessment Case Study


The study included analysis of the product in isolation, analysis in a typical data center environment, and a
comparative reliability analysis against a traditional central UPS in the same data center. The analysis
included a detailed review of the electrical and mechanical design, engineering verification and validation
testing, manufacturing techniques, and the performance of the units in actual service.

Slide 26: Target of Case Study Analysis


Let's discuss the parameters of the reliability analysis case study.

The subjects of analysis included:


14 40kW scalable, modular, rack-based power system with PDU and static bypass
500 kW Central UPS
The tools utilized to perform the reliability analysis included:
Probabilistic Risk Assessment (PRA)
Fault tree
Event tree analysis
Bayesian updating

Analyzing Reliability in the Data Center P a g e | 11

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
The targets of the analysis included:
The electrical and mechanical design
The verification of engineering
Validation testing
The manufacturing techniques
System performance

Slide 27: Reliability Assessment Case Study


All actions have both beneficial and negative affects on reliability. The introduction of a UPS into a data
center, for example, helps to support the uptime of the servers but also can represent a point of failure.

The fault tree model was used as a tool in the analysis. The fault tree highlights which components and
subassemblies impact systems reliability the most. A fault tree lists all possible failures and illustrates the
impact of one failure upon another in any given system.

Analyzing Reliability in the Data Center P a g e | 12

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Some components have more than one failure mode. For example, in a UPS, components such as bus,
batteries, controls and power modules were modeled with two failure modes: normal and catastrophic.

Normal failure, such as the failure of a power module in a scalable, modular UPS, does not result in failure
of the UPS because the components are redundant. A catastrophic failure results in failure of the UPS. An
example of catastrophic failure would be if plasma vented to the UPS interior caused shorts in multiple
power and control circuits and resulted in a load drop.

In this analysis, fault trees were constructed for two hypothetical data centers. One analyzing a single
500kW UPS and the other analyzing 14 modular, scalable products to serve the same load.

Slide 28: Comparing Modularity to the Central UPS


Here we see an initial summary of the component contribution to failure in a modular, scalable data center
(no utility failures).

Analyzing Reliability in the Data Center P a g e | 13

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 29: Comparing Modularity to the Central UPS
Let's take a closer look at the data that the study yielded about the scalable, modular power system.

An informed judgment was made that approximately 1% of all component failures are catastrophic. Later
this was adjusted to reflect actual field data and the 1% estimate was found to be accurate.

The breakdown of scalable, modular system failures (Component contribution to failure: scalable, modular
power system only, No utility failures) is as follows:
Failures in the PDU transformer and catastrophic failure of the collector bus (The point of parallel
connection between power modules and the bypass switch) account for 72% of expected failures
The input and output molded case circuit breakers account for nearly 17%
Battery failure was negligible (8 series-parallel strings, 4 positive and 4 negative, 196 Volts DC)
The scalable, modular system loses power to all loads only when the main entrance bus fails or the
transfer switch fails to open
The probability of all 14 scalable, modular units failing simultaneously due to internal failures is
extremely low
PDU failure will cause partial load drop
Only 1 circuit breaker after the transfer switch will cause all critical loads to fail
Effects of operator error were not significant

Slide 30: Comparing Modularity to the Central UPS


The study yielded this data about the central UPS.
Battery failure is a significant contributor of failure in central UPS (single string of VLRA batteries,
400 Volts DC)
The central UPS can fail internally and bypass can fail, causing all loads to then fail
PDU failure will cause partial load drop
After the transfer switch, there are 5 circuit breakers, 2 on the input and 3 on the output
Output circuit breaker failures cause immediate loss of the critical load
Input circuit breaker failures cause loss of load after the UPS batteries are exhausted

Analyzing Reliability in the Data Center P a g e | 14

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Effects of operator error on UPS were not significant

Slide 31: Comparing Modularity to the Central UPS


Here we can see a summary of the individual component contribution to failure in a data center. The
modular, scalable architecture is compared to the central UPS architecture. The study showed that the
scalable, modular power system failure rate (when failure is defined as all critical loads in the data center
lose power) was approximately 40% lower than that of the central UPS system.

Slide 32: Reliability Assessment Case Study: The Findings


The findings of this case study were:
The calculated reliability of the scalable, modular power system is comparable to data published by
vendors of large, central UPSs.
The scalable, modular power system is significantly less likely to suffer a complete system failure.
Failures in equipment common to both approaches, such as the automatic transfer switch (ATS),
were the most significant cause of system failure.
The redundancy provided in the scalable, modular power system definitely improves the product's
reliability.

Analyzing Reliability in the Data Center P a g e | 15

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
While power module failures were observed more often in the scalable, modular power system, the
increase was more than offset by the benefits provided by redundancy.
Detailed consideration of common-cause failure mechanisms and potential catastrophic failure
modes that could potentially cause UPS failure did not result in significant reductions to the
calculated product reliability.
The scalable, modular design and the associated high volume of product allows the utilization of
dedicated manufacturing cells that produce products at lower costs and with fewer defects. This
makes it possible to manufacture five power modules when a non-modular design of the same
power rating manufactures one. This enables faster reliability growth of the product line.
The use of factory-built distribution wiring in the scalable, modular architecture presents a
significant advantage over field-wired distribution systems for centralized UPS products.
Distribution wiring introduces multiple opportunities to introduce wiring defects that may eventually
cause loss of power to critical loads. Mtech’s analysis of the field vs. factory wiring process found
that the probability of defects in field-produced systems was 1,500 times higher than the equivalent
factory-built system.

Examining these findings, there are some general conclusions that can be drawn.

Slide 33: Conclusions


The general conclusions that can be made are:
Overall, the scalable, modular power system architecture showed a system failure rate 40% lower
than that of the central UPS system. In this case, the failure is defined as all critical loads in the
data center losing power.
Discounting battery failures, the scalable, modular power system failure rate is still approximately
18% less than that of a comparable central UPS architecture.
If failure is defined to include dropping of any single load (as opposed to loss of the whole data
center) due to a branch circuit failure but not UPS failure, the scalable, modular architecture is 6%
less likely to fail (compared to central UPS).

Analyzing Reliability in the Data Center P a g e | 16

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.
Slide 34: Conclusions
After thorough analysis utilizing PRA tools, the scalable, modular power system architecture proved more
reliable than the single module UPS with a single battery string.

The redundant sub-systems within the scalable, modular power system successfully reduced the probability
of UPS failure, but the effects of external systems common to both the centralized and scalable, modular
power system approach affected both systems similarly.

Utilizing parallel redundant battery strings in the central UPS would reduce, but not eliminate the difference
in reliability.

The performance of the automatic transfer switch (ATS) is often the limiting factor in achieving higher
reliability.

Slide 35: Summary


To summarize, let’s review some of the information that we have covered throughout the course.
Understanding how to best define downtime risk or vulnerability for a particular business is
important to optimizing its reliability, while decreasing total cost of ownership (TCO) and increasing
agility
While redundancy can in principle increase reliability by allowing individual components or
subassemblies to fail without causing the system to fail, there are significant costs and potentially
serious drawbacks
Data center professionals need to understand which processes are most critical, and target
reliability accordingly
PRA is a powerful tool when applied carefully

Slide 36: Thank You!


Thank you for participating in this Data Center University™ course.

Analyzing Reliability in the Data Center P a g e | 17

© 2013 Schneider Electric. All rights reserved. All trademarks provided are the property of their respective owners.