Documente Academic
Documente Profesional
Documente Cultură
1
2
Table of Contents
Conclusion ..............................................................................73
Appendix .................................................................................76
3
Introduction to High Availability
4
Introduction to High Availability
Purpose
The intent of this System Technical Note (STN) is to describe the capabilities of the
different Schneider Electric solutions that answer most critical applications
requirements, and consequently increase the availability of a Collaborative Control
System. It provides a description of a common, readily understandable reference
point for end users, system integrators, OEMs, sales people, business support and
other parties.
Introduction
Before deciding to install a high availability automation system in your installation, you
need to consider the following key questions:
• What is the security level needed? This regards security concerning of both
persons and hardware. For instance, the complete start/stop sequences that
manage the kiln in a cement plant includes a key condition: the most powerful
equipment must be the last to start and stop.
• What is the criticality of the process? This point concerns all the processes that
involve a reaction (mechanical, chemical, etc.). Consider the example of the kiln
again. To avoid its destruction, the complete stop sequence needs a slow cooling
of the constituent material. Another typical example is the biological treatment in
a wastewater plant, which cannot be stopped every day.
• Which other particular circumstances does the system have to address? This last
topic includes additional questions, for example: does the system really need
redundancy if the electrical network becomes inoperative in the concerned layer
of the installation, what is the criticality of the data in the event of a loss of
communication, etc.
5
Introduction to High Availability
Document Overview
6
Introduction to High Availability
Depending on the I/O handling philosophy (for example conventional Remote I/O
stations, or I/Os Islands distributed on Ethernet) different scenarios can be
applied: dual communication medium I/O Bus or Self healing Ring, single or dual.
7
High Availability Theoretical Basics
Guide Scope
The realization of an automation project includes five main phases: Selection, Design,
Configuration, Implementation and Operation. To help you develop a whole project
based on these previous phases, Schneider Electric created the System Technical
book concept: Guide (STG) and Notes (STN).
STG and STN are obviously linked and complementary. To sum up, you will figure out
the technologies fundamentals in a STN, and their corresponding tested and
validated applications in one or several STGs.
STN Scope
STG Scope
Automation Project
Life Cycle
8
High Availability Theoretical Basics
A Fault Tolerant System usually refers to a system that can operate even though a
hardware component becomes inoperative. The Redundancy principle is often used
to implement a Fault Tolerant Systems, because an alternate component takes over
the task transparently.
Lifetime is generally seen as a sequence of three subsequent phases: the “early life”
(or “infant mortality”), the "useful life,” and the "wear out period.”
Failure Rate (λ) is defined as the average (mean) rate probability at which a system
will become inoperative.
Failure Rate Example: For a human being, Failure Rate (λ) measures the
probability of death occurring in the next hour. Stating λ (20 years) = 10 per
-6
hour would mean that the probability for someone age 20 to die in the next
-6
hour is 10 .
9
High Availability Theoretical Basics
Bathtub Curve
The following figure shows the Bathtub Curve, which represents the Failure Rate (λ)
according to the Lifetime (t):
Consider the
relation between
Failure Rate and
Lifetime for a
device consisting of
assembled
electronic parts.
This relationship is
represented by the
The Bathtub Curve "bathtub curve". As
shown in the previous diagram. In "early life," this system, exhibits a high Failure
Rate, which gradually reduces until it approaches a constant value, maintained during
its "useful life.” The system finally enters the “wear-out” stage of its life, where Failure
Rate increases exponentially.
Note: Useful Life normally starts at the beginning of system use and ends at the
beginning of its wear-out phase. Assuming that "early life" corresponds to the ”burn-
in” period indicated by the manufacturer, we generally consider that Useful Life starts
with the beginning of system use by the end user.
The following text, from the MIL-HDBK-338B standard, defines the RAM criteria and
their probability aspect:
"For the engineering specialties of reliability, availability and maintainability (RAM),
the theories are stated in the mathematics of probability and statistics. The underlying
reason for the use of these concepts is the inherent uncertainty in predicting a failure.
Even given a failure model based on physical or chemical reactions, the results will
not be the time a part will fail, but rather the time a given percentage of the parts will
fail or the probability that a given part will fail in a specified time."
Along with Reliability, Availability and Maintainability, Safety is the fourth metric of a
meta-domain that specialists have named RAMS (also sometimes referred to as
dependability).
10
High Availability Theoretical Basics
Metrics
RAMS metrics relate to time allocation and depend on the operational state of a given
system.
MUT qualifies the average duration of the system being in operational state.
MDT qualifies the average duration of the system not being in operational state. It
comprises the different portions of time required to subsequently detect the error, fix it,
and restore the system to its operational state.
Thus for repairable systems, MTBF is a metric commonly used to appraise Reliability,
and corresponds to the average time interval (normally specified in hours) between
two consecutive occurrences of inoperative states.
MTBF can be calculated (provisional reliability) based on data books such as UTE
C80-810 (RDF2000), MIL HDBK-217F, FIDES, RDF 93, and BELLCORE. Other
inputs include field feedback, laboratory testing, or demonstrated MTBF (Operational
Reliability), or a combination of these. Remember that MTBF only applies for
repairable systems
11
High Availability Theoretical Basics
MTTF is the mean time before the occurrence of the first failure.
MTTF (and MTBF by extension) is often confused with Useful Life, even though
these two concepts are not related in any way. For example, a battery may have
a Useful Life of 4 hours and have a MTTF of 100,000 hours. These figures
indicate that for a population of 100,000 batteries, there will be approximately
one battery failure every hour (defective batteries being replaced).
Mean Down Time is usually very low compared to Mean Up Time. This equivalence
is extended to MTBF, and assimilated to MTTF, resulting in the following relationship:
MTBF = 1 / λ.
Example:
Thus there is a 93.16% probability that the communication module will not fail
on a 5-year period.
Typically used as the Failure Rate measurement for non- repairable electronic
components, FIT is the number of failures in one billion hours.
9 9
FIT= 10 / MTBF or MTBF= 10 / FIT
12
High Availability Theoretical Basics
Safety
Definition
Safety refers to the protection of people, assets and the environment. For example, if
an installation has a tank with an internal pressure that exceeds a given threshold,
there is a high chance of explosion, and eventual destruction of the installation (with
injury or death of people and damage to the environment). In this example, the Safety
System put in place is to open a valve to the atmosphere, to prevent the pressure
threshold from being crossed.
Maintainability
Definition
Mathematics Basics
13
High Availability Theoretical Basics
Availability
Definition
The term High Availability is often used when discussing Fault Tolerant Systems. For
example, your telephone line is supposed to offer you a high level of availability: the
service you are paying for has to be effectively accessible and dependable. Your line
availability is compared related to the continuity of the service which you are
provided. As an example, assume you are living in a remote area with occasional
violent storms. Because of your loacation and the damage these storms can cause,
long delays are required to fix your line once it is out of order. In these conditions, if
on average your line appears to be usable only 50% of the time, you have poor
availability. By contrast, if on average each of your attempts is 100% satisfied, then
your line has high availability.
This example demonstrates that Availability is the key metric to measuring a system’s
tolerance level, that it is typically expressed in percent (for example 99.999%), and
that it belongs to the domain of probability.
Mathematics Basics
The Instantaneous Availability of a device is the probability that this device will be in
the functional state for which it was designed, under given conditions and at a given
time (t), with the assumption that the required external conditions are met.
14
High Availability Theoretical Basics
where Downtime includes all repair time (corrective and preventive maintenance
time), administrative time and logistic time.
Intrinsic Availability does not include administrative time and logistic time, and usually
does not include preventive maintenance time. This is primarily a function of the basic
equipment/system design.
MTBF
Ai =
MTBF + MTTR
Operational Availability
Uptime
A0 =
Operating ⋅ Cycle
15
High Availability Theoretical Basics
This is the availability that the customer actually experiences. It is essentially the a
posteriori availability based on actual events that happened to the system.
Classification
For example, a system that has a five-nine availability rating means that the system is
99.999 % available; with a system downtime of approximately 5.26 minutes per year.
Reliability
Definition
16
High Availability Theoretical Basics
Mathematics Basics
In many situations a detected disruption fortunately does not mean the end of a
device’s life. This is usually the case for the automation and control systems being
discussed, which are repairable entities. As a result, the ability to predict the number
of shutdowns, due to a detected disruption over a specified period of time, is useful to
estimate the budget required for the replacement of inoperative parts.
In addition, knowing this figure can help you maintain an adequate inventory of spare
parts. Put simply, the question "Will a device work for a particular period" can only be
answered as a probability; hence the concept of Reliability.
Note: The reliability is systematically indicated for a given period of time, for example
one year.
Referring to the system model we considered with the "bathtub curve,” one
characteristic is its constant Failure Rate, during the useful life. In that portion of its
lifetime, the Reliability of the considered system will follow an exponential law, given
, where λ stands for the Failure Rate.
-λt
by the following formula: R(t) = e
17
High Availability Theoretical Basics
Note: Considering the flat portion of the bathtub curve model, where the Failure Rate
is constant over time and remains the same for a unit regardless of this unit’s age, the
system is said to be "memoryless.”
Reliability one of the factors influencing Availability, but must not be confused with
Availabilty: 99.99% Reliability does not mean 99.99% Availability. Reliability
measures the ability of a system to function without interruptions, while Availability
measures the ability of this system to provide a specified application service level.
Higher reliability reduces the frequency of inoperative states, while increasing overall
Availability.
There is a difference between Hardware MTBF and System MTBF. The mean time
between hardware component failures occurring on an I/O Module, for example, is
referred to as the Hardware MTBF. Mean time between failures occurring on a
system considered as a whole, a PLC configuration for example, is referred to as the
System MTBF.
18
High Availability Theoretical Basics
Based on basic probability computation rules, RBD are simple, convenient tools to
represent a system and its components, to determine the Reliability of the system.
Series-Parallel Systems
The target system, for example a PLC rack, must first be interpreted in terms of series
and parallel arrangements of elementary parts.
Note: When considering Reliability, two components are described as in series if both
are necessary to perform a given function.
Note: Two components are in parallel, from a reliability standpoint, when the system
works if at least one of the two components works. In this example, Power Supply 1
and 2 are said to be in active redundancy. The redundancy would be described as
passive if one of the parallel components is turned on only if the other is inoperative
only, for example in the case of auxiliary power generators.
19
High Availability Theoretical Basics
Serial RBD
Reliability
Serial system Reliability is equal to the product of the individual elements’ reliabilities.
n
R S (t) = R1(t) ∗ R2 (t) ∗ R3 (t) ∗ ... ∗ Rn (t) = ∏ Ri (t)
i=1
Example 1:
Consider a system with 10 elements, each of them required for the proper operation
of the system, for example a 10-module rack. To determine RS(t), the Reliability of
that system over a given time interval t, if each of the considered elements shows an
individual Reliability Ri(t) of 0.99:
10
R S (t) = ∏ Ri (t) = (0.99) x (0.99) x (0.99) . . . (0.99) = (0.99) = 0.9044
10
i=1
Example 2:
Element 1 Element 2
λ2 = 180 x 10 h
-6 -1
λ1 = 120 x 10 h
-6 -1
20
High Availability Theoretical Basics
n
λ S = ∑ λ i = λ1 + λ2 = 120 x 10 + 180 x 10 = 300 x 10 = 0.3 x 10 h-1
-6 -6 -6 -3
i=1
Availability
n
A S = A1 . A2 . A3 .... .A n = ∏ A i
i=1
21
High Availability Theoretical Basics
Calculation Example
In this example, we calculate the availability of a PAC Station using shared distributed
I/O Islands. The following illustration shows the final configuration:
This calculation applies the equations given by basic probability analysis. To do this
calculation, a spreadsheet was developed. These are the figures applied in the
spreadsheet:
n
Total serial Systems Failure Rate λ = ∑ λ
S i
i =1
Unavailability = 1 – Availability
Step Action
3 Based on the serial structure, add up the results from Steps 1 and 2.
Note: A common variant of in-rack I/O Module stations are I/O Islands, distributed on
an Ethernet communication network. Schneider Electric offers a versatile family
named Advantys STB, which can be used to define such architectures.
22
High Availability Theoretical Basics
23
High Availability Theoretical Basics
Step 3: Calculation of the entire installation. Assume that the communication network
used to link I/O Islands to CPU has no examples of Reliability metrics (these are
explored further in a subsequent chapter).
Note: The highlighted values were calculated in the two previous steps
Considering the results of this Serial System (Rack # 1+ Islands # 1 ... #4), Reliability over
one year is approximately 82 % (the probability that this system will encounter one failure
during one year is approximately 18%).
Considering the Availability, with a 2-hour Mean Time To Repair (typical of a very good
logistics and maintenance organization), the system would achieve a 4-nines Availability,
an average probability of approximately 24 minutes downtime per year;
24
High Availability Theoretical Basics
Parallel RBD
Reliability
n n
RRe d (t) = 1− [ Q1 (t) x Q2 (t) x Q3 (t) x .... x Q n ( t) ] = 1− ∏ Q i (t) = 1− ∏1− Ri ( t)
i=1 i=1
n
∏ Q i = Probability of Failure of the System
i=1
Example:
Element 1
λ1 = 120 x 10 h
-6 -1
Element 2
λ2 = 180 x 10 h
-6 -1
−6
R1 (1,000 h) = e −λ1t = e −120 x 10 x 103
= 0,8869
−6
−λ2 t
= e −180 x 10
3
R2 (1,000 h) = e x 10
= 0.8353
25
High Availability Theoretical Basics
Thus with Individual Elements’ Reliability of 88.69% and 83.53% respectively, the
targeted Redundant System Reliability is 98.14%
Availability
n
A S = 1− [ (1 − A 1) . (1 − A 2 ) . ... . (1− A n ) ] = 1− ∏ (1 − A i )
i =1
26
High Availability Theoretical Basics
Calculation Example
The formulas are the same as used in the previous calculation example, except for
the calculation of the reliability for a parallel system, which is as follows:
n n
RRe d (t) = 1− [ Q1 (t) x Q2 (t) x Q3 (t) x .... x Q n ( t) ] = 1− ∏ Q i (t) = 1− ∏1− Ri ( t)
i=1 i=1
Step Action
2 Perform the calculation for the redundant structure, here the two CPUs
Note: the previous results from the serial analysis, regarding the calculation linked to
the standalone elements, are reused.
27
High Availability Theoretical Basics
Because the analysis is identical to that for the serial case, the following screenshot
shows the spreadsheet corresponding only to the final results:
Because the analysis is identical to that for the serial case, the following screenshot
shows the spreadsheet corresponding to the final results only:
28
High Availability Theoretical Basics
Note: The only difference between this architecture and the previous one relates to
the CPU Rack: this one is a redundant one (Premium Hot Standby), while the former
one was a standalone one.
Looking at the results of this Parallel System (Premium CPU Rack Redundancy), Reliability
over one year would be approximately 99.9%, compared to 97.4% available with a
Standalone Premium CPU Rack (i.e. the probability for a Premium Rack System to
encounter one failure during one year would have been reduced from 2.6% to 0.1%).
System MTBF itself would increas from 335,000 hours (approximately 38 years) to 503,000
hours (approximately 57 years).
For System Availability, a 2-hour Mean Time To Repair we provide approximately a 9-nines
resulting Availability (almost 100%).
Note: Other calculations examples are available in the Calculation Examples chapter.
Note: The previous examples cover the PAC station only. To extend the calculation
to a whole system, the MTBF of the network components and the SCADA systems
(PC, servers) must be taken in account.
29
High Availability Theoretical Basics
Conclusion
Serial System
This table indicates that even though a very high availability Part Y was used, the
overall availability of the system was reduced by the low availability of Part X. A
common saying indicates that "a chain is as strong as the weakest link", however, in
this instance a chain is actually “weaker than the weakest link.”
Parallel System
The above computations indicate that the combined availability of two components in
parallel is always higher than the availability of its individual components.
The following table gives an example of combined availability in a parallel system:
This indicates that even though a very low availability Part X was used, the overall
availability of the system is much higher. Thus redundancy provides a very powerful
mechanism for making a highly reliable system from low reliability components.
30
High Availability with Collaborative Control System
This chapter provides answers to these questions, and reviews the system
architecture from top to bottom, that is, from operator stations and data servers
(Information Management) to Controllers and Devices (Control System Level), via
communication networks (Communication Infrastructure Level).
Ethernet
Profibus
Ethernet
Ethernet
31
High Availability with Collaborative Control System
Redundancy Level
Key Features
The key features a Vijeo Citect SCADA software has to handle relate to:
• Data acquisition
• Recipes
• Report
Note: This model is applicable for a single station (PC), including for small
applications. The synthesis between the stakes and the key features will help to
determine the most appropriate redundant solution.
32
High Availability with Collaborative Control System
Stakes
Considering the previously defined key features, stakes when designing a SCADA
system include:
• Is a changeover possible?
Risk analysis
Linked to the previous stakes, the risk analysis is essential to defining the redundancy
level. Consider the events the SCADA system will face, i.e. the risk, in terms of the
following:
• Inoperative hardware
That can imply loss of data, operator screen, connection with devices and so on.
Level Definition
Finally, the redundancy level is defined as the compilation of the key features, the
stakes, and the risk analysis with the customer expectations related to the data
criticality level. The following table illustrates the flow from the process analysis to the
redundancy level:
33
High Availability with Collaborative Control System
This section examines various redundancy solutions. A Vijeo Citect SCADA system
can be redundant at the following levels:
• Data servers
34
High Availability with Collaborative Control System
a
Alarms, Trends and Reports, respectively. In addition, this functional architecture
includes at least one I/O Server. The I/O Server acts as a Client to the peripheral
devices (PAC) and as a Server to Alarms, Trends and Reports (ATR) entities.
As shown in the figure above, ATR and I/O Server(s) act either as a Client or as a
Server, depending on the designated relationship. The default mechanism linking
these Clients and Servers is based on a Publisher / Subscriber relation.
As shown in the following screenshot, because of its client server model, Vijeo Citect
can create a dedicated server, depending on the application requirements: for
example for ATR, or for I/O server:
35
High Availability with Collaborative Control System
Vijeo Citect is able to manage the redundancy at the operator station level with
several client workstations. These stations can be located in the control room or
distributed in the plant close to the critical part of the process. A web client interface
can also be used to monitor and control the plant using a standard web browser.
If an operator station becomes inoperative, the plant can still be monitored and
controlled using additional operator screen.
36
High Availability with Collaborative Control System
Vijeo Citect can define Primary and Standby servers within a project, with each
element of a pair being held by different hardware.
The first level of redundancy duplicates the I/O
server or the ATR server, as shown in the
illustration. In this case, a Standby server is
maintained in parallel to the Primary server. In the
event of a detected interruption on the hardware,
the Standby server will assume control of the
communication with the devices.
Note: Vijeo Citect reconnects through the primary data path when it is returned into
service.
37
High Availability with Collaborative Control System
Redundant LAN
The I/O device is an extension of I/O Device Redundancy, providing for more than
one Standby I/O Device. Depending on the user configuration, a given order of
priority applies when an I/O Server (potentially a redundant one) needs to switch to a
Standby I/O Device. For example, in the figure above I/O Device 3 would be allotted
the highest priority, then I/O Device 2, then finally I/O Device 4.
38
High Availability with Collaborative Control System
In those conditions, in case of a detected interruption occurring on Primary I/O Device
1, a switchover would take place, with I/O Server 2 handling communications, and
with Standby I/O Device 3. If an interruption is now detected on I/O Device 3, a new
switchover would take place, with I/O Server 1 handling communications, with
Standby I/O Device 2. Finally, if there is an interruption on I/O Device 2, another
switchover would take place, with I/O Server 2 handling communications, with
Standby I/O Device 4
Clustering
A cluster may
contain several
possibly
redundant I/O
Servers
(maximum of
one per
machine), and
standalone or
redundant
ATR servers; these latter servers being implemented either on a common or on
separate machines.
39
High Availability with Collaborative Control System
With this scenario, each site is represented with a separate cluster, grouping its
primary and standby servers. Clients on each site are interested only in the local
cluster, whereas clients at the central control room are able to view all clusters.
B
a
s
e
d
o
n
c
l
u
s
t
e
Based on cluster design, each site can then be addressed independently within its
own cluster. As a result, deployment of a control room scenario is fairly
straightforward, with the control room itself only needing display clients.
The cluster concept does not actually provide an additional level of redundancy.
Regarding data criticality, clustering organizes servers, and consequently provides
additional flexibility.
40
High Availability with Collaborative Control System
Each cluster contains only one pair each of ATR servers. Those pairs of servers,
redundant to each other, must be on different machines.
Each cluster can contain an unlimited number of I/O servers; those servers must also
be on different machines that increase the level of system availability.
Conclusion
Scada Clients
Data Servers
Control Network
Targeted Devices
41
High Availability with Collaborative Control System
Plant Topology
The first step of the communication infrastructure level definition is the plant topology
analysis. From this survey, the goal is to gather information to develop a networking
system diagram, prior to defining the network topologies.
• Localization of the station and the nodes included in these areas to be connected
• Localization of the existing networks & cabling paths, in the event of expansion or
redesign
Before defining the network topologies, the following project requirements must be
considered:
• Cost constraints
From the project and the plant analyses, identify the most critical areas:
Project Process
<- Valves
<- Pumps
<- Measure-
ments
Criticality analysis
42
High Availability with Collaborative Control System
Network Topology
Topologies
Following the criticality analysis, the networking diagram can be defined by selecting
the relevant Network topology.
The following table describes the four main topologies from which to choose:
Bus The traffic must flow serially, Cost-effective solution If a switch becomes
therefore the bandwidth is not inoperative, the
used efficiently communication is lost.
Star Cable ways and distances Efficient use of the If the main switch
bandwidth, as the becomes inoperative,
traffic is spread across the communication is
the star lost
Preferred topology
when there is no need
of redundancy
Possible to couple
others rings for
increasing redundancy
Note: These different topologies can be mixed to define the plant network diagram.
43
High Availability with Collaborative Control System
In automation architecture, Ring (and Dual ring) topologies are the most commonly
used to increase the availability of a system.
Mesh architecture is not used in process applications; therefore we will not discuss it
in detail. All these topologies are allowed using Schneider Electric ConneXium
switches.
Ring Topology
44
High Availability with Collaborative Control System
2 3 4
Consider an Ethernet loop designed with such a RM switch. In normal conditions, this
RM switch will open the loop: which prevents Ethernet frames from circulating
endlessly in a loop.
45
High Availability with Collaborative Control System
A mix of Dual Networking and Network Redundancy is possible. Note that in such a
design, a SCADA I/O Server has to be equipped with two communication boards, and
reciprocally, each device (PLC) has to be allotted two Ethernet ports.
This may be an effective design, considering the junction between trunk and satellites,
especially if backbone and satellite networks have been designed as ring networks to
provide for High Availability.
With the Connexium product line, Schneider Electric offers switches that may afford a
redundant coupling. Several variations allow connection to the network. Each of these
variations is featured by two “departure” switches on the backbone network. Each
departure switch is crossing through a separate link to access the satellite network.
These variations include:
46
High Availability with Collaborative Control System
ETY
ETY
ETY
ETY
Premium
RM
The following illustration shows the architecture, which allows the combination of two
rings managed by a unique switch.
47
High Availability with Collaborative Control System
The following architecture allows extension of the main network to other segments :
48
High Availability with Collaborative Control System
and one "out" port. The advantage of such a daisy-chainable device is that its
installation inside an Ethernet network requires only two cables.
Advantys STB dual port Ethernet communication adapter (STB NIP 2311)
Advantys ETB IP67 dual port Ethernet
Motor controller TeSys T
Variable speed drive ATV 61/71 (VW3A3310D)
PROFIBUS DP V1 Remote Master
ETG 30xx Factorycast gateway
Note: Assuming no specific redundancy protocol is selected to handle the daisy
chain loop, expected loop reconfiguration time on failover is approximately one
second.
SCADA
SCADA
MRP / RSTP
Primary Standby
ETY
ETY
ETY
ETY
Premium
RM
RM
49
High Availability with Collaborative Control System
Daisy chaining topologies can be coupled to dual Ethernet rings using TCSESM
ConneXium switches.
RSTP stands for Rapid Spanning Tree Protocol (IEEE 802.1w standard)1. Based on
STP, RSTP has introduced some additional parameters that must be entered during
the switch configuration. These parameters are used by the RSTP protocol during the
path selection process; because of these, the reconfiguration time is much faster than
with STP (typically less than one second).
1
The new edition of the 802.1D standard, IEEE 802.1D-2004, incorporates IEEE 802.1t-2001 and IEEE
802.1w standards
50
High Availability with Collaborative Control System
The new release of TCSESM ConneXium switches allows better performance of
RSTP management with a detection time of 15 ms and a propagation time of 15 ms
for each switch. Considering a 6 switches configuration, the recovery time is about
105 ms.
HIPER-Ring (Version 1)
Note: The Redundancy Manager switch is said to be active when it opens the
network.
2
When configuring a Connexium TCS ESM switch for HIPER-Ring V1, the user is asked to choose
between a maximum Standard Recovery Time, which is 500 ms, and a maximum Accelerated Recovery
Time, which is 300 ms.
51
High Availability with Collaborative Control System
MRP is an IEC 62439 industry standard protocol based on HIPER-ring. Therefore all
switch manufactures can implement MRP if they choose too. This allows a mix of
different manufactures in an MRP configuration. Schneider’s’ switches support a
selectable maximum recovery time of 200ms or 500ms and 50 switch maximum ring
configuration.
TCESM switches also support redundant coupling of MRP rings. MRP rings could
easily be used instead of HIPER-ring. MRP would require that all switches be
configured via Web pages and allow for recovery time of 200ms or 500ms.
Additionally the I/O network could be a MRP redundant network and the control
network HIPER-ring or vice versa.
Fast HIPER-Ring
A new family of Connexium switches is coming named TCS ESM Extended. This will
offer a third version of HIPER-Ring strategy, named Fast HIPER-RING.
Featuring a guaranteed recovery time of less than 10 milliseconds, the fast HIPER
ring structure allows both a cost optimized implementation of a redundant network as
well as maintenance and network extension during operation. This makes fast HIPER
ring especially suitable for complex applications such as a combined transmission of
video, audio and data information
52
High Availability with Collaborative Control System
Selection
To end the communication level section, the following table presents all the
communication protocols, and thus helps you selecting the most appropriate
installation for your high availability solution:
New installation MRP All switches are configured via web pages
Complex Architecture MRP, RSTP or We recommend MRP or RSTP for High Availability with
FAST HIPER- dual ring, and FAST HIPER-Ring for high performance.
Ring
53
High Availability with Collaborative Control System
Having detailed High Availability aspects at the Information Management level and at
the Communication Infrastructure level, we will now concentrate on High Availability
concerns at the Control Level. Specific discussion will focus on PAC redundancy.
Redundancy Principles
Modicon Quantum and Premium PAC provide Hot Standby capabilities, and have
several shared principles:
2. The units are synchronized. The standby unit is aligned with the Primary unit.
Also, on each scan, the Primary unit transfers to the Standby unit its "database,”
that is, the application variables (located or not located) and internal data. The
entire database is transferred, except the "Non-Transfer Table", which is a
sequence of Memory Words (%MW). The benefit of this transfer is that, in case of
a switchover, the new Primary unit will continue to handle the process, starting
with updated variables and data values. This is referred to as a "bumpless"
switchover.
4. For any Ethernet port acting as a server (Modbus/TCP or HTTP protocol) on the
Primary unit, its IP address is implicitly incremented by one on the Standby unit.
54
High Availability with Collaborative Control System
In case a switchover occurs, homothetic addresses will automatically be
exchanged. The benefit of this feature is that seen from a SCADA/HMI, the
"active" unit is still accessed at the same IP address. No specific adaptation is
required at the development stage of the SCADA / HMI application.
5. The common programming environment that is used with both solutions is Unity
Pro. No particular restrictions apply when using the standardized (IEC 1131-3)
instruction set. In addition, the portion of code specific to the Hot Standby system
is optional, and is used primarily for monitoring purpose. This means that with any
given application, the difference between its implementation on standalone
architecture and its implementation on a Hot Standby architecture is largely
cosmetic.
Consequently, a user familiar with one type of Hot Standby system does not have to
start from scratch if he has to use a second type; initial investment is preserved and
re-usable, and only a few differences must be learned to differentiate the two
technologies.
The following table presents the available configurations with either a Quantum or
Premium PLC:
55
High Availability with Collaborative Control System
For redundant PAC architecture, both units require two interlinks to execute different
types of diagnostic - to orient the election of the Primary unit, and to achieve
synchronization between both machines. The first of these "Sync Links,” the CPU
Sync Link, is a dedicated optic fiber link anchored on the Ethernet port local to the
CPU module. This port is dedicated exclusively for this use on Quantum Hot Standby
architecture. The second of these Sync Links, Remote I/O Sync Link, is not an
additional one: the Hot Standby system uses the existing Remote I/O medium,
hosting both machines, thus providing them with an opportunity to communicate.
One benefit of the CPU optic fiber port is its inherent capability to have the two units
installed up to 2 km apart, using 62.5/125 multimode optic fiber. The Remote I/O Sync
Link can also run through optic fiber, provided that Remote I/O Communication
Processor modules are coupled on the optic fiber.
3
All I/O modules are accepted on remote I/O racks, except 140 HLI 340 00 (Interrupt module).
Looking at Ethernet adapters currently available, 140 NWM 100 00 communication module is not
compatible with a Hot Standby system. Also EtherNet/IP adapter (140 NOC 771 00) is not compatible with
Quantum Hot Standby in Step 1.
56
High Availability with Collaborative Control System
4
Up to 6 communication modules , such as NOE Ethernet TCP/IP adapters, can be
handled by a Quantum unit; whether it is a standalone unit or part of a Hot Standby
architecture.
Up to 31 Remote I/O stations can be handled from a Quantum CPU rack, whether
standalone or Hot Standby. Note that the Remote I/O service payload on scan time is
approximately 3 to 4 ms per station.
4
Acceptable communication modules are Modbus Plus adapters, Ethernet TCP/IP adapters, Ethernet/IP
adapters and PROFIBUS DP V1 adapters.
57
High Availability with Collaborative Control System
Depending on the selected I/O Bus technology, a specific layout may result in
enhanced availability.
58
High Availability with Collaborative Control System
High-level level feature is currently being added to Quantum Hot Standby applications
design. This is CCTF, or Configuration Change on the Fly. This new feature will allow
you to modify the configuration of the existing and running PLC application program,
without having to stop the PLC. As an example, consider the addition of a new
discrete or analog module on a remote Quantum I/O Station. For the CPU Firmware
version upgrade, executed on a Quantum Hot Standby architecture, this CCTF will be
sequentially executed, one unit at a time.This is an obvious for applications that
cannot afford any stop, which will now become available for architecture modification
or extensions.
59
High Availability with Collaborative Control System
6
Counting, motion, weighing and safety modules are not accepted. On the communication side, apart from
Modbus modules TSX SCY 11 601/21 601, only currently available Ethernet TCP/IP modules are accepted.
Also EtherNet/IP adapter (TSX ETC 100) is not compatible with Premium Hot Standby in Step 1.
7
TSX ETY 4103 or TSX ETY 5103 communication module
60
High Availability with Collaborative Control System
The following picture illustrates the Premium Ethernet configuration, with the CPU
and the ETY Sync links:
61
High Availability with Collaborative Control System
Schneider Electric has supported Transparent Ready strategy for several years. In
addition to SCADA and HMI, Variables Speed Drives, Power Meters, and a wide
ranges of gateways, Distributed I/Os with Ethernet connectivity, such as Advantys
STB, are also being proposed,. In addition many manufacturers are offering devices
8
capable of communicating on Ethernet using Modbus TCP protocol. These different
contributions using the Modbus protocol design legacy have helped make Ethernet a
general purpose preferred communication support for automation architectures.
The typical target of such a communication contract is an I/O block, hence the name
"I/O Scanner". Also, the I/O Scanner service may be used to implement data
exchanges with any type of equipment, including another PLC, provided that
equipment can behave as a Modbus/TCP server, and respond to multiple words
access requests.
62
High Availability with Collaborative Control System
from a power cut-off, the former Primary may not able to close the connections it had
opened. These connections will be closed after expiration of a Keep Alive timeout.
In case of a switchover, proper communications will typically recover after one initial
cycle of I/O scanning. However, the worst case gap for address swap, with I/O
scanner, is 500 ms, plus one initial cycle of I/O scanning. As a result, this mode of
communication, and hence architectures with Distributed I/Os on Ethernet, is not
preferred with a control system that regards time criticality as an essential criteria.
As demonstrated in the previous chapter, Ethernet TCP/IP used with products like
Connexium offers real opportunities to design enhanced availability architectures,
handling communication between the Information Management Level and the Control
Level. Such architectures, based on a Self Healing Ring topology, are also applicable
when using Ethernet TCP/IP as a fieldbus.
Note that Connexium accepts Copper or Optic Fiber rings. In addition, dual
networking is also applicable at the fieldbus level.
63
High Availability with Collaborative Control System
Profibus Architecture
The Configuration Builder can also be configured to pass Unity Pro a set of DFBs,
allowing easy implementation of Acyclic operations.
Each Quantum PLC can accept up to 6 of these DP Master modules (each of them
handling its own PROFIBUS network). Also, the PTQ PDPM V1 Master Module is
compatible with a Quantum Hot Standby implementation. Only the Master Module in
the Primary unit is active on the PROFIBUS network; the Master Module on the
Standby unit stays in a dormant state unless awakened by a switchover.
64
High Availability with Collaborative Control System
Automatic symbol generation provides Unity Pro with data structures corresponding
to data exchanges and diagnostic information. A set of DFBs is delivered that allows
an easy implementation of acyclic operations.
9
Planned First Customer Shipment: Q4 2009
10
M340, Premium or Quantum
11
version 5.0
65
High Availability with Collaborative Control System
66
High Availability with Collaborative Control System
Systematic checks are executed cyclically by any running CPU, in order to detect a
potential hardware corruption, such as a change affecting the integrity of the Copro,
the sub-part of the CPU module that hosts the integrated Ethernet port. Another
example of a systematic check is the continuous check of the voltage levels provided
by the power supply module(s). In case of a negative result during these hardware
health diagnostics, the tested CPU will usually switch to a Stop State.
When the unit in question is part of a Hot Standby System, in addition to these
standard hardware tests separately executed on both machines, more specific tests
are conducted between the units. These additional tests involve both Sync Links. The
basic objective is to confirm that the Primary unit is effectively operational, executing
the application program, and controlling the I/O exchanges. In addition, the system
must verify that the current Standby unit is able to assume control after a switchover.
67
High Availability with Collaborative Control System
If an abnormal situation occurs on the current Primary unit, it gives up control and
switches either to Off-Line state (the CPU is not a part of the Hot Standby system
coupling) or to Stop State, depending on the event. The former Standby unit takes
control as the new Primary unit.
Controlled Switchover
As previously indicated, the Hot Standby system is controlled through the %SW61
system register. Each unit owns an individual bit on the system Command Register
that decides whether or not that particular unit has to make its possible to "hook" to
the other unit. An operational hooked redundant Hot Standby system requires both
units to indicate this intent. Consequently, executing a switchover controlled by the
application on a hooked system is straightforward; it requires briefly toggling the
decision bit that controls the current Primary unit’s "hooking" intent. The first toggle
transition switches the current Primary unit to Off-Line Sate, and makes the former
Standby unit take control. The next toggle transition makes the former Primary unit
return and hook as the new Standby Unit.
Hence, the application program can decide on a Hot Standby switchover, having
registered a steady state negative diagnostic on the Ethernet adapter linking the
Primary unit to the "Process Network", and being at the same time informed that the
Standby unit is fully operational.
68
High Availability with Collaborative Control System
Note: HSBY_WR DFB executes a write access on HSBY Control Register (%SW60).
69
High Availability with Collaborative Control System
Switchover Latencies
The following table details the typical and maximum swap time delay encountered
when reestablishing Ethernet services during a Switchover event. (Premium and
Quantum configurations)
I/O Scanning 1 initial cycle of I/O scanning 500 ms + 1 initial cycle of I/O scanning
Server 1 MAST task cycle + the time required 500 ms + the time required by the
Messaging by the client to reestablish its client to reestablish its connection with
(1) (1)
connection with the server the server
FTP/TFTP Server the time required by the client to 500 ms + the time required by the
reestablish its connection with the client to reestablish its connection with
(1) (1)
server the server
HTTP Server the time required by the client to 500 ms + the time required by the
reestablish its connection with the client to reestablish its connection with
(1) (1)
server the server
(1)
The time the client requires to reconnect with the server depends on the client communication loss
timeout settings.
Selection
To end the control level section, the following table presents the four main criteria that
help you selecting the most appropriate configuration for your high availability solution:
70
Conclusion
The following tables provide a brief reminder of essential characteristics for Premium
and Quantum Hot Standby solutions, respectively:
71
Conclusion
Conclusion
This chapter has covered functional and architectural redundancy aspects, from the
Information Management level to the Control level, and up through the
Communication Infrastructure level.
72
Conclusion
Conclusion
This section summarizes the main characteristics and properties of Availability for
Collaborative Control automation architectures.
Chapter 1 demonstrated that Availability is dependent not only on Reliability, but also
on Maintenance as it is provided to a given system. The first level of contribution,
Reliability, is primarily a function of the system design and components. Component
and device manufacturers thus have a direct but not exclusive influence on system
Availability. The second level of contribution, Maintenance and Logistics, is totally
dependent on end customer behavior.
Chapter 3 explored a central focus of this document, Redundancy, and its application
at the Information Management Level, Communication Infrastructure Level and
Control System Level.
This final chapter summarizes customer benefits provided by Schneider Electric High
Availability solutions, as well as additional information and references.
73
Conclusion
Benefits
Standard Offer
One key concept of High Availability is that redundancy is not a default design
characteristic at any system level. Also, Redundancy can be added locally, using in
most cases standard components.
Simplicity of Implementation
System Transparency
For Client Display Stations, Dual Path Supervisory Networks, Redundant I/O servers,
or Dual Access to Process Networks, each redundant contribution is handled
separately by the system. For example, concurrent Display Clients communication
flow will be transparently re-routed to the I/O sever by the Supervisory Network in
case of a cable disruption. This flow will also be transparently routed to the alternative
I/O Server, in case of a sudden malfunction of the first server. Finally, the I/O Server
may transparently leave the communication channel it is using per default, if that
channel ceases to operate properly, or if the target PLC will not respond through this
channel.
74
Conclusion
The “IP Address automatic switch" for a SCADA application communicating through
Ethernet is an important feature of Schneider Electric PLCs. Apart from simplifying
the design of the SCADA application implementation, what may cause delays and
increased cost, this feature also contributes to reducing the payload of a
communication context exchange on a PLC switchover.
Ease of Use
As previously stated, increased effort has been made to make the implementation of
a redundant feature simple and straightforward.
The Vijeo Citect, ConneXium Web Pages and Unity Pro software environments offer
clear and accessible configuration windows, along with a dedicated selective help, in
order to execute the required parameterization.
In case of a specific need for detailed dependability (RAMS) studies, for any type of
architecture, contact the Schneider Electric Safety Competency Center. This center
has skilled and experienced individuals ready to help you with all your needs.
75
Appendix
Appendix
Glossary
Note: the references in bracket refer to standard, which are specified at the end of
this glossary.
1) Active Redundancy
Redundancy where the different means required to accomplish a given function are
present simultaneously [5]
2) Availability
Failure that affects all redundant elements for a given function at the same time [2]
4) Complete failure
Failure which results in the complete inability of an item to perform all required
functions [IEV 191-04-20] [2]
5) Dependability
Collective term used to describe availability performance and its influencing factors:
reliability performance, maintainability performance and maintenance support
performance [IEV 191-02-03] [2]
Note: Dependability is used only for general descriptions in non-quantitative terms.
6) Dormant
A state in which an item is able to function but is not required functioning. Not to be
confused with downtime [4]
7) Downtime
8) Failure
Termination of the ability of an item to perform a required function [IEV 191-04-01] [2]
76
Appendix
9) Failure Analysis
The act of determining the physical failure mechanism resulting in the functional
failure of a component or piece of equipment [1]
Procedure for analyzing each potential failure mode in a product, to determine the
results or effects on the product. When the analysis is extended to classify each
potential failure mode according to its severity and probability of occurrence, it is
called a Failure Mode, Effects, and Criticality Analysis (FMECA).[6]
Total number of failures within an item population, divided by the total number of life
units expended by that population, during a particular measurement period under
stated conditions [4]
12) Fault
77
Appendix
Failure occurring that is not detectable by or evident to the operating crew [1]
A measure of Availability that includes only the effects of an item design and its
application, and does not account for effects of the operational and support
environment. Sometimes referred to as "intrinsic" availability [4]
17) Integrity
18) Maintainability
Probability that an item can be retained in, or restored to, a specified condition when
maintenance is performed by personnel having specified skill levels, using prescribed
procedures and resources, at each prescribed level of maintenance and repair. [4]
78
Appendix
Time includes the actual repair time plus all delay time associated with a repair
person arriving with the appropriate replacement parts [4]
22) MTBF
A basic measure of reliability for repairable items. The mean number of life units
during which all parts of the item perform within their specified limits, during a
particular measurement interval under stated conditions. [4]
A basic measure of reliability for non-repairable items. The total number of life units of
an item population divided by the number of failures within that population, during a
particular measurement interval under stated conditions. [4]
Note: Used with repairable items, MTTF stands for Mean Time To First Failure
79
Appendix
26) Non-Detectable Failure
27) Redundancy
Existence of more than one means for accomplishing a given function. Each means
of accomplishing the function need not necessarily be identical. The two basic types
of redundancy are active and standby. [4]
28) Reliability
Ability of an item to perform a required function under given conditions for a given
time interval [IEV 191-02-06] [2]
Note 1: It is generally assumed that an item is in a state to perform this required
function at the beginning of the time interval
Note 2: the term “reliability” is also used as a measure of reliability performance (see
IEV 191-12-01)
29) Repairability
Probability that a failed item will be restored to operable condition within a specified
time of active repair [4]
30) Serviceability
Relative ease with which an item can be serviced (i.e. kept in operating condition). [4]
Redundancy in which some or all of the redundant items are not operating
continuously but are activated only upon failure of the primary item performing the
function(s). [4]
80
Appendix
32) System Downtime
Time interval between the reporting of a system (product) malfunction and the time
when the system has been repaired and/or checked by the maintenance person, and
no further maintenance activity is executed. [4]
34) Unavailability
State of an item of being unable to perform its required function [IEV 603-05-05] [2]
Note: Unavailability is expressed as the fraction of expected operating life that an
item is not available, for example given in minutes per year
35) Uptime
That element of Active Time during which an item is in condition to perform its
required functions. (Increases availability and dependability). [4]
[3] IEEE Std C37.1™-2007: Standard for SCADA and Automation System
[5] IEC-271-194
[7] Reliability, Quality, and Safety for Engineers - B.S. Dhillon - CRC Press
81
Appendix
Standards
General purpose
FMEA/FMECA
IEC 60812 (1985) - Analysis techniques for system reliability - Procedures for failure mode and
effect analysis (FMEA)
MIL-STD 1629A (1980) Procedures for performing a failure mode, effects and criticality analysis
IEC 61078 (1991) Analysis techniques for dependability - Reliability block diagram method
Markov Analysis
RAMS
IEC 62278 (2002) - Railway applications - Specification and demonstration of reliability, availability,
maintainability and safety (RAMS)
Functional Safety
IEC 61511 (2003) Functional safety - Safety instrumented systems for the process industry sector.
82
Calculation Examples
The root piece of information required for all of these calculations, for all major
components we will implement in our architectures, is the MTBF figure. MTBFs are
normally provided by the manufacturer, either on request and/or on dedicated
documents. For Schneider Electric PLCs, MTBF information can be found on the
intranet site Pl@net, under Product offer>Quality Info.
Note: The MTBF information is usually not accessible via the Schneider Electric
public website.
Standalone Architecture
First, calculate the individual modules’ MTBFs, using a spreadsheet that will do the
calculations (examples include Excel and Open Office).
83
Calculation Examples
The calculation guidelines are derived from the main conclusion given by Serial RBD
n
analysis, that is:. λ = ∑ λ : Equivalent Failure Rate for n serial elements is equal
S i
i =1
− λSt
to the sum of the individual Failure Rate of these elements, with R S (t) = e .
The first operation will identify individual MTBF figures of the part references
populating the target system. Using these figures, a second sheet will then
subsequently consider the item, group and system levels.
84
Calculation Examples
- Group Failure Rate: Individual item Failure Rate times the number of items in
the considered group
- System Reliability over one year: exp(- System Failure Rate * 8760)]
(where 8760 = 365 *24 : number of hours in one year)
- System Availability (with MTTR = 2 h): System MTBF / (System MTBF + 2)]
Looking at the example results, Reliability over one year is approximately 83%
(This means that the probability for this system to encounter one failure during
one year is approximately 17%).
Note:
85
Calculation Examples
Redundant Architecture
This means if we consider RS as the Standalone System Reliability and RRed as the
Note: Formal calculations should also take into account undetected errors on
redundant architecture, what would provide somewhat less optimistic figures.
A complete analysis should also take in account the additional wiring devices typically
used on a massive I/O redundancy strategy, feeding homothetic input points with the
same input signal, and bringing homothetic output points onto the same output signal.
Also, with this software, a parallel structure has been retained, with the Failure Rate
the same on the Standby rack as on the Primary rack.
86
Calculation Examples
87
Calculation Examples
88
Calculation Examples
89
Calculation Examples
For this Serial System (Rack # 1+ Rack # 2), Reliability over one year is approximately 82.8% (the
probability for this system to encounter one failure during one year is approximately 17%).
Regarding Availability, with a 2 hour Mean Time To Repair (which corresponds to a very good logistic
and maintenance organization), we receive a 4-nines resulting Availability, which is an average
probability of approximately 23 minutes downtime per year
Note: As expected, Reliability and Availability figures resulting from this Serial
implementation are determined by the weakest link of the chain.
90
Calculation Examples
System MTBF would increase from approximately 157,000 hours (approximately 18 years)
to 235,000 hours (approximately 27 years).
Regarding System Availability, with a 2 hour Mean Time To Repair we would receive a 9-
nines resulting Availability, close to 100%
Note: As expected, Reliability and Availability figures resulting from this Parallel
implementation are better than the best of the individual figures for different links of
the chain.
91
Calculation Examples
The common misuse of this Reliability figure would be to make arguments for the
potential benefit on Reliability and Availability for the whole resulting system.
Note: This example has considered the Sub-System Failure Rate provided by the
reliability Software.
Regarding the Serial System built by the Redundant CPU Racks and the STB Islands,
the worksheet above shows a resulting Reliability (over one year) of 84,06%, the
Standalone System Reliability during the same period of time being 81.95%.
As a result, this data suggests that implementing a CPU Rack Redundancy would
have almost no benefit.
Of course, this is an incorrect conclusion, and the example should suggest a simple
rule: always compare comparable items. If we implement a redundancy on the CPU
Control Rack in order to increase process control core Reliability, and to an extent
Availability, we need to then examine and compare the figures only at this level, as
the entire system has not received an increase in redundancy.
92
Calculation Examples
Previous case studies provide several important items concerning Reliability metrics
evaluation:
• Provided elementary MTBF figures are available, a "serial" system is quite easy
to evaluate, thanks to the "magic bullet" formula: λ = 1 / MTBF
93
Schneider Electric Industries SAS
Head Office Due to evolution of standards and equipment, characteristics indicated in texts and images
89, bd Franklin Roosvelt in this document are binding only after confirmation by our departments.
92506 Rueil-Malmaison Cedex
FRANCE
Print:
www.schneider-electric.com
Version 1 - 06 2009
94