Sunteți pe pagina 1din 17

High Availability Computer Systems

Jim Gray Daniel P. Siewiorek


Digital Equipment Corporation Department of Electrical Engineering
455 Market St., !t" #loor Carnegie Mellon $ni%er&ity
San #ranci&co, C'. (4)*5 Pitt&+urg", P'. )5,)-
'+&tract. The key concepts and techniques used to build high availability computer
systems are (1) modularity, (2) fail-fast modules, (3) independent failure modes, ()
redundancy, and (!) repair" These ideas apply to hard#are, to design, and to soft#are"
They also apply to tolerating operations faults and environmental faults" This article
e$plains these ideas and assesses high-availability system trends"
Overview
/t i& para0o1ical t"at t"e larger a &y&tem i&, t"e more critical i& it& a%aila+ility, an0 t"e
more 0ifficult it i& to make it "ig"ly2a%aila+le. /t i& po&&i+le to +uil0 &mall ultra-
available modules, +ut +uil0ing large &y&tem& in%ol%ing t"ou&an0& of mo0ule& an0
million& of line& of co0e i& &till an art. 3"e&e large &y&tem& are a core tec"nology of
mo0ern &ociety, yet t"eir a%aila+ility are &till poorly un0er&too0.
3"i& article &ketc"e& t"e tec"nique& u&e0 to +uil0 "ig"ly a%aila+le computer &y&tem&. /t
point& out t"at t"ree 0eca0e& ago, "ar0ware component& were t"e ma4or &ource of fault&
an0 outage&. 3o0ay, "ar0ware fault& are a minor &ource of &y&tem outage& w"en
compare0 to operation&, en%ironment, an0 &oftware fault&. 3ec"nique& an0 0e&ign& t"at
tolerate t"i& +roa0er cla&& of fault& are in t"eir infancy.
' 5i&torical Per&pecti%e
Computer& +uilt in t"e late )(5*6& offere0 twel%e2"our mean time to failure. '
maintenance &taff of a 0o7en full2time cu&tomer engineer& coul0 repair t"e mac"ine in
a+out eig"t "our&. 3"i& failure2repair cycle pro%i0e0 8*9 a%aila+ility. 3"e %acuum tu+e
an0 relay component& of t"e&e computer& were t"e ma4or &ource of failure&: t"ey "a0
lifetime& of a few mont"&. 3"erefore, t"e mac"ine& rarely operate0 for more t"an a 0ay
wit"out interruption
)
.
Many fault 0etection an0 fault ma&king tec"nique& u&e0 to0ay were fir&t u&e0 on t"e&e
early computer&. %iagnostics te&te0 t"e mac"ine. &elf-checking computational
tec"nique& 0etecte0 fault& w"ile t"e computation progre&&e0. 3"e program occa&ionally
&a%e0 ;c"eckpointe0< it& &tate on &ta+le me0ia. 'fter a failure, t"e program rea0 t"e mo&t
recent c"eckpoint, an0 continue0 t"e computation from t"at point. 3"i&
checkpoint'restart tec"nique allowe0 long2running computation& to +e performe0 +y
mac"ine& t"at faile0 e%ery few "our&.
De%ice impro%ement& "a%e impro%e0 computer &y&tem a%aila+ility. =y )(>*, typical
well2run computer &y&tem& offere0 ((9 a%aila+ility
,
. 3"i& &oun0& goo0, +ut ((9
a%aila+ility i& )** minute& of 0owntime per week. Suc" outage& may +e accepta+le for
commercial +ack2office computer &y&tem& t"at proce&& work in a&ync"ronou& +atc"e& for
later reporting. Mi&&ion critical an0 online application& cannot tolerate )** minute& of
0owntime per week. 3"ey require high-availability &y&tem& ? one& t"at 0eli%er ((.(((9
a%aila+ility. 3"i& allow& at mo&t fi%e minute& of &er%ice interruption per year.
High Availability Paper for IEEE Computer Magazine Draft 1
Proce&& control, pro0uction control, an0 tran&action proce&&ing application& are t"e
principal con&umer& of t"e new cla&& of "ig"2a%aila+ility &y&tem&. 3elep"one network&,
airport&, "o&pital&, factorie&, an0 &tock e1c"ange& cannot affor0 to &top +ecau&e of a
computer outage. /n t"e&e application&, outage& tran&late 0irectly to re0uce0 pro0ucti%ity,
0amage0 equipment, an0 &ometime& lo&t li%e&.
Degree& of a%aila+ility can +e c"aracteri7e0 +y or0er& of magnitu0e. $nmanage0
computer &y&tem& on t"e /nternet typically fail e%ery two week& an0 a%erage ten "our& to
reco%er. 3"e&e unmanage0 computer& gi%e a+out (*9 a%aila+ility. Manage0
con%entional &y&tem& fail &e%eral time& a year. Eac" failure take& a+out two "our& to
repair. 3"i& tran&late& to ((9 a%aila+ility
,
. Current fault2tolerant &y&tem& fail once e%ery
few year& an0 are repaire0 wit"in a few "our&
-
. 3"i& i& ((.((9 a%aila+ility. 5ig"2
a%aila+ility &y&tem& require fewer failure& an0 fa&ter repair. 3"eir requirement& are one
to t"ree or0er&2of2magnitu0e more 0eman0ing t"an current fault2tolerant tec"nologie&
;&ee 3a+le )<.

3a+le ). '%aila+ility of typical &y&tem& cla&&e&. 3o0ay!& +e&t &y&tem& are in t"e "ig"2
a%aila+ility range. 3"e +e&t of t"e general2purpo&e &y&tem& are in t"e fault2
tolerant range a& of )((*.
$na%aila+ility '%aila+ility
Sy&tem 3ype ;min@year< '%aila+ility Cla&&
unmanage0 5*,*** (*.9 )
manage0 5,*** ((.9 ,
well2manage0 5** ((.(9 -
fault2tolerant 5* ((.((9 4
"ig"2a%aila+ility 5 ((.(((9 5
%ery2"ig"2a%aila+ility .5 ((.((((9 8
ultra2a%aila+ility .*5 ((.(((((9
'& t"e nine& +egin to pile up in t"e a%aila+ility mea&ure, it i& +etter to t"ink of a%aila+ility
in term& of 0enial2of2&er%ice mea&ure0 in minute& per year. So for e1ample, ((.(((9
a%aila+ility i& a+out 5 minute& of &er%ice 0enial per year. E%en t"i& metric i& a little
cum+er&ome, &o t"e concept of availability class or &imply class i& 0efine0, +y analogy to
t"e "ar0ne&& of 0iamon0& or t"e cla&& of a cleanroom. '%aila+ility cla&& i& t"e num+er of
lea0ing nine& in t"e a%aila+ility figure for a &y&tem or mo0ule. More formally, if t"e
&y&tem a%aila+ility i& (, t"e &y&tem6& a%aila+ility cla&& i& e
log
10
()
. 3"e rig"tmo&t column
of 3a+le ) ta+ulate& t"e a%aila+ility cla&&e& of %ariou& &y&tem type&.
3"e telep"one network i& a goo0 e1ample of a "ig"2a%aila+ility &y&tem 2 a cla&& 5 &y&tem.
/t& 0e&ign goal i& at mo&t two outage "our& in forty year&. $nfortunately, o%er t"e la&t
two year& t"ere "a%e +een &e%eral ma4or outage& of t"e $nite0 State& telep"one &y&tem ?
a nation2wi0e outage la&ting eig"t "our&, an0 a mi02we&t outage la&ting four 0ay&. 3"i&
&"ow& "ow 0ifficult it i& to +uil0 &y&tem& wit" "ig"2a%aila+ility.
Pro0uction computer &oftware typically "a& more t"an one 0efect per t"ou&an0 line& of
co0e. A"en million& of line& of co0e are nee0e0, t"e &y&tem i& likely to "a%e t"ou&an0&
of &oftware 0efect&. 3"i& &eem& to put a ceiling on t"e &i7e of "ig"2a%aila+ility &y&tem&.
Eit"er t"e &y&tem mu&t +e &mall or it mu&t +e limite0 to a failure rate of one fault per
0eca0e. #or e1ample, t"e ten2million line 3an0em &y&tem &oftware i& mea&ure0 to "a%e a
High Availability Paper for IEEE Computer Magazine Draft 2
t"irty2year failure rate
-
.
5ig" a%aila+ility require& &y&tem& 0e&igne0 to tolerate fault& 22 to 0etect t"e fault, report
it, ma&k it, an0 t"en continue &er%ice w"ile t"e faulty component i& repaire0 offline.
=eyon0 t"e pro&aic "ar0ware an0 &oftware fault&, a "ig"2a%aila+ility &y&tem mu&t tolerate
t"e following &ample fault&.
Electrical power at a typical site in North America fails about twice a year. Each failure lasts
about an hour
4
.
Software upgrades or repair typically require interrupting service while installing new
software. This happens at least once a year and typically takes an hour.
Database Reorganization is required to add new types of information to the database, to
reorganize the data so that it can be more efficiently processed, or to redistribute the data
among recently added storage devices. Such reorganizations may happen several times a
year and typically take several hours. As of 1991, no general-purpose system provides
complete online reorganization utilities.
Operations Faults: Operators sometimes make mistakes that lead to system outages.
Conservatively, a system experiences one such fault a decade. Such faults cause an outage
of a few hours.
Ju&t t"e four fault cla&&e& li&te0 a+o%e contri+ute more t"an )*** minute& of outage per
year. 3"i& e1plain& w"y manage0 &y&tem& 0o wor&e t"an t"i& an0 w"y well manage0
&y&tem& 0o &lig"tly +etter ;&ee 3a+le )<.
5ig" a%aila+ility &y&tem& mu&t ma&k mo&t of t"e&e fault&. Bne t"ou&an0 minute& i&
muc" more t"an t"e fi%e2minute per year +u0get allowe0 for "ig"2a%aila+ility &y&tem&.
Clearly it i& a matter of 0egree 22 not all fault& can +e tolerate0. /gnoring &c"e0ule0
interruption& to upgra0e &oftware to newer %er&ion&, current fault2tolerant &y&tem&
typically 0eli%er four year& of uninterrupte0 &er%ice an0 t"en require a two2"our repair
-
.
3"i& tran&late& to ((.(89 a%aila+ility 22 a+out one minute outage per week.
3"i& article &ur%ey& t"e fault2tolerance tec"nique& u&e0 +y t"e&e &y&tem&. /t fir&t
intro0uce& terminology. 3"en it &ur%ey& 0e&ign tec"nique& u&e0 +y fault2tolerant &y&tem&.
#inally, it &ketc"e& approac"e& to t"e goal of ultra2a%aila+le &y&tem&, &y&tem& wit" a )**2
year mean2time2to2failure rate an0 a one2minute mean2time2to2repair.
Terminology
#ault2tolerance 0i&cu&&ion& +enefit from terminology an0 concept& 0e%elope0 +y /#/P
Aorking Group ;/#/P AG )*.4< an0 +y t"e /EEE 3ec"nical Committee on #ault2tolerant
Computing. 3"e re&ult of t"o&e effort& i& %ery rea0a+le
5
. 3"e key 0efinition& are
repeate0 "ere.
' &y&tem can +e %iewe0 a& a &ingle mo0ule. Mo&t &y&tem& are compo&e0 of multiple
mo0ule&. 3"e&e mo0ule& "a%e internal &tructure, +eing in turn compo&e0 of &u+2mo0ule&.
3"i& pre&entation 0i&cu&&e& t"e +e"a%ior of a &ingle mo0ule, +ut t"e terminology applie&
recur&i%ely to mo0ule& wit" internal mo0ule&.
Eac" mo0ule "a& an i0eal specified behavior an0 an o+&er%e0 actual behavior. ' failure
High Availability Paper for IEEE Computer Magazine Draft 3
occur& w"en t"e actual +e"a%ior 0e%iate& from t"e &pecifie0 +e"a%ior. 3"e failure
occurre0 +ecau&e of an error 22 a 0efect in t"e mo0ule. 3"e cau&e of t"e error i& a fault.
3"e time +etween t"e occurrence of t"e error an0 t"e re&ulting failure i& t"e error latency.
A"en t"e error cau&e& a failure, it +ecome& effective ;&ee #igure )<.
detect
correct
report
repair
{ }
failure
latency
fault
error
service accomplishment
service interruption
Figure 1. Usually a modules observed
behavior matches its specified behavior.
Therefore, it is in the service
accomplishment state. Occasionally, a
fault causes an error that eventually
becomes effective causing the module to
fail (observed behavior does not equal
specified behavior). Then the module
enters the service interruption state. The
failure is detected, reported, corrected or
repaired, and then the module returns to
the service accomplishment state.
#or e1ample, a programmer!& mi&take i& a fault. /t create& a latent error in t"e &oftware.
A"en t"e erroneou& in&truction& are e1ecute0 wit" certain 0ata %alue&, t"ey cau&e a
failure an0 t"e error +ecome& effective. '& a &econ0 e1ample, a co&mic ray ;fault< may
0i&c"arge a memory cell cau&ing a memory error. A"en t"e memory i& rea0, it pro0uce&
t"e wrong an&wer ;memory failure< an0 t"e error +ecome& effective.
3"e actual mo0ule +e"a%ior alternate& +etween service-accomplishment w"ile t"e mo0ule
act& a& &pecifie0, an0 service interruption w"ile t"e mo0ule +e"a%ior 0e%iate& from t"e
&pecifie0 +e"a%ior. )odule reliability mea&ure& t"e time from an initial in&tant an0 t"e
ne1t failure e%ent. /n a population of i0entical mo0ule& t"at are run until failure, t"e
mean-time-to-failure i& t"e a%erage time to failure o%er all mo0ule&. Mo0ule relia+ility i&
&tati&tically quantifie0 a& mean-time-to-failure ()TT*). Ser%ice interruption i& &tati&tically
quantifie0 a& mean-time-to-repair ()TT+). )odule availability mea&ure& t"e ratio of
&er%ice2accompli&"ment to elap&e0 time. 3"e a%aila+ility of non2re0un0ant &y&tem& wit"
repair i& &tati&tically quantifie0 a& .
Mo0ule relia+ility can +e impro%e0 +y re0ucing failure&. #ailure& can +e a%oi0e0 +y
valid construction an0 +y error correction.
,alidation can remo%e error& 0uring t"e con&truction proce&&, t"u& a&&uring t"at t"e
con&tructe0 mo0ule conform& to t"e &pecifie0 mo0ule. Since p"y&ical
component& fail 0uring operation, %ali0ation alone cannot a&&ure "ig" relia+ility
or "ig" a%aila+ility.
-rror .orrection re0uce& failure& +y tolerating fault& wit" re0un0ancy.
/atent error processing trie& to 0etect an0 repair latent error& +efore t"ey +ecome
effecti%e. Pre%enti%e maintenance i& an e1ample of latent error proce&&ing.
-ffective error processing trie& to correct t"e error after it +ecome& effecti%e.
Effecti%e error proce&&ing may eit"er recover from t"e error or mask t"e error.
-rror masking typically u&e& re0un0ant information to 0eli%er t"e correct
&er%ice an0 to con&truct a correct new &tate. Error Correcting Co0e& ;ECC<
u&e0 for electronic, magnetic, an0 optical &torage are e1ample& of
ma&king.
High Availability Paper for IEEE Computer Magazine Draft 4
-rror recovery typically 0enie& t"e reque&t an0 &et& t"e mo0ule to an error2
free &tate &o t"at can &er%ice &u+&equent reque&t&. Error reco%ery can take
two form&
0ack#ard error recovery return& to a pre%iou& correct &tate.
C"eckpoint@re&tart i& an e1ample of +ackwar0 error reco%ery.
*or#ard error recovery con&truct& a new correct &tate. Ce0un0ancy in
time, for e1ample re&en0ing a 0amage0 me&&age or rerea0ing a 0i&c
+lock are e1ample& of forwar0 error reco%ery.
3"e&e are t"e key 0efinition& from t"e /#/P Aorking Group
5
. Some a00itional
terminology i& u&eful. #ault& are typically categori7e0 a&.
Hardware faults -- failing devices,
Design faults -- faults in software (mostly) and hardware design,
Operations faults -- mistakes made by operations and maintenance personnel, and
Environmental faults - fire, flood, earthquake, power failure, sabotage.
Empirical Experience
3"ere i& con&i0era+le empirical e%i0ence a+out fault& an0 fault tolerance
8
. #ailure rate&
;or failure "a7ar0&< for &oftware an0 "ar0ware mo0ule& typically follow a bathtub curve.
3"e rate i& "ig" for new unit& ;infant mortality<, t"en it &ta+ili7e& at a low rate. '& t"e
mo0ule age& +eyon0 a certain t"re&"ol0 t"e failure rate increa&e& ;maturity<. P"y&ical
&tre&&, 0ecay, an0 corro&ion are t"e &ource of p"y&ical 0e%ice aging. Maintenance an0
re0e&ign are &ource& of &oftware aging.
#ailure rate& are u&ually quote0 at t"e +ottom of t"e +at"tu+ ;after infant mortality an0
+efore maturity<. #ailure& often o+ey a Aei+ull 0i&tri+ution, a negati%e "yper2
e1ponential 0i&tri+ution. Many 0e%ice an0 &oftware failure& are tran&ient 22 t"at i& t"e
operation may &uccee0 if t"e 0e%ice or &oftware &y&tem i& &imply re&et. #ailure rate&
typically increa&e wit" utili7ation. 3"ere i& e%i0ence t"at "ar0ware an0 &oftware failure&
ten0 to occur in clu&ter&.
Cepair time& for a "ar0ware mo0ule can %ary from "our& to 0ay& 0epen0ing on t"e
a%aila+ility of &pare mo0ule& an0 0iagno&tic capa+ilitie&. #or a gi%en organi7ation,
repair time& appear to follow a Poi&&on 0i&tri+ution. Goo0 repair &ucce&& rate& are
typically ((.(9, +ut (59 repair &ucce&& rate& are common. 3"i& i& &till e1cellent
compare0 to t"e 889 repair &ucce&& rate& reporte0 for automo+ile&.
Improved Devices are Half the Story
De%ice relia+ility "a& impro%e0 enormou&ly &ince )(5*. Dacuum tu+e& e%ol%e0 to
tran&i&tor&. 3ran&i&tor&, re&i&tor&, an0 capacitor& were integrate0 on &ingle c"ip&. 3o0ay,
package& integrate million& of 0e%ice& on a &ingle c"ip. 3"e&e 0e%ice an0 packaging
re%olution& "a%e many relia+ility +enefit& for 0igital electronic&.
More Reliable Devices: Integrated-circuit devices have long lifetimes. They can be disturbed
by radiation, but if operated at normal temperatures and voltages, and kept from corrosion,
they will operate for at least 20 years.
Reduced Power: Integrated circuits consume much less power per function. The reduced
power translates to reduced temperatures and slower device aging.
High Availability Paper for IEEE Computer Magazine Draft 5
Reduced Connectors: Connections were a major source of faults due to mechanical wear and
corrosion. Integrated circuits have fewer connectors. On-chip connections are chemically
deposited, off chip connections are soldered, and wires are printed on circuit boards.
Today, only backplane connections suffer mechanical wear. They interconnect field
replaceable units (modules) and peripheral devices. These connectors remain a failure
source.
Similar impro%ement& "a%e occurre0 for magnetic &torage 0e%ice&. Briginally, 0i&c&
were t"e &i7e of refrigerator& an0 nee0e0 weekly &er%ice. Ju&t ten year& ago, t"e typical
0i&c wa& t"e &i7e of a wa&"ing mac"ine, con&ume0 a+out ,,*** watt& of power, an0
nee0e0 &er%ice a+out e%ery &i1 mont"&. 3o0ay, 0i&c& are "an02"el0 unit&, con&ume a+out
)* watt& of power, an0 "a%e no &c"e0ule0 &er%ice. ' mo0ern 0i&c +ecome& o+&olete
&ooner t"an it i& likely to fail. 3"e M33# of a mo0ern 0i&c i& a+out ), year&: it& u&eful life
i& pro+a+ly fi%e year&.
Perip"eral 0e%ice ca+le& an0 connector& "a%e e1perience0 &imilar comple1ity re0uction&.
' 0eca0e ago, 0i&c ca+le& were "uge. Eac" 0i&c require0 ,* or more control wire&. Bften
0i&c& were 0ual2porte0 w"ic" 0ou+le0 t"i& num+er. 'n array of )** 0i&c& nee0e0 4,***
wire& an0 >,*** connector&. '& in t"e e%olution of 0igital electronic&, t"e&e ca+le& an0
t"eir connector& were a ma4or &ource of fault&. 3o0ay, mo0ern 0i&c a&&em+lie& u&e fi+er2
optic ca+le& an0 connector&. ' )**20i&c array can +e attac"e0 wit" ,42ca+le& an0 4>
connector&. more t"an a )**2fol0 component re0uction. /n a00ition, t"e un0erlying
me0ia u&e& lower power an0 "a%e +etter re&i&tance to electrical noi&e.
Fault-tolerant Design Concepts
#ault2tolerant &y&tem 0e&ign& u&e t"e following +a&ic concept&.
Modularity: Decompose the system into modules. The decomposition is typically hierarchical.
For example, a computer may have a storage module that in turn has several memory
modules. Each module is a unit of service, fault containment, and repair.
Service: The module provides a well specified interface to some function.
Fault containment: If the module is faulty, the design prevents it from contaminating
others.
Repair: When a module fails it is replaced by a new module.
Fail-Fast: Each module should either operate correctly or should stop immediately.
Independent Failure Modes: Modules and interconnections should be designed so that if one
module fails, the fault should not also affect other modules.
Redundancy and Repair: By having spare modules already installed or configured, when one
module fails the second can replace it almost instantly. The failed module can be repaired
offline while the system continues to deliver service.
These principles apply to hardware faults, design faults, and software faults (which are design
faults). Their application varies though, so hardware is treated first, and then design and
software faults are discussed.
High Availability Paper for IEEE Computer Magazine Draft 6
Fault-Tolerant Hardware
3"e application of t"e mo0ularity, fail2fa&t, in0epen0ence, re0un0ancy, an0 repair
concept& to "ar0ware fault2tolerance i& ea&y to un0er&tan0. 5ar0ware mo0ule& are
p"y&ical unit& like a proce&&or, a communication& line, or a &torage 0e%ice. ' mo0ule i&
ma0e fail2fa&t +y one of two tec"nique&
8,,>
.
Self-checking: A module performs the operation and also performs some additional work to
validate the state. Error detecting codes on storage and messages are examples of this.
Comparison: Two or more modules perform the operation and a comparitor examines their
results. If they disagree, the modules stop.
Self2c"ecking "a& +een t"e main&tay for many year&, +ut it require& a00itional circuitry
an0 0e&ign. Self2c"ecking will likely continue to 0ominate t"e &torage an0
communication& 0e&ign& +ecau&e t"e logic i& &imple an0 well un0er&too0.
3"e economie& of integrate0 circuit& encourage t"e u&e of compari&on for comple1
proce&&ing 0e%ice&. =ecau&e comparator& are relati%ely &imple, compari&on tra0e&
a00itional circuit& for re0uce0 0e&ign time. /n cu&tom fault2tolerant 0e&ign&, -*9 of
proce&&or circuit& an0 -*9 of t"e proce&&or 0e&ign time are 0e%ote0 to &elf2c"ecking.
Compari&on &c"eme& augment general2purpo&e circuit& wit" &imple comparitor 0e&ign&
an0 circuit&. 3"e re&ult i& a re0uction in o%erall 0e&ign co&t an0 circuit co&t.
3"e +a&ic compari&on approac" i& 0epicte0 in #igure ,.'. /t &"ow& "ow a relati%ely
&imple comparator place0 at t"e mo0ule interface can compare t"e output& of two
mo0ule&. /f t"e output& matc" e1actly, t"e comparator let& t"e output& pa&& t"roug". /f
t"e output& 0o not matc", t"e comparator 0etect& t"e fault an0 &top& t"e mo0ule&. 3"i& i&
a generic tec"nique for making fail2fa&t mo0ule& from con%entional mo0ule&.
/f more t"an two mo0ule& are u&e0, t"e mo0ule can tolerate at lea&t one fault +ecau&e t"e
comparator pa&&e& t"roug" t"e ma4ority output ;two out of t"ree in #igure ,.'<. 3"e
triple1 0e&ign i& calle0 tripple-module-redundancy (T)+). 3"e i0ea generali7e& to E2
ple1e0 mo0ule&.
'& &"own in #igure ,.=, compari&on 0e&ign& can +e ma0e recur&i%e. /n t"i& ca&e, t"e
comparator& t"em&el%e& are E2ple1e0 &o t"at comparator2failure& are al&o 0etecte0. Self2
c"ecking an0 compari&on pro%i0e quick fault 0etection. Bnce a fault i& 0etecte0 it
&"oul0 +e reported, an0 t"en masked a& in #igure ).
High Availability Paper for IEEE Computer Magazine Draft 7
comparator comparator
comparator voter
A: Basic Failfast Designs
Pair = Duplex Triplex
B: Recursive Failfast Designs
voter voter voter
Triple Modular Redundancy (TMR)

#igure ,. 3"e +a&ic approac"e& to 0e&igning fail2fa&t an0 fault2tolerant mo0ule&.
5ar0ware fault ma&king wit" compari&on &c"eme& typically work a& in #igure -. 3"e
0uple1ing &c"eme ;pair-and-spare or dual-dual< com+ine& two fail2fa&t mo0ule& to
pro0uce a &uper mo0ule t"at continue& operating e%en if one of t"e &u+mo0ule& fail&.
Since eac" &u+mo0ule i& fail2fa&t, t"e com+ination i& 4u&t t"e BC of t"e two &u+mo0ule&.
3"e triple1ing &c"eme ma&k& failure& +y "a%ing t"e comparitor pa&& t"roug" t"e ma4ority
output. /f only one mo0ule fail&, t"e output& of t"e two correct mo0ule& will form a
ma4ority an0 &o will allow t"e &upermo0ule to function correctly.
3"e pair2an02&pare &c"eme co&t& more "ar0ware ;four rat"er t"an t"ree mo0ule&<, +ut
allow& a c"oice of two operating mo0e&. eit"er two in0epen0ent fail2fa&t computation&
running on t"e two pair& of mo0ule& or a &ingle "ig"2a%aila+ility computation running on
all four mo0ule&.
comparator comparator

Pair-and-Spare orDual-Dual
OR OR
#igure -. $&ing re0un0ancy to ma&k failure&. 3MC nee0& no e1tra effort to ma&k a &ingle
fault. Duple1e0 mo0ule& can tolerate fault& +y u&ing a pair2an02&pare or 0ual20ual
0e&ign. /f any &ingle mo0ule fail&, t"e &uper mo0ule continue& operating.
3o un0er&tan0 t"e +enefit& of t"e&e 0e&ign&, imagine t"at eac" mo0ule "a& a one2year
M33#, wit" in0epen0ent failure&. Suppo&e t"at t"e 0uple1 &y&tem fail& if t"e comparitor
input& 0o not agree, an0 t"e triple1 mo0ule fail& if two of t"e mo0ule input& 0o not agree.
/f t"ere i& no repair, t"e &uper2mo0ule& in #igure , will "a%e a M33# of le&& t"an a year
;&ee 3a+le ,<. 3"i& i& an in&tance of t"e airplane rule. a two2engine airplane co&t& twice
a& muc" an0 "a& twice a& many engine pro+lem& a& a one2engine airplane. Ce0un0ancy
+y it&elf 0oe& not impro%e a%aila+ility or relia+ility ;re0un0ancy 0oe& 0ecrea&e t"e
%ariance in failure rate&<. /n fact, a00ing re0un0ancy ma0e t"e relia+ility #orse in t"e&e
two ca&e&. +edundancy designs require repair to dramatically improve availability"
The Importance of Repair
High Availability Paper for IEEE Computer Magazine Draft 8
/f faile0 mo0ule& are repaired ;replace0< wit"in four "our& of t"eir failure, t"en t"e M33#
of t"e e1ample &y&tem& goe& from one year M33# to well +eyon0 ),*** year M33#. 3"eir
a%aila+ility goe& from ((.(9 to ((.((((9 ;from a%aila+ility cla&& - to cla&& 8<. 3"at i& a
&ignificant impro%ement. /f t"e &y&tem employ& t"ou&an0& of mo0ule&, t"e con&truction
can +e repeate0 recur&i%ely to E2ple1 t"e entire &y&tem an0 get a cla&& > &uper2mo0ule
;),*** year M33#<.
Bnline mo0ule repair require& t"e a+ility to repair an0 rein&tall mo0ule& w"ile t"e &y&tem
i& operating. /t al&o require& re2integrating t"e mo0ule into t"e &y&tem wit"out
interrupting &er%ice. Doing t"i& i& not ea&y. #or e1ample, w"en a 0i&c i& repaire0, it i&
not tri%ial to make t"e content& of t"e 0i&c i0entical to a neig"+oring 0i&c. Ceintegration
algorit"m& e1i&t, +ut t"ey are &u+tle. Eac" &eem& to u&e a 0ifferent trick. 3"ere i& no
o%erall 0e&ign met"o0ology for t"em yet. Similarly, w"en a proce&&or i& repaire0, it i&
not ea&y to &et t"e proce&&or &tate to t"at of t"e ot"er proce&&or& in t"e mo0ule. 3o0ay,
online integration tec"nique& are an area of patent& an0 tra0e &ecret&. 3"ey are a key to
"ig"2a%aila+ility computing.
Table 2: MTTF estimates for various architectures using 1 year module MTTF modules with
4-hour MTTR. The letter represents a small additional cost for the comparators. (see
Siewiroek
6
derivations).
MTTF CLASS EQUATION COST
SIMPLEX 1 year 3 MTTF 1
DUPLEX ~0.5 years 3 MTTF/2 2+
TRIPLEX .8 year 3 MTTF(5/6) 3+
PAIR AND SPARE ~.7 year 3 MTTF(3/4) 4+
DUPLEX + REPAIR
>10
3
years
6
MTTF
2
/2MTTR
4+
TRIPLEX + REPAIR
>10
6
years
6
MTTF
3
/3MTTR
3+
3"e &imple an0 powerful i0ea& of fail-fast modules an0 repair %ia retry or +y &pare mo0ule&
&eem to &ol%e t"e "ar0ware fault2tolerance pro+lem. 3"ey can ma&k almo&t all p"y&ical 0e%ice
failure&. 3"ey 0o not ma&k failure& cau&e0 +y "ar0ware 0e&ign fault&. /f all t"e mo0ule& are
faulty +y 0e&ign, t"en t"e comparator& will not 0etect t"e fault. Similarly, compari&on tec"nique&
0o not &eem to apply to &oftware, w"ic" i& all 0e&ign, unle&& 0e&ign 0i%er&ity i& employe0. 3"e
ne1t &ection 0i&cu&&e& tec"nique& t"at tolerate 0e&ign fault&.
Improved Device Maintenance: the FRU Concept
3"e 0eclining co&t an0 impro%e0 relia+ility of 0e%ice& allow a new approac" to computer
maintenance. 3o0ay computer& are compo&e0 of mo0ule& calle0 field-replaceable-units
;#C$&<. Eac" #C$ "a& built-in self-tests e1ploiting one of t"e c"ecking tec"nique&
mentione0 a+o%e. 3"e&e te&t& allow a mo0ule to 0iagno&e it&elf, an0 report failure&.
3"e&e failure& are reporte0 electronically to t"e &y&tem maintenance proce&&or, an0 are
reporte0 %i&ually a& a green2yellow2re0 lig"t on t"e mo0ule it&elf. green mean& no
trou+le, yellow mean& a fault "a& +een reporte0 an0 ma&ke0, an0 re0 in0icate& a faile0
unit. 3"i& &y&tem make& it ea&y to perform repair. 3"e repair per&on look& for a re0 lig"t
High Availability Paper for IEEE Computer Magazine Draft 9
an0 replace& t"e faile0 mo0ule wit" a &pare from in%entory.
#C$& are 0e&igne0 to "a%e a M33# in e1ce&& of ten year&. 3"ey are 0e&igne0 to co&t le&&
t"an a few t"ou&an0 0ollar& &o t"at t"ey may +e manufacture0 an0 &tocke0 in quantity. '
particular &y&tem will con&i&t of ten& or t"ou&an0& of #C$&.
3"e #C$ concept "a& +een carrie0 to it& logical conclu&ion +y fault2tolerant computer
%en0or&. 3"ey "a%e t"e cu&tomer perform cooperative maintenance a& follow&. A"en a
mo0ule fail& t"e &ingle2fault2tolerant &y&tem continue& operating &ince it can tolerate any
&ingle fault. 3"e &y&tem fir&t i0entifie& t"e fault wit"in a #C$. /t t"en call& t"e %en0or!&
&upport center %ia &witc"e0 telep"one line& an0 announce& t"at a new mo0ule ;#C$< i&
nee0e0. 3"e %en0or!& &upport center &en0& t"e new part to t"e &ite %ia e1pre&& mail
;o%ernig"t<. /n t"e morning, t"e cu&tomer recei%e& a package containing replacement part
an0 in&tallation in&truction&. 3"e cu&tomer replace& t"e part an0 return& t"e faulty
mo0ule to t"e %en0or +y parcel po&t.
Cooperati%e maintenance "a& attracti%e economie&. Con%entional 0e&ign& often require a
,9 per mont" maintenance contract. Paying ,9 of t"e &y&tem price eac" mont" for
maintenance 0ou+le& t"e &y&tem price in four year&. Maintenance i& e1pen&i%e +ecau&e
eac" cu&tomer %i&it co&t& t"e %en0or a+out a t"ou&an0 0ollar&. Cooperati%e &er%ice can
cut maintenance co&t& in "alf.
Tolerating Design Faults
3olerating 0e&ign fault& i& critical to "ig" a%aila+ility. 'fter t"e fault2ma&king tec"nique&
of t"e pre%iou& &ection are applie0, t"e %a&t ma4ority of t"e remaining computer fault& are
0e&ign fault&. ;Bperation& an0 en%ironmental fault& are 0i&cu&&e0 later<.
Bne &tu0y in0icate& t"at failure& 0ue to 0e&ign ;&oftware< fault& outnum+er "ar0ware
fault& +y ten to one. 'pplying t"e concept& of mo0ularity, fail2fa&t, in0epen0ent2failure
mo0e&, an0 repair to &oftware an0 0e&ign i& t"e key to tolerating t"e&e fault&.
5ar0ware an0 &oftware mo0ularity i& well un0er&too0. ' "ar0ware mo0ule i& a fiel0
replacea+le unit ;#C$<. ' &oftware mo0ule i& a proce&& wit" pri%ate &tate ;no &"are0
memory< an0 a me&&age interface to ot"er &oftware mo0ule&
)*
.
3"e two approac"e& to fail2fa&t &oftware are &imilar to t"e "ar0ware approac"e&.
Self-checking: A program typically does simple sanity checks of its inputs, outputs, and data
structures. This is called defensive programming. It parallels the double-entry book-
keeping, and check-digit techniques used by manual accounting systems for centuries. In
defensive programming, if some item does not satisfy the integrity assertion, the program
raises an exception (fails fast) or attempts repair. In addition, independent processes, called
auditors or watch-dogs, observe the state. If they discover an inconsistency they raise an
exception and either fail-fast the state (erase it) or repair it
9,11
.
Comparison: Several modules of different design run the same computation. A comparator
examines their results and declares a fault if the outputs are not identical. This scheme
depends on independent failure modes of the various modules.
3"e t"ir0 ma4or fault tolerance concept i& in0epen0ent failure mo0e&. %esign diversity i&
t"e +e&t way to get 0e&ign& wit" in0epen0ent failure mo0e&. Di%er&e 0e&ign& are
pro0uce0 an0 implemente0 +y at lea&t t"ree in0epen0ent group& &tarting wit" t"e &ame
&pecification. 3"i& &oftware approac" i& calle0 1-,ersion programming
),
+ecau&e t"e
High Availability Paper for IEEE Computer Magazine Draft 10
program i& written E2time&.
$nfortunately, e%en in0epen0ent group& can make t"e &ame mi&take. 3"ere may +e a
mi&take in t"e original &pecification, or all t"e group& may make t"e &ame
implementation mi&take. 'nyone w"o "a& gi%en a te&t reali7e& t"at many &tu0ent& can
make t"e &ame mi&take on a 0ifficult e1am que&tion. /n0epen0ent implementation& of a
&pecification +y in0epen0ent group& i& currently t"e +e&t way to approac" 0e&ign
0i%er&ity.
E2%er&ion programming i& e1pen&i%e. E in0epen0ent implementation& rai&e t"e &y&tem
implementation an0 maintenance co&t +y a factor of E or more. /t may a00 unaccepta+le
time 0elay& to t"e pro4ect implementation. Some argue t"at t"i& time an0 money i& +etter
&pent on making one &uper2relia+le 0e&ign, rat"er t"an t"ree marginal 0e&ign&. 3"ere i&
yet no comparati%e 0ata to re&ol%e t"i& i&&ue.
3"e concept of design repair &eem& to furt"er 0amage t"e ca&e for 0e&ign 0i%er&ity.
Cecall t"e airplane rule ;F3wo engine airplane& "a%e twice t"e engine pro+lem& of a one2
engine plane.G< Suppo&e eac" mo0ule of a triple1e0 0e&ign "a& a )**2year M33#.
2ithout repair, t"e triple will "a%e it& fir&t fault in -- year& an0 it& ne1t fault in 5* year&.
3"e net i& an >- year M33#. /f only one mo0ule were operate0, t"e M33# woul0 +e )**
year&. So t"e -2%er&ion program mo0ule "a& wor&e M33# t"an any &imple program ;+ut
t"e -2%er&ion program "a& lower failure %ariance<. Cepair i& nee0e0 if E2Der&ion
programming i& to impro%e &y&tem M33#.
Cepairing a 0e&ign flaw take& week& or mont"&. 3"i& i& e&pecially true for "ar0ware.
E%en &oftware repair i& &low w"en run t"roug" a careful program 0e%elopment proce&&.
Since t"e M33# of a triple1 i& proportional to , long repair time& may +e a pro+lem for
"ig"2a%aila+ility &y&tem&.
E%en after t"e mo0ule i& repaire0, it i& not clear "ow to re2integrate it into t"e working
&y&tem wit"out interrupting &er%ice. #or e1ample, &uppo&e t"e faile0 mo0ule i& a file
&er%er wit" it& 0i&c. /f t"e mo0ule fail& an0 i& out of &er%ice for a few week& w"ile t"e
+ug i& fi1e0, t"en w"en it return& it mu&t recon&truct it& current &tate. Since it "a& a
completely 0ifferent implementation from t"e ot"er file &er%er&, a &pecial purpo&e utility
i& nee0e0 to copy t"e &tate of a Fgoo0G &er%er to t"e F+eing repaire0G &e%er w"ile t"e
Fgoo0G &er%er i& 0eli%ering &er%ice. 'ny c"ange& to t"e file& in t"e goo0 &er%er al&o mu&t
+e reflecte0 to t"e &er%er +eing repaire0. 3"i& repair utility &"oul0 it&elf +e an E2%er&ion
program to a%oi0 a &ingle fault in a Fgoo0G &er%er or in t"e copy operation creating a
0ou+le fault. Software repair i& not tri%ial.
Process Pairs and Transactions: a Way to Mask Design Faults
3rocess-pairs an0 transactions offer a completely 0ifferent approac" to &oftware repair.
3"ey generali7e t"e concept of c"eckpoint2re&tart to 0i&tri+ute0 &y&tem&. 3"ey 0epen0 on
t"e i0ea t"at mo&t error& cau&e0 +y 0e&ign fault& in pro0uction "ar0ware an0 &oftware are
tran&ient. ' transient failure will 0i&appear if t"e operation i& retrie0 later in a &lig"tly
0ifferent conte1t. Suc" tran&ient &oftware failure& "a%e +een gi%en t"e w"im&ical name
High Availability Paper for IEEE Computer Magazine Draft 11
4iesenbug +ecau&e t"ey 0i&appear w"en ree1amine0. =y contra&t, 0ohrbugs are goo0
&oli0 +ug&.
Common e1perience &ugge&t& t"at mo&t &oftware fault& in pro0uction &y&tem& are
tran&ient. A"en a &y&tem fail&, it i& generally re&tarte0 an0 return& to &er%ice. 'fter all,
t"e &y&tem wa& working la&t week. /t wa& working t"i& morning. So w"y &"oul0n!t it
work nowH 3"i& common&en&e o+&er%ation i& not %ery &ati&fying.
3o 0ate, t"e mo&t complete &tu0y of &oftware fault& wa& 0one +y E0 '0am&
)-
. /n t"at
&tu0y, "e looke0 at maintenance recor0& of Eort" 'merican /=M &y&tem& o%er a four year
perio0. 3"e &tu0y foun0 t"at mo&t &oftware fault& in pro0uction &y&tem& are only
reporte0 once. 5e 0e&cri+e0 &uc" error& a& benign bugs" =y contra&t, &ome &oftware
fault&, calle0 virulent bugs, were reporte0 many time&. Dirulent +ug& compri&e
&ignificantly le&& t"an )9 of all report&. =a&e0 on t"i& o+&er%ation, '0am& recommen0e0
t"at +enign +ug& not +e repaire0 imme0iately. 5arlan Mill&, u&ing '0am!& 0ata o+&er%e0
t"at mo&t +enign +ug& "a%e a M33# in e1ce&& of )*,*** year&. /t i& &afer to ignore &uc"
+ug& t"an to &top t"e &y&tem, in&tall a +ug fi1, an0 t"en re&tart t"e &y&tem. 3"e repair will
require a +rief outage, an0 a fault in t"e repair proce&& may cau&e a &econ0 outage.
3"e '0am& &tu0y an0 &e%eral ot"er& imply t"at t"e +e&t &"ort2term approac" to ma&king
&oftware fault& i& to re&tart t"e &y&tem
-,)-,)4
. Suppo&e t"e re&tart were in&tantaneou&.
3"en t"e fault woul0 not cau&e any outage. Ce&tating t"i&, &imple1 &y&tem una%aila+ility
i& appro1imately . /f M33C i& 7ero an0 M33#I* t"en t"ere i& no una%aila+ility.
Proce&& pair& are a way to get almo&t in&tant re&tart. Cecall t"at a process i& t"e unit of
&oftware mo0ularity. /t pro%i0e& &ome &er%ice. /t "a& a pri%ate a00re&& &pace an0
communicate& wit" t"e ot"er proce&&e& an0 0e%ice& %ia me&&age& tra%eling on &e&&ion&.
/f t"e proce&& fail&, it i& t"e unit of repair an0 replacement. ' proce&& pair gi%e& almo&t
in&tant replacement an0 repair for a proce&&. ' process pair
)5
con&i&t& of two proce&&e&
running t"e &ame program. During normal operation, t"e primary process perform& all
t"e operation& for it& client&, an0 t"e backup process pa&&i%ely watc"e& t"e me&&age flow&
;&ee #igure 4<. 3"e primary proce&& occa&ionally &en0& c"eckpoint me&&age& to t"e
+ackup, muc" in t"e &tyle of c"eckpoint2re&tart 0e&ign& typical of t"e )(5*!&. A"en t"e
primary 0etect& an incon&i&tency in it& &tate, it fail& fa&t. 3"e +ackup proce&& i& notifie0
w"en t"e primary fail&, an0 it takes over t"e computation. 3"e +ackup i& now t"e
primary proce&&. /t an&wer& all incoming reque&t& an0 pro%i0e& t"e &er%ice. /f t"e
primary faile0 0ue to a 5ie&en+ug, t"e +ackup will not fail an0 &o t"ere will +e no
interruption of &er%ice. Proce&& pair& al&o tolerate "ar0ware fault&. /f t"e "ar0ware
&upporting t"e primary proce&& fail&, t"e +ackup running on ot"er "ar0ware will ma&k
t"at failure.
High Availability Paper for IEEE Computer Magazine Draft 12

SESSION
PRIMARY PROCESS
BACKUP PROCESS
STATE
INFORMATION
LOGICAL PROCESS = PROCESS PAIR
Figure 4: A process pair appears to other
processes as a single logical process.
Internally, it is really two processes
executing the same program and with
approximately the same state. The pair
typically run on different computers and so
have some failure-mode independence. In
addition, process pairs mask Hiesenbugs.

Bne critici&m of proce&& pair& i& t"at writing t"e c"eckpoint an0 takeo%er logic make& t"e
ar0uou& 4o+ of programming e%en more comple1. /t i& analogou& to writing t"e repair
program& mentione0 for E2%er&ion programming. =itter e1perience &"ow& t"at t"e co0e
i& 0ifficult to write, 0ifficult to te&t, an0 0ifficult to maintain.
3ran&action& are a way to automate t"e c"eckpoint@takeo%er logic. 3"ey allow For0inaryG
program& to act a& proce&& pair&. 3ran&action& are an automatic c"eckpoint2re&tart
mec"ani&m. 3"e tran&action mec"ani&m allow& an application 0e&igner to 0eclare a
collection of action& ;me&&age&, 0ata+a&e up0ate&, an0 &tate c"ange&< to "a%e t"e
following propertie&.
Atomicity: either all the actions of the transaction will be done or they will all be undone. This
is often called the all-or-nothing property. The two possible outcomes are called commit
(all) and abort (nothing).
Consistency: The collection of actions is a correct transformation of state. It preserves the state
invariants (assertions that constrain the values a correct state may assume).
Isolation: Each transaction will be isolated from the concurrent execution of other concurrent
transactions. Even if other transactions concurrently read and write the inputs and outputs
of this transaction, it will appear that the transactions ran sequentially according to some
global clock.
Durability: If a transaction commits, the effects of its operations will survive any subsequent
system failures. In particular, any committed output messages will eventually be delivered,
and any committed database changes will be reflected in the database state.
3"e a+o%e four propertie&, terme0 'C/D, were fir&t 0e%elope0 +y t"e 0ata+a&e
community. 3ran&action& pro%i0e &imple a+&traction for Co+ol programmer& to 0eal wit"
error& an0 fault& in con%entional 0ata+a&e &y&tem& an0 application&. 3"e concept gaine0
many con%ert& w"en 0i&tri+ute0 0ata+a&e& +ecame common. Di&tri+ute0 &tate i& &o
comple1 t"at tra0itional c"eckpoint@re&tart &c"eme& require &uper2"uman talent&.
3"e tran&action mec"ani&m i& ea&ily un0er&too0. 3"e programmer 0eclare& a tran&action
+y i&&uing a =EG/EJ3C'ES'C3/BE;< %er+. 5e en0& t"e tran&action +y i&&uing a
CBMM/3J3C'ES'C3/BE;< or '=BC3J3C'ES'C3/BE;< %er+. =eyon0 t"at, t"e un0erlying
tran&action mec"ani&m a&&ure& t"at all action& wit"in t"e =egin2Commit an0 =egin2'+ort
+racket& "a%e t"e 'C/D propertie&.
3"e tran&action concept applie& to 0ata+a&e&, to me&&age& ;e$actly once me&&age
0eli%ery<, an0 to main memory ;per&i&tent programming language&<. 3ran&action& are
High Availability Paper for IEEE Computer Magazine Draft 13
com+ine0 wit" proce&& pair& a& follow&. ' proce&& pair may 0eclare it& &tate to +e
persistent, meaning t"at w"en t"e primary proce&& fail&, t"e tran&action mec"ani&m a+ort&
all tran&action& in%ol%e0 in t"e primary proce&& an0 recon&truct& t"e +ackup2proce&& &tate
a& of t"e &tart of t"e acti%e tran&action&. 3"e tran&action& are t"en reproce&&e0 +y t"e
+ackup proce&&.
/n t"i& mo0el, t"e un0erlying &y&tem implement& proce&& pair&, tran&actional &torage
;&torage wit" t"e 'C/D propertie&<, tran&actional me&&age& ;e1actly2once me&&age
0eli%ery<, an0 proce&& pair& ;t"e +a&ic takeo%er mec"ani&m<. 3"i& i& not tri%ial, +ut t"ere
are at lea&t two e1ample&. 3an0em!& EonStop Sy&tem&, an0 /=M!& Cro&& Ceco%ery
#eature ;KC#<
)8,)
. Ait" t"e&e un0erlying facilitie&, application programmer& can write
con%entional program& t"at e1ecute a& proce&& pair&. 3"e computation& nee0 only 0eclare
tran&action +oun0arie&. 'll t"e c"eckpoint2re&tart logic an0 tran&action mec"ani&m i&
0one automatically.
3o &ummari7e, proce&& pair& ma&k "ar0ware fault& an0 5ie&en+ug&. 3ran&action& make it
ea&y to write proce&& pair&.
The Real Problems: Operations, Maintenance, and Environment
3"e pre%iou& &ection& took t"e narrow computer %iew of fault2tolerance. Para0o1ically,
computer& are rarely t"e cau&e of a computer failure. /n one &tu0y, (>9 of t"e
un&c"e0ule0 &y&tem outage& came from Fout&i0eG &ource&
-
. 5ig"2a%aila+ility &y&tem&
mu&t tolerate en%ironmental fault& ;e.g., power failure&, fire, floo0, in&urrection, %iru&,
an0 &a+otage<, operation& fault&, an0 maintenance fault&.
3"e 0eclining price an0 increa&ing automation of computer &y&tem& offer& a
&traig"tforwar0 &olution to &ome of t"e&e pro+lem&. system pairs ;&ee #igure 5<. Sy&tem
pair& carry t"e 0i&c2pair an0 proce&&2pair 0e&ign one &tep furt"er. 3wo nearly i0entical
&y&tem& are place0 at lea&t )*** kilometer& apart. 3"ey are on 0ifferent communication
gri0&, on 0ifferent power gri0&, on 0ifferent eart"quake fault&, on 0ifferent weat"er
&y&tem&, "a%e 0ifferent maintenance per&onnel, an0 "a%e 0ifferent operator&. Client& are
in &e&&ion wit" +ot" &y&tem&, +ut eac" client preferentially &en0& "i& work to one &y&tem
or anot"er. Eac" &y&tem carrie& "alf t"e loa0 0uring normal operation. A"en one &y&tem
fail&, t"e ot"er &y&tem take& o%er for it. 3"e tran&action mec"ani&m &tep& in to clean up
t"e &tate at takeo%er.
/0eally, t"e &y&tem& woul0 not "a%e i0entical 0e&ign&. Suc" 0e&ign 0i%er&ity woul0 offer
a00itional protection again&t 0e&ign error&. 5owe%er, t"e economic& of 0e&igning,
in&talling, operating, an0 maintaining two completely 0ifferent &y&tem& may +e
pro"i+iti%e. E%en if t"e &y&tem& are i0entical, t"ey are likely to ma&k mo&t "ar0ware,
&oftware, en%ironmental, maintenance, an0 operation& fault&.

High Availability Paper for IEEE Computer Magazine Draft 14
Network (sna, osi,...)
Paris Tokyo
System
Pair @ Tokyo
System
Pair @ Paris
Figure 5: A system !ir "esig#$ A system is
re%i&!te" !t t'( sites )P!ris !#" T(*y(+$
,uri#g #(rm!% (er!ti(#- e!&. &!rries .!%/ t.e
%(!"$ 0.e# (#e /!i%s- t.e (t.er ser1es !%% t.e
&%ie#ts$ System !irs m!s* m(st .!r"'!re-
s(/t'!re- (er!ti(#s- m!i#te#!#&e !#"
e#1ir(#me#t!% /!u%ts$ T.ey !%s( !%%(' (#%i#e
s(/t'!re !#" .!r"'!re &.!#ges$
Clearly, &y&tem pair& will ma&k many "ar0ware fault&. ' "ar0ware fault in one &y&tem
will not cau&e a fault in t"e ot"er. Sy&tem pair& will ma&k maintenance fault& &ince a
maintenance per&on can only touc" an0 +reak computer& at one &ite at a time. Sy&tem
pair& ea&e maintenance. Eit"er &y&tem can +e repaire0, mo%e0, replace0, an0 c"ange0
wit"out interrupting &er%ice. /t &"oul0 +e po&&i+le to in&tall new &oftware or to reorgani7e
t"e 0ata+a&e at one &y&tem w"ile t"e ot"er pro%i0e& &er%ice. 'fter t"e upgra0e, t"e ne#
&y&tem catches up wit" t"e ol0 &y&tem an0 replace& it w"ile t"e ol0 &y&tem i& upgra0e0.
Special purpo&e &y&tem pair& "a%e +een operating for 0eca0e&. /=M!& ''S &y&tem, t"e
Di&a &y&tem, an0 many +anking &y&tem& operate in t"i& way. 3"ey offer e1cellent
a%aila+ility, an0 offer protection from en%ironmental an0 operation& 0i&a&ter&. Eac" of
t"e&e &y&tem& "a& an a0 "oc 0e&ign. Eo &y&tem pro%i0e& a general purpo&e %er&ion of
&y&tem pair& an0 t"e corre&pon0ing operation& utilitie& to0ay. 3"i& i& an area of acti%e
0e%elopment. General purpo&e &upport for &y&tem pair& will emerge 0uring t"e )((*
0eca0e.
Assessment of Future Trends
B%er t"e la&t four 0eca0e&, computer relia+ility an0 a%aila+ility "a%e impro%e0 +y four
or0er& of magnitu0e. 3ec"nique& to ma&k 0e%ice failure& are well un0er&too0. De%ice
relia+ility an0 0e&ign "a& impro%e0 &o t"at now maintenance i& rare. A"en nee0e0,
maintenance con&i&t& of replacing a mo0ule. Computer operation& are increa&ingly +eing
automate0 +y &oftware. Sy&tem pair& ma&k mo&t en%ironmental fault&, an0 al&o ma&k
&ome operation&, maintenance, an0 0e&ign fault&.
#igure 8 0epict& t"e e%olution of fault2tolerant arc"itecture& an0 t"e fault cla&&e& t"ey
tolerate. 3"e 0en&ity of t"e &"a0ing in0icate& t"e 0egree to w"ic" fault& in t"at cla&& are
tolerate0. During t"e )(8*6& fault, tolerant computer& were mainly employe0 in
telep"one &witc"ing an0 aero&pace application&. Due to t"e relati%e unrelia+ility of
"ar0ware, replication wa& u&e0 to tolerate "ar0ware failure&. 3"e 0eca0e of t"e )(*6&
&aw t"e emergence of commercial fault2tolerant &y&tem& u&ing proce&& pair& couple0 wit"
replication to tolerate "ar0ware a& well a& &ome 0e&ign fault&. 3"e replication of a
&y&tem at two or more &ite& ;&y&tem pair&< e1ten0e0 fault co%erage to inclu0e operation&
an0 en%ironmental fault&. 3"e c"allenge of t"e )((*6& i& to +uil0 upon our e1perience
an0 0e%i&e arc"itecture& capa+le of co%ering all t"e&e fault cla&&e&.
High Availability Paper for IEEE Computer Magazine Draft 15
Fault Class Replication
Process
Pairs
System
Pairs
Future
Techniques
Hardware
Design
Operations
Environement
1960's 1970's 1980's 1990's
#igure 8. Summary of t"e e%olution of fault tolerant arc"itecture& an0 t"eir fault cla&&
co%erage.
3"e&e a0%ance& "a%e come at t"e co&t of increa&e0 &oftware comple1ity. Sy&tem pair&
are more comple1 t"an &imple1 &y&tem&. Software to automate operation& an0 to allow
fully online maintenance an0 c"ange i& &u+tle. /t &eem& t"at a minimal &y&tem to pro%i0e
t"e&e feature& will in%ol%e million& of line& of &oftware. Let it i& known t"at &oftware
&y&tem& of t"at &i7e "a%e t"ou&an0& of &oftware fault& ;+ug&<. Eo economic tec"nique to
eliminate t"e&e fault& i& known.
3ec"nique& to tolerate &oftware fault& are known, +ut t"ey take a &tati&tical approac". /n
fact, t"e &tati&tic& are not %ery promi&ing. 3"e +e&t &y&tem& &eem to offer a M33#
mea&ure0 in ten& of year&. 3"i& i& unaccepta+le for application& t"at are life2critical or
t"at control multi2+illion20ollar enterpri&e&. Let, t"ere i& no alternati%e to0ay.
=uil0ing ultra2a%aila+le &y&tem& &tan0& a& a ma4or c"allenge for t"e computer in0u&try in
t"e coming 0eca0e&.
References
). '%i7ieni&, '., Mopet7, 5. an0 Naprie, J. C., %ependable .omputing and *ault-Tolerant
&ystems" Springer Derlag, Aien, )(>.
,. Aatana+e, E. ;tran&lator<. &urvey on .omputer &ecurity, Japan /nformation
De%elopment Corporation, 3okyo, Marc" )(>8.
-. Gray, J., 2hy %o .omputers &top and 2hat .an 2e %o (bout 5t, 8t" /nternational
Conference on Celia+ility an0 Di&tri+ute0 Data+a&e&, /EEE Pre&&, )(>, an0 (
.ensus of Tandem &ystem (vailability, 167!-1668. /EEE 3ran&. on Celia+ility. Dol -(,
Eo.4, )((*, pp. 4*(24)>.
4. 3ulli&, E.,3o#ering .omputer-.ontrolled &ystems9 (. or %.:. 3ele&i&. Dol. )) Eo. ),
Jan )(>4, pp. >2)4.
5. Naprie, J. C., %ependable .omputing and *ault Tolerance9 .oncepts and
Terminology, Proc. )5t" #3CS, /EEE Pre&&, pp. ,2)), )(>5.
8. Siewiorek, D.P, Swar7, C.A., +eliable .omputer &ystems9 %esign and -valuation,
Digital Pre&&, =e0for0, )((,.
. Jo"n&on, =.A., %esign and (nalysis of *ault Tolerant %igital &ystems, '00i&on
Ae&ley, Cea0ing, )(>(.
>. Pra0"an, D.M., *ault Tolerant .omputing9 Theory and Techniques, ,ol 5 and 55,
Prentice 5all, Englewoo0 Cliff&, )(>8.
(. 'n0er&on, 3., e0., +esilient .omputing &ystems, ,ol 5, Jo"n Ailey an0 Son&, Eew
Lork, )(>5.
)*. 3anen+aum, '. S. ;)(><. ;perating &ystems9 %esign and 5mplementation" Prentice
5all, Englewoo0 Cliff&,)(>(.
High Availability Paper for IEEE Computer Magazine Draft 16
)). Can0ell, =., Nee, P. '. an0 3relea%en, P. C., +eliability 5ssues in .omputer &ystem
%esign. 'CM Computing Sur%ey&. Dol. ,>, Eo. ,, )(>, pp.),-2)85.
),. '%i7ieni&, '., &oft#are *ault Tolerance, Proc. )(>( /#/P Aorl0 Computer
Conference, /#/P Pre&&., )(>(.
)-. '0am&, E., Bptimi7ing Pre%entati%e Ser%ice of Software Pro0uct&, /=M J COD. Dol.
,>, Eo. ), Jan. )(>4.
)4. Moura0, J., The +eliability of the 50)'<( ;perating &ystem, Proc. )5t" #3CS, /EEE
Pre&&, )(>5.
)5. =artlett, J. ;)(>)<. ( 1on&top= >ernel, >t" S/GBPS, 'CM, )(>).
)8. 5)&'<+*9 3lanning ?uide, /=M, GC,42-)5), /=M, A"ite Plain&, EL. )(>.
). Nyon, J., Tandem@s +emote %ata *acility, Compcon (*, /EEE Pre&&, )((*.
High Availability Paper for IEEE Computer Magazine Draft 17

S-ar putea să vă placă și