An Adaptive Programming Model For Fault-Tolerant Distributed Computing

18
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
VOL. 4,
NO. 1, JANUARY-MARCH 2007
An Adaptive Programming Model for Fault-Tolerant Distributed Computing

Sergio Gorender, Raimundo Jose de Araujo Macedo, Member, IEEE Computer Society, and Michel Raynal
AbstractThe capability of dynamically adapting to distinct runtime conditions is an important issue when designing distributed systems where negotiated quality of service (QoS) cannot always be delivered between processes. Providing fault tolerance for such dynamic environments is a challenging task. Considering such a context, this paper proposes an adaptive programming model for fault-tolerant distributed computing, which provides upper-layer applications with process state information according to the current system synchrony (or QoS). The underlying system model is hybrid, composed by a synchronous part (where there are time bounds on processing speed and message delay) and an asynchronous part (where there is no time bound). However, such a composition can vary over time, and, in particular, the system may become totally asynchronous (e.g., when the underlying system QoS degrade) or totally synchronous. Moreover, processes are not required to share the same view of the system synchrony at a given time. To illustrate what can be done in this programming model and how to use it, the consensus problem is taken as a benchmark problem. This paper also presents an implementation of the model that relies on a negotiated quality of service (QoS) for communication channels. Index TermsAdaptability, asynchronous/synchronous distributed system, consensus, distributed computing model, fault tolerance, quality of service.
1 INTRODUCTION
ISTRIBUTED
systems are composed of processes, located on one or more sites, that communicate with one another to offer services to upper-layer applications. A major difficulty a system designer has to cope with in these systems lies in the capture of consistent global states from which safe decisions can be taken in order to guarantee a safe progress of the upper-layer applications. To study and investigate what can be done (and how it has to be done) in these systems when they are prone to process failures, two distributed computing models have received significant attention, namely, the synchronous model and the asynchronous model. The synchronous distributed computing model provides processes with bounds on processing time and message transfer delay. These bounds, explicitly known by the processes, can be used to safely detect process crashes and, consequently, allow the noncrashed processes to progress with safe views of the system state (such views can be obtained with some time-lag). In contrast, the asynchronous model is characterized by the absence of time bounds (that is why this model is sometimes called the time-free model). In these systems, a system designer can only assume an upper bound on the number of processes that can crash
jo do . S. Gorender and R.J. de Arau Mace are with the Distributed Systems Laboratory (LaSiD), Computer Science Department, Federal University of Bahia, Campus de Ondina, 40170-110, Salvador, Bahia, Brazil. E-mail: {gorender, macedo}@ufba.br. . M. Raynal is with IRISA, Universite de Rennes 1, Campus de Beaulieu, 35042 Rennes Cedex, France. E-mail: raynal@irisa.fr. Manuscript received 24 May 2005; revised 31 Dec. 2005; accepted 22 Aug. 2006; published online 2 Feb. 2007. For information on obtaining reprints of this article, please send e-mail to: tdsc@computer.org, and reference IEEECS Log Number TDSC-0075-0505.
1545-5971/07/$25.00 2007 IEEE
(usually denoted as f) and, consequently, design protocols relying on the assumption that at least (n f) processes are alive (n being the total number of processes). The protocol has no means to know whether a given process is alive or not. Moreover, if more than f processes crash, there is no guarantee on the protocol behavior (usually, the protocol loses its liveness property). Synchronous systems are attractive because they allow system designers to solve many problems. The price that has to be paid is the a priori knowledge on time bounds. If they are violated, the upper-layer protocols may be unable to still guarantee their safety property. As they do not rely on explicit time bounds, asynchronous systems do not have this drawback. Unfortunately, they have another one, namely, some basic problems are impossible to solve in asynchronous systems. The most famous is the consensus problem, that has no deterministic solution when even a single process can crash [12]. The consensus problem can be stated as follows: Each process proposes a value, and has to decide a value, unless it crashes (termination), such that there is a single decided value (uniform agreement), and that value is a proposed value (validity). This problem, whose statement is particularly simple, is fundamental in fault-tolerant distributed computing as it abstracts several basic agreement problems. While consensus is considered as a theoretical problem, system designers are usually interested in the more practical Atomic Broadcast problem. That problem is both a communication problem and an agreement problem. Its communication part specifies that the processes can broadcast and deliver messages in such a way that the processes that do not crash deliver at least the messages they send. Its agreement part specifies that there is a single delivery order
Published by the IEEE Computer Society
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING
19
(so, the correct processes deliver the same sequence of messages, and a faulty process delivers a prefix of this sequence of messages). It has been shown that consensus and atomic broadcast are equivalent problems in asynchronous systems prone to process crashes [6]. Consequently, in asynchronous distributed systems prone to process crashes, the impossibility of solving consensus extends to atomic broadcast. This impossibility has motivated researchers to find distributed computing models, weaker than the synchronous model but stronger than the asynchronous model, in which consensus can be solved.
1.1 Content of the Paper In practice, systems are neither fully synchronous, nor fully asynchronous. Most of the time they behave synchronously, but can have unstable periods during which they behave in an anarchic way. Moreover, there are now QoS architectures that allow processes to dynamically negotiate the quality of service of their communication channels, leading to settings with hybrid characteristics that can change over time. These observations motivate the design of a distributed programming model that does its best to provide the processes with information on the current state of the other processes. This paper is a step in that direction. This programming model is time free in the sense that processes are not provided with time bounds guaranteed by the lower system layer. Each process pi is provided with three sets denoted downi , livei , and uncertaini . These sets, which are always a partition of the whole set of processes, define the view pi has of the state of the other processes. More precisely, if pk 2 downi , then pi knows that pk has crashed; when pk 2 livei , then pi can consider pk as being alive; finally, when pk 2 uncertaini , pi has no information on the current state of pk . These sets can evolve in time, and can have different values for different processes. For example, it is possible that (due to the fact that some quality of service can no longer be ensured) the view pi has on pk be degraded in the sense that the model moves pk from livei to uncertaini . So, the model is able to benefit from time windows to transform timeliness properties (currently satisfied by the lower layer) into time-free properties (expressed on the previous sets) that can be used by upper layer protocols. Interestingly, this programming model considers the synchronous model and the asynchronous model as particular cases. The synchronous model corresponds to the case where, for all the processes pi , the sets uncertaini are always empty. The asynchronous model corresponds to the case where, for all the processes pi , the sets uncertaini always include all processes. It is important to notice that our approach is orthogonal to the failure detector (FD) approach [6]. Given a problem P that cannot be solved in the asynchronous model, the FD approach is a computability approach: it consists of enriching the asynchronous model with properties that are sufficient for the problem to become solvable. Our approach is engineering oriented. It aims at benefiting from the fact that systems are built on QoS architectures, thereby allowing a process not to always consider the other processes as being in an uncertain state. As the proposed model includes the asynchronous model, it is still necessary to enrich it with appropriate mechanisms to solve problems
that are impossible to solve in pure time-free systems. To illustrate this point and evaluate the proposed approach, the paper considers the consensus problem as a benchmark problem. A consensus protocol is presented, that is based on the fact that each process pi is provided with 1) three sets downi , livei , and uncertaini with the previously defined semantics, and 2) an appropriate failure detector module (namely, S). Interestingly, while S-based consensus protocols in asynchronous systems require f < n=2, the proposed S-based protocol allows bypassing this bound when juncertaini j < n during the protocol execution. Another interesting feature of the proposed programming model lies in the fact that it allows us to design generic protocols, in the sense that a protocol designed for that model includes both a synchronous protocol and an asynchronous protocol. This paper is an extended and enhanced version of a previous publication that appeared in [14].
1.2 Related Work Our approach relates to previous work on distributed system models and adaptiveness. The timed asynchronous model [8] considers asynchronous processes equipped with physical clocks. The concept of hybrid architectures has been explored where a distributed wormhole (i.e., a small synchronous part of the system) equipped with synchronized clocks provides timely services to the rest of the system (the asynchronous part) [27]. The concept of a secure and crash resilient wormhole has also been used to implement efficient solutions to consensus protocols subjected to Byzantine faults in the asynchronous part of the system [7]. Several works aimed to circumvent the impossibility of consensus in the asynchronous system model [12]. Minimal synchronism needed to solve consensus is addressed in [9]. Partial synchrony making consensus solvable is investigated in [10]. Finally, the failure detector approach and its application to the consensus problem have been introduced and investigated in [5], [6]. A general framework for consensus algorithms for the classical asynchronous system model enriched with failure detectors or random number generators is given in [16]. Many systems have addressed adaptiveness, including [24], [26], [19], [4]. AquA [24] provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of reliability. AquA also provides an adaptive mechanism capable of coping with application dependability requirements at runtime. It does so by using a majority voting algorithm. Ensemble [26] offers a faulttolerant mechanism that allows adaptation at runtime. This is used to dynamically adapt the components of a group communication service to switch between two total order algorithms (sequencer-based versus token-based) according to the desired system overhead and end-to-end latency for application messages. The concept of real-time dependable channels (RTD) has been introduced in [19] to handle QoS adaptation at channel creation time. In their paper, the authors show how an application can customize a RTD channel according to specific QoS requirements, such as bounded delivery time,
20
VOL. 4,
reliable delivery, message ordering, and jitter (variance in message transmission time). Such QoS requirements are stated in terms of the probabilities related to specific QoS guarantees and are enforced by a composite protocol based on the real-time version of the Cactus system (CactusRT). The work presented in [4] explicitly addresses the problem of dynamically adapting applications when a given negotiated QoS can no longer be delivered. In order to accomplish that, they defined a QoS coverage service that provides the so-called time-elastic applications (such as mission-critical or soft real-time systems) with the ability of dependably deciding how to adapt time bounds in order to maintain a constant coverage level. Such service is implemented over the wormhole Timely Computing Base or TCB. By using its duration measurement service, TCB is able to monitor the current QoS level (in terms of timeliness), allowing applications to dynamically adapt when a given QoS cannot be delivered. Like [4], our work tackles QoS adaptability in terms of timeliness at runtime. However, we do not adapt time bounds to time-elastic applications. Instead, our hybrid programming model adapts to the available QoS of communication channels, providing applications with safe information on the current state of processes. The communication channel QoS can be timely, with guaranteed time bounds for message delivery, or untimely, otherwise.1 Moreover, we show how an adaptive consensus protocol can take advantage of the proposed hybrid model. As we shall see in Section 4, though we have our own implementation based on QoS architectures, our programming model could also be implemented on top of systems such as TCB and CactusRT.
a simple local state change). A process can fail by crashing, i.e., by prematurely halting. After it has crashed, a process does not recover. It behaves correctly (i.e., according to its specification) until it (possibly) crashes. By definition, a process is correct in a run if it does not crash in that run. Otherwise, a process is faulty in the corresponding run. In the following, f denotes the maximum number of processes that can crash 1 f < n. Until it possibly crashes, the speed of a process is positive but arbitrary. Processes communicate and synchronize by sending and receiving messages through channels. Every pair of processes pi ; pj is connected by two directed channels, denoted as pi ! pj and pj ! pi . Channels are assumed to be reliable: they do not create, alter, or lose messages. In particular, if pi sends a message to pj , then if pi is correct, eventually pj receives that message unless it fails. There is no assumption about the relative speed of processes or message transfer delays (let us observe that channels are not required to be FIFO). The primitive broadcast MSGv is a shortcut for: 8pj 2 do send MSGv to pj end (MSG is the message tag, v its value). The previous definitions describe what is usually called a time-free asynchronous distributed system prone to process crashes. We assume the existence of a global discrete clock. This clock is a fictional device that is not known by the processes; it is only used to state specifications or prove protocol properties. The range T of clock values is the set of natural numbers. Let F t be the set of processes that have crashed through time t. As a process that crashes does not recover, we have F t F t 1.
1.3 Roadmap The paper is made up of five sections. Section 2 introduces the model. Section 3 presents and proves the correctness of a consensus protocol suited to the model, and Section 4 describes an implementation of this model based on a distributed system architecture providing negotiated QoS guarantees. Finally, Section 5 concludes the paper.
THE HYBRID MODEL
AND
ADAPTIVE PROGRAMMING
This section describes the computation model offered to the upper-layer applications. These applications are generally distributed programs implementing middleware services. It is important to notice that the model offered to upper-layer applications is time-free in the sense that it offers them neither timing assumptions, nor time functions.
2.1
Asynchronous Distributed System with Process Crash Failures We consider a system consisting of a finite set of n ! 2 processes, namely, fp1 ; p2 ; . . . ; pn g. A process executes steps (a step is the reception of a message, the sending of a message, each with the corresponding local state change, or
1. For ensuring such a distinct QoS guarantee, we assume that the underlying QoS infrastructure is capable of reserving resources and controlling admission of new communication flows related to new created channels.
How a Process Sees the Other Processes: Three Sets per Process A crucial issue encountered in distributed systems is the way each process perceives the state of the other processes. To that end, the proposed programming model provides each process pi with three sets denoted downi , livei , and uncertaini . The only thing a process pi can do with respect to these sets is to read the sets it is provided with; it cannot write them and has no access to the sets of the other processes. These sets, that can evolve dynamically, are made up of process identities. Intuitively, the fact that a given process pj belongs to downi , livei , or uncertaini provides pi with some hint of the current status of pj . More operationally, if pj 2 downi , pi can safely consider pj as being crashed. If pj 2 downi , the state of pj is not known by pi with certainty: = more precisely, if pj 2 livei , pi is given a hint that it can currently consider pj as not crashed; when pj 2 uncertaini , pi has no information on the current state (crashed or live) of pj . At the abstraction level defining the computation model, these sets are defined by abstract properties (the way they are implemented is irrelevant at this level, it will be discussed in Section 4). The specification of the sets downi , livei , and uncertaini , 1 i n, is the following (where downi t is the value of downi at time t, and similarly for livei t and uncertaini t):
R0 Initial global consistency. Initially, the sets livei (respectively, downi and uncertaini ) of all the
2.2
21
p r o c e s s e s pi a r e i d e n t i c a l . N a m e l y , 8i; j: livei t livej t, downi t downj t, and uncertaini t uncertainj t; for t 0. R1 Internal consistency. The sets of each pi define a partition: . . 8i: 8t: downi t [ livei t [ uncertaini t . 8i: 8t: any two sets downi t; livei t; uncertaini t have an empty intersection. R2 Consistency of the downi sets. A downi set is never decreasing: 8i: 8t: downi t downi t 1. . A downi set is always safe with respect to crashes: 8i: 8t: downi t F t. R3 Local transitions for a process pi . While an upper layer protocol is running, the only changes a process pi can observe are the moves of a process px from livei to downi or uncertaini . R4 Consistent global transitions. The sets downi and uncertainj of any pair of processes pi and pj evolve consistently. More precisely: . 8i; j; k; t0 : pk 2 livei t0 ^pk 2 downi t0 1 = ) 8t1 > t0 : pk 2 uncertainj t1 . . 8i; j; k; t0 : pk 2 livei t0 ^pk 2 uncertaini t0 1 ) 8t1 > t0 : pk 2 downj t1 . = R5 Conditional crash detection. If a process pj crashes and there is a time after which it does never appear in the uncertaini set of any other process pi , it eventually appears in the downi set of each pi . More precisely: 8pi , if pj crashes at time t0 , and there is a time = t1 ! t0 such that 8t2 ! t1 we have pj 2 uncertaini t2 , then there is a time t3 ! t1 such that 8t4 ! t3 we have pj 2 downi t4 . As we can see from this specification, at any time t and for any pair of processes pi and pj , it is possible to have livei t 6 livej t (and similarly for the other sets). Operationally, this means that distinct processes can have different views of the current state of each other process. Let us also observe that downi is the only safe information on the current state of the other processes that a process pi has. An example, involving nine processes, namely, fp1 ; . . . ; p9 g, is described in Table 1. The columns (respectively, rows) describe pi s (respectively, pj s) view of the three sets at some time t, e.g., downi t fp1 ; p2 ; p5 g, while livej t fp5 ; p6 g. The rules [R0-R5] define a distributed programming model that interestingly satisfies the following strong consistency property. That property provides the processes with a mutually consistent view on the possibility to detect the crash of a given process. More specifically, if the crash of a process pk is never known by pi (because pk continuously belongs to uncertaini ), then no process pj will detect the crash of pk (by having pk 2 downj ). Conversely, if the crash .
TABLE 1 Example of down, live, and uncertain Sets
of pi is known by pj , the other processes will also know it. More formally, we have: Property 1: Mutual consistency. 8i; j : 8t1 ; t2 : downi t1 \ uncertainj t2 ;:
Proof. The proof is by contradiction. Let us assume that 9pi ; pj ; t1 ; t2 and a process p such that p 2 downi t1 \ uncertainj t2 : We consider two cases: . Case i j. t1 t2 is impossible, as it would violate the partitioning rule R1. t1 < t2 or t1 > t2 are also impossible, as they would violate the local transition rule R3. Case i 6 j. First, let us observe that at the initial configuration (t 0), p 2 livei 0 and p 2 livej 0, since the only possible transitions for p are the moves from live to down or to uncertain (R3) and processes see the same set formations (R0) at that stage, which always form a partition (R1). Due to R3 and the initial assumption, 8t3 ; t3 > t2 ^ t3 > t1 ! p 2 downi t3 ^ p 2 uncertainj t3 : Therefore, the formulas below follow, which contradict R4: 9t0 , t0 < t3 s u c h t h a t p 2 livei t0 ^ p 2 downi t0 1 ^ p 2 uncertainj t3 , and 9t00 , t00 < t3 such that (p 2 livej t00 ^ p 2 u t uncertainj t00 1 ^ p 2 downi t3 ).
2.3 An Upgrading Rule Let us consider two consecutive instances (runs) of an upper-layer protocol (e.g., the consensus protocol described in Section 3) built on top of the model defined by the rules R0-R5. Moreover, let t1 be the time at which the first instance terminates and t2 the time at which the second instance starts t1 < t2 . During any instance of the protocol, the set livei of a process can only downgrade (as no process can go from downi or uncertaini to livei ). But, it is important to notice that nothing prevents the model from upgrading between consecutive instances, by moving a process px 2 uncertaini t1 into livei t2 or downi t2 . Such an upgrade of a livei or downi sets between two runs of an upper-layer protocol do correspond to synchronization points during
22
VOL. 4,
which the processes are allowed to renegotiate the quality of service of their channels (see Section 4). R6 Upgrade. If no upper-layer protocol is running during a time interval t1 ; t2 , it is possible that, 8i, processes be moved from uncertaini to livei or downi during that interval.
2.4 On the Wait Statement and Examples Interestingly, the following proposition on the conditional termination of a wait statement can easily be derived from the previous computation model specification.
Proposition 1. Let us consider the following statement issued by a process pi at some time t0 : wait until ((a message from pj is received) _ pj 2 downi ) and let us assume that either the corresponding message has been sent by pj or pj has crashed.2 The model guarantees that this wait statement always terminates if there is a time t1 t1 ! t0 such that 8t2 ! t1 we have pj 2 downi t2 [ livei t2 . This proposition is very important from a practical point of view. Said in another way, it states that if, from some time, pj never belongs to uncertaini , the wait statement always terminates. This is equivalent to say that when, after some time, pj remains always in uncertaini or, alternatively, belongs to livei and uncertaini , there is no guarantee on the termination of the wait statement (it can then terminate or never terminate). To illustrate the model, let us consider here two particular extreme cases. The first is the case of synchronous distributed systems. As indicated in the Introduction, due to the upper bounds on processing times and message transfer delays, these systems correspond to the model where 8i, 8t, we have uncertaini t ;. The second example is the case of fully asynchronous distributed systems where there is no time bound. In that case, given a process pj , a process pi can never know if pj has crashed or is only slow, or if its communication channels are very slow. This type of system does correspond to the model where we have 8i, 8t: uncertaini t (or, equivalently, downi t livei t ;). In these systems (as it appears in many asynchronous distributed algorithms), it is important to observe that, in order not to be blocked by a crashed process, a process pi can only issue anonymous wait statements such that wait until (messages have been received from n f processes). In such a wait statement, pi relies only on the fact that at most f processes can crash. It does not wait for a message from a particular process (as in Proposition 1). This is an anonymous wait in the sense that the identities of the processes from which messages are received are not specified in the wait statement; these identities can be known only when the wait terminates.
2.5
The Hybrid and Adaptive Programming Model in Perspective Our programming model provides the upper-layer applications with sufficient process state information (the sets) that
2. This means that we consider that the upper-layer distributed programs are well formed in the sense that they are deadlock-free.
can be used in order to adapt to the available system synchrony or QoS (in terms of timely and untimely channels), providing more efficient solutions to faulttolerant problems when possible (this will be illustrated in the next section). As a matter of fact, the implementation of fault-tolerant services, such as consensus [17], depends decisively on the existence of upper bounds for message transmission and process scheduling delays (i.e., timely channels and timely process executions). Such upper bounds can only be continuously guaranteed in synchronous systems. On the other hand, fault-tolerant services can also be guaranteed in some asynchronous or partially synchronous environments where the synchronous behavior is verified during sufficient long periods of time, even if such a behavior is not continuously assured. Thus, such partially synchronous systems can alternate between synchronous behavior, with time bound guarantees, and asynchronous behavior, with no time bound guarantees. So, in some sense, these systems are hybrid in the time dimension. For example, the timed asynchronous system depends on a sufficient long stability period to deliver the related fault-tolerant services, and the system behavior can alternate between stable and unstable periods (i.e., synchronous and asynchronous) [8]. The GSTGlobal Stabilization Timeassumption is necessary to guarantee that S-based consensus protocols eventually deliver the proper service [6] (i.e., to work correctly, the system must eventually exhibit a synchronous behavior). There are other models that consider not only the hybridism in the time dimension, but also that the system is composed of a synchronous and an asynchronous part at the same time. So, we can regard such systems as being hybrid in the space dimension. This is the case of the TCB model, which relies on a synchronous wormhole to implement fault-tolerant services [27], and such a spatial hybridism persist without discontinuity in time. We assume that our underlying system model is capable of providing distinct QoS communications, so that a given physical channel may transmit communication flows related to multiple virtual channels with distinct QoS (e.g., one timely channel and two untimely channels implemented in the same physical channel). Because timely and untimely channels may exist in parallel, our underlying system model can be hybrid in space. Moreover, untimely channels may exhibit a synchronous behavior in certain stable periods. Therefore, our underlying system model differs from all the above mentioned models in the sense that it not only considers both the temporal and spatial hybridism dimensions (in this sense, similar to TCB), but also that the nature of such a hybridism can change over time, with periods where there is no spatial hybridism. That is, the underlying system behavior can become totally asynchronous (i.e., uncertain ), given that in some circumstances, the QoS of timely channels can degrade (due to router failures, for instance). As our programming model cannot ensure the conditions to solve consensus in circumstances where the system becomes totally asynchronous, we need a computability model to correctly specify the consensus algorithms. As
23
discussed earlier, the failure detector approach, denoted as FD, is such a computability model that we use to solve consensus (see Section 3.1). However, nothing prevents us from using our programming model with another oracle or computability assumptions (such as the eventual leader oracle () [5]) with adequate modifications in the consensus algorithm. That is why we consider the FD-based approach as being orthogonal to our approach. In summary, partial synchronous models such as GST model and timed asynchronous systems do not consider spatial hybridism. TCB is hybrid in space and time and its spatially hybrid nature persists continuously. Our model, in contrast, considers an underlying system that can also be hybrid in time and space; however, it can completely loose its synchronous part over timerequiring the corresponding adaptive mechanism. The price that has to be paid to allow the required adaptation is the implementation of QoS management mechanisms (provision and monitoring) capable of continuously assessing the QoS of the channels [1]this is to some extent similar to what is realized by the fail-awareness service implemented in timed asynchronous systems, where the occurrence of too many performance failure raises an exception signal indicating that the underlying system can no longer satisfy the synchronous behavior [11]. The difference lies in the fact that in the timed asynchronous system, the signaling is not based on QoS management but on the violation of round-trip-time message bounds. The approach presented in [16] focuses on the consensus problem in asynchronous message-passing systems. It proposes a general framework for failure detector-based consensus algorithms. Hence, its underlying model is the classical asynchronous model enriched with failure detectors. There is no notion of hybridism. The approach proposed here is different because it considers an underlying QoS infrastructure.
3.1.2 The Class S of Failure Detectors It is well known that the consensus problem cannot be solved in pure time-free asynchronous distributed systems [12]. So, these systems have to be equipped with additional power (as far as the detection of process crashes is concerned) in order consensus can be solved. Here, we consider that the system is augmented with a failure detector of the class denoted S [6] (which has been shown to be the weakest class of failure detectors able to solve consensus despite asynchrony [5]). A failure detector of the S class is defined as follows [6]: Each process pi is provided with a set suspectedi that contains processes suspected to have crashed. If pj 2 suspectedi , we say pi suspects pj . A failure detector providing such sets belongs to the class S if it satisfies the following properties:
Strong Completeness: Eventually, every process that crashes is permanently suspected by every correct process. . Eventual Weak Accuracy: There is a time after which some correct process is never suspected by the correct processes. As we can see, a failure detector of the class S can make an infinite number of mistakes (e.g., by erroneously suspecting a correct process, in a repeated way). .
USING
THE
MODEL
As a relevant example of the way the previous programming model can benefit middleware designers, this section presents a consensus protocol built on top of this model. The consensus problem has been chosen to illustrate the way to use the model because of its generality, both practical and theoretical [17].
3.1
Enriching the Model to Solve Consensus
3.1.1 The Consensus Problem In the consensus problem, every correct process pi proposes a value vi and all correct processes have to decide on the same value v, that has to be one of the proposed values. More precisely, the Consensus problem is defined by two safety properties (Validity and Uniform Agreement) and a Termination Property [6], [12]:
. . . Validity: If a process decides v, then v was proposed by some process. Uniform Agreement: No two processes decide differently. Termination: Every correct process eventually decides on some value.
3.2 A Consensus Protocol The consensus protocol described in Fig. 1 adapts the S-based protocol introduced in [21] to the computation model defined in Section 2. Its main difference lies in the waiting condition used in line 8. Moreover, while the algorithmic structure of the proposed protocol is close to the structure described in [21], due to the very different model it assumes, its proof (mainly in Lemma 1) is totally different from the one used in [21]. A process pi starts a consensus execution by invoking Consensus vi where vi is the value it proposes. This function is made up of two tasks, T 1 (the main task) and T 2. The processes proceed by consecutive asynchronous rounds. Each process pi manages two local variables whose scope is the whole execution, namely, ri (current round number) and esti (current estimate of the decision value), and two local variables whose scope is the current round, namely, auxi and reci . ? denotes a default value which cannot be proposed by processes. A round is made up of two phases (communication steps), and the first phase of each round r is managed by a coordinator pc (where c r mod n 1).
. First phase (lines 4-6). In this phase, the current round coordinator pc broadcasts its current estimate esti . A process pi that receives it, keeps it in auxi ; otherwise, pi sets auxi to ?. The test pc 2 suspectedi [ downi is used to prevent pi from blocking forever. It is important to notice that, at the end of the first phase, the following property holds: 8i; j auxi 6 ? ^ auxj 6 ? ) auxi auxj v: . Second phase (lines 7-13). During the second phase, the processes exchange their auxi values. Each
24
VOL. 4,
Fig. 1. Consensus protocol.
process pi collects in reci such values from the processes belonging to the sets livei and uncertaini as defined in the condition stated at line 8. As a consequence of the property holding at the end of the first phase and the condition of line 8, the following property holds at line 9: 8i reci fvg or reci fv; ?g or reci f?g where v estc : The proof of this property is the main part of the proof of the uniform agreement property (see Lemma 1). Then, according to the content of its set reci , pi either decides (case reci fvg), or adopts v as its new estimate (case reci fv; ?g), or keeps its previous estimate (case reci f?g). When it does not decide (two last cases), pi proceeds to the next round. As a process that decides stops participating in the sequence of rounds and processes do not necessarily terminate in the same round, it is possible that processes proceeding to round r 1 wait forever for messages from processes that decided during r. The aim of the second task is to prevent such deadlocks by disseminating the decided value in a reliable way ([18]). It is easy to see that the processes decide in a single round (two communication steps) when the first coordinator is neither crashed nor suspected. Moreover, it is important to notice that the value f (upper bound on the number of faulty processes) does not appear in the protocol and, consequently, the protocol does not impose an a priori constraint on f. Actually, if the sets uncertaini remain
always empty, the underlying failure detector becomes useless and the protocol solves consensus whatever the value of f.
3.3 Correctness Proof The proofs for the uniform agreement and termination properties, given by Theorems 1 and 2, respectively, follow. The proof for validity is omitted here due to space limitations, and can be found elsewhere [15].
Lemma 1. Let pi and pj be two processes that terminate line 8 of round r, and let v be the estimate value estc of the coordinator pc of r (if any). We have: 1) reci fvg, or reci fv; ?g, or reci f?g, and 2) reci fvg and recj f?g are mutually exclusive. Proof. Let us first consider a process pi that, during a round r, terminates the wait statement of line 8. As there is a single coordinator per round, the auxj values sent at line 7 by the processes participating in round r can only be either the estimate estc v of the current round coordinator pc , or ?. Item 1 of the lemma follows immediately from this observation. Let us now prove item 2, namely, if both pi and pj terminate the wait statement at line 8 of round r, it is not possible to have reci fvg and recj f?g during that round. Let Qi (respectively, Qj ) be the set of processes from which pi (respectively, pj ) has received PHASE2r; messages at line 8. The set reci (respectively, recj ) is defined from the aux values carried by these messages. We claim that Qi \ Qj 6 ;. As any process in the intersection sends the same message to pi and pj , the lemma follows.
25
Proof of the claim. To prove that Qi \ Qj 6 ;, the reasoning is on their size. Let ti (respectively, tj ) be the time at which the waiting condition evaluated by pi (respectively, pj ) is satisfied. Let us observe that ti and tj can be different. To simplify the notation, we do not indicate the time at which a set is considered, i.e., downi , livei , and uncertaini denote the values of the corresponding sets at time ti (and similarly for pj s sets). Moreover, let maji be the majority subset of uncertaini that makes satisfied the condition evaluated at line 8, i.e., Qi livei [ maji (notice that maji ; when uncertaini ;). Similarly, let Qj livej [ majj . To prove Qi \ Qj 6 ;, we restrict our attention to the set of processes n downi [ downj . This is because the predicate evaluated at line 8 by pi (respectively, pj ) involves only the processes in n downi (respectively, n downj ). In the following, we denote with a prime symbol a set value used in the proof, but known neither by pi nor by pj . So, down0i;j downi [ downj , and live0i livei n down0i;j and live0j livej n down0i;j . Let us observe that, due to (R1), we have live0i livei n downj and live0j livej n downi . Combining the previous set definitions with Property 1, we get the following property (E): live0i [ uncertaini n down0i;j live0j [ uncertainj : With these sets, we can now define two nonempty sets Q0i and Q0j as follows: Q0i live0i [ maji and Q0j live0j [ majj . From their very definition, we have Q0i live0i [ maji livei [ maji Qi . Similarly, Q0j Qj . As Q0i \ Q0j 6 ; implies that Qi \ Qj 6 ;, the rest of the proof consists of showing that Q0i \ Q0j 6 ;. We consider three cases: . Case 1: maji 6 ; ^ majj 6 ;. From the fact that Q0i is built from the universe of processes n down0i;j and (E), we have the following: jQ0i j jlive0i j jmaji j > jlive0i j juncertaini j 2 jlive0i j juncertaini j jj jdown0i;j j ! : 2 2
jjjdown0 j
TABLE 2 Case 3 in the Proof of the Claim
conclude pj 2 live0i and pj 2 live0j . Hence, live0i \ live0j 6 ; which proves the case. Case 3: maji 6 ; ^ majj ;. In that case, we have uncertaini 6 ; and uncertainj ;. As uncertaini \ downj ; (Property 1) and uncertainj ;, we have live0j live0i [ uncertaini (Table 2 figures out these sets). Hence, uncertaini live0j . As maji uncertaini , we have maji live0j from which we conclude Q0i \ Q0j 6 ;. (The case maji ; ^ majj 6 ; is proved by exchanging i and j.) End of the proof of the claim. u t
Theorem 1. No two processes decide different values. Proof. Let us first observe that a decided value is computed either in line 10 or in Task T 2. However, we examine only the values decided at line 10, as a value decided in Task T 2 has already been computed by some process in line 10. Let r be the smallest round during which a process pi decides, and let v be the value it decides. As pi decides during r, we have reci fvg during that round (line 10). Due to item 1 of Lemma 1, it follows that 1) v is the current value of the estimate of the coordinator of r, and 2) as there is a single coordinator per round, if another process pj decides during r, it decides the same value v. Let us now consider a process pk that proceeds from the round r to the round r 1. Due to item 2 of Lemma 1, it follows that reck fv; ?g from which we conclude that the estimates of all the processes that proceeds to r 1 are set to v at line 11. Consequently, no future coordinator will broadcast a value different from v, and agreement follows. u t Theorem 2. Let us assume that the system is equipped with S and 8i, 8t: a majority of processes in uncertaini t are not in F t. Every correct process decides. Proof. If a process decides, then all correct processes decide. This is an immediate consequence of the broadcast primitive used just before deciding at line 10. This broadcast (line 10, and Task T 2) is such that if a process executes it entirely, the broadcast message is received by all correct processes. So, let us assume (by contradiction) that no process decides. We claim that no correct process blocks forever at line 5 or line 8. As a consequence, the correct processes eventually start executing rounds coordinated by a correct process (say, pc ) that is no longer suspected to have crashed (this is due to the eventual strong accuracy of S). Moreover, this process cannot appear in downi (R2). Let us consider such a round occurring after all the
i;j We have the same for pj , i.e., jQ0j j > . As 2 0 0 both Qi and Qj are built from the same universe of processes ( n down0i;j ) and each of them contains more than half of its elements, we have Q0i \ Q0j 6 ;. Case 2: maji ; ^ majj ;. In that case, we have
uncertaini uncertainj ;: So, proving Q0i \ Q0j 6 ; amounts to prove that live0i \ live0j 6 ;. Let us remind that ti (respectively, tj ) is the time with respect to which downi is defined (respectively, downj ), and let us assume without loss of generality that ti < tj . As pj has not crashed at tj (i.e., when it terminates the wait = statement of line 8), we have (from R2) pj 2 downi and pj 2 downj , i.e., pj 2 down0i;j , from which we = =
26
VOL. 4,
faulty processes have crashed. The correct process pc then sends the same non-? value v to each correct v (line 6). process pi that consequently executes auxi As no process blocks forever at line 8, and only the value v can be received, it follows that each correct process decides v at line 10. Proof of the claim. No correct process blocks forever at line 5 or line 8. We consider each line separately. First, the fact that no process can block forever at line 5 follows directly from the fact that every crashed process eventually appears in suspectedi [ downi . Second, the fact that no process can block forever at line 8 follows from the following observations: 1. 2. Item 2 of line 8 cannot entail a permanent blocking: this follows from the theorem assumption. Let us now consider item 1 of line 8: pi could block if it waits for a message from pk 2 livei and that process pk has crashed before sending the corresponding message. If pk is later moved to uncertaini , it can no longer block pi , see 1). If pk is never moved to uncertaini , then it is eventually moved to downi (R5) and, consequently, pi will stop waiting for a message from pk . Hence, in both cases, a crashed process pk cannot prevent a correct process pi from progressing. End of the proof of the claim. u t
3.4 Discussion As presented in the Introduction and Section 2.4, this model includes two particular instances that have been deeply investigated. The first instance, namely, 8i, t : uncertaini , does correspond to the time-free asynchronous system model. When instantiated in such a particular model, the protocol described in Fig. 1 can be simplified by suppressing the set downi at line 5 and the set livei at line 8 (as they are now always empty). The assumption required in Theorem 2 becomes f < n=2, and the protocol becomes the S-based consensus protocol described in [21]. It has been shown that f < n=2 is a necessary requirement in such a model [6]. In that sense, the proposed protocol is optimal. The second extreme instance of the model is defined by 8i; t : uncertaini ;, and includes the classic synchronous distributed system model. It is important to notice that systems with less synchrony than the classic synchronous systems can also be such that 8i, 8t, we have uncertaini t ;.3 As previously, the protocol of Fig. 1 can be simplified for this particular model. More precisely, the set suspectedi (line 5) and item 2 of line 8 can be suppressed. We then obtain an early deciding protocol that works for any number of process failures, i.e., for f < n. (It is natural to suppress the sets suspectedi in synchronous systems as S is only needed to cope with the net effect of asynchrony and failures.) Interestingly, the synchronous protocol we obtain is not the classical floodset consensus protocol [20], [23], but a synchronous coordinator-based protocol.
3. It is sufficient that each process of the system is connected with a timely channelfor example, a system with four processes, two timely channels, and all the other channels being untimely can be as such uncertaini t ;.
A noteworthy feature of the protocol lies in its generic dimension: The same protocol can easily be instantiated in fully synchronous systems or fully asynchronous systems. Of course, these instantiations have different requirements on the value of f. A significant characteristic of the protocol is to suit to distributed systems that are neither fully synchronous, nor fully asynchronous. The price that has to be paid consists then of equipping the system with a failure detector of the class S. The benefit it brings lies in the fact that the constraint on f can be weaker than f < n=2.4 The rule R3 of our model restricts process identifiers to be moved from uncertain to live or from uncertain to down during a given run. So, as a consequence, we restrict the application from upgrading the QoS of related channels from untimely to timely during the execution of a consensus. One question that could be raised, though, is what would happen if during an execution of consensus, processes could renegotiate the QoS of channels from untimely to timelytherefore, allowing processes to move from uncertain to live. It turns out that in such situations, the required uniform agreement cannot be achieved. Take the simple example of a group with three processes p1 , p2 , and p3 , all initially belonging to the set uncertain as all channels are initially untimely. Now, consider that during a given consensus execution, processes p2 and p3 negotiate timely channels with p1 moving the three process identifiers from their local uncertain sets to their local live sets. Now, suppose that before the new negotiated QoS information reaches p1 , p1 decides based on the quorum of two processes, p1 and p3 (at this stage, p1 s view is that all processes belong to uncertain) and, after some time, both p1 and p3 crash. Some time later, process p2 detects the crash of p1 and also of p3 and decides based on its own initial value (for p2 all processes are in live). Under these conditions, there is no intersection in the decision quorums used by p1 and p2 , so that the decided value of p1 will be different from the decided value of p2 which violates the uniform agreement property. Regarding transitions from uncertain to down, if the state of a process p is unknown or uncertain in our model (what justifies p to be in the uncertain set), it does not make sense to move p into down (which is considered a safe information in our model, as defined by R2). So, in order to be moved from uncertain to down, p would have first to be moved from uncertain to live (so, the situation is exactly as above, which also justifies R3). However, the same reasoning also applies if we allowed process identifiers to be directly moved from uncertain to down (just take the above example and consider moving p1 and p3 from uncertain2 into down2 ). As a matter of fact, without R3 one could not deduce Property 1 and, consequently, prove Uniform Agreement (see Lemma 1). Nevertheless, the above restriction does not prevent a process from taking advantage of a better QoS for its channels in future consensus executions (rule R6): It suffices that the new live and uncertain sets, obtained from QoS improvements, be set up in all cooperating processes before the next consensus execution. A simple way to realize that (without making use of extra mechanisms such
4. The protocols designed for asynchronous systems we are aware of require 1) S (or a failure detector that has the same computabilty power, e.g., a leader oracle [5]), and 2) the upper bound f < n=2 on the number of process crashes. Our protocol has the same requirement for item 1, but a weaker for item 2.
27
as a membership protocol) is to send the new live set together with the coordinator proposed value. Then, when the consensus is eventually decided, this new live set is equally installed in all processes (and the complement of it, the uncertain set).
IMPLEMENTATION OF THE ADAPTIVE DISTRIBUTED COMPUTING MODEL
messages are transmitted in the same channels. Regarding the QoSP, as will be seen in the next section, there is a unique module in each site of the system. When the system is initiated, the QoSP modules try to establish timely channels between them. However, if resources are not available for timely channels, untimely channels are generated.
Implementing our hybrid programming model requires some basic facilities such as the provision and monitoring of QoS communications with both bounded and unbounded delivery times, and also a mechanism to adapt the system when timely channels can no longer be guaranteed, due to failures. Our programming model builds on facilities typically encountered in QoS architectures [1], such as Omega [22], QoS-A [3], Quartz [25], and Differentiated Services [2]. In particular, we assume that the underlying system is capable of providing timely communication channels (alike services such as QoS hard [3], deterministic [25], and Express Forward [2]). Similarly, we assume the existence of best-effort channels where messages are transmitted without guaranteed bounded time delays. We call these channels untimely. QoS monitoring and fail-awareness have been implemented by the QoS Provider (QoSP), failure and state detectors mechanisms, briefly presented below. It was a design decision to build our model on top of a QoS-based system. However, we could also have implemented our programming model based on facilities encountered in existing hybrid architectures: For instance, timely channels could be implemented using RTD channels by setting the probabilities Pd (deadline probability) and Pr (reliability probability) close to one, and untimely channels could be implemented with a basic channel without any guarantees [19]. The timing failure detection service of TCB [4] could then complement the required functionality. The QoS-based underlying distributed system we consider is a set of n processes p1 ; . . . ; pn , located in one or more sites, communicating through a set of nn 1=2 channels, where ci=j means a communication channel between pi and pj . That is, the system is represented by a complete graph DS; , where are the nodes and the edges of the graph. We assume that processes in are equipped with enough computational power so that the time necessary to process the messages originated by the implemented model (i.e., the state detector, the failure detector, and the QoS Provider) are negligible small compared with network delays. Therefore, such messages are assumed to be promptly computed.5 Moreover, the processes are assumed to fail only by crashing and the network is not partitionable. We do not distinguish channels for upper-layer messages (e.g., generated by the consensus protocol) and for messages generated by the state and failure detectors. These
5. This assumption can be relaxed by using real-time operating systems such as CactusRT [19], which can provide bounded process execution times.
4.1 The QoS Provider In order to make our programming model portable to distinct QoS architectures, we define a number of functions to be encapsulated in a software device we call the QoS Provider (QoSP), for creating and assessing QoS communication channels on behalf of application processes. Thanks to this modular encapsulation, porting our system to a given QoS infrastructure (such as DiffServ [2]) means implementing the QoSP functions in such a new target environment. The QoSP is made up of a module in each site of the system. The basic data structure maintained by each module is a table holding information about existing channels. These modules exchange messages to carry out modifications on the QoS of channels (due to failures or application requests). This section describes its main functionalities, which are needed for implementing our programming model. (The complete description of the QoS Provider is beyond the scope of this paper. More details can be seen elsewhere [13], [15]). Processes interact with the QoS Provider through the following functions:
CreateChannelpx ; py : 2 ! ; DefineQoSpx ; py ; qos : 2 ftimely; untimelyg ! ftimely; untimelyg; QoSpx ; py : 2 ! ftimely; untimelyg; and Delaypx ; py : 2 ! N : The functions CreateChannel, DefineQoS, QoS, and Delay are used for creating a channel, changing its QoS (with admission test and resource reservation), obtaining its current QoS (monitoring), and obtaining the expected delayin millisecondsfor message transfer for the channel cx=y , respectively. Besides the above functions, each QoSP module continuously monitors all timely channels linked to the related site, to check whether failures or lack of resources have resulted in a modification of the channel QoS (from timely to untimely). A particular case that the QoSP also assesses is the existence of a timely channel where no message flow happens within a period of time, which indicates that the timely channel is possibly linking two crashed processes (in this circumstance, the QoS of the channel is modified to untimely in order to release resources). Modifications on the QoS of channels are immediately reported to the state detectors related to the processes linked by such channels, through messages changeQoSpx ; py ; newQoS, indicating that the QoS of the channel cx=y has been modified to newQoS (see Section 4.2.1).
28
VOL. 4,
When an application process px wants to create a timely channel with the process py , it does so by first invoking the CreateChannelpx ; py function and then the DefineQoSpx ; py ; timely function. If these functions succeed (i.e., the channel is created and the related QoS is set to timely), and assuming that px and py do not crash, the new created channel will remain timely during the system execution, unless either DefineQoSpx ; py ; untimely function is explicitly invoked or a fault in the communication path liking px and py makes it impossible to still guarantee the previously reserved resources for this channel. So, degrading QoS it is not just a matter of increasing communication delays, but loosing reservation guarantees due to failures. As an example of a fault, consider a crash of a QoS router in the path linking px and py , leading the communication flow through another router that has no resource reserved for that channel. In this situation, communication between px and py is preserved but with a degraded QoS (i.e., untimely). The monitoring mechanism available in the underlying QoS infrastructure signalizes the loosing of QoS to the QoSProvider, which changes the QoS of channel px ; py from timely to untimely. This monitoring mechanism is a sort of fail-awareness scheme; however, it is different from the kind implemented in the timed-asynchronous model [11], which has no relation with resource reservation as discussed above. In particular, notice that in our underlying system model, degrading QoS does not necessarily mean increasing communication delays. The contrary is however always true. That is, if a timeout expires for a timely channel, either the communicating process crashed or the QoS downgraded. Therefore, our failure detector mechanism (described in Section 4.2.2) uses the expiration of timeouts, based on previously negotiated time delays for timely channels, as a safe indication of a crash. However, this crash is only confirmed if the QoS infrastructure assures that the related channel QoS has not been lost due to a failure, or deliberately downgraded by the application. When a process crashes, the QoS Provider can still give information about the QoS of the channels linked to that crashed process, because the communication resources are still reserved, and the QoSP module maintains information about these channels. However, if the site hosting a process crashes, all the channels allocated to processes in this site are destroyed. If a QoS Provider module crashes, all related channels have their QoS changed to untimely. It should be observed at this point that at runtime initialization the modules of the QoS Provider try to create timely channels to connect them. If timely channels are successfully created for the QoSP modules, then timeouts are used to detect the crash of a QoSP module. Otherwise, untimely channels are created for the QoSP modulesand, as a consequence, upper-layer applications will not be able to create any timely channel as well. In other words, application related timely channels can only be created if and after the timely channels for the related QoSP modules exist. Another important observation is that if QoSP loses the QoS of its channels, the same will happen for all applications with channels in the related sitehence, the situation of having timely channels for applications and untimely channels for the related QoSP will never happen.
If a given QoS Provider module cannot deliver information about a given channel (possibly, because the related QoSP module has crashed), this channel is then assumed to be untimely (which may represent a change in its previous QoS condition). On the other hand, the crash of a QoSP module Mx can only be detected by another QoSP module My if there is a timely channel linking Mx to My . The creation and monitoring of such a timely channel is realized in the same way as for the application process channels. That is, by using the underlying QoS infrastructure facilities. After the expiration of an expected timeout, the underlying QoS infrastructure is queried to verify whether the QoS for the channel linking Mx and My remains timely and, in this case, My detects the crash of Mx . Otherwise, the channel linking Mx and My is made untimely, and all application channels for Mx and My are thereafter made untimely as well.
4.2 The Programming Model Implementation In our programming model, distributed processes perceive each others state by reading the contents of the sets down, live, and uncertain. These sets can evolve dynamically following system state changes while respecting the rules R0 to R5. Therefore, implementing our model implies in providing the necessary mechanisms to maintain the sets according to their semantics. Two mechanisms have been developed to this end:
state detector that is responsible for maintaining the sets live and uncertain, in accordance with the information delivered by the QoSP, and . a failure detector that utilizes the information provided by both, the QoSP and the state detector, to detect crashes and update the down sets accordingly. Associated with each process pi there is a module of the state detector, a module of the failure detector, a representation of the DS; graph, and the three sets: livei , uncertaini , and downi . The DS; graph is constructed by using the QoSP functions CreateChannel() and DefineQoS(), for creating channels according to the QoS required and resources available in the system. The modules of the state detector exchange the information of the created channels so that they keep identical DS; graphs during the system initialization phase.6 During the initialization phase, the set downi is set to empty and the contents of livei and uncertaini are initialized so that the identity of a process pj is placed into livei if and only if there is a timely channel linking pj to another process (i.e., 9px 2 such that DSi pj ; px timely). Otherwise, the identity of pj is placed in uncertaini . When the initialization phase ends, processes in observe identical contents for their respective live, down, and uncertain sets, and a given process is either in live or in uncertain (ensuring, therefore, the restrictions R0 and R1 of our model). .
6. At the present prototype version, we use a centralized entity (process) and a distributed transaction to establish the initial graph (it is like reading the same information from a shared file). Of course, the system is only initialized if the transaction succeeds without failures (or suspicions of). Another option that we intend to explore is to use our consensus algorithm over the QoSP modules to define the initial configuration.
29
Fig. 2. Algorithm to update the sets livei and uncertaini .
During the application execution, the contents of three sets are dynamically modified according to existence of failures and/or QoS modifications. Next, implemented mechanisms in charge of updating contents of live and uncertain (the state detector), and contents of down (the failure detector) are described.
the the the the the
4.2.1 An Implementation of the State Detector There is a module of the state detector for each process in . The module of the state detector associated with pi executes two concurrent tasks to update the DS graph maintained by pi . The first task is activated by messages
changeQoSpi ; px ; newQoS from the local module of QoSP, indicating that the QoS of the channel linking pi to px has been modified. Upon receiving the changeQoS message, the state detector of pi first verifies whether px is in the downi set, and in this case, it terminates the task (this is necessary to guarantee R4). Otherwise, it passes on the information of the new QoS of the channel ci=x to the remote modules of the state detector, and the local DS graph is updated accordingly (i.e., DSi pi ; px is set to NewQoS). The second task is activated when the failure detector communicates the crash of a process px (see details in the next section). The goal of this task is to check whether there is a process in the live set, say py , that had a timely channel to the crashed process. If the channel cx=y is the only timely channel to py , it can no longer be detectable and therefore must be moved from live to uncertain. This is realized by setting all channels linked to the crashed process as untimely in the DS graph. In both tasks, after updating the DS graph, the procedure UpdateState(), described in Fig. 2, is called for each px linked to a modified channel, to update the sets livei and uncertaini , accordingly. Process px is moved from livei to uncertaini if no timely channel linking px is left (lines 14); px is moved from uncertaini to livei if a new timely channel linking px has been created (lines 6-8).7
7. Observe that as applications are not allowed to upgrade the QoS of channels during a run (R3), the else part of this algorithm will only be executed between runs (R6).
4.2.2 An Implementation of the Failure Detector Besides maintaining the set downi , the failure detector also maintains the set suspectedi for keeping the identities of processes suspected of having crashed. A process pi interacts with the failure detector by accessing these sets. The failure detector works in a pull model, where each module (working on behalf of a process px ) periodically sends are you alive? messages to the other modules related to the other processes in . The timeout value used for awaiting I am alive messages from a monitored process py is calculated using the QoSP function Delaypx ; py . The timeout includes the so-called round-trip time rtt,8 and a safety margin (), to account for necessary time to process these messages at px and py . For timely channels, the calculated timeout is accurate in the sense that network and system resources and related scheduling mechanisms guarantee the rtt within a bounded limit. Therefore, the expiration of the timeout is an accurate indication that py crashed and, in that case, py is moved from livex to downx . To account for a possible modification of the QoS of the channel cx=y , before producing a notification, the failure detector checks it out whether the channel remained timely using the QoSP function QoSpx ; py .9 On the other hand, if the channel linking px and py is untimely, the expiration of the timeout is only a hint of a possible crash and, in that case, besides belonging to uncertainx , py is also included in the set suspectedx . The algorithm for the failure detector for a process pi , described in Fig. 3, is composed of five parallel tasks. The parameter monitoringInterval indicates the time interval between two consecutive are-you-alive? messages sent by pi . The array timeouti 1::n holds the calculated timeout for pi to receive the next I am alive message from each process in . The function CTi returns the current local time. Task 1 periodically sends a I am alive message to all processes (actually, the related failure detector modules) after setting a timeout value to receive the corresponding I am alive message, which in turn is sent by Task 5. Task 2 assesses the expiration of timeouts, and it sends notification messages when the timeouts expire for processes in livei , moving them into the downi set. Otherwise, if the timeout
8. That is the time to transfer the are you alive? message from px to py plus the time to transfer the I am alive message from py to px . 9. One should observe here that the QoS Provider holds the information and resources related to a given channel even after the crashes of the processes linked by that channel.
30
VOL. 4,
Fig. 3. Algorithm for the failure detector module pi .
expires for processes in the uncertaini set, their identities are also included into the suspectedi set. Task 3 cancels a timeout for a given process to receive the I am alive message, and if such a process belongs to the suspectedi set, it is removed from it. Task 4 handles crash notification messages and updates the sets downi and livei , accordingly.
CONCLUSION
This paper proposed and fully developed an adaptive programming model for fault-tolerant distributed computing. Our model provides upper-layer applications with process state information according to the current system synchrony (or QoS). The underlying system model is hybrid, comprised of a synchronous part and an asynchronous part. However, such a composition can vary over time in such a way that the system may become totally synchronous or totally asynchronous. The programming model is given by three sets (processes perceive each others states by accessing the contents of their local nonintersecting setsuncertain, live, and down) and the rules R0-R6 that regulate modifications on the sets. Moreover, we showed how those rules and the sets can be implemented in real systems. To illustrate the adaptiveness of our model, we developed a consensus algorithm that makes progress despite distinct views of the corresponding local sets, and can tolerate more faults, the more processes are in the live set. The presented consensus algorithm adapts to the current QoS available (via the sets) and uses the majority assumption only when needed. To the best of our knowledge, there is no other system that implements the same functionality we have just
described, taking advantage of the available QoS to provide process state information that can be explored to yield efficient solutions, when possible. For example, in some circumstances, the live set can include all system processes, even if the synchrony of the underlying system is not as in the classic synchronous systems (e.g., a system with four processes and with two timely channels and all the other channels being untimely, can be as such). In these conditions, if there is no QoS degradation, our protocol tolerates f < n process faults. On the other hand, our system allows for a graceful degradation when the underlying QoS degrade, allowing to circumvent the f < n=2 lower bound associated with the consensus problem in asynchronous systems equipped with eventual failure detectors, if juncertainj < n during the execution. An implementation of the model on top of a QoS infrastructure has been presented. In order to specify the underlying functionality needed to implement it, a mechanism (called the QoS provider) has been developed and implemented. Thanks to this modularity dimension of the approach, porting the model implementation to a given environment requires us to only implement the QoS Provider functions that have been defined. The proposed system has been implemented in JAVA and tested over a set of networked LINUX workstations, equipped with QoS capabilities. More details of this implementation can be found elsewhere [15].
ACKNOWLEDGMENTS
The authors are grateful to the anonymous reviewers for their constructive comments.
31
REFERENCES
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] C. Aurrecoechea, A.T. Campbell, and L. Hauw, A Survey of QoS Architectures, ACM Multimedia Systems J., special issue on QoS architecture, vol. 6, no. 3, pp. 138-151, May 1998. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, An Architecture for Differentiated Services, RFC 2475, Dec. 1998. A. Campbell, G. Coulson, and D. Hutchison, A Quality of Service Architecture, ACM Computer Comm. Rev., vol. 24, no. 2, pp. 6-27, Apr. 1994. A. Casimiro and P. Verssimo, Using the Timely Computing Base for Dependable QoS Adaptation, Proc. 20th Symp. Reliable Distributed Systems (SRDS), pp. 208-217, Oct. 2001. T.D. Chandra, V. Hadzilacos, and S. Toueg, The Weakest Failure Detector for Solving Consensus, J. ACM, vol. 43, no. 4, pp. 685722, July 1996. T.D. Chandra and S. Toueg, Unreliable Failure Detectors for Reliable Distributed Systems, J. ACM, vol. 43, no. 2, pp. 225-267, Mar. 1996. M. Correia, N. Neves, L. Lung, and P. Verssimo, Low Complexity Byzantine-Resilient Consensus, Distributed Computing, vol. 17, no. 3, pp. 237-249, Mar. 2005. F. Cristian and C. Fetzer, The Timed Asynchronous Distributed System Model, IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 642-657, June 1999. D. Dolev, C. Dwork, and L. Stockmeyer, On the Minimal Synchronism Needed for Distributed Consensus, J. ACM, vol. 34, no. 1, pp. 77-97, Jan. 1987. C. Dwork, N. Lynch, and L. Stockmeyer, Consensus in the Presence of Partial Synchrony, J. ACM, vol. 35, no. 2, pp. 288-323, Apr. 1988. C. Fetzer and F. Cristian, Fail-Awareness in Timed Asynchronous Systems, Proc. 15th ACM Symp. Principles of Distributed Computing (PODC 96), pp. 314-321, May 1996. M.J. Fischer, N. Lynch, and M.S. Paterson, Impossibility of Distributed Consensus with One Faulty Process, J. ACM, vol. 32, no. 2, pp. 374-382, Apr. 1985. S. Gorender and R. Macedo, Fault-Tolerance in Networks with QoS, Technical Report Number RT001/03, Distributed System Laboratory (LaSiD), UFBA (in Portuguese), 2003. S. Gorender, R. Macedo, and M. Raynal, A Hybrid and Adaptive Model for Fault-Tolerant Distributed Computing, Proc. IEEE Intl Conf. Dependable Systems and Networks (DSN 05), pp. 412-421, June 2005. S. Gorender, R. Macedo, and M. Raynal, A Hybrid Model for Fault-Tolerant Distributed Computing, Technical Report RT001/ 06, Distributed System Laboratory (LaSiD), UFBA, 2006. R. Guerraoui and M. Raynal, The Information Structure of Indulgent Consensus, IEEE Trans. Computers, vol. 53, no. 4, pp. 453-466, Apr. 2004. R. Guerraoui and A. Schiper, The Generic Consensus Service, IEEE Trans. Software Eng., vol. 27, no. 1, pp. 29-41, Jan. 2001. V. Hadzilacos and S. Toueg, Fault-Tolerant Broadcasts and Related Problems, Distributed Systems, S. Mullender ed., pp. 97145, ACM Press, 1993. M. Hiltunen, R. Schlichting, X. Han, M. Cardozo, and R. Das, Real-Time Dependable Channels: Customizing QoS Attributes for Distributed Systems, IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 600-612, June 1999. N. Lynch, Distributed Algorithms, p. 872. Morgan Kaufmann, 1996. A. Mostefaoui and M. Raynal, Solving Consensus Using Chandra-Touegs Unreliable Failure Detectors: A General Quorum-Based Approach, Proc. 13th Symp. Distributed Computing (DISC 99), pp. 49-63, Sept. 1999. K. Nahrstedt and J.M. Smith, The QoS Broker, IEEE Multimedia, vol. 2, no. 1, pp. 53-67, 1995. M. Raynal, Consensus in Synchronous Systems: A Concise Guided Tour, Proc. Ninth IEEE Pacific Rim Intl Symp. Dependable Computing (PRDC 02), pp. 221-228, Dec. 2002. Y. Ren, M. Cukier, and W.H. Sanders, An Adaptive Algorithm for Tolerating Values Faults and Crash Failures, IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 2, pp. 173-192, Feb. 2001. F. Siqueira and V. Cahill, Quartz: A QoS Architecture for Open Systems, Proc. 18th Brazilian Symp. Computer Networks, pp. 553568, May 2000.
[26] R. van Renesse, K. Birman, M. Hayden, A. Vaysburd, and D. Karr, Building Adaptive Systems Using Ensemble, Software Practice and Experience, vol. 28, no. 9, pp. 963-979, July 1998. [27] P. Verssimo and A. Casimiro, The Timely Computing Base Model and Architecture, IEEE Trans. Computers, special issue on asynchronous real-time systems, vol. 51, no. 8, pp. 916-930, Aug. 2002. Sergio Gorender received the BSc degree in computer science from the Federal University of Bahia (UFBA), Brazil, in 1991, and the MSc and PhD degrees also in computer science from the Federal University of Pernambuco (UFPE), Brazil, in 1995 and 2005, respectively. Since 1996, he has been a lecturer in the Department of Computer Science at UFBA and, since 2000, he has been a researcher of LaSiD, the Distributed System Laboratory at the same university. His research interests include the many aspects of dependable distributed systems. Raimundo Jose de Araujo Macedo received the BSc, MSc, and PhD degrees in computer science from Federal University of Bahia (UFBA), Brazil, the University of Campinas (Unicamp/Brazil), and the University of Newcastle upon Tyne (England), respectively. Since 2000, he has been a professor of computer science at UFBA, where he founded the Distributed Systems Laboratory (LaSiD) in 1995. Currently, he is the head of LaSiD and of the new created Doctorate Program on Computer Science. Formerly, he coordinated the Graduate Program on Mechatronics at UFBA. His research interests include the many aspects of dependable distributed systems and real-time systems. He has served as a program committee member on a number of conferences, including the IEEE/IFIP International Dependable Systems and Networks Conference (DSN), the ACM/IFIP/USENIX International Middleware Conference, the Brazilian Symposium on Computer Networks and Distributed Systems (SBRC), and the Latin-American Dependable Computing Symposium (LADC). He was the program chair and general chair of SBRC in 2002 and 1999, respectively. He is a member of the IEEE Computer Society. Michel Raynal has been a professor of computer science since 1981. At IRISA (CNRS-INRIAUniversity Joint Computing Research Laboratory located in Rennes), he founded a research group on distributed algorithms in 1983. His research interests include distributed algorithms, distributed computing systems, networks, and dependability. His main interest lies in the fundamental principles that underly the design and the construction of distributed computing systems. He has been the principal investigator of a number of research grants in these areas, and has been invited by many universities all over the world to give lectures on distributed algorithms and distributed computing. He belongs to the editorial board of several international journals. He has published more than 100 papers in journals (Journal of the ACM, Acta Informatica, Distributed Computing, Communications of the ACM, etc.) and more than 200 papers in conferences (ACM STOC, ACM PODC, ACM SPAA, IEEE ICDCS, etc.). He has also written seven books devoted to parallelism, distributed algorithms, and systems (MIT Press and Wiley). He has served on program committees for more than 70 international conferences (including ACM PODC, DISC, ICDCS, IPDPS, DSN, LADC, SRDS, SIROCCO, etc.) and chaired the program committee of more than 15 international conferences (including DISC (twice), ICDCS, SIROCCO, and ISORC). Professor Raynal served as the chair of the steering committee leading the DISC symposium series from 2002-2004. He received the IEEE ICDCS Best Paper Award three times in a row: 1999, 2000, and 2001. Recently, he cochaired the SIROCCO 2005 (devoted to communication complexity), IWDC 2005, and IEEE ICDCS 2006.
[15] [16] [17] [18] [19]
[20] [21]
[22] [23] [24]
[25]
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

An Adaptive Programming Model For Fault-Tolerant Distributed Computing

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

An Adaptive Programming Model For Fault-Tolerant Distributed Computing

Încărcat de

Drepturi de autor:

Formate disponibile

18

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,

NO. 1, JANUARY-MARCH 2007