Mds 2

Journal of Grid Computing 1: 251272, 2003. 2004 Kluwer Academic Publishers. Printed in the Netherlands.
251
A Flexible Framework for Fault Tolerance in the Grid

Soonwook Hwang1 and Carl Kesselman2
Institute of Informatics, Jimbocho Mitsui Building 14F (NAREGI), 1-105, Kanda-Jimbocho, Chiyoda-ku, Tokyo 101-0051, Japan E-mail: hwang@grid.nii.ac.jp 2 Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA E-mail: carl@isi.edu
1 National
Key words: failure detection, fault tolerance, Grid computing, workow
Abstract This paper presents a failure detection service (FDS) and a exible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-dened exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modication to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notication mechanism which is based on the interpretation of notication messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is exibility in handling failures. This paper describes how to achieve the exibility by the use of workow structure as a high-level recovery policy specication, which enables support for multiple failure recovery techniques, the separation of failure handling strategies from the application code, and user-dened exception handlings. Finally, this paper presents an experimental evaluation of the Grid-WFS using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid applications to achieve high performance in the presence of failures.
1. Introduction The Grid environment refers to the Internet-connected computing environment in which computing and data resources are geographically dispersed in different administrative domains with different policies for security and resource uses; and the computing resources are highly heterogeneous, ranging from single PCs and workstations, cluster of workstations, to supercomputers. With Grid technologies it is possible to construct large-scale applications over the Grid environment [17, 8]. However, developing, deploying, and running applications on the Grid environment poses signicant challenges due to the diverse failures encountered during execution. Indeed, after successfully building, if not straightforwardly because of the complexity of the underlying Grid environments, largescale Grid applications running over nine sites in the
United States, the developers indicated how unreliable the Grid environment could be by stating in [8], we have been astonished by the range of error conditions that we have encountered. A recent survey [36] with Grid real users on fault treatments in the Grid have revealed how fragile it is to run applications on Grid environments susceptible to a wide range of failures. Failures or error conditions due to the inherently unreliable1 nature of the Grid environment include hardware failures (e.g., host crash, network partition, etc.), software errors (e.g., memory leak, numerical
1 The Grid environment is unreliable because it is geographically dispersed; it involves multiple autonomous administrative domains; it is composed of a large number of components (e.g., instruments, display, computational and informational resource, people). So, with all this complexity, it is not surprising to nd that some components of the Grid have failed to operate.
252 exception, etc.) and other sources of failures (e.g., machine rebooted by the owner, network congestion, excessive CPU load, etc.). Beside the need to deal with this unreliable nature of the Grid environment, Grid applications, which are in general distributed, heterogeneous multi-task applications, should be able to handle failures sensitive to the context of their own component tasks, what we call task-specic failures. For example, a linear solver component task should reach convergence within 30 minutes; otherwise, it would be considered to be a performance failure because something unexpected happens, such as excessive CPU load, or the priority of the linear solver lowered by the owner of the host for his own jobs. As another example, a simulation task requires a certain amount of disk space to save temporary results. If there is not enough disk space remaining, the simulation task will fail due to the lack of disk space. These and other types of task-specic failures as well as failures in the Grid environment should be able to be detected and handled in a variety of ways depending on the execution semantics of both the task and the overall Grid application: In case of the linear solver, if not completed within 30 minutes, terminate the task, allocate a new resource, and restart; In case of the simulation task, if not enough disk space remaining, terminate the task in advance, and either restart it on a machine with a huge amount of disk space, or retry it on the same machine, but with a different algorithm which requires less disk space; In case of a long running task, checkpoint periodically and restart from the last good state; In case of a task running on unreliable execution environments, have multiple replicas of the task run on different machines, so that as long as not all replicated tasks fail, the task will succeed to execute; In other case, undo the effect of the failed task and retry, or ignore the failure and continue, etc. Existing distributed systems, parallel systems, or even so-called Grid systems designed to address fault tolerance issues fail to address the Grid-specic fault tolerance issues such as task-specic failure handling and multiple failure handling schemes. As far as we know, existing systems focus on integrating only one failure-type independent fault tolerance technique as for their failure handling policy. For example, in the traditional distributed systems such as large-scale transactional distributed systems, even though dispersed geographically on wide area networks, the component tasks are not heterogeneous. Instead, the component tasks are not only homogeneous but also simple (i.e., mainly read and write date item), and thus need not distinguish the type of failures returned. As a result, the transaction-based recovery (i.e., logging and rollback), a failure-type independent fault tolerance technique, appears to be sufcient as only one failure handling policy for the transactional distributed system. However, for the Grid applications which consist of arbitrary component tasks, each with its own failure semantics, the failure-type independent fault tolerance technique does not work well because it can support neither task-specic failure denition, detection, and handling nor diverse failure handling strategies. In this paper, we identify four requirements for fault tolerance in the Grid: (1) Grid-aware generic failure detection mechanism, (2) support for diverse failure handling strategies, (3) separation of failure handling strategies from the application algorithm logic, (4) user-dened exception handling for dealing with task-specic failures. We present a fault tolerance mechanism on the Grid which addresses the four Grid-unique fault tolerance requirements. Our approach consists of two major techniques, a generic failure detection service and a exible failure handling framework which address failure detection and failure recovery mechanism on the Grid, respectively. The generic failure detection service enables applications exploiting the service to detect two failure classes (i.e., the task-crash failure and user-dened exceptions) which we identied in [27] as two critical failure types to deal with in Grid applications. A major challenge in providing such a generic failure detection service for the Grid is to detect the two classes of failures without requiring any modications to both the Grid protocol and the local policy of each Grid node. We describe how to overcome the challenge by employing a notication mechanism that is based on the interpretation of notication messages being delivered from different entities (i.e., the task itself, the Grid generic server, the heartbeat monitor) residing on each Grid node. The exible failure handling framework built on top of our generic failure detection service allows users to achieve failure recovery in a variety of ways based on the requirements or constraints of their applications following failure detection. The core of our framework is the use of workow [26, 21, 32, 27]
253 as a recovery strategy specication to specify diverse failure recovery procedures at a high level rather than hardcoding them inside application codes. A workow enables the structuring of applications in a directed acyclic graph form, what is called workow structure, where each node represents the constituent task and edges represent inter-task dependencies of the applications. We incorporate two-level failure recovery into the workow structure: Task-level techniques refer to recovery techniques that are to be applied in the task level to mask the effect of task crash failures. These techniques realize the so-called masking fault tolerance [19] techniques such as retying, checkpointing, and replication. Workow-level techniques refer to recovery techniques that enable the specication of failure recovery procedures as part of application structure. These techniques realize the so-called nonmasking fault tolerance [19, 14] techniques such as alternative task; basically, these techniques allow alternative tasks to be launched to deal with not only user-dened exceptions but also the failures that task-level techniques fail to mask (e.g., due to not enough redundant resources) in the task level. We show how users can specify diverse failures handling strategies (e.g., depending on the performance goal of their applications, the availability of Grid resources, task-specic execution semantics, etc.) with task-level techniques, workow-level techniques, or even the combination of both techniques. We also demonstrate the exibility of our framework by describing how it allows users to rapidly prototype, investigate different fault tolerant schemes, and more importantly, to easily change them according to the changes in the underlying Grid structure and state. We have prototyped a Grid workow system (Grid-WFS) which implements all the features described above. Especially, since our failure detection technique is designed and implemented to be generic, we expect it to be adopted in other Grid tools or systems (which rely on the Grid protocol [29] for their task execution on Grid resources, e.g., Condor-G [18], CoG Kits [47], Nimrod-G [4], Ninf-G [38], etc.), as we have done with the Grid-WFS. Finally, we present an experimental evaluation of our framework using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid systems to achieve high performance in the presence of failures. 2. Background and Motivation: The Grid We are trying to provide a fault tolerance mechanism in the context of Grid. In this section, we describe key terms and concepts to the Grid which will be required in the remainder of this paper. 2.1. Terminology The term the Grid is nicely dened in What is the Grid? A Three Point Checklist [15], where Ian Foster emphasized the critical role of standards, as is the case with the denition of the Internet. We borrow his denition and redene it by focusing on only one standard protocol for having access to computing resources rather than on multi-standard protocols of the Grid. That way, we believe it could be possible to discuss semantic details on the essential issues of what it means to fail, how failures can be detected, whether remote execution systems offer useful information about failures, etc.; otherwise, the Grid is too broad and wide to come up with clear semantic details on those issues. We dene some terms used in this paper as follows: the Grid or the Grid environment refers to a collection of computing resources on which computation can be launched only via the de facto Grid standard protocol, i.e. the Globus GRAM protocol. According to this denition, todays Grid deployment practices such as NASAs Information Power Grid [30], ASCI Grid [7] can be referred to as the Grid, but any Condor [35] pool or Unicore [42] deployment not. Grid node, Grid resource, or Grid entity refers to any computing resources who can speak the Globus GRAM protocol. By this denition, a condor pool with the Globus toolkit installed on its front end to speak the Globus GRAM protocol does qualify as a Grid node, but a condor pool itself is not. Grid application or Grid tool refers to any application or tool which relies on the Globus GRAM protocol to execute its component tasks on the Grid node. According to this denition, Condor-G [18], Ninf-G [38], and Nimrod-G [4] are called as Grid applications or tools, but Condor, Ninf [43] and Nimrod [5] are not. 2.2. Grid Architecture In this section, we briey describe the Globus GRAM [11], a reference implementation of Grid core
254
Figure 1. Defacto Grid core architecture for accessing Grid computing resources. Note that we are focusing the access to computation aspect of the Grid architecture, so that other important aspects of Grid architectures (e.g., access to data, resource discovery, etc.) cannot be seen in this gure.
architecture, as we aim to provide a fault tolerance mechanism on it. Central to the Grid core architecture (Figure 1) is the notion of protocols, services, application program interfaces (APIs), as is emphasized in [29] to achieve interoperability between Grid nodes of different platforms and thus ultimately to establish dynamic resource sharing among a set of Grid entities, what is called Virtual Organization. Using the Grid client API, a Grid client (or a Grid application) can request a computation on a specic Grid node. Upon receipt of the request, the Grid generic server residing the specic Grid node initiates, controls, and monitors the computation on its local computing resources using the interfaces and protocol provided by the underlying local resource management system. The Grid generic server then noties the Grid client of the progress of the computation. Note that in order for a computing resource (whether a single PC or a cluster of PCs) to become a Grid node, it is necessary to have the Grid generic server which can understand the Grid protocol located on it. We also note that recalling our denition of Grid applications in the previous section, an application can qualify as Grid applications by employing the Grid client APIs. Currently, the GRAM Component of Globus Toolkit realizes an implementation to this Grid core architecture; the Globus GRAM protocol implements the Grid protocol, the Globus GRAM client API the
Grid client API, and the Globus jobmanager the Grid generic server. In the next section, we discuss limitations of this Globus GRAM architecture in respect with fault tolerance. 2.3. Limitations The GRAM protocol is too general to support failure management in the Grid. Even though the GRAM protocol supports monitoring of status changes occurring in the computation submitted on Grid nodes, the state report does not tell whether the computation has completed successfully or not, but rather only tells when the computation is no longer running. In other words, when a Grid client submits a task through the GRAM protocol, there is no way for the client to distinguish between task crash and completion. Due to this shortcoming of the GRAM protocol, currently most Grid applications, tools, or systems (e.g., Condor-G, CoG Kit, Nimrod-G, Ninf-G) which rely on the GRAM protocol for remote task execution either have been ignoring fault tolerance issued, or adopting ad hoc fault tolerance mechanisms of their own which can not be reused, not shared among them. This observation of the limitation of the GRAM protocol motivated us to develop a generic failure detection service (Section 5). No support for giving useful information about failures on computations in the GRAM protocol come from the hourglass model of the GRAM protocol [16].
255 As such, (1) it supports only a simple and small set of functionality (i.e., those that are indispensable for performing primitive remote computation) such as initiation, cancellation, monitoring of computation; (2) to date it can successfully embrace a wide range of different local resource management systems as Grid resources, including PBS, Condor, Loadleveler, NQE, Sun Grid Engine, Fork, etc. In addition, it would be expected that any newly developed local resource management system can easily become part of Grid resources, as we have observed how easily the latest local resource management system, Sun Grid Engine, can become one of major Grid computing resources; it requires only a few hundred lines of shell or perl script. any changes to the code in existing local resource management systems), nor should it require any changes to the Grid protocol. Modifying the Grid protocol (e.g., such that it can support error handling that enables Grid applications to detect both the task crash failure and user dened exceptions) would cause two problems. One problem is a backward compatibility problem with the existing Grid deployment. The other more severe problem is that there will be some local resource management systems that no longer qualify as a Grid node because they do not support the new error handling requirements. For example, modifying the Grid protocol to support the propagation of the exit codes returned by computation on Grid nodes to the Grid clients would prevent Java Virtual Machine (JVM) from being participating as a computing platform in the Grid. In JVM, the exit code is not propagated properly to outside of JVM [46]. This implies that in order for JVM to remain as a useful Grid computing resource, the modication to the JVM code, which is certainly infeasible in practice, is required to handle the exit codes appropriately as is required in the newly modied Grid protocol. 3.2. Support for Diverse Failure Handling Strategies We require that the Grid failure recovery mechanism support a wide range of failure handling strategies following the detection of failures. This requirement is driven by the heterogeneous nature of the Grid context, such as heterogeneous tasks (e.g., long running tasks, mission critical tasks, transactional tasks, etc.) and heterogeneous execution environments (e.g., highly reliable execution environments such as the Condor resource, unreliable execution environments such as a single workstation donated by an anonymous volunteer for its idle computing cycle, etc.). This heterogeneity aspect introduces a need for a exible failure handling mechanism that supports multiple fault tolerance techniques, allowing each task to select an appropriate fault tolerance technique among alternatives depending on, e.g., the task characteristic, or an estimated reliability of the underlying execution environment. For example, suppose that a Grid computing resource on which a task is running has a long downtime, the task may prefer the retrying on another available Grid resource strategy to either the retrying on the same resource or restarting with checkpointing on the same resource strategy.
3. Requirements for Fault Tolerance in the Grid We have identied four requirements for fault tolerance in the Grid. Basically, these requirements are driven by the unique properties of the Grid such that Grid environments are generic, heterogeneous and dynamic, and Grid applications are heterogeneous. 3.1. Generic Failure Detection Mechanism Failure detection is the rst essential phase for developing any fault tolerance mechanism or fault tolerant systems for practical use. Hence, in order for us to explore a new fault tolerance mechanism for the Grid, it is essential to provide a failure detection mechanism within the Grid context. In [27], we identied the task crash failure and user-dened exceptions as two critical classes of failures that need to be handled in Grid applications. We, therefore, require that the fault detection mechanism should be able to detect these two failure classes. Note that we aim at providing a failure detection mechanism within the Grid context. One of key characteristics of the Grid is that it respects the local policy of its individual Grid nodes as much as possible, only requiring them to speak the Grid standard protocol to join to it. As a result, the Grid, to date, has rapidly emerged as a feasible platform for deploying large-scale applications. This local policy and standard protocol aspects of the Grid have important requirements on our approach to failure detection mechanism. Neither the failure detection mechanism should affect the local policy of each Grid node (e.g., by requiring
256 3.3. Separation of Failure Handling Policies from Application Codes We require that the Grid failure recovery mechanism enable the separation of failure handling policies from application codes. This requirement is driven by the dynamic nature of the Grid. The state and structure of the underlying Grid is constantly changing. For example, software resources with a new novel algorithm are added. New hardware resources are added and old ones are retired. In order for users to address the constantly changing nature of the Grid, a high-level specication for failure handling should be supported so that they can specify failure handling policies at a high level without having to hardwire them within the application algorithm. As a result, users should be able to quickly adapt failure handling policies to the newly changed Grid environments simply by modifying the high-level policy description. Coding manually those task-specic failure detection and failure handling procedures within the application is not a viable solution because it makes the design and development of Grid applications much more complicated; furthermore, this approach requires application programmers to start from scratch by embedding fault tolerance procedures inside the application code in an ad hoc manner each time they develop a new application. 3.4. User-dened Exception Handling We require that the Grid failure recovery mechanism support user-dened exception handling. That is, users should be allowed to specify user-dened exceptions to handle task-specic failures base on the task context. In addition, users should be able to specify appropriate exception handling procedures to deal with the associated user-dened exceptions occurring during task execution. For example, suppose that we have a computation that needs to be accomplished, for which there are two algorithms available. One algorithm is faster than the other, but requires a large amount of memory. The other claims less memory by using a local disk instead of memory, but is slower than the rst. A task using the rst algorithm may fail during execution due to out of virtual memory space. In this case, users should be able to specify a user-dened exception called out of memory for the task using the rst algorithm. In addition, users should be allowed to specify a failure handling policy like: if a task using the rst algorithm fails due to the out of memory exception, try an alternative task using the second algorithm rather than retrying the same task.
4. Overview of our Approach In this section we briey describe our approach to fault tolerance mechanism for the Grid designed to meet the requirements mentioned above. Figure 2 presents an overview of our approach, which comprises two phases (i.e., failure detection and recovery phase) represented by a generic failure detection service and a exible failure handling framework, respectively. The generic failure detection service designed to be used in Grid applications relies on heartbeat and event notications to enable the detection of two failure classes
Figure 2. Overview of our approach.
257 (i.e., the task crash failure and user-dened exceptions) during task execution on Grid nodes. That is, on receipt of heartbeat and event notication messages being delivered from each Grid node, Grid applications can interpret these messages to determine the state (i.e., inactive, active, done, failed, exception) of their component tasks submitted to the Grid node. In addition, the service allows users to specify user-dened exceptions to handle task-specic failures. Section 5 details on our failure detection mechanism. The exible failure handling framework uses workow structure in which we have integrated tasklevel and workow-level failure recovery techniques. The task-level techniques are intended to mask task crash failures in the task level, i.e., without affecting the ow of the encompassing workow structure. The workow-level techniques, on the other hand, are intended to apply in the workow level (i.e., by allowing changes to the ow of workow execution) to handle the user-dened exceptions as well as the task crash failures that cannot be masked with the task-level techniques. With these two-level techniques, users can specify failure handling strategies at different levels of abstraction; the task-level techniques enable users to mask generic task failures (i.e., failure-type independent failures) without having to know about the task context, while the workow-level techniques enable users to dene an appropriate recovery procedure in the application structure to handle task-specic failures (i.e., failure-type sensitive failures) based on their knowledge about the task execution context. Section 6 details on our framework. Note that in the Figure 2, the arrow with the label of Task crash on it indicates that the task crash failure detected by the generic failure detection mechanism can usually be handled with the Task level techniques such as retrying, checkpointing, and replication. Similarly, the arrow with the label of User-dened exception denotes that the user-dened exceptions can be dealt with by the workow-level techniques such as the concept of alternative task, or workow-level redundancy (e.g., launching several tasks with each implemented by different algorithms, in hoping that one of the redundant tasks will nish successfully). Another arrow connecting the Task level to the Workow level is seen in the gure with the label of fail to mask on it; this one tells that in case of the Task level recovery techniques having failed to mask failures (e.g., due to not enough redundancy to mask them at the task level), the workow level recovery techniques can be applied to deal with the propagated failures. When users designing and implementing fault tolerant applications, this failure handling framework gives a great deal of exibility by supporting a wide assortment of failure handling strategies, ranging from well-known task-level fault tolerance techniques to various user-specied application-level failure management schemes; to this end, the framework provides users with the XML Workow Process Denition Language called XML WPDL (Section 8). This use of high-level workow structure factors out the design of fault tolerance schemes from the low-level details of application algorithm design. Consequently, different fault tolerance schemes and designs can be rapidly prototyped, investigated, and more importantly can be easily changed according to the changes in the underlying Grid structure as highlighted in Section 7.
5. A Generic Failure Detection Service Our generic failure detection architecture (Figure 3) comprises two fundamental components: A set of notication generators is located on each Grid node, generating notication messages corresponding to the status changes in a computation (i.e., task) submitted to the Grid node. Among the notication generators are (1) the Grid generic server, (2) the heartbeat monitor, and (3) the task itself, as illustrated in Figure 3. The Grid generic server generates so-called Grid generic event notication messages [11] such as task active, task failed, task done. The heartbeat monitor generates periodic aliveness notication messages (i.e., he is alive heartbeat messages) on behalf of its monitoring task. The task itself generates what is called task-specic event notication messages such as task completed successfully, instrument turned on, disk full exception detected, etc. A notication listener receives the Grid generic event, the heartbeat, and the task-specic event notication messages being generated and sent out by the three notication generators on Grid node. Based on the examination on the notication messages being delivered, it determines whether the task has nished successfully or not, and in addition, the notication listener can distinguish whether the task has failed due to either task crash failures or user-dened exceptions. Note that a Grid application can easily become a notication
258
Figure 3. Architecture of failure detection mechanism. Notice the difference of this gure from Figure 1. We extended the Grid client API to support for the notication listening port, and added the heartbeat monitor component. Also are shown direct communication channels (1) between a task and the heartbeat monitor, (2) between the heartbeat monitor and a notication listening port (i.e., dotted arrow line), and (3) between a task and another notication listening port (i.e., bent-up solid arrow line).
listener by handing down the two notication ports (i.e., one is for receiving heartbeats and the other for task-specic event notications) over to the task when launching it on the Grid node; the task, therefore, knows where to send its task-specic notications, and can let the heartbeat monitor know where the tasks heartbeat messages should be sent. As discussed previously (in Section 3.1), the rst and foremost consideration in the design of our failure detection mechanism is that it neither should affect the local policy of each Grid node, nor should it require any changes to the Grid protocol, which have led to the architectural shape in Figure 3. We can think of two mechanisms for propagating task-specic event notications from the task to the Grid client. One is through the Grid protocol; i.e., the Grid generic server performs intermediate buffering of the notication messages sent by the task, and transfers them to the Grid client. The other is through a direct communication channel between the task and Grid client. We chose to select the second option (see the bent-up arrow line in Figure 3). We are not considering the rst option as our design choice because not only it requires some changes to the Grid protocol, but also there would be some local resource management systems which do not support interfaces through which
the Grid generic server can do the intermediate buffering; in this case, the resource management systems can not retain playing a role as Grid resources. Another important thing to note in the design of our failure detection architecture is that the heartbeat monitor is a standalone entity (i.e., having no interdependency with any other components of the Grid architecture). Therefore, deploying the heartbeat monitor on a Grid node dose not affect any local policy decisions on the Grid node, except for consuming extra computing power of it. In other words, what is only required to make it possible for a Grid node to provide the heartbeat service is to have the heartbeat monitor run on it as a daemon process. Note that using the heartbeat service is only an API call away. That is, in order for a task to get the heartbeat service, what the task needs to do is to register with the heartbeat monitor simply by calling a task-specic event notication API function inside the task code. To determine the state of a task (e.g., whether the task has failed or not), the Grid client is relying on a combination of notication messages from the Grid node on which to submit the task for execution. Figure 4 illustrates the types of notication messages from a Grid node and state transitions of a task maintained at the Grid client. In [28], we have described more details on our approach, including the
259
Figure 4. (a) Notication messages sent from notication generators on a Grid node to the notication listener; (b) State transition diagram of a task visible at the Notication Listener. When the task has nished successfully, then the Grid client is to receive both a Done notication from the Grid generic server and an End Task notication from the task. When the task has terminated as a user-dened exception, the client can recognize it by receiving an Exception notication from the task. When the task has crashed during execution for a certain reason, then the client can detect it by receiving the Done notication, but without the End Task notication. Note that for implementation simplicity of our failure handling framework described, all failures but exceptions are aggregated into the task crash failure.
format of notication messages, state transition at notication listener, the API for both the registration and task-specic event notications, the notication listener API, implementation details, limitations of our implementation, etc.
6. A Flexible Failure Handling Framework We have developed a workow-based failure handling framework on top of our failure detection service. This section describes how our framework which users to achieve failure recovery in a variety of ways following detection of task crash failure and user-dened exceptions. 6.1. Task-level Failure Handling Techniques In this section we describe failure handling techniques that can be applied in the task level so as to prevent task crash failures from being propagated to the workow level. 6.1.1. Retrying This might be the simplest failure recovery technique to use in hope that whatever cause of the failures will not be encountered in subsequent retries. Figure 5 shows an example of workow specication fragment using the XML WPDL. It describes that if the task crash failure is detected (i.e., by receiving Done without Task End notication as described in [28]), this
Figure 5. XML WPDL example of retrying. Users can specify the maximum number of retries and the interval between retries by using the max-tries and the interval attributes in the task specication. Note that this code represents retrying on the same resource. Users can also specify retrying on different resources by simply dening multiple Grid resources as seen in Figure 6.
particular task named summation would be retried on the specied Grid resource (i.e., whose hostname is bolas.isi.edu) up to 3 times with an interval of 10 seconds between tries. 6.1.2. Replication The basic idea of this failure handling technique is to have replicas of a task run on different Grid resources, so that as long as not all replicated tasks crash (due to host crash, host partition away from the Grid client,
260 have to specify anything about the checkpointing (for example, specifying something like policy = checkpoint as in the case of the replication technique) in the workow structure. Instead, all they have to do is to call one of task-specic event notication API functions (i.e., globus_FDS_task_checkpoint()) [27] within the checkpoint-enabled task code so as to notify the framework that this task is a checkpoint-enabled task. Upon receipt of the checkpoint notication from a task, the framework marks the task as checkpointenabled, and saves the checkpoint ag being delivered piggybacked on the notication message. Hence, when the the task crash failure is detected and retrying is specied, the framework retries the task from the checkpointed state by sending back the checkpoint ag. Currently, we have successfully tested this checkpointing feature of our framework with the Libckpt standalone checkpoint library. 6.2. Workow-level Failure Handling Techniques In this section we describe failure handling techniques that could be applied in the workow level. The manipulation of workow structure (e.g., modifying execution ows to deal with erroneous conditions) is the basis of these techniques. 6.2.1. Alternative Task A key idea behind this failure handling technique is that when a task has failed, an alternative task is to be performed to continue the execution, as opposed to the retrying technique where the same task is to be repeated over and over again which might never succeed. This technique might be desirable to apply in some cases where there are two different task implementations available for a certain computation; each implementation, however, has different execution characteristics. For example, the rst implementation runs fast, but is unreliable, whereas the second runs slow, but is reliable. In this case, users can specify the second implementation to act as an alternative task to the rst one. Figure 7 illustrates an example of failure handling using the alternative task technique. In the gure, the Slow_Reliable_Task is specied as an alternative task to the Fast_Unreliable_Task, such that the Slow_Reliable_Task would be activated to recover from the task crash failure that might happen to the Fast_Unreliable_Task during its execution. Note that this technique is also useful in cases where users wish to semantically undo the effect of
Figure 6. XML WPDL example of replication. Users can specify a particular task to be replicated on multiple Grid resources by dening the policy = replica in the task denition and multiple resources within the corresponding Program denition.
etc.), the task execution would succeed. Figure 6 depicts an example of a XML WPDL fragment which species this particular task to be replicated onto three different Grid resources. Note that users can easily choose to use this technique simply by specifying the policy = replica in the Activity denition for the task. When the task, summation, is performed, the underlying system simultaneously submits the task execution request to three specied Grid resources. Once recognizing that one of the submitted tasks has nished successfully by receiving both the Done and the Task End notication messages as described in [28], the system considers the task execution as having succeeded. 6.1.3. Checkpointing Checkpointing has been studied a great deal in distributed, parallel systems as an efcient fault tolerance technique especially for long-running applications. As a result, to date many checkpoint libraries and program development libraries which support checkpointing are available. With these checkpointing facilities, checkpoint-enabled applications can be developed simply by linking to them. Dome [6], Fail-safe PVM [33], and CoCheck [45] are examples of such program development libraries that enable coordinated checkpointing on parallel computing platforms (e.g., network of workstations), while Libckpt [39] and the Condor checkpoint library [2] are standalone portable checkpoint libraries for uniprocessor platforms. Our framework is designed to support checkpointenabled tasks. That is, when a task fails, it is allowed to be restarted from the recently checkpointed state rather than from the beginning. Users do not
261
Figure 8. Workow-level redundancy.
Figure 7. Task crash failure handling using an alternative task.
a failed task. For example, for a task which transfers a huge amount of data, users may want to dene an alternative task such that the alternative task is activated to clean up the partially transferred data if the original task has failed during execution. 6.2.2. Workow-level Redundancy As opposed to the task-level replication technique where same tasks are replicated, having multiple different tasks run in parallel for a certain computation is the basic idea of this technique. Thus, as long as at least one task has nished successfully, then the computation would succeed. This technique might be useful in cases where there are many task implementations with different execution behavior available for the computation. For example, the rst implementation is fast but unreliable, and the second slow but reliable. There might be other implementations available which have different execution behaviors than the rst and the second. In this case, users may want to have all these different implementations simultaneously executed so as to, e.g., achieve fault tolerance and/or performance goals at the cost of extra CPU consumption. This is simply achieved by specifying (1) a split task with multiple outgoing control ows to each of the different implementations whose transition condition evaluates always to true, (2) a join task with multiple incoming control ows from each of the implementations, and (3) an OR relationship between the incoming control ows (see Figure 8). Figure 8 depicts an example of using the workowlevel redundancy technique to achieve a certain level of fault tolerance by having the two different task implementations (i.e., Fast_Unreliable_Task and Slow_Reliable_Task) run in parallel. Note that in the gure, the OR relationship between the two incoming control ows (i.e., one from the Fast_Unreliable_Task and the other from the Slow_Reliable_Task) indicates that the Dummy_Join_Task would be activated if either execution of the two tasks succeed.
6.2.3. User-dened Exception Handling This technique allows users to give a special treatment to a specic failure of a particular task. This could be achieved by using the notion of the alternative task technique; that is, by specifying a workow process such that the alternative task is to be launched to deal with the specic failure if the particular task fails due to the specic failure. Figure 92 shows an example of user-dened exception handling using an alternative task; the Slow_Reliable_Task is specied to be activated to handle a task-specic failure (i.e., a user-dened exception called disk-full) that might arise during the execution of Fast_Unreliable_Task.
7. Flexibility of Our Framework Our framework is designed to be exible enough to: Combine between task-level failure recovery techniques. For example, in the code (Figure 6), users can specify each replica to be retried when it fails by just adding maximum number of retries into the activity denition. Combine between task-level and workow-level failure recovery techniques. For example, in Figures 7 and 8, users can make the Fast_Unreliable_ Task more tolerant to task crash failures by applying task-level failure recovery techniques such as retrying, checkpointing, and replication. Change failure handling strategies easily and incrementally by changing its workow structure as the underlying Grid structure changes. For example, as seen in Figures 79, with the two tasks (i.e., Fast_Unreliable_Task, Slow_Reliable_Task), users can structure different failure handling strategies, and can incrementally change them simply by changing the workow structure. Thus, there is no need to recompile, relink, and test the application source codes as the failure handling strategies change.
2 Note that due to page limitation, we omitted WPDL workow
specication codes for Figures 79. See [27] for details on the codes.
262
Figure 9. Exception handling using an alternative task.
Figure 10. Grid-WFS architecture.
8. Implementation We have prototyped a Grid Workow System (GridWFS) that implements the framework described above. The overall system structure of Grid-WFS is illustrated in the Figure 10, which consists of three major components: A Workow Process Denition Language using XML (XML WPDL) that allows users to dene workow process specication in a Directed Acyclic Graph (DAG) form. A Workow Engine that controls workow execution by navigating the workow specications, submitting tasks to specied Grid nodes, and monitoring the status of submitted tasks. Workow runtime services that provide directory services necessary for the workow engine to
perform resource brokering during the workow execution, including software, data, and resource category services. All features described in this paper can be specied using the XML WPDL. Furthermore, additional features are supported by the XML WPDL, including conditional transition (e.g., if-then-else), loop structure (e.g., do-while). Thus, users can specify more sophisticated application structure and applicationlevel failure handling strategies (e.g., value dependency [27]). Further details on the specication, syntax (i.e., XML Document Type Denition) and example usages of XML WPDL can be found in [27]. The workow engine is the core part of the prototype. It is implemented as a standalone application on top of the Globus Toolkit [3] v2.0. When the engine is started, it reads a workow process specication from
263 a le as specied in its input argument and creates an instance of the specication in a parse tree form. The engine, then, begins to navigate through the parse tree. The engine determines the tasks that are ready to execute by examining whether their dependencies has been resolved or not. Once having identied such tasks, the engine submits them to appropriate Grid resources via the Globus GRAM [11] protocol. The engine is designed to identify the appropriate Grid resources either as specied in the workow specication or by consulting with the directory services.3 The engine determines the nal status (i.e., done, failed, or exceptions) of the submitted tasks using the generic failure detection mechanism described in [28], storing the nal status in the parse tree. The engine then reevaluates the parse tree, identies the next tasks whose dependencies have been resolved, and then submit them. The navigation continues in this way through the workow process instance until either it is completed successfully or is terminated unsuccessfully due to any unrecoverable erroneous situations. For fault tolerance of the workow engine itself, we have implemented the checkpointing of the workow engine. That is, every time a task termination state is recognized, the engine saves the current XML parse tree onto a persistent storage in a XML le form. So, when being restarted, the engine creates a parse tree from the saved XML le rather than from the original XML le and begins navigation from where it left off. 9. Evaluation The Grid-WFS supports multiple failure recovery techniques as opposed to most other distributed (or even so-called Grid) systems in which only a single failure recovery technique is supported (e.g., transaction in OLTP [22], checkpointing in Dome [6], retrying in Netsolve [9], in Condor-G [18] and in Condor DAGMan [1], replication in Mentat [24]). In this section, we present an experimental evaluation of the Grid-WFS, demonstrating the value of supporting multiple failure recovery techniques in such highly heterogeneous Grid environment to achieve high performance in the presence of failures. 9.1. Experimental Methods Simulation We measured using simulation the expected completion time of a task in spite of failures during its execution. Following are the parameters being used for our simulation: (mostly borrowed from [40, 6, 13])
3 We have not implemented the second option yet.
Failure-free execution time (F ). This is the execution time of a task in the absence of failure. Failure rate (). This is a random variable representing an arrival rate of failures governed by a Poisson distribution, as is commonly assumed in fault tolerance literature [40, 6, 13]. TTF (time to failure) is a random variable representing the time between adja cent arrivals of failures, governed by the well-known exponential distribution [31]. MTTF (mean time to failure) is a mean TTF interval, mathematically dened by 1/ [40]. Downtime (D). This is the average time following a failure of a task before it is up again, governed by the exponential distribution [40]. Average checkpoint overhead (C). This is the average amount of time required to create a checkpoint. We assume it a constant variable in our simulation. Uninterrupted task execution time between checkpoints (a). This is the time interval between two consecutive checkpoints in the failure-free runs [13, 6]. So, if K checkpoints is created during F , then a = F /K . Recovery time (R). This is the time that it takes to restore the checkpointed state following the detection of a failure [40]. Number of replicas (N). This is the number of replicated tasks, with each running on different machines. We note that the above are not the only parameters that need to be considered-checkpoint latency (L) [40] should be taken into account for more precise simulation, but for simplicity, by assuming that a task is halted when checkpointing we do not consider this parameter in our simulation. Based on the above parameters, we measured the expected execution time of tasks with four different types of failure recovery techniques: Retrying. The basic idea is that if a failure occurs, then a task must restart from the beginning. We simulate the completion time of a task based on the assumption and analysis of Duda [13] on program without checkpointing. Checkpointing. A task stores periodically its states (i.e., checkpoints). So, when the task crashes, it can restart from the most recently checkpointed state. The time needed to complete a task is simulated based on the assumption and analysis of Duda [13] on program with checkpointing.
264
Figure 11. Comparison between analytical and simulation results for retrying.
Figure 12. Comparison between analytical and simulation result for checkpointing.
Replication. We calculate the completion time by running the task as many as N times and choosing the smallest completion time among those obtained from the N simulation runs. Note that each run is assumed to employ the retrying as its recovery technique to continue to run until it has completed. Replication w/checkpointing. The completion time is computed in the same way as used in the above Replication except that each run uses the
checkpointing recovery technique to complete its computation. To validate the correctness of our simulation results, we compared them with analytical models from fault tolerance literature. Figure 11 and Figure 12 plot a graph of the expected completion time as a function of MTTF from analytical results4 overlaid with our
4 We obtain the analytical results from Plank [40] and Dome [6],
respectively.
265 simulation results. As can be seen in both gures, the expected completion time from simulation results is the same as the analytical expected completion time, which proves that our simulation method is correct. Note that for this simulation results, we ran 10,000, 100,000 and 1,000,000 runs of simulations, and found out that 100,000 runs are enough for our simulation. In fact, the simulation results in both gures were obtained by running 100,000 runs of simulation. 9.2. Modeling of Grid Resources In order to reect heterogeneity of the Grid environment, we model Grid computing resources based on the system metrics reliability and availability, which we adopted from [34]: Reliability of a Grid computing resource is measured by mean time to failure (MTTF), the average time that the Grid resource operates without failure. Mean time to repair (MTTR) is the average time it takes to repair the Grid computing resource after failure. The MTTR measures the downtime of the computing resource. Note that in case of a Grid resource of a cluster of workstations (COW) (e.g., a condor pool), as long as not all workstations in the COW fail, we assume the COW is operating. Availability ( ) of a computing resource is statically quantied as: = MTTF . MTTF + MTTR A supercomputer on which one can run a job with a low priority, so that it could be that the job is preempted during execution by tasks with higher priority. Low reliability and high availability (i.e., short MTTF and short downtime): A single workstation which is under the same administration domain as the one who request a computation on it, so that if something wrong is detected with the workstation, immediately either can he himself reboot it, or have the owner of the workstation reboot it. Low reliability and low availability (i.e., short MTTF and long downtime): A workstation which an anonymous volunteer donates to contribute his idle cycle to the Grid community. Mobile computing resources (i.e., portable laptop, or PDA). 9.3. Simulation Results We did some preliminary tests with different scenarios [27]. In this section, we describe two scenarios and discuss our preliminary results, showing the importance of supporting multiple failure recovery techniques on the Grid. 9.3.1. A Simple DAG We measure the expected completion time of a simple DAG which consists of four component tasks of different characteristics, each running on different execution environments (see the description in Table 1). Figure 13 depicts the interdependency between the component tasks of the simple DAG. Table 1 summarizes the parameter values used in this simulation runs. Although it looks simple, we, actually, chose the values very carefully to prove our intuitions. For example, in case of task X, duration 100 and MTTF 20, implying that failures occur frequently during execution, we expected checkpointing would outperform retrying and replication. In case of task Y where downtime is considerably high, compared to its task duration, we expected replication would perform better just retrying and checkpointing. Table 2 shows the completion time of each task with different failure recovery techniques (the lower value, the better performance). As we expected, for task X, checkpointing outperforms retrying and replication, and for task Y, replication performs better than retrying and checkpointing. We compute the total execution time of the simple DAG as the summation of
Based on this denition of reliability and availability, examples of classication of Grid computing resources include: High reliability and high availability (i.e., long MTTF and short downtime): A large-scale Condor pool which consists of, e.g., tens of thousands of idle workstations. A supercomputer on which one can run a computation with a high priority, e.g., by paying the corresponding usage fee. A COW running on the Dome [6] system which supports automatic failure detection and uses a checkpointing recovery technique as its fault tolerance mechanism. High reliability and low availability (i.e., long MTTF and long downtime): A small-scale Condor pool (e.g., tens of workstations) where the owners of the workstations frequently claim their computing resources for their own computation.
266
Table 1. Parameters used for experimenting a simple DAG Task Duration (F ) 1000 100 10 100 MTTF (1/) 3000 20 1000 200 Downtime (D) 2000 0 1000 1000 Availability ( ) 0.6 1.0 0.5 0.17 Description
W X Y Z
A long-running task on a high reliable, medium available execution environment. A short-running task on a low reliable, high available execution environment. a short-running task on a high reliable, medium available execution environment. a short-running task on a medium reliable, low available execution environment.
Figure 13. Workow process model for a simple DAG.
the completion time of Task W, the maximum between the completion time of Task X and Y, and the completion time of Task Z. Figure 14 shows the expected completion time of the Simple DAG. As expected, Grid-WFS performs better than the other systems which support only one single failure recovery techniques by employing an appropriate failure recovery technique for each component task (e.g., replication w/checkpointing for Task W and Task X, retrying for Task Y, and replication for Task Z). 9.3.2. An Exception Handling DAG We also test the impact of whether or not to support user-dened exception handling mechanism on the performance of a Grid application. We measure the expected completion time of the exception handling DAG shown in Figure 9. Following are the assumptions that we have made for this experiment: The duration of the task Fast_Unreliable_Task (FU), the activity Slow_Reliable_Task (SR) 150, and the Dummy_Join_Task (DJ) is 30, 150, 0, respectively.
Table 2. Expected completion time of each component task with different failure recovery techniques Retrying Checkpointing Replication Replication w/ checkpointing 1028 120 20 144
Task W 1989 Task X 2946 Task Y 19 Task Z 775
1685 128 39 670
1033 1042 10 137
Figure 14. The expected completion time of the simple DAG. X axis represents different failure recovery techniques employed, and Y axis the total execution time.
267
Figure 15. Comparison of the expected completion time as the function of p with retrying, checkpointing, exception handling.
The task FU performs ve times of the disk full exception check (i.e., every 6) during its execution. For simplicity, we model the task FU as a Bernoulli process with a probability of p of generating a disk full exception for each exception check. We assume that there are no other failures occurring except for the disk full exception for the task FU, and the task SR always complete successfully. Figure 15 illustrates the potential value of supporting exception handling mechanism. As p is close to 1, the computation only with masking fault tolerance techniques such as retrying and checkpointing takes a very long time to complete. In case of p = 1, without the support of exception handling mechanism, the computation would never complete successfully. 9.3.3. Complex DAGs As indicated in [12], to date many Grid applications can be represented as the execution of complex workows, which may consist of tens, hundreds, or even thousands of interdependent component tasks. From our preliminary results with the simple DAG and exception handling DAG, we would expect the expected completion time of complex DAGs using the GridWFS to become signicantly reduced; it would be indeed that Complex DAGs are the ones that can gain performance benets most by selecting an appropriate failure recovery technique for each component task.
9.4. Discussion on the Simulation Results In the previous section, we claim that Grid-WFS perform better in presence of failures than any other systems (i.e., which support only single failure recovery technique) by adopting an appropriate recovery technique for each component task. Then, next question is how to choose the best recovery technique. Are there any guidance in the selection of recovery schemes? What if accurate estimates of MTTF, downtime, and task duration is not available? In case accurate estimate of the system metrics for MTTF and downtime are available, to obtain a selection guideline, we performed some more experiments. We rst tested the inuence of task crash failure rate on the selection of fault tolerance techniques to achieve high performance in the presence of failures. We simulated the expected completion time of a task with different fault tolerance techniques for various failure rate. For this experiment, we xed the parameter F to 30, K (i.e., number of checkpoints) to 20, D to 0, both C and R to 0.5, and N to 3. Figure 16 plots the simulation results. The gure shows that when task crash failure rate is high (i.e., the smaller value of MTTF), checkpointing and replication w/checkpointing outperform the other two techniques. However, for smaller value of failure rate (i.e., greater value of MTTF), the use of checkpointing and replication w/checkpointing mechanism as a recovery policy appears to be inappropriate due to the checkpoint overhead. If task execution environments
268
Figure 16. Comparison between fault tolerance techniques as MTTF increases. This graph shows the expected completion time as a function of MTTF.
are reasonably reliable, (i.e., MTTF is greater than > 0.6), then replication performs better 18, MTTF F than any other techniques at the cost of extra CPU consumption. We then tested the impact of downtime on the completion time of a task. The expected completion time was computed using the same parameters as the above experiment except for using various values of downtime: 0, F (30), 5F (150), and 10F (300). The experiment results are plotted in Figure 17, illustrating that in case of longer downtime, replication and replication w/checkpointing perform better than the other two techniques. However, when failure rate is high, failure rate is a more dominant factor than long downtime. As can be seen in Figure 18 which zooms out the downtime = 10F graph, when failure rate is rel< 0.4), atively high (i.e., MTTF is less than 12, MTTF F checkpointing performs better than replication. In case no accurate estimate for the system metrics is available, our intuition tells that it would be that the strongest fault tolerance technique (i.e., replication w/checkpointing) performs well in many cases. The simulation result from the simple DAG (Figure 16), and other simulation results performed in [27] show that our intuition holds true (see the relatively low numbers in the Replication w/checkpointing column on Table 2). Figure 18 also illustrates that the replication w/checkpointing technique outperforms the other techniques over a wide range of different execution environments, ranging from relatively reliable environments to extremely unreliable environments (see
the line with + on it being lower than other lines in failure-occuring situations; however, for highly reliable execution environments, replication technique is seen to perform slightely better than replication w/checkpointing due to the checkpointing overhead associated with it.) Therefore, in case we do not know the accurate information on the system metrics, as long as we know that the system fails from time to time and that it takes a while to repair from failure, then it would be not a bad idea to use the replication w/checkpointing technique. Of course, this is not free; this technique entails checkpointing overhead and extra resource consumption.
10. Related Work 10.1. Failure Detection The Globus HBM [44] provide a generic failure detection service designed to be incorporated into distributed systems, tools or applications. It enables applications to detect both host/network failures by their recognition of heartbeat missing and the task crash failures by their receipt of the task died without unregistering notication message from the HBM local monitor. However, it is impossible for Grid applications to detect user-dened exceptions using the Globus HBM service. Furthermore, the Globus HBM was not implemented in a portable way as we claimed in [28].
269
Figure 17. Comparison between fault tolerance techniques as downtime increases. This graph shows the impact of downtime on the performance of different fault tolerance techniques. The legend Rt represents Retrying, Ck Checkpointing, Rp Replication, and RpCk Replication w/checkpointing.
Figure 18. Comparison of expected completion time between different fault tolerance techniques as a function of MTTF when downtime is equal to ten times of the task duration.
270 MDS-2 [10] in theory can support the task crash failure detection functionality via its GRRP [25] notication protocol and GRIS/GIIS framework. However, as opposed to our approach and the Globus HBM, it is not straightforward to use MDS-2 to construct failure detection services; MDS-2 is in fact designed to develop Grid information service rather than failure detection service. Furthermore, user-dened exceptions cannot be detected using MDS-2. Legion [23, 37] uses pinging and timeout mechanism to detect task failures. That is, the system considers that the task has failed if it does not receive a response from the task within a certain amount of time. Indeed, this pinging and timeout mechanism can detect neither the task crash failures nor user-dened exceptions, nor it can distinguish the pure task crash failure from host/network failures. Condor-G [18] adopts ad hoc failure detection mechanisms since the underlying Grid protocol ignore fault tolerance issues. That is, it uses periodically polling to the generic Grid server to detect certain types of failures such as the crash of the generic Grid server and host/network failures. However, it can detect neither the task crash failures nor user-dened exceptions, as is the case in Legion. By reviewing this existing work, we recognize the importance of providing generic and portable failure detection services for the Grid. Without generic services, the failure detection mechanism will be developed in ad hoc ways for each application or tool; as a result, it is impossible to reuse or share the mechanism among different Grid applications as is the case of Condor-G. If its implementation is not portable for practical use in such highly heterogeneous Grid environments, the mechanism would not be adopted by the Grid community as in the case of the Globus HBM, even though it is designed to be generic. 10.2. Fault Tolerance Much research has been done on fault tolerance mechanisms in distributed, parallel, and Grid systems. The main focus of that work is on the provision of a single fault tolerance mechanism targeting their systemspecic domains. Table 3 summarizes fault tolerance mechanisms incorporated in some of traditional distributed systems (e.g., OLTP [22], Ficus [41]), parallel systems (e.g., PVM [20], DOME [6]), Grid systems (Netsolve [9], Mentat [24], Condor-G [18], CoG Kits [47]), illustrating types of failures that they can detect, their failure detection mechanisms and their failure recovery mechanisms. Basically, our work differs from these systems by its focus on the provision a special form of fault tolerance mechanism targeting the generic, heterogeneous, and dynamic Grid environments. As can be seen in the table, none of the
Table 3. Fault tolerance mechanisms in traditional distributed and parallel, and Grid systems Systems Failures detected Failure detection mechanism System-specic polling and event notication Voting System-specic polling and event notication System-specic polling and event notication Generic heartbeat mechanism Polling Polling N/A Failure recovery mechanism Transation (i.e., abort and retry) Replication Diverse failure handling in the application context Checkpointing General comment
Transaction system (e.g., OLTP) Distributed le system (e.g., Ficus) PVM
DOME
Netsolve
Mentat Condor-G CoG Kits
Host crash Network failure Task crash Host crash Network failure Host crash Network failure Task crash Host crash Network failure Task crash Host crash Network failure Task crash Host crash Network failure Host crash Network crash N/A
Uniform task (i.e., mainly read/write operation) Uniform task Should hardcode recover strategies in the application Targeting the SPMD parallel applications Grid RPC
Retry on another available machines Replication Retry on the same machine N/A
Exploiting tasks stateless and idempotent nature Use of Condor client interfaces on top of GLOBUS Should hardcode failure detection e.g., timeout and recovery strategies
271 systems address the Grid fault tolerance requirements mentioned in Section 3; no systems support userdened exceptions; none of them support diverse failure recovery mechanisms, but they provide only a single user-transparent failure recovery mechanism (e.g., transaction in OLTP, checkpointing in Dome, retrying in Netsolve and in Condor-G, replication in Mentat). 11. Conclusions We have argued that the generic, heterogeneous, and dynamic nature of the Grid requires a new form of fault tolerance mechanism which should be able to address Grid-unique requirements such as a generic failure detection mechanism, support for diverse failure handling strategies, separation of failure handling strategies from the application code, and user-dened exception handling. We have developed a Grid-aware generic failure detection service with which Grid applications enable the detection of two critical failure classes, the task crash failure and user-dened exceptions. We have also develpoed a Grid Workow System (Grid-WFS) on top of our failure detection service as a exible failure handling framework. The Grid-WFS allows users not only to dene failures based on the task context (i.e., user-dened exceptions), but also to specify diverse failure handling strategies using high-level workow structure which is separated from the application algorithm code. By this separation, we believe, our framework provides a exible Grid programming paradigm. In the traditional paradigm, failure handling policies are hard-coded in the application code so that if the polices are changed, the corresponding application code should be modied, recomplied, linked, and tested. In this new paradigm, failure handling polices are directly modied by simply changing the corresponding workow structure. We have also demonstrated through our preliminary experiments that in heterogeneous computing environments like the Grid, it appears to be essential to support multiple fault tolerance techniques and userdened exception handling in order to achieve high performance as well as fault tolerance in the presence of failures. References
1. Condor DAGMan, http://www.cs.wisc.edu/condor/dagman/. 2. Condor Manuals, http://www.cs.wisc.edu/condor/manual/. 3. The Globus Toolkit, http://www.globus.org. 4. D. Abramson, J. Giddy and L. Kotler, High Performance Parametric Modeling with Nimrod/G: Killer Application for the Globus Grid, in International Parallel and Distributed Processing Symposium (IPDPS), 2000, pp. 520528. 5. D. Abramson, R. Sosic, J. Giddy and B. Hall, Nimrod: A Tool for Performing Parametised Simulations Using Distributed Workstations, in Proceedings of the Fourth IEEE Symposium on High Performance Distributed Computing, 1995. 6. A. Beguelin, E. Seligman and P. Stephan, Application Level Fault Tolerance in Heterogeneous Networks of Workstations, Journal of Parallel and Distributed Computing on Workstation Clusters and Networked-based Computing, Vol. 43, No. 2, pp. 147155, 1997. 7. J.L. Beiriger, H.P. Biven, S.L. Humphreys, W.R. Johnson and R.E. Rhea, Constructing the ASCI Computational Grid, in Proceedings of the Ninth IEEE Symposium on High Performance Distributed Computing, 2000, pp. 193199. 8. S. Brunett, K. Czajkowski, S. Fitzgerald, I. Foster, A. Johnson, C. Kesselman, J. Leigh and S. Tuecke, Application Experiences with the Globus Toolkit, in Proceedings of the Eighth IEEE Symposium on High Performance Distributed Computing, 1998. 9. H. Casanova, J. Dongarra, C. Johnson and M. Miller, Application-Specic Tools, in I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New Computing Infrastructure, Chapter 7, pp. 159180, 1998. 10. K. Czajkowski, S. Fitzgerald, I. Foster and C. Kesselman, Grid Information Services for Distributed Resource Sharing, in Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing, 2001 (to appear). 11. K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, A Resource Management Architecture for Metacomputing Systems, in Proceedings of the IPPS/SPDP98 Workshop on Job Scheduling Strategies for Parallel Processing, 1998. 12. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh and S. Koranda, Mapping Abstract Complex Workows onto Grid Environments, Journal of Grid Computing, Vol. 1, No. 1, pp. 2539, 2003. 13. A. Duda, The Effects of Checkpointing on Program Execution Time, Information Processing Letters, Vol. 16, pp. 221 229, 1983. 14. M.C. Elder, Fault Tolerance in Critical Information Systems, Ph.D. thesis, University of Virginia, 2001. 15. I. Foster, What is the Grid? A Three Point Checklist, GRIDToday, 2002. 16. I. Foster and C. Kesselman, The Globus Toolkit, in I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New Computing Infrastructure, Chapter 11, Morgan Kaufmann Publishers, pp. 259278, 1998. 17. I. Foster and C. Kesselman (eds.), The GRID: Blueprint for a New computing Infrastructure, Morgan Kaufmann, 1998. 18. J. Frey, T. Tannenbaum, I. Foster, M. Livny and S. Tuecke, Condor-G: A Computation Management Agent for MultiInstitutional Grids, Cluster Computing, Vol. 5, No. 3, 2002. 19. F.C. Gartner, Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments, ACM Computing Surveys, Vol. 31, No. 1, 1999. 20. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek and V. Sunderam, PVM: Parallel Virtual Machine: A Users Guide and Tutorial for Network Parallel Computing. MIT Press, 1994.
272
21. D. Georgakopoulos, M. Hornick and A. Sheth, An Overview of Workow Management: From Process Modeling to Workow Automation Infrastructure, Distributed and Parallel Databases, Vol. 3, No. 2, pp. 119153, 1995. J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1994. A. Grimshaw, W. Wulf and T.L. Team, The Legion Vision of a Worldwide Virtual Computer, Communications of the ACM, 1997. A.S. Grimshaw, A. Ferrari and E.A. West, Mentat, in G.V. Wilson and P. Lu (eds.), Parallel Programming Using C++, Chapter 10, pp. 382427, 1996. S. Gullapalli, K. Czajkowski, C. Kesselman and S. Fitzgerald, The Grid Notication Framework, Grid Forum Working Draft GWD-GIS-019, 2001. http://www.gridforum.org. D. Hollingsworth, Workow Management Coalition: The Workow Reference Model, WfMC-TC00-1003, 1994. S. Hwang, Grid Workow: A Flexible Framework for Fault Tolerance in the Computational Grid, Ph.D. thesis, University of Southern California, 2003. S. Hwang and C. Kesselman, A Generic Failure Detection Service for the Grid, Technical Report ISI-TR-568, USC Information Sciences Institute, 2003. I. Foster and C. Kesselman, S.T., The Anatomy of the Grid: Enabling Scalable Virtual Organizations, Intl. J. Supercomputer Applications, 2001. W.E. Johnson, D. Gannon and B. Nitzberg, Grids as Production Computing Environments: The Engineering Aspects of NASAs Information Power Grid, in Proceedings of the Eighth IEEE Symposium on High Performance Distributed Computing, 1999, pp. 197204. L. Kleinrock, Queueing Systems, Volume 1: Theory. WileyInterscience Publication, 1975. N. Krishnakumar and A. Sheth, Managing Heterogeneous Multisystem Tasks to Support Enterprise-wide Operations, Distributed and Parallel Databases, Vol. 3, No. 2, pp. 155 186, 1995. J. Leon, A.L. Fisher and P. Steenkiste, Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery, Technical Report CMU-CS-93-124, Carnegie Mellon University, 1993. F. Leymann and D. Roller, Production Workow: Concepts and Techniques, Chapter 10, pp. 351427. Prentice Hall, 1999. M.J. Litzkow, M. Livny and M.W. Mutka, Condor a Hunter of Idle Workstations, in Proceedings of the Eighth Intl. Conf. on Distributed Computing Systems, 1988, pp. 104111. 36. R. Medeiros, W. Cirne, F. Brasileiro and J. Sauve, Faults in Grids: Why are They so Bad and What Can Be Done about It?, in The 4th Workshop on Grid Computing, 2003. 37. A. Nguyen-Tuong, Integrating Fault-Tolerance Techniques in Grid Applications, Ph.D. thesis, University of Virginia, 2000. 38. N.H. Page, http://ninf.apgrid.org/. 39. J.S. Plank, M. Beck, G. Kingsley and K. Li, Libckpt: Transparent Checkpointing under Unix, in Proceedings of the the USENIX Winter Technical Conference, New Orleans, LA, 1995. 40. J.S. Plank and W.R. Elwasif, Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems, in Proceedings of the 28th Fault-tolerant Computing Symposium (FTCS-28), 1998. 41. G.J. Popek, R.G. Guy, T.W. Page, Jr. and J.S. Heidemann, Replication in Ficus Distributed File Systems, in IEEE Computer Society Technical Committee on Operating Systems and Application Environments Newsletter, Vol. 4, 1990, pp. 2429. 42. M. Romberg, The UNICORE Architecture: Seamless Access to Distributed Resources, in Proceedings of the Eighth IEEE Symposium on High Performance Distributed Computing, 1999, pp. 287293. 43. S. Sekiguchi, M. Sato, H. Nakada, S. Matsuoka and U. Nagashima, Ninf: Network Based Information Library for Globally High Performance Computing, in Proceedings of the Parallel Object Oriented Methods and Applications(POOMA), 1996. 44. P. Stelling, I. Foster, C. Kesselman, C. Lee and G. von Laszewski, A Fault Detection Service for Wide Area Distributed Computations, in Proceedings of the Seventh IEEE Symposium on High Performance Distributed Computing, 1998, pp. 268278. 45. G. Stellner, CoCheck: Checkpointing and process migration for MPF, in 10th International Parallel Processing Symposium, 1996, pp. 526531. 46. D. Thain and M. Livny, Error Scope on a Computational Grid: Theory and Practice, in Proceedings of the Eleventh IEEE Symposium on High Performance Distributed Computing, Edinburgh, Scotland, 2002. 47. G. von Laszewski, I. Foster, J. Gawor, W. Smith and S. Tuecke, CoG Kits: A Bridge between Commodity Distributed Computing and High-Performance Grids, in ACM 2000 Java Grande Conference, 2000.
22. 23.
24.
25.
26. 27.
28.
29.
30.
31. 32.
33.
34. 35.

Mds 2

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Mds 2

Încărcat de

Drepturi de autor:

Formate disponibile

Journal of Grid Computing 1: 251272, 2003. 2004 Kluwer Academic Publishers. Printed in the Netherlands.

A Flexible Framework for Fault Tolerance in the Grid

Key words: failure detection, fault tolerance, Grid computing, workow

Figure 2. Overview of our approach.

Figure 8. Workow-level redundancy.

Figure 7. Task crash failure handling using an alternative task.

Figure 9. Exception handling using an alternative task.

Figure 10. Grid-WFS architecture.

Figure 13. Workow process model for a simple DAG.

Task W 1989 Task X 2946 Task Y 19 Task Z 775

1685 128 39 670

1033 1042 10 137

Transaction system (e.g., OLTP) Distributed le system (e.g., Ficus) PVM

Mentat Condor-G CoG Kits

S-ar putea să vă placă și