Failure Recovery

Failure Recovery: A Software Engineering Methodology for Robust Agents
Jennifer Kiessel
Soar Technology, Inc. 3600 Green Court, Suite 600 Ann Arbor, MI 48105 USA 1-734-327-8000 x216
Jonathan Beard
Paul Nielsen
jkiessel@soartech.com ABSTRACT
beard@soartech.com
nielsen@soartech.com
In this paper, we present a software engineering methodology that is specifically tailored to agents with complex, human-like behavior. Many existing agent systems are error-prone and built in an ad-hoc manner. Previous approaches [12] tried to account for all possible inputs to the system, but with little success [5]. Our approach focuses not on assuring correct behavior, but rather recognizing incorrect behavior and initiating recovery actions. As a result, agents are more robust to faults and require less maintenance.
classes of recovery actions can be used to generate recovery plans for large, multi-agent systems. As both agents and environments become more complex, accounting for all possible stimuli becomes impossible. In a sufficiently complex software environment, it is not feasible to test all paths through the system and find every fault. When discussing the problems with universal planners, Ginsberg states, It is impractical for an agent to precompute its response to every situation in which it might find itself. [5] Instead, agents need a recovery scheme that allows them to recognize incorrect behavior when it occurs, determine its cause, and attempt corrective actions. Bickhard [3] speaks of the tendency for knowledge bases, no matter how adequate for their initial narrow domains of knowledge, to be fundamentally not just incomplete, but inappropriate and wrongly designed in attempts to broaden the knowledge domain or to combine it with some other domainKnowledge bases do not scale up well. In this paper, we argue that attempting to prevent all faults will not guarantee correctness of agent behavior. Agents need a recovery scheme that allows them to recognize incorrect behavior, determine its cause, and attempt corrective actions. We will present several design methodologies for applying this idea in large, multi-agent systems. Finally, our implementation of this recovery plan is discussed.
Categories and Subject Descriptors

D.2.5 [Software Engineering]: Testing and Debugging error handling and recovery, diagnostics . C.4 [Computer Systems Organization]: Performance of Systems design studies, fault tolerance, modeling techniques . I.2.0 [Artificial Intelligence ]: General cognitive simulation.
General Terms
Design, Reliability, Experimentation.
Keywords
Robustness, recovery, maintenance, symptoms, diagnosis, corrective action, common sense, agent methodologies, qualitative physics, artificial intelligence, Soar, TacAir Soar.
1. INTRODUCTION
Many agents (goal-directed models of human behavior) encode a great deal of expert knowledge, but lack the general, common-sense knowledge that would avert some system failures. The approach advocated in this paper is that domain-specific common sense and
2. RECOVERY VS. PREVENTION

From a software engineering perspective, the ability to recover from failure is preferable to trying to prevent all problems [1] [7]. This is not a new statement; much research in fault tolerance and error recovery has been done in the software dependability community [13]. Many existing agent-oriented systems try to account for all possible inputs [12], but have limited success due to the complexity of their environments. This is akin to the 90-10 adage about debugging: 10% of the errors in a system can take 90% of the time to debug. Rather than trying to guarantee correctness before execution (an arguably impossible task in complex systems), recognizing incorrectness during execution and
recovering from those errors will reduce time and effort in building robust agents.
2.1 Why Prevention Isnt Enough

We are not arguing against the prevention of faults: steps should be taken to prevent faults, but prevention is not a comprehensive solution. Recovery is needed in addition to prevention, due to the inability to prevent all faults in a complex environment. One should attempt to prevent faults in encoding expert knowledge, but should also provide recovery actions to account for unexpected situations. Errors can occur because of factors internal to an agent or external to an agent. One class of external errors is those resulting from the information used. Information-based faults can arise in the following ways, when knowledge is: 1. 2. 3. 4. 5. Inaccurate: the knowledge is untrue Incomplete: part of the knowledge is lacking Obsolete: the knowledge was true, but no longer is Inconsistent: different contradict each other pieces of the knowledge
other terrain features have changed, the agent may encounter a new situation where it does not know what to do (perhaps the new environment contains a mountain high enough to reach into the pilots flying space). Moreover, even if the agent never transfers domains, the real world is complex enough to necessitate a recovery scheme. If the pilot agent had general common-sense knowledge of how to deal with objects in front of its plane (i.e. possibly try turning to go around the mountain, or increasing altitude to fly over it), the agent would be much more capable of response to errors and recovery. One may argue that dealing with mountains and other terrain obstacles should have been captured in the expert knowledge of the pilot along with the intricacies of take-off, landing, and navigation. But even if that knowledge of mountains had been included, there will always be something else that wasnt explicitly specified, because common sense is such broad knowledge [4]. Especially as we move toward the goal of intelligent unmanned vehicles, we need to consider the fact that complete fault prevention is not attainable. Pre-programming actions for all possible stimuli is impossible [5].
2.2.2 Maintenance
Secondly, there is a strong argument to be made in the case of software maintenance. One of the chief goals of software engineering is to reduce the time and money spent on maintenance. Debugging and releasing patches can be very costly. Agents attempting to fix their own mistakes in real time can be more costeffective. Agent recovery can be done in real time; bug fixing cannot. While implementing an agent recovery scheme takes additional time during development, the costs of bug fixing are often orders of magnitude worse. As discussed later, our implementation also provides helpful debugging tools because the agent identifies its own symptoms and diagnoses. Even if the agents corrective recovery actions were to fail, the symptoms or diagnosis could provide additional information that hypothesizes where the problem arose.
Ill-defined: the knowledge is too vague to use
An agent in a complex environment will likely face all of these issues: it may be told incorrect or incomplete information, it may not be getting information about the world in a timely manner, or it may be told information that is irreconcilable with its prior knowledge. Attempts at solving these problems from an engineering perspective have faced an uphill battle [4]. For example, the CYC projects work in compiling all common sense into a knowledge base [9] proves the enormity of how much knowledge humans acquire and how easy it is to have incomplete knowledge. Inconsistent knowledge can also be present, as is sometimes necessary in human thinking. Lenat gives the example that we both know that there are no vampires and that Dracula is a vampire [9]. We must account for these problems in the design of software agents and enable recovery when errors do occur.
3. METHODOLOGIES
Our design needs 1) to realize that there is a problem, 2) to determine what caused the problem, and 3) to attempt a solution to the problem. We recast this design as the following three steps: identify symptoms, make a diagnosis, and execute corrective actions [11]. A similar scheme was used by Klein [8], but his exception-handling capabilities were designed as a centralized service, whereas our recovery capabilities are built within the individual agents. Our methodology for determining symptoms relies on the use of qualitative physics and reasoning to identify error [2]. This methodology is only applicable for software agents whose environment can be represented as state parameters and which possess goals whose success can be measured in terms of those state parameters. All possible values for a parameter in a
2.2 Bene fits of Recovery

Two major benefits of using a recovery methodology, rather than prevention of all faults, involve the complexity of modeling the real world and maintenance of the system.
2.2.1 Domain Complexity

Agent interaction with highly complex environments or simulations increases the likelihood of introducing errors into the system and would benefit from a recovery plan. Providing an agent with recovery capabilities enhances portability if the agent is ever used in a different scenario. For example, suppose one develops an agent that has the knowledge of an airplane pilot. The agent may behave correctly in one environment, but when transferred to a different environment where the mountains and
continuous space can be measured in relation to a target value and represented discretely as a qualitative test. These generic qualitative tests, when applied to an agent's environmental parameters and goals, can detect discrepancies. The application of these generic qualitative tests to specific environmental parameters and target values is the composition of domain-specific qualitative models. We use these qualitative models to detect symptoms of error and to inform the diagnostic process. For example, suppose that the agents goal is to bomb a target. The entire environment can be divided into areas before the target (-), at the target (0), or past the target (+). If the agent is past the target and the bomb has not been deployed, there is a problem. We assume a hierarchy of goals and operators. This allows multiple checks for correctness at different levels; this redundancy catches far more errors. We can also attempt local solutions to local goals first, before attempting drastic recovery measures. In developing recovery actions, a causal model was used to define the flight domain. Recovery actions are constrained to the effectors of the system; the agent can only use the controls it has the capability to manipulate. Therefore, the causal model defines the relationships between effectors of the system. Backchaining through the model shows the possible causes of the problem. For example, acceleration and altitude both affect speed. (Planes move slower at higher altitudes.) Therefore, if there is a problem with the agents speed, it can diagnose that acceleration or altitude caused the problem, and use these attributes to attempt to fix it. A final methodology is the use of ordered recovery actions. Actions suggested by a diagnosis of the problem should be applied before trying generic recovery actions. Next, we will discuss the implementation of these methodologies.
symptoms over time may allow a better diagnosis of the problem. For example, suppose a high-level goal is to fly in a racetrack pattern (an oval path), and a lower-level goal is to execute a turn within that pattern. If a symptom warns the agent that it is not turning correctly, there may be several reasons for the problem. But if a symptom warns the agent that it is not turning correctly, and another symptom informs it that it is not even flying a racetrack anymore, then there is a much bigger problem.
4.2 Diagnosis
We implement a diagnosis as an informed collection of error symptoms that initiates the recovery process. Symptoms simply recognize that there is a problem in the system. Using the collection of posted symptoms, diagnoses attempt to understand what caused the problem and suggest corrective actions. A diagnosis also provides a general domain in which the problem occurred; for example, difficulties with navigation would be considered a problem in the Flight domain. Other domains in our TacAir-Soar implementation include Weapons, Communications, and Radar [11].
4.3 Recovery Actions

After a diagnosis has determined possible causes for an error, recovery actions attempt to fix it. Recovery actions are comprised of both known and generic solutions. The process of determining recovery actions is based on an analysis of the systems effectors. These effector-based actions can be immediately applied as generic solutions, but it takes further analysis to determine which actions to apply as known solutions for a particular diagnosis.
4.3.1 Known solutions

If a diagnosis suggests specific recovery actions, these are called known solutions. Known solutions are recovery actions that have been designed by the developer in response to a particular diagnosis. For example, suppose that the diagnosis is that the agent is flying too close to a missile site and thus is in danger of being shot down. A known solution is to turn away from the missile site. The agents have an awareness of the passing of time. Some recovery actions should happen almost instantaneously, such as adjusting a dial, whereas others, like turning the plane to a different heading, may require more time. Each known solution has a time attribute, which allows the agent to time out. Therefore, the agent will try a recovery action for a limited time only, and then, if the symptoms of the problem have not been alleviated, it will try a new solution. If there a re multiple known solutions for a diagnosis, a partial ordering of priorities may be imposed on them. For example, if a lack of communication leads a lead pilot agent to believe that its wingman is gone, it is better to first try resending the message to
4. IMPLEMENTATION DETAILS
To capture this complexity in behavior, we chose to implement our plan as an enhancement to TacAir-Soar, a large, multi-agent system that models the behaviors of military aviators [6]. TacAirSoar is a symbol-processing, rule-based system, built within the Soar architecture [10] for cognition. A salient feature of this representation is that there is a hierarchy of goals. Approximately 8000 rules represent human-like flight behavior. TacAir-Soar can fly all types of missions using appropriate doctrine and tactics, pilot many different types of aircraft, coordinate behavior with other entities, and behave with low computational expense.
4.1 Symptoms
Before recovery can occur, the agent must have some understanding of its goals, so it knows if it is making progress towards them. This self-monitoring capability is implemented in our design as symptoms. There are many symptom tests which assess that various operators and states are executing correctly. If a symptom test fails, then the resultant symptom is posted as an error. Observing multiple symptoms in different parts of the system, symptoms at different levels of the goal hierarchy, or
the wingman (in case it was garbled the first time) before assuming that the wingman has been hit and abandoning it.
4.3.2 Generic solutions

If the diagnosis has no suggested actions for how to deal with the problem (i.e., there are no known solutions), or all suggested actions have been tried and the problem has not been corrected, generic solutions will be tried. Generic solutions are general recovery actions that will allow the agent to search for a solution. The generic solutions are organized by domains. This provides additional constraints on the recovery actions that may be used. These constraints prevent a blind search for recovery actions. The logic behind this is that if a pilot were having a problem with his radio and he had exhausted all of the ways he knew to fix it, he would try to do random things with the radio, like adjusting various dials. However, he would not try something in a different domain, like speeding up the plane. Generic solutions are intended to allow the agent to explore its world for a solution to a problem. One of the shortcomings of agents is that they freeze when no knowledge applies, which is not comparable to human behavior. Humans can try new things. For example, suppose a person was walking and ran into an invisible barrier. That is very likely a situation that she had never encountered before, and so she may have no known solutions to that problem. However, she may try turning and moving along the barrier to see if there is a break in it, or try to climb over it or under it, or try running into it to see if she can ram it down. The general recovery actions represent this method of trying new solutions to an unprecedented problem. The agent can select randomly among the generic solutions for a resolution to i ts problem. Generic solutions also have an awareness of the concept of time; an agent will time out and try a different recovery action after a solution has been tried for some period of time. The generality of the recovery actions is key for good software engineering. We have tried to encode general behaviors for each domain in our implementation. Actions like turning, changing speed, and changing altitude are reasonable for generic solutions in the Flight domain. In the Communications domain, we h ave generic solutions like resend message, ignore message, and change radio frequency. In the Radar domain, generic solutions include change radar mode, change radar elevation, and change radar azimuth. In the Weapons domain, the only generic solution is to return to the base; we dont want agents to try doing random things with their weapons. The lowest-level recovery actions for all domains are to return to the base or to emergency-land the plane in the current location.
that the wingman is still on the radar. From these symptoms, the lead agent forms the diagnosis that the wingman did not receive the message properly. This diagnosis first suggests the recovery action of resending the message. The agent will execute this action. If this action does not alleviate the symptoms, the agent will try generic solutions in the communication domain, such as sending the message on a different radio frequency.
5. CONCLUSION
We have argued that prevention of all faults is impossible; the recognition of error and the ability to recover must augment fault prevention. We have presented an approach for software maintenance based on error recovery. Our approach reduces the complexity of the environment and lowers maintenance costs. We have described the methodologies of our design for goal-directed agent systems and given a brief description of its implementation. The general framework, such as the use of qualitative physics, redundant checks, backchaining through causal models, ordered recovery actions, and generic solutions, can be generalized to any environment.
6. ACKNOWLEDGMENTS
The authors wish to thank Jim Beisaw for his literature search and his work on the design of the current iteration. Many thanks also to Glenn Taylor for his invaluable guidance. Finally, we want to gratefully acknowledge Harold Hawkins and the Office of Naval Research for funding our research under contract number N0001400-C-0312.
7. REFERENCES
[1] Atkins, E. M., Durfee, E. H., and Shin, K. G. Detecting and
Reacting to Unplanned-for World States, Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), 1997, pp. 571-576.
[2] Beard, J., Nielsen, P., Kiessel, J. Self-Aware Synthetic

Forces: Improved Robustness Through Qualitative Reasoning (accepted for publication, I/ITSEC 2002)
[3] Bickhard, M. H. Physical Symbol Systems,

http://www.lehigh.edu/~mhb0/aiforcogs7.html
[4] Copeland, B. J. CYC: A Case Study in Ontological

Engineering, 1997, http://ejap.louisiana.edu/EJAP/1997.spring/copeland976.2.ht ml
[5] Ginsberg, M. L. Universal Planning: An (Almost)

Universally Bad Idea. AI Magazine, vol. 10, no. 4, 1989.
[6] Jones, R. M., Laird, J. E., Nielsen, P. E., Coulter, K. J.,
4.4 Example
As a simple example, suppose that a lead pilot agent is flying with a wingman agent. The lead agent sends a message to its wingman and is stuck waiting for a response. After a certain amount of time, symptoms identify that the wingman is not responding and
Kenny, P., Koss, F. V. Automated Intelligent Pilots for Combat Flight Simulation, AI Magazine, Spring 1999.
[7] Kaminka, G. A. and Tambe M. What is Wrong With Us?

Improving Robustness Through Social Diagnosis,
Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 1998.
[11] Nielsen, P., Beard, J., Kiessel, J., and Beisaw, J. Robust
Behavior Modeling, Proceedings of the CGF Conference, 2002.
[8] Klein, M. and Dellarocas, C. Exception Handling in Agent

Systems, Proceedings of the Third International Conference on Autonomous Agents , Seattle, WA, May 1-5, 1999.
[12] Schoppers, M. J. Universal Plans for Reactive Robots in

Unpredictable Domains. Proceedings of the Tenth International Joint Conference on Artificial Intelligence, 1987, pp. 1039-1046.
[9] Lenat, D. B. From 2001 to 2001: Common Sense and the

Mind of HAL. http://www.cyc.com/halslegacy.html
[10] Newell, A. Unified Theories of Cognition, Harvard

University Press, Cambridge, Massachusetts, 1990.
[13] Dependable Computing and Fault Tolerance,

http://www.dependability.org/wg10.4/

Failure Recovery

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Failure Recovery

Încărcat de

Drepturi de autor:

Formate disponibile

Failure Recovery: A Software Engineering Methodology for Robust Agents

Categories and Subject Descriptors

2. RECOVERY VS. PREVENTION

2.1 Why Prevention Isnt Enough

Ill-defined: the knowledge is too vague to use

2.2 Bene fits of Recovery

2.2.1 Domain Complexity

4.3 Recovery Actions

4.3.1 Known solutions

4.3.2 Generic solutions

[2] Beard, J., Nielsen, P., Kiessel, J. Self-Aware Synthetic

[3] Bickhard, M. H. Physical Symbol Systems,

[4] Copeland, B. J. CYC: A Case Study in Ontological

[5] Ginsberg, M. L. Universal Planning: An (Almost)

[6] Jones, R. M., Laird, J. E., Nielsen, P. E., Coulter, K. J.,

[7] Kaminka, G. A. and Tambe M. What is Wrong With Us?

Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 1998.

[8] Klein, M. and Dellarocas, C. Exception Handling in Agent

[12] Schoppers, M. J. Universal Plans for Reactive Robots in

[9] Lenat, D. B. From 2001 to 2001: Common Sense and the

[10] Newell, A. Unified Theories of Cognition, Harvard

[13] Dependable Computing and Fault Tolerance,

S-ar putea să vă placă și