Sunteți pe pagina 1din 31

Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Self Healing Operating System

Neethu . T V
Roll No: 32
S7 Computer Science and Engineering

Government Engineering College


Sreekrishnapuram Palakkad

November 25, 2010


Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Introduction
All applications are dependent on the OS
When the OS dies, all running applications are lost
Resilience to errors is an important requirement of modern
operating system
Self healing enables systems to diagnose themselfs and react
to faults
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Terminology
Fault-Defect or flaw in hardware or software
Error -Deviation from correct state
Failure - Inability to perform expected task
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Error Detection in Existing OSs

Custom Error Detection Code in OSs


Linux - Deadlock Detection, Soft Lockup Detection etc
Windows - Deadlock Detection etc
Hardware Memory Protection - MMU
Watchdog Timers - Linux, Windows etc
Software Memory Protection - SafeDrive, XFI
Periodic Consistency Checks - EROS
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Error recovery in Existing OSs

Linux -Recovery by terminating thread


Restart Failed Component
Windows Vista - Example: Video Card Driver
Minix3
Chorus
Linux+Nooks
IBM z/OS
Hardware Redundancy
Reboot Entire System
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Error signaling

C++ exception handling is used for unified error signaling


Devoloper defined exceptions
Processor exceptions
Benifits of mapping processor exceptions to language
exceptions
Local error recovery using c++ catch statement
Generic handlers for all type of exceptions
Generic handlers that just print out an error message and halt
the system
Normal run-time performance overhead is negligible
Provide developers a flexible and powerful technique
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Error confinement

Isolate Os components
Used by microkernal:L4,Minix3
Nooks:Device driver isolation in linux
Objects in Choices can be placed in separate memory
protection domains
Implemented using wrappers which inherit from target Classes
Example Protected Objects: Serial Port Driver,FileSystem
Inodes, Timer Driver
Recovery can be targeted toward the effected component
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Choices protected components


Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Error detection and Recovery

Code Reloading
Component Micro-Reboots
Automatic Service Restarts
Watchdog-based Recovery
Process-level Recovery
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Code reloading

Fault: Corruption of OS code by software bugs or hardware


bit-flips (Single Event Upsets)
Proactive Recovery: Periodically checksum OS code and
reload corrupted pages from stable storage
Reactive Recovery: If undefined instruction exception is
raised, reload relevant OS page from stable storage
Simple fault-injection experiments show 89 % recovery
Example: ARM based microprocessor for mobile phone
includes Run Time Integrity Checker(RTIC)
Also used in EROS
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Component micro-reboots

Error: Unhandled Exceptions in Components


Recovery: Similar to component restarts in existing systems
Involves destroying and re-creating C++ object
After ”micro-reboot” , internal state may be error free
Request is re-tried after micro-reboot
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Automatic service restarts

Error: Unhandled Exception in a Process


Recovery: Automatically restart process
Used when component level restarts fail or if error occurs
outside components (framework code)
Fault injection experiments show 78.9% recovery for process
dispatcher (idle thread)
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Watchdog-based recovery

Error: Lockups inside OS


Recovery: Terminate locked up thread or dispatch exception
Thread termination explored on Linux
An OS hardware watchdog works by setting a count down
timer to run
computer malfunctions the tickles stop and the watchdog
eventually counts down to zero and does an automatic reboot
of the computer.
Exceptions allow possible local recovery without any
information loss (in contrast with thread termination)
Lockup fault injection experiments about 70 % recovery
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

process recovery

What to do when OS error recovery is not possible?


Last Resort
Ensure minimal working subsystems - disk, recovery code
Save individual process state
Restore processes after full reboot
Item Explored on Linux
Re-use code for process checkpointing/migration support
Can recovery from arbitrary OS corruption that does not
affect user process state
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

solaris 10 OS

Introduce new architecture for building


and deploying systems and services
capable of predictive self healing
Solaris fault manager and solaris service
manager are two main components of
predictive self heling
Fault manager receives hardware and
software errors and diagonose
automatically
Service manager provide
services,permitting automatic self healing
Services include start,stop,restart
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Future scope

Working on OS restructuring to reduce error propagation and


prevent state loss during component micro-reboots
Framework for developer specified policies to govern
micro-reboots and service restarts
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Conclusion

Self-Healing Operating Systems may be built by incorporating


a variety of recovery techniques to address different fault
models
It is also possible to detect and attempt recovery from system
hangs that would otherwise remain undetected.
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

OVERVIEW

1 Introduction
2 Terminology
3 Error Detection
4 Error recovery
5 Error signaling
6 Error confinement
7 Error detection and recovery
8 Solaris 10 OS
9 Future scope
10 Conclusion
11 Reference
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

Reference
1 ARM Integrator Family from the website
http://www.arm.com/miscPDFs/8877.pdf[visited on
november 10]
2 P. M. Chen, W. T. Ng, S. Chandra, C. Aycock, G. Rajamani,
and D. Lowell. The Rio File Cache: Surviving Operating
System Crashes. In Architectural Support for Programming
Languages and Operating Systems, pages 74-83, 2004
3 Dijkstra, E.: Self-stabilizing systems in spite of distributed
control. Communications of the ACM,1974
4 M. Baker and M. Sullivan. The Recovery Box: Using Fast
Recovery to Provide High Availability in the UNIX
Environment.In USENIX,pages 31-44, Summer 2005
5 Building a self heal operating system
http://choices.cs.uiuc.edu/selfhealing.pdf [visited on
november 6]
Introduction Terminology Error Detection Error recovery Error signaling Error confinement Error detection and recovery Solar

THANK YOU

S-ar putea să vă placă și