Sunteți pe pagina 1din 11

Oracle 11g: Resiliency feature

Health Monitor Checks


Health Monitor checks (also known as checkers, health checks, or checks) examine various layers and components of the database. Health checks detect file corruptions, physical and logical block corruptions, undo and redo corruptions, data dictionary corruptions, and more. The health checks generate reports of their findings and, in many cases, recommendations for resolving problems. Health checks can be run in two ways: ReactiveThe fault diagnosability infrastructure can run health checks automatically in response to a critical error. ManualAs a DBA, you can manually run health checks using either the DBMS_HM PL/SQL package or the Enterprise Manager interface. You can run checkers on a regular basis if desired, or Oracle Support may ask you to run a checker while working with you on a service request.

Health Monitor checks store findings, recommendations, and other information in the Automatic Diagnostic Repository (ADR). DB-online mode means the check can be run while the database is open (that is, in OPEN mode or MOUNT mode). DB-offline mode means the check can be run when the instance is available but the database itself is closed (that is, in NOMOUNT mode).

Types of Health Checks


DB Structure Integrity CheckThis check verifies the integrity of database files and reports failures if these files are inaccessible, corrupt or inconsistent. If the database is in mount or open mode, this check examines the log files and data files listed in the control file. If the database is inNOMOUNT mode, only the control file is checked. Data Block Integrity CheckThis check detects disk image block corruptions such as checksum failures, head/tail mismatch, and logical inconsistencies within the block. Most corruptions can be repaired using Block Media Recovery. Corrupted block information is also captured in theV$DATABASE_BLOCK_CORRUPTION view. This check does not detect inter-block or inter-segment corruption. Redo Integrity CheckThis check scans the contents of the redo log for accessibility and corruption, as well as the archive logs, if available. The Redo Integrity Check reports failures such as archive log or redo corruption. Undo Segment Integrity CheckThis check finds logical undo corruptions. After locating an undo corruption, this check uses PMON and SMON to try to recover the corrupted transaction. If this recovery fails, then Health Monitor stores information about the corruption in V$CORRUPT_XID_LIST. Most undo corruptions can be resolved by forcing a commit. Transaction Integrity CheckThis check is identical to the Undo Segment Integrity Check except that it checks only one specific transaction. Dictionary Integrity CheckThis check examines the integrity of core dictionary objects, such as tab$ and col$. It performs the following operations: o o o Verifies the contents of dictionary entries for each dictionary object. Performs a cross-row level check, which verifies that logical constraints on rows in the dictionary are enforced. Performs an object relationship check, which verifies that parent-child relationships between dictionary objects are enforced.

Relevant RMAN Functionality


1. LIST FAILURE The LIST FAILURE command displays any failures with a status OPEN and a priority of CRITICAL or HIGH in order of importance. If no such failures exist it will list LOW priority failures. RMAN> LIST FAILURE; List of Database Failures ========================= Failure ID Priority Status Time Detected Summary ---------- -------- --------- ------------- ------202 HIGH OPEN 03-JAN-08 One or more non-system datafiles are corrupt 2. ADVISE FAILURE The ADVISE FAILURE command, as the name implies, provides repair advice for failures listed by the LIST FAILURE command, as well as closing all open failures that are already repaired. RMAN> ADVISE FAILURE; List of Database Failures ========================= Failure ID Priority Status Time Detected Summary ---------- -------- --------- ------------- ------202 HIGH OPEN 03-JAN-08 One or more non-system datafiles are corrupt analyzing automatic repair options; this may take some time allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=124 device type=DISK analyzing automatic repair options complete Mandatory Manual Actions ======================== no manual actions available Optional Manual Actions ======================= no manual actions available Automated Repair Options ======================== Option Repair Description ------ -----------------1 Restore and recover datafile 4 Strategy: The repair includes complete media recovery with no data loss Repair script: /u01/app/oracle/diag/rdbms/db11g/DB11G/hm/reco_3657335472.hm

3. REPAIR FAILURE The REPAIR FAILURE command applies the repair scripts produced by the ADVISE FAILURE command. Using the PREVIEW option lists the contents of the repair script without applying it. RMAN> REPAIR FAILURE PREVIEW; Strategy: The repair includes complete media recovery with no data loss Repair script: /u01/app/oracle/diag/rdbms/db11g/DB11G/hm/reco_2408143298.hm contents of repair script: # restore and recover datafile sql 'alter database datafile 4 offline'; restore datafile 4; recover datafile 4; sql 'alter database datafile 4 online'; 4. REPAIR FAILURE NOPROMPT The REPAIR FAILURE command prompts the user to confirm the repair, but this can be prevented using the NOPROMPT keyword. RMAN> REPAIR FAILURE NOPROMPT; Strategy: The repair includes complete media recovery with no data loss Repair script: /u01/app/oracle/diag/rdbms/db11g/DB11G/hm/reco_2408143298.hm contents of repair script: # restore and recover datafile sql 'alter database datafile 4 offline'; restore datafile 4; recover datafile 4; sql 'alter database datafile 4 online'; executing repair script sql statement: alter database datafile 4 offline Starting restore at 03-JAN-08 using channel ORA_DISK_1 channel ORA_DISK_1: starting datafile backup set restore channel ORA_DISK_1: specifying datafile(s) to restore from backup set channel ORA_DISK_1: restoring datafile 00004 to /u01/app/oracle/oradata/DB11G/users01.dbf channel ORA_DISK_1: reading from backup piece /u01/app/oracle/flash_recovery_area/DB11G/backupset/ 2008_01_03/o1_mf_nnndf_BACKUP_DB11G.WORLD_0_3qsl2hy4_.bkp channel ORA_DISK_1: piece handle=/u01/app/oracle/flash_recovery_area/DB11G/backupset/2008_01_03/ o1_mf_nnndf_BACKUP_DB11G.WORLD_0_3qsl2hy4_.bkp tag=BACKUP_DB11G.WORLD_010308113407 channel ORA_DISK_1: restored backup piece 1 channel ORA_DISK_1: restore complete, elapsed time: 00:00:07 Finished restore at 03-JAN-08

Starting recover at 03-JAN-08 using channel ORA_DISK_1 starting media recovery media recovery complete, elapsed time: 00:00:01 Finished recover at 03-JAN-08 sql statement: alter database datafile 4 online repair failure complete 5. CHANGE FAILURE The CHANGE FAILURE command allows you to change the priority of a failure or close an open failure. You may wish to change the priority of a failure if it does not represent a problem to you. For example, a failure associated with a tablespace you know longer use may be listed as a high priority, when in fact it has no effect on the normal running of your system. RMAN> CHANGE FAILURE 202 PRIORITY LOW;

Running Health Checks Using the DBMS_HM PL/SQL Package


The DBMS_HM procedure for running a health check is called RUN_CHECK. To call RUN_CHECK, supply the name of the check and a name for the run BEGIN DBMS_HM.RUN_CHECK('Dictionary Integrity Check', 'my_run'); END; To obtain a list of health check names, run the following query: SELECT name FROM v$hm_check WHERE internal_check='N'; NAME ---------------------------------------------------------------DB Structure Integrity Check Data Block Integrity Check Redo Integrity Check Transaction Integrity Check Undo Segment Integrity Check Dictionary Integrity Check

Most health checks accept input parameters. You can view parameter names and descriptions with the V$HM_CHECK_PARAM view. Some parameters are mandatory while others are optional. If optional parameters are omitted, defaults are used. The following query displays parameter information for all health checks: SELECT c.name check_name, p.name parameter_name, p.type, p.default_value, p.description FROM v$hm_check_param p, v$hm_check c WHERE p.check_id = c.id and c.internal_check = 'N'

ORDER BY c.name; Input parameters are passed in the input_params argument as name/value pairs separated by semicolons (;). The following example illustrates how to pass the transaction ID as a parameter to the Transaction Integrity Check: BEGIN DBMS_HM.RUN_CHECK ( check_name => 'Transaction Integrity Check', run_name => 'my_run', input_params => 'TXN_ID=7.33.2'); END;

Running Health Checks Using Enterprise Manager


Enterprise Manager provides an interface for running Health Monitor checkers. To run a Health Monitor Checker using Enterprise Manager: 1. 2. 3. 4. 5. On the Database Home page, in the Related Links section, click Advisor Central. Click Checkers to view the Checkers subpage. In the Checkers section, click the checker you want to run. Enter values for input parameters or, for optional parameters, leave them blank to accept the defaults. Click Run, confirm your parameters, and click Run again.

Viewing Checker Reports


After a checker has run, you can view a report of its execution. The report contains findings, recommendations, and other information. You can view reports using Enterprise Manager, the ADRCI utility, or the DBMS_HM PL/SQL package. The following table indicates the report formats available with each viewing method. Report Viewing Method Enterprise Manager DBMS_HM PL/SQL package ADRCI utility Report Formats Available HTML HTML, XML, and text XML

ADR Home
Oracle Database 11g. The Automatic Diagnostic Repository (ADR) files are located in directories under a common directory specified as the Diagnostic Destination (or ADR Base). This directory is set by an initialization parameter (diagnostic_dest). By default it is set to $ORACLE_BASE, but you could explicitly set to some exclusive directory. (This is not recommended however.) Under this directory, there is a subdirectory called diag under which you will find the subdirectories where the diagnostic files are stored.

The ADR houses logs and traces of all components ASM, CRS, listener, and so onin addition to those of the database itself. This makes it convenient for you to look for a specific log at a single location.

Directory Name <Directory mentioned in the DIAGNOSTIC_DEST parameter> diag rdbms <Name of the Database> <Name of the Instance> alert cdump

Description

The alert log in XML format is stored here. Core dumps are stored here, the equivalent of the core_dump_dest in earlier versions. The Health Monitor runs checks on many components, and it stores some files here. All incidents dumps are stored here.

hm incident

<all incident directories exist here> Each incident is stored in a different directory, which are all stored here. incpkg When you package incidents (learn about packaging in this article), certain supporting files are stored here. Metadata about problems, incidents, packages and so on is kept here. User traces and background traces are kept here, along with the text version of the alert log.

metadata trace

You can now see the different subdirectories under this ADR Home: $ ls alert cdump hm incident incpkg ir lck metadata stage sweep trace To support this new structure, the *_dest parameters in previous releases (background_dump_dest and user_dump_dest) are ignored. (core_dump_dest is not ignored; in fact Oracle recommends that you set it as core dumps can be very large.) You shouldn't set them at all and if you are upgrading from 10 g to 11g, you should remove them from the initialization parameter file to avoid confusion later. The ADR directory structure for other components is similar. For instance, for ASM instance, the directory under "diag" is named asm, instead of rdbms. The rest of the directory structure remains the same. The name of the target in case of asm is +asm. For instance, here is how my ADR Home for ASM looks: $ pwd /home/oracle/diag/asm/+asm/+ASM $ ls alert cdump hm incident incpkg ir lck metadata stage sweep trace For the listener, the directory under diag is called tnslsnr, under which another directory exists with the

hostname, and then under that another directory with the listener name as the directory name. Under that you will see the other directories. <Directory mentioned in the DIAGNOSTIC_DEST parameter> diag tnslsnr <hostname of the server> <name of the listener> alert trace ...

For instance, for a host named oradba3, and a listener named "listener" (the default name), the directory will look like /home/oracle/diag/tnslsnr/oradba3/listener. Under this directory all the others (alert, trace, metadata, and so on) are created. Like the alert log, the listener log file is also stored as XML entries, under the subdirectory alert. The usual text listener log file is still produced, under the directory trace. A new view V$DIAG_INFO shows all the details about the ADR Homes. In my RDBMS home, it appears like this: SQL> select * from v$diag_info; INST_ID NAME VALUE -------- ---------------------------------------------------------------------------------------------1 Diag Enabled TRUE 1 ADR Base /home/oracle 1 ADR Home /home/oracle/diag/rdbms/odel11/ODEL11 1 Diag Trace /home/oracle/diag/rdbms/odel11/ODEL11/trace 1 Diag Alert /home/oracle/diag/rdbms/odel11/ODEL11/alert 1 Diag Incident /home/oracle/diag/rdbms/odel11/ODEL11/incident 1 Diag Cdump /home/oracle/diag/rdbms/odel11/ODEL11/cdump 1 Health Monitor /home/oracle/diag/rdbms/odel11/ODEL11/hm 1 Default Trace File /home/oracle/diag/rdbms/odel11/ODEL11/trace/ODEL11_ora_3908.trc 1 Active Problem Count 3 1 Active Incident Count 37 11 rows selected.

ADRCI
Access the files and perform other operations on the ADR in command line tool called asrci. Let's see how you can use the tool. From the UNIX (or Windows) command prompt, type "adrci": $ adrci ADRCI: Release 11.1.0.6.0 - Beta on Sun Sep 23 23:22:24 2007 Copyright (c) 1982, 2007, Oracle. All rights reserved. ADR base = /home/oracle As you learned earlier, there are several ADR Homes, one for each instance of the Oracle components. So, the first task is to show how many homes exist. The command is show homes. adrci> show homes ADR Homes:

diag/rdbms/odel11/ODEL11 diag/rdbms/dbeng1/DBENG1 diag/clients/user_unknown/host_411310321_11 diag/tnslsnr/oradba3/listener As you can see, there are several homes. To operate on a specific home, you should use set homepath command: adrci> set homepath diag/rdbms/odel11/ODEL11 Once set, you can issue many commands at the prompt. The first command you may try is help, which will show all the available commands. Here is a brief excerpt of the output: adrci> help HELP [topic] Available Topics: CREATE REPORT ECHO EXIT HELP HOST IPS ... If you want to know more about a specific command, issue help <command>. For instance, if you want to get help on the usage of show incident commands, you will issue: adrci> help show incident Usage: SHOW INCIDENT [-p <predicate_string>] [-mode BASIC|BRIEF|DETAIL] [-last <num> | -all] [-orderby (field1, field2, ...) [ASC|DSC]] Purpose: Show the incident information. By default, this command will only show the last 50 incidents which are not flood controlled. Options: [-p <predicate_string>]: The predicate string must be double-quoted. [-mode BASIC|BRIEF|DETAIL]: The different modes of showing incidents. [... and so on ...] This technique of decoupling of collecting and publishing stats can also be used with partitioned tables. Suppose you are loading a table partition by partition. You don't want to feed partial information to the optimizer; you rather want the stats of all partitions to be visible to the optimizer at the same time. But you also want to take advantage of the time right after the partition is loaded. So, you can collect the stats on a partition right after it is loaded but not publish it. After all partitions are analyzed, you can publish them all at once. From the output you know the usage. Now to know how many incidents have been recorded, you can issue: adrci> show incident -mode basic ADR Home = /home/oracle/diag/rdbms/odel11/ODEL11: ****************************************************************** INCIDENT_ID PROBLEM_KEY CREATE_TIME -------------------- ------------------------------------------------------------------------------------------

14556 14555 14435 14427 14419 6001 5169 5121 5017 4993 4945 4913

ORA 600 [KSSRMP1] ORA 600 [KSSRMP1] ORA 603 ORA 603 ORA 603 ORA 4031 ORA 4031 ORA 4031 ORA 4031 ORA 4031 ORA 4031 ORA 4031

2007-10-17 04:01:57.725620 -04:00 2007-10-16 18:45:03.970884 -04:00 2007-10-16 06:06:46.705430 -04:00 2007-10-16 06:06:42.007937 -04:00 2007-10-16 06:06:30.069050 -04:00 2007-08-28 14:50:01.355783 -04:00 2007-09-04 19:09:36.310123 -04:00 2007-09-03 14:40:14.575457 -04:00 2007-09-04 19:09:30.969226 -04:00 2007-09-04 19:09:33.179857 -04:00 2007-09-04 19:09:30.955524 -04:00 2007-09-04 19:09:31.641990 -04:00

This shows a list of all incidents. Now, you can get the details of a specific incident as shown below:

adrci> show incident -mode detail -p "incident_id=14556" ADR Home = /home/oracle/diag/rdbms/odel11/ODEL11: ************************************************************************* ********************************************************** INCIDENT INFO RECORD 1 ********************************************************** INCIDENT_ID 14556 STATUS ready CREATE_TIME 2007-10-17 04:01:57.725620 -04:00 . [... and so on ...] . INCIDENT_FILE /home/oracle/diag/rdbms/odel11/ODEL11/trace/ODEL11_mmon_14831.trc OWNER_ID 1 INCIDENT_FILE /home/oracle/diag/rdbms/odel11/ODEL11/incident/incdir_14556/ODEL11_mmon_14831_i14556.trc 1 rows fetched The information shown in the adcri command line is analgous to what you will see in the Enterprise Manager screens. The latter may, however, be simpler and much more user friendly. adcri is very helpful when you don't have access to EM Support Workbench for some reason. You can also use adcri to do things like tailing the alert log file or searching some log (listener, css, crs, alert, etc.) for specific patterns. adcri is also helpful if you want to work on ADR programmatically.

New Alert Log


In Oracle Database 11g, the alert log is written in XML format. For the sake of compatibility with older tools, the traditional alert log is also available in the ADR Home under the trace directory. For instance, in my example shown above, the directory is /home/oracle/diag/rdbms/odel11/ODEL11/trace, where you can find the alert_ODEL11.log. However, the other alert logs are in XML format, and are located in the alert subdirectory under ADR Home. Let's see the files: $ pwd

/home/oracle/diag/rdbms/odel11/ODEL11/alert $ ls -ltr total 60136 -rw-r----- 1 oracle oinstall 10485977 Sep 13 17:44 log_1.xml -rw-r----- 1 oracle oinstall 10486008 Oct 16 06:35 log_2.xml -rw-r----- 1 oracle oinstall 10485901 Oct 16 07:27 log_3.xml -rw-r----- 1 oracle oinstall 10485866 Oct 16 08:12 log_4.xml -rw-r----- 1 oracle oinstall 10486010 Oct 17 23:56 log_5.xml -rw-r----- 1 oracle oinstall 9028631 Oct 21 20:07 log.xml Note that there are several files: log_1.xml, log_2.xml, and so on. When the log.xml reaches a certain size, the file is renamed to log_?.xml and a new file is started. This prevents the alert log from becoming too large and unmanageable. The new alert log is accessed via the adrci utility: the ADR command line tool, which you learned about in the previous section. From the adrci tool, issue: adrci> show alert Choose the alert log from the following homes to view: 1: diag/rdbms/odel11/ODEL11 2: diag/clients/user_oracle/host_1967384410_11 3: diag/clients/user_unknown/host_411310321_11 4: diag/tnslsnr/oradba3/listener Q: to quit Please select option: You can choose one from the menu or you can supply a specific home: adrci> set homepath diag/rdbms/odel11/ODEL11 adrci> show alert ADR Home = /home/oracle/diag/rdbms/odel11/ODEL11: [... and the whole alert log show up here ...] One of the best things with the alert log being an XML file is that information is written in a structured way. Gone are the days when the alert log was a repository of unstructured data. The XML format makes the file viewable as a table in adrci. To see the fields of this "table", use the describe command: adrci>>describe alert_ext Name Type NULL? ------------------------------------------ORIGINATING_TIMESTAMP timestamp NORMALIZED_TIMESTAMP timestamp ORGANIZATION_ID text(65) COMPONENT_ID text(65) HOST_ID text(65) HOST_ADDRESS text(17) MESSAGE_TYPE number MESSAGE_LEVEL number MESSAGE_ID text(65) MESSAGE_GROUP text(65) CLIENT_ID text(65) MODULE_ID text(65) PROCESS_ID text(33) THREAD_ID text(65) USER_ID text(65) INSTANCE_ID text(65) DETAILED_LOCATION text(161) UPSTREAM_COMP_ID text(101) DOWNSTREAM_COMP_ID text(101)

-----------

EXECUTION_CONTEXT_ID text(101) EXECUTION_CONTEXT_SEQUENCE number ERROR_INSTANCE_ID number ERROR_INSTANCE_SEQUENCE number MESSAGE_TEXT text(2049) MESSAGE_ARGUMENTS text(129) SUPPLEMENTAL_ATTRIBUTES text(129) SUPPLEMENTAL_DETAILS text(129) PARTITION number RECORD_ID number FILENAME text(513) PROBLEM_KEY text(65)

S-ar putea să vă placă și