Sunteți pe pagina 1din 8

Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

Search
Search LOGIN | REGISTER

TRAINING DOWNLOADSPLATFORMS PRODUCTS


Toad Courseware
Freeware & Trials
Database Blogs &
Discussion
Wikis & Resources

Home Platforms Oracle Oracle Blog Troubleshooting Oracle Clusterware common startup
failures

Troubleshooting Oracle
Clusterware common startup
failures
Oracle Community Join

Overview Blog Wiki Members

Blog
Options
Print
Comment
RSS Feed
Share Tweet

Like 0
0

Related
Posts
Oracle Clusterware
12c: What-if
command evaluation
- Part-I
latest revision 4
months ago
by Anju Garg

Oracle Clusterware
12c: What-if
command evaluation
- Part-II

1 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

latest revision 3
months ago
by Anju Garg

Oracle Clusterware
latest revision 4
months ago
by Anju Garg

View More

Troubleshooting Oracle Clusterware


common startup failures

Syed Jaar Hussain / 1.28.2014 at 10:57pm


As an Oracle DBA in a non-cluster environment, your responsibilities limited to
manage, troubleshoot and diagnose problems that are pertaining to the database
technologies. In contrast, you will have an additional responsibility of managing
Clusterware and troubleshooting its problems in a cluster environment. The
purpose of this article is to help you understanding the basics about Clusterware
startup sequence and troubleshoot most common Clusterware startup failures.
Additionally, this article also wills focus on some of the useful tools, utilities that
are handy identifying the root cause of Clusterware related problems.

In my perspective and personal experience, the following is some of the challenges


most DBAs in their cluster environment confront:

Node eviction
Cluster becoming unhealthy
Unable to start cluster and some of the Clusterware components

Clusterware startup sequence


Its worthwhile understanding how things get started or stopped while managing
and troubleshooting a system. In this segment, we will closely look at how an
Oracle Clusterware stack components are get started, and in which sequence they
come up on a node reboot or manual cluster startup. This understanding will
greatly help addressing most cluster stack common start-up failures and gives you
a glance where to start the investigation in case any cluster component doesnt
start.

The diagram below depicts Oracle Cluster stack (components) startup sequence at
various levels:

2 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

Source: Expert Oracle RAC 12c

The entire Oracle Cluster stack and the services registered on the cluster
automatically comes up when a node reboots or if the cluster stack manually
starts. The startup process is segregated in ve (05) levels, at each level, dierent
processes are got started in a sequence.

On node reboot/crash, the init process on the OS spawns init.ohasd (as mentioned
in the /etc/inittab le), which in turn commence Oracle High Availability Service
Daemon (ohasd). The ohasd daemon is then responsible of starting o the other
critical cluster daemon processes.

The new oraagent and oraclerootagent layers then brings up Cluster


Synchronization Daemon (cssd), Cluster Ready Services (crsd), Event Manager
Daemon (evmd) and other rest of Cluster stack in addition to ASM, RDBMS
instances and other resources on the cluster at various levels.

Cluster-wide cluster commands


With Oracle 11gR2, you can now start, stop and verify Cluster status of all nodes
from a single node. Pre Oracle 11gR2, you must login to individual nodes to start,
stop and verify cluster health status. Below are some of the cluster-wide cluster
commands:

$ ./crsctl check cluster all [verify cluster status on all


nodes]
$ ./crsctl stop cluster all [stop cluster on all nodes]
$ ./crsctl start cluster all [start cluster on all nodes]
$ ./crsctl check cluster n <nodename> [verify the cluster
status on a particular remote node]

Troubleshooting common cluster startup problems


Once you understand the startup sequence and how things get started in Oracle
Cluster environment, it will easy for anyone to troubleshoot and solve most
common start-up failure problems. In this segment, we will explain how to
diagnose some of the common Cluster start-up problems.

Imagine your crsctl check cluster/crs command and its gives the following errors:

$GRID_HOME/bin/crsctl check cluster

CRS-4639: Could not contact Oracle High Availability


Services

CRS-4124: Oracle High Availability Services startup failed

3 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

CRS-4000: Command Check failed, or completed with errors

OR

CRS-4535: Cannot communicate with Cluster Ready Services


CRS-4530: Communications failure contacting Cluster
Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

We will start taking the issues of the components startup failures in the same
sequence they usually start. Lets talk about ohasd startup failures which would
result in CRS-4639/4124/4000 errors. The following are the main causes:

Verify if the /etc/inittab le contains the entry to start the ohasd process
automatically.
Ensure Cluster auto startup is congured using the crsctl cong crs
command. For any reasons it is not auto start congured, enable the auto
start and start the cluster manually.
Refer alert.log and ohasd.log les under $ORA_GRID_HOME/log/hosname
location.
Ensure the node has no issues accessing the OLR and OCR/Voting Disks.
This can be veried in the alert or ohasd log les.

In the event of Cluster Synchronization Servic e D aemon process (cssd)


start-up failures or reported unhealthy status, follow the below guide lines to
resolve the problem:

Verify if the cssd.bin is active on the OS or not: ps ef |grep cssd.bin


Review the alert log and ocssd.log les
Ensure there is no issues with regards to Voting Disks accessibility
Also, check the network heart beat between the nodes
Run crsctl stat res t init, and if the crsd process is unhealthy, start it up
manually

$ ./crsctl start res ora.crsd init

--------------------------------------------------------------------------------
NAME TARGET STATE SERVER
STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE rac2
Started
ora.cluster_interconnect.haip
1 ONLINE ONLINE
rac2
ora.crsd
1 ONLINE OFFLINE
rac2
ora.cssd
1 ONLINE ONLINE
rac2
ora.cssdmonitor
1 ONLINE ONLINE
rac2
ora.ctssd
1 ONLINE ONLINE rac2
OBSERVER
ora.diskmon
1 ONLINE ONLINE
rac2
ora.evmd

If you encounter Cluster Ready Services Daemon (CRSD) start-up failures or


reported unhealthy, you can perform the following check list:

4 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

Verify if the crsd.bin process is up on the OS: ps ef |grep crsd.bin


Refer alert log and crsd logs to get useful information about the nature of
the problem
Check the status of the resources using the above command, and start it
manually if the status/target is OFFLINE
Ensure OCR disks are accessible to the node and there no OCR corruption

If you couldnt resolve the problem reviewing the above guide lines, you might
need to contact Oracle support and provide all required information to investigate
further on the problem.

CRS logs and directory hierarchy


Each component of Grid Infrastructure (Clusterware) maintains an individual log
le and writes important events to the log le under typical circumstances. The
information written to the log will help DBAs to understand the current state of the
component, also assist troubleshooting cluster critical problems. DBA should
comprehend the importance of these log les and able to understand the text to
solve the problems. Each node in the cluster maintains an individual log directory
under $GRID_HOME/log/<hostname> location for every cluster component, as
shows in the following screen shot:

Source: Expert Oracle RAC 12c

The following are some of the most referred logs as part of cluster maintenance
and troubleshooting various Cluster related problems, such as, Node eviction,
Cluster stack heath, OCR/Voting disk related problems etc:

alert_hostname.log
crsd.log
ocssd.log
ohasd.log

alert_hostname.log Each cluster node has its own alert log le and write
important and useful information about cluster startup, node eviction, any cluster

5 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

component start-up problems, OCR/Voting disk related information. It is therefore


recommended to refer the log le frequently to know the cluster status, in the
event of other node eviction, or wants to keep an eye on OCR/VD developments.

ocssd.log Yet another very critical and important log le which needs your
attention. Whenever cluster encounters any serious snags with regards to Cluster
Synchronization Service Daemon (CSSD) process, this is the le that needs to refer
to understand the nature of the problem and to resolve the problem too. This is
one of the busiest log les that is writing continuously and maintained by the
cluster automatically. Manual maintenance on the le is not recommended, and
once the le size reaches to 50MB, Oracle automatically archives the les and
create a new log le. In the context, total 10 archive log les are maintenance at
any given point in time.

crsd.log Cluster Ready Service Daemon (CRSD) process writes all important events
to the le, such as, cluster resources startup, stop, failure and CRSD health status.
If you have issues starting/stopping any cluster and non-cluster resources on the
node, refer this log le to diagnose the issue. This les also maintained by the
Oracle and remove the le is not recommended. Once the size of the le reaches
to 10MB, it will be archived automatically and a new log le will be created. There
will be 10 archived copies of the le will be kept under any point of time.

ohasd.log The log le is accessed and managed by the new Oracle High
Availability Service Daemon (ohasd) process which was rst introduced in Oracle
11gR2. If you encounter any issues running root.sh or rootupgrade.sh scripts, refer
to this log le to understand troubleshooting the problem. If you face issues
starting up the process, and if Oracle Local Registry (OLR) has any corruption or
inaccessibility issues, also refer to this le. Like crsd and ocssd log les, this le
also is maintained automatically by Oracle and archives the log le upon reaching
10MB size. Total of 10 archived log les are maintained at any given point in time.

Debugging and Tracing CRS components


As we have learned that CRS maintains a good amount of log les for various
Cluster components which can be referred anytime to diagnose critical cluster
startup issues, node eviction and others events. If the default debugging or logging
information doesnt provide sucient feedback to resolve any particular issues,
you can go on and increase the debug/trace level of a particular component or
cluster resource to generate additional debugging information to address the
problem.

Fortunately, Oracle let you adjust the default trace/debug levels of any cluster
component or resource dynamically. To list the default trace/debug settings of a
component or sub-component, login as root user and execute the following
command from the GRID_HOME:

$ ./crsctl get log css/crs/evm/all

To adjust/change the default trace level, use the following example:

$ ./crsctl set log crs crsmain=4


$ ./crsctl set log crs all-=3 [all components/sub-
components of CRS set to level 3]

You will have to seek Oracle support advice before you set or adjust any default
settings. To disable the trace/debug level, set the level to value 0. You can set the
levels from 1 to 5. The more the value, the more information will be generated and
you must closely watch the log le growth and space on the lesystem to avoid any
space related issues.

You can also turn on trace through OS settings; use the following OS specic
command:

$ export SRVM_TRACE=true

Once the above is set on any UNIX OS system, trace/log les are generated under
$GRID_HOME/cv/log destination. Once you exit from the terminal, tracing will end.

6 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

Oracle Clusterware Troubleshooting tools & utilities


One of the prime responsibilities of an Oracle DBA is managing and
troubleshooting the cluster system. In the context, the DBA must aware of all the
internal and external tools and utilities provided by Oracle to maintain and
diagnose cluster issues. The understanding and weighing the pros and cons of
each individual tool/utility is essential. You must have a great knowledge and
should choose the right tool/utility at the right moment; else, you will not only
waste the time to resolve the issue, instead, will have a prolonged service
interruption.

Let me zip through and explain you the benets of some of the very
important and mostly used tools and utilities here.

Cluster Verication U tility (CVU) is used to collect pre and post cluster
installation conguration details at various levels and various components.
With 11gR2, it also provides the ability to verify the cluster health. Look at
some of the useful commands below:

$ ./cluvfy comp healthcheck collect cluster


bestpractice html
$ ./cluvfy comp healthcheck collect cluster|database

Real Time RAC DB monitoring (oratop) is an external Oracle utility,


currently available on Linux platform, which provides OS specic top alike
output where you can monitor RAC databases/single instance databases in
real time. The window provides statistics real-time, such as: DB Top event,
top Oracle processes, blocking session information etc. You must download
the oratop.zip from support.oracle.com and congure it.

RAC conguration audit tool (RACcheck) yet another Oracle provided


external tool developed by the RAC support team to perform audit on
various cluster conguration. You must download the tool (raccheck.zip)
from the support.oracle.com and congure it on one of the nodes of cluster.
The tool performs cluster-wide conguration auditing at CRS,ASM, RDMS
and generic database parameters settings. This tool also can be used to
assess the readiness of the system for the upgrade. However, you need to
keep upgrading the tool to get the latest recommendations.

Cluster Diagnostic Collection Tool (diagcollection.sh) Since cluster


manages so many log les, sometime it will be time consuming and
cumbersome to visit/refer all the logs to understand the nature of the
problem, or diagnose the issue. The diagcollection.sh tool refers various
cluster log les and gathers required information to diagnose critical cluster
problems. With this tool, you can gather the stats/information at various
levels: Cluster, RDBMS, Core analysis, database etc. The tool encapsulates all
le in a zip le and removes the individual les. The following .zip les are
collected as part of the diagcollection run:

ocrData_hostname_date.tar.gz contains ocrdump,


ocrcheck etc
coreData_hostname_date.tar.gz contains CRS core
files
osData_hostname_date.tar.gz OS logs
ocrData_hostname_date.tar.gz OCR details

Example:

$ ./GRID_HOME/bin/diagcollection.sh collect crs


$GRID_HOME
$ ./diagcollection.sh help
--collect
[--crs] For collecting crs diag information
[--adr] For collecting diag information for ADR;
specify ADR location
[--chmos] For collecting Cluster Health Monitor
(OS) data

7 of 8 18-02-2017 18:32
Troubleshooting Oracle Clusterware common startup failures - Oracle ... https://www.toadworld.com/platforms/oracle/b/weblog/archive/2014/0...

[--all] Default.For collecting all diag


information. <<<>>>
[--core] Unix only. Package core files with CRS
data
[--afterdate] Unix only. Collects archives from the
specified date.
[--aftertime] Supported with -adr option. Collects
archives after the specified
[--beforetime] Supported with -adr option. Collects
archives before the specified
[--crshome] Argument that specifies the CRS Home
location
[--incidenttime] Collects Cluster Health Monitor
(OS) data from the specified
[--incidentduration] Collects Cluster Health
Monitor (OS) data for the duration
NOTE:
1. You can also do the following
./diagcollection.pl --collect --crs --crshome <CRS
Home>
--clean cleans up the diagnosability
information gathered by this script

Above all, there is many other important and useful tools: Cluster Health
Monitoring (CHM) to diagnose node eviction issues, DB Hanganalysis, OSWatcher,
The Light on Monitor, ProWatcher etc are available for your use under dierent
circumstances.

In a nutshell, this paper make you understand the Cluster startup sequence, how
things get started on node reboot and provides guide lines to solve most
commonly faced Cluster startup issues by analyzing the cluster logs, using
appropriate tools/utilities discussed.

clusterware, Startup, troubleshoot, Oracle, failure

59113 0 /

About Toad World Quest Communities Privacy Policy Terms of Use Contact Us
Send Feedback About Quest

Toad World is sponsored Copyright 2017 Quest Software Inc. ALL RIGHTS
by RESERVED.

8 of 8 18-02-2017 18:32

S-ar putea să vă placă și