Sandesh RAC Troubleshooting Diagnosability OTNyathra2 PDF

Troubleshooting and Diagnosing Oracle
Database 12.2 and Oracle RAC

https://www.linkedin.com/in/raosandesh/
sandeshr
Sandesh Rao, Senior Director , RAC Development
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Restricted 2
Common Questions
• How do I contact you ?
– Linkedin – Sandesh Rao
– Email – Sandesh.rao@oracle.com
• Where do I get your presentation ?
– http://otnyathra.in/downloads/
• Which books on RAC do I read for basics or internals ?

– Oracle Database 11g Oracle Real Application Clusters Handbook, 2nd Edition (Oracle Press) 2nd Edition
– Pro Oracle Database 11g RAC on Linux (Expert's Voice in Oracle) 2nd ed. Edition
– Oracle 10g RAC Grid, Services and Clustering 1st Edition
– Pro Oracle Database 10g RAC on Linux: Installation, Administration, and Performance (Expert's Voice in
Oracle) 1st Corrected ed., Corr. 3rd printing Edition
– Oracle Database 12c Release 2 Oracle Real Application Clusters Handbook: Concepts, Administration, Tuning &
Troubleshooting (Oracle Press) 1st Edition
– Documentation – Autonomous Computing Guide , RAC Admin guide
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 3
Agenda
• Architectural Overview
• Troubleshooting Scenarios
• Proactive and Reactive tools
• Q&A

Grid Infrastructure Overview
• Grid Infrastructure is the name for the combination of
– Oracle Cluster Ready Services (CRS)
– Oracle Automatic Storage Management (ASM)
• The Grid Home contains the software for both products
• CRS can also be Standalone for ASM and/or Oracle Restart
• CRS can run by itself or in combination with other vendor clusterware
• Grid Home and RDBMS home must be installed in different locations
– The installer locks the Grid Home path by setting root permissions.

• CRS requires shared Oracle Cluster Registry (OCR) and Voting files
– Must be in ASM or CFS
– OCR backed up every 4 hours automatically GIHOME/cdata
– Kept 4,8,12 hours, 1 day, 1 week
– Restored with ocrconfig
– Voting file backed up into OCR at each change.
– Voting file restored with crsctl

• For network CRS requires
– One/multiple high speed, low latency, redundant private network for inter node
communications
– Think of interconnect as a memory backplane for the cluster
– Should be a separate physical network or managed converged network
– VLANS are supported
– Used for :-
• Clusterware messaging
• RDBMS messaging and block transfer
• ASM messaging
• HANFS for block traffic

• Only one set of Clusterware daemons can run on each node
• The CRS stack is spawned from Oracle HA Services Daemon (ohasd)
• On Unix ohasd runs out of inittab with respawn
• A node can be evicted when deemed unhealthy
– May require reboot but at least CRS stack restart (rebootless restart)
– IPMI integration or diskmon in case of Exadata
• CRS provides Cluster Time Synchronization services
– Always runs but in observer mode if ntpd configured

Grid Infrastructure Processes
Agents change everything
• Multi-threaded Daemons
• Manage multiple resources and types
• Implements entry points for multiple resource types
– Start,stop check,clean,fail
• oraagent, orarootagent, application agent, script agent, cssdagent
• Single process started from init on Unix (ohasd)
• Diagram below shows all core resources

Grid Infrastructure Processes Level 4a
Level 2a
Level 3
Level 0
Level 4b
Level 1
Level 2b

Init Scripts
• /etc/init.d/ohasd ( location O/S dependent )
– RC script with “start” and “stop” actions
– Initiates Oracle Clusterware autostart
– Control file coordinates with CRSCTL
• /etc/init.d/init.ohasd ( location O/S dependent )
– OHASD Framework Script runs from init/upstart
– Control file coordinates with CRSCTL
– Named pipe syncs with OHASD

• Level 1: OHASD Spawns:

– cssdagent - Agent responsible for spawning CSSD
– orarootagent - Agent responsible for managing all root owned ohasd resources
– oraagent - Agent responsible for managing all oracle owned ohasd resources
– cssdmonitor - Monitors CSSD and node health (along with the cssdagent)
• Level 2a: OHASD rootagent spawns:
– CRSD - Primary daemon responsible for managing cluster resources.
– CTSSD - Cluster Time Synchronization Services Daemon
– Diskmon ( Exadata )
– ACFS (ASM Cluster File System) Drivers

• Level 2b: OHASD oraagent spawns:
– MDNSD– Multicast DNS daemon
– GIPCD – Grid IPC Daemon
– GPNPD – Grid Plug and Play Daemon
– EVMD – Event Monitor Daemon
– ASM – ASM instance started here as may be required by CRSD
• Level 3: CRSD spawns:
– orarootagent - Agent responsible for managing all root owned crsd resources.
– oraagent - Agent responsible for managing all nonroot owned crsd resources.
• One is spawned for every user that has CRS resources to manage.

Startup Sequence
• Level 4: CRSD oraagent spawns:
– ASM Resouce - ASM Instance(s) resource (proxy resource)
– Diskgroup - Used for managing/monitoring ASM diskgroups.
– DB Resource - Used for monitoring and managing the DB and instances
– SCAN Listener - Listener for single client access name, listening on SCAN VIP
– Listener - Node listener listening on the Node VIP
– Services - Used for monitoring and managing services
– ONS - Oracle Notification Service
– eONS - Enhanced Oracle Notification Service ( pre 11.2.0.2 )
– GSD - For 9i backward compatibility
– GNS (optional) - Grid Naming Service - Performs name resolution

Oracle Flex Cluster
The standard going forward

(every Oracle 12c Rel. 2 cluster
is a Flex Cluster by default.)
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 15

Under the Hood: Any New Install Ends Up in a Flex Cluster
[GRID]> crsctl get cluster name

CRS-6724: Current cluster name is 'SolarCluster'
[GRID]> crsctl get cluster class

CRS-41008: Cluster class is 'Standalone Cluster'
[GRID]> crsctl get cluster type

CRS-6539: The cluster type is 'flex'.

1 2 Cluster Domain 3 4
Database Application Database Database
Member Cluster Member Cluster Member Cluster Member Cluster
Uses IO & ASM Uses ASM

Private Uses local ASM GI only Service of DSC Service
Network
SAN
NAS Domain Services Cluster

Mgmt Trace File Rapid Home Additional
Repository Analyzer Provisioning ASM
Optional IO Service
(GIMR) (TFA) (RHP) Service
Service Services
Service Service
Shared ASM

ASM Flex Diskgroups 1
Database-oriented Storage Management for more flexibility and availability
Pre-12.2 diskgroup Organization 12.2 Flex Diskgroup Organization
Shared resource Database-oriented
File Group resource management
management
Diskgroup Flex Diskgroup
DB1 : File 1 DB3 : File 3
DB1 DB2 DB3
File 1 File 1 File 1
File 2 File 2 File 2
DB3 : File 2 DB2 : File 3 File 3 File 3 File 3
DB2 : File 4 DB1 : File 2 File 4
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 18
ASM Flex Diskgroups 2
Database-oriented Storage Management for more flexibility and availability
12.2 Flex Diskgroup Organization
• Flex Diskgroups enable
– Quota Management - limit the space
Flex Diskgroup databases can allocate in a diskgroup and
thereby improve the customers’ ability to
DB1 DB2 DB3 consolidate databases into fewer DGs
File 1 File 1 File 1 – Redundancy Change – utilize lower
redundancy for less critical databases
Quota File 2 File 2 File 2
File 3 – Shadow Copies (“split mirrors”) to easily
File 3 File 3 DB3
and dynamically create database clones
File 4 File 1
for test/dev or production databases
File 2
File 3
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 19
Node Weighting in Oracle RAC 12c Release 2
Idea: Everything equal, let the majority of work survive
• Node Weighting is a new feature that considers
✔ the workload hosted in the cluster during fencing
1 2 • The idea is to let the majority of work survive,

if everything else is equal
– Example: In a 2-node cluster, the node hosting the
majority of services (at fencing time) is meant to survive

CSS_CRITICAL – Fencing with Manual Override
Node eviction
despite WL; WL “Conflict”.
will failover.
✔
srvctl modify database -help
|grep critical
…
-css_critical {YES | NO}
Define whether the database
or service is CSS critical
crsctl set server

css_critical {YES|NO}
+ server restart
CSS_CRITICAL CSS_CRITICAL will be honored

can be set on various levels / if no other technical reason prohibits A fallback scheme is applied if
components to mark them as survival of the node which has at CSS_CRITICAL settings do not lead to
“critical” so that the cluster will try to least one critical component at the an actionable outcome.
preserve them in case of a failure. time of failure.

Proven Features – Even More Beneficial on the DSC
Autonomous Health Framework

The DSC is the ideal hosting Oracle ASM 12c Rel. 2 based storage
(powered by machine learning)
environment for Rapid Home consolidation is best performed on
works more efficiently for you on the
Provisioning (RHP) enabling software the DSC, as it enables numerous
DSC, as continuous analysis is taken
fleet management. additional features and use cases.
off the production cluster.

Node Eviction Basics

Basic RAC Cluster with Oracle Clusterware
Public Lan Public Lan
Private Lan /
Interconnect
CSSD CSSD CSSD
SAN SAN
Network Voting Network
Disk

What does CSSD do?
CSSD monitors and evicts nodes
• Monitors nodes using 2 communication channels:
– Private Interconnect ó Network Heartbeat
– Voting Disk based communication ó Disk Heartbeat
• Evicts (forcibly removes nodes from a cluster)
nodes dependent on heartbeat feedback (failures)
CSSD “Ping” CSSD
“Ping”

Network Heartbeat
Interconnect basics
• Each node in the cluster is “pinged” every second
• Nodes must respond in css_misscount time (defaults to 30 secs.)
– Reducing the css_misscount time is generally not supported
• Network heartbeat failures will lead to node evictions

– CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node
mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds
CSSD “Ping” CSSD

Disk Heartbeat
Voting Disk basics – Part 1
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
• Nodes must receive a response in (long / short) diskTimeout time
– I/O errors indicate clear accessibility problems à timeout is irrelevant
• Disk heartbeat failures will lead to node evictions

– CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:
node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
CSSD CSSD
“Ping”

Voting Disk Structure
• Voting Disks contain dynamic and static data:
– Dynamic data: disk heartbeat logging
– Static data: information about the nodes in the cluster
• With 11.2.0.1 Voting Disks got an “identity”:

– E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
• Voting Disks must therefore not be copied using “dd” or “cp” anymore
Node information Disk Heartbeat Logging

“Simple Majority Rule”
• Oracle supports redundant Voting Disks for disk failure protection
• “Simple Majority Rule” applies:
– Each node must “see” the simple majority of configured Voting Disks
at all times in order not to be evicted (to remain in the cluster)
Ø trunc(n/2+1) with n=number of voting disks configured and n>=1
CSSD CSSD

Insertion 1: “Simple Majority Rule”…
… In extended Oracle clusters
• http://www.oracle.com/goto/rac
– Using standard NFS to support
a third voting file for extended
cluster configurations (PDF)
CSSD CSSD
• Same principles apply

• Voting Disks are just
geographically dispersed

Insertion 2: Voting Disk in Oracle ASM
The way of storing Voting Disks doesn’t change its use
[GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
Located 3 voting disk(s).
• Oracle ASM auto creates 1/3/5 Voting Files

– Based on Ext/Normal/High redundancy
and on Failure Groups in the Disk Group
– Per default there is one failure group per disk
– ASM will enforce the required number of disks
– New failure group type: Quorum Failgroup

Why are nodes evicted?
è To prevent worse things from happening…
• Evicting (fencing) nodes is a preventive measure (a good thing)!
• Nodes are evicted to prevent consequences of a split brain:
– Shared data must not be written by independently operating nodes
– The easiest way to prevent this is to forcibly remove a node from the cluster
1 2
CSSD CSSD

How are nodes evicted?
EXAMPLE: Heartbeat failure
• The network heartbeat between nodes has failed
– It is determined which nodes can still talk to each other
– A “kill request” is sent to the node(s) to be evicted
§ Using all (remaining) communication channels à Voting Disk(s)
• A node is requested to “kill itself”; executer: typically CSSD
CSSD CSSD

Re-bootless Node
Fencing (restart)

Re-bootless Node Fencing (restart)
Fence the cluster, do not reboot the node
• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
– Re-boots affect applications that might run an a node, but are not protected
– Customer requirement: prevent a reboot, just stop the cluster – implemented...
Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD

How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Then IO issuing processes are killed; it is made sure that no IO process remains
– For a RAC DB mainly the log writer and the database writer are of concern
App X App Y
Oracle RAC
DB Inst. 1
CSSD CSSD

EXCEPTIONS
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:
– IF the check for a successful kill of the IO processes fails → reboot
– IF CSSD gets killed during the operation → reboot
– IF cssdmonitor is not scheduled → reboot
– IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD

Troubleshooting Scenarios
Cluster Startup Problem Triage (11.2+)
Startup ps –ef|grep init.ohasd NO crsctl config crs NO Engage Oracle Support
Sequence ps –ef|grep ohasd.bin Running? ohasd.log Obvious? TFA Collector
Engage Sysadmin Team
YES
YES
Engage Sysadmin Team
ps –ef|grep cssdagent
Cluster Startup ps –ef|grep ocssd.bin

ps –ef|grep orarootagent NO
ohasd.log YES Engage
ps –ef|grep ctssd.bin Running? agent logs Obvious?
Diagnostic Flow ps –ef|grep crsd.bin
ps –ef|grep cssdmonitor
process logs
Sysadmin Team
ps –ef|grep oraagent NO
YES
ps –ef|grep ora.asm
Engage
ps –ef|grep gpnpd.bin
TFA Collector Oracle Support
ps –ef|grep mdnsd.bin ohasd.log Sysadmin Team
ps –ef|grep evmd.bin OLR perms
Crsctl check crs Compare reference system
Crsctl check cluster
Engage NO YES Engage

Oracle Support TFA Collector Obvious?
Sysadmin Team
Sysadmin Team

Cluster Startup Problem Triage
• Multicast Domain Name Service Daemon (mDNS(d))

– Used by Grid Plug and Play to locate profiles in the cluster, as well as by GNS to perform
name resolution. The mDNS process is a background process on Linux and UNIX and on
Windows.
– Uses multicast for cache updates on service advertisement arrival/departure.
– Advertises/serves on all found node interfaces.
– Log is GI_HOME/log/<node>/mdnsd/mdnsd.log

<?xml version="1.0" encoding="UTF-8"?>
<gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.grid-
pnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile
gpnp-profile.xsd" ProfileSequence="6" ClusterUId="b1eec1fcdd355f2bbf7910ce9cc4a228" ClusterName="staij-cluster"
PALocation="">
<gpnp:Network-Profile><gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP=”192.168.1.0" Adapter="eth0" Use="public"/>
<gpnp:Network id="net2" IP=”192.168.2.0" Adapter="eth1“ Use="cluster_interconnect"/>
</gpnp:HostNetworkcss"></gpnp:Network-Profile>
<orcl:CSS-Profile id=" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="" SPFile="+SYSTEM/staij-cluster/asmparameterfile/registry.253.693925293"/>
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMethod
Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:SignatureMethod Algorithm="http://www.w3.org/2001/10/xml-
exc-c14n#"> <InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl
xsi"/></ds:Transform></ds:Transforms><ds:DigestMethod
Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>x1H9LWjyNyMn6BsOykHhMvxnP8U=</ds:DigestValue
></ds:Reference></ds:SignedInfo><ds:SignatureValue>N+20jG4=</ds:SignatureValue></ds:Signature>
</gpnp:GPnP-Profile>

• cssd agent and monitor
– Same functionality in both agent and monitor
– Functionality of several pre-11.2 daemons consolidated in both
• OPROCD – system hang
• OMON – oracle clusterware monitor
• VMON – vendor clusterware monitor
– Run realtime with locked down memory, like CSSD
– Provides enhanced stability and diagnosability
– Logs are
• GI_HOME/log/<node>/agent/oracssdagent_root/oracssdagent_root.log
• GI_HOME/log/<node>/agent/oracssdmonitor_root/oracssdmonitor_root.log
• 12c – ORACLE_BASE/diag/node/agent/..

Node Evictions
NHB?
1050693.1 Engage
1534949.1 YES YES
Eviction 1531223.1 Resource NO Cluster alert
1546004.1
Obvious? networking team
1328466.1 ocssd.log
Scenario System log
Starvation?
NO
NO
YES TFA Collector
Free memory? Engage storage

CPU load? team
Node Response? Engage DHB? NO
appropriate 1549428.1
team 1466639.1 YES YES
Obvious?
Engage
NO
Node Eviction Resolved?
NO
Oracle
Support
Diagnostic Flow Fenced? YES

NO
YES
YES Resource starvation

Engage
sysadmin
NO team
TFA Collector

Missing Network Heartbeat (1)
• ocssd.log from node 1
• ===> sending network heartbeats other nodes. Normally, this message is output once every 5 messages (seconds)
• 2016-08-13 17:00:20.023: [ CSSD][4096109472]clssnmSendingThread: sending status msg to all nodes
• 2016-08-13 17:00:20.023: [ CSSD][4096109472]clssnmSendingThread: sent 5 status msgs to all nodes
• ===> The network heartbeat is not received from node 2 (drrac2) for 15 consecutive seconds.
• ===> This means that 15 network heartbeats are missing and is the first warning (50% threshold).
• 2016-08-13 17:00:22.818: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 50% heartbeat fatal, removal in 14.520
seconds
• 2016-08-13 17:00:22.818: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) is impending reconfig, flag 132108,
misstime 15480
• ===> continuing to send the network heartbeats and log messages once every 5 messages
• ===> 75% threshold of missing network heartbeat is reached. This is second warning.
seconds

• ===> continuing to send the network heartbeats and log messages once every 5 messages
• ===> continuing to send the network heartbeats, but the message is logged after 4 messages
• ===> Last warning shows that 90% threshold of the missing network heartbeat is reached.
• ===> The eviction will occur in 2.49 seconds.
• 2016-08-13 17:00:34.841: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 90% heartbeat fatal, removal in
2.490 seconds, seedhbimpd 1
• ===> Eviction of node 2 (drrac2) started
• 2016-08-13 17:00:37.337: [ CSSD][4106599328]clssnmPollingThread: Removal started for node drrac2 (2), flags 0x2040c,
state 3, wt4c 0
• ===> This shows that the node 2 is actively updating the voting disks
• 2016-08-13 17:00:37.340: [ CSSD][4085619616]clssnmCheckSplit: Node 2, drrac2, is alive, DHB (1281744040, 1396854)
more than disk timeout of 27000 after the last NHB (1281744011, 1367154)

• ===> Evicting node 2 (drrac2)
• 2016-08-13 17:00:37.340: [ CSSD][4085619616](:CSSNM00007:)clssnmrEvict: Evicting node 2, drrac2, from the cluster in
incarnation 169934272, node birth incarnation 169934271, death incarnation 169934272, stateflags 0x24000
• ===> Reconfigured the cluster without node 2

• 2016-08-13 17:01:07.705: [ CSSD][4043389856]clssgmCMReconfig: reconfiguration successful, incarnation 169934272 with 1
nodes, local node number 1, master node number 1

• ocssd.log from node 2:
• ===> Logging the message to indicate 5 network heartbeats are sent to other nodes
• ===> First warning of reaching 50% threshold of missing network heartbeats
seconds
• 2016-08-13 17:00:26.213: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) is impending reconfig, flag 394254,
misstime 15460
• ===> Second warning of reaching 75% threshold of missing network heartbeats
seconds

• ===> Logging the message to indicate 4 network heartbeats are sent
• ===> Third warning of reaching 90% threshold of missing network heartbeats
• 2016-08-13 17:00:38.236: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) at 90% heartbeat fatal, removal in
2.460 seconds, seedhbimpd 1
• ===> Eviction started for node 1 (drrac1)
• 2016-08-13 17:00:40.702: [ CSSD][4073040800]clssnmPollingThread: Removal started for node drrac1 (1), flags 0x6040e,
state 3, wt4c 0
• ===> Node 1 is actively updating the voting disk, so this is a split brain condition
• 2016-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckSplit: Node 1, drrac1, is alive, DHB (1281744036, 1243744)
more than disk timeout of 27000 after the last NHB (1281744007, 1214144)
• 2016-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckDskInfo: My cohort: 2
• 2016-08-13 17:00:40.707: [ CSSD][4052061088]clssnmCheckDskInfo: Surviving cohort: 1
• ===> Node 2 is aborting itself to resolve the split brain and ensure the cluster integrity
• 2016-08-13 17:00:40.707: [ CSSD][4052061088](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
Cohort of 1 nodes with leader 2, drrac2, is smaller than cohort of 1 nodes led by node 1, drrac1, based on map type 2
• 2016-08-13 17:00:40.707: [ CSSD][4052061088]###################################
• 2016-08-13 17:00:40.707: [ CSSD][4052061088]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
• 2016-08-13 17:00:40.707: [ CSSD][4052061088]###################################

• Observations
1. Both nodes reported missing heartbeats at the same time
2. Both nodes sent heartbeats to other nodes all the time
3. Node 2 aborted itself to resolve split brain
• Conclusion
1. This is likely a network problem, engage network team
2. Check OSWatcheroutput (netstat and traceroute)
1. Configure private.net file, not configured by default
3. Check CHM
4. Check system log

Voting Disk Access Problem (1)
ocssd.log:
===> The first error indicating that it could not read voting disk -- first message to indicate a
problem accessing the voting disk
2016-08-13 18:31:19.787: [ SKGFD][4131736480]ERROR: -9(Error 27072, OS Error (Linux
Error: 5: Input/output error
Additional information: 4
Additional information: -1)
)
2016-08-13 18:31:19.787: [ CSSD][4131736480](:CSSNM00060:)clssnmvReadBlocks: read
failed at offset 529 of /dev/sdb8
2016-08-13 18:31:19.802: [ CSSD][4131736480]clssnmvDiskAvailabilityChange: voting file
/dev/sdb8 now offline
====> The error message that shows a problem accessing the voting disk repeats once every 4 seconds
2016-08-13 18:31:23.782: [ CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb8
2016-08-13 18:31:23.782: [ SKGFD][150477728]Handle 0xf43fc6c8 from lib :UFS:: for disk :/dev/sdb8:
2016-08-13 18:31:23.782: [ CLSF][150477728]Opened hdl:0xf4365708 for dev:/dev/sdb8:
2016-08-13 18:31:23.787: [ SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5:
Input/output error
)
2016-08-13 18:31:23.787: [ CSSD][150477728](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 17
of /dev/sdb8

====> The last error that shows a problem accessing the voting disk.
====> Note that the last message is 200 seconds after the first message
====> because the long disktimeout is 200 seconds
2016-08-13 18:34:37.423: [ CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb8
2016-08-13 18:34:37.423: [ CLSF][150477728]Opened hdl:0xf4336530 for dev:/dev/sdb8:
2016-08-13 18:34:37.429: [ SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5:
Input/output error
)
2016-08-13 18:34:37.429: [ CSSD][150477728](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 17
of /dev/sdb8

====> This message shows that ocssd.bin tried accessing the voting disk for 200 seconds
2016-08-13 18:34:38.205: [ CSSD][4110736288](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
200880 ms for voting file /dev/sdb8)
====> ocssd.bin aborts itself with an error message that the majority of voting disks are not available. In
this case, there was only one voting disk, but if three voting disks were available, as long as two
voting disks are accessible, ocssd.bin will not abort.
2016-08-13 18:34:38.206: [ CSSD][4110736288](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1
configured voting disks available, need 1
2016-08-13 18:34:38.206: [ CSSD][4110736288]###################################
2016-08-13 18:34:38.206: [ CSSD][4110736288]clssscExit: CSSD aborting from thread
clssnmvDiskPingMonitorThread
2016-08-13 18:34:38.206: [ CSSD][4110736288]###################################
• Conclusion
The voting disk was not available, engage storage team

Node Eviction Triage
• Time synchronisation issue

• Cluster Time Synchronisation Services daemon
– Provides time management in a cluster for Oracle.
• Observer mode when Vendor time synchronisation s/w is found
– Logs time difference to the CRS alert log
• Active mode when no Vendor time sync s/w is found

• Cluster Ready Services Daemon
– The CRSD daemon is primarily responsible for maintaining the availability of application
resources, such as database instances. CRSD is responsible for starting and stopping these
resources, relocating them when required to another node in the event of failure, and
maintaining the resource profiles in the OCR (Oracle Cluster Registry). In addition, CRSD is
responsible for overseeing the caching of the OCR for faster access, and also backing up the
OCR.
– Log file is GI_HOME/log/<node>/crsd/crsd.log
• Rotation policy 10-50M
• Retention policy 10 logs
• Dynamic in 12.1 and can be changed

• CRSD oraagent
– CRSD’s oraagent manages
• all database, instance, service and diskgroup resources
• node listeners
• SCAN listeners, and ONS
– If the Grid Infrastructure owner is different from the RDBMS home owner then you would
have 2 oraagents each running as one of the installation owners. The database, and service
resources would be managed by the RDBMS home owner and other resources by the Grid
Infrastructure home owner.
– Log file is
• GI_HOME/log/<node>/agent/crsd/oraagent_<user>/oraagent_<user>.log

• CRSD orarootagent
– CRSD’s rootagent manages
• GNS and it’s VIP
• Node VIP
• SCAN VIP
• network resources.
– Log file is
• GI_HOME/log/<node>/agent/crsd/orarootagent_root/oraagent_root.log

• Agent return codes
– Check entry must return one of the following return codes:
• ONLINE
• UNPLANNED_OFFLINE
– Target=online, may be recovered failed over
• PLANNED_OFFLINE
• UNKNOWN
– Cannot determine, if previously online, partial then monitor
• PARTIAL
– Some of a resources services are available. Instance up but not open.
• FAILED
– Requires clean action

Automatic Diagnostic Repository (ADR)
§ Important logs and traces

§ 11.2 – Databases only use ADR
• Grid Infrastructure files in $GI_HOME/log/<node_name>/<component_name>
– $GI_HOME/log/myHost/cssd
– $GI_HOME/log/myHost/alertmyHost.log
§ 12c – Grid Infrastructure and Database use ADR
§ Different locations for Grid Infrastructure and Databases
§ Grid Infrastructure
• Alert.log, cssd.log, csrd.log, etc
§ Databases
§ Alert.log, background process traces, foreground process traces

Oracle’s Database and Clusterware Tools
• What if issues were detected before they
had an impact? Hang
Manager
Trace File
Analyzer
• What if you were notified with a specific Quality of
Service
diagnosis and corrective actions? Management
Cluster
• What if resource bottlenecks threatening Health
SLAs were identified early? EXAchk Advisor
• What if bottlenecks could be Memory

Guard
automatically relieved just in time?
Cluster ORAchk
• What if database hangs and node reboots Health
Monitor
could be eliminated? Cluster
Verification
Utility
Maintains Compliance
with Best Practices and
Alerts Vulnerabilities to
Known Issues
Oracle 12c ORAchk & EXAchk

Why Oracle ORAchk & EXAchk
Automatic proactive warning Health checks for most impactful Runs in your environment
of problems before they reoccurring problems with no need to send
impact you anything to Oracle
Get scheduled health reports Findings can be integrated

sent to you in email Engineered into other tools of choice
EXAchk
Systems
Common Framework
Non
Engineered ORAchk
Systems

Oracle Stack Coverage
• Oracle Engineered Systems • Oracle Database • Oracle E-Business Suite
• Oracle Database Appliance • Standalone Database • Oracle Payables
o Oracle Exadata Database Machine • Grid Infrastructure & RAC • Oracle Workflow
o Oracle SuperCluster / MiniCluster • Maximum Availability Architecture (MAA) • Oracle Purchasing
Scorecard • Oracle Order Management
o Oracle Private Cloud Appliance
• Upgrade Readiness Validation • Oracle Process Manufacturing
o Oracle Big Data Appliance
• Golden Gate • Oracle Receivables
o Oracle Exalogic Elastic Cloud • Oracle Restart • Oracle Fixed Assets
o Oracle Exalytics In-Memory Machine • Oracle Enterprise Manager Cloud Control • Oracle HCM
o Oracle Zero Data Loss Recovery Appliance • Repository • Oracle CRM
• Oracle ASR • Agent • Oracle Project Billing
• OMS • Oracle Siebel
• Oracle Systems
• Oracle Solaris • Oracle Middleware • Database best practices
• Cross stack checks • Application Continuity • Oracle PeopleSoft
• Oracle Identify and Access Management • Database best practices
• Solaris Cluster
Suite (Oracle IAM)
• OVN • Oracle SAP
• EXAdata best practices

Profiles Profile
asm ASM Checks
Description
avdf Audit Vault Configuration checks

• Profiles provide logical grouping of clusterware
control_VM
Oracle clusterware checks
Checks only for Control VM(ec1-vm, ovmm, db, pc1, pc2).
checks which are about similar topics No cross node checks
corroborate Exadata checks needs further review by user to determine
• Run only checks in a specific profile pass or fail
dba DBA Checks
./exachk –profile <profile> ebs Oracle E-Business Suite checks
eci_healthchecks Enterprise Cloud Infrastructure Healthchecks
• Run everything except checks in a specific ecs_healthchecks Enterprise Cloud System Healthchecks
profile goldengate Oracle GoldenGate checks
hardware Hardware specific checks for Oracle Engineered systems
./exachk –excludeprofile <profile> maa Maximum Availability Architecture Checks
ovn Oracle Virtual Networking
platinum Platinum certification checks
preinstall Pre-installation checks
prepatch Checks to execute before patching
security Security checks
solaris_cluster Solaris Cluster Checks
storage Oracle Storage Server Checks
switch Infiniband switch checks
sysadmin Sysadmin checks
user_defined_checks Run user defined checks from user_defined_checks.xml

Profiles Profile
asm ASM Checks
Description
bi_middleware Oracle Business Intelligence checks

• Profiles provide logical grouping of clusterware
dba
Oracle clusterware checks
DBA Checks
checks which are about similar topics ebs Oracle E-Business Suite checks
emagent Cloud control agent checks
• Run only checks in a specific profile emoms Cloud Control management server
em Cloud control checks
./orachk –profile <profile> goldengate Oracle GoldenGate checks
hardware Hardware specific checks for Oracle Engineered systems
• Run everything except checks in a specific oam Oracle Access Manager checks
profile oim Oracle Identify Manager checks
oud Oracle Unified Directory server checks
./orachk –excludeprofile <profile> ovn Oracle Virtual Networking
peoplesoft Peoplesoft best practices
preinstall Pre-installation checks
prepatch Checks to execute before patching
security Security checks
siebel Siebel Checks
solaris_cluster Solaris Cluster Checks
storage Oracle Storage Server Checks
switch Infiniband switch checks
sysadmin Sysadmin checks
user_defined_checks Run user defined checks from user_defined_checks.xml

Keep Track of Changes to the Attributes of Important Files
• Track changes to the attributes of important files with –fileattr
– Looks at all files & directories within Grid Infrastructure and Database homes by default
– The list of monitored directories and their contents can be configured to your specific requirements
– Use –fileattr start to start the first snapshot ./orachk –fileattr start
$ ./orachk -fileattr start

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to
/u01/app/11.2.0.4/grid?[y/n][y]
Checking ssh user equivalency settings on all nodes in cluster
Node mysrv22 is configured for ssh user equivalency for oradb user
Node mysrv23 is configured for ssh user equivalency for oradb user
List of directories(recursive) for checking file attributes:
/u01/app/oradb/product/11.2.0/dbhome_11203
/u01/app/oradb/product/11.2.0/dbhome_11204
orachk has taken snapshot of file attributes for above directories at:
/orahome/oradb/orachk/orachk_mysrv21_20170504_041214

Keep Track of Changes to the Attributes of Important Files
• Compare current attributes against first snapshot using –fileattr check
./orachk –fileattr check
$ ./orachk -fileattr check -includedir "/root/myapp/config" -excludediscovery
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to
/u01/app/12.2.0/grid?[y/n][y]
Checking for prompts on myserver18 for oragrid user...
Checking ssh user equivalency settings on all nodes in cluster
Node myserver17 is configured for ssh user equivalency for root user
List of directories(recursive) for checking file attributes:
• Results of snapshot comparison will also
/root/myapp/config be shown in the HTML report output
Checking file attribute changes...
.
"/root/myapp/config/myappconfig.xml" is different:
Baseline : 0644 oracle root /root/myapp/config/myappconfig.xml
Current : 0644 root root /root/myapp/config/myappconfig.xml
…etc
…etc
Note:
• Use the same arguments with check that you used with start
• Will proceed to perform standard health checks after attribute checking
• File Attribute Changes will also show in HTML report output

Improve performance of SQL queries
• Many new checks focus on known issues in 12c All contained in the dba profile:
Optimizer as well as SQL Plan Management -profile dba
• These checks target problems such as:

– Wrong results returned
– High memory & CPU usage
– Errors such as ORA-00600 or ORA-07445
– Issues with cursor usage
– Other general SQL plan management problems

Oracle Database Security Assessment Tool (DBSAT) included
• DBSAT analyzes
database
configurations and
security policies
• Uncovers security
risks
• Improves the security
posture of Oracle
Databases
All results included within report output under the check:

Validate database security configuration using database security assessment tool

Upgrade to Database 12.2 with confidence
• New checks to help when upgrading the database
to 12.2
• Both pre and post upgrade verification to prevent
problems related to:
• OS configuration
• Grid Infrastructure & Database patch prerequisites
• Database configuration
• Cluster configuration
Pre upgrade -u –o pre
Post upgrade -u –o post

Oracle Health Checks Collection Manager
• New Collection Manager
app built on APEX 5
theme
• Tabs replaced with drop
down menus for easier
navigation
• ORAchk & EXAchk
continue to ship with
APEX 4 app too
• No more new
functionality in the APEX
4 app, all new features
will go into the APEX 5
app

Enterprise Manager Integration
•Related checks grouped into •View targets checked, violations &

compliance standards average score
•Check results integrated into EM

compliance framework via plugin
•View results in native EM
compliance dashboards
•Drill down into compliance standard •View break down by target
to see individual check results

Provision
• Use Enterprise Manager provisioning • After selected this will launch the
feature and select ORAchk/EXAchk provisioning wizard, choose the system
type

View Results by Compliance Standard
Drill into applicable standard and view
individual checks & target status
Filter by Exachk%”
Click individual checks for

recommendation details

JSON Output to Integrate with Kibana, Elastic Search etc
• The JSON provides many tags to
allow dashboard filtering based on
facts such as:
• Engineered System type
• Engineered System version
• Hardware type
• Node name
• OS version
• Rack identifier
• Rack type
• Database version
• And more...
• Kibana can be used to view health
check compliance across your data
center
• Results can also be filtered based
on any combination of exposed
system attributes

JSON Output to Integrate with Kibana, Elastic Search etc

Speeds Issue Diagnosis,
Triage and Resolution
Oracle 12c Trace File Analyzer
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal 77
Why TFA?
Collects data across the

Provides one interface for
cluster and consolidates it
all diagnostic needs
in one place
Reduces time required to

Collects all relevant
obtain diagnostic data,
diagnostic data at the time
which saves your business
of the problem
money
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal 78
Supported Platforms and Versions
• All major Operating Systems are • All Oracle Database & Grid versions
supported 10.2+ are supported
– Linux (OEL, RedHat, SUSE, Itanium &
zLinux) • You probably already have TFA
– Oracle Solaris (SPARC & x86-64) installed as it is included with:
– AIX Oracle Grid
Oracle Database
Infrastructure
– HPUX (Itanium & PA-RISC) 11.2.0.4+
– Windows 12.1.0.2+ 12.2.0.1+
12.2.0.1+
• Updated quarterly via 1513912.1

OS versions supported are the same as those supported by the Database
Java Runtime Edition 1.8 required

Linux / Unix Installation
Root / Daemon Install Non root / Non Daemon Install
1. Download from 1513912.1 1. Download from 1513912.1
2. Copy to one required machine and unzip 2. Copy to every required machine and unzip
3. Run ./installTFA<platform> 3. Run ./installTFA<platform>
Will : -extractto <install_dir>

Will: -javahome <jre_home>
– Install on all nodes
– Only install on current host
– Auto discover relevant Oracle Software & Exadata
Storage Servers – Not do automatic collections
– Start monitoring for problems & perform auto – Not collect from remote hosts
collections – Not collect files unreadable by install user
Recommended install location: /opt/oracle.tfa

Architecture
• TFA daemon runs on each cluster
node
Remote
TFA Cluster
Node
n
Remote
Daemon TFA
• Or single instance when no
Node
Daemon
Remote Grid Infrastructure is used
2 TFA
Node TFA
Scripts 1 Daemon Daemon • Command line communication is
via tfactl command
Scripts
Alerts &
Scripts Log files • TFA Daemons on all nodes
Alerts &
Log files
coordinate:
Scripts • Script execution
tfactl • Collection of diagnostics
Cluster • Trimming of log contents
wide
Initiator Node Collection
( Where command originated) • Cluster wide collection output is
consolidated on one node
The daemon is only used when installed as root

Automatic Diagnostic Collections
Oracle Trace File Analyzer
DBA(s) / Sys Admin(s)
1
Automatically
detect event
Oracle Grid Infrastructure
& Database(s)
2 4
Collect & package Upload collection
relevant to Oracle Support
diagnostics for further help
Significant 3 Notify
problem occurs
relevant DBA and
or Sys Admin by
email

Command Interfaces
Command line Shell Menu
• Specify all command options at 1. Set and change context 1. Select menu navigation
the command line options then choose the
2. Run commands from within command you want to run
tfactl <command> the shell
tfactl menu
tfactl
tfaclt > database MyDB
MyDB tfactl > oratop

Maintain
• Option 1 • Option 2
– Applying standard PSUs will – To update with latest TFA & Support
automatically update TFA Tools Bundle
– PSUs do not contain Support Tools 1. Download latest version: 1513912.1
Bundle updates 2. Repeat the same installation steps
Upgrade to the latest version whenever possible to include bug fixes, new features & optimizations

View System & Cluster Summary
Choose an option to drill

down further
Quick summary of status of

key components

Summary ASM Drill Down Example
ASM Overview
ASM cluster wide summary
Problems found
ASM Cluster wide status Problems found on myserver69
Also disk space warning on both servers

Summary ASM Drill Down Example
View ASM problems for myserver69
View node wise & drill into

myserver69
View ASM status summary

for myserver69
View recent problems detected

View component status

Investigate Logs & Look for Errors
• Analyze all important recent log entries: • Search recent log entries:
tfactl analyze –last 1d tfactl analyze -search “ora-00600" -last 8h
Searching for
“ora-00600”

Perform Analysis Using the Included Tools
Tool Description Tool Description
orachk or Provides health checks for the Oracle stack. grep Search alert or trace files with a given database and file name pattern, for
exachk Oracle Trace File Analyzer will install either a search string.
• Oracle EXAchk for Engineered Systems, see document 1070954.1 for
more details summary Provides high level summary of the configuration
or vi Opens alert or trace files for viewing a given database and file name
• Oracle ORAchk for all non-Engineered Systems, see document pattern in the vi editor
1268927.2 for more details
tail Runs a tail on an alert or trace files for a given database and file name
oswatcher Collects and archives OS metrics. These are useful for instance or node pattern
evictions & performance Issues. See document 301137.1 for more details
param Shows all database and OS parameters that match a specified pattern
procwatcher Automates & captures database performance diagnostics and session level
dbglevel Sets and unsets multiple CRS trace levels with one command
hang information. See document 459694.1 for more details
history Shows the shell history for the tfactl shell
oratop Provides near real-time database monitoring. See document 1500864.1
for more details. changes Reports changes in the system setup over a given time period. This
sqlt Captures SQL trace data useful for tuning. See document 215187.1 for includes database parameters, OS parameters and patches applied
more details. calog Reports major events from the Cluster Event log
alertsummary Provides summary of events for one or more database or ASM alert files events Reports warnings and errors seen in the logs
from all nodes
managelogs Shows disk space usage and purges ADR log and trace files
ls Lists all files TFA knows about for a given file name pattern across all nodes
pstack Generate process stack for specified processes across all nodes ps Finds processes
triage Summarize oswatcher/exawatcher data
Not all tools are included in Grid or Database install.
Download from 1513912.1 to get full collection of tools Verify which tools you have installed: tfactl toolstatus

OS Watcher (Support Tools Bundle)
Collect & Archive OS Metrics

• Executes standard UNIX utilities (e.g. vmstat, iostat, ps,
etc) on regular intervals
• Built in Analyzer functionality to summarize, graph and
report upon collected metrics
• Output is Required for node reboot and performance
issues
• Simple to install, extremely lightweight
• Runs on ALL platforms (Except Windows)
• MOS Note: 301137.1 – OS Watcher Users Guide

Procwatcher (Support Tools Bundle)
Monitor & Examine Database Processes

• Single instance & RAC
• Generates session wait, lock and latch reports as well as call stacks
from any problem process(s)
• Ability to collect stack traces of specific processes using Oracle Tools
and OS Debuggers
• Typically reduces SR resolution for performance related issues
• Runs on ALL major UNIX Platforms
• MOS Note: 459694.1 – Procwatcher Install Guide

oratop (Support Tools Bundle)
Near Real-Time Database Monitoring

• Single instance & RAC
• Monitoring current database activities
• Database performance
• Identifying contentions and bottleneck

Analyze
• Each tool can be run using tfactl in shell mode
• Start tfactl shell with tfactl
• Run a tool with the tool name tfactl > orachk
1. Where necessary set context with database <dbname> tfactl > database MyDB
2. Then run tool MyDB tfactl > oratop
3. Clear context with database MyDB tfactl > database

One Command SRDCs
• For certain types of problems
Oracle Support will ask you to
run a Service Request Data
Collection (SRDC)
• Previously this would have
involved:
• Reading many different
support documents
• Collecting output from
many different tasks
• Gathering lots of different
diagnostics
• Packaging & uploading
• Now just run:
tfactl diagcollect -srdc <srdc_type>

Faster & Easier SR Data Collection
tfactl diagcollect –srdc <srdc_type>
Type of Problem SRDC Types Collection Scope

• ORA-00600
• ORA-00700 • ORA-27300
ORA Errors • ORA-04030 • ORA-27301 Local only
• ORA-04031 • ORA-27302
• ORA-07445
Other internal database errors • internalerror Local only
Database performance problems • dbperf Cluster wide
• dbpatchinstall New
Database patching problems Local only
• dbpatchconflict New
• dbinstall New
Database install / upgrade problems Local only
• dbupgrade New
Enterprise Manager tablespace usage metric problems • emtbsmetrics New Local only (on EM Agent target)
• emdebugon New
Enterprise Manager general metrics page or threshold Local only (on EM Agent target & OMS)
• emdebugoff New
problems - Run all three SRDCs
• emmetricalert New Local only (on EM Agent target & Repository DB)

One Command SRDCs – Examples of What’s Collected
ORA4031: Database Performance
tfactl diagcollect –srdc ora4031 tfactl diagcollect –srdc dbperf
1. IPS Package 1. ADDM report

2. Patch Listing 2. AWR for good and problem period
3. AWR report 3. AWR Compare Period report
4. Memory information 4. ASH report for good and problem period
5. RDA 5. OS Watcher
6. IPS Package (if errors during problem
period)
7. ORAchk (performance related checks)

Manual Data Gathering vs One Command SRDC
Manual Data Gathering TFA SRDC
1. Generate ADDM reviewing Document 1680075.1 1. Run tfactl diagcollect –srdc dbperf
2. Identify “good” and “problem” periods and gather AWR 2. Upload resulting zip file to SR
reviewing Document 1903158.1
3. Generate AWR compare report (awrddrpt.sql) using “good”
and “problem” periods
4. Generate ASH report for “good” and “problem” periods
reviewing Document 1903145.1
5. Collect OSWatcher data reviewing Document 301137.1
6. Check alert.log if there are any errors during the “problem”
period
7. Find any trace files generated during the “problem” period
8. Collate and upload all the above files/outputs to SR

One Command SRDC
Interactive Mode
tfactl diagcollect –srdc <srdc_type>
4. All required files are

identified
5. Trimmed where
applicable
6. Package in a zip ready

to provide to support
1. Enter default for event date/time and database name
2. Scans system to identify recent 10 events in the system (ORA600

example shown)
3. Once the relevant event is chosen, proceeds with diagnostic

collection

One Command SRDC
Silent Mode tfactl diagcollect –srdc <srdc_type> -database <db> -for <time>
1. Parameters(date/time, DB name) are provided

in the command
2. Does not prompt for any more information
3. All required files are identified
4. Trimmed where applicable
5. Package in a zip ready to provide to support

Default Collection
• Run a default diagnostic
collection if there is not
yet an SRDC about your
problem:
tfactl diagcollect
• Will trim & collect all

important log files
updated in the past 12
hours:
• Collections stored in the
repository directory
• Change diagcollect
timeframe with
–last <n>h|d
Automatic Database Log Purge
• TFA can automatically purge database logs
– OFF by default
– Except on a Domain Service Cluster (DSC),
which it is ON by default
• Turn auto purging on or off: tfactl set manageLogsAutoPurge=<ON|OFF>
• Will remove logs older than 30 days

– configurable with: tfactl set manageLogsAutoPurgePolicyAge=<n><d|h>
• Purging runs every 60 minutes

– configurable with: tfactl set manageLogsAutoPurgeInterval=<minutes>
Manual Database Log Purge
• TFA can manage ADR log and trace files
– Show disk space usage of individual diagnostic destinations
– Purge these file types based on diagnostic location and or age:
• "ALERT“, "INCIDENT“, "TRACE“, "CDUMP“, "HM“, "UTSCDMP“, "LOG“
tfactl managelogs <options>
Option Description
Runs as the ADR home
–show usage Shows disk space usage per diagnostic directory for both GI and database logs owner. So will only be able
-show variation –older <n><m|h|d> Use to determine per directory disk space growth. to purge files this owner
Shows the disk usage variation for the specified period per directory. has permission to delete
-purge –older <n><m|h|d> Remove all ADR files under the GI_BASE directory, which are older than the time specified
–gi Restrict command to only diagnostic files under the GI_BASE
–database [all | dbname] Restrict command to only diagnostic files under the database directory. Defaults to all,
alternatively specify a database name
-dryrun Use with –purge to estimate how many files will be affected and how much disk space will be May take a while for a
freed by a potential purge command. large number of files
tfactl managelogs –show usage tfactl managelogs –show variation –older <n><m|h|d>
Use -gi to only

show grid
infrastructure
Use –database to only

show database
tfactl managelogs –purge –older n<m|h|d> -dryrun tfactl managelogs –purge –older n<m|h|d>
Use –dryrun
for a “what if”
Disk Usage Snapshots
• TFA will track disk usage and record snapshots to:
– tfa/repository/suptools/<node>/managelogs/usage_snapshot/
• Snapshot happens every 60 minutes, configurable with:
tfactl set diskUsageMonInterval=<minutes>
• Disk usage monitoring is ON by default, configurable with:

tfactl set diskUsageMon=<ON|OFF>
Collect
• Trim & collect all important log files updated in • Collect a problem specific Service Request Data
the past 12 hours: tfactl diagcollect Collection (SRDC): tfactl diagcollect -srdc ora600
• Collections stored in the repository directory

• Change diagcollect timeframe with –since <n>h|d
• For list of types of srdc collections use tfactl diagcollect -srdc help
TFA dbglevel profiles
• Example
– tfactl dbglevel -set node_eviction
– would be used for enhancing diagnostics when node evictions are the being
investigated and would perform the following operation internally
• crsctl set log css "CSSD=4"
• crsctl set log css "CSSDNMC=4"
• crsctl set log css "CLSF=4"
• crsctl set log css "CSSDGMCC=4"
• crsctl set log css "CSSDGMPC=4"
• To revert to the original or default logging levels the following command

– $ tfactl dbglevel -unset node_eviction
• would perform the following operations internally
• crsctl set log css "CSSD=2"
• crsctl set log css "CSSDNMC=2"
• crsctl set log css "CLSF=0"
• crsctl set log css "CSSDGMCC=2"
• crsctl set log css "CSSDGMPC=2" Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 107
Incident Based Collections with SRDC
Incident Type Description • For dbperf use these parameters to
ora4030 For ORA-04030 errors
ora4031 For ORA-04031 errors specify the good & bad performance
dbperf For basic db performance problems periods to compare:
Parameter Description
perf_base_sd Start date for a good performance period
perf_base_st Start time for a good performance period
perf_base_ed End date for a good performance period
• Use srdc <incident type>: tfactl srdc ora4030 perf_base_et End time for a good performance period
• To specify sid use –sid <oracle sid> perf_comp_sd Start date for a bad performance period
• To specify database use –db <dbname> perf_comp_st Start time for a bad performance period
perf_comp_ed End date for a bad performance period
• To specify incident date & time use perf_comp_et End time for a bad performance period
–inc_date <YYYY-MM-DD> -inc_time <HH:MM:SS>
• To upload directly to the SR use –sr<SR#> tfactl srdc dbperf –db RDBMS121 \
–perf_base_sd 2016-06-15 –perf_base_st 01:30:00 \
tfactl srdc ora4030 -sid orcl –db RDBMS121 \ –perf_base_ed 2016-06-15 –perf_base_et 02:00:00 \
-inc_date 2016-06-15 -inc_time 02:48:23 \ –perf_comp_sd 2016-06-16 –perf_comp_st 09:30:00 \
-sr 3-123456789 –perf_comp_ed 2016-06-16 –perf_comp_et 10:00:00
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 108
Generates Diagnostic
Metrics View of Cluster
and Databases
Oracle 12c Cluster Health Monitor
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential –
Confidential –Oracle Restricted
Oracle Internal/Restricted/Highly Restricted 109
Cluster Health Monitor (CHM)
Generates Diagnostic Metrics View of Cluster and Databases
• Always on - Enabled by default

OS Data OS Data
• Provides Detailed OS Resource Metrics
osysmond
• Assists Node eviction analysis osysmond
OS Data
• Locally logs all process data osysmond

ologgerd
(master)
• User can define pinned processes
• Listens to CSS and GIPC events osysmond OS Data
• Categorizes processes by type GIMR
• Supports plug-in collectors (ex. 12c Grid Infrastructure

traceroute, netstat, ping, etc.) Management Repository
• New CSV output for ease of analysis

Oclumon CLI or Full Integration with EM Cloud Control
Discovers Potential Cluster
& DB Problems - Notifies
with Corrective Actions
Oracle 12c Cluster Health Advisor
Generates Diagnostic Metrics View of Cluster and Databases

OS Data OS Data
• Provides Detailed OS Resource Metrics
osysmond
• Assists Node eviction analysis osysmond
OS Data
• Locally logs all process data osysmond

ologgerd
(master)
• User can define pinned processes
• Listens to CSS and GIPC events osysmond OS Data
• Categorizes processes by type GIMR
• Supports plug-in collectors (ex. 12c Grid Infrastructure

traceroute, netstat, ping, etc.) Management Repository
• New CSV output for ease of analysis

CHA has detected a service degradation due to higher than expected I/O latencies.
CHA has detected a service degradation due to higher than expected I/O latencies.
Cluster Health Advisor
CHA/DB Health
CHA detected a for service degradation due to higher than expected I/O latencies.
CHA/DB Health: I/O problem

Problem The degradation is caused by a higher than expected utilization of shared storage devices for this
database. No evidence of significant increase in I/O demand on the local node.
Confidence 95.17%
Action Validate whether there is increase in I/O demand on other nodes than the local and find I/O intensive SQL .
Add more disks to disk group or move database to faster disks.
proddb_1
proddb_2

Confidential – Oracle Restricted 115
Cluster Health Advisor Daemon
Dependencies to the Grid Infrastructure

Management Repository (GIMR)
Command Line Tool - chactl
Will only monitor cluster

initially
Tell it to monitor the

database
chactl monitor database –db <db_name>
Cluster Health Advisor - diagnosis Query a specific database for
diagnosis
Query the cluster diagnosis for

incidents and recommendations chactl query diagnosis chactl query diagnosis –db <db_name>
Query the repository footprint
chactl query repository
Autonomously Preserves
Database Availability and
Performance
Oracle 12c Database Hang Manager
Debugging Live Systems: Hangs
• Parsing the system state dump can be very time consuming.
To debug a hang more quickly you could query v$session.
blocking_session:
select sess.sid sid,substr(proc.program,0,25)
prog,substr(sw.event,0,15) event,sw.wait_time wt,
sess.blocking_session bsid from v$process proc, v$session sess,
v$session_wait sw where proc.addr=sess.paddr and
sess.status='ACTIVE‘ and sw.sid=sess.sid order by prog;
SID Program Event WT BSID

----- ------------------------- --------------- --- -----
2836 oracle@fstsun002 (S000) enq: TM - conte 0 2979
2979 oracle@fstsun002 (TNS V1- enq: TM - conte 0 2853

Debugging Live Systems: Hangs
• sqlplus –prelim “/ as sysdba” is useful because it avoids a
process state object creation which requires various
resources such as latches.
• Trying to acquire those resources may cause your debugger
session to hang.
• Some dumps/commands may require a PSO therefore you
can execute those dumps/commands in an existing process
that already has a PSO
$ sqlplus -prelim "/ as sysdba"

SQL> oradebug setorapid 9
SQL> oradebug dump systemstate 3

Oracle 12c Hang Manager
Autonomously Preserves Database Availability and Performance Session

• Reliably detects database hangs and DETECT
deadlocks
• Autonomously resolves them EVALUATE
Hung?
• Supports QoS Performance Classes, Ranks
and Policies to maintain SLAs ANALYZE
QoS
• Logs all detections and resolutions Policy
DIA0 VERIFY
• New SQL interface to configure sensitivity
(Normal/High) and trace file sizes
Victim
Oracle 12c Hang Manager
Full Resolution Dump Trace File and DB Alert Log Audit Reports
Dump file …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
Oracle Database 12c Enterprise Edition Release 12.2.0.0.0 - 64bit Beta
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics 2015-10-13T16:47:59.435039+17:00
and Real Application Testing options Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353):
Build label: RDBMS_MAIN_LINUX.X64_151013 ORA-32701: Possible hangs up to hang ID=1 detected
ORACLE_HOME: …/3775268204/oracle Incident details in: …/diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc
System name: Linux 2015-10-13T16:47:59.506775+17:00
Node name: slc05kyr DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2
Release: 2.6.39-400.211.1.el6uek.x86_64 due to a GLOBAL, HIGH confidence hang with ID=1.
Version: #1 SMP Fri Nov 15 13:39:16 PST 2013 Hang Resolution Reason: Automatic hang resolution was performed to free a
Machine: x86_64 significant number of affected sessions.
VM name: Xen Version: 3.4 (PVM) DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1.
Instance name: hm62
Redo thread mounted by this instance: 2 In the alert log on the instance local to the session (instance 2 in this case),
Oracle process number: 19 we see the following:
Unix process pid: 12656, image: oracle@slc05kyr (DIA0)
2015-10-13T16:47:59.538673+17:00
Errors in file …/diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753):
*** 2015-10-13T16:47:59.541509+17:00 ORA-32701: Possible hangs up to hang ID=1 detected
*** SESSION ID:(96.41299) 2015-10-13T16:47:59.541519+17:00 Incident details in: …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
*** CLIENT ID:() 2015-10-13T16:47:59.541529+17:00
*** SERVICE NAME:(SYS$BACKGROUND) 2015-10-13T16:47:59.541538+17:00 2015-10-13T16:48:04.222661+17:00
*** MODULE NAME:() 2015-10-13T16:47:59.541547+17:00 DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1
*** ACTION NAME:() 2015-10-13T16:47:59.541556+17:00 requested by master DIA0 process on instance 1
*** CLIENT DRIVER:() 2015-10-13T16:47:59.541565+17:00 Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid:40 with serial # 43179 (ospid:13031)
Deploys with Minimum
Footprint and Maximum
Manageability
Oracle Domain Services Cluster (DSC)
Oracle 12c Domain Services Cluster (DSC)
Deploys with Minimum Footprint and Maximum Manageability
ORACLE CLUSTER DOMAIN
Application Database
• Hosts Framework as Services Member
Cluster
Member
Cluster
• Reduces local resource footprint Application

Member
Database
Member
Cluster Cluster
• Centralizes management
• Speeds deployment and patching
Oracle Domain Services Cluster
• Optional Shared Storage Database Database
Member Member
• Supports multiple versions and Cluster Cluster
platforms going forward Management Repository Service

Trace File Analyzer Receiver
ORAchk Collection Service
Grid Names Service
Storage Services
Rapid Home Provisioning Service
Oracle Cluster Domain
Database Application Database Database
Member Cluster Member Cluster Member Cluster Member Cluster
Uses IO & ASM Uses ASM

Private Uses local ASM GI only Service of DSC Service
Network
SAN
NAS Oracle Domain Services Cluster

Mgmt Trace File Rapid Home Additional
Repository Analyzer Provisioning ACFS ASM
Optional IO Service
(GIMR) (TFA) (RHP) Services Service
Service Services
Service Service
Shared ASM
Oracle 12c Domain Services Cluster (DSC)
Deploys with Minimum Footprint and Maximum Manageability
ORACLE CLUSTER DOMAIN
Application Database
• Hosts Framework as Services Member
Cluster
Member
Cluster
• Reduces local resource footprint Application

Member
Database
Member
Cluster Cluster
• Centralizes management
• Speeds deployment and patching
Oracle Domain Services Cluster
• Optional Shared Storage Database Database
Member Member
• Supports multiple versions and Cluster Cluster
platforms going forward Management Repository Service

Trace File Analyzer Receiver
ORAchk Collection Service
Grid Names Service
Storage Services
Rapid Home Provisioning Service
Compare Database Status Before & After Upgrade
• Download dbupgdiag.sql from doc 556610.1
• Run both before and after the upgrade:
cd <location of the script>
$ sqlplus / as sysdba
sql> alter session set

nls_language='American';
sql> @dbupgdiag.sql
sql> exit

Sandesh RAC Troubleshooting Diagnosability OTNyathra2 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Sandesh RAC Troubleshooting Diagnosability OTNyathra2 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Troubleshooting and Diagnosing Oracle

Database 12.2 and Oracle RAC

Sandesh Rao, Senior Director , RAC Development

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• Which books on RAC do I read for basics or internals ?

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• Level 1: OHASD Spawns:

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

The standard going forward

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 15

[GRID]> crsctl get cluster name

[GRID]> crsctl get cluster class

[GRID]> crsctl get cluster type

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 16

Uses IO & ASM Uses ASM

NAS Domain Services Cluster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 17

• Node Weighting is a new feature that considers

✔ the workload hosted in the cluster during fencing

1 2 • The idea is to let the majority of work survive,

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 20

crsctl set server

CSS_CRITICAL CSS_CRITICAL will be honored

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 21

Autonomous Health Framework

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 22

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Public Lan Public Lan

CSSD CSSD CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

CSSD “Ping” CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• Network heartbeat failures will lead to node evictions

CSSD “Ping” CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• Disk heartbeat failures will lead to node evictions

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• With 11.2.0.1 Voting Disks got an “identity”:

Node information Disk Heartbeat Logging

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Ø trunc(n/2+1) with n=number of voting disks configured and n>=1

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• Same principles apply

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• Oracle ASM auto creates 1/3/5 Voting Files

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

• A node is requested to “kill itself”; executer: typically CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Engage Sysadmin Team