Documente Academic
Documente Profesional
Documente Cultură
Oracle, Inc.
www.oracle.com
Copyrights
Oracle Exadata Database Machine Monitoring
Copyright 2011, Oracle and/or its affiliates. All rights reserved.
This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are
protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use,
copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form,
or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is
prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors,
please report them to us in writing.
If this software or related documentation is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government,
the following notice is applicable:
U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S.
Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal
Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and
adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent
applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software
License (December 2007). Oracle USA, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
This software is developed for general use in a variety of information management applications. It is not developed or intended for use
in any inherently dangerous applications, including applications which may create a risk of personal injury. If you use this software in
dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to
ensure the safe use of this software. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this
software in dangerous applications.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
This software and documentation may provide access to or information on content, products, and services from third parties. Oracle
Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party
content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred
due to your access to or use of third-party content, products, or services.
iii
Overview
Oracle Exadata Database Machine consists of the following components:
Database servers
InfiniBand network
Avocent MergePoint Unity KVM (only on Oracle Exadata Database Machine X2-2)
Some of the components are separated into multiple monitoring categories. The following table summarizes how
each component is monitored, divided into multiple categories, if applicable. Details of monitoring each component
are provided in the sections below, including definitions of the abbreviations used in the table.
Component
Database servers
Category
Monitor Summary
Hardware
Operating system
Monitored by MS
Alert notification by EMGC
Monitored by MS
Alert notification by EMGC
Exadata software
Monitored by MS
Alert notification by EMGC
Hardware
Monitored by EM Agent
Alert notification by EMGC
Operating system
Monitored by EM Agent
Alert notification by EMGC
Monitored by EM Agent
Alert notification by EMGC
Monitored by EM Agent
Alert notification by EMGC
Oracle Database
Monitored by EM Agent
Alert notification by EMGC
Monitored by EM Agent
Alert notification by EMGC
InfiniBand fabric
Monitored directly
All
Monitored by EM Agent
Alert notification by EMGC
All
Monitored by EM Agent
Alert notification by EMGC
InfiniBand network
Component
Avocent MergePoint Unity KVM
Category
Monitor Summary
All
Monitored by EM Agent
Alert notification by EMGC
This document describes how each component of Oracle Exadata Database Machine is monitored, and explains
what should be monitored for each component. Refer to the EM Exadata Launchpad Deployment document for
instructions about configuring Oracle Enterprise Manager Grid Control (EMGC) to monitor components of Oracle
Exadata Database Machine.
Hardware
Operating system
Exadata Storage Servers are independent units that are each identified as separate targets in EMGC. However,
storage servers are grouped together under the system dashboard for a Oracle Exadata Database Machine so they are
monitored together as a group.
Metrics
There are two different types of related metrics for Exadata Storage Server: storage server metrics and, when
monitored with Exadata Storage Server Plug-In, Enterprise Manager (EM) metrics. In most cases there is a one-toone mapping between the two. Exadata Storage Server Management Server (MS) collects, computes, and manages
storage server metrics. These storage server metrics are then gathered by Exadata Storage Server Plug-In from a
storage server and presented to the user in EMGC as EM metrics.
Alerts
All Exadata Storage Server alerts are delivered by the storage server to EMGC using Simple Network Management
Protocol (SNMP). The communication between the Exadata Storage Server and EMGC is done through the Exadata
Storage Server Plug-In.
There are two types of server alerts that come from Exadata Storage Server:
For Integrated Lights Out Manager (ILOM)-monitored hardware components, ILOM reports a failure
or threshold exceeded condition as an SNMP trap, which is received by MS. MS processes the trap, creates
an alert for the storage server, and delivers the alert via SNMP to EMGC through the Exadata Storage
Server Plug-In.
For MS-monitored hardware and software components, MS processes a failure or threshold exceeded
condition for these components, creates an alert, and delivers the alert via SNMP to EMGC through the
Exadata Storage Server Plug-In.
From an end-user perspective there is no difference between these two kinds of alerts. An alert message contains
corrective action to perform to resolve the alert. For example, the circled area in the following screen shot indicates
the action to take for this specific alert.
The fix for bug 8814019 should be installed to resolve incorrect text in the Action portion of Exadata alerts.
Alerts may also be delivered directly (i.e. not through EMGC) via email or SNMP to other SNMP managers with the
proper configuration using the CELLCLI ALTER CELL command. See Exadata Storage Server Software User's
Guide for details.
What to monitor
Hardware failure
and sensor state
How to monitor
Monitored automatically by
ILOM and MS
Comment
The hardware of an Exadata Storage Server
is monitored collectively by Sun Integrated
Lights Out Management (ILOM) and the
Exadata Storage Server software component
Management Server (MS). Together they
provide full hardware monitoring and alerting.
ILOM monitors availability and sensor state
using preset thresholds for hardware
components of Exadata Storage Server, such
as system motherboard, processors, memory,
power supplies, fans, and network interface
controllers.
MS monitors other hardware components
directly, including the following: disk
controller, hard disk drives, flash accelerator
cards, and InfiniBand host controller adapter
(HCA).
Exadata Storage
Server availability
Monitored automatically by
Exadata Storage Server
Management Plug-In
Undelivered alerts
in cell
ALERTHISTORY
Network errors
What to monitor
File system free
space
How to monitor
Monitored automatically by MS
Comment
Do not set metric threshold for file system
free space in either CellCLI or in Exadata
Storage Server Plug-In. MS monitors file
system free space, generates critical alerts
when free space becomes low, and takes
corrective action to free used space.
Metrics across
storage servers
CPU and memory
utilization
I/O latency and
throughput
Metrics within a
storage server
CPU and memory
utilization
I/O latency and
throughput
User defined
Hardware
Operating system
Oracle Database
Hardware
ILOM monitors availability and sensor state using preset thresholds for hardware components of database server,
such as system motherboard, processors, memory, power supplies, fans, and network interface controllers. The
availability and sensor state can be monitored using the Oracle Exadata Database Server Integrated Lights Out
Management (ILOM) plug-in. Please refer to System Monitoring Plug-in Installation Guide for Oracle Exadata
ILOM for instructions on installing and configuring the plug-in.
Metrics
There are no Exadata-specific thresholds to set for database server hardware monitoring. Failure conditions and
threshold settings for hardware sensor readings for the components monitored by ILOM are preset in ILOM and are
sufficient for the level of monitoring necessary for Exadata.
To view current sensor readings, log in to Enterprise Manager and navigate to All Metrics from the plug-in home
page.
To view current component status, including those that have a Faulted status, expand the Sensor Alert section under
All Metrics and review the metrics for each sensor. Components in Faulted status will generate an alert. Any active
alert will be visible on the home page of the target plug-in.
Alerts
To view the history of alerts that have been generated by ILOM navigate to the target home page and expand the
Sensor Alert section under All Metrics and review the metrics for each sensor. A history of each sensor state is
available for up to 31 days.
Alerts generated by database server ILOM and captured by the plug-in may be delivered via Enterprise Manager in
the same fashion as an alert generate by any Enterprise Manager Target. Refer to Enterprise Manager Connector
Integrators Guide for details on configuring and forwarding alerts.
Operating System
Database server operating system, Oracle Enterprise Linux, is viewed in EMGC as Host target.
Metrics
There are no Exadata-specific monitoring requirements. The metrics and default thresholds provided by EMGC are
sufficient for the level of monitoring necessary for Exadata, except for the cases noted below. Thresholds may be
changed or set in the Metric and Policy Settings page in EM to handle site-specific requirements. The list of metrics
and default thresholds in EMGC for a Host target is available in the Oracle Enterprise Manager Framework, Host,
and Services Metric Reference Manual.
Database disk I/O is done to Exadata Storage Servers through the InfiniBand network over iDB protocol. Therefore,
monitoring metric thresholds on database servers relating to disk I/O (e.g. CPU in I/O Wait (%), Disk Device Busy
(%), or Average Disk I/O Service Time (ms)) will provide no value for monitoring database performance. Disk I/O
must be monitored at Exadata Storage Servers through the Exadata Storage Server Plug-In.
Oracle Exadata Database Machine Monitoring
Alerts
Database server operating system alerts are generated directly by EMGC based on default metric thresholds set for
the Host target.
How to monitor
ASM instance alert.log contains
connect: ossnet: connection failed to
server <ipaddr>, result=5 (login:
sosstcpreadtry failed)
Comment
Check every 5 minutes.
Oracle Database
Oracle Databases are viewed in EMGC as Cluster Database and Database Instance targets.
Metrics
The metrics and default thresholds provided by EMGC are sufficient for the level of monitoring necessary for
Exadata, except for the cases noted below. Thresholds may be changed or set in the Metric and Policy Settings page
in EM to handle site-specific requirements. The list of metrics and default thresholds in EMGC for Cluster Database
and Database Instance targets is available in the Oracle Enterprise Manager Oracle Database and DatabaseRelated Metric Reference Manual.
Alerts
Database alerts are generated directly by EMGC based on default metric thresholds set for the Cluster Database and
Database Instance target.
What to Monitor, How to Monitor
What to monitor
Connectivity issues
to cells
How to monitor
Alert.log contains connect: ossnet:
connection failed to server <ipaddr>,
result=5 (login: sosstcpreadtry failed)
Comment
Check every 5 minutes.
InfiniBand Switches
Monitoring of the Sun Datacenter InfiniBand Switch 36 InfiniBand switches provided with Oracle Exadata Database
Machine requires checking for failed hardware components and sensors that have exceeded preset thresholds on the
switch, and checking for port errors that have occurred on switch ports. The table below shows what to monitor on
InfiniBand switches. These checks should be run approximately every 60 to 120 seconds.
What to monitor
Hardware failure
and sensor state
How to monitor
Switch port
errors
ibqueryerrors.pl -s
RcvSwRelayErrors,RcvRemotePhysErrors,Xm
tDiscards,XmtConstraintErrors,RcvConstr
aintErrors,ExcBufOverrunErrors,VL15Drop
ped
showunhealthy
checkpower
Comment
See Sun InfiniBand Switch software
version 1.0.1 section below for
details.
SymbolErrors or RcvErrors or
LinkIntegrityErrors should not
increase without LinkDowned
increasing.
A single invocation of this command
will report on all switch ports on all
switches.
Run this check from a database
server or a switch.
The showunhealthy command should produce the output OK - No unhealthy sensors. If output differs, run
the command env_test to get detailed status information for all switch sensors. Note that showunhealthy
will indicate OK - No unhealthy sensors if a power supply is offline, so they must be checked separately.
The checkpower command should indicate that both power supplies have status OK.
Sun InfiniBand Switch software version 1.1.3
Sun IB switch software version 1.1.3, which supports SNMP. To leverage SNMP support, an EM Sun InfiniBand
Switch Management Plug-In is planned so that switch status can be monitored within EMGC. If it is not possible to
monitor the switch using theplug-in, monitor the switch using the instructions for switch software version 1.0.1.
The plug-in will show two pieces of information on the home page, a ping response time graph and a response
metric that is determined by polling an aggregrate of sensor information from the switch. If at any point, one of the
sensors on the switch that is a part of the aggregrate asserts or shows an error, the overall response metric will
change to down. At this point, run the get unhealthy script as documented above in the v 1.01 section and contact
Oracle support for further instructions. A screen shot of the plug-in is shown below.
InfiniBand Ports
InfiniBand port monitoring checks the health of InfiniBand network ports and interfaces on database servers and
Exadata Storage Servers.
Exadata Storage Servers
InfiniBand port monitoring on storage servers is performed by MS. No additional IB monitoring is required on
storage servers. If an IB port is not functioning correctly, MS creates an alert, and delivers the alert via SNMP to
EMGC through the Exadata Storage Server Plug-In. Alert messages contain corrective action to perform to resolve
the alert. Refer to the Monitoring Exadata Storage Servers section for additional details.
Database Servers
Database servers have no built in IB monitoring. The table below shows what to monitor on each database server.
These checks should be run approximately every 60 to 120 seconds.
What to monitor
Port state, physical port state,
and rate (ib0, ib1)
How to monitor
/usr/sbin/ibstatus
Comment
Expected output for each port:
state:
4: ACTIVE
phys state: 5: LinkUp
rate:
40 Gb/sec (4X QDR)
10
What to monitor
Port errors (ib0, ib1)
How to monitor
/sbin/ifconfig bond0
/sbin/ifconfig ib0
/sbin/ifconfig ib1
/bin/ping <remoteIBhost>
/usr/bin/rds-ping
<remoteIBhost>
/usr/sbin/perfquery
Comment
SymbolErrors or RcvErrors or
LinkIntegrityErrors should not
increase without LinkDowned
increasing.
All interfaces should be UP.
A database server should have
connectivity via ping and rds-ping to
all storage servers and database
servers over the IB network.
These checks can be automated using User Defined Metrics (UDMs) in Grid Control. There are two user-defined
metric (UDM) scripts provided for database servers. These scripts can be downloaded from MOS note 1110675.1
Enterprise Manager User-Defined Metric InfiniBand net CONNECTivity check (emudm_ibconnect.sh)
This UDM script will monitor connectivity over the IB network to other database servers and storage servers. This
list of database servers is built from ocrdump SYSTEM.crs.e2eport key. The list of cells is built from
cellip.ora. This script will not validate connectivity to additional devices on the IB network such as media servers.
Note that ibhosts is not be used because of root permissions. The approach used in this script is preferred as
scope is limited to servers that are relevant to the Exadata devices, which may not be the whole IB network.
2.
3.
4.
5.
6.
Create one UDM for InfiniBand network connectivity. Use the following values for the fields in the
Create User-Defined Metric screen:
11
Create one UDM for each network interface to be monitored. For example, if interfaces bond0, eth0,
and bond1 are configured, then create 3 UDMs. Use the following values for the fields in the Create
User-Defined Metric screen, replacing bond0 with the proper the interface name:
Metric name: bond0 network interface status
Metric Type: String
Command Line: /u01/app/oracle/product/11.1.0/udm_scripts/emudm_netif_state.sh bond0
User Name: oracle
Password: welcome
Comparison Operator: MATCH
Warning: WARNING
Critical: CRITICAL
Schedule
Repeat every 5 minutes
InfiniBand Fabric
The table below shows what to monitor for the InfiniBand fabric. Run these checks from only one database server.
What to
monitor
Subnet
manager
(SM)
master
location
How to monitor
Comment
/usr/sbin/sminfo reports SM
master is running on IB switch returned
by /usr/sbin/ibswitches
12
What to
monitor
How to monitor
Comment
END {
for (val in spine)
if (spine[val]==yes)
print val
}
Network
topology
/
opt/oracle.SupportTools/ibdiagtools/
verify-topology
Link status
/usr/sbin/iblinkinfo.pl -Rl
Metrics
Cisco Ethernet switch metrics are provided by the switch to the EM Cisco Switch Management Plug-In. Instruction
for installing the plug-in are contained in the System Monitoring Plug-in Installation Guide for Oracle Exadata
Cisco Switch. documentation. The metrics and default thresholds provided by the plug-in are sufficient for the level
of monitoring necessary to ensure switch availability. Thresholds may be changed or set in the Metric and Policy
Settings page in EM to handle site-specific requirements.
Alerts
Cisco Ethernet switch reports an alert condition as an SNMP trap, which is received by EMGC through the EM
Cisco Switch Management Plug-In.
Description
Environmental monitor notification that sends alert for fan,
shutdown, power supply, temperature.
13
The PDU plug-in requires the PDU to be using version 1.0.2 or greater of the PDU firmware. Instructions for
determining the firmware are available in the firmware documentation. See Oracle Database Machine Monitoring
Best Practices (Doc ID 1110675.1) on My Oracle Support for information on the latest version of the firmware.
Once the plug-in is installed, it needs to be configured with the appropriate monitoring thresholds. Those thresholds
depend on several factors (the size of the Exadata configuration, the power and voltage type). Determine the values
for those factors and then update the PDU monitoring information in the plug-in metrics using the values and
procedures documented in the My Oracle Support note PDU Threshold Settings for Oracle Exadata Database
Machine (Doc ID 1299851.1).
14
15