DB2 Haicu

IBM Tivoli System Automation for Multiplatforms v3.
2
Integrating TSAMP v3.2 with DB2 HADR v10.1
Author: Gareth Holl

Date: May 27th, 2015
© 2015 IBM Corporation

Objectives
When you have completed this module, you will be able to perform these tasks:
 Explain the general operation of the TSAMP product
 Identify each of the products and components that make up the total solution and note the
integration points
 Examine what the ‘db2haicu’ utility does from the perspective of TSAMP and the automation
policy
 Learn how to control and service the combination of TSAMP and DB2 HADR
2 IBM Tivoli System Automation for Multiplatforms - Event Notification © 2015 IBM Corporation
IBM Tivoli System Automation for Multiplatforms
Agenda
 Introduction and Overview
 System Automation Components Overview
 Mapping DB2 Components to TSAMP Resources
 Integrating TSAMP with DB2 HADR using db2haicu
 Controlling the Operational State of the DB2 Resources
 Disabling Automation (re-gain manual control of DB2)
 Serviceability
3 May 27, 2015 © 2015 IBM Corporation

Introduction
 DB2 provides a High Availability Disaster Recovery (HADR) feature that keeps a
primary and standby database synchronized, and allows an administrator to
switch control to a standby DB2 server
 DB2 provides a set of scripts that allow TSAMP to control the DB2 resources.
 Scripts that are used by TSAMP to start, stop, and monitor each of the DB2 resources –
this is the primary link between the two products
 DB2 provides a utility called ‘db2haicu’ that is used to define the domain and
automation policy within TSAMP, that is, the initial setup :
 The automation policy is the set of definitions of all resources, resource groups, and the
relationships between them all.
 The resource definitions contain attributes that define which DB2 start, stop, and
monitor script (the automation scripts) to use for a particular resource.
 TSAMP can be used to monitor an application’s resources, and automate the

starting, stopping, and failover of resources – it will attempt to maintain a desired
operational state.

Why the need for TSAMP ?
 HADR does not perform active monitoring of the topology

 HADR will not detect a node outage or NIC failure
 HADR cannot take automated actions in the event of a failed primary
instance, node outage, or NIC failure
 Instead, a DB administrator must monitor the HADR pair manually and
issue appropriate takeover commands in the event of a primary database
interruption
 This is where TSAMP’s automation capabilities comes into play :
 TSAMP can perform restart actions if an instance unexpectedly exits
 TSAMP can perform a HADR takeover automatically when certain
problems are detected on the primary server

Software Summary
 Each of the following software products/components need to be installed on both

systems (primary and standby servers) :
 DB2 v10.1. (10.1.0.5 was latest available at the time this deck was written)
 TSAMP v3.2.2 (Fixpack 8 (3.2.2.8) or later recommended)
 RSCT v3.1.5.5 (installed as part of a TSAMP installation)
 Installation of DB2 v10.1 includes the DB2 automation policy scripts:

 /opt/IBM/db2/V10.1/ha/tsa/
 db2V10_monitor.ksh, db2V10_start.ksh, db2V10_stop.ksh
 hadrV10_monitor.ksh, hadrV10_start.ksh, hadrV10_stop.ksh
 lockreqprocessed
 These scripts can get upgraded when a DB2 fixpack is installed

Software Summary (continued …)
 TSAMP/RSCT doesn’t need to be installed on a 3rd node to maintain quorum

 Can use a Network TieBreaker (db2haicu calls this a quorum device)
 License file for TSAMP is included with the DB2 Activation Zip file, available via your
Passport Advantage account.
 If a base level of TSAMP is not installed, then license file will need to be manually
installed
 TSAMP can be silently installed by the DB2 installer, but if a base level of DB2 is
not installed, then again the TSAMP license will need to be manually installed.
 See the TSAMP formal documentation for platform compatibility & dependencies :
 For TSAMP v3.2.2.8 Release Note:
 http://www.ibm.com/support/knowledgecenter/SSRM2X_3.2.2/com.ibm.samp
.doc_3.2.2/pdfs/HALRN330.pdf
 For TSAMP v4.1 Knowledge Center:
 http://www.ibm.com/support/knowledgecenter/SSRM2X_4.1.0.1/com.ibm.sa
mp.doc_4.1.0.1/welcome_samp.html?lang=en

Example 1 of a DB2 HADR environment

HADR replication via Cluster Client Apps
the public network (RSCT)
Heartbeat
Public network DB2 Transactions
Primary Server Standby Server

Virt IP
eth0:0
TSAMP TSAMP
RSCT RSCT
eth0 eth0
DB2 Instance DB2 Instance
DB2 DB2
HADR Replication
DB2 HADR HADR DB2

Database Database

Example 2 of a DB2 HADR environment

HADR replication via Cluster Client Apps
a private network (RSCT)
Heartbeat
Public network DB2 Transactions
Virt IP
Primary Server eth0:0 Standby Server
TSAMP TSAMP
RSCT RSCT
eth0 eth0
Cluster
DB2 Instance (RSCT) DB2 Instance
Heartbeat
DB2 DB2
eth1 Private network eth1
DB2 HADR HADR DB2
Switch Database
Database
HADR Replication

Progress
 Serviceability

System Automation – Components

 Reliable Scalable Cluster Technology RSCT), the "Cluster" software, is made up of some core daemons
and some Resource Managers (two shown in orange on the next 2 slides), the most important being the
following two :
Resource Manager Function and classes owned
Configuration tasks across the nodes in the domain, including quorum and TieBreaker
IBM.ConfigRM functionality
Maps IBM.AgFileSystem resources to IBM.LogicalVolume to IBM.VolumeGroup to

IBM.StorageRM IBM.Disk and manages the mount/umount of the filesystems and the varyon/varyoff of
the Volume Groups.
 Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 3
Resource Managers (shown in blue on the next 2 slides), as follows :
Resource Manager Function and classes owned
Automation engine. Owns IBM.ResourceGroup, IBM.Equivalency,

IBM.RecoveryRM IBM.ManagedResource, IBM.ManagedRelationship
IBM.GblResRM Starts, stops, and monitors resources of the class IBM.Application and IBM.ServiceIP
IBM.TestRM Controls resources of class IBM.Test


Everything is a "Resource" in the TSAMP and RSCT world
 There are different kinds of resources and that is where we introduce the concept of a resource "class".
 There are different Resource Managers, each responsible for managing or controlling resources that
belong to a particular set of resource classes.
 The following diagram shows the mapping of three key Resource Managers to some Resource Classes
they manage and then to some example Resources :


 Consider the servers that make up the cluster ... they are also resources. They are resources of the
class IBM.PeerNode.
 The domain itself is a resource, of class IBM.PeerDomain. The network interfaces are resources, of class
IBM.NetworkInterface.
 The following diagram introduces another Resource Manager (part of RSCT) called IBM.ConfigRM and it
shows two Resource Classes it manages and the Resources modelled by those classes:

Quorum and TieBreaker

 One of the questions from the ‘db2haicu’ utility deals with a
cluster/automation concept called Quorum.
 Quorum
The number of nodes in a cluster that are required to control the resources,
modify the cluster definition, or perform certain cluster operations.
The main goals of quorum operations:
 identify who has the majority when a cluster is broken up into sub-clusters
 keep data consistent, especially when shared file systems are being used
 protect critical resources….maintain HA control
Two types: configuration vs. operational quorum
Note: “configuration” quorum requires 'majority of nodes' (more than half the
number of nodes) to be Online for configuration changes to be carried out.
 TieBreaker
 a TieBreaker situation occurs when a cluster with equal number of nodes is split
into sub-clusters with equal numbers of nodes
 need to determine which sub-cluster will have an operational quorum in a tie
situation

db2haicu: Quorum and TieBreaker (continued....)
 Network TieBreaker
the goal is for each system to figure out (via the RSCT infrastructure) which one is
operational and should therefore take control (if not already the active node).
use a pingable system independent of node1 and node2, for example node3 in our
example. Although it would be just as easy and viable to use the gateway router for node1
and node2.
without an active TieBreaker, automated failover/takeover will NEVER occur in the
event of a cluster split (node outage, or network problem)
 Create a /usr/sbin/cluster/netmon.cf file on each node. Add IP addresses (one per line)
of 3-5 devices external to the domain that are pingable from each node. This is
important for a 2 node cluster to allow the cluster software (RSCT) to quickly identify the
source of a heartbeat problem between the nodes.
 The following 7 slides demonstrate how a TieBreaker works …

Base Tie-Breaker Functionality

Gateway
Router
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
Node Failure Scenario
16 May 27, 2015 16


Gateway
Router
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
Node Failure Scenario:

1. System node1 fails
17 May 27, 2015 17


Gateway
Router
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
Node Failure Scenario:

2. System node2 gets quorum using network
tiebreaker
18 May 27, 2015 18


Gateway
Router
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
Network Adapter Failure Scenario
19 May 27, 2015 19


Gateway
Router
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
Network Adapter Failure Scenario :

1. Network problem affecting node1
20 May 27, 2015 20


Gateway
Router
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
Network Adapter Failure Scenario:

2. Again node2 gets quorum using network tiebreaker
21 May 27, 2015 21


Gateway
Router
eth0
10.20.30.1
10.20.30.0
Network Tiebreaker Scenarios:

eth0 eth0 1a. System node2 gets quorum using network
10.20.30.40 10.20.30.41 tiebreaker

2a. Again node2 gets quorum using network
tiebreaker
2b. System node1 forced to reboot
node1 node2
Network Tiebreak Assumption:

If node1 can communicate (ping) with the gateway and node2 can communicate (ping)
with the gateway, THEN node1 must be able to communicate (heartbeat) with node2.
22 May 27, 2015 22

Progress
 Serviceability

Mapping DB2 HADR components to TSAMP resources

Node1 Node2
Resource Group: db2_db2inst1_db2inst1_HADRDB-rg

Floating Resource:
HADR db2_db2inst1_db2inst1_HADRDB-rs HADR
Floating Resource:
Virtual IP db2ip_10_20_30_42-rs Virtual IP
dependsOn
relationship
Resource Group: Resource Group:

db2_db2inst1_node1-rg db2_db2inst1_node2-rg
db2_db2inst1_node1-rs db2_db2inst1_node2-rs
Public
network
dependsOn dependsOn
relationship Equivalency: relationship
eth0 db2_public_network_0 eth0
eth1 Equivalency: eth1

db2_private_network_0
Private
network Optional
Mapping DB2 Components to TSAMP Resources
 DB2 instance called “db2inst1” on server called “node1” maps to a TSAMP

managed resource called “db2_db2inst1_node1_0-rs”
 DB2 instance called “db2inst1” on server called “node2” maps to a TSAMP

managed resource called “db2_db2inst1_node2_0-rs”
 DB2 HADR database called “HADRDB” who’s primary and standby instances are
both named “db2inst1” maps to a TSAMP managed resource called
“db2_db2inst1_db2inst1_HADRDB-rs”
 The virtual IP address (optional) maps to a TSAMP managed resource called

“db2ip_XX_XX_XX_XX-rs” where XX.XX.XX.XX is the virtual IP address.
 A public network can be defined and this maps to a TSAMP resource

(Equivalency) called “db2_public_network_0”
 Note: No need to defined a private network (Equivalency) ...

– TSAMP does not manage anything related to the private network … there are no
dependencies on it, so no need for it ! Just say “no” to db2haicu
– You can still have an actual private network for HADR replication … its totally
independent of TSAMP.

IBM.Application class – Example is of a DB2 Instance Resource

# lsrsrc –s “Name = ‘db2_db2inst1_node1_0-rs’” -Ab IBM.Application
Name = "db2_db2inst1_node1_0-rs“
StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh db2inst1 0"
StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_stop.ksh db2inst1 0"
MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_monitor.ksh db2inst1 0"
MonitorCommandPeriod = 10
MonitorCommandTimeout = 120
StartCommandTimeout = 330
StopCommandTimeout = 140
UserName = "root"
RunCommandsSync = 1
ProtectionMode = 1
ActivePeerDomain = hadr_domain
NodeNameList = {“node1"}
OpState = 1

IBM.Application class – Example is of a DB2 HADR Resource

# lsrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ ” -Ab IBM.Application
Name = "db2hadr_hadrdb-rs“
StartCommand = "/usr/sbin/rsct/sapolicies/db2/hadrV10_start.ksh db2inst1 db2inst1 HADRDB"
StopCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh db2inst1 db2inst1 HADRDB"
MonitorCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_monitor.ksh db2inst1 db2inst1 HADRDB"
MonitorCommandPeriod = 21
MonitorCommandTimeout = 29
StartCommandTimeout = 330
StopCommandTimeout = 140
UserName = "root"
RunCommandsSync = 1
ProtectionMode = 1
NodeNameList = {“node1",“node2"}
OpState = 1

IBM.ServiceIP class – Virtual IP addresses
# lsrsrc -Ab IBM.ServiceIP
Name = "db2ip_10_20_30_42-rs"
IPAddress = “10.20.30.42"
NetMask = "255.255.255.0"
ProtectionMode = 1
ActivePeerDomain = "hadr_dom"
NodeNameList = {“node1“,”node2”}
OpState = 1

Progress
 Serviceability

Using ‘db2haicu’ to Automate HADR Failover

Step 1. Run the following command as root on each node to configure the RSCT ACLs
(security) and allow cluster communication between the servers:
root@node1# preprpnode node1 node2
root@node2# preprpnode node1 node2
Step 2. Log on to the standby server as the instance owner and issue:
db2inst1@node2> db2haicu
– The db2haicu tool will determine the current instance and apply all cluster
configuration steps based on it. It will also activate all databases for the instance
as it attempts to gather information.
– Next db2haicu will determine if a domain has already been created by searching
for as “Online” domain. If it doesn’t find one, you will see the following :

db2haicu: Create a new domain
Create a new domain with two nodes:

db2haicu: List the new domain

 At this point there would be domain called “hadr_domain” in an online state:
root@node1# lsrpdomain
Name OpState RSCTActiveVersion MixedVersions TSPort
hadr_domain Online 3.1.5.5 No 12347
 You can also list the states of the individual nodes and see output similar to the
following, from either server:
root@node1# lsrpnode
Name OpState RSCTVersion

node1 Online 3.1.5.5
node2 Online 3.1.5.5

db2haicu: Quorum and TieBreaker

 The next db2haicu question deals with the creation of a Network TieBreaker:
 At this point you could list the TieBreaker resources and see the new network
TieBreaker:
root@node1# lsrsrc –Ab IBM.TieBreaker
 The following command should show that your new network TieBreaker is currently
active:
root@node1# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker

db2haicu: Network Equivalencies

Special TSAMP groups called Equivalencies are created containing the network
interfaces found on each of the servers in the cluster. This allows TSAMP to be notified
of NIC failures by the RSCT subsystem (who harvested the NICs) and react accordingly.
 We use db2haicu to create an equivalency called db2_public_network_0 and populate
it with the en0 NICs from the server called “node1” :

db2haicu: Network Equivalencies (continued …)
 Next we add the en0 NIC from the other server, “node2”, to the same equivalency
(db2_public_network_0) :

db2haicu: Private Network

 In the previous slide, notice the option to say “Yes” or “No” when adding NICs to
a network.
 When asked if a non-public NIC should be added to a private network, this is
where I recommend you choose “No”.
 So don’t create a private network equivalency via db2haicu even if your DB2
HADR environment does use a private network for HADR replication data.

db2haicu: Private Network (continued …)

 If your DB2 environment uses LDAP for authentication and if you have multiple
NICs per server (eg. a private network), then disable the RSCT cluster
heartbeat for all NICs not in the public network:
Identify the Communication Group that contains the non-public NICs:
lsrsrc -Ab IBM.NetworkInterface Name IPAddress CommGroup
HeartbeatActive NodeNameList
Change HeartbeatActive to 0 to disable heartbeating for a CommGroup:

chrsrc -s "CommGroup=='CG2'" IBM.NetworkInterface HeartbeatActive=0
See the following technote for more details on cluster heartbeat settings and
Communication Groups:
http://www.ibm.com/support/docview.wss?uid=swg21292274

db2haicu: Additional NICs per server

 For non-LDAP setups, its recommended to have at least a 2nd pair of NICs for
cluster heartbeating so as to reduce the likelihood of a forced reboot if there is a
problem in the public network (or with the public NICs).
For DB2 v9.7 environments, additional “dependsOn” relationships need to be
manually added to the automation policy, from each HADR database resource to
the public network equivalency
If db2haicu from v10.1.0.0, v10.1.0.1, or v10.1.0.2 is used to create the automation
policy, all the necessary dependsOn relationships will be missing due to a bug with
the db2haicu utility (fixed as of 10.1.0.3) … they will need to be manually created.
 Refer to the following knowledge item to obtain a script that can be used to
create any missing relationship in either a DB2 v9.7 or v10 environment :
https://developer.ibm.com/answers/questions/180984/tsamp-configuration-
checker-for-db2-hadr-and-ha-sh/
 The dependsOn relationships between the HADR database resources and the
Public Network Equivalency are recommended even if there is only one NIC per
server, for both DB2 v9.7 and v10 environments.

db2haicu: Listing the Equivalency

 The network equivalency(ies) would be created at this point and can be listed as follows:
root@node1# lsequ -Ab
Displaying Equivalency information:
All Attributes
Equivalency 1:
Name = db2_public_network_0
MemberClass = IBM.NetworkInterface
Resource:Node[Membership] = {en0:node1,en0:node2}
SelectString = “”
Resource:Node[ValidSelectResources] = {en0:node1,en0:node2}

db2haicu: Adding the database partition to the Automation Policy

 The final part to running db2haicu on the standby server is setting the CLUSTER_MGR
variable to “TSA” and then adding resources that represent the DB2 instance on the server
where you’re running db2haicu:


 Note in the previous screenshot that you won’t be able to validate and automate the
HADR database via db2haicu from the standby server. This is why the next part involves
running the db2haicu for a 2nd time but from the current HADR primary server.
 At this point we can view the a few more changes to the automation policy and the
database manager’s configuration :
db2inst1@node2> db2 get dbm cfg |grep -i cluster
Cluster manager = TSA
root@node2# lsrg
Resource Group names:
db2_db2inst1_node2_0-rg

db2haicu: DB2 Standby Instance Resources

root@node2# lsrg -g db2_db2inst1_node2_0-rg
Displaying Member Resource information:

For Resource Group "db2_db2inst1_node2_0-rg".
Resource Group 1:
Name = db2_db2inst1_node2_0-rg
MemberLocation = Collocated
Priority = 0
AllowedNode = db2_db2inst1_node2_0-rg_group-equ
NominalState = Online
OpState = Online
TopGroup = db2_db2inst1_node2_0-rg
TopGroupNominalState = Online
 Note the AllowNode attribute which points to a PeerNode Equivalency that dictates
which server this resource is allowed to run on … see the next slide for output that
shows this PeerNode Equivalency details.

db2haicu: DB2 Standby Instance Dependencies

root@node2# lsequ -Ab
Equivalency 1:
Name = db2_db2inst1_samp2_0-rg_group-equ
MemberClass = IBM.PeerNode
Resource:Node[Membership] = {node2:node2.tivlab.raleigh.ibm.com}
SelectString = ""
SelectFromPolicy = ANY
MinimumNecessary = 1
Resource:Node[ValidSelectResources] = {node2:node2.tivlab.raleigh.ibm.com}
 This restricts the DB2 instance resource on the previous slide from only being brought
Online by TSAMP on node2.
 This is fairly obvious given it’s the resource that represents the standby database
partition.

db2haicu: DB2 Standby Instance Dependencies (continued …)

 At this point there would be one or two relationship defined in the automation policy
depending on how many network equivalencies you created:
root@node2# lsrel -Ab
Displaying Managed Relationship Information:
All Attributes
Managed Relationship 1:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node2_0-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOn
Conditional = NoCondition
Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel
 This shows us that the DB2 instance is dependent on the operational state of the NICs
in the public network. If the NIC is Online, then TSAMP will be able to start the
associated DB2 instance.

db2haicu: The Automation Policy so far …

 Let’s look at what resources and groups are listed in the ‘lssam’ output after completing
the execution of ‘db2haicu’ on the standby server :
root@node2# lssam
 Here we see that the DB2 instance for server “node2” is defined and within its own resource
group.
 There is a PeerNode equivalency which dictates which server the above instance is allowed to
run on.
 Finally, there is a Network Equivalency which contains the NICs for the public network … the
DB2 instance would have a dependency relationship on this equivalency.

Using ‘db2haicu’ to Automate HADR Failover (continued …)
Step 3. Log on to the primary server as the instance owner and issue:
db2inst1@node1> db2haicu
– The db2haicu tool will determine the current instance and apply all cluster
configuration steps based on it. It will also activate all databases for the instance
as it attempts to gather information.
– Next db2haicu will determine if a domain has already been created by searching
for as “Online” domain. Since we’ve already run db2haicu on the standby server,
an Online domain should already exist.
– You will then be asked to set the cluster manager.


 db2haicu sets the CLUSTER_MGR variable to “TSA” within the local database manager’s
configuration:
db2inst1@node1> db2 get dbm cfg |grep -i cluster
Cluster manager = TSA
 Please note that once the dbm is configured with “Cluster manager” set to TSA, the DB2
engine expects to have a domain Online. You will have issues stopping and starting the
DB2 instance if no domain is Online.
 Run 'db2haicu -disable' on each DB2 server if you want to break the connection between
DB2 and TSAMP. This is the only way to unset “Cluster manager” for DB2 v10.x
 Then db2haicu adds resources that represent the DB2 instance (the primary DB2
instance) on the server where you’re currently running db2haicu:

db2haicu: DB2 Primary Instance Resources

 At this point we can view the a few more changes to the automation policy
root@node1# lsrg
Resource Group names:
root@node1# lsrg -g db2_db2inst1_node1_0-rg

For Resource Group "db2_db2inst1_node1_0-rg".
Resource Group 1:
Name = db2_db2inst1_node1_0-rg
Priority = 0
AllowedNode = db2_db2inst1_node1_0-rg_group-equ
OpState = Online
TopGroup = db2_db2inst1_node1_0-rg
 Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server
this resource is allowed to run on … similar to the other DB2 instance resource group but with a
different server name.

db2haicu: Dependencies for the DB2 Instances

 Now there would be additional relationships defined in the automation policy:
All Attributes
Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel
Name = db2_db2inst1_node1_0-rs_DependsOn_db2_public_network_0-rel
 This shows us that the DB2 instances are both dependent on the operational state of the NICs in the
public network. If the NICs are Online, then TSAMP will be able to start the associated DB2
instances … it also means if either NIC goes offline for any reason, the local DB2 instance will be
stopped by TSAMP.
db2haicu: The Automation Policy so far …

 Let’s take another look at what resources and groups are listed in the ‘lssam’ output
after ‘db2haicu’ has added both standby and primary database partitions :
root@node1# lssam
 There’s now a resource and group for the DB2 instance on server “node1”.
 There’s now another PeerNode equivalency … it forces the “db2_db2inst1_node1_0-rg
partition to run on “node1” only.

db2haicu: Adding the HADR database to the Automation Policy

 Validating and automating HADR failover can only be done from the current primary
server and only after successfully running db2haicu on the standby server.
 You may also want to add a virtual IP address for this HADR database

db2haicu: HADR Database Resources

# lsrg -g db2_db2inst1_db2inst1_HADRDB-rg
Displaying Resource Group information:

For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg".
Resource Group 1:
Name = db2_db2inst1_db2inst1_HADRDB-rg
Priority = 0
AllowedNode = db2_db2inst1_db2inst1_HADRDB-rg_group-equ
OpState = Online
TopGroup = db2_db2inst1_db2inst1_HADRDB-rg
 Note the AllowedNode attribute. It points to a PeerNode Equivalency that contains the
servers “node1” and “node2” that dictates which servers the HADR database can
reside on. This is just like the setup for the two DB2 instance resource groups that
also use the AllowedNode attribute with other PeerNode Equivalencies, though in this
case the HADR resource is a floating resource with two servers as its choices.

db2haicu: HADR Database Resources

# lsrg -m -g db2_db2inst1_db2inst1_HADRDB-rg

For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg".
Member Resource 1:
Class:Resource:Node[ManagedResource] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
Mandatory = True
MemberOf = db2_db2inst1_db2inst1_HADRDB-rg
SelectFromPolicy = ORDERED
OpState = Online
Member Resource 2:
Class:Resource:Node[ManagedResource] = IBM.ServiceIP:db2ip_9_42_153_137-rs
Mandatory = True
MemberOf = db2_db2inst1_db2inst1_HADRDB-rg
SelectFromPolicy = ORDERED
OpState = Online

db2haicu: Dependency for the HADR DB Resource

 Now there would be an additional relationship defined in the automation policy:
All Attributes
[...]
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
Name = db2_db2inst1_db2inst1_HADRDB-rs_DependsOn_db2_ public_network_0-rel
 This shows us that the HADRDB resource is dependent on the operational state of the NICs in the
public network.
 If the NICs are Online, then TSAMP will be able to online the associated HADR db resource
 it also means if either NIC goes offline for any reason, the constituent of the HADR db resource
local to the offline NIC will be offlined by TSAMP (if it is currently online) … this could trigger a
failover.

db2haicu: The Complete Automation Policy …

 Let’s look at what resources and groups are listed in the ‘lssam’ output after completing
the execution of ‘db2haicu’ on both servers :
root@node1# lssam
 The resources and group for the HADR database and virtual IP address have been added, as
has a new PeerNode Equivalency containing servers “node1” and “node2”.

Progress
 Serviceability

Explaining “Nominal” state for a Resource Group

 Now that the DB2 instances are managed by TSAMP, the “Nominal” (desired) state of
its Resource Group needs to be set to “Online” to be able to use the db2start and
db2stop commands. The following is example syntax:
# chrg –o online <Resource_Group>
 Changing the Nominal state of a resource group will instruct TSAMP to start/stop the
member resources using the scripts defined in the “StartCommand”, “StopCommand”
attributes for a resource of class “IBM.Application”, such is the case for a DB2 instance
resource.
 To change the desired state of multiple resource groups that have similarities in their
name, for example, start all DB2 resource groups in parallel where the instance name
on each server starts with “db2inst”, use the following syntax:
# chrg –o online –s “Name like ‘db2_db2inst_%’ “
 Another example, to take the HADR resource group offline (which will also remove any
currently assigned IP alias (Virtual IP) ) :
# chrg –o offline db2_db2inst1_db2inst1_HADRDB-rg
– Note: After offlining just the HADR resource group, the HADR pair
will remain in a peer connected state even though shown as Offline
on both servers when viewed using 'lssam' !
Domain Offline & Nominal States set to Offline ...

 Start the domain:
# startrpdomain hadr_domain
 Start the primary and standby instances simultaneously:
# chrg –o online –s “Name like ‘db2_db2inst1_%’ “
– above assumes both the instances are named “db2inst1”
 Check that both instances reach Online states. Do not proceed until both DB2 instances have
come online. Confirm using “lssam –top”. Also run “db2_ps” as the DB2 instance owner on each
node.
 The DB2 start scripts used to start the instances will also activate the databases, resulting in the
HADR pair establishing a peer connected state. Confirm that the HADR pair have reached peer
state by running the following on each DB2 node :
# db2pd –hadr –db hadrdb
– If HADR state is not active, then manually bring the HADR pair into peer state as follows:
a. On designated standby node:
# db2 start hadr on db hadrdb as standby
b. On designated primary node:
# db2 start hadr on db hadrdb as primary
Repeat for all HADR databases. Again check the state of the HADR pair before proceeding

Domain Offline & Nominal States set to Offline (continued..)

 As instance owner, ensure that the HADR pair is in “Peer” state (on both nodes) as
follows:
# db2pd –hadr -db hadrdb
You should see output similar (abbreviated) to the following on the primary server:
Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:13:36
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Primary Peer Sync 0 0
ConnectStatus ConnectTime Timeout

Connected Tue Jul 8 17:47:12 2008 (1215553632) 120
You should see output similar (abbreviated) to the following on the standby server:
Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:12:51
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Standby Peer Sync 0 0
ConnectStatus ConnectTime Timeout

Connected Tue Jul 8 17:48:23 2008 (1215552946) 120
 Finally, change the HADR resource group to online :

# chrg –o online db2_db2inst1_db2inst1_HADRDB-rg
– This last step will cause the virtual IP address (if policy includes one) to be assigned.
Taking Standby Instance Offline

 Because the database is active, the force option is required for the db2stop command:
db2inst1@node2> db2stop force
– DB2 will also request TSAMP lock the instance group to prevent TSAMP from trying to restart
the instance.
– The HADR group will also get locked … this ALWAYS happens when the HADR pair are no
longer in a Peer state.
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online

|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1
'- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated
|- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1
'- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
Pending online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online
'- Offline IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated
'- Offline IBM.Application:db2_db2inst1_node2_0-rs:node2
 To restart the instance:

db2inst1@node2> db2start
– This will result in TSAMP executing the ‘db2V10_start.ksh’ script which is also responsible for activating the
HADR database and HADR re-integration takes place … peer state results.
Taking Primary Instance Offline

 Because the database is active, the force option is required for the db2stop command:
db2inst1@node1> db2stop force
– DB2 will also request TSAMP lock the instance group to prevent TSAMP was trying to restart
the instance.
– The HADR group will also get locked since the HADR pair would no longer be in PEER state.
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online

|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1
Pending online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online
'- Offline IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated
'- Offline IBM.Application:db2_db2inst1_node1_0-rs:node1
 To restart the primary instance:

db2inst1@node1> db2start
– This will result in TSAMP executing the ‘db2V10_start.ksh’ script which should activate the db
– The ‘hadrV10_start.ksh’ script will then be executed and peer state should be re-established.

Performing a Manual Takeover (Controlled Failover)

 In versions of DB2 prior to v9.5, an operator performed a controlled failover by moving the
HADR resource group using the TSAMP command ‘rgreq -o move <HADR_group>’.
 Because the DB2 instances are cluster aware in v9.5+, you can use the native DB2 takeover
command (issued as instance owner on the current standby server) :
db2inst1@node2> db2 takeover hadr on database HADRDB
 The HADR resource group will be locked and unlocked several times. There will also be a
move request at some point.
 ‘lssam’ will show the online/offline states swapped for the HADR resource and ServiceIP,
assuming the takeover is successful :
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1
'- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs
|- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
 Use the following command to check the HADR role has swapped between the nodes and
ensure the HADR pair have reached a peer state again :
db2inst1@node2> db2pd –hadr –db <hadr_db_name>

Using a “move” request to perform a controlled failover

 If you attempt to move the HADR resource group using the following command:
# rgreq –o move db2_db2inst1_db2inst1_HADRDB-rg
… a takeover “by force peer window only” will result.
 This isn't necessarily a bad thing, however there is a small element of risk because
there isn't a two-way handshake as there is when a regular takeover (not forced) is
performed.
 Some clients prefer the “move” request as it results in a faster failover for them and
there is less TSAMP activity since there is no need to lock and unlock the HADR
groups multiple times.

Resetting a resource set to “Failed offline”

 If a resource fails to start, it will be set to “Failed offline”.
 This is not an indication of a problem with TSAMP … its a problem with the underlying
DB2 component, so must be diagnosed from the perspective of the DB2 product.
 Here's an example showing that the HADR database resource failed to be started as
primary on “node2” :
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
'- Failed offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs
 To reset the Failed Offline state, use the following TSAMP command:
resetrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ &
NodeNameList={‘node2’}” IBM.Application
 This will cause the hadrV10_stop.ksh script to be executed on node2 and if successful
(return code 0), the Operational State will change to “Offline”.

A couple of things to note ...

 Usually you need to stop resources prior to stopping the domain. However you can force stop a
domain which leaves all DB2 resources running where ever they were last running, including the
virtual IP alias:
# chrsrc -c IBM.PeerNode CritRsrcProtMethod=5
# stoprpdomain -f <domain_name>
 Bring the domain back online as follows:
# stoprpdomain -f <domain_name>
# chrsrc -c IBM.PeerNode CritRsrcProtMethod=3
 Before rebooting a server for what ever reason, you should “Offline” the node:
# stoprpnode <node_name>
 Even if the HADR resource group’s Nominal state is set to Offline, starting both instances should
result in the HADR pair reaching a peer connected state since the start script for the instances
also activates the databases.
– However, while the HADR resource group is set to Offline, the Virtual IP address is truly
offline (not assigned to any NIC) so no client access to the database, AND no automated
failover actions are possible.
Failure Scenarios
• The various failover scenarios supported by this solution are detailed in
section 6 of a whitepaper called
“Automated Cluster Controlled HADR (High Availability Disaster Recovery)
Configuration Setup using the IBM DB2 High Availability Instance Configuration
Utility (db2haicu) ”
• This whitepaper can be downloaded via the following URL:

http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-
0908hadrdb2haicu/HADR_db2haicu.pdf
• The following scenarios result in automated actions, including

failovers/takeovers:
1. Standby Instance Failure
2. Primary Instance Failure
3. Standby NIC Failures (public network)
4. Primary NIC Failures (public network)
5. Standby Node Failure
6. Primary Node Failure

Progress
 Serviceability

Disable/Re-enable HA/Automated Failover (using db2haicu)

 To prevent TSAMP from taking any action on DB2 resources, disable HA:
db2inst1@node2> db2haicu -disable
 The local database manager’s configuration will be updated so that “Cluster manager”
is unset. The ‘db2haicu –disable’ also needs to be executed on the other server so that
it’s instance configuration is also updated.
 With “Cluster manager” unset, you would be able to Offline the entire domain without
affecting the manual operation of the DB2 instances.

Disable/Re-enable HA/Automated Failover (Continued …)

 As part of the –disable process, DB2 will request TSAMP lock all Resource Groups to
prevent TSAMP was taking any action against DB2 resources:
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online

Online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online
'- Online IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated
'- Online IBM.Application:db2_db2inst1_node1_0-rs:node1
Online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online
'- Online IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated
'- Online IBM.Application:db2_db2inst1_node2_0-rs:node2
 To re-enable, run ‘db2haicu’, as instance owner, on each server, and select “1” (Yes) when asked
if you want to enable high availability, and then choose “TSA”.

Alternative for preventing TSAMP starting and stopping DB2 Resources
 The quickest way of preventing TSAMP from stopping/starting the resources is to

change TSAMP to manual mode (Automation = Manual):
# samctrl –M T
 The only action TSAMP will continue to do is monitor the resources by continuing to
execute the monitoring scripts associated with each resource.
 Check the current automation mode with the following command:
# lssamctrl
 To re-enable automation mode (Automation = Auto):
# samctrl –M F
 Although changing the Nominal (desired) state of a resource group to “offline” will
trigger TSAMP to stop its resources, this does not mean automation is stopped.
TSAMP will attempt to maintain the offline state, so if any resource is manually started,
TSAMP will stop it again.
 Note you will not be able to perform a takeover while TSAMP is in Manual mode.

Progress
 Serviceability

Serviceability - CLI commands
 Use the TSAMP command “lssam” as previously demonstrated:

# lssam –top
# lssam –g <resource_group>
 An alternative is the following TSAMP command:

# lsrg –m

Serviceability – logs
 Three main areas of logging

1. Logging from the DB2 automation scripts (i.e. start/stop/monitor scripts)
 “logger” statements in policy scripts
 written to syslog (eg. /var/log/messages on Linux systems)
2. Logging of TSAMP / RSCT core processes (i.e. quorum, monitor command timeouts)
 written to syslog (Linux/AIX/Solaris) and errpt (AIX)
 Daemon log file directory: /var/ct/<DOMAIN>/log/mc/IBM.<DAEMON>RM
– where <DAEMON> = Recovery, GblRes, …
– Circular logs, cannot open with editor directly!
rpttr –o dtic <log file dir>/trace_summary > my_trace.out
3. DB2’s log file, “db2diag.log” with DIAGLEVEL 3 or higher
 Use TSAMP Level 2 Support’s ‘getsadata’ script to collect data:

http://www.ibm.com/support/docview.wss?&uid=swg21285496

Serviceability – syslog messages from DB2 automation scripts

 The following syslog message indicates the DB2 instance is Online (return code =1) :
<timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 1 (db2inst1, 0)
 The following syslog message indicates the DB2 instance is Offline (return code =2) :
<timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 2 (db2inst1, 0)
 The DB2 instance monitors repeat approximately every 10 seconds on each server if you’re
using a default automation policy.
 The following syslog message indicates that the HADR resource if considered Online (return
code = 1) and with a Primary role :
<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB
 Seen only on the node that’s currently the primary node, repeats approx every 21 seconds
 The following syslog message indicates that the HADR resource if considered Offline (return
code = 2) and mostly likely in a Standby state (normal/OK state) :


 The following syslog messages occur when TSAMP starts a DB2 instance :
<timestamp> node1 user:notice db2V10_start.ksh[856142]: Entered db2V10_start.ksh, db2inst1, 0
<timestamp> node1 user:debug db2V10_start.ksh[856146]: Able to cd to /home/db2inst1/sqllib : db2V10_start.ksh, db2inst1, 0
<timestamp> node1 user:debug db2V10_start.ksh[262214]: 1 partitions total: db2V10_start.ksh, db2inst1, 0
<timestamp> node1 user:notice db2V10_start.ksh[393252]: Returning 0 from db2V10_start.ksh ( db2inst1, 0)
 If db2start was used to start the instance, the message below would be seen instead of the “1
partitions total” message show above:
<timestamp> node1 user:info db2V10_start.ksh[856150]: db2V10_start.ksh is already up...
 The following syslog messages are typical of the HADR resource group being brought
online:
<timestamp> node1 user:debug root[524540]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock
<timestamp> node1 user:notice hadrV10_start.ksh[422078]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug hadrV10_start.ksh[422086]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h HADRDB -L
<timestamp> node1 user:debug root[524290]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0
<timestamp> node1 user:notice hadrV10_start.ksh[422090]: Returning 0 : db2inst1 db2inst1 HADRDB
 Note: the ‘hadrV10_start.ksh’ script doesn’t actually bring the HADR database pair into a Peer
state. Its likely to already be in a Peer state beforehand because the databases are activated
as part of the starting of the DB2 instances.


 The following syslog messages occur when TSAMP stops a DB2 instance. This includes
resetting a Failed Offline state for a DB2 instance resource:
<timestamp> node1 user:notice db2V10_stop.ksh[856142]: Entered db2V10_stop.ksh, db2inst1, 0
<timestamp> node1 user:notice db2V10_stop.ksh[393252]: Returning 0 from db2V10_stop.ksh ( db2inst1, 0)
 The following syslog messages are typical of the HADR resource being stopped on one
node so a manual takeover can occur to the other node. Its also what you would see if
resetting a Failed offline state for a HADR resource:
<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602322]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602330]: su - db2inst1 -c db2gcf -t 3600 -d -i
db2inst1 -i db2inst1 -h HADRDB -L
<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602334]: Returning 0 : db2inst1 db2inst1
HADRDB
 Note: the ‘hadrV10_stop.ksh’ script doesn’t actually stop the HADR functionality within
DB2. It doesn’t affect Peer state.


 The following syslog messages show the HADR resource group lock/unlock state :
<timestamp> node1 user:debug root[327754]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-
rg lock
<timestamp> node1 user:debug root[327780]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg
lock: 1
and
<timestamp> node1 user:debug root[856206]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-
rg unlock
<timestamp> node1 user:debug root[856212]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg
unlock: 0
 The HADR resource group is locked whenever Peer state is lost. The DB2 software uses
a TSAMP API to request the lock. The “lockreqprocessed” script is used to check the
lock and unlock states.
 When the HADR pair are back in Peer state, the HADR resource group is unlocked,
again requested by DB2.
 The DB2 Instance resource groups also get locked if the db2stop command is used to
stop an instance, and unlocked when db2start is used to start it again.


 A manual (no force option) “takeover" (db2 takeover hadr on db HADRDB) would result in the
following messages on the original primary server:
<timestamp> node1 user:notice hadrV10_stop.ksh[405566]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug hadrV10_stop.ksh[405574]: su - db2inst1 -c db2gcf -t 3600 -d -i db2inst1 -i db2inst1 -h
HADRDB -L
<timestamp> node1 user:notice hadrV10_stop.ksh[405578]: Returning 0 : db2inst1 db2inst1 HADRDB
 Assuming the above hadrV10_stop.ksh script completes with a 0 return code, then a similar
sequence of messages to the following would be seen on the original standby server:
<timestamp> node2 user:debug root[487538]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock
<timestamp> node2 user:debug root[487564]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock: 1
<timestamp> node2 user:debug root[487566]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock
<timestamp> node2 user:debug root[487572]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0
<timestamp> node2 user:notice hadrV10_start.ksh[548876]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node2 user:debug hadrV10_start.ksh[548884]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h
HADRDB –L
<timestamp> node2 user:notice hadrV10_start.ksh[548888]: Returning 0 : db2inst1 db2inst1 HADRDB
<timestamp> node2 user:debug hadrV10_monitor.ksh[696436]: Returning 1 : db2inst1 db2inst1 HADRDB
 Note the return code of 0 from “hadrV10_start.ksh” meaning a successful takeover. Any other return
code would be considered unsuccessful and would need to be diagnosed from a DB2 perspective.

Serviceability – syslog messages from TSAMP/RSCT
 The following set of messages would indicate a cluster communication problem (domain split) :
Firstly, state of the domain changes to PENDING_QUORUM on each node:
CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to
PENDING_QUORUM.
The Automation Engine (RecoveryRM) on each node reports that the other node has left the
domain:
RECOVERYRM_INFO_4_ST A member has left. Node number = 1
Network TieBreaker is tested and an rc=0 indicates a successful poll of the network
TieBreaker:
samtb_net[1294584]: op=reserve ip=10.201.1.1 rc=0 log=1 count=2
If the TieBreaker poll is successful, the node regains QUORUM:
CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to
HAS_QUORUM.

Serviceability – syslog messages from TSAMP/RSCT

 The following messages are expected when TSAMP is assigning and removing ServiceIP
(Virtual IP address) resources :
<timestamp> <node_name> daemon:notice GblResRM[1319044]: … :::GBLRESRM_IPONLINE IBM.ServiceIP
assigned address on device. IBM.ServiceIP 10.20.30.42 en0
<timestamp> <node_name> daemon:notice GblResRM[618532]: … :::GBLRESRM_IPOFFLINE IBM.ServiceIP

removed address. IBM.ServiceIP 10.20.30.42
 Release TieBreaker and remove TieBreaker block attempts when node has rejoined a domain
again:
<timestamp> <node_name> daemon:info samtb_net[790758]: op=release ip=10.20.30.1 rc=0 log=1 count=2
<timestamp> <node_name> daemon:info samtb_net[925932]: remove reserve block
/var/ct/samtb_net_blockreserve_10.20.30.1
 A MonitorCommand for a resource of class IBM.Application reached a defined timeout:

<timestamp> <node_name> GblResRM[24275]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:
:::Template ID: 0:::Details File: :::Location: RSCT,Application.C,1.2.1,2434 :::GBLRESRM_MONITOR_TIMEOUT
IBM.Application monitor command timed out. Resource name <resource_name>
 Similar TIMEOUT messages exist for StartCommand and StopCommand scripts.

Serviceability – Example of the TSAMP trace summary files
 First format the trace_summary file(s)

rpttr –o dtic <log file dir>/trace_summary > my_trace_summary.txt
IBM.RecoveryRM (on “master” node only) traces show:

• all ‘online/offline order’ statements
• Binder messages and exceptions
16:10:10.660208 T(229390) _RCD Offline Request against db2_db2inst1_db2inst1_HADRDB-rs on node
node2
16:10:22.365645 T(229390) _RCD Offline request injected: db2_db2inst1_db2inst1_HADRDB-rg
/ResGroup/IBM.ResourceGroup
16:10:26.433653 T(229390) _RCD Online request injected: db2_db2inst1_db2inst1_HADRDB-rg
/ResGroup/IBM.ResourceGroup
16:10:26.442814 T(229390) _RCD RIBME-Hist for <NULL>: BINDER: Bind db2_db2inst1_db2inst1_HADRDB-
rg /ResGroup/IBM.ResourceGroup
IBM.GblResRM (from each individual node) traces show:

• All start / stop command executions and service IP on / offline
13:56:55.919065 T(16386) _GBD Monitor reports: Network device "en0:0" (IP address 10.20.30.42)
flagged UP. Bringing resource “db2ip_10_20_30_42-rs" (handle 0x6029 0xffff 0x6887df04 0x3589a7c9
0x1005fb59 0xcddd3e10) online.
13:57:30.693158 T(163851) _GBD Resource " db2ip_10_20_30_42-rs " (handle 0x6029 0xffff 0x6887df04
0x3589a7c9 0x1005fb59 0xcddd3e10): IP address 10.20.30.42 has been successfully taken offline on
network interface "en0:0"

Serviceability
 UNKNOWN (0)
– Generally a problematic state … really shouldn’t be deliberately used in the automation scripts
 ONLINE (1)
 OFFLINE (2)
– Offline and should be able to be started here if needed
FAILED
 FAILED_OFFLINE (3) Online
OFFLINE
Offline
Online
– Offline and not a possible node to be started

– If MonitorCommand returns FAILED_OFFLINE then availability can change as soon as
MonitorCommand returns something different, like Offline (return code 2)
– If status is set to FAILED_OFFLINE by StartCommand not succeeding within RetryCount, then
manual intervention will be needed to fix underyling resource and reset (resetrsrc) resource.
 STUCK_ONLINE (4)
– Manual intervention will be needed to stop the underlying resource
 PENDING_ONLINE (5)
– No action is taken in this state, resource should eventually become online, or start attempt will
timeout
 PENDING_OFFLINE (6)
– No action is taken in this state, resource should eventually become offline or stop attempt will
timeout
82 May 27, 2015 82

Serviceability
 Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands
– If yes, then problem is most likely in DB2 automation scripts or core DB2 components
– If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement
 If Operational State = UNKNOWN (OpState=0)

– Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT
– Fix: Increase MonitorCommandTimeout value
chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value>
lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout
83 May 27, 2015 83

Questions/Comments ?

DB2 Haicu

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DB2 Haicu

Încărcat de

Drepturi de autor:

Formate disponibile

IBM Tivoli System Automation for Multiplatforms v3.

Author: Gareth Holl

© 2015 IBM Corporation

 Introduction and Overview

 System Automation Components Overview

 Mapping DB2 Components to TSAMP Resources

 Integrating TSAMP with DB2 HADR using db2haicu

 Controlling the Operational State of the DB2 Resources

 Disabling Automation (re-gain manual control of DB2)

3 May 27, 2015 © 2015 IBM Corporation

 TSAMP can be used to monitor an application’s resources, and automate the

4 May 27, 2015 © 2015 IBM Corporation

Why the need for TSAMP ?

 HADR does not perform active monitoring of the topology

5 May 27, 2015 © 2015 IBM Corporation

 Each of the following software products/components need to be installed on both

 Installation of DB2 v10.1 includes the DB2 automation policy scripts:

6 May 27, 2015 © 2015 IBM Corporation

Software Summary (continued …)

 TSAMP/RSCT doesn’t need to be installed on a 3rd node to maintain quorum

7 May 27, 2015 © 2015 IBM Corporation

Example 1 of a DB2 HADR environment

Primary Server Standby Server

DB2 Instance DB2 Instance

DB2 HADR HADR DB2

8 May 27, 2015 © 2015 IBM Corporation

Example 2 of a DB2 HADR environment

9 May 27, 2015 © 2015 IBM Corporation

 Introduction and Overview

 System Automation Components Overview

 Mapping DB2 Components to TSAMP Resources

 Integrating TSAMP with DB2 HADR using db2haicu

 Controlling the Operational State of the DB2 Resources

 Disabling Automation (re-gain manual control of DB2)

10 May 27, 2015 © 2015 IBM Corporation

System Automation – Components

Resource Manager Function and classes owned

Maps IBM.AgFileSystem resources to IBM.LogicalVolume to IBM.VolumeGroup to

Resource Manager Function and classes owned

Automation engine. Owns IBM.ResourceGroup, IBM.Equivalency,

IBM.TestRM Controls resources of class IBM.Test

System Automation – Components

12 May 27, 2015 © 2015 IBM Corporation

System Automation – Components

13 May 27, 2015 © 2015 IBM Corporation

Quorum and TieBreaker

14 May 27, 2015 © 2015 IBM Corporation

db2haicu: Quorum and TieBreaker (continued....)

 The following 7 slides demonstrate how a TieBreaker works …

15 May 27, 2015 © 2015 IBM Corporation

Base Tie-Breaker Functionality

Node Failure Scenario

16 May 27, 2015 16

Base Tie-Breaker Functionality

Node Failure Scenario:

17 May 27, 2015 17

Base Tie-Breaker Functionality

Node Failure Scenario:

18 May 27, 2015 18

Base Tie-Breaker Functionality

Network Adapter Failure Scenario

19 May 27, 2015 19

Base Tie-Breaker Functionality