Documente Academic
Documente Profesional
Documente Cultură
2
Integrating TSAMP v3.2 with DB2 HADR v10.1
When you have completed this module, you will be able to perform these tasks:
Explain the general operation of the TSAMP product
Identify each of the products and components that make up the total solution and note the
integration points
Examine what the ‘db2haicu’ utility does from the perspective of TSAMP and the automation
policy
Learn how to control and service the combination of TSAMP and DB2 HADR
2 IBM Tivoli System Automation for Multiplatforms - Event Notification © 2015 IBM Corporation
IBM Tivoli System Automation for Multiplatforms
Agenda
Serviceability
Introduction
DB2 provides a High Availability Disaster Recovery (HADR) feature that keeps a
primary and standby database synchronized, and allows an administrator to
switch control to a standby DB2 server
DB2 provides a set of scripts that allow TSAMP to control the DB2 resources.
Scripts that are used by TSAMP to start, stop, and monitor each of the DB2 resources –
this is the primary link between the two products
DB2 provides a utility called ‘db2haicu’ that is used to define the domain and
automation policy within TSAMP, that is, the initial setup :
The automation policy is the set of definitions of all resources, resource groups, and the
relationships between them all.
The resource definitions contain attributes that define which DB2 start, stop, and
monitor script (the automation scripts) to use for a particular resource.
Software Summary
DB2 DB2
HADR Replication
Virt IP
Primary Server eth0:0 Standby Server
TSAMP TSAMP
RSCT RSCT
eth0 eth0
Cluster
DB2 Instance (RSCT) DB2 Instance
Heartbeat
DB2 DB2
eth1 Private network eth1
DB2 HADR HADR DB2
Switch Database
Database
HADR Replication
Progress
Serviceability
Configuration tasks across the nodes in the domain, including quorum and TieBreaker
IBM.ConfigRM functionality
Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 3
Resource Managers (shown in blue on the next 2 slides), as follows :
IBM.GblResRM Starts, stops, and monitors resources of the class IBM.Application and IBM.ServiceIP
There are different Resource Managers, each responsible for managing or controlling resources that
belong to a particular set of resource classes.
The following diagram shows the mapping of three key Resource Managers to some Resource Classes
they manage and then to some example Resources :
The following diagram introduces another Resource Manager (part of RSCT) called IBM.ConfigRM and it
shows two Resource Classes it manages and the Resources modelled by those classes:
TieBreaker
a TieBreaker situation occurs when a cluster with equal number of nodes is split
into sub-clusters with equal numbers of nodes
need to determine which sub-cluster will have an operational quorum in a tie
situation
Network TieBreaker
the goal is for each system to figure out (via the RSCT infrastructure) which one is
operational and should therefore take control (if not already the active node).
use a pingable system independent of node1 and node2, for example node3 in our
example. Although it would be just as easy and viable to use the gateway router for node1
and node2.
without an active TieBreaker, automated failover/takeover will NEVER occur in the
event of a cluster split (node outage, or network problem)
Create a /usr/sbin/cluster/netmon.cf file on each node. Add IP addresses (one per line)
of 3-5 devices external to the domain that are pingable from each node. This is
important for a 2 node cluster to allow the cluster software (RSCT) to quickly identify the
source of a heartbeat problem between the nodes.
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
eth0
10.20.30.1
10.20.30.0
eth0 eth0
10.20.30.40 10.20.30.41
node1 node2
eth0
10.20.30.1
10.20.30.0
Progress
Serviceability
Floating Resource:
Virtual IP db2ip_10_20_30_42-rs Virtual IP
dependsOn
relationship
db2_db2inst1_node1-rs db2_db2inst1_node2-rs
Public
network
dependsOn dependsOn
relationship Equivalency: relationship
eth0 db2_public_network_0 eth0
Private
network Optional
24 May 27, 2015 © 2015 IBM Corporation
IBM Tivoli System Automation for Multiplatforms
DB2 HADR database called “HADRDB” who’s primary and standby instances are
both named “db2inst1” maps to a TSAMP managed resource called
“db2_db2inst1_db2inst1_HADRDB-rs”
Name = "db2_db2inst1_node1_0-rs“
StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh db2inst1 0"
StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_stop.ksh db2inst1 0"
MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_monitor.ksh db2inst1 0"
MonitorCommandPeriod = 10
MonitorCommandTimeout = 120
StartCommandTimeout = 330
StopCommandTimeout = 140
UserName = "root"
RunCommandsSync = 1
ProtectionMode = 1
ActivePeerDomain = hadr_domain
NodeNameList = {“node1"}
OpState = 1
Name = "db2hadr_hadrdb-rs“
StartCommand = "/usr/sbin/rsct/sapolicies/db2/hadrV10_start.ksh db2inst1 db2inst1 HADRDB"
StopCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh db2inst1 db2inst1 HADRDB"
MonitorCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_monitor.ksh db2inst1 db2inst1 HADRDB"
MonitorCommandPeriod = 21
MonitorCommandTimeout = 29
StartCommandTimeout = 330
StopCommandTimeout = 140
UserName = "root"
RunCommandsSync = 1
ProtectionMode = 1
ActivePeerDomain = hadr_domain
NodeNameList = {“node1",“node2"}
OpState = 1
Name = "db2ip_10_20_30_42-rs"
IPAddress = “10.20.30.42"
NetMask = "255.255.255.0"
ProtectionMode = 1
ActivePeerDomain = "hadr_dom"
NodeNameList = {“node1“,”node2”}
OpState = 1
Progress
Serviceability
You can also list the states of the individual nodes and see output similar to the
following, from either server:
root@node1# lsrpnode
At this point you could list the TieBreaker resources and see the new network
TieBreaker:
root@node1# lsrsrc –Ab IBM.TieBreaker
The following command should show that your new network TieBreaker is currently
active:
root@node1# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker
Next we add the en0 NIC from the other server, “node2”, to the same equivalency
(db2_public_network_0) :
See the following technote for more details on cluster heartbeat settings and
Communication Groups:
http://www.ibm.com/support/docview.wss?uid=swg21292274
Equivalency 1:
Name = db2_public_network_0
MemberClass = IBM.NetworkInterface
Resource:Node[Membership] = {en0:node1,en0:node2}
SelectString = “”
ActivePeerDomain = hadr_domain
Resource:Node[ValidSelectResources] = {en0:node1,en0:node2}
root@node2# lsrg
Resource Group names:
db2_db2inst1_node2_0-rg
Resource Group 1:
Name = db2_db2inst1_node2_0-rg
MemberLocation = Collocated
Priority = 0
AllowedNode = db2_db2inst1_node2_0-rg_group-equ
NominalState = Online
ActivePeerDomain = hadr_domain
OpState = Online
TopGroup = db2_db2inst1_node2_0-rg
TopGroupNominalState = Online
Note the AllowNode attribute which points to a PeerNode Equivalency that dictates
which server this resource is allowed to run on … see the next slide for output that
shows this PeerNode Equivalency details.
This restricts the DB2 instance resource on the previous slide from only being brought
Online by TSAMP on node2.
This is fairly obvious given it’s the resource that represents the standby database
partition.
This shows us that the DB2 instance is dependent on the operational state of the NICs
in the public network. If the NIC is Online, then TSAMP will be able to start the
associated DB2 instance.
Here we see that the DB2 instance for server “node2” is defined and within its own resource
group.
There is a PeerNode equivalency which dictates which server the above instance is allowed to
run on.
Finally, there is a Network Equivalency which contains the NICs for the public network … the
DB2 instance would have a dependency relationship on this equivalency.
Step 3. Log on to the primary server as the instance owner and issue:
db2inst1@node1> db2haicu
– The db2haicu tool will determine the current instance and apply all cluster
configuration steps based on it. It will also activate all databases for the instance
as it attempts to gather information.
– Next db2haicu will determine if a domain has already been created by searching
for as “Online” domain. Since we’ve already run db2haicu on the standby server,
an Online domain should already exist.
– You will then be asked to set the cluster manager.
Please note that once the dbm is configured with “Cluster manager” set to TSA, the DB2
engine expects to have a domain Online. You will have issues stopping and starting the
DB2 instance if no domain is Online.
Run 'db2haicu -disable' on each DB2 server if you want to break the connection between
DB2 and TSAMP. This is the only way to unset “Cluster manager” for DB2 v10.x
Then db2haicu adds resources that represent the DB2 instance (the primary DB2
instance) on the server where you’re currently running db2haicu:
Resource Group 1:
Name = db2_db2inst1_node1_0-rg
MemberLocation = Collocated
Priority = 0
AllowedNode = db2_db2inst1_node1_0-rg_group-equ
NominalState = Online
ActivePeerDomain = hadr_domain
OpState = Online
TopGroup = db2_db2inst1_node1_0-rg
TopGroupNominalState = Online
Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server
this resource is allowed to run on … similar to the other DB2 instance resource group but with a
different server name.
Managed Relationship 2:
Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node1_0-rs
Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0}
Relationship = DependsOn
Conditional = NoCondition
Name = db2_db2inst1_node1_0-rs_DependsOn_db2_public_network_0-rel
ActivePeerDomain = hadr_domain
This shows us that the DB2 instances are both dependent on the operational state of the NICs in the
public network. If the NICs are Online, then TSAMP will be able to start the associated DB2
instances … it also means if either NIC goes offline for any reason, the local DB2 instance will be
stopped by TSAMP.
49 May 27, 2015 © 2015 IBM Corporation
IBM Tivoli System Automation for Multiplatforms
There’s now a resource and group for the DB2 instance on server “node1”.
There’s now another PeerNode equivalency … it forces the “db2_db2inst1_node1_0-rg
partition to run on “node1” only.
Resource Group 1:
Name = db2_db2inst1_db2inst1_HADRDB-rg
MemberLocation = Collocated
Priority = 0
AllowedNode = db2_db2inst1_db2inst1_HADRDB-rg_group-equ
NominalState = Online
ActivePeerDomain = hadr_domain
OpState = Online
TopGroup = db2_db2inst1_db2inst1_HADRDB-rg
TopGroupNominalState = Online
Note the AllowedNode attribute. It points to a PeerNode Equivalency that contains the
servers “node1” and “node2” that dictates which servers the HADR database can
reside on. This is just like the setup for the two DB2 instance resource groups that
also use the AllowedNode attribute with other PeerNode Equivalencies, though in this
case the HADR resource is a floating resource with two servers as its choices.
Member Resource 1:
Class:Resource:Node[ManagedResource] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
Mandatory = True
MemberOf = db2_db2inst1_db2inst1_HADRDB-rg
SelectFromPolicy = ORDERED
ActivePeerDomain = hadr_domain
OpState = Online
Member Resource 2:
Class:Resource:Node[ManagedResource] = IBM.ServiceIP:db2ip_9_42_153_137-rs
Mandatory = True
MemberOf = db2_db2inst1_db2inst1_HADRDB-rg
SelectFromPolicy = ORDERED
ActivePeerDomain = hadr_domain
OpState = Online
This shows us that the HADRDB resource is dependent on the operational state of the NICs in the
public network.
If the NICs are Online, then TSAMP will be able to online the associated HADR db resource
it also means if either NIC goes offline for any reason, the constituent of the HADR db resource
local to the offline NIC will be offlined by TSAMP (if it is currently online) … this could trigger a
failover.
The resources and group for the HADR database and virtual IP address have been added, as
has a new PeerNode Equivalency containing servers “node1” and “node2”.
Progress
Serviceability
Changing the Nominal state of a resource group will instruct TSAMP to start/stop the
member resources using the scripts defined in the “StartCommand”, “StopCommand”
attributes for a resource of class “IBM.Application”, such is the case for a DB2 instance
resource.
To change the desired state of multiple resource groups that have similarities in their
name, for example, start all DB2 resource groups in parallel where the instance name
on each server starts with “db2inst”, use the following syntax:
# chrg –o online –s “Name like ‘db2_db2inst_%’ “
Another example, to take the HADR resource group offline (which will also remove any
currently assigned IP alias (Virtual IP) ) :
# chrg –o offline db2_db2inst1_db2inst1_HADRDB-rg
– Note: After offlining just the HADR resource group, the HADR pair
will remain in a peer connected state even though shown as Offline
on both servers when viewed using 'lssam' !
57 May 27, 2015 © 2015 IBM Corporation
IBM Tivoli System Automation for Multiplatforms
Check that both instances reach Online states. Do not proceed until both DB2 instances have
come online. Confirm using “lssam –top”. Also run “db2_ps” as the DB2 instance owner on each
node.
The DB2 start scripts used to start the instances will also activate the databases, resulting in the
HADR pair establishing a peer connected state. Confirm that the HADR pair have reached peer
state by running the following on each DB2 node :
# db2pd –hadr –db hadrdb
– If HADR state is not active, then manually bring the HADR pair into peer state as follows:
a. On designated standby node:
# db2 start hadr on db hadrdb as standby
b. On designated primary node:
# db2 start hadr on db hadrdb as primary
Repeat for all HADR databases. Again check the state of the HADR pair before proceeding
You should see output similar (abbreviated) to the following on the standby server:
Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:12:51
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Standby Peer Sync 0 0
The HADR resource group will be locked and unlocked several times. There will also be a
move request at some point.
‘lssam’ will show the online/offline states swapped for the HADR resource and ServiceIP,
assuming the takeover is successful :
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs
|- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1
'- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs
|- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1
'- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2
Use the following command to check the HADR role has swapped between the nodes and
ensure the HADR pair have reached a peer state again :
db2inst1@node2> db2pd –hadr –db <hadr_db_name>
This isn't necessarily a bad thing, however there is a small element of risk because
there isn't a two-way handshake as there is when a regular takeover (not forced) is
performed.
Some clients prefer the “move” request as it results in a faster failover for them and
there is less TSAMP activity since there is no need to lock and unlock the HADR
groups multiple times.
To reset the Failed Offline state, use the following TSAMP command:
resetrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ &
NodeNameList={‘node2’}” IBM.Application
This will cause the hadrV10_stop.ksh script to be executed on node2 and if successful
(return code 0), the Operational State will change to “Offline”.
Before rebooting a server for what ever reason, you should “Offline” the node:
# stoprpnode <node_name>
Even if the HADR resource group’s Nominal state is set to Offline, starting both instances should
result in the HADR pair reaching a peer connected state since the start script for the instances
also activates the databases.
– However, while the HADR resource group is set to Offline, the Virtual IP address is truly
offline (not assigned to any NIC) so no client access to the database, AND no automated
failover actions are possible.
65 May 27, 2015 © 2015 IBM Corporation
IBM Tivoli System Automation for Multiplatforms
Failure Scenarios
• The various failover scenarios supported by this solution are detailed in
section 6 of a whitepaper called
“Automated Cluster Controlled HADR (High Availability Disaster Recovery)
Configuration Setup using the IBM DB2 High Availability Instance Configuration
Utility (db2haicu) ”
Progress
Serviceability
The local database manager’s configuration will be updated so that “Cluster manager”
is unset. The ‘db2haicu –disable’ also needs to be executed on the other server so that
it’s instance configuration is also updated.
With “Cluster manager” unset, you would be able to Offline the entire domain without
affecting the manual operation of the DB2 instances.
To re-enable, run ‘db2haicu’, as instance owner, on each server, and select “1” (Yes) when asked
if you want to enable high availability, and then choose “TSA”.
Progress
Serviceability
Serviceability – logs
The following syslog message indicates that the HADR resource if considered Online (return
code = 1) and with a Primary role :
<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB
Seen only on the node that’s currently the primary node, repeats approx every 21 seconds
The following syslog message indicates that the HADR resource if considered Offline (return
code = 2) and mostly likely in a Standby state (normal/OK state) :
<timestamp> node2 user:info hadrV10_monitor.ksh[69632]: Returning 2 : db2inst1 db2inst1 HADRDB
If db2start was used to start the instance, the message below would be seen instead of the “1
partitions total” message show above:
<timestamp> node1 user:info db2V10_start.ksh[856150]: db2V10_start.ksh is already up...
The following syslog messages are typical of the HADR resource group being brought
online:
<timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug root[524540]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock
<timestamp> node1 user:notice hadrV10_start.ksh[422078]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug hadrV10_start.ksh[422086]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h HADRDB -L
<timestamp> node1 user:debug root[524290]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0
<timestamp> node1 user:notice hadrV10_start.ksh[422090]: Returning 0 : db2inst1 db2inst1 HADRDB
Note: the ‘hadrV10_start.ksh’ script doesn’t actually bring the HADR database pair into a Peer
state. Its likely to already be in a Peer state beforehand because the databases are activated
as part of the starting of the DB2 instances.
The following syslog messages are typical of the HADR resource being stopped on one
node so a manual takeover can occur to the other node. Its also what you would see if
resetting a Failed offline state for a HADR resource:
<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602322]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node1 user:debug /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602330]: su - db2inst1 -c db2gcf -t 3600 -d -i
db2inst1 -i db2inst1 -h HADRDB -L
<timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602334]: Returning 0 : db2inst1 db2inst1
HADRDB
Note: the ‘hadrV10_stop.ksh’ script doesn’t actually stop the HADR functionality within
DB2. It doesn’t affect Peer state.
The HADR resource group is locked whenever Peer state is lost. The DB2 software uses
a TSAMP API to request the lock. The “lockreqprocessed” script is used to check the
lock and unlock states.
When the HADR pair are back in Peer state, the HADR resource group is unlocked,
again requested by DB2.
The DB2 Instance resource groups also get locked if the db2stop command is used to
stop an instance, and unlocked when db2start is used to start it again.
Assuming the above hadrV10_stop.ksh script completes with a 0 return code, then a similar
sequence of messages to the following would be seen on the original standby server:
<timestamp> node2 user:debug root[487538]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock
<timestamp> node2 user:debug root[487564]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock: 1
<timestamp> node2 user:debug root[487566]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock
<timestamp> node2 user:debug root[487572]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0
<timestamp> node2 user:notice hadrV10_start.ksh[548876]: Entering : db2inst1 db2inst1 HADRDB
<timestamp> node2 user:debug hadrV10_start.ksh[548884]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h
HADRDB –L
<timestamp> node2 user:notice hadrV10_start.ksh[548888]: Returning 0 : db2inst1 db2inst1 HADRDB
<timestamp> node2 user:debug hadrV10_monitor.ksh[696436]: Returning 1 : db2inst1 db2inst1 HADRDB
Note the return code of 0 from “hadrV10_start.ksh” meaning a successful takeover. Any other return
code would be considered unsuccessful and would need to be diagnosed from a DB2 perspective.
The following set of messages would indicate a cluster communication problem (domain split) :
Firstly, state of the domain changes to PENDING_QUORUM on each node:
CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to
PENDING_QUORUM.
The Automation Engine (RecoveryRM) on each node reports that the other node has left the
domain:
RECOVERYRM_INFO_4_ST A member has left. Node number = 1
Network TieBreaker is tested and an rc=0 indicates a successful poll of the network
TieBreaker:
samtb_net[1294584]: op=reserve ip=10.201.1.1 rc=0 log=1 count=2
If the TieBreaker poll is successful, the node regains QUORUM:
CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to
HAS_QUORUM.
Release TieBreaker and remove TieBreaker block attempts when node has rejoined a domain
again:
<timestamp> <node_name> daemon:info samtb_net[790758]: op=release ip=10.20.30.1 rc=0 log=1 count=2
<timestamp> <node_name> daemon:info samtb_net[925932]: remove reserve block
/var/ct/samtb_net_blockreserve_10.20.30.1
Serviceability
UNKNOWN (0)
– Generally a problematic state … really shouldn’t be deliberately used in the automation scripts
ONLINE (1)
OFFLINE (2)
– Offline and should be able to be started here if needed
FAILED
FAILED_OFFLINE (3) Online
OFFLINE
Offline
Online
Serviceability
Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands
– If yes, then problem is most likely in DB2 automation scripts or core DB2 components
– If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement
Questions/Comments ?