Sunteți pe pagina 1din 8

hastart -onenode is meant for clusters with one node in the main.

cf as you don't need gab and llt if


no other nodes, so this means "had" daemon will start without GAB and LLT, but as you have
found out, you can run "hastart -onenode" even when there is more than 1 system defined in
main.cf.

In this case, as you say, VCS can't see other node and so doesn't know state so VCS should
AutoDisable the group (which is normal, even with GAB and LLT if second node is not started).
You can AutoEnable service group and then you will be able to online SG and if you do this on
both nodes, then yes you will have split brain.

hastart -onenode should only be used if you have one node in the cluster - if you have more than
one node then you may get unexpected results and hastart -onenode is not designed to work with
more than one node in the cluster. To see if vcs was started with "hastart -onenode" just run "ps -ef
| grep had" to see if had is running with "-onenode".

-onenode option is meant to be used for starting VCS only on a single system on which LLT and
GAB are not *instaled/configured*. If you use hastart -onenode on both nodes, each remote
partition (node) will be unaware of the the status of the other, and the remote system would show
up in UNKNOWN state. And resources in the remote node will not be probed when you use this
option.

I am still confused, even if we don't have gab/llt/vxfen, if we start VCS, it will start the resources
and filessystem and if we go and start VCS on other node as onenode, it wouldn't have any idea
about the other node and wouldn't it try to mount the file system?

Yes, you are right each node will try to mount the FS -so this its uncoordinated; but then we are
using an incorrect option of -onenode (as this is a 2-node). If you see the hastart manpage :
-onenode Use this option only to start VCS on a single system where LLT and GAB are not
required. Do not use this option to start VCS on a node in a multisystem cluster.

Exactly, that's what I have seen in the doc. So we shouldn't start the VCS in second node.
When I look at the hastatus -sum, all the resources are showing as not probed, but when I see
hares -state on resource, it shows as online. It would have started with force option?
root@node01:/root]
# ps -ef | grep had
root 5111872 1 0 Aug 07 - 0:00 /opt/VRTSvcs/bin/hashadow
root 5636246 1 1 Aug 07 - 50:09 /opt/VRTSvcs/bin/had -onenode
Problem: Restart HAD after -onenode was used
ISSUE: Cannot start the cluster as it shows offline.

ISSUE AS REPORTED: Two node cluster is in an UNKNOWN state.

ERROR CODE/ MESSAGE:


node1# hastatus -summ
-- SYSTEM STATE
-- System State Frozen
A node1 RUNNING 0
A node2 UNKNOWN 0
node2# hastatus -summ
-- SYSTEM STATE
-- System State Frozen
A node1 UNKNOWN 0
A node2 RUNNING 0

PROBLEM DESCRIPTION:
Each node in the two node cluster is reporting the other node as being in an UNKNOWN state.

CAUSE:
Both nodes in the cluster were started with the onenode option.
#hastart -onenode

SOLUTION:
Stop HAD on both nodes in the cluster:
node1#hastop -all -force
node2#hastop -all -force

Verify port membership under GAB:


node1# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen a0660f membership 01
Start HAD on both nodes:
node1#hastart
node2#hastart

Verify both nodes are in a running state:


# hastatus -summ
-- SYSTEM STATE
-- System State Frozen
A node1 RUNNING 0
A node2 RUNNING 0
How do you start VCS cluster if its not started automatically after the server reboot? Have you
ever faced such issues ? If not just see how we can fix these kind of issues on veritas cluster. I
have been asking this questions on the Solaris interviews but most of them are fail to impress me
by saying some unrelated things with VCS stuffs. If you know the basic of veritas cluster, it will
be so easy for to troubleshoot in real time and easy to explain on interviews too.
VCS

troubleshooting
Scenario:
Two nodes are clustered with veritas cluster and you have rebooted one of the server. Rebooted
node has come up but VCS cluster was not started (HAD daemon). You are trying to start the
cluster using hastart command , but its not working.How do you troubleshoot ?

Here we go.
1.Check the cluster status after the server reboot using hastatus command.
# hastatus -sum |head
Cannot connect to VCS engine

2.Trying to start the cluster using hastart . No Luck. ? Still getting same message like above ?
Proceed with Step 3.

3.Check the llt and GAB service. If its in disable state, just enable it .
[root@UA~]# svcs -a |egrep "llt|gab"
online Jun_27 svc:/system/llt:default
online Jun_27 svc:/system/gab:default
4.Check the llt(heartbeat) status. Here LLT links looks good.
[root@UA ~]# lltstat -nvv |head
LLT node information:
Node State Link Status Address
0 UA2 OPEN
HB1 UP 00:91:28:99:74:89
HB2 UP 00:91:28:99:74:BF
* 1 UA OPEN
HB1 UP 00:71:28:9C:2E:OF
HB2 UP 00:71:28:9C:2F:9F
[root@UA ~]#

5.If the LLT is down ,then try to configure using lltconfig -c command to configure the private
links. Still if you have any issue with LLT links, then need to check with network team to fix the
heartbeat links.

6.check the GAB status using gabconfig -a command.


[root@UA ~]# gabconfig -a
GAB Port Memberships
===============================================================
[root@UA ~]#

7.As per the above command output, memberships are not seeded. We have to seed the
membership manually using gabconfig command.
[root@UA ~]# gabconfig -cx

8. Check the GAB status now.


[root@UA ~]# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 6d0607 membership 01
[root@UA ~]#
Above output Indicates that GAB(Port a) is online on both the nodes. (0 , 1). To know which node
is 0 and which node 1 , refer /etc/llthosts file.

9.Try to start the cluster using hastart command.It should work now.

10.Check the Membership status using gabconfig.


[root@UA ~]# gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 6d0607 membership 01
Port h gen 6d060b membership 01
Above output Indicates that HAD(Port h) is online on both the nodes. (0 , 1).
11.Check the cluster status using hastatus command. System should be back to business.
[root@UA ~]# hastatus -sum |head
-- SYSTEM STATE
-- System State Frozen
A UA2 RUNNING 0
A UA RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State
B ClusterService UA Y N ONLINE
B ClusterService UA2 Y N OFFLINE
[root@UA ~]#
This is very small thing but many of the VCS beginners failed to fix this start-up issues. In
interviews too ,they are not able say that , If the HAD is not starting using hastart command , I
will check the LLT & GAB services and will fix any issues with that.Then i will start the cluster
using hastart As an interviewers , everybody will expect this answers.

Problem
All VCS HA commands becomes unresponsive and hangs after few days of operations.
Restart HAD daemon to fix this issue or GAB panics the system when HAD daemon is not
sending heartbeat.

Error Message
No specific error message is reported, but can see engine log is not updated.
Thread 2 of the hung HAD process will have similar to below output.
(gdb) thread 2
[Switching to thread 2 (Thread 23965)]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00ba4783 in getifaddrs () from /lib/libc.so.6
#2 0x00b4860c in __old_glob_in_dir () from /lib/libc.so.6
#3 0x00b93fd3 in _res_hconf_reorder_addrs () from /lib/libc.so.6
#4 0x00b9454a in do_init () from /lib/libc.so.6
#5 0x082587bc in VCSSyslog (
bufp=0x82f8060 "VCS WARNING V-16-1-51100 HAD Self Check: Excessive delay in the
HAD heartbeat to GAB (10 seconds)") at Platform.C:732
#6 0x08258a70 in gab_heartbeat_alarm_handler (sig_num=14) at Platform.C:1992
#7 <signal handler called>
#8 0xffffe410 in __kernel_vsyscall ()
#9 0x00b87583 in sprofil () from /lib/libc.so.6
#10 0x00b4a6d0 in glob64@@GLIBC_2.2 () from /lib/libc.so.6
#11 0x00b49712 in glob64@@GLIBC_2.2 () from /lib/libc.so.6
#12 0x00b4a16d in glob64@@GLIBC_2.2 () from /lib/libc.so.6
#13 0x00b4e8c6 in internal_fnwmatch () from /lib/libc.so.6
#14 0x00b4e81f in internal_fnwmatch () from /lib/libc.so.6
#15 0x0821f701 in Log::write_ffdc (sev=35, whop=0x82d0f24 "gabtcp_compute_visible_mem
bership",
filep=0x82d0f71 "GabTcpAux.C", line=3295, flags=49152, cat=50, id=0,
msgp=0x9019460 "membership is 0, local membership is 1") at Log.C:1650
#16 0x080e62c2 in gabtcp_compute_visible_membership () at GabTcpAux.C:3295
#17 0x080f4040 in GabTcp::lowest_master (this=0xf7115108) at GabTcp.C:420
#18 0x080692e0 in MAIN (argc=3, argv=0xffa23f54) at had.C:3260
#19 0x0806b757 in main (argc=64768, argv=0x0) at had.C:3776

Procedure to get gcore of the hung HAD process is


# gcore <pid_of_hung_had_process>

Cause
The SIGALARM handler, used by HAD daemon to check its heartbeats, invokes syslog() system
call. This system call can sometimes causes HAD to go into indefinite sleep.
This issue is tracked via e2747052.

Solution
When HAD daemon gets into this mode, generally GAB will try to kill and restart HAD, which
will fix this issue. In cases of single node VCS cluster, HAD daemon is not restarted by GAB and
instead can be manually restarted, by killing the hung daemons and running # hastart -onenode
com mand.

This issue is fixed by removing the syslog() system call and instead using a file update by HAD
daemon. Below hotfixs are available which has the above fix.
VCS 5.1SP1RP1HF5
VCS 5.1SP1RP2HF3
This issue is also fixed in next rolling patch 5.1SP1RP3.
Applies To: VCS 5.1SP1RP1 / VCS 5.1SP1RP2. Also applies to single node VCS clusters.

Problem
Error when starting cluster with hastart

Solution
Error when starting cluster with hastart. When changing the name of the system, the sysname
might file not be changed. The Cluster software checks the names in the configuration files to see
if they are consistent.

Check that the file sysname has the name of the system provided in the other Cluster configuration
files. The location of the configuration file is: /etc/VRTSvcs/conf.
Symptoms:
GAB errors Excessive delay between successive calls to GAB heartbeat seen in engine log
while running a single node/standalone VCS cluster where GAB is disabled.

Description:
GAB heartbeat log messages are logged as an information when there is delay between heartbe
ats(had being stuck). In case of had running in -onenode, GAB need not be enabled. When
had is running in -onenode; For self-check purpose, had simulates heartbeat with an internal
component of had itself. These log messages are logged because of delay in simulated heartbeats.

Resolution:
Log messages seen are for informational purpose only. When had is running in -onenode, no
action will be taken on excessive delay between heartbeats
Global Atomic Broadcast simulator (GABSIM) does not take any action if the High Availability
Daemon (HAD) is unable to maintain heartbeat. To clarify, the excessive delay message is
logged by HAD when the delay in heartbeat is more than 5 seconds. HAD logs the message
irrespective of whether it has registered with GAB or GABSIM. The delay of 5 seconds is not
configurable.
There is no tunable to turn off the logging. At this point all we can sugest is upgradeing the system
resources if possible to reduce the logs.

Problem: VCS will not start on reboot.

Error Message
TAG_C 2009/08/07 02:40:17 VCS:10542:IpmServer::open Cannot find protocol information for
TCP
TAG_C 2009/08/07 02:40:17 VCS:10604:Unsuccessful open of service

Solution
Checking gabconfig -a
GAB Port Memberships
===============================================================
Port a gen e64a05 membership 01

When hastart is run we see these messages in engine_A.log:


TAG_D 2009/08/07 02:40:12 VCS:11022:VCS engine (had) started
TAG_D 2009/08/07 02:40:12 VCS:11050:VCS engine version=3.5
TAG_D 2009/08/07 02:40:12 VCS:11051:VCS engine join version=3.7
TAG_D 2009/08/07 02:40:12 VCS:11052:VCS engine pstamp=3.5 07/25/03-12:05:00
TAG_D 2009/08/07 02:40:12 VCS:10114:opening GAB library
TAG_C 2009/08/07 02:40:17 VCS:10542:IpmServer::open Cannot find protocol information for
TCP
TAG_C 2009/08/07 02:40:17 VCS:10604:Unsuccessful open of service
Collecting truss of hastart command
- truss -o /tmp/truss.out -fail -t\!all -u a.out /opt/VRTSvcs/bin/hastart
$ grep -i proto truss.out
4452/1: -> __1cRVCSGetProtoByName6FpkcpnIprotoent_pci_3_(0x23bc79,
0xffbe697c, 0xffbe597b, 0x1000)
4452/1: <- __1cRVCSGetProtoByName6FpkcpnIprotoent_pci_3_() = 0 ==> here the
return code is 0 which indicates the value is NULL returned
which mean it failed, whereas we are expecting a data structure.

Looking into source code


if( STRCMP(hostp,"localhost")) {
if ((protocol = VCSGetProtoByName("tcp", &pent, pbuf, MAXBUFFER)) == NULL) {
VCS_LOGP(Ipm_logp,SEV_WARN,0,10503,("IpmHandle::open Cannot find
protocol information for TCP"));
ret = IPM_ERR_OPEN;
XXDONE();
}
struct protoent *VCSGetProtoByName(const VCSCHAR *name, struct protoent *result,
VCSCHAR *buffer, int buflen) {
return getprotobyname_r(name, result, buffer, buflen);
}

truss the hastart for the library call getprotobyname_r().


$ grep ^tcp /etc/protocols
tcp 6 TCP # transmission control protocol

$ grep proto /etc/nsswitch.conf


protocols: ldap [NOTFOUND=return] files ===> here protocols check ldap first.

Change it to check "files" first on all nodes of the cluster.


$ grep proto /etc/nsswitch.conf
protocols: files ldap [NOTFOUND=return] files

Again run truss on hastart.


- truss -o /tmp/truss1.out -fail -t\!all -u a.out /opt/VRTSvcs/bin/hastart
- grep proto truss1.out
4701/1: ->__1cRVCSGetProtoByName6FpkcpnIprotoent_pci_3_(0x23bc79, 0xffbe6994, 0xffbe
5993, 0x1000)
4701/1: <- __1cRVCSGetProtoByName6FpkcpnIprotoent_pci_3_() = 0xffbe6994 ==> now we
see a data structure returned instead of 0.

- Run hastart on all nodes cluster. Now vcs should start fine.
- Run hastatus -sum