Sunteți pe pagina 1din 10

HCL RS

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

Proposed Storage
Normal Troubleshooting
Run Book
Release: V0.1

V 1.0

Owner: RS Storage
Originator: Neeraj Kumar
First Publish Date: January 15, 2010
Document Version: 1.0
Approver: AIGRS Storage team (approval pending)

Contents
Contents.................................................................................................................................................3
1
Purpose..........................................................................................................................................4
2
Preliminary Tasks........................................................................................................................4
3
Prerequisites.................................................................................................................................4
4
Revisions Summary.....................................................................................................................4
5
Methodology.................................................................................................................................4
6
Health Check Process..................................................................................................................5
7
Resolution Procedure...................................................................................................................9
8
Escalation Procedure...................................................................................................................9
9
Flag Procedure.............................................................................................................................9
10 Reference Commands..................................................................................................................9

Page 2 of 10

1 Purpose
The purpose of this Run-book is to define scope and procedure for Normal Troubleshooting steps to resolve
issues on Filers. This run-book can be referred to resolve normal issues occurred on Storage Systems
installed at AIG. This is an additional Process to be executed by the Offshore HCL Storage operations Team.
Important: Only RS Storage Team members are authorized to execute this document. This document
can be referred to understand the rules to be followed while performing Operational Tasks for
managing Storage systems.

2 Preliminary Tasks
Ensure that the following prerequisites are met before beginning to address any issue on Storage Systems.
Step
1.

Task

Check

Verify that you are having access to both Command line and GUI interface of the Storage
systems on which issues need to be address.

3 Prerequisites
1. User should have access on the DFM console, storage alerts and emails.

4 Revisions Summary
1. This is a new run-book, still at a proposal stage.
2. Once it is approved it will be made as Version1 of this document.
3. This space may be used to track changes in newer versions of this document.

5 Methodology
1. Common errors/issues and alerts were collated for the list of items to be checked.
2. Historical Data for last 3 weeks was taken from DFM and post analysis revealed a lot of reoccurring
issues/alerts/messages that can be tapped by this process.
3. Document was reviewed by Storage and Backup Tech Leads and Storage team members for
execution and formalized for language and specifications.
4. Management insight is pending for frequency of health check desired.

Page 3 of 10

Alert 1:

Volume gets full

Scenario:

An Error event at 13 Jan 11:18 PST on Active/Active Controller


ampnsap6070n1.annuity.aigrs.net:
Global Status: NonCritical; /vol/vmware_prod2_ampnsap6070n1_f is full
(using or reserving 98% of space and 0% of inodes, using 95% of
reserve)..

Steps:

1) Check if deleting snapshots will reclaim any space.


(i.e. If snapreserve is 0 & snap reclaimable returns a
substantial space)
# ssh ampnsap6070n1 snap list vmware_prod2_ampnsap6070n1_f
a) If so delete any old snapshot (more then 3 days on Production and 7 days on Snapshot or
DR Filer) existing in the volume.
b) If there are any Old Manual Snapshot, then check with the Business owner first and delete
them if they are okay with it. Then check how much space Volume gets now, if still it is
insufficient, the go to 2nd step.
2) Check available space in the containing Aggregate and then add space to volume till the
volume usage comes down to a considerable level (Normally minimum 95%).
# ssh ampnsap6070n1 vol container vmware_prod2_ampnsap6070n1_f
Volume 'vmware_prod2_ampnsap6070n1_f' is contained in aggregate 'aggr3_vmware'
# ssh ampnsap6070n1 df -Ah aggr3_vmware
Aggregate
total
used avail capacity
aggr3_vmware
2434GB 2247GB
187GB 92%
aggr3_vmware/.snapshot
75GB
36GB
39GB 48%
# ssh ampnsap6070n1 vol size vmware_prod2_ampnsap6070n1_f +50g
3) If there is no space available in the aggregate, check if there are any spare disks (more
than two) available.
a) If spare disk of the same size is available for the aggregate then obtain approval from
Jeff through mail or call and then add a disk. Once disk is added enter an Informational
PCM for the same.
# ssh ampnsap6070n1 vol status s
# ssh ampnsap6070n1 aggr add aggr3_vmware d 1@144
OR
# ssh ampnsap6070n1 aggr add aggr3_vmware n 1
4) Sometimes there can be situations where neither space is available in Aggregate nor spare
Page 4 of 10

disks to add. In that case check, if any other volume in that Aggregate is under utilized. If
so take approval from Jeff or Tom and then reclaim some space from the under utilized
volume.
5) Sometimes the volume may reach 100% by the time you log on to the system to check
volume status. In this situation, first we need to add some space to volume anyway.
Because if the volume remains 100% for a long time, it may affect the Snapvault or
Snapmirror replications running at the same time.
6) After adding space in the volume, ensure that the Snapvault or Snapmirror relationship is
not affected. Also add same space to destination volume (in case of Snapmirror relation),
otherwise replication can fail.
Alert 2:
Scenario:

Snapvault Replica out of date.


An Error event at 13 Jan 00:14 PST on Directory
whpnnav6070n2:/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs:
SnapVault Replica: Out of Date.
A backup relationship between
whpnnav6070n2:/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs and
whpnss6070n2:/navisys_stage has not been backed up since 11 Jan 20:07.

Steps:
1)

Run command snapvault status l on source or destination filer to trace out the cause
for replication failure.

#
ssh
whpnnav6070n2
snapvault
/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
Snapvault primary is ON.

status

-l

Source:
whpnnav6070n2:/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
Destination:
whpnss6070n2:/vol/navisys_stage/whswngen1_nmq1_logs
Status:
Idle
Progress:
State:
Source
Lag:
47:50:35
Mirror Timestamp:
Fri Jan 15 20:07:00 PST 2010
Base Snapshot:
navstagesnap.0
Current Transfer Type: Current Transfer Error: source contains no new data; suspending transfer to destination
Contents:
Last Transfer Type: Last Transfer Size: 220 KB
Last Transfer Duration: 00:00:38
Last Transfer From: 2)

After running the above command, we can find, when did the last snapvault replication
happened (Refer Mirror stamp) and which snapshot taken as base.
Page 5 of 10

3)

If the base snapshot is old one, then check whether current Snapvault snapshot is created
for this volume or not.

# ssh whpnnav6070n2 snap list nav_stage_whswngen1_nmq1_logs_f


4)

If current snapvault snapshot for the particular volume is not created, then check the
snapvault snapshot creation script with command crontab l on respective admistrative
hosts to find out, if any discrepancy in the script.
For Navisys, co-ordinate with Navisys team to validate whether
snapshot creation script run on that particular day or not. If they say no, then close
the ticket with comment snapshot creation script didnt run today, but if they say
yes then ask him to provide the output to drill down the issue.
5) Apart from all these steps, to get better understanding check logs file /etc/messages
and /etc/log/snapmirror, if any error throws out in logs related to the issue.
# ssh whpnss6070n2 rdfile /etc/log/snapmirror | grep whswngen1_nmq1_logs
dst
Sun
Jan
17
03:45:22
PST
whpnnav6070n2:/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
whpnss6070n2:/vol/navisys_stage/whswngen1_nmq1_logs Request (Update)
dst
Sun
Jan
17
03:45:23
PST
whpnnav6070n2:/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
whpnss6070n2:/vol/navisys_stage/whswngen1_nmq1_logs Defer (source contains no new data;
suspending transfer to destination)

Alert 3:
Scenario:

Qtree Full
dfm: Error event on netapp:/Apps/RC372-navstguwh (Qtree Full)
An Error event at 08 Jan 20:59 PST on Qtree RC372-navstguwh on Volume
Apps on Active/Active Controller netapp.sunamerica.com:
The qtree is 99.01% full (using 743 GB of 750 GB).

Steps:

1)

If any Qtree full alert comes then check current status of Qtree.

# ssh netapp quota report | grep /Apps/RC372-navstguwh


tree
1
navstguwh
2)

Apps RC372-navstguwh 754542720 786432000 297340

- /vol/Apps/RC372-

If it is an exported file system, then check the /etc/exports file on Filer and find out
which system (IP address) this file is exported to. After getting the system information,
try to find out persons or team name, who is the owner of this file system and send them
a mail to either archive the data to make space in Qtree or forward storage request to
increase the quota.

# ssh netapp rdfile /etc/exports | grep RC372-navstguwh


Page 6 of 10

3)

If it a CIFS share then try to find out the Qtree owner name with help of storage
team(Onshore as well as Offshore) and send them a mail to archive data or make storage
request to increase quota.

# ssh netapp cifs shares | grep RC372-navstguwh


navstguwh /vol/Apps/RC372-navstguwh
Alert 4:
Scenario:

Steps:

Navisys Stage (RC372)

Lun Snapshot Not Possible.


A Warning event at 13 Jan 01:03 PST on Lun Path
lh_prod_data2_f/amphlh02/lun61 on Active/Active Controller
AMPNLH6070N2.annuity.aigrs.net:
LUN Snapshot Not Possible.
LUN snapshot not possible

1)
2)

When alert comes, check if respective volume is run out of space.


Check, what is volume fractional reserve. Our policy is to maintain fractional reserve for
any volume at 50%. If reserve of the volume is more then 50%, reduce it to 50%,
volume get some space by this way.

# ssh ampnlh6070n2 vol options lh_prod_data2_f


nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=off, maxdirsize=335462, schedsnapname=ordinal,
fs_size_fixed=off, guarantee=volume, svo_enable=off, svo_checksum=off,
svo_allow_rman=off, svo_reject_errors=off, no_i2p=off,
ssfractional_reserve=20, extent=off, try_first=volume_grow
3)
Alert 5:
Scenario:

Steps:

Add space to volume taken into consideration its fractional reserve.


VOLUME/LUN Offline

A Warning event at 13 Jan 01:05 PST on Lun Path


nav_stage_whswnora1_f_clone/whswnora1/drive_d.lun on Active/Active
Controller whpnnav6070n2.sunamerica.com:
LUN Offline.
LUN offline

1)

If this alert comes, first check what is the current status of Volume/LUN. In case of
volume, sometimes when any snapmirror replication initialized, the destination volume
gets restricted and volume offline alert comes. In this situation, alert can be ignored, but
if this is not the case, then co-ordinate with other storage team members(On-Shore as
well as Off-Shore) to get whether any activity or PCM is executing at same time related
to that volume or Lun and accordingly take decision to make it online.
# For Volume offline alert:

Page 7 of 10

# ssh whpnnav6070n2 vol status nav_stage_whswnora1_f_clone


Volume State Status
Options
nav_stage_whswnora1_f_clone online raid_dp, flex nosnap=on, create_ucode=on,
convert_ucode=on, guarantee=none,
fractional_reserve=0
Clone, backed by volume 'nav_stage_whswnora1_f', snapshot 'hoclone'
Containing aggregate: 'data2'
# For Lun Offline alert:
# ssh whpnnav6070n2 lun show | grep drive_d.lun
/vol/nav_prod_whpwndoc01_f/whpwndoc01/drive_d.lun
online, mapped)
/vol/nav_prod_whpwndoc02_f/whpwndoc02/drive_d.lun
online, mapped)
Alert 6:
Scenario:
Steps:

50.0g (53694627840)

(r/w,

50.0g (53694627840)

(r/w,

Filer taken over


A Critical event at 15 Jan 23:06 PST on Active/Active Controller
ampnnew9n1.annuity.aigrs.net:
Global Status: Critical; ampnnew9n2 has taken over this node.

1)

When this alert comes, run some commands like uptime, cf status on both nodes to
check whether it not a fake alert. If it a true alert and one node is not responding then
from other node run partner command with options to ensure whether disks,
aggregates, volumes, Luns etc. are working properly or not.

# ssh ampnnew9n1 uptime


11:27pm up 1 day, 22:22 144705443 NFS ops, 29655 CIFS ops, 0 HTTP ops, 0 FCP ops, 0
iSCSI ops
# ssh ampnnew9n1 cf status
Cluster enabled, ampnnew9n2 is up.
2)
3)
4)

Alert 7:

At the same time check /etc/messages file to find out cause of failure. It can be
Harware issue or software issue.
Log call with Netapp at high priority to drill down the real cause of takeover. If it a
hardware issue, then ask Netapp to give ETA for hardware delivery and replacement and
convey it to other stakeholders.
Get all files, shares, luns detail resides on effected filer and send mail to Unix and
Windows team to check, if any application is effected or not.

Disk fail or Multiple disk fail alert


Page 8 of 10

Scenario:
Steps:

An Error event at 07 Jan 13:55 PST on Active/Active Controller


fwsapnav6070n2.annuity.aigrs.net:
Global Status: NonCritical; Disk on adapter 0f, shelf 1, bay 6, failed.

1) If disk failure alert comes, validate failed disk status with command aggr status f on
effected node. If it is showing single disk failure then check whether case is opened by
NetApp. If it is not opened by NetApp itself, then open a case with NetApp and ask for
ETA to delivery and replacement of disk. Update On-shore team accordingly to facilitate
the disk replacement.
# ssh fwsapnav6070n2 aggr status -f
Broken disks (empty)
2) If Multiple disk fails, then open a case with NetApp on urgent priority basis to get root
cause and ensure disk replacement as soon as possible. In the meantime check whether
any application is effected and status of both nodes is o.k.

Alert 8:
Scenario:
Steps:

Host Down
A Critical event at 15 Jan 22:04 PST on Active/Active Controller
ampnnew9n1.annuity.aigrs.net:
The Active/Active Controller is down.

1) If any alert comes for host down, then first check with ssh filername command, whether host
is accessible or not. If it is accessible, then it can be a fake alert. If it is not accessible and really
host is down,then check /etc/messages file on Partner node and try to find out, if any message
available, which can help to find cause. At the same time, log case with NetApp with urgent
priority and get the issue resolved.It can also be a part of pre-planned downtime. So check-out the
issue with all perspective and take decision accordingly.

Alert 9:

Interface Down

Scenario:

An Error event at 15 Jan 16:31 PST on Interface e0e on Active/Active


Controller ampnaiga6070n2.annuity.aigrs.net:
Interface Status Down.

Steps:

1) Check with vif status command, which interface is down and whats cause for it. It can
be due to some hardware fault like loose network cable, some fault in port etc.
# ssh ampnaiga6070n2 vif status | grep e0e
e0e: state up, since 15Jan2010 18:56:42 (2+09:17:31)

Page 9 of 10

6 Resolution Procedure

7 Escalation Procedure

8 Flag Procedure

9 Reference Commands

Page 10 of 10

S-ar putea să vă placă și