Documente Academic
Documente Profesional
Documente Cultură
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Proposed Storage
Normal Troubleshooting
Run Book
Release: V0.1
V 1.0
Owner: RS Storage
Originator: Neeraj Kumar
First Publish Date: January 15, 2010
Document Version: 1.0
Approver: AIGRS Storage team (approval pending)
Contents
Contents.................................................................................................................................................3
1
Purpose..........................................................................................................................................4
2
Preliminary Tasks........................................................................................................................4
3
Prerequisites.................................................................................................................................4
4
Revisions Summary.....................................................................................................................4
5
Methodology.................................................................................................................................4
6
Health Check Process..................................................................................................................5
7
Resolution Procedure...................................................................................................................9
8
Escalation Procedure...................................................................................................................9
9
Flag Procedure.............................................................................................................................9
10 Reference Commands..................................................................................................................9
Page 2 of 10
1 Purpose
The purpose of this Run-book is to define scope and procedure for Normal Troubleshooting steps to resolve
issues on Filers. This run-book can be referred to resolve normal issues occurred on Storage Systems
installed at AIG. This is an additional Process to be executed by the Offshore HCL Storage operations Team.
Important: Only RS Storage Team members are authorized to execute this document. This document
can be referred to understand the rules to be followed while performing Operational Tasks for
managing Storage systems.
2 Preliminary Tasks
Ensure that the following prerequisites are met before beginning to address any issue on Storage Systems.
Step
1.
Task
Check
Verify that you are having access to both Command line and GUI interface of the Storage
systems on which issues need to be address.
3 Prerequisites
1. User should have access on the DFM console, storage alerts and emails.
4 Revisions Summary
1. This is a new run-book, still at a proposal stage.
2. Once it is approved it will be made as Version1 of this document.
3. This space may be used to track changes in newer versions of this document.
5 Methodology
1. Common errors/issues and alerts were collated for the list of items to be checked.
2. Historical Data for last 3 weeks was taken from DFM and post analysis revealed a lot of reoccurring
issues/alerts/messages that can be tapped by this process.
3. Document was reviewed by Storage and Backup Tech Leads and Storage team members for
execution and formalized for language and specifications.
4. Management insight is pending for frequency of health check desired.
Page 3 of 10
Alert 1:
Scenario:
Steps:
disks to add. In that case check, if any other volume in that Aggregate is under utilized. If
so take approval from Jeff or Tom and then reclaim some space from the under utilized
volume.
5) Sometimes the volume may reach 100% by the time you log on to the system to check
volume status. In this situation, first we need to add some space to volume anyway.
Because if the volume remains 100% for a long time, it may affect the Snapvault or
Snapmirror replications running at the same time.
6) After adding space in the volume, ensure that the Snapvault or Snapmirror relationship is
not affected. Also add same space to destination volume (in case of Snapmirror relation),
otherwise replication can fail.
Alert 2:
Scenario:
Steps:
1)
Run command snapvault status l on source or destination filer to trace out the cause
for replication failure.
#
ssh
whpnnav6070n2
snapvault
/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
Snapvault primary is ON.
status
-l
Source:
whpnnav6070n2:/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
Destination:
whpnss6070n2:/vol/navisys_stage/whswngen1_nmq1_logs
Status:
Idle
Progress:
State:
Source
Lag:
47:50:35
Mirror Timestamp:
Fri Jan 15 20:07:00 PST 2010
Base Snapshot:
navstagesnap.0
Current Transfer Type: Current Transfer Error: source contains no new data; suspending transfer to destination
Contents:
Last Transfer Type: Last Transfer Size: 220 KB
Last Transfer Duration: 00:00:38
Last Transfer From: 2)
After running the above command, we can find, when did the last snapvault replication
happened (Refer Mirror stamp) and which snapshot taken as base.
Page 5 of 10
3)
If the base snapshot is old one, then check whether current Snapvault snapshot is created
for this volume or not.
If current snapvault snapshot for the particular volume is not created, then check the
snapvault snapshot creation script with command crontab l on respective admistrative
hosts to find out, if any discrepancy in the script.
For Navisys, co-ordinate with Navisys team to validate whether
snapshot creation script run on that particular day or not. If they say no, then close
the ticket with comment snapshot creation script didnt run today, but if they say
yes then ask him to provide the output to drill down the issue.
5) Apart from all these steps, to get better understanding check logs file /etc/messages
and /etc/log/snapmirror, if any error throws out in logs related to the issue.
# ssh whpnss6070n2 rdfile /etc/log/snapmirror | grep whswngen1_nmq1_logs
dst
Sun
Jan
17
03:45:22
PST
whpnnav6070n2:/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
whpnss6070n2:/vol/navisys_stage/whswngen1_nmq1_logs Request (Update)
dst
Sun
Jan
17
03:45:23
PST
whpnnav6070n2:/vol/nav_stage_whswngen1_nmq1_logs_f/whswngen1_nmq1_logs
whpnss6070n2:/vol/navisys_stage/whswngen1_nmq1_logs Defer (source contains no new data;
suspending transfer to destination)
Alert 3:
Scenario:
Qtree Full
dfm: Error event on netapp:/Apps/RC372-navstguwh (Qtree Full)
An Error event at 08 Jan 20:59 PST on Qtree RC372-navstguwh on Volume
Apps on Active/Active Controller netapp.sunamerica.com:
The qtree is 99.01% full (using 743 GB of 750 GB).
Steps:
1)
If any Qtree full alert comes then check current status of Qtree.
- /vol/Apps/RC372-
If it is an exported file system, then check the /etc/exports file on Filer and find out
which system (IP address) this file is exported to. After getting the system information,
try to find out persons or team name, who is the owner of this file system and send them
a mail to either archive the data to make space in Qtree or forward storage request to
increase the quota.
3)
If it a CIFS share then try to find out the Qtree owner name with help of storage
team(Onshore as well as Offshore) and send them a mail to archive data or make storage
request to increase quota.
Steps:
1)
2)
Steps:
1)
If this alert comes, first check what is the current status of Volume/LUN. In case of
volume, sometimes when any snapmirror replication initialized, the destination volume
gets restricted and volume offline alert comes. In this situation, alert can be ignored, but
if this is not the case, then co-ordinate with other storage team members(On-Shore as
well as Off-Shore) to get whether any activity or PCM is executing at same time related
to that volume or Lun and accordingly take decision to make it online.
# For Volume offline alert:
Page 7 of 10
50.0g (53694627840)
(r/w,
50.0g (53694627840)
(r/w,
1)
When this alert comes, run some commands like uptime, cf status on both nodes to
check whether it not a fake alert. If it a true alert and one node is not responding then
from other node run partner command with options to ensure whether disks,
aggregates, volumes, Luns etc. are working properly or not.
Alert 7:
At the same time check /etc/messages file to find out cause of failure. It can be
Harware issue or software issue.
Log call with Netapp at high priority to drill down the real cause of takeover. If it a
hardware issue, then ask Netapp to give ETA for hardware delivery and replacement and
convey it to other stakeholders.
Get all files, shares, luns detail resides on effected filer and send mail to Unix and
Windows team to check, if any application is effected or not.
Scenario:
Steps:
1) If disk failure alert comes, validate failed disk status with command aggr status f on
effected node. If it is showing single disk failure then check whether case is opened by
NetApp. If it is not opened by NetApp itself, then open a case with NetApp and ask for
ETA to delivery and replacement of disk. Update On-shore team accordingly to facilitate
the disk replacement.
# ssh fwsapnav6070n2 aggr status -f
Broken disks (empty)
2) If Multiple disk fails, then open a case with NetApp on urgent priority basis to get root
cause and ensure disk replacement as soon as possible. In the meantime check whether
any application is effected and status of both nodes is o.k.
Alert 8:
Scenario:
Steps:
Host Down
A Critical event at 15 Jan 22:04 PST on Active/Active Controller
ampnnew9n1.annuity.aigrs.net:
The Active/Active Controller is down.
1) If any alert comes for host down, then first check with ssh filername command, whether host
is accessible or not. If it is accessible, then it can be a fake alert. If it is not accessible and really
host is down,then check /etc/messages file on Partner node and try to find out, if any message
available, which can help to find cause. At the same time, log case with NetApp with urgent
priority and get the issue resolved.It can also be a part of pre-planned downtime. So check-out the
issue with all perspective and take decision accordingly.
Alert 9:
Interface Down
Scenario:
Steps:
1) Check with vif status command, which interface is down and whats cause for it. It can
be due to some hardware fault like loose network cable, some fault in port etc.
# ssh ampnaiga6070n2 vif status | grep e0e
e0e: state up, since 15Jan2010 18:56:42 (2+09:17:31)
Page 9 of 10
6 Resolution Procedure
7 Escalation Procedure
8 Flag Procedure
9 Reference Commands
Page 10 of 10