5-Minute Initial Troubleshooting On Brocade Equ..

5/13/2014 5-minute initial troubleshooting on Brocade equ...
| Hitachi Data Systems

https://community.hds.com/docs/DOC-1000269 1/5
Member Directory News and Events Help and Feedback
Login Register
Search the HDS Community...
Very often the HDS support organisation (GSC) is getting involved in cases whereby a massive amount of host logs, array dumps, FC and IP traces are taken
which could easily add up to many gigabytes of data. This is then accompanied by a very synoptic problem description such as "I have a problem with my
host, can you check?".
I'm sure the intention is good to provide us all the data but the problem is the lack of the details around the problem. We do require a detailed explanation of
what the problem is, when did it occur or is it still ongoing?

There are also things you can do yourself before opening a case with HDS. In many occasions you'll find that the feedback you get from us in 10 minutes
results in either the problem being fixed or a simple workaround has made your problem creating less of an impact. Further troubleshooting can then be done in
a somewhat less stressful time frame.

This example provides some bullet point what you can do on a Brocade platform. (Mainly since many of the problems I see are related to fabric issues and my
job is primarily focused on storage networking.)

First of all take a look at the over health of the switch:

Command Explanation
switchstatusshow Provides an overview of the general components of the switch. These all need to
show up HEALTHY and not (as shown here) as "Marginal"
Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchstatusshow
Switch Health Report Report time: 06/20/2013 06:19:17 AM
Switch Name: Sydney_ILAB_DCX-4S_LS128
IP address: 10.129.2.143
SwitchState: MARGINAL
Duration: 214:29

Power supplies monitor MARGINAL
Temperatures monitor HEALTHY
Fans monitor HEALTHY
WWN servers monitor HEALTHY
CP monitor HEALTHY
Blades monitor HEALTHY
Core Blades monitor HEALTHY
Flash monitor HEALTHY
Marginal ports monitor HEALTHY
Faulty ports monitor HEALTHY
Missing SFPs monitor HEALTHY
Error ports monitor HEALTHY

All ports are healthy
switchshow Provides a general overview of logical switch status (no physical components) plus
a list of ports and their status.

The switchState should alway be online.
The switchDomain should have a unique ID in the fabric.
If zoning is configured it should be in the "ON" state.

As for the ports connected these should all be "Online" for connected and
operational ports. If you see ports showing "No_Sync" whereby the port is not
disabled there is likely a cable or SFP/HBA problem.

If you have configured FabricWatch to enable portfencing you'll see indications like
here with port 75

Obviously for any port to work it should be enabled.
Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchshow
switchName: Sydney_ILAB_DCX-4S_LS128
switchType: 77.3
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 143
switchId: fffc8f
switchWwn: 10:00:00:05:1e:52:af:00
zoning: ON (Brocade)
switchBeacon: OFF
FC Router: OFF
Fabric Name: FID 128
Allow XISL Use: OFF
LS Attributes: [FID: 128, Base Switch: No, Default Switch: Yes, Address Mode 0]

Index Slot Port Address Media Speed State Proto
5-minute initial troubleshooting on Brocade equipment
created by elonden on Jun 18, 2013 11:22 PM, last modified by elonden on Sep 4, 2013 4:15 PM
Version 2
Share
SOLUTION AND PRODUCT FORUMS DEVELOPER NETWORK INNOVATION CENTER
5/13/2014 5-minute initial troubleshooting on Brocade equ... | Hitachi Data Systems
Index Slot Port Address Media Speed State Proto
============================================================
0 1 0 8f0000 id 4G Online FC E-Port 10:00:00:05:1e:36:02:bc "BR48000_1_IP146" (downstream)(Trunk master)
1 1 1 8f0100 id N8 Online FC F-Port 50:06:0e:80:06:cf:28:59
4 1 4 8f0400 id 4G No_Sync FC Disabled (Persistent)
5 1 5 8f0500 id N2 Online FC F-Port 50:06:0e:80:14:39:3c:15
8 1 8 8f0800 id N8 Online FC F-Port 50:06:0e:80:13:27:36:30
75 2 11 8f4b00 id N8 No_Sync FC Disabled (FOP Port State Change threshold exceeded)
76 2 12 8f4c00 id N4 No_Light FC Disabled (Persistent)
sfpshow <slot>/<port> One of the most important pieces of a link irrespective of mode and distance is the
SFP. On newer hardware and software it provides a lot of info on the overall health
of the link.

With older FOS codes there could have been a discrepancy of what was displayed
in this output as to what actually was plugged in the port. The reason was that the
SFP's get polled so every now and then for status and update information. If a port
was persistent disabled it didn't update at all so in theory you plug in another SFP
but sfpshow would still display the old info. With FOS 7.0.1 and up this has been
corrected and you can also see the latest polling time per SFP now.

The question we often get is: "What should these values be?". The answer is "It
depends". As you can imagine a shortwave 4G SFP required less amps then a
longwave 100KM SFP so in essence the SFP specs should be consulted. As a
ROT you can say that signal quality depends ont he TX power value minus the
link-loss budget. The result should be within the RX Power specifications of the
receiving SFP.

Also check the Current and Voltage of the SFP. If an SFP is broken the indication
is often it draws no power at all and you'll see these two dropping to zero.
Sydney_ILAB_DCX-4S_LS128:FID128:admin> sfpshow 1/1
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding: 1 8B10B
Baud Rate: 85 (units 100 megabaud)
Length 9u: 0 (units km)
Length 9u: 0 (units 100 meters)
Length 50u (OM2): 5 (units 10 meters)
Length 50u (OM3): 0 (units 10 meters)
Length 62.5u:2 (units 10 meters)
Length Cu: 0 (units 1 meter)
Vendor Name: BROCADE
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
Vendor Rev: A
Wavelength: 850 (units nm)
Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max: 0
BR Min: 0
Serial No: UAF110480000NYP
Date Code: 101125
DD Type: 0x68
Enh Options: 0xfa
Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5, 0x0
Warn Flags[0,1] = 0x5, 0x0
Alarm Warn
low high low high
Temperature: 25 Centigrade -10 90 -5 85
Current: 6.322 mAmps 1.000 17.000 2.000 14.000
Voltage: 3290.2 mVolts 2900.0 3700.0 3000.0 3600.0
RX Power: -3.2 dBm (476.2uW) 10.0 uW 1258.9 uW 15.8 uW 1000.0 uW
TX Power: -3.3 dBm (472.9 uW) 125.9 uW 631.0 uW 158.5 uW 562.3 uW

State transitions: 1
Last poll time: 06-20-2013 EST Thu 06:48:28
porterrshow For link state counters this is the most useful command in the switch however
there is a perception that this command provides a "silver" bullet to solve port and
link issues but that is not the case. Basically it provides a snapshot of the content
of the LESB (Link Error Status Block) of a port at that particular point in time. It
does not tell us when these counters have accumulated and over which time
frame. So in order to create a sensible picture of the statuses of the ports we need
a baseline. This baseline can be created to reset all counters and start from zero.
To do this issue the "statsclear" command on the cli.

There are 7 columns you should pay attention to from a physical perspective.

enc_in - Encoding errors inside frames. These are errors that happen on the FC1
with encoding 8 to 10 bits and back or, with 10G and 16G FC from 64 bits to 66
and back. Since these happen on the bits that are part of a data frame these are
counted in this column.

crc_err - An enc_in error might lead to a CRC error however this column shows
frames that have been market as invalid frames because of this crc-error earlier in
the datapath. According to FC specifications it is up to the implementation of the
programmer if he wants to discard the frame right away or mark it as invalid and
Sydney_ILAB_DCX-4S_LS128:FID128:admin> porterrshow
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout
pcs
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx
err
0: 100.1m 53.4m 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
1: 466.6k 154.5k 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
2: 476.9k 973.7k 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
3: 474.2k 155.0k 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
programmer if he wants to discard the frame right away or mark it as invalid and
send it to the destination anyway. There are pro's and con's on both scenarios. So
basically if you see crc_err in this column it means the port has received a frame
with an incorrect crc but this occurred further upstream.

crc_g_eof - This column is the same as crc_err however the incoming frames are
NOT marked as invalid. If you see these most often the enc_in counter increases
as well but not necessarily. If the enc_in and/or enc_out column increases as well
there is a physical link issue which could be resolved by cleaning connectors,
replacing a cable or (in rare cases) replacing the SFP and/or HBA. If the enc_in
and enc_out columns do NOT increase there is an issue between the SERDES
chip and the SFP which causes the CRC to mismatch the frame. This is a
firmware issue which could be resolved by upgrading to the latest FOS code.
There are a couple of defects listed to track these.

enc_out - Similar to enc_in this is the same encoding error however this error was
outside normal frame boundaries i.e. no host IO frame was impacted. This may
seem harmless however be aware that a lot of primitive signals and sequences
travel in between normal data frame which are paramount for fibre-channel
operations. Especially primitives which regulate credit flow. (R_RDY and VC_RDY)
and signal clock synchronization are important. If this column increases on any
port you'll likely run into performance problems sooner or later or you will see a
problem with link stability and sync-errors (see below).

Link_Fail - This means a port has received a NOS (Not Operational) primitive from
the remote side and it needs to change the port operational state to LF1 (Link Fail
1) after which the recovery sequence needs to commence. (See the FC-FS
standards specification for that)

Loss_Sync - Loss of synchronization. The transmitter and receiver side of the link
maintain a clock synchronization based on primitive signals which start with a
certain bit pattern (K28.5). If the receiver is not able to sync its baud-rate to the
rate where it can distinguish between these primitives it will lose sync and hence
it cannot determine when a data frame starts.

Loss_Sig - Loss of Signal. This column shows a drop of light i.e. no light (or
insufficient RX power) is observed for over 100ms after which the port will go into a
non-active state. This counter increases often when the link-loss budget is
overdrawn. If, for instance, a TX side sends out light with -4db and the receiver
lower sensitivity threshold is -12 db. If the quality of the cable deteriorates the
signal to a value lower than that threshold, you will see the port bounce very often
and this counter increases. Another culprit is often unclean connectors, patch-
panels and badly made fibre splices. These ports should be shut down
immediately and the cabling plant be checked. Replacing cables and/or bypassing
patch-panels is often a quick way to find out where the problem is.

The other columns are more related to protocol issues and/or performance
problems which could be the result of a physical problem but not be a cause. In
short look at these 7 columns mentioned above and check if no port increases a
value.

============================================
too_short/too_long - indicates a protocol error where SOF or EOF are observed
too soon or too late. These two columns rarely increase.

bad_eof - Bad End-of-Frame. This column indicates an issue where the sender
has observed and abnormality in a frame or it's transceiver whilst the frameheader
and portions of the payload where already send to its destination. The only way for
a transceiver to notify the destination is to invalidate the frame. It truncates the
frame and add an EOFni or EOFa to the end. This signals the destination that the
frame is corrupt and should be discarded.

F_Rjt and F_Bsy are often seen in Ficon environments where control frames could
not be processes in time or are rejected based on fabric configuration or fabric
status.

c3timout (tx/rx) - These are counters which indicate that a port is not able to
forward a frame in time to it's destination. These either show a problem
downstream of this port (tx) or a problem on this port where it has received a frame
meant to be forwarded to another port inside the sames switch. (rx). Frames are
ALWAYS discarded at the RX side (since that's where the buffers hold the frame).
Member Directory News and Events Help and Feedback Terms of Use
SOLUTION AND PRODUCT FORUMS DEVELOPER NETWORK INNOVATION CENTER
Hitachi Data Systems Corporation 2013. All Rights Reserved.
Average User Rating
(3 ratings)
ALWAYS discarded at the RX side (since that's where the buffers hold the frame).
The tx column is an aggregate of all rx ports that needs to send frames via this
port according to the routing tables created by FSPF.

pcs_err - Physical Coding Sublayer - These values represent encoding errors on
16G platforms and above. Since 16G speeds have changed to 64/66 bits
encoding/decoding there is a separate control structure that takes car of this.

As a best practise is it wise to keep a trace of these port errors and create a new
baseline every week. This allows you to quickly identify errors and solve these
before they can become an problem with an elongated resolution time. Make sure
you do this fabric-wide to maintain consistency across all switches in that fabric.

Make sure that all of these physical issues are solved first. No software can compensate for hardware problems and the HDS support organization will give you
this task anyway before commencing on the issue.

As for which information to collect please refer to https://tuf.hds.com where you will find pages for all GSC supported products and a method on how to collect
these.

Regards,
Erwin
4310 Views
Categories:
Tags: gsc , brocade , troubleshooting
3 Comments
Like (1)
Cris Danci Jun 19, 2013 12:18 AM
Erwin van Londen nice contribution.

I don't think a lot users will keep baselines, so we should have something about using portstatsclear to reset the counters during the initial diagnostics.
Also if we have ISLs portstatshow to get some credit counters.
Like (0)
mac4techs Jun 19, 2013 11:57 AM
This is really nice share. Also the following commands will help to troubleshoot whether the port is logging into the fabric
1.nsshow/nscamshow
2.nodefind
3.fcping - to check the zoning
4.perfmonitorshow
Like (1)
elonden Jun 19, 2013 4:58 PM
The intention of this port is merely to get all physical issues out of the way first. In a majority of cases there are one or more physical issues as an
underlying cause of a more widespread problem. This goes from login issues to buffer-credit problems and a whole range of protocol errors which could be
the result of one or more links behaving badly.

So name of the game is: Make sure all physical problems are fixed first because there is nothing we can do from a support perspective if this isn't fixed
first. We cannot re-program a cable to behave well.
Hitachi Data Systems (HDS) provides information infrastructure, storage solutions and services that help companies innovate with information to make a difference in
the world. HDS offers a single, virtualized platform for all data structured and unstructured to help organizations store, manage, search and protect their information
and big data in the data center and the cloud.

5-Minute Initial Troubleshooting On Brocade Equ..

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

5-Minute Initial Troubleshooting On Brocade Equ..

Încărcat de

Drepturi de autor:

Formate disponibile

5/13/2014 5-minute initial troubleshooting on Brocade equ...

| Hitachi Data Systems

S-ar putea să vă placă și