Documente Academic
Documente Profesional
Documente Cultură
or performance
If so; is the issue persistent (is the connection down now), or is the
issue intermittent?
How long has the issue been occurring?
Note: Exact time stamps are most valuable when trying to investigate
any SAN issue. If a timestamp cannot be provided, try to press for the
exact day of initial occurrence.
Was the connection ever functional? If so, when did the connection
begin to experience issues?
Has any hardware or components been replaced?
Log Collection
1 | Te j a B r o a c d e T r o u b l e s h o o t
There are two types of data collection associated with Brocade switches,
a supportshow and a supportsave.
Note: If the customers issue falls outside of constraints of a port issue or
basic functionality, have them collect a supportsave instead of
a supportshow.
Example:
Fabosv4.4switch:admin> supportsave -u anonymous -p password -h
xxx.xxx.xxx.xxx -d /directory -l ftp
This command collects RASLOG, TRACE, supportShow, core file, and FFDC
data, and then transfer them to an FTP/SCP server or a USB device. This
operation can take several minutes.
Note: supportsave will transfer the existing trace dump file first, and then
automatically generate and transfer the latest one. There will be two trace
dump files transferred after running the following command:
OK to proceed? (yes, y, no, n): [no] y
Saving support information for switch:BR4100_IP127, module:RAS...
Saving support information for switch:BR4100_IP127, module:CTRACE_OLD...
Saving support information for switch:BR4100_IP127, module:CTRACE_NEW...
porterrshow
slotstatsshow -c 5 -p 60
sloterrshow -c 5 -v -p 60
This gives one interval of front-end porterrors, five intervals of overall traffic
3 | Te j a B r o a c d e T r o u b l e s h o o t
statistics, and five intervals of error statistics over a five minute time span.
Collect the output into a file and upload these together
with supportsave under the respective case ID.
Switchshow
Explanation
Provides a general overview of logical switch status (no physical
components) plus a list of ports and their status.
The switchState should alway be online.
The switchDomain should have a unique ID in the fabric.
If zoning is configured it should be in the "ON" state.
As for the ports connected these should all be "Online" for connected and
operational ports. If you see ports showing "No_Sync" whereby the port
is notdisabled there is likely a cable or SFP/HBA problem.
If you have configured FabricWatch to enable portfencing you'll see
indications like here with port 75
Obviously for any port to work it should be enabled.
Example
Sydney_ILAB_DCX-4S_LS128:FID128:admin> switchshow
switchName: Sydney_ILAB_DCX-4S_LS128
switchType: 77.3
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 143
switchId: fffc8f
switchWwn: 10:00:00:05:1e:52:af:00
zoning:
ON (Brocade)
switchBeacon: OFF
FC Router: OFF
Fabric Name: FID 128
Allow XISL Use: OFF
5 | Te j a B r o a c d e T r o u b l e s h o o t
LS Attributes:
Mode 0]
sfpshow <slot>/<port>
Explanation
Also check the Current and Voltage of the SFP. If an SFP is broken the
indication is often it draws no power at all and you'll see these two dropping
to zero.
Example
Sydney_ILAB_DCX-4S_LS128:FID128:admin> sfpshow 1/1
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding: 1 8B10B
Baud Rate: 85 (units 100 megabaud)
Length 9u: 0 (units km)
Length 9u: 0 (units 100 meters)
Length 50u (OM2): 5 (units 10 meters)
Length 50u (OM3): 0 (units 10 meters)
Length 62.5u:2 (units 10 meters)
Length Cu: 0 (units 1 meter)
Vendor Name: BROCADE
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
Vendor Rev: A
Wavelength: 850 (units nm)
Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max:
0
BR Min:
0
Serial No: UAF110480000NYP
Date Code: 101125
DD Type: 0x68
Enh Options: 0xfa
Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5, 0x0
Warn Flags[0,1] = 0x5, 0x0
Alarm
Warn
low
high
low
high
Temperature: 25
Centigrade
-10
90
-5
85
Current: 6.322 mAmps
1.000
17.000
2.000
14.000
Voltage: 3290.2 mVolts
2900.0
3700.0
3000.0
3600.0
RX Power: -3.2 dBm (476.2uW) 10.0 uW
1258.9 uW
15.8 uW
1000.0 uW
TX Power: -3.3 dBm (472.9 uW) 125.9 uW
631.0 uW
158.5 uW
562.3 uW
State transitions: 1
Last poll time: 06-20-2013 EST Thu 06:48:28
Porterrshow
For link state counters this is the most useful command in the switch
however there is a perception that this command provides a "silver" bullet to
7 | Te j a B r o a c d e T r o u b l e s h o o t
solve port and link issues but that is not the case. Basically it provides a
snapshot of the content of the LESB (Link Error Status Block) of a port at that
particular point in time. It does not tell us when these counters have
accumulated and over which time frame. So in order to create a sensible
picture of the statuses of the ports we need a baseline. This baseline can be
created to reset all counters and start from zero. To do this issue the
"statsclear" command on the cli.
There are 7 columns you should pay attention to from a physical perspective.
enc_in - Encoding errors inside frames. These are errors that happen on the
FC1 with encoding 8 to 10 bits and back or, with 10G and 16G FC from 64
bits to 66 and back. Since these happen on the bits that are part of a data
frame these are counted in this column.
crc_err - An enc_in error might lead to a CRC error however this column
shows frames that have been market as invalid frames because of this crcerror earlier in the datapath. According to FC specifications it is up to the
implementation of the programmer if he wants to discard the frame right
away or mark it as invalid and send it to the destination anyway. There are
pro's and con's on both scenarios. So basically if you see crc_err in this
column it means the port has received a frame with an incorrect crc but this
occurred further upstream.
crc_g_eof - This column is the same as crc_err however the incoming frames
areNOT marked as invalid. If you see these most often the enc_in counter
increases as well but not necessarily. If the enc_in and/or enc_out column
increases as well there is a physical link issue which could be resolved by
cleaning connectors, replacing a cable or (in rare cases) replacing the SFP
and/or HBA. If the enc_in and enc_out columns do NOT increase there is an
issue between the SERDES chip and the SFP which causes the CRC to
mismatch the frame. This is a firmware issue which could be resolved by
upgrading to the latest FOS code. There are a couple of defects listed to
track these.
enc_out - Similar to enc_in this is the same encoding error however this error
was outside normal frame boundaries i.e. no host IO frame was impacted.
This may seem harmless however be aware that a lot of primitive signals and
sequences travel in between normal data frame which are paramount for
fibre-channel operations. Especially primitives which regulate credit flow.
(R_RDY and VC_RDY) and signal clock synchronization are important. If this
column increases on any port you'll likely run into performance problems
sooner or later or you will see a problem with link stability and sync-errors
(see below).
8 | Te j a B r o a c d e T r o u b l e s h o o t
Link_Fail - This means a port has received a NOS (Not Operational) primitive
from the remote side and it needs to change the port operational state to
LF1 (Link Fail 1) after which the recovery sequence needs to commence. (See
the FC-FS standards specification for that)
Loss_Sync - Loss of synchronization. The transmitter and receiver side of the
link maintain a clock synchronization based on primitive signals which start
with a certain bit pattern (K28.5). If the receiver is not able to sync its baudrate to the rate where it can distinguish between these primitives it will lose
sync and hence it cannot determine when a data frame starts.
Loss_Sig - Loss of Signal. This column shows a drop of light i.e. no light (or
insufficient RX power) is observed for over 100ms after which the port will go
into a non-active state. This counter increases often when the link-loss
budget is overdrawn. If, for instance, a TX side sends out light with -4db and
the receiver lower sensitivity threshold is -12 db. If the quality of the cable
deteriorates the signal to a value lower than that threshold, you will see the
port bounce very often and this counter increases. Another culprit is often
unclean connectors, patch-panels and badly made fibre splices. These ports
should be shut down immediately and the cabling plant be checked.
Replacing cables and/or bypassing patch-panels is often a quick way to find
out where the problem is.
The other columns are more related to protocol issues and/or performance
problems which could be the result of a physical problem but not be a cause.
In short look at these 7 columns mentioned above and check if no port
increases a value.
============================================
too_short/too_long - indicates a protocol error where SOF or EOF are
observed too soon or too late. These two columns rarely increase.
bad_eof - Bad End-of-Frame. This column indicates an issue where the sender
has observed and abnormality in a frame or it's transceiver whilst the
frameheader and portions of the payload where already send to its
destination. The only way for a transceiver to notify the destination is to
invalidate the frame. It truncates the frame and add an EOFni or EOFa to the
end. This signals the destination that the frame is corrupt and should be
discarded.
F_Rjt and F_Bsy are often seen in Ficon environments where control frames
could not be processes in time or are rejected based on fabric configuration
or fabric status.
9 | Te j a B r o a c d e T r o u b l e s h o o t
c3timout (tx/rx) - These are counters which indicate that a port is not able to
forward a frame in time to it's destination. These either show a problem
downstream of this port (tx) or a problem on this port where it has received a
frame meant to be forwarded to another port inside the sames switch. (rx).
Frames are ALWAYS discarded at the RX side (since that's where the buffers
hold the frame). The tx column is an aggregate of all rx ports that needs to
send frames via this port according to the routing tables created by FSPF.
pcs_err - Physical Coding Sublayer - These values represent encoding errors
on 16G platforms and above. Since 16G speeds have changed to 64/66 bits
encoding/decoding there is a separate control structure that takes car of this.
As a best practise is it wise to keep a trace of these port errors and create a
new baseline every week. This allows you to quickly identify errors and solve
these before they can become an problem with an elongated resolution time.
Make sure you do this fabric-wide to maintain consistency across all switches
in that fabric.
Sydney_ILAB_DCX-4S_LS128:FID128:admin> porterrshow
frames
enc
crc
crc
too
too
frjt
fbsy c3timeout
pcs
tx
rx
in
err
g_eof shrt
long
sig
tx
rx
err
0: 100.1m 53.4m
0
0
0
0
0
0
0
0
0
0
0
1: 466.6k 154.5k
0
0
0
0
0
0
0
0
0
0
0
2: 476.9k 973.7k
0
0
0
0
0
0
0
0
0
0
0
3: 474.2k 155.0k
0
0
0
0
0
0
0
0
0
0
0
bad
eof
enc
disc
out
c3
link
fail
loss
loss
sync
Also the following commands will help to troubleshoot whether the port is
logging into the fabric
1.
nsshow/nscamshow
2.
nodefind
3.
4.
perfmonitorshow
10 | Te j a B r o a c d e T r o u b l e s h o o t
11 | Te j a B r o a c d e T r o u b l e s h o o t
With FabricOS v7.0 the bottleneckmon was improved again. While the
core-policy which detects credit starvation situations was pretty much predefined before v7.0 you're now able to configure it in the minutest details.
We are still testing that out more in detail - for the moment I recommend to
use the defaults.
So how to use it?
That's it. It will generate messages in your switch's error log if a congestion
or a latency bottleneck was found. Pretty straightforward. If you are not sure
you can check the status with:
13 | Te j a B r o a c d e T r o u b l e s h o o t