Sunteți pe pagina 1din 14

10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

Products Solutions & Services Support Community

Browse Community Language EN  Help Sign In

Search this board

Dell Community  Networking  Storage Networking  Connectrix


 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment

ECN-APJ
3 Argentium

05-10-2015 11:14 PM

How To Identify And Troubleshoot Slow Drain Device In Brocade


SAN Environment

How To Identify And Troubleshoot Slow Drain Device In Brocade SAN
Environment

Share:

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 1/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

Please click here for all contents shared by us.

Introduction

In SAN environment, the performance degrade is a common issue. In such cases, the device processing speed
becomes slow, or there are many frame drop warnings, and finally affect the business applications. Usually they are one
or several devices cause this problem, we call such device slow drain devices. The slow drain device could be a host,
storage or connected switch. For some reason, the frames they accepted exceeded their capabilities so that they could
not return enough buffer credits to uplink devices, which causes network delay, congestion or even frame lose issue. All
of these would lead to performance issue. The bottleneck device could either be at physical layer, such as SFP, fiber
cable and endpoint device, or a SAN design defect, for example, the actual data volume exceeds the maximum
processing capability.

In this article we shall talk about how to identify and troubleshoot slow drain device in Brocade SAN environment.

Detailed Information

The cause of slow drain device

To understand the cause of the bottleneck, we should understand how switches implement the flow control mechanism.
The buffer credit plays a key role in the flow control. Every single switch port has several buffer credits, the number of the
credits is determined by the negotiation process of the port and connected device. Only when there are available buffer
credits, the port can send out a frame and then occupy a credit. Once the remote device receives the frame, sends out
an acknowledge message, then the available buffer credit will be added one. Since the buffer credits are limited, if the
port has no enough credit, then the network delay would happen. Certainly, if the occupied time is more than 500ms, the
frame will be dropped and release the credit.

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 2/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

Because of the credit-based flow control mechanism, the bottleneck will lead to the congestion on the entire data path. If
the path includes a cascading link, all the data transmission through this link will be affected. Therefore, a bottleneck
device can cause the congestion of the entire network. It is important to identify the bottleneck device during the
troubleshooting of performance issue. For endpoint devices, such as hosts or storage, the system will report bottleneck
issue. For Brocade switches, the following message will pop up,

2015/01/15-18:55:34, [AN-1004], 335118, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion


bottleneck on port 10/32. 91.33 pct. of 300 secs. affected.

2015/01/15-19:00:37, [AN-1004], 335119, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion


bottleneck on port 10/32. 88.67 pct. of 300 secs. affected.

2015/01/15-19:05:40, [AN-1004], 335120, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion


bottleneck on port 10/32. 83.33 pct. of 300 secs. affected.

Clear the counters in Brocade switch

To troubleshoot performance issue, the first step is clearing the switch counters. We can use the following commands:

#>statsclear

#>slotstatsclear

If you’d cleared the counters before, you can directly collect supportshow or supportsave logs for analysis. If you haven’t
cleared the counters, you’d better collect a copy of the current the outputs of supportshow or supportsave, then clear the
counters. The first one can be used to quickly analyze which ports already have the errors, then we can check these
ports first. The sfpshow command can be used to check the power levels for both TX and RX on a particular port.

Identify SAN topology

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 3/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

For a single switch network, all the connected device are hosts or storage. For multiple switches network, there will be
ISL links and E-Ports. Identifying the network topology can help administrators to understand the data transmission path.

For example, the following islshow ouputshows the connectivity status between the Brocade switch and the remote
switch. No. 1: local switch port 57 connects to remote switch CHD_1C_TLI_SAN1 port 55. No. 2: local switch port 129
connects to remote switch CHD_1D_NGN_SAN1, port 135.

islshow :

1: 57-> 55 10:00:00:05:1e:d2:c4:00 7 CHD_1C_TLI_SAN1 sp: 8.000G bw: 8.000G TRUNK QOS

2:129->135 10:00:00:05:33:83:e3:00 5 CHD_1D_NGN_SAN1 sp: 8.000G bw: 8.000G TRUNK QOS

<truncated>

Analyze port errors

As the following diagram shows, there are two Brocade switches in the SAN network.

    

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 4/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

As the above information, we check the port 57 status with the command portstatsshow 57,

portstatsshow 57

tim_txcrd_z 1381820 Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc 0- 3: 0 0 228512 231010

tim_txcrd_z_vc 4- 7: 521007 401291 0 0

tim_txcrd_z_vc 8-11: 0 00 0

tim_txcrd_z_vc 12-15: 0 0 0 0

er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 23 Class 3 transmit frames discarded due to timeout

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 5/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

The Time TX Credit Zero counter shows the duration of the zero buffer credit. Zero buffer credit doesn’t mean there is
performance issue. However if the value is very high, there could be congestion somewhere in the network. Usually if the
number is less than 30% of the transmission frames, then it is normal.

The c3_timeout counter is used to verify if there is frame loss. Prior to FOS 6.3.1, the counter has no direction. After FOS
6.3.1, it is replaced with the er_rx_c3_timeout and er_tx_c3_timeout counters. When the port sends or receives a frame,
it occupies a buffer credit. If more than 500ms the port doesn’t receive the response, then the transmission is failed and
the frame will be dropped and the counter will be added one. This number means there is performance issue. In this
case, er_tx_c3_timeout is not zero.

Let’s take a look at the downstream port,

portstatsshow 55

tim_txcrd_z 1259255 Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc 0- 3: 0 0 239711 218720

tim_txcrd_z_vc 4- 7: 403321 397503 0 0

tim_txcrd_z_vc 8-11: 0 00 0

tim_txcrd_z_vc 12-15: 0 0 0 0

er_rx_c3_timeout 31 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 0 Class 3 transmit frames discarded due to timeout

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 6/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

The er_rx_c3_timeout counter is not zero which means it also exceeded 500ms and dropped the frames. Please be
noted that the upstream er_tx_c3_timeout is not always equal to the downstream er_rx_c3_timeout, it depends on the
time that you clear the counters and collect the logs.

We've checked the ISL links between two switches, now let’s find out the congestion device. We saw the
er_tx_c3_timeout on the upstream port, and the er_rx_c3_timeout on the downstream port, there should be an F-Port on
upstream switch while an F-port on downstream switch.

How to find out all these abnormal ports? We check the porterrshow output of these switches. Finally we find the port 21
and port 27 have some problem:

portstatsshow 21

er_rx_c3_timeout 22 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 0 Class 3 transmit frames discarded due to timeout

portstatsshow 27

er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 31 Class 3 transmit frames discarded due to timeout

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 7/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

Let’s take a look at the diagram again:

Are the any ports also affected? Since there is only one ISL link between two switch, so all the ports on the data
transmission path have been affected as well. Please be noted that the port 26 on the downstream switch hasn’t been
affected since its data is congested on the upstream switch.

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 8/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

For multiple switches SAN environment, we can also follow the above steps to find out the abnormal device from the
portstatsshow output. For single switch environment, we only need to check the F-Ports.

Troublshoot bottleneck devices

Next we need to find out the cause of the bottleneck device, here are the normal steps:

1. Use porterrshow or portstatsshow to check if there is errors at physical layer

2. Use sfpshow to check the power levels of SFP modules

3. Use switchshow to check the port status

4. Use fabriclog –show to check if there is reset port.

5. Check the connected device if there is no finding from the above steps

Back to this case, we find there is a few errors at physical layer, and the power level of RX is less than -7dBm. So we
need to check the fiber cable between the switch and the device.
https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 9/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

portstatsshow 27

er_enc_out 34181 Encoding error outside of frames

er_bad_os 23541 Invalid ordered set

sfpshow 27

RX Power: -23.0 dBm (0.5 uW) 10.0 uW 1258.9 uW 15.8 uW1000.0 uW

TX Power: -3.2 dBm (477.3 uW)125.9 uW 631.0 uW 158.5 uW 562.3 uW

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 10/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

After replacing the fiber cable, the problem was solved which indicates the bad fiber cable caused the problem.
Sometimes we might not be able to find any problem on switches, then we should check if there is any problem on the
connected device (e.g. HBA card).

Summary

The key point of troubleshooting Brocade SAN performance issue is looking for the bottleneck device through the
congestion data path. Understanding the difference between er_rx_c3_timeout and er_tx_c3_timeout is very important.

We suggest clearing the counters when the devices work normally. If the performance issue occurs, only the logs that
are collected during that period are meaningful.

Labels : Brocade

Tags: Brocade infrastructure_connectivity support

 4 Kudos

 Share Reply

All forum topics  Previous Topic Next Topic 

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 11/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

2 Replies

RRR
5 Osmium

11-04-2015 02:45 AM

Re: How To Identify And Troubleshoot Slow Drain Device In


Brocade SAN Environment
I'm actually looking for a way to detect what my actual slow draining devices are. We're experiencing
high latencies from time to time and we cannot nd the slow device. Storage never shows any high
utilization: cache is ne, disks show 15% util, cpus are in the 20%-30% regions, so nothing points to
storage problems.

But how can we detect where the actual slowest part in the SAN environment is?

 0 Kudos

 Share Reply

ECN-APJ
3 Argentium

11-08-2015 10:45 PM

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 12/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

Re: How To Identify And Troubleshoot Slow Drain Device In


Brocade SAN Environment

Hi RRR.

Are you using Cisco MDS switches? If yes, you may also refer to this article:
http://www.cisco.com/c/en/us/products/collateral/storage-networking/mds-9700-series-multilayer-
direc....

 0 Kudos

 Share Reply

Dell Support Resources


Diagnostics & Tools
Drivers & Downloads
Warranty & Contracts
Contact Support
Product Support

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 13/14
10/21/2019 How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment - Dell Community

About Dell Careers Community Events Partner Program Premier Dell Technologies

© 2018 Dell Terms of Sales Privacy Statement Ads & Emails Legal & Regulatory Corporate Social Responsibility Contact Feedback

https://www.dell.com/community/Connectrix/How-To-Identify-And-Troubleshoot-Slow-Drain-Device-In-Brocade/td-p/7093539 14/14

S-ar putea să vă placă și