Sunteți pe pagina 1din 52

GOLD Tests

and
ES+ Crashes

EDCS-694119 CA Training

2008 Cisco Systems, Inc. All rights reserved.

Cisco Confidential

Generic Online
Diagnostics

Si

Generic OnLine Diagnostics


What Is Gold?
Gold defines a common framework for diagnostics
operations across Cisco
platforms running Cisco IOS Software.
Goal: check the health of hardware components and verify
proper operation of the system data plane and control plane
at run-time and boot-time.

Provides a common CLI and scheduling for field diagnostics


including:
Bootup Tests (includes online insertion)
Health Monitoring Tests (background non-disruptive)
On-Demand Tests (disruptive and Non-disruptive)
User Scheduled Tests (disruptive and Non-disruptive)

Generic Online Diagnostics


How Does Gold Work?

Diagnostic packet switching


tests verify that the system
is operating correctly:
Is the supervisor control plane and
forwarding plane functioning properly?
Is the standby supervisor ready to
take over?
Are line cards forwarding packets
properly?
Are all ports working?
Is the backplane connection working?

Other types of diagnostics tests


including memory and error
correlation tests are also available.

Forwarding
Engine
Line
Car
d
Fabric
Forwarding
Engine

CPU

Active Supervisor

Standby
Supervisor

Line
Car
d

Generic Online Diagnostics


What Type of Failure Does Gold Detect?

Diagnostics capabilities
built in hardware
Depending on hardware,
Gold can catch:
Port failure
Bent backplane connector
Bad fabric connection
Malfunctioning forwarding engines
Stuck control plane
Bad memory

Generic Online Diagnostics


Diagnostic Operation
Boot-Up Diagnostics
Switch(config)# diagnostic bootup level complete

Runtime Diagnostics

Run During System Bootup, Line


Card OIR Or Supervisor
Switchover
Makes Sure Faulty Hardware Is
Taken out of Service

Health-Monitoring
Switch(config)# diagnostic monitor module 5 test 2
Switch(config)# diagnostic monitor interval module 5 test 2 00:00:15

Non-Disruptive Tests
Run in the Background
Serves As HA Trigger

On-Demand
Switch# diagnostic start module 4 test 8
Module 4: Running test(s) 8 may disrupt normal system
operation
Do you want to continue? [no]: y
Switch# diagnostic stop module 4

Scheduled
Switch(config)# diagnostic schedule module 4 test 1
port 3 on Jan 3 2005 23:32
Switch(config)# diagnostic schedule module 4 test 2
daily 14:45

All diagnostics tests can be run


on demand, for troubleshooting
purposes. It can also be used as
a pre-deployment tool.
Schedule Diagnostics Tests, for
Verification and Troubleshooting
Purposes

Generic Online Diagnostics


Catalyst Gold Operation Example
Switch# show diagnostic content mod 5
Module 5: Supervisor Engine 720 (Active)
<snip>
Testing Interval
ID

Test Name

Attributes

(day hh:mm:ss.ms)

==== ================================== ============

=================

1) TestScratchRegister -------------> ***N****A***

000 00:00:30.00

2) TestSPRPInbandPing --------------> ***N****A***

000 00:00:15.00

3) TestTransceiverIntegrity --------> **PD****I***

not configured

4) TestActiveToStandbyLoopback -----> M*PDS***I***

not configured

5) TestLoopback --------------------> M*PD****I***

not configured

6) TestNewIndexLearn ---------------> M**N****I***


not configured
Diagnostics test suite attributes:
7) TestDontConditionalLearn --------> M**N****I***
not configured
M/C/* - Minimal bootup level test / Complete bootup
8) TestBadBpduTrap -----------------> M**D****I***
level not
testconfigured
/ NA
9) TestMatchCapture ----------------> M**D****I*** B/*not
configured
- Basic
ondemand test / NA
10) TestProtocolMatchChannel --------> M**D****I***
configured
P/V/*not
- Per
port test / Per device test / NA
11) TestFibDevices ------------------> M**N****I***
configured test / Non-disruptive test / NA
D/N/*not
- Disruptive
12) TestIPv4FibShortcut -------------> M**N****I*** S/*not
configured
- Only
applicable to standby unit / NA
13) TestL3Capture2 ------------------> M**N****I*** X/*not
configured
- Not
a health monitoring test / NA
14) TestIPv6FibShortcut -------------> M**N****I*** F/*not
configured
- Fixed
monitoring interval test / NA
15) TestMPLSFibShortcut -------------> M**N****I*** E/*not
configured
- Always
enabled monitoring test / NA
16) TestNATFibShortcut --------------> M**N****I*** A/Inot
configured is active / Monitoring is
- Monitoring
inactive
17) TestAclPermit -------------------> M**N****I***
not configured
R/* - Power-down line cards and need reset
18) TestAclDeny ---------------------> M**N****A***
000 00:00:05.00
supervisor
/ NA
19) TestQoSTcam ---------------------> M**D****I*** K/*not
configured
- Require resetting the line card after the
test has completed / NA
<snip>
T/* - Shut down all ports and need reset
supervisor / NA

Generic Online Diagnostics


Catalyst Gold Operation Example
Switch# show diagnostic result mod 7
Current bootup diagnostic level: complete
Module 7: CEF720 24 port 1000mb SFP
Overall Diagnostic Result for Module 7 : MINOR ERROR
Diagnostic level at card bootup: complete
Test results: (. = Pass, F = Fail, U = Untested)
1) TestTransceiverIntegrity:
Port

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

---------------------------------------------------------------------------Test results: (. = Pass, F = Fail, U = Untested)


U U . U . . U U . . U U . . U U U U U U U U U U

2) TestLoopback:
Port

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

---------------------------------------------------------------------------.

3) TestScratchRegister -------------> .
4) TestSynchedFabChannel -----------> .
<snip>

Generic Online Diagnostics


Catalyst Gold Operation Example
r1# show diagnostic description module 5 test ?
<1-33> Test ID Number
ID
Test Name [On-Demand Test Attributes]
--- ------------------------------------------1 TestScratchRegister
[***N****]
2 TestSPRPInbandPing
[***N****]
3 TestTransceiverIntegrity
[**PD****]
4 TestActiveToStandbyLoopback
[M*PDS***]
5 TestLoopback
[M*PD****]
6 TestNewIndexLearn
[M**N****]
<snip>
r1# show diagnostic description module 5 test 2
TestSPRPInbandPing :
By default, this test is enabled as health-monitoring test.
The SP-RP Inband test catches most of the runtime software driver
and hardware issues on supervisors. This is done by using diagnostic
packet tests exercising the layer 2 forwarding engine, the L3-4
forwarding engine, and the replication engine along the path from
the Switch Processor to the Route Processor.
Packets are sent at an interval of 15 seconds and 10 consecutive
failures of the SP-RP Inband test result in failover to the
redundant supervisor (default).

Generic Online Diagnostics


Recommendations
Bootup diagnostics:
Set level to complete

On demand diagnostics:
Use as a pre-deployment tool: run complete diagnostics
before putting hardware into production environment
Use as a troubleshooting tool when suspecting
hardware failure

Scheduled diagnostics:
Schedule key diagnostics tests periodically
Schedule all non-disruptive tests periodically

Health-monitoring diagnostics:
Key tests running by default
Enable additional non-disruptive tests for specific functionalities
enabled in your network: IPv6, MPLS, NAT

Si

Reference:
http://www.cisco.com/c/en/us/td/docs/routers/7600/ios/15S/configu
ration/guide/7600_15_0s_book/diagtest.html
Google cisco 7600 configuring online diagnostics 1st Link

VCC
P

VCCP
The Issue
Cisco has been working with individual customers on an issue
related to memory components manufactured by a single supplier
between 2005 and 2010.
The affected memory component is the DRAM. So, in most of the platforms,
its required only to replace the DIMM and not the entire linecard/SUP.
In some cases, you might be required to replace the entire Linecard/SUP.
This can be confirmed by the TAC engineer.

he Field notice for all the individual products and related error messages can be
ccessed via
www.cisco.com/go/memory

Symptoms
This issue does not affect boards while the boards are in
operation. The board failure might occur after one or more of
the actions that are executed.
Reload.
Software Upgrade.
Power cycle.
One of these symptoms might be observed in the syslog for a
7600 platform based devices:
*May 16 02:59:54.575: %PM_SCP-SP-1-LCP_FW_ERR: System
resetting module 1 to recover from error: Linecard received system
exception
*May 16 02:59:54.575: %OIR-SP-3-PWRCYCLE: Card in module 1, is
being power-cycled Off (Module Reset due to exception or user
request)
Alternatively, the card might crash repeatedly with this error reported in
the syslog:
%EARL-DFC<n>-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch

When should you be Cautious?

doesnt affect the recent products that are less than 5 years old / older products
are more than 10 years old. This only affects few products that were
ufactured only by a single vendor in between 2005 and 2010.

Fix on Failure Replacement Guidelines


Request Return Material Authorization (RMA) product through normal
service support channels.

Reference:
www.cisco.com/go/memory

ES+ Product Family

ES+ Series 4-Port 10GE Line Cards

ES+ Series 40-Port GE Line Cards

ES+ Series 2-Port 10GE Line Cards

ES+ Series 20-Port GE Line Cards

ES+

What is it ?

What are the various flavors?

The 7600-ES+ is an aggregate 40G


linecard series targeting the Metro
Ethernet market on the 7600 Platform
7600-ES+20G3C
7600-ES+20G3CXL
7600-ES+40G3C
7600-ES+40G3CXL
7600-ES+4TG3C
7600-ES+4TG3CXL
7600-ES+2TG3C
7600-ES+2TG3CXL

ES+ Overview Flavors & Terminology


Excalibur (ES+)

1G

10G

Ginsu [10G-OTN]

Combo(1G & 10G)

7600-ES+20G3C

7600-ES+2TG3C

7600-ES+ITU-2TG

7600-ES+20G3CXL

7600-ES+2TG3CXL

7600-ES+ITU-4TG

7600-ES+40G3C

7600-ES+4TG3C

7600-ES+40G3CXL

7600-ES+4TG3CXL

7600-ES+20C3C
7600-ES+20C3CXL
7600-ES+40C3C
7600-ES+40C3CXL

ES+

Each ES+ board consists of one Baseboard, one Link Daughter card and one
Earl Daughter card.

Baseboard has no flavors.

Link Card flavors


a. 4 ports of 10 Gigabit Ethernet (XFP form factor)
b. 40 ports of 1 Gigabit Ethernet (SFP form factor)
c. 2 ports of 10 Gigabit Ethernet (XFP form factor)
d. 20 ports of 1 Gigabit Ethernet (SFP form factor)
e. 2 ports of 10 OTN Gigabit Ethernet (SFP form factor)
f. 4 ports of 10 OTN Gigabit Ethernet (SFP form factor)
g. 2 ports of 10 Gigabit Ethernet (XFP form factor)
20 ports of 1 Gigabit Ethernet (SFP form factor)
h. 1 port of 10 Gigabit Ethernet (SFP form factor)
10 ports of 1 Gigabit Ethernet (XFP form factor)

Earl Card flavors


a. 3C (Lite)
b. 3CXL

ES+ Troubleshooting
Getting Started

Linecard console is needed for debugging many problems


Can connect to ES+ console using the attach <slot> CLI
commands
Can get a good snapshot from the RP console using show
hw-module slot <slot> tech-support
To display versions of various devices on ES+ use the
cmd, show platform hardware version

Troubleshooting card state - fault


isolation
Check the RP and ES+ console logs there is almost always a
message that will tell you why the card did not come up.
Check card LED status.
Use show module to check the current
status.
Watch out for FPD. Some device may not be updated to the latest
version required by the IOS image.
Incompatible FPD version.
Not enough power

ES+ Modules

- Hardware and Software requirement

Hardware requirement
Supported by all the Cisco 7600 series routers:
7604, 7606, 7609, 7613 router (not in slot 1-8) and 7606-S, 7609-S.
7600-ES+xx will be supported by all SUP720 models except PFC3A
7600-ES+xx will be supported with RSP720
7600-ES+xx will not be supported by SUP2, SUP32
Software Requirement
Supported from version 12.2(33)SRD of the Native IOS image
CatOS and Hybrid images are not supported.

Show module
the Linecard
Show power

To check what is the status of


To check the available power

Show hw-module [all|slot <> ] fpd To check the


FPD version
Show log
Logs will always give some
info
attach <slot> To check the ES+ console messages.

Troubleshooting interface state common


problems

Incorrect optics

Optics not matched on both ends

Unsupported optics.

routerdfc12#shplatformhardwaretransceiver?

briefBriefdeviceinformation
configDeviceconfiguration
countersDevicestatistics
errorsDeviceerrorinformation
registersDeviceregistercontents
statusDevicestatus

Transceiver Verification
Router#show module 8
Mod Ports Card Type
Model
Serial No.
--- ----- -------------------------------------- ------------------ ----------8
4 7600 ES+
7600-ES+4TG3CXL
XXXABCDXXX
Mod MAC addresses
Hw
Fw
Sw
Status
--- ---------------------------------- ------ ------------ ------------ ------8 001f.9e13.76e0 to 001f.9e13.76ef
0.303 12.2(33r)SRD 12.2(2008102 Ok
Mod
---8
8

Sub-Module
--------------------------7600 ES+ DFC XL
7600 ES+ 4x10GE XFP

Model
-----------------7600-ES+3CXL
7600-ES+4TG

Serial
Hw
Status
----------- ------- ------XXXABCDXXX 0.200 Ok
XXXABCDXXX 0.250 Ok

Mod Online Diag Status


---- ------------------8 Pass
Router#show interfaces status module 8
Port
Te8/1
Te8/2
Te8/3
Te8/4

Name

Status
connected
disabled
notconnect
disabled

Vlan
routed
1
1
1

Duplex
full
full
full
full

Speed Type
10G 10Gbase-LR
10G DWDM-51.72
10G No Connector
10G No Connector

Transceiver Verification
Router#show idprom interface te8/1
IDPROM for transceiver TenGigabitEthernet8/1:
Description
=
Transceiver Type:
=
Product Identifier (PID)
=
Vendor Revision
=
Serial Number (SN)
=
Vendor Name
=
Vendor OUI (IEEE company ID)
=
CLEI code
=
Cisco part number
=
Device State
=
Date code (yy/mm/dd)
=
Connector type
=
Encoding
=
Minimum bit rate
Maximum bit rate
Power dissipation class
cdr function
Tx Reference clock
Max link length for SMF fiber
Max link length for EBW 50/125um fiber
Max link length for 0/125um fiber
Max link length for 62.5/125um fiber
Max link length for copper
Tx device technology
Wavelength control technology
Transceiver cooling technology
Detector type
Transmitter tuning
Supported CDR rates

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=

Nominal laser wavelength


Wavelength tolerance w.r.t. nominal

=
=

XFP optics (type 6)


OC192 + 10GBASE-L (97)
XFP-10GLR-OC192SR
05
ONT11021087
CISCO-OPNEXT
00.0B.40 (2880)
WMOTBEVAAB
10-1989-02
Enabled.
07/01/09
LC.
64B/66B
SONET Scrambled
NRZ
9900 Mbits/s
10500 Mbits/s
Pwr Level 2 (2.5 W max pwr dissipation).
supported.
not Required.
10 km
not supported
not supported
not supported
not supported
1550 nm DFB
no wavelength control
un-cooled transmitter device
PIN detector
transmitter not tunable
9.95 Gb/s
10.3 Gb/s
10.5 Gb/s
1310 nm
(+/-) 20 nm ...

Router#show interfaces te8/1 transceiver


Transceiver monitoring is disabled for all interfaces.
ITU Channel not available (Wavelength not available),
Transceiver is internally calibrated.
If device is externally calibrated, only calibrated values are printed.
++ : high alarm, + : high warning, - : low warning, -- : low alarm.
NA or N/A: not applicable, Tx: transmit, Rx: receive.
mA: milliamperes, dBm: decibels (milliwatts).

Port
------Te8/1

Temperature
(Celsius)
----------35.0

Voltage
(Volts)
------0.00

Current
(mA)
-------51.5 --

Optical
Tx Power
(dBm)
--------3.1

Optical
Rx Power
(dBm)
--------3.5

ES+ LC IOS
Crash

ES+ LC IOS Crash

the event of ES+ linecard crash with errors


*Mar 15 15:14:40.875 IST: %PM_SCP-SP-1-LCP_FW_ERR: System
resetting module 7 to recover from error: x40g_crashinfo_init:
Linecard received system exception
*Mar 15 15:14:40.875 IST: %OIR-SP-3-PWRCYCLE: Card in module 7,
is being power-cycled Off (Module Reset due to exception or user
request)
*Apr 6 11:16:55.122: %NP_DEV-DFC3-2-WATCHDOG: Watchdog
detected on NP 2

Obtain the crashinfo file as mentioned in the below


message
*Writing crashinfo to bootdisk:crashinfo_20110315151433-IST

Keepalive failure reset with/without crashinfo


ES+ card reloaded because it stops responding to keepalives, a.k.a.
"Silent reload of ES+ cards".
Symptom:
*Sep 12 17:15:54.985: %OIR-SP-3-PWRCYCLE: Card in module 1, is being
power-cycled off (Module not responding to Keep Alive polling)
*Sep 12 17:15:55.013: %C7600_PWR-SP-4-DISABLED: power to module
in slot 1 set off (Module not responding to Keep Alive polling)
Upgrade to an IOS with a fix for CSCts25729 and CSCtr74953 (12.2(33).SRE5)
The two SW enhancements will ensure that under the similar conditions
previously leading to a reset without crashinfo, a crashinfo or a mini-crashlog
is created (and the linecard is gracefully restarted as before).

Module failed SCP download during bootup


Problem
Following message " Module failed SCP download" is observed on RP
logs.
Root Cause:
The message means that the Booting process of the linecard was
not complete. It could be due to variety of reasons and can be seen
at different points of booting. If the ES+ card is continuously
crashing during the boot, root cause could be FN63553
TroubleShooting:
It's recommended to collect following output from SP:
Remote command switch show oir debug all
Remote command switch show oir state-machine
scp_dnld all
Remote command switch show oir state-machine oir

Watchdog Reset
Problem:
ES+ line card crashes during the execution of "show platform hardware
config-pld" or "show platform hardware version". Both commands are
included in the line card " show hw-module slot X tech-support ".
Root Cause: In both cases crash happens when the attempt is made
to read the PLD register on the ES+ line card. The read may time out,
which triggers the watchdog to restart the line card.
Known DDTS: CSCtw77894, CSCti78408,
CSCtz30983
Next Action: Please contact TAC with the crashinfo in order to
confirm

Failed to Bootup in PUL Ph6


Problem: ES+ line card crashes during bootup, and remains down,
with an error like this:
%PM_SCP-SP-1-LCP_FW_ERR: System resetting module 8 to
recover from error: x40g_cardmgr_event_process: Failed in
PUL Ph6 NP: 3
Root Cause: All cards received for FA with this symptom have been
root-caused to a failure inside the Network Processor (NP) chip.
TroubleShooting:
Replace the failing line card.
Request from the customer to keep the failing line card with them
until a decision is made on whether FA is required.

Linecard received system exception


Problem: ES+ line card crashes due to system exception
Symptom:
%PM_SCP-SP-1-LCP_FW_ERR: System resetting module 1 to recover
from error: Linecard received system exception
%OIR-SP-3-PWRCYCLE: Card in module 1, is being power-cycled Off
(Module Reset due to exception or user request)
Root Cause:
Multiple root causes are possible. If the ES+ card is continuously
crashing during the boot, root cause could be FN63553

Crash due to EARL PATCH_INVOCATION_LIMIT


Problem: ES+ line card crashes because the EARL patch had to be
applied too many times

Sample symptom:
%EARL-DFC1-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch
invocations in the last 30 secs have been attempted. Max limit reached
Root Cause: Multiple root causes are possible. Also, issue is not limited
to ES+ linecards. When EARL detects a certain type of errors, it activates
a 'patch'. This is effectively a restart of ASICs connected to EARL. If the
limit on the number of consecutive patches is reached, line card crash is
triggered.
Next action:
Please collect crashinfo and
remote command {switch|module 1} show platform software earl
reset {histry|data}
OIR the card. Software reset of the line card does not help. It
really has to be removed and re-inserted.

ECC Single Bit Errors


%ECC-DFC8-3-SBE_LIMIT: Single bit error detected and corrected
%ECC-DFC8-3-SYNDROME_SBE_LIMIT: 8-bit Syndrome for the
detected Single-bit error: 0x0
%PM_SCP-SP-1-LCP_FW_ERR: System resetting module 8 to
recover from error: Linecard received system exception
%NP_DEV-DFC3-6-ECC_SINGLE: Recovered from a single-bit ECC
error detected on NP 0, Mem 10, SubMem 0x1,SingleErr 1,
DoubleErr 0 Count 63 TotalSingle
2
%ECC-DFC1-3-SBE_HARD:
bit *hard* error detected at
0x082C7FC0
Next Action:
On single occurrence no action needs to be taken. Please DO NOT
RMA
the more
card. occurrences of SINGLE_Bit_ECC_Error require to
Two (or)
upgrade to one of the below mentioned release.
Recommended releases to take care of the problem are
12.2(33)SRD6 or later 12.2(33)SRE3 or later 15.0(1)S or
later
In case
the error is seen multiple times even after the upgrade, then this
would be termed as hard failure and card
should be RMA'd and flagged for FA.

ECC Double Bit Errors


%NP_DEV-DFC5-3-ECC_DOUBLE: Double-bit ECC error detected on
NP 1, Mem AB, SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total
1
AB refers to values 16 - 19
Root Cause
Double bit ECC errors on ES+ family of cards on NP Mem 16 to NP Mem
19 may be caused by sub-optimal HW programming, these were taken
care through several softwarefixes.
One or more occurrences of DOUBLE_Bit_ECC_Error require an upgrade to
one of the below mentioned release.
Recommended releases to take care of the problem are 12.2(33)SRD6
or later 12.2(33)SRE3 or later 15.0(1)S or later.
If the double bit error is for "Mem 17",then there is a new fix
committed recently via CSCtn95122 - SRE5, 15.0(1)S4 or later
In case the error is seen multiple times even after the upgrade, then this
would be termed as hard failure and card
should be RMA'd and flagged for FA.

Parity
Errors
Soft parity errors : These errors occur when an energy level within
the chip (for example, a one or a zero) changes, most often due to
radiation.
When referenced by the CPU, such errors cause the system to crash.
Incase of a soft parity error, there is no need to swap the board or any
of the parity
Hard
components.
errors These errors occur when there is a chip or board
failure that corrupts data. In this case, you need to re-seat or replace
the affected component, which usually involves a memory chip swap
or a boar swap.

Next Action: At the first occurrence it is not possible to distinguish


between a soft or hard parity errors. From experience most parity
errors are soft parity errors and we can usually dismiss them. It's
suggested that the system be kept under monitor. If the error is
found to reoccur in a very short time internal, then it could be a
hard-parity error, in which case the module should be replaced.

TestMacNotification Diag Test Error


Problem Symptom:
Following messages may be seen on the console :
*Mar 10 10:25:53.562: %FABRIC_INTF_ASIC-DFC9-4-FABRICCRCERRS:
Fabric ASIC 0: 322 Fabric CRC error events in 100ms period
*Mar 10 10:26:31.071: %CONST_DIAG-SP-3-HM_TEST_FAIL:
Module 9 TestMacNotification consecutive failure count:5

oubleshooting and Recommendation:

Check if there any FabricCRC errors *"*FABRICCRCERRS" in the logs.


If there areFabric CRC errorsfollowed by TestMacNotificationtest failure then
issue could be because of CSCto55567,upgrade to release 12.2(33)SRE5 or h

If TestMacNotificationtest fails without Fabric CRC error then contact TAC,


it requires complete data path debugging.
lated DDTS: CSCto55567

EOBC Jam or Freeze Error Router crash after inserting ES+

Sup720:
OBC-SP-0-EOBC_JAM_FATAL: Primary supervisor in slot 5 is jamming the EOBC cha
It has been disabled. Supervisor will return to ROMMON

RSP720:
TSEC-SP-3-RESTART: Interface EOBC0/0 Restarted Due to TX Freeze Error
TSEC-SP-2-EXCEPTION: Fatal Error, Interface EOBC0/0 not transmitting

ot Cause: - CSCtu50337

s issue may only occur if the RSP/SUP is running HW revision 5.0/5.1/5.2 and the
should be NON-S type.

ou encounter this issue, please contact TAC referring this DDTS.

Register Read Errors


Problem Symptom:
Following messages may be seen on the console or
syslog:
DFC7: ERROR! number: 0x80003902, NPprmReg_Read_NP_3c:
register 6 is not supported for NP-3c2.
DFC7: ERROR! number: 0x80003902, Failed to read register Id: 6.
DFC7: ERROR! number: 0x80003902, Failed to read register Id: 6.
DFC1: NPcfgStatistics_ReadStatMsg_NP_3c return error
DFC1: ERROR! number: 0x800000FF, Unexpected error, Counter
type is not long neither double.
Next action:
This message can safely be ignored.
Related DDTS : CSCsy88170

DBUS Header Error


Problem Description
Following error messageseen on the console :
%EARL_L2_ASIC-DFC1-4-DBUS_HDR_ERR: EARL L2 ASIC #0: Dbus
Hdr. Error occurred. Ctrl1 0x930D0EBD
Root Cause
Whenever EARL receives a bad packet (or) garbage signal
which is to be treated as packet, it will trigger a
DBUS_HDR_ERR interrupt.
TroubleShooting and Recommendation
If it happens one time randomly and very low frequency, and
it is not reproducible then it can be ignored
If it happens randomly and high frequency and it is not
reproducible then it could be hardware failure. RMA is
recommended
If it could be reproduce constantly, please collect the packet
dumps (See the steps below)

DBUS Header Error Contd..


To display number of DBUS_HDR_ERR errors (on
LC):show platform hardware superman fwdstats | i DBus Header
DBus Header Checksum errors
= 0x0000000000000000 (0)

Collecting the offending packets:

We can use ELAM to collect the offending packet.


This way we can verify if the packet is always same or different.

f the offending packet is always same, there is a possibility of bad end device.

f the offending packet keep on changing, there is a possibility of hardware / softw


ssue.

PLEASE get TACs help to capture the offending packet using ELAM tool.

Related DDTS: CSCtg31984

Fabric Errors
1. Fabric Sync Failure
%C6KPWR-SP-4-DISABLED: power to module in slot 4 set off
(Fabric channel errors)
2. Fabric CRC Errors
FABRIC_INTF_ASIC-DFC2-4-FABRICCRCERRS: Fabric ASIC 0: 5 Fabric
CRC error events in 100ms period
3. Repeated Fabric Sync
%FABRIC_INTF_ASIC-DFC10-5-FABRICSYNC_REQ
4. Fabric Channel Counter Errors
Error counters incrementing on a specified fabric
channel.
show fabric errors

Fabric Errors Contd..


Root Cause
This is usually a transient issue on the fabric channels,
potentially due to improper insertion of any LCs on the
backplane.
Troubleshooting and Recommendations
1. Collect below outputs from RP
console:
show fabric channel-counters
show fabric errors
show fabric drops
2. Collect below outputs from LC console:
show logg
show platform hardware ssa {brief|counter|error|fabricmon}
{history|registers}
Possible actions that can be tried out:
1. Re-seat of the module.
2. Try switchover of the supervisor module

TCAM Write Inconsistency Errors


Problem Description
"MLSCEF-DFC4-2-FIB_TCAM_WRITE_INCONSISTENCY: FIB TCAM
Mismatch for value: Index: 154 Expected: Entry: 0xB10x0005000B-0x40000000Hardware: Entry: 0x00-0x000000000x00000000
Root Cause
Above message indicates that there is a issue in
TCAM write.
Troubleshooting and Recommendations
Try OIR/reseating of the linecard and check if the problem is
solved.
If the linecard is crashing consistently with this error, collect
"show tech" output and contact TAC for further assistance

NPU Cluster Error


Problem Description
%NP_DEV-DFC7-3-PERR: Non-recoverable Parity error detected
detected on NP 0, cause 39 count 1 uqParityMask 0x2000000,
uqSRAMLine 0x90, bRecov 1, bRewr 1 Total 1
Root Cause
This problem generally happen due to the internal NPU SRAM
memory corruption.
Troubleshooting and Recommendations
RMA the card and raise it for EFA.

ES+ Combo Card Power Denied Issue


Problem Description

Card powered down due to power denied with ES+ combo


cards in same chassis.
ES+ Combo variants have wrong power values programmed in
the idprom making it to allocate more power than specified in
power calculator.
This might lead to a case of insufficient power causing module
to power down with "FRU-power denied".
9 76-ES+XC-20G3CXL 405.72 9.66 - - on off (FRU-power
denied)

Contd..

The right power values for ES+XC are listed below:


Pwr-RequestedCardType

Watts

A @42V

76-ES+XC-20G3C

309.12

7.36

76-ES+XC-20G3CXL

337.26

8.03

76-ES+XC-40G3C

399

9.5

76-ES+XC-40G3CXL
427.14
10.17
Values more than the above will proportionally increase "total
available power" which would cause other module to power down
with insufficient power.
oubleshooting and Recommendations
1. confirm the power values for ES+XC is NOT as per the above
table using "show power"
2. Check if any of the modules in the same chassis fail to power
up with error
"power denied" using "show power" or "show
module"
On 7600/ES+ platform, recommended releases to take
care of the problem are 12.2(33)SRE4 or later, 15.0(1)S4
or later.
Known DDTS: CSCtn41667

Questions ?

S-ar putea să vă placă și