Sunteți pe pagina 1din 81

Avaya Operational Services

BACKBONE
ALARMS - GUIDE
Version: 2.0

Prashanth Burugula

Copyright c 2014 all rights reserved.


The information in this document is subject to change without notice and should
not be construed as a commitment by Avaya Corporation. Every effort has been
made to ensure the accuracy of this document; however, Avaya Corporation
assumes no responsibility for any error that may appear.
All trademarks mentioned herein are the property of their respective owners

Media Gateways Alarms


Alarm ->MainObject_MED-GTWY

Alarm->MaintObject_CMG_EventID_14
Alarm->MaintObject_CMG_EventID_16
CMG-19
CMG-34
CMG-21/cmgIccMissing
CMG-22/cmgIccAutoReset
CMG-23
CMG-35/VoIPOccFault
CMG-36/VoIPStats AppFault Trap
CMG-25/26
CMG-47/48/49
CMG-50/51/52

7
8
9
9
9
9
10
12
12
12
13
13

Platform Alarms
SME

14

Tripwire

15

GW_ENV_EventID_10

17

DAL2/DAL1/DAJ

18

DUPLICATION LINK ALARMS

19

FILESYNC ALARMS

21

ARBITER ALARMS/A_EventID_X

23

BKP_EventID_10

24

WD_EventID_22

26

WD_EventID_26

27

PE HEALTHCHECK/PE_EventID_1

28

Malformed_INADS

29

LOGIN_EventID_x

30

USB1_EventID_X

30

UPD_EventID_X

31

UPG_EventID_x

32

UPS_EventID_X

32

STD_EventID_X

33

ENV_EventID_x

34

SVC_MON_EventID_X

34

TN Circuit Packs Alarms


PKT-INT

36

PKT-BUS

36

G3_Cabinet-Down/G3_CircuitPack-Down

36

SYS-LINK

38

TONE-BD

40

ETR-PT

40

CLAN-BD

42

IPMEDPRO

43

MEDPROPT

44

VAL-PT

45

VAL-BD

45

SNI-BD

47

SNI-PEER

47

SN-CONF

48

SNC-LINK/SNC-BD/SNC-REF

48

EXP-INTF

49

EXP-PN

50

FIBER-LINK

51

DS1C-BD

52

TDM-BUS

53

POW-SUP

54

M/T-BD / MT-ANL/ M/T-DIG/ M/T-PKT

55

PS-RGEN/RING-GEN

56

NR-CONN

57

Survivable Processor Alarms


LIC-ERR

58

ESS_LOCATION_C000

61

ESS_EventID_1

62

ESS_EventID_2

62

ESS_EventID_3

62

ESS_EventID_4

62

ESS_EventID_5

63

ESS_EventID_6

63

Adjuncts Associated Alarms


PRI-CDR/SEC-CDR

64

ASAI-PT/BD

65

ADJ-IP/AESV-SES/ASAI-IP

66

Trunk/Trunk Board Associated Alarms


MG-IAMM

67

UDS1-BD/MG-DS1

68

MG-ANA

70

ANL-BD

71

BRI-BD /MG-BRI/TBRI-BD

71

BRI-PT/TBRI-PT

73

CO-TRK

74

ISDN-SGR / ISDN-TRK

75

H323-SGR

76

1009:Total Processing exceeds 70 % on \\LSP-LAPG45CC03PV

77

HARD DISK,S,77

78

Switch Alarms
PowerSupply_Fault

79

pethPsePortOnOffNotification

80

Interface_Fault_MIB2

80

ExceededMaximumUptime

80

HighErrorRate

80

Switch Down/Interface Down/Host Down

81

Alarm Description:

alarmMinor: ExternalName=J201C004-cm1-virtual-mdc: Type=MIN: MaintName=MEDGTWY: On Board=n: AlarmIPAddress=XX.XX.XX.XX: AlarmPort=XXX: AlarmCategory=


Understanding: The MED-GTWY maintenance object monitors the H.248 link to the Media Gateway. It logs errors
when there are H.248 link problems or when hyperactive H.248 link bounce occurs.
Solution:

almdisplay v / almdisplay res |more


If active:
Check if the reported MED GTWY is registered with the main server or not from autosat and then do
Display media-gateway <GTWY Number>

If the GTWY is not registered (in case if you see n for Registered? option) then try to ping the ip address of the
GTWY to see if it is pingable from the main server. If not pingable, contact the customer and check for any network
outage or for any scheduled power outage at the site.
If Pingable:

Do traceroute <ip address of MED GTWY> from the main server. If any errors or shows any * values, it
has issues.
Login to the MED GTWY from ssh init@<ip address of MED GTWY>
Do show event-log to check for the issues with the time stamp of the alarm occurred.

In my case, h248 link was down which caused this alarm.


Contact the customer and check for any Network issue and for any scheduled power outage or any other
activity at the site.
If yes, monitor the alarm and check if the MED GTWY registered back to main server after the activity
from show mgc by logging into MED GTWY.

Check for any faults before proceeding for case closure.

Just FYI.. In my case, there was some network fluctuation for which the H.248 link went down, Customer was able
to find it and once the network is back. The H.248 link went up. The below screenshot from Event log shows the
link is up which cleared the alarm and proceeded for the case closure.

Network Flap Scenario where the H248 link goes down, the logs would appear as below:

If there is a reset on MG, the logs would appear as below:

Probable cause: Alarm can be reported either due:


LAN issue / Power outage at the site OR
ESS/LSP server reloads after getting translations from main server OR
Main Server is down due to bad health or scheduled activity at Customers Site.
Alarm Description:

Alarm->MaintObject_CMG_EventID_14
cmgSyncSignalFault / cmgSyncSignalClear
Understanding: If the Avaya G700 Media Gateway contains an MM710 T1/E1 Media Module, it is usually
advisable to set the MM710 up as the primary synchronization source for the G700. In so doing, clock sync signals
from the Central Office (CO) are used by the MM710 to synchronize all operations of the G700. If no MM710 is
present, it is not necessary to set synchronization. If neither primary nor secondary sources are identified, then the
local clock becomes Active. By setting the clock source to primary, normal failover will occur. Setting the source
to secondary overrides normal failover, generates a trap, and asserts a fault.
Probable cause: Alarm can be reported either due:
LAN issue / Power outage at the site OR
ESS/LSP server reloads after getting translations from main server OR
ESS/LSP is down due to bad health
Solution:

almdisplay v / almdisplay res|more (Display resolved alarms in page wise, use spacebar to go to next
page)

telnet to media-gateway and get following o/p

show faults (verify for any sync faults/DS1 board faults, if present)

show sync timing (check for any errors for synchronization, if present )
show events (check for loss of signal or signal fault clear statement in logs)
test board < DS1 board location>
status trunk X (check the status whether trunk is in service or out-of service)
list measurement ds1 log <DS1 board location> (to check for any slip errors, if trunk is in service)

Alarm Description:
Alarm->MaintObject_CMG_EventID_16
Understanding: This trap indicates that any of the media-module (may be voip-module) have undergone a
change. Change could be either new media-module has been inserted or reseated or busyout-released or
any configuration file has been uploaded or firmware has been downloaded
Solution:
almdisplay v / almdisplay res |more
show faults (login to media-gateway and check for any active faults)

show event-log (check logs for the exact event happened on the media-module)
list configuration board <board location> (check for the board detection)
test board <board location> (check for all test are getting passed for the board)

Probable Cause: The common reason for the alarm to get reported is administration work on any of
the media-module:
firmware update activity on media-module OR
Bad health of the media-module because of which it was resetted/reseated

Alarm Description:
Alarm->MaintObject_CMG_EventID_19/ 34
Description: Alarm indicates an attempt to download a software module (cmg 19) or an attempt to upload a
configuration file (cmg 34) has failed
Procedure:
almdisplay v / almdisplay res |more
show faults (login to media-gateway and check for any active faults)

show event-log (check logs for the exact event happened on the media-module)
list configuration board <board location> (check for the board detection)
test board <board location> (check for all test are getting passed for the board)
Try downloading the software for module (for cmg19 alarm) / uploading the configuration file (for cmg
34 alarm), if this fails then follow below step with required Customers permission
reset the board (ie busyout-releasing the board) followed by reseat of the board and then if
required replacing the board
Probable Cause: The most common reason is failure of an update activity
Alarm Description:
Alarm->MaintObject_CMG_EventID_21/22
Alarm->cmgIccMissing /cmgIccAutoReset
Understanding: Alarm indicates that an ICC, expected in Slot 1, is either missing/present (cmg21) and/or that
the Media Gateway automatically reset the ICC (cmg 22).
Procedure:
almdisplay v / almdisplay res |more
restartcause (to check who has initialised server)
list survivable-processor (to check time of saved translation file incase ICC is an LSP)

show event-log (check logs of media-gateway for finding the exact event happened)
* If CM version is 5.2.x check for PCN 1690Pu.*

Probable Cause: The alarm maybe reported due to :


CM release is 5.x (which has software limitation) OR
Translations were pushed onto LSP by main server and in the event, CM was reloaded. OR
Bad health of S8300 server
Alarm Description:
Alarm->MaintObject_CMG_EventID_23
Description: Telephone services on a Media Gateway are controlled by a Media Gateway Controller (MGC).
All media-gateways integrate seamlessly with Avaya Media Servers. Media gateway can be configured to
have upto 4 controllers, so that if primary controller goes down, second controller from the list will take the
control and keep telephony services active. Alarm indicates that the Media Gateway cannot contact the first
controller defined in its controller list.
Procedure:
A. If the first controller is Clan Board
almdisplay v / almdisplay res |more
cd /var/log/ecs
grep -R MG <file name> (to identify Media-gateway if alarm is in resolved state)
show mgc list (to identify the ip-address of first cotroller ie Clan board)

list ip-interface clan (to find the Clan board location)

status clan-port (port locationnote 17th is the required Ethernet port here)
display errors
If Ethernet-link is down, inform Customer and ask to check the physical lan connectivity to the
Clan board. If no issues found in connectivity then, with the required permission reset the Clan
board (ie busyout-release the board) followed by reseat and then replacing the board, if required.
Probable Cause: The alarm was reported maybe due to:
LAN Issue OR
Bad health of the Clan
B. If the first controller is ICC
almdisplay v / almdisplay res |more
restartcause (to check how and when ICC was rebooted)
statapp (check whether all services are up and fine)

If ICC is not accessible/down, inform Customer and ask to reseat the ICC board. If still ICC
doesnt come up try replacing the board.
Probable Cause: The alarm may get reported due to:
Lan Issue /Power outage issue OR
Bad health of S8300 main server or it was rebooted.

Alarm Description:
Alarm->MaintObject_CMG_EventID_35 /36
Alarm->cmgVoipOccFault / VoIPStats AppFault Trap
Description: One or more or all of the VoIP engines in the media gateway is over is its occupancy threshold
(Channels In Use/Total Channels(cmg 35))/below its occupancy threshold.(ie occupancy is back below
threshold value after exceeding it (cmg 36))
Procedure:
show faults

show voip-parameters

show event-log (confirm whether occupancy is back to normal after exceeding the threshold
value)
Typically no other action is required here.

Probable Cause: The most common cause is voip occupancy exceeds its threshold value.
Alarm Description:
Alarm->MaintObject_CMG_EventID_25/26/47-52
Description: Telephone services on a Media Gateway are controlled by a Media Gateway Controller (MGC).
All media-gateways integrate seamlessly with Avaya Media Server. For MGC to control media-gateways, later
needs to be registered with the media servers. If S87xx is the primary controller, then MG has to register with
a Clan board. For S85xx, MG registers either with Clan board / Ethernet Processor port , if enabled. And for
S8300, it registers with Ethernet Processor Port. The Alarms simply means that Media-Gateway is not registered to
its controller.

Procedure:
almdisplay v / almdisplay res |more
cd /var/log/ecs
grep -R MG <file name> (to identify the Media-gateway, if alarm is in resolved state)
ping <ip-address of Media-gateway> (check whether MG is reachable) from main server)

display media-gateway X (If MG is pingable but not registered check for recovery rule)

display system-parameters mg-recovery-rule X (check the configuration)

reset media-gateway level 2 (If required, with Customers permission)


traceroute <ip-address of Media-gateway> (if MG is not pingable to check from which hop, trace
is being failed).
Inform customer and ask to check the network integrity at the site.
show mgc list (login to MG, If no issues have been found with MG and/or network integrity check
for Clan board which is defined in controller list)
ping <ip-address for Clan>
list ip-interface clan (to find the Clan board location)
status clan-port (port locationnote 17th is the required Ethernet port here)
display errors
If Ethernet-link is down, inform Customer and ask to check the physical lan connectivity to the
Clan board. If no issues found in connectivity then, with the required permission reset the Clan
board (ie busyout-release the board) followed by reseat and then replacing the board, if required.
Probable Cause:
LAN Issue or power outage at Customer site OR
Bad health of the Clan / Media-gateway

Platform Alarms
Alarm Description:

Alarm->MaintObject_SME_EventID_1
Description: The Server Maintenance Engine (SME) is a Linux process which provides error analysis,
periodic testing, and demand testing for the server.SME means that alarms are not being reported by the
other Server in duplex configuration, due to failure of either the GMM or administered reporting mechanism.
Procedure:

testinads / testcustalm
If testinads or testcustalm replies affirmative then the cause due to which alarm was
reported, no longer exist.
statapp (check whether all required processes are up)

grep GMM /var/log/ecs/wdlog


start s GMM
if GMM failure found then,Need to inform customer and with the permission restart
GMM because restarting GMM may take a server interchange.
if GMM failure not found then,
if testinads is getting failed on any of the server then we need to kill almindsagt process (With the
customers permission)
Example:

b)
init@pacehqs8720b> kill -9 5303
c) init@pacehqs8720b> ps -ef |grep alm
root 26897 3790 0 07:27 ? 00:00:00 /opt/ws/almindsagt
root 27044 26878 0 07:28 pts/0 00:00:00 grep alm
Note*: almindsagt process automatically restarts after it gets killed

testinads
stop s SME and stop s MVSubAgent followed by
start s SME and start s MVSubAgent
if testinads is getting failed , then we need to restart the sme and mvsubagent
process that sends trap from the server but with customers permission.
testinads/testcustalm
if testinads fails then need to warm reboot (stop a followed by start -a)followed by cold reboot
(reboot) if required but with customers permission.
logger -t svc_mon[2343] atd could not be restarted
try to run a false alarm and check whether it gets reported to Remedy
once alarm is resolved then, almclear a

Probable Causes: The most common cause is that one of the duplex server could not call out an alarm and the
other server calls this alarm to inform Administrator of that. This could be either due to:
GMM failure OR
Failure of the sub-process which are essential to administered reporting mechanism such as sme or
mvsubagent etc process OR
Any scheduled activity at the Customer site may also cause affect the reporting mechanism.
Alarm Description:
Alarm->MaintObject_TRIPWIRE_EventID_7
Description: Tripwire is an intrusion detection system (IDS), which, constantly and automatically, keeps
your critical system files and reports under control if they have been destroyed or modified by a cracker (or by
mistake). It allows the system administrator to know immediately what was compromised and fix it. The first time
Tripwire is run it stores checksums, exact sizes and other data of all the selected files in a database. The successive
runs check whether every file still matches the information in the database and report all changes.
Procedure:

grep -R tripwire /var/log/messages


Output would in form of :

cd /var/lib/tripwire/report (go to this directory & then)


list ltr (to verify the *file-name.twr * that was modified.)
twprint --print-report -- report-level N --twrfile /var/lib/tripwire/report/*file-name.twr *
--- where "N" is level from 0 to 4
Now shoot the above command to verify the sub-files which were modified.
Output of above command would be in form of , as shown below

Probable Cause: When any of the critical system files and reports is changed or modified, we get this alarm.

Alarm Description:

Alarm->MaintObject_GW_ENV_EventID_10
Description: This environment alarm is raised in case of power supply faults with the gateway
Procedure: Login to Media Gateway and shoot following commands.

show faults
show platform

show voltage

show event-log
show system

Ask customer to check following :


a) Verify that the power cord is firmly inserted and that power is being supplied to the power
unit reporting the event.
b) Reinsert the power supply and monitor the Event Log
If customer replies with everything is fine but alarm is still present, and then send a technician to
Confirm the same and then to replace the power supply unit.

Alarm Description:

Alarm->MaintObject_DAL2/DAL1/DAJ1
Understanding: This MO supports each S8700 media servers Duplication Memory board, a NIC
(Network interface card) serving as the physical and data-link interface for an Ethernet-based
Duplication link between the servers. This link provides a Call-status data path for sending:
TCP-based communication between each servers Process Manager
UDP-based communication between each servers Arbiter to:
Enable arbitration between the active and standby servers
Provide status signaling for memory refreshes
Procedure:
almdisplay v / almdisplay res |more
server (check for server, if it is in curbs in mode and status of standby shadowing & duplication
link)

testdupboard

(Note: If a cable has become unplugged from either of the DAJ1 boards both boards will test ok. The dup
link will show down/not refreshed but both DAJ1 boards will test ok.)
restartcause (if alarm is in resolved state and output of step 2 and 3 are fine)
testdupboard -t localloop (if the standby server is in busy-out state then, only on standby server)
reboot (If test continues to fail with Customers permission)
If test continues to fail then replace the DAL/DAJ card
Probable Cause: The alarm may be due to:
bad health of any of the duplication board OR
duplication link got refreshed because of periodic/scheduled maintenance activity OR
CM server got reloaded because of save translational activity OR
Server got rebooted/reloaded

Alarm Description:

Alarm->MaintObject_DUP_EventID_X
Understanding: The Duplication Link is a 10/100BaseT Ethernet link which is used by the Duplication
Manager (ndm) Process to communicate with the other servers ndm process. The Duplication Manager
process (via coordination of the Arbiter process) runs on each S8700 Multi-Connect server to control data
shadowing between them. Meanwhile, at the physical and data-link layers, an Ethernet duplication link
Provides a TCP communication path between each servers Duplication Manager to enable their control of data
shadowing. The dupmgr is responsible for monitoring the status of this link. It raises a major alarm in the event
that the Duplication Link is non-functional, by logging an entry into syslog that the Global Maintenance Monitor
(GMM) uses to report alarms.
Procedure:
For Duplication Link for S87xxx server

almdisplay v / almdisplay res |more


server (check for curbs in mode and status of standby shadowing & duplication link)

testdupboard

filesync -Q dup (check the status of filesync on duplication link)


pingall d (check whether dup-ip is pingable from each server)
cat /proc/mdd (check for crc errors)

cd /var/log/ecs (do ls ltr to check the log file with the latest date tag .Eg: 2014-0203-070101.log)
cat <file name> (check for logs , when duplication link went down or refreshed. Was there any scheduled /
periodic maintenance running at that time or some other activity which may affect functioning of
duplication link )

restartcause (to check if CM on any of the server was reloaded or there was a server interchange)

Probable Cause: The alarm is reported, if Dup-Link is not-functional maybe due to:

bad health of any of the server or duplication board OR


duplication link got refreshed because of periodic/scheduled maintenance activity OR
CM server got reloaded because of save translational activity OR
Server got rebooted/reloaded OR
due to any scheduled activity at the Customer site

Alarm Description:

Alarm->MaintObject_FSY_EventID_X
Description: When multiple servers (i.e. processors) are present in a network, the active server shares
configuration information (translations) with all the other servers (standby server and LSP/ESS servers) so that in
the event of failure, a surviving processor can take over and have the latest information.Sharing occurs in a process
known as file synchronization (filesync) and can happen once per day or whenever the translation file is changed.
The system must be operated in a manner, and the network connectivity designed, to accommodate this activity.
Procedure:
For ESS/LSP server (Note: Click for Dup-FSY alarm)
almdisplay v / almdisplay res |more
filesync -Q all (check the status of File Synchronisation)

statapp (check for filesync process is running on both main and ess/lsp server)

filesync -w -a lsp <IP-address of lsp> trans (Manually saving translations to lsp.)

filesync -w -a ess <IP aaddress of ess> trans (in case of an ess server)
(Check whether manual push is successful or else check for the error-reason code)
list survivable-processors (check the connectivity of ESS/LSP with main server and whether
translation were Saved on ESS/LSP because CM reloads on ESS/LSP after getting the translation file and in
the event alarm is reported on Main server)
restartcause

Incase alarm is active and above mentioned steps doesnt Identify the cause then:
date
(shoot this command on both main server as well as lsp/ess because time mismatch
could be the cause)
ip_fw -q -s 21874/tcp service (check whether tcp ports defined for filesync are open in both directions,on
each server)

cat /etc/sysconfig/network-scripts/ifcfg-ethX
(check whether Ethernet ports are locked to 100MBps-Full Duplex on each server and ethX is an
etherent port defined for Customer lan)

/sbin/ifconfig ethX
(check whether Ethernet port is seeing errors and ethx is the Ethernet port defined
for customer lan)

Probable Causes: The alarm can be reported may be due to:


CM (on ESS/LSP)got reloaded, as per the design, after getting the translations from main server OR
Network integrity issue between ESS/LSP and main server OR
Scheduled activity at the Customer site or some recent changes made on any of the server OR
Date/Time is mismatched on main server and ESS/LSP OR
Tcp/IP ports are blocked in any direction on any of the server OR
Issue with Ethernet port where Customer Lan is defined

Alarm Description:
Alarm->MaintObject_A_EventID_X
Description:
Alarm indicates malfunctioning of Arbiter Process, used in duplex server to determine the
Health of the server.The Arbiter process runs on each S87xx server to:
Decide which server is healthier (more able to be active)
Coordinate data shadowing between them (under the Duplication Managers control).
Meanwhile, at the physical and data-link layers, an Ethernet-based duplication link Provides an inter-arbiter
UDP communication path to:
Enable this arbitration between the active and standby servers
Provide the necessary status signaling for memory refreshes
Procedure:
Need to follow DUP Alarm and then ...
server

(If output indicates corrupt failed then inform Customer and with the desired permission restart Arbiter process
executing following Commands):

stop SF -s arbiter
start -s arbiter
server c
cd /var/log/ecs
grep -R Arbiter <filename>

verify host name and corresponding ip-address are identical in host file and configuration file:a) more /etc/host
b) more /etc/opt/ecs/servers.conf
c) ifconfig a ( verify that ip-address matches with host and configuration file and all
Ethernet ports have ip-address assigned.)
d)/sbin/arp a ( verify MAC address is complete)

verify whether the alarm is still active on port using following command
netstat -a | grep -R 1332

Probable Cause: The alarm is reported, whenever Arbiter Process detects:


detects bad health of any of the duplex server or any issue OR
issue with data shadowing between duplex servers
Alarm Description:

Alarm->MaintObject_BKP_EventID_10
Understanding: Backups are designed to preserve off server copies of translations, configuration files,
Security files, logs, and other important information. The backup command is used for both backup
and restore of data sets. Above alarm is reported when Scheduled backup has failed.
Procedure:
A Backup is being failed on FTP server
almdisplay v / almdisplay res |more
sudo backup t |more (gives history of successful and failed backups
ping <IP address> of ftp server
traceroute <IP address of ftp server>
If ftp server is not pingable then check from which hop it is failing and ask customer to check the network integrity.
If ftp server is pingable then take manual backup:
a)from sroot cd /etc/cron.d ls and then open any file using cat <File name>(to find the location of the file
where it should need to backup)

Then copy the string from backup (as shown above) in a notepad. Then add --verbose d to the string after b as
shown below

b) Or cat web* or cat back* (to get login password for ftp server, if required)
c) sudo backup -b --verbose -d ftp://'login':'paswd'@<IPof ftp>/ -c full
d) Backup can be taken on server as well using following command
sudo backup -b --verbose -d /var/home/ftp/pub/ -n 3 -c -x -c -- xln os security
Once the backup is Successful, check the backup t|more to capture the backup logs and then proceed to case
closure.

Probable cause: Alarm can be reported either due:


LAN issue / Power outage at the site OR
ESS/LSP server reloads after getting translations from main server OR
Main Server is down due to bad health or scheduled activity at Customers Site.
B Backup is being failed on PC Card/Flash Card
almdisplay v / almdisplay res |more
sudo backup t |more
take manual backup
a) cd /etc/cron.d

b) cat web* or cat back*


c) backup -b --verbose -w -d usb-flash:// -n 2 -c xln os security
search_scsi v t CF
(check whether device is present and at which location)
df -h /mnt/flash
(to check where PC Card was originally mounted)
Before mounting PC Card or formatting it, take Customers permission (below steps)
mount /dev/cciss/"device location" /mnt/flash
(to mount PC card)
/sbin/mkfs.ext2 /<device location>
(format the device if it is present but not being detected while backup is ran)
Try manual backup command again as in step 3-c
sudo backup t PC card (to verify the contents of PC card)
Probable Cause: The backup on PC-Card may get fail sue to:
PC Card got un mounted OR
PC Card was not detected by CM when scheduled backup script was executed.
Alarm Description:

Alarm->MaintObject__WD_EventID_22
Understanding: The watchdog keeps an eye on all processes in the system, maintaining heartbeats with
both Communication Manager and platform processes. The watchdog is responsible for stopping and
starting processes when necessary. This process watches over the entire system. Event Id-22 indicates that one of
the WD process was terminated.
Procedure:

almdisplay v / almdisplay res |more


restartcause (check whether CM was reloaded )

grep -R terminated /var/log/messages (identify the Application that was terminated and corresponding
time-stamp)
Check statapp to find watchdog application is UP inlucding all the applications are UP.

grep -R Application <name> /var/log/messages (to confirm WD restarted the application )


start s <Application Name> (If necessary restart the application using above mentioned command)
If still the application doesnt come up then inform Customer and with the permission , go for Warm
Reboot and then Cold Reboot , if required.

Probable Cause: The alarm gets reported either due to:

Malfunctioning of the server cause termination of any Watch-dog process OR


CM reloads after getting translational file from main server, in case of ESS/LSP OR
CM was rebooted, may be due to some scheduled activity at the Customer site

Alarm Description:

Alarm->MaintObject__WD_EventID_26
Description: Watched handshake error IF USB alarms are also present, this strongly points to a global SAMP or
networking problem. This error implies malfunctioning or missing or configurational / firmware mismatches issues
of SAMP or else it may point to usb modem malfunctioning.
Procedure:
almdisplay v / almdisplay res |more
sampdiag v (gives status of SAMP)
grep -R SampEth /etc/opt/ecs/ecs.conf (to chk detection of SAMP card)
sampcmd date (to check synchronization of SAMP with host)
restartcause
testmodem

testmodem -t reset_usb (soft reset of USB modem, if any of the test is getting failed)

stop -s ModemMtty followed by start -s ModemMtty


(If soft reset doesnt work then restart ModemMtty process but do take Customer before doing that)
If testmodem still get failed, ask Customer to get the Modem reseated followed by telephone cable
(inserted into modem) reseated )
If testmodem still fails then get the modem replaced.

Probable Cause: The alarm gets reported either due to:


Malfunctioning of SAMP OR
Malfunctioning of Modem OR
Server got rebooted
Alarm Description:

Event Name: PE Health Check device is not responding to


ARP request / Event Name: MaintObject_PE_Event ID_1
Understanding: Description: Processor Ethernet (procr) feature was added to duplicated Main servers in CM
5.2.This allows configurations not having Port Networks. In addition, a weight relative to the IPSIs is assigned to the
PE interface. The reason being is that if only one adjunct is connected to the system using procr, but everything
else is still IPSI connected, you wouldn't want the servers to interchange simply because the procr interface went
down. This priority can be seen in the output of the server command, and may be set to HIGH, LOW, or IGNORE
using the server web pages. If set to HIGH then the PE is favored over IPSIs and if LOW, the IPSIs are favored over
the PE. If set to IGNORE the SOH of the PE is not used in interchange decisions; if the PE fails on the active server, it
has no effect on the server SOH and does not cause a server interchange
Procedure:

almdisplay v / almdisplay res |more


cd /var/log/ecs
grep -R arping 2010* (ie search for arping in cd /var/log/ecs/*filename*)

switch to sroot login and then arping -I ethX -f -c 1 -w 1 <IP Address> (check whether arping is getting
passed & value of X need to be identified in step 3)

Note: If this issue occurs 3 times in a row it could lead to an interchange if only one server sees the failure. Many of
the examples seen have been chronic issue that occur many times over a week or 2. In this case additional analysis
should be done to determine if there is a possible issue that is occurring.
Probable Cause: When arping gets fail for any Ethernet port on server, this alarm is reported.
Alarm Description:

Event Name: Malformed_INADS_alarm-2000000000


31/01:07,EOF,ACT|AUDIT,VM,101,MIN
Description: It indicates a messaging alarm for IA 770.
Procedure:

almdisplay v
statapp (check for messaging / INDADS AlarmAgent service is down)

almdisplay res |more

Probable Cause: Malfunctioning of IA 770 when detected, CM reports the alarm.

Alarm Description:

Alarm-> MaintObject_USB1_EventID_X
Understanding: Modems are used for their ability to call out Alarms to external Alarm Monitoring System and also
to access Avaya servers remotely by dialing through the Modem (eg: from toolsa we can access
Customers network only through Modem). The modems in the system are tested every 15 minutes to verify that
dial tone can be achieved. If dial tone is not achieved every 15 minutes, Watch Dog reports as an alarm.
Procedure:
almdisplay v /almdisplay res |more
testmodem

restartcause
testmodem
testmodem -t reset_usb (soft reset of USB modem, if any of the test is getting failed)

stop -s ModemMtty followed by start -s ModemMtty (If soft reset doesnt work then restart ModemMtty
process but do take Customer before doing that)
If testmodem still get failed, ask Customer to get the Modem reseated followed by telephone cable
(inserted into modem) reseated

Note: If Handshake Test is failing then reseat the modem and if Off-Hook Test is failing then
get the telephone cable (inserted into the modem) reseated.

If testmodem still fails then get the modem replaced

Probable Cause: The alarm is reported mostly due to

Malfunctioning of modem OR
Modemtty service is hung or stopped on server OR
Telephone line connected to modem is not functioning properly

Alarm Description:

Alarm->MaintObject_UPD_EventID_X
Description: The kernel update is activated but the activation is not committed.
Procedure:

almdisplay v
swversion a (gives the date when the update was executed)

swversion r (gives the info about previous CM load)

update_show
update_commit <filename/Update ID> (making the update permanent)
almclear a

Probable Cause: The kernel update was not committed and hence this alarm is reported.

Alarm Description:

Alarm->MaintObject_UPG_EventID_1
Description: THE UPG raises an alarm if the upgrade was not made permanent within a certain amount of time
after the upgrade.
Procedure:

almdisplay v / almdisplay res |more


swversion a ( verify the date and time of upgrade)

cat /var/log/ecs/commandhistory (check the command usedfor the upgrade activity)


commit (if upgrade was not performed permanently).
almclear a (clear the alarm)

Probable Cause: The alarm is mostly due to the upgrade activity scheduled at the customer side and the the
upgrade was not made permanent in a specific time after the upgrade
Alarm Description:

Alarm->MaintObject_UPS_EventID_X
Description: The UPS process is for monitoring the status of the UPS for each 8700 server. An alarm will be raised
when there is a loss of commercial power or there is some other power problem such as a spike, sag, brownout or
blackout.
Procedure:

almdisplay v /almdisplay res |more


pingall u

(verify if switch is pingable or else ask customer to check the network integrity)

snmpwalk <ip-address of UPS switch> -c public 33 | more (to verify if the system is currently on
backup power)
If alarm is active inform Customer and ask to verify AC Power is being supplied to UPS and coordinate with
the vendor, if required.

Probable Cause: The alarm is reported maybe due to:


Network issue between server and UPS OR
Mal-functioning of UPS OR
AC Power supply is not being supplied to UPS properly
Alarm Description:
Alarm->MaintObject_STD_EventID_X
Description: These are Standard SNMP Traps (ie SNMPv2 protocol) sent by an entity indicating Media
Server that either it (entity) has undergone a reboot or else it has recognize a failure in one of the
Communication links.

Procedure:
If Event ID is 1 or 2
almdisplay v / almdisplay res |more
ping <ip-address> (ip-address can be identified in alarms itself)
login to the entity having the above ip-address and check whether it had under-gone a cold or warm
reboot
Note: Event ID 1 corresponds to Cold Reboot and 2 to Warm Reboot of the entity
If Event ID is 3
almdisplay v / almdisplay res |more
ping <ip-address> (ip-address can be identified in alarms itself)
Identify the entity and alarm indicates that communication link between media server and that
particular entity is either down or have come up after a failure.
Note: STD_EventID_X alarms are generally in resolved state and can be closed either by stating
cold/warm reboot of ip-entity or communication link flap between media server and ip-entity, as per
the X value.

Alarm Description:
Alarm->MaintObject_ENV_EventID_X
Description: The ENV MO monitors environmental variables (including temperature, voltages, and fans)
within the server. Alarm indicates that any of these variable has deviated from its nominal value
Procedure:

almdisplay v / almdisplay res |more


environment (command can be executed only on root shell prompt. Normal o/p is shown below)
If alarm is in active state and you find some of the below parameter deviated from its normal value,
inform the Customer and ask to check the room temperature and power supply. Monitor the alarm for
couple of hours and if still, alarm is active then get the Motherboard replaced.

Normal o/p of environment


root@S8700_A> environment
*** Hardware Health ***
Feature Value Status
CPU IO: 1.50 Normal
CPU CORE: 1.74 Normal
3.3V: 3.33 Normal
5V: 5.03 Normal
+12V: 11.94 Normal
-12V: -11.93 Normal
Fan: 9000 Normal
Fan: 3277 Normal
Fan: 9507 Normal
Fan: 9507 Normal
Temp: 37.00 Normal
Probable Cause: The most common reason is :
Room temperature or power-supply are not proper maybe due to power-supply fluctuation / physical
connection OR
Mother-Board has gone fault
Alarm Description:
Alarm->MaintObject_ SVC_MON_EventID_X

Description: MO-SVC_MON is a media server process, started by the Watchdog, to monitor Linux services and
daemons. It also starts up threads to communicate with a hardware-sanity device. This alarm indicates one of the
Linux Daemon is down.
Procedure:
almdisplay v /almdisplay res |more
cd /var/log
grep svc_mon messages (to check which daemon was affected)
service <daemon name> status (to check the status of daemon whether it is running)
service <daemon name> start (if service is not running )
service <daemon name> status (to confirm the service is running)

If still daemon doesnt come up, then inform the Customer and get the permission of Warm Reboot
followed by Cold Reboot (only if required), in lean period.

Probable Cause: Whenever one of the below daemon is either stopped or restarted maybe due to server
Reboot or CM reload or maybe due to health of the server degrades.

TN Circuit Packs Alarms


Alarm Description:

Alarm->MaintObject _PKT-INT_Location_
Alarm->MaintObject_PKT-BUS
G3_Cabinet-Down / G3_CircuitPack-Down

Description: An IPSI board contains several different functionalities, one of them: the PKTINT. This is the
resource on the IPSI board that is the manager for the LAPD links travelling through the packet bus. These links
include RSCLs, EALs, INLs, etc. The packet bus consists of a single bus, and one such bus appears in each port
network.The packet bus in each port network is physically independent from those in other port networks, so each
port network has a separate PKT-BUS MO.
In addition to affecting telephone service, a Packet Interface/Packet Bus failure affects the service
Provided by circuit packs e.g. ISDN-signaling service, service provided by the C-LAN or VALor
IPMEDPRO boards etc.
Procedure:

almdisplay v / almdisplay res |more


pingall i (check whether all ipsis are pingable)

list ipserver-interface (check whether ipsi is up or down)

status packet-interface X (X is cabinet number)

test packet-interface X (X is cabinet number)

status cabinet X

status port-network Y (Y is port-network to which cabinet X belongs to)

status sys-link <IPSI board location>

test sys-link <IPSI board location>


test board <IPSI board location>
cd /var/log/ecs
grep sanity <filename> (check for sanity failures)
grep WARM <filename> (check for Warm reboot of Port-Network)
grep COLD <filename> (check for Cold reset of Port-Network)
display errors

If alarm is active and any of the ipsi is down, inform Customer and with required permission go
for reset of ipsi followed by reseat and then, if required, replace it.
Probable Cause: The alarm may get report due to:
Lan Issue / Power outage at the site OR
Reboot of port-network may be due to too many sanity failures OR
Bad health of ipsi.
Alarm Description:

Alarm->MaintObject_SYS-LINK
Understanding: System links are packet links that originate at the Packet Interface board and traverse various
hardware components to specific endpoints. The hardware components involved on the forward and reverse
routes can be different, depending upon the configuration and switch administration. Various types of links are
defined by their endpoints: EAL, PRI, RSCL, RSL, MBL etc.
The state of a system link is dependent on the state of the various hardware components that it travels over. Hence,
when analyzing any system link problem, look for other active alarms present for corresponding hardware
component. If so then follow the maintenance procedures for the alarmed components to clear those alarm first.

Note: All the above links originates from Pkt-Int ie from an IPSI board and terminates on
corresponding circuit-packs
If none of the alarms for above listed hardware components are present, accept the sys-link then
execute below steps to clear the alarm.

Procedure:
almdisplay v /almdisplay res |more
list sys-link (to identify the sys-link)

status sys-link <sys-link location> (check whether current path is up or down)

test sys-link <sys-link location> long clear (to clear the dead alarm and/or to identify any test if
getting failed)

Probable Cause: The alarm may get report due to:


Lan Issue / Power outage at Customer site OR
Bad health of any of the hardware component of the sys-link

Alarm Description:
Alarm->MaintObject_TONE-BD
Description: For IPSI-equipped EPNs, the TONE-BD MO consists of a module located on the IPSI circuit
pack and provides tone generation, tone detection, call classification, clock generation, and synchronization.
For non-IPSI EPNs, the TN2182B Tone-Clock circuit pack provides the functions.
Note: Check for any other IPSI related alarms, if present follow corresponding procedure to resolve the alarms. If
there are no other alarms then follow the below procedure.
Procedure:
almdisplay v /almdisplay res |more
test tone-clock <board location>

display errors
If alarm is still present, then inform Customer and with required permission proceed with
following steps:
busyout-release of the board followed by reseat and then, if required replace it
Probable Cause: Mal-functioning of Tone-Clock board.
Alarm Description:
Alarm->MaintObject_ETH-PT
Description: The TN799DP Control LAN (C-LAN) circuit pack provides TCP/IP connection to adjuncts
applications such as CMS, INTUITY, and DCS Networking. The C-LAN circuit pack has one 100BASE-T
Ethernet connection and up to 16 DS0 physical interfaces for PPP connections. Also C-Lan acts as a
gatekeeper for IP-Endpoints registration.
Procedure:
almdisplay v /almdisplay res |more
display port <Port location> ( identify data-module/link number, say X)
status link X/ status data-module X (verify the current status of the link/data-module ie it is in-service
or not)

get ethernet-options <C-Lan board location> (check for Ethernet port settings-Avaya

Recommends having Ethernet-port on 100Mbps-Full Duplex and Auto negotiation off)


test port (check whether all test are getting passed)
display errors (to identify the cause of the alarm)
test port <Port Location> long r 3 (to clear any warnings regarding link integrity test)
ping <ip-address of Clan-board> (check whether server is able to ping the Clan-board)
If alarm is active, inform the Customer and ask to check and confirm the network integrity to the Clan
Ethernet-port and if Customer replies everything is fine , then follow below procedure with required
permission
busyout port <port-location> and then release . (for C-Lan board 17th port is always the Ethernet
port and rest 16 are PPP ports . 32nd port is RSCL link port)
busyout board <board location> followed by reset board and then release board (If alarm is still
active, try resetting the C-Lan board)
Get the board re-seated either with the help of Customer or else by sending a technician on-site.
If still alarm doesnt clear off then try inserting C-Lan board into some other slot. If alarm clears
off, then replace the Carrier or else replace the Circuit Pack.
Note: While resetting the ip-interface board through Sat-prompt, first busyout the board and then
disable the Ethernet interface by changing ip-interface <board location>. Also enable the same after
reseting the board and then only release it
Probable Cause: The alarm may get report due to:
Lan Issue OR
Bad health of any of the Clan board

Alarm Description:
Alarm->MaintObject_CLAN-BD
Description: The TN799DP Control LAN (C-LAN) circuit pack provides TCP/IP connection to adjuncts
applications such as CMS, INTUITY, and DCS Networking. The C-LAN circuit pack has one 100BASE-T
Ethernet connection and up to 16 DS0 physical interfaces for PPP connections. Also C-Lan acts as a
gatekeeper for IP-Endpoints registeration.
Procedure:
almdisplay v /almdisplay res |more
test board <C-Lan board location> (to check whether all test are getting passed)
display port <Port location> ( identify data-module/link number, say X)
status link X/ status data-module X (verify the current status of the link/data-module ie it is in-service
or not)

get ethernet-options <C-Lan board location> (check for Ethernet port settings-Avaya
recommends to have Ethernet-port on 100Mbps-Full Duplex and Autonegotiation off)

display errors (to identify the cause of the alarm)


ping <ip-address of Clan-board> (check whether server is able to ping the Clan-board)
If alarm is active, inform the Customer and ask to check and confirm the network integrity to the Clan
Ethernet-port and if Customer replies everything is fine , then follow below procedure with required
permission
busyout board <board location> followed by reset board and then release board (ie If alarm is
active, try resetting the C-Lan board)
Get the board re-seated either with the help of Customer or else by sending a technician on-site.

If still alarm doesnt clear off then try inserting C-Lan board into some other slot and check. If
alarm clears off, then replace the Carrier or else replace the Circuit Pack
Note: While resetting the ip-interface board through Sat-prompt, first busyout the board and then
disable the Ethernet interface by changing ip-interface <board location>. Also enable the same after
reseting the board and then only release it
Probable Cause: The alarm may get report due to:
Lan Issue OR
Bad health of any of the Clan board OR
Wrong configuration of Clan board
Alarm Description:
Alarm->MaintObject_IPMEDPRO
Description: In an IP telephony solution, digital signal processing (DSP) resources are used for handling
media streams. DSP resources inter-work audio between the media gateways time division multiplex (TDM)
bus and the IP network, as well as transcoding (ie when needed to convert one codec to another). DSP
resources are dynamically allocated on a call-by-call basis and are provided by the IP Media Processor
(IPMEDPRO) circuit pack for solutions using S8100, S8500 or S8700 Media Server with G600 or G650 Media
Gateways (or traditional SCC1, MCC1 gateways)
There 2 types of IPMEDPRO circuit packs
TN 2302AP IP Media Processor provides
TN 2602AP IP Media Processor provides
The TN2302/TN2602 includes a 10/100 BaseT Ethernet interface to support IP audio for IP trunks and H.323
endpoints and also for adjuncts such as Voice Recording Logger.The IPMEDPRO circuit pack acts as a
service circuit to terminate generic RTP streams used to carry packetized audio over an IP network.
Procedure:
almdisplay v /almdisplay res |more
list configuration board <board location> (to identify TN2303 / TN 2602 circuit pack)

display ip-interface <board location> (verify that ethernet port is enabled and is set to 100 Mbps
speed, Full duplex and Autonegotiation is disabled)

test board <board location> (check for all test are getting passed)
display errors (check for any errors against the IPMEDPRO circuit pack )
ping <ip-address of Medpro board> (check whether server is able to ping the Medpro-board)

If alarm is active, inform the Customer and ask to check and confirm the network integrity to the
Medpros Ethernet-port and if Customer replies everything is fine , then follow below procedure with
required permission

busyout board <board location> followed by reset board and then release board (ie If alarm is
active , try resetting the Medpro board)
Get the board re-seated either with the help of Customer or else by sending a technician on-site.
If still alarm doesnt clear off then try inserting Medpro board into some other slot and check. If
alarm clears off, then replace the Carrier or else replace the Circuit Pack.
Note: While resetting the ip-interface board through Sat-prompt, first busyout the board and then
disable the Ethernet interface by changing ip-interface <board location>. Also enable the same after
reseting the board and then only release it
Probable Cause: The alarm may get report due to:
Lan Issue OR
Wrong Configuration of the Medpro board OR
Bad health of any of the Medpro board.
Alarm Description:
Alarm->MaintObject_MEDPROPT_Location_X_OnBoard_Y
Description: The Media Processor Port (MEDPROPT) MO monitors the health of the Media Processor
(MEDPRO) digital signal processors (DSPs). This maintenance object resides on the TN2302/TN2602 Media
Processor circuit packs which provide audio bearer channels for H.323 voice over IP calls. One TN2302AP
has 8 MEDPROPTs; each TN2302 MEDPROPT has the processing capacity to handle 8 G.711 coded
Channels, for a total of 64 channels per TN2302. The capacity provided by the TN2602 is controlled by the
Avaya Communication Manager license file and may be set at either 80 G.711 channels or 320 G.711
channels. If individual DSPs on the TN2302AP or TN2602 fail, the board remains in-service at lower capacity.
The MEDPROPT is a shared service circuit. It is shared between H.323 trunk channels and H.323 stations.
An idle channel is allocated to an H.323 trunk/station on a call-by-call basis.
Note: If any Medpro-board/TDM/Pkt-Int alarm is present along with Medpropt, follow corresponding
procedure to proceed further or else follow below procedure
Procedure:
almdisplay v / almdisplay res |more
status media-processor board <board location> (Check whether Ethernet link and MPCL links
are up and all DSP channels are inservice/idle or busy state.)

test port <port location> (check for all test against the port are getting pass)
test board <Medpro board location> (check for all test for the board are getting pass)
display errors (check for any errors against the Medpropt to identify the cause)
test board <Medpro board location> long r 5 (to execute the test board command five times)
busyout-release port <port location> (soft reset of Medpropt)

Probable Cause: Bad health of the Medpro board


Alarm Description:
Alarm->MaintObject_VAL-PT
Description: Alarm indicates that CM has sensed some fault in either playback or recording of an
announcement through a particular port/board
Note: If any Val-board/TDM/Pkt-Int alarm is present along with Val-Pt, follow corresponding
procedure to proceed further or else follow below procedure
Procedure:
almdisplay v /almdisplay res |more
test port <Val-PT location> (check for all test are getting passed for the port)
test board <Val-Bd location>(check for all test are getting passed for the board)
display errors (check for errors against the board to identify the cause)
busyout and release <VAL-PT > (i.e. soft reset of Val-Pt)
Note: If error type 1 and firmware of Val Board is 20 then board needs to be upgraded to firmware 21
because firmware 20 has certain software limitation .
Probable Cause: The alarm may get reported due to:
Bad health of Val-port or
usage of Val port has exceeded its threshold.
Alarm Description:
Alarm->MaintObject_VAL-BD
Description: The Voice Announcements over the LAN (VAL) TN2501AP provides per-pack announcement
storage time of up to one hour, up to 31 playback ports, and allows for announcement file portability over a
LAN. The VAL circuit pack also allows for LAN backup and restore of announcement files and the use of user
provided (.WAV) files.

Procedure:
almdisplay v /almdisplay res |more
display ip-interface <board location> (verify that ethernet port is enabled and is set to 100 Mbps
speed, Full duplex and Auto negotiation is disabled)

test board <board location> (check for all test are getting passed)
display errors (check for any errors against the Val board to identify cause )
ping <ip-address of Medpro board> (check whether server is able to ping the Val-board)

If alarm is active, inform the Customer and ask to check and confirm the network integrity to the
Medpros Ethernet-port and if Customer replies everything is fine , then follow below procedure with
required permission

busyout board <board location> followed by reset board and then release board (ie If alarm is
active , try resetting the Val board)
Get the board re-seated either with the help of Customer or else by sending a technician on-site.
If still alarm doesnt clear off then try inserting Val board into some other slot and check. If alarm
clears off, then replace the Carrier or else replace the Circuit Pack.
Note1: While resetting the ip-interface board through Sat-prompt, first busyout the board and then
disable the Ethernet interface by changing ip-interface <board location>. Also enable the same after
reseting the board and then only release it
Note2: Before resetting the Val board or else getting it reseated, it is recommended to take the backup
of announcements present on the Val board because sometimes announcement files may get erased.
Same can be confirmed with Customer , since they may be have a schedule Val backup in place.
Incase, we are required to take the announcements backup, follow the below procedure :
Val-Backup Procedure:
list directory board (Val board location) ( this command runs on Sat-prompt---here you get all the
announcement files present in Val board)
enable filexfer (this command needs to be run on Sat-prompt---here you need to define a
login/password, secure as no and mention the Val-board location)
sudo ftpserv on (this command needs to be run on Shell-prompt---means we are turning on the
ftp service, so that we can ftp the Val board from server)
ftp <ip-address of Val board>
bin
(to get the binary version)
hash
(to get the file transfer status)
prompt
(to get more than one file with one command)
mget .*
(gets all the files from Val board to the server)

Probable Cause: The alarm may get reported due:


Lan issue OR
Bad health of Val-board OR
Wrong Configuration of the board
Alarm Description:
Alarm->MaintObject_SNI-BD _Location_X
Alarm->MaintObject_SNI-PEER_Location_X
Description: The SNI circuit pack reporting the error indicates that it has a problem with the control path,
circuit path, or packet path to the SNI peer in the slot indicated.
Procedure:
almdisplay v / almdisplay res |more
display errors (check for any errors against the board. In case of SNI-PEER alarm failed SNI
board can be identified from the error type given in below table)
test board X (check for any test, if getting failed)
status switch-node-clock (to identify the active SNC and standby SNC)
set switch-node-clock <location of standby SNC board> (to make the standby SNC as active but
with Customers permission. Now if alarm is off then replace the current standby SNC board or
else revert back the action)
If alarm do not clear off, Customer permission go ahead with soft reset (ie busyout then reset and
then releasing the board) followed by reseat of the board (ie removing the board from the slot
and then re-inserting it) and then, if alarm still doesnt go off, replace the board.

Probable Cause: Bad health of SNI-Board /SNC board or the fiber link associated with the SNI-BD

Alarm Description:
Alarm->MaintObject_SN-CONF _Location_X
Description: A switch node carrier contains:
Up to 16 Switch Node Interface (SNI) TN573 circuit packs in slots 2 through 9 and slots 13
through 20
One or two Switch Node Clock (SNC) TN572 circuit packs in slots 10 and 12
An Expansion Interface (EI) TN570 circuit pack, a DS1 Converter (DS1C) TN574 circuit
pack, or no circuit pack in slot 1
An optional DS1 converter circuit pack in slot 21
Procedure:
almdisplay v / almdisplay res |more
test board X (check for any test, if getting failed)
list fiber-link (to identify fiber link and other end-point to which it is connected)
test fiber-link Y (check for any test, if getting failed)
display errors (check for any errors against the board)
clear firmware-counters location (SNC firmware generates error reports independently of
demand tests. Therefore, test board X does not affect the error status by firmware Hence this
command needs to be executed to clear any firmware generated errors unconditionally.)
Inform customer and verify that the fiber-link physically connected to any SNI board and the other
end-point of the fiber-link are properly administered
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie
removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off,
replace the board.
Probable Cause: SN-CONF errors and alarms are generated for two types of failures:
Failure of SNI or SNC board OR
Absence of physical connectivity of a fiber-link between 2 end-points (ie either between 2 SNIs or
2 EIs or between SNI & EI or DS1C & SNI/EI) but is administered on CM OR
Two endpoints are physically connected but not administered on CM software.

Alarm Description:
Alarm->MaintObject_SNC-LINK _Location_X
Alarm->MaintObject_SNC-BD _Location_X
Alarm->MaintObject_SNC-REF _Location_X
Description: The Switch Node Clock (SNC) TN572 circuit pack is part of the Center Stage Switch (CSS)
configuration. It resides in a switch node carrier that alone or with other switch nodes make up a CSS. In a
high-reliability system (duplicated server and control network, unduplicated PNC), each SNC is duplicated
such that there are two SNCs in each switch node carrier. In a critical-reliability system (duplicated server,
control network, and PNC), each switch node is fully duplicated, and there is one SNC in each switch node
carrier. SNCs are placed in slots 10 and 12 of the switch node carrier. These are the alarms associated with
SNC circuit pack:
-The SNC-LINK MO reports errors in communications between the active Switch Node Clock and Switch
Node Interfaces over the serial channel (Aux Data 1) and the TPN link (Aux Data 2).
-The SNC-BD MO covers general SNC board errors and errors with the serial communication channel
between the active and standby SNCs.

-The SNC-REF MO reports errors in SNI reference signals detected by the active Switch Node Clock.
Note: If any alarm, related to SNI-BD or SNI-PEER or fiber-link or DS1C-BD, is present the follow
corresponding repair procedures first
Procedure:
almdisplay v / almdisplay res |more
test board X (check whether all test are getting passed)
display errors (check for any errors against the board)
clear firmware-counters location (SNC firmware generates error reports independently of
demand tests. Therefore, test board X does not affect the error status by firmware Hence this
command needs to be executed to clear any firmware generated errors unconditionally.)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie
removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off,
replace the board.
Probabale Cause: The alarm may get reported due to:
Bad health of any of the hardware component mentioned in above note OR
Bad health of the SNC board OR
Configuration issue for new installation or some change activity at the custom.
Alarm Description:
Alarm->MaintObject_EXP-INTF_Location_X
Description: The TN570 or the TN776 Expansion Interface (EI) circuit pack provides a TDM- and packet bustofiber interface for the communication of signaling information, circuit-switched connections, and packetswitched
connections between endpoints residing in separate PNs. EI circuit packs are connected via optical
fiber links.
Note: If any alarm, related to IPSI which is acting as archangel or fiber-link or TDM bus or Tone-Clk, is
present then follow corresponding repair procedures first to resolve the alarm
Procedure:
almdisplay v /almdisplay res |more
status cabinet X (to check status of connectivity of EPN)

status synchronization (to confirm that there is no issue with synchronization)

display errors (check for errors to identify cause)


test board <board location> (check whether test are getting passed for board)
test board <board location> long r 3 (to clear the minor errors)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset
(ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the
board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board.
Probable Cause: The alarm may get report due to:
Bad health of any of the hardware component mentioned in above note OR
Bad health of the EI board OR
Configuration issue for new installation or some change activity at the customer site.
Alarm Description:

Alarm->MaintObject_EXP-PN_Location_PN X
Description: The EXP-PN MO is responsible for overall maintenance of an Expansion Port Network (EPN)
and monitors cross-cabinet administration for compatible companding across the circuit-switched connection. The
focus of EPN maintenance is on the EI or IPSI circuit pack that is acting as the Expansion Archangel link in an EPN.
Note: If alarms, involving EI board or IPSI board which is acting as Expansion Archangel or any of the
hardware involved with CSS such as SNI-BD or SNC-BD or DS1C-BD or fiber link, are present, then
these alarms needs to be repaired first
Procedure:
almdisplay v /almdisplay res |more
status port-network X (check status of EPN )

status sys-link <EI slot location> (To identify which IPSI is controlling EPN and check whether
any other alarm is present for the identified IPSI and/or for EI circuit pack. If yes, follow corresponding
procedure to resolve the alarm)

display errors (to identify the cause of the alarm)


grep sanity <filename> (check for sanity failures of the IPSI acting as archangel for the EPN)
grep WARM <filename> (check for Warm reboot of Port-Network)
grep COLD <filename> (check for Cold reset of Port-Network)

If still alarm is active and corresponding IPSI and EI circuit-packs are fine, inform Customer about the
alarm and with required permission follow below procedure:

reset port-network X level 1 (ie perform Warm Restart of EPN)


reset port-network X level 2 (ie perform Cold Restart of EPN)

Probable Cause: The alarm may get report due to:


Bad health of any of the hardware component as stated in above Note. OR
Network Issue/Power outage at the customer site OR
Port Network had undergone a reboot OR
Configurational Issue due to new installation or some change activity at customer site
Alarm Description:
Alarm->MaintObject_FIBER-LK_Location_X
Description: A fiber link consists of the endpoint boards that are connected via the optical fiber, the lightwave
transceivers or metallic connections on the endpoint boards, and, if administered, the DS1 Converter (DS1
converter) complex that exists between the two fiber endpoints. The fiber endpoints are EI circuit packs
and/or SNI (SNI) circuit packs. Fiber link errors and alarms are generated only on fibers that have at least one
SNI endpoint. Fiber errors for fibers that have EIs as both endpoints are detected by the EI circuit pack, thus
generating off-board EXP-INTF errors and alarms. Fiber errors and alarms on EI-SNI fiber links generate
FIBER-LK and/or off-board EXP-INTF errors and alarms
Note: If any of the active alarm is also present for any of the end-point of the fiber-link, then follow
corresponding repair procedures.
Procedure:
almdisplay v /almdisplay res |more
status synchronization (to confirm that there is no issue with synchronization of the system)

display fiber-link X (to identify the end-points and check whether any alarm is present for any of the
end-point. If yes, follow corresponding repair procedure to resolve the alarm)

test fiber-link X (check whether all test are getting passed for fiber-link)
display errors (to identify cause for the alarm)
busyout-release the fiber link.(ie soft reset of fiber-link with Customers permission)
If still alarm is active and corresponding end-points are fine, ask Customer to check the physical
connectivity ie fiber-link is properly terminated onto the end-points and also to check the fiber-cable,
to ensure no cuts are present on fiber-cable.
Probable Cause: The alarm may get report due to:
Bad health of end-points that are connected through fiber-link (ie either SNI/EI/DS1C board as per the
solution deployed at customer site)
Physical Connectivity issue (ie either fiber-link is not properly terminated onto the end-points or else
fiber-link is broken in between.)
Configuration Issue for a new installation or due to some change activity at the customer site.
Alarm Description:
Alarm->MaintObject_DS1C-BD_Location_X
Description: The DS1 converter complex is part of the port-network connectivity (PNC) consisting of two
TN574 DS1 Converter or two TN1654 DS1 Converter circuit packs connected by one to four DS1 facilities. It
is used to extend the range of the 32-Mbps fiber links that connect PNs to the Center Stage Switch, allowing
PNs to be located at remote sites. The DS1 converter complex can extend a fiber link between two EIs or
between a PN EI and an SNI. Fiber links between two SNIs or between a PN and the Center Stage Switch
(CSS) cannot be extended.
Note: If SYNC, TDM-CLK, SNC-BD,SNI-BD, Fiber-Lk or DS1-FAC alarms are present then follow
corresponding repair procedures first. If only DS1C-Bd alarm is present, follow below procedure.
Procedure:
A. If alarm is off board
almdisplay v /almdisplay res |more
display errors (to identify the cause of alarm ie either TDM-Clk/SYNC/SNC-BD or fiber-link or ds1
facility alarms)
If errors are associated with ds1-facility then follow step 3 & 9 but with customers permission
busyout & release ds1-facility <ds1c-board location> (If errors associated to DS1 facility are present,
then have a soft reset of the ds1-facility)
If errors are associated with Synchronisation or fiber-link then follow step 4 & 9
status synchronization (to check the synchronization status and fiber-link could be the source for
synchronization issue in an EPN)
list fiber-link (to identify corresponding fiber-link no and also the extreme end-points of that fiber-link.)
test fiber-link X (check for any test, if getting failed follow the corresponding repair procedure)
test board <DS1C board location> (check for any test, if getting failed)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset
(ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the
board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board.
Probable Cause: The alarm may get report maybe due to:
If an issue has been detected due to synchronization issue or mal-functioning of TDM-Clk or fiber-link
for an EPN OR
Issue has been encountered with ds1-facility provided by DS1C board OR
Bad health of DS1C board OR
Configuration Issue for a new installation or due to some change activity at the customer site.

B. If alarm is on board
almdisplay v /almdisplay res |more
test board <DS1C board location> (check for any test, if failing)
display errors (to identify cause)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset
(ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the
board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board.
Probable Cause: The alarm may get report maybe due to:
Bad health of DS1C board OR
Configuration Issue for a new installation or due to some change activity at the customer site.
Alarm Description:
Alarm->MaintObject_TDM-BUS_Location_PN X
Description: Each Each port network has a pair of TDM buses, designated TDM bus A and TDM bus B, each
with 256 time slots. This division allows for duplication of control channels and dedicated tone time slots. The
first five time slots on each bus are reserved for the control channel, which is active on only one bus at a time
in each port network. The next 17 time slots are reserved for system tones such as dial tone, busy tone and
so on. As with the control channel, these time slots are active on only one bus, A or B, at a time. The rest of
the time slots on each bus are for general system use such as carrying call-associated voice data. The 17
dedicated tone time slots that are inactive can also be used for call processing when every other available
time slot is in use.When the system initializes, the control channel is on TDM bus A and the dedicated tones
on TDM bus B in each port network. If a failure occurs on one of the two buses, the system will switch any
control, tone and traffic channels to the other bus. Service will still be provided, though at a reduced capacity..
TDM-bus faults are usually caused by one of the following:
A defective circuit pack connected to the backplane
Bent pins on the backplane
Defective bus cables or terminators
Procedure:
almdisplay v /almdisplay res |more
status port-network X

test tdm port-network X (to check all the test are getting passed for the tdm bus in a port network)

disp errors (to identify the cause of the issue)

If alarm is active, then follow below procedure to isolate and detect TDM-Bus Fault. Also always
inform Customer about the issue and plan of action stated below before proceeding further. Always
have a SFM or TM involve in the execution:
Step 1: Check for any active alarms for Tone-Clock /Detectors Board, Expansion Interface
i.e. EI board alarms and Packet Interface i.e. IPSI board alarms or any other TN
Circuit-Pack. Follow corresponding procedure to resolve the respective alarms and then check for TDM-Bus alarm,
if it is cleared, close the case.
Step 2: If no active alarm is present for any of the Tone board or EI or IPSI board or any other circuit-pack, then
a) If duplicated circuit-pack is present, then switch standby circuit-pack to active and check for the
alarm. If alarm is resolved, remove the then standby circuit-pack and check for backplane pins. If
they are bent then switch-off the power of this Carrier and straighten or replace the pins and reinsert
the circuit-pack and restore the power.
b) Try re-moving all the circuit-packs in the Port-Network one by one depending upon the criticality
of the function of the circuit-pack. This means IPSI/EI board should be removed at the last and
Tone-Clock board needs to be removed at last but one (This is because removing these circuit packs
will result in disconnection of corresponding Port-Network).
c) When any of the circuit-packs is removed , determine whether the backplane pins in the slot
appears to be bent. If yes, then switch-off the power to this Carrier and straighten or replace the
pins and then re-insert the circuit-pack and restore the power. If backplane pins are not bent ,
then re-insert the circuit-pack.
d) If all the circuit-packs are checked as mentioned above, and alarm is still active then try replacing
TDM cable assemblies and TDM Bus terminators and then if required replace carrier itself.
Probable Cause: The alarm may get report due to:
When control of system tones is switched from one bus to other OR
Bad health of the Circuit-Pack providing Tone-Clock functions OR
Physiscal Connectivity issue ie TDN Cable assemblies or TDM bus terminators or backplane pins
which connects to Circuit-Pack inside the slot.
Alarm Description:
Alarm->MaintObject_POW-SUP_Location_X
Description: This MO verifies physical presence of power supply and output voltage of each power supply in
G650 is within tolerance
Procedure:
almdisplay v / almdisplay res | more
test board <Power Supply Board Location> (Check for any test , if being failed)
status environment <Cabinet No.> (Check the environment of the cabinet)

test environment <Cabinet No.> (Check whether all test are getting passed)

display errors (and select board) (check for errors to find the cause of alarm)
Inform Customer and with required permission go ahead with soft reset followed by reseat of the
board and then, if required, replace it.

Probable Cause: Bad health of Power Supply board OR Power supply being delivered to the board.
Alarm Description:
Alarm->MaintObject_M/T-BD / MT-ANL/ M/T-DIG/ M/T-PKT
Decription: The Maintenance/Test circuit pack (TN771D) supports packet-bus fault detection and bus
reconfiguration for the port network where it is installed. The circuit pack also provides Analog
Trunk testing and data loop-back testing of DCP Mode 2 endpoints and Digital (ISDN) Trunk
facilities via the TDM bus.
Port 1 of the Maintenance/Test board is the Analog Test port which provides the Analog Trunk testing
function for Automatic Transmission Measurement System (ATMS). M/T-ANL maintenance ensures that the
analog trunks testing function is operating correctly.
Ports 2 and 3 are the digital ports which provide the digital (ISDN) trunk-testing functions.
M/T-DIG maintenance ensures that the digital trunk testing function is operating correctly.
Port 4 is the packet port provides the packet-bus maintenance functions: Packet-bus fault detection &
Packet-bus re-configuration.
Procedure:
almdisplay v /almdisplay res |more
test board <board location> (check for any test, if failing)
display errors
test board <board location> long
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset
(ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the
board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board.
Probable Cause: Bad health of Maintenance port/ board.

Alarm Description:

Alarm->MaintObject_PS-RGEN_Location_X
Alarm->MaintObject_RING-GEN_Location_X
Understanding: The PS-RGEN maintenance object monitors the ringing voltage of each 655A power supply. The
TN2312BP IPSI uses the ring detection circuit on the 655A to monitor ring voltage for the G650. Failure of the ring
generator results in loss of ringing on analog phones. Ringing on digital and hybrid phones is not affected.
Procedure:
almdisplay v / almdisplay res | more
test board <Power Supply Board Location> (Check for any test , if being failed)
status environment <Cabinet No.> (Check the environment of the cabinet). The results should appear as
below with OK.

test environment <Cabinet No.> (Check whether all test are getting passed)

display errors (and select board) (check for errors to find the cause of alarm)
Inform Customer and with required permission go ahead with soft reset followed by reseat of the
board and then, if required, replace it.

Probable Cause: Bad health of Power Supply board OR Power supply being delivered to the board

Alarm Description:

Alarm->MaintObject_NR-CONN
Understanding: The Network-Region Connect (NR-CONN) MO monitors VoIP connectivity between network
test between IP endpoints in separate network regions.& Minor alarm for multiple failures: Once a single failure is
detected, Test #1417 is re-executed between different IP endpoints in the same pair of network regions.
Procedure:

almdisplay v /almdisplay res |more


display failed-ip-network-region (check which ip-network-regions are alarmed). The below screenshot
shows with no IP network regions are alarmed.

ping ip-address A board B (where A is ip-address of one NR and B is the ip-board of the other
NR and these boards needs to be from alarmed ip-network-region )
status ip-network-region X
test failed-ip-network-region X (to clear the alarms and or to check whether all test are being
passed for ip-network-region)
display errors (to identify the cause of the alarms)
display ip-network-map (to identify and confirm an entry, as required, against the failed ipnetworkregion. Because value may get modified here, after any change/update activity )

Probabale Cause: the alarm may get report due to:

network issue between two network region OR


configurational issue on ip-network-region map form

Survivable Processor Alarms


Alarm Description:
Alarm->MaintObject_LIC-ERR_Location_<NONE>_OnBoard_

Description: License is either missing or have gone corrupted or alarm is on one of the lsp/ess where it is
controlling any of the Media-Gateway/Port-Network
Procedure:
If license is either missing or is corrupted
almdisplay v / almdisplay res |more
statuslicense v (check whether license is corrupted/missing/normal)

Download a license copy from https://rfa.avaya.com onto your laptop (for CM 5.2)
For CM version latest to CM 5.2, download the license from the PLDS.
stage it onto the server through sig
Commandline on sig:
ssh <filename> init@<IP Address of server>:/var/home/ftp/pub
loadlicense <filename> (on shell prompt of server, You will have to wait for 30 minutes, till license
comes to Normal status)

Installing Certificate through Citrx:

Open Drop Box option from Citrix Web page.

Now, open folder <your avaya handle> and then drag and drop the downloaded license file into this
location

Now, open the Desktop folder and then drag and drop the License file to the Notes folder under FTP odva.w.ag.60\TSH (this may be different for some) as shown below.

Now, login to the Web LM with the default user name and password (admin/admin01) or check with the
customer if it is changed. And click on Install License option and then select Browse

Now, open select the License file which was earlier saved on the FTP server.

Now, click on Install. This will install the required License on the CM.
Now, click on Communication Manager option and see the new license details.

Note: If you receive any conflict error with the old existing Certificate, try uninstalling the old License from
Uninstall License option and install a new one.
Probable Cause: The alarm could have been reported maybe due to:
After an update/upgrade either license file was not installed or corrupt license file was installed OR
License file got corrupted due to bad health of the server
Solution:
lsp/ess has become active
almdisplay v / almdisplay res |more
list survivable-processors (check whether lsp/ess is registered with main server)
list media-gateways (to check if any Media-Gateway is not registered to main server and is registered to
lsp.)

ping <ip-address of lsp/ess from main server>

traceroute <ip-address of lsp/ess from main server>


check with customer for further update on network issues/power outage, if any at the site

Probable Cause: The alarm could have been reported due to:
Lan Issue /Power outage at the site OR
Main server is down and hence PNs and/or MGs got registered to the ESS or an LSP.
Alarm Description:
Alarm->MaintObject_ESS_Location_CL 000_OnBoard_N
Description: One or more IPSI is not pingable from the ESS server or ESS server is not able to detect the
serial-number of ipsi
Procedure:
almdisplay v / almdisplay res |more
pingall i (check whether all ipsis are pingable)

cd /var/log/ecs
grep -R sanity <filename>
serialnumber (check whether serialnumber of all ipsis are being detected by the server)

netstat -v |grep "5010" (check whether tcp link is established between server and ipsi)
ipsiversion a (check the firmware version of ipsi and its compatibility with the CM load on the server)

Probable Cause: The common cause is ESS is not either to ping any of the ipsi or not able to detect serial number
of an ipsi maybe because of
Lan issue OR
ipsi firmware mis-match with CM release OR
bad health of an ipsi
Alarm Description:
Alarm->MaintObject_ESS_EventID_X (where X=1,2,3or4)
Description:

Procedure:
almdisplay v / almdisplay res |more
status ess port-networks (check whether any port-network being controlled by an ess)

cat /etc/ecs.conf , (get the ip-address of main server)


pingall i , (check whether all ipsis are pingable from main server)

traceroute <ip-address of IPSI>,(traceroute only those ipsis from main server which are not
pingable)
get forced-takeover ipserver-interface port-network X
(If IPSI is pingable from main server but being controlled by ESS, then with customers permission get
force control of IPSI to main server)

cd /var/log/ecs
grep -R sanity <filename>
(If IPSI is pingable and being controlled by Main Server, then check for sanity failures if any which
could be the cause for the alarm)
If IPSI is not pingable, check with customer for network issue , if any.

Note: For EventID_3/EventID_4, either the IPSI controlling EI link of EPN is registered to ESS server or
there could be some issue with fiber link. Incase, no issues has been found with ipsi then check for the fiber link
issues and continue the below steps. But below steps are only to be followed for EventID_3or4.
cat /var/log/messages |more (to check for any fiber-link issues trace)
list fiber-link (get the details of fiber-links)
test fiber-link X (check for any test, if getting failed)
status sys-link <Slot Location for the EI Board> (check which IPSI is controlling the alarmed EPN)
list ipserver-interface (check for any errors on ipsi ie CPEG and ip-address of ipsi. )
display errors
restartcause
Probable Cause:
Lan issue OR
Bad health of main server OR
Issue with physical Connections of fiberlink (only for Event-Id 3&4)
Alarm Description:
Alarm->MaintObject_ESS_EventID_5
Alarm->MaintObject_ESS_EventID_6
Description: Enterprise Survivable Server cluster not registered ie ESS is not registered to the main server
(EventID_5) or it is registered back to main server (EventID_6)
Procedure:
almdisplay v / almdisplay res |more
status ess cluster

list survivable-processor
cd /var/log
grep -R register messages (to check which clustered is/was not registered)

ping <ip-address> (ping LSP from main server)


traceroute <ip-address> (if LSP is not pingable then traceroute to check from which hop ping is
being failed)
Inform Customer to check the network issue, if any at the site
Restartcause

Probable Cause: The alarm could have been reported because of:
Lan Issue / Power outage at site OR
ESS server is down due to its bad health.

Adjuncts Associated Alarms


Alarm->MaintObject_PRI-CDR
Description: The CDR feature records detailed call information about every incoming and outgoing call on
specified trunk groups and sends this information to a CDR output device. The two physical links can be
administered for connecting external CDR output devices to the system. They are identified as the primary
CDR (PRI-CDR) link and the secondary CDR (SEC-CDR) link.
Procedure:
almdisplay v / almdisplay res |more
status cdr-link

display ip-services (to find the node-name used for CDR and associated Clan board)

display node-names ip <Node-names found in above step> (to get the ip-address of CDR and

Clan board)

test board <Clan board location> (check for any test failing and/or any active alarm for the clan
board. If yes, proceed further with investigation on the Clan board, as discussed in the respective
section )
ping <ip-address of CDR> board <Clan Bd location (to which that CDR is connected)> (Check
any issues with the lan connectivity)
display errors (to identify the cause)

Probable Cause: The alarm may get report due to:


Network Issue between CDR and the Clan board, it connected to OR
Bad health of Clan / CDR OR
Scheduled Activity at Customer site
Alarm Description:
Alarm->MaintObject_ASAI-PT/BD_Location_X
Description: ASAI-PT corresponds to the fault detection of a port which is connected to one of the adjunct
which is not of Avaya make
Procedure:
almdisplay v / almdisplay res |more
display port <Port location> (check the CTI-Link number)
test board X (X is the board location)
test cti-link Y (Y is cti-link number)
display errors (to identify the cause)
busyout/release cti-link Y/board X (need to reset the link and/or the board to which that Adjunct
is connected with. Do inform the customer about the same)
If alarm is still active, need to check with customer for the functioning of the Adjunct.

Probable Cause: The alarm may get reported due to:


Malfunctioning of Adjunct OR
Bad health of Clan board/MapD board to which that Adjunct is Connected to OR
Network Issue /Physical connectivity issue of the link

Alarm Description:
Alarm->MaintObject_ADJ-IP_Location_X
Alarm->MaintObject_AESV-SES_Location_X
Alarm->MaintObject_ASAI-IP_Location_X
Description: ASAI-IP corresponds to the fault detection of a cti-link which is connected to one of the adjunct
which is not of Avaya make where as ADJ-IP corresponds to the adjunct which is of Avaya make.
Procedure:
almdisplay v / almdisplay res |more
test cti-link X

status aesvcs cti-link (to check the service state of cti-link)

status aesvcs link (to check remote IP and local nodename for Clan BD, also to identify aesvcs
server number)

ping ip-address <Remote IP> board <Clan board location> (to check the lan connectivity issue)
test aesvcs-server <aesvcs server number>

display errors (to identify the cause)


busyout/release cti-link X (ie soft reset of CTI-Link)
If alarm is still active, need to check with customer for the functioning of the Adjunct.

Probable Cause: The alarm may get reported due to:


Malfunctioning of Adjunct/AES server OR
Bad health of Clan board, to which that Adjunct is Connected to OR
Network Issue /Physical connectivity issue of the link

Trunk/Trunk Board Associated Alarms


Alarm Description:
Alarm->MaintObject_MG-IAMM
Description: The MG-IAMM Maintenance Object resides on the Integrated Analog Media Module (IAMM)
located on the main board of the G250, G350 and TGM 550 Media Gateways.
On a G250, slot #3 is virtual and designated as the integrated media module slot.
On a G350, slot #7 is virtual and designated as the integrated media module slot.
For the G250-BRI and G350, the Integrated Analog Media Module consists of three (3) fixedconfiguration
analog ports:
Port 1 is fixed as an analog trunk port.
Ports 2 and 3 are fixed as analog line ports.
Ports 1 and 2 are used by the media gateway processor for emergency transfer relay
(ETR) purposes.
Procedure:
almdisplay v / almdisplay res |more
test board <board location> (Check for any test, if failing)
display errors (to identify the cause)
If alarm is active, then Inform Customer and with required permission go ahead with soft reset (ie
busyout then reset and then releasing the board)
If alarm doesnt go off, then replace the Media-Gateway.
Probable Cause: Bad health of integrated media-Module.

Alarm Description:

Alarm->MaintObject_UDS1-BD / MG-DS1
Understanding: The alarm refers to the DS1 Interface circuit packs, UDS1 Interface circuit
packs, and DS1 Interface Media Modules.
Solution:

Check for the alarm on the active server. almdisplay v (to display active alarms)
If the alarm is not displayed in active alarms, check for resolved alarms.
almdisplay res|more (Display resolved alarms in page wise, use spacebar to go to next page)

If the alarm is active.

Login to autosat
The test board <board number>
Check if the test fails any. The board with no issues would appear as below:

If failing any test:

Check status trunk X (trunk number can be found by displaying any of the port on the board).
Check the service state if it shows all as In-Service/ idle or Out of Service.
The Trunk status with no issues would appear as below screenshot.

list measurement ds1 log <Board location> (Check whether any slip-errors are present.)

The board output with no slip erros would appear as below:

If any values shows other than 0, then it has errors.


If any errors . Do display errors to identify the cause.
Then do test board <Board location> r6 (only if slip errors are present and then monitor whether the slip
errors gets cleared off)

For sync related issues, verify the following on the MG:


Telnet/SSH to MG and run the following command show sync timing
The output could be as follows:
SOURCE

MM STATUS FAILURE

Primary

v2

Failure Reason

Secondary v3

Active

Local

Standby None

v0

None

Reason for a failed source can be: Loss of Signal, Locked out etc.
There needs to be an explicit sync source on a MG i.e a DS1 board, else it shall default to the internal clock of the
MG/PN. Please take permission from customer before making this change, this is not service impacting.

Define which interfaces are primary and secondary(depends on the number of DS1 boards in the MG).
Set the synchronize interface primary <mmid> i.e. set sync int pri v2.
Set synchronize interface secondary <mmid> i.e. set sync int sec v3.
Specify the source to use set synchronize source <primary | secondary> i.e. set synchronize source
primary.
Display the current selections and their status, Show synchronized timing.

Following is an example screenshot if the sync source is not defined, as displayed below:
MG34(develop)# show sync timing
SYNCHRONIZATION CONTROL: --- Local --SOURCE MM or VoIP
STATUS
FAILURE
--------- ------------------- ----------------------- --------------Primary
Not Configured >>DS1 board should be set up as primary source
Secondary
Not Configured
Local v0
Active
None

Active Source: v0
Done!

Sync Source Switching: Enabled

Set the DS1 board to be the primary synchronization source. Example:


MG34(develop)#set sync interface primary v3
MG34(develop)#set sync source primary
MG34(develop)# show sync timing
SYNCHRONIZATION CONTROL: --- Primary --SOURCE MM or VoIP
STATUS
FAILURE
--------- ------------------- ----------------------- --------------Primary
V3
Active
Secondary
Not Configured
Local v0
Standby
None
Active Source: v3

Sync Source Switching: Enabled

If slip errors are present and are more consistent, notify Customer and ask to follow up with
Service provider to clear the slip errors. Also need to monitor the same until slip errors gets clearoff.
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
Reset (i.e. buyout then reset and then releasing the board) followed by reseat of the board (i.e.
Removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace
the board.

Probabale Cause: The alarm may get reported due to:


Slip errors received by DS1 board OR
Bad health of DS1 board OR
Physical Connectivity Issue at customer site OR
DS1 board configuration issue or any activity at customer site
Alarm Description:
Alarm->MaintObject_MG-ANA - (Analog Media Module)
Description: Alarm is related to Analog board which can be used for analog stations as well as to terminate
Analog trunks. Types of Analog Media Modules:
MM711 (8 universal analog ports)
MM714 (4 analog line ports + 4 analog trunk ports)
MM716 (24 universal analog ports)
TIM514 (4 line + 4 trunk analog media module for Juniper Media Gateways)
VMM-ANA (1 analog trunk port + 2 analog line virtual ports)
VMM-42ANA (4 analog trunk ports + 2 analog line virtual ports)
Note: Too many MM711 Analog Media Modules might trip the G700 electronic breaker and cause the
power supply for the entire G700 to shut down.
Procedure:
almdisplay v /almdisplay res |more
test board <Board location> (check for any test, if getting failed)
display errors (to identify the cause of the issue)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie

removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off,
replace the board.
Probable Cause: Bad health of analog media module OR any activity at customer site OR physical
connectivity issue.
Alarm Description:
Alarm->MaintObject_ANL-BD - (Analog TN Circuit-Pack).
Description: Alarm is related to Analog board which can be used for analog stations as well as to terminate
analog trunks. Types of Analog Boards:
1. ANL-16-L (16-Port Analog Line)
The circuit packs listed below provide 16 analog line ports for single-line voice terminals.
The TN746, TN468, and TN749 support only single-line, on-premises/in-building, analog
voice terminals, and not off-premises stations, since these circuit packs are not equipped
with lightning protection.
The TN746B, TN2144, TN2149, and TN468B support both on-premises and off-premises
(that is out-of-building) analog voice terminals.
Note: TN746 & TN746B supports the neon-message waiting feature.
2. ANL-LINE (8-Port Analog Line)
The circuit packs-TN411, TN443, TN467, TN712, TN742, TN769 provide 8 analog line ports for single-line,
on or off-premises analog endpoints such as analog voice terminals, queue warning level lamps, recorded
announcements, dictation machines, PAGEPAC paging equipment, external alerting devices, modems, fax
machines, and AUDIX voice ports.
Procedure:
almdisplay v /almdisplay res |more
test board <Board location> (check for any test, if getting failed)
display errors (to identify the cause of the issue)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie
removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off,
replace the board.
Probable Cause: Bad health of analog board OR any activity at customer site OR physical connectivity issue.
Alarm Description:

Alarm->MaintObject_BRI-BD /MG-BRI/TBRI-BD
Description:
BRI-BD BRI Circuit pack
The TN556, TN2198, and TN2208 ISDN-BRI Lines are packet port circuit packs that provides access to
ISDN-BRI endpoints. The ISDN-BRI Line circuit packs supports 12 ports, each of which provides access to ISDN
stations. Voice and circuit-switched data from the ISDN stations are carried on the Time Division Multiplex (TDM)
Bus. Signaling is carried over the packet bus. The TN2208 LGATE MFB provides the system with the interface to

Adjunct-Switch Application Interface (ASAI) and Avaya adjuncts (for example, CONVERSANTR Voice System).
Though TN2208 contains 12 ports for line circuit interface, only 8 are usable by the switch.
MG-BRI -BRI Media Module
MM720 (8-port 4-wire BRI Trunk/Line Media Module),
MM722 (2-port 4-wire BRI Trunk Media Module),
TIM521 (4-port BRI Trunk Media Module in a Juniper Media Gateway) &
VMM_2BRI (2 port trunk-side integrated BRI Media Module)
The above Media Modules provide access to ISDN-BRI end-points where each port supports 2B channels of 64kbps
each and 1D chanel of 16kbps carried on 144 kbps.
TBRI-BD Trunk-Side BRI Circuit-Pack/ Media Module
The TBRI-PT maintenance object is a port on both the TN2185 Trunk-Side BRI circuit pack and
the MM720 BRI Media Module. The TN2185 Circuit Pack & MM720 Media Module contains eight, 4-wire ports that
interface to the network at the ISDN S/T reference point over two 64-kbps channels (B1 and B2) and over a 16kbps signaling (D) channel. The B1 and B2 channels can be simultaneously circuit-switched, or individually packetswitched. Only one channel per trunk can be packet-switched due to PPE (Packet Processing Element) limitations.
The D channel is either circuit- or packet-switched.
Note: If any PKT-Bus alarm is present along with BRI-BD, follow corresponding procedure to clear
PKT-Bus alarm first and if required /or else follow below procedure
Procedure:
almdisplay v /almdisplay res |more
test board <Board location> (Check for any test, if failing)

status trunk X (trunk number can be found by displaying any of the port on the board. Also when
you do test board X, you can identify trunk number against test of each port ).

display errors (to identify the cause)


test board <Board location> r 6 (only if slip errors are present and then monitor whether the slip
errors gets cleared off)
If slip errors are present and are more consistent, notify Customer and ask to follow up with
service provider to clear the slip errors. Also need to monitor the same until slip errors gets clearoff.
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie
removing the board from the slot and then re-inserting it)
test board <Board location> long clear (to clear the alarm and then monitor the board atleast for
next 48 hours)
If alarm re-appears, replace the board.

Probable Cause: The alarm may get reported due to:


Slip errors received from Service Provider OR
Bad health of BRI board OR
Physical Connectivity Issue at customer site OR
BRI board configuration issue or any activity at customer site
Alarm Description:

Alarm->MaintObject_BRI-PORT_Location_X
Alarm->MaintObject_TBRI-PT_Location_X
Description: Some of the results of maintenance testing of ISDN-BRI ports may be affected by the health of the
ISDN-BRI Line circuit pack (BRI-BD), BRI endpoint (BRI-SET), or ASAI adjunct (ASAI-ADJ/LGATE-ADJ/ LGATE-AJ) or
Avaya adjunct (ATT-ADJ/ATTE-AJ). These interactions should be kept in mind when
Investigating the cause of ISDN-BRI port problems.
Note: There is a quite possibility, on account of the Pkt-Bus / BRI-BD/ any Adjunct alarms could be the cause of
BRI-PT alarms. So if respective alarm also exist, then you need to follow the
corresponding action to resolve those alarms as discussed in respective section
Procedure:

almdisplay v /almdisplay res |more


display port X
status trunk Y (info of Y u will get in step 2)

test port X long (to clear the alarms)


display errors (to identify the cause)
busyout-release port (ie soft reset of port)
If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset
(ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the
board from the slot and then re-inserting it)
test board <Board location> long clear (to clear the alarm and then monitor the board atleast for next
48 hours)
If alarm re-appears, replace the board.
Probable Cause: The alarm may get reported due to:
Slip errors received from Service Provider OR
Bad health of BRI port OR
Physical Connectivity Issue at customer site OR
BRI board configuration issue or any activity at customer site
Alarm Description:

Alarm->MaintObject_CO-TRK
Understanding: Analog CO trunks are 2-wire analog lines to the CO that support both incoming and outgoing calls.
CO trunk circuit packs have eight ports, each of which provides an interface between the 2-wire CO line and the 4wire TDM bus.
Note: If any of Tone-BD/ Tone-Clk/ TDM-Bus alarms is present , then that could be the cause of the
CO-TRK alarm and hence need to check the corresponding action to resolve those alarms as
discussed in respective section
Procedure:
almdisplay v / almdisplay res |more
test board <board location> (check for any test, if failing for the board/port)
status trunk X (trunk number can be identifiedagainst the test of each port which can be seen in
output of step 2)

display errors (to identify the cause)


test port <port location> long

test analog-testcall full (check whether all test related to ATM-Transmission are getting passed)
busy-out & release trunk /port (i.e. soft reset of either trunk or port)
If alarm is active, then inform Customer and with required permission go ahead with soft reset (ie
busyout then reset and then releasing the board)
If alarm doesnt go off, then replace the Media-Gateway

Note: If test 3 is getting failed, then need to check with the customer that analog connection is in use
or has been de-activated. If customer replies that it needs to be functional, ask customer to detach the
analog line from the Analog port and run test board <board location> . If test passes, needs to be
followed up with Service Provider for the resolution or if test 3 fails , replace the circuit-pack
Probable Cause: The alarm may get report due to:
Physical connectivity issue or some activity at the customer site OR
Issue at the Service Provider side OR
Bad health of Analog port/board where CO trunk is terminated
Alarm Description:

Alarm->MaintObject_ISDN-SGR / ISDN-TRK
Understanding: An ISDN-PRI Signaling Group is a collection of B-channels for which a given ISDN-PRI
Signaling Channel Port (D-channel) carries signaling information. B-channels carry voice or data and can be
assigned to DS1 ISDN trunks (ISDN-TRK).
Procedure:
1. almdisplay v / almdisplay res |more
2. status signaling-group X (check whether signaling group is functional or not)

3. test signaling-group X (check if any test is getting failed).


4. list measurements ds1 summary <board location> (to check if )
5. test board <Board location> (Check for any test, if failing)
6. status trunk X (trunk number can be found by displaying any of the port on the board ).
7. If trunk is down then notify customer and the issue may persist at the service-provider end. Need

to berify the same.


8. list measurement ds1 log <Board location> (Check whether any slip-errors are present.)
9. display errors (to identify the cause)
10. test board <Board location> r 6 (only if slip errors are present and then monitor whether the slip
errors gets cleared off)
11. If slip errors are present and are more consistent, notify Customer and ask to follow up with
service provider to clear the slip errors. Also need to monitor the same until slip errors gets clearoff.
12. If alarm do not clear off, then Inform Customer and with required permission go ahead with soft
reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie
removing the board from the slot and then re-inserting.
Probabale Cause: The alarm may get reported due to:

Slip errors received by DS1 board OR


Bad health of DS1 board OR
Physical Connectivity Issue at customer site OR
DS1 board configuration issue or any activity at customer site

Alarm Description:

Alarm->MaintObject_H323-SGR
Understanding: The H.323 signaling group (H323-SGR) is a Signaling channel that physically resides on a CLAN port
(socket) and the IP network. The MEDPRO circuit-pack provides audio connectivity, working in
concert with a C-LAN (TN799DP) circuit pack that provides control signaling to support an H.323 connection. Unlike
ISDN D-channels, the H.323 channel may actually come up and down on a call by call basis. The H.323 channel is
actually a TCP/IP signaling channel.
Procedure:
almdisplay v / almdisplay res |more
status signaling-group X (check whether signaling group is functional or not)

test signaling-group X (check if any test is getting failed)


display errors (to identify the cause)
display signaling-group X (to identify the near-end and far-end node-names)
display node-name ip <Node-name> (to get the ip-addresses of both the node-names found in
Step-4)
list ip-interface clan (to get the Clan board location, which is the near-end node name)

ping <ip-address of Far-End node-name> board <Clan-dlocation from Step 6> (to confirm the
network integrity between the two nodes)
status media-processor board <Medpro-Bd location which is placed in the same cabinet as that
of Clan board found in the Step 6> (to confirm the adequate DSP resources)

Note: If any of Clan-BD/Medpro-BD alarms is present , then that could be the cause of the SGR alarm and hence
need to check the corresponding action to resolve those alarms as discussed in respective section
Probable Cause: The alarm may get report due to:
Network Issue between the two end-points of the signaling group OR
Bad health of Clan board (near end where Signaling group is terminated) OR
Bad health of Far-End Node-name
Alarm Description:

1009: Total Processing exceeds 70 % on \\LSP-MLAPG45CC03PV


Description: The alarm caused when the CPU utilization is more than 70 % on the server.
Procedure:

Connect to the system check the status health on the system.

Check the cp is not more than 30 and Idl : 70%


If it is more, monitor the system for some time and it should come down below 30 %. If still the issue,
notify the customer about this to take action to low down the memory usage on the server.

Note: This is only a Notification alarm which informs that the system load usage is more than 70 %.

Alarm Description:
HARD DISK,S, 77
Description: The majority of S8800 servers included the OEM MR10i RAID controller card with 256 MB Battery
Backed Write Cache for the application. The RAID battery used in these RAID controller cards is a Lithium Ion
battery. Over time, in normal operation, the RAID battery will have a reduced charge. The charging system in the
MR10i controller card extends the life of the battery but is not able to do so indefinitely.
Once the RAID battery falls below a minimum level of voltage output, it will no longer power the cache. When that
happens, the RAID controller switches from the default Write-Back to a Write-Through mode of operation. For
applications like Communication Manager, there will be no general performance degradation, although backup
time may be increased by slower read-write activity.
If the RAID battery is completely exhausted and the server is exposed to a commercial power source failure and no
Universal Power Supply (UPS) is being used, there is a chance that disk file corruption will occur if a write operation
is taking place at the time of the power interruption.
Procedure:

Login as root
To Get Default Cache policy: /opt/MegaRAID/MegaCli/MegaCli -LdPdInfo -aALL|egrep -i Default
To Get Current Cache policy: /opt/MegaRAID/MegaCli/MegaCli -LdPdInfo -aALL|egrep -i Current

If WriteThrough

raid_status (check if it is Optimal)

raid_status -c (check whether slots has any errors)

raid_status -p -v (check any error counts for Media and Other, Also check if any predictive errors)

/opt/MegaRAID/MegaCli/MegaCli -adpbbucmd -a0 |more (check the Battery replacement status)

Note: If the Current Cache Policy is showing WriteThrough and the above show any of the errors and the Battery
Replacement required option as Yes. Then go ahead and notify the customer to change the raid battery for that
particular Device slot.
Replacement Information:
The RAID battery is a consumable product on the S8800 server, generally not covered under the maintenance
agreement, and is available for purchase.
Alternatively, an order for Avaya to install the replacement battery may also be placed when calling to purchase
the RAID battery. The S8800 RAID battery kit is generally available and may be ordered as a miscellaneous part. It
is customer-installable; alternatively, an order for Avaya to install the replacement battery may be placed. (Tech
Dispatch)

Switch Alarms.
Alarm Description:
PowerSupply_Fault_ExtremeAlpine
Description: Indicates that the state of the power supply for this system is not normal or has been
Shutdown
Procedure:
show power (check the power status)
show inline-power (check whether inline-power is enabled)
show log (check logs for any administration work or power outage which may be the cause of the
alarm).
Probable Cause: Some issue with power-supply of the switch.

Alarm Description:
pethPsePortOnOffNotification
Description: This kind of alarm generated whenever there is a change observed on PoE port of LAN switch
and/or PD (Powered Device like IP Phone).
Note: In above heading, peth means power-ethernet and PSE means power Sourcing element, port means a
port on POE switch.
Procedure:
show power (check the power status)
show port (check whether port is functional or not)
show log (check logs for any administration work or power outage or any other related logs which
could be the cause of the alarm).
Note: typical no solution is required and this is a normal behaviour since most of the times Voice
devices such as IP phones are connected to these switches and any unplugging or reseting ofphones could
generate this alarm.
Probable Cause: Whenever a change is observed on PoE port of a Lan switch, this alarm is generated.
Alarm Description:
Interface_Fault_MIB2/ ExceededMaximumUptime
Description: Indicates that an interface marked as Backup, Serial or Dial-on-Demand has been active for too
long.
Note: Alarm could be related to media server/switch/media-gateway etc
Procedure:
log into the device and check the uptime
if system is functioning fine, then ignore the alarm
if alarm is generated frequently, this indicates that threshold value for "Maximum uptime" is set
too low, and you need to increase it in SMARTS. So transfer this ticket to NA-MS-IPT-AI team to change
the configuration.
Probable Cause: The alarm is reported when any host is not rebooted atleast for a X no. of days. The value
of X is defined in Smart.
Alarm Description:
HighErrorRate
Description: Indicates that the percentage of error packets for either input or output exceeds ErrorThreshold.
Procedure:
show ports rxerror (check the received error count)
show ports txerror (check the transmitted error count)
If errors are present for the specific port then notify data team of customer through an email and
keep the case under monitoring. If error count increases then call customer and refer it to
Customers network team
Probable Cause: When error packets either received at receiver or transmitted by transmitter exceeds its

Threshold value, this alarm is generated


Alarm Description:
Switch Down/ Interface Down/ Host Down
Description: One of the device being monitored by SIG is down
Procedure:
ping <ip-address of the decice>
login to the device & check its functioning and also check logs on device and sig
sudo vi /opt/InCharge6/SAM/smarts/local/logs/trap_mgr.log
also check whether the device was rebooted I logs of the device.
if it is functioning fine close the case or else inform customer and work accordingly
Probable Cause: Maybe either of network issue, SIG is not able to detect the switch or switch was rebooted.

S-ar putea să vă placă și