Sunteți pe pagina 1din 121

ACI Troubleshooting

BRKACI-2102
Mioljub Jovanovic, Technical Leader
Agenda
• Introduction
• Understanding Faults and Health
status
• Tools
• Troubleshooting scenarios
• Conclusion / Q&A

3
The right way we’re used to do it
# show int eth 1/1 | grep input
30 seconds input rate 97064 bits/sec, 66 packets/sec
input rate 97064 bps, 66 pps; output rate 95008 bps, 57 pps
20297397 input packets 6494649266 bytes
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 input with dribble 72 input discard

Good old CLI!!!


Example: Checking input rate on specific interface
4
John Chambers
@CiscoLive #clus, San Diego 2015

5
The way we do it in APIC

Visualize interface input/output

7
The way we can do it with ACI
> moquery -c eqptIngrPkts5min -f 'eqpt.IngrPkts5min.unicastRate>"1000"' | egrep -e "^dn|^unicastRate"
dn : topology/pod-1/node-101/sys/phys-[eth1/34]/CDeqptIngrPkts5min
unicastRate : 1742.12
example: finding interface with unicast rate > 1000

> moquery -c eqptIngrPkts5min -f 'eqpt.IngrPkts5min.unicastRate>"1000"' -o xml


…<eqptIngrPkts5min childAction="" cnt="18" dn="topology/pod-1/node-101/sys/phys-
[eth1/34]/CDeqptIngrPkts5min" … status="" unicastAvg="10833" unicastBase="0"
unicastCum="2390904" unicastLast="18809" unicastMax="31630" unicastMin="2075"
unicastPer="194995" unicastRate="1089.254093" unicastSpct="0" unicastThr=""
unicastTr="0" unicastTrBase="503518"/>
</imdata>

Query any managed object (MO) for data we need!


• Q: that’s cool, but how do I know which object/class to query …?
 check next slide for the answer
• Q: it looks cryptic to me ... how do I find meaning of each field?` 8
APIC Management Information Model Reference
From the WebUI

direct URL
9

https://apic/doc/html/
APIC UI

Connect to APIC

apic 1

APIC Cluster
Web Browser
Visore

apic 2

CLI (ssh)

apic 3
10
spine 1 spine 2

Connect to switch

ACI Fabric

leaf 1 leaf 2 leaf 3 leaf 4 leaf 5

We could connect directly to switches as well


- ssh or console
- visore
- REST 11
CLI Available at the Switch
AAA via TACACS+, Radius and LDAP is supported when logging into switch CLI console.
Configuration mode is not supported at switch console.
There are two scenarios where administrators would log into switch console:

• From APIC UI, admin can remote login to switch console

• Login directly via serial console port on the switch front panel or SSH to management
IP via out of band or inband Using username "admin".
Application Policy Infrastructure Controller
admin@apic1:~> acidiag fnvread
ID Name Serial Number IP Address Role State LastUpdMsgId
-------------------------------------------------------------------------------------------------
101 leaf1 SAL18CLUX85 10.0.40.66/32 leaf active 0
102 leaf2 SAL18CBRU00 10.0.64.69/32 leaf active 0
103 leaf3 SAL18CLHR05 10.0.40.95/32 leaf active 0
104 leaf4 SAL18CAMS14 10.0.40.65/32 leaf active 0
105 leaf5 SAL18CCHD53 10.0.112.69/32 leaf active 0
For majority of use cases, 201
202
spine1
spine2
SAL18CMUC75
SAL18CFRA11
10.0.64.65/32
10.0.64.64/32
spine
spine
active
active
0
0
admin should utilize APIC. 203
204
spine3
spine4
SAL18CSAN15
SAL18CSFO14
10.0.40.69/32 spine
10.0.112.67/32 spine
inactive
inactive
0x4000000ef664f
0x4000000ef6650

Total 9 nodes

admin@apic1> ssh leaf1 12


Fabric Health Overview

13
Troubleshooting: Where do we start?
Fabric-wide monitoring
Statistics Faults Diagnostics
Thresholds

Faults,
Health Scores
Troubleshooting, Drill Downs

Drill-Downs

Stats
Atomic
Counters
ELAM SPAN
On-Demand
Diagnostics
Switch
Nxos Cli …
14
After logging in to the
APIC, you’ll see the initial
‘Dashboard’ screen.

15
The APIC dashboard provides you with an ‘at-a-glance’ view of the system
health and fault counts.

16
‘System Health’ shows you a view of the
overall health of the ACI system (all nodes, tenants, etc).

fabricHealthTotal

Graph is plotted as per fabricOverallHealthHist5min

17
API Inspector
enables us to see REST API calls (GET, DELETE, POST) from WebUI to APIC

82

admin@apic1> moquery -d "/topology/HDfabricOverallHealth5min-0"


Total Objects shown: 1

# fabric.OverallHealthHist5min
index : 0
childAction :
cnt : 31
dn : /topology/HDfabricOverallHealth5min-0
healthAvg : 82
healthMax : 82
healthMin : 82
healthSpct : 0
healthThr :
healthTr : 0
lastCollOffset : 310
modTs : never
repIntvEnd : 2015-04-10T19:24:03.530+01:00
repIntvStart : 2015-04-10T19:18:53.442+01:00
rn : HDfabricOverallHealth5min-0
Prefer JSON or XML instead of text in moquery? status :
-> no problem 18
just specify “–o json” or “-o xml” with moquery
How is topology built?

admin@apic1:~> moquery -c fabricLink



# fabric.Link
n1 : 203
s1 : 1
p1 : 1
n2 : 101
• APIC WebUI and API inspector s2
p2
:
:
1
51
• Identify which objects are used dn
lcOwn
:
:
topology/pod-1/lnkcnt-101/lnk-203-1-1-to-101-1-51
local
to plot topology linkState
modTs
:
:
ok
2015-03-13T14:26:39.526+01:00
• Re-using fabricLink objects to monPolDn : uni/fabric/monfab-default
rn : lnk-203-1-1-to-101-1-51
identify the links status :
• We could create our own tool wiringIssues :

for topology, monitoring or admin@bdsol-aci2-apic1:~> moquery -c fabricLink | egrep -e ^dn | head -5


dn : topology/pod-1/lnkcnt-1/lnk-102-1-2-to-1-2-2
troubleshooting dn
dn
: topology/pod-1/lnkcnt-2/lnk-102-1-4-to-2-2-2
: topology/pod-1/lnkcnt-3/lnk-102-1-6-to-3-2-2
dn : topology/pod-1/lnkcnt-201/lnk-102-1-49-to-201-1-34
dn : topology/pod-1/lnkcnt-202/lnk-102-1-50-to-202-1-34 19
Visore – Web based MO query and browser tool
https://<IP>/visore.html fabricNode

adSt on

childAction

delayedHeartbeat no

dn topology/pod-1/node-101

fabricSt active

id 101

lcOwn local

modTs 2015-04-08T14:38:44.546+02:00

model N9K-C9396PX

monPolDn uni/fabric/monfab-default
<?xml version="1.0" encoding="UTF-8"?><imdata totalCount="1"><fabricNode
name bdsol-9396px-02
adSt="on" childAction="" delayedHeartbeat="no" dn="topology/pod-1/node-101"
fabricSt="active" id="101" lcOwn="local" modTs="2015-04-08T14:38:44.546+02:00" role leaf

model="N9K-C9396PX" monPolDn="uni/fabric/monfab-default" name="bdsol-9396px- serial SAL18CLUS15


02" role="leaf" serial="SAL18CLUS15" status="" uid="0" vendor="Cisco Systems, Inc" status
version=""/></imdata> uid 0

vendor Cisco Systems, Inc

version

icurl 'http://apic/api/node/class/fabricNode.xml?query-target-filter=and(eq(fabricNode.id,"101"))' 20
The lower half of the screen shows node and tenant health.

21
The lower half of the screen shows node and tenant health.

Move these sliders


down to show only
nodes / tenants with
lower health.

22
On the right, you’ll see the fault
counts by domain
(e.g. access, tenant, security)…

…type
(config, environmental, etc)…

…and APIC cluster health.


23
How to get object DN from GUI
1

24
Health Score 100 Perfect Health Score = 100

Number
between Health Score
0 and 100

25
Tools and utilities

27
Network Monitoring and Troubleshooting Tools

Physical Network Abstracted Network


• properties (EP / TEP / contract)
• ping
• health scores / faults / events / audit
• traceroute • iping, itraceroute
• show (interface / table / etc) • atomic counters
• statistics
• syslog
• diagnostics (on-demand)
• SPAN • SPAN
• ELAM

28
UI Tools
Health Faults Audits Events

Statistics Call-home Syslogs SNMP

29
UI Operations Tools introduced in APIC 1.1 and 1.2
• Visibility & Troubleshooting (also known as Troubleshooting Wizard - TsW)
• Capacity Dashboard
• ACI Optimizer
• EP Tracker
• Visualization

30
MIT access from ishell
admin@apic1:mit> cd /mit
admin@apic1:mit> ls -1l
total 3
drw-rw---- 1 admin admin 512 Apr 2422:48 comp
drw-rw---- 1 admin admin 512 Apr 2422:48 dbgs
drw-rw---- 1 admin admin 512 Apr 2422:48 expcont
drw-rw---- 1 admin admin 512 Apr 2422:48 fwrepo
drw-rw---- 1 admin admin 512 Apr 2422:48 topology
drw-rw---- 1 admin admin 512 Apr 2422:48 uni

31
moquery – CLI based MO query tool
admin@apic1:~> moquery -c fabricNode -f 'fabric.Node.id=="1"'
Total Objects shown: 1

# fabric.Node
id : 1
adSt : on
delayedHeartbeat : no
dn : topology/pod-1/node-1
fabricSt : unknown
lcOwn : local
modTs : 2015-04-08T14:27:16.290+02:00
model : APIC
monPolDn : uni/fabric/monfab-default
name : apic1
rn : node-1
role : controller
serial : SAL18CLUS15
status :
uid : 0
vendor : Cisco Systems, Inc
version : 32
moquery – some examples … or simply use
WebUI 
• Find all EPGs with access encapsulation VLAN 3399
moquery -c fvRsPathAtt -o json -f ‘fv.RsPathAtt.encap=="vlan-3399"‘
• Obtain AAEP based on interface policy group
moquery -c "infraAccPortGrp" | egrep "^dn" | awk ' { print "moquery -d
"$3" -x query-target=children \| egrep tDn" } ‘
• Query the actual policy group
moquery -d "uni/infra/funcprof/accportgrp-N3k_PG_ddastoli" -x query-target=children

33
mobrowser – CLI based MO browser tool

34
DME running on switch

Switch

NXOS Process
NXOS Process
NXOS Process
Get logical MO from PM and Objectstore (Shared memory)
push concrete MO to configure
switch
35
DME running on switch

Switch

NXOS Process
NXOS Process
NXOS Process
Delegate localObjectstore
faults, events, (Shared memory)
records, health score
35
DME running on switch

Switch

NXOS Process
NXOS Process
NXOS Process
Objectstore
Opflex(Shared
server for memory)
external
opflex elem
35
DME running on switch

Switch

NXOS Process
NXOS Process
NXOS Process
Objectstore (Shared memory)
Atomic counters, core handling
35
DME running on switch

Switch

NXOS Process
NXOS Process
NXOS Process
Objectstore (Shared memory) Collect stats from NXOS and
push to APIC
35
APIC Logs Switch Logs
• /var/log/dme/log • /var/log/dme/log
• /var/log/dme/oldlog • /var/log/dme/oldlog
• /var/sysmgr/tmp_logs/

admin@apic1:~> cd /var/log/dme/log admin@apic1:~> cd /var/log/dme/log


admin@apic1:log> ls –altr * admin@apic1:log> ls –altr *
admin@apic1:log> ls –al svc_ifc_policymgr.* admin@apic1:log> ls -al svc_ifc_policyelem.*

40
acidiag – your friend at tough times
admin@apic1:~> acidiag --help
...
avread read appliance vector
fnvread read fabric node vector
fnvreadex read fabric node vector (extended mode)
rvread read replica vector
rvreadle read replica leader summary
crashsuspecttracker
read crash suspect tracker state
validateimage validate image
version show ISO version
preservelogs stash away logs in preparation for hard reboot
platform show platform
verifyapic run apic installation verify command
bond0test run bond0 test
touch touch special files
run run specific commands and capture output
installer installer
start start a service
stop stop a service
restart restart a service
41
reboot reboot
icurl – CLI utility for data transfer
mkdir /tmp/tac-655555555 We can import and analyze active
cd /tmp/tac-655555555 faults, fault history, events history,
accounting log, login history
icurl 'http://localhost:7777/api/class/faultInfo.xml' > faultInfo.xml
icurl 'http://localhost:7777/api/class/faultRecord.xml' > faultRecord.xml
icurl 'http://localhost:7777/api/class/eventRecord.xml' > eventRecord.xml
icurl 'http://localhost:7777/api/class/aaaModLR.xml' > aaaModLR.xml
icurl 'http://localhost:7777/api/class/aaaSessionLR.xml' > aaaSessionLR.xml
cd /tmp
tar zcvf tac-655555555.tgz tac-655555555
cp tac-655555555.tgz /data/techsupport

Now you may download file from following URL:


https://apic/files/1/techsupport/tac-655555555.tgz

42
iShell filesystem - scriptcontainer
Linux
/ - APIC root filesystem admin shell
/var/run/bashroot / - ishell root folder
…bashroot/var/log/dme/log /var/log/dme/log
/debug
/aci
/mit

…/mgmt/log/scriptcontainer.log

43
Troubleshooting scenarios

44
spine 1 spine 2

Topology
 2 x spine
 2 x leaf N9K-9396px
(48 x 1/10G SFP+)
ACI Fabric
 2 x leaf N9K-93128tx
(96 x 1/10G Base-T)

 1 x leaf N9K-C9372px
(48 x 1/10G SFP+) leaf 1 leaf 2 leaf 3 leaf 4 leaf 5

 3 x APIC
10Gbps

apic 1 apic 2 apic 3

45
Troubleshooting Scenario

46
Troubleshooting Web UI performance Ctrl + Shift + I or F12
or
Open Web Browser’s Developer Tools  Network tab Cmd + Opt + I

Web Browser’s Developer tool  Network tab


Showing latency for each HTTP Request to APIC server

47
Verify if APIC is able
to process REST API
REST API call without webtoken without
Login / APIC-cookie

http://apic/api/aaaListDomains.xml

Double-click on the
specific request to
check timing details.

10ms looks good  48


Note JSON is used by
APIC WebUI, while we
How does it look from APIC’s side? used XML.

zegrep -A5 "aaaListDomains.json" /var/log/dme/log/nginx*

zegrep -A5 "aaaListDomains.xml" /var/log/dme/log/nginx.bin.log.* We could use any other


criteria for grep:
nginx.bin.log.14.gz: IP, time stamp etc
29701||15-05-10 23:11:05.701+02:00||nginx||DBG4||||Request received
/api/aaaListDomains.xml||../common/src/rest/./Rest.cc||62 bico 56.827

29701||15-05-10 23:11:05.701+02:00||nginx||DBG4||||httpmethod=1; from 10.48.16.90; url=/api/aaaListDomains.xml; url


options=||../common/src/rest/./Request.cc||103

29720||15-05-10 23:11:05.705+02:00||nginx||DBG4||co=doer:255:127:0xff00000003249f06:1||outCode:
200||../common/src/rest/./Worker.cc||357

29720||15-05-10 23:11:05.705+02:00||nginx||DBG4||co=doer:255:127:0xff00000003249f06:1||notifyEvent data ready


0x0||../common/src/rest/./Worker.cc||370

29701||15-05-10 23:11:05.706+02:00||nginx||DBG4||||Reply data (request 831 size 211) <?xml version="1.0"


encoding="UTF-8"?><imdata totalCount="4"><aaaLoginDomain name="LOCAL"/><aaaLoginDomain name="RADIUS"/><aaaLoginDomain
name="TACACS"/><aaaLoginDomain name="DefaultAuth" guiBanner=""/></imdata> Cookie:
NONE||../common/src/rest/./Rest.cc||120

49
Debug data of DMEs is also exposed via REST
APIC DME Debug URL

http://apic1/api/nginx/debug/tacacs.xml

50
Same debug data is accessible from ishell also
admin@apic1:~> cat /debug/bdsol-aci3-apic1/nginx/tacacs/mo
RequestsDispatched : 1511
ResponsesReceived : 1498

Check all other nifty stats by executing “find /debug/* …”


Example:

admin@apic1:~> find /debug/* -print -type f -exec cat {} \;


You can also check logs matching certain criteria
Example below, looking for tacacs logs or specific time.

zegrep TAC_ /var/log/dme/log/nginx*


zegrep TAC_ /var/syslog/tmp_logs/nginx*
zegrep “15-05-09 03:48” /var/log/dme/log/*
51
Troubleshooting Scenario

52
Finding changes, faults
during certain timeframe

53
System health change
We noticed slight decrease in System health

Is the cause known?


Do we need to perform Root Cause Analysis? … we’re not sure … should we call SWAT? 
Were there any known changes, maintenance etc?
54
We’ve suddenly experienced
connectivity loss … nothing has
been changed …

Déjà vu? Let’s think for a second:


What is the the most common
cause of all network incidents?

Change!

55
We noticed slight decrease in System health
aaaModLR
aaaModLR - AAA audit log record,
which is automatically generated
whenever a user modifies
an object.

we want to check if there were any config changes


moquery -c aaaModLR -f 'aaa.ModLR.created==" 2015-05-10"'

Match only on May 10th 2015

moquery -c aaaModLR -f 'aaa.ModLR.created>" 2015-05-07" and aaa.ModLR.created<" 2015-05-10"'

Match audit records (aaaModLR)


between 2015-05-07 AND 2015-05-10
56
Example looking for audit records by date / time
admin@bdsol-aci2-apic1:~> moquery -c aaaModLR -f 'aaa.ModLR.created>" 2015-05-07T17:00" and aaa.ModLR.created<"2015-05-11"'
# aaa.ModLR
id : 8589938110
affected : uni/fabric/outofsvc/rsoosPath-[topology/pod-1/paths-101/pathep-[eth1/12]]
cause : transition
changeSet :
childAction :
code : E4208269
created : 2015-05-08T15:22:04.317+01:00
descr : Interface topology/pod-1/paths-101/pathep-[eth1/12] enabled
dn : subj-[uni/fabric/outofsvc/rsoosPath-[topology/pod-1/paths-101/pathep-[eth1/12]]]/mod-8589938110
ind : deletion
modTs : never
rn : mod-8589938110 We don’t do changes on non-business days and the day
severity : info before, so let’s see who has performed any config between
status :
trig : config Thursday evening and Monday morning 
txId : 10720396
user : admin

admin configured interface eth1/12 on node 101

57
we found there were some admin changes on eth1/12

double click

faultRecord in GUI
We could also check:
eventRecord
healthRecord
58
Using moquery to dump/sort active faults (faultInst)
admin@apic1:~> moquery -c faultInst | egrep -e "^descr" | sort | uniq -c

quickly sorts all active faults


2 descr : Configuration failed for EPG default due to Not Associated With Management Zone
3 descr : Datetime Policy Configuration for F5clock failed due to : access-epg-not-specified
1 descr : Failed to form relation to MO AbsGraph-VEStandAloneFuncProfile of class vnsAbsGraph
1 descr : Failed to form relation to MO fwP-default of class nwsFwPol in context uni/infra
1 descr : Ntp configuration on leaf leaf1 is Not Synchronized
1 descr : Ntp configuration on leaf leaf2 is Not Synchronized
1 descr : Ntp configuration on spine spine1 is Not Synchronized
1 descr : Power supply shutdown. (serial number DCB18CLUS15)

Now we could query all faults by criteria – such as description (fault.Inst.descr)


moquery –c faultInst –f fault.Inst.descr==“: Failed to form relation to MO AbsGraph-VEStandAloneFuncProfile …”

59
Troubleshooting Scenario

60
NX-OS Style CLI in APIC 1.2
show endpoints
show interface bridge-domain apic1# show cli manpage ?
WORD Command Name
show health tenant apic1# show cli manpage show

show health leaf Cisco APIC NX-OS Style CLI Command Reference

show faults CLI Help and Link to CLI


Reference for your
show faults last-days 1 history convenience
show events last-hours 8 leaf 102
show audits last-minutes 59 leaf 101
show stats granularity 15min leaf 101 interface ethernet 1/2

61
Example show stats CLI output in APIC 1.2(1)
apic1# show stats granularity 15min leaf 101 interface ethernet 1/2
Start Time Counter Value Unit
-------------------- ---------------------------------------- -------------------- ------------------------
2016-01-17 10:59:52 Ingress buffer drop packets 0 packets
2016-01-17 10:59:52 Ingress error drop packets 0 packets
2016-01-17 10:59:52 Ingress forwarding drop packets 0 packets
2016-01-17 10:59:52 Ingress link utilization 0 %
2016-01-17 10:59:52 Ingress load balancer drop packets 0 packets
2016-01-17 10:59:52 Total ingress bytes 35,117,721 bytes
2016-01-17 10:59:52 Total ingress bytes rate 37,331 bytes-per-second
2016-01-17 10:59:52 Total ingress packets 101,816 packets
2016-01-17 10:59:52 Total ingress packets rate 113 packets-per-second
2016-01-17 10:59:40 Egress afd wred packets 0 packets
2016-01-17 10:59:40 Egress buffer drop packets 0 packets
2016-01-17 10:59:40 Egress error drop packets 0 packets
2016-01-17 10:59:40 Egress link utilization 0 %
2016-01-17 10:59:40 Total egress bytes 22,850,916 bytes
2016-01-17 10:59:40 Total egress bytes rate 25,236 bytes-per-second
2016-01-17 10:59:40 Total egress packets 104,837 packets
2016-01-17 10:59:40 Total egress packets rate 117 packets-per-second

62
Troubleshooting Scenario

63
Troubleshooting:
APIC Faults / Visore / debug.log / LTM log

https://<APIC>/visore.html

APIC Faults

/var/log/*
/data/devicescript/F5.BIGIP.1.1.0/logs/debug.log 64
Scenario: Graph failed-to-apply
After clicking “Finish” to deploy the graph in a contract

Under “Deployed Graph Instances”

You may see graph in the state “failed-to-apply”

65
APIC Faults

If need more details,


Double click
copy the affect object on faults

66
Example L4-L7 fault details using Visore Tool
https://apic/visore.htm

Paste the affected object


in “Class or DN” field

Provide full details of the


issues

67
APIC debug.log
Locate the APIC that contains the shard configuring the BIG-IP, then go to
the following location:

admin@apic1:~> cd /data/devicescript/F5.BIGIP.1.0.0/logs

You will see debug.log and periodic.log

admin@apic1:logs> ls –all
-rw-r--r-- 2 nobody nobody 52688 Sep 30 11:31 debug.log
-rw-r--r-- 2 nobody nobody 35492 Sep 30 11:30 periodic.log

You can “tail -f debug.log” to monitor the process


68
APIC debug.log (faults) Example: mcpd
2014-07-25 18:04:00,675 DEBUG 139789634365184 [172.23.76.198, 8534]: Faults: []
2014-07-25 18:05:47,466 DEBUG 139789634365184 [172.23.76.198, 8543]: result: serviceAudit {'stats':
{'max': 20.035178899765015, 'num': 2, 'last': 20.035178899765015, 'avg': 16.63836646080017, 'min':
13.241554021835327}, 'result': {'faults': [([], 82, "Line 100 apic/service.py::modify: Could not
configure service state: Server raised fault: 'Exception caught in
Networking::urn:iControl:Networking/RouteDomainV2::get_identifier()\nException:
Common::OperationFailed\n\tprimary_error_code : 17237812 (0x01070734)\n\tsecondary_error_code :
0\n\terror_string : 01070734:3: Configuration error: Invalid mcpd context, folder not found
(/apic_5794)'")], 'state': 3, 'health': [([], 0)]}}
2014-07-25 18:05:47,467 DEBUG 139789634365184 [172.23.76.198, 8543]: Faults: [([], 82, "Line 100
apic/service.py::modify: Could not configure service state: Server raised fault: 'Exception caught in
Networking::urn:iControl:Networking/RouteDomainV2::get_identifier()\nException:
Common::OperationFailed\n\tprimary_error_code : 17237812 (0x01070734)\n\tsecondary_error_code :
0\n\terror_string : 01070734:3: Configuration error: Invalid mcpd context, folder not found
(/apic_5794)'")]

69
APIC debug.log (faults)
Example: Tagging mismatch
2014-10-07 13:09:51,166 DEBUG 140447157077760 [198.18.128.130, 76]: Faults: []
2014-10-07 13:09:51,187 DEBUG 140447157077760 [None, None]: Waiting for task
2014-10-07 13:09:53,847 DEBUG 140447148685056 [198.18.128.130, 76]: route_domain: Allocated route
domain 907
2014-10-07 13:09:53,957 DEBUG 140447148685056 [198.18.128.130, 76]: route_domain: Setting route domain
907 on device BIGIP1
2014-10-07 13:09:54,140 INFO 140447148685056 [198.18.128.130, 76]: Line 664
apic/service.py::_modify_vlan: Target: : Creating VLAN '4663_16387' ID 202
2014-10-07 13:09:56,532 INFO 140447148685056 [198.18.128.130, 76]: Line 679
apic/service.py::_modify_vlan: Target: : Modifying VLAN '4663_16387' interface '1.1'
2014-10-07 13:09:57,304 DEBUG 140447148685056 [198.18.128.130, 76]: result: serviceModify {'stats':
{'max': 39.48741388320923, 'num': 4, 'last': 6.139014005661011, 'avg': 21.184859931468964, 'min':
6.139014005661011}, 'result': {'faults': [([(0, '', 4663), (7, '', '2752512_16387')], 81, "Line 383
apic/handlers.py::set_interface: device: : VLAN ifc update fail: Server raised fault: 'Exception
caught in Networking::urn:iControl:Networking/VLAN::add_member()\nException:
Common::OperationFailed\n\tprimary_error_code : 17236569 (0x01070259)\n\tsecondary_error_code :
0\n\terror_string : 01070259:3: Requested member (1.1) is untagged on another VLAN'")],
'state': 2, 'health': []}}

70
BIG-IP LTM log
SSH as root into BIG-IP and go to:
[root@bigip:Active:In Sync] log # cd /var/log
[root@bigip:Active:In Sync] log # ls ltm*
ltm ltm.11.gz ltm.2.gz ltm.4.gz ltm.6.gz ltm.8.gz
ltm.10.gz ltm.1.gz ltm.3.gz ltm.5.gz ltm.7.gz ltm.9.gz

Example output
Jul 19 11:57:53 apic-bigip2 notice mcpd[7439]: 01070638:5: Pool /apic_5668/apic_5668_webPool member /apic_5668/192.168.10.101%1295:80 monitor status
down. [ /apic_5668/apic_5668_webMonitor: down ] [ was up for 20hrs:55mins:46sec ]
Jul 19 11:57:54 apic-bigip2 notice mcpd[7439]: 01070638:5: Pool /apic_5668/apic_5668_webPool member /apic_5668/192.168.10.102%1295:80 monitor status
down. [ /apic_5668/apic_5668_webMonitor: down ] [ was up for 20hrs:55mins:47sec ]
Jul 19 11:57:54 apic-bigip2 notice mcpd[7439]: 01071682:5: SNMP_TRAP: Virtual /apic_5668/apic_5668_4096_Virtual-Server has become unavailable
Jul 19 11:57:54 apic-bigip2 err tmm[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 11:57:54 apic-bigip2 err tmm1[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 11:57:54 apic-bigip2 err tmm2[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 11:57:54 apic-bigip2 err tmm3[9357]: 01010028:3: No members available for pool /apic_5668/apic_5668_webPool
Jul 19 12:03:02 apic-bigip2 err iprepd[6725]: 015c0004:3: failed connect to 208.87.136.155 on 443
Jul 19 12:03:03 apic-bigip2 err iprepd[6725]: 015c0004:3: Certificate verification error: 18
Jul 19 12:03:03 apic-bigip2 err iprepd[6725]: 015c0004:3: nSendReceiveSsl failed SSL handshake
Jul 19 12:04:11 apic-bigip2 info pfmand[6925]: 01660009:6: Link: 2.1 is DOWN
Jul 19 12:04:11 apic-bigip2 info pfmand[6925]: 01660009:6: Link: 2.2 is DOWN

71
Access Encap
to
Fabric Encap

72
spine 1 spine 2

EP A to EPB - simplified

1 Regular L2 packet 2

2 iVXLAN packet

3 Regular L2 packet leaf 1 leaf 2 leaf 3 leaf 4 leaf 5

1 3

1
EP A EP B
73
spine 1 spine 2
How to identify VLAN mapping

Scenario:
VM A is unable to reach
other endpoints
connected to the Fabric
leaf 1 leaf 2 leaf 3 leaf 4 leaf 5
- ping doesn’t work
- ARP doesn’t work
linux VM A:
connected to ACI fabric
VM A VLAN 3399
74
MAC: 00:00:33:33:33:33
What happens when packet from EP A reaches leaf
To Spines
leaf 1
1 packet first comes to
8/12 x 40G
Merchant ASIC (BCM) leaf 1
Cisco
2 forwarded to destination
ASIC
if it’s known on BCM eth 1/34

8/12 x 40G
3 if destination not
learned in BCM Merchant
forwarding table, then ASIC
send to Cisco ASIC
48/96 x 10G

EP A
To servers/blade, switches
MAC: 00:00:33:33:33:33
75
Linux view

VM MAC: 00:00:33:33:33:33

VM thinks it’s interface is in


VLAN 3399
76
checking l2 forwarding table
on Broadcom
bcm-shell-hw
switch# bcm-shell-hw "l2 show"
mac=52:54:00:b0:c4:81 vlan=57 GPORT=0x22 modid=0 port=34/xe33 Hit
mac=58:f3:9c:24:2e:87 vlan=15 GPORT=0x2 modid=0 port=2/xe1 Hit
mac=00:00:33:33:33:33 vlan=57 GPORT=0x22 modid=0 port=34/xe33 Hit
mac=52:54:00:c3:b8:2c vlan=58 GPORT=0x22 modid=0 port=34/xe33 Hit
mac=00:22:bd:e2:e2:e2 vlan=49 GPORT=0x7f modid=2 port=127 Static

Broadcom says it’s


VLAN 57
77
from ishell command
interface
MAC learning from ACI switch
switch# show mac address-table interface ethernet 1/34
Legend: show interface eth 1/34 switchport
* to check if VLANs 53/54 are enabled on thet eth1/34 interface
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
VLAN MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 53 0000.3333.3333 dynamic - F F eth1/34
* 53 5254.00b0.c481 dynamic - F F eth1/34
* 54 5254.00c3.b82c dynamic - F F eth1/34

iShell CLI says it’s VLAN 53 78


so which VLAN is it? note: we’re in vsh_lc CLI
module-1# show system internal eltmc info vlan access_encap_vlan 3399
vlan_id: 53 ::: hw_vlan_id: 57
vlan_type: FD_VLAN ::: bd_vlan: 52
access_encap_type: 802.1q ::: access_encap: 3399
fabric_encap_type: VXLAN ::: fabric_encap: 9891
sclass: 16387 ::: scope: 8
bd_vnid: 9891 ::: untagged: 0
acess_encap_hex: 0xd47 ::: fabric_enc_hex: 0x26a3
Encap VLANs, VXLANs are
normalized in ACI Switch,
everything in the fabric is
it’s iVXLAN 9891 ?? iVXLAN.
79
Is this actually possible with ACI?

80
Troubleshooting Scenario

81
End Point Search
* Search by wildcard will be available in APIC 1.2(2) release
We can search End Point by
IPv4, IPv6 or MAC address

82
Troubleshooting Scenario

83
Hint: To check list of VRF names:
show vrf
iPing CLI
usage:
iping [-V vrf] [-c count] [-S source ip] host

options:
-V : vrf to use for ping (management/overlay-1/Tenant VRF)
-c : # of requests to send.
-i : interval between ICMP echo packets.
-t : Timeout for responses.
-p : Data pattern in payload.
-s : Size
-S : Source – Interface name/ IP address.

84
spine 1 spine 2
iping internals
leaf1# iping –V tenant:vrf01 –S 64.101.1.1 64.101.1.22

Note: iping is initiated from leaf1


Recommended: set the source IP address desired GW (BD IP) since EP_A is learned on leaf1 packet will be
sent out directly to ep, not going via spines
1 leaf1: iping to Endpoint_A (EP_A)

2 EP_A (.22): responds to leaf1 leaf 1 leaf 2 leaf 3 leaf 4 leaf 5


1 2

1
EP A
85
Endpoint_A IP: 64.101.1.22
spine 1 spine 2
iping internals
leaf4# iping –V tenant:vrf01 –S 64.101.1.1 64.101.1.22

Note: we initiated iping from leaf4


since EP_A is learned on leaf1
1 leaf4: iping to Endpoint_A (EP_A) 1 packet will be sent via fabric (via spines)
(icmp echo request to leaf1 TEP)
2 leaf1: ping to Endpoint_A (EP_A)
leaf 1 leaf 2 leaf 3 leaf 4 leaf 5
3 EP_A (.22): responds to leaf4
(via leaf1 and fabric) 2 3

ICMP echo reply packet to the remote leaf4 node is


relayed by the local leaf1 node
1
EP A
86
Endpoint_A IP: 64.101.1.22
Troubleshooting Scenario

87
Check ingress traffic rate from CLI – multiple ports

leaf1# watch -n 5 -d bcm-shell-hw "show c All RPKT.xe0-16"


Every 5.0s: bcm-shell-hw show c All RPKT.xe0-16
Tue Feb 9 06:06:52 2016

unit is 0
RPKT.xe0 : 368,075,657 +253 84/s
RPKT.xe1 : 351,308,235 +264 87/s
RPKT.xe2 : 332,607,921 +212 70/s
RPKT.xe3 : 0 +0
RPKT.xe4 : 60,649 +0
Convenient way to check traffic rate
RPKT.xe5 : 60,696 +0 on multiple ports at the same time.
RPKT.xe6 : 0 +0
RPKT.xe7 : 0 +0
RPKT.xe8 : 193,423 +0
RPKT.xe9 : 1,493,189 +1
RPKT.xe10 : 10,965,614 +5 2/s
RPKT.xe11 : 0 +0
RPKT.xe12 : 0 +0
RPKT.xe13 : 0 +0
RPKT.xe14 : 6,577,648 +0
RPKT.xe15 : 0 +0

*try also
watch -d bcm-shell-hw "show counters All TPKT"
88
Troubleshooting Scenario

89
Capacity Dashboard

Capacity Dashboard panel displays your usage by range and percentage.

In the example above we


configured large number
of contracts as demo for
this feature

90
Troubleshooting Scenario

91
Visibility and Troubleshooting

1 2

0 define session name 3


1 select end point 1
2 select end point 2
We define session name and select End Points we’d like to troubleshoot visually
3 start

92
Example connectivity diagram generated for the
selected two end points.

We can further select info for particular datapath

93
Troubleshooting Scenario

94
ELAM

95
What is ELAM?

ELAM stands for Embedded Logic Analyzer Module


It is a logic that is present in the ASICs that provides the
capability to capture and view one or more packets, that
match a user specified criteria, from the stream of
packets that are processed by the ASIC

96
ELAM Support in Cisco ASIC
From Fabric To Fabric
Parser Block Packet RW Sideband

Lookup Block

ELAM ELAM
Input Output
Select Select
Lines Lines
ELAM ELAM
Output Input
Select Lookup Block
Select
Lines Lines
Parser Block
Packet RW Sideband
To BCM From BCM
Egress Pipeline (FabricFrontPanel) Ingress Pipeline (FrontPanelFabric)
97
ELAM Support in North Star
• North Star data path divided into ingress and egress pipelines
• 2 ELAM’s are present in each pipeline (Input ELAM and Output ELAM)
• These ELAM’s are present at the beginning and end of the lookup block.
• ELAM’s can be configured using the available select lines
• Packets can be captured on the input ELAM based on a output condition
by configuring ELAM in “reverse” mode
Limitations
• Packets can be captured based on either input select lines or output select
lines but not both.
• ELAM Configuration should happen in a single user mode

98
ELAM Support

• Cisco ASIC data path divided into ingress and egress pipelines
• 2 ELAM’s are present in each pipeline (Input ELAM and Output ELAM)
• These ELAM’s are present at the beginning and end of the lookup block.
• ELAM’s can be configured using the available select lines
• Packets can be captured on the input ELAM based on a output condition by
configuring ELAM in “reverse” mode
Limitations
• Packets can be captured based on either input select lines or output select lines but
not both.
• ELAM Configuration should happen in a single user mode

99
ELAM Support

Input Select Lines Supported


3 Outerl2-outerl3-outerl4
4 Innerl2-innerl3-inner l4
5 Outerl2-innerl2
6 Outerl3-innerl3
7 Outerl4-innerl4
Output Select Lines Supported Note:
0  Pktrw Only output select lines 0 and 5 are supported
5  Sideband for capturing
packets based on output at both output and
input

100
ELAM Configuration
The diagram flow during ELAM configuration.
1. Init
• Init – Initialize the ELAM – select the asic instance,
pipeline and select lines
2. Config • Config – Configure the trigger based on different fields
in the packet

3. Arm • Arm – Arm the trigger by setting the fields to match in


hardware
Trigger
• Read – Once the trigger is triggered, read the report.
4. Read
• Reset – Once the process is complete, reset the trigger
to restart the process
5. Reset
101
ELAM configuration

Show the trigger


The configured trigger can be verified using the show command
root@module-1(NS-elam-insel3)# show

102
ELAM Report Analysis
 Elam report is very detailed and dumps many fields.
 In Pktrw the important fields are
• adj_index
• ol_encap_idx
• sclass
• src_tep_idx
• sup_redirect
 In Sideband the important fields are
• l2flood
• fwddrop
• bnce

103
ELAM Example

104
What happens when packet from EP A reaches leaf
To Spines
leaf 1
1 packet first comes to
8/12 x 40G
Merchant ASIC (BCM) leaf 1
Cisco
2 forwarded to destination
ASIC
if it’s known on BCM eth 1/10

8/12 x 40G
3 if destination not
learned in BCM Merchant
forwarding table, then ASIC
send to Cisco ASIC
48/96 x 10G

EP A
To servers/blade, switches
MAC: 00:25:b5:aa:00:0a
105
spine 1 spine 2
ELAM Example
1 leaf1: input ingress
ingress
2
 outer header

2 spine: input ingress


 inner header 1 3

3 leaf4: input egress


egress leaf 1 leaf 2 leaf 3 leaf 4 leaf 5

 inner header

1
EP A EP B
106
spine 1 spine 2
ELAM Example
1 leaf1: input ingress
ingress
 outer
outer header
Note: outer header
vsh_lc Packet is not yet encapsulated in iVXLAN
debug platform internal ns elam asic 0 Outer header is still original frame from EP
trigger reset 1
trigger init ingress in-select 3 out-select 0
set outer l2 src_mac 00:25:b5:aa:00:0a
set outer l2 dst_mac ff:ff:ff:ff:ff:ff leaf 1 leaf 2 leaf 3 leaf 4 leaf 5
start
status
report

1
EP A EP B
107
MAC: 00:25:b5:aa:00:0a MAC: 00:25:b5:bb:00:0b
We’re looking to
ELAM configuration confirm if broadcast
packet sourced from
leaf1# vsh_lc
MAC
module-1# debug platform internal ns elam asic 0
00:25:b5:aa:00:0a
module-1(NS-elam)# trigger reset
is reaching
module-1(NS-elam)# trigger init ingress in-select 3 out-select 0
Cisco ASIC
module-1(NS-elam-insel3)# set outer l2 src_mac 00:25:b5:aa:00:0a
module-1(NS-elam-insel3)# set outer l2 dst_mac ff:ff:ff:ff:ff:ff
module-1(NS-elam-insel3)# start
module-1(NS-elam-insel3)# status
Status: Armed
module-1(NS-elam-insel3)# ?
report Show trigger report

module-1(NS-elam-insel3)# report
ELAM not triggered. No report available
NOTE:
1) Without the "reset" command, trigger buffers are never reset other than reboot.
2) Users can move in and out of the ELAM mode, and there will be no impact on the configured
108
triggers.
ELAM Report Analysis
hg2_srcpid: source port on front panel
(trigger went off) ce_sa: Source MAC address
module-1(NS-elam-insel3)# report | egrep ce_|ar_|drop|hg2_src ce_etype: Ethertype 0x806 = ARP (Address Resolution)
GBL_C++: [INFO] hg2_srcpid: 0A
GBL_C++: [INFO] ce_da: FFFFFFFFFFFF ar_spa: Source IP address = 10.16.128.48
GBL_C++: [INFO] ce_sa: 0025B5AA000A ar_tpa: Destination IP address: 10.16.128.1
GBL_C++: [INFO] ce_etype: 0806
GBL_C++: [INFO] ar_sha: 0025B5AA000A
GBL_C++: [INFO] ar_spa: 0A108030
GBL_C++: [INFO] ar_tha: 000000000000
GBL_C++: [INFO] ar_tpa: 0A108001
GBL_C++: [INFO] ar_spare: 0000000000000000000000000000
GBL_C++: [MSG] - pktrw is complete
GBL_C++: [INFO] drop: 0 •module-1(NS-elam-insel3)# show platform internal ns forwarding encap 0x2FF6
GBL_C++: [INFO] hg2_srcpid: 0A •TABLE INSTANCE : 0
GBL_C++: [INFO] hg2_vid_lo: 63 •Legend
GBL_C++: [INFO] vlan0: 063 •MD: Mode (LUX & RWX) LB: Loopback
•LE: Loopback ECMP LB-PT: Loopback Port
GBL_C++: [INFO] adj_index: 000C
•ML: MET Last TD: TTL Dec Disable
VXLAN Destination
GBL_C++: [INFO] ol_encap_idx: 2FF6
GBL_C++: [INFO] ol_ttl: 08 •DV: Dst Valid DT-PT: Dest Port TEP address derived
GBL_C++: [INFO] ol_segid: 2A8001 •DT-NP: Dest Port Not-PC ET: Encap Type
from encap:
GBL_C++: [INFO] sclass: C005 •OP: Override PIF Pinning HR: Higig DstMod RW
GBL_C++: [INFO] sup_redirect: 0 •HG-MD: Higig DstMode KV: Keep VNTAG 10.0.200.127
GBL_C++: [INFO] mcast: 0 •------------------------------------------------------------
• M PORT L L LB MET M T D DT DT E TST O H HG K M E
•POS D FTAG B E PT PTR L D V PT NP T IDX P R MD V D T Dst MAC DIP
•------------------------------------------------------------------------------------------
People that read hex on the fly appreciate this output!
---------------------------------------------------------
•---
•12278 0 c00 0 1 0 0 0 0 0 0 0 3 4 0 0 0 0 0 3 00:00:00:00:00:00 10.0.200.127109
We have destination TEP address, what next?
Find which switch has specific TEP On APIC or Switch

acidiag fnvread | egrep 10.0.200.127


moquery -c tunnelIf -f 'tunnel.If.dest=="10.0.200.127"‘
show isis dtep vrf overlay-1
# show isis dtep vrf overlay-1
IS-IS Dynamic Tunnel End Point (DTEP) database:
DTEP-Address Role Encapsulation Type
10.0.120.95 SPINE N/A PHYSICAL
switch output 10.0.200.64 SPINE N/A PHYSICAL,PROXY-ACAST-MAC
APIC is not running ISIS 10.0.200.65 SPINE N/A PHYSICAL,PROXY-ACAST-V4
protocol 10.0.8.65 SPINE N/A PHYSICAL,PROXY-ACAST-V6
10.0.8.64 LEAF N/A PHYSICAL
10.0.200.127 LEAF N/A PHYSICAL
10.0.200.126 SPINE N/A PHYSICAL
110
spine 1 spine 2
ELAM Example
2 spine: input ingress
ingress

 inner
inner header
2
Cisco ASIC
in spine Packet is now encapsulated in iVXLAN, so
we’re looking for inner header
vsh_lc
debug platform internal alp elam asic 0 | 1
trigger init ingress in-select 3 out-select 0
set inner l2 src_mac 00:25:b5:aa:00:0a
set inner l2 dst_mac 00:25:b5:bb:00:0b leaf 1 leaf 2 leaf 3 leaf 4 leaf 5
start
status
report

Hint: don’t forget trigger reset 


1
EP A EP B
111
MAC: 00:25:b5:aa:00:0a MAC: 00:25:b5:bb:00:0b
spine 1 spine 2

ELAM Example
3 leaf4: input egress
egress
Cisco ASIC
 inner
inner header
in leaf
Egress because we’re egressing the fabric

vsh_lc
3 debug platform internal ns elam asic 0
trigger init egress in-select 3 out-select 0
set inner l2 src_mac 00:25:b5:aa:00:0a
leaf 1 leaf 2 leaf 3 leaf 4 leaf 5 set inner l2 dst_mac 00:25:b5:bb:00:0b
start
status
report

*** report will be available when trigger went off

1 report
host A host B

MAC: 00:25:b5:aa:00:0a MAC: 00:25:b5:bb:00:0b 112


References

113
APIC resources

Quick Start / Videos


APIC Help pages
API Documentation
Python SDK

114
Online resources

ACI Documentation - cisco.com/go/aci


Cisco.com – APIC Troubleshooting
Cisco Support Forums
Cisco DevNet
GitHub/datacenter
115
GitHub – a resource for ACI scripts and tools
• ACI Toolkit:
http://datacenter.github.io/acitoolkit/
https://github.com/datacenter/acitoolkit

• ACI Diagram
https://github.com/cgascoig/aci-diagram

• ACI Endpoint Tracker


http://datacenter.github.io/acitoolkit/docsb
uild/html/endpointtracker.html

116
Troubleshooting
Cisco ACI
Available at GitHub

117
Policy Driven Data
Center with ACI,
The: Architecture,
Concepts, and
Methodology
ISBN: 9781587144905

118
Designing Data
Centers with
Cisco's ACI
LiveLessons--
Networking Talks
ISBN: 978-1-58714-436-3

119
Call to Action
• Visit the World of Solutions for
• Cisco Campus – ACI
• Walk in Labs – ACI
• Technical Solution Clinics

• Meet the Engineer


• Lunch and Learn Topics
• DevNet zone related sessions

120
Complete Your Online Session Evaluation
• Please complete your online session
evaluations after each session.
Complete 4 session evaluations
& the Overall Conference Evaluation
(available from Thursday)
to receive your Cisco Live T-shirt.

• All surveys can be completed via


the Cisco Live Mobile App or the
Communication Stations

121
Thank you

122