Sunteți pe pagina 1din 301

Troubleshooting Cisco

Application Centric
Infrastructure
Problem resolution insights
from Cisco engineers

Andres Vega, Bryan Deaver, Jerry Ye,


Loy Evans, Mike Timm, Kannan Ponnuswamy,
Paul Lesiak & Paul Raytick
cisco.com
Preface
4 Troubleshooting Cisco Application Centric Infrastructure

Authors and Contributors

This book represents a joint intense collaborative effort between Cisco’s Engineering,
Technical Support, Advanced Services and Sales employees over a single week in the
same room at Cisco Headquarters Building 20 in San Jose, CA.

Authors 

Andres Vega - Cisco Technical Services


Bryan Deaver -  Cisco Technical Services
Jerry Ye - Cisco Advanced Services
Kannan Ponnuswamy - Cisco Advanced Services
Loy Evans - Systems Engineering
Mike Timm - Cisco Technical Services
Paul Lesiak - Cisco Advanced Services
Paul Raytick -  Cisco Technical Services

Lead Contributors

Giuseppe Andreello
Pavan Bassetty  
Sachin Jain
Sri Goli
Troubleshooting Cisco Application Centric Infrastructure 5

Contributors

Piyush Agarwal Ozden Karakok


Pooja Aniker Jose Martinez
Ryan Bos Rafael Mueller
Mike Brown Chandra Nagarajan
Robert Burns Mike Petrinovic
Mai Cutler Daniel Pita
Tomas de Leon Mike Ripley
Luis Flores Zach Seils
Michael Frase Ramses Smeyers
Mioljub Jovanovic Steve Winters
6 Troubleshooting Cisco Application Centric Infrastructure

Dedications

“For Olga and Victoria, the love and happiness of my life, hoping for a world that continues
to strive in providing more effective solutions to all problems intrinsic to human nature.”

- Andres Vega

“For my parents, uncles, aunts, cousins, and wonderful nephews and nieces in the US,
Australia, Hong Kong and China.”

- Jerry Ye

“To my wife, Vanitha for her unwavering love, support and encouragement, my kids
Kripa, Krish and Kriti for the sweet moments, my sister who provided me the education
needed for this book, my brother for the great times, and my parents for their uncondi-
tional love.”

- Kannan Ponnuswamy

“Would like to thank my amazing family, Molly, Ethan and Abby, without whom, none of
the things I do would matter.”

- Loy Evans

“Big thanks to my wife Morena, my daughters Elena and Mayra. Thank you to my in-laws
Guadalupe and Armando who helped watch my beautiful growing girls while I spent the
time away from home working on this project.”

- Mike Timm

“To all my collegues who forget to lock their computers”

- Paul Lesiak
Troubleshooting Cisco Application Centric Infrastructure 7

“For Susan, Matthew, Hanna, Brian, and all my extended family, thanks for your support
throughout the years. Thanks as well to Cisco for the opportunity, it continues to be a
fun ride.”

- Paul Raytick

“An appreciative thank you to my wife Melanie and our children Sierra and Jackson for
their support. And also to those that I have had the opportunity to work with over the
years on this journey with Cisco.”

- Bryan Deaver
8 Troubleshooting Cisco Application Centric Infrastructure

Acknowledgments

While this book was produced and written in a single week, the knowledge and experi-
ence leading to it are the result of hard work and dedication of many individual inside
and outside Cisco.

Special thanks to Cisco’s INSBU Executive, Technical Marketing and Engineering teams
who supported the realization of this book. We would like to thank you for your contin-
uous innovation and the value you provide to the industry.

We want to thank Cisco’s Advanced Services and Technical Services leadership teams for
the trust they conferred to this initiative and the support provided since the inception
of the idea.

We also want to express gratitude to the following individuals for their influence and
support both prior and during the Book Sprint:

Shrey Ajmera Adrienne Liu


Subrata Banerjee Anand Louis
Dave Broenen Gianluca Mardente
John Bunney Wayne McAllister
Luca Cafiero Rohit Mediratta
Ravi Chamarthy Munish Mehta
Mike Cohen Sameer Merchant
Kevin Corbin Joe Onisick
Ronak Desai Venkatesh Pallipadi
Krishna Doddapaneni Ayas Pani
Mike Dvorkin Amit Patel
Tom Edsall Maurizio Portolani
Ken Fee Pirabhu Raman
Vikki Fee Alice Saiki
Siva Gaggara Christy Sanders
Shilpa Grandhi Enrico Schiattarella
Ram Gunuganti Priyanka Shah
Ruben Hakopian Pankaj Shukla
Robert Hurst Michael Smith
Troubleshooting Cisco Application Centric Infrastructure 9

Donna Hutchinson Edward Swenson


Fabio Ingrao Srinivas Tatikonda
Saurabh Jain Santhosh Thodupunoori
Praveen Jain Sergey Timo
Prem Jain Muni Tripathi
Soni Jiandani Bobby Vandalore
Sarat Kamisetty Sunil Verma
Yousuf Khan Alok Wadhwa
Praveen Kumar Jay Weinstein
Tighe Kuykendall

We would also like to thank the Office of the President and COO for booking accommo-
dation for the week as well as the Office of the CTO and Chief Architect for their hospi-
tality while working in their office space.

We are also truly grateful to our Book Sprint (www.booksprints.net) facilitators Laia Ros
and Adam Hyde for carrying us throughout this collaborative knowledge production
process, and to our illustrator Henrik van Leeuwen who took abstract ideas and was able
to depict those ideas into clear visuals. Our first concern was how to take so many people
from different sides of the business to complete a project that traditionally takes months.
The Book Sprint team showed that this is possible and presents a new model for how we
collaborate, extract knowledge and experience and present it into a single source.
10 Troubleshooting Cisco Application Centric Infrastructure
Troubleshooting Cisco Application Centric Infrastructure 11

Index
12 Troubleshooting Cisco Application Centric Infrastructure

Preface


Authors and contributors 4


Dedications 6


Acknowledgments 8

Introduction to Application Centric


Infrastructure Troubleshooting


Introduction 16


ACI Policy Model 21


Troubleshooting Tools 35


Troubleshooting Methodology 41

Sample Reference Topology




Physical Fabric Topology 45


Logical Application Topology 47

Troubleshooting


Naming Conventions 49


Initial Hardware Bringup 53

Troubleshooting Cisco Application Centric Infrastructure 13

Troubleshooting (continued)


Fabric Initialization 57

APIC High Availability and Clustering 71


Firmware and Image Management 81


Faults / Health Scores 87


Rest Interface 94


Management Tentant 119


Common Network Services 139

Unicast Data Plane Forwarding and Reachability 166


Policies and Contracts 185

Bridged Connectivity to External Networks 204

Routed Connectivity to External Networks 227


Virtual Machine Manager Insertion 246

Layer 4 Through 7 Services Insertion 263



ACI Fabric Node Process Crash Troubleshooting 273


APIC Process Crash Troubleshooting 281

Appendix


Glossary 292
14 Troubleshooting Cisco Application Centric Infrastructure
Troubleshooting Cisco Application Centric Infrastructure 15

Introduction to
Application Centric
Infrastructure
Troubleshooting
16 Troubleshooting Cisco Application Centric Infrastructure

Introduction

In the same way that humans build relationships to communicate and share their knowl-
edge, computer networks are built to allow for nodes to exchange data at ever increasing
speeds and rates. The drivers for these rapidly growing networks are the applications,
the building blocks that consume and provide the data which are close to the heart of
the business lifecycle. The organizations tasked with nurturing and maintaining these
expanding networks, nodes and vast amounts of data, are critical to those that consume
the resources they provide.

IT organizations have managed the conduits of this data as network devices with each
device being managed individually. In the efforts to support an application, a team or
multiple teams of infrastructure specialists build and configure static infrastructure in-
cluding the following:

• Physical infrastructure (switches, ports, cables, etc.)


• Logical topology (VLANs, L2 interfaces and protocols, L3 interfaces and
protocols, etc.)
• Access control configuration (permit/deny ACLs) for application integration
and common services
• Quality of Service configuration
• Services integration (Firewall, Load Balancing, etc.)
• Connecting application workload engines (VMs, physical servers, logical
application instances)

Cisco seeks to innovate the way this infrastructure is governed by introducing new par-
adigms. Going from a network of individually managed devices to an automated poli-
cy-based model that allows an organization to define the policy, and the infrastructure
to automate the implementation of the policy in the hardware elements, will change the
way the world communicates.

To this end, Cisco has introduced Application Centric Infrastructure, or ACI, as an ho-
listic systems-based approach to infrastructure management.
Troubleshooting Cisco Application Centric Infrastructure 17

The design intent of ACI is to provide the following:

• Application-driven policy modeling 


• Centralized policy management and visibility of infrastructure and application
health
• Automated infrastructure configuration management
• Integrated physical and virtual infrastructure management
• Open interface to enable flexible software and ecosystem partner integration
• Seamless communications from any endpoint to any endpoint

There are multiple possible implementation options for an ACI fabric:

• Leveraging a network centric approach to policy deployment - in this case a full


understanding of application interdependencies is not critical, and instead the
current model of a network-oriented design is maintained. This can take one of
two forms:
o L2 Fabric – Uses the ACI policy controller to automate provisioning of
network infrastructure based on L2 connectivity between connected
network devices and hosts.
o L3 Fabric – Uses the ACI policy model to automate provisioning network
infrastructure based on L3 connectivity between network devices and hosts.
• Application-centric fabric – takes full advantage of all of the ACI objects to build
out a flexible and completely automated infrastructure including L2 and L3
reachability, physical machine and VM connectivity integration, service node
integration and full object manipulation and management.

Implementations of ACI that take full advantage of the intended design from an appli-
cation-centric perspective allow for end-to-end network automation spanning physical
and virtual network and network services integration.

All of the manual configuration and integration work detailed above is thus automated
based on policy, therefore making the infrastructure team’s efforts more efficient.

Instead of manually configuring VLANs, ports and access lists for every device connected
to the network, the policy is created and the infrastructure itself resolves and provisions
the relevant configuration to be provisioned on demand, where needed, when needed.
Conversely, when devices, applications or workloads detach from the fabric, the relevant
configuration can be de-provisioned, allowing for optimal network hygiene.
18 Troubleshooting Cisco Application Centric Infrastructure

Cisco ACI follows a model-driven approach to configuration management. This mod-


el-based configuration is disseminated through the managed nodes using the concept
of Promise Theory.

Promise Theory is a management model in which a central intelligence system declares


a desired configuration “end-state”, and the underlying objects act as autonomous intel-
ligent agents that can understand the declarative end-state and either implement the
required change, or send back information on why it could not be implemented.

In ACI, the intelligent agents are purpose-built elements of the infrastructure that take
an active part in its management by the keeping of “promises”. Within promise theory,
a promise is an agent’s declaration of intent to follow an intended instruction defining
operational behavior. This allows management teams to create an abstract “end-state”
model and the system to automate the configuration in compliance. With declarative
end-state modeling, it is easier to build and manage large scale networks with less effort.

Many new ideas, concepts and terms come with this coupling of ACI and Promise Theory.
This book is not intended to be a complete tutorial on ACI or Promise Theory, nor is it
intended to be a complete operations manual for ACI, or a complete dictionary of terms
and concepts. Where possible, however, a base level of definitions will be provided, ac-
companied by explanations. The goal of this text is to provide some common concepts,
terms, models and fundamental features of the fabric, then use that base knowledge to
dive into troubleshooting methodology and exercises.

To read more information on Cisco’s Application Centric Infrastructure, the reader may
refer to the Cisco website at https://www.cisco.com/go/aci.

Expected Audience

The intended audience for this book is those with a general need to understand how to
operate and/or troubleshoot an ACI fabric. While operation engineers may experience
the largest benefit from this content, the materials included herein may be of use to a
much wider audience, especially given modern industry trends towards continuous in-
tegration and development, along with the ever growing need for agile DevOps oriented
methodologies.

There are many elements in this book that explore topics outside the typical job re-
Troubleshooting Cisco Application Centric Infrastructure 19

sponsibilities of network administrators. For example the programmatic manipulation of


policy models can be viewed as a development-oriented task, however has specific rel-
evance to networking configuration and function, taking a very different approach than
traditional CLI-based interface configuration.

Organization of this Book

Section 1: Introduction to ACI Troubleshooting

The introduction covers some basic concepts, terms and models while introducing the
tools that will be used in troubleshooting. Also covered are the troubleshooting, verifica-
tion and resolution methodologies used in later sections that cover the actual problems
being documented.

Section 2: Sample Reference Topology

This section sets the baseline sample topology used throughout all of the troubleshoot-
ing exercises that are documented later in the book. Logical diagrams are provided for
the abstract policy elements (the endpoint group objects, the application profile objects,
etc) as well as the physical topology diagrams and any supporting documentation that is
needed to understand the focal point of the exercises. In each problem description in
Section 3, references will be made to the reference topology as necessary. Where fur-
ther examination is required, the specific aspects of the topology being examined may be
re-illustrated in the text of the troubleshooting scenario.

Section 3: Troubleshooting Application Centric Infrastructure

The Troubleshooting ACI section goes through some specific problem descriptions as it
relates to the fabric. For each iterative problem, there will be a problem description, a
listing of the process, some verification steps, and possible resolutions.

Chapter format: The chapters that follow in the Troubleshooting section document the
various problems: verification of causes and possible resolutions are arranged in the fol-
lowing format.

Overview: Provides an introduction to the problem in focus by highlighting the following


information:
20 Troubleshooting Cisco Application Centric Infrastructure

• Theory and concepts to be covered


• Information of what should be happening
• Verification steps of a working state

Problem Description: The problem description will be a high level observation of the
starting point for the troubleshooting actions to be covered. Example: a fabric node is
showing “inactive” from the APIC by using APIC CLI command “acidiag fnvread”.

Symptoms: Depending on the problem, various symptoms and their impacts may be ob-
served. In this example, some of the symptoms and indications of issues around an inac-
tive fabric node could be:

• loss of connectivity to the fabric


• low health score
• system faults
• inability to make changes through the APIC

Verification and cause: The logical set of steps to identify what is being observed will be
indicated along with the appropriate tools and output. Additionally, some information
about what is being observed and the likely causes will be included.

Book Writing Methodology

The Book Sprint (www.booksprints.net) methodology was used for writing this book. The
Book Sprint methodology is an innovative new style of cooperative and collaborative
authorship. Book Sprints are strongly facilitated and leverage team-oriented inspiration
and motivation to rapidly deliver large amounts of well authored and reviewed content,
and incorporate it into a complete narrative in a short amount of time. By leveraging
the input of many experts, the complete book was written in a short time period of only
five days, however involved hundreds of authoring man hours, and included thousands
of experienced engineering hours, allowing for extremely high quality in a very short
production time period.
Troubleshooting Cisco Application Centric Infrastructure 21

ACI Policy Model

While the comprehensive policy model that ACI utilizes is broad, the goal of this section
is to introduce the reader to a basic level of understanding about the model, what it
contains and how to work with it. The complete object model contains a vast amount
of information that represents a complete hierarchy of data center interactions, so it is
recommended that the reader take the time to review the many white papers available on
cisco.com, or for the most extensive information available, review the APIC Management
Information Model Reference packaged with the APIC itself.

Abstraction Model

ACI provides the ability to create a stateless definition of application requirements. Ap-
plication architects think in terms of application components and interactions between
such components; not necessarily thinking about networks, firewalls and other services.
By abstracting away the infrastructure, application architects can build stateless policies
and define not only the application, but also Layer 4 through 7 services and interactions
within applications. Abstraction also means that the policy defining application require-
ments is no longer tied to traditional network constructs, and thus removes dependen-
cies on the infrastructure and increases the portability of applications.

The application policy model defines application requirements, and based on the spec-
ified requirements, each device will instantiate a set of required changes. IP addresses
become fully portable within the fabric, while security and forwarding are decoupled
from any physical or virtual network attributes. Devices autonomously and consistently
update the state of the network based on the configured policy requirements set within
the application profile definitions.

Everything is an Object

The abstracted model utilized in ACI is object-oriented, and everything in the model is
represented as an object, each with properties relevant to that object. As is typical for an
object-oriented system, these objects can be grouped, classed, read, and manipulated,
and objects can be created referencing other objects. These objects can reference rel-
22 Troubleshooting Cisco Application Centric Infrastructure

evant application components as well as relationships between these components. The


rest of this section will describe the elements of the model, the objects inside, and their
relationships at a high level.

Relevant Objects and Relationships

Within the ACI application model, the primary object that encompasses all of the objects
and their relationships to each other is called an Application Profile, or AP. Some readers
are certain to think, “a 3-tier app is a unicorn,” but in this case, the idea of a literal 3-tier
application works well for illustrative purposes. Below is a diagram of an AP shown as a
logical structure for a 3-tier application that will serve well for describing the relevant
objects and relationships.

From left to right, in this 3-tier application there is a group of clients that can be cate-
gorized and grouped together. Next there is a group of web servers, followed by a group
of application servers, and finally a group of database servers. There exist relationships
between each of these independent groups. For example, from the clients to the applica-
tion servers, there are relationships that can be described in the policy which can include
things such as QoS, ACLs, Firewall and Server Load Balancing service insertion. Each of
these things is defined by managed objects, and the relationships between them are used
to build out the logical model, then resolve them into the hardware automatically.

Endpoints are objects that represent individual workload engines (i.e. virtual or phys-
ical machines, etc.). The following diagram emphasizes which elements in the policy
model are endpoints, which include web, application and database virtual machines.
Troubleshooting Cisco Application Centric Infrastructure 23

These endpoints are logically grouped together into another object called an Endpoint
Group, or EPG. The following diagram highlights the EPG boundaries in the diagram, and
there are four EPGs - Clients, Web servers, Application servers, and Database servers.

There are also Service Nodes that are referenceable objects, such as Firewalls, and Server
Load Balancers (or Application Delivery Controllers/ADC), with a firewall and load balanc-
er combination chained between the client and web EPGs, a load balancer between the
web and application EPGs, and finally a firewall securing traffic between the application
and database EPGs.

A group of Service Node objects can be logically chained into a sequence of services rep-
resented by another object called a Service Graph. A Service Graph object provides com-
pound service chains along the data path. The diagram below shows where the Service
Graph objects are inserted into a policy definition, emphasizing the grouped service nodes
in the previous diagram.

With objects defined to express the essential elements of the application, it is possible to
build relationships between the EPG objects, using another object called a Contract. A Con-
tract defines what provides a service, what consumes a service and what policy objects are
related to that consumption relationship. In the case of the relationship between the clients
24 Troubleshooting Cisco Application Centric Infrastructure

and the web servers, the policy defines the communication path and all related elements
of that. As shown in the details of the example below, the Web EPG provides a service
that the Clients EPG consumes, and that consumption would be subject to a Filter (ACL)
and a Service Graph that includes Firewall inspection services and Server Load Balancing.

A concept to note is that ACI fabrics are built on the premise of a whitelist security ap-
proach, which allows the ACI fabric to function as a semi-stateful firewall fabric. This
means communication is implicitly denied, and that one must build a policy to allow
communication between objects or they will be unable to communicate. In the example
above, with the contract in place as highlighted, the Clients EPG can communicate with
the Web EPG, but the Clients cannot communicate with the App EPG or DB EPGs. This is
not explicit in the contract, but native to the fabric’s function.

Hierarchical ACI Object Model and the Infrastructure

The APIC manages a distributed managed information tree (dMIT). The dMIT discovers,
manages, and maintains the whole hierarchical tree of objects in the ACI fabric, including
their configuration, operational status, and accompanying statistics and associated faults.

The Cisco ACI object model structure is organized around a hierarchical tree model,
called a distributed Management Infrastructure Tree (dMIT). The dMIT is the single
source of truth in the object model, and is used in discovery, management and mainte-
nance of the hierarchical model, including configuration, operational status and accom-
panying statistics and faults.

As mentioned before, within the dMIT, the Application Profile is the modeled represen-
tation of an application, network characteristics and connections, services, and all the
relationships between all of these lower-level objects. These objects are instantiated as
Managed Objects (MO) and are stored in the dMIT in a hierarchical tree, as shown below:
Troubleshooting Cisco Application Centric Infrastructure 25

All of the configurable elements shown in this diagram are represented as classes, and
the classes define the items that get instantiated as MOs, which are used to fully describe
the entity including its configuration, state, runtime data, description, referenced ob-
jects and lifecycle position.

Each node in the dMIT represents a managed object or group of objects. These objects
are organized in a hierarchical structure, similar to a structured file system with logical
object containers like folders. Every object has a parent, with the exception of the top
object, called “root”, which is the top of the tree. Relationships exist between objects in
the tree.
26 Troubleshooting Cisco Application Centric Infrastructure

Objects include a class, which describes the type of object such as a port, module or net-
work path, VLAN, Bridge Domain, or endpoint group (EPG). Packages identify the func-
tional areas to which the objects belong. Classes are organized hierarchically so that, for
example, an access port is a subclass of the class Port, or a leaf node is a subclass of the
class Fabric Node.

Managed Objects can be referenced through relative names (Rn) that consist of a prefix
matched up with a name property of the object. As an example, a prefix for a Tenant
would be “tn” and if the name would be “Cisco”, that would result in a Rn of “tn-Cisco”
for a MO.

Managed Objects can also be referenced via Distinguished Names (Dn), which is the
combination of the scope of the MO and the Rn of the MO, as mentioned above. As an
example, if there is a tenant named “Cisco” that is a policy object in the top level of the
Policy Universe (polUni), that would combine to give us a Dn of “uni/tn-Cisco”. In gener-
al, the DN can be related to a fully qualified domain name.
Troubleshooting Cisco Application Centric Infrastructure 27

Because of the hierarchical nature of the tree, and the attribute system used to identify
object classes, the tree can be queried in several ways for MO information. Queries can
be performed on an object itself through its DN, on a class of objects such as switch chas-
sis, or on a tree-level, discovering all members of an object.

The structure of the dMIT provides easy classification of all aspects of the relevant con-
figuration, as the application objects are organized into related classes, as well as hard-
ware objects and fabric objects into related classes that allow for easy reference, reading
and manipulation from individual object properties or multiple objects at a time by ref-
erence to a class. This allows configuration and management of multiple similar compo-
nents as efficiently as possible with a minimum of iterative static configuration.

Infrastructure as Objects

ACI uses a combination of Cisco Nexus 9000 Series Switch hardware and Application
Policy Infrastructure Controllers (APICs) for policy-based fabric configuration and man-
agement. These infrastructure components can be integrated with Cisco and third-party
service products to automatically provision end-to-end network solutions.

As shown in the following diagram, the logical policy model is built through manipula-
tion of the dMIT, either through direct GUI, programmatic API, or through traditional
CLI methods. Once the policy is built, the intention of the policy gets resolved into an
abstract model, then is conferred to the infrastructure elements. The infrastructure ele-
ments contain specific ASIC hardware that make them intelligent, purpose-built agents
of change that can understand the abstraction that the policy controller presents to it,
and automate the relevant concrete configuration based on the abstract model. This
configuration gets executed on connection of the Endpoint.
28 Troubleshooting Cisco Application Centric Infrastructure

The purpose-built hardware providing the intelligent resolution of policy configuration


is built on a spine-leaf architecture providing consistent network forwarding and de-
terministic latency. The hardware is also able to normalize the encapsulation coming in
from the different types of endpoints regardless of connectivity.

If an endpoint connects to a fabric with an overlay encapsulation (such as VXLAN or


NVGRE), uses physical port connectivity or VLAN 802.1Q tagging, the fabric can take ac-
cept that traffic, de-encapsulate, then re-encapsulate it to VXLAN for fabric forwarding,
then de-encapsulate and re-encapsulate to whatever the destination expects to see. This
encapsulation normalization happens at hardware speeds in the fabric and creates no
additional latency or software gateway penalty to perform the operation.

In this manner, if a VM is running on VMWare ESX utilizing VXLAN, and a VM running on


Hyper-V using NVGRE encapsulation, and a physical server running a bare metal data-
base workload on top of Linux, it is possible to configure policy to allow each of these to
communicate directly to each other without having to bounce to any separate gateway
function. That gateway function is performed in hardware by the nature of the normal-
ization process.
Troubleshooting Cisco Application Centric Infrastructure 29

This automated provisioning of end-to-end application-relevant policy provides con-


sistent implementation of relevant connectivity, quality measures, and security require-
ments. This model is extensible, and has the capability to be extended into compute and
storage for complete application policy-based provisioning.

The automation of the configuration takes the logical model, and translates it into other
models, such as the resolved model and the concrete model. The automation process
resolves configuration information into the object and class-based configuration ele-
ments that then get applied based on the object and class. As an example, if the system
is applying a configuration to a port or a group of ports, the system would likely utilize a
class-based identifier to apply configuration broadly without manual iteration. As an ex-
ample, a class is used to identify objects like cards, ports, paths, etc; port Ethernet 1/1 is a
member of class port and a type of port configuration, such as an access or trunk port is a
subclass of a port. A leaf node or a spine node is a subclass of a fabric node, and so forth.

The types of objects and relationships of the different networking elements within the
policy model can be seen in the diagram below. Each of these elements can be managed
via the object model being manipulated through the APIC, and each element could be
directly manipulated via REST API.
30 Troubleshooting Cisco Application Centric Infrastructure

Build object, use object to build policy, reuse policy

The inherent model of ACI is built on the premise of object structure, reference and re-
use. In order to build an AP, one must first create the building blocks with all the relevant
information for those objects. Once those are created, it is possible to build other objects
referencing the originally created objects as well as reuse other objects. As an example, it
is possible to build EPG objects, use those to build an AP object, and reuse the AP object
to deploy to different tenant implementations, such as a Development Environment AP,
a Test Environment AP, and a Production Environment AP.

REST API just exposes the object model

REST stands for Representative State Transfer, and is a reference model for direct object
manipulation via HTTP protocol based operations.

The uniform ACI object model places clean boundaries between the different compo-
nents that can be read or manipulated in the system. When an object exists in the tree,
whether it is an object that was derived from discovery (such as a port or module) or from
configuration (such as an EPG or policy graph), the objects then would be exposed via the
REST API via a Universal Resource Indicator (URI). The structure of the REST API calls is
shown below with a couple of examples.

The general structure of the REST API commands is seen at the top. Below the general
structure two specific examples of what can done with this structured URI.
Troubleshooting Cisco Application Centric Infrastructure 31

Logical model, Resolved model, concrete model

Within the ACI object model, there are essentially three stages of implementation of the
model: the Logical Model, the Resolved Model, and the Concrete Model.

The Logical Model is the logical representation of the objects and their relationships. The
AP that was discussed previously is an expression of the logical model. This is the decla-
ration of the “end-state” expression that is desired when the elements of the application
are connected and the fabric is provisioned by the APIC, stated in high-level terms.

The Resolved Model is the abstract model expression that the APIC resolves from the
logical model. This is essentially the elemental configuration components that would be
delivered to the physical infrastructure when the policy must be executed (such as when
an endpoint connects to a leaf).

The Concrete Model is the actual in-state configuration delivered to each individual fab-
ric member based on the resolved model and the Endpoints attached to the fabric.

In general, the logical model should be the high-level expression of what exists in the
resolved model, which should be present on the concrete devices as the concrete model
expression. If there is any gap in these, there will be inconsistent configurations.

Formed and Unformed Relationships

In creating objects and forming their relationships within the ACI fabric, a relationship
is expressed when an object is a provider of a service, and another object is a consumer
of that provided service. If a relationship is formed and one side of the service is not
connected, the relationship would be considered to be unformed. If a consumer exists
with no provider, or a provider exists with no consumer, this would be an unformed rela-
tionship. If both a consumer and provider exist and are connected for a specific service,
that relationship is fully formed.

Declarative End State and Promise Theory

For many years, infrastructure management has been built on a static and inflexible con-
figuration paradigm. In terms of theory, traditional configuration via traditional methods
32 Troubleshooting Cisco Application Centric Infrastructure

(CLI configuration of each box individually) where configuration must be done on every
device for every possibility of every thing prior to this thing connecting, is termed an
Imperative Model of configuration. In this model, due to the way configuration is built
for eventual possibility, the trend is to overbuild the infrastructure configuration to a
fairly significant amount. When this is done, fragility and complexity increase with every
eventuality included.
Troubleshooting Cisco Application Centric Infrastructure 33

Similar to what is illustrated above, if configuration must be made on a single port for
an ESXi host, it must be configured to trunk all information for all of the possible VLANs
that might get used by a vSwitch or DVS on the host, whether or not a VM actually exists
on that host. On top of that, additional ACLs may need to be configured for all possible
entries on that port, VLAN or switch to allow/restrict traffic to/from the VMs that might
end up migrating to that host/segment/switch. That is a fairly heavyweight set of tasks
for just some portions of the infrastructure, and that continues to build as peripheral
aspects of this same problem are evaluated. As these configurations are built, hardware
resource tables are filled up even if they are not needed for actual forwarding. Also re-
flected are configurations on the service nodes for eventualities that can build and grow,
many times being added but rarely ever removed. This eventually can grow into a fairly
fragile state that might be considered a form of firewall house of cards. As these building
blocks are built up over time and a broader perspective is taken, it becomes difficult to
understand which ones can be removed without the whole stack tumbling down. This is
one of the possible things that can happen when things are built on an imperative model.
34 Troubleshooting Cisco Application Centric Infrastructure

On the other hand, a declarative model allows a system to describe the “end-state” ex-
pectations of the system, and allows the system to utilize its knowledge of the inte-
grated hardware and automation tools to execute the required work to deliver the end
state. Imagine an infrastructure system where statements of desire can be made, such
as “these things should connect to those things and let them talk in this way”, and the
infrastructure converges on that desired end state. When that configuration is no longer
needed, the system knows this and removes that configuration.

Promise Theory is built on the principles that allow for systems to be designed based
on the declarative model. It’s built on voluntary execution by autonomous agents which
provide and consume services from one another based on promises.

As the IT industry continues to build and scale more and more, information and systems
are rapidly reaching breaking points where scaled-out infrastructure cannot stretch to
the hardware resources without violating the economic equilibrium, nor scale-in the
management without integrated agent-based automation. This is why a system, such as
ACI, built on promise theory, is a purpose-built system for addressing the scale problems
that are delivery challenges with traditional models.
Troubleshooting Cisco Application Centric Infrastructure 35

Troubleshooting Tools

This section is intended to provide an overview of the tools that could be used during
troubleshooting efforts on an ACI fabric. This is not intended to be a complete reference
list of all possible tools, but rather a high level list of the most common tools used.

APIC Access Methods

There are multiple ways to connect to and manage the ACI fabric and object model. An
administrator can use the built-in Graphical User Interface (GUI), programmatic methods
using an Application Programming Interface (API), or standard Command Line Interface
(CLI). While there are multiple ways to access the APIC, the APIC is still the single point
of truth. All of these access methods - the GUI, CLI and REST API - are just interfaces
resolving to the API, which is the abstraction of the object model managed by the DME.  

GUI

One of the primary ways to configure, verify and monitor the ACI fabric is through the
APIC GUI. The APIC GUI is a browser-based HTML5 application that provides a represen-
tation of the object model and would be the most likely default interface that people would
start with. GUI access is accessible through a browser at the URL https://<APIC IP>
36 Troubleshooting Cisco Application Centric Infrastructure

The GUI does not expose the underlying policy model. One of the available tools for
browsing the MIT is called “visore” and is available on the APIC and nodes. Visore sup-
ports querying by class and object, as well as easily navigating the hierarchy of the tree.
Visore is accessible through a browser at the URL https://<APIC IP>/visore.html

API

The APIC supports REST API connections via HTTP/HTTPS for processing of XML/JSON
documents for rapid configuration. The API can also be used to verify the configured
policy on the system. This is covered in details in the REST API chapter.

A common tool used to query the system is a web browser based APP that runs on Google
Chrome (tm) web browser called “Postman”.

CLI

The CLI can be used in configuring the APIC. It can be used extensively in troubleshoot-
ing the system as it allows real-time visibility of the configuration, faults, and statistics
of the system. Typically this is accessed via SSH with the appropriate administrative level
credentials. The APIC CLI can be accessed as well through the CIMC KVM (Cisco Inte-
grated Management Console Keyboard Video Mouse interface).

CLI access is also available for troubleshooting the fabric nodes either through SSH or
the console.

The APIC and fabric nodes are based on a Linux kernel but there are some ACI specific
commands and modes of access that will be used in this book.

CLI MODES:

APIC:

The APIC has fundamentally only one CLI access mode. The commands used in this book
are assuming admin level access to the APIC.
Troubleshooting Cisco Application Centric Infrastructure 37

Fabric Node:

The switch running ACI software has several different modes that can be used to access
different levels of information on the system:
 
• CLI - The CLI will be used to run NX-OS and Bash shell commands to check the
concrete models on the switch. For example “show vlan”, “show endpoint”, etc. In
some documentation this may have been referred to as Bash, iBash, or iShell.
• vsh_lc - This is the line card shell and it will be used to check line card processes
and forwarding tables specific to the Application Leaf Engine (ALE) ASIC.
• Broadcom Shell - This shell is used to view information on the Broadcom ASIC.
The shell will not be covered as it falls outside the scope of this book as its
assumed troubleshooting at a Broadcom Shell level should be performed with
assistance of Cisco Technical Assistance Center (TAC).
• Virtual Shell (VSH) - Provides deprecated NX-OS CLI shell access to the switch.
This mode can provide output on a switch in ACI mode that could be inaccurate.
This mode is not recommended, not supported, and commands that provide
useful output should be available from the normal CLI access mode.ode.

Navigating the CLI:

There are some common commands as well as some unique differences than might be
seen in NX-OS on a fabric node. On the APIC, the command structure has common com-
mands as well as some unique differences compared to Linux Bash. This section will
present a highlight of a few of these commands but is not meant to replace existing ex-
ternal documentation on Linux, Bash, ACI, and NX-OS.

Common Bash commands:


When using the CLI, some basic understanding of Linux and Bash is necessary. These
commands include:

• man – prints the online manual pages. For example, “man cd” will display what
the command “cd” does
• ls – list directory contents
• cd – change directory
• cat – print the contents of a file
• less – simple navigation tool for displaying the contents of a file
• grep – print out a matching line from a file
38 Troubleshooting Cisco Application Centric Infrastructure

• ps – show current running processes – typically used with the options “ps –ef”
• netstat – display network connection status. “netstat –a” will display active and
ports which the system is listening on
• ip route show – displays the kernel route table. This is useful on the APIC but
not on the fabric node
• pwd – print the current directory

Common CLI commands:

Beyond the normal NX-OS commands on a fabric node, there are several more that are
specific commands to ACI. Some CLI commands referenced in this guide are listed below:
 
• acidiag - Specifically “acidiag avread” and “acidiag fnvread” are two common
commands to check the status of the controllers and the fabric nodes
• techsupport – CLI command to collect the techsupport files from the device
• attach – From the APIC, opens up a ssh session to the names node. For example
“attach rtp_leaf1”
• iping/itraceroute – Fabric node command used in place of ping/traceroute
which provides similar functionality against an fabric device address and VRF.
Note that the Bash ping and traceroute commands do work but are effective only
for the switch OOB access.

Help:

When navigating around the APIC CLI, there are some differences when compared to
NX-OS.
 
• <ESC><ESC> - similar to NX-OS “?” to get a list of command options and
keywords
• <TAB> - autocomplete of the command. For example “show int<TAB>” will
complete to “show interface”
• man <command> - displays the manual and usage output for the command.
Troubleshooting Cisco Application Centric Infrastructure 39

Programmatic Configuration (Python)

A popular modern programming language is Python, which provides simple object-ori-


ented semantics in interpreted easy-to-write code. The APIC can be configured through
the use of Python through an available APIC Software Development Kit (SDK) or via the
REST API.

Fabric Node Access Methods

CLI

In general, most work within ACI will be done through the APIC using the access methods
listed above. There are, however, times in which one must directly access the individual
fabric nodes (switches). Fabric nodes can be accessed via SSH using the fabric adminis-
trative level credentials. The CLI is not used for configuration but is used extensively for
troubleshooting purposes. The fabric nodes have a Linux shell along with a CLI interpreter
to run “show” level commands. The CLI can be accessed through the console port as well.

Faults

The APICs automatically detect issues on the system and records these as faults. Faults
are displayed in the GUI until the underlying issue is cleared. After faults are cleared,
they are retained until they are acknowledged or until the retaining timer has expired.
The fault is composed of system parameters, which are used to indicate the reason for
the failure, and where the fault is located. Fault messages link to help to understand pos-
sible actions in some cases.

Exporting information from the Fabric

Techsupport

The Techsupport files in ACI capture application logs, system and services logs, version
information, faults, event and audit logs, debug counters and other command output, then
bundle all of that into one file on the system. This is presented in a single compressed file
(tarball) that can be exported to an external location for off-system processing. Techsup-
port is similar to functionality available on other Cisco products that allow for a simple
40 Troubleshooting Cisco Application Centric Infrastructure

collection of copious amounts of relevant data from the system. This collection can be
initiated through the GUI or through the CLI using the command “techsupport”.

Core Files

A process crash on the ACI fabric will generate a core file, which can be used to deter-
mine the reason for why the process crashed. This information can be exported from the
APIC for decoding by Cisco support and engineering teams.

External Data Collection – Syslog, SNMP, Call-Home

There are a variety of external collectors that can be configured to collect a variety of
system data. The call-home feature can be configured to relay information via emails
through an SMTP server, for a network engineer or to Cisco Smart Call Home to generate
a case with the TAC.

Health Scores

The APIC manages and automates the underlying forwarding components and Layer 4 to
Layer 7 service devices. Using visibility into both the virtual and physical infrastructure, as
well as the knowledge of the application end-to-end based on the application profile, the
APIC can calculate an application health score. This health score represents the network
health of the application across virtual and physical resources, including Layer 4 to Layer
7 devices. The score includes failures, packet drops, and other indicators of system health.

The health score provides enhanced visibility on both application and tenant levels. The
health score can drive further value by being used to trigger automated events at specific
thresholds. This ability allows the network to respond automatically to application health
by making changes before users are impacted.

Atomic Counters

Atomic counters can be configured to monitor endpoint/EPG to endpoint/EPG traffic


within a tenant for identifying and isolating traffic loss. Once configured, the packet
counters on a configured policy are updated every 30 seconds. Atomic counters are valid
when endpoints reside on different leaf nodes.
Troubleshooting Cisco Application Centric Infrastructure 41

Troubleshooting Methodology

Overall Methodology

Troubleshooting is the systematic process used to identify the cause of a problem. The
problem to be addressed is determined by the difference between how some entity
(function, process, feature, etc.) should be working versus how it is working. Once the
cause is identified, the appropriate actions can be taken to either correct the issue or
mitigate the effects: the latter is sometimes referred to as a workaround.

Initial efforts in the process focus around understanding more completely the issue that
is occurring. Effective troubleshooting should be based on an evidence-driven method,
rather than a symptomatic level exploration. This can be done by asking the question:

“What evidence do we have ...? “

The intent of this question is to move towards an observed factual evidence-driven meth-
od where the evidence is generally taken from the system where the problem is observed.

Troubleshooting is an iterative process attempting to isolate an issue to the point that some
action can be taken to have a positive effect. Often this is a multi-step process which moves
toward isolating the issue. For example, in deploying an application on a server attached
to an ACI fabric, a possible problem observed could be that the application does not seem
to respond from a client on the network. The isolation steps may look something like this:

Troubleshooting is usually not a simple linear path, and in this example it is possible that
a troubleshooter may have observed the system fault earlier in the process and started
at that stage.
42 Troubleshooting Cisco Application Centric Infrastructure

In this example, information related to the problem came from several data points in the
system. These data points can be part of a linear causal process or can be used to better
understand the scope and various points and conditions that better define the issue.
How these data points are collected is defined by three characteristics:

• WHAT: What information is being collected


• WHERE: Where on the system is the information being collected
• HOW: The method used in collecting the information

For example, the state of the fabric Ethernet interface can be gathered from the leaf
through CLI on the leaf in a couple of different ways. This information can be gathered
from the APIC, either through the GUI or the REST API call. When troubleshooting, it is
important to understand where else relevant information is likely to come from to build
a better picture of what is the issue.
Troubleshooting Cisco Application Centric Infrastructure 43
44 Troubleshooting Cisco Application Centric Infrastructure

Sample Reference
Topology
Troubleshooting Cisco Application Centric Infrastructure 45

Physical Fabric Topology

For a consistent frame of reference, a sample reference topology has been deployed,
provisioned and used throughout the book. This ensures a consistent reference for the
different scenarios and troubleshooting exercises.

This section explores the different aspects of the reference topology, from the logi-
cal application view to the physical fabric and any supporting details that will be used
throughout the troubleshooting exercises. Each individual section will call out the spe-
cific components that have been focused on so the reader does not have to refer back to
this section in every exercise.

The topology includes a Cisco ACI Fabric composed of three clustered Cisco APIC Con-
trollers, two Nexus 9500 spine switches, and three Nexus 9300 leaf switches. The APICs
and Nexus 9000 switches are running the current release on www.cisco.com at the ini-
tial version of this book. This is APIC version 1.0(1k) and Nexus ACI-mode version 11.0(1d).

The fabric is connected to both external Layer 2 and Layer 3 networks. For the external
Layer 2 network, the connection used a pair of interfaces aggregated into a port channel
connecting to a pair of Cisco Nexus 7000 switches that are configured as a Virtual Port
Channel. The connection to the external Layer 3 network is individual links on each leaf
to each Nexus 7000.

Each leaf also contains a set of connections to host devices. Host devices include a di-
rectly connected Cisco UCS C-Series rack server and a UCS B-Series Blade Chassis con-
nected via a pair of fabric interconnects. All servers are running virtualization hyper-
visors. The blade servers are virtualized using VMware ESX, with some of the hosts as
part of a VMware Distributed Virtual Switch, and some others as virtual leafs part of the
Cisco ACI Application Virtual Switch. The guest virtual machines on the hosts are a com-
bination of traditional operating systems and virtual network appliances such as virtual
Firewalls and Load Balancers.
46 Troubleshooting Cisco Application Centric Infrastructure
Troubleshooting Cisco Application Centric Infrastructure 47

Logical Application Topology

To maintain a consistent reference throughout the book, an Application Profile (AP) was
built for a common 3-tier application that was used for every troubleshooting sample.
This AP represents the logical model of the configuration that was deployed to the fabric
infrastructure. The AP includes all the information required for application connectivity
and policy (QoS, security, SLAs, Layer 4-7 services, logging, etc.).

This particular AP is a logical model built on a common application found in data centers,
which includes a front-end web tier, a middleware application tier, and a back-end data-
base tier. As the diagram illustrates, the flow of traffic would be left to right, with client
connections coming in to the web tier, which communicates with the app tier, which
then communicates to the database tier, and returns right to left in reverse fashion.
48 Troubleshooting Cisco Application Centric Infrastructure

Troubleshooting
Troubleshooting Cisco Application Centric Infrastructure 49

Naming Conventions

Overview

Logical thinking and clear communication is the champion of the troubleshooting pro-
cess. This chapter presents some recommended practices for naming of managed ob-
jects in the ACI policy model to provide clean and logical organization and clarity in
the object’s reference. As the management information tree is navigated during policy
creation or inspection (during troubleshooting), consistency and meaningful context be-
come extremely helpful.

Effective troubleshooting with the ACI environment does require some knowledge of the
ACI policy model. Many of the objects within this policy are unbounded fields that are left
open to the administrator to name. Having a predefined and consistent methodology for
deriving these object names can provide some clarity, which can greatly aid in configura-
tion and, more importantly, troubleshooting. By using names which are descriptive of the
purpose and scope in the tree relevant to the MO, easier identification is achieved for the
various MOs, their place in the tree, and their use/relationships at a glance. This process
of descriptive naming is very helpful in delivering context at a glance.

A good example for this type of structured naming is configuring a VLAN pool to be used
for an AVS deployment and naming it “VLANsForAVS”. The name identifies the object,
and what it is used for. Exploring this situation within the context of a real life situation is
helpful. Take for example entering a troubleshooting situation after the environment has
been configured. Ideally, it would be possible to see an object when viewing the policy
model through Visore or the CLI and know by the name “VLANsForAVS” that it is a VLAN
pool for an AVS deployment. The benefits of this naming methodology are clear.

Suggested Naming Templates

Below are some suggested templates for naming the various MOs in the fabric with an
explanation of how each was constructed.

When naming Attachable Entity Profiles, it is good to define in the name the resource
type, and concatenate a suffix of -AEP behind it ([Resource-Type]-AEP). Examples:
50 Troubleshooting Cisco Application Centric Infrastructure

• DVS-AEP describes an AEP for use with a VMware DVS


• UCS-AEP  describes an AEP for use when connecting a Cisco Unified Computing
System (UCS)
• L2Outxxx-AEP describes an AEP that gets used when connecting to an external
device via an L2Out connection
• L3Outxxx-AEP describes an AEP that gets used when connecting to an external
device via an L2Out connection
• vSwitch-AEP describes an AEP used when connecting to a standard vSwitch.

Contracts are used to describe communications that are allowed between EndPoint
Groups (EPGs). When creating contracts, it’s a good idea to reference which objects are
talking and the scope of relevance in the fabric. This could be defined in the format
of [SourceEPG]to[DestinationEPG]-[Scope]Con which includes the from and to EPGs as
well as the scope of application for the contract. Examples:

• WebToApp-GblCon describes a globally scoped contract object and is used to


describe communications between the Web EPG and the App EPG.
• AppToDB-TnCon  describes a contract scoped to a specific tenant that describes
communications between the App EPG and the DB EPG within that specific
tenant.

Contracts that deal with some explicit communications protocol function (such as allow-
ing ICMP or denying some specific protocol), can be given a name based on the explicit
reference and scope. This could be defined in the format of [ExplicitFunction]-[Scope]
Con which indicates the protocol or service to be allowed as well as the scope for
the placement of the contract. Some examples of explicit contracts might include the
following:

• ICMPAllow-CtxCon is a contract scoped to a specific context that allows ICMP


• HTTPDeny-ApCon  is a contract scoped to an application that denies HTTP
protocol

Contracts are compound objects that reference other lower-level objects. A Contract
references a subject, which references a filter that has multiple filter entries. In order
to maintain good naming consistency, similar naming structure should be followed. A
Subject should keep to a naming convention like [RuleGroup]-[Direction]Sbj, as seen in
these examples:
Troubleshooting Cisco Application Centric Infrastructure 51

• AppTraffic-BiSbj  names a subject that defines bidirectional flows of a specific


applications traffic
• WebTraffic-UniSbj  names a subject that defines web traffic in a single direction

A Filter should have a naming convention structure of [ResourceName]-flt, such as:

• SQL-flt  names a filter that contains entries to allow communications for an


SQL server
• Exchange-flt  names a filter that contains entries that allows communications
for an Exchange server

and Filter Entries should follow a structure like [ResourceName]-[Service] such as:

• Exchange-HTTP  might be the name of a filter entry that allows HTTP service
connections to an exchange server (such as for OWA connections)
• VC-Mgmt might name a filter entry allowing management connections to a
VMware vCenter server
• SQL-CIMC  is a name for a filter entry that allows connections to the CIMC
interface on a SQL server running on a standalone Cisco UCS Rack mount server.

If all of these are put together the results might look like this:

AppToDB-TnConreferences DBTraffic-BiSbj with filter SQL-flt which has an entry such


as SQL-data. This combined naming chain can almost be read as common text, such as
“this contract to allow database traffic in both directions that is filtered to only allow SQL
data connections”.

Interface policy naming should follow some characteristics, such as:

• Link Level structure - [Speed][Negotiation], Ex: 1GAuto, 10GAuto - directly


describes the speed and negotiation mode.
• CDP Interface configuration policy - explicit naming: EnableCDP / DisableCDP
• LLDP Interface configuration policy  - explicit naming:  EnableLLDP /
DisableLLDP

When grouping interface policies, it’s good to structure naming based on the interface
type and its use like [InterfaceType]For[Resource-Type]such as:
52 Troubleshooting Cisco Application Centric Infrastructure

• PCForDVS  names a policy that describes a portchannel used for the uplinks
from a DVS
• VPCForUCS  names a virtual portchannel for connecting to a set of UCS Fabric
Interconnects
• UplinkForvSwitch  names a single port link connecting to a standard vSwitch.
• PCForL3Out  names a portchannel connecting to an external L3 network

Interface profiles naming should be relative to the profile’s use, such as IntsFor[Re-
source-Type]. Examples:

• IntsForL3Outxxx  names an interface profile for connecting to an external L3


network
• IntsForUCS  names an interface profile for connecting to a UCS system
• IntsForDVS names an interface profile for connecting to a DVS running on a
VMware host

Switch Selectors should be named using a structure like LeafsFor[Resource-Type]. Some


examples:

• LeafsForUCS  switch selector policy to group leafs that are used for connecting UCS
• LeafsForDVS  switch selector policy that might be used to group leafs used for
DVS connections
• LeafsForL3Out  policy that might be used to group leafs for external L3
connections

And when creating VLAN pools, structure of VLANsFor[Resources-Type] could produce:

• VLANsForDVS names a VLAN pool for use with DVS-based endpoint


connections
• VLANsForvSwitches names a VLAN pool for use with vSwitch-based endpoint
connections
• VLANsForAVS names a VLAN pool for use with AVS-based endpoint connections
• VLANsForL2Outxxx names a VLAN pool used with L2 external connections.
• VLANsForL3Outxxx names a VLAN pool used with L3 external connections.
Troubleshooting Cisco Application Centric Infrastructure 53

Initial Hardware Bringup

Overview

This section will cover common issues seen when bringing up the initial hardware. The
APIC Fabric can be ordered in several different configurations. There is an option to pur-
chase optical leaves (leaves with Small Form Pluggable (SFP) interfaces), and when that
is the case an optical Virtual Interface Card (VIC1225) must be used in the APIC. When a
copper leaf is used the optical VIC1225T must be used.

Initial cabling of the ACI fabric is very important and the following requirements must
be adhered to:

• Leafs can only be connected to spines. There should be no cabling between


the leafs. 
• Spines can only be connected to leafs. Spines can not be inter-connected.
• An APIC must be attached to a leaf. APICs should be dual-homed (connected to
two different leafs) for redundancy.
• All end points, L2, L3, L4-L7 devices must connect to leafs. Nothing should be
connected to spines other than leafs as previously mentioned

Problem Description

There are a few common issues that can be observed when initially bringing up a fabric.

Symptom 1

On the connection between the APIC and leaf, the APIC side is down (no lights) but the
leaf side has lights on.

Verification/Resolution

• The leaf showed the APIC as a LLDP neighbor (show lldp neighbors)
• The APIC did not show the leaf in the output of "acidiag fnvread"
• A physical examination of the setup shows:
54 Troubleshooting Cisco Application Centric Infrastructure

• In the picture above a GLC-T transceiver was plugged into the APIC which has a
VIC1225 installed. This is an optical SFP+ Virtual Interface Card (VIC). The other end
of the connection is the 93128TX (copper) leaf.  
• Their desired behavior was to convert optical to copper (media conversion).
• There are no transceivers qualified to do this sort of conversion. Optical-based VICs
need to be plugged into optical-based leafs, and copper based VICs need to be
plugged into copper-based leafs. 

Once the proper transceiver was used, and a leaf with copper ports was connected the
other end, the link came up properly and both the APIC and the leaf were able to share
LLDP as expected. Fabric discovery was able to continue as expected.

Symptom 2

The APIC does not see the leaf switch in the output of "acidiag fnvread" but the leaf does
see the APIC in the output of "show lldp neighbors".

Verification/Resolution

The VIC1225 and VIC1225T need to have the proper firmware at a minimum to ensure that
these VIC’s do not consume LLDP and prevent LLDP from going to the APIC. A minimal
version of VIC firmware that should be used is 2.2(1dS1). And once this version of VIC
firmware is used, the APIC can see the LLDP frames from the leaf and fabric discovery
will complete.
Troubleshooting Cisco Application Centric Infrastructure 55

Problem Description

As mentioned in the overview section of this chapter, cabling configurations are strictly
enforced. If a leaf is connected to another leaf, or a spine is connected to another spine, a
wiring mismatch will occur.

Symptom

Using the CLI interface on the leaf, execute the show interface command. The output of
the command will show the interface as “out-of-service”.

rtp_leaf1# show interface ethernet 1/16

Ethernet1/16 is up (out-of-service)

admin state is up, Dedicated Interface

  Hardware: 100/1000/10000/auto Ethernet, address: 88f0.31db.e800 (bia 88f0.31db.

e800)

  [snip]

Verification

The “show lldp neighbors” output will identify this leaf port is connected to another leaf port.

rtp_leaf1# show lldp neighbors

Capability codes:

 (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device

 (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other

Device ID            Local Intf      Hold-time  Capability  Port ID 

RTP_Apic1             Eth1/1          120                   

RTP_Apic2             Eth1/2          120

rtp_leaf3.cisco.com   Eth1/16         120       BR        Eth1/16      


rtp_spine1.cisco.com  Eth1/49         120        BR          Eth3/1         

rtp_spine2.cisco.com  Eth1/50          120        BR          Eth4/1  

       

The following fault will be raised in the GUI under Fabric -> Inventory -> Pod_1 -> <leaf
node>
56 Troubleshooting Cisco Application Centric Infrastructure

This same fault can also be viewed in the CLI.

admin@RTP_Apic1:if-[eth1--16]> faults

Severity  Code   Cause                 Ack  Last Transition      Dn                    

--------  -----  ----------------------  ---  -------------------  -----------------------

major     F0454  wiring-check-failed   no   2014-10-17 12:50:16  topology/pod-1/        

                                                                node-101/sys/lldp/inst/

                                                                if-[eth1/ a16]/          

                                                                fault-F0454          

Total : 1

The fault can also be viewed in the APIC CLI. The full path is shown below.

admin@RTP_Apic1:if-[eth1--16]> pwd

/home/admin/mit/topology/pod-1/node-101/sys/lldp/inst/if-[eth1--16] 

Resolution

The resolution for this problem is to correct the cabling misconfiguration. Note: The
same problem will be seen for spine cabling misconfiguration where a spine is cabled to
another spine.
Troubleshooting Cisco Application Centric Infrastructure 57

Fabric Initialization

Overview

This chapter covers the discovery process for an ACI fabric, beginning with an overview
of the actions that happen and the verification steps used to confirm that a functioning
fabric exists. The displays have been captured from our reference topology working fab-
ric and can be used as an aid in troubleshooting issues where fabric nodes fail to join the
fabric.

In this discovery process, a fabric node is considered active when the APIC and node can
exchange heartbeats through the Intra-Fabric Messaging (IFM) process. The IFM process
is also used by the APIC to push policy to the fabric leaf nodes.

Fabric discovery happens in three stages. The leaf node directly connected to the APIC is
discovered in the first stage. The second stage of discovery brings in the spines connect-
ed to that initial seed leaf. Then the third stage processes the discovery of the other leaf
nodes and APICs in the cluster.

The diagram below illustrates the discovery process for switches that are directly con-
nected to the APIC. Coverage of specific verification for other parts of the process will be
presented later in the chapter.

The steps are:

• Link Layer Discovery Protocol (LLDP) Neighbor Discovery


• Tunnel End Point (TEP) IP address assignment to the node
• Node software upgraded if necessary
• Policy Element IFM Setup
58 Troubleshooting Cisco Application Centric Infrastructure

Node status may fluctuate between several states during the fabric registration process.
The states are shown in the Fabric Node Vector table. The APIC CLI command to show
the Fabric Node Vector table acidiag fnvread and sample output will be shown further
down in this section. Below is a description of each state.
Troubleshooting Cisco Application Centric Infrastructure 59

States and descriptions:

• Unknown – Node discovered but no Node ID policy configured


• Undiscovered – Node ID configured but not yet discovered
• Discovering – Node discovered but IP not yet assigned
• Unsupported – Node is not a supported model
• Disabled – Node has been decommissioned
• Inactive – No IP connectivity 
• Active – Node is active 

During fabric registration and initialization a port might transition to an “out-of-service”


state. Once a port has transitioned to an out-of-service status, only DHCP and CDP/
LLDP protocols are allowed to be transmitted. Below is a description of each out-of-ser-
vice issue that may be encountered:

• fabric-domain-mismatch – Adjacent node belongs to a different fabric 


• ctrlr-uuid-mismatch – APIC UUID mismatch (duplicate APIC ID)
• wiring-mismatch – Invalid connection (Leaf to Leaf, Spine to non-leaf, Leaf
fabric port to non-spine etc.)
• adjaceny-not-detected –  No LLDP adjacency on fabric port

Ports can go out-of-service due to wiring issues. Wiring Issues get reported through the
lldpIf object information on this object can be browsed at the following object location in
the MIT: /mit/sys/lldp/inst/if-[eth1/1]/summary.  
60 Troubleshooting Cisco Application Centric Infrastructure

Fabric Verification

This section illustrates some displays from the reference topology configured and in full
working order.

The first step is to verify LLDP neighborships information has been exchanged. To ver-
ify LLDP information exchange, the command show lldp neighbors can be used. This
command can be run on the APIC and executed on the nodes, or it can be run directly
on the fabric nodes. The APIC runs Linux using a bash-based shell, that is not sensitive
to the question mark, as is typical for IOS or NX-OS shells. In order to see all the com-
mand options, the APIC requires the entry of a special control sequence sent by pressing
the escape key twice. This double escape sequence is the equivalent of the NXOS/IOS
contextual help function triggered when the question mark “?” is typed in the CLI. For
example, the output below shows the result of typing show lldp neighbors <esc> <esc>:
Troubleshooting Cisco Application Centric Infrastructure 61

admin@RTP_Apic1:~> show lldp neighbors 

 node        Fabric node              

 rtp_leaf1   Specify Fabric Node Name 

 rtp_leaf2   Specify Fabric Node Name 

 rtp_leaf3   Specify Fabric Node Name 

 rtp_spine1  Specify Fabric Node Name 

 rtp_spine2  Specify Fabric Node Name 

Based on the option provided in the contextual help output above, now extending the
command to   show lldp neighbors node produces the following output:

admin@RTP_Apic1:~> show lldp neighbors node

 101  Specify Fabric Node id 

 102  Specify Fabric Node id 

 103  Specify Fabric Node id 

 201  Specify Fabric Node id 

 202  Specify Fabric Node id  

Executing the command show lldp neighbors rtp_leaf1 in the APIC CLI displays all the
LLDP Neighbors adjacent to “rtp_leaf1”. The output shows that this leaf is connected to
two different APICs and two spines.

admin@RTP_Apic1:~> show lldp neighbors rtp_leaf1

# Executing command: 'cat /aci/fabric/inventory/pod-1/rtp_leaf1/protocols/lldp/

neighbors/summary'

neighbors:

device-id       local-interface  hold-time  capability     port-id          

--------------  ---------------  ---------  -------------  -----------------

RTP_Apic1       eth1/1           120                       90:e2:ba:4b:fc:78

RTP_Apic2       eth1/2           120                       90:e2:ba:5a:9f:30

rtp_spine1      eth1/49          120        bridge,router  Eth3/1           

rtp_spine2      eth1/50          120        bridge,router  Eth4/1            

 
62 Troubleshooting Cisco Application Centric Infrastructure

This command may also be run directly on the leaf as shown below:

rtp_leaf1# show lldp neighbors 

Capability codes:

  (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device

  (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other

Device ID            Local Intf      Hold-time  Capability  Port ID  

RTP_Apic1             Eth1/1          120                    90:e2:ba:4b:fc:78 

RTP_Apic2             Eth1/2          120                    90:e2:ba:5a:9f:30 

rtp_spine1            Eth1/49         120        BR          Eth3/1          

rtp_spine2            Eth1/50         120        BR          Eth4/1          

When the command acidiag fnvread is run in the APIC CLI, it can be used to verify the
Fabric Node Vector (FNV) that is exchanged using LLDP. This is the quickest way to de-
termine if each node is active, and a TEP address has been assigned.

admin@RTP_Apic1:~> acidiag fnvread

ID Name Serial Number IP Address Role State LastUpdMsgId

--------------------------------------------------------------------------

101 rtp_leaf1 SAL1819SAN6  172.16.136.95/32  leaf  active  0

102 rtp_leaf2 SAL172682S0 172.16.136.91/32 leaf active 0

103 rtp_leaf3 SAL1802KLJF 172.16.136.92/32 leaf active 0

201 rtp_spine1 FGE173400H2 172.16.136.93/32 spine active 0

202 rtp_spine2 FGE173400H7 172.16.136.94/32 spine active 0

Total 5 nodes

When the command acidiag avread is run in the APIC CLI, it can be used to verify the
Appliance Vector (AV) that is exchanged using LLDP. This is the best way to determine
the APICs are all part of one clustered fabric. This command also helps to verify that the
TEP address is assigned, the appliance is commissioned, registered, and active, and the
health is equal to 255 which signifies the appliance is "fully fit".
Troubleshooting Cisco Application Centric Infrastructure 63

admin@RTP_Apic1:~> acidiag avread

Local appliance ID=1 ADDRESS=172.16.0.1 TEP ADDRESS=172.16.0.0/16 CHASSIS_

ID=a5945f3c-53c8-11e4-bde2-ebe6f6cfeb58

Cluster of 3 lm(t):1(2014-10-14T20:04:46.691+00:00) appliances (out of targeted 3

lm(t):3(2014-10-14T20:05:22.567+00:00)) with FABRIC_DOMAIN name=RTP_Fabric set to

version=1.0(1k) lm(t):3(2014-10-14T20:05:23.486+00:00)

    appliance id=1 last mutated at 2014-10-14T17:36:51.734+00:00 ad-

dress=172.16.0.1 tep address=172.16.0.0/16 oob address=10.122.254.211/24 ver-

sion=1.0(1k) lm(t):1(2014-10-14T20:12:28.291+00:00) chassisId=a5945f3c-53c8-

11e4-bde2-ebe6f6cfeb58 lm(t):1(2014-10-14T20:12:28.291+00:00) commissioned=1

registered=1 active=yes(zeroTime) health=(applnc:255 lm(t):1(2014-10-

14T20:13:30.052+00:00) svc's)

    appliance id=2 last mutated at 2014-10-14T19:55:24.356+00:00 address=172.16.0.2

tep address=172.16.0.0/16 oob address=10.122.254.212/24 version=1.0(1k)

lm(t):2(2014-10-14T20:12:28.571+00:00) chassisId=f56e0130-53db-11e4-ba9f-83158a2b-

73fa lm(t):2(2014-10-14T20:12:28.571+00:00) commissioned=1 registered=1 ac-

tive=yes(2014-10-14T19:55:24.357+00:00) health=(applnc:255 lm(t):2(2014-10-

14T20:13:30.084+00:00) svc's)

    appliance id=3 last mutated at 2014-10-14T20:04:46.922+00:00 address=172.16.0.3

tep address=172.16.0.0/16 oob address=10.122.254.213/24 version=1.0(1k)

lm(t):3(2014-10-14T20:12:28.493+00:00) chassisId=2e7f7a70-53dd-11e4-a8f2-

5d5876c67adc lm(t):3(2014-10-14T20:12:28.493+00:00) commissioned=1 registered=1

active=yes(2014-10-14T20:12:28.179+00:00) health=(applnc:255 lm(t):3(2014-10-

14T20:13:29.757+00:00) svc's)

clusterTime=<diff=0 common=2014-10-14T20:24:47.810+00:00 local=2014-10-

14T20:24:47.810+00:00 pF=<displForm=0 offsSt=0 offsVlu=0 lm(t):3(2014-10-

14T20:05:23.096+00:00)>>

 
64 Troubleshooting Cisco Application Centric Infrastructure

This same information can also be verified using the ACI GUI. The capture below shows
the APIC cluster health screen.

The capture below displays the overall fabric topology. When fully discovered, each node
should be visible under the Pod1 folder.

Problem Description

During fabric discovery, issues may be encountered when a leaf or spine does not join
the ACI fabric due to issues that were mentioned in the overview section of this chapter. 
Troubleshooting Cisco Application Centric Infrastructure 65

Symptom 1

The leaf or spine does not show up in fabric membership GUI.

Verification

1. Check the power status of switches and ensure they are powered on. Use the
locator LED to identify if each switch is in a healthy state.
2. Check the cabling between switches. Example: Leaf should only be connected to
Spine and APIC. Spine should only be connected to Leaves.
3. Use console cables to access the device, verify if the device is in loader> prompt
or (none) prompt.
3.1 When using the console connection, if the device displays the loader>prompt,
the switch is in a state where it did not load the ACI switch software image.
Please refer to the ‘ACI Fabric Node and Process Crash Troubleshooting’
chapter of this document that explains how to recover from the loader
prompt.

3.2 When using the console connection, if the device displays the (none) login:
prompt, enter “admin” then hit the Enter key to access the CLI. The following
message should appear on the screen:

User Access Verification

(none) login: admin

****************************************************************************

     Fabric discovery in progress, show commands are not fully functional

     Logout and Login after discovery to continue to use show commands.

****************************************************************************

(none)# 

Use the command show lldp neighbor to verify if the Leaf is connected to the spine or
APIC. If this is a spine, it should be connected to the leaves.
66 Troubleshooting Cisco Application Centric Infrastructure

(none)# show lldp neighbor

Capability codes:

  (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device

  (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other

Device ID            Local Intf      Hold-time  Capability  Port ID 

RTP_Apic1            Eth1/1          120                    90:e2:ba:4b:fc:78

...

switch               Eth1/49         120        BR          Eth3/1         

switch               Eth1/50         120        BR          Eth4/1         

Total entries displayed: 14

If presented with the (none)# prompt, use the command show interface brief to verify what
the status the interfaces are in. 

(none)# show interface brief

--------------------------------------------------------------------------------

Port   VRF          Status IP Address                              Speed     MTU

--------------------------------------------------------------------------------

mgmt0  --           up                                             1000      9000    

--------------------------------------------------------------------------------

Ethernet      VLAN   Type Mode    Status Reason                   Speed     Port

Interface                                                                    Ch #

--------------------------------------------------------------------------------

Eth1/1        0    eth  trunk  up    out-of-service        10G(D)   --

...

Eth1/47       0       eth  trunk   down   sfp-missing              10G(D)    --

Eth1/48       0       eth  trunk   up     out-of-service           10G(D)    --

Eth1/49       --      eth  routed  up     none                     40G(D)    --

Eth1/49.1     2       eth  routed  up     none                     40G(D)    --

Eth1/50       --      eth  routed  up     none                     40G(D)    --

Eth1/50.2     2       eth  routed  up     none                     40G(D)    --

...
Troubleshooting Cisco Application Centric Infrastructure 67

Alternatively, this information can also be found with the command cat /mit/sys/lldp/
inst/if-\[eth1--<PORT NUMBER>\]/summary in the (none)# prompt to verify if there
is any wiring issue: 

(none)# cat /mit/sys/lldp/inst/if-\[eth1--60\]/summary

# LLDP Interface

id           : eth1/60

adminRxSt    : enabled

adminSt      : enabled

adminTxSt    : enabled

childAction  :

descr        :

dn           : sys/lldp/inst/if-[eth1/60]

lcOwn        : local

mac          : 7C:69:F6:0F:EA:EF

modTs        : 2014-10-13T20:44:37.182+00:00

monPolDn     : uni/fabric/monfab-default

name         :

operRxSt     : enabled

operTxSt     : enabled

portDesc     : topology/pod-1/paths-0/pathep-[eth1/60]

portVlan     : unspecified

rn           : if-[eth1/60]

status       :

sysDesc      :

wiringIssues :

Symptom 2

In the Fabric membership, no TEP IP addresses are assigned to the leaf or spine, and the
node has a status of “unsupported” and a role of ”unknown” listed under fabric member-
ship.

Verification

1. If the switch has “unsupported” for its state, the device model (part number) is not
supported by the current APIC version. The command “acidiag fnvread” in the APIC
68 Troubleshooting Cisco Application Centric Infrastructure

CLI will help to verify all nodes in the fabric. The device model or part number must
match the catalog’s supported hardware. The command grep model /mit/uni/fabric/
compcat-default/swhw-*/summary in the switch can be used to verify the catalog’s
supported hardware:

admin@RTP_Apic1:~> acidiag fnvread

ID      Name    Serial Number      IP Address    Role        State   LastUpdMsgId

------------------------------------------------------------------------------------

0        SAL12341234        0.0.0.0  unknown     unsupported   0

(none)# cat /mit/uni/fabric/compcat-default/swhw-*/summary | grep model

model        : N9K-C9336PQ

model        : N9K-C9508

model        : N9K-C9396PX

model        : N9K-C93128TX

(none)#

If the switch state shows “unknown” with the acidiag fnvread command in the APIC CLI,
there are a few causes that could cause this switch state:

admin@RTP_Apic1:~> acidiag fnvread

ID       Name    Serial Number      IP Address    Role        State   LastUpdMsgId

-----------------------------------------------------------------------------------

0                SAL1819SAN6        0.0.0.0 unknown     unknown   0

 
• Node ID policy has not been posted to the APIC or the switch has not been
provisioned with the APIC GUI with the device’s specific serial number.
• If the REST API was used to post the Node ID policy to the APIC, the serial number
that was posted to the APIC doesn’t match the actual serial number of the device.
The following switch CLI command can verify the serial number of the device:
 
(none)# cat /mit/sys/summary | grep serial

serial       : SAL1819SAN6

 
Troubleshooting Cisco Application Centric Infrastructure 69

Symptom 3

The leaf or spine is not discovered in the “Pod” folder in the GUI.

Verification

Use the cat /mit/sys/summary CLI command to verify the state of the leaf or spine:

leaf101# cat /mit/sys/summary

# System

address      : 0.0.0.0

childAction  :

currentTime  : 2014-10-14T18:14:26.861+00:00

dn           : sys

fabricId     : 1

fabricMAC    : 00:22:BD:F8:19:FF

id           : 0

inbMgmtAddr  : 0.0.0.0

lcOwn        : local

modTs        : 2014-10-13T20:43:50.056+00:00

mode         : unspecified

monPolDn     : uni/fabric/monfab-default

name         :

oobMgmtAddr  : 0.0.0.0

podId        : 1

rn           : sys

role         : leaf

serial       : SAL1819SAN6

state       : out-of-service

status       :

systemUpTime : 00:21:31:39.000

If the state from the cat /mit/sys/summary CLI command shows in-service, then the


TEP IP address listed under the “address” field of the CLI output should be pingable. If
the switch’s TEP address is not reachable from the APIC, a possible cause could be switch
certificate issue. Verify that the switch is able to communicate with APIC via TCP port
12183
70 Troubleshooting Cisco Application Centric Infrastructure

leaf101# netstat -a |grep 12183

tcp        0      0 leaf101:12183      *:*                     LISTEN     

tcp        0      0 leaf101:12183      apic2:43371             ESTABLISHED

tcp        0      0 leaf101:12183      apic1:49862             ESTABLISHED

tcp        0      0 leaf101:12183      apic3:42332             ESTABLISHED

If the switch is listening on TCP port 12183 but there are no established sessions, assuming
that IP connectivity between the switch and APIC has been confirmed with ping test, verify
SSL communication with the command cat /tmp/logs/svc_ifc_policyelem.log | grep SSL.

leaf101# cat /tmp/logs/svc_ifc_policyelem.log | grep SSL

3952||14-08-02 21:06:53.875-08:00||ifm||DBG4||co=ifm||incoming connection established

from 10.0.0.1:52038||../dme/common/src/ifm/./ServerEventHandler.cc||42   bico 52.241

3952||14-08-02 21:06:53.931-08:00||ifm||DBG4||co=ifm||openssl error during SSL_ac-

cept()||../dme/common/src/ifm/./IFMSSL.cc||185

3952||14-08-02 21:06:53.931-08:00||ifm||DBG4||co=ifm||openssl: error:14094415:SSL

routines:SSL3_READ_BYTES:sslv3 alert certificate expired||../dme/common/src/ifm/./IF-

MSSL.cc||198 3952||14-08-02 21:06:53.931-08:00||ifm||DBG3||co=ifm||incoming connec-

tion to peer terminated (protocol error)||../dme/common/src/ifm/./Peer.cc||227

If this scenario is encountered, contact the Cisco Technical Assistance Center support.

1. If the state from the cat /mit/sys/summary CLI shows out-of-service, re-verify


by going back through Symptom 1’s Verification steps.
2. If the state from the cat /mit/sys/summary CLI shows invalid-ver, verify
“Firmware Default Policy” via the APIC GUI.
Troubleshooting Cisco Application Centric Infrastructure 71

Application Policy Infrastructure Controller


(APIC) High Availability and Clustering

Overview

This chapter covers the APIC clustering operations. Clustering provides for high avail-
ability, data protection and scaling by distributed data storage and processing across the
APIC controllers. While every unit of data in the object model is handled by a single con-
troller, all units are replicated 3 times across the cluster to other controller nodes, and
may not be replicated to all controllers if cluster size is larger than 3. Clustering makes
the system highly resilient to process crashes and corrupted databases by eliminating
single points of failure.

The APIC process that handles clustering is the Appliance Director process. The Appli-
ance Director process runs in every controller and is specifically in charge of synchro-
nizing information across all nodes in the cluster. While Appliance Director is in charge
of performing periodic heartbeats to track the availability of other controllers, actual
replication of data is done by each respective service independently. For example, Policy
Manager on one controller is in charge to replicate its data to the Policy Manager in-
stances in other controllers. Appliance Director only participates in indicating processes
or services on which other controllers are their replicas set up in.

Each controller node in the cluster is uniquely identified by an ID. This ID is configured
by the administrator at the time of initial configuration.

Cluster Formation

The following list of necessary conditions has to be met for successful cluster formation:

1. Candidate APICs must have been configured with matching admin user
credentials to be part of cluster
2. When adding controller nodes to the cluster, the administratively configured
cluster size must not be exceeded.
3. When a new node is added, its specified cluster size must match the configured
cluster size on all other nodes in the cluster.
4. Every node must have connectivity to all other nodes in the cluster.  
5. There must be a data exchange between reachable controller pairs.
72 Troubleshooting Cisco Application Centric Infrastructure

In our sample reference topology, 3 controllers are being used, namely APIC1, APIC2, and
APIC3. The process flow for forming the cluster is as follows:

APIC1 enters a state where the status shows an operational cluster size of 1 and the con-
troller’s health of “fully-fit” in the GUI. During the setup script, the administrative cluster
size was configured as 3. Once the fabric discovery has converged to the point where
APIC1 has formed relationships with the fabric node switches, providing connectivity to
APIC2, APIC1 and APIC2 will establish a data exchange and APIC2 will start sending heart-
beats. Once APIC1 receives heartbeats from APIC2, APIC1 increments the operational
cluster size to 2 and allows APIC2 to join the cluster. The discovery process continues
and detects a third controller, and the cluster operational size is incremented again by a
value of 1. This process of fabric and controller discovery continues until the operational
cluster size reaches the configured administrative cluster size. In our reference topolo-
gy, the administrative cluster size is 3 and when APIC3 joins the cluster, the operational
cluster size is 3, and the cluster formation is complete.

Majority and Minority - Handling Clustering Split Brains

Due to the fundamental functionality of data replication, any ACI fabric has a minimum
supported APIC cluster size of 3. With clustering operations, APICs leverage the concept
of majority and minority. Majority and minority are used to resolve potential split brain
scenarios. In a case where split brain has occurred, 2 APICs, such as APIC1 and APIC2,
can communicate with each other but not with APIC3, and APIC3 is not able to com-
municate with either APIC1 or APIC2. Since there were an odd number of controllers to
start with, APIC1 and APIC2 are considered to be the majority and APIC3 is the minority.
If there were, to start with, an even number of APIC controllers, it will be more difficult
to resolve which are the majority vs minority.

When an APIC controller network connectivity is lost to other controllers, it transitions


into a minority state, while if the other controllers continue to be reachable in between,
the still connected controllers represent a majority. In a minority state, the APIC enters a
read only mode where no configuration changes are allowed. No incoming updates from
any of the fabric switch nodes are handled by the minority controller(s) and if any VMM
integration exists, incoming updates from hypervisors are ignored.

While an APIC remains in minority state, read requests will be allowed but will return
data with an indication that the data could be stale.
Troubleshooting Cisco Application Centric Infrastructure 73

In the scenario of loss of network connectivity resulting in the partitioning of the fabric
and an APIC in minority state, if an endpoint attaches to a leaf managed by an APIC in
minority state, the leaf will download and instantiate a potentially state policy from the
minority controller. Once all controllers regain connectivity to each other and the split
brain condition has been resolved, if a more recent or updated copy of the policy exists
between the majority clusters, the leaf will download and update the policy accordingly.
between the majority clusters, the leaf will download and update the policy accordingly.

Problem Description

When adding or replacing APIC within an existing cluster, potentially an issue can be
encountered where APIC is not able to join the existing APIC cluster.

Symptom

During fabric bring up or expansion, APIC1 is the only controller online, and APIC3 is
being inserted before APIC2, therefore APIC3 will not join the fabric.

Verification

Under System->Controller-Faults, verify the existence of the following fault:

The fault message indicates that APIC3 cannot join the fabric before APIC2 has already
joined the fabric. The problem can be resolved when APIC2 is brought up before APIC3.

Problem Description

Policy changes are not allowed on APIC1 even though APIC1 is healthy and fully fit.

Symptom

APIC2 and APIC3 are not functional (shutdown or disconnected) while APIC1 is fully
functional.
74 Troubleshooting Cisco Application Centric Infrastructure

Verification

Under System -> Controllers -> Cluster APIC2 and APIC2 have an operational status
of "Unavailable"

When trying to create a new policy, the following status message is seen:

These symptoms indicated that APIC1 is in the minority state and it thinks that APIC2 and
APIC3 are still online, but APIC1 lost connectivity to both of these APICs via infrastruc-
ture VLAN.

One of the missing APIC, APIC2 or APIC3 needs to be powered up to resolve this error.
Let’s say when APIC1 and APIC3 become part of the cluster again, APIC1 and APIC3 will be
in the majority state where APIC2 (still offline) will be in the minority state.
Troubleshooting Cisco Application Centric Infrastructure 75

Problem Description

Cluster-related faults are designed to provide diagnostic information which is sufficient


to correct detected faulty conditions. There are 2 major groups of faults - faults related
to messages which are discarded by ApplianceDirector on receiving APIC, and faults re-
lated to cluster geometry changes.

Faults related to messages which are discarded by the ApplianceDirector process running
on the APIC receiving the messages are then examined from the following two perspectives:

1. Is this message from a cluster peer


2. If not, is it from an APIC which might be considered as a candidate for cluster
expansion

Consequently, there will be an attempt to raise two faults (F1370 and F1410) if the re-
ceived message fails to qualify either check and is discarded by recipient. 

There has to be a continuous stream of similar messages arriving over a period of time
for a fault to be raised. Once the fault is raised, it contains information about the APIC ex-
periencing the failure, including the serial number, cluster ID, and time when the stream
of similar messages started to arrive.

Symptom for Fault 1370

A fault code of 1370 has being raised

Verification

A fault code of 1370 with a reason of "operational-cluster-size-distance-can-


not-be-bridged" will be raised if the APIC trying to join has OperationalClusterSize that
deviates from cluster’s OperationalClusterSize by more than 1. For instance, cluster with
Operational ClusterSize equal to 3 will not accept an APIC as an addition or replacement
which claims Operational Cluster Size equal to 5.

Resolution

Change the operational cluster size on the new APIC to match, or be only 1 greater then
what is configured on the current fabric.
76 Troubleshooting Cisco Application Centric Infrastructure

Verification

A fault code of 1370 with a reason of “source-has-mismatched-target-chassis-id” will be


raised when the transmitting APIC sends a message that provides the ChassisID of the
receiving APIC does not match the current ChassisID of the recipient. For example APIC1
and APIC2 have discovered each other, and then APIC1 is clean restarted which will result
in changing its ChassisID. It might discover APIC2 again, but APIC2 has the old notion
of APIC1. In the essence, there are two clusters running at the moment. The first one
contains APIC2 and unavailable “old” APIC1. The other cluster contains “new” APIC1 and
discovered, but unavailable APIC2. Therefore APIC1 will not accept messages from APIC2.

Resolution

The corrective action is to decommission APIC1 on APIC2, and commission back. The two
clusters will then be able to merge.

Verification

A fault code of 1370 with a reason of "source-id-is-outisde-operational-cluster-size" is


raised when the transmitting APIC has a cluster ID which doesn’t fit into cluster with
current OperationalClusterSize.

Resolution

Change the cluster ID to be with the range of the defined cluster size. The chosen cluster
ID should be 1 greater than the current number of clusters. It may be required to grow
the cluster.

Verification

A fault code of 1370 with a reason of source-is-not-commissioned  is raised when the


transmitting APIC has a cluster ID which is currently decommissioned in the cluster

Resolution

Commission the APIC.


Troubleshooting Cisco Application Centric Infrastructure 77

Verification

A fault code of 1370 with a reason of fabric-domain-mismatch is raised when the trans-
mitting APIC has a FabricID which is different from FabricID in the formed cluster.

Resolution

Run erase-config setup and set the correct FabricID on the APIC

Verification

A fault code of 1370 with a reason of source-cluster-id-illegal is raised when the trans-
mitting APIC has cluster ID which is illegal. For instance, it might be outside the bound-
aries of acceptable cluster ID’s.

Resolution

Change the cluster ID to be within the range of the defined cluster size. The chosen clus-
ter ID should be N+1 where N is current OperationalClusterSize.

Verification

A fault code of 1370 with a reason of source-chassis-id-mismatch is raised when the


transmitting APIC has ChassisID which is different from ChassisID registered for the
same cluster ID in the cluster. Please notice that APIC clean restart changes its Chas-
sisID. If previous ChassisID has been learnt by cluster, messages from the same APIC with
the same cluster ID will be discarded.

Resolution

Decommission and commission the APIC in question.

Verification

A fault code of 1370 with a reason of expansion-contender-message-is-not-heartbeat is


raised if the transmit APIC that is being considered as a candidate for cluster expansion
is expected to transmit continuous heartbeats only. This fault is raised if anything other
than heartbeats is being transmitted.
78 Troubleshooting Cisco Application Centric Infrastructure

Resolution

Run  erase-config setup and reinitialize the APIC. If the problem persists contact the
Cisco Technical Assistance Center.

Verification

A fault code of 1370 with a reason of expansion-contender-id-is-not-next-to-oper-


cluster-size is raised if the transmit APIC cannot be considered as candidate for cluster
expansion since its cluster ID is not N+1 where N is current OperationalClusterSize.

Resolution

Change the cluster ID to be within the range of the defined cluster size. The chosen
cluster ID should be N+1 where N is current OperationalClusterSize. It may be required
to grow the cluster.

Verification

A fault code of 1370 with a reason of expansion-contender-fabric-domain-mismatch is


raised if the transmit APIC cannot be considered as candidate for cluster expansion since
its FabricID is different from FabricID in the formed cluster.

Resolution

Run erase-config setup and set the correct FabricID on the APIC.

Verification

A fault code of 1370 with a reason of "expansion-contender-chassis-id-mismatch" is


raised when the transmit APIC cannot be considered as candidate for cluster expansion
since its ChassisID is different from ChassisID already learned by the cluster being ex-
panded. Probable scenario - there are two APIC’s claiming the same cluster ID N+1 where
N is current OperationalClusterSize. Once cluster learns about either of two, messages
from the other one are rejected.
Troubleshooting Cisco Application Centric Infrastructure 79

Resolution

Run erase-config setup and set the correct FabricID on the APIC.

Symptom for Fault 1410

There are several faults 1410 related to cluster geometry changes. There are various
conditions which may cause a cluster to fail incrementing it’s OperationalClusterSize
towards AdminstrativeClusterSize.

Verification

A fault code of 1410 with a reason of no-expansion-contender is raised when a cluster


cannot increase its OperationalClusterSize from N to N+1 since there is no APIC detected
with cluster ID=N+1.

Resolution

Change the cluster ID to be with the range of the defined cluster size. The chosen cluster
ID should be 1 greater than the current cluster size.

Verification

A fault code of 1410 with a reason of most-right-appliance-remains-commissioned  is


raised when the cluster cannot decrease its OperationalClusterSize from N to N-1 since
APIC with cluster ID=N remains commissioned.

Resolution 

Proceed with caution. Verify that the intent is indeed to reduce the cluster size, and if so
decommission the APIC(s) with IDs outside the desired cluster size range.

Verification

A fault code of 1410 with a reason of unavailable-appliance-carrying-replica-relat-


ed-to-relocation is raised when a replica to be relocated has a copy located on APIC
which is currently unavailable.
80 Troubleshooting Cisco Application Centric Infrastructure

Resolution

Corrective action is to recover the lost APIC node.

Verification

A fault code of 1410 with a reason of service-down-on-appliance-carrying-replica-re-


lated-to-relocation is raised when a replica to be relocated has a copy under supervision
of a failed service. This fault is generated if many replicas of the same service on the same
APIC are not in the UP state.

Resolution

Contact the Cisco Technical Assistance Center.

Verification

A fault code of 1410 with a reason of unhealthy-replica-related-to-relocation is raised if


a replica to be relocated has a copy on another appliance which is not healthy.

Resolution

Corrective action is to eliminate the root cause of replica failure. Contact the Cisco Tech-
nical Assistance Center.

Verification

A fault code of 1410 with a reason of cluster-is-stuck-at-size-2 is raised if Operational-


ClusterSize remains at 2 for too long. It reminds operator that AdminstrativeClusterSize
cannot be set to 2 and it is preferable to avoid keeping OperationalClusterSize=2.

Resolution

Recover any failed APICs to bring the cluster size up to 3 or more, or add an additional
cluster as running with two clusters is an unsupported configuration.
Troubleshooting Cisco Application Centric Infrastructure 81

Firmware and Image Management

Overview

This chapter covers firmware and image management for the ACI fabric hardware com-
ponents. It will cover the overview of objects and policies that make up firmware and im-
age management in the context of software upgrades, followed by the verification steps
used to confirm a successful upgrade process.

APIC Controller and Switch Software

There are three types of software images in the fabric that can be upgraded:

1. The APIC controller software image.


2. The switch software image — software running on leafs and spines of the ACI
fabric.
3. The Catalog image — the catalog contains information about the capabilities of
different models of hardware supported in the fabric, compatibility across
different versions of software, and hardware and diagnostic utilities. The
Catalog image is implicitly upgraded with the controller image. Occasionally, it
may be required to upgrade the Catalog image only to include newly qualified
hardware components into the fabric or add new diagnostic utilities.  

It is recommended that the fabric is upgraded in the sequence of APIC controller soft-
ware image first, followed by the switch software images for all the spine and leaf switch-
es in the fabric.
82 Troubleshooting Cisco Application Centric Infrastructure

Firmware Management

There are five components within the context of firmware management:

1. Firmware Repository is used to store of images that have been downloaded to the
APIC. Images are transferred into the firmware repository from external source
locations over HTTP or SCP protocols. The source locations are configurable via
the firmware source policy. Once an image has been copied from its source
location, it is replicated across all controllers within the cluster. The switch
nodes will retrieve images from the controller as required during the beginning
of the upgrade process.
2. Firmware Policy is the policy which specifies the desired firmware image version.
3. Firmware Group is the configured group of nodes that share the same firmware
policy.  
4. Maintenance Policy is the maintenance policy which specifies a schedule for
upgrade.
5. Maintenance Group is the group of nodes that share the same maintenance
policy.

By default, all controllers are part of a predefined firmware group and a predefined main-
tenance group. Membership within the firmware and maintenance groups is not mod-
ifiable. However, both the controller firmware policy and the controller maintenance
policy are modifiable to select a desired version to upgrade to.

Before the administrator can upgrade switches, a firmware group must be created for all
the switches and and one or more maintenance groups should be created to contain all
the switches within the ACI fabric.

Compatibility Check 

The ACI fabric can have up to 3 different versions of compatible switch software images
to be simultaneously active in a fabric. There are three different levels of “compatibility”
checks:

1. Image level compatibility - Controllers use the Catalog image to check for
compatibility across software images that can interoperate in the fabric. The
controller will ensure image compatibility is satisfied before allowing for upgrade
and downgrade.
Troubleshooting Cisco Application Centric Infrastructure 83

2. Card level compatibility - Within a spine modular chassis, the supervisor soft-
ware must be compatible with line card, fabric card and system controller
software. Similarly all the connected FEXes within the leaf switch must be
compatible with software running in the leaf. If a card or a FEX connected to
the system contains incompatible software with the supervisor module of a
spine or a leaf, the supervisor module of the spine or the leaf will ensure
compatibility by pushing down a compatible version of card or FEX software.  
3. Feature level compatibility - Feature level compatibility - Given a set of disparate
image versions running in the fabric, these images may have image level
compatibility and can be simultaneously available within the fabric. However,
they may not support the same set of software features. As a result, feature and
hardware level compatibility is encoded in the object model such that the
controller can identify feature incompatibility at the point of configuration by
administrator. Administrator will be prompted or configuration will result in
failure when enabling such features in a mixed hardware and software version
environment.

Firmware Upgrade Verification

Once a controller image is upgraded, it will disconnect itself from the cluster and re-
boots with the newer version while the other APIC controllers in the cluster are still
operational. Once the controller is rebooted, it joins the cluster again. Then the cluster
converges, and the next controller image will start the upgrade process. If the cluster
does not immediately converge and is not fully fit, the upgrade will wait until the cluster
converges and is “Fully Fit”. During this period, a Waiting for Cluster Convergence mes-
sage is displayed.

For the switches, the administrator can also verify that the switches in the fabric have
been upgraded from the APIC GUI navigation pane, by clicking Fabric Node Firmware.
In the Work pane, view all the switches listed. In the Current Firmware column view the
upgrade image details listed against each switch.
84 Troubleshooting Cisco Application Centric Infrastructure

Verifying the Firmware Version and the Upgrade Status by of use of the REST API

For the upgrade status of controllers and switches. An administrator can query the up-
grade status with the following URL:

https://<ip address>/api/node/class/maintUpgJob.xml

An administrator can query the current running firmware version on controllers:

https://<ip address>/api/node/class/firmwareCtrlrRunning.xml

An administrator can also query the currently operating firmware version on switches:

https://<ip address>/api/node/class/firmwareRunning.xml

Problem Description

After configuring an APIC download task policy, the download keeps failing and will not
download the firmware from the home directory of the user.

Symptom

The following screen is observed.

Verification

Since the APIC is using standard Linux distribution, the SCP command needs to follow
the standard Linux SCP format. For example, if the IP address is 171.70.42.180 and the
Troubleshooting Cisco Application Centric Infrastructure 85

absolute path is /full_path_from_root/release/image_name. The following illustra-


tions show the successful download of the APIC software via SCP

Problem Description

The APIC cluster fails to upgrade.

Symptom

Policy upgrade status showing “Waiting for Cluster Convergence”

Verification

When upgrading Controllers, the Controller Upgrade Firmware Policy will not proceed
unless the APIC Cluster has a status “Fully Fit”. The upgrade status may show “Waiting for
Cluster Convergence” and will not proceed with upgrade.

This “Waiting for Cluster Convergence” status can be caused due to a policy or process
that has crashed. If the cluster is not in a “Fully Fit” state, check the list of running pro-
cesses for each APIC, for example evidence of such a problem would be the presence of
core dump files in a Controller.

The administrator can recover the APIC from the “Waiting for Cluster Convergence”
state by restarting the affected APIC to allow all processes to start up normally. If the
problem persists, Cisco Technical Assistance Center should be contacted immediately to
troubleshoot further.
86 Troubleshooting Cisco Application Centric Infrastructure

Problem Description

Policy upgrade is paused.

Symptom

Upgrade is paused for either a APIC or a switch.

Verification

The administrator can verify the fault to see if there is a fault code F1432 - Maintenance
scheduler is paused for group policyName. One or more members of the group failed to
upgrade or the user manually paused the scheduler being generated. The administrator
should look for other faults indicating why the upgrade failed. Once all the faults are
resolved, the administrator can delete failed/paused policy and re-initiate a new policy
upgrade.
Troubleshooting Cisco Application Centric Infrastructure 87

Faults / Health Scores

Overview

This chapter is intended to provide a basic understanding of faults and health scores in
the ACI object model. This chapter will cover what these items are and how the informa-
tion in these elements can be used in troubleshooting.

For every object in the fabric that can have errors or problems against it, that object will
have the potential of faults being raised. For every object, each fault that is raised has
a related weight and severity. Faults transition between stages throughout a life cycle
where the fault is raised, soaking and cleared. The APIC maintains a real time list of ad-
ministrative and operational components and their related faults, which in turn is used
to derive the health score of an object. The health score itself is calculated based on the
active faults on the object and the health scores of its child objects. This yields a health
score of 100 if no faults are present on the object and the child MOs are all healthy. The
health score will trend towards 0 as health decreases. Health score of System is calcu-
lated using the health scores of the switches and the number of endpoints learnt on the
leafs. Similarly, health score the tenant is calculated based on the health score of the re-
sources used by the tenant on leafs and the number of endpoints learned on those leafs.
88 Troubleshooting Cisco Application Centric Infrastructure

To describe the stages of the fault lifecycle in more detail, a soaking ffault is the begin-
ning state for a fault when it is first detected. During this state, depending on the type of
fault, it may expire if it is a non-persistent fault or it will continue to persist in the system.
When a fault enters the soaking-clearing state, that fault condition has been resolved at
the end of a soaking interval.

If a fault has not been cleared by the time the soaking interval has been reached, it will
enter the raised  state, and potentially have its severity increased. The new severity is
defined by the policy for the particular fault class, and will remain in the raised state until
the fault condition is cleared.

Once a fault condition has been cleared, it will enter the raised-clearing state. At this
point a clearing interval begins, after which if the fault has not returned it will enter the
retaining state, which leaves the fault visible so that it can be inspected after an issue has
been resolved.

At any point during which a fault is created, changes state or is cleared, a fault event log
is generated to keep a record of the state change.

Problem Description

Health Score is low

Symptom 1:

A fault has been raised on some object within the MIT

Resolution 1:

The process for diagnosing low health scores is similar for physical, logical and configu-
ration issues, however it can be approached from different sections of the GUI. For this
example, a low overall system health score due to a physical issue will be diagnosed.

1. Navigate to the System Health Dashboard, and identify a switch that has a
diminished health score.
Troubleshooting Cisco Application Centric Infrastructure 89

Look primarily for health scores less than 99. Double clicking on that leaf will allow nav-
igation into the faults raised on that particular device. In this case, double click on rtp_
leaf1.

Once in the Fabric Inventory section of the GUI, the dashboard for the leaf itself will
be displayed, and from there navigate into the health tab, by either double clicking the
health score or clicking on the “Health” tab.
90 Troubleshooting Cisco Application Centric Infrastructure

Now the nodes in the health tree can be expanded, to find those with low health scores.
To the left of each node in the tree, it can be seen that there will an indicator showing
the impact of the particular subtree on the parent’s health score. This can be one of Low,
Medium, Max or None. If the indicator states None that means that this particular object
has no impact on the health of the parent object. Information describing the different se-
verity of faults present on the managed object, along with their counts is also displayed.
Troubleshooting Cisco Application Centric Infrastructure 91

Navigating into a sub object by clicking the plus sign, will show the sub objects that make
up the total health score of the parent.

Navigating down through the tree, it can be noticed that there are no faults raised di-
rectly on an object, which means that some child object contains the faults. Continue
to navigate down through the tree until the faulted objects have a value that is present.

Once such an object has been reached with no children, the cause of the fault has been
found. It is possible right click anywhere on the object. Clicking in this area brings up an
action menu making it possible to show the fault objects “Show Object”.
92 Troubleshooting Cisco Application Centric Infrastructure

Click on the “Show Object” menu, to bring up the object that has the fault along with a
number of details regarding that objects current state. This includes a tab named “Faults”
which will show what faults are raised. Double clicking a fault will provide the fault prop-
erties and with this information, it is possible to limit the area for troubleshooting to just
the object that has the fault.

In the above example, it can be seen that an interface has a fault due to being used by an
EPG however missing an SFP.
Troubleshooting Cisco Application Centric Infrastructure 93

Problem Description

Receiving faults indicating “LLDP neighbor is bridge and its port vlan 1 mismatches with the
local port vlan unspecified” at the Fabric Level

Symptom 1:

The front panel ports of the ACI leaf switches do not have a native VLAN configured by default.

If a Layer-2 switch is connected to a leaf port, certain models including Nexus 5000 and Nexus
7000 by default will advertise LLDP with a native vlan of 1. LLDP on N7K side would advertise 1
in the TLV, and our side would trigger the fault.

There is no native vlan configured on front panel ports of fabric leaf by default. Normally this is
not an issue when servers are connected to these leaf ports. When a layer2 switch is connect-
ed to leaf port, it is important that native vlan is configured on that leaf port. If not configured,
leaf may not forward STP BPDUs. Hence native vlan mismatch is treated as critical fault.

Resolution 1:

This fault can be cleared by configuring a statically attached interface to the path interface
on which the fault is raised. The EPG static path attach should have the encap VLAN set to
the same as the native VLAN, and have the mode set as an untagged interface. This can be
configured via XML, using the following POST request URI and payload.

https://10.122.254.211/api/mo/uni/tn-Prod/ap-Native/epg-Native.xml

<fvAEPg name="native">

<fvRsPathAtt tDn="topology/pod-1/paths-103/pathep-[eth1/5]"/>

<fvRsPathAtt tDn="topology/pod-1/paths-101/pathep-[eth1/5]"/>

<fvRsDomAtt tDn="uni/phys-phys"/>

</fvAEPg>

Resolution 2:

To clear this fault from the system, configure the downstream switch to not advertise a
vlan. This can be configured using the “no vlan dot1q tag native” command in global config
mode, after which bouncing the interfaces connected to the fabric using “shutdown” and
“no shutdown” should clear the issue.
94 Troubleshooting Cisco Application Centric Infrastructure

REST Interface

Overview

This chapter will explain the basic concepts necessary to begin effectively utilizing ACI
programmatic features for troubleshooting. This begins with an understanding of the ACI
Object Model, which describes how the system interprets configuration and represents
state to internal and external entities. The REST API provides the means necessary to ma-
nipulate the object store, which contains the configured state of APIC using the object
model as the metadata definition. The APIC SDK leverages the REST API to read and write
the configuration of APIC, using the object model to describe the current and desired state.

ACI provides a new approach to data center connectivity, innovative and different from the
standard approach taken today, but astonishingly simple in its elegance and capacity to
describe complete application topologies and holistically manage varying components of
the data center. With the fabric behaving as a single logical switch, problems like managing
scale, enabling application mobility, collecting uniform telemetry points and configuration
automation are all solved in a straightforward approach. With the controller acting as a
single point of management, but not a single point of failure, clustering offers the advan-
tages of managing large data centers but none of the associated fragmented management
challenges.

The controller is responsible for all aspects of configuration. This includes configuration
for a number of key areas:

• Policy: defines how applications communicate, security zoning rules, quality of


service attributes, service insertion and routing/switching
• Operation: protocols on the fabric for management and monitoring, integration
with L4-7 services and virtual networking
• Hardware: maintaining fabric switch inventory and both physical and virtual
interfaces
• Software: firmware revisions on switches and controllers

With these pieces natively reflected in the object model, it is possible to change these
through the REST API, further simplifying the process by utilizing the SDK.
Troubleshooting Cisco Application Centric Infrastructure 95

ACI Object Model

Data modeling is a methodology used to define and analyze data requirements needed
to support a process in relation to information systems. The ACI Object Model contains
a modeled representation of applications, network constructs, services, virtualization,
management and the relationships between all of the building blocks. Essentially, the
object model is an abstracted version of the configuration and operational state that is
applied individually to independent network entities. As an example, a switch may have
interfaces and those interfaces can have characteristics, such as the mode of opera-
tion (L2/L3), speed, connector type, etc. Some of these characteristics are configurable,
while others are read-only, however all of them are still properties of an interface.

The object model takes this analytical breakdown of what defines a thing in the data cen-
ter, and carefully determines how it can exist and how to represent that. Furthermore,
since all of these things do not merely exist, but rather interact with one another, there
can be relationships within the model, which includes containment hierarchy and refer-
ences. An interface belongs to a switch, therefore is contained by the switch, however a
virtual port channel can reference it. A virtual port channel does not necessarily belong
to a single switch.

The objects in the model can also utilize a concept called inheritance, where an interface
can be a more generic concept and specific definitions can inherit characteristics from
a base class. For example, a physical interface can be a data port or a management port,
however both of these still have the same basic properties, so they can inherit from a
single interface base class. Rather than redefine the same properties many times, in-
heritance can be used to define them in one base class, and then specialize them for a
specific child class.
96 Troubleshooting Cisco Application Centric Infrastructure

Figure 1: Example of the Object Model

All of these configurable entities and their structure are represented as classes. The
classes define the entities that are instantiated as Managed Objects (MO) and stored
within the Management Information Tree (MIT). The general concept is similar to the
tree based hierarchy of a file system or the SNMP MIB tree. All classes have a single
parent, and may contain multiple children. This is with exception to the root of the tree,
which is a special class called topRoot. Within the model there are different packages
that act as logical groupings of classes, so that similar entities are placed into the same
package for easier navigation of the model. Each class has a name, which is made from
the package and a class name, for example “top” is the package and “Root” is the class:
“topRoot”; “fv” is the package (fabric virtualization) and “Tenant” is the class: “fvTenant”.
A more generic form of this would be:

Package:classname == packageClassName
Troubleshooting Cisco Application Centric Infrastructure 97

Managed objects make up the management information tree, and everything that can be
configured in ACI is an object. MOs have relative names (Rn), which are built according
to well-defined rules in the model. For the most part, the Rn is a prefix prepended to
some naming properties, so for example the prefix for an fvTenant is “tn-“ and the nam-
ing property for a fvTenant would be the name, “Cisco”. Combining these gives an Rn of
tn-Cisco for a particular MO. Relative names are unique within their namespace, mean-
ing that within the local scope of an MO, there can only ever be one using that name. By
using this rule paired with the tree-based hierarchy of the MIT, concatenate the relative
names of objects to derive their Distinguished Name (Dn), providing a unique address
in the MIT for a specific object. For example, an fvTenant is contained by polUni (Policy
Universe), and polUni is contained by topRoot. Concatenating the Rns for each of these
from top down yields a Dn of “uni/tn-Cisco”. Note that topRoot is always implied and
does not appear in the Dn.

Figure 2: MIT Dn Resolution

Queries

With all of this information neatly organized, it’s possible to perform a number of tree
based operations, including searching, traversal, insertion and deletion. One of the most
common operations is a search to query information from the MIT.
98 Troubleshooting Cisco Application Centric Infrastructure

The following types of queries are supported:

Class-level query: Search the MIT for objects of a specific class

Object-level query: Search the MIT for a specific Dn

Each of these query types supports a plethora of filtering and subtree options, but the
primary difference is how each type is utilized.

A class-based query is useful for searching for a specific type of information, without
knowing the details, or not all of the details. Since a class-based query can return 0 or
many results, it can be a helpful way to query the fabric for information where the full
details are not known. A class-based query combined with filtering can be a powerful
tool to extract data from the MIT. As a simple example, a class-based query can be used
to find all fabric nodes that are functioning as leafs, and extract their serial numbers, for
a quick way to get a fabric inventory.

An object based (Dn based) query returns zero or 1 matches, and the full Dn for an object
must be provided for a match to be found. Combined with an initial class query, a Dn
query can be helpful for finding more details on an object referenced from another, or as
a method to update a local copy of information.

Both query types support tree-level queries with scopes and filtering. This means that
the MIT can be queried for all objects of a specific class or Dn, and then retrieve the
children or complete subtree for the returned objects. Furthermore, the data sets can
be filtered to only return specific information that is interesting to the purpose at hand.

The next section on the REST API covers more details about how to build and execute
these queries.

APIC REST API

This section provides a brief overview of the REST API, however a more exhaustive de-
scription can be found in the Cisco APIC REST API User Guide document on Cisco.com

The APIC REST API is a programmatic interface to the Application Policy Infrastructure
Controller (APIC) that uses a Representational State Transfer (REST) architecture. The API
accepts and returns HTTP or HTTPS messages that contain JavaScript Object Notation
Troubleshooting Cisco Application Centric Infrastructure 99

(JSON) or Extensible Markup Language (XML) documents. Any programming language can
be used to generate the messages and the JSON or XML documents that contain the API
methods or managed object (MO) descriptions.

The REST API is the interface into the MIT and allows for manipulation of the object
model state. The same REST interface is utilized by the APIC CLI, GUI and SDK, so that
whenever information is displayed it is read via the REST API and when configuration
changes are made, they are written via the REST API. In addition to configuration chang-
es, the REST API also provides an interface by which other information can be retrieved,
including statistics, faults, audit events and even provide a means of subscribing to push
based event notification, so that when a change occurs in the MIT, an event can be sent
via a Web Socket.

Standard REST methods are supported on the API, which includes POSTs, GETs and DE-
LETE operations through the HTTP protocol. The following table shows the actions of each
of these and the behavior in case of multiple invocations.

Method Action Behavior


POST Create/Update Idempotent
GET Read Nullipotent
DELETE Delete Idempotent

Figure 3: REST HTTP(S) based CRUD methods

The POST and DELETE methods are idempotent meaning that they have no additional
effect if called more than once with the same input parameters. The GET method is nul-
lipotent, meaning that it can be called 0 or more times without making any changes (or
that it is a read-only operation).

Payload Encapsulation

Payloads to and from the REST interface can be encapsulated via either XML or JSON
encodings. In the case of XML, the encoding operation is simple: the element tag is the
name of the package and class, and any properties of that object are specified as attri-
butes on that element. Containment is defined by creating child elements. The following
example shows a simple XML body defining a tenant, application profile, EPG and static
port attachment.
100 Troubleshooting Cisco Application Centric Infrastructure

XML Managed Object Definition:

<polUni>

<fvTenant name=”NewTenant”>

<fvAp name=”NewApplication”>

<fvAEPg name=”WebTier”>

<fvRsPathAtt encap=”vlan-1” mode=”regular”

tDn=”topology/pod-1/paths-101/pathep-[eth1/1]”/>

</fvAEPg>

</fvAp>

</fvTenant>

</polUni>

For JSON, encoding requires definition of certain entities to reflect the tree based hi-
erarchy, however is repeated at all levels of the tree, so is fairly simple once initially
understood.

1. All objects are described as JSON dictionaries, where the key is the name of the
package and class, and the value is another nested dictionary with two keys: attri-
bute and children.

2. The attribute key contains a further nested dictionary describing key/value pairs
defining attributes on the object

3. The children key contains a list that defines all of the child objects. The children in
this list will be dictionaries containing any nested objects, that are defined as
described in (a)

4. The following example shows the XML defined above, in JSON format.
Troubleshooting Cisco Application Centric Infrastructure 101

JSON Managed Object Definition:

{
“polUni”: {
“attributes”: {},
“children”: [
{
“fvTenant”: {
“attributes”: {
“name”: “NewTenant”
},
“children”: [
{
“fvAp”: {
“attributes”: {
“name”: “NewApplication”
},
“children”: [
{
“fvAEPg”: {
“attributes”: {
“name”: “WebTier”
},
“children”: [
{
“fvRsPathAtt”: {
“attributes”: {
“mode”: “regular”,
“encap”: “vlan-1”,
“tDn”: “topology/pod-1/paths-101/pathep-[eth1/1]”
}
}
}
]
}
}
]
}
}
]
}
}
]
}
}
102 Troubleshooting Cisco Application Centric Infrastructure

Both the XML and JSON have been pretty printed to simplify visual understanding. Prac-
tically, it would make sense to compact both of them before exchanging with the REST
interface, however it will make no functional impact. In the cases of the object examples
shown here, the compacted XML results in 213 bytes of data, and the compacted JSON
results in 340 bytes of data.

Read Operations

Once the object payloads are properly encoding as XML or JSON, they can be used in
Create, Read, Update or Delete (CRUD) operations on the REST API.

Since the REST API is HTTP based, defining the URI to access a certain resource type
is important. The first two sections of the request URI simply define the protocol and
access details of the APIC. Next in the request URI is the literal string “/api” indicating
that the API will be invoked. Generally read operations will be for an object or class, as
discussed earlier, so the next part of the URI defines if it will be for a “mo” or “class”. The
next component defines either the fully qualified Dn being queried for object based que-
ries, or the package and class name for class-based queries. The final mandatory part of
the request URI is the encoding format, either .XML or .JSON. This is the only method
by which the payload format is defined (Content-Type and other headers are ignored by
APIC).

The next optional part of the request URI is the query options, which can specify various
types of filtering, which are explained extensively in the REST API User Guide.
Troubleshooting Cisco Application Centric Infrastructure 103

In the example shown above, first an object level query is shown, where an EPG named
Download is queried. The second example shows how a query for all objects with classl-
1PhysIf can be queried, and the results filtered to only show those where the speed at-
tribute is equal to 10G. For a complete reference to different objects, their properties
and possible values please refer to the Cisco APIC API Model Documentation.sible values
please refer to the Cisco APIC API Model Documentation.

Write Operations

Create and update operations to the REST API are actually both implemented using the POST
method, so that if an object does not already exist it will be created, and if it does already
exist, it will be updated to reflect any changes between its existing state and desired state.

Both create and update operations can contain complex object hierarchies, so that a com-
plete tree can be defined within a single command, so long as all objects are within the
same context root and they are under the 1MB limit for data payloads to the REST API. This
limit is in place to guarantee performance and protect the system under high load.

The context root helps defines a method by which APIC distributes information to multi-
ple controllers and ensures consistency. For the most part it should be transparent to the
user, though very large configurations may need to be broken up into smaller pieces if they
result in a distributed transaction.

<fvTenant name=”NewTenant”>

<fvAp name=”NewApplication”>

<fvAEPg name=”WebTier”>

<fvRsPathAtt encap=”vlan-1” mode=”regular”

tDn=”topology/pod-1/paths-17/pathep-[eth1/1]”/>

</fvAEPg>

</fvAEPg>

</fvTenant>

Payload is XML/JSON representation of API Command Body


104 Troubleshooting Cisco Application Centric Infrastructure

Create/Update operations follow the same syntax as read operations, except that they
will always be targeted at an object level because changes cannot be made to every ob-
ject of a specific class. The create/update operation should target a specific managed
object, so the literal string “/mo” indicates that the Dn of the managed object will be
provided, followed next by the actual Dn. Filter strings can be applied to POST opera-
tions, to retrieve the results of a POST in the response, for example, pass the rsp-sub-
tree=modified query string to indicate that the response should include any objects that
have been modified by the POST.

The payload of the POST operation will contain the XML or JSON encoded data repre-
senting the managed object defining the API command body.

Authentication

Authentication to the REST API for username/password-based authentication uses a


special subset of request URIs, including aaaLogin, aaaLogout and aaaRefresh as the Dn
target of a POST operation. Their payloads contain a simple XML or JSON payload con-
taining the MO representation of an aaaUser object with attributes name and pwd defin-
ing the username and password, for example: <aaaUser name=’admin’ pwd=’insieme’/>.
The response to the POSTs will contain an authentication token as both a Set-Cookie
header as well as an attribute to the aaaLogin object in the response named token, for
which the XPath is /imdata/aaaLogin/@token if encoded as XML. Subsequent opera-
tions on the REST API can use this token value as a Cookie named “APIC-cookie” to have
future requests authenticated.

Filters

The REST API supports a wide range of flexible filters, useful for narrowing the scope of
a search to allow for information to be more quickly located. The filters themselves are
appended as query URI options, started with a question mark (?) and concatenated with
an ampersand (&). Multiple conditions can be joined together to form complex filters

The Cisco APIC RESTful API User Guide covers in great detail the specifics of how to use
filters, their syntax, and provides examples. Some of the tools covered below, can be used
to learn to build a query string, as well as uncover those being used by the native APIC
interface, and build on top of those to create advanced filters.

 
Troubleshooting Cisco Application Centric Infrastructure 105

Browser

The MIT contains multitudes of valuable data points. Being able to browse that data can
expose new ways to use the data, aid in troubleshooting, and inspect the current state
of the object store. One of the available tools for browsing the MIT is called “visore” and
is available on the APIC. Visore supports querying by class and object, as well as easily
navigating the hierarchy of the tree.

In order to access visore, open https://<apic>/visore.html in a web browser, and then


authenticate with credentials for the APIC. Once logged in, an initial set of data will be
visible, however searching for information using filtered fields will also be available at the
top of the screen. Within the “Class or DN” text input field, enter the name of a class, e.g.
“fabricNode” or “topology/pod-1/node-1”; press the “Run Query” button and press OK
when prompted to continue without a filter. The results will be provided in either a list
of nodes on the fabric, or information for the first APIC depending on the input string.

In the list of attributes for the objects, the Dn will have a set of icons next to it.

The green arrows can be used for navigating up and down the tree, where pressing the
left arrow will navigate to the parent of the object and the right arrow will navigate to
a list of all children of the current object. The black staggered bars will display any sta-
tistics that are available for the object. If none are available, the resulting page will not
contain any data. The red octagon with exclamation point will show any faults that are
present on the current object and finally the blue circle with the letter H will show the
health score for the object, if one is available.

These tools provide access to all types of information in the MIT, and additionally use
Visore to structure query strings. For example, entering “fabricNode” as the class, “id”
for the property and “1” in the field labeled Val1, leaving the Op value to “==”, and execute
the query to filter the class results on just those with an id equal to 1. Note that Visore
does not contain the complete list of filters supported by the REST API, however can be
a useful starting point.

Visore provides the URI of the last query and the response body, and the data can be
seen not only in a tabular format, but also as the natively encoded payload. This allows
106 Troubleshooting Cisco Application Centric Infrastructure

for quick access to determine the request URI for a class or Dn based query, and also see
what the XML body of the response looks like.

API Inspector

All operations that are made through the GUI will invoke REST calls to fetch and commit
the information being accessed. The API Inspector further simplifies the process of ex-
amining what is taking place on the REST interface as the GUI is navigated by displaying
in real time the URIs and payloads. When new configuration is committed, API inspector
will display the resulting POST requests, and when information is displayed on the GUI,
the GET request will be displayed.

To get started with API inspector, access it from the account menu, visible in the top
right of the APIC GUI. Click on “welcome, <username>” and then select the “Show API
Inspector” option, as shown in the figure below.

Once the API Inspector is brought up, timestamps will be seen along with the REST
method, URIs, and payloads. Occasional updates may also be seen in the list as the GUI
refreshes subscriptions to data being shown on the screen.
Troubleshooting Cisco Application Centric Infrastructure 107

From the example output shown above, it can be seen that the last logged item has a
POST with the JSON payload containing a tenant named Cisco, and some attributes de-
fined on that object.

POST

url: http://172.23.3.215/api/node/mo/uni/tn-Cisco.json

 "fvTenant": {

   "attributes": {

     "name": "Cisco",

     "status": "created"

   },

   "children": []

 }

}
108 Troubleshooting Cisco Application Centric Infrastructure

ACI Software Development Kit (SDK)

The ACI Python SDK is named Cobra, and is a Python implementation of the API that
provides native bindings for all the REST function. Cobra also has a complete copy of the
object model so that data integrity can be ensured, and provides methods for perform-
ing lookups and queries and object creation, modification and deletion, which match
the REST methods leveraged by the GUI, as well as those that can be found using API
Inspector. As a result, policy created in the GUI can be used as a programming template
for rapid development.

The installation process for Cobra is straightforward, using standard Python distribu-
tion utilities. It is currently distributed as an egg and can be installed using easy_install.
Please reference the APIC Python API Documentation for full details on installing Cobra
on a variety of operating systems.

Establishing a Session

The first step in any code that will use Cobra is to establish a login session. Cobra current-
ly supports username and password based authentication, as well as certificate-based
authentication. For this example, we’ll use username and password based authentication:

import cobra.mit.access

import cobra.mit.session

apicUri = 'https://10.0.0.2'

apicUser = 'username'

apicPassword = 'password'

ls = cobra.mit.session.LoginSession(apicUri, apicUser,

apicPassword)

md = cobra.mit.access.MoDirectory(ls)

md.login()

This will provide an MoDirectory object named md, that is logged in and authenticated
to an APIC. If for some reason this script is unable to authenticate, the script will get a
cobra.mit.request.CommitError exception from Cobra. Once a session is allocated for
the script things can move forward.
Troubleshooting Cisco Application Centric Infrastructure 109

Working with Objects

Utilizing the Cobra SDK to manipulate the MIT generally follows the workflow:

1. identify object to be manipulated


2. build a request to change attributes, add or remove children
3. commit changes made to that object

For example, to create a new Tenant, where the tenant will be placed in the MIT must
first be identified. In this case it will be a child of the Policy Universe object:

import cobra.model.pol

polUniMo = cobra.model.pol.Uni('')

With the policy universe Mo object defined, it is possible to create a tenant object as a
child of polUniMo:

import cobra.model.fv

tenantMo = cobra.model.fv.Tenant(polUniMo, 'cisco')

Since all of these operations have only resulted in Python objects being created, the con-
figuration must be committed in order to apply it. This can do this using an object called
a ConfigRequest. A ConfigRequests acts as a container for Managed Object based classes
that fall into a single context, which can all be committed in a single atomic POST.

import cobra.mit.request

config = cobra.mit.request.ConfigRequest()

config.addMo(tenantMo)

md.commit(config)

The ConfigRequest is created, then the tenantMo is added to the request, and finally this
is commited through the MoDirectory.

For the above example, in the first step a local copy is built of the polUni object. Since
it does not have any naming properties (reflected above by the empty double sin-
gle-quotes), there is no need to look it up in the MIT to figure out what the full Dn for the
object is, since it is always known as the “uni”. If something deeper in the MIT needs to
be posted, where the object has naming properties, a lookup needs to be performed for
110 Troubleshooting Cisco Application Centric Infrastructure

that object. As an example, to post a configuration to an existing tenant, it is possible to


query for that tenant, and create objects beneath it.

tenantMo = md.lookupByClass('fvTenant', propFilter=

'eq(fvTenant.name, "cisco")')

tenantMo = tenantMo[0] if tenantMo else None

The resulting tenantMo object will be of class cobra.model.fv.Tenant, and contain prop-
erties such as .dn, .status, .name, etc, all describing the object itself. lookupByClass()
returns an array, since it can return more than one object. In this case, the propFilter is
specifying a fvTenant with a particular name. For a tenant, the name attribute is a spe-
cial type of attribute called a naming attribute. The naming attribute is used to build the
relative name, which must be unique within its local namespace. As a result of this, it can
be guaranteed that what lookupByClass on an fvTenant with a filter on the name will al-
ways either return an array of length 1 or None, meaning nothing was found. The specific
naming attributes and others can be looked up in the APIC Model Reference document.

Another method to entirely avoid a lookup, is to build a Dn object and make an object a
child of that Dn. This will only work in cases where the parent object already exists.

topDn = cobra.mit.naming.Dn.fromString('uni/tn-cisco')

fvAp = cobra.model.fv.Ap(topMo, name='AppProfile')

These fundamentals of interacting with Cobra will provide the building blocks necessary
to create more complex workflows that will aid in the process of automating network
configuration, troubleshooting and management.

APIC REST to Python Adapter

The process of building a request can be time consuming. For example, the object data
payload as Python code reflecting the object changes that are desired to be made must
be represented. Given that the Cobra SDK is directly modeled off of the ACI Object Mod-
el, this means it should be possible to generate code directly from what resides in the
object model. As expected, this is possible using a tool developed by Cisco Advanced
Services named Arya, short for APIC REST to Python Adapter.
Troubleshooting Cisco Application Centric Infrastructure 111

{“fvTenant”:{“attributes”:{“dn”:”uni/tn-
Cisco”,”name”:”Cisco”,”rn”:”tn-Cisco”,”status”:”created”},”child
ren”:[{“fvBD”:{“attributes”:{“dn”:”uni/tn-Cisco/BD-CiscoBd”,”ma
c”:”00:22:BD:F8:19:FF”,”name”:”CiscoBd”,”rn”:”BD-CiscoBd”,”stat
us”:”created”},”children”:[{“fvRsCtx”:{“attributes”:{“tnFvCtxNa
me”:”CiscoNetwork”,”status”:”created,modified”},”children”:[]}},{
“fvSubnet”:{“attributes”:{“dn”:”uni/tn-Cisco/BD-CiscoBd/subnet-
[10.0.0.1/8]”,”ip”:”10.0.0.1/8”,”rn”:”subnet-[10.0.0.1/8]”,”stat
us”:”created”},”children”:[]}}]}},{“fvCtx”:{“attributes”:{“dn”:”
uni/tn-Cisco/ctx-CiscoNetwork”,”name”:”CiscoNetwork”,”rn”:”ctx-
CiscoNetwork”,”status”:”created”},”children”:[]}}]}}

fvTenant = cobra.model.fv.Tenant(topMo, name='Cisco')


fvCtx = cobra.model.fv.Ctx(fvTenant, name='CiscoNetwork')
fvBD = cobra.model.fv.BD(fvTenant, mac='00:22:BD:F8:19:FF',
name='CiscoBd')
fvRsCtx = cobra.model.fv.RsCtx(fvBD, tnFvCtxName=fvCtx.name)
fvSubnet = cobra.model.fv.Subnet(fvBD, ip='10.0.0.1/8')

In the diagram above, it’s clearly shown how the input that might come from API Inspec-
tor, Visore or even the output of a REST query, can be quickly converted into Cobra SDK
code, that can then be tokenized and re-used in more advanced ways. Installing Arya is
relatively simple and has minimal external dependencies. Arya requires Python 2.7.5 and
git installed. The following quick installation steps will install Arya and place it the system
python.

git clone https://github.com/datacenter/ACI.git

cd ACI/arya

sudo python setup.py install

After installation of Arya has completed, it is possible to take XML or JSON representing
ACI modeled objects and convert them to Python code quickly. For example:

arya.py -f /home/palesiak/simpletenant.xml
112 Troubleshooting Cisco Application Centric Infrastructure

Will yield the following Python code:

#!/usr/bin/env python

'''

Autogenerated code using /private/tmp/ACI/arya/lib/python2.7/

site-packages/arya-1.0.0-py2.7.egg/EGG-INFO/scripts/arya.py

Original Object Document Input:

<fvTenant name='bob'/>

'''

raise RuntimeError('Please review the auto generated code before ' +

                    'executing the output. Some placeholders will ' +

                    'need to be changed')

# list of packages that should be imported for this code to work

import cobra.mit.access

import cobra.mit.session

import cobra.mit.request

import cobra.model.fv

import cobra.model.pol

from cobra.internal.codec.xmlcodec import toXMLStr

# log into an APIC and create a directory object

ls = cobra.mit.session.LoginSession('https://1.1.1.1', 'admin',

'password')

md = cobra.mit.access.MoDirectory(ls)

md.login()

# the top level object on which operations will be made

topMo = cobra.model.pol.Uni('')

# build the request using cobra syntax

fvTenant = cobra.model.fv.Tenant(topMo, name='bob')

# commit the generated code to APIC

print toXMLStr(topMo)

c = cobra.mit.request.ConfigRequest()

c.addMo(topMo)

md.commit(c)
Troubleshooting Cisco Application Centric Infrastructure 113

The placeholder raising a RuntimeError must first be removed before this code can be
executed, however it is purposely put in place to ensure that any other tokenized values
that must be updated are corrected. For example, the APIC IP defaulting to 1.1.1.1 should
be updated to reflect the actual APIC IP address. The same applies for the credentials and
other possible placeholders.

Note that if the input is XML or JSON that does not have a fully qualified hierarchy, it may
be difficult or impossible for Arya to attempt to determine it through heuristics. In this
case, a placeholder of “REPLACEME” will be populated with the text. This placeholder
will need to be replaced with the correct distinguished names (Dn’s). These Dn’s by que-
rying for the object in Visore, or inspecting the request URI for the object shown in API
inspector.

Conclusion

With an understanding of how ACI network and application information is represented,


how to interact with that data, and a grasp on using the SDK, it is trivial to create power-
ful programs that can simplify the professional tasks and introduce higher levels of au-
tomation. Mastering the MIT, Cobra SDK and leveraging Arya to streamline operational
workflows is just the beginning to leveraging ACI in ways that will increase the value to
the business and business stakeholders.

Problem Description

Errors from the REST API do not generally generate faults on the system. Errors are re-
turned directly to the source of the request. There are logs on the APIC that the request
was sent to that can be examined to see what was queried, and if errors occur what may
have resulted in that error. In 1.0(1e) the /var/log/dme/log/nginx.bin.log on the APIC
will track requests coming to the APIC and show specific types of errors. In later ver-
sions the nginx error.log and access.log will be available at /var/log/dme/nginx/.

Symptom 1

Unable to connect to APIC over HTTP using the REST API.


114 Troubleshooting Cisco Application Centric Infrastructure

Verification 1

By default port HTTP is disabled on the APICs and HTTPS is enabled.

Resolution 1

• HTTPS can be used or the communication policy can be changed to enable HTTP.
However, please be aware that that APIC ships in the most secure mode possible.

Symptom 2

REST API returns an error similar to, “Invalid DN [Dn] wrong rn prefix [Rn] at position
[position]” or “Request failed, unresolved class for [string]”

Verification 2

The REST API uses the Universal Resource Indicator (URI) to try and determine what to
either configure (for a POST) or return back to the request (for a GET).

For a GET of a class, if the APIC is unable to resolve that URI back to a valid class the error
that the APIC returns is the “unable to resolve the class” error. Please refer back to the
APIC Management Information Model Reference documentation to verify the name of
the class.
Troubleshooting Cisco Application Centric Infrastructure 115

For a GET or POST for a managed object, if the APIC is unable to resolve the distinguished
name, the APIC will return the error about an invalid DN and the APIC will specify which
Rn is the problem. It is possible use this information to determine which part of the dis-
tinguished name has resulted in the failure. Please refer back to the APIC Management
Information Model Reference documentation to verify the structure of the distinguished
name if needed. It is also possible to use Visore to traverse the object store on the APIC
and see which distinguished names exist.

Resolution 2

To use the REST API, a fully qualified distinguished name must be used for either GET
requests or POST queries to URI’s starting with “/api/mo/” and the class name when
making GET requests to URI’s starting with “api/class.”

Symptom 3

The REST API returns the error “Token was invalid (Error: Token timeout)”

Verification 3

The REST API requires that a login is refreshed periodically. When logging in using the
aaaLogin request, the response includes a refreshTimeoutSeconds attribute that defines
how long the login cookie will remain valid. The cookie must be refreshed using a GET to
api/aaaRefresh.xml or api/aaaRefresh.json prior to that timeout period. By default the
timeout period is 300 seconds. If the token is not refreshed, it will expire and the REST
API will return the token invalid error.

Resolution 3

Refresh the token by using the aaaRefresh API before the token expires or get a new token
by simply logging in again.

Symptom 4

The REST API returns the error, “Failed to update multiple items in a single operation
- request requires distributed transaction. Please modify request to process each item
individually”
116 Troubleshooting Cisco Application Centric Infrastructure

Verification 4

When a POST transaction is sent to the REST API, ensure that the POST does not con-
tain managed objects from different parts of the management information tree that may
belong to different parts of the distributed management information tree that may be
managed by different APICs. This can be a rather difficult thing to ensure on the surface
if a rather large transaction is being created by just generating a huge configuration and
committing it. However, if the POST is limited to objects within the same package it is
generally possible to avoid this issue. For example the infraInfra object should not be
included when doing a POST to api/uni.xml or api/uni.json for a fvTenant object.

Resolution 4

Break up the REST API POST such that the request does not cover classes outside of
packages at equal or higher levels of the management information tree. Please see the
APIC Management Information Model Reference documentation for more information
about the class hierarchy.

Symptom 5

The REST API reports either “incomplete node at line [line number]” or “Invalid request.
Cannot contain child [child] under parent [parent]” for POST requests.

Verification 5

The REST API requires that the objects that are sent in a POST are well formed. XML ob-
jects simply require that the proper containment rules be followed, the proper attributes
be included and the XML is well formed. Managed objects that are contained by other
managed objects in the management information tree need to be contained in the same
way in the XML POST. For example the following POST would fail because fvAP is cannot
be contained by fvAEp:

<?xml version="1.0"?>

<polUni>

    <fvTenant name="NewTenant">

        <fvAEPg name="WebTier">

<!-- This is wrong --> 

<fvAp name="NewApplication" />


<fvRsPathAtt encap="vlan-1" mode="regular" tDn="topology/pod-1/paths-101/pathep-

[eth1/1]"/>

</fvAEPg>

</fvTenant>

</polUni>

For JSON POSTs, it becomes a little more difficult because the JSON is basically built from
XML and XML attributes become an attributes field, and children become a children array
in the JSON. The children and attributes have to be explicitly specified. For example:

"polUni": {

"attributes": {},

"children": [{

"fvTenant": {

"attributes": {

"name": "NewTenant"

}}

However, if the attributes and children specification are not included - the most common
situation is that they are not included when there are no attributes, it is easy to forget
them in such a case - the REST API will return the error about an incomplete node. This is
an example of a poorly formed JSON query::

"polUni": {

"fvTenant": {

"attributes": {

"name": "NewTenant"

}
118 Troubleshooting Cisco Application Centric Infrastructure

Resolution 5

Ensure that the REST API query is properly formed.

Symptom 6

The web server returns a response 200 but along with a body that states an error 400
Bad Request.

Verification 6

This will happen if the header or request sent to the webserver is malformed in such a
way that the request cannot be parsed by the web server. For example:

>>> import httplib

>>> conn = httplib.HTTPConnection("10.122.254.211")

>>> conn.request("get", "api/aaaListDomains.json")

>>> r1 = conn.getresponse()

>>> data1 = r1.read()

>>> print r1.status, r1.reason

200

>>> print data1

<html>

<head><title>400 Bad Request</title></head>

<body bgcolor="white">

<center><h1>400 Bad Request</h1></center>

<hr><center>nginx/1.4.0</center>

</body>

</html>

>>>

In this case the method used is “get” - all lower case. The web server the APIC uses re-
quires methods be all upper case, “GET”.

Resolution 6

Ensure that the header and request is not malformed in anyway and conforms to com-
mon web standards and practices.
Troubleshooting Cisco Application Centric Infrastructure 119

Management Tenant

Overview

The management tenant is a pre-defined tenant in the ACI policy model that addresses
policies related to inband and out of band management connectivity of the ACI Fabric.

This chapter presents an overview of how the management tenant should function, the
verification steps used to confirm a working out-of-band management configuration
for the example reference topology, and potential issues that relate to the management
tenant. The displays taken on a working fabric can then be used as a reference resource to
aid in troubleshooting issues with the management tenant functions.

The example reference topology that is used only has out-of-band management config-
ured, so in order to show any inband management functions, all inband management in-
formation shown in this book will be captured from another fabric.

Fabric Management Routing

The ACI fabric provides both in-band and out-of-band management access options. The
following paragraphs will describe the internal system behavior, including routing and
failover for the APIC and fabric nodes (switches).

Out-Of-Band Management

Out-of-band (OOB) management provides management communications through configu-


ration of dedicated physical interfaces on the APICs and fabric nodes (switches). The initial
APIC setup script prompts to configure the OOB management IP address by a series of
configuration prompts:

Out-of-band management configuration ...

Enter the IP address for out-of-band management: 10.122.254.141/24

Enter the IP address of the default gateway [None]: 10.122.254.1

Enter the interface speed/duplex mode [auto]:


120 Troubleshooting Cisco Application Centric Infrastructure

Once the fabric is initialized and discovered, the OOB addresses can be configured for the
fabric nodes (switches) through any of the object model interfaces (GUI, API, CLI).
On the APIC, the OOB configuration creates an interface called oobmgmt. Keep in mind
throughout this book that when viewing configuration information on an APIC, the APIC
is built on a Linux host operating system, and some of the abbreviated information might
be more systems related than traditional Cisco NXOS/IOS command or output structure
related. As an example, to view the oobmgmt interface configuration, connect to the APIC
CLI, enter the command ip add show dev oobmgmt. The dev keyword is more of a Linux
context moniker for “device” and the order of the command is different from a traditional
Cisco “show” command. Below is the output produced from the ip add show dev oobmgmt
command:

admin@RTP_Apic2:~> ip add show dev oobmgmt

8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

 link/ether 24:e9:b3:15:dd:60 brd ff:ff:ff:ff:ff:ff

 inet 10.122.254.212/24 brd 10.122.254.255 scope global oobmgmt

 inet6 fe80::26e9:b3ff:fe15:dd60/64 scope link

 valid_lft forever preferred_lft forever

admin@RTP_Apic2:~>

On the fabric nodes (switches), the OOB configuration is applied to the management inter-
face eth0 (aka mgmt0). To view the eth0 interface configuration, connect to the node CLI,
enter the following command and observe the produced output:

rtp_leaf1# ip add show dev eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

 link/ether 88:f0:31:db:e7:f0 brd ff:ff:ff:ff:ff:ff

 inet 10.122.254.241/24 brd 10.122.254.255 scope global eth0

 inet6 fe80::8af0:31ff:fedb:e7f0/64 scope link

 valid_lft forever preferred_lft forever

rtp_leaf1#
Troubleshooting Cisco Application Centric Infrastructure 121

Inband Management

Inband management provides management communications through configuration of one


or more front-panel (data plane) ports on the fabric leaf nodes (switches). Inband man-
agement requires a dedicated pool of IP addresses that do not directly extend outside the
fabric. Inband management can be configured in two modes: Layer 2 and Layer 3.

Layer 2 Inband Management

With Layer 2 inband management, the inband management addresses assigned to the
APICs and fabric nodes (switches) are only accessible from networks directly connected
to the leaf nodes.

In this model, the fabric inband addresses are not accessible from networks not directly
connected to the fabric.
122 Troubleshooting Cisco Application Centric Infrastructure

Layer 2 Configuration Notes

A minimum of 2 VLANs are required

• 1 for the Inband management EPG


o 1 for the application EPG mapped to the leaf port providing connectivity
outside the fabric
o Configuring a second Bridge Domain (BD) for the application EPG is optional
and it is also valid to map the application EPG to the default BD named ‘inb’
• The subnet gateway(s) configured for the BD’s are used as next-hop addresses
and should be unique host addresses

Layer 3

With Layer 3 inband management, the inband management addresses assigned to the
APICs and fabric nodes are accessible by remote networks by virtue of configuring a L3
Routed Outside network object.
Troubleshooting Cisco Application Centric Infrastructure 123

Layer 3 Inband Configuration Notes

• A minimum of 2 VLANs are required


o 1 for the Inband EPG
o 1 for the application EPG mapped to the leaf port providing access outside
the fabric
• Configuring a second BD for the application EPG is optional - it is also valid to
map the application EPG to the default 'inb' BD
• The subnet gateway(s) configured for the BD's are used as next-hop addresses
and should be unique (i.e. unused) host addresses

Regardless of whether using L2 or L3, the encapsulation VLAN used for the Inband EPG
is used to create a sub-interface on the APIC using the name format bond0.<vlan>, where
<vlan> is the VLAN configured as the encapsulation for the Inband EPG.  As an example,
the following is the output from the APIC CLI show command:
 
admin@fab2_apic1:~> ip add show bond0.10

116: bond0.10@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1496 qdisc noqueue state UP

        link/ether 64:12:25:a7:df:3f brd ff:ff:ff:ff:ff:ff

        inet 5.5.5.141/24 brd 5.5.5.255 scope global bond0.10

        inet6 fe80::6612:25ff:fea7:df3f/64 scope link

                 valid_lft forever preferred_lft forever

On the fabric nodes, inband interfaces are created as part of the mgmt:inb VRF:

fab2_leaf4# show ip int vrf mgmt:inb

IP Interface Status for VRF "mgmt:inb"

vlan27, Interface status: protocol-up/link-up/admin-up, iod: 128,

        IP address: 5.5.5.1, IP subnet: 5.5.5.0/24 <<<<<<<<<<<<<<< BD gateway address

        IP address: 5.5.5.137, IP subnet: 5.5.5.137/32 secondary

        IP broadcast address: 255.255.255.255

        IP primary address route-preference: 1, tag: 0

In the output above, the gateway address(es) for the BD is also configured on the same
interface.  This is true for all leaf nodes (switches) that are configured for inband.
124 Troubleshooting Cisco Application Centric Infrastructure

APIC Management Routing

The APIC internal networking configuration utilizes the Linux iproute2 utilities, which
provides a combination of routing policy database and multiple routing tables used to
implement routing on the controllers. When both inband and out-of-band management
are configured, the APIC uses the following forwarding logic:

1. Packets that come in an interface, go out that same interface


2. Packets sourced from the APIC, destined to a directly connected network, go out
the directly connected interface
3. Packets sourced from the APIC, destined to a remote network, prefer inband,
followed by out-of-band

An APIC controller always prefers the in-band management interface to the out-of-band
management interface as long as in-band is available. This behavior cannot be changed
with configuration. APIC controllers should have two ways to reach a single management
network with inband being the primary path and out-of-band being the backup path.

To view the configured routing tables on the APIC, execute the following command cat
/etc/iproute2/rt_tables:

admin@fab2_apic1:~> cat /etc/iproute2/rt_tables

# reserved values

255 local

254 main

253 default

0 unspec

# local

#1 inr.ruhep

1 overlay

2 oobmgmt

admin@fab2_apic1:~>
Troubleshooting Cisco Application Centric Infrastructure 125

The local and main routing tables are Linux defaults. The local routing table is populated
with information from all of the interfaces configured with IP addresses on the APIC.
Theoverlay, oobmgmt, and ibmgmt routing tables are APIC-specific and are populated
with the relevant routes for each network. The entries from the 3 APIC-specific routing
tables are used to populate the main routing table. The contents of each routing table
can be viewed by using the command ip route show <table>. For example:

admin@fab2_apic1:~> ip route show table oobmgmt

default via 10.122.254.1 dev oobmgmt src 10.122.254.141

10.122.254.1 dev oobmgmt scope link src 10.122.254.141

169.254.254.0/24 dev lxcbr0 scope link

admin@fab2_apic1:~>

The decision of which routing table is used for the lookup is based on an ordered list of
rules in the routing policy database. Use ip rule show command to view the the routing
policy database:

The main routing table used for packets originating from the APIC, shows 2 default routes:

admin@fab2_apic1:~> ip route show

default via 10.122.254.1 dev oobmgmt metric 16

10.0.0.0/16 via 10.0.0.30 dev bond0.4093 src 10.0.0.1

10.0.0.30 dev bond0.4093 scope link src 10.0.0.1

10.122.254.0/24 dev oobmgmt proto kernel scope link src 10.122.254.141

10.122.254.1 dev oobmgmt scope link src 10.122.254.141

169.254.1.0/24 dev teplo-1 proto kernel scope link src 169.254.1.1

169.254.254.0/24 dev lxcbr0 proto kernel scope link src 169.254.254.254

admin@fab2_apic1:~>

The metric 16 on the default route out the oobmgmt interface is what makes the default
route via inband (bond0.10) preferable.
126 Troubleshooting Cisco Application Centric Infrastructure

Fabric Node (Switch) Routing

Routing on the fabric nodes (switches) is split between Linux and NX-OS. Unlike the
APIC configuration, the routing table segregation on the fabric nodes is implemented
using multiple VRF instances. The configured VRFs on a fabric node can be viewed by
using the show vrf command:

fab2_leaf1# show vrf

VRF-Name                           VRF-ID State   Reason                      

black-hole                              3 Up      --                          

management                              2 Up      --                          

mgmt:inb                               11 Up      --                          

overlay-1                               9 Up      -- 

 
Although the management VRF exists in the above output, the associated routing table
is empty. This is because the management VRF mapped to the out-of-band network
configuration, is handled by Linux. This means that on the fabric nodes, the Linux con-
figuration does not use multiple routing tables and the content of the main routing table
is only populated by the out-of-band network configuration.

To view the contents of each VRF routing table in NX-OS, use the show ip route vrf <vrf>
command. For example:

fab2_leaf1# show ip route vrf mgmt:inb

IP Route Table for VRF "mgmt:inb"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

3.3.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive

   *via 10.0.224.65%overlay-1, [1/0], 02:31:46, static

3.3.3.1/32, ubest/mbest: 1/0, attached

   *via 3.3.3.1, Vlan47, [1/0], 02:31:46, local

5.5.5.0/24, ubest/mbest: 1/0, attached, direct, pervasive

   *via 10.0.224.65%overlay-1, [1/0], 07:48:24, static

5.5.5.1/32, ubest/mbest: 1/0, attached

   *via 5.5.5.1, Vlan40, [1/0], 07:48:24, local

5.5.5.134/32, ubest/mbest: 1/0, attached


Troubleshooting Cisco Application Centric Infrastructure 127

Management Failover

In theory the out-of-band network functions as a backup when inband management


connectivity is unavailable on the APIC. However, APIC has does not run any routing
protocol and so will not be able to intelligently fallback to use OOB interface in case of
any upstream connectivity issues over inband. The inband management network on APIC
changes in the following scenarios:

• The bond0 interface on the APIC goes down


• The encapsulation configuration on the Inband EPG is removed
• Note that the above failover behavior is specific to APIC and the same failover
behavior is unavailable on the fabric switches due to the switches inband and
OOB interface belong to two different VRFs.

Management EPG Configuration

Some of the fabric services, such as NTP, DNS, etc., provide the option to configure a
Management EPG attribute. This specifies whether inband or out-of-band is used for
communication by these services. This setting only affects the behavior of the fabric
nodes, not the APICs. With the exception of the VM Provider configuration, the APIC fol-
lows the forwarding logic described in the APIC Management Routing section earlier in
this chapter. The VM Provider configuration has an optional Management EPG setting,
but only able to select an EPG tied to the In-Band Management EPG.
128 Troubleshooting Cisco Application Centric Infrastructure

Fabric Verification

In the following section, displays are collected from the reference topology to show a
working fabric configuration. This verification is only for out-of-band management.
Troubleshooting Cisco Application Centric Infrastructure 129

OOB Verification

The first step is to verify the configuration of the oobmgmt interface on the APIC using
the command ip addr show dev oobmgmt on all three APIC’s. The interfaces need to be in
the up state and the expected IP addresses need to be assigned with the proper masks. To
connect to the various fabric nodes, there are several options but if once logged into at least
one of the APIC’s use the output of show fabric membership to see which node-names and
VTEP IP addresses can be used to connect via SSH in order to verify operations.

Verification of the oobmgmt interface address assignment and interface status on RTP_Apic1:

admin@RTP_Apic1:~> ip addr show dev oobmgmt

8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

link/ether 24:e9:b3:15:a0:ee brd ff:ff:ff:ff:ff:ff

inet 10.122.254.211/24 brd 10.122.254.255 scope global oobmgmt

inet6 fe80::26e9:b3ff:fe15:a0ee/64 scope link

valid_lft forever preferred_lft forever

Verification of the oobmgmt interface address assignment and interface status on RTP_Apic2:

admin@RTP_Apic2:~> ip addr show dev oobmgmt

8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

link/ether 24:e9:b3:15:dd:60 brd ff:ff:ff:ff:ff:ff

inet 10.122.254.212/24 brd 10.122.254.255 scope global oobmgmt

inet6 fe80::26e9:b3ff:fe15:dd60/64 scope link

valid_lft forever preferred_lft forever

admin@RTP_Apic2:~>

Verification of the oobmgmt interface address assignment and interface status on RTP_Apic3:

admin@RTP_Apic3:~> ip addr show dev oobmgmt

8: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

link/ether 18:e7:28:2e:17:de brd ff:ff:ff:ff:ff:ff

inet 10.122.254.213/24 brd 10.122.254.255 scope global oobmgmt

inet6 fe80::1ae7:28ff:fe2e:17de/64 scope link

valid_lft forever preferred_lft forever

admin@RTP_Apic3:~>
130 Troubleshooting Cisco Application Centric Infrastructure

Verification of the fabric node membership and their respective TEP address assignment
as seen from RTP_Apic2 (would be the same on all controllers under in a normal state):

admin@RTP_Apic2:~> show fabric membership 

# Executing command: cat /aci/fabric/inventory/fabric-membership/clients/summary

clients:

serial-number node-id node-name  model         role  ip               decomissioned supported-model

------------- ------- ---------- ------------  ----- ---------------- ------------- ---------------

SAL1819SAN6   101     rtp_leaf1   N9K-C9396PX  leaf  172.16.136.95/32  no            yes 

SAL172682S0   102     rtp_leaf2   N9K-C93128TX leaf  172.16.136.91/32  no            yes 

SAL1802KLJF   103     rtp_leaf3   N9K-C9396PX  leaf  172.16.136.92/32  no            yes 

FGE173400H2   201     rtp_spine1  N9K-C9508    spine 172.16.136.93/32  no            yes 

FGE173400H7   202     rtp_spine2  N9K-C9508    spine 172.16.136.94/32  no            yes 

admin@RTP_Apic2:~> 

To verify the OOB mgmt on the fabric switch, use the attach command on the APIC to
connect to a fabric switch via the VTEP address, then execute the ip addr show dev eth0
command for each switch, and again ensure that the interface state is UP, the ip address
and netmask are correct, etc:

Verification of the OOB management interface (eth0) on fabric node rtp_leaf1:

rtp_leaf1# ip addr show dev eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

link/ether 88:f0:31:db:e7:f0 brd ff:ff:ff:ff:ff:ff

inet 10.122.254.241/24 brd 10.122.254.255 scope global eth0

inet6 fe80::8af0:31ff:fedb:e7f0/64 scope link

valid_lft forever preferred_lft forever

rtp_leaf1#

Verification of the OOB management interface (eth0) on fabric node rtp_leaf2:

rtp_leaf2# ip addr show dev eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

link/ether 00:22:bd:f8:34:c0 brd ff:ff:ff:ff:ff:ff

inet 10.122.254.242/24 brd 10.122.254.255 scope global eth0

inet6 fe80::222:bdff:fef8:34c0/64 scope link

valid_lft forever preferred_lft forever

rtp_leaf2#
Troubleshooting Cisco Application Centric Infrastructure 131

Verification of the OOB management interface (eth0) on fabric node rtp_leaf3:

rtp_leaf3# ip addr show dev eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

link/ether 7c:69:f6:10:6d:18 brd ff:ff:ff:ff:ff:ff

inet 10.122.254.243/24 brd 10.122.254.255 scope global eth0

inet6 fe80::7e69:f6ff:fe10:6d18/64 scope link

valid_lft forever preferred_lft forever

rtp_leaf3#

When looking at the spine, the command used is show interface mgmt0 to ensure the
proper ip address and netmask is assigned.  Verification of the OOB management inter-
face on rtp_spine1:

rtp_spine1# show int mgmt0

mgmt0 is up

admin state is up,

Hardware: GigabitEthernet, address: 0022.bdfb.f256 (bia 0022.bdfb.f256)

Internet Address is 10.122.254.244/24

MTU 9000 bytes, BW 1000000 Kbit, DLY 10 usec

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, medium is broadcast

Port mode is routed

full-duplex, 1000 Mb/s

Beacon is turned off

Auto-Negotiation is turned on

Input flow-control is off, output flow-control is off

Auto-mdix is turned off

EtherType is 0x0000

1 minute input rate 0 bits/sec, 0 packets/sec

1 minute output rate 0 bits/sec, 0 packets/sec

Rx

256791 input packets 521 unicast packets 5228 multicast packets

251042 broadcast packets 26081550 bytes

Tx

679 output packets 456 unicast packets 217 multicast packets

6 broadcast packets 71294 bytes

rtp_spine1#
132 Troubleshooting Cisco Application Centric Infrastructure

To verify that the spine has the proper default gateway configuration, use the com-
mand ip route show as seen here for rtp_spine1:

rtp_spine1# ip route show

default via 10.122.254.1 dev eth6

10.122.254.0/24 dev eth6 proto kernel scope link src 10.122.254.244

127.1.0.0/16 dev psdev0 proto kernel scope link src 127.1.1.27

rtp_spine1#

The same validation for rtp_spine2 looks similar to spine1 as shown:

rtp_spine2# show int mgmt0

mgmt0 is up

admin state is up,

Hardware: GigabitEthernet, address: 0022.bdfb.fa00 (bia 0022.bdfb.fa00)

Internet Address is 10.122.254.245/24

MTU 9000 bytes, BW 1000000 Kbit, DLY 10 usec

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, medium is broadcast

Port mode is routed

full-duplex, 1000 Mb/s

Beacon is turned off

Auto-Negotiation is turned on

Input flow-control is off, output flow-control is off

Auto-mdix is turned off

EtherType is 0x0000

1 minute input rate 0 bits/sec, 0 packets/sec

1 minute output rate 0 bits/sec, 0 packets/sec

Rx

256216 input packets 345 unicast packets 5218 multicast packets

250653 broadcast packets 26007756 bytes

Tx

542 output packets 312 unicast packets 225 multicast packets

5 broadcast packets 59946 bytes

rtp_spine2#

And to see the routing table:

rtp_spine2# ip route show

default via 10.122.254.1 dev eth6

10.122.254.0/24 dev eth6 proto kernel scope link src 10.122.254.245

127.1.0.0/16 dev psdev0 proto kernel scope link src 127.1.1.27

rtp_spine2#
Troubleshooting Cisco Application Centric Infrastructure 133

To verify APIC routing use the command cat /etc/iproute2/rt_tables: 

admin@RTP_Apic1:~> cat /etc/iproute2/rt_tables

# reserved values

255 local

254 main

253 default

0 unspec

# local

#1 inr.ruhep

2 oobmgmt

1 overlay

admin@RTP_Apic1:~>

The output of the ip route show tells us that there are two routing tables on our sample
reference topology, one for out-of-band management and one for the overlay.

The next output to verify is the out-of-band management routing table entries using the
command ip route show table oobmgmt.  There should be a default route pointed at the
default gateway IP address and out-of-band management interface (dev oobmgmt) with a
source IP address that matches the IP address of the out-of-band management interface

admin@RTP_Apic1:~> ip route show table oobmgmt

default via 10.122.254.1 dev oobmgmt src 10.122.254.211

10.122.254.0/24 dev oobmgmt scope link

169.254.254.0/24 dev lxcbr0 scope link

admin@RTP_Apic1:~>

The next output to verify is the output of ip rule show which shows how the APIC choos-
es which routing table is used for the lookup:
134 Troubleshooting Cisco Application Centric Infrastructure

admin@RTP_Apic1:~> ip rule show

0: from all lookup local

32762: from 10.122.254.211 lookup oobmgmt

32763: from 172.16.0.1 lookup overlay

32764: from 172.16.0.1 lookup overlay

32765: from 10.122.254.211 lookup oobmgmt

32766: from all lookup main

32767: from all lookup default

admin@RTP_Apic1:~>

Finally the ip route show ccommand will show how the global routing table is configured
on an APIC for out-of-band management. The oobmgmt metric is 16 which has no impact
on this situation but if an inband management configuration was applied, the inband would
not have a metric and would have a preference over the out-of-band management route.

admin@RTP_Apic3:~> ip route show

default via 10.122.254.1 dev oobmgmt metric 16

10.122.254.0/24 dev oobmgmt proto kernel scope link src 10.122.254.213

169.254.1.0/24 dev teplo-1 proto kernel scope link src 169.254.1.1

169.254.254.0/24 dev lxcbr0 proto kernel scope link src 169.254.254.254

172.16.0.0/16 via 172.16.0.30 dev bond0.3500 src 172.16.0.3

172.16.0.30 dev bond0.3500 scope link src 172.16.0.3

admin@RTP_Apic3:~>

To ensure that the VRF is configured for the fabric nodes, verify with the output of show vrf.

rtp_leaf1# show vrf

VRF-Name                           VRF-ID State    Reason                        

 black-hole                              3 Up       --                            

 management                        2 Up     --

 overlay-1                               4 Up       --

rtp_leaf1# 

Problem Description

Cannot reach a fabric node via SSH.


Troubleshooting Cisco Application Centric Infrastructure 135

Symptom

All three APIC’s are accessible via the out-of-band management network via SSH, HTTPS, ping,
etc. The fabric nodes are only accessible via ping, but should be accessible via SSH as well.

Verification

• The switch opens up ports using the linux iptables tool. However, the current
state of the tables cannot be viewed with out root access. Without root access,
it is still possible to verify what ports are open by running an nmap scan against
a fabric node:

Computer:tmp user1$ nmap -A -T5 -PN 10.122.254.241

Starting Nmap 6.46 ( http://nmap.org ) at 2014-10-15 09:29 PDT

Nmap scan report for rtp-leaf1.cisco.com (10.122.254.241)

Host is up (0.082s latency).

Not shown: 998 filtered ports

PORT STATE SERVICE VERSION

179/tcp closed bgp

443/tcp open http nginx 1.4.0

|_http-methods: No Allow or Public header in OPTIONS response (status code 400)

|_http-title: 400 The plain HTTP request was sent to HTTPS port

| ssl-cert: Subject: commonName=APIC/organizationName=Default Company Ltd/stateOrPro-

vinceName=CA/countryName=US

| Not valid before: 2013-11-13T18:43:13+00:00

|_Not valid after: Can't parse; string is "20580922184313Z"

|_ssl-date: 2020-02-18T08:57:12+00:00; +5y125d16h27m34s from local time.

| tls-nextprotoneg:

|_ http/1.1

Service detection performed. Please report any incorrect results at http://nmap.org/

submit/ .

Nmap done: 1 IP address (1 host up) scanned in 16.61 seconds

• The output shows that only bgp and https ports are open, but not ssh on this
fabric node. This seems to indicate that the policy has not been fully pushed
down to the fabric nodes.
• Reviewing the policy on the APIC reveals that the subnet is missing from the
External Network Instance Profile:
136 Troubleshooting Cisco Application Centric Infrastructure

Resolution

Once a subnet is added in the GUI, the ports are added to the iptables on the fabric nodes
and is then accessible via SSH:

$ ssh admin@10.122.254.241

Password:

Cisco Nexus Operating System (NX-OS) Software

TAC support: http://www.cisco.com/tac

Copyright (c) 2002-2014, Cisco Systems, Inc. All rights reserved.

The copyrights to certain works contained in this software are owned by other

third parties and used and distributed under license. Certain components of this

software are licensed under the GNU General Public License (GPL) version 2.0 or

the GNU Lesser General Public License (LGPL) Version 2.1. A copy of each such

license is available at

http://www.opensource.org/licenses/gpl-2.0.php and

http://www.opensource.org/licenses/lgpl-2.1.php

rtp_leaf1#
Troubleshooting Cisco Application Centric Infrastructure 137

Problem Description

APIC or fabric node not reachable via out-of-band management interface.

Symptom

When committing a node management policy change or when clearing the configuration
of a fabric node, decommissioning that fabric node and re-accepting that node back into
the fabric through fabric membership, the out-of-band IP connectivity to an APIC and/
or fabric switch gets lost. In this case rtp_leaf2 and RTP_Apic2 lost IP connectivity via
the out-of-band managment interfaces.

Verification

Upon verifying the fabric, scenarios such as overlapping IP addresses, between the APIC
and switch, lead to loss of connectivity:

admin@RTP_Apic2:~> ip add show dev oobmgmt

23: oobmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

link/ether 24:e9:b3:15:dd:60 brd ff:ff:ff:ff:ff:ff

inet 10.122.254.212/24 brd 10.122.254.255 scope global oobmgmt

inet6 fe80::26e9:b3ff:fe15:dd60/64 scope link

valid_lft forever preferred_lft forever

admin@RTP_Apic2:~>

and the leaf:

rtp_leaf2# ip addr show dev eth0

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

link/ether 00:22:bd:f8:34:c0 brd ff:ff:ff:ff:ff:ff

inet 10.122.254.212/24 brd 10.122.254.255 scope global eth0

inet6 fe80::222:bdff:fef8:34c0/64 scope link

valid_lft forever preferred_lft forever

rtp_leaf2#

• When checking the fabric node policies, the following is seen on the default
Node Management Address policy, and there are no connectivity groups applied
to this default policy.
138 Troubleshooting Cisco Application Centric Infrastructure

Resolution

When a fabric node joins the fabric, it randomly gets an IP address assigned from the
pool, and there are a few activities that can cause the IP address on a node to change.
Generally speaking any activity that causes the APIC or fabric node to come up from
scratch and be removed from the network or be assigned to a new policy can cause it to
be readdressed.

In situations where a fabric member is simply renumbered there may only be a need to in-
vestigate what new IP address was assigned. In some other rare circumstances where the IP
address overlaps with another device, Cisco TAC should be contacted to investigate further.
Troubleshooting Cisco Application Centric Infrastructure 139

Common Network Services

Overview

This chapter covers the common network services like DNS, NTP, DHCP, etc. A common
network service is any service that can be shared between the fabric nodes or tenants.

These services are handled differently in the way they are configured within the fabric.  

• DNS:  DNS profiles are configured globally as fabric policies and then can be
applied as needed via a dns label at a EPG level.
• NTP: This is configured as a pod level policy.
• DHCP: DHCP relay is configured at a tenant level.

Fabric Verification

For most common network shared services configurations, if the management tenant
or EPG is not configured or not working, the shared services policies will not be pushed
down to the fabric nodes. Management tenant and EPG configuration should be verified
along with shared services configuration.

DNS

• APIC
o It is verified first by looking at the management information tree then seeing
how that policy is applied to the actual APIC configuration.
o Verify that a DNS profile has been created by traversing to the following
directory and using the cat command on the  summary  file:  /aci/fabric/
fabric-policies/global-policies/dns-profiles/default

admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/global-policies/dns-profiles/default

admin@RTP_Apic1:default> cat summary

# dns-profile

name : default

description :

ownerkey :
ownertag :

management-epg : tenants/mgmt/node-management-epgs/default/out-of-band/default

dns-providers:

address preferred

-------------- ---------

171.70.168.183 yes

173.36.131.10 no

dns-domains:

name default description

--------- ------- -----------

cisco.com yes

o It should be ensured that


1. The management-epg is pointing at a management EPG distinguished name
2. There is at least one dns-provider configured
3. There is a dns-domain configured
o Next the DNS label verification can be done by changing to the following
directory and looking at the summary  /aci/tenants/mgmt/networking/
private-networks/oob/dns-profile-labels/default

admin@RTP_Apic1:default> cd /aci/tenants/mgmt/networking/private-networks/oob/

dns-profile-labels/default

admin@RTP_Apic1:default> cat summary

# dns-lbl

name : default

description :

ownerkey :

ownertag :

tag : yellow-green

o When the policies are applied they push the DNS configuration down to Linux
on the APIC. That configuration can be verified by looking at the  /etc/resolv.
conf file.

admin@RTP_Apic1:default> cat /etc/resolv.conf

# Generated by IFC

search cisco.com

nameserver 171.70.168.183

nameserver 173.36.131.10
Troubleshooting Cisco Application Centric Infrastructure 141

o The last verification step for the APIC would be to actually resolve a host using
the host command and then ping that host.

admin@RTP_Apic1:default> host www.cisco.com

www.cisco.com is an alias for www.cisco.com.akadns.net.

www.cisco.com.akadns.net is an alias for origin-www.cisco.com.

origin-www.cisco.com has address 72.163.4.161

origin-www.cisco.com has IPv6 address 2001:420:1101:1::a

admin@RTP_Apic1:default> ping www.cisco.com

PING origin-www.cisco.com (72.163.4.161) 56(84) bytes of data.

64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=1 ttl=238 time=29.3 ms

64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=2 ttl=238 time=29.0 ms

^C

--- origin-www.cisco.com ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1743ms

rtt min/avg/max/mdev = 29.005/29.166/29.328/0.235 ms

• Fabric Nodes

The policy that is applied needs to be looked at by inspecting the raw management
information tree. Once that is verified, next step is to look at the DNS configuration
that is applied to the fabric node as a result of that policy.

o Verify that a DNS policy is applied by changing to the following directory and
listing out the contents: /mit/uni/fabric/dnsp-default

rtp_leaf1# cd /mit/uni/fabric/dnsp-default

rtp_leaf1# ls -1

dom-cisco.com

mo

prov-[171.70.168.183]

prov-[173.36.131.10]

rsProfileToEpg

rsProfileToEpg.link

rsProfileToEpp

rsProfileToEpp.link

rtdnsProfile-[uni--ctx-[uni--tn-mgmt--ctx-oob]--dnslbl-default]

summary
142 Troubleshooting Cisco Application Centric Infrastructure

o The following should be seen:


1. The DNS providers listed as prov-[ipaddress]
2. The DNS domains listed as dom-[domainname]
3. The summary file in the rtdnsProfile-... directory has a tDn that points to a valid
dnslabel 
4. The rsProfileToEpg.link should exist and resolve to a valid place in the
management information tree
5. The rsProfileToEpp.link should exist and resolve to a valid place in the
management information tree

o Verifying the dnslabel on the fabric node can be done by looking at the summary
file in the rtdsnProfile-... directory, taking the tDn reference and prefacing it with
/mit, and and cat the summary file in the resulting directory.

rtp_leaf1# cat rtdnsProfile-[uni--ctx-[uni--tn-mgmt--ctx-oob]--dnslbl-default]/summa-

ry

# DNS Profile Label

tDn : uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default

childAction :

dn : uni/fabric/dnsp-default/rtdnsProfile-[uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-de-

fault]

lcOwn : local

modTs : 2014-10-15T14:16:14.850-04:00

rn : rtdnsProfile-[uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default]

status :

tCl : dnsLblDef

rtp_leaf1# cat /mit/uni/ctx-\[uni--tn-mgmt--ctx-oob\]/dnslbl-default/summary

# DNS Profile Label

name : default

childAction :

descr :

dn : uni/ctx-[uni/tn-mgmt/ctx-oob]/dnslbl-default

lcOwn : policy

modTs : 2014-10-15T14:16:14.850-04:00

monPolDn :

ownerKey :

ownerTag :

rn : dnslbl-default

status :

tag : yellow-green
Troubleshooting Cisco Application Centric Infrastructure 143

o The policy that is pushed to the fabric node results in the DNS configuration
being applied to Linux. The DNS configuration can be verified by first looking at
/etc/dcos_resolv.conf to verify DNS is enabled and /etc/resolv.conf to verify
how DNS is configured.

rtp_leaf1# cat /etc/dcos_resolv.conf


# DNS enabled

rtp_leaf1# cat /etc/resolv.conf

search cisco.com

nameserver 171.70.168.183

nameserver 173.36.131.10

o On the fabric nodes, the host command is not available so ping is the best way to
try and resolve a host.

rtp_leaf1# ping www.cisco.com

PING origin-www.cisco.com (72.163.4.161): 56 data bytes

64 bytes from 72.163.4.161: icmp_seq=0 ttl=238 time=29.153 ms

64 bytes from 72.163.4.161: icmp_seq=1 ttl=238 time=29.585 ms

^C--- origin-www.cisco.com ping statistics ---

2 packets transmitted, 2 packets received, 0% packet loss

round-trip min/avg/max/stddev = 29.153/29.369/29.585/0.216 ms

NTP

Note: NTP can be configured with either an IP address or a hostname, but when config-
ured with a hostname DNS must be configured in order to resolve the hostname.

• APIC
o NTP policies are applied globally by first applying a global pod-selector
policy which points to a policy-group. This can be verified by changing to
/aci/fabric/fabric-policies/pod-policies/pod-selector-default-all and viewing
the summary file. In this case the policy-group is set to RTPFabric1:

admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/pod-selector-de-

fault-all

admin@RTP_Apic1:pod-selector-default-all> cat summary

# pod-selector
name                 : default

type                 : all

description          : 

ownerkey             : 

ownertag             : 

fabric-policy-group  : fabric/fabric-policies/pod-policies/policy-groups/RTPFabric1

o Make note of the RTPFabric1


o The Pod policy-group can be verified by changing to the directory to  /aci/
fabric/fabric-policies/pod-policies/policy-groups/  and viewing the summary file:

admin@RTP_Apic1:pod-policies> cd /aci/fabric/fabric-policies/pod-policies/policy-groups/

admin@RTP_Apic1:policy-groups> cat summary

policy-groups:

name        date-time-policy  isis-policy  coop-group-policy  bgp-route-reflector-  communication-policy  snmp-policy

                                                             policy                                                

----------  ----------------  -----------  -----------------  --------------------  --------------------  -----------”

RTPFabric1  ntp.esl.cisco.com default      default            default               default               default  

o Ensure that the date-time-policy is pointed at the proper date-time-policy name


o Verify that a NTP policy has been created by traversing to the following directory
and using the cat command on the summary file for the specific date-time policy
configured:/aci/fabric/fabric-policies/pod-policies/policies/date-and-time/

admin@RTP_Apic1:> cd /aci/fabric/fabric-policies/pod-policies/policies/date-and-time/

admin@RTP_Apic1:> cat date-and-time-policy-ntp.esl.cisco.com/summary

# date-and-time-policy

name : default

description :

administrative-state : enabled

authentication-state : disabled

ownerkey :

ownertag :

ntp-servers:

host-name-ip-address preferred minimum-polling- maximum-polling- management-epg

interval interval

-------------------- --------- ---------------- ---------------- ---------------------

ntp.esl.cisco.com yes 4 6 tenants/mgmt/

node-management-epgs/

default/out-of-band/

default
Troubleshooting Cisco Application Centric Infrastructure 145

o Ensure the administrative state is enabled


o Ensure the ntpserver is shown
o Ensure the management-epg is shown and resolves to a valid management epg.
o When the NTP policy is applied on the APIC it is pushed down to linux as an NTP
configuration. This can be verified using the ntpstat command.
 
admin@RTP_Apic1:date-and-time> ntpstat

synchronised to NTP server (171.68.38.66) at stratum 2

time correct to within 952 ms

polling server every 64 s

o The NTP server should be synchronized.


o Netstat can also be checked on the APIC to ensure that the APIC is listening
on port 123: 
o The proper NTP server should be seen listed :

admin@RTP_Apic1:date-and-time> netstat -anu | grep :123

udp 0 0 172.16.0.1:123 0.0.0.0:*

udp 0 0 10.122.254.211:123 0.0.0.0:*

udp 0 0 169.254.1.1:123 0.0.0.0:*

udp 0 0 169.254.254.254:123 0.0.0.0:*

udp 0 0 127.0.0.1:123 0.0.0.0:*

udp 0 0 0.0.0.0:123 0.0.0.0:*

udp 0 0 ::1:123 :::*

udp 0 0 fe80::92e2:baff:fe4b:fc7:123 :::*

udp 0 0 fe80::38a5:a2ff:fe9a:4eb:123 :::*

udp 0 0 fe80::f88d:a5ff:fe4c:419:123 :::*

udp 0 0 fe80::ce7:b9ff:fe50:4481:123 :::*

udp 0 0 fe80::3c79:62ff:fef0:214:123 :::*

udp 0 0 fe80::26e9:b3ff:fe15:a0e:123 :::*

udp 0 0 fe80::e89f:1dff:fedf:1f6:123 :::*

udp 0 0 fe80::f491:1ff:fe9f:f1de:123 :::*

udp 0 0 fe80::dc2d:dfff:fe88:20d:123 :::*

udp 0 0 fe80::e4cb:caff:feec:5bd:123 :::*

udp 0 0 fe80::a83d:1ff:fe54:597:123 :::*

udp 0 0 fe80::8c71:63ff:feb2:f4a:123 :::*

udp 0 0 :::123 :::*


146 Troubleshooting Cisco Application Centric Infrastructure

• Fabric nodes
o Verify that a NTP policy has been created by traversing to the following directory
and using the cat command on the summary file and list out the directory: 
/mit/uni/fabric/time-default

rtp_leaf1# cd /mit/uni/fabric/time-default

rtp_leaf1# cat summary

# Date and Time Policy

name : default

adminSt : enabled

authSt : disabled

childAction :

descr :

dn : uni/fabric/time-default

lcOwn : resolveOnBehalf

modTs : 2014-10-15T13:11:19.747-04:00

monPolDn : uni/fabric/monfab-default

ownerKey :

ownerTag :

rn : time-default

status :

uid : 0

rtp_leaf1#

rtp_leaf1# ls -1

issues

mo

ntpprov-10.81.254.202

rtfabricTimePol-[uni--fabric--funcprof--podpgrp-RTPFabric1]

summary

o Ensure the adminSt is enabled


o Ensure the ntpprov-* directory is for the proper ntp provider.
o When the NTP policy is pushed to the fabric node it resolves to a NTP configuration
in Linux that gets applied. It can be verified using both show ntp peers and show
ntp peer status commands.
 
Troubleshooting Cisco Application Centric Infrastructure 147

rtp_leaf1# show ntp peers

--------------------------------------------------

 Peer IP Address               Serv/Peer

--------------------------------------------------

 10.81.254.202                 Server (configured)

rtp_leaf1# show ntp peer-status

Total peers : 1

* - selected for sync, + - peer mode(active),

- - peer mode(passive), = - polled in client mode

    remote                local                st   poll   reach delay    vrf

-------------------------------------------------------------------------------

*10.81.254.202           0.0.0.0               1    64     377   0.00041 management

o Ensure that the Peer IP Address is correct


o Ensure that the peer is a server
o Ensure that the vrf is shown as a management

DCHP Relay

There are two main components in the DHCP Relay configuration. The first is the policy
which is configured under a tenant. The policy contains the DHCP server address as well
as how (EPG) the DHCP server is reached.

The second component is under the tenant BD with a DHCP Relay label to link to the
DHCP Relay Policy.
148 Troubleshooting Cisco Application Centric Infrastructure

• APIC
o The DHCP Relay policy can be verified through shell access by cd to
/mit/uni/tn-<tenant name>/relayp-<DHCP Relay Profile Name>

admin@RTP_APIC1:relayp-DHCP_Relay_Profile> ls

mo

provdhcp-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]

rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]

rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG].link

rtlblDefToRelayP-[uni--bd-[uni--tn-Prod--BD-MiddleWare]-isSvc-no--dhcplbldef-DH-

CP_Relay_Profile]

summary

admin@RTP_APIC1:relayp-DHCP_Relay_Profile> cat summary

# DHCP Relay Policy

name         : DHCP_Relay_Profile

childAction  : 

descr        : 

dn           : uni/tn-Prod/relayp-DHCP_Relay_Profile

lcOwn        : local

modTs        : 2014-10-16T15:43:03.139-07:00

mode         : visible
Troubleshooting Cisco Application Centric Infrastructure 149

monPolDn     : uni/tn-common/monepg-default

owner        : infra

ownerKey     : 

ownerTag     : 

rn           : relayp-DHCP_Relay_Profile

status       : 

uid          : 15374

o In this last example, the DHCP relay policy name is DHCP_Relay_Profile.


The provider is the EPG where the DHCP server is located. In this example
the server is located through a layer 3 external routed domain named L3out.
o The dhcpRsProv contains the address of the server IP address. From the
DHCP relay policy directory, cd to the rsprov-* directory which in this
example is rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]

admin@RTP_APIC1:relayp-DHCP_Relay_Profile> cd rsprov-\[uni--tn-Prod--out-L3out--in-

stP-ExtL3EPG\]

admin@RTP_APIC1:rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]> ls

mo  summary

admin@RTP_APIC1:rsprov-[uni--tn-Prod--out-L3out--instP-ExtL3EPG]> cat summary

# DHCP Provider

tDn          : uni/tn-Prod/out-L3out/instP-ExtL3EPG

addr         : 10.30.250.1

childAction  : 

dn           : uni/tn-Prod/relayp-DHCP_Relay_Profile/rsprov-[uni/tn-Prod/out-L3out/

instP-ExtL3EPG]

forceResolve : no

lcOwn        : local

modTs        : 2014-10-16T15:43:03.139-07:00

monPolDn     : uni/tn-common/monepg-default

rType        : mo

rn           : rsprov-[uni/tn-Prod/out-L3out/instP-ExtL3EPG]

state        : formed

stateQual    : none

status       : 

tCl          : l3extInstP

tType        : mo

uid          : 15374

 
150 Troubleshooting Cisco Application Centric Infrastructure

• Fabric nodes
o From the fabric nodes, confirmation that the relay is configured properly
is with the CLI command “show dhcp internal info relay address”. "show
ip dhcp relay" presents similar information.

rtp_leaf1# show dhcp internal info relay address

DHCP Relay Address Information:

DHCP relay intf Vlan9 has 3 relay addresses:

DHCP relay addr: 10.0.0.1, vrf: overlay-1, visible, gateway IP: 10.0.0.30 

DHCP relay addr: 10.0.0.2, vrf: overlay-1, invisible, gateway IP:

DHCP relay addr: 10.0.0.3, vrf: overlay-1, invisible, gateway IP:

DHCP relay intf Vlan17 has 1 relay addresses:

DHCP relay addr: 10.30.250.1, vrf: Prod:Prod, visible, gateway IP: 10.0.0.101

10.30.250.2 

DHCP relay intf loopback0 has 3 relay addresses:

DHCP relay addr: 10.0.0.1, vrf: overlay-1, invisible, gateway IP: 

DHCP relay addr: 10.0.0.2, vrf: overlay-1, invisible, gateway IP: 

DHCP relay addr: 10.0.0.3, vrf: overlay-1, invisible, gateway IP: 

  o The DHCP relay statistics on the leaf can be viewed with "show ip dhcp relay
statistics":

Leaf-1# show ip dhcp relay statistics 

----------------------------------------------------------------------

Message Type             Rx              Tx           Drops

----------------------------------------------------------------------

Discover                  5               5               0

Offer                     1               1               0

Request(*)                4               4               0

Ack                       7               7               0

Release(*)                0               0               0

Decline                   0               0               0

Nack                      0               0               0

Inform                    3               3               0

----------------------------------------------------------------------

Total                     28              28              0

----------------------------------------------------------------------
Troubleshooting Cisco Application Centric Infrastructure 151

Problem Description

After configuring specific shared services (DNS, NTP, SNMP, etc) there are issues with
connectivity to those services.

Symptom 1

The APICs can resolve hostnames via DNS but fabric nodes are not able to

Verification

• A fabric node is unable to to resolve a hostname.

rtp_leaf1# ping www.cisco.com

ping: unknown host

rtp_leaf1# 

• An APIC is able to resolve a hostname.

admin@RTP_Apic1:~> ping www.cisco.com

PING origin-www.cisco.com (72.163.4.161) 56(84) bytes of data.

64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=1 ttl=238 time=29.4 ms

64 bytes from www1.cisco.com (72.163.4.161): icmp_seq=2 ttl=238 time=29.1 ms

^C

--- origin-www.cisco.com ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1351ms

rtt min/avg/max/mdev = 29.173/29.334/29.495/0.161 ms

• Since the problem seems isolated to the fabric nodes, let's start there. Verify the
policy is correct on the fabric node.

rtp_leaf1# cd /mit/uni/fabric/dnsp-default

rtp_leaf1# ls -al

total 1

drw-rw---- 1 admin admin 512 Oct 15 17:46 .

drw-rw---- 1 admin admin 512 Oct 15 17:46 ..

-rw-rw---- 1 admin admin   0 Oct 15 17:46 mo

-r--r----- 1 admin admin   0 Oct 15 17:46 summary


152 Troubleshooting Cisco Application Centric Infrastructure

The fabric node has no policy, the mo and summary files are empty, further inspection
should take place at the policy on the APIC configuration. All policy for the fabric nodes
comes from the APIC, so that's where the problem is most likely to be found.

• From the APIC the policy is applied:

admin@RTP_Apic1:default> cat summary

# dns-profile

name           : default

description    :

ownerkey       :

ownertag       :

management-epg : tenants/mgmt/node-management-epgs/default/out-of-band/default

dns-providers:

address         preferred

--------------  ---------

171.70.168.183  yes      

173.36.131.10   no      

dns-domains:

name       default  description

---------  -------  -----------

cisco.com  yes                

 
Troubleshooting Cisco Application Centric Infrastructure 153

• The DNS label is missing however.

admin@RTP_Apic1:default> cd /aci/tenants/mgmt/node-management-epgs/default/out-

of-band/default

admin@RTP_Apic1:default> cat summary

# out-of-band-management-epg

name                 : default

configuration-issues :

configuration-state  : applied

qos-priority         : unspecified

description          :

provided-out-of-band-contracts:

qos-priority  oob-contract  state

------------  ------------  ------

unspecified   oob_contract  formed

tags:

name

----

admin@RTP_Apic1:default> 

• From the GUI the missing label from the out-of-band management can be seen.
154 Troubleshooting Cisco Application Centric Infrastructure

Resolution

Once the DNS label "default" is added to the private network, the fabric node is able to
resolve hostnames.

Symptom 2

NTP is not functional on any of the fabric nodes but the APICs have NTP synchronized.

Verification

• APIC
o There are faults on the date-time policy for all of the fabric nodes that
state that the config failed and: Datetime Policy Configuration Failed with
issues: access-epg-not-specified
Troubleshooting Cisco Application Centric Infrastructure 155

o The APIC does not have a management-egp assigned.

admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/policies/date-and-

time/date-and-time-policy-ntp.esl.cisco.com

admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> cat summary

# date-and-time-policy

name                 : ntp.esl.cisco.com

description          :

administrative-state : enabled

authentication-state : disabled

ownerkey             :

ownertag             :

ntp-servers:

host-name-ip-address  preferred  minimum-polling-  maximum-polling-  management-epg

                                interval          interval                        

--------------------  ---------  ----------------  ----------------  --------------

ntp.esl.cisco.com     yes        4                 6                              
156 Troubleshooting Cisco Application Centric Infrastructure

o This can be seen the GUI as well.

o This is likely why NTP is not synchronized on the fabric nodes.  The fabric
nodes are not being told which vrf to use to reach the NTP server.
o On the APICs, port 123 is being listened on and because this fabric only has
out-of-band management configured the APICs are able to reach the NTP
server over this interface.
 
admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> ntpstat

synchronized to NTP server (171.68.38.65) at stratum 2

time correct to within 976 ms

polling server every 64 s

admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> netstat -anu | grep :123

udp 0 0 172.16.0.1:123 0.0.0.0:*

udp 0 0 10.122.254.211:123 0.0.0.0:*

udp 0 0 169.254.1.1:123 0.0.0.0:*

udp 0 0 169.254.254.254:123 0.0.0.0:*

udp 0 0 127.0.0.1:123 0.0.0.0:*

udp 0 0 0.0.0.0:123 0.0.0.0:*

udp 0 0 ::1:123 :::*

udp 0 0 fe80::92e2:baff:fe4b:fc7:123 :::*

udp 0 0 fe80::38a5:a2ff:fe9a:4eb:123 :::*


udp 0 0 fe80::f88d:a5ff:fe4c:419:123 :::*

udp 0 0 fe80::ce7:b9ff:fe50:4481:123 :::*

udp 0 0 fe80::3c79:62ff:fef0:214:123 :::*

udp 0 0 fe80::26e9:b3ff:fe15:a0e:123 :::*

udp 0 0 fe80::e89f:1dff:fedf:1f6:123 :::*

udp 0 0 fe80::f491:1ff:fe9f:f1de:123 :::*

udp 0 0 fe80::dc2d:dfff:fe88:20d:123 :::*

udp 0 0 fe80::e4cb:caff:feec:5bd:123 :::*

udp 0 0 fe80::a83d:1ff:fe54:597:123 :::*

udp 0 0 fe80::8c71:63ff:feb2:f4a:123 :::*

udp 0 0 :::123 :::*

• Fabric Nodes
o The leafs do not have any NTP policy.

rtp_leaf1# cd /mit/uni/fabric/time-default

rtp_leaf1# cat summary

cat: summary: No such file or directory

rtp_leaf1# cat mo

cat: mo: No such file or directory

o Because the leafs do not have any policy, they also do not have any NTP
configuration or peers.

rtp_leaf1# show ntp peer-status

Total peers : 1

* - selected for sync, + - peer mode(active),

- - peer mode(passive), = - polled in client mode

   remote                local                st   poll   reach delay   vrf

-------------------------------------------------------------------------------

=0.0.0.0                 0.0.0.0               0    1      0     0.00000

rtp_leaf1# show ntp peers

--------------------------------------------------

Peer IP Address Serv/Peer

--------------------------------------------------

0.0.0.0 Server (configured)


158 Troubleshooting Cisco Application Centric Infrastructure

Resolution

By adding a management EPG of default (Out-of-band) to the date-time policy, NTP


is able to syncronize on the fabric nodes.

Symptom 3

The APICs do not synchronize with NTP but the fabric nodes do

Verification

• APICs
o The ntp daemon is not running.

admin@RTP_Apic1:pod-selector-default-all> ntpstat

Unable to talk to NTP daemon. Is it running?

admin@RTP_Apic1:pod-selector-default-all>

o The APICs have the date-time policy configured properly.

admin@RTP_Apic1:date-and-time-policy-ntp.esl.cisco.com> cat summary

# date-and-time-policy

name                 : ntp.esl.cisco.com

description          :

administrative-state : enabled

authentication-state : disabled

ownerkey             :

ownertag             : 

 ntp-servers:

host-name-ip-address  preferred  minimum-polling-  maximum-polling-  management-epg      

                                 interval          interval                              

--------------------  ---------  ----------------  ----------------  ---------------------

ntp.esl.cisco.com     yes        4                 6                 tenants/mgmt/        

                                                                    node-management-epgs/

                                                                    default/out-of-band/

                                                                     default  

           


Troubleshooting Cisco Application Centric Infrastructure 159

o The APICs do have the proper fabric-policy-group as well

admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/pod-selector-default-all

admin@RTP_Apic1:pod-selector-default-all> cat summary

# pod-selector

name                 : default

type                 : all

description          :

ownerkey             :

ownertag             :

fabric-policy-group  : fabric/fabric-policies/pod-policies/policy-groups/RTPFabric1

o The APICs do not have the proper date-time-policy specified in the


policy-group.

admin@RTP_Apic1:pod-policies> cd /aci/fabric/fabric-policies/pod-policies/policy-groups/

admin@RTP_Apic1:policy-groups> cat summary

policy-groups:

name        date-time-policy  isis-policy  coop-group-policy  bgp-route-reflector-  communication-policy  snmp-policy

                                                             policy                                                

----------  ----------------  -----------  -----------------  --------------------  --------------------  -----------

RTPFabric1  default           default      default            default               default               default    

o This should be ntp.esl.cisco.com but it is incorrectly set to default. There should


be a fault for this.
o The fault is on the Pod policy and states: Failed to form relation to MO
time-de fault of class datetimePol in context
160 Troubleshooting Cisco Application Centric Infrastructure

• Fabric Nodes
o The fabric nodes are synchronized with the NTP server.

rtp_leaf1# show ntp peers

--------------------------------------------------

 Peer IP Address               Serv/Peer

--------------------------------------------------

 171.68.38.65                  Server (configured)

rtp_leaf1# show ntp peer-status

Total peers : 1

* - selected for sync, + - peer mode(active),

- - peer mode(passive), = - polled in client mode

   remote                local                st   poll   reach delay   vrf

-------------------------------------------------------------------------------

=171.68.38.65            0.0.0.0               1    64     377   0.07144 management

Symptom 4

The APICs and the fabric nodes do not synchronize with NTP
Troubleshooting Cisco Application Centric Infrastructure 161

Verification

• APICs
o The ntp daemon is not running on the APICs.

admin@RTP_Apic1:pod-selector-default-all> ntpstat

Unable to talk to NTP daemon. Is it running?

admin@RTP_Apic1:pod-selector-default-all> 

o The pod selector policy is missing the fabric-policy-group.

admin@RTP_Apic1:~> cd /aci/fabric/fabric-policies/pod-policies/pod-selector-de-

fault-all

admin@RTP_Apic1:pod-selector-default-all> cat summary

# pod-selector

name                 : default

type                 : all

description          :

ownerkey             :

ownertag             :

fabric-policy-group  :

o Without a fabric-policy-group applied to the pod-selector, the date-time


policy will not be applied to the pod-policy-group and the NTP daemon will
not start up. This is a problem that needs to be corrected. However, verifi
cation needs to be continued to other parts of the config to ensure that
nothing else is broken.
o The policy-group config does look proper and points to the date-time-policy.

admin@RTP_Apic1:date-and-time> cd /aci/fabric/fabric-policies/pod-policies/policy-groups/

admin@RTP_Apic1:policy-groups> cat summary

policy-groups:

name        date-time-policy   isis-policy  coop-group-policy  bgp-route-reflector-  communication-policy  snmp-policy

                                                              policy                                                

----------  -----------------  -----------  -----------------  --------------------  --------------------  -----------

RTPFabric1  ntp.esl.cisco.com  default      default            default               default               default    


162 Troubleshooting Cisco Application Centric Infrastructure

o The date-and-time policy is configured correctly.

admin@RTP_Apic1:date-and-time> cat date-and-time-policy-ntp.esl.cisco.com/summary

# date-and-time-policy

name                 : ntp.esl.cisco.com

description          :

administrative-state : enabled
authentication-state : disabled

ownerkey             :

ownertag             :

ntp-servers:

host-name-ip-address  preferred  minimum-polling-  maximum-polling-  management-epg      

                                 interval          interval                              

--------------------  ---------  ----------------  ----------------  ---------------------

ntp.esl.cisco.com     yes        4                 6                 tenants/mgmt/        

                                                                    node-management-epgs/

                                                                    default/out-of-band/

• Fabric Nodes
o The fabric nodes do not have any NTP configuration
Troubleshooting Cisco Application Centric Infrastructure 163

rtp_leaf1# show ntp peers

dn "sys/time" could not be found

Error executing command, check logs for details

rtp_leaf1# show ntp peer-status

dn "sys/time" could not be found

Error executing command, check logs for details

o There are no time-date policies on the fabric nodes.

rtp_leaf1# cd /mit/uni/fabric/time-default

bash: cd: /mit/uni/fabric/time-default: No such file or directory

Resolution

Once the Fabric Policy Group is set, the NTP daemon is started and NTP is synchronized
on the APICs.  In this case, no fault is shown anywhere.

Symptom 5

DHCP client is not getting IP address from DHCP server

Verification

Several issues could be the cause of this.  There are several steps that can be run to verify
the cause of the issue.  This is listed in a logical order moving from the policy through the
leaf to the DHCP server:

• The DHCP relay policy is properly applied as indicated in the overview section
• The endpoint is part of the EPG that is in the BD that contains the correct DHCP
relay policy.   This can be verified with the leaf CLI command "show endpoint
interface <interface ID> detail"

rtp_leaf1# show endpoint interface ethernet 1/13 detail

+---------------+---------------+-----------------+--------------+-------------+---------------------------

     VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

     Domain      VLAN            IP Address        IP Info                       Info

+---------------+---------------+-----------------+--------------+-------------+----------------------------

20                     vlan-1301    0024.81b5.d22b L                     eth1/13 Prod:commerceworkspace:MiddleWare


164 Troubleshooting Cisco Application Centric Infrastructure

• If the endpoint is not present, confirm the fabric interface status with the leaf
CLI command "show interface ethernet 1/13"
o If the interface status is not Up, check the physical connection
o If the interface status is Up but is "out-of-service", this is typically an
indication that there is an misconfiguration.  Confirm that the EPG points
to the proper domain and the domain is configured with the proper fabric
vlan pool and AEP.  
o Check for faults and refer to the section Faults and Health Scores.
• The DHCP relay policy on the leaf where the client is attached is properly
configured as shown with the leaf CLI command "show dhcp internal info relay
address" shown in the overview section.
• The DHCP server can be reached from the leaf.  One way to verify this is using
the leaf CLI command "iping" originated from the leaf using the tenant context.
• Check the DHCP relay statistics on the leaf with the leaf CLI command "show ip
dhcp relay statistics":
o If the Discover is not incrementing, check the fabric interface status where
the client is connected
o If the Discover stats are incrementing but the Offer is not, confirm that the
server can reach the BD SVI address
o If the Discover stats are incrementing but the Offer is not, confirm that a
proper contract is in place to not drop the DHCP Offer
• Confirm from the DHCP server side that the DHCP Discover is received
• From the DHCP server, confirm the GIADDR (relay agent) address is the expected
address and the proper DHCP scope for that subnet has been defined
• From the DHCP server, confirm that the DHCP Offer is sent and the destination
IP address of the relay where it is sent
• Confirm from the DHCP server that the DHCP relay agent address can be
reached/ping

Resolution

The above verification steps should isolate  whether the issue is with the policy, the
physical, the network or the DHCP server.

  Symptom 6

DHCP client is getting an address but not for the expected subnet.
Troubleshooting Cisco Application Centric Infrastructure 165

Verification

Several issues could be the cause of this.  One possibility is that if there are multiple
subnets on a BD, the relay agent address (GIADDR) used will be the primary BD SVI ad-
dress.  This is typically the first SVI configured on the BD which may be the subnet from
which the DHCP server scope has allocated the address.

Other steps to verify are:

• The DHCP relay policy is properly applied as indicated in the overview section
• The endpoint is part of the EPG that is in the BD that contains the correct DHCP
relay policy.   This can be verified with the leaf CLI command "show endpoint
interface <interface ID> detail"
 
rtp_leaf1# show endpoint interface ethernet 1/13 detail

+---------------+---------------+-----------------+--------------+-------------+----------------------------

     VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

     Domain      VLAN            IP Address        IP Info                       Info

+---------------+---------------+-----------------+--------------+-------------+----------------------------

20                     vlan-1301    0024.81b5.d22b L                     eth1/13 Prod:commerceworkspace:MiddleWare

• If the endpoint is not present, confirm the fabric interface status with the leaf
CLI command "show interface ethernet 1/13"
o If the interface status is not Up, check the physical connection
o If the interface status is Up but is "out-of-service", this is typically an
indication that there is an misconfiguration. Confirm that the EPG points
to the proper domain and the domain is configured with the proper fabric
vlan pool and AEP.  
o Check for faults and refer to the section Faults and Health Scores.
• From the DHCP server, confirm the GIADDR (relay agent) address is the expected
address and the proper DHCP scope for that subnet has been defined.

Resolution

The above verification steps should isolate whetherthe issue is with the policy, the phys-
ical, the network or the DHCP server.
166 Troubleshooting Cisco Application Centric Infrastructure

Unicast Data Plane Forwarding and


Reachability
Overview

his chapter will cover unicast forwarding and reachability problems. This can be, but is
not limited to, end points not showing up in the forwarding tables, end points not able
to communicate with each other (non zoning rule policy (contract) related problems),
VLANs not being programmed, as well as incorrect configurations that can cause these
problems and the subsequent faults that are raised. 

To troubleshoot packet forwarding issues multiple shells will need to be used on a given
leaf or spine. 

• CLI : The CLI will be used to run the well known VSH commands and check the
concrete models on the switch. For example "show vlan", "show endpoint", etc...
• vsh_lc : This is the line card shell and it will be used to check line card processes
and forwarding tables specific to the Application Leaf Engine (ALE) ASIC.
• Broadcom Shell : This shell is used to view information on the Broadcom ASIC.
The shell will not be covered as it falls outside the scope of this book as its
assumed troubleshooting at a Broadcom Shell level is performed with assistance
of Cisco Technical Assistance Center (TAC).

In the course of troubleshooting, several different VLAN types can be come across. A


thorough deep dive on the different types of VLANs is beyond the scope of this chapter
and book, but summarized in the picture below are some of the more important con-
cepts that are fundamental in understanding the use of VLANs in the ACI fabric. The
output from the below command will be discussed in more detail later in this chapter,
but the command was ran from the "vsh_lc" shell discussed above
Troubleshooting Cisco Application Centric Infrastructure 167

The “show vlan extended” command shows the VLANs described above map to the Bridge
Domains and End Point Groups that have been configured on the fabric.

rtp_leaf1# show vlan extended

 VLAN Name                             Status    Ports                           

 ---- -------------------------------- --------- ------------------------------- 

13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35 <--- Infra VLAN

17   Prod:Web                         active    Eth1/27, Eth1/28, Po2, Po3 <--- BD_VLAN

21   Prod:MiddleWare                  active    Eth1/27, Eth1/28, Po2, Po3 <--- BD_VLAN

22   Prod:commerceworkspace:Middleware active    Eth1/27, Eth1/28, Po2, Po3 <--- EPG FD_VLAN

23   Prod:commerceworkspace:Web       active    Eth1/27, Eth1/28, Po2, Po3 <--- EPG FD_VLAN 

 VLAN Type  Vlan-mode  Encap                                                         

 ---- ----- ---------- -------------------------------                               

13   enet  CE         vxlan-16777209, vlan-3500<--- Encap VLAN for in-

fra                                    17   enet  CE         vxlan-16089026                                              

21   enet  CE         vxlan-16351138                                                

22   enet  CE         vlan-634<--- Encap VLAN                                                     

     enet  CE         vlan-600<---Encap VLAN


23

An understanding of the forwarding tables in the ALE ASIC is useful when troubleshoot-
ing packet forwarding problems. The picture below summarizes two tables, and the sub-
168 Troubleshooting Cisco Application Centric Infrastructure

sequent lookup that is performed on ingress and egress. There is a  Global Station Table
(GST). These two tables are unified for Layer 2 and Layer 3, and these entries can be
displayed separately.

Verification - Endpoints

To verify end point reachability the following commands can be used from the CLI.
Notice the legend that details how that end point was learned. The below sample output
displays end points learned locally (direct attached to the leaf) as well vPC attached.

rtp_leaf1# show endpoint

Legend:

O - peer-attached    H - vtep             a - locally-aged     S - static          

V - vpc-attached     p - peer-aged        L - local            M - span            

s - static-arp       B - bounce          

+---------------+---------------+-----------------+--------------+-------------+

     VLAN/       Encap           MAC Address       MAC Info/       Interface

     Domain      VLAN            IP Address        IP Info

+---------------+---------------+-----------------+--------------+-------------+

Prod:Prod                               10.0.0.101 L                          

27/Prod:Prod            vlan-700    0026.f064.0000 LpV                       po1

27/Prod:Prod            vlan-700    001b.54c2.2644 LpV                       po1

27/Prod:Prod            vlan-700    0026.980a.df44 LV                        po1

27/Prod:Prod            vlan-700    0000.0c9f.f2bc LV                        po1

35/Prod:Prod            vlan-670    0050.56bb.0164 LV                        po2

29/Prod:Prod            vlan-600    0050.56bb.cccf LV                        po2


Troubleshooting Cisco Application Centric Infrastructure 169

34/Prod:Prod            vlan-636    0000.0c9f.f2bc LV                        po2

34/Prod:Prod            vlan-636    0026.980a.df44 LV                        po2

34/Prod:Prod            vlan-636    0050.56bb.ba9a LV                        po2

Test:Test                               10.1.0.101 L                          

overlay-1                            172.16.136.95 L                          

overlay-1                            172.16.136.96 L                          

13/overlay-1      vxlan-16777209    90e2.ba5a.9f30 L                      eth1/2

13/overlay-1      vxlan-16777209    90e2.ba4b.fc78 L                      eth1/1  

The show mac address-table command can also can be used to confirm mac address
learning and aging as shown below.

rtp_leaf1# show mac address-table 

Legend: 

* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

age - seconds since last seen,+ - primary entry using vPC Peer-Link,

(T) - True, (F) - False

   VLAN     MAC Address      Type      age     Secure NTFY Ports/SWID.SSID.LID

---------+-----------------+--------+---------+------+----+------------------

* 27       0026.f064.0000    dynamic      -       F    F    po1

* 27       001b.54c2.2644    dynamic      -       F    F    po1

* 27       0000.0c9f.f2bc    dynamic      -       F    F    po1

* 27       0026.980a.df44    dynamic      -       F    F    po1

* 16       0050.56bb.0164    dynamic      -       F    F    po2

* 16       0050.56bb.2577    dynamic      -       F    F    po2

* 16       0050.56bb.cccf    dynamic      -       F    F    po2

* 33       0050.56bb.cccf    dynamic      -       F    F    po2

* 17       0026.980a.df44    dynamic      -       F    F    po2

* 17       0050.56bb.ba9a    dynamic      -       F    F    po2

* 40       0050.56bb.f532    dynamic      -       F    F    po2

* 13       90e2.ba5a.9f30    dynamic      -       F    F    eth1/2

* 13       90e2.ba4b.fc78    dynamic      -       F    F    eth1/1

To verify end point reachability from the vsh_lc shell the following can be used com-
mand. Notice all the options that are available. To reduce the amount of output displayed,
specific end points can be specified as the output can be quite extensive in a very large
fabric. Alternatively output can be filter by using the Linux “grep” as well.
170 Troubleshooting Cisco Application Centric Infrastructure

rtp_leaf1# vsh_lc

module-1# show system internal epmc endpoint ?

all        Show information about all endpoints

interface  Display interface information

ip         IP address of the endpoint

key        Key of the endpoint

mac        MAC address of the endpoint

vlan       VLAN info

vrf        VRF of the endpoint

module-1# show system internal epmc endpoint all

VRF : overlay-1 ::: Context id : 4 ::: Vnid : 16777199

MAC : 90e2.ba4b.fc78 ::: Num IPs : 0

Vlan id : 13 ::: Vlan vnid : 16777209 ::: BD vnid : 16777209

VRF vnid : 16777199 ::: phy if : 0x1a000000 ::: tunnel if : 0

Interface : Ethernet1/1

VTEP tunnel if : N/A ::: Flags : 0x80004804

Ref count : 4 ::: sclass : 0

Timestamp : 01/02/1970 21:21:58.113891

EP Flags : local,MAC,class-set,timer,

Aging:Timer-type : Host-tracker timeout ::: Timeout-left : 399 ::: Hit-bit : Yes

::: Timer-reset count: 29 

PD handles: 

Bcm l2 hit-bit : Yes

[L2]: Asic : NS ::: BCM : Yes

::::

[snip]

------------------------------------------------

           EPMC Endpoint Summary                

----------------------------------------------

Total number of local endpoints                   : 7

Total number of remote endpoints                  : 0

Total number of peer endpoints                    : 0

Total number of cached endpoints                  : 0

Total number of config endpoints                  : 5

Total number of MACs                              : 5

Total number of IPs                               : 2

 
Troubleshooting Cisco Application Centric Infrastructure 171

Verification - VLANs

The following commands can be used in the CLI to verify the VLANs programmed on
the leaf. The show vlan extended command is very useful as it provides the BD_VLAN,
FD_VLAN (EPG), and the encap used for the EPG FD_VLAN.

rtp_leaf1# show vlan 

 <CR>       Carriage return                       

 all-ports  Show all ports on VLAN                

 brief      All VLAN status in brief              

 extended   VLAN extended info like encaps        

 id         VLAN status by VLAN id                

 internal   Show internal information of vlan-mgr 

 reserved   Internal reserved VLANs               

 summary    VLAN summary information              

rtp_leaf1# show vlan extended 

VLAN Name                             Status    Ports                           

 ---- -------------------------------- --------- ------------------------------- 

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35 

 16   Prod:Web-FWctxProd               active    Eth1/27, Eth1/28, Po2, Po3 

 17   Prod:Web-FWctxProd               active    Eth1/27, Eth1/28, Po2, Po3 

 18   Prod:Web-LBctxProd               active    Eth1/27, Eth1/28, Po2, Po3 

 21   Prod:MiddleWare                  active    Eth1/27, Eth1/28, Po2, Po3  

  22   Prod:commerceworkspace:Middleware active    Eth1/27, Eth1/28, Po2, Po3                                                                            

 26   Prod:FWOutside                   active    Eth1/27, Eth1/28, Eth1/42,      

                                                 Eth1/44, Po1, Po2, Po3          

 27   --                               active    Eth1/42, Eth1/44, Po1 

 32   Prod:FWInside                    active    Eth1/27, Eth1/28, Po2, Po3 

 33   Prod:commerceworkspace:FWInside  active    Eth1/27, Eth1/28, Po2, Po3 

 37   Prod:Web                         active    Eth1/27, Eth1/28, Po2, Po3 

 40   Prod:commerceworkspace:Web       active    Eth1/27, Eth1/28, Po2, Po3 

 VLAN Type  Vlan-mode  Encap                                                         

 ---- ----- ---------- -------------------------------                               

 13   enet  CE         vxlan-16777209, vlan-3500                                     

 16   enet  CE         vlan-637                                                      


 17   enet  CE         vlan-671                                                      

 18   enet  CE         vlan-603                                                      

 21   enet  CE         vxlan-16351138                                                

 22   enet  CE         vlan-634                                                      

 26   enet  CE         vxlan-15662984                                                

 27   enet  CE         vlan-700                                                      

 32   enet  CE         vxlan-15597456                                                

 33   enet  CE         vlan-602                                                      

 37   enet  CE         vxlan-16089026                                                

 40   enet  CE         vlan-600

The following commands can be used in vsh_lc shell to verify the VLANs are programmed
on the ALE ASIC. The brief version of the VLAN command is also useful as it provides all
the VLAN information in a single table.

rtp_leaf1# vsh_lc

module-1# show system internal eltmc info vlan ?

  <0-4095>           Vlan value

  access_encap_vlan  Vlan on the wire

  access_encap_vnid  Vnid on the wire

  all                Show information for all instances of object

  brief              Show brief information for all objects

  fab_encap_vnid     Vnid in the fabric

  hw_vlan            Vlan value in HW

  summary            Show summary for object

module-1# show system internal eltmc info vlan brief

VLAN-Info

VlanId  HW_VlanId Type            Access_enc Access_enc Fabric_enc Fabric_enc BDVlan  

                                  Type                 Type                      

==================================================================================

     13       15    BD_CTRL_VLAN    802.1q      3500     VXLAN  16777209       0

     16       28         FD_VLAN    802.1q       637     VXLAN      8429      26

     17       25         FD_VLAN    802.1q       671     VXLAN      8463      32

     18       29         FD_VLAN    802.1q       603     VXLAN      8395      32

     21       23         BD_VLAN   Unknown         0     VXLAN  16351138      21

     22       24         FD_VLAN    802.1q       634     VXLAN      8426      21

     26       26         BD_VLAN   Unknown         0     VXLAN  15662984      26


     27       27         FD_VLAN    802.1q       700     VXLAN      8192      26

     32       30         BD_VLAN   Unknown         0     VXLAN  15597456      32

     33       31         FD_VLAN    802.1q       602     VXLAN      8394      32

     37       21         BD_VLAN   Unknown         0     VXLAN  16089026      37

     40       32         FD_VLAN    802.1q       600     VXLAN      8392      37

The following command can be used in vsh_lc shell to verify the VLANs programmed
on ALE ASIC. The output shows EPG FD_VLAN which has an encap value of 634. The
FD_VLAN has a parent BD_VLAN of 21

module-1# show system internal eltmc info vlan 22

             vlan_id:             22   :::      hw_vlan_id:             24

           vlan_type:        FD_VLAN   :::         bd_vlan:             21

   access_encap_type:         802.1q   :::    access_encap:            634

   fabric_encap_type:          VXLAN   :::    fabric_encap:           8426

              sclass:          16389   :::           scope:              4

             bd_vnid:           8426   :::        untagged:              0

     acess_encap_hex:          0x27a   :::  fabric_enc_hex:         0x20ea

     pd_vlan_ft_mask:           0x4f

        bcm_class_id:             16   :::  bcm_qos_pap_id:           1024

          qq_met_ptr:              2   :::       seg_label:              0

      ns_qos_map_idx:              0   :::  ns_qos_map_pri:              1

     ns_qos_map_dscp:              0   :::   ns_qos_map_tc:              0

        vlan_ft_mask:           0x30

      NorthStar Info:

           qq_tbl_id:           1808   :::         qq_ocam:              0

     seg_stat_tbl_id:              0   :::        seg_ocam:              0

The same command can be used to display BD_VLAN 21 from the above display is the
parent VLAN of FD_VLAN 22. Displaying this VLAN provides important information with
regards to the forwarding behavior for this Bridge Domain. It can be seen that this BD
has been left to the default forwarding L3 behavior. Remember that the BD_VLAN and
FD_VLAN is only locally significant to a single leaf.
174 Troubleshooting Cisco Application Centric Infrastructure

module-1# show system internal eltmc info vlan 21 

             vlan_id:             21   :::      hw_vlan_id:             23

           vlan_type:        BD_VLAN   :::         bd_vlan:             21

   access_encap_type:        Unknown   :::    access_encap:              0

   fabric_encap_type:          VXLAN   :::    fabric_encap:       16351138

              sclass:          32773   :::           scope:              4

             bd_vnid:       16351138   :::        untagged:              0

     acess_encap_hex:              0   :::  fabric_enc_hex:       0xf97fa2

         vrf_fd_list: 22,

     pd_vlan_ft_mask:              0

        bcm_class_id:              0   :::  bcm_qos_pap_id:              0

          qq_met_ptr:              0   :::       seg_label:              0

      ns_qos_map_idx:              0   :::  ns_qos_map_pri:              0

     ns_qos_map_dscp:              0   :::   ns_qos_map_tc:              0

        vlan_ft_mask:           0x7f

            fwd_mode:  bridge, route

            arp_mode:        unicast

        unk_mc_flood:              1

         unk_uc_mode:          proxy 

      NorthStar Info:

          qq_tbl_id:           2040   :::         qq_ocam:              0

    seg_stat_tbl_id:            149   :::        seg_ocam:              0

        flood_encap:             29   :::  igmp_mld_encap:             33

Using the same command sequence and displaying FD_VLAN 33 shows that the parent
BD_VLAN is 32. Displaying the BD_VLAN shows that this BD forwarding behavior has
been changed for L2 forwarding and flooding.

module-1# show system internal eltmc info vlan 33

             vlan_id:             33   :::      hw_vlan_id:             31

           vlan_type:        FD_VLAN   :::         bd_vlan:             32

   access_encap_type:         802.1q   :::    access_encap:            602

   fabric_encap_type:          VXLAN   :::    fabric_encap:           8394

              sclass:          16394   :::           scope:              4

             bd_vnid:           8394   :::        untagged:              0

     acess_encap_hex:          0x25a   :::  fabric_enc_hex:         0x20ca

     pd_vlan_ft_mask:           0x4f
        bcm_class_id:             16   :::  bcm_qos_pap_id:           1024

          qq_met_ptr:              7   :::       seg_label:              0

      ns_qos_map_idx:              0   :::  ns_qos_map_pri:              1

     ns_qos_map_dscp:              0   :::   ns_qos_map_tc:              0

        vlan_ft_mask:           0x30

      NorthStar Info:

           qq_tbl_id:           1417   :::         qq_ocam:              0

     seg_stat_tbl_id:              0   :::        seg_ocam:              0

This is the display of the BD_VLAN showing the forwarding has been changed for layer
2 (bridged):

module-1# show system internal eltmc info vlan 32

          vlan_id:             32   :::      hw_vlan_id:             30

           vlan_type:        BD_VLAN   :::         bd_vlan:             32

   access_encap_type:        Unknown   :::    access_encap:              0

   fabric_encap_type:          VXLAN   :::    fabric_encap:       15597456

              sclass:          16393   :::           scope:              4

             bd_vnid:       15597456   :::        untagged:              0

     acess_encap_hex:              0   :::  fabric_enc_hex:       0xedff90

         vrf_fd_list: 18,17,33,

     pd_vlan_ft_mask:              0

        bcm_class_id:              0   :::  bcm_qos_pap_id:              0

          qq_met_ptr:              0   :::       seg_label:              0

      ns_qos_map_idx:              0   :::  ns_qos_map_pri:              0

     ns_qos_map_dscp:              0   :::   ns_qos_map_tc:              0

        vlan_ft_mask:           0x7f

            fwd_mode:         bridge

            arp_mode:          flood

         unk_mc_flood:              1

        unk_uc_mode:          flood

      NorthStar Info:

          qq_tbl_id:           1672   :::         qq_ocam:              0

    seg_stat_tbl_id:            452   :::        seg_ocam:              0

        flood_encap:             36   :::  igmp_mld_encap:             37

 
176 Troubleshooting Cisco Application Centric Infrastructure

Verfication - Forwarding Tables

From the CLI the “show endpoint” command can be used determine how the endpoint
are being learned. In the below example all the endpoints are locally learned (L).

rtp_leaf1# show endpoint

Legend:

O - peer-attached    H - vtep             a - locally-aged     S - static          

V - vpc-attached     p - peer-aged        L - local            M - span            

s - static-arp       B - bounce          

+---------------+---------------+-----------------+--------------+-------------+

     VLAN/       Encap           MAC Address       MAC Info/       Interface

     Domain      VLAN            IP Address        IP Info

+---------------+---------------+-----------------+--------------+-------------+

Prod:Prod                               10.0.0.101 L                          

27/Prod:Prod            vlan-700    0026.f064.0000 LpV                       po1

27/Prod:Prod            vlan-700    001b.54c2.2644 LpV                       po1

27/Prod:Prod            vlan-700    0026.980a.df44 LV                        po1

27/Prod:Prod            vlan-700    0000.0c9f.f2bc LV                        po1

35/Prod:Prod            vlan-670    0050.56bb.0164 LV                        po2

29/Prod:Prod            vlan-600    0050.56bb.cccf LV                        po2

34/Prod:Prod            vlan-636    0000.0c9f.f2bc LV                        po2

34/Prod:Prod            vlan-636    0026.980a.df44 LV                        po2

34/Prod:Prod            vlan-636    0050.56bb.ba9a LV                        po2

Test:Test                               10.1.0.101 L                          

overlay-1                            172.16.136.95 L                          

overlay-1                            172.16.136.96 L                          

13/overlay-1      vxlan-16777209    90e2.ba5a.9f30 L                      eth1/2

13/overlay-1      vxlan-16777209    90e2.ba4b.fc78 L                      eth1/1

From the vsh_lc shell the following commands can be used to examine the forwarding
tables. Remember from the overview there are two forwarding tables of interest on the
ALE ASIC, the GST and LST. These are unified tables for L2 and L3. The ingress and
egress pipelines can be displayed separately as shown below.

For the output below, when the ingress direction is specified these are packets originat-
ing from the fabric, and when egress direction is specfied, it refers to packets originating
from the front panel ports.
Troubleshooting Cisco Application Centric Infrastructure 177

module-1# show platform internal ns forwarding lst-l2 ingress

================================================================================

                          TABLE INSTANCE : 0

================================================================================

Legend:

POS: Entry Position             O: Overlay Instance

V: Valid Bit                    MD/PT: Mod/Port

PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)

PTR: ECMP/Adj/DstEncap/MET pointer

ML: MET Last

ST: Static                      PTH: Num Paths

BN: Bounce                      CP: Copy To CPU

PA: Policy Applied              PI: Policy Incomplete

DL: Dst Local                   SP: Spine Proxy

--------------------------------------------------------------------------------

                                   MO       SRC  P      M S     B C P P D S 

POS   O VNID   Address           V DE MD/PT CLSS T PTR  L T PTH N P A I L P

--------------------------------------------------------------------------------

  253 0 eeff88 00:26:98:0a:df:44 1  0 00/15 4007 A    0 0 0   1 0 0 0 0 0 0

  479 0 edff90 00:00:0c:9f:f2:bc 1  0 00/13 400f A    0 0 0   1 0 0 0 0 0 0

  693 0 eeff88 00:00:0c:9f:f2:bc 1  0 00/15 4007 A    0 0 0   1 0 0 0 0 0 0

  848 0 edff90 00:50:56:bb:cc:cf 1  0 00/13 400a A    0 0 0   1 0 0 0 0 0 0

  919 0 edff90 00:26:98:0a:df:44 1  0 00/13 400f A    0 0 0   1 0 0 0 0 0 0

 1271 0 f97fa2 00:22:bd:f8:19:ff 1  0 00/00    1 A    0 0 1   1 0 0 0 1 0 0

 1306 0 eeff88 00:1b:54:c2:26:44 1  0 00/15 4007 A    0 0 0   1 0 0 0 0 0 0

 1327 0 edff90 00:50:56:bb:ba:9a 1  0 00/13 400f A    0 0 0   1 0 0 0 0 0 0

 2421 0 f57fc2 00:50:56:bb:f5:32 1  0 00/13 8004 A    0 0 0   1 0 0 0 0 0 0

 3942 0 eeff88 00:26:f0:64:00:00 1  0 00/15 4007 A    0 0 0   1 0 0 0 0 0 0

 4083 0 eeff88 00:50:56:bb:01:64 1  0 00/13 4010 A    0 0 0   1 0 0 0 0 0 0

module-1# show platform internal ns forwarding lst-l2 egress 

================================================================================

                          TABLE INSTANCE : 1

================================================================================

Legend:

POS: Entry Position             O: Overlay Instance

V: Valid Bit                    MD/PT: Mod/Port

PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)

PTR: ECMP/Adj/DstEncap/MET pointer


ML: MET Last

ST: Static                      PTH: Num Paths

BN: Bounce                      CP: Copy To CPU

PA: Policy Applied              PI: Policy Incomplete

DL: Dst Local                   SP: Spine Proxy

--------------------------------------------------------------------------------

                                   MO       SRC  P      M S     B C P P D S 

POS   O VNID   Address           V DE MD/PT CLSS T PTR  L T PTH N P A I L P

--------------------------------------------------------------------------------

  253 0 eeff88 00:26:98:0a:df:44 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

  479 0 edff90 00:00:0c:9f:f2:bc 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

  693 0 eeff88 00:00:0c:9f:f2:bc 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

  848 0 edff90 00:50:56:bb:cc:cf 1  0 00/00 400a A   14 0 0   1 0 0 0 0 1 0

  919 0 edff90 00:26:98:0a:df:44 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

 1306 0 eeff88 00:1b:54:c2:26:44 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

 1327 0 edff90 00:50:56:bb:ba:9a 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

 2421 0 f57fc2 00:50:56:bb:f5:32 1  0 00/00 8004 A   12 0 0   1 0 0 0 0 1 0

 3942 0 eeff88 00:26:f0:64:00:00 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

 4083 0 eeff88 00:50:56:bb:01:64 1  0 00/00 4010 A   11 0 0   1 0 0 0 0 1 0

module-1# show platform internal ns forwarding lst-l3 ingress 

===========================================================================

                         TABLE INSTANCE : 0

===========================================================================

Legend:

POS: Entry Position             O: Overlay Instance

V: Valid Bit                    MD/PT: Mod/Port

PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)

PTR: ECMP/Adj/DstEncap/MET pointer

ML: MET Last

ST: Static                      PTH: Num Paths

BN: Bounce                      CP: Copy To CPU

PA: Policy Applied              PI: Policy Incomplete

DL: Dst Local                   SP: Spine Proxy

--------------------------------------------------------------------------------

                                  MO       SRC  P      M S     B C P P D S 

POS   O VNID   Address           V DE MD/PT CLSS T PTR  L T PTH N P A I L P

--------------------------------------------------------------------------------

3142 0 268000 10.1.2.1          1  0 00/00    1 A    0 0 1   1 0 0 0 1 0 0


module-1# show platform internal ns forwarding lst-l3 egress

<no output> 

module-1# show platform internal ns forwarding gst-l2 ingress  

===========================================================================

                         TABLE INSTANCE : 0

===========================================================================

Legend:

POS: Entry Position             O: Overlay Instance

V: Valid Bit                    MD/PT: Mod/Port

PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)

PTR: ECMP/Adj/DstEncap/MET pointer

ML: MET Last

ST: Static                      PTH: Num Paths

BN: Bounce                      CP: Copy To CPU

PA: Policy Applied              PI: Policy Incomplete

DL: Dst Local                   SP: Spine Proxy

--------------------------------------------------------------------------------

                                  MO       SRC  P      M S     B C P P D S 

POS   O VNID   Address           V DE MD/PT CLSS T PTR  L T PTH N P A I L P

--------------------------------------------------------------------------------

4095 0 eeff88 00:50:56:bb:01:64 1  0 00/00 4010 A   11 0 0   1 0 0 0 0 1 0

4261 0 edff90 00:26:98:0a:df:44 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

4354 0 edff90 00:50:56:bb:ba:9a 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

4476 0 eeff88 00:26:98:0a:df:44 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

4672 0 f57fc2 00:50:56:bb:f5:32 1  0 00/00 8004 A   12 0 0   1 0 0 0 0 1 0

7190 0 eeff88 00:00:0c:9f:f2:bc 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

7319 0 eeff88 00:26:f0:64:00:00 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

7631 0 edff90 00:00:0c:9f:f2:bc 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

7742 0 edff90 00:50:56:bb:25:77 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

7910 0 edff90 00:1b:54:c2:26:44 1  0 00/00 400f A   13 0 0   1 0 0 0 0 1 0

7999 0 eeff88 00:1b:54:c2:26:44 1  0 00/00 4007 A   10 0 0   1 0 0 0 0 1 0

8167 0 eeff88 00:50:56:bb:25:77 1  0 00/00 4010 A   11 0 0   1 0 0 0 0 1 0

module-1# show platform internal ns forwarding gst-l2 egress

<no output>

module-1# show platform internal ns forwarding gst-l3 ingress 

 ==========================================================================

                       TABLE INSTANCE : 0
==========================================================================

Legend:

POS: Entry Position             O: Overlay Instance

V: Valid Bit                    MD/PT: Mod/Port

PT: Pointer Type(A=Adj, E=ECMP, D=DstEncap N=Invalid)

PTR: ECMP/Adj/DstEncap/MET pointer

ML: MET Last

ST: Static                      PTH: Num Paths

BN: Bounce                      CP: Copy To CPU

PA: Policy Applied              PI: Policy Incomplete

DL: Dst Local                   SP: Spine Proxy

--------------------------------------------------------------------------------

                                  MO       SRC  P      M S     B C P P D S 

POS   O VNID   Address           V DE MD/PT CLSS T PTR  L T PTH N P A I L P

--------------------------------------------------------------------------------

 562 0 268000 10.0.0.9          1  0 00/00    1 A    c 0 0   1 0 0 0 0 1 0

 563 0 268000 10.0.0.1          1  0 00/00    1 A    e 0 0   1 0 0 0 0 1 0

2312 0 2c0000 10.0.1.11         1  0 00/00    1 A    2 0 0   1 0 0 0 0 1 0

2313 0 2c0000 10.0.1.3          1  0 00/00    1 A    2 0 0   1 0 0 0 0 1 0

4580 0 2c0000 10.0.1.9          1  0 00/00    1 A    d 0 0   1 0 0 0 0 1 0

4581 0 2c0000 10.0.1.1          1  0 00/00    1 A    f 0 0   1 0 0 0 0 1 0

6878 0 268000 10.0.0.11         1  0 00/00    1 A    2 0 0   1 0 0 0 0 1 0

6879 0 268000 10.0.0.3          1  0 00/00    1 A    2 0 0   1 0 0 0 0 1 0

module-1# show platform internal ns forwarding gst-l3 egress

<no output>

Problem Description

Using Atomic counters as an aid in troubleshooting.

In the scenario where Atomic Counters are configured, but do not seem to be working,
here are some points to keep in mind.

• NTP must be configured and working correctly within the fabric for Atomic
Counters to work.
Troubleshooting Cisco Application Centric Infrastructure 181

• Endpoints must be learned by the leaf switch. Attach to the leaf command line
and issue ‘show endpoint’ and confirm that there is a L next to the endpoints the
Atomic Counter Policy will be configured.
• The endpoints must be sending traffic in one or both directions before Atomic
Counters can display the packet counts.
• The endpoints must reside on different leafs. Counted packets must traverse
the ACI Spine switches. Locally switched packets are not counted by Atomic
Counters. The packet must traverse the ALE ASICs.
• Atomic Counters are not supported when the endpoints are in different VRFs
(also known as different Contexts). This implies that Atomic Counters are not
supported between endpoints that reside in different tenants.

In the diagram below, On-demand Atomic Counters are available for troubleshooting
between EPG-A and EPG-B, and between any of the EPs
182 Troubleshooting Cisco Application Centric Infrastructure

In the diagram below On-demand Atomic Counters would only update the transmit
counter. Drops or excess packets could not be counted.

Packet counts are updated in 30 second intervals, so wait at least 30 seconds before
expecting to see any counters incrementing.

Complete Atomic Counter restrictions are documented in the Cisco APIC Trouble-
shooting Guide.

On Demand Atomic Counters are available in Fabric -> Fabric Policies -> Trouble-
shooting Policies -> Traffic Map as shown below.
Troubleshooting Cisco Application Centric Infrastructure 183

Click on the Leaf to Leaf traffic as shown below.


184 Troubleshooting Cisco Application Centric Infrastructure

Symptom - Layer 2 Forwarding Issues

Verification

• Use the steps in the verification section, and verify 


o End point reachability
o VLAN Programming
o BD_VLAN forwarding behavior
o GST and LST L2 Forwarding Tables.
• For additional Troubleshooting Tips see “Routed Connectivity to External Net-
works” chapter. This chapter demonstrates the use of other commands such as
“show vpc” etc. etc. etc.

Symptom - Layer 3 Forwarding Issues.

Verification

• Use the steps in the verification section, and verify


o End point reachability
o VLAN Programming
o BD_VLAN forwarding behavior
o GST and LST L3 Forwarding Tables.
• For additional Troubleshooting Tips see “Bridged Connectivity to External Net-
works” chapter. This chapter demonstrates the use of other commands such as
“show ip route vrf” etc. etc. etc.
Troubleshooting Cisco Application Centric Infrastructure 185

Policies and Contracts

Overview

Within the ACI abstraction model, Contracts are objects built to represent the com-
munications allowed or denied between objects, such as EPGs.  In order to resolve and
configure the infrastructure, the contract objects get resolved into Zoning Rules on the
fabric nodes.  Zoning Rules are the ACI equivalent of Access Control Lists in tradition-
al infrastructure terms.  This chapter provides an overview and some troubleshooting
topics related to the Zoning-Rules Policy Control in the ACI fabric.  The ACI zoning-rule
policy architecture consists of 4 main components, Policy Manager, ACLQOS, Filter and
Filter Entries, Scopes and classIDs as described below.

Policy Manager

• Policy Manager is a supervisor component that processes objectStore


notifications when Data Management Engine (DME)/Policy Element (PE) pushes
zoning configuration to a leaf.
• Policy Manager uses the PPF (Policy Propagation Facility) library to push
configuration to the linecards. 
• Policy Manager follows a "verify-commit" model where lack of hardware
resources will cause a failure in the verify ‘stage’ of the process.
• Sets operational state as “Enabled” or “Disabled”

ACLQOS

• ACLQOS is a linecard component that receives the PPF configuration from the
supervisor.
• This component is responsible for programming the hardware resources
(Ternary Content Adressable Memory - TCAM) on the linecards on the leafs.

Filters and Filter Entries

• Filters act as containers for the filter entries


• Filter entries specify the Layer 4 (L4) information
186 Troubleshooting Cisco Application Centric Infrastructure

Scopes and classIDs

• Each Context (VRF) uses a specific scope identified by "scopeID"


• "actrlRules" and "mgmtRules" are children that exist under a given scope
• EPGs are identified by the classID or PcTag.
• Rules are specified in terms of scope, source class ID, dest class ID and the filter
• The actrlRules exist only on leafs, while actrl.MgmtRules exist on both leaf and
spine switches.

At a very high level, the interaction of the components on the APIC and leaf for policy can
be summarized as follows

• Policy Manager on APIC communicates with Policy Element Manager on the leaf
• Policy Element Manager on the leaf programs the Object Store on the leaf
• Policy Manager on the leaf communicates with ACLQOS client on the leaf
• ACLQOS client programs the hardware

Verification of Zoning Policies

Zoning rules on the leaf can either be displayed directly on the leaf using CLI, or through
the GUI on the APIC. The zoning policies that were examined as part of this chapter were
configured as part of the below reference topology. 

The following CLI command can be used to display the zoning rules configured on the
switch. This provides several key pieces of information when troubleshooting zoning rule
policies. The Rule ID, SrcEPG/DstEPG, FilterID, and Scope can be used in future displays.
Troubleshooting Cisco Application Centric Infrastructure 187

rtp_leaf1# show zoning-rule

Rule ID SrcEPG DstEPG FilterID operSt Scope Action

======= ====== ====== ======== ====== ===== ======

4096 0 0 implicit enabled 16777200 deny,log

4106 0 0 implicit enabled 2523136 deny,log

4107 0 16386 implicit enabled 2523136 deny,log

4147 0 32773 implicit enabled 2523136 permit

4148 0 16388 implicit enabled 2523136 permit

4149 0 32774 implicit enabled 2523136 permit

4150 0 16393 implicit enabled 2523136 permit

4151 0 32770 implicit enabled 2523136 permit

4152 16400 16391 17 enabled 2523136 permit

4153 16391 16400 17 enabled 2523136 permit

4154 16400 16391 18 enabled 2523136 permit

4155 16391 16400 18 enabled 2523136 permit

4097 16398 16394 default enabled 2523136 permit

4112 16394 16398 default enabled 2523136 permit

4120 16398 16399 default enabled 2523136 permit

4121 16399 16398 default enabled 2523136 permit

4126 16389 16387 default enabled 2523136 permit

4127 16387 16389 default enabled 2523136 permit

4128 16389 16401 default enabled 2523136 permit

4129 16401 16389 default enabled 2523136 permit

4130 16387 16401 default enabled 2523136 permit

4131 16401 16387 default enabled 2523136 permit

4117 0 0 implicit enabled 2457600 deny,log

4118 0 0 implicit enabled 2883584 deny,log

4119 0 32770 implicit enabled 2883584 deny,log


188 Troubleshooting Cisco Application Centric Infrastructure

As is evidenced by the above output, even in a small test fabric, a number of rules are in-
stalled. In order to identify which Rule IDs apply to which configured contexts and EPGs,
it is necessary to first identify the scope for the configured context. This can be done
by using Visore to search for the configured context on the APIC using the Distinguish
Name (DN) “fvCtx”. Once all the contexts are displayed, search on the specific context
that is configured, and identify the scope for that context.

The scope information is circled below. This is important as it will be used in future dis-
plays to verify the contract/policy has been pushed to the leaf.
Troubleshooting Cisco Application Centric Infrastructure 189

Notice that the scope identified in the above capture (2523136) matches the scope that
appears in the “show zoning-rule” output displayed and highlighted below.

rtp_leaf1# show zoning-rule

Rule ID SrcEPG DstEPG FilterID operSt Scope Action

======= ====== ====== ======== ====== ===== ======

4096 0 0 implicit enabled 16777200 deny,log

4106 0 0 implicit enabled 2523136 deny,log

4107 0 16386 implicit enabled 2523136 deny,log

4147 0 32773 implicit enabled 2523136 permit

4148 0 16388 implicit enabled 2523136 permit

4149 0 32774 implicit enabled 2523136 permit

4150 0 16393 implicit enabled 2523136 permit

4151 0 32770 implicit enabled 2523136 permit

4152 16400 16391 17 enabled 2523136 permit

4153 16391 16400 17 enabled 2523136 permit

4154 16400 16391 18 enabled 2523136 permit

4155 16391 16400 18 enabled 2523136 permit

4097 16398 16394 default enabled 2523136 permit

4112 16394 16398 default enabled 2523136 permit

4120 16398 16399 default enabled 2523136 permit

4121 16399 16398 default enabled 2523136 permit

4122 32772 16387 default enabled 2523136 permit

4123 16387 32772 default enabled 2523136 permit

4124 32772 16389 default enabled 2523136 permit

4125 16389 32772 default enabled 2523136 permit

4126 16389 16387 default enabled 2523136 permit

4127 16387 16389 default enabled 2523136 permit

4117 0 0 implicit enabled 2457600 deny,log

4118 0 0 implicit enabled 2883584 deny,log

4119 0 32770 implicit enabled 2883584 deny,log

Once the Scope ID information has been identified, as well as the rule and filter IDs, the
following command can be used to verify what Rule IDs and filters are being used for
the scope previously identified. From the below display it can be seen that rule 4149 with
source any (s-any) and destination (d-32774) is being used.
190 Troubleshooting Cisco Application Centric Infrastructure

rtp_leaf1# show system internal policy-mgr stats | grep 2523136

Rule (4097) DN (sys/actrl/scope-2523136/rule-2523136-s-16398-d-16394-f-default) Ingress: 206, Egress: 0

Rule (4106) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-any-f-implicit) Ingress: 35, Egress: 0

[snip]

Rule (4148) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16388-f-implicit) Ingress: 9, Egress: 0

Rule (4149) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-32774-f-implicit) Ingress: 8925, Egress: 0

Rule (4150) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16393-f-implicit) Ingress: 30, Egress: 4

[snip]

rtp_leaf1# show system internal policy-mgr stats | grep 2523136

Rule (4097) DN (sys/actrl/scope-2523136/rule-2523136-s-16398-d-16394-f-default) Ingress: 206, Egress: 0

Rule (4106) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-any-f-implicit) Ingress: 35, Egress: 0

[snip]

Rule (4148) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16388-f-implicit) Ingress: 9, Egress: 0

Rule (4149) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-32774-f-implicit) Ingress: 8935, Egress: 0

Rule (4150) DN (sys/actrl/scope-2523136/rule-2523136-s-any-d-16393-f-implicit) Ingress: 30, Egress: 4[snip]

Is this the rule id that is expected to be incrementing? If so another very useful command
is the “show system internal aclqos zoning-rules” display. Use of this command will re-
quire the direction of Cisco TAC’s, but this command provides for a confirmation that
hardware on the leaf has been programmed correctly.

The Source EPG and Destination EPG combination of interest is (0 and 32774). Next step
is to identify all the hardware entries for these source and destination classes that match
the rule IDs in question (4149). The rules are numbered sequentially. The rule ID of inter-
est is highlighted below. It can be observed that a hardware index (hw_index) of 150 and
151 is there which indicates that there is a hardware entry for this rule.
Troubleshooting Cisco Application Centric Infrastructure 191

module-1# show system internal aclqos zoning-rules

===========================================

Rule ID: 1 Scope 4 Src EPG: 0 Dst EPG: 16386 Filter 65534 

  Curr TCAM resource:

 =============================

   unit_id: 0

   === Region priority: 2307 (rule prio: 9 entry: 3)===

       sw_index = 23 | hw_index = 132

   === Region priority: 2307 (rule prio: 9 entry: 3)===

       sw_index = 24 | hw_index = 133

[snip]

Dumping the hardware entry and examining the accuracy of the content is beyond the
scope of this book, but at this point there is sufficient information to contact the Cisco
Technical Assistance Center (TAC) if there is a zoning rule but no corresponding hard-
ware entry.

===========================================

Rule ID: 4149 Scope 4 Src EPG: 0 Dst EPG: 32774 Filter 65534 

  Curr TCAM resource:

 =============================

   unit_id: 0

   === Region priority: 2311 (rule prio: 9 entry: 7)===

       sw_index = 38 | hw_index = 150

   === Region priority: 2311 (rule prio: 9 entry: 7)===

       sw_index = 39 | hw_index = 151

The GUI can also be used to verify contracts/zoning-rules. All the rules on the leaf can
be examined as shown below. By going to Fabric->Inventory->Rules. Then double click
on a particular rule of interest.
192 Troubleshooting Cisco Application Centric Infrastructure

The existing policy state can be verified using the GUI. Remember in the overview sec-
tion it was the responsibility of Policy Manager to set the operation state of the rule. In
the below display it can be verified if the operational state is disabled or enabled, whether
the action is permit or deny, and the direction of the rule.

The statistics for each rule can also be examined to make a determination that the rule
is being used. This was demonstrated earlier using the CLI. Click on stats, and then the
check mark as shown below to view stats in the GUI.
Troubleshooting Cisco Application Centric Infrastructure 193

Select the packet counters of interest and the sampling interval to be monitored.
194 Troubleshooting Cisco Application Centric Infrastructure

If the health score for that specific rule is not 100, its health status can be further drilled
down upon. This provides insight as to what problems may be occurring. Running out of
hardware resources is just one factor that can cause the health score to decrease. The
use of Health Score as an aid in troubleshooting is covered in more detail in the Health
Score/Faults specific chapter.

Faults that have been generated as a direct result of this rule being applied to the leaf can
also be analyzed. This is one of the most important items to check when troubleshooting
zoning-rules, or any other ACI policy. The use of Health Score as an aid in troubleshoot-
ing is covered in more detail in the Health Score/Faults specific chapter.

Problem Description

Some of the Policy Zoning Rules problems that can be encountered during the ACI de-
ployment process include, but are not limited to, the items below. The commands already
shown in the verification section above can be used to help identify the problems below.
Troubleshooting Cisco Application Centric Infrastructure 195

Symptom

End Point Groups can communicate when there is no contract configured.

Verification

• Check the GUI to see if any faults were generated if the rule/contract was
recently removed
• Verify in the GUI that the rule does not exist after identifying the scope for the
context
• Verify with "show zoning-rules" CLI command that the rule does not exist 
• Verify in Visore that the rule does not exist on the APIC 

For example using the following highlighted Rule Id:

 rtp_leaf1# show zoning-rule 

Rule ID         SrcEPG          DstEPG          FilterID        operSt          Scope           Action       

=======         ======          ======          ========        ======          =====           ======        

4096            0               0               implicit        enabled         16777200        deny,log      

[snip]       

4130            16388           32775           25              enabled         2523136         permit       

4131            32775           16388           21              enabled         2523136         permit


196 Troubleshooting Cisco Application Centric Infrastructure

Visore can be used to verify the rule is there or not by searching “actrlRule” and filtering
on the rule id as shown below.

Visore can also be used to search on a specific filter (actrlFlt). It can be seen that the rule
exists in this case, and is applied to node 101 or leaf1.
Troubleshooting Cisco Application Centric Infrastructure 197

It can be verified that the filter entries are correct by drilling down on the arrows as
shown below.
198 Troubleshooting Cisco Application Centric Infrastructure

Statistics can also be checked directly from Visore as shown below.


Troubleshooting Cisco Application Centric Infrastructure 199

• Verify using the commands in the verification section that the leaf does not have
a hardware entry for that rule.

Resolution

If a rule (contract) is found to be configured, remove it to block communication between


the EPGs. If there is no rule configured, or the rule is unable to be verified, contact the
Cisco Technical Assistance Center for help in diagnosing the problem.

Symptom

End Point Groups cannot communicate when there is a contract configured.

Verification

• Check the GUI for faults associated to the rule in question


• Verify in the GUI that the operational state is enabled
• Verify with "show zoning-rules" CLI command that the rule exists after identifing
the scope for the context
• Verify in the GUI or CLI that the rule entry counters are incrementing
• Verify the health score for that rule
• Verify in Visore that the rule exists (See above)
• Verify there is a corresponding hardware entry for that rule id in the CLI
200 Troubleshooting Cisco Application Centric Infrastructure

Resolution

If a rule (contract) is found to be configured, contains the proper filter content (ports),
and is found to have a corresponding hardware entry then check for forwarding prob-
lems. If further assistance is required contact the Cisco Technical Assistance Center for
help in diagnosing the problem. 

Symptom

Hardware resource exhaustion when rules are being pushed and programmed on the
leaf. 

Verification

• Check the GUI for faults associated to the rule in question. This will
be Fault F1203 - Rule failed due to hardware programming error.
Troubleshooting Cisco Application Centric Infrastructure 201
202 Troubleshooting Cisco Application Centric Infrastructure

The following fault will be observed: Fault F105504 - TCA: policy CAM entries usage current
value(eqptcapacityPolEntry5min:normalizedLast) value 93 raised above threshold 90

This fault will be generated on the leaf.

• Verify with "show zoning-rules" CLI command that the rule exists after
identifying the scope for the context

• Verfiy the health score for that rule

Resolution

Reduce the amount of zoning rules (contracts) required. Explore the option of using
“vzAny”, as well as any other contract optimization techniques. Contact the Cisco TAC if
further assistance is required.
Troubleshooting Cisco Application Centric Infrastructure 203

Symptom

Configured rules are not being deployed on the leaf.

Verification

• Check the GUI for faults associated to the rule in question


• Verify in the GUI that the operational state is enabled
• Verify in Visore that the rule exists (See above)
• Verify with "show zoning-rules" CLI command that the rule exists after
identifying the scope for the context
• Verfiy using the commands in the verification section that the leaf has a
hardware entry for the rule

Resolution

If a rule (contract) is present in the GUI and Visore, does not exist on the leaf, and there
are no corresponding faults for the policy that is being deployed, contact the Cisco Tech-
nical Assistance Center for help in diagnosing the problem. Otherwise correct the con-
figuration that is causing the fault and redeploy the policy.
204 Troubleshooting Cisco Application Centric Infrastructure

Bridged Connectivity to External Networks

Overview

This chapter covers potential issues that could occur with Bridged Connectivity to Exter-
nal Networks, starting with an overview of how bridged connectivity to external networks
should function and the verification steps used to confirm a working layer 2 bridged net-
work for the example reference topology fabric.  The displays taken on a working fabric
can then be used as an aid in troubleshooting issues with external layer 2 connectivity. 

There are different ways to extend layer 2 domain beyond the ACI fabric:

• Extend the EPG out of the ACI fabric - A user can extend an EPG out of the ACI
fabric by statically assigning a port (along with VLAN ID) to an EPG. The leaf will
learn the endpoint information and assign the traffic (by matching the port and
VLAN ID) to the proper EPG, and then enforce the policy. The endpoint
learning, data forwarding, and policy enforcement remain the same whether
the endpoint is directly attached to the leaf port or if it is behind a layer 2
network (provided the proper VLAN is enabled in the layer2 network).
• Extend the bridge domain out of the ACI fabric - This option is designed to
extend the entire bridge domain (not an individual EPG under bridge domain)
to the outside network.

Problem Description

There are many ways to connect ACI Leafs to external devices. The interface properties
could be access, trunk, port-channel, virtual port-channel, routed, routed sub-interfac-
es, or SVI. When establishing Layer 2 connectivity with external devices, it’s important to
match the properties such as VLANs tagged, LACP modes, etc. This is equally applicable
to networks hosted by ACI as well as external networks.

When a configuration mismatch occurs, depending on the parameters, the interfaces are
either down, or not forwarding traffic as expected.

Symptom

       Interfaces connecting to the external devices are in down state.


Troubleshooting Cisco Application Centric Infrastructure 205

Verification

In this example, rtp_leaf1 and rtp_leaf3 need to form a vPC pair and connect us-
ing a dedicated port-channel per UCS Fabric Interconnect. A virtual port-channel
has been configured, and the APIC has assigned port-channel4 as seen below on
rtp_leaf1 and rtp_leaf3.

Please note that although in this output both vPC ID (2) and Port (Po4) match on both
leafs, only the vPC ID need to match for this configuration to work.

rtp_leaf1# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 10

Peer status                       : peer adjacency formed ok

vPC keep-alive status             : Disabled

Configuration consistency status  : success

Per-vlan consistency status       : success

Type-2 inconsistency reason       : Consistency Check Not Performed

vPC role                          : primary

Number of vPCs configured         : 2

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

Operational Layer3 Peer           : Disabled

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans

--   ----   ------ --------------------------------------------------

1           up     -

vPC status

----------------------------------------------------------------------

id   Port   Status Consistency Reason                     Active vlans

--   ----   ------ ----------- ------                     ------------

2    Po4    down*  success     success                    -

343  Po1    up     success     success                    700,751

 
206 Troubleshooting Cisco Application Centric Infrastructure

rtp_leaf3# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 10

Peer status                       : peer adjacency formed ok

vPC keep-alive status             : Disabled

Configuration consistency status  : success

Per-vlan consistency status       : success

Type-2 inconsistency reason       : Consistency Check Not Performed

vPC role                          : secondary

Number of vPCs configured         : 2

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

Operational Layer3 Peer           : Disabled

 vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans

--   ----   ------ --------------------------------------------------

1           up     -

vPC status

----------------------------------------------------------------------

id   Port   Status Consistency Reason                     Active vlans

--   ----   ------ ----------- ------                     ------------

2    Po4    down*  success     success                    - 

343  Po1    up     success     success                    700,751

 
Since the interfaces are in the down (D) state, a check of interface status would reveal if
there is a potential layer 1 issue as seen below.
Troubleshooting Cisco Application Centric Infrastructure 207

rtp_leaf1# show interface ethernet 1/27

Ethernet1/27 is down (sfp-speed-mismatch)

admin state is up, Dedicated Interface

on the GUI, Fabric -> Inventory -> Pod1 -> leafname -> Interfaces -> vPC Interfaces
-> <vPC Domain ID> -> <vPC ID> -> Faults, would reveal:

Resolution

Configure the interface policies to match the peer device. In this scenario, the speed
mismatch was addressed by changing the interface policy from 1G to 10G.

Symptom

Certain interfaces are in the ‘suspended’ state when configuring a port-channel or virtual
port-channel.
208 Troubleshooting Cisco Application Centric Infrastructure

Verification

Check the status of the vPC and port-channel using the CLI or GUI.

rtp_leaf1# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 10

Peer status                       : peer adjacency formed ok

vPC keep-alive status             : Disabled

Configuration consistency status  : success

Per-vlan consistency status       : success

Type-2 inconsistency reason       : Consistency Check Not Performed

vPC role                          : primary

Number of vPCs configured         : 2

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -

Graceful Consistency Check        : Enabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

Operational Layer3 Peer           : Disabled

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans

--   ----   ------ --------------------------------------------------

1           up     -

vPC status

----------------------------------------------------------------------

id   Port   Status Consistency Reason                     Active vlans

--   ----   ------ ----------- ------                     ------------

2    Po4    up     success     success                    600-601,634

                                                          ,639,667-66

                                                          8

343  Po1    up     success     success                    700,751

 
Troubleshooting Cisco Application Centric Infrastructure 209

rtp_leaf1# show port-channel summary

Flags:  D - Down        P - Up in port-channel (members)

        I - Individual  H - Hot-standby (LACP only)

        s - Suspended   r - Module-removed

        S - Switched    R - Routed

        U - Up (port-channel)

        M - Not in use. Min-links not met

--------------------------------------------------------------------------------

Group Port-       Type     Protocol  Member Ports

      Channel

--------------------------------------------------------------------------------

1     Po1(SU)     Eth      LACP      Eth1/42(P)   Eth1/44(P)

4     Po4(SU)     Eth      LACP      Eth1/27(P)   Eth1/28(s)

On the GUI, Fabric -> Inventory -> Pod1 -> leafname -> Interfaces -> vPC Interfaces
-> <vPC Domain ID> -> <vPC ID> -> Faults, would reveal:
210 Troubleshooting Cisco Application Centric Infrastructure

Although the vPC is up, links to only one neighbor are members of this port-channel.
Since the interfaces are in the suspended (s) state, a check of the LACP interface status
would reveal if there are any problems with LACP communications with the peer. In this ex-
ample the peer LACP system identifier are different indicating two different peer devices.

rtp_leaf1# show lacp interface ethernet 1/27 | grep -A 2 Neighbor

Neighbor: 0x113

  MAC Address= 00-0d-ec-b1-a0-3c

  System Identifier=0x8000,00-0d-ec-b1-a0-3c

rtp_leaf1# show lacp interface ethernet 1/28 | grep -A 2 Neighbor

Neighbor: 0x113

  MAC Address= 00-0d-ec-b1-a9-fc

  System Identifier=0x8000,00-0d-ec-b1-a9-fc

Resolution

In this example, rtp_leaf1 and rtp_leaf3 need to form a vPC pair, and they each need to
connect using a dedicated port-channel for each UCS Fabric Interconnect. When us-
ing the vPC wizard or directly configuring virtual port-channels, unique interface policy
groups and interface selectors are needed to create dedicated port-channels for each
peer device, such as UCS Fabric Interconnect A and Fabric Interconnect B.

Once the needed configuration is done, two independent port-channels are created.

Fabric -> Access Policies -> Interface Policies -> Policy Groups
Fabric -> Access Policies -> Interface Policies -> Profiles
Troubleshooting Cisco Application Centric Infrastructure 211

rtp_leaf1# show port-channel summary

Flags:  D - Down        P - Up in port-channel (members)

        I - Individual  H - Hot-standby (LACP only)

        s - Suspended   r - Module-removed

        S - Switched    R - Routed

        U - Up (port-channel)

        M - Not in use. Min-links not met

-------------------------------------------------------------------------

Group Port-       Type     Protocol  Member Ports

      Channel

-------------------------------------------------------------------------

1     Po1(SU)     Eth      LACP      Eth1/42(P)   Eth1/44(P)

5     Po5(SU)     Eth      LACP      Eth1/27(P)

6     Po6(SD)     Eth      LACP      Eth1/28(P)

Problem Description:

There are various use cases, such as migration, where the L2 extension outside ACI Fab-
ric is needed either through direct EPG extension or through special L2 Out connectivi-
ty. During migration scenarios, there is also a need for the default gateway to be external
to the ACI fabric.

Most of the time, the problem represents itself as a reachability problem between fabric
hosted endpoints and external networks. In this example, an example of Web tier within
the Tenant Test is used and the address of WebServer 10.2.1.11 needs to be reachable from
the Nexus 7K, which has been configured for:

N7K-1-65-vdc_4# show hsrp brie

P indicates configured to preempt.

Interface Grp Prio P State Active addr Standby addr Group addr

Vlan700 700 110 P Active local 10.1.0.2 10.1.0.1 (conf)

Vlan750 750 110 P Active local 10.2.0.2 10.2.0.1 (conf)

Vlan751 751 110 P Active local 10.2.1.2 10.2.1.1 (conf)

The WebServer with IP of 10.2.1.11 and BD's IP of 10.2.1.254 are unreachable from

the Nexus 7Ks.

N7K-1-65-vdc_4# ping 10.2.1.254

PING 10.2.1.254 (10.2.1.254): 56 data bytes


Request 0 timed out

Request 1 timed out

Request 2 timed out

Request 3 timed out

Request 4 timed out

--- 10.2.1.254 ping statistics ---

5 packets transmitted, 0 packets received, 100.00% packet loss

N7K-1-65-vdc_4# ping 10.2.1.11

PING 10.2.1.11 (10.2.1.11): 56 data bytes

Request 0 timed out

Request 1 timed out

Request 2 timed out

Request 3 timed out

Request 4 timed out

--- 10.2.1.11 ping statistics ---

5 packets transmitted, 0 packets received, 100.00% packet loss

• Extending the EPG directly out of the ACI fabric

Symptom

The Leaf is not getting programmed with the correct VLANs for BridgeDomain and EPGs.

Verification/Resolution

The reachability problem could be due to many problems, however the most common
problem is that the right leafs are not programmed with the correct vlans used for
BridgeDomain and EPG identification.

BD relationship with the Context:

The context-BD relationship is key for programming the leaf. Without that the VLANs
don’t get programmed on the leaf as shown below.

 
Troubleshooting Cisco Application Centric Infrastructure 213

rtp_leaf1# show vlan brief

 VLAN Name                             Status    Ports

 ---- -------------------------------- --------- -------------------------------

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35

 27   Test:Database                    active    Eth1/27, Eth1/28, Po2, Po3

 28   Test:CommerceWorkspaceTest:Datab active    Eth1/27, Eth1/28, Po2, Po3

      ase

Once the BD is assigned the right context, the BD and EPG Vlans get programmed

appropriately.

rtp_leaf1# show vlan brief

 VLAN Name                             Status    Ports

 ---- -------------------------------- --------- -------------------------------

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35

 27   Test:Database                    active    Eth1/27, Eth1/28, Po2, Po3

 28   Test:CommerceWorkspaceTest:Datab active    Eth1/27, Eth1/28, Po2, Po3

      ase

 37   Test:Web                         active    Eth1/27, Eth1/28, Po2, Po3

 38   Test:CommerceWorkspaceTest:Web   active    Eth1/27, Eth1/28, Po2, Po3

While the VLANs are programmed properly, the Ports on which the VLANs are carried
seem to be incomplete. The Po2 and Po3 links to UCS hosting the VMs are shown here,
however the links to the N7Ks are not, which leads us to the next possible issue.

Ports not being programmed with EPG encap VLANs:

In this example, VLAN 751 is used to connect to the Nexus 7Ks, and the EPG has been as-
signed dynamically VLAN 639 within the scope of the VMM domain. The following output
confirms while VLAN 639 has been programmed, VLAN 751 is not present on the leaf.
214 Troubleshooting Cisco Application Centric Infrastructure

 
rtp_leaf1# show vlan extended

 VLAN Name                             Status    Ports

 ---- -------------------------------- --------- -------------------------------

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35

 27   Test:Database                    active    Eth1/27, Eth1/28, Po2, Po3

 28   Test:CommerceWorkspaceTest:Datab active    Eth1/27, Eth1/28, Po2, Po3

      ase

 37   Test:Web                         active    Eth1/27, Eth1/28, Po2, Po3

 38   Test:CommerceWorkspaceTest:Web   active    Eth1/27, Eth1/28, Po2, Po3

 VLAN Type  Vlan-mode  Encap

 ---- ----- ---------- -------------------------------

 13   enet  CE         vxlan-16777209, vlan-3500

 27   enet  CE         vxlan-14680064

 28   enet  CE         vlan-600

 37   enet  CE         vxlan-15794150

 38   enet  CE         vlan-639

For this issue to be resolved, the EPG needs to be binded to a port/leaf, and also the
L2Out domain needs to be attached. The L2Out would need to be associated to a VLAN
pool consisting of VLAN 751.
Troubleshooting Cisco Application Centric Infrastructure 215

Once the configuration is applied, the leafs are programmed to carry all the relevant L2
constructs: BD, encap VLAN for VMM domain, and encap VLAN for L2Out. Also, the right
interfaces are mapped to the encap VLANs: VLAN-751 on Po1 to N7Ks and VLAN-639 on
Po2, Po3 to both Fabric Interconnects of a UCS System.

rtp_leaf1# show port-channel summary

Flags:  D - Down        P - Up in port-channel (members)

        I - Individual  H - Hot-standby (LACP only)

        s - Suspended   r - Module-removed

        S - Switched    R - Routed

        U - Up (port-channel)

        M - Not in use. Min-links not met

--------------------------------------------------------------------------------

Group Port-       Type     Protocol  Member Ports

      Channel

--------------------------------------------------------------------------------

1     Po1(SU)     Eth      LACP      Eth1/42(P)   Eth1/44(P)


2     Po2(SU)     Eth      LACP      Eth1/27(P)

3     Po3(SU)     Eth      LACP      Eth1/28(P)

rtp_leaf1# show vlan extended

 VLAN Name                             Status    Ports

 ---- -------------------------------- --------- -------------------------------

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35

 14   Test:CommerceWorkspaceTest:Web   active    Eth1/42, Eth1/44, Po1

 27   Test:Database                    active    Eth1/27, Eth1/28, Po2, Po3

 28   Test:CommerceWorkspaceTest:Datab active    Eth1/27, Eth1/28, Po2, Po3

      ase

 37   Test:Web                         active    Eth1/27, Eth1/28, Eth1/42,

                                                 Eth1/44, Po1, Po2, Po3

 38   Test:CommerceWorkspaceTest:Web   active    Eth1/27, Eth1/28, Po2, Po3

 VLAN Type  Vlan-mode  Encap

 ---- ----- ---------- -------------------------------

 13   enet  CE         vxlan-16777209, vlan-3500

 14   enet  CE         vlan-751

 27   enet  CE         vxlan-14680064

 28   enet  CE         vlan-600

 37   enet  CE         vxlan-15794150

 38   enet  CE         vlan-639

rtp_leaf1# show vpc

Legend:

                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 10

Peer status                       : peer adjacency formed ok

vPC keep-alive status             : Disabled

Configuration consistency status  : success

Per-vlan consistency status       : success

Type-2 inconsistency reason       : Consistency Check Not Performed

vPC role                          : primary

Number of vPCs configured         : 3

Peer Gateway                      : Disabled

Dual-active excluded VLANs        : -


Graceful Consistency Check        : Enabled

Auto-recovery status              : Enabled (timeout = 240 seconds)

Operational Layer3 Peer           : Disabled

vPC Peer-link status

---------------------------------------------------------------------

id   Port   Status Active vlans

--   ----   ------ --------------------------------------------------

1           up     -

vPC status

----------------------------------------------------------------------

id   Port   Status Consistency Reason                     Active vlans

--   ----   ------ ----------- ------                     ------------

1    Po2    up     success     success                    600-601,634

                                                          ,639,666-66

                                                          9

343  Po1    up     success     success                    700,751

684  Po3    up     success     success                    600-601,634

                                                          ,639,667-66

                                                          8

rtp_leaf1#

 
Testing now reveals that while N7K can ping the BD address of 10.2.1.254, it cannot ping
the WebServer VM (10.2.1.11).
 
N7K-2-50-N7K2# ping 10.2.1.254

PING 10.2.1.254 (10.2.1.254): 56 data bytes

Request 0 timed out

64 bytes from 10.2.1.254: icmp_seq=1 ttl=56 time=1.656 ms

64 bytes from 10.2.1.254: icmp_seq=2 ttl=56 time=0.568 ms

64 bytes from 10.2.1.254: icmp_seq=3 ttl=56 time=0.826 ms

64 bytes from 10.2.1.254: icmp_seq=4 ttl=56 time=0.428 ms

--- 10.2.1.254 ping statistics ---

5 packets transmitted, 4 packets received, 20.00% packet loss

round-trip min/avg/max = 0.428/0.869/1.656 ms


N7K-2-50-N7K2# ping 10.2.1.11

PING 10.2.1.11 (10.2.1.11): 56 data bytes

Request 0 timed out

Request 1 timed out

Request 2 timed out

Request 3 timed out

Request 4 timed out

--- 10.2.1.11 ping statistics ---

5 packets transmitted, 0 packets received, 100.00% packet loss

N7K-2-50-N7K2#

The scenario described here is Intra-EPG connectivity, where the contracts are not ap-
plied. So this is not related to any filter, which brings us to the next use case.

Symptom

ACI Fabric is not learning the endpoint IPs on the leafs.

Verification/Resolution

Once the leaf is programmed, the endpoints are learned as traffic is received. The end-
point learning is key, when the BD is in Hardware Proxy mode, so that the Fabric can
efficiently route packets.

Since the N7Ks can ping the BD pervasive gateway address and not the Webserver IP of
10.2.1.11, the next step is to check the endpoint table.

As seen below, while the N7K addresses (10.2.1.2, 10.2.1.3) are seen, the webserver IP
(10.2.1.11) is missing from the endpoint table.

rtp_leaf1# show endpoint  vrf Test:Test detail

Legend:

 O - peer-attached    H - vtep             a - locally-aged     S - static

 V - vpc-attached     p - peer-aged        L - local            M - span

 s - static-arp       B - bounce
Troubleshooting Cisco Application Centric Infrastructure 219

+---------------+---------------+-----------------+--------------+-------------+----------------------+

      VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

      Domain      VLAN            IP Address        IP Info                       Info

+---------------+---------------+-----------------+--------------+-------------+----------------------+

Test:Test                               10.1.0.101 L

38/Test:Test            vlan-639    0050.56bb.d508 LV                        po2 Test:CommerceWorkspaceTest:Web

14/Test:Test            vlan-751    0026.f064.0000 LpV                       po1 Test:CommerceWorkspaceTest:Web

14/Test:Test            vlan-751    0000.0c9f.f2ef LpV                       po1 Test:CommerceWorkspaceTest:Web

14                      vlan-751    0026.980a.df44 LpV                       po1 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-751          10.2.1.2 LV

14                      vlan-751    001b.54c2.2644 LV                        po1 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-751          10.2.1.3 LV

+------------------------------------------------------------------------------+

                             Endpoint Summary

+------------------------------------------------------------------------------+

Total number of Local Endpoints     : 6

Total number of Remote Endpoints    : 0

Total number of Peer Endpoints      : 0

Total number of vPC Endpoints       : 5

Total number of non-vPC Endpoints   : 1

Total number of MACs                : 5

Total number of VTEPs               : 0

Total number of Local IPs           : 3

Total number of Remote IPs          : 0

Total number All EPs                : 6

Just as mac-addresses are learnt in traditional switching, the endpoints are learnt by the
leaf when the first packet is received. A ping from the VM triggers this learning and the
following output confirm this:

rtp_leaf1# show endpoint  vrf Test:Test detail

Legend:

 O - peer-attached    H - vtep             a - locally-aged     S - static

 V - vpc-attached     p - peer-aged        L - local            M - span

 s - static-arp       B - bounce

+---------------+---------------+-----------------+--------------+-------------+-------------------------+
      VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

      Domain      VLAN            IP Address        IP Info                       Info

+---------------+---------------+-----------------+--------------+-------------+-------------------------+

Test:Test                               10.1.0.101 L

38                      vlan-639    0050.56bb.d508 LpV                       po2 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-639         10.2.1.11 LV

14/Test:Test            vlan-751    0026.f064.0000 LpV                       po1 Test:CommerceWorkspaceTest:Web

14/Test:Test            vlan-751    0000.0c9f.f2ef LpV                       po1 Test:CommerceWorkspaceTest:Web

14                      vlan-751    0026.980a.df44 LpV                       po1 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-751          10.2.1.2 LV

14                      vlan-751    001b.54c2.2644 LV                        po1 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-751          10.2.1.3 LV

+------------------------------------------------------------------------------+

                              Endpoint Summary

 
+------------------------------------------------------------------------------+

Total number of Local Endpoints     : 6

Total number of Remote Endpoints    : 0

Total number of Peer Endpoints      : 0

Total number of vPC Endpoints       : 5

Total number of non-vPC Endpoints   : 1

Total number of MACs                : 5

Total number of VTEPs               : 0

Total number of Local IPs           : 4

Total number of Remote IPs          : 0

Total number All EPs                : 6

Once the endpoint is learned, the ping is successful from the N7K to the Webserver IP
of 10.2.1.11.

N7K-1-65-vdc_4# ping 10.2.1.11

PING 10.2.1.11 (10.2.1.11): 56 data bytes

64 bytes from 10.2.1.11: icmp_seq=0 ttl=127 time=1.379 ms

64 bytes from 10.2.1.11: icmp_seq=1 ttl=127 time=1.08 ms

64 bytes from 10.2.1.11: icmp_seq=2 ttl=127 time=0.498 ms

64 bytes from 10.2.1.11: icmp_seq=3 ttl=127 time=0.479 ms

64 bytes from 10.2.1.11: icmp_seq=4 ttl=127 time=0.577 ms

--- 10.2.1.11 ping statistics ---


5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.479/0.802/1.379 ms

N7K-1-65-vdc_4#

This issue is also seen with the N7K HSRP address as well, since N7K does not normal-
ly source packets from the HSRP address. In the above tables, the HSRP IP (10.2.1.1) is
missing from the endpoint table.

Forcing N7K to source the packet from the HSRP address populates the endpoint table.

N7K-1-65-vdc_4# ping 10.2.1.254 source 10.2.1.1

PING 10.2.1.254 (10.2.1.254) from 10.2.1.1: 56 data bytes

Request 0 timed out

64 bytes from 10.2.1.254: icmp_seq=1 ttl=57 time=1.472 ms

64 bytes from 10.2.1.254: icmp_seq=2 ttl=57 time=1.062 ms

64 bytes from 10.2.1.254: icmp_seq=3 ttl=57 time=1.097 ms

64 bytes from 10.2.1.254: icmp_seq=4 ttl=57 time=1.232 ms

--- 10.2.1.254 ping statistics ---

5 packets transmitted, 4 packets received, 20.00% packet loss

round-trip min/avg/max = 1.062/1.215/1.472 ms

N7K-1-65-vdc_4# 

This is one of the reason, when the default gateway is outside the fabric, the Fabric BD
mode should be enabled for flooding and NOT Hardware Proxy.
 
222 Troubleshooting Cisco Application Centric Infrastructure

rtp_leaf1# show endpoint  vrf Test:Test detail

Legend:

 O - peer-attached    H - vtep             a - locally-aged     S - static

 V - vpc-attached     p - peer-aged        L - local            M - span

 s - static-arp       B - bounce

+---------------+---------------+-----------------+--------------+-------------+----------------------+

      VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

      Domain      VLAN            IP Address        IP Info                       Info

+---------------+---------------+-----------------+--------------+-------------+----------------------+

Test:Test                               10.1.0.101 L

38                      vlan-639    0050.56bb.d508 LV                        po2 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-639         10.2.1.11 LV

14/Test:Test            vlan-751    0026.f064.0000 LpV                       po1 Test:CommerceWorkspaceTest:Web

14/Test:Test            vlan-751    0000.0c9f.f2ef LpV                       po1 Test:CommerceWorkspaceTest:Web

14                      vlan-751    0026.980a.df44 LpV                       po1 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-751          10.2.1.2 LV

14                      vlan-751    001b.54c2.2644 LpV                       po1 Test:CommerceWorkspaceTest:Web

Test:Test               vlan-751          10.2.1.3 LV

Test:Test               vlan-751          10.2.1.1 LV

+------------------------------------------------------------------------------+

                             Endpoint Summary

+------------------------------------------------------------------------------+

Total number of Local Endpoints     : 6

Total number of Remote Endpoints    : 0

Total number of Peer Endpoints      : 0

Total number of vPC Endpoints       : 5

Total number of non-vPC Endpoints   : 1

Total number of MACs                : 5

Total number of VTEPs               : 0

Total number of Local IPs           : 5

Total number of Remote IPs          : 0

Total number All EPs                : 6

  

N7K-1-65-vdc_4# ping 10.2.1.11 source 10.2.1.1

PING 10.2.1.11 (10.2.1.11) from 10.2.1.1: 56 data bytes

64 bytes from 10.2.1.11: icmp_seq=0 ttl=127 time=1.276 ms


64 bytes from 10.2.1.11: icmp_seq=1 ttl=127 time=0.751 ms

64 bytes from 10.2.1.11: icmp_seq=2 ttl=127 time=0.752 ms

64 bytes from 10.2.1.11: icmp_seq=3 ttl=127 time=0.807 ms

64 bytes from 10.2.1.11: icmp_seq=4 ttl=127 time=0.741 ms

--- 10.2.1.11 ping statistics ---

5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.741/0.865/1.276 ms

N7K-1-65-vdc_4#

  • Extending bridged domain using external bridged network

In this scenario, Layer 2 extension is achieved using a unique external EPG so as to ad-
dress spanning tree interoperability when integrating/extending with external Layer 2
networks inside the ACI fabric.

In this setup, the Web EPG is not directly associated with L2Out interfaces towards the
N7K. Instead, Web EPG is associated to BD Web, which is then extended using external
bridged connectivity named L2Out, with the external networks identified to be allowed to
communicate with the Web tier.

Symptom

The Leaf is not getting programmed with the correct VLANs for BridgeDomain and EPGs.

Verification

As with the previous scenario with direct EPG extension outside the fabric, programming
the leafs needs to happen before the endpoints are learned by the leafs. Mismatch in con-
figuration is the most common scenario seen when defining the extended bridged network.

rtp_leaf1# show vlan extended

 VLAN Name                             Status    Ports

 ---- -------------------------------- --------- -------------------------------

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35

 27   Test:Database                    active    Eth1/27, Eth1/28, Po2, Po3

 28   Test:CommerceWorkspaceTest:Datab active    Eth1/27, Eth1/28, Po2, Po3


      ase

 37   Test:Web                         active    Eth1/27, Eth1/28, Eth1/42,

                                                 Eth1/44, Po1, Po2, Po3

 38   Test:CommerceWorkspaceTest:Web   active    Eth1/27, Eth1/28, Po2, Po3

 68   --                               active    Eth1/42, Eth1/44, Po1

 VLAN Type  Vlan-mode  Encap

 ---- ----- ---------- -------------------------------

 13   enet  CE         vxlan-16777209, vlan-3500

 27   enet  CE         vxlan-14680064

 28   enet  CE         vlan-600

 37   enet  CE         vxlan-15794150

 38   enet  CE         vlan-639

 68   enet  CE         vlan-750

Since N7Ks are expecting VLAN-751 but the L2Out is configured with VLAN-750, the
Layer 2 domains are not extended correctly.

Resolution

Changing it to VLAN-751, makes the N7K ping the BD address of 10.2.1.254, but not the
WebServer 10.1.2.11. This is due to the fact that external network are identified as an EPG
L2Out and contracts are needed to make communication happen between any two EPGs. 

 rtp_leaf1# show vlan extended

 VLAN Name                             Status    Ports

 ---- -------------------------------- --------- -------------------------------

 13   infra:default                    active    Eth1/1, Eth1/2, Eth1/5, Eth1/35

 27   Test:Database                    active    Eth1/27, Eth1/28, Po2, Po3

 28   Test:CommerceWorkspaceTest:Datab active    Eth1/27, Eth1/28, Po2, Po3

      ase
Troubleshooting Cisco Application Centric Infrastructure 225

 37   Test:Web                         active    Eth1/27, Eth1/28, Eth1/42,

                                                 Eth1/44, Po1, Po2, Po3

 38   Test:CommerceWorkspaceTest:Web   active    Eth1/27, Eth1/28, Po2, Po3

 69   --                               active    Eth1/42, Eth1/44, Po1

 VLAN Type  Vlan-mode  Encap

 ---- ----- ---------- -------------------------------

 13   enet  CE         vxlan-16777209, vlan-3500

 27   enet  CE         vxlan-14680064

 28   enet  CE         vlan-600

 37   enet  CE         vxlan-15794150

 38   enet  CE         vlan-639

 69   enet  CE         vlan-751

N7K-1-65-vdc_4# ping 10.2.1.254

PING 10.2.1.254 (10.2.1.254): 56 data bytes

64 bytes from 10.2.1.254: icmp_seq=0 ttl=56 time=1.068 ms

64 bytes from 10.2.1.254: icmp_seq=1 ttl=56 time=0.753 ms

64 bytes from 10.2.1.254: icmp_seq=2 ttl=56 time=0.708 ms

64 bytes from 10.2.1.254: icmp_seq=3 ttl=56 time=0.731 ms

64 bytes from 10.2.1.254: icmp_seq=4 ttl=56 time=0.699 ms

--- 10.2.1.254 ping statistics ---

5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.699/0.791/1.068 ms

N7K-1-65-vdc_4# ping 10.2.1.11

PING 10.2.1.11 (10.2.1.11): 56 data bytes

Request 0 timed out

Request 1 timed out

Request 2 timed out

Request 3 timed out

Request 4 timed out

--- 10.2.1.11 ping statistics ---

5 packets transmitted, 0 packets received, 100.00% packet loss


226 Troubleshooting Cisco Application Centric Infrastructure

Symptom

The Leaf is programmed with correct VLANs and Interfaces as expected, but the servers
are unreachable from the outside L2 network.

Verification

The presence of contracts between the Web EPG and L2Out EPG need to be checked to
confirm reachability of Webserver from N7K. The PcTag of Web EPG is found from visore
to be 49153.

rtp_leaf1# show zoning-rule | grep 49153

rtp_leaf1#

Resolution

After configuring contracts between the WebEPG and L2Out EPG, the command output
shows as below:

rtp_leaf1# show zoning-rule | grep 49153

5352            49153           16386           default         enabled         2883584         permit

5353            16386           49153           default         enabled         2883584         permit

Once the contracts are defined, the pings from N7K are successful. The endpoints still
are learnt as they send traffic, so the issues highlighted in the previous 'Symptoms when
extending the EPG directly out of the ACI fabric: Endpoint not in the database' section
is applicable even in this scenario.

N7K-1-65-vdc_4# ping 10.2.1.11

PING 10.2.1.11 (10.2.1.11): 56 data bytes

64 bytes from 10.2.1.11: icmp_seq=0 ttl=127 time=1.676 ms

64 bytes from 10.2.1.11: icmp_seq=1 ttl=127 time=0.689 ms

64 bytes from 10.2.1.11: icmp_seq=2 ttl=127 time=0.626 ms

64 bytes from 10.2.1.11: icmp_seq=3 ttl=127 time=0.75 ms

64 bytes from 10.2.1.11: icmp_seq=4 ttl=127 time=0.797 ms

--- 10.2.1.11 ping statistics ---

5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.626/0.907/1.676 ms


Troubleshooting Cisco Application Centric Infrastructure 227

Routed Connectivity to External Networks

Overview

External network connectivity is an essential component to a useful fabric deployment.


To accommodate for connections to external network entities, the ACI fabric provides
the ability to automate provisioning of external network connections through the policy
model, and this chapter provides an overview and troubleshooting related to external
network connection methods.

Routed external network connectivity is provided by the association of an external rout-


ed domain to a special EPG in a tenant. This EPG expresses the routed network reach-
ability into an ACI fabric as an object that can be managed and manipulated like any other
object. Within the Layer 3 External network, configurable routing protocols are BGP,
OSPF or static routes. Configuration of this object involves switch-specific configuration
and interface-specific configuration.

The Layer 3 External Instance Profile EPG exposes the external EPG to tenant EPGs
through a contract.

As of ACI software version 1.0.(1k), there is one operational caveat that dictates that only
one outside network can be configured per leaf switch. However, the outside network
configuration can easily be reused for multiple nodes by associating multiple nodes with
the L3 External Node Profile.

External Route Distribution Inside Fabric

Multiprotocol Border Gateway Protocol (MP-BGP) is the routing protocol running in-
ternal to the fabric. A border leaf (an ACI leaf that provides host, fabric, and external
network connections) can peer with external networks and redistribute external routes
into the internal MP-BGP. The fabric leverages MP-BGP to distribute external routes to
other leaf switches. External routes are propagated to leaf switches where there are end
points attached for a given tenant.

Route distribution does not occur by default as MP-BGP has to be enabled by configu-
ration. To configure Route Distribution, MP-BGP has to be turned on by assigning a BGP
AS number and configuring spine nodes as BGP route reflectors. As a result the APIC
228 Troubleshooting Cisco Application Centric Infrastructure

will configure all leaf nodes as MP-BGP route reflector clients. APIC will also automate
the provisioning of BGP components to provide this functionality - BGP session setup,
Route Distinguishers, import and export targets, VPNv4 address family and route-maps
for redistribution. Sessions are established between TEP IPs of leafs and route reflector
functions running on spines. The MP-BGP process will be contained to the overlay-1
VRF part of the infra tenant. It is important to highlight MP-BGP will not carry Endpoint
tables (Endpoint MAC and IP entries). While BGP leverages the TEP IPs for session estab-
lishment, IS-IS is leveraged for reachability of TEP IPs of nodes.

Border leafs advertise tenant public subnets to external routers. Transit routing is cur-
rently not supported, while the border leafs inject external routes to MP-BGP, external
routes learned by a border leaf are not advertised back outside of the fabric. External
routes distributed to non-border leafs are installed with next hop as the overlay VRF TEP
address of the border leaf where it was learned form

Fabric Verification

The BGP Process is started once the BGP object has a valid ASN. 

Output from Spine 1

rtp_spine1# cat /mit/sys/bgp/inst/summary 

# BGP Instance

activateTs   : 2014-10-15T13:11:25.669-04:00

adminSt      : enabled

asPathDbSz   : 0

asn          : 10

attribDbSz   : 736

childAction  : 

createTs     : 2014-10-15T13:11:24.415-04:00

ctrl         : 

dn           : sys/bgp/inst

lcOwn        : local

memAlert     : normal

modTs        : 2014-10-15T13:11:19.746-04:00

monPolDn     : uni/fabric/monfab-default

name         : default

numAsPath    : 0
numRtAttrib  : 8

operErr      : 

rn           : inst

snmpTrapSt   : disable

status       : 

syslogLvl    : err

ver          : v4

waitDoneTs   : 2014-10-15T13:11:36.640-04:00

rtp_spine1# show vrf                 

 VRF-Name                           VRF-ID State    Reason                         

 black-hole                              3 Up       --                             

 management                              2 Up       -- 

 overlay-1                               4 Up       -- 

rtp_spine1# show bgp sessions vrf overlay-1

Total peers 3, established peers 3

ASN 10

VRF overlay-1, local ASN 10

peers 3, established peers 3, local router-id 172.16.136.93

State: I-Idle, A-Active, O-Open, E-Established, C-Closing, S-Shutdown

Neighbor        ASN    Flaps LastUpDn|LastRead|LastWrit St Port(L/R)  Notif(S/R)

172.16.136.92    10 0     00:00:20|never   |never    E  179/36783  0/0

172.16.136.95    10 0     00:00:20|never   |never    E  179/49138  0/0

172.16.136.91    10 0     00:00:19|never   |never    E  179/56262  0/0

Output from Spine 2 

tp_spine2#  cat /mit/sys/bgp/inst/summary 

# BGP Instance

activateTs   : 2014-10-15T13:11:26.594-04:00

adminSt      : enabled

asPathDbSz   : 0

asn          : 10

attribDbSz   : 736

childAction  : 

createTs     : 2014-10-15T13:11:25.363-04:00

ctrl         : 

dn           : sys/bgp/inst
lcOwn        : local

memAlert     : normal

modTs        : 2014-10-15T13:11:19.746-04:00

monPolDn     : uni/fabric/monfab-default

name         : default

numAsPath    : 0

numRtAttrib  : 8

operErr      : 

rn           : inst

snmpTrapSt   : disable

status       : 

syslogLvl    : err

ver          : v4

waitDoneTs   : 2014-10-15T13:11:32.901-04:00

rtp_spine2# show bgp sessions vrf overlay-1

Total peers 3, established peers 3

ASN 10

VRF overlay-1, local ASN 10

peers 3, established peers 3, local router-id 172.16.136.94

State: I-Idle, A-Active, O-Open, E-Established, C-Closing, S-Shutdown

Neighbor        ASN    Flaps LastUpDn|LastRead|LastWrit St Port(L/R)  Notif(S/R)

172.16.136.91    10 0     00:05:15|never   |never    E  179/49429  0/0

172.16.136.95    10 0     00:05:14|never   |never    E  179/47068  0/0

172.16.136.92    10 0     00:05:14|never   |never    E  179/32889  0/0

Problem description

External routes are not reachable from the fabric.

Symptom

When checking routing table entries for a given VRF on a leaf,  no routes are shown or
directly connected routes are not distributed to other leafs.

Verification/Resolution

Verification of the route tables can be confirmed on the spine by running the com-
mand show bgp session vrf all:
Troubleshooting Cisco Application Centric Infrastructure 231

rtp_spine1# show bgp session vrf all  

Note: BGP process currently not running

Route reflector configuration includes modifying the default Fabric Pod policy to include
a Policy Group with a relationship to the default BGP Route Reflector policy. The BGP
Route Reflector needs to have a defined BGP AS number with two spines selected as the
route reflectors.

Other troubleshooting commands:

show bgp sessions vrf  <name | all>

show bgp ipv4 unicast vrf <name | all>

show bgp vpnv4 unicast vrf  <name | all>

show ip bgp neighbors vrf <name | all>

show ip bgp neighbors <a.b.c.d> vrf <name | all>

show ip bgp nexthop-database vrf <name | all>

Problem description

Devices that should be reachable via OSPF in ACI fabric are unreachable.

For this example, the reference toplogy is used. Endpoint IPs within ACI fabric are in
most cases expected to be routable and reachable from the external/outside network.
For this the reference topology, leaf1 and leaf3 are acting as border routers peering with
external Nexus 7000 devices using OSPF. For this use case, pinging the DB Endpoint IP
of 10.1.3.31 from the Nexus 7Ks.

N7K-2-50-N7K2# ping 10.1.3.31

PING 10.1.3.31 (10.1.3.31): 56 data bytes

Request 0 timed out

Request 1 timed out

Request 2 timed out

Request 3 timed out

Request 4 timed out

--- 10.1.3.31 ping statistics ---

5 packets transmitted, 0 packets received, 100.00% packet loss

N7K-2-50-N7K2#
232 Troubleshooting Cisco Application Centric Infrastructure

Symptom

OSPF routes are missing, neighbor relationships not established.

The following are some common problems that can be seen when getting Open Shortest
Path First (OSPF) neighbors to become fully adjacent between ACI and external devices.
In a successful formation of OSPF adjacency, OSPF neighbors will attain the FULL neigh-
bor state.

Verification - Mismatched OSPF Area Type

At the time of this writing, border leaf switches only support OSPF Not So Stubby Areas
(NSSA). This implies that the ACI border leaf switches will not be in area 0 and will not
provide Area Border Router (ABR) functionality. Although the APIC GUI and object model
for OSPF don’t provide area-type configurations, users need to set the area type on the
external routers to be a NSSA in order to bring up OSPF adjacency.

In this example, N7K2 has not been configured for NSSA and the neighbors missing from
the leaf:

rtp_leaf1# show ip ospf neighbors vrf all

 OSPF Process ID default VRF Prod:Prod

 Total number of neighbors: 1

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 FULL/BDR         05:45:58 10.0.0.1        Eth1/41.14

 OSPF Process ID default VRF Test:Test

 Total number of neighbors: 1

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 FULL/DR          00:18:30 10.0.1.1        Eth1/41.24

On ACI Leafs, checking the properties of the area will reveal not only the area type, but
also other settings such as reference bandwidth need to be made sure so that overall
OSPF design is in line with best practices.

 rtp_leaf1# show ip ospf vrf Prod:Prod

 Routing Process default with ID 10.0.0.101 VRF Prod:Prod

 Stateful High Availability enabled

 Supports only single TOS(TOS0) routes

 Supports opaque LSA


 Redistributing External Routes from

   static

 Administrative distance 110

 Reference Bandwidth is 40000 Mbps

 SPF throttling delay time of 200.000 msecs,

   SPF throttling hold time of 1000.000 msecs,

   SPF throttling maximum wait time of 5000.000 msecs

 LSA throttling start time of 0.000 msecs,

   LSA throttling hold interval of 5000.000 msecs,

   LSA throttling maximum wait time of 5000.000 msecs

 Minimum LSA arrival 1000.000 msec

 LSA group pacing timer 10 secs

 Maximum paths to destination 8

 Number of external LSAs 0, checksum sum 0x0

 Number of opaque AS LSAs 0, checksum sum 0x0

 Number of areas is 1, 0 normal, 0 stub, 1 nssa

 Number of active areas is 1, 0 normal, 0 stub, 1 nssa

   Area (0.0.0.100)

        Area has existed for 19:46:14

        Interfaces in this area: 3 Active interfaces: 3

        Passive interfaces: 1  Loopback interfaces: 1

        This area is a NSSA area

        Perform type-7/type-5 LSA translation

        Summarization is disabled

        No authentication available

        SPF calculation has run 40 times

         Last SPF ran for 0.000529s

        Area ranges are

        Number of LSAs: 10, checksum sum 0x0

Resolution - Mismatched OSPF Area Type

Once the following configuration is done on the N7K2,

 router ospf 100

area 0.0.0.100 nssa no-summary default-information-originate

 area 0.0.0.110 nssa no-summary default-information-originate


234 Troubleshooting Cisco Application Centric Infrastructure

The neighbors are back up and operational:

rtp_leaf1# show ip ospf neighbors vrf all

 OSPF Process ID default VRF Prod:Prod

 Total number of neighbors: 2

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 FULL/BDR         05:40:42 10.0.0.1        Eth1/41.14

 4.4.4.2           1 FULL/BDR         00:14:05 10.0.0.9        Eth1/43.15

 OSPF Process ID default VRF Test:Test

 Total number of neighbors: 2

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 FULL/DR          00:13:14 10.0.1.1        Eth1/41.24

 4.4.4.2           1 FULL/DR          00:12:47 10.0.1.9        Eth1/43.25

 Verification - Mismatched MTU

At FCS, ACI supports by default MTU of 9000 bytes. Since the default on N7K and other
devices could very well deviate from this, this is a common reason to see neighbors stuck
in exstart/exchange state.

In this example, N7Ks have not been configured for MTU 9000 and the neighbors are
stuck in EXSTART/EXCHANGE states instead of Full:

In GUI:
Troubleshooting Cisco Application Centric Infrastructure 235

In the CLI:

rtp_leaf1# show ip ospf nei vrf all

 OSPF Process ID default VRF Prod:Prod

 Total number of neighbors: 2

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 EXSTART/BDR      00:00:10 10.0.0.1        Eth1/41.14

 4.4.4.2           1 EXSTART/BDR      00:07:50 10.0.0.9        Eth1/43.15

 OSPF Process ID default VRF Test:Test

 Total number of neighbors: 2

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 EXSTART/BDR      00:00:09 10.0.1.1        Eth1/41.24

 4.4.4.2           1 EXSTART/BDR      00:07:50 10.0.1.9        Eth1/43.25

Resolution 1 - Mismatched MTU

There are two possible ways to resolve this issue.  One is to set the ACI leaf nodes to use
a smaller MTU.  This is an example of setting a Leaf Interface MTU to 1500 bytes:

Change this setting from ‘inherit’ to ‘1500’


236 Troubleshooting Cisco Application Centric Infrastructure

Resolution 2 - Mismatched MTU

Another possible way to resolve this is to set N7K Interface MTU to 9000 bytes as shown
below:

interface Ethernet8/1

  mtu 9000

  ip router ospf 100 area 0.0.0.100

  no shutdown

interface Ethernet8/1.801

  mtu 9000

  encapsulation dot1q 801

  ip address 10.0.0.1/30

  ip router ospf 100 area 0.0.0.100

  no shutdown

With MTU set, the OSPF neighbors should be up and operational.

rtp_leaf1# show ip ospf neighbors vrf all

 OSPF Process ID default VRF Prod:Prod

 Total number of neighbors: 2

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 FULL/BDR         05:40:42 10.0.0.1        Eth1/41.14

 4.4.4.2           1 FULL/BDR         00:14:05 10.0.0.9        Eth1/43.15

 OSPF Process ID default VRF Test:Test

 Total number of neighbors: 2

 Neighbor ID     Pri State            Up Time  Address         Interface

 4.4.4.1           1 FULL/DR          00:13:14 10.0.1.1        Eth1/41.24

 4.4.4.2           1 FULL/DR          00:12:47 10.0.1.9        Eth1/43.25

Symptom - OSPF route learning problems, Neighbor adjacency formed 

In our reference topology, both N7Ks are advertising default routes to ACI border leafs.
There are situations where either the leafs or the external device (N7Ks) form neighbor
relationship fine, but don't learn routes from each other.
Troubleshooting Cisco Application Centric Infrastructure 237

rtp_leaf1# show ip route 0.0.0.0 vrf all

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0

    *via 10.0.0.1, eth1/41.14, [110/5], 01:40:59, ospf-default, inter

    *via 10.0.0.9, eth1/43.15, [110/5], 01:40:48, ospf-default, inter

IP Route Table for VRF "Test:Test"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0

    *via 10.0.1.1, eth1/41.24, [110/5], 01:41:02, ospf-default, inter

    *via 10.0.1.9, eth1/43.25, [110/5], 01:40:44, ospf-default, inter

rtp_leaf1#

Verification

External OSPF Peers are not learning routes from ACI.  For this example, ACI is advertis-
ing the DB subnet (10.1.3.0) to the N7K. This subnet exists on Leaf2, while Leaf1 and Leaf3
are the border leafs. As seen below, the N7K is not recieving the route:

N7K-2-50-N7K2# show ip route 10.1.3.0

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

Route not found

N7K-2-50-N7K2#
238 Troubleshooting Cisco Application Centric Infrastructure

ACI manages routing advertisements based on route availability, reachability and more
importantly based on Policy.  The following concepts are key to understand route ex-
change between ACI and external peers:

Resolution

There are three steps involved in resolving this problem.

The first step that should be looked at is the Bridge Domain The Bridge domain subnet
needs to be marked as Public. This lets the ACI Leaf know to advertise the route to ex-
ternal peers. Even with this setting, the routes from Leaf2 are not learned by Leaf1 and
Leaf3. This is due to only one of the three main conditions being met for external route
advertisements.

rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0

    *via 10.0.0.9, eth1/43.15, [110/5], 00:46:55, ospf-default, inter

    *via 10.0.0.1, eth1/41.14, [110/5], 00:46:37, ospf-default, inter

rtp_leaf1#

The Bridge domain needs to be associated with L3 Out as shown below:


Troubleshooting Cisco Application Centric Infrastructure 239

Even with this setting, the routes are not learned by Leaf1 and Leaf3 as there are no con-
tracts in place specifying the comminication.

rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0

    *via 10.0.0.9, eth1/43.15, [110/5], 00:58:18, ospf-default, inter

    *via 10.0.0.1, eth1/41.14, [110/5], 00:58:00, ospf-default, inter

rtp_leaf1#

However, if the routes are local to Leaf1 and Leaf3, the routes are then advertised due to
L3out association. Just for troubleshooting, this can be forced by having EPG association
either by path or local binding on Leaf1 or Leaf3.

Now the N7Ks see the the routes from Leaf1 but not Leaf3 as the EPG is associated only
to Leaf1 and Leaf2.

rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

10.1.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive

    *via 172.16.104.65%overlay-1, [1/0], 00:00:15, static

rtp_leaf1#
240 Troubleshooting Cisco Application Centric Infrastructure

N7K-2-50-N7K2# show ip route 10.1.3.0

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

10.1.3.0/24, ubest/mbest: 1/0

    *via 10.0.0.10, Eth8/1.800, [110/20], 00:01:46, ospf-100, nssa type-2

N7K-2-50-N7K2# ping 10.1.3.1

PING 10.1.3.1 (10.1.3.1): 56 data bytes

64 bytes from 10.1.3.1: icmp_seq=0 ttl=57 time=1.24 ms

64 bytes from 10.1.3.1: icmp_seq=1 ttl=57 time=0.8 ms

64 bytes from 10.1.3.1: icmp_seq=2 ttl=57 time=0.812 ms

64 bytes from 10.1.3.1: icmp_seq=3 ttl=57 time=0.809 ms

64 bytes from 10.1.3.1: icmp_seq=4 ttl=57 time=0.538 ms

--- 10.1.3.1 ping statistics ---

5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.538/0.839/1.24 ms

N7K-2-50-N7K2#

Now without contract, why is ping successful. This is due to the fact that the pervasive
GW address is being and not a endpoint within that BD/EPG. Contracts are needed for
pinging the EP if the Context is in 'enforced' mode.

N7K-2-50-N7K2# ping 10.1.3.31

PING 10.1.3.31 (10.1.3.31): 56 data bytes

Request 0 timed out

Request 1 timed out

Request 2 timed out

Request 3 timed out

Request 4 timed out

--- 10.1.3.31 ping statistics ---

5 packets transmitted, 0 packets received, 100.00% packet loss

 
Troubleshooting Cisco Application Centric Infrastructure 241

Now removing the EPG binding on Leaf1, the route would stop getting advertised to the
Nexus 7Ks.

A third part of the resolution is that the subnet being marked Public and the Bridge
Domain associated with L3 Out, needs a contract to be defined between the Database
EPG and L3Out.

The contract needs to be defined and associated both on the L3Out Networks, and
Database EPG. Prior to associating contract:

Associate contract:

Routes being learned on the N7K:`

N7K-2-50-N7K2# show ip route 10.1.3.0

IP Route Table for VRF "default"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>


242 Troubleshooting Cisco Application Centric Infrastructure

10.1.3.0/24, ubest/mbest: 2/0

    *via 10.0.0.10, Eth8/1.800, [110/20], 00:08:06, ospf-100, nssa type-2

    *via 10.0.0.14, Eth8/3.800, [110/20], 00:08:06, ospf-100, nssa type-2

N7K-2-50-N7K2#

Now with L3Out defined with associated external networks, OSPF Neighbor peering,
routes being advertised and appropriate contract permitting the traffing, the ping is now
successful.

N7K-2-50-N7K2# ping 10.1.3.31

PING 10.1.3.31 (10.1.3.31): 56 data bytes

64 bytes from 10.1.3.31: icmp_seq=0 ttl=126 time=1.961 ms

64 bytes from 10.1.3.31: icmp_seq=1 ttl=126 time=0.533 ms

64 bytes from 10.1.3.31: icmp_seq=2 ttl=126 time=0.577 ms

64 bytes from 10.1.3.31: icmp_seq=3 ttl=126 time=0.531 ms

64 bytes from 10.1.3.31: icmp_seq=4 ttl=126 time=0.576 ms

--- 10.1.3.31 ping statistics ---

5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.531/0.835/1.961 ms

N7K-2-50-N7K2#

Problem Description - Inter-tenant Communications

This problem is, a scenario where there is an endpoint in one tenant's context that can-
not connect to an endpoint in another tenant's context.  For this scenario, the Database
servers in Tenant "Test" must communicate with the "Prod" Tenant's Database tier.  

The Test-Database servers are in subnet 10.2.3.0/24, while the Prod-Database Servers
are in 10.1.3.0/24.

Symptom

Communications between tenants do not work. 

Verification

In this case, Routes not being learned between tenant contexts.  Since the Tenants have
Troubleshooting Cisco Application Centric Infrastructure 243

their respective contexts/VRF, by default the routes are not leaked between the con-
texts. Here a snippet of the status is show with Prod:Prod not learning 10.2.3.0 from
Tenant Test:Test as shown below:

rtp_leaf1# show ip route 10.1.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

10.1.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive

*via 172.16.104.65%overlay-1, [1/0], 00:57:55, static

rtp_leaf1# show ip route 10.2.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

0.0.0.0/0, ubest/mbest: 2/0

*via 10.0.0.9, eth1/43.17, [110/5], 13:18:12, ospf-default, inter

*via 10.0.0.1, eth1/41.16, [110/5], 13:18:09, ospf-default, inter

rtp_leaf1#show ip route 10.2.3.0 vrf Test:Test

IP Route Table for VRF "Test:Test"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

10.2.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive

*via 172.16.104.65%overlay-1, [1/0], 00:58:41, static

rtp_leaf1#

Resolution

The subnet address to be leaked between contexts, Tenants, in addition to being de-
fined under the Bridge Domain, needs to be marked as a as a shared subnet under the
EPG.  This is the first step in the resolution of this issue. 
244 Troubleshooting Cisco Application Centric Infrastructure

With the subnet defined, the route is now visible under Prod:Prod.

rtp_leaf1# show ip route 10.2.3.0 vrf Prod:Prod

IP Route Table for VRF "Prod:Prod"

'*' denotes best ucast next-hop

'**' denotes best mcast next-hop

'[x/y]' denotes [preference/metric]

'%<string>' in via output denotes VRF <string>

10.2.3.0/24, ubest/mbest: 1/0, attached, direct, pervasive

*via 172.16.104.65%overlay-1, [1/0], 00:00:09, static

rtp_leaf1#

However, while routes are learned the Prod and Test DB endpoints are still unable to
communicate. Contracts and Policies to allow the communication need to be defined for
the communication to happen. To figure out the contracts, the PcTag for the EPG need
to be known using Visore.
Troubleshooting Cisco Application Centric Infrastructure 245

DN PcTag

uni/tn-Test/ap-CommerceWorkspaceTest/epg-Database 5474

uni/tn-Prod/ap-commerceworkspace/epg-Database 32774

Verify, using GUI (Fabric -> Inventory -> Pod -> Pools -> Rules) or CLI
rtp_leaf1# show zoning-rule | grep 32774

rtp_leaf1# show zoning-rule | grep 5474

rtp_leaf1#

To take the second step in the resolution, a special contract needs to be created for in-
ter-tenant communications with a scope of 'Global' and should not have a scope of 'Con-
text'. The contract should also be 'Exported' from one Tenant to the other Tenant, so that
the other EPG can consume the defined contract as a 'Consumed Contract Interface'.

Once the appropriate contract configuration is done, the contract show up on the leafs
so the data plane can allow the inter-tenant communication.

rtp_leaf1# show zoning-rule | grep 5474

4146 32774 5474 default enabled 2523136 permit

4147 5474 32774 default enabled 2523136 permit

rtp_leaf1# show zoning-rule | grep 32774

4146 32774 5474 default enabled 2523136 permit

4147 5474 32774 default enabled 2523136 permit

rtp_leaf1#
246 Troubleshooting Cisco Application Centric Infrastructure

Virtual Machine Manager and UCS

Overview

Virtual Machine Manager integration allows for the fabric to extend network policy and
policy group definitions into virtual switches residing on a hypervisor. This integration
automates critical network plumbing steps that typically stand as delays in the deploy-
ment of virtual and compute resources, by automatically configuring the required fabric
side and hypervisor virtual switch encapsulation.

The general hierarchy of VMM configuration is as shown in the diagram below:

Problem Description

When attached to a Cisco UCS Fabric Interconnect, two hosts are unable to reach one
another or unable to resolve one another’s MAC addresses.

Symptom 1:

ARP fails to resolve 

Packet captures running on the sending host, show that it is in fact being transmitted,
however it never reaches the destination host. Creating static ARP entries on both devices
for one another shows that traffic allowed by policy is able to be exchanged via ICMP, etc.

The UCS Fabric Interconnect is configured with two uplinks that are configured as dis-
joint layer-2 domains
Troubleshooting Cisco Application Centric Infrastructure 247

Verification 1:

The problem in this situation is that the UCS FI has two uplinks with disjoint layer-2,
where the VLAN that is being used for the ARP is using the non-ACI uplink as the des-
ignated forwarder. This means that the fabric will not receive the ARP frames and as a
result cannot unicast them to the host. The easiest way to determine if this problem is
impacting the environment, is to use the following command on the UCS FI in NX-OS
mode, substituting the VLAN ID for the one that is attached to the EPG. This can be
determined by using the "show platform software enm internal info vlandb id <vlan>"
command as shown below:

FI-A(nxos)# show platform software enm internal info vlandb id 248 vlan_id 248 ---

---------- Designated receiver: Po103 Membership: Po103 FI-A(nxos)#

If the designated receiver is not the port-channel facing the ACI fabric, the uplink pin-
ning settings in LAN manager will need to be adjusted. Use the LAN Uplink Manager in
UCS to set the VLANs dedicated to ACI use to be pinned to the ACI fabric facing uplink.

For more information, please reference the Network Configuration section on Configur-


ing LAN Pin Groups in the Cisco UCS Manager GUI Configuration Guide.

Symptom 2:

• ARP requests are egressing ESX host on UCS blade, but not making it to the
destination
• VM hosted on C-series appliance directly attached to fabric is able to reach the
BD anycast gateway
• VM hosted on B-series chassis blade attached to fabric is unable to reach the BD
anycast gateway

Verification 2:

In this situation there are two VMs, one on a UCS blade chassis and another on a C200, and
they are unable to ping one another, while on the same EPG. The C200 VM is able to ping
the unicast gateway, however the UCS blade hosted VM is not able to ping the gateway.

The vSphere 5.5 pktcap-uw tool can be used to determine if outbound ARP requests are
in fact leaving the VM and hitting the virtual switch.
248 Troubleshooting Cisco Application Centric Infrastructure

~ # pktcap-uw --uplink vmnic3

The name of the uplink is vmnic3

No server port specifed, select 38100 as the port

Output the packet info to console.

Local CID 2 Listen on port 38100

Accept...Vsock connection from port 1027 cid 2 01:04:51.765509[1]

Captured at EtherswitchDispath point, TSO not enabled, Checksum not offloaded and

not verified, VLAN tag 602, length 60.

Segment[0] ---- 60 bytes:

0x0000: ffff ffff ffff 0050 56bb cccf 0806 0001

0x0010: 0800 0604 0001 0050 56bb cccf 0a01 000b

0x0020: 0000 0000 0000 0a01 0001 0000 0000 0000

0x0030: 0000 0000 0000 0000 0000 0000

By monitoring the packet count on the Veth### interface on the UCS in NX-OS mode,
it is possible to confirm the packets were being received.

tsi-aci-ucsb-A(nxos)# show int Veth730

Vethernet730 is up


Bound Interface is Ethernet1/1/3


Port description is server 1/3, VNIC eth3


Hardware is Virtual, address is 000d.ecb1.a000


Port mode is trunk


Speed is auto-speed


Duplex mode is auto

300 seconds input rate 0 bits/sec, 0 packets/sec

300 seconds output rate 0 bits/sec, 0 packets/sec

Rx


36 unicast packets 3694 multicast packets 3487 broadcast packets


7217 input packets 667170 bytes


0 input packet drops

Tx


433 unicast packets 12625 multicast packets 44749 broadcast packets


57807 output packets 4453489 bytes


0 flood packets 0 output packet drops

So the problem is between the UCS fabric links and the leaf interfaces. Checking the
counters on the leaf indicates no broadcast packets were ingressing.
Troubleshooting Cisco Application Centric Infrastructure 249

Ethernet1/27 is up

admin state is up, Dedicated Interface

Belongs to po2

Hardware: 100/1000/10000/auto Ethernet, address: 7c69.f610.6d33 (bia 7c69.

f610.6d33)

MTU 9000 bytes, BW 10000000 Kbit, DLY 1 usec

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, medium is broadcast

Port mode is trunk

full-duplex, 10 Gb/s, media type is 10G

Beacon is turned off

Auto-Negotiation is turned on

Input flow-control is off, output flow-control is off

Auto-mdix is turned off

Rate mode is dedicated

Switchport monitor is off

EtherType is 0x8100

EEE (efficient-ethernet) : n/a

Last link flapped 09:09:44

Last clearing of “show interface” counters never

4 interface resets

30 seconds input rate 75 bits/sec, 0 packets/sec

30 seconds output rate 712 bits/sec, 0 packets/sec

Load-Interval #2: 5 minute (300 seconds)

input rate 808 bps, 1 pps; output rate 616 bps, 0 pps

RX

193 unicast packets 5567 multicast packets 17365 broadcast packets

23125 input packets 2185064 bytes

0 jumbo packets 0 storm suppression packets

0 runts 0 giants 0 CRC 0 no buffer

0 input error 0 short frame 0 overrun 0 underrun 0 ignored


250 Troubleshooting Cisco Application Centric Infrastructure

0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop

0 input with dribble 0 input discard

0 Rx pause

TX

129 unicast packets 5625 multicast packets 17900 broadcast packets

23654 output packets 1952861 bytes

0 jumbo packets

0 output error 0 collision 0 deferred 0 late collision

0 lost carrier 0 no carrier 0 babble 0 output discard

0 Tx pause

This indicates that r traffic is egressing the ESX host, however not making it through to
the leaf. One possible cause for this is that the frames are being tagged on upon leaving
the ESX host, however are being stripped and placed on the native VLAN. The UCS
configuration, specifically VLAN manager, can be checked and verified if VLAN 602 is
incorrectly set as the native VLAN.
Troubleshooting Cisco Application Centric Infrastructure 251

This means that frames egressing the UCS FI would be untagged heading towards the
fabric, and thus would not be categorized into the appropriate EPG. By unmarking the
VLAN as native, the frames are properly tagged and then categorized as being members
of the EPG, and ICMP can immediately begin to function.

Problem Description

Virtual Machine Manager function is unable to register vCenter with APIC

Symptom 1:

When attempting to register a vCenter with APIC, one or more of the following faults is
raised:

• F606262  [FSM:FAILED]: VMM Add-Controller FSM: comp/prov-VMware/


ctrlr-[RTPACILab]-TestVcenter Failed to retrieve ServiceContent from the vCenter
server 10.122.253.152(FSM:ifc:vmmmgr:CompCtrlrAdd)
• F606351  [FSM:FAILED]: Task for updating comp:PolCont(TASK:ifc:vmmmgr:-
CompPolContUpdateCtrlrPol)
• F16438 [FSM:STAGE:FAILED]: Establish connection Stage: comp/prov-VMware/
ctrlr-[RTPACILab]-TestVcenter Failed to retrieve ServiceContent from the vCenter
server 10.122.253.152(FSM-STAGE:ifc:vmmmgr:CompCtrlrAdd:Connect)

Verification 1:

These faults typically indicate that there is an issue reaching vCenter from the APIC.
Typical causes for this include:

• The VMM is configured to use the Out of Band management (OOBM) network to
access vCenter however is on a separate subnet and has no route to reach that
vCenter
• The IP address entered for the vCenter is incorrect
252 Troubleshooting Cisco Application Centric Infrastructure

Log into the APIC and attempt a simple ping test to the remote vCenter:

admin@RTP_Apic1:~> ping 10.122.253.152 PING 10.122.253.152

(10.122.253.152) 56(84) bytes of data.

From 64.102.253.234 icmp_seq=1 Destination Host Unreachable

From 64.102.253.234 icmp_seq=2 Destination Host Unreachable

From 64.102.253.234 icmp_seq=3 Destination Host Unreachable

From 64.102.253.234 icmp_seq=4 Destination Host Unreachable

^C

In this case vCenter is not reachable from the APIC. By default the APIC will use the OOB
interface for reaching remotely managed devices, so this would indicate that there is
either a misconfiguration on the APIC or that the vCenter is unreachable by that address.

The first step is to verify if a proper default route is configured. This can be verified by
navigating to the Tenants section, entering the mgmt tenant, and then inspecting the
Node Management Addresses. If out of band management node management addresses
have been configured, verify that the proper default gateway has been entered in that
location.

The default gateway is configured as 10.122.254.254/24

admin@RTP_Apic1:~> ping 10.122.254.254

PING 10.122.254.254 (10.122.254.254) 56(84) bytes of data.

From 10.122.254.211 icmp_seq=1 Destination Host Unreachable

From 10.122.254.211 icmp_seq=2 Destination Host Unreachable

From 10.122.254.211 icmp_seq=3 Destination Host Unreachable

From 10.122.254.211 icmp_seq=4 Destination Host Unreachable

^C
Troubleshooting Cisco Application Centric Infrastructure 253

The Unreachable state indicates that the gateway is improperly configured, and this mis-
configuration can be corrected by setting it to the appropriate 10.122.254.1.

After modifying the configured Out-of-Band gateway address:

admin@RTP_Apic1:~> ping 10.122.254.152

PING 10.122.254.152 (10.122.254.152) 56(84) bytes of data.

64 bytes from 10.122.254.152: icmp_seq=1 ttl=64 time=0.245 ms

64 bytes from 10.122.254.152: icmp_seq=2 ttl=64 time=0.258 ms

64 bytes from 10.122.254.152: icmp_seq=3 ttl=64 time=0.362 ms

64 bytes from 10.122.254.152: icmp_seq=4 ttl=64 time=0.344 ms

^C

The complete management configuration is as follows:

<fvTenant name=”mgmt”>

<fvBD name=”inb”/>

<aaaDomainRef name=”mgmt”/>

<mgmtMgmtP name=”default”>

<mgmtInB name=”default”/>

<mgmtOoB name=”default”>

<mgmtRsOoBProv tnVzOOBBrCPName=”oob_contract”/>

</mgmtOoB>

</mgmtMgmtP><a>p

<fvCtx name=”inb”/>

<fvCtx name=”oob”>

<dnsLbl name=”default”/>

</fvCtx>

<vzOOBBrCP name=”oob_contract”>

<vzSubj name=”oob_subject”>

<vzRsSubjFiltAtt tnVzFilterName=”default”/>

<vzRsSubjFiltAtt tnVzFilterName=”ssh”/>

</vzSubj>

</vzOOBBrCP>

<vzFilter name=”ssh”>

<vzEntry name=”ssh”/>

</vzFilter>

<fvnsAddrInst name=”rtp_leaf3ooboobaddr”>
<fvnsUcastAddrBlk from=”10.122.254.243” to=”10.122.254.243”/>

</fvnsAddrInst>

<fvnsAddrInst name=”RTP_Apic3ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.213” to=”10.122.254.213”/>

</fvnsAddrInst>

<fvnsAddrInst name=”RTP_Apic1ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.211” to=”10.122.254.211”/>

</fvnsAddrInst>

<fvnsAddrInst name=”RTP_Apic2ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.212” to=”10.122.254.212”/>

</fvnsAddrInst>

<fvnsAddrInst name=”rtp_spine1ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.244” to=”10.122.254.244”/>

</fvnsAddrInst>

<fvnsAddrInst name=”rtp_leaf1ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.241” to=”10.122.254.241”/>

</fvnsAddrInst>

<fvnsAddrInst name=”rtp_leaf2ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.242” to=”10.122.254.242”/>

</fvnsAddrInst>

<fvnsAddrInst name=”rtp_spine2ooboobaddr”>

<fvnsUcastAddrBlk from=”10.122.254.245” to=”10.122.254.245”/>

</fvnsAddrInst>

<mgmtExtMgmtEntity name=”default”>

<mgmtInstP name=”oob_emei”>

<mgmtRsOoBCons tnVzOOBBrCPName=”oob_contract”/>

<mgmtSubnet ip=”0.0.0.0/0”/>

</mgmtInstP>

</mgmtExtMgmtEntity>

</fvTenant>
Troubleshooting Cisco Application Centric Infrastructure 255

Now it is possible to verify that the vCenter VMM is reachable:

Symptom 2:

The following fault is raised in the VMM manager

• F16438 [FSM:STAGE:FAILED]: Establish connection Stage: comp/prov-VMware/


ctrlr-[RTPACILab]-172.31.222.24 Failed to find datacenter BldgE in vCenter
(FSM-STAGE:ifc:vmmmgr:CompCtrlrAdd:Connect)
• F606262  [FSM:FAILED]: VMM Add-Controller FSM: comp/prov-VMware/
ctrlr-[RTPACILab]-172.31.222.24 Failed to find datacenter BldgE in vCenter
(FSM:ifc:vmmmgr:CompCtrlrAdd)
256 Troubleshooting Cisco Application Centric Infrastructure

Verification 2:

Ensure that the datacenter name in vCenter matches the “Datacenter” property config-
ured in the VMM Controller policy configuration

In the above screenshot, the Datacenter name is purposely misconfigured as BldgE


instead of BldgF

Problem Description

Virtual Machine Manager (VMM) unassociation fails to delete Distributed Virtual Switch
(DVS) in vCenter 

Symptom 1:

After removing a Virtual Machine Manager (VMM) configuration or removing a Virtual


Machine Manager (VMM) domain from an End Point Group (EPG), the associated virtual
port groups or DVS are not removed from the vCenter configuration.

Verification 1:

Check to see that the port groups are not currently in use by a virtual machine network
adapter.
Troubleshooting Cisco Application Centric Infrastructure 257

This can be verified from the vCenter GUI, by accessing the settings for a virtual machine
and individually inspecting the network backing for the vNIC adapters

Another mechanism by which this can be verified is by inspecting the DVS settings, and
viewing the Virtual Machines that are associated with the DVS.
258 Troubleshooting Cisco Application Centric Infrastructure

The list of virtual machines that are currently using a distributed virtual port group can
also be found using the APIC GUI, by navigating to the VM Networking section, navi-
gating into the Provider, the Domain, into the DVS, the expanding the port groups, and
looking at each individual port group.

To resolve this particular issue, the backing on the Virtual Machine VNICs must be re-
moved. This can be accomplished by either removing the Virtual Adapter entirely, or by
changing the Virtual Adapter network backing to one that is not present on the DVS,
including a local standard virtual switch or some other DVP.

Problem Description

Virtual Machine Manager hosted VMs are unable to reach the fabric, get learned by the
fabric or reach their default gateway through a UCS Fabric Interconnect.

Symptom 1:

Checking the endpoint table on the fabric does not show any new endpoints being learned,
although the Distributed Virtual Port groups are being created on the vSwitch and VMs.

The VMs are unable to ping their gateway or other VMs


Troubleshooting Cisco Application Centric Infrastructure 259

Verification 1:

For these symptoms the first step is to check to see if the endpoint table on the leaf to
which the UCS is attached is learning any endpoints in the EPG. The MAC address for the
VM in question is 00:50:56:BB:D5:08, and it is unable to reach its default gateway

Upon inspecting the “show endpoint detail” output on the leaf, the MAC for the VM is
missing from the output.

 
rtp_leaf1# show endpoint detail  

Legend:

O - peer-attached    H - vtep             a - locally-aged     S - static         

 V - vpc-attached     p - peer-aged        L - local            M - span           

 s - static-arp       B - bounce         

+---------------+---------------+-----------------+--------------+-------------+------------------------------+

      VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

      Domain      VLAN            IP Address        IP Info                       Info

+---------------+---------------+-----------------+--------------+-------------+------------------------------+
260 Troubleshooting Cisco Application Centric Infrastructure

Additionally, viewing the output of "show vlan" and grepping for the Test EPG, the inter-
face that is expected to be configured with the EPG is not visible in the interfaces that
the policy should be programmed on. 

rtp_leaf1# show vlan | grep Test

 39   Test:CommerceWorkspaceTest:Web   active    Eth1/42, Eth1/44, Po1 

Inspecting the configuration on the Attachable Entity Profile for the interface group
used on the UCS shows that no vSwitch policy is configured for the LLDP, CDP or LACP
policies. Without these policies, the defaults will be inherited from the AEP itself, and as
a result will be configured to run LLDP using whatever link aggregation protocol is used
on the upstream links. This will cause the VDS to inherit these properties, and thus run
incorrectly.
Troubleshooting Cisco Application Centric Infrastructure 261

By right clicking on the Attachable Entity Profile and clicking the “Config vSwitch
Policies” it is possible to associate override policies for the vSwitch. When using a UCS
between the leaf and ESX hosts, these should be configured to disable LLDP, enable
CDP and use Mac Pinning as the LACP policy, as shown below:

With the override in place, inspecting the endpoint table on the switch itself shows that
the MAC address for the VM has been learned and the VLAN table shows that the interface
where the EPG can be learned is correctly placed in the CommerceWorkspaceTest:Web
EPG.

 
262 Troubleshooting Cisco Application Centric Infrastructure

rtp_leaf1# show vlan | grep Test

 14   Test:CommerceWorkspaceTest:Web   active    Eth1/27, Eth1/28, Po2, Po3

rtp_leaf1# show endpoint detail

Legend:

 O - peer-attached    H - vtep             a - locally-aged     S - static         

 V - vpc-attached     p - peer-aged        L - local            M - span           

 s - static-arp       B - bounce         

+---------------+---------------+-----------------+--------------+-------------+------------------------------+

      VLAN/       Encap           MAC Address       MAC Info/       Interface     Endpoint Group

      Domain      VLAN            IP Address        IP Info                       Info\

+---------------+---------------+-----------------+--------------+-------------+------------------------------+

14                      vlan-639    0050.56bb.d508 LV                        po2 Test:CommerceWorkspaceTest:Web

  

Further verification from the host itself shows that ping to the gateway is successful.
Troubleshooting Cisco Application Centric Infrastructure 263

Layer 4 Through 7 Services Insertion

Overview

This chapter covers the common troubles encountered during L4-L7 service insertion
with the ACI fabric. An overview of what should happen and the verification steps used
to confirm a working L4-L7 service insertion are covered first. The displays taken on a
working fabric can then be used as an aid in troubleshooting issues when service graph
and device cluster deployment failed.

The Cisco ACI and the APIC controller are designed with the ability to provide automated
service insertion while acting as a central point of policy control within the ACI fabric. ACI
policies manage both the network fabric and services appliances such as firewalls, load
balancers, etc. The policy controller has the ability to configure the network automati-
cally to allow traffic to flow through the service devices. In addition, the policy controller
can also automatically configure the service devices according to the application service
requirements. This approach allows organizations to automate infrastructure configura-
tion coordinated with service insertion and eliminate the challenge of managing all the
complex traffic-steering techniques that are used by traditional service insertion config-
uration methods.

When a service graph is defined though the APIC GUI, the concept of “functions” are
used to specify how traffic should flow between the consumer EPG and the provider EPG.
These functions can be exposed as firewall, load balancer, SSL offload, etc. and APIC will
translate these function definitions into selectable elements of a service graph through a
technique called rendering. Rendering involves the allocation of the fabric resources, such
as bridge domain, service device IP addresses, etc. to ensure the consumer and provider
EPGs will have all necessary resources and configuration to be functional.

Device Package

The APIC needs to communicate with the service devices to define and configure the us-
er-specific functions according to the “communcations method” the service device under-
stands. This method of translation happens between the APIC and service devices by utilizing
a plug-in or device package installed by the administrator. The device package also includes a
description of the functions supported by the device package and the mode that the service
device is utilizing. In ACI terminology, a service appliance can operate in two modes:
264 Troubleshooting Cisco Application Centric Infrastructure

1. Go-To Mode - aka Routed mode. Examples include L3 routed firewall or load
balancer, or one-arm load balancer.
2. Go-Through Mode - Transparent mode. An example would be a transparent L2,
or bridged) firewall.

The illustration below shows some examples of device package functions.

Service Graph Definition

When the service graph definitions are being configured, the abstract graph needs to
stitch together the consumer and provider contract. The connectors between the Func-
tion Node have two connector types:

1. L2 - Layer 2 connector. Example includes ACI fabric that has L2 adjacency


between EPG and the transparent firewall’s inside interface.
2. L3 - Layer 3 connector with Unicast routing. Example: ACI fabric will act as the
default gateway the outside interface of the ASA transparent firewall.  

Node name - this will be used later on during before the service graph is rendered.
Troubleshooting Cisco Application Centric Infrastructure 265

Adjacency Function Type BD Selection

L2 Go-To Disable routing on BD if the routing is disabled for


the connection.

L3 Go-To Routing must also be enabled within the BD.

L2 Go-Through Disable routing on BD if the routing is disabled for


the connection. 

L3  Go-Through Routing settings on “shadow” BD is set as per the


routing on connection.

Once the abstract graph is instantiated, the function of the service devices can be con-
figured via GUI, REST or CLI. These functions include firewall or load balancer configura-
tions such as IP addresses of the interfaces, access-list, load balancer monitoring policy,
virtual IP, etc.

The illustration below shows the L4-L7 Function mode and empty Service Parameters.
266 Troubleshooting Cisco Application Centric Infrastructure

Concrete Device and Logical Device

The service graph also contains the abstract node information. The APIC will translate
the definition and functions from the abstract graph into the concrete devices that are
connected onto the ACI fabric. This may raise the question of why there is a logical de-
vice and a concrete device. The way this works is the concrete devices are the stand-
alone appliance nodes, but the devices are typically deployed as a cluster, or pair, which
is represented as a logical clustered device.

The following parameters are mandatory to create the Concrete Device:

1. Device identity such as IP address and login credential of the concrete device.
2. Logical interface to actual interface mapping, including guest VM virtual
network adapter name.

The following parameters are mandatory to create the Logical Device Cluster:

Select the device type - physical or virtual.

2. Device identity such as IP address and login credential of the logical device.
3. Logical interface name and function.

The illustration below shows the Logical Device Cluster configuration screen.
Troubleshooting Cisco Application Centric Infrastructure 267

Device Cluster Selector Policies

The last step before the service graph can be rendered is to associate the service graph
with the appropriate contract and logical device. For example, the Create Logical Device
Context screen is where the association of contract, graph, node and cluster is built be-
tween the “PermitWeb” contract, “Web” graph, “Web-FW” node, “Prod/Web-FW” device
cluster.

The illustration below shows the Logical Device Context configuration.


268 Troubleshooting Cisco Application Centric Infrastructure

Rendering the Service Graph

In order to render the service graph, association needs to happen between the appropri-
ate contract and subject to the correct L4-L7 Service Graph.

If the service graph is able to deploy, the service graph instance and virtual device will
be seen as deployed in “Deployed Service Graphs” and “Deployed Device Clusters”. The
illustration shows the working and rendered service graph.

The illustration below shows where to attach the service graph to the contract.
Troubleshooting Cisco Application Centric Infrastructure 269

Problem Description

The service graph is not rendering and will not deploy after the service graph is attached
to a contract.

Symptom 1

When clicking the logical device cluster, the Device State is in "init" state. 

Verification

The “init” state indicates there is a communication issue - the APIC controller cannot
communicate with the service device. Faults under the Logical Device context should be
seen. A following fault code from an ASA logical device context shows communication
between APIC and service device:

F0324 Major script error : Connection error : HTTPSConnection-

Pool(host='10.122.254.39', port=443): Max retries exceeded with url:

/admin/exec/show%20version%20%7C%20grep%20Cisco%20Adaptive%20Security%20Appli-

ance%20Software%20Version (Caused by <class 'socket.error'>: [Errno 101] Network

is unreachable)

This fault can be resolved by verifying the connectivity between the APIC and the service
device with the following:

1. Ping the service device from the APIC CLI to verify reachability
2. Verify login credentials to the service device with the username and password
supplied in the device configuration
3. Verify the device's virtual IP and port is open
4. Verify username and password is correct in the APIC configuration

Symptom 2

After correcting connectivity issues between the APIC and the service device, it can be
seen that a F0765 CDev configuration is invalid due to cdev-missing-virtual-info fault
has occurred.
270 Troubleshooting Cisco Application Centric Infrastructure

Verification

After verification of the network connectivity between APIC and the service appliance
(in this case the service appliance is a VM), it is necessary to ensure the service VM name
matches the vCenter console, and the vCenter name matches the Data Center name.

Symptom 3

Seeing a fault defined as F0772 LIf configuration is invalid due to LIf-invalid-CIf in the
Logical Device context.

Verification

First, it is necessary to define what are the items indicated called the LIf and the CIf. LIf is
the logical interface and CIf is a concrete interface. With this particular fault, the Logical
interface is the element that is not rendering properly. This is where the Function Node
maps the logical interface to the actual, or concrete, interface to form a relationship.
F0772 means one of the following:

1. The Logical interface is not created


2. The Logical interface is not mapped to the correct concrete interface.

Symptom 4

After fixing the previous fault, F0772, there may be an additional fault, F0765 Cdev con-
figuration is invalid due to cdev-missing-cif.

Verification

This fault indicates that the CIf, concrete interface, is missing from the concrete device.
This can be checked under the concrete device configuration under L4-L7 Services->De-
vice Clusters->Logical Device->Device->Policy to verify the necessary concrete inter-
faces have been configured.

Symptom 5

When deploying the service graph, it is possible to see a fault defined as F0758 Service
graph could not be rendered due to following: id-allocation-failure.
Troubleshooting Cisco Application Centric Infrastructure 271

Verification

When deploying service device VMs in a hypervisor, these devices are like the normal
virtual machine creation in that they will be placed into their own EPG that is mapped
to the BD where the VM resides. When the service graph is rendered by the APIC, it will
allocate the VLANs from the VMM pool assigned during logical device cluster creation.
If the dynamic VLAN pool that is associated with the VMM does not have enough VLANs
allocated, it will fail and raise fault F0758.

This error can be corrected by allocating additional VLANs into the dynamic VLAN pool
that is used by the VMM.

Symptom 6

All faults seem to be cleared but the service graph will still not render, and no faults are
raised. In addition, verification of the contract shows it has been associated with the ap-
propriate service graph. The filter is also defined and associated to the correct contract.

Verification

Go to consumer EPG or External Bridge Network and the provider EPG, it needs to be
made sure to have configured the correct EPG or External Bridge Network as the con-
sumer and provider. If the EPG is configured as both consumer and provider, the L4-L7
graph will not be rendered.

Symptom 7

The service graph is trying to render, but it fails and raises the fault F0758 Service graph
could not be rendered due to following: missing-mandatory-param.

Verification

This fault is associated with the Function Node configuration. It would be caused by one
or more missing mandatory parameters, or one or more missing mandatory device con-
figuration parameters:

• Check the Function Node configuration and verify if any Mandatory parameter
with “true” is missing.
272 Troubleshooting Cisco Application Centric Infrastructure

• Check under the actual service device configuration and identify if any
Mandatory parameter is missing. One example might be seen when configuring
the ASA firewall and the “order” parameter the access control entry is a required
field even thought it is not marked as required. 

Symptom 8

In the example the Cisco ASAv is being used, and traffic is not passing through the ser-
vice device. After inspecting the Deployed Device Cluster, there is a fault, F0324 Major
script error : Configuration error :.

Verification

This fault is related to the Function Node configuration and it indicates that a passed
configured parameter in rendering was not accepted by the service device. Examples
might include configuring ASAv transparent mode in the policy while the firewall is con-
figured in routed mode, or configuring the ASAv security level to 200 when the only
acceptable values are from 0 to 100.
Troubleshooting Cisco Application Centric Infrastructure 273

ACI Fabric Node and Process Crash


Troubleshooting
Overview

The ACI switch node has numerous processes which control various functional aspects
on the system. If the system has a software failure in a particular process, a core file will
be generated and the process will be reloaded.

If the process is a Data Management Engine (DME) process, the DME process will restart
automatically. If the process is a non-DME process, it will not restart automatically and
the switch will reboot to recover.

This section presents an overview of the various processes, how to detect that a process
has cored, and what actions should be taken when this occurs.

DME Processes:

The essential processes running on an APIC can be found through the CLI.  Unlike the
APIC, the processes that can be seen via the GUI in FABRIC -> INVENTORY -> Pod 1 ->
(node) will show all processes running on the leaf.

CLI:

Through the "ps-ef | grep svc_ifc":

rtp_leaf1# ps -ef |grep svc_ifc

root      3990  3087  1 Oct13 ?        00:43:36 /isan/bin/svc_ifc_policyelem --x

root      4039  3087  1 Oct13 ?        00:42:00 /isan/bin/svc_ifc_eventmgr --x

root      4261  3087  1 Oct13 ?        00:40:05 /isan/bin/svc_ifc_opflexelem --x -v

dptcp:8000

root      4271  3087  1 Oct13 ?        00:44:21 /isan/bin/svc_ifc_observerelem --x

root      4277  3087  1 Oct13 ?        00:40:42 /isan/bin/svc_ifc_dbgrelem --x

root      4279  3087  1 Oct13 ?        00:41:02 /isan/bin/svc_ifc_confelem --x

rtp_leaf1# 
274 Troubleshooting Cisco Application Centric Infrastructure

Each of the processes running on the switch writes activity to a log file on the sys-
tem.  These log files are bundled as part of the techsupport file but can be found via CLI
access in /tmp/logs/ directory. For example, the Policy Element process log output is
written into /tmp/logs/svc_ifc_policyelem.log.

The following is a brief description of the DME processes running on the system.  This
can help in understanding which log files to reference when troubleshooting a particular
process or understand the impact to the system if a process crashed: 

Process Function

policyelem Policy Element: Process logical MO from APIC and push concrete
model to the switch

eventmgr Event Manager: Processes local faults, events, health score

opflexelem Opflex Element: Opflex server on switch

observerelem Observer Element: Process local stats sent to APIC

dbgrelem Debugger Element: Core handler

nginx Web server handling traffic between the switch and APIC

Identify When a Process Crashes: 

When a process crashes and a core file is generated, a fault as well as an event is gener-
ated.  The fault for the particular process is shown as a "process-crash" as shown in this
syslog output from the APIC:

Oct 16 03:54:35 apic3 %LOG_LOCAL7-3-SYSTEM_MSG [E4208395][process-crash][major]

[subj-[dbgs/cores/node-102-card-1-svc-policyelem-ts-2014-10-16T03:54:55.000+00:00]/

rec-12884905092]Process policyelem cored

When the process on the switch crashes, the core file is compressed and copied to the
APIC.  The syslog message notification comes from the APIC.

The fault that is generated when the process crashes is cleared when the process is
Troubleshooting Cisco Application Centric Infrastructure 275

restarted.  The fault can be viewed via the GUI in the fabric history tab at FABRIC ->
INVENTORY -> Pod 1.  In this example, node102 Policy Element crashed:

Collecting the Core Files:

The APIC GUI provides a central location to collect the core files for the fabric nodes.

An export policy can be created from ADMIN -> IMPORT/EXPORT in Export Policies ->
Core. However, there is a default core policy where files can be downloaded directly. As
shown in this example:
276 Troubleshooting Cisco Application Centric Infrastructure

The core files can be accessed via SSH/SCP through the APIC at /data/techsupport on
the APIC where the core file is located. Note that the core file will be available at /data/
techsupport on one APIC in the cluster, the exact APIC that the core file resides can be
found by the Export Location path as shown in the GUI. For example, if the Export Lo-
cation begins with “files/3/”, the file is located on node 3 (APIC3).

Problem Description

Process on fabric node has crashed and either restarts automatically or leads to the
switch restarting.

Symptom 1

Process on switch fabric crashes. Either the process restarts automatically or the switch
reloads to recover.

Verification

As indicated in the overview section, if a DME process crashes, it should restart automat-
ically without the switch restarting. If a non-DME process crashes, the process will not
automatically restart and the switch will reboot to recover.

Depending on which process crashes, the impact of the process core will vary.

When a non-DME process crashes, this will typical lead to a HAP reset as seen on the
console:

[ 1130.593388] nvram_klm wrote rr=16 rr_str=ntp hap reset to nvram

[ 1130.599990] obfl_klm writing reset reason 16, ntp hap reset

[ 1130.612558] Collected 8 ext4 filesystems 

Check the appropriate process log:

The process which crashes should have at some level of log output prior to the crash.
The output of the logs on the switch are written into the /tmp/logs directory. The pro-
cess name will be part of the file name. For example, for the Policy Element process, the
file is svc_ifc_policyelem.log
Troubleshooting Cisco Application Centric Infrastructure 277

rtp_leaf2# ls -l |grep policyelem

-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log

-rw-r--r-- 1 root root  1413246 Oct 14 22:10 svc_ifc_policyelem.log.1.gz

-rw-r--r-- 1 root root  1276434 Oct 14 22:15 svc_ifc_policyelem.log.2.gz

-rw-r--r-- 1 root root  1588816 Oct 14 23:12 svc_ifc_policyelem.log.3.gz

-rw-r--r-- 1 root root  2124876 Oct 15 14:34 svc_ifc_policyelem.log.4.gz

-rw-r--r-- 1 root root  1354160 Oct 15 22:30 svc_ifc_policyelem.log.5.gz

-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log.6

-rw-rw-rw- 1 root root        2 Oct 14 22:06 svc_ifc_policyelem.log.PRESERVED

-rw-rw-rw- 1 root root      209 Oct 14 22:06 svc_ifc_policyelem.log.stderr

rtp_leaf2# 

There will be several files for each process located at /tmp/logs. As the log file increases
in size, it will be compressed and older log files will be rotated off. Check the core file
creation time (as shown in the GUI and the core file name) to understand where to look in
the file. Also, when the process first attempts to come up, there be an entry in the log file
that indicates “Process is restarting after a crash” that can be used to search backwards
as to what might have happened prior to the crash.

Check what activity occurred at the time of the process crash:

A process which has been running has had some change which then caused it to crash.
In many cases the changes may have been some configuration activity on the system.
What activity occurred on the system can be found in the audit log history of the system.

For example, if the ntp process crashes, going back around the time of the crash, in this
example there was a change where a ntp provider was deleted:
278 Troubleshooting Cisco Application Centric Infrastructure

Collect Techsupport and Core File and Contact the TAC: 

A process crashing should not normally occur. In order to understand better why be-
yond the above steps it will be necessary to decode the core file. At this point, the file will
need to be collected and provided to the TAC for further processing.

Collect the core file (as indicated above how to do this) and open up a case with the TAC.

Symptom 2

Fabric switch continuously reloads or is stuck at the BIOS loader prompt.

Verification

As indicated in the overview section, if a DME process crashes, it should restart automat-
ically without the switch restarting. If a non-DME process crashes, the process will not
automatically restart and the switch will reboot to recover. However in either case if the
process continuously crashes, the switch may get into a continuous reload loop or end
up in the BIOS loader prompt.

[ 1130.593388] nvram_klm wrote rr=16 rr_str=policyelem hap reset to nvram

[ 1130.599990] obfl_klm writing reset reason 16, policyelem hap reset

[ 1130.612558] Collected 8 ext4 filesystems 

 
Troubleshooting Cisco Application Centric Infrastructure 279

Break the HAP reset loop:

First step is to attempt to get the switch back into a state where further information can
be collected.

If the switch is continuously rebooting, when the switch is booting up, break into the
BIOS loader prompt through the console by typing CTRL C when the switch is first part
of the boot cycle.

Once the switch is at the loader prompt, enter in the following commands:

• cmdline no_hap_reset
• boot <file>

The cmdline command will prevent the switch from reloading with a hap reset is
called.  The second command will boot the system.  Note that the boot command is need-
ed instead of a reload at the loader as a reload will remove the cmdline option entered.

Though the system should now remain up to allow better access to collect data, whatev-
er process is crashing will impact the functionality of the switch.

Check the appropriate process log:

The process which crashes should have at some level of log output prior to the crash.
The output of the logs on the switch are written into the /tmp/logs directory. The pro-
cess name will be part of the file name. For example, for the Policy Element process, the
file is svc_ifc_policyelem.log

rtp_leaf2# ls -l |grep policyelem

-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log

-rw-r--r-- 1 root root  1413246 Oct 14 22:10 svc_ifc_policyelem.log.1.gz

-rw-r--r-- 1 root root  1276434 Oct 14 22:15 svc_ifc_policyelem.log.2.gz

-rw-r--r-- 1 root root  1588816 Oct 14 23:12 svc_ifc_policyelem.log.3.gz

-rw-r--r-- 1 root root  2124876 Oct 15 14:34 svc_ifc_policyelem.log.4.gz

-rw-r--r-- 1 root root  1354160 Oct 15 22:30 svc_ifc_policyelem.log.5.gz

-rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log.6

-rw-rw-rw- 1 root root        2 Oct 14 22:06 svc_ifc_policyelem.log.PRESERVED

-rw-rw-rw- 1 root root      209 Oct 14 22:06 svc_ifc_policyelem.log.stderr

rtp_leaf2# 
280 Troubleshooting Cisco Application Centric Infrastructure

There will be several files for each process located at /tmp/logs. As the log file increases
in size, it will be compressed and older log files will be rotated off. Check the core file
creation time (as shown in the GUI and the core file name) to understand where to look in
the file. Also, when the process first attempts to come up, there be an entry in the log file
that indicates “Process is restarting after a crash” that can be used to search backwards
as to what might have happened prior to the crash.

Check what activity occurred at the time of the process crash:

A process which has been running has had some change which then caused it to crash.
In many cases the changes may have been some configuration activity on the system.
What activity occurred on the system can be found in the audit log history of the system.

For example, if the ntp process crashes, going back around the time of the crash, in this
example there was a change where a ntp provider was deleted:

Collect Core File and Contact the Cisco TAC: 

A process crashing should not normally occur. In order to understand better why, be-
yond the above steps, it will be necessary to decode the core file. At this point, the file
will need to be collected and provided to the Cisco TAC for further processing.

Collect the core file (as indicated above how to do this) and open up a support case with
the Cisco TAC.
Troubleshooting Cisco Application Centric Infrastructure 281

APIC Process Crash Troubleshooting

Overview

The APIC has a series of Data Management Engine (DME) processes which control vari-
ous functional aspects on the system. When the system has a software failure in a par-
ticular process, a core file will be generated and the process will be reloaded.

This chapter covers potential issues involving system processes crashes or software fail-
ures, beginning with an overview of the various system processes, how to detect that
a process has cored, and what actions should be taken when this occurs. The displays
taken on a working healthy system can then be used to identify processes that may have
terminated abruptly.

DME Processes:

The essential processes running on an APIC can be found either through the GUI or the
CLI.  Using the GUI, the processes and the process ID running is found in System ->
Controllers -> Processes as shown here:
282 Troubleshooting Cisco Application Centric Infrastructure

Using the CLI, the processes and the process ID are found in the summary file at /aci/
system/controllers/1/processes (for APIC1):

admin@RTP_Apic1:processes> cat summary

processes:

process-id  process-name       max-memory-allocated  state              

----------  -----------------  --------------------  -------------------

0           KERNEL             0                     interruptible-sleep

331         dhcpd              108920832             interruptible-sleep

336         vmmmgr             334442496             interruptible-sleep

554         neo                398274560             interruptible-sleep

1034        ae                 153690112             interruptible-sleep

1214        eventmgr           514793472             interruptible-sleep

2541        bootmgr            292020224             interruptible-sleep

4390        snoopy             28499968              interruptible-sleep

5832        scripthandler      254308352             interruptible-sleep

19204       dbgr               648941568             interruptible-sleep

21863       nginx              4312199168            interruptible-sleep

32192       appliancedirector  136732672             interruptible-sleep

32197       sshd               1228800               interruptible-sleep

32202       perfwatch          19345408              interruptible-sleep

32203       observer           724484096             interruptible-sleep

32205       lldpad             1200128               interruptible-sleep

32209       topomgr            280576000             interruptible-sleep

32210       xinetd             99258368              interruptible-sleep

32213       policymgr          673251328             interruptible-sleep

32215       reader             258940928             interruptible-sleep

32216       logwatch           266596352             interruptible-sleep

32218       idmgr              246824960             interruptible-sleep

32416       keyhole            15233024              interruptible-sleep

admin@apic1:processes> 

Each of the processes running on the APIC writes to a log file on the system. These
log files can be bundled as part of the APIC techsupport file but can also be observed
through SSH shell access in /var/log/dme/log. For example, the Policy Manager pro-
cess log output is written into /var/log/dme/log/svc_ifc_policymgr.bin.log.
Troubleshooting Cisco Application Centric Infrastructure 283

The following is a brief description of the processes running on the system. This can help
in understanding which log files to reference when troubleshooting a particular process
or understand the impact to the system if a process crashed:

Process Function

KERNEL Linux kernel

dhcpd DHCP process running for APIC to assign infra addresses

vmmmgr Handles process between APIC and Hypervisors

neo Shell CLI Interpreter 

ae Handles the state and inventory of local APIC appliance

eventmgr Handles all events and faults on the system

bootmgr Controls boot and firmware updates on fabric nodes

snoopy Shell CLI help, tab command completion

scripthandler Handles the L4-L7 device scripts and communication

dbgr Generates core files when process crashes

nginx Web service handling GUI and REST API access

appliancedirector Handles formation and control of APIC cluster

sshd Enabled SSH access into the APIC

perfwatch Monitors Linux cgroup resource usage

observer Monitors the fabric system and data handling of state, stats, health

lldpad LLDP Agent

topomgr Maintains fabric topology and inventory


284 Troubleshooting Cisco Application Centric Infrastructure

How to Identify When a Process Crashes: 

When a process crashes and a core file is generated, the ACI system raises a fault noti-
fication and generates an entry in the event logs.  The fault for the particular process is
shown as a "process-crash" as shown in this syslog output from the APIC:

Oct 15 17:13:35 apic1 %LOG_LOCAL7-3-SYSTEM_MSG [E4208395][process-crash][ma-

jor][subj-[dbgs/cores/ctrlr-1-svc-reader-ts-2014-10-15T17:13:28.000+00:00]/rec-

4294972278]Process reader cored

The fault that is generated when the process crashes is cleared when the process is re-
started.  The fault can be viewed via the GUI in the fabric "history -> events" tab at FAB-
RIC -> INVENTORY -> Pod 1:

Collecting the Core Files:

The APIC GUI provides a central location to collect the core files for APICS and nodes in
the fabric.

An export policy can be created from ADMIN -> IMPORT/EXPORT in Export Policies ->
Core. However, there is a default core policy where files can be downloaded directly. As
shown in this example:
Troubleshooting Cisco Application Centric Infrastructure 285

The core files can be accessed via SSH/SCP on the APIC at /data/techsupport.

Note that the core file will be available at /data/techsupport for the APIC that had the
process crash. Which APIC that the core file resides can be found by the Export Location
path as shown in the GUI. For example, if the Export Location begins with “files/2/”, the
file is located on node 2 (APIC2).

Problem Description

APIC process crashes and either restarts automatically or is not running.

Symptom 1

APIC process is not running

Verification

A process that crashes generally should restart.  However, if the same process crashes
several times in a short amount of time, the process may not recover.

Verify the process status through:

APIC CLI: Verify the contents of the summary file on the APIC located in /aci/system/
controllers/<APIC node ID>/processes. For example /aci/system/controllers/1/pro-
cesses/summary for APIC1. An example output was shown in the above overview section.
286 Troubleshooting Cisco Application Centric Infrastructure

GUI: by navigating to SYSTEM->CONTROLLERS->Controllers and the APIC and check


that the processes running have a PID associated.  All but KERNEL should.  An example
output was shown in the above overview section.

Check the appropriate process log:

The process which is not running should have at some level of log output prior to the
crash. The output of the logs for that APIC that the process is not running is found in /
var/log/dme/log via SSH access. The process name will be part of the file name. For
example vmmmgr is svc_ifc_vmmmgr.bin.log.

admin@RTP_Apic1:log> ls -l |grep vmmmgr

-rw-r--r-- 2 ifc  root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log

-rw-r--r-- 1 ifc  root  1318921 Oct 14 19:25 svc_ifc_vmmmgr.bin.log.1.gz

-rw-r--r-- 1 ifc  root   967890 Oct 14 19:42 svc_ifc_vmmmgr.bin.log.2.gz

-rw-r--r-- 1 ifc  root  1555562 Oct 14 22:11 svc_ifc_vmmmgr.bin.log.3.gz

-rw-r--r-- 1 ifc  root  1673143 Oct 15 12:19 svc_ifc_vmmmgr.bin.log.4.gz

-rw-r--r-- 1 ifc  root  1119380 Oct 15 12:30 svc_ifc_vmmmgr.bin.log.5.gz

-rw-r--r-- 2 ifc  root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log.6

-rw-r--r-- 1 ifc  root        2 Oct 14 13:36 svc_ifc_vmmmgr.bin.log.PRESERVED

-rw-r--r-- 1 ifc  root     7924 Oct 14 22:44 svc_ifc_vmmmgr.bin.log.stderr

admin@RTP_Apic1:log> 

There will be several files for each process located at /var/log/dme/log. As the log file
increases in size, it will be compressed and older log files will be rotated off. Check the
core file creation time (as shown in the GUI and the core file name) to understand where
to look in the file. Also, when the process first attempts to come up, there exists an entry
in the log file that indicates “Process is restarting after a crash” that can be used to search
backwards as to what might have happened prior to the crash.

Check what activity occurred at the time of the process crash

Typically, a process which has been running successfully would have to experience some
change which caused it to crash. In many cases the changes may have been some con-
figuration activity on the system. What activity occurred on the system can be found in
the audit log history of the system.
Troubleshooting Cisco Application Centric Infrastructure 287

For example, if the policymgr process crashes several times that led to the process not
being up, going into the logs and inspecting entries around the time of the first crash is a
good way to investigate what might have caused the issue. As shown in the example be-
low, there was a change where a new service graph was added, thus giving the indication
that the service graph configuration may have caused the failure:

Restarting a process:

When a process fails to restart automatically on an APIC, the recommended method is to


restart the APIC to allow all the processes to come up organically.

The processes can be started as well through the APIC shell command “acidiag restart
mgmt”. This will restart the essential APIC processes but it will cause all processes to
restart, not just bringing up the process which is not running.

Now, if the process has crashed several times already, the process may crash again when
it comes up. This could be to some persistent condition of configuration that is leading
to the crash. Knowing what changed as indicated above may help to know what correc-
tive actions to take to correct the root issue.

Collect Techsupport and Core File and Contact the Cisco TAC: 

Process crashes should not occur under normal operational conditions. In order to un-
derstand better why the process crashed beyond the above steps it will be necessary to
decode the core files. At this point, the files will need to be collected and provided to
Cisco Technical Assistance Center for further processing.
288 Troubleshooting Cisco Application Centric Infrastructure

Collect the core files, as indicated above in the overview section, and open up a support
case with the Cisco Technical Assistance Center.

Symptom 2

APIC process has crashed and restarted automatically

Verification

A process that crashes generally should restart.  When the process crashes, a core file
will be generated as indicated in the overview section.

Check the appropriate process log:

The process which crashes should have at some level of log output prior to the crash.
The output of the logs for that APIC that the process is not running is found in/var/log/
dme/log when logged in via SSH access. The process name will be part of the file name.
For example vmmmgr is svc_ifc_vmmmgr.bin.log.

admin@RTP_Apic1:log> ls -l |grep vmmmgr

-rw-r--r-- 2 ifc  root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log

-rw-r--r-- 1 ifc  root  1318921 Oct 14 19:25 svc_ifc_vmmmgr.bin.log.1.gz

-rw-r--r-- 1 ifc  root   967890 Oct 14 19:42 svc_ifc_vmmmgr.bin.log.2.gz

-rw-r--r-- 1 ifc  root  1555562 Oct 14 22:11 svc_ifc_vmmmgr.bin.log.3.gz

-rw-r--r-- 1 ifc  root  1673143 Oct 15 12:19 svc_ifc_vmmmgr.bin.log.4.gz

-rw-r--r-- 1 ifc  root  1119380 Oct 15 12:30 svc_ifc_vmmmgr.bin.log.5.gz

-rw-r--r-- 2 ifc  root 18529370 Oct 15 14:38 svc_ifc_vmmmgr.bin.log.6

-rw-r--r-- 1 ifc  root        2 Oct 14 13:36 svc_ifc_vmmmgr.bin.log.PRESERVED

-rw-r--r-- 1 ifc  root     7924 Oct 14 22:44 svc_ifc_vmmmgr.bin.log.stderr

admin@RTP_Apic1:log> 

There will be several files for each process located at /var/log/dme/log. As the log file
increases in size, it will be compressed and older log files will be rotated off. Check the
core file creation time (as shown in the GUI and the core file name) to understand where
to look in the file. Also, when the process first attempts to come up, there be an entry in
the log file that indicates “Process is restarting after a crash” that can be used to search
backwards as to what might have happened prior to the crash.
Troubleshooting Cisco Application Centric Infrastructure 289

Check what activity occurred at the time of the process crash:

Typically, a process which has been running successfully would have to experience some
change which caused it to crash. In many cases the changes may have been some con-
figuration activity on the system. What activity occurred on the system can be found in
the audit log history of the system.

In this example, the policymgr process crashed several times leading to the process not
being up. On further investigation, during the time of the first crash event, a a new ser-
vice graph was added.

Collect Techsupport and Core File and Contact the Cisco TAC: 

Process crashes should not occur under normal operational conditions. In order to un-
derstand better why the process crashed beyond the above steps it will be necessary to
decode the core files. At this point, the files will need to be collected and provided to
Cisco Technical Assistance Center for further processing.

Collect the core files, as indicated above in the overview section, and open up a support
case with the Cisco Technical Assistance Center.
290 Troubleshooting Cisco Application Centric Infrastructure
Troubleshooting Cisco Application Centric Infrastructure 291

Appendix
292 Troubleshooting Cisco Application Centric Infrastructure

Glossary

Overview

This section is designed to provide a high level description of terms and concepts that
get brought up in this book. While ACI does not change how packets are transmitted
on a wire, there are some new terms and concepts employed, and understanding those
new terms and concepts will help those working on ACI communicate with one another
about the constructs used in ACI used to transmit those bits. Associated new acronyms
are also provided.

This is not meant to be an exhaustive list nor a completely detailed dictionary of all of the
terms and concepts, only the key ones that may not be a part of the common vernacular
or which would be relevant to the troubleshooting exercises that were covered in the
troubleshooting scenarios discussed.

A
AAA: Initialism for Authentication, Authorization, and Accounting

ACI External Connectivity: Any connectivity to and from the fabric that uses an external
routed or switched intermediary system, where endpoints fall outside of the managed
scope of the fabric

ACID transactions: ACID is an initialism for Atomicity, Consistency, Isolation, Durability


– properties of transactions that ensure consistency in database transactions. Trans-
actions to APIC devices in an ACI cluster are considered ACID, to ensure that database
consistency is maintained. This means that if one part of a transaction fails the entire
transaction fails.

AEP: Attach Entity Profile – this is a configuration profile of the interface that gets ap-
plied when an entity attaches to the fabric. An AEP represents a group of external entities
with similar infrastructure policy requirements. 
Troubleshooting Cisco Application Centric Infrastructure 293

ALE: Application Leaf Engine, an ASIC on a leaf switch

APIC: Application Infrastructure Controller is a centralized policy management control-


ler cluster. The APIC configures the intended state of the policy to the fabric.

API: Application Programming Interface used for programmable extensibility

Application Profile:Term used to reference an application profile managed object refer-


ence that models the logical components of an application and how those components
communicate. The AP is the key object used to represent an application and is also the
anchor point for the automated infrastructure management in an ACI fabric.

ASE: Application Spine Engine, an ASIC on a Spine switch

B
BGP: Border Gateway Protocol, on the ACI fabric BGP is used to distribute reachability
information within the fabric.

Bridge Domain: A unique layer 2 forwarding domain that contains one or more subnets.

C
Clos fabric: A multi-tier nonblocking leaf-spine architecture network.

Cluster: Set of devices that work together as a single system to provide an identical or
similar set of functions

Contracts: A logical container for the subjects which relate to the filters that govern
the rules for communication between endpoint groups.  ACI works on a white list policy
model.  Without a contract, the default forwarding policy is to not allow any communi-
cation between EPGs but communication within an EPG is allowed.

Context: A layer 3 forwarding domain, equivalent to a VRF.  Every bridge domain needs
to be associated with a context.
294 Troubleshooting Cisco Application Centric Infrastructure

 D
DLB: Dynamic Load Balancing – a network traffic load balancing mechanism in the ACI
fabric based on flowlet switching.

DME: Data Management Engine, a service that runs on the APIC that manages data for
the data model.

dMIT: distributed Management Information Tree, a representation of the ACI object


model with the root of the tree at the top and the leaves of the tree at the bottom.  The
tree contains all aspects of the object model that represent an ACI fabric.

Dn: Distinguished name – a fully qualified name that represents a specific object with-
in the ACI management information tree as well as the specific location information in
the tree. It is made up of a concatenation of all of the relative names from itself back to
the root of the tree. As an example, if policy object of type Application Profile is created
named commerceworkspace within a Tenant named Prod, the dn would be expressed
as uni/tn-Prod/ap-commerceworkspace. 

E
EP: Endpoint - Any logical or physical device connected directly or indirectly to a port
on a leaf switch that is not a fabric facing port. Endpoints have specific properties like
an address, location, or potentially some other attribute, which is used to identify the
endpoint. Examples include virtual-machines, servers, storage devices, etc.

EPG: A collection of endpoints that can be grouped based on common requirements for
a common policy. Endpoint groups can be dynamic or static.

F
Fault: When a failure occurs or an alarm is raised, the system creates a fault managed
object for the fault. A fault contains the conditions, information about the operational
state of the affected object, and potential resolutions for the problem.
Troubleshooting Cisco Application Centric Infrastructure 295

Fabric: Topology of network nodes. 

Filters: Filters define the rules outlining the layer 2 to layer 4 fields that will be matched
by a contract.  

Flowlet switching:  An optimized multipath load balancing methodology based on re-
search from MIT in 2004. Flowlet Switching is a way to use TCP’s own bursty nature to
more efficiently forward TCP flows by dynamically splitting flows into flowlets and split-
ting traffic across multiple parallel paths without requiring packet reordering. 

G
GUI: Graphical User Interface

H
HTML: HyperText Markup Language, a markup language that focuses on the formatting
of web pages.

Hypervisor: Software that abstracts the hardware on a host machine and allows the host
machine to run multiple virtual machines.

Hypervisor integration: Extension of ACI Fabric connectivity to a virtualization manager


to provide the APIC controller with a mechanism for virtual machine visibility and policy
enforcement. 

I
IFM: Intra-Fabric Messages, Used for communication between different devices on the
ACI fabric. 
296 Troubleshooting Cisco Application Centric Infrastructure

Inband Management (INB): Inband Management. Connectivity using an inband manage-


ment configuration. This uses a front panel (data plane) port of a leaf switch for external
management connectivity for the fabric and APICs.

IS-IS: Link local routing protocol leveraged by the fabric for infrastructure topology.
Loopback and VTEP addresses are internally advertised over IS-IS. IS-IS announces the
creation of tunnels from leaf nodes to all other nodes in fabric.   

J
JSON: JavaScript Object Notation, a data encapsulation format that uses human readable
text to encapsulate data objects in attribute and value pairs.

L
Layer 2 Out (l2out): Layer 2 connectivity to an external network that exists outside of the
ACI fabric.

Layer 3 Out (l3out): Layer 3 connectivity to an external network that exists outside of the
ACI fabric.

L4-L7 Service Insertion: The insertion of a service like a firewall and a load balancer into
the flow of traffic. Service nodes operate between Layers 4 and 7 of the OSI model, where
as networking elements (i.e. the fabric) operate at layers 1-3).

Labels: Used for classifying which objects can and cannot communicate with each other.

Leaf: Network node in fabric providing host and border connectivity. Leafs connect only
to hosts and spines. Leafs never connect to each other

 
Troubleshooting Cisco Application Centric Infrastructure 297

M
MO: Managed Object – every configurable component of the ACI policy model managed
in the MIT is called a MO.

Model: A model is a concept which represents entities and the relationships that exist
between them 

Multi-tier Application: Client–server architecture in which presentation, application log-


ic, and database management functions are physically separated and require networking
functions to communicate with the other tiers for application functionality

O
Object Model: A collection of objects and classes are used to examine and manipulate
the configuration and running state of the system that is exposing that object model.
In ACI the object model is represented as a tree known as the distributed management
information tree (dMIT).

Out of Band management (OOBM). External connectivity using a specific out-of-band


management interface on every switch and APIC.

P
Port Channel: Port link aggregation technology that binds multiple physical interfaces
into a single logical interface and provides more aggregate bandwidth and link failure
redundancy

 
298 Troubleshooting Cisco Application Centric Infrastructure

R
RBAC: Role Based Access Control, which is a method of managing secure access to infra-
structure by assigning roles to users, then using those roles in the process of granting or
denying access to devices, objects and privilege levels.

REST: REpresentational State Transfer, a stateless protocol usually run over HTTP that
allows a client to access a service.  The location that the client access usually defines the
data the client is trying to access from the service.  Data is usually accessed and returned
in either XML or JSON format.

RESTful: An API that uses REST, or Representational State Transfer.

Rn: Relative name, a name of a specific object within the ACI management information
tree that is not fully qualified. A Rn is significant to the individual object, but without
context, it’s not very useful in navigation. A Rn would need to be concatenated with all
the relative names from itself back up to the root to make a distinguished name, which
then becomes useful for navigation. As an example, if an Application Profile object is cre-
ated named “commerceworkspace”, the Rn would be “ap-commerceworkspace” because
Application Profile relative names are all prefaced with the letters "ap-". See also the Dn
definition.

S
Service graph: Cisco ACI treats services as an integral part of an application. Any services
that are required are treated as a service graph that is instantiated on the ACI fabric from
the APIC. Service graphs identify the set of network or service functions that are needed
by the application, and represent each function as a node. A service graph is represented
as two or more tiers of an application with the appropriate service function inserted
between.

Spine: Network node in fabric carrying aggregate host traffic from leafs, connected only
to leafs in the fabric and no other device types
Troubleshooting Cisco Application Centric Infrastructure 299

Spine Leaf topology: A clos-based fabric topology in which spine nodes connect to leaf
nodes, leaf nodes connect to hosts and external networks.

Subnets: Contained by a bridge domain, a subnet defines the IP address range that can
be used within the bridge domain.

Subjects: Contained by contracts and creates the relationship between filters and con-
tracts.

Supervisor: Switch module/line card that provides the processing engine. 

T
Tenants: The logical container to group all policies for application policies. This allows
isolation from a policy perspective. For service providers this would be a customer. In an
enterprise or organization this would allow the organization to define policy separation
in a way that suits their needs.  There are three pre-defined tenants on every ACI fabric:

• common: policies in this tenant are shared by all tenants. Usually these are used
for shared services or L4-L7 services.
• infra: policies in this tenant are used to influence the operation of the fabric
overlay
• mgmt: policies in this tenant are used to define access to the inband and out-of-
band management and virtual machine controllers.

V
Virtualization: application of technology used to abstract hardware resources into virtual
representations and allowing software configurability.

vPC: virtual Port Channel, in which a port channel is created for link aggregation, but is
spread across multiple physical switches.
300 Troubleshooting Cisco Application Centric Infrastructure

VRF: Virtual Routing and Forwarding - A L3 namespace isolation methodology to allow


for multiple L3 contexts to be deployed on a single device or infrastructure

VXLAN: VXLAN is a Layer 2 overlay scheme transported across a Layer 3 network. A


24-bit VXLAN segment ID (SID) or VXLAN network identifier (VNID) is included in the
encapsulation to provide up to 16 million VXLAN segments for traffic isolation or seg-
mentation. Each segment represents a unique Layer 2 broadcast domain. An ACI VXLAN
header is used to identify the policy attributes of the application endpoint within the
fabric, and every packet carries these policy attributes.

X
XML: eXtensible Markup Language, a markup language that focuses on encoding data for
documents rather than the formatting of the data for those documents.

• OK
• Back
• Create
• Cancel
• Next
• Import
• Save changes
• Couldn't create a group!
• enter epub URL
• enter Archive.org ID
• enter Wikibooks URL
• enter Booktype URL
• Delete book

S-ar putea să vă placă și