Sunteți pe pagina 1din 39

Mirantis OpenStack 7.

0
NFVI Deployment Guide
White Paper
January, 2016

TABLEOFCONTENTS

1 EXECUTIVE SUMMARY
1.1 Intended Audience/Prerequisites
1.2 Network Functions Virtualization
1.3 Lab configuration
1.3.1 Hardware
1.3.2 MOS Configuration
1.3.3 Nodes
1.3.4 Networking topology
1.4 Testing approach
2 NFVI FEATURES
2.1 Guarantee resource allocation to NFV workloads
2.1.1 Guaranteed memory allocation
2.1.2 Guaranteed CPU allocation
2.2 Huge pages
2.2.1 General recommendations on using huge pages on OpenStack
2.2.2 Huge pages and physical topology
2.2.3 Enabling huge pages on MOS 7.0
2.2.4 Compute hosts configuration
2.2.4.1 Nova configuration
2.2.4.2 Using huge pages on MOS 7.0
Some useful commands
2.3 NUMA/CPU pinning
2.3.1 Compute hosts configuration
2.3.2 Nova configuration
2.3.3 Using CPU pinning
2.3.3.1 Troubleshooting
2.4 SR-IOV
2.4.1 Enabling SR-IOV
2.4.1.1 Configure SR-IOV on the Compute nodes
2.4.1.2 Configure SR-IOV on the Controller nodes
2.4.1.3 Using SR-IOV
2.4.1.4 Troubleshooting

20052016 All Rights Reserved

www.mirantis.com

2.5 Anti-affinity groups


2.5.1 Using Anti-affinity groups
2.5.2 High Availability Instances with Neutron
2.5.2.1 One IP address, multiple VMs
2.5.2.2 Using the Allowed-Address-Pairs extension
3 APPENDIX
3.1 Installing qemu 2.4
3.2 Preparing the Ubuntu cloud image
4 RESOURCES

1EXECUTIVESUMMARY

1.1 Intended Audience/Prerequisites


This guide provides a list of procedures required to enable Network Functions Virtualization (NFV)-facing
features in Mirantis OpenStack 7.0. With these features enabled, Mirantis OpenStack 7.0 acts as an NFV
infrastructure (NFVI), enabling Virtual Network Functions (VNFs), as well as NFV Orchestrators (NFVOs) with
VNFs onboard, to successfully run on top of it.
This guide is targeted at NFV vendors who would like to use Mirantis OpenStack 7.0 as the NFVI for their
VNFs.

1.2 Network Functions Virtualization


The objective of Network Functions Virtualization (NFV) is to address problems related to communications
service providers (CSPs) need for specialized network functions such as firewalls, load balancers, content
filters, and so on. Rather than requiring the deployment of proprietary hardware appliances into their
network. NFV enables CSPs to address these problems by providing for the decoupling of software functions
from the underlying hardware. It performs this decoupling by leveraging standard IT virtualization
technology along with the introduction of open APIs and an ecosystem of suppliers, building an end-to-end,
flexible and scalable solution.
This document covers a number of features that are crucial for service providers who wish to put NFV
solutions in place:

20052016 All Rights Reserved

www.mirantis.com

To ensure a guarantee of a VNF's availability and performance, it must have access to appropriate
resources within the NFVI (in this case, the OpenStack cloud). These resources include RAM and CPU,
which should be allocated specifically for the VNF, or for the NFV solution as a whole.
SR-IOV should be enabled in the cloud to gain a greater performance for VNFs.
VNFs can be extremely I/O intensive, so Mirantis OpenStack should also have a storage plane with an
appropriate disk I/O level to meet a VNF's specific requirements. These I/O parameters must be set
appropriately..
Networking, as you might imagine, also has stringent requirements. High Availability (HA) requires
that IP management be very flexible, and that VMs of the same type are scheduled to different
physical nodes in case of hardware failure. So-called provider networks should also give VMs direct
access to external resources.

1.3 Lab configuration


This section provides information on the configuration used to enable and test the NFVI features included in
Mirantis OpenStack (MOS) 7.0.

1.3.1 Hardware
Configuration and testing was performed on the following hardware:

Host 1: Supermicro Super


Server SYS-6017R-NTF, 2x CPU Intel Xeon E5-2620v2, 4x RAM 16GB, 1x
480GB Intel SSD DC S3510, 1x Supermicro AOC-STGN-i2S
Host 2: SuperMicro SuperServer 1027R-WRF, 2x CPU Intel Xeon E5-2620v2, 4x RAM 16GB, 1x 480GB
Intel SSD DC S3510, 1x Supermicro AOC-STGN-i2S
Host 3: Supermicro
SuperServer 6018R-TDW, 2x CPU Intel Xeon E5-2620v3, 4x RAM 16GB, 1x 480GB
Intel SSD DC S3510, 1x Supermicro AOC-STGN-i2S
Hosts 4-5: Supermicro
SuperServer 6018R-TDW, 2x CPU Intel Xeon E5-2620v3, 4x RAM 16GB, 1x
80GB Intel SSD DC S3510, 1x Supermicro AOC-STGN-i2S
Hosts 6-7:
Supermicro SuperServer 5038MR-H8TRF, 1x CPU Intel Xeon E5-2620v3, 1x RAM 16GB, 1x
480GB Intel SSD DC S3500, 1x NIC AOC-CGP-i2

20052016 All Rights Reserved

www.mirantis.com

1.3.2
MOS Configuration

Compute type: KVM


Networking:
Neutron with VLAN segmentation
Storage Backend: Ceph for Cinder, Glance, and Nova ephemeral storage
Sahara, Murano and Ceilometer enabled
All OpenStack settings in the Settings tab of the Fuel Web UI at their defaults.

1.3.3 Nodes

Fuel Master node (VM inside Host1): 1 cpu, 4Gb RAM, 2 NICs
3 Controllers/Mongo (VMs inside Host1): 4 cpu, 16Gb RAM, 4NICs
4 Computes: Hosts 2-5
2 Ceph-OSD: Hosts 6-7

1.3.4 Networking topology

Each host has 2x1Gb ports, used for:


Public network
Admin(PXE)
Each host except for the Fuel Master node also has 2x10Gb ports, used for:
Storage, Management and Private networks
Unused to allow manual configuration of SRI-OV on it

1.4 Testing approach


For all NFVI features covered in this document, Mirantis performed baseline functional and performance
testing.

20052016 All Rights Reserved

www.mirantis.com

2NFVIFEATURES
An NFV Infrastructure provides an environment in which VNFs can run, but that's really the bare minimum of
what it does. An NFVi should also provide features that enable VNFs to run well. Some of the more
important features needed include:
Guaranteed resources for workloads: While cloud computing enables shared resources, the inevitable
contention can be disastrous in an NFV environment.
Huge Pages: Traditional memory structures can result in performance issues as memory lookups take
longer than they should.
NUMA/CPU pinning: Another disadvantage of cloud is that extra effort needs to be taken to make
sure that memory and CPU are local to each other.
SR/IOV: While virtualization makes cloud and NFV possible, sometimes there's just too much of a
good thing, and performance and reliability benefit from pushing functions down off the hypervisor
and back onto the hardware.
Anti-affinity groups: As a cloud management system schedules workloads, it's often necessary to
spread them among different physical physical nodes to prevent resource contention.
Let's take a look at each of these features.

2.1 Guarantee resource allocation to NFV workloads


Overcommitment of RAM and vCPU allocation -- where virtual allocations exceed physical capabilities with
the assumption that not all instances need resources simultaneously -- can lead to latency spikes in the
situations when several instances compete for the limited physical resources of a hypervisor. Because most
NFV workloads are latency-sensitive its crucial not to overcommit RAM or vCPU allocation to a VM that
carries NFV workloads. You can guarantee this by properly setting Nova's configuration options.

2.1.1 Guaranteed memory allocation


There are two parameters that affect Nova memory allocation:

ram_allocation_ratio
reserved_host_memory_mb

By default, they are already set to optimal values in MOS 7.0, but it's helpful to understand them.

20052016 All Rights Reserved

www.mirantis.com

The
ram_allocation_ratio
parameter determines virtual RAM to physical RAM allocation ratio
. It
affects the
nova scheduler and should be set in
/etc/nova/nova.conf on the controller hosts. In MOS 7.0 this parameter
is set to 1, so the hypervisor will not allow you to overcommit memory. That means you wont be able to
start a VM more memory than is available on the compute host:
ram_allocation_ratio=1.0

The
reserved_host_memory_mb parameter determines how much memory on the compute node should be
reserved, effectively disabling usage of this memory by virtual machines. It should be set in
/etc/nova/nova.conf on the compute hosts. By default, in MOS 7.0 each compute host has 512Mb reserved
for the operating system and applications other than virtual machines:
reserved_host_memory_mb=512

You may want to increase this value if your compute hosts have additional workloads. For example, if a
compute node has an additional Ceph/OSD role, reserving 1GB of RAM per OSD daemon instance as a
minimum is recommended. In general, we recommend using dedicated compute nodes.
Note: You can have per-aggregate ram_alocation_ratio with AggregateRamFilter. The process is the same as
setting per-aggregate CPU allocation, described below.

2.1.2
Guaranteed CPU allocation
Similarly to guaranteed memory allocation, its often desirable to guarantee that the VM has access to
specified CPU resources to avoid latency issues. The cpu_allocation_ratio parameter determines the virtual
CPU to physical CPU allocation ratio, i.e. how many virtual CPU will be available on the compute node for
each physical one. By default parameter
cpu_allocation_ratio
is set to 8 in MOS 7.0. To ensure as little
contention as possible for the CPU resources on the host machine, set the
cpu_allocation_ratio
to
1.0
. In
/etc/nova/nova.conf
on all controller hosts, change the line
cpu_allocation_ratio=8.0

to:
cpu_allocation_ratio=
1.0

20052016 All Rights Reserved

www.mirantis.com

Restart the nova-scheduler service on the controllers:


# restart nova-scheduler

You can also use


per-aggregate
cpu_allocation_ratio
with
AggregateCoreFilter
, so instances that require
guaranteed CPU allocation will run on dedicated compute nodes. To make this change, follow these steps:
1) On all controllers, add the
AggregateInstanceExtraSpecsFilter
and
AggregateCoreFilter
to the
scheduler_default_filters
parameter in
/etc/nova/nova.conf
.
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,CoreFilter,DiskFilter,ComputeFilter,C
omputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,
Aggre
gateInstanceExtraSpecsFilter
,
AggregateCoreFilter

2) From the command line, create an aggregate with guaranteed CPU allocation:
nova aggregate-create guaranteed_cpu nova

and set its parameters:


nova aggregate-set-metadata guaranteed_cpu cpu_allocation_ratio=1.0
nova aggregate-set-metadata guaranteed_cpu g_cpu=true

3) Add one or more compute hosts to this aggregate:


nova aggregate-add-host guaranteed_cpu node-8.domain.tld

4) Create a new flavour for the VM that requires guaranteed CPU allocation. Here as an example in
which we create flavour that requires 12 vCPUs:
nova flavor-create m1.guaranteed_cpu auto 2048 20 12

and allow it to start only on aggregate with guaranteed CPU allocation:


nova flavor-key m1.guaranteed_cpu set aggregate_instance_extra_specs:g_cpu=true

20052016 All Rights Reserved

www.mirantis.com

5) For good measure, you should update all other flavours so they will start only on the hosts not
belonging to the guaranteed_cpu aggregate:
openstack flavor list -f csv | cut -f1 -d, | tail -n +2| \
xargs -I% -n 1 nova flavor-key % \
set aggregate_instance_extra_specs:g_cpu=false

6) Now if the host that we added to the


guaranteed_cpu
aggregate has 24 vCPU and we try to start
three of the instances of
m1.guaranteed_cpu flavour,only two of them will start, and the third one
will fail because there are not enough vCPUs for it to start on this aggregate.
This way you can guarantee that the VM that requires the guaranteed CPU resources will always have them
while all the other VMs will use standard CPU overcommitment.

2.2 Huge pages


Memory addressing on contemporary computers is done in terms of blocks of contiguous virtual memory
addresses known as pages. Historically, memory pages on x86 systems have had a fixed size of 4 kilobytes,
but today this parameter is configurable to some degree: the x86_32 architecture, for example, supports 4Kb
and 4Mb pages, while the x86_64 architecture supports pages 4Kb, 2Mb, and more recently, 1Gb, in size.

Pages larger than the default size are referred to as "huge pages" or "large pages" (the terms are frequently
capitalized). Well call them "huge pages" in this document.
Processes work with virtual memory addresses. Each time a process accesses memory, a kernel translates the
desired virtual memory address to a physical one by looking at a special memory area called the page table
,
where virtual-to-physical mappings are stored. The hardware cache on the CPU is used to speed up lookups.
This cache is called the translation lookaside buffer (TLB).
The TLB typically can store only a small fraction of physical-to-virtual page mappings. By increasing memory
page size we reduce the total number of pages that need to be addressed, thus increasing TLB hit rate. This
can lead to significant performance gains when a process does many memory operations. Also, the page
table may require a significant amount of memory in cases where it needs to store many references to small
memory pages. in extreme cases, memory savings from using huge pages may amount to several gigabytes.
(For
example,
see
http://kevinclosson.net/2009/07/28/quantifying-hugepages-memory-savings-with-oracle-database-11g
.)

20052016 All Rights Reserved

www.mirantis.com

On the other hand, when the page size is large but a process doesnt use all the page memory, unused
memory is effectively lost as it cannot be used by other processes. So there is usually a tradeoff between
performance and more efficient memory utilization.
In the case of virtualization, a second level of page translation (between the hypervisor and host OS) causes
additional overhead. Using huge pages on the host OS lets us greatly reduce this overhead.
Its preferable to give a virtual machine with NFV workloads exclusive access to a predetermined amount of
memory. No other process can use that memory anyway, so there is no tradeoff in using huge pages.
Huge pages are thus the natural option for NFV workloads.
For
more
information
on
page
https://en.wikipedia.org/wiki/Page_table

tables

and

the

translation

process,

see

2.2.1 General recommendations on using huge pages on OpenStack

There are two ways to use huge pages on Linux in general:

Explicit - an application is enabled to use huge pages by changing its source code
Implicit - via automatic aggregation of default-sized pages to huge pages by the
transparent huge
pages
(THP) mechanism in the kernel

THP are turned on by default in MOS 7.0, but Explicit huge pages potentially provide more performance
gains if an application supports them.
Although we tend to think of the hypervisor as KVM, KVM is really just the kernel module; the actual
hypervisor is QEMU. That means that QEMU performance is crucial for NFV. Fortunately, it supports explicit
usage of huge pages via the hugetlbfs library, so we dont really need THP here. Moreover, THP can lead to
side effects with unpredictable results -- sometimes lowering performance instead of raising it.
Also be aware that when a kernel needs to swap out a THP, the aggregate huge page is first split to standard
4k pages. Explicit huge pages are never swapped to disk this is perfectly fine for typical NFV workloads.

20052016 All Rights Reserved

10

www.mirantis.com

In general, huge pages can be reserved at boot or at runtime (though 1GB huge pages can only be allocated
at boot). Memory generally gets fragmented on a running system and the kernel may not be able to reserve
as many contiguous memory blocks in runtime as it can at boot.
For general NFV workloads we recommend using dedicated compute nodes with the major part of their
memory reserved as explicit huge pages at boot time. NFV workload instances should be configured to use
huge pages. We also recommend disabling THP on these compute nodes. As for preferred huge page sizes:
the choice depends on the needs of specific workloads. Generally, 1Gb can be slightly faster, but 2Mb huge
pages provide more granularity.
For more information on explicit huge pages, see:

Summary in the Debian Wiki:


https://wiki.debian.org/Hugepages
Good general introductory article
http://linuxgazette.net/155/krishnakumar.html
Series of in-depth articles starting with
http://lwn.net/Articles/374424/

For more information on THP, see:

General introduction:
https://lwn.net/Articles/423584/
Articles on THP performance impact:
https://blogs.oracle.com/linuxkernel/entry/performance_impact_of_transparent_huge,
https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
https://en.wikipedia.org/wiki/Second_Level_Address_Translation
http://developerblog.redhat.com/2014/03/10/examining-huge-pages-or-transparent-huge-pa
ges-performance/

2.2.2 Huge pages and physical topology


All contemporary multiprocessor x86_64 systems have non-uniform memory access architecture (NUMA).
NUMA-related settings will be described in the following sections of this guide. but there are some subtle
characteristics of NUMA that affect huge page allocation on multi-CPU hosts that you should be aware of
when configuring OpenStack.

20052016 All Rights Reserved

11

www.mirantis.com

As a rule, some amount of memory is reserved in the lower range of memory address space. This memory is
used for
memory-mapped I/O and usually it is reserved on the first NUMA cell -- corresponding to the first
CPU -- before huge pages are allocated -- but when allocating huge pages, the kernel tries to spread them
evenly across all NUMA cells. If theres not enough contiguous memory in one of the NUMA cells, the kernel
will try to compensate by allocating more memory on the remaining cells. When the amount of memory used
by huge pages is close to the total amount of free memory, you end up with uneven huge page distributions
across NUMA cells. This is more likely to happen when using 1Gb pages.
Here is an example from a host with 64 gigabytes of memory and two CPUs:
# grep "Memory.*reserved" /var/log/dmesg
[
0.000000] Memory: 65843012K/67001792K available (7396K kernel code, 1146K rwdata, 3416K rodata, 1336K
init, 1448K bss, 1158780K reserved)

We can see that the kernel reserves more than 1 Gb of memory.


Now, if we try to reserve 60 1Gb pages the result will be:
# grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages:29
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages:31

This might lead to negative consequences. For example, if we use a VM flavor that requires 30Gb of memory
in one NUMA cell (or 60Gb in two) there would be a problem. One might think that the number of huge
pages on this host is enough to run two instances with 30Gb memory each or one, two-cell instance with
60Gb, but in reality, only one 30 Gb instance will be started: the other one will be one 1Gb page short. If we
try to start a 60Gb, two-cell instance with this distribution of huge pages between NUMA cells it will fail to
start altogether because Nova will try to find a physical host with two NUMA cells having 30Gb of memory
each and fail to do that because one of the cells has insufficient memory.
You may want to use an option such as 'Socket Interleave Below 4GB' or similar if your BIOS supports it to
avoid this situation. This option maps lower address space evenly between the NUMA cells, in effect splitting
reserved memory between NUMA nodes.
In conclusion, you should always test to verify the real allocation of huge pages and plan accordingly, based
on the results.

20052016 All Rights Reserved

12

www.mirantis.com

2.2.3 Enabling huge pages on MOS 7.0


To enable huge pages you need to configure every compute node where you plan to run instances that will
use them. You also need to configure nova aggregates and flavors before launching huge pages backed
instances.

2.2.4 Compute hosts configuration


Below we provide an example of how to configure huge pages on one of the compute nodes. All the
commands in this section should be run on the compute nodes that will handle huge pages workloads.
'
We will only describe steps required for boot time configuration. For information on runtime huge pages
allocation, please refer to kernel documentation
(
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
).
1. Check that your compute node supports huge pages:
# grep -m1 "pse\|pdpe1gb" /proc/cpuinfo
flags
: fpu vme de
psetsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gbrdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16
xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2
smep bmi2 erms invpcid

pse and pdpr1gb flags in the output indicate that the hardware supports standard (2 or 4 Megabytes
depending on hardware architecture) or 1Gb huge pages.
2. Upgrade QEMU to 2.4 to use huge pages (see the Appendix A1 Installing qemu 2.4).
3. Add huge pages allocation parameters to the list of kernel arguments in /etc/default/grub. Note that
we are also disabling Transparent Huge Pages in the examples below because we're using explicit
huge pages to prevent swapping.

20052016 All Rights Reserved

13

www.mirantis.com

Add the following to the end of


/etc/default/grub:
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX
hugepagesz=<size of hugepages> hugepages=<number of hugepages>
transparent_hugepage=never
"

Note that <size of hugepages> is either 2M or 1G.


You can also use both sizes simultaneously:
GRUB_CMDLINE_LINUX="
$GRUB_CMDLINE_LINUX
hugepagesz=2M hugepages=<number of 2M hugepages> hugepagesz=1G
hugepages=<number of 1G hugepages> transparent_hugepage=never
"

In the following example we preallocate 30000 2Mb pages:


GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX
hugepagesz=2M hugepages=30000 transparent_hugepage=never"

Caution: be careful when deciding on the number of huge pages to reserve. You should leave enough
memory for host OS processes (including memory for Ceph processes if your compute shares the
Ceph OSD role) or risk unpredictable results.
Note: You cant allocate different amounts of memory to each NUMA cell via kernel parameters. If you need
to do so, you have to use command line or startup scripts. Here is an example in which we allocate 10 1Gb
sized pages on the first NUMA cell and 30 on the second one:
echo 10 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 30 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

4. Change the value of


KVM_HUGEPAGES
in
/etc/default/qemu-kvm
from
0
to
1 t
o make QEMU aware
of huge pages:
KVM_HUGEPAGES=
1

5. Update the bootloader and reboot for these parameters to take effect:
# update-grub
# reboot

6. After rebooting, dont forget to verify that the pages are reserved according to the settings specified:

20052016 All Rights Reserved

14

www.mirantis.com

# grep Huge /proc/meminfo


AnonHugePages:
HugePages_Total:
HugePages_Free:
HugePages_Rsvd:
HugePages_Surp:
Hugepagesize:

0 kB
30000
30000
0
0
2048 kB

# grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:15000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:15000

2.2.4.1 Nova configuration


To use huge pages, you need to launch instances whose flavor has the extra specification
hw:mem_pages_size
.
By default, there is nothing to prevent normal instances with flavors that dont have the extra spec from
starting on compute nodes with reserved huge pages. To avoid this situation, youll need to create nova
aggregates for compute nodes with and without huge pages, create a new flavor for huge pages-enabled
instances, update all the other flavors with this extra spec and reconfigure nova scheduler service to check
extra spec when scheduling instances. Follow the steps below:
1. From the commandline, create an aggregate for compute nodes with and without huge pages:
# nova aggregate-create hpgs-aggr
# nova aggregate-set-metadata hpgs-aggr hpgs=true
# nova aggregate-create normal-aggr
# nova aggregate-set-metadata normal-aggr hpgs=false

2. Add one or more hosts to them:


# nova aggregate-add-host hpgs-aggr
node-9.domain.tld
# nova aggregate-add-host normal-aggr
node-10.domain.tld

3. Create a new flavor for instances with huge pages:

20052016 All Rights Reserved

15

www.mirantis.com

# nova flavor-create m1.small.hpgs auto 2000 20 2


# nova flavor-key m1.small.hpgs set hw:mem_page_size=2048
# nova flavor-key m1.small.hpgs set aggregate_instance_extra_specs:hpgs=true

4. Update all other flavours so they will start only on hosts without huge pages support:
# openstack flavor list -f csv|grep -v hpgs|cut -f1 -d,| tail -n +2| \
xargs -I% -n 1 nova flavor-key % \
set aggregate_instance_extra_specs:hpgs=false

5. On every controller add the value A


ggregateInstanceExtraSpecFilter
to the
scheduler_default_filters
parameter in
/etc/nova/nova.conf
:
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,DiskFilter,ComputeFilter,ComputeCapab
ilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter
,AggregateInstanc
eExtraSpecsFilter

6. Restart nova scheduler service on all controllers:


# restart nova-scheduler

2.2.4.2 Using huge pages on MOS 7.0


Now that OpenStack is configured for huge pages, you're ready to use it as follows:
1. Create an instance with the huge pages flavor:
nova boot --image TestVM --nic net-id=`openstack network show net04 -f value | head -n1` --flavor
m1.small.hpgs
hpgs-test

2. Verify that instance has been successfully created:


# nova list --name hpgs-test
+--------------------------------------+-----------+--------+------------+-------------+---------------------+
| ID
| Name
| Status | Task State | Power State | Networks
|

20052016 All Rights Reserved

16

www.mirantis.com

+--------------------------------------+-----------+--------+------------+-------------+---------------------+
| 593d461e-3ef2-46cc-a88d-5f147eb2a14e | hpgs-test | ACTIVE | | Running
|
net04=192.168.111.15 |
+--------------------------------------+-----------+--------+------------+-------------+---------------------+

If the status is ERROR, check the log files for lines containing this instance ID. The easiest way to do
that is to run the following command on the Fuel Master node:
#
grep -Ri
<Instance ID>/var/log/docker-logs/remote/node-*
If you encounter the error:
libvirtError: internal error: process exited while connecting to monitor: os_mem_prealloc: failed to
preallocate pages
it means there is not enough free memory available inside one NUMA cell to satisfy instance
requirements. Check that the VMs NUMA topology fits inside the hosts.

This error:
libvirtError: unsupported configuration: Per-node memory binding is not supported with this QEMU
means that you are using QEMU 2.0 packages. You need to upgrade QEMU to 2.4, see Appendix A1
for instructions on how to upgrade QEMU packages.

3. Verify that the instance uses huge pages


(all commands below should be run from a controller):
Locate the part of the instance configuration that is relevant to huge pages:
# hypervisor=`nova show
hpgs-test| grep OS-EXT-SRV-ATTR:host | cut -d\| -f3`
# instance=`nova show
hpgs-test| grep OS-EXT-SRV-ATTR:instance_name | cut -d\| -f3`
# ssh $hypervisor virsh dumpxml $instance |awk '/memoryBacking/ {p=1}; p; /\/numatune/ {p=0}'

<memoryBacking>

<hugepages>
<page size='2048' unit='KiB' nodeset='0'/>

20052016 All Rights Reserved

17

www.mirantis.com

</hugepages>
</memoryBacking>
<vcpu placement='static'>2</vcpu>
<cputune>
<shares>2048</shares>
<vcpupin vcpu='0' cpuset='0-5,12-17'/>
<vcpupin vcpu='1' cpuset='0-5,12-17'/>
<emulatorpin cpuset='0-5,12-17'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>

The memoryBacking section should show that this instances memory is backed by huge pages. You
may also see that the cputune section reveals so-called pinning of this instances vCPUs. This
means the instance will only run on physical CPU cores that have direct access to this instances
memory and comes as a bonus from hypervisor awareness of the host physical topology. We will
discuss instance CPU pinning in the next section.
You may also look at the QEMU process arguments and make sure they contain relevant options, such
as:
# ssh $hypervisor pgrep -af $instance | grep -Po "memory[^\s]+"
memory-backend-file,prealloc=yes,mem-path=/run/hugepages/kvm/libvirt/qemu,size=2000M,id=ram-node0,host-nodes
=0,policy=bind

or directly examine the kernel huge pages stats:


# ssh $hypervisor "grep huge /proc/\`pgrep -of $instance\`/numa_maps"
2aaaaac00000 bind:0
file=/run/hugepages/kvm/libvirt/qemu/qemu_back_mem._objects_ram-node0.VveFxP\040(deleted) huge anon=1000
dirty=1000 N0=1000

We can see that the instance uses 1000 huge pages (since this flavors memory is 2Gb and we are
using 2048Kb huge pages).
Note: Its possible to use more than one NUMA host cell for a single instance with the flavor key
hw:numa_nodes,
but you should be aware that multi-cell instances may show worse performance than

20052016 All Rights Reserved

18

www.mirantis.com

single-cell instances in the case when processes inside them arent aware of their NUMA topology. See
more on this subject in the section about NUMA CPU pinning.

Some useful commands


Here are some commands for obtaining huge pages-related diagnostics.

To obtain information about the hardware Translation Lookaside Buffer (run apt-get install cpuid
beforehand):
#cpuid -1| awk '/^

\w/ { p=0 } /TLB information/ { p=1; } p;'

cache and TLB information (2):


0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries

To show how much memory is used for Page Tables:


# grep PageTables /proc/meminfo
PageTables:

1244880 kB

To show current huge pages statistics:


# grep Huge /proc/meminfo
AnonHugePages:
606208 kB
HugePages_Total: 15000
HugePages_Free:
15000
HugePages_Rsvd:
0
HugePages_Surp:
0
Hugepagesize:
2048 kB

To show huge pages distribution between NUMA nodes:

20052016 All Rights Reserved

19

www.mirantis.com

# grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages:29
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:15845
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages:31
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:15899

2.3
NUMA/CPU pinning

As we hinted in the section about Huge pages, NUMA means that memory is broken up into pools with each
vCPU having its own pool of "local" memory. Best performance comes from making sure a process and its
memory are running in the same NUMA cell. Unfortunately, the nature of virtualization means that processes
typically use whatever vCPU is available, whether it's local to the memory or not. To solve this problem, you
can use CPU Pinning.
CPU Pinning enables you to pin, or establish a mapping between virtual CPU to the physical core so that a
virtual CPU will always run on the same physical one. By exposing NUMA topology to the VM and pinning
vCPU to specific core its possible to improve VM performance by ensuring that access to memory will always
be local in terms of NUMA topology.

2.3.1 Compute hosts configuration


To enable CPU Pinning, perform the following steps on every compute host where you want CPU pinning to
be enabled.
1. Upgrade QEMU to 2.4 to use NUMA CPU pinning (see the Appendix A1 Installing qemu 2.1).
2. Get the NUMA topology for the node:
# lscpu | grep NUMA
NUMA node(s):
NUMA node0 CPU(s):
NUMA node1 CPU(s):

2
0-5,12-17
6-11,18-23

3. Tell the system which cores should be used only by virtual machines and not by host operating
system by adding the following to the end of
/etc/default/grub:
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX
isolcpus=1-5,7-23"

20052016 All Rights Reserved

20

www.mirantis.com

4. Then add the same list to


vcpu_pin_set
in
/etc/nova/nova.conf:
vcpu_pin_set=
1-5,7-23

In this example we ensured that cores 0 and 6 will be dedicated to the host system. Virtual machines will use
cores 1-5 and 12-17 on NUMA cell 1, and cores 7-11 and 18-23 on NUMA cell 2.
5. Update boot record and reboot compute node:
update-grub
reboot

2.3.2 Nova configuration


Now that you've enabled CPU Pinning on the system, you need to configure nova to use it. Perform these
steps on the compute nodes where you want to use NUMA pinning:
1. On the commandline, create aggregates for instances with and without cpu pinning:
# nova aggregate-create performance
# nova aggregate-set-metadata performance pinned=true
# nova aggregate-create normal
# nova aggregate-set-metadata normal pinned=false

2. Add one or more hosts to the new aggregates:


# nova aggregate-add-host performance
node-9.domain.tld
# nova aggregate-add-host normal
node-10.domain.tld

3. Create a new flavor for VMs that require CPU pinning:


# nova flavor-create m1.small.performance auto 2048 20 2
# nova flavor-key m1.small.performance set hw:cpu_policy=dedicated
# nova flavor-key m1.small.performance set aggregate_instance_extra_specs:pinned=true

20052016 All Rights Reserved

21

www.mirantis.com

4. To be thorough, you should update all other flavours so they will start only on hosts without CPU
pinning:
# openstack flavor list -f csv|grep -v performance |cut -f1 -d,| \
tail -n +2| xargs -I% -n 1 nova flavor-key % set aggregate_instance_extra_specs:pinned=false

5. On every controller add values


AggregateInstanceExtraSpecFilter
and
NUMATopologyFilter
to the
scheduler_default_filters
parameter in
/etc/nova/nova.conf
:
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,DiskFilter,ComputeFilter,ComputeCapab
ilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter
,NUMATopologyFilt
er,AggregateInstanceExtraSpecsFilter

6. Restart nova scheduler service on all controllers:


restart nova-scheduler

2.3.3
Using CPU pinning
Once you've done this configuration, using CPU Pinning is straightforward. Follow these steps:
1. Start a new VM with a flavor that requires pinning ...
# nova boot --image TestVM --nic net-id=`openstack network show net04 -f value | head -n1` --flavor
m1.small.performance
test1

... and check its vcpu configuration:


# hypervisor=`nova show
test1| grep OS-EXT-SRV-ATTR:host | cut -d\| -f3`
# instance=`nova show
test1| grep OS-EXT-SRV-ATTR:instance_name | cut -d\| -f3`
# ssh $hypervisor virsh dumpxml $instance |awk '/vcpu placement/ {p=1}; p; /\/numatune/ {p=0}'
<vcpu placement='static'>2</vcpu>
<cputune>
<shares>2048</shares>
<vcpupin vcpu='0' cpuset='16'/>
<vcpupin vcpu='1' cpuset='4'/>
<emulatorpin cpuset='4,16'/>
</cputune>

20052016 All Rights Reserved

22

www.mirantis.com

<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>

You should see that each vCPU is pinned to a dedicated CPU core, which is not used by the host
operating system, and that these cores are inside the same host NUMA cell (in our example its cores
4 and 16 in NUMA cell 1).
2. Repeat the test for the instance with two NUMA cells:
# nova flavor-create m1.small.performance-2 auto 2048 20 2
# nova flavor-key m1.small.performance-2 set hw:cpu_policy=dedicated
# nova flavor-key m1.small.performance-2 set aggregate_instance_extra_specs:pinned=true
# nova flavor-key m1.small.performance-2 set hw:numa_nodes=2
# nova boot --image TestVM --nic net-id=`openstack network show net04 -f value | head -n1` --flavor
m1.small.performance-2
test2
# hypervisor=`nova show
test2| grep OS-EXT-SRV-ATTR:host | cut -d\| -f3`
# instance=`nova show
test2| grep OS-EXT-SRV-ATTR:instance_name | cut -d\| -f3`
# ssh $hypervisor virsh dumpxml $instance |awk '/vcpu placement/ {p=1}; p; /\/numatune/ {p=0}'
<vcpu placement='static'>2</vcpu>
<cputune>
<shares>2048</shares>
<vcpupin vcpu='0' cpuset='2'/>
<vcpupin vcpu='1' cpuset='10'/>
<emulatorpin cpuset='2,10'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0-1'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
<memnode cellid='1' mode='strict' nodeset='1'/>
</numatune>

You should see that each vCPU is pinned to a dedicated CPU core, which is not used by the host
operating system, and that these cores are inside another host NUMA cell. In our example its core 2 in
NUMA cell 1 and core 10 in NUMA cell 2. As you may remember in our configuration, cores
1-5 and 12-17
from cell 1 and cores 7-11 and 18-23 from cell 2 are available to virtual machines
.

20052016 All Rights Reserved

23

www.mirantis.com

2.3.3.1
Troubleshooting
You might run into the following errors:

internal error: No PCI buses available in /etc/nova/nova.conf


In this case, you've specified the wrong hw_machine_type in /etc/nova/nova.conf

libvirtError: unsupported configuration


Per-node memory binding is not supported with this version of QEMU. You may have an older
version of qemu, or a stale libvirt cache.

2.4
SR-IOV

SR-IOV is a PCI Special Interest Group (PCI-SIG) specification for virtualizing network interfaces, representing
each physical resource as a configurable entity (called a PF for Physical Function), and creating multiple
virtual interfaces (VFs or Virtual Functions) with limited configurability on top of it, recruiting support for
doing so from the system BIOS, and conventionally, also from the host OS or hypervisor. Among other
benefits, SR-IOV makes it possible to run a very large number of network-traffic-handling VMs per compute
without increasing the number of physical NICs/ports, and provides a means for pushing processing for this
down into the hardware layer, off-loading this task from the hypervisor and significantly improving both
throughput and deterministic network performance. Thats why its an NFV must-have.
Note: On Intel NICs, PFs cannot support promiscuous mode when SR-IOV is enabled, so it cannot do L2 bridging.
Because of this, you shouldnt enable SR-IOV on interfaces that have standard Fuel networks assigned to them.
(You may use SR-IOV on the interface that is used only by Fuel Private network if you use nova host aggregates
and different flavors for normal and SR-IOV enabled instances, as shown in the section Using SR-IOV).
Note: SR-IOV has a couple of limitations in the Kilo release of OpenStack. Most notably, instance migration with
SR-IOV attached ports is not supported. Also, iptables-based filtering is not usable with SR-IOV NICs, because
SR-IOV bypasses the normal network stack, so security groups cannot be used with SR-IOV enabled ports (though
you still can use security groups for normal ports).
In the following examples, we will assume
eth1
is an SR-IOV interface.
Note: These instructions apply only to OpenStack Neutron with Open vSwitch.

20052016 All Rights Reserved

24

www.mirantis.com

2.4.1 Enabling SR-IOV


To enable SR-IOV, you need to configure it on both the compute and the controller nodes. Lets start with
the compute nodes.
2.4.1.1 Configure SR-IOV on the Compute nodes
To enable SR-IOV, perform the following steps only on Compute nodes that will be used for running
instances with SR-IOV virtual NICs:
1. Ensure that your compute nodes are capable of PCI passthrough and SR-IOV. Your hardware must
provide VT-d and SR-IOV capabilities and these extensions may need to be enabled in the BIOS. VT-d
options are usually configured in the Chipset Configuration/North Bridge/IIO configuration section of
the BIOS, while SR-IOV support is configured in PCIe/PCI/PnP Configuration.
If your system supports VT-d you should see these messages related to DMAR in dmesg output:
# grep -i dmar /var/log/dmesg
[
[
[
[
[
[
[

0.000000] ACPI: DMAR 0000000079d31860 000140 (v01 ALASKA A M I 00000001 INTL 20091013)
0.061993] dmar: Host address width 46
0.061996] dmar: DRHD base: 0x000000fbffc000 flags: 0x0
0.062004] dmar: IOMMU 0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
0.062007] dmar: DRHD base: 0x000000c7ffc000 flags: 0x1
0.062012] dmar: IOMMU 1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020de
0.062014] dmar: RMRR base: 0x0000007bc94000 end: 0x0000007bca2fff

This is just an example, of course; your output may differ.


If your system supports SR-IOV, you should see SR-IOV capability section for each NIC PF, and the
total number of VFs should be non-zero:
# lspci -vvv | grep -i "initial vf"
Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01
Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 01

20052016 All Rights Reserved

25

www.mirantis.com

2. Check that VT-d is enabled in the kernel using this command:


# grep -i "iommu.*enabled" /var/log/dmesg

If you dont see a response similar to:


[0.000000] Intel-IOMMU: enabled

then its not yet enabled.


Enable it by adding the following line to the end of
/etc/default/grub
:
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX
intel_iommu=on
"

3. Update grub and reboot to get the changes to take effect:


# update-grub
# reboot

and repeat the check. For new environments, you may want to add these kernel parameters before
deploying so that they will be applied to all nodes of environment. (You can do that from the Fuel
interface in the Kernel Parameters section of the Settings tab.)
NOTE: If you have an AMD motherboard, you need to check for AMD-Vi in the output of the dmesg
command and pass the options iommu=pt iommu=1 to kernel, (but we havent yet tested that).
4. Enable the number of virtual functions required on the SR-IOV interface. Do not set the number of
VFs to more than required, since this might degrade performance. Depending on kernel and NIC
driver version you might get more queues on each PF with fewer VFs (usually, fewer than 32).
First, enable the interface:
# ip link set eth1 up

Next, from the command-line, get the maximum number of functions that could potentially be
enabled for your NIC:

20052016 All Rights Reserved

26

www.mirantis.com

# cat /sys/class/net/eth1/device/sriov_totalvfs

Then enable the desired number of virtual functions for your NIC:
# echo
30> /sys/class/net/eth1/device/sriov_numvfs

NOTE: To change the number to some other value afterwards you need to execute the following command
first:
# echo 0 > /sys/class/net/eth1/device/sriov_numvfs

NOTE: These settings arent saved across reboots. To save them, add them to /etc/rc.local:
echo "ip link set eth1 up" >> /etc/rc.local
echo "echo
30> /sys/class/net/
eth1
/device/sriov_numvfs" >> /etc/rc.local

NOTE: By default, Mirantis OpenStack 7.0 supports a maximum of 30 VFs. If you need more than 30, you
will need to install a newer version of libnl3, like so:
# wget https://launchpad.net/ubuntu/+archive/primary/+files/libnl-3-200_3.2.24-2_amd64.deb
# wget https://launchpad.net/ubuntu/+archive/primary/+files/libnl-genl-3-200_3.2.24-2_amd64.deb
# wget https://launchpad.net/ubuntu/+archive/primary/+files/libnl-route-3-200_3.2.24-2_amd64.deb
# dpkg -i libnl-3-200_3.2.24-2_amd64.deb
# dpkg -i libnl-genl-3-200_3.2.24-2_amd64.deb
# dpkg -i libnl-route-3-200_3.2.24-2_amd64.deb

and restart libvirtd


# service libvirtd restart

5. Check to make sure that SR-IOV is enabled:


# ip link show eth1 |grep vf
vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
vf 1 MAC c2:cd:57:9b:6c:7d, spoof checking on, link-state auto

If you dont see link-state auto in output, then your installation will require an SR-IOV agent. You
can enable it like so:

20052016 All Rights Reserved

27

www.mirantis.com

# apt-get install neutron-plugin-sriov-agent


# nohup neutron-sriov-nic-agent --debug --log-file /tmp/sriov_agent --config-file \
/etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini &

6. Edit
/etc/nova/nova.conf
to add the NIC to the
pci_passthrough_whitelist
parameter:
pci_passthrough_whitelist={"devname": "
eth1
", "physical_network":"physnet2"}

7. Edit
/etc/neutron/plugins/ml2/ml2_conf_sriov.ini
to add the mapping:
[sriov_nic]
physical_device_mappings = physnet2:
eth1

8. Restart the compute service:


# restart nova-compute

9. Get the vendors product id; youll need it to configure SR-IOV on the controller nodes.
# lspci -nn | grep -e "Ethernet.*Virtual"
06:10.1 Ethernet controller [0200]: Intel Corporation 82599 Ethernet Controller Virtual Function [8086:10ed]
(rev 01)
06:10.3 Ethernet controller [0200]: Intel Corporation 82599 Ethernet Controller Virtual Function [8086:10ed]
(rev 01)

(This is just an example of the output. The actual value may differ on your hardware.)
Write down the vendors product id (the value in square brackets, such as
8086:10ed
in this case).

2.4.1.2ConfigureSRIOVontheControllernodes
Next you'll need to configure SR-IOV on the controller nodes, as follows:
1. Edit
/etc/neutron/plugins/ml2/ml2_conf.ini
to add
sriovnicswitch
to the list of mechanism drivers:
mechanism_drivers =openvswitch, l2population,
sriovnicswitch

20052016 All Rights Reserved

28

www.mirantis.com

2. And add new section at the end of the file:


[ml2_sriov]
supported_pci_vendor_devs
8086:10ed
=

Use the vendors product id from the step 4.2.1.8 as the value for supported_pci_vendor_devs.
3. Add PciPassthroughFilter and AggregateInstanceExtraSpecsFilter to the list of scheduler filters in
/etc/nova/nova.conf:

scheduler_default_filters=DifferentHostFilter,RetryFilter,Availab
l
ter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter
AffinityFilter,
PciPassthroughFilter,

AggregateInstanceExtraSpecsFilter

4. Restart services so the changes will take effect:


# restart neutron-server
# restart nova-scheduler

2.4.1.3 Using SR-IOV


Now youre ready to actually use SR-IOV. Like the previous sections, this is mostly a matter of making sure
workloads run where they're supposed to. You can ensure that as follows:
1. Create a separate host aggregates for SR-IOV enabled computes and non-SR-IOV enabled computes.
# nova aggregate-create sriov
# nova aggregate-set-metadata sriov sriov=true
# nova aggregate-create normal
# nova aggregate-set-metadata normal sriov=false

2. Add one or more hosts to them:


# nova aggregate-add-host sriov node-9.domain.tld
# nova aggregate-add-host normal node-10.domain.tld

3. Create a new flavor for VMs that require SR-IOV support:


# nova flavor-create m1.small.sriov auto 2048 20 2

20052016 All Rights Reserved

29

www.mirantis.com

# nova flavor-key m1.small.sriov set aggregate_instance_extra_specs:sriov=true

4. Update all other flavors so they will start only on hosts without SR-IOV support:
# openstack flavor list -f csv|grep -v sriov|cut -f1 -d,| tail -n +2 | \
xargs -I% -n 1 nova flavor-key % set aggregate_instance_extra_specs:sriov=false

5. To use the SR-IOV port you need to create an instance with ports that use the vnic-type direct. For
now, you have to do this via the command line. Create a port for the instance:
# neutron port-create net04 --binding:vnic-type direct --device_owner nova-compute --name
sriov-port1

6. Because the default Cirros image does not have the Intel NIC drivers included, well use an Ubuntu
cloud image to test SR-IOV. See the Appendix A2 on how to prepare it.
7. Spawn the instance.
# port_id=`neutron port-list | awk '/sriov-port1/ {print $2}'`
# nova boot --flavor m1.small.sriov --image trusty --key_name
key1--nic port-id=$port_id
sriov-vm1

8. Get the nodes ip address:


# nova list | awk '/sriov-vm1/ {print $12}'
net04=192.168.111.5

9. Connect to the instance to make sure everything is up and running by first finding the controllers
with a namespace that has access to the instance ...
# neutron dhcp-agent-list-hosting-net -f csv -c host net04 --quote none | tail -n+2
node-7.domain.tld
node-9.domain.tld

... and then connecting to the instance from one of those controllers.
# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` ssh -i key1.pem
ubuntu@
192.168.111.5

A note about I/O aware NUMA scheduling

20052016 All Rights Reserved

30

www.mirantis.com

Be aware that even if your compute host has multiple CPUs, only one of the CPUs has direct access to a
particular I/O card. When a process running on one CPU gets access tothe I/O card that is local to the other
CPU, you'll see a performance penalty. See more on this topic at
https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/input-output-based-numa-sched
uling.html
When you use CPU pinning (instances with hw:cpu_policy=dedicated extra specification) and SR-IOV at the
same time, an SR-IOV enabled instance will start
only
on a CPU that has direct access to SR-IOV card. When
there is not enough resources in this CPUs NUMA cell, the instance will fail to start.

2.4.1.4 Troubleshooting

If you see errors in /var/log/nova/nova-compute.log on the compute host:


libvirtError: internal error: missing IFLA_VF_INFO in netlink response
you should install a newer version of libnl3, as shown above.

If you see:
libvirtError: unsupported configuration: host doesn't support passthrough of host PCI devices
it means that VT-d is not supported or not enabled.

If you see:
NovaException: Unexpected vif_type=binding_failed
you should enable the SR-IOV agent, or if youve already done so, make sure that its running:
# neutron agent-list | grep sriov-nic-agent
| dfa4edcf-63c1-4af7-a291-ec139a16f346 | NIC Switch agent | node-16.domain.tld | :-) | True |
neutron-sriov-nic-agent |

Otherwise, examine the log file


/tmp/sriov_agent
for clues to what else might be wrong.
20052016 All Rights Reserved

31

www.mirantis.com

2.5 Anti-affinity groups


Affinity or Anti-Affinity groups of instances enable you make sure that some instances will or will not share
the same hypervisor. In case of Anti-Affinity this enables you to implement workload segregation, or even
spread a single workload between number of hypervisors, thereby minimizing effect of hypervisor failures on
important workloads. In other words it embodies the principle: don't put all your eggs in one basket. (It also
enable syou to make sure that workloads don't compete for resources unnecessarily.)
The default nova configuration in MOS 7.0 supports (Anti)Affinity groups.

2.5.1 Using Anti-affinity groups


Even though anti-affinity is supported by default in MOS 7.0, you will still need to follow these steps to use
it:
1. Create a new anti-affinity server group:
nova server-group-create
cluster1anti-affinity

2. Launch one or more instances in this server group:


nova boot --num-instances
<N>
--image TestVM --nic net-id=`openstack network show net04 -f value | head -n1`
--hint group=`nova server-group-list | awk '/
cluster1
/ {print $2}` --flavor m1.micro

Every new instance will start on a new hypervisor; therell be no two instances in this group that will share
the same hypervisor. If N exceeds amount of the available hypervisors, the excess instances will fail to start
with an error No valid host was found.

2.5.2 High Availability Instances with Neutron


High availability means more than just making sure your instances are healthy; they must also be reachable.
To ensure that, you must eliminate the IP address as a single point of failure.

20052016 All Rights Reserved

32

www.mirantis.com

2.5.2.1OneIPaddress,multipleVMs
The security model in Neutron allows VMs to use only IP addresses allocated to them via IPAM, and by
default every IP address may be associated with a single VM only. Most of the time, high availability
approaches such as VRRP, HSRP, and so on require a single IP address (called a virtual IP) to be shared by
multiple VMs. In order to assign the same IP address to multiple VMs the Allowed-Address-Pairs Neutron
extension, which has been implemented in OpenStack releases since Havana, must be used.
Note: Some flavors of these protocols actually move the MAC address rather than replacing the MAC/IP
association, which is not compatible with SR-IOV without taking additional steps; in this paper we are assuming
you're dealing with the compatible varieties.

2.5.2.2UsingtheAllowedAddressPairsextension
To use the Allowed-Address-Pairs extension, perform the following steps:
1. We will use the keepalived daemon to test high availability in this example. The default Cirros image
shipped with MOS 7.0 does not include keepalived, so well use a Ubuntu cloud image. See Appendix
A2 for instructions on how to prepare it.
2. Create a new security group for your instances:
# neutron security-group-create ha-sec-group

3. Add the necessary rules to the group. In this case, you want ICMP, HTTP and SSH access to instances:
# neutron security-group-rule-create --protocol icmp ha-sec-group
# neutron security-group-rule-create --protocol tcp --port-range-min 80 --port-range-max 80 ha-sec-group
# neutron security-group-rule-create --protocol tcp --port-range-min 22 --port-range-max 22 ha-sec-group

4. Also add VRRP access between the instances in this group. VRRP is used by keepalived to monitor
instances availability:
# neutron security-group-rule-create --protocol 112 ha-sec-group --remote-group-id ha-sec-group

5. Launch two new instances:

20052016 All Rights Reserved

33

www.mirantis.com

# nova boot --num-instances 2 --image trusty --key_name key1 --flavor m1.small --nic net-id=`neutron
net-list\
| awk '/net04 / {print$2}'` --security_groups ha-sec-group ha-node

6. Get neutron port ids and IP addresses of the instances. You'll need them in the steps below:
# node1_port=`nova interface-list ha-node-1 | cut -d'|' -f 3 | grep '\w\+\(-\w\+\)\+'`
# node2_port=`nova interface-list ha-node-2 | cut -d'|' -f 3 | grep '\w\+\(-\w\+\)\+'`
# node1_ip=`openstack server show ha-node-1 -f value -c addresses | cut -d= -f2`
# node2_ip=`openstack server show ha-node-2 -f value -c addresses | cut -d= -f2`

7.

Create a new neutron port with an arbitrary free IP address on the private network. This IP address
will be shared between the instances.
# ha_port=`neutron port-create --fixed-ip ip_address=
192.168.111.200--security-group
ha-sec-group-c id -f
value net04 | tail -1`

8. Allow the instance's ports to send and receive traffic using this IP address by assigning the
allowed_address_pairs attribute to them:
# neutron port-update $node1_port --allowed_address_pairs list=true type=dict ip_address=
192.168.111.200
# neutron port-update $node2_port --allowed_address_pairs list=true type=dict ip_address=
192.168.111.200

9. Verify the assignment:


# neutron port-show $node1_port -c allowed_address_pairs -f value
# neutron port-show $node2_port -c allowed_address_pairs -f value

The output should contain lines similar to


{"ip_address": "192.168.111.200", "mac_address": "fa:16:3e:ce:32:f3"
{"ip_address": "192.168.111.200", "mac_address": "fa:16:3e:18:b6:4f"}

10. Associate a floating IP with the port:


# floating_ip=`neutron floatingip-create --port-id=$ha_port net04_ext -c floating_ip_address -f value | tail
-1`

11. Configure the first node. Start by logging into it:


# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` ssh -i key1.pem
ubuntu@$node1_ip

20052016 All Rights Reserved

34

www.mirantis.com

Install the required packages:


$ sudo apt-get install -y keepalived apache2

Create the configuration file for keepalived:


$ sudo bash -c 'echo "vrrp_instance vrrp_group_1 {
state MASTER
interface eth0
virtual_router_id 1
priority 100
authentication {
auth_type PASS
auth_pass password
}
virtual_ipaddress {
192.168.111.200/24 brd 192.168.111.255 dev eth0
}
}" > /etc/keepalived/keepalived.conf'

Restart keepalived:
sudo service keepalived restart

And check that it is started:


# grep Keepalived /var/log/syslog

Nov 16 10:39:16 ha-node-1 Keepalived_vrrp[2717]: VRRP_Instance(vrrp_group_1) Transition to MASTER STATE


Nov 16 10:39:17 ha-node-1 Keepalived_vrrp[2717]: VRRP_Instance(vrrp_group_1) Entering MASTER STATE

Change the default apache index file. This will help us to determine which node answer our requests:
sudo bash -c 'echo "ha-node1" > /var/www/html/index.html'

12. Configure the second node. The steps are largely the same as for the first node. Start by logging in:
# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` ssh -i key1.pem
ubuntu@$node2_ip

Install the required packages:


$ sudo apt-get install -y keepalived apache2

20052016 All Rights Reserved

35

www.mirantis.com

$ sudo bash -c 'echo "


vrrp_instance vrrp_group_1 {
state BACKUP
interface eth0
virtual_router_id 1
priority 50
authentication {
auth_type PASS
auth_pass password
}
virtual_ipaddress {
192.168.111.200/24 brd 192.168.111.255 dev eth0
}
}" > /etc/keepalived/keepalived.conf'

Restart keepalived:
sudo service keepalived restart

And check that it is started:


# grep Keepalived /var/log/syslog

Nov 16 11:03:05 ha-node-2 Keepalived_vrrp[2758]: VRRP_Instance(vrrp_group_1) Entering BACKUP STATE

Change the default apache index file. This will help you to determine which node answers your
requests:
sudo bash -c 'echo "ha-node2" > /var/www/html/index.html'

13. Now you can test your HA configuration. Start by switching off the first node's port:
# neutron port-update $node1_port --admin_state_up=False

Next, try to access the site. The answer should come from the second node:
# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` curl
192.168.111.200
ha-node2

20052016 All Rights Reserved

36

www.mirantis.com

Now switch the first port on and the second port off:
# neutron port-update $node1_port --admin_state_up=True
# neutron port-update $node2_port --admin_state_up=False

Try to access the site. The first node should answer:


# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` curl
192.168.111.200
ha-node1

3 APPENDIX
3.1 Installing qemu 2.4
MOS 7.0 comes with QEMU 2.0, but huge pages and NUMA functionality require version 2.1 or above. To
solve this problem, we've created prebuilt packages of QEMU 2.4 for MOS 7.0. Follow these instructions to
install them.
DISCLAIMER
Please note that at the time of this writing, the QEMU packages provided below are typically supported by
Mirantis Professional services, and are not officially supported by Mirantis support as part of Mirantis
OpenStack. They have been provided here for your convenience; please contact your account representative
before performing this particular NFV tuning.

1. Add the new QEMU packages to repository:


Login to Fuel master and download packages archive to it:
# cd ~
# wget
https://e96c0e4ba07a0e7625a5-af6e1daef8517c76ae97144caadbd641.ssl.cf5.rackcdn.com/q/e/qemu-2.4-mos7.tar

Make sure the checksum is correct:

20052016 All Rights Reserved

37

www.mirantis.com

# sha512sum qemu-2.4-mos7.tar
0edf6178f03f1cc515eb4d3418092edf06dfe28550a957e2ae7679934b5f615024a155facbf41157ebc2f418629b02e5c6fe0fcc4f0256594122
fe5adff99ef8 qemu-2.4-mos7.tar

Add the files from archive to the repository:


cd /var/www/nailgun/2015.1.0-7.0/ubuntu/auxiliary/
tar -xvf ~/qemu-2.1-mos7.tar -C dists/auxiliary/main/binary-amd64/
dpkg-scanpackages dists/auxiliary/main/binary-amd64/ > dists/auxiliary/main/binary-amd64/Packages

2. Prepare compute nodes. On every compute host where you want huge pages or CPU pinning to be
enabled, do the following:
Upgrade packages:
# apt-get update
# apt-get install qemu-kvm qemu-utils qemu-system-x86

Due to a bug in libvirt 1.2.9(see


http://libvirt.org/git/?p=libvirt.git;a=commit;h=f5059a92
) you then
need to remove the libvirt capabilities cache:
# rm -rf /var/cache/libvirt/qemu/capabilities/*

Finally, restart libvirtd:


service libvirtd restart

3.2 Preparing the Ubuntu cloud image


Some examples in this document require packages that are not available on the default Cirros image.
Instead, you can use a Ubuntu image, as follows:
1. Download the Ubuntu Trusty cloud image and upload it to Glance:
# glance image-create --name trusty --disk-format raw --container-format bare --is-public True --location
https://cloud-images.ubuntu.com/trusty/current/trusty-server-cloudimg-amd64-disk1.img

20052016 All Rights Reserved

38

www.mirantis.com

2. You will only be able to log into this instance by using a ssh public key, so create a keypair:
# nova keypair-add key1 > key1.pem
# chmod 600 key1.pem

3. Add a rule to allow SSH access to instances. Because you're adding it to the default security group, it
will apply to all VMs created:
# nova secgroup-add-rule default tcp 22 22 0.0.0.0/0

4. To test the Ubuntu Cloud image, run an instance with this keypair injected:
# nova boot --image trusty --key_name key1 --flavor m1.small --nic net-id=<net-uuid> ubuntu-trusty

5. You should now be able to access the instance by ssh. To do that assign a Floating IP to the instance
(via Horizon or CLI) and try to ssh to it:
# ssh -i key1.pem ubuntu@
<floating-ip>

4 RESOURCES

Download Mirantis OpenStack 7.0 .iso


Mirantis OpenStack Documentation (7.0)
Mirantis Reference Architectures (for HA, Neutron-network, Ceph)
OpenStack Community Documentation
OpenStack Architecture Design Guide
OpenStack Scaling

20052016 All Rights Reserved

39

www.mirantis.com