Documente Academic
Documente Profesional
Documente Cultură
0
NFVI Deployment Guide
White Paper
January, 2016
TABLEOFCONTENTS
1 EXECUTIVE SUMMARY
1.1 Intended Audience/Prerequisites
1.2 Network Functions Virtualization
1.3 Lab configuration
1.3.1 Hardware
1.3.2 MOS Configuration
1.3.3 Nodes
1.3.4 Networking topology
1.4 Testing approach
2 NFVI FEATURES
2.1 Guarantee resource allocation to NFV workloads
2.1.1 Guaranteed memory allocation
2.1.2 Guaranteed CPU allocation
2.2 Huge pages
2.2.1 General recommendations on using huge pages on OpenStack
2.2.2 Huge pages and physical topology
2.2.3 Enabling huge pages on MOS 7.0
2.2.4 Compute hosts configuration
2.2.4.1 Nova configuration
2.2.4.2 Using huge pages on MOS 7.0
Some useful commands
2.3 NUMA/CPU pinning
2.3.1 Compute hosts configuration
2.3.2 Nova configuration
2.3.3 Using CPU pinning
2.3.3.1 Troubleshooting
2.4 SR-IOV
2.4.1 Enabling SR-IOV
2.4.1.1 Configure SR-IOV on the Compute nodes
2.4.1.2 Configure SR-IOV on the Controller nodes
2.4.1.3 Using SR-IOV
2.4.1.4 Troubleshooting
www.mirantis.com
1EXECUTIVESUMMARY
www.mirantis.com
To ensure a guarantee of a VNF's availability and performance, it must have access to appropriate
resources within the NFVI (in this case, the OpenStack cloud). These resources include RAM and CPU,
which should be allocated specifically for the VNF, or for the NFV solution as a whole.
SR-IOV should be enabled in the cloud to gain a greater performance for VNFs.
VNFs can be extremely I/O intensive, so Mirantis OpenStack should also have a storage plane with an
appropriate disk I/O level to meet a VNF's specific requirements. These I/O parameters must be set
appropriately..
Networking, as you might imagine, also has stringent requirements. High Availability (HA) requires
that IP management be very flexible, and that VMs of the same type are scheduled to different
physical nodes in case of hardware failure. So-called provider networks should also give VMs direct
access to external resources.
1.3.1 Hardware
Configuration and testing was performed on the following hardware:
www.mirantis.com
1.3.2
MOS Configuration
1.3.3 Nodes
Fuel Master node (VM inside Host1): 1 cpu, 4Gb RAM, 2 NICs
3 Controllers/Mongo (VMs inside Host1): 4 cpu, 16Gb RAM, 4NICs
4 Computes: Hosts 2-5
2 Ceph-OSD: Hosts 6-7
www.mirantis.com
2NFVIFEATURES
An NFV Infrastructure provides an environment in which VNFs can run, but that's really the bare minimum of
what it does. An NFVi should also provide features that enable VNFs to run well. Some of the more
important features needed include:
Guaranteed resources for workloads: While cloud computing enables shared resources, the inevitable
contention can be disastrous in an NFV environment.
Huge Pages: Traditional memory structures can result in performance issues as memory lookups take
longer than they should.
NUMA/CPU pinning: Another disadvantage of cloud is that extra effort needs to be taken to make
sure that memory and CPU are local to each other.
SR/IOV: While virtualization makes cloud and NFV possible, sometimes there's just too much of a
good thing, and performance and reliability benefit from pushing functions down off the hypervisor
and back onto the hardware.
Anti-affinity groups: As a cloud management system schedules workloads, it's often necessary to
spread them among different physical physical nodes to prevent resource contention.
Let's take a look at each of these features.
ram_allocation_ratio
reserved_host_memory_mb
By default, they are already set to optimal values in MOS 7.0, but it's helpful to understand them.
www.mirantis.com
The
ram_allocation_ratio
parameter determines virtual RAM to physical RAM allocation ratio
. It
affects the
nova scheduler and should be set in
/etc/nova/nova.conf on the controller hosts. In MOS 7.0 this parameter
is set to 1, so the hypervisor will not allow you to overcommit memory. That means you wont be able to
start a VM more memory than is available on the compute host:
ram_allocation_ratio=1.0
The
reserved_host_memory_mb parameter determines how much memory on the compute node should be
reserved, effectively disabling usage of this memory by virtual machines. It should be set in
/etc/nova/nova.conf on the compute hosts. By default, in MOS 7.0 each compute host has 512Mb reserved
for the operating system and applications other than virtual machines:
reserved_host_memory_mb=512
You may want to increase this value if your compute hosts have additional workloads. For example, if a
compute node has an additional Ceph/OSD role, reserving 1GB of RAM per OSD daemon instance as a
minimum is recommended. In general, we recommend using dedicated compute nodes.
Note: You can have per-aggregate ram_alocation_ratio with AggregateRamFilter. The process is the same as
setting per-aggregate CPU allocation, described below.
2.1.2
Guaranteed CPU allocation
Similarly to guaranteed memory allocation, its often desirable to guarantee that the VM has access to
specified CPU resources to avoid latency issues. The cpu_allocation_ratio parameter determines the virtual
CPU to physical CPU allocation ratio, i.e. how many virtual CPU will be available on the compute node for
each physical one. By default parameter
cpu_allocation_ratio
is set to 8 in MOS 7.0. To ensure as little
contention as possible for the CPU resources on the host machine, set the
cpu_allocation_ratio
to
1.0
. In
/etc/nova/nova.conf
on all controller hosts, change the line
cpu_allocation_ratio=8.0
to:
cpu_allocation_ratio=
1.0
www.mirantis.com
2) From the command line, create an aggregate with guaranteed CPU allocation:
nova aggregate-create guaranteed_cpu nova
4) Create a new flavour for the VM that requires guaranteed CPU allocation. Here as an example in
which we create flavour that requires 12 vCPUs:
nova flavor-create m1.guaranteed_cpu auto 2048 20 12
www.mirantis.com
5) For good measure, you should update all other flavours so they will start only on the hosts not
belonging to the guaranteed_cpu aggregate:
openstack flavor list -f csv | cut -f1 -d, | tail -n +2| \
xargs -I% -n 1 nova flavor-key % \
set aggregate_instance_extra_specs:g_cpu=false
Pages larger than the default size are referred to as "huge pages" or "large pages" (the terms are frequently
capitalized). Well call them "huge pages" in this document.
Processes work with virtual memory addresses. Each time a process accesses memory, a kernel translates the
desired virtual memory address to a physical one by looking at a special memory area called the page table
,
where virtual-to-physical mappings are stored. The hardware cache on the CPU is used to speed up lookups.
This cache is called the translation lookaside buffer (TLB).
The TLB typically can store only a small fraction of physical-to-virtual page mappings. By increasing memory
page size we reduce the total number of pages that need to be addressed, thus increasing TLB hit rate. This
can lead to significant performance gains when a process does many memory operations. Also, the page
table may require a significant amount of memory in cases where it needs to store many references to small
memory pages. in extreme cases, memory savings from using huge pages may amount to several gigabytes.
(For
example,
see
http://kevinclosson.net/2009/07/28/quantifying-hugepages-memory-savings-with-oracle-database-11g
.)
www.mirantis.com
On the other hand, when the page size is large but a process doesnt use all the page memory, unused
memory is effectively lost as it cannot be used by other processes. So there is usually a tradeoff between
performance and more efficient memory utilization.
In the case of virtualization, a second level of page translation (between the hypervisor and host OS) causes
additional overhead. Using huge pages on the host OS lets us greatly reduce this overhead.
Its preferable to give a virtual machine with NFV workloads exclusive access to a predetermined amount of
memory. No other process can use that memory anyway, so there is no tradeoff in using huge pages.
Huge pages are thus the natural option for NFV workloads.
For
more
information
on
page
https://en.wikipedia.org/wiki/Page_table
tables
and
the
translation
process,
see
Explicit - an application is enabled to use huge pages by changing its source code
Implicit - via automatic aggregation of default-sized pages to huge pages by the
transparent huge
pages
(THP) mechanism in the kernel
THP are turned on by default in MOS 7.0, but Explicit huge pages potentially provide more performance
gains if an application supports them.
Although we tend to think of the hypervisor as KVM, KVM is really just the kernel module; the actual
hypervisor is QEMU. That means that QEMU performance is crucial for NFV. Fortunately, it supports explicit
usage of huge pages via the hugetlbfs library, so we dont really need THP here. Moreover, THP can lead to
side effects with unpredictable results -- sometimes lowering performance instead of raising it.
Also be aware that when a kernel needs to swap out a THP, the aggregate huge page is first split to standard
4k pages. Explicit huge pages are never swapped to disk this is perfectly fine for typical NFV workloads.
10
www.mirantis.com
In general, huge pages can be reserved at boot or at runtime (though 1GB huge pages can only be allocated
at boot). Memory generally gets fragmented on a running system and the kernel may not be able to reserve
as many contiguous memory blocks in runtime as it can at boot.
For general NFV workloads we recommend using dedicated compute nodes with the major part of their
memory reserved as explicit huge pages at boot time. NFV workload instances should be configured to use
huge pages. We also recommend disabling THP on these compute nodes. As for preferred huge page sizes:
the choice depends on the needs of specific workloads. Generally, 1Gb can be slightly faster, but 2Mb huge
pages provide more granularity.
For more information on explicit huge pages, see:
General introduction:
https://lwn.net/Articles/423584/
Articles on THP performance impact:
https://blogs.oracle.com/linuxkernel/entry/performance_impact_of_transparent_huge,
https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
https://en.wikipedia.org/wiki/Second_Level_Address_Translation
http://developerblog.redhat.com/2014/03/10/examining-huge-pages-or-transparent-huge-pa
ges-performance/
11
www.mirantis.com
As a rule, some amount of memory is reserved in the lower range of memory address space. This memory is
used for
memory-mapped I/O and usually it is reserved on the first NUMA cell -- corresponding to the first
CPU -- before huge pages are allocated -- but when allocating huge pages, the kernel tries to spread them
evenly across all NUMA cells. If theres not enough contiguous memory in one of the NUMA cells, the kernel
will try to compensate by allocating more memory on the remaining cells. When the amount of memory used
by huge pages is close to the total amount of free memory, you end up with uneven huge page distributions
across NUMA cells. This is more likely to happen when using 1Gb pages.
Here is an example from a host with 64 gigabytes of memory and two CPUs:
# grep "Memory.*reserved" /var/log/dmesg
[
0.000000] Memory: 65843012K/67001792K available (7396K kernel code, 1146K rwdata, 3416K rodata, 1336K
init, 1448K bss, 1158780K reserved)
This might lead to negative consequences. For example, if we use a VM flavor that requires 30Gb of memory
in one NUMA cell (or 60Gb in two) there would be a problem. One might think that the number of huge
pages on this host is enough to run two instances with 30Gb memory each or one, two-cell instance with
60Gb, but in reality, only one 30 Gb instance will be started: the other one will be one 1Gb page short. If we
try to start a 60Gb, two-cell instance with this distribution of huge pages between NUMA cells it will fail to
start altogether because Nova will try to find a physical host with two NUMA cells having 30Gb of memory
each and fail to do that because one of the cells has insufficient memory.
You may want to use an option such as 'Socket Interleave Below 4GB' or similar if your BIOS supports it to
avoid this situation. This option maps lower address space evenly between the NUMA cells, in effect splitting
reserved memory between NUMA nodes.
In conclusion, you should always test to verify the real allocation of huge pages and plan accordingly, based
on the results.
12
www.mirantis.com
pse and pdpr1gb flags in the output indicate that the hardware supports standard (2 or 4 Megabytes
depending on hardware architecture) or 1Gb huge pages.
2. Upgrade QEMU to 2.4 to use huge pages (see the Appendix A1 Installing qemu 2.4).
3. Add huge pages allocation parameters to the list of kernel arguments in /etc/default/grub. Note that
we are also disabling Transparent Huge Pages in the examples below because we're using explicit
huge pages to prevent swapping.
13
www.mirantis.com
Caution: be careful when deciding on the number of huge pages to reserve. You should leave enough
memory for host OS processes (including memory for Ceph processes if your compute shares the
Ceph OSD role) or risk unpredictable results.
Note: You cant allocate different amounts of memory to each NUMA cell via kernel parameters. If you need
to do so, you have to use command line or startup scripts. Here is an example in which we allocate 10 1Gb
sized pages on the first NUMA cell and 30 on the second one:
echo 10 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 30 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
5. Update the bootloader and reboot for these parameters to take effect:
# update-grub
# reboot
6. After rebooting, dont forget to verify that the pages are reserved according to the settings specified:
14
www.mirantis.com
0 kB
30000
30000
0
0
2048 kB
# grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:15000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:15000
15
www.mirantis.com
4. Update all other flavours so they will start only on hosts without huge pages support:
# openstack flavor list -f csv|grep -v hpgs|cut -f1 -d,| tail -n +2| \
xargs -I% -n 1 nova flavor-key % \
set aggregate_instance_extra_specs:hpgs=false
16
www.mirantis.com
+--------------------------------------+-----------+--------+------------+-------------+---------------------+
| 593d461e-3ef2-46cc-a88d-5f147eb2a14e | hpgs-test | ACTIVE | | Running
|
net04=192.168.111.15 |
+--------------------------------------+-----------+--------+------------+-------------+---------------------+
If the status is ERROR, check the log files for lines containing this instance ID. The easiest way to do
that is to run the following command on the Fuel Master node:
#
grep -Ri
<Instance ID>/var/log/docker-logs/remote/node-*
If you encounter the error:
libvirtError: internal error: process exited while connecting to monitor: os_mem_prealloc: failed to
preallocate pages
it means there is not enough free memory available inside one NUMA cell to satisfy instance
requirements. Check that the VMs NUMA topology fits inside the hosts.
This error:
libvirtError: unsupported configuration: Per-node memory binding is not supported with this QEMU
means that you are using QEMU 2.0 packages. You need to upgrade QEMU to 2.4, see Appendix A1
for instructions on how to upgrade QEMU packages.
<memoryBacking>
<hugepages>
<page size='2048' unit='KiB' nodeset='0'/>
17
www.mirantis.com
</hugepages>
</memoryBacking>
<vcpu placement='static'>2</vcpu>
<cputune>
<shares>2048</shares>
<vcpupin vcpu='0' cpuset='0-5,12-17'/>
<vcpupin vcpu='1' cpuset='0-5,12-17'/>
<emulatorpin cpuset='0-5,12-17'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
The memoryBacking section should show that this instances memory is backed by huge pages. You
may also see that the cputune section reveals so-called pinning of this instances vCPUs. This
means the instance will only run on physical CPU cores that have direct access to this instances
memory and comes as a bonus from hypervisor awareness of the host physical topology. We will
discuss instance CPU pinning in the next section.
You may also look at the QEMU process arguments and make sure they contain relevant options, such
as:
# ssh $hypervisor pgrep -af $instance | grep -Po "memory[^\s]+"
memory-backend-file,prealloc=yes,mem-path=/run/hugepages/kvm/libvirt/qemu,size=2000M,id=ram-node0,host-nodes
=0,policy=bind
We can see that the instance uses 1000 huge pages (since this flavors memory is 2Gb and we are
using 2048Kb huge pages).
Note: Its possible to use more than one NUMA host cell for a single instance with the flavor key
hw:numa_nodes,
but you should be aware that multi-cell instances may show worse performance than
18
www.mirantis.com
single-cell instances in the case when processes inside them arent aware of their NUMA topology. See
more on this subject in the section about NUMA CPU pinning.
To obtain information about the hardware Translation Lookaside Buffer (run apt-get install cpuid
beforehand):
#cpuid -1| awk '/^
1244880 kB
19
www.mirantis.com
# grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages:29
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:15845
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages:31
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:15899
2.3
NUMA/CPU pinning
As we hinted in the section about Huge pages, NUMA means that memory is broken up into pools with each
vCPU having its own pool of "local" memory. Best performance comes from making sure a process and its
memory are running in the same NUMA cell. Unfortunately, the nature of virtualization means that processes
typically use whatever vCPU is available, whether it's local to the memory or not. To solve this problem, you
can use CPU Pinning.
CPU Pinning enables you to pin, or establish a mapping between virtual CPU to the physical core so that a
virtual CPU will always run on the same physical one. By exposing NUMA topology to the VM and pinning
vCPU to specific core its possible to improve VM performance by ensuring that access to memory will always
be local in terms of NUMA topology.
2
0-5,12-17
6-11,18-23
3. Tell the system which cores should be used only by virtual machines and not by host operating
system by adding the following to the end of
/etc/default/grub:
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX
isolcpus=1-5,7-23"
20
www.mirantis.com
In this example we ensured that cores 0 and 6 will be dedicated to the host system. Virtual machines will use
cores 1-5 and 12-17 on NUMA cell 1, and cores 7-11 and 18-23 on NUMA cell 2.
5. Update boot record and reboot compute node:
update-grub
reboot
21
www.mirantis.com
4. To be thorough, you should update all other flavours so they will start only on hosts without CPU
pinning:
# openstack flavor list -f csv|grep -v performance |cut -f1 -d,| \
tail -n +2| xargs -I% -n 1 nova flavor-key % set aggregate_instance_extra_specs:pinned=false
2.3.3
Using CPU pinning
Once you've done this configuration, using CPU Pinning is straightforward. Follow these steps:
1. Start a new VM with a flavor that requires pinning ...
# nova boot --image TestVM --nic net-id=`openstack network show net04 -f value | head -n1` --flavor
m1.small.performance
test1
22
www.mirantis.com
<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
You should see that each vCPU is pinned to a dedicated CPU core, which is not used by the host
operating system, and that these cores are inside the same host NUMA cell (in our example its cores
4 and 16 in NUMA cell 1).
2. Repeat the test for the instance with two NUMA cells:
# nova flavor-create m1.small.performance-2 auto 2048 20 2
# nova flavor-key m1.small.performance-2 set hw:cpu_policy=dedicated
# nova flavor-key m1.small.performance-2 set aggregate_instance_extra_specs:pinned=true
# nova flavor-key m1.small.performance-2 set hw:numa_nodes=2
# nova boot --image TestVM --nic net-id=`openstack network show net04 -f value | head -n1` --flavor
m1.small.performance-2
test2
# hypervisor=`nova show
test2| grep OS-EXT-SRV-ATTR:host | cut -d\| -f3`
# instance=`nova show
test2| grep OS-EXT-SRV-ATTR:instance_name | cut -d\| -f3`
# ssh $hypervisor virsh dumpxml $instance |awk '/vcpu placement/ {p=1}; p; /\/numatune/ {p=0}'
<vcpu placement='static'>2</vcpu>
<cputune>
<shares>2048</shares>
<vcpupin vcpu='0' cpuset='2'/>
<vcpupin vcpu='1' cpuset='10'/>
<emulatorpin cpuset='2,10'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0-1'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
<memnode cellid='1' mode='strict' nodeset='1'/>
</numatune>
You should see that each vCPU is pinned to a dedicated CPU core, which is not used by the host
operating system, and that these cores are inside another host NUMA cell. In our example its core 2 in
NUMA cell 1 and core 10 in NUMA cell 2. As you may remember in our configuration, cores
1-5 and 12-17
from cell 1 and cores 7-11 and 18-23 from cell 2 are available to virtual machines
.
23
www.mirantis.com
2.3.3.1
Troubleshooting
You might run into the following errors:
2.4
SR-IOV
SR-IOV is a PCI Special Interest Group (PCI-SIG) specification for virtualizing network interfaces, representing
each physical resource as a configurable entity (called a PF for Physical Function), and creating multiple
virtual interfaces (VFs or Virtual Functions) with limited configurability on top of it, recruiting support for
doing so from the system BIOS, and conventionally, also from the host OS or hypervisor. Among other
benefits, SR-IOV makes it possible to run a very large number of network-traffic-handling VMs per compute
without increasing the number of physical NICs/ports, and provides a means for pushing processing for this
down into the hardware layer, off-loading this task from the hypervisor and significantly improving both
throughput and deterministic network performance. Thats why its an NFV must-have.
Note: On Intel NICs, PFs cannot support promiscuous mode when SR-IOV is enabled, so it cannot do L2 bridging.
Because of this, you shouldnt enable SR-IOV on interfaces that have standard Fuel networks assigned to them.
(You may use SR-IOV on the interface that is used only by Fuel Private network if you use nova host aggregates
and different flavors for normal and SR-IOV enabled instances, as shown in the section Using SR-IOV).
Note: SR-IOV has a couple of limitations in the Kilo release of OpenStack. Most notably, instance migration with
SR-IOV attached ports is not supported. Also, iptables-based filtering is not usable with SR-IOV NICs, because
SR-IOV bypasses the normal network stack, so security groups cannot be used with SR-IOV enabled ports (though
you still can use security groups for normal ports).
In the following examples, we will assume
eth1
is an SR-IOV interface.
Note: These instructions apply only to OpenStack Neutron with Open vSwitch.
24
www.mirantis.com
0.000000] ACPI: DMAR 0000000079d31860 000140 (v01 ALASKA A M I 00000001 INTL 20091013)
0.061993] dmar: Host address width 46
0.061996] dmar: DRHD base: 0x000000fbffc000 flags: 0x0
0.062004] dmar: IOMMU 0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
0.062007] dmar: DRHD base: 0x000000c7ffc000 flags: 0x1
0.062012] dmar: IOMMU 1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020de
0.062014] dmar: RMRR base: 0x0000007bc94000 end: 0x0000007bca2fff
25
www.mirantis.com
and repeat the check. For new environments, you may want to add these kernel parameters before
deploying so that they will be applied to all nodes of environment. (You can do that from the Fuel
interface in the Kernel Parameters section of the Settings tab.)
NOTE: If you have an AMD motherboard, you need to check for AMD-Vi in the output of the dmesg
command and pass the options iommu=pt iommu=1 to kernel, (but we havent yet tested that).
4. Enable the number of virtual functions required on the SR-IOV interface. Do not set the number of
VFs to more than required, since this might degrade performance. Depending on kernel and NIC
driver version you might get more queues on each PF with fewer VFs (usually, fewer than 32).
First, enable the interface:
# ip link set eth1 up
Next, from the command-line, get the maximum number of functions that could potentially be
enabled for your NIC:
26
www.mirantis.com
# cat /sys/class/net/eth1/device/sriov_totalvfs
Then enable the desired number of virtual functions for your NIC:
# echo
30> /sys/class/net/eth1/device/sriov_numvfs
NOTE: To change the number to some other value afterwards you need to execute the following command
first:
# echo 0 > /sys/class/net/eth1/device/sriov_numvfs
NOTE: These settings arent saved across reboots. To save them, add them to /etc/rc.local:
echo "ip link set eth1 up" >> /etc/rc.local
echo "echo
30> /sys/class/net/
eth1
/device/sriov_numvfs" >> /etc/rc.local
NOTE: By default, Mirantis OpenStack 7.0 supports a maximum of 30 VFs. If you need more than 30, you
will need to install a newer version of libnl3, like so:
# wget https://launchpad.net/ubuntu/+archive/primary/+files/libnl-3-200_3.2.24-2_amd64.deb
# wget https://launchpad.net/ubuntu/+archive/primary/+files/libnl-genl-3-200_3.2.24-2_amd64.deb
# wget https://launchpad.net/ubuntu/+archive/primary/+files/libnl-route-3-200_3.2.24-2_amd64.deb
# dpkg -i libnl-3-200_3.2.24-2_amd64.deb
# dpkg -i libnl-genl-3-200_3.2.24-2_amd64.deb
# dpkg -i libnl-route-3-200_3.2.24-2_amd64.deb
If you dont see link-state auto in output, then your installation will require an SR-IOV agent. You
can enable it like so:
27
www.mirantis.com
6. Edit
/etc/nova/nova.conf
to add the NIC to the
pci_passthrough_whitelist
parameter:
pci_passthrough_whitelist={"devname": "
eth1
", "physical_network":"physnet2"}
7. Edit
/etc/neutron/plugins/ml2/ml2_conf_sriov.ini
to add the mapping:
[sriov_nic]
physical_device_mappings = physnet2:
eth1
9. Get the vendors product id; youll need it to configure SR-IOV on the controller nodes.
# lspci -nn | grep -e "Ethernet.*Virtual"
06:10.1 Ethernet controller [0200]: Intel Corporation 82599 Ethernet Controller Virtual Function [8086:10ed]
(rev 01)
06:10.3 Ethernet controller [0200]: Intel Corporation 82599 Ethernet Controller Virtual Function [8086:10ed]
(rev 01)
(This is just an example of the output. The actual value may differ on your hardware.)
Write down the vendors product id (the value in square brackets, such as
8086:10ed
in this case).
2.4.1.2ConfigureSRIOVontheControllernodes
Next you'll need to configure SR-IOV on the controller nodes, as follows:
1. Edit
/etc/neutron/plugins/ml2/ml2_conf.ini
to add
sriovnicswitch
to the list of mechanism drivers:
mechanism_drivers =openvswitch, l2population,
sriovnicswitch
28
www.mirantis.com
Use the vendors product id from the step 4.2.1.8 as the value for supported_pci_vendor_devs.
3. Add PciPassthroughFilter and AggregateInstanceExtraSpecsFilter to the list of scheduler filters in
/etc/nova/nova.conf:
scheduler_default_filters=DifferentHostFilter,RetryFilter,Availab
l
ter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter
AffinityFilter,
PciPassthroughFilter,
AggregateInstanceExtraSpecsFilter
29
www.mirantis.com
4. Update all other flavors so they will start only on hosts without SR-IOV support:
# openstack flavor list -f csv|grep -v sriov|cut -f1 -d,| tail -n +2 | \
xargs -I% -n 1 nova flavor-key % set aggregate_instance_extra_specs:sriov=false
5. To use the SR-IOV port you need to create an instance with ports that use the vnic-type direct. For
now, you have to do this via the command line. Create a port for the instance:
# neutron port-create net04 --binding:vnic-type direct --device_owner nova-compute --name
sriov-port1
6. Because the default Cirros image does not have the Intel NIC drivers included, well use an Ubuntu
cloud image to test SR-IOV. See the Appendix A2 on how to prepare it.
7. Spawn the instance.
# port_id=`neutron port-list | awk '/sriov-port1/ {print $2}'`
# nova boot --flavor m1.small.sriov --image trusty --key_name
key1--nic port-id=$port_id
sriov-vm1
9. Connect to the instance to make sure everything is up and running by first finding the controllers
with a namespace that has access to the instance ...
# neutron dhcp-agent-list-hosting-net -f csv -c host net04 --quote none | tail -n+2
node-7.domain.tld
node-9.domain.tld
... and then connecting to the instance from one of those controllers.
# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` ssh -i key1.pem
ubuntu@
192.168.111.5
30
www.mirantis.com
Be aware that even if your compute host has multiple CPUs, only one of the CPUs has direct access to a
particular I/O card. When a process running on one CPU gets access tothe I/O card that is local to the other
CPU, you'll see a performance penalty. See more on this topic at
https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/input-output-based-numa-sched
uling.html
When you use CPU pinning (instances with hw:cpu_policy=dedicated extra specification) and SR-IOV at the
same time, an SR-IOV enabled instance will start
only
on a CPU that has direct access to SR-IOV card. When
there is not enough resources in this CPUs NUMA cell, the instance will fail to start.
2.4.1.4 Troubleshooting
If you see:
libvirtError: unsupported configuration: host doesn't support passthrough of host PCI devices
it means that VT-d is not supported or not enabled.
If you see:
NovaException: Unexpected vif_type=binding_failed
you should enable the SR-IOV agent, or if youve already done so, make sure that its running:
# neutron agent-list | grep sriov-nic-agent
| dfa4edcf-63c1-4af7-a291-ec139a16f346 | NIC Switch agent | node-16.domain.tld | :-) | True |
neutron-sriov-nic-agent |
31
www.mirantis.com
Every new instance will start on a new hypervisor; therell be no two instances in this group that will share
the same hypervisor. If N exceeds amount of the available hypervisors, the excess instances will fail to start
with an error No valid host was found.
32
www.mirantis.com
2.5.2.1OneIPaddress,multipleVMs
The security model in Neutron allows VMs to use only IP addresses allocated to them via IPAM, and by
default every IP address may be associated with a single VM only. Most of the time, high availability
approaches such as VRRP, HSRP, and so on require a single IP address (called a virtual IP) to be shared by
multiple VMs. In order to assign the same IP address to multiple VMs the Allowed-Address-Pairs Neutron
extension, which has been implemented in OpenStack releases since Havana, must be used.
Note: Some flavors of these protocols actually move the MAC address rather than replacing the MAC/IP
association, which is not compatible with SR-IOV without taking additional steps; in this paper we are assuming
you're dealing with the compatible varieties.
2.5.2.2UsingtheAllowedAddressPairsextension
To use the Allowed-Address-Pairs extension, perform the following steps:
1. We will use the keepalived daemon to test high availability in this example. The default Cirros image
shipped with MOS 7.0 does not include keepalived, so well use a Ubuntu cloud image. See Appendix
A2 for instructions on how to prepare it.
2. Create a new security group for your instances:
# neutron security-group-create ha-sec-group
3. Add the necessary rules to the group. In this case, you want ICMP, HTTP and SSH access to instances:
# neutron security-group-rule-create --protocol icmp ha-sec-group
# neutron security-group-rule-create --protocol tcp --port-range-min 80 --port-range-max 80 ha-sec-group
# neutron security-group-rule-create --protocol tcp --port-range-min 22 --port-range-max 22 ha-sec-group
4. Also add VRRP access between the instances in this group. VRRP is used by keepalived to monitor
instances availability:
# neutron security-group-rule-create --protocol 112 ha-sec-group --remote-group-id ha-sec-group
33
www.mirantis.com
# nova boot --num-instances 2 --image trusty --key_name key1 --flavor m1.small --nic net-id=`neutron
net-list\
| awk '/net04 / {print$2}'` --security_groups ha-sec-group ha-node
6. Get neutron port ids and IP addresses of the instances. You'll need them in the steps below:
# node1_port=`nova interface-list ha-node-1 | cut -d'|' -f 3 | grep '\w\+\(-\w\+\)\+'`
# node2_port=`nova interface-list ha-node-2 | cut -d'|' -f 3 | grep '\w\+\(-\w\+\)\+'`
# node1_ip=`openstack server show ha-node-1 -f value -c addresses | cut -d= -f2`
# node2_ip=`openstack server show ha-node-2 -f value -c addresses | cut -d= -f2`
7.
Create a new neutron port with an arbitrary free IP address on the private network. This IP address
will be shared between the instances.
# ha_port=`neutron port-create --fixed-ip ip_address=
192.168.111.200--security-group
ha-sec-group-c id -f
value net04 | tail -1`
8. Allow the instance's ports to send and receive traffic using this IP address by assigning the
allowed_address_pairs attribute to them:
# neutron port-update $node1_port --allowed_address_pairs list=true type=dict ip_address=
192.168.111.200
# neutron port-update $node2_port --allowed_address_pairs list=true type=dict ip_address=
192.168.111.200
34
www.mirantis.com
Restart keepalived:
sudo service keepalived restart
Change the default apache index file. This will help us to determine which node answer our requests:
sudo bash -c 'echo "ha-node1" > /var/www/html/index.html'
12. Configure the second node. The steps are largely the same as for the first node. Start by logging in:
# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` ssh -i key1.pem
ubuntu@$node2_ip
35
www.mirantis.com
Restart keepalived:
sudo service keepalived restart
Change the default apache index file. This will help you to determine which node answers your
requests:
sudo bash -c 'echo "ha-node2" > /var/www/html/index.html'
13. Now you can test your HA configuration. Start by switching off the first node's port:
# neutron port-update $node1_port --admin_state_up=False
Next, try to access the site. The answer should come from the second node:
# ip netns exec `ip netns show|grep qdhcp-$(neutron net-list | awk '/net04 / {print$2}')` curl
192.168.111.200
ha-node2
36
www.mirantis.com
Now switch the first port on and the second port off:
# neutron port-update $node1_port --admin_state_up=True
# neutron port-update $node2_port --admin_state_up=False
3 APPENDIX
3.1 Installing qemu 2.4
MOS 7.0 comes with QEMU 2.0, but huge pages and NUMA functionality require version 2.1 or above. To
solve this problem, we've created prebuilt packages of QEMU 2.4 for MOS 7.0. Follow these instructions to
install them.
DISCLAIMER
Please note that at the time of this writing, the QEMU packages provided below are typically supported by
Mirantis Professional services, and are not officially supported by Mirantis support as part of Mirantis
OpenStack. They have been provided here for your convenience; please contact your account representative
before performing this particular NFV tuning.
37
www.mirantis.com
# sha512sum qemu-2.4-mos7.tar
0edf6178f03f1cc515eb4d3418092edf06dfe28550a957e2ae7679934b5f615024a155facbf41157ebc2f418629b02e5c6fe0fcc4f0256594122
fe5adff99ef8 qemu-2.4-mos7.tar
2. Prepare compute nodes. On every compute host where you want huge pages or CPU pinning to be
enabled, do the following:
Upgrade packages:
# apt-get update
# apt-get install qemu-kvm qemu-utils qemu-system-x86
38
www.mirantis.com
2. You will only be able to log into this instance by using a ssh public key, so create a keypair:
# nova keypair-add key1 > key1.pem
# chmod 600 key1.pem
3. Add a rule to allow SSH access to instances. Because you're adding it to the default security group, it
will apply to all VMs created:
# nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
4. To test the Ubuntu Cloud image, run an instance with this keypair injected:
# nova boot --image trusty --key_name key1 --flavor m1.small --nic net-id=<net-uuid> ubuntu-trusty
5. You should now be able to access the instance by ssh. To do that assign a Floating IP to the instance
(via Horizon or CLI) and try to ssh to it:
# ssh -i key1.pem ubuntu@
<floating-ip>
4 RESOURCES
39
www.mirantis.com