Documente Academic
Documente Profesional
Documente Cultură
Intel Corporation
21 July 2008
Agenda
- Xen internals
- High level architecture
- Paravirtualization
- HVM
- Others
- KVM
- VMware
- OpenVZ
INTEL CONFIDENTIAL
Xen Overview
INTEL CONFIDENTIAL
Xen Project bio
• Xen project was created in 2003 at the University of Cambridge Computer Laboratory in
what's known as the Xen Hypervisor project
– Led by Ian Pratt with team members Keir Fraser, Steven Hand, and Christian Limpach.
– This team along with Silicon Valley technology entrepreneurs Nick Gault and Simon Crosby founded XenSource
which was acquired by Citrix Systems in October 2007
• The Xen® hypervisor is an open source technology, developed collaboratively by the Xen
community and engineers (AMD, Cisco, Dell, HP, IBM, Intel, Mellanox, Network
Appliance, Novell, Red Hat, SGI, Sun, Unisys, Veritas, Voltaire, and of course, Citrix)
• Xen is licensed under the GNU General Public License
• Xen supports Linux 2.4, 2.6, Windows and NetBSD 2.0
INTEL CONFIDENTIAL
Xen Components
A Xen virtual environment consists of several modules that provide the virtualization
environment:
• Xen Hypervisor - VMM
• Domain 0
• Domain Management and Control
• Domain User, can be one of:
– Paravirtualized Guest: the kernel is aware of virtualization
– Hardware Virtual Machine Guest: the kernel runs natively
Domain 0
Domain U Domain U
Domain Domain U Domain U
Paravirtual
DomainGuest
U HVM Guest U
Domain
Management Paravirtual Guest HVM Guest
Paravirtual Guest HVM Guest
and Control
Hypervisor - VMM
INTEL CONFIDENTIAL
Xen Hypervisor - VMM
The hypervisor is Xen itself.
It goes between the hardware and the operating systems of the various domains.
The hypervisor is responsible for:
• Checking page tables
• Allocating resources for new domains
• Scheduling domains.
• Booting the machine enough that it can start dom0.
It presents the domains with a VirtualMachine that looks similar but not identical to the native
architecture.
Just as applications can interact with an OS by giving it syscalls, domains interact with the
hypervisor by giving it hypercalls. The hypervisor responds by sending the domain an
event, which fulfills the same function as an IRQ on real hardware.
A hypercall is to a hypervisor what a syscall is to a kernel.
INTEL CONFIDENTIAL
Restricting operations with Privilege Rings
The hypervisor executes privileged instructions, so it must be in the right place:
• x86 architecture provides 4 privilege levels / rings
• Most OSs were created before this implementation, so only 2 levels are used
• Xen provides 2 modes:
– In x86 the applications are run at ring 3, the kernel at ring 1 and Xen at ring 0
– In x86 with VT-x, the applications run at ring 3, the guest at ring non-root-0 and Xen at ring root-0 (-1)
0 0 0
Applications
The Hypervisor is moved to
Guest kernel (dom0 and dom U) ring -1
Hypervisor
INTEL CONFIDENTIAL
Domain 0
Domain 0 is a Xen required Virtual Machine running a modified Linux kernel with special
rights to:
• Access physical I/O devices
– Two drivers are included in Domain 0 to attend requests from Domain U PV or HVM guests
• Interact with the other Virtual Machines (Domain U)
• Provides the command line interface for Xen daemons
Due to its importance, the minimum functionality should be provided and properly secured
Some Domain 0 responsibilities can be delegated to Domain U (isolated driver domain)
Domain 0
PV Communicates directly with the local
networking hardware to process all virtual
Network backend
machines requests
driver
Block backend Communicates with the local storage
driver disk to read and write data from the drive
based upon Domain U requests
HVM
Supports HVM Guests for networking
Qemu-DM
and disk access requests
INTEL CONFIDENTIAL
Domain Management and Control -
Daemons
The Domain Management and Control is composed of Linux daemons and tools:
• Xm
– Command line tool and passes user input to Xend through XML RPC
• Xend
– Python application that is considered the system manager for the Xen environment
• Libxenctrl
– A C library that allows Xend to talk with the Xen hypervisor via Domain 0 (privcmd driver delivers the request to
the hypervisor)
• Xenstored
– Maintains a registry of information including memory and event channel links between Domain 0 and all other
Domains
• Qemu-dm
– Supports HVM Guests for networking and disk access requests
INTEL CONFIDENTIAL
Domain U – Paravirtualized guests
The Domain U PV Guest is a modified Linux, Solaris, FreeBSD or other UNIX system that is
aware of virtualization (no direct access to hardware)
No rights to directly access hardware resources, unless especially granted
Access to hardware through front-end drivers using the split device driver model
Usually contains XenStore, console, network and block device drivers
There can be multiple Domain U in a Xen configuration
Domain U - PV
Similar to a registry
Console driver
INTEL CONFIDENTIAL
Domain U – HVM guests
The Domain U HVM Guest is a native OS with no notion of virtualization (sharing CPU time
and other VMs running)
An unmodified OS doesn’t support the Xen split device driver, Xen emulates devices by
borrowing code from QEMU
HVMs begin in real mode and gets configuration information from an emulated BIOS
For an HVM guest to use Xen features it must use CPUID and then access the hypercall
page
Domain U - HVM
Simulates the BIOS for the unmodified
operating system to read it during startup
INTEL CONFIDENTIAL
Pseudo-Physical to Memory Model
Kernel … … Pseudo-physical
Hypervisor … … Machine
The triple indirection model is not necessarily required but it is more convenient
from the performance point of view and modifications needed in the guest kernel.
If the guest kernel needs to know anything about the machine pages, it has to
use the translation table provided by the shared info page (rare)
There are variables at various places in the code identified as MFN, PFN, GMFN and
GPFN
Xen creates, by default, seven pair of "connected virtual ethernet interfaces" for use by dom0
For each new domU, it creates a new pair of "connected virtual ethernet interfaces", with one
end in domU and the other in dom0
Virtualized network interfaces in domains are given Ethernet MAC addresses (by default xend
will select a random address)
The default Xen configuration uses bridging (xenbr0) within domain 0 to allow all domains to
appear on the network as individual hosts
INTEL CONFIDENTIAL
The Virtual Machine lifecycle
Xen provides 3 mechanisms to boot a VM:
- Booting from scratch (Turn on)
- Restoring the VM from a previously saved state (Wake)
- Clone a running VM (only in XenServer)
PAUSED
Stop Resume
Start Pause
(paused
) Turn on
OFF RUNNING Migrate
Turn off
Wake
Turn off Sleep
SUSPENDED
INTEL CONFIDENTIAL
A project: provide VMs for instantaneous/isolated
execution
Goal: determine a mechanism for instantaneous execution of applications in sandboxed VMs
Approach:
• Analyze current capabilities in Xen
• Implement a prototype that addresses the specified goal: VM-Pool
INTEL CONFIDENTIAL
Analyzing Xen spawning mechanisms
• Booting from scratch
# of CPU Time # of CPU Time
External interface
VM Pool
to listen for
Manager
requests
VMM
INTEL CONFIDENTIAL
Results with the VM-Pool
• The VM is ready to run in less than half a second (~350 milliseconds)
Initialization time - from scratch 265±21 seconds
Initialization time - resume 52±1 seconds
Get operation 341 milliseconds
Release operation - from scratch 110±21 seconds
Release operation - resume 30±2 seconds
250
200
s
d
n
o
150 From
c scratch
e
S
Resume
100
50
0
VM Booting Mode
INTEL CONFIDENTIAL
Virtual Machines Scheduling
The hypervisor is responsible for ensuring that every running guest receives some CPU time.
Most used scheduling mechanisms in Xen:
• Simple Earliest Deadline First – SEDF (being deprecated):
– Each domain runs for an n ms slice every m ms (n and m are configured per-domain)
• Credit Scheduler:
– Each domain has a couple of properties: a cap and a weight
– Weight: determines the share of the physical CPU time that the domain gets, weights are relative to each other
– Cap: represents the maximum, it’s an absolute value
– Default work-conserving; if no other VMs needs to use CPU, then the running one will be given more time to
execute
– Uses a fixed-size 30ms quantum, and ticks every 10 ms
Xen provides a simple abstract interface to schedulers:
struct scheduler {
char *name; /* full name for this scheduler */
char *opt_name; /* option name for this scheduler */
unsigned int sched_id; /* ID for this scheduler */
void (*init) (void);
int (*init_domain) (struct domain *);
void (*destroy_domain) (struct domain *);
int (*init_vcpu) (struct vcpu *);
void (*destroy_vcpu) (struct vcpu *);
void (*sleep) (struct vcpu *);
void (*wake) (struct vcpu *);
struct task_slice (*do_schedule) (s_time_t);
int (*pick_cpu) (struct vcpu *);
int (*adjust) (struct domain *, struct xen_domctl_scheduler_op *);
void (*dump_settings) (void);
INTEL CONFIDENTIAL
void (*dump_cpu_state) (int);
};
Xen Para-Virtual
functionality
INTEL CONFIDENTIAL
Paravirtualized architecture
We’ll review the PV mechanisms that support this architecture:
- Kernel Initialization
- Hypercalls creation
- Event channels
- XenStore (some kind of registry)
- Memory transfers between VMs
- Split device drivers
Domain 0 Paravirtual Guest
Hypervisor
INTEL CONFIDENTIAL
Initial information for booting a PV OS
• First things the OS needs to know when boots:
– Available RAM, connected peripherals, access to the machine clock.
• An OS running on a PV Xen environment does not have access to real
firmware
– The information required is provided by the SHARED INFO PAGES.
• The “domain builder” is in charge of mapping the shared info pages in the
guest’s address space prior its boot.
– Example: launching dom0 in a i386 architecture:
• Refer to function construct_dom0 in xen/arch/x86/domain_build.c
INTEL CONFIDENTIAL
The start info page
• The start info page is loaded in the guest’s address space at boot time. The
way this page is transferred is architecture-dependent; x86 uses the ESI
register.
• A portion of the fields in the start info page are always available for the guest
domain and are updated every time the virtual machine is resumed because
some of them contain machine addresses (subject to change
INTEL CONFIDENTIAL
start_info structure overview
struct start_info {
/* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
char magic[32]; /* "xen-<version>-<platform>". */
unsigned long nr_pages; /* Total pages allocated to this domain. */
unsigned long shared_info; /* MACHINE address of shared info struct. */
uint32_t flags; /* SIF_xxx flags. */
xen_pfn_t store_mfn; /* MACHINE page number of shared page. */
uint32_t store_evtchn; /* Event channel for store communication. */
union {
struct {
xen_pfn_t mfn; /* MACHINE page number of console page. */
uint32_t evtchn; /* Event channel for console page. */
} domU;
struct {
uint32_t info_off; /* Offset of console_info struct. */
uint32_t info_size; /* Size of console_info struct from start.*/
} dom0;
} console;
/* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
unsigned long pt_base; /* VIRTUAL address of page directory. */
unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
int8_t cmd_line[MAX_GUEST_CMDLINE];
}; typedef struct start_info start_info_t;
INTEL CONFIDENTIAL
start_info fields
char magic[32]; /*"xen-<version>-platform>"*/
• The magic number is the first thing the guest domain must check from its start
info page.
– If the magic string does not start with “xen-” something is seriously wrong and the
best thing to do is abort.
– Also, minor and major versions must be checked in order to determine if the guest
kernel had been tested in this Xen version.
INTEL CONFIDENTIAL
start_info fields (2)
unsigned long shared_info; /*MACHINE address of shared info struct.*/
• Contains the address of the machine page where the shared info structure is.
The guest kernel should map it to retrieve useful information for its initialization
process.
INTEL CONFIDENTIAL
start_info fields (3)
union {
struct {
xen_pfn_t mfn; /* MACHINE page number of console page.*/
uint32_t evtchn; /* Event channel for console page.*/
}domU;
struct {
uint32_t info_off; /*Offset of console_info struct. */
uint32_t info_size; /*Size of console_info struct from start.*/
}dom0;
}console;
• Domain 0 guests uses the dom0 part, which contains the memory offset and
size of the structure used to define the Xen console.
• For unprivileged domains the domU part of the union is used .The fields in this
represent a shared memory page and event channel used to identify the
console device.
INTEL CONFIDENTIAL
The shared Info Page
• The shared info contains information that is dynamically updated as the system
runs.
• It is explicitly mapped by the guest.
• The content of this page is defined by the C structure shared_info which is
declared in xen/include/public/xen.h
vcpu_info_t
arch_vcpu_t
shard_info_t
evtchn_upcall_pending
cr2
evtchn_upcall_mask
pad
evtchn_pending_sel
arch
time arch_time_info_t
vcpu_info[]
evtchn_pending version
evtchn_mask pad0
wc_version tsc_timestamp
system_time
wc_sec
arch_shared_info_t
tsc_to_system_mul
wc_nsec
max_pfn tsc_shift
arch
pfn_to_mfn_frame_list_list pad1
INTEL CONFIDENTIAL
shared_info fields
INTEL CONFIDENTIAL
shared_info fields (2)
INTEL CONFIDENTIAL
An exercise:
The simplest Xen kernel
INTEL CONFIDENTIAL
The simplest Xen kernel
• Bootstrap
– Each Xen guest kernel must start with a section __xen_guest for the bootloader, with key-value pairs
• GUEST_OS: name of the running kernel
• XEN_VER: specifies the Xen version for which the guest was implemented
• VIRT_BASE: guest’s address space this allocation is mapped (0 for kernels)
• ELF_PADDR_OFFSET: value subtracted from addresses in ELF headers (0 for kernels)
• HYPERCALL_PAGE: specifies the page number where the hypercall trampolines will be set
• LOADER: special boot loaders (currently only generic is available)
– After mapping everything into memory at the right places, Xen passes control to the guest kernel
• A trampoline is defined _start
– Clears the direction flag, sets up the stack and calls the kernel start passing the start info page
address in the ESI register as a parameter
– A guest kernel is expected to setup handlers to receive events at boot time, otherwise the kernel is not able to
respond to the outside world (it is ignored in the book’s example)
• Kernel.c
– The start_kernel routine takes the start info page as the parameter (passed through the ESI)
– The stack is reserved in this file, although it was referenced in bootstrap as well for creating the trampoline routine
– If the hypervisor was compiled with debugging, then the HYPERVISOR_console_io will send the string to the
console (otherwise the hypercall fails)
• Debug.h
– The hypercall takes three arguments: the command (write), the length of the string and the string pointer
– The hypercall # is 18 (xen/include/public/xen.h)
INTEL CONFIDENTIAL
Hypercalls
INTEL CONFIDENTIAL
Executing Privileged instructions from
apps
Because guest kernels don’t run at ring 0 they’re not allowed to execute privileged
instructions, a mechanism is needed to execute them in the right ring, supose exit(0):
push dword 0
mov eax, 1
push eac
int 80h Native Paravirtualized
The Hypervisor
Ring 1 has the Kernel
interrupts table
Ring 2
System Call
Hypercall
INTEL CONFIDENTIAL Direct System Call (Xen
specific)
Replacing Privileged instructions with
Hypercalls
Unmodified guests use privileged instructions which require transition to ring 0, causing performance
penalty if resolved by the hypervisor
Paravirtual guests replace their privilege instructions by hypercalls
Xen uses 2 mechanisms for hypercalls:
1. An int 82h is used as the channel similar to system calls (deprecated after Xen 3.0)
2. Issued indirectly using the hypercall page provided when the guest is started
A PV Xen guest uses the HYPERVISOR_sched_op function with SCHEDOP_yield argument instead of
using the privileged instruction HLT, in order to relinquish CPU time to guests with running tasks
static inline int HYPERVISOR_sched_op(int cmd, void *arg)
{
return _hypercall2(int, sched_op, cmd, arg);
}
extras/mini-os/include/x86/x86_32/hypercall-x86_32.h, implemented at xen/common/schedule.c
INTEL CONFIDENTIAL
Event Channels
INTEL CONFIDENTIAL
Event Channels
Event channels are the basic primitive provided by Xen for event notifications, equivalent of a
hardware interrupt valid for paravirtualized OSs
Events are one bit of information signaled by transitioning from 0 to 1
• Physical IRQs: mapped from real IRQs used to communicate with hardware devices
• Virtual IRQs: similar to PIRQs, but related to virtual devices such as the timer, debug
console
• Interdomain events: bidirectional interrupts that notify domains about certain event
• Intradomain events: special case of interdomain events
Domain 0
Domain Domain U
Management Paravirtual Guest
and Control Event
Channel
driver
Hypervisor - VMM
Hardware
INTEL CONFIDENTIAL
Event Channel Interface
Guests configure the Event Channel and send interrupts by issuing a specific hypercall:
HYPERVISOR_event_channel_op (...)
Guests are notified about pending events through callbacks installed during initialization,
these events can be masked dynamically
HYPERVISOR_set_callbacks(…)
Domain 0
Domain Domain U
Management Paravirtual Guest
and Control Event
Channel
driver
Callback
HYPERVISOR_event_channel_op
Hypervisor - VMM
Hardware
INTEL CONFIDENTIAL
HYPERVISOR_event_channel_op – 1/2
HYPERVISOR_event_channel_op(int cmd, void *arg) // defined at xen-3.1.0-src\linux-2.6-xen-sparse\include\asm-
i386\mach-xen\asm\hypercall.h
INTEL CONFIDENTIAL
HYPERVISOR_event_channel_op – 2/2
HYPERVISOR_event_channel_op(int cmd, void *arg) /* defined at xen-3.1.0-src\linux-2.6-xen-sparse\include\asm-
i386\mach-xen\asm\hypercall.h */
INTEL CONFIDENTIAL
Issuing event channel hypercalls
Structures defined at xen-3.1.0-src\xen\include\public\event_channel.h
Hypervisor handlers defined at xen-3.1.0-src\xen\common\event_channel.c
INTEL CONFIDENTIAL
HYPERVISOR_set_callbacks
Hypercall to configure the notification handlers
HYPERVISOR_set_callbacks(
unsigned long event_selector, unsigned long event_address,
unsigned long failsafe_selector, unsigned long failsafe_address)
/* defined at xen-3.1.0-src\linux-2.6-xen-sparse\include\asm-i386\mach-xen\asm\hypercall.h */
INTEL CONFIDENTIAL
Setting the notifications handler
Handler and masks configuration
/* Locations in the bootstrapping code */
extern volatile shared_info_t shared_info;
void hypervisor_callback(void);
void failsafe_callback(void);
INTEL CONFIDENTIAL
Implementing the callback function
/* Dispatch events to the correct handlers */
void do_hypervisor_callback(struct pt_regs *regs)
{
unsigned int pending_selector, next_event_offset;
vcpu_info_t *vcpu = &shared_info.vcpu_info[0];
/* Make sure we don't lose the edge on new events... */
vcpu->evtchn_upcall_pending = 0;
/* Set the pending selector to 0 and get the old value atomically */
pending_selector = xchg(&vcpu->evtchn_pending_sel, 0);
while(pending_selector != 0)
{
/* Get the first bit of the selector and clear it */
next_event_offset = first_bit(pending_selector);
Maps a bit with an
index in the
pending_selector &= ~(1 << next_event_offset);
unsigned int event;
INTEL CONFIDENTIAL
XenStore
INTEL CONFIDENTIAL
Xen Store
XenStore is a hierarchical namespace (similar to sysfs or Open Firmware) which is shared
between domains
The interdomain communication primitives exposed by Xen are very low-level (virtual IRQ
and shared memory)
XenStore is implemented on top of these primitives and provides some higher level
operations (read a key, write a key, enumerate a directory, notify when a key changes
value)
General Format
There are three main paths in XenStore:
• /vm - stores configuration information about domain
• /local/domain - stores information about the domain on the local node (domid, etc.)
• /tool - stores information for the various tools
INTEL CONFIDENTIAL
Ring buffers for split driver model
• The ring buffer is a fairly standard lockless data structure for producer-consumer
communications
• Xen uses free-running counters
• Each ring contains two kinds of data, a request and a response, updated by the two
halves of the driver
• Xen only allows responses to be written in a way that overwrites requests
INTEL CONFIDENTIAL
Xen Split Device Driver Model (for PV
guests)
Xen delegates hardware support typically to Domain 0, and device drivers typically consist of
four main components:
• The real driver
• The back end split driver
• A shared ring buffer (shared memory pages and events notification)
• The front end split driver
Hypervisor
INTEL CONFIDENTIAL
Xen HVM functionality
INTEL CONFIDENTIAL
Xen HVM
Hardware Virtual Machines allow unmodified Operating Systems to run on Virtual Environments
This approach brings 2 kinds of problems:
- For the unmodified OS, the VM must appear as a real PC
- Hardware access
- To keep isolation device emulation must be provided from Domain 0
- Provide direct assignment from a VM to a specific HW
Every HVM has a qemu-dm counterpart Virtual BIOS to provide standard start-up
Handles networking and disk access from HVM Composed of 3 payloads
Based in QEMU project - Vmxassist: real mode emulator for VMX
- VGA BIOS
- ROM BIOS
INTEL CONFIDENTIAL
Xen QEMU-dm / Virtual firmware
interaction
Domain 0 Domain U - HVM
Xen Virtual firmware works as the front end driver in the split driver model
INTEL CONFIDENTIAL
HVM domain creation
Once the domain builder is specified as “hvm”:
1. Allocates and verifies memory for domain
2. Loads the hvmloader as a kernel (setup_guest at xc_hvm_build.c)
3. Initializes hypercalls table and verifies that Xen is active
4. Copies BIOS image to 0x000F0000 created from Bochs (tools/firmware/rombios)
5. Discovers and sets up PCI devices
6. Loads a VGA BIOS
7. For Intel platforms, loads real-mode emulator for VMX (tools/firmware/vmxassist)
INTEL CONFIDENTIAL
HVM support in Xen
Support for hardware virtualization is done through an abstract interface defined at xen/include/asm-x86/hvm
struct hvm_function_table {
char *name;
void (*disable)(void);
int (*vcpu_initialise)(struct vcpu *v);
void (*vcpu_destroy)(struct vcpu *v);
void (*store_cpu_guest_regs)(struct vcpu *v, struct cpu_user_regs *r, unsigned long *crs);
void (*load_cpu_guest_regs)(struct vcpu *v, struct cpu_user_regs *r);
void (*save_cpu_ctxt)(struct vcpu *v, struct hvm_hw_cpu *ctxt);
int (*load_cpu_ctxt)(struct vcpu *v, struct hvm_hw_cpu *ctxt);
int (*paging_enabled)(struct vcpu *v);
int (*long_mode_enabled)(struct vcpu *v);
int (*pae_enabled)(struct vcpu *v);
int (*interrupts_enabled)(struct vcpu *v);
int (*guest_x86_mode)(struct vcpu *v);
unsigned long (*get_guest_ctrl_reg)(struct vcpu *v, unsigned int num);
unsigned long (*get_segment_base)(struct vcpu *v, enum x86_segment seg);
void (*get_segment_register)(struct vcpu *v, enum x86_segment seg, struct segment_register *reg);
void (*update_host_cr3)(struct vcpu *v);
void (*update_guest_cr3)(struct vcpu *v);
void (*update_vtpr)(struct vcpu *v, unsigned long value);
void (*stts)(struct vcpu *v);
void (*set_tsc_offset)(struct vcpu *v, u64 offset);
void (*inject_exception)(unsigned int trapnr, int errcode, unsigned long cr2);
void (*init_ap_context)(struct vcpu_guest_context *ctxt, int vcpuid, int trampoline_vector);
void (*init_hypercall_page)(struct domain *d, void *hypercall_page);
int (*event_injection_faulted)(struct vcpu *v);
};
INTEL CONFIDENTIAL
Intel VT support in Xen
The hvm_function_table is initialized at xen/arch/x86/hvm/vmx/vmx.c
The following routines store and load completely save the state of a CPU through the VMCS
.store_cpu_guest_regs = vmx_store_cpu_guest_regs
.load_cpu_guest_regs = vmx_load_cpu_guest_regs
INTEL CONFIDENTIAL
KVM overview
INTEL CONFIDENTIAL
What is KVM?
• It’s a VMM built within the Linux kernel
– The name stands for Kernel Virtual Machines
– It is included in mainline Linux, as of 2.6.20
• It offers full-virtualization
– Para-virtualization support is in alpha state
• It works *only* in platforms with hardware-assisted virtualization
– Currently only Intel-VT and AMD-V
– Recently also s390, PowerPC and IA64
• Decision taken to achieve a simple design
– No need to deal with ring aliasing problem,
– Nor excessive faulting avoidance
– Nor guest memory management complexity
– Etc
INTEL CONFIDENTIAL
Why KVM?
• Today’s hardware is becoming increasingly complex
– Multiple HW threads on a core
– Multiple cores on a socket
– Multiple sockets on a system
– NUMA memory models (on-chip memory controllers)
• Scheduling and memory management is becoming harder accordingly
• Great effort is required to program all this complexity in hypervisors
– But an operating system kernel already handles this complexity
– So why no reuse it?
• KVM makes use of all the fine-tuning work that has gone (and is going)
into the Linux kernel, applying it to a virtualized environment
• Minimal footprint
– Less than 10K lines of kernel code
– Implemented as a Linux’s module
INTEL CONFIDENTIAL
How it works?
• A normal Linux process has two modes of execution: kernel and user
– KVM adds a third mode: guest mode
• A virtual machine in KVM will be “seen” as a normal Linux process
– A portion of code will run in user mode: performs I/O on behalf of the guest
– A portion of code will run in guest mode: performs non-I/O guest code
guest
mode
With its
own 4
rings
INTEL CONFIDENTIAL
Key features
• Simpler design: Kernel+Userspace (vs. Hypervisor+Kernel+Userspace)
– Avoids many context switches
– Code reuse (today and tomorrow)
– Easy management of VMs (standard process tools)
• Supports Qcow2 and Vmdk disk image formats
– “Growable” formats (copy-on-write)
– Saved state of a VM with X Mb of RAM takes less than X Mb of file space
• KVM skips RAM sectors mapped by itself
• KVM uses the on-the-fly compression capability of Qcow2 and VMDK formats
• I.e. an save state of a 384Mb’s Windows VM occupies ~40Mb
– Discard-on-write capability (read’s made from base image A, write’s goes to new image B)
• B will contain the differences from A performed by the VM
• Later, B diff’s can be merged into A
• Advanced guest memory management
– Increased VM density with KSM (under development)[3]
• KSM is a kernel module to save memory by searching and merging identical pages inside one
or more memory areas
– Balloon driver as in Xen
– Guest’s page swapping allowed
INTEL CONFIDENTIAL
Future trends
• Para-virtualization support (Windows & Linux)
– virtio devices already included in Linux’s mainline as of 2.6.25
• Storage[4]
– Many similar guests cause a lot of duplicate storage
– Current solution: baseline + delta images
• Delta degrades overtime (needs planning)
• Disk-in-file is overheady
– Future:
• Block-level deduplication
– Filesystem or block device looks for identical blocks ... and consolidates them
– Btrfs being analyzed right now (has snapshots & reverse mappings)
• Hostfs + file-based deduplication
– No more virtual block device. Guest filesystem is a host directory
– Host can carry out file dedup in the background
– Requires changes in guest
• Para-virtualized file systems (9P from IBM Research)[2]
– Easy way to maintain consistency between two guests sharing a block device R/W
– Provide a direct file system proxy mechanism built on top of the native host<->guest I/O
transport, avoiding unnecessary network stack overhead
INTEL CONFIDENTIAL
Future trends (2)
• Containers & Isolation (reduce the impact of one guest on others)
– Memory containers
• Account each page to its container
• Allows preferentially swapping some guests
– I/O accounting (since I/O affects other guests)
• Each I/O in flight is correctly accounted to initiating task
• Important for I/O scheduling
• Device passthrough methods
– Several competing options
• 1:1 mapping with Intel VT-d
• Virtualization-capable devices with PCI SIG Single Root IOV
• PVDMA
• Userspace IRQ delivery
– Still to see which will become mainline
• VMs-AS-FILES
– Cross-hypervisor virtualization containers to allow for transportability of VMs
– OVF: Open Virtual Appliance Format[5]
• Cross platform guest support (QuickTransit technology[6])
– I.e. a Solaris for Sparc running in an Intel platform
INTEL CONFIDENTIAL
VMware overview
INTEL CONFIDENTIAL
VMware
In 1998, VMware created a solution to virtualize the x86 platform, creating the market for x86 virtualization
The solution was a combination of binary translation and direct execution on the processor
INTEL CONFIDENTIAL
VMware ESX architecture
Datacenter-class virtualization platform used by many enterprise customers for server consolidation
Runs directly on a physical server having direct access to the physical hardware of the server
INTEL CONFIDENTIAL
VMware default deployment
Primary method of
interaction with virtual
infrastructure (console
and GUI) Authorizes
VirtualCenter Servers and ESX
Server hosts appropriately for
the licensing
agreement
INTEL CONFIDENTIAL
VMware for free
VMware provides freeware Server and Workstation virtualization solutions
• VMware Server:
– Is a free desktop application that lets you run virtual machines on your Windows or Linux PC
– Lets you use host machine devices, such as CD and DVD drives, from the virtual machine
– Datasheet or FAQ page is available
– Different Virtual Appliances are provided for free
• VMware Player:
– Similar to VMware Server but limited to run pre-built virtual appliances
INTEL CONFIDENTIAL
OpenVZ overview
Operating System
virtualization
INTEL CONFIDENTIAL
OpenVZ
• OpenVZ is an open source server virtualization solution that creates multiple isolated
Virtual Private Servers (VPSs) or Virtual Environments (VEs) on a single physical server
• VPS perform and execute exactly like a stand-alone server for its users and applications
as it can be rebooted independently
• All VPSs have their own set of processes and can run different Linux distributions, but all
VPSs operate under the same kernel
• OpenVZ is the basis of Parallels/Virtuozzo Containers
• Distinctive features:
– Operating System Virtualization
– Network Virtualization
– Resource Management
– Templates
• Installation: http://wiki.openvz.org/Quick_installation
• User documentation: http://download.openvz.org/doc/OpenVZ-Users-Guide.pdf
INTEL CONFIDENTIAL
OpenVZ Kernel
The OpenVZ kernel is a modified Linux kernel which adds the following functionality:
• Virtualization and isolation: enables many virtual environments within a single kernel
• Resource management: subsystem limits (and in some cases guarantees) resources
such as CPU, RAM, and disk space on a per-VE basis
• Live Migration/Checkpointing: a process of “freezing” a VE, saving its complete state
to a disk file, with the ability to “unfreeze” that state later
INTEL CONFIDENTIAL
OpenVZ Kernel Virtualization and Isolation
Each Virtual Environment has its own set of virtualized/isolated resources, such as:
• Files
– System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.
• Users and groups
– Each VE has its own root user, as well as other users and groups.
• Process tree
– A VE sees only its own set of processes, starting from init. PIDs are virtualized, so that the init PID is 1 as it
should be.
• Network
– Virtual network device, which allows the VE to have its own IP addresses, as well as a set of netfilter (iptables)
and routing rules.
• Devices
– Devices are virtualized. In addition, any VE can be granted exclusive access to real devices like network
interfaces, serial ports, disk partitions, etc.
• IPC objects
– Shared memory, semaphores, and messages.
INTEL CONFIDENTIAL
OVZ Resource Management
Resource management subsystem consists of three components:
• Two-level disk quota:
– 1st level: Server administrator can set up per-VE disk quotas in terms of disk space and number of inodes
– 2nd level: VE administrator (VE root) uses standard UNIX quota tools to set up per-user and per-group disk
quotas.
• “Fair” CPU 2 level scheduler:
– 1st level: decides which VE to give the time slice to, taking into account the VE’s CPU priority and limit settings
– 2nd level: standard Linux scheduler decides which process in the VE to give the time slice to, using standard
process priorities.
• User Beancounters
– This is a set of per-VE counters, limits, and guarantees
– Set of about 20 parameters which are carefully chosen to cover all the aspects of VE operation, so no single VE
can abuse any resource which is limited for the whole computer and thus cause harm to other VEs
– The resources accounted and controlled are mainly memory and various in-kernel objects such as IPC shared
memory segments, network buffers etc.
INTEL CONFIDENTIAL
OpenVZ Checkpointing and live migration
Allows the “live” migration of a VE to another physical server
A “frozen” VE and its complete state is saved to a disk file, then transferred to another
machine
This VE can then be “unfrozen” (restored) there (the whole process takes a few seconds, and
from the client’s point of view it looks not like a downtime, but rather a delay in
processing, since the established network connections are also migrated)
Live migration
Virtual Virtual
Env Env
OpenVZ OpenVZ
Host Host
Checkpoint
INTEL CONFIDENTIAL
Disk
Backup
INTEL CONFIDENTIAL
Xen Terminology – 1/2
Basics
guest operating system: An operating system that can run within the Xen environment.
hypervisor: Code running at a higher privilege level than the supervisor code of its guest operating systems.
virtual machine monitor ("vmm"): In this context, the hypervisor.
domain: A running virtual machine within which a guest OS executes.
domain0 ("dom0"): The first domain, automatically started at boot time. Dom0 has permission to control all hardware on the system, and is used to manage the hypervisor and the other
domains.
unprivileged domain ("domU"): A domain with no special hardware access.
Approaches to Virtualization
full virtualization: An approach to virtualization which requires no modifications to the hosted operating system, providing the illusion of a complete system of real hardware devices.
paravirtualization: An approach to virtualization which requires modifications to the operating system in order to run in a virtual machine. Xen uses paravirtualization but preserves binary
compatibility for user space applications.
Address Spaces
MFN (machine frame number): Real host machine address; the addresses the processor understands.
GPFN (guest pseudo-physical frame number): Guests run in an illusory contiguous physical address space, which is probably not contiguous in the machine address space.
GMFN (guest machine frame number): Equivalent to GPFN for an auto-translated guest, and equivalent to MFN for normal paravirtualised guests. It represents what the guest thinks are
MFNs.
PFN (physical frame number): A catch-all for any kind of frame number. "Physical" here can mean guest-physical, machine-physical or guest-machine-physical.
Page Tables
SPT (shadow page table): shadow version of a guest OSes page table. Useful for numerous things, for instance in tracking dirty pages during live migration.
PAE: Intel's Physical Addressing Extensions, which enable x86/32 machines to address up to 64 GB of physical memory.
PSE (page size extension): used as a flag to indicate that a given page is ahuge/super page (2/4 MB instead of 4KB).
x86 Architecture
HVM: Hardware Virtual Machine, which is the full-virtualization mode supported by Xen. This mode requires hardware support, e.g. Intel's Virtualization Technology (VT) and AMD's
Pacifica technology.
VT-x: full-virtualization support on Intel's x86 VT-enabled processors
VT-i: full-virtualization support on Intel's IA-64 VT-enabled processors
INTEL CONFIDENTIAL
Xen Terminology – 2/2
Networking Infrastructure
backend: one half of a communication end point - interdomain communication is implemented using a frontend and backend device model interacting via event channels.
frontend: the device as presented to the guest; other half of the communication endpoint.
vif: virtual interface; the name of the network backend device connected by an event channel to a network front end on the guest.
vethN: local networking front end on dom0; renamed to ethN by xen network scripts in bridging mode (FIXME)
pethN: real physical device (after renaming)
Migration
Live migration: A technique for moving a running virtual machine to another physical host, without stopping it or the services running on it.
Scheduling
BVT: The Borrowed Virtual Time scheduler is used to give proportional fair shares of the CPU to domains.
SEDF: The Simple Earliest Deadline First scheduler provides weighted CPU sharing in an intuitive way and uses realtime algorithms to ensure time guarantees.
INTEL CONFIDENTIAL
Intel privileged instructions
Some of the system instructions (called “privileged instructions”) are protected from use by application
programs. The privileged instructions control system functions (such as the loading of system
registers). They can be executed only when the CPL is 0 (most privileged). If one of these instructions
is executed when the CPL is not 0, a general-protection exception (#GP) is generated. The following
system instructions are privileged instructions (16):
• LGDT — Load GDT register.
• LLDT — Load LDT register.
• LTR — Load task register.
• LIDT — Load IDT register.
• MOV (control registers) — Load and store control registers.
• LMSW — Load machine status word.
• CLTS — Clear task-switched flag in register CR0.
• MOV (debug registers) — Load and store debug registers.
• INVD — Invalidate cache, without writeback.
• WBINVD — Invalidate cache, with writeback.
• INVLPG —Invalidate TLB entry.
• HLT— Halt processor.
• RDMSR — Read Model-Specific Registers.
• WRMSR —Write Model-Specific Registers.
• RDPMC — Read Performance-Monitoring Counter.
• RDTSC — Read Time-Stamp Counter.
INTEL CONFIDENTIAL
QEMU Description -
http://bellard.org/qemu/
http://bellard.org/qemu/qemu-tech.html
A fast processor emulator using a portable dynamic emulator
2 operating modes (add diagrams for each case):
• Full system emulation
• User mode emulation
Generic features:
• User space only or full system emulation
• Using dynamic translation to native code for reasonable speed
• Working on x86 and PowerPC hosts. Being tested on ARM, Sparc32, Alpha and S390
• Self-modifying code support
• Precise exceptions support
• The virtual CPU is a library (libqemu) which can be used in other projects
QEMU full system emulation features:
• QEMU can either use a full software MMU for maximum portability or use the host system call
mmap() to simulate the target MMU
INTEL CONFIDENTIAL
QEMU x86 emulation
QEMU x86 target features:
• Support for 16 bit and 32 bit addressing with segmentation. LDT/GDT and IDT are emulated.
VM86 mode is also supported to run DOSEMU
• Support of host page sizes bigger than 4KB in user mode emulation
• QEMU can emulate itself on x86
INTEL CONFIDENTIAL
References
• Intel® 64 and IA-32 Architectures - Software Developer’s Manual
• http://wiki.xensource.com/xenwiki/XenArchitecture?action=AttachFile&do=get&target=Xen+Architectur
e_Q1+2008.pdf
• http://wiki.xensource.com/xenwiki/XenArchitecture
• http://www.xen.org/files/xensummit_4/Liguori_XenSummit_Spring_2007.pdf
• http://wiki.xensource.com/xenwiki/XenTerminology
• http://www.xen.org/xen/faqs.html
• http://www.vmware.com/pdf/esx2_performance_implications.pdf
• http://www.vmware.com/files/pdf/VMware_paravirtualization.pdf
• http://download.openvz.org/doc/OpenVZ-Users-Guide.pdf
• http://download.openvz.org/doc/openvz-intro.pdf
• KVM project @ Sourceforge.net
• Paravirtualized file systems, KVM Forum 2008.
• Increasing Virtual Machine density with KSM, KVM Forum 2008.
• Beyond kvm.ko, KVM Forum 2008.
• Open-OVF: an OSS project around the Open Virtual Appliance format, KVM Forum 2008.
• Cross platform guest support, KVM Forum 2008.
INTEL CONFIDENTIAL