amarao August 24, 2010 at 09:19

Modern virtualization features

After recent discussions about which hypervisor is better, the idea arose to write out the functionality of modern virtualization systems without reference to specific names. This is not a “who is better” comparison, it is an answer to the question “what can be done with virtualization?”, A general overview of the possibilities of industrial virtualization.

Code execution

Since the hypervisor fully controls the virtual machines, it can specifically control the process of the machine.

Various virtualization systems offer several methods for executing code (full emulation is not included in the list, as it is not used in industrial virtualization):

binary rewriting. This approach uses VMWare and Connectix Virtual PC (purchased by microsoft) for host virtualization without hardware virtualization. The hypervisor (virtualizer) scans the executable code and marks instructions that require "virtualization" with breakpoints and emulates (virtualizes) only such instructions.
Hardware virtualization. Antique technology for Alpha and System / 360, relatively new technology for amd64 / i386. Introduces a semblance of ring -1, which runs a hypervisor that controls the machines through a set of virtualization instructions. Intel and amd technologies are slightly different, amd offers the ability to program a memory controller (on the processor) to reduce the computational load on virtualization (nested pages), intel implemented this as a separate EPT technology. It is actively used in the product to run "alien systems" in VMWare, Xen HVM, KVM, HyperV
Paravirtualization. In this case, the kernel of the guest system is “virtualized” at the compilation stage. userspace practically does not change. The guest system collaborates with the hypervisor and makes all privileged calls through the hypervisor. The core of the guest system itself does not work on ring0, but on ring1, so a rebelled guest will not be able to interfere with the hypervisor. It is used for virtualization of opensource systems in xen (PV mode) and openvz (openvz is unique in some sense, because there is simply no “guest” kernel there, it is closer to jail than virtualization, although it does have some strong isolation provides).
Container virtualization. Allows you to isolate many processes on the "containers", each of which has access only to processes from its container. It is on the fine line between virtualization, bsd jail and just a well-written system for isolating processes in the OS from each other. Uses a shared memory manager, the same core

In reality, paravirtualized drivers are used in both HVM and binary rewriting (often called guest tools), because It is in I / O operations that paravirtualization significantly outstrips all other methods in performance.

Without exception, all hypervisors can perform a suspend / pause operation on a virtual machine. In this mode, the operation of the machine is suspended, possibly with the storage of memory data on the disk and the continuation of work after the "recovery" (resume).

A common feature is the concept of migration- transfer of a virtual machine from one computer to another. It happens offline (turned off on one computer, turned on on the second) and online (usually called live, ie, "live migration") without shutting down. In reality, it is implemented using suspend on one machine and resume on another with some optimization of the data transfer process (first the data is transferred, then the machine is suspended and the changed data is transferred from the moment the migration started, after which the new machine starts on the new host).

Also in xena is promised (and, it seems, almost brought to the product) technology for parallel execution of one machine on two or more hosts (Remus), which allows the virtual machine to continue to work if one of the servers fails without interruptions / reboots.

Memory management

The classical virtualization model implies the allocation of a fixed amount of memory to the guest machine, a change is possible only after it is “turned off”.

Modern systems can implement the functionality of manual or automatic change in the amount of RAM for the guest system.

The following memory management methods are available:

ballooning. The generally accepted mechanism (at least xen and hyperv, like VMWare seems to have it) The essence of the idea is simple: a special module in the guest system requests memory from the guest system and gives it to the hypervisor. At the right time, he takes the memory from the hypervisor and gives it to the guest system. The main feature is the ability to return the idle memory of the virtual machine back.
Memory hot-plug. Adding memory on the go. It is supported in hyper-v in future sp for windows server (there is no release in the release yet), in xen 4.1 (in linux 2.6.32). This is a memory addition, similar to a hardware hotplug. Allows you to add memory on the go to a live server without rebooting. An alternative to this method is a preinflated balloon, when the hypervisor starts the machine with an already “non-zero” balloon, which can be “blown away” as needed. memory hot-unplug is for now only in linux (it works or not I can’t say yet). Most likely, MS will finish Windows for unplug in the near future.
Common memory. Specifically for openvz, the memory is taken from a common pool, virtual machines are limited only by an artificial limit in the form of a number that can be changed on the go. The most flexible mechanism, however, is specific to openvz and has some unpleasant memory effects.
Memory compression. The guest’s memory is “compressed” (by compression algorithms), which in some cases will get some additional volume. Penalty: read and write delay, processor load from the side of the hypervisor.
Page deduplication. If the memory pages match, then they are not stored twice, and one of them is made a link to another. It works well in case of running several virtual machines simultaneously with the same software package (and the same versions). Code sections match and deduplicate. For data inefficiency, the picture is also corrupted by disk caches, which each machine has its own (and which strive to occupy all the free memory). Of course, checking for duplicates (calculating the hash for a memory page) is not a free operation.
NUMA - the ability to expand the memory of a virtual machine in a volume larger than it is on the server. The technology is raw and not quite mainstream (I didn’t dig deeply, so I won’t tell you more)
memory overcommitment / memory oversell - a technology for announcing to virtual machines more memory than they actually are (for example, a promise to 10 virtual machines of 2 GB each, with only 16 GB available). The technology is based on the idea that no virtual machine in normal mode uses all memory to 100%.
A common swap that allows you to partially unload the memory of virtual machines to disk.

Peripherals

Some hypervisors allow virtual machine access to real equipment (moreover, to different virtual machines, to different equipment).

They can also emulate equipment, including equipment that is not on the computer. The most important of the devices - network adapter and drive are considered separately; Among the rest: video adapters (even with 3D), USB, serial / parallel ports, timers, watch dogs.

To do this, use one of the following technologies:

emulation of real devices (slow)
Direct access to the device (as "forwarding", passthrough) to the guest machine.
IOMMU (hardware translation of page addresses used for DMA, which allows sharing the RAM used by devices between virtual machines). Intel calls VT-d (not to be confused with VT aka vanderpool, which is the technology for ring -1 in the processor).
Creation of paravirtual devices that implement a minimum of functionality (in fact, two main classes - block devices and network devices are disassembled below).

Network devices

Network devices are typically implemented at either the third or second level of abstraction. The created virtual network interface has two ends - in the virtual machine and in the hypervisor / control domain / virtualization program. Traffic from the guest is transmitted unchanged to the host (without any dancing with resubmissions, matching speeds, etc.). And then quite significant difficulties begin.

At the moment, minus systems that emulate a network interface at the third level (IP address level, for example, openvz), all other systems provide the following set of features:

Bridging (2nd level) of an interface with one or more physical interfaces.
Creating a local network between the host and guests (one or more) without access to the real network. It is noteworthy that in this case the network exists in a purely virtual sense and is not tied to “live” networks
Routing / NAT traffic of the guest system. A special case of the aforementioned method, with routing enabled for the virtual interface (and NAT / PAT)
Encapsulating traffic in GRE / VLAN and sending to hardware switches / routers

Some virtualization systems separate the case of bridging the network interface of a virtual machine with a physical network interface and the presence of a virtual switch.

In general, a network of virtual machines presents a particular headache during migration. All existing product systems with interface bridging allow transparent live migration of machines in only one network segment, require special tricks (fake ARP) to notify higher-level switches about a port change for traffic switching.

At the moment, a rather interesting system has been developed - open vSwitch, which allows the task of determining the path of the packet to the open-flow controller - it may significantly expand the functionality of virtual networks. However, open flow and vSwitch are a bit off topic (and I'll try to talk about them a bit later).

Disk (block) devices

This is the second critical milestone for virtual machines. A hard disk (more precisely, a block device for storing information) is the second, and perhaps even the first, component of virtualization in importance. Disk subsystem performance is critical for evaluating virtualization system performance. Large overhead (overhead) on the processor and memory will be experienced more calmly than overhead on disk operations.

Modern virtualization systems offer several approaches. The first is to provide a virtual machine with a finished file system. The overhead at the same time tends to zero (specific to openvz). The second - in the emulation of a block device (without any ryushechek like smart and SCSI commands). A block device from a virtual machine is bound either to a physical device (disk, partition, logical volume LVM), or to a file (via a loopback device or by directly emulating block operations “inside” a file).

An additional feature is the use of network storage by the hypervisor - in this case, the migration process is very simple: the machine is paused on one host, and the second is continued. Without transferring any data between hosts.

Moreover, most systems, provided that the block device of the underlying level supports it (LVM, file), provide the ability to change the size of the virtual block device on the go. Which is very convenient on the one hand, and guest OSs are completely unprepared for this yet. Of course, all systems support adding / removing on-the-fly block devices themselves.

Deduplication functions are usually assigned to the underlying block device provider, although, for example, openvz allows you to use the copy-on-write mode of using the "container template", and XCP allows you to make a chain of block devices with copy-on-write dependencies from each other. This, on the one hand, slows down productivity, and on the other hand, it can significantly save space. Of course, many systems allow you to allocate disk space on-demand (for example, VMWare, XCP) - a file corresponding to a block device is created as sparsed (or has a specific format with support for "skipping" empty spaces).

Access to disks can be controlled by speed, by priority of one device (or virtual machine) relative to another. VMWare announced a wonderful opportunity to control the number of I / O operations, providing a small delay in servicing all guests, slowing down the most guzzling of them.

Dedicated disk devices can be shared between several guests (when using file systems that are ready for this, for example, GFS), which makes it easy to implement clusters with shared storage.

Since the hypervisor fully controls the guest’s access to the media, it becomes possible to create snapshots of disks (and virtual machines themselves), build a tree of snapshots (which one is coming from) with the ability to easily switch between them (usually the state of virtual memory is also included in these snapshots cars).

Similarly, backups are implemented. The easiest way to implement a backup is by copying a disk of a backup system - this is a regular volume, file or LV partition, which is easy to copy, including on the go. For Windows, the opportunity is usually given to notify shadowcopy of the need to prepare for backup.

The interaction between the hypervisor and the guest

In some systems, a communication mechanism is provided between the guest system and the hypervisor (more precisely, the controlling OS), which allows information to be transmitted regardless of the network’s operability.

There are experimental developments (not ready for product) on the “self-migration” of the guest system.

Cross compatibility

Work is underway to standardize the interaction between hypervisors. For example, the XVA format is proposed as a platform-independent format for exporting / importing virtual machines. The VHD format could claim to be universal, if not for several incompatible formats with the same extension.

Most virtualization systems provide the ability to "convert" competitors' virtual machines. (however, I did not see a single system of live migration that would allow me to migrate a machine between different systems on the go, and I did not even see any sketches on this topic).

Accounting

Most hypervisors provide some form of host load estimation mechanism (showing current values and a history of these values). Some provide the ability to accurately account for consumed resources in the form of the absolute number of ticks, iops, megabytes, network packets, etc. (as far as I know, this is only in Xen, and only in the form of undocumented features).

Association and management

Most systems of the latest generation allow you to combine several virtualizing machines into a single structure (cloud, pool, etc.), either by providing the infrastructure for load management, or by providing a ready-made load management service for each server in the infrastructure immediately. This is done, first, by the automatic choice of "where to start the next car", and secondly, by automatically migrating guests to evenly load the hosts. At the same time, the simplest fault-tolerance (high avability) is also supported, when using shared network storage - if one host with a bunch of virtual machines is dead, then the virtual machines will be run on other hosts that are part of the infrastructure.

If I missed some essential features of any of the systems, say, I’ll add

Tags:

Virtualization