pbor October 14, 2014 at 11:47

How to speed up the container: tuning OpenVZ

OpenVZ is an OpenSource implementation of container virtualization technology for the Linux kernel, which allows you to run on a single system with the OpenVZ kernel many virtual environments with various Linux distributions inside. Due to its features (containerized virtualization is at the kernel level, not iron) in a number of performance indicators - density, elasticity, requirements for RAM size, response speed, etc. - It works better than other virtualization technologies. For example, here you can see OpenVZ performance comparisons with traditional hypervisor virtualization systems. But besides this, Linux and OpenVZ have a ton of fine-tuning options.
In this article, we will consider non-trivial configuration options for the OpenVZ kernel containers, which can improve the performance of the entire OpenVZ system.

General settings

The main settings that affect the performance of containers are limits on memory and processor consumption. In most cases, increasing the amount of allocated memory and processors helps to achieve better container performance for custom applications, such as, for example, your web server or database server.

To set the global limit on the physical memory allocated to the container, just specify the --ram option, for example:

# vzctl set 100 --ram 4G --save

Very often, underestimated limits in a container lead to a failure to allocate memory in one place or another in the kernel or application, so when using containers it is extremely useful to monitor the contents of the / proc / user_beancounters file. Nonzero values in the failcnt column mean that some of the limits are too small and you need to either reduce the working set in the container (for example, reduce the number of apache or postgresql server threads), or increase the memory limit using the --ram option. For convenient monitoring of the / proc / user_beancounters file, you can use the vzubc utility, which allows, for example, to watch only counters close to failcnt, or to update the readings with some periodicity (top-like mode). Read more about the vzubc utility here .

In addition to setting a limit on physical memory, it is recommended to set a limit on the swap size for the container. To adjust the size of the swap container, use the --swap option of the vzctl command:

# vzctl set 100 --ram 4G --swap 8G --save

The sum of the --ram and --swap values is the maximum amount of memory a container can use. After reaching the --ram limit of the container, the memory pages belonging to the processes of the container will begin to be pushed into the so-called “virtual swap (VSwap)”. In this case, there will be no real disk I / O, and the performance of the container will be artificially lowered to create the effect of real swapping.

When configuring containers, it is recommended that the sum of ram + swap for all containers does not exceed the sum of ram + swap on the host node. To check the settings, you can use the vzoversell utility.

To control the maximum number of processors available for a container, you must use the --cpus option, for example:

# vzctl set 100 --cpus 4 --save

When creating a new container, the number of processors for this container is not limited, and it will use all possible CPU resources of the server. Therefore, for systems with several containers, it makes sense to set a limit on the number of processors of each container in accordance with the tasks assigned to them. It can also sometimes be useful to limit the CPU in percent (or in megahertz) using the --cpulimit option, and also manage weights, i.e. container priorities, using the --cpuunits option.

Overcommit from memory

The OpenVZ core allows all containers to allocate a total larger amount of memory than the full size of physical memory available on the host. This situation is called memory overcommit, and in this case, the kernel will manage the memory dynamically, balancing it between containers, because the memory allowed for use by containers will not be required by them and the kernel can control it at its discretion. When overcommiting from memory, the kernel will effectively manage various caches (page cache, dentry cache) and try to reduce them in containers in proportion to the set memory limit. For example, if you want to isolate the services of a highly loaded web site consisting of a front end, a back end and a database for the sake of improving security, you can put them in separate containers,

# vzctl set 100 --ram 128G --save
# vzctl set 101 --ram 128G --save
# vzctl set 102 --ram 128G --save

In this case, from the point of view of memory management, the situation will not differ from that when your services were running on the same host and memory balancing will still be as efficient as possible. You do not need to think about which container you need to put more memory in - in the front-end container for a larger page cache for static web site data, or in a container with a database - for a larger cache cache for the database itself. The kernel will balance everything automatically.

Although overcommit allows the kernel to balance memory between containers as efficiently as possible, it also has certain unpleasant properties. When the total amount of anonymous memory allocated, that is, the total working set for all processes of all containers approaches the total memory size in the host, when you try to allocate new memory by some process or kernel, a global “out of memory” exception will occur and the OOM-killer will kill one of the processes in the system. To check whether such exceptions have occurred on the host or in the container, you can use the command:

# dmesg | grep oom
[3715841.990200] 645043 (postmaster) invoked oom-killer in ub 600 generation 0 gfp 0x2005a
[3715842.557102] oom-killer in ub 600 generation 0 ends: task died

It is important to note that the OOM-killer does not necessarily kill the process in the same container in which they were trying to allocate memory. To control the OOM-killer’s behavior, use the command:

# vzctl set 100 --oomguarpages 2G --save

allowing you to set a limit that guarantees the integrity of the container processes within the specified limit. Therefore, for containers running vital services, you can set this limit equal to the memory limit.

CPU overcommit

Just as in the case of memory, an overcommit by processors allows you to allocate to the containers the total number of processors exceeding the total number of logical processors on the host. And just like in the case of memory, an overcommit on processors allows achieving the most effective overall system performance. For example, in the case of the same web server, “laid out” in three containers for the front end, back end and database, you can also set an unlimited number of CPUs for each container and achieve maximum total system performance. Again, the kernel itself will determine which processes to allocate processor time from which containers.

Unlike memory, the processor is an “elastic” resource, that is, a lack of CPU does not lead to any exceptions or errors in the system, except for slowing down some active processes. Therefore, the use of an overcommit by processors is a safer trick to overclock the system than an overcommit from memory. The only negative effect of the processor overcommit is a possible violation of the principle of honesty in allocating processor time for containers, which can be bad, for example, for VPS hosting clients who may receive less than the paid processor time. To maintain an honest distribution of processor time, it is necessary to set the “weight” of the containers in accordance with the paid processor time using the --cpuunits option of the vzctl command (more details can be found here ).

Optimization of containers on NUMA hosts

In the case of running containers on a host with NUMA (Non-Uniform Memory Access), it is possible that the container processes run on one NUMA node, and part (or all memory) of these processes was allocated on another NUMA node. In this case, each memory access will be slower than the memory access of the local NUMA node and the deceleration factor will depend on the distance between the nodes. The Linux kernel will try to prevent this situation, but for guaranteed container execution on the local NUMA node, you can set a CPU mask for each container, which will limit the set of processors on which the container processes are allowed to run.

You can view the NUMA nodes available on the host using the numactl command:

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 16351 MB
node 0 free: 1444 MB
node 1 cpus: 4 5 6 7
node 1 size: 16384 MB
node 1 free: 10602 MB
node distances:
node 0 1
  0:10 21
  1:21 10

In this example, there are two NUMA nodes on the host, each of which has 4 processor cores and 16GB of memory.

To set the restriction on the set of processors for the container, use the vzctl command:

# vzctl set 100 --cpumask 0-3 --save
# vzctl set 101 --cpumask 4-7 --save

In this example, we allowed container 100 to run only on processors 0 through 3, and container 101 from 4 through 7. It should be understood that if any process from, for example, container 100 already allocated memory on NUMA-node 1 , then every access to this memory will be slower than access to local memory. Therefore, it is recommended that you restart the containers after running these commands.

It is worth noting that in the new vzctl 4.8 release, the --nodemask option has appeared, which allows you to attach the container to a specific NUMA node without specifying a list of processors for this node, but operating only with the NUMA node number.

It should be borne in mind that this approach will limit the ability of the process scheduler to balance the load between the system processors, which in the case of a large overcommit on the processors can lead to a slowdown.

Controlling fsync behavior in containers

As you know, to ensure that data is written to disk, an application needs to make a fsync () system call on each changed file. This system call will write the file data from the write-back cache to the disk and initiates the flushing of data from the disk cache to a permanent non-volatile medium. At the same time, even if the application writes data to the disk, bypassing the write-back cache (the so-called Direct I / O), this system call is still necessary to guarantee the dumping of data from the disk cache itself.

Frequent execution of the fsync () system call can significantly slow down the disk subsystem. The average hard drive is capable of 30-50 syncs / sec.

Moreover, it is often known that for all or part of containers such strict guarantees of data recording are not needed, and the loss of part of the data in the event of a hardware failure is not critical. For such cases, the OpenVZ kernel provides the ability to ignore fsync () / fdatasync () / sync () requests for all or part of the containers. You can customize the kernel behavior using the / proc / sys / fs / fsync-enable file. Possible values of this file if configured on host node (global settings):

  0 (FSYNC_NEVER) fsync () / fdatasync () / sync () requests from containers are ignored
  1 (FSYNC_ALWAYS) fsync () / fdatasync () / sync () requests from containers work as usual, 
                        data of all inodes on all file systems of the host machine will be written
  2 (FSYNC_FILTERED) fsync () / fdatasync () requests from containers work as usual,
                        sync () requests from containers affect only container files (default value)

Possible values of this file if configured inside a specific container:

  0 fsync () / fdatasync () / sync () requests from this container are ignored 
  2 use global settings installed on host node (default value)

Despite the fact that these settings can significantly speed up the disk subsystem of the server, you need to use them carefully and selectively, because disabling fsync () can lead to data loss in the event of a hardware failure.

Controlling Direct I / O Behavior in Containers

By default, writing to all files opened without the O_DIRECT flag is done via write-back cache. This not only reduces the latency (timeout) of writing data to disk for the application (the write () system call ends as soon as the data is copied to the write-back cache, without waiting for the actual writing to the disk), but also allows the kernel I / O scheduler more efficiently distribute disk resources between processes, grouping I / O requests from applications.

At the same time, some categories of applications, for example, databases, themselves effectively manage the recording of their data by performing large sequential I / O requests. Therefore, such applications often open files with the O_DIRECT flag, which tells the kernel to write data to such a file, bypassing the write-back cache, directly from the user application memory. In the case of a single database running on the host, this approach is more efficient than writing via the cache, since I / O requests from the database are already optimally arranged and there is no need for additional copying of memory from the user application to the write-back cache.

If several containers work with databases on the same host, this assumption turns out to be incorrect, since the I / O scheduler in the Linux kernel cannot optimally distribute disk resources between applications using Direct I / O. Therefore, by default, the OpenVZ Direct I / O kernel for containers is turned off, and all data is written through the write-back cache. This introduces a small overhead in the form of additional copying of memory from the user application to the write-back cache, while allowing the kernel I / O scheduler to more efficiently allocate disk resources.

If you know in advance that there will be no such situation on the host, then you can avoid additional overhead and allow the use of Direct I / O for all or part of the containers. You can customize the behavior of the kernel using the file / proc / sys / fs / odirect_enable. Possible values of this file if configured on host node (global settings):

  0 the O_DIRECT flag is ignored for containers, all recording occurs via write-back cache (default value)
  1 O_DIRECT flag in containers works as usual

Possible values of this file if configured inside a specific container:

  0 the O_DIRECT flag is ignored for this container, all recording occurs through write-back cache
  1 O_DIRECT flag for this container works as usual
  2 use global settings (default value)

Conclusion

In general, the Linux kernel as a whole, and OpenVZ in particular, provide a large number of options for fine-tuning performance for specific user tasks. Virtualization based on OpenVZ allows you to provide the highest possible performance due to flexible resource management and various settings. In this article, we have presented only a small part of container-specific settings. In particular, I did not paint about three parameters of CPUUNITS / CPULIMIT / CPUS and how they all affect each other. But I’m ready to explain this and much more in the comments.
For more information, read the vzctl man page and a ton of resources on the Internet, for example, openvz.livejournal.com.

Tags:

How to speed up the container: tuning OpenVZ

General settings

Overcommit from memory

CPU overcommit

Optimization of containers on NUMA hosts

Controlling fsync behavior in containers

Controlling Direct I / O Behavior in Containers

Conclusion

Also popular now: