Weekday cloud development, part one
I will continue the series of official blog posts on how to work with the cloud, but in parallel I want to talk about the problems that we encountered during the adaptation of Xen Cloud Platform to our model of the cloud. These posts will be a bit more complicated and suggest that the reader at least in general terms knows how Xen works.
When the concept of “payment by consumption” was just taking shape, and I frantically searched for “how to count,” it seemed to me that the processor and memory were the two easiest resources.
Indeed, we have xencontrol (the Xen hypervisor management library), which can tell you exactly about each domain (running virtual machine), how much memory it has, how many nanoseconds it has been spent. This library requests information directly (via xenbus) from the hypervisor and subjects it to minimal processing.
This information looks something like this (xencontrol binding output for python):
As we can see, there is a mem_kb field corresponding to the allocated memory for the virtual machine and there is a cpu_time field containing some kind of mind-blowing number (although in reality it is only 17 minutes). cpu_time counts accurate to nanoseconds (more precisely, the value that is stored here is counted in nanoseconds, the actual accuracy is about microseconds). The memory, as is understandable, is in kilobytes (although the internal unit of accounting is that the hypervisor that the Linux kernel is a page - its default size is 4 kilobytes).
It would seem “take and count”, however, the devil is in the details ...
Sorry for the yellow heading below, but that is how the question in the debate sounded during the discussion of one of the problems:
Question: whether or not to enable hyperthreading on hosts with Xen? Its inclusion allows you to offer the client more cores. For two zeons with 4 cores on each, this will give 16 cores, and if we reserve a couple for our own needs, then 14 cores. 14 is cooler than 8? Cooler. So, it should be 14.
In addition, if you run a multithread application, it will count one and a half times faster on 14 "fake" kernels than on 7 real ones. This is true, I checked on tests on bare Linux on one of the test servers.
... stop. One and a half? That is, 14 cores will work one and a half times less with twice as many cores?
Yes, that is exactly what happened. The situation became especially dramatic at the moment when I tried to run a number crusher in one virtual machine with maximum load and a task that was obviously unsolvable for several hours, and in another I launched an application that considers a computational task with a fixed volume. And he compared how much time it took to calculate with hypertreading off and on.
It is easy to see, 29.6 * 16 is 473.6, just like 59.376 * 8 (475). And this is one and a half times more than 40.119 * 8 (320).
In other words, hyper-threading slows down processors by one and a half times when doubling their number.
This helps if the processor is entirely yours. And if processor time is paid and there are not even “colleagues” nearby, but simply “strangers”?
After that, we had a big discussion (it took about three days intermittently) - should we do HT on the hosts' clouds for clients? In addition to the obvious “honestly dishonest” and “no one will know” there were much more serious arguments:
After the discussion, we came to the following set of arguments: We can not control and really (and not statistically) predict the consumption of computer time. Migration, although a way out, but partial, because the migration reaction should be on the order of 30-40s minimum, and the shots on loading can be instantaneous (less than a second). In this regard, we do not know what kind of machine time (full or not) we provided to the client, and in any case, the client will face an unjustified loss of performance for no apparent reason due to the fact that his neighbor wanted to calculate something heavy.
Due to the inability to ensure constant performance of a virtual machine with hyper-threading, we still won the view that HT in the cloud with payment for computer time should be turned off (hence the limit of 8 cores per virtual machine).
The second funny moment was the problem of accounting for resources during migration. By default, it is assumed that handle (aka uuid) is a proof of the uniqueness of an object, and there cannot be two virtual machines with one uuid. Indeed it is. However, this applies to virtual machines, not domains. During migration, the contents of the domain (RAM) are copied, launched on the new node, and only then deleted on the old one. All this is accompanied by numerous re-copying of domain fragments (since the virtual machine continues to work). In any case, we get TWO domains from ONE virtual machine. If we roughly and numerically count the digits (summarize), then in the final counters we get completely incorrect numbers. (By the way, this problem was discovered quite late and was one of the reasons for the delay in launching).
The solution to the problem was elegant and architecturally beautiful. Only one copy of a domain can work at one time - all the rest are in pause mode. This is very important, because if two domains work for us, then they can break rare firewood. Thus, the solution looks like this: a paused domain is not calculated. This has several minor negative effects:
But otherwise, this solution has no side effects. And several were discussed: the presence of a lock in the accounting database, accounting for the domain's lifetime, etc. All of them against the background of the elegance of the solution, ignoring the stopped domains, look bulky and ugly (you will not praise yourself, no one will praise, alas).
Another giant issue was the dom0 boot problem. When a user makes a request over the network or performs disk operations, the request is transferred from domU (the domain that is the running virtual machine) to dom0 in the second half of the driver. The driver kommetsya over the request, passes it to the real drivers of these pieces of iron. And I must say, with intensive disk operations, he kumeka oh how. See us at 50-80% - how nefig do. And most of this number is OVS, blktap, xenstore, etc. (the rest is xapi, squeezed, stunnel, which are part of the cloud management system). Who should bill for this machine time? And most importantly, how to separate one user from another? At the driver level, this may be possible, but more ...
The same OVS (Open vSwitch, a program that provides a virtual network) commutes frames, and he does not care which domain they belong to.
The Zenovites (developers of the hypervisor and its bindings) broke their heads over this issue. I, too, began to think, but came to my senses in time. If disk operations are paid, then their price, in fact, includes the cost of processing the request. These are not only IOPS disks, load on raid controllers, SAN, amortization of 10G cards, etc. This and (very insignificant in price against the above) machine time dom0. Everything was logical and decided by itself.
When the concept of “payment by consumption” was just taking shape, and I frantically searched for “how to count,” it seemed to me that the processor and memory were the two easiest resources.
Indeed, we have xencontrol (the Xen hypervisor management library), which can tell you exactly about each domain (running virtual machine), how much memory it has, how many nanoseconds it has been spent. This library requests information directly (via xenbus) from the hypervisor and subjects it to minimal processing.
This information looks something like this (xencontrol binding output for python):
{ 'paused': 0, 'cpu_time': 1038829778010L, 'ssidref': 0, 'hvm': 0, 'shutdown_reason': 0, 'dying': 0, 'mem_kb': 262144L, 'domid': 3, 'max_vcpu_id': 7, 'crashed': 0, 'running': 0, 'maxmem_kb': 943684L, 'shutdown': 0, 'online_vcpus': 8, 'handle': [148, 37, 12, 110, 141, 24, 149, 226, 8, 104, 198, 5, 239, 16, 20, 25], 'blocked': 1 }
As we can see, there is a mem_kb field corresponding to the allocated memory for the virtual machine and there is a cpu_time field containing some kind of mind-blowing number (although in reality it is only 17 minutes). cpu_time counts accurate to nanoseconds (more precisely, the value that is stored here is counted in nanoseconds, the actual accuracy is about microseconds). The memory, as is understandable, is in kilobytes (although the internal unit of accounting is that the hypervisor that the Linux kernel is a page - its default size is 4 kilobytes).
It would seem “take and count”, however, the devil is in the details ...
Sorry for the yellow heading below, but that is how the question in the debate sounded during the discussion of one of the problems:
Hyper-threading + Xen = theft of money from customers
Question: whether or not to enable hyperthreading on hosts with Xen? Its inclusion allows you to offer the client more cores. For two zeons with 4 cores on each, this will give 16 cores, and if we reserve a couple for our own needs, then 14 cores. 14 is cooler than 8? Cooler. So, it should be 14.
In addition, if you run a multithread application, it will count one and a half times faster on 14 "fake" kernels than on 7 real ones. This is true, I checked on tests on bare Linux on one of the test servers.
... stop. One and a half? That is, 14 cores will work one and a half times less with twice as many cores?
Yes, that is exactly what happened. The situation became especially dramatic at the moment when I tried to run a number crusher in one virtual machine with maximum load and a task that was obviously unsolvable for several hours, and in another I launched an application that considers a computational task with a fixed volume. And he compared how much time it took to calculate with hypertreading off and on.
Load level | Processor time counted without HT | With HT |
neighbors idle, 1 core | 313.758 | 313.149 |
neighbors idle, 4 cores | 79.992 * 4 | 80.286 * 4 |
neighbors idle, 8 cores | 40.330 * 8 | 40.240 * 8 |
neighbors idle, 16 cores | - | 29.165 * 16 |
fully loaded neighbors, 1 core | 313.958 | 469.510 |
fully loaded neighbors, 4 cores | 79.812 * 4 | 119.33 * 4 |
fully loaded neighbors, 8 cores | 40.119 * 8 | 59.376 * 8 |
fully loaded neighbors, 16 cores | - | 29.634 * 16 |
It is easy to see, 29.6 * 16 is 473.6, just like 59.376 * 8 (475). And this is one and a half times more than 40.119 * 8 (320).
In other words, hyper-threading slows down processors by one and a half times when doubling their number.
This helps if the processor is entirely yours. And if processor time is paid and there are not even “colleagues” nearby, but simply “strangers”?
After that, we had a big discussion (it took about three days intermittently) - should we do HT on the hosts' clouds for clients? In addition to the obvious “honestly dishonest” and “no one will know” there were much more serious arguments:
- One server will generate more processor resources (for which Intel did this technology)
- We can take this “performance loss” into account in lowering the price of machine time
- We can monitor the host load and prevent overloading above 50%, but we can provide the client with more cores
After the discussion, we came to the following set of arguments: We can not control and really (and not statistically) predict the consumption of computer time. Migration, although a way out, but partial, because the migration reaction should be on the order of 30-40s minimum, and the shots on loading can be instantaneous (less than a second). In this regard, we do not know what kind of machine time (full or not) we provided to the client, and in any case, the client will face an unjustified loss of performance for no apparent reason due to the fact that his neighbor wanted to calculate something heavy.
Due to the inability to ensure constant performance of a virtual machine with hyper-threading, we still won the view that HT in the cloud with payment for computer time should be turned off (hence the limit of 8 cores per virtual machine).
Migration: copy and delete
The second funny moment was the problem of accounting for resources during migration. By default, it is assumed that handle (aka uuid) is a proof of the uniqueness of an object, and there cannot be two virtual machines with one uuid. Indeed it is. However, this applies to virtual machines, not domains. During migration, the contents of the domain (RAM) are copied, launched on the new node, and only then deleted on the old one. All this is accompanied by numerous re-copying of domain fragments (since the virtual machine continues to work). In any case, we get TWO domains from ONE virtual machine. If we roughly and numerically count the digits (summarize), then in the final counters we get completely incorrect numbers. (By the way, this problem was discovered quite late and was one of the reasons for the delay in launching).
The solution to the problem was elegant and architecturally beautiful. Only one copy of a domain can work at one time - all the rest are in pause mode. This is very important, because if two domains work for us, then they can break rare firewood. Thus, the solution looks like this: a paused domain is not calculated. This has several minor negative effects:
- We were not able to offer our customers an effective pause button (in this mode, the domain exists, but is not executed). Since such a domain consumes memory, we cannot afford not to take it into account if the client pauses the virtual machine and goes on vacation. And we cannot distinguish between a “pause during migration” and just a “pause” (at least without very large and non-trivial dances with states and databases).
- When the client reboots the machine, there is a small moment when the domain already exists, but is still paused - we do not count it (thus, the client constantly rebooting the machine will be able to unreasonably consume a small amount of memory). It is possible that this is even more honest, because these are our problems - where everything is being built there, until the machine starts to work, there is no reason for the client to pay for it.
But otherwise, this solution has no side effects. And several were discussed: the presence of a lock in the accounting database, accounting for the domain's lifetime, etc. All of them against the background of the elegance of the solution, ignoring the stopped domains, look bulky and ugly (you will not praise yourself, no one will praise, alas).
Who will pay for dom0?
Another giant issue was the dom0 boot problem. When a user makes a request over the network or performs disk operations, the request is transferred from domU (the domain that is the running virtual machine) to dom0 in the second half of the driver. The driver kommetsya over the request, passes it to the real drivers of these pieces of iron. And I must say, with intensive disk operations, he kumeka oh how. See us at 50-80% - how nefig do. And most of this number is OVS, blktap, xenstore, etc. (the rest is xapi, squeezed, stunnel, which are part of the cloud management system). Who should bill for this machine time? And most importantly, how to separate one user from another? At the driver level, this may be possible, but more ...
The same OVS (Open vSwitch, a program that provides a virtual network) commutes frames, and he does not care which domain they belong to.
The Zenovites (developers of the hypervisor and its bindings) broke their heads over this issue. I, too, began to think, but came to my senses in time. If disk operations are paid, then their price, in fact, includes the cost of processing the request. These are not only IOPS disks, load on raid controllers, SAN, amortization of 10G cards, etc. This and (very insignificant in price against the above) machine time dom0. Everything was logical and decided by itself.