How I Mirrored Virtual Machines for Free ESXi
In my home lab, I use free virtualization from VMware - it's cheap and reliable. At first there was one server, then I started adding local datastores to it, then I assembled a second server ... The standard problem with this was moving the virtual machine. Doing these operations manually, I came across one method that allowed me to switch a working virtual machine to a copy of flat files in a completely different place. The method is extremely simple: just create a snapshot of the virtual machine, clone the flats to a new location, and then in the delta, kill the link to the parent disk. The hypervisor does not keep disk metadata files open, so when you remove the snapshot, it merges with the new disk, and the old one can be safely deleted. This method works great without any VDDK, which is not available on free hypervisors and which is used,
I easily automated this procedure in python, applying a few more tricks, which, if there are requests, I can expand in the following articles. A little later, I found a good man from my former colleagues who agreed to write a gui, the latter, however, was implemented on Unity, but for the resulting free solution we called Extrasphere, it was not bad at all. What is not a toy for the administrator?
Having made the migration of virtual machines for my home laboratory, I thought about protection from failures. The minimum requirement was the backup of a working virtual machine, the maximum unattainable is the absence of a backlog of the backup from the original. It’s not that I had such data where the loss of 15 seconds is critical, in truth, it’s not critical for me to lose a couple of days, but I wanted to come to such an ideal with a foundation for the future.
I will not begin to analyze and compare the available solutions - it was a long time ago, there haven’t been any notes since then, I only remember the unthinkable craving for cycling.
On a free hypervisor, you can make the simplest backup agent from a Linux machine, to which the flats of the attached virtual machine can be hooked. This solution is well suited for creating full backups, but it is absolutely not suitable for incremental backups, as Native CBT is not available for free hypervisors.
I thought it would be nice to cut CBT myself, but how? I heard about Zerto and SRM from their vSCSIFilter, but after downloading the open-source package for ESXi I didn’t find anything similar there - well, except that you can write a character device. I decided to take a look at the hbr_filter device, there, to my surprise, everything was not too complicated. Three weeks of experiments - and now I can hang my filter driver on the virtual disk and track the changes.
But what if you do not just track changes, but replicate them? Here the biggest danger is to start writing a ton of code that provides the transmission channel: here, pull the changes, package and send to the network, there, accept, unpack and write, and you must also ensure the integrity at every step and error handling. It seems impossible to do without a pair of agents. Just look at the Zerto architecture to understand that writing and stabilizing such a solution alone is unrealistic:

Fig. 1. Zerto Virtual Replication architecture from the commercial.
Then I remembered that ESXi itself can write over the network through iSCSI and NFS for example. All that is needed is to mount the target datastore locally. And if you also enable the replica, then you can write to it directly from the filter driver! I started the experiments: at first I did not know what to do with the included replica and just loaded it from the Ubuntu Live CD, after a couple of weeks it began to work out a working copy, and then I learned to transfer the changes on the fly. Moreover, the source machine does not receive confirmation of the record until it passes the record to both recipients. So I got replication with zero lag.
The technology turned out to be agentless, at least the code, creating a replica immediately quickly threw on python. For such a dissimilarity to classical replication and simplicity, I decided to call it mirroring.
I solved the problem of the included mirror by writing a simple bootloader, and so that it would be of any use, it shows the last status of the mirror on the boot, and then freezes. As a result, the actual memory consumption tends to zero, the CPU is a bit spent on data transfer, but the installed agent would have eaten no less. As a result, the graph of disk activity for recording in the mirror and the original machine are identical.

Fig. 2. CPU consumption on the original machine with the load.

Fig. 3. CPU consumption on the mirror in the same period.

Fig. 4. Memory consumption on the original machine with the load.

Fig. 5. Memory consumption on the mirror in the same period.

Fig. 6. Disk performance of the original machine under load.

Fig. 7. Disk mirror performance over the same period.
In order to check the state of the mirror, I made test machines that run as linked-clone from the snapshot of the mirror. If you wish, you can store the snapshot and then run the tests again, and if you really liked the test, you can turn it into a persistent virtual machine with built-in migration, which I told you about at the beginning of the story.
A local target is good, but what if you need to mirror to another office / city? Even if the communication channel is wide enough, the response time will seriously drain the performance of the original machine, but we remember that the original machine does not receive confirmation of the recording until it is completed in both recipients. The solution here is extremely simple: you need to expand the recording uncertainty interval from zero to some reasonable value. For example, an allowable lag of 3-5 seconds will provide both good data integrity and decent performance. I’m just working on this solution right now. Next in line is work without ssh and application-level consistency, which also cannot do without tricks, which I will gladly share.