How I stopped worrying and loved Hyper-V replication

    Perhaps this is strange, but in the first days at work after the New Year holidays, when everything that fell during the holidays has already been successfully brought back to life, many people have a desire to somehow sort the information in their heads in order to bring it to a systematic look. A good catalyst for this process is the recognition of the fact that you seem to have a lot of knowledge, but you can’t explain this baggage in simple words to a grandmother from the street or a six-year-old child. For, as the popular wisdom says, I could not explain to the child - that means you yourself do not know. Anyway, defragmenting information has not harmed anyone.
    But we do not have a course in applied psychology, so today I just set out in a systematic way a set of pixels the maximum amount of useful information about the replication function of virtual machines in the Hyper-V hypervisor using the example of the current version of Windows Server 2012 R2.

    So, what I want to spend about an hour of your time:
    • You need to understand why replication is necessary in the modern world.
    • Make a checklist of the obvious and not so points that precede server configuration
    • How to properly and quickly configure replication with built-in tools. In sufficient detail, but without water
    • Some tips for optimizing the replication process.
    • Not a word about Veeam products or other vendors.

    Act one. Survey.

    The meaning of the term “replication of virtual machines” is no different from the generally accepted meaning of the word “replication” in IT: on a third-party host, a copy of the VM from the main host is created and maintained.

    Let's immediately agree: replication is not a backup! Like snapshots are not backup, raids are not backup, and generally nothing is backup except backup, for if grandfather were grandmother ...

    But better still, just in case, I’ll explain why it’s not “backup”: in case of failure of the main machine, you can always turn on the replica without delay, but if the failure was triggered not by a momentary error, but by a set of accumulated problems at the OS or application level, then they will all be successfully reflected on the replica, and nothing good will come of it. There are countless cases when, after turning on the replicated VM, it works for several minutes and dies after its parent with the same symptoms.
    Thus, replication is a great tool to expand your disaster recovery plan., which allows you to return all your services to a combat state with a minimum delay in time, but you cannot transfer all responsibility to it, because nothing is perfect and there are cons everywhere.
    There are three ways to replicate a Hyper-V virtual machine as a process:
    • Built-in hypervisor tools.
    • Using third-party software. An interesting fact is that some vendors ignore this function for no apparent reason.
    • Using SAN tools. Undoubtedly, the method is the most interesting, fast and other ... but extremely expensive.

    As agreed at the very beginning, we will consider only the first item - namely, the replication of Hyper-V machines using the built-in tools of Windows Server 2012 R2. I allow myself to note that it is R2 because there is a functional gap between the first and second releases, and using the non-R2 version of the hypervisor in the production environment is practically bad.

    So, what does Microsoft offer us out of the box:
    • Replication by asynchronously copying changed data from the parent machine to the replicated one. Asynchronous - this means that data is not transferred immediately after the original data has been changed, but after some time intervals, which allows you not to “overload” the source machine and the transmission channel. Currently, the minimum replication period is 30 seconds.
    • To replicate Hyper-V machines, you don’t need to have any special shared storage facilities or observe uniformity of storage equipment at the source and receiver
    • Everything that can be virtualized can be replicated.
    • Replication occurs over regular IP networks and, if necessary, traffic can be encrypted.
    • In Hyper-V, replication is possible both between individual hosts and between clusters. And even a mixed version is possible without restrictions.
    • Hosts between which replication occurs can be located anywhere, on any networks, and belong to different domains.

    Checklist before it's too late

    • The first and obvious point, but often overlooked: make sure that the server to which replication will go has Hyper-V-compatible hardware. Engineers tend to run anything, anything, but believe me, this is not the case. Absolutely not that one. Hardware support for virtualization should be mandatory.
    • The second obvious point: calculate how much space you need on the receiving storage, and check the speed of its operation. Removing the storage of a dinosaur era from a far shelf, you risk seeing the overall data transfer rate at the level of the same era, despite all your network gigabits.
    • The corollary of the previous paragraph: based on the replication period of the averaged virtual machine, figure out how much each replication point will weigh and how many such points you can afford. The maximum amount currently available is 24 rollback points.
    • If you plan to replicate machines that are part of a Hyper-V cluster, you must install and configure the Replica Broker role in the cluster. If there is a cluster on the receiving side, it must be repeated.
    • Check the firewall settings and routing on the entire route between hosts. If it is not you who is responsible for the network, then find your networker and torture it with hot iron until it builds the shortest and fastest route between the hosts. Everything is simpler with firewalls: we need port 80 for Kerberos over HTTP and 443 for certificate-based over HTTPS. Of course, the ports can be changed during the configuration process.
    • If you plan to encrypt traffic between hosts, then you will need certificates, and you need to arrange them in advance on all interested servers. And do not disdain to check the certificates for expiration dates, and make the certification authorities trusted if you use self-signed certificates.
    • Inspect all of your virtual machines for VHD usage. You have every chance to find disks that do not need to be replicated at all, which will save you time and money.
    • Make a list of applications for which data consistency is important. Check the health of the VSS system both on the hosts and inside the guest OS. If your applications do not use VSS (for example, not the oldest versions of Oracle), they will have to pay special attention.
    • Think about the time for the first pass of replication. At the first run through the network, the entire machine will be transferred - and, naturally, I want to do this outside of working hours. If the receiving host is located outside the boundaries of the local network, you run the risk of not having time to complete the transfer over night or weekend and in the morning get a very busy communication channel and a very busy host. How to avoid such a situation will be written below.
    • Keep in mind that replication in Hyper-V is possible not only between two hosts, but also using intermediate servers. A sort of multi-way replication.
    • Review your current backup plan for compatibility with your replication plan. I think no one will like the situation when replication starts during backup too. Your host may well not forgive you such a load. It is also worth answering the question of what will happen if you restore the car from backup: do you need a replica in the form of a car before the accident, or should you sooner bring it to a consistent look.

    I believe that it is justly fair to mention a tool from Microsoft, which allows, with a certain degree of error, to calculate the resources necessary for replication of a single virtual machine. It is called Capacity Planner for Hyper-V Replica. Of course, you won’t get the exact number of IOPS, the load on the network and the processor, but as an evaluation tool it is very good and will allow you to analyze your infrastructure in advance.

    At startup, you will be asked to indicate the main server, the server for replication, the machines to be processed, and the time of the measurements. I recommend changing the default 30 minutes up to an hour. And, of course, the best time to start is at the height of the working day. The data collected can very cool scare the bosses and ask for money for new toys ... glands.

    Act two. Tuning

    And then came the crucial moment! There are certificates, the network is configured, the Hyper-V role works everywhere, management tools are not forgotten, and we can proceed.
    The first step is to allow our host to act as a replication server and take machines on board. This is done through the standard Hyper-V settings window:

    All settings are transparent, but I want to focus a little on the bottom section of Authorization and storage. This is not critical, but I highly recommend allowing replication only with specific hosts or host groups. Not often, but there are times when erroneous replication is triggered by mistake or out of ignorance - and it’s good if it is a spare host, and it may happen that the battle storage is clogged with all the subsequent entertainment. Solving everything in a row is the destiny of laboratories for testers and developers. Well, or just brave people =)

    Call the broker

    Since at the very beginning we agreed that our infrastructure is similar to that of adults (that is, the cluster is configured and successfully working), we need to enable the role of the Hyper-V replica broker. If you do not have a cluster, then you can safely skip this paragraph.

    The activation procedure is simple and includes 5 Next buttons and one Finish. There is nothing to explain here, so just go to the cluster management wizard, select Configure Role and go through the wizard, remembering to give NETBIOS a compatible name and specify IP.

    A small hint for those who read the documentation first, and then do it, although real engineers don’t do this — everything described in the previous paragraph can be done directly from the broker with the only difference being that the settings will be applied immediately to the entire cluster and will not have to be manually resolved replication on each server. As you can see, everything looks exactly the same:

    And a small explanation about the role of the broker in the replication process - when replicating machines that are not involved in the High Availability cluster, the broker is not involved in any way. But when it comes to clustered machines, he completely takes control of all the processes associated with replication and clustering, preventing the cluster from making the wrong decision about the availability of machines. Therefore, the golden rule - from now on, everythingyou should only do actions through the Failover Cluster Manager console, otherwise you run the risk of being left without a cluster. Even if a meteorite falls on the combat host, the worst thing you can do in this situation is to enable the replica machine through Hyper-V Manager.

    First went

    Now, finally, we are really ready to replicate our very first machine. Like everything in Windows, we will do this through the right mouse button:

    Then a fairly standard setup wizard opens, where in the first steps we are asked the server name (where the machine will be replicated) and asked to clarify the connection settings. Rather, if the hosts are in the same domain, then everything will be filled without our participation, but if the servers are not familiar, and you still need to encrypt the traffic, you will have to specify all the parameters manually. The only checkmark that deserves attention at this step is “compress the transmitted data”. Here we turn to the planning stage and see what is more important for us: to compress the information and rather finish the data transfer (which will inevitably cause additional load on the hosts), or the transfer volume and duration are not important to us, because priority is host performance. Two boring screenshots:

    The next step is to select the drives that will participate in replication. At the end of the article, when it comes to general optimization, I will give a few tips, but for now it’s worth remembering one detail - a disk that is not marked for replication will be completely absent on the receiving side, i.e. it is excluded from the virtual machine configuration. If a machine cannot function without this disk, but something unimportant is stored on it (like temporary files), then simply recreate this disk on the replicated machine.

    Then we again turn to the planning stage and set the selected replication period. If, by mistake, you are still using Server 2012, then you will not even be asked, but simply set to 5 minutes. Over time, Microsoft came to the conclusion that this behavior is not entirely correct, and in Server 2012 R2 they added a choice of 30 seconds, 5 and 15 minutes. Not a fountain, of course, but better than nothing.

    And be very careful when choosing a 30 second interval - you will need a really very strong host, with a very fast network and very fast storage.

    The next crucial step is to indicate how many recovery points we will store. Here we indicate how often VSS snapshots will be created. In principle, you can do well without them, but then no one can guarantee you the consistency of the data with all the ensuing consequences, especially when it comes to applications for which it is critical.
    The screenshot example can be interpreted into Russian in this way - we need to create a recovery point every hour, store it for 24 hours (this is the maximum value) and create a VSS snapshot every 4 hours. I agree that the construction is not the most transparent and easy to understand, but that is, we are working with that.

    This is followed by a very useful point for those who have very large machines or just do not have the ability to transfer large amounts of data over the network. As we remember, during the first start from host to host, the entire volume of the replicated machine should be transferred, and we are offered three options to choose from, how we can do this:
    • Directly on the network with the start time of the process. The default option does not require additional comments.
    • The most interesting, in my opinion, option. If you select it, at the first moment of time, the clone machine will be created and saved in a separate folder on the sending host. The folder will be named after the template.. The same machine will be replicated as a dummy on the receiving side. Further, the folder with the fake machine can and should be copied to external media and transferred closer to the second host. On the second host, the dummy machine will have a new menu item: Import Initial Replica, i.e. the machine will be waiting for real data. We will be asked to indicate the path to the data folder, they will be copied to the place of permanent service, the internal approval processes will start, and on this the passage of the first replica can be considered completed. Undoubtedly, the longer the data disk travels between hosts, the greater the difference between the machines, so you should not delay this trip.
    • And the third option: when there is already a copy of the virtual machine on the receiving side. You simply indicate this machine, and then it will be used as a reference. How can this be? For example, it was restored from backup. Or left over from previous replication. It does not matter, the main thing is that this machine can be used as a reference, and only mismatched data will be transmitted over the network.

    Then we will be offered to take a look at all the entered settings and confirm our desire with the Finish button. They will tell us that everything went well, and they will offer to change the network settings for replicas, because by default, they are not connected to any network (I agree that this is a very unexpected place for such a proposal), but it seems to me that it is better to explain network issues with practical examples that will come next, but for now let's move on to extended replication of Hyper-V machines.

    Expanding the breadth of our depths

    Like many other cool features, advanced virtual machine replication appeared only in Windows Server 2012 R2. Advanced replication allows you to configure replication not only on a point-to-point basis, but also to build entire chains when, after passing through the replication from the main server (let's call it the main replica), the replication process starts (oil is oil, but you can’t say better) to the third host

    And, if it’s not entirely clear to many why replication is necessary at all, then the possibility of creating a replication chain is likely to completely confuse even the most persistent. However, I propose to your court this, not a fictitious example. Suppose you have a large enough company with several server rooms in the same building, and you set up replication every 30 seconds, so that in the event of a sewerage breakdown and server flooding, you can quickly turn on copies of your virtual machines with minimal data loss. This is an excellent scheme, but, unfortunately, it does not protect against a complete blackout of the building or tractor, which cuts through the optical channels suitable for the building. In this case, I really want to have copies of machines somewhere on the side, updated, even if not every 30 seconds, but at least once every 15 minutes,

    Here you should identify the rules for advanced replication of virtual machines:
    • The frequency of extended replication cannot be less than the main one i.e. if the core occurs once every 5 minutes, the extended cannot occur every 30 seconds
    • The frequency of creating VSS snapshots cannot be changed
    • You cannot change the list of disks participating in replication
    • However, you can change the authentication methods and the way the first replica passes

    The Advanced Replication Configuration Wizard is invoked traditionally by right-clicking on the replicated machine and selecting Extend Replication. Further tuning happens exactly the same as in the case with the usual one, so there is no sense in considering it separately.

    And now we have successfully configured everything, launched and checked everything, so I propose to go on to consider the behavior in the event of an accident by making a short stop near the networks.

    A bit about networks

    It is not known for certain whether this is excessive paranoia or not, but it is customary to connect all replicas to an isolated network that does not intersect with the production one. And often the administrator has no choice at all, because other subnets are used in the data center on the receiving side, and the replica must have completely different network settings.

    And, as we can see in the screenshot below, Hyper-V gives us the opportunity to specify the exact settings of each network adapter in case of emergency power-up. Which, by the way, is called failover, and we’ll talk about it right now.

    Scary word fallover

    I'll start by explaining the term Failover, as An adequate translation into the language of Pushkin and Tolstoy has not yet been invented. A faylover is a process of correctly (read controlled) turning on, operating, and turning off a replicated machine. Example of incorrect behavior: from the host or cluster control panel, the machine is turned on by the Start button. In this case, we get a guaranteed collapse of replication, with subsequent reconfiguration, and the whole set of fun problems inherent in having two identical machines in the same infrastructure.

    So, there are three types of faylover:
    • Planned
    • Test
    • Emergency (or just a failover, no posts)

    Planned Feloiver

    Using a planned faylover, implies that it is known in advance about possible problems with the main host. For example, work will be done with power networks, a hurricane is moving towards you, it is necessary to turn off the host for maintenance, or in the morning the workers decide to dig into the ground in a dangerous proximity to the cable routes.

    In this option, there is a small downtime of services equal to the shutdown time of the main machine and loading the replica, but the fact that switching is performed as planned makes it possible to choose the most convenient time for everyone.

    The important point is that replication can be continued in reverse mode, i.e. all changes made on the side of the replica, when it is turned off, will be transferred to the main machine. This allows you to completely eliminate data loss.
    So, how does the planned feylover happen:
    1. Turn off the main virtual machine. This can only be done manually to avoid erroneous shutdown. Until the machine is completely turned off, the master of the feylover will show the corresponding error.
    2. There, on the main host, click on the switched off machine and select Planned Failover
    3. By default, the Reverse the replication direction after failover item, which provides reverse replication, is unchecked, and if you do not want to lose the data accumulated during the operation of the machine in the feylover mode, this item should be noted. An important note - permission must be set on the main host to accept replicas, which was mentioned at the very beginning, otherwise the data simply will not be accepted.

    We start the process of the feylover and check the network availability of the raised machine for users. Here, the most common errors are the wrong VLAN and the lack of an appropriate DNS record. Neither one nor the other master of the feylover checks, leaving this at the mercy of the administrator.

    The funniest thing in this situation is how the reverse switch happens: we need to repeat the feylover, but this time from the side of the second host, i.e. you need to turn off the replica on it and make its planned feylover. The solution is more than strange, but that is - that is.

    Test Failover

    This is exactly the case when the name corresponds to the functional. I want to check the replicas, like backups, to sleep a little more calmly. And the best way to test a replica is to enable it. At first glance, it might seem that this is another name for the planned feylover, but this is not so.

    When a test feylover is executed, a temporary machine is created on the replica side, on which various tests can be performed. For example, check the set of ports by telnet, and in the case of an affirmative answer, be sure that the services on these ports are started successfully. One caveat - by default, the virtual machine in the test faylover starts not connected to the network. Therefore, the first step is to specify the general network settings in case of a feylover, re-open the wizard and see a new menu item:

    Or a more interesting option: see how an application critical for business processes behaves after installing a new patch, without forgetting to bring the machine to a specially prepared isolated network.

    Of course, the test feylover needs to be started on the replica side. The process completely repeats the planned failover, with the only difference being that after all the necessary procedures have been completed, it must be stopped. Otherwise, the machine will continue to work until sooner or later it grows to the entire disk.

    Emergency failover

    There is only one golden rule - never run this feylover, except when it is really necessary, i.e. if there is no emergency, use only test and plan options. If you just need to see how it works, write documentation for engineers, etc., then do all the steps exclusively in the test environment.

    When performing a feylover, the only option that will be available to you is the choice of the desired restore point. Further, the machine will be launched no matter what. If the master doesn’t let you shoot in the foot and turn on two identical cars when you execute the scheduled feylover (i.e., he will wait until the main machine is completely turned off), then in this case you will receive only a very clear, but unobtrusive warning.

    As a final barrier to the point of no return, you will need to confirm the completion of the feylover using the Complete-VMFailover PowerShell cmdlet. All additional recovery points will be deleted, and the feylover process is logically completed.

    Best practice

    Before moving on to general tips, I want to touch on the topic of private optimization for a specific infrastructure. The only source of information on the basis of which far-reaching conclusions can be drawn, of course, is comprehensive monitoring. You can argue whether the Operation Manager from the System Center package is the best or not. But, since in the beginning we agreed not to consider third-party software, and even for a lot of money, we will skip this tool.

    So, the first tool out of the box that meets us every time Windows Server boots up is the nondescript name Best Practice Analyzer (it is located at the very bottom of the Server Manager console).

    By launching BPA from time to time, you can get really valuable advice on host settings, which are made on the basis of accumulated events and monitoring the performance of various subsystems of your specific host and information accumulated by Microsoft itself.

    For reasons unknown to me, the events for Hyper-V Replica are not placed in a separate subgroup and, although they have their own unique numbers, they go under the heading of Hyper-V. The rules relating to replicas are numbered from 37 to 54 inclusive.

    The next in order is the Hyper-V Manager console itself. It is worth adding an additional column Replication Health to the standard window with a list of machines. As you might guess, this column will display the current state of replication.

    And through the Replication menu, you can call up a very detailed help on the state of the machine:

    Now, let's move on to the general tips:
    • Do not be afraid to spend extra days planning and testing
    • If possible, move the swap file of the virtual machine to a separate VHDX and exclude it from replication. There is absolutely no reason to transmit it.
    • If you decide to upgrade your servers to 2012 R2, you must first upgrade the replica, and then the main server - and never vice versa. Replication does not support backward compatibility.
    • If you changed the disk size of the source machine (made possible in Server 2012 R2), you must also change the disk of the replica. This does not happen automatically.
    • Use Network Throttling if you cannot use a dedicated network for replication, as the replication process is able to completely capture the entire bandwidth of the communication channel. In such cases, QoS is our everything. In my opinion, the easiest way to configure the restrictions for the vmms.exe process or for the specified ports

    Also popular now: