VMware: “To quiesce or not to quiesce?”, Backup virtual machines correctly



    A great many articles have been written about snapshots of virtual machines , where the theoretical part of this action has been excessively described. In my article, I will focus on the practical side of the issue and exclusively on the VMware vSphere platform.

    So why do you need “quiesced” * snapshots, what do they eat with, and what are the typical problems with them? A look at snapshots will be presented primarily from the point of view of backup, but I will try to reveal to some extent other aspects of use.


    * If anyone is ready to suggest a suitable Russian-language term - I beg you in the comments, there will be a good option - I will replace Englishism in the text.


    Using snapshots for backup


    In the VMware vSphere environment, the snapshot creation process is controlled by two options:
    • Snapshot including the state of memory of the virtual machine
    • Snapshot preceded by the so-called quiescing-th guest file system

    In the case of backing up a virtual machine using VMware vStorage API for Data Protection, the first option is simply not used, and the main reason for this behavior is as follows: if a virtual machine has a large amount of RAM (and 8-16GB virus programs are not uncommon for a long time), then when you turn it on of this option, the creation time and size of the incremental backup will be significant (each incremental backup will additionally include the size of RAM). In addition, there are a number of technical difficulties, but today they are of little interest to us, because We are considering an alternative scenario.

    Actually, an alternative option is our second option - quiescing. It is of much greater interest and the essence of it is to prepare the guest operating system (file system in the first place) for removing the backup.

    What is quiescing?


    If we translate the official article , we get something like the following:
    “This is the process of bringing data on a virtual disk to the“ suitable ”state for backup. This process may include flushing dirty buffers from the operating system's memory to disk or other high-level operations specific to specific applications. ”

    From this description of what is happening with the virtual machine, it’s actually not clearer. Let's figure it out for yourself.

    First, VMware Tools, through its VMware Snapshot Provider service, initiates the creation of a VSS snapshot inside the guest OS. Further, all registered VSS writers (you can see them with the " vssadmin list writers " command ) in the guest OS receive a request and prepare the corresponding applications for backup (all transactions are written from memory to disk). When all the VSS writers finish, they report it to the VMware Tools service (again, via the VMware Snapshot Provider service), which, in turn, tells VMware that the snapshot can be removed.
    Thus, all backup applications for VMware vSphere use the following combinations when issuing a command to create a VMware snapshot (note that the process of directly creating a snapshot is completely and completely controlled by VMware itself):

    Quiesced = ON, Memory = OFF
    Quiesced = OFF, Memory = OFF

    Second we will not consider the combination in this article and will focus on the quiescing process.

    Why do you need quiescing?


    The most obvious example is the problem of USN rollback when restoring a controller domain from backup. It occurs if the virtualized domain controller was backed up without using VSS (that is, without the quiescing option or other means that provide writing transactions to disk).

    No additional actions and dances with a tambourine will be required if you restore the backup made with the quiescing option. InvocationID will be correctly reset and you will see the following entry in Event Log on the controller loaded after recovery:
    Event ID 1109: Active Directory has been restored from backup media, or has been configured to host an application partition. The invocationID attribute for this domain controller has been changed.

    Similar correct behavior can be observed when using Acronis vmProtect 9 . Actually, we specifically tested it as part of the backup and recovery of virtual machines with a domain controller inside.

    USN rollback is obviously not the only possible problem when using raw snapshots, and other applications (for example, Exchange / SQL - explicitly supporting VSS applications) may be prone to failures when recovering from such snapshots.

    How to verify that a snapshot is created correctly using VSS?


    There are several ways to determine the correctness of creating a consistent (up to the application level) snapshot:

    The easiest way: enter the guest operating system and check the "Event Viewer" (it was necessary to translate the poor Event Viewer like this). After creating a snapshot with the options quiesced = ON, snapshot memory = OFF (see the screenshot at the beginning of the article), events from the corresponding VSS writers in the application logs should be present:





    Note: The error from VSS with Event ID 12289, which can be seen in the screenshot, is not really a problem . It belongs to a 3.5 '' disk, and to get rid of it, just remove the flop from the virtual machine configuration



    : More complicated way: use the Datastore Browser component from the vSphere client: the file *** vss_manifests should appear in the virtual machine folder on the datastore after creating the quiesced snapshot * .zip.

    Inside the file there is backup.xml with a description of all the VSS writers found in the guest system + metadata for each writer in writerX.xml.



    IMPORTANT: if vss_manifests.zip contains only backup.xml - this usually means that a snapshot in fact was made without using VSS. Thus, we smoothly approach the most interesting: the study of problems with snapshots. Below I will list the main causes of broken snapshots. It is worth noting that the main danger is not disabled snapshots (they are easy to detect), namely those that VMware reports as successful, while these snapshots are not.

    Environmental requirements


    If the usefulness of the quiescing option is becoming less or less clear, then in practical use problems often arise, as a rule related to the incorrectness of the initial configuration of the environment. The official description of part of the requirements is here , and I will try to reveal them more clearly so that it is clear where to look when you encounter problems in practice:

    First , make sure that your combination of vSphere + guest OS is supported for snapshoting with consistency at the application level on this plate (taken from here ).



    Data is relevant for vSphere 5.0 and higher. As you can see, for the most popular operating systems at the moment (Windows 2008 and higher) there are asterisks - the main dog is buried in them, and we are going to dig it now.

    Secondly , in order for quiescing to really work, you need to make sure that the VSS components of VMware Tools are really installed (and naturally VMware Tools must be the most current version).



    On older versions of vSphere (3.5 and earlier) for quiescing, Legato Sync Driver was also used, which guaranteed consistency at the file system level, but not at the application level (for which VSS components are needed). Currently, this driver is practically not used and is universally replaced with VMware Snapshot Provider. The installation can be checked in the guest operating system (on the virtual machine) by the presence of the VMware Snapshot Provider + service corresponding to the COM + component.



    What jambs can be at this stage?

    If the VMware Snapshot Provider service is disabled or not installed at all, then VMware, when removing the snapshot with the quiescing = ON, snapshot memory = OFF options, will report that it is successful, but in fact the snapshot will be generated without using VSS inside the system, that is, through Legato Sync drivers.



    Note that in the case of Windows 2008 and higher, the behavior is different - there are no such events in the log, but just the Volume Shadow Copy service goes into the started and then stopped state.

    Third , one of the typical quiescing configuration problems is disk.EnableUUID = true in the .vmx configuration of the virtual machine.

    This setting only makes sense for guest systems running Windows 2008 and higher (for Windows 2003, the setting is ignored). An additional feature is the fact that this parameter is automatically registered when creating a new virtual machine only starting with vSphere 4.1. In other words, if the virtual machine was migrated from an older version of vSphere, then there may not be any settings.



    If there is no parameter, or if it is set to false, the behavior when creating a snapshot will be similar to the previous case: the snapshot will be created successfully, but in fact VSS will not be used and as a result we can get an inconsistent backup. The second symptom of a disabled parameter is an empty backup.xml (without a description of the VSS writers) in vss_manifests.zip.

    Fourth, check for dynamic disks inside the guest machine. If at least one dynamic disk is present inside the guest system - whether it is systemic or not, then VSS will not be involved. The snapshot will be created successfully, but vss_manifests.zip will be empty, as well as the event logs inside the guest OS. This rule applies to guest OSs Windows 2008 and higher.

    The same applies to IDE disks - they should not be in the virtual machine configuration (but the presence of IDE CD-ROM devices is valid and does not affect snapshots). It should be borne in mind that the number of free SCSI slots on one SCSI controller should be equal to the number of disks. For example: if there are already 8 SCSI disks on SCSI1, then there will not be enough slots.

    Fifthly: Broken VSS inside the guest machine. This is the main point causing tons of resentment and calls to the technical support of VMware. Often people who see an unsuccessful snapshot sin on VMware, although it’s worth blaming a completely different thought giant - Microsoft. I got about the same picture when I tried to create a quiesced snapshot of the machine after the installation of the new SQL database failed (the virtual .iso drive was unmounted during the installation, which the installer really did not like.: - \



    This problem was solved by a banal reboot of the virtual machine, and although this method helps very often, there are running cases when the VSS inside is broken a little less than completely. In these cases, the easiest way to find out if Microsoft really is to blame is to start Windows Backup and make a backup of the system state (Backup of System State, if anyone is used to English terms). Windows Backup (or NTBackup) works - then the problem is on the side of VMware, it does not work - Microsoft can't.

    VMware has several official articles on this subject: for example, here and here. But there is an interesting feature - to simplify your life (maybe there are some other reasons) in the second article, VMware explicitly recommends setting disk.EnableUUID to “false”, which means refusing to use VSS when creating quiesced snapshots (“quiesced- it’s not real! ”). In the general case, this method is not a solution, but only a temporary workaround, since the consequences of such an approach can manifest themselves during recovery, that is, precisely when the consistency of applications is key (remember at least the same USN rollback).

    To summarize


    In my experience, the most common problems when creating snapshots (their inconsistency) are points 2, 3, and 5, while IDEs or dynamic disks are much less common.

    Of course, completely mystical cases are not excluded: for example, a snapshot was not created (VMware reported an indistinct error) due to the fact that the iSCSI LUN (datastore) on which the problematic virtual machine was located was physically connected through 2 network cards in teaming mode and at this one worked on 100MBit, and the second on 1Gbit.

    The topic of quiesced snapshots can be dug for almost forever - which is at least the fact that Windows 2008, when creating a quiesced snapshot, creates not one but two deltas on the datastore and, in fact, writes to the already created snapshot (this, by the way, is one of the root reasons asterisks opposite the OSE data in the plate above) or the ability to disable certain VSS writers through the vmbackup.conf configuration on the guest system. The world is wonderful and amazing, but the rake is enough for everyone. If there is a desire - I will gladly write something else on this topic. As usual, comments are welcomed, clarifications - also about errors and typos - in PM, I will try to answer asap questions.

    Do not forget to subscribe to our Hub, we have a huge number of articles planned on backup and data recovery, maybe our articles will help you solve certain problems (or better, avoid them). Thanks for attention. :)

    Also popular now: