Fault tolerance of systems based on HP Storageworks P4xxx without a third data center

    Background

    About two years ago, the management decided to invest in the virtualization project of our data center. The task was quite simple, about 50 servers, mostly Windows, a couple of Linux machines, nothing non-standard. The data center, although small but very proud, is important, we are the European headquarters of a large organization - we host services for 30 countries (Europe + CIS). Two data centers, reliable and duplicated communication, for certain reasons, chose a bunch of VMWare ESXi (4 then 5) and HP Lefthand P4000 (first tranche) and P4500 (second tranche). The reasons are purely subjective, VMWare and HP are strategic partners, etc.

    What happened?
    According to the adopted rules, we test redundancy and failover twice a year, in the case of virtualized services, we decided to divide the process into two stages. The first step was to simulate a failure of only the hypervisor hosts (we really cut down the power - it’s rude, but this is how the testing process is described in the documentation). As expected, VMWare HA and FT worked as they should, the committee ticked off the protocols and signed up. At the second stage, together with hypervisors, the storage devices (LeftHand) were also cut down and ... a miracle did not happen. There is an error in the HP Centralized Management Console, data is not available, although backup devices are turned on and available ... but there is no quorum. It was not possible to restore working capacity - it was urgent to turn everything back on, no failover was achieved.

    We started to find out.
    We knew that an automatic feylover needed a 3 data center - at pre-sales meetings, HP representatives warned us about this many times. Admins were not invited to meetings, clarifying questions were not asked, for some reason the management decided that "automatic feylover is possible only with 3 data centers" implies that "if manually, then two data centers are enough." But no, in response to a request, HP Support replied that neither manually nor automatically without a third data center a fever is impossible. The principle is similar to that described here (in our case, the systems are somewhat different - but in general, the same case).imageIn short, everything is tied to the Failover Manager (FOM) - at the time of the failure of the main device, it should be accessible from the network by the backup data center - in order to avoid a situation of parallel functioning - split brain. The FOM itself does not contain any data, and is needed only in the event of a failure, as a witness. For the functioning of the FOM, which is a regular virtual machine with more than modest requirements (2Ghz, 1GB RAM, 13Gb HDD), you only need access to our iSCSI VLAN. We immediately figured out and presented the management with the option of a Windows server in the cloud with a VPN in our iSCSI VLAN and a free VMWare Server to run FOM ... but the project was rejected with comments:
    • a) An automatic feylover is not needed;
    • b) the use of cloud-hosted servers in the storage network is contrary to the IB policy.
    The leadership set the task: to solve the problem without using the cloud and to provide the ability to manually switch between primary and backup storage devices.

    And here is how we solved the problem:
    • On one of the ESXi hosts in the backup data center, activate local storage (to provide access in case of SAN failure)
    • Create a full copy of the main FOM (copy everything, and most importantly, the MAC address of the virtual network card connected to the iscsi network) on the host in the backup data center
    • Leave the FOM in the backup data center in StandBy mode
    And that's all, in case of a failure in the main system, the FOM in the backup data center is taken out of StandBy mode, and it completely replaces the inaccessible main FOM. When returning to normal mode, you just need to remember to return it back to StandBy mode.


    ps sorry that the text in the pictures in English, copied from the report

    Also popular now: