
Fault tolerance of systems based on HP Storageworks P4xxx without a third data center
Background
About two years ago, the management decided to invest in the virtualization project of our data center. The task was quite simple, about 50 servers, mostly Windows, a couple of Linux machines, nothing non-standard. The data center, although small but veryWhat happened?
According to the adopted rules, we test redundancy and failover twice a year, in the case of virtualized services, we decided to divide the process into two stages. The first step was to simulate a failure of only the hypervisor hosts (we really cut down the power - it’s rude, but this is how the testing process is described in the documentation). As expected, VMWare HA and FT worked as they should, the committee ticked off the protocols and signed up. At the second stage, together with hypervisors, the storage devices (LeftHand) were also cut down and ... a miracle did not happen. There is an error in the HP Centralized Management Console, data is not available, although backup devices are turned on and available ... but there is no quorum. It was not possible to restore working capacity - it was urgent to turn everything back on, no failover was achieved.We started to find out.
We knew that an automatic feylover needed a 3 data center - at pre-sales meetings, HP representatives warned us about this many times. Admins were not invited to meetings, clarifying questions were not asked, for some reason the management decided that "automatic feylover is possible only with 3 data centers" implies that "if manually, then two data centers are enough." But no, in response to a request, HP Support replied that neither manually nor automatically without a third data center a fever is impossible. The principle is similar to that described here (in our case, the systems are somewhat different - but in general, the same case).
- a) An automatic feylover is not needed;
- b) the use of cloud-hosted servers in the storage network is contrary to the IB policy.
And here is how we solved the problem:
- On one of the ESXi hosts in the backup data center, activate local storage (to provide access in case of SAN failure)
- Create a full copy of the main FOM (copy everything, and most importantly, the MAC address of the virtual network card connected to the iscsi network) on the host in the backup data center
- Leave the FOM in the backup data center in StandBy mode

ps sorry that the text in the pictures in English, copied from the report