Black Friday IT or Tale of Data Loss

There is a wonderful proverb, “And there’s a rip-off for an old woman.” It can be made the motto of the industry: even a well-designed multi-level system of protection against data loss can fall victim to an unexpected bug or human error. Alas, such stories are not uncommon, and today we want to talk about two cases from our practice, when everything went wrong. Shit happens, as old Forrest Gump used to say.

Case One: Bugs are omnipresent

One of our customers used a backup system designed to protect against a hardware failure. To ensure high availability of functioning levels of infrastructure and applications, a cluster solution based on Veritas Cluster Server software was used. Data backup was achieved using synchronous replication using external disk arrays. Before updating the system or its individual components, continuous testing was always carried out at the bench in accordance with the recommendations of the software manufacturer.

The system is competent, distributed over several sites in case of loss of one data center as a whole. It would seem that everything is fine, the backup solution is excellent, the automation is configured, nothing terrible can happen. The system worked for many years, and there were no problems - in case of failures everything worked out normally.

But then came the moment when one of the servers fell. When switching to the backup site, they discovered that one of the largest file systems was unavailable. We sorted it out for a long time and found out that we had stepped on a software bug.

It turned out that shortly before these events the system software update was rolled up: the developer added new features to increase productivity. But one of these features was buggy, and as a result, the data blocks of one of the database data files were damaged. It weighed only 3 GB, but the whole trouble is that it had to be restored from backup. And the last full backup of the system was ... almost a week ago.

To begin with, the customer had to get the entire database from a backup a week ago, and then apply all the archive logs for the week. It took a lot of time to get a few terabytes from tapes, because the SRK actually depends on the current load, free drives and tape drives, which at the right time may be busy with another recovery and recording another backup.

In our case, everything turned out to be even worse. The trouble does not come alone, yeah. Recovery from a full backup was fast enough, because both ribbons and drives were allocated. But the backups of the archive logs turned out to be on the same tape, so it did not work to start parallel recovery in several streams. We sat and waited until the system found the desired position on the tape, read the data, rewound, again searched for the desired position, and so on and around in circles. And since SRK was automated, a separate task was generated for the recovery of each file, it fell into the general queue and waited for the release of the necessary resources in it.

In general, we restored the unfortunate 3 gigabytes of 13 hours.

After this accident, the customer revised the reservation system and thought about accelerating the work of the IBS. We decided to abandon the tapes, considered the option of software storage and the use of various distributed file systems. At that time, the customer already had virtual libraries, but their number was increased, they introduced software complexes of deduplication and local data storage to speed up access.

Case Two: Man Tends to Mistake

The second story is more prosaic. The approach to building distributed backup systems was standardized at a time when no one expected low-skilled personnel behind the server management console.

The monitoring system worked for a high utilization of the file system, the engineer on duty decided to clean up and clean the file system on the server with the database. An engineer finds a database audit log - it seems to him! - and deletes its files. But the caveat is that the database itself was also called “AUDIT”. As a result, the customer’s on-duty engineer mixed up the catalogs and famously deleted the database itself.

But the database was working at that moment, and free disk space did not increase after deletion. The engineer began to look for other possibilities to reduce the file system size - he found them and calmed down, without telling anyone about what was done.

It took about 10 hours before messages began to arrive from users that the database was slow and that some operations did not go through at all. Our experts began to understand and found out that there weren’t any files.

There was no point in switching to the backup site, because synchronous replication worked at the array level. All changes that were made on one side instantly reflected on the other. That is, there was no data either on the main site or on the backup one.

The engineer from the suicide or Lynch trial saved the experience of our employees. The fact is that the database files were located in the vxfs cross-platform file system. Due to the fact that no one had previously stopped the database, the inode of the deleted files did not go to freelist, were not used by anyone yet. If you install the application and “release” these files, the file system will finish the “dirty business”, mark the blocks that were occupied by the data as free, and any application that asks for additional space can safely overwrite them.

To save the situation, we broke replication on the go. This prevented the file system on the remote node from synchronizing the release of blocks. Then, using the file system debugger, we ran over the latest changes, found out which inodes corresponded to which files, re-linked them, made sure that the database files appeared again, and checked the database consistency.

Then they proposed to the customer to raise a database instance in standby mode on the restored files and thus synchronize data on the backup site without losing online data. When all this was done, we managed to switch to the reserve site without losses.

According to all calculations, the primary recovery plan with the help of IBS could last up to a week. In this case, they cost the degradation of the service, and not its complete inaccessibility. A human error could result in financial and reputational losses, even the loss of a business.

Then our engineers came up with a solution. 90% of customer systems run Oracle databases. We proposed to leave the old backup system, but to supplement it with a backup system at the application level. That is, to synchronous replication by means of arrays, they also added software - by means of Oracle Data Guard. Its only drawback is that after failover it is necessary to do a number of activities that are problematic to automate. To avoid them, we implemented switching instances between sites using cluster software with the substitution of database configs.

The result is an additional layer of data protection. In addition, the new redundancy system helped reduce the requirements for arrays, so the customer saved money by switching from arrays of Hi-End level to mid-level storage systems.

Here is such a Happy End.

Enterprise Systems Support Team, Jet Infosystems

Tags:

Black Friday IT or Tale of Data Loss

Case One: Bugs are omnipresent

Case Two: Man Tends to Mistake

Also popular now: