oowl February 5, 2009 at 08:43

Server history

This story happened to me during the previous week - from January 26 to January 31, 2009. Having lived through this wonderfully tiny period of my life, I realized the need for simple things, believed in the existence of a “case” and became more and more disappointed in people. This week's tags were RAID, Infobox and backup. Although it all started much earlier ...

Part one

In January 2008, I rented a servochka in the St. Petersburg company Infobox. Mediocre in characteristics, relatively cheap, it, as well as possible, satisfied my current requests. The rental service included the initial installation of the operating system, which of course was freebsd and the desired partitioning. Also, the kindly technical support workers combined a pair of 120 gig screws in software RAID 1 (mirror). I asked to take care of the server to my friend, who works as a system administrator in many places at once. He installed a web server, configured all services, including a full backup of data to archives twice a day. On my home computer, I raised a script that regularly took these backups and stored them in a daddy. I periodically cleaned this daddy.

The reader should agree that, on the whole, everything turned out pretty well: RAID 1 + archives on the server + archives on my home computer, which is turned on around the clock.

Immediately I transferred the CakePHP website from the hosting to the newly-made server , and later other sites appeared, well-known habra-audiences, such as MyNotifier , CodeIgniter , my home page , as well as many other projects that are very distant to my story.

Part two

And I lived happily ever after, until in January of this year I decided to update my obsolete ubuntu home. From number 8.04 to number 8.10, and at the same time start the desktop life first - format the screws and put the OS “clean”. This noble cause happened on January 23. There wasn’t much point in saving the accumulated backups: “reinstall the system - configure the script again and collect the archives,” I thought. But life is fast and unpredictable and in the next couple of days I could not devote much time to setting up my new 8.10.

Having returned home on the evening of the 26th, I found a jabber and ICQ contact list clogged with messages. All as one wrote that something yes on my sites does not work. It was not difficult to verify this - just open any of the projects and wait half a minute for the page to load with a database error. Having decided that the matter is simple, I rebooted mysql prankster, but this did not bring the desired effect. Not only that, the server responded via ssh at the speed of a turtle or a little slower than that. The situation was aggravated by the fact that my fellow administrator at that time rode peacefully on the train "St. Petersburg - Moscow" and my worldly problems were not subject to his desire to solve them.

With a request to restart the server, I turned to Infobox technical support. So began my correspondence with them, consisting at this time of 53 letters.

The server was rebooted, but nothing changed, then I suggested that something was burned out, it could be a cooler or a screw. It turned out that it was a hard drive, which technical support staff, after an hour and a little, kindly replaced by starting background copying from the old screw. It was by night, and after several fruitless attempts to reach the server, I went to bed, deciding that the background copy, which had taken the server so busy, should have ended by morning. But in the morning, nothing has changed.

Meanwhile, my administrator arrived in Moscow and after some time threw me a log of unsuccessful attempts to record to a new hard drive. It looks like this.

...
Jan 27 10:44:44 oowl kernel: ad6: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA = 74274048
Jan 27 10:46:14 oowl kernel: ad6: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA = 74344960
Jan 27 10:47:05 oowl kernel: ad6: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA = 50792319

The next half day was spent on the fact that the Infobox engineers themselves became convinced of the impossibility of recording and the malfunction of the new hard drive. The hard drive has been changed again and now real background copying has already started. By the time I got a fiftieth letter from users with questions about what happened.

When the server began to respond to requests at an acceptable speed, and I already thought that this was the end of an unpleasant story, but, as it turned out, the adventures were just beginning, because I was in the past! The latest forum posts were dated May 24, 2008. Jobs at MyNotifier confirmed my teleportation. To make sure that I was not crazy, I had to look at the calendar. It was winter, and the server is already spring, however, last year.

After negotiations with support, I received the following from them.

Now the server is running hard which was in the raid until the second hard drive crashed. The second hard drive failed at the physical level; data recovery is impossible, since May the hard ones should have been synchronized, apparently because of some kind of error of the same hard failed one this did not happen.

Part three

So I stayed with nothing: on the first hard drive, May blooms, the second “data recovery is out of order at the physical level”, and the script for collecting backups has not yet been configured on the local computer (remember my transition to 8.10?). Thus, I lost the information accumulated over almost a year, including the complete source code of some projects whose duplicates were not.

After digging through all the correspondence in May with my administrator, I came to the conclusion that nothing was installed in May, it was not erased, and indeed it was not overloaded. In no logs did the failure of the hard drive fail.

It was necessary to do something and as soon as possible. Calling serious companies that are engaged in data recovery, I agreed on a visit to the data center and that they would give me the dead screw for a receipt. You can get to the DC only from 10 in the morning. At 9:30 I was already upholstering the rapids. Grabbing the warming corpse of the hard drive, he rushed into intensive care for people like him.

Part four

At 10:15 I already described the situation to the master. “We'll see,” he grunted and plunged into a dark room behind the counter, leaving me to fill out a questionnaire with questions about the volume of sections, the location of information, and what needs to be restored first. I didn’t take it for five minutes, as the master ran out with the words: “Are you kidding, right ?! Are you bored or something ?! Why did you bring me a whole Winchester ?! ”

There was an awkward pause. The technician looked at me reproachfully, and I looked at him, not trusting his professional abilities, already mentally buried the screw. “It cannot be, check again,” I did not believe my ears. The master connected the screw to the Windows machine behind the client rack and through the UFS Explorer utility showed me the contents of the screw, my documents, databases, pictures and everything I just didn’t ask for, hoping to get rid of the abnormal client.

I came home with a hard drive and, horrified, I realized that I just have nowhere to connect it - I do not have stationary PCs. Calling all my friends, I was convinced that if people were not owners of laptops, then they had nowhere to insert a hard drive with a SATA connector. Of course, this was not a problem for my administrator, but he was in Moscow.

Meanwhile, angry correspondence continued with technical support of the info box. As an excuse, they chose the phrase:

We couldn’t work with this disk, perhaps it’s the server’s configuration.

And they also wrote:

... you can bring tough ones back to us, we will try to copy the information, or connect it to your server.

I had no options and the next morning I delivered the hard drive back to the data center. Meanwhile, the number of letters asking for clarification of the situation exceeded 80 and the new hard one, delivered to the leased server, began to slowly refuse.

Part five

2009-01-29 11:03:32 <...> Well, during the day we’ll copy the data.

2009-01-29 19:18:28 No copying has yet been done, there were problems with the "old hard" only the 500MB root file system was mounted, now I managed to mount / var / usr / home partitions but errors still appear. <...>

and about the updated server, which began to hang constantly:

The server hung, the signal was not output to the console, it is rebooted now pinging <...>

The next day, late in the evening, my administrator appeared in touch, who explained where the necessary data was. I immediately sent this information to tech support.

2009-01-30 17:36:59 Thank you for the information, we will keep you informed.

2009-01-30 21:53:53 Me : What is the state of the process now?
2009-01-30 21:55:57 Engineer : Trying to copy data.

Part six

My patience snapped, as you know, two days passed of the phrase “we are trying to copy”, and on Saturday morning the next day I was allowed to pick up the hard drive again. Striking the gas, I went to the administrator who had just returned to Petersburg.

What was my surprise when he said that he simply copied all the data. Errors were caused only by reading one innodb'shny base, which fell badly upon failure. The rest of the files were extracted without any problems. A reasonable question arises: what did support do for two days, writing me reports about the recovery process. But let’s leave it to the conscience of engineers who, by the way, in “attempts” to read data from the hard drive, wrote it to him!

Conclusion

My story with a happy ending. All data returned, projects work. I refused to rent a server, putting my own on colocation. The money for the remainder of the lease was returned to me in three stages: first, they denied me firmly, then they credited me with an error of 800 rubles in their favor, then, after another letter, they corrected it.

I don’t know if the story that happened to me helped anyone make sure of the need for reliable backups, choose a data center or just think about the value of information, but she taught me a lot, pretty tattered my nerves and gave an invaluable life experience. As a result of a week-long correspondence with support, 119 letters from site users and endless running around, I still found more than lost.

Thanks for attention.

Tags: