Virtual server with Ubuntu 11.04, Software RAID and its recovery

Hi Habr. I would like to describe the solution to the problem with Software RAID on Ubuntu Server 11.04 which I encountered incorrectly rebooting the server.

A couple of days ago, I worked, wrote the code in php, the office server didn’t load much. In general, we have taken both server and client code to write on our own machines and version with git, except that the mysql database is sometimes shared from that server itself. And if necessary, git push to help. For many developments on the server, vhosts are configured to be updated from git and accessible from the Internet.

After reloading some page from the server, I sensed that something was wrong, part of the page was loaded, and then everything ... The situation was aggravated by the fact that a colleague came up and said that his access to the server via smb stopped working, and I still had and the ssh connection fell off. It became clear that not only apache hung.

"Not a problem," I thought, "because we have virtualization, reboot vm and the trick is done." Yes Yes exactly. It costs a physical server, on Ubuntu Server 11.04, inside of which another Ubuntu Server 11.04 is running under qemu, on which all the necessary services are configured. Why is that? The decision was made by a more experienced colleague, who unfortunately quit, and I am not particularly strong in system administration. I’ll leave out a small part of the password change history, which of course I didn’t know :)

I connected to the physical server, and there:

Ok, server is running, id 1. It doesn’t cling to the terminal (given a slight panic, I forgot about vnc, but at that time it wouldn’t help me very much, although it was configured for a guest OS).
reboot 1
error: this function is not supported by the hypervisor: virDomainReboot

Not ok, but what can you do:
destroy 1
start 1

Zhdems. It still does not cling to the terminal. ssh connection is not working. In general, the server does not start. Repeated attempts to destroy / start did not lead to anything. Desperate, I decided to look at the configuration of the guest OS. And there:

I was delighted and climbed up to look at this disgrace on vnc. All further is done inside the guest OS. And there:
The disk drive for /some/mounted/folder is not ready yet or not present
Continue to wait; or Press S to skip mounting or M for manual recovery

After the first pressing of S, I realized that everything seemed very bad. After an impulsive pressing of S 100,500 times holding S for a second, the OS continued loading, but mysql, apache and many other demons did not start, because folders like / var / lib / mysql were not mounted. Having overcome the login, I tried to understand where everything was lost (we have backups, but I really didn’t want to do the rest of the week to restore). The presence in / etc / fstab and / dev / strange entries like / dev / md / 1_0 alarmed me. Google suggested that this is part of the Software RAID array. Inside Ubuntu, Ubuntu, inside Ubuntu Software RAID ... here. The parts turned out to be 5.

Google suggested that fsck and mdadm help me:
fsck –nvf /dev/md/1_0
fsck –nvf /dev/md/5_0

Please do not change anything fsck, displaying a lot of waste of interesting information in the console and check everything, even if the file system is not marked as damaged. Of the 5, 3 were errors / damage FS.

Then I took a chance and asked fsck to fix everything:
fsck –vf /dev/md/1_0
fsck –vf /dev/md/1_0

At the same time, mdadm for all devices said:
mdadm --detail /dev/md/1_0
Raid Level : raid1

/ etc / fstab said:
/dev/md/1_0 /var/www ext3 defaults,noatime 1 2

Corrections were a great success. It remains to understand how to make it all come together again. The reboot did not help. By itself, the array is not going to. It turned out that the names and mapping of devices of the form / dev / md [xxx] in / dev / md / [yyy] changed with each reboot (in / dev / md /, symbolic links to / dev / md [xxx] are created). Therefore, the devices specified in /etc/mdadm.conf were not found by the system and were not automatically mounted.

At this stage, I stopped wondering "How did it work before?", And resolutely began to look for some way to associate what was written in this file with what I saw in / dev / md /.

And still found:
mdadm --detail /dev/md/123_0
UUID : 4e9f1a60:4492:11e2:a25f:0800200c9a66
less /etc/mdadm.conf
ARRAY /dev/md/1_0 level=raid1 metadata=0.90 num-devices=2 devices=/dev/sda5,/dev/sdb5 UUID=4e9f1a60:4492:11e2:a25f:0800200c9a66

Connection found (UUID), the case for small. Assign the old devices to the old mount points found in / etc / fstab, the new devices from the / dev / md [xxx] list, which was done:
mount –a
#Монтирует все описанное в /etc/fstab

Restarting mysql, apache, etc., seeing that the contents of / var / www are back and everything is blooming and dancing, I calmed down and went to drink coffee. As it turned out, the server did not live 4 days before the uptime. However, it cannot be said that the problem is 100% solved. It is not clear behavior when rebooting, but now there is a list of manipulations that need to be done to make it work again. Question to the community, but did anyone come across this?

On this question I will finish my story. Tips, questions and comments are welcome.

PS: After the above works, I had a desire to get rid of such a hum, a strange configuration of disks, but leave virtualization. At the same time, there will be an occasion to develop the skills of setting up the server and to ask the authorities for a memory upgrade for the server, which has no complaints about the hardware (Dell PowerEdge tower).

Also popular now: