Extreme data recovery from a degraded 5th raid
Written on real events.
Any repetition of actions and rash decisions can lead to complete loss of data. Not for HowTo-shnikov, this material is only to recreate the picture of the presentation of data on disk media.
So let's get started. Input data:
- 7 disks, 2 primary partitions on each;
- 1st partition 7 and multiple mirroring (RAID1);
- 2nd partition of RAID5, under which LVM is spinning.
Two disks fail overnight due to a surge in electricity and some other problems with the iron. Attempts to assemble the disks back were unsuccessful, as the system worked in autopilot on the deceased raid for two hours, in addition to everything, the disks either came to life or died again, the kernel did not work out which disk in which place at the moment, i.e. what was written on them and how it happened - one can only guess.
In general, we have a completely dead raid. and mdadm is powerless here.
What has already been done, you will not return, you need to somehow restore the data. because backups, as usual, no. Action plan:
- Copy the surviving data to the new disk (s);
- Restore original disk order;
- To isolate the killed disk (see point 1);
- Calculate the raid format (meta), as it turned out, soft-raid, starting with some version, allocates 1 kb for itself at the beginning of the disks, before it stored this data in the header of the disk .;
- Calculate the size of the chunk / stripe, it was also changed from 64k to 512k;
- assemble discs together
- Recover LVM and rewrite logical volumes
According to paragraph 1.
Everything is trivial, we buy new disks of a larger volume, we make LVM on them, select a separate LV for each disk, and copy the 2nd partition using dd. We will have to play with the data long enough.
Now for the rest of the items.
The theory is as follows: to determine the order of the disks and the size of the chunks, you need to find some kind of log file in which the date and time of the event will be put down. No sooner said than done. in my case, the log files were in a separate lv section. For work, we also need a directory of sufficient size (in my case, I allocated 200 GB for my work). Getting to the study. we take the first 64k from each disk: We get 7 files of 64 kb. Here we can immediately understand which md metadata format is used. for example, for metadata 1, 1.0, 1.1, 1.2 - it looks like this: This is an example of metadata 1.2. A distinctive feature is that the blocks will be of approximately the same content (except for the disk number), while the rest of the space will be filled with NULLs. Raid information is located at the following addresses: 0x00001000-0x00001ffff.
for i in a b c d e f g; do dd if=/dev/jbod/sd${i} bs=64 count=1024 of=/mnt/recover/${i}; done
mega@megabook ~ $ dd if=/dev/gentoo/a bs=1024 count=64 | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............|
00001010 ee 6f de dc c3 94 9c 58 47 d0 cc 91 9c f7 c5 35 |.o.....XG......5|
00001020 6d 65 67 61 62 6f 6f 6b 3a 30 00 00 00 00 00 00 |megabook:0......|
00001030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00001040 96 e2 6c 4e 00 00 00 00 05 00 00 00 02 00 00 00 |..lN............|
00001050 00 5c 00 00 00 00 00 00 00 04 00 00 07 00 00 00 |.\..............|
00001060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001080 00 04 00 00 00 00 00 00 00 5c 00 00 00 00 00 00 |.........\......|
00001090 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000010a0 00 00 00 00 00 00 00 00 64 23 f5 c9 5f 2a 64 68 |........d#.._*dh|
000010b0 e8 92 f2 1a 8c ca ad 98 00 00 00 00 00 00 00 00 |................|
000010c0 9a e2 6c 4e 00 00 00 00 12 00 00 00 00 00 00 00 |..lN............|
000010d0 ff ff ff ff ff ff ff ff f6 51 38 f5 80 01 00 00 |.........Q8.....|
000010e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001100 00 00 01 00 02 00 03 00 04 00 05 00 fe ff 06 00 |................|
00001110 fe ff fe ff fe ff fe ff fe ff fe ff fe ff fe ff |................|
*
00001400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
64+0 записей считано
64+0 записей написано
скопировано 65536 байт (66 kB)00010000
, 0,000822058 c, 79,7 MB/c
mega@megabook ~ $ mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sun Sep 11 20:32:22 2011
Raid Level : raid5
Array Size : 70656 (69.01 MiB 72.35 MB)
Used Dev Size : 11776 (11.50 MiB 12.06 MB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent
Update Time : Sun Sep 11 20:32:26 2011
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : megabook:0 (local to host megabook)
UUID : ee6fdedc:c3949c58:47d0cc91:9cf7c535
Events : 18
Number Major Minor RaidDevice State
0 253 21 0 active sync /dev/dm-21
1 253 22 1 active sync /dev/dm-22
2 253 23 2 active sync /dev/dm-23
3 253 24 3 active sync /dev/dm-24
4 253 25 4 active sync /dev/dm-25
5 253 26 5 active sync /dev/dm-26
7 253 27 6 active sync /dev/dm-27
For earlier versions of metadata, raid information was recorded in the disk layout area, and data started immediately on the devices. This one looks something like this for metadata 0.9: A curious feature surfaced after further study of the first 64 kb from disks. As it turned out, LVM stores LV markup information in clear text. it looks like this: Super! What do we need from this. The main figures that should be guided by are:
~ # dd if=/dev/jbod/sdb bs=1024 count=1 | hexdump -C
1+0 записей считано
1+0 записей написано
скопировано 1024 байта (1,0 kB), 0,0200084 c, 51,2 kB/c
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000200 4c 41 42 45 4c 4f 4e 45 01 00 00 00 00 00 00 00 |LABELONE........|
00000210 1b 72 36 1f 20 00 00 00 4c 56 4d 32 20 30 30 31 |.r6. ...LVM2 001|
00000220 66 6d 59 33 4a 35 6b 72 46 73 6d 52 51 41 47 66 |fmY3J5krFsmRQAGf|
00000230 4c 30 72 53 6b 69 59 6e 31 43 6c 72 66 61 66 70 |L0rSkiYn1Clrfafp|
00000240 00 00 fa ff ed 02 00 00 00 00 06 00 00 00 00 00 |................|
00000250 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000260 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 |................|
00000270 00 f0 05 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000280 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000400
vg0 {
id = "xxxx"
seqno = 64
status = ["RESIZEABLE", "READ", "WRITE"]
extent_size = 8192 # 4 Megabytes
max_lv = 0
max_pv = 0
physical_volumes {
pv0 {
id = "xxxxx"
device = "/dev/md2" # Hint only
status = ["ALLOCATABLE"]
dev_size = 5848847616 # 2,72358 Terabytes
pe_start = 384
pe_count = 713970 # 2,72358 Terabytes
}
}
logical_volumes {
--//--
log {
id = "l8OVMc-BUAj-YrIT-w8mh-YkvH-riS3-p1h6OY"
status = ["READ", "WRITE", "VISIBLE"]
segment_count = 2
segment1 {
start_extent = 0
extent_count = 12800 # 50 Gigabytes
type = "striped"
stripe_count = 1 # linear
stripes = [
"pv0", 410817
]
}
segment2 {
start_extent = 12800
extent_count = 5120 # 20 Gigabytes
type = "striped"
stripe_count = 1 # linear
stripes = [
"pv0", 205456
]
}
}
--//--
}
- extent_size = 8192 # 4 Megabytes
- pe_start = 384
- stripes = [pv0, 410817]
- extent_count = 5120
What are these numbers ...
- extent_size is something like cluster size. The minimum unit that the entire VG space is divided into. Why 4 megabytes and in which such parrots is measured? I also asked this question when I saw it, but as it turned out, everything is simple - the cluster size is 512 bytes * 8192 = 4 Mb
- pe_start - where does the LVM header end and where does the data begin.
- extent_count - the number of blocks allocated for the section.
- stripes = [pv0, xxxx] - from where and on what PV the section is located.
Do not forget that we have the 5th raid on 7 drives, divide all the numbers by 6. (because the 7th is parity), we
try to find at least some kind of log. Using your hands / eyes to look - utopia, we write a script: where: after a short rustling of the disks, we get something like this at the output: we take the address: 02d08770, divide by 512 with the remainder, we get: we put a couple of megabytes and see what happens: We get 7 files of 512kb. We open it with a text editor and look at the dates which of the files are read from checksums (parity). We arrange the order of disks, as well as see what size chunk is. If checksums begin after 64kb, then 64kb, if not, then most likely 512 or larger. Repeat the action with an offset of 1 block:
dd if=/dev/recover/sda bs=512 skip=$[(8192*410817+384)/6] | hexdump -C | grep 'Aug 28' | head
skip=(extent_size*stripes+pe_start)/6
bs=512 -- те самые попугаи
02d08100 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 36 20 |Aug 28 00:06:06 |
02d081c0 3d 0a 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 |=.Aug 28 00:06:0|
02d08410 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 37 20 |Aug 28 00:06:07 |
02d08570 70 0a 41 75 67 20 32 38 20 30 30 3a 30 36 3a 30 |p.Aug 28 00:06:0|
02d085d0 65 78 74 3d 0a 41 75 67 20 32 38 20 30 30 3a 30 |ext=.Aug 28 00:0|
02d086b0 64 3d 2a 29 29 22 0a 41 75 67 20 32 38 20 30 30 |d=*))".Aug 28 00|
02d08710 65 73 74 61 6d 70 0a 41 75 67 20 32 38 20 30 30 |estamp.Aug 28 00|
02d08770 73 3d 30 20 74 65 78 74 3d 0a 41 75 67 20 32 38 |s=0 text=.Aug 28|
02d089c0 6f 72 64 3d 2a 29 29 22 0a 41 75 67 20 32 38 20 |ord=*))".Aug 28 |
02d08a20 6d 65 73 74 61 6d 70 0a 41 75 67 20 32 38 20 30 |mestamp.Aug 28 0|
mega@megabook ~ $ echo $[0x02d08770/512]
92227
mega@megabook ~ $ echo $[0x02d08770/512/2048]
45
for i in a b c d e f g; do dd if=/dev/recover/sd${i} of=/mnt/recover/${i} bs=512 count=1024 skip=$[(8192*410817+384)/6+(48*2048)] ; done
for i in a b c d e f g; do dd if=/dev/recover/sd${i} of=/mnt/recover/${i}.1 bs=512 count=1024 skip=$[(8192*410817+384)/6+(48*2048)+1] ; done
Build a table on paper. which drive is the first, where is the parity block. It will not be amiss to say that there are 4 orders of 5 RAIDs: left asynchronous, left synchronous, right asynchronous and right synchronous. Details are described here: www.accs.com/p_and_p/RAID/LinuxRAID.html .
As for the rejection of a beaten disk, the creative task and the above material should be enough.
Now, knowing the size of the chunks, the type of meta-data table and the order of the sections, you can try playing with mdadm to recreate the raid. The strategic trick is that mdadm can create degraded raids if you write the word missing instead of a real disk. Using the knowledge we are trying to create a raid.Be sure to specify the type of metadata! And in no case do not repeat the following command without a thorough study of the above material! . To verify the correct assembly of the array, I used the 2 GB virtual machine partition. If cfdisk says that the disk layout table is not correct, we destroy the array and repeat the creation of the array with the displacement of the disks ... the first one is thrown to the end. those. copy the section, look at the contents, and so until we find the desired sequence. you can of course sit with the calculator and based on the data that we calculated when viewing the log section, as well as addresses, calculate the exact order.
lvcreate -L2G -nnagios recover
mdadm -C /dev/md0 -l 5 -n 7 --metadata 0.9 -c 64 /dev/recover/[a-f] missing
dd if=/dev/md0 bs=512 skip=$[8192*XXXX+384] count=$[8192*512] | dd of=/dev/recover/nagios
cfdisk /dev/recover/nagios
mdadm -S /dev/md0
mdadm -C /dev/md0 -l 5 -n 7 --metadata 0.9 -c 64 /dev/recover/[b-f] missing /dev/recover/a
Please note that instead of the seventh sdg disk, I wrote missing, and the order of the disks should not be ag, but what you got in the previous step. Let me remind you that missing will make the raid refuse to recalculate parity blocks, because will consider itself degraded and operate in emergency mode.
After you find which of the sections you have first, select the small LV section and copy it in the image and likeness, as I copied nagios. try to mount. If you have thrown the wrong drive (missing), most likely dmesg will report that there are problems with the drive's log (as one of the drives contributes in the form of data curves). repeat the breakpoints, creating and copying disks with the replacement of the missing position. those. add, for example, add the sdg disk, which I had in place of missing, and instead of the sdf disk, write missing: And so on, until the received data is as relevant as possible. On the sim, I think the story is over. I’ll just add a few words on LVM recovery (so as not to suffer with these addresses).
mdadm -C /dev/md0 -l 5 -n 7 --metadata 0.9 -c 64 /dev/recover/[b-e] missing /dev/recover/sdg /dev/recover/a
everything is simple here. If you achieve relevance - it will be able to activate itself, if not, get backups, usually in / etc / lvm / backup / VG-NAME. I, given that my root partition was on the same LVM, moved this directory to / boot / lvm and made symlinks on it. and then everything is simple: If it doesn’t help, say, it starts to swear at checksums, you can hack this thing a bit: then we’ll edit the backup file, change the UUID there, create a VG with the same name as before in this section after all these frauds should be restored and VG will be relevant at the time of the backup file. In general, on sim, I think the material is sufficiently stated, I will make a reservation that these are all extreme measures and at least you should have a full copy of the disks, i.e. work lead not
vgcfgrestore -f /path/to/backup-file vg-name
dd if=/dev/zero of=/dev/md0 bs=512 count=10
pvcreate /dev/md0
pvdisplay /dev/md0 | grep 'PV UUID'
vgcreate vg0 /dev/md0
vgcfgrestore -f /path/to/backup-file vg0
on the source disks, and on their cast.