
Proper preparation and work with ZFS under FreeBSD
Some time ago, the problem arose of building a sufficiently capacious array for storing operational incremental backups. And I didn’t really want to spend money, but I needed a place. The solution was simple and convenient enough. Next up is a lot of text.
Based on the SuperChassis 825TQ-560LPB chassis,
which included 8 Hitachi 1TB drives and an LSI Logic SAS HBA SAS3081E-R controller. We preplanned the scheme without a hardware raid based solely on the capabilities of ZFS.
The first version of the server on FreeBSD 7.2-STABLE lived for about 3 months, during which time the main bugs were caught in the form of limitations on the used memory vm.kmem_size and vfs.zfs.arc_max.
With a lack of memory allocated to the kernel, the system fell into a panic, with a lack of arc_max buffers, the speed dropped very much. To boot the system, we used a 1GB flash drive, which allowed us to easily change the version of the system by simply replacing the flash drive with a new assembly.
The circuit of the constructed array was based on addressing by the name of the disk (device).
Further in the text there will be inserts from dmesg and console work commands.
In principle, everything is more or less clear. There is a controller, there are moons, there are disks, there are names of connected devices to the controller.
Everything works until one of the drives dies. The seventh disk began to crumble, and the controller threw it out of the array by timeout. And I didn’t have any confidence that the disk itself was falling, because the disks were completely new and were not subjected to any loads.
Hitachi DFT did not find any errors and said that everything was normal, only occasionally slowed down. MHDD found many sectors with an access time of about 500 ms, after 50 of these sectors I just changed the drive under warranty.
But the drive is really not so bad. The drive overcame, but the problems caught during its failures made me think.
Problem One: An array with disk numbering not attached to the moons
Solaris ZFS was designed with the drive mapped to controller numbers and moons on these controllers. In the case of using such an accounting scheme, it is only necessary to replace the dead disk and synchronize the data in the array. In FreeBSD, as in Linux, device naming is usually transparent and independent of the physical port on the controller. This is actually the biggest ambush.
For example, take a disk 5 from the system and emulate a hardware failure on the disk.
Everything is fine. The controller saw that he had lost the drive and marked it as missing.
You can insert a new disk and synchronize the array. BUT if you reboot the server, then a
very interesting picture is waiting for you (I will remove unnecessary information from the dmesg quota so as not to overload the screen with text)
What do we see?
We see that in comparison with the first dmesg dump, there were 8 devices instead of 9 and a flash drive became da7 instead of da8. ZFS begins to swear that there are problems with reading GPT tags and that everything is bad.
Zpool stupidly fell apart before our eyes, as the numbering of the discs went up 1, which in turn caused the loss of an array of landmarks for work and confusion in the disc labels.
Please note that we have 2 !!! drive da6. All this is because we have a physical disk defined as da6 and there is a GPT label da6 on the disk that used to be da6.
Now try to stick the disk that they pulled out.
The disk was determined, but became da8. Of course, we can try rebuilding the array, but it will not lead to anything good. Therefore, just overloaded.
After a reboot, zfs calmly finds all the disks, swears a bit in the log and continues to work.
resilver runs almost painlessly, since we did not have work with data on a broken array.
In general, we were convinced that such a scheme is not very reliable with automatic numbering of disks in the controller. Although if the number of disks does not change - they can be rearranged in random order. The array will work properly.
The second problem: geomlabel on disks.
In order to try to solve the problem with naming disks, it was decided to try marking them with labels and add labels to the array. When creating a label on the disk - it is written to the end of the disk, when the disk is initialized, the system reads the labels and creates virtual devices / dev / label / label.
The array lives fine, now we pull out disk1 and put it aside.
We reboot the server and look at the log.
But with all the abuse, the array turns out to be alive. with a lively structure and status DEGRADED, which is much better than in the first case.
Now we zero out the disk that we pulled out and insert it back into the system.
The label on the disk is recorded at the very end of the disk, so regular zeroing of the beginning of the disk will not help us. you have to either zero the end or just the whole disk. It all depends on your time.
We give the new disk the label “disk8” and try to replace the “dead” disk with a new one.
The system refuses to change the drive to us, citing its busyness. Replace the "dead" disk can only be the direct name of the device. Why this is done, I did not fully understand. We discard the idea of replacing labels on a new disk and try just to give the label “disk1” to the new disk.
Now we need to say that the drive is back in operation.
Everything falls into place and after synchronization, you can reset the pool status to normal.
And here the problem begins. Since the disk label made by glabel is written to the end of the disk, zfs basically does not know anything about where it is written. And when the disk is full, it frays this label. The disk in the array is given a physical name and we return to point 1 of our problems.
Solving the problem
Solving the problem turned out to be a bit banal and simple. Once upon a time, FreeBSD began to be able to do GPT partitions on large disks. In FreeBSD 7.2, naming GPT partitions did not work and access was made using the direct device name / dev / da0p1 (example for the first GPT partition)
In FreeBSD 8.0, a change was made to the GPT tag system. And now you can name GPT partitions and refer to them as virtual devices via the / dev / gpt / partition label.
All we really need to do is rename the partitions on the disks and assemble an array from them. How to do it is written in ZFS Boot Howto which is very quickly google.
After creating the GPT section on gpart show, you can see where the data area on the disk begins and ends. Next, we create this partition and give it the label “disk0”.
We perform this operation for all disks in the system and collect an array from the resulting partitions.
The server calmly survives any reboots and rearrangement of disks, as well as their replacement with the same label. The speed of this array over the network reaches 70-80 megabytes per second. The local write speed to the array, depending on the occupancy of the buffers, reaches 200 megabytes per second.
PS: when using gpt tags, I came across a strange glitch that the system did not see the new label on the disk, but then it by itself passed.
PPS: for enthusiasts who are trying to run FreeBSD 8.0 from a USB flash drive (if it has not been repaired yet), this dirty hack is useful.
In the new USB stack, when loading the kernel, USB devices are not always detected on time and an error occurs when mounting system partitions.
This hack sets a timeout of 10 seconds in the system to wait for the USB drive to be ready.
PPPS: if you have questions, ask.
Settings loader.conf for a machine with 2 gigabytes of memory.
© Aborche 2009

Based on the SuperChassis 825TQ-560LPB chassis,

which included 8 Hitachi 1TB drives and an LSI Logic SAS HBA SAS3081E-R controller. We preplanned the scheme without a hardware raid based solely on the capabilities of ZFS.
The first version of the server on FreeBSD 7.2-STABLE lived for about 3 months, during which time the main bugs were caught in the form of limitations on the used memory vm.kmem_size and vfs.zfs.arc_max.
With a lack of memory allocated to the kernel, the system fell into a panic, with a lack of arc_max buffers, the speed dropped very much. To boot the system, we used a 1GB flash drive, which allowed us to easily change the version of the system by simply replacing the flash drive with a new assembly.
The circuit of the constructed array was based on addressing by the name of the disk (device).
Further in the text there will be inserts from dmesg and console work commands.
mpt0: port 0x2000-0x20ff mem 0xdc210000-0xdc213fff,0xdc200000-0xdc20ffff irq 16 at device 0.0 on pci3
mpt0: [ITHREAD]
mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)
da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: Fixed Direct Access SCSI-5 device
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da1 at mpt0 bus 0 scbus0 target 1 lun 0
da1: Fixed Direct Access SCSI-5 device
da1: 300.000MB/s transfers
da1: Command Queueing enabled
da1: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da2 at mpt0 bus 0 scbus0 target 2 lun 0
da2: Fixed Direct Access SCSI-5 device
da2: 300.000MB/s transfers
da2: Command Queueing enabled
da2: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da3 at mpt0 bus 0 scbus0 target 3 lun 0
da3: Fixed Direct Access SCSI-5 device
da3: 300.000MB/s transfers
da3: Command Queueing enabled
da3: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da4 at mpt0 bus 0 scbus0 target 4 lun 0
da4: Fixed Direct Access SCSI-5 device
da4: 300.000MB/s transfers
da4: Command Queueing enabled
da4: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da5 at mpt0 bus 0 scbus0 target 5 lun 0
da5: Fixed Direct Access SCSI-5 device
da5: 300.000MB/s transfers
da5: Command Queueing enabled
da5: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da6 at mpt0 bus 0 scbus0 target 7 lun 0
da6: Fixed Direct Access SCSI-5 device
da6: 300.000MB/s transfers
da6: Command Queueing enabled
da6: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da7 at mpt0 bus 0 scbus0 target 8 lun 0
da7: Fixed Direct Access SCSI-5 device
da7: 300.000MB/s transfers
da7: Command Queueing enabled
da7: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
ugen3.2: at usbus3
umass0: on usbus3
umass0: SCSI over Bulk-Only; quirks = 0x0100
umass0:2:0:-1: Attached to scbus2
da8 at umass-sim0 bus 0 scbus2 target 0 lun 0
da8: Removable Direct Access SCSI-2 device
da8: 40.000MB/s transfers
da8: 963MB (1972224 512 byte sectors: 64H 32S/T 963C)
Trying to mount root from ufs:/dev/ufs/FBSDUSB
In principle, everything is more or less clear. There is a controller, there are moons, there are disks, there are names of connected devices to the controller.
backupstorage# zpool create storage raidz da0 da1 da2 da3 da4 da5 da6 da7
backupstorage# zpool status -v
pool: storage
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0
da7 ONLINE 0 0 0
errors: No known data errors
Everything works until one of the drives dies. The seventh disk began to crumble, and the controller threw it out of the array by timeout. And I didn’t have any confidence that the disk itself was falling, because the disks were completely new and were not subjected to any loads.
Hitachi DFT did not find any errors and said that everything was normal, only occasionally slowed down. MHDD found many sectors with an access time of about 500 ms, after 50 of these sectors I just changed the drive under warranty.
But the drive is really not so bad. The drive overcame, but the problems caught during its failures made me think.
Problem One: An array with disk numbering not attached to the moons
Solaris ZFS was designed with the drive mapped to controller numbers and moons on these controllers. In the case of using such an accounting scheme, it is only necessary to replace the dead disk and synchronize the data in the array. In FreeBSD, as in Linux, device naming is usually transparent and independent of the physical port on the controller. This is actually the biggest ambush.
For example, take a disk 5 from the system and emulate a hardware failure on the disk.
backupstorage# camcontrol rescan all
Re-scan of bus 0 was successful
Re-scan of bus 1 was successful
Re-scan of bus 2 was successful
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x16
(da4:mpt0:0:4:0): lost device
(da4:mpt0:0:4:0): Synchronize cache failed, status == 0x4a, scsi status == 0x0
(da4:mpt0:0:4:0): removing device entry
backupstorage# zpool status -v
pool: storage
state: DEGRADED
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 REMOVED 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0
da7 ONLINE 0 0 0
Everything is fine. The controller saw that he had lost the drive and marked it as missing.
You can insert a new disk and synchronize the array. BUT if you reboot the server, then a
very interesting picture is waiting for you (I will remove unnecessary information from the dmesg quota so as not to overload the screen with text)
da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: Fixed Direct Access SCSI-5 device
da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da1 at mpt0 bus 0 scbus0 target 1 lun 0
da1: Fixed Direct Access SCSI-5 device
da1: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da2 at mpt0 bus 0 scbus0 target 2 lun 0
da2: Fixed Direct Access SCSI-5 device
da2: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da3 at mpt0 bus 0 scbus0 target 3 lun 0
da3: Fixed Direct Access SCSI-5 device
da3: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da4 at mpt0 bus 0 scbus0 target 5 lun 0
da4: Fixed Direct Access SCSI-5 device
da4: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da5 at mpt0 bus 0 scbus0 target 7 lun 0
da5: Fixed Direct Access SCSI-5 device
da5: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da6 at mpt0 bus 0 scbus0 target 8 lun 0
da6: Fixed Direct Access SCSI-5 device
da6: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
SMP: AP CPU #1 Launched!
GEOM: da0: the primary GPT table is corrupt or invalid.
GEOM: da0: using the secondary instead -- recovery strongly advised.
GEOM: da1: the primary GPT table is corrupt or invalid.
GEOM: da1: using the secondary instead -- recovery strongly advised.
GEOM: da2: the primary GPT table is corrupt or invalid.
GEOM: da2: using the secondary instead -- recovery strongly advised.
GEOM: da3: the primary GPT table is corrupt or invalid.
GEOM: da3: using the secondary instead -- recovery strongly advised.
GEOM: da4: the primary GPT table is corrupt or invalid.
GEOM: da4: using the secondary instead -- recovery strongly advised.
GEOM: da5: the primary GPT table is corrupt or invalid.
GEOM: da5: using the secondary instead -- recovery strongly advised.
GEOM: da6: the primary GPT table is corrupt or invalid.
GEOM: da6: using the secondary instead -- recovery strongly advised.
ugen3.2: at usbus3
umass0: on usbus3
umass0: SCSI over Bulk-Only; quirks = 0x0100
umass0:2:0:-1: Attached to scbus2
da7 at umass-sim0 bus 0 scbus2 target 0 lun 0
da7: Removable Direct Access SCSI-2 device
da7: 40.000MB/s transfers
da7: 963MB (1972224 512 byte sectors: 64H 32S/T 963C)
Trying to mount root from ufs:/dev/ufs/FBSDUSB
GEOM: da4: the primary GPT table is corrupt or invalid.
GEOM: da4: using the secondary instead -- recovery strongly advised.
GEOM: da5: the primary GPT table is corrupt or invalid.
GEOM: da5: using the secondary instead -- recovery strongly advised.
GEOM: da0: the primary GPT table is corrupt or invalid.
GEOM: da0: using the secondary instead -- recovery strongly advised.
GEOM: da1: the primary GPT table is corrupt or invalid.
GEOM: da1: using the secondary instead -- recovery strongly advised.
GEOM: da2: the primary GPT table is corrupt or invalid.
GEOM: da2: using the secondary instead -- recovery strongly advised.
GEOM: da3: the primary GPT table is corrupt or invalid.
GEOM: da3: using the secondary instead -- recovery strongly advised.
GEOM: da6: the primary GPT table is corrupt or invalid.
GEOM: da6: using the secondary instead -- recovery strongly advised.
GEOM: da4: the primary GPT table is corrupt or invalid.
GEOM: da4: using the secondary instead -- recovery strongly advised.
What do we see?
We see that in comparison with the first dmesg dump, there were 8 devices instead of 9 and a flash drive became da7 instead of da8. ZFS begins to swear that there are problems with reading GPT tags and that everything is bad.
Zpool stupidly fell apart before our eyes, as the numbering of the discs went up 1, which in turn caused the loss of an array of landmarks for work and confusion in the disc labels.
backupstorage# zpool status -v
pool: storage
state: UNAVAIL
status: One or more devices could not be used because the label is missing
or invalid. There are insufficient replicas for the pool to continue
functioning.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-5E
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage UNAVAIL 0 0 0 insufficient replicas
raidz1 UNAVAIL 0 0 0 insufficient replicas
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 FAULTED 0 0 0 corrupted data
da5 FAULTED 0 0 0 corrupted data
da6 FAULTED 0 0 0 corrupted data
da6 ONLINE 0 0 0
Please note that we have 2 !!! drive da6. All this is because we have a physical disk defined as da6 and there is a GPT label da6 on the disk that used to be da6.
Now try to stick the disk that they pulled out.
backupstorage# camcontrol rescan all
Re-scan of bus 0 was successful
Re-scan of bus 1 was successful
Re-scan of bus 2 was successful
da8 at mpt0 bus 0 scbus0 target 4 lun 0
da8: Fixed Direct Access SCSI-5 device
da8: 300.000MB/s transfers
da8: Command Queueing enabled
da8: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
GEOM: da8: the primary GPT table is corrupt or invalid.
GEOM: da8: using the secondary instead -- recovery strongly advised.
The disk was determined, but became da8. Of course, we can try rebuilding the array, but it will not lead to anything good. Therefore, just overloaded.
backupstorage# zpool status -v
pool: storage
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0
da7 ONLINE 0 0 0
errors: No known data errors
After a reboot, zfs calmly finds all the disks, swears a bit in the log and continues to work.
GEOM: da0: the primary GPT table is corrupt or invalid.
GEOM: da0: using the secondary instead -- recovery strongly advised.
GEOM: da1: the primary GPT table is corrupt or invalid.
GEOM: da1: using the secondary instead -- recovery strongly advised.
GEOM: da2: the primary GPT table is corrupt or invalid.
GEOM: da2: using the secondary instead -- recovery strongly advised.
GEOM: da3: the primary GPT table is corrupt or invalid.
GEOM: da3: using the secondary instead -- recovery strongly advised.
GEOM: da4: the primary GPT table is corrupt or invalid.
GEOM: da4: using the secondary instead -- recovery strongly advised.
GEOM: da5: the primary GPT table is corrupt or invalid.
GEOM: da5: using the secondary instead -- recovery strongly advised.
GEOM: da6: the primary GPT table is corrupt or invalid.
GEOM: da6: using the secondary instead -- recovery strongly advised.
GEOM: da7: the primary GPT table is corrupt or invalid.
GEOM: da7: using the secondary instead -- recovery strongly advised.
GEOM: da0: the primary GPT table is corrupt or invalid.
GEOM: da0: using the secondary instead -- recovery strongly advised.
GEOM: da2: the primary GPT table is corrupt or invalid.
GEOM: da2: using the secondary instead -- recovery strongly advised.
GEOM: da7: the primary GPT table is corrupt or invalid.
GEOM: da7: using the secondary instead -- recovery strongly advised.
resilver runs almost painlessly, since we did not have work with data on a broken array.
GEOM: da0: the primary GPT table is corrupt or invalid.
GEOM: da0: using the secondary instead -- recovery strongly advised.
GEOM: da3: the primary GPT table is corrupt or invalid.
GEOM: da3: using the secondary instead -- recovery strongly advised.
====================
backupstorage# zpool status -v
pool: storage
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Wed Nov 25 11:06:15 2009
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0 512 resilvered
da3 ONLINE 0 0 0 512 resilvered
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0 512 resilvered
da7 ONLINE 0 0 0 512 resilvered
errors: No known data errors
In general, we were convinced that such a scheme is not very reliable with automatic numbering of disks in the controller. Although if the number of disks does not change - they can be rearranged in random order. The array will work properly.
The second problem: geomlabel on disks.
In order to try to solve the problem with naming disks, it was decided to try marking them with labels and add labels to the array. When creating a label on the disk - it is written to the end of the disk, when the disk is initialized, the system reads the labels and creates virtual devices / dev / label / label.
backupstorage# zpool create storage raidz label/disk0 label/disk1 label/disk2 label/disk3 label/disk4 label/disk5 label/disk6 label/disk7
backupstorage# zpool status -v
pool: storage
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
label/disk0 ONLINE 0 0 0
label/disk1 ONLINE 0 0 0
label/disk2 ONLINE 0 0 0
label/disk3 ONLINE 0 0 0
label/disk4 ONLINE 0 0 0
label/disk5 ONLINE 0 0 0
label/disk6 ONLINE 0 0 0
label/disk7 ONLINE 0 0 0
errors: No known data errors
backupstorage# ls /dev/label/disk*
/dev/label/disk0 /dev/label/disk2 /dev/label/disk4 /dev/label/disk6
/dev/label/disk1 /dev/label/disk3 /dev/label/disk5 /dev/label/disk7
The array lives fine, now we pull out disk1 and put it aside.
We reboot the server and look at the log.
GEOM: da0: corrupt or invalid GPT detected.
GEOM: da0: GPT rejected -- may not be recoverable.
GEOM: da1: corrupt or invalid GPT detected.
GEOM: da1: GPT rejected -- may not be recoverable.
GEOM: da2: corrupt or invalid GPT detected.
GEOM: da2: GPT rejected -- may not be recoverable.
GEOM: da3: corrupt or invalid GPT detected.
GEOM: da3: GPT rejected -- may not be recoverable.
GEOM: da4: corrupt or invalid GPT detected.
GEOM: da4: GPT rejected -- may not be recoverable.
GEOM: da5: corrupt or invalid GPT detected.
GEOM: da5: GPT rejected -- may not be recoverable.
GEOM: da6: corrupt or invalid GPT detected.
GEOM: da6: GPT rejected -- may not be recoverable.
GEOM: label/disk0: corrupt or invalid GPT detected.
GEOM: label/disk0: GPT rejected -- may not be recoverable.
GEOM: label/disk2: corrupt or invalid GPT detected.
GEOM: label/disk2: GPT rejected -- may not be recoverable.
GEOM: label/disk3: corrupt or invalid GPT detected.
GEOM: label/disk3: GPT rejected -- may not be recoverable.
GEOM: label/disk4: corrupt or invalid GPT detected.
GEOM: label/disk4: GPT rejected -- may not be recoverable.
GEOM: label/disk5: corrupt or invalid GPT detected.
GEOM: label/disk5: GPT rejected -- may not be recoverable.
GEOM: label/disk6: corrupt or invalid GPT detected.
GEOM: label/disk6: GPT rejected -- may not be recoverable.
GEOM: label/disk7: corrupt or invalid GPT detected.
GEOM: label/disk7: GPT rejected -- may not be recoverable.
da7 at umass-sim0 bus 0 scbus2 target 0 lun 0
da7: Removable Direct Access SCSI-2 device
da7: 40.000MB/s transfers
da7: 963MB (1972224 512 byte sectors: 64H 32S/T 963C)
Trying to mount root from ufs:/dev/ufs/FBSDUSB
GEOM: da0: corrupt or invalid GPT detected.
GEOM: da0: GPT rejected -- may not be recoverable.
GEOM: da1: corrupt or invalid GPT detected.
GEOM: da1: GPT rejected -- may not be recoverable.
GEOM: label/disk2: corrupt or invalid GPT detected.
GEOM: label/disk2: GPT rejected -- may not be recoverable.
GEOM: da2: corrupt or invalid GPT detected.
GEOM: da2: GPT rejected -- may not be recoverable.
GEOM: label/disk3: corrupt or invalid GPT detected.
GEOM: label/disk3: GPT rejected -- may not be recoverable.
GEOM: da3: corrupt or invalid GPT detected.
GEOM: da3: GPT rejected -- may not be recoverable.
GEOM: label/disk4: corrupt or invalid GPT detected.
GEOM: label/disk4: GPT rejected -- may not be recoverable.
GEOM: da4: corrupt or invalid GPT detected.
GEOM: da4: GPT rejected -- may not be recoverable.
GEOM: label/disk5: corrupt or invalid GPT detected.
GEOM: label/disk5: GPT rejected -- may not be recoverable.
GEOM: da5: corrupt or invalid GPT detected.
GEOM: da5: GPT rejected -- may not be recoverable.
But with all the abuse, the array turns out to be alive. with a lively structure and status DEGRADED, which is much better than in the first case.
backupstorage# zpool status
pool: storage
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
label/disk0 ONLINE 0 0 0
label/disk1 REMOVED 0 94 0
label/disk2 ONLINE 0 0 0
label/disk3 ONLINE 0 0 0
label/disk4 ONLINE 0 0 0
label/disk5 ONLINE 0 0 0
label/disk6 ONLINE 0 0 0
label/disk7 ONLINE 0 0 0
errors: No known data errors
====================
Now we zero out the disk that we pulled out and insert it back into the system.
The label on the disk is recorded at the very end of the disk, so regular zeroing of the beginning of the disk will not help us. you have to either zero the end or just the whole disk. It all depends on your time.
mpt0: mpt_cam_event: 0x16
mpt0: mpt_cam_event: 0x12
mpt0: mpt_cam_event: 0x16
da8 at mpt0 bus 0 scbus0 target 1 lun 0
da8: Fixed Direct Access SCSI-5 device
da8: 300.000MB/s transfers
da8: Command Queueing enabled
da8: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
We give the new disk the label “disk8” and try to replace the “dead” disk with a new one.
backupstorage# ls /dev/label/
disk0 disk2 disk3 disk4 disk5 disk6 disk7 disk8
backupstorage# zpool replace storage label/disk1 label/disk8
cannot replace label/disk1 with label/disk8: label/disk8 is busy
backupstorage# zpool replace -f storage label/disk1 label/disk8
cannot replace label/disk1 with label/disk8: label/disk8 is busy
The system refuses to change the drive to us, citing its busyness. Replace the "dead" disk can only be the direct name of the device. Why this is done, I did not fully understand. We discard the idea of replacing labels on a new disk and try just to give the label “disk1” to the new disk.
backupstorage# glabel label disk1 da8
Now we need to say that the drive is back in operation.
backupstorage# zpool online storage label/disk1
backupstorage# zpool status
pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed after 0h0m with 0 errors on Wed Nov 25 18:29:17 2009
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
label/disk0 ONLINE 0 0 0 6.50K resilvered
label/disk1 ONLINE 0 94 1 10.5K resilvered
label/disk2 ONLINE 0 0 0 6K resilvered
label/disk3 ONLINE 0 0 0 3.50K resilvered
label/disk4 ONLINE 0 0 0 6.50K resilvered
label/disk5 ONLINE 0 0 0 6.50K resilvered
label/disk6 ONLINE 0 0 0 5.50K resilvered
label/disk7 ONLINE 0 0 0 3K resilvered
errors: No known data errors
Everything falls into place and after synchronization, you can reset the pool status to normal.
And here the problem begins. Since the disk label made by glabel is written to the end of the disk, zfs basically does not know anything about where it is written. And when the disk is full, it frays this label. The disk in the array is given a physical name and we return to point 1 of our problems.
Solving the problem
Solving the problem turned out to be a bit banal and simple. Once upon a time, FreeBSD began to be able to do GPT partitions on large disks. In FreeBSD 7.2, naming GPT partitions did not work and access was made using the direct device name / dev / da0p1 (example for the first GPT partition)
In FreeBSD 8.0, a change was made to the GPT tag system. And now you can name GPT partitions and refer to them as virtual devices via the / dev / gpt / partition label.
All we really need to do is rename the partitions on the disks and assemble an array from them. How to do it is written in ZFS Boot Howto which is very quickly google.
backupstorage# gpart create -s GPT ad0
backupstorage# gpart add -b 34 -s 1953525101 -i 1 -t freebsd-zfs -l disk0 ad0
backupstorage# gpart show
=> 34 1953525101 da0 GPT (932G)
34 1953525101 1 freebsd-zfs (932G)
backupstorage# gpart show -l
=> 34 1953525101 da0 GPT (932G)
34 1953525101 1 disk0 (932G)
backupstorage# ls /dev/gpt
disk0
After creating the GPT section on gpart show, you can see where the data area on the disk begins and ends. Next, we create this partition and give it the label “disk0”.
We perform this operation for all disks in the system and collect an array from the resulting partitions.
backupstorage# zpool status -v
pool: storage
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1 ONLINE 0 0 0
gpt/disk0 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 0
gpt/disk2 ONLINE 0 0 0
gpt/disk3 ONLINE 0 0 0
gpt/disk4 ONLINE 0 0 0
gpt/disk5 ONLINE 0 0 0
gpt/disk6 ONLINE 0 0 0
gpt/disk7 ONLINE 0 0 0
errors: No known data errors
The server calmly survives any reboots and rearrangement of disks, as well as their replacement with the same label. The speed of this array over the network reaches 70-80 megabytes per second. The local write speed to the array, depending on the occupancy of the buffers, reaches 200 megabytes per second.
PS: when using gpt tags, I came across a strange glitch that the system did not see the new label on the disk, but then it by itself passed.
PPS: for enthusiasts who are trying to run FreeBSD 8.0 from a USB flash drive (if it has not been repaired yet), this dirty hack is useful.
Index: sys/kern/vfs_mount.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_mount.c,v
retrieving revision 1.308
diff -u -r1.308 vfs_mount.c
--- sys/kern/vfs_mount.c 5 Jun 2009 14:55:22 -0000 1.308
+++ sys/kern/vfs_mount.c 29 Sep 2009 17:08:25 -0000
@@ -1645,6 +1645,9 @@
options = NULL;
+ /* NASTY HACK: wait for USB sticks to appear */
+ pause("usbhack", hz * 10);
+
root_mount_prepare();
mount_zone = uma_zcreate("Mountpoints", sizeof(struct mount),
In the new USB stack, when loading the kernel, USB devices are not always detected on time and an error occurs when mounting system partitions.
This hack sets a timeout of 10 seconds in the system to wait for the USB drive to be ready.
PPPS: if you have questions, ask.
Settings loader.conf for a machine with 2 gigabytes of memory.
vm.kmem_size = "1536M"
vm.kmem_size_max = "1536M"
vfs.zfs.arc_max = "384M"
© Aborche 2009
