Automate RAID Integrity Check on a Dell Server
Hello,%% Habrachitatel%!
A few months ago, we had problems with one virtual machine running on a Dell PowerEdge R720 server with ESXi 5.5. Rebooting this VM took quite a while and caused a severe drop in performance on the host itself.
The Lifecycle log on the server was filled with messages of the form:
Googling led to a disappointing conclusion: the raid array is damaged and cannot be restored. Namely, data related to one block (stripe) was damaged on several disks at once (double fault):

Fortunately, Dell RAID controllers have the feature to continue working, despite the non-consonant state of the array - puncture ( https://www.dell .com / support / Article / us / en / 04/438291 / EN # Unique-Hyphenated-Issue-Here-2 ), which allows you to save at least that part of the data that has not been damaged. This, of course, does not in any way eliminate the need for subsequent replacement of disks and rebuilding the raid array from scratch.
To prevent such situations, Dell recommends running an array integrity check at least once a month. Alas, we learned about this too late.
You can run this check either through the Dell OpenManage Server Administrator web interface ( http://www.dell.com/support/contents/us/en/19/article/Product-Support/Self-support-Knowledgebase/enterprise-resource- center / Enterprise-Tools / OMSA / ), and through the omconfig / omreport utilities included in OMSA. And if the developers from Dell did not “forget” to include these utilities in OpenManage for ESXi, then there would be no problems with automation, because it is clear that manually checking the integrity of the array on each server is not an IT way at all. Not to mention that the OMSA interface is very slow and it’s a pleasure to work with it.
The guys from Dell “did a great job” and it’s impossible to automate checking (for example, by opening a prepared link in cURL) in a simple way the web interface is generated dynamically and there are no permanent links in it.
What to do?
I had to tinker a bit and write the verification utility myself. Meet: Consistency Check Task Automation Tool for Dell servers with iDRAC (https://github.com/jazzl0ver/dell_raid_cc). The utility is written using the CasperJS framework, which allows you to automate the work with just such dynamic sites.
To use dell_raid_cc you need:
1. A server with OMSA installed (see link above)
2. Download and installphantomjs (http://phantomjs.org/download.html)
3. Download and install casperjs (http://docs.casperjs.org/en/latest/installation.html)
4. Get the utility from git:
5. Create a file with access parameters (for example, creds.txt): 6. Download it and you can run the utility or put its launch in crontab: If everything is in order, the output will be something like this:
If you run it again, you can see the progress of the scan, for example:
It is worth saying that the utility does not support multi-controller systems (I just don’t have one and there’s nothing to test, respectively).
I hope the utility will be useful not only to me.
UPD As colleagues suggested in the comments, it’s more correct to configure the launch of integrity checks on a schedule using the megacli utility. For instance:
Installation instructions for the server with CentOS / RedHat - here
Configuring CC schedule - here
Under ESXi it is also easy to install. You can install vib directly , or make bundle out of it and install it as an update through vCenter.
UPD # 2 Perc5 controllers do not support scheduling via MegaCli:
For them, using dell_raid_cc is the only way to automate.
A few months ago, we had problems with one virtual machine running on a Dell PowerEdge R720 server with ESXi 5.5. Rebooting this VM took quite a while and caused a severe drop in performance on the host itself.
The Lifecycle log on the server was filled with messages of the form:
PDR47
A block on Disk 0 in Backplane 1 of Integrated RAID Controller 1 was
punctured by the controller.
PDR64
An unrecoverable disk media error occurred on Disk 0 in Backplane 1 of
Integrated RAID Controller 1.
Googling led to a disappointing conclusion: the raid array is damaged and cannot be restored. Namely, data related to one block (stripe) was damaged on several disks at once (double fault):

Fortunately, Dell RAID controllers have the feature to continue working, despite the non-consonant state of the array - puncture ( https://www.dell .com / support / Article / us / en / 04/438291 / EN # Unique-Hyphenated-Issue-Here-2 ), which allows you to save at least that part of the data that has not been damaged. This, of course, does not in any way eliminate the need for subsequent replacement of disks and rebuilding the raid array from scratch.
To prevent such situations, Dell recommends running an array integrity check at least once a month. Alas, we learned about this too late.
You can run this check either through the Dell OpenManage Server Administrator web interface ( http://www.dell.com/support/contents/us/en/19/article/Product-Support/Self-support-Knowledgebase/enterprise-resource- center / Enterprise-Tools / OMSA / ), and through the omconfig / omreport utilities included in OMSA. And if the developers from Dell did not “forget” to include these utilities in OpenManage for ESXi, then there would be no problems with automation, because it is clear that manually checking the integrity of the array on each server is not an IT way at all. Not to mention that the OMSA interface is very slow and it’s a pleasure to work with it.
The guys from Dell “did a great job” and it’s impossible to automate checking (for example, by opening a prepared link in cURL) in a simple way the web interface is generated dynamically and there are no permanent links in it.
What to do?
I had to tinker a bit and write the verification utility myself. Meet: Consistency Check Task Automation Tool for Dell servers with iDRAC (https://github.com/jazzl0ver/dell_raid_cc). The utility is written using the CasperJS framework, which allows you to automate the work with just such dynamic sites.
To use dell_raid_cc you need:
1. A server with OMSA installed (see link above)
2. Download and installphantomjs (http://phantomjs.org/download.html)
3. Download and install casperjs (http://docs.casperjs.org/en/latest/installation.html)
4. Get the utility from git:
git clone https://github.com/jazzl0ver/dell_raid_cc
5. Create a file with access parameters (for example, creds.txt): 6. Download it and you can run the utility or put its launch in crontab: If everything is in order, the output will be something like this:
export OMSAHOST=192.168.1.191
export OMSAPORT=1311
export USERNAME=root
export PASSWORD=password
export DELLHOST=192.168.1.30
source creds.txt
casperjs --ignore-ssl-errors=true --cookies-file=/tmp/dell_raid_cc_cookie.jar dell_raid_cc.js
Found: Virtual Disk 0 [state: Ready; layout: RAID-10; size: 1,862.00GB]
CC for Virtual Disk 0 has been started
Found: Virtual Disk 1 [state: Ready; layout: RAID-1; size: 931.00GB]
CC for Virtual Disk 1 has been started
If you run it again, you can see the progress of the scan, for example:
Found: Virtual Disk 0 [state: Resynching; layout: RAID-6; size: 5,026.50GB]
CC for Virtual Disk 0 is still running, progress: 19% complete
It is worth saying that the utility does not support multi-controller systems (I just don’t have one and there’s nothing to test, respectively).
I hope the utility will be useful not only to me.
UPD As colleagues suggested in the comments, it’s more correct to configure the launch of integrity checks on a schedule using the megacli utility. For instance:
./MegaCli -AdpCcSched -SetStartTime 20140822 04 -aALL
Installation instructions for the server with CentOS / RedHat - here
Configuring CC schedule - here
Under ESXi it is also easy to install. You can install vib directly , or make bundle out of it and install it as an update through vCenter.
UPD # 2 Perc5 controllers do not support scheduling via MegaCli:
cd / opt / lsi / MegaCLI; ./MegaCli -AdpCcSched -Info -aALL
Adapter 0: Scheduled Chceck Consistency is not supported.
Exit Code: 0x01
For them, using dell_raid_cc is the only way to automate.