Monitor the status of HP Proliant servers in nagios / icinga. Plugins check_hpasm and check_ilo2_health.pl

  • Tutorial
There are a lot of plugins for monitoring systems. You can look and find what you need in the exchange.nagios and monitoringexchange directories . When searching for the right plug-in, it is better to check in both repositories - despite the apparent identity, their contents are different.

Another thing is that the quality and functionality of plugins, even similar to each other, vary greatly - there are hacks quickly stuck together on the knee, working in strictly defined conditions and solving a narrow task. After writing, the author of the plugin did not throw it in / dev / null, but decided to tell the world about it. Other plugins are well-made products that work with entire families of devices and provide extensive information about target systems.

I would like to talk about the latter, especially since during the work with nagios / icinga it was found that there is very little Russian-language information on plug-ins for monitoring systems.

This article is devoted to monitoring HP Proliant servers, and the author sincerely hopes that it will help those who have HP equipment in their work and would like to more fully monitor its parameters.

Manage HP Proliant Multiple Performance (iLO) servers.

Most Proliant servers are managed through iLO - integrated Lights-Out. The basic version of iLO allows you to remotely control the server via the http-interface: turn on / turn off the power from the network, turn on / off the blue UID signal indicator, by which the server can be found in the rack, see the ILO / Integrated Managament (IML) logs, as well as the current internal parameters components - processor temperatures, fan speeds, and other health-friendly information systems. There is an iLO Advanced License - this is an advanced version for which you need to buy a license (activation key). It is inexpensive, something around $ 50 and allows you to redirect the terminal / keyboard / mouse to a remote browser, after which it becomes possible to control the BIOS loading, as well as go into an already loaded system.

There are currently four versions of iLO. Just iLO, sometimes called iLO1, installed on Proliant servers of generations G1-G4, iLO2 (G5-G6) and iLO3 installed on G7. Since the G8 generation, now called gen8, iLO4 has been installed on servers. If you wonder what is inside iLO4, the Habré has a good article on this subject Shedding light on ProLiant iLO Management the HP Engine .

All iLOs have their own independent network connection interface. It is active, even if the server is in a power down state, the main thing is that Proliant is physically plugged in. An iLO is assigned its own address; the interface itself is usually included in a separate management switch with a separate VLAN and its own subnet.

Proliant server blade versions also have iLO management interfaces; each blade has its own. The c3000 / c7000 blade rack also has a separate management interface (Onboard Administrator). In addition to information about the general condition of power supplies, rack temperature sensors, the Onboard Administrator interface has access to each iLO blade. On the blades themselves, at least iLO2 was installed even on generations G1. Recent generations of blades (gen8) are also equipped with iLO4.

Versions of Integrity iLO (Integrity iLO-iLO3) are also installed on Integrity servers and Superdome 2 blades - such servers are not common, so we will not consider them.

In some systems that are now outdated but still working (DL760 G2 - who will throw such an 8-processor horse?) That were not originally equipped with iLO, you can install a RILOE II card (full-size PCI) with a separate physical LAN interface RJ45. RILOE II - (Remote Insight Lights-Out Edition II) is a rather funny thing - it has a KVM interface (keyboard / vga / mouse) for rack control, RJ45 for network connection and remote control, and also ... an adapter for external nutrition.

There is also a truncated version of the iLO - LO100i, it was put on the G6 and G7 generations of some entry-level models, for example DL160, DL180, DL320, as well as low-cost ML series. In generations of gen8 servers, even the initial line-up, the LO100i will no longer exist (at least so was said at the HP conference). The LO100i operates on one of the two Proliant network interfaces, and can do this both in dedicated and in shared mode. In dedicated mode, one of the server’s network interfaces is fully occupied under LO100i, in shared mode, the common band is shared between data and control. Management occupies a small band and practically does not affect data. The LO100i also has its own separate network address, which is independent of the primary server address. On some entry-level models (e.g. ML110 / ML150 G2), there is no control interface, but if necessary, it can be organized by installing a special RMP (Remote Management Processor) control card. The card is not a slot - it is mounted on the connectors of the motherboard (piggyback), and you can not put it on other Proliant models.

In real life with the LO100i, when working in shared mode, not everything works out well (in dedicated mode, everything is fine). The LO100i works great if the two Proliant network interfaces with us either work independently or are configured in Network Fault Tolerance (NFT) or Transmit Load Balancing (TLB) with hot spare. When trying to combine links in LACP (and this is the most efficient mode of using several network interfaces), the LO100i becomes unavailable, although the data on the interfaces go off with a bang. Moreover, it is not accessible from an external network - from a workstation located in the same VLAN on the same switch, or from the system console. Since the documentation states the opposite, a case was opened in HP on this occasion, and at the end of November last year. At the moment, the problem has slowly escalated to L1 (developers), but HP didn’t give any specific recommendations (or at least information on understanding the reasons), although judging by the tracker, the engineers are doing something. Of course, there is a problem, but it affects the monitoring task only indirectly, although it does not make it possible to quickly manage the server.

Exchange.nagios.org has a fairly large number of plugins that work directly with ilo making requests via http via XML. But they only work well with iLO2 and newer versions. iLO1 through XML is very taciturn and provides little useful information, but servers with iLO1 are still alive and healthy, and you need to understand what is going on inside them.

Plugin check_ilo2_health.pl

For checking iLO, check_ilo2_health.pl seemed to be the most effective. Download here . Exchange.nagios.org has an older version that does not understand iLO4.

Installing a script does not require much work. The script is copied to the directory where other nagios / icinga scripts are located, then the verification command is written in the configurator and assigned to the hosts.

$USER1$/check_ilo2_health.pl -H $HOSTADDRESS$ -d 1 -u $ARG1$ -p $ARG2$
, where $ ARG1 $, $ ARG2 $ is the iLO interface user / password.

For the script to work, you need to install Perl packages (installed via CPAN): Nagios :: Plugin (be sure to read UPDATE at the end!), IO :: Socket :: SSL and XML :: Simple.

Options:
-e - the plugin ignores the syntax error messages in the XML output. May be useful for older firms.
-n - without temperature indicators
-d - temperature data compatible with PerfParse
-v - display fully XML message (for debugging)
-3 - support for iLO3 and iLO4
-a - check fault tolerance of fans (if supported by equipment)
-b - check HDD bays ( if supported by hardware)
-o - check the fault tolerance of power supplies (if supported by the equipment)

In addition to the returned information, a script is needed to check the physical availability of iLO from the outside. Normal ping here may not be enough.

HP Server Insight Manager

It would be a mistake to think that HP can offer nothing to the administrator. HP Server Insight Manager (HP SIM) is a solution focused on managing and monitoring the Proliant platform and, what is important, its use for HP products is free (download here ). Support is for money. Additions for monitoring some third-party systems are also paid. For its full operation on managed servers, driver installation is required.

A good question arises: Is it possible to add HP SIM information to icinga / nagios features for collecting information in one monitoring application? And what is surprising: there was a man who was able to solve this problem.

Plugin check_hpasm

The check_hpasm plugin was created by ConSol labs. Its author Gerhard Lausser. The plugin is designed to collect information from the following Proliant systems:

• Linux where the HP System Health Application and Insight Management Agent (HPASM) is installed;
• Windows 2003/2008/2008 R2 / 2012 - with installed HP SIM drivers.
• HP Blade Racks (c7000 / c3000) with Onboard Administrator.

In all three cases, SNMP must be raised and configured.

What it was checked on. The Icinga 1.8.4 monitoring system (fully compatible with Nagios) is installed on FreeBSD 9.0, so if you have a different monitoring system or OS, make corrections for dependencies and paths. Perl is installed on the system, with all the necessary additions.

Install check_hpasm.

Download the latest version. We go to the plugin page and look for check_hpasm, a link to it . It is at the bottom of the page. At the time of writing, version 4.6.3 was available.

The plugin is subject to assembly and installation:
1. We upload it to our monitoring system via ftp or in another way.
2. We do it to him tar xvf check_hpasm-4.6.3.tar.gz
3. cd check_hpasm-4.6.3
4. we launch ./configure

Keys for configure:
--libexecdir=/usr/local/libexec/nagios
(the place where the scripts are of the form check_ *. I have it, it was nestled in the old fashioned way on the nagiosovsky place, although there is also / usr / local / libexec / icinga)
--with-nagios-user=icinga
(the username under which the monitoring system starts, for nagios will be nagios)
--with-nagios-group=icinga
(the name of the group the user belongs to, from which the monitoring system is launched, for icinga = icinga for nagios = nagios)
--enable-perfdata=YES
(whether or not to display information for collecting statistics - YES, default = no)
--enable-extendedinfo=YES
(whether or not to display extended information - YES, by default = no) The

final line for the configuration will look like:

./configure --with-nagios-user=icinga --with-nagios-group=icinga --enable-perfdata=yes 
--with-degrees=yes --enable-perfdata=yes --enable-extendedinfo=yes
--libexecdir=/usr/local/libexec/nagios

we look at the output of configure, we make sure that everything suits us.
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for gawk... no
checking for mawk... no
checking for nawk... nawk
checking whether make sets $(MAKE)... yes
checking how to create a pax tar archive... gnutar
checking build system type... i386-unknown-freebsd9.0
checking host system type... i386-unknown-freebsd9.0
checking whether make sets $(MAKE)... (cached) yes
checking for gawk... (cached) nawk
checking for sh... /bin/sh
checking for perl... /usr/bin/perl
configure: creating ./config.status
config.status: creating Makefile
config.status: WARNING:  'Makefile.in' seems to ignore the --datarootdir setting
config.status: creating plugins-scripts/Makefile
config.status: WARNING:  'plugins-scripts/Makefile.in' seems to ignore the --datarootdir setting
config.status: creating plugins-scripts/subst
                      --with-perl: /usr/bin/perl
                --with-nagios-user: icinga
               --with-nagios-group: icinga
               --with-noinst-level: unknown
                   --with-degrees: yes
                --enable-perfdata: yes
                --enable-extendedinfo: yes
                --enable-hwinfo: yes
                --enable-hpacucli: no

Then we do. make install

In the directory that appears in the key libexecdir, a script should appear check_hpasm, the size is somewhere around 307K, if it is not there, or it is noticeably smaller, then the script was not assembled. If all is well, then check what permissions are on it and set the correct permissions, if that.

You can verify the plugin’s functionality by running it on the command line:

check_hpasm –H  --community   -v 
After which the script will give detailed information about the status of the remote system. For example, a query for the c7000 blade rack in nagios / icinga would look like this:

OK - System: 'bladesystem c7000 enclosure g2', S/N: 'GBXXXXXXXX', hardware working fine, temp_1:1:1=20 temp_1:1:12=20
temp_1:1:13=20 temp_1:1:2=31 temp_1:1:4=20 temp_1:1:5=19 temp_1:1:6=19 temp_1:1:7=20
common enclosure Blade7000 condition is ok (Ser: GBXXXXXXXX, FW: 3.70)
fan 1:1:1 is present, location is 1, redundance is other, condition is ok
fan 1:1:10 is present, location is 10, redundance is other, condition is ok
fan 1:1:2 is present, location is 2, redundance is other, condition is ok
fan 1:1:3 is present, location is 3, redundance is other, condition is ok
fan 1:1:4 is present, location is 4, redundance is other, condition is ok
fan 1:1:5 is present, location is 5, redundance is other, condition is ok
fan 1:1:6 is present, location is 6, redundance is other, condition is ok
fan 1:1:7 is present, location is 7, redundance is other, condition is ok
fan 1:1:8 is present, location is 8, redundance is other, condition is ok
fan 1:1:9 is present, location is 9, redundance is other, condition is ok
Chassis temperature is 20C (42 max)
Blade Bay temperature is 20C (42 max)
Blade Bay temperature is 20C (42 max)
System temperature is 31C (75 max)
Blade Bay temperature is 20C (42 max)
Blade Bay temperature is 19C (42 max)
Blade Bay temperature is 19C (42 max)
Blade Bay temperature is 20C (42 max)
manager 1:1:1 is present, location is 1, redundance is redundant, condition is ok, role is active
manager 1:1:2 is present, location is 0, redundance is notRedundant, condition is ok, role is standby
power enclosure 1:1 'Blade7000' condition is ok
power supply 1:1:1 is present, condition is ok (Ser: 5AGUD0AHLZ93AC, FW: )
power supply 1:1:2 is present, condition is ok (Ser: 5AGUD0AHLZ93AE, FW: )
power supply 1:1:3 is present, condition is ok (Ser: 5AGUD0AHLZ93AL, FW: )
power supply 1:1:4 is present, condition is ok (Ser: 5AGUD0AHLZ93AK, FW: )
power supply 1:1:5 is present, condition is ok (Ser: 5AGUD0AHLZ92LT, FW: )
power supply 1:1:6 is present, condition is ok (Ser: 5AGUD0AHLZ93AD, FW: )
net connector 1:1:1 is present, model is HP 1Gb Ethernet Pass-Thru Module for c-Class BladeSystem (Ser: TWTXXXXXXX, FW: )
net connector 1:1:2 is present, model is HP 1Gb Ethernet Pass-Thru Module for c-Class BladeSystem (Ser: TWTXXXXXXX, FW: )
net connector 1:1:3 is present, model is BROCADE HP B-series 8/12c SAN Switch BladeSystem c-Class (Ser: CNXXXXXXXX, FW: )
net connector 1:1:4 is present, model is BROCADE HP B-series 8/12c SAN Switch BladeSystem c-Class (Ser: CNXXXXXXXX, FW: )
server blade 1:1:1 'BLADE1' is present, status is ok, powered is on
server blade 1:1:10 'BLADE10' is present, status is ok, powered is on
server blade 1:1:2 'BLADE2' is present, status is ok, powered is on
server blade 1:1:3 'BLADE3' is present, status is ok, powered is on
server blade 1:1:4 'BLADE4' is present, status is ok, powered is on
server blade 1:1:9 'BLADE9' is present, status is ok, powered is on


If a faulty component is detected, the plugin will return WARNING and indicate the reason. For example, the
message about a problem battery in the accelerator looks like this:
WARNING - controller accelerator battery needs attention, System: 'proliant dl380 g4', S/N: 'GB8640P5NS', ROM: 'P51 04/26/2006'
checking cpus
cpu 0 is ok
cpu 1 is ok
checking power supplies
powersupply 1 is ok
powersupply 2 is ok
checking fans
overall fan status: system=ok, cpu=ok
fan 1 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 2
fan 2 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 3
fan 3 is present, speed is normal, pctmax is 50%, location is ioBoard, redundance is redundant, partner is 4
fan 4 is present, speed is normal, pctmax is 50%, location is ioBoard, redundance is redundant, partner is 5
fan 5 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 6
fan 6 is present, speed is normal, pctmax is 50%, location is cpu, redundance is redundant, partner is 7
fan 7 is present, speed is normal, pctmax is 50%, location is powerSupply, redundance is redundant, partner is 8
fan 8 is present, speed is normal, pctmax is 50%, location is powerSupply, redundance is redundant, partner is 1
checking temperatures
1 cpu temperature is 38C (62 max)
2 cpu temperature is 37C (87 max)
3 ioBoard temperature is 34C (60 max)
4 cpu temperature is 40C (87 max)
5 powerSupply temperature is 31C (53 max)
checking memory
dimm module 0:1 (module 1 @ cartridge 0) is ok
dimm module 0:2 (module 2 @ cartridge 0) is ok
dimm module 0:3 (module 3 @ cartridge 0) is not present
dimm module 0:4 (module 4 @ cartridge 0) is not present
dimm module 0:5 (module 5 @ cartridge 0) is not present
dimm module 0:6 (module 6 @ cartridge 0) is not present
checking disk subsystem
controller accelerator is failed
controller accelerator battery is failed
logical drive 2:1 is ok (mirroring)
physical drive 2:144 is ok
physical drive 2:145 is ok
scsi controller 3:1 in slot 1 is ok
ide controller 0 in slot -1 is ok and unused
checking ASR
ASR overall condition is ok
checking events


By the way, through the existing ILO1 server this information cannot be collected by other plugins. What is fraught with a dead battery of a disk controller accelerator? The write speed drops almost twice.

You can get even more detailed information by adding the –vv switch to the command line, and if you need a very long sheet for diagnostics, we use –vvv. We write

in the nagios verification commands:

$USER1$/check_hpasm --hostname $HOSTADDRESS$ --community public  –v
Instead of public, there should be your SNMP community. Assign a scan command to hosts and services.

The plugin relies on the data returned by the drivers in its work. Sometimes (usually these are errors in firmware) they can return incorrect values ​​- the presence of physically missing components, incorrect temperature parameters (99 degrees, as was the case with gen8's iLO4), etc. To do this, there is an extensive set of keys with which you can exclude some sensors or subsystems as a whole. The description of the keys is large and fully provided on the plugin page in the Blacklisting section.

To work under Windows 2003/2008/2012 on Proliant you need to install:

1. SNMP service (not installed by default, it is in the system components) and configure it correctly (specify the community / write the address of the nagios / icinga server as an SNMP server / register the address for sending SNMP traps). WBEM providers / SIM drivers depend on the SNMP service. WBEM - Web-based Enterprise Management.
2. The latest drivers iLO, HP System Insight Manager and WBEM providers for the desired version of the system.

For Linux, HPASM is installed in the same way - it also requires the configuration of SNMP and drivers. It is described in detail in the HOWTO. You

can
install drivers in two ways: 1. You can drag them individually from the HP site support pages for each of the models available on the farm, for example, for the HP ProLiant DL380 G4 Serverthey will be in the section Driver - System Management (drivers for iLO) or Software - System Management (WBEM providers and SIM drivers)
2. Download SPP (Service Pack for Proliant - former PSP) - latest version 2012.10.0 . For the download, you need free registration with HP Passport and a number of manipulations with the acquisition for $ 0.0 of the right to download .iso from SPP. (Forgot to mention that with HP SIM the same story).

Installing drivers with SPP is much more convenient, already because you can do bulk installation on several servers at once. This is done by HP SUM - Smart Update Manager. This will require the details of a domain administrator or local server administrator. In addition, before installing drivers, it is strongly recommended that you upgrade all available firmware.

Known problems:
1. On a dozen or two servers can be installed oooooochen for a long time, why - it is not clear.
2. HP SUM is not recommended to be run on a machine with productive software, because it has the ability to load the processor up to 90%
3. Some servers cannot be automatically installed, especially if the system is bare - WBEM / SIM drivers have not been installed before. I have to put it with my hands.
4. Sometimes on target servers HP SUM does not allow selecting WBEM / SIM drivers - they will also have to be installed manually, especially for servers with LO100i. Seen on the DL180G6 and DL160G6.
5. WBEM / SIM drivers are not installed if Windows 2008 R2 is installed on an officially unsupported architecture (such as legacy DL360G4 or DL380G4). The blog http://kf.livejournal.com have the following decision: .

What if you can't install HP Insight Management Agents or WBEM to Windows Server 2008 R2?
Хотя HP DL3x0 G4 официально и не поддерживает Microsoft Windows Server 2008 R2, данную ОС вполне просто установить на него стандартными ср-ми. Тем не менее после установки ОС вы обнаружите отсутствие информации на странице HP System Management Homepage. Происходит это потому, что, по-умолчанию, ни WBEM, ни HP Insight Management Agents не могут установиться.
При попытке установить WBEM/HPIMA вручную, с диска SmartStart, вы получите сбой зависимостей с описанием следующего вида:
Installation for "HP Insight Management Agents for Windows Server 2003/2008 x64 Editions" requires one or more of the following that is not currently installed or in the install set:
- HP ProLiant Advanced System Management Controller Driver for Windows
- HP ProLiant iLO Advanced and Enhanced System Management Controller Driver for Windows
- HP ProLiant iLO 2 Management Controller Driver for Windows
- HP ProLiant iLO 3 Management Controller Driver for Windows
Для успешной установки, выполните следующие действия:
1) Download HP ProLiant iLO Advanced and Enhanced System Management Controller Driver for Windows Server 2008 x64 Editions (cp010914.exe) to the server.
2) Extract downloaded file with integrated extract feature.
3) Set the compatibility mode for cpqsetup.exe as Windows Server 2008 (Service Pack 1).
4) Run cpqsetup.exe, installation should works fine.
5) Install HP Insight Management Agents/WBEM as usual.
6) Also, 1 of unknown devices will be disappeared from Device Manager and it will be called "HP ProLiant iLO2 Advanced System Management Controller" now.


Known bugs and drawbacks of check_hpasm

At the time of version 4.6.3, the following error was noticed: When starting on FreeBSD, the plugin displays a message:

Use of uninitialized value in lc at /usr/local/libexec/nagios/check_hpasm line 3622.

On the operation of the plugin itself and the information issued by it, no effect was noticed. In newer versions fixed.
It was suggested that there was no lc utility, but installing the utility yielded nothing. The error remained. Plugin author reported.

The disadvantages include the conclusion in the Unix format (from the beginning of the era) of the date of the lines from IML, which makes them poorly perceived:
Event: 76 Added:1357193160 Class: (System Revision) informational ROM flashed (New version: 12/02/2011)

You can count on an online calculator. For example, here .

About the noticed errors can be written to the plugin author (plain english). G. Lausser is a busy person, but check_hpasm always tries to respond to messages about problems or incorrect behavior.

The benefits of the plugin

Using this plugin, they were quickly detected
• Faulty batteries on the accelerators of disk controllers in DL360G4, which led to a drop in the write speed to the attached MSA20 arrays.
• A jammed fan has been detected in one of the Proliant. The fan is replaced.
• A dead backup power supply was detected on one of the servers. PSU replaced.

Of course, if every day I patrol all iLO interfaces and carefully look through the IML logs, then you can notice such errors, but the question arises - when should I work? This plugin, combined with nagios / icinga, simplifies the administration process and fully covers the HP Proliant server infrastructure.

Some useful guides

1. HP Integrated Lights-Out User Guide .
2. HP Integrated Lights-Out 2 User Guide .
3. HP iLO 3 User Guide
4. HP iLO 4 User Guide
5. HP management software for Linux on ProLiant servers.HOWTO 6th edition
6. HP Remote Insight Lights-Out Edition II User Guide
7. HP ProLiant Lights Out-100 User Guide

PS Will articles on other aspects of monitoring be interesting - configurators, practical developments, interesting plugins, add-ons, non-trivial pieces of iron? A lot of material has accumulated - at least write a book.

UPDATE 2015:



In 2015, I had to return to the tasks associated with monitoring HP servers and found that the article was a bit outdated.
in particular, it turned out that when working with iLO2, an error appears:

ILO2_HEALTH UNKNOWN - ERROR: Failed to establish SSL connection with: 443.

iLO3 and iLO4 work fine.

A study of the issue showed that the source is a well-known SSL issue. Our environment needs to be updated.

1. Upgrade our script to the minimum version 1.60
(the script is on nagios.exchange. Here )
2. Upgrade firmware from iLO2 to the latest version. (1.94 or 1.96, available from July 2015)

The verification command must be changed:

./check_ilo2_health.pl -H  -d 1 -u  -p  -l --sslopts 'SSL_verify_mode => SSL_VERIFY_NONE, SSL_version => "TLSv1"'


The sslopts key has been added, with the help of which TLSv1 is turned on and SSL checking is disabled.

Result:
ILO2_HEALTH OK - (Board-Version: ILO2) Temperatures: Temp_1 (Ok): 18, Temp_2 (Ok): 40, Temp_4 (Ok): 25, Temp_5 (Ok): 26, Temp_8 (Ok): 37, Temp_9 (Ok): 30, Temp_10 (Ok): 37, Temp_11 (Ok): 29, Temp_12 (Ok): 41, Temp_19 (Ok): 21, Temp_20 (Ok): 26, Temp_21 (Ok): 27, Temp_22 (Ok): 25, Temp_23 (Ok): 34, Temp_24 (Ok): 29, Temp_25 (Ok): 26, Temp_26 (Ok): 26, Temp_29 (Ok): 35, Temp_30 (Ok): 63


UPDATE 2016:



Regarding all pearl scripts using Nagios :: Plugin (check_ilo2_health.pl, check_ilo2_health.pl):
Due solely to copyrights and trademarks (and Nagios is a registered trademark) cpan no longer indexes Nagios :: Plugin, so it’s normal to install and use it is no longer possible.

Instead, Monitoring :: Plugin is used, which performs identical functions, according to the script text, you need to replace Nagios with Monitoring to

install Monitoring :: Plugin using:

perl -MCPAN -e 'install Monitoring :: Plugin'

And everything works!


Also popular now: