All3 January 21, 2016 at 15:57

How I Fought Death Screens on Legacy Blade Servers

A post about how I struggled with the problems of new software on old hardware that arose after adding additional hardware.

I ask everyone who is interested in server hardware and struggle with errors under cat.

We ordered two additional cisco switches and a mezzanine card for each server blade for the HP C3000 server shelf in order to do everything wisely. I wanted the networks to be shared on a physical level, as well as improve performance and reliability.
The configuration was as follows:

Shelf hp c3000, in it

2 hp bl460c g6
2 hp bl490c g7

2 switches HP GbE2c
2 switches of Cisco 3020

Each blade has two mezzanine cards (HP NC382m Dual-Port 1GbE and HP NC364m Quad Port 1GbE) and integrated FlexFabric dual-port 10GbE.

Mezzanine cards look like this:

HP NC382m

HP NC364m

Servers are running Vmware ESXi 5.5.

Initially, everything worked stably without tsiska and four-port mezanins. One hp switch was for a network of virtual machines, the second for management and iscsi networks. The performance of the second was not enough and it was decided to transfer the iscsi network to separate switches. For this, they acquired two tsiskas and a mezzanine card.

As you know, the 460th servers are quite outdated, but still need to be supported. An up-to-date hp service pack distribution was received, the entire shelf was updated.

From the cluster vmware brought 460th hosts, inserted the mezzanine of the card there, stuck it in the shelf and ... when loading immediately PSOD.

In this case, the error code is the string

PCPU0: 32840 / helper14-0

At first I thought that maybe this is a motherboard problem, since one of the blades already changed the motherboard, precisely because of problems with network adapters. They disappeared from time to time.
But when the problem was duplicated on the second blade server, I rejected this idea. It is worth noting that I tried to start the server with one any mezzanine card in different slots and everything worked without problems, which means that the problem is not in the card or in the slot.

The server blade is in debug mode, I read logs, I read the vmware forum. It says that this is a hardware problem and link to the manufacturer’s forum. I turn to the HP forum, they write that when using modern vmware products, difficulties often arise on old equipment. I put vmware esxi 4.1 - everything works stably, but the problem is that the license for esxi 5.5 is the accompanying software for this license, such as Vgate 2.7. I put Windows Server 2012 R2 to make sure that the problem is really in the software and ... BSOD.

NMI_HARDWARE_FAILURE

The next time you start windows everything seems to work stably, I leave it for tests. The next day I find bsod.
At the same time, there are errors in the administrator’s onboard console in IML (Integrated Management Log) Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 9, Function 0, Error status 0x00000000). Those. fatal hardware error, and device 9 is just the second mezzanine card.

I continue to read the hp forum, it is written that ilo firmware can affect. I find that there is a newer ilo firmware and I am changing both blades, but it does not help. Further more, the forum says that there is an incompatibility between FlexFabric firmware and drivers. I am changing FlexFabric - it’s still an error.

I try different distributions: the standard distribution of vmware esxi 5.5 and the distribution of the manufacturer HP of the same build. The result is one.
I read that in the logs, and there the error is specifically on bnx2 (this is a network FlexFabric adapter). I’m installing Broadcom drivers from the vmware website (moreover, driver rewriting works only from the console of esxi itself. If installed from under vcenter, then vcenter does not overwrite). Reboot and flight is normal! The same thing happened with the Emulex FlexFabric on the 490 blades. Also updated the FlexFabric BIOS and rewritten the driver. Everything worked stably, quickly
... but not for long.

In this screenshot, the error code is the line

PCPU0: 32802 / UplinkWatchdogWorld

There was a second problem with the mezzanine card.
After some time, on one of the blades, the four-port mezzanine card completely disappeared, even from the host BIOS. Rebooting, resetting the BIOS, nothing helped until an item was found in the BIOS about working with mezzanine pci adapters. Using pci lines, it has become possible to select a signal gain level (just two points 6db and 3.5db). Yes, it became, because this item appeared when adding a four-port card. I switched the gain level and immediately after the reboot the card appeared in the BIOS.

Two working weeks passed and not a single purple screen was there.
After updating the firmware, a wake on lan function appeared on the network cards, which was not there before, and power management was configured on vcenter. Now hosts wake up as needed.

And as a conclusion, I want to say that you need to be attentive to the functionality that appears when adding new hardware (such as additional items in bios), and also that not all fatal hardware errors are fatal. Standard drivers and outdated bios lead to some errors.

I hope my torment with the blades will be useful to someone.

Tags:

How I Fought Death Screens on Legacy Blade Servers

Also popular now: