Errors and problems of the servers of the Big Three: part three. Ibm
Hello, Habr! In previous articles, we dealt with errors and problems with Dell and HP servers , and our story about refurbished server errors would not be complete without mentioning the products of the third Big Three vendor - IBM. Although this glorious corporation has already moved away from server production, its products are still actively used. Therefore, we hasten to share with you the accumulated experience of "taming" IBM servers. This is not an exhaustive list of problems, but it may still be useful to someone.
RAM
IBM servers are sensitive to the configuration of memory modules. Often, after an independent upgrade - adding memory or replacing it - the server does not load, or sees less memory than is actually installed. Fortunately, in such situations, one does not have to guess for a long time about the causes of the failure: on the diagnostic panel (if any), two Config and Memory indicators light up.
Therefore, before upgrading the memory, be sure to study the specification, what type and size of memory is supported by your server. The number of processors in the server is also very important - the order in which the modules are placed in the slots depends on this. This also needs to be clarified in the specification.
In general, the situation with memory is exactly the same as described in the article about HP , for example. In short:
- Observe the memory channel.
- Insert ECC REG 1 (2) Rx4 memory into dual-processor systems and UDIMM into single-processor systems.
- Put the same amount of memory on each processor.
And what if you inserted the memory in accordance with the instructions, and the server still does not work and the Memory indicator is treacherously lit? In this case, you have to check different options:
- This type of memory is not supported by the server . Check the specifications carefully.
- The memory turned out to be a “bat . " Replace the line with exactly the same and check if the server starts.
- The slot on the motherboard is clogged with dust . This is a fairly popular reason if the server has been working for several years, and even more so if you are not its first owner. Blow out slots with compressed air.
- Bent pin in socket . This happens very rarely, but it does happen: the memory refuses to work due to a bent contact in the processor socket. If the previous options did not help to find the cause of the failure, remove the processor and carefully inspect the socket. If you are among the few "lucky ones", you can try to carefully straighten the bent contact, but this is entirely at your own peril and risk.
Many system administrators are faced with the fact that when checking RAM using MemTest86 they receive error messages even in obviously working modules, or on the same tracks. This is especially common in servers of the M4 generation. This is not at all the fault of the machines or the memory: MemTest86 is not recommended for testing server memory. If the memory starts to fail, the server will inform about it through the diagnostic panel. It is better to check memory on IBM servers using standard self-diagnostic tools.
Drives
We have already repeatedly mentioned that it is not at all necessary to install “native” drives in the servers. Neither IBM nor other vendors produce them, they only purchase them from well-known manufacturers, reflash and glue their logos. Therefore, you can easily save on upgrading or restoring disk arrays by choosing analogues instead of “native” drives. This justifies the two-three-fold difference in price, especially when it comes to refurbished servers . On the network, you can easily find model matching tables, for example:
IBM Model | Original |
---|---|
IBM 49Y2003 | Seagate ST9600204SS |
IBM 90Y8872 | Seagate ST9600205SS |
IBM 90Y8908 | Seagate ST9600105SS |
IBM 81Y9650 | Seagate ST900MM0006 |
Nevertheless, situations of incompatibility of non-native drives with the server are still possible. In this case, the server does not load normally, or does not see the drive. This is usually solved by installing a fresh RAID controller firmware. By the way, it is recommended to update the firmware and backplane / expander, the IBM Bootable Media Creator (BoMC) application will help you with this .
When you turn on the server and pass the POST check, an error may occur:
A discovery error has occurred, please powercycle the system and all the enclosures attached to this system.
This signals a problem with one of the drives. It is easy to calculate it: the indicators on its slide constantly flash, even when all other media pass the test and stop blinking.
There are more exotic problems with the disk subsystem. For example, when using RAID-1 in the MegaRAID Storage Manager proprietary application , errors like:
ID = 63
SEQUENCE NUMBER = 48442
TIME = 24-01-2016 17:03:59
LOCALIZED MESSAGE = Controller ID: 0 Consistency Check found inconsistent parity on VD strip: (VD = 0, strip = 637679)
Most often this does not mean that the disk was dying, but about a parity error - data mismatch on the primary and secondary disks. Possible reasons:
- Often, such errors appear immediately after configuring a new array or after replacing one of the disks.
- During a pancake surface diagnostic session, the disk is initialized and I / O operations are performed. On RAID-1, this can lead to a temporary volume mismatch, which is automatically corrected during the next compliance check. This does not occur during any diagnostic session, but when the stars converge:
- o Uses a RAID controller without caching, or Write Through mode is activated.
- o Lack of RAM, which actively paging from the disk.
- o Just very heavy use of disks.
To solve this problem, it is recommended to reduce the swap activity from disk: use a RAID controller with caching and increase the amount of RAM.
Firmware and software update
A curious problem may lie in wait when installing from scratch Windows 2012 or Windows 2012 R2 - a freshly installed OS does not see any drives. And this happens not only with IBM servers. The fact is that all drives in the server are connected via RAID, and the mentioned OS versions do not have embedded drivers for working with RAID. And so they just ignore them. How to be The most reliable way: use the IBM ServerGuide utility . When installing the OS, it forces all the necessary drivers for this model and version of the operating system. Please note that the OS image must be installed from the disk, not from the flash drive: ServerGuide will not work with the image on the same USB-drive from which it is launched.
When buying servers, there are situations when you first need to update all the firmware, and then roll the system. This can be done using the aforementioned IBM Bootable Media Creator :
- Boot from a bootable flash drive or disk.
- Launch BoMC as Administrator.
- Choose what you want to do: update and / or carry out diagnostics.
- The program will ask where to get the drivers: download it or pull it from the archive you specified.
- Select the media to record the boot image: a flash drive or disk. Recording can go on for several hours, do not worry, the program does not hang.
- At the end of the recording, boot from this media, and then follow the instructions.
This procedure also helps in a number of problem situations. For example, if you did not wait for the Integrated Management Module update to complete and clicked the “cancel” button, then at the next downloads the server may not be able to load IMM and uses the default settings. You can first try to restore using the “UEFI & IMM recovery jumper” jumper on the motherboard, thanks to which the flashed IMM image is loaded.
But if it doesn’t help, then use the update procedure through BoMC.
There are more unpleasant situations when, according to the law of meanness, during the installation of a more recent version of the BIOS, a power failure occurs.
After that, the server can no longer download the main firmware, and uses the backup. If the regular BIOS recovery procedure does not help, then do ... downgrade: install an older firmware than the one before the power failure. This usually helps. After that, you can already try again to install the latest BIOS version. As they say, a step back - two steps forward.
Other problems
Sometimes when trying to remotely control a server, the error “Login failed with an access denied error.” Occurs, and in any browsers. If rebooting the server and client does not help, it is recommended that you reset IMM to factory settings.
In the article about HP server errors, we mentioned problems with the cooling system: immediately after starting the server, the fans reached high speeds and did not reduce them anymore. This kind of ailment also happens in IBM servers. The server howls like a jet airliner on takeoff. We were unable to find out the cause of such failures, but we can advise the following:
- Check the tightness of the power connectors.
- Turn off all fans and remove the basket.
- Check each fan on different servers.
- Assemble the basket again by swapping fans. Or completely replace them.
There was such an interesting failure in our practice: when the server boots up, IMM is regularly initialized, then UEFI initialization starts, and ... that's it. Further, the server is not loaded without explanation. No manipulations helped: disconnecting from the network, completely de-energizing, disconnecting various components. Downloading a UEFI backup using a jumper on the motherboard also did not help. Empirically, it turned out that if you wait about 20 minutes, you can still wait for the server to load. So it has been working since then - each time it takes 20 minutes to load. It was not possible to find out the cause of the failure.
Benefits of IBM Servers
IBM servers are deservedly very popular:
- These are simple and very reliable cars.
- Excellent extensibility even on the initial models and a rich delivery set.
- IBM servers are usually cheaper than competitors and are not inferior in performance. For example, the M3 and M4 generations are cheaper than their counterparts from HP (Gen7 and Gen8) and Dell (11G and 12G).
- The most inexpensive consumables. Easy to find in Russia.
- Convenient diagnostic panel on many models.
The main thing is that IBM servers are inferior to competitors - they have a very long cold start.