VMware Front Experience: When Onboard Administrators go bad ... - an important heads-up for HP Blade Enclosure users

We recently had a very unpleasant event with one of our HP Blade Enclosures. Four of the eight ProLiant BL620c G7 servers suddenly lost access to the SAN storage for no apparent reason. By looking at the logs of the Onboard Administrators (OA) and Virtual Connect (VC) FlexFabric modules we found out that something really bad happened inside the enclosure, and it took us some time and the help of HP Support to permanently fix that.

Pending profiles

The Onboard Administrator's web interface showed no alerts, but the Virtual Connect Manager showed us that the profiles of the four blade servers that had the storage issue had not been fully applied: they showed up with status Pending.

In the VC server profile you define what "virtual" adapters (Ethernet cards and FC HBAs) a server has and how they are internally wired to the VC modules and their uplink connections to the external network and SAN switches. A profile is automatically applied to a server whenever it is powered on. When a VC module comes into operation and discovers (or "imports") the enclosure's infrastructure it will also check the servers' profiles and re-apply if necessary. Since none of the blade servers had been power-cycled or rebooted there must had been an incident that triggered a re-import of the enclosure by the VC modules. This process would normally not interrupt any of the servers' network or storage connections, but this time something went badly wrong ...

The only way to fix the mess was to manually re-apply the pending profiles. However, this is only possible when the server is powered down. So we had to forcefully power them off (of course putting the ESXi hosts into maintenance mode prior to a clean shutdown was impossible, because they had lost access to the storage) which caused lots of VM downtimes and restarts.

When Onboard Administrators go bad ...

So why would the VC modules suddenly re-import the enclosure? By looking at the VC and OA log files we quickly found out why: The active OA had repeatedly rebooted (multiple times per day) since a few weeks already. And whenever a VC module loses and re-establishes the connection to the OA it will re-import the enclosure.

Here are some relevant excerpts from the failing OA's log files:

...
Apr 16 22:39:21  OA: Management process failure.
Apr 16 22:39:25  OA: Onboard Administrator is rebooting
...
Apr 18 21:31:40  Redundancy: Error communicating with the other Onboard Administrator.
Apr 18 21:31:40  OA: Enclosure Status changed from OK to Degraded. (Diagnostics)
Apr 18 21:31:42  Redundancy: Onboard Administrator redundancy restored.
Apr 18 21:31:52  OA: Enclosure Status changed from Degraded to OK.
...
Apr 20 09:45:13  OA: Rebooted due to watchdog timer.
...
May  6 19:39:28  Redundancy: Error communicating via ethernet with the other Onboard Administrator (heartbeat).
May  6 19:39:28  OA: Enclosure Status changed from OK to Degraded. (Diagnostics)
May  6 19:39:45  Redundancy: Onboard Administrator redundancy restored.
...
May 17 13:35:23  Kernel: kernel BUG in page_remove_rmap at mm/rmap.c:560!
May 17 13:35:23  Kernel: Fixing recursive fault but reboot is needed!
May 17 13:35:23  Kernel: BUG: scheduling while atomic: httpd2/0x00000001/7224
…

and from the VC log files:

Apr 18 21:27:23 VCNNNNNNNNNNN vcmd: [VCD:enc0:1003:Critical] VCM-OA communication down
Apr 18 21:27:23 VCNNNNNNNNNNN vcmd: [ENC:enc0:2011:Critical] Enclosure state NO_COMM : Enclosure is no-comm, Previous: Enclosure state OK, Cause: Enclosure enc0 unable to communicate with OA
Apr 18 21:27:23 VCNNNNNNNNNNN vcmd: [FAB:Fabric_1:8011:Warning] FC Fabric state UNKNOWN : All port sets UNKNOWN, Previous: FC Fabric state DEGRADED, Cause: State of port enc0:iobay1:X1 is unknown due to module condition; State of port enc0:iobay1:X2 is unknown due to module condition
Apr 18 21:27:23 VCNNNNNNNNNNN vcmd: [VCD:ENCLOS01_vc_domain:1022:Critical] Domain state FAILED : 1+ enclosures not OK or DEGRADED, Previous: Domain state OK, Cause: Enclosure enc0 unable to communicate with OA
Apr 18 21:27:23 VCNNNNNNNNNNN vcmd: [FAB:Fabric_2:8011:Warning] FC Fabric state UNKNOWN : All port sets UNKNOWN, Previous: FC Fabric state DEGRADED, Cause: State of port enc0:iobay2:X1 is unknown due to module condition; State of port enc0:iobay2:X2 is unknown due to module condition
Apr 18 21:31:45 VCNNNNNNNNNNN vcmd: [VCD:enc0:1002:Info] VCM-OA communication up
Apr 18 21:31:46 VCNNNNNNNNNNN vcmd: [ENC:CZNNNNNNNN:2001:Info] Enclosure import started : Name [ENCLOS01]
Apr 18 21:32:02 VCNNNNNNNNNNN vcmd: [SVR:enc0:dev1:5004:Info] Server power on
Apr 18 21:32:06 VCNNNNNNNNNNN vcmd: [SVR:enc0:dev2:5004:Info] Server power on
...

In most cases the re-import of the enclosure was successful and caused no issues, but at the time of the incident it somehow failed and cut off the storage connections of half of the servers.

So the event was caused by a really bad coincidence of two different issues:

A faulty/instable OA that repeatedly rebooted
A mess-up of the server profiles during a VC re-import of the enclosure

HP Support response

Time to get HP Support involved! I had a hard time convincing them that one of the OAs was faulty. The hardware monitoring functions of the enclosure did not detect any obvious problem, but the reboot and connectivity issues persisted even after pulling both OAs and both VC modules out of the enclosure and re-inserting them. So they finally agreed to replace the OA, et voilà: The issue went away.

However, the mess-up of the server profiles was not caused by defective hardware, but by a bug in the Virtual Connect firmware! HP asked me to update the VC firmware to the latest version 4.20 that fixes this bug. See Release Notes (link section below), p.6:

After an enclosure import or recovery, profiles that enter a pending state could encounter an FCoE SAN outage with FlexFabric 10Gb/24 Port Modules.

So the issue is specific to FlexFabric modules and affects only storage connections (and that I can confirm: the network connectivity was not interrupted).

The Virtual Connect Firmware 4.20 is part of the HP Maintenance Supplement Bundle for SPP 2014.02.0(B), and that also includes the latest OA firmware v4.21 with a fix for the OpenSSL Heartbleed bug, so I applied both.

And while I was looking for the latest and greatest firmware (which is still a pain on the HP.com web pages) I also discovered a new firmware for the Emulex OneConnect CNAs of the blade servers (version 4.9.416.0). In its release notes I saw the note "adds support for Virtual Connect firmware 4.20", so I though it would be better to also apply that to all eight servers.

Please note: The HP Maintenance Supplement Bundle for SPP 2014.02.0(B) is an addition to and requires the HP Service Pack for ProLiant (SPP) Version 2014.02.0(B) being deployed first. We had already done that a while ago following HP's latest VMware FW and Software Recipe. So I thought we were on a safe side regarding firmware ... Far from it!

Lessons learned

1. Do not only monitor the hardware health of your Blade enclosures (e.g. through HP SIM), but also the log files of the OAs and VC modules for suspicious unexpected messages! (You can configure forwarding to a remote syslog server for both!)

2. Firmware bugs can defeat hardware redundancy! It helps to have redundant OAs and VCs in a Blade enclosure, but a buggy firmware can lead to outages despite of any redundancy features. So keep an eye on firmware updates, carefully study their release notes and consider updating pro-actively (even if you have just updated everything to the latest recommended versions)!

References

Latest HP Service Pack for ProLiants (SPP)

HP Maintenance Supplement Bundle for SPP 2014.02.0(B)

Onboard Administrator Firmware 4.21

Virtual Connect Firmware 4.20

Emulex OneConnect Firmware 4.9.416.0 (Bootable Update ISO)

This post first appeared on the VMware Front Experience Blog and was written by Andreas Peetz.

Follow him on Twitter to keep up to date with what he posts.

3 comments:

AnonymousJune 3, 2014 at 9:26 AM
on which firmware revision where you running when the problem occured?
FishmanJune 5, 2014 at 6:52 PM
Yikes that is a perfect storm. We are currently on VC4.10 and OA 4.21 and VCEM 7.3 so we've got an extra layer of abstraction.. Firmware games have always been the worst part of this platform.

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!