Active Directory issues with ESXi 6.0 Update 2 and an automated fix

Lately VMware published two new Knowledge Base (KB) articles that should alarm all people using Active Directory (AD) authentication with their ESXi 6.0 hosts:
Let's have a closer look at these articles and make sure that you draw the right conclusions.

KB2145611

The first one looks very serious: The ESXi host becomes unresponsive and pretty much unmanageable, probably because of a resource issue that affects the hostd management daemon. All affected hosts have in common that they are joined to Active Directory for Authentication Services. Other than some related log entries from /var/log/vmkernel.log the KB article provides disturbingly little information about what exactly triggers the issue and how it might be prevented. Currently there is no resolution or workaround. Apparently VMware has not yet fully identified (or does not want to disclose) the root cause of the issue.

How should you deal with this if you know that you are using Active Directory Authentication with ESXi 6.0 hosts in critical production environments? Open a Support Request with VMware NOW - no matter whether you already have experienced the described issue or not!

If you already have run into this issue then VMware Support will probably be able and willing to share a workaround with you that they do not want share publicly in the KB (for whatever reason). If you have not already experienced the issue then VMware Support might be willing to share further information about what type of environments are prone to this issue, or they might even look at your environment for early (undocumented) indicators of the issues.

If you are lucky then the outcome of the support call will be that your environment is not prone to this issue, or you will be able to check for early indicators and already know suitable countermeasures.

KB2145400

This one is much better: We have
  • clear symptoms (that are not too critical): Running various AD related actions fail, like logging in with an AD account,
  • an identified root cause: The likewise SASL buffer sized too small,
  • and even a resolution ...
but - sorry, VMware - this resolution is tremendous and furthermore incomplete! Let me explain why ...

Apparently there are two actions needed to fully fix the issue: First, the buffer size in the configuration file /etc/likewise/openldap/ldap.conf needs to be raised - the article suggest doubling it from 4096 to 8192. Second, the system resource pool for the likewise daemons needs to be modified to provide more RAM.

The issue with ldap.conf ...

... is that the file is read-only and cannot be easily changed in a persistent way. The KB article describes a hard way to persistently change the file and provides instructions on how to update the bootbank file s.v00 to include a changed copy of the file. My hair stood on end when I was reading this! The instructions given
  • are extremely dangerous to implement - one typo or cut-and-paste glitch, and your host may become unbootable -,
  • modify the system in a way that - under normal circumstances - would render the system completely unsupported (you tinker with a bootbank file that should normally only be touched by the system install and update procedures),
  • cannot be easily run remotely, and thus cannot be automated.
The modification of the likewise resource group ...

... is something that could be done in a persistent way (so that it survives system reboots), but the KB article provides a command line (using vsish) that does the change in a non-persistent way. That means it must be executed at every system boot, not only once. On the other hand the KB article does not provide a way to execute the vsish command at every startup, and that means it does not describe the complete fix for the issue!

A possible resolution

I have developed a complete fix for the issue described in KB2145400, and that is a VIB file (named kb2145400) that you just need to install on the affected hosts. In the following I will describe in detail how the package works. If you are not interested in this educational part of my post then you can skip this section now and jump directly to the How to install the VIB package part.

 The VIB package does the following - when installed -:

1. It replaces the ldap.conf file with a writable copy in a way that changes to it even persist reboots. For this to work the VIB file was created with the overlay flag enabled. When creating the package with my ESXi Community Packaging Tools the installation flags should be configured like this:


Furthermore the ldap.conf file in my package has the sticky bit set (with chmod +t ...). As a result changes to it will be saved in the /bootbank/state.tgz file and thus persist reboots. You can find more details about this technique by reading Part 3 of my Daemon's VIB - Building a software package for VMware ESXi series (see the section How to make the config file editable).

By default the ldap.conf file in the package already includes the buffer size value 8192 that is recommended in the KB article, but - after installing the package - you could even change it to another value.

2. The second file that is installed with my VIB package is the init script /etc/init.d/kb2145400. The file looks like this:
#! /bin/sh
#
# Copyright (c) V-Front.de 2016
# Author: Andreas Peetz <[email protected]>
#
###
# chkconfig: on 10 90
# description: Implement Likewise ldap.conf fix (as of VMware KB2145400)
# see: https://vibsdepot.v-front.de/wiki/index.php/Kb2145400
###

export PATH=/bin:/sbin

case "$1" in
   start)
      # Increase max memory of likewise resource group (by 200 MB)
      logger "KB2145400 startup: Increasing maxmem of likewise resgrp by 200 MB"
      vsish -e set /sched/groups/$(vsish -e set /sched/groupPathNameToID host vim vmvisor likewise | cut -d ' ' -f 1)/increaseMemMinMaxInMB max=200
      # At install time only: Restart lwsmd to make it pick up the changed ldap.conf
      if [ "$2" == "install" ]; then
         logger "KB2145400 installed: Restarting lwmsd ..."
         /etc/init.d/lwsmd restart
      fi
      exit 0
   ;;

   stop)
      # do nothing 
      exit 0
   ;;

   *)
      echo "Usage: $(basename "$0") {start|stop}"
      exit 1
   ;;
esac
This script will be executed on two occasions: At every system startup it will be executed with the parameter "start" passed. It will then only do the resource group modification. However, the script will also immediately be executed when the package is installed, and then the string "install" will be passed as a second parameter. In this case the script will also restart the likewise daemon lwsmd to make it pick up the changed ldap.conf file (Kudos to William Lam for pointing out this undocumented feature).

On normal system startups the line # chkonfig: on 10 90 (line 7) will make sure that it is executed before the likewise service startup script /etc/init.d/lwsmd, because it provides a lower startup sequence order number (10) then the latter.

To summarize the explanations: My VIB file implements all required immediate actions on install and the recurring action (resource group modification) on every system boot.

How to install the VIB package

I have published the kb2145400 package to the V-Front Online Depot. If you are already familiar with that then you should know how you can install it on a host:

Online method: If your host has a direct outbound Internet connection then you just need to open a remote shell on it and run the following esxcli commands:
esxcli network firewall ruleset set -e true -r httpClient
esxcli software vib install -n kb2145400 -d https://vibsdepot.v-front.de --no-sig-check
esxcli network firewall ruleset set -e false -r httpClient
Offline method: If your host does not have a direct Internet connection then download the VIB package from the kb2145400 Wiki page, upload it to a datastore that is accessible by the host and install it with this esxcli command:
esxcli software vib install -v /vmfs/volumes/<your-datastore>/kb2145400-1.0.0-1.x86_64.vib --no-sig-check
Replace <your-datastore> with the name of the datastore to which you uploaded the VIB file.

The VIB installation does not require a reboot! The esxcli commands can also be run remotely, e.g. through PowerCLI. This would also be a great way to automate the installation on multiple hosts. David Stamen has posted a nice example of how to install a VIB package through PowerCLI.

Please note that Update Manager can not be used for this task, because it won't let you install unsigned packages.

Wrap-up

I hope that VMware will fix this issue quickly and provide a patch that will make the KB2145400 workarounds unnecessary. When this happens and you deploy the official fix then please uninstall my package again. This can be done by running
esxcli software vib remove -n kb2145400
Unfortunately this requires a reboot, because there is no other way to clear the effect of the package's overlay feature.

Final note: Installing my package is of course completely unsupported and not endorsed by VMware. I am aware that most people won't want to use it in critical production environments. The main reason I created it is to provide another educational example of ESXi software packaging, and for the fun of it :-)

Update (2016-06-15)

In the meantime VMware has reworked and republished KB2145400. They completely removed the instructions to make the required changes permanent (which I criticized) and only state that the changes done manually will not persist a reboot.


This post first appeared on the VMware Front Experience Blog and was written by Andreas Peetz. Follow him on Twitter to keep up to date with what he posts.



11 comments:

  1. Can anyone still access the info for KB2145400? The link no longer works.

    ReplyDelete
    Replies
    1. VMware has pulled KB2145400. I hope that they will work over and re-publish it shortly. Unfortunately I do not have a saved copy of it. Anyone?

      Delete
    2. KB2145400 is available again. See my update to this post on 2016-06-15.

      Delete
  2. if i have this vib installed How would i go about increasing the ldap.conf to say 16384? what command would i have to run?

    ReplyDelete
    Replies
    1. Hi Gavian,

      just edit the file /etc/likewise/openldap/ldap.conf and change the value in there. Then restart the likewise daemons by running
      /etc/init.d/lwsmd restart

      Andreas

      Delete
    2. thanks, but what command exactly would i use to edit that file? for example to change it from the default of your Vib to double that of your default

      Delete
    3. Hi Gavian,

      if you are familiar with the vi editor then just edit the file with
      vi /etc/likewise/openldap/ldap.conf

      If not then use this command to edit the file and change the value 8192 to 16384:
      sed -i "s@8192@16384@" /etc/likewise/openldap/ldap.conf

      Andreas

      Delete
    4. thank you Andreas, modified it with vi Editor, and restarted the service. it now works like a charm. thank you so much for your help

      Delete
  3. Pretty funny, I thought my vCenter installation had fallen victim to this issue when I was no longer able to authenticate into it using my domain credentials. Turns out, the system time had just drifted by more than 5 minutes from my Domain Controller's Time.

    Just a reminder to always set an NTP server on your VMware Hosts. The vCenter appliance (and almost everything using VMware tools) will then inherit this time.

    ReplyDelete
    Replies
    1. Actually the best option is to configure both the ESXi host and the VCSA to use NTP and the same NTP servers.

      Delete
  4. This is fixed in Patch ESXi-6.0.0-20160804001-standard, however kb 2145400 has still not been updated to reflect this.

    ReplyDelete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!