Filtering logs in ESXi 6.0 - A practical example


Recently I rented a new server with an LSI MegaRAID SAS 9260-4i controller and - of course - installed VMware ESXi 6.0 on it. I will soon migrate my "production workloads" (the V-Front Online Depot, the ESXi Patch Tracker site and my Zimbra E-Mail server) onto this box to improve their availability. A disk failure is one of the most likely disaster scenarios, so a hardware RAID controller with two mirrored disks will certainly help.

After I had the host installed and configured I did a routine check of the log files and found a lot of SCSI errors like these in /var/log/vmkernel.log:
cpu1:34460)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x1a (0x439d80668b00, 0) to dev "naa.600605b0058b25f01c995a4f0b03da18" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
cpu1:32779)ScsiDeviceIO: 2646: Cmd(0x439d80716d00) 0x4d, CmdSN 0x1 from world 34425 to dev "naa.600605b0058b25f01c995a4f0b03da18" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
These errors were referring to the RAID 1 disk, and first I was worried about them - according to KB289902 the D:0x2 translates to "Device: Check condition" and the sense data codes 0x5 0x24 0x0 and 0x5 0x20 0x0 mean "Illegal request", "Invalid Field in CDB", resp. "Invalid command operation code". I double checked that the RAID controller is on the VMware HCL, I searched the Internet for these errors, found few peoples' posts describing this issue, but no solution. I even updated the firmware of the controller, all to no avail ... and then, finally, I stumbled over KB1031221 which - basically - describes that these errors are "normal" and "can be safely ignored".

Well, that calmed me down... But I'm a log purist: I wanted to get rid of these false error messages in vmkernel.log!

Then this appeared on my Twitter timeline:


Just when I needed it! William Lam explains on his blog that ESXi 6.0 introduces a new log filtering capability and how to use it.


How to filter syslogs in ESXi 6.0

1. Enable shell access to your host and log in to it. In the ESXi shell edit the file /etc/vmsyslog.conf and add the line enable_logfilters = true in the section [vmsyslog]:
[vmsyslog]
rotate = 8
size = 1024
enable_logfilters = true
2. Define the filter entries in the file /etc/vmware/logfilters. The comments at the beginning of this file explain the format: The line numLogs | ident | logRegexp will filter out all messages that come from the log source ident and match the regular expression logRegexp after it has appeared numLogs times. 

I love regular expressions ... but each time I need to use them I have a hard time figuring out the correct syntax. One reason is that there are many different types of regular expressions that all use slightly different syntax. In this case we need to use Python regular expressions, because the ESXi syslog facility is written in Python.

To filter the false SCSI error messages I finally came up with this neat line in the config file:
0 | vmkernel | to dev "naa\.600605b0058b25f01c995a4f0b03da18" (on path "vmhba1:C2:T0:L0" )?[fF]ailed:? H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x2[04] 0x0\.
If you are in the same boat and want to implement a similar filter then make sure that you replace the LUN's naa-id and canonical name (vmhba1:C2:T0:L0 in my case) with your specific values. For numLogs I put a 0, because I want these messages to appear really never. For ident you need to use vmkernel.

3. After you edited and saved the files run
   esxcli system syslog reload
to make the syslog daemon pick up the new configuration.

By looking at the well hidden log file /var/log/.vmsyslogd.err you can find out whether the syslog daemon was happy with the configuration or not. If the last line looks like this:

   vmsyslog.main            : ERROR   ] reloading (33043)

then everything is fine - although it says ERROR here :-)

After I made these changes the false SCSI error messages were never again logged by the ESXi host.


So, is log filtering cool?

Well, it helped in my case, but in general it is not a good idea to suppress logging, because it may limit your (and VMware's) ability to troubleshoot an issue. So use it sparingly! As an advice here are three random rules that you should follow when thinking about log filtering:

1. If you encounter errors in log files first try to find their root cause and fix it! Just suppressing a message makes log purists like me happy, but it doesn't fix anything.

2. Never suppress a message that you do not fully understand! Make sure that it can be safely ignored before filtering it.

3. In production environments use a centralized logging facility and try to do the filtering there (rather than at the source)!. Examples for such facilities are:


This post first appeared on the VMware Front Experience Blog and was written by Andreas Peetz. Follow him on Twitter to keep up to date with what he posts.



3 comments:

  1. Thank you very much for such a simple recipe.

    ReplyDelete
  2. Hey there,

    so im trying some stuff with ESXi on some Laptops and unfortunately my vmkernel.log is full of

    2019-05-28T12:32:00.079Z cpu6:2097734)WARNING: APIC: 898: Thermal interrupt on pcpu 6

    messages in these messages the pcpu.

    i enabled logfilters in the vmsyslog.conf file and added the following line into my /etc/vmware/logfilters file:
    0 | vmkernel | Thermal interrupt on pcpu *

    Unfortunately the Log-spamming won't stop :-( even after doing a full reboot of the ESXi.

    Is there something wrong with my line or may it be that there are some messages which can't be filtered out?

    thx!

    ReplyDelete
    Replies
    1. You regular expression is probably wrong. Try

      0 | vmkernel | Thermal interrupt on pcpu

      or

      0 | vmkernel | Thermal interrupt on pcpu .*

      There are online test tools for regular expressions, e.g. at https://regex101.com/.

      Delete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!