Filtering logs in ESXi 6.0 - A practical example


Recently I rented a new server with an LSI MegaRAID SAS 9260-4i controller and - of course - installed VMware ESXi 6.0 on it. I will soon migrate my "production workloads" (the V-Front Online Depot, the ESXi Patch Tracker site and my Zimbra E-Mail server) onto this box to improve their availability. A disk failure is one of the most likely disaster scenarios, so a hardware RAID controller with two mirrored disks will certainly help.

After I had the host installed and configured I did a routine check of the log files and found a lot of SCSI errors like these in /var/log/vmkernel.log:
cpu1:34460)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x1a (0x439d80668b00, 0) to dev "naa.600605b0058b25f01c995a4f0b03da18" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
cpu1:32779)ScsiDeviceIO: 2646: Cmd(0x439d80716d00) 0x4d, CmdSN 0x1 from world 34425 to dev "naa.600605b0058b25f01c995a4f0b03da18" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
These errors were referring to the RAID 1 disk, and first I was worried about them - according to KB289902 the D:0x2 translates to "Device: Check condition" and the sense data codes 0x5 0x24 0x0 and 0x5 0x20 0x0 mean "Illegal request", "Invalid Field in CDB", resp. "Invalid command operation code". I double checked that the RAID controller is on the VMware HCL, I searched the Internet for these errors, found few peoples' posts describing this issue, but no solution. I even updated the firmware of the controller, all to no avail ... and then, finally, I stumbled over KB1031221 which - basically - describes that these errors are "normal" and "can be safely ignored".

Well, that calmed me down... But I'm a log purist: I wanted to get rid of these false error messages in vmkernel.log!

Then this appeared on my Twitter timeline:


Just when I needed it! William Lam explains on his blog that ESXi 6.0 introduces a new log filtering capability and how to use it.


How to filter syslogs in ESXi 6.0

1. Enable shell access to your host and log in to it. In the ESXi shell edit the file /etc/vmsyslog.conf and add the line enable_logfilters = true in the section [vmsyslog]:
[vmsyslog]
rotate = 8
size = 1024
enable_logfilters = true
2. Define the filter entries in the file /etc/vmware/logfilters. The comments at the beginning of this file explain the format: The line numLogs | ident | logRegexp will filter out all messages that come from the log source ident and match the regular expression logRegexp after it has appeared numLogs times. 

I love regular expressions ... but each time I need to use them I have a hard time figuring out the correct syntax. One reason is that there are many different types of regular expressions that all use slightly different syntax. In this case we need to use Python regular expressions, because the ESXi syslog facility is written in Python.

To filter the false SCSI error messages I finally came up with this neat line in the config file:
0 | vmkernel | to dev "naa\.600605b0058b25f01c995a4f0b03da18" (on path "vmhba1:C2:T0:L0" )?[fF]ailed:? H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x2[04] 0x0\.
If you are in the same boat and want to implement a similar filter then make sure that you replace the LUN's naa-id and canonical name (vmhba1:C2:T0:L0 in my case) with your specific values. For numLogs I put a 0, because I want these messages to appear really never. For ident you need to use vmkernel.

3. After you edited and saved the files run
   esxcli system syslog reload
to make the syslog daemon pick up the new configuration.

By looking at the well hidden log file /var/log/.vmsyslogd.err you can find out whether the syslog daemon was happy with the configuration or not. If the last line looks like this:

   vmsyslog.main            : ERROR   ] reloading (33043)

then everything is fine - although it says ERROR here :-)

After I made these changes the false SCSI error messages were never again logged by the ESXi host.


So, is log filtering cool?

Well, it helped in my case, but in general it is not a good idea to suppress logging, because it may limit your (and VMware's) ability to troubleshoot an issue. So use it sparingly! As an advice here are three random rules that you should follow when thinking about log filtering:

1. If you encounter errors in log files first try to find their root cause and fix it! Just suppressing a message makes log purists like me happy, but it doesn't fix anything.

2. Never suppress a message that you do not fully understand! Make sure that it can be safely ignored before filtering it.

3. In production environments use a centralized logging facility and try to do the filtering there (rather than at the source)!. Examples for such facilities are:


This post first appeared on the VMware Front Experience Blog and was written by Andreas Peetz. Follow him on Twitter to keep up to date with what he posts.



1 comment:

  1. Thank you very much for such a simple recipe.

    ReplyDelete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!