Monitoring vSphere Replication RPO with vCenter alarms

I'm currently exploring the possibilities of stand-alone vSphere Replication in my lab. A first and very important outcome was my post about how to automate VM recovery. This time we will look at how you can monitor vSphere Replication RPO violations using vCenter alarms.

When configuring replication for a VM you specify a target RPO (Recovery Point Objective) that can range between 15 minutes and 4 hours. I chose the minimum for most of my VMs, and this means that vCenter (resp. the vSphere Replication Appliance) should replicate changed data so frequent and fast that the state of the replica lags behind the primary VM for at most 15 minutes. It depends on a number of factors whether this is always successful or not:

The data change rate, overall disk activity, the available network bandwidth, etc. - In my lab setup I noticed that the RPO was violated at some times (at least once per day), but only by 1 minute. You can track that in the Events view of the vSphere Client:

vSphere Replication RPO events in the vSphere client
The 1 minute violations didn't really bother me, but I wanted to make sure that I don't miss any of the greater RPO violations so that I can investigate what causes them.

A great (and often underestimated) way to monitor your vSphere environment for any kind of events is setting up a vCenter alarm. You can do this with both the new Web Client and the legacy C# Client. Like many of us I have a hard time to get used to the Web Client, because I am just so familiar with the legacy client and know how to get things done quickly with it. I thought that this is a good opportunity to familiarize with the new client, and tried to create the alarm with it. However, I found out that it is just not possible in this special case, because of the advanced options that I needed.

I will come back to this later and explain in more detail - while we go through the process of creating the alarm with the legacy Client:

1. Open the Settings dialog of a new alarm from the context menu of the Alarms view:

Create a vCenter alarm - General section
Enter the Alarm name and  Description. In the Alarm Type section choose to monitor Virtual Machines and select the option to Monitor for specific events.

2. Change to the Triggers tab of the dialog:

Create a vCenter alarm - Define trigger events
Here we face a challenge: The dropdown menu for the Event to look for does not contain any vSphere Replication events. We need to manually enter the Event identifiers here, but how do we find them out?

There are a number of ways to do this ... If you are familiar with PowerCLI then the quickest way is to look at the properties of the Event objects using the cmdlet Get-VIEvent and appropriate filters. But I want to point you to another resource that I have not yet found mentioned in other places:

The vCenter server program directory ("%ProgramFiles%\VMware\Infrastructure\VirtualCenter Server") contains a lot of text files with the suffix .vmsg that contain the localized values of all string resources of the vCenter server software. E.g. the file locale\en\event.vmsg contains all vCenter built-in event definitions in English (additional languages are stored in their own sub directories: de, fr, ja, ko, zh_CN).
VCenter extension specific resources can be found in the sub directories under extensions: The file that we finally need to look at is extensions\com.vmware.vcHms\locale\en\event.vmsg. If you search for the relevant event message text (e.g. the string violated) in this file then you will find these lines:
com.vmware.vcHms.rpoViolatedEvent.category = "error"
com.vmware.vcHms.rpoViolatedEvent.description = "RPO violated"
com.vmware.vcHms.rpoViolatedEvent.formatOnComputeResource = ""
com.vmware.vcHms.rpoViolatedEvent.formatOnDatacenter = ""
com.vmware.vcHms.rpoViolatedEvent.formatOnHost = ""
com.vmware.vcHms.rpoViolatedEvent.formatOnVm = ""
com.vmware.vcHms.rpoViolatedEvent.fullFormat = "Virtual machine vSphere Replication RPO is violated by [data.currentRpoViolation] minute(s)"
com.vmware.vcHms.rpoRestoredEvent.category = "info"
com.vmware.vcHms.rpoRestoredEvent.description = "RPO restored"
com.vmware.vcHms.rpoRestoredEvent.formatOnComputeResource = ""
com.vmware.vcHms.rpoRestoredEvent.formatOnDatacenter = ""
com.vmware.vcHms.rpoRestoredEvent.formatOnHost = ""
com.vmware.vcHms.rpoRestoredEvent.formatOnVm = ""
com.vmware.vcHms.rpoRestoredEvent.fullFormat = "Virtual machine vSphere Replication RPO is no longer violated"
Here we learn that the RPO violation and restore events have the internal names com.vmware.vcHms.rpoViolatedEvent and com.vmware.vcHms.rpoRestoredEvent, and we can use these as custom event triggers. The violation event will set the status Alert (or Red), and the restoration event will revert the status to Normal (or Green) and clear the alarm again.

Important note: When using the new Web Client for creating the alarm you will find the "RPO violated" and "RPO restored" events in the dropdown menu for the triggers, so you would not have to look up their identifiers like we did above. But on the other hand you can only choose entries from the dropdown list here and can not type a custom identifier at all. This is a clear limitation of the Web Client GUI, and we will also stumble over this in the next step:

3. Set an advanced condition for the Alert trigger

Remember we only want an alert to be triggered if the RPO was violated by more than 1 minute! So we need to define an advanced trigger condition by clicking on the Advanced... link in the picture above. This will open the following dialog:

Create a vCenter alarm - Advanced trigger conditions
The fullFormat message in the event.vmsg file (see 2., above) references [data.currentRpoViolation] as a placeholder for the number of minutes. So we put currentRpoViolation as the Argument here and compare it with the Value 1. The Operator must be chosen from a dropdown list, and unfortunately this list does not include something like "greater than" or "less than". The best that we can choose here is "not equal to", and this is fine in our case, because any positive number that is not equal to 1 will be actually greater than 1 ...

By the way: In the Web Client we have the same limitation for the Operator, and what's worse: Even the Argument can only be chosen from a dropdown list that only includes some generic arguments (like VM Name), but not currentRpoViolation. This means that we can not define this specific advanced trigger condition in the Web Client!

4. Define the alarm action

The last step is to define the alarm action, and this needs to be done in the Actions tab of the Alarm Settings dialog:

Create a vCenter alarm - Define actions
We choose to send an e-mail here whenever the alarm status changes to Red (that means the RPO is violated by more than 1 minute) or back to Green (which means that the RPO is restored back to the normal range).

Conclusion

This alarm works fine for me, but I'm still puzzled about the lack of a "greater than" comparison operator and other GUI restrictions. Maybe we can overcome them by creating this alarm via PowerCLI rather than using a GUI? But this will be the topic of another post ...




7 comments:

  1. I think you've saved my butt sir. I've recently set up monitoring for my vSphere replications and have been going nuts trying to figure out how to get it to only alert when it's bad for more than X minutes. If this works, I owe you a beer.

    ReplyDelete
    Replies
    1. Been playing about with this myself and in the Advanced Trigger conditions, you can actually state current RPO violation is not equal to <x minutes...hope that helps, Steve.

      Delete
    2. Hmmm, thanks, that sounds interesting. I need to try this.

      Delete
  2. Putting "<x" (x is minutes, no quotes) did not work for me. Did it for anyone else? I am running vSphere 5.5 U2

    ReplyDelete
  3. Great info. I went in an categorized the .vmsg by info, warning, error and added a bunch of EventTypeId to the vCenter alerts. I did not add anything into the advanced and was able to see a few alerts fire afterwards.

    Thanks!

    Ben

    ReplyDelete
  4. Many thanks indeed.

    How can this not be a simple tick/config option from within the main replication or monitoring config area, rather than having to initially decode text files etc.?

    Surely that should be a given for such an important function in an enterprise solution?

    ReplyDelete
  5. To set the overall trigger to greater than 5 minutes, I created separate conditions with Not equal to, each with their own value 1, 2, 3, 4 & 5.

    ReplyDelete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!