Automating VM recovery with stand-alone vSphere Replication

With vSphere 5.1 VMware introduced the stand-alone vSphere Replication feature that enables you to continuously replicate Virtual Machines between different hosts using different storage. Formerly this feature was only available with SRM (Site Recovery Manager), this is why I call the SRM-less version "stand-alone" here, although this is not the official product name.

Stand-alone vSphere Replication is included with the vSphere Essentials Plus Kit, the Standard, Enterprise and Enterprise Plus Editions. It requires at least one vCenter server and one vSphere Replication Appliance (vSRA) per vCenter and uses the new Web Client as a management GUI for configuring and managing replication and VM recovery. For more detailed information I recommend reading Ken Werneburg's excellent Introduction to VMware vSphere Replication white paper.

I have set up vSphere Replication in my lab and played with it to find out if it works flawlessly and how easy it can be installed, configured and managed. To shortly summarize my findings (without going into much detail): It does work flawlessly and it is easy to use, and I would recommend it at any time as a cost effective DR solution to any SMB customer.

But ... there is one shortcoming that really bugged me: VM recovery is a purely manual process. There is no other way to perform a VM failover to the secondary site than clicking through a wizard in the Web Client, and at the end the secondary VM will be started with its NIC disconnected - a safety measure that you can (annoyingly!) not override.

Starting a VM failover in the Web Client

The quest for automation

Good engineers and sysadmins try to automate as much as possible ... My goal here is to create a home-grown solution that autonomously monitors the availability of the primary VM and automatically fails over to the replicated copy if the primary one fails (for whatever reason).

Currently there is no official support or documentation from VMware on how to do this, so I started some research and discovered that there are some useful CLI tools available in the ESXi shell that you can use to manage VM replications: hbrfilterctl and the vim-cmd namespace hbrsvc. Both were already described in detailed by William Lam. His article refers to the SRM version of vSphere Replication, but also applies to the stand-alone version.

Unfortunately, these commands can not be used to automate the recovery process. They let you pause, reconfigure and disable VM replications, but do not offer any option to do a recovery. Another drawback is that they can only be used on the host that runs the primary VM, and a failure of exactly this host will be the most common reason to initiate a recovery ... So a different approach is needed.

Reverse engineering Web Services ...

You may already know that most VMware products make heavy use of Web Services (SOAP over https) to manage their components and let them communicate with each other, and this is also used by vSphere Replication. Whenever you initiate a VM recovery the vSRA will talk to vCenter (and/or the ESXi hosts directly?) to let them do the necessary steps.

It is possible to intercept and understand these Web Service calls, then build an own tool that uses these calls to initiate a VM recovery. I tried this for a short time, captured and analyzed network traffic using WireShark, found out that WireShark has some support for decrypting SSL traffic ... I learnt some new things, but eventually I gave up. This is a challenging task for an experienced Web Services Developer that I am not (any volunteers?).
Maybe another reason for giving up early was that I had a "Plan B" in my mind from the very beginning.

Plan B

I remembered a blog post by Duncan Epping answering to the question "Can I protect my vCenter Server with vSphere Replication?". Here he outlines the solution to a chicken-and-egg problem: You need a functioning vCenter server to start a VM recovery, so if vCenter is gone you won't be able to recover any VM. Not in the supported and documented way, so let's look at the unsupported and undocumented way ;-).

In the location of the replicated VM you will find that all the files are available that you need to assemble a copy of the original VM, they just have different names:

A look at a replicated VM's directory
In fact we only need to rename the configuration files (.vmx, .vmxf and .nvram) into their original names, register the VM and fire it up ... yes, it's really so easy!

We can be quite sure that the "official" process initiated by the vSRA mainly consists of exactly these steps (plus some internal housekeeping to keep any of the involved components up to date). We just mimic its behavior, and - this time - it's quite easy to automate the steps.

PowerCLI to our rescue!

I was able to quickly hack a PowerCLI script to do the job. Before I had some concerns that it might be complex to manipulate datastore files in PowerCLI, but for no reason, because this is in fact incredibly easy:
# Connect to surviving host (or vCenter if available)
connect-viserver host2.unsupported.com -User root -Password secret

# Change to the directory containing the replicated files
cd vmstore:\ha-datacenter\VMStore2\POINTMAN

# Rename the config files into their regular names
Move-Item -Force (Get-ChildItem *.vmx.*).Name POINTMAN.vmx
Move-Item -Force (Get-ChildItem *.vmxf.*).Name POINTMAN.vmxf
Move-Item -Force (Get-ChildItem *.nvram.*).Name POINTMAN.nvram

# Register the replicated VM and power it on
New-VM -Name "POINTMAN" -VMFilePath "[VMStore2] POINTMAN/POINTMAN.vmx" | Start-VM

# Disconnect from host (or vCenter server)
Disconnect-VIServer -Confirm:$false
This is a very concrete example using constants that should be replaced by variables to provide a more general solution: The name of the host (here: host2.unsupported.com) that received the replica, the datastore (VMStore2) and directory (POINTMAN) that holds the replicated VM and the original name of the replicated VM (POINTMAN).

Some more annotations:
  • Line 2: I'd recommend connecting to the host directly, but you can also connect to vCenter (if it is still alive).
  • Line 5: Whenever you connect to a host or vCenter server PowerCLI will automatically create the PSDrive object vmstore that you can use to easily navigate through all available datastores and manipulate files there. ha-datacenter is the builtin datacenter object that exists on each ESXi host. If you connect through vCenter then you need to replace it with the name of the datacenter object that you created there. The cd command in this line just changes to the directory of the replicated VM.
  • Line 8-10: The move-item commands rename the replicated configuration files into their original names. I use -Force here to overwrite any existing copies (that might be left over from earlier tests).
  • Line 13: The New-VM command here does not really create a new VM, but registers the replicated VM into the host's (or vCenter server's) inventory. The resulting VM object is directly piped into a Start-VM command that powers on the VM.
    If connected to vCenter you must specify the host on which to register the VM using the -VMHost option of the New-VM command. Then you can also specify a VM folder for the VM using the -Location option. If connected to a host the VM will appear in the "Discovered Virtual Machines" inventory folder.

If connected to vCenter the New-VM command will fail if the original VM is also still in the inventory and in the same VM folder that you want to use to register the replica. It is perfectly okay though to have two VMs with the same name in the vCenter inventory as long as they are located in different VM folders!
So make sure that you end up using different folders for the original VM and its replica (or give the replica a different name, but this will complicate the failback process later).

You also want to make sure that the original VM is really dead or at least disconnected from the network, before you power on the replica to avoid any conflicts resulting of two machines with the same identity being active at the same time. This can be quite a challenge depending on what disaster happened, especially if you want to automate the whole process from monitoring the primary VM until activating the replica.

Failing back and forth

I also tested if and how you are able to come back to a clean state after you did a VM recovery using these hacks. It is quite easy: Once the vCenter server and the vSRA are up and running you will notice that the replication job has an error status in the Web Client. You can only get rid of it by completely stopping the job. But this is what you need to do anyway if you want to fail back into the original state.

For failing back you just follow the manual process that is described in the vSphere documentation:
1. You set up a new replication job, this time from the replica to the original VM. If the files of the original VM still exist you can use them as seeds which will speed up the initial sync.
2. Once the initial sync is done do a VM recovery back to the original copy using the official documented procedure.
3. Now repeat step 1 in the reverse direction and you finally reached the old pre-disaster state.

I have tested the whole procedure multiple times in the lab and it worked smoothly for me. I would even trust it in a small scale production environment. But remember: This is unsupported by VMware, use it at your own risk!




5 comments:

  1. Have you tried this in vSphere 5.5?
    When I do a 'Get-ChildItem -Force' in the datastore for the replicated files I only get the .vmdk and -flat.vmdk even though the remaining, masked, files are visible in the datastore browser in the vSphere web client.
    Same thing happens using 'ls -al' in a Putty session to one of the ESXi hosts.

    ReplyDelete
  2. Ok... that was kind of embarrasing. If the original VM is removed from inventory the masked files are removed from the Replica store, leaving only the .vmdk and -flat.vmdk.
    A refresh of the datastore browser in the web client reduced the files to the same as seem in PowerCLI.

    So a need-to-know on replication.... NEVER unregister af VM before it has been completely recovered :)

    ReplyDelete
  3. Good write up. Be carefull though, that with 5.5 MPIT (multi point in time recovery) you will have a bunch of *.vmx.* files and also the hbrdisk* deltas (or redo logs) to deal with.
    I would say that it would be a safe advise to disable MPIT for vCenter VR ...
    -Carlos

    ReplyDelete
  4. For some reason when I try this, the newly registered VM at the recovery site is expecting the .VMDKs to be in the location of the source VM when I try to power it on. (Fortunately the target host does not have access to the source data store.)

    ReplyDelete
  5. I just figured out the reason that you didn't run into this problem is likely because your source VMs each had all of their files (VMX, VMDK, etc.) contained in the same data store. In that case there is no hard path to the VMDKs in the VMX file. As a workaround you can edit the path, inserting the GUID of the new data store, which is kind of a pain.

    ReplyDelete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!