The SBHVL project - Part 2: Backup and Disaster Recovery

This is part 2 of the SBHVL (Small Budget Hosted Virtual Lab) project post series (see also the Introduction and Part 1). This time it's about backing things up and recovering from a disaster.

Installing and maintaining a test lab can cause significant efforts and will take time. This is true for the initial installation and configuration of the lab infrastructure (our physical host), but also for every single test scenario that usually consists of multiple VMs, software installed on them, configuration work, test data etc. All this is definitely worth being backed up, although it's just a test lab...

When thinking about a backup strategy you usually have two different goals in mind: The most important one is to be able to fully recover from a disaster (First think about: What is a disaster?). The second one is more challenging: Being able to roll back to multiple "known good" points in time, e.g. after you screwed things up or accidently deleted something.

So, what is a disaster in our case? Well, you can think of many events that render your test lab unusable, but - for this post - I will limit it to the most likely event that will cause permanent data loss if you do not guard against it in a suitable way: The failure of a physical hard disk. The most usual way to guard against this is to use a RAID setup. Software RAID is not supported with ESXi, but Hardware RAID is. If you are going for Hardware RAID in your own box then you need to find a controller that is compatible with ESXi. After doing some research my recommendation is to get yourself a used Dell PERC 6i controller: It is compatible, has a battery-backed onboard cache (which is important to achieve a decent performance!) and is widely available at affordable prices (<100 US-$).

My hosting provider would also offer to equip my server with a RAID controller, but only at a relatively high monthly charge. So I decided to go without RAID and implement the following setup for my virtual lab:

SBHVL Backup and DR
Basically my goal here was to back up the contents of the first physical disk to the second physical disk (and vice versa) in a way that allows me to fully recover from a failure of either the one or the other disk with an acceptable effort.

Backing up ESXi

Normally you just would not back up ESXi itself, but only its configuration data using suitable command line tools like vicfg-cfgbackup (see the vSphere 5 docs). This is because you can quickly re-install ESXi and then re-apply the saved configuration to restore the pre-disaster state of the system. Installing ESXi is easy if you have physical access to the host, but not with a hosted lab. This is why I wrote a small script that runs in an ESXi shell, dumps the system partitions using dd and transfers the data directly to an FTP backup space that my hosting provider offers for free. Here it is:

BKPFILE=esxi5-backup-`date +%Y%m%d`.dd
echo Enabling FTP client on firewall ...
esxcli network firewall ruleset set -r ftpClient -e true
echo Backup the system disk via ftp ...
dd if=$SYSDISK bs=512 count=10229760|`dirname $0`/ncftpput -c -E -u $FTPUSER -p $FTPPWD $FTPHOST $BKPFILE
echo Disable FTP client on firewall ...
esxcli network firewall ruleset set -r ftpClient -e false
echo All done.

SYSDISK is the device name of your first physical disk where ESXi is installed. You can find out the long identifier that comes after /dev/disks/ by using the command esxcli storage core device list.
FTPHOST, FTPUSER and FTPPWD are the hostname, the username and the password of the FTP server that you want to send the backup to.
BKPFILE is the name of the backup file that will be created on the FTP host. The example generates files that are named like this: esxi5-backup-20120907.dd.

The script temporarily opens a port on the ESXi firewall that is needed to ftp out to the backup host. ESXi 5.0 does not include an ftp client, so I searched for and found one that works great in an ESXi shell: ncftpput is part of the free NcFTP client package. Follow these instructions to use it in the ESXi shell:
  1. Download the Linux (Intel 64-bit) binaries package of the NcFTP client
  2. Extract the file bin/ncftpput from the archive
  3. Upload the file to a datastore of the host using the vSphere client
  4. Log on to an ESXi shell and change the permissions of the file to make it executable using the command chmod 500 /path/to/ncftpput
  5. Now save and modify the above script to include the right variable values for your environment, upload it to the same directory and also make it executable with chmod.
Of course backing up ESXi this way is only helpful if there is also a way to restore the system from this backup. Like probably any hosting provider Hetzner lets you boot your server with a Linux based rescue system (via PXE boot). You can then log in remotely to this system (using ssh) and restore the ESXi installation by connecting to the FTP backup server with a regular ftp client and issuing the ftp-command
   get esxi5-backup-YYYYMMDD.dd.gz "|dd of=/dev/sda"
in there.

I tested this procedure inside a virtual ESXi 5.0 host and it worked fine. There way only one caveat: After restoring ESXi like described above to a new blank hard disk I was not able to create a VMFS datastore on the remaining space of the disk using the vSphere client. First I had to manually delete the associated partition using the command
   partedUtil delete $SYSDISK 3
in an ESXi shell (replace $SYSDISK with the /dev/disks/... identifier that you also used in the backup script).

Backing up your VMs

To back up the VMs that are running in my virtual lab I decided to use the product Veeam Backup & Replication. To be honest: The reason why I first looked at this product is the fact that Veeam offers free NFR licenses of its Enterprise edition to vExperts. But there is also a feature-limited edition of the product that is free for everyone: Veeam Backup Free Edition (also called VeeamZip) allows you to backup (archive or copy) your VMs without powering them off and in a compressed and deduplicated format. And it even allows to restore single files from a VM backup (which is a pretty unique feature among free products).
VeeamZIPTM - How it works (taken from

In addition the paid edition allows you to
  • schedule backup jobs
  • do incremental backups utilizing CBT (Changed Block Tracking)
  • deduplicate backup data globally (instead only inside the same VM backup)
You do not necessarily need these features for backing up a test lab, but once you have gotten used to them you don't want to miss them.

So, how is this set up in my virtual lab? The Backup server (a Windows machine with the Veeam software installed) is the only VM on the first physical disk. Besides from the system disk it has another large virtual disk there (using the maximum possible size of 2TB) that is used for the backup data. It backs up all other VMs (that are located on the second physical disk) onto this large disk, but not itself(!). This would lead to a chicken-and-egg problem of course, so I use a conventional Windows Backup task to backup the Backup server's system disk on to another virtual disk that also resides on the second physical disk.

I have the term "Small Budget" in the project title so you may quite rightly note that we need a Windows license to run the Backup server (at least not a Windows Server license, because Veeam Backup also runs on a Windows Workstation). So what other options do we have? There are Linux based products available that have a similar feature set (e.g. PHD Virtual Backup that I reviewed back in April), but none of them comes with a free edition like Veeam B&R (as far as I know - Anyone?). VMware's soon to be released new version of vSphere (5.1) will include the new backup solution Data Protection which also uses Linux based appliances, but is only included with paid licenses of vSphere.

There is one option though for backing up VMs that does not require a backup server at all, but runs inside the ESXi shell: William Lam's excellent ghettoVCB script. It backs up VMs after snapshotting them, it is highly configurable, has matured since ESXi 3.5 and has good community support.

Disaster Recovery

So what are the procedures to recover from the failure of a physical disk? Obviously it depends on whether it is the first or the second disk that fails.

If the first disk fails then the system won't be bootable any more, because it holds the ESXi installation. We assume that the defective disk was replaced by a new hard disk of the same size. Now the first step is to restore the ESXi installation like described above:
  1. PXE-Boot the machine with the Linux rescue system and restore the ESXi installation from the FTP backup like described above. Then connect to the physical host using the vSphere Client and perform the rest of the steps:
  2. Register the Lab VMs that are on the second physical disk. You can then immediately power them on again.
  3. Re-create the datastore on the remainder of the first physical disk.
  4. Manually re-create the "Backup VM" (As part of the DR plan you need to take careful note of its configuration!). Be sure to include the "OS Backup" disk from the second physical disk.
  5. Boot the Backup VM with a Windows Recovery CD and restore the Windows Backup from its "OS Backup" disk to the "OS Disk".
  6. Apparently all your backups have been lost, so - as a last step - you should now check the configuration of your backup software and create new initial backups of all the Lab VMs, re-create backup schedules as needed etc.
If the second physical disk fails (and has been replaced by a new blank one) then your physical host should still be able to boot, but its references to the Lab VMs have all become invalid. Follow these steps to return to normal conditions:
  1. Re-create the datastore on the second physical disk.
  2. Connect to the console of your Backup VM using the vSphere client and restore all Lab VMs to the datastore on the second disk.
  3. Re-create the "OS Backup" virtual disk for the Backup VM. Check and re-configure its Windows Backup schedule, create a new initial Windows backup of its "OS Disk".

This was the second part of my SBHVL project post series, covering Backup and DR. In the third and last part I will explain what tools I use to remotely manage the Lab infrastructure and how I access the Lab VMs (on the internal subnet) from the outside.

1 comment:

  1. Thanks, looking forward to part3.
    I love reading your blog :)


***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!