A VMware Storage vMotion riddle - survey results and resolution

A week ago I invited you to think about a little riddle concerning something I called "self-referential Storage vMotion". What happens if you try to migrate a VM on to a datastore that this VM provides itself? Here is the resolution:

First of all: The task will fail. Why? In his blog post Under the covers with Storage vMotion Kyle Gleed outlines in six steps how the Storage vMotion process works. (1) It starts by copying the configuration files to the destination datastore, (2) then creates a "shadow" VM with these files. (3) This VM waits in idle mode until the hard disk is copied to the destination datastore. (4) The copy process works iteratively using CBT (Changed Block Tracking) until there is only a small enough amount of changed blocks left to transfer.

This works all fine in our "self-referential" case. But then there is a step that involves a so-called FSR (Fast Suspend and Resume) operation: It will stop the original VM, copy the remaining dirty blocks and then start the shadow VM so that it can take the place of the original VM. This FSR operation will normally take an unnoticeably short time, but in our case it can never complete, because - as soon as the original VM is suspended - the datastore that it provides and the shadow VM that sits on it become inaccessible. That means the remaining blocks (if any) can not be copied and the shadow VM can never be started. After a certain timeout the task will be declared a failure, and the original VM will be reactivated.

This happens when the task progress is shown as 76% in vCenter. Interestingly the timeout then was very long (more than 30 minutes!) when I tried this with an iSCSI datastore, but with an NFS datastore it took just a few seconds until the task finally failed with an error message like this:

The end of a self-referential Storage vMotion task

Quite easy after all, isn't it? Now let's look at the possible answers to the quiz and how often they were chosen:

1 The tasks starts and finishes successfully (why not!?), but if you ever happen to power down the VM later then it becomes invalid and can never be powered on again. 40.0%
2 You are unable to start the task. It is rejected with the error message "Cannot do self-referential storage migration." 11.1%
3 When the task process reaches 99% the host that performs the operation crashes with a Purple Screen Of Death (PSOD) and the error message "Spin count exceeded (svmLock) - possible deadlock detected." 6.7%
4 The Storage vMotion task fails with the error message "Failed to copy one or more disks." (s. KB2030986) 24.4%
5 The Storage vMotion task hangs (eventually for a very long time) at about 76%, then fails, because ... - Please explain the reason for the failure: 17.8%

Some comments and attempts to explain the results:

(1) Most readers assumed that this would just work. I must admit that the why not!? and the constraint given in the but sentence was quite delusive ;-)
(2) Some readers believed that vCenter would even prevent this task from starting, foreseeing that it can never complete. This software is smart, but not that smart.
(3) Surprisingly many people picked the PSOD variant. This was really meant to be only funny (although the "spin count exceeded" message sounds somewhat authentic ...)
(4) Okay, this was really mean, because this error message and KB2030986 really exists. However, it will only appear if the disk copy operation fails, and this is not the case here: The task does not fail before the subsequent FSR operation (unless there is some additional issue that prevents the disk copy before that)

Anyone who chose the correct answer (5) also gave a somewhat correct explanation in more or less detail. The best and most detailed explanations were given by
  • Bjoern Roth: "While Fast Suspend and Resume is invoked the remaining dirty blocks can not be copied successfully."
  • @jpiscaer: "Because there's a cut-over point when the VM is pauzed and the home location of the VM is switched over to the new location; by which point the shared datastore goes offline and the VM that was just moved is disconnected"
According to the answers to the second question (Have you tried this yourself before answering the first question?) very few readers actually tried this themselves, but none of them picked the right answer!? Most of them picked (4) which probably means that they ran into another storage related issue when testing.

So, that was it. I hope you had fun!

No comments:

Post a Comment

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!