Why Windows 2003 VMs suddenly lose network connectivity

This blog post is really a drama. The location is the IT department of a medium sized or enterprise business, the characters are Infrastructure Administrators. Any resemblance to real persons and incidents is by far not coincidental, but fully intended.

Act 1

Todd, an admin mainly caring about Windows servers, steps into the office of his co-worker Jim, the department's "VMware guy" (Side note: If you want to be successful in deploying cloud computing then you need to get rid of administration silos. Obviously this has not yet happened here - like in most places).

Todd: "I have a Windows 2003 server that suddenly went offline. There must be something wrong with VMware networking!"

Jim: "Hmmm, let me check ..."

Jim checks the network connection of the VM. Its NIC is connected to a distributed virtual switch. The dvSwitch port that the VM uses looks fine and shows reasonable statistics. He checks other VMs that are on the same port group, and they do not have any problems.

Jim: "This is probably a Windows issue. Have you tried rebooting the machine?"

Todd: "Yes, several times ... It won't even get a DHCP address. Must be the networking of the host that runs it. Can you vMotion it to another host?"

Jim agrees and migrates the VM to another host in the cluster. But this doesn't fix the issue.

Todd: "Maybe its a hiccup in the virtual hardware. Can you add a new NIC to the VM and remove the old one?"

Jim: "Phew! I don't know ..."

Jim finally agrees and does what Todd suggested. To no avail - the guest OS still has no network connectivity.

Jim: "Come on, let's do some serious troubleshooting and check the Windows event log of the machine! I'm sure this is a guest OS issue!"

Todd: "No way. This was working all the time, and we haven't recently changed anything!"

Act 2

Some minutes later: Jim enters Todd's office.

Jim: "Hi Todd, I took a closer look at this Windows 2003 VM that went offline. In its event log I found a strange IPSec error. Can we look at this together?"

Todd and Jim look at the machine's event log. After each reboot the following entry is logged there:
 Source: IPSec
 Event id: 4292
 Message: The IPSec driver has entered Block mode. IPSec will discard all inbound and outbound TCP/IP network traffic that is not permitted by boot-time IPSec Policy exemptions.

Todd (scratches his head): "This looks somewhat familiar, but I don't remember ... Let's google for that error!"

After few minutes of searching they find several blog posts, forum threads and Microsoft Knowledgebase articles that all describe the same issue and provide a resolution: The IPSec policy store of Windows became corrupt and needs to be reset (See e.g. KB912023).

They follow the instructions, reboot the VM, et voilĂ : Its network connectivity is fully restored, the issue is resolved.

Act 3

Next day. Jim and Todd meet at the coffee machine.

Todd: "Hi Jim, thanks again for helping with this network issue. Once we found the solution I also remembered that we had this happening before. I still have one question in my mind though: Why is VMware causing such an issue?

Jim stares at Todd, thinking for a long while and finally replies.

Jim: "Why do you think that this is caused by VMware?"

Todd: "Because I have never seen it on a physical server!"

Jim thinks again for a long while ...

Jim: "How many physical Windows 2003 servers do we have?"

Now it's Todd scratching his head and thinking ...

Todd: "Er..., none. We have virtualized all of them a long time ago, because the hardware got too old."

Jim: "Are you really sure that you have never seen this issue on physical servers? Must have been years ago then, and you don't seem to have a good memory...

- The End -

Moral of the story

A growing number of SMB and enterprise size companies have a Virtualization First policy - that means they try to deploy any new server as a virtual machine rather than buying new physical hardware. And they also virtualize existing physical servers as much as they can. This leads to a constantly growing virtualization rate: 80%, 90%, even 100% is possible. At the end each server having an issue is a virtual server.

Even in such environments there are still too many people who start troubleshooting any issue at the virtualization layer rather than at the layer that is closest to the problem. We really do not only need a Virtualization First policy, but also a Don't blame Virtualization First policy ...

Did you like this story? Did it sound familiar to you? If you have other good examples of how high virtualization rates make people blind for the real reasons behind issues (or if you have anything else to add) then please comment!




6 comments:

  1. Been there .. Done that .... So many times!!!!
    :)

    ReplyDelete
  2. Good article!

    Sound very familiar to me....Somehow annoying to tell everybody to check the OS first everytime they come to me with a problem
    Hope that the mindchange will happen soon...

    ReplyDelete
  3. Yes, it's always vmware that causes the problem. or at least it's guilty in one way or another :-)
    awaiting a mindchange too.

    ReplyDelete
  4. Windows admins always blame the virtualization, storage or network before having a look into their own logs...

    ReplyDelete
  5. Had one of these today - (logged Source: IPSec Event id: 4292) on windows 2003 VM.
    I found the solution listed right in the event viewer (disable IPSec service) and reboot worked fine restoring networking.

    thanks!

    ReplyDelete
  6. Just had this problem today. Great article. Thanks. -Pete

    ReplyDelete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!