vMotion fails at 14%

I was in the process of upgrading all the hosts in a cluster, I was down to the last two in the cluster, when the last few VMs failed during the vMotion. vMotion fails at 14%, operation timed out. I tried a few things but no matter what the VMs failed. So I had two host, host A had one VM left on it and host B had four VMs. Both of them started out with having about twenty VMs each on them.

So what had happen… Had vMotion so how “broken down”, was there something wrong on the network, or …

I started by verifying that vMotion connectivity was still intact, and indeed it was. I felt that the problem couldn’t be directly related to vMotion communication. So I found one of the VMs I just had move away from the host and did a vMotion back to the host, success, it worked and I could even do a vMotion of the VM to another host.

So there were no doubt that vMotion network is working. As my time was somewhat scarce, I made a SR with GSS.

Unfortunate GSS’s sense of time and urgency wasn’t the same as mine. I return to troubleshoot the issue a week or so later. This cluster was used solely Citrix. I went to the Citrix team and got them to “drain” the one VM sitting alone on host A and started troubleshooting the problem extensively.

After some troubleshooting the following was clear:

Logs were useless – All errors points to the vMotion network – Which was working fine
Only these five VMs were affected by the problem
KB2054100, KB2042654 or KB2068817 was not the problem

At this state I could just have shutdown the one VM and moved it – But as I needed to move the last four VMs aswell, I kept trying to find a way out of this situation and also there were of cause two which was so mission critical that they couldn’t be shutdown, so a solution had to be found.

The solution

I came across this totally by accident, but was able to reproduce it and it fixed my issue…

At some point I thought that I’ve exhausted my options on reasonable troubleshooting options, so I started playing with “services.sh restart” command and vMotion at the same time. Basically starting a vMotion seeing it get stuck at 14%, and then restart services.sh. What happen next came to me as abit of a surprise. HA kick in and tried to do a restart of the VM, but at the same time as services on the ESXi host were been restarted and the host temporarily disconnected from vCenter. This of course can’t be! As nothing was down, HA shouldn’t have tried to restart the VM. But It did, HA failed to do the restart, as the VM was still running on host A. Host A were now connected to vCenter again and the VM that HA had tried to restart on another host, were found on another host in the cluster in an invalid state and powered off. On host A no VM(s) were seen.

UPDATED

Frank Denneman and Duncan Epping raised two questions on twitter after I posted article. Seeing Franks question, I can only think I wasn’t clear enough in the writing above, so hopefully this clarifies what happened and a possible cause of the error I had. Duncan was nice enough to reach you and clarify why HA attempted to restart the VM. Thanks both

At this point a got a little nervous, a quick ping of the VM’s IP verified that it was still running some were. On host A I ran “vim-cmd vmsvc/getallvms” command, and it came up empty, no VMs running on the host. So after some head scratching, first tried the “ps -aux” command before jumping to esxcli. I ran the following command “esxcli vm process list” and here the VM was indeed listed as running, the nice thing about “esxcli vm process list”, is that it also lists the path to the VMX file of the running VM. At the state all I did was copy/paste the complete path to the VM’s VMX file and then ran the command “vim-cmd solo/registervm [Path to VMX]”. The VM were now visible via “vim-cmd vmsvc/getallvms” and in the vSphere Client and was running. I then tried a vMotion and to my surprise it worked now! yeah!

On to try a vMotion of a VM on the other host,but no dice, still vMotion fails at 14%. So like with any good result it should be reproducible, so I tired the same steps on host B. Start a vMotion, see it getting stuck, restart services.sh, HA kicked in, verify everything was working still, and re-registering the VM via command line, and lastly do a vMotion of the VM. Success once again! I then tried to do a vMotion of one of the three VMs which were left on host B, and it just worked now, quickly the rest of the VMs followed!

Now I knew it wasn’t a VM or vMotion fault. But what was it then, well a clear answer I can’t give you. But I then went on to close the case with GSS and they were nice enough to give me a call to hear how I’ve fixed the issue and they concluded that is was probably an issue with HA have a lock on these VM…

Well, it all turned out okay, and I got to play a lot with cli and logs, so not a bad day at all and I also got some troubleshooting out of the system 🙂

Closing remarks

At first I thought KB2068817 might be the solution to my issue. As the line “VMotionLastStatusCb: Failed with error 4: Timed out waiting for migration start request.” and a lot of the other text in the KB was just like my issue. But I didn’t see these lines in the log files.

Get shared vigor fields message: CPUID register value () is invalid.
CPUID register value () is invalid.
CPUID register value () is invalid.
CPUID register value () is invalid.

Which made me check the KBs solution, but in the end abandon the idea that this KB should be related alt all.

The issue happened on ESXi 5.0 hosts.

Michael Ryom

vMotion fails at 14%