So just a quick note here. We are staging our NetApp + Nexus 10GbE upgrade/migration, and was going through the motions, moving VM's off in order to add the 10GbE NIC's, and start running into Host CPU incompatability errors. What I wanted to do here was basically post everything we discovered, and tried (unsuccessfully), to get it to live vMotion.
Symptoms: live vMotion throws an error about Host CPU incompatability at 'ecx' however, cold migration goes without a hitch. This was unacceptable however, because we promised NDU's to the new NetApp.
The digging begins. First things first, I KNOW that Intel-VT is enabled in the BIOS, because we would not be able to host x64 VM's were it not. And we have several, and they have lived on every host. But you know what? I'll check anyway. I was right. Intel-VT is enabled.
At the suggestion of Rick Scherer, I tried to create a brand new EVC-enabled cluster. Apparently the X7350 3GHz Xeon's don't support this, as I was unable to add the first host to this new cluster.
Back to the drawing board. OK, let's start looking into the NX-bit bypass, and dig through the vmx files.
Tried to power down a non-critical VM, change the CPUID Mask to "Hide the NX/XD flag from guest." Powered it back up, and tried to vMotion again....failed.
Now I'm really getting frustrated, and digging into the vmx files with a fine tooth comb. Then I stumbled across something......odd....
tools.syncTime = "FALSE"
uuid.location = "56 4d 65 df 6b c0 4c 44-7f ee b8 89 f7 af 4d 66"
sched.mem.max = "1024"
sched.swap.derivedName = "/vmfs/volumes/cd9c6686-f1131816/Web4-01/Web4-01-99e2ad9d.vswp"
hostCPUID.0 = "0000000a756e65476c65746e49656e69"
guestCPUID.0 = "0000000a756e65476c65746e49656e69"
userCPUID.0 = "0000000a756e65476c65746e49656e69"
hostCPUID.1 = "000006fb000408000004e3bdbfebfbff"
guestCPUID.1 = "000006fb00010800800022010febbbff"
userCPUID.1 = "000006fb000408000004e3bdbfebfbff"
hostCPUID.80000001 = "00000000000000000000000120000800"
guestCPUID.80000001 = "00000000000000000000000120000800"
userCPUID.80000001 = "00000000000000000000000120000800"
evcCompatibilityMode = "FALSE"
I actually have no idea what this truly means, or where it comes from, but on just about every other 1vCPU VM I inspect, I see like values across the board. I did make a copy and try changing it to match the other values, and the migration still failed.
The tech I worked with last night was supposidly allocating this to a Bug ID of 380853 (which I can't access...partners?!) and submitting a set of exported logs to the engineering team.
I don't have time to sit around and wait, so I'm proactively moving things off. If they need to be quickly powered-down to be moved, I'm clearing it with the responsible party, and plowing through it. Certainly not how I would like to go about it, but hey, bigger fish to fry.
A couple of things to note. There are (3) HP DL580 G5 servers in this cluster. They're identical. Same box, same (4) X7350's per host (OR ARE THEY?!), same 128GB RAM per host. Each of them had different BIOS revisions, so that is being brought up to the latest (May 2009). Also, I decided to check out the /proc/vmware/cpuinfo on each host.
Dammit, Intel. This is annoying. Please stop doing this. See below:
ESX host 1. This is the oldest one, and set the stage for the other two with the X7350's. We were specific about getting identical hardware so we didnt have to deal with performance degradations and "workarounds" such as EVC.
cat /proc/vmware/cpuinfo
ESX1 | ESX2 | ESX3 | |
pcpu | 00 | 00 | 00 |
family | 06 | 06 | 06 |
model | 15 | 15 | 15 |
type | 00 | 00 | 00 |
stepping | 11 | 11 | 11 |
tscKhz | 2933332 | 2933434 | 2933434 |
processorKhz | 2933332 | 2933434 | 2933434 |
busKhz | 266666 | 266675 | 266675 |
name | GenuineIntel | GenuineIntel | GenuineIntel |
ebx | 0x00040800 | 0x00040800 | 0x00040800 |
ecxFeat | 0x0004e3bd | 0x0004e3bd | 0x0004e3bd |
edxFeat | 0xbfebfbff | 0xbfebfbff | 0xbfebfbff |
initApic | 0x00000000 | 0x00000000 | 0x00000000 |
apicID | 0x00000000 | 0x00000000 | 0x00000000 |
So I sent a big long email draft to my local Intel field engineer and we'll see what the deal is. For now, I'm putting a smile back on my face and headed to the datacenter to turnup some 10GbE!
-Nick