I encountered a pesky issue recently. Before I get into the details, first a quick overview of the setup. A Windows Server 2012 cluster consisting of two cluster nodes. The cluster nodes are brand new HP DL 360 G8p servers with 256Gb Memory and two six-core processors. Networking is based on 10Gb Emulex NICs for converged fabric connected to a HP Procurve 5406zl. The storage for the cluster has two members, an Equallogic 4100E and an Equallogic 4100XV. The ISCSI traffic is on a dedicated network with separate 1Gb NICs in the cluster nodes.
When I connected to a cluster node the response in the RDP session sometimes had a little delay. Typing in PowerShell for example felt like watching a movie with the audio out of sync from time to time. The first time I thought the lack of sleep was taking its toll. But after experiencing a couple of delays I concluded that I had some troubleshooting to do.
After bypassing the Remote Desktop Gateway that I connected through, I singled out one cluster node having the issue. I looked at the event log, but came up empty handed. My next thought made me look at the networking infrastructure. I checked that both servers had the correct and identical NIC firmware and drivers. I also verified that the switch had the latest firmware applied. I compared the complete converged fabric configuration on both servers. All parts checked out fine. I looked at the task manager and the processor utilization was close to idle.
The next thing to rule out was the NIC hardware. Since only one of the two servers was subject to the issue I decided to swap the 10Gb NICs between the servers. After this swap the issue seemed to have disappeared. I did not experience the issue on the other server.
I am unable to let go of an issue without a proper technical explanation and since the NIC hardware swap seemed to make the issue disappear I run a diagnostic test on both servers. All green checkmarks. Suddenly the delay appeared again on the same server where I experienced the issue before. We can now rule out the NIC hardware.
I swapped the cabling on the individual NICs and even between the servers. The issue persisted on the same cluster node.
When I compared the running processes in the resource monitor of the two cluster nodes I noticed one process called NT Kernel & System using 4% of CPU Utilization.
This is a strong indication that the system is encountering a hardware issue. After a simple calculation the 4 percent even made sense. The cluster node has 2 six-core processors. With Hyper-Threading enabled the system has access to 24 logical processors. A hundred percent of CPU divided by 24 logical processors equals to 4 percent for a single logical processor running at its maximum. I changed the view in the Task Manager to logical processors and detected one logical processor running at its maximum.
With live migration I moved the 60 virtual machines to the second cluster node. All the logical processors went idle except the one that was running at its max. The NT Kernel & System process was still running at 4%. After a reboot all logical processors were idle. Using live migration I moved virtual machines back one at a time. At the 14th virtual machine another logical processor spiked to its max. I marked the virtual machine and used live migration to clear the cluster node and after a reboot all processors were idle again.
Now I first moved the marked virtual machine. Nothing happened. Live migrating more individual virtual machines resulted in another logical processor spiking to its max. Each logical processors that spiked during these test were part of the same NUMA node. I was quite sure this had to be a hardware issue even if the diagnostics test (online and offline) reported no issues.
Before I opened up the server again I checked the event viewer one more time. And what do you know. There were multiple warnings with the source WHEA-Logger.
Opening a single event confirmed what I already suspected. A memory hardware failure.
I removed a pair of memory DIMMs each time, live migrating multiple virtual machines would spike a logical processor. This would have been a tedious job in Windows Server 2008 R2. But with multiple live migration support in Windows Server 2012 it probably saved me a couple of hours. After replacing the fifth pair the problem disappeared.
Afterwards I also have a technical explanation for the first time the problem disappeared. With live migration I cleared the server to replace the NIC and did not fail the roles back to this node. The corrupt memory was not addressed and the issue seemed to have disappeared until I started moving virtual machines again.