Don’t blame the hypervisor first — part 1 — (ARP corruption) 0

Overview

I have been an IT professional long enough to know that folks love to have a “go-to” that allows them to point a finger and place blame onto a particular technological component.  Shoot, I am surely guilty of this in the past as well.  However, because I have been an IT professional for many years now I also know that this is a severely flawed practice.  The best approach comes with working to isolate a problem then start ruling things out one-by-one.

Unfortunately, once something has caused problem once it becomes the “go-to” cause for everything.  Back in the day, this “go-to” was usually “the network“.  Whenever something broke, be it a web server, a database connection, or connectivity to a file server, people would blame network connectivity.  Over time the default “go-to” has gone through a few different iterations.  I have been in locations where the “go-to” has been “the network“, “the SAN“, or even more generically “the server“.  It is a sad truth that once something has experienced downtime and a true cause has been identified it then becomes the default target of negativity.  Every environment will suffer from a black eye every now and then.  It is important to realize that the long-term keys to success rely on letting some things go and not dwelling on them for eternity.  Isolate the original problem, take steps to mitigate it, then move on.  You can and should learn from past experiences, but do not make the assumption that every situation is the same.

Why am I starting this blog series?

Very recently I have seen a trend where the new “go-to” for every problem is “the hypervisor“.  All too often the hypervisor (I will talk specifically about vSphere, but these concepts apply universally to other hypervisors) takes the blame for far too many problems.  I personally feel that this trend started back in the early days of virtualization and has continued to hang on.  Virtualization definitely had its setbacks while still in its infancy, however the vSphere platform (including VI 3) have been mainstream for almost a decade.  If the technology is severely flawed it would not be present in every single datacenter today (in one form or another).

I want to bring to light specific examples where “the hypervisor” has been blamed, found guilty, and sentenced before proper troubleshooting has been done.  Of course, I will also detail how the proper issue was found.

My main goals here are to stress to other IT professionals that we should not have a “blame the hypervisor first” mentality, and how to go about properly troubleshooting a new issue.

So without further ado, here is our first scenario!

Situation:

Late one Friday afternoon I receive a call from one of our IT managers.  A production web application that spans approximately 20 virtual machines has been experiencing issues for the past few hours.  Customers have reported random disconnects and the initial analysis done by a systems administrator had shown that  the servers could not communicate reliably amongst themselves.  Ping requests between the VMs themselves would work for a while, then drop some packets, then start communicating again.  This occurred randomly and continued over a period of a few hours.

Where things went wrong:

While the systems administrator was performing a deeper analysis, he realized that only the VMs with odd-numbered hostnames were experiencing the connectivity issues.

In this particular environment, there are two vSphere clusters — one that contains all odd-numbered hostnames, and another with even-numbered hostnames.  This split was set up to allow an entire cluster (each contained within a Cisco UCS chassis in two different domains) to fail and still allow the application to run successfully.

The systems administrator quickly realized that everything was contained to this particular ESXi cluster and escalated the issue up the management chain to get the virtualization team involved.  While speaking with him on the phone, I started to begin the troubleshooting process.  He was correct in that everything was in one ESXi cluster, however, the hosts themselves were reporting no issues with connectivity.  The systems administrator became very adamant that the problem lied within vSphere and insisted that I had a problem with the ESXi hosts themselves.

Troubleshooting steps:

Part 1:

  1. I first validated the issue at hand.  After making sure VM-based firewalls allowed ICMP traffic, I could see that there was definitely a problem with inter-VM communication.  Ping requests were unsuccessful between the VMs in question.
  2. Next, I checked vCenter for any obvious errors.  No alerts were found, all network uplinks for the ESXi hosts in question appeared to be happy, and there were no storage errors present.
  3. I then logged into the UCS manager for the domain in question.  Again I checked around for errors.  There were no faults present in UCSM, and all of the blade servers were operating normally.  The uplinks from UCS to the core routers were fully operational and the Tx/Rx counters were all showing activity.

At this point, the systems administrator asked for an update on my findings.  I reported that I had found no critical issues and that everything was operating normally, but my analysis was not yet complete.  At this point in time the systems administrator brought in the IT manager and pleaded with him to apply pressure to me to reboot the ESXi hosts in question.  After having a quick discussion about why this was a bad idea, I was allowed to continue my analysis…

Part 2:

  1. Since I had determined that the infrastructure itself was happy and healthy, I decided to try to isolate the issue.  I moved three of the effected VMs to the same ESXi host via manual vMotion operations (we’ll call this “host A” for this example).  Magically, these three VMs no longer had any problems talking to each other.  Interesting.  By placing the VMs on the same host, it forces inter-VM network traffic to stay local to the ESXi host.  There is no need for it to traverse the “real” network uplinks.  Instead, traffic stays on the local virtual switch.
  2. Next I picked one of the three VMs I had just placed on host A and used vMotion to put in on another host (host B).  The VM continued to respond to ping requests.  Hmm…
  3. I then moved the same VM to yet another host (host C).  The VM stopped communicating for the most part, but would still respond to about 1 out of every 10 ping requests.
  4. I then moved the same VM to one more host (host D).  The VM returned to normal operations.
  5. Thinking I may have a problem with host C (which seemed unlikely), I used vMotion to place the VM back onto host C.  Much to my surprise, the VM remained in a working state and replied to all ping requests.

Things start to look like a network problem…  Instead of jumping up and down pointing my finger, or burning up the phone lines to our networking team, I wanted to confirm the actual problem.

Part 3:

  1. I began a tcpdump packet capture on the VM I had been using for a guinea pig.  I then also attempted vMotion operations on it to try to reproduce the issue.  Sure enough, I was successful in getting useful data, as the VM stopped replying to pings about 2 minutes after I had performed a vMotion to host D.
  2. Upon loading up the output of the packet capture in Wireshark, I was able to see what was going on…  ARP requests for this particular VM were reporting two different MAC addresses.

After relaying this information to the systems administrator and the IT manager, we decided to bring our networking team into the mix.  The network team’s tools showed that the ARP tables on the router and distribution switches were reporting that they were okay.  They insisted that the ARP problem was within the VMware virtual switch.

This is where I had my “ah-ha!” moment…  I remembered reading about ARP and how virtual switches are intelligent enough to know all of the MAC addresses residing on them, as they are all encapsulated within the hypervisor (thanks to Chris Wahl and Steve Pantol for their book, Networking for VMware Adminsitrators).  Since I could rule out the hypervisor as the culprit for the ARP issues, where could the problem lie?

I then decided to poke around at the OS configuration of the VM itself.  I found that there is a third party clustering agent running inside of the guest OS.  Some quick searching on Google indicated that this agent relies on Linux kernel bonding to achieve its load balancing between NICs.  This particular method also relied on manipulating the ARP tables on the network to achieve redundant connectivity without the need for a port channel / LACP on the upstream switches.

Next we analyzed the log files for the clustering agent in each of the troubled VMs.  We found that the clustering layer had crashed.  Due to the improper shutdown, this had left ARP tables in an inconsistent, corrupted state.  This fully explains the behavior we were seeing in that it simply did not matter which ESXi host the VM had resided on — we still had problems.

At this point we had our networking team clear all ARP tables, starting at the core router and moving down through the distribution switches.  We then restarted the clustering application on all of the effected VMs.  Magically, everything started responding to pings as normal and the day had been saved.

Moral of the story:

Do not simply blame the hypervisor (or any other piece of infrastructure) until you fully understand the problem you are dealing with.  Just like the systems administrator insisted we reboot the ESXi hosts to resolve the issue, I could have easily pointed the finger blindly at our networking team.  Instead, I calmly dug in deeper and continued to rule out potential problems one-by-one, collecting as much information along that path as possible.

I hope that upon reading this rather long blog post that you find value in this scenario.  If you have been in the IT industry for a while, it is our job as “veterans” to help teach proper methodologies to those around us.

Stay tuned for more posts in this series…