Cisco UCS, vSphere 5.x, and Windows NLB problem 2

The Problem:

Since migrating to the Cisco UCS environment, a few of my client’s Windows System Administrators have been complaining about issues with the NLB (Network Load Balancing) service on their systems.  These machines were previously residing inside of a traditional vSphere 5.0 cluster running on HP BL465c G7 blades, and were recently migrated to the UCS platform on blades running vSphere 5.1 within the past day.

As is the case with most projects involving a virtualization platform migration, vSphere was the first component to receive the blame.  “Something in vSphere 5.1 must have broken our virtual machines.”

After some quick investigation, it became apparent that the issue was not with vSphere at all, but with a setting inside of the UCS manager…

Some Background Information:

The NLB service works by relying on entries in a MAC address lookup table (simliar to ARP), as it needs to dynamically manipulate which MAC address a specific virtual IP address belongs to.

There is a setting inside of UCS manager called “MAC Address Table Aging”.  It is a global policy and it can be accessed by going to the Equipment tab on the left, and selecting Policies on the right (as seen below).

UCS Manager MAC Address Table Aging -- Global Policy Setting

The UCS Manager MAC Address Table Aging Setting (click for bigger image)


By default, this setting is set to “Mode Default”, which can be a little confusing.  Venturing on over to Cisco’s documentation, I found this:

Mode Default—The system uses the default value. If the fabric interconnect is set to end-host mode, the default is 14,500 seconds. If it is set to switching mode, the default is 300 seconds.

This explains why the NLB service was having such a hard time.  The table that it relies upon to perform its virtual IP switching “magic” was being cleared at a pre-defined interval.

The Solution:

Simply changing the MAC Address Table Aging setting to “Never” fixed the NLB service issues.  The MAC address translation table will now remain static, so the NLB service can manipulate it as it sees fit.

Some closing thoughts…  I once read somewhere (unfortunately I cannot remember where exactly) that organizations not only need to adopt a “virtualization first” policy, but also a “don’t blame virtualization first” policy.  That statement really holds true in this case.  The version of vSphere had absolutely nothing to do with the issues the NLB service was facing.  The change in the underlying hardware platform and the switching environment was the culprit.

Hopefully this post saves someone some time hunting down this bug in the future!