Why you need to use vendor specific drivers with ESXi (and keep them up-to-date) 0

I wanted to take a moment and talk about the importance of using vendor specific drivers with your ESXi hosts (and bare metal for that matter).

Recently, I have had to diagnose and repair problems caused by improper drivers wreaking havoc inside of a large vSphere environment.  I will give one specific example that relates to Cisco UCS, but the principles behind my argument apply to most other vendors and all different types of hardware components.

Background Information / Environment

In this particular vSphere environment, the ESXi hosts had been deployed using the “vanilla” VMware image.  This is the raw ESXi ISO file that you download from VMware directly via the my.vmware.com portal in the Product Downloads section.

The environment I am referring to is a Cisco UCS deployment. The ESXi hosts are running on B200-M3 blades and are taking advantage of the CNAs (converged network adapters) for 10Gb networking and FCoE storage.  There are two versions of ESXi residing in this environment, 5.0u3 and 5.5.

The ESXi hosts were deployed successfully and worked great — for a while…

The Problem

The infrastructure team scheduled a UCS firmware upgrade to go from version 2.1(1a) to 2.1(3b).  This upgrade included fixes for some bugs that they were encountering within our UCS environment.  As such, a maintenance window was scheduled and the necessary firmware updates were successful.

A few days later, the virtualization team started to receive reports of guest VMs “hanging”.  These guests would actually stop responding to ping requests and sometimes drop to a read-only filesystem.  This was happening to both Linux and Windows guests within the environment.  After receiving this report, I started the usual routine of gathering as much information as possible.

It became apparent that there was a very large and widespread problem.  This problem was surfacing across many different vSphere clusters, and was causing kernel panics, etc., but only under high load conditions.  We began seeing I/O errors popping up on ESXi hosts themselves as well.

The Solution

As it turns out, since this environment was using the “vanilla” VMware ESXi image, the enic and fnic drivers that power the CNAs were way out-of-date.  After quickly checking the versions of these drivers (see how to do this here), I then checked the Cisco compatibility matrix (found here), and discovered that this environment should be running enic version and fnic version for both ESXi 5.0u3 and 5.5.  I then hopped over to the my.vmware.com portal and downloaded the appropriate versions.

After updating these drivers throughout the environment with a rolling reboot, things returned back to normal and the I/O and networking errors disappeared.

Lessons for Everyone

I am surprised at how many people are not aware of the Drivers & Tools section within the my.vmware.com portal.  It contains vendor specific drivers for all sorts of components (RAID cards, NICs, HBAs, CNAs, etc., etc.)  It is imperative that you run the correct drivers for the components that are installed in your systems!  Additionally, it is just as important to keep these drivers up-to-date as you update these components.  Always check with your vendor to obtain an official compatibility matrix and be sure not to deviate from it.

I also want to point out that although vendor provided ESXi ISOs are a nice starting point, those too can easily fall behind on driver versions or compatibility.  For example, Cisco provides a custom ESXi 5.5 ISO that contains their latest drivers.  While this sounds great, it would not work in the environment I had mentioned above.  Why?  The answer lies within the compatibility matrix.  The Cisco provided ESXi 5.5 ISO contains version of the enic driver and version of the fnic driver.  These driver versions require the entire UCS environment to be at the 2.2 firmware level (remember that the example environment is at 2.1(3a)).

I hope that you can learn from these mistakes.  This specific example provides lessons for everyone and is not vendor specific.  In fact, it is not even virtualization specific…  Within the above mentioned environment there were also a handful of bare-metal UCS blade servers that experienced the same exact symptoms.

Bottom line — do not ignore compatibility matrices.  They are your best friend to avoid compatibility and performance issues!