ATS Miscompare Issue with EMC VMAX — Updated 08/24/2015 5

Note:  See 08/24/2015 update below!

It seems that the ATS Miscompare bug referenced by Comac Hogan’s blog post (found here) is also impacting some EMC VMAX arrays.This discovery was made in a large environment running VMware ESXi, 5.5.0, 2456374 backed by a VMAX 20k array while investigating VM performance issues / pauses.

Under heavy write loads, VMs would completely pause all I/O and these messages would appear in the vmkernel logs on the ESXi hosts:

2015-04-11T14:12:56.482Z cpu3:38545)NMP: nmp_ThrottleLogForDevice:2322: Cmd 0xf1 (0x412e803d0d00, 32789) to dev “naa.60000970000192604666533032333835” on path “vmhba2:C0:T1:L20” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE

2015-04-11T14:12:56.482Z cpu3:38545)ScsiDeviceIO: 2325: Cmd(0x412e803f95c0) 0xfe, CmdSN 0xcfe7 from world 32789 to dev “naa.60000970000192604666533032333835” failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0.

2015-04-11T14:12:56.482Z cpu3:38545)HBX: 1929: ATS Miscompare detected between test and set HB images at offset 3387392 on vol ‘53003a80-2832c375-6a9a-0025b5a102cc’.

2015-04-11T14:12:56.482Z cpu3:38545)HBX: 1930: Test version: [HB state abcdef02 offset 3387392 gen 61 stampUS 65973008482 uuid 55282a6f-c89f6768-199c-0025b5a301bd jrnl <FB 821137> drv 14.60]

2015-04-11T14:12:56.482Z cpu3:38545)HBX: 1931: Set version: [HB state abcdef02 offset 3387392 gen 61 stampUS 65976008500 uuid 55282a6f-c89f6768-199c-0025b5a301bd jrnl <FB 821137> drv 14.60]

2015-04-11T14:12:56.483Z cpu11:32825)HBX: 270: Reclaimed heartbeat for volume 53003a80-2832c375-6a9a-0025b5a102cc (dc4compute01-storage019): [Timeout] Offset 3387392

2015-04-11T14:12:56.483Z cpu11:32825)[HB state abcdef02 offset 3387392 gen 61 stampUS 65976030746 uuid 55282a6f-c89f6768-199c-0025b5a301bd jrnl <FB 821137> drv 14.60]

2015-04-11T14:12:56.976Z cpu3:39091)ScsiDeviceIO: 2325: Cmd(0x412e8339a8c0) 0xfe, CmdSN 0xcfeb from world 220873 to dev “naa.60000970000192604666533032333835” failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0.

Additionally, these messages would appear in the vCenter event logs:

4/12/2015 7:44:09 PM Lost access to volume 524c6025-5f875fc5-75c3-0025b5a101ec (dc4mgmt01-storage002) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

4/12/2015 7:46:09 PM Successfully restored access to volume 524c6025-5f875fc5-75c3-0025b5a101ec (dc4mgmt01-storage002) following connectivity issues.

This behavior is consistent with what Cormac describes in reference to the IBM Storwize arrays.  Ultimately, this statement found in IBM’s description of the issue seems to be exactly what we are witnessing in this environment:

“Due to the low timeout value for heartbeat I/O using ATS, this can lead to host disconnects and application outages if delays of 8 seconds or longer are experienced in completing individual heartbeat I/Os on backend storage systems or the SAN infrastructure.”

Unfortunately, this behavior is being exhibited in the *entire* environment and is not isolated to any set of specific hosts or VM’s.  This explains the pauses witnessed in the I/O activity of the VM’s.  In this case, the I/O pauses were easily spotted by witnessing gaps in ‘sar’ data on Linux guests (collected every minute).

As of now, the only recommended solution is to revert the ATS to a pre-ESXi 5.5u2 behavior by disabling the ATS VAAI Heartbeat mechanism as described in VMware KB 2113956: Enabling or Disabling VAAI ATS heartbeat.  Personally, I do not like to disable features or deviate from defaults without proper justification.  This may end up being one of those times.

There are currently support cases open with both VMware and EMC regarding this issue.  Thus far, EMC has confirmed the existence of the ATS bug and are supposedly working with VMware for a resolution.

I/we have held off on disabling the ATS VAAI Heartbeats thus far, but will keep this page updated with the results of the VMware/EMC support cases.

05/06/2015 Update:

After continuing to work with EMC support, we received the following response that definitively explains the ATS heartbeat issue witnessed in this environment:

“The F1 command is very time sensitive. If there is any delays in getting it processed, including getting it to the array in the first place, a miscompare will occur. A delay of a few milliseconds can be enough to throw us off.

So with ESXi 5.5 U2, VMware offloaded the datastore heartbeat functionality to the F1 command. This is so that we do the validation, freeing up resources on the host. This makes the heartbeat now incredibly time sensitive. Am awaiting further updates from Engineering but the workaround at the moment is to disable the heartbeat functionality that was added to ESXi 5.5 U2 essentially reverting back to pre 5.5 U2 functionality.”

The fix:

On 04/30/2015 the decision was made to disable the ATS heartbeat functionality by executing the following command (taken from VMware KB 2113956: Enabling or Disabling VAAI ATS heartbeat) on all pre-production ESXi 5.5u2 hosts:

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

This was executed quickly in our environment via a quick ‘expect’ script.  It is a hot change and we did not encounter any issues during or after application.  Note that we already have SSH enabled to our ESXi hosts via Host Profiles…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/expect
 
# Usage: ./update_hosts.sh  (hostname) (username) (password) (commands)  
 
set timeout 60
spawn ssh [lindex $argv 1]@[lindex $argv 0]
 
expect "yes/no" {
    send "yes\r"
    expect "*?assword" { send "[lindex $argv 2]\r" }
    } "*?assword" { send "[lindex $argv 2]\r" }
 
expect "# " { send "[lindex $argv 3]\r\r" }
expect "# " { send "exit\r" }
 
interact

This script was executed in a for loop as follows:

1
for i in $(seq -f "%03.f" 1 12); do ./update_hosts.sh aacompute01-esxi$i.aa1.pqe root "password" "esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5"; done

Note that the hostnames in this environment are sequential (ex: aacompute01-esxi001.aa1.pqe through aacompute01-esxi012.aa1.pqe).  You will need to modify it to fit your exact needs.

Once the fix was put into place the number of datastore disconnects in the pre-production environment went to zero as witnessed via our log aggregation / search service, SumoLogic:

sumo

Not only did the number of datastore disconnects drop to zero; but perhaps most importantly, we have not received any complaints about VM performance or VM “pauses” since reverting the heartbeat behavior.  Now that this change has had a period to “bake” with no side-effects in pre-production, we have rolled this out into our production ESXi farm today (05/06/2015).  As with pre-production, we did not encounter any issues.

A new problem:

Unfortunately, this fix is not a perfect “silver bullet” for us.

Our vSphere environment is rather large and has been standardized on using Auto Deploy with stateless Host Profiles.  This has worked out great for us in the past, as upgrades / patching are a breeze.  This fix brought about some trouble in paradise…

When using Host Profiles, a systems administrator typically only needs to configure a single ESXi host then either create a new profile from the example host or choose to update an existing profile via the “Copy settings from host” function.  After some pretty in-depth testing and manual hacking, it turns out that the “VMFS3.UseATSForHBOnVMFS5” setting CAN NOT be set from a Host Profile.  The setting appears in /etc/vmware/esx.conf on the ESXi hosts after the fix has been manually applied, but does not get captured or honored at all via Host Profiles.

This means that we now have to remember to re-run the fix every time an ESXi host reboots in our environment.  I will be working on a “watchdog” PowerCLI script or perhaps even a vCO workflow that will apply the fix if a host reboot is detected.  I will share that code upon completion.

Fortunately, it looks as if this is a known issue with the Host Profiles product team and they are looking into it (kudos to the killer community support on Twitter!):

HP_help

Beware of bad support advice:

I am typically not a naysayer, however, I do feel the need to pass on this vital information so that others do not blindly follow recommendations that come from a vendor’s support system without proper vetting.

We had posed the question of how to permanently implement this setting to VMware Support.  The technician we were working with came back with the following recommendation:

“Your VMFS file systems were created ATS-only, it can’t fall back to SCSI reservation locking unless disable ATS in each logical unit:

1) Power off all VMs on the VMFS volume in question

2) vmkfstools –configATSOnly 0 /vmfs/devices/disks/device-ID:Partition

for example:
vmkfstools –configATSOnly 0 /vmfs/devices/disks/naa.60000970000192604666533032333836:1
Note: It is sufficient to run this command on one of the hosts sharing the VMFS5 datastore. Other hosts automatically recognize the change.

After disabling ATSOnly, run this command rescan for datastores:
esxcli storage filesystem rescan”

This support technician did not recognize the difference between disabling ATS heartbeats only vs. disabling ATS altogether.  Remember, trust but verify.  Blindly applying this recommendation across our environment would have been a big mistake!

Giving credit where it is due:

I like to give credit where credit is due.  As such, I am sending a big THANK YOU to Adam Larsen, my colleague, for heading up our support cases with EMC and VMware regarding this issue, as well as putting up with my many gripes along the way.  Additionally, I want to thank Ryan Wassa for discovering the frequency of this issue and correlating it with our update to ESXi 5.5u2.  EMC support was also very helpful and got us pointed in the right direction very quickly.

08/24/2015 Update:

The following is the official statement from EMC regarding this issue:

EMC KB https://support.emc.com/kb/200936  is the standpoint from EMC engineering at this time.

The reason for the above statement from Engineering is three fold:

  1. We cannot control the F1 payload that is sent from the host. We control the comparing process, but not the actual data to be compared with. As such, if the host sends us the wrong information, then we will generate a miscompare that must be reported to the host.
  2. We cannot control how the host reacts to a received miscompare. That is decided by the os vendor.
  3. We cannot control the workload of the array at any given time. Workload is driven by the customer’s hosts, and can fluctuate at any given time. As such, when a F1 command comes in, it may be placed in the queue to be handled later as there are no resources to handle it at the moment. This will elongate the time it takes to process the IO and provide a response. Cache slot resources, the size of the queue both on the FA and the device, all of this is workload driven and normal behavior. A F1 command will be treated just like any other IO in this regard and can be placed at the back of the queue. Reports from Vmware state that the command is very time sensitive, and that even minor delays can cause the compare data to no longer be valid, generating miscompares.

This is a very unfortunate situation for all of us.  It appears that VMware has decided to use a feature in the backend storage that is not guaranteed to respond in a timely fashion.  With no way to tweak the time to respond we may be forced to forevermore revert the storage heartbeats to the older method.

In other news, VMware has issued a statement indicating that they will release a fix that will allow setting the “VMFS3.UseATSForHBOnVMFS5” value via a Host Profile.  The timing or release date of the fix is yet to be determined.