This post is also available in: Italian

Reading Time: 2 minutes

On October 2017, I wrote a post about a possible issue with vSphere 6.5 and 10 Gbps NICs (mostly standard on new deployment). The final result was a PSOD (Purple Screen Of the Death) and no solution was available (yet).

VMware KB 2151749 describe this issue as related to possible upgrade at vSphere 6.5. But other customers have report the issue also on new deployment.

Veeam, one of the first vendor to found this issues (from their customers), reports that the issues is due to network-intensive activities such as backup over NBD or vMotion randomly triggering one. But seems that this bug can be triggerer not only from high network load, but also from creating vmkernel ports.

The PSOD is something like this:

2017-09-16T15:34:30.908Z cpu6:65645)@BlueScreen: #PF Exception 14 in world 65645:HELPER_UPLIN IP 0x41802c496258 addr 0x0
PTEs:0x292379a027;0x2efe54c027;0xbfffffffff001;
2017-09-16T15:34:30.908Z cpu6:65645)Code start: 0x41802c200000 VMK uptime: 4:02:26:10.151
2017-09-16T15:34:30.908Z cpu6:65645)0x4390c369bd00:[0x41802c496258]UplinkTreePackQueueFilters@vmkernel#nover+0x188 stack: 0xe15427000
2017-09-16T15:34:30.909Z cpu6:65645)0x4390c369bd90:[0x41802c49e142]UplinkLB_LoadBalanceCB@vmkernel#nover+0x1e42 stack: 0x1
2017-09-16T15:34:30.909Z cpu6:65645)0x4390c369bf20:[0x41802c4916f2]UplinkAsyncProcessCallsHelperCB@vmkernel#nover+0x116 stack: 0x43048761eac0
2017-09-16T15:34:30.910Z cpu6:65645)0x4390c369bf50:[0x41802c2c9e0d]helpFunc@vmkernel#nover+0x3c5 stack: 0x4300b9b2a050
2017-09-16T15:34:30.910Z cpu6:65645)0x4390c369bfe0:[0x41802c4c91b5]CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0
2017-09-16T15:34:30.913Z cpu6:65645)base fs=0x0 gs=0x418041800000 Kgs=0x0

In my case, I never got this issue, but for sure was something critical to vSphere 6.5 wide adoption.

But finally, VMware has found the root cause: the issue occurs because the Netqueue commit phase abruptly stops due to a failure of hardware activation of a Rx queue. As a result, the Internal data structure of the Netqueue layer’s could go out of sync causing a host PSOD.

The VMware patch ESXi 6.5 P02 (ESXi-6.5.0-20171204001-standard), available on VMware Downloads fix the issue (and several others). For more information see KB 2151112. With this patch your ESXi build will become 7388607, or 7273056 (Security-only).

Share

Virtualization, Cloud and Storage Architect. Tech Field delegate. VMUG IT Co-Founder and board member. VMware VMTN Moderator and vExpert 2010-24. Dell TechCenter Rockstar 2014-15. Microsoft MVP 2014-16. Veeam Vanguard 2015-23. Nutanix NTC 2014-20. Several certifications including: VCDX-DCV, VCP-DCV/DT/Cloud, VCAP-DCA/DCD/CIA/CID/DTA/DTD, MCSA, MCSE, MCITP, CCA, NPP.