This post is also available in: Italian
In VMware ESXi the All Paths Down (APD) or Permanent Device Loss (PDL) condition is what occurs on an ESX/ESXi host when a storage device is removed in an uncontrolled manner from the host (or the device fails), and the VMkernel core storage stack does not know how long the loss of device access will last. A typical way of getting into APD would be a Fiber Channel switch failure or (in the case of an iSCSI array) a network connectivity issue. But there are also other scenarios that we will discuss later.
VMware vSphere cannot handle the case of a storage related failure (neither with VMware HA), so it’s really important to guarantee the redundancy of your storage and the usage of multi-pathing solution to build a resilient storage infrastructure according also with the best practices suggested by VMware and the storage vendor (see also this note).
The APD/PDL condition could be:
- transient: since the device or switch might come back
- permanent: in so far as the device might never come back
In the past (until version 5.0), the I/O was queued indefinitely, and this resulted in I/Os to the device hanging (not acknowledged). Now is managed little different, but (until vSphere 5.5) some issues still remains.
This condition became particularly problematic when someone issued a storage rescan from a host or cluster (and this is typically the first thing that most people will try in this case): the rescan operation caused hostd to block waiting for a response from the devices (which never comes). Because hostd is blocked waiting on these responses, it was not possible use it by other services, like the vpx agent (vpxa) which is responsible for communication between the host and vCenter. The end result is the host becoming disconnected from vCenter and the only solution is reboot the host. With vSphere 5.0 and 5.1 host remain connected but still some big issue could appear (like VMs and datastores remain greyed out and also umount and remount the datastore may not always work).
A scenario where this issue become critical is in a scale-out storage, that provide also synchronous replication across two storage members, but does not provide and automatic failover in case of a member failure (this is for example the case of an EqualLogic group with SyncRepl function). In this case the simplest solution for an operator is just reboot the hosts (after the manual storage failover operation of the volumes) in order to automatically remount the datastores.
To make more simple is possible do the two following changes:
- Set disk.terminateVMOnPDLDefault=”True” on each host in the /etc/vmware/settings file (note that in 5.5 is available in the advanced settings)
- Set the das.maskCleanShutdownEnabled = “True” in the HA Advanced Setting of the VMware Cluster
In this way is possible use the cluster storage rescan feature to bring up again the datastores and the VMs.
PDL AutoRemove in vSphere 5.5 automatically removes a device with PDL from the host, but this could be an issue in a Guest Cluster environment (see this post: Disable “Disk.AutoremoveOnPDL” in a vMSC environment!). A PDL state on a device implies that the device is gone and that it cannot accept more IOs, but needlessly uses up one of the 256 device per host limit. PDL AutoRemove gets rid of the device from the ESXi host perspective.
For more information see also:
- vSphere 5.0 Storage Features Part 8 – Handling the All Paths Down (APD) condition
- KB 2004684 – Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x