Reading Time: 18 minutes

Questo articolo è stato realizzato per il StarWind blog ed è focalizzato sulla progettazione ed implementazione di uno stretched cluster.

Uno stretched cluster, a volte chiamato metro-cluster (o in alcuni casi anche campus cluster), è un modello di distribuzione in cui due o più host fanno parte di uno stesso cluster logico ma sono situati in posizioni geografiche distinte, di solito due siti diversi. Per essere nello stesso cluster, lo storage deve essere raggiungibile e condiviso su entrambi i siti.

I stretched cluster, di solito, vengono utilizzati per fornire funzionalità e capacità di bilanciamento del carico e ad alta disponibilità (HA) e di per realizzare siti attivi-attivi, in costrasto con siti attivi-passivi solitamente usati dal disaster recovery.

Il vSphere Metro Storage Cluster (vMSC) è solo un’opzione di configurazione per un cluster vSphere, dove una parte degli host di virtualizzazione sono in un sito e la seconda parte in un secondo sito. Entrambi i siti funzionano in modo attivo-attivo e possono essere utilizzate le caratteristiche comuni di vSphere, come ad esempio vMotion o vSphere HA.

Nel caso di una migrazione pianificata, come nel caso di necessità di evitare disastri o di consolidamento dei data center, l’utilizzo di storage esteso consente una mobilità delle applicazioni con tempi di inattività nulli tipici delle funzioni di live migration nei cluster di virtualizzazione.

Nel caso di un disastro in un sito, vSphere HA garantirè che le macchine virtuali vengano riavviate sull’altro sito.

Nel resto dell’articolo considererò, per semplicità, ma anche per fare esempi pratici e reali, il caso di uno stretched cluster VMware, ma molti concetti si possono estendere od adattare anche ad altri stretched cluster.

Requisiti e limitazioni

Ci sono alcuni vincoli tecnici importantilegati alla migrazione online delle macchine virtuali.

Nel caos di implementazione con VMware vSphere, devono essere soddisfatti i seguenti requisiti specifici prima di prendere in considerazione l’implementazione di un cluster esteso:

  • Poiché vMotion avrà una latenza maggiore del normale (in caso di trasferimento geografico), è necessaria l’edizione vSphere Enterprise Plus (benhcé questo requisito non è più esplicitato in vSphere 6.x).
  • La rete utilizzata per ESXi vSphere vMotion deve avere una larghezza di banda minima di 250Mbps.
  • La rete utilizzata per ESXi vSphere vMotion deve avere una latenza massima non superiore ai 10ms round-trip time (RTT).
    Notare che vSphere vMotion supporta fino a 150ms di latenza (a partire da vSphere 6.0), ma questa non si applica al caso dei stretched clustering.
  • Le reti relative alle VM devono corrispondere tra i due siti a reti stretched L2 (stesso dominio di broadcast o con soluzioni di VPN L2 o con soluzioni di virtualizzazione di rete come NSX).
    Notare che sia la management di ESXi che l’interfaccia di vMotion possono supportare reti L3, ed è consigliabile usarle per il caso di stretched cluster.

Per la parte di storage inoltre è richiesto:

  • Lo storage deve essere certificato per VMware vMSC.
  • Siano utilizzati protocolli Fibre Channel, iSCSI, NFS, o FCoE. Ma anche vSAN è supportata.
  • La massima latenza per la replica sincrona dello storage non deve essere superiore ai 10ms RTT.

I requisiti di storage possono, naturalmente, essere leggermente più complessi, a seconda del fornitore di storage, dell’architettura di storage e del prodotto di storage specifico, ma di solito ci sono sia articoli specifici della VMware KB che documenti specifici del fornitore di storage che possono aiutare a capire meglio quali siano i reali requisiti e limitazioni. Un vSphere Metro Storage Cluster richiede in effetti un unico sottosistema di storage astratto che si estende su entrambi i siti e consente a ESXi di accedere ai datastore da entrambi gli array da entrambi i siti, il tutto in modo trasparente e senza alcun impatto sulle attività di storage in corso, con lettura/scrittura potenzialmente attiva-attiva contemporaneamente da entrambi i siti.

Uniform vs. non-uniform

La realtà degli storage è leggermente diversa e in effetti, VMware vMSC prevede due modalità diverse per l’accesso allo storage:

  • Uniform host access configuration
  • Nonuniform host access configuration

Nel modello uniform, tutti gli host ESXi in entrambi i siti sono connessi allo storage e vedono gli stessi percorsi e gli stessi datastore in modalità attiva-attiva. Sono presenti percorsi che prevedono il passaggio cross-site e quindi teoricamente meno performanti e con maggior latenza.

Questo modello è ampiamente utilizzato per la sua semplicità di design e di implementazione , ma anche perché permette di gestire meglio il fallimento di storage locali.

Si presta molto bene per storage stretched effettivamente attivi-attivi, ma per quelli che prevedono accessi preferenziali da un sito è richiesto un minimo di tuning. In particolare è richiesta la creazione di regole di “site affinity” per le diverse VM che devono stare sui datastore preferenziali, in modo da diminuire il traffico tra siti.

Nel modello non-uniform gli host ESXi di ogni sito sono collegati SOLO allo storage del proprio sito e i percorsi sono quindi esclusivamente locali. Ogni host ha accesso in lettorua e scrittura al proprio storage (che poi sarà replicato verso lo storage dell’altro sito).

Questo modello ha il vantaggio di disporre di regole di “site affinity” implicite, dettate dall’architettura e dalla “LUN locality” di ogni sito. Altro vantaggio è che nal caso di un problema sul link cross-site, la località e l’accesso allo storage sono mantenuti. Si può prestare anche a casi di replica asincrona tra gli storage. Per contro se fallisce lo stroage di un sito, le VM dovranno essere riavviate sull’altro, con conseguente interruzione temporanea dei servizi.

vSphere Metro Storage Cluster supporta entrambi i modelli e la scelta del modello da utilizzare va valutata correttamente in base al tipo di storage, al tipo di replica, al tipo di requisiti utente.

Notare che vSAN stretched cluster ha una configurazione unicamente in modalità uniform more, dove comunque è possibile gestire i concetti di data locality e data affinity.

Il resto dell’articolo deve ancora essere tradotto in italiano.

Synchronous vs. Asynchronous

Normalmente uno stretched cluster utilizza una replica sincrona tra gli storage dei due siti, visto che comunque almeno due storage vi devono essere. La scelto è dettata dalla necessità di garantire il massimo di coerenza dei dati con un RPO pari a zero.

Nella replica sincrona la sequenza delle scritture avviene secondo questo modello:

  1. The application or server sends a write request to the source.
  2. The write I/O is mirrored to the destination.
  3. The mirrored write I/O is committed to the destination.
  4. The write commit at the destination is acknowledged back to the source.
  5. The write I/O is committed to the source.
  6. Finally, the write acknowledgment is sent to the application or server.

The process is repeated for each write I/O requested by the application or server.

Note that formally, if the write I/O cannot be committed at the source or destination, the write will not be committed at either location to ensure consistency. This means that, in the case of a complete links failure across sites, in a strictly synchronous replication both storage are blocked!

But there are also some specific products that can have synchronous replication and just hold the replication in case of storage communication interruption. This provides better storage availability but can imply potentially data misalignments if both sites continue in write operations.

Asynchronous replication tries to accomplish a similar data protection goal, but with a not null RPO. Usually, it defines a frequency to define how often data are replicated.

The write I/O pattern sequence with respect to asynchronous replication.

  1. The application or server sends a write request to the source volume.
  2. The write I/O is committed to the source volume.
  3. Finally, the write acknowledgment is sent to the application or server.

The process is repeated for each write I/O requested by the application or server.

  1. Periodically, a batch of write I/Os that have already been committed to the source volume are transferred to the destination volume.
  2. The write I/Os are committed to the destination volume.
  3. A batch acknowledgment is sent to the source.

There can be some variations like Semi-synchronous or Asynchronous not based on schedules, but simply based on snapshot or network best effort in order to reach a close to zero RPO.

The type of replication depends by the storage product and the recommended replication for a stretched cluster, but also by the type of the architecture: in a non-uniform asynchronous can be usable (if the not null RPO is accepted), in a full uniform and active-active cluster, the synchronous should be used.

Of course, the distance can impact on the type of replication (asynchronous support longer distance and remove the storage latency requirement) as also the type of availability.

Stretched storage “only”

A single stretched cluster is a deployment option and requires, of course, a stretched storage. But potentially is possible have only a stretched storage and two different clusters, one in each site, and use the cross-vCenter vMotion to move VMs live across sites if needed.

Site Recovery Manager support this kind of deployment and can use the stretched storage to reduce recovery times: in the case of a disaster, recovery is much faster due to the nature of the stretched storage architecture that enables synchronous data writes and reads on both sites.

When using stretched storage, Site Recovery Manager can orchestrate cross-vCenter vMotion operations at scale, using recovery plans. This is what enables application mobility, without incurring in any downtime.

Note that the SRM model for active-active data centers is fundamentally different from the model used in the VMware vSphere Metro Storage Cluster (VMSC) program. The SRM model uses two vCenter Server instances, one on each site, instead of stretching the vSphere cluster across sites.

Design and configuration aspects

Split-brain avoidance

Split brain is where two arrays might serve I/O to the same volume, without keeping the data in sync between the two sites. Any active/active synchronous replication solution designed to provide continuous availability across two different sites requires a component referred to as a witness or voter to mediate failovers while preventing split brain.

Depending on the storage solution, there are different approaches to this specific problem.

Resiliency and availability

Of course, the overall infrastructure must provide a better resiliency and availability compared to a single site. By default, stretched cluster provide at least a +1 redundancy (for the entire site), but more can be provided with a proper design.

This must start from the storage layer where you must tolerate a total failure of a storage in one site without service interruption (in uniform access) or with a minimal service interruption (in non-uniform access). But site resiliency it’s just a part: what about local resiliency for the storage? That means redundant arrays and local data redundancy for external storage. For hyper-converged this means just local data redundancy with at recommended 3 nodes per sites (or 5 if erasure coding is used). Keep in mind also maintenance windows and activities (for this reason the number of nodes has been increased by one).

Note that vSAN 6.6 provide new features for a secondary level of failure to tolerate, specific for stretched cluster configurations.

VMware vSphere HA

VMware recommends enabling vSphere HA admission control in all cluster, especially in a stretched cluster. Workload availability is the primary driver for most stretched cluster environments, so can be crucial providing sufficient capacity for a full site failure. To ensure that all workloads can be restarted by vSphere HA on just one site, configuring the admission control policy to 50 percent for both memory and CPU is recommended. VMware recommends using a percentage-based policy because it offers the most flexibility and reduces operational overhead.

With vSphere 6.0, some enhancements have been introduced to enable an automated failover of VMs residing on a datastore that has either an all paths down (APD) or a permanent device loss (PDL) condition. Those can be useful in non-uniform models during a failure scenario to ensure that the ESXi host takes appropriate action when access to a LUN is revoked. To enable vSphere HA to respond to both an APD and a PDL condition, vSphere HA must be configured in a specific way. VMware recommends enabling VM Component Protection (VMCP).

The typical configuration for PDL events is Power off and restart VMs. For APD events, VMware recommends selecting Power off and restart VMs (conservative). But of course, refer to specific storage vendor requirements.

Before vSphere 6.0 those cases were managed by specific ESXi advanced settings like Disk.terminateVMOnPDLDefault, VMkernel.Boot.terminateVMOnPDL and Disk.AutoremoveOnPDL (introduced in vSphere 5.5).

vSphere HA uses heartbeat mechanisms to validate the state of a host. There are two such mechanisms: network heartbeating and datastore heartbeating. Network heartbeating is the primary mechanism for vSphere HA to validate the availability of the hosts. Datastore heartbeating is the secondary mechanism used by vSphere HA; it determines the exact state of the host after network heartbeating has failed.

For network heartbeat, if a host is not receiving any heartbeats, it uses a fail-safe mechanism to detect if it is merely isolated from its master node or completely isolated from the network. It does this by pinging the default gateway. In addition to this mechanism, one or more isolation addresses can be specified manually to enhance the reliability of isolation validation. VMware recommends specifying a minimum of two additional isolation addresses, with each address site local. This enables vSphere HA validation for complete network isolation, even in case of a connection failure between sites.

For storage heartbeat, the minimum number of heartbeat datastores is two and the maximum is five. For vSphere HA datastore heartbeating to function correctly in any type of failure scenario, VMware recommends increasing the number of heartbeat datastores from two to four in a stretched cluster environment. This provides full redundancy for both data center locations. Defining four specific datastores as preferred heartbeat datastores is also recommended, selecting two from one site and two from the other. This enables vSphere HA to heartbeat to a datastore even in the case of a connection failure between sites. Subsequently, it enables vSphere HA to determine the state of a host in any scenario. VMware recommends selecting two datastores in each location to ensure that datastores are available at each site in the case of a site partition. Adding an advanced setting called das.heartbeatDsPerHost can increase the number of heartbeat datastores.

Data locality

Cross-site bandwidth can be really crucial and critical in a stretched cluster configuration. For this reason, you must “force” access to local data for all VMs in a uniform model (for non-uniform data locality is implicit). Using vSphere DRS and proper path selection is a way to achieve this goal.

For vSAN, in a traditional cluster, a virtual machine’s read operations are distributed across all replica copies of the data in the cluster. In the case of a policy setting of NumberOfFailuresToTolerate=1, which results in two copies of the data, 50% of the reads will come from replica1 and 50% will come from replica2. In a vSAN Stretched Cluster, to ensure that 100% of reads occur in the site the VM resides on, the read locality mechanism was introduced. Read locality overrides the NumberOfFailuresToTolerate=1 policy’s behavior to distribute reads across the two data sites.

Other hyper-converged solutions have a specific solution to maximize data locality.

VMware vSphere DRS and storage DRS

To provide VM locality you can build specific VMs to hosts affinity rules. VMware recommends implementing the “should rule” because these are violated by vSphere HA in the case of a full site failure. Note that vSphere DRS communicates these rules to vSphere HA, and these are stored in a “compatibility list” governing allowed start-up. If a single host fails, VM-to-host “should rules” are ignored by default.

For vSAN, VMware recommends that DRS is placed in partially automated mode if there is an outage. Customers will continue to be informed about DRS recommendations when the hosts on the recovered site are online, but can now wait until vSAN has fully resynced the virtual machine components. DRS can then be changed back to fully automated mode, which will allow virtual machine migrations to take place to conform to the VM/Host affinity rules.

For Storage DRS (if applicable), this should be configured in manual mode or partially automated. This enables human validation per recommendation and allows recommendations to be applied during off-peak hours. Note that the use of I/O Metric or VMware vSphere Storage I/O Control is not supported in a vMSC configuration, as is described in VMware KB article 2042596.

Multipath selection

For block-based storage, multi-path policies are critical for the stretched cluster.

In a uniform configuration, for example for Dell SC array with Live Volumes, you must use a fixed path to ensure the locality of the data:

For non-uniform configuration, the data locality is implicit, so you can maximize (local) paths usage and distribution with round robin:

Be sure to check specific storage vendor best practices or reference architecture.

Myths

Disaster recovery vs. disaster avoidance

Disaster Avoidance, as the name implies, is the process of preventing or significantly reducing the probability that a disaster will occur (like for human errors); or if such an event does occur (like for a natural disaster) that the effects upon the organization’s technology systems are minimized as much as possible.

The idea of disaster avoidance provides better “resilience” rather than good recovery, but to do so, you cannot rely only on infrastructure availability solutions, that mostly are geographically limited to a specific site, you need to look also at how to provide a better application availability and redundancy in the wake of foreseeable disruption.

Multi-datacenter (or multi-region cloud) replication is one part, the second part is having active-active datacenters or have applications spanned between the multiple sites that provide service availability.

Most of the new cloud-native applications are designed for this scenario. But there are also some examples of traditional applications with high availability concepts at the application level that can work also geographically, like: DNS services, Active Directory Domain Controllers, Exchange DAG or SQL Always-On clusters. In all those cases one system can fail, but the service is not affected because another node will provide it. Although solutions like Exchange DAG or SQL Always-On rely internally on cluster services, usually applications designed with high availability solutions use systems loosely coupled without shared components (except of course the network, but it can be a routed or geographical network).

An interesting example of the infrastructure layer could be the stretched cluster.

Disaster recovery vs. Stretched cluster

Although stretched cluster can be used also of disaster recovery and not only for disaster avoidance, there are some possible limitations on using a stretched cluster also as disaster recovery:

  • Stretched cluster can’t protect you from site link failures and can be affected by the split-brain scenario.
  • Stretched cluster usually works with synchronous replication, that means limited distance, but also the difficult to provide multiple restore point at different timing.
  • Bandwidth requirements are really high, to minimize storage latency. So you need not only reliable lines but also larger.
  • Stretched cluster can be costlier than a DR solution, but of cours,e can provide also disaster avoidances for some cases.

In most cases, where a stretched cluster is used, then there could be third site acting as a traditional DR, using in this way a multi-level protection approach.

Share