Datacenter resiliency with VMware vSphere
March 31, 2012 1 Comment
While protection for data and single applications is usually done with lots of attention (backup, clustering of servers, RAID, VMware HA etc) protection against major disruptions in a site /datacenter does not get the attention it should get. For most companies it is too complex or costly.
While the risk of a major incident bringing down a datacenter is low, the impact can be enormous. Even to getting companies out of business.
This posting will give some guidelines for protection of datacenters in which VMware vSphere is used for server virtualization. It is partly a summary of an excellent breakout session presented at both VMworld USA and Europe (BCO2479 ) and my knowledge on DR solutions.
Using a stretched cluster will deliver downtime avoidance and DR avoidance but it not an equal solution to Site Recovery Manager.
When using DR tooling (SRM, Zerto, etc) the softwarecosts for protection of virtual machines start at around $ 200,- per vm to around $1000,-
Datacenter or site protection can be done for two reasons: disaster avoidance or disaster recovery.
Disaster avoidance or downtime avoidance (DA) is executed when you know in advance a disaster or disruption of your datacenter operation is about to happen. It can be a hurricane coming your way or planned maintenance on the power, storage, network or cooling of the datacenter which requires evacuation of virtual machines.
Disaster Recovery (DR) on the other hand is restarting operations in an alternative site because an unforeseen disaster happened in the primary site making it unavailble or not the preferred site: a fire, earthquake or major failure of your storage for example. Mind human error, software and hardware faults are in 75% of the cases the cause of the disaster.
Think VMware vMotion as disaster avoidance technique and VMware HA as disaster recovery at a host level.
At a site/datacenter level there are several architectures possible for DR and DA:
1. A stretched VMware HA cluster over two sites (active/active).
2. A VMware HA cluster at each of the two sites
3. A protected primary site and a recovery, secondary site using a DR solution (VMware SRM, Zerto, VirtualSharp) to protect
4. A protected primary site and a recovery site operated by a Service Provider (Disaster Recovery as a Service)
As a guideline, around 10% of the situations a stretched cluster will be needed, in 90% of the case a DR solution like VMware Site Recovery Manager will perfectly fit requirements.
What fits the organization depends heavily on RTO, RPO and budget.
A stretched cluster is mostly suited for disaster avoidance/downtime avoidance scenarios. vMotion is used to balance workloads, HA will restart VM’s on remaining hosts in the alternate site if hosts in one of the sites fail. Mind there are quite some requirements to have this architecture operational like not more than 100 km distance between both sites and less than 10 ms latency between the two sites link. vSphere 5 Enterprise Edition support Metro vMotion which accepts a latency of 10 ms. As the storage needs synchronous replication you will need a fiber channel connection between the sites.
Using DRS Host Affinity and datastore clusters you can prevent that a vm is running in datacenter A while the storage is located in datacenter B.
HA does not have granular control mechanisms for starting vm’s in the right order. Basically HA has three restart priorities: high, medium and low. It is a rather expensive and complex architecture for keeping your VM’s up with ZERO downtime and no datalose in case of disaster avoidance.
You could also decide for multiple, not strectched HA clusters. One in each site for example. vMotion (intercluster vMotion) in this case is done serialized instead of parallel when a stretched, single cluster is used (intracluster vMotion). Serialized vMotions will take longer time to complete when many VM’s needs to be evacuated to the alternative site. Also mind VMware HA will not start VM’s in the other cluster as the domain of HA is a cluster. So some (complex) scripting needs to be done when manual startup will take too much time or could lead to errors.
Mind using a stretched cluster also does not allow you to perform DR tests and have reports for audit reasons like SRM or other DR tooling delivers.
Another architecture to protect datacenter is using disaster recovery software. This is always disruptive which means some downtime for virtual machines (when performing evacuation for DR or DR avoidance (planned migration)) but is much more simple and probably less costly because of less requirements on bandwidth, storage and networking. You can either operate your own second datacenter (running Test/Dev virtual machines for instance which are shutdown at disaster recovery) or outsource DR to a Service Provider.
Depending on the replication used (synchronous or a-synchronous) you must accept none or some lose of data.
On average using DR tooling will give the best RTO and RPO of all available architectures. Also mind you do not need to purchase expensive replication software from your storage vendor. SRM, Zerto Virtual Replication and VirtualSharp ReliableDR can work at the host level and are storage agnostic. Zerto promises a 1 minute RPO for their near synchronous replication. VirtualSharp ReliableDR will assure the replica of the VM in your recovery site will be made available according RTO and RPO during failover.
Also worth to know is that stretched clusters cannot be combined with Site Recovery Manager as the first is using a single instance of vCenter Server while SRM needs two instances of vCenter Server.
VMware released a very good whitepaper on datacenter protection in March 2012 titled Stretched Clusters and VMware vCenter Site Recovery Manager . The description of the paper is:
“This paper is intended to clarify concepts involved with choosing solutions for vSphere site availability, and to help understand the use cases for availability solutions for the virtualized infrastructure. Specific guidance is given around the intended use of DR solutions like VMware vCenter Site Recovery Manager and contrasted with the intended use of geographically stretched clusters spanning multiple datacenters. While both solutions excel at their primary use case, their strengths lie in different areas which are explored within.”
VMware also published a whitepaper titled VMware vSphere Metro Storage Cluster Case Study . The abstract of the paper is:
VMware vSphere Metro Storage Cluster (VMware vMSC) is a new configuration within the VMware Hardware Compatibility List. This type of configuration is commonly referred to as a stretched storage cluster or metro storage cluster. It is implemented in environments where disaster/downtime avoidance is a key requirement. This case study was developed to provide additional insight and information regarding operation of a VMware vMSC infrastructure in conjunction with VMware vSphere. This paper will explain how vSphere handles specific failure scenarios and will discuss various design considerations and operational procedures.
Scott Lowe did a couple of great presentations on strectched clusters. View one here.
VMware recently added a new category to the HCL titled vSphere Metro Stretched Cluster
Chad Sakac (EMC) and Lee Dilworth (VMware) did a great presentation atg both VMworld USA and Europe titled BCO2479 – Understanding vSphere Stretched Clusters, Disaster Recovery, and Planned Workload Mobility. Details on the sessions and the video recording here.
Duncan Epping of Yellow-Bricks wrote a great blogpost on the subject titled vSphere 5.0 HA and metro / stretched cluster solutions
HP published a great whitepaper titled Implementing VMware vSphere Metro Storage Cluster with HP LeftHand Multi-Site storage
NetApp has a good whitepaper avaiable titled A Continuous-Availability Solution for VMware vSphere and NetApp Using VMware High Availability and Fault Tolerance and NetApp MetroCluster
VMware vSphere 5 Update 1 has some enhancements which are good use for stretched clusters. Read the blogposting of Chad of VirtualGeek here.
Marco Broeken of Virtual Clouds has another overview of Disaster Avoidance and DR and useful links.