For Business Continuity and Disaster Recovery in a vSphere infrastructure most customers make a choice of two options. Either use VMware Site Recovery Manager (SRM) or build a vSphere Metro Stretched Cluster.
While both are great options for BC/DR, both also have some disadvantages.
VMware announced at VMworld 2014 they are working on integration of SRM with a vSphere Metro Stretched Cluster.
I am sure these new features are not in SRM 5.8 announced during VMworld. Jason Boche made three videos demoing the new release. In the demo of a tech preview some errors were encountered.
In May 2012 I predicted a couple of the enhancements now announced. Good to see there are now becoming a reality.
This is a summary of breakout session BCO1916. You can watch this interesting session yourself here.
So the two options for BC/DR in a vSphere datacenter are:
-option 1: two datacenters, both running production in an active-active configuration with stretched storage and networking. We call this a vSphere Metro Stretched Cluster
-option 2: two datacenters in active/passive. One running production, the other test/dev. If production site fails, VMware Site Recovery Manager is used to perform an orchestrated recovery of the virtual machines in the recovery site. Alternative tools are vSphere Replication or Zerto Virtual Replication.
Option 1 is great for disaster avoidance, balancing of resources and planned maintenance. When IT knows in advance one of the datacenters might become unavailable because of a hurricane/downtime of power/SAN maintenance etc virtual machines can be vMotion-ed to the alternate datacenter.
In case of an unplanned event like a fire or earthquake, VMware HA will take care of the restarts of virtual machine. The advantage is that up-to-date virtual machine disk files are available in the recovery site so RPO as well as RTO is low.
However VMware HA is not designed for large scale recovery of a complete site. VMware HA does not offer runbooks for an automated recovery. It is not aware of application dependancies nor is it site aware. HA does not offer a granular control over VM start priority. Also a failover cannot be tested. So we cannot shutdown and reboot a VM without taking a production VM down.
Another restriction is that because of the synchronous replication of the storage layer the distance between the two datacenters is limited to about 100km. A vSphere Metro Stretched Cluster is typicaly deployed in a metro area. A huricane or earthquake is likely to hit a larger area so both sites might be hurt.
Option 2 does offer orchestration using runbooks aka recovery plans. IT can test a recovery without disturbing the production environment.
Currently combining a vSphere Metro Stretched Cluster with VMware Site Recovery is not possible.
Quite a few blogposts, VMworld sessions and whitepapers have been written about the advantages/disadvantages of both scenario’s. Duncan Epping wrote a blogpost titled SRM vs Stretched Cluster solution about this in 2013. This is a great whitepaper about this topic published by VMware.
As said VMware is working on a tech preview of SRM which enables using SRM in a vSphere Metro Stretched Cluster.
Some of the requirements for such a setup have been announced at VMworld:
- vSphere 6.0 will enhance Longdistance vMotion by supporting a roundtrip of 100ms. This enables a much bigger distance between two datacenters
- vMotion will be possible between two different vCenter Servers. Two vCenters are a requirment for SRM.
In the future SRM will allow organizations using vSphere Metro Stretched Cluster to orchestrate planned failovers using SRM. So SRM will use a recovery plan to initiate vMotion of virtual machines. It will monitor the vMotion progress and report success or failure. Not always a vMotion will succeed for example because of latency issues. In that case SRM allows to execute a rerun of a planned failover runbook. If a vMotion fails, SRM will shut down the VM on the production site and restart it on the recovery/secondary site.
For SRM to understand stretched storage, vendors will need to develop new Storage Replication Adapters (SRA).
Storage Profile Protection Groups (SPPG) will be a new component of SRM. The idea is that once a protection group (PG) is created, storage profiles are added to the PG. Any virtual machine or datastore part of the storage profile will automatically be protected.
In case of an unplanned failover SRM will obviously not use vMotion as the production site is down. SRM will take care of restarting VMs in the recovery site according to the recovery plan.
One of the new features of combining SRM with a stretched cluster is the ability to perform test failovers. This will not actually perform a vMotion. It will power on the virtual machines in the recovery site using an isolated network.
It is also possible to reprotect. Reprotect is to make the recovery site the primary site. The site which was originally the protected site now will become the recovery site. Failback is supported as well.
VMware did not reveal the release number of SRM which will support stretched clusters. Nor did they reveal a release date. My guess this feature will be in SRM 6.0 to be released near the GA of vSphere 6.0.