Test a recovery plan
You can automatically create a non-disruptive, isolated testing environment on the recovery site by using replication and connecting virtual
machines to your isolated testing network. You can also save test results for viewing and export at any time.
Testing a recovery plan exercises nearly every aspect of a recovery plan, though several concessions are made to avoid disruption of
ongoing operations. While testing a recovery plan has no lasting eects on either site, running a recovery plan has signicant eects on
both sites.
You should run test recoveries as often as needed. Testing a recovery plan does not aect replication or the ongoing operations of either
site (though it might temporarily suspend the selected local virtual machines at the recovery site if recoveries are congured to do so). You
can cancel a recovery plan test at any time.
In the case of planned migrations, a recovery stops replication after a nal synchronization of the source and the target. Note that for
disaster recoveries, virtual machines are restored to the most recent available state, as determined by the recovery point objective (RPO).
After the nal replication is completed, SRM makes changes at both sites that require signicant time and eort to reverse. Because of
this, the privilege to test a recovery plan and the privilege to run a recovery plan must be separately assigned.
When SRM test failovers to the recovery site are requested, SRM performs the following steps:
1 Determines the latest recovery point for each replicated volume.
2 Creates a writeable test snapshot for each recovery point, with a name in the form srannnnnn where nnnnnn is a monotonically
increasing number.
3 Maps the test snapshots to the appropriate ESXi hosts on the recovery site.
When testing stops, the test snapshots are unmapped and deleted.
Failover and failback
Failback is the process of setting the replication environment back to its original state at the protected site prior to failover. Failback with
SRM is an automated process that occurs after recovery. This makes the failback process of the protected virtual machines relatively
simple in the case of a planned migration. If the entire SRM environment remains intact after recovery, failback is done by running the
reprotect recovery steps with SRM, followed by running the recovery plan again, which moves the virtual machines congured within their
protection groups back to the original protected SRM site.
In disaster scenarios, failback steps vary with respect to the degree of failure at the protected site. For example, the failover could have
been due to a storage system failure or the loss of the entire data center. The manual conguration of failback is important because the
protected site may have a dierent hardware or SAN conguration after a disaster. Using SRM, after failback is congured, it can be
managed and automated like any planned SRM failover. The recovery steps can dier based on the conditions of the last failover that
occurred. If failback follows an unplanned failover, a full data re-mirroring between the two sites may be required. This step usually takes
most of the time in a failback scenario.
All recovery plans in SRM include an initial attempt to synchronize data between the protection and recovery sites, even during a disaster
recovery scenario.
During the disaster recovery, an initial attempt will be made to shut down the protection group’s virtual machines and establish a nal
synchronization between the sites. This is designed to ensure that virtual machines are static and quiescent before running the recovery
plan, in order to minimize data loss wherever possible. If the protected site is no longer available, the recovery plan will continue to execute
and will run to completion even if errors are encountered.
This new attribute minimizes the possibility of data loss during a disaster recovery, balancing the requirement for virtual machine
consistency with the ability to achieve aggressive recovery-point objectives.
Using SRM for disaster recovery
9