Sunteți pe pagina 1din 7

Research

Publication Date: 16 August 2011

ID Number: G00215785

Best Practices for Planning and Managing Disaster


Recovery Testing
John P Morency

The time and resource costs of disaster recovery (DR) plan exercising, especially that
which is supported by manual or semimanual processes, has become the most
significant IT DR management (IT-DRM) pain point for many of Gartner's clients.
Specific steps can be taken and technologies can be deployed to reduce recovery plan
testing costs and complexities.
Key Findings
The annual costs of DR testing can reach or exceed $150,000 for many Gartner clients.
These costs could go even higher, as new business applications are rolled into
production.
Tools capable of discovering and mapping software and data dependencies between
Web-based applications are likely to become essential for managing efficient and
effective recovery testing/exercising.
The need for more thorough business application inquiry and transaction testing will
drive enterprises to assess organizational and test management consolidation and
integration to more efficiently scale recovery testing in the future.

Recommendations
Evaluate IT service dependency mapping technologies from vendors such as BMC
Software (Tideway), CA Technologies, HP, IBM, Neebula, ServiceNow and VMware to
assess the extent to which they can simplify the testing process and make it more
reliable.
Pilot software change management tools (from vendors such as BMC Software, CA
Technologies; HP, IBM Maximo, SAP and ServiceNow) and procedures that have the
potential to most effectively synchronize change implementation between primary
production and secondary recovery data centers.
Evaluate the possible savings that can be gained by consolidating the application testing
resources, processes and tools used by the DR and quality assurance (QA) testing
teams.

2011 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its
affiliates. This publication may not be reproduced or distributed in any form without Gartner's prior written permission. The
information contained in this publication has been obtained from sources believed to be reliable. Gartner disclaims all
warranties as to the accuracy, completeness or adequacy of such information and shall have no liability for errors,
omissions or inadequacies in such information. This publication consists of the opinions of Gartner's research organization
and should not be construed as statements of fact. The opinions expressed herein are subject to change without notice.
Although Gartner research may include a discussion of related legal issues, Gartner does not provide legal advice or
services and its research should not be construed or used as such. Gartner is a public company, and its shareholders may
include firms and funds that have financial interests in entities covered in Gartner research. Gartner's Board of Directors
may include senior managers of these firms or funds. Gartner research is produced independently by its research
organization without input or influence from these firms, funds or their managers. For further information on the
independence and integrity of Gartner research, see "Guiding Principles on Independence and Objectivity" on its website,
http://www.gartner.com/technology/about/ombudsman/omb_guide2.jsp

STRATEGIC PLANNING ASSUMPTION


By the end of 2014, 15% of enterprises will have significantly reduced or eliminated traditional DR
testing as a result of supporting more resilient IT operations.

ANALYSIS
DR testing is critical for supporting business resiliency. However, as the scope of mission-critical
business processes, applications and data increases, sustaining the quality and thoroughness of
the test process can be a challenge. Gartner client recovery and continuity-specific inquiries
indicate that many enterprises are now implementing new approaches for managing recovery
exercising, mostly because of the increasing cost and logistical complexity of traditional
approaches.
Gartner research shows the importance of effectively managing recovery exercising costs. In one
study of the exercising costs of federal government agencies (see "Cost-Cutting IT: Should You
Cut Back Your Disaster Recovery Exercise Spending?"), clients reported that IT-DRM annual
exercise budget allocations ranged from $20,000 to more than $150,000, depending on the size,
location, number of participants, scope of exercise and organizational structure of the
governmental unit. Results from nongovernment client inquiries have shown that it isn't unusual
for the annual cost of DR exercising to be between $75,000 and $150,000.
Gartner has identified some of the key reasons enterprises find DR testing increasingly difficult
and/or costly:
Increasingly complex dependencies Web applications and services often have
logically meshed relationships with, and dependencies on, other applications and data,
some of which is often part of a lower recovery tier (see Table 1).
Inconsistencies These occur between the current state of the data center
infrastructure, applications and data, and their state at the time of the last recovery test.
This may affect the extent to which production applications and data can be successfully
recovered, unless robust change and configuration management processes (and tools)
are in place. For example, a monthly volume of even a few hundred changes to a data
center's OS, middleware, applications or management agents can result in a difference
of thousands of changes between the current production configuration and the
production configuration at the time of the last recovery test.
Lack of resources With the increasingly complex scope of testing, enterprises rarely
have adequate recovery testing resources to exercise all production application inquiries
and transactions on a regular basis. Some organizations test only their most missioncritical applications. Others rotate testing among applications, while still others focus on
systems that have failed previous tests. A frequent result is that lower-priority
applications are tested far less frequently, and their recoverability is qualified as being
" on a best effort basis."

Publication Date: 16 August 2011/ID Number: G00215785


2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 2 of 7

Table 1. Recovery Tiers


Tier
1

Service Levels
24/7 scheduled
99.9% availability (less than 45
minutes/month)
Recovery time objective (RTO) = two to eight
hours; recovery point objective (RPO) = four
hours

24/6 3/4 scheduled


99.5% availability (less than 3.5 hours per
month)
RTO = eight to 24 hours; RPO = four hours

18/7 scheduled
99% availability (less than 5.5 hours per
month)
RTO = one to three days; RPO = one day

24/6 1/2 scheduled


98% availability (less than 413.5 hours per
month)
RTO = more than three days; RPO = one day

Source: Gartner (August 2011)

In light of these challenges, Gartner is increasingly seeing clients rethink their test strategies and
implement a series of best practices.

Establishing a Minimum Acceptable Level of Recovery Testing


The 2011 Gartner Risk Management Survey shows that enterprises test recoverability, on
average, once or twice a year. However, anecdotal evidence based on more than 3,000 DRrelated Gartner client inquiries in a three-year period suggests that fewer and fewer of these
live tests involve all production applications and data. Instead, tests are specific to an individual
recovery tier (typically, the recovery tier corresponding to the most mission-critical applications) or
include an affinity group of production applications that have related software and data
dependencies. This means that many organizations follow the 80/20 rule 80% of the testing is
done on the applications that are the most mission-critical (which are often 20% or less of the
total number of production applications).
Despite this data, however, you shouldn't completely ignore test procedures for less critical
applications and data. Rather, IT must ensure the recovery of the business processes and
supporting applications, the loss of which would cause the greatest loss of revenue, productivity
or organizational reputation.
In terms of how often an organization should conduct testing, we offer the following baselines,
again subject to your organization's special circumstances:
Conduct live testing for Tier 1 and Tier 2 applications and data at least twice per year.
Initiate more frequent (monthly, quarterly) manual or (ideally) automated testing on
application affinity groups.

Publication Date: 16 August 2011/ID Number: G00215785


2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 3 of 7

Perform failover and failback testing during the same or separate planned downtime
periods.
Ensure that the required data restoration and application activation cycle times meet or
beat the RTO and RPO targets.
Regardless of how you determine recovery tier definitions, it is important to begin thinking about
how you can best test recoverability, especially for the most mission-critical application data. Test
more frequently the related applications and data that support a smaller set of key business
processes, and shift the testing focus to how IT can best meet or beat the associated recovery
targets.

Pain Point Remediation Alternatives


Automated Dependency Mapping
The challenge of ensuring that all required software and data dependencies are addressed in a
recovery configuration will become more complex, as new business applications that have been
purchased, created by in-house development teams, or acquired through merger and acquisition
(M&A) activity are turned over to production.
Increasingly mature IT service dependency mapping tools can help. These products, available
from vendors such as BMC Software, CA Technologies, HP and IBM, enable IT organizations to
discover, document and track relationships by mapping dependencies among the infrastructure
components, such as servers, networks, storage and applications, that form an IT service (see
"IT Service Dependency Mapping Tools: Market Dynamics Update"). These tools are used
primarily for applications, servers and databases; however, a few discover network devices (such
as switches and routers), mainframe-unique attributes and virtual infrastructures, thereby
presenting a complete service map. Although these tools are often bought in conjunction with
configuration management database (CMDB) projects, we have seen a significant increase in
their acquisition and use for data center-specific projects, such as IT-DRM modernization and
data center consolidation.
Data dependency mapping products from 21st Century Software, AppAssure, Bocada, Continuity
Software, InMage and Sanovi are software products that provide automated data, metadata and
index consistency assurance between production files and databases and their replicas that are
maintained at one or more recovery sites. Background software agents determine and report on
the likelihood of achieving specified recovery targets, based on analyzing and correlating data
from applications, databases, clusters, OSs, virtual systems, networking and storage replication
mechanisms. These products perform their consistency checking on data located on directattached storage (DAS), storage-area-network (SAN)-connected storage or network-attached
storage (NAS) at the primary production and secondary recovery data centers.

Synchronizing Distributed Change


Ensuring 100% change consistency between the production data center configuration,
applications and data and their recovery data center counterparts is a challenging task. At a
minimum, the recovery infrastructure at the secondary site must be dedicated, although this may
not be the case for the recovery facility itself.
Typically, asynchronous data replication (either host- or storage controller-based) and server
virtualization are used to support a partial or full development and testing configuration that is
used by in-house application development, support and testing teams during normal production
hours. In this scenario, synchronizing changes between the primary production and development
and test (which can or might support recovery) configurations is typically managed by the

Publication Date: 16 August 2011/ID Number: G00215785


2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 4 of 7

development and testing teams, in conjunction with operations support. This may involve the
automated replication of updated production virtual server images to the secondary configuration,
in parallel or in tandem with production data replication.
Several product options support virtual server replication, including offerings from such vendors
as Acronis, Asigra, Atempo, BakBone Software, CA Technologies, CommVault, Double-Take
Software, EMC, FalconStor Software, HP, i365, IBM, InMage, Microsoft, NetApp, Novell, PHD
Virtual, Quest Software, Symantec, Syncsort and Veeam. However, for recovery configurations
that include a mix of physical and virtual servers, as well as a combination of shrink-wrapped and
in-house-developed applications, the use of IT process automation tools that orchestrate
infrastructure configuration, provisioning and change updating is likely to be required. (Further
information on the current state of IT process automation, change and configuration management
can be found in "Hype Cycle for IT Operations Management, 2011.")

Consolidating Testing Personnel, Tools and Skill Sets


One approach that has met with some client success is consolidating what were previously
separate QA and recovery testing teams into a single organization. Organizational consolidation,
together with the consolidation and standardization and testing platforms and scripts, is an
approach that can be used to support preproduction turnover regression, as well as ongoing DR,
testing.
Organizations that implemented this approach did so to address a lack of recovery testing
breadth and depth. Given the increasing numbers of mission-critical applications requiring
recovery, as well as the related numbers of inquiries and transactions, it became clear that
manual or semimanual testing processes could only provide limited recovery assurance. This was
because the extent to which a full set of production inquiries and transactions could be
consistently exercised by the recovery exercising team was limited by testing time constraints.
In one specific instance, a recovery team was able to meet the required RTO and RPO targets for
the most mission-critical applications, but the recovery of the production environment, as
perceived by the business unit end users, was short-lived, because undiscovered (and, therefore,
unaddressed) software and data dependencies resulted in several inquiries and transactions
prematurely aborting or incurring unacceptably long response times. The net result was that the
recovery team won the battle by supporting the required RTOs and RPOs, but lost the war,
because the usability and effectiveness of the recovery operations configuration was limited.
A new approach was needed that could not only improve the breadth and depth of application
testing coverage, but could increase the efficiency and effectiveness of recovery exercising as a
whole. Following an assessment of the technical benefits and cost savings that could result from
a merger of the internal QA and the DR testing teams, a decision was made to consolidate them
into a single organization and to standardize the management and automation of test processes
by leveraging many of the tools, scripts and staff resources that were already in place.
The benefits that have been realized by some of the early adopters of this approach include
increasingly reliable and more-effective test exercises, combined with more-thorough testing of
representative production inquiries and transactions against the recovery configuration. The latter
improves the likelihood that recovery operations can be initiated within required RTO and RPO
targets, and ensures more stable recovery operations.

Summary
IT-DRM managers may recognize one or more of these approaches as potentially adding value to
their IT-DRM programs. Regardless of which side of the issue you see your organization leaning
toward, it is important to consider the key technologies your organization uses, because, for many

Publication Date: 16 August 2011/ID Number: G00215785


2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 5 of 7

organizations, the use of more traditional recovery testing and technology that helps manage
more sustained availability may not be so much a case of "either/or" in the next five years, but
rather a case of "and."

RECOMMENDED READING
Some documents may not be available as part of your current Gartner subscription.
"Hype Cycle for Business Continuity Management and IT Disaster Recovery Management, 2011"
"From Development to Production: Integrating Change, Configuration and Release"
"Predicts 2011: Improved Recoverability May Be on the Horizon, but Significant Challenges
Remain"
"Data Center Conference Poll Findings: Disaster Recovery Testing Mistakes"
"Cost-Cutting IT: Should You Cut Back Your Disaster Recovery Exercise Spending?"
"Toolkit: Best Practices for a Successful Tabletop Recovery Test"
"Hype Cycle for IT Operations Management, 2011."
"IT Service Dependency Mapping Tools: Market Dynamics Update"

Publication Date: 16 August 2011/ID Number: G00215785


2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 6 of 7

REGIONAL HEADQUARTERS
Corporate Headquarters
56 Top Gallant Road
Stamford, CT 06902-7700
U.S.A.
+1 203 964 0096
European Headquarters
Tamesis
The Glanty
Egham
Surrey, TW20 9AW
UNITED KINGDOM
+44 1784 431611
Asia/Pacific Headquarters
Gartner Australasia Pty. Ltd.
Level 9, 141 Walker Street
North Sydney
New South Wales 2060
AUSTRALIA
+61 2 9459 4600
Japan Headquarters
Gartner Japan Ltd.
Aobadai Hills, 6F
7-7, Aobadai, 4-chome
Meguro-ku, Tokyo 153-0042
JAPAN
+81 3 3481 3670
Latin America Headquarters
Gartner do Brazil
Av. das Naes Unidas, 12551
9 andarWorld Trade Center
04578-903So Paulo SP
BRAZIL
+55 11 3443 1509

Publication Date: 16 August 2011/ID Number: G00215785


2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved.

Page 7 of 7

S-ar putea să vă placă și