This tip offers 10 reasons why your disaster recovery plan might fail and offers advice on how to make sure your organization is prepared to recover from a disaster.
Any wily IT veteran develops a keen sense for the gap between IT fantasy and reality. Best practices are often talked about as lofty ideals, but in the real world they tend to be the best we can do given current constraints. In a well-run shop, the gap between the ideal and the practical isn't that great for most functions.
When it comes to disaster recovery (DR), however, the reality gap can be alarmingly huge. The disaster recovery vision is a scenario in which all disasters are withstood; using a well-crafted disaster recovery plan, operations are transferred to a remote facility to get the organization back online within recovery time objective (RTO) and recovery point objective (RPO) targets. But this is pure fantasy for most companies. The reality is that if a disaster should occur, nothing short of Herculean efforts by the IT staff would be required to have the slightest chance of getting back online in any reasonable period of time, much less the targeted RTO. So, it's time for a reality check. Here are some reasons why your disaster recovery plan may fail.
- Business and IT aren't linked. Disaster recovery is one component of a larger business recovery undertaking and, to be successful, it's necessary to understand all the requirements, drivers, related activities, interdependencies, contingencies and pitfalls associated with those other activities. But a recent survey sponsored by Veritas found that 76% of the companies studied left disaster recovery plans setting solely in the hands of IT. While disaster recovery itself is an IT-specific function, it has a supporting role to core business activities. As such, its focus must be tied to the overall business continuance effort, which includes ensuring that people and facilities are available and able to function from a business perspective.
- You don't have a disaster recovery plan. If there's an IT activity that cries out for teamwork, it's disaster recovery. The disaster recovery plan should be the playbook for all functional areas within IT prior to, during and after a disaster, and encompass applications, databases, networks, servers, clients and storage. Among its elements are the key contacts and owners for each activity, step-by-step recovery plans, validation tasks and activation processes. But most organizations fall short of this goal. The activities required in a disaster recovery situation are unfamiliar and will likely need to be carried out in adverse or chaotic circumstances. The lack of a comprehensive plan is a recipe for disaster -- or an even worse disaster.
- Your disaster recovery plan isn't current. Two words: change control. disaster recovery plans become outdated almost immediately. Management of your disaster recovery plan must be integrated as a rigorously enforced part of the change control process. As new applications are brought online, their priority and impact with respect to disaster recovery should be considered. If you invest the time to develop a disaster recovery plan that classifies servers and applications, identifies interdependencies and documents recovery in detail, adding new elements may simply mean updating the appropriate set of forms and notifying the necessary groups.
You don't test disaster recovery (or you don't test the right things). Let's face it -- disaster recovery testing is a major pain for most IT shops. It's not only a major operational disruption performed just once or twice a year, but all too often it's treated as a pro forma exercise.
Many disaster recovery test plans lack true end-to-end testing. Recovery and testing should be done on an application basis, not simply per server. Complex apps have interdependent elements that run on multiple servers. Recovering operating systems and data is just the first step; the apps should then be recovered and tested. While it may be impractical, the ultimate disaster recovery test would be to run production from the disaster recovery location for a period of time and then switch back at some later date.
Another problem is that disaster recovery testing isn't viewed as a quality improvement exercise, but as an exam. This can lead to counterproductive activities such as limiting recovery to "safe" components that aren't likely to be problematic. It should be assumed that some weaknesses or failures will occur. Finding process bugs is a good thing, so they can be corrected and avoided in the future.
- Your recovery goals are unrealistic. Often, organizations will establish RTO and RPO objectives, and even prioritize and classify servers and apps in accordance with the policies; but when disaster recovery capabilities are objectively examined, the goals are unattainable. For example, if you have recovery goals of less than a day, they can't be met if your disaster recovery facility is a cold site and you're relying on tape-based recovery. Realistic goals and metrics need to be established that reasonably estimate the time it takes to recover a server or to configure a storage or backup environment.
- You don't have clearly defined disaster recovery roles, responsibilities and ownership. disaster recovery demands organization and execution. Each participant must understand their job, who they will interact with and, most importantly, the proper chain of command. A good portion of disaster recovery planning should be spent defining this structure and developing a level of comfort in its execution. Factors to consider include how a disaster is declared, the time to notify and stage people at disaster recovery sites, equipment logistics and execution of the recovery process.
Your disaster recovery plan doesn't address the right risks. disaster recovery is an insurance policy. You need to determine how much and what kinds of insurance you need, and what risks you're willing to take.
There are many potential causes of unplanned outages, ranging from internal physical events to external regional or environmental catastrophes. Internal events are more likely to cause problems than events outside the data center. Developing an understanding of disaster categories, weighing the risks and formulating a plan to address the targeted categories should be the goal. People often buy insurance based on what they can afford, rather than what they need, but disaster recovery decisions shouldn't be made on that basis.
Your backups don't work. Although technically related to testing, it's worth underscoring the point that tape backup is the primary medium for disaster recovery at most companies. Wide-area data replication is still too costly for many businesses; therefore, their disaster recovery is only as good as their tape restoration capabilities and offsite tape management. All the planning in the world matters naught if the tapes are bad (or just don't exist).
Often, offsite tapes can't be created and shipped in a timely fashion due to a lack of resources. Virtual tape libraries and other disk-based approaches can enable backups to complete sooner and allow tape resources to be dedicated to offsite media production.
- Will anyone be there to recover data? An uncomfortable factor to consider is the risk of staff not being available to perform the recovery. Some might say that in such a scenario there are far greater issues than data recovery, but at some point this risk needs to be considered. Large companies with IT expertise in multiple data centers can develop disaster recovery capabilities that leverage resources in multiple locations. Third-party service companies may also be involved, provided comprehensive plans and guidelines are in place.
- disaster recovery is just too expensive. During a recent conversation, I had an IT manager exclaim, "We simply can't afford to test DR!" I've alluded to this issue in some of the previous points, but good disaster recovery is an onerous expense that most organizations are unwilling or unable to absorb.
But even with a small disaster recovery budget, prudent steps can be taken, such as ensuring good backups, establishing roles and responsibilities, and effective planning. New technologies may also be leveraged to make recovery more affordable. But don't create false expectations. Establish recovery objectives that are in line with capabilities and make them known and understood outside of IT. disaster recovery may be the item IT least wants to talk about, but it's past time to face up to the issue and close the reality gap.
For more information:
James Damoulakis is CTO of GlassHouse Technologies, an independent storage services firm with offices across the United States and in the UK.