performance tuning

Five Disaster Recovery Mistakes You Might Be Making

Pragmatic Works Sep 24, 2015

Five Disaster Recovery Mistakes You Might Be Making

5 disaster recovery mistakes you might be making

Kathy Vick and Rowland Gosling are Pragmatic Works' resident experts on disaster recovery. Through their vast experience, they've noticed many common mistakes that DBAs are making and they're sharing their insight to prevent your recovery from being a disaster of its own:

Disaster planning is a challenging topic for most companies and a crucial factor in data lifecycle maturity. While we understand no one wants to think about bad things happening, but for the same reason you pay for insurance, you need to have a resolute disaster recovery plan. Even the most well-prepared organizations and DBAs can be challenged with what to do to bring their systems back online during an outage.

In our experience, we’ve noticed most DBAs understand planning for a disaster like their company’s RPO (Recovery Point Objective, or how much data they can lose in a disaster) and RTO (Recovery Time Objective – how much time they can be down), and they plan their disaster strategies around those objectives. But the real challenge comes when disaster strikes, and your DBA is put to the test trying to bring the systems back online and prevent recovery from being a disaster.

With all of that in mind, here’s our list of five common mistakes you may be making with your disaster recovery strategy:

1. You Don’t Have a Plan, Know the Plan, or Rehearse the Plan

This one seems obvious, but we see it all the time. Organizations don’t have a disaster recovery plan in place, or if they do, they don’t know it or rehearse it. Hardware is fairly reliable these days and a lot of redundancies are built into systems to help maintain high levels of uptime. But when disasters happened, ANY availability becomes the most important thing and emphasis is placed on bringing the systems back online as soon as possible. Many DBAs often overlook disaster planning when there’s a heavy focus on High Availability solutions, and not being prepared for the recovery portion is a very common mistake.

Knowing what to do is dependent on what kind of a disaster you are facing. For example, “I dropped a table in production” is a very different disaster than “Our data center has flooded and everything is wiped out.” While both are critical problems, they have very different solutions for recovery.

A good DBA should have a playbook that defines what to do, step-by-step, for a variety of disasters scenarios such as:

What if the server breaks and goes down – how does it fail over and how do you fail it back?
What if the data center goes down – can we fail over to another center? How do we fail back? What is the impact to the users?
What if a table is dropped – where can you get the data back without data loss?
How to troubleshoot performance issues
How to handle disk/SAN performance problems – what to do when a disk does down
What to do if the network goes down

This playbook helps DBAs to not panic under pressure while providing the steps to take when something goes wrong. In addition to having a playbook, running “disaster drills” to test the recovery plays and making sure that all DBAs understand what to do when a disaster happens can help reduce the panic and stress in disaster scenarios. Having a plan, knowing the plan and rehearsing the plan will provide valuable experience when it comes time to execute disaster recovery plans.

2. You Over-Allocate or Under-Plan Standby Equipment for Disaster Recovery

Yellow wires connecting into the back of a server Most companies hate having resources that are just “standing by” waiting for failures to happen, but it often happens with older disaster recovery systems like mirroring or active-passive type clusters. With newer technologies such as Always On or hardware/virtual replication, companies can utilize standby hardware to take the load off of existing systems. The problem with this is that companies may forget that in the event of a failure, the system will have to run on the remaining hardware.

We’ve seen companies with great resources on both of their clustered boxes, but they’re running different instances of SQL Server on each box. Initially, this wasn’t an issue because one of the boxes could easily handle both workloads on the two instances. As the usage grew, when the system failed and both workstreams had to exist on the single box, it was over-allocated and could not support the two workloads.

Another example is a company with a solid disaster recovery plan and a full off-site data center for their failover host, but the network pipeline between the disaster center and the primary data center was too slow to process the volume of network transactions. The failover occurred, but most users could not connect to the remote data center to keep working because of the network latency. That company under planned how much network bandwidth would be required, and failed their users.

Making sure that you can support your workload with the post-disaster resources is critical. Testing to see that the workload can be supported in a failover situation helps to validate whether a real disaster will be supported with the resources available after a failure.

3. You Forget About Applications Outside of the Primary System

When building a disaster recovery plan, many DBAs forget about items like SSRS and SSIS that are not “cluster aware” and do not support failover processes. In one instance, we worked with a company who had a great disaster plan, with a failover site to support their primary databases, but forgot that their SSIS packages were all on the database server that failed and did not exist on the backup server. Once the system failed over to the backup, they couldn’t run any of their maintenance jobs (including backups) until the original system was back online. In some cases, this could be a catastrophe all on its own.

It’s important to make sure you have plans to support non-clustered resources for disaster recovery. Plan for what to do with SSIS, SSRS, or any external third-party service that runs to support your users. And we can’t say this enough: Test your plan to make sure it will work in case of a failure!

4. You Don’t Have a Communication Plan

One of the most common problems during a disaster is having all users contact the DBA to tell them their system is down, or even worse, having management calling the DBA to see when the system will be back up. Something we often see is DBAs held up in unproductive status meetings during downtime. DBAs need to focus in a disaster on the recovery, not on telling people about the recovery.

Having a communication plan is essential for DBA productivity in the midst of a disaster (Click to Tweet)

Including a communication plan in your disaster recovery strategy can help reduce stress and increase efficiency. This will set clear expectations and communications regarding the recovery strategies ahead of time and also help minimize the noise that happens during the disaster. An effective communication plan will include parameters like:

Who needs to be notified
Frequency of updates
Kinds of updates to be provided
How to give feedback to the DBA on the recovery process

The communication plan should also include communications of expected recovery times for the different types of disasters that may be happening.

5. You Rely Completely on Backups as Your Primary Disaster Recovery Plan

Don’t get us wrong – having backups are critical to any good disaster recovery scenario. Even if you have things like SAN replication and Virtual Machine restores to handle most scenarios, nothing works better for miscellaneous disaster recovery than a good backup strategy.

BUT – is that your only strategy? Are you counting on backups to be your recovery in case of a problem? Do you have other strategies in place in case of a problem? A common problem for DBAs is companies who decide their backup data is their only disaster recovery plan, and they don’t plan for any other options.

Here’s a disaster that was waiting to happen: One company we worked with did tape backups. They sent their tapes to offline storage in order to have them available for recovery if necessary. That strategy worked fine as long as they didn’t have to recover anything. Unfortunately for them, they didn’t realize their tape backup was not writing data properly to the tape, and all of their backups were no good. When they had a disaster, they couldn’t recover from it using their tape backups. This DBA failed on several fronts: Having a single disaster recovery strategy, and failing to perform fire drill test recoveries on his tapes.

Another concern is whether you actually know how long a database restore can take – have you run a drill to see how long it takes to restore? Many times it is quicker to restore data from another online copy rather than have to shut down the database and do a full restore plus incrementals.

Understanding all of your disaster recovery options outside of just backups can be a huge help when disaster strikes. Depending on the type of disaster, these options can be described in a “playbook” for execution by a DBA when problems happen.

Evaluating Your Disaster Recovery Plan

Being prepared for a data emergency is one of the most critical steps in data lifecycle maturity, and we want to help make sure you're prepared for any disaster. By assessing your data platform as a whole, we can identify strengths and weaknesses in all areas of your strategy and execution to help you reach the pinnacle of data maturity.