disater recovery incidents

Disaster Recovery has become tougher due to ever-changing virtual environments.

Being able to recover from a disaster is consistently a top priority for IT managers. They’re constantly looking for ways to protect more applications, and to do it more economically with less downtime. But even with sustained investment, there’s still an alarming lack of confidence in how well these processes will perform when a real disaster event occurs.

One of the most ambitious projects an IT department will ever embark on is the creation of a Disaster Recovery (DR) plan. But IT professionals need to understand that creating the plan is only the first step in the process. No matter how carefully crafted it is, a DR plan has no value if it doesn’t work when needed or if only a subset of the protected data can be recovered and recreated. It’s important to understand that in addition to developing an adequate DR plan, a strictly adhered to change control process must be implemented so that changes in the environment can be reflected in the plan. Yet the reality of the modern data center is that change typically happens too fast for a change control process to keep up with it. Even if change control is adhered to most of the time, one small misstep or slip up can result in recovery failure.

Four Disaster Recovery monitoring must-haves:

  • Environment awareness. Disaster Recovery tools must go beyond application awareness and understand the environment so that changes to the application’s specific environment are detected and reported.
  • Hardware and software independence. DR monitoring should work across a variety of applications and storage hardware to analyze for inconsistencies.
  • Monitoring only. DR tools don’t have to actually move data — there are numerous hardware and software vendor products that do that. DR monitoring should therefore complement those solutions, not compete with them.
  • Work from a knowledgebase. DR shouldn’t depend on collecting information from devices for information. Organisations should develop their own list of best practices that’s used to check for DR gaps.

The proof is in the testing

Disaster Recovery plan testing is critical to identifying changes in the environment so that the plan can be updated or modified to include any new situations and to accommodate any altered conditions. Despite the importance of DR plan testing, full-scale tests can only be done periodically because they’re time consuming and often expensive to conduct. In reality, partial testing is more likely with a quarterly frequency at best; many businesses only do a full-scale test once a year.

Many businesses also have the added burden of multiple locations coupled with legal or compliance regulations. That means each location should conduct its own standalone DR test, This can potentially make the gaps between various DR sites and the primary site even greater.

The problem is that in between DR tests, many configuration changes take place. As a result, IT planners are looking for ways to monitor and validate their disaster readiness in between full-scale tests. DR monitoring tools are able to audit processes such as clustering and replication to ensure these systems capture all the data they need and store the redundant data copies correctly.

Configuration is the root of the problem

When a Disaster Recovery process like replication is first implemented, it’s installed into a known, static application state. The volumes have all been created and configured, and they can be easily identified by the replication application so that it can protect them. But as the application evolves, new volumes may be added so that more host servers can be supported. Or perhaps a volume gets moved to a different storage system so that performance can be improved, such as moving log files to an all-flash array. These additions or changes are often not reported to the IT personnel in charge of the disaster recovery process and, consequently, are left out of the protection process.

The configuration changes will typically be discovered during the next DR test and can be corrected then. But if a disaster occurs before the next scheduled test, data loss is likely to occur, as well as a failure to return the application to proper operation. In other words, every time a configuration change is made to an application, a DR test should be planned to make sure all the changes have been mapped into the DR process. In the real world, however, most IT budgets can’t support the expense of such frequent DR tests, and the IT staff is stretched far too thin to execute tests so frequently.

The bottom line

DR planning is never a one-time event; it’s a constant process that has to keep up with evolving service-level agreements and changes in environment. Given the realities of a rapidly changing business, it’s almost impossible for change control processes to keep up, and it’s equally difficult to conduct DR tests with enough frequency to be meaningful. As a result, most companies, especially large enterprises, should consider disaster recovery monitoring and outsourcing of the day to day processes.