Disaster Recovery: What the CTO/CIO needs to know

Today’s Information Technology world helps to make business run, but there is always the possibility for a breakdown in any and all systems. Your company must be able to recover and get back online as quickly as possible. That is where a good Disaster Recovery (DR) plan comes into play. Here we will break down what goes into a worthy plan.

Documentation

The first action you need is for the plan to be documented. This document is a living document. It moves as your infrastructure and personnel changes. There should be a person or a team assigned to handle this. It is recommended to review this document at least 3 to 4 times a year for completeness. Whenever a project plan is developed for new applications or upgrades, then an entry in the project plan needs to be added to update the DR plan. The DR plan needs to be stored in several places. Some companies will save it on the file server only to find out that when the file server crashes, the DR plan in not available. (Whoops!) Always save your DR plan in multiple locations as well as keeping a hard copy.

What should be in the DR plan? One of the items that most companies do not put in their plans is contact numbers for key internal personnel, vendor support numbers and contract information. If you cannot get to your contacts in mail because the email system is down, then you may not have the phone numbers you need to call. Another is a priority order of calls. Server and network information is great, but what would help is service account information. This does not mean to put passwords in the DR document, but list where you can find those passwords. A cloud service would be great to hold that document, but keep it in different locations. Make sure that the document is password-protected and do not name the file “passwords.”

Prioritization

Timing is a very important issue with a DR plan. Not every system is a top-priority system. You need to rank each system or application. For example, authentication systems are more important than your print servers (unless your business is built around printing). Timing has to be realistic; you cannot expect to have all your systems up in 15 minutes. Practice running through some possible scenarios, and make sure that at least one scenario involves a complete data center outage. Another one would be for individual applications. Another key factor is your people. Take a look at your resources—if some key people are not available during the disaster, will you be able to recover? Make sure the recovery can be done remotely, because if you cannot access the physical data-center you may need remote access to get it up and running. You may also need to find a central location if they cannot do the recovery from home. If you are a small organization, you need to make sure you have backup resources available.  Are the documents and resources available? Do you have vendors available to get equipment to rebuild the infrastructure? What if the internet is down, can you mobilize your team? What would be the means of travel to get to the alternate location? If your alternate location is in a different country, passports will have to be up to date.

The Cloud

Today companies are using the cloud more and more. Do you know what your cloud provider abilities are for DR testing? Some cloud companies have SLAs for DR. Azure or Amazon may be a good option because you can deploy servers quickly. Build a relationship with cloud companies even if you are not going to use cloud solutions on an everyday basis. That way if your hardware vendor cannot get all the systems you need in time, you can switch to a cloud provider.

Testing

The most important aspect of DR is testing it. You need to make sure it works. It may look good on paper, but if you don’t test it, you may not know where the pitfalls are. You should test at least once a year. This seems like a lot, but systems change with service packs and upgrades to software. You need to make sure that these changes do not impact your recovery. Two types of recovery are complete infrastructure recovery and individual application recovery. Infrastructure recoveries can be very time consuming. It can take your whole IT staff and several days to complete this type of test. Another DR test would be the individual application recovery. This type of recovery takes fewer resources to complete. You can do different applications during different times of the year.

 

DR is not something every CIO/CTO wants to do, because of the cost involved. But the cost of not having and testing your DR plan could mean loss of revenue and could ultimately be the reason for your company to fail. Not all of these suggestions may make it into your company’s DR plan, but having and testing a plan of some level is necessary.