Over the past few weeks a large financial institution encountered issues with a system upgrade which left customers in an unfortunate situation for weeks. This incident was not unique. Many companies and institutions suffer unexpected, prolonged systems outages. Sometimes they’re due to scheduled maintenance or upgrades, other times they’re the result of equipment failure. Regardless, in the aftermath of a disaster the company’s level of preparedness gets put on display. Those that are the most prepared ride out the storm in the best shape.
From our experience at Anexinet, prolonged outages are most commonly caused by inadequate pre-activity testing and backout plans, risk assessments, and disaster-recovery procedures, including simulated recovery exercises and a suitable crisis communication plan. It boils down to formalized processes and good documentation. It’s not sexy—documentation rarely is—and for this reason is often overlooked. Should you experience an outage, though, these items can make all the difference, particularly if assembled thoroughly and accurately. Now let’s walk through the areas mentioned above to determine where you and your company/department/group stand.
“By failing to prepare, you are preparing to fail.” – Benjamin Franklin
Risk Impact Assessment:
What aspects of an application or system could fail? What are the implications of a failed change? How likely are failures to occur, and how quickly can they be detected? Is a change back-out or rollback possible? What does this back-out plan look like (more on this below)? These are all difficult questions that must be addressed in advance. Whether utilizing a formalized process or not (you should be), asking, pondering, and—most importantly—addressing these questions can help get the ball rolling towards achieving preparedness. Many models and methodologies are out there—and we won’t delve into any specifics here—needless to say all of them are great at ranking and prioritizing the risk and impact on your organization.
What we see most often with clients is a lack of a formal process. Everything is done informally and ad-hoc; frequently this knowledge is only in the minds of a few people. Documenting using formalized templates ensures every business and technology issue is being considered, and helps frame a discussion regarding the impact of change. Expanding the participants in the discussion to include all concerned business units allows for a more diverse and complete understanding of the implications involved and can help anticipate any potential complications. This perspective tends to be quite narrow in IT departments and reminders of the bigger picture help assure full impact analysis at all levels and functions.
Back-Out Plans are instruction sets that encompass all the actions required to undo (i.e. “back-out of”) a change, upgrade, or maintenance activity. Depending on the situation, these instructions may range from a single declarative sentence to a long, complex document.
Are back-out plans standard procedure for every system change in your organization? Are they a mandatory component of your change management process? Regardless of the size of the change, are they to be performed every time? Do you have thresholds for when back-out plans get peer reviewed? How about dry-run change testing? For larger tasks, a dry-run (line by line plan walk-throughs with stakeholders) can be a valuable exercise that often brings to light any oversights or errors. For instance, by revealing a single incorrect command that would otherwise have devastating consequences. Or maybe the backup/restore process requires prerequisite information unknown to the author. This provides the opportunity to address such issues without the pressure of having to do so during a crisis.
How much time do you have to initiate a back-out plan? Can you execute it after you go live? If so, how long do you have? How do you backport any new data to the pre-change system once rollback is initiated, if this is even possible? If not possible, your decision window before allowing end users back into the system is likely quite short. So what additional validation is in place to ensure confidence in the system? The answers to these questions can vary wildly, and success is rarely certain during the consideration process. But even with a low chance of success, the change has at least been thoroughly socialized within the organization and fully vetted by all business units and levels of management/stakeholders. Once signoff is attained from these groups, the perceived fault of any issues will not rest solely on the IT department, The organization will have proceeded with full knowledge of the risks.
Disaster Recovery Procedures:
A typical Disaster Recovery Plan includes procedures, instructions, and/or runbooks that detail system and application disaster recovery procedures. Sample instructions contained in these documents include how to failover servers, database, applications, etc. Some methods may be automated, but preparedness for manual failover activity must also be assured. What should be done if a change step freezes halfway through? What steps does one have to work through to ensure its completion? Or is manual failback/recovery necessary? Once all disaster recovery systems are up and functional, does anything need to be reconfigured? Maybe the servers need new configurations because they are in a different location, for example IP addressing? Is there a documented list of these settings, and a list of scripts or procedures to run through to assure systems/applications are 100% functional on the other side? Documenting these scenarios before a crisis occurs provides engineering resources an easy cheat-sheet during a chaotic disaster situation. It’s certainly possible technical resources may not even be available in a disaster, so this documentation will also enable someone less technical to failover the systems. Regardless of who’s present, comprehensive documentation means nobody misses a step. It’s the reason pilots with thousands of flying hours still run through their checklist every time they fly. Documenting recovery steps allows for easy failover and relinquishes any one person from the responsibility of keeping the information “in their head.”
Many companies with solid DR plans and documentation fail to update them as they change the systems. When was the last time your DR documents were updated? Was it after that massive application upgrade from last quarter? Were they updated to account for the new module you implemented? What about when you upgraded Oracle 10g to 12c? Does the new version of Oracle Data Guard still use the same commands? Or have your DR documents been left untouched since they were written five years ago? Having DR procedure recorded is great—you’re already miles ahead of most organizations, but not keeping them up-to-date is basically the same as not having them in the first place.
When was the last time you tested the procedures? Did you perform a full test or a dry run? Regularly ensure your procedure documentation is accurate and that everything is working as it should. Perhaps someone modified a startup script on Server A in Site A, but the change never made it over to the Server DR in Site DR. Your procedure is still correct in that the startup scripts need to run—and it will in the DR Site—but since it’s missing updates your database/application might not startup correctly. It’s a lot easier to catch these and other issues during a controlled test, rather than during the frustrating unpredictability of an actual disaster.
Use cases, acceptance testing, regression testing—all these things (and more) comprise a testing regimen. An entire blog post could be devoted solely to software testing. But in summary, software systems are complex, and by corollary, it makes sense that testing is equally complex. It’s not fun. It can be expensive. And with larger systems, it can be a mountain of work. Shortcuts are inevitable. So keeping an organization’s testing sharp—while an immense challenge—is one that must be undertaken.
In many cases, organizations overlook testing right from the start. Perhaps the initial software was small, maintained by an equally small development team, and everyone on the team knew the software inside and out. The team could immediately visualize the potential impacts of every change or addition. Nothing ever escaped their grasp. If this is the case—fantastic! But this is not sustainable long term. Code becomes more complex, teams grow, people get promoted or leave, and the knowledge eventually becomes specialized, siloed and fragmented.
This situation also applies to off-the-shelf software, which interfaces other systems and is just as susceptible to bugs. For example, a change to some efficiently organized data in one system can break data extract, transform, or load (ETL) jobs in another. Always try to tease out the issues yourself, lest your customers find them for you.
Crisis Communication Plan:
A Crisis Communication Plan outlines the basics of who, what, where, when, and how information gets communicated in a crisis. Like most of the above, its goal is to have many items figured out beforehand so they don’t need to be made up/decided upon on the fly in the midst of a trying situation.
Do you need a formal crisis team? If so, who should be on it? Who should be tasked with communicating with the public? Who should communicate with internal entities? How often should communications be sent out? What is the process to ensure everyone receives consistent and updated information? Is there a call tree for updating members of the organization? Depending on the size and shape of the organization, multiple individuals may communicate with the public via different channels. Ensuring consistent information across the board eliminates confusion and frustration on the part of stakeholders, clients, etc.
Is it possible to anticipate a crisis? Maybe not. But in the abstract, there are certainly some disruptive events that can be planned for: Organizational (mergers, acquisitions), IT (system upgrades/outages), Product (defects or recalls). These preparatory exercises help your organization formulate hypothetical responses and ensure the right stakeholders are identified no matter the nature of the disruption.
“You can plan a pretty picnic. But you can’t predict the weather, Ms. Jackson.” – Andre 3000
Being properly prepared can be difficult, costly and time-consuming. Chances are, however, it will be far less costly than a poorly handled crisis. Markets today are very competitive; your competition is looking for any toehold it can find. Why let them gain advantage when it can be avoided with proper planning and testing? This might sound like common sense. But to put it another way: everyone knows fruits and vegetables are good for you, but not everyone eats them on a regular basis. But they should. Sometimes folks just need to be reminded of the perils of inaction and unpreparedness. So ask yourself: just how prepared are you, really, for an IT crisis?