We’re rounding out our WAF blog series with the final pillar of the AWS Well-Architected Framework: Operational Excellence. Each pillar of the Well-Architected Framework includes a set of design principles that define that pillar’s focus. While other pillars benefit both the business and technical sides, when it comes to Operational Excellence, AWS focuses solely on business value.
Let’s check back with our favorite retail chain, Echidna Electronics. Echidna decided it is time to update their customer recommendation software, Project Kookaburra. Talented analytic techs created some great Machine Learning algorithms and statistical models they think increase sales 3-5x.
To test these new algos and models, they jump to the console, stand up a few EC2 instances, copy the code from their laptops onto the EC2s, and start testing. After a week or so, everything is performing better than expected. Their next step is to copy the code to production—set and forget—and the team is done. Fantastic workflow and process!
Tips for Incorporating the AWS Operational Excellence Pillar Framework
Perform operations as code
Hopefully, it’s obvious the issues with the above process are all the manual steps it requires: the copying of code, the manual deployment of infrastructure, etc. To be successful in the cloud game, automation must be leveraged. Defining your infrastructure, configuration, and application as code are key to success. Minimizing human interaction in deployments only improves consistency. Take advantage of tools like CodeBuild, CodeDeploy, and CodePipeline to implement code testing. Or use CloudFormation to maintain consistency in your infrastructure deployment.
The Echidna techs ran through the runbook for deploying their instances, but it took them an hour longer than usual to do so, because they overlooked a few steps and needed to backtrack. This falls in line with the first principle: once your deployments are running as code, the resulting output provides you with exact documentation of your environments. For every build that’s run, an updated result is available for review.
While environment documentation is an administrator’s favorite job (or maybe a close second), this kind of process simplifies it exponentially. It won’t draw Visios or write the Word documents, but it will provide a point of reference that’s true, and (hopefully) versioned.
Make frequent, small, reversible changes
Echidna’s analytics techs learned a thing or two from other company departments. Last year, their website had an outage based on a bad update that wasn’t properly tested, resulting in a large loss of revenue.
From that point on, any updates would be small and incremental, allowing for quick reversal upon failure. The ability to pin-point issues from the deployments restores some sanity to the on-call personnel.
Refine operations procedures frequently
Note that when it came to the deployment of their code, Echidna used the good-ol’ ‘set and forget’ method. We’ve all been there, done that. Not many people want to break something that’s already working “just fine.”
Finding the time to review current workloads to identify possible enhancements or even cost-cutting approaches often proves advantageous. This not only gives the team a refresher on the workload configuration but also helps validate current procedures.
During the website outage mentioned above, the CTO stormed into the IT Director’s office:
CTO: “The site is down. We’re losing business fast. How quickly can we get the site back up?”
IT Director: “We take backups every 5 minutes.”
CTO: “Good. Restore the latest.”
IT Director: “OK. To further set expectations, know this has never been properly tested.”
This conversation is like nails on a chalkboard. It’s one thing to have a procedure in place for failure; it’s another to not test it frequently. This applies not only to Disaster Recovery scenarios but also to current workloads as well. Are there any single points of failure? Do you know how each subsystem affects another? Have you analyzed the risk and mitigated to the best of your abilities? Failures are inevitable. So, have a procedure in place and be sure to test it frequently. You’ll all sleep better at night as a result.
Learn from all operational failures
While this is somewhat self-explanatory; it can be a hard process to follow. You’re in a post-mortem, discussing the findings. Take this opportunity to put the proper policies, procedures, and contingencies in place to prevent the same mistake from happening twice. Take a step back and be sure you mitigated any risks as effectively as possible.
This brings us to some additional areas of focus, which can also be found in the other pillars:
- Prepare: This highlights the ‘why’ things are done the way they are—from a business perspective as well as a technical perspective. The business provides clear goals and priorities which can guide designs and processes.
- Operate: Once understanding of the design is achieved, determine how the workload will be monitored. Identify a proper baseline with the metrics to define a normal workload. Create alerts and dashboards for metrics and events that exceed the bounds of normal behavior. Create runbooks based on operational events and define escalations as needed.
- Evolve: This phase combines design insights with the results of monitoring to provide data for future enhancements. Analyze trends, review any lessons, and look for improvements in deployment and testing.
As you review your workloads; take the time to understand why the system is set up as is. Ask questions to ensure business requirements are covered. Look for areas of improvement, not only in the technical details, but also in procedures and operations.
To learn more about the Operational Excellence Pillar of the AWS Well Architected Framework, check out the official AWS white paper: Operational Excellence Pillar. And If you’d like to see how your application stacks up, please feel free to schedule your FREE Well-Architected Review with a Certified AWS Solutions Architect from Anexinet.