One pillar of the AWS Well Architected Framework is reliability. Naturally, to be effective, the applications and services that power your business must be available. When a critical service is down, your coworkers can’t do their jobs, and business suffers. Understanding the relationship between service-availability and business-impact is critical to designing and adhering to Service Level Agreements. Does it matter if an application is down for one minute? How about fifteen minutes? What about an hour? Generally speaking, the more available a service needs to be, the more expensive it will be to run.
So, when designing for availability, your first step is to answer these two fundamental questions:
- How much cost would be incurred if this application is unavailable?
- How expensive would it be to improve reliability?
These questions are more complex than they appear. Let’s look at the first one: How much will it cost for an application to be unavailable? Several factors should be considered in your answer. But doing so in a vacuum can yield an answer that’s too abstract. To elucidate matters, let’s consider the potential impact of downtime on two hypothetical applications.
Let’s say that you run a retail store chain called Echidna Electronics. You use a sales-monitoring application called SalesViz. It ingests data from your Point of Sale (POS) systems and generates reports used by procurement, sales, and finance. Some departments consume the information directly through a dashboard (that’s part of the application). For other departments, the analyzed data is ingested by auxiliary systems (e.g. the ERP application). What happens if SalesViz is offline? If the system is down for an hour, there is no significant impact to the business. Based on interviews with departments that use the application, downtime will not impede their ability to meet work objectives for at least four hours. After that, a net impact of $1,000 in lost productivity per hour will result. If the downtime exceeds a day, the impact increases to $5,000 an hour due to a cascading effect in dependent systems. Based on the business impact, you can safely sustain an hour of downtime during production hours once per month. With a production day of twelve hours, that works out to an uptime requirement of 99.7%.
The store’s POS system is Wombat. Wombat is the primary way all business is transacted at the retail locations. Each store has two or more registers hooked into a central Wombat application located in the cloud. Wombat uploads sales metrics to SalesViz on a rolling basis, and performs a closeout each night with SalesViz and the ERP system. What does it mean if the Wombat application is offline? The registers are able to cache sales for up to 15 minutes, after which they will stop accepting transactions. The retail stores average $30,000 in sales per hour, but during Christmas, this can reach $100,000 per hour. In a worst-case scenario, a Wombat outage would cost the company $100,000 per hour. That’s pretty significant! Based on this cost, it would make sense to try and keep Wombat as available as possible. But how available is that? Here’s a handy chart based on the 12-hour business day:
- 9% – 4.4 hours of downtime annually
- 99% – 26.3 minutes of downtime annually
- 999% – 2.6 minutes of downtime annually
It appears that a goal of 99.99% uptime would likely satisfy the business requirements. A goal of five 9’s would probably be too costly. The reason is that each component comprising the application would need to have better than five 9’s availability. When looking at a system as a whole, you need to factor in the uptime of all components; meaning the uptime of the system will always be lower than its component parts. If two components fail in a serial fashion, each one will contribute to the downtime separately.
Now that we understand the requirements of each application, it’s possible to review the existing components of the application and determine if they can support the required SLA. In a cloud-based deployment, this is simplified since vendors like AWS have a published SLA for their services. They also provide tools and infrastructure—such as Elastic Load Balancers and Availability Zones—that will assist in providing the necessary level of uptime for an application. In the case of the Wombat POS, you would want to deploy the Wombat servers in a cluster across two or more availability zones, and place the servers in an Auto Scale Group. This provides resiliency in the case of an availability zone failure or a Wombat server crash. Additionally, the pre-baked AMIs and data stored by the Wombat servers should be replicated to another region, in order to provide cross-region redundancy. As noted previously, the registers in the stores can cache transactions locally, so to avoid significant business impact, the failover of the Wombat application to another region must take less than fifteen minutes. You can also make use of Route 53 to automate traffic routing to the backend Wombat application from the register.
The SalesViz application can handle much higher levels of downtime. Therefore, it may be sufficient to run the system in a single-availability zone, and take occasional snapshots of the EBS volumes. Those snapshots could be mounted to a new instance in another availability zone (or region) as needed. This would be significantly cheaper than the solution for the Wombat application, while still providing the necessary service level to the business.
To learn more about the Reliability Pillar of the AWS Well Architected Framework, check out AWS’ official whitepaper. And if you’d like to see how your own application stacks up, please don’t hesitate to schedule your FREE Well-Architected Review with one of our Certified AWS Solutions Architects.
Director, Cloud Solutions and Microsoft MVP: Cloud (Azure/Azure Stack) & DC Mgmt