There has been quite a bit of press coverage regarding the outage at Amazon this week. Much of this coverage has focused on how the outage has brought down many popular web sites such as Reddit, Quora and Foursquare. The point that seems to be getting missed here is that by failing to plan, these companies planned to fail. Technology professionals know that data centers fail all the time. Data center failures are a fact of life that must be planned for and dealt with. While it’s true that Amazon did not live up to expectations, they actually did not violate their service level agreement (SLA) as Gartner pointed out.
Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.
Amazon is an infrastructure as a service (IaaS) provider which means they provide the hardware and low level software used to support Cloud based applications. The beauty of the IaaS model is that you can design and build an application anyway you see fit based on your individual requirements. If your application requires high availability and you choose not to address that requirement in your design, then you have introduced risk into your environment. We call this design shortcoming “technical debt.” I have blogged extensively on the subject which can be referenced if additional background is needed.
The principal amount of this technical debt is the cost of implementing the required redundancy. The interest is the cost of the additional risk associated with not having appropriate levels of redundancy. There are several ways to assign dollars to risk but none of them are perfect. The most straightforward approach is to estimate the cost of a failure and then multiply by the probability it will occur. Let’s say Foursquare estimates that the cost of their website going down for 24 hours is one million dollars. Based on the optimal design and implementation of the application, the probability of such an outage is 0.5 percent a year. However, because Foursquare took some design shortcuts the probability increased to 4 percent a year. The interest on this technical debt can be calculated as follows.
Incremental Risk: 4%-0.5% = 3.5% Cost of Failure: $1,000,000 Interest: $35,000
The fact is that Cloud redundancy is cheap and Foursquare would have achieved an astronomical ROI by implementing it. The culprit in this outage is not the Cloud. It is technical debt. Blaming the Cloud for these outages would be like blaming your hard drive manufacturer for lost data when it fails. Everyone knows hard drives fail and you should always have a backup. If your hard drive happened to be backed up by one of the many tools that leverage Amazon’s S3 service that’s not technical debt. That’s just bad luck!

8 Responses to Failing to Plan is Planning to Fail
I really like the ROI calculation, short and to the point. Redundancy is a necessity these days, everyone eventually has their bad day.
Mike
I see quite a lot of bloggers pointing at Foursquare and other EC2 users saying they “Should have designed for a failing datacenter” while they actually did. Amazon explains their Availability-Zones as being separate physical locations:
“Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.”
So while I think your explanation of technical-dept is fine, you should have done some more research before pointing at Reddit / Foursquare engineers saying they made a wrong decision. The blame is with Amazon for failing to uphold their contract of making Availability-Zones completely isolated from each other.
Thanks for the note Tomas. We run test servers on EC2 regularly so I’m quite familiar with how Amazon characterizes their AZs. Because we only use EC2 as a test environment, we don’t have to build redundancy into our design. However, if we did I sure wouldn’t rely on some overly optimistic verbiage on Amazon’s website that their marketing department probably wrote. The phrase “caveat emptor” should be applied to any purchasing scenario, especially those that have life or death implications to your business.
Even if Amazon did physically isolate their AZs, there are still all kinds of events that would impact an regional center. Despite Amazon’s optimistic perspective on the matter, earthquakes, pandemics, blackouts, floods, hurricanes, terrorist attacks, etc likely would bring a regional data center down. I would encourage you to take a look at a concept called the Maximum Tolerable Period of Disruption (http://acwr.us/hVkqez) which Foursquare and many other social media sites clearly don’t understand. Foursquare in particular has been having outages because of their poor business continuity planning since they launched. Last year TechCrunch called them the next Twitter (http://acwr.us/eMf5DI).
The blessing and the curse of the Cloud is that it allows two 15 year old kids who come up with a great idea in their mom’s basement to put a scalable application into production literally overnight. While that will no doubt increase innovation significantly, keeping the barriers to entry that low will bring more people into the market with limited perspective and a lack of appropriate risk management skills.
While Amazon is not without fault, Foursquare is responsible for protecting their business. When you let a statement on a vendor’s website override common sense you’re on your own in my book. Here’s some advice for folks who take that approach. If you ever buy a hard drive and the manufacturer tells you that it will never fail…..back it up anyway.
I understand technical debt to be something suboptimal done that actually achieves an objective with a shortcut or hack as part of the core product. If you fail to improve or pay back the hack, you’ll be paying high interest in lack of maintainability and/or painful system upgrades.
What you are describing to me sounds like simple failure to plan for disaster.
Thoughts ?
Thanks for the note Dan. I’m of the opinion that Foursquare (and others) did take shortcuts in the design of their application by postponing the implementation of the required level of availability. When requirements are postponed or shortcuts are taken to implement them, the organization pays interest. In this particular case, the interest comes in the form of incremental risk. Clearly that architecture shortcoming had financial implications that fits the debt model perfectly. There was a principal (cost of required redundancy) and interest (increased risk) associated with the gap. Using these components to calculate an ROI would have allowed Foursquare to prioritize addressing this gap over other enhancements that provided less return.
As you probably know, technical debt is a concept that was born and bred in the development community so it has historically taken a developer oriented perspective. I have proposed a new definitional framework for the concept in a paper to Carnegie Mellon’s 2nd Annual Workshop on Technical Debt (http://acwr.us/fiPpmg) next month. The paper was accepted and I am presenting it in Hawaii next month. Ward Cunningham, who first coined the term, is on the program committee so it will be interesting to get his perspective and that of the rest of the committee.
An overview of the proposed definitional framework is available (http://acwr.us/hGbUQM) if you’re interested. I’d love to hear your feedback!
Can you share the contents or a copy of
An Enterprise Perspective on Technical Debt (Klinger, Tarr, Wagstrom, Williams—IBM Watson Research, USA)
I am interested in establishing an enterprise definition for Technical Debt outside of the development or code paradigm. How does it relate through out the stack (Solution Pattern) for any given solution?
Hi Todd. Sorry for the delayed response. I was actually at the ICSE conference when you posted this so I missed it. My apologies. The paper is available for purchase through the ACM for a pretty reasonable price. I think it’s something like $10. If you want a copy of the presentation IBM did for the paper you can find it at the link below.
http://www.sei.cmu.edu/community/td2011/program/upload/7_Tarr_MTD_2011.pdf
I’m actually working with the SEI to put together a new community driven portal on technical debt where research like this will be published for free. It will be located at http://technicaldebt.org. The site is NOT up yet and we’re just getting started on the planning process. Hopefully, information like this will be more readily available through this portal.
[...] · Ted Theodoropoulos, http://blog.acrowire.com/cloud-computing/failing-to-plan-is-planning-to-fail [...]