Lessons of Amazon Cloud Outage

I happen to be one of the luckier website owners using AWS. Even though my site was hosted in AWS-East (in Virginia) where a major outage knocked many sites including Reddit, FourSquare, BigDoor among others, somehow my availability zone (us-east-1d) was recovered earlier than other availability zones in AWS-East. Yesterday evening (April 23rd) at 7 pm PDT, AWS finally reported that Elastic Compute Cloud (EC2) services were working normally over all access zones within their Virginia data center. Considering it took almost 3 days to get to this point, and there are still problems with restoring specific instances for various customers, it is quite natural to understand the very heated discussion going on in technology blogs about the viability of cloud services.

Amazon has built 5 AWS data centers so far. These are in :

  • Ashburn, VA
  • Palo Alto, CA
  • Dublin, Ireland
  • Singapore
  • Tokyo, Japan

In each of these locations, Amazon built multiple independent rooms/halls and they called them as availability zones. For example, in Ashburn location there are 4 distinct availability zones. Similarly Palo Alto and Dublin data centers have three availability zones while Singapore and Tokyo offer two. Amazon claimed that availability zones are independent of each other in terms of physical infrastructure. Following excerpt is from the AWS EC2 FAQ.

“Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.”

Until Thursday April 21st 2011, AWS has not suffered an outage impacting multiple availability zones at once. All outages they have had until the most recent one were limited to a single availability zone. For such outages, AWS recommends the use of load balancers within a data center that will allow traffic to be redirected to different availability zones.

Latest AWS outage has shown that Amazon’s assumptions about ” fate-sharing” were not entirely true. Based on the extensive amount of information on techno-blogs, it is apparent that the original network failure in one of the availability zones triggered the replication of disk data across other zones in Ashburn as well as more traffic and server instances in those zones. This resulted in a cascaded outage that impacted all availability zones in Ashburn for almost 12 hours on Thursday. Eventually Amazon managed to recover three out of four availability zones and since then the impact of the outage has been limited to zone B in Ashburn.

As in the case of any such event, many people came out with very strong words on the impact of this outage on the fate of cloud computing. Ultimately the analysis boils down to these two extremes:

  • “Sky is falling” camp: Cloud-computing is inherently more complex than necessary. All those features that allow sharing, dynamic resource allocation, virtualization come at a great cost. This unnecessary complexity is the soft-belly of cloud-computing.
  • “Sky is bluer than ever” camp: Cloud-computing provides scale, elasticity and (yes!) reliability much better compared to standard model of building/hosting equipment. Design your application and system accordingly, e.g., use geo-redundancy, build your application for failures, and if you are paranoid enough use not one but multiple clouds.

The biggest fallacy of the pessimist view is its inability to provide the necessary flexibility in resource allocation. If a systems architect wants to design a fully redundant service over two geographically diverse locations, she needs to provide 250% of the peak capacity assuming that the engineering limit is 80% of the capacity of resources at any given site. Considering many web/mobile services have very unpredictable growth curves, building 250% peak capacity from day-one is a disastrous decision. That is why many data centers today are full of rows of racks of very lightly used equipment. Certainly it is possible to increase the number of locations for equipment to reduce the total peak capacity that needs to be deployed. Following graph shows a scenario where 100 different applications of the same load (1 Unit Load) is deployed over multiple sites. The graph clearly depicts that when the redundancy scheme is simpler, e.g., 1+1, the amount of resources necessary to support these applications is much higher. This graph assumes resources are dynamically shared among multiple applications per the principles of cloud computing. If such dynamic sharing isn’t available, total resources will be independent of the number of sites. In that case if every application uses a 1+1 redundancy scheme, irrespective of the number of available sites, one would end up building 250% capacity for every application.

Interestingly in the dynamic resource allocation model, by simply increasing the universe of possible locations, it is possible to reduce the total amount of resources. This is due to the fact that with a larger pool of sites to deploy these applications, the impact of any single site failure is significantly reduced. Therefore, the sensible approach would be to keep the level of redundancy small, e.g., 1+1 or 2+1 but spread the applications to as many possible non-fate sharing sites as possible. During last week’s outage, companies such as Netflix that have adopted this model came out unaffected. On the other hand, those who relied on the intra-data center redundancy had a much bigger impact. Certainly Amazon didn’t make life any simpler for those who believed Amazon’s statements about the diversity of availability zones within a data center.

I expect Amazon as well as the overall cloud industry will come out of this ordeal even stronger than before. It is reasonable to expect Amazon to come up with better explanation on why/how failures impacting multiple availability zones occur. However, instead of going back to owned/hosted equipment, more systems architects will realize the details of AWS architecture, especially its use of Elastic Block Storage and how it can be adopted/combined with AWS Simple Storage Service (S3) as well as how multi-region redundancy can be built. I am sure those who can justify will adopt their applications to use multiple cloud providers along with global load balancing solutions. Ultimately the real winner for cloud is its elasticity. That benefit alone is enough to convince technology leaders to adopt cloud services more forcefully. Unless a company is willing to invest in infrastructure as a technology/cost/time-to-market differentiator for its business, then cloud service is the only viable solution. In other words, if you are not ready to be the next Google, don’t bother with it.

You can find a useful article about the outage and lessons is at this blog.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Lessons of Amazon Cloud Outage

  1. Prakash Patel says:

    Good summarized write up on amazon outage describing high level what happened. Provides insight how to model your applications in best way for failure using cost optimized model. Hopefully cloud industry will have some better ideas on how to avoid this kind of situation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s