May 4, 2011

What Happens when a Cloud Turns Dark


April 21st, 2011: when Amazon's East Zone went Dark

Amazon EC2 (Elastic Compute Cloud) went dark for a few days in the US East Coast on Apr 21st. The impact was wide and widely noticed. So much so that Economist, a publication that I read avidly but not for its technology coverage, noted the breakdown in an article a few days later.

The fact that one data center went offline is not a surprise per se. (Without any hint of callousness and acknowledging the real impact on the businesses involved,) it just happens. What is more interesting is to consider the famed Warren Buffett quote and find out who has NOT been swimming naked in the rising Cloud.

Chaos Monkey and other Lessons

Netflix is one of the organizations that had fared well in the outage even though, given its large customer footprint, one would reasonably expect them to suffer all sorts of problems since they have moved their infrastructure to the Amazon Cloud. In their Tech Blog (below), Netflix team talked about how they dealt with it.

Instead of just moving their datacenter onto EC2 through VM, Netflix made architectural changes to take advantage of the inherit flexibility (and instability) of Cloud Computing. Nobody is perfect though and they did not account for the possibility of an entire zone/datacenter going out. In other words, switching over to other active zones was a time consuming and error-prone manual process.

The most interesting part for me is the Lessons Learned section. In addition to build up the ability to handle failover and recover at the Zone level, Netflix will scale up random disruptions, from Chaos Monkey to Chaos Gorilla, to introduce more failure as an on-going part of its system-wide robustness design.

An effective, albeit brutal, solution that could not be fathomed before Cloud Computing.

Lessons Netflix Learned from the AWS Outage

No comments:

Post a Comment