There are some that argue that without the Apollo 1 tragedy that killed three astronauts on the launch pad, we might not have gotten to the moon. That’s certainly not to minimize what happened to those men, but there’s some truth in the idea that there can be no success without lessons learned in failure.
There was yet another conspicuous outage at Amazon Web Services (AWS) this weekend. Friday night’s weather in the mid-Atlantic region of the US took out large chunks of the power grid, and failure of Amazon’s backup generators ultimately caused a complete power outage in one of the AWS East region availability zones. Power was restored within about 15 minutes, but by that time the damage was done. Approximately 7% of the EC2 instances in the US East-1 region were down and had to be recovered. The recovery process stretched well into Saturday afternoon for some customers. The issue here isn’t so much with the power failure itself, but in the bugs and weaknesses it exposed in the highly complex and tightly interconnected AWS infrastructure. Amazon has engineered a system that should allow its customers to build redundancy across availability zones and which should allow fairly rapid recovery from such a failure (so long as the customer has an adequate backup strategy). In this case, however, multiple software/system failures seem to have seriously impaired or even destroyed that built-in resiliency. Of special note would be this little nugget from Amazon:
“From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region.”
We use AWS. We know enough to have a tested strategy in place that allows us to migrate services from one availability zone to another in the event of failure, but that strategy would have failed us on Friday evening due to the underlying failure of the AWS infrastructure across the entire region. This appears to be one of the primary reasons why very well known services like Netflix, Instagram and Pinterest were taken down by the power failure at Amazon. We can assume that if we’re bright enough to plan for failure in an availability zone, then so have those guys, but when an entire AWS region (encompassing 10 different datacenters) is impacted, those plans are reduced to rubble. We can also move services to another AWS region if need be, but that’s much more complicated and takes some time to implement. We’re small potatoes. If it takes us an hour or more to migrate from Virginia to Oregon, how long would it take Netflix to completely evacuate an AWS region?
The nuts and bolts of the power failure itself appear relatively straight forward (generator failure), and Amazon has already implemented changes in its operating procedures that would stop this particular chain of events from unfolding again. If you’re in this business you learn to accept that uninterruptible power is hard. Really hard. Expecting and planning for failure is the rule when it comes to that stuff. The bigger issue is the AWS system holes and bugs that were exposed by this event. Obviously they’ll be addressed going forward, but it really puts the immense complexity of what Amazon is doing under a bright light and a powerful microscope. I can’t speak to Google’s Compute Engine or Microsoft’s Azure, but at least with respect to the cloud computing offerings from players like Rackspace and Softlayer, AWS is in a completely different universe. Only AWS has really taken things to a level where each aspect of a computing system can detached and re-attached in different places. This makes AWS more complicated, but also much more powerful. However, as Google’s Urz Holtze pointed out not too long ago, “At scale, everything breaks.” and in this event AWS experienced impairment or failure on multiple fronts (EC2/EBS, ELB and RDS) due to multiple factors that included overload and system bugs. At scale, almost everything broke.
Everyone likes to pile on when this sort of thing happens. It makes the news. Every sysadmin and network engineer not on the dark side of the moon wants to weigh in and there’s lots of “coulda and shoulda” going around. There’s also a healthy dose of “I told you so” to be found. In the end, however, we like to look upon these events as learning experiences. Success is built on lessons learned in failure. Highly complex, massive public cloud systems are still a relatively new thing and nobody likes to admit it, but not every possible outcome can be predicted, simulated or planned for. There’s no substitute for real life experiences – both good and bad.
For us, the takeaway is that much of the Internet still comes down to managing expectations. When you want your online cheese shop to never ever go down, that requires extremely complex systems and those systems take years to develop and reach full potential. In cases like this, you almost have to embrace the failure as a necessary step to get where we collectively want to be. Lick your wounds, see what went wrong, make sure that doesn’t happen again, and move along. Learn. Adapt. Improve.
The Internet is a far more reliable place than it was 15 years ago, but the leap from four nines to five nines is enormous and no amount of marketing hype and no buzzwords can make that leap any smaller or easier to achieve.