Netflix's response to October 2012 AWS EBS degradation

On Monday, October 22nd, 2012, Amazon experienced a service degradation, specifically affecting its EBS service in a single Availability Zone. Netflix first noticed issues affecting other websites just after 8:30 AM, though their own systems showed no impact initially. By 10:40 AM, Amazon confirmed EBS degradation.

Around 11:00 AM, some Netflix customers began experiencing intermittent problems. Due to resilient client software, most customers did not notice. At 11:15 AM, the issue became significant enough for Netflix to open an internal alert. The problem initially appeared as a network issue, causing some confusion before it was narrowed down to a single Availability Zone.

Once the issue was confirmed to be isolated to one AZ, Netflix initiated a zone evacuation. Leveraging prior drills and their Asgard cloud management tool, they were able to evacuate the troubled zone in just 20 minutes, completely restoring service to all customers.

The root cause was an Amazon EBS service degradation in a single Availability Zone. Netflix’s ability to mitigate the impact stemmed from their architectural patterns, including building software to operate across three Availability Zones, ensuring resilience to single instance failures, and using tools like Asgard for rapid zone evacuation. They also highlighted their use of the Simian Army for continuous testing of system resilience.

Postmortem Index

Netflix's response to October 2012 AWS EBS degradation

Keywords