Amazon EC2 and Amazon RDS Service Disruption in US East Region

On April 21st, 2011, at 12:47 AM PDT, an incorrect network configuration change during a routine upgrade in a single US East Availability Zone caused primary Amazon Elastic Block Store (EBS) network traffic to be routed to a lower-capacity redundant network. This isolated many EBS nodes, leading to a large number of EBS volumes becoming “stuck” and unable to service read/write operations. Amazon EC2 instances attached to these volumes were also affected.

When network connectivity was restored, the isolated EBS nodes aggressively searched for new replicas, exhausting cluster capacity and creating a “re-mirroring storm.” This led to approximately 13% of volumes in the affected AZ becoming stuck. The degraded EBS cluster also impacted the EBS control plane, causing “create volume” API requests to back up due to long timeout periods, resulting in thread starvation and high error rates for EBS APIs across the entire US East Region. A race condition bug in EBS node code further exacerbated the issue, causing node crashes and increasing the number of “stuck” volumes.

The disruption caused significant customer impact, including high error rates and latencies for EBS APIs and elevated error rates for launching new EBS-backed EC2 instances for about 11 hours. Approximately 45% of single-AZ Amazon Relational Database Service (RDS) instances in the affected Availability Zone were also impacted. To stabilize the situation, AWS disabled control APIs for EBS in the affected AZ, disabled new Create Volume requests, and later disabled communication between the degraded EBS cluster and the EBS control plane. A method was developed to prevent EBS servers from futilely contacting other servers, which stopped further degradation by 11:30 AM PDT on April 21st.

Recovery involved bringing significant new storage capacity online, which was challenging due to the need to physically relocate servers and carefully adjust negotiation throttles to integrate the new capacity without overwhelming existing servers. By 2:00 AM PDT on April 22nd, new capacity was being added, and volumes were restored consistently over the next nine hours. Re-establishing EBS control plane API access to the affected AZ required building a separate, partitioned instance and developing finer-grained throttles to process the backlog of state changes.

API access to EBS resources in the affected Availability Zone was fully restored by 6:15 PM PDT on April 23rd. The remaining affected volumes were recovered through S3 snapshots and manual forensics. By 12:30 PM PDT on April 24th, most recoverable volumes were restored. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

Postmortem Index

Amazon EC2 and Amazon RDS Service Disruption in US East Region

Keywords