Postmortem Index

Explore incident reports from various companies

Amazon EC2, EBS, and RDS EU West Region Service Event

Amazon · EC2, EBS, RDS

On August 7th, at 10:41 AM PDT, a service disruption began in a single EU West Availability Zone due to a 110kV 10 megawatt transformer failure from a utility provider, causing a total loss of electricity. Backup generators failed to come online because a Programmable Logic Controller (PLC) did not complete the connection, believed to be due to a large ground fault. This led to insufficient power for servers, causing almost all EC2 instances and 58% of EBS volumes in that AZ to lose power, and impacting network connectivity.

The power loss also affected EC2 networking gear, leading to connectivity issues and API errors. EC2 management servers in other AZs continued to route requests to the affected AZ, causing failures and overloading the management services, resulting in long launch delays and high error rates across all EU West EC2 APIs. By 12:00 PM PDT, disabling launches in the affected AZ and removing failed management servers recovered API launch times for other AZs. Manual synchronization of backup generators began restoring power by 11:54 AM PDT, and network connectivity to the AZ was re-established by 1:49 PM PDT.

EBS volume recovery was prolonged. Many volumes became “stuck” due to insufficient spare capacity for re-mirroring data after nodes lost power. Additional capacity was brought in, and recovery continued. For volumes where all nodes lost power, data consistency could not be verified, and recovery snapshots were generated for customers, a process that extended until August 10th.

Separately, an EBS software bug, independent of the power event, caused incorrect deletion of customer snapshot data due to an incomplete list of snapshot references. This issue was identified and recovery snapshots for affected snapshots were delivered by 4:19 PM PDT on August 8th.

RDS instances were also significantly impacted. Single-AZ instances in the affected AZ became unavailable, recovering as EBS volumes were restored or via Point-in-Time-Restore. Multi-AZ instances mostly failed over rapidly without data loss, but a portion experienced prolonged failover times due to a DNS connectivity issue preventing health checks and triggering a software bug. This cautious approach by RDS engineers to prevent “split brain” scenarios extended unavailability for a small subset.

To prevent recurrence, AWS plans to add redundancy and isolation for PLCs, improve EC2 load balancing, and further isolate EC2 control plane components. For EBS, the primary action is to drastically reduce recovery time by enabling direct volume recovery on servers without moving data to S3. Changes have also been made to the EBS snapshot deletion process, including new alarms and holding states, to prevent the software bug from recurring.

Keywords

ec2ebsrdseu westpower outageplcgeneratorsnapshotapimulti-az