Postmortem Index

Explore incident reports from various companies

Amazon ELB Service Event in US-East Region on December 24, 2012

Amazon · Elastic Load Balancing

2012-12-24 – 2012-12-25 automation cloud config-change

The incident began at 12:24 PM PST on December 24, 2012, when a maintenance process inadvertently deleted a portion of Amazon Elastic Load Balancing (ELB) state data in the US-East region. This critical data is used by the ELB control plane to manage load balancer configurations, and its deletion was performed by a developer with persistent access to the production environment.

Initially, the impact was limited to high latency and error rates for ELB API calls, with no immediate effect on running load balancers. However, as the control plane attempted to modify or scale load balancers, the missing state data led to improper configurations. This resulted in degraded performance and errors for customer applications using these modified ELBs. At its peak, 6.8% of running ELB load balancers were impacted, while others could not be scaled or modified.

The root cause was identified as the missing ELB state data. Recovery efforts involved disabling control plane workflows to prevent further impact and attempting to restore the data. An initial restoration attempt consumed several hours and failed to provide a usable snapshot, delaying recovery. At 2:45 AM PST on December 25th, an alternate recovery process successfully restored a snapshot of the ELB state data, followed by a data merge completed by 5:40 AM PST.

Service workflows and APIs were then slowly re-enabled. By 10:30 AM PST, almost all affected load balancers had been restored to full operation. The service was declared operating normally at 12:05 PM PST on December 25th.

Remediation included modifying access controls for production ELB state data to prevent inadvertent modification without specific Change Management approval. The data recovery process was also improved to be significantly faster for future events. Additionally, architectural changes are planned to enable the ELB control plane to automatically reconcile central service data with current load balancer states, reducing reliance on manual data restoration.

Keywords

elbus-eastload balancingawsdata deletioncontrol planemaintenance erroraccess control