AWS US-EAST-1 Internal Network Congestion on December 7, 2021

On December 7th, 2021, at 7:30 AM PST, an automated capacity scaling activity for an AWS service in the main AWS network triggered an unexpected behavior from a large number of clients within the internal AWS network. This resulted in a significant surge of connection activity that overwhelmed the networking devices connecting the internal network to the main AWS network, causing communication delays, increased latency, and errors between these networks. This led to persistent congestion and performance issues on these critical networking devices.

The congestion immediately impacted the availability of real-time monitoring data for internal operations teams, hindering their ability to diagnose and resolve the issue. Operators initially focused on internal DNS errors, moving DNS traffic away from congested paths, which improved some services by 9:28 AM PST. However, full resolution was delayed due to limited monitoring visibility, impact on internal deployment systems, and a cautious approach to avoid affecting functioning customer workloads.

The root cause was identified as a latent issue in networking client back-off behaviors. While designed to recover from congestion, a previously unobserved behavior prevented adequate back-off during this specific automated scaling event. This code path had been in production for many years, but the specific trigger exposed the flaw.

Customer impact was widespread, primarily affecting the control planes of many AWS services used for creating and managing resources. Services like EC2 APIs, RDS, EMR, Workspaces, ELB provisioning, Route 53 APIs, AWS Console login, STS, CloudWatch monitoring, VPC Endpoints for S3/DynamoDB, API Gateway, EventBridge, container services (Fargate, ECS, EKS), and Amazon Connect experienced elevated error rates and latencies. Existing customer workloads on the main AWS network were largely unaffected, but new resource provisioning and management operations were significantly impaired.

Remediation actions included immediately disabling the scaling activities that triggered the event. AWS is developing a fix for the latent client back-off issue and expects to deploy it within two weeks. Additionally, new network configurations have been deployed to protect potentially impacted networking devices from similar congestion events in the future. Congestion significantly improved by 1:34 PM PST, with all network devices fully recovered by 2:22 PM PST, though some services experienced extended recovery times, with the latest full recovery noted at 6:40 PM PST for EventBridge event delivery latency.

Postmortem Index

AWS US-EAST-1 Internal Network Congestion on December 7, 2021

Keywords