Amazon Kinesis US-EAST-1 outage November 2020

On November 25, 2020, Amazon Kinesis in the US-EAST-1 region experienced a service disruption. The incident began with alarms firing at 5:15 AM PST, indicating errors in Kinesis record operations. The Kinesis front-end fleet, responsible for authentication, throttling, and request routing, failed to function correctly, leading to an inability to route requests to back-end clusters. The service fully returned to normal by 10:23 PM PST on the same day.

The trigger for the event was a capacity addition to the Kinesis front-end fleet. The root cause was identified as the new capacity causing all servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. This thread limit breach prevented the successful construction of shard-maps, which are crucial for front-end servers to route requests, rendering them useless.

The Kinesis disruption had a cascading impact on several other AWS services. Amazon Cognito experienced elevated API failures and increased latencies, affecting user authentication and credential issuance. CloudWatch saw increased error rates and latencies for metric and log data APIs, resulting in gaps in data. Lambda function invocations also suffered increased error rates due to memory contention from buffering CloudWatch metrics. Additionally, CloudWatch Events, EventBridge, ECS, and EKS experienced issues, and AWS’s ability to update the Service Health Dashboard was temporarily impaired due to its dependency on Cognito.

Immediate remediation efforts included removing the newly added capacity and restarting the front-end fleet. For future prevention, Kinesis is moving to larger servers to reduce thread count, adding fine-grained thread consumption alarming, and testing increased OS thread limits. Long-term plans involve improving cold-start times, dedicating the front-end server cache to a separate fleet, and accelerating cellularization of the front-end fleet. Dependent services like Cognito and CloudWatch also implemented changes to reduce their reliance on Kinesis or improve resilience to its unavailability.

Postmortem Index

Amazon Kinesis US-EAST-1 outage November 2020

Keywords