Postmortem Index

Explore incident reports from various companies

Supermarket Intermittent Unresponsiveness

Chef.io · Supermarket

Approximately two hours after its launch as the official community site, Supermarket began experiencing increased latency and intermittent unresponsiveness. This issue impacted the main Supermarket site, community.opscode.com, cookbooks.opscode.com, and api.berkshelf.com, making cookbook operations problematic.

The unresponsiveness led to several user-facing problems, including Berkshelf v3.x being unable to connect to its API for cookbook lists, difficulties in browsing cookbooks, and broken or extremely latent uploading and downloading of cookbooks. The Supermarket API also became inaccessible.

Several factors contributed to the outage. Load planning was skipped in haste, and health check timeouts were set too low, causing a domino effect where nodes were prematurely removed from the ELB pool. Additionally, the application servers, running on m3.medium instances, were undersized and lacked sufficient CPU to handle the traffic demand.

Immediate stabilization steps involved adding new instances and upgrading the instance type to m3.xlarge, increasing the capacity to five instances. Long-term corrective actions include changing Berkshelf API behavior, adjusting Unicorn worker counts, improving download performance with non-blocking metrics, enhancing alerting for ELB node failures, and improving status communication.

Keywords

supermarketlatencyunresponsivenessawselbrailsunicornberkshelf