{"UUID":"785c5618-a34e-4dfc-bd04-559fc0635b3b","URL":"https://www.chef.io/blog/2014/07/10/supermarket-intermittent-unresponsiveness-postmortem/","ArchiveURL":"","Title":"Supermarket Intermittent Unresponsiveness","StartTime":"0001-01-01T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["automation","cascading-failure","cloud","config-change"],"Keywords":["supermarket","latency","unresponsiveness","aws","elb","rails","unicorn","berkshelf"],"Company":"Chef.io","Product":"Supermarket","SourcePublishedAt":"2014-07-10T18:09:23Z","SourceFetchedAt":"2026-05-04T17:49:34.018205Z","Summary":"The recipe community site Supermarket crashed two hours after launch due to intermittent unresponsiveness and increased latency. One of the main reasons for failure identified in the post mortem was very low health check timeouts.","Description":"Approximately two hours after its launch as the official community site, Supermarket began experiencing increased latency and intermittent unresponsiveness. This issue impacted the main Supermarket site, community.opscode.com, cookbooks.opscode.com, and api.berkshelf.com, making cookbook operations problematic.\n\nThe unresponsiveness led to several user-facing problems, including Berkshelf v3.x being unable to connect to its API for cookbook lists, difficulties in browsing cookbooks, and broken or extremely latent uploading and downloading of cookbooks. The Supermarket API also became inaccessible.\n\nSeveral factors contributed to the outage. Load planning was skipped in haste, and health check timeouts were set too low, causing a domino effect where nodes were prematurely removed from the ELB pool. Additionally, the application servers, running on m3.medium instances, were undersized and lacked sufficient CPU to handle the traffic demand.\n\nImmediate stabilization steps involved adding new instances and upgrading the instance type to m3.xlarge, increasing the capacity to five instances. Long-term corrective actions include changing Berkshelf API behavior, adjusting Unicorn worker counts, improving download performance with non-blocking metrics, enhancing alerting for ELB node failures, and improving status communication."}