{"UUID":"d17fe136-eb00-44f8-8147-e3dc22358cba","URL":"https://labs.spotify.com/2013/06/04/incident-management-at-spotify/","ArchiveURL":"","Title":"Spotify Popcount service outage of April 2013","StartTime":"2013-04-27T20:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":null,"Keywords":["spotify","popcount","outage","european users","client bug","exponential backoff","microservice","cascading failure"],"Company":"Spotify","Product":"Popcount","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:53:43.321183Z","Summary":"Lack of exponential backoff in a microservice caused a cascading failure, leading to notable service degradation.","Description":"Spotify experienced a major outage for European users in April 2013, impacting music playback and login functionality. This incident was preceded by a similar issue two months prior involving Popcount, a backend service storing playlist subscriber lists. Popcount was designed to fail fast, but a legacy desktop client component lacked this behavior.\n\nThe legacy client continuously retried fetching Popcount data without exponential backoff, overwhelming the service. This led to a state where recovery was difficult due to the volume of pending requests. Developers deployed a fix for Popcount to fast-fail and return empty lists, which temporarily resolved the issue. However, the root cause in the client was not prioritized for a permanent fix.\n\nOn April 27th, Popcount became unhealthy again. A new \"Discovery\" feature (Bartender service) had unknowingly introduced a dependency on Popcount, increasing its load. The previous fast-fail logic was insufficient. Additionally, excessive logging in Accesspoints, intended for debugging, caused them to become unresponsive due to I/O issues, exacerbating the problem as the faulty client retry behavior continued.\n\nThe combination of factors led to notable service degradation, with most Accesspoints becoming unreachable or extremely slow. To restore service, engineers firewalled off unresponsive Accesspoints, forcing clients to trigger their exponential backoff logic. This allowed the Accesspoints to recover, and service was restored within minutes.\n\nKey lessons included the importance of prioritizing root cause fixes, the dangers of excessive logging, and the need for thorough testing of extreme conditions. Post-incident remediations included fixing the client's faulty retry behavior, implementing static caching for Discovery service data, optimizing Accesspoint logging, and improving syslog flushing. Company-wide education on the incident was also a remediation."}