Multiple Slack service disruptions in October 2014

Slack experienced two significant service disruptions in October 2014, impacting user connectivity. The first occurred on October 14th, and the second on October 16th. These incidents led to varying degrees of unavailability for a subset of users.

On Tuesday, October 14th, routine maintenance led to an automation malfunction, deploying corrupted code to web servers and job queue workers. This caused a 14-minute lockout for all users, followed by 13% of users experiencing poor or no availability for periods of up to two hours. A preceding internal network issue on Monday, though unrelated, contributed to existing work backlogs.

The immediate attempt by disconnected users to reconnect simultaneously overwhelmed Slack’s database capacity, leading to cascading connection failures. Ultimately, 5% of users remained disconnected for up to two hours while database clusters recovered.

A separate incident occurred on Thursday, October 16th, from 11:27 am to 12:28 pm San Francisco time. This was triggered by a bug introduced during the update of real-time message servers, following the disabling of SSLv3 due to the POODLE vulnerability.

The bug caused message servers to crash, and the simultaneous reconnections from affected users again overwhelmed databases. This was exacerbated by a client-side change that forced a full history reload, increasing strain on database servers.

In response, Slack immediately began adding additional database capacity and optimizing reconnection methods to reduce the load during mass reconnections. They also worked on gracefully restarting real-time message servers and addressing the introduced bug.

Postmortem Index

Multiple Slack service disruptions in October 2014

Keywords