{"UUID":"61617271-54e4-4f99-9ade-86428b8fe4c3","URL":"https://slackhq.com/this-was-not-normal-really","ArchiveURL":"https://web.archive.org/web/20181208123409if_/https://slackhq.com/this-was-not-normal-really","Title":"Multiple Slack service disruptions in October 2014","StartTime":"0001-01-01T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["automation","cascading-failure","security"],"Keywords":["slack","outage","connectivity","database","real-time","automation","sslv3","poodle","ai","workflow","productivity","collaboration","developers","security","enterprise"],"Company":"Slack","Product":"Slack","SourcePublishedAt":"2014-10-16T00:47:00Z","SourceFetchedAt":"2026-05-04T18:15:04.637999Z","Summary":"A combination of factor results in a large number of Slack's users being disconnected to the server. The subsequent massive disconnection-reconnection process exceeded the database capacity and caused cascading connection failures, leading to 5% of Slack's users not being able to connect to the server for up to 2 hours.","Description":"Slack experienced two significant service disruptions in October 2014, impacting user connectivity. The first occurred on October 14th, and the second on October 16th. These incidents led to varying degrees of unavailability for a subset of users.\n\nOn Tuesday, October 14th, routine maintenance led to an automation malfunction, deploying corrupted code to web servers and job queue workers. This caused a 14-minute lockout for all users, followed by 13% of users experiencing poor or no availability for periods of up to two hours. A preceding internal network issue on Monday, though unrelated, contributed to existing work backlogs.\n\nThe immediate attempt by disconnected users to reconnect simultaneously overwhelmed Slack's database capacity, leading to cascading connection failures. Ultimately, 5% of users remained disconnected for up to two hours while database clusters recovered.\n\nA separate incident occurred on Thursday, October 16th, from 11:27 am to 12:28 pm San Francisco time. This was triggered by a bug introduced during the update of real-time message servers, following the disabling of SSLv3 due to the POODLE vulnerability.\n\nThe bug caused message servers to crash, and the simultaneous reconnections from affected users again overwhelmed databases. This was exacerbated by a client-side change that forced a full history reload, increasing strain on database servers.\n\nIn response, Slack immediately began adding additional database capacity and optimizing reconnection methods to reduce the load during mass reconnections. They also worked on gracefully restarting real-time message servers and addressing the introduced bug."}