{"UUID":"a50e7b06-2f2c-4d13-a340-9a1e7eed59d7","URL":"https://slack.engineering/slacks-incident-on-2-22-22/","ArchiveURL":"","Title":"Slack’s Incident on 2-22-22","StartTime":"2022-02-22T06:00:00-08:00","EndTime":"0001-01-01T00:00:00Z","Categories":["automation","cascading-failure","cloud","config-change"],"Keywords":["slack","vitess","memcached","consul","database","caching","client boot","gdm"],"Company":"Slack","Product":"","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:51:42.993906Z","Summary":"Cache nodes removal caused the high workload on the vitness cluster, which in turn cased the service outage.","Description":"On February 22, 2022, just after 6 a.m. Pacific Time, Slack experienced a major incident where many users were unable to connect to the service. The primary symptom was the failure of client boot operations, which prevented users from fetching essential data like channel listings and preferences, rendering Slack unusable.\n\nThe incident was triggered by a percentage-based rollout (PBR) of Consul agent upgrades. As Consul agents on Memcached nodes restarted, they temporarily deregistered and re-registered. Mcrib, Slack's cache control plane, efficiently replaced these nodes with empty spares, leading to a significant drop in the cache hit rate across the system.\n\nThis cache degradation exposed an underlying inefficiency in a \"scatter query\" used for Group Direct Message (GDM) conversations. This query, sharded by user, required querying every shard in the Vitess database when data was not in the cache. With a high cache miss rate, the database was overwhelmed by these superlinear read loads, causing queries to time out and preventing caches from refilling, leading to a cascading failure.\n\nInitial mitigation involved throttling client boot requests to reduce database load, which helped users with already booted clients but prevented new connections. The Consul agent restart operation was paused. Subsequently, the problematic scatter query was modified to read only missing data from Vitess and to utilize replicas, which allowed caches to refill and database load to decrease.\n\nThese remediations enabled engineers to slowly increase the client boot rate limit back to normal levels, gradually restoring full service. The incident highlighted complex interactions between the application, Vitess datastores, caching system, and service discovery, leading to process changes for Consul rollouts and modifications to the problematic query."}