{"UUID":"a46ebba4-2d7c-49e6-aa71-0ce989fbff97","URL":"https://status.discordapp.com/incidents/qk9cdgnqnhcn","ArchiveURL":"","Title":"Unavailable Guilds \u0026 Connection Issues","StartTime":"2017-10-13T14:01:00-07:00","EndTime":"2017-10-13T16:52:00-07:00","Categories":["cascading-failure","cloud","config-change"],"Keywords":["discord","api","redis","google cloud platform","gcp","failover","cassandra","outage"],"Company":"Discord","Product":"Discord API and core services","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:51:35.650364Z","Summary":"\"At approximately 14:01, a Redis instance acting as the primary for a highly-available cluster used by Discord's API services was migrated automatically by Google’s Cloud Platform. This migration caused the node to incorrectly drop offline, forcing the cluster to rebalance and trigger known issues with the way Discord API instances handle Redis failover. After resolving this partial outage, unnoticed issues on other services caused a cascading failure through Discord’s real time system. These issues caused enough critical impact that Discord’s engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.\"","Description":"On October 13, 2017, at 14:01 PDT, a Google Cloud Platform-initiated migration of a primary Redis instance caused it to drop offline. This triggered a known bug in Discord's API instances, which failed to properly handle the Redis failover, leading to a partial outage. Engineers performed rolling restarts of API instances and escalated the issue to GCP support.\n\nFollowing the initial Redis issue, a misconfigured edge caching rule for an expensive API route was discovered, causing performance degradation and overloading a Cassandra cluster. This misconfiguration was corrected, the Cassandra cluster recovered, and further API instance restarts were performed. By 14:57 PDT, these issues were believed resolved, and the API returned to a nominal state.\n\nHowever, at 15:41 PDT, new anomalies emerged, with users reporting unavailable guilds and connection problems. A misbehaving node in the \"guilds\" cluster was identified, followed by issues in the \"sessions\" cluster. These cascading failures prompted Discord engineers to initiate a full service restart at 16:07 PDT, which involved rebooting various components and correcting an improperly configured API setting for client reconnection.\n\nThe root cause of the initial partial outage was a known bug in how Discord API instances handle Redis failover, which was triggered by the GCP migration. The exact cause of the subsequent cascading full system failure was not fully understood at the time of the postmortem, but it was theorized that the initial failure caused other service nodes to misbehave, run out of memory, and trigger further failures despite safeguards. A misconfigured edge caching rule also contributed to API degradation.\n\nThe incident resulted in unavailable guilds and connection issues for users, requiring millions of clients to reconnect over a 20-minute period after the full service restart. Remediation included multiple rolling restarts, correcting the caching rule, recovering Cassandra, and the full service restart. Discord committed to increasing the priority of fixing Redis failover issues, modifying caching behavior, and adding monitoring to detect cascading failures earlier."}