Discord Connectivity Issues (March 2017)
Discord · Discord
On March 20, 2017, Discord experienced significant connectivity issues affecting a large portion of its users. The incident began at 11:38 PDT when an internal “presence” service server disconnected. This triggered a “thundering herd” of reconnection attempts from “sessions” service nodes, leading to memory exhaustion and crashes in these nodes. Service was mostly recovered by 13:56 PDT, but a recurrence of the presence service disconnection at 14:52 PDT caused further issues, with full resolution achieved by 16:54 PDT.
The root cause of the initial disconnection was identified as CPU soft lockups on a single presence server. These lockups stalled the machine’s network stack for extended periods. A bug in the presence cluster’s handling of lost nodes exacerbated the problem, causing the cluster to split-brain instead of properly isolating the faulty node.
The failure of the presence node caused millions of Erlang processes within the sessions service to attempt reconnection simultaneously, overwhelming the remaining presence servers. The sessions service, designed to buffer messages for clients during brief disconnections, saw its in-memory buffers quickly fill up for approximately one-third of all connected users whose connections stalled due to the dependency on the presence service. This memory exhaustion led to the sessions servers crashing.
During the incident, approximately one-third of all connected Discord users experienced stalled connections, leading to significant service degradation. Users were unable to send messages, and direct messages were not delivered to recipients at certain times. Engineers had to globally disable message sending multiple times to shed load and facilitate recovery efforts.
Discord engineers implemented several remediations. They rolled out upgrades to the presence server to fix the split-brain bug and other planned improvements. A hard limit was added to the number of in-flight connections from the sessions cluster to the presence server, along with a fast-fail mechanism to prevent future cascading failures. Discord is also actively collaborating with Google Cloud Platform’s hypervisor team to investigate and resolve the underlying CPU soft lockup issue.