Fortnite service outages of February 3-4, 2018

On February 3rd and 4th, 2018, Fortnite experienced a series of six distinct service incidents, ranging from partial to total disruptions. These outages occurred as the game reached a new peak of 3.4 million concurrent players, exposing various scaling challenges and bottlenecks within Epic Games’ infrastructure.

The MCP (Matchmaking, Chat, and Presence) database, primarily MongoDB shards, suffered from severe latency due to write queuing, particularly affecting matchmaking. This led to database operations spiking to over 40,000ms and MCP threads blocking. Additionally, a bug introduced prior to launch limited the number of available service threads for MCP, which, when corrected, paradoxically increased latency in production due to connection pool starvation. Manual primary failovers were required multiple times per hour to restore functionality.

The Account Service, responsible for user data and authentication, experienced an outage when Memcached instability saturated Nginx capacity. Nginx, acting as a proxy, ran out of worker threads while waiting for Memcached timeouts, preventing traffic from reaching the main application and causing a full service downtime. Separately, the XMPP Service, critical for social features, became unstable due to a memory leak. A failover to a standby cluster, combined with a landrush of reconnecting players, overloaded a Friends Service internal load balancer, leading to IP exhaustion and preventing presence flow.

Beyond specific service issues, Epic Games also encountered general cloud capacity limits, specifically hitting the total instance limit in their AWS region, which affected their ability to scale services. Several API rate limits were also encountered. Furthermore, available IP exhaustion occurred in subnets, contributing to extended load balancer recovery times during incidents.

Players experienced unusually long wait times, matchmaking instability, and were often blocked from signing in or out. Social features like seeing friends online or forming parties were largely non-functional, creating a “dark room” situation. In response, Epic Games is actively working on resolving database performance issues with external experts, optimizing backend calls, improving matchmaking data storage, enhancing operational excellence, and improving monitoring for cloud provider limits. Longer-term plans include rearchitecting the core messaging stack, expanding to multiple cloud providers and geographical locations to reduce blast radius, and scaling internal infrastructure.

Postmortem Index

Fortnite service outages of February 3-4, 2018

Keywords