Reddit outage and degraded performance on August 11, 2016
Reddit · reddit
On August 11, 2016, Reddit experienced a significant outage, rendering the platform unreachable from 15:24 PDT to 16:52 PDT. This was followed by a period of degraded performance until 18:19 PDT. The incident impacted all official Reddit platforms and the API used by third-party applications, though no user data was lost.
The outage stemmed from an error during a migration of Reddit’s Zookeeper system to new infrastructure within Amazon’s cloud. During this migration, the autoscaler system, which manages server counts, was manually disabled. However, a package management system unexpectedly reverted this manual change, reactivating the autoscaler.
Upon reactivation, the autoscaler read partially migrated Zookeeper data. This led it to incorrectly identify many application and caching servers as unhealthy, causing it to terminate them rapidly within 16 seconds. The subsequent empty caches after server restoration contributed to the extended period of degraded performance due to increased load on databases.
Reddit engineers quickly identified the issue and set the site to “down mode” while restoring servers. To prevent recurrence, Reddit plans to implement several improvements. These include making the autoscaler less aggressive by limiting simultaneous server shutdowns, enhancing migration processes with mandatory pair engineering for risky operations, and ensuring package management systems are properly disabled during critical migrations.