Postmortem Index

Explore incident reports from various companies

Google logged-in services outage due to incorrect configuration

Google · Gmail, Google+, Calendar, Documents

2014-01-24 cloud

On January 24, 2014, an outage affected most Google users attempting to access logged-in services such as Gmail, Google+, Calendar, and Documents. The disruption began around 10:55 a.m. PST, with users experiencing errors starting at 11:02 a.m. PST. Service was largely restored by 11:30 a.m. PST, resulting in approximately 25 minutes of downtime for most users, and up to 55 minutes for about 10% of the user base.

The incident was triggered by a software bug within an internal system responsible for generating configurations. At 10:55 a.m. PST, this system produced an incorrect configuration. This faulty configuration was then distributed to live services over the subsequent 15 minutes, causing user requests for data to be ignored and generating errors across affected services.

Google’s internal monitoring systems alerted the Site Reliability Team at 11:02 a.m. PST when user errors became apparent. While engineers were debugging, the same internal system automatically cleared the original error and generated a new, correct configuration at 11:14 a.m. PST. This new configuration was then distributed, leading to a rapid subsidence of errors.

To prevent recurrence, Google planned several steps. These included correcting the specific bug in the configuration generator, auditing other critical configuration systems for similar issues, adding input validation checks to prevent bad configurations from causing disruptions, and implementing additional targeted monitoring for quicker detection and diagnosis of service failures.

Keywords

googleoutageconfigurationsoftware buggmailgoogle+calendardocuments