Stackdriver Intelligent Monitoring application outage on October 23, 2013
Stackdriver · Stackdriver
On October 23, 2013, at approximately 9:45 AM ET, the Stackdriver Intelligent Monitoring application experienced a significant outage. The incident was triggered by the simultaneous crash of its 36-node Cassandra cluster, which is responsible for hosting time series data displayed in charts. This failure occurred after nodes in the cluster had been leaking memory for several days, eventually exhausting their resources.
A critical design flaw in Stackdriver’s message broker system exacerbated the impact. When the queue feeding Cassandra became congested, the system’s design dictated that all data be spilled to disk, effectively stopping message publication to any downstream consumers. This inadvertently blocked healthy and independent pipelines, such as those for alerting and archiving, leading to a broader service degradation. Customers experienced a gap of several hours in data displayed within application charts, specifically from 9:45 AM to 1:30 PM on October 23. Alerting functionality was delayed for approximately one hour, and some erroneous alerts were generated once the system began processing buffered data. No data was ultimately lost.
Initial attempts to restore the existing Cassandra cluster were unsuccessful. To mitigate the immediate impact, Stackdriver engineers broke the dependency between the Cassandra queue and other subsystems at 11:00 AM, allowing alerting and other services to resume. By 11:30 AM, the decision was made to abandon the failed cluster and initiate a disaster recovery strategy. A new 36-node Cassandra cluster was deployed and bootstrapped by 1:30 PM. Full data restoration from an archive, covering all historical data, was completed by 9:00 AM on October 26th.
Stackdriver plans to implement architectural changes to prevent similar incidents. Key among these is adding a greater degree of separation between pipelines for real-time data display and alerting. This involves redesigning the queueing service to break dependencies, ensuring that if one subsystem encounters problems, other critical services, particularly real-time alerting, will continue to function independently and unaffected.