PagerDuty notification dispatch system outage of April 2013
Pagerduty · notification dispatch system
On Saturday, April 13, 2013, PagerDuty experienced an outage that primarily affected its notification dispatch system. The incident began around 7:57 AM Pacific Time, leading to delays in notifications and 500 errors on API endpoints, and eventually a complete inability to dispatch notifications for a period.
The root cause was identified as a degradation in a common peering point located in Northern California. This peering point was shared by two AWS regions where PagerDuty hosted its infrastructure, despite efforts to ensure physical separation and no dependency between its three datacenters. This single point of failure effectively took two of PagerDuty’s datacenters offline simultaneously.
The outage resulted in PagerDuty completely losing its ability to dispatch notifications between 8:35 AM and 8:53 AM Pacific Time, a duration of 18 minutes, because it could not establish quorum due to high network latency. While notifications were impacted, the events API remained operational and continued to accept events throughout the incident.
In the short term, PagerDuty implemented several improvements. These included adding more logging and aggregating logs for better searchability, planning to add a process watcher for automatic restarts of failed coordinator processes, and building a dashboard to enhance visibility into inter-host connectivity.
For long-term remediation, PagerDuty committed to investing in staff training for Cassandra and ZooKeeper. They also planned to investigate moving off one of the affected AWS regions, emphasizing the need for thorough due diligence when selecting new hosting providers and datacenters to prevent similar single points of failure.