{"UUID":"55f106c4-1b4e-4d02-bae9-ccf07ff223c2","URL":"https://www.pagerduty.com/blog/outage-post-mortem-april-13-2013/","ArchiveURL":"https://web.archive.org/web/20211019062735if_/https://www.pagerduty.com/blog/outage-post-mortem-april-13-2013/","Title":"PagerDuty notification dispatch system outage of April 2013","StartTime":"2013-04-13T14:57:00Z","EndTime":"2013-04-13T16:23:00Z","Categories":["automation","cloud","config-change","security"],"Keywords":["pagerduty","aws","peering point","network","outage","notifications","datacenter","california"],"Company":"Pagerduty","Product":"notification dispatch system","SourcePublishedAt":"2013-04-24T18:59:34Z","SourceFetchedAt":"2026-05-04T17:48:11.178076Z","Summary":"In April 2013, [Pagerduty](https://web.archive.org/web/20220906003007/https://www.pagerduty.com/), a cloud service proving application uptime monitoring and real-time notifications, suffered an outage when two of its three independent cloud deployments in different data centers began experiencing connectivity issues and high network latency. It was found later that the two independent deployments shared a common peering point which was experiencing network instability.  While the third deployment was still operational, Pagerduty's applications failed to establish quorum due to to high network latency and hence failed in their ability to send notifications.","Description":"On Saturday, April 13, 2013, PagerDuty experienced an outage that primarily affected its notification dispatch system. The incident began around 7:57 AM Pacific Time, leading to delays in notifications and 500 errors on API endpoints, and eventually a complete inability to dispatch notifications for a period.\n\nThe root cause was identified as a degradation in a common peering point located in Northern California. This peering point was shared by two AWS regions where PagerDuty hosted its infrastructure, despite efforts to ensure physical separation and no dependency between its three datacenters. This single point of failure effectively took two of PagerDuty's datacenters offline simultaneously.\n\nThe outage resulted in PagerDuty completely losing its ability to dispatch notifications between 8:35 AM and 8:53 AM Pacific Time, a duration of 18 minutes, because it could not establish quorum due to high network latency. While notifications were impacted, the events API remained operational and continued to accept events throughout the incident.\n\nIn the short term, PagerDuty implemented several improvements. These included adding more logging and aggregating logs for better searchability, planning to add a process watcher for automatic restarts of failed coordinator processes, and building a dashboard to enhance visibility into inter-host connectivity.\n\nFor long-term remediation, PagerDuty committed to investing in staff training for Cassandra and ZooKeeper. They also planned to investigate moving off one of the affected AWS regions, emphasizing the need for thorough due diligence when selecting new hosting providers and datacenters to prevent similar single points of failure."}