{"UUID":"3e8b0d65-84e5-4a4f-bd28-81a2693432d4","URL":"https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/","ArchiveURL":"https://web.archive.org/web/20260315195449/https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/","Title":"GoCardless API and Dashboard outage on 10 October 2017","StartTime":"2017-10-10T14:09:00Z","EndTime":"2017-10-10T15:59:00Z","Categories":["automation","cloud","config-change","hardware"],"Keywords":["api","dashboard","postgres","pacemaker","database","hardware failure","configuration","gocardless"],"Company":"GoCardless","Product":"API and Dashboard","SourcePublishedAt":"2019-07-08T13:45:35.007Z","SourceFetchedAt":"2026-05-04T17:47:03.922917Z","Summary":"A bad config combined with an uncommon set of failures led to an outage of a database cluster, taking the API and Dashboard offline.","Description":"On the afternoon of 10 October 2017, GoCardless experienced an outage of its API and Dashboard, lasting 1 hour and 50 minutes. During this period, all requests to these services failed and returned an error. The incident began at 15:09 BST when monitoring detected the outage, and services were confirmed back up at 16:59 BST after manual intervention.\n\nThe incident was triggered by a hardware failure on the primary database node, specifically a disk array failure. This alone should have resulted in a brief outage as the database cluster automation, Pacemaker, was designed to promote a replica to primary. However, the automation failed to do so, extending what would typically be a 1-2 minute outage to almost two hours.\n\nThe root cause was identified as a combination of three factors. First, a `default-resource-stickiness` Pacemaker setting. Second, a \"Backup VIP\" resource, intended to reduce load on the primary, had a `-INF` colocation rule with the Postgres primary and was running on the synchronous replica. Third, a Postgres subprocess on the synchronous replica crashed almost simultaneously with the primary's disk failure. These three conditions together prevented Pacemaker from successfully promoting a new primary.\n\nCustomer impact included the complete unavailability of the API and Dashboard for 1 hour and 50 minutes, leading to failed requests. There was no evidence of data corruption found after extensive verification.\n\nRemediation involved engineers manually promoting a synchronous replica to primary and reconfiguring backend applications to connect to its IP address. Post-incident, a new database cluster was provisioned, and traffic was migrated to it. The Pacemaker configuration was adjusted to correctly handle failovers, and the team committed to improving fault injection testing and addressing knowledge atrophy related to manual recovery procedures."}