15 seconds of API downtime during PostgreSQL migration
GoCardless · API
In late January, GoCardless experienced approximately 15 seconds of unexpected API downtime during a planned database migration. Shortly after the schema migration began, API requests started timing out, indicating a service disruption.
The incident was triggered by an ALTER TABLE statement attempting to add a foreign key constraint. This operation required an AccessExclusive lock on the referenced parent table. Unfortunately, a long-running read query was simultaneously holding an AccessShare lock on the same parent table.
The AccessExclusive lock, unable to be acquired immediately, entered a queue. Because AccessExclusive locks conflict with all other lock types, subsequent read/write queries from the API also queued up behind it, leading to widespread timeouts and the 15-second API outage.
To prevent similar issues, GoCardless identified several best practices. These include eliminating long-running queries, running analytical queries on replicas, setting lock_timeout in migration scripts, splitting schema changes into smaller transactions, and keeping PostgreSQL updated to benefit from locking code improvements.