Sentry hosted Postgres XID wraparound outage
Sentry · Sentry (hosted)
On Monday, July 20, 2015, hosted Sentry was down for most of the US working day because the primary Postgres database hit transaction-ID wraparound protection and stopped accepting writes.
Postgres uses a 32-bit transaction ID (XID) counter to determine row visibility under MVCC. As the counter approaches its maximum (~2 billion) Postgres will refuse to accept new write commands when fewer than one million IDs remain, to prevent the catastrophic semantic outcomes of wraparound (deleted rows reappearing, updated rows reverting, broken referential integrity). Routine VACUUM (or autovacuum) freezes old XIDs and keeps the counter from running away.
Sentry is “very write-heavy” and stores large relational tables, so they had been continually tuning autovacuum to keep ahead of XID consumption. On July 20 the protection tripped anyway. With writes blocked, the only safe path forward was to let the in-flight autovacuums complete; restarting into Postgres’s single-user mode would have meant interrupting them and potentially extending the outage. They had already failed the primary over to newer hardware with more memory and CPUs dedicated to maintenance, which is where the autovacuums were running.
After several hours one of the autovacuums apparently failed or did not behave as expected — autovacuum verbosity was not enabled, so the logs offered no answer — and Postgres still refused writes. To keep the rest of the system from drowning while the database was read-only (Redis buffers ballooning, queue depth growing), Sentry flushed the entire event backlog to avoid making the situation worse. The database was eventually returned to a writeable state and Sentry came back online.
Committed follow-ups: keep autovacuum on and aggressively tuned for a write-heavy workload; provision enough headroom in maintenance memory and CPU that VACUUM can keep up under load; turn on autovacuum verbosity so future post-mortems aren’t blind to which vacuum did what; and treat XID consumption as a first-class observability metric, since “we’ll notice when it gets close” is not actually true at scale.