Postmortem Index

Explore incident reports from various companies

Kickstarter MySQL replication failure

Kickstarter · Amazon RDS MySQL replication

2013-03-07 cloud

On Thursday, March 7th, Kickstarter experienced a critical incident where all MySQL replicas stopped updating, leading to stale and inconsistent data across the site. The issue was signaled by Duplicate entry errors during replication, indicating a divergence between the master and its replicas.

Investigation revealed that specific data, particularly backer sequences, had been inconsistent for days, with some inconsistencies dating back over a month. The replication process eventually broke when an “unsafe query” triggered row-based replication, exposing the underlying data divergence.

The core issue was traced to a MySQL bug where the ORDER BY clause in an UPDATE query was sometimes ignored during replication. This, combined with InnoDB’s transaction optimization causing transactions to finish out of order on the master, led to the master and replicas applying updates in different implicit orders, resulting in data inconsistency.

The incident caused Kickstarter to scramble, requiring efforts to stabilize the site, minimize the effects of stale replicas, and communicate with users. The inconsistent backer sequences were crucial for creator reports, impacting the accuracy of information provided to project creators.

While the immediate crisis was resolved within hours by stabilizing the site, the underlying bug had caused data inconsistencies for an extended period. The primary outcome of the incident was the discovery and understanding of this critical MySQL replication bug, highlighting the importance of deeply understanding database internals beyond ORM abstractions.

Keywords

mysqlreplicationamazon rdsdatabaseinconsistencybugorder bykickstarterdata inconsistencymixed mode replicationbacker sequences