Postmortem Index

Explore incident reports from various companies

Strava upload outage

Strava · Strava (uploads)

2014-07-29 – 2014-07-30 automation

At 15:10 PDT on Tuesday, July 29, 2014, the auto-incrementing primary key of a Strava database table that maps activities to their data streams (location, distance, speed, …) hit the maximum positive value of a 4-byte signed integer (2,147,483,647) and inserts started failing. Stream data itself was on S3; the metadata row pointing to it lived in this table. Without those rows, no upload could be processed.

For about 35 minutes all uploads failed. Strava put the site into maintenance mode to plan, then brought the site back online at 16:10 PDT with background job processing disabled — interactive features still worked, and incoming uploads were accepted and queued for processing once the database was fixed. The team chose this over keeping the site fully down or attempting emergency code changes against a live primary.

While the site was back up, they began the database migration to widen the column from signed to unsigned integer. The migration could not be performed in place against the volume of data in the table, so they spun up a fresh slave, ran the migration there, and failed over once it completed. The migration took about 9 hours. Upload processing resumed for the queued backlog overnight and all uploads were processed by 07:00 PDT on July 30, ~14 hours after the outage began.

Schema migrations to widen integer keys were a routine operation Strava had run many times before, but this particular database was missing keyspace-utilization monitoring and so gave no advance warning. Committed follow-ups: audit every primary key in the system and preemptively widen every remaining signed-integer key to unsigned; add keyspace monitoring as part of standard production review; build finer-grained controls so uploads and upload processing can be disabled independently rather than relying on the current broad on/off switch; and stand up a status site plus in-app notifications, since user-facing communication during the outage had been inadequate.

Keywords

stravaprimary keysigned integerdatabase migrationkeyspace monitoringbackground jobsstreams