{"UUID":"da6d262f-9aa9-43a3-881c-66c0afa03fb2","URL":"https://blog.travis-ci.com/2018-04-03-incident-post-mortem","ArchiveURL":"https://web.archive.org/web/2019/https://blog.travis-ci.com/2018-04-03-incident-post-mortem","Title":"Travis CI database truncation and cross-account session exposure","StartTime":"2018-03-13T12:14:00Z","EndTime":"2018-03-13T21:14:00Z","Categories":["automation","security"],"Keywords":["travis ci","production database","truncate","database_url","tmux","database cleaner","session token","localstorage","credential rotation"],"Company":"Travis CI","Product":"travis-ci.com","SourcePublishedAt":"2018-04-03T00:00:00Z","SourceFetchedAt":"0001-01-01T00:00:00Z","Summary":"Accidental environment variable made tests truncate production database.","Description":"On March 13, 2018, travis-ci.com was non-operational for around 5.5 hours starting at 12:14 UTC, with another 3.5 hours of build backlog after recovery. The trigger was a developer's local test suite truncating the production database, and the recovery left the application briefly serving an empty database, which caused a cross-account session-token exposure.\n\nA developer ran the project test suite from a tmux pane that had been used days earlier to inspect production data, and which still had `DATABASE_URL` pointing at the primary production Postgres. The Database Cleaner gem ran a TRUNCATE against every table on test setup. The query was blocked behind other activity for ~10 minutes and finally executed at 12:14 UTC. The team responded to alerts almost immediately — but missed that the API was still operational and serving from an empty database for roughly the next 30 minutes.\n\nDuring those 30 minutes, anyone who signed in to travis-ci.com saw a blank profile. Their old user records were gone, so the application created new records with primary keys generated from the (untruncated, since Postgres does not reset id sequences on TRUNCATE) sequence. Travis eventually took the user-facing applications offline, restored the database, and brought everything back. When customers logged back in, those who had signed in during the 30-minute window found themselves logged in as the *wrong user*: their localStorage held a signed token whose user-id corresponded to a record created post-restore but reassigned by the user-sync from GitHub. Because Travis syncs user records from GitHub on a regular basis, both new and existing customers were potentially affected.\n\nTravis revoked all affected tokens by 14:22 UTC on March 14 and contacted every potentially-impacted GitHub repo admin with a Security Advisory. They also discovered that their cron scheduler had not been restarted after the database recovery, breaking scheduled jobs.\n\nContributing factors: developer environments could connect to the production primary with write access because the read-only follower was harder to reach — the easy path was the dangerous path; the team rushed back to user-facing recovery and inadvertently left applications running while the database was empty, creating the duplicate-id situation; certain alerts had not been re-armed after the outage, hiding the broken cron scheduler. Total data loss was ~15 minutes once the database was restored. Committed follow-ups: make it harder to connect to production with write access from a developer machine, treat \"applications still running while the database is unreachable\" as an explicit incident state, and add alert re-arming to the standard recovery checklist."}