{"UUID":"f70ce67c-0944-422e-b0bd-be38f0edb0cd","URL":"https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/","ArchiveURL":"","Title":"GitLab.com database outage of January 31, 2017","StartTime":"2017-01-31T23:30:00Z","EndTime":"2017-02-01T18:00:00Z","Categories":["automation","cloud","config-change","hardware"],"Keywords":["gitlab.com","database","postgresql","data loss","replication","azure","lvm","backup"],"Company":"Gitlab","Product":"GitLab.com","SourcePublishedAt":"2017-02-10T00:00:00Z","SourceFetchedAt":"2026-05-04T17:55:28.107695Z","Summary":"Influx of requests overloaded the database, caused replication to lag, tired admin deleted the wrong directory, six hours of data lost. See also [earlier report](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident) and [HN discussion](https://news.ycombinator.com/item?id=13537052).","Description":"On January 31, 2017, GitLab.com experienced increased database load starting around 19:00 UTC, leading to PostgreSQL replication lag. Around 23:00 UTC, an engineer attempted to resynchronize the secondary database, which involved wiping its data directory. At approximately 23:30 UTC, an engineer mistakenly wiped the primary database's data directory instead of the secondary's, causing a major outage. Recovery efforts began immediately, but were hampered by non-functional backup systems. GitLab.com was eventually restored by February 1, 2017, 18:00 UTC.\n\nThe primary database server for GitLab.com had its data directory accidentally deleted. This was compounded by multiple backup and recovery mechanisms failing. The pg_dump backups to S3 were not working due to a PostgreSQL version mismatch and silent failure notification. Azure disk snapshots were not enabled for database servers, and LVM snapshots, while available, were not designed for rapid disaster recovery. The secondary database was also unusable for failover as its data had been wiped earlier.\n\nThe immediate cause was an engineer mistakenly executing a `rm -rf` command on the primary database server while attempting to fix replication on the secondary. This human error was exacerbated by a lack of clear documentation for `pg_basebackup` behavior and an exhausted engineer. Underlying issues included a single point of failure database setup, silent backup failures due to version mismatch and DMARC email rejection, and inadequate disaster recovery procedures.\n\nGitLab.com was unavailable for approximately 18 hours. Data modifications (projects, comments, user accounts, issues, snippets) made between January 31st 17:20 UTC and 00:00 UTC were lost. This affected an estimated 5,000 projects, 5,000 comments, and 700 new user accounts. Code repositories and wikis were unavailable but not lost. Self-managed GitLab instances were unaffected.\n\nGitLab committed to multiple improvements to operations and recovery procedures. This included addressing the single point of failure, fixing backup systems, improving documentation, and enhancing disaster recovery capabilities. The recovery process itself involved restoring from an LVM snapshot taken 6 hours prior to the deletion, which was a slow process due to slow Azure disks."}