{"UUID":"80919009-d534-486c-9142-284d27690da0","URL":"https://github.blog/2021-12-01-github-availability-report-november-2021/","ArchiveURL":"","Title":"GitHub November 2021 Availability Incident due to MySQL Schema Migration","StartTime":"2021-11-27T20:40:00Z","EndTime":"2021-11-27T23:30:00Z","Categories":["automation","cascading-failure","cloud","config-change","security"],"Keywords":["github","mysql","schema migration","read replicas","deadlock","availability","november 2021"],"Company":"Github","Product":"GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, Webhooks","SourcePublishedAt":"2021-12-01T20:14:04Z","SourceFetchedAt":"2026-05-04T17:50:00.202946Z","Summary":"Github platform encountered a novel failure mode when processing a schema migration on a large MySQL table. Schema migrations are a common task at GitHub and often take weeks to complete. The final step in a migration is to perform a rename to move the updated table into the correct place. During the final step of this migration a significant portion of our MySQL read replicas entered a semaphore deadlock. Our MySQL clusters consist of a primary node for write traffic, multiple read replicas for production traffic, and several replicas that serve internal read traffic for backup and analytics purposes. The read replicas that hit the deadlock entered a crash-recovery state causing an increased load on healthy read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas to handle production requests which impacted the availability of core GitHub services.","Description":"On November 27, 2021, starting at 20:40 UTC and lasting 2 hours and 50 minutes, GitHub experienced an incident that significantly impacted the availability of core services. Affected services included GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.\n\nThe incident stemmed from a novel failure mode during a schema migration on a large MySQL table. Specifically, during the final rename step of the migration, a significant portion of GitHub's MySQL read replicas entered a semaphore deadlock. This caused the affected read replicas to enter a crash-recovery state.\n\nThe crash-recovery state of the deadlocked replicas led to an increased load on the remaining healthy read replicas. This cascading effect resulted in an insufficient number of active read replicas to handle production requests, thereby degrading the availability of core GitHub services for users. Write operations remained healthy, and no data corruption occurred.\n\nDuring mitigation, GitHub attempted to increase capacity by promoting healthy internal replicas to production, but this was not sufficient. To restore service, production traffic was proactively removed from broken replicas until they could successfully process the table rename and recover. Once recovered, these replicas were returned to production, restoring normal operations.\n\nTo prevent similar incidents and reduce recovery time, GitHub is prioritizing functional partitioning efforts, which will allow migrations to run in canary mode on single shards. Additionally, internal procedures are being updated to increase the over-provisioning of each cluster. Schema migrations have been paused while the specific failure scenario is further investigated and migration tooling improvements are classified."}