{"UUID":"5d85d316-f0c4-4c1a-81a2-5f89aaad6148","URL":"https://github.com/blog/2106-january-28th-incident-report","ArchiveURL":"","Title":"GitHub January 28th, 2016 datacenter power disruption","StartTime":"2016-01-28T00:23:00Z","EndTime":"2016-01-28T02:29:00Z","Categories":["automation","cascading-failure","cloud","config-change","hardware","security"],"Keywords":["github","datacenter","power outage","firmware","redis","503","january 2016","outage"],"Company":"GitHub","Product":"GitHub","SourcePublishedAt":"2016-02-04T05:24:36Z","SourceFetchedAt":"2026-05-04T17:48:37.724456Z","Summary":"On January 28th, 2016 GitHub experienced a disruption in the power at their primary datacenter.","Description":"On January 28th, 2016, GitHub experienced an outage lasting two hours and six minutes, beginning at 00:23 UTC. During this period, users encountered HTTP 503 error pages, indicating the service was unavailable. Initial communication regarding the incident was delayed due to affected internal systems.\n\nThe incident was triggered by a brief power disruption at GitHub's primary datacenter, which caused over 25% of servers and several network devices to reboot. A critical root cause was a known firmware issue that prevented affected servers from recognizing their physical drives upon reboot. This was compounded by a hard dependency in the application's boot path on Redis clusters, which were also offline.\n\nThe response involved engineers working with on-site technicians to cold-boot servers affected by the firmware issue and rebuilding Redis clusters on standby hardware. The recovery was complicated by internal systems residing on the offline hardware, making provisioning new servers difficult. Service was restored after Redis clusters were brought back online and application processes recovered.\n\nGitHub identified several areas for improvement. These include updating firmware across their fleet, enhancing application test suites to ensure graceful degradation when external systems are unavailable, and improving circuit breakers. They also plan to review the availability requirements of critical internal systems and strengthen cross-team communication and incident escalation strategies to reduce recovery times."}