GitHub January 28th, 2016 datacenter power disruption
GitHub · GitHub
On January 28th, 2016, GitHub experienced an outage lasting two hours and six minutes, beginning at 00:23 UTC. During this period, users encountered HTTP 503 error pages, indicating the service was unavailable. Initial communication regarding the incident was delayed due to affected internal systems.
The incident was triggered by a brief power disruption at GitHub’s primary datacenter, which caused over 25% of servers and several network devices to reboot. A critical root cause was a known firmware issue that prevented affected servers from recognizing their physical drives upon reboot. This was compounded by a hard dependency in the application’s boot path on Redis clusters, which were also offline.
The response involved engineers working with on-site technicians to cold-boot servers affected by the firmware issue and rebuilding Redis clusters on standby hardware. The recovery was complicated by internal systems residing on the offline hardware, making provisioning new servers difficult. Service was restored after Redis clusters were brought back online and application processes recovered.
GitHub identified several areas for improvement. These include updating firmware across their fleet, enhancing application test suites to ensure graceful degradation when external systems are unavailable, and improving circuit breakers. They also plan to review the availability requirements of critical internal systems and strengthen cross-team communication and incident escalation strategies to reduce recovery times.