Postmortem Index

Explore incident reports from various companies

Windows Azure Service Disruption on Feb 29th, 2012

Azure · Windows Azure Compute

2012-02-29 – 2012-03-01 cascading-failure cloud config-change

On February 29, 2012, Windows Azure experienced a significant service disruption primarily affecting its Compute service and dependent components like Access Control Service (ACS), Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. The incident began at 00:00 UST on February 29th, when new virtual machines (VMs) attempted to generate transfer certificates.

The root cause was a “leap day bug” in the Guest Agent (GA) responsible for creating these certificates. The GA calculated the certificate’s valid-to date by simply adding one year to the current date. On February 29, 2012, this resulted in an invalid expiration date of February 29, 2013, causing certificate creation to fail. This failure prevented GAs from connecting to Host Agents (HAs), leading to VMs being marked faulty and eventually entire clusters entering a degraded “Human Investigate” (HI) state. To prevent further customer impact, service management functionality was globally disabled at 6:55 PM PST.

After identifying the bug at 6:38 PM PST, developers created and tested a fix. The updated GA code was ready by 11:20 PM PST and began rolling out to all clusters by 2:11 AM PST on February 29th, with service management restored to most clusters by 5:23 AM PST. However, a secondary outage occurred in seven clusters at 2:47 AM PST due to an incompatible networking plugin included in an emergency update package. This caused VMs in these clusters to lose network connectivity.

A corrected HA package was quickly developed and deployed to the affected seven clusters starting at 5:40 AM PST. These clusters were largely operational by 8:00 AM PST, though manual restoration efforts continued. All Windows Azure services were reported healthy by 2:15 AM PST on March 1st. Microsoft apologized for the disruption, issued service credits, and committed to improving testing for time-related bugs, fault isolation, graceful degradation, and faster error detection, as well as enhancing the service dashboard’s reliability and information delivery.

Keywords

azurecomputeleap day bugcertificatefabric controllerguest agentfebruary 2012outage