{"UUID":"5a99836e-bbde-4dfc-9621-054eac5c5c50","URL":"https://azure.microsoft.com/en-us/blog/summary-of-windows-azure-service-disruption-on-feb-29th-2012/","ArchiveURL":"","Title":"Windows Azure Service Disruption on Feb 29th, 2012","StartTime":"2012-02-29T00:00:00Z","EndTime":"2012-03-01T10:15:00Z","Categories":["cascading-failure","cloud","config-change"],"Keywords":["azure","compute","leap day bug","certificate","fabric controller","guest agent","february 2012","outage"],"Company":"Azure","Product":"Windows Azure Compute","SourcePublishedAt":"2012-03-09T00:00:00Z","SourceFetchedAt":"2026-05-04T17:48:16.227777Z","Summary":"Certificates that were valid for one year were created. Instead of using an appropriate library, someone wrote code that computed one year to be the current date plus one year. On February 29th 2012, this resulted in the creation of certificates with an expiration date of February 29th 2013, which were rejected because of the invalid date. This caused an Azure global outage that lasted for most of a day.","Description":"On February 29, 2012, Windows Azure experienced a significant service disruption primarily affecting its Compute service and dependent components like Access Control Service (ACS), Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. The incident began at 00:00 UST on February 29th, when new virtual machines (VMs) attempted to generate transfer certificates.\n\nThe root cause was a \"leap day bug\" in the Guest Agent (GA) responsible for creating these certificates. The GA calculated the certificate's valid-to date by simply adding one year to the current date. On February 29, 2012, this resulted in an invalid expiration date of February 29, 2013, causing certificate creation to fail. This failure prevented GAs from connecting to Host Agents (HAs), leading to VMs being marked faulty and eventually entire clusters entering a degraded \"Human Investigate\" (HI) state. To prevent further customer impact, service management functionality was globally disabled at 6:55 PM PST.\n\nAfter identifying the bug at 6:38 PM PST, developers created and tested a fix. The updated GA code was ready by 11:20 PM PST and began rolling out to all clusters by 2:11 AM PST on February 29th, with service management restored to most clusters by 5:23 AM PST. However, a secondary outage occurred in seven clusters at 2:47 AM PST due to an incompatible networking plugin included in an emergency update package. This caused VMs in these clusters to lose network connectivity.\n\nA corrected HA package was quickly developed and deployed to the affected seven clusters starting at 5:40 AM PST. These clusters were largely operational by 8:00 AM PST, though manual restoration efforts continued. All Windows Azure services were reported healthy by 2:15 AM PST on March 1st. Microsoft apologized for the disruption, issued service credits, and committed to improving testing for time-related bugs, fault isolation, graceful degradation, and faster error detection, as well as enhancing the service dashboard's reliability and information delivery."}