{"UUID":"3f376b77-6063-48a6-8ee2-92ca365194e4","URL":"http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/","ArchiveURL":"https://web.archive.org/web/20160125202239/https://azure.microsoft.com/en-us/blog/update-on-azure-storage-service-interruption/","Title":"Azure Storage service interruption","StartTime":"2014-11-19T00:51:00Z","EndTime":"2014-11-19T11:45:00Z","Categories":["cloud","config-change"],"Keywords":["azure storage","azure blob","azure tables","configuration change","infinite loop","flighting","staged rollout","blob front-end"],"Company":"Microsoft","Product":"Microsoft Azure Storage","SourcePublishedAt":"2014-11-19T00:00:00Z","SourceFetchedAt":"0001-01-01T00:00:00Z","Summary":"A bad config took down Azure storage.","Description":"A configuration change rolled out as part of an Azure Storage performance update triggered a previously-undetected infinite-loop bug in the Blob front-ends, which then could not absorb traffic and took down every Azure service that depended on Storage. The change was applied to most regions in a short period of time, due to operational error, rather than following the standard protocol of applying production changes in incremental batches — which is why the blast radius was so broad.\n\nThe configuration change had been \"flighted\" for several weeks against a subset of Azure Tables front-ends and showed the expected performance improvement and CPU reduction. The Blob front-ends shared the same code path but contained a latent bug that the change exposed; when the same configuration was pushed broadly, the Blob front-ends went into an infinite loop. Reverting the configuration change on its own was not sufficient — the Blob front-ends had to be restarted before they would re-read the rolled-back configuration, which extended the recovery window.\n\n**Affected services**: Azure Storage (Blob, Table, Queue), Virtual Machines, SQL Geo-Restore, SQL Import/Export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple, and Azure Backup. **Affected regions**: Central US, East US, East US 2, North Central US, South Central US, West US, North Europe, West Europe, East Asia, Southeast Asia, Japan East, Japan West.\n\nThe incident also degraded Microsoft's ability to communicate during the event: the Service Health Dashboard and Management Portal both depended on Azure Storage, so for roughly the first three hours of the outage Microsoft fell back to Twitter and other social channels for status, and customers struggled to file new support cases.\n\nCommitted remediations: enforce incremental-batch deployment in the deployment tooling itself (so the operational error of pushing globally is no longer possible); shorten recovery time for cases where a configuration revert needs a process restart; fix the infinite-loop bug in the CPU-reduction code path before it ships again; and harden the Service Health Dashboard infrastructure so it can report through future Storage outages."}