Azure Storage service interruption November 2014

On November 19, 2014, Azure storage services experienced a service interruption across the United States, Europe, and parts of Asia. This incident, which began at 00:51 UTC and was largely resolved by 11:45 UTC, impacted numerous cloud services built on Azure Storage, including Virtual Machines, Websites, SQL Geo-Restore, and the Management Portal.

The root cause was identified as a configuration change introduced as part of a performance update for Azure Storage. This change, intended to improve performance and reduce CPU footprint for Azure Table Front-Ends, exposed a previously undetected bug in the Blob Front-Ends. The bug caused the Blob Front-Ends to enter an infinite loop, preventing them from processing traffic.

A significant contributing factor was an operational error during deployment. Instead of following the standard protocol of applying production changes in incremental batches, the configuration change was pushed across most regions in a short period. This widespread deployment exacerbated the impact of the bug.

Customers using Azure Storage services experienced timeouts and connectivity issues with Storage Blob, Table, and Queues. Services dependent on Azure Storage, such as IaaS Virtual Machines, also suffered unavailability, with some users unable to connect via RDP or SSH. The incident also affected the Service Health Dashboard, delaying timely updates and limiting customers’ ability to create support cases during the initial phase.

Upon detection, the faulty configuration change was promptly rolled back. However, the infinite loop state required a restart of the storage front ends to fully undo the update, prolonging recovery. Microsoft Azure committed to improving processes by enforcing incremental deployments, enhancing recovery methods, fixing the identified bug, and strengthening the Service Health Dashboard infrastructure to prevent similar incidents.

Postmortem Index

Azure Storage service interruption November 2014

Keywords