Google Compute Engine Persistent Disk issue in europe-west1-b
Google · Google Compute Engine Persistent Disk
From 2015-08-13 09:25 PDT to 2015-08-16 09:35 PDT, Google Compute Engine’s Standard Persistent Disks in the europe-west1-b zone experienced sporadic I/O errors. Approximately 5% of these disks were affected, with some management operations like snapshot creation also failing. A very small fraction, less than 0.000001% of the allocated PD space in the zone, suffered permanent data loss. SSD Persistent Disks, disk snapshots, and Local SSDs were not impacted.
The root cause was identified as four successive lightning strikes on the local utilities grid powering the European datacenter at 09:19 PDT on August 13, 2015. This led to a brief loss of power to storage systems hosting disk capacity for GCE instances in europe-west1-b. Despite automatic auxiliary power restoration and battery backups, some recently written data on storage systems more susceptible to power failure from extended or repeated battery drain became unrecoverable.
Customers experienced I/O errors and, in rare cases, permanent data loss. Google emphasized that GCE instances and Persistent Disks within a single zone are vulnerable to datacenter-scale disasters and recommended customers needing maximum availability to use multiple GCE zones. For maximum durability, GCE snapshots and Google Cloud Storage were suggested as resilient, geographically replicated repositories.
Google is upgrading storage hardware to be less susceptible to this type of power failure. A comprehensive review of the datacenter technology stack identified several areas for improvement, including enhancing cache data retention during transient power loss, implementing multiple orthogonal schemes for increased Persistent Disk data durability, and improving response procedures for system engineers during future incidents.