Google Cloud europe-west2 outage due to cooling system failure

On July 19, 2022, at 06:33 US/Pacific, a simultaneous failure of multiple, redundant cooling systems in a data center hosting zone europe-west2-a initiated a significant outage across numerous Google Cloud services. To prevent further damage and a longer outage, Google engineers powered down a portion of the impacted zone at 10:05 US/Pacific. This action led to widespread service unavailability, elevated error rates, and increased latencies for customers in the europe-west2 region.

The primary root cause was the inability of a data center in europe-west2-a to maintain safe operating temperatures due to a combined failure of multiple cooling systems and exceptionally high external temperatures. Additionally, two key factors exacerbated the regional impact: an inadvertent modification of traffic routing for internal services to avoid all three zones in europe-west2, and regional storage services being unable to access data replicas due to this routing change.

The incident affected a broad range of Google Cloud products, including Compute Engine, Persistent Disk, Cloud Storage, BigQuery, Cloud SQL, App Engine, Cloud Functions, and many others. Customers experienced issues such as VM terminations, unavailable disk volumes, HTTP 500 errors for object reads, data plane unavailability, degraded performance, and increased error rates for various operations. The impact varied by service, with some experiencing complete unavailability for specific operations or instances.

Google engineers repaired the cooling system by 14:13 US/Pacific on July 19. Cloud services began restoration and were largely operational by 04:28 US/Pacific on July 20. A “long tail” of affected Google Compute Engine instances and related services, such as Cloud SQL, required additional manual work, with full mitigation achieved by 21:20 US/Pacific on July 20, marking the incident’s closure.

Google is implementing several preventative measures. These include repairing and re-testing zonal failover automation, developing advanced methods to progressively decrease thermal load in data centers, and examining recovery procedures and tooling to improve future recovery times. A detailed analysis of the cooling system failure and an audit of cooling system equipment across global data centers are also underway.

Postmortem Index

Google Cloud europe-west2 outage due to cooling system failure

Keywords