Google Compute Engine global connectivity loss April 2016
Google · Google Compute Engine
On April 11, 2016, Google Compute Engine instances across all regions experienced a loss of external connectivity for 18 minutes, from 19:09 to 19:27 Pacific Time. The incident began earlier for Cloud VPN in asia-east1 at 18:14 Pacific Time. Inbound internet traffic to GCE instances was not routed correctly, leading to dropped connections and an inability to reconnect, impacting services like VPNs and L3 network load balancers.
The incident was triggered at 14:50 Pacific Time when engineers removed an unused GCE IP block from the network configuration. A timing quirk in this removal caused an inconsistency in the network configuration management software. Instead of failing safe and reverting to a known good configuration, a previously unseen software bug caused the management software to remove all GCE IP blocks from the new configuration.
Google’s defense-in-depth systems include a canary step and progressive rollout. While the canary step correctly identified the new configuration as unsafe, a second software bug prevented this conclusion from propagating back to the push process. Consequently, the system deemed the configuration valid and proceeded with a progressive rollout, causing sites to stop announcing GCE IP blocks.
As the rollout progressed, internal monitoring detected anomalies, including the Cloud VPN in asia-east1 failing at 18:14 and rising user latency to GCE. By 19:09, with no sites announcing GCE IP blocks, inbound traffic to GCE dropped by over 95%. Engineers, initially investigating a localized VPN issue, quickly identified the widespread problem and reverted the recent configuration changes, resolving the outage by 19:27.
The immediate remediation was reverting the faulty configuration. For prevention, Google plans numerous engineering changes. These include monitoring targeted GCE network paths, comparing IP block announcements before and after configuration changes, and implementing semantic checks for network configurations to ensure specific Cloud IP blocks are present. The company is also offering service credits to impacted GCE and VPN customers.