Google Cloud Networking and Load Balancing outage of November 2021
Google · Google Cloud Load Balancing
On November 16, 2021, at 09:35 PT, Google Cloud Networking experienced issues with the Google External Proxy Load Balancing (GCLB) service. This resulted in customers receiving 404 errors for HTTP/S requests served by GCLB, impacting various downstream services such as Google Cloud Run, Google App Engine, Google Cloud Functions, Apigee, and Firebase. Engineers were alerted at 09:50 PT and initiated a rollback to a known good configuration, which resolved the 404 errors by 10:08 PT.
Following the initial impact, customer-initiated configuration changes in GCLB were suspended from 10:04 PT until 11:28 PT to prevent recurrence and allow for validation of fixes. During this period, customers were unable to modify their load balancing configurations. Normal service, including the ability to push configurations, was fully restored by 11:28 PT, bringing the total duration of impact to 1 hour and 53 minutes.
The incident was caused by a latent bug in the configuration pipeline responsible for propagating customer rules to GCLB. This bug, introduced six months prior, created a race condition that could, in rare instances, push a corrupted configuration file to GCLB. Although the pipeline had extensive validation checks, this specific race condition corrupted the file late in the process. An attempted fix was in progress, but the race condition manifested in an unpatched cluster before the full rollout of the primary patch, and a secondary patch did not prevent the specific error form produced.
Google engineers mitigated the issue by restoring a known-good configuration and subsequently deployed fixes to eliminate the risk of recurrence. Remediation efforts include adding additional alerting for faster detection, implementing strengthened automated correctness-checking for configurations, and accelerating planned architectural changes to improve isolation and resolution of similar issues in the future.