Postmortem Index

Explore incident reports from various companies

Google Cloud HTTP(S) Load Balancer 502 errors on April 5, 2017

Google · HTTP(S) Load Balancer

2017-04-05 cloud

On Wednesday, April 5, 2017, the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for 22 minutes, from 01:13 to 01:35 PDT. Clients received 502 errors, and some recently modified HTTP(S) Load Balancers saw 100% error rates. Additionally, deployments of App Engine Flexible apps failed for over three hours as configuration changes were paused.

The root cause was identified as a bug in the HTTP(S) Load Balancer configuration update process. A replica of the master server lost access to Google’s distributed file system, and when mastership transferred to this server, it reverted all HTTP(S) Load Balancers to a substantially out-of-date configuration. This outdated configuration triggered excessive garbage collection on Google Frontend servers, leading to failed health checks and server restarts, which resulted in 502 errors.

Google engineers were alerted at 01:22 PDT and mitigated the issue by 01:34 PDT by switching the configuration update process to a different master server. Configuration updates were then paused until 05:16 PDT to allow for root cause analysis.

To prevent recurrence, Google plans to configure master servers to reject old configurations, ensure Google Frontend servers reject outdated configuration files, improve testing for new configurations, and fix the underlying issues causing master server file access failures and health check failures during heavy garbage collection.

Keywords

load balancerhttp(s)502 errorsconfigurationgoogle compute engineapp engine flexiblegarbage collectiondistributed file system