Google Cloud internal blob storage disruption March 2019
Google · internal blob storage service
On March 12, 2019, from 18:40 to 22:50 PDT, Google’s internal blob storage service experienced a disruption characterized by elevated error rates, averaging 20% and peaking at 31%. This incident significantly impacted various Google services, including Gmail, Photos, and Google Drive, which rely on the blob storage. Google Cloud Platform services were also affected, with Google Cloud Storage seeing an average 4.8% error rate across all bucket locations and storage classes, Stackdriver Monitoring experiencing up to 5% errors for historical time series data, and App Engine’s Blobstore API and deployments facing elevated latency and error rates, peaking at 21% and 90% respectively.
The incident’s root cause originated on March 11, 2019, when Google SREs observed a substantial increase in storage resources for metadata used by the internal blob service. In an attempt to mitigate this resource usage, SREs implemented a configuration change on March 12. This change, however, had an unintended side effect: it overloaded a critical component of the system responsible for looking up the location of blob data, which subsequently triggered a cascading failure.
Upon being alerted to the service disruption at 18:56 PDT, SREs promptly halted the configuration change job that initiated the problem. To recover from the cascading failure, engineers manually reduced traffic levels to the blob service. This action allowed tasks to restart successfully without crashing under high load, facilitating the gradual restoration of service.
To prevent similar incidents, Google plans several improvements. These include enhancing the isolation between regions of the storage service to minimize the global impact of failures, and improving the ability to provision resources more quickly to recover from cascading failures caused by high load. Additionally, software measures will be implemented to prevent configuration changes that could overload key system components, and the metadata storage system’s load shedding behavior will be improved for more graceful degradation under stress.