{"UUID":"0b69e012-c539-485c-a2c7-b59c50d0650a","URL":"https://www.elastic.co/blog/elastic-cloud-incident-report-feburary-4-2019","ArchiveURL":"","Title":"Elastic Cloud AWS us-east-1 outage of February 2019","StartTime":"2019-02-04T02:50:00Z","EndTime":"2019-02-04T18:44:00Z","Categories":["automation","cascading-failure","cloud","config-change","security"],"Keywords":["elastic cloud","aws","us-east-1","zookeeper","elasticsearch","kibana","runc","patching"],"Company":"Elastic","Product":"Elastic Cloud","SourcePublishedAt":"2019-03-04T19:00:00Z","SourceFetchedAt":"2026-05-04T17:44:58.455897Z","Summary":"Elastic Cloud customers with deployments in the AWS us-east-1 region experienced degraded access to their clusters.","Description":"On February 4, 2019, at approximately 02:50 UTC, Elastic Cloud customers with deployments in the AWS us-east-1 region experienced degraded access to their clusters. The incident was triggered during a routine patching procedure for the coordination layer (ZooKeeper) in that region. Despite following documented procedures, the patching led to unanticipated instability and an outage of the coordination services.\n\nThe primary root cause was identified as a failure in the coordination layer, stemming from insufficient metrics during host replacement that failed to accurately reflect the health of individual hosts and the overall coordination layer. This resulted in instability and a loss of quorum within the ZooKeeper ensemble. A contributing factor was a previously unknown `runc` bug that caused CPU softlocks and system unresponsiveness on ZooKeeper ensemble members.\n\nCustomer impact included partial or complete unavailability for Elasticsearch Service deployments in AWS us-east-1 between 02:50 and 09:28 UTC. Kibana access was disrupted for most customers from 02:50 to 09:28 UTC, with some experiencing degraded access until 18:44 UTC. The Elastic Cloud User Console also saw increased timeouts and was in a degraded state from 02:50 to 07:17 UTC.\n\nRemediation involved re-establishing quorum within the ZooKeeper ensemble by reducing client load and stabilizing the ZooKeeper observer layer through an increased `initLimit` setting. The extended Kibana issues were resolved by restarting internal proxying containers and Kibana instances, and applying `sysctl` limits to prevent recurrence. This also addressed identified connection leaks and HTTP request amplification bugs within Kibana.\n\nElastic has since implemented several action items, including reducing ZooKeeper dataset size, optimizing proxy health-checks, and improving ZooKeeper visibility. Ongoing efforts include a ground-up rewrite of the proxy layer, improving Kibana resiliency and addressing connection leaks, and formalizing maintenance procedures to prevent similar incidents."}