{"UUID":"655ee350-a86f-424b-89c5-f80cc0fe527e","URL":"https://medium.com/@florian_7764/technical-post-mortem-of-the-august-incident-82ab4c3d6547","ArchiveURL":"https://web.archive.org/web/20201202234639if_/https://medium.com/@florian_7764/technical-post-mortem-of-the-august-incident-82ab4c3d6547","Title":"Platform.sh EU region outage of August 2016","StartTime":"2016-08-18T19:00:00Z","EndTime":"2016-08-18T23:30:00Z","Categories":["automation","config-change"],"Keywords":["platform.sh","eu region","zookeeper","kazoo","orchestration software","downtime","august 2016","maintenance"],"Company":"Platform.sh","Product":"Platform.sh EU region","SourcePublishedAt":"2016-10-19T22:08:01.474Z","SourceFetchedAt":"2026-05-04T17:49:00.692982Z","Summary":"Outage during a scheduled maintenance window because there were too much data for Zookeeper to boot.","Description":"On August 18, 2016, Platform.sh experienced a 4-hour downtime in its EU region. A scheduled maintenance window began at 19:00 UTC to upgrade the orchestration software, initially affecting only git servers and the UI. However, at 19:30 UTC, websites also went down, and all services were restored by 23:30 UTC.\n\nThe incident occurred due to a series of cascading issues. During the orchestration software upgrade, gateways were prematurely restarted before the orchestration software had fully started. This caused the gateways to lose their application list and be unable to fetch a new one, leading to website downtime.\n\nThe orchestration software itself failed to start correctly due to connection drops to ZooKeeper. Investigation revealed a bug in the Kazoo library, which dropped connections when its internal pipe buffer became full (64k queries). After a fix for this was deployed, a subsequent issue was discovered: one ZooKeeper node exceeded its maximum allowed size, preventing full recovery.\n\nRemediation involved several steps. Automated checks were implemented to ensure the orchestration software is fully started before proceeding with further maintenance. A semaphore lock was added to the Kazoo client to prevent pipe buffer overflow, and a pull request is being prepared for the upstream Kazoo project. Finally, the ZooKeeper max buffer size was increased, and monitoring for ZooKeeper node sizes was implemented to prevent future occurrences."}