Buildkite outage of August 22nd, 2016

On August 22nd, 2016, Buildkite experienced a severe unplanned outage starting at 17:21 UTC. During this period, customers were unable to log in, view build logs, or access documentation, although builds continued to run and GitHub/Bitbucket Pull Request statuses updated. The Buildkite team, based in Melbourne, was alerted hours later due to misconfigured PagerDuty settings and phones on silent, delaying their response.

The root cause of the outage was a database capacity downgrade. Approximately two weeks prior, Buildkite decided to downgrade their m4.10xlarge Multi-AZ RDS PostgreSQL database to a smaller r3.2xlarge instance to reduce AWS costs, as their AWS credits were expiring. This change was made without sufficient load testing or due diligence, leading to the new database struggling and maxing out its CPU around 16:00 UTC on August 22nd.

The struggling database caused a cascade of failures. Health checks on Elastic Load Balancers (ELBs) began returning HTTP 500 errors, leading to servers being marked “OutOfService” and removed. New servers launched by Auto Scaling Groups failed to come online because their bootstrapping process relied on fetching the latest code from buildkite.com/_secret/version, which was unavailable. Additionally, the baked-in AMI code referenced a decommissioned database, causing health checks on new instances to fail instantly.

Recovery efforts were further hampered by concurrent AWS issues, including IAM problems preventing database upgrades and AMI launches via the console. Later, EC2 request limits were hit due to the continuous cycle of new servers launching and terminating. Buildkite eventually brought buildkite.com back online with temporary fixes by stealing servers from other ELBs, and later upgraded the database and reverted to a host-based pgbouncer setup using the aws-cli once AWS issues subsided. The incident was fully recovered by 23:10 UTC.

Key lessons learned included the importance of monitoring AWS credits, conducting load testing after significant infrastructure changes, re-evaluating health check logic, ensuring robust self-deployment mechanisms for new servers, and maintaining correct on-call PagerDuty and phone settings.

Postmortem Index

Buildkite outage of August 22nd, 2016

Keywords