Postmortem Index

Explore incident reports from various companies

Travis CI container-based Linux builds outage due to worker rollback failure

TravisCI · worker

2017-02-02 – 2017-02-05 cloud config-change

On February 2nd, Travis CI began rolling out worker version v2.6.2 for container-based builds. By February 3rd, issues emerged where jobs were incorrectly marked as failed, prompting a rollback to v2.5.0. However, this rollback was initially unsuccessful, leading to continued impact for users.

On February 4th, it was discovered that new instances with worker v2.5.0 were not entering service because the image was missing a tag on Docker Hub. An on-call engineer identified and rectified this, allowing the rollback to proceed. To expedite resolution, emergency maintenance was declared on February 5th, and the incident was fully resolved by 00:31 UTC on February 5th.

Several factors contributed to the incident. A change in the worker’s Docker backend, affecting how bash handles exit codes, was not caught by staging tests. The recent migration of worker Docker images from quay.io to Docker Hub meant that the v2.5.0 image, needed for the rollback, was not available on Docker Hub. Additionally, there was a lack of alerting for errors related to image pull failures.

The incident caused significant disruption, build outages, and delays for users relying on Travis CI’s Linux container-based builds. In response, Travis CI implemented additional alerting for image pull failures, began discussions on improving instance replacement processes and test diversity, and is exploring a move to an agent/pull-based job execution model for easier worker version updates.

Keywords

travis ciworkerdockerec2rollbacklinuxcontainerbuildsquay.iodocker hub