Postmortem Index

Explore incident reports from various companies

Partial Cloudflare outage on October 25, 2022

Cloudflare · Tiered Cache

2022-10-25

On October 25, 2022, a software release for a Cloudflare CDN component began rolling out at 08:40 UTC, leading to a partial outage. The issue caused HTTP 530 errors, affecting approximately 5% of all HTTP requests at its peak at 17:28 UTC. Customers using Tiered Cache, Cloudflare Images, or Bandwidth Alliance were impacted if their requests required an origin fetch through the tiered cache hierarchy.

The problem stemmed from new distributed tracing code introduced into the Tiered Cache logic. This code inadvertently cleared control headers, which are crucial for routing requests and performing internal DNS lookups for origin server IP addresses. Consequently, essential hostname information was missing, resulting in the 530 DNS errors observed by clients.

Cloudflare initiated a rollback of the faulty release, confirming the root cause at 17:03 UTC. An accelerated global rollback followed, with impact ending once the reversion was complete in all upper tier data centers by 18:04 UTC.

To prevent similar incidents, Cloudflare plans to include larger Tiered Cache upper tier data centers in earlier release stages, expand acceptance test coverage for various Tiered Cache topologies, implement more aggressive alerting for requests lacking full context, and ensure systems fail fast during development and testing.

Keywords

cloudflaretiered cachecdnoutage530 errorrollbackdistributed tracingdns530 errorscontrol headers