Cloudflare outage on June 21, 2022
On June 21, 2022, Cloudflare experienced an outage that began at 06:27 UTC when a network configuration change was deployed to Multi-Colo PoP (MCP) enabled locations. This deployment swiftly took 19 of their data centers offline. The first data center was brought back online by 06:58 UTC, and all affected data centers were operational by 07:42 UTC.
The root cause was identified as an incorrect re-ordering of terms within a BGP policy statement during a rollout designed to standardize BGP communities. The REJECT-THE-REST term was inadvertently moved before the ADV-SITE-LOCALS terms, leading to the immediate rejection and withdrawal of critical site-local prefixes.
This withdrawal of site-local prefixes had two main consequences: it prevented Cloudflare engineers from easily reaching the affected locations to revert the change, and it caused the internal load balancing system, Multimog, to stop functioning within the MCPs, leading to smaller compute clusters overloading.
Customer impact was significant, as the 19 affected data centers, despite representing only 4% of Cloudflare’s total network, handled 50% of global traffic. Users relying on Cloudflare services through these locations experienced an inability to access websites and services.
Remediation involved reverting the problematic change. Cloudflare has outlined several immediate follow-up steps, including improving change procedures and automation for MCP-specific deployments, redesigning the problematic policy statement to prevent similar re-ordering issues, and enhancing automation for staggered rollouts and automated commit-confirm rollbacks to reduce future impact and resolution times.