Postmortem Index

Explore incident reports from various companies

Cloudflare global outage on July 2, 2019

Cloudflare · Cloudflare WAF Managed Rules

On July 2, 2019, Cloudflare experienced a global outage lasting 27 minutes, from 13:42 UTC to 14:09 UTC. The incident began when an engineer deployed a new rule to the WAF Managed Rules. Within minutes, PagerDuty alerts indicated widespread failures, leading to a P0 incident declaration. Cloudflare’s core proxying, CDN, and WAF functionality became unavailable, causing customers worldwide to see 502 error pages.

The root cause was a poorly written regular expression within the newly deployed WAF rule. This regex caused excessive backtracking, leading to nearly 100% CPU utilization across Cloudflare’s global network servers handling HTTP/HTTPS traffic. The rapid deployment mechanism, Quicksilver, distributed the problematic rule globally in seconds.

Several factors contributed to the widespread impact. A critical protection mechanism designed to prevent excessive CPU usage by regular expressions had been mistakenly removed during a prior WAF refactoring. The WAF’s test suite also lacked the capability to identify such CPU consumption issues. Furthermore, the standard operating procedure for WAF rule changes allowed for immediate global deployment without staged rollouts, unlike other software releases.

Cloudflare’s internal systems, including the control panel and Jira, were also affected, hindering incident response due to their reliance on the very services that were down. SREs faced challenges accessing systems because their credentials had been timed out for security reasons, and the status page was not updated quickly enough.

In response, Cloudflare immediately re-introduced the missing CPU usage protection and manually inspected all existing WAF rules. They committed to implementing performance profiling in their test suite, switching to regex engines with runtime guarantees, and revising WAF deployment SOPs to include staged rollouts. Long-term plans include porting the WAF to a new firewall engine for enhanced performance and protection.

Keywords

cloudflarewafregexcpu exhaustionglobaloutagebacktrackingxssregular expressionquicksilver502 errors