{"UUID":"a0e252d3-10a6-4345-84c3-f271124e2d7b","URL":"https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/","ArchiveURL":"https://web.archive.org/web/20211006055154if_/https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/","Title":"Cloudflare global outage on July 2, 2019","StartTime":"2019-07-02T13:42:00Z","EndTime":"2019-07-02T14:09:00Z","Categories":["cascading-failure","config-change","security"],"Keywords":["cloudflare","waf","regex","cpu exhaustion","global","outage","backtracking","xss","regular expression","quicksilver","502 errors"],"Company":"Cloudflare","Product":"Cloudflare WAF Managed Rules","SourcePublishedAt":"2019-07-12T15:45:21Z","SourceFetchedAt":"2026-05-04T18:17:47.839537Z","Summary":"A CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking. This rule was deployed quickly to production and a series of events lead to a global 27 minutes downtime of the Cloudflare services.","Description":"On July 2, 2019, Cloudflare experienced a global outage lasting 27 minutes, from 13:42 UTC to 14:09 UTC. The incident began when an engineer deployed a new rule to the WAF Managed Rules. Within minutes, PagerDuty alerts indicated widespread failures, leading to a P0 incident declaration. Cloudflare's core proxying, CDN, and WAF functionality became unavailable, causing customers worldwide to see 502 error pages.\n\nThe root cause was a poorly written regular expression within the newly deployed WAF rule. This regex caused excessive backtracking, leading to nearly 100% CPU utilization across Cloudflare's global network servers handling HTTP/HTTPS traffic. The rapid deployment mechanism, Quicksilver, distributed the problematic rule globally in seconds.\n\nSeveral factors contributed to the widespread impact. A critical protection mechanism designed to prevent excessive CPU usage by regular expressions had been mistakenly removed during a prior WAF refactoring. The WAF's test suite also lacked the capability to identify such CPU consumption issues. Furthermore, the standard operating procedure for WAF rule changes allowed for immediate global deployment without staged rollouts, unlike other software releases.\n\nCloudflare's internal systems, including the control panel and Jira, were also affected, hindering incident response due to their reliance on the very services that were down. SREs faced challenges accessing systems because their credentials had been timed out for security reasons, and the status page was not updated quickly enough.\n\nIn response, Cloudflare immediately re-introduced the missing CPU usage protection and manually inspected all existing WAF rules. They committed to implementing performance profiling in their test suite, switching to regex engines with runtime guarantees, and revising WAF deployment SOPs to include staged rollouts. Long-term plans include porting the WAF to a new firewall engine for enhanced performance and protection."}