Cloudflare API and dashboard availability incident on 2020-11-02

On November 2, 2020, Cloudflare experienced an incident that impacted the availability of its API and dashboard for six hours and 33 minutes. During this time, the success rate for API queries periodically dropped to 75%, and the dashboard experience was up to 80 times slower than normal. While Cloudflare’s edge network remained operational, the control plane (API & dashboard) was affected.

The incident began at 14:43 UTC with a partial network switch failure. This misbehaving switch, while not fully down, caused inconsistent network traffic. This created a “Byzantine fault” scenario for the etcd cluster, where nodes received conflicting information, preventing a stable leader election and making the etcd cluster read-only.

The read-only etcd cluster subsequently impacted the database cluster management system, which relies on etcd for coordination. This triggered the automatic promotion of a new primary database. However, a defect in the cluster management system required a full rebuild of all database replicas when a new primary was promoted, causing significant delays. The primary authentication database became overloaded as it had to handle all traffic without available replicas.

To mitigate the impact, Cloudflare manually reduced load by dialing back non-critical services. They also manually diverted API read-queries to replicas in a secondary data center, which substantially improved API availability. The incident concluded at 21:20 UTC when the first database replica rebuilt and performance returned to normal.

Post-incident, Cloudflare adjusted the configuration for automatic database replica promotion to prevent overly quick triggers. They also fixed the bug in the cluster management system that necessitated full replica rebuilds and are redesigning the user session system for better cross-data center flexibility.

Postmortem Index

Cloudflare API and dashboard availability incident on 2020-11-02

Keywords