Datadog Infrastructure Connectivity Issue March 2023
On March 8, 2023, at 06:00 UTC, Datadog experienced an infrastructure connectivity issue impacting its US1, EU1, US3, US4, and US5 regions. This led to customers being unable to access the platform, services, or APIs, with monitors unavailable and data ingestion affected. Initial recovery began by 09:13 UTC, with major services operational by 16:44 UTC on March 8. All services were operational by March 9, 08:58 UTC, and the incident was fully resolved, including historical data backfill, by March 10, 06:25 UTC.
The root cause was an automatic security update to systemd on March 8, 2023, at 06:00 UTC. This update triggered a latent interaction in the network stack, causing systemd-networkd to forcibly delete routing rules managed by the Cilium Container Network Interface (CNI) plugin. This action took tens of thousands of nodes offline.
The issue was compounded by a legacy security update channel being enabled on the base OS images for Kubernetes clusters. This allowed the update to apply automatically across multiple regions simultaneously, as the default update window was set between 06:00 and 07:00 UTC. This indirect coupling of behavior across regions significantly amplified the scale of the impact.
Remediation efforts focused on restoring real-time telemetry data processing, followed by historical data. This involved scaling compute capacity with cloud providers and recovering services in parallel. Datadog has since disabled the legacy security update channel and reconfigured systemd-networkd to prevent it from altering routing tables. The infrastructure was also audited for similar potential issues.
The incident highlighted the need to re-examine assumptions about regional autonomy and the importance of clear communication regarding service availability during degraded operations. Lessons learned will inform future improvements to platform resilience, including enhanced chaos testing and refined strategies for handling data hierarchy during outages.