Postmortem Index

Explore incident reports from various companies

Datadog US region infrastructure connectivity issue

DataDog · service discovery and dynamic configuration system

2020-09-24 – 2020-09-25 automation cascading-failure cloud security

On September 24, 2020, at 14:27 UTC, Datadog’s US region experienced a service degradation that lasted until September 25, 2020, at 03:00 UTC. This incident affected multiple systems including the web tier, API endpoints, logs, network performance monitoring, alerts, infrastructure monitoring, and APM, which were at times disabled, degraded, or intermittently available. Incoming data was still ingested and processed, but users faced difficulties accessing products.

The incident stemmed from the failure of an internal service discovery and dynamic configuration system, a core component relied upon by most Datadog software. The root cause was traced to a faulty configuration change implemented in late August. This change made a large data intake cluster dependent on the local DNS resolver, which proxies to the service discovery system, instead of a more resilient local file for DNS resolution.

This faulty configuration set the stage for a “thundering herd” scenario. When a smaller, latency-measuring cluster (a dependency of the intake cluster) was recycled, its temporary unavailability caused the intake cluster to issue a massive volume of NXDOMAIN DNS requests to the service discovery system. This sudden onslaught overwhelmed the service discovery cluster, causing it to lose quorum and fail to reliably register/deregister services or answer DNS requests.

The failure of the service discovery system meant most services could not reliably find dependencies or load runtime configurations, leading to widespread errors. The web tier, being at the top of the dependency tree, was particularly affected, showing 60-90% error rates. Recovery efforts were complicated as many mitigation tools, including dynamic configuration changes and load shedding, became unavailable due to the core system’s failure.

Datadog is implementing several changes to prevent similar incidents. These include further decoupling the control and data planes, splitting service discovery and dynamic configuration, building additional caching layers for DNS queries, hardening components to be resilient to service discovery loss, and regularly testing failure modes. Improvements to service discovery and web tier resilience, along with enhanced external communication processes, are also underway.

Keywords

service discoverydynamic configurationdnsthundering herdweb tierus regiondatadoginfrastructure