Honeycomb Ingest System Outage: Shepherd Cache Delays

On September 8th, 2022, Honeycomb’s ingest system experienced repeated outages for over eight hours. The incident began with a “shark fin” pattern, where Shepherd hosts, responsible for accepting customer data, ran out of memory (OOM) and restarted. This also led to cascading crashes in the Refinery cluster, which processes internal Shepherd traces.

Initial attempts to stabilize the system, such as aggressive sampling in Refinery and scaling Shepherds, were ineffective. The system temporarily stabilized, but a subsequent deployment, which bypassed pinned build artifacts, caused the “shark fin” pattern to resume. Attempts to force a scale-down of Shepherds were counteracted by the autoscaler.

The core issue was identified as a contention problem within Shepherd’s in-memory schema cache. Each Shepherd worker acquired a table-wide lock when updating the cache, causing unrelated requests to queue up and leading to OOM errors and restarts. While the exact trigger for the initial bad state remained unknown, the cache contention was the underlying mechanism.

The outage resulted in a serious portion of ingest data being unavailable for over eight and a half hours, impacting most customers sending data at the time. Engineers developed and deployed a fix to reduce cache contention and pre-fill the cache before Shepherd hosts accepted traffic. This immediately stabilized the ingest system, and Refinery was subsequently stabilized by adding more hosts.

Although the immediate cause was addressed, the team acknowledged that the precise trigger for the incident was not identified. The incident highlighted challenges with mental models of components and the exhausting nature of such outages. The fix has proven effective, and the team is now focusing on longer-term architectural improvements.

Postmortem Index

Honeycomb Ingest System Outage: Shepherd Cache Delays

Keywords