{"UUID":"7ba9335c-25db-4dfc-b0f6-3e6a2aff2244","URL":"https://www.honeycomb.io/blog/incident-review-shepherd-cache-delays/","ArchiveURL":"","Title":"Honeycomb Ingest System Outage: Shepherd Cache Delays","StartTime":"2022-09-08T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["cascading-failure","cloud","config-change"],"Keywords":["honeycomb","ingest","shepherd","cache","refinery","outage","oom","contention"],"Company":"Honeycomb","Product":"Shepherd cache","SourcePublishedAt":"2022-10-18T13:00:00Z","SourceFetchedAt":"2026-05-04T17:49:45.061888Z","Summary":"On September 8th, 2022, our ingest system went down repeatedly and caused interruptions for over eight hours. We will first cover the background behind the incident with a high-level view of the relevant architecture, how we tried to investigate and fix the system, and finally, we’ll go over some meaningful elements that surfaced from our incident review process.","Description":"On September 8th, 2022, Honeycomb's ingest system experienced repeated outages for over eight hours. The incident began with a \"shark fin\" pattern, where Shepherd hosts, responsible for accepting customer data, ran out of memory (OOM) and restarted. This also led to cascading crashes in the Refinery cluster, which processes internal Shepherd traces.\n\nInitial attempts to stabilize the system, such as aggressive sampling in Refinery and scaling Shepherds, were ineffective. The system temporarily stabilized, but a subsequent deployment, which bypassed pinned build artifacts, caused the \"shark fin\" pattern to resume. Attempts to force a scale-down of Shepherds were counteracted by the autoscaler.\n\nThe core issue was identified as a contention problem within Shepherd's in-memory schema cache. Each Shepherd worker acquired a table-wide lock when updating the cache, causing unrelated requests to queue up and leading to OOM errors and restarts. While the exact trigger for the initial bad state remained unknown, the cache contention was the underlying mechanism.\n\nThe outage resulted in a serious portion of ingest data being unavailable for over eight and a half hours, impacting most customers sending data at the time. Engineers developed and deployed a fix to reduce cache contention and pre-fill the cache before Shepherd hosts accepted traffic. This immediately stabilized the ingest system, and Refinery was subsequently stabilized by adding more hosts.\n\nAlthough the immediate cause was addressed, the team acknowledged that the precise trigger for the incident was not identified. The incident highlighted challenges with mental models of components and the exhausting nature of such outages. The fix has proven effective, and the team is now focusing on longer-term architectural improvements."}