{"UUID":"019eb098-9da5-4d2c-a64c-c6498933f165","URL":"https://www.honeycomb.io/blog/incident-review-designed-failing/","ArchiveURL":"","Title":"Honeycomb query performance and alerting incident (August 2022)","StartTime":"0001-01-01T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["cloud"],"Keywords":["query performance","alerting","triggers","slos","lambda","s3","cold storage","backfill"],"Company":"Honeycomb","Product":"query engine, triggers, SLOs","SourcePublishedAt":"2022-09-09T16:00:00Z","SourceFetchedAt":"2026-05-04T17:44:40.228473Z","Summary":"Another story of multiple incidents that ended up impacting [query performance](https://status.honeycomb.io/incidents/fzw6hqjx5t4f) and [alerting via triggers and SLOs](https://status.honeycomb.io/incidents/jwhrxcs5zr06). These incidents were notable because of how challenging their investigation turned out to be.","Description":"In early August 2022, Honeycomb experienced incidents impacting query performance and alerting via triggers and SLOs. Visible issues began around 11:35 a.m. ET, with the worst impact lasting approximately 4 hours, though the investigation spanned 9 hours.\n\nThe incidents stemmed from a convergence of factors. Inaccurate timestamps in a customer's telemetry data caused short trigger queries to consistently access cold storage via AWS Lambda, tying trigger performance to Lambda capacity. Concurrently, a single customer's SLO, due to an SLI never returning valid results, continuously triggered aggressive backfills of up to 60 days of cold data, consuming significant Lambda resources.\n\nThe primary customer impact included degraded query performance and failures in alerting mechanisms managed by triggers and SLOs. The continuous backfills for one enterprise customer's SLO also consumed shared resources, affecting other services.\n\nInitial mitigation attempts were hampered by red herrings. Resolution came when an engineer, not involved in the initial investigation, identified the problematic SLO. The team fixed the customer's SLO, corrected the general SLO behavior for failures, and implemented a more aggressive default policy for handling future-stamps in data.\n\nHoneycomb is also working on improving on-call engineers' ability to search and categorize feature flags and enhancing support for Incident Commanders to prevent cognitive overload. The company aims to add controls to quickly manage problematic usage patterns and re-evaluate \"normal\" usage as the platform grows."}