{"UUID":"029b4a8e-0332-4b91-abc3-5f84bdf70094","URL":"https://incident.io/blog/intermittent-downtime","ArchiveURL":"","Title":"Intermittent downtime from repeated crashes","StartTime":"2022-11-18T15:40:00Z","EndTime":"2022-11-18T16:12:00Z","Categories":["automation","cloud","config-change"],"Keywords":["heroku","go","pub/sub","gcp","panic","goroutine","monolith","downtime","crash"],"Company":"incident.io","Product":"incident.io application","SourcePublishedAt":"2022-11-30T00:00:00Z","SourceFetchedAt":"2026-05-04T17:44:40.227476Z","Summary":"A bad event (poison pill) in the async workers queue triggered unhandled panics that repeatedly crashed the app. This combined poorly with Heroku infrastructure, making it difficult to find the source of the problem. Applied mitigations that are generally interesting to people running web services, such as catching corner cases of Go panic recovery and splitting work by type/class to improve reliability.","Description":"On Friday, November 18th, 2022, incident.io experienced 13 minutes of intermittent downtime over a 32-minute period, from 15:40 to 16:12 GMT. This incident led to repeated crashes of their Go monolith application, impacting customer access to the service.\n\nThe core issue was an unhandled panic within the Go application, triggered by a \"poison pill\" message in the GCP Pub/Sub asynchronous message queue. This message caused a specific handler to panic, which, due to an edge case in Go's panic recovery mechanism, led to the entire application crashing.\n\nThe problem stemmed from the Google Cloud Pub/Sub client's `sub.Receive` method, which spawns new goroutines to handle messages. While the parent function had a `recover()` block, it did not catch panics in these child goroutines, resulting in an unhandled panic that terminated the application. Heroku's dyno crash restart policy exacerbated the issue by introducing cool-off periods, leading to prolonged downtime.\n\nInvestigation was hampered by Heroku's log buffering and dropping of large stack traces from Go panics, and the immediate termination of the app prevented Sentry from reporting the crash. Engineers eventually identified the problematic Pub/Sub subscription by looking for unacknowledged messages and purged several queues, which stabilized the application.\n\nTwo key mitigations were implemented. First, a `recover()` function was explicitly added to the message handler within the `sub.Receive` method to correctly catch panics in child goroutines. Second, the monolithic application was physically split into separate Heroku dynos for web, worker, and cron processes, enhancing reliability by isolating failures and preventing one component from crashing the entire service."}