{"UUID":"2590ea48-9ceb-4cb4-a034-49bffacd7b33","URL":"http://yellerapp.com/posts/2014-08-04-postmortem1.html","ArchiveURL":"","Title":"Yeller network partition causes processing delays","StartTime":"2014-07-29T12:00:00Z","EndTime":"2014-07-30T09:30:00Z","Categories":["cloud"],"Keywords":["yeller","network partition","processing delay","cluster","datacenter","riak","datomic","maintenance"],"Company":"Yeller","Product":"Yeller exception processing","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:45:54.802715Z","Summary":"A network partition in a cluster caused some messages to get delayed, up to 6-7 hours. For reasons that aren't clear, a rolling restart of the cluster healed the partition. There's some suspicious that it was due to cached routes, but there wasn't enough logging information to tell for sure.","Description":"On Tuesday, July 29, 2014, planned datacenter maintenance between 1100 UTC and 1200 UTC led to the formation of three network partitions within Yeller's server cluster. The issue persisted for approximately 21.5 hours until it was resolved on Wednesday, July 30, 2014, at 0930 UTC following a rolling restart of the cluster.\n\nThe network partitions primarily affected Yeller's internal networking, leading to severe exception processing delays and a complete inability to modify user account and billing data. The exact root cause of the partitions after the maintenance, and why a cluster restart resolved them, remains unclear. Cached routes are suspected, but the restart wiped the machine state, preventing definitive identification of the underlying issue.\n\nCustomers experienced processing delays for 42 exceptions, with some taking up to 7 hours to appear. These required manual reprocessing due to Riak write timeouts. Additionally, critical user actions such as signing up, changing passwords, resetting billing, or inviting users were unavailable because the network partition disabled write capacity to the user/billing data store. However, Yeller remained available for reading existing exceptions.\n\nThe immediate resolution involved a careful rolling restart of the affected server cluster. Moving forward, Yeller plans to improve outage communication by using statuspage.io and integrating it with internal tools. Technical improvements include investigating enhanced retry logic for exception writes to reduce manual intervention and a commitment to preserving bad machine states in future incidents to aid root cause analysis."}