Yeller network partition causes processing delays
Yeller · Yeller exception processing
On Tuesday, July 29, 2014, planned datacenter maintenance between 1100 UTC and 1200 UTC led to the formation of three network partitions within Yeller’s server cluster. The issue persisted for approximately 21.5 hours until it was resolved on Wednesday, July 30, 2014, at 0930 UTC following a rolling restart of the cluster.
The network partitions primarily affected Yeller’s internal networking, leading to severe exception processing delays and a complete inability to modify user account and billing data. The exact root cause of the partitions after the maintenance, and why a cluster restart resolved them, remains unclear. Cached routes are suspected, but the restart wiped the machine state, preventing definitive identification of the underlying issue.
Customers experienced processing delays for 42 exceptions, with some taking up to 7 hours to appear. These required manual reprocessing due to Riak write timeouts. Additionally, critical user actions such as signing up, changing passwords, resetting billing, or inviting users were unavailable because the network partition disabled write capacity to the user/billing data store. However, Yeller remained available for reading existing exceptions.
The immediate resolution involved a careful rolling restart of the affected server cluster. Moving forward, Yeller plans to improve outage communication by using statuspage.io and integrating it with internal tools. Technical improvements include investigating enhanced retry logic for exception writes to reduce manual intervention and a commitment to preserving bad machine states in future incidents to aid root cause analysis.