{"UUID":"e7d7aa93-81f7-4338-9c0b-6e6c0dcefdcb","URL":"http://mail.tarsnap.com/tarsnap-announce/msg00035.html","ArchiveURL":"","Title":"Tarsnap outage 2016-07-24","StartTime":"2016-07-24T10:15:19Z","EndTime":"2016-07-24T11:40:04Z","Categories":["cloud"],"Keywords":["tarsnap","s3","amazon","outage","filesystem","disk space","monitoring","background job"],"Company":"Tarsnap","Product":"Tarsnap","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:54:49.983678Z","Summary":"A batch job which scans for unused blocks in Amazon S3 and marks them to be freed encountered a condition where all retries for freeing certain blocks would fail. The batch job logs its actions to local disk and this log grew without bound. When the filesystem filled, this caused other filesystem writes to fail, and the Tarsnap service stopped. Manually removing the log file restored service.","Description":"The Tarsnap service experienced an outage on 2016-07-24, lasting approximately 85 minutes from 10:15:19 UTC to 11:40:04 UTC. During this period, customers were unable to create new archives or retrieve existing ones, although all customer data remained safely stored. Affected users were credited with two days of storage costs.\n\nThe incident was precipitated by an increase in correlated timeout failures from Amazon S3, observed since around June 30th. This led to a significant increase in the rate at which S3 requests exhausted all permitted retries, suggesting an undiagnosed change in S3's behavior, particularly concerning retries of failed requests.\n\nA routine Tarsnap background job, tasked with identifying and marking unreferenced data blocks in S3, encountered these repeated S3 write failures. Instead of gracefully handling the persistent failures, the job continuously logged these events to a local disk file. This log file grew uncontrollably, eventually filling the filesystem.\n\nThe full filesystem subsequently caused other critical Tarsnap service writes to fail, leading to an immediate shutdown of the Tarsnap service code. Monitoring systems alerted the founder, who, after investigation, identified and deleted the runaway log file. The server code was then restarted, restoring full service functionality.\n\nLessons learned included the importance of better internal monitoring (e.g., disk space usage), the necessity of investigating anomalous behavior even if it appears harmless, and improvements to incident response protocols. The incident also reaffirmed Tarsnap's design philosophy of prioritizing data safety over service availability during critical failures."}