Tarsnap outage 2016-07-24

The Tarsnap service experienced an outage on 2016-07-24, lasting approximately 85 minutes from 10:15:19 UTC to 11:40:04 UTC. During this period, customers were unable to create new archives or retrieve existing ones, although all customer data remained safely stored. Affected users were credited with two days of storage costs.

The incident was precipitated by an increase in correlated timeout failures from Amazon S3, observed since around June 30th. This led to a significant increase in the rate at which S3 requests exhausted all permitted retries, suggesting an undiagnosed change in S3’s behavior, particularly concerning retries of failed requests.

A routine Tarsnap background job, tasked with identifying and marking unreferenced data blocks in S3, encountered these repeated S3 write failures. Instead of gracefully handling the persistent failures, the job continuously logged these events to a local disk file. This log file grew uncontrollably, eventually filling the filesystem.

The full filesystem subsequently caused other critical Tarsnap service writes to fail, leading to an immediate shutdown of the Tarsnap service code. Monitoring systems alerted the founder, who, after investigation, identified and deleted the runaway log file. The server code was then restarted, restoring full service functionality.

Lessons learned included the importance of better internal monitoring (e.g., disk space usage), the necessity of investigating anomalous behavior even if it appears harmless, and improvements to incident response protocols. The incident also reaffirmed Tarsnap’s design philosophy of prioritizing data safety over service availability during critical failures.

Postmortem Index

Tarsnap outage 2016-07-24

Keywords