Instapaper AWS RDS MySQL 2TB File Size Limit Outage
Instapaper · Amazon RDS MySQL
Instapaper experienced an extended outage from Wednesday, February 9 at 12:30 PM PT until Thursday, February 10 at 7:30 PM PT. During this period, the service was unavailable. A short-term solution was implemented, restoring Instapaper with limited access to archives. Full service recovery was completed on Tuesday, February 14, 2017.
The critical system that failed was Instapaper’s MySQL database, hosted on Amazon’s Relational Database Service (RDS). The root cause was a 2TB file size limit for RDS instances created before April 2014, which used an ext3 filesystem. Instapaper’s “bookmarks” table, storing user-saved articles, exceeded this 2TB limit on February 9, causing new entries to fail. Although a read replica was created in March 2015, it inherited the same filesystem and limitation from the original June 2013 instance.
The incident resulted in 31 hours of full downtime, followed by five days of limited access for users. The issue was difficult to foresee as there was no information in the RDS console regarding monitoring, alerts, or logging for the 2TB file size limit. Instapaper engineers were unaware of this specific limitation, which was likely known only by the contractors who performed the initial migration in 2013.
Recovery involved extensive collaboration with AWS support and Pinterest’s Site Reliability Engineers. The primary path forward was to rebuild the 2.5TB database by performing a complete dump and restore to a new instance not subject to the 2TB limit. This was a multi-day effort. An Amazon engineer ultimately facilitated recovery by performing an rsync between the failed ext3 filesystem and a new ext4 filesystem, creating a new, fully indexed database. The new database was then synced with the temporary production database using binary logs, and finally promoted to master, restoring full service without data loss.
The incident highlighted the lack of a disaster recovery plan for such a critical filesystem issue, leading to longer downtime and recovery times. Instapaper plans to define a better workflow for system-wide outages, escalating issues immediately to Pinterest’s SRE team, and will increase backup testing frequency from quarterly to monthly. Despite the challenges, Instapaper will continue to use AWS RDS, acknowledging its overall reliability and the support received during the incident.