Foursquare MongoDB memory exhaustion outage
Foursquare · MongoDB
Foursquare experienced significant downtime due to its MongoDB instances running out of memory. This led to a “large amount of downtime” for users, with some questioning the company’s reliability and considering alternative services. The incident was discussed in a Hacker News thread following a postmortem by MongoDB’s lead developer.
The core technical issue stemmed from memory fragmentation within MongoDB. Foursquare’s check-in documents were small (around 300 bytes), meaning many fit onto 4KB pages. When data was removed, it created “holes” in the address space rather than freeing up entire pages, leading to a “swiss cheese” effect where the data files still required the same amount of RAM despite some data being moved.
A critical contributing factor was the absence of adequate monitoring and reporting on Foursquare’s database servers. Despite relying heavily on these machines, there was no system in place to track memory usage, CPU, or bandwidth, preventing detection of the impending memory exhaustion days or weeks in advance. This lack of monitoring was particularly surprising given that Foursquare had reportedly experienced the problem once before.
Discussions highlighted the importance of robust monitoring and capacity planning for rapidly growing startups. Potential solutions mentioned included online compaction for MongoDB, improved data locality, and the necessity of a multi-server setup for data durability. Foursquare later sought an operations engineer, indicating a move towards addressing these infrastructure gaps.