{"UUID":"b29ba3ed-e3be-48f0-95b4-979e69ced0ab","URL":"https://news.ycombinator.com/item?id=1769761","ArchiveURL":"","Title":"Foursquare MongoDB memory exhaustion outage","StartTime":"0001-01-01T00:00:00Z","EndTime":"0001-01-01T00:00:00Z","Categories":["automation","cloud","config-change"],"Keywords":["foursquare","mongodb","memory","outage","monitoring","fragmentation","database","ec2"],"Company":"Foursquare","Product":"MongoDB","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:52:21.519473Z","Summary":"MongoDB fell over under load when it ran out of memory. The failure was catastrophic and not graceful due to a a query pattern that involved a read-load with low levels of locality (each user check-in caused a read of all check-ins for the user's history, and records were 300 bytes with no spatial locality, meaning that most of the data pulled in from each page was unnecessary). A lack of monitoring on the MongoDB instances caused the high load to go undetected until the load became catastrophic, causing 17 hours of downtime spanning two incidents in two days.","Description":"Foursquare experienced significant downtime due to its MongoDB instances running out of memory. This led to a \"large amount of downtime\" for users, with some questioning the company's reliability and considering alternative services. The incident was discussed in a Hacker News thread following a postmortem by MongoDB's lead developer.\n\nThe core technical issue stemmed from memory fragmentation within MongoDB. Foursquare's check-in documents were small (around 300 bytes), meaning many fit onto 4KB pages. When data was removed, it created \"holes\" in the address space rather than freeing up entire pages, leading to a \"swiss cheese\" effect where the data files still required the same amount of RAM despite some data being moved.\n\nA critical contributing factor was the absence of adequate monitoring and reporting on Foursquare's database servers. Despite relying heavily on these machines, there was no system in place to track memory usage, CPU, or bandwidth, preventing detection of the impending memory exhaustion days or weeks in advance. This lack of monitoring was particularly surprising given that Foursquare had reportedly experienced the problem once before.\n\nDiscussions highlighted the importance of robust monitoring and capacity planning for rapidly growing startups. Potential solutions mentioned included online compaction for MongoDB, improved data locality, and the necessity of a multi-server setup for data durability. Foursquare later sought an operations engineer, indicating a move towards addressing these infrastructure gaps."}