{"UUID":"3aac3a45-0815-432e-8057-f8a02132cd36","URL":"https://stackstatus.net/post/156407746074/outage-postmortem-january-24-2017","ArchiveURL":"https://web.archive.org/web/20170130231315if_/https://stackstatus.net/post/156407746074/outage-postmortem-january-24-2017","Title":"Stack Exchange SQL Server bugcheck outage January 2017","StartTime":"2017-01-24T17:53:00Z","EndTime":"2017-01-24T18:10:00Z","Categories":null,"Keywords":["sql server","bugcheck","read-only","outage","failover","ny-sql02","stack exchange","database"],"Company":"Stack Exchange","Product":"sql server","SourcePublishedAt":"2017-01-26T18:45:55-05:00","SourceFetchedAt":"2026-05-04T17:46:43.43678Z","Summary":"The primary SQL-Server triggered a bugcheck on the SQL Server process, causing the Stack Exchange sites to go into read only mode, and eventually a complete outage.","Description":"On January 24, 2017, starting at 17:53 UTC, the Stack Exchange network experienced system degradation, entering a read-only state for approximately 5 minutes. This was followed by a complete site outage that lasted for 12 minutes.\n\nThe incident was triggered when the primary SQL Server, identified as NY-SQL02, initiated a bugcheck on its SQL Server process. This event initially forced the SQL server into a read-only state.\n\nThe root cause involved a combination of the SQL Server bugcheck and a critical bug in the application-level failover logic. Although the system was designed to switch to standby SQL servers in read-only mode during such events, the failover mechanism was disabled due to this bug, leading to the complete network outage instead of a graceful degradation. The underlying cause of the SQL Server bugcheck is currently unknown, but logs suggest a potential issue with a bad DIMM.\n\nCustomer impact included the Stack Exchange network being inaccessible for 12 minutes after an initial 5-minute period of read-only access. Approximately 3.5 seconds of data may have been lost due to uncommitted transactions being rolled back.\n\nRemediation involved restarting the SQL service on NY-SQL02, which brought the network back online in read-only mode. After a sanity check, sites were restored to read-write functionality. NY-SQL02 has since been taken out of production for thorough testing, including memory diagnostics. Additionally, the SQL cluster was updated to 2016 SP1 CU1."}