{"UUID":"4a4e2060-38cb-4ced-9788-3c4e305a3c3c","URL":"http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse.html","ArchiveURL":"","Title":"1990 AT\u0026T Long Distance Network Collapse","StartTime":"1990-01-15T14:25:00Z","EndTime":"1990-01-15T23:30:00Z","Categories":["cascading-failure"],"Keywords":["at\u0026t","long distance network","4ess switches","software bug","c programming","cascading failure","signaling system no. 7","1990"],"Company":"AT\u0026T","Product":"long-distance network","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:47:42.280922Z","Summary":"A bad line of C code introduced a race hazard which in due course collapsed the phone network. After a planned outage, the quickfire resumption messages triggered the race,  causing more reboots which retriggered the problem. \"The problem repeated iteratively throughout the 114 switches in the network, blocking over 50 million calls in the nine hours it took to stabilize the system.\" From 1990.","Description":"On Monday, January 15th, 1990, at 2:25 PM, AT\u0026T's long-distance network began experiencing a widespread malfunction. Network managers observed an alarming number of red warning signals, indicating a rapidly spreading issue across their computer-operated switching centers. For nine hours, almost 50% of calls placed through AT\u0026T failed, blocking over 50 million calls. The system stabilized around 11:30 PM when network loads decreased.\n\nThe incident involved AT\u0026T's long-distance network, specifically its 114 computer-operated electronic switches (4ESS) linked by Common Channel Signaling System No. 7. The problem was triggered when a New York switch performed a routine self-test and maintenance reset. Upon recovery, it began distributing backed-up signals. Another switch received two messages from the New York switch in very close succession, less than ten milliseconds apart.\n\nThe core issue was a one-line software defect introduced in early December during an upgrade to speed message processing. This bug was located in the recovery software of each of the 114 switches. The C code contained a `break` statement within an `if` clause nested inside a `switch` clause. When a second message arrived before the first was fully processed, this bug caused the program to incorrectly exit the `case` statement, leading to crucial communications information being overwritten.\n\nError correction software in the receiving switch detected the data overwrite and initiated a reset. However, the same timing condition with closely spaced messages caused backup processors to also shut down. As each recovering switch began routing its backlogged calls, it propagated the cycle of close-timed messages and subsequent shutdowns throughout the entire network of 114 switches. This cascading failure blocked over 50 million calls and resulted in AT\u0026T losing more than $60 million in unconnected calls, with significant additional losses for businesses relying on the network.\n\nWhile the article doesn't detail the immediate fix, it highlights lessons learned. The incident demonstrated how a simple, obscure software error could bring down a highly reliable system, especially one designed with self-healing features. It suggested that more structured programming languages and stricter compilers could make such defects more obvious. Additionally, a more fault-tolerant hardware and software system that could handle minor problems without shutting down might have reduced the impact."}