Postmortem Index

Explore incident reports from various companies

1990 AT&T Long Distance Network Collapse

AT&T · long-distance network

1990-01-15 cascading-failure

On Monday, January 15th, 1990, at 2:25 PM, AT&T’s long-distance network began experiencing a widespread malfunction. Network managers observed an alarming number of red warning signals, indicating a rapidly spreading issue across their computer-operated switching centers. For nine hours, almost 50% of calls placed through AT&T failed, blocking over 50 million calls. The system stabilized around 11:30 PM when network loads decreased.

The incident involved AT&T’s long-distance network, specifically its 114 computer-operated electronic switches (4ESS) linked by Common Channel Signaling System No. 7. The problem was triggered when a New York switch performed a routine self-test and maintenance reset. Upon recovery, it began distributing backed-up signals. Another switch received two messages from the New York switch in very close succession, less than ten milliseconds apart.

The core issue was a one-line software defect introduced in early December during an upgrade to speed message processing. This bug was located in the recovery software of each of the 114 switches. The C code contained a break statement within an if clause nested inside a switch clause. When a second message arrived before the first was fully processed, this bug caused the program to incorrectly exit the case statement, leading to crucial communications information being overwritten.

Error correction software in the receiving switch detected the data overwrite and initiated a reset. However, the same timing condition with closely spaced messages caused backup processors to also shut down. As each recovering switch began routing its backlogged calls, it propagated the cycle of close-timed messages and subsequent shutdowns throughout the entire network of 114 switches. This cascading failure blocked over 50 million calls and resulted in AT&T losing more than $60 million in unconnected calls, with significant additional losses for businesses relying on the network.

While the article doesn’t detail the immediate fix, it highlights lessons learned. The incident demonstrated how a simple, obscure software error could bring down a highly reliable system, especially one designed with self-healing features. It suggested that more structured programming languages and stricter compilers could make such defects more obvious. Additionally, a more fault-tolerant hardware and software system that could handle minor problems without shutting down might have reduced the impact.

Keywords

at&tlong distance network4ess switchessoftware bugc programmingcascading failuresignaling system no. 71990