Sun Microsystems Enterprise server cache memory flaw
Sun · Enterprise server line
A mysterious glitch began affecting Sun Microsystems’ high-end Enterprise server line starting in late 1999, causing servers to crash unexpectedly. This issue impacted major corporate clients such as America Online, Ebay, and Verisign, leading to significant downtime and operational costs. Sun initially struggled to identify the problem and, for months, required customers seeking repairs to sign non-disclosure agreements, which led to accusations of a cover-up.
In May 2000, Sun engineers finally identified the root cause: faulty S-RAM cache memory chips supplied by an unnamed vendor. These chips were susceptible to disruption by stray radiation, such as alpha particles or cosmic rays, causing bit-level errors (a 1 turning into a 0 or vice versa). When the server detected these memory errors, it would shut down and reboot. A critical contributing factor was Sun’s omission of Error-Correcting Code (ECC) protection in these server models, a feature commonly used by competitors like Hewlett-Packard and IBM to detect and correct such memory errors.
The customer impact was severe, with some clients reporting millions of dollars in diagnostic and repair costs. Verisign, for example, experienced a two-hour outage and subsequently moved important systems to IBM Unix servers. The incident damaged Sun’s reputation for product quality and reliability, and its handling of the issue drew criticism from industry analysts and publications.
Sun’s remediation efforts included developing new “mirrored” cache modules to replace the defective ones, which began shipping in November 2000 and were installed free of charge for affected customers. The company also committed to including ECC protection in its next generation of high-end servers, the UltraSparc III, slated for release in mid-2001. Sun expected to resolve the issue for all affected customers within a few months of the article’s publication.