{"UUID":"82216044-3155-403c-9ccc-83dbfcf0e0a5","URL":"https://status.aws.amazon.com/s3-20080720.html","ArchiveURL":"https://web.archive.org/web/20220403060108if_/https://status.aws.amazon.com/s3-20080720.html","Title":"Amazon S3 Availability Event: July 20, 2008","StartTime":"2008-07-20T15:40:00Z","EndTime":"2008-07-20T23:58:00Z","Categories":["cloud"],"Keywords":["s3","amazon","aws","availability","gossip protocol","message corruption","datacenter","us","eu","amazon s3","outage"],"Company":"Amazon","Product":"Amazon S3","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T18:15:29.996631Z","Summary":"Message corruption caused the distributed server state function to overwhelm resources on the S3 request processing fleet.","Description":"On July 20, 2008, at 8:40am PDT, Amazon S3 experienced a significant availability event, with error rates quickly climbing across all datacenters. By 8:50am PDT, error rates were significantly elevated, and very few customer requests were completing successfully. Engineers were engaged by 8:55am PDT, and by 9:41am PDT, it was determined that servers within Amazon S3 were having difficulty communicating with each other.\n\nThe core issue was that Amazon S3 servers, which use a gossip protocol to spread server state information, were spending almost all their time gossiping and failing while doing so. This prevented the system from successfully processing customer requests. To resolve this, at 10:32am PDT, S3 teams decided to shut down all server-to-server communication, clear the system's state, and reactivate request processing components. This shutdown was complete by 11:05am PDT.\n\nInternal communication was restored by 2:20pm PDT, and request processing components were reactivated concurrently in the US and EU. The EU location returned to normal by 3:10pm PDT, and the US location by 4:58pm PDT.\n\nThe root cause was identified as message corruption. A handful of internal state messages had a single bit corrupted, making the system state information incorrect. Unlike customer object data, which uses MD5 checksums, there was no protection in place to detect corruption of this internal state information, allowing it to spread throughout the system and cause widespread communication failures.\n\nAs remediation, Amazon S3 deployed changes to significantly reduce system restoration time and modified how it gossips about failed servers to prevent similar behavior. Additional monitoring and alarming for gossip rates and failures were implemented. Crucially, checksums are being added to proactively detect and reject corrupted system state messages, enhancing the system's resilience against such internal data corruption."}