{"UUID":"fde04641-ac3a-499a-887e-fb5d26184c09","URL":"https://aws.amazon.com/message/41926/","ArchiveURL":"","Title":"Amazon S3 US-EAST-1 outage of February 2017","StartTime":"2017-02-28T17:37:00Z","EndTime":"2017-02-28T21:54:00Z","Categories":["automation","cloud","config-change"],"Keywords":["s3","aws","us-east-1","outage","human error","index subsystem","placement subsystem","server removal"],"Company":"Amazon","Product":"Amazon S3","SourcePublishedAt":"0001-01-01T00:00:00Z","SourceFetchedAt":"2026-05-04T17:55:43.984736Z","Summary":"Human error. On February 28th 2017 9:37AM PST, the Amazon S3 team was debugging a minor issue. Despite using an established playbook, one of the commands intending to remove a small number of servers was issued with a typo, inadvertently causing a larger set of servers to be removed. These servers supported critical S3 systems. As a result, dependent systems required a full restart to correctly operate, and the system underwent widespread outages for US-EAST-1 (Northern Virginia) until final resolution at 1:54PM PST. Since Amazon's own services such as EC2 and EBS rely on S3 as well, it caused a vast cascading failure which affected hundreds of companies.","Description":"On February 28, 2017, at 9:37 AM PST, an incident began in the Amazon S3 US-EAST-1 region. An S3 team member, while debugging a billing system issue, inadvertently removed a larger set of servers than intended. This action impacted critical S3 subsystems, leading to service disruption. Recovery efforts saw the index subsystem begin servicing GET, LIST, and DELETE requests by 12:26 PM PST, fully recovering by 1:18 PM PST. The placement subsystem, crucial for PUT requests, completed its recovery by 1:54 PM PST, at which point S3 was operating normally.\n\nThe incident was triggered by human error: an authorized S3 team member executed a command with an incorrect input, leading to the removal of a significant portion of server capacity. The affected servers supported the S3 index subsystem, which manages object metadata and location, and the placement subsystem, which handles new storage allocation. The tool used allowed too much capacity to be removed too quickly.\n\nThe disruption rendered S3 unable to service GET, LIST, PUT, and DELETE requests. This had a cascading effect on other AWS services in the US-EAST-1 region that rely on S3, including the S3 console, new EC2 instance launches, EBS volumes dependent on S3 snapshots, and AWS Lambda. Communication via the AWS Service Health Dashboard was also impaired due to its dependency on S3.\n\nIn response, Amazon modified the problematic tool to ensure slower capacity removal and added safeguards to prevent operations that would take any subsystem below its minimum required capacity. An audit of other operational tools is underway. The S3 team is also reprioritizing and accelerating plans to further partition the index subsystem into smaller cells to improve recovery times. Additionally, the AWS Service Health Dashboard administration console has been reconfigured to run across multiple AWS regions to enhance its resilience during future events."}