Amazon Web Services outage of 2017

Amazon Web Services outage of 2017
Name	Amazon Web Services outage of 2017
Caption	Services affected by the outage
Date	February 28, 2017
Location	Amazon Web Services US-East-1
Cause	Human error and cascading failures
Outcome	Service disruptions across major websites and services

Contents

Background
Timeline of outage
Causes and technical analysis
Impact and affected services
Response and mitigation
Aftermath and policy changes

Amazon Web Services outage of 2017 was a major disruption on February 28, 2017, that affected a broad set of internet services hosted in the US-East-1 region operated by Amazon Web Services. The outage interrupted platforms and applications across entertainment, retail, media, and communications, prompting responses from companies including Netflix, Airbnb, Slack Technologies, Trello, and The Washington Post. The incident highlighted dependencies between cloud infrastructure and high-profile services such as Giphy, Quora, Fitbit, and Alexa (virtual assistant).

Background

The outage occurred within the context of rapid adoption of cloud services by companies such as NASA, Comcast, Capital One, Expedia Group, and Spotify. Amazon Web Services had grown to serve enterprises including Netflix, Adobe Systems, and Twitch via regions such as US-East-1, which hosts resources for LinkedIn, Pinterest, and Uber. Prior events like outages in cloud providers including Google Cloud Platform and Microsoft Azure had influenced architecture discussions among firms like Dropbox, Salesforce, and Box. Industry observers from institutions such as MIT and Stanford University had documented single-region risk for services used by organizations like The New York Times and Hulu.

Timeline of outage

On February 28, 2017, routine maintenance actions in US-East-1 triggered errors that propagated through control planes used by companies such as Airbnb, Quora, Trello, and Fitbit. Within minutes, content delivery networks and APIs relied upon by Slack Technologies, Giphy, Mapbox, and Vimeo began failing. By mid-day, mainstream services including The Washington Post, Business Insider, Zillow, and HBO reported degraded performance or outages. The event continued over several hours, with restoration activities coordinated among Amazon Web Services engineers, customers like Netflix and Expedia Group, and monitoring firms such as Dyn and Akamai Technologies before full functionality returned later that day.

Causes and technical analysis

Amazon later reported the immediate trigger was a human error during routine maintenance that removed capacity from a subsystem supporting Simple Storage Service and related control-plane services used by customers including Heroku and Zendesk. The failure cascaded through services like Elastic Load Balancing, EC2, and EBS affecting downstream applications such as Trello and Giphy. The interaction of metadata services, automated scaling, and dependency chains resembled failure modes analyzed in research from Carnegie Mellon University and University of California, Berkeley. Third-party analyses referenced architectural patterns employed by Netflix—including chaos engineering from Chaos Monkey initiatives—to illustrate how single-region dependency can amplify faults across services like Spotify and Airbnb.

Impact and affected services

The outage impacted thousands of companies and millions of users across platforms including Netflix, Airbnb, Spotify, Giphy, Trello, Quora, Fitbit, Slack Technologies, Trello, The Washington Post, and Hulu. E-commerce and retail experiences for customers of Zappos and Etsy were disrupted, while developer platforms such as Heroku and CircleCI reported provisioning and deployment failures. Media outlets including BuzzFeed, Vox (media company), and The Verge covered user-facing failures, and enterprise clients like Capital One and Expedia Group examined business continuity plans. The outage also affected advertising platforms and analytics tools used by The New York Times and AdRoll.

Response and mitigation

Amazon issued technical summaries and worked with customers including Netflix, Airbnb, Slack Technologies, and Quora to restore services, invoking incident response practices similar to playbooks used by Google and Microsoft Azure. Some customers activated multi-region failover strategies advocated by firms such as GitHub and Dropbox; others relied on content delivery networks from Akamai Technologies and Cloudflare to reduce impact. Industry forums including discussions at O’Reilly Media events and whitepapers from Gartner and Forrester Research emphasized redundancy patterns used by Netflix and enterprise architects from IBM and Oracle.

Aftermath and policy changes

In the aftermath, companies including Amazon Web Services, Netflix, Airbnb, and Spotify reviewed architecture and disaster-recovery strategies, promoting multi-region deployments and stronger dependency mapping. AWS updated documentation and tooling used by enterprises such as Capital One and Expedia Group and expanded guidance echoing resilience work from Netflix OSS and research from University of Cambridge and Imperial College London. The outage influenced procurement and risk conversations at institutional customers like NASA and Comcast, and spurred regulatory and governance interest among stakeholders including Federal Communications Commission, large enterprises, and industry bodies referenced by IEEE. Technical communities at ACM and USENIX later cited the event in case studies about cloud reliability.

Category:2017 incidents Category:Amazon Web Services