
A massive cloud outage stemming from Amazon Web Services’ key US-EAST-1 region, its hub in northern Virginia, near the US Capitol, caused widespread disruptions of websites and platforms around the world on Monday morning. Amazon’s main ecommerce platform and other properties, including Ring doorbells and the Alexa smart assistant, suffered interruptions and outages throughout the morning, as did Meta’s communication platform WhatsApp, OpenAI’s ChatGPT, PayPal’s Venmo payment platform, multiple web services from Epic Games, multiple British government sites, and many others.
The outages stemmed from Amazon’s DynamoDB database application programming interfaces in US-EAST-1, and AWS said in status updates that the problem was specifically related to DNS resolution issues. The “domain name system” is a foundational internet service that essentially acts as an automatic phonebook lookup to translate web URLs like www.wired.com into numeric server IP addresses so web browsers show users the right content. DNS resolution issues occur when DNS servers aren’t accurately connecting these dots and, to keep with the phonebook analogy, are providing the wrong numbers for a given name, or vice versa.
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1,” AWS wrote in status updates on Monday. Shortly after, the company added: “If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches.”
An AWS spokesperson did not immediately respond when asked for details about the nature of the failure. DNS resolution issues can be malicious—known as DNS hijacking—but there is no indication that Monday’s AWS outages were nefarious.
“When the system couldn’t correctly resolve which server to connect to, cascading failures took down services across the internet,” says Davi Ottenheimer, a longtime security operations and compliance manager and a vice president at the data infrastructure company Inrupt. “Today’s AWS outage is a classic availability problem, and we need to start seeing it more as data integrity failure.”
Problems began around 3 am ET. By 5:22 am, AWS had applied “initial mitigations” that were starting to take effect. At 6:35 am, Amazon said that it had fully addressed the underlying technical issues but that “some services will have a backlog of work to work through, which may take additional time to fully process.”
AWS has suffered other large-scale outages, including a major incident in 2023. Reliance on central cloud services from giants like AWS, Microsoft Azure, and Google Cloud Services has, in may ways, improved cybersecurity and stability around the world by creating a baseline of guardrails and best practices for all customers. But this standardization comes with major trade-offs, because the platforms become a single point of failure for large swaths of critical services.
“Failures increasingly trace to integrity,” Ottenheimer says. “Corrupted data, failed validation or, in this case, broken name resolution that poisoned every downstream dependency. Until we better understand and protect integrity, our total focus on uptime is an illusion.”
Services Marketplace – Listings, Bookings & Reviews