Half the Internet Broken for 14 Hours - October 2025 AWS Outage
On the night of October 19, 2025, software engineers around the world split into two camps.
Half scrambled to fix cascading failures, working through the night as services went dark one by one. The other half sat idle, unable to work at all—their tools, deployments, and production systems unreachable.

For 14 hours and 32 minutes, AWS’s us-east-1 region experienced the longest AWS outage in over a decade. DynamoDB went down first. Then the internal workflow manager crashed under retry pressure. Then EC2 instances couldn’t start. Then the network manager buckled. Then load balancers started flapping.
The blast radius was staggering: Signal, Slack, Snapchat, Fortnite, Roblox, Venmo, Coinbase, Disney+, Duolingo. Banks in the UK. Government services. Over 100 AWS products affected.
The root cause? A race condition. A textbook distributed systems error that somehow made it into production at the world’s largest cloud provider.
The Architecture That Failed
DynamoDB uses Amazon Route 53 for internal service discovery. When load balancers scale, partitions move, or capacity changes, DNS records must be updated. AWS built a two-component system to handle this:
DNS Planner: Monitors load balancer health and generates DNS update plans containing weighted traffic assignments for regional endpoints.
DNS Enactors: Three independent processes (one per availability zone) that apply DNS plans to Route 53. The redundancy was intended to provide fault tolerance.
Each Enactor independently:
- Fetches new DNS plans from the Planner
- Checks if the plan is newer than the previously applied plan (a one-time check at the start)
- Applies the update to each endpoint via Route 53 API (can involve retries if blocked by another Enactor)
- Runs cleanup to remove plans that are ****older than the one just applied
On paper, this looks like reasonable redundancy. Three independent processes. Geographic distribution. If one Enactor fails, the others continue working.
In practice, AWS created a race condition that guaranteed eventual failure.
The TOCTOU Bug
The failure was a classic Time-of-Check-Time-of-Use (TOCTOU) vulnerability. The Enactors performed a staleness check at one point in time, then acted on that information later without re-validating.
Here’s the exact sequence that caused the outage, based on the official AWS post-mortem:
Three independent events converged to trigger the failure:
- Enactor A experienced unusual delays - According to AWS: “one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints.” While A was slowly working through endpoints, it fell far behind.
- The Planner continued generating plans - The DNS Planner kept running and “produced many newer generations of plans” during the window when Enactor A was delayed.
- Enactor B applied a much newer plan and started cleanup - A second Enactor picked up one of the newer plans (many generations ahead) and “rapidly progressed through all of the endpoints,” then invoked cleanup.
The critical race condition occurred when these events overlapped:
- Enactor B completed applying plan M+K and started cleanup
- Concurrently, Enactor A finally completed applying its much older plan M
- Enactor A’s write overwrote the newer plan M+K with the older plan M
- Enactor B’s cleanup saw plan M (now in Route 53) was “many generations older” than M+K
- Cleanup deleted plan M, removing all IP addresses for the regional endpoint
The root cause was the one-time staleness check. As AWS explained: “The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing.”
Why Recovery Failed
The deletion left the system in an unrecoverable state. As AWS explained: “because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors.”
The Enactors needed to compare their new plan against the previously applied plan. But the cleanup had deleted that plan reference. Without a baseline for comparison, no Enactor could pass validation, and none could write new records.
Manual intervention was required. AWS engineers had to directly restore DNS records, bypassing the automation that had created the problem. This took approximately 2.5 hours (from 11:48 PM to 2:25 AM PDT).
The Cascade: Four Layers of Failure
DynamoDB’s DNS failure was just the trigger. The outage cascaded through four distinct layers, each with its own failure mode.
Layer 1: DynamoDB (0-3 hours)
With dynamodb.us-east-1.amazonaws.com resolving to nothing, all DynamoDB API calls failed. But DynamoDB isn’t just a customer-facing database—it’s foundational infrastructure for dozens of internal AWS services.
Layer 2: EC2 DropletWorkflow Manager (3-12 hours)
The most severe secondary failure occurred in EC2’s DropletWorkflow Manager (DWFM). This subsystem manages the physical servers (“droplets”) that host EC2 instances.
Normal operation:
- DWFM maintains over a million active leases
- Only a tiny fraction are broken at any time
- State checks stored in DynamoDB
During the outage:
- DynamoDB unavailable for 3 hours
- Lease heartbeats couldn’t complete
- Leases began timing out
- Broken leases increased by three orders of magnitude
When DynamoDB recovered at 2:25 AM, DWFM attempted to re-establish all broken leases simultaneously. But the scale was too large: processing time exceeded lease timeout duration.
New timeouts accumulated faster than leases could be restored. The system entered congestive collapse—a feedback loop where recovery attempts created new failures faster than resolving existing ones.
Engineers had to manually:
- Throttle incoming work
- Selectively restart DWFM hosts to clear queued work
- Gradually restore leases
Full DWFM recovery took until 5:28 AM—3 additional hours after DynamoDB was restored.
Layer 3: Network Manager (5+ hours)
Even after EC2 instances could launch, they couldn’t connect to the network. The Network Manager propagates network configuration during instance creation and state transitions.
The 3-hour DynamoDB outage plus 3-hour DWFM recovery created a massive backlog of pending network state changes. Network Manager couldn’t process them fast enough:
- New instances launched without network connectivity
- Stale network configuration caused false health check results
- NLB endpoints flapped between healthy and unhealthy
Normal network propagation times didn’t resume until 10:36 AM—7 hours after DynamoDB recovery.
Layer 4: NLB Health Check Instability (6+ hours)
The combination of delayed network propagation and instances coming online without proper configuration caused NLB health checks to oscillate:
- Instance launches successfully
- Health check runs before network config propagates
- Health check fails → instance removed from DNS
- Network config finally propagates
- Next health check succeeds → instance added back to DNS
- Repeat across thousands of instances
This flapping behavior degraded the health check subsystem itself and triggered inappropriate availability zone failovers. Engineers disabled automatic AZ failover at 9:36 AM and didn’t re-enable it until 2:09 PM.
Timeline Summary
Total duration: ~14.5 hours
The Fundamental Issue
Here’s what AWS got wrong: they placed the responsibility for data integrity on the writers instead of the data store.
Three independent processes, each with full read-write-delete capabilities on the same data, each making independent decisions about what’s stale and what’s current. This is a recipe for exactly the failure that occurred.
The architectural principle being violated is simple: the entity that owns the data should enforce integrity constraints. Route 53 should have been the arbiter of which record is authoritative. Instead, each Enactor acted as an independent authority, and their “coordination” consisted of racing to see who could write last.
The irony is acute. AWS ran three Enactors specifically for redundancy. The redundancy is what caused the failure. A single Enactor would have been slower and less fault-tolerant, but it couldn’t have raced against itself.
This isn’t a novel observation. The “single writer” principle exists precisely because coordinating multiple writers on shared mutable state is one of the hardest problems in distributed systems. AWS’s architects presumably know this. They built a system that violated it anyway.
The Survivor
While Signal, Slack, Coinbase, and dozens of other well-known brands went down with AWS, one notable company stayed up: Netflix.
Their failover to backup regions took approximately 6-7 minutes. Users barely noticed.
How did they pull this off?
Netflix has spent 13+ years building resilience infrastructure. The Simian Army—their suite of chaos engineering tools—launched in 2011. Chaos Monkey randomly terminates production instances. Chaos Gorilla takes down entire availability zones. Chaos Kong simulates the failure of an entire AWS region. These tools have run continuously for over a decade, discovering failure modes and forcing Netflix’s engineers to build systems that survive them.
Netflix doesn’t just test failover. They practice failover. Regularly. In production. With paying customers. Their 6-7 minute recovery time isn’t luck or good architecture—it’s the result of failing over hundreds of times until the process became routine.
The cost of this capability is staggering.
Multi-region active-active deployment—where traffic can shift seamlessly between regions—roughly doubles infrastructure costs. You’re running two complete copies of everything. Netflix runs active-active. They’ve built custom tooling, trained specialized teams, and accepted the ongoing operational complexity of maintaining truly independent regional deployments.
The Resilience Tax
So should everyone follow Netflix’s example?
Not really.
Engineering decisions are never purely technical. More often than not, the business and commercial factors play a bigger role than the architectural elegance of a solution.
Redundancy isn’t free. Multi-region deployments, chaos engineering teams, active-active infrastructure—all of this comes with significant cost and complexity. For most companies, the price tag is steep.
Which leaves you with two options:
- Embrace redundancy. Accept the complexity and cost. Hope your product survives infrastructure failures that might not happen for years. And even when they do happen, you won’t be blamed—because everyone else went down too.
- Save the money. Sit back. Your financial reports will look better. When the next outage hits, you’ll be in good company with the rest of the internet.
Suppose there’s a button in front of you. Press it, and your product will never crash during a cloud outage again. The downside: your infrastructure costs double, and you need a dedicated team to maintain the complexity.
If you were the CEO of your own company, would you press it?
Subscribe
I don't write often, but if you want to get new posts delivered to your inbox, feel free to subscribe.