When AWS Goes Dark

The October 2025 AWS outage reminded the world how fragile even the most trusted cloud platforms can be. Here’s what happened, why it matters, and how higroup designs resilient systems that stay online when the cloud doesn’t.

The day AWS went dark

On October 20, 2025, AWS experienced a large-scale outage that disrupted services across the globe. Websites, mobile apps, and entire infrastructures went offline within minutes. The incident originated in AWS’s US-EAST-1 region, highlighting how dependent modern businesses have become on centralized cloud systems and why resilience in architecture is no longer optional.

A picture of AWS logotype

What happened and when?

In the early hours of the morning, AWS users began reporting connection failures and service disruptions. The issue originated in one of Amazon’s largest and most interlinked data centers. Within minutes, it affected global traffic routes, bringing down platforms like Snapchat, Fortnite, and even parts of Amazon’s own services. AWS confirmed that increased error rates and latency were caused by a fault in its load balancing and health monitoring subsystems, which led to cascading failures across multiple internal services.

A picture showcasing AWS outages.

Why did it happen?

At the core of the incident was a failure in AWS’s internal network load-balancing system, which distributes incoming traffic across servers to keep applications running smoothly. When that subsystem malfunctioned, it triggered a domino effect - health checks began failing, routing tables became unstable, and dependent services started timing out. Because many AWS services rely on shared internal components, the disruption quickly spread across the platform. Even applications hosted in other regions experienced intermittent slowdowns, since global DNS and authentication systems were impacted.

How higroup stayed online

While the outage disrupted many global platforms, higroup’s systems remained fully operational. Our infrastructure majorly runs on AWS’s EU region, which was unaffected by the incident. By design, we separate environments by geography and purpose — ensuring that a disruption in one region never cascades into another. This regional isolation is paired with redundant architecture: automated backups, independent service clusters, and failover mechanisms that keep core systems responsive even when external dependencies slow down. It’s part of our broader commitment to resilient engineering — designing for failure, so clients stay online when the cloud doesn’t.

A map showing AWS regions of their datacenters.

How to prevent issues like this?

Cloud outages are inevitable, but downtime doesn’t have to be. The key lies in architecture designed for resilience, not convenience. Systems should anticipate regional or service-level disruptions and be able to redistribute loads automatically when a failure occurs. At higroup, we design infrastructures that leverage multi-region redundancy, auto-scaling groups, and load balancers that reroute traffic to healthy zones in real time. For mission-critical platforms, we also apply disaster recovery environments across multiple cloud providers AWS, Azure and GCP.

Lessons for software teams

True resilience is built on preparation, not reaction. Design systems with geographic and infrastructure redundancy, ensuring multiple regions or availability zones can take over when one fails. Establish and rehearse clear failover procedures so they can be executed instantly when needed. Combine this with strong monitoring not just for uptime but for the performance of every service and dependency. Knowing when and where failures occur allows teams to respond fast, limit disruption, and maintain user trust. In short, resilience isn’t just about avoiding failure - it’s about being ready for it.

Picture showing AWS datacenter.

Beyond the outage: Building resilience that lasts

Cloud outages will happen again - it’s not a question of if, but when. The difference lies in how prepared teams are to respond. Each incident, like the AWS downtime, reminds us that resilience is a mindset, not just a configuration. At higroup, we see reliability as a continuous process — one built on smart architecture, observability, and preparation. From multi-region setups to proactive monitoring and tested failover plans, our goal is to make downtime invisible to users. Because in the end, the real test of great engineering isn’t avoiding failure - it’s how quickly you recover from it.

Related post

Handpicked Reads to Deepen Your Understanding

  • Development
  • Dino Starcic
  • 26/05/2025

Advantages of Golang over PHP

Go and PHP are both powerful languages, but they are designed with different philosophies and use cases in mind. Read more to learn the differences.

Read article
  • Business
  • Dino Starcic
  • 03/09/2025

The Right Way to Hire a Development Agency

Learn how to hire a development agency the right way. Discover best practices, outsourcing tips, and how higroup helps build digital products that last.

Read article

Do you have a specific idea in mind?

Share your vision, and we'll explore how we can make it happen together.

Frequently asked questions