Reliability lessons from the 2025 Microsoft Azure Front Door outage

On October 29th, 2025, Azure Front Door suffered an outage that impacted Microsoft services on a global level, including Microsoft 365, Outlook, Xbox Live, Copilot, and more. It also affected Microsoft Azure, meaning companies like Costco, Starbucks, and Alaska Airlines ran into issues for both customer-facing and internal systems.

The root of the issue was a misconfiguration in the data plane for Azure Front Door and the Azure Content Delivery Network. While the Microsoft team responded within 7 minutes of customer impact, and less than 15 minutes after the initial issue showed up, it still took 7 hours for full mitigation and recovery.

So what reliability lessons can you take from this outage? And how can you make sure your company isn’t impacted by outages like this in the future?

Customers will blame you for outages

All three of the major cloud providers (AWS, Google Cloud, and Microsoft) regularly have high uptimes of 99.7% or greater, but even they sometimes experience high profile outages. But here’s the thing: your customers aren’t going to care where the outage started. If someone went to Starbucks on October 29th and they couldn’t get their afternoon caffeine fix, they’d blame Starbucks, not Azure.

Because ultimately, reliability and uptime is your responsibility as the engineers and teams behind your systems.

That sounds harsh, but it’s actually a good thing because it means you can take actions right now to lessen (or even prevent) the impact from outages in the future.

Verify redundancy and failover

One of the biggest issues with this outage was how quickly it spread across Azure’s global network. It impacted a number of Azure services beyond just Azure Front Door and Azure Content Delivery Network, making hosts and other resources unavailable to customers.

Fortunately, resource outages like this can be addressed with redundancy and failover systems. Microsoft implemented redundancy after the outage to make sure similar data issues won’t bring down their systems:

“We have migrated critical first-party infrastructure (including Azure Portal, Azure Communication Services, Marketplace, Linux Software Repository for Microsoft Products, Support ticket creation) into an active-active solution with fail away.” (Source)

These are common best practices for a reason, but remember the old adage: a backup’s not a backup unless you test it. You can’t really be sure that your failover will work until you see how your system responds when a resource is unavailable. Fortunately, reliability tests allow you to do that without there being an outage.

How to test redundancy with Gremlin

When a resource goes down, your application doesn’t differentiate between whether the network cable was cut, the power source fried itself, or the entire data center was flooded. All it knows is that a resource was reachable, and now it’s suddenly not.

It’s this logic that allows Gremlin to safely and securely simulate resource outages. Gremlin’s redundancy tests cut off all network traffic to a specific resource, such as a host, zone, or DNS server. This allows you to make sure your systems respond the way they should, such as by failing over to redundant zones. Once you’ve verified the behavior, you can stop the test to quickly restore network traffic and rollback your application to the previous state without having to wait for things like a server to start up.

These tests also allow you to verify how your system on a whole will react to a missing resource. Say you have two zones allocated, each running at 60% capacity. If zone A is suddenly unavailable, all traffic will shift over to zone B…which will now need to run at 120% capacity in order to handle the increased traffic. We’ve had many customers find out that failover worked correctly, but the other resource wasn’t able to scale, leading to decreased performance and a potential outage.

By testing these early, you can make sure you’ll know how your system will behave if a resource becomes unavailable.

Map and test dependencies

But what if you’re using a key Microsoft service like Azure Content Delivery Network or Microsoft 365 as part of your architecture? These function as dependencies for your application. Modern complex architectures are massive networks of dependencies, and each dependency will usually have its own dependencies.

Being ready for dependency outages really comes down to two key pieces:

Visibility into all of your dependencies
Verification of what happens if they’re unavailable

When you know the dependencies of your application or microservices, you can then test them to make sure your system responds as expected when those dependencies are unavailable. But don’t limit your testing to just outages. You’ll also want to know how your system responds if there are performance degrades or increased latency.

Hypothetically, say a similar issue happens with Azure in the future, but it doesn’t cause an outage because they’ve taken actions and built redundancy to prevent full outages. It could still cause lowered performance and increased latency, which could, in turn, cause your application to timeout.

Fully testing your dependencies makes sure you’re prepared and know how your system is going to respond.

How to map and test dependencies with Gremlin

Whenever you set up a service in Gremlin, it will automatically use Dependency Discovery to map the service’s network dependencies. Don’t be surprised if it turns up dependencies you didn’t know about! With modern complex systems, it can be hard to know every single dependency communicating to your service, which is why it’s so important to map them.

Once mapped, you can run a variety of dependency tests to see how your service responds when its dependencies are slow or unavailable. Start with the failure tests (which drops all network traffic to a specific dependency) to test an outage like Azure Front Door going down. Similarly, you can test a partial outage by running a latency test to delay traffic by a specific amount of milliseconds.

And because we’ve all seen what happens when a TLS certificate expires, don’t forget to run the Certificate Expiry test to make sure there aren’t any certificates that expire within the next 30 days.

Know your weak spots to reduce risk and speed recovery

The most important part of reliability tests is gaining visibility into the reliability risks that could cause outages or incidents with your systems. Sometimes these can be simple fixes, such as changing timeout values or autoscaling configurations. Issues like these can be addressed quickly and easily as part of your standard sprints with minimal efforts.

But sometimes they can be more involved, like spinning up an entire redundant database system at substantial cost. Situations like this open up a larger conversation with the organization, which may be willing to accept the risk rather than spend the money.

Either way, you and your teams need to be informed of the risks. That way your teams can fix what makes sense, and have a plan to respond to risks they’ve decided to take on. Because the ultimate goal is that when a cloud provider outage does happen, your system either responds correctly and stays up, or your team knows exactly what to do and responds quickly to minimize the impact.

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour