June 18, 2020 - 7 min read

Chaos Engineering and Windows: Mitigating common Windows failure scenarios

Microsoft Windows is a popular operating system for many enterprise applications, such as Microsoft SQL Server clusters and Microsoft Exchange Servers. About 30% of the world’s web application hosting systems are running Windows, making it an important part of every enterprise’s plans to prevent outages and enhance reliability.

Chaos Engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of harm. By observing how it responds, we can improve the system’s resiliency. The concept was born from Netflix, who created and released their open source Chaos Monkey and Simian Army tools when moving from a private data center to the Amazon public cloud (AWS). These tools originally worked by randomly shutting down various servers, but the practice has evolved greatly since then. Today’s fault injection practices are scientific, methodical, and safe, even in production environments.

Traditional QA falls short of fully testing the reliability of today’s production systems. Chaos Engineering is the only way to know with certainty how our systems will react as a whole when a part of the system fails. This lets software engineering and site reliability engineering (SRE) find and fix problems with our systems before those problems can cause issues for our users, while also teaching us more about how these systems behave.

This article tells you why Chaos Engineering specifically on Windows is important. It then moves on to describe three examples of potential failure scenarios that Chaos Engineering can help you find and mitigate against. It then shows how to test the effectiveness of your reliability work. Each of these could be run during a GameDay, where an engineering team sets aside a couple of hours to carefully plan some chaos testing of their system, records results, and uses them to prioritize software development work.

Why is Chaos Engineering important on Windows systems?

Windows has several built-in fault tolerance and high availability features for production systems, but simply turning those on does not guarantee reliability, especially in today’s complex systems. You have to test. Only then can you be certain that everything is functioning as designed and well enough to keep customers from being impacted when something goes wrong.

As an added bonus, the process of testing also helps your team gain confidence in their skills alongside the confidence gained in the reliability of the system. This is the case when you use Chaos Engineering to run a FireDrill. A FireDrill is a planned outage simulation where part of your team will run a chaos experiment and monitor (and halt the experiment if things go awry). Meanwhile, the other part of the team knows nothing about the experiment design and will be tasked with solving the problem, just like in a real incident.

For specific applications, Windows has a larger ecosystem and better support options, but that comes with greater vulnerability. For the most part, the existing Chaos Engineering tools were designed for use with Linux systems. That’s fair as Linux has most of the market share in the server world, however, it never meant that Windows DevOps teams don’t need the same level of testing options. On the contrary, the lack of testing options has kept Windows systems in a position of greater risk.

Testing in this way on Windows systems has been difficult before now. Organizations had to write their own testing methods with all the expense and engineering time that implies. This means that Windows systems have been left untested or inadequately tested. We could learn from a failure event and make changes to how our systems operate, but learning that way is unpredictable and expensive.

The following sections provide a high level description of some common Windows failure scenarios, along with how to test for them to detect if your system will be reliable and remain available to customers when one of them happens. As we describe these three scenarios you will begin to see how beneficial Chaos Engineering will be to your implementation.

An example of Chaos Engineering with Windows Server

Let’s say you have a product or offering that consists of one or more applications installed on Windows Server 2016. You have enabled Failover Clustering (WSFC) to provide high availability because it delegates a clustered role across multiple servers. If one server fails, another immediately takes its place running the application. Great!

What happens when a server doesn’t fail, but suffers from partial resource exhaustion, slowing down all processes? How does the system respond then? How can you test the infrastructure that Windows Server runs on to determine what will happen?

One way to find out is to use a Chaos Engineering attack that simulates this condition by using up a specified percentage of specific system resources and monitoring to see what happens. If you start small by limiting the blast radius, which designates which part or parts of the system are permitted to be impacted, and by limiting the magnitude, which is the intensity of the attack, you can gain insights into actual system behavior while keeping the system safe from large, customer-impacting problems. Learning how your system actually functions by testing it is always superior to assuming you know, because the design specifications only tell you how it is supposed to work.

Similar attacks that exhaust disk resources or even shutdown servers will help you understand whether your overall system behaves as you expect. As a bonus, if it does not, you will learn more about how your system actually works and that will help you decide what work to prioritize as you strive for reliability.

An example of Chaos Engineering with Microsoft SQL Server

We’ve tested some of the ways infrastructure might fail to learn about how our Windows Server ecosystem will respond. Now, let’s think about testing an application installed on top of Windows Server.

Microsoft SQL Server is one of the applications that can leverage the failover clustering (WSFC) mentioned in the previous example. It also offers its own high availability features, such as Always On availability groups (AG). This provides replication and fault tolerance for databases. You can replicate your SQL Server database up to nine times across different nodes. This set is your availability group, where one node is the primary and the others mirror it. With AG, if the primary node fails, perhaps due to an attempted DDoS, the system should immediately failover to a healthy secondary node, keeping your database available and your system running.

That’s the design. That’s what should happen in our software system. What Chaos Engineering provides is the means of confirming that this function works. There are many types of chaos attacks that can be used to test whether the availability groups actually do reroute traffic to a secondary database in the event that the primary fails. Here’s an outline of one way to test this.

Our hypothesis is that our application will continue to have the ability to query the database and access information even in the event of a primary node failure. One way we know that a node has failed is that communication with the node becomes impossible. We will simulate this with a blackhole attack, which we will configure to drop all network communication to and from the primary SQL Server node.

If we run this attack while monitoring our system and using our application to try to access data from the database, we will learn precisely what happens when communication with the primary database node fails. Does a secondary node take over? How long does it take for the secondary node to respond with the requested data? Is the application impacted with delayed response? Does that cause any problems in the application or for the end user?

Testing and learning the answer to these and similar questions will help your engineers make the system better and more resilient to internal issues. The goal is not a perfect system, those don’t exist. The goal is creating a reliable system that has safeguards built in to handle problems like these, which are expected from time to time for reasons we can’t control. Since we can’t control the problem source, we choose to control how the system responds. So far, so good, but testing is the only way to be confident our designs work.

An example of Chaos Engineering with Microsoft Exchange Server

Many of us host company email using Microsoft Exchange. This can be done on-premises or in a cloud like Azure. Email is essential to business operations and must be reliable. One significant challenge when deploying Microsoft Exchange is choosing the right capacity for your server.

There are calculators that exist to help you determine the needed capacity. They are complex, but useful. However, they are not perfect. To help, Microsoft built a feature into Exchange called the back pressure system. When the resources available to Exchange are nearing capacity due to an event like traffic spikes, the back pressure system will slow down or stop various Exchange features to try to keep the system up and running while usage is high.

We can design a Chaos Engineering experiment to test whether the back pressure system works as expected when approaching the system’s capability. One common problem with Exchange is that various log files and user inboxes can strain disk resources. We can create chaos experiments and attacks that help us configure back pressure thresholds. We can test our scaling and automated node capacity increases before we hit physical limits at the same time.

The goal is to determine the sweet spot between over spending on unnecessary resources that we pay for because our limits are not configured well and under spending and creating a bad experience for all of our email users. We want the system to be reliably available and consistently right-sized. We set metrics like SLOs and SLAs to help us find that sweet spot. Chaos Engineering helps us use our error budgets in more effective ways. Once we find and fix the problems we have, we can then use our time to experiment and improve by trying out new technologies or design patterns. Chaos Engineering helps us make our systems better and better.

One chaos experiment we can run is to deliberately trigger the back pressure system by consuming disk space on the Exchange server. Then, we will monitor whether the response conforms to expected results. A positive result would be that the back pressure system triggers and begins limiting features.

In a follow-up experiment, we could have our monitoring tools trigger the creation of additional node capacity and scale our system resources up before the Exchange Server back pressure system begins limiting what users can do.

Conclusion

Each of the high-level examples in this article are expanded upon with greater implementation detail in our white paper, The First 3 Chaos Experiments to Run on Windows. Engineers will appreciate the specific guidance given.

It’s not enough to turn on some (quality and appreciated) features in our Windows servers and hope they work as expected. The only way we can have confidence in the reliability of our systems is to test them. Specifically, we must use a new form of testing.

QA isn’t enough. Traditional forms of testing are useful and should continue to be used, but they don’t help us learn about how our systems work holistically in the real world; how they will react to unexpected component failures, networking issues, or unreliable dependencies that we don’t control.

Chaos Engineering is the path to resilient systems and Gremlin has given us a way to do it on Windows. Get started today with the Gremlin Free testing tool and see how easy we make it for you to build reliability into your Windows Server architecture.