Testing doesn't stop at staging

Originally published April 27, 2020.

Imagine a perfect world where software releases ship bug-free. Developers write perfect code the first time, all tests pass without issues, operations teams effortlessly deploy builds to production, and customers never experience defects. Everyone's happy, and the Engineering team can focus exclusively on building and delivering features.

Of course, we don't live in a perfect world. Releasing buggy software is a reality for any organization, even those that invest time, money, and resources into testing. Production bugs have led to many high-profile problems, such as:

Flight disruptions caused by a corrupt database file in the FAA's Notice to Air Missions (NOTAM) system.
A network configuration change in Azure causing outages for Teams, Outlook, and other Azure services.
A June 2022 Cloudflare outage caused by a network configuration change intended to increase resilience.

Engineering teams are motivated to move quickly, shorten release cycles, and push new changes to production faster. This is especially true as more organizations adopt DevOps practices. At the same time, microservices and cloud-native applications are increasing the size and complexity of applications. When velocity and complexity increase, the chance of defects making their way into production increases. To address this risk, teams need to ramp up their testing practices to include testing in production, a process known as "shifting right."

Why is testing in production important?

Traditionally, engineering teams create dedicated, isolated environments specifically for testing. These environments are built to replicate production as closely as possible, so testers have a reasonably realistic test environment without putting production systems (or customers) at risk. Developers will always need separate, non-production test environments to enforce quality standards using unit, smoke, and regression testing. But pre-production testing alone isn't enough to catch all problems, and several reasons exist.

Production is a unique environment with unique problems.

There's no substitute for production. Teams can design test environments that use the same architecture, configuration, and even the same data as production (e.g., using infrastructure as code), but they can never fully replicate production. Different environments have unique configurations and emergent behaviors that affect how applications behave. These can result from software updates, ad-hoc fixes, or even routine operations resulting from normal usage. In any case, these unique attributes are hard to detect and even harder to reproduce.

Production is where your customers are.

If a defect slips by your pre-production tests, then your customers may be the ones who find it. A defect found by a QA engineer reflects poorly on development, but a defect found by a customer reflects poorly on the entire organization. Testing in production helps testers find defects unique to production before they can negatively impact the customer experience. To quote Martin Fowler, "we'd rather fix bad data or system state than disappoint a potential [user]."

Risks of Testing in Production

Testing in production does come with risks. Pre-production environments let testers safely execute intrusive tests, such as stress, endurance, and disaster recovery, in a controlled environment. Running these same tests in production risks:

Impacting performance or stability in a way that harms the customer experience.
Accidentally exposing, modifying, or losing customer data.
Skewing marketing analytics and operational metrics, such as user traffic or error rates.
Violating compliance with regulations, standards, or practices.

Because of these, testing in production requires a more measured and thoughtful approach. The following section will look at several deployment strategies that support this approach.

Strategies for Testing in Production Safely

Many application deployment strategies are well-suited for production testing, as they allow tests to run on production infrastructure while containing the risk to a relatively small percentage of users. They also allow you to roll back changes in case of critical defects or outages. This section presents some of the more common ones.

Blue-Green Deployments

A blue-green deployment is a release strategy that involves running two identical production environments side-by-side. One environment (the "blue" environment) hosts the live version of your application, while the other (the "green" environment) hosts the new version. Both versions run simultaneously, but the green version remains idle and serves no user traffic until the team is ready to deploy it. At this point, all user traffic is seamlessly redirected from blue to green, with customers experiencing zero interruptions or downtime.

The main benefit of a blue-green deployment is that it lets DevOps teams validate changes in a production environment without putting users at risk. Any adverse changes can be undone by routing users back to the previous version, and DevOps teams always have a proven and safe production environment to fall back to. The biggest problem is that until traffic is migrated over, the green environment handles zero user traffic, making it difficult to test how the application will behave once it goes live. Maintaining two independent production environments also comes with added costs and operational overhead.

Canary Deployments

In a canary deployment, new changes are deployed initially to a small subset of users before being gradually rolled out to all users. Where blue-green deployments consist of two separate production environments, canaries consist of a single production environment hosting two separate versions of an application. A small percentage (e.g., 2% or 10%) of real user traffic is routed from the stable version to the canary. Engineers can then increase or decrease this amount as they validate the systems' reliability.

The main benefits of a canary deployment are that it:

Runs on your existing production systems.
Reduces the potential impact of defects to a small number of users.
Lets you easily redirect users to the new version once the changes are validated.

For example, Netflix uses canaries to run chaos experiments, load tests, and regression tests. One of these experiments might involve deploying a version of a service where an API has been deliberately disabled, routing a small percentage of users to the service, and measuring how the system responds to the increase in failed requests. Once the test is complete, the canary can be removed, and those users can be rerouted to the stable version of the service. There is still a risk of a bad deployment, outage, or customer impact, but this risk is relatively small compared to production customers.

Dark Launches

In a dark launch, live user traffic is copied and sent to your stable and new applications. While the stable version will continue to respond to user requests, the new version will drop all responses. Dark launches let you thoroughly test the end-to-end functionality of a new release and its performance under a realistic load without impacting the customer experience. Like canary deployments, you can start by copying only a small percentage of traffic and scale up over time. Once you've thoroughly tested and vetted the new version, releasing it is simply a matter of enabling responses from the new version and disabling responses from the old version. However, this requires running two versions of your application simultaneously, which can quickly become expensive.

Using production to test and improve reliability

Regardless of your chosen deployment model, it's always essential to run reliability tests before fully moving users to the new deployment. Netflix demonstrated how canaries provide a safe environment for running chaos experiments since only a small percentage of customers would be impacted if any experiments failed. Dark launches and blue-green deployments provide the same benefits, with the addition of not impacting customers.

Another benefit of dark launches and blue-green deployments is that we can run reliability tests on the same scale as our production systems without impacting customers. We can introduce faults at different points in our infrastructure, measure the impact on real-world user requests, and address weaknesses that directly impact the user experience. We can uncover problems that wouldn't have been exposed by functional or end-to-end testing in a controlled environment, and we can do so without affecting systems that are actively serving user requests.

Conclusion

We don't live in a world of perfect software, which means searching for bugs wherever they might pop up. Testing in production provides significant value by helping DevOps teams better understand their applications and infrastructure, lower the risk of outages, and improve the customer experience. Using a staged deployment strategy like canary deployments or dark launches can ease teams from testing in controlled environments to testing in production while allowing for more in-depth testing methods. Your systems will become more reliable, your production defect rates will drop, and your customers will be much happier.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL