- 5 min read

Testing doesn't stop at staging

In a perfect world, software releases would ship bug-free. Developers would write perfect code, all tests would pass without issue, operations deployed flawless builds to production, and customers would never encounter defects.

Of course, we don’t live in a perfect world, and the risk of releasing buggy software is a reality for any organization. Production bugs have led to many high-profile problems such as:

The challenge is that organizations adopting DevOps practices are pushing their teams to shorten release cycles and push new features changes to production faster than before. At the same time, microservices and cloud-native applications are increasing the size and complexity of applications. When velocity and complexity increase, the chance of defects making their way into production also increases. To address this risk, teams need to ramp up their testing practices to include testing in production, a process known as “shifting right.”

Why is Testing in Production Important?

Traditionally, testing was restricted to isolated environments created for the sole purpose of testing. These environments are designed to replicate production as closely as possible in order to provide testers with a fairly accurate test environment, but without actually putting production systems (or customers) at risk.

Testing in pre-production is still an important part of the overall testing strategy. Unit tests, smoke tests, and regression tests are essential in enforcing quality standards before new builds ever hit production. But pre-production testing alone isn’t enough to catch all problems, and there are several reasons why:

Production is a unique environment with unique problems.

There is no substitute for production. You can architect test environments that use the same architecture and even the same data as production (e.g. using infrastructure as code), but they will never fully replicate production. Environments have unique configurations and emergent behaviors that affect how applications behave and are difficult to reproduce. These can result from software updates, ad-hoc fixes, or even just routine operations resulting from normal usage.

Production is where your customers are.

If a defect slips by your pre-production tests (the ones we just established are inaccurate), then your customers will be the ones who find it. A defect found in testing might reflect poorly on development, but a defect found by a customer reflects poorly on the entire organization. Testing in production gives testers an extra opportunity to find defects before they can negatively impact the customer experience. To quote Martin Fowler, “we’d rather fix bad data or system state than disappoint a potential [user].”

Risks of Testing in Production

A benefit to testing in controlled environments is that it lets testers safely execute intrusive tests such as stress, endurance, and disaster recovery tests. Running these same tests in production risks:

  • Impacting performance or stability in a way that harms the customer experience
  • Causing customer data to be exposed, modified, or lost
  • Skewing marketing analytics and operational metrics, such as user traffic or error rates
  • Causing non-compliance with regulations or standards

Testing in production therefore requires a more measured and thoughtful approach when compared to testing in a controlled environment. In the next section, we’ll look at several deployment strategies that support this approach.

Strategies for Testing in Production Safely

Many application deployment strategies are well-suited for production testing, as they allow tests to run on production infrastructure while containing the risk to a relatively small percentage of users. They also allow you to rollback changes in case of critical defects or outages.

Blue-Green Deployments

A blue-green deployment is a release strategy that involves running two identical production environments side-by-side. One environment (e.g. blue) hosts the live version of your application, while the other (e.g. green) hosts the new release. The green version remains idle and serves no user traffic until the team is ready to deploy it, at which point all user traffic is seamlessly shifted from blue to green, with customers experiencing zero interruptions or downtime.

The main benefit of a blue-green deployment is that it allows DevOps teams to validate changes in a production environment without putting users at risk. Any adverse changes can be rolled back by routing users to the previous version, and DevOps teams always have a proven and safe production environment to fall back to. The biggest problem is that until traffic is migrated over, the green environment handles zero user traffic, making it difficult to test how the application will behave once it goes live. Additionally, maintaining two independent production environments comes with added costs and operational overhead.

Canary Deployments

In a canary deployment, new changes are deployed initially to a small subset of users before being gradually rolled out to all users. Where blue-green deployments consist of two separate production environments, canaries consist of a single production environment hosting two separate versions of an application. The stable version continues to receive most of your user traffic, while the canary receives a much smaller ratio.

The main benefits of a canary deployment are that it:

  • Runs on your actual production systems
  • Reduces the potential impact of defects to a small number of users
  • Lets you easily roll out new versions to all users they are validated

As an example, Netflix uses canaries to run chaos experiments, load tests, and regression tests. One of these experiments might involve deploying a version of a service where an API has been deliberately disabled, routing a small percentage of users to the service, and measuring how the system responds to the increase in failed requests. Once the test is complete, the canary can be removed and those users rerouted to the stable version of the service. There is still a risk of a bad deployment, outage, or chaos experiment affecting customers, but this risk is relatively small compared to your production customers.

Dark Launches

In a dark launch, live user traffic is copied and sent to both your stable application version and the new version. While the stable version will continue to respond to user requests, the new version will drop all responses, effectively making it invisible to the user. Dark launches allow you to fully test the end-to-end functionality of a new release, as well as its performance under realistic load. Like canary deployments, dark launches can be gradually scaled up to handle increasing amounts of traffic over time. Once you’ve fully tested and vetted a new version, releasing it is simply a matter of returning responses from the new version and disabling responses from the original version. However, this does require running two versions of your application at once, which can quickly become expensive.

Dark launches also provide an ideal environment for testing reliability through Chaos Engineering. Netflix demonstrated how canaries can be used to run chaos experiments safely, but with dark launches, we can run these same experiments on the same scale as our production systems with zero impact to customers. We can introduce failure at different points in our infrastructure, measure the impact on real-world user requests, and address weaknesses that directly affect our customers. We can uncover problems that wouldn’t have been exposed using functional or end-to-end testing in a controlled environment, and we can do so without impacting customers.


We don’t live in a world of perfect software, and that means searching for bugs wherever they might pop up. Testing in production provides significant value by helping DevOps teams better understand their applications and infrastructure, lower the risk of outages, and improve the customer experience. Using a staged deployment strategy like canary deployments or dark launches can ease teams from testing in controlled environments to testing in production, while allowing for more in-depth testing methods. Your systems will become more reliable, your production defect rates will drop, and your customers will be much happier.

Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below. You can subscribe to Break Things on Purpose wherever you get your podcasts. If you have feedback about the show, find us on…
Read more


Subscribe to our newsletter

© 2021 Gremlin Inc. San Jose, CA 95113