Resilience testing for Kubernetes clusters

How to use Fault Injection testing to verify resilience to known Kubernetes failures—and discover unknown reliability risks

Editor’s Note: Metrics and reporting are only one part of the Kubernetes Resiliency Framework. Find out more about the framework in The Ultimate Guide to Kubernetes High Availability.


Why use Fault Injection resilience testing for Kubernetes

The goal of resiliency testing is to provide as accurate a picture as possible of your current reliability posture. While many types of testing (e.g. load testing) are done as part of the QA process before deployment, resiliency testing is specifically designed to verify that your Kubernetes clusters function as expected under real-world conditions.

This is done by using Fault Injection testing. Fault Injection works by creating controlled failure in a computing component, such as a host, container, or service. By observing how their components respond to failure, you can take action to make your services more resilient. It can be used in as exploratory testing (e.g. Chaos Engineering experiments) to uncover new reliability risks and failure modes, or it can be used in standardized groups, known as test suites, to validate workload behavior.

When to test in your SDLC and which exact tests to run will vary depending on your individual organization’s standards and the maturity of your resilience practice. But there is a core set of resiliency tests that should be run for every Kubernetes deployment, as well as best practices to help determine when in your SDLC your teams should run resiliency tests.

Kubernetes exploratory resilience testing

Exploratory testing is used to better understand your systems and suss out the unknowns in how it responds to external pressures. Many of the experiments performed under the practice of Chaos Engineering make use of exploratory tests to find unknown points of potential failure.

To minimize the impact on your systems, exploratory tests should always be done in a controlled manner. While a trustworthy Fault Injection tool will contain safeguards like automatic rollback in case of problems, the injection of faults can potentially cause disruption when doing exploratory tests. Be sure to follow Chaos Engineering best practices like limiting the blast radius and carefully defining the boundaries of the experiment. Ideally, these tests should start with individual services, then expand broader into the organization as you become more confident in the results and impact of the test.

For example, a common type of exploratory test is making sure your Kubernetes deployments scale properly in response to high demand. You can set up a Horizontal Pod Autoscaling (HPA) rule on your deployment to increase the number of pods when CPU usage exceeds a certain percentage. Then, you can use a Fault Injection tool to apply CPU pressure directly to the deployment, while monitoring the number of pods.

If Kubernetes deploys an additional pod, then you know your system will scale properly under similar conditions in production. If not, tweak your HPA rules and repeat the test until the system behaves the way you expect. Then those HPA rules can become part of your resilience standards, and future tests will be used to validate that the rules are in place.

Kubernetes validation resilience testing

Once you have a standardized set of known failures and reliability risks for your Kubernetes clusters, you can test your resilience to them with validation testing. Using Fault Injection, validation tests inject specific failure conditions into your systems to verify resilience to failures. Unlike exploratory testing, which is done manually, validation testing works best when it can be automated on a schedule. Ideally, they should be tested weekly, but many organizations will start with monthly testing, then gradually increase the frequency as they become more comfortable with the testing process.

To define these standards, start with a list of known failures and reliability risks, then set the standards for how your Kubernetes clusters should respond for them. As an example, say you have a Redis cache database setup for your checkout service. Cache databases are often a known point of failure, which is why it’s a good idea to have timeout standards to failover to an origin database. But if the timeout settings are too long, then you’ll introduce latency before the failover occurs, leading to a brownout or even an outage. In this case, you’d want to create a test that makes the cache database unavailable to verify that your containers do, in fact, failover correctly and that the latency settings are correct.

By collecting validation tests that correspond to these known failures, you can create a standardized set of test suites that can be run across your organization. The results of these testing suites can also  be collected over time to create metrics and dashboards that can be used to align and prioritize reliability risk remediation efforts.

Standardized test suites for every Kubernetes cluster

There are certain resiliency tests that should be run for every Kubernetes deployment. Based on the key traits in common with any Kubernetes cluster, these should form the core of your resiliency test standards. These core tests fall under three groups.

Resource tests

Any Kubernetes cluster needs to be resilient to sudden spikes in traffic, demand, or resource needs. These two tests will verify that your services are resilient to sudden resource spikes. Depending on your architecture, you may also want to add a Disk I/O scalability test to this mix.

  • CPU Scalability: Test that your service scales as expected when CPU capacity is limited. This should be done in three stages of 50%, 75%, and 90% CPU consumption.some text
    • Estimated test length: 15 minutes
  • Memory Scalability: Test that your service scales as expected when memory is limited. Memory consumption should be done in three stages: 50%, 75%, and 90% capacity.some text
    • Estimated test length: 15 minutes

Redundancy tests

Make sure that your deployments are resilient to infrastructure failures. These tests shut down a host or access to an availability zone to verify that your deployment has the redundancy in place to stay up when a host or zone goes down. If your standards call for multi-region redundancy, then you should add tests that make regions unavailable.

  • Host Redundancy: Test resilience to host failures by immediately shutting down a randomly selected host or container.some text
    • Estimated test length: 5 minutes
  • Zone Redundancy: Test your service's availability when a randomly selected zone is unreachable from the other zones. some text
    • Estimate test length: 5 minutes

Dependency & network tests

The microservices nature of Kubernetes architectures can create a web of dependencies. These tests help you verify that your deployments will respond correctly when dependencies have failed, network issues are delaying communications, or have expiring certificates that make them unavailable. If you have a more complex architecture, you may want to periodically run dependency discovery tests to uncover any unknown dependencies.

  • Dependency Failure: Test your service’s ability to tolerate unavailable dependencies by dropping ​​​​all network traffic to a specific dependency.some text
    • Estimated test length: 5 minutes.
  • Dependency Latency: Test your service’s ability to tolerate slow dependencies by delaying all network traffic to this dependency by 100ms.some text
    • Estimated test length: 5 minutes.
  • Certificate Expiry: Test your service’s dependencies for expired or expiring TLS certificates by opening a secure connection to your dependency, retrieving the certificate chain, and validating that no certificates expire in the next 30 days. A lack of a secure connection would also pass the test, since that would mean there are no certificates.some text
    • Estimated test length: 1 minute.

Customizing resilience test suites to fit your Kubernetes deployment

While you should start with the standardized test suites above, there are situations where you should make adjustments to better fit your organization and its reliability goals. These changes could be adding new tests designed to fit specific failures, or tweaking the parameters of existing tests, such as adjusting the allowed latency depending on a service.

When customizing suites, you should do it based off data from sources like:

  • Incidents - When there’s an outage, it’s a good practice to set up tests to detect and prevent the same incident from happening in the future. For example, if you experienced a DNS-based outage, then you may want to set up weekly tests to make sure you can failover to a fallback DNS service.
  • Observability alerts - There’s plenty of application behavior that doesn’t directly create an outage, but is still a definite warning sign. Perhaps a service owner has noticed that compute resource increases that take up 85% of compute capacity don’t take the system down, but still create a situation where a sudden spike in traffic would cause an outage. In this case, you’d want to add tests that simulate compute resource usage at 85% capacity to ensure resilience to this potential failure.
  • Exploratory testing - Using exploratory testing, operators can determine exactly what the failures are so you can design tests against them. Critical services, for example, should have higher resilience standards than internal services, and the test suite should be customized to fit these standards.
  • Industry models - There are many architecture models, such as the AWS Well-Architected Framework, that have specific best practices to improve reliability. If you’re using these architecture standards, then you can adjust your testing suites to verify compliance with those standards.
  • Industry compliance requirements - Highly regulated industries, such as finance, can often have resilience and reliability standards unique to their industry. Often these can be much more strict than common best practices, and you should adjust test suites accordingly to fit these compliance requirements.

When to run resilient testing in your Kubernetes SDLC

The goal of validation testing is to provide an accurate ‌picture of your current system's resiliency. As such, testing should be done in production environments when possible. However, resiliency testing with Fault Injection does have some risks. These can be mitigated with the right tool, setup, and experience with testing, but this is why many organizations  will start their testing resilience testing journey by running test in staging.

Most organizations will follow one of the three strategies below. Ultimately, the choice comes down to your individual organization and its familiarity with resilience testing. Consider the pros and cons of each choice before you decide on a strategy.

Gating Release Candidates on running tests

Testing in a staging environment prevents any potential downtime caused by testing from impacting customers. However, perfectly duplicating a staging environment with the same workloads, resources, and traffic as production environments is cost-prohibitive and time intensive. Additionally, there are changes outside your control, such as network topography, that can’t be accounted for in staging environments.

In the end, while testing in staging can catch key reliability risks, it can’t give you an accurate view of the reliability of your system in production.

Running tests after production deployment

Like other kinds of testing, validation resilient testing can be done as part of a release pipeline process, such as CI/CD. But whether this is the right strategy or not will often come down to your release schedule.

Due to the nature of Fault Injection testing, a full battery of tests could take several hours. If you’re releasing on a weekly or monthly schedule, holding up a deployment to run these tests could be worth it for the reliability risks you uncover. However, if you’re set up for multiple releases a day, then the time spent on the tests prevents them from being used as a gating mechanism. In this case, you should consider testing in production, either post-deployment or on a regular schedule.

Remember, the goal of resiliency testing is to uncover reliability risks in production. While some of these can be uncovered before deployment, you should fit testing into your SDLC where it makes the most sense and can be the most effective at uncovering reliability risks in production before they impact customers.

Scheduling test at regular intervals

Kubernetes clusters are constantly changing with new deployments, resource changes, network topography shifts, and more. A service that had very few reliability risks two weeks ago could suddenly have a much more vulnerable reliability posture due to new releases, changes in dependency services, or network shifts.

The only way to catch these changes is through regular, automated validation testing using test suites. Ideally, you should aim to have weekly scheduled tests in production, though many organizations work up to this point.

It’s best to schedule these tests during a time when engineers are present and available to address any issues. You should also schedule them to run shortly before your prioritization and resourcing meetings. This will allow your teams to move quickly to address any critical reliability risks the tests uncover.

Next steps in your Kubernetes high availability journey:

Download the comprehensive eBook

Learn how your own resiliency management practice for Kubernetes in the 55-page guide Kubernetes Reliability at Scale: How to Improve Uptime with Resiliency Management

Thanks for requesting

Kubernetes Reliability at Scale:

How to Improve Uptime with Resiliency Management.

View the guide here.

(A copy has also been sent to your email.)