How can you tell if your systems are reliable when under load? A common answer is to open your observability dashboards, wait for a high-traffic event (like Black Friday), and cross your fingers.
While this approach is certainly effective, it's far from ideal. Without proactive reliability and load testing, we have no idea if a system will hold up to real-world usage patterns, which could mean a production outage at the worst possible time. What if there was a way to test your systems' reliability and load capabilities in a safe, controlled way?
This blog will look at the benefits of combining reliability testing with load testing and how Gremlin helps you do so.
Both reliability testing and load testing aim to answer a critical question: how do technology systems behave when under stress?
With load testing, we want to determine whether our systems have enough capacity and scalability to handle sudden increases in load or if performance and stability suffer. We do this by putting artificially created demand on our systems—often by replicating user activity—and seeing how well the system handles the demand. This lets you preview how your systems will behave when handling similar traffic patterns in production.
Reliability testing takes a slightly different approach. Instead of creating load, it creates faults, such as exhausted resources, host or zone outages, and slow or unavailable dependencies. Reliability testing tools can generate load by consuming CPU, memory, storage bandwidth, and other resources, but the aim is to replicate unstable conditions.
The aim of reliability testing is to reproduce unstable conditions. Load testing puts artificially created demand on systems to see how well they'll handle similar traffic patterns in production.
There's already value in running reliability and load tests independently, but when combined, the whole is greater than the sum of its parts. In production, systems are likely to experience both high load and unstable conditions. In many cases, instability and failures happen because of high load. Think of major e-commerce platforms on Black Friday and Cyber Monday, where traffic can exceed several dozen terabytes per minute. This is so much extra traffic compared to a normal day that many online retailers do extensive testing and scaling months beforehand.
Running proactive reliability and load tests helps ensure that your systems are resilient under stressful conditions, better preparing them for real-world scenarios.
Most load-testing tools provide externally accessible APIs, and Gremlin can use these APIs to start and stop load tests alongside reliability tests. Gremlin makes it easy to integrate with some of the industry's leading load-testing tools, such as Grafana k6. The way it works is simple:
- You authenticate Gremlin with your load testing tool. This allows Gremlin to send API commands for any service(s) your team manages.
- For each service you add to Gremlin, you can specify which load test to run at the start of a reliability test and whether to stop the load test when the reliability test finishes.
- Each time you initiate a reliability test in Gremlin, Gremlin automatically runs the load test, then the reliability test. If you opted to have Gremlin halt the load test, it will do so once the reliability test finishes preventing excess stress.
Remember that Gremlin automatically monitors the health of your services using your Health Checks, and this is no different with load testing. While your reliability test runs, Gremlin regularly checks your service's Health Check to verify that it's still healthy. If any of your Health Checks reports as unhealthy, Gremlin immediately halts the test and marks it as failed. This provides immediate feedback on how well your service withstood the test while returning the service to a healthy state.
There's an important assumption here that shouldn't get overlooked. Since Gremlin fails a test when a Health Check is unhealthy, your services must be able to pass the test under normal conditions. When you run a load test, the service under load becomes the new baseline. If adding load causes any of your Health Checks to fail, the test will instantly be marked as a failure even before your reliability tests start. In other words, your systems must already operate reliably under load in order for Gremlin to work effectively.
However, this doesn't completely prevent you from using Gremlin. Even if a reliability test fails because one of your Health Checks is impacted by a load test, this is a good outcome. It means you've discovered a problem with your service that you need to address. Then, once you've implemented a fix, you can easily re-run your reliability and load tests to ensure your fix is effective.
Adding reliability testing is easy if you're already using load-testing tools. With Gremlin, all you need to do is deploy the Gremlin agent, add your services, and then run our pre-built suite of reliability tests. These tests cover a number of critical scenarios, such as auto-scaling, zone and region redundancy, and resilience to failed dependencies. Running reliability tests alongside your load tests lets you ensure your systems are resilient even under heavy load while adding minimal extra time to your testing cycle.
To get started, visit gremlin.com/demo.