With so many Chaos Engineering tools available, it’s no surprise that SRE and platform leaders are doing their homework when choosing a platform to help them build and scale their Chaos Engineering programs. But like anything else you can research on the internet, there’s a lot of noise and hype that you need to wade through.
Gremlin works with Reliability Engineering teams at hundreds of companies with the most sensitive workloads—and has since 2016. Installed in virtually every environment, and with millions of Chaos Engineering experiments under our belt, we’ve heard it all. So to make things easy for you, we’ve put together a list of the top Chaos Engineering tool myths we hear and the reality you need to know.
Lots of Chaos Engineering tools will brag about the number of experiments you can run. The reality is that most of those numbers aren’t apples-to-apples comparisons. For instance, Gremlin has eleven fully customizable experiment types you can perform on dozens of different environment types. You can combine and arrange these into a near infinite number of unique scenarios. Does that mean that Gremlin has a near infinite number of experiments? The answer is… it’s not actually important.
When you get down to it, experiments are only useful if they help you learn about your system and get down to the real-world causes of reliability issues. You should never run a ton of experiments for the sake of running experiments. This can lead to shallow, surface-level testing that misses key vulnerabilities. Instead, you should create well-designed, targeted experiments built around identifying and addressing the most critical failure modes. You also want to make sure your experiments are safe to run in all your environments, which means they need more robust features like effective scoping and automatic rollbacks.
Quality over quantity is key to building a successful chaos engineering practice. And to do that, you need to be able to run strategic experiments where you can have confidence in their results.
Some IaaS, CI/CD, and software delivery platforms have built (or bought) a Chaos Engineering tool to integrate into their platforms. Still others have integrated open source tools as just one small part of their greater platform. Add-ons like this may seem convenient and cost-effective, but many are built just to tick the box during the purchasing process. They may have a long list of features, but often suffer from complicated setup, usability challenges, and lock you into integration with the rest of their toolset.
Purpose-built platforms, like Gremlin, are built on years of focused Chaos Engineering experience. They’re specifically designed to address the unique challenges of Chaos Engineering and provide a more comprehensive, reliable, and user-friendly experience than add-ons can offer. They have best practices built-in and often help get your Chaos Engineering program up and running faster and more successfully.
When looking for a pipeline automation platform, you should always look for one that can be molded to fit exactly what your organization needs—that includes integrating cleanly with third party systems and not locking you into their platform or addons.
In theory, using open source tools and building your own solution seems cheaper and more customizable. Open source tools are free to download, and you can tailor the tool to your systems to perfectly fit your exact experimental needs. And this is all true! Open source solutions give you incredible customization and flexibility.
But in practice, this can actually limit the number and quality of the experiments you run. It takes a lot of time and resources to build, maintain, and troubleshoot your own tool—time that you’re not spending on experiments and improving your reliability. And every time anyone on your team wants to make a small change (like to upgrade security), it takes even more time to adjust. With a vendor or managed tool, you’re working with a tool that is constantly being designed, maintained, and improved using data from thousands of different systems. So you can focus on your experiments and improving your product knowing that your Chaos Engineering tool is being maintained and updated with the latest features, technology, security capabilities, and more.
When you build a Chaos Engineering tool, you’re on your own for the entire process, but with a vendor tool, you’ve got expertise baked in, support in using the tool, and a team behind you every step of the way.
While integrating Chaos Engineering tools into your CI/CD pipeline can help catch issues early in the development process, it shouldn't be the sole focus. It’s important not to confuse Chaos Engineering with other quality engineering practices like functional testing. Functional testing is the most effective tool for catching problematic releases, but its job is finished after the build pipeline. Chaos Engineering is important for understanding slower changing issues with load, complex interrelated deployment problems, or other operating conditions like changes to infrastructure.
Unlike functional testing, Chaos Engineering is equally valuable when applied to production environments, where it can uncover and address real-world, system-level issues that may not be apparent during pre-production testing. Applying Chaos Engineering continuously ensures your production systems stay resilient, even as they change due to new infrastructure and code deployments. Striking a balance between pre-production and production testing ensures a comprehensive approach to system resilience.
So now that we’ve gone over the myths you shouldn’t use to select a Chaos Engineering tool, what should you look for? Pulled from The Guide to Chaos Engineering Tools, here’s the parameters you should look at when evaluating your tool options:
- How easy is it to deploy and use? Like any tool, you have to be able to use your Chaos Engineering tool to get value out of it. It should be easy to deploy, easy to use, and easy to maintain so you can get more value faster.
- What tests can it run, and what target types does it support? Make sure the tool supports your architecture. You should be able to run the experiments you need and have confidence that you’re uncovering any reliability risk.
- Does it include any guidance around what tests to run, how to run them, and what to look for? Your systems are complex, and there are thousands of tests you could run. You should be able to focus your efforts where they’ll have the greatest impact. Look for tools that test against best practices and the most common reliability risks.
- Does it support your environment? For instance, Gremlin can run in cloud or on-prem environments. Your tool should be able to run experiments everywhere your systems run, otherwise you’re not truly testing reliability.
- Does it support organizational practices like GameDays? Reliability takes everyone working together across teams. Your tool should be able to be used across teams to help spur action that will find and reduce reliability risks.
- What’s the tool’s license? Most tools have either a commercial, freemium, or open source license. Each one comes with its own level of support, feature access, and costs.
- Does it provide recommendations, scores, or other feedback to tell you how you’re doing? There’s always going to be reliability risks in a system. You should be able to identify which risks to spend your time on—and be able to show the results of your fixes.
The right Chaos Engineering tool will be the one that allows you to uncover reliability risks across your systems, evaluate the threat of those risks so you can prioritize your engineering resources, and show the improvement in reliability you’ve made over time, without being too difficult to deploy or use.
These criteria should help you evaluate any tool with these goals in mind.
In the end, Chaos Engineering is just one part of a mature reliability program—one where you’ve gone from constantly reacting to operating with a reliability strategy. At Gremlin, we’ve seen leading organizations in this area build robust reliability programs that are proactively resolving risks before they become outages, use reliability standards and baselines, incorporate automated reliability testing, and have a culture of reliability built into their team every step of the way.
And it all starts with using the right Chaos Engineering tool to start taking command of your incidents and reliability risks.
Ready to find the right tool for your team? Check out The Guide to Chaos Engineering Tools or try Gremlin’s Reliability Management tool for free.