Chaos Engineering, where engineers intentionally inject failure to test the resilience of their systems, is becoming a regular practice for companies who value uptime and availability. As cloud-based systems have grown more complex, Chaos Engineering has become a critical part of the software testing and release process to uncover surprise dependencies, fix problems before they become 3am outages, and bake resiliency into every feature.
As the discipline becomes mainstream, engineering teams are naturally debating whether they should build their own tools in-house, or buy a commercial offering.
We get the “build vs buy” question frequently, so we wrote the following resource for teams that are ready to adopt chaos engineering and want guidance on the best strategy.
Many internal tools start off as a fork from open source (like Chaos Monkey) to allow a quicker path to a minimal viable product, and to address simple concerns, such as random shutdowns or reboots of hosts. Gradually, more failure states can be added over time, and ideally, an automation layer can be added on top.
Building an internal chaos engineering tool means you can customize it to your application or environment’s exact needs so that it deeply integrates with other environments like your monitoring and development pipelines.
Building also means you possibly have a shorter feedback cycle between development and production (budget + resources permitting). You also get full control over the product roadmap, and can exert more control over the features and direction of the product.
There’s some security benefit as well. When you build a tool, all connections will stay within your company’s internal network, which gives control over attack surface area. Everything can be kept on-prem, with no reliance on the outside world. This also potentially makes it easier to monitor and control traffic.
Buying a commercial offering for Chaos Engineering means you’re able to get up and running sooner with zero in-house development. Even starting with an existing open source project will require a non-trivial amount of time to build robust features, not to mention sufficiently hardening the tool for security. Chaos Engineering is still a new discipline, but there are tools like Gremlin out there so you don’t need to task your engineers with reinventing the wheel.
Building your own solution when there are commercial tools ends up looking like undifferentiated heavy lifting, rather than focusing engineering resources on business drivers. It’s often more cost-effective for your team to purchase a tool than to build their own at the expense of customer-facing, revenue-driving features.
Purchasing a solution means you can outsource everything, including API layer availability. Because a SaaS product needs to be extensible and generally available across diverse customer environments, an API layer is provided and maintained. The API layer will also usually have feature parity with the in-app experience, and will have the same SLA. Additionally, because of the inherent dependency of the service to the API, features that the SaaS provider needs will immediately be available.
Cross-platform compatibility also comes into play. In the age of microservices and distributed systems, tooling needs to support various environments, rather than only being compatible with specific environments.
Customer support will also be included in your contract. This can be helpful for a couple of reasons. First, your in house engineers don’t need to provide support. Second, the support engineers as part of your contract will not only be experts in their tool, but will be chaos engineering experts as well.
Purchasing a solution also means you realize a more immediate impact on reducing downtime. An hour of downtime costs, on average, $100,000, and that doesn’t include engineering costs to bring the site back up, or potential impact to your brand. Every month where an organization delays disciplined chaos engineering is a month that might contain a large outage.
A purchased product will also be refined from feedback and input from a number of customers. Lots of user feedback can mean a simple, intuitive UI, and a more robust feature set that covers use cases your internal engineers may not have considered, but find very valuable.
Proven success from vendor is also a factor. Buying a chaos engineering solution means you can look to the vendor’s track record of success for your business, and offload risk that you’d own by building your own solution.
Costs - Building any chaos tooling requires dedicated time of engineers to also support and maintain the application. With buying, support is handled by the SaaS provider, leaving engineers time to develop chaos experiments and truly test their application’s resilience, rather than having to ensure the availability of their chaos tool. To build a sophisticated fault injection platform takes roughly 14-18 months of several engineers to build and maintain.
Ease of use - Internal tooling, as noted, tends to be released as a minimal viable product at first with interfaces that might not be well-documented or easy to use. SaaS tools have to be approachable from a novice perspective, as well as extensible to the most advanced users, right from the start. Both are a major factor in how well your organizational culture adopts chaos engineering.
Scalability - While building a chaos tool to address one particular application stack allows for much more control and sophistication around what faults you can inject, this can be a double-edged sword. The application may not be extensible to other application stacks and infrastructure types, such as container architecture vs. host-based, or running in AWS public cloud vs. Google Cloud Platform or Azure. Additionally, most built tools won’t have an API as an out-of-the-box feature, limiting your ability to automate your chaos experiments.
Security - Using open source tools or developing in-house means security may be an afterthought or an implied feature. For example, making the assumption that users are all authenticated through a corporate network, and thus automatically have the right access, rather than enforcing the principle of least privilege and multi-factor auth.
Security - Because a SaaS is a hosted solution, there will be traffic going out of the network. SaaS offerings have this in mind when building their software, and in particular with Chaos Engineering, security is always a huge concern and must be baked into the product.
Product - As with any vendor software, you lack control over the roadmap. You’re buying a product you can begin using immediately, but you’ll be a layer removed from their product roadmap.
In both scenarios, you still need a Chaos Engineering champion (or ideally, many champions) within your company to help ensure chaos engineering adoption, and treating resilience as a first class feature. The main difference is whether engineers want to spend time building and maintaining “yet another app” that’s outside of their core competence, or to realize the benefits of Chaos Engineering as soon as you can get your first Gameday running.
Downtime is expensive, as is the operations burden of building and maintaining a system. While companies won’t get the full control over their chaos tool when buying, it affords them the time and availability to start making their systems more resilient immediately, and ultimately helping everyone sleep better at night.