Today, we’re excited to announce our native Gremlin integration with Spinnaker, the open source CI/CD tool built by Netflix and backed by Google, Microsoft, and Oracle. With this integration, Spinnaker users can now automate chaos experiments by adding a Gremlin stage to their delivery pipelines.
For all software engineering teams, small and large, automation is necessary to keep velocity up while delighting end users with new features and a rock solid platform. Continuous integration and continuous deployment tools (CI/CD) enable this automation, reducing the toil of manually checking out code, building it, and deploying to the right environment.
Automation and adopting microservices means code is making it to production faster, but more frequent changes also means more opportunity for regressions. To deal with the increase in complexity, it’s now standard to rely on a test suite which guarantees that released code works as expected. Just as unit and integration tests verify the components of our systems work as expected, running Chaos Engineering experiments at a regular cadence during the build and deploy process allows for a consistent feedback loop to ensure our systems don’t drift into failure before a release is pushed.
To get started, update your Spinnaker deployment to v1.13.0. When either editing an existing pipeline or creating a new one, create a new stage and select Gremlin from the drop down.
Paste in a Gremlin API Key and you’ll be presented with a list of your Gremlin templates, one to identify hosts to target and one to select the type of attack.
Once the templates have been selected, you’ll see a description of what will be carried out by this Spinnaker stage. After a Gremlin stage has run, the logs will be visible in the Spinnaker UI, or you can head to the Gremlin WebApp to view the attack details.
In the real world, implementing continuous chaos allows you to be sure an incident will never happen again. At a previous company, one of Gremlin’s engineers dealt with a customer facing outage due to large traffic spikes making it past a caching mechanism.
All inbound traffic was hitting a service or the service’s cache. If the cache stopped responding with valid data, the service would see dramatic spikes in traffic volume that were difficult for it to handle. Ensuring the service’s behavior would remain unaffected by the behavior of the cache required a large number of code changes and the introduction of new architectural components. The team created a black hole chaos experiment that tested for cache failure in order to get a solution in place and verify it worked. From here, they automated the chaos test in order to prevent future deploys from drifting back into this failure.
As we continue our mission to make the internet more reliable for everyone, we’ll continue to remind folks that downtime is expensive. Increasing the automated test coverage of your codebase is a proven way to improve application health, but isn’t sufficient for increasing system reliability. Adding automated chaos experiments will allow you to to reduce outages, providing your end users with a more positive experience.