Since 2017, Gremlin has offered a platform to run Chaos Engineering experiments, enabling our customers to increase the reliability of their applications. Gremlin provides a variety of different failure modes across state, resource, network, and application to test your reliability. However until now, while running a single Chaos Engineering attack has been simple, we’ve received many customer requests to simplify planning and tracking an experiment to simulate a real-world outage.
Today we’re introducing Scenarios, which make it simple to simulate real-world outages that lead to costly downtime. Scenarios allow you to link attacks together, growing both the blast radius and magnitude over time. Once created, these Scenarios become a shareable resource for your team, complete with a name, description, hypothesis, and a place to record your notes and observations.
Whether you’ve been practicing Chaos Engineering for years or you’re just starting out, we’re simplifying your reliability journey with a set of pre-built Recommended Scenarios.
Available with today’s release are six Recommended Scenarios for you to clone and start with. Recommended Scenarios are based on real-world outages you can run in a couple of clicks. In other words, you can validate your systems’ ability to withstand each of them and avoid downtime in a few minutes time.
Remember the AWS S3 outage in 2017? An enormous number of sites were unavailable, affecting millions of internet users. How about when thousands of people couldn’t clear customs and were forced to wait when entering the United States by air? And finally, Github’s outage in 2018, bringing engineering productivity around the globe to a halt for more than 24 hours.
Scenarios feel like an important step in the natural evolution of chaos. Replicating isolated failures will always be helpful, but scenarios provide the means to ratchet up pressure on our systems in ways that more closely mirror the complex, orchestrated failure states we observe in production environments."Matt Simons
Senior Engineering Manager at Workiva
Inside Gremlin you’ll find Recommended Scenarios for each of these outages. All you need to do is select a set of hosts to target and click Run Scenario to see how your system would respond to these conditions!
Recommended Scenarios guide you through Chaos Engineering experiments to be sure that your application is reliable despite resource constraints, unreliable networks, unavailable dependencies, and more, preventing these types of outages from affecting your business and your customers.
The practice of Chaos Engineering is all about injecting failure, starting with a small blast radius, a low number of hosts, and with limited magnitude, such as a minimal CPU load. Scenarios allow you to create many attacks and link them together, growing both the blast radius and magnitude over time.
All systems have a breaking point, but do you know where that is? When your system is on the verge of failure, is your customer aware?
Create a sequence of attacks within a Scenario with increasing blast radius and magnitude to cause failure on your own terms. Observe your system under low stress, at a breaking point, and in a critical state to determine next steps to improve your reliability.
Once you’ve thoughtfully created a Scenario, take the time to record your hypothesis for what you expect to happen when the Scenario runs. Once complete, jot down your observations, and indicate if the result met your expectations, or if an incident was detected. Following this method will enable you to take action on the results of your Scenario, improving the reliability of your application.
After creating, running, and adding results to a Scenario you can, schedule it to run on an on-going basis to verify the results of your Scenario and prevent drifting into failure. Select the days of the week you would like it to run, the maximum number of times it can run, and the time window in which it can run.