Introducing Scenarios to prepare for real-world outages
Since 2017, Gremlin has offered a platform to run Chaos Engineering experiments, enabling our customers to increase the reliability of their applications. Gremlin provides a variety of different failure modes across state, resource, network, and application to test your reliability. However until now, while running a single Chaos Engineering attack has been simple, we’ve received many customer requests to simplify planning and tracking an experiment to simulate a real-world outage.
Today we’re introducing Scenarios, which make it simple to simulate real-world outages that lead to costly downtime. Scenarios allow you to link attacks together, growing both the blast radius and magnitude over time. Once created, these Scenarios become a shareable resource for your team, complete with a name, description, hypothesis, and a place to record your notes and observations.
Whether you’ve been practicing Chaos Engineering for years or you’re just starting out, we’re simplifying your reliability journey with a set of pre-built Recommended Scenarios.
Available with today’s release are six Recommended Scenarios for you to clone and start with. Recommended Scenarios are based on real-world outages you can run in a couple of clicks. In other words, you can validate your systems’ ability to withstand each of them and avoid downtime in a few minutes time.
Remember the AWS S3 outage in 2017? An enormous number of sites were unavailable, affecting millions of internet users. How about when thousands of people couldn’t clear customs and were forced to wait when entering the United States by air? And finally, Github’s outage in 2018, bringing engineering productivity around the globe to a halt for more than 24 hours.
Scenarios feel like an important step in the natural evolution of chaos. Replicating isolated failures will always be helpful, but scenarios provide the means to ratchet up pressure on our systems in ways that more closely mirror the complex, orchestrated failure states we observe in production environments."
Inside Gremlin you’ll find Recommended Scenarios for each of these outages. All you need to do is select a set of hosts to target and click Run Scenario to see how your system would respond to these conditions!
Recommended Scenarios guide you through Chaos Engineering experiments to be sure that your application is reliable despite resource constraints, unreliable networks, unavailable dependencies, and more, preventing these types of outages from affecting your business and your customers.
Blast Radius & Magnitude
The practice of Chaos Engineering is all about injecting failure, starting with a small blast radius, a low number of hosts, and with limited magnitude, such as a minimal CPU load. Scenarios allow you to create many attacks and link them together, growing both the blast radius and magnitude over time.
All systems have a breaking point, but do you know where that is? When your system is on the verge of failure, is your customer aware?
Create a sequence of attacks within a Scenario with increasing blast radius and magnitude to cause failure on your own terms. Observe your system under low stress, at a breaking point, and in a critical state to determine next steps to improve your reliability.
Hypothesize and Observe
Once you’ve thoughtfully created a Scenario, take the time to record your hypothesis for what you expect to happen when the Scenario runs. Once complete, jot down your observations, and indicate if the result met your expectations, or if an incident was detected. Following this method will enable you to take action on the results of your Scenario, improving the reliability of your application.
After creating, running, and adding results to a Scenario you can, schedule it to run on an on-going basis to verify the results of your Scenario and prevent drifting into failure. Select the days of the week you would like it to run, the maximum number of times it can run, and the time window in which it can run.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more