Today, we’re thrilled to announce Gremlin Free, to further our mission of making the internet more reliable for everyone. Gremlin Free provides the ability to run both Shutdown and CPU attacks on your hosts or containers, controlled by a simple user interface at no cost.
The practice of Chaos Engineering has expanded, with folks across DevOps, Engineering, and SRE all discussing the what and why of thoughtfully breaking things on purpose. Netflix gave rise to the practice with Chaos Monkey, their homegrown utility to randomly shut down compute instances when Netflix first migrated to AWS, and it’s grown from there.
With the ongoing migration to microservice, serverless, and cloud environments, we believe the industry has answered “why do Chaos Engineering”, and has begun asking “how do I begin practicing Chaos Engineering” in order to significantly increase the reliability and resiliency of our systems to provide the best user experience possible.
Gremlin Free answers the how, making it easy for any engineering team to safely, simply, and securely start their journey towards greater reliability.
The Gremlin Shutdown attack allows you to shutdown or reboot one or many hosts or containers. Before configuring the attack itself, first identify the hosts that you’d like to target.
Either select a set of specific hosts or choose to impact a random number of tagged hosts. While random, you configure the number or percentage of hosts to impact. Then set the delay before the attack starts along with whether or not you’d like the host to reboot.
Finally, you can specify whether you’d like the attack to start right away or run on a schedule. In the screenshot, the attack will run randomly, within a specific window of time. This functionality offers a great opportunity to train team members who are new to your on-call rotation, and to ensure your runbooks are up to date.
Curious how your autoscaled instances behave when their CPUs are consumed? Simply target the instances, select how many cores to consume and for how long, and using your favorite monitoring tool, watch the CPU consumption increase along with the number of instances in place to handle your traffic.
After running your first attacks and observing the effects on your end user experience, your team can get busy fixing any critical issues that may have been exposed. Once fixes are in place, simply re-run the same attacks to verify your newfound reliability.
Along with Shutdown and CPU attacks, Gremlin Free delivers Failure-as-a-Service to provide you with full control at all times with an intuitive UI, CLI, & API. Quickly halt and revert all attacks at the click of a button, returning your hosts to a healthy state. Rest assured that if the Gremlin client ever loses communication with our control plane, all attacks will be halted and reverted.
Downtime is expensive and it's critical to be proactive to prevent it. Gartner cites average per-company figures at $5,600 per minute (roughly $300,000 per hour), and for top eCommerce websites, that figure can be millions per hour. Our team is made up of engineers and on-call leaders from Amazon, Netflix, Google, and Dropbox, who have developed Gremlin Free to enable your team to better understand their systems and identify weaknesses before they cause outages and impact customers.
Sign up for Gremlin Free to get started with Chaos Engineering. If you have any questions, check out our documentation, our Shutdown Experiment Pack, or get in touch on the Chaos Engineering #support channel.
It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal Site Reliability Engineer