Democratizing Chaos Engineering and progressing from why to how
It’s been a busy year. We launched new products, attended events around the world, hosted the second annual Chaos Conf and spent a lot of time engaging the community and educating those interested in learning more about Chaos Engineering.
A key takeaway from 2019 is that the industry has matured significantly in their Chaos Engineering practices. In 2018, we often found ourselves explaining to people why they should be doing Chaos Engineering. Today, the people we meet understand that every business is becoming an online business; that microservices and distributed systems are on the rise; and that more and more we are relying on complex software for our businesses, our health, and even our safety...
In a recent report, Gartner made the prediction that by 2023, 40% of organizations will implement Chaos Engineering as part of their DevOps initiatives. No longer are people asking us why they should do Chaos Engineering; the question we get now is how to do it effectively. This is a responsibility we take seriously, along with our core values to make running these experiments simple, safe, and secure.
With all that said, we wanted to share some highlights from the past year. It wouldn’t have been possible without our amazing community—so thank you. 💚
For a long time, when people heard the term ‘Chaos Engineering’ they immediately thought of Netflix’s open source tool Chaos Monkey. This has been a gift and a curse for Gremlin: On one hand, our CEO and Co-Founder Kolton worked at Netflix and helped build their 2nd generation of fault injection tooling. On the other hand, Chaos Monkey fuels the perception that Chaos Engineering is all about randomly breaking things, when in actuality the discipline has matured to a more sophisticated, scientific approach.
The truth is that Chaos Monkey is a bit difficult to use and maintain, it only works on AWS, and it only has one attack mode: randomly killing servers. It lacks the safety and security features we find critical to a successful Chaos Engineering program, as well as a user interface that’s useful in getting people up and running.
All that said: It’s free. And if you’re a team or company that is interested in Chaos Engineering but not convinced, we can see why a free tool would be compelling. So we decided to create Gremlin Free! It gives you everything offered in Chaos Monkey, plus some other neat features that add a ton of value. Not only can Gremlin Free randomly shut down servers on AWS -- it works on any cloud, and you can be much more targeted in your approach. We also added the CPU attack so that you can simulate situations like traffic spikes, and see how your systems handle those stressful moments.
Read more about getting started with Gremlin Free here.
Your wish was our command: Scenarios was based on direct user feedback. It was our productized response to the question how to get started with Chaos Engineering that we heard over and over in 2019.
To be honest, when we founded Gremlin in 2016, our inclination was to provide engineers with raw tooling that let them approach these problems however they saw fit. What we found pretty quickly, however, was significant demand for more guidance within the product. So last year at Chaos Conf we released six out-of-the-box scenarios that simulated real-world outages: essentially a chain of individual attacks that, when grouped together, led to disaster. This has helped companies deal with unreliable networks, unavailable dependencies, and even region evacuations.
Read more about leveraging Scenarios here.
Chaos Engineering on Kubernetes
It’s important for us to keep up with technology trends and make sure that our solutions address various architectures, environments, and use cases. We had an amazing time meeting so many of you at KubeCon and it’s obvious that Kubernetes has become the default option for orchestrating containers. Given the highly dynamic, ephemeral, and complex nature of Kubernetes -- we wanted to build something that made it easy to target entire pods or specific containers within those pods, in order to give our users a better sense of how their Kubernetes infrastructure actually behaves in production. This was one of the best received updates to our product in 2019.
Read more about Chaos Engineering on Kubernetes here.
SOC II Type 2 Certified
One of our core principles is a commitment to security. Compiled by Peterson & Sullivan, the goal of the report is to verify the existence of internal controls designed and implemented to meet the requirements for the security principles set forth in the Trust Services Principles and Criteria for Security. It provides a thorough review of how Gremlin’s internal controls affect the security, availability, processing integrity, and confidentiality of the systems it uses to process users’ data, and the confidentiality and privacy of the information processed by these systems. This independent validation of security controls is crucial for our customers in highly regulated industries.
Read more about our commitment to security here.
The Second Annual Chaos Conf
Chaos Conf 2019 was held in September at the historic Regency Ballroom in San Francisco. The event doubled in size (we still can’t believe it!) and we can’t say enough about how grateful we are to the community for making it possible. You can read all about it in our recap post or by searching the #ChaosConf hashtag on Twitter. Follow the Twitter handle @ChaosConf to keep an eye on updates for 2020.
By The Numbers
In 2019, we scaled our Chaos Engineering bootcamps and hosted 20 of them across the USA. That means our team had the opportunity to meet over a thousand of you in person and lead in-depth, hands-on training. We also gave 70 talks on Chaos Engineering across the globe and ended the year with 4,000+ people in the Chaos Engineering Community Slack. If you’d like to host a bootcamp, or have Gremlin get involved with your event, shoot us an email! firstname.lastname@example.org You can also find where we’ll be next by checking our events page. Here’s to 2020!
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more