Our blog focuses on Chaos Engineering insights and education, product news, and shares our own internal experiences in running effective failure test and improving reliability.
Subscribe to our RSS feed to receive blog posts as they are published.
The Gremlin YouTube channel provides feature demos, conference talks, Site Reliability Engineering chats, company updates, and much more. We also provide playlists for Kubernetes, Chaos Engineering basics, and product demos.
Our webinar page provides access to dozens of free webinars on-demand. These range from product feature overviews to fireside chats to thought leadership.
In addition, check out the following talks:
- The Evolution of Chaos - Chaos Engineering is intentionally injecting failure into a system to proactively identify and fix problems before they cause outages. It's an emerging discipline, but its roots are decades old. So why is it now becoming the go-to approach for building resilient systems? Why does the current state of distributed architectures require chaos as the best solution for system failure?
- Breaking Things on Purpose - Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.
- Monkeys in Lab Coats - In this talk, we present our experience: a fruitful industry/academic collaboration. We describe how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing system that leverages Netflix’s state-of-the-art fault injection and tracing infrastructures.
Break Things on Purpose is Gremlin's own podcast featuring experts such as Kelsey Hightower from Google, Paul Osman from Under Armour, Haley Tucker from Netflix, and Kolton Andrus from Gremlin.
In addition, check out these recommended podcasts:
- Software Engineering Daily - Servers in a data center fail. Bugs in an application make it into production. Human operators make mistakes. Failure is unavoidable. Jeff and Kolton discuss failure testing, start up life, and culture at Amazon and Netflix.
- InfoQ - QCon chair Wesley Reisz talks to Kolton Andrus, founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website.
- The Cloud Cast - Brian talks with Kolton Andrus about his background at Amazon and Netflix, the discipline of Chaos Engineering, the challenges of breaking things in production, and Gremlin Inc’s approach to building better applications and systems.
Visit our resources page to learn more about the "why" behind Chaos Engineering and reliability testing. In addition, check out these recommended reading materials:
Automatic Failure Testing Research at Internet Scale - uscs.edu - In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix.
On Designing and Deploying Internet Scale Services - James Hamilton - Proceedings of the 21st Large Installation System Administration Conference (LISA '07)
Chaos Engineering - Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones & Ali Basiri - Netflix
Site Reliability Engineering - Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Murphy - Google
Antifragile - Nassim Nicholas Taleb
Drift into Failure - Sidney Dekker