Gremlin blog

Our blog focuses on Chaos Engineering insights and education, product news, and shares our own internal experiences in running effective failure test and improving reliability.

Subscribe to our RSS feed to receive blog posts as they are published.

Talks and videos

The Gremlin YouTube channel provides feature demos, conference talks, Site Reliability Engineering chats, company updates, and much more. We also provide playlists for Kubernetes, Chaos Engineering basics, and product demos.

Our webinar page provides access to dozens of free webinars on-demand. These range from product feature overviews to fireside chats to thought leadership.

In addition, check out the following talks:

  • The Evolution of Chaos - Chaos Engineering is intentionally injecting failure into a system to proactively identify and fix problems before they cause outages. It's an emerging discipline, but its roots are decades old. So why is it now becoming the go-to approach for building resilient systems? Why does the current state of distributed architectures require chaos as the best solution for system failure?
  • Breaking Things on Purpose - Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.
  • Monkeys in Lab Coats - In this talk, we present our experience: a fruitful industry/academic collaboration. We describe how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing system that leverages Netflix’s state-of-the-art fault injection and tracing infrastructures.


Break Things on Purpose is Gremlin's own podcast featuring experts such as Kelsey Hightower from Google, Paul Osman from Under Armour, Haley Tucker from Netflix, and Kolton Andrus from Gremlin.

In addition, check out these recommended podcasts:

  • Software Engineering Daily - Servers in a data center fail. Bugs in an application make it into production. Human operators make mistakes. Failure is unavoidable. Jeff and Kolton discuss failure testing, start up life, and culture at Amazon and Netflix.
  • InfoQ - QCon chair Wesley Reisz talks to Kolton Andrus, founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website.
  • The Cloud Cast - Brian talks with Kolton Andrus about his background at Amazon and Netflix, the discipline of Chaos Engineering, the challenges of breaking things in production, and Gremlin Inc’s approach to building better applications and systems.

Recommended Reading

Visit our resources page to learn more about the "why" behind Chaos Engineering and reliability testing. In addition, check out these recommended reading materials:

No items found.
This is some text inside of a div block.
Installing the Gremlin Agent
Authenticating the Gremlin Agent
Configuring the Gremlin Agent
Managing the Gremlin Agent
User Management
Health Checks
Command Line Interface
Updating Gremlin
Reliability Management (RM) Quick Start Guide
Services and Dependencies
Detected Risks
Reliability Tests
Reliability Score
Deploying Failure Flags on AWS Lambda
Deploying Failure Flags on AWS ECS
Deploying Failure Flags on Kubernetes
Classes, methods, & attributes
API Keys
Container security
Additional Configuration for Helm
Amazon CloudWatch Health Check
AppDynamics Health Check
Blackhole Experiment
CPU Experiment
Certificate Expiry
Custom Health Check
Custom Load Generator
DNS Experiment
Datadog Health Check
Disk Experiment
Dynatrace Health Check
Grafana Cloud Health Check
Grafana Cloud K6
IO Experiment
Install Gremlin on Kubernetes manually
Install Gremlin on OpenShift 4
Installing Gremlin on AWS - Configuring your VPC
Installing Gremlin on Kubernetes with Helm
Installing Gremlin on Windows
Installing Gremlin on a virtual machine
Installing the Failure Flags SDK
Latency Experiment
Memory Experiment
Network Tags
New Relic Health Check
Packet Loss Attack
PagerDuty Health Check
Preview: Gremlin in Kubernetes Restricted Networks
Private Network Integration Agent
Process Collection
Process Killer Experiment
Prometheus Health Check
Configuring Role Based Access Control (RBAC)
Running Failure Flags experiments
Scheduling Scenarios
Shared Scenarios
Shutdown Experiment
Managing Teams
Time Travel Experiment
Troubleshooting Gremlin on OpenShift
User Authentication via SAML and Okta
Managing Users
Integration Agent for Linux
Test Suites
Restricting Testing Times
Process Exhaustion Experiment
Enabling DNS collection
Authenticating Users with Microsoft Entra ID (Azure Active Directory) via SAML
AWS Quick Start Guide
Installing Gremlin on Amazon ECS