Gremlin Scenario: Linux host redundancy

Description

Test resilience to host failures by shutting down a randomly selected Linux host. Verify that your platform automatically restarts or replaces it.

What this Scenario does

This Scenario shuts down a randomly selected Linux host, simulating an unexpected host failure. This forces your infrastructure to detect the failure and initiate recovery—whether through auto-scaling groups, load balancer health checks, or manual failover processes.

‍

Why run this Scenario?

This Scenario uses the same principle as Chaos Monkey: if a host or container shuts down unexpectedly, the underlying platform should detect this and automatically restart or replace it.

Validate that Linux instances restart within a reasonable timeframe and workloads successfully migrate to healthy hosts.
Verify that load balancers automatically route traffic away from the failed Linux host.
Test that losing a critical node (such as a Kafka broker or database primary) doesn't cause a split-brain scenario.
Build the same confidence as Netflix's Chaos Monkey approach: if a host shuts down unexpectedly, the platform handles it automatically.

‍

Expected outcome

When a Linux host fails, the cloud platform or infrastructure automatically restarts or replaces it, and workloads migrate to healthy instances.