How to test your systems for scalability and redundancy with Fault Injection

Office Hours

Register Now

Thank you for registering! Click here to watch on-demand.

Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?

Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

In this webinar, we’ll show you how to test the scalability and redundancy of your systems by testing them directly. We’ll use Fault Injection to simulate large-scale failures, use observability tools to monitor the state of our systems, and discuss ways of using our findings to make our systems more resilient.

You'll learn:

  • What is Fault Injection? Learn how simulating incidents is the first step towards resolving them.
  • How to run blackhole and shutdown experiments using Gremlin.
  • How to use observability to monitor your system's response, then use these insights to make reliability improvements.

About the speakers

Andre Newman

Sr. Reliability Specialist

At Gremlin, Andre promotes the benefits of Chaos Engineering and reliability testing to engineering teams around the world, including at some of the largest enterprise organizations. Prior to Gremlin, he created technical content explaining Kubernetes and containerization, the shift to cloud computing, DevOps, observability, and more. His work has been featured in The New Stack, DZone, Software Engineering Daily, TechBeacon, and StatusCode Weekly.

Dan Muret

Sr. Solutions Architect

At Gremlin, Dan works closely with organizations to understand, implement, and design Chaos Engineering and reliability testing practices. Prior to Gremlin, he’s worked as a system administrator and solutions architect for companies like IBM, Zerto, and Veeam/Kasten. Dan’s real-world experience in system architecture, cloud migrations, disaster recovery, and resilience testing help him guide companies to make the most out of their reliability and Chaos Engineering efforts.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape