WEBINAR

Navigating the Reliability Minefield

Finding and Fixing Your Hidden Reliability Risks

Reliability risks lurk everywhere in complex cloud architectures, especially when they’re scaled to an enterprise level. But how do you know where all of those risks are? How can you distinguish between minor incident risks with little impact and probability versus risks that are likely to become catastrophic outages? And more importantly, how do you drive conversations with the broader engineering organization about which risks need mitigation now?

Designed by reliability and chaos engineering experts, Gremlin’s Reliability Tracker gives you a working reliability map of the services and most likely failure scenarios in your organization.

On-demand

Register Now

Thank you for registering for this on-demand event. You will receive an email momentarily with a link to watch the session.

About this webinar

By combining this spreadsheet and methodology with reliability testing, you’ll be able to test your systems, find reliability risks, and know what will happen if they fail—then prioritize your engineering efforts to stop disruptive outages before they happen.

In this webinar, Sam Rossoff, Principal Engineer and one of the creators of Gremlin’s Reliability Tracker, will use it to walk you through performing a Reliability Risk Assessment, and show you how to:

Agenda

Test your systems
Find reliability risks
Know what will happen if they fail
Prioritize your engineering efforts to stop disruptive outages before they happen

About the speakers

Sam Rossoff

Principal Software Engineer

Gremlin

Sam is currently a Principal Engineer at Gremlin, designing and implementing core components of the Gremlin platform. His work enables Gremlin customers to improve the reliability of their applications and infrastructure. Before his time at Gremlin, he held engineering roles at Snapchat, Amazon, Nokia Research and Activision.

Andre Newman

Sr. Reliability Specialist

Gremlin

At Gremlin, Andre promotes the benefits of Chaos Engineering and reliability testing to engineering teams around the world, including at some of the largest enterprise organizations. Prior to Gremlin, he created technical content explaining Kubernetes and containerization, the shift to cloud computing, DevOps, observability, and more. His work has been featured in The New Stack, DZone, Software Engineering Daily, TechBeacon, and StatusCode Weekly.

Check out other webinars from Gremlin

How to test zone redundancy using Gremlin

How to run Chaos Engineering experiments in your CI/CD pipeline

How to test your systems for scalability and redundancy with Fault Injection

How to find Kubernetes reliability risks with Gremlin

How to find and test critical dependencies with Gremlin

Kubernetes Reliability Risks

Enterprise Chaos Engineering Certification Prep Session

More Reliability, Less Firefighting

Automate Reliability in Your CI/CD Pipeline

Secure yourself against expiring TLS certificates

Building a Culture of Reliability

Preparing for Traffic Spikes with Chaos Engineering

Introduction to Chaos Engineering

Validate Your Disaster Recovery Strategy: Ensuring Your Plan Works

This is Fine: The SRE's Guide to Chaos & Observability

The Road to Reliability

Running Your First 5 Chaos Experiments on Kubernetes

Serverless resilience: How to Build a Reliable Serverless Platform

RELIABILITY: The Next Big Development Trend

Recreating 3 Common Outages with Gremlin Scenarios

Reducing Trauma in Production with SLOs and Chaos Engineering

Beyond Chaos: Reliability in the Age of Cloud Native

Improving Network Resiliency & Performance with Network Attacks

Beyond Chaos Engineering: Using Reliability Scores to Drive Real Results

Planning and Architecting for Reliability - Part 1

Planning and Architecting for Reliability - Part 2

Organizational Reliability: What, Why, and How?

Improving system stability with Gremlin Resource Attacks

Is Your Microservice Actually a Distributed Monolith?

Incident Repro & Playbook Validation with Chaos Engineering

Introduction to Chaos Engineering

Gremlin Chaos Engineering Professional Certificate Prep Session

How Twilio Built a Culture of Reliability

Improving the Reliability of Financial Services with Chaos Engineering

Improving system resiliency with Gremlin State attacks

Improving Incident Management and Postmortem Analysis at Google

Getting Started With Chaos Engineering

Partner Update

Gremlin Chaos Engineering Practitioner Certificate Prep Session

Full-service Ownership: Owning Your Service from Code to Production

GameDays: Preparing Systems for the Real World

Five Hidden Barriers to Chaos Engineering Success

Chaos Engineering: When the Network Breaks

Continuous Validation of the AWS Well-Architected Framework with Chaos Engineering

Chaos Engineering: Test Your Systems NOT Your People

How to Baseline and Improve Reliability with Automated Scoring

Introduction to Chaos Engineering with Microsoft Azure

Achieving SLO Success with Golden Signals and Reliability Testing

Proactively improve reliability

Explore our tutorials to learn about the technologies and processes that help you manage reliability to a higher standard

Chaos Engineering: the history, principles, and practice

How To Establish a High Severity Incident Management Program

4 Chaos Experiments to Start With

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

get started