WEBINAR

Planning and Architecting for Reliability - Part 1

Don’t wait for an incident to start focusing on the reliability of your systems. Join this two-part series to take a proactive approach to reliability, so you can prevent incidents from happening in the first place.

In this, the first part, we map dependencies and uncover failure points to identify where to improve reliability.

On-demand

Watch on-demand

Thank you for registering for this on-demand event. You will receive an email momentarily with a link to watch the session.

About this webinar

The reliability of your systems is crucial, but can often be put on the back burner until an incident occurs. We walk through how to take a proactive approach to reliability so you can find and fix weaknesses before they become incidents.

You’ll walk away having identified vulnerabilities, knowing how to test them for failure, and how to prioritize your reliability efforts across services.

Part 1: Planning for Reliability

Lay the foundation for reliability by better understanding our complex, multi-layered architectures
Map dependencies in a single view and identify failure points

Part 2: Architecting for Reliability

Put reliability plans into action by testing our dependencies and vulnerabilities.
Learn how to test the technologies in your stack against common failure modes.

About the speakers

Vince Huang

Reliability Architect

Gremlin

Vincent is a Reliability Architect at Gremlin, helping teams and companies strategize, design, implement, and interpret their Chaos Engineering and resiliency efforts. Previously, he worked for LinkedIn and Twitch, doing Operations, Site Reliability, and Incident and Problem Management focusing on uptime and availability.

Jacob Plicque III

Solutions Architect

Gremlin

Jacob is a Solutions Architect at Gremlin where he works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Jacob has worked on Network Chaos Engineering across a variety of industries including finance, e-commerce, airlines, retail, and insurance. Jacob is also the co-host of the Break Things on Purpose podcast, a series dedicated to sharing Chaos Engineering experiences. Jacob previously worked at Fanatics as a Senior SRE where he was responsible for providing a reliable e-commerce experience to process over 1100 orders a minute. He has in-depth experience proving a reliable service on peak days such as Cyber Monday and Black Friday.

Check out other webinars from Gremlin

How to ensure your AWS workloads are resilient

How to test your systems for scalability and redundancy with fault injection

Improving Resilience for GenAI Workloads on AWS

How to keep track of what’s running in your Gremlin team

How to test Istio and other service meshes

How to demonstrate your reliability progress

How to build a Test Suite that fits your requirements

Building Resilience from Architecture to Production with AWS & Gremlin

Integrating Gremlin with your observability tools

How Visa Cross-Border Solutions Reduces Outages by Testing System Resilience in Their SDLC

How to test serverless applications using Failure Flags

How to Build Resilience Throughout Your SDLC: Lessons from a Top 10 Bank

Confident Cloud Migrations How a Top 5 Bank Ensured Reliability With AWS and Gremlin

Building Resilience in the Cloud With the AWS Well Architected Framework and Gremlin

Get better reliability on AWS with our new release

5 essential resilience tests for a successful cloud migration

How to run fault injection tests on AWS managed services

How to test zone redundancy using Gremlin

How to run Chaos Engineering experiments in your CI/CD pipeline

How to test your systems for scalability and redundancy with Fault Injection

How to find Kubernetes reliability risks with Gremlin

How to find and test critical dependencies with Gremlin

Kubernetes Reliability Risks

Enterprise Chaos Engineering Certification Prep Session

More Reliability, Less Firefighting

Automate Reliability in Your CI/CD Pipeline

Secure yourself against expiring TLS certificates

Building a Culture of Reliability

Preparing for Traffic Spikes with Chaos Engineering

Introduction to Chaos Engineering

Validate Your Disaster Recovery Strategy: Ensuring Your Plan Works

This is Fine: The SRE's Guide to Chaos & Observability

The Road to Reliability

Running Your First 5 Chaos Experiments on Kubernetes

Serverless resilience: How to Build a Reliable Serverless Platform

RELIABILITY: The Next Big Development Trend

Recreating 3 Common Outages with Gremlin Scenarios

Reducing Trauma in Production with SLOs and Chaos Engineering

Beyond Chaos: Reliability in the Age of Cloud Native

Improving Network Resiliency & Performance with Network Attacks

Beyond Chaos Engineering: Using Reliability Scores to Drive Real Results

Planning and Architecting for Reliability - Part 2

Organizational Reliability: What, Why, and How?

Improving system stability with Gremlin Resource Attacks

Is Your Microservice Actually a Distributed Monolith?

Navigating the Reliability Minefield

Incident Repro & Playbook Validation with Chaos Engineering

Introduction to Chaos Engineering

Gremlin Chaos Engineering Professional Certificate Prep Session

How Twilio Built a Culture of Reliability

Improving the Reliability of Financial Services with Chaos Engineering

Improving system resiliency with Gremlin State attacks

Getting Started With Chaos Engineering

Partner Update

Gremlin Chaos Engineering Practitioner Certificate Prep Session

Full-service Ownership: Owning Your Service from Code to Production

GameDays: Preparing Systems for the Real World

Five Hidden Barriers to Chaos Engineering Success

Chaos Engineering: When the Network Breaks

Continuous Validation of the AWS Well-Architected Framework with Chaos Engineering

Chaos Engineering: Test Your Systems NOT Your People

How to Baseline and Improve Reliability with Automated Scoring

Introduction to Chaos Engineering with Microsoft Azure

Achieving SLO Success with Golden Signals and Reliability Testing

Proactively improve reliability

Explore our tutorials to learn about the technologies and processes that help you manage reliability to a higher standard

Chaos Engineering: the history, principles, and practice

How To Establish a High Severity Incident Management Program

4 Chaos Experiments to Start With

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started