CHAOS CONF IS THE LARGEST CHAOS ENGINEERING EVENT IN HISTORY

Watch all 18 talks from Chaos Conf 2020, and join us for the next Chaos Conf!

CHAOS ENGINEERING: THE PATH TO RELIABILITY

,
Kolton Andrus, Gremlin

We’re all here for the same purpose: to ensure the systems we build operate reliably. This is a difficult task, one that must balance people, process and technology during difficult conditions.We operate with incomplete information, assessing risks and dealing with emerging issues. We’ve found Chaos Engineering to be a valuable tool in addressing these concerns. Learn from real world examples what works, what doesn’t, and what the future holds.

View talk

TOP 5 THINGS YOU CAN DO TO REDUCE OPERATIONAL LOAD

,
Rachel Obstler, PagerDuty

With the world shifting to everything online, digital dependency and pressure is higher than ever. In March, PagerDuty saw incidents double across the board for its customers, with significant spikes in industries like online learning and ecommerce. The pressure isn't letting up, nor are customer expectations.Based on PagerDuty's data and conversations with thousands of customers, Rachel will talk about the easiest things you can do to make a big difference in reducing operational work from incidents. She'll also discuss ways to reduce duplicative efforts, surfacing issues, and improve response times to build more reliable teams.

View talk

FAILING OVER WITHOUT FALLING OVER

,
Adrian Cockroft, Amazon Web Services

Many organizations have disaster recovery (DR) failover plans that are poorly tested and implemented, and they are scared to test or use them in a realistic manner. This talk will show how we can use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards.Observability and human understanding of safety margins and the state of a failover are critical to having a real DR capability. Chaos engineering, game days and a high level of automation provides continuously tested resilience, and confidence that systems will fail over, without falling over.

View talk

SCALING RELIABILITY

,
Nate Vogel, Charter Communications

How do you build a culture of reliability in a massive organization with well-established expectations of how to operate? A common assumption about enterprises is that everything moves at a glacial pace.After growing Charter’s product data engineering team from a handful of engineers to 30, the company implemented a large reorg. This new data platforms group quadrupled in size to over 120 engineers, and responsibility for a mission-critical services platform that backs Customer self-service digital applications and portals. This set of services needed to grow their reliability and Chaos Engineering practice. Nate Vogel, VP, Data Platforms, will share how he grew the data engineering team with an emphasis on building a culture of reliability. He’ll discuss the processes and tools his team used to ensure Charter and its customers have the data and analytics necessary to drive the business. Nate will also provide insight on how to share a culture of reliability in the face of sudden team expansion.

View talk

STABILIZING AND REINFORCING H-E-B'S EXISTING CURBSIDE FULFILLMENT SYSTEMS WHILE REINVENTING THEM

,
Justin Turner, H-E-B

While going through the process of reinventing H-E-B's curbside and home delivery fulfillment systems, we had to spend significant effort to stabilize and reinforce the existing mission-critical systems to give us the cover needed to get to the finish line.It took a blend of utilizing new services as anti-corruption layers as well as addressing complex technical debt and performance issues to improve our uptime and reduce business impact. It also took using our newly developed chaos engineering mindset to get creative in introducing failure to validate our fixes.

View talk

THE MORE YOU KNOW: A GUIDE TO UNDERSTANDING YOUR SYSTEMS

,
Tyler Wells, Twilio

As a platform provider, incidents and outages cost our customers money and it doesn't matter what your role is — developer, quality engineer, SRE, or even technical management — you must deliver trust.Delivering trust is accomplished by shipping secure and reliable systems. And you have to know your systems in order to do that. I'll share how we developed a template that enables anyone at Twilio to understand their systems better, identify critical metrics to watch, and how to use Chaos Engineering to verify it all.

View talk

LET DEVS BE DEVS: ABSTRACTING AWAY COMPLIANCE AND RELIABILITY TO ACCELERATE MODERN CLOUD DEPLOYMENTS

,
Rahul Arya, JPMC

Reliability is hard as complexity grows, and it makes shipping software difficult. The rigorous compliance requirements of the financial industry add additional challenges to developer velocity on modern cloud platforms. When you scale that up to an organization of JP Morgan Chase’s size with over 6500 apps and 50,000 engineers working across a global organization it can bring everything to a grinding halt.In this session, Rahul Arya, Managing Director & Head of Global Technology Solutions Architecture at JPMC will share how they built a platform to abstract away compliance, make reliability with Chaos Engineering completely self-serve, and enable developers across the organization to ship code faster than ever.

View talk

CAN CHAOS COERCE CLARITY FROM COMPOUNDING COMPLEXITY? CERTAINLY.

,
Matt Simons, Workiva

Let's go Black Swan hunting together. No no -- you can leave the guns at home. The camo too. No bait, traps, dogs, or calls needed. This is a very different kind of hunting, and the tool we need is chaos. You see, the swans we're hunting aren't sitting in a tranquil pond or gliding majestically over a clear lake on a beautiful, sunny day. These swans are hiding in your products.They are hiding in your architecture, your infrastructure, and every dark-corner-turned-refuge created by layer upon layer of increasing system complexity. And these swans, these Black Swans, are not friendly or majestic creatures. They are wild, coked-out maniacs, whose singular purpose is to watch your products burn. So suit up! Grab some coffee, put on something comfortable, and follow me, chaos tools in hand. Let's get some birds.

View talk

IBM’S PRINCIPLES OF CHAOS ENGINEERING

,
Haytham Elkhoja, IBM

IBM has a long history of improving the reliability and availability of systems ranging from the largest of mainframes to the smallest of microservices. As part of cultural and organisational improvements we’ve sat down and codified a list of Chaos Engineering principles which define our view of Chaos Engineering.These principles do not replace existing principles, but adapt them and match them to the requirements we have from our clients and from our own internal services. In this session we will describe a little of the process of getting engineers from across to agree on these principles and present the principles and lessons which we agreed upon.

View talk

CULTURING RESILIENCY WITH DATA: A TAXONOMY OF OUTAGES

,
Ranjib Dey, IT Revolution

This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types. We'll start with some background information, including definitions, incident management framework, and existing preventive techniques, aka best practices.Followed by details and rationale around individual categories, sub-categories, and their relative distribution. Then we'll deep dive into two of the biggest categories: deployment and capacity with a focus on time series based data ming techniques to assist detection and simulation of some of the common root causes. Finally, we'll discuss the propagation of lessons learned in terms of policy and process changes based on these insights.

View talk

CONVERGENCE OF CHAOS ENGINEERING AND REVOLUTIONIZED TECHNOLOGY TECHNIQUES

,
Yury Niño Roa, ADL Digital Labs

Novel research areas such as the Internet of Things (IoT), Artificial Intelligence (AI), Cybersecurity, and Human Augmentation (HA) have demonstrated a big potential in the solution of specific problems. Medicine, Transportation, Software, Education, and Finances have been benefited by the progress of them. However, reaching this success requires assuming risks and failing many times to gain resilience.This journey involves terms and techniques that we study in Chaos Engineering, so in this talk, we are to explore how these emerging paradigms can use Chaos Engineering to manage the pains in the path toward providing a solution. On the other side, we will show how Chaos Engineering can benefit from Artificial Intelligence for example. Further, we are going to propose a conceptual model to explore the influence of these emerging paradigms over Chaos Engineering and How to use the Chaos Principles to identify risks, vulnerabilities, and generate resilience solutions.

View talk

BREAKING SERVERLESS THINGS ON PURPOSE: CHAOS ENGINEERING IN STATELESS ENVIRONMENTS

,
Emrah Şamdan, Thundra

Serverless enabled us to build highly distributed applications that led to more granular functions and ultimate scalability. However, it also brought the risk of failure from a single microservice to many serverless functions and resources. You might be able to predict and design for certain troublesome issues but there are many, many more that you probably will not be able to easily plan for. How do you build a resilient system under these highly distributed circumstances? The answer is Chaos Engineering: Breaking things on purpose just to experience how the whole system will react.

View talk

IDENTIFYING HIDDEN DEPENDENCIES

,
Liz Fong-Jones, Honeycomb.io

You don't need to write automation or deploy on Kubernetes to gain benefits from resilience engineering! Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose.We'll discuss the initial manual experiments we ran, the bugs in our automatic replacement tools we uncovered, and what steps we needed to progress towards continuously running the experiments. Today, no node at Honeycomb lives longer than 12 months, and we automatically recycle nodes every week.

View talk

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape