Chaos Conf 2019 Recap
Chaos Conf 2019 was held on September 26 at the historic Regency Ballroom in San Francisco. I wasn’t able to attend Chaos Conf 2018 (and had huge amounts of FOMO as I watched people tweet about it), so it was great to not only be able to attend this year, but to help select the talks for the conference. We had a great crowd of people excited to learn more about Chaos Engineering. The weather decided to add a little extra chaos to the conference by spiking to highs not seen since the 80s, which we appreciate everyone taking in stride. We'll start working on a scenario for next year 😀!
The day started off with a welcome from Gremlin CEO Kolton Andrus (check out Kolton’s slides). Kolton talked about the S3 outage of 2017 which impacted so many companies. Kolton then announced a new Gremlin feature that launched the day of the conference called Scenarios, which lets people run Chaos Engineering experiments based on real-world outages they may have experienced. Some examples are Unavailable Dependencies (like S3 being unavailable) or a DNS failure like many people experienced when the DYN DDOS attack happened. (If you’d like more info on Gremlin Scenarios you can read the announcement.)
The opening keynote was from Dave Rensin from Google, with a talk titled “Chaos Engineering for People Systems” (check out Dave’s slides). Dave has worked with complex distributed systems for many years, but he said that companies are a system of human microservices. Dave suggested four experiments to run in a company, to test the company’s resiliency. The experiments were to simulate an employee being unexpectedly out of the office, to simulate latency in communications, to simulate people getting information that may be inaccurate, and to simulate an existential emergency for the company. These ideas were a lot of fun, and it was interesting to see the kinds of experiments we perform on computing systems applied to people.
Companies are distributed systems. Most of the complexity comes from the humans, not the machines.
I told Dave afterward that I thought the idea of someone giving out unreliable information was pretty evil:
Subbu Allamajaru from Expedia spoke next about Forming Failure Hypothesis (check out Subbu’s slides). Subbu described going through a period where his organization didn’t seem to be making progress with Chaos Engineering, and he experienced self-doubt. This led him to research incidents that happened across the company, to try to understand better why they occurred. Subbu said we need to learn from incidents, and make decisions based on what kind of business value our work will deliver.
Next up was Caroline Dickey from Mailchimp (check out Caroline's slides), with a talk called Think Big: Chaos Testing a Monolith. Caroline explained that many people think of Chaos Engineering as something you do in a microservice environment, but that Mailchimp has a monolithic app with over 23 million lines of code, mostly PHP. Caroline also mentioned that within a monolith you still have separate "functionality" that you're dependent on. The team at Mailchimp has gained value from their Chaos Engineering by using experiments to test things like load balancers and databases, to validate changes, and to test for what happens when both internal and external dependencies fail. Caroline’s talk contained a lot of great information about GameDays her team has run at Mailchimp, including screenshots of the results.
After lunch (delicious food trucks... nom nom nom 😋!), Paul Osman from Under Armour and Ana Medina from Gremlin talked about Embracing Chaos! (check out Ana and Paul’s slides). Both Paul and Ana are currently practicing Chaos Engineering, and have done so at previous jobs. They spoke about using Chaos Engineering to validate runbooks, for onboarding, and to learn about systems. One tidbit Paul dropped was that the famous image that became the Disaster Girl meme was taken at a fire department training exercise planned to give them practice responding to fires.
Next up were four lightning talks. Yury Niño Roa from Aval Digital Labs talked about how Hot Recipes for Building Chaos Experiments, and a cookbook she is working on for chaos experiments (check out Yury’s slides).
Niran Fajemisin from Starbucks talked about how humans are fallible, and we should expect failures to happen. Niran is on the application development side in his organization, and it was really great to see him embrace this viewpoint (check out Niran’s slides).
Joyce Lin from Postman asked Who is Responsible for Chaos? Joyce did some research on folks in the Community Slack and found that people who are feeling the pain of outages tend to be the ones who are motivated to implement Chaos Engineering (check out Joyce’s slides).
Jason Yee from Datadog talked about how we can start thinking about resiliency earlier in the development process (RDD: resilience driven development). He also included code samples that you can use for both the Datadog and Gremlin API. His talk was titled What should I monitor? It was a great real-world example on how monitoring and Chaos Engineering can be used together (check out Jason’s slides).
Robert Ross (AKA Bobby Tables) from FireHydrant and Tammy Butow from Gremlin spoke about Incident Repro & Playbook Validation With Chaos Engineering (check out Tammy and Bobby's slides). Bobby and Tammy worked together at Dropbox, and they traveled back in time to talk about the AWS S3 outage. Tammy showed how it’s possible to reproduce S3 becoming unavailable with Gremlin’s new Unavailable Dependency Scenario. Bobby showed a hilarious recording of a fire drill they did at his office where a co-founder was trying to respond to an outage and was blocked almost immediately because he didn’t have credentials to update their status page. Chaos Engineering and fire drills are great ways to prepare for the inevitable failures. Plus this talk had lots of Bill & Ted references, which means it gets an upvote from me.
Jose Esquivel from Backcountry talked about his company’s Roadmap Towards Chaos Engineering, and including Chaos Engineering into your normal testing pyramid (check out Jose's slides). He also talked about using other patterns for reliability, like having good observability and alerting, examining your timeout and retry logic, and using circuit breakers. Jose’s team has done a lot to improve their reliability and it was great to hear from him.
Lenny Sharpe and Brian Lee from Target spoke next about Finding the Joy in Chaos Engineering (check out Lenny's slides). It was great to hear about the hard work they’ve done at Target. One aspect of implementing Chaos Engineering or any new reliability program is socializing it within the company, and getting teams excited about doing the work. Lenny and Brian’s team had the great idea of making a logo for their work, and even created a menu of Chaos Engineering experiments (with spicy options) to make the process friendlier for people just getting started. I love their team’s focus on customers and their experience, both online and in Target’s retail locations. We do reliability work because it benefits the business, but in the end it’s the customer experience that drives those benefits. Keeping the end user in mind always is important, as well as internal users of the tooling we provide. The Target teams focus on these values was very evident. They referred to their customers as “guests,” which I think is fantastic.
Our closing keynote came from Crystal Hirschorn from Conde Nast (check out Crystal's slides). Crystal talked about how infrastructure has become much more complex, and how even the best engineers can no longer hold accurate models of them. Crystal also shared a lot of great information from thinkers in the Safety field like Sydney Dekker, Richard Cook and David Woods. My favorite part of Crystal’s presentation was when she shared information about an incident they experienced at Conde Nast, including chat logs and screenshots. It was great to see some of our speakers be transparent about the realities of what they do. I think for our community to grow and thrive, that kind of sharing is very important.
After the conference there was a Chaos Cookout with giant Jenga and other games, and some great food and drinks. My favorite guest was a senior dog who would love to be adopted named Tina Fey.
Thanks to the folks from Muttville for bringing these sweet senior dogs to the event 🐶🤗. And thanks to everyone who helped make Chaos Conf 2019 such a fun day: the speakers, the organizers, and everyone who participated 😀. There are now over 1000 of us in our Chaos Conf community , between the audience and folks on the livestream, and it was a great day. I’m already looking forward to Chaos Conf 2020!!
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more