A Chaos Day is a dedicated team day focused on using Chaos Engineering to reveal weaknesses in your system. We’ve all heard of hack days and hack weeks, where you focus on building new features. Well, a Chaos Day is focused on building more resilient systems by breaking things on purpose.
Thoughtful, planned experiments designed to reveal weaknesses in your system. Read more about Chaos Engineering here: Chaos Engineering: the history, principles, and practice.
Large companies have BCPs (Business Continuity Plans). They aren’t exercised very frequently and are often out of date. However, having a BCP means you already do experiments in production (e.g. disaster recovery testing). Chaos Engineering is focused on making these experiments automated and continuous.
Chaos Days are a fairly new concept, but they are inspired by many of the days at work we all love and look forward to including:
GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. The goal was to increase reliability by purposefully creating major failures on a regular basis. You can read more about GameDays here: How To Run a GameDay.
One of the earliest company Hack Days was held at Yahoo back in 2006. Many of our Gremlin team members (Grems as we are known to each other) have been participating in company Hack Days since the late 2000s.
Hack Days are entire days dedicated to building software with no meetings or other distractions. It’s heads-down time. Many of the greatest software features we know and love were created during Hack Days and Hack Weeks — they are a great way to encourage teamwork and drive innovation.
“Today, Facebook hopes to make security education easier and more accessible, especially for students, with the release of our Capture the Flag (CTF) platform to open source on GitHub! CTFs provide a safe and legal way to try your hand at hacking challenges.”
There are a variety of benefits for a variety of people. Let’s explore some of the common personalities and questions you may encounter.
We don’t learn by always doing things the way we have always done them.
Engineers who are on-call are your influencers when it comes to kicking off a Chaos Day at your company. They wear the pagers, they feel the pain. They will extract great value out of identifying weaknesses within their systems with Chaos Engineering.
Engineering Managers will be interested in knowing what Chaos Engineering is — but it’s also helpful to share with them information about the actual cost of downtime. Chaos Engineering practitioners have been able to demonstrate that Chaos Engineering can reduce high severity incidents (SEVs) and overall downtime experienced by customers. Ask them what it would cost the company if it were down for 10 hours. British Airways recently experienced a 10 hour outage which was estimated to cost 80 million pounds.
Engineering Directors, VPs and CTOs will want you to explain why they should consider practicing Chaos Engineering. Here it’s helpful to highlight the business impact of incidents — The cost of downtime, the loss of customer trust, and the engineering resources spent fixing issues.
There’s always someone (and you probably already know who they are). They will likely tell you that they are too busy and don’t have time for new practices. A great tip for interacting with moaners is to explain to them “we don’t learn by always doing things the way we have always done them.”
You may also find yourself standing in-front of an opponent. An opponent is likely to tell you that “there’s so much chaos already”. This is a great conversation starter. Ask them what chaos they are dealing with at the moment. Then speak with them about the impact of high severity incidents on the business and all the teams across the company. Ask them what they think the top 5 most critical services are. Ask them which systems they think are the most fragile. Ask them if they were to practice Chaos Engineering, which system would they start with? Try to get them involved to see the value of being proactive.
There are some people you will meet on this journey who will proudly explain that what they do is the only way to improve reliability. For example, they might be diehard fans of unit tests . That too is an excellent conversation starter, you can follow it with “Chaos Engineering is unit tests for alerting and monitoring.”
There will always be skeptics and they will usually outright tell you “we won’t get value from this”. An important conversation to have with skeptics is one focused on defense protection and training. Security teams in large companies will often have security engineers train the greater engineering team on how to write more secure code. Who is training your team on building more resilient systems and improving your on-call habits?
The mutineers will be very forceful and explain “we don’t need to do this”. Share metrics with your mutineers. Give them the data on the top 5 most unreliable systems and focus on the importance of building more resilient systems.
Starting early and giving yourself plenty of time is a major advantage. The goal of your Chaos Day is to plan, create and host an impactful Chaos Day.
Your Chaos Day could be:
90 Days may sound like a long time, but getting everyone involved requires giving a large amount of lead time. This is especially important if you’d like your CTO and CEO to attend.
There are a number of questions which are useful to ask yourself before you start officially planning your Chaos Day: Who will attend? What is the focus? How will you measure success ? What is your budget? Where will it be?
Here is a Chaos Day timeline for you to use when planning your Chaos Day. This will help you ensure your Chaos Day is a success!
Chaos Day Countdown: 90 Days
Chaos Day Countdown: 60 Days
Chaos Day Countdown: 30 Days
Chaos Day Countdown: 0 Days
The very first thing you need to tick off your list when planning a Chaos Day is creating an official Chaos Day Crew. It’s important to have a diverse team, we recommend you include the following people:
Once you’ve created your Chaos Day, get together and determine who will be responsible, accountable, consulted or informed (RACI) when it comes to your Chaos Day tasks. You can read more about RACI here: The RACI Matrix - Your blueprint for project success.
We recommend you collaboratively create a matrix together similar to the one below:
Planning a successful and impactful Chaos Day isn’t an overnight activity. It takes time and a great team. But the outcome will be well worth the effort.
Here are the top 3 objectives when planning your Chaos Day:
You will need to determine how much your Chaos Day is going to cost. Remember to take into account venue, food, drink, whiteboards, and time away from your usual work.
It’s important to determine who you need to attend Chaos Day before booking a specific date and venue.. Using a tool like Doodle to send out a quick and simple poll will enable you to collect everyone’s available dates.
Where you choose to host your Chaos Day really comes down to three questions:
If your team culture could be described as “scientific, love experiments, and exploring new concepts”, you could host your Chaos Day at a Science Exploratorium event space. If you team culture is “serious, no nonsense” it would likely be more appropriate to host your Chaos Day on-site at your office.
There are many tools out there that can help you find a great venue, try out Splacer:
You really want everyone to be there, especially when you are putting in a ton of effort to make this an impactful and useful team day. Send all your attendees a very simple placeholder invite. Don’t give much away!
[Chaos Day is coming]
Keep this day free, we need you.
Many of your attendees may never have practiced Chaos Engineering. It’s still a very new concept for most engineers.
Sharing useful articles for your attendees to read 30 days before your Chaos Day will help them feel included. It will also give them time to learn more so they can do their best work on the day. This is very similar to giving everyone on your team time to prepare for Hack Days. It’s useful to have time to think about what Chaos Engineering experiments you will perform and which service you will target.
You can find the Gremlin Chaos Day Pre-Read Pack on GitHub.
Now it’s time to create your Chaos Day Agenda.
When you are determining what experiments to run on your Chaos Day, ask yourself the following question:
What are your top 5 most critical services?
Remember that it is okay to start in staging and then later move to production. When you are practicing Chaos Engineering in production or staging, be sure to take into consideration the blast radius. Start small and measure the impact on your engineering team.
Before you run your Chaos Engineering experiments make sure you have an exit plan. Gremlin has a built-in functionality to stop experiments.
Gremlin has the following Chaos Engineering experiments available to you.
A very beneficial activity during your Chaos Day is having everyone in the room draw up a current diagram of the systems you will focus on for your Chaos Day. Use this time to your advantage. With so many great minds in the room, it’s a great opportunity to debate assumptions and gain consensus.
Chaos Engineering Hypothesis
This is an example of a latency attack for 120 seconds impacting dynamodb with 1500 milliseconds of latency.
This will delay egress packets for 1500 milliseconds. This Chaos Engineering experiment is being performed on stage.
Gremlin gives you the ability to see the logs within the Gremlin Control Panel. This is useful because everyone who has access to your Gremlin account is able to see the progress of experiments in real-time.
You will then see raised 500 responses in your monitoring tooling.
You will also be able to confirm that graceful degradation occurs as expected. Here you can see that the Gremlin app gracefully degrades successfully.
Gremlin has a library of Recommended Scenarios available for you to use that is always expanding. To get access to these Scenarios, please chat with our team. Our CEO Kolton shared a guide on how you can use Scenarios to prepare for real-world outages.
Scenarios feel like an important step in the natural evolution of chaos. Replicating isolated failures will always be helpful, but scenarios provide the means to ratchet up pressure on our systems in ways that more closely mirror the complex, orchestrated failure states we observe in production environments.
Now that you have successfully run your experiment, it’s a good time to turn this experiment into continuous chaos. You can use Gremlin to automatically schedule this Chaos Engineering experiment to occur on a daily/weekly/monthly basis. This is great because you will be able to have confidence that you have the same level of resilience as you did on the day of your Chaos Day. It also frees your team up to think of new experiments to run for future Chaos Days.
Here are some resources on how you can integrate Gremlin with your CI/CD platform or use the built-in Gremlin scheduler:
Now that you have successfully run your experiment, it’s a good time to ask who would like to volunteer to run the next Chaos Day!
Finally, there are a number of important aspects of your Chaos Day that will make it really stand out. With these additional touches your Chaos Day could become the reason engineers want to join your company. You Chaos Day could be a day that every engineer looks forward to, thinks about and plans for.
Don’t forget to order coffee, lunch, drinks and snacks. Everyone will need delicious brain food to do their best work on Chaos Day.
Do you want a theme for your Chaos Day? Themes are very common at Hack Week. They get everyone generating new ideas and thinking about the upcoming activities.
You could use a technical theme like packet loss and hide physical packets around your office for engineers to find. What are your favorite chaos-filled movies and tv-shows?
The day before your Chaos Day send out everyone a reminder with information on where they need to be, what they need to bring and what time they need to show up. Make sure to tell everyone to bring along their laptop and charge it in advance.
WELCOME TO CHAOS DAY
WE NEED YOU
WE ARE GLAD YOU ARE HERE
----- END OF MESSAGE --------
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started