April 8, 2020

Design Thinking Leads to Chaos Engineering

Design Thinking Leads to Chaos Engineering

In the Double Diamond Framework for Innovation from the Design Council in the United Kingdom, there are four defined stages in the process of creating a good design. They illustrate those stages using a diagram like this (ours is simplified slightly from theirs).

An illustration of the Double Diamond Framework for Innovation

The Framework for Innovation is intended for use across disciplines. It can be and is used in the design of everyday things from toasters to cars to public areas like parks and museum exhibits. It also works well for things like User Experience (UX) design in information technology (IT). Here we apply it to software testing.

In this article, we assert that the testing style that is most commonly used today in software engineering, Quality Assurance (QA), is inadequate and that applying design thinking in the style of the Double Diamond leads us to a better solution. That solution is Chaos Engineering.

What is the Design Thinking Process?

The Double Diamond illustrates a design process that first explores an issue in a wider, deeper way (divergent thinking) and then takes focused action (convergent thinking). We do this twice; once to understand the problem at hand and a second time to create a solution.

The vertical widening of the first diamond represents doing research to understand the problem more fully. This likely involves talking with people affected by the issue and doing research into the system itself to understand how it works today and how that is insufficient. As we understand the broader context, the diamond narrows, representing our honing in on the core of the specific problem we are trying to solve.

The vertical widening of the second diamond represents brainstorming to consider as many methods of solving the problem at hand as can be thought up in the time we have available. The narrowing shows our selecting process of weeding out potential solutions that are ineffective or inadequate to our needs. Finally, we select and implement a solution we believe to be the best from among those we have considered.

Applying the Double Diamond to QA

System design is moving away from monoliths, where the entire application is contained in one single program, and toward distributed models like the use of microservice architectures and cloud computing. It is in this context that we are exploring traditional QA methods, design thinking, and effectiveness.

With QA, we test our systems using both automated and manual methods during the interim between the triggering of a software build (when we might perform things like unit tests) and final deployment. We will test for integration, system tests that compare the build with design specifications, and maybe even have some humans sit in front of a testing deployment and click stuff in the UI to confirm that things work as expected.

This is good. The question is: Are we finding the right problems?

In QA testing, thinking converges on finding and solving problems in a build. In a distributed system, each service has its own build. Those individual builds may pass QA tests with flying colors, but we have no insight into whether the individual services will function well within our distributed system as a whole. We are converging too soon in the first diamond, therefore everything we do in the second diamond is too limited in scope to prevent problems in the greater whole of our application deployment.

Chaos Engineering Saves the Day

Traditionally, QA relies on having an understanding of the system and the types of things that might happen. That is a natural outgrowth of the traditional model of systems design where you have build specifications and design documents and full knowledge of how the monolithic system will be deployed in production.

Our systems are constantly changing and incredibly complex. This reliance on a problem-finding convergence at the QA stage does not serve us well anymore.

Chaos Engineering allows us to diverge in our search for potential problems. Here’s why.

With Chaos Engineering, we design our monitoring and observability efforts to give us insights across the system as a whole. We limit the scope (blast radius) of our chaos experiments to only a small piece of the system, but we watch to see how our intentional and focused interactions on that one piece impact the wider whole.

Here are some examples of vertical widening of the first diamond to find problems using Chaos Engineering that are impossible to find using traditional QA methods:

  • Injecting Latency and Packet Loss- Does messaging latency between services affect the user experience? How much latency? In what ways? What happens if packets are lost? How does the system compensate (or does it)?
  • Exhausting Infrastructure Resources- Remember that in a cloud computing environment we often give up control of the infrastructure to a third party vendor in exchange for cost savings and flexibility. So, what happens if a disk that we provisioned turns out to be too small for our needs and gets full? Or perhaps a node running our authentication service runs out of RAM? What happens if a runaway process spikes CPU usage on the system running your API?
  • Unexpected Infrastructure Changes- What happens when a node running a load balancer for an important high traffic feature suddenly disappears? Will communications between services deployed across multiple regions be affected during the Daylight Saving Time change? Is your application resilient to dependency crashes?

Modern systems require modern testing. We can’t continue to rely on what worked in the past, no matter how effective it once was. We must evolve with the times and test the chaotic systems we now have, systems that are constantly changing as load changes, spinning up instances, deleting unused resources, and deploying as frequently as daily or even hourly. Chaos Engineering is the only way to get insight into these systems and find out what happens when failure occurs, before the failures occur unwatched, and gives us time to mitigate and enhance the resilience of our systems now.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started