Failure mode and effects analysis (FMEA) is a decades-old method for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. In the last few years it has begun to be used by companies looking to make their computer systems better. While FMEA is not officially an ISO standard procedure like ISO 31000 Risk Management, there are ISO implementations specific to certain applications. Instead, it is broader and able to be applied to a wide range of needs.
DevOps and the move from corporate-owned datacenters to the cloud, combined with the rise of microservices, have created systems that are more flexible and cost-efficient. But that comes with greater complexity and the risk of unknown single points of failure. Our new applications and production environments are too large and complex to adequately or accurately reproduce in testing and staging environments.
The goal of Chaos Engineering is to create more reliable and resilient systems by injecting small, controlled failures into large scale computer systems. Only by properly testing a system can we truly know where its weaknesses lie. We think we know our system’s capacity and how it will respond to known actions, but we aren’t certain because we haven’t yet found a way to test our assumptions. In addition, we can only fix problems by finding them. When we have a good sense of what can go wrong, disaster recovery times are shortened and the criticality of incidents is reduced.
This seems like a match made in heaven. Using FMEA and Chaos Engineering in tandem for failure detection and failure analysis will help achieve the ultimate goals of both more quickly and with greater assurance that the desired results have been achieved. Whether using root cause analysis to find initial causes of failures or instead expecting those failures going forward and creating automated failover and initial incident response schemes, our systems benefit from a careful examination and testing.
FMEA was started in the 1940s by the U.S. military as a step-by-step approach for identifying all possible failures in a design, in a manufacturing process, in a product or service, or an assembly process. This is important because by anticipating the potential small failures, you can adjust processes and systems to prevent the cascading consequences of failures. This is just as true in software applications.
As a process analysis tool, FMEA can be used at any stage from design through the process, product, or service’s lifecycle. It is typically implemented after quality function deployment (QFD) is performed. QFD measures customer satisfaction and analyzes their needs. In software, this is similar to the requirements gathering process that precedes writing the first line of code. Only by knowing what customers want and need can we produce something suitable and delightful. If you are coming from a DMAIC world, FMEA occurs during the Analysis phase of the process.
FMEA is typically used while a process, product, or service is being:
- Applied for the first time
- Modified or improved
- Run in production
The approach is also useful while developing control plans for quality assurance or quality control and when analyzing failures that have happened to prevent their recurrence. Organizations using FMEA may also apply it periodically throughout the life cycle, for the purposes of continual improvement or an undocumented change was introduced that caused a regression or new failure. In this case, the application may be scheduled on a periodic basis or may be randomly applied.
No one can control world events; the scope is just too large, which is one of the reasons disaster recovery plans and failure mitigation was created. Similarly, in the world of today’s complex systems, we don’t always control all the moving parts. In today’s cloud computing environments, we don’t control the infrastructure, so changes can happen that we have no control over or way to anticipate. What we can do is apply some very similar steps to enhance our understanding of our systems and mitigate potential failures.
FMEA is a complex process with implementation details that vary according to organizational goals and industry standards. From a high level, it involves assembling a team of people who analyze and determine the scope of the FMEA. From there, flowcharts and forms are created to capture every detail as clearly as possible. Then, every part of the system is analyzed carefully to find potential modes of failure, possible root causes, and probable customer impacts. While the complex process was designed for other uses, any software application will benefit from us taking time to consider and find potential failures and considering customer impacts. We can only prevent what we anticipate.
Once the failure modes are discovered, described, and documented, FMEA requires the analysis of three things:
- Severity - using an S rating, this denotes the potential impact of a failure, from insignificant to catastrophic
- Occurrence - using an O rating, this denotes the probability that the failure will occur, from extremely unlikely to inevitable
- Detection - using a D rating, this denotes how well the process controls put in place can detect a cause or a failure mode after they occur but before a customer is affected
When the team assigns a severity rating to each failure mode, this is similar to the SEV-1 or SEV-4 style ratings that are used for software incidents. The occurrence rating is similar to categorizing the results of a Chaos Engineering test to prioritize system enhancements; there’s more on that in the next section.
After the severity and occurrence ratings are calculated, process controls are identified and written covering the tests, procedures, or mechanisms that we will put in place to keep the failures from reaching the customer. Notice that we aren’t assuming the ability to prevent all failures, but instead expect there to be failures that are impossible to keep from happening. What we are trying to do is prevent customer impact.
Site reliability engineers (SRE) do something similar when we write runbooks (sometimes called playbooks) documenting procedures and information to help make human-required assistance in an emergent situation more efficient and effective.
Finally, the FMEA group determines how to detect failures as early as possible and give a set of recommended actions. An SRE will agree that good system monitoring and observability enables early failure detection in today’s computer systems and applications.
This methodical and scientific approach of FMEA is useful, but not perfectly possible with our constantly-changing computing architectures today, at least not in the sense of documenting every piece of the architecture and how it interacts with other parts of the system. When load balancers spin up or take down resources according to system load, it is impossible to know second by second what some of our systems look like. We can approximate, but any drawing we attempt will be imprecise at any given moment.
However, most FMEA goals do not actually require that level of exactness. What the goals require is finding failure modes and doing something about them. In fact, FMEA itself is a bit fuzzy because it is anticipating failure modes, not documenting actual experience (although some analyses may start from past failure events). We can choose to be comfortable with the inexactness of prediction while searching for potential problems and solving them in advance.
What Chaos Engineering does is use an intentional, planned process through which we inject harm into a system to learn how it responds and ultimately, to find and fix problems and find defects before they happen in a way that impacts customers. Before starting any attacks on your systems, you should fully think out and develop the experiments you want to run. Sounds familiar, doesn’t it?
Chaos Engineering doesn’t create chaos; it acknowledges the chaos that already exists in our massive deployments and constantly-changing systems and was created to reign in the potential customer impacts of that chaos. A component or service failure should not be able to bring down the whole system. We can’t anticipate all of them, but that doesn’t mean we shouldn’t try to find the ones we can anticipate.
When creating a chaos experiment we:
- Start with a hypothesis stating the question that we’re trying to answer, and what you think the result will be. For example, if your experiment is to learn what happens to one of your bank’s ATMs when network latency increases beyond a set level, your hypothesis might state that you expect the machine to store a local record of the incident, prevent customer account balance errors, and signal maintenance as soon as connectivity improves.
- Define your blast radius. The blast radius includes any and all components affected by this test. A smaller blast radius will limit the potential damage done by the test. We strongly recommend you start with the smallest blast radius possible. Once you are more comfortable running chaos experiments, you can increase the blast radius to include more components.
- You should also define your magnitude, which is how large or impactful the attack is. For example, a low-magnitude experiment might be to test application responsiveness after increasing CPU usage by 5%. A high-magnitude experiment might be to increase usage by 90%, as this will have a far greater impact on performance. As with the blast radius, start with low magnitude experiments and scale up over time.
- Monitor your infrastructure. Determine which metrics will help you reach a conclusion about your hypothesis, take measurements before you test to establish a baseline, and record those metrics throughout the course of the test so that you can watch for changes, both expected and unexpected.
- Run the experiment. You can use Gremlin to run experiments on your infrastructure in a simple, safe, and secure way. We also want to define abort conditions, which are the conditions where we should stop the test to avoid unintentional damage. With Gremlin, we can actively monitor our experiments and immediately stop them at any time.
- Form a conclusion from your results. Does it confirm or refute your hypothesis? Use the results you collect to modify your infrastructure, then design new experiments around these improvements.
Repeating this process over time will help us harden our applications and processes against failure. Use this process to discover failure modes or confirm suspected ones, learn their impact on the rest of the system, discover mean time to detect (MTTD) when a failure occurs and learn whether the system auto-heals with designed failover schemes and mitigation techniques. MTTD is very similar to FMEA’s Detection.
Likewise, you can use Chaos Engineering to help you calculate both Severity and Occurrence. Here are some examples. If you perform a blackhole attack against a single DB and discover that it shuts down the entire application, the Severity is high. If instead you are performing a simple CPU attack, which reproduces a common event, that would be considered a high-Occurrence event when compared to something rare like a full region evacuation. Carefully designed chaos experiments can help you detect new entries into these categories and determine the proper ratings to assign to each that you find, accelerating the overall process.
Frequently, organizations begin their Chaos Engineering journey using GameDays, which are days with a couple of hours set aside for a team to run one or more chaos experiments and then focus on the technical outcomes. Risk assessment goals can be set in advance and then tested alongside other of the system’s capabilities with everyone on the software development team participating with real-time monitoring and implementation.
The Chaos Engineering goal of finding failure modes more safely, easier, and faster aligns perfectly with FMEA goals. In fact, although they developed in different times for different purposes, site reliability engineering looks like an implementation of FMEA to the world of distributed systems and large-scale software applications.
Once failure scenarios are discovered, the information gathered is used to set improvement goals as the results of chaos experiments inform how reliability work is prioritized or scheduled. Serious issues that are detected are given the highest priority. Implementing Chaos Engineering into our operation helps us achieve our goals of continuous improvement and reliable systems. We can eliminate or reduce outages as we build resiliency into our systems.
Both SRE and FMEA have one major goal: prevent anything that affects the customer negatively. FMEA uses risk priority numbers (RPN) when assessing risk. RPNs are calculated using an equation involving severity, occurrence/likelihood, and detection. In SRE, we use service-level objectives (SLO) with associated error budgets to set targets for reliability that are measurable and actionable.
Whether in the physical world of manufacturing or the virtual world of computing, we all have expectations and agreements that we must meet and we all have defined ways to measure success. Reliability and customer satisfaction are the goals.
Applying FMEA goals to today’s enterprise software systems is possible, but accelerated using Chaos Engineering. SREs have discovered that Chaos Engineering helps them meet and exceed their service level agreements, it can also help organizations with regulations or mandates to use FMEA meet or exceed similar expectations by discovering potential failure modes the only way possible in today’s distributed architectures.
- It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal SRE