This article was originally published on TechCrunch.
The outages at RBS, TSB, and Visa left millions of people unable to deposit their paychecks, pay their bills, acquire new loans, and more. As a result, the Treasury Select Committee (TSC) began an investigation of the UK finance industry and found the “current level of financial services IT failures is unacceptable.” Following this, the Bank of England (BoE), Prudential Regulation Authority (PRA), and Financial Conduct Authority (FCA) decided to take action and set a standard for operational resiliency.
While policies can often feel burdensome and detached from reality, these guidelines are reasonable steps that any company across any industry can exercise to improve the resilience of their software systems.
The BoE standard breaks down to these five steps:
- Identify critical business services based on those that end users rely on most
- Set a tolerance level for the amount of outage time during an incident that is acceptable for that service, based on what utility the service provides
- Test if the firm is able to stay within that acceptable period of time during real-life scenarios
- Involve management in the reporting and sign-off of these thresholds and tests
- Take action to improve resiliency against the different scenarios where feasible
Following this process aligns with best practices in architecting resilient systems. Let’s break each of these steps down and discuss how Chaos Engineering can help.
The operational resilience framework recommends focusing on the services that serve external customers. While internal applications are important for productivity, this customer-first mentality is sound advice for determining a starting place for reliability efforts. While it’s ultimately up to the business to weigh the criticality of the different services they offer, the ones necessary to make payments, retrieve payments, investing, or insuring against risks are all recommended priorities.
For example, retail companies can prioritize the customer’s critical shopping and checkout path as a place to start. Business SaaS companies can start with their customer-facing applications, especially those with Service Level Agreements (SLAs). To pick a simple example, Salesforce would focus reliability efforts on their CRM, not on their internal ticketing systems.
The second part of this stage is mapping, where firms “identify and document the necessary people, processes, technology, facilities, and information (referred to as resources) required to deliver each of its important business services.” The valuable insight here is that mapping an application doesn’t stop at the microservices that make up the application itself -- so neither should the reporting and testing! Sometimes even a service that’s thought to be non-critical can take down other critical services (due to unfound bugs etc), so companies need to be aware of this unintentional tight coupling.
This is where Chaos Engineering comes into play. Firms can map critical dependencies by running network attacks to see which services make up the application and find any unknown or undocumented dependencies. Then, incrementally scale the testing to include multiple services.
The next step is setting the tolerance levels, also known as Service Level Agreements (SLAs), customized to the criticality of the service. In other words, banks must preemptively set and agree to the amount of time it takes to restore service during any named incident. There are multiple Service Level Indicators (SLIs) suggested to track -- e.g. outage time and the number of failed requests -- but the paper suggests using outage time as the primary metric. The initial paper also recommends self-assessment to determine the acceptable outage time, but the three governing bodies have since taken a stronger stance. The new standard is two consecutive days of maximum downtime, which from my worldview is still too low of a standard.
The paper also talks about the importance of setting what they call an “impact tolerance”, which tries to assume all scenarios that could happen, and then outlines agreed upon timeframes for remediation. Impact tolerance can be considered a Recovery Time Objective (RTO) metric, and is slightly stricter than a time bound Service Level Objective (SLO). For example, region loss scenarios have occurred but are quite rare (less frequent than once a year) -- yet the requirements state that a bank should be prepared to recover in under 2 days from such an incident.
However, in order to be realistic and limit the universe of possible failures, the paper suggests starting with the scenarios that have happened to the firm before, or to others in their industry. For example, if retailers see an outage at another retailer, it behooves them to replicate that incident in their environment to ensure resilience.
I believe this is the right approach -- if you try and do too much too soon, it can quickly become overwhelming. I’d recommend running FireDrills to see how fast systems self-heal or measure your team’s mean time to recover (MTTR) to known, likely incidents.
This is the step where Chaos Engineering is most directly applicable. The guidance on testing comes down to performing severe, but plausible failure scenarios seen by the firm or others in the industry. The proposal recommends varying the severity of attacks by the number of resources or time the resources are unavailable. For example, if you are testing a database failover scenario, you can begin with testing the database with a small amount of latency, then move to drop the connection to the primary node, then grow to lose multiple nodes until you see how your system handles losing connection to the entire database cluster.
Additionally, the paper recommends testing third party resources. This is a best practice in the new API driven economy, where modern applications are built by stitching together services from other teams inside or outside of our organization. If an eCommerce store’s payment service provided by a third party goes down, the store should be prepared for this outage and have a failover solution in place. Chaos Engineering lets you simulate these scenarios without actually having to bring down the service.
None of this should be performed in isolation. Setting the impact tolerance is a critical risk management process that should involve buy-in from senior management. The impact tolerance levels provide a clear metric that can be reported to the senior management to ensure that they are included in the decision-making process when it comes to determining the criticality of needed improvements.
The proposal recommends firms take action by leveraging the findings of the testing stage and management’s cost/benefit analysis to prioritize areas for improvement. The recommendations stated include:
- Replacing outdated or weak infrastructure
- Increasing system capacity
- Achieving full fail-over capability
- Addressing key person dependencies
- Being able to communicate with all affected parties
This rounds out the chaos experiment lifecycle as well. Once a system weakness is identified, companies should fix the bug based on severity, and then rerun the chaos experiments to ensure there isn’t a regression. Conducting post-mortems and publishing the results for management and other teams to learn from the findings helps enhance their work.
Following the process outlined by the BoE, RPA, and FCA aligns with best practices for identifying and prioritizing areas for resiliency improvements. The process is currently opt-in beginning October 1, and while the standards only apply to UK-based institutions, out of strong customer focus many multinational firms are preparing to voluntarily adhere to these standards. The process outlined parallels much of the tech industry’s best practices, and could be considered astute guidance for any company wanting to build more resilient systems.