Improving the reliability of financial services with Chaos Engineering
In late 2019, customers of one U.S. bank suddenly found that they could no longer access their accounts. Across the country, customers couldn’t log into the website, use the mobile app, or even access ATM services. Debit and credit cards were declined, bill payment processing came to a halt, and customer service lines went dark. The outage lasted for over an hour, with sites like Downdetector receiving thousands of reports from frustrated customers. The problem was soon resolved, but the damage to customer trust had already been done.
Unplanned IT outages are a grim reality for any company. Lost revenue, eroded customer goodwill, and lost engineering time are only a few of the consequences of system failures. For companies in the financial services industry, the risk is much greater due to the importance and sensitivity of the services provided. Finance companies face increased scrutiny from not only their own customers, but also state and federal regulatory agencies, investors, and financial markets. An outage that might cause a drop in revenue for a retail company could mean severe penalties—including fines and investigations—for a company handling financial data.
The need for reliable systems is critical today as social distancing and stay-at-home orders are driving customers towards online financial services. Financial customers expect to be provided with technology platforms that are always available, reliable, and trustworthy. From mobile banking to online investment platforms, finance companies have had to rapidly scale up capacity to meet a sudden surge in demand. Meanwhile, established financial institutions have had to compete with a growing number of financial technology (fintech) companies vying to draw customers through innovation and greater accessibility.
This leaves financial services companies in a difficult position. How do we provide technology platforms that are reliable, can rapidly scale to meet changing customer demands, and still allow us to compete with the fast-paced and competitive fintech industry? The answer is with Chaos Engineering.
What is Chaos Engineering?
Chaos Engineering is a new testing discipline that helps finance companies proactively test for failure in their applications and systems. It helps IT teams find and address potential failure points before they can grow into high-profile outages.
Enterprise financial systems undergo rigorous testing, but they’re still susceptible to problems. Network outages, system failures, and drops in performance not only harm the customer experience, but can compromise commitments to service level agreements (SLAs) and regulatory compliance. In order to provide the best possible customer experience and meet obligations, engineering teams need a way to proactively test the ability of their systems to withstand problems in a safe and holistic way.
Chaos Engineering is a scientific process that involves performing intentional experimentation on systems by injecting precise and measured amounts of harm. Injecting harm means creating conditions that would typically be considered detrimental or undesirable, such as adding latency to network calls or consuming extra CPU resources. This is similar to stress testing, but allows us to test the entire system holistically instead of a single component. By deliberately putting our systems under these conditions, we can observe how they respond for the purpose of making them more resilient.
This process of safely introducing small amounts of harm to observe the effects is called a chaos experiment, and the actual injection of harm is called an attack. Early Chaos Engineering tools worked by randomly executing attacks across your entire IT infrastructure, and while this worked for companies like Netflix, it is more of a liability for companies that don’t already have resilient systems. Risk is a significant factor in finance, and teams managing financial systems need fine-tuned control over how experiments are run. As the only enterprise Chaos Engineering platform, Gremlin provides this level of control by letting you schedule, halt, and roll back attacks within seconds, making it a much safer approach.
How legacy IT systems restrain innovation
In the finance industry, reliability is paramount. We need to be confident that our systems are always available and highly responsive. For established companies, this can be a challenge due to the presence of legacy infrastructure.
Many established companies built their IT infrastructure on top of monolithic systems such as mainframes. These systems provide core business functionality and have undergone years—if not decades—of extensive testing, validation, securing, and integrating with other systems. They tend to be:
- Always available (or allowed very limited downtime)
- Locked down against unnecessary changes
- Legacy or specialized systems that don’t support modern software
For all intents and purposes, these systems are unchangeable except under very specific circumstances and only by highly specialized engineers. The challenge is that technology is always evolving, and the rigidity of these systems acts as a bottleneck to change. Fintech companies typically don’t have these legacy systems and can potentially outmaneuver companies that do.
Rather than re-architect these systems using modern technologies, engineering teams in established companies tend to build new systems separately and bridge them with legacy systems. This allows for innovation without disrupting the stable core infrastructure, but introduces many new potential failure points. Throughout this white paper, we’ll explore these failure points and how Chaos Engineering addresses them safely.
Three ways Chaos Engineering provides value for financial services
In this section, we’ll present three ways Chaos Engineering helps transform financial IT systems: Improving reliability while reducing IT costs, improving the customer experience, and proactively testing for compliance.
Improving reliability while reducing IT costs
Engineering teams in finance companies face unique challenges compared to other industries, and one of the top challenges is in how these teams collaborate.
Financial services companies often have large technology teams distributed across multiple cities or regions, separation of ownership over services, and technology platforms managed by third-party providers. These are needed to deliver the services customers expect, but they create additional complexity, difficulty integrating multiple systems, and a slowdown in the escalation of critical issues. In turn, this becomes a competitive disadvantage compared to newly created startups in the fintech space.
One way IT leaders have increased reliability, decreased costs, and raced to win out against competitor advantages is with cloud computing. As in other industries, cloud computing allows financial institutions to deploy scalable and reliable services more quickly and cost-effectively than on-premises, while simultaneously centralizing their infrastructure and leveraging extra computing power for resource-intensive tasks such as data analytics. However, moving to the cloud brings an additional set of risks that need to be identified and mitigated in order to ensure high availability.
Chaos Engineering helps engineering teams test and verify the reliability of their own systems, as well as their ability to withstand failures caused by systems that they depend on. It helps teams build confidence in their ability to keep services up and running under conditions that would otherwise result in an outage, even if the cause is outside of their control. Improving the reliability of one system also improves the overall reliability of the systems and services that utilize it, making for a more robust infrastructure.
Consider the following scenario. A bank uses a mainframe system to store customer account data and facilitate transactions. Connected to this mainframe is a gateway server that bridges the mainframe with customer-facing applications, such as the bank’s website and mobile app. When a customer performs an action on either the website or the app, their request is sent from the web server to the gateway server, which interfaces with the mainframe to perform the transaction. Let’s assume that for security reasons, each of these three systems is owned and maintained by a separate team.
There are multiple potential failure points along this path, and if any one of these three systems has a problem, then the customer experience suffers. Since each service is handled by a separate team, creating a unified plan for addressing and mitigating every failure point would be difficult. At the same time, each team will have likely implemented their own strategy for handling failures, so we can’t be certain about how the system as a whole will respond to failure.
With Chaos Engineering, we can design experiments that simulate real-world conditions in order to proactively find and address problems before they affect customers. For example, we might inject latency into network requests from our web server to our gateway in order to see the impact on mobile and web application response time. This lets us answer questions such as:
- How does a small amount of latency impact users’ ability to perform transactions?
- If a request times out in the middle of a transaction, do we roll it back?
- Can customers still access the website even if our login portal is down?
By answering these questions early in the development process, we can design systems that are more reliable, provide more value to the business, and are less likely to fail in production.
Improving the customer experience
Technology has made an enormous impact on how customers use financial services. For Millennial and Gen Z customers, the quality of a mobile banking app is one of the top three factors in choosing a banking provider. Customers expect a high level of performance, availability, and usability in financial applications, and a poor experience can have a stronger negative impact on brand perception than in other industries.
The challenge that established finance companies face is that they must build resilient systems on top of an untouchable legacy backbone. Developing new services or features often means working around limitations on or requesting modifications to these legacy systems, both of which require significant engineering time and effort. Unless a modification addresses a high priority problem or shows a clear return on investment (ROI), engineering teams are generally disincentivized from making changes.
Companies must also contend with growing competition from fintechs, who aren’t restrained by legacy applications or mainframe systems. These companies can build applications on top of modern architectures like microservices, allowing them to rapidly deploy features that focus on improving the customer experience. Platforms like Kubernetes allow them to build scalable, reliable systems at a faster rate, while technologies like machine learning and conversational banking allow for more personalized customer service. While these technologies aren’t exclusive to fintechs, implementing them in a more established firm is challenging due to the engineering effort and risks of introducing new technologies into an already complex environment.
In addition, the growing trend of deposit displacement means customers are moving more of their money to alternative accounts such as HSAs (health savings account), P2P (peer-to-peer) payment apps like Venmo and Cash, and investment savings apps. Established firms have the benefit of scale, larger budgets, and brand recognition, which can help protect against disruption from fintechs. However, attracting new customers requires them to cater to customer demands by improving the quality of their services and adding features offered by competing services, such as social networking, budget tracking and charting, and gamification. In doing so, they need to be sure that these changes don’t impact the reliability of their existing services or overall infrastructure.
Chaos Engineering helps mitigate the risk of system failures even as engineering teams increase the velocity of changes.
Let’s consider another scenario. An investment brokerage firm is adding a stock recommendation service to its platform. This new service uses machine learning to analyze each investor’s current portfolio while factoring in variables such as their risk tolerance and market forecast. It returns a list of stocks that the investor can choose from directly within the platform. The firm decides to deploy the new service to AWS to leverage features such as automatic scaling and load balancing across multiple server instances, as well as use machine learning and data analysis services like Amazon Forecast. However, when designing this system for reliability, there are several risk factors to consider:
- What happens when a component fails? Can we continue providing recommendations, or do our users see errors?
- How efficiently does our service scale? If we have a sudden surge in users, can we accommodate them while maintaining performance and availability?
- If the recommendation service goes down, can users still perform critical functions like executing trades?
We can use Chaos Engineering to answer these questions while also validating our operational excellence and reliability. Here are some ways we can do this.
First, we can consume computing resources on our server instances to simulate higher demand and verify that we have enough capacity provisioned. This also allows us to test any auto scaling policies we have in place. We might also shut down or restart our server instances to verify that our load balancer automatically redistributes requests to healthy nodes.
Second, we can run experiments to verify that our service continues operating even when it can’t connect to services it depends on. For example, we can use a blackhole attack to block network traffic to and from the system that stores stock data, try to access the recommendation service, then observe the results. Does it display an error message? Does it repeatedly try to access the system while making the user wait in the meantime? Or worse, does it cause our platform to crash? We can use similar experiments to verify that other services remain functional while the recommendation service is unavailable, such as our mobile app.
Both of these examples help us verify the reliability of our systems both individually and holistically. Each experiment is an opportunity for us to gain insight into how our systems perform under pressure and where their faults lie. As we learn from these insights and implement fixes, we can repeat our experiments to avoid introducing regressions and provide the best possible experience for customers.
Proactively testing for compliance
Finance companies face enormous pressure from regulatory agencies and financial markets regarding service availability. Failing to provide a certain level of service can result in fines, investigations, and lawsuits, as well as loss of customer trust and damage to the company brand. As an example, three of the UK’s largest banks all experienced outages on the same day, triggering a parliamentary inquiry into the cause of the outages. High-profile outages like these create a culture of risk avoidance and discourage innovation and agility, especially in larger financial firms.
Strategies like the governance, risk, and compliance (GRC) framework help guide the development of IT systems in ways that align with business objectives, reduce risk, and maintain compliance. In addition, companies managing critical services often provide service level agreements (SLAs) for their customers, which are contracts guaranteeing the availability of a service. SLAs set the level of expectation that customers should have when utilizing a service and provide avenues for reimbursement if the provider fails to meet those expectations. This creates an additional financial incentive for companies to engineer reliable systems and respond quickly to potential problems.
This process is further complicated by mergers and acquisitions (M&A), in which several different and sometimes incompatible systems must be consolidated. This process is rife with technical challenges, instability, and countless potential failure points. Without careful planning, IT integration can add incremental costs of 50–100% on top of what banks already spend, while taking 50% longer than expected to capture their expected value. In a merger between Sabadell and TSB banks, a failed three-year integration project resulted in at least a week-long web and mobile banking outage, millions of dollars in post-merger costs, multiple investigations, and lost income due to waived fees.
Chaos Engineering gives engineering teams insight into how systems behave in general, not just how they can fail. This creates an opportunity to find and address vulnerabilities, resolve the underlying causes of outages, and create more reliable systems. By doing this early in the development and integration process, finance companies can avoid the fines and scrutiny that come with high-profile IT failures. In addition, engineers can focus their efforts on providing value to the company instead of fighting fires when unanticipated and unmitigated failures occur.
Unfortunately, many companies don’t have the processes in place to proactively test for failures. In our white paper titled The New QA, we show how traditional software and systems testing only accounts for a small fraction of the many ways that systems can fail. Systems are becoming more complex and engineering teams are releasing changes more frequently, and traditional testing practices are no longer sufficient. This creates a reliability gap, where fast-paced development and limited testing leaves many unknown variables that can grow into outages.
Chaos Engineering bridges this gap by helping engineering teams test their systems holistically, implement fixes that improve reliability, meet their SLAs, and avoid expensive outages. This is especially true during M&A. When integrating multiple systems together, the risk of something going wrong is compounded. Chaos Engineering helps engineering teams verify that the combined system provides the stability and reliability that financial companies need.
Why Gremlin is the preferred Chaos Engineering solution
While there are many open source Chaos Engineering projects available, Gremlin is the only enterprise Chaos Engineering platform built by experts who pioneered it at Amazon and Netflix. Gremlin is simple, safe, secure, and comprehensive, letting you quickly and easily run chaos experiments across your infrastructure. This includes bare metal systems, virtual machines, and containers running on-premises or in the cloud.
How Gremlin helps teams improve reliability and prepare for incidents
Gremlin provides the most value when experiments are run on production systems, as this is the most effective way to verify reliability. However, we recognize that this comes with inherent risks. To mitigate this risk, Gremlin gives you complete control over how your attacks are executed, including control over the blast radius (the number of systems impacted by the attack) and magnitude (the intensity of the attack). If you’re concerned about running experiments in production, you can start by running small-scale attacks on non-critical systems, or by running attacks in a test environment. As you become more familiar with Gremlin and more confident in the reliability of your systems, you can easily redirect your attacks to your production systems.
When starting with Gremlin, we recommend planning the execution of your first chaos experiments and tracking your progress towards increased reliability. GameDays were created for this exact purpose by providing a structured and safe way of running experiments. In a GameDay, a team of engineers (typically the owners of an application or system) comes together in a single location to run a specific set of experiments on a system and observe the results. Involving the owners of the system is important for several reasons: they will gain the most insight from the experiments, and they can respond to unexpected problems. You can learn more about planning GameDays by reading our tutorial on how to run a GameDay.
While GameDays help improve system reliability, they don’t necessarily help teams respond to unexpected or unplanned outages. Teams managing critical systems maintain runbooks (also called playbooks or disaster recovery plans), which instruct engineers on how to respond to outages. As systems change over time, these runbooks must be continuously updated and validated. One method is by running FireDrills, which are staged incidents meant to test runbooks and train teams on how to respond to incidents. FireDrills help teams prepare to respond to real production incidents, reducing your mean time to resolution (MTTR) during an actual outage.
Gremlin facilitates GameDays and FireDrills by letting you plan, execute, and record your observations on experiments. You can perform ad-hoc attacks, schedule attacks to run at a specific time, or use Scenarios to link multiple attacks together to simulate real-world outages. This lets you carefully plan experiments in advance, track your observations and improvements to system reliability, and easily halt experiments at any time.
Our commitment to security and safety
At Gremlin, we take security seriously. We recognize the challenges financial firms face when considering new tools, especially tools that interact with how their systems operate. For this reason, we have several security and safety features built into our product, including:
- Agents that require no administrative or root permissions to run
- Proxy support for outbound network traffic
- TLS encryption for data in transit and AES-256 encryption for data stored on Gremlin servers
- Single sign-on (SSO) authentication, multi-factor authentication (MFA), and access logging
- Role-based access controls (RBAC)
- Compliance with ISO 27001 & 27017, PCI DSS Level 1, and SOC 1 & 2 & 3
If an unexpected problem occurs, you can safely halt and rollback any ongoing attacks at any time. Our agents also have a built-in failsafe that stops an attack if the agent loses contact with Gremlin servers.
With technology changing at an ever-increasing rate, increased competition from fintechs, customers demanding 100% availability, and greater scrutiny from external parties, financial firms face an uphill battle. With Chaos Engineering and Gremlin, firms can feel confident in their ability to provide value to customers through greater reliability and innovative new services without risking system failures and outages. This requires engineering teams to re-think their approach to systems testing and experimentation, and Gremlin lets you do so safely.