For decades, information technology in the financial services industry meant deploying bulky applications onto monolithic systems like mainframes. These systems have a proven track record of reliability, but don’t offer the flexibility and scalability of more modern architectures such as microservices and cloud computing. During periods of unexpectedly high demand, this inflexibility can cause technical issues for organizations ranging from personal trading platforms to major banks. Likewise, periods of low demand result in unused computing resources, costing these same organizations money.
In addition, financial institutions must navigate many technical challenges in order to:
- Reduce latency as much as possible.
- Release new features to compete with FinTechs.
- Lower operating costs.
- Comply with regulatory requirements.
- Maintain high availability.
Legacy architectures simply can’t keep up with these requirements, which is why many organizations are migrating to modern architectures such as microservices, Kubernetes, and cloud computing. Modernization is a difficult and complex process, but nonetheless we need to ensure that it doesn’t negatively impact the reliability of our applications and systems. We do this by proactively testing our systems in order to discover ways to improve their resilience, and the best way to do this is with Chaos Engineering. In this article, we’ll explain how Chaos Engineering addresses these challenges and allows for a smoother, more successful modernization process.
Migrating monolithic, on-premises software to a more modern architecture isn’t always as straightforward as “lift-and-shift.” It involves provisioning and configuring new infrastructure, modifying applications, and testing extensively. This creates countless opportunities for problems, such as:
- Developers introducing bugs when rewriting code
- Applications behaving in unexpected ways when deployed to a new environment
- Failing to properly configure the environment
Failing to identify these possibilities will result in poor performance at best, and catastrophic production outages at worst.
We also need to consider how modernization changes our testing practices. Modern applications can fail in ways that legacy systems do not. For example, applications that were once co-located on the same server might now be spread out across multiple servers, requiring fast and reliable network connections. Conditions such as high latency, packet loss, and outages have a much greater impact on application performance. This puts even more importance on testing our application’s ability to withstand these conditions, and if possible, work around them.
Lastly, moving to a different architecture affects how our applications behave. Tools like container orchestrators and cloud management consoles provide a wealth of features such as horizontal and vertical scaling, load balancing, and automatic failover. We need to understand how these features work and how they can impact how our applications function, otherwise we risk running into surprising production outages.
Chaos Engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of harm in order to observe how the system responds for the purpose of improving its resilience. With Chaos Engineering, we can proactively uncover and address failure modes in our systems in order to make those systems more resilient. As the world’s leading enterprise Chaos Engineering platform, Gremlin lets us perform chaos experiments in a simple, safe, secure, and comprehensive way.
We will look at a few ways that applying Chaos Engineering practices can help with the modernizing process.
Modern architectures and applications have countless interdependent components, configuration options, and downstream dependencies, all of which impact predictability and reliability. Even if we generally understand how a platform like Kubernetes works, the inherent complexity of these platforms means that a minor problem in one component can cause a failure in another component.
For example, in Kubernetes, applications are deployed in units called Pods. Pods communicate entirely over the network, even those located on the same server. A small amount of network latency in one Pod can have a cascading effect on Pods that depend on it, resulting in requests timing out. How do we test a scenario like this?
With Gremlin, we can use a latency Gremlin to add packet latency for traffic to and from our application. We can then monitor requests to the application to measure the impact of this additional latency from the perspective of a user. If there is a noticeable impact—for instance, requests fail or take significantly longer than the amount of latency added by the attack—we should investigate the cause further and take action to mitigate against or resolve it. This way, if we experience a similar problem in production caused by an oversaturated or degraded network connection, our application is already prepared for it.
Cloud platforms and distributed computing platforms emphasize high availability, but outages are still possible. Applications can crash without warning, servers or even entire data centers can experience hardware failures, and we can lose connection to downstream dependencies at any time. We need to ensure that our applications can safely withstand these conditions, and we can do so using Chaos Engineering.
For example, autoscaling is a core feature of cloud platforms. As demand increases, we want to be confident that we can automatically add infrastructure capacity to handle the increased traffic. However, we also want to be sure that we automatically scale down during periods of low demand in order to save on hosting costs.
We can use a resource Gremlin to verify that our autoscaling rules are configured correctly by consuming computing resources, such as CPU time or memory. We can then monitor for changes to our infrastructure to determine whether it responds the way we expect it to. This way, we can be confident that our systems respond appropriately to changes in customer traffic, even if these changes are rapid and hard to predict. Additionally, this gives us an opportunity to verify that our applications continue to provide fast performance and low latency at high scale.
A migration isn’t a “one-and-done” project. Applications and infrastructure change over time due to new feature deployments, bug fixes, and normal everyday operation. All of these changes can negatively impact reliability, but we can reduce this risk by performing Continuous Chaos.
Continuous Chaos is the use of Chaos Engineering in a repeated or scheduled manner. This can involve manually running chaos experiments on a periodic basis, or automating chaos experiments as part of our CI/CD pipeline. This helps ensure that we’re meeting our reliability targets on an ongoing basis, which in turn helps us:
- Remain compliant with regulations.
- Meet service level agreements (SLAs).
- Lower the risk of a sudden and unexpected outage.
Not only does Continuous Chaos help us ensure that our systems continue working optimally, but it also reinforces an organizational focus on reliability. In addition to the technical practices, such as running latency and resource attacks, we can test our team’s ability to identify and respond to reliability issues. This includes:
- Running GameDays, where we practice running chaos experiments as a team and analyze the technical outcomes to determine the resilience of our applications and systems.
- Running FireDrills, where we deliberately cause unexpected outages in order to test our team’s incident response processes.
The goal of Continuous Chaos isn’t to keep our teams in a constant state of anxiety, but to ensure that reliability remains a top priority before, during, and after the modernization process.
No matter the architecture, there is always a risk of failure. Modern architectures may provide more tools to help us address reliability concerns, but they’re only effective when we know they work. Using Chaos Engineering, we can proactively test our applications and reduce the risk of outages early in the modernization process. By uncovering and addressing failure modes early, we can save engineering time, save money, avoid costly outages and compliance breaches, and keep our customers happy.
To learn more about using Chaos Engineering in financial services, read our white paper on Improving the Reliability of Financial Services. If you want to get started with Chaos Engineering, sign up for a Gremlin Free account and join the Chaos Engineering Slack community.