Creating a Culture of Reliability at Visa Cross-Border Solutions

Using Gremlin, Visa Cross-Border Solutions was able to standardize resilience testing in staging to create a culture of reliability that improved the resilience and availability of services across their organization.

Outages stopped before reaching production

Testing in a parity staging environment routinely uncovers risks that would have cause major P1 outages and addresses them before code is shipped.

Notable technologies

AWS • Kubernetes

Executive Summary

When a new CTO joined Visa Cross-Border Solutions with the goal to improve uptime, they turned to Chaos Engineering to help them proactively find and fix failures before they reached production. By using Gremlin, they were able to standardize resilience testing in staging to create a culture of reliability that improved the resilience and availability of services across their organization.

We couldn’t have done this without Gremlin and the close working relationship we have with them.”

Chris Kempster, Senior NFT Engineer, Visa Cross-Border Solutions

‍

The Challenge:

How do you meet high availability needs for complex transactions?

Cross-border financial transactions involve a web of security, regulatory, compliance, and currency requirements. To navigate these quickly and reliably, Visa Cross Border Solutions built an architecture with over 120 microservices on the cloud that serves thousands of clients across the world.

With this sort of a complex setup, there was no dedicated function for testing performance and resilience. This is when the CTO hired the head of Non Functional testing to bring in the best practices for testing the system for availability.

‍

The Solution:

Create a culture that addresses failures before they cause outages

The first step was to build a dedicated staging environment exclusively to test stress and failure scenarios. Both the CTO and the Head of NFT were familiar with Gremlin for chaos testing from previous roles and knew how effectively it could improve reliability.

After Gremlin was fully set up in the staging environment, the reliability team at Visa Cross-Border Solutions was able to uncover failures across their organizations using standardized and automated resilience testing.

Coupled with a robust staging environment and simulation of production level traffic, they’ve created a culture of reliability that allows engineering team and service owners to find and fix reliability risks in staging so they can confidently deploy resilient, reliable code into production.

One of the things that was awesome about Gremlin was the ability to customize the scenarios with the test suites, then standardize them for our teams, so we were able to dial in the test to right where we needed them.”

Chris Kempster, Senior NFT Engineer, Visa Cross-Border Solutions

‍

How Visa Cross-Border Solutions proactively improves reliability

Verified resilience to known failures

After an outage or incident, one of the first actions taken by engineering is to make sure the same failure doesn’t happen again. However, once the fixes have been implemented, how can you verify your resilience to the same outage without the exact same dangerous circumstances being replicated?

By safely recreating the same traffic, network, dependency, or similar condition that caused the outage, resilience testing helps you make sure the fixes work the way they’re supposed to. And if they don’t, you’re able to take further action before the failure causes another outage.

This was where the Non-Functional Testing (NFT) team at Visa Cross-Border Solutions first introduced Chaos Engineering to many of their teams.

They started with automated reliability score tests every fortnight for the most critical services to ensure they met a minimum score of 90. Any service that scored less than 90 would investigate the reasons in the same sprint for the tests to be rerun.

The NFT team also started the culture of Gamedays every Friday on the staging environment between 10AM to 12PM. They have conducted over 60 Gamedays so far, building the process of resolving P1 and P2 incidents into the engineers' muscle memory.

The test results helped the engineering teams verify and prove the effectiveness of their work, allowing them to improve reliability and resilience while also increasing their trust in resilience testing.

‍

Outages stopped before reaching production

Resilience testing is effective because it accurately simulates real-world conditions that could lead to failure. But one of the hazards is when testing is done in a separate environment with different setup, traffic, and workloads than production. This can create false positives where a service seems resilient in the low-traffic, simpler staging environment, but can’t stand up to the traffic, throughput, and complexity of real-world conditions.

To prevent this issue, the CTO of Visa Cross-Border Solutions made the commitment to create a staging environment that perfectly mirrors production, including being able to simulate production traffic, data, dependencies, and more. While this effort requires more resources, it pays off the investment by preventing failures and bad code from reaching production in the first place.

Using Gremlin, service teams work with the NFT team to simulate real-world conditions in staging. Any test failures or “outages” are caused in staging, where they can be safely addressed and then verified without impacting customers. By the time the code is deployed, it’s resilient to known and common failures, resulting in a more reliable and available production environment.

We put the onus on the teams to own reliability. This was key. It wasn’t enough to meet with everyone and run the experiments. We handed that over to them to care about their services and help them understand more about their services and what could go wrong.”

Chris Kempster, Senior NFT Engineer, Visa Cross-Border Solutions

‍

Culture of reliability across teams

Visa Cross-Border solutions uses a complex web made out of hundreds of different microservices. With a company and systems that extensive, it’s impossible for a single team to be responsible for testing and verifying resilience for all those services. The only solution is to create a culture of reliability across the organization from the CTO to service leaders to engineers, one where engineers are thinking with reliability risks and building resilience into services as they code, then are empowered to perform and act on their own resilience tests.

This kind of reliability culture doesn’t happen overnight, but the team at Visa Cross-Border Solutions has used Gremlin to become industry leaders in reliability. What started with verifying resilience to past outages became service owners asking the NFT team to run a wide-spectrum battery of resilience tests on their services, and then, in turn, those GameDays became automated testing embraced by service teams.

Using Gremlin Reliability Management, the NFT team created test suites and risk monitoring customized to fit their exact reliability standards. These are integrated into release processes and automated to run in staging. Individual service owners are able to view results and reliability scores, lead efforts to address the failures, then run the tests again themselves.

The result? An industry-leading team of engineers deploying code that’s more reliable, resilient, and available for customers.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started

Creating a Culture of Reliability at Visa Cross-Border Solutions

Executive Summary

How do you meet high availability needs for complex transactions?

Create a culture that addresses failures before they cause outages

How Visa Cross-Border Solutions proactively improves reliability

Verified resilience to known failures

Outages stopped before reaching production

Culture of reliability across teams

Avoid downtime. Use Gremlin to turn failure into resilience.