Building resiliency at a 160 year old bank
NAB, Australia’s largest business bank and one of the ‘big four,’ kicked off its Technology Transformation program at the end of 2018 in pursuit of simplicity, agility, resilience and to stay relevant to an ever-evolving competitive landscape.
Movement to the Cloud
NAB now has more than 45% of its apps on cloud, has established the NAB Cloud Guild to deliver its #1 position in number of employees with industry-recognised cloud certification in the country, is bolstering its DevOps practices through the NAB Engineering Foundation (NEF), and is strengthening its business practices with its own framework such as CAST (Cloud Adoption Services & Techniques) and Multi-cloud. As they continue on this journey, reliability and resiliency are top of mind.
"As the organisation grows further and deeper into its technology transformation, resilience becomes increasingly important. This pushes NAB’s engineering teams to ensure reliability is baked into their processes. After all, it’s our responsibility to see that the services we provide are resilient" said Chaitanya Krant, Engineering Manager at NAB.
After all, it’s our responsibility to see that the services we provide are resilient.”
Finding an enterprise-ready Reliability solution
Chaitanya began looking for ways to drive Reliability Engineering standardization across the company. In his research, he came across Chaos Engineering and the practice of running Game Days and bootcamps to test how resilient services and their dependencies are in the cloud.
"With hundreds of teams running through their own workflows and with multiple tools being used, it became apparent that we needed to standardise our approach" said Chaitanya.
Chaitanya suggested that NAB’s service management and continuity teams run a bootcamp to introduce the practice in the bank. It worked—150 people from 15 teams attended and executed three different experiments to get comfortable with the basics of Chaos Engineering. A few months later, they organised a more hands-on bootcamp where participants ran experiments on their own applications.
With hundreds of teams running through their own workflows and with multiple tools being used, it became apparent that we needed to standardise our approach.
Support for Chaos Engineering grew at NAB, and the next step was to find a Chaos Engineering solution that was easy to use and that would allow them to drive cross-team collaboration by creating and sharing experiments between teams. "Chaos Engineering isn’t just about improving any one team’s systems, it’s about helping others improve their systems and share the knowledge across our Engineering Community."
Chaos Engineering isn’t just about improving any one team’s systems, it’s about helping others improve their systems and share the knowledge across our Engineering Community.
Finding the right Chaos Engineering solution
Prateek Sachan, Engineering Manager, leads a Performance Engineering team at NAB. His team’s primary goal is high availability and reliability of all applications on the Retail Lending platform. Chaitanya and Prateek joined forces to find a solution that would meet NAB’s needs in one of the most highly regulated industries in the country. The solution needed to:
• Be secure enough for Financial Services and offer single sign on
• Support for multi-factor authentication
• Meet strict compliance guidelines
• Have high-levels of customer support
• Be customizable
• Be able to revert experiments
They wanted something easy to use so that engineers could focus on experimentation rather than configuration and setup.
"Gremlin checked a lot of the boxes for us—it has a control plane and simple design which means the teams don’t have to spend time building the experiments and can focus on running them and improving reliability based on results" said Chaitanya.
Prateek’s team was the first to try Gremlin at NAB and they had a very positive first experience:
“We identified a few defects in an application, and we were impressed with how easy it was for us to identify unknowns. We were quite excited, so we started to share this with other teams.”
With this early success, the team moved to roll-out Gremlin to a number of customer-facing applications including trading platforms and other critical applications.
Gremlin checked a lot of the boxes for us—it has a control plane and simple design which means the teams don’t have to spend time building the experiments and can focus on running them and improving reliability based on results
Creating a culture of reliability, at scale
Even with early success, there was a need to advocate for Chaos Engineering across the organization. In 2019, Chaitanya and Prateek joined forces to run a cross-functional Chaos Engineering advocacy team. This team runs enablement sessions including bootcamps, gamedays and workshops designed to shift engineering mindsets with one simple question: what can you break, and how can you improve its reliability?
Game Days have become a critical part of NAB’s Chaos Engineering culture, as they incentivize teams to take a proactive approach to reliability in an engaging, collaborative setting. Teams use Gremlin during Gamedays to run specific, preconfigured Scenarios—a series of preconfigured tests—in a single environment. This process allows NAB teams to find potential vulnerabilities so that they can resolve potential issues quickly.
The NAB team was able to reduce time to resolution from what could potentially have been two hours to under 30 minutes.
While still early in their Chaos Engineering journey, using Gremlin to run Chaos Engineering experiments is showing fantastic outcomes in bolstering NAB’s detailed service management capabilities. In fact, in one Game Day scenario, the NAB team simulated a specific incident, and a playbook on how to manage the incident was produced which was then later used to resolve this exact same incident in production several weeks later. The results: the NAB team was able to reduce time to resolution from what could potentially have been two hours to under 30 minutes.
The developer journey should start and end with reliability, and it is one of the pillars of NAB’s technology transformation
NAB now has more than 25 engineering teams running Chaos Engineering experiments on Gremlin, with the goal of eventually having 100 teams onboarded. The Chaos Engineering community continues to grow at NAB through collaboration, experimentation, and knowledge sharing. "The developer journey should start and end with reliability, and it is one of the pillars of NAB’s technology transformation" said Chaitanya and Prateek.