4 Chaos Engineering recommendations from Gartner

Gartner recently published their annual Hype Cycle reports, including the Hype Cycle for Infrastructure Platforms. Designed to help heads of infrastructure and IT operations make informed decisions about infrastructure platforms, it includes over thirty different topics covering everything from platform engineering to distributed cloud to policy as code—including Chaos Engineering and Site Reliability Engineering.

Let’s take a look at some of the key recommendations Gartner gave for adopting Chaos Engineering.

1. Leverage Chaos Engineering when embedding GenAI calls

Leverage Chaos Engineering when embedding generative AI API calls in your applications to test fallback patterns.” —Gartner, Hype Cycle for Infrastructure Platforms, 2025

Organizations across every industry are integrating generative AI into their architectures, often with substantial investments. Generative AI can provide a lot of benefits, but it also adds additional complexity, dependencies, and points of failure. This is especially true when using API calls to generative AI vendors.

Resilience testing will help you simulate failures, disconnections, or increased latency from these API calls so you can make sure that your fallback patterns perform as expected. Not only will this help you prevent costly outages, but the results of those tests will allow you to prove resilience and show leadership that their investment is protected.

Gremlin CEO and founder Kolton Andrus recently sat down with Nobl9 and Pagerduty for a roundtable specifically about AI reliability. Check out this blog for key insights and takeaways, or watch the whole discussion on-demand.

2. Utilize scenario-based tests and GameDays

Utilize scenario-based tests, known as GameDays, to evaluate and learn about how individual IT systems would respond to certain types of outages, including catastrophic failures such as CrowdStrike.” —Gartner, Hype Cycle for Infrastructure Platforms, 2025

A GameDay is an organized team event to practice Chaos Engineering, test your incident response process, validate past outages, or find unknown issues in your services. Enterprise organizations, including highly-regulated financial institutions like NAB, have been utilizing GameDays for years to validate runbooks, optimize observability, uncover reliability risks, and more.

These are especially important in the current environment of massively interconnected systems where outages from other systems can impact yours, such as with Crowdstrike or when the recent GCP outage affected customers that didn’t even use GCP.

GameDays help companies be prepared for catastrophic failures and prove it to leadership, stakeholders, and regulators. In fact, many financial enterprises who are facing operational resilience regulations are also using GameDays to verify their Disaster Recovery and Business Continuity playbooks for compliance.

3. Prioritize Chaos Engineering on critical systems

Prioritize CE activities on critical systems that have elevated security privileges, business-critical services such as payment/payroll or components that are single points of failure.” —Gartner, Hype Cycle for Infrastructure Platforms, 2025

In an era where downtime costs an average of $14,056/min (or $843,360/hr), outages have a material impact on businesses. At the same time, systems are getting larger and more complex than ever, making it easier for blindspots to sneak in. With the growing size of architectures, it can be daunting to try to tackle everything at once.

As engineering and IT leaders adopt Chaos Engineering, they should start their journey by identifying their critical services and getting a baseline of reliability metrics. From there, prioritize critical services, verifying resilience to known failures, and testing best practices. By doing so, you’ll be able uncover key reliability risks, prevent outages, and get a quick, strong ROI on your reliability efforts.

4. Adopt a platform to track reliability and create metrics

Adopt a platform or tool to track activities and create metrics to build feedback for continuous improvements.” —Gartner, Hype Cycle for Infrastructure Platforms, 2025

You can’t track what you don’t measure, and that’s especially true when it comes to outage prevention. Reliability metrics created by regular resilience testing, such as Reliability Scores, give you a clear way to show the effectiveness of your efforts. Organizations will often measure reliability by a simple binary metric of uptime/downtime, but real operations are more complicated than that. Decreased performance, increased latency, and brownouts can all have material impact on a business, and you need a metric that gives you visibility into these factors.

It’s why measurement and metrics are one of the four pillars of a best-in-class reliability program. Any organization committed to increasing their reliability needs to include reliability metrics in their strategy and planning.

Adopt Chaos Engineering to become more resilient

Chaos Engineering helps organizations become more resilient across their processes, knowledge and technology.” —Gartner, Hype Cycle for Infrastructure Platforms, 2025

Modern systems are getting more and more complex, and that’s only going to increase over the coming years, especially with the continued adoption of AI. At the same time, the cost and impact of downtime is increasing, with outages costing each of the Global 2000 an average of $200 million every year.

Just like how organizations are prioritizing security investments, every organization needs to prioritize their reliability programs. That means starting with Chaos Engineering, then maturing into standardized, organization-wide Reliability Management.

No items found.
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Book a demo

Schedule a time with a reliability expert to see how reliability management and Chaos Engineering can help improve the reliability, resilience, and availability of your systems.

Schedule now
Gavin Cahill
Gavin Cahill
Sr. Content Manager