Why agentic AI development needs reliability guardrails

AI has massively accelerated code deployment. In fact, since the introduction of agentic coding, GitHub has seen exponential growth in PRs, commits, and new repos. What they originally predicted would require 10X capacity, they’re now estimating it’s going to require 30X capacity, and the biggest driver is agentic development.

*(Source:* *An update on GitHub availability*)

Companies across industries are building agentic pipelines to ship features faster than ever before.

That acceleration isn’t without risk.

More code has always meant more potential errors, and that’s only multiplied when using agentic AI for code generation. In CodeRabbit’s recent State of AI vs. Human Code Generation report, they found that AI-generated code averages 1.7x more issues per PR than human-written code, with 1.4x more critical issues, 1.7x more major issues, and nearly double the amount of minor issues.

That doesn’t mean you should stop using agentic AI. But it does mean that you need to account for the increased reliability risk it presents. The key is finding the balance between moving quickly and reducing the risk of catastrophic failures.

That’s where reliability guardrails come in.

Reliability guardrails ensure your organization can take advantage of this new velocity while ensuring systems remain resilient and reliable.

Let’s dig into what those guardrails would look like and how to implement them so you can keep moving fast without flying off the track.

More code and less familiarity mean more risks

The combination of increased error rate and accelerated code generation creates a reliability problem for organizations. If you had one issue per x lines of code before, and now you're shipping 10x with 1.7x as many issues, you now have 17 issues for every one you had before.

Organizations need programmatic, scalable reliability guardrails that can make sure new code doesn’t introduce outage-causing risks. You need to be able to answer questions like: Does the code introduce new dependencies? Did it change how your application handles latency? Did configurations revert to a default that deviates from your standards? Can your system fail gracefully if your database is unavailable?

The only way to get answers is by testing.

Use fault injection to verify resilience

Fault injection accurately and safely creates failure conditions, so you know how your systems react to failures, such as a dependency going down, a resource being unavailable, or increased latency. This makes them an ideal tool for creating reliability guardrails. When used systematically and at scale, fault injection tests can define the boundaries between acceptable application operations and outage-causing reliability risks.

Generally, fault injection has two kinds of testing. Exploratory testing, such as the type used in Chaos Engineering, uncovers new failure modes. But guardrails use the second kind: validation testing.

With reliability guardrails, you’re trying to verify continued resilience against known failure modes. As long as a service or application passes the tests, you know that it’s operating within the correct reliability parameters. Once it fails the test, then you know that you need to dig in and resolve the issue causing the failure.

When using these guardrails for AI, you’ll want to automate these tests in staging at the end of the CI/CD pipeline as a gate. Any issues found can be fed back into the AI agent for correction, and the code can only be promoted once the core tests have been passed and resilience has been verified.

Start with broad coverage, then refine

The whole goal of guardrails is to protect reliability without slowing deployment down. With that in mind, you should start by using tests that cover the widest range of outage causes for the least amount of effort.

So what should they cover? Well, the New Relic Observability Forecast identified the most common causes of outages by asking companies whether they had experienced outages due to a given failure. Their results showed that these are the 10 most common outage causes:

Network failure (35%)
Third-party or cloud provider failure (28%)
Deploying software changes (28%)
Someone making a change to the environment (26%)
Security failure (24%)
Hardware failure (23%)
Power failure (22%)
Capacity constraint (20%)
Unexpected traffic surge (18%)
DNS issue (18%)

‍

Obviously, security failures are being addressed by their own team, but engineers can build resilience to the rest of these failures into their systems. So the first step in building reliability guardrails is to build a broad coverage group of tests that verifies resilience to as many of these common causes as possible.

These six core tests will help you verify resilience to the majority of outage causes:

Zone redundancy - Does your system respond correctly if a zone isn’t available? Beyond failover, this also makes sure that backup or redundant resources are scaled correctly.

Host redundancy - Can your system survive the unavailability of specific resources, such as a host or container?

CPU scalability - If there’s a sudden surge in CPU requests, can your service scale? And can it scale back down afterward?

Memory scalability - Will your service also scale memory as needed? If they can’t, do they degrade performance correctly?

Dependency failure - If a dependency is unavailable or goes down, how will your system respond? This could include external/third-party dependencies or internal ones like databases.

Dependency latency - If latency on your dependency connections increases substantially, will your system respond correctly? For example, does your cache database roll over correctly?

The first step is to run these against your services to get a baseline of your current reliability. Address any risks or issues that come up, then test again to verify your fixes. This gives you a reliability standard to judge new releases against. If a candidate fails to meet the standard, you identify the issues, fix them, and test again before release.

Over time, you can add additional tests tailored to your environment, but by starting with these six tests, you’ll get the most failure mode coverage for the lightest lift, allowing you to verify reliability without slowing down.

Keep the guardrails separate from the coding agent for independent verification and safety

You have to be able to count on your guardrails, which means they have to be strong and effective. This is why it’s important to run tests independently of the AI agents writing the code. This helps avoid destructive actions and helps make sure you can count on the neutrality of the test results. For example, if you tell an AI agent to test how your application handles a missing database, there’s no guarantee it won’t simply delete that database, which, technically, would fulfill the terms of the test.

Instead, you can have agents use API calls to established, proven fault injection methods. Over the years, fault injection technology has come a long way in figuring out how to safely create actual failure conditions without causing damage or problems. By using established providers and methods, you can count on the accuracy and safety of the tests.

Testing should always be as non-destructive and safe as possible, even in staging environments.

Roll out guardrails across your entire organization

Modern applications are complex chains of hundreds of microservices and dependencies that are only as reliable as their least-resilient link. It doesn’t matter if one single service is incredibly resilient and reliable. The only way to truly improve reliability is to test and apply reliability guardrails across your entire organization.

Gremlin was specifically designed with use cases like these in mind. The pre-built reliability test suite includes the core tests needed for the most common outages, while the central management capabilities and intelligent health checks make it easy to onboard new services. And with open APIs and MCP servers, Gremlin can integrate into your CI/CD pipeline and agentic workflows to help make automation even easier.

You don’t have to choose between reliability and deployment speed. With the right reliability guardrails, you can use AI agents to move quickly without sacrificing reliability and availability.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour