Creating an agentic feedback loop with reliability guardrails

Reliability guardrails help make sure that your applications stay reliable without slowing down. In an earlier blog, we went into why agentic AI development needs reliability guardrails. It went over how the increased speed of AI development demands automated guardrails to verify resilience and what kinds of tests these guardrails should cover.

But that’s only the beginning. By themselves, guardrails act as a gate to ensure resilience mechanisms hold under rapid changes. When set up to create a feedback loop, they can also help AI agents produce higher-quality code and react faster when an outage does occur.

Agentic code review isn’t enough for reliability

Every organization using AI for code generation should adopt some form of agentic code review. This is essential to provide that extra layer of verification, catch bugs, and make sure that everything is set up according to policies.

But we’ve all seen code that flawlessly passes review and every single QA test, then fails in production. This issue is only exacerbated with AI coding. To avoid overloading context windows, agents will only have access to information relevant to the code. As a result, it’s easy to generate a solution that will successfully pass unit tests, but break once in production.

The only way you can be sure is to have data on how your broader systems respond to actual failures. This is the logic behind fault injection testing and Chaos Engineering: safely create realistic failure conditions, such as a resource being unavailable, and make sure your application responds correctly.

A key idea behind Chaos Engineering was trying to test failures as efficiently as possible. Instead of exhaustively writing tests for each individual failure, we could replicate the actual failure conditions by consuming compute resources, adding network latency, etc. It allowed us to test dozens of potential failures with a single test, validating our systems work as intended with much less of a lift.

This concept still holds true with AI coding. We need an efficient way to verify the actual resilience of code before we rely on it in production. And these same tests can be used to produce unique data that can strengthen our agents.

Add guardrails as an automated CI/CD gate

In the past, using resilience tests in the CI/CD pipeline has created a tradeoff. Yes, it ensures resilience before release, but it also creates a bottleneck when engineers are required to manually review any failed tests to address issues

Agentic development changes this situation. The increased risk potential demands additional guardrails before deployment, which means placing resilience tests as a gate at the end of the CI/CD pipeline.

Resilience tests should be placed at the end of the CI/CD pipeline to verify resilience right before code goes to production. The later we place the tests in the SDLC, the more confidence we can have that the code will be resilient once in production.

In order to minimize the bottleneck, the tests need to be automated. At a minimum, we should automate the tests to run during off-hours. That way, all of the results, including any recommendations to address failed tests, are waiting when an engineer signs back on.

But if we really want to increase resilience without slowing down, we need to automate remediation as well. With this approach, any failed tests produce recommended fixes. These recommendations go straight back to the AI coding agent, creating a cycle where tests are run automatically, any failures are addressed, and then the tests are run again until all of them pass.

The goal is to set up a system that can automate the removal of reliability risks so you can be sure that every release candidate complies with resilience policies.

Creating a feedback loop to improve code quality

When a company’s reliability practice matures, there’s also a shift in how engineers approach their code in the first place. Regular testing makes them more familiar with how their systems can fail and with best practices to avoid those failures. As a result, they start building more resilience into the code to begin with.

Automated reliability guardrails allow us to do something similar with agentic AI. On a human level, the engineers looking at the test results will be able to adjust prompts, skills, and instructions to compensate for resilience issues. But the results of the resilience tests can also be fed back into the coding agents as context.

Let’s look at an example. Say a release candidate didn’t have circuit breakers set up on its network calls, causing it to fail dependency tests. Even after this is fixed and the code is deployed, the same AI agent could code a future candidate the same way. But if we feed these results and the resolution back into the agent as context, then the agent can compensate and generate future code without the same issue.

What we’re doing in this case is an extension of the basic premise for coding agents, where the agent makes a change, runs a unit test, and then refines the change based on failure. The difference is, instead of testing code that’s still controlled by the agent, we’re testing deployed code. So to get the data back to the agent, we’ll need to use something like an MCP server, integrate the data directly into a harness, or some other tool. For example, we can do this right now with Gremlin by using our MCP server, which connects to Gremlin via API.

Decreasing MTTR for AI SREs

Unfortunately, even with resilience testing, unknown failures can still cause outages. Resilience test results can help reduce AI SRE resolution time by providing valuable context to focus the search.

Take an autoscaling issue as an example. If that same application goes down during a traffic surge, an AI SRE will have to start with a broad scope. But what if six of the eight microservices in the application passed resource scaling tests before deployment? Suddenly, the scope can be narrowed to focus on the two other microservices.

This will help minimize the length of the outage and also prevent the agents from burning through extra tokens with unnecessary searches.

Don’t sacrifice reliability for speed or vice versa

AI-generated code has been found to have 1.7x more issues per PR than human-written code. And at a time when teams are shipping 10x the number of PRs, those issues add up quickly. Agentic code review can catch many of those issues, but only guardrails with resilience testing can prove the reliability of your code.

At the same time, if we spend all of our time reviewing and verifying AI-generated code, then we risk losing any velocity gains from using AI in the first place.

The trick is to find the right balance by embracing strategic automation to produce verifiable, actionable data that can improve the reliability of your code without slowing you down.

Ready to see how it works? Check out our interactive Gremlin tour to see how reliability test suites verify resilience.

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour