AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations.

But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.

In a recent roundtable, Gremlin CEO and Founder Kolton Andrus sat down with Nobl9 CTO Alex Nauda and Mandi Walls from PagerDuty to talk about how to keep AI applications functioning at a high level.

Let’s dig into some high-level takeaways. And be sure to check out the whole conversation on-demand!

1. Even things that are the same are changing

At their core, AI applications still run on the same infrastructure as other enterprise applications, which means all of the operating best practices honed over the years are still effective. Observability, resilience testing, SLOs, and incident response playbooks are all valid and essential in the age of AI.

The bits are roughly the same. We're talking about the same channels and that gives us the same rough shape. So from that side, I don't think we should be too daunted.” —Kolton Andrus, Gremlin 

But there are some core shifts that every operations team is going to have to account for. AI brings different traffic patterns, increased complexity, expanded infrastructure footprints, and new dependencies.

“You're either running intensive training and using models in your own infrastructure, which is new and complex and can be expensive, or you're dealing with these new SaaS boundaries to these services providing LLM functionality. And you have to find a way to govern that and figure out if that's working correctly.” —Alex Nauda, Nobl9

Model training also adds additional wrinkles to operations. The background batch processing during model training is extremely valuable work that can be very expensive if it crashes partway and needs to be restarted. Testing, metrics, and incident response will be necessary to keep these systems functioning, even though they’re often not real-time, directly customer-facing systems. For example, using the Gremlin GPU Experiment can show you if training your model will crash when there’s a surge of GPU resource usage.

When it comes to operating AI applications, whether running a model or using a third-party integration, it really comes down to balancing the new with the proven. Yes, there are changes you should make, but at the same time, you need to shore up your core standard reliability practices.

2. Reliability has multiple meanings with AI

AI reliability can be split into two buckets: uptime and responses. The first is concerned with making sure the AI application is available and performant, while the second is focused on making sure the responses it returns are useful and accurate.

Both of these are essential for AI applications to be truly reliable, and doing so is going to require a shift in your approach that brings DevOps and machine learning/AI engineers together. A key part of this is setting the right metrics and SLOs to define reliable operation in the first place.

 Imagine you have goal setting on each of those spots. Is my model good enough? Is my agent working and doing what it's supposed to? Is it crashing, or is it up all the time? From an operational perspective, each of those has their own measurement, goal setting, testing workflow, and an incident response workflow for each separate piece of it.” —Alex Nauda, Nobl9

These questions are often managed by different teams, but as far as the customer is concerned, these are under the same bucket. To maintain a reliable AI application for customers,  your operations team will now have to work closely with the engineers responsible for training and maintaining the models. And that includes having those teams available for incident response.

[The engineers responsible are] data engineers who are not used to being on call. They're in the back office with their statistical models and their python. And their stuff hasn't in the past really faced the customers until now. It's as much an organizational flex on that side as it is anything else. We're gonna need to call in the folks who know what's going on.” —Mandi Walls, PagerDuty

Take the time now to figure out these metrics and processes, then test the systems to make sure they all work. As a hypothetical, you could use a Gremlin experiment to create latency between LLM databases. When you do, what happens to the responses? If they fall outside the SLO range, do the right processes get triggered, and do the right people get paged to address it?

Answering these questions on your time, your schedule, and in a controlled environment will minimize the customer impact. And your teams will be much more effective at addressing any issues that arise in a sprint than when on call in the middle of the night.

3. AI is still new and evolving

It can sometimes feel like AI has been around forever. ChatGPT was first released to the public almost three years ago in 2022, and machine learning has been a big part of the software world for a good decade now.

But it’s important to remember that AI is still cutting-edge technology. Operating it requires finding that balance between enabling it and setting the right guardrails.

 As [AI] becomes business critical, it's gonna get the same kind of attention business critical services get. Everyone's dipping their toes in. We might be all in on it from a marketing perspective, but technology-wise, we're making sure we're getting the right outcomes and we haven't painted ourselves in a corner.” —Kolton Andrus, Gremlin

This is especially true as more AI applications are integrated via APIs or the Model Context Protocol (MCP).  As these become more common, we’ll start to see more evolution between AI models that changes how we approach architecting and operating systems.

 In the past when you let AI communicate to themselves, they kind of make their own twin language and you're not even sure what they're talking about on the backend because they've sort of recalibrated how they communicate. Enterprise software being available via API was kind of a big deal, and now you're just making all of that stuff more accessible to more players and more components. There’s a lot of possibilities there for good stuff, interesting things, and who knows what else.” —Mandi Walls, PagerDuty

So while there are definite best practices emerging around AI, like integrating machine learning engineers into incident response playbooks, GPU testing, and specific SLOs, a lot of this is still being formulated. In situations like this, it behooves operating teams and engineers to test thoroughly, define their processes, and set their leashes within their organizational tolerance for risk.

 In a way, we wanna keep them on a shorter leash, but because they're flexible systems and the users are using them in flexible ways, we have to give them a reasonable amount of leash, right?” —Alex Nauda, Nobl9

Ultimately, AI reliability still rests in the hands of engineers

AI can do incredible things for customers, operations teams, and more, but one thing all the panelists agreed on is that reliability comes down to the engineers.

You know, the next frontier for us is really about guiding people how to make the right decisions.  Let's use it as the super assistant to really help you interpret the results and make suggestions. —Kolton Andrus, Gremlin

There are many ways AI can help, such as improving shipping speed, helping to determine possible causes of incidents, and more, but engineers will be the ones behind the models, the applications, and, ultimately, solving software problems.

One of the things we're gonna see, especially from enterprise customers, is they want auditing of what the agents are doing. Where did they come from? Who or what kicked it off? What events spurred this thing into action? What did it do? What environment was it in? Any of those kinds of things.” —Mandi Walls, PagerDuty

The best thing any engineering leader can do to keep their AI applications reliable is to enable their team. Make sure you’re measuring the right metrics, testing to verify resiliency, and have the right processes set up in case things go wrong.

That way, you’ll build trust, validity, and reliability into your AI systems.

No items found.
Categories
Gavin Cahill
Gavin Cahill
Sr. Content Manager
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL