Gartner: tips for improving reliability

In their report titled “IT Resilience — 7 Tips for Improving Reliability, Tolerability and Disaster Recovery”, Gartner presents seven strategies for improving the resilience posture of your critical systems. These recommendations range from how to get started, to identifying IT hazards and risks to reliability, to capturing metrics and translating them into business value.

In this blog, we’ll take a high-level look at the report and summarize some of its key findings. We’ll provide a link to the full report at the end of this blog.

What is IT resilience?

The term “IT resilience” can take on many different meanings depending on who you ask. Gartner defines it as the ability for IT systems to be reliable, tolerable, and recoverable:

Reliable means absorbing or reverting failures while continuing to meet SLOs.
Tolerable means having the ability to limit the duration and scale of an event or hazard.
Recoverable means having the ability to recover from various scenarios, particularly those that can’t be predicted or mitigated beforehand.

IT resilience isn’t a one-and-done initiative, but an ongoing and continuous process of iterative improvement. Systems are always changing, new risks are always emerging, and engineers are always working on ways to prevent incidents. What matters is defining a process to identify and mitigate risks, then building on that process as your team grows.

Why is IT resilience important?

Building resilient systems isn’t just about providing a good customer experience. It’s quickly becoming an expectation from customers, and with how quickly high-profile outages make the news, a competitive differentiator. Organizations that focus on IT resilience position themselves as more dependable to potential customers, which in turn can lead to greater revenue generation and customer retention.

Treat resilience as a key differentiator that sets you apart from the competition. Similarly, leverage the demise of your key competitor, which may have experienced extended delays or downtime, to highlight to management that it could have been your organization. For example, a transactional bank had an objective to use availability as a competitive differentiator. It was able to obtain additional support and investment for DR and availability because it was directly tied to meeting a business objective.

The challenge is that failure modes can take countless different forms and scale with the size and complexity of the system. There are also different categories of failure modes:

Those that we know of and understand well.
Those that we know of but don’t understand.
Those that we aren’t aware of.

Problems that we know of and understand are the easiest to resolve, and for those we don’t understand, we can at least document them. The big question is: how do we approach problems we aren’t aware of? This is where Chaos Engineering helps as it allows teams to uncover and address unknown failure modes, particularly those that emerge from complex systems.

It’s important to note that Chaos Engineering doesn’t replace other IT resilience practices like Disaster Recovery and SRE, but complements them. For example, you can use Chaos Engineering to test your Disaster Recovery Plans and test your incident response runbooks to ensure they’re up-to-date.

Who owns IT resilience?

It’s tempting to assign responsibility over IT resilience to SREs. After all, it’s in the name: Site Reliability Engineering. And while it’s true that SREs tend to spearhead reliability initiatives, IT resilience is ultimately a shared responsibility.

The SRE role is less of a reliability gatekeeper—like the role of QA in a traditional waterfall development practice—and more of a reliability evangelist. SREs create bridges between traditionally siloed teams like operations, disaster recovery, and product, to ensure a shared focus on reliability. This is in addition to performing tactical tasks like troubleshooting and root cause analysis. As SREs facilitate reliability-centric tasks, their importance to businesses that run complex systems will only become stronger as systems become more complex.

The addition of SRE-like roles will increase in prominence, from less than 5% of enterprises in 2021 to at least 30% by the end of 2025, due to their measurable impact across all areas of resilience.

How do you identify IT hazards, risks, and failure modes?

If you’re not aware of the things that can break, then it’ll be impossible to plan for their eventual failures and difficult to recover. As IT systems become larger and more complex, the ability to find failure modes also becomes more complicated.

For example, the growing usage of cloud computing over the past two decades means that organizations aren’t just dependent on their own systems being reliable, but they also depend on the reliability of their cloud providers. While providers can promise certain levels of reliability via SLAs, they can still have outages, and those outages can cascade down to your own systems if you’re not prepared for them.

In order for organizations to uncover risks like these, they need to encourage engineers to actively look for and surface failure modes so that they can be addressed and accounted for. This helps not only find points of failure, but also prioritize reliability initiatives. Practices like Chaos Engineering help with this by allowing teams to proactively identify and address these points of failure before they can become production outages.

By the end of 2025, 30% of enterprises will establish new roles focused on IT resilience and boost end-to-end reliability, tolerability and recoverability by at least 45%.

Learn more

There’s much more to the report than what’s covered in this blog post, including:

How to get support from your organization.
Key operational and business metrics to track.
What to work on after these seven tips.

To read the full report for free, visit gremlin.com/gartner-seven-tips-for-improving-reliability.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL