It’s one thing to say that reliability is a priority for your organization, and a whole other thing to make actual, demonstrable improvements in the availability of your applications.

Sadly, it’s common for organizations to invest time, money, and effort into improving reliability only to barely nudge the needle on incidents and downtime. But there are hundreds of companies successfully improving their reliability posture—and doing it at enterprise scale.

What’s the difference between these companies?

After analyzing successful programs, we found some common themes in the composition of people involved. Every successful program was supported by three pillar roles, and ensuring someone is in each stood out as instrumental to their effectiveness. These roles can be summarized as:

  1. Standards roles set resiliency standards and oversee the execution of resilience efforts.
  2. Operations roles perform resilience tests on services and remediate reliability risks.
  3. Leadership roles make reliability a priority and allocate resources to the initiative.

The roles aren’t tied to specific titles. In fact, sometimes one person or team will take on multiple roles. For example, performance engineering teams or centralized SRE teams often take on both setting the standards and performing tests and mitigations—at least until the program scales across the organization and software teams start owning tests on their own systems.

But without someone stepping in to take on the duties of each role, teams will struggle to make progress. Let’s dig a little deeper into each of the roles and how they work together to improve system resiliency.

1. Standards roles set the bar

The Standards role is responsible for driving resilience efforts across the organization. They own the standards, tooling, and organizational processes for executing the framework. In some organizations, this role is in centers of excellence, such as SRE or Platform Engineering teams, while in others this role is added to an existing role like system architect.

Core Responsibilities:

  • Define reliability standards
    Reliability standards should be based on a combination of universal best practices, organizational reliability goals, and unique deployment reliability risks. These should be consistent across the organization, with any service-specific deviations (such as those discovered through exploratory testing) documented and incorporated into testing processes.

  • Manage tooling for testing & reporting
    By centralizing testing and reporting tools with the standards role, tests can be automated to minimize the lift by individual teams and metrics can be compiled to make it easier to align around reliability and prioritize fixes.

  • Determine standardized validation test suites
    Resilience test suites are a powerful tool for creating a baseline of resiliency across your organization. The standards role should define these, then integrate them into tooling so teams can automate running them.

  • Owning operationalization processes
    Metrics should be regularly reported in meetings where the reliability posture is reviewed, then any fixes are prioritized. Whether these are standalone meetings or integrated into existing meetings, the standards role should own and run these review processes.

2. Operations roles ensure execution

This role could have a wide variety of titles, but the defining characteristics are that they’re responsible for the resiliency of specific services. They make sure the tests are run, report the results, and make sure any prioritized risks are addressed.

Core Responsibilities:

  • Run tests and report on results
    Once the initial agents or setup is done, testing should be automated to make this a light lift. As part of prioritization and review meetings, operators will need to make sure the test results are reported and speak to ‌any discussion about them.

  • Respond to risks detected by reliability risk monitoring
    Risks detected by monitoring can often be fixed with a change to the configuration or other lighter-lift fixes. In these cases, operators should be empowered to quickly address these risks to maintain resiliency.

  • Address and mitigate reliability risks
    Once reliability risks have been prioritized, operators are responsible for making sure the risk is fixed. They may not be the person to perform the actual work, but they should be responsible for making sure any risks are addressed, then testing again to verify the fixes.

3. Leadership roles prioritize reliability and commit resources

The leadership role is the one responsible for setting the priorities of engineering teams and allocating resources. In some companies this is held by someone in the C-suite, while in others it’s held by Vice Presidents or Directors. The defining factor is that anyone in this role has the authority to make engineering priorities and direct resources towards them.

Core Responsibilities:

  • Dedicate resources to reliability
    Most reliability efforts fail due to a lack of prioritization from the organization. For your resilience practice to be effective, leadership roles need to allocate resources to it.

  • Ensure standards create business value
    Work with those in the standards roles to make sure resiliency standards and goals tie directly back to business value. Try to find the balance where the time, money, and effort spent finding and mitigating reliability risks is creating far more value than it takes in resources.

  • Drive accountability and review metrics dashboards
    When leadership is visibly engaged in reviewing reliability metrics, it lends importance to the efforts, which, in turn drives action. Leadership should hold operators accountable for improving resiliency—and applaud them when they do.

Align regularly around goals and metrics to improve reliability

These roles don’t exist in silos. In fact, the most effective reliability programs meet regularly to review the current reliability posture and resilience test results, prioritize fixes based on the possible impact of reliability risks, and check to make sure previous risks have been addressed.

By working together, these three roles can improve the resiliency of your systems to increase their availability and uptime—making them more reliable for your customers.

Further reading:

No items found.
Gavin Cahill
Gavin Cahill
Sr. Content Manager
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

K8s Reliability at Scale

To learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook

Get the Ultimate Guide