Interactive Assessment

How Mature Is Your Reliability Program?

Evaluate your team against the 18 traits Gremlin identified working with reliability leaders at Fortune 100 companies. Get your score and a prioritized action plan.

Your progress
0 / 18
PILLAR 1

Leadership & Strategy

The best reliability programs start with clear goals, executive commitment, and a proactive mindset. Strategy turns reliability from a hope into a plan.

Pillar score
0%
Define clear, specific missions and goals
You know what you're working towards: which services you're targeting, milestone timelines, and the reliability levels you aspire to. Goals are specific enough to measure, like "double the reliability of mission-critical customer-facing applications."
Identify initial timelines or mission-critical dates
Your timelines and resources are informed by business events. You've set minimum reliability policies and target dates for compliance tied to real milestones like product launches, peak traffic, or regulatory deadlines.
Focus on proactive goals, not reactive incident response
Your strategy has a long-term, policy-based focus on desired reliability levels. You're getting ahead of incidents to prevent them rather than optimizing how fast you respond after they happen.
Define clear accountability with visible leadership interest
Everyone involved understands how they'll be held accountable. Leadership's interest in the program is visible and felt across the organization, not just mentioned in a quarterly all-hands.
Establish milestone-based review and celebration
You celebrate and perform retrospectives as the program achieves milestones. Milestones are defined as both a target date and a reliability target. You don't celebrate until you've met the reliability target.

PILLAR 2

Ownership & Handoffs

Reliability programs fail without clear ownership. Centralize who drives testing and measurement, decentralize who fixes what's found, and document every handoff.

Pillar score
0%
Identify your program owner
There's a named person taking responsibility for the reliability program. This isn't a committee or a shared Slack channel. Someone's name is on it.
Centralize ownership for baselines, testing, and reporting
You've defined who is measuring your progress, what you're being measured against, and who's responsible for reporting. The methodology is consistent, not team-by-team.
Decentralize ownership for system improvements
The people closest to each service are responsible for making improvements. Testing is centralized; fixing is distributed to the teams that know their services best.
Create ownership handoff processes
You track ownership and continuously onboard new owners when service transfers occur. When an engineer leaves or a reorg happens, institutional knowledge doesn't walk out the door.

PILLAR 3

Measurement & Metrics

You can't improve what you can't measure. The best programs quantify reliability, tie it to business outcomes, and create leading indicators rather than lagging ones.

Pillar score
0%
Define the background behind the program
You've documented why you're doing this and quantified the impact of downtime in business terms. All relevant parties have access to this background and have reviewed it together.
Set up consistent reliability measurement and normalized scoring
You understand how reliability is measured and can compare it fairly between services. You're using standardized scoring rather than letting every team define "reliable" differently.
Record your progress against your goals
You have a well-known, regularly reviewed progress report with historical data in a reverse chronological journal. Anyone can find it and see the trajectory at a glance.
Tie high-value golden signals to business metrics
Your golden signals go beyond uptime/downtime. They're mission-critical indicators tied to protecting your company and creating demonstrable value against business metrics that leadership cares about.

PILLAR 4

Processes & Policies

Reliability is a sustained organizational effort, not a one-time project. Documented processes keep the program running as teams grow, services change, and people move on.

Pillar score
0%
Build a catalog of services, owners, and disruption impact
Your program subjects are well-defined and their criticality is well-understood. You know which services matter most and what it costs when they go down.
Establish biweekly progress reviews
You hold regular meetings with program owners, leadership, and service owners to review recent changes and unexpected spikes or dips in progress toward reliability goals.
Document new service onboarding
You've built and regularly exercise new service onboarding processes. New services don't get added to the portfolio without going through a documented reliability intake.
Define response to services falling out of compliance
You've documented how you detect, respond to, and correct services that fall out of compliance. This includes when you review, how you reach out, what information you collect, and reasonable timelines for correction.
Define response to services coming into compliance
You've documented how you recognize and celebrate services coming into compliance. Positive reinforcement matters as much as detection and correction.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape