Treat reliability risks like security vulnerabilities by scanning and testing for them

Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.

Even if the code passed all the security checks before being deployed, you still perform regular security tests to make sure everything’s secure. In fact, the idea of just waiting for a security vulnerability to be exploited before addressing it is enough to make engineers wake up at night in a cold sweat.

But that’s exactly what many organizations are doing with reliability risks. Instead of actively checking our systems for issues that could cause incidents or outages, we often wait for things to break, then scramble to fix them. And with high-impact outages having a median annual cost of $7.75 million, this approach can get really costly, really quickly.

Just as we proactively work to improve our security posture, we can do similar work to improve our reliability posture by scanning and testing for reliability risks.

Check out the three steps below to find out how and learn more in the Closing the Reliability Gap Ebook.

Reliability risks are potential outages

Having security vulnerabilities doesn’t mean that your system is automatically compromised, but they are a weak point where a breach could happen, so the vulnerability should be addressed before someone exploits it.

The same is true with reliability risks. Your system may be running fine right now, but these are places where your system could break. And you can prevent an incident by locating and remediating the cause of the risk before it causes an outage.

Let’s look at resource scalability as an example. The service may run just fine with a normal amount of traffic, but a sudden surge may cause its memory or CPU usage to spike. If it’s not set up correctly, that spike could bring the service, and everything dependent on it, down. And if that surge is from something like a traffic increase during a Black Friday Sale, that can turn into a costly outage.

A combination of scanning for common risks, reliability testing, and custom Chaos Engineering experiments can help you locate reliability risks before outages like that happen.

1. Scan configurations to detect common risks

Like with addressing security vulnerabilities, the first step is to scan for any common or known risks. Gremlin’s Detected Risks helps you find and fix the most common causes of infrastructure outages without running Chaos Engineering experiments or reliability tests

Detected Risks are high-priority reliability concerns that Gremlin automatically identified in any environment where the Gremlin agent is installed. These risks can include misconfigurations, bad default values, or reliability anti-patterns.

Gremlin prioritizes these risks based on severity and impact for each of your services. This gives you near-instantaneous feedback on risks and action items to improve the reliability and stability of your services.

These risks can include:

CPU Requests
Liveness Probes
Availability Zone Redundancy
Memory Requests
Memory Limits
Application Version Uniformity

The goal with Detected Risks is to quickly give you actionable data that you can use to uncover and remediate the most common reliability risks.

2. Test for specific failure modes with reliability tests

Gremlin provides several pre-built tests designed to proactively validate against common reliability issues. These tests use Fault Injection to stress systems while watching the monitors you’ve set in your observability tool. If your systems can run the test without triggering an alert, you pass. If not, tests are halted and marked as a fail.

These include tests under three broad categories:

Scalability: Tests if your service behaves as expected when system resources (such as CPU or memory) are limited or exhausted.
Redundancy: Tests whether your service is reliable when one of your hosts, zones, or regions is unavailable.
Dependencies: Dependencies are any independently maintained software component used by a service to provide features or functionality. Dependencies can be hard (required) or soft (not required), and this test checks whether your service performs as expected when a dependency is slow, unavailable, or has an expiring security certificate.

When you first start testing, you may want to run reliability tests in a non-production environment that closely mirrors production, but the goal is to eventually run and pass tests in production. Because tests are designed to simulate real-world scenarios, it’s best to validate that services can withstand these issues in real-world environments.

Gremlin includes a number of safeguards to make testing in production safe and secure.

3. Understand systems with custom Chaos Engineering experiments

Chaos Engineering experiments are incredibly effective for finding specific risks.

Between Detected Risks scanning for the most common risks and reliability tests verifying services against the most common failure modes, most potential failures are covered. But every system and architecture is unique, and Chaos Engineering fills in the rest of the gaps so you can confidently test your entire system.

With Gremlin, you can build custom experiments using a wide variety of targeting and experiment types, then save them as custom scenarios that you can add to your automated testing set up. Designed to run anywhere your services are, it can be used to test failure modes on anything from on-prem to cloud to Kubernetes to serverless.

Reliability, like security, takes collaboration

Scanning for reliability risks, reliability testing, and Chaos Engineering experiments are only the beginning. Once you’ve found the reliability risks, you have to work with the service owners to remediate the issues, and then keep testing regularly to make sure new reliability risks are found an remedied.

But all that time and effort is worth it when you can find and fix costly outages before they happen. Just like how the security prevention effort is worth it when you patch vulnerabilities before someone exploits them.

Ready to start finding reliability risks? Start a free 30-day trial and start scanning for risks right away, or read more about building a culture of reliability in the Closing the Reliability Gap Ebook.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL