Zero-click risk detection

Uncover critical reliability risks as soon as you deploy your service, without running a single test.

Free for 30 days. No credit card required.

Get started

Top Fortune 500 organizations worldwide trust Gremlin

Comprehensive and automatic reliability risk detection

Stay one step ahead of reliability risks by detecting them before they impact your customers. With Detected Risks, Gremlin automatically monitors the services running in your environment for potential failure modes—without requiring any active testing.

‍

What is a reliability risk?

A reliability risk is anything that jeopardizes a system’s ability to work as expected, such as missing autoscaling rules, limited or no redundancy, and no automatic health checks or liveness probes. Risks have the potential to become incidents or outages in production. Gremlin helps teams proactively find and fix these risks before they can impact your customers.

‍

What risks can Gremlin detect for you?

‍

Kubernetes

CPU Requests Do you have enough CPU reserved for your containers?	Liveness Probes Are you monitoring your Kubernetes deployments for failed or unresponsive Pods?
Availability Zone (AZ) Redundancy Are you running at least one instance of your application in another zone or region?	Memory Requests Do you have enough memory (RAM) reserved for your containers?
Memory Limits Did you specify an upper boundary to the amount of memory (RAM) your containers can use?	Application Version Uniformity Are all instances of your application running the same version?
CrashLoopBackOff Are any of your Kubernetes Pods stuck in a crash loop?	ImagePullBackOff Are all of your Kubernetes worker nodes able to pull the correct container images?
Init Container Error Have any of your initialization (init) containers failed?	Unschedulable Pods Is Kubernetes unable to schedule any of your Pods?
Horizontal Pod Autoscaler Missing Is your Kubernetes service configured to scale horizontally?	Horizontal Pod Autoscaler - Scaling Inactive Is your Horizontal Pod Autoscaler (HPA) active?
Horizontal Pod Autoscaler - Unable to Scale Is Kubernetes unable to scale your workload?	Horizontal Pod Autoscaler - Scaling Limited Has your Horizontal Pod Autoscaler (HPA) reached its maximum or minimum pod limits?

‍

Amazon Web Services (AWS)

Load Balancer Deletion Protection Are your load balancers protected against accidental deletions?	Load Balancer AZ Redundancy Are you load balancing traffic across two or more availability zones?
Cross Zone Load Balancing Are you distributing traffic equally across all availability zones?	AWS Load Balancer ASG with Policies Are your Auto Scaling Groups (ASGs) configured correctly?

‍

Azure

No AZ Redundancy Are you load balancing traffic across two or more availability zones?	Autoscaling Missing Are your Application Gateways configured to scale in case of traffic surges?
SSL Certificate Expiring Soon Are any of your TLS/SSL certificates expiring in the next 30 days?

‍

Google Cloud Platform (GCP)

Single-zone Backends Are your Backend Services zone-redundant?	Connection Draining Disabled Will your Backend Services finish in-progress requests before removing an endpoint?
No Circuit Breaker Configured Do you have upper limits set for connection and request volume?	Outlier Detection Disabled Can your Backend Services detect and route around unhealthy endpoints?

Reveal hidden reliability risks

Modern systems have countless moving parts. The potential for failure is significant and increases as teams move to distributed cloud-native platforms.

Gremlin continuously monitors your changing environments for common reliability risks, such as missing autoscaling rules, undefined resource limits, and non-redundant services. It then surfaces these risks in an easy-to-understand way, so you can see what areas require your attention.

Surface regressions and recurring issues

As systems change, the chance of a regression grows. Gremlin’s fully automated risk detection processes capture risks as soon as they appear in your environment, letting you know immediately if something’s wrong.

Gremlin tracks these risks over time and also presents them as a team-wide report so you can identify risks across all of your services at once. And as Gremlin’s risk library continues to grow, you’ll get more insights into how reliable your services are—all without having to run a single Chaos Engineering experiment.

Build confidence in your system's resiliency

Engineering teams need to know that their systems can withstand any type of fault at any time. Gremlin helps you understand how your systems behave under any condition, not just ideal conditions.

Environments change over time, especially as systems scale and engineers push new code. Gremlin helps you stay ahead of changing systems and configuration drift with automated, repeated experiments. Confidently push to production knowing that your changes won’t introduce new reliability risks.

Shift from observing to improving

Gremlin enables teams to proactively improve reliability at every stage of maturity.

Experimenting

Custom Chaos Tests & Experiments

Robust, customizable chaos tests to safely replicate any incident scenario.

Standardizing

Standardized Reliability Tests

Pre-built test suite to cover the most common reliability risks. Get started in minutes.

Scaling

Automated & Scaled Reliability Programs

Standardized scoring tools to identify and prioritize risks, and build reliability programs.

Get a demo