Reliability Management > Detected Risks

Detected Risks

Supported platforms:

N/A

Detected Risks are high-priority reliability concerns that Gremlin automatically identified in your environment. These risks can include misconfigurations, bad default values, or reliability anti-patterns. Gremlin prioritizes these risks based on severity and impact for each of your services. This gives you near-instantaneous feedback on risks and action items to improve the reliability and stability of your services.

This guide is also available as an interactive demo.

Viewing Detected Risks in Gremlin

Gremlin provides a visual indicator of the number of Detected Risks on the Service Catalog view, as well as on the service details page. Detected Risks are shown in a separate indicator next to the reliability score.

The details page for a service showing a reliability score of 12% and 3 Detected Risks

Click on this indicator to see a list of all potential Detected Risks for your service. Each risk will show one of three statuses:

<red-text>At Risk<red-text>: This risk is currently present in your systems and hasn't been addressed.
<green-text>Mitigated<green-text>: This risk has been fixed since it was last detected.
N/A: This risk could not be evaluated. A warning tooltip will be shown next to the risk with more details.

Clicking on a risk name provides additional information about the risk, including guidance on how to fix it.

A screenshot of the Kubernetes Liveness Probe Detected Risk with its details.

Once you've addressed a risk, refresh the page to confirm that it's been mitigated.

‍

Kubernetes Detected Risks

In a Kubernetes environment, Gremlin will detect the following set of risks:

CPU Requests
Liveness Probes
Availability Zone Redundancy
Memory Requests
Memory Limits
Application Version Uniformity
CrashLoopBackOff
ImagePullBackOff
Init Container Error
Unschedulable Pods
Horizontal Pod Autoscaler Missing
Horizontal Pod Autoscaler - Scaling Inactive
Horizontal Pod Autoscaler - Unable to Scale
Horizontal Pod Autoscaler - Scaling Limited‍
No readiness probe defined
Topology spread constraints absent

‍

CPU Requests

What is this?

spec.containers[].resources.requests.cpu specifies how much CPU should be reserved for your pod container.

Why is this a risk?

The kubelet reserves at least the request amount of that system resource specifically for that container to use.
This protects your node from resource shortages and helps to schedule pods on nodes that can accommodate the requested resource amount.

How can I fix this?

Specify an appropriate resource request for your pod container. Think of this as the minimum amount of the resource needed for your application to run.

How does this work?

Gremlin will consider the absence of a container's resource request "at-risk".

‍

Liveness Probe

What is this?

spec.containers[].livenessProbe specifies how the kubelet will decide when to restart your pod container.

Why is this a risk?

The kubelet uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.

How can I fix this?

Implement a livenessProbe for your pod container, such that it fails when your container needs restarting.

How does this work?

Gremlin will consider the absence of a container's livenessProbe "at-risk".

‍

Availability Zone Redundancy

What is this?

Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.

Why is this a risk?

Availability zone redundancy ensures your applications continue running, even in the event of critical failure within a single zone.
Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.

How can I fix this?

If you are running in a single availability zone now, you should deploy your service to at least one other zone.
For a Kubernetes service, once your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better expected availability, reducing the risk that a correlated failure affects your whole workload.
For a Kubernetes service, you can apply node selector constraints to Pods that you create, as well as to Pod templates in workload resources such as Deployment, StatefulSet, or Job.

How does this work?

Gremlin will perform zone redundancy analysis similar to how it generates targeting for a zone failure test: Identify all unique zone tags among the Gremlin agents that are co-located with the given service.
A service with one or no values for zone are considered "at-risk".

‍

Memory Request

What is this?

spec.containers[].resources.requests.memory specifies how much Memory should be reserved for your Pod container.

Why is this a risk?

The kubelet reserves at least the request amount of that system resource specifically for that container to use. This protects your node from resource shortages and helps to schedule pods on nodes that can accommodate the requested resource amount.

How can I fix this?

Specify an appropriate resource request for your pod container. Think of this as the minimum amount of the resource needed for your application to run.

How does this work?

Gremlin will consider the absence of a container's resource request "at-risk".

‍

Memory Limit

What is this?

spec.containers[].resources.limits.memory specifies a maximum amount of memory your Pod container can use.

Why is this a risk?

Specifying a memory limit for your pod containers protects the underlying nodes from applications consuming all available memory.
The memory limit defines a memory limit for that cgroup. If the container tries to allocate more memory than this limit, the Linux kernel out-of-memory subsystem activates and, typically, intervenes by stopping one of the processes in the container that tried to allocate memory. If that process is the container's PID 1, and the container is marked as restartable, Kubernetes restarts the container.

How can I fix this?

Specify an appropriate memory limit for your Pod container.

How does this work?

Gremlin will consider the absence of a container's memory limit "at-risk".

‍

Application Version Uniformity

What is this?

Whether your application is configured to ensure all of its replicas are running the exact same version.

Why is this a risk?

Version uniformity ensures your application behaves consistently across all instances.
Image tags such as latest can be easily modified in a registry. As application pods redeploy over time, this can produce a situation where the application is running unexpected code.

How can I fix this?

Specify an image tag other than latest, ideally using the complete sha256 digest which is unique to the image manifest.

How does this work?

Gremlin will consider the presence of more than one image version running within your service as "at-risk".

‍

CrashLoopBackOff

What is this?

CrashLoopBackOff is a Kubernetes state that indicates a restart loop is happening in a pod. It’s a common error message that occurs when a Kubernetes container fails to start up properly for some reason, then repeatedly crashes.

Why is this a risk?

CrashLoopBackOff is not an error in itself—it indicates there’s an error happening that causes the application to crash. A CrashLoopBackoff error also indicates that a portion of your application fleet is not running and usually means your application fleet is in a degraded state.

How can I fix this?

Fixing this issue will depend on identifying and fixing the underlying problem(s).

Examine the output or log file for the application to identify any errors that lead to crashes.
Use kubectl describe to identify any relevant events or configuration that contributed to crashes.

How does this work?

Gremlin considers a service as "at-risk" when it finds at least one containerStatus in a state of waiting with reason=CrashLoopBackoff.

‍

ImagePullBackOff

What is this?

Kubernetes pods sometimes experience issues when trying to pull container images from a container registry. If an error occurs, the pod goes into the ImagePullBackOff state. The ImagePullBackOff error occurs when the image path is incorrect, the network fails, or the kubelet does not succeed in authenticating with the container registry. Kubernetes initially throws the ErrImagePull error, and then after retrying a few times, "pulls back" and schedules another download attempt. For each unsuccessful attempt, the delay increases exponentially, up to a maximum of 5 minutes.

Why is this a risk?

An ImagePullBackOff error means a portion of your application fleet is not running, and cannot download the image required to start running. This usually means your application fleet is in a degraded state.

How can I fix this?

In most cases, restarting the pod and deploying a new version will resolve the problem and keep the application online. Otherwise:

Check that your pod specification is using correct values for image’s registry, repository, and tag.
Check for network connection issues with the image registry. You can also forcibly recreate the pod to retry an image pull.
Verify your pod specification can properly authenticate to the targeted container registry.

How does this work?

Gremlin considers a service as "at-risk" when it finds at least one containerStatus in a state of waiting with reason=ImagePullBackoff.

‍

Init Container Error

What is this?

An init container is a type of container that has a few modified operational behavior and rules. One of the most dominant features is that init containers are started and terminated before application containers, and they must run to completion with success. They specifically exist for initializing the workload environment.

Why is this a risk?

If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds. However, if the Pod has a restartPolicy of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.

How can I fix this?

Init containers are defined in the pod.spec.initContainers array, whereas regular containers are defined under the pod.spec.containers array. Both hold Container objects. pod.spec is defined in the Kubernetes source code below; we can see that InitContainers and Containers are arrays of Container type.

How does this work?

Gremlin will consider one or more containerStatuses in a state of waiting with a reason=Init Container Error as "at-risk".

‍

Unschedulable Pods

What is this?

A pod may be unschedulable for several reasons:

Resource Requests: If the pod is requesting more resources than any node can provide, it will not be scheduled. This can be solved by adding nodes, increasing node size, or reducing the resource requests of pods.
Persistent Volumes: If the pod requests persistent volumes that are not available, it may not be able to schedule. This can happen when using dynamic volumes, or referring to a persistent volume claim that cannot be completed e.g. requesting an EBS volume without permissions to create it.

In rare cases, it is possible for a pod to get stuck in the terminating state. This is detected by finding any pods where every container has been terminated, but the pod is still running. Usually, this is caused when a node in the cluster gets taken out of service abruptly, and the cluster scheduler and controller-manager do not clean up all of the pods on that node.

Why is this a risk?

When the node that the pod is running on doesn't have enough resources, the pod can be evicted and moved to a different node. If none of the nodes have sufficient resources, the pod can go into a CrashLoopBackOff state.

How can I fix this?

If your Pod resource requests exceed that of a single node from any eligible node pools, GKE does not schedule the Pod and also does not trigger scale up to add a new node. For GKE to schedule the Pod, you must either request fewer resources for the Pod, or create a new node pool with sufficient resources. You can also enable node auto-provisioning so that GKE can automatically create node pools with nodes where the unscheduled Pods can run.

The default CPU request is 100m or 10% of a CPU (or one core). If you want to request more or fewer resources, specify the value in the Pod specification under spec: containers: resources: requests.

How does this work?

Gremlin will consider one or more containerStatuses in a state of waiting with a reason=Unschedulable Pods as "at-risk".

‍

Horizontal Pod Autoscaler Missing

What is this?

A Horizontal Pod Autoscaler (HPA) automatically adds or removes replicas to a workload (such as a deployment) based on observed metrics, such as CPU or memory utilization. HPAs scale workloads horizontally in response to user demand.

Why is this a risk?

Not having an HPA puts your service at risk of running out of resources when demand increases. HPAs help maintain performance while optimizing resource utilization.

How can I fix this?

Create an HPA for your workload using the command below as an example. Adjust the minimum replica count, maximum replica count, and CPU threshold to match your requirements. In this example, the my-app deployment will scale up if CPU usage across its replicas exceeds 50%, up to four replicas. If CPU usage drops below 50%, the HPA will scale back down.

You can replace CPU with a different, or even a custom, metric.

SHELL


kubectl autoscale deployment my-app --cpu-percent=50 --min=1 --max=4

‍

Horizontal Pod Autoscaler - Scaling Inactive

What is this?

The Horizontal Pod Autoscaler periodically checks the service’s resource usage and increases (or decreases) the number of replicas as needed. If the HPA’s ScalingActive condition is False, the HPA is not scaling, and the service is at risk.

Why is this a risk?

An inactive Horizontal Pod Autoscaler will not scale your service. This makes it more vulnerable to changes in demand, especially surges.

How can I fix this?

This risk occurs when the target service's replica count is set to zero. Check to make sure that your service’s replica count is set to a non-zero number. You can view the specific error message by using the following command (replace service-hpa-name with your HPA’s name):

SHELL


kubectl get hpa service-hpa-name -o jsonpath='{.items[0].status.conditions[?(@.type=="ScalingActive")]}'

‍

Horizontal Pod Autoscaler - Unable to Scale

What is this?

The HPA’s AbleToScale condition indicates whether the HPA can fetch and update scales and whether any backoff-related conditions prevent scaling. If this condition is False, something is preventing the HPA from scaling.

Why is this a risk?

A Horizontal Pod Autoscaler that is unable to scale will not be performing its primary function of scaling your service.

How can I fix this?

Check your Pod or Deployment for backoff-related issues, such as CrashLoopBackOff or ImagePullBackOff. If there are none, view the specific error message using the following command (replace service-hpa-name with your HPA’s name):

SHELL


kubectl get hpa service-hpa-name -o jsonpath='{.items[0].status.conditions[?(@.type=="AbleToScale")]}'

‍

Horizontal Pod Autoscaler - Scaling Limited

What is this?

The HPA associated with this service has reached its minimum or maximum replica count and won’t scale further. If this happens, the HPA’s ScalingLimited condition is set to True.

Why is this a risk?

Limited scaling may mean your Horizontal Pod Autoscaler wants to scale more due to demand or lack thereof, but cannot. This carries similar risks to not having an HPA, where the overall service may not have enough resources to meet customer demand.

How can I fix this?

Consider increasing your HPA’s maximum replica count if you can no longer scale up, or decreasing it if you can no longer scale down. You can determine which of these is the problem by running the following command (replace service-hpa-name with your HPA’s name):

SHELL


kubectl get hpa service-hpa-name -o jsonpath='{.items[0].status.conditions[?(@.type=="ScalingLimited")]}'

‍

No readiness probe defined

What is this?

The container spec has no readiness probe defined.

Why is this a risk?

Readiness probes indicate to Kubernetes that the pod is ready to start receiving traffic. Without it, Kubernetes sends traffic to the pod as soon as it starts. If your application isn’t done initializing, this can result in request failures that are invisible in deployment logs.

How can I fix this?

Add a readiness probe that checks a part of your application that only returns successful when it’s ready to receive traffic.

How does this work?

Gremlin checks for the existence of spec.containers[].readinessProbe. If this property is empty, Gremlin considers the service at-risk.

‍

Topology spread constraints absent

What is this?

The pod spec does not contain topologySpreadConstraints.

Why is this a risk?

Topology spread constraints determine how Kubernetes distributes pods across failure domains, such as nodes, regions, and zones. Without them, Kubernetes may schedule pods within the same failure domain (e.g. on the same host), reducing redundancy.

How can I fix this?

Configure a topology spread constraint in your pod spec.

How does this work?

Gremlin checks for the existence of spec.topologySpreadConstraints. If this property is empty, Gremlin considers the service at-risk.

‍

AWS Detected Risks

In an AWS (Amazon Web Services) environment, Gremlin will detect the following set of risks:

Availability Zone Redundancy
Cross-zone Load Balancing
Deletion Protection Enabled
Auto Scaling Group (ASG) with Policies

Note

You must be using Amazon Elastic Load Balancers (ELBs) in front of your services for these Detected Risks to work.

‍

Multiple Availability Zones

What is this?

This checks if a load balancer (Application or Gateway) is mapped to multiple Availability Zones (AZs). If it’s mapped to less than two AZs, the service is “at-risk.”

Why is this a risk?

An Availability Zone is a single point of failure in a cloud network. If your load balancer is mapped to a single AZ and that AZ becomes unavailable, your entire application will become unavailable. Mapping to multiple AZs helps ensure that any redundancy or failover systems in place will work as expected in case of an AZ failure.

How can I fix this?

You should deploy your Load Balancer to at least two availability zones. You can refer to the AWS documentation for details on how to do this.

How does this work?

When you authenticate Gremlin with AWS, Gremlin detects the load balancers running in your region, as well as the number of AZ’s they’re mapped to. If the number of AZs is less than two, Gremlin considers it “at-risk.”

‍

Cross-zone Load Balancing

What is this?

Cross-zone load balancing reduces the need to maintain equivalent numbers of instances in each enabled Availability Zone, and improves your application's ability to handle the loss of one or more instances.

Why is this a risk?

When cross-zone load balancing is disabled, each load balancer node only distributes traffic across the registered targets in its Availability Zone (AZ). If the AZ becomes unavailable, or if there are no healthy targets in the AZ, then the load balancer can’t distribute traffic.

How can I fix this?

See the AWS Elastic Load Balancer documentation for instructions.

How does this work?

Gremlin checks the load balancer’s CrossZoneLoadBalancing.Enabled attribute to determine whether it’s enabled or disabled.

‍

Deletion Protection Enabled

What is this?

To prevent your load balancer from being deleted accidentally, you can enable deletion protection. By default, deletion protection is disabled by default.

Note

Classic Load Balancers do not support deletion protection and therefore are always considered "at-risk".

‍Why is this a risk?

Accidentally deleting a load balancer will prevent users from accessing the applications that the load balancer targets. Enabling deletion protection adds an additional step to prevent accidental deletions.

How can I fix this?

See the AWS Elastic Load Balancer documentation for instructions.

How does this work?

Gremlin checks the load balancer’s deletion_protection attribute to determine whether it’s enabled or disabled.

‍

Auto Scaling Group (ASG) with Policies

What is this?

An Auto Scaling Group (ASG) automatically scales an EC2 target group in response to changing demand. This risk checks your ASGs to ensure they have enabled scaling policies.

Why is this a risk?

Without an enabled scaling policy, an ASG won’t add instances to the target group to meet increasing demand. This can cause performance and stability problems as load increases.

How can I fix this?

Add a scaling policy to your ASG. AWS supports several different scaling methods, including dynamic scaling, which scales in response to changing metrics such as CPU utilization or request rate.

‍

Azure Detected Risks

Gremlin will detect the following Azure risks if the service is mapped to an Azure Application Gateway:

No AZ redundancy
Autoscaling missing
SSL certificate expiring soon

‍

No AZ redundancy

What is this?

An Application Gateway was detected with fewer than two assigned zones.

Why is this a risk?

Application Gateways without multiple zones are not zone-redundant. If the availability zone that it’s scheduled in fails, the Application Gateway will also fail.

How can I fix this?

When deploying an Application Gateway, specify two or more zones (or omit zones to let the gateway use all available zones). If your Gateway is currently limited to one zone, you may need to re-deploy it.

How does this work?

Gremlin checks the zones property of the Application Gateway mapped to the service. If there are fewer than two elements, the service is at-risk.

‍

Autoscaling missing

What is this?

An Application Gateway was detected with no autoscaling configured.

Why is this a risk?

A fixed-capacity Application Gateway can easily saturate during traffic spikes, such as during sales holidays (Black Friday, Cyber Monday, etc.), new product releases, or unexpected incidents. Autoscaling ensures the Application Gateway can meet sudden surges in demand.

How can I fix this?

In your Application Gateway’s configuration, change the capacity type from “Manual” to “Autoscale” and set the maximum scale units to the maximum number of instances you want to scale to.

How does this work?

Gremlin checks the autoscaleConfiguration property of the Application Gateway mapped to the service. If this value is null, the service is at-risk.

‍

SSL certificate expiring soon

What is this?

An Application Gateway was detected with a TLS/SSL certificate expiring in the next 30 days.

Why is this a risk?

Expired certificates are one of the most common outage categories. Even though the service is available, users will see a browser warning. A 30-day lead time lets teams rotate on a regular change window rather than during an incident.

How can I fix this?

Renew and replace your TLS certificate. Using Key Vault to manage your Application Gateway certificates will also address this risk.

How does this work?

Gremlin checks the sslCertificates.publicCertData property of the Application Gateway mapped to the service, decodes it, and checks the notAfter property. If this property falls within 30 days of the current date, the service is at-risk. Gremlin also checks if the Application Gateway has an associated key vault. If so, the service is not at-risk.

‍

GCP Detected Risks

Gremlin will detect the following GCP risks if the service is mapped to a GCP Backend Service:

Single-zone backends
Connection draining disabled
No circuit breaker configured
Outlier detection disabled

‍

Single-zone backends

What is this?

At least one GCP Backend Service’s network endpoint groups (NEGs) or Instance Groups are all located in a single zone.

Why is this a risk?

NEGs route traffic between load balancers and endpoints, such as virtual machine instances, serverless applications, and containers. Locating every NEG in the same zone creates a single point of failure in case of a zone outage.

How can I fix this?

Assign at least one NEG to your load balancer in a different zone than the other NEGs assigned to the same load balancer. See Zonal NEG backends for more information.

How does this work?

Gremlin checks the Backend Service’s groups.zone property for unique values. If there are fewer than two unique values, the service is at-risk.

‍

Connection draining disabled

What is this?

Your Backend Service doesn’t have connection draining enabled.

‍Why is this a risk?

Connection draining gives in-progress requests time to complete when an endpoint is removed (such as a virtual machine instance being removed from an instance group). Without connection draining, in-flight requests might stop suddenly instead of completing, which could result in client errors.

‍How can I fix this?

Enable connection draining and increase the connection draining timeout to a period above zero seconds.

‍How does this work?

Gremlin checks the Backend Service’s connectionDraining.drainingTimeoutSec property to ensure it’s not equal to zero.

‍

No circuit breaker configured

What is this?

One or more of your Backend Services does not have a circuit breaker configured.

‍Why is this a risk?

Circuit breakers set limits on the volume of requests hitting your Backend Service. Without them, a misbehaving client or downstream failure could trigger a thundering-herd or retry storm that saturates the backend.

‍How can I fix this?

Update your Backend Service configuration with a circuit breaker.

‍How does this work?

Gremlin checks the Backend Service’s circuitBreakers property for a null value.

‍

Outlier detection disabled

What is this?

At least one of your Backend Services does not have outlier detection configured.

Why is this a risk?

Outlier detection detects unhealthy endpoints in your Backends and automatically evicts them. By not configuring outlier detection, you may be sending traffic towards endpoints that aren’t healthy enough to process them.

How can I fix this?

Enable and configure outlier detection for your Backends. Make sure to add criteria that determines when an endpoint is healthy enough to receive traffic again.

How does this work?

Gremlin checks the Backend Service’s outlierDetection property for any null values.

‍

Privileges Required

Privilege	Description
SERVICES_READ	Allows reading information about services and reliability management

‍

Services

Test Suites