Failure Flags > Troubleshooting Failure Flags

Troubleshooting Failure Flags

Supported platforms:

N/A

This guide helps you diagnose and resolve common issues with deploying the Failure Flags sidecar.

‍

Recommended steps

Enable debug logging

Before troubleshooting any issue, always enable debug logging to get detailed information about what the sidecar is doing. You can enable debug logging in one of two ways:

Using an environment variable (recommended): GREMLIN_DEBUG=true
Using the configuration file: debug: true

Debug logging adds the following to logs:

Sidecar startup sequence
Configuration loading and validation
Proxy initialization status
Service registration with Gremlin
Incoming request processing
Experiment evaluation and execution

‍

Accessing debug logs

AWS Lambda: check CloudWatch Logs for your function, or run the following command: aws logs tail /aws/lambda/your-function-name --follow
AWS ECS: Check your ECS service logs, or run: aws logs tail /etc/failure-flags-sidecar --follow
Kubernetes: Check your pod's logs. For example: kubectl logs -f deployment/your-app -c failure-flags-sidecar
Docker: Check the container logs for the sidecar. For example: docker logs -f failure-flags-sidecar

‍

Enable trace logging

If you suspect issues communicating with Gremlin's control plane, enable trace logging for detailed network information. To enable trace logging:

Using an environment variable (recommended): GREMLIN_TRACE=true
Using the configuration file: trace: true

Trace logging shows:

HTTP requests to Gremlin API
TLS handshake details
Network timeouts and retries
Response codes and error messages
Certificate validation issues
Corporate proxy interactions

‍

Common issues and solutions

The sidecar is not starting

Symptoms:

The container immediately exits after starting.
No log output from the sidecar container.
Sidecar container health checks are failing.

Debugging steps:

Check the logs for configuration or startup error messages.
Verify that the required environment varlabes are set:
1. For Lambda: GREMLIN_LAMBDA_ENABLED=true
2. For others: GREMLIN_SIDECAR_ ENABLED=true
If using configuration files, check your configuration file and path for accuracy.
If you're using Lambda and can't pull the layer, check to make sure the ARN you're using is accurate. Remember that your ARN must match your application's availability zone and compute architecture.

‍

The sidecar can't connect to the Gremlin Control Plane

Symptoms:

The sidecar is generating "Failed to register service" or "Connection timeout" error messages.
No Failure Flags appear in Gremlin UI, even after you've invoked your application at least once.

Debugging steps:

Check network connectivity between your application and api.gremlin.com.
1. If you have shell access in your container, try running a command such as: curl -v https://api.gremlin.com/v1/ff/health
Verify your credentials and team ID. If you're using certificate-based authentication, verify that your certificate pair is still active and hasn't expired.
1. Check your team_id, team_certificate, and team_ private_key configuration options against the information on your team page.
2. When storing certificates in the config file, ensure newlines are preserved (add \n, or use the multi-line YAML format).
If your function is behind a firewall, check your proxy settings (https_proxy in your configuration file).

‍

Failure Flags not appearing in the Gremlin UI

If your application and sidecar start successfully, but you still don't see Failure Flags in the Gremlin web app, try these debugging steps:

Send network traffic through your application while it's running. If you're using the SDK, send requests to code points where the SDK is invoked.
If you're using proxy mode, verify that the HTTP_PROXY and HTTPS_PROXY environment variables are set.
1. Additionally, ensure you've enabled each proxy by setting the following configuration options to true: dependency_proxy_enabled, ingress_proxy_enabled, and/or lambda_proxy_enabled.
For Kubernetes services, explicitly set the SERVICE_NAME environment variable.

‍

Application can't reach its dependencies

If you're using proxy mode and your application can't connect to it's dependencies, your proxy may be misconfigured.

Symptoms:

HTTP requests from your application to its dependencies fail.
Your application returns "Connection refused" errors.
You experience timeouts on outbound calls.

Debugging steps:

Verify that you've set your application's HTTP_PROXY and HTTPS_PROXY environment variables to the URL and port of the Gremlin sidecar container.
Verify that you've enabled the dependency proxy by setting dependency_proxy_enabled: true in the configuration file.
Test your application's network access via curl (if shell access is available and curl is installed): curl -x http://localhost:5034 https://example.com

‍

Load balancer health checks are failing

If you're using proxy mode and your application's load balancer has failing health checks, you may need to update your load balancer to target the Failure Flags sidecar instead of your application.

Symptoms:

Your load balancer has marked your service as "unhealthy".
Your load balancer is logging HTTP 502 / 503 errors.
Network traffic is not reaching your application.

Debugging steps:

Check your ingress proxy configuration. Ensure that ingress_proxy_enabled is set to true, and that ingress_proxied_endpoint is set to the URL and port of your application (e.g., http://localhost:8080). The proxy port is set to 5035 by default, but you can change this by setting ingress_proxy_port.
Verify that your load balancer target configuration is pointed to the Gremlin sidecar. For example, if your sidecar's ingress port is set to 5035, ensure your load balancer is targeting the Gremlin container on port 5035.

‍

Lambda function is timing out or returning errors

If your Lambda function is no longer running correctly, there may be problems with your Lambda layer or proxy configuration.

Symptoms:

Lambda function timeouts
Extension not starting
Runtime API errors

Debugging steps:

Check the function's CloudWatch logs for errors.
Verify that you attached the Lambda layer.
1. Additionally, make sure you're using the correct ARN for your function's region and architecture.
Ensure that you've enabled the Lambda layer by setting the GREMLIN_LAMBDA_ENABLED environment variable to true.
Double-check your Lambda proxy configuration, and ensure lambda_proxy_enabled is set to true.

‍

ECS function is timing out or returning errors

If your ECS container is no longer running correctly, there may be a problem with your sidecar deployment or permissions.

Symptoms:

Task fails to start
Sidecar container exits
Service registration issues

Debugging steps:

Check your ECS event log for error messages.
Verify the task role has the required permissions:

YAML


{
  "Effect": "Allow",
  "Action": [
    "secretsmanager:GetSecretValue",
    "ssm:GetParameter"
  ],
  "Resource": "arn:aws:secretsmanager:*:*:secret:gremlin-config-*"
}

‍

Kubernetes pod fails to start or is unresponsive

Symptoms:

Pod fails to start
Service name not detected
Network policies blocking traffic

Debugging steps:

Check your pod's logs using kubectl or similar tool.
Verify that the service account used to run the pod has the correct permissions.
Ensure that you've explicity set the name of the service using the SERVICE_NAME environment variable.
Ensure the pod can reach the Gremlin API at api.gremlin.com. If you have shell access to the pod, you can use a tool like curl, wget, or ping.
Verify that your application container can communicate with the Gremlin sidecar.
Double-check your resource requests and limits to ensure your container isn't being pre-emptively terminated.

Running Failure Flags experiments