Troubleshooting Failure Flags
Supported platforms:
This guide helps you diagnose and resolve common issues with deploying the Failure Flags sidecar.
Recommended steps
Enable debug logging
Before troubleshooting any issue, always enable debug logging to get detailed information about what the sidecar is doing. You can enable debug logging in one of two ways:
- Using an environment variable (recommended):
GREMLIN_DEBUG=true
- Using the configuration file:
debug: true
Debug logging adds the following to logs:
- Sidecar startup sequence
- Configuration loading and validation
- Proxy initialization status
- Service registration with Gremlin
- Incoming request processing
- Experiment evaluation and execution
Accesing debug logs
- AWS Lambda: check CloudWatch Logs for your function, or run the following command:
aws logs tail /aws/lambda/your-function-name --follow
- AWS ECS: Check your ECS service logs, or run:
aws logs tail /etc/failure-flags-sidecar --follow
- Kubernetes: Check your pod's logs. For example:
kubectl logs -f deployment/your-app -c failure-flags-sidecar
- Docker: Check the container logs for the sidecar. For example:
docker logs -f failure-flags-sidecar
Enable trace logging
If you suspect issues communicating with Gremlin's control plane, enable trace logging for detailed network information. To enable trace logging:
- Using an environment variable (recommended):
GREMLIN_TRACE=true
- Using the configuration file:
trace: true
Trace logging shows:
- HTTP requests to Gremlin API
- TLS handshake details
- Network timeouts and retries
- Response codes and error messages
- Certificate validation issues
- Corporate proxy interactions
Common issues and solutions
The sidecar is not starting
Symptoms:
- The container immediately exits after starting.
- No log output from the sidecar container.
- Sidecar container health checks are failing.
Debugging steps:
- Check the logs for configuration or startup error messages.
- Verify that the required environment varlabes are set:
- For Lambda:
GREMLIN_LAMBDA_ENABLED=true
- For others:
GREMLIN_SIDECAR_ ENABLED=true
- For Lambda:
- If using configuration files, check your configuration file and path for accuracy.
- If you're using Lambda and can't pull the layer, check to make sure the ARN you're using is accurate. Remember that your ARN must match your application's availability zone and compute architecture.
The sidecar can't connect to the Gremlin Control Plane
Symptoms:
- The sidecar is generating "Failed to register service" or "Connection timeout" error messages.
- No Failure Flags appear in Gremlin UI, even after you've invoked your application at least once.
Debugging steps:
- Check network connectivity between your application and
api.gremlin.com
.- If you have shell access in your container, try running a command such as:
curl -v https://api.gremlin.com/v1/ff/health
- If you have shell access in your container, try running a command such as:
- Verify your credentials and team ID. If you're using certificate-based authentication, verify that your certificate pair is still active and hasn't expired.
- Check your
team_id
,team_certificate
, andteam_ private_key
configuration options against the information on your team page. - When storing certificates in the config file, ensure newlines are preserved (add
\n
, or use the multi-line YAML format).
- Check your
- If your function is behind a firewall, check your proxy settings (
https_proxy
in your configuration file).
Failure Flags not appearing in the Gremlin UI
If your application and sidecar start successfully, but you still don't see Failure Flags in the Gremlin web app, try these debugging steps:
- Send network traffic through your application while it's running. If you're using the SDK, send requests to code points where the SDK is invoked.
- If you're using proxy mode, verify that the
HTTP_PROXY
andHTTPS_PROXY
environment variables are set.- Additionally, ensure you've enabled each proxy by setting the following configuration options to
true
:dependency_proxy_enabled
,ingress_proxy_enabled
, and/orlambda_proxy_enabled
.
- Additionally, ensure you've enabled each proxy by setting the following configuration options to
- For Kubernetes services, explicitly set the
SERVICE_NAME
environment variable.
Application can't reach its dependencies
If you're using proxy mode and your application can't connect to it's dependencies, your proxy may be misconfigured.
Symptoms:
- HTTP requests from your application to its dependencies fail.
- Your application returns "Connection refused" errors.
- You experience timeouts on outbound calls.
Debugging steps:
- Verify that you've set your application's
HTTP_PROXY
andHTTPS_PROXY
environment variables to the URL and port of the Gremlin sidecar container. - Verify that you've enabled the dependency proxy by setting
dependency_proxy_enabled: true
in the configuration file. - Test your application's network access via curl (if shell access is available and curl is installed):
curl -x http://localhost:5034 https://example.com
Load balancer health checks are failing
If you're using proxy mode and your application's load balancer has failing health checks, you may need to update your load balancer to target the Failure Flags sidecar instead of your application.
Symptoms:
- Your load balancer has marked your service as "unhealthy".
- Your load balancer is logging HTTP 502 / 503 errors.
- Network traffic is not reaching your application.
Debugging steps:
- Check your ingress proxy configuration. Ensure that
ingress_proxy_enabled
is set totrue
, and thatingress_proxied_endpoint
is set to the URL and port of your application (e.g.,http://localhost:8080
). The proxy port is set to5035
by default, but you can change this by settingingress_proxy_port
. - Verify that your load balancer target configuration is pointed to the Gremlin sidecar. For example, if your sidecar's ingress port is set to
5035
, ensure your load balancer is targeting the Gremlin container on port5035
.
Lambda function is timing out or returning errors
If your Lambda function is no longer running correctly, there may be problems with your Lambda layer or proxy configuration.
Symptoms:
- Lambda function timeouts
- Extension not starting
- Runtime API errors
Debugging steps:
- Check the function's CloudWatch logs for errors.
- Verify that you attached the Lambda layer.
- Additionally, make sure you're using the correct ARN for your function's region and architecture.
- Ensure that you've enabled the Lambda layer by setting the
GREMLIN_LAMBDA_ENABLED
environment variable totrue
. - Double-check your Lambda proxy configuration, and ensure
lambda_proxy_enabled
is set totrue
.
ECS function is timing out or returning errors
If your ECS container is no longer running correctly, there may be a problem with your sidecar deployment or permissions.
Symptoms:
- Task fails to start
- Sidecar container exits
- Service registration issues
Debugging steps:
- Check your ECS event log for error messages.
- Verify the task role has the required permissions:
Kubernetes pod fails to start or is unresponsive
Symptoms:
- Pod fails to start
- Service name not detected
- Network policies blocking traffic
Debugging steps:
- Check your pod's logs using
kubectl
or similar tool. - Verify that the service account used to run the pod has the correct permissions.
- Ensure that you've explicity set the name of the service using the
SERVICE_NAME
environment variable. - Ensure the pod can reach the Gremlin API at api.gremlin.com. If you have shell access to the pod, you can use a tool like curl, wget, or ping.
- Verify that your application container can communicate with the Gremlin sidecar.
- Double-check your resource requests and limits to ensure your container isn't being pre-emptively terminated.