Reliability Management > Services and Dependencies

Services and Dependencies

Supported platforms:

N/A

A service is a discrete unit of functionality provided by one or more systems in your environment. For example, a web server deployed as a load balancer for your backend systems is a service. In Gremlin, services are the units used to test and measure the reliability of your system. This page will show you how to add, manage, and test your services using the Gremlin web app.

In order to use services, the Gremlin agent must be configured to collect process data. See Enabling Process Collection for more information.

‍

Viewing your list of services

You can access your list of services using the Services menu item in the nav bar. This is the main view of any services that you or your teammates have added to Gremlin, along with their reliability score. This is also called the Service Catalog. If no services have been added yet, this list will appear empty.

To open a service, simply click on its entry in the list. You can search for a specific service by name using the search box, or by clicking on the Name or Score column header to sort by those fields.

Viewing a list of services in the Gremlin web app

Viewing your production services

Gremlin lets you flag services as being in a Production environment. When this flag is enabled for one or more services, those services will be highlighted, and an additional tab named Production will appear at the top of the list. Clicking on the Production tab shows only those services that are flagged as Production so that you can more easily identify key services.

‍

Adding a service

To add a new service, you may define them manually, by adding one via an AWS Elastic Load Balancer (ELB), or by adding an annotation to your Kubernetes Spec.

‍

Adding a service manually

To add a new service, click the + Service button on the top-right corner of the services list. This will walk you through a short wizard with the following steps:

Give your service a name and define the type of service. Gremlin supports host-based, container-based, Kubernetes-based, and Failure Flags-based services.
Define your service's fingerprint. This is where you select the resources in your environment that comprise your service. The selection will change depending on the type of service selected in step 1. For example, selecting Kubernetes will show all of the Kubernetes resources detected by the Gremlin agent.
1. For Failure Flags, this will show the services detected by your Failure Flags agent(s).
2. For hosts and Kubernetes, you can select multiple resources. For example, you can select multiple Kubernetes Deployments, a Deployment and a DaemonSet, etc.
Click Create Service.

Next, you will need to add a health check.

‍

Adding a service via an AWS Elastic Load Balancer (ELB)

If you've authenticated Gremlin with your AWS account, Gremlin can detect your ELBs and use them to define your services for you. Gremlin will also automatically create Health Checks for these services, so you can start running tests immediately after creating them.

Click on the Getting Started nav item.
Under step 3 (Define your services), select the AWS account you wish to add the service from. If you haven't added an account yet, follow the instructions in Authenticating Gremlin to AWS.
Select the Elastic Load Balancer (ELB) that you want to use to create the service. Note that you can select multiple ELBs at once.
1. If Gremlin can't detect your ELBs, make sure you've set up your AWS authentication correctly, and that your Gremlin IAM role has the correct permissions. You can find the full policy and permissions on the CloudWatch Health Checks page.
Once your ELB(s) are selected, click Create Service.

‍

Adding a service by using an annotation

You can use annotations to register a service with Gremlin in one of two ways, depending on the type of service being added:

For host-based services, you can add tags to the agent configuration file defining the service name.
For Kubernetes-based services, you can annotate Kubernetes objects with the service name.

Warning

Using the same service IDs across multiple Gremlin teams is unsupported. If you wish to annotate multiple services with the same service ID, consider adding a team-specific identifier to the ID, such as the team name.

‍

Host-based service annotations

You can register a host-based service with Gremlin by tagging agents with the gremlin-service-id tag. Tags are key-value pairs that provide additional metadata for objects. In this case, gremlin-service-id is the key, and the value is the name that you want the service to have in Gremlin.

Identify the hosts that you want to annotate.
Choose a name for the service. This is the name that the service will have in the Gremlin web app and REST API. This can be the same name as the hostname, or a unique name specifically for Gremlin.
Add the tag to the agent configuration. An example is available below.
1. Optional: If you want multiple objects to be part of the same service (e.g. two or more Deployments), use the same service name for both and Gremlin will consider them as part of the same service.
Save the updated configuration(s) and restart the agent(s). Gremlin will detect the tag and add the new service(s), which you can manage using the Service Catalog.

YAML


identifier: gremlin-01
team_id: 11111111-1111-1111-1111-111111111111

## Gremlin Client Tags; Tag your machine with key-value pairs that help you target this machine during experiments
## (can also set with GREMLIN_CLIENT_TAGS environment variable)
tags:
  gremlin-service-id: my-nginx-service
  interface: http

team_certificate: |
  -----BEGIN CERTIFICATE-----
  111111111111111111111111111==
  -----END CERTIFICATE-----
team_private_key: file:///var/lib/gremlin/key.pem
push_metrics: true
collect_dns: true

‍

Kubernetes-based service annotations

You can register a Kubernetes service with Gremlin by annotating Kubernetes objects with the gremlin.com/service-id annotation. Annotations are key-value pairs that provide additional metadata for objects. In this case, gremlin.com/service-id is the key, and the value is the name that you want the service to have in Gremlin.

Identify the Kubernetes resource that you want to annotate. Gremlin supports annotations on Deployments, DeploymentConfigs, DaemonSets, StatefulSets, and Argo Rollouts.
Choose a name for the service. This is the name that the service will have in the Gremlin web app and REST API. This can be the same name as the Kubernetes object, or a unique name specifically for Gremlin.
Add the annotation to the object definition. An example is available below.
1. Optional: If you want multiple objects to be part of the same service (e.g. two or more Deployments), use the same service name for both and Gremlin will consider them as part of the same service.
2. Optional: If you want to create this service for another team in your company, you can annotate the spec with gremlin.com/team-id. The value of this annotation should be the teamId of the team within your company that you want to create the service for. Remember to share access to the namespace with that team before doing so, otherwise the service will not be automatically generated.
3. Optional: You can add custom metadata to this service by adding a gremlin.com/tags annotation containing a comma-separated list of key-value pairs. For example: gremlin.com/tags: environment:staging,app:nginx .
Save and apply the updated manifest(s). Gremlin will detect the annotation and add the new service(s), which you can manage using the Service Catalog.

YAML


apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
  labels:
    app: nginx
  annotations:
    gremlin.com/service-id: my-nginx-service
    gremlin.com/team-id: my-team-id
    gremlin.com/tags: environment:production,app:nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          ports:
            - containerPort: 80

‍

Managing services

The service details page is your dashboard to managing and testing each service. You can perform tasks such as viewing the service's reliability score, running reliability tests, adding Health Checks, adding other integrations, deleting the service, and viewing the service's selection criteria (e.g. the systems in your environment that comprise the service). You can also view, manage, and run tests on the service's dependencies.

A detailed overview of a service in the Gremlin web app

‍

Assigning an owner

Services can be assigned to a service owner. Service owners will receive email notifications when:

Test suites finish running, and
Gremlin sends weekly reliability reports.

‍

Adding and removing Health Checks

The Health Check feature automatically checks external metrics or REST API endpoints while a reliability test is running. These are usually monitors configured in an observability tool like Datadog, New Relic, or Prometheus. It can also include custom monitoring tools and URLs.

Before you can run a reliability test on a service, you'll need to assign at least one Health Check to the service. While a reliability test is running, your Health Check(s) will poll your observability tool every 10 seconds. If the monitor/endpoint reports back as failed, unhealthy, or unavailable, the Health Check will halt the ongoing test, revert the impact, and mark it as failed.

‍

Adding a Health Check to a Service

Once you've created a Health Check, you can add it to a service in one of two ways:

‍

Adding a Health Check from the service details page

In the Gremlin web app, open the details page for the serivce you want to add the Health Check to.
Click Settings at the top of the page next to the service name, then select Health Checks.
Click on the Health Checks drop-down and select the Health Check you want to add. You can also type text to search for a specific Health Check.
Click + Add to add the Health Check to the service.

‍

Adding a Health Check from the Service Catalog

In the Gremlin web app, open the service catalog.
Click the check box next to the service you want to add the Health Check too. Note that you can select multiple services for bulk adding Health Checks.
Click on the Health Checks drop-down at the top of the list and select the Health Check you want to add. You can also type text to search for a specific Health Check.
Click + Add to add the Health Check to the service.

‍

Editing a Health Check

You can edit an existing Health Check in one of two ways:

‍

Editing a Health Check from the service details page

In the Gremlin web app, open the details page for the serivce you want to add the Health Check to.
Click Settings at the top of the page next to the service name, then select Health Checks.
Find the Health Check you want to edit, then click Edit.
Make the desired changes to the Health Check.
Click Test Connection, then Test Evaluation to verify the new settings.
Click Save Health Check to save the changes.

‍

Editing a Health Check from the Health Checks page

In the Gremlin web app, open the Health Checks page.
Find the Health Check you want to edit, then click Edit.
Make the desired changes to the Health Check.
Click Test Connection, then Test Evaluation to verify the new settings.
Click Save Health Check to save the changes.

‍

Removing a Health Check

To remove a Health Check from a Service, open the Service in the Gremlin web app, click Settings, and then click the Health Checks tab. Find the Health Check you want to edit, then click Edit. This will delete this specific Health Check, but it will not delete the authentication settings for the observability tool. In other words, you can continue using this observability tool for other Health Checks.

Note

Gremlin will not let you delete Health Checks that are actively in use by at least one service. You'll need to make sure a Health Check is not in use before deleting it.

‍

Viewing the reliability score

Each service has a reliability score ranging from 0 to 100. This score is a calculated value that represents how reliable the service is. Running a reliability test will increase your score. To learn how the score is calculated, see Reliability Score.

‍

Managing the service's queue

The Queue page shows all the reliability tests scheduled to run—and currently running—on a service. To access the queue, click the Queue button at the top of each service’s overview page.

A list of upcoming and recently ran reliability tests for a service.

If a reliability test is running, it will be shown under the Now Running section at the top of the page with its start time, target service, source (how it was triggered), and a button to halt the test. The Next section shows the tests that are scheduled to run next, while the Recent Runs section shows the last ten (10) tests that have run on this service.

‍

Editing service settings

You can modify a service by clicking the Settings button at the top of the service's page. This page lets you:

Change the service's name.
Add or remove Health Checks.
Change the testing schedule.
Manage integrations (e.g. load generators).
Flag whether this service is a production service.
Delete the service.

Note on service deletion

Deleting a service will also delete its score, test history, and dependencies. This is irreversible!

‍

Flagging a service as Production

If a service is running in production, you may want to avoid running tests on it without confirmation. Gremlin lets you do this by tagging the service with metadata that identifies it as running in a production environment. The service will be highlighted in the service list and will show a confirmation warning when you try to run tests.

To flag a service as "Production":

Open the service settings and select the Tags tab.
In the Tag Name box, enter <span class="code-class-custom">environment</span>, and in the Tag Value box, enter <span class="code-class-custom">production</span>.
Click Add Tag. The new tag will appear in the box below, and "production" will be highlighted in orange.

Marking a service as Production in the Gremlin web app

If you want to flag every service as a Production service, you can do so by navigating to Team Settings, selecting the Environments tab, and clicking the Everything in this Team is in a Production environment checkbox.

‍

Tagging a service with custom metadata

In addition to auto-detected tags (region, zone, etc.), you can also add your own custom tags to services. This lets you add metadata to help with searching, grouping, and filtering services.

Warning

The following tags are reserved by Gremlin and shouldn't be used: local-hostname, local-ip, instance-id, gremlin-client-version, cloud, os-name, and os-type.

‍

To add a tag to a service, open the service's settings page and select the Tags tab. Give the tag a name in the Tag Name box, and enter its value in the Tag Values box. You can store multiple values in a single tag by entering them as a comma-separated list. When you're ready to add the tag, click Add Tag, then click Save.

The tag table shows all tags associated with this service. You can remove a tag from the service by clicking the Delete button. Note that deleting or editing a tag here won't change other service's tags, even if they share the same tag name.

For Kubernetes services, you can add tags using annotations. For example, here is an Nginx deployment with a production environment tag and an additional custom tag:

YAML


...
metadata:
  name: my-deployment
  annotations:
    gremlin.com/service-id: my-nginx-service
    gremlin.com/tags: environment:production,app:my-web-application
...

‍

Managing dependencies

In addition to testing services, Gremlin can test each service's dependencies. For infrastructure-based services, Gremlin will try to automatically detect all relevant dependencies using the service’s network traffic and DNS requests. For Failure Flags-based services, Gremlin will add any Flag with a name containing the prefix dependency- as a dependency. For example, if your application connects to DynamoDB, you can wrap the calls to DynamoDB in a Failure Flag named dependency-dynamodb to run dependency tests. You can also manually add dependencies.

Example of dependencies Gremlin can discover:

Common technologies using well-known ports (e.g. Oracle over port 1521)
AWS, Azure, and Google Cloud services (e.g. DynamoDB)
Any Failure Flag with the dependency- prefix.

To learn how Gremlin detects dependencies, see how dependency detection works.

‍

Adding infrastructure dependencies

If you have host, service, or Kubernetes dependencies that Gremlin didn't detect, you can add them manually. To add a dependency:

Open your service page in the Gremlin web app.
Scroll down to Dependencies and click the Add Dependency button.
Enter a name for the dependency.
Enter the dependency's network identifier. This can be a hostname, IP address, CIDR subnet, URL, or cloud service.
Optionally, enter the port(s) to target. You can enter a single port number, a port range, or a comma-separated (CSV) string of multiple ports and/or port ranges. Leaving this blank will target all ports.
Click Save Dependency to add the dependency.

‍

Adding Failure Flags dependencies

Gremlin automatically detects Failure Flags dependencies by looking for Flags whose names start with the prefix dependency-. There is currently no way to manually add a Failure Flags dependency.

‍

Editing dependencies

To edit a dependency, click the gear icon next to its name, then click Edit. After making your edits, click Save to save your changes.

‍

Combining dependencies using wildcard matching

If you have multiple dependencies that you want to treat as one single dependency, you can combine them using wildcard characters. If Gremlin detects a dependency that matches the wildcard pattern, it will roll it into the wildcard dependency along with other dependencies that have the same wildcard pattern. This lets you test multiple endpoints simultaneously without needing to create a separate dependency for each one.

To use wildcard dependency matching:

Navigate to your desired Service in the Gremlin web app.
Scroll down to the Dependencies section and add a new dependency
In the dependency’s Identifier section, enter the wildcard pattern you wish to use. The table below lists the patterns you can use. Optionally, you can also supply one or more comma-separated Port values that will be used for matching and targeting.
Click Create Dependency or Save.

‍

When Gremlin detects a new dependency that matches this wildcard, it will list its endpoint under the wildcard dependency and include it when testing the wildcard dependency.

Pattern	Description	Example
`?`	Matches a single arbitrary character	`dynamodb.us-west-?.amazonaws.com` matches all Amazon DynamoDB instances in all us-west regions
`*`	Matches zero or more arbitrary characters	`dynamodb.*.amazonaws.com` matches all DynamoDB instances on AWS
`**`	Matches zero or more arbitrary characters	`**.amazonaws.com` matches all AWS services

‍

Grouping behavior when creating a new dependency with a wildcard pattern

When creating a new dependency with a wildcard pattern, all existing dependencies that match the supplied wildcard expression are merged automatically with the new dependency.
When an existing dependency has only a subset of hostnames that match the new pattern, the hostnames that match are moved to the wildcard. Hostnames that don't match will still be associated with the original dependency.
When creating a regular hostname dependency, each existing wildcard will also track the new hostname separately from the new dependency.
Wildcard dependencies are allowed to overlap with other wildcard dependencies.

Grouping behavior at discovery time

Gremlin continuously monitors the dependency activity of your services. When Gremlin detects a new hostname not yet tracked by your dependencies (aka a “discovery”), the following describes how this hostname is treated.

When a discovery matches to a known wildcard, its hostname and port details are tracked on the dependency for as long as the hostname is reported by Gremlin agents.
A discovery can be tracked by one or more wildcard dependencies for a service.
A discovery is only created as a standalone dependency when it matches to no existing wildcards.

Removing Dependencies

To remove a single dependency, click the gear icon next to its name, then click Delete. Click Delete again to confirm the deletion.

To remove all automatically-discovered dependencies, you can use the Flush button. Flushing your dependencies removes any dependencies that Gremlin automatically detected (this does not include user-created dependencies). Flushing will also halt any dependency tests running on impacted dependencies, remove any scheduled tests associated with a flushed dependency, and recalculate the service's reliability score. Dependencies for Failure Flags services will be unaffected. Gremlin will rediscover your dependencies if there is network traffic between the service and the dependency.

‍

Marking a dependency as a Single Point of Failure (SPOF)

Some dependencies are known to be a risk. These dependencies may be critical parts of your infrastructure, where a failure of the dependency will result in an outage. These are known as single points of failure (SPOF).

Flagging a dependency as a single point of failure excludes it from running reliability tests when a user clicks the Run All Tests button or sets up Auto Scheduling. You can still run tests on the dependency manually, and the results will still contribute to the service's reliability score. This feature is simply meant to prevent users from accidentally running automated tests on the dependency when it's already known to be a risk. Once the risk is addressed, you can simply uncheck this feature to include the dependency in automatic testing once again.

Viewing a dependency in the Company Report with a single SPOF

‍

Marking an existing dependency as a single point of failure

To mark an existing dependency as a single point of failure:

Open the Services list in the Gremlin web app, then click on the service containing the dependency you want to flag.
Scroll down to the Dependencies section of the service's overview page, then click Dependencies to view its dependencies.
Click on the gear icon next to the name of the dependency you want to flag, then click Edit. This opens the dependency's settings.
In the Edit Dependency pane, check the option Mark this Dependency as a Single Point of Failure, then click Save.

Screenshot of the SPOF checkbox in the Edit Dependency screen

Marking a new dependency as a single point of failure

To mark a new dependency as a single point of failure:

Follow the instructions in Adding dependencies, but don't click Save yet.
Check the option Mark this Dependency as a Single Point of Failure.
Click Save.

Screenshot of adding a new suggested dependency using IP address and port number

‍

How dependency detection works

Gremlin relies on DNS and socket data to identify dependencies. Every 30 seconds, Gremlin monitors packets sent over UDP port 53 (i.e., DNS traffic). The Gremlin agent also polls the host’s (or container’s) socket table every 5 seconds and matches open sockets to DNS network traffic. If there’s a match, the agent sends the DNS name, IP address, and port to the Gremlin Control Plane. Finally, Gremlin must find the same dependency on at least two replicas of the service. If these conditions are met, the dependency will be listed in the service’s Dependencies table and available for testing.

Note

Gremlin only monitors TCP and UDP sockets on Linux and only TCP sockets on Windows. Other types of traffic, such as RDP and SCTP, won’t appear.

‍

Privileges required

Privilege	Description
RELIABILITY_MANAGEMENT_READ	Allows reading all RM services
RELIABILITY_MANAGEMENT_RUN	Allows running of an RM test for a Team
SERVICES_READ	Allows reading information about services and reliability management
SERVICES_WRITE	Allows writing to manage services and reliability management

‍

FAQ

Q: How often are services discovered?

A: Gremlin currently discovers services once every hour.

Q: How often are characteristics of an existing service discovered and/or modified?

A: Gremlin currently discovers and/or modifies once every hour.

Q: How often are targets resolved to an existing service?

A: Gremlin resolves targets instantly, as soon as they change on a service. If a new pod is registered with the control plane, it’s immediately registered as a target to a service.

Q: How often does Gremlin associate pods, containers and hosts with existing services?

A: Every 30 seconds.

Q: How does Gremlin find dependencies?

A: See how dependency detection works. In short, Gremlin observes DNS traffic and compares it to the open sockets on the host (or container). If there’s a match between an open socket and DNS traffic, Gremlin records the DNS name, IP address, and port. Gremlin must detect the same dependency on at least two replicas of the service to make it available for testing.

Q: I have a known dependency on my service, but Gremlin won’t detect it.

A: A dependency might not be listed if your service communicates with it infrequently or if only one instance of your service has connected to the dependency. Try increasing the number of instances of your service to two or more. If the dependency still isn’t being detected, you can add it manually.

If you're using Istio in sidecar mode, add --set gremlin.features.discoverDestinationService.enabled=true to your Helm configuration and restart the Gremlin daemon. You can learn more in our Helm Chart documentation.

Reliability Management (RM) Quick Start Guide

Detected Risks