It’s been another busy few months here at Gremlin. Overall, our team has been working on feature improvements to enable teams to measurably improve the reliability of their systems, whether that’s through broadening platform support so you can run Gremlin in more places, making it easier than ever to identify reliability risks, or improving reporting so you can manage reliability programs effectively at enterprise scale. Here’s a summary of what’s new.
The headline feature this month is the introduction of Detected Risks. This new capability automatically detects high-priority reliability concerns in a Kubernetes environment—without running any reliability tests or chaos experiments. You can look forward to dozens more risks being added by the end of this year.
We also launched the beta release of Failure Flags, Gremlin's new framework for running Chaos Engineering experiments on fully managed platforms such as AWS Lambda functions, serverless workloads, and containers. Teams can now run chaos experiments where access to the underlying infrastructure is limited, or simulate failures at the application layer that aren’t possible at the infrastructure layer. It also means Gremlin can now run across your entire stack—even if it’s managed for you.
Also this month, we improved Company Summary reports (previously called the Dashboard). You can now see summary reports of both your Detected Risk reports and Reliability Score reports, so you can get a sense of your reliability posture in one place. As part of this change, plan usage details have been moved to Company Settings.
In other news, we’ve made a number of general improvements:
- Gremlin now supports delegation of Namespaces to a Team for both manual and automatic service creation. Teams can more confidently run experiments without accidentally impacting other teams' resources.
- We’ve added service annotations, which lets you automatically register your Kubernetes services in Gremlin by adding a simple annotation. This speeds up the process of service creation significantly: any service with an annotation simply appears in the Gremlin Service Catalog, ready for you to manage and test.
- We’ve added web app support for managing multiple services simultaneously. This lets you add Health Checks to multiple services with a single click and start testing within seconds. The Service Catalog has been reworked to reflect this change.
- Scenarios can now be deleted in addition to being archived, so now you only need to see your most relevant Scenarios.
We’ve made two significant improvements to the Linux agent, both of which reduce network overhead and improve overall performance.
First, Gremlin now uploads discovered process data at a slower rate, reducing network overhead.
gremlind now batches up process data over 15 minute intervals, deduplicating all network and process data detected over this interval. Previously,
gremlind would emit snapshots of process and socket data to Gremlin's control plane over two minute intervals.
Noted above, Gremlin can now detect specific reliability risks without fault injection. To support this functionality, the Chao Kubernetes agent now sends the
imageID of each container, which enables Gremlin to identify services running multiple container versions simultaneously—a common reliability risk. You can learn more about Detected Risks here.
We continue to build out enterprise-grade security capabilities trusted by some of the world’s largest and most regulated companies, and this month we’ve made two updates.
First, when installed directly on the host and launched with SystemD, the Gremlin agent now runs with ambient capabilities (capabilities(7)) rather than file capabilities. Ambient capabilities allow the Gremlin agent to retain certain permissions even after it has started, making it more flexible and secure in a Linux environment.
Second, when installed directly on the host, the suid bit is no longer set for installed binaries
/usr/sbin/gremlind. Additionally, these binaries are no longer owned by the Gremlin linux user, but instead by root, which allows a user to run things as if they were being run by the owner while improving security.
Running Certificate Expiry experiments against CIDR values (e.g., 10.0.0.0/24) will make several attempts to find an active IP address in use by the target system for evaluating certificate expiration characteristics within the duration specified by the argument
With Helm, you can now add labels to the deployed Gremlin Pods using the
gremlin.podLabels parameters. Labels make it easier to filter, sort, or select pods for tests and experimentation in Gremlin. See the Chart documentation for details.
If you already have a Gremlin account, everything noted here is already available to you, as long as you have the latest agent installed.
If not, sign up for a free trial to start understanding and improving your reliability posture in minutes.