2024 is off to a fast start here at Gremlin. Since our last release roundup, we’ve released new experiment types, new features to improve integration with cloud platforms, and improvements to our auto-detection processes. Now you can push processes to their limits, find dependencies even easier, limit when tests can be run, and much more. We also introduced a slew of platform improvements to improve efficiency, performance, and user experience in the Gremlin web application.

Check out what’s new in Gremlin below.

New features

Simulate massively parallel workloads with the new Process Exhaustion experiment

Gremlin has a brand new experiment type: Process Exhaustion! This experiment simulates running processes on a system in order to consume process IDs (PIDs). This lets you test your systems’ ability to handle massively concurrent workloads, such as container orchestration tools and large-scale web and proxy servers. You can use this to determine:

  • How many processes your systems can handle before becoming unstable.
  • How your services respond when the host runs out of PIDs.
  • Whether your PID limits are being enforced across your services.

Available for Linux today, you can find this new experiment under the State category of experiments, and it’s ready to be incorporated in your Scenarios and Reliability Management Test Suites. Check out our blog post for more details.

Streamline deploying Gremlin to AWS with AWS Key Management Service

Gremlin now natively integrates with AWS Key Management Service (KMS), making deploying Gremlin to your AWS environment easier and more secure. When deploying the agent, you can replace your normal configuration values (team_id, team_certificate, etc.) with the Amazon Resource Name (ARN) of the KMS secret you wish to use. When the Gremlin agent starts, it will retrieve the values from KMS, letting you deploy Gremlin securely without having to store or distribute plaintext passwords or certificates.

Learn how to do it in our tutorial.

Prevent testing during critical time blocks with restricted time windows

Sometimes there isn’t a good time to run reliability tests, such as during code merges, scheduled deployments, or peak traffic times. Gremlin now has a native way to prevent users from running experiments, Scenarios, or reliability tests with restricted time windows. Restricted time windows lets you set blocks of time at either the team or company level where tests won’t run. Running tests are halted, and scheduled tests will not run during this time. You can specify a weekday, start time, and duration. Check out the docs to learn more.

Platform Updates

We typically spend a lot of time in December and January on platform improvements, and this year was no different: we’ve made many updates, fixes, and under-the-hood improvements.

Discover and track dependencies more accurately

Gremlin has long been able to find your services' critical dependencies automatically, and now we’ve improved our discovery methods. Gremlin now detects DNS calls made by your service to other services, and uses this information to identify dependencies. This DNS-based method is faster, more accurate, and lets Gremlin track dependencies even if their IP address changes. We talk about it in detail in our blog: How dependency discovery works.

Improved auditing tools in the Gremlin API

Having a comprehensive audit trail is important for any software tool, especially one that tests your systems. Gremlin now provides two REST API endpoints for retrieving log data about who logged into your Gremlin organization, and which experiments/Scenarios were run. Both of these logs are available under the /reports/security endpoint. You can learn more in our REST API documentation.

A cleaner, more streamlined web app

We squashed some bugs and improved the user experience in our web interface. This includes:

  • Adding help text to Test Suite creation wizard warning users that reliability scores will be reset when a new Test Suite is applied.
  • Making test results clickable on the service overview page. Clicking on a test result will bring you to the most recent test run.
  • Displaying more information about LostCommunication agent errors.
  • Reducing the delay when creating multiple test suites.
  • Improving the way columns are rendered in reports.
  • And much more!

Agent updates

As always, we’re constantly improving our agents. In addition to the normal dependency and library updates, we’ve added new container drivers for Docker, containerd, and CRI-O. These new libraries remove our dependency on runC, which significantly reduces CPU and I/O usage. The agent is also smarter about detecting pre-existing network ingress rules that conflict with the Blackhole experiment, which can happen with network integrations like Cilium or Kata. We’ve also made several improvements to experiment rollback logic, such as improved logging and error reporting, and better handling for network devices that are taken offline externally while an experiment is running.

Try it for yourself

If you already have a Gremlin account, everything noted here is already available to you, as long as you have the latest agent installed. 

If not, sign up for a free trial to start understanding and improving your reliability posture in minutes.

No items found.
Categories
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL