Gremlin User Newsletter: Exploring Istio with Chaos, Chaos at Gremlin, & more
For new users just learning about Chaos Engineering, and even seasoned Gremlin pros looking for inspiration to break beyond their normal experiments, it can be challenging to develop new Chaos Experiments.I hope this post sparks some new ideas for experimenting on your own systems.
Exploring service meshes with chaos
Istio is an open source service mesh that is popular in the Kubernetes community for the extensive network control it provides to container-based workloads. The recently released Istio version 1.6 introduces a new Workload Entry resource that makes VM and bare metal workloads first-class objects, just like a Kubernetes Pod. This allows you to extend the traffic management, security, and observability features of Istio to those applications.
Whenever you’re operating heterogeneous container-vm infrastructure, there’s almost always some network complexity. That complexity often introduces latency. Latency attacks are a great way to test your application’s ability to handle slow communications.
Use a Latency attack on your VM or bare metal workloads starting with a small amount and increasing the latency until you identify your breakpoints. Using this information, you can set your Istio Service Entry timeout and retry policies.
Network complexity can also lead to outages. Regardless of whether it’s caused by a misconfiguration or a provider outage, the customer impact is the same so you need to handle service outages gracefully. Blackhole attacks allow you to simulate network breaks to your services.
Istio’s Service Entry can simplify your high-availability configuration by allowing you to add multiple VMs as a service endpoint and automatically load balancing among them. Use a Blackhole attack to isolate individual VMs and test Istio’s outlier detection feature that will temporarily eject unhealthy hosts from the load balancing pool.
Are you running Istio? If so, how are you testing the reliability of your service mesh?
Chaos at Gremlin
At Gremlin, we’re always exploring ways to improve our internal Chaos Engineering efforts. We recently experimented with an idea from Gremlin’s VP of Engineering Jade Rubick, called The Gauntlet—a surprise attack that would test each engineering team’s reliability. Here are a few things we learned that may be helpful in your own applications:
Monitoring for change
A common monitoring strategy is to set warning alerts that will notify you prior to hitting critical thresholds. With static thresholds—such as warning when CPU consumption hits 75% and sending a critical alert at 90%—a problem arises when the metrics quickly change.
During the Gauntlet, we stepped up CPU consumption, but small steps that didn’t breach the thresholds went ignored, while larger steps triggered our warning and critical alerts almost simultaneously. As a result, we’ve started reevaluating our alerts and are updating them to use Datadog’s Anomaly Detection when appropriate. Using machine learning-based alerting in your own systems can help amplify small signals and alert you prior to large spikes.
Cattle vs pets
Cycling machines is often an easy and effective way to resolve issues—especially when you have a system such as Auto-Scaling Groups (ASGs) to enforce infrastructure state. As our Gauntlet attack consumed CPU, the incident response team followed our runbook to delete an affected node. This causes the ASG to replace it with a new node that is unaffected by the attack—or in real incidents would be unaffected by runaway CPU-heavy processes. After the new node is confirmed to be healthy and operating normally, the team cycled all affected nodes.
One new addition to our runbook is that we now remove one node from the ASG and keep it for forensic analysis rather than deleting it. While it’s easy to focus on just getting your systems back to normal during an incident, ensure that you’re preserving opportunities to learn more about failure.
What have you learned from your Chaos Engineering? Share your experience in the Chaos Engineering Slack. Feeling stuck or need some help? You can always contact us at firstname.lastname@example.org for technical support and Chaos Engineering education.
What’s new in Gremlin
We're excited to introduce Gremlin’s agent for Windows. With the addition of our Windows agent, you will now be able to run Shutdown, CPU, Disk, Memory and IO exhaustion attacks across your Windows systems (Server 2008 and later, client Vista and later). To get started, see our tutorial.
Network Attacks with Tags
Tags can be used for targeting IP addresses during network attacks. This is important for ephemeral environments where hosts live for a short time and have dynamic IP addresses.
Custom tags can be used to designate hosts where the attack will run and can also be used to constrain the attack to network communications with tagged destinations. For example, to test latency between serviceA and serviceB, select all clients with the tag<span class="code-class-custom"> service:serviceA</span> when choosing the Hosts to target, and select the tag <span class="code-class-custom">service:serviceB</span> when configuring the Network Gremlin to run. For more information, see the documentation.
New settings menu
We consolidated our Settings and Team menu into a more focused view with User/Company/Team hierarchy, an easier way to switch teams and the addition of resource links.
New pricing model
We’ve introduced our new pricing model with Starter, Professional, and Enterprise packages. The new model provides more flexibility to suit organizations of any size. Learn more on our pricing page.
For a comprehensive list of features, updates, and bug fixes, see our release notes. We’re constantly working on new features and enhancements.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more