CHAOS ENGINEERING GUIDE

Improving the Reliability of Financial Services

How do we increase development velocity to meet changing customer demands, while ensuring reliability, avoiding outages, and meeting compliance? The answer is with Chaos Engineering.

Learn how you can use Chaos Engineering to proactively increase reliability and mitigate the risk of outages, so you can stay competitive in an ever-changing market.

Get your copy

What's inside?

  • An introduction to Chaos Engineering
  • How to improve reliability while reducing IT costs
  • How to mitigate the risk of system failures while increasing development velocity
  • How to proactively test for compliance and fix vulnerabilities before they become high-profile outages

With Chaos Engineering, you can confidently increase development velocity without risking system failures and outages.

In order to keep up with the rapid pace of digital transformation and provide innovative new services, teams must be able to push new changes quickly. However, legacy IT backbones, distributed system ownership, and compliance regulations can cause a bottleneck.

In this white paper, we explore how Chaos Engineering enables you to safely increase development velocity while proactively increasing reliability and mitigating risks of outages.

Over a decade of collective experience unleashing chaos at companies like

  • Reliability across your environment
    Test and improve the reliability of distributed systems. Orchestrate chaos experiments across your environment, including cloud platforms, bare metal, containers, and Kubernetes clusters.
  • Track reliability improvements
    Keep track of various experiments run across your environment to help you prioritize which services need attention, and where vulnerabilities may lie.
  • Win customer trust
    Keep customers at the heart of your engineering practices. Test across every variable to uncover and address potential failure points before they disrupt customer experiences and damage your brand.
  • Avoid costly downtime
    Minimize your risk of system failure by proactively testing for weaknesses and addressing them before they become costly, public outages.
  • Reduce your MTTD and MTTR
    Get comfortable with various failure scenarios so in the event that the unlikely happens (and it will), you're prepared to respond quickly and efficiently.

Confidently run Chaos Engineering experiments

Confidently test systems reliability by thoughtfully injecting failure into services, hosts, or containers with a Gremlin attack. Using the attack library, see how systems respond to a variety of common failure conditions. Scale the blast radius of the attack once you're confident in system stability, and easily halt attacks should issues arise.

Teams who frequently run Chaos Engineering experiments - weekly, or monthly - have >99.9% availability. Keep your availability high and your incident count low by setting attacks to run on an automated schedule.

Validate resilience to common failures

Scenarios let you run multiple attacks in sequence and create more complex Chaos Engineering experiments. Create your own, or use Gremlin's pre-configured library of Recommended Scenarios to simulate real-world outages that can impact performance, uptime, and customer experience. Share scenarios across teams to create a stronger culture of reliability.

Automatic services detection and tracking

Gremlin auto-detects all services in an environment, giving you complete systems visibility and helping you uncover any unknowns. Isolate, target, and attack distributed services no matter where they're running. Track your reliability practice with a full history of all attacks run on a service, and quickly identify and prioritize services that need attention.

Prevent unintended failures

Prevent experiments from running when systems are unstable. Status Checks automatically halt and roll back experiments if systems don't meet expected criteria. Integrate with your preferred monitoring and observability tool to validate conditions and trigger rollbacks if any issues arise.

An integral part of your testing framework

Use Gremlin's APIs and webhooks to trigger notifications to monitoring, incident management, or other DevOps tools of new or ongoing experiments. Automate Chaos Engineering practices by integrating experiments into your CI/CD pipeline.

Supported platforms

Gremlin is a cloud-native platform that runs in any environment. Gremlin supports all public cloud environments - AWS, Azure, and GCP - and runs on Linux, Windows, containerized environments like Kubernetes, and yes, bare metal too.

Gremlin
attack library

Rely on Gremlin's comprehensive attack library to build resilience to common failure conditions.
CPU
Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.
Memory
Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.
IO
Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.
Disk
Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files) and exhausted storage volumes.
Shutdown
Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.
Time Travel
Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.
Process Killer
Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.
Blackhole
Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.
Latency
Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.
Packet Loss
Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.
DNS
Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.

Enterprise-grade security and compliance

Gremlin is SOC II compliant and follows industry standard security practices.
  • Least Permissions
    Gremlin runs on default Linux permissions and doesnโ€™t require root access
  • Secure user management
    Multi-factor authentication, Secure Single Sign On, and Role-Based Access Control (RBAC)
  • Audit trails
    Every action on the platform is tracked for compliance
  • 3rd party testing
    Gremlin regularly undergoes regular security auditing by a 3rd party

Trusted by teams worldwide

Leading SRE teams rely on Gremlin to keep their systems available and their customer experience reliable.
Charter CommunicationsDreamworks AnimationExpediaGrubhubHโ€‘Eโ€‘BMailChimpNABQualtricsSASShiptTargetTwilioWalmartWorkiva
Charter CommunicationsDreamworks AnimationExpediaGrubhubHโ€‘Eโ€‘BMailChimpNABQualtricsSASShiptTargetTwilioWalmartWorkiva

ยฉ 2021 Gremlin Inc.
All rights reserved.
Privacy Policy