April 2, 2018

Making Your APIs More Resilient with Gremlin

Here's the thing: when a company measures their critical services, APIs are often considered second class-citizens. But the fact of the matter is that APIs are a core part of an organization's infrastructure, and not understanding their weaknesses can lead to performance issues and downtime.

The API brokers metadata between internal services and there’s always the risk that a failure can affect the user experience or result in an outage. As the adoption of your API scales, it can even end up creating an unexpected attack on your own infrastructure due to increased read/write usage.

Here at Gremlin, we aim to help engineers build more resilient infrastructure. We believe focusing on API-related failure injection is critical to ensure your API never disrupts your user experience or causes a high severity incident. One of the ways to help accomplish this is to run an API GameDay with controlled chaos experiments.

* If you are unfamiliar with GameDays, they are like fire drills  where you practice a potentially dangerous scenario in a safe environment to proactively identify weaknesses. To learn more read our Introduction to GameDays and our guide on How to run a GameDay.

Example API GameDay Infrastructure: The MyStatus App

Let’s say we were running experiments on a status update sharing application called “MyStatus”. The MyStatus infrastructure is composed of an API Gateway (e.g. Open Source Kong), Memcached for caching, and MySQL for the database. This is demonstrated in the diagram below:

null

The best case scenario is that when your instances are impacted by a chaos experiment, they are either able to handle the stress or you automate their removal from your fleet. After they are removed, they would be automatically replaced with fresh hosts; a fresh host is safer than rebooting infected hosts.

Install the Gremlin agent on your memcached instances.

null

Round 1: Small Blast Radius Chaos Engineering Experiment

  • A large number of read requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
curl -X GET "https://api.mystatus.com/v1/status?size=1000" -H "accept: application/json" -H "Authorization: Bearer $TOKEN"

Round 2: Medium Blast Radius Chaos Engineering Experiment

  • A large number of read requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
curl -X GET "https://api.mystatus.com/v1/status?size=1000" -H "accept: application/json" -H "Authorization: Bearer $TOKEN"
  • A memory attack using Gremlin at the same time we trigger the large number of API requests.

null

null

null

null

Round 3: Large Blast Radius Chaos Engineering Experiment

  • A large number of read requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
curl -X GET "https://api.mystatus.com/v1/status?size=1000" -H "accept: application/json" -H "Authorization: Bearer $TOKEN"
  • A large number of write requests (e.g. 1000) and verifying that our system performs as expected and does not drop below SLA.
#!/bin/bash
         COUNT=0
         while [ $COUNT -lt 1000 ]; do
             curl -X POST "https://api.mystatus.com/v1/status" \    -d'{"update":"Today is a great day for a GameDay"}' \
            -H "accept: application/json" \
 -H "Authorization: Bearer $TOKEN"
             let COUNT=COUNT+1
         done
  • Kill a cache instance using Gremlin at the same time we trigger the large number of API requests.

null

Additional Chaos Engineering Experiments

Network gremlins also allow you to see the impact of lost or delayed traffic to your application. You can test how your service behaves when you can’t reach one of your external dependencies.

null

Understanding how your system behaves if memcache becomes overloaded will give you critical insight into your infrastructure. If memcache crashes how does this impact your SLA and database reliability? Does your database crash? Does it failover?

Preparing For Failures In 3rd Party APIs

After you’ve tested and confirmed the resiliency of your own, the next step is ensuring you are prepared for what happens when 3rd party APIs fail. An outage of a 3rd party API can still affect customer experience, so it’s essential to have a plan for these outages as well. There have been a number of API outages that caused some well-known websites and applications to go down:

  • The Facebook and Instagram API servers went down for an hour taking. The outage also impacted a number of well-known websites including Tinder and HipChat.
  • Amazon Web Services (AWS) experienced a disruption that caused an increase in faults for the EC2 Auto Scaling APIs.
  • Twitter experienced a one hour major outage that impacted websites and applications using Twitter APIs.

API reliance is only going to increase, so testing the resilience of these is a key step to ensuring the resilience of your systems. If you have an application that is dependent on external APIs to perform a critical function, you need to have a plan for dealing with disruptions. API virtualization, synthetic / real-user monitoring, asynchronous scripting and caching are all common ways to mitigate failures. But how do you know the fallbacks you’ve put in place actually work in a real-world scenario if you’ve never tested them?

In Conclusion

Running chaos experiments on a consistent basis is one of many things you can do to begin measuring the resiliency of your APIs. Making sure you have good visibility (monitoring) and increasing your fallback coverage will all help strengthen your own systems. But don’t stop there: with the number of connected devices and application ecosystems growing rapidly, it is more important than ever to safeguarded applications from internal and external third-party API outages and errors.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Try Gremlin for free and see how you can harness chaos to build resilient systems.