When a chaos experiment shows you a weird new way to fail, fix the failure and revise the experiment—or devise a new one—to test the fix.
Unless your application was born yesterday, it’s probably resilient to at least some kinds of failure: when it’s missing crucial data, it makes an educated guess; when it’s running at capacity, it scales out automatically; when any node goes bad, it bows out of the service pool. Call these safeguards resiliency mechanisms.
Obviously you’re better off with resiliency mechanisms than without, but they can fail too. How? That’s what a good Chaos Engineer is unendingly curious about. That’s what drives the design of her chaos experiments. She doesn’t kill the database just to see what happens; she kills it to make sure its replicas are healthy and failover is automatic. But she doesn’t rest easy after finding one or two scenarios where failover definitely works. She designs a more clever chaos experiment, looking for new ways for failover to fail. Over time, she improves her resiliency mechanisms, but they’re never perfect—that’s why she keeps practicing Chaos Engineering.
At Gremlin, we practice what we preach. And like you, we keep finding new ways to fail. This post looks at one of our recent failures to show how we iterate on chaos.
One day our API service started spitting out a ton of errors. It couldn’t resolve the address of our DynamoDB service. “DNS is down!” we thought. Then we noticed the errors were coming from just one node. “For this one node, DNS is down!” we thought. That wasn’t right either, we would soon discover. We took the node out of service, ending the service degradation, but kept it around for investigation.
In short order, we found the culprit: one of our unattended chaos experiments. It reboots random production servers three times every weekday during working hours. On this day, one server had come back in a state we had never seen before.
Our API service runs in containers. On the problem node,
docker ps -a showed that the service wasn’t running, but
ps -ef showed otherwise:
root 1813 1 0 18:19 ? 00:00:00 docker-containerd-shim 955282e0cb351149c03d1b4cac15731c1ca05af31936a0506e5fb3bb494ac14f /var/run/docker/libcontainerd/955282e0cb351149c03d1b4cac15731c1ca05af31936a0506e5fb3bb494ac14f docker-runc root 1855 1813 2 18:19 ? 00:05:55 java -jar ...
docker-containerd-shim seemed to have been orphaned during startup; normally its parent process is
docker-containerd, not pid 1.
So why couldn’t the service resolve DynamoDB’s address? Syslog held the clue:
systemd: Started Gremlin API service. dockerd: time="2018-03-08T18:20:52.539350691Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" dockerd: time="2018-03-08T18:20:52.541300177Z" level=error msg="Create container failed with error: transport is closing" kernel: [ 73.904566] docker0: port 1(veth7999d80) entered disabled state
Docker seemed to have incompletely set up the container’s virtual network interface, but started the API service anyway. This left the container in a weird limbo: the outside world could reach the API service—otherwise its health check would have removed the node from the service pool—but the service couldn’t initiate outbound connections. This wasn’t a DNS outage; it was a local network misconfiguration, confined to one container, broken in one direction.
We strongly believe in never failing the same way twice. The solution here wasn’t to dig into the Docker weirdness and prevent it from recurring, but to improve the resiliency mechanism our chaos experiment sought to test: the API health check.
Not all health checks are equally sophisticated. They tend to sit somewhere in this list:
- Service responds to TCP/UDP probes
- Service responds to HTTP GET / (with 200 OK)
- Service responds to HTTP GET /custom-check (with custom content or status)
Before the incident, our health check sat at level 2. Since our problem node was able to receive and respond to the load balancer’s simple GET (the health check endpoint was a no-op—an HTTP ping), it remained in the service pool—despite the fact it couldn’t reach the database. And our API service can’t function without the database.
So it was time to level up our health check. Since a custom check can do anything you want, the trick is writing one that doesn’t gum up the machinery with new problems. Restraint is key. Clearly we needed the health check to monitor database connectivity, but we couldn’t let it do the probing—that could overwhelm the database, since the load balancer hits every node’s health check often. Rather, we implemented a solution whereby each API node keeps an in-memory count of any failed connections in the wild (which no longer exhaust our API servers’ thread pools, thanks to one of our Gamedays) and considers itself healthy as long as the count doesn’t spike. We call it a Dead Man’s switch. (Fun fact: we also put a Dead Man’s switch in the Gremlin daemon. When any daemon loses its connection to our backend, it immediately halts any attacks in progress.)
Just as we cover new application code with new tests, we cover new resiliency mechanisms with new chaos experiments. With the health check upgraded, we turned to reconsider our daily Server Reboot experiment.
Server Reboot experiments usually set out to test 1) that remaining servers can handle extra load with some servers down, and 2) that downed servers gracefully fail out of their service pools, return healthy, and rejoin the pools only if healthy. Our experiment found a new case where unhealthy nodes rejoin the pool, so the experiment was a success. But it’s not the only—and certainly not the best—experiment for testing our Dead Man’s switch.
For starters, it’s heavy handed. We don’t need to reboot servers to simulate a failed DNS lookup. Plus, reboots cannot reliably simulate a failed lookup. Whatever bug bit us (we found a few issues on GitHub) seems to manifest only in a perfect storm, and we haven’t seen another half-baked container since the incident. So what we want is an experiment that both narrows the blast radius and reliably simulates the failure we want to test.
For a narrower blast radius, we could kill just the Docker daemon. But that may not simulate the failure any more reliably than killing the whole server. Better to blackhole outbound traffic to our DNS servers. That experiment is guaranteed to test our new health check every time—to always trigger the Dead Man’s switch.
Should we retire our Server Reboot experiment, then? No. With its wider blast radius, it serves to test many resiliency mechanisms, not just the API health check. And though it did prove its worth by exposing a weakness in the health check, it also stressed our need for a brand new resiliency mechanism.
Now that we can handle a few nodes losing database connectivity, what if all nodes do? We had considered that before, but waved it aside. “If the database goes down, a smarter health check won’t save us,” we reasoned. True enough, but what a lapse of imagination! Obviously a database crash isn’t the only way for clients to lose connectivity. This incident reminded us of that. We shouldn’t have presumed that by imagining one failure, we had covered them all. That’s not the way of Chaos Engineering.
We need a new resiliency mechanism: a local cache of the database. While most caches are performance enhancers, ours will be nothing more than a short-term repository for crucial data. Why short-term? Because we have a business requirement to make our API’s state eventually consistent and as current as possible.
Once we’ve implemented the cache, we’ll need to make the API health check aware of it, and in turn, retune our chaos experiment to answer these two questions (in the affirmative):
1. Do health checks respond HEALTHY when the database is unavailable but cache is available?
2. Do health checks respond UNHEALTHY when both the database and cache are unavailable?
This may suffice for the first iteration of our new resiliency mechanism, but again, the cache data is only good for a time. Once we decide how old is too old for this data, we’ll need to split the first question above into two:
1a. Do health checks respond HEALTHY when the database is unavailable but cache is available and has fresh data?
1b. Do health checks respond UNHEALTHY when the database is unavailable and the cache is available but has stale data?
As long as the CAP Theorem keeps us busy choosing the best way to degrade our service when chaos strikes, we'll be iterating on this resiliency mechanism—and the chaos experiments that test it—for a while.
Without our daily chaos experiment, would this weird failure have bitten us eventually? Probably. And while it would have been nice to delay it for a day, a week, or a month, we’re glad we dealt with it during waking hours—and that it reminded us to implement a cache before the day our database inevitably goes down.
Natural chaos will always find you. In synthesizing your own chaos, you preempt the disruptive cycle of natural chaos with a safer cycle:
Step 0: Implement some resiliency mechanism. (Don’t inject chaos where you know it’s going to wreak havoc.)
Step 1: Devise and run experiments to confirm the resiliency mechanism works (or doesn’t).
Step 2: When it doesn’t work, improve it.
Step 3: Retune the experiments to fully test the improved resiliency mechanism—reliably, and with the smallest possible blast radius.
Step 4: Implement new resiliency mechanisms if the experiments revealed a need for them.
Step 5: For any new resiliency mechanisms, return to Step 1.
If this looks similar to software testing patterns, that’s because it is similar. You wouldn’t write software without an iterative testing cycle. Would you design production systems without one?
It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal Site Reliability Engineer