- 6 min read

How to test for expired TLS/SSL certificates using Gremlin

Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential components of the modern Internet. By encrypting network communications, TLS protects both users and organizations from publicly exposing their in-transit data to third parties. This is especially true for the web, where TLS is used to secure HTTP traffic (HTTPS) between backend servers and customers’ browsers.

TLS is such a critical part of the modern web that browsers and search engines will penalize unencrypted websites. Unsecured pages are displayed with warnings and given reduced SEO rankings. This caused a surge in websites using TLS, growing HTTPS traffic on the desktop from just 45% of websites to 98%.

While TLS adoption has gotten easier through initiatives like Let’s Encrypt, it’s not without challenges. For one, a TLS certificate is only valid for a certain period of time (called the validity period). Security teams need to request new certificates and roll them out over existing ones before the old ones expire. If a certificate’s expiration date lapses, customers will see an alarming warning when trying to access your website or service.

Chrome invalid certificate warning

Additionally, organizations often have multiple certificates in rotation for different services. Security teams need to track which certificates are in use, where they’re deployed, and when they’re due for renewal, creating logistical overhead. The risk of a certificate expiring and bringing down a critical service increases as the size and complexity of the service grows. For example, in 2020, Spotify had a nearly hour-long outage when an expired certificate brought down one of their main endpoints.

To add to this challenge, certificate renewal is an infrequent maintenance task. Renewals only happen once every few months to every few years (depending on our validity period). Some certificate providers support fully automated renewals, which makes it even less likely that teams will catch renewal problems until they happen. This creates a lot of risks, such as:

  • Automated renewal notifications falling through the cracks or getting ignored.
  • Security team personnel changing and losing track of ownership over certificate rotations.
  • Expiration dates changing when certificates renew.

Since certificates are time-sensitive, and different certificates can expire at different times, we need a way to continuously check for expiring certificates across multiple services. But how do we test whether a certificate is expiring? Fortunately, we can use Chaos Engineering to help.

Using Chaos Engineering to detect expired certificates

With Chaos Engineering, we can simulate the conditions that would cause a certificate to expire. First, let’s look at how certificates are validated.

When a device connects to an encrypted website, it downloads the website’s certificate and checks the expiration date against its own internal system time. If the expiration date and time falls after the current date and time, then the certificate is valid. However, if we change the device’s time to after the expiration date, the device will think the certificate is expired. In other words, by "time travelling" into the future, we can accurately detect when a certificate is going to expire simply by connecting to a website.

With Gremlin, we can use Chaos Engineering to test TLS security using a Time Travel experiment. Time Travel changes the system clock on a host, letting us shift seconds, minutes, hours, days, or even years into the future. We can move our systems forward, send a request out to a website, and if we receive an expiration error, we know how much time we have before our certificate expires.

The benefits of this approach is that:

  • It tests the entire SSL chain including intermediate certificates, root certificates, and certificate authorities (CA).
  • It detects certificates that we might have overlooked or forgotten about (do you know when your Gremlin certificates expire?).
  • It tests for other time-sensitive failure modes, like Daylight Savings Time compatibility and time synchronization errors.

To run this chaos experiment, we need two things: a website to test, and a separate host with Gremlin installed. We’ll use the second host as a stand-in for a user’s device. We also need a tool to access the website. For this experiment, we’ll use curl.

To demonstrate how curl responds to healthy and expired certificates, let’s run a curl request against a working website:

shell
1curl -I https://gremlin-demo-lab-host/
1HTTP/2 200
2cache-control: max-age=604800
3content-length: 543958
4content-type: text/plain
5date: Mon, 18 Jan 2021 23:06:18 GMT
6...

Now let’s try sending a request to a website with an expired certificate:

shell
1curl -I https://expired.badssl.com/
1curl: (60) SSL certificate problem: certificate has expired
2More details here: https://curl.haxx.se/docs/sslcerts.html

Now that we know what to look for, let’s design our experiment. Our hypothesis is that if we set our system clock forward (e.g. by one day) and send a curl request, we’ll see a successful response. But if curl returns an error, then we know the certificate will expire within the next day.

We’ll log into the Gremlin web app, create a new attack, and select our test host, which is shown here as "gremlin-demo-lab-host":

Selecting targets in the Gremlin web app

Next, we’ll expand the State category and select Time Travel. We’ll keep the length of the experiment set to 60 seconds, block NTP (Network Time Protocol) communication so that our host doesn’t automatically update to the correct time, and set the offset set to 2,678,400 seconds, or exactly one month from now.

Tip: To calculate the offset, use a date/time conversion tool such as the ones provided by timeanddate.com.

Time travel attack parameters

Now let’s run the attack. While the attack is running, let’s re-run curl:

shell
1curl -I https://gremlin-demo-lab-host/
1curl: (60) SSL certificate problem: certificate has expired
2More details here: https://curl.haxx.se/docs/sslcerts.html

Curl returned an error, meaning that our certificate is going to expire within the next month. We’ll click the Halt button in the top-right corner of the Gremlin web app to halt the experiment, which automatically reverts the system clock to the correct time. We’ll record our observations in Gremlin, then work on replacing our certificate. Using Time Travel allowed us to catch this before it became a problem for our customers, while doing so in a safe and controlled way.

Scaling up and automating your certificate checks

For large teams, manually testing each and every certificate isn’t scalable. Imagine if we were managing dozens of certificates across different hosts and services. We also need a way to test further than one day out, otherwise we could have multiple certificates expiring within a short time frame.

To address these, we’ll do two things: we’ll use a Scenario to gradually increase the magnitude (e.g. the time period) of our Time Travel experiment, then automate our experiment using the Gremlin REST API.

Using Scenarios to gradually increase the magnitude of a Time Travel experiment

With a Scenario, we can run multiple Time Travel attacks back-to-back and increase the interval each time. This lets us test over multiple time periods during a single experiment.

To create a Scenario, we’ll click on our previously run Time Travel attack to open the Attack Details page. From here, we’ll click Create Scenario. We’ll call our Scenario "SSL/TLS certificate expiration" and enter a description.

Creating a Scenario from an attack

Entering Scenario details

Next, we’ll click “Add a recent attack”, re-select our previous Time Travel attack, and choose our test host. We’ll change the offset for the second attack to 604,800 (one week). We’ll repeat this step to create a third attack, then change its offset to 2,678,400 (one month).

While the Scenario is running, we’ll run curl in a continuous loop. In the following script, curl makes a request, and if the request is successful, it waits 10 seconds before repeating it. If curl fails, it exits the loop and prints the failure to the console. This script also prints the current system time before each check, so we can see which stage of the Scenario was active when curl failed.

shell
1#!/bin/bash
2while :; do
3 echo $(date)
4 curl -s https://gremlin-demo-lab-host/ > /dev/null
5 if [[ "$?" -ne 0 ]]; then
6 break
7 fi
8 sleep 10
9done
10echo "Failed to connect."

Now let’s run the Scenario and start our script. Once the Scenario hits stage 2, curl returns an error and the loop exits. This tells us our TLS certificate will expire between one day and one week from now.

Aborted Scenario results

If you have a Gremlin account, you can use this card to use a pre-configured Scenario. Click "Run Scenario" to open the Recommended Scenario in the Gremlin web app, click "Add targets and run" to select the hosts you want to run the attack on, then run the Scenario.

Using the Gremlin REST API to automate a Scenario

The Gremlin REST API provides a RESTful interface for performing actions in Gremlin, such as starting attacks and Scenarios. By using the REST API in our test script, we can automatically initiate our Scenario 

First, let’s reopen our executed Time Travel Scenario in the web app. Click Rerun, then in the bottom-right corner of the page, click Gremlin API Examples. This generates a full curl request that we can use to initiate the attack:

shell
1curl -i -X POST 'https://api.gremlin.com/v1/scenarios/<your Scenario ID>/runs?teamId=<your team ID>' -H 'Content-Type: application/json;charset=utf-8' -H 'Authorization: Bearer <your bearer token>' -d '{}'

Next, we’ll copy this command to our curl script and add it just before the loop. We can add a second API command after the end of the loop to halt the experiment if curl fails. This lets us safely rollback after detecting an expired certificate without having to open the Gremlin web app and halt the experiment ourselves. Make sure to replace <your Scenario ID>, <your team ID>, and <your bearer token> with your own values:

shell
1#!/bin/bash
2
3# Start the Scenario
4RUN=$(curl -X POST 'https://api.gremlin.com/v1/scenarios/<your Scenario ID>/runs?teamId=<your team ID>' -H 'Content-Type: application/json;charset=utf-8' -H 'Authorization: Bearer <your bearer token>' -d '{}')
5
6# Test your website(s)
7while :; do
8 echo $(date)
9 curl -s https://gremlin-demo-lab-host/ > /dev/null
10 if [[ "$?" -ne 0 ]]; then
11 break
12 fi
13 sleep 10
14done
15
16# Halt the Scenario
17curl -X POST 'https://api.gremlin.com/v1/scenarios/halt/<your Scenario ID>/runs/'$RUN'?teamId=<your team ID>' -H 'Content-Type: application/json;charset=utf-8' -H 'Authorization: Bearer <your bearer token>' -d '{}'

Now we have a fully scripted chaos experiment that we can schedule using a service like cron, add to our CI/CD pipeline, or run as part of our client-side testing suite.

Conclusion

Staying ahead of expiring certificates is vital for keeping your websites and services accessible and secure. Time Travel lets you quickly and safely test your certificates on any environment, whether your websites are hosted on AWS, GCP, Azure, or on-premises.

Categories
Compliance, Tools and Integrations
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below. You can subscribe to Break Things on Purpose wherever you get your podcasts. If you have feedback about the show, find us on…
Read more
February 16, 2021 - 3 min read

What is fault injection?

When reading about Chaos Engineering, you’ll likely hear the terms “fault injection” or “failure injection.” As the name suggests, fault injection is a technique for deliberately introducing stress or failure into a system in order to see…

Company
  • Team
    Join us
Loading...

© 2021 Gremlin Inc. San Jose, CA 95113