How to test the reliability of a Point of Sale (POS) system

Point of Sale (POS) systems are the backbone of any retail store. A single outage can cost retail companies thousands of dollars each minute in lost sales, and even more if the outage happens during peak hours. If the outage goes on too long, it can cause even more costly damage as customers abandon carts and turn to competitors.

In an industry where customer loyalty is worth its weight in gold, that brand damage can end up even more costly than the initial lost sales.

To combat this, many businesses have embraced microservices to build redundant, self-healing, and resilient systems that can easily include 400+ different services for a single checkout transaction. These systems bring a lot of benefits, but they also exponentially increase complexity, which creates more potential points of failure.

With stakes this high, companies need to use reliability testing to understand how their complex systems fail so they can minimize downtime and prevent outages.

Gremlin is used by some of the leading retailers in the world across industries, including beauty, apparel, and more. These testing best practices have helped them build reliable, resilient POS systems that customers can count on.

1. Test autoscaling and capacity

Autoscaling is essential in retail POS systems to handle the inconsistent surges of traffic and purchases. But it can also be derailed by small errors like misconfigurations or incompatible timeouts. Unfortunately, you’re more likely to find out autoscaling doesn’t work during a huge surge, which is precisely the worst time for a system to fall over.

Prevent this by regularly simulating resource use surges to make sure autoscaling scales up (and down!) correctly.

You should also run these tests when opening new lane capacity or new stores in the same region. Your current infrastructure might be correctly allocated for a single store or your current load, but what happens when you drastically increase the demands?

By running resource tests, you can make sure your POS system can handle the expanded loads.

Test these resources:

CPU - Consume CPU in multiple stages (e.g. 50%, 75%, 90%) to simulate limited CPU.
Memory - Consume memory in multiple stages (e.g. 50%, 75%, 90%) to simulate running into memory limits.
Disk I/O - Make sure your disk is enough I/O capacity by running a high number of read and write operations.

2. Verify resilience to outages and failures

Infrastructure failures happen all the time, and can be caused by anything from a network switch breaking to faulty code to a backhoe slicing a fiber cable. But that outage takes down your retail POS system, your customer doesn’t care why. All they know is that their transaction can’t go through.

Gremlin simulates these outages by cutting off network traffic between specific targets and your services. This allows you to verify that your service responds correctly when a resource is entirely unavailable, such as by failing over to redundant resources. But this also gives you a chance to verify how the system on a whole will react as traffic is rerouted. For example, you might suddenly have two regions’ worth of traffic suddenly going through a single region. Can your POS handle twice as much traffic?

Testing resource outages and failures helps keep you from being blindsided during outages, giving your teams the confidence to know it can weather outages.

Test redundancy and failure on these targets:

Hosts - Shut down hosts and containers to simulate failure.
Zone - Make zones unavailable to see how your system reacts to zonal failure.
DNS - Don’t forget DNS! Verify how your system responds to unavailable DNS servers.

3. Map dependencies and test their failure

Retail POS systems have a web of internal and external dependencies. Some of these are less critical, while others are essential for the service to function. For example, if the checkout is supposed to still go through when the retail loyalty program is unavailable, then it’s a non-critical dependency. On the other hand, payment processing is usually a critical dependency, and the retail POS system should stop if it becomes unavailable.

Gremlin’s Dependency Discovery automatically detects all relevant dependencies using a service’s network traffic and DNS requests, making it easy to see and test all of your service’s dependencies.

Since these dependencies are also owned by different teams or even different companies, teams need to make sure their services compensate if those dependencies are unavailable. Even if all of the dependencies are available, there can still be issues caused by how they all interact, such as a timeout cascading through multiple dependencies and services.

Run these tests on your dependencies:

Failure - Drop all network traffic to a specific dependency to test failure.
Latency - Increase the network latency to check timeouts and failover behavior.
Certificate Expiry - Check your dependencies for any upcoming certificate expiries.

4. Kubernetes and configuration issues

Many modern retail POS systems use Kubernetes and cloud deployments. And one of the most common causes for incidents and outages in Kubernetes is misconfigurations. Even if your Kubernetes cluster was correctly configured once, an update or restart might reset the configurations to default, including vital settings like timeouts, container restart parameters, and more. This also extends to status checks, like inconsistent versions or stuck states like CrashLoopBackOff.

Fortunately, most of these risks can be surfaced with Gremlin’s Detected Risks. Detected Risks automatically uncovers the most common Kubernetes and AWS risks so you can catch them before they cause a POS outage.

Be able to detect risks in these areas:

CPU Requests
Memory Requests
Liveness Probes
AZ Redundancy
Memory Limits
Application Version Uniformity
Status check: CrashLoopBackOff
Status check: ImagePullBackOff
Init Container Errors
Unschedulable Pods
Horizontal Pod Autoscaler errors

Retail POS reliability requires regular testing and processes

Reliability isn’t a one-and-done task. Your systems are constantly changing. Even if you’re not deploying, your dependencies, network topology, and more are shifting all the time. The only way to catch fluctuations in reliability is through consistent, regular testing.

Also, make sure to integrate that testing into your established processes. The test results should enable conversations between your teams so they can plan the work to address the risks on their schedule—instead of when unaddressed risks cause an outage.

Gremlin makes it easy to scale reliability testing across your organization to keep your POS systems reliable and available when your customers need them.

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour