An E-Commerce Example of Preparing for Black Friday.
Black Friday. For shoppers, it is a day of mad dashes amid the great herd of bargain hunters. The entire weekend brings heavy traffic to e-commerce sites. That is great for retailers, as long as their sites stay up and running. Outages mean lost sales. They may even push your customers into the arms of a competitor.
What’s the big deal? The process of shopping online seems simple - add stuff to your cart and check out. But behind the ease of 'one touch ordering' lies a surprising level of complexity that teams of engineers are responsible for maintaining.
Here is a description of some of that complexity:
- The customer must first find the item they want to buy. This requires using the site search function that calls to databases containing a list of products, product images, reviews, and prices. Here and in the following, database availability (fulfillment) at the time of buying intent is vital. If the database is not available then the revenue from the potential sale is effectively lost.
- The customer then selects and views a single item. This involves further database calls to determine the availability of the item, a full product description, whether the item has multiple features from which a selection must be made such as size or color, and to pull and display product reviews.
- The customer decides to add the item to their cart. Now they must indicate how many of the product they will buy (the default is typically one) and enter their address while the cart gives multiple options for shipping, frequently from multiple carriers, all of which require location-based internal and external database queries as shipping costs differ based on locale and the total size and weight of the products being purchased (you didn’t think there was only one item in the cart, did you?), the shipping zone(s) involved, and the vendor being used. Note that this may require API calls to multiple external dependencies.
- Finally, the customer clicks “checkout”. This triggers another set of service calls and processes. The cost must be charged to and collected from a payment vendor, whether a credit card, a bank, or another service like PayPal. The backend must perform a live credit card authorization, checking for mistyped card numbers or invalid accounts. When payment is authorized, the transaction is allowed to occur. Stock is marked in the shipping database queue to be pulled and packed. Shipping is ordered and often labels and tracking numbers are created, ready to be applied to a box once the items are pulled from stock and packed.
This is an incomplete, high-level description. We have yet to mention topics like input verification, such as with addresses, or sanity checks that ask whether the customer actually meant to order 111 of something, or just 1 or maybe 11? It doesn’t cover logging in to user accounts and database calls to pull up customer records during the purchase process.
The description doesn’t begin to cover taxes or sales or discount codes or applying free shipping or tiered product orders where you get a discount per item when purchasing in quantity. Nor do we even touch on order management, where you deal with things like order confirmation emails, changing shipping dates, customer returns, and updating inventory information.
No mention was made of consistency, for example, if a buyer puts something in their shopping cart while browsing on one tab, it had better appear in their shopping cart in another tab. We haven’t yet discussed conversion rate optimization with features like suggested selling, as in “Other customers who bought X also bought Y.”
Overwhelmed yet? This is the stuff that gives engineers nightmares. And, applications and needs have not simplified over time as e-commerce has matured and grown.
Today, most retailers have broken up our monolithic programs into microservices, deployed those microservices as cloud services with complicated (but really useful!) load balancing and failover schemes, and required those microservices to communicate with one another using APIs. We deploy updated versions of those microservices frequently, often running just a few new instances among a sea of veteran instances and migrating over time while we test whether the new code works as expected.
A failure in any part of the system has the potential to cause massive cascading failure across the rest of the system. The inability to communicate with one of the many databases described could mean that customers do not have access to product descriptions, images, or even prices. A networking failure between the intended transaction and the credit card vendor could mean an inability to complete a sale. So many things can go wrong in so many ways.
Phew, no wonder people working behind the scenes get stressed just before and during heavy shopping periods! There are so many potential points of failure, even in the most well-designed systems.
As online sales continue to grow, so does the size and increasing complexity of our sales systems. How can we gain any assurance that everything is operating as it should at any given moment and that a failure or slowdown in one area of this complex process will not cause the entire system to fail, affecting the customers’ ability to buy from us?
Our answer is Chaos Engineering. What we learned solving this exact problem is that Chaos Engineering is not a luxury, but a necessity. The goal with Chaos Engineering is to perform precise, limited experiments to determine how your system will respond to chaotic events long before those events occur naturally for the ultimate purpose of fixing weaknesses and enhancing system reliability. You want to anticipate problems before they occur, and when you do not know exactly what the impact of those high-load events will be, you want to experiment to find out so you can mitigate against those problems early, preventing problems on big shopping days.
At Gremlin we offer multiple Infrastructure Gremlins that we can use to test our systems. For example, the Blackhole Gremlin is a great way to test what happens when communication to and from a specific microservice is lost and the Latency Gremlin to find the result of experimenting with slowed messaging. Is the graceful degradation scheme working? Are fail over systems taking over properly to minimize customer impacts? Are monitoring triggers correctly spinning up new instances of a service that is being overloaded before the pressure causes problems? Gremlin can help engineers find answers to these and other uptime-related questions.
Sometimes we believe we have properly set up mitigation schemes. We have network redundancy. We have automated failover schemes set up and running. We have thought through potential points of failure to make sure that a communication failure between our website’s front end and the data store that contains product images will not cause a catastrophic site failure. The question then becomes, “Have those schemes been tested?” The only way to be confident in any mitigation scheme is to intentionally cause the problem and see what happens. Chaos Engineering does exactly that.
Black Friday and other big shopping days put significant stress on e-commerce sites. That pressure can be simulated and its effects measured through careful Chaos Engineering experiments. Doing so well in advance of expected high volume days can prevent heartache and pain later.
Create a little controlled chaos to help you create resilient systems and prevent catastrophic chaos.