It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping period. It’s not unusual for retailers to see 5 to 10X normal traffic during this short period and these few days can account for a significant percentage of overall sales for the year.
Hardening systems for seasonal spikes in traffic is not unique to retail and e-commerce, though. I have seen traffic spikes on Tax Day and at the end of the financial year while working in financial services and during product launches and marketing promos while working in software.
Likewise, news organizations see peak traffic during major election cycles, while gaming and media companies manage spikes during popular sporting events, series premieres and finales, and after major releases. Additionally, providers of online college entrance tests and progress exams experience significant traffic spikes when thousands of students log on during admissions or testing seasons.
In each case, engineering effort is required to avoid costly (and brand damaging) incidents.
There’s a lot you can do ahead of time to make your life easier and to ensure you’re meeting your business and customers’ needs. I thought I’d share my favourite tips from my 10 years working to build more reliable systems.
An IMOC is an Incident Manager on Call. Ensure your most experienced IMOCs are on call during these high traffic events. During peak traffic, you should have a minimum of 2 IMOCs on call (a primary and a secondary). These most experienced IMOCs have the best understanding of your system’s architecture, its weak points, and your most critical services.
Ensure that ahead of the event, your IMOCs have been trained and briefed on any known issues or problem services (see #6 for how to identify these). If you don’t have an on-call rotation program, check out my other tutorial about creating a high-severity incident management program.
Create a one-page document several weeks ahead of the high traffic event that can be shared across your organization. This document will contain all the need-to-know information about the peak period including:
- Background information on the event
- Contact information for your IMOCs (see #1) and where to find the Slack channel specific to the event
- Any known bottlenecks (ex: if you know that you have previous issues with your cache causing incidents, be sure to note it!)
- Primary query plans and any expected query pattern changes. (This is particularly important if you are launching a new feature or service that may alter query or traffic patterns.) How long do these queries take to run under normal conditions?
- Scaling bounds and known capacity limits (ex: a capacity limit on Lambdas)
- Results from Chaos Experiments run on the relevant services (see #4).
Especially during your first year at a new company, you may not have the same “tribal knowledge” as your team. Seek out postmortems from similar high traffic events like big marketing pushes or product launches to understand what has caused problems in the past. Identify any outstanding action items and connect with the engineers responsible for them—even if it’s just for a quick cup of coffee. You’ll be amazed at what you’ll learn.
Testing in production doesn’t need to be as scary as it may sound! Start with a single host and send it 1X normal traffic. Scale up to 2X. Find out where it tips over.
Remember that load testing is just one part of preparing for traffic spikes. Failures happen when a service fails or when one of its critical dependencies fails. Testing in production is the only way to ensure you are controlling for all of the upstream and downstream dependencies that could contribute to an incident.
As you do this, you will determine the important characteristics of a real production build, like:
- What is your memory bound?
- Are your autoscaling rules triggering as expected?
- How long does it take for a new instance to spin up?
- Can you lose your cache and continue to operate?
- Are your alerts configured to fire at the right thresholds and are they triggered correctly when you hit those thresholds?
Within the first 3 months of starting on a new team or at a new company, aim to know the services that are the source of the most problems.
Ask your team, “What service would you least like to write code for?” Services that your fellow engineers avoid working with are often the services primed to cause the most issues.
Other things to look out for:
- Has this service been involved in an incident or outage?
- Does the service have monitoring and alerting in place?
- Does the service have a 24/7 on-call rotation?
- Does the service have an owner?
Based on the answers to these simple questions, you can quickly start to identify the services that are more likely to cause trouble under heavy load.
One quick way to get started is to use Gremlin’s reliability calculator to help you discover and prioritize your efforts.
The best piece of advice I can offer after 10 years preparing for peak traffic events is to start preparing today. This recent study found that 80% of engineers implement their preparations less than 3 months prior to a big event, but confidence in preparation grows the earlier the prep starts.
Building reliability takes time, but even if you choose to start with just one of these recommendations, you will be reducing the likelihood of an incident on the most important days of the year for your business and your users.
These tips have helped me avoid incidents during the busiest days of my career. However, all of these efforts will ultimately build reliability not just for the peak periods, but year-round.
Now I’d like to hear from you! What are your tips for ensuring a smooth Black Friday, product launch, or other peak events? Join me and 5,000+ engineers in the Chaos Engineering Slack to share your tips in the #learning channel.