Networks are being stressed in unprecedented ways now that workers are connecting remotely to do their jobs. Video calls add to the network traffic spike. How do we cope effectively? How do we maintain the ability for our employees and customers to stay productive in the new paradigm?
The reliability of our internal systems, often not designed for the current levels of externally-connecting traffic, as well as the reliability of work-related services we connect to in order to get our work done is more important than ever. And yet, usage levels are at an all-time high. What can we do to prepare for and adapt to the surge in demand?
Whether we are a service provider of things like video conferencing or an enterprise company used to having employees working in an office environment, our networks and systems usage is up. Let’s consider some examples.
Some companies are expanding their load balancing and handling by spinning up more instances of vital services housed in the cloud. All clouds are being used at higher levels than ever before, and at times we have seen the issues on the customer side. For example, Microsoft Azure customers found themselves hitting capacity limits. Although there have not been any service disruptions at this time, there is significant increase in demand, to the point that Microsoft has been proactive in communicating cloud services continuity plans.
Business continuity is important. We need to be ready to handle online disasters remotely. The question becomes: How do we make sure that services remain available for those who need them to get their work done?
There are a number of ways that we can help make sure systems stay up and running during a traffic spike, whether this is a one-time event or a new long-term shift. In either case, the goal is always to prepare in advance of the incident, when possible, to try to prevent scrambling during a crisis. Much of this section comes from a previous Gremlin blog post, but is information worth repeating.
Check the on-call schedule and make sure the IMOC (incident manager on call)rotation is solid. During times of heavy traffic, we should have at least two on call at all times, naming one primary and one secondary so there is never an issue with who ultimately makes decisions.
Take the time to think through all of the high-priority things a team should know about the system. Include entries like this, but keep it brief:
- Why we expect the traffic spike
- Contact information for all on-call people and a link to the rotation calendar
- Known system trouble spots like potential bottlenecks or single points of failure
- Check primary database query plans and any expected query pattern changes, including how long these queries take to run under normal conditions
- Scaling bounds and known capacity limits, such as a capacity limit on Lambdas
- Results from Chaos Engineering experiments run on services
Look over notes from past incidents. Scan specifically for any involving higher than normal traffic. Check for action items that were not completed and get those things prioritized and fixed as soon as possible!
Chaos Engineering is all about controlled testing and experimentation to learn about the reality of how our systems actually operate. Ultimately, we want to understand how things work in production, but even small starts are useful.
Think about how to carefully and safely limit the blast radius (the systems impacted by an experiment) and begin with a test of small magnitude, such as selecting a single host and sending it 105% of typical traffic. When it passes (and hopefully it will), increase the magnitude in small increments and watch your monitoring. Find out when the host begins to fail and how, then stop the experiment. Use that data to prioritize reliability work.
Load testing is a good place to start, but we all know it is not the part of the system where we find the most failures. Examine your networking: what happens when we introduce latency? Failures? See if our intended mitigation schemes work as designed. Discover whether autoscaling rules trigger as expected. Test alerts and whether they trigger at the appropriate thresholds, while things are off but not yet causing major incidents, but not so frequently that we start ignoring them.
Qualtrics uses Gremlin to help plan and test for disaster recovery readiness with Chaos Engineering. Under the current circumstances, this is vital.
As early as possible in a job and before an incident, try to learn as much as possible about what keeps other engineers in the team up at night. Where are the problems? What service do engineers try to avoid writing code for? These are often involved in failures and incidents in some way.
It’s helpful to learn past context for these services. Which services have been involved in past incidents or outages? How? As a contributor or as part of the cascading into a larger failure? Which services do/don’t have monitoring and alerting in place, on-call rotations, owners?
Start now, before anything goes wrong. If it has, then start working today on things known to be problems. In either case, there are things we know have the potential to impact customers unless we have solutions in place before issues come up.
Here are some things you can do now:
- Expand the number of compute nodes in use and make sure your load balancing can handle the increase. Base this expansion on your expected traffic handling need; don’t just add randomly.
- Create failover methods to protect users and their data from the inevitable (and hopefully rare) disappearance or freeze of a cloud instance or even a region failure.
- Create automated traffic redundancy and rerouting for busy networks experiencing latency, packet loss, or even full-failures.
- Create automated mitigation schemes for resources nearing usage limits, like I/O or RAM.
- Create full backup systems for all vital resources like DNS servers and databases.
Work intentionally and methodically. Add code using small changes that are easy to test and roll back. Test each new addition well. Consider using canary instances to test in production while existing instances bear the main load. Keep moving forward!
- It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal SRE
- Failure mode and effects analysis ( FMEA ) is a decades-old method for identifying all possible failures in a design, a…Matthew HelmkeTechnical Writer