One of the most frequent questions we get asked by teams looking to improve reliability is, “Where should I start?”
Even the simplest applications can consist of several microservices and as complexity increases, it can be very difficult to decide where to focus your reliability efforts. As engineers with only so many hours in a day, we want to ensure any time we spend thinking about reliability (and not the newest feature) will have the maximum effect.
Earlier this year, Gremlin’s Tammy Butow presented a live tutorial on preparing for peak traffic events: whether that’s Black Friday, Election Day, or something else. In this session, she drew from her 10 years of experience and suggested focusing efforts on the services that are the right combination of (1) critical, (2) struggling, (3) but not a total dumpster fire (my words, not Tammy’s). In short, taking a “B” service to an “A” will net much greater value than trying to rescue a “D” or an “F” or trying to squeeze incremental gains from an “A”.
However, evaluating our services in terms of how they stack up is a nontrivial task. In an effort to make this process more straightforward, the team at Gremlin has been hard at work creating a Reliability Calculator. To use this interactive calculator, input a few key bits of information about a specific service in your architecture and quickly get its reliability grade. By repeating this process across various services, we can start to develop a plan for which services to focus on and which to safely deprioritize.
Additionally, for each service, we are given personalized recommendations for how to improve its reliability. These are actionable steps we can take to improve the reliability (and its reliability grade) over time.
So, if you’re like many of the engineers we have spoken and are looking for a way to prioritize your reliability efforts, take our calculator for a spin and let us know what you think!
It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal Site Reliability Engineer