Last week, over five hundred SREs gathered in Santa Clara to share the latest research, tips, tricks, best practices, and more for site reliability engineering. They were joined by some of the biggest names in the reliability space. And, yes, Gremlin was there to answer any and all questions about chaos engineering and proactive reliability.
After three days of great conversations and insightful talk, let’s take a look at some of the themes we heard weaving through SRECon.
These days, being an SRE is as much about having the right people skills and processes as it is technical skills. When there’s an incident, you’re bringing people together across teams, digging through multiple services, and getting buy-in from stakeholders who just want to get back to other priorities. It’s a game of balance that starts with testing in the CI/CD pipeline and extends through incident resolution into post mortems.
And let’s be honest: It’s a lot. Fortunately, you’re not alone. SRECon was filled with best practices, collaboration tools, reliability dashboards, and more to help make it easier.
A big part is embracing an SRE mindset as a proactive facilitator. (Austin Parker from Lightstep shared some good insights in his talk, “The Revolution Will Not Be Terraformed: SRE and the Anarchist Style.”) The more you get ahead of incidents with chaos experiments, dry runs of incident response, and creating collaboration processes, the more you can get everyone aligned so you can react faster with less drama.
The baseline for observability has moved beyond basic monitoring and logs. Complex architectures have made a good observability setup with golden signals a bare minimum for strong reliability. And that’s just the beginning. During his talk on J.P. Morgan's Journey into the Cloud, Fred Moyer shared a snapshot of the hundreds of services and tools involved in an enterprise-level architecture, with a dozen observability tools spread across the organization.
As things get more and more complex, getting those basic health check observability metrics in place is essential to know the reliability of your systems. And once you have that, then you can start layering in more advanced tools and methodologies to cut down your incident response times, like automated response or regular reliability testing.
Whether it’s through acquisitions, migrations, or good old company success, having to scale up your reliability practice as your systems grow is a good sign — one that comes with its own brand of complications. And those complications have only grown with the increased use of Kubernetes, microservices, and serverless. It’s why scaling was a common topic for SRECon talks.
A big part of this has to do with the complexity of architectures outpacing our mental models. As architectures become more fluid and ephemeral, it gets harder for us as individuals to wrap our heads around them. Fortunately, we saw a variety of new tools and approaches emerging that help better understand reliability in these environments, including stronger Kubernetes observability, Kubernetes chaos engineering, and more.
No matter how quickly you can resolve an incident, it’s always better if the incident was avoided in the first place. For that reason, SREs are looking to shift reliability left in the software development cycle, while also looking to be more proactive with reliability testing and monitoring.
Technology and processes like feature flags or integrating chaos engineering into your CI/CD pipeline are becoming more common. And once the code has shipped, chaos engineering can be used to regularly test your reliability, while automated remediation tools can help address basic issues before they become incidents.
Part of the SRE mindset shift we mentioned above includes taking a more proactive approach to reliability — and a big part of that is embracing chaos engineering. In many of our conversations, SREs already knew or had heard about chaos engineering.
Microsoft’s Kaitlyn Yang and Vikram Raju gave a talk about building resilient distributed systems using chaos engineering, and Dhishan Amaranath and Tucker Vento from Bloomberg’s talk “Chaos-Driven Development: TDD for Distributed Systems” stated, “Chaos experimentation is a force multiplier for other Reliability Engineering practices.”
SRE tools, processes, and practices have matured substantially in the past few years. It’s no longer just about making sure we have the visibility to resolve incidents quickly. Modern SRE practices are about taking active measures to decrease response time, improve team collaboration, and proactively prevent incidents in the first place.
And if there’s one big takeaway, it’s this: As architectures get more complex and distributed, SREs and a strong reliability practice are more important than ever.