How to become a top notch SRE

FREE RESOURCE & TRAINING MATERIALS

You have some experience with programming or systems administration, development or operations, and now that you have heard about Site Reliability Engineering (SRE) you think this sounds like something you would like to do as your next step. This article will help you learn in greater detail what you need to know to not only be successful, but one of the best SREs.

Site Reliability Engineers (SREs) are specialized DevOps-style engineers whose primary focus is keeping modern software services and systems up and running, or reliable. They work with large, distributed computer systems to prevent downtime.

The goal of this article is to walk you through the transition from software engineering or operations engineer to SRE, diving into the skills you need to develop, the mindset shift that needs to take place, and the training you should pursue before hopping directly into an SRE role.

Example H2

Example H3

What does a Site Reliability Engineer do?

Site Reliability Engineers start by looking at the system, then taking the easiest and most mundane tasks and automating them. Doing so frees up their time and mental capacity for other tasks that improve system reliability, efficiency, and design. Automation also reduces error.

Reducing error and improving reliability is important because SREs are directly responsible for meeting the service-level objectives (SLOs) set for the service they manage and the service-level agreements (SLAs) we promise in our contracts. SLOs set targets for reliability using error budgets. Data is gathered while the system is operating and compiled as service-level indicators (SLIs) to help guide decision making by SREs about what parts of the system need to be prioritized for enhancement.

While many Site Reliability Engineers come from a software development background or a systems background or operations function, all of us take time doing each of these tasks: writing code and managing the system. This is why we are well-suited to both know what would be useful to automate and also to write the code that does the automation.

When there are problems, such as an outage, SREs take a problem-solving approach and work to find the fastest and most appropriate way to fix the immediate problem. They then schedule work to prevent it from happening again. There are several metrics that are used to measure the speed and efficiency of incident response, such as:

Mean time to detect (MTTD), which measures the average time needed to discover a problem
Mean time to resolve (MTTR), which measures how long it takes to fix a failed system
Mean time to failure (MTTF), which is the average amount of time a defective system can continue running before it fails; this is similar to uptime and helps teams plan for future replacement of system components before they stop working
Mean time between failures (MTBF), which measures the the average time a system or component is working properly

This article gives the highlights of the job as a segue into our main topic, how to become the best SRE possible. We describe the job in greater detail in our article, What is Site Reliability Engineering? A Primer for Engineering Leaders.

Here are the main tasks that SREs perform:

Write software features
Fix bugs
Build and deploy software
Test software
Administer production deployments
Manage tooling and provisioning
Mitigate against disasters
Configure and use monitoring for a observability
Prevent data loss
Manage incidents and incident response and repair the site when problems happen
Prevent recurrence of past problems
Analyze past incidents
Use Chaos Engineering to find and prevent future problems and to confirm fixes from past incidents function as intended
Learn and share skills

Ultimately, the job is focused on owning a system or a service, from initial coding to production deployment, and keeping that system or service running reliably. See our sample SRE job description and interview questions article for more.

What makes someone a good fit for Site Reliability Engineering?

How do I know if SRE is right for me? Hiring managers and committees all say that talented future SREs are hard to find. That said, there are a set of qualities that make someone a good candidate for SRE.

Companies hiring SREs look for people who are smart, who are passionate about building and running complex systems, and who can quickly understand how something works especially when they have never seen it before. This requires a strong curiosity and interest in learning new things.

Google's Site Reliability Engineering (SRE) organization is a mix of software engineers (known as SWEs) and systems engineers (known as SEs) with a flair for building and operating reliable complex software systems at an incredible scale. SREs have a wide range of backgrounds - from a traditional CS degree or self-taught sysadmin to academic biochemists; we've found that a candidate's educational background and work experience are less predictive than their performance in interviews with future colleagues.

From Hiring Site Reliability Engineers by Chris Jones, Todd Underwood, and Shylaja Nukala in the Usenix publication ;login:, vol. 40, no 3, June 2015

Finding people who are both technology generalists with a wide array of knowledge and experiences as well as true specialists with sufficient technical depth in vital fields like networking or distributed systems, is difficult. Becoming that person takes work.

Before we get to the technical details, let's start with some specific personality traits that highly successful SREs share:

An impatience with drudgery and toil, because this provides an inherent motivation to figure out how to automate repetitive work.
An eye for detail, because this helps prevent foreseeable problems.
A good sense of the big picture, because knowing how your work fits into a wider system helps you write better code and helps you work with others to create a better system overall.
A strong interest in learning new things, because once you automate all the simple things it will be your imagination and curiosity that help drive new efficiencies, improvements, and enhancements that have not been previously tried.
A humble demeanor, because everyone makes mistakes and casting blame toward others or heaping it upon yourself when problems and system failures happen does not help solve issues when they arise.

It also helps to be okay with writing code that external users will probably never see or even know about, because it works invisibly to keep the site up and running ever more efficiently. You should definitely enjoy interacting with a system using a terminal (command-line interface), because it is so much faster, more powerful, and more elegant than using a GUI.

One of the most important traits is the ability to stay calm under pressure, including during unexpected on-call events, because systems never seem to fail at convenient moments. You will be asked to solve problems that no one has ever solved before, because you and your team have already automated solutions to any obvious problems or even complex ones that you have discovered.

Finally, it is a huge bonus if you have a set of interests that coincide with the problems you are going to solve, the people you will be working with, and the technologies you will be using or may want to use in the future. Ask yourself whether you like chaotic, large systems and working with a team to figure out how to create quality services that are modular and play well in those types of systems, such as those that often exist behind the scenes keeping the company's mobile app available and working.

Having the ability to connect seemingly disparate ideas is a key component of finding great solutions that seem obvious, but only in retrospect. Being a person that works well in a collaborative, communicative environment helps all of you do this together in ways that each of you working separately would never be able to do. Listening to teammates and respecting one another's strengths and opinions gives all of you greater potential for success.

Connect with a community of SREs

Want to get advice from thousands of experienced SREs? Join the Chaos Engineering Community Slack to find SRE mentors.

Join the Slack →

‍

What training and skill sets should potential SREs have?

Before becoming an SRE, people come from many different backgrounds. This is good because it makes SRE teams well-rounded, with expert knowledge from many different perspectives and experiences. Many high-quality SREs have come from backgrounds like those described in this section and the next and have significant hands-on experience. This list is not exhaustive.

Ultimately, what you need is a variety of technical experiences and skills combined with an interest in working on large scale systems, making them reliable and even more scalable.

Programmers and developers bring programming language skills into a team, most likely along with experience building that software. They have also written code from scratch, fixed bugs and errors, and added new features.

System administrators have experience on the operations side of things, where the job has been to keep servers running and get the software they have been handed to run on those servers. They have had to take up the slack left by developers who thought their code was finished when it was thrown over the wall to operations, only to have SysAdmins discover that configuration info is missing, dependencies are not properly listed, or even that features are only mostly coded.

Once exposed to DevOps, each of these begin to understand the limitations of operating separately. When they come together, though, they often have to struggle with competing priorities. Developer-experienced team members want to focus on getting new features into the hands of users as quickly as possible. SysAdmin-experienced team members want to avoid breaking anything. Both concerns are valid.

Here is a definition of SRE according to Ben Treynor, the creator of the SRE position at Google:

Fundamentally, it's what happens when you ask a software engineer to design an operations function...doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.

What this leads to is a need for training on systems, the design and development especially of large, distributed ones. By eliminating as much human interaction as possible through automation, SREs make systems more reliable. Those systems also react more quickly to reroute traffic around damage or failover to backup service instances when those fail.

Database administrators often join SRE teams and learn programming and systems administration while teaching team members how to manage databases so they stay up and running efficiently. Their expertise expands what a team can effectively support and from the team they learn how to script and automate many of their more common tasks, freeing up time to really explore more interesting or problematic DB-related issues.

Good SREs are pragmatic. They analyze, they use their big picture understanding of a service and how it fits into a wider system to come up with solutions that minimize impacts to others or provides positive impacts to others. They also know when to let go of processes, policies, procedures, and even fixes or automated schemes they created, when those are no longer helpful. Even the most well-meaning idea can turn out to one day become unproductive and SREs are not sentimental about removing obstacles.

Great SREs are able to persuade teammates and organizations of what needs to be done. They confidently advocate for work they see is needed, but that other people may not value or want to do (at first). We must be able to see how short-term pain can bring long-term benefit and demonstrate that with data, as effective salespeople to team members and managers and sometimes higher up the org chart. We also must be able to say "No" effectively when it needs to be said, and that is not a common skill.

How do I become a Site Reliability Engineer?

Everyone's path is a little different, but there are some commonalities. The foundational step is to learn about large computing systems and get whatever experience you can interacting with components of those systems, even on a small scale.

Learn to program shell scripts. Learn a few programming languages like Python, Java, and C (or maybe Rust or Go instead of C). Install a Linux distribution on a personally-owned computer or in a virtual machine and really learn it. Create problems for yourself by stretching a little too far and breaking something, then figure out how to fix it. Get in over your head and swim to the surface.

Document everything. Write clear notes, code comments, and instructions for yourself. This will make future-you happy when something breaks or needs to be upgraded and you can't remember where to look for information, what the process for upgrading should be, and so on. SREs do this all the time and create runbooks for training, incident management, disaster recovery and more. Examine and update your documentation every time you change anything.

Learn to use a distributed version control system, preferably Git. Follow some projects on GitHub, download open source code and study it to figure out how it works; write and release your own projects, even very small ones, using a proper license and a public repository. Create things that you find useful. At the same time you will be creating a portfolio to show hiring managers when you apply for SRE positions.

There are many quality text and code editors out there from massive integrated development environments (IDEs) like Eclipse to lighter, modern ones such as Atom. Pick one you like and learn to use it well. Perhaps more importantly, because it is available on every Linux system you will find, learn to use Vim.

Learn about automation, especially how to automate testing and automate software builds with a continuous integration/continuous delivery (CI/CD) pipeline, such as Travis CI or Jenkins.

Create some websites. They don't have to be huge projects and they don't have to be intended for lots of traffic, but do it yourself. Start by creating a cloud server using a service like Digital Ocean, Linode, or Amazon Web Services. Start with a bare-bones Linux installation. Install and configure a web server yourself, either Apache HTTP Server or Nginx (or better yet, learn both and tie them together for maximum performance). Create a basic website by hand-coding your HTML. Create a more complex one using an old-school common combination like PHP and MySQL so that you learn some fundamentals of connecting to and using a relational database on the web. Learn how to back everything up and restore from backups. Document everything.

Learn how to monitor your systems with tools like Nagios, Datadog, or New Relic. Spend some time gaining a good understanding of what things are important to monitor and what you can ignore. Create a dashboard. Study observability and how it builds on monitoring.

This is a good time to learn about service oriented architecture (SOA) and how that has developed into microservices. Understanding systems architecture and how discrete services interact in that larger system is a vital part of SRE.

Spend some time working with containers such as with Docker and Kubernetes. You don't have to be a master at these to get started, but familiarity with each of them provides a solid foundation for getting your foot in the door.

Learn about so-called "NoSQL" databases. There are many different types and each have pretty specific use cases where they excel. Compare and contrast with relational databases like MySQL. This is a good time to dive into understanding what a data model is, why data models are necessary, and how the data model should inform your choice of database and your service architecture.

That was a lot. You don't need to master all of this, but you should be a master of at least some of these things if you want to become an SRE. When you have, you are ready to start reading job listings to see how close you are to the listed requirements. If you see knowledge or experience gaps, prioritize filling them, especially if you have many gaps. Even if you have a few gaps, you may be able to land a junior position and get started with a mature company that provides training programs for employees.

What online resources and courses can potential SREs learn from?

So many great options exist for self-directed study of topics mentioned in the previous two sections. We have cultivated this short, opinionated list to help you get started learning some of the skills mentioned. This is not exhaustive and we intentionally chose a different source for each entry on the list. Don't like what you find when you click? See if one of the other sources on the list has something you like on the topic you want to learn.

Programming - Learn Python with The Python Guru
Version control - Begin your journey with Git using GitHub's Git Handbook
Text editor - Vim has an incredibly useful starter program called vimtutor to teach you the basics that is frequently installed by default in Linux distributions, but you may find the online Vim Adventures tutorial game more compelling
Virtualization - Enjoy an introduction to virtualization technologies in this online course from CloudAcademy
Linux - Start with this Introduction to Linux course from edX
CI/CD - Watch this introductory video from the University of Virginia via Coursera to understand what a CI/CD pipeline is and the high-level process involved
Containers
Microservices - See how Google creates scalable microservices with Kubernetes in this course from Udacity
Site Reliability Engineering - Learn how Google runs production systems using SRE with the complete contents of their book, provided online for free by Google

In addition to these, many SREs like to find ways to connect with others or learn new technologies. They use some less common social options, like finding a local Meetup group that is focused on DevOps, Site Reliability Engineering, Chaos Engineering, or even a specific vendor's technology like Amazon Web Services.

You find many searching for answers, asking questions, or sharing knowledge in communities like the Gremlin-sponsored Chaos Engineering Slack, which has participants from across the industry well beyond Gremlin, Stack Overflow or Stack Exchange, AnandTech, Spiceworks, and even the technical areas of Reddit.

What companies have SRE training programs for employees?

Honestly, any company that has a mature and healthy SRE implementation will have developed a strong culture of collaboration. They are hiring you for who you are, for their belief that you are smart, reliable, imaginative, and have the right technical interests, background, knowledge, and experience to be successful.

Good companies begin with the expectation that you will need time to learn their technology stack. They may have you start using an onboarding runbook that trains newcomers on the system. Many will pair you up with an experienced senior-level SRE who will guide you into the position.

However, this is not universal. Never fear, though. You can discover a lot about a company's perspective on training employees just by reading their job listings and asking good questions during an interview.

Job listings are unlikely to use explicit phrases like "employee training programs" but they will demonstrate to you the company's values. For example, do they list their tech stack, even from a high level? If so, they value transparency and want you to know what you are getting into. Do they talk about working in a collaborative environment? Then they are likely to actively cultivate a work atmosphere and team dynamic that values good, clear, open, and honest communication, which will include teaching and training.

While you are interviewing, remember that while the company is trying to determine whether you are a good fit for their team, this is your chance to see whether the company is a good fit for your needs. Ask good questions about how they onboard new SREs, about training, about team culture and values. Read the compensation packages on offer closely. Let them know that you are as interested in how things work at the company as they are in how you might fit their needs.

What companies are actively hiring SREs?

There are numerous companies hiring Site Reliability Engineers, from small to large ones. Because smaller companies have fewer job openings, they are not as easy to include in a list that we want to be useful for a long time. For that reason, after we mention some big names to look at, we also list some job sites where you can do a search and find literally thousands of job openings.

Visit LinkedIn and Indeed for a list of some current job listings for SREs.

Site Reliability Engineering

A primer on SRE for engineering leaders

Site Reliability Engineering (SRE) is the outcome of combining IT operations responsibilities with software development. With SRE there is an inherent expectation of responsibility for meeting the service-level objectives (SLOs) set for the service they manage and the service-level agreements (SLAs) we promise in our contracts.

SRE interview questions and job descriptions

What do Site Reliability Engineers do and what exactly are they responsible for within an engineering organization? While the specifics will depend on your company, there are some general trends for how SRE teams tend to organize themselves. This article focuses on how SRE teams share responsibilities across members while at the same time recognizing the strengths each member brings to the team as they work towards a common reliability goal.

How much money do SREs make?

Wondering about the average Site Reliability Engineer salary? Or how much top-notch SREs at best-in-class organizations are compensated? We did some research and are sharing our findings here.

The role and responsibilities of SREs in software engineering

SRE vs DevOps: Can they coexist or do they compete?

DevOps. Site Reliability Engineering (SRE). Are they different or just different names for the same thing? This article explores that question in depth by delving into each and then comparing them.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started