This article is specifically intended for engineering managers and leaders working with Site Reliability Engineering (SRE) teams. We begin with an example SRE job description that you can copy, paste, and edit for your specific location and needs. The description includes a sample list of desired skills based on Gremlin's experience working across multiple companies across varied industries to help you assess a candidate's skill level.
We complete the piece with a set of SRE interview questions that include information about how to evaluate and think about candidate answers. You can also adapt and enhance these for your specific needs. You may also want to peruse the listings in the #jobs channel of the Gremlin-sponsored Chaos Engineering Slack, which has participants from across the industry well beyond Gremlin.
We realize that SRE candidates are also likely to read this article. Good. We want people to be prepared and ready. However, we are not providing a candidate cheat sheet, but rather a resource for those doing the hiring. Candidates would be better served focusing on our article, How to Become a Top-Notch Site Reliability Engineer. If you have the experience and the skills, answering the questions will be easy. This article is about figuring out how to write an intriguing job description and how to ask the right interview questions to help companies find the perfect match for their SRE team needs.
Join the Chaos Engineering Community Slack to chat with thousands of SREs, plus find mentors and jobs.
Do you enjoy working with a highly motivated and talented team to deliver mission critical software? [company name] is growing our Site Reliability Engineering team to help deploy, manage, troubleshoot, and enhance our complex cloud-based services for a wide variety of customers.
As a Site Reliability Engineer you will design and implement web applications and REST API services using a microservice-based infrastructure to replace our current monolith implementation. The new technology stack includes [Amazon Web Services (AWS)/Google Cloud/etc.], [Docker/Kubernetes/other], [relational database], [NoSQL/NewSQL database], and [monitoring tool]. Your focus will be on maximizing system uptime. Team members all participate in an on-call rotation.
You will build innovative automated solutions and tools to help debug and resolve problems in production and prevent them from recurring. Further, you will proactively seek out system weaknesses and find ways to fix them before they cause production issues using monitoring data, watching trends, and using Chaos Engineering.
It's not expected that any single candidate would have expertise across all of these areas - we're looking for candidates that are particularly strong in a few areas, and have some interest and capabilities in others.
At [company name] our mission is to [insert company mission]. Our products help software companies [do something awesome] - thereby empowering businesses and individuals to [save time and money]. Our customers include [name], [name], [name], and [name]. [company] is a unique place to work and offers competitive compensation packages that include medical, dental, and vision benefits with flexible PTO and a 401k with company-matched contributions [up to X%].
[company] has a [industry] startup culture that emphasizes transparency, collaboration and career growth, with the ability to work on small, nimble teams. Employees are able to create change at scale and have an opportunity to truly disrupt and shape [industry].
[company] is an equal opportunity employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.
Learn more at [company URL].
Our sample questions do not form a complete set and we do not recommend that anyone use them without first looking at the hiring company and team needs. Modify the questions to help find someone who is a great fit for the role the team needs filled. Many would work well as DevOps interview questions as well. The big thing is to see how the questions may fit well into your interview process. Most of our sample questions are focused on the technical interview.
The goal of these questions is to help gauge a candidate's knowledge, experience, and ability to interact with the interviewer while answering with technical competence and clarity. We wouldn't expect any but the top candidates for senior-level positions to answer all of these, but how a candidate handles not knowing the answer with transparency and discusses how they would approach solutions is one of the most valuable indicators to look for in a job interview.
A service-level objective (SLO) defines the target availability (uptime) we want for a system or service. We define reliability as meeting our SLOs.
A service-level agreement (SLA) is the uptime promise that we make to a customer. These are often legally-defined with penalties for missing the target availability. For this reason, SLAs are generally set using figures that are easier to meet than SLOs.
A service-level indicator (SLI) is something you can measure with precision to help you think about, define, and determine whether you are meeting SLOs and SLAs. They are generally reported as the ratio between the number of good events divided by the total number of events. A simple example would be the number of successful HTTP requests / total HTTP requests. SLIs are frequently reported as a percentage with 0% meaning everything is broken and 100% meaning everything is working perfectly.
It's a data structure where each data element is a separate element in a list. Elements are connected (linked) using pointers. The list starts with a head, which is a reference to the first node in the list. The head is followed by nodes, which include a data element and a reference to the next data element. The final node, the tail, includes the data element and a reference to null, indicating the end of the list.
Queue, stack, heap, hash table, binary tree, etc.
Depending on your needs, this could be followed up with a question about data algorithms.
This is a BIG question and it will be interesting how the candidate answers. Ultimately, you aren't looking necessarily for comprehensive knowledge, but rather whether they can name the main points of interest and do so with clear definitions.
The domain name system (DNS) is a decentralized naming system for resources connected to the internet or a private network. These resources are assigned internet protocol (IP) addresses, which are defined strings of unique identifying numbers that follow a precise format. However, humans cannot feasibly remember IP addresses, so DNS allows the assigning of a human-readable name, such as google.com, to use in place of the IP address.
They may also talk about IPv4 versus IPv6, DNS records and the fields involved and how to create one, nameservers and decentralization and the existence of a set of canonical root nameservers, queries, caching, primary versus secondary DNS settings, reverse DNS lookups, DNS zones, and security concerns. All of these are important, but you are really looking at whether the candidate understands the big picture and how they communicate it to you.
They must name relational databases as one of the types, like MySQL, Postgres, Oracle and so on.
After that, we are looking for what sorts of other databases they may know of or have familiarity working with. The candidate should be able to describe the difference between each type they name. Here are some examples:
Key/value stores: BerkeleyDB, Cassandra, etcd, Memcached and MemcacheDB, Redis, Riak
Document stores: CouchDB, MongoDB
Wide column stores: BigTable, HBase
Graph stores: FlockDB, Neo4j, OrientDB
An inode is a data structure in Unix/Linux that contains metadata about a file. Some of the items contained in an inode are:
The filename is present in the parent directory's inode structure.
RAID 0 uses striping, which splits the data across two or more disks. RAID 5 is striping with parity, which provides some error detection. RAID 0 strictly emphasizes performance while RAID 5 introduces fault tolerance at the expense of somewhat lower performance.
If a filesystem is full, and you see a large file that is taking up a lot of space, how do you make space on the filesystem?
There are several options. We want at least one or something just as good. Perhaps follow up with a question about when/why their answer might be suitable and when a different option would be better.
cp /dev/nullon the file, which will reduce it's size to 0.
kill -15sends a TERM signal, which attempts to gracefully stop a process. It is the default.
kill -1sends a HUP signal, which reloads a process.
kill -9sends a KILL signal, which kills a process.
You can follow this up nicely with a discussion of important system calls.
Bonus points if they start by talking about a bare metal server.
Virtualization installs a control layer on top of a set of bare metal servers to create a pool of resources from the combination of the physical resources of those servers. It then allows you to create "virtual machines" that have a varied combination of memory, storage, and processor resources according to need, each machine with its own operating system. Virtual machines can be created and destroyed quickly and easily.
Containers are similar, except they do not contain the base layer operating system. Instead the control layer provides the operating system access while also keeping the containers and their processes isolated from one another. Containers include software such as a microservice along with all of the software dependencies required to run that software. This provides isolation and flexibility.
Kubernetes adds an orchestration layer to containers, making the management of them, especially large systems, easier.
Common answers are "using someone else's computer" or running services on equipment in someone else's data center. Follow up with a question about why companies use any of the various cloud platforms (save money, offload maintenance, etc.).
You are looking for their thinking process, their organization, and how methodical they are in finding problem sources. You are also looking for how creative they can be in solving them.
Every architecture is different, so you are looking for them to mention networking problems, resource allocation, unusual service interactions, and so on.
Do the candidate's steps match with your company's? Close? Is the candidate open to suggestions or do they act like they have the definitive answer (like a know-it-all)?
You want to learn about how the candidate thinks about interacting with coworkers to gauge how those thoughts fit with your company's current culture as well as the culture you want in the future.
Site Reliability Engineering (SRE) is the outcome of combining IT operations responsibilities with software development. With SRE there is an inherent expectation of responsibility for meeting the service-level objectives (SLOs) set for the service they manage and the service-level agreements (SLAs) we promise in our contracts.
DevOps. Site Reliability Engineering (SRE). Are they different or just different names for the same thing? This article explores that question in depth by delving into each and then comparing them.
What do Site Reliability Engineers do and what exactly are they responsible for within an engineering organization? While the specifics will depend on your company, there are some general trends for how SRE teams tend to organize themselves. This article focuses on how SRE teams share responsibilities across members while at the same time recognizing the strengths each member brings to the team as they work towards a common reliability goal.
You have some experience with programming or systems administration, development or operations, and now that you have heard about Site Reliability Engineering (SRE) you think this sounds like something you would like to do as your next step. This article will help you learn in greater detail what you need to know to not only be successful, but one of the best SREs.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Request a Demo