Thank you all for joining us last week for Failover Conf 2! We had a great turnout this year, with over 1,800 participants, 20 sponsors, and 9 amazing sessions.

After more than a year of virtual events and video calls, we know that Zoom fatigue is real. We tried to make this event different by finding new ways to bring the community together and thinking of fun new ways to shake up the conference formula. We hope you enjoyed the Slack channels, live Q&A sessions, fireside chats, and sponsored rooms (shoutout to Red Means Recording for the groovy tunes)!

Your questions answered

When you registered for Failover Conf, we asked you if you had any questions for our panelists about teams and culture, or observability and monitoring. You sent us a ton of great questions, but unfortunately our panelists didn't have time to answer all of them.

We didn't want to leave you hanging, so we passed your questions on to Gremlin's own Director of Engineering, James Thigpen. James has over a decade of experience building and leading high performance software engineering teams, and moderated the panel discussion on the evolution of teams and culture at Failover Conf. Read his answers below!

Q: How do you create a culture of reliability in teams that haven't done reliability testing or Chaos Engineering before?

A: We do not create culture, we foster it. Engineers by and large want to create and build reliable systems.

Start there with the motivations and desires of the people responsible for your systems struggling with reliability. Ask them how they want to resolve the issue, empower them to propose solutions, guide them through iterating on their solutions (our first ideas are rarely our best), prioritize the solutions the group comes up with, give them a set of tools (E.g., Gremlin) that support them on the road to achieving their goals, and celebrate their inevitable victories.

Wash, rinse, repeat.

Q: How do you influence culture when the organization is composed of silos due to mergers and acquisitions?

A: Silos can form organically in a single organization or as the result of M&A activity. This is a tough, long term issue to address and won’t be resolved overnight. This really has to be addressed from two different angles.

First, horizontally, it’s critical that you try to be as inclusive as is reasonable with the people in the silos you work with. Communicate early and often, try to understand their problems and offer whatever support is possible to help solve them. Build relationships with them, and try to understand the culture and business that exists on its own terms.

Second, breaking down those silos has to be something leadership cares about and prioritizes. If it's not on leadership's radar, it will be really hard to drive results across the many and varied silos in the organization. It's important to understand what leadership's vision of success is with respect to these various teams.

Finally, it’s critical to understand that the various teams, subcultures, processes, etc. are not always things to be completely unified. Various teams can have different styles of work, and different values. The key is that there is explicit agreement on what the interfaces and seams between teams are, and there is alignment on the key values, which need to be shared by the entire organization for it to be successful.

Something to make explicit is that a part of this exercise is creating an environment where humans have the opportunity and motivation to build relationships with each other. This is an evolution that plays out over years, not weeks or months.

Breaking down silos has to be something leadership cares about and prioritizes. If it's not on leadership's radar, it will be really hard to drive results.

Q: How do you promote and embed ownership in a cross-functional world?

A: The key is to figure out how your organization is exposing people on one side of the fence to the problems on the other side of the fence. Rotation programs—where engineers go work on another team for ~6 weeks—are a great way to get meaningful experience in another part of the organization and a relatively deep understanding that they can bring back to their home teams.

Pairing is also really effective and can be done in a structured way to drive the cross pollination that would be most critical to the organization. Even small amounts of pairing (e.g. a couple hours a week) can have dramatic impacts in this space.

Your code and teams should, as much as possible, be architected and chartered in such a way that teams can control their own destiny. If they are set up such that they cannot be successful in achieving their goals without throwing things over the wall to other teams, it will drive inefficiency and resentment over the long term.

And, obviously, any sentiment of “no other engineer is allowed to touch the code in my codebase” should be very closely examined.

If code and teams are set up such that they cannot be successful in achieving their goals without throwing things over the wall to other teams, it will drive inefficiency and resentment over the long term.

Q: In a blameless culture, how does one successfully hold teams accountable for their actions?

A: I think this depends on what is meant by “accountable”. On the surface, it reads like “how do I punish people if they make mistakes” and if that is the attitude, then a blameless culture is not something you are interested in building.

In a blameless culture, we start from the belief that people are doing their best given the resources and information they have available. If we are punishing teams for making mistakes, then we are destroying psychological safety, and we are ensuring that those teams will over-index towards risk prevention in order to stay “safe”.

Performance management is still a tool in the toolbox though. Establishing a clear vision of success, communicating clearly when employees don’t meet those expectations, providing a path forward in those situations, and carrying through are all things every manager should be doing all the time.

It is possible to communicate in a manner which is both direct and kind.

In a blameless culture, we start from the belief that people are doing their best given the resources and information they have available.

Q: What strategies do you have for managing teams when maintaining an existing stack and also working on a redesign?

How do you keep both teams engaged and motivated with the obvious disparities in the amount of attention each team is getting?

A: Sometimes the work we do is not glamorous and exciting, but that does not mean it is unimportant. Often, the team supporting the existing design is supporting the engine of the business while the more forward-leaning team is building the next version of that engine.

Connect the work they do with the importance of it: to the business, to the company finances, to the customers using it. Celebrate their victories as much as you celebrate the victories of the team working on 2.0.

Leadership has a big role to play in this situation in making sure those teams feel valued. The new thing will get a lot of praise by the nature of what it is, but it is critical for leadership to be aware of all the teams doing important work and ensure that they are well taken care of.

Q: Do you have any suggestions for anyone forming a new SRE team?

A: Separate out the goals of recording metrics for a system and moving those metrics in a positive direction. Too often we set out to instrument a system and improve it at the same time. Often we end up not knowing the start state and thus not being able to understand the impact of our efforts. Measure first, then influence.

Reflecting on Failover Conf 2

The key theme of this year's Failover Conf was evolution. We've all had to adapt, overcome unexpected challenges, and become more resilient to change. As a community, we came together to share our experiences and learnings, from managing fully-remote teams, to fostering organizational culture, to building and maintaining systems that are growing increasingly complex. Here are some of our favorite highlights from the conference:

Special thanks to Mind’s Eye Creative for creating an amazing set of detailed illustrations for each session!

The evolution of teams and culture panel
The evolution of observability and monitoring panel
Fireside chat with Jeff Smith and Matt Stratton

Folks also shared their comments and insights on Twitter, especially around creating a culture of reliability, enabling teams to adapt to fast-changing situations, and building confidence in their ability to handle the unknown:

Watch the sessions

If you weren't able to attend Failover Conf, or you want to rewatch the sessions, all of the talks are available here. If you want to learn more about building a culture of reliability, you can also check out our guide to creating a culture of reliability, which provides additional information and links to resources. We hope to see you (hopefully in person) next year!

No items found.
James Thigpen
James Thigpen
Director of Engineering
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.