SRE as a team sport
Tracy Ferrell and Phil Beevers on the principles of Site Reliability Engineering and successful SRE teams.
The responsibility for operating data centers has evolved from responding to a buzzing pager to building software systems that heal themselves. The role of systems operator has thus taken on a new name: Site Reliability Engineer (SRE), reflecting what happens when you treat operations as software. In this interview at Google Cloud Next in London this past November, Tracy Ferrell and Phil Beevers were expertly guided by Yaniv Aknin through their views about the principles of SRE, along with a range of insights about team building, staff development, and other practical aspects of the work.
The term engineering came loaded with meanings. In part of the interview, Ferrell and Beevers contrast engineering with operational tasks. Here, “operational” meant responding to immediate needs such as a failure and a missed service level objective (SLO)—firefighting, in their metaphor. They contrast this constant state of “operational” response (which can be exhausting) with the long-range “engineering” skill of anticipating problems so they don’t occur in the first place. Part of this type of engineering involves collecting metrics that can alert you to oncoming problems; another part is building up enough experience with SRE to see the problems in advance: “to skate where the puck is heading,” as Beevers puts it. He goes on to say that understanding the business and the market is a key part of this skill. Both interviewees also stress the importance of “great postmortems” to analyze system failures, and extrapolating from details of the previous failure to larger “failure modes” that will appear different in the future.
Engineering has a second meaning in this interview: it is contrasted with leadership, another key SRE skill. In this contrast, engineering covers all the technical work while leadership covers people skills. These are both critical to an SRE, according to the interviewees: more critical to SRE than to software development roles. The great skill of a leader is to prioritize when there are many competing needs: in particular, to provide a fertile balance between urgent needs and more long-range accomplishments.
The importance of leadership springs from the extra layer at which SREs are removed from users. Traditionally, operators weren’t directly trained to understand user needs, as developers and their project managers are. This was not a problem for conventional operations because they didn’t have to take user needs at the application level into account (at least in theory): the operators just had to keep the machinery well oiled and running. But the sophistication of modern environments, along with the importance of good performance and more specific SLOs, require an SRE to understand what the user is trying to do and how the machines’ operational behavior affects these goals.
Beevers calls for SREs to have “unbridled compassion” for the users. It’s also important to holistically speak the languages the developers use to describe the user experience, rather than merely talking about technical operational goals.
Now that the job of the SRE has been defined, how can you create a successful team and develop its members? Ferrell and Beevers emphasize, beyond all tools or technical skills, the importance of good communications and teamwork, which create “psychological safety.” People have to simultaneously feel that their managers and peers support them, and that they have “agency”—the chance to exercise their skills. Ferrell cites, as a marker of success, a major network outage that was handled by fairly low-level staff while high-level management kept out of it.
Listen to the full interview for details about achieving organizational goals through SRE while keeping a team motivated.
This post is a collaboration between O’Reilly and Google. See our statement of editorial independence.