Site Reliability Engineering: How Google Runs Production Systems

by: Jennifer Petoff (0)

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large scale computing systems?

In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:

  • Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
  • Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
  • Practices—Understand the theory and practice of an SRE’s day to day work: building and operating large distributed computing systems
  • Management—Explore Google's best practices for training, communication, and meetings that your organization can use

The Reviews

Will have to relisten 1000 times or just pick a section and work on that instead of overwhelming myself

First off - it's worth noting that Google lets you read this entire book for free on their website.I bought the Kindle version anyways because I spend enough time in front of a backlit screen that it seemed worth it to read something this large using a device that's better on your eyes. Unfortunately the Kindle version is formatted terribly and I wish I'd bought the print version instead. The book is broken up into Parts which are broken up into Chapters which are further broken up into headlined sections. The Kindle version identifies those headlined sections as chapters which is somewhat useless.Anyways, the first few chapters aren't especially useful unless you work at Google. They mostly discuss what's unique about Google's computing infrastructure. Despite this, they were EASILY my favorite part of the book because the material is so interesting and their approach is so unique. After that, each chapter is written in a way that it can stand on its own if you aren't reading the entire book, or are reading it out of order. This is convenient for people who want to pick and choose what parts they want to read, but means that people who are reading the entire thing wind up getting a lot of the same information multiple times. It's all written by different people too, which on the one hand makes it not quite as repetitive, but on the other hand makes it hard to just skim over the sections with info you already have because you don't recognize it as information you already know until you've processed it.Overall this is a fantastic book on DevOps, SRE, and current trends in the industry, It's a great read for anyone who wants to apply some "best practices" to their role. I would however say that reading the entire thing is overkill for most people and not necessarily the best use of your time if you have other things you'd like to be learning as well.Part 1 - Fascinating read. I imagine this would be a good overview if you're about to start at Google and want a sneak peek at how things are done, but I'm only speculating this as an outsider.Part 2 - Interesting and useful concepts for modern cloud computing.Part 3 - Some useful info and a lot of stuff that's not really unique to Google in my experience. Read the parts that you think you could use some improvement on, skip the rest.Part 4 - A condensed view from a managerial perspective of things you already read in Part 3.Part 5 - Some case studies, comparisons from other businesses, a useless recap, and examples that could be useful to share using the website version of the book if you're trying to explain to your team what new concepts are being implemented.

Tons of nuggets about best practices, how they can be useful across industry, Google's tooling, how they got there, challenges faced, communication between engineers and SRE, how to look at problems, and so much more.There were parts of the book that got can be too deep or not best explained, and end up boring. I just skipped pages to move on to the next learning.Overall a good addition to my library.

I was amazed by the depth of this book, and the way it covers several aspects of what it takes to operate a complex and distributed software system. I was particularly impressed with the details of some chapters related to monitoring, load balancing (at the front end and back end), designing applications to manage overload conditions, and being on call.I think the book has a lot to teach and inspire. Some of the approaches described are very counterintuitive like the error budget, and the blameless postmortem culture. One of the shortcomings I noticed was that some chapters are hard to read because they treat rather advanced topics. The fact that the book has very few illustrations makes it hard to understand some of the concepts at times. Overall, an invaluable resource.

It's worth noting that there is a great Coursera course about SRE from Google. It will not cover as much as the book, but's it is a distilled version to learn the basics.This book has a lot of great information, which I found invaluable over the years. One of the harder thing for growing organizations is to keep teams focused, and I've seen that DevOps and SRE practices help to zero in on what is essential.A lot of Automation related work feels like 'yak shaving,' which is a term to refer to entirely unrelated things that don't add value to our product. For development teams, this feels very frustrating. Why would I want to make a script to automate this? We only use it once a year!SRE helps to solve these frustrations, to some extent, with practices that help organizations understand why should they communicate, why should they talk about issues, and why we measure some things on some level and not others.

The practices of Site Reliability Engineering are all well known, and successful teams practiced them before there was a collective name for them. The establishment of SRE as a discipline is ascribed to Google, and many organizations are trying to hire SREs and establish SRE teams, even though the definition of what what a Site Reliability Engineer does varies by team. My hopes in reading this book was to get a canonical definition of SRE, and also to learn more about practices I can apply to my work.The early chapters of the book did a good job in terms determining the bounds of SRE, though there are certain some fuzzy edges. The latter goal, learning practices, was mixed. Because the books is actually a collection of chapters, the writing is a bit uneven. Some chapters do a good job of walking you though the subject area, and distinguish how what Google does could apply to your organization and too chain. Others are Google centric to a fault, describing internal tools with little if any reference to similar more generally available or even open source tools.Google is a successful company and has solved some challenging problems, and we can learn a lot from the Google practices. It’s important to remember that Google is also unique, in terms of history and problem space, so one should consider adapting aspects of the Google Way, rather than adopting it without interpretation. This book is a great launching point for discussion, and it’s worth having a copy if you deploy systems at scale. Just don’t take it as The Way to do Site Reliability Engineering

After receiving this publication, I quickly realized this content was a trove of knowledge. There are sections I need to review again as there is so much that can be taken from each chapter, that I was overwhelmed. Understanding incident response, toil, after-action reports, and sharing knowledge were just a few nuggets I was able to glean from the text. Definitely, a must read to assist with expanding your knowledge on revamping how to tackle your production environment.


I enjoyed reading this book and learning about challenges Google faced internally as they grew their SRE team. I found some of the "war stories" about big outages fun to read, like Google Music having to restore customers' music data from tape drives via trucks.The most useful parts of the book to me were around scaling the SRE team and how they went from white glove support, to identifying and recommending best practices, to finally building out automated tools to provide teams with these best practices.Some chapters were pretty weak and could have been cut. Like a chapter near the end comparing Google SRE practices with outside companies basically amounted to "Google already does all the best practices that the industry uses." Other chapters just mostly repeated content contained elsewhere in the book.Also, what about some open ended content/questions about further improvements that could be made? Not present

Good coverage of the concepts at a medium level.

If you are in any role , please read this book to understand the core concepts

Site Reliability Engineering: How Google Runs Production Systems
⭐ 4.7 💛 870
kindle: $31.34
paperback: $15.56
Buy the Book