My Indespensible SRE Book List
Table of Contents
Five Reads for Learning SRE#
- How Complex Systems Fail. Richard Cook. Not technically a book, it’s the only “paper” that appears here. It seems like I facilitate a reading of this treatise like every year, it is maybe the first thing I will tell any technologist - dev or ops, data or designer - to dig into and start asking their own questions.
[ how.complexsystems.fail] - Seeking SRE. Compilation of experts on SRE. This large volume outlines a lot of the problems, solutions, and constructs we consider. It is a more realistic set of contexts for the “non-FAANG SRE” that hits closer to home for me than very-often-mentioned books on SRE that describe a single company. My favorite chapter features Richard Cook (surprised?) introducing the Line of Representation.
[ Amazon] - Making Work Visible: Exposing Time Theft to Optimize Work & Flow. Dominica Degrandis. Working as an SRE on a team in a business means the team needs to understand how operations work against the backdrop (and challenges of) a larger organization. It is crucial for SRE teams to get a handle on workload and flow, and I have never read a better book on how to get successful results from managing operational work.
[ Amazon] - The Field Guide to Understanding Human Error. Syndey Dekker. This book is not about SRE at the same time it is everything about SRE, especially when it comes to understanding how to navigate emergency response and the post-incident. It is worth putting on the list because it ties together many of the concepts from all the other books listed here.
[ Amazon] - Kill it with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones). Marianne Belotti. I just finished this book this past summer. Instead of including a favorite of mine on what metrics are useful to measure, I decided this book needed a place at the table at the top. Like these other resources, this book applies in so many similar situations found across wildly different contexts.
[ Amazon]
About My List#
It’s kinda hard to keep this at five, I have read an extensive amount on Resilience and Reliability Engineering. It’s also hard not to just throw in a bunch of links to other things, but I don’t intend this to be exhaustive. It is only a snapshot of what’s on my mind, it could change next year.
So maybe it’s kinda easy to keep it at five. And when I think about the things SRE does that distinguishes us from other very similar and Venn-crossy disciplines (Infrastructure, Platform, or Deployment Engineering come to mind), I think of these five pieces of Resilience literature.
These move the Venn diagram of “what SRE does” across not only technology disciplines, but entire industries. It is proof that we are dealing with the same energies of resilient performance and efficiency / thoroughness trade-offs. They are in no particular order, they share a Number One spot for me as indepsensible for either aspiring or maturing SREs.