SRE 123

30 Sep 2020, 14:50

technical / devops / cloud / sre

Everyone is all on fire about SRE. But what is it really? It’s as easy as 1-2-3 if you boil it down to the basics.

You have read the book right? If not, go read it. I’ll wait.

Site Reliabillity Engineering (SRE) means a lot of things to a lot of people. You may have read my post about How To Do DevOps. You also may have read my thoughts on Cloudy DevOps. SRE is not all that different. It’s DevOps through a different lens.

But, it’s a lens that distinctly SOFTWARE ENGINEERING. Yes, I did just yell that. I’ve talked to so many folks in the last 18 months who want to ride the buzzword wave and say they are doing SRE. Everything from dashboards to CI/CD to monitoring tools to alerting systems. It’s funny, in a way. And it’s fine to call it SRE, I guess. It’s still a free country. But I don’t think it really is SRE unless it’s intrinsically about building software.

Fundamentally SRE is the practice of applying Software Engineering to Operations problems. That’s also why it’s DevOps. My rules of DevOps anyway (see my post referred above). Is writing terraform scripts SRE? Maybe. If it’s just automating deployment it’s in the grey zone. If it’s part of automating the entire development process then hell yes.

That’s why my team - that often spend most of their time really doing Site Reliability Engineering - get angry if you call them SRE. They will tell you “NO! We are Software Engineers.” And they are right. In my opinion, if you are not writing software you are not doing SRE work. You might be an architect, and improving a system’s reliability, uptime, operability, sustainability, performance, whatever… but taking Operations into account should be part of your job. You can’t really “be an SRE.” It’s a verb, not a noun. You “do SRE” and when you do that you are writing software.

This is STILL the part I keep seeing that gets missed, and STILL the part that is the hardest. Automating your cloud deployment only for deploying production is Cloud 1.0 thinking. That’s table stakes today. If you think you want to do SRE then you need to be automating the deployment of ALL your software even in dev and test. You need to be deploying your whole system to run tests and then kill the whole thing. SRE is about building software, not just automating operations. And then it’s about building things into your software that make it work better, faster, more reliably, and enable faster recovery for WHEN (not if) something goes wrong. SRE is not just directly in support of Operations though. It can be 100% about building tools, too.

Kubernetes is a direct child of real SRE. It’s what happens when Software Engineers decided that they didn’t want to manage live systems when software could do it better.

Prometheus is a direct child of real SRE because it provides a real way to collect, track, and alert on metrics.

OpenTelemetry is a direct child of real SRE because it provides a real way to measure, track and alert on latencies and traces.

The people working on those products are doing SRE. Are they “an SRE?” No. SRE is Software Engineering. It’s writing code.

But do you have to be building tools to do SRE? Absolutly not. It may be writing something that supports automatic database replication to a second (or third) cloud region, or building circuit breaker proxies to limit blast radius of a cascading failure, or refactoring connection code libraries to support failover connections… or whatever. All of those are SRE work. What makes it SRE work is the obscession over operations. That’s why not all Software Engineers are doing work I would call SRE. If you are building features of the product where the user story is for a non-Operations person then you are probably not doing SRE. But if you are writing software that enables material improvements in Operations then you are probably doing SRE.

Conversely if your company is calling you an SRE and you are automating Ops that’s important, critcal work. You are doing DevOps for sure. But to me, that’s not really SRE. That’s just what DevOps should be doing. SRE is hard-core Engineering with the same skills as your development team - only elevated, and probably at the cutting edge.

In a mature software organization whose developer base is well skilled in the technologies in play it’s wise to distribute strong, senior Engineers across the teams to “do SRE” as they build the product. If your products - and thus your teams - are making a transition from older technologies then having a centralized SRE team might make sense. You are unlikely to mature the products and the overall developer skills homogenously. If you are like most companies, those teams are distributed across the world. It’s unlikely they would each choose similar approaches and tech stacks and if you let that happen you don’t get SRE, you get Operations chaos. Because Operations needs the fewest possible technologies, wired up the simplest way. SRE is all about one eye on Operations. In those cases you are probably better off building out a centralized SRE team to enable consistent patterns at least - and to provide a single escalation point for Operations.

So in many ways, SRE is just an evoltion of DevOps, from a certain point of view. It’s just a more comprehensive lens. It’s thinking every moment of every day about how the software will run, how it will fail, how it will recover, how it will self-heal, how it will degrade gracefully if it can’t heal, how it can be better without waking up a human… and then designing and building software to solve that. In many ways, it’s the highest calling for people who really care about the craft of writing software.

At least, it is for me.

Share!