How Many SREs Does Your Company Need? Here’s How to Decide
Tips for deciding how many SREs your company should hire.
October 16, 2024
7 mins
Treating your incident response playbook as rigid can backfire. Incidents demand flexibility, judgment, and real-time decision-making. Discover how to balance process with empowerment and foster a culture where responders can make effective choices under pressure.
Everyone wants to make their incident response process better. So they iterate and iterate to refine this process based on what went well and what didn't after each incident. The document that collects all of this feedback around the incident response process is through a playbook. Playbooks are crucial, yes. But treating them as instruction manuals in an incident’s context can lead to less than happy results.
Incidents are inherently unpredictable. Thus, your playbook might suggest you what steps to take, but in real-life incidents unfold in ways anyone can predict. What seemed simple in theory quickly becomes complex in practice.
In this post, we’ll explore the common trap of treating playbooks as rigid instruction manuals, how it hurts our response efforts, and what we can do instead.
No matter your scale, you’ll have to deal with incidents. When you’re a startup, it’s easy to just have everyone hop on a call and figure things out. But as the company grows, the complexity and number of your incidents grows with it. Your team gets frustrated, the communication with customers gets messy. “Winging it” stops being an option.
This is usually when somebody who’s seen a lot of incidents takes it upon themselves to put together some sort of incident response plan. This person may come from the engineering team, but it is not rare to see the playbook emerge from the in ops or customer support teams.
Your first incident response playbook is probably going to look like a Frankenstein that puts together some generic enterprise framework combined with the individual experiences of the person who stepped forward to write the plan on their spare time. It’s a start, but the road towards a functional playbook will be rough.
After each important incident you want to make sure you incorporate the learnings from your retrospective into your incident response process. You’ll add more branches to cover more cases in your playbook, or extend existing sections to provide clearer guidance. While that sounds good in theory, it rapidly turns playbooks into monsters. They become these bloated documents filled with detailed instructions, rules, and policies.
The more complicated the playbook, the more mental toll it imposes on your responders. Ultimately, this creates an environment where responders are more worried about doing things by the book than doing the right thing for the situation. This hurts not only your responders but the overall effectiveness of your response.
Imagine being in the middle of a high severity incident with important outages, frustrated customers but you’re flipping through a giant document trying to figure out what to do next. The problem with an over-reliance on playbooks is that, over time, your team relies so much on what’s documented that they stop trusting they own judgment.
Responders stop making real-time decisions based on the context of the situation and instead focus on following the rules.
So, if playbooks aren’t the answer, what is? The key is to shift the mindset. Instead of focusing on “What’s the process for this type of incident?” your should be arming your responders to make better decisions in unpredictable circumstances. This mindset shift is essential for building confident, strategic responders who can think on their feet when things get chaotic.
Instead of relying solely on process, we need to empower people to make real-time judgment calls. But how do we do that without throwing out the structure entirely? Here’s where I think we can find a balance: by implementing strategies that create structure at the right level, while still leaving room for critical thinking.
Automation is a critical part of modern incident response. I personally think it's a necessity. Of course, you cannot automate the entire process , but automating repetitive tasks. For example, you can automate executive escalation paths, internal communication updates, or reminders to update the status page.
Heuristics are mental shortcuts based on perceived truths. They allow you to make quick judgment calls with limited information. In the context of incident response, heuristics can be incredibly useful. For example, if your organization is risk-averse due to compliance reasons, a heuristic might be “assume the worst” when assessing an incident’s severity. This gives your responders a guideline for how to approach a tough decision under pressure.
Another example could be, “Our policy is to protect data at all costs.” This makes the decision clear when faced with a choice between a high-risk action that could cause data loss or a lower-risk approach that extends downtime.
Heuristics aren’t perfect, but they provide responders with a framework for making decisions quickly.
Tripwires create clear checkpoints or boundaries for when to take action. They give your responders a frame of reference for making decisions, especially when ambiguity can become a blocker.
For example, let’s say your customer support team is dealing with a spike in inquiries during an outage. Instead of asking, “Should we email all customers?” you can set a tripwire: if support load exceeds three times the normal volume, trigger a mass email. This changes the conversation from “What should we do?” to “Do we see a reason to make an exception to our tripwire?” It reshapes decision-making in a structured way, while still leaving room for context and judgment.
No amount of strategy or structure will work without the right culture. A culture of fear around incidents will always undermine your efforts to create a better incident response process. I've seen many well-intentioned organizations inadvertently create environments where responders are afraid to make the wrong decision, leading to slow and ineffective responses.
The key is to create a culture where people feel empowered to make decisions, exercise judgment, and do the right thing—even when the pressure is high. Incident response is a high-stakes, high-reward environment, and when people are empowered to step into that ring, you’ll be amazed at the quality of their responses.
At the end of the day, incident response is about more than just bringing systems back online. It’s about communication, judgment, and empowering people to make thoughtful, human responses to complex problems. Playbooks can be helpful, but they should serve as a guide, not as the final word. By building a culture of empowerment, using automation and heuristics strategically, and setting clear tripwires, you can create an incident response process that not only works but works well.
Let’s not forget that incident response is about more than just getting systems back online. One of the most overlooked aspects of incident management is communication—across teams, with customers, and with stakeholders. As your company grows, it’s not just about solving technical problems anymore; it’s about managing the message and keeping everyone in the loop.
When people rely too much on playbooks, communication gets messy. You might have detailed steps about who to inform and how to phrase things, but that often results in canned, impersonal communications. You’ve probably experienced this yourself: a confusing email during a system outage that doesn’t feel like it’s addressing the real problem. That’s what happens when responders aren’t empowered to think critically and tailor their message to the specific situation.
Let’s face it: your customers know when you're sending them a canned statement. They can tell when you’re just trying to cover your legal bases. It’s the difference between an email that says, “We’re experiencing some issues; your service will be restored soon,” and one that says, “We understand this outage has disrupted your business, and we’re working around the clock to fix it. We’re sorry for the inconvenience.” The latter builds trust; the former erodes it.
The lesson here is that communication during an incident needs to be authentic, not just a regurgitation of pre-approved language from a playbook. Responders should feel empowered to make decisions in real-time about how to communicate based on the situation’s context.