Beyond MTTR: 7 incident metrics that matter and 3 that don’t
Measure what matters, not what is easier. Learn tips to untangle the different common metrics used by SREs.
August 23, 2024
7 mins
Handling SEV0 incidents requires careful but expedite action. Learn how top performing teams deal with them at scale.
Just a few weeks ago, millions of developers had a 36-minute forced break when GitHub’s global outage took down the availability of most core services, preventing them from pushing upstream or running GitHub Actions. The culprit was a misconfiguration in GitHub’s database infrastructure, which was rapidly resolved with a rollback.
Could this be considered a SEV0 incident? While the impact was massive, the time to mitigation was so short that it did not merit the scandalous title. GitHub teaches us a valuable lesson when dealing with high-severity incidents: no matter how grim the situation looks, do not jump to conclusions.
The boundary between declaring a SEV1 and SEV0 is somewhat fuzzy in most organizations. Crossing it, though, should be a very thoughtful decision. Depending on your scale, the mere fact of naming an incident SEV0 will bring you immediate (bad) press attention, investor inquiries, and reputation damage.
As a CEO in the incident management space, I’m constantly in the trenches helping our enterprise partners deal with high-severity incidents. This year alone, Rootly has helped manage more than 150,000 critical incidents. In this article, I’ve compiled a list of best practices I’ve seen work effectively at leading tech companies.
SEV0 is the most catastrophic type of incident you can encounter. But before you get there, most incidents will start at lower levels of severity and only scale up when more impact is discovered and confirmed.
There are different schemes for defining incident severity levels. Enterprise frameworks like ITIL (Information Technology Infrastructure Library) or COBIT (Control Objectives for Information and Related Technologies) provide basic guidelines on what each level constitutes.
But ultimately, you’ll have to define your own guidelines, with parameters that apply to your context and requirements. It’s even common to find different severity scales within the same organization depending on the department. For example, security teams tend to have very specific criteria to categorize incidents.
In general terms, these are common ways of thinking about each severity level:
While severity is usually linked to incident priority, it’s not always the case. Priority refers explicitly to how urgent it is to fix an incident.
A simple example to illustrate the difference: an engineer finds that one of your dependencies released a security patch to mitigate a potential vulnerability that may apply to your service. In terms of severity, this update is not impacting any customer or the system’s performance. But in terms of priority, this patch should be applied as soon as possible.
Schemas that classify incidents by priority will include the urgency to fix criteria more explicitly and can use the P4-P3-P2-P1-P0 notation. Using this approach can be useful when you have to deal with several incidents at the same time, as it helps prioritize efforts.
{{subscribe-form}}
SEV0 incidents are life-defining events for most companies, their leadership, and their processes. Recent examples include CrowdStrike’s outage, which disrupted hundreds of flights and caused $5.4 billion in damages, or Google Cloud accidentally wiping clean the data of a $125 billion Australian pension fund.
SEV0 incidents are not just “oh no, our service is down” kinds of incidents. One of the key differentiators between a SEV0 and SEV1 goes beyond impact and is how rapidly you can recover your systems. If your core services are down for all your customers, but a straight-from-the-manual rollback gets you back on your feet, you were not dealing with a SEV0 incident.
It’s precisely entering uncharted territory that characterizes SEV0 incidents. You walk on thin ice. You don’t know if whatever you do will make things better or worse. You have to deal not only with the technical issue but also with keeping your board at ease, working with customers to mitigate the impact, and dealing with the press.
SEV0 incidents are all-hands-on-deck. It doesn’t matter if it’s not someone’s time to be on-call or not. You won’t hesitate to call your VP who is on holiday in Costa Rica if needed. The ongoing incident is already costing you millions in expenses and reputation, and you need to remediate it as soon as possible, whatever it takes.
When dealing with an incident, the steps are usually in the line of identification, triage, containment, root cause analysis, mitigation, and post-incident review. That all applies to high-severity incidents, with a few caveats:
When the impact of an incident keeps creeping up and starts to look like a SEV1 (or SEV0) incident, the Incident Commander and the entire response team should get ready for immediate action. The escalation paths to on-call managers and other departments should be taken to make sure everybody is ready for a potential downfall.
Executive-level stakeholders should be notified and involved as early as possible. Top-down coordination can help expedite results and ensure all teams and resources are available to mitigate the situation as soon as possible.
High-severity incidents will usually go beyond your defined playbook and require coordination from more people than other types of incidents. Keeping a centralized reference for communication will be paramount, such as a dedicated Slack channel or internal status pages.
You’ll need to be very strategic about external communication when dealing with a SEV1 or SEV0 incident. Try to control the narrative by being the one disclosing the incident through your Public Relations team when warranted. Communicate effectively and keep your channels updated.
The retrospective will be a deep dive into what went wrong, why, and how you can prevent something similar from ever happening again. Usually, it’ll involve an exhaustive investigation of the issue and circumstances that led to it.
After a SEV0 incident, you’ll probably need to provide explanations to shareholders, customers, and regulators, depending on the nature of the issue. You’ll need to coordinate different teams to gather insights on the financial implications, legal and compliance risks, and propose an improvement plan.
Yes, you have every reason to be freaking out. Everybody does. But letting it out on your team, stakeholders, or customers is nothing but counterproductive. Project calm and control to create a better environment where your team feels that somebody’s got their back. There’s no need to add additional pressure to your response team or anyone else involved.
Facilitate any resources they may need and make sure progress is being made through different avenues, but without becoming a burden with your demands. The goal is to get out of this, and you’ll only do it as a team.
Even though high-severity incidents come in all forms and shapes and usually require ad hoc maneuvers, having a structured way of distributing work is essential. Your incident commander will assign all the incident roles that are usually required, but should not be afraid of splitting up the roles or creating new ones to make sure everything has an owner.
Distributing responsibilities makes it easier for your response team to function in the high-stress circumstances of a critical event. Each person can focus on fulfilling their role well rather than trying to cover different kinds of gaps alone.
The more comprehensive your incident response playbooks are, the less your responders will have to worry about trivial details when dealing with a high-severity incident. The playbook should be used as a flexible guideline that eases the incident response, but it shouldn’t become a rigid protocol that gets in the way of remediation.
Make sure your playbooks are designed to take work off your responders and to cover all your compliance needs. Look for signs that any of your playbooks are merely bureaucratic stamps that slow down your resolution time.
If your high-severity incident occurs while you have an on-call team at high risk of burnout, you can expect less than optimal results. Make sure there are proactive policies to take care of your responders, with properly staffed and spaced rotations and adequate rest after challenging shifts.
Incident management tools like Rootly can help you centralize communications, form a response team, distribute responsibilities through roles, and gather insights for detailed retrospectives throughout a high-severity incident.
Rootly has helped resolve hundreds of thousands of incidents at some of the most influential teams in the industry. Our on-call and incident management solutions are trusted by LinkedIn, NVIDIA, Cisco, Canva, and more than a hundred other teams.
{{cta-demo}}
Unlock simple and cost-effective incident management with Rootly.