August 23, 2024

7 mins

Mastering SEV0: Best Practices for Handling High-Severity Incidents

Handling SEV0 incidents requires careful but expedite action. Learn how top performing teams deal with them at scale.

Written by

JJ Tang

Mastering SEV0: Best Practices for Handling High-Severity Incidents

Table of contents

Just a few weeks ago, millions of developers had a 36-minute forced break when GitHub’s global outage took down the availability of most core services, preventing them from pushing upstream or running GitHub Actions. The culprit was a misconfiguration in GitHub’s database infrastructure, which was rapidly resolved with a rollback.

Could this be considered a SEV0 incident? While the impact was massive, the time to mitigation was so short that it did not merit the scandalous title. GitHub teaches us a valuable lesson when dealing with high-severity incidents: no matter how grim the situation looks, do not jump to conclusions.

The boundary between declaring a SEV1 and SEV0 is somewhat fuzzy in most organizations. Crossing it, though, should be a very thoughtful decision. Depending on your scale, the mere fact of naming an incident SEV0 will bring you immediate (bad) press attention, investor inquiries, and reputation damage.

As a CEO in the incident management space, I’m constantly in the trenches helping our enterprise partners deal with high-severity incidents. This year alone, Rootly has helped manage more than 150,000 critical incidents. In this article, I’ve compiled a list of best practices I’ve seen work effectively at leading tech companies.

Common Incident Severity Levels

SEV0 is the most catastrophic type of incident you can encounter. But before you get there, most incidents will start at lower levels of severity and only scale up when more impact is discovered and confirmed.

There are different schemes for defining incident severity levels. Enterprise frameworks like ITIL (Information Technology Infrastructure Library) or COBIT (Control Objectives for Information and Related Technologies) provide basic guidelines on what each level constitutes.

But ultimately, you’ll have to define your own guidelines, with parameters that apply to your context and requirements. It’s even common to find different severity scales within the same organization depending on the department. For example, security teams tend to have very specific criteria to categorize incidents.

In general terms, these are common ways of thinking about each severity level:

SEV4: The lowest severity incident, which usually refers to a technical defect with no user or performance impact. It can be handled by the on-call team or filed as a ticket for the engineering team.
SEV3: The incident causes minor inconveniences to customers, but they can mostly continue to use the services. It’s all handled by the on-call team.
SEV2: There is a high-impact disturbance for users or partners, although the number of accounts impacted is limited to a subset. The issue can be handled by the on-call team, and more stakeholders can be notified if needed (e.g., a VIP customer is impacted, and the account manager may want to be on top of any eventualities).
SEV1: There is a critical impact on all or a large set of customers, and production services are unavailable. The reputation of the company is at stake. Engineering leaders and executives are brought up to speed and monitor the situation closely.
SEV0: It’s a critical impact affecting all or most users. Your team has tried everything to bring systems back online, but the issues persist. Your standard protocols have failed. Every step must be taken carefully going forward until the situation is stabilized.

Incident Severity vs. Incident Priority

While severity is usually linked to incident priority, it’s not always the case. Priority refers explicitly to how urgent it is to fix an incident.

A simple example to illustrate the difference: an engineer finds that one of your dependencies released a security patch to mitigate a potential vulnerability that may apply to your service. In terms of severity, this update is not impacting any customer or the system’s performance. But in terms of priority, this patch should be applied as soon as possible.

Schemas that classify incidents by priority will include the urgency to fix criteria more explicitly and can use the P4-P3-P2-P1-P0 notation. Using this approach can be useful when you have to deal with several incidents at the same time, as it helps prioritize efforts.

What is SEV0?

SEV0 incidents are life-defining events for most companies, their leadership, and their processes. Recent examples include CrowdStrike’s outage, which disrupted hundreds of flights and caused $5.4 billion in damages, or Google Cloud accidentally wiping clean the data of a $125 billion Australian pension fund.

SEV0 incidents are not just “oh no, our service is down” kinds of incidents. One of the key differentiators between a SEV0 and SEV1 goes beyond impact and is how rapidly you can recover your systems. If your core services are down for all your customers, but a straight-from-the-manual rollback gets you back on your feet, you were not dealing with a SEV0 incident.

It’s precisely entering uncharted territory that characterizes SEV0 incidents. You walk on thin ice. You don’t know if whatever you do will make things better or worse. You have to deal not only with the technical issue but also with keeping your board at ease, working with customers to mitigate the impact, and dealing with the press.

SEV0 incidents are all-hands-on-deck. It doesn’t matter if it’s not someone’s time to be on-call or not. You won’t hesitate to call your VP who is on holiday in Costa Rica if needed. The ongoing incident is already costing you millions in expenses and reputation, and you need to remediate it as soon as possible, whatever it takes.

Steps to Tackle High-Severity Incidents

When dealing with an incident, the steps are usually in the line of identification, triage, containment, root cause analysis, mitigation, and post-incident review. That all applies to high-severity incidents, with a few caveats:

Immediate Mobilization of Resources

When the impact of an incident keeps creeping up and starts to look like a SEV1 (or SEV0) incident, the Incident Commander and the entire response team should get ready for immediate action. The escalation paths to on-call managers and other departments should be taken to make sure everybody is ready for a potential downfall.

Executive and Stakeholder Involvement

Executive-level stakeholders should be notified and involved as early as possible. Top-down coordination can help expedite results and ensure all teams and resources are available to mitigate the situation as soon as possible.

Cross-functional Coordination

High-severity incidents will usually go beyond your defined playbook and require coordination from more people than other types of incidents. Keeping a centralized reference for communication will be paramount, such as a dedicated Slack channel or internal status pages.

External Communications and PR

You’ll need to be very strategic about external communication when dealing with a SEV1 or SEV0 incident. Try to control the narrative by being the one disclosing the incident through your Public Relations team when warranted. Communicate effectively and keep your channels updated.

Post-Incident Review

The retrospective will be a deep dive into what went wrong, why, and how you can prevent something similar from ever happening again. Usually, it’ll involve an exhaustive investigation of the issue and circumstances that led to it.

Documentation and Reports

After a SEV0 incident, you’ll probably need to provide explanations to shareholders, customers, and regulators, depending on the nature of the issue. You’ll need to coordinate different teams to gather insights on the financial implications, legal and compliance risks, and propose an improvement plan.

Best Practices for Managing High-Severity Incidents

Remain Calm Despite the Circumstances

Yes, you have every reason to be freaking out. Everybody does. But letting it out on your team, stakeholders, or customers is nothing but counterproductive. Project calm and control to create a better environment where your team feels that somebody’s got their back. There’s no need to add additional pressure to your response team or anyone else involved.

Facilitate any resources they may need and make sure progress is being made through different avenues, but without becoming a burden with your demands. The goal is to get out of this, and you’ll only do it as a team.

Establish Clear Roles and Responsibilities

Even though high-severity incidents come in all forms and shapes and usually require ad hoc maneuvers, having a structured way of distributing work is essential. Your incident commander will assign all the incident roles that are usually required, but should not be afraid of splitting up the roles or creating new ones to make sure everything has an owner.

Distributing responsibilities makes it easier for your response team to function in the high-stress circumstances of a critical event. Each person can focus on fulfilling their role well rather than trying to cover different kinds of gaps alone.

Implement Robust Incident Playbooks

The more comprehensive your incident response playbooks are, the less your responders will have to worry about trivial details when dealing with a high-severity incident. The playbook should be used as a flexible guideline that eases the incident response, but it shouldn’t become a rigid protocol that gets in the way of remediation.

Make sure your playbooks are designed to take work off your responders and to cover all your compliance needs. Look for signs that any of your playbooks are merely bureaucratic stamps that slow down your resolution time.

Focus on Employee Well-Being

If your high-severity incident occurs while you have an on-call team at high risk of burnout, you can expect less than optimal results. Make sure there are proactive policies to take care of your responders, with properly staffed and spaced rotations and adequate rest after challenging shifts.

Why Using the Right Tools Is Key and How Rootly Can Help

Incident management tools like Rootly can help you centralize communications, form a response team, distribute responsibilities through roles, and gather insights for detailed retrospectives throughout a high-severity incident.

Rootly has helped resolve hundreds of thousands of incidents at some of the most influential teams in the industry. Our on-call and incident management solutions are trusted by LinkedIn, NVIDIA, Cisco, Canva, and more than a hundred other teams.