If you know any professional cook, you’ve likely seen them carrying around their knife roll. This is because they need to perform very precise work and have specific preferences for how it's done. That’s why they bring their own, exceptionally sharp knives to the kitchen, rather than using just any random knife lying around.
SREs also perform very precise work and need the best tools available. You need an incident management solution that works for your team based on your processes, tech stack, and budget. For example, if your team uses Slack, Linear, and Datadog, your incident management software should integrate seamlessly with them. If your team relies on automations, your incident response tool should offer simple yet powerful options.
Following the cooking analogy: a dull knife is not only less effective, it’s dangerous. For your SREs, relying on sub-par tools can be detrimental. The stakes are higher than "we’d be 15% faster" if we had XYZ tool. An inadequate incident response tool can introduce confusion and frustration to your resolution process.
In this blog post, you’ll find out which features to look for in an incident management tool and ideas for criteria you can establish when choosing a vendor.
What to Look For When Choosing Incident Management Software
Ease of Use
If you need people to go through “PagerDuty University” (50+ corporate videos) just to get a push notification, you’re probably not making the best use of everybody’s time.
Incident management software shouldn’t require you to look at a manual or Google how to perform everyday tasks like shift overrides or setting up a 24/7 schedule.
An easy-to-use, intuitive tool helps ensure faster onboarding and promotes cross-functional collaboration during an incident, unlike legacy tools that are clunky and designed for engineers only.
Customization
When your on-call scheduler requires you to buy a dummy seat just so you can leave intentional gaps in your schedule (looking at you, PagerDuty and OpsGenie), the level of customization of your alerting solution is quite low.
Many SREs have accepted, after years of frustration, that it’s okay PagerDuty doesn’t let them create schedules that work for them without jumping through hoops. Want to page multiple teams? Need different rules for incidents that occur off-hours vs. during business hours? You’ll have to implement numerous hacks that will keep breaking over time.
The truth is, your on-call and incident management should adapt to your team’s workflows and current needs. Reliability is difficult enough on its own; your incident response solution should work for you, not become another chore for your team.
Flexibility
When someone is on call, they need to ensure they’re reachable and can address potential issues immediately. But what if I’m on call and I get a call from my kid’s school saying she’s unwell and needs to be picked up? I just need someone to cover for me for an hour or two. That’s when your incident management software’s flexibility is put to the test.
In legacy on-call tools, making an override in the schedule feels like performing a ritual dance with intricate steps. Making a partial shift override in PagerDuty also involves a lot of frustration.
Your alerting solution should respond to real-life incident management scenarios, whether it’s covering for someone who’s sick, managing special events like a Black Friday sale, or handling last-minute changes. Your incident management software should help in each case by providing simple flexibility.
Multi-Cloud Redundancy
You need to make absolutely sure that your alerting and incident management solution is reliable. The lack of reliable options from 2009 through the early 2020s made PagerDuty the industry leader. Its rivals at the time, OpsGenie and ZenDuty, were known for long maintenance windows or worrying outages for their alerting services.
However, modern on-call solutions are built with state-of-the-art infrastructure. Rootly On-Call, for example, is the only alerting solution offering multi-cloud redundancy. That means even if AWS has an outage, you still won’t miss a single alert.
Check the vendor’s status pages to see how many incidents they’ve had over the past year. Be critical about the SLA offered and investigate what measures they’re taking to ensure the availability of critical services.
Automation
Automation is often used as a catch-all phrase, but there are several areas throughout an incident’s lifecycle where automation can add real value.
- Managing alerts: Automations in alert management can help prevent alert fatigue in your responders. While reviewing your observability pipeline is generally the best practice, you can prevent duplicated alerts at the alerting software level. PagerDuty charges a hefty fee per seat for this automation as an add-on, while Rootly includes it as part of the service.
- Escalation policies: Automate how alerts are treated based on severity, timing, or service type. For example, you probably don’t want to wake up someone on-call during the weekend for a SEV3 alert on a secondary service. PagerDuty doesn’t offer a way to automate this process, but modern solutions like Rootly do.
- Incident resolution process: Automate significant parts of the incident response process, from using a Slack or Microsoft Teams bot that allows your team to manage the incident without leaving their usual collaboration tool, to automatically fetching dashboards from Datadog.
- Status pages: Automations can keep your status page updated. For instance, you can set periodic reminders for your team to update the status page, or upon marking an incident as resolved, the status page can automatically reflect it and notify subscribers.
- Post-incident: Automate the retrospective process or have your incident management tool compile a timeline of the incident. You can also automatically sync actions with Jira tasks to ensure that the retrospective leads to actionable insights.
Your on-call and alerting solution may support some of these automations or offer them as paid add-ons.
AI
By using AI, the SRE team at Google experienced a 50% increase in velocity when writing incident reports. You can also leverage AI to make incident response faster. Modern on-call and incident management solutions offer AI features that boost responders’ productivity.
Ensure that your organization’s privacy guidelines align with how the incident management software handles data for AI processing. Also, be aware that some vendors charge a base fee for AI access plus a cost per usage.
Transparent Pricing
Most legacy on-call vendors are known for their opaque pricing strategies. They often charge per seat per month, but that only includes access to core features. They’ll try to upsell you on add-ons for even basic features like status pages. That’s why many SRE teams are questioning whether their PagerDuty costs are still worth it.
How to Choose the Right Incident Response Tool for Your Team
Finding the right incident response tool for your team involves figuring out what you need and putting the vendor to the test. Below are examples of questions to consider for each feature in your next on-call and incident response solution. Adapt them to your circumstances and add questions as needed.
Ease of Use Criteria:
- How long does it take for a new colleague to be ready to be on-call with this vendor?
- Can someone outside the engineering team use this tool?
Customization:
- I want to page several teams at the same time for one service—how do I go about it?
Flexibility:
- Two teammates want to swap part of their on-call shift. How long does it take to get it done?
- Can we easily have a new teammate shadow an experienced responder?
Automation:
- Where in the incident lifecycle do I need automations? Does this vendor offer them?
- Are there additional costs for automations or usage limits?
Reliability:
- How many incidents have they had in the past quarter?
- How is their infrastructure set up to guarantee their SLA?
Build a Stronger Incident Response Strategy with Rootly
Rootly is a modern on-call and incident response solution trusted by leading SRE teams like LinkedIn, Dropbox, NVIDIA, and Webflow. Rootly offers an intuitive UI that’s easy for both engineers and non-technical staff to use, allows full customization to fit your team’s workflows, and offers transparent pricing with all features included.
Talk with one of our reliability experts to see if Rootly is the right incident response vendor for you.