SRE vs. Platform Engineering: The Key Differences, Explained
An overview of the similarities and differences between Site Reliability Engineering and Platform Engineering, including from a career perspective.
July 22, 2024
10 mins
Your on-call management software can make or break your reliability story. Find out which boxes your on-call solution should be checking for you.
If you’re not into bikes, you might find it excessive to think a bicycle can cost more than 30 thousand dollars. You might think it’s totally unnecessary to optimize your biking gear and obsess over every single gram you put on it. Yet, improved bike technology has resulted in the average speed of competitors at the Tour de France—one of the most important bicycle races in the world—doubling over a few decades.
The tools you use make a huge impact on your performance as a professional. Maybe 15 years ago you found it cool to upload your website files through Cyberduck FTP. But in 2024, you’d never consider it a suitable deployment solution. On-call management software is no different.
Since PagerDuty and other legacy on-call scheduling tools were created, in 2009, the way we think and operate systems at scale has evolved. Thankfully, SREs now have more options in the market and can demand more and better features from vendors.
But what are the key considerations when choosing an on-call management solution? In this article, you’ll get tips on features to look at and questions to ask the vendors you evaluate while looking at possible solutions.
Being on-call can be a taxing task on its own, and when an incident breaks, you’ll need all the help you can get from your tooling. On-call solutions can do a lot for your responders, including making their life easier by introducing flexibility into their schedules and proactively simplifying the incident response process.
Signs that you may need to revise your on-call solution include:
By choosing the right on-call management software, you’re helping streamline the incident resolution process.
{{cta-on-call}}
Reliability is a team effort, and every team is unique. You need to deep dive into how your team operates and the business objectives you’re trying to hit. Your team’s scale and toolchain can dramatically change the requirements you have for an on-call solution. Ultimately, clearly defining the structure of your budget will help you determine which vendor and cost structure fits you better.
A startup establishing its reliability foundation has very different requirements from a Fortune 500 company. On-call management software vendors tend to cater their solutions to a segment of the market. It’s never black or white, but the use cases are so different between a small team and a scaled-up team that they need different features that on-call software providers build for each type of team.
When you have a smaller team, your focus will be contributing to your main business line. You do not want your incipient SRE team to dedicate too much time to setting up the tooling around on-call. That means you want your on-call solution to come with batteries included: smart defaults, ready-made schedule templates, and a simplified UI.
As you scale up your reliability practice, you’ll have more specialized practices and tools that need to be part of your on-call strategy. That’s when generic pre-made on-call settings don’t cut it anymore. You’ll have SREs dedicating significant effort to set up tooling around on-call and fine-tuning settings, perhaps even building ad-hoc integrations. That means you need an on-call management solution that offers flexibility and is ready to partner up with your team.
You want your on-call management software to take work off your plate, not the opposite. Thus, you need to make sure the majority of your processes and workflows can be mapped in the on-call solution you’re evaluating.
Take note of the tools your team uses for incident response. For example, if you use Slack for collaboration and Jira for project management, it’ll be handy to have native integrations for those in the on-call solution you choose to work with.
Your budget can be an asset when negotiating rates and services with an on-call management vendor. Typically, vendors will want to charge you a tiered per-seat cost, plus additional fees for other services.
To emerge victorious from this complex process, it’s best to have a good understanding of how your organization’s budget is structured in the long run. For example, you can get a better deal by discussing how you expect the budget to grow over the next two years when you’ve fully ramped up the on-call solution.
On-call management software comes in different shapes and sizes. The essentials will look similar, so you need to scratch below the surface to see which on-call solution fits your team better.
There are two basic personas that will interact heavily with the on-call solution you choose: admins and responders:
Both roles must have their use cases fulfilled, and feel comfortable using the on-call solution you come up with.
The whole purpose of an on-call management solution is to bring alerts to your team. The first thing to check is if the on-call software you’re evaluating can digest your alerts. Popular observability vendors like Datadog, Grafana, or New Relic are supported in most on-call solutions as alert sources. But how you manage alerts will likely be unique to you.
For example, it’s common to rely on multiple alert sources, which may include proprietary solutions exclusive to your organization. Make sure your on-call management solution offers a wide range of ways to connect your alerts with simple interfaces such as UIs, APIs, or generic webhooks.
Arguably, managing on-call schedules is one of the most challenging aspects you need to deal with. Thus, you need to make sure your on-call management software gives you as many thoughtful features as possible. A simplified user experience will go a long way.
Assess how the on-call solution makes it easier for you to manage the complexity of dealing with multiple schedules. For example, the ability to see multiple schedules at once or filter calendar views can make your life easier, but not every on-call solution offers this functionality.
Another area where your on-call scheduler can help you is finding gaps in your coverage. Check if the on-call software you’re evaluating can detect gaps in your schedules and automatically backfill them so you’re not left hanging with nobody on call due to a minor oversight.
Along with alert sources and schedules, escalation policies are the third pillar of any on-call strategy. The least the on-call management software you evaluate should provide you with is a simple way of defining escalation policies.
Helpful things that on-call solutions can give you include flexibility in how you define who belongs to each level of an escalation policy. The minimum expectations is to be able to set individuals and schedules on a layer. But there are other helpful escalation policies options in a vendors like Rootly On-Call, such as assigning a Slack channel to be notified as part of the escalation policy.
Another helpful feature is to make it easy to define Round Robin escalation policies. Instead of you having to manually keep track of whose turn it is to handle alerts, your on-call management software can do it for you.
On-call duty has been a critical part of reliability for decades, so it’s possible you already have on-call management software. That means you already spent significant effort to bring in your teams, schedules, rotations, and escalation policies into that tool. You also have important data and metrics about your reliability practice from your alerts history.
Switching to a modern on-call management solution shouldn’t mean starting from scratch. Make sure the on-call software you’re evaluating offers tools and expert assistance to migrate your PagerDuty or Opsgenie schedules and data. The goal is to get you started as soon as possible without incurring any rework or metrics inconsistencies.
Reliability is typically measured with hard numbers: how many incidents did we get last month? What’s the alert-incident ratio? What’s our mean time to mitigation? You want your on-call management software to keep track of all the service level indicators.
Additionally, on-call solutions can generally help you save time by generating reports for you. These reports can provide valuable insights into your on-call and incident management performance.
Ashley is on call right now, but she gets a call from her kid’s school to pick him up. She won’t be available for a few hours. What can Ashley do about her on-call shift? In legacy on-call management software, a tedious process of pinging around people would get started until some verbal agreement is made with the promise of manual overrides across schedules.
However, modern on-call software like Rootly On-Call lets you request coverage for part of your shift when you need it. In Ashley’s case, she can drop a message letting her colleagues know she needs some backup. Her colleague, Andre, can volunteer to cover for her and make a partial shift override with a few clicks in Rootly.
Life happens, and it will certainly impact your perfectly crafted rotations. Responders are humans, so their on-call schedule should be human too.Thus, your on-call scheduler should support different ways of changing the on-call schedule people are assigned to on demand.
Cultivating on-call skills is a great way of improving your reliability. But you can’t throw somebody who’s new to your team or to being on-call for a service into the darkness of on-call alone. Most organizations set up shadowing or reverse-shadowing sessions as part of their SRE onboarding.
Although on-call shadowing is a great training technique, it presents a tough challenge for on-call scheduling. Legacy on-call management software like PagerDuty, and even most modern on-call solutions, force you to go through an error-prone and tedious process to align schedules to achieve on-call shadowing.
However, Rootly On-Call makes it easy for you to schedule shadow rotations. Rootly does all the necessary adjustments, so you don’t have to do any manual work like duplicating schedules or adding/removing people after the shadowing time is over.
Legacy on-call management software like PagerDuty, for most users, is just a very expensive phone call service. PagerDuty sends you a push notification, SMS, or call when an alert comes in while you’re on-call. But that’s the end of it: you’re left alone to figure out what to do.
However, on-call pages can give you much more than a ping. With modern on-call solutions, like Rootly On-Call, you’ll get additional context with each alert. You’ll also get relevant next steps to try and a straightforward way to triage and start your incident resolution process in-app.
Choosing the best on-call management software for your team will need an intentional mapping of your business needs. Team size and preferences, as well as tech stack and your incident response process, will inform the requirements for the on-call management solution that best suits you.
Check out if Rootly On-Call could be the right fit for your team by booking a demo with one of our reliability experts.
Escalation, overrides, viewing on-call, and more are natively supported in Slack