

Monitoring Your Platform From Multiple Locations
SREs face multiple challenges while their platform becomes available in different locations on the globe. One step in overcoming them is building a solid monitoring system to enable that.
You’re hanging out with a dear friend when she trips and faints. Oh no, are you okay? You reach out to help her up. But she’s not moving. Hey, Jane, hey. It’s not funny. She’s not reacting. Did she hit her head? Is she bleeding? You don’t know what’s going on—you’re panicking. You call 911.
Imagine that instead of the usual “911, what’s your emergency?” the operator picked up the phone and said, “911, what’s the severity of the incident? Which body parts were impacted? Is it raining right now?” and went on and on, asking you a long list of fixed questions, several of them irrelevant to the situation.
Sounds silly, but this is how most organizations approach incident declarations—one of the first communication touchpoints of an incident. Unafraid of challenging popular practices like this one, the industry leaders who gathered at the latest Reliability Leaders Roundtable discussed how they’re tackling incident communications in their teams.
Reliability Leaders Roundtables are private events where 20–30 SRE leaders gather to have casual chats on how their peers are approaching specific challenges.
In this article, we present a distilled version of the insights uncovered during the incident communications roundtable.
Tickets can drive anyone insane. It’s not just about new requests—it’s also about dealing with ticket hygiene. It never ends. Not only do you get an overwhelming number of new requests through tickets with barely any information or context, but you also end up losing track of what’s being done for each of them—if anything at all. It’s been two weeks; are these resolved or even relevant by now? No easy way to tell.
To make matters worse, that same lack of information in the tickets makes it hard to set up automations around them. And let’s not even get started on tickets created by the trigger of an abandoned integration.
The “intuitive” solution to this problem is to add extra validations to the ticket creation process. Let’s safeguard the process by adding a bunch of required fields; that way, we’ll make sure we capture essential information in every ticket.
Except you just end up with a baroque questionnaire that nobody wants to deal with.
The reliability leaders at the incident communications roundtable agreed that adding additional friction didn’t improve ticket management. At the end of the day, people will end up filling in irrelevant data in the form because you’re asking them the wrong questions at the wrong time. Plus, you’ll add additional frustration to the users.
Incomplete information is a problem you’ll always have to deal with—you need to accept it. Attendees explained that they’ve been using tools like Rootly to bring up additional context automatically when an incident is created through a ticket.
For example, if a user files a Linear triage ticket whose only content is “Catalog API taking too long,” the team has a Rootly workflow that automatically pulls up a Datadog dashboard.
Incidents are unavoidable. However, an incident at the wrong time can sour a deal or deteriorate your relationship with your customers. That’s why incidents are not only relevant to engineering—reliability leaders agree that cross-functional collaboration is vital in incident management.
But how do you tell your sales team that a bunch of OOMKilled
errors are flooding your logs, but you’re not sure why or when you’ll have a fix?
Traditionally, internal status pages helped drive visibility for internal communication. Your customer support teams could check the internal status page to see if a system was down, and perhaps that information could help them understand a customer’s issue. Because it’s a private status page, your SRE team could hook automations to update it and make it easier to maintain.
The problem? Discoverability. Unless the team has a regular need to check the internal status page, they’ll have a hard time figuring out where to find the URL for it. You need to make a full context switch just to know if a system is down or if your organization is experiencing an incident at all.
The roundtable attendees agreed they had more success setting up automations to notify their colleagues about ongoing incidents in specific Slack channels. Most attendees had a tool like Rootly to update a Slack channel when an incident was declared and resolved.
It’s standard practice to start dealing with an incident by assessing its severity (SEV1, SEV2, SEV3) or priority (P1, P2, P3, etc.). But how useful—or even realistic—is it to conduct an impact assessment when you’re basically in the dark about what the incident entails?
It doesn’t help that the boundaries between, for example, SEV1 and SEV2 are arbitrary. They can vary across companies and even across teams within the same company. Then you run into black swan events—incidents that don’t fit any category—or incidents that leave you perplexed, unable to move beyond filling in the incident declaration form.
The attendees didn’t reach a consensus on how to replace severities. But the room agreed that severities are more likely a system inherited from legacy tools and frameworks.
Various leaders are experimenting with these ideas instead of relying on SEV or P levels:
Over-relying on processes tends to cause frustration and slow responses. Even declaring an incident seems so difficult—you can’t even declare one without filling in a questionnaire and writing an essay on the semantic limits of severity levels.
How can we improve incident communications at scale? Empower your Incident Commanders.
The reliability leaders agreed that it’s important to create an on-call culture that makes Incident Commanders feel empowered to make decisions without fear of punitive measures if their initial assessment isn’t entirely correct.
For example, Incident Commanders should be able to:
You’ve seen it firsthand. Your team is in the middle of a technical discussion to find a remediation path when a VP bursts into the room (or Zoom call), bypassing all standard communication channels. Now, instead of focusing on the solution, you have to stop and explain what’s going on.
VPs and execs aren’t the enemy or inconvenient allies—they just want to do everything in their power to help resolve the incident as quickly as possible. And they have more influence than you may realize while you’re knee-deep in logs and traces.
All the reliability leaders at the roundtable agreed: building trust with executives is essential. And the best way to build that trust? Over-communicate.
Here’s how roundtable attendees build rapport with executives during incidents:
Incident communication is evolving from the bureaucracy of forms and SEV levels to trust and automation. Being able to pair stronger human relationships and enhanced performance through tooling is what’s making SRE leaders move towards more robust reliability.
Apply to join us in the next Reliability Leaders Roundtable to discuss how AI is impacting incident response.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.