Round Robin escalation policies: do's and don'ts
Minimize alert fatigue by distributing incoming alerts evenly across responders with a Round Robin schedule. This strategy comes in two variations and can benefit some teams more than others.
November 2, 2023
6 min read
Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right.
This blog post is adapted from my talk at SRECon EMEA 2023 - original slides are available here!
Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right. Want to dive deeper? Get expert advice on your Status page setup with a free consultation from our Reliability Experts.
When things go wrong, customers notice. A status page acts as a consistent source of truth for your customers (and in turn, your customer support team) to stay up to date on issues that impact their ability to use your product/service. When frustrated customers don’t know where to find this information, they head straight to your support team, creating unnecessary support debt. Or, worse, to social media, where things can quickly snowball.
A status page allows you to build trust with your customers by proactively communicating the information they need to know about the issue, how it’s impacting them, and the progress you’ve made towards resolving it.
A basic status page includes the following elements:
Components
Elements of your product that can be impacted by incidents
Status
The status of each component—operational, degraded, partial outage, full outage. You may also choose to have a maintenance status option for planned maintenance that impacts functionality.
Incident title
A brief description of the incident.
Incident description
One or two sentences describing the incident and its impact on users.
One of the first challenges people encounter when setting up their status page is defining their components. Depending on your product, it might not be clear how to break it down in ways that both reflect your architecture, and how your customers experience different aspects of the product.
On your external-facing status page, use language that your customers will recognize when naming your components. Canva’s status page is a great example of this!
If you have quite a few components, consider grouping them in a logical order to make it easier for your customers to find what they’re looking for.
Your statuses reflect the current state of each component, with the default state being Operational (i.e. working as expected with no known issues impacting performance). If something isn’t right, you’ll update the component’s status to one of the following:
Degraded
Partial Outage
Full Outage
It can be easy for the lines between these to blur, leading to engineers spending valuable time in incidents second-guessing what status to use. To avoid this, make sure your definitions for each are clear and documented. Here’s how I like to break these down:
Degraded: The component is usable, but in a degraded state, meaning it is performing slowly or experiencing intermittent errors.
Partial Outage: The component is inaccessible or experiencing functionality errors that make it unusable to a portion of users.
Full Outage: The component is fully inaccessible or unusable to all users.
“I apologize for such a long letter - I didn't have time to write a short one.” - Mark Twain
When it comes to status page messaging, brevity and clarity are key. Distilling complex incidents into one or two sentences can be a challenging task—the Mark Twain quote above perfectly captures the difficulty that comes with succinct communication.
One of the best things you can do to avoid this is create messaging templates in advance. If possible, work with your communications or customer support teams to create messaging templates that are easy to adjust during incidents, rather than writing your messaging from scratch each time under the pressure and time constraints of a live incident.
Another quick trick to shorten messaging? Use active voice. In the active voice, the subject of a sentence performs the action. In a sentence written in the passive voice, the subject receives the action. It’s the difference between:
Passive voice: Slow loading, error messages, and timeouts may be encountered by users.
Active voice: Users may encounter slow loading, errors, and timeouts.
When in doubt, focus on the impact of the issue on users, instead of the details surrounding the issue. Understanding of the incident’s cause will likely evolve as the incident progresses, so it’s best to focus on the immediate impact and any potential workarounds. Unless your audience is deeply technical, stick with plain language to avoid unnecessary confusion.
Slack’s status page consistently sets a great example for user-focused communication. Check out this recent incident as an example. They’ve used active voice, focused on the impact, provided workarounds/troubleshooting steps, and followed up with additional details around the incident’s cause once they reached resolution. While the incident was active, they didn’t let their status page go longer than 30 minutes without an update.
Rootly allows you to integrate with your existing Atlassian Statuspage or create your own custom Rootly status page in minutes. After a simple setup, you’ll be able to:
Learn more and book a free, customized demo here.