The Rogers Outage of 2022: 3 Crucial Takeaways for SREs
Millions of Canadians offline. For SREs, the Rogers outage is a lesson in the importance of testing updates, building redundant infrastructure and having a crisis communications plan.
September 13, 2024
7 mins
Learn how to build a clear, actionable incident response communication plan that ensures effective internal and external communication during any incident.
After years of training and over a decade of investments, you’re aboard Apollo 13, on your way to the moon. The whole world is watching the mission closely. From the launch and as you leave Earth’s orbit, it’s all excitement and constant monitoring. Two days into the journey, you hear unexpected noises. Your dashboards are showing concerning signs. Your life, and everyone else’s aboard, is in danger. “Houston, we’ve had a problem,” you tell mission control, maintaining your cool even though your chances of survival are slim.
It’s been over 50 years, but the safe return of Apollo 13’s crew to Earth remains one of the most emblematic incidents where communication played a crucial role. The astronauts’ exhaustive communications with HQ, along with the coordination of dozens of engineers back on Earth, made it possible to put together a survival plan in the most extreme conditions.
Apollo 13 had over fifteen communication protocols with specific channels, dedicated teams, and precise terminology. The accidental oxygen tank explosion aboard was not a foreseeable incident, but having a communication plan for dealing with the unexpected was crucial to make failure not an option.
Having an incident response communication plan is not a nice-to-have feature. Effective communication can significantly mitigate the impact of an incident. Good communication can smooth out your customers' experience, soothe concerned shareholders, and protect your brand reputation.
No service or team is perfect, which means you’ll inevitably run into incidents. Undoubtedly, your response team will be hard at work to resolve any incident as quickly as possible. However, the perception that impacted users, shareholders, and other teammates have of the incident is just as important as what is actually happening. How you handle an incident can strengthen your relationship with customers and partners and protect your brand reputation. But you can’t improvise communications while managing expectations during and after an incident—you’ll need to have an effective Incident Communication Plan ready.
An Incident Communication Plan is a framework that outlines how and when an organization communicates with different audiences during and after an incident.
An Incident Communication Plan may include:
However, an Incident Communication Plan is not a list of canned messages or generic copy meant to be sent to customers. In fact, I recommend the SRE teams I coach to avoid those practices. Every incident is different, and your communication should be carefully crafted for each case.
An effective Incident Response Communication Plan ensures everyone who needs to know about an incident does so. Thus, you need to identify who those people are and when and how to contact them during an incident. You may also outline the type of details that are relevant to each stakeholder.
Examples of key stakeholders may include:
Depending on the incident’s scope and impact, you will have different communication needs. For example, minor incidents may not need to be communicated to unaffected customers, while for security incidents, you may need to coordinate with law enforcement agencies.
While some communication elements may seem “obvious” as you read this, things can get quite confusing when you’re dealing with an active incident. That’s why outlining severity levels and their potential communication implications is a key part of your Incident Response Communication Plan.
You’ve established who needs to be contacted and under which circumstances by defining key stakeholders and incident severity tiers. Now, you need to pin down how you’ll get in contact with each of them and set expectations for communication cadence.
For example, for internal communication, I recommend you keep incident collaboration within your usual platforms, such as Slack or Microsoft Teams. That may mean having a dedicated incident channel where responders actively collaborate, while higher-level updates can be shared by a Slack bot like Rootly’s in a different channel for org-wide visibility.
Establish guidelines on what information should remain confidential and what can be shared externally, as well as how to manage those communications.
While your response team needs to remain nimble and not overly bound by protocol to collaborate and mitigate an incident, communication with third parties should be very intentional. You want to avoid triggering unnecessary panic from your customers and ensure there are no misunderstandings from your partners or investors.
External communications are best handled by a teammate with a strong grasp of technology, a deep understanding of your business, and, ideally, media training. It is common to have individuals responsible for each external stakeholder, such as Investor Relations, Customer Relations, or Press Relations.
Collaborate with each of them to develop an external communications strategy that the response team can rely on during an incident. This ensures they will feel comfortable reaching out to the right person at the right time.
Transparency does not necessarily mean sharing specific technical details of an incident. While technical details may clarify the situation for a few people, it’s unlikely to help others—even those within your organization—understand what to expect.
Instead, I recommend writing messages that make sense to your audience. For example, when writing a status page update, provide details on what users can expect from the degraded services. When explaining the situation to executives, focus on explaining the impacted accounts.
Keep in mind that not all your audiences are technical but still require updates that help them understand what’s going on and how you’re mitigating the situation. Try to translate technical issues into simplified terminology that your audience can grasp intuitively.
Too frequent communications can become noise and take away too much of your time. However, not enough communication can be misunderstood by stakeholders as a lack of progress in mitigating the incident.
Your Incident Response Communication Plan can include details on how frequently each audience and channel should be updated. This can help alleviate tensions with stakeholders by setting clear expectations.
Tools like Rootly AI can also help you shorten the time needed to write updates for each audience, as they can generate summaries for specific stakeholders based on the incident’s context. Use those summaries as a starting point to save up to 51% of the time writing them, as Google has done.
You can skip the “Dearest customer, we sincerely regret any inconvenience caused” in most cases. Nobody appreciates corporate jargon or generic phrases being thrown at them while they’re in distress.
Include a section on style and tone in your Incident Response Communication Plan so everyone is aligned on how to provide a cohesive voice for your company. Emphasize making your messaging sound empathetic, showing you’re human and care about those affected.
You’ve seen the typical “an error was made” in incident communication messages. For your audience, this language is unhelpful and can come across as if you're deflecting responsibility for what is happening.
Acknowledge the issue directly and provide details that reassure your audience you’re doing everything possible to resolve the incident. Be specific if you need action from your reader—for example, if you need your customers to reconnect their authentication service.
Rootly is an on-call and incident management solution that helps you orchestrate communication during and after an incident. You can set up automated workflows to notify relevant stakeholders when certain conditions are met, or generate AI-powered summaries based on the incident context.
{{cta-demo}}
Rootly’s Reliability Advocate, Ashley Sawatsky—former incident communications lead at Shopify—coaches dozens of teams on best practices around reliability. She has put together an in-depth, comprehensive Incident Comms Playbook to help companies formalize their communications plan.
See Rootly in action and book a personalized demo with our team