Incident Management Goes to the Olympics
A look at outages and disruptions to the IT systems that power the Olympics, from 1996 to today.
January 17, 2024
5 min read
In December, we held a roundtable discussion among Reliability Leaders on the topic of incident retrospectives. We explored whether they’re necessary, how to build the right culture around them, and what a successful retrospective entailed. As always, we had a wide range of opinions and a few hot takes. We’ve distilled some of the key takeaways from the discussion into this post.
Every quarter, we host a roundtable discussion centered around the challenges encountered by incident responders at the world’s leading organizations. These discussions are lightly facilitated and vendor-agnostic, with a carefully curated group of experts. Everyone brings their own unique perspective and experience to the group as we dive deep into the real-world challenges incident responders are facing today.
Our last roundtable in December was focused on incident retrospectives. Whether they’re necessary, how to build the right culture around them, and what a successful retrospective entailed. As always, we had a wide range of opinions (and a few hot takes). We’ve distilled some of the key takeaways from the discussion into this post, so without further adieu, here they are!
Should a universal retrospective process be applied to all incidents? Or should you consider aligning the scope of your retro to the scope/severity of the incident? Participants were divided on the approach, with some advocating for a universal retrospective model and others favoring a tailored approach based on the incident's severity and impact. The discussion revolved around key triggers, such as resource allocation, customer impact, and the necessity of communicating outcomes to leadership. It was also called out that the practice of doing retros, even for less severe incidents, was a good reinforcement model and could improve overall competence in running retrospectives. This ideally improved retrospective competency within organizations and enhanced overall quality.
Some companies have experimented with allowing the incident commander to decide on a case-by-case basis whether or not to run a retro. This garnered mixed results. Some found it to be a positive way to reduce “retrospective fatigue” and ensure that time spent in retros felt meaningful, while others found this opened up too much inconsistency in decision-making, as some commanders had a tendency to skirt the retro process more frequently.
Building a culture of reliability emerged as a crucial factor influencing the enthusiasm for incident retrospectives. Ease of implementation and a correlation between organizational size and retrospective adoption were key considerations. Senior leadership was highlighted as instrumental in setting the tone for a culture of improvement, especially in larger organizations.
Creating a psychologically safe environment during retrospectives was a top priority across orgs of all sizes. Discussions covered attendance policies, language choices, and the involvement of leadership. “Blameless” retro language like “areas of opportunity” and “contributing factors” were seen as driving a more psychologically safe environment than “root cause”. Some went as far as using the passive voice intentionally to describe actions and events during an incident (e.g. “A failure occurred” vs “We caused a failure”.) Focus on fixing systems rather than humans was a common sentiment, very much in line with Sidney Dekker’s writing on the topic.
Some pushed for open attendance (anyone interested in the incident) but many seemed committed to keeping the group restricted to responders and responsible teams. There was a notable distinction between sharing the outcomes of the retro (e.g. the completed doc) widely, which was unanimously encouraged, vs. inviting spectators to attend the actual meeting, which was a more contentious approach. Leadership was often involved in the retrospective meeting, which opened discussion around whether this might hinder transparency and candor during the process. It was noted that for this reason, leadership should be held to the same (if not higher) standards around using blameless language and tactful discussion.
Noting the popularity of the “Blameless” approach, the challenge of balancing accountability without assigning blame was explored. Larger organizations often designated a "responsible party" and an executive sponsor for incidents, while smaller teams experimented with the "if you declare it, you run it" mentality.
The group discussed the evolving landscape of metrics, with MTTx facing scrutiny, and the shift towards measuring "speed of shipping" versus "frequency of incidents."
Human-centric metrics, balancing incident frequency, number of responders, and duration, were deemed valuable, aligning with the principles of DORA metrics.
A model to balance human metrics and occurrence metrics was described using the following factors:
Frequency of incidents + Number of responders + Duration of incident
This felt aligned and complementary with how DORA metrics are being used in some organizations, and paired nicely with quantifying the human cost of incidents.
The group discussed the importance of normalizing to leadership and other stakeholders that while incidents are unavoidable altogether in complex systems, valuable business improvements can still be made by investing in the speed and quality of response.
Overall, December’s roundtable showcased a rich exchange of ideas on incident retrospectives, emphasizing the need for flexibility, cultural alignment, psychological safety, and thoughtful metric choices. As organizations continue to refine their incident response processes, these insights can provide valuable guidance for creating a resilient and learning-oriented culture.
🗓️ Our next roundtable discussion takes place on February 13, 2024.
Ryan McDonald is a Senior Solutions Architect at Rootly. He has over 10 years' experience in information technology and incident management and is passionate about helping organizations improve their resilience, reliability, and responsiveness.