

RescueOps - Ep. 4: Situation Awareness and Real-Time Tracking
Whether scaling a mountain or troubleshooting an outage, situational awareness and real-time tracking can help your team build resilience and minimize costly delays.
February 5, 2025
6 mins
MTTR isn’t the silver bullet for reliability—it’s a trap. Learn why traditional incident metrics fall short, how SLOs provide a better approach, and how gamedays can help you test and improve system resilience.
Jacob is is a Developer Advocate with 13 years of experience, specializing in Cloud, SRE, and storytelling to make technology more accessible.
When I was an SRE at an enterprise e-commerce company, I ran into this one incident that’s burned into my memory. It was late on a Sunday night—peak traffic for sports fans finalizing their purchases after a day of games. Suddenly, Alerts were firing left and right. CPU spikes, database latency through the roof, timeouts cascading like dominos. It felt like everything was on fire, and we were frantically trying to put it out.
In the aftermath, my team back then did what many teams do: we measured MTTR (Mean Time to Resolve) and patted ourselves on the back for bringing the system back online in just under an hour. (if you’re curious the Data Center lost Power and the backups failed, oh and then our ability to move traffic over to the other Data Center also failed, classic.) But here’s the kicker, despite “fixing” the problem quickly, the same issue reared its head again a month later. Why? Because we were more focused chasing metrics like MTTR and not addressing the underlying systemic issues.
This wasn’t just a one-off, it was a pattern. And it took us far too long to realize that the way we measured success with MTTR was completely broken.
Metrics like MTTR, MTTD (Mean Time to Detect), and MTTM (Mean Time to Mitigate) are supposed to measure incident management success. But here’s the uncomfortable truth: they often do more harm than good. Here’s why:
They Lack Context:
They Encourage Bad Behavior:
They Focus on Internal Efficiency Over User Impact:
They Ignore Prevention:
What You’re Measuring WrongThe Common Mistakes:
What to Measure Instead:
Service Level Objectives (SLOs) are a modern, user-focused approach to measuring reliability. They shift the focus from internal averages to outcomes that matter to customers.How SLOs Work:
Examples of SLOs in Practice:
By tracking these objectives and tying them to an error budget, teams gain a structured way to prioritize reliability without obsessing over arbitrary resolution times.
When you over-rely on MTTR, focusing too much on resolution speed often leads to ignoring prevention and systemic fixes. It wasn’t until we started adopting SLOs and measuring user impact that we began to see real improvements. We stopped rushing to “fix” and started focusing on resilience and reliability.
One of the most impactful changes you can make is introducing gamedays—simulated incident response exercises that help teams prepare for real incidents before they happen. Here’s how gamedays can benefit you:
Gamedays don’t have to be complex or time-consuming, but they do require planning. Here’s how you can get started:
Metrics are only as good as the outcomes they drive. In the beginning, I shared a story about an incident where we focused too much on resolution speed (MTTR) instead of long-term reliability. By shifting away from MTTX metrics and focusing on SLOs, gamedays, and user impact, you can avoid repeating these same mistakes. Building resilience and improving incident response isn't about chasing the fastest resolution—it's about preventing failures from recurring and ensuring your team and customers are set up for long-term success.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.