February 5, 2025

6 mins

From MTTR to SLOs: a shift towards proactive reliability

MTTR isn’t the silver bullet for reliability—it’s a trap. Learn why traditional incident metrics fall short, how SLOs provide a better approach, and how gamedays can help you test and improve system resilience.

Written by

Jacob Plicque III

From MTTR to SLOs: a shift towards proactive reliability

Table of contents

Jacob is is a Developer Advocate with 13 years of experience, specializing in Cloud, SRE, and storytelling to make technology more accessible.

‍

When I was an SRE at an enterprise e-commerce company, I ran into this one incident that’s burned into my memory. It was late on a Sunday night—peak traffic for sports fans finalizing their purchases after a day of games. Suddenly, Alerts were firing left and right. CPU spikes, database latency through the roof, timeouts cascading like dominos. It felt like everything was on fire, and we were frantically trying to put it out.

In the aftermath, my team back then did what many teams do: we measured MTTR (Mean Time to Resolve) and patted ourselves on the back for bringing the system back online in just under an hour. (if you’re curious the Data Center lost Power and the backups failed, oh and then our ability to move traffic over to the other Data Center also failed, classic.) But here’s the kicker, despite “fixing” the problem quickly, the same issue reared its head again a month later. Why? Because we were more focused chasing metrics like MTTR and not addressing the underlying systemic issues.

This wasn’t just a one-off, it was a pattern. And it took us far too long to realize that the way we measured success with MTTR was completely broken.

The Pain: Why MTTX Metrics Are Misleading

Metrics like MTTR, MTTD (Mean Time to Detect), and MTTM (Mean Time to Mitigate) are supposed to measure incident management success. But here’s the uncomfortable truth: they often do more harm than good. Here’s why:

They Lack Context:

MTTR averages out resolution times, masking the complexity of incidents. A quick fix for a minor issue gets equal weight as a marathon effort to resolve a critical outage.

They Encourage Bad Behavior:

Teams rush to “resolve” incidents quickly to hit targets, often closing tickets prematurely or implementing band-aid solutions.

They Focus on Internal Efficiency Over User Impact:

Metrics like MTTR don’t tell you how incidents affect your customers or the business.

They Ignore Prevention:

MTTX metrics only track what happens after something goes wrong, not how to stop it from happening in the first place.

What You’re Measuring WrongThe Common Mistakes:

Over Reliance on Averages: One long incident skews your MTTR and hides real trends.
Chasing Arbitrary Numbers: Setting MTTR targets often leads to gaming the system instead of meaningful improvements.
Neglecting User Experience: Metrics that don’t account for customer impact fail to reflect what really matters.

What to Measure Instead:

Customer Impact Duration: How long were users affected by the incident?
- Example: “Degraded performance will not exceed 30 minutes per month.”
Error Rate: Track the percentage of requests that fail due to errors.
- Example: “Error rates will remain below 1% on a rolling 30-day window.”
Post-Incident Action Item Follow-Through: Measure how well your team addresses systemic issues uncovered in postmortems.
Incident Recurrence Rates: Track how often similar issues reoccur and focus on reducing them.
Team Health Metrics: Monitor alert fatigue, on-call workload, and recovery time between incidents to ensure sustainability.

SLOs: A Better Way to Measure Success

Service Level Objectives (SLOs) are a modern, user-focused approach to measuring reliability. They shift the focus from internal averages to outcomes that matter to customers.How SLOs Work:

Set Realistic Targets: Define SLOs based on user expectations.
- Example: “99.95% uptime per month.”
Leverage Error Budgets: Allow for some failure while balancing reliability with innovation.
- Example: “We have 22 minutes of downtime allowed per month.”
Align with Business Goals: Ensure SLOs reflect what’s important to customers and stakeholders.

Examples of SLOs in Practice:

Latency SLO:
- Example: "99% of API requests must complete in under 200ms."
- Why It Matters: Ensures customers experience smooth, responsive applications instead of just focusing on uptime.
Error Rate SLO:
- Example: "The percentage of failed transactions must remain below 0.5% over a 30-day rolling period."
- Why It Matters: Tracks and limits errors from impacting users before they become a crisis.
Availability SLO:
- Example: "99.9% uptime per month, ensuring no more than 43 minutes of downtime."
- Why It Matters: Ties system reliability directly to the user experience.

By tracking these objectives and tying them to an error budget, teams gain a structured way to prioritize reliability without obsessing over arbitrary resolution times.

‍Bringing It All Together: Lessons from the Field

‍When you over-rely on MTTR, focusing too much on resolution speed often leads to ignoring prevention and systemic fixes. It wasn’t until we started adopting SLOs and measuring user impact that we began to see real improvements. We stopped rushing to “fix” and started focusing on resilience and reliability.

‍Putting SLOs to the Test with Gamedays

‍One of the most impactful changes you can make is introducing gamedays—simulated incident response exercises that help teams prepare for real incidents before they happen. Here’s how gamedays can benefit you:

Test SLOs in Real Scenarios: Use gamedays to simulate failures and validate your team’s response time and effectiveness under controlled conditions.
Uncover Blind Spots: Identify system dependencies and failure modes that might not be obvious until an actual incident occurs.
Build Team Confidence: Gamedays provide a low-stakes environment to practice incident resolution, helping team members build confidence in their skills.

How to Implement Gamedays in Your Team

‍Gamedays don’t have to be complex or time-consuming, but they do require planning. Here’s how you can get started:

Start Small: Choose a simple but impactful scenario, like a database connection failure or API timeout, and simulate it in a controlled way.
Make It Routine: Schedule gamedays regularly, such as once per sprint or quarterly, so they become part of your engineering culture.
Leverage Automation: Use chaos engineering tools like Gremlin, Harness, or Steadybit to introduce controlled failures, reducing manual effort and increasing realism.

Lessons from the Field: What We Learned from Gamedays

Don’t Treat Incident Response as an Afterthought:
- Gamedays forced us to acknowledge that response efficiency was just as critical as system stability. It’s also an opportunity to learn.
Communication Is Key:
- Role-playing different on-call situations revealed weak points in our alerting and escalation processes.
Metrics Should Drive Learning, Not Vanity:
- Gamedays exposed how some of our old metrics (MTTR, MTTD) weren’t actually helping us improve.

Conclusion: Measure What Matters

Metrics are only as good as the outcomes they drive. In the beginning, I shared a story about an incident where we focused too much on resolution speed (MTTR) instead of long-term reliability. By shifting away from MTTX metrics and focusing on SLOs, gamedays, and user impact, you can avoid repeating these same mistakes. Building resilience and improving incident response isn't about chasing the fastest resolution—it's about preventing failures from recurring and ensuring your team and customers are set up for long-term success.