7 Essential Tools for SREs
From chaos engineering to monitoring and beyond, SREs rely on several key types of tools to do their jobs.
July 24, 2024
8 mins
Measure what matters, not what is easier. Learn tips to untangle the different common metrics used by SREs.
Pets.com was an online pet supply retailer founded in 1998, during the dot-com craze. In February 2000, it raised $83 million to go public based mainly on metrics like user acquisition, website traffic, and brand recognition. However, the profit margins were minimal and the marketing costs exorbitant, which led Pets.com to file for bankruptcy nine months after its IPO.
The industry now recognizes these metrics as vanity metrics. It might make you feel good to see them trending upwards, but they do not necessarily translate to business value. SREs are also at risk of focusing too much on vanity metrics around their incident management practice.
There are dozens of things you could be measuring in your incident management process, each with progressively cryptic acronyms. But what is causing a real impact on your business? What is driving your reliability upwards?
In this article, you’ll get an overview of the most common metrics in the SRE space, how they can be useful, and when they can be misleading.
{{subscribe-form}}
Incident volume metrics help you gauge the performance of your overall system. They hint at how reliable your system is based on the number of incidents and their frequency over time. However, incident volume metrics do not give you insights into your readiness to handle incidents.
Also known as Alert Conversion Rate, this is a priority metric to take care of your team. It’s important to never miss a beat, but overwhelming your team through alert fatigue has negative consequences for your reliability strategy. Under alert fatigue, your responders are at risk of burnout and face performance issues when an alert actually turns out to be an incident.
Thus, if you’re having too many alerts that do not convert to incidents, you need to review your alert sources. Finding the right balance in your observability toolchain requires fine-tuning and never-ending iterations, which makes keeping an eye on alert-to-incident ratio important.
This is simply the number of incidents declared over a period of time. This metric is relatively shallow and isn’t usually useful alone. It’s generally of interest to stakeholders outside the SRE team, but it can easily become a vanity metric.
Having more or fewer incidents over a month or a quarter doesn’t mean your system is more or less reliable; it might be more related to the seasonality of your business or broader issues outside your control.
However, understanding the Incidents Over Time is useful to figure out how burdened your on-call schedules are and if they need additional support. Typically, organizations set up a max of 2 incidents per on-call shift for their responders.
MTBF counts the time elapsed between two consecutive incidents in a system or component and averages it over a period of time, just like the warehouse safety scoreboards. MTBF can be a good indicator of how reliable your system is because it hints at how often it fails.
You can use the Mean Time Between Failures to predict the reliability of your systems, schedule preventive maintenance, and compare the reliability of different components. When you identify components with higher failure rates, it gives you data to justify investing in a deeper dive into them.
Incident Response metrics are related to how effective your incident response process is. The metrics here can help you improve your process by highlighting the gaps in your team skills or toolchain.
How fast is your team reacting to alerts? The answer can be found in the Mean Time to Acknowledge (MTTA). This metric focuses specifically on the lapse between an alert being generated by the system and an acknowledgment by someone on your team.
While important, MTTA is more an indicator of how well set up your on-call solution is. If the Mean Time To Acknowledge is higher than you’d like, you’ll want to review your on-call rotations and escalation policies.
The tricky part of MTTA is that it can be easily played by on-call engineers. If you set goals for MTTA, your responders will make sure to ACK alerts rapidly, but that doesn’t translate to effectively taking care of the alert all the way to resolution if it turns out to be an incident.
We get to the star of all metrics. MTTR stands for Mean Time to Resolve / Recover / Repair. You’ll see MTTR plastered around everyone trying to sell you SRE-related tools. That’s because it’s the metric that everyone—from incident commander to leadership—cares the most about: are we solving things fast?
The Mean Time to Resolve (MTTR) counts the average time between acknowledgment and resolution across your incidents in a given timeframe. MTTR is critical because it’s linked to how effective your incident response process is and can be directly related to lost revenue, eroded customer trust, brand damage, or potential SLA breaches.
However, MTTR doesn’t tell us more details about what’s going on in our incident response process. You’ll need to dig deeper into which step of the process is a bottleneck and experiment with changes.
The name on this one can be confusing. Average Incident Response Time (AIRT) refers to the time it takes to route the incident to the team member who’s the most apt for resolving it.
Surprisingly, many companies report that a significant percentage of their MTTR is taken by assembling the right response team.
Improving your AIRT is not simple. Finding the right team usually implies a more or less structured diagnosis of the incident, which means your responders must be well-versed in logs and traces.
Some companies are using AI in their incident response to find teammates who have solved a similar issue in the past. Others, like Meta, use AI to find changes in the codebase that may have caused the incident at hand.
Mean Time To Contain (MTTC) comes from the security world, known for having to deal with the scariest incidents. This indicator aims to capture the total time of an incident from its inception to its resolution.
MTTC includes the time it took to detect the issue (MTTD) + the time to acknowledge (MTTA) + the time to resolve it (MTTR).
MTTC can be useful to understand the whole lifespan of a threat found in your system and figure out its implications.
Incidents are not merely technical glitches: they carry a wave of implications. From loss of potential revenue to legal and compliance issues. Reliability leaders often have to keep an eye on other metrics that track their own cost and the business value they contribute to the company.
Although not paying for on-call duty was the norm for a while, many organizations have started to jump into different on-call compensation schemes. Beyond labor laws requiring it in some regions, being compensated for on-call duty makes responders more eager to acknowledge alerts and resolve incidents effectively.
When compensation is involved, the question of how many people we have on-call becomes more important. You’ll usually have several levels in your escalation policy, with potentially more than one person on-call at each level.
However, the amount of on-call time is usually just a curiosity. You’ll generally structure on-call schedules, rotations, and escalation policies based on your system, business needs, and available staff. The additional compensation, if any, will be already baked into your staff budget.
System Uptime is the ultimate indicator of reliability. The System Uptime is the percentage of time in which your systems are working properly. The closer you are to 100%, the happier stakeholders and customers will be. But clearly, that ideal world does not exist, so you need to monitor closely what your uptime is like.
System uptime is often used in marketing materials and can be baked into SLAs for deals that require strong reliability promises.
SLAs and SLOs are an entire field of study within SRE. SLA stands for Service Level Agreement, and SLO for Service Level Objective.
SLAs are legally binding because they describe the quality and availability of the service you’re offering to a third party.
SLOs are generally for internal use and express the ideal objective you want to reach, leaving you some error budget compared to the SLA.
Metrics and KPIs help you gauge how your reliability practice is performing in time and at scale. Black Swan incidents or internal strategy changes can hugely alter your indicators. That’s why you’ll need to get into the weeds of the day-to-day practice to understand what’s really going on in your SRE team and process.
Rootly is the incident management solution trusted by LinkedIn, Cisco, and Canva. Rootly manages the entire incident lifecycle in a centralized tool, which means it keeps track of most SRE indicators for you. You can set up dashboards and generate reports with MTTR or any other metric split by team or service. Book a demo to see how Rootly can help your team improve their reliability practice.
{{cta-demo}}
See Rootly in action and book a personalized demo with our team