May 6, 2021

8 min read

Practical Guide to SRE: Automating On-Call

Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.

Written by

JJ Tang

Practical Guide to SRE: Automating On-Call

Table of contents

In a perfect world, everything would just work, and humans could focus on important things like having a work-life balance. Unfortunately, we don't live in a perfect world. No one wants to wake up in the middle of the night to respond to a page, but it happens.

Systems break, broken code is deployed to production, DDoS attacks happen, humans make mistakes, websites go down. Even with modern automated systems, humans often need to get involved to deal with the incidents that invariably creep into our day-to-day work.

Pretty much anyone who has ever worked on-call has multiple horror stories about waking up bleary-eyed to a page at 3am because the company website is down. This person has no idea what caused the problem, and now they need to troubleshoot, find the cause of the issue, and hopefully fix it in a timely manner.

In the end, on-call work is really just incident response with a different name. So how can we make on-call easier and more efficient?

Problems of On-Call

Probably the single biggest issue with on-call work is scheduling. Staff turnover, vacations, and sick days contribute to the difficulty of maintaining on-call schedules. Sometimes, the on-call engineer is simply in transit from work to home; or maybe they just need to hit the gym to work off some of the frustration from last night's page.

There is also the issue of new team members coming on board who don't know all of the ins and outs of the new systems they're supposed to learn. Frequently, another engineer needs to shadow them (or vice versa) to assist when something happens.

Other times, due to tight deadlines or software breakage, there will be an emergency release that could have probably been better handled with a little more planning.

It's often the case that there isn't a clear set of criteria as to when a human should be alerted, and when not. This leads to any and all unusual conditions in a system causing an alert to be fired and a human notified.

Too many alerts leads to alert fatigue and engineer burnout. This can affect our personal life in negative ways. It makes it difficult to destress during one's off hours simply because you might be afraid you will miss a page.

Tired engineers are less effective than fresh ones. It's almost always the case that a fresh engineer can respond and deal with a problem more effectively and efficiently than one who's been awakened multiple times during the week for ongoing issues.

Even aside from the problems of scheduling and burnout, on-call work can be difficult in other ways. An engineer may lack domain-specific knowledge about a particular part of the system that's having trouble.

They may not be aware of any recent changes that happened. It might be necessary to log into multiple systems and dashboards to gather information and formulate a reasonable hypothesis as to the root cause of a problem.

Lack of an escalation path to ask questions, and isolation are also issues. If there is no one to call to ask questions, then it will take longer to search around trying to find someone to call. If they can't find the right runbook, or other needed information, this also lengthens the time it takes to remediate an incident.

Incident coordination during an incident adds to the long list of tasks that the on-call worker needs to do each time something happens. If there isn't a standard and automated way to communicate the nature and impact of a situation, the engineer can often find themselves answering more questions from people than spending time on remediating the issue at hand.

Documenting incidents after the fact adds to the tedium and toil of on-call work. Even if a problem has been resolved, there is still the remaining work of writing up a trouble ticket or postmortem document describing what happened for follow-up later in the process.

Many of these issues don't need to be this way, but how can they be addressed?

Follow the Sun

In the modern world, a distributed workforce is becoming more common. Especially with global companies, workers may be spread across multiple time zones and geolocations. Scheduling on-call work across time zones helps to alleviate burnout and keep engineers fresh.

The Follow the Sun (FTS) model has been used for years at many companies to support not only software development, but also customer support. There is no reason that this can't also apply to on-call work and incident response.

In an FTS scheduling model, teams work in different time zones and handoff activities at the end of their day to the next time zone. The team coming online will review the activities of the team going offline and take up the work where the other team left off. Obviously, this will work best with multiple time zones several hours apart, but can be made to work with any reasonably long time gap between teams.

A very common example of this would be a US-based company with a presence in the EU. With a time gap of several hours, it means that engineers would respond on their day shift to incidents that occur, while the other time zones are offline. Once the day is over, the incident and all information pertaining to it can be handed off to the new team for follow-up.

One of the requirements for a follow the sun model is having the correct processes and automation in place to facilitate a hand-off between teams at the end of their shift. It's not simply that different teams work different hours in their own time zone.

With this type of model, not only can engineer burnout be reduced, but so can Mean Time To Detection (MTTD) and Mean Time To Resolution (MTTR). Engineers are already awake and working when they receive an alert. They can respond more quickly to address it.

Proper communication and handoff are crucial for this model to work. Organizations have to avoid the silos that have been so common in the past. Shared ownership of software and services is a must. There is where automation comes in.

Automating Away the Toil

One of the key principles of Site Reliability Engineering (SRE) is to automate toil. Less manual work that engineering teams have to perform means they can spend more time on software development. Having standardized tools and processes are a great way to begin automating away the toil.

Proper on-call automation needs to have several key features. Since we talked about scheduling being a major issue, that's the one we'll address first.

On-call schedules shouldn't just be kept in a calendar or spreadsheet. They should be integrated into your monitoring and alerting systems. That way, it's never a question of trying to remember who is on call when, or what the proper escalation path is, or even how to contact someone during an incident.

Being able to switch schedules easily is also important. You might be scheduled to be on call during a given shift, but maybe there is an emergency that necessitates swapping with someone else. This shouldn't be complicated or difficult.

Along those lines, alerts should be routed automatically to the appropriate teams. It doesn't do anyone any good if an alert is sent to the wrong team, who then needs to locate the correct team to resolve an issue. This only increases toil and lengthens MTTR.

At the same time, escalation of alerts should be simple and automatic whenever possible. In one example, maybe the engineer who is normally on call simply didn't receive a page because their phone is dead. The system should be able to escalate on a predefined path to another engineer who can respond instead.

In another example, perhaps there's a junior engineer who simply isn't able to resolve an alert themselves. They need a quick and easy path where they can simply push a button to get the support they require.

An automated scheduling system will also help organizations if they decide that a Follow The Sun model makes sense for their teams. Once the issue of scheduling is resolved, then you will need a way to address some of the other problems of being on call.

MTTR is increased the more that an engineer needs to spend time trying to find needed information to troubleshoot and resolve an issue. Incident enhancement is one way to reduce toil in this area.

If an alert is fired due to a detected problem, automation should take care of as much of the information gathering as possible. Records of changes made to the system, monitoring data, commit history, logs, and runbooks should be at a responders fingertips through a centralized dashboard.

Using this dashboard, the engineer can quickly see what happened just prior to the alert firing, and create a better hypothesis as to what caused the problem. They can then act accordingly to address the root cause of the issue.

In the middle of an incident, time is frequently spent simply tracking people down and communicating scope and impact of a problem. Having a centralized dashboard helps, but being able to automatically centralize communications through the use of chat or a virtual war room goes a long way towards easing the burden of telling people what's happening.

Finally, once an incident is over, the last thing anyone wants to spend time on is creating tickets and writing reports. A system should automatically take all of the information regarding the incident and create tickets or generate reports automatically. This also helps to ensure details about the incident that happened in the heat of the moment aren't forgotten.

Conclusion

As you can see, there are a lot of problems with on-call work that make life harder for everyone. Using a Follow The Sun scheduling model is one way to ease some of the pain. Proper automation is another.