The Best SRE Tools To Improve Reliability and Streamline Operations
Discover the essential SRE tools for monitoring, incident management, automation, and more!
December 13, 2023
8 min read
Before I stumbled into the tech industry (a story for another day), I spent several years in the customer service world as a server and front of house manager in restaurants. It was in these jobs that I first honed some critical skills that would later lead me on the path to incident response. In this article, I draw comparisons between life in the service industry and IT incident response.
Before I stumbled into the tech industry (a story for another day), I spent several years in the customer service world as a server and front-of-house manager in restaurants. It was in these jobs that I first honed some critical skills that would later lead me on the path to incident response. I hadn’t thought much about the connections between these two parts of my career until a recent conversation with Adriana Villela, in which we noted that A) many people in tech got here through a more traditional means of studying computer science in college after high school, and didn’t have experience working in industries like customer service, and B) the skills that are gained through service-focused roles are sadly often overlooked. So I decided to share some of the lessons I learned in customer service and how they taught me to be a better incident responder in the tech world.
Imagine you’re out for dinner with a guest. Maybe you’re catching up with an old friend, or out on a first date. Your server delivers your food to the table, wishes you a nice meal, and walks away to tend to their other tables and duties. All is well until you go to take that first bite and you realize you’ve been served the wrong meal. Instead of the vegetarian pasta you ordered, you’ve received pasta that has pieces of chicken in it. As a longtime vegetarian, it’s not an option for you to eat the meal in front of you. From here, let’s imagine two different scenarios follow:
In the first, your server promptly checks on you to ask how the first couple of bites are going. You point out the mistake, they apologize, take it to the kitchen to be corrected, and bring you an extra breadbasket while you wait. Nobody is sitting hungry, your new meal is on its way, and you continue enjoying a night out.
Now, an alternate scenario. You point out to your friend that your meal is incorrect. Your server is nowhere to be found. Your friend, trying to be polite, isn’t eating their food either (who wants to sit there eating while their guest is stuck with an inedible meal?!). Your dinners are getting cold and you’re awkwardly looking around for your server hoping to get their attention. You’ve seen them whiz by a few times, clearly busy. The more time goes by, the hungrier you’re getting, and the more you can feel the energy of your night out shifting. Instead of conversing with your guest, you’re both focused on the problem with your meal. Finally, you get your server’s attention and they apologize for the confusion and take your meal back to the kitchen to be replaced.
In both of these scenarios, the same mistake was made. But the speed of response made the difference between a minor inconvenience and a ruined night out.
Speed of response is just as important when things go wrong with technology. When your system is down and you haven’t acknowledged the problem, your customers’ frustration builds. They’re left to fend for themselves, maybe taking to social media or reaching out to your support team for answers. By promptly acknowledging the issue and showing that you’re on your way to resolving it, you avoid creating unnecessary stress for those impacted by the problem.
You may have noticed that there’s another difference between the two response scenarios, and this one illustrates the important practice of next issue avoidance.
The initial problem is, of course, that you were served the wrong meal. But, as we know from incidents, the impact of an initial problem has downstream consequences. Without anticipating and responding to those, you’re always a step behind. In this case, the incorrect meal led to the next problem — one of you has food, the other one doesn’t. Awkward. An astute server will notice this and recognize that people don’t come to a restaurant with a guest solely to eat a meal. They’ve come to share a meal. To eat, talk, and enjoy each other’s company. By bringing a breadbasket to the table while you wait for your new meal, the first server has ensured that the mistaken entrée is a minimal disruption, instead of adding to the problem by creating an uncomfortable situation for the guest whose meal was correct.
In IT incident response, this is the goal of the containment or mitigation phase of the incident. We haven’t solved the problem, but we’ve contained the impact of it. This highlights the importance of managing an incident through its stages. We don’t need to have a new meal ready immediately or know whose fault it was that the meal was incorrect, to address the immediate problem that a customer doesn’t have food in front of them. Don’t rush past opportunities to contain the impact of an issue in pursuit of a full resolution. Having containment strategies in place is key to minimizing the impact of incidents. Utilizing tactics like rollbacks or beta flags are good ways to ensure you can contain issues quickly while you work towards a more fulsome resolution.
I don’t know about you, but when I go to a chef-owned restaurant with a Michelin star, I have very different expectations of my meal and experience than I would at say, an Olive Garden. When someone is out for their anniversary or a special occasion, they have different expectations of how their night will feel than if they were out having a casual meal with a co-worker after work. What’s tricky about this is that we don’t always know what these expectations are until we’ve failed to meet them. So how can we avoid the disappointment of mismatched expectations? For starters, we get to know our customers better.
At a fine dining restaurant I worked at, it was mandatory to ask anyone booking a reservation whether it was a special occasion or not. That way, if someone came out on their birthday or an anniversary, or to celebrate a milestone, their server knew about it before they even sat down. We had special table settings and champagne at the ready, so couples who came to dine at our restaurant for their anniversary were greeted with a long-stem rose on the table and two glasses of champagne. The goal was to surprise and delight—to exceed expectations and create a memorable experience. While your users may not be interacting with your software to celebrate their special occasions, they do come with their own expectations in terms of your reliability and the experience they’ll have if something does go wrong.
As you build out your incident response program, consider what your users’ expectations might be. Do you have high-priority customers who expect a different level of service than a user who just signed up for a free trial? Do your users know where they’ll be informed of outages or issues that impact them? Are you publishing incidents to your status page in vain only to be inundated with support requests from customers who expected an email to notify them of issues?
On top of that, everything you do is setting a precedent for your customers on what to expect from you. We know the importance of conducting a thorough retrospective on our systems when they break so we can identify what exactly happened and how to prevent it, but there’s also value in taking a deeper look at the human side of the response after an incident. This is where you can dig into the actions you took, the experience they created for your team and users, and how you can continue or improve that experience next time.
For the most part, I enjoyed my time working in the service industry. I met interesting people every night, had fun with my co-workers, and made pretty good money doing it. But, the job was also physically demanding—I worked long shifts (often from 5 pm until well after 1 am), carried trays full of heavy plates and drinks, was on my feet walking across a large restaurant floor, and the kicker was that I was expected to do all of this while wearing black stiletto heels, with a three-inch minimum heel height policy. It was (in my opinion) unnecessary, unfair, and unsafe. It was so unsafe in fact, that one night, a co-worker of mine slipped on a melted ice cube that was on the floor as she passed through the bar area to grab drinks for a table, and sprained her ankle badly. She returned to work many weeks later with a doctor’s note specifying that she required an exemption from the high heel policy. The reason I’m sharing this story isn’t to complain about sexist dress codes, but to highlight the fact that this policy made it more difficult for this restaurant’s employees to do their jobs well, and it did absolutely nothing to improve the experience of the restaurant’s customers.
When this friend returned to work wearing a sensible pair of ballet flats, there wasn’t any change in how customers reacted. Nobody complained or cared (or likely even noticed) that she was wearing flat shoes. She served her tables well, her regular customers were glad to see her, and she felt safer doing her job without the distracting risk of rolling her ankle with any misstep. There was likely a time when societal norms (unfortunately) made this policy make sense, but these days, most reasonable modern people would rather have their meal brought to them quickly and safely.
Incident responders have difficult jobs too. Long on-call shifts, high stakes situations, troubleshooting technically complex systems. Process and policy can be an accelerator in incident response, bringing structure and clarity to otherwise chaotic situations. But if you’re not careful, bad process can easily become an insidious hindrance. It pays to have a healthy amount of skepticism when implementing rigid process or policy into your incident response practice so you don’t accidentally slow your responders down. The policies you have in place at one point also may not make sense later on. Just as we’re responsible for keeping our software systems healthy and up-to-date, we should pay attention to the operations and policies that support the people who take care of these systems.
If you worked in another industry before moving into tech, I’d love to hear about the parallels you’ve noticed between previous jobs and your current work. And a special thanks to Adriana Villela for inspiring this blog post!