A system is reliable when it's behaving the way you expect it to behave. Imagine you're buying shoes online. You hit “Add to cart,” but it takes 10 seconds to process. Yes, it is working, but you're probably frustrated because it took 10 seconds when it should’ve taken less than a second.
When you’re on the other side, hopefully, you know there is a problem with the cart experience, but you need more details to fix it. This is where Observability comes into play. Adriana builds on Hazel Weakly’s concept to define Observability as “the ability to ask meaningful questions, get useful answers, and to be able to act on that information.”
You use Observability, first of all, to know if your system is behaving as expected and, if it isn’t, to figure out why. That’s why Observability is the foundation of Reliability.
One of your responsibilities as a manager is to protect your team. This is especially true for an SRE team, which comes under fire more than any other team. But when you, as a manager, are being called in for questioning non-stop when an incident breaks, it takes a lot of self-awareness and experience to not trickle that pressure down to your team.
Nothing is more annoying than someone pinging you with, "Is it done yet? Is it done yet? Have you figured out the problem?" It'll be done when it gets done. It’s important to keep your cool and protect your team so they feel they’re working in a psychologically safe environment.
If they work under the assumption that fingers can point at them at any given moment, they’ll be less effective responders. It’s also a long-term problem for the organization because the root causes of incidents go ignored. Blameless postmortems are important to let your team grow and make your organization more reliable.
Being on call is stressful as it is. You know you can get called at any point during your shift for something that has gone kaboom in the system. To make matters worse, you might have been in a deep sleep, and your brain isn't functioning properly.
Most people don't last as SREs for very long because of the anxiety and stress that it involves. Being on call directly impacts your quality of life, as it can wake you up in the middle of the night over something that went massively wrong.
As an SRE manager, you have to constantly adjust your on-call rotations and make sure you have clear escalation policies. You also have to be flexible with the personal circumstances of your team members, as some people are more willing than others to take difficult on-call shifts.
A common complaint when you talk to teams getting into Observability is that engineers instrumented their code, they’re all-in for Observability, but once it’s done, they’re like, "Okay… now what?"
For Adriana, AI in incident response will be more like a buddy—your Observability buddy. Observability generates a lot of data, and AI is well suited for identifying patterns in logs and traces and suggesting, "Hey, this looks interesting. Maybe you want to take a look."
AI can be valuable as a starting point, suggesting a few entry points to tackle a problem, for example. It can also propose some courses of action, because it might suggest something you hadn’t thought of.
Adriana started her career relatively isolated, without the chance to meet other women in the industry. It wasn’t until she got into open source that she was exposed to greater diversity: “Oh my God! There are so many women in tech, and so many underrepresented groups!"
In her podcast, Geeking Out with Adriana, she wants to elevate underrepresented voices in tech. “I want it to be a combination of well-known folks and the not-so-well-known names,” she explains.
More than ever, we need diverse groups of technical folks shaping technology.