The Unofficial SRE Track for KubeCon NA '24
KubeCon doesn’t have an SRE track, so we’ve gone through the 300+ talks so you don’t have to. We picked the ones that we find more inspiring for reliability folks.
August 14, 2023
5 min read
Between cloud service providers, payment processors, content delivery networks, and more, chances are you rely on external systems to keep your product working. So what do you do when someone else's incident becomes your problem? It’s probably not realistic to completely eliminate third-party dependencies, but there are things you can do to enhance your resilience against third-party failures and maintain trust with your customers when outages out of your control impact them.
Very few SaaS products exist completely independently. Between cloud service providers, payment processors, content delivery networks, and more, chances are you rely on external systems to keep your product working. When these systems fail, it can leave you feeling pretty helpless. In some cases you might have fallback options, but oftentimes all you can do is wait for recovery and clean up the fallout.
It’s probably not realistic to completely eliminate third-party dependencies, but there are things you can do to enhance your resilience against third-party failures and maintain trust with your customers when outages out of your control impact them.
{{subscribe-form}}
The first step to managing third-party incidents well is having a clear understanding of all your services and the role they play in your system. Keeping this information in a service catalog ensures it all lives in one place and makes it easily accessible when you need it. It’s where you can document things like:
Don’t settle for observability tools that don’t give you the full picture. Investing in observability that goes beyond your self-managed systems allows you to better understand how third-party outages may be affecting your system, and detect them sooner. We would all love it if all of our vendors updated their status pages right away and proactively notified us of incidents, but it doesn’t always happen that way. If you’re relying on third parties for any business-critical services, make sure your monitoring reflects that.
Don’t assume that because an outage isn’t your fault, you’re off the hook for keeping your customers informed. Even if the origin isn’t in your system, your customers expect to be kept aware of downtime. That said, clarifying the source of the problem can help set expectations around your ability to resolve it.
While being transparent about the cause of the problem with your customers is important, be careful about how you frame it. You don’t want to simply throw your providers under the bus. Your relationships with third party providers you rely on are important, and those of us who have handled incidents should have more empathy than anyone for other teams experiencing the same thing. If you’re worried your providers are failing to meet their SLAs, have that conversation with them privately after the incident is resolved. As tempting as it can be in the heat of the moment, dragging another service publicly—especially one you’ve chosen to use—won’t reflect well on your business.
When it comes to external communication, third-party incidents can add a layer of complexity. Your agreement with your provider may even prevent you from referencing them directly, so make sure you’re aware of any contractual agreements around how you communicate your relationship to your customers. Avoid speculating on things like time to resolution or cause of the issue. Instead, focus your communication on what you know and what you can control. Your Legal and Public Relations teams are great resources to navigate these situations, so lean on them as needed. Here's an example:
❎ Avoid saying:
Sorry folks, Google Cloud is down right now, which is affecting our platform. There is nothing we can do until the issue is resolved on their end, which could potentially take several hours. You can follow along at status.cloud.google.com.
✅ Instead, say:
Our platform is currently inaccessible due to a cloud network provider outage. We’re in close communication with our provider, and our engineers are following along with their progress towards resolution. Once we have received confirmation that things are back up and running, we will notify all customers via our Status Page. Thank you for your patience.
Relying on external services is the reality for most SaaS companies, and you should have a plan in place for what you’ll do if a service you rely on fails. Here are the tips we covered in this article:
Rootly integrates with dozens of tools including Grafana, ServiceNow, Confluence, and more, so you can fully manage any kind of incident right in Slack. Book a free personalized demo today to learn more about how Rootly can automate your incident workflows.
About the Author
Ashley Sawatsky has owned sensitive incident communications for global brands like Disney and Shopify through her career in tech. Now, she works at Rootly where she helps other organizations level up their incident management game.