What Managed Kubernetes Service is Best for SREs?
A comparison of EKS, AKS, GKE, Rancher and OpenShift from an SRE’s perspective.
January 14, 2022
4 min read
An overview of major IT incidents and outages in 2021
Now that 2021 has come and gone, it’s possible for SREs to look back definitively at the major incidents that occurred during the past year. Let’s do that in this post by examining outages on platforms like AWS, Verizon and beyond – and what SREs can learn from these incidents.
2021 was not an excellent year for AWS, which suffered multiple network outages.
Arguably the worst happened in mid-December, when AWS networking issues crippled Amazon’s retail logistics systems, among others, for about twelve hours. Given that the incident was squeezed right between Black Friday and Christmas, this was an especially problematic time for retail operations to go offline.
It remains unclear what the root cause of the outage was, but packet loss seems to have played a role. Without knowing more about the exact origin of this outage, it’s hard to draw a specific conclusion about what SREs can learn. But it is a reminder of the importance of observing the network – a process that is sometimes left out of common observability strategies, which tend to focus more on applications and infrastructure than on the networking layer.
On the very same day that packet loss brought Amazon’s operations to a standstill, Azure suffered its own (apparently unrelated) outage when problems with Azure Active Directory led to authentication failures on applications like O365.
In this case, the damage was less severe than the concurrent AWS outage. Azure AD was offline for only about 1.5 hours. Microsoft fixed the problem quickly by pivoting to backup infrastructure – a lesson for SREs in the importance of keeping fallback environments on hand.
Sometimes, the Internet breaks just because your router freaked out and needs to be reset. Other times, it’s because of a major failure at your ISP.
That was the case for thousands of customers in the northeastern United States back in January 2021, when Verizon suffered a performance degradation on its Fios service.
The incident, which lasted about an hour, reportedly resulted in a drop of traffic volume of around 12 percent. That’s not nearly as bad as a complete networking outage, but it was still a major issue, especially for the many users relying on network connectivity to work remotely during the first pandemic winter.
For SREs, the Fios outage was a reminder that, although ISPs are on the whole pretty reliable, they can and do fail sometimes. It’s not a bad idea to make sure to configure your workloads in such a way that you can fall back to a secondary Internet provider. Most data centers and colocation facilities have multiple Internet connections available, but you need to make sure you’re ready to switch to a different one quickly if the need arises.
A large number of Internet users were similarly frustrated in June when websites they were trying to visit were unavailable or slow to load. This time, the problem wasn’t with an ISP’s network, but rather with Fastly’s CDN infrastructure.
Fastly attributed the incident to a bad configuration change. However, the company received a lot of praise for its fast resolution of the incident, which lasted under an hour.
There are two takeaways here for SREs. The first is that you should validate configuration changes before you apply them. But because no amount of validation guarantees against failure, the second lesson is that SREs must be able to trace failures to root causes – like bad configuration updates – very quickly. It’s unclear exactly how Fastly handled the incident so rapidly, but we’re guessing its engineers did a good job of tracking configuration changes and mapping them to performance issues.
Incident resolution took somewhat longer when Facebook experienced a major outage in October 2021. The failure, which Facebook attributed to an errant command within its data centers, brought most of Facebook, Instagram and WhatsApp offline for about six hours.
As Facebook explained in a blog post, the incident proved especially difficult to resolve because engineers couldn’t reach remote data centers using the network to bring services back online. Instead, they had to go onsite to access the affected servers. Even then, physical access control barriers slowed down operations.
Given the complexity of this incident, you can’t really fault Facebook for not solving it sooner (and six hours is not exactly a long time, although the scope of the outage was severe given that it brought Facebook’s services almost totally offline). Still, the lesson here for SREs involves the importance of developing incident response plans ahead of time, and making sure they address scenarios where remote access via the network is not possible.
Outages are inevitable, even at the world’s biggest companies. What the top incidents of 2021 highlight, however, is the importance of being able to identify issues quickly, determine their contributing factors and then – last but not least – orchestrate an efficient response that resolves the incident as quickly as possible.
Given that the longest major outage of 2021 lasted only about half a day, we think companies have gotten fairly good at doing the above. But there’s always room for improvement!
{{subscribe-form}}