The Role of SREs in Observability
Although conversation about observability often ignores SREs, SREs have a central role to play in observability success.
April 7, 2021
8 min read
The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.
By now, almost everyone who pays attention to news or social media is aware of the Ever Given, and how it blocked the Suez Canal on March 23, 2021. We thought it would be interesting to take a look at this incident from an SRE perspective, and discuss postmortem best practices as viewed through from that perspective.
Before we dive in though, it's worth talking a little about Ever Given and the Suez Canal for a bit of context and scope.
The Ever Given belongs to a class of ship called Golden-class container ships. It is one of eleven of this class of ship that's been built to date. Golden-class ships are considered some of the largest ships in the world, and are truly massive vessels.
Ever Given and her sister ships are approximately 400 meters in length, or about 1,312 feet--almost one-quarter mile long. They are about 59 meters (193 feet) wide.
For comparison, the Statue of Liberty is about 305 feet tall; the Eiffel Tower is 984 feet tall; and the Empire State Building (depending on how you measure it…) is between 1,250 feet and 1,454 tall. If you were to stand Ever Given and the Empire State Building upright next to each other, they would stand almost the same height.
The Titanic, arguably one of the most infamous ships that ever existed, measured in at only 269 meters (882 feet) long, and 28 meters (93 feet) wide. So in terms of length, Titanic was approximately two-thirds the length of Ever Given, and slightly less than half as wide.
To give you some perspective on just how large this class of ship is, here is an image of one of her sister ships, the Ever Glory. Each of those green boxes you see is a shipping container.
Shipping containers come in various sizes, but the standard unit of measurement for such containers is called a TEU, which stands for "Twenty-foot Equivalent Unit". Meaning, their base size is 20 feet in length.
Ever Given has a capacity of approximately 20,124 TEU. That's a lot of containers and a lot of weight these ships can carry.
The Suez Canal connects the Mediterranean Sea and the Red Sea. It provides a key route for ships of all types travelling from Asia and the Middle East to Europe. Another available route, around the tip of Africa, takes weeks longer than transiting through the canal.
It's said that an average of 50 vessels transit the canal a day. This amounted in 2020 to approximately 18,500 vessels. The Suez accounts for approximately 12% of global trade transiting through it.
Opened in 1869, the Suez Canal has been enlarged over time to its present length of 193 kilometers (120 miles) long. It is approximately 205 meters (673 feet) wide, and 24 meters (79 feet) deep.
At the time when the canal was originally constructed, ship building technology was significantly different than it is now. As time has gone on, the Suez was enlarged to its present dimensions to accommodate these newer--much larger--classes of ships. A second channel was added along part of the canal to allow for two-way traffic through that portion.
Even with these improvements, the canal is legacy infrastructure badly in need of more updates to keep up with current ship technology and global trade demands.
A lot of what happened on March 23, 2021 is still under investigation. What seems to be the general agreement at the time is that a sandstorm with high winds up to 46 mph blew the ship off course and it became lodged sideways in the canal, completely blocking it.
Considering that the canal is only 205 meters wide, and the Ever Given is approximately 400 meters long, it's not hard to imagine that a ship of this size blown off course could get wedged into the constricted space of the canal if it were to drift sideways at an angle.
As the ship became grounded at a point in the canal that did not allow for two-way travel, the Suez was effectively blocked. Over the course of the six days the canal was blocked, hundreds of vessels were delayed or had to be rerouted.
Due to the enormous weight and size of the ship, a massive effort was required to free her from the banks of the canal. Dredging equipment, tug boats, and pumps to help redistribute ballast and fuel inside the ship were all required.
On March 29, 2021, Ever Given was refloated and towed to a lake in the middle of the Suez for inspection and investigation. Fortunately, there was no loss of life. So far no major damage to the ship or the canal has been reported. Global trade through the Suez has resumed.
We are neither Maritime nor Civil Engineers, so we're going to keep the focus to how one might conduct a postmortem of this incident if it were done by an SRE team at a modern tech company. Since all of the information about the incident still hasn't been presented, we'll also limit scope to what is currently available and keep things simple for now.
First, we'd start with an up front statement that our investigation would be conducted as a blameless postmortem. This means the people involved would all be brought together with an understanding there would be no blaming, shaming, or recriminations for presenting the information they have on the incident.
Second, we'd use the "five whys" to analyze the incident to try and determine a root cause for what happened. We'll demonstrate how this technique might be used shortly.
Third, a thorough analysis of the incident would be compiled into a postmortem document. This document would be made available to all interested parties to serve as a learning experience, and provide valuable information on how to prevent similar problems in the future.
Fourth, once all the information has been compiled into our postmortem document, we'd make suggestions for remediations--along with associated task tickets--on how similar incidents could be avoided in the future. Team members would be assigned as the responsible person for executing each ticket and due dates would be established.
Finally, regular follow ups would be conducted to track progress of the tickets until the work is completed.
Based on what we know at present, here's a simplified version of how our postmortem might go.
1. What was the expected outcome before the incident happened?
2. What actually happened?
3. Why did it become stuck?
4. Why did this happen?
5. Some additional whys…
6. How can we prevent similar incidents in the future?
As an SRE, there is probably one last thing we'd want to do. One of the key tenets of Site Reliability Engineering is to assume that failure is normal.
We also know this isn't the first time that this sort of problem has happened. With that in mind, we'd want to make better preparations for dealing with similar incidents in the future.
Preparations of this type would include making an analysis of all past failures, and reducing each down to a root cause. Finding commonalities between each would allow formulation of better responses when another incident occurs. Doing so should help reduce the time to respond and time to resolve each incident.
It will be interesting to see what happens in the coming months once investigations on Ever Given and the Suez Canal have been conducted. We'd guess that this will cause the global shipping and transit industries to rethink infrastructure deficiencies and hopefully begin modernization efforts.
Along with these efforts, there will most likely be a great deal of thought into how to more quickly recover from these types of incidents in the future, and better ways to work around them when they do.
You can coordinate response to your own Suez Canal incidents or postmortem analysis with Rootly. Try it free today or book a demo!
{{subscribe-form}}