Top 9 Skills for SREs from ex-Instacart SRE
A list of the top nine SRE skills, from incident management, to cloud computing, to networking and beyond.
January 21, 2022
4 min read
Many of the concepts SREs take for granted about incident management originated with efforts to fight fires in California in the 1970s.
What’s the history of incident management?
If you’re an SRE, you may be so caught up in the day-to-day work of managing reliability and responding to incidents that you never take time to step back and ask that question. And that’s a shame, because SREs didn’t invent incident management concepts and strategies on their own.
On the contrary, the ways SREs think about incident response, structure incident management teams and rank the priority of incidents owes much to incident management strategies developed in the offline world decades ago. To understand fully what it means to be an SRE today, you have to appreciate this deep history of incident response.
So, let’s take a look at that history, and examine where modern incident response concepts originated.
Societies have always had incidents, of course. Fires, floods, infrastructure breakdowns and similar crises have been happening for millennia.
For most of history, however, humans lacked an efficient, purposeful way to manage these sorts of incidents. Response efforts were ad hoc, and their effectiveness owed more than a little to sheer luck.
Particular challenges included:
Historically, organizations may have been able to handle incidents well enough if the incidents required response from only one, small group. But the more stakeholders involved, the harder it was to respond quickly and effectively.
Matters began to change for the better when stakeholders started thinking about better ways to put out fires – literally.
By the 1960s, fire chiefs in California realized that they were struggling to respond effectively to the wildfires that broke out every summer. Each year brought blazes worse than the last, with more land burned and more buildings lost. The Laguna fire of 1970 brought matters to a head and elicited a new approach to incident response for fire agencies.
After assessing what was going wrong, the fire chiefs determined that it wasn’t a lack of equipment or personnel. It was poor coordination among the various firefighting agencies that responded to blazes. Lacking a clear chain of command and a systematic approach to firefighting, the agencies struggled to deploy their resources effectively and rapidly.
To fix the problem, California fire chiefs developed what became known as the Incident Command System, or ICS. The ICS defines a hierarchy for incident response, with an incident commander at the top. It also defines several categories of incident response processes, including operations, planning, logistics and finance. And it establishes a consistent set of terms that stakeholders can use to describe their actions during incident response, which makes it easier to communicate clearly.
Although the ICS was initially conceived to fight fires, it became the de facto standard for organizing incident response strategies of all types.
The history of incident response doesn’t end with the ICS. A new chapter began in the early 2000s, when the U.S. federal government developed an even more comprehensive approach to incident management called the National Incident Management System, or NIMS.
NIMS was born in the wake of the September 11, 2001, terrorist attacks, which underlined the importance of efficient communication not just between different agencies of the same type (like fire departments), but of entirely separate organizations. To achieve this, NIMS expanded upon the principles of the ICS.
In addition to adopting most of the incident command principles and practices included in the ICS, NIMS includes standards for coordinating the distribution of resources. It also embraces the concept of the emergency operations center, which is akin in some ways to a network operations center.
In some respects, NIMS resembles a compliance framework (although, to be clear, that’s not what it is). It includes fourteen management principles, which are similar to compliance controls, that organizations must implement in order to manage incidents using a NIMS approach.
Obviously, putting out forest fires and responding to terrorist attacks is pretty different from dealing with data center failures or a buggy application deployment. ICS and NIMS aren’t designed for Site Reliability Engineering or IT teams specifically.
Still, the influence of ICS and NIMS on the way SREs think is clear enough. Terminology like “incident commander” comes from these frameworks. So do concepts like shared accountability for incident response processes and the importance of involving all stakeholders – not just technical teams – in incident response.
ICS and NIMS may not be acronyms familiar to most SREs. But they may as well be, because they are the historical sources of the incident management philosophies that form the foundation for SRE work today.
{{subscribe-form}}