De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.

JJ Tang

JJ Tang

July 15, 2021
5 min read
The Incident Review: 4 Incidents in Outer Space

The Incident Review: 4 Incidents in Outer Space

From network problems to computer failures, a variety of incidents can disrupt operations for systems in outer space.

JJ Tang

JJ Tang

July 6, 2021
4 min read
The Incident Review: 4 Times When Typos Brought Down Critical Systems

The Incident Review: 4 Times When Typos Brought Down Critical Systems

Sometimes, as these 4 incidents highlight, major failure results from a mere typo or configuration oversight.

JJ Tang

JJ Tang

June 4, 2021
5 min read
Practical Guide to SRE: Automating On-Call

Practical Guide to SRE: Automating On-Call

Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.

JJ Tang

JJ Tang

May 6, 2021
8 min read
Creating Chaos to Achieve Reliability

Creating Chaos to Achieve Reliability

How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.

JJ Tang

JJ Tang

April 22, 2021
5 min read
How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.

JJ Tang

JJ Tang

April 7, 2021
8 min read