FRESH

Hacker News

Home

Root cause analysis? You're doing it wrong

107 points by davedx

by CobrastanJorji

13 subcomments

Many years ago, I worked at Amazon, and it was at the time quite fond of the "five whys" approach to root cause analysis: say what happened, ask why that happened, ask why that in turn happened, and keep going until you get to some very fundamental problem.
I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."
I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.

by captainkrtek

1 subcomments

At a large cloud provider I held a role for a bit in the “safety” organization that was tasked with developing better understanding of our incidents, working on tooling to protect systems, and so on.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?

by AstroJetson

1 subcomments

I did a very long RCA on a problem. My management at the time was really BIG into looking at ALL THE CAUSES. They wanted HUGE fishbone diagrams to show that we had looked at everything. This was in the days of having huge drum plotters, so the diagrams could be 36" and many feet long.
So I did what they wanted and the root cause was:
On December 11 1963 Mr and Mrs Stanley Smith had sexual intercourse.
I got asked what that had to do with anything and I said, "If you look up a few lines you'll see that the issue was a human error caused by Bob Smith, if he hadn't been born we wouldn't have had this problem and I just went back to the actual conception date."
I got asked how I was able to pin it to that date and said "I asked Bob what his father's birthday was and extrapolated that info"
I was never asked to do a RCA again.

by tptacek

4 subcomments

Some of the same thoughts in Richard Cook, which was a brain-altering read for me:
https://how.complexsystems.fail/

by jph

0 subcomment

Thanks for the article and shoutout - CAST is great and I use it extensively with tech teams.
Causal Analysis based on Systems Theory - my notes - https://github.com/joelparkerhenderson/causal-analysis-based...
The full handbook by Nancy G. Leveson at MIT is free here: http://sunnyday.mit.edu/CAST-Handbook.pdf

by hshdhdhehd

2 subcomments

Please keep working on that piece I think it will be very useful for incident reviewers.
Someone said the quiet part loud! :
"""
Common circumstances missing from accident reports are:
Pressures to cut costs or work quicker,
Competing requests for colleagues,
Unnecessarily complicated systems,
Broken tools,
Biological needs (e.g. sleep or hunger),
Cumbersome enforced processes,
Fear of being consequences of doing something out of the ordinary, and
Shame of feeling in over one’s head.
"""

by kqr

1 subcomments

Author here. Please note this is an early draft/stream-of-consciousness. Feel free to read and share anyway but my actual published articles hold a higher standard!

by 0xbadcafebee

1 subcomments

A lot of this is based on heavy assumptions about systems and risk/safety analysis. The biggest assumption this post is making is that humans should be involved at all.
Systems do not have to facilitate operators in building accurate mental models. In fact, safe systems disregard mental models, because a mental model is a human thing, and humans are fallible. Remove the human and you have a safer system.
Safety is not a dynamic control problem, it's a quality-state management problem. You need to maintain a state of quality assurance/quality control to ensure safety. When it falters, so does safety. Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Also fwiw, there's often not a root cause, but instead multiple causes (or coincidental states). For getting better at tracking down causes of failures (and preventing them), I recommend learning the Toyota Production System, then reading Out Of The Crisis. That'll kickstart your brain enough to be more effective than 99.99% of people.

by exmadscientist

1 subcomments

I agree with a lot of the statements at the top of the article, but some of them are just nonsense. This one, in particular:
> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.
Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.
> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.
I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.

by protocolture

0 subcomment

Good RCA: Produce some useful documentation to prevent issue from recurring.
Fantastic RCA: Remove requirement that caused the action that resulted in the problem occurring.
Bad RCA: Lets get 12 non technical people on a call to ask the on call engineer who is tired from 6 hours managing the fault, a bunch of technical questions they don't understand the answers to anyway.
(Worst possible fault practice is to bring in a bunch of stakeholders and force the engineer to be on a call with them while they try and investigate the fault)
Worst RCA: A half paragraph describing the problem in the most general terms to meet a contractual RCA requirement.

by gtirloni

0 subcomment

I recommend attending the next STAMP Workshop offered by MIT if you have a chance: https://psas.scripts.mit.edu/home/stamp-workshops

by gmuslera

0 subcomment

Not all problems (and systems) are alike. And probably simple approaches like Occam's Razor will work good enough with most. But the remaining 10% will need deeper digging into more data and correlations.

by bluGill

1 subcomments

Root cause works better if you can come back next time the same thing happens and find a different root cause to fix. keep repeating until the problem doesn't happen enough to care anymore.
If the result/accident is too bad though you need to find all the different faults and mitigate as manyias possible the first time.

by opwieurposiu

1 subcomments

I feel like half the time issues are caused by adding some stupid feature that nobody really wants, but makes it in anyways because the incentive is to add features, not make good software.
People rarely react well if you tell them "Hey this feature ticket you made is poorly conceived and will cause problems, can we just not do it?" It is easier just to implement whatever it is and deal with the fallout later.

by o4c

0 subcomment

[dead]