Blameless PostMortems and a Just Culture

Posted by on May 22, 2012

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired. Or maybe they need to be prevented from touching the dangerous bits again. Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

Posted by on May 22, 2012
Category: engineering, people

33 Comments

Thanks for posting this. I really don’t see why anyone would be playing the ‘blame’ game, especially in an environment with 30+ developers or so.

Anyone can make a mistake at any given time. Regardless of knowledge, time working for the company, etc. To assume that ‘mistakes’ are representative of the quality of a worker seems incredibly naive. Moreover, it denotes ‘arrogance ‘ rather than ‘progress’.

Mistakes are learning and improvement opportunities. I think it is awesome that you guys are putting a label on it and making it ‘official’. While some companies *may* be implementing a ‘Just Culture’ already, I firmly believe that making it KNOWN to employees and even outside the organization itself, is the right way to promote openness and to make it an ‘easy’ topic for discussion (and hence adoption of the same).

Second stories and Post-Mortem reviews can only add value to an organization. Congrats!

You can foster a blameless attitude even when choosing how to investigate source code changes. svn blame has three aliases to choose from: svn praise, svn annotate and svn ann. Unfortunately git-blame lacks these aliases.

This is a great post. I was fortunate to receive advice early in my career as a manger from a wise veteran who told me to live by the mantra “Fix the problem, not the blame”. This was a great simple statement of something I have found to have great depth over the years. This is a notion that came out of a lot of the discussions of how Japanese companies manage differently from US companies. Toyota (as usual) is the one most cited.

To be sure, a lot of the tendency to blame comes from culture, so I have developed a number of techniques for de-fusing it. And a lot of that comes from how you define the problem. One of my favorite tools for gaining perspective on the problem definition is the “5 whys”. This guy http://www.youtube.com/watch?v=JmrAkHafwHI does a fabulous job of explaining the basic concept in 3 minutes flat.

[...] Some of this is caused by blame cultures. I was speaking to Kevin Marks earlier this evening about this and related issues, and he referred me to this Etsy post: Blameless PostMortems and a Just Culture. [...]

Very well said. I couldn’t agree more – in places where blamestorming is the response to an incident, the natural consequence is that those involved become less and less forthcoming with the information required to make the systems safer.

One thing I would add: it’s also critical for a successful process that the output of post-mortems be acted upon in a *timely* and *visible* manner. If someone spends their time doing a bunch of analysis and comes up with recommendations on how to avoid problems but then feels like those recommendations are ignored or not appreicated, that also is highly disincentivizing future analysis. I’m sure Etsy doesn’t have this problem, but I’ve seen it happen in other organizations, especially as they get larger.

[...] costly. Mistakes can turn into a positive learning experience for everyone on the team. There's a great post on Etsy's "Code as Craft" blog about this [...]

Great post! I have seen so many teams debilitated because of fear. Once you get out of the ‘blame the developer’ game, you open the discussion up to the context of the failure, and ultimately to the root cause and likely solutions.

[...] on Etsy’s Code as Craft blog, they explain: Why shouldn’t they be punished or reprimanded? Because an engineer who thinks [...]

[...] root cause analysis, problem management, after-action reports, etc. Then John Allspaw wrote this incredibly fantastic blog post about blameless postmortems that so eloquently and thoughtfully conveys a bunch of the things I [...]

Constructolution: Indeed, you’re right. Not acting on remediation items is akin to paying lip-service and essentially dismissing the learning out of hand.

At the moment, in Etsy’s case, we have a 30 days (this is likely too long) due date on remediation tasks to get done, and they are essentially actionable tickets that trump any other work that the engineer is currently working on, including shipping product.

Excellent catch, thanks. :)

This was a great article (definitely going to share this link). I might also add that it would be a good idea to put monitoring in place as well, whenever possible to help avoid the problem. Whether it be change control or thresholds on key metrics that might have changed drastically when the problem was implemented. That is our philosophy at LogicMonitor. We too are aware that people make mistakes, but if you can track when the mistake gets introduced with monitoring you can typically thwart it much quicker and possibly cause less damage than you would have if you didn’t have monitoring in place. Just another method of providing failsafes.

[...] Blameless PostMortems and a Just Culture via Etsy Filed Under: Link Tagged With: culture, etsy, [...]

Post Mortems…

blameless post mortems at Etsy video:…

[...] came across a post recently on Etsy‘s developer blog, Code as Craft, on Blameless PostMortems and a Just Culture, which details how the company takes a blameless approach to learning from mistakes, and ensuring [...]

[...] Jason Antman Here is a small selection of sysadmin links that I recently found, and wanted to share:Blameless PostMortems and a Just Culture « Code as Craft – some really good ideas about a culture that recognizes and seeks to remedy human errors, [...]

[...] Email | tweet us @STWNextness.If you only read one thing.Blameless postmortems and a just culture | Code as CraftManagement.5 ideas trapping the advertising industry right now | AdverblogA new report from McKinsey [...]

This was a really interesting post, and very timely. My organization does not do engineering (we administer a health insurance program for uninsured children) but many of the principles about the path to blamelessness apply.

[...] and performance degradations. In consequence, we implemented what John Allspaw calls a “blameless post-mortem“: a thorough review process with the single goal of helping the team prevent the problem in [...]

[...] as in “justice”), to them is predicated in large measure on conducting a Blameless Post Mortem when things diagnosing critical problems and mistakes. Here is why they believe a blameless post [...]

[...] — you are coming up with a way to prevent that problem in the future. (See Etsy’s blog, Code as Craft, for a great summary of how and why they host [...]

[…] provides examples of techniques that can be adopted for positive impact. For e.g. Etsy’s blameless post- mortem technique is very relevant in this cloud era. Failures happen and what better way to deal with them […]

[…] What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes?  […]

[…] we get through a post-mortem without pointing fingers at one another, yet still have accountability. The folks at Etsy think […]

[…] landscape changes? John Allspaw writes about corporate culture and the need to adapt in his article Blameless PostMortems and a Just Culture. The point of doing a post-mortem after a failure is to learn about what went wrong and figure out […]

Great article. I’m curious what Etsy would do if the problem was not caused by a mistake made, but rather inaction. I’m thinking of operational type support roles where if action isn’t taken, it can cause an outage.

Then, as a follow up, I’d ask what Etsy does if there were an recurring pattern.

Thanks!

Hey Jason, thanks for the comments and questions.

There’s certainly something there in what you’ve mentioned: inaction or in some cases, indecision. In those cases, not acting can be the result of a number of things:

– Perceptual recognition (was the font too small or color not salient enough in a signal that told you to take action?)
– Data overload (are we not detecting when to take action? If we are, is the information that accompanies an alert informative or noisy?)
– Lots of other things that I can’t imagine :)

Regarding recurring patterns

This is a question that comes up many times in conversations about this approach. What to do when the same mistake happens again? Or what can look like a recurring pattern? There are a number of questions we can ask about that…

1. If there is a pattern, what allows for that pattern to emerge, if we’re looking at each event? In other words, what *didn’t* we address in previous postmortems that allowed it to happen again?

2. What other ways can we explore events? If we see similar things happening again and again, what perspectives are we not covering?

If we find that the answer to recurring patterns is something that sounds like “I guess that person is just clumsy or not paying attention” or finding that the “pattern” is a person, and not a dynamic or mechanism…then it’s an indication that we’re not digging deep enough for second stories.

Thoughts?

[…] not for retribution and blame. The REAL failure is not allowing teams to learn from incidents; the blameless post-mortem review is a crucial part of helping that organisational learning to take place (p. […]

Well reasoned responses. I’m wondering if there’s value in taking this approach with additional roles in IT. I’m guessing that you’re focused on development or code teams… which is great. I can see parallels in operational teams, even roles like Business Analysts.

With Operational or Run teams it’s often inaction that can cause outages, a missed alert for a filling drive, or failure to head a hardware vendor’s recommendation to upgrade firmware, failure to test backup strategies, etc.

I appreciate your time responding to my questions!

[…] post-mortem process, which although I had seen John Allspaw talk about it a couple of years ago (blog post, video), now when presented within the larger DevOps context really started to […]

It’s a bit late to comment, but NASA runs the Aviation Safety Reporting System which gives immunity from prosecution to any flight crew that reports on an unsafe incident.
http://asrs.arc.nasa.gov/overview/summary.html
The idea is to avoid people covering up mistakes. It is more important to report a mistake than it is to punish someone for making a mistake.

It could be one reason why aviation is so safe in the US.

It certainly is an unusual program for the government.

[…] the way in which the organisation deals with failures in the software systems needs to shift to a blame-free model, allowing the whole organisation to learn and improve. In our experience, a ‘big […]

[…] error” approach is the equivalent of cutting off your nose to spite your face. He explains in a blog post that at Etsy, their approach it to “view mistakes, errors, slips, lapses, etc., with a […]

There are two relevant books for reading up on this in depth:
Behind Human Error [Kindle Edition]
The Field Guide to Understanding Human Error [Kindle Edition]
Thank you for providing this info in your recent O’Reilly podcast!