Blameless PostMortems and a Just Culture

Posted by on May 22, 2012

Last week, Owen Thomas wrote a flattering article over at Business Insider on how we handle errors and mistakes at Etsy. I thought I might give some detail on how that actually happens, and why.

Anyone who’s worked with technology at any scale is familiar with failure. Failure cares not about the architecture designs you slave over, the code you write and review, or the alerts and metrics you meticulously pore through.

So: failure happens. This is a foregone conclusion when working with complex systems. But what about those failures that have resulted due to the actions (or lack of action, in some cases) of individuals? What do you do with those careless humans who caused everyone to have a bad day?

Maybe they should be fired. Or maybe they need to be prevented from touching the dangerous bits again. Or maybe they need more training.

This is the traditional view of “human error”, which focuses on the characteristics of the individuals involved. It’s what Sidney Dekker calls the “Bad Apple Theory” – get rid of the bad apples, and you’ll get rid of the human error. Seems simple, right?

We don’t take this traditional view at Etsy. We instead want to view mistakes, errors, slips, lapses, etc. with a perspective of learning. Having blameless Post-Mortems on outages and accidents are part of that.

A Blameless Post-Mortem

What does it mean to have a ‘blameless’ Post-Mortem?
Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

Posted by on May 22, 2012
Category: engineering, people


Thanks for posting this. I really don’t see why anyone would be playing the ‘blame’ game, especially in an environment with 30+ developers or so.

Anyone can make a mistake at any given time. Regardless of knowledge, time working for the company, etc. To assume that ‘mistakes’ are representative of the quality of a worker seems incredibly naive. Moreover, it denotes ‘arrogance ‘ rather than ‘progress’.

Mistakes are learning and improvement opportunities. I think it is awesome that you guys are putting a label on it and making it ‘official’. While some companies *may* be implementing a ‘Just Culture’ already, I firmly believe that making it KNOWN to employees and even outside the organization itself, is the right way to promote openness and to make it an ‘easy’ topic for discussion (and hence adoption of the same).

Second stories and Post-Mortem reviews can only add value to an organization. Congrats!

You can foster a blameless attitude even when choosing how to investigate source code changes. svn blame has three aliases to choose from: svn praise, svn annotate and svn ann. Unfortunately git-blame lacks these aliases.

This is a great post. I was fortunate to receive advice early in my career as a manger from a wise veteran who told me to live by the mantra “Fix the problem, not the blame”. This was a great simple statement of something I have found to have great depth over the years. This is a notion that came out of a lot of the discussions of how Japanese companies manage differently from US companies. Toyota (as usual) is the one most cited.

To be sure, a lot of the tendency to blame comes from culture, so I have developed a number of techniques for de-fusing it. And a lot of that comes from how you define the problem. One of my favorite tools for gaining perspective on the problem definition is the “5 whys”. This guy does a fabulous job of explaining the basic concept in 3 minutes flat.

[…] Some of this is caused by blame cultures. I was speaking to Kevin Marks earlier this evening about this and related issues, and he referred me to this Etsy post: Blameless PostMortems and a Just Culture. […]

Very well said. I couldn’t agree more – in places where blamestorming is the response to an incident, the natural consequence is that those involved become less and less forthcoming with the information required to make the systems safer.

One thing I would add: it’s also critical for a successful process that the output of post-mortems be acted upon in a *timely* and *visible* manner. If someone spends their time doing a bunch of analysis and comes up with recommendations on how to avoid problems but then feels like those recommendations are ignored or not appreicated, that also is highly disincentivizing future analysis. I’m sure Etsy doesn’t have this problem, but I’ve seen it happen in other organizations, especially as they get larger.

[…] costly. Mistakes can turn into a positive learning experience for everyone on the team. There's a great post on Etsy's "Code as Craft" blog about this […]

Great post! I have seen so many teams debilitated because of fear. Once you get out of the ‘blame the developer’ game, you open the discussion up to the context of the failure, and ultimately to the root cause and likely solutions.

[…] on Etsy’s Code as Craft blog, they explain: Why shouldn’t they be punished or reprimanded? Because an engineer who thinks […]

[…] root cause analysis, problem management, after-action reports, etc. Then John Allspaw wrote this incredibly fantastic blog post about blameless postmortems that so eloquently and thoughtfully conveys a bunch of the things I […]

Constructolution: Indeed, you’re right. Not acting on remediation items is akin to paying lip-service and essentially dismissing the learning out of hand.

At the moment, in Etsy’s case, we have a 30 days (this is likely too long) due date on remediation tasks to get done, and they are essentially actionable tickets that trump any other work that the engineer is currently working on, including shipping product.

Excellent catch, thanks. 🙂

This was a great article (definitely going to share this link). I might also add that it would be a good idea to put monitoring in place as well, whenever possible to help avoid the problem. Whether it be change control or thresholds on key metrics that might have changed drastically when the problem was implemented. That is our philosophy at LogicMonitor. We too are aware that people make mistakes, but if you can track when the mistake gets introduced with monitoring you can typically thwart it much quicker and possibly cause less damage than you would have if you didn’t have monitoring in place. Just another method of providing failsafes.

[…] Blameless PostMortems and a Just Culture via Etsy Filed Under: Link Tagged With: culture, etsy, […]

Post Mortems…

blameless post mortems at Etsy video:…

[…] came across a post recently on Etsy‘s developer blog, Code as Craft, on Blameless PostMortems and a Just Culture, which details how the company takes a blameless approach to learning from mistakes, and ensuring […]

[…] Jason Antman Here is a small selection of sysadmin links that I recently found, and wanted to share:Blameless PostMortems and a Just Culture « Code as Craft – some really good ideas about a culture that recognizes and seeks to remedy human errors, […]

[…] Email | tweet us @STWNextness.If you only read one thing.Blameless postmortems and a just culture | Code as CraftManagement.5 ideas trapping the advertising industry right now | AdverblogA new report from McKinsey […]

This was a really interesting post, and very timely. My organization does not do engineering (we administer a health insurance program for uninsured children) but many of the principles about the path to blamelessness apply.

[…] and performance degradations. In consequence, we implemented what John Allspaw calls a “blameless post-mortem“: a thorough review process with the single goal of helping the team prevent the problem in […]

[…] as in “justice”), to them is predicated in large measure on conducting a Blameless Post Mortem when things diagnosing critical problems and mistakes. Here is why they believe a blameless post […]

[…] — you are coming up with a way to prevent that problem in the future. (See Etsy’s blog, Code as Craft, for a great summary of how and why they host […]

[…] provides examples of techniques that can be adopted for positive impact. For e.g. Etsy’s blameless post- mortem technique is very relevant in this cloud era. Failures happen and what better way to deal with them […]

[…] What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes?  […]

[…] we get through a post-mortem without pointing fingers at one another, yet still have accountability. The folks at Etsy think […]

[…] landscape changes? John Allspaw writes about corporate culture and the need to adapt in his article Blameless PostMortems and a Just Culture. The point of doing a post-mortem after a failure is to learn about what went wrong and figure out […]

Great article. I’m curious what Etsy would do if the problem was not caused by a mistake made, but rather inaction. I’m thinking of operational type support roles where if action isn’t taken, it can cause an outage.

Then, as a follow up, I’d ask what Etsy does if there were an recurring pattern.


Hey Jason, thanks for the comments and questions.

There’s certainly something there in what you’ve mentioned: inaction or in some cases, indecision. In those cases, not acting can be the result of a number of things:

– Perceptual recognition (was the font too small or color not salient enough in a signal that told you to take action?)
– Data overload (are we not detecting when to take action? If we are, is the information that accompanies an alert informative or noisy?)
– Lots of other things that I can’t imagine 🙂

Regarding recurring patterns

This is a question that comes up many times in conversations about this approach. What to do when the same mistake happens again? Or what can look like a recurring pattern? There are a number of questions we can ask about that…

1. If there is a pattern, what allows for that pattern to emerge, if we’re looking at each event? In other words, what *didn’t* we address in previous postmortems that allowed it to happen again?

2. What other ways can we explore events? If we see similar things happening again and again, what perspectives are we not covering?

If we find that the answer to recurring patterns is something that sounds like “I guess that person is just clumsy or not paying attention” or finding that the “pattern” is a person, and not a dynamic or mechanism…then it’s an indication that we’re not digging deep enough for second stories.


[…] not for retribution and blame. The REAL failure is not allowing teams to learn from incidents; the blameless post-mortem review is a crucial part of helping that organisational learning to take place (p. […]

Well reasoned responses. I’m wondering if there’s value in taking this approach with additional roles in IT. I’m guessing that you’re focused on development or code teams… which is great. I can see parallels in operational teams, even roles like Business Analysts.

With Operational or Run teams it’s often inaction that can cause outages, a missed alert for a filling drive, or failure to head a hardware vendor’s recommendation to upgrade firmware, failure to test backup strategies, etc.

I appreciate your time responding to my questions!

[…] post-mortem process, which although I had seen John Allspaw talk about it a couple of years ago (blog post, video), now when presented within the larger DevOps context really started to […]

Brilliant. Thank you.

It’s a bit late to comment, but NASA runs the Aviation Safety Reporting System which gives immunity from prosecution to any flight crew that reports on an unsafe incident.
The idea is to avoid people covering up mistakes. It is more important to report a mistake than it is to punish someone for making a mistake.

It could be one reason why aviation is so safe in the US.

It certainly is an unusual program for the government.

[…] the way in which the organisation deals with failures in the software systems needs to shift to a blame-free model, allowing the whole organisation to learn and improve. In our experience, a ‘big […]

[…] error” approach is the equivalent of cutting off your nose to spite your face. He explains in a blog post that at Etsy, their approach it to “view mistakes, errors, slips, lapses, etc., with a […]

There are two relevant books for reading up on this in depth:
Behind Human Error [Kindle Edition]
The Field Guide to Understanding Human Error [Kindle Edition]
Thank you for providing this info in your recent O’Reilly podcast!

[…] What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes?  […]

[…] and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through […]

Thank you.

[…] idea of “the blameless postmortem”, a term they credit to Etsy’s John Allspaw. In his article Allspaw writes “an engineer who thinks they’re going to be reprimanded are […]

[…] RCA to be most effective we should instill the idea of the “blameless postmortem” into how we envision RCA. Blameless postmortem is an awesome concept that defines a culture […]

[…] Blameless Postmortems (Allspaw) […]

Thank you very much for this insightful post. You have hit the nail squarely on the head with issues I am seeing at several of my clients these days. Referencing this post has made it much easier to consult with my clients, as I don’t have to be the messenger.

Thank you again.

[…] or too ready to blame others. Etsy, the recently public craft-focused e-commerce site, has made a concerted effort to change that. In a conversation yesterday with Quartz editor-in-chief Kevin Delaney, Etsy CEO Chad Dickerson […]

[…] behind the three armed sweater, Etsy SVP of Technical Operations (and now CTO) John Allway wrote a blog post in 2012 about how shaming people who make mistakes basically guarantees that the mistake will happen again, […]

[…] transition to this new model, not everything will go smoothly. You need to encourage individuals to learn from the inevitable mishaps and challenges, and to feed that learning back into the development cycle. Be proactive about building an […]

[…] Blameless PostMortems and a Just Culture […]

[…] Blameless PostMortems and a Just Culture […]

[…] focuses on the time line leading up to a failure. Etsy has done an excellent job of encouraging blameless postmortems in industry. It’s a matter of when things are going to fail, not if failure will occur, when […]

[…] Blameless Post-Mortems at Etsy […]

[…] things down requires us to take a pause, collect our thoughts and draft an impartial, sober, and fearless account of what happened, how we dealt with it, what we learned and what steps we’re taking to […]

[…] great article on Blameless PostMortems by John […]

[…] make time for a full post mortem.  I like the philosophy and format shared by Etsy, called a Blameless Post Mortem.  Basically, you want to emphasize that mistakes happen and the key is to focus on what […]

I love this piece, but there are two implicit assumptions it makes that (in my experience) don’t hold true in the real world very often:

(1) The action that contributed to a failure or incident was an action taken by an Engineer, not a Manager, and
(2) A “blameful” approach will result in reduced quality of work experience over time because of reduced information flow from Engineering to Management.

In my career thus far (a decade and counting), I’ve observed that it’s much more likely for a failure or incident to have a root cause in Management, not in Engineering. My favorite example is when our Management signed a contract for a vendor to provide a critical service through an API, without even checking to see whether that vendor provided that particular service (they didn’t) or even had an API (they don’t). You can’t blame Engineering for being unable to use a product when that product literally doesn’t exist; that sort of logical atrocity is unique to the profession of Management. I’ve never seen an Engineering decision result in anything worse than a small amount of lost data or a few hours of downtime; but I’ve seen Management decisions result in company-wrecking calamities. There’s not even a contest.

Unfortunately, unlike Engineers, there is no mechanism in a corporate hierarchy to “punish, shame, blame, or retrain” a Manager when their action led to a failure or incident. Even when it is clear to everyone studying the situation that a particular Management decision is the direct cause of a failure, the political reality of living in a capitalist society prevents that person from being blamed or held responsible in any way. So, if avoiding blame is truly a better approach for minimizing failure over time — if the claim of this article is indeed true — then we would expect the areas of a company that are intrinsically shielded from blame to be the areas with the LOWEST rates of failure. Instead, we observe them to be the areas with the HIGHEST rates of failure. In other words, failure has an inverse correlation to blame, not a positive correlation as this article suggests.

You rightly point out that “human error is seen as the effect of systemic vulnerabilities deeper inside the organization.” However, those systemic vulnerabilities are all, categorically, the purview of Management. There is no part of the system of a company that is outside the authority of Management to control; ergo, any systemic vulnerability that has been identified and documented, but not eliminated, must exist BECAUSE OF (rather than despite) Management’s choices. When you are talking about Engineering, you can clearly separate the humans from the system they work within, because that system is created and maintained independently of the actions of the Engineers. But when you are talking about Management, there is NO distinction between human failure and systemic failure, because the “system” is entirely the product of human action — in particular, the actions of the humans called Managers. To put it another way, what we call “the system” is simply the collective actions and decisions of Management, so if “the system” fails, then by definition a Manager’s action (or decision not to act) is the cause of that failure.

You might be able to address SOME Engineering failures by avoiding blame and drawing distinctions between human and systemic failures. But you will never be able to minimize failure across an entire company by this approach because (a) people who expect to never be blamed can objectively be seen to fail more often, and (b) the distinctions between human and systemic failures are mostly fictional.

[…] Also check out John Allspaw, who gets credit for coining the ‘Blameless Post Mortem’ used here: […]

[…] a post on Etsy’s blog, CTO John Allspaw states that, instead of punishing the “bad […]

[…] pages have gotten slower on the backend. The performance team kicked off this quarter by hosting a post mortem for a site-wide performance degradation that occurred at the end of Q2. At that time, we had […]

[…] a post on Etsy’s blog, CTO John Allspaw states that, instead of punishing the “bad […]

[…] instead of sitting at your desk and wondering what went wrong, try taking your team through a blameless post-mortem. It could be that the cause is something altogether different from what you think it […]

[…] game days to rehearse incident management practices, and after each incident we recommend that a blameless post mortem is conducted to identify whether there are actions that could improve the team’s ability to […]

[…] posts (part 1 and part 2). Adopting a DevOps model – with the attached concepts of blameless postmortems and failing often and failing fast – is essential to success here. The ability to rapidly […]

[…] game days to rehearse incident management practices, and after each incident we recommend that a blameless post mortem is conducted to identify whether there are actions that could improve the team’s ability to […]

[…] should feel comfortable in the post mortem giving [a] detailed account “without fear of punishment or retribution.” Because if engineers – or any individual for that matter – see the focus on blame, they […]

[…] “fail fast” to catch individual security defects, applying other best practices like blameless post mortems can help reinforce the learning […]

[…] Allspaw from Etsy has a concept called Blameless Post Mortem, built around the idea that “human error is seen as the effect of systemic vulnerabilities […]

[…] do postmortems or even specifically how to do them because there are already a lot of great posts out there on the topic. Like Jeff Atwood says, “I don’t think it matters how you conduct the […]

Very nice piece. It’s worth taking a look at the postmortem methodology practiced by the NTSB (National Transportation Safety Board). It is very similar to what you describe in your work at Etsy.

[…] you haven’t read John Allspaw’s piece on Blameless Postmortems, take a few minutes and do it now. John Allspaw is one of the greats in our […]