Post Mortem How-To
Post mortems are insanely powerful tools when used well, whether used by software teams, operational teams, or even business teams.
They are a reliable path to operational excellence, they create ownership, and they help teams earn trust with their stakeholders, customers and managers.
This is a guide for how to conduct a post mortem, based on Amazon’s Cause of Error process (CoE), simplified a bit.
Two templates are included at the end—a simplified one for quick use, and a detailed one for major issues and more mature organisations.
Why Write a Post Mortem?
Don’t write a full detailed post mortem for every problem that occurs.
Write them for major or recurring issues, situations which would otherwise erode trust in your team or business. Writing a good post mortem is an investment you make in places you really want to improve.
The most powerful case for writing one is where there is a problem which had or could have significant impact on customers (internal or external). You do a post mortem to demonstrate that you are taking the issue seriously, and are not just resolving it, but doing your best to prevent it from recurring.
Why Share your Post Mortem?
You will get far more value from doing a post mortem if you share it openly with your peers, internal customers, managers, and other stakeholders.
In most cases, they already know that there is a problem. Sharing your assessment, analysis and plan for improvement with them will get them focused on how they can help you improve. This might be in the form of advice or coaching, training, extra resources, or other changes. It may just be in the form of moral support and respect.
Often, the problem you encounter is more widely spread than your own example, and others can learn from your post mortem too.
How to Conduct a Post-Mortem
Assign an Owner and a Reviewer
The owner writes the post-mortem, and should be experienced enough and hands-on enough to get into the details. This is often a senior engineer or product manager.
The reviewer makes sure that the owner goes deep enough, gets input from the right people (peers, stakeholders, internal customers, managers), and follows up to make sure the actions are taken. They are usually a more senior manager.
Write the Document
Write the document, and review it informally a few times with key players. Get their input for how to improve the analysis and action plan. Your side-goal here is to have them feel like they are part of the process, so they support the end result.
Present the Document
Once key players (including suitably hands-on representatives from internal customers and stakeholders) are happy with what has been created (or their concerns are addressed in an FAQ section to their satisfaction), publish the document to all interested parties and offer them a meeting to review it, which they should attend if the incident is something they care about resolving.
What about External Customers?
You can’t always be as open with external customers as you can internally to your organisation. Similarly with external partners, distributors, suppliers, and even board members or investors—each one requires a thoughtful approach based on the relationship and what you want them to think, feel, and do.
However, the work done on creating a detailed post mortem document is very helpful as a basis for all of these communications, and enables you to have consistent messaging.
Focus on the parts of the incident that are already visible to them, your plans for preventing a recurrence in the future, and how you would communicate with them in case of similar incidents (eg with a public service status page).
What Next?
Next time you have an incident, focus first on solving the immediate problem in the best way you can. After the problem is resolved, use this guide to run a post mortem analysis on it and improve your operational excellence, create ownership in your team, and earn the trust of your stakeholders.
Over time, your collection of post mortems can be a useful resource and reference, and you can even refer back to how the issues you are experiencing are changing over time (hopefully, you’re seeing the common issues being permanently resolved as you work through the actions developed in your post mortem practice!).
Simplified Template
Description
What happened, when, where and how.
Impact
How did this affect customers or other parts of the organisation. Include metrics, costs, lost revenues, reputation damage and risks.
Quick Fixes
What did we do to fix this immediately?
Root Cause
What was the root cause that enabled this error to happen and have these consequences?
Long Term Fixes
What will we do to stop this happening again in future.
Detailed Template
Use this as a guide - omit parts as needed, or simplify. It’s better to do some of this well than all of it badly.
Summary
Summarise the main points as if you were writing a brief email to your CEO. This helps anyone looking at the document know if they want to read more details.
What happened when.
When and how it was detected.
Impact, and who was affected.
High level cause and how it was diagnosed.
When it was fixed and high level solution.
Any plan to prevent recurrence.
Any resources needed to implement that plan.
Write this last, and keep it to a few sentences or bullet points.
Impact
Define the impact in business terms - customer impact, revenue, costs, risks. Assess scope and severity, especially of how the problem really affected customers, and any reputational damage, especially if broader than the immediate impact.
Also record impact on service metrics (uptime, latency, throughput, cost, etc), and list any further incidents caused or other work affected.
Timeline
Starting with the trigger that led to the problem, write a detailed timeline. Include names of people or systems or teams who took actions. Be as specific as possible, and link to external data sources where relevant.
Format this as a table or bullet list with everything related to the incident in chronological order (use a single timezone for clarity, but state local time also where relevant).
Metrics
List the metrics that help detect or diagnose the incident - or which metrics would have helped, but weren’t available or monitored.
Analysis
Analyse the incident in three phases:
Detection
When did we learn of the problem, and how (especially distinguish between proactive monitoring vs customer reporting).
How could we reduce time to detection, and by how much?
Diagnosis
What was the underlying cause, how did you diagnose it, and how long did this take?
How could we reduce time to diagnosis, and by how much?
Mitigation
When did the problem get resolved, how do you know it is resolved, and what did you do?
How could we reduce time to mitigation, and by how much?
Communication
What was communicated to customers, stakeholders, and other affected parties when?
What could have been improved or accelerated? What’s next?
Root Cause
Do a deep “five whys” analysis of the problem. Keep going until you get to system-level root causes (human error is not a root cause - humans make mistakes, and systems need to account for this and eliminate those errors or make them inconsequential).
Actions
The actions you’ll take as a result of this CoE.
Include actions from the analysis to improve the metrics, detection, diagnosis, mitigation, and communication, as well as actions to address each level of the root cause analysis.
For each action, include the priority, owner and ETA for completion.
If the action is complex, link to an external document.
If the action requires more planning, use an ETA for the plan, and update the real ETA later (Amazon calls this an ETA for an ETA).
If the action is not committed, or needs more resources or external approvals, the ETA is for when that decision will be made.
References
Here are some articles I found most helpful when writing this. Some are very technical in nature:
AWS’s content (quite technical):
AWS Blog on Why you should develop a correction of error (CoE)
Includes a more rigorous template, good for technical services
AWS Blog on Creating a correction of errors document (followup to the above)
Includes an example of a technical service CoE
CoE Definition in the AWS Glossary
Joshua Harris’s Postmortem / Correction of Error (CoE) template (quite technical)
Warwick Massey’s articles on this method (non-technical):
Colin Bryar and Bill Carr’s resources from their excellent Working Backwards book (non-technical):