Post-Mortems
Are Necessary
No major incident is ever truly resolved without a post-mortem. Post-mortems
are a great way for development teams to identify and analyze elements of a
project that were successful or unsuccessful. It’s a way to look back and review
the incident in detail to determine exactly what went wrong, why it went wrong,
and what can be done in the future to make sure it doesn’t happen again.
Sharing Our Incident
Response Process
Reliability has always been one of the primary design considerations at
PagerDuty. But what do we do when the unexpected happens and something
does go wrong? It’s of the utmost importance that we are prepared and can
A post-mortem can
also be referred to as
an after-action review,
incident review, or
follow-up review. While
the name may be
different, the process
and goal is the same.
get our systems back into full working order as quickly as possible. We pride
ourselves on being able to quickly resolve issues that arise and keep our
systems working within their SLA. We’ve worked very hard to accomplish this,
and our incident response process is where it all begins.
Our internal incident response documentation is something we’ve built up
over the last few years as we’ve learned from our mistakes. It details the best practices of our process, from how to prepare new
employees for on-call responsibilities, to how to handle major incidents, both in preparation and after-work. Few companies seem
to talk about their internal processes for dealing with major incidents. It’s sometimes considered taboo to even mention the word
“incident” in any sort of communication. We would like to change that.
To that end, we’d like to share how we here at PagerDuty conduct post-mortems internally. It is our hope that others will use the
documentation as a starting point to formalize their own processes. This guide provides information on what to do after a major
incident and shares PagerDuty’s follow-up and after-action review procedures.
Check out the rest of our incident response documentation to learn how we
prepare for and handle incidents, as well as how we prep our teams to go
on-call effectively.
3
First & Foremost,
Create Response
Roles
Creating response roles for individuals on your team gives each person specific follow-up tasks to be accountable for. These are
generally lightweight tasks that ensure information is organized and customers are followed-up with accordingly. Below are the
five response roles we assign.
Incident Commander
An Incident Commander acts as the single source
of truth of what is currently happening and helps
drive major incidents to resolution.
TASKS INCLUDE:
• Create the post-mortem page from the template, and assign an owner
to the post-mortem for the incident.
• Send out an internal email to the relevant stakeholders explaining that
we had an incident and provide a link to the post-mortem page.
• Check on the progress of the post-mortem to ensure that it’s
completed within the desired time frame.
Deputy
A Deputy is a direct support role for the Incident
Commander. They support the Incident Commander
so that the Incident Commander can focus on the
incident at hand.
4
TASKS INCLUDE:
• There are no steps for a Deputy after an incident is resolved, however,
the Incident Commander may ask for your help with their steps.
Please complete the form to gain access to this content