Sharing Our Internal
Response Process
Our internal incident response documentation is something we’ve built up over the last few years as we’ve learned from our
mistakes. It details the best practices of our process, from how to prepare new employees for on-call responsibilities, to how to
handle major incidents, both in preparation and after-work.
We’d like to share how we here at PagerDuty prepare our team members for going on-call. It is our hope that others will use the
documentation as a starting point to formalize their own processes. In this guide, we’ll talk about what being on-call actually
means, what on-call responsibilities entail (and don’t entail), and best practices for being on-call.
What is “On-Call”?
Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise for the
system you are responsible for. For example, if you are on-call for a service at your organization, should any alarms be triggered for
that service, you will receive an alert on your mobile device via email, phone call, push notification or SMS, providing you details on
what’s wrong and how to fix it. You’re expected to take whatever actions are necessary to resolve the issue and return your service
to a normal state.
On-call responsibilities extend beyond normal office hours and if you are on-call, you are expected to be able to respond to issues,
even at 2am. This sounds horrible (and it sometimes can be), but this is what our customers go through, and is the exact problem
that PagerDuty is trying to solve! PagerDuty exists to make on-call life less painful for everyone.
3
Responsibilities
Knowing exactly what your responsibilities are can make being on-call much more painless. Below are responsibilities as they
relate to each step of the incident management process.
Prepare
For peace of mind, it’s crucial that you’re prepared with everything you need before going on-call.
Have your laptop and Internet with you (office,
home, a MiFi, a phone with a tethering plan, a
hotspot, etc).
Have a way to charge your MiFi.
Team alert escalation happens within
5 minutes. Be sure to set or stagger your
notification timeouts accordingly.
Make sure PagerDuty can bypass
your “Do Not Disturb” settings
Your environment should be set up and a
current working copy of the necessary repos
should be local and functioning.
4
Have your configured and tested
environments on workstations.
Ensure your credentials for third-party
services are current.
Understand how your organization handles
serious incidents, as well as what the different
roles and methods of communication are.
Be aware of your upcoming on-call time
(primary, backup) and arrange swaps around
travel, vacations, appointments etc.
Please complete the form to gain access to this content