Data-Driven DevOps
IMPORTANT METRICS
It takes time to build a data-driven culture, so where do you stare? Incident response is an
essential part of keeping your business active, and is a good place to lay that foundation for
your team. Here are four incident response metrics to get you started.
Raw Incident Count
When you know the number of incidents a team normally encounters, a spike or continuous
upward trend in the incident count tells you either that team’s infrastructure has a weakness or
their monitoring tools need to be recalibrated.
A data-driven DevOps team has
the tools and the agility to put an
end to alert fatigue.
As you add features and monitoring tools, incident count
may rise. But you can lower real incidents per responder
by filtering out low-quality alerts, building runbooks, and
automating common fixes, the team prevents alert fatigue
and maximizes the time it can spend tackling critical
incidents and building new features.
As with any metric, knowing the number is less important than knowing the context that
gave you that number. It’s important to break your incident count down by team or service
and drill into specific incidents to understand what is causing problems. Was that spike on
Wednesday due to a failed deploy that caused issues across teams or just a flapping service on
a low-severity service? Comparing incident counts across services and teams also helps you
understand whether a particular incident load is better or worse than the organization average.
Time to Acknowledgment
Time to Acknowledgment (TTA) is a good way to measure individual performance. Team
members may not always have control over the root cause of a particular incident, but they are
always in control of how quickly they acknowledge and respond. Fast response time is a marker
of operational readiness, and teams with the attitude and tools to respond faster tend to have the
attitude and tools to recover faster. Operationally mature teams have high expectations for their
team members’ TTA and hold themselves accountable with internal targets on response time.
You can enforce a response time target with IT operations management software using an
escalation timeout. If, for example, you decide that all incidents should be responded to within
five minutes, you simply set your timeout to five minutes to make sure the next person in line
is alerted if the timeout is triggered. Tracking your escalations will also give you valuable data
about how your team is working together.
Escalations
For most organizations using IT operations management software, escalations are rare.
They are a sign that either a responder wasn’t able to get to an incident in time or that he or
she didn’t have the tools or skills to work on it. While escalation policies are a necessary and
valuable part of incident management, teams should generally be trying to drive the number
of escalations down. If you’re seeing a rising trend in escalations over time, you can make
adjustments to your workflow and alerting protocols to ensure that alerts are being funneled to
the people with the time and skills to address them.
pagerduty.com
3
Data-Driven DevOps
It should be noted that there are some situations in which an escalation will be part of standard
operating practice. For example, you might have a NOC, first-tier support team or even an autoremediation tool that triages or escalates incoming incidents based on their content. In this
case, you’ll want to track what types of alerts should be escalated and what normal numbers
should look like for those alerts.
Mean Time to Resolution
Mean Time to Resolution (MTTR) is the highest standard you can use to measure your team.
How long does it take your team to resolve an incident?
Every organization has a different baseline for MTTR. Complexity of infrastructure, organization of
responsibility, even the industry in which the organization operates can all contribute to different
norms. But downtime is expensive, both in loss of revenue and customer trust, and it’s important
to track MTTR to make sure that your team is up to the challenges of a major incident.
HOW TO BUILD A DATA-DRIVEN CULTURE
Now that you have some basic metrics to drive your team’s performance, the question is how
to build a culture around them. There aren’t simple answers to this question, and you will know
best how to guide your team through this change. There are, however, a few principles of datadriven DevOps culture to keep in mind.
Relate the metrics to both your specific business goals and the team’s role in achieving them.
The goal is to get your engineers to see themselves as generating value for your customers, not
just “keeping the lights on” for the company. Mean Time to Resolution is the ultimate customerfacing metric, but it can be difficult for teams to take sole responsibility for the results you
see there. But combining MTTR with MTTA should give you a clearer picture of how your team
is contributing to customer satisfaction. Once everyone is working with the same customeroriented goals in mind, you’ll have established a common reference for success as you tackle
new challenges.
Once everyone is working with the same
customer-oriented goals in mind, you’ll have
established a common reference for success
as you tackle new challenges.
pagerduty.com
4
Please complete the form to gain access to this content