8 Key Performance Indicators(KPIs) that every IT help desk needs to know
People often say "what gets measured gets improved," but they rarely say what, exactly, should be measured. With the recent developments in the reporting capabilities of IT help desk software, hundreds of KPIs and help desk metrics can be measured and monitored.
But that doesn't mean you should measure them all. Only the KPIs and metrics that are critical to your IT help desk need to be measured to improve service delivery.
This paper describes the 8 KPIs that are critical to every IT help desk. These KPIs help meet basic IT help desk objectives such as business continuity, organizational productivity, and delivery of services on time and within budget. The KPIs
are as follows:
Ensuring business continuity
Making the organization productive
Ensuring business continuity
1. Lost business hours
The number of hours the business is down because IT services are unavailable.
Keep lost business hours to the bare minimum.
Most IT teams track service availability to see the overall performance of their IT service desks. But the pain of lost business isn't always reflected in service availability levels, even when those levels are high. For instance, if service
availability is at 99.9%, the company still loses more than eight hours per year. Tracking lost business hours clearly highlights the loss and its impact on business.
Case study: No-fly time at Virgin Blue
In September 2010, Virgin Blue faced what could be considered every airline's worst nightmare. About 50,000 customers and 100 flights were grounded . Four hundred more flights were delayed or rescheduled over the following days because the
solid-state disk server infrastructure hosting Virgin Blue's applications failed. This affected Virgin Blue's online check-in and booking system.
Despite SLAs to restore services immediately, it took 11 hours for the service to be restored, and 10 more hours to restore full operations. This was because of an attempted repair of a faulty device, which delayed the switch over to a contingency
hardware platform. By then, the damage was already done. Although these 11 hours didn't cost much in terms of Virgin Blue'sIT service availability for the year, they cost Virgin Blue approximately $10 million in terms of lost business.
Industry standards - Lost business hours
|Number of downtime events in the last 12 months
|Average amount of downtime per event in the last 12 months
|Longest downtime event
|Crirical application availability
|Length of time to recover from last downtime event
Tips for minimizing lost business hours
- Proper planning and execution of application upgrades, server migration, and any IT change implementation process.
- Having a clean and well-defined CMDB to identify critical failure points and understanding
CI interactions in the network to identify the cascading impact of failed changes.
- Educating IT teams on the risks of SLA violations in terms of lost business hours and revenue.
- Gain insight on anticipating and handling outages by evaluating the past performance of the IT help desk
That said, a lot of factors could contribute negatively towards lost business hours. In 2010, Gartner projected that, "Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than
50% of those outages will be caused by change/configuration/release integration, and hand-off issues."
2. Change success rate
The ratio of the number of successful changes to the total number of changes
that were executed in a given time frame.
Achieve a higher percentage of successful change implementations.
Opinion remains divided on what a failed change implies. It basically refers to any change that did not meet its objectives or go as planned.
Case study: The ASX outage
On October 27, 2011, trading had to be halted at the Australian Stock Exchange (ASX) for four hours due to a failed change implementation. An upgrade on the ASX's internal network ( to improve the latency of the trading platform) led to unprecedented
connectivity issues between the supporting components and the disseminating gateways of the trading system. ASX had to initiate trading services from one of their disaster recovery sites. Finally, to restore normalcy, the change had to
be backed out that night.
A downward trend or a stale change success rate is usually due to failure of change implementations due to:
- Lack of relevant information such as the impact of the change, the dependencies of the assets involved, the change implementation window, and business priorities.
- Inability to collaborate between teams for successful change implementation.
- Improper communication to end users and stakeholders of the change implementation.
Tips for a high change success rate
- Perform a proper impact analysis and a detailed rollout plan with a check list of tasks to be completed.
- Collect all relevant information from end users and technicians before the implementation.
- Constitute CABs and ensure a strict approval process.
Another help desk metric that should be tracked to have an effective change management process is the number of unplanned changes. An unplanned change can be an emergency change or an urgent change.
- An emergency change: A service restoration change due to an incident, or a change that needs to be implemented quickly to avoid an incident.
- An urgent or expedited change: Changes that are required quickly due to a pressing need such as a legal requirement or a business need, but are not related to restoring service.
Although there is no industry standard or defined number for the number of unplanned changes permissible in an IT infrastructure, this reporting metric is important, especially during an increasing trend or a spike in the number of unplanned
An increasing trend in unplanned changes
An increasing trend in the number of unplanned changes indicates the inadequate planning of changes and questions the efficiency of the change management
process. Therefore, the change management process has to be improved to ensure proper planning and execution of changes.
Increasing trend in unplanned changes
A discrete spike in unplanned changes
A sudden spike in the number of unplanned changes can be due to unanticipated
major incidents, which warrant emergency changes to restore
service. Such a situation is probably due to an unstable infrastructure, which could affect service availability and, ultimately, the business.
Discrete spikes in unplanned changes
3. Infrastructure stability
A highly stable infrastructure is characterized by maximum availability, very
few outages, and low service disruptions.
Maintain a highly stable infrastructure.
To effectively gauge and monitor infrastructural stability, IT help desks need to monitor the following:
- Percentage reduction in the number of problematic assets
- Percentage reduction in the number of major incidents
Percentage reduction in the number of problematic assets
Delivering maximum availability and better service quality will be impossible in an infrastructure where routers have to be restarted multiple times a day, servers are often down, or workstations have to be rebooted every now and then. Therefore,
such problematic assets must be identified and replaced to ensure business continuity.
A problematic asset might repeatedly be the cause for service disruptions or outages, and for reporting purposes, these could be assets that have more than a couple incidents associated with them. The percentage reduction in the number of
problematic assets can be calculated using the following formula:
Number of problematic assets replaced at the end of the time frame.
Number of problematic assets identified at the beginning of the time frame
Percentage reduction in the number of major incidents
Another major indication of stability is the recurrence of major incidents on the IT infrastructure, which can lead to service disruptions or service level deterioration. A major incident, by definition, is a high-impact, high-urgency incident that affects a large number of users, depriving the business of one or two key services.
The goal is to reduce the number of major incidents, which can be achieved with efficient Root Cause Analysis (RCA) and a reduction of problem backlog.
Identifying root causes and fixing problems can reduce the recurrence of major incidents and, subsequently, ticket volumes to the IT helpdesk.
Tips to reduce problem backlog (and therefore major incidents)
- Faster initiation of RCA: In this case, the sooner the better. The sooner the RCA is initiated, the greater the chances are of identifying the root cause.
- Quick completion of investigations: If the root cause is identified faster, the IT team can fix and resolve the problem faster, making sure that incidents don't reoccur
Teams can also measure these action items with details on time taken to initiate root cause analysis after problem identification and time taken to complete root cause analysis.
Case study: Reducing major incidents helps improve it stability
One of the world's leading financial institutions was able to improve its stability by reducing their major incidents. This reduction in the number of incidents was achieved by improving their root cause analysis process.
Reducing major incidents helps improve IT stability
The major reasons for a heavy problem backlog could be:
- Delayed and long-pending RCAs.
- Inconsistent quality of RCAs, and lack of proper documentation.
- Not effectively communicating the investigation process to the stakeholders.
Without identifying and rectifying the root cause, the chances of major incidents recurring are fairly high. Thankfully, though, the problem backlog can be reduced by:
Working on these two simple ITIL® service desk metrics-percentage reduction in the number of major incidents and percentage reduction in the number of problematic assets-can help you maintain a highly stable IT infrastructure.
4. Ticket volume trends
Total number of tickets handled by the IT helpdesk and their patterns within a
given time frame.
Optimize the number of incidents and service requests, and prepare the IT team to handle the ticket load.
What can you do with ticket volume trends?
- Identify peaks and troughs to optimize resource management and technician workload.
- Create a better staffing model.
- Design training sessions for your IT service desk team.
- Analyze service request patterns and plan ahead for purchases of assets and licenses.
- Validate any additional resource requirements.
IT help desks should watch out for a few trends when it comes to ticket
volumes, such as:
Discrete spikes in ticket volumes
A sudden upward spike in the ticket volume can be due to the following reasons:
- a. Period of peak business activity
- b. IT rollouts leading to:
c. IT disruptions
d. Post holiday password reset tickets
- i. Service disruptions and unavailability
- ii. FAQs
Case study: Fall intake leadsto ticket spike at auniversity
The below figure (7) represents the number of tickets handled by the IT helpdesk at a university in the United States. The graph clearly indicates a ticket spike in the month of September 2012 and 2013. This is due to the increased amount
of students joining the university during the fall. So, the IT team makes sure that this extra load is distributed evenly across the team, and each member works overtime to handle these ticket spikes
Ticket volume at an American university
Gradual continuous upward trend
Continuous upward trend in ticket volumes
An upward trend could be due to any of the following reasons:
Increase in the organization size
As the business grows, it is obvious that the IT service desk has to support more end users, which typically leads to increased ticket volumes. This gradual increase in the ticket volume can be handled by an effective staffing plan in accordance
with the growth of the business. Furthermore, end users can be segregated into departments and user groups to handle tickets effectively.
Initiatives to support more business functions
As IT starts supporting more business functions, the ticket volume (both incidents and service requests) rises. This can be handled by understanding the requirements and expectations of the end users, and equipping the IT help desk team to
handle the increase in tickets.
Decrease of infrastructure stability
With the increasing number of problematic and outdated assets in the IT network, the number of tickets is bound to increase as well. This can be addressed by associating the incidents and problems with their assets, helping the IT team decide
on retiring the asset, upgrading the assets, and so on.
5. First Call Resolution Rate (FCRR)
Percentage of incidents resolved by the first level of support (first call or
contact with the IT help desk).
Have a higher level of FCRR.
High first call resolution rate is usually associated with higher customer satisfaction as confirmed by a study that Customer Relationship Metrics conducted. Furthermore, a study conducted by the Service Quality Measurement Group also revealed that for every one percent
improvement in FCR, you get a one percent improvement in customer or end user satisfaction.
First call resolution is also related to cost per ticket. The following graph represents the cost per ticket for every level.
Cost per ticket at various levels of support
Sometimes IT helpdesk technicians rush to close tickets during the first call, even without accurate resolutions. Such cases can lead to first call resolution rates rising, while end user satisfaction rates drop drastically, as depicted in
the following graph.
First call resolution rate Vs. End user satisfaction
FCRR Excellence Tip
Here is a simple three-phase technique to get your IT help desk team resolving tickets in the first call.
Phase 1 : Learn the environment
- Gather environment-specific knowledge.
- Populate the knowledge base with the information collected, creating relevant artifacts.
- Generate regular status reports on the IT help desk performance with sections on lessons learned, achievements, and obstacles overcome.
- Invite experts to evaluate performance.
- Create an operations manual that clearly outlines support processes, centralizes key environmental information, and explicitly defines complex procedures for ticket resolution.
Phase 2 : Fine tune
Generate reports to ascertain that the efforts of phase I panned out, and identify areas of improvement. Below
are some sample reports to help you get started.
- Percentage of calls taken by each technician.
- Number of calls taken per agent, per hour.
- Average talk time, by agent.
- Of the tickets we did not close, where were they transferred?
- Of those transfer destinations, who received the most tickets?
Phase 3 : Optimize
Establish a well-defined process for continual improvement of first call resolution rate.
This technique not only helps you improve the FCRR levels, but also helps ensure that tickets are properly resolved, not just closed.
Another possible trend is a constantly degrading FCRR, as shown in the following graph.
Constantly degrading FCRR
There are a few reasons this could occur, but the primary reasons are as follows:
- Lack of requester and system information.
- Poor technician capabilities.
- Poor knowledge transfer and sharing.
According to MetricNet's benchmarking levels, the average net FCRR for service desks globally is 74 percent, with a range of 41 to 74 percent. The most common factors among all the services on the higher end of the spectrum were the presence
of highly trained agents, the availability of knowledge management
tools, and the presence of tools such as remote desktop management.
FCRR can be improved with the following tips:
- Communicate the importance of FCRR to the technicians.
- Design training programs for the first level technicians on specific subjects to help resolve tickets faster.
- Maintain a knowledge base of advanced technical solutions and articles exclusively for, and limited to, technicians.
- Create custom forms to collect all relevant information at the time of ticket creation to avoid turnaround delays.
- Automatically route tickets to the right technician or group based on ticket parameters.
6. SLA compliance rate
Percentage of incidents resolved within the agreed SLA time.
Maintain maximum SLA compliance rate.
Tracking SLA compliance levels helps IT help desks:
- Ascertain that the service levels are real and obtainable.
- Check the performance of the IT help desk against the service levels
agreed with the end user
- Identify areas of improvement, strengths, and weakness of the IT help desk.
Sometimes IT help desk technicians close tickets without proper resolutions, just to avoid SLA violations. When this happens, though the SLA compliance rates remain high and the end-user satisfaction levels are bound to decrease, as shown
in the following graph.
SLA compliance rate Vs. End user satisfaction
SLA compliance levels may drop for other reasons, though, so it is important to keep the following possibilities in mind:
- Your team may not understand the business requirements, which can lead to service level agreements that don't fulfill the business needs, or improper categorization and prioritization of tickets leading to SLA violations.
- There is often a lack of proper communication on the risks of outages affecting mission-critical services and their business impacts.
During such scenarios, IT service desk teams must understand the requirements of the business, and redefine their SLAs as appropriate.
Case study: When meeting SLAs doesn't help
SLAs and SLA compliance are critical to ensuring business continuity. This case from a cement manufacturing company, however, stresses that SLAs must also be set carefully. The IT help desk was unavailable for immediate response to an issue
on a truck dispatch, but did resolve it within the SLA. Unfortunately, the cement manufactured had to be dispatched to the client location within one hour to avoid hardening.
The IT help desk was unaware of this, and SLAs were set without considering these factors. As a result, though the ticket was resolved within the SLA, the cement had already hardened, which affected the business.
Decreasing trend in SLA compliance rate
Another alarming trend to keep an eye out for is a constantly degrading SLA compliance rate.
This falling trend could be due to any of the following:
- Unrealistic service level agreements.
- Lack of awareness of the SLAs and the risks of SLA violations.
- Absence of proper monitoring and proactive escalation.
- Lack of technician expertise.
- Unassigned tickets and delayed and faulty ticket assignments.
The SLA compliance rate can be kept at higher levels by:
- Setting realistic SLAs based on the business requirements and IT capabilities.
- Communicating the SLAs and risks of SLA violations to the business and technicians.
- Setting necessary escalation rules.
- Automating the process of routing and assigning tickets.
- Designing training programs for your technicians.
7. Cost per ticket
The total monthly operating expense of IT support, divided by the monthly ticket volume.
Maintain minimum levels of cost per ticket.
As per MetricNet, the following were the cost per ticket benchmarks for 2014.
Industry standard - Cost per ticket at a high density environment
Industry standard - Cost per ticket at a medium density environment
As seen in both cases, the cost of the service request is usually higher than the cost of the incidents. This is because incidents typically take less time to resolve than service requests. So, the cost per ticket is heavily influenced by
the mix of incidents and service requests.
IT support is considered a cost center in most organizations, and is usually the first to get budget cuts during a financial downturn. Therefore, IT support must remain efficient, even when IT spending is reduced. Cost per ticket is a key
service desk performance metric that helps IT support analyze its efficiency in handling tickets within a given budget. The goal is always to maintain an optimal level of cost per ticket.
However, it is important to keep in mind that a higher-than-average cost per ticket may not necessarily be a bad thing, and a lower-than-average cost per ticket may not always be good, as shown in the following graphs.
The scenario depicted in this graph may mean that the IT service desk team is compromising on service quality to reduce the cost per ticket, which often results in lower customer satisfaction levels.
Cost per ticket Vs End user satisfaction
The scenario depicted in the above graph shows where the increase in the cost per ticket is accompanied by an increase in the customer satisfaction levels. This may mean that the increasing cost per ticket has led to better service delivery,
justifying the extra cost.
One key factor for optimizing the cost per ticket is to enable quick resolution of tickets and reduce any unnecessary escalation. Cost per ticket can be kept in control by following these pointers:
- Analyze service request patterns to plan ahead for purchase of assets and licenses, reducing the time taken to close service requests.
- Identify peaks and troughs to optimize resource management and technician work load.
- Properly categorize and prioritize tickets to reduce incorrect ticket assignments, helping provide quick resolutions.
- Create a robust knowledge base.
8. Software asset utilization rate
Percentage of software products and licenses in actual use by the business.
Maximize ROI (return on investments) on software investments.
With software license purchases taking up a major part of the IT
spending, it is important to track software utilization. Unfortunately, this is one of the least discussed service desk metrics. For easy management, the software can be categorized as follows:
- Category 1 - Software that needs the most attention (with the highest business implications, license cost, or compliance risks).
- Category 2 - Software that needs the least attention (free software such as Adobe Reader).
- Category 3 - Prohibited software and malware.
The following service desk metrics can be used to track software utilization:
Ratio of total used to total owned software
This metric helps identify any software purchase expenditure that does not provide any value to the organization. Ideally, this ratio should be close to one, meaning there is maximum utilization of all purchased software, thereby ensuring
a maximum ROI on the software license purchase. A high number of category one software in the unused list means that a major portion of the software asset spending is sitting in idle software.
Ratio of unallocated licenses to total license count
This metric helps analyze the license utilization of a particular software, helping IT teams plan ahead for license purchases. The ratio should be as small as possible for maximum ROI. A higher ratio could mean that some of the software applications
are over licensed, which could be an idle investment with no ROI.
Case study: Increasing software asset utilization saves a million dollars
A leading global pharmaceutical company saved about one million dollars in spending. The pharmaceutical company, with its services spread across 50+ countries, was using a diverse range of Microsoft products. At one particular office, there
were thousands of software applications licensed under a Microsoft volume licensing agreement, but there was no visibility or control of these software assets, initially. The purchase had been made without understanding the business requirements.
In fact, the company had limited information on the software assets and the number and type of assets the organization actually needed. This, again, put the organization at the risk of over-licensing, under-licensing, and compliance penalties.
The IT help desk started with a simple analysis by comparing the installed Microsoft software with the Microsoft licenses they held. The insight gained, and IT's efforts to understand the business requirements, led to a redesigned Microsoft
license purchase that involved stepping down from the Microsoft Office Professional edition to the cheaper standard edition, which met the business requirement.
Furthermore, several other volume licenses were replaced, leading to cost cuts saving the company about one million dollars in their software license purchases.
License compliance rate
Another important software asset management metric that could
incur cost to the organization is the license compliance rate. Maintaining maximum compliance can save your organization from penalties and fines. The following are a few tips for achieving maximum compliance:
- Track all software installations and license purchases.
- Allocate licenses to individual software installations to find the over and under-compliant software.
- Purchase the right license types for the software. For example, it is better to purchase a perpetual license for a core software to avoid compliance issues due to license expiry.
- Conduct formal internal assessments for compliance and audit readiness.
Achieve maximum compliance with a three-step pre-audit
Hundred percent license compliance rate will no longer be a myth with this simple three-step pre-audit.
Step 1 : Gap analysis
- Request a list of all software applications licensed to your organization from the specific vendor.
- Identify and pin down software that is in use by the business, but not on the list provided by the vendor.
Step 2 : Compliance analysis
Check the total number of software installations vs. the total number of licenses purchased for every software application to identify over and under-licensed software.
Step 3 : Software license optimization
With all the insight gained from step I and II, redesign your software purchases to optimize compliance and attain a 100 percent license compliance rate.
These 8 KPIs, with respective metrics, will help you establish a measurement engine to constantly measure and continuously improve your service desk performance. The first step in establishing this measurement engine is to understand the business
that the IT help desk is supporting, and align the IT help desk objectives to the business objectives. The next step is to identify the KPIs and metrics that are critical to these help desk objectives, and constantly measure them.
The 8 service desk KPIs discussed here are critical to the three basic IT help desk objectives of ensuring business continuity, making the organization productive, and delivering services within budgets and on time, which underlines the fact
that these 8 KPIs are the ones that your IT help desk should care most about.