| Developing
Information Quality Metrics,
Published in DM Review May 2005
Currently in vogue is the ability
to summarize an organization's "business productivity" to senior
managers using pithy representations that are expected to carry deep meaning
and, at the same time, reduce the attention required to absorb that meaning.
Business productivity management systems engage key performance indicators
whose values are posted to executive dashboards for the CEO's periodic
(be it daily or hourly) review. The intention of these applications is
to provide a presentation of the current state of the environment in the
context of reasonable expectations. In other words, a business manager
wants to have an overview of the "value creation" of the entire
system, much the same way a nuclear engineer gauges different metrics
associated with the safety status of the nuclear reactors.
In most areas of a business,
the metrics that back up the key performance indicators may be relatively
straightforward. For example, in a shoe factory, one might gauge the number
of shoes coming off the production line, the rate at which shoes are being
produced, the number of flawed shoes coming off the line or the number
of accidents that occur each day. Each of these metrics may be represented
using various visual cues, each of which provides a warning when the performance
indicator reaches some critical level.
When it comes to the world
of information quality, however, the analogy seems to break down, mostly
because there is a disconnect between what is apparently measurable and
what the value of that measurement means. For example, one may count the
number of times a value is missing from a specific column in a specific
table, but in the absence of any business context, it is not clear how
those missing values affect the business, or if they even affect the business
at all.
Yet we all know that poor data
quality does affect the business. Thus, there should be some kind of performance
indicator that can capture and summarize the relationship between data
that does not meet one's expectations and the organizational bottom line.
The challenge, then, is to devise a strategy for identifying and managing
"business-relevant" information quality metrics.
What Makes a Good Metric?
More challenging, however, is that the individuals typically tasked with
devising good information quality metrics are better trained at data analysis
and less skilled in business performance monitoring. Therefore, part of
this strategy is to understand the characteristics of a reasonable business
performance metric and then explore how to map those characteristics to
the measurable aspects of data quality. The following list of characteristics,
which is by no means complete, should give some guidance as to how to
jump-start the strategy:
- Clarity of definition
- Measurability
- Business relevance
- Controllability
- Representation
- Reportability
- Trackability
- Drill-down capability
- Clarity of Definition
Because the metric is intended to convey a particular piece of information
regarding an aspect of business performance in a summarized manner, it
is critical that its underlying definition be stated in a way that clearly
explains what is being measured. In fact, each metric should be subject
to a rigorous "standardization" process in which the key stakeholders
participate in its definition and agree to the definition's final wording.
In addition, it is advisable to provide the metric's value range, as well
as a qualitative segmentation of the value range that relates the metric's
score to its performance assessment.
Measurability
Any metric must be measurable and should be quantifiable within a discrete
range. Note, however, that there are many things that can be measured
that may not translate into useful metrics, and that implies the need
for business relevance.
Business Relevance
The metric is of no value if it cannot be related to some aspect of business
operations or performance. Therefore, every desirable metric must be defined
within a business context with an explanation of how the metric score
correlates with a measurement of performance. More desirable is if performance
measurement can be directly associated with a critical business impact;
this is probably the most critical characteristic of a data quality metric.
Controllability
Any measurable characteristic of information that is suitable as a metric
should reflect some controllable aspect of the business. In other words,
the assessment of an information quality metric's value within an undesirable
range should trigger some action to improve the data being measured.
Representation
Without digressing into a discussion about the plethora of visual "widgets"
that can be used to represent a metric's value, it is reasonable to note
that one should associate a visual representation for each metric that
logically presents the metric's value in a concise and meaningful way.
Reportability
From a different point of view, each metric's definition should provide
enough information that can be summarized as a line item in a comprehensive
report. The difference between representation and reportability is that
the representation will focus on the specific metric in isolation, while
the reporting should show each metric's contribution to an aggregate assessment.
In turn, this allows the manager to evaluate the priority of any issues
needing resolution.
Trackability
A major benefit of metrics is the ability to measure performance improvement
over time. Tracking performance over time not only validates any improvement
efforts, but once an information process is presumed to be stable, tracking
provides insight into maintaining statistical control. In turn, these
kinds of metrics can evolve from performance indicators into standard
monitors, placed in the background to notify the right individuals when
the data quality measurements suddenly indicate a deviation from expected
control bounds.
Drill-Down Capability
In recognition of the summarization aspect of a representation of a data
quality metric, the flip side is the ability to provide exposure to the
underlying data that contributed to a particular metric score. The natural
instinct, when reviewing data quality measurements, is to review the data
instances that contributed to any low scores. The ability to drill down
through the performance metric allows an analyst to get a better understanding
of patterns (if any exist) that may have contributed to a low score, and
consequently use that understanding for a more comprehensive root-cause
analysis. This kind of insight allows your organization to isolate the
processing stage at which any flaws are introduced and, in turn, enables
you to eliminate the source of the introduction of data problems (instead
of the typical, counterproductive reaction of correcting the data values
themselves).
Measurements of Data Quality
The conventional wisdom for measuring data quality relies on quantifying
how data sets relate to "dimensions of data quality." Those
dimensions, including (among others) accuracy, completeness, consistency,
timeliness and currency, are useful for discussing ways that data values
exhibit quality within their information context. Some of these measurements
are relatively easy to capture, such as data value completeness (which,
for data in a structured relational database, is trivially done using
simple SQL queries), while others might require more dedicated resources,
such as the manual review necessary to determine accuracy.
Unfortunately, it is easy to
confuse a measurement for a metric. Generating a count of the number of
missing data values is easy, but does it truly reflect the characteristics
of a data quality metric? In fact, outside of the business context, one
may not be able to answer that question. However, by identifying and placing
that measurement within the appropriate business context, one may evolve
a measurement into a metric. To do this, one must identify the "relevance"
of any measurement in terms of the business impact associated with what
is being measured.
Finding Business Relevance
We can divide the set of all of your organization's information flaws
into two groups - those that impact the achievement of the business' operational
and strategic goals, and those that do not. For all intents and purposes,
we can ignore those flaws that do not have any impact. Associating a specific
information flaw to a specific business impact may be hard work, but it
is not an impossible task. First, identify the areas of business impact.
Then, for each perceived data quality problem, break the process down
into these subtasks:
Review how that data flaw relates
to each area of impact.
Determine the frequency with which impact is incurred.
Sum up the measurable costs associated with each impact incurred by the
data quality problem.
Assign an average cost to each occurrence of the problem.
For example, let's presume that 10 percent of a company's shipping addresses
have some problem and that 20 percent of the time that an item with a
flawed shipping address is shipped, it is returned - incurring an additional
cost of $10.00. This means that two percent of the shipping addresses
may incur an additional $10.00 cost. Therefore, if the company ships 1,000
packages per day, the data quality flaw costs $200.00 per day. If the
company manages 100,000 shipping address records, then 10,000 of them
are flawed, and ultimately, the average cost per day of each occurrence
of a flawed address is $0.02.
This is relatively simplistic,
and this example is contrived to demonstrate an approach to attaching
business value to each instance of a problem. In turn, this evaluation
allows us to turn a measurement (of a flawed address) into a metric because
the number of occurrences of the flaw is directly associated with business
relevance.
There are four general areas of business impact that can be associated
with data quality problems:
1) Productivity
2) Profit
3) Risk
4) Intangibles
Productivity
Productivity can be assessed in terms of physical production (i.e., the
number of usable components coming off the production line) or in terms
of individual production (i.e., how much time someone is spending on a
task). While physical productivity is easy to measure, personal productivity
is less so. Yet, the frequently referenced data quality costs incurred
due to "scrap and rework" are most often attributable to individual
productivity, in which we accumulate the number of hours that a person
spends identifying that a problem exists, tracking down its source, rewinding
any tasks that were performed using the incorrect data, and re-running
the processing. The cost of that problem instance is stated in terms of
that person's fully loaded cost per hour multiplied by the number of hours
spent addressing the problem.
Profit
Simply, an organization's profit is based on how much money it takes in
(revenue) minus the amount of money it spends (expenses). A data quality
problem can be associated with increased costs as well as decreased revenue.
In addition to any increased costs associated with reduced productivity,
a problem's impacts may have ramifications further down the knowledge
chain. For example, when one organization released a report with inconsistencies,
there were additional costs associated with recalling and destroying the
distributed (hard-copy) documents, and producing and distributing a corrected
version. Events such as these are usually tracked within a company; therefore,
it should be relatively easy to accumulate costs and statistics and, again,
isolate an average cost to each data quality incident.
The more insidious impacts
are ones that result in lost revenue. For example, component pricing information
on the supplier side is likely to be integrated into a final product's
pricing strategy; inaccurate data on the supplier side may result in underpricing
the final product, which in turn reduces the margin for each product sold.
Costs are calculated as the sum of the difference in margins presuming
the data quality flaw had not existed.
Another example is the concept of the "lost customer." There
are two kinds of lost customers: parties within your information domain
who are understood to not be current customers, although more accurate
analysis of the data would fully indicate that they are, and parties who
are understood to be current customers, but in fact have been subject
to attrition. Both of these kinds of lost customers incur profit impact,
either through increased marketing costs for current customers or decreased
sales to ex-customers.
Risk
There are many different forms of risk, and each can be used as the business
basis for a data quality metric. Regulatory risks are associated with
noncompliance with legal imperatives, such as statutes, laws, regulations,
etc. Investment risks are associated with increasing the value of your
assets. Development risks are associated with capital investment in the
development of systems intended to improve the business operations. One
might even incorporate credit risk into the stable of risks related to
poor data quality.
While it is difficult to assign
a "precise" value to each of these risks, there are algorithms
for assigning some value to each. Even in the absence of a precise measurement,
there are strata of quantifications (e.g., high, medium, low) that can
be measured and represented.
Intangibles
Poor data quality can impact organizations in intangible ways as well.
Examples include customer satisfaction, public relations and goodwill.
Each of these areas can be measured, and in cases where bad data clearly
affects that measurement, one can configure a corresponding metric.
Direct and Subsidiary Metrics
We have looked at some areas of business relevance and how to relate the
measurement of the number of data flaws to those areas - we can refer
to these as direct metrics. More interestingly, there are subsidiary (and
possibly, more useful) metrics that can be created from this process as
well. For example, we looked at how to assign an average cost to each
occurrence of an unexpected data value, and we can provide periodic reports
and ongoing tracking of that metric. The subsidiary metric is to review
how data quality improvement reduces that average cost per occurrence
over time. As a different example, we have metrics that we track over
time, and we can see how improvement is made over time. The subsidiary
metric reports the rate at which improvement is made. Both of these examples
are used to provide insight into how effective the improvement program
is overall.
The Data Quality Dashboard
Lastly, an issue to consider incorporates the reporting and presentation
of these metrics to the business partner. This presentation, which we
can accumulate into a "dashboard," would provide visual representation
conveying the business relevance of each metric, as well as provide access
to its definition. More importantly, the dashboard provides access to
the more important aspects: trackability and drill-down.
The tracking component would
provide a visual graph over the periods that the metric is measured and
allow the knowledge-worker the opportunity to review the details of any
specific period's measurements. The drill-down capability would allow
the analyst to access the data underlying the metric and review those
data instances contributing to the measurement, which enables more comprehensive
review as well as root cause analysis.
The Challenge
Developing key performance indicators for information quality is clearly
a challenge, mostly because the hard numbers presented by data quality
tools are typically out of the business context. Here we have provided
some insight into how the data analyst can work with the business customer
to identify ways that poor data quality impacts the achievement of business
objectives and subsequently determine hard costs associated with each
occurrence of a flaw. Once this has been done, providing a dashboard with
tracking and drill-down capabilities establishes a value-added approach
for value-directed information quality management and improvement.
|