![]() |
![]() |
{GSA} | |||||||||||||||||
| Knowledge Integrity | Column Archive/Customer Data Integration, Linkage Precision and Match Accuracy | ||||||||||||||||||
|
Customer
Data Integration, Linkage Precision and Match Accuracy- Published in DM
Review, October 2004 (Co-authored with Ed Allburn, Data
Delta)
Competitive advantage may be gained in the evaluation and improvment of match accuracy by improving the effectiveness of all BI technologies that rely on the data. As customer relationship management
(CRM), personalization, data mining, one-to-one relationship marketing/database
marketing and customer loyalty programs are becoming de rigueur at many
large (and some not so large) organizations, billions of dollars are being
invested in sophisticated customer data integration technology as a means
to total customer data integration (CDI). The underlying technology for
CDI evolved out of the data quality tools space, particularly from the
concepts of record linkage and matching. Record matching is a sophisticated process referred to by a variety of different terms such as merge/purge, de-duping, householding, building a 360-degree single customer view, creating a marketing customer information file (MCIF) and others. Regardless of the term used, all perform a similar process of identifying and linking related records by parsing name, address and other text fields into separate components and then using advanced approximate string matching algorithms and sophisticated similarity scoring to compare sets of these components and identify pairs that are similar enough to isolate as referring to the same entity. There has been great success in deploying record linkage for the purpose of customer data integration. However, a key aspect of this process is often glossed over and ignored - the issue of linkage precision and record match accuracy. Linkage precision guides how well a set of record linkage applications are tuned. Consider this simple mechanism for tuning: match/not-a-match thresholds. As part of the matching process, two records are compared across multiple fields, and the similarity of the two records is evaluated as a function of the application of a set of business rules and corresponding weights associated with each field, resulting in the assignment of a similarity score. If that score is greater than the match threshold, then the pair is deemed a match. If the score is less than the not-a-match threshold, it is reported that the pair does not match. When the score falls between the two thresholds, the pair is shunted to a separate repository for subsequent manual review. Match accuracy is a measure of how well the assorted thresholds, business rules and weights are set to provide the most accurate match. When match accuracy is high, the results are excellent - better CDI, more aggressive personalization, reduced costs associated with customer interaction, etc. On the other hand, low match accuracy is likely to provide the impression of much poorer customer relationship management, resulting in duplicate mailings, mixed up credit profiles and repeated attempts at direct marketing, among other less heinous crimes. On the other hand, businesses increasingly face major risks when linking records for applications such as health records and financial management, especially in the context of HIPAA privacy requirements, Sarbanes-Oxley compliance, Anti-Kickback Statute and other regulatory constraints. As more businesses and more applications rely upon a single customer view, it becomes increasingly important to ensure that this single view is accurate. Today's CDI systems have evolved into highly sophisticated applications incorporating leading-edge research and development advances in fields such as information theory, natural language processing, artificial intelligence and others. One major advancement has been the recognition of users' needs to be able to fine-tune the matching and householding behavior to create a single customer view that more directly fits with the business needs. CDI vendors no longer assume that they can dictate to businesses what the "correct" single customer view is. As businesses have become increasingly sophisticated with business intelligence (BI), CRM and one-to-one systems, they have demanded control of their customer definition. This is typically effected via business rules that control how the single customer view is resolved by the CDI system. In general, a business rule is anything that controls or changes the CDI application's function, such as: * Parsing, standardizing and
matching program parameters. In recent years, CDI vendors
have started competing on who has the most business rules, basically arguing
that more business rules are better. Many vendors now claim to have more
than 100,000 business rules, and one vendor at a major industry conference
bragged that a large customer added more than 50,000 custom business rules
(thus emphasizing how flexible their system was). However, CDI matching
and householding accuracy requires precise refinement of all these business
rules; otherwise it generates a less-than-effective data warehouse. It is interesting to note that until recently, the notion of applying data quality technology for the purposes of CDI was considered to be leading-edge application of technology. Today, it would be unusual for an organization to not be doing this. Years ago, businesses could gain a major competitive advantage by implementing basic data quality and BI technology. However, today this technology is no longer an optional luxury, but instead is a fundamental requirement just to be on a level playing field. For example, a company might cancel a promotional campaign because too many consumers such as "Michael Jablowski" did not respond. However, more accurate record matching might reveal that "Mike Jadlowsky" is in fact the same person, and Mike Jadlowsky did respond (or worse, was already a customer - thus indicating wasted marketing). It is very likely that your own company has already made major investments in record matching, and it is equally likely that all of your major competitors have also made similar investments. However, if this is true, and everyone is doing the same thing, then an enlightened manager should be looking for second-order opportunities for additional competitive advantages. One idea with potential is the evaluation and improvement of match accuracy, which, in turn, will deliver an ongoing competitive advantage by improving the accuracy and effectiveness of all the BI technologies that rely on the data. The opportunity to gain a fresh
competitive advantage in this area is very compelling because although
most companies already have similar technology, the odds are that their
technology is significantly underperforming. In fact, an individual's
chances for winning the Powerball lottery are greater than having one's
current matching system's complex business rules fully optimized to deliver
maximum match accuracy. More than 100 vendors offer record matching systems, many of which have evolved into highly sophisticated technologies that are often comprised of more than 100,000 "business rules" that control their exact behavior. In theory, information-workers can precisely fine-tune these business rules to improve match accuracy. In reality, very little, if any, significant time is spent attempting to do so. Instead, many people just rely on the business rules and settings right out of the box or some organizational adaptations based on recommendations from minimal vendor consulting due to the overwhelming complexity of attempting to make changes to all these close-knit, highly interdependent rules. As an example of a way to adjust similarity scoring, most matching systems provide a way to modify the weighting assigned to parsed components during matching, using qualifiers such as disabled, very low, low, medium, high, very high or required. These systems often parse name and address fields into multiple components, as well as other fields often used in matching such as phone, social security number, e-mail, city, state and ZIP code (often 12 to 20 total components). The potential for modifying settings for even a reasonable number of fields yields a staggering number of possible combinations. Therefore, when matching records by scoring 12 fields with seven different weighting settings, there are 712, or nearly 14 billion possible combinations. Bump that up to 20 fields, and you have almost 80 quadrillion combinations! In comparison, the odds of someone winning the Powerball lottery are one in 121 million. In addition to the issue of
complexity, the reality is that project time for fine-tuning is typically
scheduled toward the end of the project. Yet when projects run over schedule
and over budget, a common target for project elimination is the step for
fine-tuning the matching business rules. The ramification of this is that
most businesses are using these expensive, sophisticated matching engines
with little or no changes to their default settings. One of the most insidious aspects of match accuracy is that its responsibility often falls through cracks of the company organizational chart, with responsibility typically defaulting to IT staff to fine-tune the business rules. However, the exact goal of the matching behavior can often be a moving target or equally often a target that has either conflicting definitions or no definition at all. This creates a very strong temptation for the IT staff to simply accept the default business rules with little, if any, attempt to truly refine them. More accurate matching results are achieved when business users actively collaborate with the IT staff to analyze and refine the business rules. Business users are the ones with the critical information about the intended business use of the data, which then drives the decisions on the matching business rules. For example, a newspaper company may place much higher priority on postal address matching criteria and may not want any records that have different addresses to be matched. A bank, on the other hand, may place a higher priority on the individual, regardless of how many different addresses their records may span (such as home and work addresses). Fraud detection applications may utilize even looser match rules to find all possible relationships between records. The bottom line is that regardless of what business rules the IT staff defines, those rules will be wrong if they are defined in a vacuum without business user involvement. From that, any initiatives to find and fix matching errors must be driven by the business users (and executive sponsorship helps). One interesting (one might even say, "procedurally fractal") aspect of linkage precision is that the quality of match accuracy can be analyzed and improved the same way other aspects of information quality are treated. We can take the same steps in evaluating the quality of the application's matching rules as a baseline measure and then identify potential areas for improvement. One way to start this process is to ask some questions about how your record matching software is used, including: * How many business rules
are in your current matching process? Another key step is to review your current tools and techniques for fine-tuning the matching business rules. Surprisingly, many teams still use the same tools and techniques that were being used more than 20 years ago. For example: * A small sample of a few
thousand records is selected and used as the basis for developing and
evaluating rules and are subjected to the parsing, standardizing and matching
steps. Clearly, basing the rules used to aggregate large sets of disparate customer information on a small selected sample may not be the most effective way to develop business rules. To address this issue, automated tools are now being developed that can be used to adjust business rule settings, run the record matching applications and then automatically evaluate the results, providing reports that can be scanned to assess the differences between sets of business rules and corresponding thresholds and similarity scoring. Finally, it is not uncommon to uncover conflicting requirements during this stage that may warrant creating multiple customer views instead of trying to force a single customer view on the entire organization. Therefore, a key step in match accuracy assessment and improvement is to ensure that business clients closely collaborate with IT staff to clarify matching requirements. |
|
|||||||||||||||||
|