![]() |
![]() |
{GSA} | |||||||||||||||||
| Knowledge Integrity | Column Archive/Who's in There? | ||||||||||||||||||
|
Who's In There? - Published in www.businessintelligence.com March 2004 We dont store objects or entities inside a database; we store some representation of real world entities hence the concept of a data model. In other words, a database table maintains some textual (or perhaps somewhat enhanced) representation of something that exists in the real world. And in fact, the way we represent the real world object often both describes the object and defines it. What I mean by this is that each unique object in the real world that is represented in the database must be distinguishable from every other object. This is how people operate in real life we use attribution to identify objects. For example, a cable television subscriber might be represented in a table that has a set of name fields, (we can recognize the customer by his or her name), a telephone number (or using one way of contacting them), and some address data (so we can differentiate by where they live). These values are attempts at describing the customer, although together they may be perceived as defining each one as well. This may provide some insight into reasons that early database tables contain many attributes, with significant denormalization. These are relics of the operations approach to data management, where organizations captured data as part of how businesses are run. As a way of thinking about names and addresses in a slightly different way, consider that an address or a telephone number is a description of a means of contact, and a name is just one (of possibly many) labels that individuals choose to use in reference to themselves or others. In other words, names and addresses are also descriptions. An individual cannot be uniquely defined by his or her own description there are many people who share the same name, so we cant use a name for differentiation. And in large apartment buildings, many people share the same address, so that is out as a differentiator as well. If there were no such thing as identical twins, we might be able to use looks as a distinguishing factor, but even so I suspect that it would be hard to capture enough precision to uniquely describe how any one person could be distinguished from any other. In fact, the problem is more complex than one might think, since there are inherent statistical laws and idiosyncratic dependencies that make similarity more frequent than might be presumed. For example, if we cant use name as a unique identifier, and we cant use address, then lets combine the name and address and use that. Think that works? I thought so at one time, until I scanned the names on the mailboxes in my apartment building a few years ago, and I found one last name used three times, 3 or 4 used twice, and one instance of the same first and last name! Intuitively, this actually makes sense the distribution of last names in a specific geographic area is not independent, since some names are extremely common, and members of tight-knit families live close to each other, etc. For proof of this in the United States, open up your local phone book to Jones and start counting pages. The upshot is that in order to really make sure that two different people (or products, or whatever) are distinguishable in a database, we artificially assign each a unique identifier. For the cable subscriber, that might be a customer id. This all leads me to this months conundrum how do we really know who is who in the database? Each cable subscription may serve more than one person, such as both partners of a married couple, and so a customer identifier might effectively refer to two people. One the other hand, each cable customer may actually have more than one subscription one for the home, one for the office, and one for the vacation home. In this case there may be three customer identifiers that refer to one person. To complicate matters, remember that while our unique identifiers can be restricted to a numeric domain, names and addresses are conveyed via language and text, which may be subject to variations depending on the source of the information. This means that the same real person may be represented in a database multiple times, or many individuals may be captured under a single unique tag. If the database is used solely for well-defined operational purposes, this may not be a relevant problem. But once the data is designated for use in a business intelligence application, it is worthwhile to try and sort out the factors used to uniquely identify individuals. As a savvy BI manager, you might reconsider the operations approach to information modeling in favor of an individual-oriented approach. This should be the essence of any Customer Relationship Management (CRM) system, and if the methods and software selected for your CRM program are not helping you figure out whos in there, it might be worthwhile to take a step backwards and reassess the CRM programs true intention. |
|
|||||||||||||||||
|