![]() |
![]() |
{GSA} | |||||||||||||||||
| Knowledge Integrity | Column Archive/Data Models, Data Standards, and XML | ||||||||||||||||||
|
Data Models, Data Standards, and XML - Published in www.businessintelligence.com January 2004 I have been thinking
a lot about information exchange lately, mostly in terms of how different
groups within a single virtual enterprise share information. When the
topic has come up with some of our clients, there is often some confusion
as to the difference between a data model and a data standard. In a non-technical
environment, that confusion is understandable, since both are related
to the representation of data, and both are driven by business objectives. In fact, data models
and data standards are related, yet they differ subtly. We can loosely
define a data model to be a formal structured representation of real-world
entities, focused on the definition of an object and its associated attributes.
Data models concentrate on entities, their characteristics, and in a relational
environment, the relationships between different entities. For example,
a data model representing people might capture all attributes relevant
to the description of a person: last name, first name, weight, height,
birth date, hair color, eye color, etc. This data model could also capture
how individual entities are related, such as documenting all line items
associated with a customer's order, or even the set of orders placed by
any particular customers. The data model, though,
is mostly concerned with the structure of the representation, but not
necessarily the content used to fill in that structure. For example, consider
a person's birth date attribute represented using a character string,
although imagine that the model does not specify whether that birth date
is expressed using month names, followed by the day of the month, followed
by a year (e.g., "February 28, 1977"), or whether it is expressed
using the MM/DD/YY format (e.g., "02/28/77"), or one of any
number of formats for representing dates. No matter what format is used,
as long as the representation meets the needs of those working with that
data, the value will conform to the model's directive, and this may be
fine as long as the people using the data in that model understand this
to be true. However, this changes as soon as anyone wants to share the
data stored using that model with someone in a different organization.
The variety of date formats used may turn out to decrease the ease with
which information may be migrated from its source to its next destination.
For example, in contrast
to the laxness associated with the source data model with respect to date
representation, the next user of that data set may have strict requirements
about date formats. This apparent formatting dichotomy evolves from the
fact that any participant sharing the information may have their own data
model, and embedded within each data model is information about the data
types that populate each field. So while one data model (built using one
vendor's database system) may allow dates to be stored as character strings,
another data model (built using a different vendor's database system)
might use an embedded system type for representing dates. When the target
system attempts to load a record whose values do not conform to the specified
type, an exception occurs, which may prevent the participant from using
that violating record (or at worst, the entire set of records). Of course, the solution
to this problem is the use of a data standard for information exchange.
The standard may correspond to the source data model, or the target data
model, or may provide for a format that is foreign to both models. The
actual format selected is irrelevant; what is important is the two participants
agree to use the selected format in any situation where they exchange
data. This is not to say that a data standard should not be distinct from
the data models associated with the applications that use the exchanged
data. On the contrary, it is sometimes very important to develop the data
standard in concert with the data model. This brings me to
the issue of XML, a definition framework that is gaining in popularity
for developing data standards. XML does not by itself prescribe a standard;
instead, XML is used to define standards for the exchange of information
through conforming XML documents. Data values in an XML document are surrounded
by tags (labels) that identify where the data content begins and ends,
such as <state>Alabama</state>. On the other hand,
since one of my major interests is in automated data validation, I have
to think twice about the use of XML for standards for data exchange. First
of all, using XML will not necessarily provide any automatic increase
in data quality; the XML document definition describes object structure
without necessarily providing any insight into data element content. Yet
I have already pointed out that the differences in format used for content
is one of the drivers for defining standards in the first place, and XML
does not provide any improvement in that department without additional
auxiliary processing at either the server or client side. Second, in data intensive
applications XML documents will tend to be much larger due to including
the start and end tags, which can affect application performance. Note
that comparing a simple data field such as "STATE" may take
two characters in a fixed-field exchange and sixteen characters in an
XML exchange! Third, a lot of the
value of XML can be exploited only when the target systems are specifically
built to use and process XML documents; if a participant does not yet
have the infrastructure for absorbing XML messages, there is not a significant
benefit over other kinds of defined standards. Lastly, despite claims
of "human-readability," there is a degree of complexity to XML
documents that necessitates additional processing to bend them into a
truly readable form. I am not saying that there is not value in using XML as a framework for defining a data standard for information exchange. I do think, however, there is a tendency for people to grasp onto a technological meme, such as XML as the de facto method for defining data standards, as a way to decouple themselves from addressing the underlying business issues, such as developing the process by which participating parties negotiate information exchange. But this brings me to the issue that faces the savvy BI manager: the focus of the standards process should be the agreement by all participants as to what the data will look like when it moves from one place to another. Once that is agreed upon, the implementation may be done using XML just as well as using any other approach |
|
|||||||||||||||||
|