Data Models, Data Standards, and XML

{GSA}

Knowledge Integrity

Column Archive/Data Models, Data Standards, and XML

Home

Services

Training

Methodology

Products

Books

Custom Development

Company Profile

Work With Us

Column Archive

Contact

Ask The Expert

Data Models, Data Standards, and XML - Published in www.businessintelligence.com January 2004

I have been thinking a lot about information exchange lately, mostly in terms of how different groups within a single virtual enterprise share information. When the topic has come up with some of our clients, there is often some confusion as to the difference between a data model and a data standard. In a non-technical environment, that confusion is understandable, since both are related to the representation of data, and both are driven by business objectives.

In fact, data models and data standards are related, yet they differ subtly. We can loosely define a data model to be a formal structured representation of real-world entities, focused on the definition of an object and its associated attributes. Data models concentrate on entities, their characteristics, and in a relational environment, the relationships between different entities. For example, a data model representing people might capture all attributes relevant to the description of a person: last name, first name, weight, height, birth date, hair color, eye color, etc. This data model could also capture how individual entities are related, such as documenting all line items associated with a customer's order, or even the set of orders placed by any particular customers.

The data model, though, is mostly concerned with the structure of the representation, but not necessarily the content used to fill in that structure. For example, consider a person's birth date attribute represented using a character string, although imagine that the model does not specify whether that birth date is expressed using month names, followed by the day of the month, followed by a year (e.g., "February 28, 1977"), or whether it is expressed using the MM/DD/YY format (e.g., "02/28/77"), or one of any number of formats for representing dates. No matter what format is used, as long as the representation meets the needs of those working with that data, the value will conform to the model's directive, and this may be fine as long as the people using the data in that model understand this to be true. However, this changes as soon as anyone wants to share the data stored using that model with someone in a different organization. The variety of date formats used may turn out to decrease the ease with which information may be migrated from its source to its next destination.

For example, in contrast to the laxness associated with the source data model with respect to date representation, the next user of that data set may have strict requirements about date formats. This apparent formatting dichotomy evolves from the fact that any participant sharing the information may have their own data model, and embedded within each data model is information about the data types that populate each field. So while one data model (built using one vendor's database system) may allow dates to be stored as character strings, another data model (built using a different vendor's database system) might use an embedded system type for representing dates. When the target system attempts to load a record whose values do not conform to the specified type, an exception occurs, which may prevent the participant from using that violating record (or at worst, the entire set of records).

Of course, the solution to this problem is the use of a data standard for information exchange. The standard may correspond to the source data model, or the target data model, or may provide for a format that is foreign to both models. The actual format selected is irrelevant; what is important is the two participants agree to use the selected format in any situation where they exchange data. This is not to say that a data standard should not be distinct from the data models associated with the applications that use the exchanged data. On the contrary, it is sometimes very important to develop the data standard in concert with the data model.

This brings me to the issue of XML, a definition framework that is gaining in popularity for developing data standards. XML does not by itself prescribe a standard; instead, XML is used to define standards for the exchange of information through conforming XML documents. Data values in an XML document are surrounded by tags (labels) that identify where the data content begins and ends, such as <state>Alabama</state>.
The value of XML documents lies in its apparent self-documenting form as well as its flexibility. Because a valid XML document must refer to the document's definition schema, the receiving application can determine what data elements may be present within a valid document, as well as which elements it may choose to pay attention to or to ignore. In addition, because the representation is text-based, yet embeds structure, data exchanged in the form of XML messages can be accepted by many different kinds of systems, making it very portable.

On the other hand, since one of my major interests is in automated data validation, I have to think twice about the use of XML for standards for data exchange. First of all, using XML will not necessarily provide any automatic increase in data quality; the XML document definition describes object structure without necessarily providing any insight into data element content. Yet I have already pointed out that the differences in format used for content is one of the drivers for defining standards in the first place, and XML does not provide any improvement in that department without additional auxiliary processing at either the server or client side.

Second, in data intensive applications XML documents will tend to be much larger due to including the start and end tags, which can affect application performance. Note that comparing a simple data field such as "STATE" may take two characters in a fixed-field exchange and sixteen characters in an XML exchange!

Third, a lot of the value of XML can be exploited only when the target systems are specifically built to use and process XML documents; if a participant does not yet have the infrastructure for absorbing XML messages, there is not a significant benefit over other kinds of defined standards. Lastly, despite claims of "human-readability," there is a degree of complexity to XML documents that necessitates additional processing to bend them into a truly readable form.

I am not saying that there is not value in using XML as a framework for defining a data standard for information exchange. I do think, however, there is a tendency for people to grasp onto a technological meme, such as XML as the de facto method for defining data standards, as a way to decouple themselves from addressing the underlying business issues, such as developing the process by which participating parties negotiate information exchange. But this brings me to the issue that faces the savvy BI manager: the focus of the standards process should be the agreement by all participants as to what the data will look like when it moves from one place to another. Once that is agreed upon, the implementation may be done using XML just as well as using any other approach

1-866-BIZRULE (1-866-249-7853)

images courtesy http://www.freeimages.co.uk