|
|
Data, Databases, and Distribution
- Published at tdan.com
October 2003
Over the past fifteen
years or so, there have been a number of attempts at making better use
of idle computers by farming out application functionality to underused
computing resources from some designated pool. This used to be referred
to as "cycle-stealing," and it is a way to increase computing
volume without a significant capital investment. Within recent years,
other ideas and concepts have been incorporated, mostly abstracted as
general resource sharing, and a number of researchers working on these
ideas have gotten together to formalize the process, which is now referred
to as "grid computing."
In general, grid computing is about collaboration and sharing. A grid
is composed of a collection of resources and a set of protocols for sharing
those resources, much like the electrical power grids that provide power
services to many distributed clients. The kinds of resources that compose
a grid might include computers, disks, memory, services, or even data.
The protocols control the distribution of functionality, security, authentication,
management of distributed processes, and policies regarding restriction
of use, such as thresholds of local activity above which the resource
is not available for sharing. The people that interact or share resources
are together referred to as a virtual organization, and a virtual organization
may actually incorporate individuals and resources from more than one
administrative authority. For example, two scientific research groups
at different universities working on similar problems may share both data
and resources to solve problems relevant to both groups.
Grid applications are typically scientific ones, which require large-scale
processing power to run, or require large amounts of data as input. However,
if we were to imagine a collection of capabilities that were transparently
provided by a virtual organization, we can see that grids have the potential
to develop into an underlying fabric for providing value-added services
seamlessly throughout (or across more than one) enterprise. Incorporating
functionality as services within the grid augments enterprise architecture,
because it provides a formal method for reusing capabilities that up to
this point might have been acquired multiple times independently by different
vertical organizations within the enterprise, and there is some cost benefit
promise in this premise.
As a simple example, more than one group within a company might have a
need for and ETL tool, and vendors are all too happy to sell multiple
licenses to these different groups. On the other hand, it is not likely
that the same tool is being used constantly by processes in each group,
and a single license would probably be sufficient. In the grid environment,
all the groups needing that ETL tool can participate in a virtual organization,
and one instance of that same tool can be made available to any of the
members. By purchasing a single instance, the company saves money by not
buying multiple copies, the costs of maintenance and management are reduced,
and the tool is more likely to have a high utilization, all of which results
in a larger ROI.
I was recently working on a report on grid computing and during my research,
I found a lot of material discussing the benefits of grid computing with
respect to sharing information is what is called a "data grid."
Data grids are more likely to focus on the distribution and/or sharing
of data for scientific applications, but conceptually there is no reason
why the grid paradigm cannot be extended to the information management
world. Apparently this is what Oracle has been thinking, considering their
recent announcement of their Oracle Enterprise Grid computing initiative
as part of its Oracle 10G product suite.
Contrast this with the concepts of web services built to provide distributed
access to databases. These services can be constructed to allow knowledge
workers to peruse metadata via the browser interface, request data from
a database, and have the service provide the data and deliver it directly
to the client. This also provides for distributed access to data to clients
in different locations.
I can imagine the next step in combining these two ideas: the virtual
distributed database. This might be a system that, via web services, provides
a traditional client interface to a collection of independent databases
that have been made available as shared resources within a virtual organization
implemented using grid technology. In this paradigm, a knowledge worker
can sit down in front of a browser and interact with the virtual database
as if it were a single resource. Having this capability would reduce the
reliance on data replication (whether that is managed replication or whether
it is ad hoc, i.e., "making my own copy") and reduce storage
requirements. It might mitigate the need for certain ETL functions, or
even reduce the necessity for some kinds of data marts for analytical
purposes.
I am certain that some readers are champing at the bit to claim that this
is all doable today; new enterprise metadata tools are helpful in producing
the enterprise view of available information, and we can certainly build
services as information provision middleware objects. While I am confident
that this environment is buildable, I believe that there are some serious
issues that need to be understood before this virtual distributed database
emerges into reality, and these include (although, this is not meant to
be all-inclusive):
- Access
- Security
- Functionality
- Performance
My use of the word access refers to the mechanisms through which information
is collected and repackaged. For this database to be successful, there
must be a mechanism to gain access to any data that exists in a structured
format that is to be made available. Fortunately, we can use open database
connectivity (ODBC) or the corresponding version for Java (JDBC) to allow
applications to access data directly from program code. But this will
only work as long as there is an existing driver for the targeted database.
Ultimately, there must be some kind of adapter and corresponding API that
will allow the service to access the data if it is to be incorporated
into the pool of shared data resources.
Security is particularly relevant in a distributed environment. For the
most part, the database server model provides certain levels of security
and authentication at either the gross-level (via username and passwords),
or at a granular level (in those database systems with table, record,
or column-level access control). In the distributed environment, though,
we are using proxies to access the data, and therefore the access rights
are those granted to the proxy, not to the ultimate information client.
Therefore, the service must be able to also provide a level of security
and authentication, and in the distributed environment the management
of these access rights must also be viewed as a service that is part of
the set of shared resources. Again, we are fortunate in that these capabilities
are already available as part of the grid computing protocols.
The last two issues are the ones I believe are more difficult. Let's consider
a simple view of functionality: supporting standard SQL. Simple queries
are easy to handle - we have already discussed the use of ODBC/JDBC for
accessing the data from any specific database, and it is easy to package
a query, direct it to a server, and then wait for the result set to be
forwarded back to the service, which can then repackage it and display
it to the information client.
The problem is when it comes to supporting cross-table queries when the
tables live in different databases. Joining two tables from the same database
is easy - it is just another query. Joining tables from different databases
means that you cannot rely on the internal query engine of either database
to materialize the result. This implies that we need to build the mechanics
of the query engine into the service itself! In other words, a join of
two tables from different databases means that the service needs to access
the data from each table and then execute the join outside of the database,
and this requires both query engine functionality as well as memory and
disk resources outside of the individual databases.
And this leads to the last issue: performance. For this virtual database
to work, it must be able to provide certain performance levels that are
acceptable to the information client. But presumably we are implementing
a significant amount of database functionality outside of the database
servers. In addition, we also need to factor in the latency for delivery
of information from the separate servers through the network. That means
that the service must be able to optimize the database functionality across
the collection of distributed resources. And we are in luck again, since
the grid computing paradigm is designed to be able to support parallelism
across distributed systems, and since a large part of database query functionality
is well-suited to parallelization, we can take advantage of the grid to
provide for the computational requirements as well as exploit network
connectivity to support the needed functionality.
I anticipate that this kind of virtual distributed database service is
the next logical step within a grid services environment. I would be interested
in hearing from you if you are currently working on this kind of project
to learn more about your experiences - email me at loshin@knowledge-integrity.com,
and I will be happy to summarize what I learn in my next column.
|
|