Over the past fifteen years or so, there have been a number of attempts at making better use of idle computers by farming out application functionality to underused computing resources from some designated pool. This used to be referred to as “cycle-stealing,” and it is a way to increase computing volume without a significant capital investment. Within recent years, other ideas and concepts have been incorporated, mostly abstracted as general resource sharing, and a number of researchers working on these ideas have gotten together to formalize the process, which is now referred to as “grid computing.”
In general, grid computing is about collaboration and sharing. A grid is composed of a collection of resources and a set of protocols for sharing those resources, much like the electrical power grids that provide power services to many distributed clients. The kinds of resources that compose a grid might include computers, disks, memory, services, or even data. The protocols control the distribution of functionality, security, authentication, management of distributed processes, and policies regarding restriction of use, such as thresholds of local activity above which the resource is not available for sharing. The people that interact or share resources are together referred to as a virtual organization, and a virtual organization may actually incorporate individuals and resources from more than one administrative authority. For example, two scientific research groups at different universities working on similar problems may share both data and resources to solve problems relevant to both groups.
Grid applications are typically scientific ones, which require large-scale processing power to run, or require large amounts of data as input. However, if we were to imagine a collection of capabilities that were transparently provided by a virtual organization, we can see that grids have the potential to develop into an underlying fabric for providing value-added services seamlessly throughout (or across more than one) enterprise. Incorporating functionality as services within the grid augments enterprise architecture, because it provides a formal method for reusing capabilities that up to this point might have been acquired multiple times independently by different vertical organizations within the enterprise, and there is some cost benefit promise in this premise.
As a simple example, more than one group within a company might have a need for and ETL tool, and vendors are all too happy to sell multiple licenses to these different groups. On the other hand, it is not likely that the same tool is being used constantly by processes in each group, and a single license would probably be sufficient. In the grid environment, all the groups needing that ETL tool can participate in a virtual organization, and one instance of that same tool can be made available to any of the members. By purchasing a single instance, the company saves money by not buying multiple copies, the costs of maintenance and management are reduced, and the tool is more likely to have a high utilization, all of which results in a larger ROI.
I was recently working on a report on grid computing and during my research, I found a lot of material discussing the benefits of grid computing with respect to sharing information is what is called a “data grid.” Data grids are more likely to focus on the distribution and/or sharing of data for scientific applications, but conceptually there is no reason why the grid paradigm cannot be extended to the information management world. Apparently this is what Oracle has been thinking, considering their recent announcement of their Oracle Enterprise Grid computing initiative as part of its Oracle 10G product suite.
Contrast this with the concepts of web services built to provide distributed access to databases. These services can be constructed to allow knowledge workers to peruse metadata via the browser interface, request data from a database, and have the service provide the data and deliver it directly to the client. This also provides for distributed access to data to clients in different locations.
I can imagine the next step in combining these two ideas: the virtual distributed database. This might be a system that, via web services, provides a traditional client interface to a collection of independent databases that have been made available as shared resources within a virtual organization implemented using grid technology. In this paradigm, a knowledge worker can sit down in front of a browser and interact with the virtual database as if it were a single resource. Having this capability would reduce the reliance on data replication (whether that is managed replication or whether it is ad hoc, i.e., “making my own copy”) and reduce storage requirements. It might mitigate the need for certain ETL functions, or even reduce the necessity for some kinds of data marts for analytical purposes.
I am certain that some readers are champing at the bit to claim that this is all doable today; new enterprise metadata tools are helpful in producing the enterprise view of available information, and we can certainly build services as information provision middleware objects. While I am confident that this environment is buildable, I believe that there are some serious issues that need to be understood before this virtual distributed database emerges into reality, and these include (although, this is not meant to be all-inclusive):
- Access
- Security
- Functionality
- Performance
My use of the word access refers to the mechanisms through which information is collected and repackaged. For this database to be successful, there must be a mechanism to gain access to any data that exists in a structured format that is to be made available. Fortunately, we can use open database connectivity (ODBC) or the corresponding version for Java (JDBC) to allow applications to access data directly from program code. But this will only work as long as there is an existing driver for the targeted database. Ultimately, there must be some kind of adapter and corresponding API that will allow the service to access the data if it is to be incorporated into the pool of shared data resources.
Security is particularly relevant in a distributed environment. For the most part, the database server model provides certain levels of security and authentication at either the gross-level (via username and passwords), or at a granular level (in those database systems with table, record, or column-level access control). In the distributed environment, though, we are using proxies to access the data, and therefore the access rights are those granted to the proxy, not to the ultimate information client. Therefore, the service must be able to also provide a level of security and authentication, and in the distributed environment the management of these access rights must also be viewed as a service that is part of the set of shared resources. Again, we are fortunate in that these capabilities are already available as part of the grid computing protocols.
The last two issues are the ones I believe are more difficult. Let’s consider a simple view of functionality: supporting standard SQL. Simple queries are easy to handle – we have already discussed the use of ODBC/JDBC for accessing the data from any specific database, and it is easy to package a query, direct it to a server, and then wait for the result set to be forwarded back to the service, which can then repackage it and display it to the information client.
The problem is when it comes to supporting cross-table queries when the tables live in different databases. Joining two tables from the same database is easy – it is just another query. Joining tables from different databases means that you cannot rely on the internal query engine of either database to materialize the result. This implies that we need to build the mechanics of the query engine into the service itself! In other words, a join of two tables from different databases means that the service needs to access the data from each table and then execute the join outside of the database, and this requires both query engine functionality as well as memory and disk resources outside of the individual databases.
And this leads to the last issue: performance. For this virtual database to work, it must be able to provide certain performance levels that are acceptable to the information client. But presumably we are implementing a significant amount of database functionality outside of the database servers. In addition, we also need to factor in the latency for delivery of information from the separate servers through the network. That means that the service must be able to optimize the database functionality across the collection of distributed resources. And we are in luck again, since the grid computing paradigm is designed to be able to support parallelism across distributed systems, and since a large part of database query functionality is well-suited to parallelization, we can take advantage of the grid to provide for the computational requirements as well as exploit network connectivity to support the needed functionality.
I anticipate that this kind of virtual distributed database service is the next logical step within a grid services environment. I would be interested in hearing from you if you are currently working on this kind of project to learn more about your experiences – email me at loshin@knowledge-integrity.com, and I will be happy to summarize what I learn in my next column.
1st Oct, 2003
Data, Databases, and Distribution
Leave a response
You must be logged in to post a comment.