We have recently been tinkering with an open source tool for data mining, provided by the Weka project, and this has enticed us to do a little reading up on data mining techniques and how they are used.
One area of focus is the undirected approach to grouping stuff together using clustering algorithms. The basic idea is that given a set of data instances, organize them into groups such that all members of a single group are similar to each other, and every group is dissimilar to every other group (could one say, “the most diverse group of similar things?”).
In order to do this clustering, though, the algorithm must be able to determine whether two data instances are similar to each other or not, and that means some quantification of similarity. Of course, this is based on the values within the attributes of the data objects, suggesting the need for how close any pair of values are to each other. This notion of closeness is reflected in the typical approaches to gaigung similarity using a Euclidean distance function (remember high school geometry?).
In essence, there are different classes of values sets, some being interval or ranked (meaining they have magnitude and/or can be ordered according to some criteria) or are categorical (that is, discrete values but no specific ordering, such as shoe color or printer model number). Each value set must have some distance function that indicates some contribution to the similarity analysis.
Some data mining tools require the user to provide the data classifications and enables them to provide distance functions, while others apply statistical analysis and employ some heuristics to assign a distance function. And these functions become integral to other data mining activities as well. For example, consider:
- Case-based reasoning: this compares new situtions to a model built from existing instances and outcome. New instances are matched against the model to find the closest matches, and make suggestions for actions based on outcome probabilities from the model.
- Classification: comparing new data instances against existing group profiles to assign the new instance to an existing group also needs similarity functions.
- Link analysis: seeking to connect individual instances together yet again requires the ability to match records against each other to determine closeness or similarity.
There are more examples that are in wide use within the descriptive and predictive analytics world, so understanding the value of good similarity functions will go a long way in applying data mining techniques!