Task : Scalable Clustering

Problem description: When clustering large data sets it is often not necessary to use the entire data set to construct a clustering model. Most clustering algorithms scale either linearly or quadratically in the size of a data set. Thus, the run-times for clustering methods can become burdensome when using large data sets. Meek, Thiesson, and Heckerman (2001) provide a method for selecting the number of points to use during the construction of a clustering model and demonstrate the method can lead to substantial speedups while obtaining clusters with comparable predictive power. Their analysis is performed on several data sets including the USCensus1990 data set.

Meek, Thiesson, and Heckerman (2001), "The Learning Curve Method Applied to Clustering", to appear in The Journal of Machine Learning Research. (Also see MSR-TR-2001-34

The UCI KDD Archive
Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Last modified: 6 Nov 2001