Wednesday, November 18, 2009

Research Notes 4: One-Pass Clustering Algorithms

The development of data stream clustering algorithms starts from addressing the one-pass requirement. Among the one-pass clustering algorithms, BIRCH (1996) and STREAM (2000-2003) are two of the frequently cited.

BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.

STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.

No comments:

Post a Comment