Showing posts with label Data Stream Clustering. Show all posts
Showing posts with label Data Stream Clustering. Show all posts

Sunday, November 22, 2009

Research Notes 5: Addressing Concept Drift in Data Stream Clustering

Besides the one-pass requirement, the data stream algorithms need to address the concept drift issue which is well-known in data stream mining community. In simple words, concept drift means the target variable, which the model is trying to predict, changes over in time in unforeseen ways. Among the data stream clustering algorithms, CluStream (2003), HPStream (2004), Den-Stream (2006) and D-Stream (2007) are frequently cited.

CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.

HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.

Den-Stream and D-Stream proposes the following additional issues in data stream clustering:

  • Handle of outliers

  • No assumption on number of clusters

  • Arbitrary cluster shape


To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.

D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.

Wednesday, November 18, 2009

Research Notes 4: One-Pass Clustering Algorithms

The development of data stream clustering algorithms starts from addressing the one-pass requirement. Among the one-pass clustering algorithms, BIRCH (1996) and STREAM (2000-2003) are two of the frequently cited.

BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.

STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.

Monday, November 16, 2009

Research Notes 3: Motivation of Data Stream Clustering Algorithms

Clustering algorithms separate a set of data points into two or more groups of similar points. Their performance indicators are the objective functions related to similarity.

Traditional clustering algorithms, such as K-means, hierarchical clustering, DBSCAN and SOM, revisit the data points several times to optimize the performance indicators.

However, the revisit of data points is not allowed in data stream clustering as the data volume grows along the time. This stimulates the development of data stream clustering algorithms. The short overview on data stream mining algorithms will be the subsequent post.