Sunday, November 22, 2009

Research Notes 5: Addressing Concept Drift in Data Stream Clustering

Besides the one-pass requirement, the data stream algorithms need to address the concept drift issue which is well-known in data stream mining community. In simple words, concept drift means the target variable, which the model is trying to predict, changes over in time in unforeseen ways. Among the data stream clustering algorithms, CluStream (2003), HPStream (2004), Den-Stream (2006) and D-Stream (2007) are frequently cited.

CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.

HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.

Den-Stream and D-Stream proposes the following additional issues in data stream clustering:

  • Handle of outliers

  • No assumption on number of clusters

  • Arbitrary cluster shape


To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.

D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.

1 comment: