The traditional classification algorithms work on small data sets which has limited data samples (typically less than 10,000) samples. The main challenge of these algorithms is to get accurate model without overfitting. In contrast, data stream classification algorithms are performed on the continuously growing data volume where only one sequential data scan is allowed. Thus, the algorithms should work on limited amount of memory and time. In addition, the models trained by the algorithms should not be dominated by the old data.
Very Fast Decision Tree (VFDT) (2000) addresses the above challenges by introducing stricter condition during node splitting, which is the core component in the decision tree learning algorithms. The decision tree is a popular classification model due to its ease understanding. VFDT realizes the stricter condition with Hoeffding bound, which guarantees the correct data attribute to be chosen for node splitting with a desired probability according to the current data statistics regardless the type of underlying data distribution. In detailed, the Hoeffding bound guarantees the significance of difference in heuristic measures of choosing different data attribute. The impact of using Hoeffding bound during node splitting is that the correctness of node splitting is ensured without the scanning on whole data. To validate Hoeffding bound in online fashion, VFDT collects data statistics with histogram-like data structures on leaf nodes.
The Heoffding bound assumes stationary data distribution while data stream classification encounters concept drift issue. CVFDT (2001) handles this issue by growing alternate trees and subtrees. In addition, CVFDT also keeps a window of recent data to preriodically determine whether the alternate trees and subtrees should replaced the current tree.
Thursday, November 26, 2009
Sunday, November 22, 2009
Research Notes 5: Addressing Concept Drift in Data Stream Clustering
Besides the one-pass requirement, the data stream algorithms need to address the concept drift issue which is well-known in data stream mining community. In simple words, concept drift means the target variable, which the model is trying to predict, changes over in time in unforeseen ways. Among the data stream clustering algorithms, CluStream (2003), HPStream (2004), Den-Stream (2006) and D-Stream (2007) are frequently cited.
CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.
HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.
Den-Stream and D-Stream proposes the following additional issues in data stream clustering:
To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.
D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.
CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.
HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.
Den-Stream and D-Stream proposes the following additional issues in data stream clustering:
- Handle of outliers
- No assumption on number of clusters
- Arbitrary cluster shape
To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.
D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.
Wednesday, November 18, 2009
Research Notes 4: One-Pass Clustering Algorithms
The development of data stream clustering algorithms starts from addressing the one-pass requirement. Among the one-pass clustering algorithms, BIRCH (1996) and STREAM (2000-2003) are two of the frequently cited.
BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.
STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.
BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.
STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.
Monday, November 16, 2009
Research Notes 3: Motivation of Data Stream Clustering Algorithms
Clustering algorithms separate a set of data points into two or more groups of similar points. Their performance indicators are the objective functions related to similarity.
Traditional clustering algorithms, such as K-means, hierarchical clustering, DBSCAN and SOM, revisit the data points several times to optimize the performance indicators.
However, the revisit of data points is not allowed in data stream clustering as the data volume grows along the time. This stimulates the development of data stream clustering algorithms. The short overview on data stream mining algorithms will be the subsequent post.
Traditional clustering algorithms, such as K-means, hierarchical clustering, DBSCAN and SOM, revisit the data points several times to optimize the performance indicators.
However, the revisit of data points is not allowed in data stream clustering as the data volume grows along the time. This stimulates the development of data stream clustering algorithms. The short overview on data stream mining algorithms will be the subsequent post.
Foxpro Notes 2: Steps to Build Foxpro Application
If you meet problems when building a runnable exe with Visual Foxpro 9.0 (VFP9), here are some guides for you. (It takes some time for me to search through the forum, damn Microsoft!)
Note: If you create your own menu and find that you cannot compile the program, change the environment variable "_GENMENU" to locate the "genmenu.prg".
- Create the main form, eg. "frmmain.scx"
- Create a program and set it as main, eg. "prgmain.prg", with following code:
Line 1: ON SHUTDOWN QUIT
Line 2: DO FORM "../forms/frmewsmain.scx"
Line 3: READ EVENTS - Press "Build.." button on "Project Manager" window to create exe
- Copy the "config.fpw" from project folder to the exe folder. Change "SCREEN = OFF" to "SCREEN = ON"
- Get "vfp9r.dll" and "VFP9RENU.DLL" from "\Program Files\Common Files\Microsoft Shared\VFP" to the exe folder
- Place the exe and data properly and you will be able to run the exe
Note: If you create your own menu and find that you cannot compile the program, change the environment variable "_GENMENU" to locate the "genmenu.prg".
Subscribe to:
Posts (Atom)