Thursday, December 10, 2009

(Recommend-Music) Any Other World from Mika

Recently I have got the following nice song from my colleague. Somehow, he has resigned yesterday. It's quite coincidence with the song name. All the best to him. The lyric is also attached in this post. Enjoy~



Lyric

Sunday, December 6, 2009

科系简介

是否在要申请大学时很彷徨,不知道要进入哪一个系呢?我是学电脑工程的,我一直对各个科系都很感兴趣。以下是我的一些关于各个科系内容的想法:

  • 电脑工程(Computer Engineering):有效率地策划和执行重复的操作

  • 材料工程(Material Engineering):测量和合成材料的性质

  • 机械工程(Mechanical Engineering):设计和测量可动物件

  • 土木工程(Civil Engineering):设计和测量巨型不动物件

  • 电子工程(Electrical Engineering):设计电路

  • 经济(Economic):发现并应用供需关系

  • 生物(Biology):生物的演化与平衡

  • 数学(Mathematic):发展量化的语言

  • 物理(Physic):用量化语言表达和预测现象

Friday, December 4, 2009

Research Notes 7: Ensemble Learning for Data Stream Classification

Ensemble learning attempts to produce a strong learner by combining many weak learners. Ensemble learning can train stronger classification model on top of multiple simpler models such as decision trees, if there is significant diversity among the models.

In SEA (2001), it is suggested that even simple ensembles can produce results better that are comparable to individual classifiers. SEA is an ensemble learning scheme proposed for data stream classification. In SEA, a new classifier is trained whenever a new chunk of data arrives. The new classifiers are added to existing classifiers to form an ensemble until the memory is full. Whenever the memory is full, the new classifier is evaluated whether it should replaced one of the existing classifiers. The replacement can be considered as a way to handle concept drift. To decide which classifiers to be retained, SEA evaluate the classifiers with a novel heuristic that favors classifiers that can correctly classify instances on which the ensemble is nearly undecided. In the paper, SEA with C4.5 decision trees are tested.

Thursday, November 26, 2009

Research Notes 6: VFDT and CVFDT -- Decision Trees for Data Stream Classification

The traditional classification algorithms work on small data sets which has limited data samples (typically less than 10,000) samples. The main challenge of these algorithms is to get accurate model without overfitting. In contrast, data stream classification algorithms are performed on the continuously growing data volume where only one sequential data scan is allowed. Thus, the algorithms should work on limited amount of memory and time. In addition, the models trained by the algorithms should not be dominated by the old data.

Very Fast Decision Tree (VFDT) (2000) addresses the above challenges by introducing stricter condition during node splitting, which is the core component in the decision tree learning algorithms. The decision tree is a popular classification model due to its ease understanding. VFDT realizes the stricter condition with Hoeffding bound, which guarantees the correct data attribute to be chosen for node splitting with a desired probability according to the current data statistics regardless the type of underlying data distribution. In detailed, the Hoeffding bound guarantees the significance of difference in heuristic measures of choosing different data attribute. The impact of using Hoeffding bound during node splitting is that the correctness of node splitting is ensured without the scanning on whole data. To validate Hoeffding bound in online fashion, VFDT collects data statistics with histogram-like data structures on leaf nodes.

The Heoffding bound assumes stationary data distribution while data stream classification encounters concept drift issue. CVFDT (2001) handles this issue by growing alternate trees and subtrees. In addition, CVFDT also keeps a window of recent data to preriodically determine whether the alternate trees and subtrees should replaced the current tree.

Sunday, November 22, 2009

Research Notes 5: Addressing Concept Drift in Data Stream Clustering

Besides the one-pass requirement, the data stream algorithms need to address the concept drift issue which is well-known in data stream mining community. In simple words, concept drift means the target variable, which the model is trying to predict, changes over in time in unforeseen ways. Among the data stream clustering algorithms, CluStream (2003), HPStream (2004), Den-Stream (2006) and D-Stream (2007) are frequently cited.

CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.

HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.

Den-Stream and D-Stream proposes the following additional issues in data stream clustering:

  • Handle of outliers

  • No assumption on number of clusters

  • Arbitrary cluster shape


To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.

D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.

Wednesday, November 18, 2009

Research Notes 4: One-Pass Clustering Algorithms

The development of data stream clustering algorithms starts from addressing the one-pass requirement. Among the one-pass clustering algorithms, BIRCH (1996) and STREAM (2000-2003) are two of the frequently cited.

BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.

STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.

Monday, November 16, 2009

Research Notes 3: Motivation of Data Stream Clustering Algorithms

Clustering algorithms separate a set of data points into two or more groups of similar points. Their performance indicators are the objective functions related to similarity.

Traditional clustering algorithms, such as K-means, hierarchical clustering, DBSCAN and SOM, revisit the data points several times to optimize the performance indicators.

However, the revisit of data points is not allowed in data stream clustering as the data volume grows along the time. This stimulates the development of data stream clustering algorithms. The short overview on data stream mining algorithms will be the subsequent post.