Recently I have got the following nice song from my colleague.  Somehow, he has resigned yesterday.  It's quite coincidence with the song name.  All the best to him.  The lyric is also attached in this post.  Enjoy~
Lyric
Thursday, December 10, 2009
Sunday, December 6, 2009
科系简介
是否在要申请大学时很彷徨,不知道要进入哪一个系呢?我是学电脑工程的,我一直对各个科系都很感兴趣。以下是我的一些关于各个科系内容的想法:
- 电脑工程(Computer Engineering):有效率地策划和执行重复的操作
- 材料工程(Material Engineering):测量和合成材料的性质
- 机械工程(Mechanical Engineering):设计和测量可动物件
- 土木工程(Civil Engineering):设计和测量巨型不动物件
- 电子工程(Electrical Engineering):设计电路
- 经济(Economic):发现并应用供需关系
- 生物(Biology):生物的演化与平衡
- 数学(Mathematic):发展量化的语言
- 物理(Physic):用量化语言表达和预测现象
Friday, December 4, 2009
Research Notes 7: Ensemble Learning for Data Stream Classification
Ensemble learning attempts to produce a strong learner by combining many weak learners.  Ensemble learning can train stronger classification model on top of multiple simpler models such as decision trees, if there is significant diversity among the models.
In SEA (2001), it is suggested that even simple ensembles can produce results better that are comparable to individual classifiers. SEA is an ensemble learning scheme proposed for data stream classification. In SEA, a new classifier is trained whenever a new chunk of data arrives. The new classifiers are added to existing classifiers to form an ensemble until the memory is full. Whenever the memory is full, the new classifier is evaluated whether it should replaced one of the existing classifiers. The replacement can be considered as a way to handle concept drift. To decide which classifiers to be retained, SEA evaluate the classifiers with a novel heuristic that favors classifiers that can correctly classify instances on which the ensemble is nearly undecided. In the paper, SEA with C4.5 decision trees are tested.
In SEA (2001), it is suggested that even simple ensembles can produce results better that are comparable to individual classifiers. SEA is an ensemble learning scheme proposed for data stream classification. In SEA, a new classifier is trained whenever a new chunk of data arrives. The new classifiers are added to existing classifiers to form an ensemble until the memory is full. Whenever the memory is full, the new classifier is evaluated whether it should replaced one of the existing classifiers. The replacement can be considered as a way to handle concept drift. To decide which classifiers to be retained, SEA evaluate the classifiers with a novel heuristic that favors classifiers that can correctly classify instances on which the ensemble is nearly undecided. In the paper, SEA with C4.5 decision trees are tested.
Thursday, November 26, 2009
Research Notes 6: VFDT and CVFDT -- Decision Trees for Data Stream Classification
The traditional classification algorithms work on small data sets which has limited data samples (typically less than 10,000) samples.  The main challenge of these algorithms is to get accurate model without overfitting.  In contrast, data stream classification algorithms are performed on the continuously growing data volume where only one sequential data scan is allowed.  Thus, the algorithms should work on limited amount of memory and time.  In addition, the models trained by the algorithms should not be dominated by the old data.
Very Fast Decision Tree (VFDT) (2000) addresses the above challenges by introducing stricter condition during node splitting, which is the core component in the decision tree learning algorithms. The decision tree is a popular classification model due to its ease understanding. VFDT realizes the stricter condition with Hoeffding bound, which guarantees the correct data attribute to be chosen for node splitting with a desired probability according to the current data statistics regardless the type of underlying data distribution. In detailed, the Hoeffding bound guarantees the significance of difference in heuristic measures of choosing different data attribute. The impact of using Hoeffding bound during node splitting is that the correctness of node splitting is ensured without the scanning on whole data. To validate Hoeffding bound in online fashion, VFDT collects data statistics with histogram-like data structures on leaf nodes.
The Heoffding bound assumes stationary data distribution while data stream classification encounters concept drift issue. CVFDT (2001) handles this issue by growing alternate trees and subtrees. In addition, CVFDT also keeps a window of recent data to preriodically determine whether the alternate trees and subtrees should replaced the current tree.
Very Fast Decision Tree (VFDT) (2000) addresses the above challenges by introducing stricter condition during node splitting, which is the core component in the decision tree learning algorithms. The decision tree is a popular classification model due to its ease understanding. VFDT realizes the stricter condition with Hoeffding bound, which guarantees the correct data attribute to be chosen for node splitting with a desired probability according to the current data statistics regardless the type of underlying data distribution. In detailed, the Hoeffding bound guarantees the significance of difference in heuristic measures of choosing different data attribute. The impact of using Hoeffding bound during node splitting is that the correctness of node splitting is ensured without the scanning on whole data. To validate Hoeffding bound in online fashion, VFDT collects data statistics with histogram-like data structures on leaf nodes.
The Heoffding bound assumes stationary data distribution while data stream classification encounters concept drift issue. CVFDT (2001) handles this issue by growing alternate trees and subtrees. In addition, CVFDT also keeps a window of recent data to preriodically determine whether the alternate trees and subtrees should replaced the current tree.
Labels:
Data Stream Classification,
Decision Tree,
Research
Sunday, November 22, 2009
Research Notes 5: Addressing Concept Drift in Data Stream Clustering
Besides the one-pass requirement, the data stream algorithms need to address the concept drift issue which is well-known in data stream mining community.  In simple words, concept drift means the target variable, which the model is trying to predict, changes over in time in unforeseen ways.  Among the data stream clustering algorithms, CluStream (2003), HPStream (2004), Den-Stream (2006) and D-Stream (2007) are frequently cited.
CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.
HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.
Den-Stream and D-Stream proposes the following additional issues in data stream clustering:
To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.
D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.
CluStream extends the two-phase clustering approach and Clustering Feature (CF) from BIRCH by storing the data scan results in pyramidal time frame. In the online summary phase, k-means algorithm are used to cluster the data points into micro-clusters which are described by CFs. Snapshots of the micro-clustering results are stored according to pyramidal time frame. The micro-clustering results possess the subtractive property. Thus, the subtractive property allows the user to cluster over different portions of the stream and hence handles concept drift. In the offline clustering phase, k-means algorithm is again used to cluster the difference of the two snapshots specified by users.
HPStream addresses the challenge to perform data stream clustering on high dimensional data. In high dimensional data, the distances on all dimension become meaningless (which is commonly known as the curse of dimensionality). Hence, a feasible way is to compute the distances on projected dimensions for each cluster individually during clustering. Finding and maintaining a relevant set of dimensions for each cluster increases the computational difficulty especially in the data stream environment. To address this issue, HPStream incorporates fading cluster structure based on CF for projected distance computation in high dimensional data stream clustering. The fading cluster structure contains a weight which is a fading function of time stamps of all data points assigned to the cluster. The cluster with the least weight will be removed for new incoming cluster whenever memory is full.
Den-Stream and D-Stream proposes the following additional issues in data stream clustering:
- Handle of outliers
- No assumption on number of clusters
- Arbitrary cluster shape
To handle the outliers, Den-Stream enhances the micro-cluster concept in CluStream by differentiating the micro-clusters into potential core-micro-clusters and outlier micro-clusters. The differentiating factor is the weight of cluster, which is similar to HPStream. In the online summary phase, the new incoming data points are firstly fitted in existing clusters. Then, the clusters are updated to determine whether the types of micro-clusters are changed or they should be deleted. Proofs are done to ensure the deletion of micro-clusters do not impact the offline clustering result. In the offline clustering phase, the potential core-micro-clusters are used to generate the final clusters using DBSCAN algorithm.
D-Stream manages the online summary phase with grid-based approach. Similar to other grid-based approach, each grid is formed on all dimensions by selecting one interval from each dimension. In the online summary phase, each existing grid in D-Stream is given a density value which is similar to HPStream according to incoming data points. All the density values of existing grids are above a cutoff threshold. The existing grids are divided into dense grids, transitional grids and sporadic grids according to their density values. Those grids are deleted if their density values fall below certain threshold. Proofs are also done to ensure the deletion of grids do not impact the offline clustering result. In the offline clustering phase, only dense grids and transitional grids are clustered with grid-based clustering approach.
Wednesday, November 18, 2009
Research Notes 4: One-Pass Clustering Algorithms
The development of data stream clustering algorithms starts from addressing the one-pass requirement.  Among the one-pass clustering algorithms, BIRCH (1996) and STREAM (2000-2003) are two of the frequently cited.
BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.
STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.
BIRCH can be seen as a two-phase clustering method. In the first phase, all the data points are scanned once and inserted into Clustering Feature (CF) tree according to specified branching factor, threshold and number of entries in leaf node. The nodes in CF tree are primitive clusters and described by CFs that allow computation of cluster centers and distances. If exceeding the memory limit, the threshold value is increased to rebuild a new, smaller CF tree. In the second phase, global clustering, which can be k-means or agglomerative hierarchical clustering, is done on the primitive clusters according to user specification.
STREAM formulates the data stream clustering as a problem of Sum-of-Square (SSQ) distance minimization where a finite set of points can only be read in sequence. In the STREAM algorithm, k weighted clusters centers are derived for each data chunk. While the total number of cluster centers are exceeding the memory limits, the cluster centers will be revised and only k weighted clusters are retained. The STREAM with LSEARCH algorithm as a k-Median subroutine is tested to give the minimum SSQ.
Monday, November 16, 2009
Research Notes 3: Motivation of Data Stream Clustering Algorithms
Clustering algorithms separate a set of data points into two or more groups of similar points.  Their performance indicators are the objective functions related to similarity.  
Traditional clustering algorithms, such as K-means, hierarchical clustering, DBSCAN and SOM, revisit the data points several times to optimize the performance indicators.
However, the revisit of data points is not allowed in data stream clustering as the data volume grows along the time. This stimulates the development of data stream clustering algorithms. The short overview on data stream mining algorithms will be the subsequent post.
Traditional clustering algorithms, such as K-means, hierarchical clustering, DBSCAN and SOM, revisit the data points several times to optimize the performance indicators.
However, the revisit of data points is not allowed in data stream clustering as the data volume grows along the time. This stimulates the development of data stream clustering algorithms. The short overview on data stream mining algorithms will be the subsequent post.
Foxpro Notes 2: Steps to Build Foxpro Application
If you meet problems when building a runnable exe with Visual Foxpro 9.0 (VFP9), here are some guides for you.  (It takes some time for me to search through the forum, damn Microsoft!)
Note: If you create your own menu and find that you cannot compile the program, change the environment variable "_GENMENU" to locate the "genmenu.prg".
- Create the main form, eg. "frmmain.scx"
- Create a program and set it as main, eg. "prgmain.prg", with following code:
 Line 1: ON SHUTDOWN QUIT
 Line 2: DO FORM "../forms/frmewsmain.scx"
 Line 3: READ EVENTS
- Press "Build.." button on "Project Manager" window to create exe
- Copy the "config.fpw" from project folder to the exe folder. Change "SCREEN = OFF" to "SCREEN = ON"
- Get "vfp9r.dll" and "VFP9RENU.DLL" from "\Program Files\Common Files\Microsoft Shared\VFP" to the exe folder
- Place the exe and data properly and you will be able to run the exe
Note: If you create your own menu and find that you cannot compile the program, change the environment variable "_GENMENU" to locate the "genmenu.prg".
Sunday, October 18, 2009
Foxpro Notes 1: Modification on Foxpro Auto-generated Stored Procedure
Last tuesday, I was being asked to support development with Visual Foxpro 9.0.  I have setup Foxpro database according to specification except foreign key constraint.  The auto-generated foreign key constraint will restrict the direct data entry from the form.
I have looked for help at "www.foxite.com". Tushar from India has provided me 2 clues to derive the solution. Really appreciate for his help.
I have looked for help at "www.foxite.com". Tushar from India has provided me 2 clues to derive the solution. Really appreciate for his help.
- Add "IF !EMPTY(foreignkey)" to bypass the foreign key constraint checking code
- Add round brackets () for the variable lcParentWkArea in statement "unlock record pnParentRec in lcParentWkArea". Thhe new statement is "unlock record pnParentRec in (lcParentWkArea)"
Tuesday, October 13, 2009
Web Notes 1: Way to Setup CGI Environment
Recently, I have written javascript to paste data from excel sheet into input elements of web form.  The javascript function invoked on "onpaste" event is successful implemented on IE but the integration of javascript into cgi script fails due to my empty knowledge about cgi.  Thus, i decided to setup cgi environment in company's computer.  
Wah, it's really a hard time to search and understand the pieces of relevant information. I have spent whole morning and finally success. The following are the main steps to setup CGI environment in Window XP with Apache Tomcat server. The IIS may also be used as the server.
You can refer to http://tomcat.apache.org/tomcat-6.0-doc/cgi-howto.html for more details to setup cgi environment in Apache Tomcat 6.0. The debug information can be found in "logs" folder at Apache root folder.
Wah, it's really a hard time to search and understand the pieces of relevant information. I have spent whole morning and finally success. The following are the main steps to setup CGI environment in Window XP with Apache Tomcat server. The IIS may also be used as the server.
- Install Perl: I obtained the installer from "http://www.activestate.com/activeperl/"
- Install Apache Tomcat Server: I obtained the installer from "http://tomcat.apache.org/download-60.cgi"
- Uncomment cgi portion in "conf/web.xml" at Apache root folder
- Insert "privileged="true"" in context element of respective xml file of the web applications
- By default in "web.xml", the cgi scripts should be placed in "WEB-INF/cgi".  The following is the sample cgi script for testing:
 #!perlroot/bin/perl.exe
 print "Content-type: text/html\n\n" ;
 print <<HTML ;
 <html>
 <head><title>CGI Results</title></head>
 <body>
 <h1>Hello, world.</h1>
 </body>
 </html>
 HTML
- By default, the cgi scripts can be accessed under "cgi-bin" at the root of web application
You can refer to http://tomcat.apache.org/tomcat-6.0-doc/cgi-howto.html for more details to setup cgi environment in Apache Tomcat 6.0. The debug information can be found in "logs" folder at Apache root folder.
Tuesday, October 6, 2009
Research Notes 2: Literature in Plan
- Data stream mining algorithms: Understand the data stream mining mechanisms that can be integrated and evaluation criteria against concept drift
- Parallel data mining: Understand the frameworks, mechanisms and evaluation criteria
- Ensemble mining: Understand the frameworks and mechanisms
- Multi-Stream Mining: Understand the frameworks, mechanisms and evaluation criteria against concept drift
- Spatio-Temporal Mining: Understand how the mining associate spatial and temporal relationship and evaluation criteria
Sunday, October 4, 2009
Research Notes 1: Notes to be taken when reading literature
Some preliminary thought on what to be taken down when reading literature is listed down.  Just for sharing~~~
- What are the existing work it based on?
 "If I have seen further than others, it is by standing upon the shoulders of giants" ~~ Isaac Newton
 Just like the quote from Newton, the existing work is essential to achieve new contribution. Thus, it is important to have basic understanding on existing work.
- What is its contribution?
 It is critical to understand the paper's contribution, so that we understand the impact and do not repeat their work.
- What is not done and related to our interested area?
 This is the most important part. We will need to advance our research on existing work.
Subscribe to:
Comments (Atom)
 
