Thursday, November 26, 2009

Research Notes 6: VFDT and CVFDT -- Decision Trees for Data Stream Classification

The traditional classification algorithms work on small data sets which has limited data samples (typically less than 10,000) samples. The main challenge of these algorithms is to get accurate model without overfitting. In contrast, data stream classification algorithms are performed on the continuously growing data volume where only one sequential data scan is allowed. Thus, the algorithms should work on limited amount of memory and time. In addition, the models trained by the algorithms should not be dominated by the old data.

Very Fast Decision Tree (VFDT) (2000) addresses the above challenges by introducing stricter condition during node splitting, which is the core component in the decision tree learning algorithms. The decision tree is a popular classification model due to its ease understanding. VFDT realizes the stricter condition with Hoeffding bound, which guarantees the correct data attribute to be chosen for node splitting with a desired probability according to the current data statistics regardless the type of underlying data distribution. In detailed, the Hoeffding bound guarantees the significance of difference in heuristic measures of choosing different data attribute. The impact of using Hoeffding bound during node splitting is that the correctness of node splitting is ensured without the scanning on whole data. To validate Hoeffding bound in online fashion, VFDT collects data statistics with histogram-like data structures on leaf nodes.

The Heoffding bound assumes stationary data distribution while data stream classification encounters concept drift issue. CVFDT (2001) handles this issue by growing alternate trees and subtrees. In addition, CVFDT also keeps a window of recent data to preriodically determine whether the alternate trees and subtrees should replaced the current tree.

No comments:

Post a Comment