🤖 AI Summary
This work addresses the challenge of constructing decision trees from non-i.i.d. data streams. We propose a streaming algorithm that dynamically builds trees for both regression—minimizing mean squared error—and classification—minimizing misclassification rate or Gini impurity. For the first time, our method achieves exact estimation of optimal split points under sublinear space complexity (O(√n)) and single-pass (or near-single-pass) processing. The approach integrates online statistical estimation, space-efficient summary structures, and multi-pass optimization heuristics, while natively supporting MapReduce-based distributed deployment. Compared to the Domingos–Hulten family of algorithms, our framework bridges critical gaps in theoretical guarantees and space efficiency for streaming optimal splitting: it attains asymptotically optimal split accuracy—matching batch-mode solutions—while reducing memory overhead by orders of magnitude. This significantly enhances the practicality and scalability of streaming tree models on large-scale, non-stationary data streams.
📝 Abstract
In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations (x_i) and their corresponding labels (y_i), without the i.i.d. assumption, the objective is to identify the optimal split (j) that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).