Constructing Decision Trees from Data Streams

📅 2024-03-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of constructing decision trees from non-i.i.d. data streams. We propose a streaming algorithm that dynamically builds trees for both regression—minimizing mean squared error—and classification—minimizing misclassification rate or Gini impurity. For the first time, our method achieves exact estimation of optimal split points under sublinear space complexity (O(√n)) and single-pass (or near-single-pass) processing. The approach integrates online statistical estimation, space-efficient summary structures, and multi-pass optimization heuristics, while natively supporting MapReduce-based distributed deployment. Compared to the Domingos–Hulten family of algorithms, our framework bridges critical gaps in theoretical guarantees and space efficiency for streaming optimal splitting: it attains asymptotically optimal split accuracy—matching batch-mode solutions—while reducing memory overhead by orders of magnitude. This significantly enhances the practicality and scalability of streaming tree models on large-scale, non-stationary data streams.

Technology Category

Application Category

📝 Abstract

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations (x_i) and their corresponding labels (y_i), without the i.i.d. assumption, the objective is to identify the optimal split (j) that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).

Problem

Research questions and friction points this paper is trying to address.

Compute optimal splits for decision trees from data streams

Minimize error metrics without i.i.d. assumption

Develop efficient sublinear-space streaming algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sublinear space streaming algorithms for splits

Multi-pass efficient decision tree construction

Extendable to MapReduce model

🔎 Similar Papers

No similar papers found.

TikTok

$174304 - $259200 per year

San Jose, CA

Machine Learning Engineer