Approximating splits for decision trees quickly in sparse data streams

📅 2026-01-18
🏛️ SDM
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an efficient approximation algorithm for constructing online binary classification decision trees under sparse binary features, enabling rapid identification of near-optimal split points in data streams. It achieves the first $(1+\alpha)$-approximation guarantees for both information gain and Gini impurity in the context of sparse streaming data, leveraging a counter-based approximation strategy combined with sparsity-aware design. Amortized analysis ensures low time complexity, particularly when the feature dimensionality $d$ greatly exceeds the number of non-zero features $m$. Experimental results demonstrate that the method significantly outperforms existing baselines in runtime while maintaining split quality close to optimal, with empirical performance even exceeding theoretical guarantees.

Technology Category

Application Category

📝 Abstract
Decision trees are one of the most popular classifiers in the machine learning literature. While the most common decision tree learning algorithms treat data as a batch, numerous algorithms have been proposed to construct decision trees from a data stream. A standard training strategy involves augmenting the current tree by changing a leaf node into a split. Here we typically maintain counters in each leaf which allow us to determine the optimal split, and whether the split should be done. In this paper we focus on how to speed up the search for the optimal split when dealing with sparse binary features and a binary class. We focus on finding splits that have the approximately optimal information gain or Gini index. In both cases finding the optimal split can be done in $O(d)$ time, where $d$ is the number of features. We propose an algorithm that yields $(1 + \alpha)$ approximation when using conditional entropy in amortized $O(\alpha^{-1}(1 + m\log d) \log \log n)$ time, where $m$ is the number of 1s in a data point, and $n$ is the number of data points. Similarly, for Gini index, we achieve $(1 + \alpha)$ approximation in amortized $O(\alpha^{-1} + m \log d)$ time. Our approach is beneficial for sparse data where $m \ll d$. In our experiments we find almost-optimal splits efficiently, faster than the baseline, overperforming the theoretical approximation guarantees.
Problem

Research questions and friction points this paper is trying to address.

decision trees
sparse data streams
optimal split
information gain
Gini index
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse data streams
decision trees
approximate split selection
information gain
Gini index