Approximating splits for decision trees quickly in sparse data streams

📅 2026-01-18

🏛️ SDM

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work proposes an efficient approximation algorithm for constructing online binary classification decision trees under sparse binary features, enabling rapid identification of near-optimal split points in data streams. It achieves the first $(1+\alpha)$-approximation guarantees for both information gain and Gini impurity in the context of sparse streaming data, leveraging a counter-based approximation strategy combined with sparsity-aware design. Amortized analysis ensures low time complexity, particularly when the feature dimensionality $d$ greatly exceeds the number of non-zero features $m$. Experimental results demonstrate that the method significantly outperforms existing baselines in runtime while maintaining split quality close to optimal, with empirical performance even exceeding theoretical guarantees.

Technology Category

Application Category

📝 Abstract

Decision trees are one of the most popular classifiers in the machine learning literature. While the most common decision tree learning algorithms treat data as a batch, numerous algorithms have been proposed to construct decision trees from a data stream. A standard training strategy involves augmenting the current tree by changing a leaf node into a split. Here we typically maintain counters in each leaf which allow us to determine the optimal split, and whether the split should be done. In this paper we focus on how to speed up the search for the optimal split when dealing with sparse binary features and a binary class. We focus on finding splits that have the approximately optimal information gain or Gini index. In both cases finding the optimal split can be done in $O(d)$ time, where $d$ is the number of features. We propose an algorithm that yields $(1 + \alpha)$ approximation when using conditional entropy in amortized $O(\alpha^{-1}(1 + m\log d) \log \log n)$ time, where $m$ is the number of 1s in a data point, and $n$ is the number of data points. Similarly, for Gini index, we achieve $(1 + \alpha)$ approximation in amortized $O(\alpha^{-1} + m \log d)$ time. Our approach is beneficial for sparse data where $m \ll d$. In our experiments we find almost-optimal splits efficiently, faster than the baseline, overperforming the theoretical approximation guarantees.

Problem

Research questions and friction points this paper is trying to address.

decision trees

sparse data streams

optimal split

information gain

Gini index

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse data streams

decision trees

approximate split selection