Nearly Optimal Bounds for Computing Decision Tree Splits in Data Streams

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the problem of efficiently computing approximate decision tree split points in data streams. The authors propose a single-pass algorithm that achieves near-optimal space complexity for both regression tasks—where labels are bounded—and classification tasks based on Gini impurity. By leveraging the Lipschitz continuity of the loss function, reservoir sampling, and a Count-Min Sketch variant supporting range queries, the method attains theoretically grounded performance. The study establishes matching upper and lower bounds via a reduction from the INDEX problem: for regression, it achieves Õ(M²/ε) space and proves an Ω(M²/ε) lower bound; for classification, it reduces space to Õ(1/ε) and demonstrates a matching Ω(1/ε) lower bound. These results collectively establish the optimality of the proposed single-pass algorithm.

Technology Category

Application Category

📝 Abstract

We establish nearly optimal upper and lower bounds for approximating decision tree splits in data streams. For regression with labels in the range $\{0,1,\ldots,M\}$, we give a one-pass algorithm using $\tilde{O}(M^2/ε)$ space that outputs a split within additive $ε$ error of the optimal split, improving upon the two-pass algorithm of Pham et al. (ISIT 2025). Furthermore, we provide a matching one-pass lower bound showing that $Ω(M^2/ε)$ space is indeed necessary. For classification, we also obtain a one-pass algorithm using $\tilde{O}(1/ε)$ space for approximating the optimal Gini split, improving upon the previous $\tilde{O}(1/ε^2)$-space algorithm. We complement these results with matching space lower bounds: $Ω(1/ε)$ for Gini impurity and $Ω(1/ε)$ for misclassification (which matches the upper bound obtained by sampling). Our algorithms exploit the Lipschitz property of the loss functions and use reservoir sampling along with Count--Min sketches with range queries. Our lower bounds follow from careful reductions from the INDEX problem.

Problem

Research questions and friction points this paper is trying to address.

decision tree splits

data streams

space complexity

regression

classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

data streams

decision tree splits

space complexity