🤖 AI Summary
This work addresses the problem of efficiently computing approximate decision tree split points in data streams. The authors propose a single-pass algorithm that achieves near-optimal space complexity for both regression tasks—where labels are bounded—and classification tasks based on Gini impurity. By leveraging the Lipschitz continuity of the loss function, reservoir sampling, and a Count-Min Sketch variant supporting range queries, the method attains theoretically grounded performance. The study establishes matching upper and lower bounds via a reduction from the INDEX problem: for regression, it achieves Õ(M²/ε) space and proves an Ω(M²/ε) lower bound; for classification, it reduces space to Õ(1/ε) and demonstrates a matching Ω(1/ε) lower bound. These results collectively establish the optimality of the proposed single-pass algorithm.
📝 Abstract
We establish nearly optimal upper and lower bounds for approximating decision tree splits in data streams. For regression with labels in the range $\{0,1,\ldots,M\}$, we give a one-pass algorithm using $\tilde{O}(M^2/ε)$ space that outputs a split within additive $ε$ error of the optimal split, improving upon the two-pass algorithm of Pham et al. (ISIT 2025). Furthermore, we provide a matching one-pass lower bound showing that $Ω(M^2/ε)$ space is indeed necessary.
For classification, we also obtain a one-pass algorithm using $\tilde{O}(1/ε)$ space for approximating the optimal Gini split, improving upon the previous $\tilde{O}(1/ε^2)$-space algorithm. We complement these results with matching space lower bounds: $Ω(1/ε)$ for Gini impurity and $Ω(1/ε)$ for misclassification (which matches the upper bound obtained by sampling).
Our algorithms exploit the Lipschitz property of the loss functions and use reservoir sampling along with Count--Min sketches with range queries. Our lower bounds follow from careful reductions from the INDEX problem.