๐ค AI Summary
Existing mixed-integer programming (MIP)-based regression tree learning methods struggle to jointly model continuous features and scale to large datasets. Method: This paper proposes a two-stage MIP optimization framework. Its core innovations are: (1) restricting branch-and-bound exclusively to tree-structure variables, yielding sample-size-independent convergence guarantees; (2) incorporating closed-form leaf predictions, empirical threshold discretization, and exact analytical solutions for depth-1 subtrees to tighten bounds; and (3) integrating decomposition-based upper/lower bound estimation with node-level parallelization for efficient training on million-scale datasets. Results: Experiments on multi-source benchmark datasets with mixed feature types demonstrate substantial improvements over state-of-the-art MIP baselines. The method constructs high-quality regression trees on 2 million samples within four hoursโachieving both provable optimality and practical scalability.
๐ Abstract
Mixed-integer programming (MIP) has emerged as a powerful framework for learning optimal decision trees. Yet, existing MIP approaches for regression tasks are either limited to purely binary features or become computationally intractable when continuous, large-scale data are involved. Naively binarizing continuous features sacrifices global optimality and often yields needlessly deep trees. We recast the optimal regression-tree training as a two-stage optimization problem and propose Reduced-Space Optimal Regression Trees (RS-ORT) - a specialized branch-and-bound (BB) algorithm that branches exclusively on tree-structural variables. This design guarantees the algorithm's convergence and its independence from the number of training samples. Leveraging the model's structure, we introduce several bound tightening techniques - closed-form leaf prediction, empirical threshold discretization, and exact depth-1 subtree parsing - that combine with decomposable upper and lower bounding strategies to accelerate the training. The BB node-wise decomposition enables trivial parallel execution, further alleviating the computational intractability even for million-size datasets. Based on the empirical studies on several regression benchmarks containing both binary and continuous features, RS-ORT also delivers superior training and testing performance than state-of-the-art methods. Notably, on datasets with up to 2,000,000 samples with continuous features, RS-ORT can obtain guaranteed training performance with a simpler tree structure and a better generalization ability in four hours.