Learning Optimal Classification Trees Robust to Distribution Shifts

📅 2023-10-26
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In high-stakes domains such as public health, distributional shifts between training and test data—arising from questionnaire design variations, heterogeneous data collection environments, and differing respondent trust levels—severely degrade classifier reliability. Method: This paper proposes the first robust optimal classification tree learning framework. It formulates robust tree learning as a single-stage nonlinear mixed-integer robust optimization problem and equivalently recasts it as a tractable two-stage linear robust optimization model. A customized constraint-generation algorithm is then developed to solve it efficiently. Results: Evaluated on multiple public datasets, the method improves worst-case accuracy by up to 12.48% and average accuracy by up to 4.85% over non-robust optimal trees. Crucially, it establishes, for the first time, provable robustness guarantees for optimal decision trees under distribution shift while maintaining computational tractability—unifying theoretical robustness and practical solvability.
📝 Abstract
We consider the problem of learning classification trees that are robust to distribution shifts between training and testing/deployment data. This problem arises frequently in high stakes settings such as public health and social work where data is often collected using self-reported surveys which are highly sensitive to e.g., the framing of the questions, the time when and place where the survey is conducted, and the level of comfort the interviewee has in sharing information with the interviewer. We propose a method for learning optimal robust classification trees based on mixed-integer robust optimization technology. In particular, we demonstrate that the problem of learning an optimal robust tree can be cast as a single-stage mixed-integer robust optimization problem with a highly nonlinear and discontinuous objective. We reformulate this problem equivalently as a two-stage linear robust optimization problem for which we devise a tailored solution procedure based on constraint generation. We evaluate the performance of our approach on numerous publicly available datasets, and compare the performance to a regularized, non-robust optimal tree. We show an increase of up to 12.48% in worst-case accuracy and of up to 4.85% in average-case accuracy across several datasets and distribution shifts from using our robust solution in comparison to the non-robust one.
Problem

Research questions and friction points this paper is trying to address.

Learning classification trees robust to training-testing distribution shifts
Addressing data sensitivity in high-stakes settings like public health
Proposing mixed-integer optimization for optimal robust tree learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses mixed-integer robust optimization technology
Reformulates problem as two-stage linear optimization
Increases accuracy via constraint generation solution
🔎 Similar Papers
Nathan Justin
Nathan Justin
PhD Candidate, University of Southern California
OptimizationMachine LearningOperations Research
S
S. Aghaei
Center for Artificial Intelligence in Society, University of Southern California, Los Angeles, CA 90089, USA
A
Andr'es G'omez
Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, CA 90089, USA
P
P. Vayanos
Center for Artificial Intelligence in Society, University of Southern California, Los Angeles, CA 90089, USA