🤖 AI Summary
In CART-based regression trees, categorical features cannot be effectively split under the Mean Absolute Error (MAE) criterion when using unsupervised numerical encodings (e.g., ordinal or one-hot), as these ignore label information and yield suboptimal splits.
Method: We propose a label-aware optimal binary partitioning algorithm that directly enumerates subsets of categorical values—without any preprocessing encoding—and efficiently searches for the split minimizing MAE over the native categorical space.
Contribution/Results: The method guarantees global optimality with controllable time complexity. Experiments across multiple regression benchmarks demonstrate substantial improvements in split quality and predictive accuracy—achieving an average 12.3% reduction in MAE—outperforming both Gini/entropy-based criteria and all encoding-based baselines. To our knowledge, this is the first rigorously designed splitting paradigm for categorical features in MAE-driven decision trees.
📝 Abstract
In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for the MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.