ZTree: A Subgroup Identification Based Decision Tree Learning Framework

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Conventional decision trees (e.g., CART) rely on heuristic impurity measures for splitting and require extensive pruning and hyperparameter tuning to control model complexity, lacking a rigorous statistical foundation. Method: This paper proposes ZTree—a statistically principled decision tree framework that replaces impurity criteria with hypothesis tests (e.g., z-test, t-test, Mann–Whitney U test, log-rank test) to assess subgroup difference significance at each node. Multiple testing is corrected via cross-validation, enabling automatic termination without explicit pruning. Contribution/Results: ZTree unifies interpretable statistical thresholds (i.e., p-value–based significance levels) for both splitting decisions and complexity control, supporting nested tree construction. Evaluated on five large-scale UCI datasets, ZTree demonstrates superior robustness in small-sample regimes, yields significantly more compact trees, and achieves accuracy comparable to or exceeding that of CART—without manual regularization or post-hoc pruning.

Technology Category

Application Category

📝 Abstract

Decision trees are a commonly used class of machine learning models valued for their interpretability and versatility, capable of both classification and regression. We propose ZTree, a novel decision tree learning framework that replaces CART's traditional purity based splitting with statistically principled subgroup identification. At each node, ZTree applies hypothesis testing (e.g., z-tests, t-tests, Mann-Whitney U, log-rank) to assess whether a candidate subgroup differs meaningfully from the complement. To adjust for the complication of multiple testing, we employ a cross-validation-based approach to determine if further node splitting is needed. This robust stopping criterion eliminates the need for post-pruning and makes the test threshold (z-threshold) the only parameter for controlling tree complexity. Because of the simplicity of the tree growing procedure, once a detailed tree is learned using the most lenient z-threshold, all simpler trees can be derived by simply removing nodes that do not meet the larger z-thresholds. This makes parameter tuning intuitive and efficient. Furthermore, this z-threshold is essentially a p-value, allowing users to easily plug in appropriate statistical tests into our framework without adjusting the range of parameter search. Empirical evaluation on five large-scale UCI datasets demonstrates that ZTree consistently delivers strong performance, especially at low data regimes. Compared to CART, ZTree also tends to grow simpler trees without sacrificing performance. ZTree introduces a statistically grounded alternative to traditional decision tree splitting by leveraging hypothesis testing and a cross-validation approach to multiple testing correction, resulting in an efficient and flexible framework.

Problem

Research questions and friction points this paper is trying to address.

Replacing purity-based splitting with subgroup identification

Addressing multiple testing complications via cross-validation

Simplifying tree complexity control with statistical thresholds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Subgroup identification replaces purity-based splitting

Cross-validation handles multiple testing for stopping

Single z-threshold parameter controls tree complexity

🔎 Similar Papers

A Unified Approach to Extract Interpretable Rules from Tree Ensembles via Integer Programming