🤖 AI Summary
Supervised learning models’ black-box nature severely limits their trustworthy deployment in high-stakes applications. This paper proposes a model-agnostic, distribution-unbiased framework for feature significance testing, applicable to both regression and classification tasks. Our method quantifies each feature’s incremental contribution via mask-based perturbations—without retraining the original model or introducing auxiliary models—and introduces the first optimal randomized sign test based on the median performance difference. Theoretically, it guarantees exact p-values and valid confidence intervals, ensuring both statistical rigor and computational efficiency. Empirical evaluation on synthetic data confirms its accuracy and robustness under distributional shifts; on real-world high-dimensional datasets, it yields reproducible and reliable feature importance assessments. By unifying statistical hypothesis testing with interpretability, our framework establishes a novel paradigm for trustworthy, explainable AI.
📝 Abstract
The opacity of many supervised learning algorithms remains a key challenge, hindering scientific discovery and limiting broader deployment -- particularly in high-stakes domains. This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm. Our method evaluates a feature's incremental contribution to model performance by masking its values across samples. Under the null hypothesis, the distribution of performance differences across a test set has a non-positive median. We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals with exact coverage for estimating population-level feature importance. The approach requires minimal assumptions, avoids model retraining or auxiliary models, and remains computationally efficient even for large-scale, high-dimensional settings. Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data illustrate its practical utility.