🤖 AI Summary
To address the challenge of unsupervised model performance estimation under covariate shift—where ground-truth labels are unavailable or delayed post-deployment—this paper proposes the Probability-Adaptive Performance Estimation (PAPE) framework. PAPE requires neither access to true labels nor knowledge of the original model’s architecture or feature representations; it operates solely on the model’s probabilistic outputs and confidence scores. By jointly leveraging density ratio estimation and performance generalization bound theory, PAPE models prediction distributions and applies adaptive reweighting to yield unbiased estimates of arbitrary classification metrics—without assuming a specific shift form or resorting to feature learning or generative modeling. Extensive evaluation across 900+ real-world census dataset–model combinations demonstrates that PAPE reduces mean absolute error by 37% compared to state-of-the-art proxy metrics and drift detection methods, significantly enhancing the reliability and generality of model monitoring in production environments.
📝 Abstract
Machine learning models often experience performance degradation post-deployment due to shifts in data distribution. It is challenging to assess model's performance accurately when labels are missing or delayed. Existing proxy methods, such as drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method, Probabilistic Adaptive Performance Estimation (PAPE), for evaluating classification models on unlabeled data that accurately quantifies the impact of covariate shift on model performance. It is model and data-type agnostic and works for various performance metrics. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of the covariate shift, learning directly from data instead. We tested PAPE on tabular data using over 900 dataset-model combinations created from US census data, assessing its performance against multiple benchmarks. Overall, PAPE provided more accurate performance estimates than other evaluated methodologies.