🤖 AI Summary
Existing out-of-distribution (OOD) performance prediction research suffers from inconsistent evaluation protocols and insufficient coverage of real-world OOD datasets and distribution shift types.
Method: We introduce OOD-PPB—the first systematic OOD performance prediction benchmark—integrating 12 real-world datasets, 6 canonical distribution shift categories (e.g., semantic, compound), and state-of-the-art prediction algorithms, with standardized evaluation pipelines and pre-trained models to eliminate redundant training overhead. Crucially, OOD-PPB enables performance prediction evaluation under zero-shot, unlabeled OOD settings, supporting risk-sensitive deployment.
Contribution/Results: Through extensive cross-dataset and cross-shift experiments, we systematically characterize the capabilities and limitations of existing methods for the first time, revealing significant failures under semantic and compound shifts. We publicly release code, models, and evaluation tools, establishing a reproducible, extensible, and authoritative testbed for future research.
📝 Abstract
Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.