ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing out-of-distribution (OOD) performance prediction research suffers from inconsistent evaluation protocols and insufficient coverage of real-world OOD datasets and distribution shift types. Method: We introduce OOD-PPB—the first systematic OOD performance prediction benchmark—integrating 12 real-world datasets, 6 canonical distribution shift categories (e.g., semantic, compound), and state-of-the-art prediction algorithms, with standardized evaluation pipelines and pre-trained models to eliminate redundant training overhead. Crucially, OOD-PPB enables performance prediction evaluation under zero-shot, unlabeled OOD settings, supporting risk-sensitive deployment. Contribution/Results: Through extensive cross-dataset and cross-shift experiments, we systematically characterize the capabilities and limitations of existing methods for the first time, revealing significant failures under semantic and compound shifts. We publicly release code, models, and evaluation tools, establishing a reproducible, extensible, and authoritative testbed for future research.

Technology Category

Application Category

📝 Abstract

Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking inconsistent OOD performance prediction evaluation protocols

Addressing limited real-world datasets and distribution shift types

Providing fair algorithm comparisons with trained model testbench

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes comprehensive benchmark for OOD performance prediction

Includes commonly used datasets and existing prediction algorithms

Provides trained models as testbench for consistent comparisons

🔎 Similar Papers

No similar papers found.