🤖 AI Summary
Current out-of-distribution (OOD) evaluation protocols are questioned for their efficacy in exposing question-answering models’ reliance on spurious features—such as prediction shortcuts—due to potential confounding between OOD test sets and training distributions.
Method: The authors conduct a systematic cross-dataset analysis, comparing multiple OOD benchmarks against in-distribution (ID) evaluation, while performing failure diagnostics grounded in known shortcut patterns.
Contribution/Results: They find that most OOD datasets inadvertently retain spurious correlations present in the training distribution, leading to misleading robustness estimates; some OOD benchmarks are even less sensitive to shortcut exploitation than ID evaluation. Consequently, OOD evaluation is not inherently robust—their validity critically depends on dataset construction principles. The paper advocates for principled OOD benchmark design and proposes a shortcut-aware evaluation framework to enhance the reliability and interpretability of generalization assessment. This work offers methodological reflection and practical guidance for trustworthy AI evaluation.
📝 Abstract
A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment.
In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts.
We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.