Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Aggregated metrics obscure failures stemming from spurious correlations in out-of-distribution (OOD) generalization. The commonly observed “accuracy-on-the-line” phenomenon—positive correlation between in-distribution (ID) and OOD accuracy—may arise from coarse-grained aggregation over heterogeneous OOD samples; in fact, within semantically coherent OOD subsets, high ID accuracy often predicts *lower* OOD performance. Method: We propose OODSelect, the first gradient-based method for fine-grained, semantically consistent partitioning of OOD samples, enabling systematic analysis of ID–OOD performance correlations. Contribution/Results: Across multiple mainstream benchmarks, we detect statistically significant negative correlations in over 50% of standard OOD test sets—demonstrating that aggregated metrics severely underestimate generalization risk. Our work challenges the prevailing assumption that spurious correlations are rare, revealing their widespread prevalence. We publicly release code and curated OOD subsets to advance robustness research.

Technology Category

Application Category

📝 Abstract
Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.
Problem

Research questions and friction points this paper is trying to address.

Uncovering hidden OOD generalization failures masked by aggregated metrics
Identifying subsets where higher ID accuracy predicts lower OOD performance
Revealing spurious correlation impacts obscured by accuracy-on-the-line patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

OODSelect method identifies coherent failure subsets
Gradient-based technique reveals hidden generalization failures
Uncovers subsets where higher ID accuracy lowers OOD
🔎 Similar Papers
No similar papers found.