Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Current domain generalization (DG) benchmarks—e.g., ColoredMNIST and Waterbirds—exhibit a fundamental flaw in evaluating model robustness to spurious correlations: their distribution shifts fail to meaningfully alter the spurious associations governing out-of-distribution (OOD) generalization, leading to the “accuracy-on-a-line” phenomenon and rendering them incapable of probing whether models truly disentangle spurious dependencies. Method: We introduce the novel concept of *benchmark misspecification*, establishing necessary conditions for robustness evaluation grounded in causal modeling and distribution shift analysis. We theoretically prove that mainstream DG benchmarks violate these conditions and derive verifiable criteria for *well-specified* benchmarks. Contribution/Results: Our work provides the first formal theoretical foundation for assessing spurious-correlation robustness, yielding both principled guidelines and practical criteria for designing credible, causally sound DG evaluation protocols.

Technology Category

Application Category

📝 Abstract

Spurious correlations are unstable statistical associations that hinder robust decision-making. Conventional wisdom suggests that models relying on such correlations will fail to generalize out-of-distribution (OOD), especially under strong distribution shifts. However, empirical evidence challenges this view as naive in-distribution empirical risk minimizers often achieve the best OOD accuracy across popular OOD generalization benchmarks. In light of these results, we propose a different perspective: many widely used benchmarks for evaluating robustness to spurious correlations are misspecified. Specifically, they fail to include shifts in spurious correlations that meaningfully impact OOD generalization, making them unsuitable for evaluating the benefit of removing such correlations. We establish conditions under which a distribution shift can reliably assess a model's reliance on spurious correlations. Crucially, under these conditions, we should not observe a strong positive correlation between in-distribution and OOD accuracy, often called"accuracy on the line."Yet, most state-of-the-art benchmarks exhibit this pattern, suggesting they do not effectively assess robustness. Our findings expose a key limitation in current benchmarks used to evaluate domain generalization algorithms, that is, models designed to avoid spurious correlations. We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.

Problem

Research questions and friction points this paper is trying to address.

Assessing misspecified benchmarks for spurious correlation robustness

Identifying conditions for reliable spurious correlation shift evaluation

Rethinking domain generalization benchmark design for meaningful robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes misspecified benchmark conditions for robustness

Identifies key distribution shift impact factors

Suggests redesigning benchmarks for meaningful evaluation

🔎 Similar Papers

No similar papers found.